Linux-XFS Archive mirror
 help / color / mirror / Atom feed
* [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib
@ 2024-02-24  1:00 Darrick J. Wong
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:00 UTC (permalink / raw
  To: Kent Overstreet, daniel, akpm, keescook; +Cc: linux-bcachefs, linux-kernel, xfs

Hi all,

Kent and I went on a little sprint of figuring out if there were any
pieces of bcachefs that we could steal for XFS.  It turns out that there
are -- the timestats code is useful for measuring delays due to lock
contention, and thread_with_file will be very helpful for exporting
filesystem metadata health events to userspace.

So here's a pile of patchsets lifting those pieces of bcachefs to lib
and fixing a bunch of bugs in them.  These patches have already been
soaking in Kent's testing tree (and -next) for a few days, but
apparently not all of them got emailed so here I am blasting out the
entire thing.

If you want to see what XFS does with this, have a look at [1] and [2].
For 6.9 it'd be helpful to get these modules lifted.

--Darrick

[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring_2024-02-23
[2] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/tag/?h=contention-timestats_2024-02-23

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET 1/6] time_stats: promote to lib/
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
@ 2024-02-24  1:07 ` Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 1/4] mean and variance: Promote to lib/math Darrick J. Wong
                     ` (3 more replies)
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                   ` (5 subsequent siblings)
  6 siblings, 4 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:07 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: Dave Chinner, Theodore Ts'o, Coly Li, linux-xfs,
	linux-bcachefs, linux-kernel

Hi all,

This is Kent Overstreet's series to lift the mean and variance
computation code, as well as the time statistics code, to become
generic library code.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=timestats-hoist
---
Commits in this patchset:
 * mean and variance: Promote to lib/math
 * eytzinger: Promote to include/linux/
 * bcachefs: bch2_time_stats_to_seq_buf()
 * time_stats: Promote to lib/
---
 MAINTAINERS                          |   22 ++
 fs/bcachefs/Kconfig                  |   10 -
 fs/bcachefs/Makefile                 |    3 
 fs/bcachefs/alloc_foreground.c       |   13 -
 fs/bcachefs/bcachefs.h               |   11 +
 fs/bcachefs/bset.c                   |    2 
 fs/bcachefs/btree_cache.c            |    2 
 fs/bcachefs/btree_gc.c               |    2 
 fs/bcachefs/btree_io.c               |    8 -
 fs/bcachefs/btree_iter.c             |    8 -
 fs/bcachefs/btree_locking.h          |    2 
 fs/bcachefs/btree_update_interior.c  |    8 -
 fs/bcachefs/io_read.c                |    4 
 fs/bcachefs/io_write.c               |    4 
 fs/bcachefs/journal.c                |    5 -
 fs/bcachefs/journal_io.c             |    9 -
 fs/bcachefs/journal_reclaim.c        |    9 -
 fs/bcachefs/journal_seq_blacklist.c  |    6 -
 fs/bcachefs/journal_types.h          |   11 -
 fs/bcachefs/nocow_locking.c          |    2 
 fs/bcachefs/replicas.c               |   19 +-
 fs/bcachefs/replicas.h               |    3 
 fs/bcachefs/super-io.h               |    2 
 fs/bcachefs/super.c                  |   14 +
 fs/bcachefs/util.c                   |  339 ++--------------------------------
 fs/bcachefs/util.h                   |   86 ---------
 include/linux/eytzinger.h            |   58 +++---
 include/linux/mean_and_variance.h    |    0 
 include/linux/time_stats.h           |  134 +++++++++++++
 lib/Kconfig                          |    4 
 lib/Kconfig.debug                    |    9 +
 lib/Makefile                         |    2 
 lib/math/Kconfig                     |    3 
 lib/math/Makefile                    |    2 
 lib/math/mean_and_variance.c         |    3 
 lib/math/mean_and_variance_test.c    |    3 
 lib/sort.c                           |   89 +++++++++
 lib/time_stats.c                     |  271 +++++++++++++++++++++++++++
 38 files changed, 662 insertions(+), 520 deletions(-)
 rename fs/bcachefs/eytzinger.h => include/linux/eytzinger.h (77%)
 rename fs/bcachefs/mean_and_variance.h => include/linux/mean_and_variance.h (100%)
 create mode 100644 include/linux/time_stats.h
 rename fs/bcachefs/mean_and_variance.c => lib/math/mean_and_variance.c (99%)
 rename fs/bcachefs/mean_and_variance_test.c => lib/math/mean_and_variance_test.c (99%)
 create mode 100644 lib/time_stats.c


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET 2/6] time_stats: cleanups and fixes
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
@ 2024-02-24  1:08 ` Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 01/10] time_stats: report lifetime of the stats object Darrick J. Wong
                     ` (9 more replies)
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
                   ` (4 subsequent siblings)
  6 siblings, 10 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:08 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

Hi all,

This series reduces the memory consumption of individual time_stats
objects, and adds reporting of how long each stat counter has been
making observations.  It's a prep patch for adding some counters to XFS,
for which we'll want as low overhead as possible to maximimze
shotgunning effect.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=timestats-cleanups
---
Commits in this patchset:
 * time_stats: report lifetime of the stats object
 * time_stats: split stats-with-quantiles into a separate structure
 * time_stats: fix struct layout bloat
 * time_stats: add larger units
 * time_stats: don't print any output if event count is zero
 * time_stats: allow custom epoch names
 * mean_and_variance: put struct mean_and_variance_weighted on a diet
 * time_stats: shrink time_stat_buffer for better alignment
 * time_stats: report information in json format
 * time_stats: Kill TIME_STATS_HAVE_QUANTILES
---
 fs/bcachefs/bcachefs.h            |    2 -
 fs/bcachefs/io_write.c            |    2 -
 fs/bcachefs/super.c               |   10 +--
 fs/bcachefs/sysfs.c               |    4 +
 fs/bcachefs/util.c                |   15 ++--
 include/linux/mean_and_variance.h |   14 ++--
 include/linux/time_stats.h        |   43 ++++++++++-
 lib/math/mean_and_variance.c      |   28 +++++--
 lib/math/mean_and_variance_test.c |   80 +++++++++++---------
 lib/time_stats.c                  |  148 +++++++++++++++++++++++++++++++------
 10 files changed, 248 insertions(+), 98 deletions(-)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET RFC 3/6] xfs: capture statistics about wait times
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
@ 2024-02-24  1:08 ` Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 1/4] xfs: present wait time statistics Darrick J. Wong
                     ` (3 more replies)
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:08 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

Hi all,

This patchset builds off of Kent Overstreet's timestats code to capture
information about the amount of time we spend waiting for buffer, quota,
and inode locks; as well as time spent in the scrub code.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=contention-timestats
---
Commits in this patchset:
 * xfs: present wait time statistics
 * xfs: present time stats for scrubbers
 * xfs: present timestats in json format
 * xfs: create debugfs uuid aliases
---
 fs/xfs/Kconfig         |    8 ++
 fs/xfs/Makefile        |    1 
 fs/xfs/scrub/repair.c  |    6 +-
 fs/xfs/scrub/scrub.c   |    6 +-
 fs/xfs/scrub/stats.c   |  136 +++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/stats.h   |   21 +-----
 fs/xfs/xfs_buf.c       |    4 +
 fs/xfs/xfs_dquot.c     |   11 +++
 fs/xfs/xfs_dquot.h     |    4 +
 fs/xfs/xfs_inode.c     |   12 +++-
 fs/xfs/xfs_linux.h     |    5 ++
 fs/xfs/xfs_log.c       |    9 +++
 fs/xfs/xfs_mount.h     |   14 ++++
 fs/xfs/xfs_super.c     |   17 +++++
 fs/xfs/xfs_timestats.c |  156 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_timestats.h |   37 +++++++++++
 16 files changed, 418 insertions(+), 29 deletions(-)
 create mode 100644 fs/xfs/xfs_timestats.c
 create mode 100644 fs/xfs/xfs_timestats.h


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET 4/6] thread_with_file: promote to lib/
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
@ 2024-02-24  1:08 ` Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 01/10] bcachefs: thread_with_stdio: eliminate double buffering Darrick J. Wong
                     ` (9 more replies)
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
                   ` (2 subsequent siblings)
  6 siblings, 10 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:08 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: fuyuanli, linux-xfs, linux-bcachefs, linux-kernel

Hi all,

This is Kent Overstreet's series to lift the thread_with_file support
code to generic library code.  This enables the kernel to create a
pseudo file that userspace can use to read deeply structured event
information from the kernel.  kthreads are used to manage the buffers
underlying the file operations.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=twf-hoist
---
Commits in this patchset:
 * bcachefs: thread_with_stdio: eliminate double buffering
 * bcachefs: thread_with_stdio: convert to darray
 * bcachefs: thread_with_stdio: kill thread_with_stdio_done()
 * bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline()
 * bcachefs: Thread with file documentation
 * darray: lift from bcachefs
 * thread_with_file: Lift from bcachefs
 * thread_with_stdio: Mark completed in ->release()
 * kernel/hung_task.c: export sysctl_hung_task_timeout_secs
 * thread_with_stdio: suppress hung task warning
---
 MAINTAINERS                            |   16 +
 fs/bcachefs/Kconfig                    |    1 
 fs/bcachefs/Makefile                   |    2 
 fs/bcachefs/bcachefs.h                 |    2 
 fs/bcachefs/btree_types.h              |    2 
 fs/bcachefs/btree_update.c             |    2 
 fs/bcachefs/btree_write_buffer_types.h |    2 
 fs/bcachefs/chardev.c                  |   24 +-
 fs/bcachefs/error.c                    |    4 
 fs/bcachefs/fsck.c                     |    2 
 fs/bcachefs/journal_sb.c               |    2 
 fs/bcachefs/sb-downgrade.c             |    3 
 fs/bcachefs/sb-errors_types.h          |    2 
 fs/bcachefs/sb-members.h               |    2 
 fs/bcachefs/subvolume.h                |    1 
 fs/bcachefs/subvolume_types.h          |    2 
 fs/bcachefs/super.c                    |    9 -
 fs/bcachefs/thread_with_file.c         |  299 -------------------------
 fs/bcachefs/thread_with_file.h         |   41 ---
 fs/bcachefs/thread_with_file_types.h   |   16 -
 fs/bcachefs/util.h                     |   29 --
 include/linux/darray.h                 |   59 +++--
 include/linux/darray_types.h           |   22 ++
 include/linux/thread_with_file.h       |   71 ++++++
 include/linux/thread_with_file_types.h |   25 ++
 kernel/hung_task.c                     |    1 
 lib/Kconfig                            |    3 
 lib/Makefile                           |    3 
 lib/darray.c                           |   12 +
 lib/thread_with_file.c                 |  379 ++++++++++++++++++++++++++++++++
 30 files changed, 596 insertions(+), 442 deletions(-)
 delete mode 100644 fs/bcachefs/thread_with_file.c
 delete mode 100644 fs/bcachefs/thread_with_file.h
 delete mode 100644 fs/bcachefs/thread_with_file_types.h
 rename fs/bcachefs/darray.h => include/linux/darray.h (66%)
 create mode 100644 include/linux/darray_types.h
 create mode 100644 include/linux/thread_with_file.h
 create mode 100644 include/linux/thread_with_file_types.h
 rename fs/bcachefs/darray.c => lib/darray.c (56%)
 create mode 100644 lib/thread_with_file.c


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET 5/6] thread_with_file: cleanups and fixes
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
                   ` (3 preceding siblings ...)
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
@ 2024-02-24  1:08 ` Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 1/5] thread_with_file: allow creation of readonly files Darrick J. Wong
                     ` (4 more replies)
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  6 siblings, 5 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:08 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

Hi all,

This series fixes some problems with the thread_with_file code -- namely
that blocking stdout writes attempt a non-atomic memory allocation while
holding a spinlock.  It also cleans up the ops handling so that we can
support ioctls on the thread_with_file itself.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=twf-cleanups
---
Commits in this patchset:
 * thread_with_file: allow creation of readonly files
 * thread_with_file: fix various printf problems
 * thread_with_file: create ops structure for thread_with_stdio
 * thread_with_file: allow ioctls against these files
 * thread_with_file: Fix missing va_end()
---
 fs/bcachefs/chardev.c            |   18 ++++--
 include/linux/thread_with_file.h |   20 +++++--
 lib/thread_with_file.c           |  113 ++++++++++++++++++++++++++++++--------
 3 files changed, 115 insertions(+), 36 deletions(-)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
                   ` (4 preceding siblings ...)
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
@ 2024-02-24  1:09 ` Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 1/8] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
                     ` (7 more replies)
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  6 siblings, 8 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:09 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

Hi all,

This patchset builds off of Kent Overstreet's thread_with_file code to
deliver live information about filesystem health events to userspace.
This is done by creating a twf file and hooking internal operations so
that the event information can be queued to the twf without stalling the
kernel if the twf client program is nonresponsive.  This is a private
ioctl, so events are expressed using simple json objects so that we can
enrich the output later on without having to rev a ton of C structs.

In userspace, we create a new daemon program that will read the json
event objects and initiate repairs automatically.  This daemon is
managed entirely by systemd and will not block unmounting of the
filesystem unless repairs are ongoing.  It is autostarted via some
horrible udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * xfs: use thread_with_file to create a monitoring file
 * xfs: create hooks for monitoring health updates
 * xfs: create a filesystem shutdown hook
 * xfs: report shutdown events through healthmon
 * xfs: report metadata health events through healthmon
 * xfs: report media errors through healthmon
 * xfs: allow reconfiguration of the health monitoring device
 * xfs: send uevents when mounting and unmounting a filesystem
---
 fs/xfs/Kconfig                 |    9 
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_fs.h         |    1 
 fs/xfs/libxfs/xfs_fs_staging.h |   18 +
 fs/xfs/libxfs/xfs_health.h     |   48 ++
 fs/xfs/xfs_buf.c               |    1 
 fs/xfs/xfs_fsops.c             |   57 ++
 fs/xfs/xfs_fsops.h             |   14 +
 fs/xfs/xfs_health.c            |  266 ++++++++++
 fs/xfs/xfs_healthmon.c         | 1108 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h         |   86 +++
 fs/xfs/xfs_ioctl.c             |   21 +
 fs/xfs/xfs_linux.h             |    3 
 fs/xfs/xfs_mount.h             |    9 
 fs/xfs/xfs_notify_failure.c    |  161 +++++-
 fs/xfs/xfs_notify_failure.h    |   42 ++
 fs/xfs/xfs_super.c             |   43 ++
 fs/xfs/xfs_super.h             |    1 
 fs/xfs/xfs_trace.c             |    3 
 fs/xfs/xfs_trace.h             |  240 +++++++++
 20 files changed, 2098 insertions(+), 34 deletions(-)
 create mode 100644 fs/xfs/xfs_healthmon.c
 create mode 100644 fs/xfs/xfs_healthmon.h
 create mode 100644 fs/xfs/xfs_notify_failure.h


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/4] mean and variance: Promote to lib/math
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
@ 2024-02-24  1:09   ` Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 2/4] eytzinger: Promote to include/linux/ Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:09 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Small statistics library, for taking in a series of value and computing
mean, weighted mean, standard deviation and weighted deviation.

The main use case is for statistics on latency measurements.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Daniel Hill <daniel@gluo.nz>
Cc: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                          |    9 +++++++++
 fs/bcachefs/Kconfig                  |   10 +---------
 fs/bcachefs/Makefile                 |    3 ---
 fs/bcachefs/util.c                   |    2 +-
 fs/bcachefs/util.h                   |    3 +--
 include/linux/mean_and_variance.h    |    0 
 lib/Kconfig.debug                    |    9 +++++++++
 lib/math/Kconfig                     |    3 +++
 lib/math/Makefile                    |    2 ++
 lib/math/mean_and_variance.c         |    3 +--
 lib/math/mean_and_variance_test.c    |    3 +--
 11 files changed, 28 insertions(+), 19 deletions(-)
 rename fs/bcachefs/mean_and_variance.h => include/linux/mean_and_variance.h (100%)
 rename fs/bcachefs/mean_and_variance.c => lib/math/mean_and_variance.c (99%)
 rename fs/bcachefs/mean_and_variance_test.c => lib/math/mean_and_variance_test.c (99%)


diff --git a/MAINTAINERS b/MAINTAINERS
index 9ed4d38685394..3e13de69b7f07 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13387,6 +13387,15 @@ S:	Maintained
 F:	drivers/net/mdio/mdio-regmap.c
 F:	include/linux/mdio/mdio-regmap.h
 
+MEAN AND VARIANCE LIBRARY
+M:	Daniel B. Hill <daniel@gluo.nz>
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+S:	Maintained
+T:	git https://github.com/YellowOnion/linux/
+F:	include/linux/mean_and_variance.h
+F:	lib/math/mean_and_variance.c
+F:	lib/math/mean_and_variance_test.c
+
 MEASUREMENT COMPUTING CIO-DAC IIO DRIVER
 M:	William Breathitt Gray <william.gray@linaro.org>
 L:	linux-iio@vger.kernel.org
diff --git a/fs/bcachefs/Kconfig b/fs/bcachefs/Kconfig
index 5cdfef3b551a7..72d1179262b33 100644
--- a/fs/bcachefs/Kconfig
+++ b/fs/bcachefs/Kconfig
@@ -24,6 +24,7 @@ config BCACHEFS_FS
 	select XXHASH
 	select SRCU
 	select SYMBOLIC_ERRNAME
+	select MEAN_AND_VARIANCE
 	help
 	The bcachefs filesystem - a modern, copy on write filesystem, with
 	support for multiple devices, compression, checksumming, etc.
@@ -86,12 +87,3 @@ config BCACHEFS_SIX_OPTIMISTIC_SPIN
 	Instead of immediately sleeping when attempting to take a six lock that
 	is held by another thread, spin for a short while, as long as the
 	thread owning the lock is running.
-
-config MEAN_AND_VARIANCE_UNIT_TEST
-	tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
-	depends on KUNIT
-	depends on BCACHEFS_FS
-	default KUNIT_ALL_TESTS
-	help
-	  This option enables the kunit tests for mean_and_variance module.
-	  If unsure, say N.
diff --git a/fs/bcachefs/Makefile b/fs/bcachefs/Makefile
index 1a05cecda7cc5..b11ba74b8ad41 100644
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@@ -57,7 +57,6 @@ bcachefs-y		:=	\
 	keylist.o		\
 	logged_ops.o		\
 	lru.o			\
-	mean_and_variance.o	\
 	migrate.o		\
 	move.o			\
 	movinggc.o		\
@@ -88,5 +87,3 @@ bcachefs-y		:=	\
 	util.o			\
 	varint.o		\
 	xattr.o
-
-obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST)   += mean_and_variance_test.o
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index 231003b405efc..c9d13dcf3ef1a 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -22,9 +22,9 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/sched/clock.h>
+#include <linux/mean_and_variance.h>
 
 #include "eytzinger.h"
-#include "mean_and_variance.h"
 #include "util.h"
 
 static const char si_units[] = "?kMGTPEZY";
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index b414736d59a5b..0059481995ef7 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -17,8 +17,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/workqueue.h>
-
-#include "mean_and_variance.h"
+#include <linux/mean_and_variance.h>
 
 #include "darray.h"
 
diff --git a/fs/bcachefs/mean_and_variance.h b/include/linux/mean_and_variance.h
similarity index 100%
rename from fs/bcachefs/mean_and_variance.h
rename to include/linux/mean_and_variance.h
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 975a07f9f1cc0..817ddfe132cda 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2191,6 +2191,15 @@ config CPUMASK_KUNIT_TEST
 
 	  If unsure, say N.
 
+config MEAN_AND_VARIANCE_UNIT_TEST
+	tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	select MEAN_AND_VARIANCE
+	default KUNIT_ALL_TESTS
+	help
+	  This option enables the kunit tests for mean_and_variance module.
+	  If unsure, say N.
+
 config TEST_LIST_SORT
 	tristate "Linked list sorting test" if !KUNIT_ALL_TESTS
 	depends on KUNIT
diff --git a/lib/math/Kconfig b/lib/math/Kconfig
index 0634b428d0cb7..7530ae9a3584f 100644
--- a/lib/math/Kconfig
+++ b/lib/math/Kconfig
@@ -15,3 +15,6 @@ config PRIME_NUMBERS
 
 config RATIONAL
 	tristate
+
+config MEAN_AND_VARIANCE
+	tristate
diff --git a/lib/math/Makefile b/lib/math/Makefile
index 91fcdb0c9efe4..8cdfa13a67ce0 100644
--- a/lib/math/Makefile
+++ b/lib/math/Makefile
@@ -4,6 +4,8 @@ obj-y += div64.o gcd.o lcm.o int_log.o int_pow.o int_sqrt.o reciprocal_div.o
 obj-$(CONFIG_CORDIC)		+= cordic.o
 obj-$(CONFIG_PRIME_NUMBERS)	+= prime_numbers.o
 obj-$(CONFIG_RATIONAL)		+= rational.o
+obj-$(CONFIG_MEAN_AND_VARIANCE) += mean_and_variance.o
 
 obj-$(CONFIG_TEST_DIV64)	+= test_div64.o
 obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
+obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST)   += mean_and_variance_test.o
diff --git a/fs/bcachefs/mean_and_variance.c b/lib/math/mean_and_variance.c
similarity index 99%
rename from fs/bcachefs/mean_and_variance.c
rename to lib/math/mean_and_variance.c
index bf0ef668fd383..ba90293204bae 100644
--- a/fs/bcachefs/mean_and_variance.c
+++ b/lib/math/mean_and_variance.c
@@ -40,10 +40,9 @@
 #include <linux/limits.h>
 #include <linux/math.h>
 #include <linux/math64.h>
+#include <linux/mean_and_variance.h>
 #include <linux/module.h>
 
-#include "mean_and_variance.h"
-
 u128_u u128_div(u128_u n, u64 d)
 {
 	u128_u r;
diff --git a/fs/bcachefs/mean_and_variance_test.c b/lib/math/mean_and_variance_test.c
similarity index 99%
rename from fs/bcachefs/mean_and_variance_test.c
rename to lib/math/mean_and_variance_test.c
index 019583c3ca0ea..f45591a169d87 100644
--- a/fs/bcachefs/mean_and_variance_test.c
+++ b/lib/math/mean_and_variance_test.c
@@ -1,7 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <kunit/test.h>
-
-#include "mean_and_variance.h"
+#include <linux/mean_and_variance.h>
 
 #define MAX_SQR (SQRT_U64_MAX*SQRT_U64_MAX)
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/4] eytzinger: Promote to include/linux/
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 1/4] mean and variance: Promote to lib/math Darrick J. Wong
@ 2024-02-24  1:09   ` Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 3/4] bcachefs: bch2_time_stats_to_seq_buf() Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 4/4] time_stats: Promote to lib/ Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:09 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

eytzinger trees are a faster alternative to binary search. They're a bit
more expensive to setup, but lookups perform much better assuming the
tree isn't entirely in cache.

Binary search is a worst case scenario for branch prediction and
prefetching, but eytzinger trees have children adjacent in memory and
thus we can prefetch before knowing the result of a comparison.

An eytzinger tree is a binary tree laid out in an array, with the same
geometry as the usual binary heap construction, but used as a search
tree instead.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                         |    6 +
 fs/bcachefs/bset.c                  |    2 
 fs/bcachefs/journal_seq_blacklist.c |    6 +
 fs/bcachefs/replicas.c              |   19 +++--
 fs/bcachefs/replicas.h              |    3 -
 fs/bcachefs/super-io.h              |    2 
 fs/bcachefs/util.c                  |  145 -----------------------------------
 fs/bcachefs/util.h                  |    4 -
 include/linux/eytzinger.h           |   58 ++++++++------
 lib/sort.c                          |   89 +++++++++++++++++++++
 10 files changed, 148 insertions(+), 186 deletions(-)
 rename fs/bcachefs/eytzinger.h => include/linux/eytzinger.h (77%)


diff --git a/MAINTAINERS b/MAINTAINERS
index 3e13de69b7f07..98a17270566d3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8066,6 +8066,12 @@ L:	iommu@lists.linux.dev
 S:	Maintained
 F:	drivers/iommu/exynos-iommu.c
 
+EYTZINGER TREE LIB
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Maintained
+F:	include/linux/eytzinger.h
+
 F2FS FILE SYSTEM
 M:	Jaegeuk Kim <jaegeuk@kernel.org>
 M:	Chao Yu <chao@kernel.org>
diff --git a/fs/bcachefs/bset.c b/fs/bcachefs/bset.c
index 3fd1085b6c61e..1d77aa55d641c 100644
--- a/fs/bcachefs/bset.c
+++ b/fs/bcachefs/bset.c
@@ -9,12 +9,12 @@
 #include "bcachefs.h"
 #include "btree_cache.h"
 #include "bset.h"
-#include "eytzinger.h"
 #include "trace.h"
 #include "util.h"
 
 #include <asm/unaligned.h>
 #include <linux/console.h>
+#include <linux/eytzinger.h>
 #include <linux/random.h>
 #include <linux/prefetch.h>
 
diff --git a/fs/bcachefs/journal_seq_blacklist.c b/fs/bcachefs/journal_seq_blacklist.c
index 0200e299cfbb9..024c9b1b323f8 100644
--- a/fs/bcachefs/journal_seq_blacklist.c
+++ b/fs/bcachefs/journal_seq_blacklist.c
@@ -2,10 +2,11 @@
 
 #include "bcachefs.h"
 #include "btree_iter.h"
-#include "eytzinger.h"
 #include "journal_seq_blacklist.h"
 #include "super-io.h"
 
+#include <linux/eytzinger.h>
+
 /*
  * journal_seq_blacklist machinery:
  *
@@ -119,8 +120,7 @@ int bch2_journal_seq_blacklist_add(struct bch_fs *c, u64 start, u64 end)
 	return ret ?: bch2_blacklist_table_initialize(c);
 }
 
-static int journal_seq_blacklist_table_cmp(const void *_l,
-					   const void *_r, size_t size)
+static int journal_seq_blacklist_table_cmp(const void *_l, const void *_r)
 {
 	const struct journal_seq_blacklist_table_entry *l = _l;
 	const struct journal_seq_blacklist_table_entry *r = _r;
diff --git a/fs/bcachefs/replicas.c b/fs/bcachefs/replicas.c
index cc2672c120312..678b9c20e2514 100644
--- a/fs/bcachefs/replicas.c
+++ b/fs/bcachefs/replicas.c
@@ -6,12 +6,15 @@
 #include "replicas.h"
 #include "super-io.h"
 
+#include <linux/sort.h>
+
 static int bch2_cpu_replicas_to_sb_replicas(struct bch_fs *,
 					    struct bch_replicas_cpu *);
 
 /* Some (buggy!) compilers don't allow memcmp to be passed as a pointer */
-static int bch2_memcmp(const void *l, const void *r, size_t size)
+static int bch2_memcmp(const void *l, const void *r,  const void *priv)
 {
+	size_t size = (size_t) priv;
 	return memcmp(l, r, size);
 }
 
@@ -39,7 +42,8 @@ void bch2_replicas_entry_sort(struct bch_replicas_entry_v1 *e)
 
 static void bch2_cpu_replicas_sort(struct bch_replicas_cpu *r)
 {
-	eytzinger0_sort(r->entries, r->nr, r->entry_size, bch2_memcmp, NULL);
+	eytzinger0_sort_r(r->entries, r->nr, r->entry_size,
+			  bch2_memcmp, NULL, (void *)(size_t)r->entry_size);
 }
 
 static void bch2_replicas_entry_v0_to_text(struct printbuf *out,
@@ -228,7 +232,7 @@ static inline int __replicas_entry_idx(struct bch_replicas_cpu *r,
 
 	verify_replicas_entry(search);
 
-#define entry_cmp(_l, _r, size)	memcmp(_l, _r, entry_size)
+#define entry_cmp(_l, _r)	memcmp(_l, _r, entry_size)
 	idx = eytzinger0_find(r->entries, r->nr, r->entry_size,
 			      entry_cmp, search);
 #undef entry_cmp
@@ -824,10 +828,11 @@ static int bch2_cpu_replicas_validate(struct bch_replicas_cpu *cpu_r,
 {
 	unsigned i;
 
-	sort_cmp_size(cpu_r->entries,
-		      cpu_r->nr,
-		      cpu_r->entry_size,
-		      bch2_memcmp, NULL);
+	sort_r(cpu_r->entries,
+	       cpu_r->nr,
+	       cpu_r->entry_size,
+	       bch2_memcmp, NULL,
+	       (void *)(size_t)cpu_r->entry_size);
 
 	for (i = 0; i < cpu_r->nr; i++) {
 		struct bch_replicas_entry_v1 *e =
diff --git a/fs/bcachefs/replicas.h b/fs/bcachefs/replicas.h
index 654a4b26d3a3c..983cce782ac2a 100644
--- a/fs/bcachefs/replicas.h
+++ b/fs/bcachefs/replicas.h
@@ -3,9 +3,10 @@
 #define _BCACHEFS_REPLICAS_H
 
 #include "bkey.h"
-#include "eytzinger.h"
 #include "replicas_types.h"
 
+#include <linux/eytzinger.h>
+
 void bch2_replicas_entry_sort(struct bch_replicas_entry_v1 *);
 void bch2_replicas_entry_to_text(struct printbuf *,
 				 struct bch_replicas_entry_v1 *);
diff --git a/fs/bcachefs/super-io.h b/fs/bcachefs/super-io.h
index 95e80e06316bf..f37620919e11a 100644
--- a/fs/bcachefs/super-io.h
+++ b/fs/bcachefs/super-io.h
@@ -3,12 +3,12 @@
 #define _BCACHEFS_SUPER_IO_H
 
 #include "extents.h"
-#include "eytzinger.h"
 #include "super_types.h"
 #include "super.h"
 #include "sb-members.h"
 
 #include <asm/byteorder.h>
+#include <linux/eytzinger.h>
 
 static inline bool bch2_version_compatible(u16 version)
 {
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index c9d13dcf3ef1a..902f6b1a8a142 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -11,6 +11,7 @@
 #include <linux/console.h>
 #include <linux/ctype.h>
 #include <linux/debugfs.h>
+#include <linux/eytzinger.h>
 #include <linux/freezer.h>
 #include <linux/kthread.h>
 #include <linux/log2.h>
@@ -24,7 +25,6 @@
 #include <linux/sched/clock.h>
 #include <linux/mean_and_variance.h>
 
-#include "eytzinger.h"
 #include "util.h"
 
 static const char si_units[] = "?kMGTPEZY";
@@ -864,149 +864,6 @@ void memcpy_from_bio(void *dst, struct bio *src, struct bvec_iter src_iter)
 	}
 }
 
-static int alignment_ok(const void *base, size_t align)
-{
-	return IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
-		((unsigned long)base & (align - 1)) == 0;
-}
-
-static void u32_swap(void *a, void *b, size_t size)
-{
-	u32 t = *(u32 *)a;
-	*(u32 *)a = *(u32 *)b;
-	*(u32 *)b = t;
-}
-
-static void u64_swap(void *a, void *b, size_t size)
-{
-	u64 t = *(u64 *)a;
-	*(u64 *)a = *(u64 *)b;
-	*(u64 *)b = t;
-}
-
-static void generic_swap(void *a, void *b, size_t size)
-{
-	char t;
-
-	do {
-		t = *(char *)a;
-		*(char *)a++ = *(char *)b;
-		*(char *)b++ = t;
-	} while (--size > 0);
-}
-
-static inline int do_cmp(void *base, size_t n, size_t size,
-			 int (*cmp_func)(const void *, const void *, size_t),
-			 size_t l, size_t r)
-{
-	return cmp_func(base + inorder_to_eytzinger0(l, n) * size,
-			base + inorder_to_eytzinger0(r, n) * size,
-			size);
-}
-
-static inline void do_swap(void *base, size_t n, size_t size,
-			   void (*swap_func)(void *, void *, size_t),
-			   size_t l, size_t r)
-{
-	swap_func(base + inorder_to_eytzinger0(l, n) * size,
-		  base + inorder_to_eytzinger0(r, n) * size,
-		  size);
-}
-
-void eytzinger0_sort(void *base, size_t n, size_t size,
-		     int (*cmp_func)(const void *, const void *, size_t),
-		     void (*swap_func)(void *, void *, size_t))
-{
-	int i, c, r;
-
-	if (!swap_func) {
-		if (size == 4 && alignment_ok(base, 4))
-			swap_func = u32_swap;
-		else if (size == 8 && alignment_ok(base, 8))
-			swap_func = u64_swap;
-		else
-			swap_func = generic_swap;
-	}
-
-	/* heapify */
-	for (i = n / 2 - 1; i >= 0; --i) {
-		for (r = i; r * 2 + 1 < n; r = c) {
-			c = r * 2 + 1;
-
-			if (c + 1 < n &&
-			    do_cmp(base, n, size, cmp_func, c, c + 1) < 0)
-				c++;
-
-			if (do_cmp(base, n, size, cmp_func, r, c) >= 0)
-				break;
-
-			do_swap(base, n, size, swap_func, r, c);
-		}
-	}
-
-	/* sort */
-	for (i = n - 1; i > 0; --i) {
-		do_swap(base, n, size, swap_func, 0, i);
-
-		for (r = 0; r * 2 + 1 < i; r = c) {
-			c = r * 2 + 1;
-
-			if (c + 1 < i &&
-			    do_cmp(base, n, size, cmp_func, c, c + 1) < 0)
-				c++;
-
-			if (do_cmp(base, n, size, cmp_func, r, c) >= 0)
-				break;
-
-			do_swap(base, n, size, swap_func, r, c);
-		}
-	}
-}
-
-void sort_cmp_size(void *base, size_t num, size_t size,
-	  int (*cmp_func)(const void *, const void *, size_t),
-	  void (*swap_func)(void *, void *, size_t size))
-{
-	/* pre-scale counters for performance */
-	int i = (num/2 - 1) * size, n = num * size, c, r;
-
-	if (!swap_func) {
-		if (size == 4 && alignment_ok(base, 4))
-			swap_func = u32_swap;
-		else if (size == 8 && alignment_ok(base, 8))
-			swap_func = u64_swap;
-		else
-			swap_func = generic_swap;
-	}
-
-	/* heapify */
-	for ( ; i >= 0; i -= size) {
-		for (r = i; r * 2 + size < n; r  = c) {
-			c = r * 2 + size;
-			if (c < n - size &&
-			    cmp_func(base + c, base + c + size, size) < 0)
-				c += size;
-			if (cmp_func(base + r, base + c, size) >= 0)
-				break;
-			swap_func(base + r, base + c, size);
-		}
-	}
-
-	/* sort */
-	for (i = n - size; i > 0; i -= size) {
-		swap_func(base, base + i, size);
-		for (r = 0; r * 2 + size < i; r = c) {
-			c = r * 2 + size;
-			if (c < i - size &&
-			    cmp_func(base + c, base + c + size, size) < 0)
-				c += size;
-			if (cmp_func(base + r, base + c, size) >= 0)
-				break;
-			swap_func(base + r, base + c, size);
-		}
-	}
-}
-
 static void mempool_free_vp(void *element, void *pool_data)
 {
 	size_t size = (size_t) pool_data;
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index 0059481995ef7..c3b11c3d24ea9 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -737,10 +737,6 @@ static inline void memset_u64s_tail(void *s, int c, unsigned bytes)
 	memset(s + bytes, c, rem);
 }
 
-void sort_cmp_size(void *base, size_t num, size_t size,
-	  int (*cmp_func)(const void *, const void *, size_t),
-	  void (*swap_func)(void *, void *, size_t));
-
 /* just the memmove, doesn't update @_nr */
 #define __array_insert_item(_array, _nr, _pos)				\
 	memmove(&(_array)[(_pos) + 1],					\
diff --git a/fs/bcachefs/eytzinger.h b/include/linux/eytzinger.h
similarity index 77%
rename from fs/bcachefs/eytzinger.h
rename to include/linux/eytzinger.h
index b04750dbf870b..1031501030449 100644
--- a/fs/bcachefs/eytzinger.h
+++ b/include/linux/eytzinger.h
@@ -1,27 +1,37 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _EYTZINGER_H
-#define _EYTZINGER_H
+#ifndef _LINUX_EYTZINGER_H
+#define _LINUX_EYTZINGER_H
 
 #include <linux/bitops.h>
 #include <linux/log2.h>
 
-#include "util.h"
+#ifdef EYTZINGER_DEBUG
+#define EYTZINGER_BUG_ON(cond)		BUG_ON(cond)
+#else
+#define EYTZINGER_BUG_ON(cond)
+#endif
 
 /*
  * Traversal for trees in eytzinger layout - a full binary tree layed out in an
- * array
- */
-
-/*
- * One based indexing version:
+ * array.
  *
- * With one based indexing each level of the tree starts at a power of two -
- * good for cacheline alignment:
+ * Consider using an eytzinger tree any time you would otherwise be doing binary
+ * search over an array. Binary search is a worst case scenario for branch
+ * prediction and prefetching, but in an eytzinger tree every node's children
+ * are adjacent in memory, thus we can prefetch children before knowing the
+ * result of the comparison, assuming multiple nodes fit on a cacheline.
+ *
+ * Two variants are provided, for one based indexing and zero based indexing.
+ *
+ * Zero based indexing is more convenient, but one based indexing has better
+ * alignment and thus better performance because each new level of the tree
+ * starts at a power of two, and thus if element 0 was cacheline aligned, each
+ * new level will be as well.
  */
 
 static inline unsigned eytzinger1_child(unsigned i, unsigned child)
 {
-	EBUG_ON(child > 1);
+	EYTZINGER_BUG_ON(child > 1);
 
 	return (i << 1) + child;
 }
@@ -58,7 +68,7 @@ static inline unsigned eytzinger1_last(unsigned size)
 
 static inline unsigned eytzinger1_next(unsigned i, unsigned size)
 {
-	EBUG_ON(i > size);
+	EYTZINGER_BUG_ON(i > size);
 
 	if (eytzinger1_right_child(i) <= size) {
 		i = eytzinger1_right_child(i);
@@ -74,7 +84,7 @@ static inline unsigned eytzinger1_next(unsigned i, unsigned size)
 
 static inline unsigned eytzinger1_prev(unsigned i, unsigned size)
 {
-	EBUG_ON(i > size);
+	EYTZINGER_BUG_ON(i > size);
 
 	if (eytzinger1_left_child(i) <= size) {
 		i = eytzinger1_left_child(i) + 1;
@@ -101,7 +111,7 @@ static inline unsigned __eytzinger1_to_inorder(unsigned i, unsigned size,
 	unsigned shift = __fls(size) - b;
 	int s;
 
-	EBUG_ON(!i || i > size);
+	EYTZINGER_BUG_ON(!i || i > size);
 
 	i  ^= 1U << b;
 	i <<= 1;
@@ -126,7 +136,7 @@ static inline unsigned __inorder_to_eytzinger1(unsigned i, unsigned size,
 	unsigned shift;
 	int s;
 
-	EBUG_ON(!i || i > size);
+	EYTZINGER_BUG_ON(!i || i > size);
 
 	/*
 	 * sign bit trick:
@@ -164,7 +174,7 @@ static inline unsigned inorder_to_eytzinger1(unsigned i, unsigned size)
 
 static inline unsigned eytzinger0_child(unsigned i, unsigned child)
 {
-	EBUG_ON(child > 1);
+	EYTZINGER_BUG_ON(child > 1);
 
 	return (i << 1) + 1 + child;
 }
@@ -231,11 +241,9 @@ static inline unsigned inorder_to_eytzinger0(unsigned i, unsigned size)
 	     (_i) != -1;				\
 	     (_i) = eytzinger0_next((_i), (_size)))
 
-typedef int (*eytzinger_cmp_fn)(const void *l, const void *r, size_t size);
-
 /* return greatest node <= @search, or -1 if not found */
 static inline ssize_t eytzinger0_find_le(void *base, size_t nr, size_t size,
-					 eytzinger_cmp_fn cmp, const void *search)
+					 cmp_func_t cmp, const void *search)
 {
 	unsigned i, n = 0;
 
@@ -244,7 +252,7 @@ static inline ssize_t eytzinger0_find_le(void *base, size_t nr, size_t size,
 
 	do {
 		i = n;
-		n = eytzinger0_child(i, cmp(search, base + i * size, size) >= 0);
+		n = eytzinger0_child(i, cmp(search, base + i * size) >= 0);
 	} while (n < nr);
 
 	if (n & 1) {
@@ -269,13 +277,13 @@ static inline ssize_t eytzinger0_find_le(void *base, size_t nr, size_t size,
 	int _res;							\
 									\
 	while (_i < _nr &&						\
-	       (_res = _cmp(_search, _base + _i * _size, _size)))	\
+	       (_res = _cmp(_search, _base + _i * _size)))		\
 		_i = eytzinger0_child(_i, _res > 0);			\
 	_i;								\
 })
 
-void eytzinger0_sort(void *, size_t, size_t,
-		    int (*cmp_func)(const void *, const void *, size_t),
-		    void (*swap_func)(void *, void *, size_t));
+void eytzinger0_sort_r(void *, size_t, size_t,
+		       cmp_r_func_t, swap_r_func_t, const void *);
+void eytzinger0_sort(void *, size_t, size_t, cmp_func_t, swap_func_t);
 
-#endif /* _EYTZINGER_H */
+#endif /* _LINUX_EYTZINGER_H */
diff --git a/lib/sort.c b/lib/sort.c
index b399bf10d6759..f5b2206c73461 100644
--- a/lib/sort.c
+++ b/lib/sort.c
@@ -290,3 +290,92 @@ void sort(void *base, size_t num, size_t size,
 	return sort_r(base, num, size, _CMP_WRAPPER, SWAP_WRAPPER, &w);
 }
 EXPORT_SYMBOL(sort);
+
+#include <linux/eytzinger.h>
+
+static inline int eytzinger0_do_cmp(void *base, size_t n, size_t size,
+			 cmp_r_func_t cmp_func, const void *priv,
+			 size_t l, size_t r)
+{
+	return do_cmp(base + inorder_to_eytzinger0(l, n) * size,
+		      base + inorder_to_eytzinger0(r, n) * size,
+		      cmp_func, priv);
+}
+
+static inline void eytzinger0_do_swap(void *base, size_t n, size_t size,
+			   swap_r_func_t swap_func, const void *priv,
+			   size_t l, size_t r)
+{
+	do_swap(base + inorder_to_eytzinger0(l, n) * size,
+		base + inorder_to_eytzinger0(r, n) * size,
+		size, swap_func, priv);
+}
+
+void eytzinger0_sort_r(void *base, size_t n, size_t size,
+		       cmp_r_func_t cmp_func,
+		       swap_r_func_t swap_func,
+		       const void *priv)
+{
+	int i, c, r;
+
+	/* called from 'sort' without swap function, let's pick the default */
+	if (swap_func == SWAP_WRAPPER && !((struct wrapper *)priv)->swap)
+		swap_func = NULL;
+
+	if (!swap_func) {
+		if (is_aligned(base, size, 8))
+			swap_func = SWAP_WORDS_64;
+		else if (is_aligned(base, size, 4))
+			swap_func = SWAP_WORDS_32;
+		else
+			swap_func = SWAP_BYTES;
+	}
+
+	/* heapify */
+	for (i = n / 2 - 1; i >= 0; --i) {
+		for (r = i; r * 2 + 1 < n; r = c) {
+			c = r * 2 + 1;
+
+			if (c + 1 < n &&
+			    eytzinger0_do_cmp(base, n, size, cmp_func, priv, c, c + 1) < 0)
+				c++;
+
+			if (eytzinger0_do_cmp(base, n, size, cmp_func, priv, r, c) >= 0)
+				break;
+
+			eytzinger0_do_swap(base, n, size, swap_func, priv, r, c);
+		}
+	}
+
+	/* sort */
+	for (i = n - 1; i > 0; --i) {
+		eytzinger0_do_swap(base, n, size, swap_func, priv, 0, i);
+
+		for (r = 0; r * 2 + 1 < i; r = c) {
+			c = r * 2 + 1;
+
+			if (c + 1 < i &&
+			    eytzinger0_do_cmp(base, n, size, cmp_func, priv, c, c + 1) < 0)
+				c++;
+
+			if (eytzinger0_do_cmp(base, n, size, cmp_func, priv, r, c) >= 0)
+				break;
+
+			eytzinger0_do_swap(base, n, size, swap_func, priv, r, c);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(eytzinger0_sort_r);
+
+void eytzinger0_sort(void *base, size_t n, size_t size,
+		     cmp_func_t cmp_func,
+		     swap_func_t swap_func)
+{
+	struct wrapper w = {
+		.cmp  = cmp_func,
+		.swap = swap_func,
+	};
+
+	return eytzinger0_sort_r(base, n, size, _CMP_WRAPPER, SWAP_WRAPPER, &w);
+}
+EXPORT_SYMBOL_GPL(eytzinger0_sort);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/4] bcachefs: bch2_time_stats_to_seq_buf()
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 1/4] mean and variance: Promote to lib/math Darrick J. Wong
  2024-02-24  1:09   ` [PATCH 2/4] eytzinger: Promote to include/linux/ Darrick J. Wong
@ 2024-02-24  1:09   ` Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 4/4] time_stats: Promote to lib/ Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:09 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/super.c |    2 +
 fs/bcachefs/util.c  |  133 +++++++++++++++++++++++++++++++++++++++++++++------
 fs/bcachefs/util.h  |    4 ++
 3 files changed, 123 insertions(+), 16 deletions(-)


diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 6b23e11825e6d..b9e2c1032b920 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -1262,6 +1262,8 @@ static struct bch_dev *__bch2_dev_alloc(struct bch_fs *c,
 
 	bch2_time_stats_init(&ca->io_latency[READ]);
 	bch2_time_stats_init(&ca->io_latency[WRITE]);
+	ca->io_latency[READ].quantiles_enabled = true;
+	ca->io_latency[WRITE].quantiles_enabled = true;
 
 	ca->mi = bch2_mi_to_cpu(member);
 
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index 902f6b1a8a142..4c63f81e18bc4 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -506,10 +506,8 @@ static inline void pr_name_and_units(struct printbuf *out, const char *name, u64
 
 void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats)
 {
-	const struct time_unit *u;
 	s64 f_mean = 0, d_mean = 0;
-	u64 q, last_q = 0, f_stddev = 0, d_stddev = 0;
-	int i;
+	u64 f_stddev = 0, d_stddev = 0;
 
 	if (stats->buffer) {
 		int cpu;
@@ -608,19 +606,122 @@ void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats
 
 	printbuf_tabstops_reset(out);
 
-	i = eytzinger0_first(NR_QUANTILES);
-	u = pick_time_units(stats->quantiles.entries[i].m);
-
-	prt_printf(out, "quantiles (%s):\t", u->name);
-	eytzinger0_for_each(i, NR_QUANTILES) {
-		bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
-
-		q = max(stats->quantiles.entries[i].m, last_q);
-		prt_printf(out, "%llu ",
-		       div_u64(q, u->nsecs));
-		if (is_last)
-			prt_newline(out);
-		last_q = q;
+	if (stats->quantiles_enabled) {
+		int i = eytzinger0_first(NR_QUANTILES);
+		const struct time_unit *u =
+			pick_time_units(stats->quantiles.entries[i].m);
+		u64 last_q = 0;
+
+		prt_printf(out, "quantiles (%s):\t", u->name);
+		eytzinger0_for_each(i, NR_QUANTILES) {
+			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
+
+			u64 q = max(stats->quantiles.entries[i].m, last_q);
+			prt_printf(out, "%llu ", div_u64(q, u->nsecs));
+			if (is_last)
+				prt_newline(out);
+			last_q = q;
+		}
+	}
+}
+
+#include <linux/seq_buf.h>
+
+static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
+{
+	const struct time_unit *u = pick_time_units(ns);
+
+	seq_buf_printf(out, "%8llu %s", div64_u64(ns, u->nsecs), u->name);
+}
+
+void bch2_time_stats_to_seq_buf(struct seq_buf *out, struct bch2_time_stats *stats)
+{
+	s64 f_mean = 0, d_mean = 0;
+	u64 f_stddev = 0, d_stddev = 0;
+
+	if (stats->buffer) {
+		int cpu;
+
+		spin_lock_irq(&stats->lock);
+		for_each_possible_cpu(cpu)
+			__bch2_time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
+		spin_unlock_irq(&stats->lock);
+	}
+
+	/*
+	 * avoid divide by zero
+	 */
+	if (stats->freq_stats.n) {
+		f_mean = mean_and_variance_get_mean(stats->freq_stats);
+		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
+		d_mean = mean_and_variance_get_mean(stats->duration_stats);
+		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
+	}
+
+	seq_buf_printf(out, "count: %llu\n", stats->duration_stats.n);
+
+	seq_buf_printf(out, "                       since mount        recent\n");
+
+	seq_buf_printf(out, "duration of events\n");
+
+	seq_buf_printf(out, "  min:                     ");
+	seq_buf_time_units_aligned(out, stats->min_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  max:                     ");
+	seq_buf_time_units_aligned(out, stats->max_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  total:                   ");
+	seq_buf_time_units_aligned(out, stats->total_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  mean:                    ");
+	seq_buf_time_units_aligned(out, d_mean);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  stddev:                  ");
+	seq_buf_time_units_aligned(out, d_stddev);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "time between events\n");
+
+	seq_buf_printf(out, "  min:                     ");
+	seq_buf_time_units_aligned(out, stats->min_freq);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  max:                     ");
+	seq_buf_time_units_aligned(out, stats->max_freq);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  mean:                    ");
+	seq_buf_time_units_aligned(out, f_mean);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  stddev:                  ");
+	seq_buf_time_units_aligned(out, f_stddev);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	if (stats->quantiles_enabled) {
+		int i = eytzinger0_first(NR_QUANTILES);
+		const struct time_unit *u =
+			pick_time_units(stats->quantiles.entries[i].m);
+		u64 last_q = 0;
+
+		prt_printf(out, "quantiles (%s):\t", u->name);
+		eytzinger0_for_each(i, NR_QUANTILES) {
+			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
+
+			u64 q = max(stats->quantiles.entries[i].m, last_q);
+			seq_buf_printf(out, "%llu ", div_u64(q, u->nsecs));
+			if (is_last)
+				seq_buf_printf(out, "\n");
+			last_q = q;
+		}
 	}
 }
 #else
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index c3b11c3d24ea9..7ff2d4fe26f68 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -382,6 +382,7 @@ struct bch2_time_stat_buffer {
 
 struct bch2_time_stats {
 	spinlock_t	lock;
+	bool		quantiles_enabled;
 	/* all fields are in nanoseconds */
 	u64             min_duration;
 	u64		max_duration;
@@ -435,6 +436,9 @@ static inline bool track_event_change(struct bch2_time_stats *stats,
 
 void bch2_time_stats_to_text(struct printbuf *, struct bch2_time_stats *);
 
+struct seq_buf;
+void bch2_time_stats_to_seq_buf(struct seq_buf *, struct bch2_time_stats *);
+
 void bch2_time_stats_exit(struct bch2_time_stats *);
 void bch2_time_stats_init(struct bch2_time_stats *);
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/4] time_stats: Promote to lib/
  2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:09   ` [PATCH 3/4] bcachefs: bch2_time_stats_to_seq_buf() Darrick J. Wong
@ 2024-02-24  1:10   ` Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:10 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: Dave Chinner, Theodore Ts'o, Coly Li, linux-xfs,
	linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Library code from bcachefs for tracking latency measurements.

The main interface is
  time_stats_update(stats, start_time);

which collects a new event with an end time of the current time.

It features percpu buffering of input values, making it very low
overhead, and nicely formatted output to printbufs or seq_buf.

Sample output, from the bcache conversion:

root@moria-kvm:/sys/fs/bcache/bdaedb8c-4554-4dd2-87e4-276e51eb47cc# cat internal/btree_sort_times
count: 6414
                       since mount        recent
duration of events
  min:                          440 ns
  max:                         1102 us
  total:                        674 ms
  mean:                         105 us     102 us
  stddev:                       101 us      88 us
time between events
  min:                          881 ns
  max:                            3 s
  mean:                           7 ms       6 ms
  stddev:                        52 ms       6 ms

Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                         |    7 +
 fs/bcachefs/Kconfig                 |    2 
 fs/bcachefs/alloc_foreground.c      |   13 +-
 fs/bcachefs/bcachefs.h              |   11 +
 fs/bcachefs/btree_cache.c           |    2 
 fs/bcachefs/btree_gc.c              |    2 
 fs/bcachefs/btree_io.c              |    8 +
 fs/bcachefs/btree_iter.c            |    8 +
 fs/bcachefs/btree_locking.h         |    2 
 fs/bcachefs/btree_update_interior.c |    8 +
 fs/bcachefs/io_read.c               |    4 -
 fs/bcachefs/io_write.c              |    4 -
 fs/bcachefs/journal.c               |    5 -
 fs/bcachefs/journal_io.c            |    9 +
 fs/bcachefs/journal_reclaim.c       |    9 -
 fs/bcachefs/journal_types.h         |   11 -
 fs/bcachefs/nocow_locking.c         |    2 
 fs/bcachefs/super.c                 |   12 +-
 fs/bcachefs/util.c                  |  263 ----------------------------------
 fs/bcachefs/util.h                  |   83 -----------
 include/linux/time_stats.h          |  134 +++++++++++++++++
 lib/Kconfig                         |    4 +
 lib/Makefile                        |    2 
 lib/time_stats.c                    |  271 +++++++++++++++++++++++++++++++++++
 24 files changed, 470 insertions(+), 406 deletions(-)
 create mode 100644 include/linux/time_stats.h
 create mode 100644 lib/time_stats.c


diff --git a/MAINTAINERS b/MAINTAINERS
index 98a17270566d3..aa762fe654e3e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22157,6 +22157,13 @@ F:	kernel/time/ntp.c
 F:	kernel/time/time*.c
 F:	tools/testing/selftests/timers/
 
+TIME STATS:
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+M:	Darrick J. Wong <djwong@kernel.org>
+S:	Maintained
+F:	include/linux/time_stats.h
+F:	lib/time_stats.c
+
 TIPC NETWORK LAYER
 M:	Jon Maloy <jmaloy@redhat.com>
 M:	Ying Xue <ying.xue@windriver.com>
diff --git a/fs/bcachefs/Kconfig b/fs/bcachefs/Kconfig
index 72d1179262b33..8c587ddd2f85e 100644
--- a/fs/bcachefs/Kconfig
+++ b/fs/bcachefs/Kconfig
@@ -24,7 +24,7 @@ config BCACHEFS_FS
 	select XXHASH
 	select SRCU
 	select SYMBOLIC_ERRNAME
-	select MEAN_AND_VARIANCE
+	select TIME_STATS
 	help
 	The bcachefs filesystem - a modern, copy on write filesystem, with
 	support for multiple devices, compression, checksumming, etc.
diff --git a/fs/bcachefs/alloc_foreground.c b/fs/bcachefs/alloc_foreground.c
index 633d3223b353f..ca58193dd9027 100644
--- a/fs/bcachefs/alloc_foreground.c
+++ b/fs/bcachefs/alloc_foreground.c
@@ -236,8 +236,7 @@ static struct open_bucket *__try_alloc_bucket(struct bch_fs *c, struct bch_dev *
 		if (cl)
 			closure_wait(&c->open_buckets_wait, cl);
 
-		track_event_change(&c->times[BCH_TIME_blocked_allocate_open_bucket],
-				   &c->blocked_allocate_open_bucket, true);
+		track_event_change(&c->times[BCH_TIME_blocked_allocate_open_bucket], true);
 		spin_unlock(&c->freelist_lock);
 		return ERR_PTR(-BCH_ERR_open_buckets_empty);
 	}
@@ -263,11 +262,8 @@ static struct open_bucket *__try_alloc_bucket(struct bch_fs *c, struct bch_dev *
 	ca->nr_open_buckets++;
 	bch2_open_bucket_hash_add(c, ob);
 
-	track_event_change(&c->times[BCH_TIME_blocked_allocate_open_bucket],
-			   &c->blocked_allocate_open_bucket, false);
-
-	track_event_change(&c->times[BCH_TIME_blocked_allocate],
-			   &c->blocked_allocate, false);
+	track_event_change(&c->times[BCH_TIME_blocked_allocate_open_bucket], false);
+	track_event_change(&c->times[BCH_TIME_blocked_allocate], false);
 
 	spin_unlock(&c->freelist_lock);
 	return ob;
@@ -555,8 +551,7 @@ static struct open_bucket *bch2_bucket_alloc_trans(struct btree_trans *trans,
 			goto again;
 		}
 
-		track_event_change(&c->times[BCH_TIME_blocked_allocate],
-				   &c->blocked_allocate, true);
+		track_event_change(&c->times[BCH_TIME_blocked_allocate], true);
 
 		ob = ERR_PTR(-BCH_ERR_freelist_empty);
 		goto err;
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 69d0d60d50e36..92547d6fd2d95 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -200,6 +200,7 @@
 #include <linux/seqlock.h>
 #include <linux/shrinker.h>
 #include <linux/srcu.h>
+#include <linux/time_stats.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
 #include <linux/zstd.h>
@@ -593,7 +594,7 @@ struct bch_dev {
 
 	/* The rest of this all shows up in sysfs */
 	atomic64_t		cur_latency[2];
-	struct bch2_time_stats	io_latency[2];
+	struct time_stats	io_latency[2];
 
 #define CONGESTED_MAX		1024
 	atomic_t		congested;
@@ -640,8 +641,8 @@ struct btree_debug {
 #define BCH_TRANSACTIONS_NR 128
 
 struct btree_transaction_stats {
-	struct bch2_time_stats	duration;
-	struct bch2_time_stats	lock_hold_times;
+	struct time_stats	duration;
+	struct time_stats	lock_hold_times;
 	struct mutex		lock;
 	unsigned		nr_max_paths;
 	unsigned		journal_entries_size;
@@ -919,8 +920,6 @@ struct bch_fs {
 	/* ALLOCATOR */
 	spinlock_t		freelist_lock;
 	struct closure_waitlist	freelist_wait;
-	u64			blocked_allocate;
-	u64			blocked_allocate_open_bucket;
 
 	open_bucket_idx_t	open_buckets_freelist;
 	open_bucket_idx_t	open_buckets_nr_free;
@@ -1104,7 +1103,7 @@ struct bch_fs {
 	unsigned		copy_gc_enabled:1;
 	bool			promote_whole_extents;
 
-	struct bch2_time_stats	times[BCH_TIME_STAT_NR];
+	struct time_stats	times[BCH_TIME_STAT_NR];
 
 	struct btree_transaction_stats btree_transaction_stats[BCH_TRANSACTIONS_NR];
 
diff --git a/fs/bcachefs/btree_cache.c b/fs/bcachefs/btree_cache.c
index d7c81beac14af..8b3c04fc406f5 100644
--- a/fs/bcachefs/btree_cache.c
+++ b/fs/bcachefs/btree_cache.c
@@ -648,7 +648,7 @@ struct btree *bch2_btree_node_mem_alloc(struct btree_trans *trans, bool pcpu_rea
 	bch2_btree_keys_init(b);
 	set_btree_node_accessed(b);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_node_mem_alloc],
+	time_stats_update(&c->times[BCH_TIME_btree_node_mem_alloc],
 			       start_time);
 
 	memalloc_nofs_restore(flags);
diff --git a/fs/bcachefs/btree_gc.c b/fs/bcachefs/btree_gc.c
index 1102995643b13..774df395e4c73 100644
--- a/fs/bcachefs/btree_gc.c
+++ b/fs/bcachefs/btree_gc.c
@@ -1970,7 +1970,7 @@ int bch2_gc_gens(struct bch_fs *c)
 
 	c->gc_count++;
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_gc], start_time);
+	time_stats_update(&c->times[BCH_TIME_btree_gc], start_time);
 	trace_and_count(c, gc_gens_end, c);
 err:
 	for_each_member_device(c, ca) {
diff --git a/fs/bcachefs/btree_io.c b/fs/bcachefs/btree_io.c
index aa9b6cbe32269..a56dcabb7ace7 100644
--- a/fs/bcachefs/btree_io.c
+++ b/fs/bcachefs/btree_io.c
@@ -327,7 +327,7 @@ static void btree_node_sort(struct bch_fs *c, struct btree *b,
 	BUG_ON(vstruct_end(&out->keys) > (void *) out + bytes);
 
 	if (sorting_entire_node)
-		bch2_time_stats_update(&c->times[BCH_TIME_btree_node_sort],
+		time_stats_update(&c->times[BCH_TIME_btree_node_sort],
 				       start_time);
 
 	/* Make sure we preserve bset journal_seq: */
@@ -397,7 +397,7 @@ void bch2_btree_sort_into(struct bch_fs *c,
 			&dst->format,
 			true);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_node_sort],
+	time_stats_update(&c->times[BCH_TIME_btree_node_sort],
 			       start_time);
 
 	set_btree_bset_end(dst, dst->set);
@@ -1251,7 +1251,7 @@ int bch2_btree_node_read_done(struct bch_fs *c, struct bch_dev *ca,
 out:
 	mempool_free(iter, &c->fill_iter);
 	printbuf_exit(&buf);
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_node_read_done], start_time);
+	time_stats_update(&c->times[BCH_TIME_btree_node_read_done], start_time);
 	return retry_read;
 fsck_err:
 	if (ret == -BCH_ERR_btree_node_read_err_want_retry ||
@@ -1323,7 +1323,7 @@ static void btree_node_read_work(struct work_struct *work)
 		}
 	}
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_node_read],
+	time_stats_update(&c->times[BCH_TIME_btree_node_read],
 			       rb->start_time);
 	bio_put(&rb->bio);
 
diff --git a/fs/bcachefs/btree_iter.c b/fs/bcachefs/btree_iter.c
index 5467a8635be11..f2d7b1dabcfbb 100644
--- a/fs/bcachefs/btree_iter.c
+++ b/fs/bcachefs/btree_iter.c
@@ -2899,7 +2899,7 @@ u32 bch2_trans_begin(struct btree_trans *trans)
 
 	if (!IS_ENABLED(CONFIG_BCACHEFS_NO_LATENCY_ACCT) &&
 	    time_after64(now, trans->last_begin_time + 10))
-		__bch2_time_stats_update(&btree_trans_stats(trans)->duration,
+		__time_stats_update(&btree_trans_stats(trans)->duration,
 					 trans->last_begin_time, now);
 
 	if (!trans->restarted &&
@@ -3224,7 +3224,7 @@ void bch2_fs_btree_iter_exit(struct bch_fs *c)
 	     s < c->btree_transaction_stats + ARRAY_SIZE(c->btree_transaction_stats);
 	     s++) {
 		kfree(s->max_paths_text);
-		bch2_time_stats_exit(&s->lock_hold_times);
+		time_stats_exit(&s->lock_hold_times);
 	}
 
 	if (c->btree_trans_barrier_initialized)
@@ -3240,8 +3240,8 @@ void bch2_fs_btree_iter_init_early(struct bch_fs *c)
 	for (s = c->btree_transaction_stats;
 	     s < c->btree_transaction_stats + ARRAY_SIZE(c->btree_transaction_stats);
 	     s++) {
-		bch2_time_stats_init(&s->duration);
-		bch2_time_stats_init(&s->lock_hold_times);
+		time_stats_init(&s->duration);
+		time_stats_init(&s->lock_hold_times);
 		mutex_init(&s->lock);
 	}
 
diff --git a/fs/bcachefs/btree_locking.h b/fs/bcachefs/btree_locking.h
index 4bd72c855da1a..f2e2c5881b7e4 100644
--- a/fs/bcachefs/btree_locking.h
+++ b/fs/bcachefs/btree_locking.h
@@ -122,7 +122,7 @@ static void btree_trans_lock_hold_time_update(struct btree_trans *trans,
 					      struct btree_path *path, unsigned level)
 {
 #ifdef CONFIG_BCACHEFS_LOCK_TIME_STATS
-	__bch2_time_stats_update(&btree_trans_stats(trans)->lock_hold_times,
+	__time_stats_update(&btree_trans_stats(trans)->lock_hold_times,
 				 path->l[level].lock_taken_time,
 				 local_clock());
 #endif
diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c
index 4530b14ff2c37..669379eaea2f0 100644
--- a/fs/bcachefs/btree_update_interior.c
+++ b/fs/bcachefs/btree_update_interior.c
@@ -517,7 +517,7 @@ static void bch2_btree_update_free(struct btree_update *as, struct btree_trans *
 	bch2_disk_reservation_put(c, &as->disk_res);
 	bch2_btree_reserve_put(as, trans);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_interior_update_total],
+	time_stats_update(&c->times[BCH_TIME_btree_interior_update_total],
 			       as->start_time);
 
 	mutex_lock(&c->btree_interior_update_lock);
@@ -1039,7 +1039,7 @@ static void bch2_btree_update_done(struct btree_update *as, struct btree_trans *
 	continue_at(&as->cl, btree_update_set_nodes_written,
 		    as->c->btree_interior_update_worker);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_interior_update_foreground],
+	time_stats_update(&c->times[BCH_TIME_btree_interior_update_foreground],
 			       start_time);
 }
 
@@ -1630,7 +1630,7 @@ static int btree_split(struct btree_update *as, struct btree_trans *trans,
 
 	bch2_trans_verify_locks(trans);
 
-	bch2_time_stats_update(&c->times[n2
+	time_stats_update(&c->times[n2
 			       ? BCH_TIME_btree_node_split
 			       : BCH_TIME_btree_node_compact],
 			       start_time);
@@ -1936,7 +1936,7 @@ int __bch2_foreground_maybe_merge(struct btree_trans *trans,
 
 	bch2_btree_update_done(as, trans);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_btree_node_merge], start_time);
+	time_stats_update(&c->times[BCH_TIME_btree_node_merge], start_time);
 out:
 err:
 	if (new_path)
diff --git a/fs/bcachefs/io_read.c b/fs/bcachefs/io_read.c
index 3c574d8873a1e..dce136cd22713 100644
--- a/fs/bcachefs/io_read.c
+++ b/fs/bcachefs/io_read.c
@@ -134,7 +134,7 @@ static void promote_done(struct bch_write_op *wop)
 		container_of(wop, struct promote_op, write.op);
 	struct bch_fs *c = op->write.op.c;
 
-	bch2_time_stats_update(&c->times[BCH_TIME_data_promote],
+	time_stats_update(&c->times[BCH_TIME_data_promote],
 			       op->start_time);
 	promote_free(c, op);
 }
@@ -356,7 +356,7 @@ static inline struct bch_read_bio *bch2_rbio_free(struct bch_read_bio *rbio)
 static void bch2_rbio_done(struct bch_read_bio *rbio)
 {
 	if (rbio->start_time)
-		bch2_time_stats_update(&rbio->c->times[BCH_TIME_data_read],
+		time_stats_update(&rbio->c->times[BCH_TIME_data_read],
 				       rbio->start_time);
 	bio_endio(&rbio->bio);
 }
diff --git a/fs/bcachefs/io_write.c b/fs/bcachefs/io_write.c
index 2c098ac017b30..8123a84320e3f 100644
--- a/fs/bcachefs/io_write.c
+++ b/fs/bcachefs/io_write.c
@@ -88,7 +88,7 @@ void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw)
 
 	bch2_congested_acct(ca, io_latency, now, rw);
 
-	__bch2_time_stats_update(&ca->io_latency[rw], submit_time, now);
+	__time_stats_update(&ca->io_latency[rw], submit_time, now);
 }
 
 #endif
@@ -457,7 +457,7 @@ static void bch2_write_done(struct closure *cl)
 
 	EBUG_ON(op->open_buckets.nr);
 
-	bch2_time_stats_update(&c->times[BCH_TIME_data_write], op->start_time);
+	time_stats_update(&c->times[BCH_TIME_data_write], op->start_time);
 	bch2_disk_reservation_put(c, &op->res);
 
 	if (!(op->flags & BCH_WRITE_MOVE))
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index bc890776eb579..c5d6cc29be870 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -518,8 +518,7 @@ static int __journal_res_get(struct journal *j, struct journal_res *res,
 	ret = journal_entry_open(j);
 
 	if (ret == JOURNAL_ERR_max_in_flight) {
-		track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight],
-				   &j->max_in_flight_start, true);
+		track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], true);
 		if (trace_journal_entry_full_enabled()) {
 			struct printbuf buf = PRINTBUF;
 			buf.atomic++;
@@ -727,7 +726,7 @@ int bch2_journal_flush_seq(struct journal *j, u64 seq)
 	ret = wait_event_interruptible(j->wait, (ret2 = bch2_journal_flush_seq_async(j, seq, NULL)));
 
 	if (!ret)
-		bch2_time_stats_update(j->flush_seq_time, start_time);
+		time_stats_update(j->flush_seq_time, start_time);
 
 	return ret ?: ret2 < 0 ? ret2 : 0;
 }
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 47805193f18cc..23401e9686eed 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1576,9 +1576,9 @@ static CLOSURE_CALLBACK(journal_write_done)
 	u64 v, seq;
 	int err = 0;
 
-	bch2_time_stats_update(!JSET_NO_FLUSH(w->data)
-			       ? j->flush_write_time
-			       : j->noflush_write_time, j->write_start_time);
+	time_stats_update(!JSET_NO_FLUSH(w->data)
+			  ? j->flush_write_time
+			  : j->noflush_write_time, j->write_start_time);
 
 	if (!w->devs_written.nr) {
 		bch_err(c, "unable to write journal to sufficient devices");
@@ -1639,8 +1639,7 @@ static CLOSURE_CALLBACK(journal_write_done)
 	bch2_journal_reclaim_fast(j);
 	bch2_journal_space_available(j);
 
-	track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight],
-			   &j->max_in_flight_start, false);
+	track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight], false);
 
 	closure_wake_up(&w->wait);
 	journal_wake(j);
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 2cf626315652c..d58503b73b966 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -62,12 +62,9 @@ void bch2_journal_set_watermark(struct journal *j)
 		? BCH_WATERMARK_reclaim
 		: BCH_WATERMARK_stripe;
 
-	if (track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_space],
-			       &j->low_on_space_start, low_on_space) ||
-	    track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_pin],
-			       &j->low_on_pin_start, low_on_pin) ||
-	    track_event_change(&c->times[BCH_TIME_blocked_write_buffer_full],
-			       &j->write_buffer_full_start, low_on_wb))
+	if (track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_space], low_on_space) ||
+	    track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_pin], low_on_pin) ||
+	    track_event_change(&c->times[BCH_TIME_blocked_write_buffer_full], low_on_wb))
 		trace_and_count(c, journal_full, c);
 
 	swap(watermark, j->watermark);
diff --git a/fs/bcachefs/journal_types.h b/fs/bcachefs/journal_types.h
index 38817c7a08515..b93e02c0a178a 100644
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@@ -274,14 +274,9 @@ struct journal {
 	u64			nr_noflush_writes;
 	u64			entry_bytes_written;
 
-	u64			low_on_space_start;
-	u64			low_on_pin_start;
-	u64			max_in_flight_start;
-	u64			write_buffer_full_start;
-
-	struct bch2_time_stats	*flush_write_time;
-	struct bch2_time_stats	*noflush_write_time;
-	struct bch2_time_stats	*flush_seq_time;
+	struct time_stats	*flush_write_time;
+	struct time_stats	*noflush_write_time;
+	struct time_stats	*flush_seq_time;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map	res_map;
diff --git a/fs/bcachefs/nocow_locking.c b/fs/bcachefs/nocow_locking.c
index 3c21981a4a1c0..181efa4a83fa1 100644
--- a/fs/bcachefs/nocow_locking.c
+++ b/fs/bcachefs/nocow_locking.c
@@ -85,7 +85,7 @@ void __bch2_bucket_nocow_lock(struct bucket_nocow_lock_table *t,
 		u64 start_time = local_clock();
 
 		__closure_wait_event(&l->wait, __bch2_bucket_nocow_trylock(l, dev_bucket, flags));
-		bch2_time_stats_update(&c->times[BCH_TIME_nocow_lock_contended], start_time);
+		time_stats_update(&c->times[BCH_TIME_nocow_lock_contended], start_time);
 	}
 }
 
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index b9e2c1032b920..c491c5e102287 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -520,7 +520,7 @@ static void __bch2_fs_free(struct bch_fs *c)
 	unsigned i;
 
 	for (i = 0; i < BCH_TIME_STAT_NR; i++)
-		bch2_time_stats_exit(&c->times[i]);
+		time_stats_exit(&c->times[i]);
 
 	bch2_free_pending_node_rewrites(c);
 	bch2_fs_sb_errors_exit(c);
@@ -753,7 +753,7 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
 	c->journal_keys.initial_ref_held = true;
 
 	for (i = 0; i < BCH_TIME_STAT_NR; i++)
-		bch2_time_stats_init(&c->times[i]);
+		time_stats_init(&c->times[i]);
 
 	bch2_fs_copygc_init(c);
 	bch2_fs_btree_key_cache_init_early(&c->btree_key_cache);
@@ -1168,8 +1168,8 @@ static void bch2_dev_free(struct bch_dev *ca)
 	bch2_dev_buckets_free(ca);
 	free_page((unsigned long) ca->sb_read_scratch);
 
-	bch2_time_stats_exit(&ca->io_latency[WRITE]);
-	bch2_time_stats_exit(&ca->io_latency[READ]);
+	time_stats_exit(&ca->io_latency[WRITE]);
+	time_stats_exit(&ca->io_latency[READ]);
 
 	percpu_ref_exit(&ca->io_ref);
 	percpu_ref_exit(&ca->ref);
@@ -1260,8 +1260,8 @@ static struct bch_dev *__bch2_dev_alloc(struct bch_fs *c,
 
 	INIT_WORK(&ca->io_error_work, bch2_io_error_work);
 
-	bch2_time_stats_init(&ca->io_latency[READ]);
-	bch2_time_stats_init(&ca->io_latency[WRITE]);
+	time_stats_init(&ca->io_latency[READ]);
+	time_stats_init(&ca->io_latency[WRITE]);
 	ca->io_latency[READ].quantiles_enabled = true;
 	ca->io_latency[WRITE].quantiles_enabled = true;
 
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index 4c63f81e18bc4..88853513a15fa 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -337,32 +337,6 @@ void bch2_prt_datetime(struct printbuf *out, time64_t sec)
 }
 #endif
 
-static const struct time_unit {
-	const char	*name;
-	u64		nsecs;
-} time_units[] = {
-	{ "ns",		1		 },
-	{ "us",		NSEC_PER_USEC	 },
-	{ "ms",		NSEC_PER_MSEC	 },
-	{ "s",		NSEC_PER_SEC	 },
-	{ "m",          (u64) NSEC_PER_SEC * 60},
-	{ "h",          (u64) NSEC_PER_SEC * 3600},
-	{ "eon",        U64_MAX          },
-};
-
-static const struct time_unit *pick_time_units(u64 ns)
-{
-	const struct time_unit *u;
-
-	for (u = time_units;
-	     u + 1 < time_units + ARRAY_SIZE(time_units) &&
-	     ns >= u[1].nsecs << 1;
-	     u++)
-		;
-
-	return u;
-}
-
 void bch2_pr_time_units(struct printbuf *out, u64 ns)
 {
 	const struct time_unit *u = pick_time_units(ns);
@@ -370,121 +344,6 @@ void bch2_pr_time_units(struct printbuf *out, u64 ns)
 	prt_printf(out, "%llu %s", div_u64(ns, u->nsecs), u->name);
 }
 
-/* time stats: */
-
-#ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
-static void bch2_quantiles_update(struct bch2_quantiles *q, u64 v)
-{
-	unsigned i = 0;
-
-	while (i < ARRAY_SIZE(q->entries)) {
-		struct bch2_quantile_entry *e = q->entries + i;
-
-		if (unlikely(!e->step)) {
-			e->m = v;
-			e->step = max_t(unsigned, v / 2, 1024);
-		} else if (e->m > v) {
-			e->m = e->m >= e->step
-				? e->m - e->step
-				: 0;
-		} else if (e->m < v) {
-			e->m = e->m + e->step > e->m
-				? e->m + e->step
-				: U32_MAX;
-		}
-
-		if ((e->m > v ? e->m - v : v - e->m) < e->step)
-			e->step = max_t(unsigned, e->step / 2, 1);
-
-		if (v >= e->m)
-			break;
-
-		i = eytzinger0_child(i, v > e->m);
-	}
-}
-
-static inline void bch2_time_stats_update_one(struct bch2_time_stats *stats,
-					      u64 start, u64 end)
-{
-	u64 duration, freq;
-
-	if (time_after64(end, start)) {
-		duration = end - start;
-		mean_and_variance_update(&stats->duration_stats, duration);
-		mean_and_variance_weighted_update(&stats->duration_stats_weighted, duration);
-		stats->max_duration = max(stats->max_duration, duration);
-		stats->min_duration = min(stats->min_duration, duration);
-		stats->total_duration += duration;
-		bch2_quantiles_update(&stats->quantiles, duration);
-	}
-
-	if (stats->last_event && time_after64(end, stats->last_event)) {
-		freq = end - stats->last_event;
-		mean_and_variance_update(&stats->freq_stats, freq);
-		mean_and_variance_weighted_update(&stats->freq_stats_weighted, freq);
-		stats->max_freq = max(stats->max_freq, freq);
-		stats->min_freq = min(stats->min_freq, freq);
-	}
-
-	stats->last_event = end;
-}
-
-static void __bch2_time_stats_clear_buffer(struct bch2_time_stats *stats,
-					   struct bch2_time_stat_buffer *b)
-{
-	for (struct bch2_time_stat_buffer_entry *i = b->entries;
-	     i < b->entries + ARRAY_SIZE(b->entries);
-	     i++)
-		bch2_time_stats_update_one(stats, i->start, i->end);
-	b->nr = 0;
-}
-
-static noinline void bch2_time_stats_clear_buffer(struct bch2_time_stats *stats,
-						  struct bch2_time_stat_buffer *b)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&stats->lock, flags);
-	__bch2_time_stats_clear_buffer(stats, b);
-	spin_unlock_irqrestore(&stats->lock, flags);
-}
-
-void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end)
-{
-	unsigned long flags;
-
-	WARN_ONCE(!stats->duration_stats_weighted.weight ||
-		  !stats->freq_stats_weighted.weight,
-		  "uninitialized time_stats");
-
-	if (!stats->buffer) {
-		spin_lock_irqsave(&stats->lock, flags);
-		bch2_time_stats_update_one(stats, start, end);
-
-		if (mean_and_variance_weighted_get_mean(stats->freq_stats_weighted) < 32 &&
-		    stats->duration_stats.n > 1024)
-			stats->buffer =
-				alloc_percpu_gfp(struct bch2_time_stat_buffer,
-						 GFP_ATOMIC);
-		spin_unlock_irqrestore(&stats->lock, flags);
-	} else {
-		struct bch2_time_stat_buffer *b;
-
-		preempt_disable();
-		b = this_cpu_ptr(stats->buffer);
-
-		BUG_ON(b->nr >= ARRAY_SIZE(b->entries));
-		b->entries[b->nr++] = (struct bch2_time_stat_buffer_entry) {
-			.start = start,
-			.end = end
-		};
-
-		if (unlikely(b->nr == ARRAY_SIZE(b->entries)))
-			bch2_time_stats_clear_buffer(stats, b);
-		preempt_enable();
-	}
-}
-
 static void bch2_pr_time_units_aligned(struct printbuf *out, u64 ns)
 {
 	const struct time_unit *u = pick_time_units(ns);
@@ -504,7 +363,7 @@ static inline void pr_name_and_units(struct printbuf *out, const char *name, u64
 
 #define TABSTOP_SIZE 12
 
-void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats)
+void bch2_time_stats_to_text(struct printbuf *out, struct time_stats *stats)
 {
 	s64 f_mean = 0, d_mean = 0;
 	u64 f_stddev = 0, d_stddev = 0;
@@ -514,7 +373,7 @@ void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats
 
 		spin_lock_irq(&stats->lock);
 		for_each_possible_cpu(cpu)
-			__bch2_time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
+			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
 		spin_unlock_irq(&stats->lock);
 	}
 
@@ -625,124 +484,6 @@ void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats
 	}
 }
 
-#include <linux/seq_buf.h>
-
-static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
-{
-	const struct time_unit *u = pick_time_units(ns);
-
-	seq_buf_printf(out, "%8llu %s", div64_u64(ns, u->nsecs), u->name);
-}
-
-void bch2_time_stats_to_seq_buf(struct seq_buf *out, struct bch2_time_stats *stats)
-{
-	s64 f_mean = 0, d_mean = 0;
-	u64 f_stddev = 0, d_stddev = 0;
-
-	if (stats->buffer) {
-		int cpu;
-
-		spin_lock_irq(&stats->lock);
-		for_each_possible_cpu(cpu)
-			__bch2_time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
-		spin_unlock_irq(&stats->lock);
-	}
-
-	/*
-	 * avoid divide by zero
-	 */
-	if (stats->freq_stats.n) {
-		f_mean = mean_and_variance_get_mean(stats->freq_stats);
-		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
-		d_mean = mean_and_variance_get_mean(stats->duration_stats);
-		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
-	}
-
-	seq_buf_printf(out, "count: %llu\n", stats->duration_stats.n);
-
-	seq_buf_printf(out, "                       since mount        recent\n");
-
-	seq_buf_printf(out, "duration of events\n");
-
-	seq_buf_printf(out, "  min:                     ");
-	seq_buf_time_units_aligned(out, stats->min_duration);
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  max:                     ");
-	seq_buf_time_units_aligned(out, stats->max_duration);
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  total:                   ");
-	seq_buf_time_units_aligned(out, stats->total_duration);
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  mean:                    ");
-	seq_buf_time_units_aligned(out, d_mean);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted));
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  stddev:                  ");
-	seq_buf_time_units_aligned(out, d_stddev);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted));
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "time between events\n");
-
-	seq_buf_printf(out, "  min:                     ");
-	seq_buf_time_units_aligned(out, stats->min_freq);
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  max:                     ");
-	seq_buf_time_units_aligned(out, stats->max_freq);
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  mean:                    ");
-	seq_buf_time_units_aligned(out, f_mean);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted));
-	seq_buf_printf(out, "\n");
-
-	seq_buf_printf(out, "  stddev:                  ");
-	seq_buf_time_units_aligned(out, f_stddev);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
-	seq_buf_printf(out, "\n");
-
-	if (stats->quantiles_enabled) {
-		int i = eytzinger0_first(NR_QUANTILES);
-		const struct time_unit *u =
-			pick_time_units(stats->quantiles.entries[i].m);
-		u64 last_q = 0;
-
-		prt_printf(out, "quantiles (%s):\t", u->name);
-		eytzinger0_for_each(i, NR_QUANTILES) {
-			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
-
-			u64 q = max(stats->quantiles.entries[i].m, last_q);
-			seq_buf_printf(out, "%llu ", div_u64(q, u->nsecs));
-			if (is_last)
-				seq_buf_printf(out, "\n");
-			last_q = q;
-		}
-	}
-}
-#else
-void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats) {}
-#endif
-
-void bch2_time_stats_exit(struct bch2_time_stats *stats)
-{
-	free_percpu(stats->buffer);
-}
-
-void bch2_time_stats_init(struct bch2_time_stats *stats)
-{
-	memset(stats, 0, sizeof(*stats));
-	stats->duration_stats_weighted.weight = 8;
-	stats->freq_stats_weighted.weight = 8;
-	stats->min_duration = U64_MAX;
-	stats->min_freq = U64_MAX;
-	spin_lock_init(&stats->lock);
-}
-
 /* ratelimit: */
 
 /**
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index 7ff2d4fe26f68..cf8d16a911622 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -15,6 +15,7 @@
 #include <linux/preempt.h>
 #include <linux/ratelimit.h>
 #include <linux/slab.h>
+#include <linux/time_stats.h>
 #include <linux/vmalloc.h>
 #include <linux/workqueue.h>
 #include <linux/mean_and_variance.h>
@@ -360,87 +361,7 @@ static inline void prt_bdevname(struct printbuf *out, struct block_device *bdev)
 #endif
 }
 
-#define NR_QUANTILES	15
-#define QUANTILE_IDX(i)	inorder_to_eytzinger0(i, NR_QUANTILES)
-#define QUANTILE_FIRST	eytzinger0_first(NR_QUANTILES)
-#define QUANTILE_LAST	eytzinger0_last(NR_QUANTILES)
-
-struct bch2_quantiles {
-	struct bch2_quantile_entry {
-		u64	m;
-		u64	step;
-	}		entries[NR_QUANTILES];
-};
-
-struct bch2_time_stat_buffer {
-	unsigned	nr;
-	struct bch2_time_stat_buffer_entry {
-		u64	start;
-		u64	end;
-	}		entries[32];
-};
-
-struct bch2_time_stats {
-	spinlock_t	lock;
-	bool		quantiles_enabled;
-	/* all fields are in nanoseconds */
-	u64             min_duration;
-	u64		max_duration;
-	u64		total_duration;
-	u64             max_freq;
-	u64             min_freq;
-	u64		last_event;
-	struct bch2_quantiles quantiles;
-
-	struct mean_and_variance	  duration_stats;
-	struct mean_and_variance_weighted duration_stats_weighted;
-	struct mean_and_variance	  freq_stats;
-	struct mean_and_variance_weighted freq_stats_weighted;
-	struct bch2_time_stat_buffer __percpu *buffer;
-};
-
-#ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
-void __bch2_time_stats_update(struct bch2_time_stats *stats, u64, u64);
-
-static inline void bch2_time_stats_update(struct bch2_time_stats *stats, u64 start)
-{
-	__bch2_time_stats_update(stats, start, local_clock());
-}
-
-static inline bool track_event_change(struct bch2_time_stats *stats,
-				      u64 *start, bool v)
-{
-	if (v != !!*start) {
-		if (!v) {
-			bch2_time_stats_update(stats, *start);
-			*start = 0;
-		} else {
-			*start = local_clock() ?: 1;
-			return true;
-		}
-	}
-
-	return false;
-}
-#else
-static inline void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end) {}
-static inline void bch2_time_stats_update(struct bch2_time_stats *stats, u64 start) {}
-static inline bool track_event_change(struct bch2_time_stats *stats,
-				      u64 *start, bool v)
-{
-	bool ret = v && !*start;
-	*start = v;
-	return ret;
-}
-#endif
-
-void bch2_time_stats_to_text(struct printbuf *, struct bch2_time_stats *);
-
-struct seq_buf;
-void bch2_time_stats_to_seq_buf(struct seq_buf *, struct bch2_time_stats *);
-
-void bch2_time_stats_exit(struct bch2_time_stats *);
-void bch2_time_stats_init(struct bch2_time_stats *);
+void bch2_time_stats_to_text(struct printbuf *, struct time_stats *);
 
 #define ewma_add(ewma, val, weight)					\
 ({									\
diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
new file mode 100644
index 0000000000000..caefa7aba65a0
--- /dev/null
+++ b/include/linux/time_stats.h
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * time_stats - collect statistics on events that have a duration, with nicely
+ * formatted textual output on demand
+ *
+ * - percpu buffering of event collection: cheap enough to shotgun
+ *   everywhere without worrying about overhead
+ *
+ * tracks:
+ *  - number of events
+ *  - maximum event duration ever seen
+ *  - sum of all event durations
+ *  - average event duration, standard and weighted
+ *  - standard deviation of event durations, standard and weighted
+ * and analagous statistics for the frequency of events
+ *
+ * We provide both mean and weighted mean (exponentially weighted), and standard
+ * deviation and weighted standard deviation, to give an efficient-to-compute
+ * view of current behaviour versus. average behaviour - "did this event source
+ * just become wonky, or is this typical?".
+ *
+ * Particularly useful for tracking down latency issues.
+ */
+#ifndef _LINUX_TIME_STATS_H
+#define _LINUX_TIME_STATS_H
+
+#include <linux/mean_and_variance.h>
+#include <linux/sched/clock.h>
+#include <linux/spinlock_types.h>
+
+struct time_unit {
+	const char	*name;
+	u64		nsecs;
+};
+
+/*
+ * given a nanosecond value, pick the preferred time units for printing:
+ */
+const struct time_unit *pick_time_units(u64 ns);
+
+/*
+ * quantiles - do not use:
+ *
+ * Only enabled if time_stats->quantiles_enabled has been manually set - don't
+ * use in new code.
+ */
+
+#define NR_QUANTILES	15
+#define QUANTILE_IDX(i)	inorder_to_eytzinger0(i, NR_QUANTILES)
+#define QUANTILE_FIRST	eytzinger0_first(NR_QUANTILES)
+#define QUANTILE_LAST	eytzinger0_last(NR_QUANTILES)
+
+struct quantiles {
+	struct quantile_entry {
+		u64	m;
+		u64	step;
+	}		entries[NR_QUANTILES];
+};
+
+struct time_stat_buffer {
+	unsigned	nr;
+	struct time_stat_buffer_entry {
+		u64	start;
+		u64	end;
+	}		entries[32];
+};
+
+struct time_stats {
+	spinlock_t	lock;
+	bool		quantiles_enabled;
+	/* all fields are in nanoseconds */
+	u64             min_duration;
+	u64		max_duration;
+	u64		total_duration;
+	u64             max_freq;
+	u64             min_freq;
+	u64		last_event;
+	u64		last_event_start;
+	struct quantiles quantiles;
+
+	struct mean_and_variance	  duration_stats;
+	struct mean_and_variance_weighted duration_stats_weighted;
+	struct mean_and_variance	  freq_stats;
+	struct mean_and_variance_weighted freq_stats_weighted;
+	struct time_stat_buffer __percpu *buffer;
+};
+
+void __time_stats_clear_buffer(struct time_stats *, struct time_stat_buffer *);
+void __time_stats_update(struct time_stats *stats, u64, u64);
+
+/**
+ * time_stats_update - collect a new event being tracked
+ *
+ * @stats	- time_stats to update
+ * @start	- start time of event, recorded with local_clock()
+ *
+ * The end duration of the event will be the current time
+ */
+static inline void time_stats_update(struct time_stats *stats, u64 start)
+{
+	__time_stats_update(stats, start, local_clock());
+}
+
+/**
+ * track_event_change - track state change events
+ *
+ * @stats	- time_stats to update
+ * @v		- new state, true or false
+ *
+ * Use this when tracking time stats for state changes, i.e. resource X becoming
+ * blocked/unblocked.
+ */
+static inline bool track_event_change(struct time_stats *stats, bool v)
+{
+	if (v != !!stats->last_event_start) {
+		if (!v) {
+			time_stats_update(stats, stats->last_event_start);
+			stats->last_event_start = 0;
+		} else {
+			stats->last_event_start = local_clock() ?: 1;
+			return true;
+		}
+	}
+
+	return false;
+}
+
+struct seq_buf;
+void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *);
+
+void time_stats_exit(struct time_stats *);
+void time_stats_init(struct time_stats *);
+
+#endif /* _LINUX_TIME_STATS_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 5ddda7c2ed9b3..3ba8b965f8c7e 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -785,3 +785,7 @@ config POLYNOMIAL
 
 config FIRMWARE_TABLE
 	bool
+
+config TIME_STATS
+	tristate
+	select MEAN_AND_VARIANCE
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e619..57858997c87aa 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -370,6 +370,8 @@ obj-$(CONFIG_SBITMAP) += sbitmap.o
 
 obj-$(CONFIG_PARMAN) += parman.o
 
+obj-$(CONFIG_TIME_STATS) += time_stats.o
+
 obj-y += group_cpus.o
 
 # GCC library routines
diff --git a/lib/time_stats.c b/lib/time_stats.c
new file mode 100644
index 0000000000000..081aeba88b535
--- /dev/null
+++ b/lib/time_stats.c
@@ -0,0 +1,271 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/eytzinger.h>
+#include <linux/jiffies.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/preempt.h>
+#include <linux/time.h>
+#include <linux/time_stats.h>
+#include <linux/spinlock.h>
+
+static const struct time_unit time_units[] = {
+	{ "ns",		1		 },
+	{ "us",		NSEC_PER_USEC	 },
+	{ "ms",		NSEC_PER_MSEC	 },
+	{ "s",		NSEC_PER_SEC	 },
+	{ "m",          (u64) NSEC_PER_SEC * 60},
+	{ "h",          (u64) NSEC_PER_SEC * 3600},
+	{ "eon",        U64_MAX          },
+};
+
+const struct time_unit *pick_time_units(u64 ns)
+{
+	const struct time_unit *u;
+
+	for (u = time_units;
+	     u + 1 < time_units + ARRAY_SIZE(time_units) &&
+	     ns >= u[1].nsecs << 1;
+	     u++)
+		;
+
+	return u;
+}
+EXPORT_SYMBOL_GPL(pick_time_units);
+
+static void quantiles_update(struct quantiles *q, u64 v)
+{
+	unsigned i = 0;
+
+	while (i < ARRAY_SIZE(q->entries)) {
+		struct quantile_entry *e = q->entries + i;
+
+		if (unlikely(!e->step)) {
+			e->m = v;
+			e->step = max_t(unsigned, v / 2, 1024);
+		} else if (e->m > v) {
+			e->m = e->m >= e->step
+				? e->m - e->step
+				: 0;
+		} else if (e->m < v) {
+			e->m = e->m + e->step > e->m
+				? e->m + e->step
+				: U32_MAX;
+		}
+
+		if ((e->m > v ? e->m - v : v - e->m) < e->step)
+			e->step = max_t(unsigned, e->step / 2, 1);
+
+		if (v >= e->m)
+			break;
+
+		i = eytzinger0_child(i, v > e->m);
+	}
+}
+
+static inline void time_stats_update_one(struct time_stats *stats,
+					      u64 start, u64 end)
+{
+	u64 duration, freq;
+
+	if (time_after64(end, start)) {
+		duration = end - start;
+		mean_and_variance_update(&stats->duration_stats, duration);
+		mean_and_variance_weighted_update(&stats->duration_stats_weighted, duration);
+		stats->max_duration = max(stats->max_duration, duration);
+		stats->min_duration = min(stats->min_duration, duration);
+		stats->total_duration += duration;
+
+		if (stats->quantiles_enabled)
+			quantiles_update(&stats->quantiles, duration);
+	}
+
+	if (stats->last_event && time_after64(end, stats->last_event)) {
+		freq = end - stats->last_event;
+		mean_and_variance_update(&stats->freq_stats, freq);
+		mean_and_variance_weighted_update(&stats->freq_stats_weighted, freq);
+		stats->max_freq = max(stats->max_freq, freq);
+		stats->min_freq = min(stats->min_freq, freq);
+	}
+
+	stats->last_event = end;
+}
+
+void __time_stats_clear_buffer(struct time_stats *stats,
+			       struct time_stat_buffer *b)
+{
+	for (struct time_stat_buffer_entry *i = b->entries;
+	     i < b->entries + ARRAY_SIZE(b->entries);
+	     i++)
+		time_stats_update_one(stats, i->start, i->end);
+	b->nr = 0;
+}
+EXPORT_SYMBOL_GPL(__time_stats_clear_buffer);
+
+static noinline void time_stats_clear_buffer(struct time_stats *stats,
+					     struct time_stat_buffer *b)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&stats->lock, flags);
+	__time_stats_clear_buffer(stats, b);
+	spin_unlock_irqrestore(&stats->lock, flags);
+}
+
+void __time_stats_update(struct time_stats *stats, u64 start, u64 end)
+{
+	unsigned long flags;
+
+	WARN_ONCE(!stats->duration_stats_weighted.weight ||
+		  !stats->freq_stats_weighted.weight,
+		  "uninitialized time_stats");
+
+	if (!stats->buffer) {
+		spin_lock_irqsave(&stats->lock, flags);
+		time_stats_update_one(stats, start, end);
+
+		if (mean_and_variance_weighted_get_mean(stats->freq_stats_weighted) < 32 &&
+		    stats->duration_stats.n > 1024)
+			stats->buffer =
+				alloc_percpu_gfp(struct time_stat_buffer,
+						 GFP_ATOMIC);
+		spin_unlock_irqrestore(&stats->lock, flags);
+	} else {
+		struct time_stat_buffer *b;
+
+		preempt_disable();
+		b = this_cpu_ptr(stats->buffer);
+
+		BUG_ON(b->nr >= ARRAY_SIZE(b->entries));
+		b->entries[b->nr++] = (struct time_stat_buffer_entry) {
+			.start = start,
+			.end = end
+		};
+
+		if (unlikely(b->nr == ARRAY_SIZE(b->entries)))
+			time_stats_clear_buffer(stats, b);
+		preempt_enable();
+	}
+}
+EXPORT_SYMBOL_GPL(__time_stats_update);
+
+#include <linux/seq_buf.h>
+
+static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
+{
+	const struct time_unit *u = pick_time_units(ns);
+
+	seq_buf_printf(out, "%8llu %s", div64_u64(ns, u->nsecs), u->name);
+}
+
+void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
+{
+	s64 f_mean = 0, d_mean = 0;
+	u64 f_stddev = 0, d_stddev = 0;
+
+	if (stats->buffer) {
+		int cpu;
+
+		spin_lock_irq(&stats->lock);
+		for_each_possible_cpu(cpu)
+			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
+		spin_unlock_irq(&stats->lock);
+	}
+
+	/*
+	 * avoid divide by zero
+	 */
+	if (stats->freq_stats.n) {
+		f_mean = mean_and_variance_get_mean(stats->freq_stats);
+		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
+		d_mean = mean_and_variance_get_mean(stats->duration_stats);
+		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
+	}
+
+	seq_buf_printf(out, "count: %llu\n", stats->duration_stats.n);
+
+	seq_buf_printf(out, "                       since mount        recent\n");
+
+	seq_buf_printf(out, "duration of events\n");
+
+	seq_buf_printf(out, "  min:                     ");
+	seq_buf_time_units_aligned(out, stats->min_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  max:                     ");
+	seq_buf_time_units_aligned(out, stats->max_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  total:                   ");
+	seq_buf_time_units_aligned(out, stats->total_duration);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  mean:                    ");
+	seq_buf_time_units_aligned(out, d_mean);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  stddev:                  ");
+	seq_buf_time_units_aligned(out, d_stddev);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "time between events\n");
+
+	seq_buf_printf(out, "  min:                     ");
+	seq_buf_time_units_aligned(out, stats->min_freq);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  max:                     ");
+	seq_buf_time_units_aligned(out, stats->max_freq);
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  mean:                    ");
+	seq_buf_time_units_aligned(out, f_mean);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	seq_buf_printf(out, "  stddev:                  ");
+	seq_buf_time_units_aligned(out, f_stddev);
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
+	seq_buf_printf(out, "\n");
+
+	if (stats->quantiles_enabled) {
+		int i = eytzinger0_first(NR_QUANTILES);
+		const struct time_unit *u =
+			pick_time_units(stats->quantiles.entries[i].m);
+		u64 last_q = 0;
+
+		seq_buf_printf(out, "quantiles (%s):\t", u->name);
+		eytzinger0_for_each(i, NR_QUANTILES) {
+			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
+
+			u64 q = max(stats->quantiles.entries[i].m, last_q);
+			seq_buf_printf(out, "%llu ", div_u64(q, u->nsecs));
+			if (is_last)
+				seq_buf_printf(out, "\n");
+			last_q = q;
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(time_stats_to_seq_buf);
+
+void time_stats_exit(struct time_stats *stats)
+{
+	free_percpu(stats->buffer);
+}
+EXPORT_SYMBOL_GPL(time_stats_exit);
+
+void time_stats_init(struct time_stats *stats)
+{
+	memset(stats, 0, sizeof(*stats));
+	stats->duration_stats_weighted.weight = 8;
+	stats->freq_stats_weighted.weight = 8;
+	stats->min_duration = U64_MAX;
+	stats->min_freq = U64_MAX;
+	spin_lock_init(&stats->lock);
+}
+EXPORT_SYMBOL_GPL(time_stats_init);
+
+MODULE_AUTHOR("Kent Overstreet");
+MODULE_LICENSE("GPL");


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 01/10] time_stats: report lifetime of the stats object
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
@ 2024-02-24  1:10   ` Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 02/10] time_stats: split stats-with-quantiles into a separate structure Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:10 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Capture the initialization time of the time_stats object so that we can
report how long the counter has been observing data.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |    2 ++
 lib/time_stats.c           |   10 ++++++++++
 2 files changed, 12 insertions(+)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index caefa7aba65a0..eb1957cb77c0d 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -78,6 +78,8 @@ struct time_stats {
 	u64		last_event_start;
 	struct quantiles quantiles;
 
+	u64		start_time;
+
 	struct mean_and_variance	  duration_stats;
 	struct mean_and_variance_weighted duration_stats_weighted;
 	struct mean_and_variance	  freq_stats;
diff --git a/lib/time_stats.c b/lib/time_stats.c
index 081aeba88b535..8df4b55fc6337 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -158,10 +158,16 @@ static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
 	seq_buf_printf(out, "%8llu %s", div64_u64(ns, u->nsecs), u->name);
 }
 
+static inline u64 time_stats_lifetime(const struct time_stats *stats)
+{
+	return local_clock() - stats->start_time;
+}
+
 void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
 {
 	s64 f_mean = 0, d_mean = 0;
 	u64 f_stddev = 0, d_stddev = 0;
+	u64 lifetime = time_stats_lifetime(stats);
 
 	if (stats->buffer) {
 		int cpu;
@@ -183,6 +189,9 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
 	}
 
 	seq_buf_printf(out, "count: %llu\n", stats->duration_stats.n);
+	seq_buf_printf(out, "lifetime: ");
+	seq_buf_time_units_aligned(out, lifetime);
+	seq_buf_printf(out, "\n");
 
 	seq_buf_printf(out, "                       since mount        recent\n");
 
@@ -263,6 +272,7 @@ void time_stats_init(struct time_stats *stats)
 	stats->freq_stats_weighted.weight = 8;
 	stats->min_duration = U64_MAX;
 	stats->min_freq = U64_MAX;
+	stats->start_time = local_clock();
 	spin_lock_init(&stats->lock);
 }
 EXPORT_SYMBOL_GPL(time_stats_init);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/10] time_stats: split stats-with-quantiles into a separate structure
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 01/10] time_stats: report lifetime of the stats object Darrick J. Wong
@ 2024-02-24  1:10   ` Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 03/10] time_stats: fix struct layout bloat Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:10 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Currently, struct time_stats has the optional ability to quantize the
information that it collects.  This is /probably/ useful for callers who
want to see quantized information, but it more than doubles the size of
the structure from 224 bytes to 464.  For users who don't care about
that (e.g. upcoming xfs patches) and want to avoid wasting 240 bytes per
counter, split the two into separate pieces.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bcachefs.h     |    2 +-
 fs/bcachefs/io_write.c     |    2 +-
 fs/bcachefs/super.c        |   10 ++++------
 fs/bcachefs/sysfs.c        |    4 ++--
 fs/bcachefs/util.c         |    7 ++++---
 include/linux/time_stats.h |   36 ++++++++++++++++++++++++++++++++++--
 lib/time_stats.c           |   17 ++++++++++-------
 7 files changed, 56 insertions(+), 22 deletions(-)


diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 92547d6fd2d95..04e4a65909a4f 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -594,7 +594,7 @@ struct bch_dev {
 
 	/* The rest of this all shows up in sysfs */
 	atomic64_t		cur_latency[2];
-	struct time_stats	io_latency[2];
+	struct time_stats_quantiles	io_latency[2];
 
 #define CONGESTED_MAX		1024
 	atomic_t		congested;
diff --git a/fs/bcachefs/io_write.c b/fs/bcachefs/io_write.c
index 8123a84320e3f..3fa2cb1d5b13a 100644
--- a/fs/bcachefs/io_write.c
+++ b/fs/bcachefs/io_write.c
@@ -88,7 +88,7 @@ void bch2_latency_acct(struct bch_dev *ca, u64 submit_time, int rw)
 
 	bch2_congested_acct(ca, io_latency, now, rw);
 
-	__time_stats_update(&ca->io_latency[rw], submit_time, now);
+	__time_stats_update(&ca->io_latency[rw].stats, submit_time, now);
 }
 
 #endif
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index c491c5e102287..2c238030fb5d7 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -1168,8 +1168,8 @@ static void bch2_dev_free(struct bch_dev *ca)
 	bch2_dev_buckets_free(ca);
 	free_page((unsigned long) ca->sb_read_scratch);
 
-	time_stats_exit(&ca->io_latency[WRITE]);
-	time_stats_exit(&ca->io_latency[READ]);
+	time_stats_quantiles_exit(&ca->io_latency[WRITE]);
+	time_stats_quantiles_exit(&ca->io_latency[READ]);
 
 	percpu_ref_exit(&ca->io_ref);
 	percpu_ref_exit(&ca->ref);
@@ -1260,10 +1260,8 @@ static struct bch_dev *__bch2_dev_alloc(struct bch_fs *c,
 
 	INIT_WORK(&ca->io_error_work, bch2_io_error_work);
 
-	time_stats_init(&ca->io_latency[READ]);
-	time_stats_init(&ca->io_latency[WRITE]);
-	ca->io_latency[READ].quantiles_enabled = true;
-	ca->io_latency[WRITE].quantiles_enabled = true;
+	time_stats_quantiles_init(&ca->io_latency[READ]);
+	time_stats_quantiles_init(&ca->io_latency[WRITE]);
 
 	ca->mi = bch2_mi_to_cpu(member);
 
diff --git a/fs/bcachefs/sysfs.c b/fs/bcachefs/sysfs.c
index cee80c47feea2..c86a93a8d8fc8 100644
--- a/fs/bcachefs/sysfs.c
+++ b/fs/bcachefs/sysfs.c
@@ -930,10 +930,10 @@ SHOW(bch2_dev)
 	sysfs_print(io_latency_write,		atomic64_read(&ca->cur_latency[WRITE]));
 
 	if (attr == &sysfs_io_latency_stats_read)
-		bch2_time_stats_to_text(out, &ca->io_latency[READ]);
+		bch2_time_stats_to_text(out, &ca->io_latency[READ].stats);
 
 	if (attr == &sysfs_io_latency_stats_write)
-		bch2_time_stats_to_text(out, &ca->io_latency[WRITE]);
+		bch2_time_stats_to_text(out, &ca->io_latency[WRITE].stats);
 
 	sysfs_printf(congested,			"%u%%",
 		     clamp(atomic_read(&ca->congested), 0, CONGESTED_MAX)
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index 88853513a15fa..ef620bfe76cd2 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -365,6 +365,7 @@ static inline void pr_name_and_units(struct printbuf *out, const char *name, u64
 
 void bch2_time_stats_to_text(struct printbuf *out, struct time_stats *stats)
 {
+	struct quantiles *quantiles = time_stats_to_quantiles(stats);
 	s64 f_mean = 0, d_mean = 0;
 	u64 f_stddev = 0, d_stddev = 0;
 
@@ -465,17 +466,17 @@ void bch2_time_stats_to_text(struct printbuf *out, struct time_stats *stats)
 
 	printbuf_tabstops_reset(out);
 
-	if (stats->quantiles_enabled) {
+	if (quantiles) {
 		int i = eytzinger0_first(NR_QUANTILES);
 		const struct time_unit *u =
-			pick_time_units(stats->quantiles.entries[i].m);
+			pick_time_units(quantiles->entries[i].m);
 		u64 last_q = 0;
 
 		prt_printf(out, "quantiles (%s):\t", u->name);
 		eytzinger0_for_each(i, NR_QUANTILES) {
 			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
 
-			u64 q = max(stats->quantiles.entries[i].m, last_q);
+			u64 q = max(quantiles->entries[i].m, last_q);
 			prt_printf(out, "%llu ", div_u64(q, u->nsecs));
 			if (is_last)
 				prt_newline(out);
diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index eb1957cb77c0d..c05490101d197 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -27,6 +27,7 @@
 #include <linux/mean_and_variance.h>
 #include <linux/sched/clock.h>
 #include <linux/spinlock_types.h>
+#include <linux/string.h>
 
 struct time_unit {
 	const char	*name;
@@ -67,7 +68,6 @@ struct time_stat_buffer {
 
 struct time_stats {
 	spinlock_t	lock;
-	bool		quantiles_enabled;
 	/* all fields are in nanoseconds */
 	u64             min_duration;
 	u64		max_duration;
@@ -76,7 +76,12 @@ struct time_stats {
 	u64             min_freq;
 	u64		last_event;
 	u64		last_event_start;
-	struct quantiles quantiles;
+
+/*
+ * Is this really a struct time_stats_quantiled?  Hide this flag in the least
+ * significant bit of the start time to avoid blowing up the structure size.
+ */
+#define TIME_STATS_HAVE_QUANTILES	(1ULL << 0)
 
 	u64		start_time;
 
@@ -87,6 +92,22 @@ struct time_stats {
 	struct time_stat_buffer __percpu *buffer;
 };
 
+struct time_stats_quantiles {
+	struct time_stats	stats;
+	struct quantiles	quantiles;
+};
+
+static inline struct quantiles *time_stats_to_quantiles(struct time_stats *stats)
+{
+	struct time_stats_quantiles *statq;
+
+	if (!(stats->start_time & TIME_STATS_HAVE_QUANTILES))
+		return NULL;
+
+	statq = container_of(stats, struct time_stats_quantiles, stats);
+	return &statq->quantiles;
+}
+
 void __time_stats_clear_buffer(struct time_stats *, struct time_stat_buffer *);
 void __time_stats_update(struct time_stats *stats, u64, u64);
 
@@ -133,4 +154,15 @@ void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *);
 void time_stats_exit(struct time_stats *);
 void time_stats_init(struct time_stats *);
 
+static inline void time_stats_quantiles_exit(struct time_stats_quantiles *statq)
+{
+	time_stats_exit(&statq->stats);
+}
+static inline void time_stats_quantiles_init(struct time_stats_quantiles *statq)
+{
+	time_stats_init(&statq->stats);
+	statq->stats.start_time |= TIME_STATS_HAVE_QUANTILES;
+	memset(&statq->quantiles, 0, sizeof(statq->quantiles));
+}
+
 #endif /* _LINUX_TIME_STATS_H */
diff --git a/lib/time_stats.c b/lib/time_stats.c
index 8df4b55fc6337..767b1a340e805 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -69,6 +69,8 @@ static inline void time_stats_update_one(struct time_stats *stats,
 	u64 duration, freq;
 
 	if (time_after64(end, start)) {
+		struct quantiles *quantiles = time_stats_to_quantiles(stats);
+
 		duration = end - start;
 		mean_and_variance_update(&stats->duration_stats, duration);
 		mean_and_variance_weighted_update(&stats->duration_stats_weighted, duration);
@@ -76,8 +78,8 @@ static inline void time_stats_update_one(struct time_stats *stats,
 		stats->min_duration = min(stats->min_duration, duration);
 		stats->total_duration += duration;
 
-		if (stats->quantiles_enabled)
-			quantiles_update(&stats->quantiles, duration);
+		if (quantiles)
+			quantiles_update(quantiles, duration);
 	}
 
 	if (stats->last_event && time_after64(end, stats->last_event)) {
@@ -160,11 +162,12 @@ static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
 
 static inline u64 time_stats_lifetime(const struct time_stats *stats)
 {
-	return local_clock() - stats->start_time;
+	return local_clock() - (stats->start_time & ~TIME_STATS_HAVE_QUANTILES);
 }
 
 void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
 {
+	struct quantiles *quantiles = time_stats_to_quantiles(stats);
 	s64 f_mean = 0, d_mean = 0;
 	u64 f_stddev = 0, d_stddev = 0;
 	u64 lifetime = time_stats_lifetime(stats);
@@ -239,17 +242,17 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
 	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
 	seq_buf_printf(out, "\n");
 
-	if (stats->quantiles_enabled) {
+	if (quantiles) {
 		int i = eytzinger0_first(NR_QUANTILES);
 		const struct time_unit *u =
-			pick_time_units(stats->quantiles.entries[i].m);
+			pick_time_units(quantiles->entries[i].m);
 		u64 last_q = 0;
 
 		seq_buf_printf(out, "quantiles (%s):\t", u->name);
 		eytzinger0_for_each(i, NR_QUANTILES) {
 			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
 
-			u64 q = max(stats->quantiles.entries[i].m, last_q);
+			u64 q = max(quantiles->entries[i].m, last_q);
 			seq_buf_printf(out, "%llu ", div_u64(q, u->nsecs));
 			if (is_last)
 				seq_buf_printf(out, "\n");
@@ -272,7 +275,7 @@ void time_stats_init(struct time_stats *stats)
 	stats->freq_stats_weighted.weight = 8;
 	stats->min_duration = U64_MAX;
 	stats->min_freq = U64_MAX;
-	stats->start_time = local_clock();
+	stats->start_time = local_clock() & ~TIME_STATS_HAVE_QUANTILES;
 	spin_lock_init(&stats->lock);
 }
 EXPORT_SYMBOL_GPL(time_stats_init);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/10] time_stats: fix struct layout bloat
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 01/10] time_stats: report lifetime of the stats object Darrick J. Wong
  2024-02-24  1:10   ` [PATCH 02/10] time_stats: split stats-with-quantiles into a separate structure Darrick J. Wong
@ 2024-02-24  1:10   ` Darrick J. Wong
  2024-02-24  1:11   ` [PATCH 04/10] time_stats: add larger units Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:10 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Make these more efficient by getting rid of the holes.  This reduces the
structure size from 224 bytes to 208 bytes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index c05490101d197..1c1ba8efa7bfe 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -77,19 +77,19 @@ struct time_stats {
 	u64		last_event;
 	u64		last_event_start;
 
-/*
- * Is this really a struct time_stats_quantiled?  Hide this flag in the least
- * significant bit of the start time to avoid blowing up the structure size.
- */
-#define TIME_STATS_HAVE_QUANTILES	(1ULL << 0)
-
-	u64		start_time;
-
 	struct mean_and_variance	  duration_stats;
-	struct mean_and_variance_weighted duration_stats_weighted;
 	struct mean_and_variance	  freq_stats;
+	struct mean_and_variance_weighted duration_stats_weighted;
 	struct mean_and_variance_weighted freq_stats_weighted;
 	struct time_stat_buffer __percpu *buffer;
+
+/*
+ * Is this really a struct time_stats_quantiled?  Hide this flag in the least
+ * significant bit of the start time to avoid blowing up the structure size.
+ */
+#define TIME_STATS_HAVE_QUANTILES	(1ULL << 0)
+
+	u64		start_time;
 };
 
 struct time_stats_quantiles {


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/10] time_stats: add larger units
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:10   ` [PATCH 03/10] time_stats: fix struct layout bloat Darrick J. Wong
@ 2024-02-24  1:11   ` Darrick J. Wong
  2024-02-24  1:11   ` [PATCH 05/10] time_stats: don't print any output if event count is zero Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:11 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Filesystems can stay mounted for a very long time, so add some larger
units.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 lib/time_stats.c |    3 +++
 1 file changed, 3 insertions(+)


diff --git a/lib/time_stats.c b/lib/time_stats.c
index 767b1a340e805..43106bda43a92 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -16,6 +16,9 @@ static const struct time_unit time_units[] = {
 	{ "s",		NSEC_PER_SEC	 },
 	{ "m",          (u64) NSEC_PER_SEC * 60},
 	{ "h",          (u64) NSEC_PER_SEC * 3600},
+	{ "d",          (u64) NSEC_PER_SEC * 3600 * 24},
+	{ "w",          (u64) NSEC_PER_SEC * 3600 * 24 * 7},
+	{ "y",          (u64) NSEC_PER_SEC * ((3600 * 24 * 7 * 365) + (3600 * (24 / 4) * 7))}, /* 365.25d */
 	{ "eon",        U64_MAX          },
 };
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/10] time_stats: don't print any output if event count is zero
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-02-24  1:11   ` [PATCH 04/10] time_stats: add larger units Darrick J. Wong
@ 2024-02-24  1:11   ` Darrick J. Wong
  2024-02-24  1:11   ` [PATCH 06/10] time_stats: allow custom epoch names Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:11 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

There's no point in printing an empty report for no data, so add a flag
that allows us to do that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |    4 +++-
 lib/time_stats.c           |   10 ++++++----
 2 files changed, 9 insertions(+), 5 deletions(-)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index 1c1ba8efa7bfe..994823c17bca9 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -148,8 +148,10 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
 	return false;
 }
 
+#define TIME_STATS_PRINT_NO_ZEROES	(1U << 0)	/* print nothing if zero count */
 struct seq_buf;
-void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *);
+void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
+		unsigned int flags);
 
 void time_stats_exit(struct time_stats *);
 void time_stats_init(struct time_stats *);
diff --git a/lib/time_stats.c b/lib/time_stats.c
index 43106bda43a92..382935979f8f7 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -168,7 +168,8 @@ static inline u64 time_stats_lifetime(const struct time_stats *stats)
 	return local_clock() - (stats->start_time & ~TIME_STATS_HAVE_QUANTILES);
 }
 
-void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
+void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
+		unsigned int flags)
 {
 	struct quantiles *quantiles = time_stats_to_quantiles(stats);
 	s64 f_mean = 0, d_mean = 0;
@@ -184,14 +185,15 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats)
 		spin_unlock_irq(&stats->lock);
 	}
 
-	/*
-	 * avoid divide by zero
-	 */
 	if (stats->freq_stats.n) {
+		/* avoid divide by zero */
 		f_mean = mean_and_variance_get_mean(stats->freq_stats);
 		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
 		d_mean = mean_and_variance_get_mean(stats->duration_stats);
 		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
+	} else if (flags & TIME_STATS_PRINT_NO_ZEROES) {
+		/* unless we didn't want zeroes anyway */
+		return;
 	}
 
 	seq_buf_printf(out, "count: %llu\n", stats->duration_stats.n);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/10] time_stats: allow custom epoch names
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-02-24  1:11   ` [PATCH 05/10] time_stats: don't print any output if event count is zero Darrick J. Wong
@ 2024-02-24  1:11   ` Darrick J. Wong
  2024-02-24  1:11   ` [PATCH 07/10] mean_and_variance: put struct mean_and_variance_weighted on a diet Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:11 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Let callers of time_stats_to_seq_buf define the epoch name; "mount"
doesn't make sense generally.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |    2 +-
 lib/time_stats.c           |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index 994823c17bca9..b2f71e3862c0f 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -151,7 +151,7 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
 #define TIME_STATS_PRINT_NO_ZEROES	(1U << 0)	/* print nothing if zero count */
 struct seq_buf;
 void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
-		unsigned int flags);
+		const char *epoch_name, unsigned int flags);
 
 void time_stats_exit(struct time_stats *);
 void time_stats_init(struct time_stats *);
diff --git a/lib/time_stats.c b/lib/time_stats.c
index 382935979f8f7..f4a21409006bd 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -169,7 +169,7 @@ static inline u64 time_stats_lifetime(const struct time_stats *stats)
 }
 
 void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
-		unsigned int flags)
+		const char *epoch_name, unsigned int flags)
 {
 	struct quantiles *quantiles = time_stats_to_quantiles(stats);
 	s64 f_mean = 0, d_mean = 0;
@@ -201,7 +201,7 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
 	seq_buf_time_units_aligned(out, lifetime);
 	seq_buf_printf(out, "\n");
 
-	seq_buf_printf(out, "                       since mount        recent\n");
+	seq_buf_printf(out, "                       since %-12s recent\n", epoch_name);
 
 	seq_buf_printf(out, "duration of events\n");
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/10] mean_and_variance: put struct mean_and_variance_weighted on a diet
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-02-24  1:11   ` [PATCH 06/10] time_stats: allow custom epoch names Darrick J. Wong
@ 2024-02-24  1:11   ` Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 08/10] time_stats: shrink time_stat_buffer for better alignment Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:11 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

The only caller of this code (time_stats) always knows the weights and
whether or not any information has been collected.  Pass this
information into the mean and variance code so that it doesn't have to
store that information.  This reduces the structure size from 24 to 16
bytes, which shrinks each time_stats counter to 192 bytes from 208.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/util.c                |    8 ++--
 include/linux/mean_and_variance.h |   14 ++++--
 include/linux/time_stats.h        |    4 ++
 lib/math/mean_and_variance.c      |   28 ++++++++-----
 lib/math/mean_and_variance_test.c |   80 ++++++++++++++++++++-----------------
 lib/time_stats.c                  |   23 +++++------
 6 files changed, 87 insertions(+), 70 deletions(-)


diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index ef620bfe76cd2..4c3e19d562852 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -429,14 +429,14 @@ void bch2_time_stats_to_text(struct printbuf *out, struct time_stats *stats)
 	prt_tab(out);
 	bch2_pr_time_units_aligned(out, d_mean);
 	prt_tab(out);
-	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted));
+	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT));
 	prt_newline(out);
 
 	prt_printf(out, "stddev:");
 	prt_tab(out);
 	bch2_pr_time_units_aligned(out, d_stddev);
 	prt_tab(out);
-	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted));
+	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT));
 
 	printbuf_indent_sub(out, 2);
 	prt_newline(out);
@@ -452,14 +452,14 @@ void bch2_time_stats_to_text(struct printbuf *out, struct time_stats *stats)
 	prt_tab(out);
 	bch2_pr_time_units_aligned(out, f_mean);
 	prt_tab(out);
-	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted));
+	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT));
 	prt_newline(out);
 
 	prt_printf(out, "stddev:");
 	prt_tab(out);
 	bch2_pr_time_units_aligned(out, f_stddev);
 	prt_tab(out);
-	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
+	bch2_pr_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT));
 
 	printbuf_indent_sub(out, 2);
 	prt_newline(out);
diff --git a/include/linux/mean_and_variance.h b/include/linux/mean_and_variance.h
index 64df11ab422bf..4fcf062dd22c7 100644
--- a/include/linux/mean_and_variance.h
+++ b/include/linux/mean_and_variance.h
@@ -154,8 +154,6 @@ struct mean_and_variance {
 
 /* expontentially weighted variant */
 struct mean_and_variance_weighted {
-	bool	init;
-	u8	weight;	/* base 2 logarithim */
 	s64	mean;
 	u64	variance;
 };
@@ -192,10 +190,14 @@ s64 mean_and_variance_get_mean(struct mean_and_variance s);
 u64 mean_and_variance_get_variance(struct mean_and_variance s1);
 u32 mean_and_variance_get_stddev(struct mean_and_variance s);
 
-void mean_and_variance_weighted_update(struct mean_and_variance_weighted *s, s64 v);
+void mean_and_variance_weighted_update(struct mean_and_variance_weighted *s,
+		s64 v, bool initted, u8 weight);
 
-s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s);
-u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s);
-u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s);
+s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s,
+		u8 weight);
+u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s,
+		u8 weight);
+u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s,
+		u8 weight);
 
 #endif // MEAN_AND_VAIRANCE_H_
diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index b2f71e3862c0f..dc539123f7997 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -79,6 +79,10 @@ struct time_stats {
 
 	struct mean_and_variance	  duration_stats;
 	struct mean_and_variance	  freq_stats;
+
+/* default weight for weighted mean and variance calculations */
+#define TIME_STATS_MV_WEIGHT	8
+
 	struct mean_and_variance_weighted duration_stats_weighted;
 	struct mean_and_variance_weighted freq_stats_weighted;
 	struct time_stat_buffer __percpu *buffer;
diff --git a/lib/math/mean_and_variance.c b/lib/math/mean_and_variance.c
index ba90293204bae..21ec6afc67884 100644
--- a/lib/math/mean_and_variance.c
+++ b/lib/math/mean_and_variance.c
@@ -102,14 +102,17 @@ EXPORT_SYMBOL_GPL(mean_and_variance_get_stddev);
  * mean_and_variance_weighted_update() - exponentially weighted variant of mean_and_variance_update()
  * @s: mean and variance number of samples and their sums
  * @x: new value to include in the &mean_and_variance_weighted
+ * @initted: caller must track whether this is the first use or not
+ * @weight: ewma weight
  *
  * see linked pdf: function derived from equations 140-143 where alpha = 2^w.
  * values are stored bitshifted for performance and added precision.
  */
-void mean_and_variance_weighted_update(struct mean_and_variance_weighted *s, s64 x)
+void mean_and_variance_weighted_update(struct mean_and_variance_weighted *s,
+		s64 x, bool initted, u8 weight)
 {
 	// previous weighted variance.
-	u8 w		= s->weight;
+	u8 w		= weight;
 	u64 var_w0	= s->variance;
 	// new value weighted.
 	s64 x_w		= x << w;
@@ -118,45 +121,50 @@ void mean_and_variance_weighted_update(struct mean_and_variance_weighted *s, s64
 	// new mean weighted.
 	s64 u_w1	= s->mean + diff;
 
-	if (!s->init) {
+	if (!initted) {
 		s->mean = x_w;
 		s->variance = 0;
 	} else {
 		s->mean = u_w1;
 		s->variance = ((var_w0 << w) - var_w0 + ((diff_w * (x_w - u_w1)) >> w)) >> w;
 	}
-	s->init = true;
 }
 EXPORT_SYMBOL_GPL(mean_and_variance_weighted_update);
 
 /**
  * mean_and_variance_weighted_get_mean() - get mean from @s
  * @s: mean and variance number of samples and their sums
+ * @weight: ewma weight
  */
-s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s)
+s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s,
+		u8 weight)
 {
-	return fast_divpow2(s.mean, s.weight);
+	return fast_divpow2(s.mean, weight);
 }
 EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_mean);
 
 /**
  * mean_and_variance_weighted_get_variance() -- get variance from @s
  * @s: mean and variance number of samples and their sums
+ * @weight: ewma weight
  */
-u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s)
+u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s,
+		u8 weight)
 {
 	// always positive don't need fast divpow2
-	return s.variance >> s.weight;
+	return s.variance >> weight;
 }
 EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_variance);
 
 /**
  * mean_and_variance_weighted_get_stddev() - get standard deviation from @s
  * @s: mean and variance number of samples and their sums
+ * @weight: ewma weight
  */
-u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s)
+u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s,
+		u8 weight)
 {
-	return int_sqrt64(mean_and_variance_weighted_get_variance(s));
+	return int_sqrt64(mean_and_variance_weighted_get_variance(s, weight));
 }
 EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_stddev);
 
diff --git a/lib/math/mean_and_variance_test.c b/lib/math/mean_and_variance_test.c
index f45591a169d87..0d8c2451a8588 100644
--- a/lib/math/mean_and_variance_test.c
+++ b/lib/math/mean_and_variance_test.c
@@ -30,53 +30,59 @@ static void mean_and_variance_basic_test(struct kunit *test)
 
 static void mean_and_variance_weighted_test(struct kunit *test)
 {
-	struct mean_and_variance_weighted s = { .weight = 2 };
+	struct mean_and_variance_weighted s = { };
 
-	mean_and_variance_weighted_update(&s, 10);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 10);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+	mean_and_variance_weighted_update(&s, 10, false, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), 10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 0);
 
-	mean_and_variance_weighted_update(&s, 20);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 12);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+	mean_and_variance_weighted_update(&s, 20, true, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), 12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 18);
 
-	mean_and_variance_weighted_update(&s, 30);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 16);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+	mean_and_variance_weighted_update(&s, 30, true, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), 16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 72);
 
-	s = (struct mean_and_variance_weighted) { .weight = 2 };
+	s = (struct mean_and_variance_weighted) { };
 
-	mean_and_variance_weighted_update(&s, -10);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -10);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+	mean_and_variance_weighted_update(&s, -10, false, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), -10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 0);
 
-	mean_and_variance_weighted_update(&s, -20);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -12);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+	mean_and_variance_weighted_update(&s, -20, true, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), -12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 18);
 
-	mean_and_variance_weighted_update(&s, -30);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -16);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+	mean_and_variance_weighted_update(&s, -30, true, 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 2), -16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 2), 72);
 }
 
 static void mean_and_variance_weighted_advanced_test(struct kunit *test)
 {
-	struct mean_and_variance_weighted s = { .weight = 8 };
+	struct mean_and_variance_weighted s = { };
+	bool initted = false;
 	s64 i;
 
-	for (i = 10; i <= 100; i += 10)
-		mean_and_variance_weighted_update(&s, i);
+	for (i = 10; i <= 100; i += 10) {
+		mean_and_variance_weighted_update(&s, i, initted, 8);
+		initted = true;
+	}
 
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 11);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 8), 11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 8), 107);
 
-	s = (struct mean_and_variance_weighted) { .weight = 8 };
+	s = (struct mean_and_variance_weighted) { };
+	initted = false;
 
-	for (i = -10; i >= -100; i -= 10)
-		mean_and_variance_weighted_update(&s, i);
+	for (i = -10; i >= -100; i -= 10) {
+		mean_and_variance_weighted_update(&s, i, initted, 8);
+		initted = true;
+	}
 
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -11);
-	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s, 8), -11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s, 8), 107);
 }
 
 static void do_mean_and_variance_test(struct kunit *test,
@@ -91,26 +97,26 @@ static void do_mean_and_variance_test(struct kunit *test,
 				      s64 *weighted_stddev)
 {
 	struct mean_and_variance mv = {};
-	struct mean_and_variance_weighted vw = { .weight = weight };
+	struct mean_and_variance_weighted vw = { };
 
 	for (unsigned i = 0; i < initial_n; i++) {
 		mean_and_variance_update(&mv, initial_value);
-		mean_and_variance_weighted_update(&vw, initial_value);
+		mean_and_variance_weighted_update(&vw, initial_value, false, weight);
 
 		KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(mv),		initial_value);
 		KUNIT_EXPECT_EQ(test, mean_and_variance_get_stddev(mv),		0);
-		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(vw),	initial_value);
-		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_stddev(vw),0);
+		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(vw, weight),	initial_value);
+		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_stddev(vw, weight),0);
 	}
 
 	for (unsigned i = 0; i < n; i++) {
 		mean_and_variance_update(&mv, data[i]);
-		mean_and_variance_weighted_update(&vw, data[i]);
+		mean_and_variance_weighted_update(&vw, data[i], true, weight);
 
 		KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(mv),		mean[i]);
 		KUNIT_EXPECT_EQ(test, mean_and_variance_get_stddev(mv),		stddev[i]);
-		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(vw),	weighted_mean[i]);
-		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_stddev(vw),weighted_stddev[i]);
+		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(vw, weight),	weighted_mean[i]);
+		KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_stddev(vw, weight),weighted_stddev[i]);
 	}
 
 	KUNIT_EXPECT_EQ(test, mv.n, initial_n + n);
diff --git a/lib/time_stats.c b/lib/time_stats.c
index f4a21409006bd..0fb3d854e503b 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -70,13 +70,15 @@ static inline void time_stats_update_one(struct time_stats *stats,
 					      u64 start, u64 end)
 {
 	u64 duration, freq;
+	bool initted = stats->last_event != 0;
 
 	if (time_after64(end, start)) {
 		struct quantiles *quantiles = time_stats_to_quantiles(stats);
 
 		duration = end - start;
 		mean_and_variance_update(&stats->duration_stats, duration);
-		mean_and_variance_weighted_update(&stats->duration_stats_weighted, duration);
+		mean_and_variance_weighted_update(&stats->duration_stats_weighted,
+				duration, initted, TIME_STATS_MV_WEIGHT);
 		stats->max_duration = max(stats->max_duration, duration);
 		stats->min_duration = min(stats->min_duration, duration);
 		stats->total_duration += duration;
@@ -88,7 +90,8 @@ static inline void time_stats_update_one(struct time_stats *stats,
 	if (stats->last_event && time_after64(end, stats->last_event)) {
 		freq = end - stats->last_event;
 		mean_and_variance_update(&stats->freq_stats, freq);
-		mean_and_variance_weighted_update(&stats->freq_stats_weighted, freq);
+		mean_and_variance_weighted_update(&stats->freq_stats_weighted,
+				freq, initted, TIME_STATS_MV_WEIGHT);
 		stats->max_freq = max(stats->max_freq, freq);
 		stats->min_freq = min(stats->min_freq, freq);
 	}
@@ -121,15 +124,11 @@ void __time_stats_update(struct time_stats *stats, u64 start, u64 end)
 {
 	unsigned long flags;
 
-	WARN_ONCE(!stats->duration_stats_weighted.weight ||
-		  !stats->freq_stats_weighted.weight,
-		  "uninitialized time_stats");
-
 	if (!stats->buffer) {
 		spin_lock_irqsave(&stats->lock, flags);
 		time_stats_update_one(stats, start, end);
 
-		if (mean_and_variance_weighted_get_mean(stats->freq_stats_weighted) < 32 &&
+		if (mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT) < 32 &&
 		    stats->duration_stats.n > 1024)
 			stats->buffer =
 				alloc_percpu_gfp(struct time_stat_buffer,
@@ -219,12 +218,12 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
 
 	seq_buf_printf(out, "  mean:                    ");
 	seq_buf_time_units_aligned(out, d_mean);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted));
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT));
 	seq_buf_printf(out, "\n");
 
 	seq_buf_printf(out, "  stddev:                  ");
 	seq_buf_time_units_aligned(out, d_stddev);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted));
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT));
 	seq_buf_printf(out, "\n");
 
 	seq_buf_printf(out, "time between events\n");
@@ -239,12 +238,12 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
 
 	seq_buf_printf(out, "  mean:                    ");
 	seq_buf_time_units_aligned(out, f_mean);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted));
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT));
 	seq_buf_printf(out, "\n");
 
 	seq_buf_printf(out, "  stddev:                  ");
 	seq_buf_time_units_aligned(out, f_stddev);
-	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted));
+	seq_buf_time_units_aligned(out, mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT));
 	seq_buf_printf(out, "\n");
 
 	if (quantiles) {
@@ -276,8 +275,6 @@ EXPORT_SYMBOL_GPL(time_stats_exit);
 void time_stats_init(struct time_stats *stats)
 {
 	memset(stats, 0, sizeof(*stats));
-	stats->duration_stats_weighted.weight = 8;
-	stats->freq_stats_weighted.weight = 8;
 	stats->min_duration = U64_MAX;
 	stats->min_freq = U64_MAX;
 	stats->start_time = local_clock() & ~TIME_STATS_HAVE_QUANTILES;


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/10] time_stats: shrink time_stat_buffer for better alignment
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-02-24  1:11   ` [PATCH 07/10] mean_and_variance: put struct mean_and_variance_weighted on a diet Darrick J. Wong
@ 2024-02-24  1:12   ` Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 09/10] time_stats: report information in json format Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 10/10] time_stats: Kill TIME_STATS_HAVE_QUANTILES Darrick J. Wong
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:12 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Shrink this percpu object by one array element so that the object size
becomes exactly 512 bytes.  This will lead to more efficient memory use,
hopefully.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index dc539123f7997..b3c810fff963a 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -63,7 +63,7 @@ struct time_stat_buffer {
 	struct time_stat_buffer_entry {
 		u64	start;
 		u64	end;
-	}		entries[32];
+	}		entries[31];
 };
 
 struct time_stats {


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/10] time_stats: report information in json format
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-02-24  1:12   ` [PATCH 08/10] time_stats: shrink time_stat_buffer for better alignment Darrick J. Wong
@ 2024-02-24  1:12   ` Darrick J. Wong
  2024-02-24  4:15     ` Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 10/10] time_stats: Kill TIME_STATS_HAVE_QUANTILES Darrick J. Wong
  9 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:12 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Export json versions of time statistics information.  Given the tabular
nature of the numbers exposed, this will make it a lot easier for higher
(than C) level languages (e.g. python) to import information without
needing to write yet another clumsy string parser.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/time_stats.h |    2 +
 lib/time_stats.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index b3c810fff963a..4e1f5485ed039 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -156,6 +156,8 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
 struct seq_buf;
 void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
 		const char *epoch_name, unsigned int flags);
+void time_stats_to_json(struct seq_buf *, struct time_stats *,
+		const char *epoch_name, unsigned int flags);
 
 void time_stats_exit(struct time_stats *);
 void time_stats_init(struct time_stats *);
diff --git a/lib/time_stats.c b/lib/time_stats.c
index 0fb3d854e503b..c0f209dd9f6dd 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -266,6 +266,93 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
 }
 EXPORT_SYMBOL_GPL(time_stats_to_seq_buf);
 
+void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
+		const char *epoch_name, unsigned int flags)
+{
+	struct quantiles *quantiles = time_stats_to_quantiles(stats);
+	s64 f_mean = 0, d_mean = 0;
+	u64 f_stddev = 0, d_stddev = 0;
+
+	if (stats->buffer) {
+		int cpu;
+
+		spin_lock_irq(&stats->lock);
+		for_each_possible_cpu(cpu)
+			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
+		spin_unlock_irq(&stats->lock);
+	}
+
+	if (stats->freq_stats.n) {
+		/* avoid divide by zero */
+		f_mean = mean_and_variance_get_mean(stats->freq_stats);
+		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
+		d_mean = mean_and_variance_get_mean(stats->duration_stats);
+		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
+	} else if (flags & TIME_STATS_PRINT_NO_ZEROES) {
+		/* unless we didn't want zeroes anyway */
+		return;
+	}
+
+	seq_buf_printf(out, "{\n");
+	seq_buf_printf(out, "  \"epoch\":       \"%s\",\n", epoch_name);
+	seq_buf_printf(out, "  \"count\":       %llu,\n", stats->duration_stats.n);
+
+	seq_buf_printf(out, "  \"duration_ns\": {\n");
+	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_duration);
+	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_duration);
+	seq_buf_printf(out, "    \"total\":     %llu,\n", stats->total_duration);
+	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
+	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
+	seq_buf_printf(out, "  },\n");
+
+	d_mean = mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
+	d_stddev = mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
+
+	seq_buf_printf(out, "  \"duration_ewma_ns\": {\n");
+	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
+	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
+	seq_buf_printf(out, "  },\n");
+
+	seq_buf_printf(out, "  \"frequency_ns\": {\n");
+	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
+	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
+	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
+	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
+	seq_buf_printf(out, "  },\n");
+
+	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
+	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
+
+	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
+	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
+	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
+
+	if (quantiles) {
+		u64 last_q = 0;
+
+		/* close frequency_ewma_ns but signal more items */
+		seq_buf_printf(out, "  },\n");
+
+		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
+		eytzinger0_for_each(i, NR_QUANTILES) {
+			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
+
+			u64 q = max(quantiles->entries[i].m, last_q);
+			seq_buf_printf(out, "    %llu", q);
+			if (!is_last)
+				seq_buf_printf(out, ", ");
+			last_q = q;
+		}
+		seq_buf_printf(out, "  ]\n");
+	} else {
+		/* close frequency_ewma_ns without dumping further */
+		seq_buf_printf(out, "  }\n");
+	}
+
+	seq_buf_printf(out, "}\n");
+}
+EXPORT_SYMBOL_GPL(time_stats_to_json);
+
 void time_stats_exit(struct time_stats *stats)
 {
 	free_percpu(stats->buffer);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/10] time_stats: Kill TIME_STATS_HAVE_QUANTILES
  2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-02-24  1:12   ` [PATCH 09/10] time_stats: report information in json format Darrick J. Wong
@ 2024-02-24  1:12   ` Darrick J. Wong
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:12 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

We have 4 spare bytes next to the spinlock, no need for bit stuffing

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux/time_stats.h |   19 +++++--------------
 lib/time_stats.c           |    4 ++--
 2 files changed, 7 insertions(+), 16 deletions(-)


diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
index 4e1f5485ed039..6df2b34aa274b 100644
--- a/include/linux/time_stats.h
+++ b/include/linux/time_stats.h
@@ -68,6 +68,7 @@ struct time_stat_buffer {
 
 struct time_stats {
 	spinlock_t	lock;
+	bool		have_quantiles;
 	/* all fields are in nanoseconds */
 	u64             min_duration;
 	u64		max_duration;
@@ -87,12 +88,6 @@ struct time_stats {
 	struct mean_and_variance_weighted freq_stats_weighted;
 	struct time_stat_buffer __percpu *buffer;
 
-/*
- * Is this really a struct time_stats_quantiled?  Hide this flag in the least
- * significant bit of the start time to avoid blowing up the structure size.
- */
-#define TIME_STATS_HAVE_QUANTILES	(1ULL << 0)
-
 	u64		start_time;
 };
 
@@ -103,13 +98,9 @@ struct time_stats_quantiles {
 
 static inline struct quantiles *time_stats_to_quantiles(struct time_stats *stats)
 {
-	struct time_stats_quantiles *statq;
-
-	if (!(stats->start_time & TIME_STATS_HAVE_QUANTILES))
-		return NULL;
-
-	statq = container_of(stats, struct time_stats_quantiles, stats);
-	return &statq->quantiles;
+	return stats->have_quantiles
+		? &container_of(stats, struct time_stats_quantiles, stats)->quantiles
+		: NULL;
 }
 
 void __time_stats_clear_buffer(struct time_stats *, struct time_stat_buffer *);
@@ -169,7 +160,7 @@ static inline void time_stats_quantiles_exit(struct time_stats_quantiles *statq)
 static inline void time_stats_quantiles_init(struct time_stats_quantiles *statq)
 {
 	time_stats_init(&statq->stats);
-	statq->stats.start_time |= TIME_STATS_HAVE_QUANTILES;
+	statq->stats.have_quantiles = true;
 	memset(&statq->quantiles, 0, sizeof(statq->quantiles));
 }
 
diff --git a/lib/time_stats.c b/lib/time_stats.c
index c0f209dd9f6dd..0b90c80cba9f1 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -164,7 +164,7 @@ static void seq_buf_time_units_aligned(struct seq_buf *out, u64 ns)
 
 static inline u64 time_stats_lifetime(const struct time_stats *stats)
 {
-	return local_clock() - (stats->start_time & ~TIME_STATS_HAVE_QUANTILES);
+	return local_clock() - stats->start_time;
 }
 
 void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
@@ -364,7 +364,7 @@ void time_stats_init(struct time_stats *stats)
 	memset(stats, 0, sizeof(*stats));
 	stats->min_duration = U64_MAX;
 	stats->min_freq = U64_MAX;
-	stats->start_time = local_clock() & ~TIME_STATS_HAVE_QUANTILES;
+	stats->start_time = local_clock();
 	spin_lock_init(&stats->lock);
 }
 EXPORT_SYMBOL_GPL(time_stats_init);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/4] xfs: present wait time statistics
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
@ 2024-02-24  1:12   ` Darrick J. Wong
  2024-02-24  1:13   ` [PATCH 2/4] xfs: present time stats for scrubbers Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:12 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Plumb in Kent's timestats code so we can observe wait times for log
grant heads, buffer, inode, and dquot locks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig         |    8 +++
 fs/xfs/Makefile        |    1 
 fs/xfs/xfs_buf.c       |    4 ++
 fs/xfs/xfs_dquot.c     |   11 +++++
 fs/xfs/xfs_dquot.h     |    4 ++
 fs/xfs/xfs_inode.c     |   12 ++++-
 fs/xfs/xfs_linux.h     |    4 ++
 fs/xfs/xfs_log.c       |    9 ++++
 fs/xfs/xfs_mount.h     |   13 +++++
 fs/xfs/xfs_super.c     |    6 +++
 fs/xfs/xfs_timestats.c |  115 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_timestats.h |   34 ++++++++++++++
 12 files changed, 219 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_timestats.c
 create mode 100644 fs/xfs/xfs_timestats.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index b0cac77c90572..e0fa9b382fbeb 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -5,6 +5,7 @@ config XFS_FS
 	select EXPORTFS
 	select LIBCRC32C
 	select FS_IOMAP
+	select TIME_STATS if XFS_TIME_STATS
 	help
 	  XFS is a high performance journaling filesystem which originated
 	  on the SGI IRIX platform.  It is completely multi-threaded, can
@@ -120,6 +121,13 @@ config XFS_RT
 
 	  If unsure, say N.
 
+config XFS_TIME_STATS
+	bool "Collect time statistics for XFS filesystems"
+	depends on XFS_FS
+	default y
+	help
+	  Collects time statistics on various operations in the filesystem.
+
 config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 0c3b4cd4f9c84..bf3bacfb7afff 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -153,6 +153,7 @@ xfs-$(CONFIG_XFS_DRAIN_INTENTS)	+= xfs_drain.o
 xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
 xfs-$(CONFIG_XFS_BTREE_IN_MEM)	+= libxfs/xfs_btree_mem.o
+xfs-$(CONFIG_XFS_TIME_STATS)	+= xfs_timestats.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 503ce7aff0c30..b11515f7f270f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -22,6 +22,7 @@
 #include "xfs_error.h"
 #include "xfs_ag.h"
 #include "xfs_buf_mem.h"
+#include "xfs_timestats.h"
 
 struct kmem_cache *xfs_buf_cache;
 
@@ -1183,11 +1184,14 @@ void
 xfs_buf_lock(
 	struct xfs_buf		*bp)
 {
+	DEFINE_XFS_TIMESTAT(start_time);
+
 	trace_xfs_buf_lock(bp, _RET_IP_);
 
 	if (atomic_read(&bp->b_pin_count) && (bp->b_flags & XBF_STALE))
 		xfs_log_force(bp->b_mount, 0);
 	down(&bp->b_sema);
+	xfs_timestats_end(&bp->b_mount->m_timestats.ts_buflock, start_time);
 
 	trace_xfs_buf_lock_done(bp, _RET_IP_);
 }
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 2919b9bdf0cb0..515ffe0fcfe29 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -25,6 +25,7 @@
 #include "xfs_bmap_btree.h"
 #include "xfs_error.h"
 #include "xfs_health.h"
+#include "xfs_timestats.h"
 
 /*
  * Lock order:
@@ -45,6 +46,16 @@ static struct kmem_cache	*xfs_dquot_cache;
 static struct lock_class_key xfs_dquot_group_class;
 static struct lock_class_key xfs_dquot_project_class;
 
+#ifdef CONFIG_XFS_TIME_STATS
+void xfs_dqlock(struct xfs_dquot *dqp)
+{
+	DEFINE_XFS_TIMESTAT(start_time);
+
+	mutex_lock(&dqp->q_qlock);
+	xfs_timestats_end(&dqp->q_mount->m_timestats.ts_dqlock, start_time);
+}
+#endif
+
 /* Record observations of quota corruption with the health tracking system. */
 static void
 xfs_dquot_mark_sick(
diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h
index 677bb2dc9ac91..6523a1f713139 100644
--- a/fs/xfs/xfs_dquot.h
+++ b/fs/xfs/xfs_dquot.h
@@ -120,10 +120,14 @@ static inline int xfs_dqlock_nowait(struct xfs_dquot *dqp)
 	return mutex_trylock(&dqp->q_qlock);
 }
 
+#ifdef CONFIG_XFS_TIME_STATS
+void xfs_dqlock(struct xfs_dquot *dqp);
+#else
 static inline void xfs_dqlock(struct xfs_dquot *dqp)
 {
 	mutex_lock(&dqp->q_qlock);
 }
+#endif
 
 static inline void xfs_dqunlock(struct xfs_dquot *dqp)
 {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d5ce9bd85a111..8d81e6ac77397 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -44,6 +44,7 @@
 #include "xfs_xattr.h"
 #include "xfs_inode_util.h"
 #include "xfs_imeta.h"
+#include "xfs_timestats.h"
 
 struct kmem_cache *xfs_inode_cache;
 
@@ -161,10 +162,17 @@ xfs_ilock(
 				 XFS_MMAPLOCK_DEP(lock_flags));
 	}
 
-	if (lock_flags & XFS_ILOCK_EXCL)
+	if (lock_flags & XFS_ILOCK_EXCL) {
+		DEFINE_XFS_TIMESTAT(start_time);
+
 		mrupdate_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
-	else if (lock_flags & XFS_ILOCK_SHARED)
+		xfs_timestats_end(&ip->i_mount->m_timestats.ts_ilock, start_time);
+	} else if (lock_flags & XFS_ILOCK_SHARED) {
+		DEFINE_XFS_TIMESTAT(start_time);
+
 		mraccess_nested(&ip->i_lock, XFS_ILOCK_DEP(lock_flags));
+		xfs_timestats_end(&ip->i_mount->m_timestats.ts_ilock, start_time);
+	}
 }
 
 /*
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 953466922ddf7..27f9ec7721a93 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -64,6 +64,10 @@ typedef __u32			xfs_nlink_t;
 #include <linux/xattr.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/debugfs.h>
+#ifdef CONFIG_XFS_TIME_STATS
+# include <linux/seq_buf.h>
+# include <linux/time_stats.h>
+#endif
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a604eac68ea9e..a30be4ab780bb 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -20,6 +20,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_sb.h"
 #include "xfs_health.h"
+#include "xfs_timestats.h"
 
 struct kmem_cache	*xfs_log_ticket_cache;
 
@@ -403,6 +404,7 @@ xfs_log_regrant(
 	struct xlog_ticket	*tic)
 {
 	struct xlog		*log = mp->m_log;
+	DECLARE_XFS_TIMESTAT(start_time);
 	int			need_bytes;
 	int			error = 0;
 
@@ -427,12 +429,15 @@ xfs_log_regrant(
 
 	trace_xfs_log_regrant(log, tic);
 
+	xfs_timestats_start(&start_time);
 	error = xlog_grant_head_check(log, &log->l_write_head, tic,
 				      &need_bytes);
 	if (error)
 		goto out_error;
 
 	xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
+	xfs_timestats_end(&mp->m_timestats.ts_log_regrant, start_time);
+
 	trace_xfs_log_regrant_exit(log, tic);
 	xlog_verify_grant_tail(log);
 	return 0;
@@ -466,6 +471,7 @@ xfs_log_reserve(
 {
 	struct xlog		*log = mp->m_log;
 	struct xlog_ticket	*tic;
+	DECLARE_XFS_TIMESTAT(start_time);
 	int			need_bytes;
 	int			error = 0;
 
@@ -483,6 +489,7 @@ xfs_log_reserve(
 
 	trace_xfs_log_reserve(log, tic);
 
+	xfs_timestats_start(&start_time);
 	error = xlog_grant_head_check(log, &log->l_reserve_head, tic,
 				      &need_bytes);
 	if (error)
@@ -490,6 +497,8 @@ xfs_log_reserve(
 
 	xlog_grant_add_space(log, &log->l_reserve_head.grant, need_bytes);
 	xlog_grant_add_space(log, &log->l_write_head.grant, need_bytes);
+	xfs_timestats_end(&mp->m_timestats.ts_log_reserve, start_time);
+
 	trace_xfs_log_reserve_exit(log, tic);
 	xlog_verify_grant_tail(log);
 	return 0;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 01934c567f760..7cfd209404365 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -72,6 +72,17 @@ struct xfs_inodegc {
 	unsigned int		cpu;
 };
 
+struct xfs_timestats {
+#ifdef CONFIG_XFS_TIME_STATS
+	struct time_stats	ts_log_reserve;
+	struct time_stats	ts_log_regrant;
+	struct time_stats	ts_ilock;
+	struct time_stats	ts_buflock;
+	struct time_stats	ts_dqlock;
+	struct dentry		*ts_debugfs;
+#endif
+};
+
 /*
  * The struct xfsmount layout is optimised to separate read-mostly variables
  * from variables that are frequently modified. We put the read-mostly variables
@@ -271,6 +282,8 @@ typedef struct xfs_mount {
 
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
+
+	struct xfs_timestats	m_timestats;
 } xfs_mount_t;
 
 #define M_IGEO(mp)		(&(mp)->m_ino_geo)
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index fa4db490b74a0..69f1c1d85edf6 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -46,6 +46,7 @@
 #include "xfs_exchmaps_item.h"
 #include "xfs_parent.h"
 #include "xfs_rtalloc.h"
+#include "xfs_timestats.h"
 #include "scrub/stats.h"
 #include "scrub/rcbag_btree.h"
 
@@ -768,6 +769,7 @@ xfs_mount_free(
 		xfs_free_buftarg(mp->m_ddev_targp);
 
 	debugfs_remove(mp->m_debugfs);
+	xfs_timestats_destroy(mp);
 	kfree(mp->m_rtname);
 	kfree(mp->m_logname);
 	kmem_free(mp);
@@ -1146,6 +1148,7 @@ xfs_fs_put_super(
 	xfs_rtmount_freesb(mp);
 	xfs_freesb(mp);
 	xchk_mount_stats_free(mp);
+	xfs_timestats_unexport(mp);
 	free_percpu(mp->m_stats.xs_stats);
 	xfs_inodegc_free_percpu(mp);
 	xfs_destroy_percpu_counters(mp);
@@ -1580,6 +1583,7 @@ xfs_fs_fill_super(
 		goto out_destroy_inodegc;
 	}
 
+	xfs_timestats_export(mp);
 	error = xchk_mount_stats_alloc(mp);
 	if (error)
 		goto out_free_stats;
@@ -1805,6 +1809,7 @@ xfs_fs_fill_super(
 	xfs_freesb(mp);
  out_free_scrub_stats:
 	xchk_mount_stats_free(mp);
+	xfs_timestats_unexport(mp);
  out_free_stats:
 	free_percpu(mp->m_stats.xs_stats);
  out_destroy_inodegc:
@@ -2065,6 +2070,7 @@ static int xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_timestats_init(mp);
 
 	fc->s_fs_info = mp;
 	fc->ops = &xfs_context_ops;
diff --git a/fs/xfs/xfs_timestats.c b/fs/xfs/xfs_timestats.c
new file mode 100644
index 0000000000000..163a37e6717f7
--- /dev/null
+++ b/fs/xfs/xfs_timestats.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_sb.h"
+#include "xfs_mount.h"
+#include "xfs_timestats.h"
+
+/* Format a timestats report into a buffer. */
+static ssize_t
+xfs_timestats_read(
+	struct file		*file,
+	char __user		*ubuf,
+	size_t			count,
+	loff_t			*ppos)
+{
+	struct seq_buf		s;
+	struct time_stats	*ts = file->private_data;
+	char			*buf;
+	ssize_t			ret;
+
+	/*
+	 * This generates a stringly snapshot of a timestats report, so we
+	 * do not want userspace to receive garbled text from multiple calls.
+	 * If the file position is greater than 0, return a short read.
+	 */
+	if (*ppos > 0)
+		return 0;
+
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	seq_buf_init(&s, buf, PAGE_SIZE);
+	time_stats_to_seq_buf(&s, ts, "mount", TIME_STATS_PRINT_NO_ZEROES);
+	ret = simple_read_from_buffer(ubuf, count, ppos, buf, seq_buf_used(&s));
+	kfree(buf);
+	return ret;
+}
+
+const struct file_operations xfs_timestats_fops = {
+	.open			= simple_open,
+	.read			= xfs_timestats_read,
+};
+
+/* Set up timestats collection. */
+void
+xfs_timestats_init(
+	struct xfs_mount	*mp)
+{
+	struct xfs_timestats	*ts = &mp->m_timestats;
+
+	time_stats_init(&ts->ts_log_reserve);
+	time_stats_init(&ts->ts_log_regrant);
+	time_stats_init(&ts->ts_ilock);
+	time_stats_init(&ts->ts_buflock);
+	time_stats_init(&ts->ts_dqlock);
+}
+
+/* Free all resources used by timestats collection. */
+void
+xfs_timestats_destroy(
+	struct xfs_mount	*mp)
+{
+	struct xfs_timestats	*ts = &mp->m_timestats;
+
+	time_stats_exit(&ts->ts_log_reserve);
+	time_stats_exit(&ts->ts_log_regrant);
+	time_stats_exit(&ts->ts_ilock);
+	time_stats_exit(&ts->ts_buflock);
+	time_stats_exit(&ts->ts_dqlock);
+}
+
+/* Export timestats via debugfs */
+#define X(p, ts, name) \
+	debugfs_create_file("blocked::" #name, 0444, (p), &(ts)->ts_##name, \
+			&xfs_timestats_fops)
+void
+xfs_timestats_export(
+	struct xfs_mount	*mp)
+{
+	struct dentry		*parent;
+	struct xfs_timestats	*ts = &mp->m_timestats;
+
+	if (!mp->m_debugfs)
+		return;
+
+	parent = xfs_debugfs_mkdir("time_stats", mp->m_debugfs);
+	if (!parent)
+		return;
+	ts->ts_debugfs = parent;
+
+	X(parent, ts, log_reserve);
+	X(parent, ts, log_regrant);
+	X(parent, ts, ilock);
+	X(parent, ts, buflock);
+	X(parent, ts, dqlock);
+}
+#undef X
+
+/* Delete debugfs entries for timestats */
+void
+xfs_timestats_unexport(
+	struct xfs_mount	*mp)
+{
+	struct xfs_timestats	*ts = &mp->m_timestats;
+
+	debugfs_remove(ts->ts_debugfs);
+}
diff --git a/fs/xfs/xfs_timestats.h b/fs/xfs/xfs_timestats.h
new file mode 100644
index 0000000000000..e53dbb40c8fff
--- /dev/null
+++ b/fs/xfs/xfs_timestats.h
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_TIMESTATS_H__
+#define __XFS_TIMESTATS_H__
+
+#ifdef CONFIG_XFS_TIME_STATS
+extern const struct file_operations xfs_timestats_fops;
+
+void xfs_timestats_init(struct xfs_mount *mp);
+void xfs_timestats_export(struct xfs_mount *mp);
+void xfs_timestats_unexport(struct xfs_mount *mp);
+void xfs_timestats_destroy(struct xfs_mount *mp);
+
+# define DECLARE_XFS_TIMESTAT(name)	u64 name
+# define DEFINE_XFS_TIMESTAT(name)	u64 name = local_clock()
+# define xfs_timestats_start(b)		do { *(b) = local_clock(); } while (0)
+# define xfs_timestats_end(a, b)	time_stats_update((a), (b))
+#else
+# define xfs_timestats_init(mp)		((void)0)
+# define xfs_timestats_export(mp)	((void)0)
+# define xfs_timestats_unexport(mp)	((void)0)
+# define xfs_timestats_destroy(mp)	((void)0)
+
+# define DECLARE_XFS_TIMESTAT(name)
+# define DEFINE_XFS_TIMESTAT(name)
+# define xfs_timestats_start(t)		((void)0)
+# define xfs_timestats_end(s, t)	((void)0)
+#endif /* CONFIG_XFS_TIME_STATS */
+
+#endif /* __XFS_TIMESTATS_H__ */
+


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/4] xfs: present time stats for scrubbers
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 1/4] xfs: present wait time statistics Darrick J. Wong
@ 2024-02-24  1:13   ` Darrick J. Wong
  2024-02-24  1:13   ` [PATCH 3/4] xfs: present timestats in json format Darrick J. Wong
  2024-02-24  1:13   ` [PATCH 4/4] xfs: create debugfs uuid aliases Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:13 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Use the timestats code to report statistical information about how much
time we spend in scrub and repair.  This augments the existing raw scrub
counters.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/repair.c  |    6 +-
 fs/xfs/scrub/scrub.c   |    6 +-
 fs/xfs/scrub/stats.c   |  128 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/scrub/stats.h   |   21 +-------
 fs/xfs/xfs_linux.h     |    1 
 fs/xfs/xfs_timestats.h |    2 +
 6 files changed, 137 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 77db28830ce9e..81955d0a188cf 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -42,6 +42,7 @@
 #include "xfs_rtalloc.h"
 #include "xfs_imeta.h"
 #include "xfs_rtrefcount_btree.h"
+#include "xfs_timestats.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -61,7 +62,6 @@ xrep_attempt(
 	struct xfs_scrub	*sc,
 	struct xchk_stats_run	*run)
 {
-	u64			repair_start;
 	int			error = 0;
 
 	trace_xrep_attempt(XFS_I(file_inode(sc->file)), sc->sm, error);
@@ -72,10 +72,10 @@ xrep_attempt(
 	/* Repair whatever's broken. */
 	ASSERT(sc->ops->repair);
 	run->repair_attempted = true;
-	repair_start = xchk_stats_now();
+	run->repair_start = xchk_stats_now();
 	error = sc->ops->repair(sc);
+	run->repair_stop = xchk_stats_now();
 	trace_xrep_done(XFS_I(file_inode(sc->file)), sc->sm, error);
-	run->repair_ns += xchk_stats_elapsed_ns(repair_start);
 	switch (error) {
 	case 0:
 		/*
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 4322743aa5578..fc4a71fab51e6 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -22,6 +22,7 @@
 #include "xfs_dir2.h"
 #include "xfs_parent.h"
 #include "xfs_icache.h"
+#include "xfs_timestats.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
@@ -677,7 +678,6 @@ xfs_scrub_metadata(
 	struct xchk_stats_run		run = { };
 	struct xfs_scrub		*sc;
 	struct xfs_mount		*mp = XFS_I(file_inode(file))->i_mount;
-	u64				check_start;
 	int				error = 0;
 
 	BUILD_BUG_ON(sizeof(meta_scrub_ops) !=
@@ -735,12 +735,12 @@ xfs_scrub_metadata(
 		goto out_teardown;
 
 	/* Scrub for errors. */
-	check_start = xchk_stats_now();
+	run.scrub_start = xchk_stats_now();
 	if ((sc->flags & XREP_ALREADY_FIXED) && sc->ops->repair_eval != NULL)
 		error = sc->ops->repair_eval(sc);
 	else
 		error = sc->ops->scrub(sc);
-	run.scrub_ns += xchk_stats_elapsed_ns(check_start);
+	run.scrub_stop = xchk_stats_now();
 	if (error == -EDEADLOCK && !(sc->flags & XCHK_TRY_HARDER))
 		goto try_harder;
 	if (error == -ECHRNG && !(sc->flags & XCHK_NEED_DRAIN))
diff --git a/fs/xfs/scrub/stats.c b/fs/xfs/scrub/stats.c
index 0e0be23adfcb4..b9e6ace59e572 100644
--- a/fs/xfs/scrub/stats.c
+++ b/fs/xfs/scrub/stats.c
@@ -12,6 +12,7 @@
 #include "xfs_sysfs.h"
 #include "xfs_btree.h"
 #include "xfs_super.h"
+#include "xfs_timestats.h"
 #include "scrub/scrub.h"
 #include "scrub/stats.h"
 #include "scrub/trace.h"
@@ -44,12 +45,24 @@ struct xchk_scrub_stats {
 	spinlock_t		css_lock;
 };
 
+struct xchk_timestats {
+#ifdef CONFIG_XFS_TIME_STATS
+	struct dentry		*parent;
+	struct {
+		struct time_stats	scrub;
+		struct time_stats	repair;
+	} scrub[XFS_SCRUB_TYPE_NR];
+#endif
+};
+
 struct xchk_stats {
 	struct dentry		*cs_debugfs;
+#ifdef CONFIG_XFS_TIME_STATS
+	struct xchk_timestats	*cs_timestats;
+#endif
 	struct xchk_scrub_stats	cs_stats[XFS_SCRUB_TYPE_NR];
 };
 
-
 static struct xchk_stats	global_stats;
 
 static const char *name_map[XFS_SCRUB_TYPE_NR] = {
@@ -86,6 +99,107 @@ static const char *name_map[XFS_SCRUB_TYPE_NR] = {
 	[XFS_SCRUB_TYPE_RTRMAPBT]	= "rtrmapbt",
 	[XFS_SCRUB_TYPE_RTREFCBT]	= "rtrefcountbt",
 };
+#ifdef CONFIG_XFS_TIME_STATS
+static inline void
+xchk_timestats_init(
+	struct xchk_stats	*cs,
+	struct xfs_mount	*mp)
+{
+	struct xchk_timestats	*ts;
+	unsigned int		i;
+
+	/* Only individual mounts have timestats so far */
+	if (!mp) {
+		cs->cs_timestats = NULL;
+		return;
+	}
+
+	/* timestats are optional */
+	ts = kmalloc(sizeof(struct xchk_timestats), GFP_KERNEL);
+	if (!ts) {
+		cs->cs_timestats = NULL;
+		return;
+	}
+
+	for (i = 0; i < XFS_SCRUB_TYPE_NR; i++) {
+		time_stats_init(&ts->scrub[i].scrub);
+		time_stats_init(&ts->scrub[i].repair);
+	}
+
+	ts->parent = mp->m_timestats.ts_debugfs;
+	cs->cs_timestats = ts;
+}
+
+static inline void
+xchk_timestats_teardown(
+	struct xchk_stats	*cs)
+{
+	struct xchk_timestats	*ts = cs->cs_timestats;
+	unsigned int		i;
+
+	if (!ts)
+		return;
+
+	for (i = 0; i < XFS_SCRUB_TYPE_NR; i++) {
+		time_stats_exit(&ts->scrub[i].scrub);
+		time_stats_exit(&ts->scrub[i].repair);
+	}
+	kfree(ts);
+	cs->cs_timestats = NULL;
+}
+
+static inline void
+xchk_timestats_register(
+	struct xchk_stats	*cs)
+{
+	char			name[32];
+	struct xchk_timestats	*ts = cs->cs_timestats;
+	unsigned int		i;
+
+	if (!ts)
+		return;
+
+	for (i = 0; i < XFS_SCRUB_TYPE_NR; i++) {
+		if (!name_map[i])
+			continue;
+
+		snprintf(name, 32, "scrub::%s", name_map[i]);
+		debugfs_create_file(name, 0444, ts->parent,
+				&ts->scrub[i].scrub, &xfs_timestats_fops);
+
+		snprintf(name, 32, "repair::%s", name_map[i]);
+		debugfs_create_file(name, 0444, ts->parent,
+				&ts->scrub[i].repair, &xfs_timestats_fops);
+	}
+}
+
+STATIC void
+xchk_timestats_merge_one(
+	struct xchk_stats		*cs,
+	const struct xfs_scrub_metadata	*sm,
+	const struct xchk_stats_run	*run)
+{
+	struct xchk_timestats		*ts = cs->cs_timestats;
+
+	if (sm->sm_type >= XFS_SCRUB_TYPE_NR) {
+		ASSERT(sm->sm_type < XFS_SCRUB_TYPE_NR);
+		return;
+	}
+	if (!ts)
+		return;
+
+	xfs_timestats_interval(&ts->scrub[sm->sm_type].scrub,
+			run->scrub_start, run->scrub_stop);
+	xfs_timestats_interval(&ts->scrub[sm->sm_type].repair,
+			run->repair_start, run->repair_stop);
+}
+
+#else
+# define xchk_timestats_init(cs, mp)	((void)0)
+# define xchk_timestats_teardown(cs)	((void)0)
+# define xchk_timestats_register(cs)	((void)0)
+# define xchk_timestats_merge_one(...)	((void)0)
+#endif
 
 /* Format the scrub stats into a text buffer, similar to pcp style. */
 STATIC ssize_t
@@ -192,6 +306,7 @@ xchk_stats_merge_one(
 	const struct xchk_stats_run	*run)
 {
 	struct xchk_scrub_stats		*css;
+	u64				delta;
 
 	if (sm->sm_type >= XFS_SCRUB_TYPE_NR) {
 		ASSERT(sm->sm_type < XFS_SCRUB_TYPE_NR);
@@ -216,13 +331,15 @@ xchk_stats_merge_one(
 	if (sm->sm_flags & XFS_SCRUB_OFLAG_WARNING)
 		css->warning++;
 	css->retries += run->retries;
-	css->checktime_us += howmany_64(run->scrub_ns, NSEC_PER_USEC);
+	delta = max(1, run->scrub_stop - run->scrub_start);
+	css->checktime_us += howmany_64(delta, NSEC_PER_USEC);
 
 	if (run->repair_attempted)
 		css->repair_invocations++;
 	if (run->repair_succeeded)
 		css->repair_success++;
-	css->repairtime_us += howmany_64(run->repair_ns, NSEC_PER_USEC);
+	delta = max(1, run->repair_stop - run->repair_start);
+	css->repairtime_us += howmany_64(delta, NSEC_PER_USEC);
 	spin_unlock(&css->css_lock);
 }
 
@@ -235,6 +352,7 @@ xchk_stats_merge(
 {
 	xchk_stats_merge_one(&global_stats, sm, run);
 	xchk_stats_merge_one(mp->m_scrub_stats, sm, run);
+	xchk_timestats_merge_one(mp->m_scrub_stats, sm, run);
 }
 
 /* debugfs boilerplate */
@@ -321,6 +439,7 @@ xchk_stats_init(
 	for (i = 0; i < XFS_SCRUB_TYPE_NR; i++, css++)
 		spin_lock_init(&css->css_lock);
 
+	xchk_timestats_init(cs, mp);
 	return 0;
 }
 
@@ -341,6 +460,8 @@ xchk_stats_register(
 			&scrub_stats_fops);
 	debugfs_create_file("clear_stats", 0400, cs->cs_debugfs, cs,
 			&clear_scrub_stats_fops);
+
+	xchk_timestats_register(cs);
 }
 
 /* Free all resources related to the stats object. */
@@ -348,6 +469,7 @@ STATIC int
 xchk_stats_teardown(
 	struct xchk_stats	*cs)
 {
+	xchk_timestats_teardown(cs);
 	return 0;
 }
 
diff --git a/fs/xfs/scrub/stats.h b/fs/xfs/scrub/stats.h
index b358ad8d8b90a..f615bff22dd22 100644
--- a/fs/xfs/scrub/stats.h
+++ b/fs/xfs/scrub/stats.h
@@ -7,8 +7,8 @@
 #define __XFS_SCRUB_STATS_H__
 
 struct xchk_stats_run {
-	u64			scrub_ns;
-	u64			repair_ns;
+	u64			scrub_start, scrub_stop;
+	u64			repair_start, repair_stop;
 	unsigned int		retries;
 	bool			repair_attempted;
 	bool			repair_succeeded;
@@ -29,21 +29,7 @@ void xchk_stats_unregister(struct xchk_stats *cs);
 void xchk_stats_merge(struct xfs_mount *mp, const struct xfs_scrub_metadata *sm,
 		const struct xchk_stats_run *run);
 
-static inline u64 xchk_stats_now(void) { return ktime_get_ns(); }
-static inline u64 xchk_stats_elapsed_ns(u64 since)
-{
-	u64 now = xchk_stats_now();
-
-	/*
-	 * If the system doesn't have a high enough resolution clock, charge at
-	 * least one nanosecond so that our stats don't report instantaneous
-	 * runtimes.
-	 */
-	if (now == since)
-		return 1;
-
-	return now - since;
-}
+static inline u64 xchk_stats_now(void) { return local_clock(); }
 #else
 # define xchk_global_stats_setup(parent)	(0)
 # define xchk_global_stats_teardown()		((void)0)
@@ -52,7 +38,6 @@ static inline u64 xchk_stats_elapsed_ns(u64 since)
 # define xchk_stats_register(cs, parent)	((void)0)
 # define xchk_stats_unregister(cs)		((void)0)
 # define xchk_stats_now()			(0)
-# define xchk_stats_elapsed_ns(x)		(0 * (x))
 # define xchk_stats_merge(mp, sm, run)		((void)0)
 #endif /* CONFIG_XFS_ONLINE_SCRUB_STATS */
 
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 27f9ec7721a93..8598294514aa3 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -68,6 +68,7 @@ typedef __u32			xfs_nlink_t;
 # include <linux/seq_buf.h>
 # include <linux/time_stats.h>
 #endif
+#include <linux/sched/clock.h>
 
 #include <asm/page.h>
 #include <asm/div64.h>
diff --git a/fs/xfs/xfs_timestats.h b/fs/xfs/xfs_timestats.h
index e53dbb40c8fff..418e5abf2cf12 100644
--- a/fs/xfs/xfs_timestats.h
+++ b/fs/xfs/xfs_timestats.h
@@ -18,6 +18,7 @@ void xfs_timestats_destroy(struct xfs_mount *mp);
 # define DEFINE_XFS_TIMESTAT(name)	u64 name = local_clock()
 # define xfs_timestats_start(b)		do { *(b) = local_clock(); } while (0)
 # define xfs_timestats_end(a, b)	time_stats_update((a), (b))
+# define xfs_timestats_interval(a,b,c)	__time_stats_update((a), (b), (c))
 #else
 # define xfs_timestats_init(mp)		((void)0)
 # define xfs_timestats_export(mp)	((void)0)
@@ -28,6 +29,7 @@ void xfs_timestats_destroy(struct xfs_mount *mp);
 # define DEFINE_XFS_TIMESTAT(name)
 # define xfs_timestats_start(t)		((void)0)
 # define xfs_timestats_end(s, t)	((void)0)
+# define xfs_timestats_interval(...)	((void)0)
 #endif /* CONFIG_XFS_TIME_STATS */
 
 #endif /* __XFS_TIMESTATS_H__ */


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/4] xfs: present timestats in json format
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
  2024-02-24  1:12   ` [PATCH 1/4] xfs: present wait time statistics Darrick J. Wong
  2024-02-24  1:13   ` [PATCH 2/4] xfs: present time stats for scrubbers Darrick J. Wong
@ 2024-02-24  1:13   ` Darrick J. Wong
  2024-02-24  1:13   ` [PATCH 4/4] xfs: create debugfs uuid aliases Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:13 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Export json versions of xfs time statistics information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/stats.c   |   12 ++++++++++--
 fs/xfs/xfs_timestats.c |   45 +++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_timestats.h |    1 +
 3 files changed, 54 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/scrub/stats.c b/fs/xfs/scrub/stats.c
index b9e6ace59e572..12f6ebbda3758 100644
--- a/fs/xfs/scrub/stats.c
+++ b/fs/xfs/scrub/stats.c
@@ -163,13 +163,21 @@ xchk_timestats_register(
 		if (!name_map[i])
 			continue;
 
-		snprintf(name, 32, "scrub::%s", name_map[i]);
+		snprintf(name, 32, "scrub::%s.txt", name_map[i]);
 		debugfs_create_file(name, 0444, ts->parent,
 				&ts->scrub[i].scrub, &xfs_timestats_fops);
 
-		snprintf(name, 32, "repair::%s", name_map[i]);
+		snprintf(name, 32, "repair::%s.txt", name_map[i]);
 		debugfs_create_file(name, 0444, ts->parent,
 				&ts->scrub[i].repair, &xfs_timestats_fops);
+
+		snprintf(name, 32, "scrub::%s.json", name_map[i]);
+		debugfs_create_file(name, 0444, ts->parent,
+				&ts->scrub[i].scrub, &xfs_timestats_json_fops);
+
+		snprintf(name, 32, "repair::%s.json", name_map[i]);
+		debugfs_create_file(name, 0444, ts->parent,
+				&ts->scrub[i].repair, &xfs_timestats_json_fops);
 	}
 }
 
diff --git a/fs/xfs/xfs_timestats.c b/fs/xfs/xfs_timestats.c
index 163a37e6717f7..dccecbe1ad922 100644
--- a/fs/xfs/xfs_timestats.c
+++ b/fs/xfs/xfs_timestats.c
@@ -49,6 +49,43 @@ const struct file_operations xfs_timestats_fops = {
 	.read			= xfs_timestats_read,
 };
 
+/* Format a timestats report into a buffer as json. */
+static ssize_t
+xfs_timestats_read_json(
+	struct file		*file,
+	char __user		*ubuf,
+	size_t			count,
+	loff_t			*ppos)
+{
+	struct seq_buf		s;
+	struct time_stats	*ts = file->private_data;
+	char			*buf;
+	ssize_t			ret;
+
+	/*
+	 * This generates a stringly snapshot of a timestats report, so we
+	 * do not want userspace to receive garbled text from multiple calls.
+	 * If the file position is greater than 0, return a short read.
+	 */
+	if (*ppos > 0)
+		return 0;
+
+	buf = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	seq_buf_init(&s, buf, PAGE_SIZE);
+	time_stats_to_json(&s, ts, "mount", TIME_STATS_PRINT_NO_ZEROES);
+	ret = simple_read_from_buffer(ubuf, count, ppos, buf, seq_buf_used(&s));
+	kfree(buf);
+	return ret;
+}
+
+const struct file_operations xfs_timestats_json_fops = {
+	.open			= simple_open,
+	.read			= xfs_timestats_read_json,
+};
+
 /* Set up timestats collection. */
 void
 xfs_timestats_init(
@@ -79,8 +116,12 @@ xfs_timestats_destroy(
 
 /* Export timestats via debugfs */
 #define X(p, ts, name) \
-	debugfs_create_file("blocked::" #name, 0444, (p), &(ts)->ts_##name, \
-			&xfs_timestats_fops)
+	do { \
+		debugfs_create_file("blocked::" #name ".txt", 0444, (p), \
+				&(ts)->ts_##name, &xfs_timestats_fops); \
+		debugfs_create_file("blocked::" #name ".json", 0444, (p), \
+				&(ts)->ts_##name, &xfs_timestats_json_fops); \
+	} while (0)
 void
 xfs_timestats_export(
 	struct xfs_mount	*mp)
diff --git a/fs/xfs/xfs_timestats.h b/fs/xfs/xfs_timestats.h
index 418e5abf2cf12..33ea794bdabce 100644
--- a/fs/xfs/xfs_timestats.h
+++ b/fs/xfs/xfs_timestats.h
@@ -8,6 +8,7 @@
 
 #ifdef CONFIG_XFS_TIME_STATS
 extern const struct file_operations xfs_timestats_fops;
+extern const struct file_operations xfs_timestats_json_fops;
 
 void xfs_timestats_init(struct xfs_mount *mp);
 void xfs_timestats_export(struct xfs_mount *mp);


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/4] xfs: create debugfs uuid aliases
  2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:13   ` [PATCH 3/4] xfs: present timestats in json format Darrick J. Wong
@ 2024-02-24  1:13   ` Darrick J. Wong
  3 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:13 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Create an alias for the debugfs dir so that we can find a filesystem by
uuid.  Unless it's mounted nouuid.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.h |    1 +
 fs/xfs/xfs_super.c |   11 +++++++++++
 2 files changed, 12 insertions(+)


diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 7cfd209404365..63649c259b9c5 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -235,6 +235,7 @@ typedef struct xfs_mount {
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	struct delayed_work	m_reclaim_work;	/* background inode reclaim */
 	struct dentry		*m_debugfs;	/* debugfs parent */
+	struct dentry		*m_debugfs_uuid; /* debugfs symlink */
 	struct xfs_kobj		m_kobj;
 	struct xfs_kobj		m_error_kobj;
 	struct xfs_kobj		m_error_meta_kobj;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 69f1c1d85edf6..29a53874490cc 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -768,6 +768,7 @@ xfs_mount_free(
 	if (mp->m_ddev_targp)
 		xfs_free_buftarg(mp->m_ddev_targp);
 
+	debugfs_remove(mp->m_debugfs_uuid);
 	debugfs_remove(mp->m_debugfs);
 	xfs_timestats_destroy(mp);
 	kfree(mp->m_rtname);
@@ -1799,6 +1800,16 @@ xfs_fs_fill_super(
 		goto out_unmount;
 	}
 
+	if (xfs_debugfs && mp->m_debugfs && !xfs_has_nouuid(mp)) {
+		char	name[UUID_STRING_LEN + 1];
+
+		snprintf(name, UUID_STRING_LEN + 1, "%pU", &mp->m_sb.sb_uuid);
+		mp->m_debugfs_uuid = debugfs_create_symlink(name, xfs_debugfs,
+				mp->m_super->s_id);
+	} else {
+		mp->m_debugfs_uuid = NULL;
+	}
+
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 01/10] bcachefs: thread_with_stdio: eliminate double buffering
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
@ 2024-02-24  1:14   ` Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 02/10] bcachefs: thread_with_stdio: convert to darray Darrick J. Wong
                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:14 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

The output buffer lock has to be a spinlock so that we can write to it
from interrupt context, so we can't use a direct copy_to_user; this
switches thread_with_file_read() to use fault_in_writeable() and
copy_to_user_nofault(), similar to how thread_with_file_write() works.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/thread_with_file.c |   56 ++++++++++++----------------------------
 fs/bcachefs/thread_with_file.h |    1 -
 2 files changed, 17 insertions(+), 40 deletions(-)


diff --git a/fs/bcachefs/thread_with_file.c b/fs/bcachefs/thread_with_file.c
index 9220d7de10db6..8c3afb4c3204f 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/fs/bcachefs/thread_with_file.c
@@ -67,16 +67,15 @@ int bch2_run_thread_with_file(struct thread_with_file *thr,
 
 static inline bool thread_with_stdio_has_output(struct thread_with_stdio *thr)
 {
-	return thr->stdio.output_buf.pos ||
-		thr->output2.nr ||
-		thr->thr.done;
+	return thr->stdio.output_buf.pos || thr->thr.done;
 }
 
-static ssize_t thread_with_stdio_read(struct file *file, char __user *buf,
+static ssize_t thread_with_stdio_read(struct file *file, char __user *ubuf,
 				      size_t len, loff_t *ppos)
 {
 	struct thread_with_stdio *thr =
 		container_of(file->private_data, struct thread_with_stdio, thr);
+	struct printbuf *buf = &thr->stdio.output_buf;
 	size_t copied = 0, b;
 	int ret = 0;
 
@@ -89,44 +88,25 @@ static ssize_t thread_with_stdio_read(struct file *file, char __user *buf,
 	if (ret)
 		return ret;
 
-	if (thr->thr.done)
-		return 0;
-
-	while (len) {
-		ret = darray_make_room(&thr->output2, thr->stdio.output_buf.pos);
-		if (ret)
+	while (len && buf->pos) {
+		if (fault_in_writeable(ubuf, len) == len) {
+			ret = -EFAULT;
 			break;
+		}
 
 		spin_lock_irq(&thr->stdio.output_lock);
-		b = min_t(size_t, darray_room(thr->output2), thr->stdio.output_buf.pos);
+		b = min_t(size_t, len, buf->pos);
 
-		memcpy(&darray_top(thr->output2), thr->stdio.output_buf.buf, b);
-		memmove(thr->stdio.output_buf.buf,
-			thr->stdio.output_buf.buf + b,
-			thr->stdio.output_buf.pos - b);
-
-		thr->output2.nr += b;
-		thr->stdio.output_buf.pos -= b;
+		if (b && !copy_to_user_nofault(ubuf, buf->buf, b)) {
+			memmove(buf->buf,
+				buf->buf + b,
+				buf->pos - b);
+			buf->pos -= b;
+			ubuf	+= b;
+			len	-= b;
+			copied	+= b;
+		}
 		spin_unlock_irq(&thr->stdio.output_lock);
-
-		b = min(len, thr->output2.nr);
-		if (!b)
-			break;
-
-		b -= copy_to_user(buf, thr->output2.data, b);
-		if (!b) {
-			ret = -EFAULT;
-			break;
-		}
-
-		copied	+= b;
-		buf	+= b;
-		len	-= b;
-
-		memmove(thr->output2.data,
-			thr->output2.data + b,
-			thr->output2.nr - b);
-		thr->output2.nr -= b;
 	}
 
 	return copied ?: ret;
@@ -140,7 +120,6 @@ static int thread_with_stdio_release(struct inode *inode, struct file *file)
 	bch2_thread_with_file_exit(&thr->thr);
 	printbuf_exit(&thr->stdio.input_buf);
 	printbuf_exit(&thr->stdio.output_buf);
-	darray_exit(&thr->output2);
 	thr->exit(thr);
 	return 0;
 }
@@ -245,7 +224,6 @@ int bch2_run_thread_with_stdio(struct thread_with_stdio *thr,
 	spin_lock_init(&thr->stdio.output_lock);
 	init_waitqueue_head(&thr->stdio.output_wait);
 
-	darray_init(&thr->output2);
 	thr->exit = exit;
 
 	return bch2_run_thread_with_file(&thr->thr, &thread_with_stdio_fops, fn);
diff --git a/fs/bcachefs/thread_with_file.h b/fs/bcachefs/thread_with_file.h
index 05879c5048c87..b5098b52db709 100644
--- a/fs/bcachefs/thread_with_file.h
+++ b/fs/bcachefs/thread_with_file.h
@@ -20,7 +20,6 @@ int bch2_run_thread_with_file(struct thread_with_file *,
 struct thread_with_stdio {
 	struct thread_with_file	thr;
 	struct stdio_redirect	stdio;
-	DARRAY(char)		output2;
 	void			(*exit)(struct thread_with_stdio *);
 };
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 02/10] bcachefs: thread_with_stdio: convert to darray
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 01/10] bcachefs: thread_with_stdio: eliminate double buffering Darrick J. Wong
@ 2024-02-24  1:14   ` Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 03/10] bcachefs: thread_with_stdio: kill thread_with_stdio_done() Darrick J. Wong
                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:14 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

- eliminate the dependency on printbufs, so that we can lift
   thread_with_file for use in xfs
 - add a nonblocking parameter to stdio_redirect_printf(), and either
   block if the buffer is full or drop it on the floor - don't buffer
   infinitely

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/super.c                  |    9 -
 fs/bcachefs/thread_with_file.c       |  229 +++++++++++++++++++++-------------
 fs/bcachefs/thread_with_file.h       |    7 +
 fs/bcachefs/thread_with_file_types.h |   15 ++
 4 files changed, 160 insertions(+), 100 deletions(-)


diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 2c238030fb5d7..0cff8c5f3c104 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -56,6 +56,7 @@
 #include "super.h"
 #include "super-io.h"
 #include "sysfs.h"
+#include "thread_with_file.h"
 #include "trace.h"
 
 #include <linux/backing-dev.h>
@@ -95,16 +96,10 @@ void __bch2_print(struct bch_fs *c, const char *fmt, ...)
 	if (likely(!stdio)) {
 		vprintk(fmt, args);
 	} else {
-		unsigned long flags;
-
 		if (fmt[0] == KERN_SOH[0])
 			fmt += 2;
 
-		spin_lock_irqsave(&stdio->output_lock, flags);
-		prt_vprintf(&stdio->output_buf, fmt, args);
-		spin_unlock_irqrestore(&stdio->output_lock, flags);
-
-		wake_up(&stdio->output_wait);
+		bch2_stdio_redirect_vprintf(stdio, true, fmt, args);
 	}
 	va_end(args);
 }
diff --git a/fs/bcachefs/thread_with_file.c b/fs/bcachefs/thread_with_file.c
index 8c3afb4c3204f..ca81d3fec3eef 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/fs/bcachefs/thread_with_file.c
@@ -2,7 +2,6 @@
 #ifndef NO_BCACHEFS_FS
 
 #include "bcachefs.h"
-#include "printbuf.h"
 #include "thread_with_file.h"
 
 #include <linux/anon_inodes.h>
@@ -65,48 +64,74 @@ int bch2_run_thread_with_file(struct thread_with_file *thr,
 	return ret;
 }
 
-static inline bool thread_with_stdio_has_output(struct thread_with_stdio *thr)
+/* stdio_redirect */
+
+static bool stdio_redirect_has_input(struct stdio_redirect *stdio)
 {
-	return thr->stdio.output_buf.pos || thr->thr.done;
+	return stdio->input.buf.nr || stdio->done;
 }
 
+static bool stdio_redirect_has_output(struct stdio_redirect *stdio)
+{
+	return stdio->output.buf.nr || stdio->done;
+}
+
+#define WRITE_BUFFER		4096
+
+static bool stdio_redirect_has_input_space(struct stdio_redirect *stdio)
+{
+	return stdio->input.buf.nr < WRITE_BUFFER || stdio->done;
+}
+
+static bool stdio_redirect_has_output_space(struct stdio_redirect *stdio)
+{
+	return stdio->output.buf.nr < WRITE_BUFFER || stdio->done;
+}
+
+static void stdio_buf_init(struct stdio_buf *buf)
+{
+	spin_lock_init(&buf->lock);
+	init_waitqueue_head(&buf->wait);
+	darray_init(&buf->buf);
+}
+
+/* thread_with_stdio */
+
 static ssize_t thread_with_stdio_read(struct file *file, char __user *ubuf,
 				      size_t len, loff_t *ppos)
 {
 	struct thread_with_stdio *thr =
 		container_of(file->private_data, struct thread_with_stdio, thr);
-	struct printbuf *buf = &thr->stdio.output_buf;
+	struct stdio_buf *buf = &thr->stdio.output;
 	size_t copied = 0, b;
 	int ret = 0;
 
-	if ((file->f_flags & O_NONBLOCK) &&
-	    !thread_with_stdio_has_output(thr))
+	if (!(file->f_flags & O_NONBLOCK)) {
+		ret = wait_event_interruptible(buf->wait, stdio_redirect_has_output(&thr->stdio));
+		if (ret)
+			return ret;
+	} else if (!stdio_redirect_has_output(&thr->stdio))
 		return -EAGAIN;
 
-	ret = wait_event_interruptible(thr->stdio.output_wait,
-		thread_with_stdio_has_output(thr));
-	if (ret)
-		return ret;
-
-	while (len && buf->pos) {
+	while (len && buf->buf.nr) {
 		if (fault_in_writeable(ubuf, len) == len) {
 			ret = -EFAULT;
 			break;
 		}
 
-		spin_lock_irq(&thr->stdio.output_lock);
-		b = min_t(size_t, len, buf->pos);
+		spin_lock_irq(&buf->lock);
+		b = min_t(size_t, len, buf->buf.nr);
 
-		if (b && !copy_to_user_nofault(ubuf, buf->buf, b)) {
-			memmove(buf->buf,
-				buf->buf + b,
-				buf->pos - b);
-			buf->pos -= b;
+		if (b && !copy_to_user_nofault(ubuf, buf->buf.data, b)) {
 			ubuf	+= b;
 			len	-= b;
 			copied	+= b;
+			buf->buf.nr -= b;
+			memmove(buf->buf.data,
+				buf->buf.data + b,
+				buf->buf.nr);
 		}
-		spin_unlock_irq(&thr->stdio.output_lock);
+		spin_unlock_irq(&buf->lock);
 	}
 
 	return copied ?: ret;
@@ -118,25 +143,18 @@ static int thread_with_stdio_release(struct inode *inode, struct file *file)
 		container_of(file->private_data, struct thread_with_stdio, thr);
 
 	bch2_thread_with_file_exit(&thr->thr);
-	printbuf_exit(&thr->stdio.input_buf);
-	printbuf_exit(&thr->stdio.output_buf);
+	darray_exit(&thr->stdio.input.buf);
+	darray_exit(&thr->stdio.output.buf);
 	thr->exit(thr);
 	return 0;
 }
 
-#define WRITE_BUFFER		4096
-
-static inline bool thread_with_stdio_has_input_space(struct thread_with_stdio *thr)
-{
-	return thr->stdio.input_buf.pos < WRITE_BUFFER || thr->thr.done;
-}
-
 static ssize_t thread_with_stdio_write(struct file *file, const char __user *ubuf,
 				       size_t len, loff_t *ppos)
 {
 	struct thread_with_stdio *thr =
 		container_of(file->private_data, struct thread_with_stdio, thr);
-	struct printbuf *buf = &thr->stdio.input_buf;
+	struct stdio_buf *buf = &thr->stdio.input;
 	size_t copied = 0;
 	ssize_t ret = 0;
 
@@ -152,29 +170,29 @@ static ssize_t thread_with_stdio_write(struct file *file, const char __user *ubu
 			break;
 		}
 
-		spin_lock(&thr->stdio.input_lock);
-		if (buf->pos < WRITE_BUFFER)
-			bch2_printbuf_make_room(buf, min(b, WRITE_BUFFER - buf->pos));
-		b = min(len, printbuf_remaining_size(buf));
+		spin_lock(&buf->lock);
+		if (buf->buf.nr < WRITE_BUFFER)
+			darray_make_room_gfp(&buf->buf, min(b, WRITE_BUFFER - buf->buf.nr), __GFP_NOWARN);
+		b = min(len, darray_room(buf->buf));
 
-		if (b && !copy_from_user_nofault(&buf->buf[buf->pos], ubuf, b)) {
-			ubuf += b;
-			len -= b;
-			copied += b;
-			buf->pos += b;
+		if (b && !copy_from_user_nofault(&buf->buf.data[buf->buf.nr], ubuf, b)) {
+			buf->buf.nr += b;
+			ubuf	+= b;
+			len	-= b;
+			copied	+= b;
 		}
-		spin_unlock(&thr->stdio.input_lock);
+		spin_unlock(&buf->lock);
 
 		if (b) {
-			wake_up(&thr->stdio.input_wait);
+			wake_up(&buf->wait);
 		} else {
 			if ((file->f_flags & O_NONBLOCK)) {
 				ret = -EAGAIN;
 				break;
 			}
 
-			ret = wait_event_interruptible(thr->stdio.input_wait,
-					thread_with_stdio_has_input_space(thr));
+			ret = wait_event_interruptible(buf->wait,
+					stdio_redirect_has_input_space(&thr->stdio));
 			if (ret)
 				break;
 		}
@@ -188,14 +206,14 @@ static __poll_t thread_with_stdio_poll(struct file *file, struct poll_table_stru
 	struct thread_with_stdio *thr =
 		container_of(file->private_data, struct thread_with_stdio, thr);
 
-	poll_wait(file, &thr->stdio.output_wait, wait);
-	poll_wait(file, &thr->stdio.input_wait, wait);
+	poll_wait(file, &thr->stdio.output.wait, wait);
+	poll_wait(file, &thr->stdio.input.wait, wait);
 
 	__poll_t mask = 0;
 
-	if (thread_with_stdio_has_output(thr))
+	if (stdio_redirect_has_output(&thr->stdio))
 		mask |= EPOLLIN;
-	if (thread_with_stdio_has_input_space(thr))
+	if (stdio_redirect_has_input_space(&thr->stdio))
 		mask |= EPOLLOUT;
 	if (thr->thr.done)
 		mask |= EPOLLHUP|EPOLLERR;
@@ -203,75 +221,112 @@ static __poll_t thread_with_stdio_poll(struct file *file, struct poll_table_stru
 }
 
 static const struct file_operations thread_with_stdio_fops = {
-	.release	= thread_with_stdio_release,
+	.llseek		= no_llseek,
 	.read		= thread_with_stdio_read,
 	.write		= thread_with_stdio_write,
 	.poll		= thread_with_stdio_poll,
-	.llseek		= no_llseek,
+	.release	= thread_with_stdio_release,
 };
 
 int bch2_run_thread_with_stdio(struct thread_with_stdio *thr,
 			       void (*exit)(struct thread_with_stdio *),
 			       int (*fn)(void *))
 {
-	thr->stdio.input_buf = PRINTBUF;
-	thr->stdio.input_buf.atomic++;
-	spin_lock_init(&thr->stdio.input_lock);
-	init_waitqueue_head(&thr->stdio.input_wait);
-
-	thr->stdio.output_buf = PRINTBUF;
-	thr->stdio.output_buf.atomic++;
-	spin_lock_init(&thr->stdio.output_lock);
-	init_waitqueue_head(&thr->stdio.output_wait);
-
+	stdio_buf_init(&thr->stdio.input);
+	stdio_buf_init(&thr->stdio.output);
 	thr->exit = exit;
 
 	return bch2_run_thread_with_file(&thr->thr, &thread_with_stdio_fops, fn);
 }
 
-int bch2_stdio_redirect_read(struct stdio_redirect *stdio, char *buf, size_t len)
+int bch2_stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t len)
 {
-	wait_event(stdio->input_wait,
-		   stdio->input_buf.pos || stdio->done);
+	struct stdio_buf *buf = &stdio->input;
 
+	wait_event(buf->wait, stdio_redirect_has_input(stdio));
 	if (stdio->done)
 		return -1;
 
-	spin_lock(&stdio->input_lock);
-	int ret = min(len, stdio->input_buf.pos);
-	stdio->input_buf.pos -= ret;
-	memcpy(buf, stdio->input_buf.buf, ret);
-	memmove(stdio->input_buf.buf,
-		stdio->input_buf.buf + ret,
-		stdio->input_buf.pos);
-	spin_unlock(&stdio->input_lock);
+	spin_lock(&buf->lock);
+	int ret = min(len, buf->buf.nr);
+	buf->buf.nr -= ret;
+	memcpy(ubuf, buf->buf.data, ret);
+	memmove(buf->buf.data,
+		buf->buf.data + ret,
+		buf->buf.nr);
+	spin_unlock(&buf->lock);
 
-	wake_up(&stdio->input_wait);
+	wake_up(&buf->wait);
 	return ret;
 }
 
-int bch2_stdio_redirect_readline(struct stdio_redirect *stdio, char *buf, size_t len)
+int bch2_stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len)
 {
-	wait_event(stdio->input_wait,
-		   stdio->input_buf.pos || stdio->done);
+	struct stdio_buf *buf = &stdio->input;
 
+	wait_event(buf->wait, stdio_redirect_has_input(stdio));
 	if (stdio->done)
 		return -1;
 
-	spin_lock(&stdio->input_lock);
-	int ret = min(len, stdio->input_buf.pos);
-	char *n = memchr(stdio->input_buf.buf, '\n', ret);
-	if (n)
-		ret = min(ret, n + 1 - stdio->input_buf.buf);
-	stdio->input_buf.pos -= ret;
-	memcpy(buf, stdio->input_buf.buf, ret);
-	memmove(stdio->input_buf.buf,
-		stdio->input_buf.buf + ret,
-		stdio->input_buf.pos);
-	spin_unlock(&stdio->input_lock);
-
-	wake_up(&stdio->input_wait);
+	spin_lock(&buf->lock);
+	int ret = min(len, buf->buf.nr);
+	char *n = memchr(buf->buf.data, '\n', ret);
+	if (!n)
+		ret = min(ret, n + 1 - buf->buf.data);
+	buf->buf.nr -= ret;
+	memcpy(ubuf, buf->buf.data, ret);
+	memmove(buf->buf.data,
+		buf->buf.data + ret,
+		buf->buf.nr);
+	spin_unlock(&buf->lock);
+
+	wake_up(&buf->wait);
 	return ret;
 }
 
+__printf(3, 0)
+static void bch2_darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_list args)
+{
+	size_t len;
+
+	do {
+		va_list args2;
+		va_copy(args2, args);
+
+		len = vsnprintf(out->data + out->nr, darray_room(*out), fmt, args2);
+	} while (len + 1 > darray_room(*out) && !darray_make_room_gfp(out, len + 1, gfp));
+
+	out->nr += min(len, darray_room(*out));
+}
+
+void bch2_stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
+				 const char *fmt, va_list args)
+{
+	struct stdio_buf *buf = &stdio->output;
+	unsigned long flags;
+
+	if (!nonblocking)
+		wait_event(buf->wait, stdio_redirect_has_output_space(stdio));
+	else if (!stdio_redirect_has_output_space(stdio))
+		return;
+	if (stdio->done)
+		return;
+
+	spin_lock_irqsave(&buf->lock, flags);
+	bch2_darray_vprintf(&buf->buf, nonblocking ? __GFP_NOWARN : GFP_KERNEL, fmt, args);
+	spin_unlock_irqrestore(&buf->lock, flags);
+
+	wake_up(&buf->wait);
+}
+
+void bch2_stdio_redirect_printf(struct stdio_redirect *stdio, bool nonblocking,
+				const char *fmt, ...)
+{
+
+	va_list args;
+	va_start(args, fmt);
+	bch2_stdio_redirect_vprintf(stdio, nonblocking, fmt, args);
+	va_end(args);
+}
+
 #endif /* NO_BCACHEFS_FS */
diff --git a/fs/bcachefs/thread_with_file.h b/fs/bcachefs/thread_with_file.h
index b5098b52db709..4243c7c5ad3f3 100644
--- a/fs/bcachefs/thread_with_file.h
+++ b/fs/bcachefs/thread_with_file.h
@@ -27,8 +27,8 @@ static inline void thread_with_stdio_done(struct thread_with_stdio *thr)
 {
 	thr->thr.done = true;
 	thr->stdio.done = true;
-	wake_up(&thr->stdio.input_wait);
-	wake_up(&thr->stdio.output_wait);
+	wake_up(&thr->stdio.input.wait);
+	wake_up(&thr->stdio.output.wait);
 }
 
 int bch2_run_thread_with_stdio(struct thread_with_stdio *,
@@ -37,4 +37,7 @@ int bch2_run_thread_with_stdio(struct thread_with_stdio *,
 int bch2_stdio_redirect_read(struct stdio_redirect *, char *, size_t);
 int bch2_stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 
+__printf(3, 0) void bch2_stdio_redirect_vprintf(struct stdio_redirect *, bool, const char *, va_list);
+__printf(3, 4) void bch2_stdio_redirect_printf(struct stdio_redirect *, bool, const char *, ...);
+
 #endif /* _BCACHEFS_THREAD_WITH_FILE_H */
diff --git a/fs/bcachefs/thread_with_file_types.h b/fs/bcachefs/thread_with_file_types.h
index 90b5e645e98ce..e0daf4eec341e 100644
--- a/fs/bcachefs/thread_with_file_types.h
+++ b/fs/bcachefs/thread_with_file_types.h
@@ -2,14 +2,21 @@
 #ifndef _BCACHEFS_THREAD_WITH_FILE_TYPES_H
 #define _BCACHEFS_THREAD_WITH_FILE_TYPES_H
 
+#include "darray.h"
+
+struct stdio_buf {
+	spinlock_t		lock;
+	wait_queue_head_t	wait;
+	darray_char		buf;
+};
+
 struct stdio_redirect {
-	spinlock_t		output_lock;
-	wait_queue_head_t	output_wait;
-	struct printbuf		output_buf;
+	struct stdio_buf	input;
+	struct stdio_buf	output;
 
 	spinlock_t		input_lock;
 	wait_queue_head_t	input_wait;
-	struct printbuf		input_buf;
+	darray_char		input_buf;
 	bool			done;
 };
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 03/10] bcachefs: thread_with_stdio: kill thread_with_stdio_done()
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 01/10] bcachefs: thread_with_stdio: eliminate double buffering Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 02/10] bcachefs: thread_with_stdio: convert to darray Darrick J. Wong
@ 2024-02-24  1:14   ` Darrick J. Wong
  2024-02-24  1:14   ` [PATCH 04/10] bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline() Darrick J. Wong
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:14 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Move the cleanup code to a wrapper function, where we can call it after
the thread_with_stdio fn exits.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/chardev.c          |   14 ++++----------
 fs/bcachefs/thread_with_file.c |   20 +++++++++++++++++---
 fs/bcachefs/thread_with_file.h |   11 ++---------
 3 files changed, 23 insertions(+), 22 deletions(-)


diff --git a/fs/bcachefs/chardev.c b/fs/bcachefs/chardev.c
index 226b39c176673..11711f54057e1 100644
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@@ -155,17 +155,14 @@ static void bch2_fsck_thread_exit(struct thread_with_stdio *_thr)
 	kfree(thr);
 }
 
-static int bch2_fsck_offline_thread_fn(void *arg)
+static void bch2_fsck_offline_thread_fn(struct thread_with_stdio *stdio)
 {
-	struct fsck_thread *thr = container_of(arg, struct fsck_thread, thr);
+	struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr);
 	struct bch_fs *c = bch2_fs_open(thr->devs, thr->nr_devs, thr->opts);
 
 	thr->thr.thr.ret = PTR_ERR_OR_ZERO(c);
 	if (!thr->thr.thr.ret)
 		bch2_fs_stop(c);
-
-	thread_with_stdio_done(&thr->thr);
-	return 0;
 }
 
 static long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_arg)
@@ -763,9 +760,9 @@ static long bch2_ioctl_disk_resize_journal(struct bch_fs *c,
 	return ret;
 }
 
-static int bch2_fsck_online_thread_fn(void *arg)
+static void bch2_fsck_online_thread_fn(struct thread_with_stdio *stdio)
 {
-	struct fsck_thread *thr = container_of(arg, struct fsck_thread, thr);
+	struct fsck_thread *thr = container_of(stdio, struct fsck_thread, thr);
 	struct bch_fs *c = thr->c;
 
 	c->stdio_filter = current;
@@ -793,11 +790,8 @@ static int bch2_fsck_online_thread_fn(void *arg)
 	c->stdio_filter = NULL;
 	c->opts.fix_errors = old_fix_errors;
 
-	thread_with_stdio_done(&thr->thr);
-
 	up(&c->online_fsck_mutex);
 	bch2_ro_ref_put(c);
-	return 0;
 }
 
 static long bch2_ioctl_fsck_online(struct bch_fs *c,
diff --git a/fs/bcachefs/thread_with_file.c b/fs/bcachefs/thread_with_file.c
index ca81d3fec3eef..eb8ab4c47a94b 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/fs/bcachefs/thread_with_file.c
@@ -228,15 +228,29 @@ static const struct file_operations thread_with_stdio_fops = {
 	.release	= thread_with_stdio_release,
 };
 
+static int thread_with_stdio_fn(void *arg)
+{
+	struct thread_with_stdio *thr = arg;
+
+	thr->fn(thr);
+
+	thr->thr.done = true;
+	thr->stdio.done = true;
+	wake_up(&thr->stdio.input.wait);
+	wake_up(&thr->stdio.output.wait);
+	return 0;
+}
+
 int bch2_run_thread_with_stdio(struct thread_with_stdio *thr,
 			       void (*exit)(struct thread_with_stdio *),
-			       int (*fn)(void *))
+			       void (*fn)(struct thread_with_stdio *))
 {
 	stdio_buf_init(&thr->stdio.input);
 	stdio_buf_init(&thr->stdio.output);
-	thr->exit = exit;
+	thr->exit	= exit;
+	thr->fn		= fn;
 
-	return bch2_run_thread_with_file(&thr->thr, &thread_with_stdio_fops, fn);
+	return bch2_run_thread_with_file(&thr->thr, &thread_with_stdio_fops, thread_with_stdio_fn);
 }
 
 int bch2_stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t len)
diff --git a/fs/bcachefs/thread_with_file.h b/fs/bcachefs/thread_with_file.h
index 4243c7c5ad3f3..66212fcae226a 100644
--- a/fs/bcachefs/thread_with_file.h
+++ b/fs/bcachefs/thread_with_file.h
@@ -21,19 +21,12 @@ struct thread_with_stdio {
 	struct thread_with_file	thr;
 	struct stdio_redirect	stdio;
 	void			(*exit)(struct thread_with_stdio *);
+	void			(*fn)(struct thread_with_stdio *);
 };
 
-static inline void thread_with_stdio_done(struct thread_with_stdio *thr)
-{
-	thr->thr.done = true;
-	thr->stdio.done = true;
-	wake_up(&thr->stdio.input.wait);
-	wake_up(&thr->stdio.output.wait);
-}
-
 int bch2_run_thread_with_stdio(struct thread_with_stdio *,
 			       void (*exit)(struct thread_with_stdio *),
-			       int (*fn)(void *));
+			       void (*fn)(struct thread_with_stdio *));
 int bch2_stdio_redirect_read(struct stdio_redirect *, char *, size_t);
 int bch2_stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 04/10] bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline()
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:14   ` [PATCH 03/10] bcachefs: thread_with_stdio: kill thread_with_stdio_done() Darrick J. Wong
@ 2024-02-24  1:14   ` Darrick J. Wong
  2024-02-24  1:15   ` [PATCH 05/10] bcachefs: Thread with file documentation Darrick J. Wong
                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:14 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

This fixes a bug where we'd return data without waiting for a newline,
if data was present but a newline was not.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/thread_with_file.c |   33 ++++++++++++++++++++++-----------
 1 file changed, 22 insertions(+), 11 deletions(-)


diff --git a/fs/bcachefs/thread_with_file.c b/fs/bcachefs/thread_with_file.c
index eb8ab4c47a94b..830efb06ef0be 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/fs/bcachefs/thread_with_file.c
@@ -277,25 +277,36 @@ int bch2_stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t le
 int bch2_stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len)
 {
 	struct stdio_buf *buf = &stdio->input;
-
+	size_t copied = 0;
+	ssize_t ret = 0;
+again:
 	wait_event(buf->wait, stdio_redirect_has_input(stdio));
-	if (stdio->done)
-		return -1;
+	if (stdio->done) {
+		ret = -1;
+		goto out;
+	}
 
 	spin_lock(&buf->lock);
-	int ret = min(len, buf->buf.nr);
-	char *n = memchr(buf->buf.data, '\n', ret);
-	if (!n)
-		ret = min(ret, n + 1 - buf->buf.data);
-	buf->buf.nr -= ret;
-	memcpy(ubuf, buf->buf.data, ret);
+	size_t b = min(len, buf->buf.nr);
+	char *n = memchr(buf->buf.data, '\n', b);
+	if (n)
+		b = min_t(size_t, b, n + 1 - buf->buf.data);
+	buf->buf.nr -= b;
+	memcpy(ubuf, buf->buf.data, b);
 	memmove(buf->buf.data,
-		buf->buf.data + ret,
+		buf->buf.data + b,
 		buf->buf.nr);
+	ubuf += b;
+	len -= b;
+	copied += b;
 	spin_unlock(&buf->lock);
 
 	wake_up(&buf->wait);
-	return ret;
+
+	if (!n && len)
+		goto again;
+out:
+	return copied ?: ret;
 }
 
 __printf(3, 0)


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 05/10] bcachefs: Thread with file documentation
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-02-24  1:14   ` [PATCH 04/10] bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline() Darrick J. Wong
@ 2024-02-24  1:15   ` Darrick J. Wong
  2024-02-24  1:15   ` [PATCH 06/10] darray: lift from bcachefs Darrick J. Wong
                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:15 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/bcachefs/thread_with_file.c |   15 ++++++++-------
 fs/bcachefs/thread_with_file.h |   32 ++++++++++++++++++++++++++++++++
 2 files changed, 40 insertions(+), 7 deletions(-)


diff --git a/fs/bcachefs/thread_with_file.c b/fs/bcachefs/thread_with_file.c
index 830efb06ef0be..dde9679b68b42 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/fs/bcachefs/thread_with_file.c
@@ -76,16 +76,16 @@ static bool stdio_redirect_has_output(struct stdio_redirect *stdio)
 	return stdio->output.buf.nr || stdio->done;
 }
 
-#define WRITE_BUFFER		4096
+#define STDIO_REDIRECT_BUFSIZE		4096
 
 static bool stdio_redirect_has_input_space(struct stdio_redirect *stdio)
 {
-	return stdio->input.buf.nr < WRITE_BUFFER || stdio->done;
+	return stdio->input.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
 }
 
 static bool stdio_redirect_has_output_space(struct stdio_redirect *stdio)
 {
-	return stdio->output.buf.nr < WRITE_BUFFER || stdio->done;
+	return stdio->output.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
 }
 
 static void stdio_buf_init(struct stdio_buf *buf)
@@ -171,11 +171,12 @@ static ssize_t thread_with_stdio_write(struct file *file, const char __user *ubu
 		}
 
 		spin_lock(&buf->lock);
-		if (buf->buf.nr < WRITE_BUFFER)
-			darray_make_room_gfp(&buf->buf, min(b, WRITE_BUFFER - buf->buf.nr), __GFP_NOWARN);
+		if (buf->buf.nr < STDIO_REDIRECT_BUFSIZE)
+			darray_make_room_gfp(&buf->buf,
+				min(b, STDIO_REDIRECT_BUFSIZE - buf->buf.nr), GFP_NOWAIT);
 		b = min(len, darray_room(buf->buf));
 
-		if (b && !copy_from_user_nofault(&buf->buf.data[buf->buf.nr], ubuf, b)) {
+		if (b && !copy_from_user_nofault(&darray_top(buf->buf), ubuf, b)) {
 			buf->buf.nr += b;
 			ubuf	+= b;
 			len	-= b;
@@ -338,7 +339,7 @@ void bch2_stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
 		return;
 
 	spin_lock_irqsave(&buf->lock, flags);
-	bch2_darray_vprintf(&buf->buf, nonblocking ? __GFP_NOWARN : GFP_KERNEL, fmt, args);
+	bch2_darray_vprintf(&buf->buf, nonblocking ? GFP_NOWAIT : GFP_KERNEL, fmt, args);
 	spin_unlock_irqrestore(&buf->lock, flags);
 
 	wake_up(&buf->wait);
diff --git a/fs/bcachefs/thread_with_file.h b/fs/bcachefs/thread_with_file.h
index 66212fcae226a..f06f8ff19a790 100644
--- a/fs/bcachefs/thread_with_file.h
+++ b/fs/bcachefs/thread_with_file.h
@@ -4,6 +4,38 @@
 
 #include "thread_with_file_types.h"
 
+/*
+ * Thread with file: Run a kthread and connect it to a file descriptor, so that
+ * it can be interacted with via fd read/write methods and closing the file
+ * descriptor stops the kthread.
+ *
+ * We have two different APIs:
+ *
+ * thread_with_file, the low level version.
+ * You get to define the full file_operations, including your release function,
+ * which means that you must call bch2_thread_with_file_exit() from your
+ * .release method
+ *
+ * thread_with_stdio, the higher level version
+ * This implements full piping of input and output, including .poll.
+ *
+ * Notes on behaviour:
+ *  - kthread shutdown behaves like writing or reading from a pipe that has been
+ *    closed
+ *  - Input and output buffers are 4096 bytes, although buffers may in some
+ *    situations slightly exceed that limit so as to avoid chopping off a
+ *    message in the middle in nonblocking mode.
+ *  - Input/output buffers are lazily allocated, with GFP_NOWAIT allocations -
+ *    should be fine but might change in future revisions.
+ *  - Output buffer may grow past 4096 bytes to deal with messages that are
+ *    bigger than 4096 bytes
+ *  - Writing may be done blocking or nonblocking; in nonblocking mode, we only
+ *    drop entire messages.
+ *
+ * To write, use stdio_redirect_printf()
+ * To read, use stdio_redirect_read() or stdio_redirect_readline()
+ */
+
 struct task_struct;
 
 struct thread_with_file {


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 06/10] darray: lift from bcachefs
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-02-24  1:15   ` [PATCH 05/10] bcachefs: Thread with file documentation Darrick J. Wong
@ 2024-02-24  1:15   ` Darrick J. Wong
  2024-02-24  1:15   ` [PATCH 07/10] thread_with_file: Lift " Darrick J. Wong
                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:15 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

dynamic arrays - inspired from CCAN darrays, basically c++ stl vectors.

Used by thread_with_stdio, which is also being lifted from bcachefs for
xfs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                            |    7 ++++
 fs/bcachefs/Makefile                   |    1 -
 fs/bcachefs/btree_types.h              |    2 +
 fs/bcachefs/btree_update.c             |    2 +
 fs/bcachefs/btree_write_buffer_types.h |    2 +
 fs/bcachefs/fsck.c                     |    2 +
 fs/bcachefs/journal_sb.c               |    2 +
 fs/bcachefs/sb-downgrade.c             |    3 +-
 fs/bcachefs/sb-errors_types.h          |    2 +
 fs/bcachefs/sb-members.h               |    2 +
 fs/bcachefs/subvolume.h                |    1 -
 fs/bcachefs/subvolume_types.h          |    2 +
 fs/bcachefs/thread_with_file_types.h   |    2 +
 fs/bcachefs/util.h                     |   29 +---------------
 include/linux/darray.h                 |   59 +++++++++++++++++++++-----------
 include/linux/darray_types.h           |   22 ++++++++++++
 lib/Makefile                           |    2 +
 lib/darray.c                           |   12 +++++--
 18 files changed, 93 insertions(+), 61 deletions(-)
 rename fs/bcachefs/darray.h => include/linux/darray.h (66%)
 create mode 100644 include/linux/darray_types.h
 rename fs/bcachefs/darray.c => lib/darray.c (56%)


diff --git a/MAINTAINERS b/MAINTAINERS
index aa762fe654e3e..97905e0d57a52 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5810,6 +5810,13 @@ F:	net/ax25/ax25_out.c
 F:	net/ax25/ax25_timer.c
 F:	net/ax25/sysctl_net_ax25.c
 
+DARRAY
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+L:	linux-bcachefs@vger.kernel.org
+S:	Maintained
+F:	include/linux/darray.h
+F:	include/linux/darray_types.h
+
 DATA ACCESS MONITOR
 M:	SeongJae Park <sj@kernel.org>
 L:	damon@lists.linux.dev
diff --git a/fs/bcachefs/Makefile b/fs/bcachefs/Makefile
index b11ba74b8ad41..bb17d146b0900 100644
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@@ -27,7 +27,6 @@ bcachefs-y		:=	\
 	checksum.o		\
 	clock.o			\
 	compress.o		\
-	darray.o		\
 	debug.o			\
 	dirent.o		\
 	disk_groups.o		\
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index 4a5a64499eb76..0d5eecbd3e9cf 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -2,12 +2,12 @@
 #ifndef _BCACHEFS_BTREE_TYPES_H
 #define _BCACHEFS_BTREE_TYPES_H
 
+#include <linux/darray_types.h>
 #include <linux/list.h>
 #include <linux/rhashtable.h>
 
 #include "btree_key_cache_types.h"
 #include "buckets_types.h"
-#include "darray.h"
 #include "errcode.h"
 #include "journal_types.h"
 #include "replicas_types.h"
diff --git a/fs/bcachefs/btree_update.c b/fs/bcachefs/btree_update.c
index c3ff365acce9a..e5193116b092f 100644
--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@@ -14,6 +14,8 @@
 #include "snapshot.h"
 #include "trace.h"
 
+#include <linux/darray.h>
+
 static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
 					 const struct btree_insert_entry *r)
 {
diff --git a/fs/bcachefs/btree_write_buffer_types.h b/fs/bcachefs/btree_write_buffer_types.h
index 9b9433de9c368..5f248873087c3 100644
--- a/fs/bcachefs/btree_write_buffer_types.h
+++ b/fs/bcachefs/btree_write_buffer_types.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
 #define _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
 
-#include "darray.h"
+#include <linux/darray_types.h>
 #include "journal_types.h"
 
 #define BTREE_WRITE_BUFERED_VAL_U64s_MAX	4
diff --git a/fs/bcachefs/fsck.c b/fs/bcachefs/fsck.c
index 6a760777bafb0..04d3d9957a203 100644
--- a/fs/bcachefs/fsck.c
+++ b/fs/bcachefs/fsck.c
@@ -5,7 +5,6 @@
 #include "btree_cache.h"
 #include "btree_update.h"
 #include "buckets.h"
-#include "darray.h"
 #include "dirent.h"
 #include "error.h"
 #include "fs-common.h"
@@ -18,6 +17,7 @@
 #include "xattr.h"
 
 #include <linux/bsearch.h>
+#include <linux/darray.h>
 #include <linux/dcache.h> /* struct qstr */
 
 /*
diff --git a/fs/bcachefs/journal_sb.c b/fs/bcachefs/journal_sb.c
index ae4fb8c3a2bc2..156691c203bef 100644
--- a/fs/bcachefs/journal_sb.c
+++ b/fs/bcachefs/journal_sb.c
@@ -2,8 +2,8 @@
 
 #include "bcachefs.h"
 #include "journal_sb.h"
-#include "darray.h"
 
+#include <linux/darray.h>
 #include <linux/sort.h>
 
 /* BCH_SB_FIELD_journal: */
diff --git a/fs/bcachefs/sb-downgrade.c b/fs/bcachefs/sb-downgrade.c
index 441dcb1bf160e..626eaaea5b01d 100644
--- a/fs/bcachefs/sb-downgrade.c
+++ b/fs/bcachefs/sb-downgrade.c
@@ -6,12 +6,13 @@
  */
 
 #include "bcachefs.h"
-#include "darray.h"
 #include "recovery.h"
 #include "sb-downgrade.h"
 #include "sb-errors.h"
 #include "super-io.h"
 
+#include <linux/darray.h>
+
 #define RECOVERY_PASS_ALL_FSCK		BIT_ULL(63)
 
 /*
diff --git a/fs/bcachefs/sb-errors_types.h b/fs/bcachefs/sb-errors_types.h
index c08aacdfd073c..9a3a74ca0806b 100644
--- a/fs/bcachefs/sb-errors_types.h
+++ b/fs/bcachefs/sb-errors_types.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_SB_ERRORS_TYPES_H
 #define _BCACHEFS_SB_ERRORS_TYPES_H
 
-#include "darray.h"
+#include <linux/darray_types.h>
 
 #define BCH_SB_ERRS()							\
 	x(clean_but_journal_not_empty,				0)	\
diff --git a/fs/bcachefs/sb-members.h b/fs/bcachefs/sb-members.h
index be0a941832715..e4d4d842229a6 100644
--- a/fs/bcachefs/sb-members.h
+++ b/fs/bcachefs/sb-members.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_SB_MEMBERS_H
 #define _BCACHEFS_SB_MEMBERS_H
 
-#include "darray.h"
+#include <linux/darray.h>
 
 extern char * const bch2_member_error_strs[];
 
diff --git a/fs/bcachefs/subvolume.h b/fs/bcachefs/subvolume.h
index a6f56f66e27cb..3ca1d183369c5 100644
--- a/fs/bcachefs/subvolume.h
+++ b/fs/bcachefs/subvolume.h
@@ -2,7 +2,6 @@
 #ifndef _BCACHEFS_SUBVOLUME_H
 #define _BCACHEFS_SUBVOLUME_H
 
-#include "darray.h"
 #include "subvolume_types.h"
 
 enum bkey_invalid_flags;
diff --git a/fs/bcachefs/subvolume_types.h b/fs/bcachefs/subvolume_types.h
index ae644adfc3916..40f16e3a6dd04 100644
--- a/fs/bcachefs/subvolume_types.h
+++ b/fs/bcachefs/subvolume_types.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_SUBVOLUME_TYPES_H
 #define _BCACHEFS_SUBVOLUME_TYPES_H
 
-#include "darray.h"
+#include <linux/darray_types.h>
 
 typedef DARRAY(u32) snapshot_id_list;
 
diff --git a/fs/bcachefs/thread_with_file_types.h b/fs/bcachefs/thread_with_file_types.h
index e0daf4eec341e..41990756aa261 100644
--- a/fs/bcachefs/thread_with_file_types.h
+++ b/fs/bcachefs/thread_with_file_types.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_THREAD_WITH_FILE_TYPES_H
 #define _BCACHEFS_THREAD_WITH_FILE_TYPES_H
 
-#include "darray.h"
+#include <linux/darray_types.h>
 
 struct stdio_buf {
 	spinlock_t		lock;
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index cf8d16a911622..b354307903057 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -5,23 +5,22 @@
 #include <linux/bio.h>
 #include <linux/blkdev.h>
 #include <linux/closure.h>
+#include <linux/darray.h>
 #include <linux/errno.h>
 #include <linux/freezer.h>
 #include <linux/kernel.h>
-#include <linux/sched/clock.h>
 #include <linux/llist.h>
 #include <linux/log2.h>
 #include <linux/percpu.h>
 #include <linux/preempt.h>
 #include <linux/ratelimit.h>
+#include <linux/sched/clock.h>
 #include <linux/slab.h>
 #include <linux/time_stats.h>
 #include <linux/vmalloc.h>
 #include <linux/workqueue.h>
 #include <linux/mean_and_variance.h>
 
-#include "darray.h"
-
 struct closure;
 
 #ifdef CONFIG_BCACHEFS_DEBUG
@@ -662,30 +661,6 @@ static inline void memset_u64s_tail(void *s, int c, unsigned bytes)
 	memset(s + bytes, c, rem);
 }
 
-/* just the memmove, doesn't update @_nr */
-#define __array_insert_item(_array, _nr, _pos)				\
-	memmove(&(_array)[(_pos) + 1],					\
-		&(_array)[(_pos)],					\
-		sizeof((_array)[0]) * ((_nr) - (_pos)))
-
-#define array_insert_item(_array, _nr, _pos, _new_item)			\
-do {									\
-	__array_insert_item(_array, _nr, _pos);				\
-	(_nr)++;							\
-	(_array)[(_pos)] = (_new_item);					\
-} while (0)
-
-#define array_remove_items(_array, _nr, _pos, _nr_to_remove)		\
-do {									\
-	(_nr) -= (_nr_to_remove);					\
-	memmove(&(_array)[(_pos)],					\
-		&(_array)[(_pos) + (_nr_to_remove)],			\
-		sizeof((_array)[0]) * ((_nr) - (_pos)));		\
-} while (0)
-
-#define array_remove_item(_array, _nr, _pos)				\
-	array_remove_items(_array, _nr, _pos, 1)
-
 static inline void __move_gap(void *array, size_t element_size,
 			      size_t nr, size_t size,
 			      size_t old_gap, size_t new_gap)
diff --git a/fs/bcachefs/darray.h b/include/linux/darray.h
similarity index 66%
rename from fs/bcachefs/darray.h
rename to include/linux/darray.h
index 4b340d13caace..ff167eb795f22 100644
--- a/fs/bcachefs/darray.h
+++ b/include/linux/darray.h
@@ -1,34 +1,26 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_DARRAY_H
-#define _BCACHEFS_DARRAY_H
+/*
+ * (C) 2022-2024 Kent Overstreet <kent.overstreet@linux.dev>
+ */
+#ifndef _LINUX_DARRAY_H
+#define _LINUX_DARRAY_H
 
 /*
- * Dynamic arrays:
+ * Dynamic arrays
  *
  * Inspired by CCAN's darray
  */
 
+#include <linux/darray_types.h>
 #include <linux/slab.h>
 
-#define DARRAY_PREALLOCATED(_type, _nr)					\
-struct {								\
-	size_t nr, size;						\
-	_type *data;							\
-	_type preallocated[_nr];					\
-}
-
-#define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
-
-typedef DARRAY(char)	darray_char;
-typedef DARRAY(char *) darray_str;
-
-int __bch2_darray_resize(darray_char *, size_t, size_t, gfp_t);
+int __darray_resize_slowpath(darray_char *, size_t, size_t, gfp_t);
 
 static inline int __darray_resize(darray_char *d, size_t element_size,
 				  size_t new_size, gfp_t gfp)
 {
 	return unlikely(new_size > d->size)
-		? __bch2_darray_resize(d, element_size, new_size, gfp)
+		? __darray_resize_slowpath(d, element_size, new_size, gfp)
 		: 0;
 }
 
@@ -69,6 +61,28 @@ static inline int __darray_make_room(darray_char *d, size_t t_size, size_t more,
 #define darray_first(_d)	((_d).data[0])
 #define darray_last(_d)		((_d).data[(_d).nr - 1])
 
+/* Insert/remove items into the middle of a darray: */
+
+#define array_insert_item(_array, _nr, _pos, _new_item)			\
+do {									\
+	memmove(&(_array)[(_pos) + 1],					\
+		&(_array)[(_pos)],					\
+		sizeof((_array)[0]) * ((_nr) - (_pos)));		\
+	(_nr)++;							\
+	(_array)[(_pos)] = (_new_item);					\
+} while (0)
+
+#define array_remove_items(_array, _nr, _pos, _nr_to_remove)		\
+do {									\
+	(_nr) -= (_nr_to_remove);					\
+	memmove(&(_array)[(_pos)],					\
+		&(_array)[(_pos) + (_nr_to_remove)],			\
+		sizeof((_array)[0]) * ((_nr) - (_pos)));		\
+} while (0)
+
+#define array_remove_item(_array, _nr, _pos)				\
+	array_remove_items(_array, _nr, _pos, 1)
+
 #define darray_insert_item(_d, pos, _item)				\
 ({									\
 	size_t _pos = (pos);						\
@@ -79,10 +93,15 @@ static inline int __darray_make_room(darray_char *d, size_t t_size, size_t more,
 	_ret;								\
 })
 
+#define darray_remove_items(_d, _pos, _nr_to_remove)			\
+	array_remove_items((_d)->data, (_d)->nr, (_pos) - (_d)->data, _nr_to_remove)
+
 #define darray_remove_item(_d, _pos)					\
-	array_remove_item((_d)->data, (_d)->nr, (_pos) - (_d)->data)
+	darray_remove_items(_d, _pos, 1)
 
-#define __darray_for_each(_d, _i)						\
+/* Iteration: */
+
+#define __darray_for_each(_d, _i)					\
 	for ((_i) = (_d).data; _i < (_d).data + (_d).nr; _i++)
 
 #define darray_for_each(_d, _i)						\
@@ -106,4 +125,4 @@ do {									\
 	darray_init(_d);						\
 } while (0)
 
-#endif /* _BCACHEFS_DARRAY_H */
+#endif /* _LINUX_DARRAY_H */
diff --git a/include/linux/darray_types.h b/include/linux/darray_types.h
new file mode 100644
index 0000000000000..a400a0c3600d8
--- /dev/null
+++ b/include/linux/darray_types.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * (C) 2022-2024 Kent Overstreet <kent.overstreet@linux.dev>
+ */
+#ifndef _LINUX_DARRAY_TYpES_H
+#define _LINUX_DARRAY_TYpES_H
+
+#include <linux/types.h>
+
+#define DARRAY_PREALLOCATED(_type, _nr)					\
+struct {								\
+	size_t nr, size;						\
+	_type *data;							\
+	_type preallocated[_nr];					\
+}
+
+#define DARRAY(_type) DARRAY_PREALLOCATED(_type, 0)
+
+typedef DARRAY(char)	darray_char;
+typedef DARRAY(char *)	darray_str;
+
+#endif /* _LINUX_DARRAY_TYpES_H */
diff --git a/lib/Makefile b/lib/Makefile
index 57858997c87aa..830907bb8fc85 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -48,7 +48,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
 	 percpu-refcount.o rhashtable.o base64.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
-	 generic-radix-tree.o bitmap-str.o
+	 generic-radix-tree.o bitmap-str.o darray.o
 obj-$(CONFIG_STRING_SELFTEST) += test_string.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/fs/bcachefs/darray.c b/lib/darray.c
similarity index 56%
rename from fs/bcachefs/darray.c
rename to lib/darray.c
index ac35b8b705ae1..7cb064f14b391 100644
--- a/fs/bcachefs/darray.c
+++ b/lib/darray.c
@@ -1,10 +1,14 @@
 // SPDX-License-Identifier: GPL-2.0
+/*
+ * (C) 2022-2024 Kent Overstreet <kent.overstreet@linux.dev>
+ */
 
+#include <linux/darray.h>
 #include <linux/log2.h>
+#include <linux/module.h>
 #include <linux/slab.h>
-#include "darray.h"
 
-int __bch2_darray_resize(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
+int __darray_resize_slowpath(darray_char *d, size_t element_size, size_t new_size, gfp_t gfp)
 {
 	if (new_size > d->size) {
 		new_size = roundup_pow_of_two(new_size);
@@ -22,3 +26,7 @@ int __bch2_darray_resize(darray_char *d, size_t element_size, size_t new_size, g
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(__darray_resize_slowpath);
+
+MODULE_AUTHOR("Kent Overstreet");
+MODULE_LICENSE("GPL");


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 07/10] thread_with_file: Lift from bcachefs
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-02-24  1:15   ` [PATCH 06/10] darray: lift from bcachefs Darrick J. Wong
@ 2024-02-24  1:15   ` Darrick J. Wong
  2024-02-24  1:15   ` [PATCH 08/10] thread_with_stdio: Mark completed in ->release() Darrick J. Wong
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:15 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

thread_with_file and thread_with_stdio are abstractions for connecting
kthreads to file descriptors, which is handy for all sorts of things -
the running kthread has its lifetime connected to the file descriptor,
which means an asynchronous job running in the kernel can easily exit in
response to a ctrl-c, and the file descriptor also provides a
communications channel.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 MAINTAINERS                            |    9 +
 fs/bcachefs/Kconfig                    |    1 
 fs/bcachefs/Makefile                   |    1 
 fs/bcachefs/bcachefs.h                 |    2 
 fs/bcachefs/chardev.c                  |   10 -
 fs/bcachefs/error.c                    |    4 
 fs/bcachefs/super.c                    |    4 
 include/linux/thread_with_file.h       |   35 ++-
 include/linux/thread_with_file_types.h |    8 -
 lib/Kconfig                            |    3 
 lib/Makefile                           |    1 
 lib/thread_with_file.c                 |  326 ++++++++++++++++----------------
 12 files changed, 212 insertions(+), 192 deletions(-)
 rename fs/bcachefs/thread_with_file.h => include/linux/thread_with_file.h (63%)
 rename fs/bcachefs/thread_with_file_types.h => include/linux/thread_with_file_types.h (64%)
 rename fs/bcachefs/thread_with_file.c => lib/thread_with_file.c (79%)


diff --git a/MAINTAINERS b/MAINTAINERS
index 97905e0d57a52..5799134b24737 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21888,6 +21888,15 @@ F:	Documentation/userspace-api/media/drivers/thp7312.rst
 F:	drivers/media/i2c/thp7312.c
 F:	include/uapi/linux/thp7312.h
 
+THREAD WITH FILE
+M:	Kent Overstreet <kent.overstreet@linux.dev>
+M:	Darrick J. Wong <djwong@kernel.org>
+L:	linux-bcachefs@vger.kernel.org
+S:	Maintained
+F:	include/linux/thread_with_file.c
+F:	include/linux/thread_with_file_types.c
+F:	lib/thread_with_file.c
+
 THUNDERBOLT DMA TRAFFIC TEST DRIVER
 M:	Isaac Hazan <isaac.hazan@intel.com>
 L:	linux-usb@vger.kernel.org
diff --git a/fs/bcachefs/Kconfig b/fs/bcachefs/Kconfig
index 8c587ddd2f85e..08073d76e5a42 100644
--- a/fs/bcachefs/Kconfig
+++ b/fs/bcachefs/Kconfig
@@ -25,6 +25,7 @@ config BCACHEFS_FS
 	select SRCU
 	select SYMBOLIC_ERRNAME
 	select TIME_STATS
+	select THREAD_WITH_FILE
 	help
 	The bcachefs filesystem - a modern, copy on write filesystem, with
 	support for multiple devices, compression, checksumming, etc.
diff --git a/fs/bcachefs/Makefile b/fs/bcachefs/Makefile
index bb17d146b0900..d335b6572d72d 100644
--- a/fs/bcachefs/Makefile
+++ b/fs/bcachefs/Makefile
@@ -80,7 +80,6 @@ bcachefs-y		:=	\
 	super-io.o		\
 	sysfs.o			\
 	tests.o			\
-	thread_with_file.o	\
 	trace.o			\
 	two_state_shared_lock.o	\
 	util.o			\
diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 04e4a65909a4f..5f801256e8740 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -200,6 +200,7 @@
 #include <linux/seqlock.h>
 #include <linux/shrinker.h>
 #include <linux/srcu.h>
+#include <linux/thread_with_file_types.h>
 #include <linux/time_stats.h>
 #include <linux/types.h>
 #include <linux/workqueue.h>
@@ -466,7 +467,6 @@ enum bch_time_stats {
 #include "replicas_types.h"
 #include "subvolume_types.h"
 #include "super_types.h"
-#include "thread_with_file_types.h"
 
 /* Number of nodes btree coalesce will try to coalesce at once */
 #define GC_MERGE_NODES		4U
diff --git a/fs/bcachefs/chardev.c b/fs/bcachefs/chardev.c
index 11711f54057e1..4cbda66bb6e0f 100644
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@@ -11,7 +11,6 @@
 #include "replicas.h"
 #include "super.h"
 #include "super-io.h"
-#include "thread_with_file.h"
 
 #include <linux/cdev.h>
 #include <linux/device.h>
@@ -20,6 +19,7 @@
 #include <linux/major.h>
 #include <linux/sched/task.h>
 #include <linux/slab.h>
+#include <linux/thread_with_file.h>
 #include <linux/uaccess.h>
 
 __must_check
@@ -217,7 +217,7 @@ static long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_a
 
 	opt_set(thr->opts, stdio, (u64)(unsigned long)&thr->thr.stdio);
 
-	ret = bch2_run_thread_with_stdio(&thr->thr,
+	ret = run_thread_with_stdio(&thr->thr,
 			bch2_fsck_thread_exit,
 			bch2_fsck_offline_thread_fn);
 err:
@@ -422,7 +422,7 @@ static int bch2_data_job_release(struct inode *inode, struct file *file)
 {
 	struct bch_data_ctx *ctx = container_of(file->private_data, struct bch_data_ctx, thr);
 
-	bch2_thread_with_file_exit(&ctx->thr);
+	thread_with_file_exit(&ctx->thr);
 	kfree(ctx);
 	return 0;
 }
@@ -472,7 +472,7 @@ static long bch2_ioctl_data(struct bch_fs *c,
 	ctx->c = c;
 	ctx->arg = arg;
 
-	ret = bch2_run_thread_with_file(&ctx->thr,
+	ret = run_thread_with_file(&ctx->thr,
 			&bcachefs_data_ops,
 			bch2_data_thread);
 	if (ret < 0)
@@ -834,7 +834,7 @@ static long bch2_ioctl_fsck_online(struct bch_fs *c,
 			goto err;
 	}
 
-	ret = bch2_run_thread_with_stdio(&thr->thr,
+	ret = run_thread_with_stdio(&thr->thr,
 			bch2_fsck_thread_exit,
 			bch2_fsck_online_thread_fn);
 err:
diff --git a/fs/bcachefs/error.c b/fs/bcachefs/error.c
index d32c8bebe46c3..70a1253959740 100644
--- a/fs/bcachefs/error.c
+++ b/fs/bcachefs/error.c
@@ -2,7 +2,7 @@
 #include "bcachefs.h"
 #include "error.h"
 #include "super.h"
-#include "thread_with_file.h"
+#include <linux/thread_with_file.h>
 
 #define FSCK_ERR_RATELIMIT_NR	10
 
@@ -105,7 +105,7 @@ static enum ask_yn bch2_fsck_ask_yn(struct bch_fs *c)
 	do {
 		bch2_print(c, " (y,n, or Y,N for all errors of this type) ");
 
-		int r = bch2_stdio_redirect_readline(stdio, buf, sizeof(buf) - 1);
+		int r = stdio_redirect_readline(stdio, buf, sizeof(buf) - 1);
 		if (r < 0)
 			return YN_NO;
 		buf[r] = '\0';
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 0cff8c5f3c104..38a87c8fc8235 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -56,7 +56,6 @@
 #include "super.h"
 #include "super-io.h"
 #include "sysfs.h"
-#include "thread_with_file.h"
 #include "trace.h"
 
 #include <linux/backing-dev.h>
@@ -68,6 +67,7 @@
 #include <linux/percpu.h>
 #include <linux/random.h>
 #include <linux/sysfs.h>
+#include <linux/thread_with_file.h>
 #include <crypto/hash.h>
 
 MODULE_LICENSE("GPL");
@@ -99,7 +99,7 @@ void __bch2_print(struct bch_fs *c, const char *fmt, ...)
 		if (fmt[0] == KERN_SOH[0])
 			fmt += 2;
 
-		bch2_stdio_redirect_vprintf(stdio, true, fmt, args);
+		stdio_redirect_vprintf(stdio, true, fmt, args);
 	}
 	va_end(args);
 }
diff --git a/fs/bcachefs/thread_with_file.h b/include/linux/thread_with_file.h
similarity index 63%
rename from fs/bcachefs/thread_with_file.h
rename to include/linux/thread_with_file.h
index f06f8ff19a790..54091f7ff3383 100644
--- a/fs/bcachefs/thread_with_file.h
+++ b/include/linux/thread_with_file.h
@@ -1,8 +1,11 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_THREAD_WITH_FILE_H
-#define _BCACHEFS_THREAD_WITH_FILE_H
+/*
+ * (C) 2022-2024 Kent Overstreet <kent.overstreet@linux.dev>
+ */
+#ifndef _LINUX_THREAD_WITH_FILE_H
+#define _LINUX_THREAD_WITH_FILE_H
 
-#include "thread_with_file_types.h"
+#include <linux/thread_with_file_types.h>
 
 /*
  * Thread with file: Run a kthread and connect it to a file descriptor, so that
@@ -13,7 +16,7 @@
  *
  * thread_with_file, the low level version.
  * You get to define the full file_operations, including your release function,
- * which means that you must call bch2_thread_with_file_exit() from your
+ * which means that you must call thread_with_file_exit() from your
  * .release method
  *
  * thread_with_stdio, the higher level version
@@ -44,10 +47,10 @@ struct thread_with_file {
 	bool			done;
 };
 
-void bch2_thread_with_file_exit(struct thread_with_file *);
-int bch2_run_thread_with_file(struct thread_with_file *,
-			      const struct file_operations *,
-			      int (*fn)(void *));
+void thread_with_file_exit(struct thread_with_file *);
+int run_thread_with_file(struct thread_with_file *,
+			 const struct file_operations *,
+			 int (*fn)(void *));
 
 struct thread_with_stdio {
 	struct thread_with_file	thr;
@@ -56,13 +59,13 @@ struct thread_with_stdio {
 	void			(*fn)(struct thread_with_stdio *);
 };
 
-int bch2_run_thread_with_stdio(struct thread_with_stdio *,
-			       void (*exit)(struct thread_with_stdio *),
-			       void (*fn)(struct thread_with_stdio *));
-int bch2_stdio_redirect_read(struct stdio_redirect *, char *, size_t);
-int bch2_stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
+int run_thread_with_stdio(struct thread_with_stdio *,
+			  void (*exit)(struct thread_with_stdio *),
+			  void (*fn)(struct thread_with_stdio *));
+int stdio_redirect_read(struct stdio_redirect *, char *, size_t);
+int stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 
-__printf(3, 0) void bch2_stdio_redirect_vprintf(struct stdio_redirect *, bool, const char *, va_list);
-__printf(3, 4) void bch2_stdio_redirect_printf(struct stdio_redirect *, bool, const char *, ...);
+__printf(3, 0) void stdio_redirect_vprintf(struct stdio_redirect *, bool, const char *, va_list);
+__printf(3, 4) void stdio_redirect_printf(struct stdio_redirect *, bool, const char *, ...);
 
-#endif /* _BCACHEFS_THREAD_WITH_FILE_H */
+#endif /* _LINUX_THREAD_WITH_FILE_H */
diff --git a/fs/bcachefs/thread_with_file_types.h b/include/linux/thread_with_file_types.h
similarity index 64%
rename from fs/bcachefs/thread_with_file_types.h
rename to include/linux/thread_with_file_types.h
index 41990756aa261..98d0ad1253221 100644
--- a/fs/bcachefs/thread_with_file_types.h
+++ b/include/linux/thread_with_file_types.h
@@ -1,8 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _BCACHEFS_THREAD_WITH_FILE_TYPES_H
-#define _BCACHEFS_THREAD_WITH_FILE_TYPES_H
+#ifndef _LINUX_THREAD_WITH_FILE_TYPES_H
+#define _LINUX_THREAD_WITH_FILE_TYPES_H
 
 #include <linux/darray_types.h>
+#include <linux/spinlock_types.h>
+#include <linux/wait.h>
 
 struct stdio_buf {
 	spinlock_t		lock;
@@ -20,4 +22,4 @@ struct stdio_redirect {
 	bool			done;
 };
 
-#endif /* _BCACHEFS_THREAD_WITH_FILE_TYPES_H */
+#endif /* _LINUX_THREAD_WITH_FILE_TYPES_H */
diff --git a/lib/Kconfig b/lib/Kconfig
index 3ba8b965f8c7e..9258d04e939db 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -789,3 +789,6 @@ config FIRMWARE_TABLE
 config TIME_STATS
 	tristate
 	select MEAN_AND_VARIANCE
+
+config THREAD_WITH_FILE
+	tristate
diff --git a/lib/Makefile b/lib/Makefile
index 830907bb8fc85..e77304f69df03 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -371,6 +371,7 @@ obj-$(CONFIG_SBITMAP) += sbitmap.o
 obj-$(CONFIG_PARMAN) += parman.o
 
 obj-$(CONFIG_TIME_STATS) += time_stats.o
+obj-$(CONFIG_THREAD_WITH_FILE) += thread_with_file.o
 
 obj-y += group_cpus.o
 
diff --git a/fs/bcachefs/thread_with_file.c b/lib/thread_with_file.c
similarity index 79%
rename from fs/bcachefs/thread_with_file.c
rename to lib/thread_with_file.c
index dde9679b68b42..092996ca43fe7 100644
--- a/fs/bcachefs/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -1,26 +1,160 @@
 // SPDX-License-Identifier: GPL-2.0
-#ifndef NO_BCACHEFS_FS
-
-#include "bcachefs.h"
-#include "thread_with_file.h"
-
+/*
+ * (C) 2022-2024 Kent Overstreet <kent.overstreet@linux.dev>
+ */
 #include <linux/anon_inodes.h>
+#include <linux/darray.h>
 #include <linux/file.h>
 #include <linux/kthread.h>
+#include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/poll.h>
+#include <linux/thread_with_file.h>
 
-void bch2_thread_with_file_exit(struct thread_with_file *thr)
+/* stdio_redirect */
+
+#define STDIO_REDIRECT_BUFSIZE		4096
+
+static bool stdio_redirect_has_input(struct stdio_redirect *stdio)
+{
+	return stdio->input.buf.nr || stdio->done;
+}
+
+static bool stdio_redirect_has_output(struct stdio_redirect *stdio)
+{
+	return stdio->output.buf.nr || stdio->done;
+}
+
+static bool stdio_redirect_has_input_space(struct stdio_redirect *stdio)
+{
+	return stdio->input.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
+}
+
+static bool stdio_redirect_has_output_space(struct stdio_redirect *stdio)
+{
+	return stdio->output.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
+}
+
+static void stdio_buf_init(struct stdio_buf *buf)
+{
+	spin_lock_init(&buf->lock);
+	init_waitqueue_head(&buf->wait);
+	darray_init(&buf->buf);
+}
+
+int stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t len)
+{
+	struct stdio_buf *buf = &stdio->input;
+
+	wait_event(buf->wait, stdio_redirect_has_input(stdio));
+	if (stdio->done)
+		return -1;
+
+	spin_lock(&buf->lock);
+	int ret = min(len, buf->buf.nr);
+	memcpy(ubuf, buf->buf.data, ret);
+	darray_remove_items(&buf->buf, buf->buf.data, ret);
+	spin_unlock(&buf->lock);
+
+	wake_up(&buf->wait);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(stdio_redirect_read);
+
+int stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len)
+{
+	struct stdio_buf *buf = &stdio->input;
+	size_t copied = 0;
+	ssize_t ret = 0;
+again:
+	wait_event(buf->wait, stdio_redirect_has_input(stdio));
+	if (stdio->done) {
+		ret = -1;
+		goto out;
+	}
+
+	spin_lock(&buf->lock);
+	size_t b = min(len, buf->buf.nr);
+	char *n = memchr(buf->buf.data, '\n', b);
+	if (n)
+		b = min_t(size_t, b, n + 1 - buf->buf.data);
+	memcpy(ubuf, buf->buf.data, b);
+	darray_remove_items(&buf->buf, buf->buf.data, b);
+	ubuf += b;
+	len -= b;
+	copied += b;
+	spin_unlock(&buf->lock);
+
+	wake_up(&buf->wait);
+
+	if (!n && len)
+		goto again;
+out:
+	return copied ?: ret;
+}
+EXPORT_SYMBOL_GPL(stdio_redirect_readline);
+
+__printf(3, 0)
+static void darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_list args)
+{
+	size_t len;
+
+	do {
+		va_list args2;
+		va_copy(args2, args);
+
+		len = vsnprintf(out->data + out->nr, darray_room(*out), fmt, args2);
+	} while (len + 1 > darray_room(*out) && !darray_make_room_gfp(out, len + 1, gfp));
+
+	out->nr += min(len, darray_room(*out));
+}
+
+void stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
+			    const char *fmt, va_list args)
+{
+	struct stdio_buf *buf = &stdio->output;
+	unsigned long flags;
+
+	if (!nonblocking)
+		wait_event(buf->wait, stdio_redirect_has_output_space(stdio));
+	else if (!stdio_redirect_has_output_space(stdio))
+		return;
+	if (stdio->done)
+		return;
+
+	spin_lock_irqsave(&buf->lock, flags);
+	darray_vprintf(&buf->buf, nonblocking ? GFP_NOWAIT : GFP_KERNEL, fmt, args);
+	spin_unlock_irqrestore(&buf->lock, flags);
+
+	wake_up(&buf->wait);
+}
+EXPORT_SYMBOL_GPL(stdio_redirect_vprintf);
+
+void stdio_redirect_printf(struct stdio_redirect *stdio, bool nonblocking,
+			   const char *fmt, ...)
+{
+
+	va_list args;
+	va_start(args, fmt);
+	stdio_redirect_vprintf(stdio, nonblocking, fmt, args);
+	va_end(args);
+}
+EXPORT_SYMBOL_GPL(stdio_redirect_printf);
+
+/* thread with file: */
+
+void thread_with_file_exit(struct thread_with_file *thr)
 {
 	if (thr->task) {
 		kthread_stop(thr->task);
 		put_task_struct(thr->task);
 	}
 }
+EXPORT_SYMBOL_GPL(thread_with_file_exit);
 
-int bch2_run_thread_with_file(struct thread_with_file *thr,
-			      const struct file_operations *fops,
-			      int (*fn)(void *))
+int run_thread_with_file(struct thread_with_file *thr,
+			 const struct file_operations *fops,
+			 int (*fn)(void *))
 {
 	struct file *file = NULL;
 	int ret, fd = -1;
@@ -63,37 +197,7 @@ int bch2_run_thread_with_file(struct thread_with_file *thr,
 		kthread_stop(thr->task);
 	return ret;
 }
-
-/* stdio_redirect */
-
-static bool stdio_redirect_has_input(struct stdio_redirect *stdio)
-{
-	return stdio->input.buf.nr || stdio->done;
-}
-
-static bool stdio_redirect_has_output(struct stdio_redirect *stdio)
-{
-	return stdio->output.buf.nr || stdio->done;
-}
-
-#define STDIO_REDIRECT_BUFSIZE		4096
-
-static bool stdio_redirect_has_input_space(struct stdio_redirect *stdio)
-{
-	return stdio->input.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
-}
-
-static bool stdio_redirect_has_output_space(struct stdio_redirect *stdio)
-{
-	return stdio->output.buf.nr < STDIO_REDIRECT_BUFSIZE || stdio->done;
-}
-
-static void stdio_buf_init(struct stdio_buf *buf)
-{
-	spin_lock_init(&buf->lock);
-	init_waitqueue_head(&buf->wait);
-	darray_init(&buf->buf);
-}
+EXPORT_SYMBOL_GPL(run_thread_with_file);
 
 /* thread_with_stdio */
 
@@ -126,10 +230,7 @@ static ssize_t thread_with_stdio_read(struct file *file, char __user *ubuf,
 			ubuf	+= b;
 			len	-= b;
 			copied	+= b;
-			buf->buf.nr -= b;
-			memmove(buf->buf.data,
-				buf->buf.data + b,
-				buf->buf.nr);
+			darray_remove_items(&buf->buf, buf->buf.data, b);
 		}
 		spin_unlock_irq(&buf->lock);
 	}
@@ -137,18 +238,6 @@ static ssize_t thread_with_stdio_read(struct file *file, char __user *ubuf,
 	return copied ?: ret;
 }
 
-static int thread_with_stdio_release(struct inode *inode, struct file *file)
-{
-	struct thread_with_stdio *thr =
-		container_of(file->private_data, struct thread_with_stdio, thr);
-
-	bch2_thread_with_file_exit(&thr->thr);
-	darray_exit(&thr->stdio.input.buf);
-	darray_exit(&thr->stdio.output.buf);
-	thr->exit(thr);
-	return 0;
-}
-
 static ssize_t thread_with_stdio_write(struct file *file, const char __user *ubuf,
 				       size_t len, loff_t *ppos)
 {
@@ -221,6 +310,18 @@ static __poll_t thread_with_stdio_poll(struct file *file, struct poll_table_stru
 	return mask;
 }
 
+static int thread_with_stdio_release(struct inode *inode, struct file *file)
+{
+	struct thread_with_stdio *thr =
+		container_of(file->private_data, struct thread_with_stdio, thr);
+
+	thread_with_file_exit(&thr->thr);
+	darray_exit(&thr->stdio.input.buf);
+	darray_exit(&thr->stdio.output.buf);
+	thr->exit(thr);
+	return 0;
+}
+
 static const struct file_operations thread_with_stdio_fops = {
 	.llseek		= no_llseek,
 	.read		= thread_with_stdio_read,
@@ -242,117 +343,18 @@ static int thread_with_stdio_fn(void *arg)
 	return 0;
 }
 
-int bch2_run_thread_with_stdio(struct thread_with_stdio *thr,
-			       void (*exit)(struct thread_with_stdio *),
-			       void (*fn)(struct thread_with_stdio *))
+int run_thread_with_stdio(struct thread_with_stdio *thr,
+			  void (*exit)(struct thread_with_stdio *),
+			  void (*fn)(struct thread_with_stdio *))
 {
 	stdio_buf_init(&thr->stdio.input);
 	stdio_buf_init(&thr->stdio.output);
 	thr->exit	= exit;
 	thr->fn		= fn;
 
-	return bch2_run_thread_with_file(&thr->thr, &thread_with_stdio_fops, thread_with_stdio_fn);
+	return run_thread_with_file(&thr->thr, &thread_with_stdio_fops, thread_with_stdio_fn);
 }
+EXPORT_SYMBOL_GPL(run_thread_with_stdio);
 
-int bch2_stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t len)
-{
-	struct stdio_buf *buf = &stdio->input;
-
-	wait_event(buf->wait, stdio_redirect_has_input(stdio));
-	if (stdio->done)
-		return -1;
-
-	spin_lock(&buf->lock);
-	int ret = min(len, buf->buf.nr);
-	buf->buf.nr -= ret;
-	memcpy(ubuf, buf->buf.data, ret);
-	memmove(buf->buf.data,
-		buf->buf.data + ret,
-		buf->buf.nr);
-	spin_unlock(&buf->lock);
-
-	wake_up(&buf->wait);
-	return ret;
-}
-
-int bch2_stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len)
-{
-	struct stdio_buf *buf = &stdio->input;
-	size_t copied = 0;
-	ssize_t ret = 0;
-again:
-	wait_event(buf->wait, stdio_redirect_has_input(stdio));
-	if (stdio->done) {
-		ret = -1;
-		goto out;
-	}
-
-	spin_lock(&buf->lock);
-	size_t b = min(len, buf->buf.nr);
-	char *n = memchr(buf->buf.data, '\n', b);
-	if (n)
-		b = min_t(size_t, b, n + 1 - buf->buf.data);
-	buf->buf.nr -= b;
-	memcpy(ubuf, buf->buf.data, b);
-	memmove(buf->buf.data,
-		buf->buf.data + b,
-		buf->buf.nr);
-	ubuf += b;
-	len -= b;
-	copied += b;
-	spin_unlock(&buf->lock);
-
-	wake_up(&buf->wait);
-
-	if (!n && len)
-		goto again;
-out:
-	return copied ?: ret;
-}
-
-__printf(3, 0)
-static void bch2_darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_list args)
-{
-	size_t len;
-
-	do {
-		va_list args2;
-		va_copy(args2, args);
-
-		len = vsnprintf(out->data + out->nr, darray_room(*out), fmt, args2);
-	} while (len + 1 > darray_room(*out) && !darray_make_room_gfp(out, len + 1, gfp));
-
-	out->nr += min(len, darray_room(*out));
-}
-
-void bch2_stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
-				 const char *fmt, va_list args)
-{
-	struct stdio_buf *buf = &stdio->output;
-	unsigned long flags;
-
-	if (!nonblocking)
-		wait_event(buf->wait, stdio_redirect_has_output_space(stdio));
-	else if (!stdio_redirect_has_output_space(stdio))
-		return;
-	if (stdio->done)
-		return;
-
-	spin_lock_irqsave(&buf->lock, flags);
-	bch2_darray_vprintf(&buf->buf, nonblocking ? GFP_NOWAIT : GFP_KERNEL, fmt, args);
-	spin_unlock_irqrestore(&buf->lock, flags);
-
-	wake_up(&buf->wait);
-}
-
-void bch2_stdio_redirect_printf(struct stdio_redirect *stdio, bool nonblocking,
-				const char *fmt, ...)
-{
-
-	va_list args;
-	va_start(args, fmt);
-	bch2_stdio_redirect_vprintf(stdio, nonblocking, fmt, args);
-	va_end(args);
-}
-
-#endif /* NO_BCACHEFS_FS */
+MODULE_AUTHOR("Kent Overstreet");
+MODULE_LICENSE("GPL");


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 08/10] thread_with_stdio: Mark completed in ->release()
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-02-24  1:15   ` [PATCH 07/10] thread_with_file: Lift " Darrick J. Wong
@ 2024-02-24  1:15   ` Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 09/10] kernel/hung_task.c: export sysctl_hung_task_timeout_secs Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 10/10] thread_with_stdio: suppress hung task warning Darrick J. Wong
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:15 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

This fixes stdio_redirect_read() getting stuck, not noticing that the
pipe has been closed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 lib/thread_with_file.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)


diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index 092996ca43fe7..f4946a437332a 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -201,6 +201,14 @@ EXPORT_SYMBOL_GPL(run_thread_with_file);
 
 /* thread_with_stdio */
 
+static void thread_with_stdio_done(struct thread_with_stdio *thr)
+{
+	thr->thr.done = true;
+	thr->stdio.done = true;
+	wake_up(&thr->stdio.input.wait);
+	wake_up(&thr->stdio.output.wait);
+}
+
 static ssize_t thread_with_stdio_read(struct file *file, char __user *ubuf,
 				      size_t len, loff_t *ppos)
 {
@@ -315,6 +323,7 @@ static int thread_with_stdio_release(struct inode *inode, struct file *file)
 	struct thread_with_stdio *thr =
 		container_of(file->private_data, struct thread_with_stdio, thr);
 
+	thread_with_stdio_done(thr);
 	thread_with_file_exit(&thr->thr);
 	darray_exit(&thr->stdio.input.buf);
 	darray_exit(&thr->stdio.output.buf);
@@ -336,10 +345,7 @@ static int thread_with_stdio_fn(void *arg)
 
 	thr->fn(thr);
 
-	thr->thr.done = true;
-	thr->stdio.done = true;
-	wake_up(&thr->stdio.input.wait);
-	wake_up(&thr->stdio.output.wait);
+	thread_with_stdio_done(thr);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 09/10] kernel/hung_task.c: export sysctl_hung_task_timeout_secs
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (7 preceding siblings ...)
  2024-02-24  1:15   ` [PATCH 08/10] thread_with_stdio: Mark completed in ->release() Darrick J. Wong
@ 2024-02-24  1:16   ` Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 10/10] thread_with_stdio: suppress hung task warning Darrick J. Wong
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:16 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: fuyuanli, linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

needed for thread_with_file; also rare but not unheard of to need this
in module code, when blocking on user input.

one workaround used by some code is wait_event_interruptible() - but
that can be buggy if the outer context isn't expecting unwinding.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: fuyuanli <fuyuanli@didiglobal.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 kernel/hung_task.c |    1 +
 1 file changed, 1 insertion(+)


diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 9a24574988d23..b2fc2727d6544 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -43,6 +43,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
  * Zero means infinite timeout - no checking done:
  */
 unsigned long __read_mostly sysctl_hung_task_timeout_secs = CONFIG_DEFAULT_HUNG_TASK_TIMEOUT;
+EXPORT_SYMBOL_GPL(sysctl_hung_task_timeout_secs);
 
 /*
  * Zero (default value) means use sysctl_hung_task_timeout_secs:


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 10/10] thread_with_stdio: suppress hung task warning
  2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
                     ` (8 preceding siblings ...)
  2024-02-24  1:16   ` [PATCH 09/10] kernel/hung_task.c: export sysctl_hung_task_timeout_secs Darrick J. Wong
@ 2024-02-24  1:16   ` Darrick J. Wong
  9 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:16 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 lib/thread_with_file.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)


diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index f4946a437332a..b09dc60ba6280 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/pagemap.h>
 #include <linux/poll.h>
+#include <linux/sched/sysctl.h>
 #include <linux/thread_with_file.h>
 
 /* stdio_redirect */
@@ -46,7 +47,15 @@ int stdio_redirect_read(struct stdio_redirect *stdio, char *ubuf, size_t len)
 {
 	struct stdio_buf *buf = &stdio->input;
 
-	wait_event(buf->wait, stdio_redirect_has_input(stdio));
+	/*
+	 * we're waiting on user input (or for the file descriptor to be
+	 * closed), don't want a hung task warning:
+	 */
+	do {
+		wait_event_timeout(buf->wait, stdio_redirect_has_input(stdio),
+				   sysctl_hung_task_timeout_secs * HZ / 2);
+	} while (!stdio_redirect_has_input(stdio));
+
 	if (stdio->done)
 		return -1;
 
@@ -67,7 +76,11 @@ int stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len
 	size_t copied = 0;
 	ssize_t ret = 0;
 again:
-	wait_event(buf->wait, stdio_redirect_has_input(stdio));
+	do {
+		wait_event_timeout(buf->wait, stdio_redirect_has_input(stdio),
+				   sysctl_hung_task_timeout_secs * HZ / 2);
+	} while (!stdio_redirect_has_input(stdio));
+
 	if (stdio->done) {
 		ret = -1;
 		goto out;


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/5] thread_with_file: allow creation of readonly files
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
@ 2024-02-24  1:16   ` Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 2/5] thread_with_file: fix various printf problems Darrick J. Wong
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:16 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Create a new run_thread_with_stdout function that opens a file in
O_RDONLY mode so that the kernel can write things to userspace but
userspace cannot write to the kernel.  This will be used to convey xfs
health event information to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/thread_with_file.h |    3 +++
 lib/thread_with_file.c           |   36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 39 insertions(+)


diff --git a/include/linux/thread_with_file.h b/include/linux/thread_with_file.h
index 54091f7ff3383..5f7e85bc8322b 100644
--- a/include/linux/thread_with_file.h
+++ b/include/linux/thread_with_file.h
@@ -62,6 +62,9 @@ struct thread_with_stdio {
 int run_thread_with_stdio(struct thread_with_stdio *,
 			  void (*exit)(struct thread_with_stdio *),
 			  void (*fn)(struct thread_with_stdio *));
+int run_thread_with_stdout(struct thread_with_stdio *,
+			  void (*exit)(struct thread_with_stdio *),
+			  void (*fn)(struct thread_with_stdio *));
 int stdio_redirect_read(struct stdio_redirect *, char *, size_t);
 int stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 
diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index b09dc60ba6280..71028611b8d59 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -344,6 +344,22 @@ static int thread_with_stdio_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static __poll_t thread_with_stdout_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct thread_with_stdio *thr =
+		container_of(file->private_data, struct thread_with_stdio, thr);
+
+	poll_wait(file, &thr->stdio.output.wait, wait);
+
+	__poll_t mask = 0;
+
+	if (stdio_redirect_has_output(&thr->stdio))
+		mask |= EPOLLIN;
+	if (thr->thr.done)
+		mask |= EPOLLHUP|EPOLLERR;
+	return mask;
+}
+
 static const struct file_operations thread_with_stdio_fops = {
 	.llseek		= no_llseek,
 	.read		= thread_with_stdio_read,
@@ -352,6 +368,13 @@ static const struct file_operations thread_with_stdio_fops = {
 	.release	= thread_with_stdio_release,
 };
 
+static const struct file_operations thread_with_stdout_fops = {
+	.llseek		= no_llseek,
+	.read		= thread_with_stdio_read,
+	.poll		= thread_with_stdout_poll,
+	.release	= thread_with_stdio_release,
+};
+
 static int thread_with_stdio_fn(void *arg)
 {
 	struct thread_with_stdio *thr = arg;
@@ -375,5 +398,18 @@ int run_thread_with_stdio(struct thread_with_stdio *thr,
 }
 EXPORT_SYMBOL_GPL(run_thread_with_stdio);
 
+int run_thread_with_stdout(struct thread_with_stdio *thr,
+			  void (*exit)(struct thread_with_stdio *),
+			  void (*fn)(struct thread_with_stdio *))
+{
+	stdio_buf_init(&thr->stdio.input);
+	stdio_buf_init(&thr->stdio.output);
+	thr->exit	= exit;
+	thr->fn		= fn;
+
+	return run_thread_with_file(&thr->thr, &thread_with_stdout_fops, thread_with_stdio_fn);
+}
+EXPORT_SYMBOL_GPL(run_thread_with_stdout);
+
 MODULE_AUTHOR("Kent Overstreet");
 MODULE_LICENSE("GPL");


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/5] thread_with_file: fix various printf problems
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 1/5] thread_with_file: allow creation of readonly files Darrick J. Wong
@ 2024-02-24  1:16   ` Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 3/5] thread_with_file: create ops structure for thread_with_stdio Darrick J. Wong
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:16 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Experimentally fix some problems with stdio_redirect_vprintf by creating
a MOO variant with which we can experiment.  We can't do a GFP_KERNEL
allocation while holding the spinlock, and I don't like how the printf
function can silently truncate the output if memory allocation fails.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/thread_with_file.h |    4 +--
 lib/thread_with_file.c           |   55 ++++++++++++++++++++++++++------------
 2 files changed, 39 insertions(+), 20 deletions(-)


diff --git a/include/linux/thread_with_file.h b/include/linux/thread_with_file.h
index 5f7e85bc8322b..7b133a15d3540 100644
--- a/include/linux/thread_with_file.h
+++ b/include/linux/thread_with_file.h
@@ -68,7 +68,7 @@ int run_thread_with_stdout(struct thread_with_stdio *,
 int stdio_redirect_read(struct stdio_redirect *, char *, size_t);
 int stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 
-__printf(3, 0) void stdio_redirect_vprintf(struct stdio_redirect *, bool, const char *, va_list);
-__printf(3, 4) void stdio_redirect_printf(struct stdio_redirect *, bool, const char *, ...);
+__printf(3, 0) ssize_t stdio_redirect_vprintf(struct stdio_redirect *, bool, const char *, va_list);
+__printf(3, 4) ssize_t stdio_redirect_printf(struct stdio_redirect *, bool, const char *, ...);
 
 #endif /* _LINUX_THREAD_WITH_FILE_H */
diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index 71028611b8d59..70a805ef017f9 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -108,49 +108,68 @@ int stdio_redirect_readline(struct stdio_redirect *stdio, char *ubuf, size_t len
 EXPORT_SYMBOL_GPL(stdio_redirect_readline);
 
 __printf(3, 0)
-static void darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_list args)
+static ssize_t darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_list args)
 {
-	size_t len;
+	ssize_t ret;
 
 	do {
 		va_list args2;
+		size_t len;
+
 		va_copy(args2, args);
-
 		len = vsnprintf(out->data + out->nr, darray_room(*out), fmt, args2);
-	} while (len + 1 > darray_room(*out) && !darray_make_room_gfp(out, len + 1, gfp));
+		if (len + 1 <= darray_room(*out)) {
+			out->nr += len;
+			return len;
+		}
 
-	out->nr += min(len, darray_room(*out));
+		ret = darray_make_room_gfp(out, len + 1, gfp);
+	} while (ret == 0);
+
+	return ret;
 }
 
-void stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
-			    const char *fmt, va_list args)
+ssize_t stdio_redirect_vprintf(struct stdio_redirect *stdio, bool nonblocking,
+			       const char *fmt, va_list args)
 {
 	struct stdio_buf *buf = &stdio->output;
 	unsigned long flags;
+	ssize_t ret;
 
-	if (!nonblocking)
-		wait_event(buf->wait, stdio_redirect_has_output_space(stdio));
-	else if (!stdio_redirect_has_output_space(stdio))
-		return;
-	if (stdio->done)
-		return;
-
+again:
 	spin_lock_irqsave(&buf->lock, flags);
-	darray_vprintf(&buf->buf, nonblocking ? GFP_NOWAIT : GFP_KERNEL, fmt, args);
+	ret = darray_vprintf(&buf->buf, GFP_NOWAIT, fmt, args);
 	spin_unlock_irqrestore(&buf->lock, flags);
 
+	if (ret < 0) {
+		if (nonblocking)
+			return -EAGAIN;
+
+		ret = wait_event_interruptible(buf->wait,
+				stdio_redirect_has_output_space(stdio));
+		if (ret)
+			return ret;
+		goto again;
+	}
+
 	wake_up(&buf->wait);
+	return ret;
+
 }
 EXPORT_SYMBOL_GPL(stdio_redirect_vprintf);
 
-void stdio_redirect_printf(struct stdio_redirect *stdio, bool nonblocking,
-			   const char *fmt, ...)
+ssize_t stdio_redirect_printf(struct stdio_redirect *stdio, bool nonblocking,
+			      const char *fmt, ...)
 {
 
 	va_list args;
+	ssize_t ret;
+
 	va_start(args, fmt);
-	stdio_redirect_vprintf(stdio, nonblocking, fmt, args);
+	ret = stdio_redirect_vprintf(stdio, nonblocking, fmt, args);
 	va_end(args);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(stdio_redirect_printf);
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/5] thread_with_file: create ops structure for thread_with_stdio
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 1/5] thread_with_file: allow creation of readonly files Darrick J. Wong
  2024-02-24  1:16   ` [PATCH 2/5] thread_with_file: fix various printf problems Darrick J. Wong
@ 2024-02-24  1:17   ` Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 4/5] thread_with_file: allow ioctls against these files Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 5/5] thread_with_file: Fix missing va_end() Darrick J. Wong
  4 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:17 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Create an ops structure so we can add more file-based functionality in
the next few patches.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/chardev.c            |   18 ++++++++++++------
 include/linux/thread_with_file.h |   16 ++++++++++------
 lib/thread_with_file.c           |   16 ++++++----------
 3 files changed, 28 insertions(+), 22 deletions(-)


diff --git a/fs/bcachefs/chardev.c b/fs/bcachefs/chardev.c
index 4cbda66bb6e0f..a2f30f45f93f7 100644
--- a/fs/bcachefs/chardev.c
+++ b/fs/bcachefs/chardev.c
@@ -165,6 +165,11 @@ static void bch2_fsck_offline_thread_fn(struct thread_with_stdio *stdio)
 		bch2_fs_stop(c);
 }
 
+static const struct thread_with_stdio_ops bch2_offline_fsck_ops = {
+	.exit		= bch2_fsck_thread_exit,
+	.fn		= bch2_fsck_offline_thread_fn,
+};
+
 static long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_arg)
 {
 	struct bch_ioctl_fsck_offline arg;
@@ -217,9 +222,7 @@ static long bch2_ioctl_fsck_offline(struct bch_ioctl_fsck_offline __user *user_a
 
 	opt_set(thr->opts, stdio, (u64)(unsigned long)&thr->thr.stdio);
 
-	ret = run_thread_with_stdio(&thr->thr,
-			bch2_fsck_thread_exit,
-			bch2_fsck_offline_thread_fn);
+	ret = run_thread_with_stdio(&thr->thr, &bch2_offline_fsck_ops);
 err:
 	if (ret < 0) {
 		if (thr)
@@ -794,6 +797,11 @@ static void bch2_fsck_online_thread_fn(struct thread_with_stdio *stdio)
 	bch2_ro_ref_put(c);
 }
 
+static const struct thread_with_stdio_ops bch2_online_fsck_ops = {
+	.exit		= bch2_fsck_thread_exit,
+	.fn		= bch2_fsck_online_thread_fn,
+};
+
 static long bch2_ioctl_fsck_online(struct bch_fs *c,
 				   struct bch_ioctl_fsck_online arg)
 {
@@ -834,9 +842,7 @@ static long bch2_ioctl_fsck_online(struct bch_fs *c,
 			goto err;
 	}
 
-	ret = run_thread_with_stdio(&thr->thr,
-			bch2_fsck_thread_exit,
-			bch2_fsck_online_thread_fn);
+	ret = run_thread_with_stdio(&thr->thr, &bch2_online_fsck_ops);
 err:
 	if (ret < 0) {
 		bch_err_fn(c, ret);
diff --git a/include/linux/thread_with_file.h b/include/linux/thread_with_file.h
index 7b133a15d3540..445b1b12a0bd6 100644
--- a/include/linux/thread_with_file.h
+++ b/include/linux/thread_with_file.h
@@ -52,19 +52,23 @@ int run_thread_with_file(struct thread_with_file *,
 			 const struct file_operations *,
 			 int (*fn)(void *));
 
+struct thread_with_stdio;
+
+struct thread_with_stdio_ops {
+	void (*exit)(struct thread_with_stdio *);
+	void (*fn)(struct thread_with_stdio *);
+};
+
 struct thread_with_stdio {
 	struct thread_with_file	thr;
 	struct stdio_redirect	stdio;
-	void			(*exit)(struct thread_with_stdio *);
-	void			(*fn)(struct thread_with_stdio *);
+	const struct thread_with_stdio_ops	*ops;
 };
 
 int run_thread_with_stdio(struct thread_with_stdio *,
-			  void (*exit)(struct thread_with_stdio *),
-			  void (*fn)(struct thread_with_stdio *));
+			  const struct thread_with_stdio_ops *);
 int run_thread_with_stdout(struct thread_with_stdio *,
-			  void (*exit)(struct thread_with_stdio *),
-			  void (*fn)(struct thread_with_stdio *));
+			  const struct thread_with_stdio_ops *);
 int stdio_redirect_read(struct stdio_redirect *, char *, size_t);
 int stdio_redirect_readline(struct stdio_redirect *, char *, size_t);
 
diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index 70a805ef017f9..2edf33c3e7dc5 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -359,7 +359,7 @@ static int thread_with_stdio_release(struct inode *inode, struct file *file)
 	thread_with_file_exit(&thr->thr);
 	darray_exit(&thr->stdio.input.buf);
 	darray_exit(&thr->stdio.output.buf);
-	thr->exit(thr);
+	thr->ops->exit(thr);
 	return 0;
 }
 
@@ -398,33 +398,29 @@ static int thread_with_stdio_fn(void *arg)
 {
 	struct thread_with_stdio *thr = arg;
 
-	thr->fn(thr);
+	thr->ops->fn(thr);
 
 	thread_with_stdio_done(thr);
 	return 0;
 }
 
 int run_thread_with_stdio(struct thread_with_stdio *thr,
-			  void (*exit)(struct thread_with_stdio *),
-			  void (*fn)(struct thread_with_stdio *))
+			  const struct thread_with_stdio_ops *ops)
 {
 	stdio_buf_init(&thr->stdio.input);
 	stdio_buf_init(&thr->stdio.output);
-	thr->exit	= exit;
-	thr->fn		= fn;
+	thr->ops = ops;
 
 	return run_thread_with_file(&thr->thr, &thread_with_stdio_fops, thread_with_stdio_fn);
 }
 EXPORT_SYMBOL_GPL(run_thread_with_stdio);
 
 int run_thread_with_stdout(struct thread_with_stdio *thr,
-			  void (*exit)(struct thread_with_stdio *),
-			  void (*fn)(struct thread_with_stdio *))
+			   const struct thread_with_stdio_ops *ops)
 {
 	stdio_buf_init(&thr->stdio.input);
 	stdio_buf_init(&thr->stdio.output);
-	thr->exit	= exit;
-	thr->fn		= fn;
+	thr->ops = ops;
 
 	return run_thread_with_file(&thr->thr, &thread_with_stdout_fops, thread_with_stdio_fn);
 }


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/5] thread_with_file: allow ioctls against these files
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:17   ` [PATCH 3/5] thread_with_file: create ops structure for thread_with_stdio Darrick J. Wong
@ 2024-02-24  1:17   ` Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 5/5] thread_with_file: Fix missing va_end() Darrick J. Wong
  4 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:17 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Darrick J. Wong <djwong@kernel.org>

Make it so that a thread_with_stdio user can handle ioctls against the
file descriptor.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 include/linux/thread_with_file.h |    1 +
 lib/thread_with_file.c           |   12 ++++++++++++
 2 files changed, 13 insertions(+)


diff --git a/include/linux/thread_with_file.h b/include/linux/thread_with_file.h
index 445b1b12a0bd6..33770938d5d9a 100644
--- a/include/linux/thread_with_file.h
+++ b/include/linux/thread_with_file.h
@@ -57,6 +57,7 @@ struct thread_with_stdio;
 struct thread_with_stdio_ops {
 	void (*exit)(struct thread_with_stdio *);
 	void (*fn)(struct thread_with_stdio *);
+	long (*unlocked_ioctl)(struct thread_with_stdio *, unsigned int, unsigned long);
 };
 
 struct thread_with_stdio {
diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index 2edf33c3e7dc5..8b129744a48a3 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -379,12 +379,23 @@ static __poll_t thread_with_stdout_poll(struct file *file, struct poll_table_str
 	return mask;
 }
 
+static long thread_with_stdio_ioctl(struct file *file, unsigned int cmd, unsigned long p)
+{
+	struct thread_with_stdio *thr =
+		container_of(file->private_data, struct thread_with_stdio, thr);
+
+	if (thr->ops->unlocked_ioctl)
+		return thr->ops->unlocked_ioctl(thr, cmd, p);
+	return -ENOTTY;
+}
+
 static const struct file_operations thread_with_stdio_fops = {
 	.llseek		= no_llseek,
 	.read		= thread_with_stdio_read,
 	.write		= thread_with_stdio_write,
 	.poll		= thread_with_stdio_poll,
 	.release	= thread_with_stdio_release,
+	.unlocked_ioctl	= thread_with_stdio_ioctl,
 };
 
 static const struct file_operations thread_with_stdout_fops = {
@@ -392,6 +403,7 @@ static const struct file_operations thread_with_stdout_fops = {
 	.read		= thread_with_stdio_read,
 	.poll		= thread_with_stdout_poll,
 	.release	= thread_with_stdio_release,
+	.unlocked_ioctl	= thread_with_stdio_ioctl,
 };
 
 static int thread_with_stdio_fn(void *arg)


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/5] thread_with_file: Fix missing va_end()
  2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-02-24  1:17   ` [PATCH 4/5] thread_with_file: allow ioctls against these files Darrick J. Wong
@ 2024-02-24  1:17   ` Darrick J. Wong
  4 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:17 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet, djwong
  Cc: linux-xfs, linux-bcachefs, linux-kernel

From: Kent Overstreet <kent.overstreet@linux.dev>

Fixes: https://lore.kernel.org/linux-bcachefs/202402131603.E953E2CF@keescook/T/#u
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 lib/thread_with_file.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/lib/thread_with_file.c b/lib/thread_with_file.c
index 8b129744a48a3..37a1ea22823ca 100644
--- a/lib/thread_with_file.c
+++ b/lib/thread_with_file.c
@@ -118,6 +118,8 @@ static ssize_t darray_vprintf(darray_char *out, gfp_t gfp, const char *fmt, va_l
 
 		va_copy(args2, args);
 		len = vsnprintf(out->data + out->nr, darray_room(*out), fmt, args2);
+		va_end(args2);
+
 		if (len + 1 <= darray_room(*out)) {
 			out->nr += len;
 			return len;


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 1/8] xfs: use thread_with_file to create a monitoring file
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
@ 2024-02-24  1:17   ` Darrick J. Wong
  2024-02-24  1:18   ` [PATCH 2/8] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:17 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Use Kent Overstreet's thread_with_file abstraction to provide a magic
file from which we can read filesystem health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Kconfig                 |    9 +++
 fs/xfs/Makefile                |    1 
 fs/xfs/libxfs/xfs_fs.h         |    1 
 fs/xfs/libxfs/xfs_fs_staging.h |   10 +++
 fs/xfs/xfs_healthmon.c         |  129 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h         |   15 +++++
 fs/xfs/xfs_ioctl.c             |   21 +++++++
 fs/xfs/xfs_linux.h             |    3 +
 8 files changed, 189 insertions(+)
 create mode 100644 fs/xfs/xfs_healthmon.c
 create mode 100644 fs/xfs/xfs_healthmon.h


diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index e0fa9b382fbeb..dd22cf799328a 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -6,6 +6,7 @@ config XFS_FS
 	select LIBCRC32C
 	select FS_IOMAP
 	select TIME_STATS if XFS_TIME_STATS
+	select THREAD_WITH_FILE if XFS_HEALTH_MONITOR
 	help
 	  XFS is a high performance journaling filesystem which originated
 	  on the SGI IRIX platform.  It is completely multi-threaded, can
@@ -128,6 +129,14 @@ config XFS_TIME_STATS
 	help
 	  Collects time statistics on various operations in the filesystem.
 
+config XFS_HEALTH_MONITOR
+	bool "Report filesystem health events to userspace"
+	depends on XFS_FS
+	select XFS_LIVE_HOOKS
+	default y
+	help
+	  Report health events to userspace programs.
+
 config XFS_DRAIN_INTENTS
 	bool
 	select JUMP_LABEL if HAVE_ARCH_JUMP_LABEL
diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index bf3bacfb7afff..563936e48ab39 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -154,6 +154,7 @@ xfs-$(CONFIG_XFS_LIVE_HOOKS)	+= xfs_hooks.o
 xfs-$(CONFIG_XFS_MEMORY_BUFS)	+= xfs_buf_mem.o
 xfs-$(CONFIG_XFS_BTREE_IN_MEM)	+= libxfs/xfs_btree_mem.o
 xfs-$(CONFIG_XFS_TIME_STATS)	+= xfs_timestats.o
+xfs-$(CONFIG_XFS_HEALTH_MONITOR) += xfs_healthmon.o
 
 # online scrub/repair
 ifeq ($(CONFIG_XFS_ONLINE_SCRUB),y)
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 246c2582abbe5..b9d9bc511475d 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -855,6 +855,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGETXATTRA	_IOR ('X', 45, struct fsxattr)
 /*	XFS_IOC_SETBIOSIZE ---- deprecated 46	   */
 /*	XFS_IOC_GETBIOSIZE ---- deprecated 47	   */
+/*	XFS_IOC_HEALTHMON -------- staging 48	   */
 #define XFS_IOC_GETBMAPX	_IOWR('X', 56, struct getbmap)
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
diff --git a/fs/xfs/libxfs/xfs_fs_staging.h b/fs/xfs/libxfs/xfs_fs_staging.h
index 1da182c77934d..84b99816eec2e 100644
--- a/fs/xfs/libxfs/xfs_fs_staging.h
+++ b/fs/xfs/libxfs/xfs_fs_staging.h
@@ -303,4 +303,14 @@ struct xfs_map_freesp {
  */
 #define XFS_IOC_MAP_FREESP	_IOWR('X', 64, struct xfs_map_freesp)
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
+/* Monitor for health events. */
+#define XFS_IOC_HEALTH_MONITOR		_IOR ('X', 48, struct xfs_health_monitor)
+
 #endif /* __XFS_FS_STAGING_H__ */
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
new file mode 100644
index 0000000000000..9b4da8d1e5173
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.c
@@ -0,0 +1,129 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_trace.h"
+#include "xfs_health.h"
+#include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_quota_defs.h"
+#include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
+
+/*
+ * Live Health Monitoring
+ * ======================
+ *
+ * Autonomous self-healing of XFS filesystems requires a means for the kernel
+ * to send filesystem health events to a monitoring daemon in userspace.  To
+ * accomplish this, we establish a thread_with_file kthread object to handle
+ * translating internal events about filesystem health into a format that can
+ * be parsed easily by userspace.  Then we hook various parts of the filesystem
+ * to supply those internal events to the kthread.  Userspace reads events
+ * from the file descriptor returned by the ioctl.
+ *
+ * The healthmon abstraction has a weak reference to the host filesystem mount
+ * so that the queueing and processing of the events do not pin the mount and
+ * cannot slow down the main filesystem.  The healthmon object can exist past
+ * the end of the filesystem mount.
+ */
+
+struct xfs_healthmon {
+	/* thread with stdio redirection */
+	struct thread_with_stdio	thread;
+};
+
+static inline struct xfs_healthmon *
+to_healthmon(struct thread_with_stdio	*thr)
+{
+	return container_of(thr, struct xfs_healthmon, thread);
+}
+
+/* Free the health monitoring information. */
+STATIC void
+xfs_healthmon_exit(
+	struct thread_with_stdio	*thr)
+{
+	struct xfs_healthmon		*hm = to_healthmon(thr);
+
+	kfree(hm);
+	module_put(THIS_MODULE);
+}
+
+/* Pipe health monitoring information to userspace. */
+STATIC void
+xfs_healthmon_run(
+	struct thread_with_stdio	*thr)
+{
+}
+
+/* Validate ioctl parameters. */
+static inline bool
+xfs_healthmon_validate(
+	const struct xfs_health_monitor	*hmo)
+{
+	if (hmo->flags)
+		return false;
+	if (hmo->format)
+		return false;
+	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
+		return false;
+	if (memchr_inv(&hmo->pad2, 0, sizeof(hmo->pad2)))
+		return false;
+	return true;
+}
+
+static const struct thread_with_stdio_ops xfs_healthmon_ops = {
+	.exit		= xfs_healthmon_exit,
+	.fn		= xfs_healthmon_run,
+};
+
+/*
+ * Create a health monitoring file.  Returns an index to the fd table or a
+ * negative errno.
+ */
+int
+xfs_healthmon_create(
+	struct xfs_mount		*mp,
+	struct xfs_health_monitor	*hmo)
+{
+	struct xfs_healthmon		*hm;
+	int				ret;
+
+	if (!xfs_healthmon_validate(hmo))
+		return -EINVAL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENOMEM;
+
+	hm = kzalloc(sizeof(*hm), GFP_KERNEL);
+	if (!hm) {
+		ret = -ENOMEM;
+		goto out_mod;
+	}
+
+	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
+	if (ret < 0)
+		goto out_hm;
+
+	return ret;
+out_hm:
+	kfree(hm);
+out_mod:
+	module_put(THIS_MODULE);
+	return ret;
+}
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
new file mode 100644
index 0000000000000..a9a8115ec770b
--- /dev/null
+++ b/fs/xfs/xfs_healthmon.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_HEALTHMON_H__
+#define __XFS_HEALTHMON_H__
+
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+int xfs_healthmon_create(struct xfs_mount *mp, struct xfs_health_monitor *hmo);
+#else
+# define xfs_healthmon_create(mp, hmo)		(-EOPNOTSUPP)
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
+#endif /* __XFS_HEALTHMON_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d592ceb26c3e5..270127300ba02 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -44,6 +44,7 @@
 #include "xfs_file.h"
 #include "xfs_exchrange.h"
 #include "xfs_rtgroup.h"
+#include "xfs_healthmon.h"
 
 #include <linux/mount.h>
 #include <linux/namei.h>
@@ -2429,6 +2430,23 @@ xfs_ioc_map_freesp(
 # define xfs_ioc_map_freesp(...)		(-ENOTTY)
 #endif
 
+#ifdef CONFIG_XFS_EXPERIMENTAL_IOCTLS
+STATIC int
+xfs_ioc_health_monitor(
+	struct xfs_mount		*mp,
+	struct xfs_health_monitor __user *arg)
+{
+	struct xfs_health_monitor	hmo;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	return xfs_healthmon_create(mp, &hmo);
+}
+#else
+# define xfs_ioc_health_monitor(...)		(-ENOTTY)
+#endif
+
 /*
  * These long-unused ioctls were removed from the official ioctl API in 5.17,
  * but retain these definitions so that we can log warnings about them.
@@ -2685,6 +2703,9 @@ xfs_file_ioctl(
 	case XFS_IOC_MAP_FREESP:
 		return xfs_ioc_map_freesp(filp, arg);
 
+	case XFS_IOC_HEALTH_MONITOR:
+		return xfs_ioc_health_monitor(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index 8598294514aa3..02dc0aba4e728 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -69,6 +69,9 @@ typedef __u32			xfs_nlink_t;
 # include <linux/time_stats.h>
 #endif
 #include <linux/sched/clock.h>
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+# include <linux/thread_with_file.h>
+#endif
 
 #include <asm/page.h>
 #include <asm/div64.h>


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/8] xfs: create hooks for monitoring health updates
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 1/8] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
@ 2024-02-24  1:18   ` Darrick J. Wong
  2024-02-24  1:18   ` [PATCH 3/8] xfs: create a filesystem shutdown hook Darrick J. Wong
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:18 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_health.h |   48 ++++++++
 fs/xfs/xfs_health.c        |  266 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h         |    3 
 fs/xfs/xfs_super.c         |    1 
 4 files changed, 317 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_health.h b/fs/xfs/libxfs/xfs_health.h
index 89b80e957917e..3c508a71ec91e 100644
--- a/fs/xfs/libxfs/xfs_health.h
+++ b/fs/xfs/libxfs/xfs_health.h
@@ -331,4 +331,52 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_RT,	/* realtime */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */
diff --git a/fs/xfs/xfs_health.c b/fs/xfs/xfs_health.c
index 33059d979857a..7e6cde66ef23a 100644
--- a/fs/xfs/xfs_health.c
+++ b/fs/xfs/xfs_health.c
@@ -20,6 +20,189 @@
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+/*
+ * Use a static key here to reduce the overhead of health updates.  If
+ * the compiler supports jump labels, the static branch will be replaced by a
+ * nop sled when there are no hook users.  Online fsck is currently the only
+ * caller, so this is a reasonable tradeoff.
+ *
+ * Note: Patching the kernel code requires taking the cpu hotplug lock.  Other
+ * parts of the kernel allocate memory with that lock held, which means that
+ * XFS callers cannot hold any locks that might be used by memory reclaim or
+ * writeback when calling the static_branch_{inc,dec} functions.
+ */
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_health_hooks_switch);
+
+void
+xfs_health_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_health_hooks_switch);
+}
+
+void
+xfs_health_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_health_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem unmount health update. */
+static inline void
+xfs_health_unmount_hook(
+	struct xfs_mount		*mp)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+		};
+
+		xfs_hooks_call(&mp->m_health_update_hooks,
+				XFS_HEALTHUP_UNMOUNT, &p);
+	}
+}
+
+/* Call downstream hooks for a filesystem health update. */
+static inline void
+xfs_fs_health_update_hook(
+	struct xfs_mount		*mp,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_FS,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+		};
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a realtime health update. */
+static inline void
+xfs_rt_health_update_hook(
+	struct xfs_mount		*mp,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_RT,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+		};
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a perag health update. */
+static inline void
+xfs_ag_health_update_hook(
+	struct xfs_perag		*pag,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_AG,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.group		= pag->pag_agno,
+		};
+		struct xfs_mount	*mp = pag->pag_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for an inode health update. */
+static inline void
+xfs_inode_health_update_hook(
+	struct xfs_inode		*ip,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_INODE,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.ino		= ip->i_ino,
+			.gen		= VFS_I(ip)->i_generation,
+		};
+		struct xfs_mount	*mp = ip->i_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call downstream hooks for a realtime group health update. */
+static inline void
+xfs_rtgroup_health_update_hook(
+	struct xfs_rtgroup		*rtg,
+	enum xfs_health_update_type	op,
+	unsigned int			old_mask,
+	unsigned int			new_mask)
+{
+	if (xfs_hooks_switched_on(&xfs_health_hooks_switch)) {
+		struct xfs_health_update_params	p = {
+			.domain		= XFS_HEALTHUP_RTGROUP,
+			.old_mask	= old_mask,
+			.new_mask	= new_mask,
+			.group		= rtg->rtg_rgno,
+		};
+		struct xfs_mount	*mp = rtg->rtg_mount;
+
+		if (new_mask)
+			xfs_hooks_call(&mp->m_health_update_hooks, op, &p);
+	}
+}
+
+/* Call the specified function during a health update. */
+int
+xfs_health_hook_add(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Stop calling the specified function during a health update. */
+void
+xfs_health_hook_del(
+	struct xfs_mount	*mp,
+	struct xfs_health_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_health_update_hooks, &hook->health_hook);
+}
+
+/* Configure health update hook functions. */
+void
+xfs_health_hook_setup(
+	struct xfs_health_hook	*hook,
+	notifier_fn_t		mod_fn)
+{
+	xfs_hook_setup(&hook->health_hook, mod_fn);
+}
+#else
+# define xfs_health_unmount_hook(...)		((void)0)
+# define xfs_fs_health_update_hook(...)		((void)0)
+# define xfs_rt_health_update_hook(...)		((void)0)
+# define xfs_ag_health_update_hook(...)		((void)0)
+# define xfs_inode_health_update_hook(...)	((void)0)
+# define xfs_rtgroup_health_update_hook(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Warn about metadata corruption that we detected but haven't fixed, and
  * make sure we're not sitting on anything that would get in the way of
@@ -37,8 +220,10 @@ xfs_health_unmount(
 	unsigned int		checked = 0;
 	bool			warn = false;
 
-	if (xfs_is_shutdown(mp))
+	if (xfs_is_shutdown(mp)) {
+		xfs_health_unmount_hook(mp);
 		return;
+	}
 
 	/* Measure AG corruption levels. */
 	for_each_perag(mp, agno, pag) {
@@ -101,6 +286,8 @@ xfs_health_unmount(
 		if (sick & XFS_SICK_FS_COUNTERS)
 			xfs_fs_mark_healthy(mp, XFS_SICK_FS_COUNTERS);
 	}
+
+	xfs_health_unmount_hook(mp);
 }
 
 /* Mark unhealthy per-fs metadata. */
@@ -109,12 +296,17 @@ xfs_fs_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark per-fs metadata as having been checked and found unhealthy by fsck. */
@@ -123,13 +315,18 @@ xfs_fs_mark_corrupt(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_corrupt(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick |= mask;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark a per-fs metadata healed. */
@@ -138,15 +335,20 @@ xfs_fs_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_FS_ALL));
 	trace_xfs_fs_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_fs_sick;
 	mp->m_fs_sick &= ~mask;
 	if (!(mp->m_fs_sick & XFS_SICK_FS_PRIMARY))
 		mp->m_fs_sick &= ~XFS_SICK_FS_SECONDARY;
 	mp->m_fs_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_fs_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-fs metadata are unhealthy. */
@@ -168,12 +370,17 @@ xfs_rt_mark_sick(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
 	trace_xfs_rt_mark_sick(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_rt_sick;
 	mp->m_rt_sick |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_rt_health_update_hook(mp, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark realtime metadata as having been checked and found unhealthy by fsck. */
@@ -182,13 +389,18 @@ xfs_rt_mark_corrupt(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
 	trace_xfs_rt_mark_corrupt(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_rt_sick;
 	mp->m_rt_sick |= mask;
 	mp->m_rt_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_rt_health_update_hook(mp, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark a realtime metadata healed. */
@@ -197,15 +409,20 @@ xfs_rt_mark_healthy(
 	struct xfs_mount	*mp,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RT_ALL));
 	trace_xfs_rt_mark_healthy(mp, mask);
 
 	spin_lock(&mp->m_sb_lock);
+	old_mask = mp->m_rt_sick;
 	mp->m_rt_sick &= ~mask;
 	if (!(mp->m_rt_sick & XFS_SICK_RT_PRIMARY))
 		mp->m_rt_sick &= ~XFS_SICK_RT_SECONDARY;
 	mp->m_rt_checked |= mask;
 	spin_unlock(&mp->m_sb_lock);
+
+	xfs_rt_health_update_hook(mp, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which realtime metadata are unhealthy. */
@@ -244,12 +461,17 @@ xfs_ag_mark_sick(
 	struct xfs_perag	*pag,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_AG_ALL));
 	trace_xfs_ag_mark_sick(pag->pag_mount, pag->pag_agno, mask);
 
 	spin_lock(&pag->pag_state_lock);
+	old_mask = pag->pag_sick;
 	pag->pag_sick |= mask;
 	spin_unlock(&pag->pag_state_lock);
+
+	xfs_ag_health_update_hook(pag, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark per-ag metadata as having been checked and found unhealthy by fsck. */
@@ -258,13 +480,18 @@ xfs_ag_mark_corrupt(
 	struct xfs_perag	*pag,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_AG_ALL));
 	trace_xfs_ag_mark_corrupt(pag->pag_mount, pag->pag_agno, mask);
 
 	spin_lock(&pag->pag_state_lock);
+	old_mask = pag->pag_sick;
 	pag->pag_sick |= mask;
 	pag->pag_checked |= mask;
 	spin_unlock(&pag->pag_state_lock);
+
+	xfs_ag_health_update_hook(pag, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark per-ag metadata ok. */
@@ -273,15 +500,20 @@ xfs_ag_mark_healthy(
 	struct xfs_perag	*pag,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_AG_ALL));
 	trace_xfs_ag_mark_healthy(pag->pag_mount, pag->pag_agno, mask);
 
 	spin_lock(&pag->pag_state_lock);
+	old_mask = pag->pag_sick;
 	pag->pag_sick &= ~mask;
 	if (!(pag->pag_sick & XFS_SICK_AG_PRIMARY))
 		pag->pag_sick &= ~XFS_SICK_AG_SECONDARY;
 	pag->pag_checked |= mask;
 	spin_unlock(&pag->pag_state_lock);
+
+	xfs_ag_health_update_hook(pag, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which per-ag metadata are unhealthy. */
@@ -320,12 +552,17 @@ xfs_rtgroup_mark_sick(
 	struct xfs_rtgroup	*rtg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
 	trace_xfs_rtgroup_mark_sick(rtg, mask);
 
 	spin_lock(&rtg->rtg_state_lock);
+	old_mask = rtg->rtg_sick;
 	rtg->rtg_sick |= mask;
 	spin_unlock(&rtg->rtg_state_lock);
+
+	xfs_rtgroup_health_update_hook(rtg, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark rtgroup metadata as having been checked and found unhealthy by fsck. */
@@ -334,13 +571,19 @@ xfs_rtgroup_mark_corrupt(
 	struct xfs_rtgroup	*rtg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
 	trace_xfs_rtgroup_mark_corrupt(rtg, mask);
 
 	spin_lock(&rtg->rtg_state_lock);
+	old_mask = rtg->rtg_sick;
 	rtg->rtg_sick |= mask;
 	rtg->rtg_checked |= mask;
 	spin_unlock(&rtg->rtg_state_lock);
+
+	xfs_rtgroup_health_update_hook(rtg, XFS_HEALTHUP_CORRUPT, old_mask,
+			mask);
 }
 
 /* Mark per-rtgroup metadata ok. */
@@ -349,15 +592,21 @@ xfs_rtgroup_mark_healthy(
 	struct xfs_rtgroup	*rtg,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_RG_ALL));
 	trace_xfs_rtgroup_mark_healthy(rtg, mask);
 
 	spin_lock(&rtg->rtg_state_lock);
+	old_mask = rtg->rtg_sick;
 	rtg->rtg_sick &= ~mask;
 	if (!(rtg->rtg_sick & XFS_SICK_RT_PRIMARY))
 		rtg->rtg_sick &= ~XFS_SICK_RT_SECONDARY;
 	rtg->rtg_checked |= mask;
 	spin_unlock(&rtg->rtg_state_lock);
+
+	xfs_rtgroup_health_update_hook(rtg, XFS_HEALTHUP_HEALTHY, old_mask,
+			mask);
 }
 
 /* Sample which per-rtgroup metadata are unhealthy. */
@@ -379,10 +628,13 @@ xfs_inode_mark_sick(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_sick(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	spin_unlock(&ip->i_flags_lock);
 
@@ -394,6 +646,8 @@ xfs_inode_mark_sick(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_SICK, old_mask, mask);
 }
 
 /* Mark inode metadata as having been checked and found unhealthy by fsck. */
@@ -402,10 +656,13 @@ xfs_inode_mark_corrupt(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_corrupt(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick |= mask;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
@@ -418,6 +675,8 @@ xfs_inode_mark_corrupt(
 	spin_lock(&VFS_I(ip)->i_lock);
 	VFS_I(ip)->i_state &= ~I_DONTCACHE;
 	spin_unlock(&VFS_I(ip)->i_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_CORRUPT, old_mask, mask);
 }
 
 /* Mark parts of an inode healed. */
@@ -426,15 +685,20 @@ xfs_inode_mark_healthy(
 	struct xfs_inode	*ip,
 	unsigned int		mask)
 {
+	unsigned int		old_mask;
+
 	ASSERT(!(mask & ~XFS_SICK_INO_ALL));
 	trace_xfs_inode_mark_healthy(ip, mask);
 
 	spin_lock(&ip->i_flags_lock);
+	old_mask = ip->i_sick;
 	ip->i_sick &= ~mask;
 	if (!(ip->i_sick & XFS_SICK_INO_PRIMARY))
 		ip->i_sick &= ~XFS_SICK_INO_SECONDARY;
 	ip->i_checked |= mask;
 	spin_unlock(&ip->i_flags_lock);
+
+	xfs_inode_health_update_hook(ip, XFS_HEALTHUP_HEALTHY, old_mask, mask);
 }
 
 /* Sample which parts of an inode are unhealthy. */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 63649c259b9c5..316240b79a1e9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -284,6 +284,9 @@ typedef struct xfs_mount {
 	/* Hook to feed dirent updates to an active online repair. */
 	struct xfs_hooks	m_dir_update_hooks;
 
+	/* Hook to feed health events to a daemon. */
+	struct xfs_hooks	m_health_update_hooks;
+
 	struct xfs_timestats	m_timestats;
 } xfs_mount_t;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 29a53874490cc..23dbb67a1344d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2081,6 +2081,7 @@ static int xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_timestats_init(mp);
 
 	fc->s_fs_info = mp;


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/8] xfs: create a filesystem shutdown hook
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
  2024-02-24  1:17   ` [PATCH 1/8] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
  2024-02-24  1:18   ` [PATCH 2/8] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2024-02-24  1:18   ` Darrick J. Wong
  2024-02-24  1:18   ` [PATCH 4/8] xfs: report shutdown events through healthmon Darrick J. Wong
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:18 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Create a hook so that health monitoring can report filesystem shutdown
events to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |   57 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_fsops.h |   14 +++++++++++++
 fs/xfs/xfs_mount.h |    3 +++
 fs/xfs/xfs_super.c |    1 +
 4 files changed, 75 insertions(+)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index a2929a0e0367e..ac2960c44bb84 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -483,6 +483,61 @@ xfs_fs_goingdown(
 	return 0;
 }
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_shutdown_hooks_switch);
+
+void
+xfs_shutdown_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_shutdown_hooks_switch);
+}
+
+void
+xfs_shutdown_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_shutdown_hooks_switch);
+}
+
+/* Call downstream hooks for a filesystem shutdown. */
+static inline void
+xfs_shutdown_hook(
+	struct xfs_mount		*mp,
+	uint32_t			flags)
+{
+	if (xfs_hooks_switched_on(&xfs_shutdown_hooks_switch))
+		xfs_hooks_call(&mp->m_shutdown_hooks, flags, NULL);
+}
+
+/* Call the specified function during a shutdown update. */
+int
+xfs_shutdown_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Stop calling the specified function during a shutdown update. */
+void
+xfs_shutdown_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_shutdown_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_shutdown_hooks, &hook->shutdown_hook);
+}
+
+/* Configure shutdown update hook functions. */
+void
+xfs_shutdown_hook_setup(
+	struct xfs_shutdown_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->shutdown_hook, mod_fn);
+}
+#else
+# define xfs_shutdown_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 /*
  * Force a shutdown of the filesystem instantly while keeping the filesystem
  * consistent. We don't do an unmount here; just shutdown the shop, make sure
@@ -541,6 +596,8 @@ xfs_do_force_shutdown(
 		"Please unmount the filesystem and rectify the problem(s)");
 	if (xfs_error_level >= XFS_ERRLEVEL_HIGH)
 		xfs_stack_trace();
+
+	xfs_shutdown_hook(mp, flags);
 }
 
 /*
diff --git a/fs/xfs/xfs_fsops.h b/fs/xfs/xfs_fsops.h
index 3e2f73bcf8314..59df17decfbbf 100644
--- a/fs/xfs/xfs_fsops.h
+++ b/fs/xfs/xfs_fsops.h
@@ -14,4 +14,18 @@ int xfs_fs_goingdown(struct xfs_mount *mp, uint32_t inflags);
 int xfs_fs_reserve_ag_blocks(struct xfs_mount *mp);
 void xfs_fs_unreserve_ag_blocks(struct xfs_mount *mp);
 
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_shutdown_hook {
+	struct xfs_hook			shutdown_hook;
+};
+
+void xfs_shutdown_hook_disable(void);
+void xfs_shutdown_hook_enable(void);
+
+int xfs_shutdown_hook_add(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_del(struct xfs_mount *mp, struct xfs_shutdown_hook *hook);
+void xfs_shutdown_hook_setup(struct xfs_shutdown_hook *hook,
+		notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_FSOPS_H__ */
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 316240b79a1e9..f1db647b94871 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -287,6 +287,9 @@ typedef struct xfs_mount {
 	/* Hook to feed health events to a daemon. */
 	struct xfs_hooks	m_health_update_hooks;
 
+	/* Hook to feed shutdown events to a daemon. */
+	struct xfs_hooks	m_shutdown_hooks;
+
 	struct xfs_timestats	m_timestats;
 } xfs_mount_t;
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 23dbb67a1344d..1ed848a3706be 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2081,6 +2081,7 @@ static int xfs_init_fs_context(
 	mp->m_allocsize_log = 16; /* 64k */
 
 	xfs_hooks_init(&mp->m_dir_update_hooks);
+	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
 	xfs_timestats_init(mp);
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/8] xfs: report shutdown events through healthmon
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:18   ` [PATCH 3/8] xfs: create a filesystem shutdown hook Darrick J. Wong
@ 2024-02-24  1:18   ` Darrick J. Wong
  2024-02-24  1:18   ` [PATCH 5/8] xfs: report metadata health " Darrick J. Wong
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:18 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs_staging.h |    8 +
 fs/xfs/xfs_healthmon.c         |  458 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h         |   27 ++
 fs/xfs/xfs_trace.c             |    2 
 fs/xfs/xfs_trace.h             |  107 +++++++++
 5 files changed, 597 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_fs_staging.h b/fs/xfs/libxfs/xfs_fs_staging.h
index 84b99816eec2e..684d6d22cc8dd 100644
--- a/fs/xfs/libxfs/xfs_fs_staging.h
+++ b/fs/xfs/libxfs/xfs_fs_staging.h
@@ -310,6 +310,14 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Return events in JSON format */
+#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
+
 /* Monitor for health events. */
 #define XFS_IOC_HEALTH_MONITOR		_IOR ('X', 48, struct xfs_health_monitor)
 
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 9b4da8d1e5173..b215ded0fda8b 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -12,13 +12,13 @@
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trace.h"
-#include "xfs_health.h"
 #include "xfs_ag.h"
 #include "xfs_btree.h"
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
+#include "xfs_fsops.h"
 #include "xfs_healthmon.h"
 
 /*
@@ -37,11 +37,79 @@
  * so that the queueing and processing of the events do not pin the mount and
  * cannot slow down the main filesystem.  The healthmon object can exist past
  * the end of the filesystem mount.
+ *
+ * The easily parseable event format is a stream of json objects as follows:
+ *
+ * Queue Management
+ * ----------------
+ *
+ * {
+ *	"type": "lost" or "shutdown",
+ *	"domain": "mount",
+ *	"time_ns": integer
+ * }
+ *
+ * "lost" indicates that the kthread dropped events due to memory allocation
+ * failures or queue limits.
+ *
+ * "mount" means that the event affects the entire filesystem mount.
+ *
+ * "time_ns" is the time stamp of when the event originated in the kernel,
+ * expressed in nanoseconds.
+ *
+ * Abnormal Shutdowns
+ * ------------------
+ *
+ * {
+ *	"type": "shutdown",
+ *	"domain": "mount",
+ *	"reasons": [reason_string list...],
+ *	"time_ns": integer
+ * }
+ *
+ * "shutdown" indicates that the filesystem shut down either due to errors or
+ * due to an explicit request from userspace.
+ *
+ * "reasons" are a list of strings describing why the filesystem went down.
+ * They correspond to the SHUTDOWN_* flags.
  */
 
+#define XFS_HEALTHMON_MAX_EVENTS \
+		(32768 / sizeof(struct xfs_healthmon_event))
+
+struct flag_string {
+	unsigned int	mask;
+	const char	*str;
+};
+
 struct xfs_healthmon {
 	/* thread with stdio redirection */
 	struct thread_with_stdio	thread;
+
+	/* lock for mp and eventlist */
+	struct mutex			lock;
+
+	/* waiter for signalling the arrival of events */
+	struct wait_queue_head		wait;
+
+	/* list of event objects */
+	struct xfs_healthmon_event	*first_event;
+	struct xfs_healthmon_event	*last_event;
+
+	/* live update hooks */
+	struct xfs_shutdown_hook	shook;
+
+	/* filesystem mount, or NULL if we've unmounted */
+	struct xfs_mount		*mp;
+
+	/* number of events */
+	unsigned int			events;
+
+	/* do we want all events? */
+	bool				verbose;
+
+	/* did we lose an event? */
+	bool				lost_prev_event;
 };
 
 static inline struct xfs_healthmon *
@@ -50,6 +118,23 @@ to_healthmon(struct thread_with_stdio	*thr)
 	return container_of(thr, struct xfs_healthmon, thread);
 }
 
+/* Free all events */
+STATIC void
+xfs_healthmon_free_events(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event, *next;
+
+	event = hm->first_event;
+	while (event != NULL) {
+		trace_xfs_healthmon_drop(hm->mp, event);
+		next = event->next;
+		kfree(event);
+		event = next;
+	}
+	hm->first_event = hm->last_event = NULL;
+}
+
 /* Free the health monitoring information. */
 STATIC void
 xfs_healthmon_exit(
@@ -57,15 +142,357 @@ xfs_healthmon_exit(
 {
 	struct xfs_healthmon		*hm = to_healthmon(thr);
 
+	trace_xfs_healthmon_exit(hm->mp, hm->events, hm->lost_prev_event);
+
+	if (hm->mp) {
+		xfs_shutdown_hook_del(hm->mp, &hm->shook);
+	}
+	xfs_shutdown_hook_disable();
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
 	kfree(hm);
 	module_put(THIS_MODULE);
 }
 
+/* Remove an event from the head of the list. */
+static inline struct xfs_healthmon_event *
+xfs_healthmon_pop(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*ret = hm->first_event;
+
+	if (!ret)
+		return NULL;
+
+	if (hm->last_event == ret)
+		hm->last_event = NULL;
+	hm->first_event = ret->next;
+	hm->events--;
+
+	trace_xfs_healthmon_pop(hm->mp, ret);
+	return ret;
+}
+
+/* Push an event onto the end of the list. */
+static inline void
+xfs_healthmon_push(
+	struct xfs_healthmon		*hm,
+	struct xfs_healthmon_event	*event)
+{
+	if (!hm->first_event)
+		hm->first_event = event;
+	if (hm->last_event)
+		hm->last_event->next = event;
+	hm->last_event = event;
+	event->next = NULL;
+	hm->events++;
+	wake_up(&hm->wait);
+
+	trace_xfs_healthmon_push(hm->mp, event);
+}
+
+/* Create a new event or record that we failed. */
+static struct xfs_healthmon_event *
+new_event(
+	struct xfs_healthmon		*hm,
+	enum xfs_healthmon_type		type,
+	enum xfs_healthmon_domain	domain)
+{
+	struct timespec64		now;
+	struct xfs_healthmon_event	*event;
+
+	event = kzalloc(sizeof(*event), GFP_KERNEL);
+	if (!event) {
+		hm->lost_prev_event = true;
+		return NULL;
+	}
+
+	event->type = type;
+	event->domain = domain;
+	ktime_get_coarse_real_ts64(&now);
+	event->time_ns = (now.tv_sec * NSEC_PER_SEC) + now.tv_nsec;
+
+	return event;
+}
+
+/*
+ * Before we accept an event notification from a live update hook, we need to
+ * clear out any previously lost events.
+ */
+STATIC int
+xfs_healthmon_start_live_update(
+	struct xfs_healthmon		*hm)
+{
+	struct xfs_healthmon_event	*event;
+
+	/* Already unmounted filesystem, do nothing. */
+	if (!hm->mp)
+		return -ESHUTDOWN;
+
+	/*
+	 * If we previously lost an event or the queue is full, try to queue
+	 * a notification about lost events.
+	 */
+	if (!hm->lost_prev_event && hm->events != XFS_HEALTHMON_MAX_EVENTS)
+		return 0;
+
+	/*
+	 * A previous invocation of the live update hook could not allocate
+	 * any memory at all.  If the last event on the list is already a
+	 * notification of lost events, we're done.
+	 */
+	if (hm->last_event && hm->last_event->type == XFS_HEALTHMON_LOST)
+		return 0;
+
+	/*
+	 * There are no events or the last one wasn't about lost events.  Try
+	 * to allocate a new one to note the lost events.
+	 */
+	event = new_event(hm, XFS_HEALTHMON_LOST, XFS_HEALTHMON_MOUNT);
+	if (!event)
+		return -ENOMEM;
+
+	hm->lost_prev_event = false;
+	xfs_healthmon_push(hm, event);
+	return 0;
+}
+
+/* Add a shutdown event to the reporting queue. */
+STATIC int
+xfs_healthmon_shutdown_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, shook.shutdown_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_shutdown_hook(hm->mp, action, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	event = new_event(hm, XFS_HEALTHMON_SHUTDOWN, XFS_HEALTHMON_MOUNT);
+	if (!event)
+		goto out_unlock;
+
+	event->flags = action;
+	xfs_healthmon_push(hm, event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
+/* Render the health update type as a string. */
+STATIC const char *
+xfs_healthmon_typestring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *type_strings[] = {
+		[XFS_HEALTHMON_LOST]		= "lost",
+		[XFS_HEALTHMON_SHUTDOWN]	= "shutdown",
+	};
+
+	if (event->type >= ARRAY_SIZE(type_strings))
+		return "?";
+
+	return type_strings[event->type];
+}
+
+/* Render the health domain as a string. */
+STATIC const char *
+xfs_healthmon_domstring(
+	const struct xfs_healthmon_event	*event)
+{
+	static const char *dom_strings[] = {
+		[XFS_HEALTHMON_MOUNT]		= "mount",
+	};
+
+	if (event->domain >= ARRAY_SIZE(dom_strings))
+		return "?";
+
+	return dom_strings[event->domain];
+}
+
+/* Convert a flags bitmap into a jsonable string. */
+static inline int
+xfs_healthmon_format_flags(
+	struct stdio_redirect		*out,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			flags)
+{
+	const struct flag_string	*p;
+	ssize_t				ret;
+	unsigned int			i;
+	bool				first = true;
+
+	for (i = 0, p = strings; i < nr_strings; i++, p++) {
+		if (!(p->mask & flags))
+			continue;
+
+		ret = stdio_redirect_printf(out, false, "%s\"%s\"",
+				first ? "" : ", ", p->str);
+		if (ret < 0)
+			return ret;
+
+		first = false;
+		flags &= ~p->mask;
+	}
+
+	for (i = 0; flags != 0 && i < sizeof(flags) * NBBY; i++) {
+		if (!(flags & (1U << i)))
+			continue;
+
+		/* json doesn't support hexadecimal notation */
+		ret = stdio_redirect_printf(out, false, "%s%u",
+				first ? "" : ", ", (1U << i));
+		if (ret < 0)
+			return ret;
+
+		first = false;
+	}
+
+	return 0;
+}
+
+/* Convert the event mask into a jsonable string. */
+static inline int
+__xfs_healthmon_format_mask(
+	struct stdio_redirect		*out,
+	const char			*descr,
+	const struct flag_string	*strings,
+	size_t				nr_strings,
+	unsigned int			mask)
+{
+	ssize_t				ret;
+
+	ret = stdio_redirect_printf(out, false, "  \"%s\":  [", descr);
+	if (ret < 0)
+		return ret;
+
+	ret = xfs_healthmon_format_flags(out, strings, nr_strings, mask);
+	if (ret < 0)
+		return ret;
+
+	return stdio_redirect_printf(out, false, "],\n");
+}
+
+#define xfs_healthmon_format_mask(o, d, s, m) \
+	__xfs_healthmon_format_mask((o), (d), (s), ARRAY_SIZE(s), (m))
+
+/* Render shutdown mask as a string set */
+static ssize_t
+xfs_healthmon_format_shutdown(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ SHUTDOWN_META_IO_ERROR,	"meta_ioerr" },
+		{ SHUTDOWN_LOG_IO_ERROR,	"log_ioerr" },
+		{ SHUTDOWN_FORCE_UMOUNT,	"force_umount" },
+		{ SHUTDOWN_CORRUPT_INCORE,	"corrupt_incore" },
+		{ SHUTDOWN_CORRUPT_ONDISK,	"corrupt_ondisk" },
+		{ SHUTDOWN_DEVICE_REMOVED,	"device_removed" },
+	};
+
+	return xfs_healthmon_format_mask(out, "reasons", mask_strings,
+			event->flags);
+}
+
+/* Format an event into json. */
+STATIC int
+xfs_healthmon_format(
+	struct xfs_healthmon		*hm,
+	const struct xfs_healthmon_event *event)
+{
+	struct stdio_redirect		*out = &hm->thread.stdio;
+	ssize_t				ret;
+
+	ret = stdio_redirect_printf(out, false, "{\n");
+	if (ret < 0)
+		return ret;
+
+	ret = stdio_redirect_printf(out, false, "  \"type\":       \"%s\",\n",
+			xfs_healthmon_typestring(event));
+	if (ret < 0)
+		return ret;
+
+	ret = stdio_redirect_printf(out, false, "  \"domain\":     \"%s\",\n",
+			xfs_healthmon_domstring(event));
+	if (ret < 0)
+		return ret;
+
+	switch (event->type) {
+	case XFS_HEALTHMON_SHUTDOWN:
+		ret = xfs_healthmon_format_shutdown(out, event);
+		break;
+	case XFS_HEALTHMON_LOST:
+		/* empty */
+		break;
+	default:
+		break;
+	}
+
+	switch (event->domain) {
+	case XFS_HEALTHMON_MOUNT:
+		/* empty */
+		break;
+	}
+	if (ret < 0)
+		return ret;
+
+	trace_xfs_healthmon_format(hm->mp, event);
+
+	/* The last element in the json must not have a trailing comma. */
+	ret = stdio_redirect_printf(out, false, "  \"time_ns\":    %llu\n",
+			event->time_ns);
+	if (ret < 0)
+		return ret;
+
+	return stdio_redirect_printf(out, false, "}\n");
+}
+
 /* Pipe health monitoring information to userspace. */
 STATIC void
 xfs_healthmon_run(
 	struct thread_with_stdio	*thr)
 {
+	struct xfs_healthmon		*hm = to_healthmon(thr);
+	struct xfs_healthmon_event	*event;
+	bool				unmounted = false;
+
+	while (!kthread_should_stop() && !unmounted &&
+	       wait_event_interruptible(hm->wait,
+				hm->events > 0 || hm->mp == NULL) == 0) {
+
+		trace_xfs_healthmon_run(hm->mp, hm->events, hm->lost_prev_event);
+
+		mutex_lock(&hm->lock);
+		while (!kthread_should_stop() &&
+		       (event = xfs_healthmon_pop(hm)) != NULL) {
+			mutex_unlock(&hm->lock);
+
+			xfs_healthmon_format(hm, event);
+			kfree(event);
+			cond_resched();
+
+			mutex_lock(&hm->lock);
+		}
+		if (!hm->mp)
+			unmounted = true;
+		mutex_unlock(&hm->lock);
+	}
+
+	trace_xfs_healthmon_stop(hm->mp, hm->events, hm->lost_prev_event);
 }
 
 /* Validate ioctl parameters. */
@@ -73,9 +500,9 @@ static inline bool
 xfs_healthmon_validate(
 	const struct xfs_health_monitor	*hmo)
 {
-	if (hmo->flags)
+	if (hmo->flags & ~XFS_HEALTH_MONITOR_ALL)
 		return false;
-	if (hmo->format)
+	if (hmo->format != XFS_HEALTH_MONITOR_FMT_JSON)
 		return false;
 	if (memchr_inv(&hmo->pad1, 0, sizeof(hmo->pad1)))
 		return false;
@@ -116,12 +543,33 @@ xfs_healthmon_create(
 		goto out_mod;
 	}
 
+	hm->mp = mp;
+	mutex_init(&hm->lock);
+	init_waitqueue_head(&hm->wait);
+
+	if (hmo->flags & XFS_HEALTH_MONITOR_VERBOSE)
+		hm->verbose = true;
+
+	xfs_shutdown_hook_enable();
+
+	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
+	ret = xfs_shutdown_hook_add(mp, &hm->shook);
+	if (ret)
+		goto out_hooks;
+
 	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
 	if (ret < 0)
-		goto out_hm;
+		goto out_shutdown;
+
+	trace_xfs_healthmon_create(mp, hmo->flags, hmo->format);
 
 	return ret;
-out_hm:
+out_shutdown:
+	xfs_shutdown_hook_del(mp, &hm->shook);
+out_hooks:
+	xfs_shutdown_hook_disable();
+	mutex_destroy(&hm->lock);
+	xfs_healthmon_free_events(hm);
 	kfree(hm);
 out_mod:
 	module_put(THIS_MODULE);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index a9a8115ec770b..f67e2f1b8f947 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -6,6 +6,33 @@
 #ifndef __XFS_HEALTHMON_H__
 #define __XFS_HEALTHMON_H__
 
+enum xfs_healthmon_type {
+	XFS_HEALTHMON_LOST,	/* message lost */
+
+	/* filesystem shutdown */
+	XFS_HEALTHMON_SHUTDOWN,
+};
+
+enum xfs_healthmon_domain {
+	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+};
+
+struct xfs_healthmon_event {
+	struct xfs_healthmon_event	*next;
+
+	enum xfs_healthmon_type		type;
+	enum xfs_healthmon_domain	domain;
+
+	uint64_t			time_ns;
+
+	union {
+		/* mount */
+		struct {
+			unsigned int	flags;
+		};
+	};
+};
+
 #ifdef CONFIG_XFS_HEALTH_MONITOR
 int xfs_healthmon_create(struct xfs_mount *mp, struct xfs_health_monitor *hmo);
 #else
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index b40f01cb0fe8d..14bc3f8cf306d 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -47,6 +47,8 @@
 #include "xfs_rmap.h"
 #include "xfs_refcount.h"
 #include "xfs_fsrefs.h"
+#include "xfs_health.h"
+#include "xfs_healthmon.h"
 
 static inline void
 xfs_rmapbt_crack_agno_opdev(
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3c2a2410b17d2..54e3d6d549ec1 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -102,6 +102,8 @@ struct xfs_extent_free_item;
 struct xfs_rmap_intent;
 struct xfs_refcount_intent;
 struct xfs_fsrefs;
+struct xfs_healthmon_event;
+struct xfs_health_update_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -5923,6 +5925,111 @@ TRACE_EVENT(xfs_growfs_check_rtgeom,
 );
 #endif /* CONFIG_XFS_RT */
 
+#ifdef CONFIG_XFS_HEALTH_MONITOR
+TRACE_EVENT(xfs_healthmon_shutdown_hook,
+	TP_PROTO(const struct xfs_mount *mp, uint32_t shutdown_flags,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, shutdown_flags, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint32_t, shutdown_flags)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->shutdown_flags = shutdown_flags;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d shutdown_flags %s events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_flags(__entry->shutdown_flags, "|", XFS_SHUTDOWN_STRINGS),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_class,
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev),
+	TP_ARGS(mp, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#define DEFINE_HEALTHMON_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev), \
+	TP_ARGS(mp, events, lost_prev))
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_create);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_run);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_stop);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_exit);
+DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
+
+#define XFS_HEALTHMON_TYPE_STRINGS \
+	{ XFS_HEALTHMON_LOST,		"lost" }, \
+	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }
+
+#define XFS_HEALTHMON_DOMAIN_STRINGS \
+	{ XFS_HEALTHMON_MOUNT,		"mount" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+
+DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
+	TP_ARGS(mp, event),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = event->type;
+		__entry->domain = event->domain;
+		switch (__entry->domain) {
+		case XFS_HEALTHMON_MOUNT:
+			__entry->mask = event->flags;
+			break;
+		}
+	),
+	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHMON_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHMON_DOMAIN_STRINGS),
+		  __entry->mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group)
+);
+#define DEFINE_HEALTHMONEVENT_EVENT(name) \
+DEFINE_EVENT(xfs_healthmon_event_class, name, \
+	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event), \
+	TP_ARGS(mp, event))
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_push);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_pop);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_format);
+DEFINE_HEALTHMONEVENT_EVENT(xfs_healthmon_drop);
+#endif /* CONFIG_XFS_HEALTH_MONITOR */
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/8] xfs: report metadata health events through healthmon
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-02-24  1:18   ` [PATCH 4/8] xfs: report shutdown events through healthmon Darrick J. Wong
@ 2024-02-24  1:18   ` Darrick J. Wong
  2024-02-24  1:19   ` [PATCH 6/8] xfs: report media errors " Darrick J. Wong
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:18 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Set up a metadata health event hook so that we can send events to
userspace as we collect information.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |  403 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_healthmon.h |   31 ++++
 fs/xfs/xfs_trace.h     |  102 ++++++++++++
 3 files changed, 532 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index b215ded0fda8b..d3b548a63f0b9 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -19,6 +19,7 @@
 #include "xfs_quota_defs.h"
 #include "xfs_rtgroup.h"
 #include "xfs_fsops.h"
+#include "xfs_health.h"
 #include "xfs_healthmon.h"
 
 /*
@@ -72,6 +73,86 @@
  *
  * "reasons" are a list of strings describing why the filesystem went down.
  * They correspond to the SHUTDOWN_* flags.
+ *
+ * Metadata Health Events
+ * ----------------------
+ *
+ * {
+ *	"type": "sick" | "corrupt" | "healthy",
+ *	"domain": "fs" | "realtime" | "ag" | "inode" | "rtgroup",
+ *	"structures": [structure string list...],
+ *
+ *	"group": integer,      (if domain is "ag" or "rtgroup")
+ *
+ *	"inode": integer,      (if domain is "inode")
+ *	"generation": integer, (if domain is "inode")
+ *
+ *	"time_ns": integer
+ * }
+ *
+ * "sick" means that metadata corruption was discovered during a runtime
+ * operation.
+ *
+ * "corrupt" means that corruption was discovered during an xfs_scrub run.
+ *
+ * "healthy" means that a metadata object was found to be ok by xfs_scrub.
+ *
+ * The domain item indicates where in the filesystem to find the metadata
+ * object(s) that are the target of the event.
+ *
+ * "fs" means whole-filesystem metadata.  Structures are as follows:
+ *
+ *     "fscounters": summary counters
+ *     "usrquota":   user quota records
+ *     "grpquota":   group quota records
+ *     "prjquota":   project quota records
+ *     "quotacheck": quota counters
+ *     "nlinks":     file link counts
+ *     "metadir":    metadata directory
+ *     "metapath":   metadata inode paths
+ *
+ * "realtime" means realtime volume metadata:
+ *
+ *     "bitmap":     realtime bitmap file
+ *     "summary":    realtime free space summary file
+ *
+ * "ag" means allocation group metadata on the data device:
+ *
+ *     "super":      superblock
+ *     "agf":        group space header
+ *     "agfl":       per-group free block list
+ *     "agi":        group inode header
+ *     "bnobt":      free space by position btree
+ *     "cntbt":      free space by length btree
+ *     "inobt":      inode btree
+ *     "finobt":     free inode btree
+ *     "rmapbt":     reverse mapping btree
+ *     "refcountbt": reference count btree
+ *     "inodes":     problems were recorded for this group's inodes, but the
+ *                   inodes themselves had to be reclaimed
+ *
+ * "inode" means inode metadata:
+ *
+ *     "core":       inode record
+ *     "bmapbtd":    data fork
+ *     "bmapbta":    attr fork
+ *     "bmapbtc":    cow fork
+ *     "directory":  directory entries and index
+ *     "xattr":      extended attributes and index
+ *     "symlink":    symbolic link target
+ *     "parent":     directory parent pointer
+ *     "bmapbtd_zapped":  these are set when an inode record repair had to drop
+ *     "bmapbtd_zapped"   the corresponding data structure to get the inode
+ *     "directory_zapped" back to a consistent state
+ *     "symlink_zapped"
+ *     "dirtree":    directory tree problems detected
+ *
+ * "rtgroup" means realtime group metadata for the realtime volume:
+ *
+ *     "super":      group superblock
+ *     "bitmap":     free space bitmap contents for this group
+ *     "rmapbt":     reverse mapping btree
+ *     "refcountbt": reference count btree
  */
 
 #define XFS_HEALTHMON_MAX_EVENTS \
@@ -98,6 +179,7 @@ struct xfs_healthmon {
 
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
+	struct xfs_health_hook		hhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -145,8 +227,10 @@ xfs_healthmon_exit(
 	trace_xfs_healthmon_exit(hm->mp, hm->events, hm->lost_prev_event);
 
 	if (hm->mp) {
+		xfs_health_hook_del(hm->mp, &hm->hhook);
 		xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	}
+	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
@@ -291,6 +375,157 @@ xfs_healthmon_shutdown_hook(
 	return NOTIFY_DONE;
 }
 
+/* Compute the reporting mask. */
+static inline bool
+xfs_healthmon_event_mask(
+	struct xfs_healthmon			*hm,
+	enum xfs_health_update_type		type,
+	const struct xfs_health_update_params	*hup,
+	unsigned int				*mask)
+{
+	/* Always report unmounts. */
+	if (type == XFS_HEALTHUP_UNMOUNT)
+		return true;
+
+	/* If we want all events, return all events. */
+	if (hm->verbose) {
+		*mask = hup->new_mask;
+		return true;
+	}
+
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		/* Always report runtime corruptions */
+		*mask = hup->new_mask;
+		break;
+	case XFS_HEALTHUP_CORRUPT:
+		/* Only report new fsck errors */
+		*mask = hup->new_mask & ~hup->old_mask;
+		break;
+	case XFS_HEALTHUP_HEALTHY:
+		/* Only report healthy metadata that got fixed */
+		*mask = hup->new_mask & hup->old_mask;
+		break;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* This is here for static enum checking */
+		break;
+	}
+
+	/* If not in verbose mode, mask state has to change. */
+	return *mask != 0;
+}
+
+static inline enum xfs_healthmon_type
+health_update_to_type(
+	enum xfs_health_update_type	type)
+{
+	switch (type) {
+	case XFS_HEALTHUP_SICK:
+		return XFS_HEALTHMON_SICK;
+	case XFS_HEALTHUP_CORRUPT:
+		return XFS_HEALTHMON_CORRUPT;
+	case XFS_HEALTHUP_HEALTHY:
+		return XFS_HEALTHMON_HEALTHY;
+	case XFS_HEALTHUP_UNMOUNT:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_UNMOUNT;
+}
+
+static inline enum xfs_healthmon_domain
+health_update_to_domain(
+	enum xfs_health_update_domain	domain)
+{
+	switch (domain) {
+	case XFS_HEALTHUP_FS:
+		return XFS_HEALTHMON_FS;
+	case XFS_HEALTHUP_RT:
+		return XFS_HEALTHMON_RT;
+	case XFS_HEALTHUP_AG:
+		return XFS_HEALTHMON_AG;
+	case XFS_HEALTHUP_RTGROUP:
+		return XFS_HEALTHMON_RTGROUP;
+	case XFS_HEALTHUP_INODE:
+		/* static checking */
+		break;
+	}
+	return XFS_HEALTHMON_INODE;
+}
+
+/* Add a health event to the reporting queue. */
+STATIC int
+xfs_healthmon_metadata_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_health_update_params	*hup = data;
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	enum xfs_health_update_type	type = action;
+	unsigned int			mask = 0;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, hhook.health_hook.nb);
+
+	/* Decode event mask and skip events we don't care about. */
+	if (!xfs_healthmon_event_mask(hm, type, hup, &mask))
+		return NOTIFY_DONE;
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_metadata_hook(hm->mp, action, hup, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	if (type == XFS_HEALTHUP_UNMOUNT) {
+		/*
+		 * The filesystem is unmounting, so we must detach from the
+		 * mount.  After this point, the healthmon thread has no
+		 * connection to the mounted filesystem.
+		 */
+		trace_xfs_healthmon_unmount(hm->mp, hm->events,
+				hm->lost_prev_event);
+		hm->mp = NULL;
+		wake_up(&hm->wait);
+		goto out_unlock;
+	}
+
+	event = new_event(hm, health_update_to_type(type),
+			  health_update_to_domain(hup->domain));
+	if (!event)
+		goto out_unlock;
+
+	switch (event->domain) {
+	case XFS_HEALTHMON_FS:
+	case XFS_HEALTHMON_RT:
+		event->fsmask = mask;
+		break;
+	case XFS_HEALTHMON_AG:
+	case XFS_HEALTHMON_RTGROUP:
+		event->grpmask = mask;
+		event->group = hup->group;
+		break;
+	case XFS_HEALTHMON_INODE:
+		event->imask = mask;
+		event->ino = hup->ino;
+		event->gen = hup->gen;
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
+	xfs_healthmon_push(hm, event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -299,6 +534,10 @@ xfs_healthmon_typestring(
 	static const char *type_strings[] = {
 		[XFS_HEALTHMON_LOST]		= "lost",
 		[XFS_HEALTHMON_SHUTDOWN]	= "shutdown",
+		[XFS_HEALTHMON_UNMOUNT]		= "unmount",
+		[XFS_HEALTHMON_SICK]		= "sick",
+		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
+		[XFS_HEALTHMON_HEALTHY]		= "healthy",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -314,6 +553,11 @@ xfs_healthmon_domstring(
 {
 	static const char *dom_strings[] = {
 		[XFS_HEALTHMON_MOUNT]		= "mount",
+		[XFS_HEALTHMON_FS]		= "fs",
+		[XFS_HEALTHMON_RT]		= "realtime",
+		[XFS_HEALTHMON_AG]		= "ag",
+		[XFS_HEALTHMON_INODE]		= "inode",
+		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -339,6 +583,11 @@ xfs_healthmon_format_flags(
 		if (!(p->mask & flags))
 			continue;
 
+		if (!p->str) {
+			flags &= ~p->mask;
+			continue;
+		}
+
 		ret = stdio_redirect_printf(out, false, "%s\"%s\"",
 				first ? "" : ", ", p->str);
 		if (ret < 0)
@@ -408,6 +657,132 @@ xfs_healthmon_format_shutdown(
 			event->flags);
 }
 
+/* Render fs sickness mask as a string set */
+static ssize_t
+xfs_healthmon_format_fs(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_FS_COUNTERS,		"fscounters" },
+		{ XFS_SICK_FS_UQUOTA,		"usrquota" },
+		{ XFS_SICK_FS_GQUOTA,		"grpquota" },
+		{ XFS_SICK_FS_PQUOTA,		"prjquota" },
+		{ XFS_SICK_FS_QUOTACHECK,	"quotacheck" },
+		{ XFS_SICK_FS_NLINKS,		"nlinks" },
+		{ XFS_SICK_FS_METADIR,		"metadir" },
+		{ XFS_SICK_FS_METAPATH,		"metapath" },
+	};
+
+	return xfs_healthmon_format_mask(out, "structures", mask_strings,
+			event->fsmask);
+}
+
+/* Render rt sickness mask as a string set */
+static ssize_t
+xfs_healthmon_format_rt(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_RT_BITMAP,		"bitmap" },
+		{ XFS_SICK_RT_SUMMARY,		"summary" },
+	};
+
+	return xfs_healthmon_format_mask(out, "structures", mask_strings,
+			event->fsmask);
+}
+
+/* Render rtgroup sickness mask as a string set */
+static ssize_t
+xfs_healthmon_format_rtgroup(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_RG_SUPER,		"super" },
+		{ XFS_SICK_RG_BITMAP,		"bitmap" },
+		{ XFS_SICK_RG_RMAPBT,		"rmapbt" },
+		{ XFS_SICK_RG_REFCNTBT,		"refcountbt" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(out, "structures", mask_strings,
+			event->grpmask);
+	if (ret < 0)
+		return ret;
+
+	return stdio_redirect_printf(out, false, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render perag sickness mask as a string set */
+static ssize_t
+xfs_healthmon_format_ag(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_AG_SB,		"super" },
+		{ XFS_SICK_AG_AGF,		"agf" },
+		{ XFS_SICK_AG_AGFL,		"agfl" },
+		{ XFS_SICK_AG_AGI,		"agi" },
+		{ XFS_SICK_AG_BNOBT,		"bnobt" },
+		{ XFS_SICK_AG_CNTBT,		"cntbt" },
+		{ XFS_SICK_AG_INOBT,		"inobt" },
+		{ XFS_SICK_AG_FINOBT,		"finobt" },
+		{ XFS_SICK_AG_RMAPBT,		"rmapbt" },
+		{ XFS_SICK_AG_REFCNTBT,		"refcountbt" },
+		{ XFS_SICK_AG_INODES,		"inodes" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(out, "structures", mask_strings,
+			event->grpmask);
+	if (ret < 0)
+		return ret;
+
+	return stdio_redirect_printf(out, false, "  \"group\":      %u,\n",
+			event->group);
+}
+
+/* Render inode sickness mask as a string set */
+static ssize_t
+xfs_healthmon_format_inode(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	static const struct flag_string	mask_strings[] = {
+		{ XFS_SICK_INO_CORE,		"core" },
+		{ XFS_SICK_INO_BMBTD,		"bmapbtd" },
+		{ XFS_SICK_INO_BMBTA,		"bmapbta" },
+		{ XFS_SICK_INO_BMBTC,		"bmapbtc" },
+		{ XFS_SICK_INO_DIR,		"directory" },
+		{ XFS_SICK_INO_XATTR,		"xattr" },
+		{ XFS_SICK_INO_SYMLINK,		"symlink" },
+		{ XFS_SICK_INO_PARENT,		"parent" },
+		{ XFS_SICK_INO_BMBTD_ZAPPED,	"bmapbtd_zapped" },
+		{ XFS_SICK_INO_BMBTA_ZAPPED,	"bmapbtd_zapped" },
+		{ XFS_SICK_INO_DIR_ZAPPED,	"directory_zapped" },
+		{ XFS_SICK_INO_SYMLINK_ZAPPED,	"symlink_zapped" },
+		{ XFS_SICK_INO_FORGET,		NULL, },
+		{ XFS_SICK_INO_DIRTREE,		"dirtree" },
+	};
+	ssize_t				ret;
+
+	ret = xfs_healthmon_format_mask(out, "structures", mask_strings,
+			event->imask);
+	if (ret < 0)
+		return ret;
+
+	ret = stdio_redirect_printf(out, false, "  \"inode\":      %llu,\n",
+			event->ino);
+	if (ret < 0)
+		return ret;
+	return stdio_redirect_printf(out, false, "  \"generation\": %u,\n",
+			event->gen);
+}
+
 /* Format an event into json. */
 STATIC int
 xfs_healthmon_format(
@@ -446,6 +821,21 @@ xfs_healthmon_format(
 	case XFS_HEALTHMON_MOUNT:
 		/* empty */
 		break;
+	case XFS_HEALTHMON_FS:
+		ret = xfs_healthmon_format_fs(out, event);
+		break;
+	case XFS_HEALTHMON_RT:
+		ret = xfs_healthmon_format_rt(out, event);
+		break;
+	case XFS_HEALTHMON_RTGROUP:
+		ret = xfs_healthmon_format_rtgroup(out, event);
+		break;
+	case XFS_HEALTHMON_AG:
+		ret = xfs_healthmon_format_ag(out, event);
+		break;
+	case XFS_HEALTHMON_INODE:
+		ret = xfs_healthmon_format_inode(out, event);
+		break;
 	}
 	if (ret < 0)
 		return ret;
@@ -551,22 +941,31 @@ xfs_healthmon_create(
 		hm->verbose = true;
 
 	xfs_shutdown_hook_enable();
+	xfs_health_hook_enable();
 
 	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
 	ret = xfs_shutdown_hook_add(mp, &hm->shook);
 	if (ret)
 		goto out_hooks;
 
-	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
-	if (ret < 0)
+	xfs_health_hook_setup(&hm->hhook, xfs_healthmon_metadata_hook);
+	ret = xfs_health_hook_add(mp, &hm->hhook);
+	if (ret)
 		goto out_shutdown;
 
+	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
+	if (ret < 0)
+		goto out_health;
+
 	trace_xfs_healthmon_create(mp, hmo->flags, hmo->format);
 
 	return ret;
+out_health:
+	xfs_health_hook_del(mp, &hm->hhook);
 out_shutdown:
 	xfs_shutdown_hook_del(mp, &hm->shook);
 out_hooks:
+	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
 	xfs_healthmon_free_events(hm);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index f67e2f1b8f947..e445a89decc57 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -11,10 +11,23 @@ enum xfs_healthmon_type {
 
 	/* filesystem shutdown */
 	XFS_HEALTHMON_SHUTDOWN,
+
+	/* metadata health events */
+	XFS_HEALTHMON_SICK,	/* runtime corruption observed */
+	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
 };
 
 enum xfs_healthmon_domain {
 	XFS_HEALTHMON_MOUNT,	/* affects the whole fs */
+
+	/* metadata health events */
+	XFS_HEALTHMON_FS,	/* main filesystem metadata */
+	XFS_HEALTHMON_RT,	/* realtime metadata */
+	XFS_HEALTHMON_AG,	/* allocation group metadata */
+	XFS_HEALTHMON_INODE,	/* inode metadata */
+	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
 };
 
 struct xfs_healthmon_event {
@@ -30,6 +43,24 @@ struct xfs_healthmon_event {
 		struct {
 			unsigned int	flags;
 		};
+		/* fs/rt metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	fsmask;
+		};
+		/* ag/rtgroup metadata */
+		struct {
+			/* XFS_SICK_* flags */
+			unsigned int	grpmask;
+			unsigned int	group;
+		};
+		/* inode metadata */
+		struct {
+			/* XFS_SICK_INO_* flags */
+			unsigned int	imask;
+			uint32_t	gen;
+			xfs_ino_t	ino;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 54e3d6d549ec1..2f296ba1db822 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -5949,6 +5949,72 @@ TRACE_EVENT(xfs_healthmon_shutdown_hook,
 		  __entry->lost_prev)
 );
 
+#define XFS_HEALTHUP_TYPE_STRINGS \
+	{ XFS_HEALTHUP_UNMOUNT,		"unmount" }, \
+	{ XFS_HEALTHUP_SICK,		"sick" }, \
+	{ XFS_HEALTHUP_CORRUPT,		"corrupt" }, \
+	{ XFS_HEALTHUP_HEALTHY,		"healthy" }
+
+#define XFS_HEALTHUP_DOMAIN_STRINGS \
+	{ XFS_HEALTHUP_FS,		"fs" }, \
+	{ XFS_HEALTHUP_RT,		"realtime" }, \
+	{ XFS_HEALTHUP_AG,		"ag" }, \
+	{ XFS_HEALTHUP_INODE,		"inode" }, \
+	{ XFS_HEALTHUP_RTGROUP,		"rtgroup" }
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_HEALTHY);
+
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RT);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHUP_RTGROUP);
+
+TRACE_EVENT(xfs_healthmon_metadata_hook,
+	TP_PROTO(const struct xfs_mount *mp, unsigned long type,
+		 const struct xfs_health_update_params *update,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, type, update, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long, type)
+		__field(unsigned int, domain)
+		__field(unsigned int, old_mask)
+		__field(unsigned int, new_mask)
+		__field(unsigned long long, ino)
+		__field(unsigned int, gen)
+		__field(unsigned int, group)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		__entry->dev = mp ? mp->m_super->s_dev : 0;
+		__entry->type = type;
+		__entry->domain = update->domain;
+		__entry->old_mask = update->old_mask;
+		__entry->new_mask = update->new_mask;
+		__entry->ino = update->ino;
+		__entry->gen = update->gen;
+		__entry->group = update->group;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d type %s domain %s oldmask 0x%x newmask 0x%x ino 0x%llx gen 0x%x group 0x%x events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __print_symbolic(__entry->type, XFS_HEALTHUP_TYPE_STRINGS),
+		  __print_symbolic(__entry->domain, XFS_HEALTHUP_DOMAIN_STRINGS),
+		  __entry->old_mask,
+		  __entry->new_mask,
+		  __entry->ino,
+		  __entry->gen,
+		  __entry->group,
+		  __entry->events,
+		  __entry->lost_prev)
+);
+
 DECLARE_EVENT_CLASS(xfs_healthmon_class,
 	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev),
 	TP_ARGS(mp, events, lost_prev),
@@ -5979,15 +6045,33 @@ DEFINE_HEALTHMON_EVENT(xfs_healthmon_unmount);
 
 #define XFS_HEALTHMON_TYPE_STRINGS \
 	{ XFS_HEALTHMON_LOST,		"lost" }, \
-	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }
+	{ XFS_HEALTHMON_SHUTDOWN,	"shutdown" }, \
+	{ XFS_HEALTHMON_UNMOUNT,	"unmount" }, \
+	{ XFS_HEALTHMON_SICK,		"sick" }, \
+	{ XFS_HEALTHMON_CORRUPT,	"corrupt" }, \
+	{ XFS_HEALTHMON_HEALTHY,	"healthy" }
 
 #define XFS_HEALTHMON_DOMAIN_STRINGS \
-	{ XFS_HEALTHMON_MOUNT,		"mount" }
+	{ XFS_HEALTHMON_MOUNT,		"mount" }, \
+	{ XFS_HEALTHMON_FS,		"fs" }, \
+	{ XFS_HEALTHMON_RT,		"realtime" }, \
+	{ XFS_HEALTHMON_AG,		"ag" }, \
+	{ XFS_HEALTHMON_INODE,		"inode" }, \
+	{ XFS_HEALTHMON_RTGROUP,	"rtgroup" }
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_LOST);
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_SHUTDOWN);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_UNMOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_SICK);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_CORRUPT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_HEALTHY);
 
 TRACE_DEFINE_ENUM(XFS_HEALTHMON_MOUNT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_FS);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RT);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_AG);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_INODE);
+TRACE_DEFINE_ENUM(XFS_HEALTHMON_RTGROUP);
 
 DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 	TP_PROTO(const struct xfs_mount *mp, const struct xfs_healthmon_event *event),
@@ -6009,6 +6093,20 @@ DECLARE_EVENT_CLASS(xfs_healthmon_event_class,
 		case XFS_HEALTHMON_MOUNT:
 			__entry->mask = event->flags;
 			break;
+		case XFS_HEALTHMON_FS:
+		case XFS_HEALTHMON_RT:
+			__entry->mask = event->fsmask;
+			break;
+		case XFS_HEALTHMON_AG:
+		case XFS_HEALTHMON_RTGROUP:
+			__entry->mask = event->grpmask;
+			__entry->group = event->group;
+			break;
+		case XFS_HEALTHMON_INODE:
+			__entry->mask = event->imask;
+			__entry->ino = event->ino;
+			__entry->gen = event->gen;
+			break;
 		}
 	),
 	TP_printk("dev %d:%d type %s domain %s mask 0x%x ino 0x%llx gen 0x%x group 0x%x",


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/8] xfs: report media errors through healthmon
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-02-24  1:18   ` [PATCH 5/8] xfs: report metadata health " Darrick J. Wong
@ 2024-02-24  1:19   ` Darrick J. Wong
  2024-02-24  1:19   ` [PATCH 7/8] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
  2024-02-24  1:19   ` [PATCH 8/8] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:19 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Set up a media error event hook so that we can send events to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_buf.c            |    1 
 fs/xfs/xfs_healthmon.c      |  107 ++++++++++++++++++++++++++++-
 fs/xfs/xfs_healthmon.h      |   13 +++
 fs/xfs/xfs_mount.h          |    3 +
 fs/xfs/xfs_notify_failure.c |  161 ++++++++++++++++++++++++++++++++++---------
 fs/xfs/xfs_notify_failure.h |   42 +++++++++++
 fs/xfs/xfs_super.c          |    1 
 fs/xfs/xfs_super.h          |    1 
 fs/xfs/xfs_trace.c          |    1 
 fs/xfs/xfs_trace.h          |   35 +++++++++
 10 files changed, 330 insertions(+), 35 deletions(-)
 create mode 100644 fs/xfs/xfs_notify_failure.h


diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index b11515f7f270f..1e21c508e4982 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -23,6 +23,7 @@
 #include "xfs_ag.h"
 #include "xfs_buf_mem.h"
 #include "xfs_timestats.h"
+#include "xfs_notify_failure.h"
 
 struct kmem_cache *xfs_buf_cache;
 
diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index d3b548a63f0b9..34efc5b5d85e3 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -21,6 +21,7 @@
 #include "xfs_fsops.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 /*
  * Live Health Monitoring
@@ -153,6 +154,21 @@
  *     "bitmap":     free space bitmap contents for this group
  *     "rmapbt":     reverse mapping btree
  *     "refcountbt": reference count btree
+ *
+ * Media Failures
+ * --------------
+ *
+ * {
+ *	"type": "media",
+ *	"domain": "datadev" | "logdev" | "rtdev",
+ *	"daddr": integer,
+ *	"bbcount": integer,
+ *	"time_ns": integer
+ * }
+ *
+ * The domain element tells us which device reported a media failure.  The
+ * daddr and bbcount elements tell us where inside that device the failure was
+ * observed.
  */
 
 #define XFS_HEALTHMON_MAX_EVENTS \
@@ -180,6 +196,7 @@ struct xfs_healthmon {
 	/* live update hooks */
 	struct xfs_shutdown_hook	shook;
 	struct xfs_health_hook		hhook;
+	struct xfs_media_error_hook	mhook;
 
 	/* filesystem mount, or NULL if we've unmounted */
 	struct xfs_mount		*mp;
@@ -227,9 +244,11 @@ xfs_healthmon_exit(
 	trace_xfs_healthmon_exit(hm->mp, hm->events, hm->lost_prev_event);
 
 	if (hm->mp) {
+		xfs_media_error_hook_del(hm->mp, &hm->mhook);
 		xfs_health_hook_del(hm->mp, &hm->hhook);
 		xfs_shutdown_hook_del(hm->mp, &hm->shook);
 	}
+	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
@@ -526,6 +545,55 @@ xfs_healthmon_metadata_hook(
 	return NOTIFY_DONE;
 }
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+/* Add a media error event to the reporting queue. */
+STATIC int
+xfs_healthmon_media_error_hook(
+	struct notifier_block		*nb,
+	unsigned long			action,
+	void				*data)
+{
+	struct xfs_healthmon		*hm;
+	struct xfs_healthmon_event	*event;
+	struct xfs_media_error_params	*p = data;
+	struct xfs_mount		*mp = p->btp->bt_mount;
+	enum xfs_healthmon_domain	domain;
+	int				error;
+
+	hm = container_of(nb, struct xfs_healthmon, mhook.error_hook.nb);
+
+	mutex_lock(&hm->lock);
+
+	trace_xfs_healthmon_media_error_hook(hm->mp, p, hm->events,
+			hm->lost_prev_event);
+
+	error = xfs_healthmon_start_live_update(hm);
+	if (error)
+		goto out_unlock;
+
+	if (mp->m_logdev_targp != mp->m_ddev_targp &&
+	    mp->m_logdev_targp == p->btp) {
+		domain = XFS_HEALTHMON_LOGDEV;
+	} else if (mp->m_rtdev_targp == p->btp) {
+		domain = XFS_HEALTHMON_RTDEV;
+	} else {
+		domain = XFS_HEALTHMON_DATADEV;
+	}
+
+	event = new_event(hm, XFS_HEALTHMON_MEDIA_ERROR, domain);
+	if (!event)
+		goto out_unlock;
+
+	event->daddr = p->daddr;
+	event->bbcount = p->bbcount;
+	xfs_healthmon_push(hm, event);
+
+out_unlock:
+	mutex_unlock(&hm->lock);
+	return NOTIFY_DONE;
+}
+#endif
+
 /* Render the health update type as a string. */
 STATIC const char *
 xfs_healthmon_typestring(
@@ -538,6 +606,7 @@ xfs_healthmon_typestring(
 		[XFS_HEALTHMON_SICK]		= "sick",
 		[XFS_HEALTHMON_CORRUPT]		= "corrupt",
 		[XFS_HEALTHMON_HEALTHY]		= "healthy",
+		[XFS_HEALTHMON_MEDIA_ERROR]	= "media",
 	};
 
 	if (event->type >= ARRAY_SIZE(type_strings))
@@ -558,6 +627,9 @@ xfs_healthmon_domstring(
 		[XFS_HEALTHMON_AG]		= "ag",
 		[XFS_HEALTHMON_INODE]		= "inode",
 		[XFS_HEALTHMON_RTGROUP]		= "rtgroup",
+		[XFS_HEALTHMON_DATADEV]		= "datadev",
+		[XFS_HEALTHMON_LOGDEV]		= "logdev",
+		[XFS_HEALTHMON_RTDEV]		= "rtdev",
 	};
 
 	if (event->domain >= ARRAY_SIZE(dom_strings))
@@ -783,6 +855,23 @@ xfs_healthmon_format_inode(
 			event->gen);
 }
 
+/* Render media error as a string set */
+static ssize_t
+xfs_healthmon_format_media_error(
+	struct stdio_redirect		*out,
+	const struct xfs_healthmon_event *event)
+{
+	ssize_t				ret;
+
+	ret = stdio_redirect_printf(out, false, "  \"daddr\":      %llu,\n",
+			event->daddr);
+	if (ret < 0)
+		return ret;
+
+	return stdio_redirect_printf(out, false, "  \"bbcount\":    %llu,\n",
+			event->bbcount);
+}
+
 /* Format an event into json. */
 STATIC int
 xfs_healthmon_format(
@@ -836,6 +925,11 @@ xfs_healthmon_format(
 	case XFS_HEALTHMON_INODE:
 		ret = xfs_healthmon_format_inode(out, event);
 		break;
+	case XFS_HEALTHMON_DATADEV:
+	case XFS_HEALTHMON_LOGDEV:
+	case XFS_HEALTHMON_RTDEV:
+		ret = xfs_healthmon_format_media_error(out, event);
+		break;
 	}
 	if (ret < 0)
 		return ret;
@@ -942,6 +1036,7 @@ xfs_healthmon_create(
 
 	xfs_shutdown_hook_enable();
 	xfs_health_hook_enable();
+	xfs_media_error_hook_enable();
 
 	xfs_shutdown_hook_setup(&hm->shook, xfs_healthmon_shutdown_hook);
 	ret = xfs_shutdown_hook_add(mp, &hm->shook);
@@ -953,18 +1048,26 @@ xfs_healthmon_create(
 	if (ret)
 		goto out_shutdown;
 
-	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
-	if (ret < 0)
+	xfs_media_error_hook_setup(&hm->mhook, xfs_healthmon_media_error_hook);
+	ret = xfs_media_error_hook_add(mp, &hm->mhook);
+	if (ret)
 		goto out_health;
 
+	ret = run_thread_with_stdout(&hm->thread, &xfs_healthmon_ops);
+	if (ret < 0)
+		goto out_media;
+
 	trace_xfs_healthmon_create(mp, hmo->flags, hmo->format);
 
 	return ret;
+out_media:
+	xfs_media_error_hook_del(mp, &hm->mhook);
 out_health:
 	xfs_health_hook_del(mp, &hm->hhook);
 out_shutdown:
 	xfs_shutdown_hook_del(mp, &hm->shook);
 out_hooks:
+	xfs_media_error_hook_disable();
 	xfs_health_hook_disable();
 	xfs_shutdown_hook_disable();
 	mutex_destroy(&hm->lock);
diff --git a/fs/xfs/xfs_healthmon.h b/fs/xfs/xfs_healthmon.h
index e445a89decc57..97d77ea9285f6 100644
--- a/fs/xfs/xfs_healthmon.h
+++ b/fs/xfs/xfs_healthmon.h
@@ -17,6 +17,9 @@ enum xfs_healthmon_type {
 	XFS_HEALTHMON_CORRUPT,	/* fsck reported corruption */
 	XFS_HEALTHMON_HEALTHY,	/* fsck reported healthy structure */
 	XFS_HEALTHMON_UNMOUNT,	/* filesystem is unmounting */
+
+	/* media errors */
+	XFS_HEALTHMON_MEDIA_ERROR,
 };
 
 enum xfs_healthmon_domain {
@@ -28,6 +31,11 @@ enum xfs_healthmon_domain {
 	XFS_HEALTHMON_AG,	/* allocation group metadata */
 	XFS_HEALTHMON_INODE,	/* inode metadata */
 	XFS_HEALTHMON_RTGROUP,	/* realtime group metadata */
+
+	/* media errors */
+	XFS_HEALTHMON_DATADEV,
+	XFS_HEALTHMON_RTDEV,
+	XFS_HEALTHMON_LOGDEV,
 };
 
 struct xfs_healthmon_event {
@@ -61,6 +69,11 @@ struct xfs_healthmon_event {
 			uint32_t	gen;
 			xfs_ino_t	ino;
 		};
+		/* media errors */
+		struct {
+			xfs_daddr_t	daddr;
+			uint64_t	bbcount;
+		};
 	};
 };
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index f1db647b94871..4bfe9c80d8abd 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -290,6 +290,9 @@ typedef struct xfs_mount {
 	/* Hook to feed shutdown events to a daemon. */
 	struct xfs_hooks	m_shutdown_hooks;
 
+	/* Hook to feed media error events to a daemon. */
+	struct xfs_hooks	m_media_error_hooks;
+
 	struct xfs_timestats	m_timestats;
 } xfs_mount_t;
 
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index fa50e5308292d..db15db7650c26 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -19,6 +19,7 @@
 #include "xfs_rtalloc.h"
 #include "xfs_trans.h"
 #include "xfs_ag.h"
+#include "xfs_notify_failure.h"
 
 #include <linux/mm.h>
 #include <linux/dax.h>
@@ -255,6 +256,112 @@ xfs_dax_notify_ddev_failure(
 	return error;
 }
 
+static int
+xfs_dax_translate_range(
+	struct xfs_buftarg	*btp,
+	u64			offset,
+	u64			len,
+	xfs_daddr_t		*daddr,
+	uint64_t		*bbcount)
+{
+	u64			ddev_start;
+	u64			ddev_end;
+
+	ddev_start = btp->bt_dax_part_off;
+	ddev_end = ddev_start + bdev_nr_bytes(btp->bt_bdev) - 1;
+
+	/* Notify failure on the whole device. */
+	if (offset == 0 && len == U64_MAX) {
+		offset = ddev_start;
+		len = bdev_nr_bytes(btp->bt_bdev);
+	}
+
+	/* Ignore the range out of filesystem area */
+	if (offset + len - 1 < ddev_start)
+		return -ENXIO;
+	if (offset > ddev_end)
+		return -ENXIO;
+
+	/* Calculate the real range when it touches the boundary */
+	if (offset > ddev_start)
+		offset -= ddev_start;
+	else {
+		len -= ddev_start - offset;
+		offset = 0;
+	}
+	if (offset + len - 1 > ddev_end)
+		len = ddev_end - offset + 1;
+
+	*daddr = BTOBB(offset);
+	*bbcount = BTOBB(len);
+	return 0;
+}
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+DEFINE_STATIC_XFS_HOOK_SWITCH(xfs_media_error_hooks_switch);
+
+void
+xfs_media_error_hook_disable(void)
+{
+	xfs_hooks_switch_off(&xfs_media_error_hooks_switch);
+}
+
+void
+xfs_media_error_hook_enable(void)
+{
+	xfs_hooks_switch_on(&xfs_media_error_hooks_switch);
+}
+
+/* Call downstream hooks for a media error. */
+static inline void
+xfs_media_error_hook(
+	struct xfs_mount		*mp,
+	struct xfs_buftarg		*btp,
+	xfs_daddr_t			daddr,
+	uint64_t			bbcount,
+	int				mf_flags)
+{
+	if (xfs_hooks_switched_on(&xfs_media_error_hooks_switch)) {
+		struct xfs_media_error_params p = {
+			.btp		= btp,
+			.daddr		= daddr,
+			.bbcount	= bbcount,
+		};
+
+		xfs_hooks_call(&mp->m_media_error_hooks, 0, &p);
+	}
+}
+
+/* Call the specified function during a media error. */
+int
+xfs_media_error_hook_add(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	return xfs_hooks_add(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Stop calling the specified function during a media error. */
+void
+xfs_media_error_hook_del(
+	struct xfs_mount		*mp,
+	struct xfs_media_error_hook	*hook)
+{
+	xfs_hooks_del(&mp->m_media_error_hooks, &hook->error_hook);
+}
+
+/* Configure media error hook functions. */
+void
+xfs_media_error_hook_setup(
+	struct xfs_media_error_hook	*hook,
+	notifier_fn_t			mod_fn)
+{
+	xfs_hook_setup(&hook->error_hook, mod_fn);
+}
+#else
+# define xfs_media_error_hook(...)		((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 static int
 xfs_dax_notify_failure(
 	struct dax_device	*dax_dev,
@@ -263,22 +370,38 @@ xfs_dax_notify_failure(
 	int			mf_flags)
 {
 	struct xfs_mount	*mp = dax_holder(dax_dev);
-	u64			ddev_start;
-	u64			ddev_end;
+	struct xfs_buftarg	*btp;
+	xfs_daddr_t		daddr;
+	uint64_t		bbcount;
+	int			error;
 
 	if (!(mp->m_super->s_flags & SB_BORN)) {
 		xfs_warn(mp, "filesystem is not ready for notify_failure()!");
 		return -EIO;
 	}
 
-	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev) {
+	if (mp->m_rtdev_targp && mp->m_rtdev_targp->bt_daxdev == dax_dev)
+		btp = mp->m_rtdev_targp;
+	else if (mp->m_logdev_targp != mp->m_ddev_targp &&
+		 mp->m_logdev_targp->bt_daxdev == dax_dev)
+		btp = mp->m_logdev_targp;
+	else
+		btp = mp->m_ddev_targp;
+
+	error = xfs_dax_translate_range(btp, offset, len, &daddr, &bbcount);
+	if (error)
+		return error;
+
+	xfs_media_error_hook(mp, btp, daddr, bbcount, mf_flags);
+
+	if (mp->m_rtdev_targp == btp) {
 		xfs_debug(mp,
 			 "notify_failure() not supported on realtime device!");
 		return -EOPNOTSUPP;
 	}
 
-	if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
-	    mp->m_logdev_targp != mp->m_ddev_targp) {
+	if (mp->m_logdev_targp != mp->m_ddev_targp &&
+	    mp->m_logdev_targp == btp) {
 		/*
 		 * In the pre-remove case the failure notification is attempting
 		 * to trigger a force unmount.  The expectation is that the
@@ -297,33 +420,7 @@ xfs_dax_notify_failure(
 		return -EOPNOTSUPP;
 	}
 
-	ddev_start = mp->m_ddev_targp->bt_dax_part_off;
-	ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
-
-	/* Notify failure on the whole device. */
-	if (offset == 0 && len == U64_MAX) {
-		offset = ddev_start;
-		len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
-	}
-
-	/* Ignore the range out of filesystem area */
-	if (offset + len - 1 < ddev_start)
-		return -ENXIO;
-	if (offset > ddev_end)
-		return -ENXIO;
-
-	/* Calculate the real range when it touches the boundary */
-	if (offset > ddev_start)
-		offset -= ddev_start;
-	else {
-		len -= ddev_start - offset;
-		offset = 0;
-	}
-	if (offset + len - 1 > ddev_end)
-		len = ddev_end - offset + 1;
-
-	return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
-			mf_flags);
+	return xfs_dax_notify_ddev_failure(mp, daddr, bbcount, mf_flags);
 }
 
 const struct dax_holder_operations xfs_dax_holder_operations = {
diff --git a/fs/xfs/xfs_notify_failure.h b/fs/xfs/xfs_notify_failure.h
new file mode 100644
index 0000000000000..71dc6e4766c57
--- /dev/null
+++ b/fs/xfs/xfs_notify_failure.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_NOTIFY_FAILURE_H__
+#define __XFS_NOTIFY_FAILURE_H__
+
+extern const struct dax_holder_operations xfs_dax_holder_operations;
+
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+struct xfs_media_error_params {
+	struct xfs_buftarg		*btp;
+	xfs_daddr_t			daddr;
+	uint64_t			bbcount;
+	int				mf_flags;
+};
+
+struct xfs_media_error_hook {
+	struct xfs_hook			error_hook;
+};
+
+void xfs_media_error_hook_disable(void);
+void xfs_media_error_hook_enable(void);
+
+int xfs_media_error_hook_add(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_del(struct xfs_mount *mp,
+		struct xfs_media_error_hook *hook);
+void xfs_media_error_hook_setup(struct xfs_media_error_hook *hook,
+		notifier_fn_t mod_fn);
+#else
+struct xfs_media_error_params { };
+struct xfs_media_error_hook { };
+# define xfs_media_error_hook_disable()		((void)0)
+# define xfs_media_error_hook_enable()		((void)0)
+# define xfs_media_error_hook_add(...)		(0)
+# define xfs_media_error_hook_del(...)		((void)0)
+# define xfs_media_error_hook_setup(...)	((void)0)
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
+#endif /* __XFS_NOTIFY_FAILURE_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1ed848a3706be..5aa51d5402809 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -2083,6 +2083,7 @@ static int xfs_init_fs_context(
 	xfs_hooks_init(&mp->m_dir_update_hooks);
 	xfs_hooks_init(&mp->m_shutdown_hooks);
 	xfs_hooks_init(&mp->m_health_update_hooks);
+	xfs_hooks_init(&mp->m_media_error_hooks);
 	xfs_timestats_init(mp);
 
 	fc->s_fs_info = mp;
diff --git a/fs/xfs/xfs_super.h b/fs/xfs/xfs_super.h
index 302e6e5d6c7e2..c0e85c1e42f27 100644
--- a/fs/xfs/xfs_super.h
+++ b/fs/xfs/xfs_super.h
@@ -92,7 +92,6 @@ extern xfs_agnumber_t xfs_set_inode_alloc(struct xfs_mount *,
 
 extern const struct export_operations xfs_export_operations;
 extern const struct quotactl_ops xfs_quotactl_operations;
-extern const struct dax_holder_operations xfs_dax_holder_operations;
 
 extern void xfs_reinit_percpu_counters(struct xfs_mount *mp);
 
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 14bc3f8cf306d..8e0bddaa2df2c 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -49,6 +49,7 @@
 #include "xfs_fsrefs.h"
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
+#include "xfs_notify_failure.h"
 
 static inline void
 xfs_rmapbt_crack_agno_opdev(
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2f296ba1db822..f5be973be5433 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -104,6 +104,7 @@ struct xfs_refcount_intent;
 struct xfs_fsrefs;
 struct xfs_healthmon_event;
 struct xfs_health_update_params;
+struct xfs_media_error_params;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -6015,6 +6016,40 @@ TRACE_EVENT(xfs_healthmon_metadata_hook,
 		  __entry->lost_prev)
 );
 
+#if defined(CONFIG_XFS_LIVE_HOOKS) && defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_FS_DAX)
+TRACE_EVENT(xfs_healthmon_media_error_hook,
+	TP_PROTO(const struct xfs_mount *mp,
+		 const struct xfs_media_error_params *p,
+		 unsigned int events, bool lost_prev),
+	TP_ARGS(mp, p, events, lost_prev),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(dev_t, error_dev)
+		__field(uint64_t, daddr)
+		__field(uint64_t, bbcount)
+		__field(unsigned int, events)
+		__field(bool, lost_prev)
+	),
+	TP_fast_assign(
+		if (mp) {
+			__entry->dev = mp->m_super->s_dev;
+			__entry->error_dev = p->btp->bt_dev;
+		}
+		__entry->daddr = p->daddr;
+		__entry->bbcount = p->bbcount;
+		__entry->events = events;
+		__entry->lost_prev = lost_prev;
+	),
+	TP_printk("dev %d:%d error_dev %d:%d daddr 0x%llx bbcount 0x%llx events %u lost_prev? %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  MAJOR(__entry->error_dev), MINOR(__entry->error_dev),
+		  __entry->daddr,
+		  __entry->bbcount,
+		  __entry->events,
+		  __entry->lost_prev)
+);
+#endif
+
 DECLARE_EVENT_CLASS(xfs_healthmon_class,
 	TP_PROTO(const struct xfs_mount *mp, unsigned int events, bool lost_prev),
 	TP_ARGS(mp, events, lost_prev),


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/8] xfs: allow reconfiguration of the health monitoring device
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-02-24  1:19   ` [PATCH 6/8] xfs: report media errors " Darrick J. Wong
@ 2024-02-24  1:19   ` Darrick J. Wong
  2024-02-24  1:19   ` [PATCH 8/8] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:19 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that we can reconfigure the health monitoring device by
calling the XFS_IOC_HEALTH_MONITOR ioctl on it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_healthmon.c |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)


diff --git a/fs/xfs/xfs_healthmon.c b/fs/xfs/xfs_healthmon.c
index 34efc5b5d85e3..27cfca98164eb 100644
--- a/fs/xfs/xfs_healthmon.c
+++ b/fs/xfs/xfs_healthmon.c
@@ -22,6 +22,8 @@
 #include "xfs_health.h"
 #include "xfs_healthmon.h"
 #include "xfs_notify_failure.h"
+#include "xfs_fs.h"
+#include "xfs_ioctl.h"
 
 /*
  * Live Health Monitoring
@@ -995,9 +997,36 @@ xfs_healthmon_validate(
 	return true;
 }
 
+/* Handle ioctls for the health monitoring thread. */
+STATIC long
+xfs_healthmon_ioctl(
+	struct thread_with_stdio	*thr,
+	unsigned int			cmd,
+	unsigned long			p)
+{
+	struct xfs_health_monitor	hmo;
+	struct xfs_healthmon		*hm = to_healthmon(thr);
+	void			__user *arg = (void __user *)p;
+
+	if (cmd != XFS_IOC_HEALTH_MONITOR)
+		return -ENOTTY;
+
+	if (copy_from_user(&hmo, arg, sizeof(hmo)))
+		return -EFAULT;
+
+	if (!xfs_healthmon_validate(&hmo))
+		return -EINVAL;
+
+	mutex_lock(&hm->lock);
+	hm->verbose = !!(hmo.flags & XFS_HEALTH_MONITOR_VERBOSE);
+	mutex_unlock(&hm->lock);
+	return 0;
+}
+
 static const struct thread_with_stdio_ops xfs_healthmon_ops = {
 	.exit		= xfs_healthmon_exit,
 	.fn		= xfs_healthmon_run,
+	.unlocked_ioctl	= xfs_healthmon_ioctl,
 };
 
 /*


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 8/8] xfs: send uevents when mounting and unmounting a filesystem
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
                     ` (6 preceding siblings ...)
  2024-02-24  1:19   ` [PATCH 7/8] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
@ 2024-02-24  1:19   ` Darrick J. Wong
  7 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:19 UTC (permalink / raw
  To: kent.overstreet, djwong; +Cc: linux-xfs, linux-bcachefs

From: Darrick J. Wong <djwong@kernel.org>

Send uevents when we mount and unmount the filesystem, so that we can
trigger systemd services.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_super.c |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)


diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5aa51d5402809..06f0c00988fc8 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1136,12 +1136,28 @@ xfs_inodegc_free_percpu(
 	free_percpu(mp->m_inodegc);
 }
 
+static void
+xfs_send_unmount_uevent(
+	struct xfs_mount	*mp)
+{
+	char			sid[256] = "";
+	char			*env[] = {
+		"TYPE=mount",
+		sid,
+		NULL,
+	};
+
+	snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id);
+	kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_REMOVE, env);
+}
+
 static void
 xfs_fs_put_super(
 	struct super_block	*sb)
 {
 	struct xfs_mount	*mp = XFS_M(sb);
 
+	xfs_send_unmount_uevent(mp);
 	xfs_notice(mp, "Unmounting Filesystem %pU", &mp->m_sb.sb_uuid);
 	xfs_filestream_unmount(mp);
 	xfs_unmountfs(mp);
@@ -1504,6 +1520,29 @@ xfs_debugfs_mkdir(
 	return child;
 }
 
+/*
+ * Send a uevent signalling that the mount succeeded so we can use udev rules
+ * to start background services.
+ */
+static void
+xfs_send_mount_uevent(
+	struct fs_context	*fc,
+	struct xfs_mount	*mp)
+{
+	char			source[256] = "";
+	char			sid[256] = "";
+	char			*env[] = {
+		"TYPE=mount",
+		source,
+		sid,
+		NULL,
+	};
+
+	snprintf(source, sizeof(source), "SOURCE=%s", fc->source);
+	snprintf(sid, sizeof(sid), "SID=%s", mp->m_super->s_id);
+	kobject_uevent_env(&mp->m_kobj.kobject, KOBJ_ADD, env);
+}
+
 static int
 xfs_fs_fill_super(
 	struct super_block	*sb,
@@ -1810,6 +1849,7 @@ xfs_fs_fill_super(
 		mp->m_debugfs_uuid = NULL;
 	}
 
+	xfs_send_mount_uevent(fc, mp);
 	return 0;
 
  out_filestream_unmount:


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCHSET RFC] xfsprogs: live health monitoring of filesystems
  2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
                   ` (5 preceding siblings ...)
  2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
@ 2024-02-24  1:34 ` Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 1/7] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
                     ` (6 more replies)
  6 siblings, 7 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:34 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

Hi all,

This patchset builds off of Kent Overstreet's thread_with_file code to
deliver live information about filesystem health events to userspace.
This is done by creating a twf file and hooking internal operations so
that the event information can be queued to the twf without stalling the
kernel if the twf client program is nonresponsive.  This is a private
ioctl, so events are expressed using simple json objects so that we can
enrich the output later on without having to rev a ton of C structs.

In userspace, we create a new daemon program that will read the json
event objects and initiate repairs automatically.  This daemon is
managed entirely by systemd and will not block unmounting of the
filesystem unless repairs are ongoing.  It is autostarted via some
horrible udev rules.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=health-monitoring

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=health-monitoring

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=health-monitoring
---
Commits in this patchset:
 * xfs: use thread_with_file to create a monitoring file
 * xfs: create hooks for monitoring health updates
 * xfs: report shutdown events through healthmon
 * xfs_io: monitor filesystem health events
 * xfs_scrubbed: create daemon to listen for health events
 * xfs_scrubbed: enable repairing filesystems
 * xfs_scrubbed: create a background monitoring service
---
 io/Makefile                    |    1 
 io/healthmon.c                 |  172 +++++++++++++
 io/init.c                      |    1 
 io/io.h                        |    1 
 libxfs/xfs_fs.h                |    1 
 libxfs/xfs_fs_staging.h        |   18 +
 libxfs/xfs_health.h            |   48 ++++
 man/man8/xfs_io.8              |   22 ++
 scrub/Makefile                 |   23 +-
 scrub/xfs.rules                |    3 
 scrub/xfs_scrubbed.in          |  519 ++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrubbed@.service.in |   95 +++++++
 scrub/xfs_scrubbed_start       |   17 +
 13 files changed, 916 insertions(+), 5 deletions(-)
 create mode 100644 io/healthmon.c
 create mode 100644 scrub/xfs_scrubbed.in
 create mode 100644 scrub/xfs_scrubbed@.service.in
 create mode 100755 scrub/xfs_scrubbed_start


^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH 1/7] xfs: use thread_with_file to create a monitoring file
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
@ 2024-02-24  1:34   ` Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 2/7] xfs: create hooks for monitoring health updates Darrick J. Wong
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:34 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use Kent Overstreet's thread_with_file abstraction to provide a magic
file from which we can read filesystem health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs.h         |    1 +
 libxfs/xfs_fs_staging.h |   10 ++++++++++
 2 files changed, 11 insertions(+)


diff --git a/libxfs/xfs_fs.h b/libxfs/xfs_fs.h
index 246c2582abbe..b9d9bc511475 100644
--- a/libxfs/xfs_fs.h
+++ b/libxfs/xfs_fs.h
@@ -855,6 +855,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGETXATTRA	_IOR ('X', 45, struct fsxattr)
 /*	XFS_IOC_SETBIOSIZE ---- deprecated 46	   */
 /*	XFS_IOC_GETBIOSIZE ---- deprecated 47	   */
+/*	XFS_IOC_HEALTHMON -------- staging 48	   */
 #define XFS_IOC_GETBMAPX	_IOWR('X', 56, struct getbmap)
 #define XFS_IOC_ZERO_RANGE	_IOW ('X', 57, struct xfs_flock64)
 #define XFS_IOC_FREE_EOFBLOCKS	_IOR ('X', 58, struct xfs_fs_eofblocks)
diff --git a/libxfs/xfs_fs_staging.h b/libxfs/xfs_fs_staging.h
index 1da182c77934..84b99816eec2 100644
--- a/libxfs/xfs_fs_staging.h
+++ b/libxfs/xfs_fs_staging.h
@@ -303,4 +303,14 @@ struct xfs_map_freesp {
  */
 #define XFS_IOC_MAP_FREESP	_IOWR('X', 64, struct xfs_map_freesp)
 
+struct xfs_health_monitor {
+	__u64	flags;		/* flags */
+	__u8	format;		/* output format */
+	__u8	pad1[7];	/* zeroes */
+	__u64	pad2[2];	/* zeroes */
+};
+
+/* Monitor for health events. */
+#define XFS_IOC_HEALTH_MONITOR		_IOR ('X', 48, struct xfs_health_monitor)
+
 #endif /* __XFS_FS_STAGING_H__ */


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 2/7] xfs: create hooks for monitoring health updates
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 1/7] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
@ 2024-02-24  1:34   ` Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 3/7] xfs: report shutdown events through healthmon Darrick J. Wong
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:34 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create hooks for monitoring health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_health.h |   48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)


diff --git a/libxfs/xfs_health.h b/libxfs/xfs_health.h
index 89b80e957917..3c508a71ec91 100644
--- a/libxfs/xfs_health.h
+++ b/libxfs/xfs_health.h
@@ -331,4 +331,52 @@ void xfs_bulkstat_health(struct xfs_inode *ip, struct xfs_bulkstat *bs);
 #define xfs_metadata_is_sick(error) \
 	(unlikely((error) == -EFSCORRUPTED || (error) == -EFSBADCRC))
 
+/*
+ * Parameters for tracking health updates.  The enum below is passed as the
+ * hook function argument.
+ */
+enum xfs_health_update_type {
+	XFS_HEALTHUP_SICK = 1,	/* runtime corruption observed */
+	XFS_HEALTHUP_CORRUPT,	/* fsck reported corruption */
+	XFS_HEALTHUP_HEALTHY,	/* fsck reported healthy structure */
+	XFS_HEALTHUP_UNMOUNT,	/* filesystem is unmounting */
+};
+
+/* Where in the filesystem was the event observed? */
+enum xfs_health_update_domain {
+	XFS_HEALTHUP_FS = 1,	/* main filesystem */
+	XFS_HEALTHUP_RT,	/* realtime */
+	XFS_HEALTHUP_AG,	/* allocation group */
+	XFS_HEALTHUP_INODE,	/* inode */
+	XFS_HEALTHUP_RTGROUP,	/* realtime group */
+};
+
+struct xfs_health_update_params {
+	/* XFS_HEALTHUP_INODE */
+	xfs_ino_t			ino;
+	uint32_t			gen;
+
+	/* XFS_HEALTHUP_AG/RTGROUP */
+	uint32_t			group;
+
+	/* XFS_SICK_* flags */
+	unsigned int			old_mask;
+	unsigned int			new_mask;
+
+	enum xfs_health_update_domain	domain;
+};
+
+#ifdef CONFIG_XFS_LIVE_HOOKS
+struct xfs_health_hook {
+	struct xfs_hook			health_hook;
+};
+
+void xfs_health_hook_disable(void);
+void xfs_health_hook_enable(void);
+
+int xfs_health_hook_add(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_del(struct xfs_mount *mp, struct xfs_health_hook *hook);
+void xfs_health_hook_setup(struct xfs_health_hook *hook, notifier_fn_t mod_fn);
+#endif /* CONFIG_XFS_LIVE_HOOKS */
+
 #endif	/* __XFS_HEALTH_H__ */


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 3/7] xfs: report shutdown events through healthmon
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 1/7] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
  2024-02-24  1:34   ` [PATCH 2/7] xfs: create hooks for monitoring health updates Darrick J. Wong
@ 2024-02-24  1:34   ` Darrick J. Wong
  2024-02-24  1:35   ` [PATCH 4/7] xfs_io: monitor filesystem health events Darrick J. Wong
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:34 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Set up a shutdown hook so that we can send notifications to userspace.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 libxfs/xfs_fs_staging.h |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/libxfs/xfs_fs_staging.h b/libxfs/xfs_fs_staging.h
index 84b99816eec2..684d6d22cc8d 100644
--- a/libxfs/xfs_fs_staging.h
+++ b/libxfs/xfs_fs_staging.h
@@ -310,6 +310,14 @@ struct xfs_health_monitor {
 	__u64	pad2[2];	/* zeroes */
 };
 
+/* Return all health status events, not just deltas */
+#define XFS_HEALTH_MONITOR_VERBOSE	(1ULL << 0)
+
+#define XFS_HEALTH_MONITOR_ALL		(XFS_HEALTH_MONITOR_VERBOSE)
+
+/* Return events in JSON format */
+#define XFS_HEALTH_MONITOR_FMT_JSON	(1)
+
 /* Monitor for health events. */
 #define XFS_IOC_HEALTH_MONITOR		_IOR ('X', 48, struct xfs_health_monitor)
 


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 4/7] xfs_io: monitor filesystem health events
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (2 preceding siblings ...)
  2024-02-24  1:34   ` [PATCH 3/7] xfs: report shutdown events through healthmon Darrick J. Wong
@ 2024-02-24  1:35   ` Darrick J. Wong
  2024-02-24  1:35   ` [PATCH 5/7] xfs_scrubbed: create daemon to listen for " Darrick J. Wong
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:35 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a subcommand to monitor for health events generated by the kernel.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 io/Makefile       |    1 
 io/healthmon.c    |  172 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 io/init.c         |    1 
 io/io.h           |    1 
 man/man8/xfs_io.8 |   22 +++++++
 5 files changed, 197 insertions(+)
 create mode 100644 io/healthmon.c


diff --git a/io/Makefile b/io/Makefile
index 787027fe10ed..b1f9cebd63b0 100644
--- a/io/Makefile
+++ b/io/Makefile
@@ -24,6 +24,7 @@ CFILES = \
 	fsuuid.c \
 	fsync.c \
 	getrusage.c \
+	healthmon.c \
 	imap.c \
 	inject.c \
 	label.c \
diff --git a/io/healthmon.c b/io/healthmon.c
new file mode 100644
index 000000000000..7db8c52c96c0
--- /dev/null
+++ b/io/healthmon.c
@@ -0,0 +1,172 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "libxfs.h"
+#include "libfrog/fsgeom.h"
+#include "libfrog/paths.h"
+#include "command.h"
+#include "init.h"
+#include "io.h"
+
+static void
+healthmon_help(void)
+{
+	printf(_(
+"Monitor filesystem health events"
+"\n"
+"-c             Replace the open file with the monitor file.\n"
+"-d delay_ms    Sleep this many milliseconds between reads.\n"
+"-p             Only probe for the existence of the ioctl.\n"
+"-v             Request all events.\n"
+"\n"));
+}
+
+static inline int
+monitor_sleep(
+	int			delay_ms)
+{
+	struct timespec		ts;
+
+	if (!delay_ms)
+		return 0;
+
+	ts.tv_sec = delay_ms / 1000;
+	ts.tv_nsec = (delay_ms % 1000) * 1000000;
+
+	return nanosleep(&ts, NULL);
+}
+
+#define BUFSIZE			(4096)
+
+static int
+monitor(
+	bool			consume,
+	int			delay_ms,
+	bool			verbose,
+	bool			only_probe)
+{
+	struct xfs_health_monitor	hmo = {
+		.format		= XFS_HEALTH_MONITOR_FMT_JSON,
+	};
+	char			*buf;
+	ssize_t			bytes_read;
+	int			mon_fd;
+	int			ret = 1;
+
+	if (verbose)
+		hmo.flags |= XFS_HEALTH_MONITOR_ALL;
+
+	mon_fd = ioctl(file->fd, XFS_IOC_HEALTH_MONITOR, &hmo);
+	if (mon_fd < 0) {
+		perror("XFS_IOC_HEALTH_MONITOR");
+		return 1;
+	}
+
+	if (only_probe) {
+		ret = 0;
+		goto out_mon;
+	}
+
+	buf = malloc(BUFSIZE);
+	if (!buf) {
+		perror("malloc");
+		goto out_mon;
+	}
+
+	if (consume) {
+		close(file->fd);
+		file->fd = mon_fd;
+	}
+
+	monitor_sleep(delay_ms);
+	while ((bytes_read = read(mon_fd, buf, BUFSIZE)) > 0) {
+		char		*write_ptr = buf;
+		ssize_t		bytes_written;
+		size_t		to_write = bytes_read;
+
+		while ((bytes_written = write(STDOUT_FILENO, write_ptr, to_write)) > 0) {
+			write_ptr += bytes_written;
+			to_write -= bytes_written;
+		}
+		if (bytes_written < 0) {
+			perror("healthdump");
+			goto out_buf;
+		}
+
+		monitor_sleep(delay_ms);
+	}
+	if (bytes_read < 0) {
+		perror("healthmon");
+		goto out_buf;
+	}
+
+	ret = 0;
+
+out_buf:
+	free(buf);
+out_mon:
+	close(mon_fd);
+	return ret;
+}
+
+static int
+healthmon_f(
+	int			argc,
+	char			**argv)
+{
+	bool			consume = false;
+	bool			verbose = false;
+	bool			only_probe = false;
+	int			delay_ms = 0;
+	int			c;
+
+	while ((c = getopt(argc, argv, "cd:pv")) != EOF) {
+		switch (c) {
+		case 'c':
+			consume = true;
+			break;
+		case 'd':
+			errno = 0;
+			delay_ms = atoi(optarg);
+			if (delay_ms < 0 || errno) {
+				printf("%s: delay must be positive msecs\n",
+						optarg);
+				exitcode = 1;
+				return 0;
+			}
+			break;
+		case 'p':
+			only_probe = true;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		default:
+			exitcode = 1;
+			healthmon_help();
+			return 0;
+		}
+	}
+
+	return monitor(consume, delay_ms, verbose, only_probe);
+}
+
+static struct cmdinfo healthmon_cmd = {
+	.name		= "healthmon",
+	.cfunc		= healthmon_f,
+	.argmin		= 0,
+	.argmax		= -1,
+	.flags		= CMD_FLAG_ONESHOT | CMD_NOMAP_OK,
+	.args		= "[-c] [-d delay_ms] [-v]",
+	.help		= healthmon_help,
+};
+
+void
+healthmon_init(void)
+{
+	healthmon_cmd.oneline = _("monitor filesystem health events");
+
+	add_command(&healthmon_cmd);
+}
diff --git a/io/init.c b/io/init.c
index 452f4cfc898c..ef32e74bc744 100644
--- a/io/init.c
+++ b/io/init.c
@@ -91,6 +91,7 @@ init_commands(void)
 	utimes_init();
 	crc32cselftest_init();
 	exchrange_init();
+	healthmon_init();
 }
 
 /*
diff --git a/io/io.h b/io/io.h
index 06a8ae1db496..b8bed3b66171 100644
--- a/io/io.h
+++ b/io/io.h
@@ -192,3 +192,4 @@ extern void		bulkstat_init(void);
 extern void		exchrange_init(void);
 extern void		aginfo_init(void);
 extern void		fsrefcounts_init(void);
+extern void		healthmon_init(void);
diff --git a/man/man8/xfs_io.8 b/man/man8/xfs_io.8
index 93a4f0790d8e..9f00d26a0b49 100644
--- a/man/man8/xfs_io.8
+++ b/man/man8/xfs_io.8
@@ -1407,6 +1407,28 @@ flag.
 .RE
 .PD
 
+.TP
+.BI "healthmon [ \-c ] [ \-d " delay_ms " ] [ \-p ] [ \-v ]"
+Watch for filesystem health events and write them to the console.
+.RE
+.RS 1.0i
+.PD 0
+.TP
+.BI \-c
+Close the open file and replace it with the monitor file.
+.TP
+.BI "\-d " delay_ms
+Sleep for this long between read attempts.
+.TP
+.B \-p
+Probe for the existence of the functionality by opening the monitoring fd and
+closing it immediately.
+.TP
+.BI \-v
+Request all health events, even if nothing changed.
+.PD
+.RE
+
 .TP
 .BI "inject [ " tag " ]"
 Inject errors into a filesystem to observe filesystem behavior at


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 5/7] xfs_scrubbed: create daemon to listen for health events
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (3 preceding siblings ...)
  2024-02-24  1:35   ` [PATCH 4/7] xfs_io: monitor filesystem health events Darrick J. Wong
@ 2024-02-24  1:35   ` Darrick J. Wong
  2024-02-24  1:35   ` [PATCH 6/7] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
  2024-02-24  1:36   ` [PATCH 7/7] xfs_scrubbed: create a background monitoring service Darrick J. Wong
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:35 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a daemon program that can listen for and log health events.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile        |   15 +++
 scrub/xfs_scrubbed.in |  217 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 230 insertions(+), 2 deletions(-)
 create mode 100644 scrub/xfs_scrubbed.in


diff --git a/scrub/Makefile b/scrub/Makefile
index c0fc927f4278..cf112018376b 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -18,6 +18,7 @@ XFS_SCRUB_ALL_PROG = xfs_scrub_all
 XFS_SCRUB_FAIL_PROG = xfs_scrub_fail
 XFS_SCRUB_ARGS = -p
 XFS_SCRUB_SERVICE_ARGS = -b
+XFS_SCRUBBED_PROG = xfs_scrubbed
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
@@ -124,9 +125,9 @@ endif
 # Automatically trigger a media scan once per month
 XFS_SCRUB_ALL_AUTO_MEDIA_SCAN_INTERVAL=1mo
 
-LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) *.service *.cron
+LDIRT = $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) *.service *.cron
 
-default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(OPTIONAL_TARGETS)
+default: depend $(LTCOMMAND) $(XFS_SCRUB_ALL_PROG) $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_PROG) $(OPTIONAL_TARGETS)
 
 xfs_scrub_all: xfs_scrub_all.in $(builddefs)
 	@echo "    [SED]    $@"
@@ -139,6 +140,14 @@ xfs_scrub_all: xfs_scrub_all.in $(builddefs)
 		   -e "s|@scrub_args@|$(XFS_SCRUB_ARGS)|g" < $< > $@
 	$(Q)chmod a+x $@
 
+xfs_scrubbed: xfs_scrubbed.in $(builddefs)
+	@echo "    [SED]    $@"
+	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
+		   -e "s|@scrub_svcname@|$(scrub_svcname)|g" \
+		   -e "s|@pkg_version@|$(PKG_VERSION)|g" \
+		   < $< > $@
+	$(Q)chmod a+x $@
+
 xfs_scrub_fail: xfs_scrub_fail.in $(builddefs)
 	@echo "    [SED]    $@"
 	$(Q)$(SED) -e "s|@sbindir@|$(PKG_SBIN_DIR)|g" \
@@ -182,6 +191,8 @@ install-scrub: default
 	$(INSTALL) -m 755 -d $(PKG_SBIN_DIR)
 	$(LTINSTALL) -m 755 $(LTCOMMAND) $(PKG_SBIN_DIR)
 	$(INSTALL) -m 755 $(XFS_SCRUB_ALL_PROG) $(PKG_SBIN_DIR)
+	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUBBED_PROG) $(PKG_LIBEXEC_DIR)
 	$(INSTALL) -m 755 -d $(PKG_STATE_DIR)
 
 install-udev: $(UDEV_RULES)
diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
new file mode 100644
index 000000000000..0c72f5c54a78
--- /dev/null
+++ b/scrub/xfs_scrubbed.in
@@ -0,0 +1,217 @@
+#!/usr/bin/python3
+
+# SPDX-License-Identifier: GPL-2.0-or-later
+# Copyright (C) 2024 Oracle.  All rights reserved.
+#
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+# Daemon to listen for and react to filesystem health events
+
+import sys
+import os
+import argparse
+import fcntl
+import struct
+import json
+import datetime
+import errno
+
+debug = False
+log = False
+everything = False
+printf_prefix = ''
+
+# ioctl encoding stuff
+_IOC_NRBITS   =  8
+_IOC_TYPEBITS =  8
+_IOC_SIZEBITS = 14
+_IOC_DIRBITS  =  2
+
+_IOC_NRSHIFT   = 0
+_IOC_TYPESHIFT = (_IOC_NRSHIFT   + _IOC_NRBITS)
+_IOC_SIZESHIFT = (_IOC_TYPESHIFT + _IOC_TYPEBITS)
+_IOC_DIRSHIFT  = (_IOC_SIZESHIFT + _IOC_SIZEBITS)
+
+_IOC_NONE  = 0
+_IOC_WRITE = 1
+_IOC_READ  = 2
+
+def _IOC(direction, type, nr, size):
+	return (((direction)  << _IOC_DIRSHIFT) |
+		((type) << _IOC_TYPESHIFT) |
+		((nr)   << _IOC_NRSHIFT) |
+		((size) << _IOC_SIZESHIFT))
+
+def _IOR(type, number, size):
+	return _IOC(_IOC_READ, type, number, size)
+
+# xfs health monitoring ioctl stuff
+XFS_HEALTH_MONITOR_FMT_JSON = 1
+XFS_HEALTH_MONITOR_VERBOSE = 1 << 0
+xfs_health_monitor = struct.Struct('QB' + ('x' * 23))
+XFS_IOC_HEALTH_MONITOR = _IOR(0x58, 48, xfs_health_monitor.size)
+
+def open_health_monitor(fd, verbose = False):
+	'''Return a health monitoring fd.'''
+	assert xfs_health_monitor.size == 32
+
+	flags = 0
+	fmt = XFS_HEALTH_MONITOR_FMT_JSON
+
+	if verbose:
+		flags |= XFS_HEALTH_MONITOR_VERBOSE
+
+	# Create an immutable byte array representation of struct args, then
+	# pass it to the ioctl function as a mutable byte array so that the
+	# return value is the kernel fd and /not/ the post-syscall byte array
+	# contents.
+	arg = xfs_health_monitor.pack(flags, fmt)
+	ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, bytearray(arg))
+	return ret
+
+# main program
+
+def health_reports(mon_fp):
+	'''Generate python objects describing health events.'''
+	global debug, printf_prefix
+
+	lines = []
+	buf = mon_fp.readline()
+	while buf != '':
+		for line in buf.split('\0'):
+			line = line.strip()
+			if debug:
+				print(f'new line: {line}')
+			if line == '':
+				continue
+
+			lines.append(line)
+			if not '}' in line:
+				continue
+
+			s = ''.join(lines)
+			if debug:
+				print(f'new event: {s}')
+			try:
+				yield json.loads(s)
+			except json.decoder.JSONDecodeError as e:
+				print(f"{printf_prefix}: {e} from {s}",
+						file = sys.stderr)
+				pass
+			lines = []
+		buf = mon_fp.readline()
+
+def log_event(event):
+	global printf_prefix
+
+	print(f"{printf_prefix}: {event}")
+	sys.stdout.flush()
+
+def report_lost(event):
+	'''Report that the kernel lost events.'''
+	global printf_prefix
+
+	print(f"{printf_prefix}: Events were lost.")
+	sys.stdout.flush()
+
+def report_shutdown(event):
+	'''Report an abortive shutdown of the filesystem.'''
+	global printf_prefix
+	REASONS = {
+		"meta_ioerr":		"metadata IO error",
+		"log_ioerr":		"log IO error",
+		"force_umount":		"forced unmount",
+		"corrupt_incore":	"in-memory state corruption",
+		"corrupt_ondisk":	"ondisk metadata corruption",
+		"device_removed":	"device removal",
+	}
+
+	reasons = []
+	for reason in event['reasons']:
+		if reason in REASONS:
+			reasons.append(REASONS[reason])
+		else:
+			reasons.append(reason)
+
+	print(f"{printf_prefix}: Filesystem shut down due to {', '.join(reasons)}.")
+	sys.stdout.flush()
+
+def monitor(mountpoint):
+	'''Monitor the given mountpoint for health events.'''
+	global log, everything
+
+	fd = os.open(mountpoint, os.O_RDONLY)
+	try:
+		mon_fd = open_health_monitor(fd, verbose = everything)
+	except OSError as e:
+		if e.errno != errno.ENOTTY:
+			raise e
+		print(f"{mountpoint}: XFS health monitoring not supported.",
+				file = sys.stderr)
+		return 1
+	finally:
+		# Close the mountpoint if opening the health monitor fails
+		os.close(fd)
+
+	# Ownership of mon_fd (and hence responsibility for closing it) is
+	# transferred to the mon_fp object.
+	with os.fdopen(mon_fd) as mon_fp:
+		for event in health_reports(mon_fp):
+			try:
+				ts = datetime.datetime.fromtimestamp(event['time_ns'] / 1e9).astimezone()
+				event['time'] = str(ts)
+				del event['time_ns']
+			except Exception as e:
+				print(e)
+				pass
+			if log:
+				log_event(event)
+			if event['type'] == 'lost':
+				report_lost(event)
+			elif event['type'] == 'shutdown':
+				report_shutdown(event)
+
+	return 0
+
+def main():
+	global debug, log, printf_prefix, everything
+	ret = 0
+
+	parser = argparse.ArgumentParser( \
+			description = "XFS filesystem health monitoring demon.")
+	parser.add_argument("--debug", help = "Enabling debugging messages.", \
+			action = "store_true")
+	parser.add_argument("--log", help = "Log health events to stdout.", \
+			action = "store_true")
+	parser.add_argument("--everything", help = "Capture all events.", \
+			action = "store_true")
+	parser.add_argument("-V", help = "Report version and exit.", \
+			action = "store_true")
+	parser.add_argument('mountpoint', default = None, nargs = '?',
+			help = 'XFS filesystem mountpoint to target.')
+	args = parser.parse_args()
+
+	if args.V:
+		print("xfs_scrubbed version @pkg_version@")
+		sys.exit(0)
+
+	if args.mountpoint is None:
+		parser.error("the following arguments are required: mountpoint")
+		sys.exit(1)
+
+	if args.debug:
+		debug = True
+	if args.log:
+		log = True
+	if args.everything:
+		everything = True
+
+	printf_prefix = args.mountpoint
+	try:
+		ret = monitor(args.mountpoint)
+	except KeyboardInterrupt:
+		ret = 0
+	sys.exit(ret)
+
+if __name__ == '__main__':
+	main()


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 6/7] xfs_scrubbed: enable repairing filesystems
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (4 preceding siblings ...)
  2024-02-24  1:35   ` [PATCH 5/7] xfs_scrubbed: create daemon to listen for " Darrick J. Wong
@ 2024-02-24  1:35   ` Darrick J. Wong
  2024-02-24  1:36   ` [PATCH 7/7] xfs_scrubbed: create a background monitoring service Darrick J. Wong
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:35 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Make it so that our health monitoring daemon can initiate repairs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/xfs_scrubbed.in |  300 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 297 insertions(+), 3 deletions(-)


diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 0c72f5c54a78..5458d39486bc 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -15,11 +15,16 @@ import struct
 import json
 import datetime
 import errno
+import ctypes
+import ctypes.util
 
 debug = False
 log = False
 everything = False
 printf_prefix = ''
+want_repair = False
+libhandle = None
+libc = None
 
 # ioctl encoding stuff
 _IOC_NRBITS   =  8
@@ -45,6 +50,9 @@ def _IOC(direction, type, nr, size):
 def _IOR(type, number, size):
 	return _IOC(_IOC_READ, type, number, size)
 
+def _IOWR(type, number, size):
+	return _IOC(_IOC_READ | _IOC_WRITE, type, number, size)
+
 # xfs health monitoring ioctl stuff
 XFS_HEALTH_MONITOR_FMT_JSON = 1
 XFS_HEALTH_MONITOR_VERBOSE = 1 << 0
@@ -69,6 +77,159 @@ def open_health_monitor(fd, verbose = False):
 	ret = fcntl.ioctl(fd, XFS_IOC_HEALTH_MONITOR, bytearray(arg))
 	return ret
 
+# libhandle stuff
+class xfs_weak_handle(object):
+	def __init__(self, fd, mountpoint):
+		global libhandle, printf_prefix
+
+		self.mountpoint = mountpoint
+		self.hanp = ctypes.c_void_p()
+		self.hlen = ctypes.c_size_t()
+		self.has_handle = False
+
+		# Create the file and fs handles for the open mountpoint
+		# so that we can compare them later
+		ret = libhandle.fd_to_handle(fd, self.hanp, self.hlen)
+		if ret != 0:
+			raise OSError(ctypes.get_errno(),
+					f"{printf_prefix}: cannot create handle")
+		self.has_handle = True
+
+	def __del__(self):
+		if self.has_handle:
+			libhandle.free_handle(self.hanp, self.hlen)
+
+	def open(self):
+		'''Reopen a file handle obtained via weak reference.'''
+		global libhandle, libc, printf_prefix
+
+		nhanp = ctypes.c_void_p()
+		nhlen = ctypes.c_size_t()
+
+		fd = os.open(self.mountpoint, os.O_RDONLY)
+
+		# Create the file and fs handles for the open mountpoint
+		# so that we can compare them later
+		ret = libhandle.fd_to_handle(fd, nhanp, nhlen)
+		if ret != 0:
+			raise OSError(ctypes.get_errno(),
+					f"{printf_prefix}: cannot resample handle")
+
+		# Did we get the same handle?
+		if nhlen.value != self.hlen.value or \
+		   libc.memcmp(self.hanp, nhanp, nhlen) != 0:
+			os.close(fd)
+			libhandle.free_handle(nhanp, nhlen)
+			raise OSError(errno.ENOENT,
+					f"{printf_prefix}: filesystem has changed")
+
+		libhandle.free_handle(nhanp, nhlen)
+		return fd
+
+def libc_load():
+	'''Load libc and set things up.'''
+	global libc
+
+	libc_name = ctypes.util.find_library("c")
+	libc = ctypes.cdll.LoadLibrary(libc_name)
+	libc.memcmp.argtypes = (
+			ctypes.c_void_p,
+			ctypes.c_void_p,
+			ctypes.c_size_t)
+	libc.errno
+
+def libhandle_load():
+	'''Load libhandle and set things up.'''
+	global libhandle
+
+	libhandle = ctypes.cdll.LoadLibrary('libhandle.so')
+	libhandle.fd_to_handle.argtypes = (
+			ctypes.c_int,
+			ctypes.POINTER(ctypes.c_void_p),
+			ctypes.POINTER(ctypes.c_size_t))
+	libhandle.handle_to_fshandle.argtypes = (
+			ctypes.c_void_p,
+			ctypes.c_size_t,
+			ctypes.POINTER(ctypes.c_void_p),
+			ctypes.POINTER(ctypes.c_size_t))
+	libhandle.path_to_fshandle.argtypes = (
+			ctypes.c_char_p,
+			ctypes.c_void_p,
+			ctypes.c_size_t)
+	libhandle.free_handle.argtypes = (
+			ctypes.c_void_p,
+			ctypes.c_size_t)
+
+# metadata scrubbing stuff
+XFS_SCRUB_TYPE_PROBE		= 0
+XFS_SCRUB_TYPE_SB		= 1
+XFS_SCRUB_TYPE_AGF		= 2
+XFS_SCRUB_TYPE_AGFL		= 3
+XFS_SCRUB_TYPE_AGI		= 4
+XFS_SCRUB_TYPE_BNOBT		= 5
+XFS_SCRUB_TYPE_CNTBT		= 6
+XFS_SCRUB_TYPE_INOBT		= 7
+XFS_SCRUB_TYPE_FINOBT		= 8
+XFS_SCRUB_TYPE_RMAPBT		= 9
+XFS_SCRUB_TYPE_REFCNTBT		= 10
+XFS_SCRUB_TYPE_INODE		= 11
+XFS_SCRUB_TYPE_BMBTD		= 12
+XFS_SCRUB_TYPE_BMBTA		= 13
+XFS_SCRUB_TYPE_BMBTC		= 14
+XFS_SCRUB_TYPE_DIR		= 15
+XFS_SCRUB_TYPE_XATTR		= 16
+XFS_SCRUB_TYPE_SYMLINK		= 17
+XFS_SCRUB_TYPE_PARENT		= 18
+XFS_SCRUB_TYPE_RTBITMAP		= 19
+XFS_SCRUB_TYPE_RTSUM		= 20
+XFS_SCRUB_TYPE_UQUOTA		= 21
+XFS_SCRUB_TYPE_GQUOTA		= 22
+XFS_SCRUB_TYPE_PQUOTA		= 23
+XFS_SCRUB_TYPE_FSCOUNTERS	= 24
+XFS_SCRUB_TYPE_QUOTACHECK	= 25
+XFS_SCRUB_TYPE_NLINKS		= 26
+XFS_SCRUB_TYPE_HEALTHY		= 27
+XFS_SCRUB_TYPE_DIRTREE		= 28
+XFS_SCRUB_TYPE_METAPATH		= 29
+XFS_SCRUB_TYPE_RGSUPER		= 30
+XFS_SCRUB_TYPE_RGBITMAP		= 31
+XFS_SCRUB_TYPE_RTRMAPBT		= 32
+XFS_SCRUB_TYPE_RTREFCBT		= 33
+
+XFS_SCRUB_IFLAG_REPAIR			= 1 << 0
+XFS_SCRUB_OFLAG_CORRUPT			= 1 << 1
+XFS_SCRUB_OFLAG_PREEN			= 1 << 2
+XFS_SCRUB_OFLAG_XFAIL			= 1 << 3
+XFS_SCRUB_OFLAG_XCORRUPT		= 1 << 4
+XFS_SCRUB_OFLAG_INCOMPLETE		= 1 << 5
+XFS_SCRUB_OFLAG_WARNING			= 1 << 6
+XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED	= 1 << 7
+XFS_SCRUB_IFLAG_FORCE_REBUILD		= 1 << 8
+
+xfs_scrub_metadata = struct.Struct('IIQII' + ('x' * 40))
+XFS_IOC_SCRUB_METADATA		= _IOWR(0x58, 60, xfs_scrub_metadata.size)
+
+def xfs_repair_fs_metadata(fd, type):
+	'''Call the kernel to repair some whole-fs metadata.'''
+	arg = bytearray(xfs_scrub_metadata.pack(type, XFS_SCRUB_IFLAG_REPAIR,
+					0, 0, 0))
+	fcntl.ioctl(fd, XFS_IOC_SCRUB_METADATA, arg)
+	return xfs_scrub_metadata.unpack(arg)[1]
+
+def xfs_repair_group_metadata(fd, type, group):
+	'''Call the kernel to repair some group metadata.'''
+	arg = bytearray(xfs_scrub_metadata.pack(type, XFS_SCRUB_IFLAG_REPAIR,
+					 0, 0, group))
+	fcntl.ioctl(fd, XFS_IOC_SCRUB_METADATA, bytearray(arg))
+	return xfs_scrub_metadata.unpack(arg)[1]
+
+def xfs_repair_inode_metadata(fd, type, ino, gen):
+	'''Call the kernel to repair some inode metadata.'''
+	arg = bytearray(xfs_scrub_metadata.pack(type, XFS_SCRUB_IFLAG_REPAIR,
+					 ino, gen, 0))
+	fcntl.ioctl(fd, XFS_IOC_SCRUB_METADATA, bytearray(arg))
+	return xfs_scrub_metadata.unpack(arg)[1]
+
 # main program
 
 def health_reports(mon_fp):
@@ -138,10 +299,12 @@ def report_shutdown(event):
 
 def monitor(mountpoint):
 	'''Monitor the given mountpoint for health events.'''
-	global log, everything
+	global log, printf_prefix, everything, want_repair
 
 	fd = os.open(mountpoint, os.O_RDONLY)
 	try:
+		if want_repair:
+			handle = xfs_weak_handle(fd, mountpoint)
 		mon_fd = open_health_monitor(fd, verbose = everything)
 	except OSError as e:
 		if e.errno != errno.ENOTTY:
@@ -150,7 +313,8 @@ def monitor(mountpoint):
 				file = sys.stderr)
 		return 1
 	finally:
-		# Close the mountpoint if opening the health monitor fails
+		# Close the mountpoint if opening the health monitor fails;
+		# the handle object will free its own memory.
 		os.close(fd)
 
 	# Ownership of mon_fd (and hence responsibility for closing it) is
@@ -170,11 +334,131 @@ def monitor(mountpoint):
 				report_lost(event)
 			elif event['type'] == 'shutdown':
 				report_shutdown(event)
+			elif want_repair and event['type'] == 'sick':
+				repair_metadata(event, handle)
 
 	return 0
 
+def __scrub_type(code):
+	'''Convert a "structures" json list to a scrub type code.'''
+	SCRUB_TYPES = {
+		"probe":	XFS_SCRUB_TYPE_PROBE,
+		"sb":		XFS_SCRUB_TYPE_SB,
+		"agf":		XFS_SCRUB_TYPE_AGF,
+		"agfl":		XFS_SCRUB_TYPE_AGFL,
+		"agi":		XFS_SCRUB_TYPE_AGI,
+		"bnobt":	XFS_SCRUB_TYPE_BNOBT,
+		"cntbt":	XFS_SCRUB_TYPE_CNTBT,
+		"inobt":	XFS_SCRUB_TYPE_INOBT,
+		"finobt":	XFS_SCRUB_TYPE_FINOBT,
+		"rmapbt":	XFS_SCRUB_TYPE_RMAPBT,
+		"refcountbt":	XFS_SCRUB_TYPE_REFCNTBT,
+		"inode":	XFS_SCRUB_TYPE_INODE,
+		"bmapbtd":	XFS_SCRUB_TYPE_BMBTD,
+		"bmapbta":	XFS_SCRUB_TYPE_BMBTA,
+		"bmapbtc":	XFS_SCRUB_TYPE_BMBTC,
+		"directory":	XFS_SCRUB_TYPE_DIR,
+		"xattr":	XFS_SCRUB_TYPE_XATTR,
+		"symlink":	XFS_SCRUB_TYPE_SYMLINK,
+		"parent":	XFS_SCRUB_TYPE_PARENT,
+		"rtbitmap":	XFS_SCRUB_TYPE_RTBITMAP,
+		"rtsummary":	XFS_SCRUB_TYPE_RTSUM,
+		"usrquota":	XFS_SCRUB_TYPE_UQUOTA,
+		"grpquota":	XFS_SCRUB_TYPE_GQUOTA,
+		"prjquota":	XFS_SCRUB_TYPE_PQUOTA,
+		"fscounters":	XFS_SCRUB_TYPE_FSCOUNTERS,
+		"quotacheck":	XFS_SCRUB_TYPE_QUOTACHECK,
+		"nlinks":	XFS_SCRUB_TYPE_NLINKS,
+		"healthy":	XFS_SCRUB_TYPE_HEALTHY,
+		"dirtree":	XFS_SCRUB_TYPE_DIRTREE,
+		"metapath":	XFS_SCRUB_TYPE_METAPATH,
+		"rgsuper":	XFS_SCRUB_TYPE_RGSUPER,
+		"rgbitmap":	XFS_SCRUB_TYPE_RGBITMAP,
+		"rtrmapbt":	XFS_SCRUB_TYPE_RTRMAPBT,
+		"rtrefcountbt":	XFS_SCRUB_TYPE_RTREFCBT,
+	}
+
+	if code not in SCRUB_TYPES:
+		return None
+
+	return SCRUB_TYPES[code]
+
+def report_outcome(oflags):
+	if oflags & (XFS_SCRUB_OFLAG_CORRUPT | \
+		     XFS_SCRUB_OFLAG_CORRUPT | \
+		     XFS_SCRUB_OFLAG_INCOMPLETE):
+		return "Repair unsuccessful; offline repair required."
+
+	if oflags & XFS_SCRUB_OFLAG_XFAIL:
+		return "Seems correct but cross-referencing failed; offline repair recommended."
+
+	if oflags & XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED:
+		return "No modification needed."
+
+	return "Repairs successful."
+
+def repair_wholefs(event, fd):
+	'''React to a fs-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_fs_metadata(fd, type)
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {e}", file = sys.stderr)
+
+def repair_group(event, fd, group_type):
+	'''React to a group-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_group_metadata(fd, type, event['group'])
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {e}", file = sys.stderr)
+
+def repair_inode(event, fd):
+	'''React to a inode-domain corruption event by repairing it.'''
+	for s in event['structures']:
+		type = __scrub_type(s)
+		if type is None:
+			continue
+		try:
+			oflags = xfs_repair_inode_metadata(fd, type,
+				      event['inode'], event['generation'])
+			print(f"{printf_prefix}: {s}: {report_outcome(oflags)}")
+			sys.stdout.flush()
+		except Exception as e:
+			print(f"{printf_prefix}: {e}", file = sys.stderr)
+
+def repair_metadata(event, handle):
+	'''Repair a metadata corruption.'''
+	global debug, printf_prefix
+
+	if debug:
+		print(f'repair {event}')
+	fd = handle.open()
+
+	if event['domain'] in ['fs', 'realtime']:
+		repair_wholefs(event, fd)
+	elif event['domain'] in ['ag', 'rtgroup']:
+		repair_group(event, fd, event['domain'])
+	elif event['domain'] == 'inode':
+		repair_inode(event, fd)
+	else:
+		raise Exception(f"{printf_prefix}: Unknown metadata domain \"{event['domain']}\".")
+
+	os.close(fd)
+	return
+
 def main():
-	global debug, log, printf_prefix, everything
+	global debug, log, printf_prefix, everything, want_repair
 	ret = 0
 
 	parser = argparse.ArgumentParser( \
@@ -185,6 +469,8 @@ def main():
 			action = "store_true")
 	parser.add_argument("--everything", help = "Capture all events.", \
 			action = "store_true")
+	parser.add_argument("--repair", help = "Automatically repair corrupt metadata.", \
+			action = "store_true")
 	parser.add_argument("-V", help = "Report version and exit.", \
 			action = "store_true")
 	parser.add_argument('mountpoint', default = None, nargs = '?',
@@ -205,6 +491,14 @@ def main():
 		log = True
 	if args.everything:
 		everything = True
+	if args.repair:
+		try:
+			libc_load()
+			libhandle_load()
+			want_repair = True
+		except OSError as e:
+			print(e, file = sys.stderr)
+			sys.exit(1)
 
 	printf_prefix = args.mountpoint
 	try:


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH 7/7] xfs_scrubbed: create a background monitoring service
  2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
                     ` (5 preceding siblings ...)
  2024-02-24  1:35   ` [PATCH 6/7] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
@ 2024-02-24  1:36   ` Darrick J. Wong
  6 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  1:36 UTC (permalink / raw
  To: cem, kent.overstreet, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Create a systemd service and activate it automatically.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/Makefile                 |    8 ++-
 scrub/xfs.rules                |    3 +
 scrub/xfs_scrubbed.in          |    8 +++
 scrub/xfs_scrubbed@.service.in |   95 ++++++++++++++++++++++++++++++++++++++++
 scrub/xfs_scrubbed_start       |   17 +++++++
 5 files changed, 128 insertions(+), 3 deletions(-)
 create mode 100644 scrub/xfs_scrubbed@.service.in
 create mode 100755 scrub/xfs_scrubbed_start


diff --git a/scrub/Makefile b/scrub/Makefile
index cf112018376b..7c1b5c742be2 100644
--- a/scrub/Makefile
+++ b/scrub/Makefile
@@ -19,6 +19,7 @@ XFS_SCRUB_FAIL_PROG = xfs_scrub_fail
 XFS_SCRUB_ARGS = -p
 XFS_SCRUB_SERVICE_ARGS = -b
 XFS_SCRUBBED_PROG = xfs_scrubbed
+XFS_SCRUBBED_HELPER = xfs_scrubbed_start
 ifeq ($(HAVE_SYSTEMD),yes)
 INSTALL_SCRUB += install-systemd
 SYSTEMD_SERVICES=\
@@ -29,8 +30,9 @@ SYSTEMD_SERVICES=\
 	xfs_scrub_all.service \
 	xfs_scrub_all_fail.service \
 	xfs_scrub_all.timer \
-	system-xfs_scrub.slice
-OPTIONAL_TARGETS += $(SYSTEMD_SERVICES)
+	system-xfs_scrub.slice \
+	xfs_scrubbed@.service
+OPTIONAL_TARGETS += $(SYSTEMD_SERVICES) $(XFS_SCRUBBED_HELPER)
 endif
 ifeq ($(HAVE_CROND),yes)
 INSTALL_SCRUB += install-crond
@@ -181,7 +183,7 @@ install-systemd: default $(SYSTEMD_SERVICES)
 	$(INSTALL) -m 755 -d $(SYSTEMD_SYSTEM_UNIT_DIR)
 	$(INSTALL) -m 644 $(SYSTEMD_SERVICES) $(SYSTEMD_SYSTEM_UNIT_DIR)
 	$(INSTALL) -m 755 -d $(PKG_LIBEXEC_DIR)
-	$(INSTALL) -m 755 $(XFS_SCRUB_FAIL_PROG) $(PKG_LIBEXEC_DIR)
+	$(INSTALL) -m 755 $(XFS_SCRUB_FAIL_PROG) $(XFS_SCRUBBED_HELPER) $(PKG_LIBEXEC_DIR)
 
 install-crond: default $(CRONTABS)
 	$(INSTALL) -m 755 -d $(CROND_DIR)
diff --git a/scrub/xfs.rules b/scrub/xfs.rules
index c3f69b3ab909..f3ec21c322fe 100644
--- a/scrub/xfs.rules
+++ b/scrub/xfs.rules
@@ -11,3 +11,6 @@
 # supplying UDISKS_AUTO=0 here changes the HintAuto property of the block
 # device abstraction to mean "do not automatically start" (e.g. mount).
 SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="xfs|xfs_external_log", ENV{UDISKS_AUTO}="0"
+
+# Start the background scrubber automatically
+ACTION=="add", SUBSYSTEM=="xfs", ENV{TYPE}=="mount", RUN+="xfs_scrubbed_start"
diff --git a/scrub/xfs_scrubbed.in b/scrub/xfs_scrubbed.in
index 5458d39486bc..6d12efc2998b 100644
--- a/scrub/xfs_scrubbed.in
+++ b/scrub/xfs_scrubbed.in
@@ -17,6 +17,7 @@ import datetime
 import errno
 import ctypes
 import ctypes.util
+import time
 
 debug = False
 log = False
@@ -505,6 +506,13 @@ def main():
 		ret = monitor(args.mountpoint)
 	except KeyboardInterrupt:
 		ret = 0
+
+	# See the service mode comments in xfs_scrub.c for why we do this.
+	if 'SERVICE_MODE' in os.environ:
+		time.sleep(2)
+		if ret != 0:
+			ret = 1
+
 	sys.exit(ret)
 
 if __name__ == '__main__':
diff --git a/scrub/xfs_scrubbed@.service.in b/scrub/xfs_scrubbed@.service.in
new file mode 100644
index 000000000000..c33efbbbc7e5
--- /dev/null
+++ b/scrub/xfs_scrubbed@.service.in
@@ -0,0 +1,95 @@
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (C) 2024 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+[Unit]
+Description=Self Healing of XFS Metadata for %f
+Documentation=man:xfs_scrubbed(8)
+
+# Explicitly require the capabilities that this program needs
+ConditionCapability=CAP_SYS_ADMIN
+
+# Must be a mountpoint
+ConditionPathIsMountPoint=%f
+RequiresMountsFor=%f
+
+[Service]
+Type=oneshot
+Environment=SERVICE_MODE=1
+ExecStart=@pkg_libexec_dir@/xfs_scrubbed --repair --log %f
+SyslogIdentifier=%N
+
+# Run scrub with minimal CPU and IO priority so that nothing else will starve.
+IOSchedulingClass=idle
+CPUSchedulingPolicy=idle
+CPUAccounting=true
+Nice=19
+
+# Create the service underneath the scrub background service slice so that we
+# can control resource usage.
+Slice=system-xfs_scrub.slice
+
+# No realtime CPU scheduling
+RestrictRealtime=true
+
+# Dynamically create a user that isn't root
+DynamicUser=true
+
+# Make the entire filesystem readonly, but don't hide /home and don't use a
+# private bind mount like xfs_scrub.  We don't want to pin the filesystem,
+# because we want umount to work correctly and this service to stop
+# automatically.
+ProtectSystem=strict
+ProtectHome=no
+PrivateTmp=true
+PrivateDevices=true
+
+# Don't let scrub complain about paths in /etc/projects that have been hidden
+# by our sandboxing.  scrub doesn't care about project ids anyway.
+InaccessiblePaths=-/etc/projects
+
+# No network access
+PrivateNetwork=true
+ProtectHostname=true
+RestrictAddressFamilies=none
+IPAddressDeny=any
+
+# Don't let the program mess with the kernel configuration at all
+ProtectKernelLogs=true
+ProtectKernelModules=true
+ProtectKernelTunables=true
+ProtectControlGroups=true
+ProtectProc=invisible
+RestrictNamespaces=true
+
+# Hide everything in /proc, even /proc/mounts
+ProcSubset=pid
+
+# Only allow the default personality Linux
+LockPersonality=true
+
+# No writable memory pages
+MemoryDenyWriteExecute=true
+
+# Don't let our mounts leak out to the host
+PrivateMounts=true
+
+# Restrict system calls to the native arch and only enough to get things going
+SystemCallArchitectures=native
+SystemCallFilter=@system-service
+SystemCallFilter=~@privileged
+SystemCallFilter=~@resources
+SystemCallFilter=~@mount
+
+# xfs_scrub needs these privileges to run, and no others
+CapabilityBoundingSet=CAP_SYS_ADMIN
+AmbientCapabilities=CAP_SYS_ADMIN
+NoNewPrivileges=true
+
+# xfs_scrubbed doesn't create files
+UMask=7777
+
+# No access to hardware /dev files except for block devices
+ProtectClock=true
+DevicePolicy=closed
diff --git a/scrub/xfs_scrubbed_start b/scrub/xfs_scrubbed_start
new file mode 100755
index 000000000000..471fdc99eb16
--- /dev/null
+++ b/scrub/xfs_scrubbed_start
@@ -0,0 +1,17 @@
+#!/bin/sh
+
+# SPDX-License-Identifier: GPL-2.0-or-later
+#
+# Copyright (C) 2024 Oracle.  All Rights Reserved.
+# Author: Darrick J. Wong <djwong@kernel.org>
+
+# Start the xfs_scrubbed service when the filesystem is mounted
+
+command -v systemctl || exit 0
+
+grep "^$SOURCE[[:space:]]" /proc/mounts | while read source mntpt therest; do
+	inst="$(systemd-escape --path "$mntpt")"
+	systemctl restart --no-block "xfs_scrubbed@$inst" && break
+done
+
+exit 0


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/10] time_stats: report information in json format
  2024-02-24  1:12   ` [PATCH 09/10] time_stats: report information in json format Darrick J. Wong
@ 2024-02-24  4:15     ` Darrick J. Wong
  2024-02-24  5:10       ` Kent Overstreet
  0 siblings, 1 reply; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  4:15 UTC (permalink / raw
  To: akpm, daniel, kent.overstreet; +Cc: linux-xfs, linux-bcachefs, linux-kernel

On Fri, Feb 23, 2024 at 05:12:26PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Export json versions of time statistics information.  Given the tabular
> nature of the numbers exposed, this will make it a lot easier for higher
> (than C) level languages (e.g. python) to import information without
> needing to write yet another clumsy string parser.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  include/linux/time_stats.h |    2 +
>  lib/time_stats.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 89 insertions(+)
> 
> 
> diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
> index b3c810fff963a..4e1f5485ed039 100644
> --- a/include/linux/time_stats.h
> +++ b/include/linux/time_stats.h
> @@ -156,6 +156,8 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
>  struct seq_buf;
>  void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
>  		const char *epoch_name, unsigned int flags);
> +void time_stats_to_json(struct seq_buf *, struct time_stats *,
> +		const char *epoch_name, unsigned int flags);
>  
>  void time_stats_exit(struct time_stats *);
>  void time_stats_init(struct time_stats *);
> diff --git a/lib/time_stats.c b/lib/time_stats.c
> index 0fb3d854e503b..c0f209dd9f6dd 100644
> --- a/lib/time_stats.c
> +++ b/lib/time_stats.c
> @@ -266,6 +266,93 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
>  }
>  EXPORT_SYMBOL_GPL(time_stats_to_seq_buf);
>  
> +void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
> +		const char *epoch_name, unsigned int flags)
> +{
> +	struct quantiles *quantiles = time_stats_to_quantiles(stats);
> +	s64 f_mean = 0, d_mean = 0;
> +	u64 f_stddev = 0, d_stddev = 0;
> +
> +	if (stats->buffer) {
> +		int cpu;
> +
> +		spin_lock_irq(&stats->lock);
> +		for_each_possible_cpu(cpu)
> +			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
> +		spin_unlock_irq(&stats->lock);
> +	}
> +
> +	if (stats->freq_stats.n) {
> +		/* avoid divide by zero */
> +		f_mean = mean_and_variance_get_mean(stats->freq_stats);
> +		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
> +		d_mean = mean_and_variance_get_mean(stats->duration_stats);
> +		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
> +	} else if (flags & TIME_STATS_PRINT_NO_ZEROES) {
> +		/* unless we didn't want zeroes anyway */
> +		return;
> +	}
> +
> +	seq_buf_printf(out, "{\n");
> +	seq_buf_printf(out, "  \"epoch\":       \"%s\",\n", epoch_name);
> +	seq_buf_printf(out, "  \"count\":       %llu,\n", stats->duration_stats.n);
> +
> +	seq_buf_printf(out, "  \"duration_ns\": {\n");
> +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_duration);
> +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_duration);
> +	seq_buf_printf(out, "    \"total\":     %llu,\n", stats->total_duration);
> +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> +	seq_buf_printf(out, "  },\n");
> +
> +	d_mean = mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> +	d_stddev = mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> +
> +	seq_buf_printf(out, "  \"duration_ewma_ns\": {\n");
> +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> +	seq_buf_printf(out, "  },\n");
> +
> +	seq_buf_printf(out, "  \"frequency_ns\": {\n");

I took the variable names too literally here; these labels really ought
to be "between_ns" and "between_ewma_ns" to maintain consistency with
the labels in the table format.

> +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
> +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
> +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> +	seq_buf_printf(out, "  },\n");
> +
> +	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> +	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> +
> +	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
> +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> +
> +	if (quantiles) {
> +		u64 last_q = 0;
> +
> +		/* close frequency_ewma_ns but signal more items */

(also this comment)

> +		seq_buf_printf(out, "  },\n");
> +
> +		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
> +		eytzinger0_for_each(i, NR_QUANTILES) {
> +			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
> +
> +			u64 q = max(quantiles->entries[i].m, last_q);
> +			seq_buf_printf(out, "    %llu", q);
> +			if (!is_last)
> +				seq_buf_printf(out, ", ");
> +			last_q = q;
> +		}
> +		seq_buf_printf(out, "  ]\n");
> +	} else {
> +		/* close frequency_ewma_ns without dumping further */

(this one too)

Kent, would you mind making that edit the next time you reflow your
branch?

--D

> +		seq_buf_printf(out, "  }\n");
> +	}
> +
> +	seq_buf_printf(out, "}\n");
> +}
> +EXPORT_SYMBOL_GPL(time_stats_to_json);
> +
>  void time_stats_exit(struct time_stats *stats)
>  {
>  	free_percpu(stats->buffer);
> 
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/10] time_stats: report information in json format
  2024-02-24  4:15     ` Darrick J. Wong
@ 2024-02-24  5:10       ` Kent Overstreet
  2024-02-24  6:02         ` Darrick J. Wong
  0 siblings, 1 reply; 59+ messages in thread
From: Kent Overstreet @ 2024-02-24  5:10 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: akpm, daniel, linux-xfs, linux-bcachefs, linux-kernel

On Fri, Feb 23, 2024 at 08:15:45PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 23, 2024 at 05:12:26PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Export json versions of time statistics information.  Given the tabular
> > nature of the numbers exposed, this will make it a lot easier for higher
> > (than C) level languages (e.g. python) to import information without
> > needing to write yet another clumsy string parser.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > ---
> >  include/linux/time_stats.h |    2 +
> >  lib/time_stats.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 89 insertions(+)
> > 
> > 
> > diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
> > index b3c810fff963a..4e1f5485ed039 100644
> > --- a/include/linux/time_stats.h
> > +++ b/include/linux/time_stats.h
> > @@ -156,6 +156,8 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
> >  struct seq_buf;
> >  void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
> >  		const char *epoch_name, unsigned int flags);
> > +void time_stats_to_json(struct seq_buf *, struct time_stats *,
> > +		const char *epoch_name, unsigned int flags);
> >  
> >  void time_stats_exit(struct time_stats *);
> >  void time_stats_init(struct time_stats *);
> > diff --git a/lib/time_stats.c b/lib/time_stats.c
> > index 0fb3d854e503b..c0f209dd9f6dd 100644
> > --- a/lib/time_stats.c
> > +++ b/lib/time_stats.c
> > @@ -266,6 +266,93 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
> >  }
> >  EXPORT_SYMBOL_GPL(time_stats_to_seq_buf);
> >  
> > +void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
> > +		const char *epoch_name, unsigned int flags)
> > +{
> > +	struct quantiles *quantiles = time_stats_to_quantiles(stats);
> > +	s64 f_mean = 0, d_mean = 0;
> > +	u64 f_stddev = 0, d_stddev = 0;
> > +
> > +	if (stats->buffer) {
> > +		int cpu;
> > +
> > +		spin_lock_irq(&stats->lock);
> > +		for_each_possible_cpu(cpu)
> > +			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
> > +		spin_unlock_irq(&stats->lock);
> > +	}
> > +
> > +	if (stats->freq_stats.n) {
> > +		/* avoid divide by zero */
> > +		f_mean = mean_and_variance_get_mean(stats->freq_stats);
> > +		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
> > +		d_mean = mean_and_variance_get_mean(stats->duration_stats);
> > +		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
> > +	} else if (flags & TIME_STATS_PRINT_NO_ZEROES) {
> > +		/* unless we didn't want zeroes anyway */
> > +		return;
> > +	}
> > +
> > +	seq_buf_printf(out, "{\n");
> > +	seq_buf_printf(out, "  \"epoch\":       \"%s\",\n", epoch_name);
> > +	seq_buf_printf(out, "  \"count\":       %llu,\n", stats->duration_stats.n);
> > +
> > +	seq_buf_printf(out, "  \"duration_ns\": {\n");
> > +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_duration);
> > +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_duration);
> > +	seq_buf_printf(out, "    \"total\":     %llu,\n", stats->total_duration);
> > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> > +	seq_buf_printf(out, "  },\n");
> > +
> > +	d_mean = mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> > +	d_stddev = mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> > +
> > +	seq_buf_printf(out, "  \"duration_ewma_ns\": {\n");
> > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> > +	seq_buf_printf(out, "  },\n");
> > +
> > +	seq_buf_printf(out, "  \"frequency_ns\": {\n");
> 
> I took the variable names too literally here; these labels really ought
> to be "between_ns" and "between_ewma_ns" to maintain consistency with
> the labels in the table format.
> 
> > +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
> > +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
> > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> > +	seq_buf_printf(out, "  },\n");
> > +
> > +	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> > +	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> > +
> > +	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
> > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> > +
> > +	if (quantiles) {
> > +		u64 last_q = 0;
> > +
> > +		/* close frequency_ewma_ns but signal more items */
> 
> (also this comment)
> 
> > +		seq_buf_printf(out, "  },\n");
> > +
> > +		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
> > +		eytzinger0_for_each(i, NR_QUANTILES) {
> > +			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
> > +
> > +			u64 q = max(quantiles->entries[i].m, last_q);
> > +			seq_buf_printf(out, "    %llu", q);
> > +			if (!is_last)
> > +				seq_buf_printf(out, ", ");
> > +			last_q = q;
> > +		}
> > +		seq_buf_printf(out, "  ]\n");
> > +	} else {
> > +		/* close frequency_ewma_ns without dumping further */
> 
> (this one too)
> 
> Kent, would you mind making that edit the next time you reflow your
> branch?
> 
> --D
> 
> > +		seq_buf_printf(out, "  }\n");
> > +	}
> > +
> > +	seq_buf_printf(out, "}\n");
> > +}
> > +EXPORT_SYMBOL_GPL(time_stats_to_json);
> > +
> >  void time_stats_exit(struct time_stats *stats)
> >  {
> >  	free_percpu(stats->buffer);
> > 
> > 


From 5885a65fa5a0aace7bdf1a8fa58ac2bca3b15900 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: Sat, 24 Feb 2024 00:10:06 -0500
Subject: [PATCH] fixup! time_stats: report information in json format


diff --git a/lib/time_stats.c b/lib/time_stats.c
index 0b90c80cba9f..d7dd64baebb8 100644
--- a/lib/time_stats.c
+++ b/lib/time_stats.c
@@ -313,7 +313,7 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
 	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
 	seq_buf_printf(out, "  },\n");
 
-	seq_buf_printf(out, "  \"frequency_ns\": {\n");
+	seq_buf_printf(out, "  \"between_ns\": {\n");
 	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
 	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
 	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
@@ -323,14 +323,14 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
 	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
 	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
 
-	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
+	seq_buf_printf(out, "  \"between_ewma_ns\": {\n");
 	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
 	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
 
 	if (quantiles) {
 		u64 last_q = 0;
 
-		/* close frequency_ewma_ns but signal more items */
+		/* close between_ewma_ns but signal more items */
 		seq_buf_printf(out, "  },\n");
 
 		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
@@ -345,7 +345,7 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
 		}
 		seq_buf_printf(out, "  ]\n");
 	} else {
-		/* close frequency_ewma_ns without dumping further */
+		/* close between_ewma_ns without dumping further */
 		seq_buf_printf(out, "  }\n");
 	}
 

^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH 09/10] time_stats: report information in json format
  2024-02-24  5:10       ` Kent Overstreet
@ 2024-02-24  6:02         ` Darrick J. Wong
  0 siblings, 0 replies; 59+ messages in thread
From: Darrick J. Wong @ 2024-02-24  6:02 UTC (permalink / raw
  To: Kent Overstreet; +Cc: akpm, daniel, linux-xfs, linux-bcachefs, linux-kernel

On Sat, Feb 24, 2024 at 12:10:33AM -0500, Kent Overstreet wrote:
> On Fri, Feb 23, 2024 at 08:15:45PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 23, 2024 at 05:12:26PM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Export json versions of time statistics information.  Given the tabular
> > > nature of the numbers exposed, this will make it a lot easier for higher
> > > (than C) level languages (e.g. python) to import information without
> > > needing to write yet another clumsy string parser.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > ---
> > >  include/linux/time_stats.h |    2 +
> > >  lib/time_stats.c           |   87 ++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 89 insertions(+)
> > > 
> > > 
> > > diff --git a/include/linux/time_stats.h b/include/linux/time_stats.h
> > > index b3c810fff963a..4e1f5485ed039 100644
> > > --- a/include/linux/time_stats.h
> > > +++ b/include/linux/time_stats.h
> > > @@ -156,6 +156,8 @@ static inline bool track_event_change(struct time_stats *stats, bool v)
> > >  struct seq_buf;
> > >  void time_stats_to_seq_buf(struct seq_buf *, struct time_stats *,
> > >  		const char *epoch_name, unsigned int flags);
> > > +void time_stats_to_json(struct seq_buf *, struct time_stats *,
> > > +		const char *epoch_name, unsigned int flags);
> > >  
> > >  void time_stats_exit(struct time_stats *);
> > >  void time_stats_init(struct time_stats *);
> > > diff --git a/lib/time_stats.c b/lib/time_stats.c
> > > index 0fb3d854e503b..c0f209dd9f6dd 100644
> > > --- a/lib/time_stats.c
> > > +++ b/lib/time_stats.c
> > > @@ -266,6 +266,93 @@ void time_stats_to_seq_buf(struct seq_buf *out, struct time_stats *stats,
> > >  }
> > >  EXPORT_SYMBOL_GPL(time_stats_to_seq_buf);
> > >  
> > > +void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
> > > +		const char *epoch_name, unsigned int flags)
> > > +{
> > > +	struct quantiles *quantiles = time_stats_to_quantiles(stats);
> > > +	s64 f_mean = 0, d_mean = 0;
> > > +	u64 f_stddev = 0, d_stddev = 0;
> > > +
> > > +	if (stats->buffer) {
> > > +		int cpu;
> > > +
> > > +		spin_lock_irq(&stats->lock);
> > > +		for_each_possible_cpu(cpu)
> > > +			__time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
> > > +		spin_unlock_irq(&stats->lock);
> > > +	}
> > > +
> > > +	if (stats->freq_stats.n) {
> > > +		/* avoid divide by zero */
> > > +		f_mean = mean_and_variance_get_mean(stats->freq_stats);
> > > +		f_stddev = mean_and_variance_get_stddev(stats->freq_stats);
> > > +		d_mean = mean_and_variance_get_mean(stats->duration_stats);
> > > +		d_stddev = mean_and_variance_get_stddev(stats->duration_stats);
> > > +	} else if (flags & TIME_STATS_PRINT_NO_ZEROES) {
> > > +		/* unless we didn't want zeroes anyway */
> > > +		return;
> > > +	}
> > > +
> > > +	seq_buf_printf(out, "{\n");
> > > +	seq_buf_printf(out, "  \"epoch\":       \"%s\",\n", epoch_name);
> > > +	seq_buf_printf(out, "  \"count\":       %llu,\n", stats->duration_stats.n);
> > > +
> > > +	seq_buf_printf(out, "  \"duration_ns\": {\n");
> > > +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_duration);
> > > +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_duration);
> > > +	seq_buf_printf(out, "    \"total\":     %llu,\n", stats->total_duration);
> > > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> > > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> > > +	seq_buf_printf(out, "  },\n");
> > > +
> > > +	d_mean = mean_and_variance_weighted_get_mean(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> > > +	d_stddev = mean_and_variance_weighted_get_stddev(stats->duration_stats_weighted, TIME_STATS_MV_WEIGHT);
> > > +
> > > +	seq_buf_printf(out, "  \"duration_ewma_ns\": {\n");
> > > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", d_mean);
> > > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
> > > +	seq_buf_printf(out, "  },\n");
> > > +
> > > +	seq_buf_printf(out, "  \"frequency_ns\": {\n");
> > 
> > I took the variable names too literally here; these labels really ought
> > to be "between_ns" and "between_ewma_ns" to maintain consistency with
> > the labels in the table format.
> > 
> > > +	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
> > > +	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
> > > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> > > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> > > +	seq_buf_printf(out, "  },\n");
> > > +
> > > +	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> > > +	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
> > > +
> > > +	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
> > > +	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> > > +	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
> > > +
> > > +	if (quantiles) {
> > > +		u64 last_q = 0;
> > > +
> > > +		/* close frequency_ewma_ns but signal more items */
> > 
> > (also this comment)
> > 
> > > +		seq_buf_printf(out, "  },\n");
> > > +
> > > +		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
> > > +		eytzinger0_for_each(i, NR_QUANTILES) {
> > > +			bool is_last = eytzinger0_next(i, NR_QUANTILES) == -1;
> > > +
> > > +			u64 q = max(quantiles->entries[i].m, last_q);
> > > +			seq_buf_printf(out, "    %llu", q);
> > > +			if (!is_last)
> > > +				seq_buf_printf(out, ", ");
> > > +			last_q = q;
> > > +		}
> > > +		seq_buf_printf(out, "  ]\n");
> > > +	} else {
> > > +		/* close frequency_ewma_ns without dumping further */
> > 
> > (this one too)
> > 
> > Kent, would you mind making that edit the next time you reflow your
> > branch?
> > 
> > --D
> > 
> > > +		seq_buf_printf(out, "  }\n");
> > > +	}
> > > +
> > > +	seq_buf_printf(out, "}\n");
> > > +}
> > > +EXPORT_SYMBOL_GPL(time_stats_to_json);
> > > +
> > >  void time_stats_exit(struct time_stats *stats)
> > >  {
> > >  	free_percpu(stats->buffer);
> > > 
> > > 
> 
> 
> From 5885a65fa5a0aace7bdf1a8fa58ac2bca3b15900 Mon Sep 17 00:00:00 2001
> From: Kent Overstreet <kent.overstreet@linux.dev>
> Date: Sat, 24 Feb 2024 00:10:06 -0500
> Subject: [PATCH] fixup! time_stats: report information in json format
> 
> 
> diff --git a/lib/time_stats.c b/lib/time_stats.c
> index 0b90c80cba9f..d7dd64baebb8 100644
> --- a/lib/time_stats.c
> +++ b/lib/time_stats.c
> @@ -313,7 +313,7 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
>  	seq_buf_printf(out, "    \"stddev\":    %llu\n", d_stddev);
>  	seq_buf_printf(out, "  },\n");
>  
> -	seq_buf_printf(out, "  \"frequency_ns\": {\n");
> +	seq_buf_printf(out, "  \"between_ns\": {\n");
>  	seq_buf_printf(out, "    \"min\":       %llu,\n", stats->min_freq);
>  	seq_buf_printf(out, "    \"max\":       %llu,\n", stats->max_freq);
>  	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
> @@ -323,14 +323,14 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
>  	f_mean = mean_and_variance_weighted_get_mean(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
>  	f_stddev = mean_and_variance_weighted_get_stddev(stats->freq_stats_weighted, TIME_STATS_MV_WEIGHT);
>  
> -	seq_buf_printf(out, "  \"frequency_ewma_ns\": {\n");
> +	seq_buf_printf(out, "  \"between_ewma_ns\": {\n");

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

>  	seq_buf_printf(out, "    \"mean\":      %llu,\n", f_mean);
>  	seq_buf_printf(out, "    \"stddev\":    %llu\n", f_stddev);
>  
>  	if (quantiles) {
>  		u64 last_q = 0;
>  
> -		/* close frequency_ewma_ns but signal more items */
> +		/* close between_ewma_ns but signal more items */
>  		seq_buf_printf(out, "  },\n");
>  
>  		seq_buf_printf(out, "  \"quantiles_ns\": [\n");
> @@ -345,7 +345,7 @@ void time_stats_to_json(struct seq_buf *out, struct time_stats *stats,
>  		}
>  		seq_buf_printf(out, "  ]\n");
>  	} else {
> -		/* close frequency_ewma_ns without dumping further */
> +		/* close between_ewma_ns without dumping further */
>  		seq_buf_printf(out, "  }\n");
>  	}
>  

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-02-24  6:02 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-24  1:00 [PATCHBOMB] time_stats, thread_with_file: lifting generic code to lib Darrick J. Wong
2024-02-24  1:07 ` [PATCHSET 1/6] time_stats: promote to lib/ Darrick J. Wong
2024-02-24  1:09   ` [PATCH 1/4] mean and variance: Promote to lib/math Darrick J. Wong
2024-02-24  1:09   ` [PATCH 2/4] eytzinger: Promote to include/linux/ Darrick J. Wong
2024-02-24  1:09   ` [PATCH 3/4] bcachefs: bch2_time_stats_to_seq_buf() Darrick J. Wong
2024-02-24  1:10   ` [PATCH 4/4] time_stats: Promote to lib/ Darrick J. Wong
2024-02-24  1:08 ` [PATCHSET 2/6] time_stats: cleanups and fixes Darrick J. Wong
2024-02-24  1:10   ` [PATCH 01/10] time_stats: report lifetime of the stats object Darrick J. Wong
2024-02-24  1:10   ` [PATCH 02/10] time_stats: split stats-with-quantiles into a separate structure Darrick J. Wong
2024-02-24  1:10   ` [PATCH 03/10] time_stats: fix struct layout bloat Darrick J. Wong
2024-02-24  1:11   ` [PATCH 04/10] time_stats: add larger units Darrick J. Wong
2024-02-24  1:11   ` [PATCH 05/10] time_stats: don't print any output if event count is zero Darrick J. Wong
2024-02-24  1:11   ` [PATCH 06/10] time_stats: allow custom epoch names Darrick J. Wong
2024-02-24  1:11   ` [PATCH 07/10] mean_and_variance: put struct mean_and_variance_weighted on a diet Darrick J. Wong
2024-02-24  1:12   ` [PATCH 08/10] time_stats: shrink time_stat_buffer for better alignment Darrick J. Wong
2024-02-24  1:12   ` [PATCH 09/10] time_stats: report information in json format Darrick J. Wong
2024-02-24  4:15     ` Darrick J. Wong
2024-02-24  5:10       ` Kent Overstreet
2024-02-24  6:02         ` Darrick J. Wong
2024-02-24  1:12   ` [PATCH 10/10] time_stats: Kill TIME_STATS_HAVE_QUANTILES Darrick J. Wong
2024-02-24  1:08 ` [PATCHSET RFC 3/6] xfs: capture statistics about wait times Darrick J. Wong
2024-02-24  1:12   ` [PATCH 1/4] xfs: present wait time statistics Darrick J. Wong
2024-02-24  1:13   ` [PATCH 2/4] xfs: present time stats for scrubbers Darrick J. Wong
2024-02-24  1:13   ` [PATCH 3/4] xfs: present timestats in json format Darrick J. Wong
2024-02-24  1:13   ` [PATCH 4/4] xfs: create debugfs uuid aliases Darrick J. Wong
2024-02-24  1:08 ` [PATCHSET 4/6] thread_with_file: promote to lib/ Darrick J. Wong
2024-02-24  1:14   ` [PATCH 01/10] bcachefs: thread_with_stdio: eliminate double buffering Darrick J. Wong
2024-02-24  1:14   ` [PATCH 02/10] bcachefs: thread_with_stdio: convert to darray Darrick J. Wong
2024-02-24  1:14   ` [PATCH 03/10] bcachefs: thread_with_stdio: kill thread_with_stdio_done() Darrick J. Wong
2024-02-24  1:14   ` [PATCH 04/10] bcachefs: thread_with_stdio: fix bch2_stdio_redirect_readline() Darrick J. Wong
2024-02-24  1:15   ` [PATCH 05/10] bcachefs: Thread with file documentation Darrick J. Wong
2024-02-24  1:15   ` [PATCH 06/10] darray: lift from bcachefs Darrick J. Wong
2024-02-24  1:15   ` [PATCH 07/10] thread_with_file: Lift " Darrick J. Wong
2024-02-24  1:15   ` [PATCH 08/10] thread_with_stdio: Mark completed in ->release() Darrick J. Wong
2024-02-24  1:16   ` [PATCH 09/10] kernel/hung_task.c: export sysctl_hung_task_timeout_secs Darrick J. Wong
2024-02-24  1:16   ` [PATCH 10/10] thread_with_stdio: suppress hung task warning Darrick J. Wong
2024-02-24  1:08 ` [PATCHSET 5/6] thread_with_file: cleanups and fixes Darrick J. Wong
2024-02-24  1:16   ` [PATCH 1/5] thread_with_file: allow creation of readonly files Darrick J. Wong
2024-02-24  1:16   ` [PATCH 2/5] thread_with_file: fix various printf problems Darrick J. Wong
2024-02-24  1:17   ` [PATCH 3/5] thread_with_file: create ops structure for thread_with_stdio Darrick J. Wong
2024-02-24  1:17   ` [PATCH 4/5] thread_with_file: allow ioctls against these files Darrick J. Wong
2024-02-24  1:17   ` [PATCH 5/5] thread_with_file: Fix missing va_end() Darrick J. Wong
2024-02-24  1:09 ` [PATCHSET RFC 6/6] xfs: live health monitoring of filesystems Darrick J. Wong
2024-02-24  1:17   ` [PATCH 1/8] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
2024-02-24  1:18   ` [PATCH 2/8] xfs: create hooks for monitoring health updates Darrick J. Wong
2024-02-24  1:18   ` [PATCH 3/8] xfs: create a filesystem shutdown hook Darrick J. Wong
2024-02-24  1:18   ` [PATCH 4/8] xfs: report shutdown events through healthmon Darrick J. Wong
2024-02-24  1:18   ` [PATCH 5/8] xfs: report metadata health " Darrick J. Wong
2024-02-24  1:19   ` [PATCH 6/8] xfs: report media errors " Darrick J. Wong
2024-02-24  1:19   ` [PATCH 7/8] xfs: allow reconfiguration of the health monitoring device Darrick J. Wong
2024-02-24  1:19   ` [PATCH 8/8] xfs: send uevents when mounting and unmounting a filesystem Darrick J. Wong
2024-02-24  1:34 ` [PATCHSET RFC] xfsprogs: live health monitoring of filesystems Darrick J. Wong
2024-02-24  1:34   ` [PATCH 1/7] xfs: use thread_with_file to create a monitoring file Darrick J. Wong
2024-02-24  1:34   ` [PATCH 2/7] xfs: create hooks for monitoring health updates Darrick J. Wong
2024-02-24  1:34   ` [PATCH 3/7] xfs: report shutdown events through healthmon Darrick J. Wong
2024-02-24  1:35   ` [PATCH 4/7] xfs_io: monitor filesystem health events Darrick J. Wong
2024-02-24  1:35   ` [PATCH 5/7] xfs_scrubbed: create daemon to listen for " Darrick J. Wong
2024-02-24  1:35   ` [PATCH 6/7] xfs_scrubbed: enable repairing filesystems Darrick J. Wong
2024-02-24  1:36   ` [PATCH 7/7] xfs_scrubbed: create a background monitoring service Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).