[PATCH RFC 0/7] Use Maple Trees for simple

Linux-mm Archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities
@ 2024-02-13 21:37 Chuck Lever
  2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
                   ` (7 more replies)
  0 siblings, 8 replies; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

In an effort to address slab fragmentation issues reported a few
months ago, I've replaced the use of xarrays for the directory
offset map in "simple" file systems (including tmpfs).

This patch set passes functional testing and is ready for code
review. But I don't have the facilities to re-run the performance
tests that identified the regression. We expect the performance of
this implementation will need additional improvement.

Thanks to Liam Howlett for helping me get this working.

---

Chuck Lever (6):
      libfs: Rename "so_ctx"
      libfs: Define a minimum directory offset
      libfs: Add simple_offset_empty()
      maple_tree: Add mtree_alloc_cyclic()
      libfs: Convert simple directory offsets to use a Maple Tree
      libfs: Re-arrange locking in offset_iterate_dir()

Liam R. Howlett (1):
      test_maple_tree: testing the cyclic allocation


 fs/libfs.c                 | 125 +++++++++++++++++++++++--------------
 include/linux/fs.h         |   6 +-
 include/linux/maple_tree.h |   7 +++
 lib/maple_tree.c           |  93 +++++++++++++++++++++++++++
 lib/test_maple_tree.c      |  44 +++++++++++++
 mm/shmem.c                 |   4 +-
 6 files changed, 227 insertions(+), 52 deletions(-)

--
Chuck Lever



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH RFC 1/7] libfs: Rename "so_ctx"
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
@ 2024-02-13 21:37 ` Chuck Lever
  2024-02-15 12:42   ` Jan Kara
  2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

Most of instances of "so_ctx" were renamed before the simple offset
work was merged, but there were a few that were missed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index eec6031b0155..bfbe1a8c5d2d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -271,7 +271,7 @@ void simple_offset_init(struct offset_ctx *octx)
  * @octx: directory offset ctx to be updated
  * @dentry: new dentry being added
  *
- * Returns zero on success. @so_ctx and the dentry offset are updated.
+ * Returns zero on success. @octx and the dentry's offset are updated.
  * Otherwise, a negative errno value is returned.
  */
 int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
@@ -430,8 +430,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 
 static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 {
-	struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
-	XA_STATE(xas, &so_ctx->xa, ctx->pos);
+	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
+	XA_STATE(xas, &octx->xa, ctx->pos);
 	struct dentry *dentry;
 
 	while (true) {




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 2/7] libfs: Define a minimum directory offset
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
  2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
@ 2024-02-13 21:37 ` Chuck Lever
  2024-02-15 12:47   ` Jan Kara
  2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

This value is used in several places, so make it a symbolic
constant.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index bfbe1a8c5d2d..a38af72f4719 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -240,6 +240,11 @@ const struct inode_operations simple_dir_inode_operations = {
 };
 EXPORT_SYMBOL(simple_dir_inode_operations);
 
+/* 0 is '.', 1 is '..', so always start with offset 2 or more */
+enum {
+	DIR_OFFSET_MIN	= 2,
+};
+
 static void offset_set(struct dentry *dentry, u32 offset)
 {
 	dentry->d_fsdata = (void *)((uintptr_t)(offset));
@@ -261,9 +266,7 @@ void simple_offset_init(struct offset_ctx *octx)
 {
 	xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
 	lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
-
-	/* 0 is '.', 1 is '..', so always start with offset 2 */
-	octx->next_offset = 2;
+	octx->next_offset = DIR_OFFSET_MIN;
 }
 
 /**
@@ -276,7 +279,7 @@ void simple_offset_init(struct offset_ctx *octx)
  */
 int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
 {
-	static const struct xa_limit limit = XA_LIMIT(2, U32_MAX);
+	static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
 	u32 offset;
 	int ret;
 
@@ -481,7 +484,7 @@ static int offset_readdir(struct file *file, struct dir_context *ctx)
 		return 0;
 
 	/* In this case, ->private_data is protected by f_pos_lock */
-	if (ctx->pos == 2)
+	if (ctx->pos == DIR_OFFSET_MIN)
 		file->private_data = NULL;
 	else if (file->private_data == ERR_PTR(-ENOENT))
 		return 0;




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 3/7] libfs: Add simple_offset_empty()
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
  2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
  2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
@ 2024-02-13 21:37 ` Chuck Lever
  2024-02-15 12:53   ` Jan Kara
  2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

For simple filesystems that use directory offset mapping, rely
strictly on the directory offset map to tell when a directory has
no children.

After this patch is applied, the emptiness test holds only the RCU
read lock when the directory being tested has no children.

In addition, this adds another layer of confirmation that
simple_offset_add/remove() are working as expected.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c         |   32 ++++++++++++++++++++++++++++++++
 include/linux/fs.h |    1 +
 mm/shmem.c         |    4 ++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index a38af72f4719..3cf773950f93 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -313,6 +313,38 @@ void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
 	offset_set(dentry, 0);
 }
 
+/**
+ * simple_offset_empty - Check if a dentry can be unlinked
+ * @dentry: dentry to be tested
+ *
+ * Returns 0 if @dentry is a non-empty directory; otherwise returns 1.
+ */
+int simple_offset_empty(struct dentry *dentry)
+{
+	struct inode *inode = d_inode(dentry);
+	struct offset_ctx *octx;
+	struct dentry *child;
+	unsigned long index;
+	int ret = 1;
+
+	if (!inode || !S_ISDIR(inode->i_mode))
+		return ret;
+
+	index = 2;
+	octx = inode->i_op->get_offset_ctx(inode);
+	xa_for_each(&octx->xa, index, child) {
+		spin_lock(&child->d_lock);
+		if (simple_positive(child)) {
+			spin_unlock(&child->d_lock);
+			ret = 0;
+			break;
+		}
+		spin_unlock(&child->d_lock);
+	}
+
+	return ret;
+}
+
 /**
  * simple_offset_rename_exchange - exchange rename with directory offsets
  * @old_dir: parent of dentry being moved
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..03d141809a2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3267,6 +3267,7 @@ struct offset_ctx {
 void simple_offset_init(struct offset_ctx *octx);
 int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
 void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
+int simple_offset_empty(struct dentry *dentry);
 int simple_offset_rename_exchange(struct inode *old_dir,
 				  struct dentry *old_dentry,
 				  struct inode *new_dir,
diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..6fed524343cb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3374,7 +3374,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
 
 static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
 {
-	if (!simple_empty(dentry))
+	if (!simple_offset_empty(dentry))
 		return -ENOTEMPTY;
 
 	drop_nlink(d_inode(dentry));
@@ -3431,7 +3431,7 @@ static int shmem_rename2(struct mnt_idmap *idmap,
 		return simple_offset_rename_exchange(old_dir, old_dentry,
 						     new_dir, new_dentry);
 
-	if (!simple_empty(new_dentry))
+	if (!simple_offset_empty(new_dentry))
 		return -ENOTEMPTY;
 
 	if (flags & RENAME_WHITEOUT) {




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic()
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
                   ` (2 preceding siblings ...)
  2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
@ 2024-02-13 21:37 ` Chuck Lever
  2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

I need a cyclic allocator for the simple_offset implementation in
fs/libfs.c.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/maple_tree.h |    7 +++
 lib/maple_tree.c           |   93 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 100 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index b3d63123b945..a53ad4dabd7e 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -171,6 +171,7 @@ enum maple_type {
 #define MT_FLAGS_LOCK_IRQ	0x100
 #define MT_FLAGS_LOCK_BH	0x200
 #define MT_FLAGS_LOCK_EXTERN	0x300
+#define MT_FLAGS_ALLOC_WRAPPED	0x0800
 
 #define MAPLE_HEIGHT_MAX	31
 
@@ -319,6 +320,9 @@ int mtree_insert_range(struct maple_tree *mt, unsigned long first,
 int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp,
 		void *entry, unsigned long size, unsigned long min,
 		unsigned long max, gfp_t gfp);
+int mtree_alloc_cyclic(struct maple_tree *mt, unsigned long *startp,
+		void *entry, unsigned long range_lo, unsigned long range_hi,
+		unsigned long *next, gfp_t gfp);
 int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp,
 		void *entry, unsigned long size, unsigned long min,
 		unsigned long max, gfp_t gfp);
@@ -499,6 +503,9 @@ void *mas_find_range(struct ma_state *mas, unsigned long max);
 void *mas_find_rev(struct ma_state *mas, unsigned long min);
 void *mas_find_range_rev(struct ma_state *mas, unsigned long max);
 int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp);
+int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp,
+		void *entry, unsigned long range_lo, unsigned long range_hi,
+		unsigned long *next, gfp_t gfp);
 
 bool mas_nomem(struct ma_state *mas, gfp_t gfp);
 void mas_pause(struct ma_state *mas);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6f241bb38799..af0970288727 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4290,6 +4290,56 @@ static inline void *mas_insert(struct ma_state *mas, void *entry)
 
 }
 
+/**
+ * mas_alloc_cyclic() - Internal call to find somewhere to store an entry
+ * @mas: The maple state.
+ * @startp: Pointer to ID.
+ * @range_lo: Lower bound of range to search.
+ * @range_hi: Upper bound of range to search.
+ * @entry: The entry to store.
+ * @next: Pointer to next ID to allocate.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Return: 0 if the allocation succeeded without wrapping, 1 if the
+ * allocation succeeded after wrapping, or -EBUSY if there are no
+ * free entries.
+ */
+int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp,
+		void *entry, unsigned long range_lo, unsigned long range_hi,
+		unsigned long *next, gfp_t gfp)
+{
+	unsigned long min = range_lo;
+	int ret = 0;
+
+	range_lo = max(min, *next);
+	ret = mas_empty_area(mas, range_lo, range_hi, 1);
+	if ((mas->tree->ma_flags & MT_FLAGS_ALLOC_WRAPPED) && ret == 0) {
+		mas->tree->ma_flags &= ~MT_FLAGS_ALLOC_WRAPPED;
+		ret = 1;
+	}
+	if (ret < 0 && range_lo > min) {
+		ret = mas_empty_area(mas, min, range_hi, 1);
+		if (ret == 0)
+			ret = 1;
+	}
+	if (ret < 0)
+		return ret;
+
+	do {
+		mas_insert(mas, entry);
+	} while (mas_nomem(mas, gfp));
+	if (mas_is_err(mas))
+		return xa_err(mas->node);
+
+	*startp = mas->index;
+	*next = *startp + 1;
+	if (*next == 0)
+		mas->tree->ma_flags |= MT_FLAGS_ALLOC_WRAPPED;
+
+	return ret;
+}
+EXPORT_SYMBOL(mas_alloc_cyclic);
+
 static __always_inline void mas_rewalk(struct ma_state *mas, unsigned long index)
 {
 retry:
@@ -6443,6 +6493,49 @@ int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp,
 }
 EXPORT_SYMBOL(mtree_alloc_range);
 
+/**
+ * mtree_alloc_cyclic() - Find somewhere to store this entry in the tree.
+ * @mt: The maple tree.
+ * @startp: Pointer to ID.
+ * @range_lo: Lower bound of range to search.
+ * @range_hi: Upper bound of range to search.
+ * @entry: The entry to store.
+ * @next: Pointer to next ID to allocate.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Finds an empty entry in @mt after @next, stores the new index into
+ * the @id pointer, stores the entry at that index, then updates @next.
+ *
+ * @mt must be initialized with the MT_FLAGS_ALLOC_RANGE flag.
+ *
+ * Context: Any context.  Takes and releases the mt.lock.  May sleep if
+ * the @gfp flags permit.
+ *
+ * Return: 0 if the allocation succeeded without wrapping, 1 if the
+ * allocation succeeded after wrapping, -ENOMEM if memory could not be
+ * allocated, -EINVAL if @mt cannot be used, or -EBUSY if there are no
+ * free entries.
+ */
+int mtree_alloc_cyclic(struct maple_tree *mt, unsigned long *startp,
+		void *entry, unsigned long range_lo, unsigned long range_hi,
+		unsigned long *next, gfp_t gfp)
+{
+	int ret;
+
+	MA_STATE(mas, mt, 0, 0);
+
+	if (!mt_is_alloc(mt))
+		return -EINVAL;
+	if (WARN_ON_ONCE(mt_is_reserved(entry)))
+		return -EINVAL;
+	mtree_lock(mt);
+	ret = mas_alloc_cyclic(&mas, startp, entry, range_lo, range_hi,
+			       next, gfp);
+	mtree_unlock(mt);
+	return ret;
+}
+EXPORT_SYMBOL(mtree_alloc_cyclic);
+
 int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp,
 		void *entry, unsigned long size, unsigned long min,
 		unsigned long max, gfp_t gfp)




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
                   ` (3 preceding siblings ...)
  2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
@ 2024-02-13 21:37 ` Chuck Lever
  2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:37 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Liam R. Howlett <Liam.Howlett@oracle.com>

This tests the interactions of the cyclic allocations, the maple state
index and last, and overflow.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 lib/test_maple_tree.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 29185ac5c727..399380db449c 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -3599,6 +3599,45 @@ static noinline void __init check_state_handling(struct maple_tree *mt)
 	mas_unlock(&mas);
 }
 
+static noinline void __init alloc_cyclic_testing(struct maple_tree *mt)
+{
+	unsigned long location;
+	unsigned long next;
+	int ret = 0;
+	MA_STATE(mas, mt, 0, 0);
+
+	next = 0;
+	mtree_lock(mt);
+	for (int i = 0; i < 100; i++) {
+		mas_alloc_cyclic(&mas, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+		MAS_BUG_ON(&mas, i != location - 2);
+		MAS_BUG_ON(&mas, mas.index != location);
+		MAS_BUG_ON(&mas, mas.last != location);
+		MAS_BUG_ON(&mas, i != next - 3);
+	}
+
+	mtree_unlock(mt);
+	mtree_destroy(mt);
+	next = 0;
+	mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
+	for (int i = 0; i < 100; i++) {
+		mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+		MT_BUG_ON(mt, i != location - 2);
+		MT_BUG_ON(mt, i != next - 3);
+		MT_BUG_ON(mt, mtree_load(mt, location) != mt);
+	}
+
+	mtree_destroy(mt);
+	/* Overflow test */
+	next = ULONG_MAX - 1;
+	ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+	MT_BUG_ON(mt, ret != 0);
+	ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+	MT_BUG_ON(mt, ret != 0);
+	ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+	MT_BUG_ON(mt, ret != 1);
+}
+
 static DEFINE_MTREE(tree);
 static int __init maple_tree_seed(void)
 {
@@ -3880,6 +3919,11 @@ static int __init maple_tree_seed(void)
 	check_state_handling(&tree);
 	mtree_destroy(&tree);
 
+	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+	alloc_cyclic_testing(&tree);
+	mtree_destroy(&tree);
+
+
 #if defined(BENCH)
 skip:
 #endif




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
                   ` (4 preceding siblings ...)
  2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
@ 2024-02-13 21:38 ` Chuck Lever
  2024-02-15 13:06   ` Jan Kara
  2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
  2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III
  7 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:38 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

Test robot reports:
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>
> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Feng Tang further clarifies that:
> ... the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.

Willy's analysis is that, over time, the test workload causes
xa_alloc_cyclic() to fragment the underlying SLAB cache.

This patch replaces the offset_ctx's xarray with a Maple Tree in the
hope that Maple Tree's dense node mode will handle this scenario
more scalably.

In addition, we can widen the directory offset to an unsigned long
everywhere.

Suggested-by: Matthew Wilcox <willy@infradead.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c         |   53 ++++++++++++++++++++++++++--------------------------
 include/linux/fs.h |    5 +++--
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3cf773950f93..f073e9aeb2bf 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -245,17 +245,17 @@ enum {
 	DIR_OFFSET_MIN	= 2,
 };
 
-static void offset_set(struct dentry *dentry, u32 offset)
+static void offset_set(struct dentry *dentry, unsigned long offset)
 {
-	dentry->d_fsdata = (void *)((uintptr_t)(offset));
+	dentry->d_fsdata = (void *)offset;
 }
 
-static u32 dentry2offset(struct dentry *dentry)
+static unsigned long dentry2offset(struct dentry *dentry)
 {
-	return (u32)((uintptr_t)(dentry->d_fsdata));
+	return (unsigned long)dentry->d_fsdata;
 }
 
-static struct lock_class_key simple_offset_xa_lock;
+static struct lock_class_key simple_offset_lock_class;
 
 /**
  * simple_offset_init - initialize an offset_ctx
@@ -264,8 +264,8 @@ static struct lock_class_key simple_offset_xa_lock;
  */
 void simple_offset_init(struct offset_ctx *octx)
 {
-	xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
-	lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
+	mt_init_flags(&octx->mt, MT_FLAGS_ALLOC_RANGE);
+	lockdep_set_class(&octx->mt.ma_lock, &simple_offset_lock_class);
 	octx->next_offset = DIR_OFFSET_MIN;
 }
 
@@ -279,15 +279,14 @@ void simple_offset_init(struct offset_ctx *octx)
  */
 int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
 {
-	static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
-	u32 offset;
+	unsigned long offset;
 	int ret;
 
 	if (dentry2offset(dentry) != 0)
 		return -EBUSY;
 
-	ret = xa_alloc_cyclic(&octx->xa, &offset, dentry, limit,
-			      &octx->next_offset, GFP_KERNEL);
+	ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
+				 ULONG_MAX, &octx->next_offset, GFP_KERNEL);
 	if (ret < 0)
 		return ret;
 
@@ -303,13 +302,13 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
  */
 void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
 {
-	u32 offset;
+	unsigned long offset;
 
 	offset = dentry2offset(dentry);
 	if (offset == 0)
 		return;
 
-	xa_erase(&octx->xa, offset);
+	mtree_erase(&octx->mt, offset);
 	offset_set(dentry, 0);
 }
 
@@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
 	if (!inode || !S_ISDIR(inode->i_mode))
 		return ret;
 
-	index = 2;
+	index = DIR_OFFSET_MIN;
 	octx = inode->i_op->get_offset_ctx(inode);
-	xa_for_each(&octx->xa, index, child) {
+	mt_for_each(&octx->mt, child, index, ULONG_MAX) {
 		spin_lock(&child->d_lock);
 		if (simple_positive(child)) {
 			spin_unlock(&child->d_lock);
@@ -362,8 +361,8 @@ int simple_offset_rename_exchange(struct inode *old_dir,
 {
 	struct offset_ctx *old_ctx = old_dir->i_op->get_offset_ctx(old_dir);
 	struct offset_ctx *new_ctx = new_dir->i_op->get_offset_ctx(new_dir);
-	u32 old_index = dentry2offset(old_dentry);
-	u32 new_index = dentry2offset(new_dentry);
+	unsigned long old_index = dentry2offset(old_dentry);
+	unsigned long new_index = dentry2offset(new_dentry);
 	int ret;
 
 	simple_offset_remove(old_ctx, old_dentry);
@@ -389,9 +388,9 @@ int simple_offset_rename_exchange(struct inode *old_dir,
 
 out_restore:
 	offset_set(old_dentry, old_index);
-	xa_store(&old_ctx->xa, old_index, old_dentry, GFP_KERNEL);
+	mtree_store(&old_ctx->mt, old_index, old_dentry, GFP_KERNEL);
 	offset_set(new_dentry, new_index);
-	xa_store(&new_ctx->xa, new_index, new_dentry, GFP_KERNEL);
+	mtree_store(&new_ctx->mt, new_index, new_dentry, GFP_KERNEL);
 	return ret;
 }
 
@@ -404,7 +403,7 @@ int simple_offset_rename_exchange(struct inode *old_dir,
  */
 void simple_offset_destroy(struct offset_ctx *octx)
 {
-	xa_destroy(&octx->xa);
+	mtree_destroy(&octx->mt);
 }
 
 /**
@@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 
 	/* In this case, ->private_data is protected by f_pos_lock */
 	file->private_data = NULL;
-	return vfs_setpos(file, offset, U32_MAX);
+	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
 }
 
-static struct dentry *offset_find_next(struct xa_state *xas)
+static struct dentry *offset_find_next(struct ma_state *mas)
 {
 	struct dentry *child, *found = NULL;
 
 	rcu_read_lock();
-	child = xas_next_entry(xas, U32_MAX);
+	child = mas_find(mas, ULONG_MAX);
 	if (!child)
 		goto out;
 	spin_lock(&child->d_lock);
@@ -456,7 +455,7 @@ static struct dentry *offset_find_next(struct xa_state *xas)
 
 static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 {
-	u32 offset = dentry2offset(dentry);
+	unsigned long offset = dentry2offset(dentry);
 	struct inode *inode = d_inode(dentry);
 
 	return ctx->actor(ctx, dentry->d_name.name, dentry->d_name.len, offset,
@@ -466,11 +465,11 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 {
 	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
-	XA_STATE(xas, &octx->xa, ctx->pos);
+	MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
 	struct dentry *dentry;
 
 	while (true) {
-		dentry = offset_find_next(&xas);
+		dentry = offset_find_next(&mas);
 		if (!dentry)
 			return ERR_PTR(-ENOENT);
 
@@ -480,7 +479,7 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 		}
 
 		dput(dentry);
-		ctx->pos = xas.xa_index + 1;
+		ctx->pos = mas.index + 1;
 	}
 	return NULL;
 }
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03d141809a2c..55144c12ee0f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -43,6 +43,7 @@
 #include <linux/cred.h>
 #include <linux/mnt_idmapping.h>
 #include <linux/slab.h>
+#include <linux/maple_tree.h>
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
@@ -3260,8 +3261,8 @@ extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
 		const void __user *from, size_t count);
 
 struct offset_ctx {
-	struct xarray		xa;
-	u32			next_offset;
+	struct maple_tree	mt;
+	unsigned long		next_offset;
 };
 
 void simple_offset_init(struct offset_ctx *octx);




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
                   ` (5 preceding siblings ...)
  2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
@ 2024-02-13 21:38 ` Chuck Lever
  2024-02-15 13:16   ` Jan Kara
  2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III
  7 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-13 21:38 UTC (permalink / raw
  To: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang
  Cc: linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

From: Chuck Lever <chuck.lever@oracle.com>

Liam says that, unlike with xarray, once the RCU read lock is
released ma_state is not safe to re-use for the next mas_find() call.
But the RCU read lock has to be released on each loop iteration so
that dput() can be called safely.

Thus we are forced to walk the offset tree with fresh state for each
directory entry. mt_find() can do this for us, though it might be a
little less efficient than maintaining ma_state locally.

Since offset_iterate_dir() doesn't build ma_state locally any more,
there's no longer a strong need for offset_find_next(). Clean up by
rolling these two helpers together.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/libfs.c |   39 +++++++++++++++++----------------------
 1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index f073e9aeb2bf..6e01fde1cf95 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
 	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
 }
 
-static struct dentry *offset_find_next(struct ma_state *mas)
-{
-	struct dentry *child, *found = NULL;
-
-	rcu_read_lock();
-	child = mas_find(mas, ULONG_MAX);
-	if (!child)
-		goto out;
-	spin_lock(&child->d_lock);
-	if (simple_positive(child))
-		found = dget_dlock(child);
-	spin_unlock(&child->d_lock);
-out:
-	rcu_read_unlock();
-	return found;
-}
-
 static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 {
 	unsigned long offset = dentry2offset(dentry);
@@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
 static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 {
 	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
-	MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
-	struct dentry *dentry;
+	struct dentry *dentry, *found;
+	unsigned long offset;
 
+	offset = ctx->pos;
 	while (true) {
-		dentry = offset_find_next(&mas);
+		found = mt_find(&octx->mt, &offset, ULONG_MAX);
+		if (!found)
+			goto out_noent;
+
+		dentry = NULL;
+		spin_lock(&found->d_lock);
+		if (simple_positive(found))
+			dentry = dget_dlock(found);
+		spin_unlock(&found->d_lock);
 		if (!dentry)
-			return ERR_PTR(-ENOENT);
+			goto out_noent;
 
 		if (!offset_dir_emit(ctx, dentry)) {
 			dput(dentry);
@@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
 		}
 
 		dput(dentry);
-		ctx->pos = mas.index + 1;
+		ctx->pos = offset;
 	}
 	return NULL;
+
+out_noent:
+	return ERR_PTR(-ENOENT);
 }
 
 /**




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities
  2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
                   ` (6 preceding siblings ...)
  2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
@ 2024-02-13 21:40 ` Chuck Lever III
  7 siblings, 0 replies; 29+ messages in thread
From: Chuck Lever III @ 2024-02-13 21:40 UTC (permalink / raw
  To: Al Viro, Christian Brauner, Jan Kara, Hugh Dickins, Andrew Morton,
	Liam Howlett, kernel test robot, Feng Tang
  Cc: Linux Kernel Mailing List, linux-fsdevel,
	maple-tree@lists.infradead.org, linux-mm, kernel test robot



> On Feb 13, 2024, at 4:37 PM, Chuck Lever <cel@kernel.org> wrote:
> 
> In an effort to address slab fragmentation issues reported a few
> months ago, I've replaced the use of xarrays for the directory
> offset map in "simple" file systems (including tmpfs).
> 
> This patch set passes functional testing and is ready for code
> review. But I don't have the facilities to re-run the performance
> tests that identified the regression. We expect the performance of
> this implementation will need additional improvement.
> 
> Thanks to Liam Howlett for helping me get this working.

And note the patches are also available from:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

in the simple-offset-maple branch.


> ---
> 
> Chuck Lever (6):
>      libfs: Rename "so_ctx"
>      libfs: Define a minimum directory offset
>      libfs: Add simple_offset_empty()
>      maple_tree: Add mtree_alloc_cyclic()
>      libfs: Convert simple directory offsets to use a Maple Tree
>      libfs: Re-arrange locking in offset_iterate_dir()
> 
> Liam R. Howlett (1):
>      test_maple_tree: testing the cyclic allocation
> 
> 
> fs/libfs.c                 | 125 +++++++++++++++++++++++--------------
> include/linux/fs.h         |   6 +-
> include/linux/maple_tree.h |   7 +++
> lib/maple_tree.c           |  93 +++++++++++++++++++++++++++
> lib/test_maple_tree.c      |  44 +++++++++++++
> mm/shmem.c                 |   4 +-
> 6 files changed, 227 insertions(+), 52 deletions(-)
> 
> --
> Chuck Lever
> 
> 

--
Chuck Lever



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 1/7] libfs: Rename "so_ctx"
  2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
@ 2024-02-15 12:42   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-15 12:42 UTC (permalink / raw
  To: Chuck Lever
  Cc: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Tue 13-02-24 16:37:25, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Most of instances of "so_ctx" were renamed before the simple offset
> work was merged, but there were a few that were missed.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index eec6031b0155..bfbe1a8c5d2d 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -271,7 +271,7 @@ void simple_offset_init(struct offset_ctx *octx)
>   * @octx: directory offset ctx to be updated
>   * @dentry: new dentry being added
>   *
> - * Returns zero on success. @so_ctx and the dentry offset are updated.
> + * Returns zero on success. @octx and the dentry's offset are updated.
>   * Otherwise, a negative errno value is returned.
>   */
>  int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> @@ -430,8 +430,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  
>  static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
>  {
> -	struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
> -	XA_STATE(xas, &so_ctx->xa, ctx->pos);
> +	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> +	XA_STATE(xas, &octx->xa, ctx->pos);
>  	struct dentry *dentry;
>  
>  	while (true) {
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 2/7] libfs: Define a minimum directory offset
  2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
@ 2024-02-15 12:47   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-15 12:47 UTC (permalink / raw
  To: Chuck Lever
  Cc: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Tue 13-02-24 16:37:32, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> This value is used in several places, so make it a symbolic
> constant.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c |   13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index bfbe1a8c5d2d..a38af72f4719 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -240,6 +240,11 @@ const struct inode_operations simple_dir_inode_operations = {
>  };
>  EXPORT_SYMBOL(simple_dir_inode_operations);
>  
> +/* 0 is '.', 1 is '..', so always start with offset 2 or more */
> +enum {
> +	DIR_OFFSET_MIN	= 2,
> +};
> +
>  static void offset_set(struct dentry *dentry, u32 offset)
>  {
>  	dentry->d_fsdata = (void *)((uintptr_t)(offset));
> @@ -261,9 +266,7 @@ void simple_offset_init(struct offset_ctx *octx)
>  {
>  	xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
>  	lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
> -
> -	/* 0 is '.', 1 is '..', so always start with offset 2 */
> -	octx->next_offset = 2;
> +	octx->next_offset = DIR_OFFSET_MIN;
>  }
>  
>  /**
> @@ -276,7 +279,7 @@ void simple_offset_init(struct offset_ctx *octx)
>   */
>  int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
>  {
> -	static const struct xa_limit limit = XA_LIMIT(2, U32_MAX);
> +	static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
>  	u32 offset;
>  	int ret;
>  
> @@ -481,7 +484,7 @@ static int offset_readdir(struct file *file, struct dir_context *ctx)
>  		return 0;
>  
>  	/* In this case, ->private_data is protected by f_pos_lock */
> -	if (ctx->pos == 2)
> +	if (ctx->pos == DIR_OFFSET_MIN)
>  		file->private_data = NULL;
>  	else if (file->private_data == ERR_PTR(-ENOENT))
>  		return 0;
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 3/7] libfs: Add simple_offset_empty()
  2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
@ 2024-02-15 12:53   ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-15 12:53 UTC (permalink / raw
  To: Chuck Lever
  Cc: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Tue 13-02-24 16:37:39, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> For simple filesystems that use directory offset mapping, rely
> strictly on the directory offset map to tell when a directory has
> no children.
> 
> After this patch is applied, the emptiness test holds only the RCU
> read lock when the directory being tested has no children.
> 
> In addition, this adds another layer of confirmation that
> simple_offset_add/remove() are working as expected.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c         |   32 ++++++++++++++++++++++++++++++++
>  include/linux/fs.h |    1 +
>  mm/shmem.c         |    4 ++--
>  3 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a38af72f4719..3cf773950f93 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -313,6 +313,38 @@ void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
>  	offset_set(dentry, 0);
>  }
>  
> +/**
> + * simple_offset_empty - Check if a dentry can be unlinked
> + * @dentry: dentry to be tested
> + *
> + * Returns 0 if @dentry is a non-empty directory; otherwise returns 1.
> + */
> +int simple_offset_empty(struct dentry *dentry)
> +{
> +	struct inode *inode = d_inode(dentry);
> +	struct offset_ctx *octx;
> +	struct dentry *child;
> +	unsigned long index;
> +	int ret = 1;
> +
> +	if (!inode || !S_ISDIR(inode->i_mode))
> +		return ret;
> +
> +	index = 2;
> +	octx = inode->i_op->get_offset_ctx(inode);
> +	xa_for_each(&octx->xa, index, child) {
> +		spin_lock(&child->d_lock);
> +		if (simple_positive(child)) {
> +			spin_unlock(&child->d_lock);
> +			ret = 0;
> +			break;
> +		}
> +		spin_unlock(&child->d_lock);
> +	}
> +
> +	return ret;
> +}
> +
>  /**
>   * simple_offset_rename_exchange - exchange rename with directory offsets
>   * @old_dir: parent of dentry being moved
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ed5966a70495..03d141809a2c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3267,6 +3267,7 @@ struct offset_ctx {
>  void simple_offset_init(struct offset_ctx *octx);
>  int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
>  void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
> +int simple_offset_empty(struct dentry *dentry);
>  int simple_offset_rename_exchange(struct inode *old_dir,
>  				  struct dentry *old_dentry,
>  				  struct inode *new_dir,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d7c84ff62186..6fed524343cb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3374,7 +3374,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
>  
>  static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
>  {
> -	if (!simple_empty(dentry))
> +	if (!simple_offset_empty(dentry))
>  		return -ENOTEMPTY;
>  
>  	drop_nlink(d_inode(dentry));
> @@ -3431,7 +3431,7 @@ static int shmem_rename2(struct mnt_idmap *idmap,
>  		return simple_offset_rename_exchange(old_dir, old_dentry,
>  						     new_dir, new_dentry);
>  
> -	if (!simple_empty(new_dentry))
> +	if (!simple_offset_empty(new_dentry))
>  		return -ENOTEMPTY;
>  
>  	if (flags & RENAME_WHITEOUT) {
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
@ 2024-02-15 13:06   ` Jan Kara
  2024-02-15 13:45     ` Chuck Lever
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2024-02-15 13:06 UTC (permalink / raw
  To: Chuck Lever
  Cc: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Test robot reports:
> > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> >
> > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> Feng Tang further clarifies that:
> > ... the new simple_offset_add()
> > called by shmem_mknod() brings extra cost related with slab,
> > specifically the 'radix_tree_node', which cause the regression.
> 
> Willy's analysis is that, over time, the test workload causes
> xa_alloc_cyclic() to fragment the underlying SLAB cache.
> 
> This patch replaces the offset_ctx's xarray with a Maple Tree in the
> hope that Maple Tree's dense node mode will handle this scenario
> more scalably.
> 
> In addition, we can widen the directory offset to an unsigned long
> everywhere.
> 
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

OK, but this will need the performance numbers. Otherwise we have no idea
whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
0-day guys are quite helpful.

> @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
>  	if (!inode || !S_ISDIR(inode->i_mode))
>  		return ret;
>  
> -	index = 2;
> +	index = DIR_OFFSET_MIN;

This bit should go into the simple_offset_empty() patch...

> @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>  
>  	/* In this case, ->private_data is protected by f_pos_lock */
>  	file->private_data = NULL;
> -	return vfs_setpos(file, offset, U32_MAX);
> +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
					^^^
Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
quite right? Why not use ULONG_MAX here directly?

Otherwise the patch looks good to me.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
@ 2024-02-15 13:16   ` Jan Kara
  2024-02-15 17:00     ` Liam R. Howlett
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2024-02-15 13:16 UTC (permalink / raw
  To: Chuck Lever
  Cc: viro, brauner, jack, hughd, akpm, Liam.Howlett, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Liam says that, unlike with xarray, once the RCU read lock is
> released ma_state is not safe to re-use for the next mas_find() call.
> But the RCU read lock has to be released on each loop iteration so
> that dput() can be called safely.
> 
> Thus we are forced to walk the offset tree with fresh state for each
> directory entry. mt_find() can do this for us, though it might be a
> little less efficient than maintaining ma_state locally.
> 
> Since offset_iterate_dir() doesn't build ma_state locally any more,
> there's no longer a strong need for offset_find_next(). Clean up by
> rolling these two helpers together.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>

Well, in general I think even xas_next_entry() is not safe to use how
offset_find_next() was using it. Once you drop rcu_read_lock(),
xas->xa_node could go stale. But since you're holding inode->i_rwsem when
using offset_find_next() you should be protected from concurrent
modifications of the mapping (whatever the underlying data structure is) -
that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
maple tree? Am I missing something?

								Honza

> ---
>  fs/libfs.c |   39 +++++++++++++++++----------------------
>  1 file changed, 17 insertions(+), 22 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index f073e9aeb2bf..6e01fde1cf95 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>  	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
>  }
>  
> -static struct dentry *offset_find_next(struct ma_state *mas)
> -{
> -	struct dentry *child, *found = NULL;
> -
> -	rcu_read_lock();
> -	child = mas_find(mas, ULONG_MAX);
> -	if (!child)
> -		goto out;
> -	spin_lock(&child->d_lock);
> -	if (simple_positive(child))
> -		found = dget_dlock(child);
> -	spin_unlock(&child->d_lock);
> -out:
> -	rcu_read_unlock();
> -	return found;
> -}
> -
>  static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  {
>  	unsigned long offset = dentry2offset(dentry);
> @@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>  static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
>  {
>  	struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> -	MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
> -	struct dentry *dentry;
> +	struct dentry *dentry, *found;
> +	unsigned long offset;
>  
> +	offset = ctx->pos;
>  	while (true) {
> -		dentry = offset_find_next(&mas);
> +		found = mt_find(&octx->mt, &offset, ULONG_MAX);
> +		if (!found)
> +			goto out_noent;
> +
> +		dentry = NULL;
> +		spin_lock(&found->d_lock);
> +		if (simple_positive(found))
> +			dentry = dget_dlock(found);
> +		spin_unlock(&found->d_lock);
>  		if (!dentry)
> -			return ERR_PTR(-ENOENT);
> +			goto out_noent;
>  
>  		if (!offset_dir_emit(ctx, dentry)) {
>  			dput(dentry);
> @@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
>  		}
>  
>  		dput(dentry);
> -		ctx->pos = mas.index + 1;
> +		ctx->pos = offset;
>  	}
>  	return NULL;
> +
> +out_noent:
> +	return ERR_PTR(-ENOENT);
>  }
>  
>  /**
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-15 13:06   ` Jan Kara
@ 2024-02-15 13:45     ` Chuck Lever
  2024-02-15 14:02       ` Jan Kara
                         ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Chuck Lever @ 2024-02-15 13:45 UTC (permalink / raw
  To: Jan Kara
  Cc: Chuck Lever, viro, brauner, hughd, akpm, Liam.Howlett,
	oliver.sang, feng.tang, linux-kernel, linux-fsdevel, maple-tree,
	linux-mm, lkp

On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Test robot reports:
> > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > >
> > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > 
> > Feng Tang further clarifies that:
> > > ... the new simple_offset_add()
> > > called by shmem_mknod() brings extra cost related with slab,
> > > specifically the 'radix_tree_node', which cause the regression.
> > 
> > Willy's analysis is that, over time, the test workload causes
> > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > 
> > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > hope that Maple Tree's dense node mode will handle this scenario
> > more scalably.
> > 
> > In addition, we can widen the directory offset to an unsigned long
> > everywhere.
> > 
> > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> OK, but this will need the performance numbers.

Yes, I totally concur. The point of this posting was to get some
early review and start the ball rolling.

Actually we expect roughly the same performance numbers now. "Dense
node" support in Maple Tree is supposed to be the real win, but
I'm not sure it's ready yet.


> Otherwise we have no idea
> whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> 0-day guys are quite helpful.

Oliver and Feng were copied on this series.


> > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> >  	if (!inode || !S_ISDIR(inode->i_mode))
> >  		return ret;
> >  
> > -	index = 2;
> > +	index = DIR_OFFSET_MIN;
> 
> This bit should go into the simple_offset_empty() patch...
> 
> > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >  
> >  	/* In this case, ->private_data is protected by f_pos_lock */
> >  	file->private_data = NULL;
> > -	return vfs_setpos(file, offset, U32_MAX);
> > +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> 					^^^
> Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> quite right? Why not use ULONG_MAX here directly?

I initially changed U32_MAX to ULONG_MAX, but for some reason, the
length checking in vfs_setpos() fails. There is probably a sign
extension thing happening here that I don't understand.


> Otherwise the patch looks good to me.

As always, thank you for your review.


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-15 13:45     ` Chuck Lever
@ 2024-02-15 14:02       ` Jan Kara
  2024-02-16 15:15       ` Christian Brauner
  2024-02-18  2:02       ` Oliver Sang
  2 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-15 14:02 UTC (permalink / raw
  To: Chuck Lever
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, Liam.Howlett,
	oliver.sang, feng.tang, linux-kernel, linux-fsdevel, maple-tree,
	linux-mm, lkp

On Thu 15-02-24 08:45:33, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > > 
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > 
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > > 
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > > 
> > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > OK, but this will need the performance numbers.
> 
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
> 
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.
> 
> 
> > Otherwise we have no idea
> > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > 0-day guys are quite helpful.
> 
> Oliver and Feng were copied on this series.
> 
> 
> > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > >  	if (!inode || !S_ISDIR(inode->i_mode))
> > >  		return ret;
> > >  
> > > -	index = 2;
> > > +	index = DIR_OFFSET_MIN;
> > 
> > This bit should go into the simple_offset_empty() patch...
> > 
> > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >  
> > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > >  	file->private_data = NULL;
> > > -	return vfs_setpos(file, offset, U32_MAX);
> > > +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > 					^^^
> > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > quite right? Why not use ULONG_MAX here directly?
> 
> I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> length checking in vfs_setpos() fails. There is probably a sign
> extension thing happening here that I don't understand.

Right. loff_t is signed (long long). So I think you should make the
'offset' be long instead of unsigned long and allow values 0..LONG_MAX?
Then you can pass LONG_MAX here. You potentially loose half of the usable
offsets on 32-bit userspace with 64-bit file offsets but who cares I guess?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 13:16   ` Jan Kara
@ 2024-02-15 17:00     ` Liam R. Howlett
  2024-02-15 17:16       ` Jan Kara
  2024-02-15 17:40       ` Chuck Lever
  0 siblings, 2 replies; 29+ messages in thread
From: Liam R. Howlett @ 2024-02-15 17:00 UTC (permalink / raw
  To: Jan Kara
  Cc: Chuck Lever, viro, brauner, hughd, akpm, oliver.sang, feng.tang,
	linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

* Jan Kara <jack@suse.cz> [240215 08:16]:
> On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Liam says that, unlike with xarray, once the RCU read lock is
> > released ma_state is not safe to re-use for the next mas_find() call.
> > But the RCU read lock has to be released on each loop iteration so
> > that dput() can be called safely.
> > 
> > Thus we are forced to walk the offset tree with fresh state for each
> > directory entry. mt_find() can do this for us, though it might be a
> > little less efficient than maintaining ma_state locally.
> > 
> > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > there's no longer a strong need for offset_find_next(). Clean up by
> > rolling these two helpers together.
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> 
> Well, in general I think even xas_next_entry() is not safe to use how
> offset_find_next() was using it. Once you drop rcu_read_lock(),
> xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> using offset_find_next() you should be protected from concurrent
> modifications of the mapping (whatever the underlying data structure is) -
> that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> maple tree? Am I missing something?

If you are stopping, you should be pausing the iteration.  Although this
works today, it's not how it should be used because if we make changes
(ie: compaction requires movement of data), then you may end up with a
UAF issue.  We'd have no way of knowing you are depending on the tree
structure to remain consistent.

IOW the inode->i_rwsem is protecting writes of data but not the
structure holding the data.

This is true for both xarray and maple tree.

Thanks,
Liam



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 17:00     ` Liam R. Howlett
@ 2024-02-15 17:16       ` Jan Kara
  2024-02-15 21:07         ` Liam R. Howlett
  2024-02-15 17:40       ` Chuck Lever
  1 sibling, 1 reply; 29+ messages in thread
From: Jan Kara @ 2024-02-15 17:16 UTC (permalink / raw
  To: Liam R. Howlett
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > > 
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > > 
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
> 
> If you are stopping, you should be pausing the iteration.  Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue.  We'd have no way of knowing you are depending on the tree
> structure to remain consistent.

I see. But we have versions of these structures that have locking external
to the structure itself, don't we? Then how do you imagine serializing the
background operations like compaction? As much as I agree your argument is
"theoretically clean", it seems a bit like a trap and there are definitely
xarray users that are going to be broken by this (e.g.
tag_pages_for_writeback())...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 17:00     ` Liam R. Howlett
  2024-02-15 17:16       ` Jan Kara
@ 2024-02-15 17:40       ` Chuck Lever
  2024-02-15 21:08         ` Liam R. Howlett
  1 sibling, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-15 17:40 UTC (permalink / raw
  To: Liam R. Howlett, Jan Kara, Chuck Lever, viro, brauner, hughd,
	akpm, oliver.sang, feng.tang, linux-kernel, linux-fsdevel,
	maple-tree, linux-mm, lkp

On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > > 
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > > 
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
> 
> If you are stopping, you should be pausing the iteration.  Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue.  We'd have no way of knowing you are depending on the tree
> structure to remain consistent.
> 
> IOW the inode->i_rwsem is protecting writes of data but not the
> structure holding the data.
> 
> This is true for both xarray and maple tree.

Would it be appropriate to reorder this series so 7/7 comes before
the transition to use Maple Tree?


-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 17:16       ` Jan Kara
@ 2024-02-15 21:07         ` Liam R. Howlett
  2024-02-16 10:15           ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Liam R. Howlett @ 2024-02-15 21:07 UTC (permalink / raw
  To: Jan Kara
  Cc: Chuck Lever, viro, brauner, hughd, akpm, oliver.sang, feng.tang,
	linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

* Jan Kara <jack@suse.cz> [240215 12:16]:
> On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > > 
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > > 
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > > 
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> > 
> > If you are stopping, you should be pausing the iteration.  Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue.  We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
> 
> I see. But we have versions of these structures that have locking external
> to the structure itself, don't we?

Ah, I do have them - but I don't want to propagate its use as the dream
is that it can be removed.

> Then how do you imagine serializing the
> background operations like compaction? As much as I agree your argument is
> "theoretically clean", it seems a bit like a trap and there are definitely
> xarray users that are going to be broken by this (e.g.
> tag_pages_for_writeback())...

I'm not sure I follow the trap logic.  There are locks for the data
structure that need to be followed for reading (rcu) and writing
(spinlock for the maple tree).  If you don't correctly lock the data
structure then you really are setting yourself up for potential issues
in the future.

The limitations are outlined in the documentation as to how and when to
lock.  I'm not familiar with the xarray users, but it does check for
locking with lockdep, but the way this is written bypasses the lockdep
checking as the locks are taken and dropped without the proper scope.

If you feel like this is a trap, then maybe we need to figure out a new
plan to detect incorrect use?

Looking through tag_pages_for_writeback(), it does what is necessary to
keep a safe state - before it unlocks it calls xas_pause().  We have the
same on maple tree; mas_pause().  This will restart the next operation
from the root of the tree (the root can also change), to ensure that it
is safe.

If you have other examples you think are unsafe then I can have a look
at them as well.

You can make the existing code safe by also calling xas_pause() before
the rcu lock is dropped, but that is essentially what Chuck has done in
the maple tree conversion by using mt_find().

Regarding compaction, I would expect the write lock to be taken to avoid
any writes happening while compaction occurs.  Readers would use rcu
locking to ensure they return either the old or new value.  During the
write critical section, other writers would be in a
"mas_pause()/xas_pause()" state - so once they continue, they will
re-start the walk to the next element in the new tree.

If external locks are used, then compaction would be sub-optimal as it
may unnecessarily hold up readers, or partial write work (before the
point of a store into the maple tree).

Thanks,
Liam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 17:40       ` Chuck Lever
@ 2024-02-15 21:08         ` Liam R. Howlett
  0 siblings, 0 replies; 29+ messages in thread
From: Liam R. Howlett @ 2024-02-15 21:08 UTC (permalink / raw
  To: Chuck Lever
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

* Chuck Lever <chuck.lever@oracle.com> [240215 12:40]:
> On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > > 
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > > 
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > > 
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> > 
> > If you are stopping, you should be pausing the iteration.  Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue.  We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
> > 
> > IOW the inode->i_rwsem is protecting writes of data but not the
> > structure holding the data.
> > 
> > This is true for both xarray and maple tree.
> 
> Would it be appropriate to reorder this series so 7/7 comes before
> the transition to use Maple Tree?

I think it would, yes.

Thanks,
Liam


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-15 21:07         ` Liam R. Howlett
@ 2024-02-16 10:15           ` Jan Kara
  2024-02-16 15:57             ` Matthew Wilcox
  2024-02-16 16:33             ` Liam R. Howlett
  0 siblings, 2 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-16 10:15 UTC (permalink / raw
  To: Liam R. Howlett
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240215 12:16]:
> > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > 
> > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > that dput() can be called safely.
> > > > > 
> > > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > > directory entry. mt_find() can do this for us, though it might be a
> > > > > little less efficient than maintaining ma_state locally.
> > > > > 
> > > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > > rolling these two helpers together.
> > > > > 
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Well, in general I think even xas_next_entry() is not safe to use how
> > > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > > using offset_find_next() you should be protected from concurrent
> > > > modifications of the mapping (whatever the underlying data structure is) -
> > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > > maple tree? Am I missing something?
> > > 
> > > If you are stopping, you should be pausing the iteration.  Although this
> > > works today, it's not how it should be used because if we make changes
> > > (ie: compaction requires movement of data), then you may end up with a
> > > UAF issue.  We'd have no way of knowing you are depending on the tree
> > > structure to remain consistent.
> > 
> > I see. But we have versions of these structures that have locking external
> > to the structure itself, don't we?
> 
> Ah, I do have them - but I don't want to propagate its use as the dream
> is that it can be removed.
> 
> 
> > Then how do you imagine serializing the
> > background operations like compaction? As much as I agree your argument is
> > "theoretically clean", it seems a bit like a trap and there are definitely
> > xarray users that are going to be broken by this (e.g.
> > tag_pages_for_writeback())...
> 
> I'm not sure I follow the trap logic.  There are locks for the data
> structure that need to be followed for reading (rcu) and writing
> (spinlock for the maple tree).  If you don't correctly lock the data
> structure then you really are setting yourself up for potential issues
> in the future.
> 
> The limitations are outlined in the documentation as to how and when to
> lock.  I'm not familiar with the xarray users, but it does check for
> locking with lockdep, but the way this is written bypasses the lockdep
> checking as the locks are taken and dropped without the proper scope.
> 
> If you feel like this is a trap, then maybe we need to figure out a new
> plan to detect incorrect use?

OK, I was a bit imprecise. What I wanted to say is that this is a shift in
the paradigm in the sense that previously, we mostly had (and still have)
data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
guaranteeing that unless you call into the function to mutate the data
structure it stays intact. Now maple trees are shifting more in a direction
of black-box API where you cannot assume what happens inside. Which is fine
but then we have e.g. these iterators which do not quite follow this
black-box design and you have to remember subtle details like calling
"mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
the black-box API shouldn't be exposed to the details of the internal
locking at all (but then the performance suffers so I understand why you do
things this way). Second to this ideal variant would be if we could detect
we unlocked the lock without calling xas_pause() and warn on that. Or maybe
xas_unlock*() should be calling xas_pause() automagically and we'd have
similar helpers for RCU to do the magic for you?

> Looking through tag_pages_for_writeback(), it does what is necessary to
> keep a safe state - before it unlocks it calls xas_pause().  We have the
> same on maple tree; mas_pause().  This will restart the next operation
> from the root of the tree (the root can also change), to ensure that it
> is safe.

OK, I've missed the xas_pause(). Thanks for correcting me.

> If you have other examples you think are unsafe then I can have a look
> at them as well.

I'm currently not aware of any but I'll let you know if I find some.
Missing xas/mas_pause() seems really easy.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-15 13:45     ` Chuck Lever
  2024-02-15 14:02       ` Jan Kara
@ 2024-02-16 15:15       ` Christian Brauner
  2024-02-18  2:02       ` Oliver Sang
  2 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2024-02-16 15:15 UTC (permalink / raw
  To: Chuck Lever
  Cc: Jan Kara, Chuck Lever, viro, hughd, akpm, Liam.Howlett,
	oliver.sang, feng.tang, linux-kernel, linux-fsdevel, maple-tree,
	linux-mm, lkp

On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > > 
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > 
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > > 
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > > 
> > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > OK, but this will need the performance numbers.
> 
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
> 
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.

I keep repeating this but we need a better way to request performance
tests for specific series/branches. Maybe I can add a vfs.perf branch
where we can put patches that we suspect have positive/negative perf
impact and that perf bot can pull that in. I know, that's a fuzzy
boundary but for stuff like this where we already know that there's a
perf impact that's important for us it would really help.

Because it is royally annoying to get a perf regression report after a
patch has been in -next for a long time already and the merge window is
coming up or we already merged that stuff.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-16 10:15           ` Jan Kara
@ 2024-02-16 15:57             ` Matthew Wilcox
  2024-02-16 16:33             ` Liam R. Howlett
  1 sibling, 0 replies; 29+ messages in thread
From: Matthew Wilcox @ 2024-02-16 15:57 UTC (permalink / raw
  To: Jan Kara
  Cc: Liam R. Howlett, Chuck Lever, viro, brauner, hughd, akpm,
	oliver.sang, feng.tang, linux-kernel, linux-fsdevel, maple-tree,
	linux-mm, lkp

On Fri, Feb 16, 2024 at 11:15:46AM +0100, Jan Kara wrote:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > The limitations are outlined in the documentation as to how and when to
> > lock.  I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> > 
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
> 
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact.

hm, no.  The radix tree never guaranteed that to you; I just documented
that it wasn't guaranteed for the XArray.

> Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?

If you're unlocking the lock that protects a data structure while still
using that data structure, you should always be aware that you're doing
something very dangerous!  It's no different from calling inode_unlock()
inside a filesystem.  Sure, you can do it, but you'd better be ready to
deal with the consequences.

The question is, do we want to be able to defragment slabs or not?
My thinking is "yes", for objects where we can ensure there are no
current users (at least after an RCU grace period), we want to be able
to move them.  That does impose certain costs (and subtleties), but just
like fast-GUP and lockless page-cache, I think it's worth doing.

Of course, we don't have slab defragmentation yet, so we're not getting
any benefit from this.  The most recent attempt was in 2019:
https://lore.kernel.org/linux-mm/20190603042637.2018-1-tobin@kernel.org/
but there were earlier attepts in 2017:
https://lore.kernel.org/linux-mm/20171227220636.361857279@linux.com/
and 2008:
https://lore.kernel.org/linux-mm/20080216004526.763643520@sgi.com/

so I have to believe it's something we want, just haven't been able to
push across the "merge this now" line.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-16 10:15           ` Jan Kara
  2024-02-16 15:57             ` Matthew Wilcox
@ 2024-02-16 16:33             ` Liam R. Howlett
  2024-02-19 18:06               ` Jan Kara
  1 sibling, 1 reply; 29+ messages in thread
From: Liam R. Howlett @ 2024-02-16 16:33 UTC (permalink / raw
  To: Jan Kara
  Cc: Chuck Lever, viro, brauner, hughd, akpm, oliver.sang, feng.tang,
	linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

* Jan Kara <jack@suse.cz> [240216 05:15]:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > * Jan Kara <jack@suse.cz> [240215 12:16]:
> > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > > * Jan Kara <jack@suse.cz> [240215 08:16]:
> > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > > 
> > > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > > that dput() can be called safely.
> > > > > > 

...

> > 
> > > Then how do you imagine serializing the
> > > background operations like compaction? As much as I agree your argument is
> > > "theoretically clean", it seems a bit like a trap and there are definitely
> > > xarray users that are going to be broken by this (e.g.
> > > tag_pages_for_writeback())...
> > 
> > I'm not sure I follow the trap logic.  There are locks for the data
> > structure that need to be followed for reading (rcu) and writing
> > (spinlock for the maple tree).  If you don't correctly lock the data
> > structure then you really are setting yourself up for potential issues
> > in the future.
> > 
> > The limitations are outlined in the documentation as to how and when to
> > lock.  I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> > 
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
> 
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact. Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?
> 
...

Fair enough.  You can think of the radix-tree and xarray as black boxes
as well; not everyone knows what's going on in there.  The rbtree and
list are embedded in your own data structure and you have to do a lot of
work for your operations.

Right now, it is the way you have described; it's a data structure API.
It will work this way, but you lose the really neat and performant part
of the tree.  If we do this right, we can compact at the same time as
reading data.  We cannot do that when depending on the locks you use
today. 

You don't have to use the mas_pause() functions, there are less
error-prone methods such as mt_find() that Chuck used here.  If you want
to do really neat stuff though, you should be looking at the mas_*
interface.  And yes, we totally hand you enough rope to hang yourself.
I'm not sure we can have an advanced interface without doing this.

Using the correct calls will allow for us to smoothly transition to a
model where you don't depend on the data remaining in the same place in
the tree (or xarray).

> 
> > If you have other examples you think are unsafe then I can have a look
> > at them as well.
> 
> I'm currently not aware of any but I'll let you know if I find some.
> Missing xas/mas_pause() seems really easy.

What if we convert the rcu_read_lock() to a mas_read_lock() or
xas_read_lock() and we can check to ensure the state isn't being locked
without being in the 'parked' state (paused or otherwise)?

mas_read_lock(struct ma_state *mas) {
	assert(!mas_active(mas));
	rcu_read_lock();
}

Would that be a reasonable resolution to your concern?  Unfortunately,
what was done with the locking in this case would not be detected with
this change unless the rcu_read_lock() was replaced.  IOW, people could
still use the rcu_read_lock() and skip the detection.

Doing the same in the mas_unlock() doesn't make as much sense since that
may be called without the intent to reuse the state at all.  So we'd be
doing more work than necessary at the end of each loop or use.

Thanks,
Liam

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-15 13:45     ` Chuck Lever
  2024-02-15 14:02       ` Jan Kara
  2024-02-16 15:15       ` Christian Brauner
@ 2024-02-18  2:02       ` Oliver Sang
  2024-02-18 15:57         ` Chuck Lever
  2 siblings, 1 reply; 29+ messages in thread
From: Oliver Sang @ 2024-02-18  2:02 UTC (permalink / raw
  To: Chuck Lever
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, Liam.Howlett,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp,
	oliver.sang

hi, Chuck Lever,

On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > 
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > > 
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > 
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > > 
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > > 
> > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > 
> > OK, but this will need the performance numbers.
> 
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
> 
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.
> 
> 
> > Otherwise we have no idea
> > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > 0-day guys are quite helpful.
> 
> Oliver and Feng were copied on this series.

we are in holidays last week, now we are back.

I noticed there is v2 for this patch set
https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/

and you also put it in a branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
"simple-offset-maple" branch.

we will test aim9 performance based on this branch. Thanks

> 
> 
> > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > >  	if (!inode || !S_ISDIR(inode->i_mode))
> > >  		return ret;
> > >  
> > > -	index = 2;
> > > +	index = DIR_OFFSET_MIN;
> > 
> > This bit should go into the simple_offset_empty() patch...
> > 
> > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >  
> > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > >  	file->private_data = NULL;
> > > -	return vfs_setpos(file, offset, U32_MAX);
> > > +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > 					^^^
> > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > quite right? Why not use ULONG_MAX here directly?
> 
> I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> length checking in vfs_setpos() fails. There is probably a sign
> extension thing happening here that I don't understand.
> 
> 
> > Otherwise the patch looks good to me.
> 
> As always, thank you for your review.
> 
> 
> -- 
> Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-18  2:02       ` Oliver Sang
@ 2024-02-18 15:57         ` Chuck Lever
  2024-02-19  6:00           ` Oliver Sang
  0 siblings, 1 reply; 29+ messages in thread
From: Chuck Lever @ 2024-02-18 15:57 UTC (permalink / raw
  To: Oliver Sang
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, Liam.Howlett,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Sun, Feb 18, 2024 at 10:02:37AM +0800, Oliver Sang wrote:
> hi, Chuck Lever,
> 
> On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> > On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > Test robot reports:
> > > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > > >
> > > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > 
> > > > Feng Tang further clarifies that:
> > > > > ... the new simple_offset_add()
> > > > > called by shmem_mknod() brings extra cost related with slab,
> > > > > specifically the 'radix_tree_node', which cause the regression.
> > > > 
> > > > Willy's analysis is that, over time, the test workload causes
> > > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > > 
> > > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > > hope that Maple Tree's dense node mode will handle this scenario
> > > > more scalably.
> > > > 
> > > > In addition, we can widen the directory offset to an unsigned long
> > > > everywhere.
> > > > 
> > > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > OK, but this will need the performance numbers.
> > 
> > Yes, I totally concur. The point of this posting was to get some
> > early review and start the ball rolling.
> > 
> > Actually we expect roughly the same performance numbers now. "Dense
> > node" support in Maple Tree is supposed to be the real win, but
> > I'm not sure it's ready yet.
> > 
> > 
> > > Otherwise we have no idea
> > > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > > 0-day guys are quite helpful.
> > 
> > Oliver and Feng were copied on this series.
> 
> we are in holidays last week, now we are back.
> 
> I noticed there is v2 for this patch set
> https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/
> 
> and you also put it in a branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> "simple-offset-maple" branch.
> 
> we will test aim9 performance based on this branch. Thanks

Very much appreciated!


> > > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > >  	if (!inode || !S_ISDIR(inode->i_mode))
> > > >  		return ret;
> > > >  
> > > > -	index = 2;
> > > > +	index = DIR_OFFSET_MIN;
> > > 
> > > This bit should go into the simple_offset_empty() patch...
> > > 
> > > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > >  
> > > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > > >  	file->private_data = NULL;
> > > > -	return vfs_setpos(file, offset, U32_MAX);
> > > > +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > > 					^^^
> > > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > > quite right? Why not use ULONG_MAX here directly?
> > 
> > I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> > length checking in vfs_setpos() fails. There is probably a sign
> > extension thing happening here that I don't understand.
> > 
> > 
> > > Otherwise the patch looks good to me.
> > 
> > As always, thank you for your review.
> > 
> > 
> > -- 
> > Chuck Lever

-- 
Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree
  2024-02-18 15:57         ` Chuck Lever
@ 2024-02-19  6:00           ` Oliver Sang
  0 siblings, 0 replies; 29+ messages in thread
From: Oliver Sang @ 2024-02-19  6:00 UTC (permalink / raw
  To: Chuck Lever
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, Liam.Howlett,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp,
	oliver.sang

hi, Chuck Lever,

On Sun, Feb 18, 2024 at 10:57:07AM -0500, Chuck Lever wrote:
> On Sun, Feb 18, 2024 at 10:02:37AM +0800, Oliver Sang wrote:
> > hi, Chuck Lever,
> > 
> > On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> > > On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > > > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > > > From: Chuck Lever <chuck.lever@oracle.com>
> > > > > 
> > > > > Test robot reports:
> > > > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > > > >
> > > > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > > 
> > > > > Feng Tang further clarifies that:
> > > > > > ... the new simple_offset_add()
> > > > > > called by shmem_mknod() brings extra cost related with slab,
> > > > > > specifically the 'radix_tree_node', which cause the regression.
> > > > > 
> > > > > Willy's analysis is that, over time, the test workload causes
> > > > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > > > 
> > > > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > > > hope that Maple Tree's dense node mode will handle this scenario
> > > > > more scalably.
> > > > > 
> > > > > In addition, we can widen the directory offset to an unsigned long
> > > > > everywhere.
> > > > > 
> > > > > Suggested-by: Matthew Wilcox <willy@infradead.org>
> > > > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > > > Closes: https://lore.kernel.org/oe-lkp/202309081306.3ecb3734-oliver.sang@intel.com
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > 
> > > > OK, but this will need the performance numbers.
> > > 
> > > Yes, I totally concur. The point of this posting was to get some
> > > early review and start the ball rolling.
> > > 
> > > Actually we expect roughly the same performance numbers now. "Dense
> > > node" support in Maple Tree is supposed to be the real win, but
> > > I'm not sure it's ready yet.
> > > 
> > > 
> > > > Otherwise we have no idea
> > > > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > > > 0-day guys are quite helpful.
> > > 
> > > Oliver and Feng were copied on this series.
> > 
> > we are in holidays last week, now we are back.
> > 
> > I noticed there is v2 for this patch set
> > https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/
> > 
> > and you also put it in a branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> > "simple-offset-maple" branch.
> > 
> > we will test aim9 performance based on this branch. Thanks
> 
> Very much appreciated!

always our pleasure!

we've already sent out a report [1] to you for the commit a616bc6667 in new
branch.
we saw 11.8% improvement of aim9.disk_src.ops_per_sec on it comparing to its
parent.

so the regression we saw on a2e459555c is 'half' recovered.

since I noticed the performance for a2e459555c, v6.8-rc4 and f3f24869a1 (parent
of a616bc6667) are very similar, so ignored results from v6.8-rc4 and f3f24869a1
in below tables for brief. if you want a full table, please let me know. Thanks!


summary for aim9.disk_src.ops_per_sec:

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s

commit:
  23a31d8764 ("shmem: Refactor shmem_symlink()")
  a2e459555c ("shmem: stable directory offsets")
  a616bc6667 ("libfs: Convert simple directory offsets to use a Maple Tree")


23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 a616bc666748063733c62e15ea4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    202424           -19.0%     163868            -9.3%     183678        aim9.disk_src.ops_per_sec


full data:

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s

commit:
  23a31d8764 ("shmem: Refactor shmem_symlink()")
  a2e459555c ("shmem: stable directory offsets")
  a616bc6667 ("libfs: Convert simple directory offsets to use a Maple Tree")


23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 a616bc666748063733c62e15ea4
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
      1404            +0.6%       1412            +3.3%       1450        boot-time.idle
      0.26 ±  9%      +0.1        0.36 ±  2%      -0.1        0.20 ±  4%  mpstat.cpu.all.soft%
      0.61            -0.1        0.52            -0.0        0.58        mpstat.cpu.all.usr%
      1.00            +0.0%       1.00           +19.5%       1.19 ±  2%  vmstat.procs.r
      1525 ±  6%      -5.6%       1440            +9.4%       1668 ±  4%  vmstat.system.cs
     52670            +5.0%      55323 ±  3%     +12.7%      59381 ±  3%  turbostat.C1
      0.02            +0.0        0.02            +0.0        0.03 ± 10%  turbostat.C1%
     12949 ±  9%      -1.2%      12795 ±  6%     -19.6%      10412 ±  6%  turbostat.POLL
    115468 ± 24%     +11.9%     129258 ±  7%     +16.5%     134545 ±  5%  numa-meminfo.node0.AnonPages.max
      2015 ± 12%      -7.5%       1864 ±  5%     +30.3%       2624 ± 10%  numa-meminfo.node0.PageTables
      4795 ± 30%      +1.1%       4846 ± 37%    +117.0%      10405 ± 21%  numa-meminfo.node0.Shmem
      6442 ±  5%      -0.6%       6401 ±  4%     -19.6%       5180 ±  7%  numa-meminfo.node1.KernelStack
     13731 ± 92%     -88.7%       1546 ±  6%    +161.6%      35915 ± 23%  time.involuntary_context_switches
     94.83            -4.2%      90.83            +1.2%      96.00        time.percent_of_cpu_this_job_got
    211.64            +0.5%     212.70            +4.0%     220.06        time.system_time
     73.62           -17.6%      60.69            -6.4%      68.94        time.user_time
    202424           -19.0%     163868            -9.3%     183678        aim9.disk_src.ops_per_sec
     13731 ± 92%     -88.7%       1546 ±  6%    +161.6%      35915 ± 23%  aim9.time.involuntary_context_switches
     94.83            -4.2%      90.83            +1.2%      96.00        aim9.time.percent_of_cpu_this_job_got
    211.64            +0.5%     212.70            +4.0%     220.06        aim9.time.system_time
     73.62           -17.6%      60.69            -6.4%      68.94        aim9.time.user_time
    174558 ±  4%      -1.0%     172852 ±  7%     -19.7%     140084 ±  7%  meminfo.DirectMap4k
     94166            +6.6%     100388           -14.4%      80579        meminfo.KReclaimable
     12941            +0.4%      12989           -14.4%      11078        meminfo.KernelStack
      3769            +0.2%       3775           +31.5%       4955        meminfo.PageTables
     79298            +0.0%      79298           -70.6%      23298        meminfo.Percpu
     94166            +6.6%     100388           -14.4%      80579        meminfo.SReclaimable
    204209            +2.9%     210111            -9.6%     184661        meminfo.Slab
    503.33 ± 12%      -7.5%     465.67 ±  5%     +30.4%     656.39 ± 10%  numa-vmstat.node0.nr_page_table_pages
      1198 ± 30%      +1.1%       1211 ± 37%    +117.0%       2601 ± 21%  numa-vmstat.node0.nr_shmem
    220.33            +0.2%     220.83 ±  2%     -97.3%       6.00 ±100%  numa-vmstat.node1.nr_active_file
    167.00            +0.2%     167.33 ±  2%     -87.7%      20.48 ±100%  numa-vmstat.node1.nr_inactive_file
      6443 ±  5%      -0.6%       6405 ±  4%     -19.5%       5184 ±  7%  numa-vmstat.node1.nr_kernel_stack
    220.33            +0.2%     220.83 ±  2%     -97.3%       6.00 ±100%  numa-vmstat.node1.nr_zone_active_file
    167.00            +0.2%     167.33 ±  2%     -87.7%      20.48 ±100%  numa-vmstat.node1.nr_zone_inactive_file
      0.04 ± 25%      +4.1%       0.05 ± 33%     -40.1%       0.03 ± 34%  perf-sched.sch_delay.avg.ms.syslog_print.do_syslog.kmsg_read.vfs_read
      0.01 ±  4%      +1.6%       0.01 ±  8%    +136.9%       0.02 ± 29%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.16 ± 10%     -18.9%       0.13 ± 12%     -16.4%       0.13 ± 15%  perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
      0.14 ± 15%      -6.0%       0.14 ± 20%     -30.4%       0.10 ± 26%  perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
      0.04 ±  7%   +1802.4%       0.78 ±115%   +7258.6%       3.00 ± 56%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.05 ± 40%     -14.5%       0.04 ± 31%   +6155.4%       3.09        perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.01 ±  8%     +25.5%       0.01 ± 21%    +117.0%       0.02 ± 30%  perf-sched.total_sch_delay.average.ms
      0.16 ± 11%   +1204.9%       2.12 ± 90%   +2225.8%       3.77 ± 10%  perf-sched.total_sch_delay.max.ms
     58.50 ± 28%      -4.8%      55.67 ± 14%     -23.9%      44.50 ±  4%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.usleep_range_state.ipmi_thread.kthread
    277.83 ±  3%     +10.3%     306.50 ±  5%     +17.9%     327.62 ±  4%  perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.10 ± 75%     -47.1%       0.05 ± 86%    -100.0%       0.00        perf-sched.wait_time.avg.ms.__cond_resched.dput.path_put.vfs_statx.vfs_fstatat
      0.10 ± 72%     -10.8%       0.09 ± 67%    -100.0%       0.00        perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      2.00 ± 27%      -4.8%       1.90 ± 14%     -23.7%       1.53 ±  4%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
      0.16 ± 81%     +20.1%       0.19 ±104%    -100.0%       0.00        perf-sched.wait_time.max.ms.__cond_resched.dput.path_put.vfs_statx.vfs_fstatat
      0.22 ± 77%     +14.2%       0.25 ± 54%    -100.0%       0.00        perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
      7.32 ± 43%     -10.3%       6.57 ± 20%     -33.2%       4.89 ±  6%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
    220.33            +0.2%     220.83 ±  2%     -94.6%      12.00        proc-vmstat.nr_active_file
    676766            -0.0%     676706            +4.1%     704377        proc-vmstat.nr_file_pages
    167.00            +0.2%     167.33 ±  2%     -75.5%      40.98        proc-vmstat.nr_inactive_file
     12941            +0.4%      12988           -14.4%      11083        proc-vmstat.nr_kernel_stack
    942.50            +0.1%     943.50           +31.4%       1238        proc-vmstat.nr_page_table_pages
      7915 ±  3%      -0.8%       7855            +8.6%       8599        proc-vmstat.nr_shmem
     23541            +6.5%      25074           -14.4%      20144        proc-vmstat.nr_slab_reclaimable
     27499            -0.3%      27422            -5.4%      26016        proc-vmstat.nr_slab_unreclaimable
    668461            +0.0%     668461            +4.2%     696493        proc-vmstat.nr_unevictable
    220.33            +0.2%     220.83 ±  2%     -94.6%      12.00        proc-vmstat.nr_zone_active_file
    167.00            +0.2%     167.33 ±  2%     -75.5%      40.98        proc-vmstat.nr_zone_inactive_file
    668461            +0.0%     668461            +4.2%     696493        proc-vmstat.nr_zone_unevictable
    722.17 ± 71%     -34.7%     471.67 ±136%     -98.3%      12.62 ±129%  proc-vmstat.numa_hint_faults_local
   1437319 ± 24%    +377.6%    6864201           -47.8%     750673 ±  7%  proc-vmstat.numa_hit
   1387016 ± 25%    +391.4%    6815486           -49.5%     700947 ±  7%  proc-vmstat.numa_local
     50329            -0.0%      50324            -1.2%      49704        proc-vmstat.numa_other
   4864362 ± 34%    +453.6%   26931180           -66.1%    1648373 ± 17%  proc-vmstat.pgalloc_normal
   4835960 ± 34%    +455.4%   26856610           -66.3%    1628178 ± 18%  proc-vmstat.pgfree
     11.21           +23.7%      13.87           -81.8%       2.04        perf-stat.i.MPKI
 7.223e+08            -4.4%  6.907e+08            -3.9%   6.94e+08        perf-stat.i.branch-instructions
      2.67            +0.2        2.88            +0.0        2.70        perf-stat.i.branch-miss-rate%
  19988363            +2.8%   20539702            -3.1%   19363031        perf-stat.i.branch-misses
     17.36            -2.8       14.59            +0.4       17.77        perf-stat.i.cache-miss-rate%
  40733859           +19.5%   48659982            -1.9%   39962840        perf-stat.i.cache-references
      1482 ±  7%      -5.7%       1398            +9.6%       1623 ±  5%  perf-stat.i.context-switches
      1.76            +3.5%       1.82            +5.1%       1.85        perf-stat.i.cpi
     55.21            +5.4%      58.21 ±  2%      +0.4%      55.45        perf-stat.i.cpu-migrations
  16524721 ±  8%      -4.8%   15726404 ±  4%     -13.1%   14367627 ±  4%  perf-stat.i.dTLB-load-misses
  1.01e+09            -3.8%  9.719e+08            -4.4%  9.659e+08        perf-stat.i.dTLB-loads
      0.26 ±  4%      -0.0        0.23 ±  3%      -0.0        0.25 ±  3%  perf-stat.i.dTLB-store-miss-rate%
   2166022 ±  4%      -6.9%    2015917 ±  3%      -7.0%    2014037 ±  3%  perf-stat.i.dTLB-store-misses
 8.503e+08            +5.5%  8.968e+08            -3.5%  8.205e+08        perf-stat.i.dTLB-stores
     69.22 ±  4%      +6.4       75.60           +14.4       83.60 ±  3%  perf-stat.i.iTLB-load-miss-rate%
    709457 ±  5%      -5.4%     670950 ±  2%    +133.6%    1657233        perf-stat.i.iTLB-load-misses
    316455 ± 12%     -31.6%     216531 ±  3%      +3.5%     327592 ± 19%  perf-stat.i.iTLB-loads
 3.722e+09            -3.1%  3.608e+09            -4.6%  3.553e+09        perf-stat.i.instructions
      5243 ±  5%      +2.2%       5357 ±  2%     -59.1%       2142        perf-stat.i.instructions-per-iTLB-miss
      0.57            -3.3%       0.55            -4.8%       0.54        perf-stat.i.ipc
    865.04           -10.4%     775.02 ±  3%      -2.5%     843.12        perf-stat.i.metric.K/sec
     53.84            -0.4%      53.61            -4.0%      51.71        perf-stat.i.metric.M/sec
     47.51            -2.1       45.37            +0.7       48.17        perf-stat.i.node-load-miss-rate%
     88195 ±  3%      +5.2%      92745 ±  4%     +16.4%     102647 ±  6%  perf-stat.i.node-load-misses
    106705 ±  3%     +14.8%     122490 ±  5%     +13.2%     120774 ±  5%  perf-stat.i.node-loads
    107169 ±  4%     +29.0%     138208 ±  7%      +7.5%     115217 ±  5%  perf-stat.i.node-stores
     10.94           +23.3%      13.49           -81.8%       1.99        perf-stat.overall.MPKI
      2.77            +0.2        2.97            +0.0        2.79        perf-stat.overall.branch-miss-rate%
     17.28            -2.7       14.56            +0.4       17.67        perf-stat.overall.cache-miss-rate%
      1.73            +3.4%       1.79            +5.0%       1.82        perf-stat.overall.cpi
      0.25 ±  4%      -0.0        0.22 ±  3%      -0.0        0.24 ±  3%  perf-stat.overall.dTLB-store-miss-rate%
     69.20 ±  4%      +6.4       75.60           +14.4       83.58 ±  3%  perf-stat.overall.iTLB-load-miss-rate%
      5260 ±  5%      +2.3%       5380 ±  2%     -59.2%       2144        perf-stat.overall.instructions-per-iTLB-miss
      0.58            -3.2%       0.56            -4.7%       0.55        perf-stat.overall.ipc
     45.25            -2.2       43.10            +0.7       45.93        perf-stat.overall.node-load-miss-rate%
 7.199e+08            -4.4%  6.883e+08            -3.9%  6.917e+08        perf-stat.ps.branch-instructions
  19919808            +2.8%   20469001            -3.1%   19299968        perf-stat.ps.branch-misses
  40597326           +19.5%   48497201            -1.9%   39829580        perf-stat.ps.cache-references
      1477 ±  7%      -5.7%       1393            +9.6%       1618 ±  5%  perf-stat.ps.context-switches
     55.06            +5.4%      58.03 ±  2%      +0.5%      55.32        perf-stat.ps.cpu-migrations
  16469488 ±  8%      -4.8%   15673772 ±  4%     -13.1%   14319828 ±  4%  perf-stat.ps.dTLB-load-misses
 1.007e+09            -3.8%  9.686e+08            -4.4%  9.627e+08        perf-stat.ps.dTLB-loads
   2158768 ±  4%      -6.9%    2009174 ±  3%      -7.0%    2007326 ±  3%  perf-stat.ps.dTLB-store-misses
 8.475e+08            +5.5%  8.937e+08            -3.5%  8.178e+08        perf-stat.ps.dTLB-stores
    707081 ±  5%      -5.4%     668703 ±  2%    +133.6%    1651678        perf-stat.ps.iTLB-load-misses
    315394 ± 12%     -31.6%     215816 ±  3%      +3.5%     326463 ± 19%  perf-stat.ps.iTLB-loads
  3.71e+09            -3.1%  3.595e+09            -4.5%  3.541e+09        perf-stat.ps.instructions
     87895 ±  3%      +5.2%      92424 ±  4%     +16.4%     102341 ±  6%  perf-stat.ps.node-load-misses
    106351 ±  3%     +14.8%     122083 ±  5%     +13.2%     120439 ±  5%  perf-stat.ps.node-loads
    106728 ±  4%     +29.1%     137740 ±  7%      +7.6%     114824 ±  5%  perf-stat.ps.node-stores
 1.117e+12            -3.0%  1.084e+12            -4.5%  1.067e+12        perf-stat.total.instructions
     10.55 ± 95%    -100.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.MIN_vruntime.avg
    506.30 ± 95%    -100.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.MIN_vruntime.max
      0.00            +0.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.MIN_vruntime.min
      1.08 ±  7%      -7.7%       1.00           -46.2%       0.58 ± 27%  sched_debug.cfs_rq:/.h_nr_running.max
    538959 ± 24%     -23.2%     414090           -58.3%     224671 ± 31%  sched_debug.cfs_rq:/.load.max
    130191 ± 14%     -13.3%     112846 ±  6%     -56.6%      56505 ± 39%  sched_debug.cfs_rq:/.load.stddev
     10.56 ± 95%    -100.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.max_vruntime.avg
    506.64 ± 95%    -100.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.max_vruntime.max
      0.00            +0.0%       0.00          -100.0%       0.00        sched_debug.cfs_rq:/.max_vruntime.min
    116849 ± 27%     -51.2%      56995 ± 20%     -66.7%      38966 ± 39%  sched_debug.cfs_rq:/.min_vruntime.max
      1248 ± 14%      +9.8%       1370 ± 15%     -29.8%     876.86 ± 16%  sched_debug.cfs_rq:/.min_vruntime.min
     20484 ± 22%     -37.0%      12910 ± 15%     -60.7%       8059 ± 32%  sched_debug.cfs_rq:/.min_vruntime.stddev
      1.08 ±  7%      -7.7%       1.00           -46.2%       0.58 ± 27%  sched_debug.cfs_rq:/.nr_running.max
      1223 ±191%    -897.4%      -9754          -100.0%       0.00        sched_debug.cfs_rq:/.spread0.avg
    107969 ± 29%     -65.3%      37448 ± 39%    -100.0%       0.00        sched_debug.cfs_rq:/.spread0.max
     -7628          +138.2%     -18173          -100.0%       0.00        sched_debug.cfs_rq:/.spread0.min
     20484 ± 22%     -37.0%      12910 ± 15%    -100.0%       0.00        sched_debug.cfs_rq:/.spread0.stddev
     29.84 ± 19%      -1.1%      29.52 ± 14%    -100.0%       0.00        sched_debug.cfs_rq:/.util_est_enqueued.avg
    569.91 ± 11%      +4.5%     595.58          -100.0%       0.00        sched_debug.cfs_rq:/.util_est_enqueued.max
    109.06 ± 13%      +3.0%     112.32 ±  5%    -100.0%       0.00        sched_debug.cfs_rq:/.util_est_enqueued.stddev
    910320            +0.0%     910388           -44.2%     507736 ± 28%  sched_debug.cpu.avg_idle.avg
   1052715 ±  5%      -3.8%    1012910           -45.8%     570219 ± 28%  sched_debug.cpu.avg_idle.max
    175426 ±  8%      +2.8%     180422 ±  6%     -42.5%     100839 ± 20%  sched_debug.cpu.clock.avg
    175428 ±  8%      +2.8%     180424 ±  6%     -42.5%     100840 ± 20%  sched_debug.cpu.clock.max
    175424 ±  8%      +2.8%     180421 ±  6%     -42.5%     100838 ± 20%  sched_debug.cpu.clock.min
    170979 ±  8%      +2.8%     175682 ±  6%     -42.5%      98373 ± 20%  sched_debug.cpu.clock_task.avg
    173398 ±  8%      +2.6%     177937 ±  6%     -42.6%      99568 ± 20%  sched_debug.cpu.clock_task.max
    165789 ±  8%      +3.2%     171014 ±  6%     -42.4%      95513 ± 20%  sched_debug.cpu.clock_task.min
      2178 ±  8%     -13.1%       1893 ± 18%     -49.7%       1094 ± 33%  sched_debug.cpu.clock_task.stddev
      7443 ±  4%      +1.7%       7566 ±  3%     -43.0%       4246 ± 24%  sched_debug.cpu.curr->pid.max
      1409 ±  2%      +1.1%       1426 ±  2%     -42.0%     818.46 ± 25%  sched_debug.cpu.curr->pid.stddev
    502082            +0.0%     502206           -43.9%     281834 ± 29%  sched_debug.cpu.max_idle_balance_cost.avg
    567767 ±  5%      -3.3%     548996 ±  2%     -46.6%     303439 ± 27%  sched_debug.cpu.max_idle_balance_cost.max
      4294            +0.0%       4294           -43.7%       2415 ± 29%  sched_debug.cpu.next_balance.avg
      4294            +0.0%       4294           -43.7%       2415 ± 29%  sched_debug.cpu.next_balance.max
      4294            +0.0%       4294           -43.7%       2415 ± 29%  sched_debug.cpu.next_balance.min
      0.29 ±  3%      -1.2%       0.28           -43.0%       0.16 ± 29%  sched_debug.cpu.nr_running.stddev
      8212 ±  7%      -2.2%       8032 ±  4%     -40.7%       4867 ± 20%  sched_debug.cpu.nr_switches.avg
     55209 ± 14%     -21.8%      43154 ± 14%     -57.0%      23755 ± 20%  sched_debug.cpu.nr_switches.max
      1272 ± 23%     +10.2%       1402 ±  8%     -39.1%     775.51 ± 29%  sched_debug.cpu.nr_switches.min
      9805 ± 13%     -13.7%       8459 ±  8%     -50.8%       4825 ± 20%  sched_debug.cpu.nr_switches.stddev
    -15.73           -14.1%     -13.52           -64.8%      -5.53        sched_debug.cpu.nr_uninterruptible.min
      6.08 ± 27%      -1.0%       6.02 ± 13%     -53.7%       2.82 ± 24%  sched_debug.cpu.nr_uninterruptible.stddev
    175425 ±  8%      +2.8%     180421 ±  6%     -42.5%     100838 ± 20%  sched_debug.cpu_clk
 4.295e+09            +0.0%  4.295e+09           -43.7%  2.416e+09 ± 29%  sched_debug.jiffies
    174815 ±  8%      +2.9%     179811 ±  6%     -42.5%     100505 ± 20%  sched_debug.ktime
    175972 ±  8%      +2.8%     180955 ±  6%     -42.5%     101150 ± 20%  sched_debug.sched_clk
  58611259            +0.0%   58611259           -94.0%    3508734 ± 29%  sched_debug.sysctl_sched.sysctl_sched_features
      0.75            +0.0%       0.75          -100.0%       0.00        sched_debug.sysctl_sched.sysctl_sched_idle_min_granularity
     24.00            +0.0%      24.00          -100.0%       0.00        sched_debug.sysctl_sched.sysctl_sched_latency
      3.00            +0.0%       3.00          -100.0%       0.00        sched_debug.sysctl_sched.sysctl_sched_min_granularity
      4.00            +0.0%       4.00          -100.0%       0.00        sched_debug.sysctl_sched.sysctl_sched_wakeup_granularity
      0.26 ±100%      -0.3        0.00            +1.2        1.44 ±  9%  perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      2.08 ± 26%      -0.2        1.87 ± 12%      -1.4        0.63 ± 11%  perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      0.00            +0.0        0.00            +0.7        0.67 ± 11%  perf-profile.calltrace.cycles-pp.rcu_sched_clock_irq.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues
      0.00            +0.0        0.00            +0.7        0.71 ± 18%  perf-profile.calltrace.cycles-pp.rebalance_domains.__do_softirq.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
      0.00            +0.0        0.00            +0.8        0.80 ± 14%  perf-profile.calltrace.cycles-pp.getname_flags.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
      0.00            +0.0        0.00            +0.8        0.84 ± 12%  perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
      0.00            +0.0        0.00            +1.0        0.98 ± 13%  perf-profile.calltrace.cycles-pp.mas_alloc_cyclic.mtree_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
      0.00            +0.0        0.00            +1.1        1.10 ± 14%  perf-profile.calltrace.cycles-pp.mtree_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
      0.00            +0.0        0.00            +1.2        1.20 ± 13%  perf-profile.calltrace.cycles-pp.mas_erase.mtree_erase.simple_offset_remove.shmem_unlink.vfs_unlink
      0.00            +0.0        0.00            +1.3        1.34 ± 15%  perf-profile.calltrace.cycles-pp.link_path_walk.path_lookupat.filename_lookup.vfs_statx.__do_sys_newstat
      0.00            +0.0        0.00            +1.4        1.35 ± 12%  perf-profile.calltrace.cycles-pp.mtree_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
      0.00            +0.0        0.00            +1.6        1.56 ± 18%  perf-profile.calltrace.cycles-pp.__do_softirq.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
      0.00            +0.0        0.00            +1.7        1.73 ±  6%  perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues
      0.00            +0.0        0.00            +1.8        1.80 ± 16%  perf-profile.calltrace.cycles-pp.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
      0.00            +0.0        0.00            +2.0        2.03 ± 15%  perf-profile.calltrace.cycles-pp.path_lookupat.filename_lookup.vfs_statx.__do_sys_newstat.do_syscall_64
      0.00            +0.0        0.00            +2.2        2.16 ± 15%  perf-profile.calltrace.cycles-pp.filename_lookup.vfs_statx.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +0.0        0.00            +2.9        2.94 ± 14%  perf-profile.calltrace.cycles-pp.vfs_statx.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
      0.00            +0.0        0.00            +3.2        3.19 ±  8%  perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt
      0.00            +0.0        0.00            +3.4        3.35 ±  8%  perf-profile.calltrace.cycles-pp.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
      0.00            +0.0        0.00            +3.9        3.88 ±  8%  perf-profile.calltrace.cycles-pp.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
      0.00            +0.8        0.75 ± 12%      +0.0        0.00        perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
      0.00            +0.8        0.78 ± 34%      +0.0        0.00        perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
      0.00            +0.8        0.83 ± 29%      +0.0        0.00        perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
      0.00            +0.9        0.92 ± 26%      +0.0        0.00        perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
      0.00            +1.0        0.99 ± 27%      +0.0        0.00        perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
      0.00            +1.0        1.04 ± 28%      +0.0        0.00        perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
      0.00            +1.1        1.11 ± 26%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
      1.51 ± 24%      +1.2        2.73 ± 10%      +1.2        2.75 ± 10%  perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +1.2        1.24 ± 20%      +0.0        0.00        perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
      0.00            +1.3        1.27 ± 10%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
      0.00            +1.3        1.30 ± 10%      +0.0        0.00        perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
      0.00            +1.3        1.33 ± 19%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
      0.00            +1.4        1.36 ± 10%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
      0.00            +1.4        1.37 ± 10%      +1.4        1.37 ± 12%  perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
      0.00            +1.5        1.51 ± 17%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
      0.00            +1.6        1.62 ± 12%      +1.6        1.65 ± 12%  perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
      0.00            +2.8        2.80 ± 13%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
      0.00            +2.9        2.94 ± 13%      +0.0        0.00        perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
      5.38 ± 24%      +3.1        8.51 ± 11%      +0.6        5.95 ± 11%  perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
      6.08 ± 24%      +3.2        9.24 ± 12%      +0.5        6.59 ± 10%  perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
      0.00            +3.2        3.20 ± 13%      +0.0        0.00        perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
      0.00            +3.2        3.24 ± 13%      +0.0        0.00        perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
      0.00            +3.4        3.36 ± 14%      +1.2        1.16 ± 13%  perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
      2.78 ± 25%      +3.4        6.17 ± 12%      +0.9        3.69 ± 12%  perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
      0.16 ± 30%      -0.1        0.08 ± 20%      -0.0        0.13 ± 42%  perf-profile.children.cycles-pp.map_id_up
      0.22 ± 18%      -0.0        0.16 ± 17%      -0.1        0.12 ±  8%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.47 ± 17%      -0.0        0.43 ± 16%      +1.0        1.49 ± 10%  perf-profile.children.cycles-pp.__x64_sys_close
      0.02 ±142%      -0.0        0.00            +0.1        0.08 ± 22%  perf-profile.children.cycles-pp._find_next_zero_bit
      0.01 ±223%      -0.0        0.00            +2.0        1.97 ± 15%  perf-profile.children.cycles-pp.irq_exit_rcu
      0.00            +0.0        0.00            +0.1        0.08 ± 30%  perf-profile.children.cycles-pp.__wake_up
      0.00            +0.0        0.00            +0.1        0.08 ± 21%  perf-profile.children.cycles-pp.should_we_balance
      0.00            +0.0        0.00            +0.1        0.09 ± 34%  perf-profile.children.cycles-pp.amd_clear_divider
      0.00            +0.0        0.00            +0.1        0.10 ± 36%  perf-profile.children.cycles-pp.apparmor_current_getsecid_subj
      0.00            +0.0        0.00            +0.1        0.10 ± 32%  perf-profile.children.cycles-pp.filp_flush
      0.00            +0.0        0.00            +0.1        0.10 ± 27%  perf-profile.children.cycles-pp.mas_wr_end_piv
      0.00            +0.0        0.00            +0.1        0.12 ± 27%  perf-profile.children.cycles-pp.mnt_get_write_access
      0.00            +0.0        0.00            +0.1        0.12 ± 23%  perf-profile.children.cycles-pp.file_close_fd
      0.00            +0.0        0.00            +0.1        0.13 ± 30%  perf-profile.children.cycles-pp.security_current_getsecid_subj
      0.00            +0.0        0.00            +0.1        0.13 ± 18%  perf-profile.children.cycles-pp.native_apic_mem_eoi
      0.00            +0.0        0.00            +0.2        0.17 ± 22%  perf-profile.children.cycles-pp.mas_leaf_max_gap
      0.00            +0.0        0.00            +0.2        0.18 ± 24%  perf-profile.children.cycles-pp.mtree_range_walk
      0.00            +0.0        0.00            +0.2        0.20 ± 19%  perf-profile.children.cycles-pp.inode_set_ctime_current
      0.00            +0.0        0.00            +0.2        0.24 ± 14%  perf-profile.children.cycles-pp.ima_file_check
      0.00            +0.0        0.00            +0.2        0.24 ± 22%  perf-profile.children.cycles-pp.mas_anode_descend
      0.00            +0.0        0.00            +0.3        0.26 ± 18%  perf-profile.children.cycles-pp.lockref_put_return
      0.00            +0.0        0.00            +0.3        0.29 ± 16%  perf-profile.children.cycles-pp.mas_wr_walk
      0.00            +0.0        0.00            +0.3        0.31 ± 23%  perf-profile.children.cycles-pp.mas_update_gap
      0.00            +0.0        0.00            +0.3        0.32 ± 17%  perf-profile.children.cycles-pp.mas_wr_append
      0.00            +0.0        0.00            +0.4        0.37 ± 15%  perf-profile.children.cycles-pp.mas_empty_area
      0.00            +0.0        0.00            +0.5        0.47 ± 18%  perf-profile.children.cycles-pp.mas_wr_node_store
      0.00            +0.0        0.00            +0.5        0.53 ± 18%  perf-profile.children.cycles-pp.__memcg_slab_post_alloc_hook
      0.00            +0.0        0.00            +0.7        0.65 ± 12%  perf-profile.children.cycles-pp.__memcg_slab_pre_alloc_hook
      0.00            +0.0        0.00            +0.7        0.65 ± 28%  perf-profile.children.cycles-pp.__memcg_slab_free_hook
      0.00            +0.0        0.00            +1.0        0.99 ± 13%  perf-profile.children.cycles-pp.mas_alloc_cyclic
      0.00            +0.0        0.00            +1.1        1.11 ± 14%  perf-profile.children.cycles-pp.mtree_alloc_cyclic
      0.00            +0.0        0.00            +1.2        1.21 ± 14%  perf-profile.children.cycles-pp.mas_erase
      0.00            +0.0        0.00            +1.4        1.35 ± 12%  perf-profile.children.cycles-pp.mtree_erase
      0.00            +0.0        0.00            +1.8        1.78 ±  9%  perf-profile.children.cycles-pp.entry_SYSCALL_64
      0.00            +0.0        0.00            +4.1        4.06 ±  8%  perf-profile.children.cycles-pp.tick_nohz_highres_handler
      0.02 ±146%      +0.1        0.08 ± 13%      +0.0        0.06 ± 81%  perf-profile.children.cycles-pp.shmem_is_huge
      0.02 ±141%      +0.1        0.09 ± 16%      -0.0        0.00        perf-profile.children.cycles-pp.__list_del_entry_valid
      0.00            +0.1        0.08 ± 11%      +0.0        0.00        perf-profile.children.cycles-pp.free_unref_page
      0.00            +0.1        0.08 ± 13%      +0.1        0.08 ± 45%  perf-profile.children.cycles-pp.shmem_destroy_inode
      0.04 ±101%      +0.1        0.14 ± 25%      +0.0        0.05 ± 65%  perf-profile.children.cycles-pp.rcu_nocb_try_bypass
      0.00            +0.1        0.12 ± 27%      +0.0        0.00        perf-profile.children.cycles-pp.xas_find_marked
      0.02 ±144%      +0.1        0.16 ± 14%      -0.0        0.00        perf-profile.children.cycles-pp.__unfreeze_partials
      0.03 ±106%      +0.2        0.19 ± 26%      -0.0        0.03 ±136%  perf-profile.children.cycles-pp.xas_descend
      0.01 ±223%      +0.2        0.17 ± 15%      -0.0        0.00        perf-profile.children.cycles-pp.get_page_from_freelist
      0.11 ± 22%      +0.2        0.29 ± 16%      -0.0        0.08 ± 30%  perf-profile.children.cycles-pp.rcu_segcblist_enqueue
      0.02 ±146%      +0.2        0.24 ± 13%      -0.0        0.01 ±174%  perf-profile.children.cycles-pp.__alloc_pages
      0.36 ± 79%      +0.6        0.98 ± 15%      -0.0        0.31 ± 43%  perf-profile.children.cycles-pp.__slab_free
      0.50 ± 26%      +0.7        1.23 ± 14%      -0.2        0.31 ± 19%  perf-profile.children.cycles-pp.__call_rcu_common
      0.00            +0.8        0.82 ± 13%      +0.0        0.00        perf-profile.children.cycles-pp.radix_tree_node_rcu_free
      0.00            +1.1        1.14 ± 17%      +0.0        0.00        perf-profile.children.cycles-pp.radix_tree_node_ctor
      0.16 ± 86%      +1.2        1.38 ± 16%      -0.1        0.02 ±174%  perf-profile.children.cycles-pp.setup_object
      1.52 ± 25%      +1.2        2.75 ± 10%      +1.2        2.76 ± 11%  perf-profile.children.cycles-pp.vfs_unlink
      0.36 ± 22%      +1.3        1.63 ± 12%      +1.3        1.65 ± 12%  perf-profile.children.cycles-pp.shmem_unlink
      0.00            +1.3        1.30 ± 10%      +0.0        0.00        perf-profile.children.cycles-pp.__xa_erase
      0.20 ± 79%      +1.3        1.53 ± 15%      -0.2        0.02 ±173%  perf-profile.children.cycles-pp.shuffle_freelist
      0.00            +1.4        1.36 ± 10%      +0.0        0.00        perf-profile.children.cycles-pp.xa_erase
      0.00            +1.4        1.38 ± 10%      +1.4        1.37 ± 12%  perf-profile.children.cycles-pp.simple_offset_remove
      0.00            +1.5        1.51 ± 17%      +0.0        0.00        perf-profile.children.cycles-pp.xas_expand
      0.26 ± 78%      +1.6        1.87 ± 13%      -0.2        0.05 ± 68%  perf-profile.children.cycles-pp.allocate_slab
      0.40 ± 49%      +1.7        2.10 ± 13%      -0.3        0.14 ± 28%  perf-profile.children.cycles-pp.___slab_alloc
      1.30 ± 85%      +2.1        3.42 ± 12%      -0.1        1.15 ± 41%  perf-profile.children.cycles-pp.rcu_do_batch
      1.56 ± 27%      +2.4        3.93 ± 11%      -0.2        1.41 ± 12%  perf-profile.children.cycles-pp.kmem_cache_alloc_lru
      0.00            +2.4        2.44 ± 12%      +0.0        0.00        perf-profile.children.cycles-pp.xas_alloc
      2.66 ± 13%      +2.5        5.14 ±  5%      -2.7        0.00        perf-profile.children.cycles-pp.__irq_exit_rcu
     11.16 ± 10%      +2.7       13.88 ±  8%      +0.6       11.72 ±  8%  perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
     11.77 ± 10%      +2.7       14.49 ±  8%      +0.6       12.40 ±  8%  perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
      0.00            +2.8        2.82 ± 13%      +0.0        0.00        perf-profile.children.cycles-pp.xas_create
      5.40 ± 24%      +3.1        8.52 ± 11%      +0.6        5.97 ± 11%  perf-profile.children.cycles-pp.lookup_open
      6.12 ± 24%      +3.1        9.27 ± 12%      +0.5        6.62 ± 10%  perf-profile.children.cycles-pp.open_last_lookups
      0.00            +3.2        3.22 ± 13%      +0.0        0.00        perf-profile.children.cycles-pp.__xa_alloc
      0.00            +3.2        3.24 ± 13%      +0.0        0.00        perf-profile.children.cycles-pp.__xa_alloc_cyclic
      0.00            +3.4        3.36 ± 14%      +1.2        1.16 ± 13%  perf-profile.children.cycles-pp.simple_offset_add
      2.78 ± 25%      +3.4        6.18 ± 12%      +0.9        3.70 ± 12%  perf-profile.children.cycles-pp.shmem_mknod
      0.00            +4.2        4.24 ± 12%      +0.0        0.00        perf-profile.children.cycles-pp.xas_store
      0.14 ± 27%      -0.1        0.08 ± 21%      -0.0        0.12 ± 42%  perf-profile.self.cycles-pp.map_id_up
      0.09 ± 18%      -0.0        0.06 ± 52%      +0.1        0.14 ± 12%  perf-profile.self.cycles-pp.obj_cgroup_charge
      0.18 ± 22%      -0.0        0.15 ± 17%      -0.1        0.10 ±  9%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      0.02 ±141%      -0.0        0.00            +0.1        0.08 ± 22%  perf-profile.self.cycles-pp._find_next_zero_bit
      0.16 ± 24%      -0.0        0.16 ± 24%      -0.1        0.07 ± 64%  perf-profile.self.cycles-pp.__sysvec_apic_timer_interrupt
      0.02 ±146%      +0.0        0.02 ±146%      +0.1        0.08 ± 18%  perf-profile.self.cycles-pp.shmem_mknod
      0.00            +0.0        0.00            +0.1        0.09 ± 36%  perf-profile.self.cycles-pp.irq_exit_rcu
      0.00            +0.0        0.00            +0.1        0.09 ± 28%  perf-profile.self.cycles-pp.tick_nohz_highres_handler
      0.00            +0.0        0.00            +0.1        0.09 ± 36%  perf-profile.self.cycles-pp.apparmor_current_getsecid_subj
      0.00            +0.0        0.00            +0.1        0.09 ± 30%  perf-profile.self.cycles-pp.mtree_erase
      0.00            +0.0        0.00            +0.1        0.10 ± 26%  perf-profile.self.cycles-pp.mtree_alloc_cyclic
      0.00            +0.0        0.00            +0.1        0.10 ± 27%  perf-profile.self.cycles-pp.mas_wr_end_piv
      0.00            +0.0        0.00            +0.1        0.12 ± 28%  perf-profile.self.cycles-pp.mnt_get_write_access
      0.00            +0.0        0.00            +0.1        0.12 ± 29%  perf-profile.self.cycles-pp.inode_set_ctime_current
      0.00            +0.0        0.00            +0.1        0.12 ± 38%  perf-profile.self.cycles-pp.mas_empty_area
      0.00            +0.0        0.00            +0.1        0.13 ± 18%  perf-profile.self.cycles-pp.native_apic_mem_eoi
      0.00            +0.0        0.00            +0.1        0.14 ± 38%  perf-profile.self.cycles-pp.mas_update_gap
      0.00            +0.0        0.00            +0.1        0.14 ± 20%  perf-profile.self.cycles-pp.mas_wr_append
      0.00            +0.0        0.00            +0.2        0.16 ± 23%  perf-profile.self.cycles-pp.mas_leaf_max_gap
      0.00            +0.0        0.00            +0.2        0.18 ± 24%  perf-profile.self.cycles-pp.mtree_range_walk
      0.00            +0.0        0.00            +0.2        0.18 ± 29%  perf-profile.self.cycles-pp.mas_alloc_cyclic
      0.00            +0.0        0.00            +0.2        0.20 ± 14%  perf-profile.self.cycles-pp.__memcg_slab_pre_alloc_hook
      0.00            +0.0        0.00            +0.2        0.22 ± 32%  perf-profile.self.cycles-pp.mas_erase
      0.00            +0.0        0.00            +0.2        0.24 ± 35%  perf-profile.self.cycles-pp.__memcg_slab_free_hook
      0.00            +0.0        0.00            +0.2        0.24 ± 22%  perf-profile.self.cycles-pp.mas_anode_descend
      0.00            +0.0        0.00            +0.3        0.26 ± 17%  perf-profile.self.cycles-pp.lockref_put_return
      0.00            +0.0        0.00            +0.3        0.27 ± 16%  perf-profile.self.cycles-pp.mas_wr_walk
      0.00            +0.0        0.00            +0.3        0.34 ± 20%  perf-profile.self.cycles-pp.mas_wr_node_store
      0.00            +0.0        0.00            +0.4        0.35 ± 20%  perf-profile.self.cycles-pp.__memcg_slab_post_alloc_hook
      0.00            +0.0        0.00            +1.6        1.59 ±  8%  perf-profile.self.cycles-pp.entry_SYSCALL_64
      0.05 ± 85%      +0.0        0.06 ± 47%      +0.1        0.15 ± 54%  perf-profile.self.cycles-pp.check_heap_object
      0.05 ± 72%      +0.0        0.06 ± 81%      +0.0        0.10 ± 15%  perf-profile.self.cycles-pp.call_cpuidle
      0.04 ±105%      +0.0        0.04 ± 75%      +0.1        0.10 ± 33%  perf-profile.self.cycles-pp.putname
      0.00            +0.1        0.06 ± 24%      +0.0        0.05 ± 66%  perf-profile.self.cycles-pp.shmem_destroy_inode
      0.00            +0.1        0.07 ±  8%      +0.0        0.00        perf-profile.self.cycles-pp.__xa_alloc
      0.02 ±146%      +0.1        0.11 ± 28%      +0.0        0.02 ±133%  perf-profile.self.cycles-pp.rcu_nocb_try_bypass
      0.01 ±223%      +0.1        0.10 ± 28%      -0.0        0.00        perf-profile.self.cycles-pp.shuffle_freelist
      0.00            +0.1        0.11 ± 40%      +0.0        0.00        perf-profile.self.cycles-pp.xas_create
      0.00            +0.1        0.12 ± 27%      +0.0        0.00        perf-profile.self.cycles-pp.xas_find_marked
      0.00            +0.1        0.14 ± 18%      +0.0        0.00        perf-profile.self.cycles-pp.xas_alloc
      0.03 ±103%      +0.1        0.17 ± 29%      -0.0        0.03 ±136%  perf-profile.self.cycles-pp.xas_descend
      0.00            +0.2        0.16 ± 23%      +0.0        0.00        perf-profile.self.cycles-pp.xas_expand
      0.10 ± 22%      +0.2        0.27 ± 16%      -0.0        0.06 ± 65%  perf-profile.self.cycles-pp.rcu_segcblist_enqueue
      0.92 ± 35%      +0.3        1.22 ± 11%      -0.3        0.59 ± 15%  perf-profile.self.cycles-pp.kmem_cache_free
      0.00            +0.4        0.36 ± 16%      +0.0        0.00        perf-profile.self.cycles-pp.xas_store
      0.32 ± 30%      +0.4        0.71 ± 12%      -0.1        0.18 ± 23%  perf-profile.self.cycles-pp.__call_rcu_common
      0.18 ± 27%      +0.5        0.65 ±  8%      +0.1        0.26 ± 21%  perf-profile.self.cycles-pp.kmem_cache_alloc_lru
      0.36 ± 79%      +0.6        0.96 ± 15%      -0.0        0.31 ± 42%  perf-profile.self.cycles-pp.__slab_free
      0.00            +0.8        0.80 ± 14%      +0.0        0.00        perf-profile.self.cycles-pp.radix_tree_node_rcu_free
      0.00            +1.0        1.01 ± 16%      +0.0        0.00        perf-profile.self.cycles-pp.radix_tree_node_ctor



[1] https://lore.kernel.org/all/202402191308.8e7ee8c7-oliver.sang@intel.com/

> 
> 
> > > > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > > >  	if (!inode || !S_ISDIR(inode->i_mode))
> > > > >  		return ret;
> > > > >  
> > > > > -	index = 2;
> > > > > +	index = DIR_OFFSET_MIN;
> > > > 
> > > > This bit should go into the simple_offset_empty() patch...
> > > > 
> > > > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > > >  
> > > > >  	/* In this case, ->private_data is protected by f_pos_lock */
> > > > >  	file->private_data = NULL;
> > > > > -	return vfs_setpos(file, offset, U32_MAX);
> > > > > +	return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > > > 					^^^
> > > > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > > > quite right? Why not use ULONG_MAX here directly?
> > > 
> > > I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> > > length checking in vfs_setpos() fails. There is probably a sign
> > > extension thing happening here that I don't understand.
> > > 
> > > 
> > > > Otherwise the patch looks good to me.
> > > 
> > > As always, thank you for your review.
> > > 
> > > 
> > > -- 
> > > Chuck Lever
> 
> -- 
> Chuck Lever


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()
  2024-02-16 16:33             ` Liam R. Howlett
@ 2024-02-19 18:06               ` Jan Kara
  0 siblings, 0 replies; 29+ messages in thread
From: Jan Kara @ 2024-02-19 18:06 UTC (permalink / raw
  To: Liam R. Howlett
  Cc: Jan Kara, Chuck Lever, viro, brauner, hughd, akpm, oliver.sang,
	feng.tang, linux-kernel, linux-fsdevel, maple-tree, linux-mm, lkp

On Fri 16-02-24 11:33:18, Liam R. Howlett wrote:
> * Jan Kara <jack@suse.cz> [240216 05:15]:
> > > If you have other examples you think are unsafe then I can have a look
> > > at them as well.
> > 
> > I'm currently not aware of any but I'll let you know if I find some.
> > Missing xas/mas_pause() seems really easy.
> 
> What if we convert the rcu_read_lock() to a mas_read_lock() or
> xas_read_lock() and we can check to ensure the state isn't being locked
> without being in the 'parked' state (paused or otherwise)?
> 
> mas_read_lock(struct ma_state *mas) {
> 	assert(!mas_active(mas));
> 	rcu_read_lock();
> }
> 
> Would that be a reasonable resolution to your concern?  Unfortunately,
> what was done with the locking in this case would not be detected with
> this change unless the rcu_read_lock() was replaced.  IOW, people could
> still use the rcu_read_lock() and skip the detection.

Yes, I guess this is still better than nothing.

> Doing the same in the mas_unlock() doesn't make as much sense since that
> may be called without the intent to reuse the state at all.  So we'd be
> doing more work than necessary at the end of each loop or use.

Yes, understood.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-02-19 18:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-13 21:37 [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 1/7] libfs: Rename "so_ctx" Chuck Lever
2024-02-15 12:42   ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 2/7] libfs: Define a minimum directory offset Chuck Lever
2024-02-15 12:47   ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 3/7] libfs: Add simple_offset_empty() Chuck Lever
2024-02-15 12:53   ` Jan Kara
2024-02-13 21:37 ` [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic() Chuck Lever
2024-02-13 21:37 ` [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation Chuck Lever
2024-02-13 21:38 ` [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree Chuck Lever
2024-02-15 13:06   ` Jan Kara
2024-02-15 13:45     ` Chuck Lever
2024-02-15 14:02       ` Jan Kara
2024-02-16 15:15       ` Christian Brauner
2024-02-18  2:02       ` Oliver Sang
2024-02-18 15:57         ` Chuck Lever
2024-02-19  6:00           ` Oliver Sang
2024-02-13 21:38 ` [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir() Chuck Lever
2024-02-15 13:16   ` Jan Kara
2024-02-15 17:00     ` Liam R. Howlett
2024-02-15 17:16       ` Jan Kara
2024-02-15 21:07         ` Liam R. Howlett
2024-02-16 10:15           ` Jan Kara
2024-02-16 15:57             ` Matthew Wilcox
2024-02-16 16:33             ` Liam R. Howlett
2024-02-19 18:06               ` Jan Kara
2024-02-15 17:40       ` Chuck Lever
2024-02-15 21:08         ` Liam R. Howlett
2024-02-13 21:40 ` [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).