[PATCH v3 0/2] mm/zswap: optimize for dynamic zswap

Linux-mm Archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] mm/zswap: optimize for dynamic zswap_pools
@ 2024-02-16  8:55 Chengming Zhou
  2024-02-16  8:55 ` [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools Chengming Zhou
  2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
  0 siblings, 2 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-16  8:55 UTC (permalink / raw
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton
  Cc: linux-mm, Yosry Ahmed, linux-kernel, Chengming Zhou

Changes in v3:
- Improve the commit messages and comments, per Yosry.
- Use percpu_ref_is_zero() for debug purpose, per Yosry.
- Collect tag.
- Link to v2: https://lore.kernel.org/r/20240210-zswap-global-lru-v2-0-fbee3b11a62e@bytedance.com

Changes in v2:
- fix build error when !CONFIG_MEMCG_KMEM.
- make zswap struct static and fix some error paths, per Yosry.
- add another shrink_lock to protect zswap.next_shrink, per Yosry.
- keep "WARN_ON(percpu_ref_tryget(&pool->ref))" in pool release path
  for debug, per Nhat.
- improve the commit messages.
- Link to v1: https://lore.kernel.org/r/20240210-zswap-global-lru-v1-0-853473d7b0da@bytedance.com

Dynamic pool creation has been supported for a long time, which maybe
not used so much in practice. But with the per-memcg lru merged, the
current structure of zswap_pool's lru and shrinker become less optimal.

In the current structure, each zswap_pool has its own lru, shrinker and
shrink_work, but only the latest zswap_pool will be the current used.

1. When memory has pressure, all shrinkers of zswap_pools will try to
   shrink its lru list, there is no order between them.

2. When zswap limit hit, only the last zswap_pool's shrink_work will
   try to shrink its own lru, which is inefficient.

A more natural way is to have a global zswap lru shared between all
zswap_pools, and so is the shrinker. The code becomes much simpler too.

Another optimization is changing zswap_pool kref to percpu_ref, which
will be taken reference by every zswap entry. So the scalability is
better.

Testing kernel build (32 threads) in tmpfs with memory.max=2GB.
(zswap shrinker and writeback enabled with one 50GB swapfile,
on a 128 CPUs x86-64 machine, below is the average of 5 runs)

        mm-unstable  zswap-global-lru
real    63.20        63.12
user    1061.75      1062.95
sys     268.74       264.44

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
Chengming Zhou (2):
      mm/zswap: global lru and shrinker shared by all zswap_pools
      mm/zswap: change zswap_pool kref to percpu_ref

 mm/zswap.c | 207 +++++++++++++++++++++++++++----------------------------------
 1 file changed, 93 insertions(+), 114 deletions(-)
---
base-commit: 191d97734e41a5c9f90a2f6636fdd335ae1d435d
change-id: 20240210-zswap-global-lru-94d49316178b

Best regards,
-- 
Chengming Zhou <zhouchengming@bytedance.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools
  2024-02-16  8:55 [PATCH v3 0/2] mm/zswap: optimize for dynamic zswap_pools Chengming Zhou
@ 2024-02-16  8:55 ` Chengming Zhou
  2024-02-20  1:28   ` Nhat Pham
  2024-03-05  7:53   ` [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix Chengming Zhou
  2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
  1 sibling, 2 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-16  8:55 UTC (permalink / raw
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton
  Cc: linux-mm, Yosry Ahmed, linux-kernel, Chengming Zhou

Dynamic zswap_pool creation may create/reuse to have multiple
zswap_pools in a list, only the first will be current used.

Each zswap_pool has its own lru and shrinker, which is not
necessary and has its problem:

1. When memory has pressure, all shrinker of zswap_pools will
   try to shrink its own lru, there is no order between them.

2. When zswap limit hit, only the last zswap_pool's shrink_work
   will try to shrink its lru list. The rationale here was to
   try and empty the old pool first so that we can completely
   drop it. However, since we only support exclusive loads now,
   the LRU ordering should be entirely decided by the order of
   stores, so the oldest entries on the LRU will naturally be
   from the oldest pool.

Anyway, having a global lru and shrinker shared by all zswap_pools
is better and efficient.

Acked-by: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 171 ++++++++++++++++++++++++-------------------------------------
 1 file changed, 66 insertions(+), 105 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 62fe307521c9..d275eb523fc4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -176,14 +176,19 @@ struct zswap_pool {
 	struct kref kref;
 	struct list_head list;
 	struct work_struct release_work;
-	struct work_struct shrink_work;
 	struct hlist_node node;
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
+};
+
+static struct {
 	struct list_lru list_lru;
-	struct mem_cgroup *next_shrink;
-	struct shrinker *shrinker;
 	atomic_t nr_stored;
-};
+	struct shrinker *shrinker;
+	struct work_struct shrink_work;
+	struct mem_cgroup *next_shrink;
+	/* The lock protects next_shrink. */
+	spinlock_t shrink_lock;
+} zswap;
 
 /*
  * struct zswap_entry
@@ -301,9 +306,6 @@ static void zswap_update_total_size(void)
 * pool functions
 **********************************/
 
-static void zswap_alloc_shrinker(struct zswap_pool *pool);
-static void shrink_worker(struct work_struct *w);
-
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	int i;
@@ -353,30 +355,16 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	if (ret)
 		goto error;
 
-	zswap_alloc_shrinker(pool);
-	if (!pool->shrinker)
-		goto error;
-
-	pr_debug("using %s compressor\n", pool->tfm_name);
-
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
 	 */
 	kref_init(&pool->kref);
 	INIT_LIST_HEAD(&pool->list);
-	if (list_lru_init_memcg(&pool->list_lru, pool->shrinker))
-		goto lru_fail;
-	shrinker_register(pool->shrinker);
-	INIT_WORK(&pool->shrink_work, shrink_worker);
-	atomic_set(&pool->nr_stored, 0);
 
 	zswap_pool_debug("created", pool);
 
 	return pool;
 
-lru_fail:
-	list_lru_destroy(&pool->list_lru);
-	shrinker_free(pool->shrinker);
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -434,15 +422,8 @@ static void zswap_pool_destroy(struct zswap_pool *pool)
 
 	zswap_pool_debug("destroying", pool);
 
-	shrinker_free(pool->shrinker);
 	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 	free_percpu(pool->acomp_ctx);
-	list_lru_destroy(&pool->list_lru);
-
-	spin_lock(&zswap_pools_lock);
-	mem_cgroup_iter_break(NULL, pool->next_shrink);
-	pool->next_shrink = NULL;
-	spin_unlock(&zswap_pools_lock);
 
 	for (i = 0; i < ZSWAP_NR_ZPOOLS; i++)
 		zpool_destroy_pool(pool->zpools[i]);
@@ -529,24 +510,6 @@ static struct zswap_pool *zswap_pool_current_get(void)
 	return pool;
 }
 
-static struct zswap_pool *zswap_pool_last_get(void)
-{
-	struct zswap_pool *pool, *last = NULL;
-
-	rcu_read_lock();
-
-	list_for_each_entry_rcu(pool, &zswap_pools, list)
-		last = pool;
-	WARN_ONCE(!last && zswap_has_pool,
-		  "%s: no page storage pool!\n", __func__);
-	if (!zswap_pool_get(last))
-		last = NULL;
-
-	rcu_read_unlock();
-
-	return last;
-}
-
 /* type and compressor must be null-terminated */
 static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
 {
@@ -816,15 +779,11 @@ void zswap_folio_swapin(struct folio *folio)
 
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 {
-	struct zswap_pool *pool;
-
-	/* lock out zswap pools list modification */
-	spin_lock(&zswap_pools_lock);
-	list_for_each_entry(pool, &zswap_pools, list) {
-		if (pool->next_shrink == memcg)
-			pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
-	}
-	spin_unlock(&zswap_pools_lock);
+	/* lock out zswap shrinker walking memcg tree */
+	spin_lock(&zswap.shrink_lock);
+	if (zswap.next_shrink == memcg)
+		zswap.next_shrink = mem_cgroup_iter(NULL, zswap.next_shrink, NULL);
+	spin_unlock(&zswap.shrink_lock);
 }
 
 /*********************************
@@ -923,9 +882,9 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	if (!entry->length)
 		atomic_dec(&zswap_same_filled_pages);
 	else {
-		zswap_lru_del(&entry->pool->list_lru, entry);
+		zswap_lru_del(&zswap.list_lru, entry);
 		zpool_free(zswap_find_zpool(entry), entry->handle);
-		atomic_dec(&entry->pool->nr_stored);
+		atomic_dec(&zswap.nr_stored);
 		zswap_pool_put(entry->pool);
 	}
 	if (entry->objcg) {
@@ -1288,7 +1247,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 {
 	struct lruvec *lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 	unsigned long shrink_ret, nr_protected, lru_size;
-	struct zswap_pool *pool = shrinker->private_data;
 	bool encountered_page_in_swapcache = false;
 
 	if (!zswap_shrinker_enabled ||
@@ -1299,7 +1257,7 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 
 	nr_protected =
 		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	lru_size = list_lru_shrink_count(&pool->list_lru, sc);
+	lru_size = list_lru_shrink_count(&zswap.list_lru, sc);
 
 	/*
 	 * Abort if we are shrinking into the protected region.
@@ -1316,7 +1274,7 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		return SHRINK_STOP;
 	}
 
-	shrink_ret = list_lru_shrink_walk(&pool->list_lru, sc, &shrink_memcg_cb,
+	shrink_ret = list_lru_shrink_walk(&zswap.list_lru, sc, &shrink_memcg_cb,
 		&encountered_page_in_swapcache);
 
 	if (encountered_page_in_swapcache)
@@ -1328,7 +1286,6 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 		struct shrink_control *sc)
 {
-	struct zswap_pool *pool = shrinker->private_data;
 	struct mem_cgroup *memcg = sc->memcg;
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
 	unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
@@ -1342,8 +1299,8 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
 #else
 	/* use pool stats instead of memcg stats */
-	nr_backing = get_zswap_pool_size(pool) >> PAGE_SHIFT;
-	nr_stored = atomic_read(&pool->nr_stored);
+	nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
+	nr_stored = atomic_read(&zswap.nr_stored);
 #endif
 
 	if (!nr_stored)
@@ -1351,7 +1308,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 
 	nr_protected =
 		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	nr_freeable = list_lru_shrink_count(&pool->list_lru, sc);
+	nr_freeable = list_lru_shrink_count(&zswap.list_lru, sc);
 	/*
 	 * Subtract the lru size by an estimate of the number of pages
 	 * that should be protected.
@@ -1367,23 +1324,24 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	return mult_frac(nr_freeable, nr_backing, nr_stored);
 }
 
-static void zswap_alloc_shrinker(struct zswap_pool *pool)
+static struct shrinker *zswap_alloc_shrinker(void)
 {
-	pool->shrinker =
+	struct shrinker *shrinker;
+
+	shrinker =
 		shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE, "mm-zswap");
-	if (!pool->shrinker)
-		return;
+	if (!shrinker)
+		return NULL;
 
-	pool->shrinker->private_data = pool;
-	pool->shrinker->scan_objects = zswap_shrinker_scan;
-	pool->shrinker->count_objects = zswap_shrinker_count;
-	pool->shrinker->batch = 0;
-	pool->shrinker->seeks = DEFAULT_SEEKS;
+	shrinker->scan_objects = zswap_shrinker_scan;
+	shrinker->count_objects = zswap_shrinker_count;
+	shrinker->batch = 0;
+	shrinker->seeks = DEFAULT_SEEKS;
+	return shrinker;
 }
 
 static int shrink_memcg(struct mem_cgroup *memcg)
 {
-	struct zswap_pool *pool;
 	int nid, shrunk = 0;
 
 	if (!mem_cgroup_zswap_writeback_enabled(memcg))
@@ -1396,32 +1354,25 @@ static int shrink_memcg(struct mem_cgroup *memcg)
 	if (memcg && !mem_cgroup_online(memcg))
 		return -ENOENT;
 
-	pool = zswap_pool_current_get();
-	if (!pool)
-		return -EINVAL;
-
 	for_each_node_state(nid, N_NORMAL_MEMORY) {
 		unsigned long nr_to_walk = 1;
 
-		shrunk += list_lru_walk_one(&pool->list_lru, nid, memcg,
+		shrunk += list_lru_walk_one(&zswap.list_lru, nid, memcg,
 					    &shrink_memcg_cb, NULL, &nr_to_walk);
 	}
-	zswap_pool_put(pool);
 	return shrunk ? 0 : -EAGAIN;
 }
 
 static void shrink_worker(struct work_struct *w)
 {
-	struct zswap_pool *pool = container_of(w, typeof(*pool),
-						shrink_work);
 	struct mem_cgroup *memcg;
 	int ret, failures = 0;
 
 	/* global reclaim will select cgroup in a round-robin fashion. */
 	do {
-		spin_lock(&zswap_pools_lock);
-		pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
-		memcg = pool->next_shrink;
+		spin_lock(&zswap.shrink_lock);
+		zswap.next_shrink = mem_cgroup_iter(NULL, zswap.next_shrink, NULL);
+		memcg = zswap.next_shrink;
 
 		/*
 		 * We need to retry if we have gone through a full round trip, or if we
@@ -1435,7 +1386,7 @@ static void shrink_worker(struct work_struct *w)
 		 * memcg is not killed when we are reclaiming.
 		 */
 		if (!memcg) {
-			spin_unlock(&zswap_pools_lock);
+			spin_unlock(&zswap.shrink_lock);
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 
@@ -1445,15 +1396,15 @@ static void shrink_worker(struct work_struct *w)
 		if (!mem_cgroup_tryget_online(memcg)) {
 			/* drop the reference from mem_cgroup_iter() */
 			mem_cgroup_iter_break(NULL, memcg);
-			pool->next_shrink = NULL;
-			spin_unlock(&zswap_pools_lock);
+			zswap.next_shrink = NULL;
+			spin_unlock(&zswap.shrink_lock);
 
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 
 			goto resched;
 		}
-		spin_unlock(&zswap_pools_lock);
+		spin_unlock(&zswap.shrink_lock);
 
 		ret = shrink_memcg(memcg);
 		/* drop the extra reference */
@@ -1467,7 +1418,6 @@ static void shrink_worker(struct work_struct *w)
 resched:
 		cond_resched();
 	} while (!zswap_can_accept());
-	zswap_pool_put(pool);
 }
 
 static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
@@ -1508,7 +1458,6 @@ bool zswap_store(struct folio *folio)
 	struct zswap_entry *entry, *dupentry;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
-	struct zswap_pool *shrink_pool;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1576,7 +1525,7 @@ bool zswap_store(struct folio *folio)
 
 	if (objcg) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &entry->pool->list_lru, GFP_KERNEL)) {
+		if (memcg_list_lru_alloc(memcg, &zswap.list_lru, GFP_KERNEL)) {
 			mem_cgroup_put(memcg);
 			goto put_pool;
 		}
@@ -1607,8 +1556,8 @@ bool zswap_store(struct folio *folio)
 	}
 	if (entry->length) {
 		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&entry->pool->list_lru, entry);
-		atomic_inc(&entry->pool->nr_stored);
+		zswap_lru_add(&zswap.list_lru, entry);
+		atomic_inc(&zswap.nr_stored);
 	}
 	spin_unlock(&tree->lock);
 
@@ -1640,9 +1589,7 @@ bool zswap_store(struct folio *folio)
 	return false;
 
 shrink:
-	shrink_pool = zswap_pool_last_get();
-	if (shrink_pool && !queue_work(shrink_wq, &shrink_pool->shrink_work))
-		zswap_pool_put(shrink_pool);
+	queue_work(shrink_wq, &zswap.shrink_work);
 	goto reject;
 }
 
@@ -1804,6 +1751,22 @@ static int zswap_setup(void)
 	if (ret)
 		goto hp_fail;
 
+	shrink_wq = alloc_workqueue("zswap-shrink",
+			WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
+	if (!shrink_wq)
+		goto shrink_wq_fail;
+
+	zswap.shrinker = zswap_alloc_shrinker();
+	if (!zswap.shrinker)
+		goto shrinker_fail;
+	if (list_lru_init_memcg(&zswap.list_lru, zswap.shrinker))
+		goto lru_fail;
+	shrinker_register(zswap.shrinker);
+
+	INIT_WORK(&zswap.shrink_work, shrink_worker);
+	atomic_set(&zswap.nr_stored, 0);
+	spin_lock_init(&zswap.shrink_lock);
+
 	pool = __zswap_pool_create_fallback();
 	if (pool) {
 		pr_info("loaded using pool %s/%s\n", pool->tfm_name,
@@ -1815,19 +1778,17 @@ static int zswap_setup(void)
 		zswap_enabled = false;
 	}
 
-	shrink_wq = alloc_workqueue("zswap-shrink",
-			WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
-	if (!shrink_wq)
-		goto fallback_fail;
-
 	if (zswap_debugfs_init())
 		pr_warn("debugfs initialization failed\n");
 	zswap_init_state = ZSWAP_INIT_SUCCEED;
 	return 0;
 
-fallback_fail:
-	if (pool)
-		zswap_pool_destroy(pool);
+lru_fail:
+	shrinker_free(zswap.shrinker);
+shrinker_fail:
+	destroy_workqueue(shrink_wq);
+shrink_wq_fail:
+	cpuhp_remove_multi_state(CPUHP_MM_ZSWP_POOL_PREPARE);
 hp_fail:
 	kmem_cache_destroy(zswap_entry_cache);
 cache_fail:

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref
  2024-02-16  8:55 [PATCH v3 0/2] mm/zswap: optimize for dynamic zswap_pools Chengming Zhou
  2024-02-16  8:55 ` [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools Chengming Zhou
@ 2024-02-16  8:55 ` Chengming Zhou
  2024-02-16 10:41   ` Nhat Pham
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-16  8:55 UTC (permalink / raw
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton
  Cc: linux-mm, Yosry Ahmed, linux-kernel, Chengming Zhou

All zswap entries will take a reference of zswap_pool when
zswap_store(), and drop it when free. Change it to use the
percpu_ref is better for scalability performance.

Although percpu_ref use a bit more memory which should be ok
for our use case, since we almost have only one zswap_pool to
be using. The performance gain is for zswap_store/load hotpath.

Testing kernel build (32 threads) in tmpfs with memory.max=2GB.
(zswap shrinker and writeback enabled with one 50GB swapfile,
on a 128 CPUs x86-64 machine, below is the average of 5 runs)

        mm-unstable  zswap-global-lru
real    63.20        63.12
user    1061.75      1062.95
sys     268.74       264.44

Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 mm/zswap.c | 36 +++++++++++++++++++++++++++---------
 1 file changed, 27 insertions(+), 9 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index d275eb523fc4..961349162997 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -173,7 +173,7 @@ struct crypto_acomp_ctx {
 struct zswap_pool {
 	struct zpool *zpools[ZSWAP_NR_ZPOOLS];
 	struct crypto_acomp_ctx __percpu *acomp_ctx;
-	struct kref kref;
+	struct percpu_ref ref;
 	struct list_head list;
 	struct work_struct release_work;
 	struct hlist_node node;
@@ -305,6 +305,7 @@ static void zswap_update_total_size(void)
 /*********************************
 * pool functions
 **********************************/
+static void __zswap_pool_empty(struct percpu_ref *ref);
 
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
@@ -358,13 +359,18 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 	/* being the current pool takes 1 ref; this func expects the
 	 * caller to always add the new pool as the current pool
 	 */
-	kref_init(&pool->kref);
+	ret = percpu_ref_init(&pool->ref, __zswap_pool_empty,
+			      PERCPU_REF_ALLOW_REINIT, GFP_KERNEL);
+	if (ret)
+		goto ref_fail;
 	INIT_LIST_HEAD(&pool->list);
 
 	zswap_pool_debug("created", pool);
 
 	return pool;
 
+ref_fail:
+	cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
 error:
 	if (pool->acomp_ctx)
 		free_percpu(pool->acomp_ctx);
@@ -437,8 +443,9 @@ static void __zswap_pool_release(struct work_struct *work)
 
 	synchronize_rcu();
 
-	/* nobody should have been able to get a kref... */
-	WARN_ON(kref_get_unless_zero(&pool->kref));
+	/* nobody should have been able to get a ref... */
+	WARN_ON(!percpu_ref_is_zero(&pool->ref));
+	percpu_ref_exit(&pool->ref);
 
 	/* pool is now off zswap_pools list and has no references. */
 	zswap_pool_destroy(pool);
@@ -446,11 +453,11 @@ static void __zswap_pool_release(struct work_struct *work)
 
 static struct zswap_pool *zswap_pool_current(void);
 
-static void __zswap_pool_empty(struct kref *kref)
+static void __zswap_pool_empty(struct percpu_ref *ref)
 {
 	struct zswap_pool *pool;
 
-	pool = container_of(kref, typeof(*pool), kref);
+	pool = container_of(ref, typeof(*pool), ref);
 
 	spin_lock(&zswap_pools_lock);
 
@@ -469,12 +476,12 @@ static int __must_check zswap_pool_get(struct zswap_pool *pool)
 	if (!pool)
 		return 0;
 
-	return kref_get_unless_zero(&pool->kref);
+	return percpu_ref_tryget(&pool->ref);
 }
 
 static void zswap_pool_put(struct zswap_pool *pool)
 {
-	kref_put(&pool->kref, __zswap_pool_empty);
+	percpu_ref_put(&pool->ref);
 }
 
 static struct zswap_pool *__zswap_pool_current(void)
@@ -604,6 +611,17 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 
 	if (!pool)
 		pool = zswap_pool_create(type, compressor);
+	else {
+		/*
+		 * Restore the initial ref dropped by percpu_ref_kill()
+		 * when the pool was decommissioned and switch it again
+		 * to percpu mode.
+		 */
+		percpu_ref_resurrect(&pool->ref);
+
+		/* Drop the ref from zswap_pool_find_get(). */
+		zswap_pool_put(pool);
+	}
 
 	if (pool)
 		ret = param_set_charp(s, kp);
@@ -642,7 +660,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 	 * or the new pool we failed to add
 	 */
 	if (put_pool)
-		zswap_pool_put(put_pool);
+		percpu_ref_kill(&put_pool->ref);
 
 	return ret;
 }

-- 
b4 0.10.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref
  2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
@ 2024-02-16 10:41   ` Nhat Pham
  2024-02-28 15:18   ` [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing " Chengming Zhou
  2024-02-28 15:49   ` [PATCH mm-unstable v2] " Chengming Zhou
  2 siblings, 0 replies; 12+ messages in thread
From: Nhat Pham @ 2024-02-16 10:41 UTC (permalink / raw
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm,
	linux-kernel

On Fri, Feb 16, 2024 at 12:55 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> All zswap entries will take a reference of zswap_pool when
> zswap_store(), and drop it when free. Change it to use the
> percpu_ref is better for scalability performance.
>
> Although percpu_ref use a bit more memory which should be ok
> for our use case, since we almost have only one zswap_pool to
> be using. The performance gain is for zswap_store/load hotpath.
>
> Testing kernel build (32 threads) in tmpfs with memory.max=2GB.
> (zswap shrinker and writeback enabled with one 50GB swapfile,
> on a 128 CPUs x86-64 machine, below is the average of 5 runs)
>
>         mm-unstable  zswap-global-lru
> real    63.20        63.12
> user    1061.75      1062.95
> sys     268.74       264.44

Idea is straightforward + code looks solid to me FWIW, so:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 36 +++++++++++++++++++++++++++---------
>  1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index d275eb523fc4..961349162997 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -173,7 +173,7 @@ struct crypto_acomp_ctx {
>  struct zswap_pool {
>         struct zpool *zpools[ZSWAP_NR_ZPOOLS];
>         struct crypto_acomp_ctx __percpu *acomp_ctx;
> -       struct kref kref;
> +       struct percpu_ref ref;
>         struct list_head list;
>         struct work_struct release_work;
>         struct hlist_node node;
> @@ -305,6 +305,7 @@ static void zswap_update_total_size(void)
>  /*********************************
>  * pool functions
>  **********************************/
> +static void __zswap_pool_empty(struct percpu_ref *ref);
>
>  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  {
> @@ -358,13 +359,18 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>         /* being the current pool takes 1 ref; this func expects the
>          * caller to always add the new pool as the current pool
>          */
> -       kref_init(&pool->kref);
> +       ret = percpu_ref_init(&pool->ref, __zswap_pool_empty,
> +                             PERCPU_REF_ALLOW_REINIT, GFP_KERNEL);
> +       if (ret)
> +               goto ref_fail;
>         INIT_LIST_HEAD(&pool->list);
>
>         zswap_pool_debug("created", pool);
>
>         return pool;
>
> +ref_fail:
> +       cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node);
>  error:
>         if (pool->acomp_ctx)
>                 free_percpu(pool->acomp_ctx);
> @@ -437,8 +443,9 @@ static void __zswap_pool_release(struct work_struct *work)
>
>         synchronize_rcu();
>
> -       /* nobody should have been able to get a kref... */
> -       WARN_ON(kref_get_unless_zero(&pool->kref));
> +       /* nobody should have been able to get a ref... */
> +       WARN_ON(!percpu_ref_is_zero(&pool->ref));

Ah nice - this is actually even clearer :) For some reason I missed
it, my apologies.

> +       percpu_ref_exit(&pool->ref);
>
>         /* pool is now off zswap_pools list and has no references. */
>         zswap_pool_destroy(pool);
> @@ -446,11 +453,11 @@ static void __zswap_pool_release(struct work_struct *work)
>
>  static struct zswap_pool *zswap_pool_current(void);
>
> -static void __zswap_pool_empty(struct kref *kref)
> +static void __zswap_pool_empty(struct percpu_ref *ref)
>  {
>         struct zswap_pool *pool;
>
> -       pool = container_of(kref, typeof(*pool), kref);
> +       pool = container_of(ref, typeof(*pool), ref);
>
>         spin_lock(&zswap_pools_lock);
>
> @@ -469,12 +476,12 @@ static int __must_check zswap_pool_get(struct zswap_pool *pool)
>         if (!pool)
>                 return 0;
>
> -       return kref_get_unless_zero(&pool->kref);
> +       return percpu_ref_tryget(&pool->ref);
>  }
>
>  static void zswap_pool_put(struct zswap_pool *pool)
>  {
> -       kref_put(&pool->kref, __zswap_pool_empty);
> +       percpu_ref_put(&pool->ref);
>  }
>
>  static struct zswap_pool *__zswap_pool_current(void)
> @@ -604,6 +611,17 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
>
>         if (!pool)
>                 pool = zswap_pool_create(type, compressor);
> +       else {
> +               /*
> +                * Restore the initial ref dropped by percpu_ref_kill()
> +                * when the pool was decommissioned and switch it again
> +                * to percpu mode.
> +                */
> +               percpu_ref_resurrect(&pool->ref);
> +
> +               /* Drop the ref from zswap_pool_find_get(). */
> +               zswap_pool_put(pool);
> +       }
>
>         if (pool)
>                 ret = param_set_charp(s, kp);
> @@ -642,7 +660,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
>          * or the new pool we failed to add
>          */
>         if (put_pool)
> -               zswap_pool_put(put_pool);
> +               percpu_ref_kill(&put_pool->ref);
>
>         return ret;
>  }
>
> --
> b4 0.10.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools
  2024-02-16  8:55 ` [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools Chengming Zhou
@ 2024-02-20  1:28   ` Nhat Pham
  2024-02-20  3:24     ` Chengming Zhou
  2024-03-05  7:53   ` [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix Chengming Zhou
  1 sibling, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2024-02-20  1:28 UTC (permalink / raw
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm,
	linux-kernel

On Fri, Feb 16, 2024 at 12:55 AM Chengming Zhou
<zhouchengming@bytedance.com> wrote:
>
> Dynamic zswap_pool creation may create/reuse to have multiple
> zswap_pools in a list, only the first will be current used.
>
> Each zswap_pool has its own lru and shrinker, which is not
> necessary and has its problem:
>
> 1. When memory has pressure, all shrinker of zswap_pools will
>    try to shrink its own lru, there is no order between them.
>
> 2. When zswap limit hit, only the last zswap_pool's shrink_work
>    will try to shrink its lru list. The rationale here was to
>    try and empty the old pool first so that we can completely
>    drop it. However, since we only support exclusive loads now,
>    the LRU ordering should be entirely decided by the order of
>    stores, so the oldest entries on the LRU will naturally be
>    from the oldest pool.
>
> Anyway, having a global lru and shrinker shared by all zswap_pools
> is better and efficient.
>
> Acked-by: Yosry Ahmed <yosryahmed@google.com>
> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
> ---
>  mm/zswap.c | 171 ++++++++++++++++++++++++-------------------------------------
>  1 file changed, 66 insertions(+), 105 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 62fe307521c9..d275eb523fc4 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -176,14 +176,19 @@ struct zswap_pool {
>         struct kref kref;
>         struct list_head list;
>         struct work_struct release_work;
> -       struct work_struct shrink_work;
>         struct hlist_node node;
>         char tfm_name[CRYPTO_MAX_ALG_NAME];
> +};
> +
> +static struct {
>         struct list_lru list_lru;
> -       struct mem_cgroup *next_shrink;
> -       struct shrinker *shrinker;
>         atomic_t nr_stored;
> -};
> +       struct shrinker *shrinker;
> +       struct work_struct shrink_work;
> +       struct mem_cgroup *next_shrink;
> +       /* The lock protects next_shrink. */
> +       spinlock_t shrink_lock;
> +} zswap;

nit: Is there a reason why we're putting these in a struct instead of
just a bunch of static variables (perhaps prefixed with zswap?)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools
  2024-02-20  1:28   ` Nhat Pham
@ 2024-02-20  3:24     ` Chengming Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-20  3:24 UTC (permalink / raw
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm,
	linux-kernel

On 2024/2/20 09:28, Nhat Pham wrote:
> On Fri, Feb 16, 2024 at 12:55 AM Chengming Zhou
> <zhouchengming@bytedance.com> wrote:
>>
>> Dynamic zswap_pool creation may create/reuse to have multiple
>> zswap_pools in a list, only the first will be current used.
>>
>> Each zswap_pool has its own lru and shrinker, which is not
>> necessary and has its problem:
>>
>> 1. When memory has pressure, all shrinker of zswap_pools will
>>    try to shrink its own lru, there is no order between them.
>>
>> 2. When zswap limit hit, only the last zswap_pool's shrink_work
>>    will try to shrink its lru list. The rationale here was to
>>    try and empty the old pool first so that we can completely
>>    drop it. However, since we only support exclusive loads now,
>>    the LRU ordering should be entirely decided by the order of
>>    stores, so the oldest entries on the LRU will naturally be
>>    from the oldest pool.
>>
>> Anyway, having a global lru and shrinker shared by all zswap_pools
>> is better and efficient.
>>
>> Acked-by: Yosry Ahmed <yosryahmed@google.com>
>> Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
>> ---
>>  mm/zswap.c | 171 ++++++++++++++++++++++++-------------------------------------
>>  1 file changed, 66 insertions(+), 105 deletions(-)
>>
>> diff --git a/mm/zswap.c b/mm/zswap.c
>> index 62fe307521c9..d275eb523fc4 100644
>> --- a/mm/zswap.c
>> +++ b/mm/zswap.c
>> @@ -176,14 +176,19 @@ struct zswap_pool {
>>         struct kref kref;
>>         struct list_head list;
>>         struct work_struct release_work;
>> -       struct work_struct shrink_work;
>>         struct hlist_node node;
>>         char tfm_name[CRYPTO_MAX_ALG_NAME];
>> +};
>> +
>> +static struct {
>>         struct list_lru list_lru;
>> -       struct mem_cgroup *next_shrink;
>> -       struct shrinker *shrinker;
>>         atomic_t nr_stored;
>> -};
>> +       struct shrinker *shrinker;
>> +       struct work_struct shrink_work;
>> +       struct mem_cgroup *next_shrink;
>> +       /* The lock protects next_shrink. */
>> +       spinlock_t shrink_lock;
>> +} zswap;
> 
> nit: Is there a reason why we're putting these in a struct instead of
> just a bunch of static variables (perhaps prefixed with zswap?)

No reason, both is ok for me. I thought there should be no difference.
But I can change to static variables if it's preferred in kernel. :)



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing to percpu_ref
  2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
  2024-02-16 10:41   ` Nhat Pham
@ 2024-02-28 15:18   ` Chengming Zhou
  2024-02-28 15:24     ` Matthew Wilcox
  2024-02-28 15:49   ` [PATCH mm-unstable v2] " Chengming Zhou
  2 siblings, 1 reply; 12+ messages in thread
From: Chengming Zhou @ 2024-02-28 15:18 UTC (permalink / raw
  To: akpm; +Cc: hannes, yosryahmed, nphamcs, linux-mm, linux-kernel,
	Chengming Zhou

Now the release of zswap pool is controlled by percpu_ref, its release
callback (__zswap_pool_empty()) will be called when percpu_ref hit 0.
But this release callback may potentially be called from RCU callback
context by percpu_ref_kill(), which maybe in the interrupt context.

So we need to use spin_lock_irqsave() and spin_unlock_irqrestore()
in the release callback: __zswap_pool_empty(). In other task context
places, spin_lock_irq() and spin_unlock_irq() are enough to avoid
potential deadlock.

This problem is introduced by the commit f3da427e82c4 ("mm/zswap: change
zswap_pool kref to percpu_ref"), which is in mm-unstable branch now.
It can be reproduced by testing kernel build in tmpfs with zswap and
CONFIG_LOCKDEP enabled, meanwhile changing the zswap compressor setting
dynamically.

Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 mm/zswap.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 011e068eb355..894bd184f78e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -456,10 +456,11 @@ static struct zswap_pool *zswap_pool_current(void);
 static void __zswap_pool_empty(struct percpu_ref *ref)
 {
 	struct zswap_pool *pool;
+	unsigned long flags;
 
 	pool = container_of(ref, typeof(*pool), ref);
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_irqsave(&zswap_pools_lock, flags);
 
 	WARN_ON(pool == zswap_pool_current());
 
@@ -468,7 +469,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
 	INIT_WORK(&pool->release_work, __zswap_pool_release);
 	schedule_work(&pool->release_work);
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_irqrestore(&zswap_pools_lock, flags);
 }
 
 static int __must_check zswap_pool_get(struct zswap_pool *pool)
@@ -598,7 +599,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		return -EINVAL;
 	}
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_irq(&zswap_pools_lock);
 
 	pool = zswap_pool_find_get(type, compressor);
 	if (pool) {
@@ -607,7 +608,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		list_del_rcu(&pool->list);
 	}
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_irq(&zswap_pools_lock);
 
 	if (!pool)
 		pool = zswap_pool_create(type, compressor);
@@ -628,7 +629,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 	else
 		ret = -EINVAL;
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_irq(&zswap_pools_lock);
 
 	if (!ret) {
 		put_pool = zswap_pool_current();
@@ -643,7 +644,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		put_pool = pool;
 	}
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_irq(&zswap_pools_lock);
 
 	if (!zswap_has_pool && !pool) {
 		/* if initial pool creation failed, and this pool creation also
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing to percpu_ref
  2024-02-28 15:18   ` [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing " Chengming Zhou
@ 2024-02-28 15:24     ` Matthew Wilcox
  2024-02-28 15:37       ` Chengming Zhou
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-02-28 15:24 UTC (permalink / raw
  To: Chengming Zhou; +Cc: akpm, hannes, yosryahmed, nphamcs, linux-mm, linux-kernel

On Wed, Feb 28, 2024 at 03:18:32PM +0000, Chengming Zhou wrote:
> Now the release of zswap pool is controlled by percpu_ref, its release
> callback (__zswap_pool_empty()) will be called when percpu_ref hit 0.
> But this release callback may potentially be called from RCU callback
> context by percpu_ref_kill(), which maybe in the interrupt context.
> 
> So we need to use spin_lock_irqsave() and spin_unlock_irqrestore()
> in the release callback: __zswap_pool_empty(). In other task context
> places, spin_lock_irq() and spin_unlock_irq() are enough to avoid
> potential deadlock.

RCU callback context is BH, not IRQ, so it's enough to use
spin_lock_bh(), no?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing to percpu_ref
  2024-02-28 15:24     ` Matthew Wilcox
@ 2024-02-28 15:37       ` Chengming Zhou
  0 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-28 15:37 UTC (permalink / raw
  To: Matthew Wilcox; +Cc: akpm, hannes, yosryahmed, nphamcs, linux-mm, linux-kernel

On 2024/2/28 23:24, Matthew Wilcox wrote:
> On Wed, Feb 28, 2024 at 03:18:32PM +0000, Chengming Zhou wrote:
>> Now the release of zswap pool is controlled by percpu_ref, its release
>> callback (__zswap_pool_empty()) will be called when percpu_ref hit 0.
>> But this release callback may potentially be called from RCU callback
>> context by percpu_ref_kill(), which maybe in the interrupt context.
>>
>> So we need to use spin_lock_irqsave() and spin_unlock_irqrestore()
>> in the release callback: __zswap_pool_empty(). In other task context
>> places, spin_lock_irq() and spin_unlock_irq() are enough to avoid
>> potential deadlock.
> 
> RCU callback context is BH, not IRQ, so it's enough to use
> spin_lock_bh(), no?

You're right, it's the softirq context, so spin_lock_bh() is enough.

Thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2] mm/zswap: fix zswap_pools_lock usages after changing to percpu_ref
  2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
  2024-02-16 10:41   ` Nhat Pham
  2024-02-28 15:18   ` [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing " Chengming Zhou
@ 2024-02-28 15:49   ` Chengming Zhou
  2 siblings, 0 replies; 12+ messages in thread
From: Chengming Zhou @ 2024-02-28 15:49 UTC (permalink / raw
  To: akpm
  Cc: hannes, yosryahmed, nphamcs, willy, linux-mm, linux-kernel,
	Chengming Zhou

Now the release of zswap pool is controlled by percpu_ref, its release
callback (__zswap_pool_empty()) will be called when percpu_ref hit 0.
But this release callback may potentially be called from RCU callback
context by percpu_ref_kill(), which maybe from the softirq context.

So we need to use spin_lock/unlock_bh() to avoid potential deadlock.

This problem is introduced by the commit f3da427e82c4 ("mm/zswap: change
zswap_pool kref to percpu_ref"), which is in mm-unstable branch now.
It can be reproduced by testing kernel build in tmpfs with zswap and
CONFIG_LOCKDEP enabled, meanwhile changing the zswap compressor setting
dynamically.

Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
v2:
 - Change to use spin_lock/unlock_bh(), per Matthew.
---
 mm/zswap.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 011e068eb355..da90933c6d20 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -459,7 +459,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
 
 	pool = container_of(ref, typeof(*pool), ref);
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_bh(&zswap_pools_lock);
 
 	WARN_ON(pool == zswap_pool_current());
 
@@ -468,7 +468,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
 	INIT_WORK(&pool->release_work, __zswap_pool_release);
 	schedule_work(&pool->release_work);
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_bh(&zswap_pools_lock);
 }
 
 static int __must_check zswap_pool_get(struct zswap_pool *pool)
@@ -598,7 +598,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		return -EINVAL;
 	}
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_bh(&zswap_pools_lock);
 
 	pool = zswap_pool_find_get(type, compressor);
 	if (pool) {
@@ -607,7 +607,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		list_del_rcu(&pool->list);
 	}
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_bh(&zswap_pools_lock);
 
 	if (!pool)
 		pool = zswap_pool_create(type, compressor);
@@ -628,7 +628,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 	else
 		ret = -EINVAL;
 
-	spin_lock(&zswap_pools_lock);
+	spin_lock_bh(&zswap_pools_lock);
 
 	if (!ret) {
 		put_pool = zswap_pool_current();
@@ -643,7 +643,7 @@ static int __zswap_param_set(const char *val, const struct kernel_param *kp,
 		put_pool = pool;
 	}
 
-	spin_unlock(&zswap_pools_lock);
+	spin_unlock_bh(&zswap_pools_lock);
 
 	if (!zswap_has_pool && !pool) {
 		/* if initial pool creation failed, and this pool creation also
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix
  2024-02-16  8:55 ` [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools Chengming Zhou
  2024-02-20  1:28   ` Nhat Pham
@ 2024-03-05  7:53   ` Chengming Zhou
  2024-03-05 15:09     ` Johannes Weiner
  1 sibling, 1 reply; 12+ messages in thread
From: Chengming Zhou @ 2024-03-05  7:53 UTC (permalink / raw
  To: akpm
  Cc: hannes, yosryahmed, nphamcs, linux-mm, linux-kernel,
	Chengming Zhou, kernel test robot

The commit bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by
all zswap_pools") introduced a new lock to protect zswap_next_shrink,
instead of reusing zswap_pools_lock.

But the problem is that it's initialized only when zswap enabled,
which causes bug if zswap_memcg_offline_cleanup() called without
zswap enabled.

Fix it by using DEFINE_SPINLOCK() to statically initialize them
and define them as multiple static variables to keep in consistent
with the existing global variables in zswap.

Fixes: bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all zswap_pools")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202403051008.a8cf8a94-lkp@intel.com
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 mm/zswap.c | 77 +++++++++++++++++++++++++++---------------------------
 1 file changed, 38 insertions(+), 39 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index da90933c6d20..9a3237752082 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -180,15 +180,16 @@ struct zswap_pool {
 	char tfm_name[CRYPTO_MAX_ALG_NAME];
 };
 
-static struct {
-	struct list_lru list_lru;
-	atomic_t nr_stored;
-	struct shrinker *shrinker;
-	struct work_struct shrink_work;
-	struct mem_cgroup *next_shrink;
-	/* The lock protects next_shrink. */
-	spinlock_t shrink_lock;
-} zswap;
+/* Global LRU lists shared by all zswap pools. */
+static struct list_lru zswap_list_lru;
+/* counter of pages stored in all zswap pools. */
+static atomic_t zswap_nr_stored = ATOMIC_INIT(0);
+
+/* The lock protects zswap_next_shrink updates. */
+static DEFINE_SPINLOCK(zswap_shrink_lock);
+static struct mem_cgroup *zswap_next_shrink;
+static struct work_struct zswap_shrink_work;
+static struct shrinker *zswap_shrinker;
 
 /*
  * struct zswap_entry
@@ -798,10 +799,10 @@ void zswap_folio_swapin(struct folio *folio)
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 {
 	/* lock out zswap shrinker walking memcg tree */
-	spin_lock(&zswap.shrink_lock);
-	if (zswap.next_shrink == memcg)
-		zswap.next_shrink = mem_cgroup_iter(NULL, zswap.next_shrink, NULL);
-	spin_unlock(&zswap.shrink_lock);
+	spin_lock(&zswap_shrink_lock);
+	if (zswap_next_shrink == memcg)
+		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+	spin_unlock(&zswap_shrink_lock);
 }
 
 /*********************************
@@ -900,9 +901,9 @@ static void zswap_entry_free(struct zswap_entry *entry)
 	if (!entry->length)
 		atomic_dec(&zswap_same_filled_pages);
 	else {
-		zswap_lru_del(&zswap.list_lru, entry);
+		zswap_lru_del(&zswap_list_lru, entry);
 		zpool_free(zswap_find_zpool(entry), entry->handle);
-		atomic_dec(&zswap.nr_stored);
+		atomic_dec(&zswap_nr_stored);
 		zswap_pool_put(entry->pool);
 	}
 	if (entry->objcg) {
@@ -1274,7 +1275,7 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 
 	nr_protected =
 		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	lru_size = list_lru_shrink_count(&zswap.list_lru, sc);
+	lru_size = list_lru_shrink_count(&zswap_list_lru, sc);
 
 	/*
 	 * Abort if we are shrinking into the protected region.
@@ -1291,7 +1292,7 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
 		return SHRINK_STOP;
 	}
 
-	shrink_ret = list_lru_shrink_walk(&zswap.list_lru, sc, &shrink_memcg_cb,
+	shrink_ret = list_lru_shrink_walk(&zswap_list_lru, sc, &shrink_memcg_cb,
 		&encountered_page_in_swapcache);
 
 	if (encountered_page_in_swapcache)
@@ -1317,7 +1318,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 #else
 	/* use pool stats instead of memcg stats */
 	nr_backing = zswap_pool_total_size >> PAGE_SHIFT;
-	nr_stored = atomic_read(&zswap.nr_stored);
+	nr_stored = atomic_read(&zswap_nr_stored);
 #endif
 
 	if (!nr_stored)
@@ -1325,7 +1326,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 
 	nr_protected =
 		atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
-	nr_freeable = list_lru_shrink_count(&zswap.list_lru, sc);
+	nr_freeable = list_lru_shrink_count(&zswap_list_lru, sc);
 	/*
 	 * Subtract the lru size by an estimate of the number of pages
 	 * that should be protected.
@@ -1374,7 +1375,7 @@ static int shrink_memcg(struct mem_cgroup *memcg)
 	for_each_node_state(nid, N_NORMAL_MEMORY) {
 		unsigned long nr_to_walk = 1;
 
-		shrunk += list_lru_walk_one(&zswap.list_lru, nid, memcg,
+		shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
 					    &shrink_memcg_cb, NULL, &nr_to_walk);
 	}
 	return shrunk ? 0 : -EAGAIN;
@@ -1387,9 +1388,9 @@ static void shrink_worker(struct work_struct *w)
 
 	/* global reclaim will select cgroup in a round-robin fashion. */
 	do {
-		spin_lock(&zswap.shrink_lock);
-		zswap.next_shrink = mem_cgroup_iter(NULL, zswap.next_shrink, NULL);
-		memcg = zswap.next_shrink;
+		spin_lock(&zswap_shrink_lock);
+		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+		memcg = zswap_next_shrink;
 
 		/*
 		 * We need to retry if we have gone through a full round trip, or if we
@@ -1403,7 +1404,7 @@ static void shrink_worker(struct work_struct *w)
 		 * memcg is not killed when we are reclaiming.
 		 */
 		if (!memcg) {
-			spin_unlock(&zswap.shrink_lock);
+			spin_unlock(&zswap_shrink_lock);
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 
@@ -1413,15 +1414,15 @@ static void shrink_worker(struct work_struct *w)
 		if (!mem_cgroup_tryget_online(memcg)) {
 			/* drop the reference from mem_cgroup_iter() */
 			mem_cgroup_iter_break(NULL, memcg);
-			zswap.next_shrink = NULL;
-			spin_unlock(&zswap.shrink_lock);
+			zswap_next_shrink = NULL;
+			spin_unlock(&zswap_shrink_lock);
 
 			if (++failures == MAX_RECLAIM_RETRIES)
 				break;
 
 			goto resched;
 		}
-		spin_unlock(&zswap.shrink_lock);
+		spin_unlock(&zswap_shrink_lock);
 
 		ret = shrink_memcg(memcg);
 		/* drop the extra reference */
@@ -1542,7 +1543,7 @@ bool zswap_store(struct folio *folio)
 
 	if (objcg) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap.list_lru, GFP_KERNEL)) {
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
 			mem_cgroup_put(memcg);
 			goto put_pool;
 		}
@@ -1573,8 +1574,8 @@ bool zswap_store(struct folio *folio)
 	}
 	if (entry->length) {
 		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap.list_lru, entry);
-		atomic_inc(&zswap.nr_stored);
+		zswap_lru_add(&zswap_list_lru, entry);
+		atomic_inc(&zswap_nr_stored);
 	}
 	spin_unlock(&tree->lock);
 
@@ -1606,7 +1607,7 @@ bool zswap_store(struct folio *folio)
 	return false;
 
 shrink:
-	queue_work(shrink_wq, &zswap.shrink_work);
+	queue_work(shrink_wq, &zswap_shrink_work);
 	goto reject;
 }
 
@@ -1773,16 +1774,14 @@ static int zswap_setup(void)
 	if (!shrink_wq)
 		goto shrink_wq_fail;
 
-	zswap.shrinker = zswap_alloc_shrinker();
-	if (!zswap.shrinker)
+	zswap_shrinker = zswap_alloc_shrinker();
+	if (!zswap_shrinker)
 		goto shrinker_fail;
-	if (list_lru_init_memcg(&zswap.list_lru, zswap.shrinker))
+	if (list_lru_init_memcg(&zswap_list_lru, zswap_shrinker))
 		goto lru_fail;
-	shrinker_register(zswap.shrinker);
+	shrinker_register(zswap_shrinker);
 
-	INIT_WORK(&zswap.shrink_work, shrink_worker);
-	atomic_set(&zswap.nr_stored, 0);
-	spin_lock_init(&zswap.shrink_lock);
+	INIT_WORK(&zswap_shrink_work, shrink_worker);
 
 	pool = __zswap_pool_create_fallback();
 	if (pool) {
@@ -1801,7 +1800,7 @@ static int zswap_setup(void)
 	return 0;
 
 lru_fail:
-	shrinker_free(zswap.shrinker);
+	shrinker_free(zswap_shrinker);
 shrinker_fail:
 	destroy_workqueue(shrink_wq);
 shrink_wq_fail:
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix
  2024-03-05  7:53   ` [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix Chengming Zhou
@ 2024-03-05 15:09     ` Johannes Weiner
  0 siblings, 0 replies; 12+ messages in thread
From: Johannes Weiner @ 2024-03-05 15:09 UTC (permalink / raw
  To: Chengming Zhou
  Cc: akpm, yosryahmed, nphamcs, linux-mm, linux-kernel,
	kernel test robot

On Tue, Mar 05, 2024 at 07:53:45AM +0000, Chengming Zhou wrote:
> The commit bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by
> all zswap_pools") introduced a new lock to protect zswap_next_shrink,
> instead of reusing zswap_pools_lock.
> 
> But the problem is that it's initialized only when zswap enabled,
> which causes bug if zswap_memcg_offline_cleanup() called without
> zswap enabled.
> 
> Fix it by using DEFINE_SPINLOCK() to statically initialize them
> and define them as multiple static variables to keep in consistent
> with the existing global variables in zswap.
> 
> Fixes: bf9b7df23cb3 ("mm/zswap: global lru and shrinker shared by all zswap_pools")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202403051008.a8cf8a94-lkp@intel.com
> Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-03-05 15:09 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-16  8:55 [PATCH v3 0/2] mm/zswap: optimize for dynamic zswap_pools Chengming Zhou
2024-02-16  8:55 ` [PATCH v3 1/2] mm/zswap: global lru and shrinker shared by all zswap_pools Chengming Zhou
2024-02-20  1:28   ` Nhat Pham
2024-02-20  3:24     ` Chengming Zhou
2024-03-05  7:53   ` [PATCH mm-unstable] mm/zswap: global lru and shrinker shared by all zswap_pools fix Chengming Zhou
2024-03-05 15:09     ` Johannes Weiner
2024-02-16  8:55 ` [PATCH v3 2/2] mm/zswap: change zswap_pool kref to percpu_ref Chengming Zhou
2024-02-16 10:41   ` Nhat Pham
2024-02-28 15:18   ` [PATCH mm-unstable hotfix] mm/zswap: fix zswap_pools_lock usages after changing " Chengming Zhou
2024-02-28 15:24     ` Matthew Wilcox
2024-02-28 15:37       ` Chengming Zhou
2024-02-28 15:49   ` [PATCH mm-unstable v2] " Chengming Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).