From: Qi Zheng <zhengqi.arch@bytedance.com>
To: paulmck@kernel.org, Dave Chinner <david@fromorbit.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
akpm@linux-foundation.org, tkhai@ya.ru, roman.gushchin@linux.dev,
djwong@kernel.org, brauner@kernel.org, tytso@mit.edu,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
linux-arm-msm@vger.kernel.org, dm-devel@redhat.com,
linux-raid@vger.kernel.org, linux-bcache@vger.kernel.org,
virtualization@lists.linux-foundation.org,
linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
linux-nfs@vger.kernel.org, linux-xfs@vger.kernel.org,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 24/29] mm: vmscan: make global slab shrink lockless
Date: Wed, 5 Jul 2023 11:27:28 +0800 [thread overview]
Message-ID: <733af312-fb2d-3ec4-54c8-f154447c2051@bytedance.com> (raw)
In-Reply-To: <3efa68e0-b04f-5c11-4fe2-2db0784064fc@bytedance.com>
On 2023/7/4 11:45, Qi Zheng wrote:
>
>
> On 2023/7/4 00:39, Paul E. McKenney wrote:
>> On Fri, Jun 23, 2023 at 04:29:39PM +1000, Dave Chinner wrote:
>>> On Thu, Jun 22, 2023 at 05:12:02PM +0200, Vlastimil Babka wrote:
>>>> On 6/22/23 10:53, Qi Zheng wrote:
>>>>> @@ -1067,33 +1068,27 @@ static unsigned long shrink_slab(gfp_t
>>>>> gfp_mask, int nid,
>>>>> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
>>>>> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>>>>> - if (!down_read_trylock(&shrinker_rwsem))
>>>>> - goto out;
>>>>> -
>>>>> - list_for_each_entry(shrinker, &shrinker_list, list) {
>>>>> + rcu_read_lock();
>>>>> + list_for_each_entry_rcu(shrinker, &shrinker_list, list) {
>>>>> struct shrink_control sc = {
>>>>> .gfp_mask = gfp_mask,
>>>>> .nid = nid,
>>>>> .memcg = memcg,
>>>>> };
>>>>> + if (!shrinker_try_get(shrinker))
>>>>> + continue;
>>>>> + rcu_read_unlock();
>>>>
>>>> I don't think you can do this unlock?
>>
>> Sorry to be slow to respond here, this one fell through the cracks.
>> And thank you to Qi for reminding me!
>>
>> If you do this unlock, you had jolly well better nail down the current
>> element (the one referenced by shrinker), for example, by acquiring an
>> explicit reference count on the object. And presumably this is exactly
>> what shrinker_try_get() is doing. And a look at your 24/29 confirms
>> this,
>> at least assuming that shrinker->refcount is set to zero before the call
>> to synchronize_rcu() in free_module() *and* that synchronize_rcu()
>> doesn't
>> start until *after* shrinker_put() calls complete(). Plus, as always,
>> the object must be removed from the list before the synchronize_rcu()
>> starts. (On these parts of the puzzle, I defer to those more familiar
>> with this code path. And I strongly suggest carefully commenting this
>> type of action-at-a-distance design pattern.)
>
> Yeah, I think I've done it like above. A more detailed timing diagram is
> below.
>
>>
>> Why is this important? Because otherwise that object might be freed
>> before you get to the call to rcu_read_lock() at the end of this loop.
>> And if that happens, list_for_each_entry_rcu() will be walking the
>> freelist, which is quite bad for the health and well-being of your
>> kernel.
>>
>> There are a few other ways to make this sort of thing work:
>>
>> 1. Defer the shrinker_put() to the beginning of the loop.
>> You would need a flag initially set to zero, and then set to
>> one just before (or just after) the rcu_read_lock() above.
>> You would also need another shrinker_old pointer to track the
>> old pointer. Then at the top of the loop, if the flag is set,
>> invoke shrinker_put() on shrinker_old. This ensures that the
>> previous shrinker structure stays around long enough to allow
>> the loop to find the next shrinker structure in the list.
>>
>> This approach is attractive when the removal code path
>> can invoke shrinker_put() after the grace period ends.
>>
>> 2. Make shrinker_put() invoke call_rcu() when ->refcount reaches
>> zero, and have the callback function free the object. This of
>> course requires adding an rcu_head structure to the shrinker
>> structure, which might or might not be a reasonable course of
>> action. If adding that rcu_head is reasonable, this simplifies
>> the logic quite a bit.
>>
>> 3. For the shrinker-structure-removal code path, remove the shrinker
>> structure, then remove the initial count from ->refcount,
>> and then keep doing grace periods until ->refcount is zero,
>> then do one more. Of course, if the result of removing the
>> initial count was zero, then only a single additional grace
>> period is required.
>>
>> This would need to be carefully commented, as it is a bit
>> unconventional.
>
> Thanks for such a detailed addition!
>
>>
>> There are probably many other ways, but just to give an idea of a few
>> other ways to do this.
>>
>>>>> +
>>>>> ret = do_shrink_slab(&sc, shrinker, priority);
>>>>> if (ret == SHRINK_EMPTY)
>>>>> ret = 0;
>>>>> freed += ret;
>>>>> - /*
>>>>> - * Bail out if someone want to register a new shrinker to
>>>>> - * prevent the registration from being stalled for long
>>>>> periods
>>>>> - * by parallel ongoing shrinking.
>>>>> - */
>>>>> - if (rwsem_is_contended(&shrinker_rwsem)) {
>>>>> - freed = freed ? : 1;
>>>>> - break;
>>>>> - }
>>>>> - }
>>>>> - up_read(&shrinker_rwsem);
>>>>> -out:
>>>>> + rcu_read_lock();
>>>>
>>>> That new rcu_read_lock() won't help AFAIK, the whole
>>>> list_for_each_entry_rcu() needs to be under the single
>>>> rcu_read_lock() to be
>>>> safe.
>>>
>>> Yeah, that's the pattern we've been taught and the one we can look
>>> at and immediately say "this is safe".
>>>
>>> This is a different pattern, as has been explained bi Qi, and I
>>> think it *might* be safe.
>>>
>>> *However.*
>>>
>>> Right now I don't have time to go through a novel RCU list iteration
>>> pattern it one step at to determine the correctness of the
>>> algorithm. I'm mostly worried about list manipulations that can
>>> occur outside rcu_read_lock() section bleeding into the RCU
>>> critical section because rcu_read_lock() by itself is not a memory
>>> barrier.
>>>
>>> Maybe Paul has seen this pattern often enough he could simply tell
>>> us what conditions it is safe in. But for me to work that out from
>>> first principles? I just don't have the time to do that right now.
>>
>> If the code does just the right sequence of things on the removal path
>> (remove, decrement reference, wait for reference to go to zero, wait for
>> grace period, free), then it would work. If this is what is happening,
>> I would argue for more comments. ;-)
>
> The order of the removal path is slightly different from this:
>
> shrink_slab unregister_shrinker
> =========== ===================
>
> shrinker_try_get()
> rcu_read_unlock()
> 1. decrement initial reference
> shrinker_put()
> 2. wait for reference to go to zero
> wait_for_completion()
> rcu_read_lock()
>
> shrinker_put()
> 3. remove the shrinker from list
> list_del_rcu()
> 4. wait for grace period
> kfree_rcu()/synchronize_rcu()
>
>
> list_for_each_entry()
>
> shrinker_try_get()
> rcu_read_unlock()
> 5. free the shrinker
>
> So the order is: decrement reference, wait for reference to go to zero,
> remove, wait for grace period, free.
>
> I think this can work. And we can only do the *step 3* after we hold the
> RCU read lock again, right? Please let me know if I missed something.
Oh, you are right, It would be better to move step 3 to step 1. We
should first remove the shrinker from the shrinker_list to prevent
other traversers from finding it again, otherwise the following
situations may occur theoretically:
CPU 0 CPU 1
shrinker_try_get()
shrinker_try_get()
shrinker_put()
shrinker_try_get()
shrinker_put()
Thanks,
Qi
>
> Thanks,
> Qi
>
>>
>> Thanx, Paul
>>
>>>> IIUC this is why Dave in [4] suggests unifying shrink_slab() with
>>>> shrink_slab_memcg(), as the latter doesn't iterate the list but uses
>>>> IDR.
>>>
>>> Yes, I suggested the IDR route because radix tree lookups under RCU
>>> with reference counted objects are a known safe pattern that we can
>>> easily confirm is correct or not. Hence I suggested the unification
>>> + IDR route because it makes the life of reviewers so, so much
>>> easier...
>>>
>>> Cheers,
>>>
>>> Dave.
>>> --
>>> Dave Chinner
>>> david@fromorbit.com
next prev parent reply other threads:[~2023-07-05 3:28 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-22 8:53 [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 01/29] mm: shrinker: add shrinker::private_data field Qi Zheng
2023-06-22 14:47 ` Vlastimil Babka
2023-06-23 12:50 ` [External] " Qi Zheng
2023-06-22 8:53 ` [PATCH 02/29] mm: vmscan: introduce some helpers for dynamically allocating shrinker Qi Zheng
2023-06-23 6:12 ` Dave Chinner
2023-06-23 12:49 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 03/29] drm/i915: dynamically allocate the i915_gem_mm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 04/29] drm/msm: dynamically allocate the drm-msm_gem shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 05/29] drm/panfrost: dynamically allocate the drm-panfrost shrinker Qi Zheng
2023-06-23 13:33 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 06/29] dm: dynamically allocate the dm-bufio shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 07/29] dm zoned: dynamically allocate the dm-zoned-meta shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 08/29] md/raid5: dynamically allocate the md-raid5 shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 09/29] bcache: dynamically allocate the md-bcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 10/29] vmw_balloon: dynamically allocate the vmw-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 11/29] virtio_balloon: dynamically allocate the virtio-balloon shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 12/29] mbcache: dynamically allocate the mbcache shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 13/29] ext4: dynamically allocate the ext4-es shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 14/29] jbd2,ext4: dynamically allocate the jbd2-journal shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 15/29] NFSD: dynamically allocate the nfsd-client shrinker Qi Zheng
2023-06-23 21:49 ` Chuck Lever
2023-06-24 11:17 ` Qi Zheng
2023-06-22 8:53 ` [PATCH 16/29] NFSD: dynamically allocate the nfsd-reply shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 17/29] xfs: dynamically allocate the xfs-buf shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 18/29] xfs: dynamically allocate the xfs-inodegc shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 19/29] xfs: dynamically allocate the xfs-qm shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 20/29] zsmalloc: dynamically allocate the mm-zspool shrinker Qi Zheng
2023-06-22 8:53 ` [PATCH 21/29] fs: super: dynamically allocate the s_shrink Qi Zheng
2023-06-22 8:53 ` [PATCH 22/29] drm/ttm: introduce pool_shrink_rwsem Qi Zheng
2023-06-22 8:53 ` [PATCH 23/29] mm: shrinker: add refcount and completion_wait fields Qi Zheng
2023-06-22 8:53 ` [PATCH 24/29] mm: vmscan: make global slab shrink lockless Qi Zheng
2023-06-22 15:12 ` Vlastimil Babka
2023-06-22 16:42 ` Qi Zheng
2023-06-22 17:41 ` Alan Huang
2023-06-22 18:18 ` Qi Zheng
2023-06-23 6:29 ` Dave Chinner
2023-06-23 13:10 ` Qi Zheng
2023-06-23 22:19 ` Dave Chinner
2023-06-24 11:08 ` Qi Zheng
2023-06-25 3:15 ` Qi Zheng
2023-07-04 4:20 ` Qi Zheng
2023-07-03 16:39 ` Paul E. McKenney
2023-07-04 3:45 ` Qi Zheng
2023-07-05 3:27 ` Qi Zheng [this message]
2023-06-22 8:53 ` [PATCH 25/29] mm: vmscan: make memcg " Qi Zheng
2023-06-22 8:53 ` [PATCH 26/29] mm: shrinker: make count and scan in shrinker debugfs lockless Qi Zheng
2023-06-22 8:53 ` [PATCH 27/29] mm: vmscan: hold write lock to reparent shrinker nr_deferred Qi Zheng
2023-06-22 8:53 ` [PATCH 28/29] mm: shrinkers: convert shrinker_rwsem to mutex Qi Zheng
2023-06-22 8:53 ` [PATCH 29/29] mm: shrinker: move shrinker-related code into a separate file Qi Zheng
2023-06-22 14:53 ` Vlastimil Babka
2023-06-23 13:12 ` Qi Zheng
2023-06-23 5:25 ` Sergey Senozhatsky
2023-06-23 13:24 ` Qi Zheng
2023-06-22 9:02 ` [PATCH 00/29] use refcount+RCU method to implement lockless slab shrink Qi Zheng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=733af312-fb2d-3ec4-54c8-f154447c2051@bytedance.com \
--to=zhengqi.arch@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=dm-devel@redhat.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=intel-gfx@lists.freedesktop.org \
--cc=linux-arm-msm@vger.kernel.org \
--cc=linux-bcache@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-raid@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=tkhai@ya.ru \
--cc=tytso@mit.edu \
--cc=vbabka@suse.cz \
--cc=virtualization@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).