All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] mm/zswap :memory corruption after zswap_load().
@ 2024-03-21  4:34 Zhongkun He
  2024-03-21  4:42 ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Zhongkun He @ 2024-03-21  4:34 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Andrew Morton
  Cc: linux-mm, wuyun.abel, zhouchengming, Nhat Pham

Hey folks,

Recently, I tested the zswap with memory reclaiming in the mainline
(6.8) and found a memory corruption issue related to exclusive loads.


root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
stress: WARN: [31753] (396) now reaping child worker processes
stress: FAIL: [31753] (451) failed run completed in 14s


1. Test step(the frequency of memory reclaiming has been accelerated):
-------------------------
a. set up the zswap, zram and cgroup V2
b. echo 0 > /sys/kernel/mm/lru_gen/enabled
      (Increase the probability of problems occurring)
c.  mkdir /sys/fs/cgroup/zz
     echo $$  > /sys/fs/cgroup/zz/cgroup.procs
     cd  /sys/fs/cgroup/zz/
     stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep

e. in other shell:
   while :;do for i in {1..5};do echo 20g >
/sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done

2. Root cause:
--------------------------
With a small probability, the page fault will occur twice with the
original pte, even if a new pte has been successfully set.
Unfortunately, zswap_entry has been released during the first page fault
with exclusive loads, so zswap_load will fail, and there is no corresponding
data in swap space, memory corruption occurs.

bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
*)arg0)->private,nsecs)}'
--include linux/mm_types.h  > a.txt

look up the same index:

index            nsecs
1318876, 8976040736819
1318876, 8976040746078

4123110, 8976234682970
4123110, 8976234689736

2268896, 8976660124792
2268896, 8976660130607

4634105, 8976662117938
4634105, 8976662127596

3. Solution

Should we free zswap_entry in batches so that zswap_entry will be
valid when the next page fault occurs with the
original pte? It would be great if there are other better solutions.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  4:34 [bug report] mm/zswap :memory corruption after zswap_load() Zhongkun He
@ 2024-03-21  4:42 ` Chengming Zhou
  2024-03-21  5:09   ` [External] " Zhongkun He
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2024-03-21  4:42 UTC (permalink / raw)
  To: Zhongkun He, Johannes Weiner, Yosry Ahmed, Andrew Morton
  Cc: linux-mm, wuyun.abel, zhouchengming, Nhat Pham

On 2024/3/21 12:34, Zhongkun He wrote:
> Hey folks,
> 
> Recently, I tested the zswap with memory reclaiming in the mainline
> (6.8) and found a memory corruption issue related to exclusive loads.

Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
This fix avoids concurrent swapin using the same swap entry.

Thanks.

> 
> 
> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
> stress: WARN: [31753] (396) now reaping child worker processes
> stress: FAIL: [31753] (451) failed run completed in 14s
> 
> 
> 1. Test step(the frequency of memory reclaiming has been accelerated):
> -------------------------
> a. set up the zswap, zram and cgroup V2
> b. echo 0 > /sys/kernel/mm/lru_gen/enabled
>       (Increase the probability of problems occurring)
> c.  mkdir /sys/fs/cgroup/zz
>      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
>      cd  /sys/fs/cgroup/zz/
>      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> 
> e. in other shell:
>    while :;do for i in {1..5};do echo 20g >
> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
> 
> 2. Root cause:
> --------------------------
> With a small probability, the page fault will occur twice with the
> original pte, even if a new pte has been successfully set.
> Unfortunately, zswap_entry has been released during the first page fault
> with exclusive loads, so zswap_load will fail, and there is no corresponding
> data in swap space, memory corruption occurs.
> 
> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
> *)arg0)->private,nsecs)}'
> --include linux/mm_types.h  > a.txt
> 
> look up the same index:
> 
> index            nsecs
> 1318876, 8976040736819
> 1318876, 8976040746078
> 
> 4123110, 8976234682970
> 4123110, 8976234689736
> 
> 2268896, 8976660124792
> 2268896, 8976660130607
> 
> 4634105, 8976662117938
> 4634105, 8976662127596
> 
> 3. Solution
> 
> Should we free zswap_entry in batches so that zswap_entry will be
> valid when the next page fault occurs with the
> original pte? It would be great if there are other better solutions.
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  4:42 ` Chengming Zhou
@ 2024-03-21  5:09   ` Zhongkun He
  2024-03-21  5:24     ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Zhongkun He @ 2024-03-21  5:09 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming, Nhat Pham

On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
<chengming.zhou@linux.dev> wrote:
>
> On 2024/3/21 12:34, Zhongkun He wrote:
> > Hey folks,
> >
> > Recently, I tested the zswap with memory reclaiming in the mainline
> > (6.8) and found a memory corruption issue related to exclusive loads.
>
> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> This fix avoids concurrent swapin using the same swap entry.
>

Yes, This fix avoids concurrent swapin from different cpu, but the
reported issue occurs
on the same cpu.

Thanks.

> Thanks.
>
> >
> >
> > root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> > stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
> > stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
> > stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
> > stress: WARN: [31753] (396) now reaping child worker processes
> > stress: FAIL: [31753] (451) failed run completed in 14s
> >
> >
> > 1. Test step(the frequency of memory reclaiming has been accelerated):
> > -------------------------
> > a. set up the zswap, zram and cgroup V2
> > b. echo 0 > /sys/kernel/mm/lru_gen/enabled
> >       (Increase the probability of problems occurring)
> > c.  mkdir /sys/fs/cgroup/zz
> >      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
> >      cd  /sys/fs/cgroup/zz/
> >      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> >
> > e. in other shell:
> >    while :;do for i in {1..5};do echo 20g >
> > /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
> >
> > 2. Root cause:
> > --------------------------
> > With a small probability, the page fault will occur twice with the
> > original pte, even if a new pte has been successfully set.
> > Unfortunately, zswap_entry has been released during the first page fault
> > with exclusive loads, so zswap_load will fail, and there is no corresponding
> > data in swap space, memory corruption occurs.
> >
> > bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
> > *)arg0)->private,nsecs)}'
> > --include linux/mm_types.h  > a.txt
> >
> > look up the same index:
> >
> > index            nsecs
> > 1318876, 8976040736819
> > 1318876, 8976040746078
> >
> > 4123110, 8976234682970
> > 4123110, 8976234689736
> >
> > 2268896, 8976660124792
> > 2268896, 8976660130607
> >
> > 4634105, 8976662117938
> > 4634105, 8976662127596
> >
> > 3. Solution
> >
> > Should we free zswap_entry in batches so that zswap_entry will be
> > valid when the next page fault occurs with the
> > original pte? It would be great if there are other better solutions.
> >
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  5:09   ` [External] " Zhongkun He
@ 2024-03-21  5:24     ` Chengming Zhou
  2024-03-21  6:36       ` Zhongkun He
  0 siblings, 1 reply; 29+ messages in thread
From: Chengming Zhou @ 2024-03-21  5:24 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming, Nhat Pham

On 2024/3/21 13:09, Zhongkun He wrote:
> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/3/21 12:34, Zhongkun He wrote:
>>> Hey folks,
>>>
>>> Recently, I tested the zswap with memory reclaiming in the mainline
>>> (6.8) and found a memory corruption issue related to exclusive loads.
>>
>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>> This fix avoids concurrent swapin using the same swap entry.
>>
> 
> Yes, This fix avoids concurrent swapin from different cpu, but the
> reported issue occurs
> on the same cpu.

I think you may misunderstand the race description in this fix changelog,
the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.

Could you verify if the problem still exists with this fix?

> 
> Thanks.
> 
>> Thanks.
>>
>>>
>>>
>>> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
>>> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
>>> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
>>> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
>>> stress: WARN: [31753] (396) now reaping child worker processes
>>> stress: FAIL: [31753] (451) failed run completed in 14s
>>>
>>>
>>> 1. Test step(the frequency of memory reclaiming has been accelerated):
>>> -------------------------
>>> a. set up the zswap, zram and cgroup V2
>>> b. echo 0 > /sys/kernel/mm/lru_gen/enabled
>>>       (Increase the probability of problems occurring)
>>> c.  mkdir /sys/fs/cgroup/zz
>>>      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
>>>      cd  /sys/fs/cgroup/zz/
>>>      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
>>>
>>> e. in other shell:
>>>    while :;do for i in {1..5};do echo 20g >
>>> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
>>>
>>> 2. Root cause:
>>> --------------------------
>>> With a small probability, the page fault will occur twice with the
>>> original pte, even if a new pte has been successfully set.
>>> Unfortunately, zswap_entry has been released during the first page fault
>>> with exclusive loads, so zswap_load will fail, and there is no corresponding
>>> data in swap space, memory corruption occurs.
>>>
>>> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
>>> *)arg0)->private,nsecs)}'
>>> --include linux/mm_types.h  > a.txt
>>>
>>> look up the same index:
>>>
>>> index            nsecs
>>> 1318876, 8976040736819
>>> 1318876, 8976040746078
>>>
>>> 4123110, 8976234682970
>>> 4123110, 8976234689736
>>>
>>> 2268896, 8976660124792
>>> 2268896, 8976660130607
>>>
>>> 4634105, 8976662117938
>>> 4634105, 8976662127596
>>>
>>> 3. Solution
>>>
>>> Should we free zswap_entry in batches so that zswap_entry will be
>>> valid when the next page fault occurs with the
>>> original pte? It would be great if there are other better solutions.
>>>
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  5:24     ` Chengming Zhou
@ 2024-03-21  6:36       ` Zhongkun He
  2024-03-21  9:28         ` Chengming Zhou
  0 siblings, 1 reply; 29+ messages in thread
From: Zhongkun He @ 2024-03-21  6:36 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming, Nhat Pham

On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/21 13:09, Zhongkun He wrote:
> > On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > <chengming.zhou@linux.dev> wrote:
> >>
> >> On 2024/3/21 12:34, Zhongkun He wrote:
> >>> Hey folks,
> >>>
> >>> Recently, I tested the zswap with memory reclaiming in the mainline
> >>> (6.8) and found a memory corruption issue related to exclusive loads.
> >>
> >> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >> This fix avoids concurrent swapin using the same swap entry.
> >>
> >
> > Yes, This fix avoids concurrent swapin from different cpu, but the
> > reported issue occurs
> > on the same cpu.
>
> I think you may misunderstand the race description in this fix changelog,
> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
>
> Could you verify if the problem still exists with this fix?

Yes,I'm sure the problem still exists with this patch.
There is some debug info, not mainline.

bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
linux/mm_types.h

offset nsecs tid pid cpu
2140659, 595771411052,15045,15045,6
swap_readpage+1
do_swap_page+2135
handle_mm_fault+2426
do_user_addr_fault+462
do_page_fault+48
async_page_fault+62

offset nsecs tid pid cpu
2140659, 595771424445,15045,15045,6
swap_readpage+1
do_swap_page+2135
handle_mm_fault+2426
do_user_addr_fault+462
do_page_fault+48
async_page_fault+62

-------------------------------
There are two page faults with the same tid and offset  in 13393 nsecs.

>
> >
> > Thanks.
> >
> >> Thanks.
> >>
> >>>
> >>>
> >>> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> >>> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
> >>> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
> >>> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
> >>> stress: WARN: [31753] (396) now reaping child worker processes
> >>> stress: FAIL: [31753] (451) failed run completed in 14s
> >>>
> >>>
> >>> 1. Test step(the frequency of memory reclaiming has been accelerated):
> >>> -------------------------
> >>> a. set up the zswap, zram and cgroup V2
> >>> b. echo 0 > /sys/kernel/mm/lru_gen/enabled
> >>>       (Increase the probability of problems occurring)
> >>> c.  mkdir /sys/fs/cgroup/zz
> >>>      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
> >>>      cd  /sys/fs/cgroup/zz/
> >>>      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> >>>
> >>> e. in other shell:
> >>>    while :;do for i in {1..5};do echo 20g >
> >>> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
> >>>
> >>> 2. Root cause:
> >>> --------------------------
> >>> With a small probability, the page fault will occur twice with the
> >>> original pte, even if a new pte has been successfully set.
> >>> Unfortunately, zswap_entry has been released during the first page fault
> >>> with exclusive loads, so zswap_load will fail, and there is no corresponding
> >>> data in swap space, memory corruption occurs.
> >>>
> >>> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
> >>> *)arg0)->private,nsecs)}'
> >>> --include linux/mm_types.h  > a.txt
> >>>
> >>> look up the same index:
> >>>
> >>> index            nsecs
> >>> 1318876, 8976040736819
> >>> 1318876, 8976040746078
> >>>
> >>> 4123110, 8976234682970
> >>> 4123110, 8976234689736
> >>>
> >>> 2268896, 8976660124792
> >>> 2268896, 8976660130607
> >>>
> >>> 4634105, 8976662117938
> >>> 4634105, 8976662127596
> >>>
> >>> 3. Solution
> >>>
> >>> Should we free zswap_entry in batches so that zswap_entry will be
> >>> valid when the next page fault occurs with the
> >>> original pte? It would be great if there are other better solutions.
> >>>
> >>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  6:36       ` Zhongkun He
@ 2024-03-21  9:28         ` Chengming Zhou
  2024-03-21 15:25           ` Nhat Pham
  2024-03-22  3:04           ` Zhongkun He
  0 siblings, 2 replies; 29+ messages in thread
From: Chengming Zhou @ 2024-03-21  9:28 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming, Nhat Pham

On 2024/3/21 14:36, Zhongkun He wrote:
> On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> On 2024/3/21 13:09, Zhongkun He wrote:
>>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
>>> <chengming.zhou@linux.dev> wrote:
>>>>
>>>> On 2024/3/21 12:34, Zhongkun He wrote:
>>>>> Hey folks,
>>>>>
>>>>> Recently, I tested the zswap with memory reclaiming in the mainline
>>>>> (6.8) and found a memory corruption issue related to exclusive loads.
>>>>
>>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>>>> This fix avoids concurrent swapin using the same swap entry.
>>>>
>>>
>>> Yes, This fix avoids concurrent swapin from different cpu, but the
>>> reported issue occurs
>>> on the same cpu.
>>
>> I think you may misunderstand the race description in this fix changelog,
>> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
>>
>> Could you verify if the problem still exists with this fix?
> 
> Yes,I'm sure the problem still exists with this patch.
> There is some debug info, not mainline.
> 
> bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> linux/mm_types.h

Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
which now include zram, ramdisk, pmem, nvdimm.

It maybe not good to use zswap on these swap backends?

The problem here is the page fault handler tries to skip swapcache to
swapin the folio (swap entry count == 1), but then it can't install folio
to pte entry since some changes happened such as concurrent fork of entry.

Maybe we should writeback that folio in this special case.

> 
> offset nsecs tid pid cpu
> 2140659, 595771411052,15045,15045,6
> swap_readpage+1
> do_swap_page+2135
> handle_mm_fault+2426
> do_user_addr_fault+462
> do_page_fault+48
> async_page_fault+62
> 
> offset nsecs tid pid cpu
> 2140659, 595771424445,15045,15045,6
> swap_readpage+1
> do_swap_page+2135
> handle_mm_fault+2426
> do_user_addr_fault+462
> do_page_fault+48
> async_page_fault+62
> 
> -------------------------------
> There are two page faults with the same tid and offset  in 13393 nsecs.
> 
>>
>>>
>>> Thanks.
>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>
>>>>> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
>>>>> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
>>>>> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
>>>>> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
>>>>> stress: WARN: [31753] (396) now reaping child worker processes
>>>>> stress: FAIL: [31753] (451) failed run completed in 14s
>>>>>
>>>>>
>>>>> 1. Test step(the frequency of memory reclaiming has been accelerated):
>>>>> -------------------------
>>>>> a. set up the zswap, zram and cgroup V2
>>>>> b. echo 0 > /sys/kernel/mm/lru_gen/enabled
>>>>>       (Increase the probability of problems occurring)
>>>>> c.  mkdir /sys/fs/cgroup/zz
>>>>>      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
>>>>>      cd  /sys/fs/cgroup/zz/
>>>>>      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
>>>>>
>>>>> e. in other shell:
>>>>>    while :;do for i in {1..5};do echo 20g >
>>>>> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
>>>>>
>>>>> 2. Root cause:
>>>>> --------------------------
>>>>> With a small probability, the page fault will occur twice with the
>>>>> original pte, even if a new pte has been successfully set.
>>>>> Unfortunately, zswap_entry has been released during the first page fault
>>>>> with exclusive loads, so zswap_load will fail, and there is no corresponding
>>>>> data in swap space, memory corruption occurs.
>>>>>
>>>>> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
>>>>> *)arg0)->private,nsecs)}'
>>>>> --include linux/mm_types.h  > a.txt
>>>>>
>>>>> look up the same index:
>>>>>
>>>>> index            nsecs
>>>>> 1318876, 8976040736819
>>>>> 1318876, 8976040746078
>>>>>
>>>>> 4123110, 8976234682970
>>>>> 4123110, 8976234689736
>>>>>
>>>>> 2268896, 8976660124792
>>>>> 2268896, 8976660130607
>>>>>
>>>>> 4634105, 8976662117938
>>>>> 4634105, 8976662127596
>>>>>
>>>>> 3. Solution
>>>>>
>>>>> Should we free zswap_entry in batches so that zswap_entry will be
>>>>> valid when the next page fault occurs with the
>>>>> original pte? It would be great if there are other better solutions.
>>>>>
>>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  9:28         ` Chengming Zhou
@ 2024-03-21 15:25           ` Nhat Pham
  2024-03-21 18:32             ` Yosry Ahmed
  2024-03-22  3:16             ` Zhongkun He
  2024-03-22  3:04           ` Zhongkun He
  1 sibling, 2 replies; 29+ messages in thread
From: Nhat Pham @ 2024-03-21 15:25 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Zhongkun He, Johannes Weiner, Yosry Ahmed, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming

On Thu, Mar 21, 2024 at 2:28 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/21 14:36, Zhongkun He wrote:
> > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >>
> >> On 2024/3/21 13:09, Zhongkun He wrote:
> >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> >>> <chengming.zhou@linux.dev> wrote:
> >>>>
> >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> >>>>> Hey folks,
> >>>>>
> >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> >>>>
> >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >>>> This fix avoids concurrent swapin using the same swap entry.
> >>>>
> >>>
> >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> >>> reported issue occurs
> >>> on the same cpu.
> >>
> >> I think you may misunderstand the race description in this fix changelog,
> >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> >>
> >> Could you verify if the problem still exists with this fix?
> >
> > Yes,I'm sure the problem still exists with this patch.
> > There is some debug info, not mainline.
> >
> > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > linux/mm_types.h
>
> Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> which now include zram, ramdisk, pmem, nvdimm.
>
> It maybe not good to use zswap on these swap backends?

My gut reaction is to say yes, but I'll refrain from making sweeping
statements about backends I'm not too familiar with. Let's see:

1. zram: I don't even know why we're putting a compressed cache... in
front of a compressed faux swap device? Ramdisk == other in-memory
swap backend right?
2. I looked it up, and it seemed SWP_SYNCHRONOUS_IO was introduced for
fast swap storage (see the original patch series [1]). If this is the
case, one could argue there are diminishing returns for applying zswap
on top of this.

[1]: https://lore.kernel.org/linux-mm/1505886205-9671-1-git-send-email-minchan@kernel.org/

>
> The problem here is the page fault handler tries to skip swapcache to
> swapin the folio (swap entry count == 1), but then it can't install folio
> to pte entry since some changes happened such as concurrent fork of entry.
>
> Maybe we should writeback that folio in this special case.

But yes, if this is simple maybe we can do this first to fix the bug?

>
> >
> > offset nsecs tid pid cpu
> > 2140659, 595771411052,15045,15045,6
> > swap_readpage+1
> > do_swap_page+2135
> > handle_mm_fault+2426
> > do_user_addr_fault+462
> > do_page_fault+48
> > async_page_fault+62
> >
> > offset nsecs tid pid cpu
> > 2140659, 595771424445,15045,15045,6
> > swap_readpage+1
> > do_swap_page+2135
> > handle_mm_fault+2426
> > do_user_addr_fault+462
> > do_page_fault+48
> > async_page_fault+62
> >
> > -------------------------------
> > There are two page faults with the same tid and offset  in 13393 nsecs.
> >
> >>
> >>>
> >>> Thanks.
> >>>
> >>>> Thanks.
> >>>>
> >>>>>
> >>>>>
> >>>>> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> >>>>> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd
> >>>>> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010
> >>>>> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1
> >>>>> stress: WARN: [31753] (396) now reaping child worker processes
> >>>>> stress: FAIL: [31753] (451) failed run completed in 14s
> >>>>>
> >>>>>
> >>>>> 1. Test step(the frequency of memory reclaiming has been accelerated):
> >>>>> -------------------------
> >>>>> a. set up the zswap, zram and cgroup V2
> >>>>> b. echo 0 > /sys/kernel/mm/lru_gen/enabled
> >>>>>       (Increase the probability of problems occurring)
> >>>>> c.  mkdir /sys/fs/cgroup/zz
> >>>>>      echo $$  > /sys/fs/cgroup/zz/cgroup.procs
> >>>>>      cd  /sys/fs/cgroup/zz/
> >>>>>      stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep
> >>>>>
> >>>>> e. in other shell:
> >>>>>    while :;do for i in {1..5};do echo 20g >
> >>>>> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done
> >>>>>
> >>>>> 2. Root cause:
> >>>>> --------------------------
> >>>>> With a small probability, the page fault will occur twice with the
> >>>>> original pte, even if a new pte has been successfully set.
> >>>>> Unfortunately, zswap_entry has been released during the first page fault
> >>>>> with exclusive loads, so zswap_load will fail, and there is no corresponding
> >>>>> data in swap space, memory corruption occurs.
> >>>>>
> >>>>> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page
> >>>>> *)arg0)->private,nsecs)}'
> >>>>> --include linux/mm_types.h  > a.txt
> >>>>>
> >>>>> look up the same index:
> >>>>>
> >>>>> index            nsecs
> >>>>> 1318876, 8976040736819
> >>>>> 1318876, 8976040746078
> >>>>>
> >>>>> 4123110, 8976234682970
> >>>>> 4123110, 8976234689736
> >>>>>
> >>>>> 2268896, 8976660124792
> >>>>> 2268896, 8976660130607
> >>>>>
> >>>>> 4634105, 8976662117938
> >>>>> 4634105, 8976662127596
> >>>>>
> >>>>> 3. Solution
> >>>>>
> >>>>> Should we free zswap_entry in batches so that zswap_entry will be
> >>>>> valid when the next page fault occurs with the
> >>>>> original pte? It would be great if there are other better solutions.
> >>>>>
> >>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21 15:25           ` Nhat Pham
@ 2024-03-21 18:32             ` Yosry Ahmed
  2024-03-22  3:27               ` Chengming Zhou
  2024-03-22  3:16             ` Zhongkun He
  1 sibling, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-21 18:32 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chengming Zhou, Zhongkun He, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming

On Thu, Mar 21, 2024 at 08:25:26AM -0700, Nhat Pham wrote:
> On Thu, Mar 21, 2024 at 2:28 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >
> > On 2024/3/21 14:36, Zhongkun He wrote:
> > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >>
> > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > >>> <chengming.zhou@linux.dev> wrote:
> > >>>>
> > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > >>>>> Hey folks,
> > >>>>>
> > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > >>>>
> > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >>>> This fix avoids concurrent swapin using the same swap entry.
> > >>>>
> > >>>
> > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > >>> reported issue occurs
> > >>> on the same cpu.
> > >>
> > >> I think you may misunderstand the race description in this fix changelog,
> > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > >>
> > >> Could you verify if the problem still exists with this fix?
> > >
> > > Yes,I'm sure the problem still exists with this patch.
> > > There is some debug info, not mainline.
> > >
> > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > linux/mm_types.h
> >
> > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > which now include zram, ramdisk, pmem, nvdimm.
> >
> > It maybe not good to use zswap on these swap backends?
> 
> My gut reaction is to say yes, but I'll refrain from making sweeping
> statements about backends I'm not too familiar with. Let's see:
> 
> 1. zram: I don't even know why we're putting a compressed cache... in
> front of a compressed faux swap device? Ramdisk == other in-memory
> swap backend right?

I personally use it for testing because it's easy, but I doubt any prod
setups actually do that. That being said, I don't think we need to
disable zswap completely for these swap backends just to address this
bug.

> 2. I looked it up, and it seemed SWP_SYNCHRONOUS_IO was introduced for
> fast swap storage (see the original patch series [1]). If this is the
> case, one could argue there are diminishing returns for applying zswap
> on top of this.
> 
> [1]: https://lore.kernel.org/linux-mm/1505886205-9671-1-git-send-email-minchan@kernel.org/
> 
> >
> > The problem here is the page fault handler tries to skip swapcache to
> > swapin the folio (swap entry count == 1), but then it can't install folio
> > to pte entry since some changes happened such as concurrent fork of entry.
> >
> > Maybe we should writeback that folio in this special case.
> 
> But yes, if this is simple maybe we can do this first to fix the bug?

Can we just enforce using the swapcache if zswap is in-use? We cannot
simply check if zswap is enabled, because it could be the case that we
stored some pages into zswap then disabled it.

Perhaps we could keep track of whether zswap was ever enabled or if any
pages were ever stored in zswap, and skip the no swap cache optimization
then?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21  9:28         ` Chengming Zhou
  2024-03-21 15:25           ` Nhat Pham
@ 2024-03-22  3:04           ` Zhongkun He
  2024-03-22 19:34             ` Yosry Ahmed
  1 sibling, 1 reply; 29+ messages in thread
From: Zhongkun He @ 2024-03-22  3:04 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming, Nhat Pham

On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> On 2024/3/21 14:36, Zhongkun He wrote:
> > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >>
> >> On 2024/3/21 13:09, Zhongkun He wrote:
> >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> >>> <chengming.zhou@linux.dev> wrote:
> >>>>
> >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> >>>>> Hey folks,
> >>>>>
> >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> >>>>
> >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> >>>> This fix avoids concurrent swapin using the same swap entry.
> >>>>
> >>>
> >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> >>> reported issue occurs
> >>> on the same cpu.
> >>
> >> I think you may misunderstand the race description in this fix changelog,
> >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> >>
> >> Could you verify if the problem still exists with this fix?
> >
> > Yes,I'm sure the problem still exists with this patch.
> > There is some debug info, not mainline.
> >
> > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > linux/mm_types.h
>
> Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> which now include zram, ramdisk, pmem, nvdimm.

Yes.

>
> It maybe not good to use zswap on these swap backends?
>
> The problem here is the page fault handler tries to skip swapcache to
> swapin the folio (swap entry count == 1), but then it can't install folio
> to pte entry since some changes happened such as concurrent fork of entry.
>

The first page fault returned VM_FAULT_RETRY because
folio_lock_or_retry() failed.

> Maybe we should writeback that folio in this special case.
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21 15:25           ` Nhat Pham
  2024-03-21 18:32             ` Yosry Ahmed
@ 2024-03-22  3:16             ` Zhongkun He
  1 sibling, 0 replies; 29+ messages in thread
From: Zhongkun He @ 2024-03-22  3:16 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chengming Zhou, Johannes Weiner, Yosry Ahmed, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming

On Thu, Mar 21, 2024 at 11:25 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Mar 21, 2024 at 2:28 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >
> > On 2024/3/21 14:36, Zhongkun He wrote:
> > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >>
> > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > >>> <chengming.zhou@linux.dev> wrote:
> > >>>>
> > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > >>>>> Hey folks,
> > >>>>>
> > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > >>>>
> > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >>>> This fix avoids concurrent swapin using the same swap entry.
> > >>>>
> > >>>
> > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > >>> reported issue occurs
> > >>> on the same cpu.
> > >>
> > >> I think you may misunderstand the race description in this fix changelog,
> > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > >>
> > >> Could you verify if the problem still exists with this fix?
> > >
> > > Yes,I'm sure the problem still exists with this patch.
> > > There is some debug info, not mainline.
> > >
> > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > linux/mm_types.h
> >
> > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > which now include zram, ramdisk, pmem, nvdimm.
> >
> > It maybe not good to use zswap on these swap backends?

Hi Nhat,

>
> My gut reaction is to say yes, but I'll refrain from making sweeping
> statements about backends I'm not too familiar with. Let's see:
>
> 1. zram: I don't even know why we're putting a compressed cache... in
> front of a compressed faux swap device? Ramdisk == other in-memory
> swap backend right?

It is currently for testing, and will be applied online later as a
temporary solution
to prevent performance jitter.

> 2. I looked it up, and it seemed SWP_SYNCHRONOUS_IO was introduced for
> fast swap storage (see the original patch series [1]). If this is the
> case, one could argue there are diminishing returns for applying zswap
> on top of this.
>

sounds good.

> [1]: https://lore.kernel.org/linux-mm/1505886205-9671-1-git-send-email-minchan@kernel.org/
>
> >
> > The problem here is the page fault handler tries to skip swapcache to
> > swapin the folio (swap entry count == 1), but then it can't install folio
> > to pte entry since some changes happened such as concurrent fork of entry.
> >
> > Maybe we should writeback that folio in this special case.
>
> But yes, if this is simple maybe we can do this first to fix the bug?
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-21 18:32             ` Yosry Ahmed
@ 2024-03-22  3:27               ` Chengming Zhou
  0 siblings, 0 replies; 29+ messages in thread
From: Chengming Zhou @ 2024-03-22  3:27 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: Zhongkun He, Johannes Weiner, Andrew Morton, linux-mm, wuyun.abel,
	zhouchengming

On 2024/3/22 02:32, Yosry Ahmed wrote:
> On Thu, Mar 21, 2024 at 08:25:26AM -0700, Nhat Pham wrote:
>> On Thu, Mar 21, 2024 at 2:28 AM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>
>>> On 2024/3/21 14:36, Zhongkun He wrote:
>>>> On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>>>
>>>>> On 2024/3/21 13:09, Zhongkun He wrote:
>>>>>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
>>>>>> <chengming.zhou@linux.dev> wrote:
>>>>>>>
>>>>>>> On 2024/3/21 12:34, Zhongkun He wrote:
>>>>>>>> Hey folks,
>>>>>>>>
>>>>>>>> Recently, I tested the zswap with memory reclaiming in the mainline
>>>>>>>> (6.8) and found a memory corruption issue related to exclusive loads.
>>>>>>>
>>>>>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
>>>>>>> This fix avoids concurrent swapin using the same swap entry.
>>>>>>>
>>>>>>
>>>>>> Yes, This fix avoids concurrent swapin from different cpu, but the
>>>>>> reported issue occurs
>>>>>> on the same cpu.
>>>>>
>>>>> I think you may misunderstand the race description in this fix changelog,
>>>>> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
>>>>>
>>>>> Could you verify if the problem still exists with this fix?
>>>>
>>>> Yes,I'm sure the problem still exists with this patch.
>>>> There is some debug info, not mainline.
>>>>
>>>> bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
>>>> ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
>>>> linux/mm_types.h
>>>
>>> Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
>>> which now include zram, ramdisk, pmem, nvdimm.
>>>
>>> It maybe not good to use zswap on these swap backends?
>>
>> My gut reaction is to say yes, but I'll refrain from making sweeping
>> statements about backends I'm not too familiar with. Let's see:
>>
>> 1. zram: I don't even know why we're putting a compressed cache... in
>> front of a compressed faux swap device? Ramdisk == other in-memory
>> swap backend right?
> 
> I personally use it for testing because it's easy, but I doubt any prod
> setups actually do that. That being said, I don't think we need to
> disable zswap completely for these swap backends just to address this
> bug.

Right, agree! We'd better fix it.

> 
>> 2. I looked it up, and it seemed SWP_SYNCHRONOUS_IO was introduced for
>> fast swap storage (see the original patch series [1]). If this is the
>> case, one could argue there are diminishing returns for applying zswap
>> on top of this.
>>
>> [1]: https://lore.kernel.org/linux-mm/1505886205-9671-1-git-send-email-minchan@kernel.org/
>>
>>>
>>> The problem here is the page fault handler tries to skip swapcache to
>>> swapin the folio (swap entry count == 1), but then it can't install folio
>>> to pte entry since some changes happened such as concurrent fork of entry.
>>>
>>> Maybe we should writeback that folio in this special case.
>>
>> But yes, if this is simple maybe we can do this first to fix the bug?
> 
> Can we just enforce using the swapcache if zswap is in-use? We cannot
> simply check if zswap is enabled, because it could be the case that we
> stored some pages into zswap then disabled it.
> 
> Perhaps we could keep track of whether zswap was ever enabled or if any
> pages were ever stored in zswap, and skip the no swap cache optimization
> then?

Hmm, this way we have to add something to the swap_info_struct, to check
if it has used zswap or not.

Another way I can think of is to add that folio to the swapcache if we
can't install it successfully, so next time it can find it in swapcache.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22  3:04           ` Zhongkun He
@ 2024-03-22 19:34             ` Yosry Ahmed
  2024-03-22 23:04               ` Barry Song
  2024-03-23  1:34               ` Zhongkun He
  0 siblings, 2 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-22 19:34 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, linux-mm,
	wuyun.abel, zhouchengming, Nhat Pham, Kairui Song, Minchan Kim,
	David Hildenbrand, Barry Song, Chris Li, Ying

On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >
> > On 2024/3/21 14:36, Zhongkun He wrote:
> > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >>
> > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > >>> <chengming.zhou@linux.dev> wrote:
> > >>>>
> > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > >>>>> Hey folks,
> > >>>>>
> > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > >>>>
> > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > >>>> This fix avoids concurrent swapin using the same swap entry.
> > >>>>
> > >>>
> > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > >>> reported issue occurs
> > >>> on the same cpu.
> > >>
> > >> I think you may misunderstand the race description in this fix changelog,
> > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > >>
> > >> Could you verify if the problem still exists with this fix?
> > >
> > > Yes,I'm sure the problem still exists with this patch.
> > > There is some debug info, not mainline.
> > >
> > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > linux/mm_types.h
> >
> > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > which now include zram, ramdisk, pmem, nvdimm.
>
> Yes.
>
> >
> > It maybe not good to use zswap on these swap backends?
> >
> > The problem here is the page fault handler tries to skip swapcache to
> > swapin the folio (swap entry count == 1), but then it can't install folio
> > to pte entry since some changes happened such as concurrent fork of entry.
> >
>
> The first page fault returned VM_FAULT_RETRY because
> folio_lock_or_retry() failed.

How so? The folio is newly allocated and not visible to any other
threads or CPUs. swap_read_folio() unlocks it and then returns and we
immediately try to lock it again with folio_lock_or_retry(). How does
this fail?

Let's go over what happens after swap_read_folio():
- The 'if (!folio)' code block will be skipped.
- folio_lock_or_retry() should succeed as I mentioned earlier.
- The 'if (swapcache)' code block will be skipped.
- The pte_same() check should succeed on first look because other
concurrent faulting threads should be held off by the newly introduced
swapcache_prepare() logic. But looking deeper I think this one may
fail due to a concurrent MADV_WILLNEED.
- The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
skipped because swap_read_folio() marks the folio up-to-date.
- After that point there is no possible failure until we install the
pte, at which point concurrent faults will fail on !pte_same() and
retry.

So the only failure I think is possible is the pte_same() check. I see
how a concurrent MADV_WILLNEED could cause that check to fail. A
concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
the fault resolves it will go ahead and read the folio again into the
swapcache. It seems like we will end up with two copies of the same
folio? Maybe this is harmless because the folio in the swacache will
never be used, but it is essentially leaked at that point, right?

I feel like I am missing something. Adding other folks that were
involved in the recent swapcache_prepare() synchronization thread.

Anyway, I agree that at least in theory the data corruption could
happen because of exclusive loads when skipping the swapcache, and we
should fix that.

Perhaps the right thing to do may be to write the folio again to zswap
before unlocking it and before calling swapcache_clear(). The need for
the write can be detected by checking if the folio is dirty, I think
this will only be true if the folio was loaded from zswap.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 19:34             ` Yosry Ahmed
@ 2024-03-22 23:04               ` Barry Song
  2024-03-22 23:08                 ` Yosry Ahmed
  2024-03-23  1:34               ` Zhongkun He
  1 sibling, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-03-22 23:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >
> > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > >>
> > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > >>> <chengming.zhou@linux.dev> wrote:
> > > >>>>
> > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > >>>>> Hey folks,
> > > >>>>>
> > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > >>>>
> > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > >>>>
> > > >>>
> > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > >>> reported issue occurs
> > > >>> on the same cpu.
> > > >>
> > > >> I think you may misunderstand the race description in this fix changelog,
> > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > >>
> > > >> Could you verify if the problem still exists with this fix?
> > > >
> > > > Yes,I'm sure the problem still exists with this patch.
> > > > There is some debug info, not mainline.
> > > >
> > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > linux/mm_types.h
> > >
> > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > which now include zram, ramdisk, pmem, nvdimm.
> >
> > Yes.
> >
> > >
> > > It maybe not good to use zswap on these swap backends?
> > >
> > > The problem here is the page fault handler tries to skip swapcache to
> > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > to pte entry since some changes happened such as concurrent fork of entry.
> > >
> >
> > The first page fault returned VM_FAULT_RETRY because
> > folio_lock_or_retry() failed.
>
> How so? The folio is newly allocated and not visible to any other
> threads or CPUs. swap_read_folio() unlocks it and then returns and we
> immediately try to lock it again with folio_lock_or_retry(). How does
> this fail?
>
> Let's go over what happens after swap_read_folio():
> - The 'if (!folio)' code block will be skipped.
> - folio_lock_or_retry() should succeed as I mentioned earlier.
> - The 'if (swapcache)' code block will be skipped.
> - The pte_same() check should succeed on first look because other
> concurrent faulting threads should be held off by the newly introduced
> swapcache_prepare() logic. But looking deeper I think this one may
> fail due to a concurrent MADV_WILLNEED.
> - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> skipped because swap_read_folio() marks the folio up-to-date.
> - After that point there is no possible failure until we install the
> pte, at which point concurrent faults will fail on !pte_same() and
> retry.
>
> So the only failure I think is possible is the pte_same() check. I see
> how a concurrent MADV_WILLNEED could cause that check to fail. A
> concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> the fault resolves it will go ahead and read the folio again into the
> swapcache. It seems like we will end up with two copies of the same

but zswap has freed the object when the do_swap_page finishes swap_read_folio
due to exclusive load feature of zswap?

so WILLNEED will get corrupted data and put it into swapcache.
some other concurrent new forked process might get the new data
from the swapcache WILLNEED puts when the new-forked process
goes into do_swap_page.

so very likely a new process is forked right after do_swap_page finishes
swap_read_folio and before swapcache_clear.

> folio? Maybe this is harmless because the folio in the swacache will
> never be used, but it is essentially leaked at that point, right?
>
> I feel like I am missing something. Adding other folks that were
> involved in the recent swapcache_prepare() synchronization thread.
>
> Anyway, I agree that at least in theory the data corruption could
> happen because of exclusive loads when skipping the swapcache, and we
> should fix that.
>
> Perhaps the right thing to do may be to write the folio again to zswap
> before unlocking it and before calling swapcache_clear(). The need for
> the write can be detected by checking if the folio is dirty, I think
> this will only be true if the folio was loaded from zswap.

we only need to write when we know swap_read_folio() gets data
from zswap but not swapfile. is there a quick way to do this?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:04               ` Barry Song
@ 2024-03-22 23:08                 ` Yosry Ahmed
  2024-03-22 23:18                   ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-22 23:08 UTC (permalink / raw)
  To: Barry Song
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > <hezhongkun.hzk@bytedance.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > >
> > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > >>
> > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > >>>>
> > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > >>>>> Hey folks,
> > > > >>>>>
> > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > >>>>
> > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > >>>>
> > > > >>>
> > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > >>> reported issue occurs
> > > > >>> on the same cpu.
> > > > >>
> > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > >>
> > > > >> Could you verify if the problem still exists with this fix?
> > > > >
> > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > There is some debug info, not mainline.
> > > > >
> > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > linux/mm_types.h
> > > >
> > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > which now include zram, ramdisk, pmem, nvdimm.
> > >
> > > Yes.
> > >
> > > >
> > > > It maybe not good to use zswap on these swap backends?
> > > >
> > > > The problem here is the page fault handler tries to skip swapcache to
> > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > >
> > >
> > > The first page fault returned VM_FAULT_RETRY because
> > > folio_lock_or_retry() failed.
> >
> > How so? The folio is newly allocated and not visible to any other
> > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > immediately try to lock it again with folio_lock_or_retry(). How does
> > this fail?
> >
> > Let's go over what happens after swap_read_folio():
> > - The 'if (!folio)' code block will be skipped.
> > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > - The 'if (swapcache)' code block will be skipped.
> > - The pte_same() check should succeed on first look because other
> > concurrent faulting threads should be held off by the newly introduced
> > swapcache_prepare() logic. But looking deeper I think this one may
> > fail due to a concurrent MADV_WILLNEED.
> > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > skipped because swap_read_folio() marks the folio up-to-date.
> > - After that point there is no possible failure until we install the
> > pte, at which point concurrent faults will fail on !pte_same() and
> > retry.
> >
> > So the only failure I think is possible is the pte_same() check. I see
> > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > the fault resolves it will go ahead and read the folio again into the
> > swapcache. It seems like we will end up with two copies of the same
>
> but zswap has freed the object when the do_swap_page finishes swap_read_folio
> due to exclusive load feature of zswap?
>
> so WILLNEED will get corrupted data and put it into swapcache.
> some other concurrent new forked process might get the new data
> from the swapcache WILLNEED puts when the new-forked process
> goes into do_swap_page.

Oh I was wondering how synchronization with WILLNEED happens without
zswap. It seems like we could end up with two copies of the same folio
and one of them will be leaked unless I am missing something.

>
> so very likely a new process is forked right after do_swap_page finishes
> swap_read_folio and before swapcache_clear.
>
> > folio? Maybe this is harmless because the folio in the swacache will
> > never be used, but it is essentially leaked at that point, right?
> >
> > I feel like I am missing something. Adding other folks that were
> > involved in the recent swapcache_prepare() synchronization thread.
> >
> > Anyway, I agree that at least in theory the data corruption could
> > happen because of exclusive loads when skipping the swapcache, and we
> > should fix that.
> >
> > Perhaps the right thing to do may be to write the folio again to zswap
> > before unlocking it and before calling swapcache_clear(). The need for
> > the write can be detected by checking if the folio is dirty, I think
> > this will only be true if the folio was loaded from zswap.
>
> we only need to write when we know swap_read_folio() gets data
> from zswap but not swapfile. is there a quick way to do this?

The folio will be dirty when loaded from zswap, so we can check if the
folio is dirty and write the page if fail after swap_read_folio().


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:08                 ` Yosry Ahmed
@ 2024-03-22 23:18                   ` Barry Song
  2024-03-22 23:22                     ` Yosry Ahmed
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-03-22 23:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > <hezhongkun.hzk@bytedance.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > >
> > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > >>
> > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > >>>>
> > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > >>>>> Hey folks,
> > > > > >>>>>
> > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > >>>>
> > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > >>>>
> > > > > >>>
> > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > >>> reported issue occurs
> > > > > >>> on the same cpu.
> > > > > >>
> > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > >>
> > > > > >> Could you verify if the problem still exists with this fix?
> > > > > >
> > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > There is some debug info, not mainline.
> > > > > >
> > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > linux/mm_types.h
> > > > >
> > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > It maybe not good to use zswap on these swap backends?
> > > > >
> > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > >
> > > >
> > > > The first page fault returned VM_FAULT_RETRY because
> > > > folio_lock_or_retry() failed.
> > >
> > > How so? The folio is newly allocated and not visible to any other
> > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > this fail?
> > >
> > > Let's go over what happens after swap_read_folio():
> > > - The 'if (!folio)' code block will be skipped.
> > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > - The 'if (swapcache)' code block will be skipped.
> > > - The pte_same() check should succeed on first look because other
> > > concurrent faulting threads should be held off by the newly introduced
> > > swapcache_prepare() logic. But looking deeper I think this one may
> > > fail due to a concurrent MADV_WILLNEED.
> > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > skipped because swap_read_folio() marks the folio up-to-date.
> > > - After that point there is no possible failure until we install the
> > > pte, at which point concurrent faults will fail on !pte_same() and
> > > retry.
> > >
> > > So the only failure I think is possible is the pte_same() check. I see
> > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > the fault resolves it will go ahead and read the folio again into the
> > > swapcache. It seems like we will end up with two copies of the same
> >
> > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > due to exclusive load feature of zswap?
> >
> > so WILLNEED will get corrupted data and put it into swapcache.
> > some other concurrent new forked process might get the new data
> > from the swapcache WILLNEED puts when the new-forked process
> > goes into do_swap_page.
>
> Oh I was wondering how synchronization with WILLNEED happens without
> zswap. It seems like we could end up with two copies of the same folio
> and one of them will be leaked unless I am missing something.
>
> >
> > so very likely a new process is forked right after do_swap_page finishes
> > swap_read_folio and before swapcache_clear.
> >
> > > folio? Maybe this is harmless because the folio in the swacache will
> > > never be used, but it is essentially leaked at that point, right?
> > >
> > > I feel like I am missing something. Adding other folks that were
> > > involved in the recent swapcache_prepare() synchronization thread.
> > >
> > > Anyway, I agree that at least in theory the data corruption could
> > > happen because of exclusive loads when skipping the swapcache, and we
> > > should fix that.
> > >
> > > Perhaps the right thing to do may be to write the folio again to zswap
> > > before unlocking it and before calling swapcache_clear(). The need for
> > > the write can be detected by checking if the folio is dirty, I think
> > > this will only be true if the folio was loaded from zswap.
> >
> > we only need to write when we know swap_read_folio() gets data
> > from zswap but not swapfile. is there a quick way to do this?
>
> The folio will be dirty when loaded from zswap, so we can check if the
> folio is dirty and write the page if fail after swap_read_folio().

Is it actually a bug of swapin_walk_pmd_entry? it only check pte
before read_swap_cache_async. but when read_swap_cache_async
is blocked by swapcache_prepare, after it gets the swapcache_prepare
successfully , someone else should have already set the pte and freed
the swap slot even if this is not zswap?

static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
                unsigned long end, struct mm_walk *walk)
{
        struct vm_area_struct *vma = walk->private;
        struct swap_iocb *splug = NULL;
        pte_t *ptep = NULL;
        spinlock_t *ptl;
        unsigned long addr;

        for (addr = start; addr < end; addr += PAGE_SIZE) {
                pte_t pte;
                swp_entry_t entry;
                struct folio *folio;

                if (!ptep++) {
                        ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
                        if (!ptep)
                                break;
                }

                pte = ptep_get(ptep);
                if (!is_swap_pte(pte))
                        continue;
                entry = pte_to_swp_entry(pte);
                if (unlikely(non_swap_entry(entry)))
                        continue;

                pte_unmap_unlock(ptep, ptl);
                ptep = NULL;

                folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
                                             vma, addr, &splug);
                if (folio)
                        folio_put(folio);
        }

        if (ptep)c
                pte_unmap_unlock(ptep, ptl);
        swap_read_unplug(splug);
        cond_resched();

        return 0;
}

I mean pte can become non-swap within read_swap_cache_async(),
so no matter if it is zswap, we have the bug.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:18                   ` Barry Song
@ 2024-03-22 23:22                     ` Yosry Ahmed
  2024-03-22 23:32                       ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-22 23:22 UTC (permalink / raw)
  To: Barry Song
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > >
> > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > >>
> > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > >>>>
> > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > >>>>> Hey folks,
> > > > > > >>>>>
> > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > >>>>
> > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > >>> reported issue occurs
> > > > > > >>> on the same cpu.
> > > > > > >>
> > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > >>
> > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > >
> > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > There is some debug info, not mainline.
> > > > > > >
> > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > linux/mm_types.h
> > > > > >
> > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > >
> > > > > Yes.
> > > > >
> > > > > >
> > > > > > It maybe not good to use zswap on these swap backends?
> > > > > >
> > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > >
> > > > >
> > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > folio_lock_or_retry() failed.
> > > >
> > > > How so? The folio is newly allocated and not visible to any other
> > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > this fail?
> > > >
> > > > Let's go over what happens after swap_read_folio():
> > > > - The 'if (!folio)' code block will be skipped.
> > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > - The 'if (swapcache)' code block will be skipped.
> > > > - The pte_same() check should succeed on first look because other
> > > > concurrent faulting threads should be held off by the newly introduced
> > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > fail due to a concurrent MADV_WILLNEED.
> > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > - After that point there is no possible failure until we install the
> > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > retry.
> > > >
> > > > So the only failure I think is possible is the pte_same() check. I see
> > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > the fault resolves it will go ahead and read the folio again into the
> > > > swapcache. It seems like we will end up with two copies of the same
> > >
> > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > due to exclusive load feature of zswap?
> > >
> > > so WILLNEED will get corrupted data and put it into swapcache.
> > > some other concurrent new forked process might get the new data
> > > from the swapcache WILLNEED puts when the new-forked process
> > > goes into do_swap_page.
> >
> > Oh I was wondering how synchronization with WILLNEED happens without
> > zswap. It seems like we could end up with two copies of the same folio
> > and one of them will be leaked unless I am missing something.
> >
> > >
> > > so very likely a new process is forked right after do_swap_page finishes
> > > swap_read_folio and before swapcache_clear.
> > >
> > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > never be used, but it is essentially leaked at that point, right?
> > > >
> > > > I feel like I am missing something. Adding other folks that were
> > > > involved in the recent swapcache_prepare() synchronization thread.
> > > >
> > > > Anyway, I agree that at least in theory the data corruption could
> > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > should fix that.
> > > >
> > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > the write can be detected by checking if the folio is dirty, I think
> > > > this will only be true if the folio was loaded from zswap.
> > >
> > > we only need to write when we know swap_read_folio() gets data
> > > from zswap but not swapfile. is there a quick way to do this?
> >
> > The folio will be dirty when loaded from zswap, so we can check if the
> > folio is dirty and write the page if fail after swap_read_folio().
>
> Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> before read_swap_cache_async. but when read_swap_cache_async
> is blocked by swapcache_prepare, after it gets the swapcache_prepare
> successfully , someone else should have already set the pte and freed
> the swap slot even if this is not zswap?

If someone freed the swap slot then swapcache_prepare() should fail,
but the swap entry could have been recycled after we dropped the pte
lock, right?

Anyway, yeah, I think there might be a bug here irrelevant to zswap.

>
> static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
>                 unsigned long end, struct mm_walk *walk)
> {
>         struct vm_area_struct *vma = walk->private;
>         struct swap_iocb *splug = NULL;
>         pte_t *ptep = NULL;
>         spinlock_t *ptl;
>         unsigned long addr;
>
>         for (addr = start; addr < end; addr += PAGE_SIZE) {
>                 pte_t pte;
>                 swp_entry_t entry;
>                 struct folio *folio;
>
>                 if (!ptep++) {
>                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>                         if (!ptep)
>                                 break;
>                 }
>
>                 pte = ptep_get(ptep);
>                 if (!is_swap_pte(pte))
>                         continue;
>                 entry = pte_to_swp_entry(pte);
>                 if (unlikely(non_swap_entry(entry)))
>                         continue;
>
>                 pte_unmap_unlock(ptep, ptl);
>                 ptep = NULL;
>
>                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
>                                              vma, addr, &splug);
>                 if (folio)
>                         folio_put(folio);
>         }
>
>         if (ptep)c
>                 pte_unmap_unlock(ptep, ptl);
>         swap_read_unplug(splug);
>         cond_resched();
>
>         return 0;
> }
>
> I mean pte can become non-swap within read_swap_cache_async(),
> so no matter if it is zswap, we have the bug.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:22                     ` Yosry Ahmed
@ 2024-03-22 23:32                       ` Barry Song
  2024-03-22 23:34                         ` Yosry Ahmed
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-03-22 23:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > >
> > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > >>
> > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > >>>>
> > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > >>>>> Hey folks,
> > > > > > > >>>>>
> > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > >>>>
> > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > >>> reported issue occurs
> > > > > > > >>> on the same cpu.
> > > > > > > >>
> > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > >>
> > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > >
> > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > There is some debug info, not mainline.
> > > > > > > >
> > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > linux/mm_types.h
> > > > > > >
> > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > >
> > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > >
> > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > >
> > > > > >
> > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > folio_lock_or_retry() failed.
> > > > >
> > > > > How so? The folio is newly allocated and not visible to any other
> > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > this fail?
> > > > >
> > > > > Let's go over what happens after swap_read_folio():
> > > > > - The 'if (!folio)' code block will be skipped.
> > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > - The pte_same() check should succeed on first look because other
> > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > - After that point there is no possible failure until we install the
> > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > retry.
> > > > >
> > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > swapcache. It seems like we will end up with two copies of the same
> > > >
> > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > due to exclusive load feature of zswap?
> > > >
> > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > some other concurrent new forked process might get the new data
> > > > from the swapcache WILLNEED puts when the new-forked process
> > > > goes into do_swap_page.
> > >
> > > Oh I was wondering how synchronization with WILLNEED happens without
> > > zswap. It seems like we could end up with two copies of the same folio
> > > and one of them will be leaked unless I am missing something.
> > >
> > > >
> > > > so very likely a new process is forked right after do_swap_page finishes
> > > > swap_read_folio and before swapcache_clear.
> > > >
> > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > never be used, but it is essentially leaked at that point, right?
> > > > >
> > > > > I feel like I am missing something. Adding other folks that were
> > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > >
> > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > should fix that.
> > > > >
> > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > this will only be true if the folio was loaded from zswap.
> > > >
> > > > we only need to write when we know swap_read_folio() gets data
> > > > from zswap but not swapfile. is there a quick way to do this?
> > >
> > > The folio will be dirty when loaded from zswap, so we can check if the
> > > folio is dirty and write the page if fail after swap_read_folio().
> >
> > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > before read_swap_cache_async. but when read_swap_cache_async
> > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > successfully , someone else should have already set the pte and freed
> > the swap slot even if this is not zswap?
>
> If someone freed the swap slot then swapcache_prepare() should fail,
> but the swap entry could have been recycled after we dropped the pte
> lock, right?
>
> Anyway, yeah, I think there might be a bug here irrelevant to zswap.
>
> >
> > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> >                 unsigned long end, struct mm_walk *walk)
> > {
> >         struct vm_area_struct *vma = walk->private;
> >         struct swap_iocb *splug = NULL;
> >         pte_t *ptep = NULL;
> >         spinlock_t *ptl;
> >         unsigned long addr;
> >
> >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> >                 pte_t pte;
> >                 swp_entry_t entry;
> >                 struct folio *folio;
> >
> >                 if (!ptep++) {
> >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >                         if (!ptep)
> >                                 break;
> >                 }
> >
> >                 pte = ptep_get(ptep);
> >                 if (!is_swap_pte(pte))
> >                         continue;
> >                 entry = pte_to_swp_entry(pte);
> >                 if (unlikely(non_swap_entry(entry)))
> >                         continue;
> >
> >                 pte_unmap_unlock(ptep, ptl);
> >                 ptep = NULL;
> >
> >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> >                                              vma, addr, &splug);
> >                 if (folio)
> >                         folio_put(folio);
> >         }
> >
> >         if (ptep)c
> >                 pte_unmap_unlock(ptep, ptl);
> >         swap_read_unplug(splug);
> >         cond_resched();
> >
> >         return 0;
> > }
> >
> > I mean pte can become non-swap within read_swap_cache_async(),
> > so no matter if it is zswap, we have the bug.

checked again,  probably still a zswap issue, as swapcache_prepare can detect
real swap slot free :-)

                /*
                 * Swap entry may have been freed since our caller observed it.
                 */
                err = swapcache_prepare(entry);
                if (!err)
                        break;


zswap exslusive load isn't a real swap free.

But probably we have found the timing which causes the issue at least :-)


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:32                       ` Barry Song
@ 2024-03-22 23:34                         ` Yosry Ahmed
  2024-03-22 23:38                           ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-22 23:34 UTC (permalink / raw)
  To: Barry Song
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > >
> > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > >>
> > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > >>>>> Hey folks,
> > > > > > > > >>>>>
> > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > >>>>
> > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > >>>>
> > > > > > > > >>>
> > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > >>> reported issue occurs
> > > > > > > > >>> on the same cpu.
> > > > > > > > >>
> > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > >>
> > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > >
> > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > There is some debug info, not mainline.
> > > > > > > > >
> > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > linux/mm_types.h
> > > > > > > >
> > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > >
> > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > >
> > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > >
> > > > > > >
> > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > folio_lock_or_retry() failed.
> > > > > >
> > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > this fail?
> > > > > >
> > > > > > Let's go over what happens after swap_read_folio():
> > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > - The pte_same() check should succeed on first look because other
> > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > - After that point there is no possible failure until we install the
> > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > retry.
> > > > > >
> > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > >
> > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > due to exclusive load feature of zswap?
> > > > >
> > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > some other concurrent new forked process might get the new data
> > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > goes into do_swap_page.
> > > >
> > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > zswap. It seems like we could end up with two copies of the same folio
> > > > and one of them will be leaked unless I am missing something.
> > > >
> > > > >
> > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > swap_read_folio and before swapcache_clear.
> > > > >
> > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > >
> > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > >
> > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > should fix that.
> > > > > >
> > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > this will only be true if the folio was loaded from zswap.
> > > > >
> > > > > we only need to write when we know swap_read_folio() gets data
> > > > > from zswap but not swapfile. is there a quick way to do this?
> > > >
> > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > folio is dirty and write the page if fail after swap_read_folio().
> > >
> > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > before read_swap_cache_async. but when read_swap_cache_async
> > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > successfully , someone else should have already set the pte and freed
> > > the swap slot even if this is not zswap?
> >
> > If someone freed the swap slot then swapcache_prepare() should fail,
> > but the swap entry could have been recycled after we dropped the pte
> > lock, right?
> >
> > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> >
> > >
> > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > >                 unsigned long end, struct mm_walk *walk)
> > > {
> > >         struct vm_area_struct *vma = walk->private;
> > >         struct swap_iocb *splug = NULL;
> > >         pte_t *ptep = NULL;
> > >         spinlock_t *ptl;
> > >         unsigned long addr;
> > >
> > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > >                 pte_t pte;
> > >                 swp_entry_t entry;
> > >                 struct folio *folio;
> > >
> > >                 if (!ptep++) {
> > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > >                         if (!ptep)
> > >                                 break;
> > >                 }
> > >
> > >                 pte = ptep_get(ptep);
> > >                 if (!is_swap_pte(pte))
> > >                         continue;
> > >                 entry = pte_to_swp_entry(pte);
> > >                 if (unlikely(non_swap_entry(entry)))
> > >                         continue;
> > >
> > >                 pte_unmap_unlock(ptep, ptl);
> > >                 ptep = NULL;
> > >
> > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > >                                              vma, addr, &splug);
> > >                 if (folio)
> > >                         folio_put(folio);
> > >         }
> > >
> > >         if (ptep)c
> > >                 pte_unmap_unlock(ptep, ptl);
> > >         swap_read_unplug(splug);
> > >         cond_resched();
> > >
> > >         return 0;
> > > }
> > >
> > > I mean pte can become non-swap within read_swap_cache_async(),
> > > so no matter if it is zswap, we have the bug.
>
> checked again,  probably still a zswap issue, as swapcache_prepare can detect
> real swap slot free :-)
>
>                 /*
>                  * Swap entry may have been freed since our caller observed it.
>                  */
>                 err = swapcache_prepare(entry);
>                 if (!err)
>                         break;
>
>
> zswap exslusive load isn't a real swap free.
>
> But probably we have found the timing which causes the issue at least :-)

The problem I was referring to is with the swapin fault path that
skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
page and skip the swapcache, and MADV_WILLNEED could swap it in again
into the swapcache. We would end up with two copies of the folio.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:34                         ` Yosry Ahmed
@ 2024-03-22 23:38                           ` Barry Song
  2024-03-22 23:41                             ` Yosry Ahmed
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-03-22 23:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > >
> > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > >>
> > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > >>>>
> > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > >>>>> Hey folks,
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > >>>>
> > > > > > > > > >>>
> > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > >>> reported issue occurs
> > > > > > > > > >>> on the same cpu.
> > > > > > > > > >>
> > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > >>
> > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > >
> > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > >
> > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > linux/mm_types.h
> > > > > > > > >
> > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > >
> > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > >
> > > > > > > >
> > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > folio_lock_or_retry() failed.
> > > > > > >
> > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > this fail?
> > > > > > >
> > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > - After that point there is no possible failure until we install the
> > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > retry.
> > > > > > >
> > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > >
> > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > due to exclusive load feature of zswap?
> > > > > >
> > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > some other concurrent new forked process might get the new data
> > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > goes into do_swap_page.
> > > > >
> > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > and one of them will be leaked unless I am missing something.
> > > > >
> > > > > >
> > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > swap_read_folio and before swapcache_clear.
> > > > > >
> > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > >
> > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > >
> > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > should fix that.
> > > > > > >
> > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > >
> > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > >
> > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > >
> > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > successfully , someone else should have already set the pte and freed
> > > > the swap slot even if this is not zswap?
> > >
> > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > but the swap entry could have been recycled after we dropped the pte
> > > lock, right?
> > >
> > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > >
> > > >
> > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > >                 unsigned long end, struct mm_walk *walk)
> > > > {
> > > >         struct vm_area_struct *vma = walk->private;
> > > >         struct swap_iocb *splug = NULL;
> > > >         pte_t *ptep = NULL;
> > > >         spinlock_t *ptl;
> > > >         unsigned long addr;
> > > >
> > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > >                 pte_t pte;
> > > >                 swp_entry_t entry;
> > > >                 struct folio *folio;
> > > >
> > > >                 if (!ptep++) {
> > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > >                         if (!ptep)
> > > >                                 break;
> > > >                 }
> > > >
> > > >                 pte = ptep_get(ptep);
> > > >                 if (!is_swap_pte(pte))
> > > >                         continue;
> > > >                 entry = pte_to_swp_entry(pte);
> > > >                 if (unlikely(non_swap_entry(entry)))
> > > >                         continue;
> > > >
> > > >                 pte_unmap_unlock(ptep, ptl);
> > > >                 ptep = NULL;
> > > >
> > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > >                                              vma, addr, &splug);
> > > >                 if (folio)
> > > >                         folio_put(folio);
> > > >         }
> > > >
> > > >         if (ptep)c
> > > >                 pte_unmap_unlock(ptep, ptl);
> > > >         swap_read_unplug(splug);
> > > >         cond_resched();
> > > >
> > > >         return 0;
> > > > }
> > > >
> > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > so no matter if it is zswap, we have the bug.
> >
> > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > real swap slot free :-)
> >
> >                 /*
> >                  * Swap entry may have been freed since our caller observed it.
> >                  */
> >                 err = swapcache_prepare(entry);
> >                 if (!err)
> >                         break;
> >
> >
> > zswap exslusive load isn't a real swap free.
> >
> > But probably we have found the timing which causes the issue at least :-)
>
> The problem I was referring to is with the swapin fault path that
> skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> page and skip the swapcache, and MADV_WILLNEED could swap it in again
> into the swapcache. We would end up with two copies of the folio.

right. i feel like we have to re-check pte is not changed within
__read_swap_cache_async after swapcache_prepare succeed
after being blocked for a while as the previous entry could have
been freed and re-allocted by someone else -  a completely
different process. then we get read other processes' data.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:38                           ` Barry Song
@ 2024-03-22 23:41                             ` Yosry Ahmed
  2024-03-23  0:34                               ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-22 23:41 UTC (permalink / raw)
  To: Barry Song
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > >
> > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > >>>>>
> > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > >>
> > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > >>
> > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > >
> > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > >
> > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > linux/mm_types.h
> > > > > > > > > >
> > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > >
> > > > > > > > > Yes.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > >
> > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > >
> > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > this fail?
> > > > > > > >
> > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > retry.
> > > > > > > >
> > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > >
> > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > due to exclusive load feature of zswap?
> > > > > > >
> > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > some other concurrent new forked process might get the new data
> > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > goes into do_swap_page.
> > > > > >
> > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > and one of them will be leaked unless I am missing something.
> > > > > >
> > > > > > >
> > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > >
> > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > >
> > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > >
> > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > should fix that.
> > > > > > > >
> > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > >
> > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > >
> > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > >
> > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > successfully , someone else should have already set the pte and freed
> > > > > the swap slot even if this is not zswap?
> > > >
> > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > but the swap entry could have been recycled after we dropped the pte
> > > > lock, right?
> > > >
> > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > >
> > > > >
> > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > {
> > > > >         struct vm_area_struct *vma = walk->private;
> > > > >         struct swap_iocb *splug = NULL;
> > > > >         pte_t *ptep = NULL;
> > > > >         spinlock_t *ptl;
> > > > >         unsigned long addr;
> > > > >
> > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > >                 pte_t pte;
> > > > >                 swp_entry_t entry;
> > > > >                 struct folio *folio;
> > > > >
> > > > >                 if (!ptep++) {
> > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > >                         if (!ptep)
> > > > >                                 break;
> > > > >                 }
> > > > >
> > > > >                 pte = ptep_get(ptep);
> > > > >                 if (!is_swap_pte(pte))
> > > > >                         continue;
> > > > >                 entry = pte_to_swp_entry(pte);
> > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > >                         continue;
> > > > >
> > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > >                 ptep = NULL;
> > > > >
> > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > >                                              vma, addr, &splug);
> > > > >                 if (folio)
> > > > >                         folio_put(folio);
> > > > >         }
> > > > >
> > > > >         if (ptep)c
> > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > >         swap_read_unplug(splug);
> > > > >         cond_resched();
> > > > >
> > > > >         return 0;
> > > > > }
> > > > >
> > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > so no matter if it is zswap, we have the bug.
> > >
> > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > real swap slot free :-)
> > >
> > >                 /*
> > >                  * Swap entry may have been freed since our caller observed it.
> > >                  */
> > >                 err = swapcache_prepare(entry);
> > >                 if (!err)
> > >                         break;
> > >
> > >
> > > zswap exslusive load isn't a real swap free.
> > >
> > > But probably we have found the timing which causes the issue at least :-)
> >
> > The problem I was referring to is with the swapin fault path that
> > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > into the swapcache. We would end up with two copies of the folio.
>
> right. i feel like we have to re-check pte is not changed within
> __read_swap_cache_async after swapcache_prepare succeed
> after being blocked for a while as the previous entry could have
> been freed and re-allocted by someone else -  a completely
> different process. then we get read other processes' data.

This is only a problem when we skip the swapcache during swapin.
Otherwise the swapcache synchronizes this. I wonder how much does
skipping the swapcache buy us on recent kernels? This optimization was
introduced a long time ago.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 23:41                             ` Yosry Ahmed
@ 2024-03-23  0:34                               ` Barry Song
  2024-03-23  0:42                                 ` Yosry Ahmed
  2024-03-23 10:48                                 ` Chris Li
  0 siblings, 2 replies; 29+ messages in thread
From: Barry Song @ 2024-03-23  0:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

On Sat, Mar 23, 2024 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > > >
> > > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > > linux/mm_types.h
> > > > > > > > > > >
> > > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > > >
> > > > > > > > > > Yes.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > > >
> > > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > > >
> > > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > > this fail?
> > > > > > > > >
> > > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > > retry.
> > > > > > > > >
> > > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > > >
> > > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > > due to exclusive load feature of zswap?
> > > > > > > >
> > > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > > some other concurrent new forked process might get the new data
> > > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > > goes into do_swap_page.
> > > > > > >
> > > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > > and one of them will be leaked unless I am missing something.
> > > > > > >
> > > > > > > >
> > > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > > >
> > > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > > >
> > > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > > >
> > > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > > should fix that.
> > > > > > > > >
> > > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > > >
> > > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > > >
> > > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > > >
> > > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > > successfully , someone else should have already set the pte and freed
> > > > > > the swap slot even if this is not zswap?
> > > > >
> > > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > > but the swap entry could have been recycled after we dropped the pte
> > > > > lock, right?
> > > > >
> > > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > > >
> > > > > >
> > > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > > {
> > > > > >         struct vm_area_struct *vma = walk->private;
> > > > > >         struct swap_iocb *splug = NULL;
> > > > > >         pte_t *ptep = NULL;
> > > > > >         spinlock_t *ptl;
> > > > > >         unsigned long addr;
> > > > > >
> > > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > > >                 pte_t pte;
> > > > > >                 swp_entry_t entry;
> > > > > >                 struct folio *folio;
> > > > > >
> > > > > >                 if (!ptep++) {
> > > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > >                         if (!ptep)
> > > > > >                                 break;
> > > > > >                 }
> > > > > >
> > > > > >                 pte = ptep_get(ptep);
> > > > > >                 if (!is_swap_pte(pte))
> > > > > >                         continue;
> > > > > >                 entry = pte_to_swp_entry(pte);
> > > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > > >                         continue;
> > > > > >
> > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > >                 ptep = NULL;
> > > > > >
> > > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > > >                                              vma, addr, &splug);
> > > > > >                 if (folio)
> > > > > >                         folio_put(folio);
> > > > > >         }
> > > > > >
> > > > > >         if (ptep)c
> > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > >         swap_read_unplug(splug);
> > > > > >         cond_resched();
> > > > > >
> > > > > >         return 0;
> > > > > > }
> > > > > >
> > > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > > so no matter if it is zswap, we have the bug.
> > > >
> > > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > > real swap slot free :-)
> > > >
> > > >                 /*
> > > >                  * Swap entry may have been freed since our caller observed it.
> > > >                  */
> > > >                 err = swapcache_prepare(entry);
> > > >                 if (!err)
> > > >                         break;
> > > >
> > > >
> > > > zswap exslusive load isn't a real swap free.
> > > >
> > > > But probably we have found the timing which causes the issue at least :-)
> > >
> > > The problem I was referring to is with the swapin fault path that
> > > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > > into the swapcache. We would end up with two copies of the folio.
> >
> > right. i feel like we have to re-check pte is not changed within
> > __read_swap_cache_async after swapcache_prepare succeed
> > after being blocked for a while as the previous entry could have
> > been freed and re-allocted by someone else -  a completely
> > different process. then we get read other processes' data.

>
> This is only a problem when we skip the swapcache during swapin.
> Otherwise the swapcache synchronizes this. I wonder how much does
> skipping the swapcache buy us on recent kernels? This optimization was
> introduced a long time ago.

Still performs quite good. according to kairui's data:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458

Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
configuration for users.  if this case is really happening, could we
simply fix it
by:

diff --git a/mm/memory.c b/mm/memory.c
index b7cab8be8632..6742d1428373 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        swapcache = folio;

        if (!folio) {
-               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+               if (!is_zswap_enabled() && data_race(si->flags &
SWP_SYNCHRONOUS_IO) &&
                    __swap_count(entry) == 1) {
                        /*
                         * Prevent parallel swapin from proceeding with


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23  0:34                               ` Barry Song
@ 2024-03-23  0:42                                 ` Yosry Ahmed
  2024-03-23 10:48                                 ` Chris Li
  1 sibling, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-23  0:42 UTC (permalink / raw)
  To: Barry Song
  Cc: Zhongkun He, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Chris Li, Ying

[..]
>
> > This is only a problem when we skip the swapcache during swapin.
> > Otherwise the swapcache synchronizes this. I wonder how much does
> > skipping the swapcache buy us on recent kernels? This optimization was
> > introduced a long time ago.
>
> Still performs quite good. according to kairui's data:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458
>
> Before: 10934698 us
> After: 11157121 us
> Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

That's 17.9% slower if we use the swapcache. Commit 0bcac06f27d7 ("mm,
swap: skip swapcache for swapin of synchronous device") says:
"""
This patch aims to reduce swap-in latency by skipping swapcache if the
swap device is synchronous device like rw_page based device.  It
enhances 45% my swapin test(5G sequential swapin, no readahead, from
2.41sec to 1.64sec).
"""

So went from 45% to 17.9%, we are getting close :)

>
> BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
> configuration for users.  if this case is really happening, could we
> simply fix it

Yeah we can do that. Zhongkun said they have a use case for zram+zswap
tho, but I don't know what it is.

If we can find a better (yet still simple) fix, then I would prefer
that. Otherwise enforcing the use of the swapcache when zswap is
enabled sounds like a viable option.

> by:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b7cab8be8632..6742d1428373 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         swapcache = folio;
>
>         if (!folio) {
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> +               if (!is_zswap_enabled() && data_race(si->flags &
> SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
>                         /*
>                          * Prevent parallel swapin from proceeding with


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-22 19:34             ` Yosry Ahmed
  2024-03-22 23:04               ` Barry Song
@ 2024-03-23  1:34               ` Zhongkun He
  2024-03-23  1:36                 ` Yosry Ahmed
  2024-03-23 10:52                 ` Chris Li
  1 sibling, 2 replies; 29+ messages in thread
From: Zhongkun He @ 2024-03-23  1:34 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, linux-mm,
	wuyun.abel, zhouchengming, Nhat Pham, Kairui Song, Minchan Kim,
	David Hildenbrand, Barry Song, Chris Li, Ying

On Sat, Mar 23, 2024 at 3:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > >
> > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > >>
> > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > >>> <chengming.zhou@linux.dev> wrote:
> > > >>>>
> > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > >>>>> Hey folks,
> > > >>>>>
> > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > >>>>
> > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > >>>>
> > > >>>
> > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > >>> reported issue occurs
> > > >>> on the same cpu.
> > > >>
> > > >> I think you may misunderstand the race description in this fix changelog,
> > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > >>
> > > >> Could you verify if the problem still exists with this fix?
> > > >
> > > > Yes,I'm sure the problem still exists with this patch.
> > > > There is some debug info, not mainline.
> > > >
> > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > linux/mm_types.h
> > >
> > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > which now include zram, ramdisk, pmem, nvdimm.
> >
> > Yes.
> >
> > >
> > > It maybe not good to use zswap on these swap backends?
> > >
> > > The problem here is the page fault handler tries to skip swapcache to
> > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > to pte entry since some changes happened such as concurrent fork of entry.
> > >
> >
> > The first page fault returned VM_FAULT_RETRY because
> > folio_lock_or_retry() failed.
>

Hi Yosry,

> How so? The folio is newly allocated and not visible to any other
> threads or CPUs. swap_read_folio() unlocks it and then returns and we
> immediately try to lock it again with folio_lock_or_retry(). How does
> this fail?

Haha, it makes me very confused. Based on the steps to reproduce the problem,
I think the page is locked by shrink_folio_list(). Please see the
following situation.

do_swap_page
      __folio_set_locked(folio);
     swap_readpage(page, true, NULL);
          zswap_load(folio)
          folio_unlock(folio);

    shrink_folio_list

    if (!folio_trylock(folio))
      ret |= folio_lock_or_retry(folio, vmf);
      if (ret & VM_FAULT_RETRY)
           goto out_release;

Thanks.

>
> Let's go over what happens after swap_read_folio():
> - The 'if (!folio)' code block will be skipped.
> - folio_lock_or_retry() should succeed as I mentioned earlier.
> - The 'if (swapcache)' code block will be skipped.
> - The pte_same() check should succeed on first look because other
> concurrent faulting threads should be held off by the newly introduced
> swapcache_prepare() logic. But looking deeper I think this one may
> fail due to a concurrent MADV_WILLNEED.
> - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> skipped because swap_read_folio() marks the folio up-to-date.
> - After that point there is no possible failure until we install the
> pte, at which point concurrent faults will fail on !pte_same() and
> retry.
>
> So the only failure I think is possible is the pte_same() check. I see
> how a concurrent MADV_WILLNEED could cause that check to fail. A
> concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> the fault resolves it will go ahead and read the folio again into the
> swapcache. It seems like we will end up with two copies of the same
> folio? Maybe this is harmless because the folio in the swacache will
> never be used, but it is essentially leaked at that point, right?
>
> I feel like I am missing something. Adding other folks that were
> involved in the recent swapcache_prepare() synchronization thread.
>
> Anyway, I agree that at least in theory the data corruption could
> happen because of exclusive loads when skipping the swapcache, and we
> should fix that.
>
> Perhaps the right thing to do may be to write the folio again to zswap
> before unlocking it and before calling swapcache_clear(). The need for
> the write can be detected by checking if the folio is dirty, I think
> this will only be true if the folio was loaded from zswap.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23  1:34               ` Zhongkun He
@ 2024-03-23  1:36                 ` Yosry Ahmed
  2024-03-23 10:52                 ` Chris Li
  1 sibling, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-03-23  1:36 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Chengming Zhou, Johannes Weiner, Andrew Morton, linux-mm,
	wuyun.abel, zhouchengming, Nhat Pham, Kairui Song, Minchan Kim,
	David Hildenbrand, Barry Song, Chris Li, Ying

On Fri, Mar 22, 2024 at 6:35 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> On Sat, Mar 23, 2024 at 3:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > <hezhongkun.hzk@bytedance.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > >
> > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > >>
> > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > >>>>
> > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > >>>>> Hey folks,
> > > > >>>>>
> > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > >>>>
> > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > >>>>
> > > > >>>
> > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > >>> reported issue occurs
> > > > >>> on the same cpu.
> > > > >>
> > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > >>
> > > > >> Could you verify if the problem still exists with this fix?
> > > > >
> > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > There is some debug info, not mainline.
> > > > >
> > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > linux/mm_types.h
> > > >
> > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > which now include zram, ramdisk, pmem, nvdimm.
> > >
> > > Yes.
> > >
> > > >
> > > > It maybe not good to use zswap on these swap backends?
> > > >
> > > > The problem here is the page fault handler tries to skip swapcache to
> > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > >
> > >
> > > The first page fault returned VM_FAULT_RETRY because
> > > folio_lock_or_retry() failed.
> >
>
> Hi Yosry,
>
> > How so? The folio is newly allocated and not visible to any other
> > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > immediately try to lock it again with folio_lock_or_retry(). How does
> > this fail?
>
> Haha, it makes me very confused. Based on the steps to reproduce the problem,
> I think the page is locked by shrink_folio_list(). Please see the
> following situation.

I missed the call to folio_add_lru() before swap_read_folio(). Reclaim
would be able to lock the folio in this case once it's unlocked by
swap_read_folio().

Thanks for elaborating.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23  0:34                               ` Barry Song
  2024-03-23  0:42                                 ` Yosry Ahmed
@ 2024-03-23 10:48                                 ` Chris Li
  2024-03-23 11:27                                   ` Chris Li
  2024-03-23 12:41                                   ` Zhongkun He
  1 sibling, 2 replies; 29+ messages in thread
From: Chris Li @ 2024-03-23 10:48 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Zhongkun He, Chengming Zhou, Johannes Weiner,
	Andrew Morton, linux-mm, wuyun.abel, zhouchengming, Nhat Pham,
	Kairui Song, Minchan Kim, David Hildenbrand, Ying

On Fri, Mar 22, 2024 at 5:35 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Mar 23, 2024 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > > > >>>>>
> > > > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > > > >>>>
> > > > > > > > > > > > >>>
> > > > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > > > linux/mm_types.h
> > > > > > > > > > > >
> > > > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > > > >
> > > > > > > > > > > Yes.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > > > >
> > > > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > > > >
> > > > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > > > this fail?
> > > > > > > > > >
> > > > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > > > retry.
> > > > > > > > > >
> > > > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > > > >
> > > > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > > > due to exclusive load feature of zswap?
> > > > > > > > >
> > > > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > > > some other concurrent new forked process might get the new data
> > > > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > > > goes into do_swap_page.
> > > > > > > >
> > > > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > > > and one of them will be leaked unless I am missing something.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > > > >
> > > > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > > > >
> > > > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > > > >
> > > > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > > > should fix that.
> > > > > > > > > >
> > > > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > > > >
> > > > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > > > >
> > > > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > > > >
> > > > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > > > successfully , someone else should have already set the pte and freed
> > > > > > > the swap slot even if this is not zswap?
> > > > > >
> > > > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > > > but the swap entry could have been recycled after we dropped the pte
> > > > > > lock, right?
> > > > > >
> > > > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > > > >
> > > > > > >
> > > > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > > > {
> > > > > > >         struct vm_area_struct *vma = walk->private;
> > > > > > >         struct swap_iocb *splug = NULL;
> > > > > > >         pte_t *ptep = NULL;
> > > > > > >         spinlock_t *ptl;
> > > > > > >         unsigned long addr;
> > > > > > >
> > > > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > > > >                 pte_t pte;
> > > > > > >                 swp_entry_t entry;
> > > > > > >                 struct folio *folio;
> > > > > > >
> > > > > > >                 if (!ptep++) {
> > > > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > >                         if (!ptep)
> > > > > > >                                 break;
> > > > > > >                 }
> > > > > > >
> > > > > > >                 pte = ptep_get(ptep);
> > > > > > >                 if (!is_swap_pte(pte))
> > > > > > >                         continue;
> > > > > > >                 entry = pte_to_swp_entry(pte);
> > > > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > > > >                         continue;
> > > > > > >
> > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > >                 ptep = NULL;
> > > > > > >
> > > > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > > > >                                              vma, addr, &splug);
> > > > > > >                 if (folio)
> > > > > > >                         folio_put(folio);
> > > > > > >         }
> > > > > > >
> > > > > > >         if (ptep)c
> > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > >         swap_read_unplug(splug);
> > > > > > >         cond_resched();
> > > > > > >
> > > > > > >         return 0;
> > > > > > > }
> > > > > > >
> > > > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > > > so no matter if it is zswap, we have the bug.
> > > > >
> > > > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > > > real swap slot free :-)
> > > > >
> > > > >                 /*
> > > > >                  * Swap entry may have been freed since our caller observed it.
> > > > >                  */
> > > > >                 err = swapcache_prepare(entry);
> > > > >                 if (!err)
> > > > >                         break;
> > > > >
> > > > >
> > > > > zswap exslusive load isn't a real swap free.
> > > > >
> > > > > But probably we have found the timing which causes the issue at least :-)
> > > >
> > > > The problem I was referring to is with the swapin fault path that
> > > > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > > > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > > > into the swapcache. We would end up with two copies of the folio.
> > >
> > > right. i feel like we have to re-check pte is not changed within
> > > __read_swap_cache_async after swapcache_prepare succeed
> > > after being blocked for a while as the previous entry could have
> > > been freed and re-allocted by someone else -  a completely
> > > different process. then we get read other processes' data.
>
> >
> > This is only a problem when we skip the swapcache during swapin.
> > Otherwise the swapcache synchronizes this. I wonder how much does
> > skipping the swapcache buy us on recent kernels? This optimization was
> > introduced a long time ago.
>
> Still performs quite good. according to kairui's data:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458
>
> Before: 10934698 us
> After: 11157121 us
> Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
>
> BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
> configuration for users.  if this case is really happening, could we
> simply fix it
> by:
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b7cab8be8632..6742d1428373 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         swapcache = folio;
>
>         if (!folio) {
> -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> +               if (!is_zswap_enabled() && data_race(si->flags &
> SWP_SYNCHRONOUS_IO) &&

Because zswap_enable can change at run time due to the delay setup of zswap.

This has the time-of-check to time-of-use issue.

Maybe moving to the zswap_store() is better.

Something like this.

Zhongkun, can you verify with this change the bug will go away?

Chris


    zswap: disable SWP_SYNCRNOUS_IO in zswap_store

diff --git a/mm/zswap.c b/mm/zswap.c
index f04a75a36236..f40778adefa3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1472,6 +1472,7 @@ bool zswap_store(struct folio *folio)
        struct obj_cgroup *objcg = NULL;
        struct mem_cgroup *memcg = NULL;
        unsigned long max_pages, cur_pages;
+       struct swap_info_struct *si = NULL;

        VM_WARN_ON_ONCE(!folio_test_locked(folio));
        VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
@@ -1483,6 +1484,18 @@ bool zswap_store(struct folio *folio)
        if (!zswap_enabled)
                goto check_old;

+       /* Prevent swapoff from happening to us. */
+       si = get_swap_device(swp);
+       if (si) {
+               /*
+                * SWP_SYNCRONOUS_IO bypass swap cache, not compatible
+                * with zswap exclusive load.
+                */
+               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+                       si->flags &= ~ SWP_SYNCHRONOUS_IO;
+               put_swap_device(si);
+       }
+
        /* Check cgroup limits */
        objcg = get_obj_cgroup_from_folio(folio);
        if (objcg && !obj_cgroup_may_zswap(objcg)) {


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23  1:34               ` Zhongkun He
  2024-03-23  1:36                 ` Yosry Ahmed
@ 2024-03-23 10:52                 ` Chris Li
  2024-03-23 10:55                   ` Barry Song
  1 sibling, 1 reply; 29+ messages in thread
From: Chris Li @ 2024-03-23 10:52 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Yosry Ahmed, Chengming Zhou, Johannes Weiner, Andrew Morton,
	linux-mm, wuyun.abel, zhouchengming, Nhat Pham, Kairui Song,
	Minchan Kim, David Hildenbrand, Barry Song, Ying

On Fri, Mar 22, 2024 at 6:35 PM Zhongkun He
<hezhongkun.hzk@bytedance.com> wrote:
>
> On Sat, Mar 23, 2024 at 3:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > <hezhongkun.hzk@bytedance.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > >
> > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > >>
> > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > >>>>
> > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > >>>>> Hey folks,
> > > > >>>>>
> > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > >>>>
> > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > >>>>
> > > > >>>
> > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > >>> reported issue occurs
> > > > >>> on the same cpu.
> > > > >>
> > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > >>
> > > > >> Could you verify if the problem still exists with this fix?
> > > > >
> > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > There is some debug info, not mainline.
> > > > >
> > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > linux/mm_types.h
> > > >
> > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > which now include zram, ramdisk, pmem, nvdimm.
> > >
> > > Yes.
> > >
> > > >
> > > > It maybe not good to use zswap on these swap backends?
> > > >
> > > > The problem here is the page fault handler tries to skip swapcache to
> > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > >
> > >
> > > The first page fault returned VM_FAULT_RETRY because
> > > folio_lock_or_retry() failed.
> >
>
> Hi Yosry,
>
> > How so? The folio is newly allocated and not visible to any other
> > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > immediately try to lock it again with folio_lock_or_retry(). How does
> > this fail?
>
> Haha, it makes me very confused. Based on the steps to reproduce the problem,
> I think the page is locked by shrink_folio_list(). Please see the
> following situation.
>
> do_swap_page
>       __folio_set_locked(folio);
>      swap_readpage(page, true, NULL);
>           zswap_load(folio)
>           folio_unlock(folio);
>
>     shrink_folio_list
>
>     if (!folio_trylock(folio))
>       ret |= folio_lock_or_retry(folio, vmf);
>       if (ret & VM_FAULT_RETRY)
>            goto out_release;

Thanks for the detailed bug report. So this means the folio
immediately gets reclaimed after zswap_load(), before do_swap_page
returns, right?

We also need to audit if there is any other code path in the
do_swap_page that can fail a swap fault and not store the folio into
the swap cache.

Chris

>
> Thanks.
>
> >
> > Let's go over what happens after swap_read_folio():
> > - The 'if (!folio)' code block will be skipped.
> > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > - The 'if (swapcache)' code block will be skipped.
> > - The pte_same() check should succeed on first look because other
> > concurrent faulting threads should be held off by the newly introduced
> > swapcache_prepare() logic. But looking deeper I think this one may
> > fail due to a concurrent MADV_WILLNEED.
> > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > skipped because swap_read_folio() marks the folio up-to-date.
> > - After that point there is no possible failure until we install the
> > pte, at which point concurrent faults will fail on !pte_same() and
> > retry.
> >
> > So the only failure I think is possible is the pte_same() check. I see
> > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > the fault resolves it will go ahead and read the folio again into the
> > swapcache. It seems like we will end up with two copies of the same
> > folio? Maybe this is harmless because the folio in the swacache will
> > never be used, but it is essentially leaked at that point, right?
> >
> > I feel like I am missing something. Adding other folks that were
> > involved in the recent swapcache_prepare() synchronization thread.
> >
> > Anyway, I agree that at least in theory the data corruption could
> > happen because of exclusive loads when skipping the swapcache, and we
> > should fix that.
> >
> > Perhaps the right thing to do may be to write the folio again to zswap
> > before unlocking it and before calling swapcache_clear(). The need for
> > the write can be detected by checking if the folio is dirty, I think
> > this will only be true if the folio was loaded from zswap.
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23 10:52                 ` Chris Li
@ 2024-03-23 10:55                   ` Barry Song
  0 siblings, 0 replies; 29+ messages in thread
From: Barry Song @ 2024-03-23 10:55 UTC (permalink / raw)
  To: Chris Li
  Cc: Zhongkun He, Yosry Ahmed, Chengming Zhou, Johannes Weiner,
	Andrew Morton, linux-mm, wuyun.abel, zhouchengming, Nhat Pham,
	Kairui Song, Minchan Kim, David Hildenbrand, Ying

On Sat, Mar 23, 2024 at 6:52 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Mar 22, 2024 at 6:35 PM Zhongkun He
> <hezhongkun.hzk@bytedance.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 3:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > <hezhongkun.hzk@bytedance.com> wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > >
> > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > >>
> > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > >>>>
> > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > >>>>> Hey folks,
> > > > > >>>>>
> > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > >>>>
> > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > >>>>
> > > > > >>>
> > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > >>> reported issue occurs
> > > > > >>> on the same cpu.
> > > > > >>
> > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > >>
> > > > > >> Could you verify if the problem still exists with this fix?
> > > > > >
> > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > There is some debug info, not mainline.
> > > > > >
> > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > linux/mm_types.h
> > > > >
> > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > >
> > > > Yes.
> > > >
> > > > >
> > > > > It maybe not good to use zswap on these swap backends?
> > > > >
> > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > >
> > > >
> > > > The first page fault returned VM_FAULT_RETRY because
> > > > folio_lock_or_retry() failed.
> > >
> >
> > Hi Yosry,
> >
> > > How so? The folio is newly allocated and not visible to any other
> > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > this fail?
> >
> > Haha, it makes me very confused. Based on the steps to reproduce the problem,
> > I think the page is locked by shrink_folio_list(). Please see the
> > following situation.
> >
> > do_swap_page
> >       __folio_set_locked(folio);
> >      swap_readpage(page, true, NULL);
> >           zswap_load(folio)
> >           folio_unlock(folio);
> >
> >     shrink_folio_list
> >
> >     if (!folio_trylock(folio))
> >       ret |= folio_lock_or_retry(folio, vmf);
> >       if (ret & VM_FAULT_RETRY)
> >            goto out_release;
>
> Thanks for the detailed bug report. So this means the folio
> immediately gets reclaimed after zswap_load(), before do_swap_page
> returns, right?
>
> We also need to audit if there is any other code path in the
> do_swap_page that can fail a swap fault and not store the folio into
> the swap cache.
>
> Chris
>
> >
> > Thanks.
> >
> > >
> > > Let's go over what happens after swap_read_folio():
> > > - The 'if (!folio)' code block will be skipped.
> > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > - The 'if (swapcache)' code block will be skipped.
> > > - The pte_same() check should succeed on first look because other
> > > concurrent faulting threads should be held off by the newly introduced
> > > swapcache_prepare() logic. But looking deeper I think this one may
> > > fail due to a concurrent MADV_WILLNEED.
> > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > skipped because swap_read_folio() marks the folio up-to-date.
> > > - After that point there is no possible failure until we install the
> > > pte, at which point concurrent faults will fail on !pte_same() and
> > > retry.
> > >
> > > So the only failure I think is possible is the pte_same() check. I see
> > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > the fault resolves it will go ahead and read the folio again into the
> > > swapcache. It seems like we will end up with two copies of the same
> > > folio? Maybe this is harmless because the folio in the swacache will
> > > never be used, but it is essentially leaked at that point, right?

right. it has got a good fix here to avoid immediate release by zswap:
https://lore.kernel.org/linux-mm/20240322234826.GA448621@cmpxchg.org/

> > >
> > > I feel like I am missing something. Adding other folks that were
> > > involved in the recent swapcache_prepare() synchronization thread.
> > >
> > > Anyway, I agree that at least in theory the data corruption could
> > > happen because of exclusive loads when skipping the swapcache, and we
> > > should fix that.
> > >
> > > Perhaps the right thing to do may be to write the folio again to zswap
> > > before unlocking it and before calling swapcache_clear(). The need for
> > > the write can be detected by checking if the folio is dirty, I think
> > > this will only be true if the folio was loaded from zswap.
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23 10:48                                 ` Chris Li
@ 2024-03-23 11:27                                   ` Chris Li
  2024-03-23 12:41                                   ` Zhongkun He
  1 sibling, 0 replies; 29+ messages in thread
From: Chris Li @ 2024-03-23 11:27 UTC (permalink / raw)
  To: Barry Song
  Cc: Yosry Ahmed, Zhongkun He, Chengming Zhou, Johannes Weiner,
	Andrew Morton, linux-mm, wuyun.abel, zhouchengming, Nhat Pham,
	Kairui Song, Minchan Kim, David Hildenbrand, Ying

On Sat, Mar 23, 2024 at 3:48 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Mar 22, 2024 at 5:35 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > > > > linux/mm_types.h
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > > > > >
> > > > > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > > > > this fail?
> > > > > > > > > > >
> > > > > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > > > > retry.
> > > > > > > > > > >
> > > > > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > > > > >
> > > > > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > > > > due to exclusive load feature of zswap?
> > > > > > > > > >
> > > > > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > > > > some other concurrent new forked process might get the new data
> > > > > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > > > > goes into do_swap_page.
> > > > > > > > >
> > > > > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > > > > and one of them will be leaked unless I am missing something.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > > > > >
> > > > > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > > > > >
> > > > > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > > > > >
> > > > > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > > > > should fix that.
> > > > > > > > > > >
> > > > > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > > > > >
> > > > > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > > > > >
> > > > > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > > > > >
> > > > > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > > > > successfully , someone else should have already set the pte and freed
> > > > > > > > the swap slot even if this is not zswap?
> > > > > > >
> > > > > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > > > > but the swap entry could have been recycled after we dropped the pte
> > > > > > > lock, right?
> > > > > > >
> > > > > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > > > > >
> > > > > > > >
> > > > > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > > > > {
> > > > > > > >         struct vm_area_struct *vma = walk->private;
> > > > > > > >         struct swap_iocb *splug = NULL;
> > > > > > > >         pte_t *ptep = NULL;
> > > > > > > >         spinlock_t *ptl;
> > > > > > > >         unsigned long addr;
> > > > > > > >
> > > > > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > > > > >                 pte_t pte;
> > > > > > > >                 swp_entry_t entry;
> > > > > > > >                 struct folio *folio;
> > > > > > > >
> > > > > > > >                 if (!ptep++) {
> > > > > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > > >                         if (!ptep)
> > > > > > > >                                 break;
> > > > > > > >                 }
> > > > > > > >
> > > > > > > >                 pte = ptep_get(ptep);
> > > > > > > >                 if (!is_swap_pte(pte))
> > > > > > > >                         continue;
> > > > > > > >                 entry = pte_to_swp_entry(pte);
> > > > > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > > > > >                         continue;
> > > > > > > >
> > > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > > >                 ptep = NULL;
> > > > > > > >
> > > > > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > > > > >                                              vma, addr, &splug);
> > > > > > > >                 if (folio)
> > > > > > > >                         folio_put(folio);
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         if (ptep)c
> > > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > > >         swap_read_unplug(splug);
> > > > > > > >         cond_resched();
> > > > > > > >
> > > > > > > >         return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > > > > so no matter if it is zswap, we have the bug.
> > > > > >
> > > > > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > > > > real swap slot free :-)
> > > > > >
> > > > > >                 /*
> > > > > >                  * Swap entry may have been freed since our caller observed it.
> > > > > >                  */
> > > > > >                 err = swapcache_prepare(entry);
> > > > > >                 if (!err)
> > > > > >                         break;
> > > > > >
> > > > > >
> > > > > > zswap exslusive load isn't a real swap free.
> > > > > >
> > > > > > But probably we have found the timing which causes the issue at least :-)
> > > > >
> > > > > The problem I was referring to is with the swapin fault path that
> > > > > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > > > > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > > > > into the swapcache. We would end up with two copies of the folio.
> > > >
> > > > right. i feel like we have to re-check pte is not changed within
> > > > __read_swap_cache_async after swapcache_prepare succeed
> > > > after being blocked for a while as the previous entry could have
> > > > been freed and re-allocted by someone else -  a completely
> > > > different process. then we get read other processes' data.
> >
> > >
> > > This is only a problem when we skip the swapcache during swapin.
> > > Otherwise the swapcache synchronizes this. I wonder how much does
> > > skipping the swapcache buy us on recent kernels? This optimization was
> > > introduced a long time ago.
> >
> > Still performs quite good. according to kairui's data:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458
> >
> > Before: 10934698 us
> > After: 11157121 us
> > Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >
> > BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
> > configuration for users.  if this case is really happening, could we
> > simply fix it
> > by:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b7cab8be8632..6742d1428373 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         swapcache = folio;
> >
> >         if (!folio) {
> > -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > +               if (!is_zswap_enabled() && data_race(si->flags &
> > SWP_SYNCHRONOUS_IO) &&
>
> Because zswap_enable can change at run time due to the delay setup of zswap.
>
> This has the time-of-check to time-of-use issue.

Never mind that, I just realized that even if zswap was enabled, the
data race does not affect the current swap entry, which already
swapped out before the change of zswap_enable.

Chris


>
> Maybe moving to the zswap_store() is better.
>
> Something like this.
>
> Zhongkun, can you verify with this change the bug will go away?
>
> Chris
>
>
>     zswap: disable SWP_SYNCRNOUS_IO in zswap_store
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f04a75a36236..f40778adefa3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1472,6 +1472,7 @@ bool zswap_store(struct folio *folio)
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
>         unsigned long max_pages, cur_pages;
> +       struct swap_info_struct *si = NULL;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1483,6 +1484,18 @@ bool zswap_store(struct folio *folio)
>         if (!zswap_enabled)
>                 goto check_old;
>
> +       /* Prevent swapoff from happening to us. */
> +       si = get_swap_device(swp);
> +       if (si) {
> +               /*
> +                * SWP_SYNCRONOUS_IO bypass swap cache, not compatible
> +                * with zswap exclusive load.
> +                */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +                       si->flags &= ~ SWP_SYNCHRONOUS_IO;
> +               put_swap_device(si);
> +       }
> +
>         /* Check cgroup limits */
>         objcg = get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
  2024-03-23 10:48                                 ` Chris Li
  2024-03-23 11:27                                   ` Chris Li
@ 2024-03-23 12:41                                   ` Zhongkun He
  1 sibling, 0 replies; 29+ messages in thread
From: Zhongkun He @ 2024-03-23 12:41 UTC (permalink / raw)
  To: Chris Li
  Cc: Barry Song, Yosry Ahmed, Chengming Zhou, Johannes Weiner,
	Andrew Morton, linux-mm, wuyun.abel, zhouchengming, Nhat Pham,
	Kairui Song, Minchan Kim, David Hildenbrand, Ying

On Sat, Mar 23, 2024 at 6:49 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Mar 22, 2024 at 5:35 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > > > > >>>>>
> > > > > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > >>>
> > > > > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > > > > linux/mm_types.h
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > > > > >
> > > > > > > > > > > > Yes.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > > > > >
> > > > > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > > > > >
> > > > > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > > > > this fail?
> > > > > > > > > > >
> > > > > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > > > > retry.
> > > > > > > > > > >
> > > > > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > > > > >
> > > > > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > > > > due to exclusive load feature of zswap?
> > > > > > > > > >
> > > > > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > > > > some other concurrent new forked process might get the new data
> > > > > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > > > > goes into do_swap_page.
> > > > > > > > >
> > > > > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > > > > and one of them will be leaked unless I am missing something.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > > > > >
> > > > > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > > > > >
> > > > > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > > > > >
> > > > > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > > > > should fix that.
> > > > > > > > > > >
> > > > > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > > > > >
> > > > > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > > > > >
> > > > > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > > > > >
> > > > > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > > > > successfully , someone else should have already set the pte and freed
> > > > > > > > the swap slot even if this is not zswap?
> > > > > > >
> > > > > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > > > > but the swap entry could have been recycled after we dropped the pte
> > > > > > > lock, right?
> > > > > > >
> > > > > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > > > > >
> > > > > > > >
> > > > > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > > > > {
> > > > > > > >         struct vm_area_struct *vma = walk->private;
> > > > > > > >         struct swap_iocb *splug = NULL;
> > > > > > > >         pte_t *ptep = NULL;
> > > > > > > >         spinlock_t *ptl;
> > > > > > > >         unsigned long addr;
> > > > > > > >
> > > > > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > > > > >                 pte_t pte;
> > > > > > > >                 swp_entry_t entry;
> > > > > > > >                 struct folio *folio;
> > > > > > > >
> > > > > > > >                 if (!ptep++) {
> > > > > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > > >                         if (!ptep)
> > > > > > > >                                 break;
> > > > > > > >                 }
> > > > > > > >
> > > > > > > >                 pte = ptep_get(ptep);
> > > > > > > >                 if (!is_swap_pte(pte))
> > > > > > > >                         continue;
> > > > > > > >                 entry = pte_to_swp_entry(pte);
> > > > > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > > > > >                         continue;
> > > > > > > >
> > > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > > >                 ptep = NULL;
> > > > > > > >
> > > > > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > > > > >                                              vma, addr, &splug);
> > > > > > > >                 if (folio)
> > > > > > > >                         folio_put(folio);
> > > > > > > >         }
> > > > > > > >
> > > > > > > >         if (ptep)c
> > > > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > > > >         swap_read_unplug(splug);
> > > > > > > >         cond_resched();
> > > > > > > >
> > > > > > > >         return 0;
> > > > > > > > }
> > > > > > > >
> > > > > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > > > > so no matter if it is zswap, we have the bug.
> > > > > >
> > > > > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > > > > real swap slot free :-)
> > > > > >
> > > > > >                 /*
> > > > > >                  * Swap entry may have been freed since our caller observed it.
> > > > > >                  */
> > > > > >                 err = swapcache_prepare(entry);
> > > > > >                 if (!err)
> > > > > >                         break;
> > > > > >
> > > > > >
> > > > > > zswap exslusive load isn't a real swap free.
> > > > > >
> > > > > > But probably we have found the timing which causes the issue at least :-)
> > > > >
> > > > > The problem I was referring to is with the swapin fault path that
> > > > > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > > > > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > > > > into the swapcache. We would end up with two copies of the folio.
> > > >
> > > > right. i feel like we have to re-check pte is not changed within
> > > > __read_swap_cache_async after swapcache_prepare succeed
> > > > after being blocked for a while as the previous entry could have
> > > > been freed and re-allocted by someone else -  a completely
> > > > different process. then we get read other processes' data.
> >
> > >
> > > This is only a problem when we skip the swapcache during swapin.
> > > Otherwise the swapcache synchronizes this. I wonder how much does
> > > skipping the swapcache buy us on recent kernels? This optimization was
> > > introduced a long time ago.
> >
> > Still performs quite good. according to kairui's data:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458
> >
> > Before: 10934698 us
> > After: 11157121 us
> > Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)
> >
> > BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
> > configuration for users.  if this case is really happening, could we
> > simply fix it
> > by:
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b7cab8be8632..6742d1428373 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         swapcache = folio;
> >
> >         if (!folio) {
> > -               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > +               if (!is_zswap_enabled() && data_race(si->flags &
> > SWP_SYNCHRONOUS_IO) &&
>
> Because zswap_enable can change at run time due to the delay setup of zswap.
>
> This has the time-of-check to time-of-use issue.
>
> Maybe moving to the zswap_store() is better.
>
> Something like this.
>
> Zhongkun, can you verify with this change the bug will go away?
>
> Chris
>

Hi Chris, thanks for your analysis.

There is a good fix from Johannes and I have tested it.
https://lore.kernel.org/linux-mm/20240322234826.GA448621@cmpxchg.org/


>
>     zswap: disable SWP_SYNCRNOUS_IO in zswap_store
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index f04a75a36236..f40778adefa3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1472,6 +1472,7 @@ bool zswap_store(struct folio *folio)
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
>         unsigned long max_pages, cur_pages;
> +       struct swap_info_struct *si = NULL;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> @@ -1483,6 +1484,18 @@ bool zswap_store(struct folio *folio)
>         if (!zswap_enabled)
>                 goto check_old;
>
> +       /* Prevent swapoff from happening to us. */
> +       si = get_swap_device(swp);
> +       if (si) {
> +               /*
> +                * SWP_SYNCRONOUS_IO bypass swap cache, not compatible
> +                * with zswap exclusive load.
> +                */
> +               if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
> +                       si->flags &= ~ SWP_SYNCHRONOUS_IO;
> +               put_swap_device(si);
> +       }
> +
>         /* Check cgroup limits */
>         objcg = get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2024-03-23 12:41 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-21  4:34 [bug report] mm/zswap :memory corruption after zswap_load() Zhongkun He
2024-03-21  4:42 ` Chengming Zhou
2024-03-21  5:09   ` [External] " Zhongkun He
2024-03-21  5:24     ` Chengming Zhou
2024-03-21  6:36       ` Zhongkun He
2024-03-21  9:28         ` Chengming Zhou
2024-03-21 15:25           ` Nhat Pham
2024-03-21 18:32             ` Yosry Ahmed
2024-03-22  3:27               ` Chengming Zhou
2024-03-22  3:16             ` Zhongkun He
2024-03-22  3:04           ` Zhongkun He
2024-03-22 19:34             ` Yosry Ahmed
2024-03-22 23:04               ` Barry Song
2024-03-22 23:08                 ` Yosry Ahmed
2024-03-22 23:18                   ` Barry Song
2024-03-22 23:22                     ` Yosry Ahmed
2024-03-22 23:32                       ` Barry Song
2024-03-22 23:34                         ` Yosry Ahmed
2024-03-22 23:38                           ` Barry Song
2024-03-22 23:41                             ` Yosry Ahmed
2024-03-23  0:34                               ` Barry Song
2024-03-23  0:42                                 ` Yosry Ahmed
2024-03-23 10:48                                 ` Chris Li
2024-03-23 11:27                                   ` Chris Li
2024-03-23 12:41                                   ` Zhongkun He
2024-03-23  1:34               ` Zhongkun He
2024-03-23  1:36                 ` Yosry Ahmed
2024-03-23 10:52                 ` Chris Li
2024-03-23 10:55                   ` Barry Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.