All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Zhongkun He <hezhongkun.hzk@bytedance.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	 linux-mm <linux-mm@kvack.org>,
	wuyun.abel@bytedance.com, zhouchengming@bytedance.com,
	 Nhat Pham <nphamcs@gmail.com>, Kairui Song <kasong@tencent.com>,
	Minchan Kim <minchan@kernel.org>,
	 David Hildenbrand <david@redhat.com>,
	Chris Li <chrisl@kernel.org>, Ying <ying.huang@intel.com>
Subject: Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load().
Date: Sat, 23 Mar 2024 13:34:49 +1300	[thread overview]
Message-ID: <CAGsJ_4wuiVgdX0PU3HaGjiy3nE1qDN_eHzkyP4974ZShB-WTiQ@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkaiwd1YULSYAJu=s-uwd-nw8CAVz12tNyjS-xfQ3S6FEw@mail.gmail.com>

On Sat, Mar 23, 2024 at 12:42 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 22, 2024 at 4:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Mar 23, 2024 at 12:35 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Fri, Mar 22, 2024 at 4:32 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, Mar 23, 2024 at 12:23 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Fri, Mar 22, 2024 at 4:18 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Sat, Mar 23, 2024 at 12:09 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Fri, Mar 22, 2024 at 4:04 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > On Sat, Mar 23, 2024 at 8:35 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Mar 21, 2024 at 8:04 PM Zhongkun He
> > > > > > > > > <hezhongkun.hzk@bytedance.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Thu, Mar 21, 2024 at 5:29 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote:
> > > > > > > > > > > > On Thu, Mar 21, 2024 at 1:24 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On 2024/3/21 13:09, Zhongkun He wrote:
> > > > > > > > > > > >>> On Thu, Mar 21, 2024 at 12:42 PM Chengming Zhou
> > > > > > > > > > > >>> <chengming.zhou@linux.dev> wrote:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote:
> > > > > > > > > > > >>>>> Hey folks,
> > > > > > > > > > > >>>>>
> > > > > > > > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline
> > > > > > > > > > > >>>>> (6.8) and found a memory corruption issue related to exclusive loads.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping swapcache")
> > > > > > > > > > > >>>> This fix avoids concurrent swapin using the same swap entry.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but the
> > > > > > > > > > > >>> reported issue occurs
> > > > > > > > > > > >>> on the same cpu.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think you may misunderstand the race description in this fix changelog,
> > > > > > > > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Could you verify if the problem still exists with this fix?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes,I'm sure the problem still exists with this patch.
> > > > > > > > > > > > There is some debug info, not mainline.
> > > > > > > > > > > >
> > > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s",
> > > > > > > > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include
> > > > > > > > > > > > linux/mm_types.h
> > > > > > > > > > >
> > > > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends,
> > > > > > > > > > > which now include zram, ramdisk, pmem, nvdimm.
> > > > > > > > > >
> > > > > > > > > > Yes.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > It maybe not good to use zswap on these swap backends?
> > > > > > > > > > >
> > > > > > > > > > > The problem here is the page fault handler tries to skip swapcache to
> > > > > > > > > > > swapin the folio (swap entry count == 1), but then it can't install folio
> > > > > > > > > > > to pte entry since some changes happened such as concurrent fork of entry.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > The first page fault returned VM_FAULT_RETRY because
> > > > > > > > > > folio_lock_or_retry() failed.
> > > > > > > > >
> > > > > > > > > How so? The folio is newly allocated and not visible to any other
> > > > > > > > > threads or CPUs. swap_read_folio() unlocks it and then returns and we
> > > > > > > > > immediately try to lock it again with folio_lock_or_retry(). How does
> > > > > > > > > this fail?
> > > > > > > > >
> > > > > > > > > Let's go over what happens after swap_read_folio():
> > > > > > > > > - The 'if (!folio)' code block will be skipped.
> > > > > > > > > - folio_lock_or_retry() should succeed as I mentioned earlier.
> > > > > > > > > - The 'if (swapcache)' code block will be skipped.
> > > > > > > > > - The pte_same() check should succeed on first look because other
> > > > > > > > > concurrent faulting threads should be held off by the newly introduced
> > > > > > > > > swapcache_prepare() logic. But looking deeper I think this one may
> > > > > > > > > fail due to a concurrent MADV_WILLNEED.
> > > > > > > > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be
> > > > > > > > > skipped because swap_read_folio() marks the folio up-to-date.
> > > > > > > > > - After that point there is no possible failure until we install the
> > > > > > > > > pte, at which point concurrent faults will fail on !pte_same() and
> > > > > > > > > retry.
> > > > > > > > >
> > > > > > > > > So the only failure I think is possible is the pte_same() check. I see
> > > > > > > > > how a concurrent MADV_WILLNEED could cause that check to fail. A
> > > > > > > > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once
> > > > > > > > > the fault resolves it will go ahead and read the folio again into the
> > > > > > > > > swapcache. It seems like we will end up with two copies of the same
> > > > > > > >
> > > > > > > > but zswap has freed the object when the do_swap_page finishes swap_read_folio
> > > > > > > > due to exclusive load feature of zswap?
> > > > > > > >
> > > > > > > > so WILLNEED will get corrupted data and put it into swapcache.
> > > > > > > > some other concurrent new forked process might get the new data
> > > > > > > > from the swapcache WILLNEED puts when the new-forked process
> > > > > > > > goes into do_swap_page.
> > > > > > >
> > > > > > > Oh I was wondering how synchronization with WILLNEED happens without
> > > > > > > zswap. It seems like we could end up with two copies of the same folio
> > > > > > > and one of them will be leaked unless I am missing something.
> > > > > > >
> > > > > > > >
> > > > > > > > so very likely a new process is forked right after do_swap_page finishes
> > > > > > > > swap_read_folio and before swapcache_clear.
> > > > > > > >
> > > > > > > > > folio? Maybe this is harmless because the folio in the swacache will
> > > > > > > > > never be used, but it is essentially leaked at that point, right?
> > > > > > > > >
> > > > > > > > > I feel like I am missing something. Adding other folks that were
> > > > > > > > > involved in the recent swapcache_prepare() synchronization thread.
> > > > > > > > >
> > > > > > > > > Anyway, I agree that at least in theory the data corruption could
> > > > > > > > > happen because of exclusive loads when skipping the swapcache, and we
> > > > > > > > > should fix that.
> > > > > > > > >
> > > > > > > > > Perhaps the right thing to do may be to write the folio again to zswap
> > > > > > > > > before unlocking it and before calling swapcache_clear(). The need for
> > > > > > > > > the write can be detected by checking if the folio is dirty, I think
> > > > > > > > > this will only be true if the folio was loaded from zswap.
> > > > > > > >
> > > > > > > > we only need to write when we know swap_read_folio() gets data
> > > > > > > > from zswap but not swapfile. is there a quick way to do this?
> > > > > > >
> > > > > > > The folio will be dirty when loaded from zswap, so we can check if the
> > > > > > > folio is dirty and write the page if fail after swap_read_folio().
> > > > > >
> > > > > > Is it actually a bug of swapin_walk_pmd_entry? it only check pte
> > > > > > before read_swap_cache_async. but when read_swap_cache_async
> > > > > > is blocked by swapcache_prepare, after it gets the swapcache_prepare
> > > > > > successfully , someone else should have already set the pte and freed
> > > > > > the swap slot even if this is not zswap?
> > > > >
> > > > > If someone freed the swap slot then swapcache_prepare() should fail,
> > > > > but the swap entry could have been recycled after we dropped the pte
> > > > > lock, right?
> > > > >
> > > > > Anyway, yeah, I think there might be a bug here irrelevant to zswap.
> > > > >
> > > > > >
> > > > > > static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
> > > > > >                 unsigned long end, struct mm_walk *walk)
> > > > > > {
> > > > > >         struct vm_area_struct *vma = walk->private;
> > > > > >         struct swap_iocb *splug = NULL;
> > > > > >         pte_t *ptep = NULL;
> > > > > >         spinlock_t *ptl;
> > > > > >         unsigned long addr;
> > > > > >
> > > > > >         for (addr = start; addr < end; addr += PAGE_SIZE) {
> > > > > >                 pte_t pte;
> > > > > >                 swp_entry_t entry;
> > > > > >                 struct folio *folio;
> > > > > >
> > > > > >                 if (!ptep++) {
> > > > > >                         ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > >                         if (!ptep)
> > > > > >                                 break;
> > > > > >                 }
> > > > > >
> > > > > >                 pte = ptep_get(ptep);
> > > > > >                 if (!is_swap_pte(pte))
> > > > > >                         continue;
> > > > > >                 entry = pte_to_swp_entry(pte);
> > > > > >                 if (unlikely(non_swap_entry(entry)))
> > > > > >                         continue;
> > > > > >
> > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > >                 ptep = NULL;
> > > > > >
> > > > > >                 folio = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
> > > > > >                                              vma, addr, &splug);
> > > > > >                 if (folio)
> > > > > >                         folio_put(folio);
> > > > > >         }
> > > > > >
> > > > > >         if (ptep)c
> > > > > >                 pte_unmap_unlock(ptep, ptl);
> > > > > >         swap_read_unplug(splug);
> > > > > >         cond_resched();
> > > > > >
> > > > > >         return 0;
> > > > > > }
> > > > > >
> > > > > > I mean pte can become non-swap within read_swap_cache_async(),
> > > > > > so no matter if it is zswap, we have the bug.
> > > >
> > > > checked again,  probably still a zswap issue, as swapcache_prepare can detect
> > > > real swap slot free :-)
> > > >
> > > >                 /*
> > > >                  * Swap entry may have been freed since our caller observed it.
> > > >                  */
> > > >                 err = swapcache_prepare(entry);
> > > >                 if (!err)
> > > >                         break;
> > > >
> > > >
> > > > zswap exslusive load isn't a real swap free.
> > > >
> > > > But probably we have found the timing which causes the issue at least :-)
> > >
> > > The problem I was referring to is with the swapin fault path that
> > > skips the swapcache vs. MADV_WILLNEED. The fault path could swapin the
> > > page and skip the swapcache, and MADV_WILLNEED could swap it in again
> > > into the swapcache. We would end up with two copies of the folio.
> >
> > right. i feel like we have to re-check pte is not changed within
> > __read_swap_cache_async after swapcache_prepare succeed
> > after being blocked for a while as the previous entry could have
> > been freed and re-allocted by someone else -  a completely
> > different process. then we get read other processes' data.

>
> This is only a problem when we skip the swapcache during swapin.
> Otherwise the swapcache synchronizes this. I wonder how much does
> skipping the swapcache buy us on recent kernels? This optimization was
> introduced a long time ago.

Still performs quite good. according to kairui's data:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=13ddaf26be324a7f951891ecd9ccd04466d27458

Before: 10934698 us
After: 11157121 us
Cached: 13155355 us (Dropping SWP_SYNCHRONOUS_IO flag)

BTW, zram+zswap seems pointless from the first beginning. it seems a wrong
configuration for users.  if this case is really happening, could we
simply fix it
by:

diff --git a/mm/memory.c b/mm/memory.c
index b7cab8be8632..6742d1428373 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3999,7 +3999,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
        swapcache = folio;

        if (!folio) {
-               if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+               if (!is_zswap_enabled() && data_race(si->flags &
SWP_SYNCHRONOUS_IO) &&
                    __swap_count(entry) == 1) {
                        /*
                         * Prevent parallel swapin from proceeding with


  reply	other threads:[~2024-03-23  0:35 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21  4:34 [bug report] mm/zswap :memory corruption after zswap_load() Zhongkun He
2024-03-21  4:42 ` Chengming Zhou
2024-03-21  5:09   ` [External] " Zhongkun He
2024-03-21  5:24     ` Chengming Zhou
2024-03-21  6:36       ` Zhongkun He
2024-03-21  9:28         ` Chengming Zhou
2024-03-21 15:25           ` Nhat Pham
2024-03-21 18:32             ` Yosry Ahmed
2024-03-22  3:27               ` Chengming Zhou
2024-03-22  3:16             ` Zhongkun He
2024-03-22  3:04           ` Zhongkun He
2024-03-22 19:34             ` Yosry Ahmed
2024-03-22 23:04               ` Barry Song
2024-03-22 23:08                 ` Yosry Ahmed
2024-03-22 23:18                   ` Barry Song
2024-03-22 23:22                     ` Yosry Ahmed
2024-03-22 23:32                       ` Barry Song
2024-03-22 23:34                         ` Yosry Ahmed
2024-03-22 23:38                           ` Barry Song
2024-03-22 23:41                             ` Yosry Ahmed
2024-03-23  0:34                               ` Barry Song [this message]
2024-03-23  0:42                                 ` Yosry Ahmed
2024-03-23 10:48                                 ` Chris Li
2024-03-23 11:27                                   ` Chris Li
2024-03-23 12:41                                   ` Zhongkun He
2024-03-23  1:34               ` Zhongkun He
2024-03-23  1:36                 ` Yosry Ahmed
2024-03-23 10:52                 ` Chris Li
2024-03-23 10:55                   ` Barry Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4wuiVgdX0PU3HaGjiy3nE1qDN_eHzkyP4974ZShB-WTiQ@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hezhongkun.hzk@bytedance.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=wuyun.abel@bytedance.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    --cc=zhouchengming@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.