From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4443C54E67 for ; Sat, 23 Mar 2024 10:52:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 786E36B0088; Sat, 23 Mar 2024 06:52:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 737806B0089; Sat, 23 Mar 2024 06:52:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D8986B008A; Sat, 23 Mar 2024 06:52:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4DC146B0088 for ; Sat, 23 Mar 2024 06:52:58 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 282E21C0DD0 for ; Sat, 23 Mar 2024 10:52:58 +0000 (UTC) X-FDA: 81927991236.03.77DD893 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf12.hostedemail.com (Postfix) with ESMTP id 3250240005 for ; Sat, 23 Mar 2024 10:52:56 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C1LAZmdZ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711191176; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Tcl8vvMZs0ZVirFMU5ZTCe/GUwNQN16HqQHYcj8cBZo=; b=vSVkqqJWKNGxyKZ26hhUkStxuYvKed7kqooY4lwlUB4FbBZr9TwrZNhqQYfvRzlS63riiL meuXS8C0InEycOrH0c+41heOmKmdIQgCJ3IQOLHjPKaK8LuTQs+cEckGUcvw5fu3ZQUera ZJZlyqojk7LPShFZpXD8+lBtaee3Z24= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=C1LAZmdZ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711191176; a=rsa-sha256; cv=none; b=8jA+uO0pIKq0te+Bgbmu7bKCQoCqnInHFP3/KVtpE/HGPWYQqkXDYCJrCS87YCv7o6F2Jq 84U1BnMTLOAnoVc76PSC91+Q8WMbmPPBeiKp/Eb9LcuPWABXGMemCAZUU7FuqRW+9kZbaA SCDGsCwMwGXjPjlsJ68bSPwMtaa0nQk= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 4BBA760A6B for ; Sat, 23 Mar 2024 10:52:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E8FAAC43390 for ; Sat, 23 Mar 2024 10:52:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1711191175; bh=dajEVsCtxdfXcjuZhiAm0zdWq3U6Iw1x66DMwe8DsLU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=C1LAZmdZILaVWjcwWTD0ok0L70kj3q2VNNMpm0Z77dxt8LnBInFTQbadYdOMkr5ae E9ZbKcNFd9HtAGXNE3FyLX5JzLKqVEPOAlljNM/b5D0Gg+p5j8fFRfHfvKEjpxYR3j atob+oJUPrUa0NZ+UVEdTo49wvWbjDJKxC4KIV9haeB6cButa9BkJMgz+rmfc4AhiX LjXkCo6gyvWRMHaQNvKSVfivBxaHkBdxnhO5/0m8D0+Z5I8orrzhvn0Cjji7h3Rl+N yDZAKMavIhxpDQ9DVS3cAnn8OXRSW0q50REt9MDkSI9DsfolSKFloEV9hJkNye/hOg MiIagW8DPv2JA== Received: by mail-il1-f174.google.com with SMTP id e9e14a558f8ab-36885693b5bso118775ab.1 for ; Sat, 23 Mar 2024 03:52:54 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCXOP2X8EAVuj+aHUaTWg4pZxL9BQzzm+lrxOlmS+H5SAzkqsCjwwtVtEIFLG/WKd+LwLOrOWpZUaZObbTxiuHnv7BU= X-Gm-Message-State: AOJu0YyOaBZ7HUoaKP3mf8/rq26s2WYnAFpDZicosQ+hxQ6gAqPR1prv SmaV+pDGzs7D9bnLzC482RzuPkc0uO9GDF0QuFyaUL9l+V8Cy8I3hOHfv44roqvLLnpxhMAcFy9 RPr/ePyeiSWaIjumx7PC66kviJOutCoMcE4ZK X-Google-Smtp-Source: AGHT+IEuNHppSnE/RYkGZ2P/+QHCzqq/nq/kc8GTRdMBEr7utfZweJIrQ60SKfxCVIfcTTXXjLkQoDHKdeNo6XcNZ4o= X-Received: by 2002:a92:dacb:0:b0:367:c356:1e53 with SMTP id o11-20020a92dacb000000b00367c3561e53mr1880670ilq.16.1711191174256; Sat, 23 Mar 2024 03:52:54 -0700 (PDT) MIME-Version: 1.0 References: <01b0b8e8-af1d-4fbe-951e-278e882283fd@linux.dev> In-Reply-To: From: Chris Li Date: Sat, 23 Mar 2024 03:52:41 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load(). To: Zhongkun He Cc: Yosry Ahmed , Chengming Zhou , Johannes Weiner , Andrew Morton , linux-mm , wuyun.abel@bytedance.com, zhouchengming@bytedance.com, Nhat Pham , Kairui Song , Minchan Kim , David Hildenbrand , Barry Song <21cnbao@gmail.com>, Ying Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 3250240005 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: hxc4xd3hd9o1bo47uu3aiakzi7mu6rjq X-HE-Tag: 1711191175-627723 X-HE-Meta: U2FsdGVkX1+yUB0rrTO3KMPWZ/37qYquuzwyQYfn/j4wRZEKQu23XsH91yGqpFhNNs05y1CE57HGqL6oEbJ398ENid7pG8OZ4cHJQ1KWZXZ2QNJ7wpGAtG/8jCt/UTbPLcDvKuSViHmCkbaJwtabaWAZCnFc4G/AN0JU7nMA6MwgPE98CLq7V7BqtPkSZR3hRRIbxBWYsuCh42Fn5egAuPcJawQV+iKTYQcRtUEDYEB3e6qpYkk2h/9gl5hRh6depIguzoXx7o4chhSWKWQzn2UJq4u7ALwUoY/gsUbRk9n3mT0jwXs8V/vsz70WDmY8o2ZPo+gV9lUC6jHyrn49AW4PGKReDI7XegZubo+S+RhivXEt54fKnbP2uSJIAUixtkt9/4u/PL9Y5TtKo9zCVilWe8Ks1uoRvENXsclgsOu5ml3JEKf4OVV8qfzUEIna9hIsHYbJHqex2iXD0lM/16YOgAmhmBtN5oLtokqKAe19rfRnOsq6WDyqJ8mY1KKj1Z8KQhaKRi03anGCquxP8sXumk0mMJmPyEH1PMcObZPAkB8vP5HK6lfr22NOI+1sh9eUKJ8lnyqlqE+MWCKrFrN7mGmNkjOUb/9fvkfBIUDhUOqmI7mGwEOvcNARFafJYK7+ccMI/f6GSe6xpRbI48VAA+G8rm2N8iJLVxcuusjG6Km1dYRZWWoro5Uv6Q4JB3uKN6YVi06JxbrVcvYK218wkZPPMNfbAc17NhGMXu1AhIMATXAK/I3TsZ5dum/BwD5WTUmlgQlAU8DivNqDBfOh+hJvhl4yNWzLvqvy6DU1NAuI1L9I2sqRP8BXwGrPh4KYLM6V26TAblm0wG76dRZvhCfI0hcz2NVGWUA7giQzy5PfZ12DxykBZG4IUJqgXQgFxg9FOijhz9/6LFs/tq9+r6Wp/M9N1N6MGw4lg9i5KG0EPkUgLFxTlZfbHJCowr3puhYi6ihwhML86jy Nzt82wkj 2cxMpQAVJuPAFnNh/t5V9P1p5vCbhyl/Vnd3PGTbhdLyRfbNk7zHe2XC473Lt0qClfNzSfKl5HrkUzsUJoJXh6q9+CQvRJtbNXdMgBuRoy7QWzU8MI+d/WRwFFSK3wEpgS1WdYzaXMurLL0F2YmQ5EnMd0uMX7jR8c/whecxsjF4MR306xwIIC1dhFj4J250z77aK3Ysv40zTF5ed3wL3XD7wtdZL1pdV5ytnsBQn7MQOCmWuEAnlT4hB3DG5iElApmmoJBgy2KPZon80BtRM1WxWP2/eYF/V7DI08avlY18keFfet3DlnJx/9ss7v/ckp+E3T2jEZjBIta784/3bXVgGF0YxN7Lv6rgb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 22, 2024 at 6:35=E2=80=AFPM Zhongkun He wrote: > > On Sat, Mar 23, 2024 at 3:35=E2=80=AFAM Yosry Ahmed wrote: > > > > On Thu, Mar 21, 2024 at 8:04=E2=80=AFPM Zhongkun He > > wrote: > > > > > > On Thu, Mar 21, 2024 at 5:29=E2=80=AFPM Chengming Zhou wrote: > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote: > > > > > On Thu, Mar 21, 2024 at 1:24=E2=80=AFPM Chengming Zhou wrote: > > > > >> > > > > >> On 2024/3/21 13:09, Zhongkun He wrote: > > > > >>> On Thu, Mar 21, 2024 at 12:42=E2=80=AFPM Chengming Zhou > > > > >>> wrote: > > > > >>>> > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote: > > > > >>>>> Hey folks, > > > > >>>>> > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the ma= inline > > > > >>>>> (6.8) and found a memory corruption issue related to exclusiv= e loads. > > > > >>>> > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when sk= ipping swapcache") > > > > >>>> This fix avoids concurrent swapin using the same swap entry. > > > > >>>> > > > > >>> > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but = the > > > > >>> reported issue occurs > > > > >>> on the same cpu. > > > > >> > > > > >> I think you may misunderstand the race description in this fix c= hangelog, > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two= CPUs. > > > > >> > > > > >> Could you verify if the problem still exists with this fix? > > > > > > > > > > Yes=EF=BC=8CI'm sure the problem still exists with this patch. > > > > > There is some debug info, not mainline. > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s", > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --incl= ude > > > > > linux/mm_types.h > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backe= nds, > > > > which now include zram, ramdisk, pmem, nvdimm. > > > > > > Yes. > > > > > > > > > > > It maybe not good to use zswap on these swap backends? > > > > > > > > The problem here is the page fault handler tries to skip swapcache = to > > > > swapin the folio (swap entry count =3D=3D 1), but then it can't ins= tall folio > > > > to pte entry since some changes happened such as concurrent fork of= entry. > > > > > > > > > > The first page fault returned VM_FAULT_RETRY because > > > folio_lock_or_retry() failed. > > > > Hi Yosry, > > > How so? The folio is newly allocated and not visible to any other > > threads or CPUs. swap_read_folio() unlocks it and then returns and we > > immediately try to lock it again with folio_lock_or_retry(). How does > > this fail? > > Haha, it makes me very confused. Based on the steps to reproduce the prob= lem, > I think the page is locked by shrink_folio_list(). Please see the > following situation. > > do_swap_page > __folio_set_locked(folio); > swap_readpage(page, true, NULL); > zswap_load(folio) > folio_unlock(folio); > > shrink_folio_list > > if (!folio_trylock(folio)) > ret |=3D folio_lock_or_retry(folio, vmf); > if (ret & VM_FAULT_RETRY) > goto out_release; Thanks for the detailed bug report. So this means the folio immediately gets reclaimed after zswap_load(), before do_swap_page returns, right? We also need to audit if there is any other code path in the do_swap_page that can fail a swap fault and not store the folio into the swap cache. Chris > > Thanks. > > > > > Let's go over what happens after swap_read_folio(): > > - The 'if (!folio)' code block will be skipped. > > - folio_lock_or_retry() should succeed as I mentioned earlier. > > - The 'if (swapcache)' code block will be skipped. > > - The pte_same() check should succeed on first look because other > > concurrent faulting threads should be held off by the newly introduced > > swapcache_prepare() logic. But looking deeper I think this one may > > fail due to a concurrent MADV_WILLNEED. > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be > > skipped because swap_read_folio() marks the folio up-to-date. > > - After that point there is no possible failure until we install the > > pte, at which point concurrent faults will fail on !pte_same() and > > retry. > > > > So the only failure I think is possible is the pte_same() check. I see > > how a concurrent MADV_WILLNEED could cause that check to fail. A > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once > > the fault resolves it will go ahead and read the folio again into the > > swapcache. It seems like we will end up with two copies of the same > > folio? Maybe this is harmless because the folio in the swacache will > > never be used, but it is essentially leaked at that point, right? > > > > I feel like I am missing something. Adding other folks that were > > involved in the recent swapcache_prepare() synchronization thread. > > > > Anyway, I agree that at least in theory the data corruption could > > happen because of exclusive loads when skipping the swapcache, and we > > should fix that. > > > > Perhaps the right thing to do may be to write the folio again to zswap > > before unlocking it and before calling swapcache_clear(). The need for > > the write can be detected by checking if the folio is dirty, I think > > this will only be true if the folio was loaded from zswap. >