From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 27DFDC47DD9 for ; Fri, 22 Mar 2024 23:09:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87FE86B0085; Fri, 22 Mar 2024 19:09:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 830D26B0087; Fri, 22 Mar 2024 19:09:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F79A6B0088; Fri, 22 Mar 2024 19:09:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 612856B0085 for ; Fri, 22 Mar 2024 19:09:19 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CE2B2A140D for ; Fri, 22 Mar 2024 23:09:18 +0000 (UTC) X-FDA: 81926217996.05.33D74C7 Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf28.hostedemail.com (Postfix) with ESMTP id EAB76C0015 for ; Fri, 22 Mar 2024 23:09:16 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=P8IM2K7X; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711148957; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=C5WTy6MAGUlBCZ4R+m2+UUrHm7qP19VX8v4AZHCZdws=; b=avzKyR4wxR30UI3ss0XkWnrwf+zXyrjMdlvdHiNl2abH5LDSzhm0HfZQflevubppioyQng nZGMag0Y5fJ6rA6m4VMMLn70NZp0u4syDoQ3+rAC/Cpfnrpr2vR44hy0dQckoGJ+L0NvgN 6a/gz4IK8w2DNKdLtvylC3o45aACmoM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711148957; a=rsa-sha256; cv=none; b=BtbGYHLl8c0OHofE7vgYE2oeynA7hHYSNc7GBQhM28yWOJgl5S3SYqfMD0zdhy/NTRdcYi 01irheWHVlRuQGOAlLE6tJn97abxPcqHZYoeglEc9b8zuDhlYmCrJ4xCMLPRgw5Sb2q/Ha LGLLrYn8Bj2Zp3a29YvPMEt6WscjaZs= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=P8IM2K7X; spf=pass (imf28.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-a45f257b81fso323861266b.0 for ; Fri, 22 Mar 2024 16:09:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711148955; x=1711753755; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=C5WTy6MAGUlBCZ4R+m2+UUrHm7qP19VX8v4AZHCZdws=; b=P8IM2K7X4siK92CGs2NEI7dTf1GFJ9d4ABNKHgQLNaXoZDjLQw9L2YynTJwRR+uL2w vlEQ59A3I3rh4j1vHEKcVdIQxz/5DR8/ekOUS1Jdow0mDCnOjqKUkYptvIA+dFyQ+fEf Mxj6pErM2+hgZ70kvsfWOfkWTfe2pso9AlFTIxqW8iCTUoddui49W/ePBucPItjC1dXZ cC2TNRfaydhqPLVtpViuIfCUPce7q9kgjojFTJNs7qu4JCtHbdElB5MiiG5HDeuUfK3b glb5pAxAAu51CFr8dH+Ik1E6MY9K1OB1Jxxtq1dG/DboCQ7f8wu3U/0fu1YaAm2p56Fv j8Ug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711148955; x=1711753755; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=C5WTy6MAGUlBCZ4R+m2+UUrHm7qP19VX8v4AZHCZdws=; b=gzjH/hVw7fguhriXzCxgHLSsriTXcKDF+s/HuYE1FseG/H67J/nNPvbV94pltX/FCS cRcKVYDA1+KrSlB7mltY14RPYZigABSPXEjPEcw7AzwU1zlms4Ji7U67m+V+c6Q97Cz6 0EnQPCI5l7R4yBB0uBgHoNuKJYdqN/JYgBgpW7K3MTEX2jf3Mzpr8gWv7vWTkSBLWWFa guaXBUbqOdo/ZWuwlIZUqsSQQ/yHla3G+s5UyGxvRJjd1h++smey879uKDToDNArm4Tm 4RlQCmDqIfxy4geT9niVaeOZAzTbAC+TMhXa2anN+VddIcKzl4sg0E04cDiKjoH1o+DH bRSA== X-Forwarded-Encrypted: i=1; AJvYcCWIu+1HXtJSsIIhAPz05H7hfZf0ZBaiIrcHxuoHPCgyzoJHsge9VwCX5hLPs0r92cC9BN5q6d6o5E1F/1YY0yqjiPo= X-Gm-Message-State: AOJu0YzUXCHNyhalO3XuGBa7e501LQ1fsRCS/ViAby1qIUlP6e/RRpbz n0r8zwsqOfkBmsBkPCgHQdDvObDmww3j2mQ5RgKkQZpV0CPjPvhE2ml7abrHUcQg7ICHnEjpa4V lk3G+v8KfQ1YT0tHqRs1DNl5LOBp8vvGvMUJk X-Google-Smtp-Source: AGHT+IHslifyo9Jh9JVJ3c4mQuhGHUKWexRltEeIC0NskGEPUuonuCvTlK32vpZL5FckCNgJS3D+FC3kxtXEaEEmCI0= X-Received: by 2002:a17:906:1916:b0:a47:3409:2946 with SMTP id a22-20020a170906191600b00a4734092946mr823886eje.1.1711148955081; Fri, 22 Mar 2024 16:09:15 -0700 (PDT) MIME-Version: 1.0 References: <01b0b8e8-af1d-4fbe-951e-278e882283fd@linux.dev> In-Reply-To: From: Yosry Ahmed Date: Fri, 22 Mar 2024 16:08:37 -0700 Message-ID: Subject: Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load(). To: Barry Song <21cnbao@gmail.com> Cc: Zhongkun He , Chengming Zhou , Johannes Weiner , Andrew Morton , linux-mm , wuyun.abel@bytedance.com, zhouchengming@bytedance.com, Nhat Pham , Kairui Song , Minchan Kim , David Hildenbrand , Chris Li , Ying Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: EAB76C0015 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: hpsy5iz4rhjqorn4wor1xey7eyuf3c84 X-HE-Tag: 1711148956-476158 X-HE-Meta: U2FsdGVkX1/KibRP31AhdAFVeMxYRfwjzgNFd99akf0sCLi0mqqQz7IX2lNKg6Pu02GprvashxWKTMmm+shu1e6mlvIt/XjJCu14XSDWYhx08SkgcZj8VpNZqnff4ixr0WjVcQDL08dGAfUug4IQfk/cFzXeo5qg0/QhIxfzxNkO177LAVcpPmABz7Xugf8fnJvK9kaVTXxcMCPMgBrk073Az21HAFi5busHk2fd062ypLK4DWmdlaHzIDbBhftuL59TNrf2cEJWnIQe/J2drKcevEd9YLlHyGJHCJyODmHDPkOPYcRIYPt3EzO8mD0Fu0eES/XcjDl2KGyqMoqVtE2eFkkYeIIooZwxyjnJqr0HdqDL9BpL1iwrvkDZnHYLhIyhXpCaDASbm+StcsCPEaB17YBHpgDjGsXXSCGF563vDTgRZzVzO5KiGbnHrI3sS1a/Mk26tjUFnypbL31S8Sss0m0T3x0+6ZeXVAcrRXzhZMEpP81+JCYi2pKfWItuSRa7bzXDAyfmARxVZh8xwaGTT85eo3KBWF1uKsH614L+dGGMeoAH9tGwKeY/zLjABBsTJlIEzlasB6uVlu4kt0g0eoMvtNSHrOpBLEeDD73xDKXwWGxIvVehE8TgorIzzTOMBiMbQDrpyU9G5CObntVggRPKtZTxrHlT0V+QItZH9LcdJt9424j14psh/EpdxkiXUCt+6CCnIfxVtmFOjEwf8n/d3qd5RZRQQGua01nuXj42JwmXukBwY3Isr/aTWGC4geTcWKPw+G1D2XHUcigCHG5ReGG3W71Nov4/iu48R8eybBLXtGaNTIM7aEizUSU3iiJMIFaZQby8sYoN3mnLg6Esflyb8Vx0bG0aIQIBMkBxycd7iOPi1vekW9U3NlgzHA/fv4h1+RkAJcMGzeclEiEJ08PZwgMbYWEITPgGHBRnAEouGNc3jp+shRBFe08xAWwYTalmM8lgBmv 7Za7/06B hdza6rSAuWutUgDnRcCmjEBUDjSUChHLx+nqXdRSdxbUT3+sONJpAZ6nW8H7uUmprU1HJAFUNxpRV/2nPdyTORoH+6muHEWRH88VU27QpirHjnbXbg8KyC2L7a1xZobSbs+zC1qwfT1TqCR2J61ifXvJepJSMmkFDv/I4T6kAypkywYbYHqjJRDVkTo2itJzEla9I+zCj+QspXovJqfXFucyGuu2r8EDIfp3F7yJn0hwSCNpoYuz8FeFJx84i7aA/NH5rqfvRMp5grFuVGrO1ZN1OkA3tn68C23aLAbOEPp/U0xAte3hk9kbfYm9MXl8fRa9Y3CJHS0vy2e2+34Pe4zpPgi3NcAALUSrwcSoYY0WIq2crJlb/XlLT4O19xW/+bfBw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 22, 2024 at 4:04=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, Mar 23, 2024 at 8:35=E2=80=AFAM Yosry Ahmed wrote: > > > > On Thu, Mar 21, 2024 at 8:04=E2=80=AFPM Zhongkun He > > wrote: > > > > > > On Thu, Mar 21, 2024 at 5:29=E2=80=AFPM Chengming Zhou wrote: > > > > > > > > On 2024/3/21 14:36, Zhongkun He wrote: > > > > > On Thu, Mar 21, 2024 at 1:24=E2=80=AFPM Chengming Zhou wrote: > > > > >> > > > > >> On 2024/3/21 13:09, Zhongkun He wrote: > > > > >>> On Thu, Mar 21, 2024 at 12:42=E2=80=AFPM Chengming Zhou > > > > >>> wrote: > > > > >>>> > > > > >>>> On 2024/3/21 12:34, Zhongkun He wrote: > > > > >>>>> Hey folks, > > > > >>>>> > > > > >>>>> Recently, I tested the zswap with memory reclaiming in the ma= inline > > > > >>>>> (6.8) and found a memory corruption issue related to exclusiv= e loads. > > > > >>>> > > > > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when sk= ipping swapcache") > > > > >>>> This fix avoids concurrent swapin using the same swap entry. > > > > >>>> > > > > >>> > > > > >>> Yes, This fix avoids concurrent swapin from different cpu, but = the > > > > >>> reported issue occurs > > > > >>> on the same cpu. > > > > >> > > > > >> I think you may misunderstand the race description in this fix c= hangelog, > > > > >> the CPU0 and CPU1 just mean two concurrent threads, not real two= CPUs. > > > > >> > > > > >> Could you verify if the problem still exists with this fix? > > > > > > > > > > Yes=EF=BC=8CI'm sure the problem still exists with this patch. > > > > > There is some debug info, not mainline. > > > > > > > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s", > > > > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --incl= ude > > > > > linux/mm_types.h > > > > > > > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backe= nds, > > > > which now include zram, ramdisk, pmem, nvdimm. > > > > > > Yes. > > > > > > > > > > > It maybe not good to use zswap on these swap backends? > > > > > > > > The problem here is the page fault handler tries to skip swapcache = to > > > > swapin the folio (swap entry count =3D=3D 1), but then it can't ins= tall folio > > > > to pte entry since some changes happened such as concurrent fork of= entry. > > > > > > > > > > The first page fault returned VM_FAULT_RETRY because > > > folio_lock_or_retry() failed. > > > > How so? The folio is newly allocated and not visible to any other > > threads or CPUs. swap_read_folio() unlocks it and then returns and we > > immediately try to lock it again with folio_lock_or_retry(). How does > > this fail? > > > > Let's go over what happens after swap_read_folio(): > > - The 'if (!folio)' code block will be skipped. > > - folio_lock_or_retry() should succeed as I mentioned earlier. > > - The 'if (swapcache)' code block will be skipped. > > - The pte_same() check should succeed on first look because other > > concurrent faulting threads should be held off by the newly introduced > > swapcache_prepare() logic. But looking deeper I think this one may > > fail due to a concurrent MADV_WILLNEED. > > - The 'if (unlikely(!folio_test_uptodate(folio)))` part will be > > skipped because swap_read_folio() marks the folio up-to-date. > > - After that point there is no possible failure until we install the > > pte, at which point concurrent faults will fail on !pte_same() and > > retry. > > > > So the only failure I think is possible is the pte_same() check. I see > > how a concurrent MADV_WILLNEED could cause that check to fail. A > > concurrent MADV_WILLNEED will block on swapcache_prepare(), but once > > the fault resolves it will go ahead and read the folio again into the > > swapcache. It seems like we will end up with two copies of the same > > but zswap has freed the object when the do_swap_page finishes swap_read_f= olio > due to exclusive load feature of zswap? > > so WILLNEED will get corrupted data and put it into swapcache. > some other concurrent new forked process might get the new data > from the swapcache WILLNEED puts when the new-forked process > goes into do_swap_page. Oh I was wondering how synchronization with WILLNEED happens without zswap. It seems like we could end up with two copies of the same folio and one of them will be leaked unless I am missing something. > > so very likely a new process is forked right after do_swap_page finishes > swap_read_folio and before swapcache_clear. > > > folio? Maybe this is harmless because the folio in the swacache will > > never be used, but it is essentially leaked at that point, right? > > > > I feel like I am missing something. Adding other folks that were > > involved in the recent swapcache_prepare() synchronization thread. > > > > Anyway, I agree that at least in theory the data corruption could > > happen because of exclusive loads when skipping the swapcache, and we > > should fix that. > > > > Perhaps the right thing to do may be to write the folio again to zswap > > before unlocking it and before calling swapcache_clear(). The need for > > the write can be detected by checking if the folio is dirty, I think > > this will only be true if the folio was loaded from zswap. > > we only need to write when we know swap_read_folio() gets data > from zswap but not swapfile. is there a quick way to do this? The folio will be dirty when loaded from zswap, so we can check if the folio is dirty and write the page if fail after swap_read_folio().