From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46474C54E68 for ; Thu, 21 Mar 2024 15:25:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BCAA06B0085; Thu, 21 Mar 2024 11:25:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B79936B0089; Thu, 21 Mar 2024 11:25:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A68636B008A; Thu, 21 Mar 2024 11:25:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 965F56B0085 for ; Thu, 21 Mar 2024 11:25:42 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 3DCB6121692 for ; Thu, 21 Mar 2024 15:25:42 +0000 (UTC) X-FDA: 81921420924.21.A6B7193 Received: from mail-vs1-f52.google.com (mail-vs1-f52.google.com [209.85.217.52]) by imf06.hostedemail.com (Postfix) with ESMTP id 795FD18001D for ; Thu, 21 Mar 2024 15:25:40 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CyijToJC; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.217.52 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711034740; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3tJMHOwvXG/c1FWd7xeJ0+gcr0aB940zEhC0BmTkKd0=; b=t+FrN5sI6Pl4XRvq0Ojyi21+lWG9BN4raVIkrFhN2n4W0JwiW9NB/LVRBW+xAlQQWyy4G5 +DUZCsI4YSU6th2c56ysHup6E6RnZDNUrNWxmevwXYNS/zNCYCCU94x82xBzpFQ8ofFN9D WG8DM8HgzEz9jddTzeaAP4c1eC5rkuY= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=CyijToJC; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.217.52 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711034740; a=rsa-sha256; cv=none; b=kXWSscSkbqq5ogGGzTzEFu8yVZMWEj1VEaA+/4TTOMdOtJD3UAqXL0Oxp18uNRGGx7EP/Y 5N5ovcyT1wuHvrn1BcD9rDK+dK9QqIOQb4C96PSs+IRFNjC5tf0J5KlLU0DzZ7qf14dHfj awJCgYM3mb2gk9Yh0p8Yw2siff1Rqp0= Received: by mail-vs1-f52.google.com with SMTP id ada2fe7eead31-4765c5905a8so364234137.0 for ; Thu, 21 Mar 2024 08:25:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1711034739; x=1711639539; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3tJMHOwvXG/c1FWd7xeJ0+gcr0aB940zEhC0BmTkKd0=; b=CyijToJCsJ8z1/O0TVKgHj97UxYGdWvIQT2QC+2P0dtxFEhNHDlO5Mt+KI3A3682zP EeVPVJx1ElqpZ6rQrS2oHjUskhaZpZE0es5noS9mwohnGqspNjTJTHiqyYKN+Ncb/NPX Rmx1PELUMT3n06IMbgw0emXDs9zOIQmd8vphSU5VTIfOZ+juX58sCNkpAv1YVgmBESOd W3q1d07bqq5TfMMk3r8k4bVOLYowPkFsrTh+LUX5WVywCg3Rzj0G7MayCXsfi38aGene CXOtL7qHQ5Olg3ij7r5yjoTur8dRHqQimfJtGOrVlmbsi/YP52R+FWatgUHPgHxWg7oE skQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711034739; x=1711639539; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3tJMHOwvXG/c1FWd7xeJ0+gcr0aB940zEhC0BmTkKd0=; b=CklK8cptXQZWM/CgL+XasN45SC/HiJwosB/EweNZcycnon3rEv7UaDWQ5SIHih1Ub1 GZpwerbxd4LW71czQU+Ugi9DlvyCxGG/o46DRltrmcyDPdATC6i99ELKTNO+521boRAH j8ppEnzwwVB6QKKuoabQJDrCXRQK9qKLAxlxQ3HxAZHmdNl1T2dZcZfYSqNfh3gvp12C tbA1Qc4qoJGduhmHSSTOYHd6dzK8cqn4J7EJa1tYoa4k7OW3jj4l1kkFmhdp3xqvMgMO gBRzX3l537+wuq2GEKv71hYzBu7Glct0SxnZR1/0N5ipkwy6WACArdralg5mEqu8yGV1 H43w== X-Forwarded-Encrypted: i=1; AJvYcCVxh+U8cfFZo57ki3wdKKs0KGMvFWnvfHN6cpGj9KCtUnRqECyNxpPCQ2PXXyWBh76S9t3230hEA96vA7bue92XV48= X-Gm-Message-State: AOJu0YxiHHi1jWOgnxTHk/F7B1+Yc5/s5lY8yirePei3L3mkgqMj/4YB eO3XDGIzmIwsJagSrttUDJ7FAKSJBCvMrFpCYUNkhu5nzjiQTASM462Cf3pBKEAGjOVI2Itarxf /nG5cbZQcHMcvJIOubjr3O1KfYnQ= X-Google-Smtp-Source: AGHT+IFgqfESQblkcuvt/Gxbit0GFP5hkn5UBAQKb7E6lwcFHlT9sjRFWUUcvQXTOZYbUrPFk4oHNudPv7LuvejWEBE= X-Received: by 2002:a67:fd87:0:b0:476:79b6:b9ad with SMTP id k7-20020a67fd87000000b0047679b6b9admr2139933vsq.34.1711034739464; Thu, 21 Mar 2024 08:25:39 -0700 (PDT) MIME-Version: 1.0 References: <01b0b8e8-af1d-4fbe-951e-278e882283fd@linux.dev> In-Reply-To: From: Nhat Pham Date: Thu, 21 Mar 2024 08:25:26 -0700 Message-ID: Subject: Re: [External] Re: [bug report] mm/zswap :memory corruption after zswap_load(). To: Chengming Zhou Cc: Zhongkun He , Johannes Weiner , Yosry Ahmed , Andrew Morton , linux-mm , wuyun.abel@bytedance.com, zhouchengming@bytedance.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 795FD18001D X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: u7w96himsx3qqk6tdg9rxqot8kcta3ho X-HE-Tag: 1711034740-495863 X-HE-Meta: U2FsdGVkX19mh8mlE0OkI756wHsggsmWKuWrUJF4UQsp/PrUKFq2wGRN9EwZCk/1MMelrFFqcZ3SjYESqoBQUP0+YfAdy4HIwz+m4AXPmztM8DsQ6igveG7hK47v9B7HxodgGqZ7SkOMiae7ZArNn3gMcruo0kQogImPCQ7BpHJ4JffDlcXufvwo3qm9FX+i4Rey5vIn/yH8sS//TN4G1XWvsWVa1gdaUHgDGmSVNaFcoS8VOxw4MolEPUfyqeVgN2KIp+OQNWKb9p3zoRfPVD/o3ppuMHFsXbkWqW6mG11FEktOE7nt+30ko+h8jGmLa9I8DbibwZUykynBrsc0nFl03ofTgmAsJRlb4GTc25mZLtZB+WASsEemJCrds3XJWpQvf9tErBxg4CsRwjGQAqN0xE43G0ZBfEJKpLD97hKlfopNMb+a4un8erOnBeouCaDC0F5a3nfXp2dEhACaWjAOY6KfXYJ3KiuF0Zjs/o+/lQpeTx2UgMMGjLwoKTaC5HMnQceibPcjkR4yc3zUVmOHUdNDrSx4fF4N9EsleU2RTuV25YZzl/ViWgXJyutV4TCAFBpxCpV6xdRp+ous5fUhNKmCh6f6g+Cyj8oECXB6TT4qr/FJnOysIwCt9zTsLJO9VSPbCLuxgh7N5roh5ON+SnWI3KiWTY9+sFtrVFYLvXqczRnAZ8qqomR1zVjTRQ7t5iY9GLrDISogYB5NNj0fAMF8+tekXXr2jZ4dS9dPF98VUl7KCXboptBuNGdfAgtqkgrgeFaSPB0Q/6wp6FDSr6TsR5jZC859q7PowWFUDmz+wvKoVVsC6T+l27aKaUUFeMO7KQWHgryJJeOe8E4BvyN4peB+6qLVz27dctugi1V6xDHfyo9Ebl6s1Wh0lwRnaPO1TBC5893b3R7lN4+6zrEH37wsqNdMzqtBV0xmtYR9AvRSy8QVOoLdm3ALvyxr4wDF9YvqYPtFVbM THAK0jW1 4OoPPHbN5CdN8jN+8idqdIktFUYbJbMciZ6kQw4dccv3+Ef1ZFbiH2CUKwHgbq3cZdOvPGpnno7PgmHuqmbSkaZAawhaBpd8FgcadrV+187C2LzllrMHKm4P38wyBsSoB29yc/nRI26BYxFbqsEjy1gkGbGNkDcwTDV8TNRSb7Wsmf9ez/oiDrejAlc5h+FGSgyo1MgoG0Nr6nfRFnNHVQLITP+rUfORQmKYUBeAwkd5lVSJfENza/1MVaQPZlXCy38uTKmZaP6uomrtP/XftNGTNWn9G/JB4QUmxYqGRRRwiH67T6UHSK8XxGneuj/Sgjg6Zrphqy8PlN88UIKZNr5xqwgUENVrEpVU4uqvkzhuiXOEVbDEqbnibW92Pz4YvhDX5k/AUbXLeeOx3D8KllJIMKbakCXydn1CCIjaU9117L96p05qQoobUt4eEFoRFp7b/zY+kd2pOP2A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Mar 21, 2024 at 2:28=E2=80=AFAM Chengming Zhou wrote: > > On 2024/3/21 14:36, Zhongkun He wrote: > > On Thu, Mar 21, 2024 at 1:24=E2=80=AFPM Chengming Zhou wrote: > >> > >> On 2024/3/21 13:09, Zhongkun He wrote: > >>> On Thu, Mar 21, 2024 at 12:42=E2=80=AFPM Chengming Zhou > >>> wrote: > >>>> > >>>> On 2024/3/21 12:34, Zhongkun He wrote: > >>>>> Hey folks, > >>>>> > >>>>> Recently, I tested the zswap with memory reclaiming in the mainline > >>>>> (6.8) and found a memory corruption issue related to exclusive load= s. > >>>> > >>>> Is this fix included? 13ddaf26be32 ("mm/swap: fix race when skipping= swapcache") > >>>> This fix avoids concurrent swapin using the same swap entry. > >>>> > >>> > >>> Yes, This fix avoids concurrent swapin from different cpu, but the > >>> reported issue occurs > >>> on the same cpu. > >> > >> I think you may misunderstand the race description in this fix changel= og, > >> the CPU0 and CPU1 just mean two concurrent threads, not real two CPUs. > >> > >> Could you verify if the problem still exists with this fix? > > > > Yes=EF=BC=8CI'm sure the problem still exists with this patch. > > There is some debug info, not mainline. > > > > bpftrace -e'k:swap_readpage {printf("%lld, %lld,%ld,%ld,%ld\n%s", > > ((struct page *)arg0)->private,nsecs,tid,pid,cpu,kstack)}' --include > > linux/mm_types.h > > Ok, this problem seems only happen on SWP_SYNCHRONOUS_IO swap backends, > which now include zram, ramdisk, pmem, nvdimm. > > It maybe not good to use zswap on these swap backends? My gut reaction is to say yes, but I'll refrain from making sweeping statements about backends I'm not too familiar with. Let's see: 1. zram: I don't even know why we're putting a compressed cache... in front of a compressed faux swap device? Ramdisk =3D=3D other in-memory swap backend right? 2. I looked it up, and it seemed SWP_SYNCHRONOUS_IO was introduced for fast swap storage (see the original patch series [1]). If this is the case, one could argue there are diminishing returns for applying zswap on top of this. [1]: https://lore.kernel.org/linux-mm/1505886205-9671-1-git-send-email-minc= han@kernel.org/ > > The problem here is the page fault handler tries to skip swapcache to > swapin the folio (swap entry count =3D=3D 1), but then it can't install f= olio > to pte entry since some changes happened such as concurrent fork of entry= . > > Maybe we should writeback that folio in this special case. But yes, if this is simple maybe we can do this first to fix the bug? > > > > > offset nsecs tid pid cpu > > 2140659, 595771411052,15045,15045,6 > > swap_readpage+1 > > do_swap_page+2135 > > handle_mm_fault+2426 > > do_user_addr_fault+462 > > do_page_fault+48 > > async_page_fault+62 > > > > offset nsecs tid pid cpu > > 2140659, 595771424445,15045,15045,6 > > swap_readpage+1 > > do_swap_page+2135 > > handle_mm_fault+2426 > > do_user_addr_fault+462 > > do_page_fault+48 > > async_page_fault+62 > > > > ------------------------------- > > There are two page faults with the same tid and offset in 13393 nsecs. > > > >> > >>> > >>> Thanks. > >>> > >>>> Thanks. > >>>> > >>>>> > >>>>> > >>>>> root@**:/sys/fs/cgroup/zz# stress --vm 5 --vm-bytes 1g --vm-hang 3 = --vm-keep > >>>>> stress: info: [31753] dispatching hogs: 0 cpu, 0 io, 5 vm, 0 hdd > >>>>> stress: FAIL: [31758] (522) memory corruption at: 0x7f347ed1a010 > >>>>> stress: FAIL: [31753] (394) <-- worker 31758 returned error 1 > >>>>> stress: WARN: [31753] (396) now reaping child worker processes > >>>>> stress: FAIL: [31753] (451) failed run completed in 14s > >>>>> > >>>>> > >>>>> 1. Test step(the frequency of memory reclaiming has been accelerate= d): > >>>>> ------------------------- > >>>>> a. set up the zswap, zram and cgroup V2 > >>>>> b. echo 0 > /sys/kernel/mm/lru_gen/enabled > >>>>> (Increase the probability of problems occurring) > >>>>> c. mkdir /sys/fs/cgroup/zz > >>>>> echo $$ > /sys/fs/cgroup/zz/cgroup.procs > >>>>> cd /sys/fs/cgroup/zz/ > >>>>> stress --vm 5 --vm-bytes 1g --vm-hang 3 --vm-keep > >>>>> > >>>>> e. in other shell: > >>>>> while :;do for i in {1..5};do echo 20g > > >>>>> /sys/fs/cgroup/zz/memory.reclaim & done;sleep 1;done > >>>>> > >>>>> 2. Root cause: > >>>>> -------------------------- > >>>>> With a small probability, the page fault will occur twice with the > >>>>> original pte, even if a new pte has been successfully set. > >>>>> Unfortunately, zswap_entry has been released during the first page = fault > >>>>> with exclusive loads, so zswap_load will fail, and there is no corr= esponding > >>>>> data in swap space, memory corruption occurs. > >>>>> > >>>>> bpftrace -e'k:zswap_load {printf("%lld, %lld\n", ((struct page > >>>>> *)arg0)->private,nsecs)}' > >>>>> --include linux/mm_types.h > a.txt > >>>>> > >>>>> look up the same index: > >>>>> > >>>>> index nsecs > >>>>> 1318876, 8976040736819 > >>>>> 1318876, 8976040746078 > >>>>> > >>>>> 4123110, 8976234682970 > >>>>> 4123110, 8976234689736 > >>>>> > >>>>> 2268896, 8976660124792 > >>>>> 2268896, 8976660130607 > >>>>> > >>>>> 4634105, 8976662117938 > >>>>> 4634105, 8976662127596 > >>>>> > >>>>> 3. Solution > >>>>> > >>>>> Should we free zswap_entry in batches so that zswap_entry will be > >>>>> valid when the next page fault occurs with the > >>>>> original pte? It would be great if there are other better solutions= . > >>>>> > >>>>