Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Ian Kent <raven@themaw.net>
Cc: Jinliang Zheng <alexjlzheng@gmail.com>,
	alexjlzheng@tencent.com, bfoster@redhat.com, djwong@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	rcu@vger.kernel.org
Subject: Re: About the conflict between XFS inode recycle and VFS rcu-walk
Date: Mon, 27 May 2024 19:41:18 +1000	[thread overview]
Message-ID: <ZlRVPv0EGIu5q7l9@dread.disaster.area> (raw)
In-Reply-To: <3545f78c-5e1c-4328-8ab0-19227005f4b7@themaw.net>

On Thu, May 16, 2024 at 03:23:40PM +0800, Ian Kent wrote:
> On 16/5/24 15:08, Ian Kent wrote:
> > On 16/5/24 12:56, Jinliang Zheng wrote:
> > > > I encountered the following calltrace:
> > > > 
> > > > [20213.578756] BUG: kernel NULL pointer dereference, address:
> > > > 0000000000000000
> > > > [20213.578785] #PF: supervisor instruction fetch in kernel mode
> > > > [20213.578799] #PF: error_code(0x0010) - not-present page
> > > > [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
> > > > [20213.578828] Oops: 0010 [#1] SMP NOPTI
> > > > [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump:
> > > > loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
> > > > [20213.578860] Hardware name: New H3C Technologies Co., Ltd.
> > > > UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
> > > > [20213.578884] RIP: 0010:0x0
> > > > [20213.578894] Code: Bad RIP value.
> > > > [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
> > > > [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX:
> > > > 0000000000000000
> > > > [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI:
> > > > 0000000000000000
> > > > [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09:
> > > > ffff889b9eeae380
> > > > [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12:
> > > > 0000000000000000
> > > > [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15:
> > > > ffffc90021ebfd48
> > > > [20213.578998] FS:  00007f89c534e740(0000)
> > > > GS:ffff88c07fd00000(0000) knlGS:0000000000000000
> > > > [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4:
> > > > 00000000007706e0
> > > > [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > > 0000000000000000
> > > > [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > > 0000000000000400
> > > > [20213.579079] PKRU: 55555554
> > > > [20213.579087] Call Trace:
> > > > [20213.579099]  trailing_symlink+0x1da/0x260
> > > > [20213.579112]  path_lookupat.isra.53+0x79/0x220
> > > > [20213.579125]  filename_lookup.part.69+0xa0/0x170
> > > > [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0
> > > > [20213.579151]  ? getname_flags+0x4f/0x1e0
> > > > [20213.579161]  user_path_at_empty+0x3e/0x50
> > > > [20213.579172]  vfs_statx+0x76/0xe0
> > > > [20213.579182]  __do_sys_newstat+0x3d/0x70
> > > > [20213.579194]  ? fput+0x13/0x20
> > > > [20213.579203]  ? ksys_ioctl+0xb0/0x300
> > > > [20213.579213]  ? generic_file_llseek+0x24/0x30
> > > > [20213.579225]  ? fput+0x13/0x20
> > > > [20213.579233]  ? ksys_lseek+0x8d/0xb0
> > > > [20213.579243]  __x64_sys_newstat+0x16/0x20
> > > > [20213.579256]  do_syscall_64+0x4d/0x140
> > > > [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
> > > > 
> > > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> > > > 
> > > Please note that the kernel version I use is the one maintained by
> > > Tencent.Inc,
> > > and the baseline is v5.4. But in fact, in the latest upstream source
> > > tree,
> > > although the trailing_symlink() function has been removed, its logic
> > > has been
> > > moved to pick_link(), so the problem still exists.
> > > 
> > > Ian Kent pointed out that try_to_unlazy() was introduced in
> > > pick_link() in the
> > > latest upstream source tree, but I don't understand why this can
> > > solve the NULL
> > > ->get_link pointer dereference problem, because ->get_link pointer
> > > will be
> > > dereferenced before try_to_unlazy().
> > > 
> > > (I don't understand why Ian Kent's email didn't appear on the
> > > mailing list.)
> > 
> > It was something about html mail and I think my mail client was at fault.
> > 
> > In any case what you say is indeed correct, so the comment isn't
> > important.
> > 
> > 
> > Fact is it is still a race between the lockless path walk and inode
> > eviction
> > 
> > and xfs recycling. I believe that the xfs recycling code is very hard to
> > fix.

Not really for this case. This is simply concurrent pathwalk lookups
occurring just after the inode has been evicted from the VFS inode
cache. The first lookup hits the XFS inode cache, sees
XFS_IRECLAIMABLE, and it then enters xfs_reinit_inode() to
reinstantiate the VFS inode to an initial state. The Xfs inode
itself is still valid as it hasn't reached the XFS_IRECLAIM state
where it will be torn down and freed.

Whilst we are running xfs_reinit_inode(), a second RCU pathwalk has
been run and that it is trying to call ->get_link on that same
inode. Unfortunately, the first lookup has just set inode->f_ops =
&empty_fops as part of the VFS inode reinit, and that then triggers
the null pointer deref.

Once the first lookup has finished the inode_init_always(),
xfs_reinit_inode() resets inode->f_ops back to 
xfs_symlink_file_ops and get_link calls work again.

Fundamentally, the problem is that we are completely reinitialising
the VFS inode within the RCU grace period. i.e. while concurrent RCU
pathwalks can still be in progress and find the VFS inode whilst the
XFS inode cache is manipulating it.

What we should be doing here is a subset of inode_init_always(),
which only reinitialises the bits of the VFS inode we need to
initialise rather than the entire inode. The identity of the inode
is not changing and so we don't need to go through a transient state
where the VFS inode goes xfs symlink -> empty initialised inode ->
xfs symlink.

i.e. We need to re-initialise the non-identity related parts of the
VFS inode so the identity parts that the RCU pathwalks rely on never
change within the RCU grace period where lookups can find the VFS
inode after it has been evicted.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2024-05-27  9:41 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng
2023-12-08  0:14 ` Dave Chinner
2024-01-31  6:35   ` Jinliang Zheng
2024-01-31 19:30     ` Darrick J. Wong
2024-05-15 15:54       ` alexjlzheng
2024-05-16  4:56         ` Jinliang Zheng
2024-05-16  7:08           ` Ian Kent
2024-05-16  7:23             ` Ian Kent
2024-05-20 17:36               ` Darrick J. Wong
2024-05-21  1:35                 ` Ian Kent
2024-05-21  2:13                   ` Ian Kent
2024-05-26 15:04                     ` Jinliang Zheng
2024-05-26 17:21                       ` Paul E. McKenney
2024-05-26 23:51                     ` Ian Kent
2024-05-27  0:18                       ` Al Viro
2024-05-28 15:51                         ` Brian Foster
2024-05-27  9:41               ` Dave Chinner [this message]
2024-05-27 13:56                 ` Jinliang Zheng
2024-05-28  2:10                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZlRVPv0EGIu5q7l9@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=alexjlzheng@gmail.com \
    --cc=alexjlzheng@tencent.com \
    --cc=bfoster@redhat.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=raven@themaw.net \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).