Linux-XFS Archive mirror
 help / color / mirror / Atom feed
From: Ian Kent <raven@themaw.net>
To: Jinliang Zheng <alexjlzheng@gmail.com>, alexjlzheng@tencent.com
Cc: bfoster@redhat.com, david@fromorbit.com, djwong@kernel.org,
	linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org,
	rcu@vger.kernel.org
Subject: Re: About the conflict between XFS inode recycle and VFS rcu-walk
Date: Thu, 16 May 2024 15:08:43 +0800	[thread overview]
Message-ID: <7f744bf5-5f6d-4031-8a4f-91be2cd45147@themaw.net> (raw)
In-Reply-To: <20240516045655.40122-1-alexjlzheng@tencent.com>

On 16/5/24 12:56, Jinliang Zheng wrote:
> On Wed, 15 May 2024 at 23:54:41 +0800, Jinliang Zheng wrote:
>> On Wed, 31 Jan 2024 at 11:30:18 -0800, djwong@kernel.org wrote:
>>> On Wed, Jan 31, 2024 at 02:35:17PM +0800, Jinliang Zheng wrote:
>>>> On Fri, 8 Dec 2023 11:14:32 +1100, david@fromorbit.com wrote:
>>>>> On Tue, Dec 05, 2023 at 07:38:33PM +0800, alexjlzheng@gmail.com wrote:
>>>>>> Hi, all
>>>>>>
>>>>>> I would like to ask if the conflict between xfs inode recycle and vfs rcu-walk
>>>>>> which can lead to null pointer references has been resolved?
>>>>>>
>>>>>> I browsed through emails about the following patches and their discussions:
>>>>>> - https://lore.kernel.org/linux-xfs/20220217172518.3842951-2-bfoster@redhat.com/
>>>>>> - https://lore.kernel.org/linux-xfs/20220121142454.1994916-1-bfoster@redhat.com/
>>>>>> - https://lore.kernel.org/linux-xfs/164180589176.86426.501271559065590169.stgit@mickey.themaw.net/
>>>>>>
>>>>>> And then came to the conclusion that this problem has not been solved, am I
>>>>>> right? Did I miss some patch that could solve this problem?
>>>>> We fixed the known problems this caused by turning off the VFS
>>>>> functionality that the rcu pathwalks kept tripping over. See commit
>>>>> 7b7820b83f23 ("xfs: don't expose internal symlink metadata buffers to
>>>>> the vfs").
>>>> Sorry for the delay.
>>>>
>>>> The problem I encountered in the production environment was that during the
>>>> rcu walk process the ->get_link() pointer was NULL, which caused a crash.
>>>>
>>>> As far as I know, commit 7b7820b83f23 ("xfs: don't expose internal symlink
>>>> metadata buffers to the vfs") first appeared in:
>>>> - https://lore.kernel.org/linux-fsdevel/YZvvP9RFXi3%2FjX0q@bfoster/
>>>>
>>>> Does this commit solve the problem of NULL ->get_link()? And how?
>>> I suggest reading the call stack from wherever the VFS enters the XFS
>>> readlink code.  If you have a reliable reproducer, then apply this patch
>>> to your kernel (you haven't mentioned which one it is) and see if the
>>> bad dereference goes away.
>>>
>>> --D
>> Sorry for the delay.
>>
>> I encountered the following calltrace:
>>
>> [20213.578756] BUG: kernel NULL pointer dereference, address: 0000000000000000
>> [20213.578785] #PF: supervisor instruction fetch in kernel mode
>> [20213.578799] #PF: error_code(0x0010) - not-present page
>> [20213.578812] PGD 3f01d64067 P4D 3f01d64067 PUD 3f01d65067 PMD 0
>> [20213.578828] Oops: 0010 [#1] SMP NOPTI
>> [20213.578839] CPU: 92 PID: 766 Comm: /usr/local/serv Kdump: loaded Not tainted 5.4.241-1-tlinux4-0017.3 #1
>> [20213.578860] Hardware name: New H3C Technologies Co., Ltd. UniServer R4900 G3/RS33M2C9SA, BIOS 2.00.38P02 04/14/2020
>> [20213.578884] RIP: 0010:0x0
>> [20213.578894] Code: Bad RIP value.
>> [20213.578903] RSP: 0018:ffffc90021ebfc38 EFLAGS: 00010246
>> [20213.578916] RAX: ffffffff82081f40 RBX: ffffc90021ebfce0 RCX: 0000000000000000
>> [20213.578932] RDX: ffffc90021ebfd48 RSI: ffff88bfad8d3890 RDI: 0000000000000000
>> [20213.578948] RBP: ffffc90021ebfc70 R08: 0000000000000001 R09: ffff889b9eeae380
>> [20213.578965] R10: 302d343200000067 R11: 0000000000000001 R12: 0000000000000000
>> [20213.578981] R13: ffff88bfad8d3890 R14: ffff889b9eeae380 R15: ffffc90021ebfd48
>> [20213.578998] FS:  00007f89c534e740(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000
>> [20213.579016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [20213.579030] CR2: ffffffffffffffd6 CR3: 0000003f01d90001 CR4: 00000000007706e0
>> [20213.579046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [20213.579062] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [20213.579079] PKRU: 55555554
>> [20213.579087] Call Trace:
>> [20213.579099]  trailing_symlink+0x1da/0x260
>> [20213.579112]  path_lookupat.isra.53+0x79/0x220
>> [20213.579125]  filename_lookup.part.69+0xa0/0x170
>> [20213.579138]  ? kmem_cache_alloc+0x3f/0x3f0
>> [20213.579151]  ? getname_flags+0x4f/0x1e0
>> [20213.579161]  user_path_at_empty+0x3e/0x50
>> [20213.579172]  vfs_statx+0x76/0xe0
>> [20213.579182]  __do_sys_newstat+0x3d/0x70
>> [20213.579194]  ? fput+0x13/0x20
>> [20213.579203]  ? ksys_ioctl+0xb0/0x300
>> [20213.579213]  ? generic_file_llseek+0x24/0x30
>> [20213.579225]  ? fput+0x13/0x20
>> [20213.579233]  ? ksys_lseek+0x8d/0xb0
>> [20213.579243]  __x64_sys_newstat+0x16/0x20
>> [20213.579256]  do_syscall_64+0x4d/0x140
>> [20213.579268]  entry_SYSCALL_64_after_hwframe+0x5c/0xc1
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> Please note that the kernel version I use is the one maintained by Tencent.Inc,
> and the baseline is v5.4. But in fact, in the latest upstream source tree,
> although the trailing_symlink() function has been removed, its logic has been
> moved to pick_link(), so the problem still exists.
>
> Ian Kent pointed out that try_to_unlazy() was introduced in pick_link() in the
> latest upstream source tree, but I don't understand why this can solve the NULL
> ->get_link pointer dereference problem, because ->get_link pointer will be
> dereferenced before try_to_unlazy().
>
> (I don't understand why Ian Kent's email didn't appear on the mailing list.)

It was something about html mail and I think my mail client was at fault.

In any case what you say is indeed correct, so the comment isn't important.


Fact is it is still a race between the lockless path walk and inode eviction

and xfs recycling. I believe that the xfs recycling code is very hard to 
fix.


IIRC correctly putting a NULL check in pick_link() was not considered 
acceptable

but there must be a way that is acceptable to check this and restart the 
walk.

Maybe there was a reluctance to suffer the overhead of restarting the 
walk when

it shouldn't be needed.


The alternative would be to find some way to identify when it's unsafe 
to reuse

an inode marked for re-cycle before dropping rcu read, perhaps with the 
reference

count plus the seqlock. Basically, to reuse inodes xfs will need to 
identify when

the race occurs and let the inode go away under rcu and create a new one 
if a race

is detected. But possibly that isn't nearly as simple as it sounds?


>
> Thanks,
> Jinliang Zheng
>
>> And I analyzed the disassembly of trailing_symlink() and confirmed that a NULL
>> ->get_link() happened here:
>>
>> 0xffffffff812e4850 <trailing_symlink>:	nopl   0x0(%rax,%rax,1) [FTRACE NOP]
>> 0xffffffff812e4855 <trailing_symlink+0x5>:	push   %rbp
>> 0xffffffff812e4856 <trailing_symlink+0x6>:	mov    %rsp,%rbp
>> 0xffffffff812e4859 <trailing_symlink+0x9>:	push   %r15
>> 0xffffffff812e485b <trailing_symlink+0xb>:	push   %r14
>> 0xffffffff812e485d <trailing_symlink+0xd>:	push   %r13
>> 0xffffffff812e485f <trailing_symlink+0xf>:	push   %r12
>> 0xffffffff812e4861 <trailing_symlink+0x11>:	push   %rbx
>> 0xffffffff812e4862 <trailing_symlink+0x12>:	mov    %rdi,%rbx		# rbx = &nameidate
>> 0xffffffff812e4865 <trailing_symlink+0x15>:	sub    $0x8,%rsp
>> 0xffffffff812e4869 <trailing_symlink+0x19>:	mov    0x1765845(%rip),%edx     # 0xffffffff82a4a0b4 <sysctl_protected_symlinks>
>> 0xffffffff812e486f <trailing_symlink+0x1f>:	mov    0x38(%rdi),%eax
>> 0xffffffff812e4872 <trailing_symlink+0x22>:	test   %edx,%edx
>> 0xffffffff812e4874 <trailing_symlink+0x24>:	je     0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e4876 <trailing_symlink+0x26>:	mov    %gs:0x1ad00,%rdx
>> 0xffffffff812e487f <trailing_symlink+0x2f>:	mov    0xc8(%rdi),%rcx		# rcx = nameidata->link_inode
>> 0xffffffff812e4886 <trailing_symlink+0x36>:	mov    0xc18(%rdx),%rdx
>> 0xffffffff812e488d <trailing_symlink+0x3d>:	mov    0x4(%rcx),%ecx		# ecx = link_inode->uid
>> 0xffffffff812e4890 <trailing_symlink+0x40>:	cmp    %ecx,0x1c(%rdx)
>> 0xffffffff812e4893 <trailing_symlink+0x43>:	je     0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e4895 <trailing_symlink+0x45>:	mov    0x30(%rdi),%rsi
>> 0xffffffff812e4899 <trailing_symlink+0x49>:	movzwl (%rsi),%edx
>> 0xffffffff812e489c <trailing_symlink+0x4c>:	and    $0x202,%dx
>> 0xffffffff812e48a1 <trailing_symlink+0x51>:	cmp    $0x202,%dx
>> 0xffffffff812e48a6 <trailing_symlink+0x56>:	je     0xffffffff812e495f <trailing_symlink+0x10f>
>> 0xffffffff812e48ac <trailing_symlink+0x5c>:	or     $0x10,%eax
>> 0xffffffff812e48af <trailing_symlink+0x5f>:	mov    %eax,0x38(%rbx)		# nd->flags |= LOOKUP_PARENT
>> 0xffffffff812e48b2 <trailing_symlink+0x62>:	mov    0x50(%rbx),%rax		# rax = nd->stack
>> 0xffffffff812e48b6 <trailing_symlink+0x66>:	movq   $0x0,0x20(%rax)		# stack[0].name = NULL
>> 0xffffffff812e48be <trailing_symlink+0x6e>:	mov    0x48(%rbx),%eax		# nd->depth
>> 0xffffffff812e48c1 <trailing_symlink+0x71>:	mov    0x50(%rbx),%rdx		# nd->stack
>> 0xffffffff812e48c5 <trailing_symlink+0x75>:	mov    0xc8(%rbx),%r13		# nd->link_inode
>> 0xffffffff812e48cc <trailing_symlink+0x7c>:	lea    (%rax,%rax,2),%rax	# rax = depth * 3
>> 0xffffffff812e48d0 <trailing_symlink+0x80>:	shl    $0x4,%rax		# rax = rax << 4, sizeof(saved):0x30
>> 0xffffffff812e48d4 <trailing_symlink+0x84>:	lea    -0x30(%rdx,%rax,1),%r15	# r15 = last
>> 0xffffffff812e48d9 <trailing_symlink+0x89>:	mov    0x8(%r15),%r14		# r14 = last->link.dentry
>> 0xffffffff812e48dd <trailing_symlink+0x8d>:	testb  $0x40,0x38(%rbx)
>> 0xffffffff812e48e1 <trailing_symlink+0x91>:	je     0xffffffff812e4950 <trailing_symlink+0x100>
>> 0xffffffff812e48e3 <trailing_symlink+0x93>:	mov    %r13,%rsi
>> 0xffffffff812e48e6 <trailing_symlink+0x96>:	mov    %r15,%rdi
>> 0xffffffff812e48e9 <trailing_symlink+0x99>:	callq  0xffffffff812f8a00 <atime_needs_update>
>> 0xffffffff812e48ee <trailing_symlink+0x9e>:	test   %al,%al
>> 0xffffffff812e48f0 <trailing_symlink+0xa0>:	jne    0xffffffff812e4a56 <trailing_symlink+0x206>
>> 0xffffffff812e48f6 <trailing_symlink+0xa6>:	mov    0x38(%rbx),%edx
>> 0xffffffff812e48f9 <trailing_symlink+0xa9>:	mov    %r13,%rsi
>> 0xffffffff812e48fc <trailing_symlink+0xac>:	mov    %r14,%rdi
>> 0xffffffff812e48ff <trailing_symlink+0xaf>:	shr    $0x6,%edx
>> 0xffffffff812e4902 <trailing_symlink+0xb2>:	and    $0x1,%edx
>> 0xffffffff812e4905 <trailing_symlink+0xb5>:	callq  0xffffffff81424310 <security_inode_follow_link>
>> 0xffffffff812e490a <trailing_symlink+0xba>:	movslq %eax,%r12
>> 0xffffffff812e490d <trailing_symlink+0xbd>:	test   %eax,%eax
>> 0xffffffff812e490f <trailing_symlink+0xbf>:	jne    0xffffffff812e4939 <trailing_symlink+0xe9>
>> 0xffffffff812e4911 <trailing_symlink+0xc1>:	movl   $0x4,0x44(%rbx)
>> 0xffffffff812e4918 <trailing_symlink+0xc8>:	mov    0x248(%r13),%r12
>> 0xffffffff812e491f <trailing_symlink+0xcf>:	test   %r12,%r12
>> 0xffffffff812e4922 <trailing_symlink+0xd2>:	je     0xffffffff812e49e5 <trailing_symlink+0x195>
>> 0xffffffff812e4928 <trailing_symlink+0xd8>:	movzbl (%r12),%eax
>> 0xffffffff812e492d <trailing_symlink+0xdd>:	cmp    $0x2f,%al
>> 0xffffffff812e492f <trailing_symlink+0xdf>:	je     0xffffffff812e49b7 <trailing_symlink+0x167>
>> 0xffffffff812e4935 <trailing_symlink+0xe5>:	test   %al,%al
>> 0xffffffff812e4937 <trailing_symlink+0xe7>:	je     0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e4939 <trailing_symlink+0xe9>:	test   %r12,%r12
>> 0xffffffff812e493c <trailing_symlink+0xec>:	je     0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e493e <trailing_symlink+0xee>:	add    $0x8,%rsp
>> 0xffffffff812e4942 <trailing_symlink+0xf2>:	mov    %r12,%rax
>> 0xffffffff812e4945 <trailing_symlink+0xf5>:	pop    %rbx
>> 0xffffffff812e4946 <trailing_symlink+0xf6>:	pop    %r12
>> 0xffffffff812e4948 <trailing_symlink+0xf8>:	pop    %r13
>> 0xffffffff812e494a <trailing_symlink+0xfa>:	pop    %r14
>> 0xffffffff812e494c <trailing_symlink+0xfc>:	pop    %r15
>> 0xffffffff812e494e <trailing_symlink+0xfe>:	pop    %rbp
>> 0xffffffff812e494f <trailing_symlink+0xff>:	retq
>> 0xffffffff812e4950 <trailing_symlink+0x100>:	mov    %r15,%rdi
>> 0xffffffff812e4953 <trailing_symlink+0x103>:	callq  0xffffffff812f8ae0 <touch_atime>
>> 0xffffffff812e4958 <trailing_symlink+0x108>:	callq  0xffffffff81a26410 <_cond_resched>
>> 0xffffffff812e495d <trailing_symlink+0x10d>:	jmp    0xffffffff812e48f6 <trailing_symlink+0xa6>
>> 0xffffffff812e495f <trailing_symlink+0x10f>:	mov    0x4(%rsi),%edx
>> 0xffffffff812e4962 <trailing_symlink+0x112>:	cmp    $0xffffffff,%edx
>> 0xffffffff812e4965 <trailing_symlink+0x115>:	je     0xffffffff812e496f <trailing_symlink+0x11f>
>> 0xffffffff812e4967 <trailing_symlink+0x117>:	cmp    %edx,%ecx
>> 0xffffffff812e4969 <trailing_symlink+0x119>:	je     0xffffffff812e48ac <trailing_symlink+0x5c>
>> 0xffffffff812e496f <trailing_symlink+0x11f>:	mov    $0xfffffffffffffff6,%r12
>> 0xffffffff812e4976 <trailing_symlink+0x126>:	test   $0x40,%al
>> 0xffffffff812e4978 <trailing_symlink+0x128>:	jne    0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e497a <trailing_symlink+0x12a>:	mov    %gs:0x1ad00,%rax
>> 0xffffffff812e4983 <trailing_symlink+0x133>:	mov    0xce0(%rax),%rax
>> 0xffffffff812e498a <trailing_symlink+0x13a>:	test   %rax,%rax
>> 0xffffffff812e498d <trailing_symlink+0x13d>:	je     0xffffffff812e4999 <trailing_symlink+0x149>
>> 0xffffffff812e498f <trailing_symlink+0x13f>:	mov    (%rax),%eax
>> 0xffffffff812e4991 <trailing_symlink+0x141>:	test   %eax,%eax
>> 0xffffffff812e4993 <trailing_symlink+0x143>:	je     0xffffffff812e4a6f <trailing_symlink+0x21f>
>> 0xffffffff812e4999 <trailing_symlink+0x149>:	mov    $0xffffffff82319b4f,%rdi
>> 0xffffffff812e49a0 <trailing_symlink+0x150>:	mov    $0xfffffffffffffff3,%r12
>> 0xffffffff812e49a7 <trailing_symlink+0x157>:	callq  0xffffffff81161310 <audit_log_link_denied>
>> 0xffffffff812e49ac <trailing_symlink+0x15c>:	jmp    0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e49ae <trailing_symlink+0x15e>:	mov    $0xffffffff8230164d,%r12
>> 0xffffffff812e49b5 <trailing_symlink+0x165>:	jmp    0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e49b7 <trailing_symlink+0x167>:	cmpq   $0x0,0x20(%rbx)
>> 0xffffffff812e49bc <trailing_symlink+0x16c>:	je     0xffffffff812e4a8a <trailing_symlink+0x23a>
>> 0xffffffff812e49c2 <trailing_symlink+0x172>:	mov    %rbx,%rdi
>> 0xffffffff812e49c5 <trailing_symlink+0x175>:	callq  0xffffffff812e2da0 <nd_jump_root>
>> 0xffffffff812e49ca <trailing_symlink+0x17a>:	test   %eax,%eax
>> 0xffffffff812e49cc <trailing_symlink+0x17c>:	jne    0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e49d2 <trailing_symlink+0x182>:	add    $0x1,%r12
>> 0xffffffff812e49d6 <trailing_symlink+0x186>:	movzbl (%r12),%eax
>> 0xffffffff812e49db <trailing_symlink+0x18b>:	cmp    $0x2f,%al
>> 0xffffffff812e49dd <trailing_symlink+0x18d>:	jne    0xffffffff812e4935 <trailing_symlink+0xe5>
>> 0xffffffff812e49e3 <trailing_symlink+0x193>:	jmp    0xffffffff812e49d2 <trailing_symlink+0x182>
>> 0xffffffff812e49e5 <trailing_symlink+0x195>:	mov    0x20(%r13),%rax		# inode->i_op
>> 0xffffffff812e49e9 <trailing_symlink+0x199>:	add    $0x10,%r15
>> 0xffffffff812e49ed <trailing_symlink+0x19d>:	mov    %r13,%rsi
>> 0xffffffff812e49f0 <trailing_symlink+0x1a0>:	mov    %r15,%rdx
>> 0xffffffff812e49f3 <trailing_symlink+0x1a3>:	mov    0x8(%rax),%rcx		# inode_operations->get_link
>> 0xffffffff812e49f7 <trailing_symlink+0x1a7>:	testb  $0x40,0x38(%rbx)
>> 0xffffffff812e49fb <trailing_symlink+0x1ab>:	jne    0xffffffff812e4a1f <trailing_symlink+0x1cf>
>> 0xffffffff812e49fd <trailing_symlink+0x1ad>:	mov    %r14,%rdi		# nd->flags & LOOKUP_RCU == 0
>> 0xffffffff812e4a00 <trailing_symlink+0x1b0>:	callq  0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>> 0xffffffff812e4a05 <trailing_symlink+0x1b5>:	mov    %rax,%r12
>> 0xffffffff812e4a08 <trailing_symlink+0x1b8>:	test   %r12,%r12
>> 0xffffffff812e4a0b <trailing_symlink+0x1bb>:	je     0xffffffff812e49ae <trailing_symlink+0x15e>
>> 0xffffffff812e4a0d <trailing_symlink+0x1bd>:	cmp    $0xfffffffffffff000,%r12
>> 0xffffffff812e4a14 <trailing_symlink+0x1c4>:	jbe    0xffffffff812e4928 <trailing_symlink+0xd8>
>> 0xffffffff812e4a1a <trailing_symlink+0x1ca>:	jmpq   0xffffffff812e493e <trailing_symlink+0xee>
>> 0xffffffff812e4a1f <trailing_symlink+0x1cf>:	xor    %edi,%edi		# nd->flags & LOOKUP_RCU != 0
>> 0xffffffff812e4a21 <trailing_symlink+0x1d1>:	mov    %rcx,-0x30(%rbp)
>> 0xffffffff812e4a25 <trailing_symlink+0x1d5>:	callq  0xffffffff81e00f70 <__x86_indirect_thunk_rcx> # jmpq *%rcx
>> 0xffffffff812e4a2a <trailing_symlink+0x1da>:	mov    %rax,%r12
>> 0xffffffff812e4a2d <trailing_symlink+0x1dd>:	cmp    $0xfffffffffffffff6,%rax
>> 0xffffffff812e4a31 <trailing_symlink+0x1e1>:	jne    0xffffffff812e4a08 <trailing_symlink+0x1b8>
>> 0xffffffff812e4a33 <trailing_symlink+0x1e3>:	mov    %rbx,%rdi
>> 0xffffffff812e4a36 <trailing_symlink+0x1e6>:	callq  0xffffffff812e3840 <unlazy_walk>
>> 0xffffffff812e4a3b <trailing_symlink+0x1eb>:	test   %eax,%eax
>> 0xffffffff812e4a3d <trailing_symlink+0x1ed>:	jne    0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e4a3f <trailing_symlink+0x1ef>:	mov    %r15,%rdx
>> 0xffffffff812e4a42 <trailing_symlink+0x1f2>:	mov    %r13,%rsi
>> 0xffffffff812e4a45 <trailing_symlink+0x1f5>:	mov    %r14,%rdi
>> 0xffffffff812e4a48 <trailing_symlink+0x1f8>:	mov    -0x30(%rbp),%rcx
>> 0xffffffff812e4a4c <trailing_symlink+0x1fc>:	callq  0xffffffff81e00f70 <__x86_indirect_thunk_rcx>
>> 0xffffffff812e4a51 <trailing_symlink+0x201>:	mov    %rax,%r12
>> 0xffffffff812e4a54 <trailing_symlink+0x204>:	jmp    0xffffffff812e4a08 <trailing_symlink+0x1b8>
>> 0xffffffff812e4a56 <trailing_symlink+0x206>:	mov    %rbx,%rdi
>> 0xffffffff812e4a59 <trailing_symlink+0x209>:	callq  0xffffffff812e3840 <unlazy_walk>
>> 0xffffffff812e4a5e <trailing_symlink+0x20e>:	test   %eax,%eax
>> 0xffffffff812e4a60 <trailing_symlink+0x210>:	jne    0xffffffff812e4a97 <trailing_symlink+0x247>
>> 0xffffffff812e4a62 <trailing_symlink+0x212>:	mov    %r15,%rdi
>> 0xffffffff812e4a65 <trailing_symlink+0x215>:	callq  0xffffffff812f8ae0 <touch_atime>
>> 0xffffffff812e4a6a <trailing_symlink+0x21a>:	jmpq   0xffffffff812e48f6 <trailing_symlink+0xa6>
>> 0xffffffff812e4a6f <trailing_symlink+0x21f>:	mov    0x50(%rbx),%rax
>> 0xffffffff812e4a73 <trailing_symlink+0x223>:	mov    0xb8(%rbx),%rdi
>> 0xffffffff812e4a7a <trailing_symlink+0x22a>:	xor    %edx,%edx
>> 0xffffffff812e4a7c <trailing_symlink+0x22c>:	mov    0x8(%rax),%rsi
>> 0xffffffff812e4a80 <trailing_symlink+0x230>:	callq  0xffffffff811673f0 <__audit_inode>
>> 0xffffffff812e4a85 <trailing_symlink+0x235>:	jmpq   0xffffffff812e4999 <trailing_symlink+0x149>
>> 0xffffffff812e4a8a <trailing_symlink+0x23a>:	mov    %rbx,%rdi
>> 0xffffffff812e4a8d <trailing_symlink+0x23d>:	callq  0xffffffff812e4790 <set_root>
>> 0xffffffff812e4a92 <trailing_symlink+0x242>:	jmpq   0xffffffff812e49c2 <trailing_symlink+0x172>
>> 0xffffffff812e4a97 <trailing_symlink+0x247>:	mov    $0xfffffffffffffff6,%r12
>> 0xffffffff812e4a9e <trailing_symlink+0x24e>:	jmpq   0xffffffff812e493e <trailing_symlink+0xee>
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>
>> According to my understanding, the problem solved by commit 7b7820b83f23 ("xfs:
>> don't expose internal symlink metadata buffers to the vfs") is a data NULL
>> pointer dereference, but the problem here is an instruction NULL pointer
>> dereference.
>>
>> Further, I analyzed the possible triggering process as follows:
>>
>> rcu_walk            do_unlinkat ~~> prune_dcache_sb         create
>>                                                              
>>                                                              
>> rcu_read_lock
>>                                                                     
>> read_seqcount_retry
>> (the last check)      iput_final
>>                          evict
>>                            destroy_inode
>>                              xfs_fs_destroy_inode
>>                                xfs_inode_set_reclaim_tag       xfs_ialloc
>>                                  spin_lock(ip->i_flags_lock)     xfs_dialloc
>>                                  set(ip, XFS_IRECLAIMABLE)       xfs_iget
>>                                  wakeup(xfs_reclaim_worker)        rcu_read_lock
>>                                  spin_unlock(ip->i_flags_lock)     xfs_iget_cache_hit
>>                                                                      spin_lock(ip->i_flags_lock)
>>                                                                      if (XFS_IRECLAIMABLE && !XFS_IRECLAIM)
>>                                                                        set(ip, XFS_IRECLAIM)
>>                                                                      spin_unlock(ip->i_flags_lock)
>>                                                                      rcu_read_unlock
>>                                                                      < ------------ >
>>                                                                      // miss synchronize_rcu()
>>                                                                      xfs_reinit_inode
>>                                                                        ->get_link = NULL
>> get_link() // NULL
>>
>> rcu_read_unlock
>>
>> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
>>
>> Therefore, I think that after commit 7b7820b83f23 ("xfs: don't expose internal
>> symlink metadata buffers to the vfs"), we should start processing this NULL
>> ->get_link pointer dereference.
>>
>> Or, am I thinking wrong somewhere?
>>
>> Thanks,
>> Jinliang Zheng
>>
>>>>> Apart from that issue, I'm not aware of any other issues that the
>>>>> XFS inode recycling directly exposes.
>>>>>
>>>>>> According to my understanding, the essence of this problem is that XFS reuses
>>>>>> the inode evicted by VFS, but VFS rcu-walk assumes that this will not happen.
>>>>> It assumes that the inode will not change identity during the RCU
>>>>> grace period after the inode has been evicted from cache. We can
>>>>> safely reinstantiate an evicted inode without waiting for an RCU
>>>>> grace period as long as it is the same inode with the same content
>>>>> and same state.
>>>>>
>>>>> Problems *may* arise when we unlink the inode, then evict it, then a
>>>>> new file is created and the old slab cache memory address is used
>>>>> for the new inode. I describe the issue here:
>>>>>
>>>>> https://lore.kernel.org/linux-xfs/20220118232547.GD59729@dread.disaster.area/
>>>> And judging from the relevant emails, the main reason why ->get_link() is set
>>>> to NULL should be the lack of synchronize_rcu() before xfs_reinit_inode() when
>>>> the inode is chosen to be reused.
>>>>
>>>> However, perhaps due to performance reasons, this solution has not been merged
>>>> for a long time. How is it now?
>>>>
>>>> Maybe I am missing something in the threads of mail?
>>>>
>>>> Thank you very much. :)
>>>> Jinliang Zheng
>>>>
>>>>> That said, we have exactly zero evidence that this is actually a
>>>>> problem in production systems. We did get systems tripping over the
>>>>> symlink issue, but there's no evidence that the
>>>>> unlink->close->open(O_CREAT) issues are manifesting in the wild and
>>>>> hence there hasn't been any particular urgency to address it.
>>>>>
>>>>>> Are there any recommended workarounds until an elegant and efficient solution
>>>>>> can be proposed? After all, causing a crash is extremely unacceptable in a
>>>>>> production environment.
>>>>> What crashes are you seeing in your production environment?
>>>>>
>>>>> -Dave.
>>>>> -- 
>>>>> Dave Chinner
>>>>> david@fromorbit.com

  reply	other threads:[~2024-05-16  7:08 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-05 11:38 About the conflict between XFS inode recycle and VFS rcu-walk alexjlzheng
2023-12-08  0:14 ` Dave Chinner
2024-01-31  6:35   ` Jinliang Zheng
2024-01-31 19:30     ` Darrick J. Wong
2024-05-15 15:54       ` alexjlzheng
2024-05-16  4:56         ` Jinliang Zheng
2024-05-16  7:08           ` Ian Kent [this message]
2024-05-16  7:23             ` Ian Kent
2024-05-20 17:36               ` Darrick J. Wong
2024-05-21  1:35                 ` Ian Kent
2024-05-21  2:13                   ` Ian Kent
2024-05-26 15:04                     ` Jinliang Zheng
2024-05-26 17:21                       ` Paul E. McKenney
2024-05-26 23:51                     ` Ian Kent
2024-05-27  0:18                       ` Al Viro
2024-05-28 15:51                         ` Brian Foster
2024-05-27  9:41               ` Dave Chinner
2024-05-27 13:56                 ` Jinliang Zheng
2024-05-28  2:10                   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7f744bf5-5f6d-4031-8a4f-91be2cd45147@themaw.net \
    --to=raven@themaw.net \
    --cc=alexjlzheng@gmail.com \
    --cc=alexjlzheng@tencent.com \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).