recap of 9p problems in 6.9 - Dominique Martinet

v9fs.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Dominique Martinet <asmadeus@codewreck.org>
To: Eric Van Hensbergen <eric.vanhensbergen@linux.dev>
Cc: v9fs@lists.linux.dev
Subject: recap of 9p problems in 6.9
Date: Mon, 8 Apr 2024 13:17:07 +0900	[thread overview]
Message-ID: <ZhNvwwYRN4-EczYi@codewreck.org> (raw)

Hi Eric (& anyone else reading on the list),

I've spent quite a bit of time testing after Kent's report last week and
we have a few problems, so trying to recap a bit before I blow up (I'm
really busy right now so this is eating up work hours I'm not really
comfortable using for this, and will need to tone it down quite fast)


* Kent's weird error when running xfstests from 9p:
https://lkml.kernel.org/r/f6upxoxa6d2c6cbh4ka775msggvuduigiu7xgvfx7qsufg2lo6@2ellaad6b2on
He's saying reverting new patches that got in 6.9 fixes it but didn't
have time to bisect yet as reproduction rate is too low;
I couldn't reproduce yet on my end but we need to either confirm a
single patch to revert or revert the whole thing for another cycle if we
don't figure it out in a few weeks as I don't like the idea of a stable
release with this bug.


* open / unlink / fstat|ftruncate etc fail
https://lkml.kernel.org/r/E7D462A2-EE93-4A57-9F15-8565EE1567F3@linux.dev

I didn't confirm yet but I think it's a new bug? maybe the 'fix dups
even in uncached mode' patch dropping v9fs_drop_inode(); that's easy
enough to test just a new bug so didn't look yet


* running apt install in a VM with 9p as rootfs in default cache=none
got me this warning once:
```
[   64.291867] ------------[ cut here ]------------
[   64.292458] WARNING: CPU: 0 PID: 161 at fs/inode.c:332 drop_nlink+0x2a/0x40
[   64.293380] Modules linked in: 9p netfs
[   64.293818] CPU: 0 PID: 161 Comm: dpkg Not tainted 6.9.0-rc2+ #20
[   64.294583] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   64.296043] RIP: 0010:drop_nlink+0x2a/0x40
[   64.296518] Code: 0f 1f 44 00 00 55 8b 47 38 48 89 e5 85 c0 74 1a 83 e8 01 89 47 38 75 0c 48 8b 47 18 3e 48 ff 80 30 04 00 00 5d c3 cc cc cc cc <0f> 0b c7 47 38 ff ff ff ff 5d c3 cc cc cc cc 0f 1f 80 00 00 00 00
[   64.298896] RSP: 0018:ffff9e3c8072be18 EFLAGS: 00010246
[   64.299440] RAX: 0000000000000000 RBX: ffff9c49414c4540 RCX: 000000000002bd46
[   64.300239] RDX: ffff9c4941bdac0c RSI: 0000000000033990 RDI: ffff9c49414e8dc0
[   64.301034] RBP: ffff9e3c8072be18 R08: 0000000000000000 R09: ffff9e3c8072bd78
[   64.302328] R10: ffff9e3c8072bd80 R11: ffff9c4942997298 R12: ffff9c49414e8b00
[   64.303610] R13: 0000000000000000 R14: ffff9c49414e8dc0 R15: 0000000000000000
[   64.304904] FS:  00007fdf1a979d00(0000) GS:ffff9c495f200000(0000) knlGS:0000000000000000
[   64.306407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   64.307390] CR2: 0000561961b47000 CR3: 0000000002580000 CR4: 00000000000006b0
[   64.308679] Call Trace:
[   64.308956]  <TASK>
[   64.309152]  ? show_regs+0x64/0x70
[   64.309644]  ? __warn+0x84/0x120
[   64.310121]  ? drop_nlink+0x2a/0x40
[   64.310620]  ? report_bug+0x15d/0x180
[   64.311182]  ? handle_bug+0x44/0x90
[   64.311684]  ? exc_invalid_op+0x18/0x70
[   64.312268]  ? asm_exc_invalid_op+0x1b/0x20
[   64.312928]  ? drop_nlink+0x2a/0x40
[   64.313429]  v9fs_remove+0x132/0x280 [9p]
[   64.314077]  v9fs_vfs_unlink+0x10/0x20 [9p]
[   64.314732]  vfs_unlink+0x135/0x2c0
[   64.315245]  do_unlinkat+0x231/0x2b0
[   64.315764]  __x64_sys_unlink+0x23/0x30
[   64.316340]  do_syscall_64+0x5f/0x130
[   64.316883]  entry_SYSCALL_64_after_hwframe+0x71/0x79
[   64.317734] RIP: 0033:0x7fdf1ab0ea07
[   64.318257] Code: f0 ff ff 73 01 c3 48 8b 0d f6 83 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 57 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d c9 83 0d 00 f7 d8 64 89 01 48
[   64.322093] RSP: 002b:00007fff0b13fc58 EFLAGS: 00000202 ORIG_RAX: 0000000000000057
[   64.323478] RAX: ffffffffffffffda RBX: 00005619848be5e0 RCX: 00007fdf1ab0ea07
[   64.324768] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000056198492ad40
[   64.326081] RBP: 000056198492ad40 R08: 0000000000000007 R09: 00005619848bdfb0
[   64.327371] R10: ed3dec18d54433a1 R11: 0000000000000202 R12: 00005619848bdef0
[   64.328657] R13: 0000561961b30040 R14: 0000561961b431d8 R15: 00007fdf1ac6e020
[   64.329969]  </TASK>
[   64.330185] ---[ end trace 0000000000000000 ]---
```

Doesn't seem to be everytime, I don't think it happened before but given
I didn't look into reproducing it I couldn't say for sure.


* some refcounting bug running this (creating/removing the same files many
times in parallel).

This doesn't seem new as I reproduce it without the 6.9 patches, I'll
need to dig into it.

```
mkdir tmp
echo test > tmp/test
for i in 1 2 3 4 5; do
	seq 1 1000 | while read _; do
		cp -a tmp copy 2>/dev/null
		rm -rf copy 2>/dev/null 
	done &
done
wait
```

I also had a qemu segfault after a similar script (copying/removing a
larger tree), but looking at the code/locals I don't see how it could
fail there given the locals... giving up on this for now, it happened
after refcount UAF error on linux so might be caused by garbage we sent,
but that's still a problem obviously)
```
(gdb) bt
#0  __memcmp_avx2_movbe () at ../sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S:415
#1  0x000064d9dd7a5716 in v9fs_mark_fids_unreclaim (pdu=pdu@entry=0x64d9e16ffe58, path=path@entry=0x74d9888c8f70) at ../hw/9pfs/9p.c:545
#2  0x000064d9dd7aa7ad in v9fs_unlinkat (opaque=0x64d9e16ffe58) at ../hw/9pfs/9p.c:3189
#3  0x000064d9ddd4d64b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:175
#4  0x000074da302bf510 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:90 from /nix/store/cyrrf49i2hm1w7vn2j945ic3rrzgxbqs-glibc-2.38-44/lib/libc.so.6
#5  0x00007fff7b8c4e30 in ?? ()
#6  0x0000000000000000 in ?? ()
(gdb) p fidp->path
$1 = {size = 29, data = 0x74d96c001010 "./copy/1/tests/btrfs/007.out"}
(gdb) p *path
$2 = {size = 29, data = 0x74da1c001e00 "./copy/4/tests/btrfs/049.out"}
# failing on this line:...
	!memcmp(fidp->path.data, path->data, path->size)) {
```

-- 
Dominique Martinet | Asmadeus

next             reply	other threads:[~2024-04-08  4:17 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08  4:17 Dominique Martinet [this message]
2024-04-08 23:14 ` recap of 9p problems in 6.9 Eric Van Hensbergen
2024-04-09  0:11 ` Eric Van Hensbergen
2024-04-09  2:07   ` Dominique Martinet
2024-04-09 11:40     ` Eric Van Hensbergen
2024-04-09 20:57       ` Eric Van Hensbergen
2024-04-09  2:15   ` Eric Van Hensbergen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZhNvwwYRN4-EczYi@codewreck.org \
    --to=asmadeus@codewreck.org \
    --cc=eric.vanhensbergen@linux.dev \
    --cc=v9fs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).