Regressions List Tracking
 help / color / mirror / Atom feed
* Re: rm hanging, v6.1.35
       [not found] <20230710215354.GA679018@onthe.net.au>
@ 2023-07-11  0:53 ` Bagas Sanjaya
  2023-07-11  2:29   ` Dave Chinner
  0 siblings, 1 reply; 2+ messages in thread
From: Bagas Sanjaya @ 2023-07-11  0:53 UTC (permalink / raw
  To: Chris Dunlop, Linux XFS, Dave Chinner, Darrick J. Wong
  Cc: Linux Stable, Linux Kernel Mailing List, Linux Regressions

[-- Attachment #1: Type: text/plain, Size: 5291 bytes --]

On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> Hi,
> 
> This box is newly booted into linux v6.1.35 (2 days ago), it was previously
> running v5.15.118 without any problems (other than that fixed by
> "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
> upgrade).
> 
> I have rm operations on two files that have been stuck for in excess of 22
> hours and 18 hours respectively:
> 
> $ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
>     PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
> 2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
> 2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> 3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
> 3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> 
> The "rm"s are run from a process that's obviously tried a few times to get
> rid of these files.
> 
> There's nothing extraordinary about the files in terms of size:
> 
> $ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
> -rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
> -rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
> 
> As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
> first one run on each file) stack trace looks like:
> 
> [<0>] vfs_unlink+0x48/0x270
> [<0>] do_unlinkat+0x1f5/0x290
> [<0>] __x64_sys_unlinkat+0x3b/0x60
> [<0>] do_syscall_64+0x34/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> And each "secondary" rm (i.e. the subsequent ones on each file) stack trace
> looks like:
> 
> == blog-230710-xfs-rm-stuckd
> [<0>] down_write_nested+0xdc/0x100
> [<0>] do_unlinkat+0x10d/0x290
> [<0>] __x64_sys_unlinkat+0x3b/0x60
> [<0>] do_syscall_64+0x34/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> Multiple kernel strack traces don't show vfs_unlink or anything related that
> I can see, or anything else consistent or otherwise interesting. Most cores
> are idle.
> 
> Each of /aaa and /bbb are separate XFS filesystems:
> 
> $ xfs_info /aaa
> meta-data=/dev/mapper/aaa        isize=512    agcount=2, agsize=268434432 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=536868864, imaxpct=5
>          =                       sunit=256    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=262143, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> $ xfs_info /bbb
> meta-data=/dev/mapper/bbb        isize=512    agcount=8, agsize=268434432 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=1879047168, imaxpct=5
>          =                       sunit=256    swidth=256 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> There's plenty of free space at the fs level:
> 
> $ df -h /aaa /bbb
> Filesystem          Size  Used Avail Use% Mounted on
> /dev/mapper/aaa     2.0T  551G  1.5T  27% /aaa
> /dev/mapper/bbb     7.0T  3.6T  3.5T  52% /bbb
> 
> The fses are on sparse ceph/rbd volumes, the underlying storage tells me
> they're 50-60% utilised:
> 
> aaa: provisioned="2048G" used="1015.9G"
> bbb: provisioned="7168G" used="4925.3G"
> 
> Where to from here?
> 
> I'm guessing only a reboot is going to unstick this. Anything I should be
> looking at before reverting to v5.15.118?
> 
> ...subsequent to starting writing all this down I have another two sets of
> rms stuck, again on unremarkable files, and on two more separate
> filesystems.
> 
> ...oh. And an 'ls' on those files is hanging. The reboot has become more
> urgent.
> 

Smells like regression resurfaced, right? I mean, does 5e672cd69f0a53 not
completely fix your reported blocking regression earlier?

I'm kinda confused...

-- 
An old man doll... just what I always wanted! - Clara

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: rm hanging, v6.1.35
  2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
@ 2023-07-11  2:29   ` Dave Chinner
  0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2023-07-11  2:29 UTC (permalink / raw
  To: Bagas Sanjaya
  Cc: Chris Dunlop, Linux XFS, Dave Chinner, Darrick J. Wong,
	Linux Stable, Linux Kernel Mailing List, Linux Regressions

On Tue, Jul 11, 2023 at 07:53:35AM +0700, Bagas Sanjaya wrote:
> On Tue, Jul 11, 2023 at 07:53:54AM +1000, Chris Dunlop wrote:
> > Hi,
> > 
> > This box is newly booted into linux v6.1.35 (2 days ago), it was previously
> > running v5.15.118 without any problems (other than that fixed by
> > "5e672cd69f0a xfs: non-blocking inodegc pushes", the reason for the
> > upgrade).
> > 
> > I have rm operations on two files that have been stuck for in excess of 22
> > hours and 18 hours respectively:
> > 
> > $ ps -opid,lstart,state,wchan=WCHAN-xxxxxxxxxxxxxxx,cmd -C rm
> >     PID                  STARTED S WCHAN-xxxxxxxxxxxxxxx CMD
> > 2379355 Mon Jul 10 09:07:57 2023 D vfs_unlink            /bin/rm -rf /aaa/5539_tmp
> > 2392421 Mon Jul 10 09:18:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2485728 Mon Jul 10 09:28:57 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2488254 Mon Jul 10 09:39:27 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 2491180 Mon Jul 10 09:49:58 2023 D down_write_nested     /bin/rm -rf /aaa/5539_tmp
> > 3014914 Mon Jul 10 13:00:33 2023 D vfs_unlink            /bin/rm -rf /bbb/5541_tmp
> > 3095893 Mon Jul 10 13:11:03 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3098809 Mon Jul 10 13:21:35 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3101387 Mon Jul 10 13:32:06 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 3195017 Mon Jul 10 13:42:37 2023 D down_write_nested     /bin/rm -rf /bbb/5541_tmp
> > 
> > The "rm"s are run from a process that's obviously tried a few times to get
> > rid of these files.
>
> > There's nothing extraordinary about the files in terms of size:
> > 
> > $ ls -ltrn --full-time /aaa/5539_tmp /bbb/5541_tmp
> > -rw-rw-rw- 1 1482 1482 7870643 2023-07-10 06:07:58.684036505 +1000 /aaa/5539_tmp
> > -rw-rw-rw- 1 1482 1482  701240 2023-07-10 10:00:34.181064549 +1000 /bbb/5541_tmp
> > 
> > As hinted by the WCHAN in the ps output above, each "primary" rm (i.e. the
> > first one run on each file) stack trace looks like:
> > 
> > [<0>] vfs_unlink+0x48/0x270
> > [<0>] do_unlinkat+0x1f5/0x290
> > [<0>] __x64_sys_unlinkat+0x3b/0x60
> > [<0>] do_syscall_64+0x34/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

This looks to be stuck on the target inode lock (i.e. the locks for
the inodes at /aaa/5539_tmp and /bbb/5541_tmp).

What's holding these inode locks? This hasn't even got to XFS yet
here, so there's something else going on in the background. Attached
the full output of 'echo w > /proc/sysrq-trigger' and 'echo t >
/proc/sysrq-trigger', please?

> > 
> > And each "secondary" rm (i.e. the subsequent ones on each file) stack trace
> > looks like:
> > 
> > == blog-230710-xfs-rm-stuckd
> > [<0>] down_write_nested+0xdc/0x100
> > [<0>] do_unlinkat+0x10d/0x290
> > [<0>] __x64_sys_unlinkat+0x3b/0x60
> > [<0>] do_syscall_64+0x34/0x80
> > [<0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

These are likely all stuck on the parent directory inode lock (i.e.
/aaa and /bbb).


> > Where to from here?
> > 
> > I'm guessing only a reboot is going to unstick this. Anything I should be
> > looking at before reverting to v5.15.118?
> > 
> > ...subsequent to starting writing all this down I have another two sets of
> > rms stuck, again on unremarkable files, and on two more separate
> > filesystems.

What's an "unremarkable file" look like? Is is a reflink copy of
something else, a hard link, a small/large regular data file or something else?

> > 
> > ...oh. And an 'ls' on those files is hanging. The reboot has become more
> > urgent.

Yup, that's most likely getting stuck on the directory locks that
the unlinks are holding....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2023-07-11  2:29 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20230710215354.GA679018@onthe.net.au>
2023-07-11  0:53 ` rm hanging, v6.1.35 Bagas Sanjaya
2023-07-11  2:29   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).