* Re: NFS workload leaves nfsd threads in D state
[not found] <7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com>
@ 2023-07-09 6:58 ` Linux regression tracking (Thorsten Leemhuis)
[not found] ` <20230710075634.GA30120@lst.de>
1 sibling, 0 replies; 4+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-09 6:58 UTC (permalink / raw
To: Chuck Lever III, Jens Axboe, Christoph Hellwig
Cc: linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
Linux kernel regressions list
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
On 08.07.23 20:30, Chuck Lever III wrote:
>
> I have a "standard" test of running the git regression suite with
> many threads against an NFS mount. I found that with 6.5-rc, the
> test stalled and several nfsd threads on the server were stuck
> in D state.
>
> I can reproduce this stall 100% with both an xfs and an ext4
> export, so I bisected with both, and both bisects landed on the
> same commit:
>
> 615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
> commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
> Author: Christoph Hellwig <hch@lst.de>
> Date: Fri May 19 06:40:48 2023 +0200
>
> blk-mq: defer to the normal submission path for post-flush requests
>
> Requests with the FUA bit on hardware without FUA support need a post
> flush before returning to the caller, but they can still be sent using
> the normal I/O path after initializing the flush-related fields and
> end I/O handler.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> block/blk-flush.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> On system 1: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2BA400G3" SATA device.
>
> On system 2: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2KB240G8" SATA device.
>
> System 1 was where I discovered the stall. System 2 is where I ran
> the bisects.
>
> The call stacks vary a little. I've seen stalls in both the WRITE
> and SETATTR paths. Here's a sample from system 1:
>
> INFO: task nfsd:1237 blocked for more than 122 seconds.
> Tainted: G W 6.4.0-08699-g9e268189cb14 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:nfsd state:D stack:0 pid:1237 ppid:2 flags:0x00004000
> Call Trace:
> <TASK>
> __schedule+0x78f/0x7db
> schedule+0x93/0xc8
> jbd2_log_wait_commit+0xb4/0xf4
> ? __pfx_autoremove_wake_function+0x10/0x10
> jbd2_complete_transaction+0x85/0x97
> ext4_fc_commit+0x118/0x70a
> ? _raw_spin_unlock+0x18/0x2e
> ? __mark_inode_dirty+0x282/0x302
> ext4_write_inode+0x94/0x121
> ext4_nfs_commit_metadata+0x72/0x7d
> commit_inode_metadata+0x1f/0x31 [nfsd]
> commit_metadata+0x26/0x33 [nfsd]
> nfsd_setattr+0x2f2/0x30e [nfsd]
> nfsd_create_setattr+0x4e/0x87 [nfsd]
> nfsd4_open+0x604/0x8fa [nfsd]
> nfsd4_proc_compound+0x4a8/0x5e3 [nfsd]
> ? nfs4svc_decode_compoundargs+0x291/0x2de [nfsd]
> nfsd_dispatch+0xb3/0x164 [nfsd]
> svc_process_common+0x3c7/0x53a [sunrpc]
> ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
> svc_process+0xc6/0xe3 [sunrpc]
> nfsd+0xf2/0x18c [nfsd]
> ? __pfx_nfsd+0x10/0x10 [nfsd]
> kthread+0x10d/0x115
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x2c/0x50
> </TASK>
Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:
#regzbot ^introduced 615939a2ae734e
#regzbot title blk-mq: NFS workload leaves nfsd threads in D state
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.
Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: NFS workload leaves nfsd threads in D state
[not found] ` <92CC9151-0309-41E9-920E-A549E2A73BE4@oracle.com>
@ 2023-07-25 9:57 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-25 13:21 ` Chuck Lever III
0 siblings, 1 reply; 4+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-25 9:57 UTC (permalink / raw
To: Chuck Lever III, Chengming Zhou
Cc: Christoph Hellwig, ross.lagerwall@citrix.com, Jens Axboe,
linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
Linux kernel regressions list
On 12.07.23 15:29, Chuck Lever III wrote:
>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>> blk_rq_init_flush(rq);
>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>
>>>> Thanks for the quick response. No change.
>>> I'm a bit lost and still can't reprodce. Below is a patch with the
>>> only behavior differences I can find. It has two "#if 1" blocks,
>>> which I'll need to bisect to to find out which made it work (if any,
>>> but I hope so).
>>
>> I tried today to reproduce, but can't unfortunately.
>>
>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>> IO hung problem of plug recursive flush?
>>
>> (Since the main difference is that post-flush requests now can go into plug.)
>>
>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
>
> Thanks for the suggestion. No change, unfortunately.
Chuck, what's the status here? This thread looks stalled, that's why I
wonder.
FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
requests")). But unless I missed something you are not CCed, so I guess
that's a different issue?
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
#regzbot poke
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: NFS workload leaves nfsd threads in D state
2023-07-25 9:57 ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-07-25 13:21 ` Chuck Lever III
2023-07-25 13:34 ` Linux regression tracking #update (Thorsten Leemhuis)
0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever III @ 2023-07-25 13:21 UTC (permalink / raw
To: Linux regressions mailing list
Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
Chuck Lever
> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>
> On 12.07.23 15:29, Chuck Lever III wrote:
>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>>> blk_rq_init_flush(rq);
>>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>>
>>>>> Thanks for the quick response. No change.
>>>> I'm a bit lost and still can't reprodce. Below is a patch with the
>>>> only behavior differences I can find. It has two "#if 1" blocks,
>>>> which I'll need to bisect to to find out which made it work (if any,
>>>> but I hope so).
>>>
>>> I tried today to reproduce, but can't unfortunately.
>>>
>>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>>> IO hung problem of plug recursive flush?
>>>
>>> (Since the main difference is that post-flush requests now can go into plug.)
>>>
>>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
>>
>> Thanks for the suggestion. No change, unfortunately.
>
> Chuck, what's the status here? This thread looks stalled, that's why I
> wonder.
>
> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
> requests")). But unless I missed something you are not CCed, so I guess
> that's a different issue?
Hi Thorsten-
This issue was fixed in 6.5-rc2 by commit
9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")
--
Chuck Lever
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: NFS workload leaves nfsd threads in D state
2023-07-25 13:21 ` Chuck Lever III
@ 2023-07-25 13:34 ` Linux regression tracking #update (Thorsten Leemhuis)
0 siblings, 0 replies; 4+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-07-25 13:34 UTC (permalink / raw
To: Chuck Lever III, Linux regressions mailing list
Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
Chuck Lever
On 25.07.23 15:21, Chuck Lever III wrote:
>> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 12.07.23 15:29, Chuck Lever III wrote:
>>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>
>> Chuck, what's the status here? This thread looks stalled, that's why I
>> wonder.
>>
>> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
>> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
>> requests")). But unless I missed something you are not CCed, so I guess
>> that's a different issue?
>
> This issue was fixed in 6.5-rc2 by commit
>
> 9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")
Ahh, many thx for the update, it lacked a proper Link:/Closes: tag and
mentioned another culprit, so I had missed that!
#regzbot fix: 9f87fc4d72f5
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2023-07-25 13:34 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com>
2023-07-09 6:58 ` NFS workload leaves nfsd threads in D state Linux regression tracking (Thorsten Leemhuis)
[not found] ` <20230710075634.GA30120@lst.de>
[not found] ` <3F16A14B-F854-41CC-A3CA-87C7946FC277@oracle.com>
[not found] ` <F610D6B3-876F-4E5D-A3C4-A30F1B81D9B5@oracle.com>
[not found] ` <20230710172839.GA7190@lst.de>
[not found] ` <0F9A70B1-C6AE-4A8B-8A4B-8DC9ADED73AB@oracle.com>
[not found] ` <20230711120137.GA27050@lst.de>
[not found] ` <82cb9937-bd11-64a9-2520-bf3cf81ec720@linux.dev>
[not found] ` <92CC9151-0309-41E9-920E-A549E2A73BE4@oracle.com>
2023-07-25 9:57 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-25 13:21 ` Chuck Lever III
2023-07-25 13:34 ` Linux regression tracking #update (Thorsten Leemhuis)
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).