Regressions List Tracking
 help / color / mirror / Atom feed
* Re: NFS workload leaves nfsd threads in D state
       [not found] <7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com>
@ 2023-07-09  6:58 ` Linux regression tracking (Thorsten Leemhuis)
       [not found] ` <20230710075634.GA30120@lst.de>
  1 sibling, 0 replies; 4+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-09  6:58 UTC (permalink / raw
  To: Chuck Lever III, Jens Axboe, Christoph Hellwig
  Cc: linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
	Linux kernel regressions list

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 08.07.23 20:30, Chuck Lever III wrote:
> 
> I have a "standard" test of running the git regression suite with
> many threads against an NFS mount. I found that with 6.5-rc, the
> test stalled and several nfsd threads on the server were stuck
> in D state.
> 
> I can reproduce this stall 100% with both an xfs and an ext4
> export, so I bisected with both, and both bisects landed on the
> same commit:
> 
> 615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
> commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Fri May 19 06:40:48 2023 +0200
> 
>     blk-mq: defer to the normal submission path for post-flush requests
> 
>     Requests with the FUA bit on hardware without FUA support need a post
>     flush before returning to the caller, but they can still be sent using
>     the normal I/O path after initializing the flush-related fields and
>     end I/O handler.
> 
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Bart Van Assche <bvanassche@acm.org>
>     Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
>  block/blk-flush.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> On system 1: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2BA400G3" SATA device.
> 
> On system 2: the exports are on top of /dev/mapper and reside on
> an "INTEL SSDSC2KB240G8" SATA device.
> 
> System 1 was where I discovered the stall. System 2 is where I ran
> the bisects.
> 
> The call stacks vary a little. I've seen stalls in both the WRITE
> and SETATTR paths. Here's a sample from system 1:
> 
> INFO: task nfsd:1237 blocked for more than 122 seconds.
>       Tainted: G        W          6.4.0-08699-g9e268189cb14 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:nfsd            state:D stack:0     pid:1237  ppid:2      flags:0x00004000
> Call Trace:
>  <TASK>
>  __schedule+0x78f/0x7db
>  schedule+0x93/0xc8
>  jbd2_log_wait_commit+0xb4/0xf4
>  ? __pfx_autoremove_wake_function+0x10/0x10
>  jbd2_complete_transaction+0x85/0x97
>  ext4_fc_commit+0x118/0x70a
>  ? _raw_spin_unlock+0x18/0x2e
>  ? __mark_inode_dirty+0x282/0x302
>  ext4_write_inode+0x94/0x121
>  ext4_nfs_commit_metadata+0x72/0x7d
>  commit_inode_metadata+0x1f/0x31 [nfsd]
>  commit_metadata+0x26/0x33 [nfsd]
>  nfsd_setattr+0x2f2/0x30e [nfsd]
>  nfsd_create_setattr+0x4e/0x87 [nfsd]
>  nfsd4_open+0x604/0x8fa [nfsd]
>  nfsd4_proc_compound+0x4a8/0x5e3 [nfsd]
>  ? nfs4svc_decode_compoundargs+0x291/0x2de [nfsd]
>  nfsd_dispatch+0xb3/0x164 [nfsd]
>  svc_process_common+0x3c7/0x53a [sunrpc]
>  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
>  svc_process+0xc6/0xe3 [sunrpc]
>  nfsd+0xf2/0x18c [nfsd]
>  ? __pfx_nfsd+0x10/0x10 [nfsd]
>  kthread+0x10d/0x115
>  ? __pfx_kthread+0x10/0x10
>  ret_from_fork+0x2c/0x50
>  </TASK>

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 615939a2ae734e
#regzbot title blk-mq: NFS workload leaves nfsd threads in D state
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
       [not found]               ` <92CC9151-0309-41E9-920E-A549E2A73BE4@oracle.com>
@ 2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
  2023-07-25 13:21                   ` Chuck Lever III
  0 siblings, 1 reply; 4+ messages in thread
From: Linux regression tracking (Thorsten Leemhuis) @ 2023-07-25  9:57 UTC (permalink / raw
  To: Chuck Lever III, Chengming Zhou
  Cc: Christoph Hellwig, ross.lagerwall@citrix.com, Jens Axboe,
	linux-block@vger.kernel.org, Linux NFS Mailing List, Chuck Lever,
	Linux kernel regressions list

On 12.07.23 15:29, Chuck Lever III wrote:
>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>> blk_rq_init_flush(rq);
>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>
>>>> Thanks for the quick response. No change.
>>> I'm a bit lost and still can't reprodce.  Below is a patch with the
>>> only behavior differences I can find.  It has two "#if 1" blocks,
>>> which I'll need to bisect to to find out which made it work (if any,
>>> but I hope so).
>>
>> I tried today to reproduce, but can't unfortunately.
>>
>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>> IO hung problem of plug recursive flush?
>>
>> (Since the main difference is that post-flush requests now can go into plug.)
>>
>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
> 
> Thanks for the suggestion. No change, unfortunately.

Chuck, what's the status here? This thread looks stalled, that's why I
wonder.

FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
requests")). But unless I missed something you are not CCed, so I guess
that's a different issue?

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
@ 2023-07-25 13:21                   ` Chuck Lever III
  2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 1 reply; 4+ messages in thread
From: Chuck Lever III @ 2023-07-25 13:21 UTC (permalink / raw
  To: Linux regressions mailing list
  Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
	Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever



> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
> 
> On 12.07.23 15:29, Chuck Lever III wrote:
>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>>>>> blk_rq_init_flush(rq);
>>>>>> - rq->flush.seq |= REQ_FSEQ_POSTFLUSH;
>>>>>> + rq->flush.seq |= REQ_FSEQ_PREFLUSH;
>>>>>> spin_lock_irq(&fq->mq_flush_lock);
>>>>>> list_move_tail(&rq->flush.list, &fq->flush_data_in_flight);
>>>>>> spin_unlock_irq(&fq->mq_flush_lock);
>>>>> 
>>>>> Thanks for the quick response. No change.
>>>> I'm a bit lost and still can't reprodce.  Below is a patch with the
>>>> only behavior differences I can find.  It has two "#if 1" blocks,
>>>> which I'll need to bisect to to find out which made it work (if any,
>>>> but I hope so).
>>> 
>>> I tried today to reproduce, but can't unfortunately.
>>> 
>>> Could you please also try the fix patch [1] from Ross Lagerwall that fixes
>>> IO hung problem of plug recursive flush?
>>> 
>>> (Since the main difference is that post-flush requests now can go into plug.)
>>> 
>>> [1] https://lore.kernel.org/all/20230711160434.248868-1-ross.lagerwall@citrix.com/
>> 
>> Thanks for the suggestion. No change, unfortunately.
> 
> Chuck, what's the status here? This thread looks stalled, that's why I
> wonder.
> 
> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
> requests")). But unless I missed something you are not CCed, so I guess
> that's a different issue?

Hi Thorsten-

This issue was fixed in 6.5-rc2 by commit

9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")


--
Chuck Lever



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: NFS workload leaves nfsd threads in D state
  2023-07-25 13:21                   ` Chuck Lever III
@ 2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)
  0 siblings, 0 replies; 4+ messages in thread
From: Linux regression tracking #update (Thorsten Leemhuis) @ 2023-07-25 13:34 UTC (permalink / raw
  To: Chuck Lever III, Linux regressions mailing list
  Cc: Chengming Zhou, Christoph Hellwig, ross.lagerwall@citrix.com,
	Jens Axboe, linux-block@vger.kernel.org, Linux NFS Mailing List,
	Chuck Lever

On 25.07.23 15:21, Chuck Lever III wrote:
>> On Jul 25, 2023, at 5:57 AM, Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote:
>>
>> On 12.07.23 15:29, Chuck Lever III wrote:
>>>> On Jul 12, 2023, at 7:34 AM, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>>> On 2023/7/11 20:01, Christoph Hellwig wrote:
>>>>> On Mon, Jul 10, 2023 at 05:40:42PM +0000, Chuck Lever III wrote:
>>
>> Chuck, what's the status here? This thread looks stalled, that's why I
>> wonder.
>>
>> FWIW, I noticed a commit with a Fixes: tag for your culprit in next (see
>> 28b24123747098 ("blk-flush: fix rq->flush.seq for post-flush
>> requests")). But unless I missed something you are not CCed, so I guess
>> that's a different issue?
> 
> This issue was fixed in 6.5-rc2 by commit
> 
> 9f87fc4d72f5 ("block: queue data commands from the flush state machine at the head")

Ahh, many thx for the update, it lacked a proper Link:/Closes: tag and
mentioned another culprit, so I had missed that!

#regzbot fix: 9f87fc4d72f5
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-07-25 13:34 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <7A57C7AE-A51A-4254-888B-FE15CA21F9E9@oracle.com>
2023-07-09  6:58 ` NFS workload leaves nfsd threads in D state Linux regression tracking (Thorsten Leemhuis)
     [not found] ` <20230710075634.GA30120@lst.de>
     [not found]   ` <3F16A14B-F854-41CC-A3CA-87C7946FC277@oracle.com>
     [not found]     ` <F610D6B3-876F-4E5D-A3C4-A30F1B81D9B5@oracle.com>
     [not found]       ` <20230710172839.GA7190@lst.de>
     [not found]         ` <0F9A70B1-C6AE-4A8B-8A4B-8DC9ADED73AB@oracle.com>
     [not found]           ` <20230711120137.GA27050@lst.de>
     [not found]             ` <82cb9937-bd11-64a9-2520-bf3cf81ec720@linux.dev>
     [not found]               ` <92CC9151-0309-41E9-920E-A549E2A73BE4@oracle.com>
2023-07-25  9:57                 ` Linux regression tracking (Thorsten Leemhuis)
2023-07-25 13:21                   ` Chuck Lever III
2023-07-25 13:34                     ` Linux regression tracking #update (Thorsten Leemhuis)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).