IO-Uring Archive mirror
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>,
	io-uring@vger.kernel.org, linux-block@vger.kernel.org,
	Kanchan Joshi <joshi.k@samsung.com>
Subject: Re: (subset) [PATCH 00/11] remove aux CQE caches
Date: Mon, 18 Mar 2024 09:49:00 +0800	[thread overview]
Message-ID: <ZfedjMPDXp7q8t/D@fedora> (raw)
In-Reply-To: <1e05aee5-4166-4e5d-9b76-94e1d833ab17@kernel.dk>

On Sun, Mar 17, 2024 at 07:34:30PM -0600, Jens Axboe wrote:
> On 3/17/24 6:15 PM, Ming Lei wrote:
> > On Sun, Mar 17, 2024 at 04:24:07PM -0600, Jens Axboe wrote:
> >> On 3/17/24 4:07 PM, Jens Axboe wrote:
> >>> On 3/17/24 3:51 PM, Jens Axboe wrote:
> >>>> On 3/17/24 3:47 PM, Pavel Begunkov wrote:
> >>>>> On 3/17/24 21:34, Pavel Begunkov wrote:
> >>>>>> On 3/17/24 21:32, Jens Axboe wrote:
> >>>>>>> On 3/17/24 3:29 PM, Pavel Begunkov wrote:
> >>>>>>>> On 3/17/24 21:24, Jens Axboe wrote:
> >>>>>>>>> On 3/17/24 2:55 PM, Pavel Begunkov wrote:
> >>>>>>>>>> On 3/16/24 13:56, Ming Lei wrote:
> >>>>>>>>>>> On Sat, Mar 16, 2024 at 01:27:17PM +0000, Pavel Begunkov wrote:
> >>>>>>>>>>>> On 3/16/24 11:52, Ming Lei wrote:
> >>>>>>>>>>>>> On Fri, Mar 15, 2024 at 04:53:21PM -0600, Jens Axboe wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>>>> The following two error can be triggered with this patchset
> >>>>>>>>>>>>> when running some ublk stress test(io vs. deletion). And not see
> >>>>>>>>>>>>> such failures after reverting the 11 patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I suppose it's with the fix from yesterday. How can I
> >>>>>>>>>>>> reproduce it, blktests?
> >>>>>>>>>>>
> >>>>>>>>>>> Yeah, it needs yesterday's fix.
> >>>>>>>>>>>
> >>>>>>>>>>> You may need to run this test multiple times for triggering the problem:
> >>>>>>>>>>
> >>>>>>>>>> Thanks for all the testing. I've tried it, all ublk/generic tests hang
> >>>>>>>>>> in userspace waiting for CQEs but no complaints from the kernel.
> >>>>>>>>>> However, it seems the branch is buggy even without my patches, I
> >>>>>>>>>> consistently (5-15 minutes of running in a slow VM) hit page underflow
> >>>>>>>>>> by running liburing tests. Not sure what is that yet, but might also
> >>>>>>>>>> be the reason.
> >>>>>>>>>
> >>>>>>>>> Hmm odd, there's nothing in there but your series and then the
> >>>>>>>>> io_uring-6.9 bits pulled in. Maybe it hit an unfortunate point in the
> >>>>>>>>> merge window -git cycle? Does it happen with io_uring-6.9 as well? I
> >>>>>>>>> haven't seen anything odd.
> >>>>>>>>
> >>>>>>>> Need to test io_uring-6.9. I actually checked the branch twice, both
> >>>>>>>> with the issue, and by full recompilation and config prompts I assumed
> >>>>>>>> you pulled something in between (maybe not).
> >>>>>>>>
> >>>>>>>> And yeah, I can't confirm it's specifically an io_uring bug, the
> >>>>>>>> stack trace is usually some unmap or task exit, sometimes it only
> >>>>>>>> shows when you try to shutdown the VM after tests.
> >>>>>>>
> >>>>>>> Funky. I just ran a bunch of loops of liburing tests and Ming's ublksrv
> >>>>>>> test case as well on io_uring-6.9 and it all worked fine. Trying
> >>>>>>> liburing tests on for-6.10/io_uring as well now, but didn't see anything
> >>>>>>> the other times I ran it. In any case, once you repost I'll rebase and
> >>>>>>> then let's see if it hits again.
> >>>>>>>
> >>>>>>> Did you run with KASAN enabled
> >>>>>>
> >>>>>> Yes, it's a debug kernel, full on KASANs, lockdeps and so
> >>>>>
> >>>>> And another note, I triggered it once (IIRC on shutdown) with ublk
> >>>>> tests only w/o liburing/tests, likely limits it to either the core
> >>>>> io_uring infra or non-io_uring bugs.
> >>>>
> >>>> Been running on for-6.10/io_uring, and the only odd thing I see is that
> >>>> the test output tends to stall here:
> >>>>
> >>>> Running test read-before-exit.t
> >>>>
> >>>> which then either leads to a connection disconnect from my ssh into that
> >>>> vm, or just a long delay and then it picks up again. This did not happen
> >>>> with io_uring-6.9.
> >>>>
> >>>> Maybe related? At least it's something new. Just checked again, and yeah
> >>>> it seems to totally lock up the vm while that is running. Will try a
> >>>> quick bisect of that series.
> >>>
> >>> Seems to be triggered by the top of branch patch in there, my poll and
> >>> timeout special casing. While the above test case runs with that commit,
> >>> it'll freeze the host.
> >>
> >> Had a feeling this was the busy looping off cancelations, and flushing
> >> the fallback task_work seems to fix it. I'll check more tomorrow.
> >>
> >>
> >> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> >> index a2cb8da3cc33..f1d3c5e065e9 100644
> >> --- a/io_uring/io_uring.c
> >> +++ b/io_uring/io_uring.c
> >> @@ -3242,6 +3242,8 @@ static __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
> >>  	ret |= io_kill_timeouts(ctx, task, cancel_all);
> >>  	if (task)
> >>  		ret |= io_run_task_work() > 0;
> >> +	else if (ret)
> >> +		flush_delayed_work(&ctx->fallback_work);
> >>  	return ret;
> >>  }
> > 
> > Still can trigger the warning with above patch:
> > 
> > [  446.275975] ------------[ cut here ]------------
> > [  446.276340] WARNING: CPU: 8 PID: 731 at kernel/fork.c:969 __put_task_struct+0x10c/0x180
> 
> And this is running that test case you referenced? I'll take a look, as
> it seems related to the poll kill rather than the other patchset.

Yeah, and now I am running 'git bisect' on Pavel's V2.

thanks,
Ming


  parent reply	other threads:[~2024-03-18  1:49 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-15 15:29 [PATCH 00/11] remove aux CQE caches Pavel Begunkov
2024-03-15 15:29 ` [PATCH 01/11] io_uring: fix poll_remove stalled req completion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 02/11] io_uring/cmd: kill one issue_flags to tw conversion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 03/11] io_uring/cmd: fix tw <-> issue_flags conversion Pavel Begunkov
2024-03-15 15:29 ` [PATCH 04/11] io_uring/cmd: introduce io_uring_cmd_complete Pavel Begunkov
2024-03-15 15:29 ` [PATCH 05/11] ublk: don't hard code IO_URING_F_UNLOCKED Pavel Begunkov
2024-03-15 15:29 ` [PATCH 06/11] nvme/io_uring: " Pavel Begunkov
2024-03-15 15:29 ` [PATCH 07/11] io_uring/rw: avoid punting to io-wq directly Pavel Begunkov
2024-03-15 15:29 ` [PATCH 08/11] io_uring: force tw ctx locking Pavel Begunkov
2024-03-15 15:40   ` Jens Axboe
2024-03-15 16:14     ` Pavel Begunkov
2024-03-15 15:29 ` [PATCH 09/11] io_uring: remove struct io_tw_state::locked Pavel Begunkov
2024-03-15 15:30 ` [PATCH 10/11] io_uring: refactor io_fill_cqe_req_aux Pavel Begunkov
2024-03-15 15:30 ` [PATCH 11/11] io_uring: get rid of intermediate aux cqe caches Pavel Begunkov
2024-03-15 16:20   ` Jens Axboe
2024-03-15 16:23     ` Pavel Begunkov
2024-03-15 16:25       ` Jens Axboe
2024-03-15 16:27         ` Jens Axboe
2024-03-15 16:44           ` Pavel Begunkov
2024-03-15 16:49             ` Jens Axboe
2024-03-15 17:26               ` Pavel Begunkov
2024-03-15 18:26                 ` Jens Axboe
2024-03-15 18:51                   ` Pavel Begunkov
2024-03-15 19:02                     ` Jens Axboe
2024-03-15 16:29         ` Pavel Begunkov
2024-03-15 16:33           ` Jens Axboe
2024-03-15 15:42 ` [PATCH 00/11] remove aux CQE caches Jens Axboe
2024-03-15 16:00 ` Jens Axboe
2024-03-15 22:53 ` (subset) " Jens Axboe
2024-03-16  2:03   ` Ming Lei
2024-03-16  2:24     ` Ming Lei
2024-03-16  2:54       ` Pavel Begunkov
2024-03-16  3:54         ` Ming Lei
2024-03-16  4:13           ` Pavel Begunkov
2024-03-16  4:20             ` Pavel Begunkov
2024-03-16  9:53               ` Ming Lei
2024-03-16 11:52   ` Ming Lei
2024-03-16 13:27     ` Pavel Begunkov
2024-03-16 13:56       ` Ming Lei
2024-03-17 20:55         ` Pavel Begunkov
2024-03-17 21:24           ` Jens Axboe
2024-03-17 21:29             ` Pavel Begunkov
2024-03-17 21:32               ` Jens Axboe
2024-03-17 21:34                 ` Pavel Begunkov
2024-03-17 21:47                   ` Pavel Begunkov
2024-03-17 21:51                     ` Jens Axboe
2024-03-17 22:07                       ` Jens Axboe
2024-03-17 22:24                         ` Jens Axboe
2024-03-18  0:15                           ` Ming Lei
2024-03-18  1:34                             ` Jens Axboe
2024-03-18  1:44                               ` Jens Axboe
2024-03-18  1:49                               ` Ming Lei [this message]
2024-03-17 23:16                       ` Pavel Begunkov
2024-03-16 14:39       ` Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZfedjMPDXp7q8t/D@fedora \
    --to=ming.lei@redhat.com \
    --cc=asml.silence@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=io-uring@vger.kernel.org \
    --cc=joshi.k@samsung.com \
    --cc=linux-block@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).