From: "Paul E. McKenney" <paulmck@kernel.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Frederic Weisbecker <frederic@kernel.org>, rcu@vger.kernel.org
Subject: Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
Date: Sun, 17 Sep 2023 23:05:33 -0700 [thread overview]
Message-ID: <3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop> (raw)
In-Reply-To: <20230915163711.GA3116200@google.com>
On Fri, Sep 15, 2023 at 04:37:11PM +0000, Joel Fernandes wrote:
> Hi Paul,
> Thanks! I merged replies to 3 threads into this one to make it easier to follow.
>
> On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> [...]
> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> >
> > This don't-schedule-on-dying CPUs approach does quite look promising
> > to me!
> >
> > Then again, I cannot claim to be a scheduler expert. And I am a bit
> > surprised that this does not already happen. Which makes me wonder
> > (admittedly without evidence either way) whether there is some CPU-hotplug
> > race that it might induce. But then again, figuring this sort of thing
> > out is what part of the scheduler guys are there for, right? ;-)
>
> Yes it looks promising. Actually this sort of thing also seems to be already
> done in CFS, it just not there in RT. So maybe it is OK. Testing so far
> showed me pretty good results even with hotplug testing. Here is hoping!
>
> > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > by halving the total rcutorture boost threads.
> > >
> > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > for the archives. Though I'm rewriting it slightly differently and testing
> > > that more. The main thing I am doing in the new patch is I find that RT
> > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > off. Though checking for cpu_dying() also works. I could not find any
> > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > have the new version of the patch tested.
> > >
> > > Also halving the number of RT boost threads makes it less likely to occur but
> > > does not work. Not too surprising since the issue actually may not be related
> > > to too many RT threads but rather a lockup between hotplug and RT..
> >
> > Again, looks promising! When I get the non-RCU -rcu stuff moved to
> > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > the test setup here.
>
> Thanks a lot, and I have enclosed a simpler updated patch below which also
> similarly shows very good results. This is the one I would like to test
> more and send to scheduler folks. I'll send it out once I have it tested more
> and also possibly after seeing your results (I am on vacation next week so
> there's time).
>
> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > >
> > > Thoughts?
> >
> > The problem with this is that it will often render RCU priority boosting
> > unnecessary. Any kthread preempted within an RCU read-side critical
> > section will with high probability quickly be resumed on one of the
> > even-numbered CPUs.
> >
> > Or were you also planning to bind the rcu_torture_reader() kthreads to
> > a specific CPU, preventing such migration? Or am I missing something
> > here?
>
> You are right, I see now why you were running them on all CPUs. One note
> though, af the RCU reader threads are CFS, they will not be immediately
> pushed out so maybe it may work (unlike if the RCU reader threads being
> preempted were RT). However testing shows thread-halving does not fix the
> issue anyway so we can scratch that. Instead, below is the updated patch for
> don't schedule-on-dying/inactive CPUs approach which is showing really good
> results!
>
> And I'll most likely see you the week after, after entering into a 4-day
> quiescent state. ;-)
And I queued this as an experimental patch and am starting testing.
I am assuming that the path to mainline is through the scheduler tree.
Thanx, Paul
> thanks,
>
> - Joel
>
> ---8<-----------------------
>
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Subject: [PATCH] RT: Alternative fix for livelock with hotplug
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> kernel/sched/cpupri.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index a286e726eb4b..42c40cfdf836 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
>
> if (lowest_mask) {
> cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> + cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
>
> /*
> * We have to ensure that we have at least one bit
> --
> 2.42.0.459.ge4e396fd5e-goog
>
next prev parent reply other threads:[~2023-09-18 6:06 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37 ` Joel Fernandes
2023-09-11 2:27 ` Joel Fernandes
2023-09-11 8:16 ` Paul E. McKenney
2023-09-11 13:17 ` Joel Fernandes
2023-09-11 13:49 ` Paul E. McKenney
2023-09-11 16:18 ` Joel Fernandes
2023-09-13 20:30 ` Joel Fernandes
2023-09-14 11:11 ` Paul E. McKenney
2023-09-14 13:13 ` Joel Fernandes
2023-09-14 15:23 ` Paul E. McKenney
2023-09-14 18:56 ` Joel Fernandes
2023-09-14 21:53 ` Joel Fernandes
2023-09-15 0:13 ` Joel Fernandes
2023-09-15 11:33 ` Joel Fernandes
2023-09-15 14:53 ` Paul E. McKenney
2023-09-15 16:37 ` Joel Fernandes
2023-09-15 16:57 ` Paul E. McKenney
2023-09-15 21:14 ` Joel Fernandes
2023-09-18 6:05 ` Paul E. McKenney [this message]
2023-09-15 14:48 ` Paul E. McKenney
2023-09-15 14:45 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop \
--to=paulmck@kernel.org \
--cc=frederic@kernel.org \
--cc=joel@joelfernandes.org \
--cc=rcu@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).