Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

RCU Archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Frederic Weisbecker <frederic@kernel.org>, rcu@vger.kernel.org
Subject: Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
Date: Sun, 17 Sep 2023 23:05:33 -0700	[thread overview]
Message-ID: <3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop> (raw)
In-Reply-To: <20230915163711.GA3116200@google.com>

On Fri, Sep 15, 2023 at 04:37:11PM +0000, Joel Fernandes wrote:
> Hi Paul,
> Thanks! I merged replies to 3 threads into this one to make it easier to follow.
> 
> On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> [...]
> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > This don't-schedule-on-dying CPUs approach does quite look promising
> > to me!
> > 
> > Then again, I cannot claim to be a scheduler expert.  And I am a bit
> > surprised that this does not already happen.  Which makes me wonder
> > (admittedly without evidence either way) whether there is some CPU-hotplug
> > race that it might induce.  But then again, figuring this sort of thing
> > out is what part of the scheduler guys are there for, right?  ;-)
> 
> Yes it looks promising. Actually this sort of thing also seems to be already
> done in CFS, it just not there in RT. So maybe it is OK. Testing so far
> showed me pretty good results even with hotplug testing. Here is hoping!
> 
> > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > by halving the total rcutorture boost threads.
> > > 
> > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > for the archives. Though I'm rewriting it slightly differently and testing
> > > that more. The main thing I am doing in the new patch is I find that RT
> > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > off. Though checking for cpu_dying() also works. I could not find any
> > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > have the new version of the patch tested.
> > > 
> > > Also halving the number of RT boost threads makes it less likely to occur but
> > > does not work. Not too surprising since the issue actually may not be related
> > > to too many RT threads but rather a lockup between hotplug and RT..
> > 
> > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > the test setup here.
> 
> Thanks a lot, and I have enclosed a simpler updated patch below which also
> similarly shows very good results. This is the one I would like to test
> more and send to scheduler folks. I'll send it out once I have it tested more
> and also possibly after seeing your results (I am on vacation next week so
> there's time).
> 
> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > > 
> > > Thoughts?
> > 
> > The problem with this is that it will often render RCU priority boosting
> > unnecessary.  Any kthread preempted within an RCU read-side critical
> > section will with high probability quickly be resumed on one of the
> > even-numbered CPUs.
> > 
> > Or were you also planning to bind the rcu_torture_reader() kthreads to
> > a specific CPU, preventing such migration?  Or am I missing something
> > here?
> 
> You are right, I see now why you were running them on all CPUs. One note
> though, af the RCU reader threads are CFS, they will not be immediately
> pushed out so maybe it may work (unlike if the RCU reader threads being
> preempted were RT). However testing shows thread-halving does not fix the
> issue anyway so we can scratch that. Instead, below is the updated patch for
> don't schedule-on-dying/inactive CPUs approach which is showing really good
> results!
> 
> And I'll most likely see you the week after, after entering into a 4-day
> quiescent state. ;-)

And I queued this as an experimental patch and am starting testing.
I am assuming that the path to mainline is through the scheduler tree.

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Subject: [PATCH] RT: Alternative fix for livelock with hotplug
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/cpupri.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index a286e726eb4b..42c40cfdf836 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
>  
>  	if (lowest_mask) {
>  		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> +		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
>  
>  		/*
>  		 * We have to ensure that we have at least one bit
> -- 
> 2.42.0.459.ge4e396fd5e-goog
>

next prev parent reply	other threads:[~2023-09-18  6:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37   ` Joel Fernandes
2023-09-11  2:27     ` Joel Fernandes
2023-09-11  8:16       ` Paul E. McKenney
2023-09-11 13:17         ` Joel Fernandes
2023-09-11 13:49           ` Paul E. McKenney
2023-09-11 16:18             ` Joel Fernandes
2023-09-13 20:30         ` Joel Fernandes
2023-09-14 11:11           ` Paul E. McKenney
2023-09-14 13:13             ` Joel Fernandes
2023-09-14 15:23               ` Paul E. McKenney
2023-09-14 18:56                 ` Joel Fernandes
2023-09-14 21:53                   ` Joel Fernandes
2023-09-15  0:13                     ` Joel Fernandes
2023-09-15 11:33                       ` Joel Fernandes
2023-09-15 14:53                         ` Paul E. McKenney
2023-09-15 16:37                           ` Joel Fernandes
2023-09-15 16:57                             ` Paul E. McKenney
2023-09-15 21:14                               ` Joel Fernandes
2023-09-18  6:05                             ` Paul E. McKenney [this message]
2023-09-15 14:48                       ` Paul E. McKenney
2023-09-15 14:45                     ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop \
    --to=paulmck@kernel.org \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).