Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Frederic Weisbecker <frederic@kernel.org>, rcu@vger.kernel.org
Subject: Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
Date: Sun, 17 Sep 2023 23:05:33 -0700	[thread overview]
Message-ID: <3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop> (raw)
In-Reply-To: <20230915163711.GA3116200@google.com>

On Fri, Sep 15, 2023 at 04:37:11PM +0000, Joel Fernandes wrote:
> Hi Paul,
> Thanks! I merged replies to 3 threads into this one to make it easier to follow.
> 
> On Fri, Sep 15, 2023 at 07:53:15AM -0700, Paul E. McKenney wrote:
> > On Fri, Sep 15, 2023 at 11:33:13AM +0000, Joel Fernandes wrote:
> > > On Fri, Sep 15, 2023 at 12:13:31AM +0000, Joel Fernandes wrote:
> [...]
> > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > This is to fix a live lock between RT push and CPU hotplug's
> > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > the real fix is needed to prevent these issues even if we're able to hide it
> > > by halving the total rcutorture boost threads.
> > 
> > This don't-schedule-on-dying CPUs approach does quite look promising
> > to me!
> > 
> > Then again, I cannot claim to be a scheduler expert.  And I am a bit
> > surprised that this does not already happen.  Which makes me wonder
> > (admittedly without evidence either way) whether there is some CPU-hotplug
> > race that it might induce.  But then again, figuring this sort of thing
> > out is what part of the scheduler guys are there for, right?  ;-)
> 
> Yes it looks promising. Actually this sort of thing also seems to be already
> done in CFS, it just not there in RT. So maybe it is OK. Testing so far
> showed me pretty good results even with hotplug testing. Here is hoping!
> 
> > > > On the other hand, I came up with a real fix [1] and I am currently testing it.
> > > > This is to fix a live lock between RT push and CPU hotplug's
> > > > select_fallback_rq()-induced push. I am not sure if the fix works but I have
> > > > some faith based on what I'm seeing in traces. Fingers crossed. I also feel
> > > > the real fix is needed to prevent these issues even if we're able to hide it
> > > > by halving the total rcutorture boost threads.
> > > 
> > > So that fixed it without any changes to RCU. Below is the updated patch also
> > > for the archives. Though I'm rewriting it slightly differently and testing
> > > that more. The main thing I am doing in the new patch is I find that RT
> > > should not select !cpu_active() CPUs since those have the scheduler turned
> > > off. Though checking for cpu_dying() also works. I could not find any
> > > instance where cpu_dying() != cpu_active() but there could be a tiny window
> > > where that is true. Anyway, I'll make some noise with scheduler folks once I
> > > have the new version of the patch tested.
> > > 
> > > Also halving the number of RT boost threads makes it less likely to occur but
> > > does not work. Not too surprising since the issue actually may not be related
> > > to too many RT threads but rather a lockup between hotplug and RT..
> > 
> > Again, looks promising!  When I get the non-RCU -rcu stuff moved to
> > v6.6-rc1 and appropriately branched and tested, I will give it a go on
> > the test setup here.
> 
> Thanks a lot, and I have enclosed a simpler updated patch below which also
> similarly shows very good results. This is the one I would like to test
> more and send to scheduler folks. I'll send it out once I have it tested more
> and also possibly after seeing your results (I am on vacation next week so
> there's time).
> 
> > > We could run them on just the odd, or even ones and still be able to get
> > > sufficient boost testing. This may be especially important without RT
> > > throttling. I'll go ahead and queue a test like that.
> > > 
> > > Thoughts?
> > 
> > The problem with this is that it will often render RCU priority boosting
> > unnecessary.  Any kthread preempted within an RCU read-side critical
> > section will with high probability quickly be resumed on one of the
> > even-numbered CPUs.
> > 
> > Or were you also planning to bind the rcu_torture_reader() kthreads to
> > a specific CPU, preventing such migration?  Or am I missing something
> > here?
> 
> You are right, I see now why you were running them on all CPUs. One note
> though, af the RCU reader threads are CFS, they will not be immediately
> pushed out so maybe it may work (unlike if the RCU reader threads being
> preempted were RT). However testing shows thread-halving does not fix the
> issue anyway so we can scratch that. Instead, below is the updated patch for
> don't schedule-on-dying/inactive CPUs approach which is showing really good
> results!
> 
> And I'll most likely see you the week after, after entering into a 4-day
> quiescent state. ;-)

And I queued this as an experimental patch and am starting testing.
I am assuming that the path to mainline is through the scheduler tree.

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes (Google) <joel@joelfernandes.org>
> Subject: [PATCH] RT: Alternative fix for livelock with hotplug
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/sched/cpupri.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/cpupri.c b/kernel/sched/cpupri.c
> index a286e726eb4b..42c40cfdf836 100644
> --- a/kernel/sched/cpupri.c
> +++ b/kernel/sched/cpupri.c
> @@ -101,6 +101,7 @@ static inline int __cpupri_find(struct cpupri *cp, struct task_struct *p,
>  
>  	if (lowest_mask) {
>  		cpumask_and(lowest_mask, &p->cpus_mask, vec->mask);
> +		cpumask_and(lowest_mask, lowest_mask, cpu_active_mask);
>  
>  		/*
>  		 * We have to ensure that we have at least one bit
> -- 
> 2.42.0.459.ge4e396fd5e-goog
>

next prev parent reply	other threads:[~2023-09-18  6:06 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37   ` Joel Fernandes
2023-09-11  2:27     ` Joel Fernandes
2023-09-11  8:16       ` Paul E. McKenney
2023-09-11 13:17         ` Joel Fernandes
2023-09-11 13:49           ` Paul E. McKenney
2023-09-11 16:18             ` Joel Fernandes
2023-09-13 20:30         ` Joel Fernandes
2023-09-14 11:11           ` Paul E. McKenney
2023-09-14 13:13             ` Joel Fernandes
2023-09-14 15:23               ` Paul E. McKenney
2023-09-14 18:56                 ` Joel Fernandes
2023-09-14 21:53                   ` Joel Fernandes
2023-09-15  0:13                     ` Joel Fernandes
2023-09-15 11:33                       ` Joel Fernandes
2023-09-15 14:53                         ` Paul E. McKenney
2023-09-15 16:37                           ` Joel Fernandes
2023-09-15 16:57                             ` Paul E. McKenney
2023-09-15 21:14                               ` Joel Fernandes
2023-09-18  6:05                             ` Paul E. McKenney [this message]
2023-09-15 14:48                       ` Paul E. McKenney
2023-09-15 14:45                     ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e0bcf79-f088-40c0-89b7-aef36f9af4ba@paulmck-laptop \
    --to=paulmck@kernel.org \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.