Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Joel Fernandes <joel@joelfernandes.org>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Frederic Weisbecker <frederic@kernel.org>, rcu@vger.kernel.org
Subject: Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
Date: Mon, 11 Sep 2023 13:17:30 +0000	[thread overview]
Message-ID: <20230911131730.GA2291108@google.com> (raw)
In-Reply-To: <1f12ffe6-4cb0-4364-8c4c-3393ca5368c2@paulmck-laptop>

On Mon, Sep 11, 2023 at 01:16:21AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 11, 2023 at 02:27:25AM +0000, Joel Fernandes wrote:
> > On Sun, Sep 10, 2023 at 07:37:13PM -0400, Joel Fernandes wrote:
> > > On Sun, Sep 10, 2023 at 5:16 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > >
> > > > On Sun, Sep 10, 2023 at 08:14:45PM +0000, Joel Fernandes wrote:
> > > [...]
> > > > > >  I have been running into another intermittent one as well which
> > > > > > is the boost failure and that happens once in 10-15 runs or so.
> > > > > >
> > > > > > I was thinking of running the following configuration on an automated
> > > > > > regular basis to at least provide a better clue on the lucky run that
> > > > > > catches an issue. But then the issue is it would change timing enough
> > > > > > to maybe hide bugs. I could also make it submit logs automatically to
> > > > > > the list on such occurrences, but one step at a time and all that.  I
> > > > > > do need to add (hopefully less noisy) tick/timer related trace events.
> > > > > >
> > > > > > # Define the bootargs array
> > > > > > bootargs=(
> > > > > >     "ftrace_dump_on_oops"
> > > > > >     "panic_on_warn=1"
> > > > > >     "sysctl.kernel.panic_on_rcu_stall=1"
> > > > > >     "sysctl.kernel.max_rcu_stall_to_panic=1"
> > > > > >     "trace_buf_size=10K"
> > > > > >     "traceoff_on_warning=1"
> > > > > >     "panic_print=0x1f"      # To dump held locks, mem and other info.
> > > > > > )
> > > > > > # Define the trace events array passed to bootargs.
> > > > > > trace_events=(
> > > > > >     "sched:sched_switch"
> > > > > >     "sched:sched_waking"
> > > > > >     "rcu:rcu_callback"
> > > > > >     "rcu:rcu_fqs"
> > > > > >     "rcu:rcu_quiescent_state_report"
> > > > > >     "rcu:rcu_grace_period"
> > > > > > )
> > > > >
> > > > > So some insight on this boost failure. Just before the boost failures are
> > > > > reported, I see the migration thread interferring with the rcu_preempt thread
> > > > > (aka GP kthread). See trace below. Of note is that the rcu_preempt thread is
> > > > > runnable while context switching, which means its execution is interferred.
> > > > > The rcu_preempt thread is at RT prio 2 as can be seen.
> > > > >
> > > > > So some open-ended questions: what exactly does the migration thread want,
> > > > > this is something related to CPU hotplug? And if the migration thread had to
> > > > > run, why did the rcu_preempt thread not get pushed to another CPU by the
> > > > > scheduler? We have 16 vCPUs for this test.
> > > >
> > > > Maybe we need a cpus_read_lock() before doing a given boost-test interval
> > > > and a cpus_read_unlock() after finishing one?  But much depends on
> > > > exactly what is starting those migration threads.
> > > 
> > > But in the field, a real RT task can preempt a reader without doing
> > > cpus_read_lock() and may run into a similar boost issue?
> 
> The sysctl_sched_rt_runtime should prevent a livelock in most
> configurations.  Here, rcutorture explicitly disables this.

I see. Though RT throttling will actually stall the rcu_preempt thread as
well in the real world. RT throttling is a bit broken and we're trying to fix
it in scheduler land. Even if there are idle CPUs, RT throttling will starve
not just the offending RT task, but all of them essentially causing a
priority inversion between running RT and CFS tasks.

> > > > Then again, TREE03 is pretty aggressive about doing CPU hotplug.
> > > 
> > > Ok. I put a trace_printk() in the stopper thread to see what the
> > > ->fn() is. I'm doing another run to see what falls out.
> > 
> > I think I am getting somewhere, It looks like the migration thread tries to
> > push the task out due to __balance_push_cpu_stop. Clearly it fails because
> > the rcu_preempt task is still on CPU 3. And the whole thing is live locked
> > retrying the same.
> > 
> > I am not sure if the migration thread is trying to migrate rcu_preempt
> > because the CPU is going down or because a higher priority RT task is
> > running.
> > 
> > rcu_pree-13        3d..4. 2936723336us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723338us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723341us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > migratio-34        3d..2. 2936723345us : sched_switch: prev_comm=migration/3 prev_pid=34 prev_prio=0 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> > <idle>-0         3d..2. 2936723676us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=13 next_prio=97
> > rcu_pree-13        3d..4. 2936723678us : sched_waking: comm=migration/3 pid=34 prio=0 target_cpu=003
> > rcu_pree-13        3d..2. 2936723680us : sched_switch: prev_comm=rcu_preempt prev_pid=13 prev_prio=97 prev_state=R ==> next_comm=migration/3 next_pid=34 next_prio=0
> > migratio-34        3..... 2936723683us : cpu_stopper_thread: cpu_stop: caller=balance_push+0xc9/0x150 fn=__balance_push_cpu_stop+0x0/0x1c0 on cpu 3
> > 
> > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > and then I'll also look at why is it being pushed out in the first place.
> > 
> > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > for several hours.
> 
> Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> the problem still occurs?  If yes, then there is definitely some reason
> other than CPU hotplug that makes this happen.

Sure, I'll do this now. If that disappears, I'll do the cpu_read_lock()
suggestion you mentioned. IMHO this is increasingly looking like a scheduling
issue with RT and too many RT tasks running at the same time where the
rcu_preempt thread migrates around looking for a CPU than it does any work.

Today morning's traces show that __balance_push_cpu_stop() actually
successfully moves the rcu_preempt task out of the CPU. But a few ms later,
it is back on the same CPU again only to have the migration thread try to
move it out again. Maybe the cpuhp threads are fighting with it. There's also
all the rcutorture boost kthreads that are all running at the same time
during the failure but those are lower priority than the rcu_preempt thread
in this test.

Also I am switching to some day job things while the onoff_interval=0 test
runs, thanks a lot Paul for your help!  I'll continue to look at this as I
can.

thanks,

 - Joel

next prev parent reply	other threads:[~2023-09-11 22:00 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37   ` Joel Fernandes
2023-09-11  2:27     ` Joel Fernandes
2023-09-11  8:16       ` Paul E. McKenney
2023-09-11 13:17         ` Joel Fernandes [this message]
2023-09-11 13:49           ` Paul E. McKenney
2023-09-11 16:18             ` Joel Fernandes
2023-09-13 20:30         ` Joel Fernandes
2023-09-14 11:11           ` Paul E. McKenney
2023-09-14 13:13             ` Joel Fernandes
2023-09-14 15:23               ` Paul E. McKenney
2023-09-14 18:56                 ` Joel Fernandes
2023-09-14 21:53                   ` Joel Fernandes
2023-09-15  0:13                     ` Joel Fernandes
2023-09-15 11:33                       ` Joel Fernandes
2023-09-15 14:53                         ` Paul E. McKenney
2023-09-15 16:37                           ` Joel Fernandes
2023-09-15 16:57                             ` Paul E. McKenney
2023-09-15 21:14                               ` Joel Fernandes
2023-09-18  6:05                             ` Paul E. McKenney
2023-09-15 14:48                       ` Paul E. McKenney
2023-09-15 14:45                     ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230911131730.GA2291108@google.com \
    --to=joel@joelfernandes.org \
    --cc=frederic@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.