Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@kernel.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Frederic Weisbecker <frederic@kernel.org>, rcu@vger.kernel.org
Subject: Re: [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..)
Date: Thu, 14 Sep 2023 08:23:38 -0700	[thread overview]
Message-ID: <885bb95b-9068-45f9-ba46-3feb650a3c45@paulmck-laptop> (raw)
In-Reply-To: <20230914131351.GA2274683@google.com>

On Thu, Sep 14, 2023 at 01:13:51PM +0000, Joel Fernandes wrote:
> On Thu, Sep 14, 2023 at 04:11:26AM -0700, Paul E. McKenney wrote:
> > On Wed, Sep 13, 2023 at 04:30:20PM -0400, Joel Fernandes wrote:
> > > On Mon, Sep 11, 2023 at 4:16 AM Paul E. McKenney <paulmck@kernel.org> wrote:
> > > [..]
> > > > > I am digging deeper to see why the rcu_preempt thread cannot be pushed out
> > > > > and then I'll also look at why is it being pushed out in the first place.
> > > > >
> > > > > At least I have a strong repro now running 5 instances of TREE03 in parallel
> > > > > for several hours.
> > > >
> > > > Very good!  Then why not boot with rcutorture.onoff_interval=0 and see if
> > > > the problem still occurs?  If yes, then there is definitely some reason
> > > > other than CPU hotplug that makes this happen.
> > > 
> > > Hi Paul,
> > > So looks so far like onoff_interval=0 makes the issue disappear. So
> > > likely hotplug related. I am ok with doing the cpus_read_lock during
> > > boost testing and seeing if that fixes it. If it does, I can move on
> > > to the next thing in my backlog.
> > > 
> > > What do you think? Or should I spend more time root-causing it? It is
> > > most like runaway RT threads combined with the CPU hotplug threads,
> > > making scheduling of the rcu_preempt thread not happen. But I can't
> > > say for sure without more/better tracing (Speaking of better tracing,
> > > I am adding core-dump support to rcutorture, but it is not there yet).
> > 
> > This would not be the first time rcutorture has had trouble with those
> > threads, so I am for adding the cpus_read_lock().
> > 
> > Additional root-causing might be helpful, but then again, you might
> > have higher priority things to worry about.  ;-)
> 
> No worries. Unfortunately putting cpus_read_lock() around the boost test
> causes hangs. I tried something like the following [1]. If you have a diff, I can
> quickly try something to see if the issue goes away as well.

The other approaches that occur to me are:

1.	Synchronize with the torture.c CPU-hotplug code.  This is a bit
	tricky as well.

2.	Rearrange the testing to convert one of the TREE0* scenarios that
	is not in CFLIST (TREE06 or TREE08) to a real-time configuration,
	with boosting but without CPU hotplug.	Then remove boosting
	from TREE04.

Of these, #2 seems most productive.  But is there a better way?

> I am also interested in tracing further the state of CPUs when the back and
> forth migration happens. That code is really hairy (select_fallback_rq())
> with some scary code comments about more audit being needed, so I would not
> be surprised if something is off there. I left some trace prints in there [2]
> to see if I can gather more information. I'll let you know if I find
> anything there.

Looking forward to seeing what you come up with!

> [ btw I'm officially OOO from Mon-Thurs next week, that's mostly Ok since the
> last set stable releases just got done. Though I might cheat and attend to
> low-hanging fruit type of tasks like documentation during the OOO :-D ]

Have a good time away!

							Thanx, Paul

> [1]
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 9c028b118abc..4354f87f522a 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -1115,6 +1115,7 @@ static int rcu_torture_boost(void *arg)
>  		}
>  
>  		// Do one boost-test interval.
> +		cpus_read_lock();
>  		endtime = oldstarttime + test_boost_duration * HZ;
>  		while (time_before(jiffies, endtime)) {
>  			// Has current GP gone too long?
> @@ -1164,7 +1165,7 @@ static int rcu_torture_boost(void *arg)
>  			}
>  			schedule_timeout_uninterruptible(1);
>  		}
> -
> +		cpus_read_unlock();
>  		/* Go do the stutter. */
>  checkwait:	if (stutter_wait("rcu_torture_boost"))
>  			sched_set_fifo_low(current);
> -- 
> 
> [2]
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1569e9591654..1e79e059e578 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2552,6 +2552,12 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
>  	rq = cpu_rq(new_cpu);
>  
>  	rq_lock(rq, rf);
> +	if (rq->curr) {
> +		trace_printk("Curr on new_cpu (%d) is: %s[%d]\n",
> +				new_cpu, rq->curr->comm, task_pid_nr(rq->curr));
> +	} else {
> +		trace_printk("Curr on new_cpu (%d) is NULL\n", new_cpu);
> +	}
>  	WARN_ON_ONCE(task_cpu(p) != new_cpu);
>  	activate_task(rq, p, 0);
>  	check_preempt_curr(rq, p, 0);
> @@ -9490,6 +9496,11 @@ static int __balance_push_cpu_stop(void *arg)
>  		trace_printk("attempt push task %d currently on cpu %d to...\n", task_pid_nr(p), rq->cpu);
>  		cpu = select_fallback_rq(rq->cpu, p);
>  		trace_printk("to new cpu %d\n", cpu);
> +		trace_printk("is_cpu_allowed(%d), md(p)=%d, pf_kthread=%d, kthread_is_per_cpu=%d, cpu_online=%d, cpu_dying=%d\n",
> +				cpu,
> +				is_migration_disabled(p),
> +				p->flags & PF_KTHREAD, kthread_is_per_cpu(p),
> +				cpu_online(cpu), cpu_dying(cpu));
>  		rq = __migrate_task(rq, &rf, p, cpu);
>  		trace_printk("After __migrate_task, new cpu = %d\n", rq->cpu);
>  	}
> -- 
> 2.42.0.459.ge4e396fd5e-goog
>

next prev parent reply	other threads:[~2023-09-14 15:23 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-10 20:14 [BUG] Random intermittent boost failures (Was Re: [BUG] TREE04..) Joel Fernandes
2023-09-10 21:16 ` Paul E. McKenney
2023-09-10 23:37   ` Joel Fernandes
2023-09-11  2:27     ` Joel Fernandes
2023-09-11  8:16       ` Paul E. McKenney
2023-09-11 13:17         ` Joel Fernandes
2023-09-11 13:49           ` Paul E. McKenney
2023-09-11 16:18             ` Joel Fernandes
2023-09-13 20:30         ` Joel Fernandes
2023-09-14 11:11           ` Paul E. McKenney
2023-09-14 13:13             ` Joel Fernandes
2023-09-14 15:23               ` Paul E. McKenney [this message]
2023-09-14 18:56                 ` Joel Fernandes
2023-09-14 21:53                   ` Joel Fernandes
2023-09-15  0:13                     ` Joel Fernandes
2023-09-15 11:33                       ` Joel Fernandes
2023-09-15 14:53                         ` Paul E. McKenney
2023-09-15 16:37                           ` Joel Fernandes
2023-09-15 16:57                             ` Paul E. McKenney
2023-09-15 21:14                               ` Joel Fernandes
2023-09-18  6:05                             ` Paul E. McKenney
2023-09-15 14:48                       ` Paul E. McKenney
2023-09-15 14:45                     ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=885bb95b-9068-45f9-ba46-3feb650a3c45@paulmck-laptop \
    --to=paulmck@kernel.org \
    --cc=frederic@kernel.org \
    --cc=joel@joelfernandes.org \
    --cc=rcu@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.