From: Vincent Guittot <vincent.guittot@linaro.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
Paul McKenney <paulmck@kernel.org>,
Frederic Weisbecker <fweisbec@gmail.com>,
Dietmar Eggeman <dietmar.eggemann@arm.com>,
Qais Yousef <qais.yousef@arm.com>,
Ben Segall <bsegall@google.com>,
Daniel Bristot de Oliveira <bristot@redhat.com>,
Ingo Molnar <mingo@redhat.com>,
Juri Lelli <juri.lelli@redhat.com>, Mel Gorman <mgorman@suse.de>,
Peter Zijlstra <peterz@infradead.org>,
Steven Rostedt <rostedt@goodmis.org>
Subject: Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ
Date: Mon, 25 Jan 2021 15:42:41 +0100 [thread overview]
Message-ID: <CAKfTPtBWoRuwwkaqQKNgHTnQBE4fevyYqEoeGc5RpCsBbOS1sQ@mail.gmail.com> (raw)
In-Reply-To: <YAsjOqmo7TEeXjoj@google.com>
On Fri, 22 Jan 2021 at 20:10, Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Vincent,
>
> Thanks for reply. Please see the replies below:
>
> On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> > On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > > that there are a lot of calls to update_blocked_averages(). This causes
> > > the schedule loop to slow down to taking upto 500 micro seconds at
> > > times (due to newidle load balance). I have also seen this manifest in
> > > the periodic balancer.
> > >
> > > Closer look shows that the problem is caused by the following
> > > ingredients:
> > > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > > suggesting to inspect /proc/sched_debug for this), this can make
> > > __update_blocked_fair() take a long time.
> >
> > Inactive cgroups are removed from the list so they should not impact
> > the duration
>
> I meant blocked CGroups. According to this code, a cfs_rq can be partially
> decayed and not have any tasks running on it but its load needs to be
> decayed, correct? That's what I meant by 'inactive'. I can reword it to
> 'blocked'.
How many blocked cgroups have you got ?
>
> * There can be a lot of idle CPU cgroups. Don't let fully
> * decayed cfs_rqs linger on the list.
> */
> if (cfs_rq_is_decayed(cfs_rq))
> list_del_leaf_cfs_rq(cfs_rq);
>
> > > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > > shared frequency domain configuration to be slower than usual. (the load
> >
> > What do you mean exactly by it causes schedutil to be slower than usual ?
>
> sugov_next_freq_shared() is order number of CPUs in the a cluster. This
> system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
> frequency update needs to go through utilization of other CPUs in the
> cluster. I believe this could be adding to the problem but is not really
> needed to optimize if we can rate limit the calls to update_blocked_averages
> to begin with.
Qais mentioned half of the time being used by
sugov_next_freq_shared(). Are there any frequency changes resulting in
this call ?
>
> > > average updates also try to update the frequency in schedutil).
> > >
> > > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > > code paths to take longer than when running at a high CPU frequency.
> >
> > Low frequency usually means low utilization so it should happen that much.
>
> It happens a lot as can be seen with schbench. It is super easy to reproduce.
Happening a lot in itself is not a problem if there is nothing else to
do so it's not a argument in itself
So why is it a problem for you ? You are mentioning newly idle load
balance so I assume that your root problem is the scheduling delay
generated by the newly idle load balance which then calls
update_blocked_averages
rate limiting the call to update_blocked_averages() will only reduce
the number of time it happens but it will not prevent it to happen.
>
> schedule() can result in new idle balance with the CFS pick call happening
> often. Here is a function graph trace. The tracer shows
> update_blocked_averages taking a lot of time.
>
> sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
> sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
> sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
> sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
> sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
> sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
> sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
> sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
> sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
> sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
> sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
> sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
> sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
> sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
> sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
> sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
> sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
> sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
> sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
> sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
> sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
> sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
> sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
> sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458
>
>
> > > The fix is simply rate limit the calls to update_blocked_averages to 20
> > > times per second. It appears that updating the blocked average less
> > > often is sufficient. Currently I see about 200 calls per second
> >
> > Would be good to explain why updating less often is sufficient ?
>
> I don't know this code that well, intuitively it seems to me updating blocked
> averages at such a high rate seems pointless. But I defer to your expertise
> on that. Why do you feel an update is needed at least HZ times per second?
> What about system with HZ=1000 or 300, that seems to be an insane rate of
> updating (not to mention all the complexity of going through the leaf cgroup
> list and doing the frequency updates).
Your trace above is a full update of blocked load which is a normal
sequence that happen regularly but you don't show that this happens
too often
The call to update_blocked _average is rate limited by load_balance
interval for idle and busy load balance so those 2 cases should not be
a problem. There is no rate limit in case of newly_idle because it is
assumed that the latter is normally called only if it has been
estimated that there is enough time to run it. That's why
update_blocked_load_average is called all the time during newly idle
load balance
see in update_sd_lb_stats()
if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
env->flags |= LBF_NOHZ_STATS;
We could consider removing this condition but I need time to remind
all the side effect of this
IIUC, your real problem is that newidle_balance is running whereas a
task is about to wake up on the cpu and we don't abort quickly during
this load_balance
so we could also try to abort earlier in case of newly idle load balance
>
> > > sometimes, which seems overkill.
> > >
> > > schbench shows a clear improvement with the change:
> >
> > Have you got more details about your test setup ?
> > which platform ?
> > which kernel ?
>
> I mentioned in the commit message it is a v5.4 kernel.
I was not sure if the tests results done with this kernel because we
usually ask for results against mainline to make sure you are not
facing a problem that has solved since v5.4 has been released
>
> The platform is Snapdragon 7c. The platform is:
> https://www.qualcomm.com/products/snapdragon-7c-compute-platform
>
> >
> > >
> > > Without patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > > 50.0th: 210 (106 samples)
> > > 75.0th: 619 (53 samples)
> > > 90.0th: 665 (32 samples)
> > > 95.0th: 703 (11 samples)
> > > *99.0th: 12656 (8 samples)
> > > 99.5th: 12784 (1 samples)
> > > 99.9th: 13424 (1 samples)
> > > min=15, max=13424
> > >
> > > With patch:
> > > ~/schbench -t 2 -m 2 -r 5
> > > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > > 50.0th: 188 (108 samples)
> > > 75.0th: 238 (53 samples)
> > > 90.0th: 623 (32 samples)
> > > 95.0th: 657 (12 samples)
> > > *99.0th: 717 (7 samples)
> > > 99.5th: 725 (2 samples)
> > > 99.9th: 725 (0 samples)
> > >
> > > Cc: Paul McKenney <paulmck@kernel.org>
> > > Cc: Frederic Weisbecker <fweisbec@gmail.com>
> > > Suggested-by: Dietmar Eggeman <dietmar.eggemann@arm.com>
> > > Co-developed-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Qais Yousef <qais.yousef@arm.com>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >
> > > ---
> > > kernel/sched/fair.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 04a3ce20da67..fe2dc0024db5 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > > return false;
> > >
> > > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
> >
> > This condition is there to make sure to update blocked load at most
> > once a tick in order to filter newly idle case otherwise the rate
> > limit is already done by load balance interval
>
> What prevents newidle_balance from hitting this code? Are you suggesting rate
> limit that as well?
>
> > This hard coded (HZ/20) looks really like an ugly hack
>
> Yes, it was not a well researched number. If you have a suggestion for better
> fix, let me know.
>
> thanks,
>
> - Joel
>
>
> > > return true;
> > >
> > > update_blocked_averages(cpu);
> > > --
> > > 2.30.0.280.ga3ce27912f-goog
> > >
next prev parent reply other threads:[~2021-01-25 14:46 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-22 15:46 [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ Joel Fernandes (Google)
2021-01-22 16:56 ` Vincent Guittot
2021-01-22 18:39 ` Qais Yousef
2021-01-22 19:14 ` Joel Fernandes
2021-01-25 13:23 ` Vincent Guittot
2021-01-26 16:36 ` Qais Yousef
2021-01-22 19:10 ` Joel Fernandes
2021-01-25 10:44 ` Dietmar Eggemann
2021-01-25 17:30 ` Vincent Guittot
2021-01-25 17:53 ` Dietmar Eggemann
2021-01-25 14:42 ` Vincent Guittot [this message]
2021-01-27 18:43 ` Joel Fernandes
2021-01-28 13:57 ` Vincent Guittot
2021-01-28 15:09 ` Joel Fernandes
2021-01-28 16:57 ` Qais Yousef
[not found] ` <CAKfTPtBvwm9vZb5C=2oTF6N-Ht6Rvip4Lv18yi7O3G8e-_ZWdg@mail.gmail.com>
2021-01-29 17:27 ` Vincent Guittot
2021-02-03 11:54 ` Dietmar Eggemann
2021-02-03 13:12 ` Vincent Guittot
2021-02-04 9:47 ` Dietmar Eggemann
2021-02-03 17:09 ` Qais Yousef
2021-02-03 17:35 ` Vincent Guittot
2021-02-04 10:45 ` Qais Yousef
2021-02-03 19:56 ` Joel Fernandes
2021-03-23 21:37 ` Tim Chen
2021-03-24 13:44 ` Vincent Guittot
2021-03-24 16:05 ` Tim Chen
2021-04-07 14:02 ` Vincent Guittot
2021-04-07 17:19 ` Tim Chen
2021-04-08 14:51 ` Vincent Guittot
2021-04-08 23:05 ` Tim Chen
2021-04-09 15:26 ` Vincent Guittot
2021-04-09 17:59 ` Tim Chen
2021-05-10 21:59 ` Tim Chen
2021-05-11 15:25 ` Vincent Guittot
2021-05-11 17:25 ` Tim Chen
2021-05-11 17:56 ` Vincent Guittot
2021-05-12 13:59 ` Qais Yousef
2021-05-13 18:45 ` Tim Chen
2021-05-17 16:14 ` Qais Yousef
2021-06-11 20:00 ` Tim Chen
2021-06-18 10:28 ` Vincent Guittot
2021-06-18 16:14 ` Tim Chen
2021-06-25 8:50 ` Vincent Guittot
2021-02-01 15:13 ` Joel Fernandes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAKfTPtBWoRuwwkaqQKNgHTnQBE4fevyYqEoeGc5RpCsBbOS1sQ@mail.gmail.com \
--to=vincent.guittot@linaro.org \
--cc=bristot@redhat.com \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=fweisbec@gmail.com \
--cc=joel@joelfernandes.org \
--cc=juri.lelli@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=qais.yousef@arm.com \
--cc=rostedt@goodmis.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).