[RFC 0/4] perf sched: Introduce schedstat tool

Linux-perf-users Archive mirror
 help / color / mirror / Atom feed

From: Ravi Bangoria <ravi.bangoria@amd.com>
To: <peterz@infradead.org>, <mingo@redhat.com>, <acme@kernel.org>,
	<namhyung@kernel.org>, <irogers@google.com>
Cc: <ravi.bangoria@amd.com>, <swapnil.sapkal@amd.com>,
	<mark.rutland@arm.com>, <alexander.shishkin@linux.intel.com>,
	<jolsa@kernel.org>, <rostedt@goodmis.org>,
	<vincent.guittot@linaro.org>, <bristot@redhat.com>,
	<adrian.hunter@intel.com>, <james.clark@arm.com>,
	<kan.liang@linux.intel.com>, <gautham.shenoy@amd.com>,
	<kprateek.nayak@amd.com>, <juri.lelli@redhat.com>,
	<yangjihong@bytedance.com>, <linux-kernel@vger.kernel.org>,
	<linux-perf-users@vger.kernel.org>, <santosh.shukla@amd.com>,
	<ananth.narayan@amd.com>, <sandipan.das@amd.com>
Subject: [RFC 0/4] perf sched: Introduce schedstat tool
Date: Wed, 8 May 2024 11:34:23 +0530	[thread overview]
Message-ID: <20240508060427.417-1-ravi.bangoria@amd.com> (raw)

MOTIVATION
----------

Existing `perf sched` is quite exhaustive and provides lot of insights
into scheduler behavior but it quickly becomes impractical to use for
long running or scheduler intensive workload. For ex, `perf sched record`
has ~7.77% overhead on hackbench (with 25 groups each running 700K loops
on a 2-socket 128 Cores 256 Threads 3rd Generation EPYC Server), and it
generates huge 56G perf.data for which perf takes ~137 mins to prepare
and write it to disk [1].

Unlike `perf sched record`, which hooks onto set of scheduler tracepoints
and generates samples on a tracepoint hit, `perf sched schedstat record`
takes snapshot of the /proc/schedstat file before and after the workload,
i.e. there is zero interference on workload run. Also, it takes very
minimal time to parse /proc/schedstat, convert it into perf samples and
save those samples into perf.data file. Result perf.data file is much
smaller. So, overall `perf sched schedstat record` is much more light-
weight compare to `perf sched record`.

We, internally at AMD, have been using this (a variant of this, known as
"sched-scoreboard"[2]) and found it to be very useful to analyse impact
of any scheduler code changes[3][4].

Please note that, this is not a replacement of perf sched record/report.
The intended users of the new tool are scheduler developers, not regular
users.

USAGE
-----

  # perf sched schedstat record
  # perf sched schedstat report

Note: Although perf schedstat tool supports workload profiling syntax
(i.e. -- <workload> ), the recorded profile is still systemwide since
the /proc/schedstat is a systemwide file.

HOW TO INTERPRET THE REPORT
---------------------------

The schedstat report starts with total time profiling was active in
terms of jiffies:

  ----------------------------------------------------------------------------------------------------
  Time elapsed (in jiffies)                                  :       24537
  ----------------------------------------------------------------------------------------------------

Next is cpu scheduling statistics. These are simple diffs of
/proc/schedstat cpu lines along with description. The report also
prints % relative to base stat.

In the example below, schedule() left the cpu0 idle 98.19% of the time.
16.54% of total try_to_wake_up() was to wakeup local cpu. And, the total
waittime by tasks on cpu0 is 0.49% of the total runtime by tasks on the
same cpu.

  ----------------------------------------------------------------------------------------------------
  cpu:  cpu0
  ----------------------------------------------------------------------------------------------------
  sched_yield() count                                         :           0
  Legacy counter can be ignored                               :           0
  schedule() called                                           :       17138
  schedule() left the processor idle                          :       16827 ( 98.19% )
  try_to_wake_up() was called                                 :         508
  try_to_wake_up() was called to wake up the local cpu        :          84 ( 16.54% )
  total runtime by tasks on this processor (in jiffies)       :  2408959243
  total waittime by tasks on this processor (in jiffies)      :    11731825 ( 0.49% )
  total timeslices run on this cpu                            :         311
  ----------------------------------------------------------------------------------------------------

Next is load balancing statistics. For each of the sched domains
(eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
the following three categories:

  1) Idle Load Balance: Load balancing performed on behalf of a long
                        idling cpu by some other cpu.
  2) Busy Load Balance: Load balancing performed when the cpu was busy.
  3) New Idle Balance : Load balancing performed when a cpu just became
                        idle.

Under each of these three categories, schedstat report provides
different load balancing statistics. Along with direct stats, the
report also contains derived metrics prefixed with *. Example:

  ----------------------------------------------------------------------------------------------------
  CPU 0 DOMAIN 0
  ----------------------------------------------------------------------------------------------------
  ------------------------------------------<Category idle>-------------------------------------------
  load_balance() count on cpu idle                                 :          50   $      490.74 $
  load_balance() found balanced on cpu idle                        :          42   $      584.21 $
  load_balance() move task failed on cpu idle                      :           8   $     3067.12 $
  imbalance sum on cpu idle                                        :           8
  pull_task() count on cpu idle                                    :           0
  pull_task() when target task was cache-hot on cpu idle           :           0
  load_balance() failed to find busier queue on cpu idle           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu idle           :          42   $      584.21 $
  *load_balance() success count on cpu idle                        :           0
  *avg task pulled per successful lb attempt (cpu idle)            :        0.00
  ------------------------------------------<Category busy>-------------------------------------------
  load_balance() count on cpu busy                                 :           2   $    12268.50 $
  load_balance() found balanced on cpu busy                        :           2   $    12268.50 $
  load_balance() move task failed on cpu busy                      :           0   $        0.00 $
  imbalance sum on cpu busy                                        :           0
  pull_task() count on cpu busy                                    :           0
  pull_task() when target task was cache-hot on cpu busy           :           0
  load_balance() failed to find busier queue on cpu busy           :           0   $        0.00 $
  load_balance() failed to find busier group on cpu busy           :           1   $    24537.00 $
  *load_balance() success count on cpu busy                        :           0
  *avg task pulled per successful lb attempt (cpu busy)            :        0.00
  -----------------------------------------<Category newidle>-----------------------------------------
  load_balance() count on cpu newly idle                           :         427   $       57.46 $
  load_balance() found balanced on cpu newly idle                  :         382   $       64.23 $
  load_balance() move task failed on cpu newly idle                :          45   $      545.27 $
  imbalance sum on cpu newly idle                                  :          48
  pull_task() count on cpu newly idle                              :           0
  pull_task() when target task was cache-hot on cpu newly idle     :           0
  load_balance() failed to find busier queue on cpu newly idle     :           0   $        0.00 $
  load_balance() failed to find busier group on cpu newly idle     :         382   $       64.23 $
  *load_balance() success count on cpu newly idle                  :           0
  *avg task pulled per successful lb attempt (cpu newly idle)      :        0.00
  ----------------------------------------------------------------------------------------------------

Consider following line:

  load_balance() found balanced on cpu newly idle                  :         382    $      64.23 $

While profiling was active, the load-balancer found 382 times the load
needs to be balanced on a newly idle CPU 0. Following value encapsulated
inside $ is average jiffies between two events (24537 / 382 = 64.23).

Next is active_load_balance() stats. alb did not trigger while I ran
a workload so it's all 0s.

  ----------------------------------<Category active_load_balance()>----------------------------------
  active_load_balance() count                                      :           0
  active_load_balance() move task failed                           :           0
  active_load_balance() successfully moved a task                  :           0
  ----------------------------------------------------------------------------------------------------

Next are sched_balance_exec() and sched_balance_fork() stats. They are
not used but we kept it in RFC just for legacy purpose. Unless opposed,
we plan to remove them in next revision.

Next is wakeup statistics. For every domain, the report also shows
task-wakeup statistics. Example:

  --------------------------------------------<Wakeup Info>-------------------------------------------
  try_to_wake_up() awoke a task that last ran on a diff cpu       :       12070
  try_to_wake_up() moved task because cache-cold on own cpu       :        3324
  try_to_wake_up() started passive balancing                      :           0
  ----------------------------------------------------------------------------------------------------

Same set of stats are reported for each cpu and each domain level.


TODO:
 - This RFC series supports v15 layout of /proc/schedstat but v16 layout
   is already being pushed upstream. We are planning to include v16 as
   well in the next revision.
 - Currently schedstat tool provides statistics of only one run but we
   are planning to add `perf sched schedstat diff` which can compare
   the data of two different runs (possibly good and bad) and highlight
   where scheduler decisions are impacting workload performance.
 - perf sched schedstat records /proc/schedstat which is a cpu and domain
   level scheduler statistic. We are planning to add taskstat tool which
   reads task stats from procfs and generate scheduler statistic report
   at task granularity. this will probably a standalone tool, something
   like `perf sched taskstat record/report`.
 - /proc/schedstat shows cpumask in domain line to indicate a group of
   cpus that belong to the domain. Since we are not using domain<->cpumask
   data anywhere, we are not capturing it as part of perf sample. But we
   are planning to include it in the next revision.
 - We have tested the patch only on AMD machines, not on other platforms.
 - Some of the perf related features can be added to the schedstat sub-
   command as well: for ex, pipemode, -p <pid> option to profile running
   process etc. We are not planning to add TUI support however.
 - Code changes are for RFC and thus not optimal. Some of the places
   where code changes are not required for RFC are marked as /* FIXME */.
 - Documenting details about schedstat tool in perf-sched man page will
   also be done in next revision.
 - Checkpatch warnings are ignored for now.

Patches are prepared on perf-tools-next/perf-tools-next (37862d6fdced).
Although all changes are in tools/perf/, kernel is important since it's
using v15 of /proc/schedstat.

[1] https://youtu.be/lg-9aG2ajA0?t=283
[2] https://github.com/AMDESE/sched-scoreboard
[3] https://lore.kernel.org/lkml/c50bdbfe-02ce-c1bc-c761-c95f8e216ca0@amd.com/
[4] https://lore.kernel.org/lkml/3e32bec6-5e59-c66a-7676-7d15df2c961c@amd.com/

Ravi Bangoria (2):
  perf sched: Make `struct perf_sched sched` a global variable
  perf sched: Add template code for schedstat subcommand

Swapnil Sapkal (2):
  perf sched schedstat: Add record and rawdump support
  perf sched schedstat: Add support for report subcommand

 tools/lib/perf/Documentation/libperf.txt      |   2 +
 tools/lib/perf/Makefile                       |   2 +-
 tools/lib/perf/include/perf/event.h           |  37 ++
 .../lib/perf/include/perf/schedstat-cpu-v15.h |  22 +
 .../perf/include/perf/schedstat-domain-v15.h  | 121 ++++
 tools/perf/builtin-inject.c                   |   2 +
 tools/perf/builtin-sched.c                    | 558 ++++++++++++++++--
 tools/perf/util/event.c                       |  54 ++
 tools/perf/util/event.h                       |   2 +
 tools/perf/util/session.c                     |  44 ++
 tools/perf/util/synthetic-events.c            | 170 ++++++
 tools/perf/util/synthetic-events.h            |   4 +
 tools/perf/util/tool.h                        |   4 +-
 13 files changed, 965 insertions(+), 57 deletions(-)
 create mode 100644 tools/lib/perf/include/perf/schedstat-cpu-v15.h
 create mode 100644 tools/lib/perf/include/perf/schedstat-domain-v15.h

-- 
2.44.0

next             reply	other threads:[~2024-05-08  6:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-08  6:04 Ravi Bangoria [this message]
2024-05-08  6:04 ` [RFC 1/4] perf sched: Make `struct perf_sched sched` a global variable Ravi Bangoria
2024-05-08  6:04 ` [RFC 2/4] perf sched: Add template code for schedstat subcommand Ravi Bangoria
2024-05-08  6:04 ` [RFC 3/4] perf sched schedstat: Add record and rawdump support Ravi Bangoria
2024-05-08  6:04 ` [RFC 4/4] perf sched schedstat: Add support for report subcommand Ravi Bangoria
2024-05-11  7:45   ` Chen Yu
2024-05-13  3:13     ` Ravi Bangoria
2024-05-08 22:21 ` [RFC 0/4] perf sched: Introduce schedstat tool Ian Rogers
2024-05-09  5:09   ` Ravi Bangoria
2024-05-09  5:08 ` Namhyung Kim
2024-05-09  6:02   ` Ravi Bangoria
2024-05-09 22:33     ` Namhyung Kim
2024-05-10  4:46       ` Ravi Bangoria

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240508060427.417-1-ravi.bangoria@amd.com \
    --to=ravi.bangoria@amd.com \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=ananth.narayan@amd.com \
    --cc=bristot@redhat.com \
    --cc=gautham.shenoy@amd.com \
    --cc=irogers@google.com \
    --cc=james.clark@arm.com \
    --cc=jolsa@kernel.org \
    --cc=juri.lelli@redhat.com \
    --cc=kan.liang@linux.intel.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sandipan.das@amd.com \
    --cc=santosh.shukla@amd.com \
    --cc=swapnil.sapkal@amd.com \
    --cc=vincent.guittot@linaro.org \
    --cc=yangjihong@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).