KVM Archive mirror
 help / color / mirror / Atom feed
From: Tobias Huschle <huschle@linux.ibm.com>
To: Abel Wu <wuyun.abel@bytedance.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, mst@redhat.com, jasowang@redhat.com
Subject: Re: Re: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)
Date: Thu, 7 Dec 2023 07:22:12 +0100	[thread overview]
Message-ID: <ZXFklCgF3EjeKXDC@DESKTOP-2CCOB1S.> (raw)
In-Reply-To: <d4110c79-d64f-49bd-9f69-0a94369b5e86@bytedance.com>

On Tue, Nov 28, 2023 at 04:55:11PM +0800, Abel Wu wrote:
> On 11/27/23 9:56 PM, Tobias Huschle Wrote:
> > On Wed, Nov 22, 2023 at 11:00:16AM +0100, Peter Zijlstra wrote:
> > > On Tue, Nov 21, 2023 at 02:17:21PM +0100, Tobias Huschle wrote:
[...]
> 
> What are the weights of the two entities?
> 

Both entities have the same weights (I saw 1048576 for both of them).
The story looks different when we look at the cgroup hierarchy though:

sew := weight of the sched entity (se->load.weight)

     CPU 6/KVM-2360    [011] d....  1158.884473: sched_place: comm=vhost-2961 pid=2984 sev=3595548386 sed=3598548386 sel=0 sew=1048576 avg=3595548386 min=3595548386 cpu=11 nr=0 vru=3595548386 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884473: sched_place: comm= pid=0 sev=19998138425 sed=20007532920 sel=0 sew=335754 avg=19998138425 min=19998138425 cpu=11 nr=0 vru=19998138425 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=37794158943 sed=37807515464 sel=0 sew=236146 avg=37794158943 min=37794158943 cpu=11 nr=0 vru=37794158943 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=50387168150 sed=50394482435 sel=0 sew=430665 avg=50387168150 min=50387168150 cpu=11 nr=0 vru=50387168150 lag=0
     CPU 6/KVM-2360    [011] d....  1158.884474: sched_place: comm= pid=0 sev=76600751247 sed=77624751246 sel=0 sew=3876 avg=76600751247 min=76600751247 cpu=11 nr=0 vru=76600751247 lag=0
<...>
    vhost-2961-2984    [011] d....  1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0

Here we can see the following weights:
kworker     -> 1048576
vhost       -> 1048576
cgroup root ->    3876

kworker and vhost weights remain the same. The weights of the nodes in the cgroup vary.


I also spent some more thought on this and have some more observations:

1. kworker lag after short runtime

    vhost-2961-2984    [011] d....  1158.884486: sched_waking: comm=kworker/11:2 pid=202 prio=120 target_cpu=011
    vhost-2961-2984    [011] d....  1158.884487: sched_place: comm=kworker/11:2 pid=202 sev=76603905961 sed=76606905961 sel=0 sew=1048576 avg=76603905961 min=76603905961 cpu=11 nr=1 vru=76603905961 lag=0
<...>                                                                                                                   ^^^^^
    vhost-2961-2984    [011] d....  1158.884490: sched_switch: prev_comm=vhost-2961 prev_pid=2984 prev_prio=120 prev_state=R+ ==> next_comm=kworker/11:2 next_pid=202 next_prio=120
   kworker/11:2-202    [011] d....  1158.884491: sched_waking: comm=CPU 0/KVM pid=2988 prio=120 target_cpu=009
   kworker/11:2-202    [011] d....  1158.884492: sched_stat_runtime: comm=kworker/11:2 pid=202 runtime=5150 [ns] vruntime=76603911111 [ns] deadline=76606905961 [ns] lag=76606905961
                                                                                               ^^^^^^^^^^^^^^^^
   kworker/11:2-202    [011] d....  1158.884492: sched_update: comm=kworker/11:2 pid=202 sev=76603911111 sed=76606905961 sel=-1128 sew=1048576 avg=76603909983 min=76603905961 cpu=11 nr=2 lag=-1128 lim=10000000
                                                                                                                         ^^^^^^^^^
   kworker/11:2-202    [011] d....  1158.884494: sched_stat_wait: comm=vhost-2961 pid=2984 delay=5150 [ns]
   kworker/11:2-202    [011] d....  1158.884494: sched_switch: prev_comm=kworker/11:2 prev_pid=202 prev_prio=120 prev_state=I ==> next_comm=vhost-2961 next_pid=2984 next_prio=120

In the sequence above, the kworker gets woken up by the vhost and placed on the timeline with 0 lag.
The kworker then executes for 5150ns and returns control to the vhost.
Unfortunately, this short runtime earns the kworker a negative lag of -1128.
This in turn, causes the kworker to not be selected by check_preempt_wakeup_fair.

My naive understanding of lag is, that only those entities get negative lag, which consume
more time than they should. Why is the kworker being punished for running only a tiny
portion of time?

In the majority of cases, the kworker finishes after a 4-digit number of ns.
There are occassional outliers with 5-digit numbers. I would therefore not 
expect negative lag for the kworker.

It is fair to say that the kworker was executing while the vhost was not.
kworker gets put on the queue with no lag, so it essentially has its vruntime
set to avg_vruntime.
After giving up its timeslice the kworker has now a vruntime which is larger
than the avg_vruntime. Hence the negative lag might make sense here from an
algorithmic standpoint. 


2a/b. vhost getting increased deadlines over time, no call of pick_eevdf

    vhost-2961-2984    [011] d.h..  1158.892878: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=8385872 [ns] vruntime=3603948448 [ns] deadline=3606948448 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.892879: sched_stat_runtime: comm= pid=0 runtime=8385872 [ns] vruntime=76604158567 [ns] deadline=77624751246 [ns] lag=77624751246
<..>
    vhost-2961-2984    [011] d.h..  1158.902877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999435 [ns] vruntime=3613947883 [ns] deadline=3616947883 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.902878: sched_stat_runtime: comm= pid=0 runtime=9999435 [ns] vruntime=76633826282 [ns] deadline=78137144356 [ns] lag=77624751246
<..>
    vhost-2961-2984    [011] d.h..  1158.912877: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=9999824 [ns] vruntime=3623947707 [ns] deadline=3626947707 [ns] lag=3598548386
    vhost-2961-2984    [011] d.h..  1158.912878: sched_stat_runtime: comm= pid=0 runtime=9999824 [ns] vruntime=76688003113 [ns] deadline=78161723086 [ns] lag=77624751246
<..>
<..>
    vhost-2961-2984    [011] dN...  1159.152927: sched_stat_runtime: comm=vhost-2961 pid=2984 runtime=40402 [ns] vruntime=3863988069 [ns] deadline=3866947667 [ns] lag=3598548386
    vhost-2961-2984    [011] dN...  1159.152928: sched_stat_runtime: comm= pid=0 runtime=40402 [ns] vruntime=78355923791 [ns] deadline=78393801472 [ns] lag=77624751246

In the sequence above, I extended the tracing of sched_stat_runtime to use 
for_each_sched_entity to also output the values for the cgroup hierarchy.
The first entry represents the actual task, the second entry represents
the root for that particular cgroup. I dropped the levels in between
for readability.

The first three groupings are happening in sequence. The fourth grouping
is the last sched_stat_runtime update before the vhost gets migrated off
the CPU. The ones in between repeat the same pattern.

Interestingly, the vruntimes of the root grow faster than the actual tasks.
I assume this is intended.
At the same time, the deadlines keep on growing for vhost and the cgroup root.
At the same time, the kworker is left starving with its negative lag.
At no point in this sequence, pick_eevdf is being called.

The only time pick_eevdf is being called is right when the kworker is woken up.
So check_preempt_wakeup_fair seems to be the only chance for the kworker to get
scheduled in time.

For reference:
    vhost-2961-2984    [011] d....  1158.884563: sched_place: comm=kworker/11:2 pid=202 sev=76604163719 sed=76607163719 sel=-1128 sew=1048576 avg=76604158567 min=76604158567 cpu=11 nr=1 vru=76604158567 lag=-5152

The kworker has a deadline which is definitely smaller than the one of vhost
in later stages. So, I would assume it should get scheduled at some point.
If vhost is running in kernel space and is therefore not preemptable,
this would be expected behavior though.


3. vhost looping endlessly, waiting for kworker to be scheduled

I dug a little deeper on what the vhost is doing. I'm not an expert on
virtio whatsoever, so these are just educated guesses that maybe
someone can verify/correct. Please bear with me probably messing up 
the terminology.

- vhost is looping through available queues.
- vhost wants to wake up a kworker to process a found queue.
- kworker does something with that queue and terminates quickly.

What I found by throwing in some very noisy trace statements was that,
if the kworker is not woken up, the vhost just keeps looping accross
all available queues (and seems to repeat itself). So it essentially
relies on the scheduler to schedule the kworker fast enough. Otherwise
it will just keep on looping until it is migrated off the CPU.


SUMMARY 

1 and 2a/b have some more or less plausible potential explanations, 
where the EEVDF scheduler might just do what it is designed to do.

3 is more tricky since I'm not familiar with the topic. If the vhost just
relies on the kworker pre-empting the vhost, than this sounds a bit
counter-intuitive. But there might also be a valid design decision
behind this.

If 1 and 2 are indeed plausible, path 3 is probably the
one to go in order to figure out if we have a problem there.
[...]

  parent reply	other threads:[~2023-12-07  6:27 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-16 18:58 EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement) Tobias Huschle
2023-11-17  9:23 ` Peter Zijlstra
2023-11-17  9:58   ` Peter Zijlstra
2023-11-17 12:24   ` Tobias Huschle
2023-11-17 12:37     ` Peter Zijlstra
2023-11-17 13:07       ` Abel Wu
2023-11-21 13:17         ` Tobias Huschle
2023-11-22 10:00           ` Peter Zijlstra
2023-11-27 13:56             ` Tobias Huschle
     [not found]             ` <6564a012.c80a0220.adb78.f0e4SMTPIN_ADDED_BROKEN@mx.google.com>
2023-11-28  8:55               ` Abel Wu
2023-11-29  6:31                 ` Tobias Huschle
2023-12-07  6:22                 ` Tobias Huschle [this message]
     [not found]                 ` <07513.123120701265800278@us-mta-474.us.mimecast.lan>
2023-12-07  6:48                   ` Michael S. Tsirkin
2023-12-08  9:24                     ` Tobias Huschle
2023-12-08 17:28                       ` Mike Christie
     [not found]                     ` <56082.123120804242300177@us-mta-137.us.mimecast.lan>
2023-12-08 10:31                       ` Re: " Michael S. Tsirkin
2023-12-08 11:41                         ` Tobias Huschle
     [not found]                         ` <53044.123120806415900549@us-mta-342.us.mimecast.lan>
2023-12-09 10:42                           ` Michael S. Tsirkin
2023-12-11  7:26                             ` Jason Wang
2023-12-11 16:53                               ` Michael S. Tsirkin
2023-12-12  3:00                                 ` Jason Wang
2023-12-12 16:15                                   ` Michael S. Tsirkin
2023-12-13 10:37                                     ` Tobias Huschle
     [not found]                                     ` <42870.123121305373200110@us-mta-641.us.mimecast.lan>
2023-12-13 12:00                                       ` Michael S. Tsirkin
2023-12-13 12:45                                         ` Tobias Huschle
     [not found]                                         ` <25485.123121307454100283@us-mta-18.us.mimecast.lan>
2023-12-13 14:47                                           ` Michael S. Tsirkin
2023-12-13 14:55                                           ` Michael S. Tsirkin
2023-12-14  7:14                                             ` Michael S. Tsirkin
2024-01-08 13:13                                               ` Tobias Huschle
     [not found]                                               ` <92916.124010808133201076@us-mta-622.us.mimecast.lan>
2024-01-09 23:07                                                 ` Michael S. Tsirkin
2024-01-21 18:44                                                 ` Michael S. Tsirkin
2024-01-22 11:29                                                   ` Tobias Huschle
2024-02-01  7:38                                                   ` Tobias Huschle
     [not found]                                                   ` <07974.124020102385100135@us-mta-501.us.mimecast.lan>
2024-02-01  8:08                                                     ` Michael S. Tsirkin
2024-02-01 11:47                                                       ` Tobias Huschle
     [not found]                                                       ` <89460.124020106474400877@us-mta-475.us.mimecast.lan>
2024-02-01 12:08                                                         ` Michael S. Tsirkin
2024-02-22 19:23                                                         ` Michael S. Tsirkin
2024-03-11 17:05                                                         ` Michael S. Tsirkin
2024-03-12  9:45                                                           ` Luis Machado
2024-03-14 11:46                                                             ` Tobias Huschle
     [not found]                                                             ` <73123.124031407552500165@us-mta-156.us.mimecast.lan>
2024-03-14 15:09                                                               ` Michael S. Tsirkin
2024-03-15  8:33                                                                 ` Tobias Huschle
     [not found]                                                                 ` <84704.124031504335801509@us-mta-515.us.mimecast.lan>
2024-03-15 10:31                                                                   ` Michael S. Tsirkin
2024-03-19  8:21                                                                     ` Tobias Huschle
2024-03-19  8:29                                                                       ` Michael S. Tsirkin
2024-03-19  8:59                                                                         ` Tobias Huschle
2024-04-30 10:50                                                                           ` Tobias Huschle
2024-05-01 10:51                                                                             ` Peter Zijlstra
2024-05-01 15:31                                                                               ` Michael S. Tsirkin
2024-05-02  9:16                                                                                 ` Peter Zijlstra
2024-05-02 12:23                                                                                 ` Tobias Huschle
2024-05-02 12:20                                                                               ` Tobias Huschle
2023-11-18  5:14   ` Abel Wu
2023-11-20 10:56     ` Peter Zijlstra
2023-11-20 12:06       ` Abel Wu
2023-11-18  7:33 ` Abel Wu
2023-11-18 15:29   ` Honglei Wang
2023-11-19 13:29 ` Bagas Sanjaya

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZXFklCgF3EjeKXDC@DESKTOP-2CCOB1S. \
    --to=huschle@linux.ibm.com \
    --cc=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=virtualization@lists.linux.dev \
    --cc=wuyun.abel@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).