Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer

QEMU-Devel Archive mirror
 help / color / mirror / Atom feed

From: "Maciej S. Szmigiero" <mail@maciej.szmigiero.name>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: "Peter Xu" <peterx@redhat.com>, "Fabiano Rosas" <farosas@suse.de>,
	"Alex Williamson" <alex.williamson@redhat.com>,
	"Cédric Le Goater" <clg@redhat.com>,
	"Eric Blake" <eblake@redhat.com>,
	"Markus Armbruster" <armbru@redhat.com>,
	"Avihai Horon" <avihaih@nvidia.com>,
	"Joao Martins" <joao.m.martins@oracle.com>,
	qemu-devel@nongnu.org
Subject: Re: [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer
Date: Thu, 18 Apr 2024 20:14:15 +0200	[thread overview]
Message-ID: <aebcd78e-b8b6-44db-b2be-0bbd5acccf3f@maciej.szmigiero.name> (raw)
In-Reply-To: <ZiD4aLSre6qubuHr@redhat.com>

On 18.04.2024 12:39, Daniel P. Berrangé wrote:
> On Thu, Apr 18, 2024 at 11:50:12AM +0200, Maciej S. Szmigiero wrote:
>> On 17.04.2024 18:35, Daniel P. Berrangé wrote:
>>> On Wed, Apr 17, 2024 at 02:11:37PM +0200, Maciej S. Szmigiero wrote:
>>>> On 17.04.2024 10:36, Daniel P. Berrangé wrote:
>>>>> On Tue, Apr 16, 2024 at 04:42:39PM +0200, Maciej S. Szmigiero wrote:
>>>>>> From: "Maciej S. Szmigiero" <maciej.szmigiero@oracle.com>
(..)
>>>>> That said, the idea of reserving channels specifically for VFIO doesn't
>>>>> make a whole lot of sense to me either.
>>>>>
>>>>> Once we've done the RAM transfer, and are in the switchover phase
>>>>> doing device state transfer, all the multifd channels are idle.
>>>>> We should just use all those channels to transfer the device state,
>>>>> in parallel.  Reserving channels just guarantees many idle channels
>>>>> during RAM transfer, and further idle channels during vmstate
>>>>> transfer.
>>>>>
>>>>> IMHO it is more flexible to just use all available multifd channel
>>>>> resources all the time.
>>>>
>>>> The reason for having dedicated device state channels is that they
>>>> provide lower downtime in my tests.
>>>>
>>>> With either 15 or 11 mixed multifd channels (no dedicated device state
>>>> channels) I get a downtime of about 1250 msec.
>>>>
>>>> Comparing that with 15 total multifd channels / 4 dedicated device
>>>> state channels that give downtime of about 1100 ms it means that using
>>>> dedicated channels gets about 14% downtime improvement.
>>>
>>> Hmm, can you clarify. /when/ is the VFIO vmstate transfer taking
>>> place ? Is is transferred concurrently with the RAM ? I had thought
>>> this series still has the RAM transfer iterations running first,
>>> and then the VFIO VMstate at the end, simply making use of multifd
>>> channels for parallelism of the end phase. your reply though makes
>>> me question my interpretation though.
>>>
>>> Let me try to illustrate channel flow in various scenarios, time
>>> flowing left to right:
>>>
>>> 1. serialized RAM, then serialized VM state  (ie historical migration)
>>>
>>>         main: | Init | RAM iter 1 | RAM iter 2 | ... | RAM iter N | VM State |
>>>
>>>
>>> 2. parallel RAM, then serialized VM state (ie today's multifd)
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>
>>>
>>> 3. parallel RAM, then parallel VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:                                                     | VFIO VM state |
>>>     multifd5:                                                     | VFIO VM state |
>>>
>>>
>>> 4. parallel RAM and VFIO VM state, then remaining VM state
>>>
>>>         main: | Init |                                            | VM state |
>>>     multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N |
>>>     multifd4:        | VFIO VM state                                         |
>>>     multifd5:        | VFIO VM state                                         |
>>>
>>>
>>> I thought this series was implementing approx (3), but are you actually
>>> implementing (4), or something else entirely ?
>>
>> You are right that this series operation is approximately implementing
>> the schema described as numer 3 in your diagrams.
> 
>> However, there are some additional details worth mentioning:
>> * There's some but relatively small amount of VFIO data being
>> transferred from the "save_live_iterate" SaveVMHandler while the VM is
>> still running.
>>
>> This is still happening via the main migration channel.
>> Parallelizing this transfer in the future might make sense too,
>> although obviously this doesn't impact the downtime.
>>
>> * After the VM is stopped and downtime starts the main (~ 400 MiB)
>> VFIO device state gets transferred via multifd channels.
>>
>> However, these multifd channels (if they are not dedicated to device
>> state transfer) aren't idle during that time.
>> Rather they seem to be transferring the residual RAM data.
>>
>> That's most likely what causes the additional observed downtime
>> when dedicated device state transfer multifd channels aren't used.
> 
> Ahh yes, I forgot about the residual dirty RAM, that makes sense as
> an explanation. Allow me to work through the scenarios though, as I
> still think my suggestion to not have separate dedicate channels is
> better....
> 
> 
> Lets say hypothetically we have an existing deployment today that
> uses 6 multifd channels for RAM. ie:
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
> 
> That value of 6 was chosen because that corresponds to the amount
> of network & CPU utilization the admin wants to allow, for this
> VM to migrate. All 6 channels are fully utilized at all times.
> 
> 
> If we now want to parallelize VFIO VM state, the peak network
> and CPU utilization the admin wants to reserve for the VM should
> not change. Thus the admin will still wants to configure only 6
> channels total.
> 
> With your proposal the admin has to reduce RAM transfer to 4 of the
> channels, in order to then reserve 2 channels for VFIO VM state, so we
> get a flow like:
> 
>   
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:                                                     | VFIO VM state |
>      multifd6:                                                     | VFIO VM state |
> 
> This is bad, as it reduces performance of RAM transfer. VFIO VM
> state transfer is better, but that's not a net win overall.
> 
> 
> 
> So lets say the admin was happy to increase the number of multifd
> channels from 6 to 8.
> 
> This series proposes that they would leave RAM using 6 channels as
> before, and now reserve the 2 extra ones for VFIO VM state:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM |
>      multifd7:                                                     | VFIO VM state |
>      multifd8:                                                     | VFIO VM state |
> 
> 
> RAM would perform as well as it did historically, and VM state would
> improve due to the 2 parallel channels, and not competing with the
> residual RAM transfer.
> 
> This is what your latency comparison numbers show as a benefit for
> this channel reservation design.
> 
> I believe this comparison is inappropriate / unfair though, as it is
> comparing a situation with 6 total channels against a situation with
> 8 total channels.
> 
> If the admin was happy to increase the total channels to 8, then they
> should allow RAM to use all 8 channels, and then VFIO VM state +
> residual RAM to also use the very same set of 8 channels:
> 
>          main: | Init |                                            | VM state |
>      multifd1:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd2:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd3:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd4:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd5:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd6:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd7:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
>      multifd8:        | RAM iter 1 | RAM iter 2 | ... | RAM iter N | Residual RAM + VFIO VM state|
> 
> This will speed up initial RAM iters still further & the final switch
> over phase even more. If residual RAM is larger than VFIO VM state,
> then it will dominate the switchover latency, so having VFIO VM state
> compete is not a problem. If VFIO VM state is larger than residual RAM,
> then allowing it acces to all 8 channels instead of only 2 channels
> will be a clear win.

I re-did the measurement with increased the number of multifd channels,
first to (total count/dedicated count) 25/0, then to 100/0.

The results did not improve:
With 25/0 multifd mixed channels config I still get around 1250 msec
downtime - the same as with 15/0 or 11/0 mixed configs I measured
earlier.

But with the (pretty insane) 100/0 mixed channel config the whole setup
gets so for into the law of diminishing returns that the results actually
get worse: the downtime is now about 1450 msec.
I guess that's from all the extra overhead from switching between 100
multifd channels.

I think one of the reasons for these results is that mixed (RAM + device
state) multifd channels participate in the RAM sync process
(MULTIFD_FLAG_SYNC) whereas device state dedicated channels don't.

It is possible that there are other subtle performance interactions too,
but I am not 100% sure about that.

> With regards,
> Daniel

Best regards,
Maciej

next prev parent reply	other threads:[~2024-04-18 18:15 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-16 14:42 [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 01/26] migration: Add x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 02/26] migration: Add migration channel header send/receive Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 03/26] migration: Add send/receive header for main channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 04/26] multifd: change multifd_new_send_channel_create() param type Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 05/26] migration: Add a DestroyNotify parameter to socket_send_channel_create() Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 06/26] multifd: pass MFDSendChannelConnectData when connecting sending socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 07/26] migration/postcopy: pass PostcopyPChannelConnectData when connecting sending preempt socket Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 08/26] migration: Allow passing migration header in migration channel creation Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 09/26] migration: Add send/receive header for postcopy preempt channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 10/26] migration: Add send/receive header for multifd channel Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 11/26] migration/options: Mapped-ram is not channel header compatible Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 12/26] migration: Enable x-channel-header pseudo-capability Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 13/26] vfio/migration: Add save_{iterate, complete_precopy}_started trace events Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 14/26] migration/ram: Add load start trace event Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 15/26] migration/multifd: Zero p->flags before starting filling a packet Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 16/26] migration: Add save_live_complete_precopy_async{, wait} handlers Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 17/26] migration: Add qemu_loadvm_load_state_buffer() and its handler Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 18/26] migration: Add load_finish handler and associated functions Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 19/26] migration: Add x-multifd-channels-device-state parameter Maciej S. Szmigiero
2024-04-16 14:42 ` [PATCH RFC 20/26] migration: Add MULTIFD_DEVICE_STATE migration channel type Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 21/26] migration/multifd: Device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 22/26] migration/multifd: Convert multifd_send_pages::next_channel to atomic Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 23/26] migration/multifd: Device state transfer support - send side Maciej S. Szmigiero
2024-04-29 20:04   ` Peter Xu
2024-05-06 16:25     ` Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 24/26] migration/multifd: Add migration_has_device_state_support() Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 25/26] vfio/migration: Multifd device state transfer support - receive side Maciej S. Szmigiero
2024-04-16 14:43 ` [PATCH RFC 26/26] vfio/migration: Multifd device state transfer support - send side Maciej S. Szmigiero
2024-04-17  8:36 ` [PATCH RFC 00/26] Multifd 🔀 device state transfer support with VFIO consumer Daniel P. Berrangé
2024-04-17 12:11   ` Maciej S. Szmigiero
2024-04-17 16:35     ` Daniel P. Berrangé
2024-04-18  9:50       ` Maciej S. Szmigiero
2024-04-18 10:39         ` Daniel P. Berrangé
2024-04-18 18:14           ` Maciej S. Szmigiero [this message]
2024-04-18 20:02             ` Peter Xu
2024-04-19 10:07               ` Daniel P. Berrangé
2024-04-19 15:31                 ` Peter Xu
2024-04-23 16:15                   ` Maciej S. Szmigiero
2024-04-23 22:20                     ` Peter Xu
2024-04-23 22:25                       ` Maciej S. Szmigiero
2024-04-23 22:35                         ` Peter Xu
2024-04-26 17:34                           ` Maciej S. Szmigiero
2024-04-29 15:09                             ` Peter Xu
2024-05-06 16:26                               ` Maciej S. Szmigiero
2024-05-06 17:56                                 ` Peter Xu
2024-05-07  8:41                                   ` Avihai Horon
2024-05-07 16:13                                     ` Peter Xu
2024-05-07 17:23                                       ` Avihai Horon
2024-04-23 16:14               ` Maciej S. Szmigiero
2024-04-23 22:27                 ` Peter Xu
2024-04-26 17:35                   ` Maciej S. Szmigiero
2024-04-29 20:34                     ` Peter Xu
2024-04-19 10:20             ` Daniel P. Berrangé

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aebcd78e-b8b6-44db-b2be-0bbd5acccf3f@maciej.szmigiero.name \
    --to=mail@maciej.szmigiero.name \
    --cc=alex.williamson@redhat.com \
    --cc=armbru@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=berrange@redhat.com \
    --cc=clg@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=joao.m.martins@oracle.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).