KVM Archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: liulongfang <liulongfang@huawei.com>,
	"jgg@nvidia.com" <jgg@nvidia.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linuxarm@openeuler.org" <linuxarm@openeuler.org>
Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of the XQC address
Date: Thu, 9 May 2024 08:29:40 -0600	[thread overview]
Message-ID: <20240509082940.51a69feb.alex.williamson@redhat.com> (raw)
In-Reply-To: <ed07017d74f147b28a069660100e3ad1@huawei.com>

On Thu, 9 May 2024 09:37:51 +0000
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote:

> > -----Original Message-----
> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, May 8, 2024 7:00 PM
> > To: liulongfang <liulongfang@huawei.com>
> > Cc: jgg@nvidia.com; Shameerali Kolothum Thodi
> > <shameerali.kolothum.thodi@huawei.com>; Jonathan Cameron
> > <jonathan.cameron@huawei.com>; kvm@vger.kernel.org; linux-
> > kernel@vger.kernel.org; linuxarm@openeuler.org
> > Subject: Re: [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of
> > the XQC address  
> 
> [...]
>  
> > > HiSilicon accelerator equipment can perform general services after  
> > completing live migration.  
> > > This kind of business is executed through the user mode driver and only  
> > needs to use SQE and CQE.  
> > >
> > > At the same time, this device can also perform kernel-mode services in the  
> > VM through the crypto  
> > > subsystem. This kind of service requires the use of EQE.
> > >
> > > Finally, if the device is abnormal, the driver needs to perform a device  
> > reset, and AEQE needs to  
> > > be used in this case.
> > >
> > > Therefore, a complete device live migration function needs to ensure that  
> > device functions are  
> > > normal in all these scenarios.
> > > Therefore, this data still needs to be migrated.  
> > 
> > Ok, I had jumped to an in-kernel host driver in reference to "kernel
> > mode" rather than a guest kernel.  Migrating with bad data only affects
> > the current configuration of the device, reloading a guest driver to
> > update these registers or a reset of the device would allow proper
> > operation of the device, correct?  
> 
> Yes, after talking to Longfang, the device RAS will trigger a reset and
> would function after reset.
> 
> > 
> > But I think this still isn't really a complete solution, we know
> > there's a bug in the migration data stream, so not only would we fix
> > the data stream, but I think we should also take measures to prevent
> > loading a known bad data stream.  AIUI migration of this device while
> > running in kernel mode (ie. a kernel driver within a guest VM) is
> > broken.  Therefore, the least we can do in a new kernel, knowing that
> > there was previously a bug in the migration data stream, is to fail to
> > load that migration data because it risks this scenario where the
> > device is broken after migration.  Shouldn't we then also increment a
> > migration version field in the data stream to block migrations that
> > risk this breakage, or barring that, change the magic data field to
> > prevent the migration?  Thanks,  
> 
> Ok. We could add a new ACC_DEV_MAGIC_V2 and prevent the migration
> in vf_qm_check_match(). The only concern here is that, it will completely
> block old kernel to new kernel migration and since we can recover the
> device after the reset whether it is too restrictive or not.

What's the impact to the running driver, kernel or userspace, if the
device is reset?  Migration is intended to be effectively transparent
to the driver.  If the driver stalls and needs to reset the device,
what has the migration driver accomplished versus an offline migration?

If there's a way to detect from the migration data if the device is
running in kernel mode or user mode then you could potentially accept
and send v1 magic conditional that the device is in user mode and
require v2 magic for any migration where the device is in kernel mode.
This all adds complication though and seems like it has corner cases
where we might allow migration to an old kernel that might trap the
device there if the use case changes.

Essentially it comes down to what should the migration experience be
and while restricting old->new and new->old migration is undesirable,
it seems old->old migration is effectively already broken anyway.  As
you consider a v2 magic, perhaps consider how the migration data
structure might be improved overall to better handle new features and
bugs.  Thanks,

Alex


  reply	other threads:[~2024-05-09 14:29 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-25 13:23 [PATCH v6 0/5] add debugfs to hisilicon migration driver Longfang Liu
2024-04-25 13:23 ` [PATCH v6 1/5] hisi_acc_vfio_pci: extract public functions for container_of Longfang Liu
2024-04-25 13:23 ` [PATCH v6 2/5] hisi_acc_vfio_pci: modify the register location of the XQC address Longfang Liu
2024-05-03 16:11   ` Alex Williamson
2024-05-07  8:29     ` liulongfang
2024-05-07 11:10       ` Shameerali Kolothum Thodi
2024-05-07 12:35       ` Alex Williamson
2024-05-08  7:18         ` liulongfang
2024-05-08 17:59           ` Alex Williamson
2024-05-09  9:37             ` Shameerali Kolothum Thodi
2024-05-09 14:29               ` Alex Williamson [this message]
     [not found]                 ` <e63c0c85-7f3a-100c-5059-322268b3f517@huawei.com>
2024-05-13  8:35                   ` Shameerali Kolothum Thodi
2024-04-25 13:23 ` [PATCH v6 3/5] hisi_acc_vfio_pci: create subfunction for data reading Longfang Liu
2024-05-03 16:25   ` Alex Williamson
2024-05-07  8:30     ` liulongfang
2024-04-25 13:23 ` [PATCH v6 4/5] hisi_acc_vfio_pci: register debugfs for hisilicon migration driver Longfang Liu
2024-05-03 17:21   ` Alex Williamson
2024-05-07  8:06     ` Shameerali Kolothum Thodi
2024-04-25 13:23 ` [PATCH v6 5/5] Documentation: add debugfs description for hisi migration Longfang Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240509082940.51a69feb.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=liulongfang@huawei.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).