QEMU-Devel Archive mirror
 help / color / mirror / Atom feed
From: Eugenio Perez Martin <eperezma@redhat.com>
To: Si-Wei Liu <si-wei.liu@oracle.com>
Cc: Jonah Palmer <jonah.palmer@oracle.com>,
	qemu-devel@nongnu.org,  "Michael S. Tsirkin" <mst@redhat.com>,
	Lei Yang <leiyang@redhat.com>, Peter Xu <peterx@redhat.com>,
	 Dragos Tatulea <dtatulea@nvidia.com>,
	Jason Wang <jasowang@redhat.com>
Subject: Re: [RFC 1/2] iova_tree: add an id member to DMAMap
Date: Wed, 8 May 2024 17:25:34 +0200	[thread overview]
Message-ID: <CAJaqyWeZSfWEsmBjJE2LfCAe4wqE7FercZEwJzaUzoNoqAuPuw@mail.gmail.com> (raw)
In-Reply-To: <6cb3e772-17fc-43c3-99ec-cfd074241112@oracle.com>

On Wed, May 8, 2024 at 2:52 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
>
>
>
> On 5/1/2024 11:44 PM, Eugenio Perez Martin wrote:
> > On Thu, May 2, 2024 at 1:16 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>
> >>
> >> On 4/30/2024 10:19 AM, Eugenio Perez Martin wrote:
> >>> On Tue, Apr 30, 2024 at 7:55 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>
> >>>> On 4/29/2024 1:14 AM, Eugenio Perez Martin wrote:
> >>>>> On Thu, Apr 25, 2024 at 7:44 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>> On 4/24/2024 12:33 AM, Eugenio Perez Martin wrote:
> >>>>>>> On Wed, Apr 24, 2024 at 12:21 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>> On 4/22/2024 1:49 AM, Eugenio Perez Martin wrote:
> >>>>>>>>> On Sat, Apr 20, 2024 at 1:50 AM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>> On 4/19/2024 1:29 AM, Eugenio Perez Martin wrote:
> >>>>>>>>>>> On Thu, Apr 18, 2024 at 10:46 PM Si-Wei Liu <si-wei.liu@oracle.com> wrote:
> >>>>>>>>>>>> On 4/10/2024 3:03 AM, Eugenio Pérez wrote:
> >>>>>>>>>>>>> IOVA tree is also used to track the mappings of virtio-net shadow
> >>>>>>>>>>>>> virtqueue.  This mappings may not match with the GPA->HVA ones.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This causes a problem when overlapped regions (different GPA but same
> >>>>>>>>>>>>> translated HVA) exists in the tree, as looking them by HVA will return
> >>>>>>>>>>>>> them twice.  To solve this, create an id member so we can assign unique
> >>>>>>>>>>>>> identifiers (GPA) to the maps.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>>>>>>>>>>>> ---
> >>>>>>>>>>>>>         include/qemu/iova-tree.h | 5 +++--
> >>>>>>>>>>>>>         util/iova-tree.c         | 3 ++-
> >>>>>>>>>>>>>         2 files changed, 5 insertions(+), 3 deletions(-)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> diff --git a/include/qemu/iova-tree.h b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> index 2a10a7052e..34ee230e7d 100644
> >>>>>>>>>>>>> --- a/include/qemu/iova-tree.h
> >>>>>>>>>>>>> +++ b/include/qemu/iova-tree.h
> >>>>>>>>>>>>> @@ -36,6 +36,7 @@ typedef struct DMAMap {
> >>>>>>>>>>>>>             hwaddr iova;
> >>>>>>>>>>>>>             hwaddr translated_addr;
> >>>>>>>>>>>>>             hwaddr size;                /* Inclusive */
> >>>>>>>>>>>>> +    uint64_t id;
> >>>>>>>>>>>>>             IOMMUAccessFlags perm;
> >>>>>>>>>>>>>         } QEMU_PACKED DMAMap;
> >>>>>>>>>>>>>         typedef gboolean (*iova_tree_iterator)(DMAMap *map);
> >>>>>>>>>>>>> @@ -100,8 +101,8 @@ const DMAMap *iova_tree_find(const IOVATree *tree, const DMAMap *map);
> >>>>>>>>>>>>>          * @map: the mapping to search
> >>>>>>>>>>>>>          *
> >>>>>>>>>>>>>          * Search for a mapping in the iova tree that translated_addr overlaps with the
> >>>>>>>>>>>>> - * mapping range specified.  Only the first found mapping will be
> >>>>>>>>>>>>> - * returned.
> >>>>>>>>>>>>> + * mapping range specified and map->id is equal.  Only the first found
> >>>>>>>>>>>>> + * mapping will be returned.
> >>>>>>>>>>>>>          *
> >>>>>>>>>>>>>          * Return: DMAMap pointer if found, or NULL if not found.  Note that
> >>>>>>>>>>>>>          * the returned DMAMap pointer is maintained internally.  User should
> >>>>>>>>>>>>> diff --git a/util/iova-tree.c b/util/iova-tree.c
> >>>>>>>>>>>>> index 536789797e..0863e0a3b8 100644
> >>>>>>>>>>>>> --- a/util/iova-tree.c
> >>>>>>>>>>>>> +++ b/util/iova-tree.c
> >>>>>>>>>>>>> @@ -97,7 +97,8 @@ static gboolean iova_tree_find_address_iterator(gpointer key, gpointer value,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>             needle = args->needle;
> >>>>>>>>>>>>>             if (map->translated_addr + map->size < needle->translated_addr ||
> >>>>>>>>>>>>> -        needle->translated_addr + needle->size < map->translated_addr) {
> >>>>>>>>>>>>> +        needle->translated_addr + needle->size < map->translated_addr ||
> >>>>>>>>>>>>> +        needle->id != map->id) {
> >>>>>>>>>>>> It looks this iterator can also be invoked by SVQ from
> >>>>>>>>>>>> vhost_svq_translate_addr() -> iova_tree_find_iova(), where guest GPA
> >>>>>>>>>>>> space will be searched on without passing in the ID (GPA), and exact
> >>>>>>>>>>>> match for the same GPA range is not actually needed unlike the mapping
> >>>>>>>>>>>> removal case. Could we create an API variant, for the SVQ lookup case
> >>>>>>>>>>>> specifically? Or alternatively, add a special flag, say skip_id_match to
> >>>>>>>>>>>> DMAMap, and the id match check may look like below:
> >>>>>>>>>>>>
> >>>>>>>>>>>> (!needle->skip_id_match && needle->id != map->id)
> >>>>>>>>>>>>
> >>>>>>>>>>>> I think vhost_svq_translate_addr() could just call the API variant or
> >>>>>>>>>>>> pass DMAmap with skip_id_match set to true to svq_iova_tree_find_iova().
> >>>>>>>>>>>>
> >>>>>>>>>>> I think you're totally right. But I'd really like to not complicate
> >>>>>>>>>>> the API of the iova_tree more.
> >>>>>>>>>>>
> >>>>>>>>>>> I think we can look for the hwaddr using memory_region_from_host and
> >>>>>>>>>>> then get the hwaddr. It is another lookup though...
> >>>>>>>>>> Yeah, that will be another means of doing translation without having to
> >>>>>>>>>> complicate the API around iova_tree. I wonder how the lookup through
> >>>>>>>>>> memory_region_from_host() may perform compared to the iova tree one, the
> >>>>>>>>>> former looks to be an O(N) linear search on a linked list while the
> >>>>>>>>>> latter would be roughly O(log N) on an AVL tree?
> >>>>>>>>> Even worse, as the reverse lookup (from QEMU vaddr to SVQ IOVA) is
> >>>>>>>>> linear too. It is not even ordered.
> >>>>>>>> Oh Sorry, I misread the code and I should look for g_tree_foreach ()
> >>>>>>>> instead of g_tree_search_node(). So the former is indeed linear
> >>>>>>>> iteration, but it looks to be ordered?
> >>>>>>>>
> >>>>>>>> https://github.com/GNOME/glib/blob/main/glib/gtree.c#L1115
> >>>>>>> The GPA / IOVA are ordered but we're looking by QEMU's vaddr.
> >>>>>>>
> >>>>>>> If we have these translations:
> >>>>>>> [0x1000, 0x2000] -> [0x10000, 0x11000]
> >>>>>>> [0x2000, 0x3000] -> [0x6000, 0x7000]
> >>>>>>>
> >>>>>>> We will see them in this order, so we cannot stop the search at the first node.
> >>>>>> Yeah, reverse lookup is unordered indeed, anyway.
> >>>>>>
> >>>>>>>>> But apart from this detail you're right, I have the same concerns with
> >>>>>>>>> this solution too. If we see a hard performance regression we could go
> >>>>>>>>> to more complicated solutions, like maintaining a reverse IOVATree in
> >>>>>>>>> vhost-iova-tree too. First RFCs of SVQ did that actually.
> >>>>>>>> Agreed, yeap we can use memory_region_from_host for now.  Any reason why
> >>>>>>>> reverse IOVATree was dropped, lack of users? But now we have one!
> >>>>>>>>
> >>>>>>> No, it is just simplicity. We already have an user in the hot patch in
> >>>>>>> the master branch, vhost_svq_vring_write_descs. But I never profiled
> >>>>>>> enough to find if it is a bottleneck or not to be honest.
> >>>>>> Right, without vIOMMU or a lot of virtqueues / mappings, it's hard to
> >>>>>> profile and see the difference.
> >>>>>>> I'll send the new series by today, thank you for finding these issues!
> >>>>>> Thanks! In case you don't have bandwidth to add back reverse IOVA tree,
> >>>>>> Jonah (cc'ed) may have interest in looking into it.
> >>>>>>
> >>>>> Actually, yes. I've tried to solve it using:
> >>>>> memory_region_get_ram_ptr -> It's hard to get this pointer to work
> >>>>> without messing a lot with IOVATree.
> >>>>> memory_region_find -> I'm totally unable to make it return sections
> >>>>> that make sense
> >>>>> flatview_for_each_range -> It does not return the same
> >>>>> MemoryRegionsection as the listener, not sure why.
> >>>> Ouch, thank you for the summary of attempts that were done earlier.
> >>>>> The only advance I have is that memory_region_from_host is able to
> >>>>> tell if the vaddr is from the guest or not.
> >>>> Hmmm, then it won't be too useful without a direct means to identifying
> >>>> the exact memory region associated with the iova that is being mapped.
> >>>> And, this additional indirection seems introduce a tiny bit of more
> >>>> latency in the reverse lookup routine (should not be a scalability issue
> >>>> though if it's a linear search)?
> >>>>
> >>> I didn't measure, but I guess yes it might. OTOH these structs may be
> >>> cached because virtqueue_pop just looked for them.
> >> Oh, right, that's a good point.
> >>>>> So I'm convinced there must be a way to do it with the memory
> >>>>> subsystem, but I think the best way to do it ATM is to store a
> >>>>> parallel tree with GPA-> SVQ IOVA translations. At removal time, if we
> >>>>> find the entry in this new tree, we can directly remove it by GPA. If
> >>>>> not, assume it is a host-only address like SVQ vrings, and remove by
> >>>>> iterating on vaddr as we do now.
> >>>> Yeah, this could work I think. On the other hand, given that we are now
> >>>> trying to improve it, I wonder if possible to come up with a fast
> >>>> version for the SVQ (host-only address) case without having to look up
> >>>> twice? SVQ callers should be able to tell apart from the guest case
> >>>> where GPA -> IOVA translation doesn't exist? Or just maintain a parallel
> >>>> tree with HVA -> IOVA translations for SVQ reverse lookup only? I feel
> >>>> SVQ mappings may be worth a separate fast lookup path - unlike guest
> >>>> mappings, the insertion, lookup and removal for SVQ mappings seem
> >>>> unavoidable during the migration downtime path.
> >>>>
> >>> I think the ideal order is the opposite actually. So:
> >>> 1) Try for the NIC to support _F_VRING_ASID, no translation needed by QEMU
> >> Right, that's the case for _F_VRING_ASID, which is simple and easy to
> >> deal with. Though I think this is an edge case across all vendor
> >> devices, as most likely only those no-chip IOMMU parents may support it.
> >> It's a luxury for normal device to steal another VF for this ASID feature...
> >>
> >>> 2) Try reverse lookup from HVA to GPA. Since dataplane should fit
> >>> this, we should test this first
> >> So instead of a direct lookup from HVA to IOVA, the purpose of the extra
> >> reverse lookup from HVA to GPA is to verify the validity of GPA (avoid
> >> from being mistakenly picked from the overlapped region)? But this would
> >> seem require scanning the entire GPA space to identify possible GPA
> >> ranges that are potentially overlapped? I wonder if there exists
> >> possibility to simplify this assumption, could we go this extra layer of
> >> GPA wide scan and validation, *only* when overlap is indeed detected
> >> during memory listerner's region_add (say during which we try to insert
> >> a duplicate / overlapped HVA into the HVA -> IOVA tree)? Or simply put,
> >> the first match on the reverse lookup would mostly suffice, since we
> >> know virtio driver can't use guest memory from these overlapped regions?
> > The first match should be enough, but maybe we need more than one
> > match. Let me put an example:
> >
> > The buffer is (vaddr = 0x1000, size=0x3000). Now the tree contains two
> > overlapped entries: (vaddr=0x1000, size=0x2000), and (vaddr=0x1000,
> > size=0x3000).
> In this case, assume the overlap can be detected via certain structs,
> for e.g. a HVA->IOVA reverse tree, then a full and slow lookup needs to
> be performed. Here we can try to match using the size, but I feel its
> best to identify the exact IOVA range by the GPA. This can be done
> through a tree storing the GPA->HVA mappings, and the reverse lookup
> from HVA->GPA will help identify if the HVA falls into certain GPA range.
>

It is possible somehow, but multiple searches are already used in
other areas where the full range is not found in the first attempt.
First one may return a partial result, but the second one can look for
the missing part of the key (vaddr=0x2000, size=0x1000). Isn't that
simpler?

> >
> > Assuming we go through the reverse IOVA tree, we had bad luck and we
> > stored the small entry plus the big entry. The first search returns
> > the small entry then, (vaddr=0x1000, size=0x2000),. Calling code must
> > detect it, and then look for vaddr = 0x1000 + 0x2000. That gives us
> > the next entry.
> Is there any reason why the first search can't pass in the GPA to
> further help identify? Suppose it's verified that the specific GPA range
> does exists via the HVA->GPA lookup.

The only problem is that IOVATree is shared with intel_iommu. How to
keep modifying it without affecting IOVATree usage by intel_iommu
might be a problem.

> >
> > You can see that virtqueue_map_desc translates this way if
> > dma_memory_map returns a translation shorter than the length of the
> > buffer, for example.
> >
> >> You may say this assumption is too bold, but do we have other means to
> >> guarantee the first match will always hit under SVQ lookup? Given that
> >> we don't receive an instance of issue report until we move the memory
> >> listener registration upfront to device initialization, I guess there
> >> should be some point or under certain condition that the non-overlapped
> >> 1:1 translation and lookup can be satisfied. Don't you agree?
> >>
> > To be able to build the shorter is desirable, yes. Maybe it can be
> > done in this series, but I find it hard to solve some situations. For
> > example, is it possible to have three overlapping regions (A, B, C)
> > where regions A and B do not overlap but C overlaps both of them?
> Does C map to a different GPA range than where region A and B reside
> originally? The flatten guest view should guarantee that, right? Then it
> shouldn't be a problem by passing in the GPA as the secondary ID for the
> reverse HVA->IOVA lookup.
>

Right. But in this RFC the id is not searched in full range, only the
first GPA of each region.

> Regards,
> -Siwei
> >
> > That's why I think it is better to delay that to a future series, but
> > we can do it with one shot if it is simple enough for sure.
> >
> > Thanks!
> >
> >> Thanks,
> >> -Siwei
> >>> 3) Look in SVQ host-only entries (SVQ, current shadow CVQ). It is the
> >>> control VQ, speed is not so important.
> >>>
> >>> Overlapping regions may return the wrong SVQ IOVA though. We should
> >>> take extra care to make sure these are correctly handled. I mean,
> >>> there are valid translations in the tree unless the driver is buggy,
> >>> just may need to span many translations.
> >>>
> >>>>>     It is guaranteed the guest does not
> >>>>> translate to that vaddr and that that vaddr is unique in the tree
> >>>>> anyway.
> >>>>>
> >>>>> Does it sound reasonable? Jonah, would you be interested in moving this forward?
> >>>> My thought would be that the reverse IOVA tree stuff can be added as a
> >>>> follow-up optimization right after for extended scalability, but for now
> >>>> as the interim, we may still need some form of simple fix, so as to
> >>>> quickly unblock the other dependent work built on top of this one and
> >>>> the early pinning series [1]. With it said, I'm completely fine if
> >>>> performing the reverse lookup through linear tree walk e.g.
> >>>> g_tree_foreach(), that should suffice small VM configs with just a
> >>>> couple of queues and limited number of memory regions. Going forward, to
> >>>> address the scalability bottleneck, Jonah could just replace the
> >>>> corresponding API call with the one built on top of reverse IOVA tree (I
> >>>> presume the use of these iova tree APIs is kind of internal that only
> >>>> limits to SVQ and vhost-vdpa subsystems) once he gets there, and then
> >>>> eliminate the other API variants that will no longer be in use. What do
> >>>> you think about this idea / plan?
> >>>>
> >>> Yeah it makes sense to me. Hopefully we can even get rid of the id member.
> >>>
> >>>> Thanks,
> >>>> -Siwei
> >>>>
> >>>> [1] https://lists.nongnu.org/archive/html/qemu-devel/2024-04/msg00079.html
> >>>>
> >>>>> Thanks!
> >>>>>
> >>>>>> -Siwei
> >>>>>>
> >>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -Siwei
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>>> Of course,
> >>>>>>>>>> memory_region_from_host() won't search out of the guest memory space for
> >>>>>>>>>> sure. As this could be on the hot data path I have a little bit
> >>>>>>>>>> hesitance over the potential cost or performance regression this change
> >>>>>>>>>> could bring in, but maybe I'm overthinking it too much...
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> -Siwei
> >>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> -Siwei
> >>>>>>>>>>>>>                 return false;
> >>>>>>>>>>>>>             }
> >>>>>>>>>>>>>
>



  reply	other threads:[~2024-05-08 15:26 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-10 10:03 [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Eugenio Pérez
2024-04-10 10:03 ` [RFC 1/2] iova_tree: add an id member to DMAMap Eugenio Pérez
2024-04-18 20:46   ` Si-Wei Liu
2024-04-19  8:29     ` Eugenio Perez Martin
2024-04-19 23:49       ` Si-Wei Liu
2024-04-22  8:49         ` Eugenio Perez Martin
2024-04-23 22:20           ` Si-Wei Liu
2024-04-24  7:33             ` Eugenio Perez Martin
2024-04-25 17:43               ` Si-Wei Liu
2024-04-29  8:14                 ` Eugenio Perez Martin
2024-04-29 11:19                   ` Jonah Palmer
2024-04-30 18:11                     ` Eugenio Perez Martin
2024-05-01 22:08                       ` Si-Wei Liu
2024-05-02  6:18                         ` Eugenio Perez Martin
2024-05-07  9:12                           ` Si-Wei Liu
2024-04-30  5:54                   ` Si-Wei Liu
2024-04-30 17:19                     ` Eugenio Perez Martin
2024-05-01 23:13                       ` Si-Wei Liu
2024-05-02  6:44                         ` Eugenio Perez Martin
2024-05-08  0:52                           ` Si-Wei Liu
2024-05-08 15:25                             ` Eugenio Perez Martin [this message]
2024-04-10 10:03 ` [RFC 2/2] vdpa: identify aliased maps in iova_tree Eugenio Pérez
2024-04-12  6:46 ` [RFC 0/2] Identify aliased maps in vdpa SVQ iova_tree Jason Wang
2024-04-12  7:56   ` Eugenio Perez Martin
2024-05-07  7:29     ` Jason Wang
2024-05-07 10:56       ` Eugenio Perez Martin
2024-05-08  2:29         ` Jason Wang
2024-05-08 17:15           ` Eugenio Perez Martin
2024-05-09  6:27             ` Jason Wang
2024-05-09  7:10               ` Eugenio Perez Martin
2024-05-10  4:28                 ` Jason Wang
2024-05-10  7:16                   ` Eugenio Perez Martin
2024-05-11  4:00                     ` Jason Wang
2024-05-13  6:27                       ` Eugenio Perez Martin
2024-05-13  8:28                         ` Jason Wang
2024-05-13  9:56                           ` Eugenio Perez Martin
2024-05-14  3:56                             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJaqyWeZSfWEsmBjJE2LfCAe4wqE7FercZEwJzaUzoNoqAuPuw@mail.gmail.com \
    --to=eperezma@redhat.com \
    --cc=dtatulea@nvidia.com \
    --cc=jasowang@redhat.com \
    --cc=jonah.palmer@oracle.com \
    --cc=leiyang@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=si-wei.liu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).