Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support

virtio-fs.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

From: Jonah Palmer <jonah.palmer@oracle.com>
To: Eugenio Perez Martin <eperezma@redhat.com>
Cc: qemu-devel@nongnu.org, mst@redhat.com, raphael@enfabrica.net,
	kwolf@redhat.com, hreitz@redhat.com, jasowang@redhat.com,
	pbonzini@redhat.com, fam@euphon.net, stefanha@redhat.com,
	qemu-block@nongnu.org, schalla@marvell.com, leiyang@redhat.com,
	virtio-fs@lists.linux.dev, si-wei.liu@oracle.com,
	boris.ostrovsky@oracle.com
Subject: Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support
Date: Tue, 26 Mar 2024 15:01:57 -0400	[thread overview]
Message-ID: <ad9fb016-bd2b-4e75-8bcc-d361676aef5f@oracle.com> (raw)
In-Reply-To: <CAJaqyWcHokNf97uwE0=S56CK9cBpB54sF8cdW7+BhFc5VzBRUQ@mail.gmail.com>



On 3/26/24 2:34 PM, Eugenio Perez Martin wrote:
> On Tue, Mar 26, 2024 at 5:49 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>
>>
>>
>> On 3/25/24 4:33 PM, Eugenio Perez Martin wrote:
>>> On Mon, Mar 25, 2024 at 5:52 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/22/24 7:18 AM, Eugenio Perez Martin wrote:
>>>>> On Thu, Mar 21, 2024 at 4:57 PM Jonah Palmer <jonah.palmer@oracle.com> wrote:
>>>>>>
>>>>>> The goal of these patches is to add support to a variety of virtio and
>>>>>> vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature
>>>>>> indicates that all buffers are used by the device in the same order in
>>>>>> which they were made available by the driver.
>>>>>>
>>>>>> These patches attempt to implement a generalized, non-device-specific
>>>>>> solution to support this feature.
>>>>>>
>>>>>> The core feature behind this solution is a buffer mechanism in the form
>>>>>> of GLib's GHashTable. The decision behind using a hash table was to
>>>>>> leverage their ability for quick lookup, insertion, and removal
>>>>>> operations. Given that our keys are simply numbers of an ordered
>>>>>> sequence, a hash table seemed like the best choice for a buffer
>>>>>> mechanism.
>>>>>>
>>>>>> ---------------------
>>>>>>
>>>>>> The strategy behind this implementation is as follows:
>>>>>>
>>>>>> We know that buffers that are popped from the available ring and enqueued
>>>>>> for further processing will always done in the same order in which they
>>>>>> were made available by the driver. Given this, we can note their order
>>>>>> by assigning the resulting VirtQueueElement a key. This key is a number
>>>>>> in a sequence that represents the order in which they were popped from
>>>>>> the available ring, relative to the other VirtQueueElements.
>>>>>>
>>>>>> For example, given 3 "elements" that were popped from the available
>>>>>> ring, we assign a key value to them which represents their order (elem0
>>>>>> is popped first, then elem1, then lastly elem2):
>>>>>>
>>>>>>         elem2   --  elem1   --  elem0   ---> Enqueue for processing
>>>>>>        (key: 2)    (key: 1)    (key: 0)
>>>>>>
>>>>>> Then these elements are enqueued for further processing by the host.
>>>>>>
>>>>>> While most devices will return these completed elements in the same
>>>>>> order in which they were enqueued, some devices may not (e.g.
>>>>>> virtio-blk). To guarantee that these elements are put on the used ring
>>>>>> in the same order in which they were enqueued, we can use a buffering
>>>>>> mechanism that keeps track of the next expected sequence number of an
>>>>>> element.
>>>>>>
>>>>>> In other words, if the completed element does not have a key value that
>>>>>> matches the next expected sequence number, then we know this element is
>>>>>> not in-order and we must stash it away in a hash table until an order
>>>>>> can be made. The element's key value is used as the key for placing it
>>>>>> in the hash table.
>>>>>>
>>>>>> If the completed element has a key value that matches the next expected
>>>>>> sequence number, then we know this element is in-order and we can push
>>>>>> it on the used ring. Then we increment the next expected sequence number
>>>>>> and check if the hash table contains an element at this key location.
>>>>>>
>>>>>> If so, we retrieve this element, push it to the used ring, delete the
>>>>>> key-value pair from the hash table, increment the next expected sequence
>>>>>> number, and check the hash table again for an element at this new key
>>>>>> location. This process is repeated until we're unable to find an element
>>>>>> in the hash table to continue the order.
>>>>>>
>>>>>> So, for example, say the 3 elements we enqueued were completed in the
>>>>>> following order: elem1, elem2, elem0. The next expected sequence number
>>>>>> is 0:
>>>>>>
>>>>>>        exp-seq-num = 0:
>>>>>>
>>>>>>         elem1   --> elem1.key == exp-seq-num ? --> No, stash it
>>>>>>        (key: 1)                                         |
>>>>>>                                                         |
>>>>>>                                                         v
>>>>>>                                                   ================
>>>>>>                                                   |key: 1 - elem1|
>>>>>>                                                   ================
>>>>>>        ---------------------
>>>>>>        exp-seq-num = 0:
>>>>>>
>>>>>>         elem2   --> elem2.key == exp-seq-num ? --> No, stash it
>>>>>>        (key: 2)                                         |
>>>>>>                                                         |
>>>>>>                                                         v
>>>>>>                                                   ================
>>>>>>                                                   |key: 1 - elem1|
>>>>>>                                                   |--------------|
>>>>>>                                                   |key: 2 - elem2|
>>>>>>                                                   ================
>>>>>>        ---------------------
>>>>>>        exp-seq-num = 0:
>>>>>>
>>>>>>         elem0   --> elem0.key == exp-seq-num ? --> Yes, push to used ring
>>>>>>        (key: 0)
>>>>>>
>>>>>>        exp-seq-num = 1:
>>>>>>
>>>>>>        lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
>>>>>>                                                 remove elem from table
>>>>>>                                                         |
>>>>>>                                                         v
>>>>>>                                                   ================
>>>>>>                                                   |key: 2 - elem2|
>>>>>>                                                   ================
>>>>>>
>>>>>>        exp-seq-num = 2:
>>>>>>
>>>>>>        lookup(table, exp-seq-num) != NULL ? --> Yes, push to used ring,
>>>>>>                                                 remove elem from table
>>>>>>                                                         |
>>>>>>                                                         v
>>>>>>                                                   ================
>>>>>>                                                   |   *empty*    |
>>>>>>                                                   ================
>>>>>>
>>>>>>        exp-seq-num = 3:
>>>>>>
>>>>>>        lookup(table, exp-seq-num) != NULL ? --> No, done
>>>>>>        ---------------------
>>>>>>
>>>>>
>>>>> I think to use a hashtable to handle this has an important drawback:
>>>>> it hurts performance on the devices that are using right in-order
>>>>> because of hash calculus, to benefit devices that are using it badly
>>>>> by using descriptors out of order. We should use data structs that are
>>>>> as free as possible for the first, and we don't care to worse the
>>>>> experience of the devices that enable in_order and they shouldn't.
>>>>>
>>>>
>>>> Right, because if descriptors are coming in in-order, we still search
>>>> the (empty) hash table.
>>>>
>>>> Hmm... what if we introduced a flag to see if we actually should bother
>>>> searching the hash table? That way we avoid the cost of searching when
>>>> we really don't need to.
>>>>
>>>>> So I suggest reusing vq->used_elems array vq. At each used descriptor
>>>>> written in the used ring, you know the next head is elem->index +
>>>>> elem->ndescs, so you can check if that element has been filled or not.
>>>>> If used, it needs to be flushed too. If not used, just return.
>>>>>
>>>>> Of course virtqueue_flush also needs to take this into account.
>>>>>
>>>>> What do you think, does it make sense to you?
>>>>>
>>>>
>>>> I'm having a bit of trouble understanding the suggestion here. Would you
>>>> mind elaborating a bit more for me on this?
>>>>
>>>> For example, say elem0, elem1, and elem2 were enqueued in-order (elem0
>>>> being first, elem2 last) and then elem2 finishes first, elem1 second,
>>>> and elem0 third. Given that these elements finish out-of-order, how
>>>> would you handle these out-of-order elements using your suggestion?
>>>>
>>>
>>> virtqueue_fill is called first with elem2. So vq->used_elems[2 %
>>> vq->num] is filled with the needed information of the descriptor:
>>> index, len and ndescs. idx function parameter is ignored.
>>>
>>> Optionally, virtqueue_push is called. It checks if
>>> vq->used_elems[vq->used_idx] is valid. valid can be elem->in_num +
>>> elem->out_num > 0, and reset them on every used ring write. If it is
>>> not valid, this is a no-op. Currently, it is not valid.
>>>
>>> Same process for elem1.
>>>
>>> virtqueue_fill is the same for elem0. But now virtqueue_flush gets
>>> interesting, as it detects vq->used_elems[0] is used. It scans for the
>>> first not-used element, and it finds it is vq->used_elems[3]. So it
>>> needs to write an used elem with id = 2 and the corresponding length.
>>>
>>> Maybe it is interesting to implement ways to improve the look for the
>>> last used descriptor, but if any I'd go for a bitmap and always on top
>>> of the basis series.
>>>
>>> The algorithm has not been tested, so maybe I've missed something.
>>>
>>> Thanks!
>>>
>>
>> Thank you for taking the time to clarify for this for me, I appreciate it.
>>
>> I spent some time yesterday and this morning working this over in my
>> head. I believe I understand what you're trying to do here and it makes
>> more sense than employing a data structure like a hash table for this
>> kind of job. However, I have a few questions regarding this implementation.
>>
>> So, one question is on the reuse of the VirtQueue's used_elems array.
>> Wont reusing this array cause issues with packed VQ operations, since it
>> also uses this array? If we want to stick with using this array
>> specifically, perhaps we may need to rewrite its logic if the device has
>> negotiated the in_order feature? E.g.
>>
>> virtqueue_packed_flush (...) {
>>      if (virtio_vdev_has_feature(vdev, VIRTIO_F_IN_ORDER) {
>>         // new logic
>>      } else {
>>        // current logic
>>      }
>> }
>> -----------
>>
> 
> That's right.
> 
>> Regarding this paragraph:
>>
>> "virtqueue_fill is called first with elem2. So vq->used_elems[2 %
>> vq->num] is filled with the needed information of the descriptor:
>> index, len and ndescs. idx function parameter is ignored."
>>
>> This looks exactly like virtqueue_packed_fill except for the idx
>> parameter we'd pass in (sequence_num % vq->vring.num).
>>
>> In any case, regardless of whether this element being passed in is
>> considered to be in-order or not, we still add this element to
>> vq->used_elems in virtqueue_fill. Ok, got it.
>>
>> Then you say "Optionally, virtqueue_push is called". I assume by
>> "optionally" you mean we need to know if this is a single-shot operation
>> or a batched operation. A single-shot operation would call for
>> virtqueue_push whereas a batched operation would just use
>> virtqueue_fill. If this is what you meant by that then ok, I understand
>> that too.
>>
> 
> Totally correct.
> 
>> However, I think before we start considering whether or not we need to
>> call virtqueue_push or continue with virtqueue_fill, we first should
>> know whether or not this element is in-order. And I think to do that we
>> should use the check you mentioned:
>>
>> if (vq->used_elems[vq->used_idx].in_num +
>> vq->used_elems[vq->used_idx].out_num > 0)
>>
>> or perhaps:
>>
>> if (vq->used_elems[vq->used_idx] != NULL)
>>
>> If the element is found not to be in-order, I assume we return and we
>> are done with the handling of this element for now.
>>
>> Now my confusion with this part comes from calling virtqueue_push inside
>> of the virtqueue_fill function. Wouldn't calling virtqueue_push inside
>> of virtqueue_fill present some kind of recursive execution path? Unless
>> I'm missing something here, this probably isn't something we need to do,
>> right?
> 
> Maybe I explained something wrong, but virtqueue_fill should not call
> virtqueue_push. It's up to the caller (virtio-net, virtio-blk, etc) to
> call one or another. Can you elaborate?
> 

Oh I see, my apologies! I misunderstood and thought you were suggesting 
to call virtqueue_push inside of virtqueue_fill.

>> -----------
>>
>> Lastly, when execution reaches virtqueue_flush, what would define an
>> element as unused? Perhaps...
>>
>> if (vq->used_elems[i] == NULL)
>>
> 
> used_elems is not an array of pointers but an array of
> VirtQueueElement so we cannot do this way.
> 
>> or
>>
>> if (vq->used_elems[i].in_num + vq->used_elems[i].out_num > 0)
>>
> 
> Right, I propose to reset both in_num = out_num = 0.
> 
> Thanks!
> 

Gotcha. I'll take this and work on getting a v2 RFC out.

Thank you for the re-clarifying things again Eugenio!

>> Thanks Eugenio!
>>
>>>> Thanks :)
>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>> Jonah Palmer (8):
>>>>>>      virtio: Define InOrderVQElement
>>>>>>      virtio: Create/destroy/reset VirtQueue In-Order hash table
>>>>>>      virtio: Define order variables
>>>>>>      virtio: Implement in-order handling for virtio devices
>>>>>>      virtio-net: in-order handling
>>>>>>      vhost-svq: in-order handling
>>>>>>      vhost/vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits
>>>>>>      virtio: Add VIRTIO_F_IN_ORDER property definition
>>>>>>
>>>>>>     hw/block/vhost-user-blk.c          |   1 +
>>>>>>     hw/net/vhost_net.c                 |   2 +
>>>>>>     hw/net/virtio-net.c                |   6 +-
>>>>>>     hw/scsi/vhost-scsi.c               |   1 +
>>>>>>     hw/scsi/vhost-user-scsi.c          |   1 +
>>>>>>     hw/virtio/vhost-shadow-virtqueue.c |  15 ++++-
>>>>>>     hw/virtio/vhost-user-fs.c          |   1 +
>>>>>>     hw/virtio/vhost-user-vsock.c       |   1 +
>>>>>>     hw/virtio/virtio.c                 | 103 ++++++++++++++++++++++++++++-
>>>>>>     include/hw/virtio/virtio.h         |  20 +++++-
>>>>>>     net/vhost-vdpa.c                   |   1 +
>>>>>>     11 files changed, 145 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 2.39.3
>>>>>>
>>>>>
>>>>
>>>
>>
>

     prev parent reply	other threads:[~2024-03-26 19:02 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21 15:57 [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support Jonah Palmer
2024-03-21 15:57 ` [RFC 1/8] virtio: Define InOrderVQElement Jonah Palmer
2024-03-22  9:45   ` Eugenio Perez Martin
2024-03-25 17:08     ` Jonah Palmer
2024-03-25 19:12       ` Eugenio Perez Martin
2024-03-21 15:57 ` [RFC 2/8] virtio: Create/destroy/reset VirtQueue In-Order hash table Jonah Palmer
2024-03-21 15:57 ` [RFC 3/8] virtio: Define order variables Jonah Palmer
2024-03-21 15:57 ` [RFC 4/8] virtio: Implement in-order handling for virtio devices Jonah Palmer
2024-03-22 10:46   ` Eugenio Perez Martin
2024-03-25 17:34     ` Jonah Palmer
2024-03-25 19:45       ` Eugenio Perez Martin
2024-03-21 15:57 ` [RFC 5/8] virtio-net: in-order handling Jonah Palmer
2024-03-21 15:57 ` [RFC 6/8] vhost-svq: " Jonah Palmer
2024-03-21 15:57 ` [RFC 7/8] vhost/vhost-user: Add VIRTIO_F_IN_ORDER to vhost feature bits Jonah Palmer
2024-03-22 10:47   ` Eugenio Perez Martin
2024-03-21 15:57 ` [RFC 8/8] virtio: Add VIRTIO_F_IN_ORDER property definition Jonah Palmer
2024-03-22 10:48   ` Eugenio Perez Martin
2024-03-21 19:48 ` [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support Dongli Zhang
2024-03-21 21:25   ` Jonah Palmer
2024-03-22 11:18 ` Eugenio Perez Martin
2024-03-25 16:52   ` Jonah Palmer
2024-03-25 20:33     ` Eugenio Perez Martin
2024-03-26 16:49       ` Jonah Palmer
2024-03-26 18:34         ` Eugenio Perez Martin
2024-03-26 19:01           ` Jonah Palmer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad9fb016-bd2b-4e75-8bcc-d361676aef5f@oracle.com \
    --to=jonah.palmer@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=eperezma@redhat.com \
    --cc=fam@euphon.net \
    --cc=hreitz@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=leiyang@redhat.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael@enfabrica.net \
    --cc=schalla@marvell.com \
    --cc=si-wei.liu@oracle.com \
    --cc=stefanha@redhat.com \
    --cc=virtio-fs@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).