2008-12-18 17:13:03

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 0/3] virtio: indirect ring entries


Hi Rusty,
Here's something that has (apparently) been kicked around for
a while now and I think makes some sense.

Avi has been especially pushing for it lately in the context
of high performance block I/O - I'll let him explain his thinking
there.

The patches themselves are fairly trivial, there shouldn't be
anything too surprising here.

Cheers,
Mark.


2008-12-20 11:42:19

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

Hi Mark,

On Thursday 18 December 2008, Mark McLoughlin wrote:
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5777196..2330c4b 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -70,6 +73,55 @@ struct vring_virtqueue
>
> #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>
> +/* Set up an indirect table of descriptors and add it to the queue. */
> +static int vring_add_indirect(struct vring_virtqueue *vq,
> + struct scatterlist sg[],
> + unsigned int out,
> + unsigned int in)
> +{
> + struct vring_desc *desc;
> + unsigned head;
> + int i;
> +
> + desc = kmalloc((out + in) * sizeof(struct vring_desc), GFP_ATOMIC);

kmalloc() returns ZERO_SIZE_PTR, if (out + in) == 0

> + if (!desc)
> + return vq->vring.num;
> +
> + /* Transfer entries from the sg list into the indirect page */
> + for (i = 0; i < out; i++) {
> + desc[i].flags = VRING_DESC_F_NEXT;
> + desc[i].addr = sg_phys(sg);
> + desc[i].len = sg->length;
> + desc[i].next = i+1;
> + sg++;
> + }
> + for (; i < (out + in); i++) {
> + desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> + desc[i].addr = sg_phys(sg);
> + desc[i].len = sg->length;
> + desc[i].next = i+1;
> + sg++;
> + }
> +
> + /* Last one doesn't continue. */
> + desc[i-1].flags &= ~VRING_DESC_F_NEXT;
> + desc[i-1].next = 0;

So this array index can fail (be -1).
Please check and avoid within this function.


Best Regards

Ingo Oeser

2008-12-22 10:17:44

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

Hi Ingo,

On Sat, 2008-12-20 at 12:38 +0100, Ingo Oeser wrote:
> Hi Mark,
>
> On Thursday 18 December 2008, Mark McLoughlin wrote:
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 5777196..2330c4b 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -70,6 +73,55 @@ struct vring_virtqueue
> >
> > #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
> >
> > +/* Set up an indirect table of descriptors and add it to the queue. */
> > +static int vring_add_indirect(struct vring_virtqueue *vq,
> > + struct scatterlist sg[],
> > + unsigned int out,
> > + unsigned int in)
> > +{
> > + struct vring_desc *desc;
> > + unsigned head;
> > + int i;
> > +
> > + desc = kmalloc((out + in) * sizeof(struct vring_desc), GFP_ATOMIC);
>
> kmalloc() returns ZERO_SIZE_PTR, if (out + in) == 0

vring_add_buf() has:

BUG_ON(out + in == 0)

I should just add that here too before the kmalloc() call.

Thanks,
Mark.

2008-12-18 17:12:01

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 1/3] virtio: teach virtio_has_feature() about transport features

Drivers don't add transport features to their table, so we
shouldn't check these with virtio_check_driver_offered_feature().

We could perhaps add an ->offered_feature() virtio_config_op,
but that perhaps that would be overkill for a consitency check
like this.

Signed-off-by: Mark McLoughlin <[email protected]>
---
include/linux/virtio_config.h | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index bf8ec28..e4ba694 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -99,7 +99,9 @@ static inline bool virtio_has_feature(const struct virtio_device *vdev,
if (__builtin_constant_p(fbit))
BUILD_BUG_ON(fbit >= 32);

- virtio_check_driver_offered_feature(vdev, fbit);
+ if (fbit < VIRTIO_TRANSPORT_F_START)
+ virtio_check_driver_offered_feature(vdev, fbit);
+
return test_bit(fbit, vdev->features);
}

--
1.6.0.5

2008-12-18 17:12:45

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 2/3] virtio: indirect ring entries (VIRTIO_RING_F_INDIRECT_DESC)

Add a new feature flag for indirect ring entries. These are ring
entries which point to a table of buffer descriptors.

The idea here is to increase the ring capacity by allowing a larger
effective ring size whereby the ring size dictates the number of
requests that may be outstanding, rather than the size of those
requests.

This should be most effective in the case of block I/O where we can
potentially benefit by concurrently dispatching a large number of
large requests. Even in the simple case of single segment block
requests, this results in a threefold increase in ring capacity.

Signed-off-by: Mark McLoughlin <[email protected]>
---
drivers/virtio/virtio_ring.c | 75 ++++++++++++++++++++++++++++++++++++++++-
include/linux/virtio_ring.h | 5 +++
2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5777196..2330c4b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -46,6 +46,9 @@ struct vring_virtqueue
/* Other side has made a mess, don't try any more. */
bool broken;

+ /* Host supports indirect buffers */
+ bool indirect;
+
/* Number of free buffers */
unsigned int num_free;
/* Head of free buffer list. */
@@ -70,6 +73,55 @@ struct vring_virtqueue

#define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)

+/* Set up an indirect table of descriptors and add it to the queue. */
+static int vring_add_indirect(struct vring_virtqueue *vq,
+ struct scatterlist sg[],
+ unsigned int out,
+ unsigned int in)
+{
+ struct vring_desc *desc;
+ unsigned head;
+ int i;
+
+ desc = kmalloc((out + in) * sizeof(struct vring_desc), GFP_ATOMIC);
+ if (!desc)
+ return vq->vring.num;
+
+ /* Transfer entries from the sg list into the indirect page */
+ for (i = 0; i < out; i++) {
+ desc[i].flags = VRING_DESC_F_NEXT;
+ desc[i].addr = sg_phys(sg);
+ desc[i].len = sg->length;
+ desc[i].next = i+1;
+ sg++;
+ }
+ for (; i < (out + in); i++) {
+ desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+ desc[i].addr = sg_phys(sg);
+ desc[i].len = sg->length;
+ desc[i].next = i+1;
+ sg++;
+ }
+
+ /* Last one doesn't continue. */
+ desc[i-1].flags &= ~VRING_DESC_F_NEXT;
+ desc[i-1].next = 0;
+
+ /* We're about to use a buffer */
+ vq->num_free--;
+
+ /* Use a single buffer which doesn't continue */
+ head = vq->free_head;
+ vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+ vq->vring.desc[head].addr = virt_to_phys(desc);
+ vq->vring.desc[head].len = i * sizeof(struct vring_desc);
+
+ /* Update free pointer */
+ vq->free_head = vq->vring.desc[head].next;
+
+ return head;
+}
+
static int vring_add_buf(struct virtqueue *_vq,
struct scatterlist sg[],
unsigned int out,
@@ -79,12 +131,21 @@ static int vring_add_buf(struct virtqueue *_vq,
struct vring_virtqueue *vq = to_vvq(_vq);
unsigned int i, avail, head, uninitialized_var(prev);

+ START_USE(vq);
+
BUG_ON(data == NULL);
+
+ /* If the host supports indirect descriptor tables, and we have multiple
+ * buffers, then go indirect. FIXME: tune this threshold */
+ if (vq->indirect && (out + in) > 1 && vq->num_free) {
+ head = vring_add_indirect(vq, sg, out, in);
+ if (head != vq->vring.num)
+ goto add_head;
+ }
+
BUG_ON(out + in > vq->vring.num);
BUG_ON(out + in == 0);

- START_USE(vq);
-
if (vq->num_free < out + in) {
pr_debug("Can't add buf len %i - avail = %i\n",
out + in, vq->num_free);
@@ -121,6 +182,7 @@ static int vring_add_buf(struct virtqueue *_vq,
/* Update free pointer */
vq->free_head = i;

+add_head:
/* Set token. */
vq->data[head] = data;

@@ -164,6 +226,11 @@ static void detach_buf(struct vring_virtqueue *vq, unsigned int head)

/* Put back on free list: find end */
i = head;
+
+ /* Free the indirect table */
+ if (vq->vring.desc[i].flags & VRING_DESC_F_INDIRECT)
+ kfree(phys_to_virt(vq->vring.desc[i].addr));
+
while (vq->vring.desc[i].flags & VRING_DESC_F_NEXT) {
i = vq->vring.desc[i].next;
vq->num_free++;
@@ -305,6 +372,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
vq->in_use = false;
#endif

+ vq->indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
+
/* No callback? Tell other side not to bother us. */
if (!callback)
vq->vring.avail->flags |= VRING_AVAIL_F_NO_INTERRUPT;
@@ -332,6 +401,8 @@ void vring_transport_features(struct virtio_device *vdev)

for (i = VIRTIO_TRANSPORT_F_START; i < VIRTIO_TRANSPORT_F_END; i++) {
switch (i) {
+ case VIRTIO_RING_F_INDIRECT_DESC:
+ break;
default:
/* We don't understand this bit. */
clear_bit(i, vdev->features);
diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 71e0372..3828ae2 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -14,6 +14,8 @@
#define VRING_DESC_F_NEXT 1
/* This marks a buffer as write-only (otherwise read-only). */
#define VRING_DESC_F_WRITE 2
+/* This means the buffer contains a list of buffer descriptors. */
+#define VRING_DESC_F_INDIRECT 4

/* The Host uses this in used->flags to advise the Guest: don't kick me when
* you add a buffer. It's unreliable, so it's simply an optimization. Guest
@@ -24,6 +26,9 @@
* optimization. */
#define VRING_AVAIL_F_NO_INTERRUPT 1

+/* We support indirect buffer descriptors */
+#define VIRTIO_RING_F_INDIRECT_DESC 28
+
/* Virtio ring descriptors: 16 bytes. These can chain together via "next". */
struct vring_desc
{
--
1.6.0.5

2008-12-18 17:12:26

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 3/3] lguest: add support for indirect ring entries

Support the VIRTIO_RING_F_INDIRECT_DESC feature.

This is a simple matter of changing the descriptor walking
code to operate on a struct vring_desc* and supplying it
with an indirect table if detected.

Signed-off-by: Mark McLoughlin <[email protected]>
---
Documentation/lguest/lguest.c | 41 +++++++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index f2dbbf3..3cd388f 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -623,20 +623,21 @@ static void *_check_pointer(unsigned long addr, unsigned int size,
/* Each buffer in the virtqueues is actually a chain of descriptors. This
* function returns the next descriptor in the chain, or vq->vring.num if we're
* at the end. */
-static unsigned next_desc(struct virtqueue *vq, unsigned int i)
+static unsigned next_desc(struct vring_desc *desc,
+ unsigned int i, unsigned int max)
{
unsigned int next;

/* If this descriptor says it doesn't chain, we're done. */
- if (!(vq->vring.desc[i].flags & VRING_DESC_F_NEXT))
- return vq->vring.num;
+ if (!(desc[i].flags & VRING_DESC_F_NEXT))
+ return max;

/* Check they're not leading us off end of descriptors. */
- next = vq->vring.desc[i].next;
+ next = desc[i].next;
/* Make sure compiler knows to grab that: we don't want it changing! */
wmb();

- if (next >= vq->vring.num)
+ if (next >= max)
errx(1, "Desc next is %u", next);

return next;
@@ -653,7 +654,8 @@ static unsigned get_vq_desc(struct virtqueue *vq,
struct iovec iov[],
unsigned int *out_num, unsigned int *in_num)
{
- unsigned int i, head;
+ struct vring_desc *desc;
+ unsigned int i, head, max;
u16 last_avail;

/* Check it isn't doing very strange things with descriptor numbers. */
@@ -678,15 +680,28 @@ static unsigned get_vq_desc(struct virtqueue *vq,
/* When we start there are none of either input nor output. */
*out_num = *in_num = 0;

+ max = vq->vring.num;
+ desc = vq->vring.desc;
i = head;
+
+ /* If this is an indirect entry, then this buffer contains a descriptor
+ * table which we handle as if it's any normal descriptor chain. */
+ if (desc[i].flags & VRING_DESC_F_INDIRECT) {
+ if (desc[i].len % sizeof(struct vring_desc))
+ errx(1, "Invalid size for indirect buffer table");
+
+ max = desc[i].len / sizeof(struct vring_desc);
+ desc = check_pointer(desc[i].addr, desc[i].len);
+ i = 0;
+ }
+
do {
/* Grab the first descriptor, and check it's OK. */
- iov[*out_num + *in_num].iov_len = vq->vring.desc[i].len;
+ iov[*out_num + *in_num].iov_len = desc[i].len;
iov[*out_num + *in_num].iov_base
- = check_pointer(vq->vring.desc[i].addr,
- vq->vring.desc[i].len);
+ = check_pointer(desc[i].addr, desc[i].len);
/* If this is an input descriptor, increment that count. */
- if (vq->vring.desc[i].flags & VRING_DESC_F_WRITE)
+ if (desc[i].flags & VRING_DESC_F_WRITE)
(*in_num)++;
else {
/* If it's an output descriptor, they're all supposed
@@ -697,9 +712,9 @@ static unsigned get_vq_desc(struct virtqueue *vq,
}

/* If we've got too many, that implies a descriptor loop. */
- if (*out_num + *in_num > vq->vring.num)
+ if (*out_num + *in_num > max)
errx(1, "Looped descriptor");
- } while ((i = next_desc(vq, i)) != vq->vring.num);
+ } while ((i = next_desc(desc, i, max)) != max);

vq->inflight++;
return head;
@@ -1502,6 +1517,8 @@ static void setup_tun_net(char *arg)
add_feature(dev, VIRTIO_NET_F_HOST_TSO4);
add_feature(dev, VIRTIO_NET_F_HOST_TSO6);
add_feature(dev, VIRTIO_NET_F_HOST_ECN);
+ /* we handle indirect ring entries */
+ add_feature(dev, VIRTIO_RING_F_INDIRECT_DESC);
set_config(dev, sizeof(conf), &conf);

/* We don't need the socket any more; setup is done. */
--
1.6.0.5