This intentionally leaves "fixup" changes separate - hopefully
that is enough to fix vhost-net crashes reported here,
but it helps me keep track of what changed.
I will naturally squash them later when we are done.
This adds infrastructure required for supporting
multiple ring formats.
The idea is as follows: we convert descriptors to an
independent format first, and process that converting to
iov later.
Used ring is similar: we fetch into an independent struct first,
convert that to IOV later.
The point is that we have a tight loop that fetches
descriptors, which is good for cache utilization.
This will also allow all kind of batching tricks -
e.g. it seems possible to keep SMAP disabled while
we are fetching multiple descriptors.
For used descriptors, this allows keeping track of the buffer length
without need to rescan IOV.
This seems to perform exactly the same as the original
code based on a microbenchmark.
Lightly tested.
More testing would be very much appreciated.
changes from v6:
- fixes some bugs introduced in v6 and v5
changes from v5:
- addressed comments by Jason: squashed API changes, fixed up discard
changes from v4:
- added used descriptor format independence
- addressed comments by jason
- fixed a crash detected by the lkp robot.
changes from v3:
- fixed error handling in case of indirect descriptors
- add BUG_ON to detect buffer overflow in case of bugs
in response to comment by Jason Wang
- minor code tweaks
Changes from v2:
- fixed indirect descriptor batching
reported by Jason Wang
Changes from v1:
- typo fixes
Michael S. Tsirkin (14):
vhost: option to fetch descriptors through an independent struct
fixup! vhost: option to fetch descriptors through an independent
struct
vhost: use batched get_vq_desc version
vhost/net: pass net specific struct pointer
vhost: reorder functions
vhost: format-independent API for used buffers
fixup! vhost: format-independent API for used buffers
fixup! vhost: use batched get_vq_desc version
vhost/net: convert to new API: heads->bufs
vhost/net: avoid iov length math
vhost/test: convert to the buf API
vhost/scsi: switch to buf APIs
vhost/vsock: switch to the buf API
vhost: drop head based APIs
drivers/vhost/net.c | 174 +++++++++----------
drivers/vhost/scsi.c | 73 ++++----
drivers/vhost/test.c | 22 +--
drivers/vhost/vhost.c | 378 +++++++++++++++++++++++++++---------------
drivers/vhost/vhost.h | 44 +++--
drivers/vhost/vsock.c | 30 ++--
6 files changed, 439 insertions(+), 282 deletions(-)
--
MST
In preparation for further cleanup, pass net specific pointer
to ubuf callbacks so we can move net specific fields
out to net structures.
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index bf5e1d81ae25..ff594eec8ae3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -94,7 +94,7 @@ struct vhost_net_ubuf_ref {
*/
atomic_t refcount;
wait_queue_head_t wait;
- struct vhost_virtqueue *vq;
+ struct vhost_net_virtqueue *nvq;
};
#define VHOST_NET_BATCH 64
@@ -231,7 +231,7 @@ static void vhost_net_enable_zcopy(int vq)
}
static struct vhost_net_ubuf_ref *
-vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
+vhost_net_ubuf_alloc(struct vhost_net_virtqueue *nvq, bool zcopy)
{
struct vhost_net_ubuf_ref *ubufs;
/* No zero copy backend? Nothing to count. */
@@ -242,7 +242,7 @@ vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
return ERR_PTR(-ENOMEM);
atomic_set(&ubufs->refcount, 1);
init_waitqueue_head(&ubufs->wait);
- ubufs->vq = vq;
+ ubufs->nvq = nvq;
return ubufs;
}
@@ -384,13 +384,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
{
struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
- struct vhost_virtqueue *vq = ubufs->vq;
+ struct vhost_net_virtqueue *nvq = ubufs->nvq;
int cnt;
rcu_read_lock_bh();
/* set len to mark this desc buffers done DMA */
- vq->heads[ubuf->desc].len = success ?
+ nvq->vq.heads[ubuf->desc].in_len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
@@ -402,7 +402,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
* less than 10% of times).
*/
if (cnt <= 1 || !(cnt % 16))
- vhost_poll_queue(&vq->poll);
+ vhost_poll_queue(&nvq->vq.poll);
rcu_read_unlock_bh();
}
@@ -1525,7 +1525,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
/* start polling new socket */
oldsock = vhost_vq_get_backend(vq);
if (sock != oldsock) {
- ubufs = vhost_net_ubuf_alloc(vq,
+ ubufs = vhost_net_ubuf_alloc(nvq,
sock && vhost_sock_zcopy(sock));
if (IS_ERR(ubufs)) {
r = PTR_ERR(ubufs);
--
MST
As testing shows no performance change, switch to that now.
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Eugenio Pérez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/test.c | 2 +-
drivers/vhost/vhost.c | 318 ++++++++----------------------------------
drivers/vhost/vhost.h | 7 +-
3 files changed, 65 insertions(+), 262 deletions(-)
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 0466921f4772..7d69778aaa26 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
dev = &n->dev;
vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
- vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+ vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
f->private_data = n;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 11433d709651..28f324fd77df 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
{
vq->num = 1;
vq->ndescs = 0;
+ vq->first_desc = 0;
vq->desc = NULL;
vq->avail = NULL;
vq->used = NULL;
@@ -372,6 +373,11 @@ static int vhost_worker(void *data)
return 0;
}
+static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
+{
+ return vq->max_descs - UIO_MAXIOV;
+}
+
static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
{
kfree(vq->descs);
@@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
vq->max_descs = dev->iov_limit;
+ if (vhost_vq_num_batch_descs(vq) < 0) {
+ return -EINVAL;
+ }
vq->descs = kmalloc_array(vq->max_descs,
sizeof(*vq->descs),
GFP_KERNEL);
@@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
vq->last_avail_idx = s.num;
/* Forget the cached index value. */
vq->avail_idx = vq->last_avail_idx;
+ vq->ndescs = vq->first_desc = 0;
break;
case VHOST_GET_VRING_BASE:
s.index = idx;
@@ -2078,253 +2088,6 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
return next;
}
-static int get_indirect(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num,
- struct vring_desc *indirect)
-{
- struct vring_desc desc;
- unsigned int i = 0, count, found = 0;
- u32 len = vhost32_to_cpu(vq, indirect->len);
- struct iov_iter from;
- int ret, access;
-
- /* Sanity check */
- if (unlikely(len % sizeof desc)) {
- vq_err(vq, "Invalid length in indirect descriptor: "
- "len 0x%llx not multiple of 0x%zx\n",
- (unsigned long long)len,
- sizeof desc);
- return -EINVAL;
- }
-
- ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
- UIO_MAXIOV, VHOST_ACCESS_RO);
- if (unlikely(ret < 0)) {
- if (ret != -EAGAIN)
- vq_err(vq, "Translation failure %d in indirect.\n", ret);
- return ret;
- }
- iov_iter_init(&from, READ, vq->indirect, ret, len);
-
- /* We will use the result as an address to read from, so most
- * architectures only need a compiler barrier here. */
- read_barrier_depends();
-
- count = len / sizeof desc;
- /* Buffers are chained via a 16 bit next field, so
- * we can have at most 2^16 of these. */
- if (unlikely(count > USHRT_MAX + 1)) {
- vq_err(vq, "Indirect buffer length too big: %d\n",
- indirect->len);
- return -E2BIG;
- }
-
- do {
- unsigned iov_count = *in_num + *out_num;
- if (unlikely(++found > count)) {
- vq_err(vq, "Loop detected: last one at %u "
- "indirect size %u\n",
- i, count);
- return -EINVAL;
- }
- if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
- vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
- i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
- return -EINVAL;
- }
- if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
- vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
- i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
- return -EINVAL;
- }
-
- if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
- access = VHOST_ACCESS_WO;
- else
- access = VHOST_ACCESS_RO;
-
- ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
- vhost32_to_cpu(vq, desc.len), iov + iov_count,
- iov_size - iov_count, access);
- if (unlikely(ret < 0)) {
- if (ret != -EAGAIN)
- vq_err(vq, "Translation failure %d indirect idx %d\n",
- ret, i);
- return ret;
- }
- /* If this is an input descriptor, increment that count. */
- if (access == VHOST_ACCESS_WO) {
- *in_num += ret;
- if (unlikely(log && ret)) {
- log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
- log[*log_num].len = vhost32_to_cpu(vq, desc.len);
- ++*log_num;
- }
- } else {
- /* If it's an output descriptor, they're all supposed
- * to come before any input descriptors. */
- if (unlikely(*in_num)) {
- vq_err(vq, "Indirect descriptor "
- "has out after in: idx %d\n", i);
- return -EINVAL;
- }
- *out_num += ret;
- }
- } while ((i = next_desc(vq, &desc)) != -1);
- return 0;
-}
-
-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found. A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
-{
- struct vring_desc desc;
- unsigned int i, head, found = 0;
- u16 last_avail_idx;
- __virtio16 avail_idx;
- __virtio16 ring_head;
- int ret, access;
-
- /* Check it isn't doing very strange things with descriptor numbers. */
- last_avail_idx = vq->last_avail_idx;
-
- if (vq->avail_idx == vq->last_avail_idx) {
- if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
- vq_err(vq, "Failed to access avail idx at %p\n",
- &vq->avail->idx);
- return -EFAULT;
- }
- vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
-
- if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
- vq_err(vq, "Guest moved used index from %u to %u",
- last_avail_idx, vq->avail_idx);
- return -EFAULT;
- }
-
- /* If there's nothing new since last we looked, return
- * invalid.
- */
- if (vq->avail_idx == last_avail_idx)
- return vq->num;
-
- /* Only get avail ring entries after they have been
- * exposed by guest.
- */
- smp_rmb();
- }
-
- /* Grab the next descriptor number they're advertising, and increment
- * the index we've seen. */
- if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
- vq_err(vq, "Failed to read head: idx %d address %p\n",
- last_avail_idx,
- &vq->avail->ring[last_avail_idx % vq->num]);
- return -EFAULT;
- }
-
- head = vhost16_to_cpu(vq, ring_head);
-
- /* If their number is silly, that's an error. */
- if (unlikely(head >= vq->num)) {
- vq_err(vq, "Guest says index %u > %u is available",
- head, vq->num);
- return -EINVAL;
- }
-
- /* When we start there are none of either input nor output. */
- *out_num = *in_num = 0;
- if (unlikely(log))
- *log_num = 0;
-
- i = head;
- do {
- unsigned iov_count = *in_num + *out_num;
- if (unlikely(i >= vq->num)) {
- vq_err(vq, "Desc index is %u > %u, head = %u",
- i, vq->num, head);
- return -EINVAL;
- }
- if (unlikely(++found > vq->num)) {
- vq_err(vq, "Loop detected: last one at %u "
- "vq size %u head %u\n",
- i, vq->num, head);
- return -EINVAL;
- }
- ret = vhost_get_desc(vq, &desc, i);
- if (unlikely(ret)) {
- vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
- i, vq->desc + i);
- return -EFAULT;
- }
- if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
- ret = get_indirect(vq, iov, iov_size,
- out_num, in_num,
- log, log_num, &desc);
- if (unlikely(ret < 0)) {
- if (ret != -EAGAIN)
- vq_err(vq, "Failure detected "
- "in indirect descriptor at idx %d\n", i);
- return ret;
- }
- continue;
- }
-
- if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
- access = VHOST_ACCESS_WO;
- else
- access = VHOST_ACCESS_RO;
- ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
- vhost32_to_cpu(vq, desc.len), iov + iov_count,
- iov_size - iov_count, access);
- if (unlikely(ret < 0)) {
- if (ret != -EAGAIN)
- vq_err(vq, "Translation failure %d descriptor idx %d\n",
- ret, i);
- return ret;
- }
- if (access == VHOST_ACCESS_WO) {
- /* If this is an input descriptor,
- * increment that count. */
- *in_num += ret;
- if (unlikely(log && ret)) {
- log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
- log[*log_num].len = vhost32_to_cpu(vq, desc.len);
- ++*log_num;
- }
- } else {
- /* If it's an output descriptor, they're all supposed
- * to come before any input descriptors. */
- if (unlikely(*in_num)) {
- vq_err(vq, "Descriptor has out after in: "
- "idx %d\n", i);
- return -EINVAL;
- }
- *out_num += ret;
- }
- } while ((i = next_desc(vq, &desc)) != -1);
-
- /* On success, increment avail index. */
- vq->last_avail_idx++;
-
- /* Assume notifications from guest are disabled at this point,
- * if they aren't we would need to update avail_event index. */
- BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
- return head;
-}
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
-
static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
{
BUG_ON(!vq->ndescs);
@@ -2428,7 +2191,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
* A negative code is returned on error. */
-static int fetch_descs(struct vhost_virtqueue *vq)
+static int fetch_buf(struct vhost_virtqueue *vq)
{
unsigned int i, head, found = 0;
struct vhost_desc *last;
@@ -2441,7 +2204,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
/* Check it isn't doing very strange things with descriptor numbers. */
last_avail_idx = vq->last_avail_idx;
- if (vq->avail_idx == vq->last_avail_idx) {
+ if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
+ /* If we already have work to do, don't bother re-checking. */
+ if (likely(vq->ndescs))
+ return 1;
+
if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
@@ -2532,6 +2299,41 @@ static int fetch_descs(struct vhost_virtqueue *vq)
return 1;
}
+/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
+static int fetch_descs(struct vhost_virtqueue *vq)
+{
+ int ret;
+
+ if (unlikely(vq->first_desc >= vq->ndescs)) {
+ vq->first_desc = 0;
+ vq->ndescs = 0;
+ }
+
+ if (vq->ndescs)
+ return 1;
+
+ for (ret = 1;
+ ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
+ ret = fetch_buf(vq))
+ ;
+
+ /* On success we expect some descs */
+ BUG_ON(ret > 0 && !vq->ndescs);
+ return ret;
+}
+
+/* Reverse the effects of fetch_descs */
+static void unfetch_descs(struct vhost_virtqueue *vq)
+{
+ int i;
+
+ for (i = vq->first_desc; i < vq->ndescs; ++i)
+ if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
+ vq->last_avail_idx -= 1;
+ vq->ndescs = 0;
+}
+
/* This looks in the virtqueue and for the first available buffer, and converts
* it to an iovec for convenient access. Since descriptors consist of some
* number of output then some number of input descriptors, it's actually two
@@ -2540,7 +2342,7 @@ static int fetch_descs(struct vhost_virtqueue *vq)
* This function returns the descriptor number found, or vq->num (which is
* never a valid descriptor number) if none was found. A negative code is
* returned on error. */
-int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
+int vhost_get_vq_desc(struct vhost_virtqueue *vq,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num)
@@ -2549,7 +2351,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
int i;
if (ret <= 0)
- goto err_fetch;
+ goto err;
/* Now convert to IOV */
/* When we start there are none of either input nor output. */
@@ -2557,7 +2359,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
if (unlikely(log))
*log_num = 0;
- for (i = 0; i < vq->ndescs; ++i) {
+ for (i = vq->first_desc; i < vq->ndescs; ++i) {
unsigned iov_count = *in_num + *out_num;
struct vhost_desc *desc = &vq->descs[i];
int access;
@@ -2603,24 +2405,26 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
}
ret = desc->id;
+
+ if (!(desc->flags & VRING_DESC_F_NEXT))
+ break;
}
- vq->ndescs = 0;
+ vq->first_desc = i + 1;
return ret;
err:
- vhost_discard_vq_desc(vq, 1);
-err_fetch:
- vq->ndescs = 0;
+ unfetch_descs(vq);
return ret ? ret : vq->num;
}
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
+EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
{
+ unfetch_descs(vq);
vq->last_avail_idx -= n;
}
EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 87089d51490d..fed36af5c444 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -81,6 +81,7 @@ struct vhost_virtqueue {
struct vhost_desc *descs;
int ndescs;
+ int first_desc;
int max_descs;
struct file *kick;
@@ -189,10 +190,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
-int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
- struct iovec iov[], unsigned int iov_count,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num);
int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
@@ -261,6 +258,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
void *private_data)
{
vq->private_data = private_data;
+ vq->ndescs = 0;
+ vq->first_desc = 0;
}
/**
--
MST
---
drivers/vhost/vhost.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7a587b13095c..03e6bca02288 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2205,10 +2205,6 @@ static int fetch_buf(struct vhost_virtqueue *vq)
last_avail_idx = vq->last_avail_idx;
if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
- /* If we already have work to do, don't bother re-checking. */
- if (likely(vq->ndescs))
- return 1;
-
if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
vq_err(vq, "Failed to access avail idx at %p\n",
&vq->avail->idx);
--
MST
On Wed, Jun 10, 2020 at 1:36 PM Michael S. Tsirkin <[email protected]> wrote:
>
> As testing shows no performance change, switch to that now.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Eugenio Pérez <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> drivers/vhost/test.c | 2 +-
> drivers/vhost/vhost.c | 318 ++++++++----------------------------------
> drivers/vhost/vhost.h | 7 +-
> 3 files changed, 65 insertions(+), 262 deletions(-)
>
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index 0466921f4772..7d69778aaa26 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> dev = &n->dev;
> vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> - vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> + vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
>
> f->private_data = n;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 11433d709651..28f324fd77df 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> {
> vq->num = 1;
> vq->ndescs = 0;
> + vq->first_desc = 0;
> vq->desc = NULL;
> vq->avail = NULL;
> vq->used = NULL;
> @@ -372,6 +373,11 @@ static int vhost_worker(void *data)
> return 0;
> }
>
> +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> +{
> + return vq->max_descs - UIO_MAXIOV;
> +}
> +
> static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> {
> kfree(vq->descs);
> @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> vq->max_descs = dev->iov_limit;
> + if (vhost_vq_num_batch_descs(vq) < 0) {
> + return -EINVAL;
> + }
> vq->descs = kmalloc_array(vq->max_descs,
> sizeof(*vq->descs),
> GFP_KERNEL);
> @@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> vq->last_avail_idx = s.num;
> /* Forget the cached index value. */
> vq->avail_idx = vq->last_avail_idx;
> + vq->ndescs = vq->first_desc = 0;
> break;
> case VHOST_GET_VRING_BASE:
> s.index = idx;
> @@ -2078,253 +2088,6 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> return next;
> }
>
> -static int get_indirect(struct vhost_virtqueue *vq,
> - struct iovec iov[], unsigned int iov_size,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num,
> - struct vring_desc *indirect)
> -{
> - struct vring_desc desc;
> - unsigned int i = 0, count, found = 0;
> - u32 len = vhost32_to_cpu(vq, indirect->len);
> - struct iov_iter from;
> - int ret, access;
> -
> - /* Sanity check */
> - if (unlikely(len % sizeof desc)) {
> - vq_err(vq, "Invalid length in indirect descriptor: "
> - "len 0x%llx not multiple of 0x%zx\n",
> - (unsigned long long)len,
> - sizeof desc);
> - return -EINVAL;
> - }
> -
> - ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
> - UIO_MAXIOV, VHOST_ACCESS_RO);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d in indirect.\n", ret);
> - return ret;
> - }
> - iov_iter_init(&from, READ, vq->indirect, ret, len);
> -
> - /* We will use the result as an address to read from, so most
> - * architectures only need a compiler barrier here. */
> - read_barrier_depends();
> -
> - count = len / sizeof desc;
> - /* Buffers are chained via a 16 bit next field, so
> - * we can have at most 2^16 of these. */
> - if (unlikely(count > USHRT_MAX + 1)) {
> - vq_err(vq, "Indirect buffer length too big: %d\n",
> - indirect->len);
> - return -E2BIG;
> - }
> -
> - do {
> - unsigned iov_count = *in_num + *out_num;
> - if (unlikely(++found > count)) {
> - vq_err(vq, "Loop detected: last one at %u "
> - "indirect size %u\n",
> - i, count);
> - return -EINVAL;
> - }
> - if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
> - vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
> - i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> - return -EINVAL;
> - }
> - if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
> - vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
> - i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> - return -EINVAL;
> - }
> -
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
> - access = VHOST_ACCESS_WO;
> - else
> - access = VHOST_ACCESS_RO;
> -
> - ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
> - vhost32_to_cpu(vq, desc.len), iov + iov_count,
> - iov_size - iov_count, access);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d indirect idx %d\n",
> - ret, i);
> - return ret;
> - }
> - /* If this is an input descriptor, increment that count. */
> - if (access == VHOST_ACCESS_WO) {
> - *in_num += ret;
> - if (unlikely(log && ret)) {
> - log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
> - log[*log_num].len = vhost32_to_cpu(vq, desc.len);
> - ++*log_num;
> - }
> - } else {
> - /* If it's an output descriptor, they're all supposed
> - * to come before any input descriptors. */
> - if (unlikely(*in_num)) {
> - vq_err(vq, "Indirect descriptor "
> - "has out after in: idx %d\n", i);
> - return -EINVAL;
> - }
> - *out_num += ret;
> - }
> - } while ((i = next_desc(vq, &desc)) != -1);
> - return 0;
> -}
> -
> -/* This looks in the virtqueue and for the first available buffer, and converts
> - * it to an iovec for convenient access. Since descriptors consist of some
> - * number of output then some number of input descriptors, it's actually two
> - * iovecs, but we pack them into one and note how many of each there were.
> - *
> - * This function returns the descriptor number found, or vq->num (which is
> - * never a valid descriptor number) if none was found. A negative code is
> - * returned on error. */
> -int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> - struct iovec iov[], unsigned int iov_size,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num)
> -{
> - struct vring_desc desc;
> - unsigned int i, head, found = 0;
> - u16 last_avail_idx;
> - __virtio16 avail_idx;
> - __virtio16 ring_head;
> - int ret, access;
> -
> - /* Check it isn't doing very strange things with descriptor numbers. */
> - last_avail_idx = vq->last_avail_idx;
> -
> - if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - &vq->avail->idx);
> - return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> -
> - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> - vq_err(vq, "Guest moved used index from %u to %u",
> - last_avail_idx, vq->avail_idx);
> - return -EFAULT;
> - }
> -
> - /* If there's nothing new since last we looked, return
> - * invalid.
> - */
> - if (vq->avail_idx == last_avail_idx)
> - return vq->num;
> -
> - /* Only get avail ring entries after they have been
> - * exposed by guest.
> - */
> - smp_rmb();
> - }
> -
> - /* Grab the next descriptor number they're advertising, and increment
> - * the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> - vq_err(vq, "Failed to read head: idx %d address %p\n",
> - last_avail_idx,
> - &vq->avail->ring[last_avail_idx % vq->num]);
> - return -EFAULT;
> - }
> -
> - head = vhost16_to_cpu(vq, ring_head);
> -
> - /* If their number is silly, that's an error. */
> - if (unlikely(head >= vq->num)) {
> - vq_err(vq, "Guest says index %u > %u is available",
> - head, vq->num);
> - return -EINVAL;
> - }
> -
> - /* When we start there are none of either input nor output. */
> - *out_num = *in_num = 0;
> - if (unlikely(log))
> - *log_num = 0;
> -
> - i = head;
> - do {
> - unsigned iov_count = *in_num + *out_num;
> - if (unlikely(i >= vq->num)) {
> - vq_err(vq, "Desc index is %u > %u, head = %u",
> - i, vq->num, head);
> - return -EINVAL;
> - }
> - if (unlikely(++found > vq->num)) {
> - vq_err(vq, "Loop detected: last one at %u "
> - "vq size %u head %u\n",
> - i, vq->num, head);
> - return -EINVAL;
> - }
> - ret = vhost_get_desc(vq, &desc, i);
> - if (unlikely(ret)) {
> - vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> - i, vq->desc + i);
> - return -EFAULT;
> - }
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
> - ret = get_indirect(vq, iov, iov_size,
> - out_num, in_num,
> - log, log_num, &desc);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Failure detected "
> - "in indirect descriptor at idx %d\n", i);
> - return ret;
> - }
> - continue;
> - }
> -
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
> - access = VHOST_ACCESS_WO;
> - else
> - access = VHOST_ACCESS_RO;
> - ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
> - vhost32_to_cpu(vq, desc.len), iov + iov_count,
> - iov_size - iov_count, access);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d descriptor idx %d\n",
> - ret, i);
> - return ret;
> - }
> - if (access == VHOST_ACCESS_WO) {
> - /* If this is an input descriptor,
> - * increment that count. */
> - *in_num += ret;
> - if (unlikely(log && ret)) {
> - log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
> - log[*log_num].len = vhost32_to_cpu(vq, desc.len);
> - ++*log_num;
> - }
> - } else {
> - /* If it's an output descriptor, they're all supposed
> - * to come before any input descriptors. */
> - if (unlikely(*in_num)) {
> - vq_err(vq, "Descriptor has out after in: "
> - "idx %d\n", i);
> - return -EINVAL;
> - }
> - *out_num += ret;
> - }
> - } while ((i = next_desc(vq, &desc)) != -1);
> -
> - /* On success, increment avail index. */
> - vq->last_avail_idx++;
> -
> - /* Assume notifications from guest are disabled at this point,
> - * if they aren't we would need to update avail_event index. */
> - BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
> - return head;
> -}
> -EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
> -
> static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
> {
> BUG_ON(!vq->ndescs);
> @@ -2428,7 +2191,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
>
> /* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> * A negative code is returned on error. */
> -static int fetch_descs(struct vhost_virtqueue *vq)
> +static int fetch_buf(struct vhost_virtqueue *vq)
> {
> unsigned int i, head, found = 0;
> struct vhost_desc *last;
> @@ -2441,7 +2204,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> /* Check it isn't doing very strange things with descriptor numbers. */
> last_avail_idx = vq->last_avail_idx;
>
> - if (vq->avail_idx == vq->last_avail_idx) {
> + if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
> + /* If we already have work to do, don't bother re-checking. */
> + if (likely(vq->ndescs))
> + return 1;
> +
> if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> vq_err(vq, "Failed to access avail idx at %p\n",
> &vq->avail->idx);
> @@ -2532,6 +2299,41 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> return 1;
> }
>
> +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> + * A negative code is returned on error. */
> +static int fetch_descs(struct vhost_virtqueue *vq)
> +{
> + int ret;
> +
> + if (unlikely(vq->first_desc >= vq->ndescs)) {
> + vq->first_desc = 0;
> + vq->ndescs = 0;
> + }
> +
> + if (vq->ndescs)
> + return 1;
> +
> + for (ret = 1;
> + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> + ret = fetch_buf(vq))
> + ;
(Expanding comment in V6):
We get an infinite loop this way:
* vq->ndescs == 0, so we call fetch_buf() here
* fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
* This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
last_avail_vq), so it just return 1
I think that we should check vq->ndescs == 0 instead of ndescs <=
vhost_vq_num_batch_descs(vq). However, this could cause less
descriptors to be fetches/cached, so less thoughput. Another
possibility is to compare with vhost_vq_num_batch_descs(vq) for the
"early return", or to find a sensible limit (increasing latency?).
In my local version:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 28f324fd77df..50b258a46cef 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2310,12 +2310,7 @@ static int fetch_descs(struct vhost_virtqueue *vq)
vq->ndescs = 0;
}
- if (vq->ndescs)
- return 1;
-
- for (ret = 1;
- ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
- ret = fetch_buf(vq))
+ for (ret = 1; ret > 0 && vq->ndescs == 0; ret = fetch_buf(vq))
;
/* On success we expect some descs */
---
> +
> + /* On success we expect some descs */
> + BUG_ON(ret > 0 && !vq->ndescs);
> + return ret;
> +}
> +
> +/* Reverse the effects of fetch_descs */
> +static void unfetch_descs(struct vhost_virtqueue *vq)
> +{
> + int i;
> +
> + for (i = vq->first_desc; i < vq->ndescs; ++i)
> + if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
> + vq->last_avail_idx -= 1;
> + vq->ndescs = 0;
> +}
> +
> /* This looks in the virtqueue and for the first available buffer, and converts
> * it to an iovec for convenient access. Since descriptors consist of some
> * number of output then some number of input descriptors, it's actually two
> @@ -2540,7 +2342,7 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> * This function returns the descriptor number found, or vq->num (which is
> * never a valid descriptor number) if none was found. A negative code is
> * returned on error. */
> -int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> +int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> struct iovec iov[], unsigned int iov_size,
> unsigned int *out_num, unsigned int *in_num,
> struct vhost_log *log, unsigned int *log_num)
> @@ -2549,7 +2351,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> int i;
>
> if (ret <= 0)
> - goto err_fetch;
> + goto err;
>
> /* Now convert to IOV */
> /* When we start there are none of either input nor output. */
> @@ -2557,7 +2359,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> if (unlikely(log))
> *log_num = 0;
>
> - for (i = 0; i < vq->ndescs; ++i) {
> + for (i = vq->first_desc; i < vq->ndescs; ++i) {
> unsigned iov_count = *in_num + *out_num;
> struct vhost_desc *desc = &vq->descs[i];
> int access;
> @@ -2603,24 +2405,26 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> }
>
> ret = desc->id;
> +
> + if (!(desc->flags & VRING_DESC_F_NEXT))
> + break;
> }
>
> - vq->ndescs = 0;
> + vq->first_desc = i + 1;
>
> return ret;
>
> err:
> - vhost_discard_vq_desc(vq, 1);
> -err_fetch:
> - vq->ndescs = 0;
> + unfetch_descs(vq);
>
> return ret ? ret : vq->num;
> }
> -EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
> +EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
>
> /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
> {
> + unfetch_descs(vq);
> vq->last_avail_idx -= n;
> }
> EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 87089d51490d..fed36af5c444 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -81,6 +81,7 @@ struct vhost_virtqueue {
>
> struct vhost_desc *descs;
> int ndescs;
> + int first_desc;
> int max_descs;
>
> struct file *kick;
> @@ -189,10 +190,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> bool vhost_log_access_ok(struct vhost_dev *);
>
> -int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
> - struct iovec iov[], unsigned int iov_count,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num);
> int vhost_get_vq_desc(struct vhost_virtqueue *,
> struct iovec iov[], unsigned int iov_count,
> unsigned int *out_num, unsigned int *in_num,
> @@ -261,6 +258,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
> void *private_data)
> {
> vq->private_data = private_data;
> + vq->ndescs = 0;
> + vq->first_desc = 0;
> }
>
> /**
> --
> MST
>
Add a new API that doesn't assume used ring, heads, etc.
For now, we keep the old APIs around to make it easier
to convert drivers.
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 52 ++++++++++++++++++++++++++++++++++---------
drivers/vhost/vhost.h | 17 +++++++++++++-
2 files changed, 58 insertions(+), 11 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 506208b63126..e5763d81bf0f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2339,13 +2339,12 @@ static void unfetch_descs(struct vhost_virtqueue *vq)
* number of output then some number of input descriptors, it's actually two
* iovecs, but we pack them into one and note how many of each there were.
*
- * This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found. A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+ * This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
+int vhost_get_avail_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
{
int ret = fetch_descs(vq);
int i;
@@ -2358,6 +2357,8 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
*out_num = *in_num = 0;
if (unlikely(log))
*log_num = 0;
+ buf->in_len = buf->out_len = 0;
+ buf->descs = 0;
for (i = vq->first_desc; i < vq->ndescs; ++i) {
unsigned iov_count = *in_num + *out_num;
@@ -2387,6 +2388,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
/* If this is an input descriptor,
* increment that count. */
*in_num += ret;
+ buf->in_len += desc->len;
if (unlikely(log && ret)) {
log[*log_num].addr = desc->addr;
log[*log_num].len = desc->len;
@@ -2402,9 +2404,11 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
goto err;
}
*out_num += ret;
+ buf->out_len += desc->len;
}
- ret = desc->id;
+ buf->id = desc->id;
+ ++buf->descs;
if (!(desc->flags & VRING_DESC_F_NEXT))
break;
@@ -2412,14 +2416,22 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
vq->first_desc = i + 1;
- return ret;
+ return 1;
err:
unfetch_descs(vq);
return ret ? ret : vq->num;
}
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+EXPORT_SYMBOL_GPL(vhost_get_avail_buf);
+
+/* Reverse the effect of vhost_get_avail_buf. Useful for error handling. */
+void vhost_discard_avail_bufs(struct vhost_virtqueue *vq,
+ struct vhost_buf *buf, unsigned count)
+{
+ vhost_discard_vq_desc(vq, count);
+}
+EXPORT_SYMBOL_GPL(vhost_discard_avail_bufs);
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
@@ -2511,6 +2523,26 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
}
EXPORT_SYMBOL_GPL(vhost_add_used);
+int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
+{
+ return vhost_add_used(vq, buf->id, buf->in_len);
+}
+EXPORT_SYMBOL_GPL(vhost_put_used_buf);
+
+int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
+ struct vhost_buf *bufs, unsigned count)
+{
+ unsigned i;
+
+ for (i = 0; i < count; ++i) {
+ vq->heads[i].id = cpu_to_vhost32(vq, bufs[i].id);
+ vq->heads[i].len = cpu_to_vhost32(vq, bufs[i].in_len);
+ }
+
+ return vhost_add_used_n(vq, vq->heads, count);
+}
+EXPORT_SYMBOL_GPL(vhost_put_used_n_bufs);
+
static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
__u16 old, new;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index fed36af5c444..28eea0155efb 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -67,6 +67,13 @@ struct vhost_desc {
u16 id;
};
+struct vhost_buf {
+ u32 out_len;
+ u32 in_len;
+ u16 descs;
+ u16 id;
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -195,7 +202,12 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
-
+int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num);
+void vhost_discard_avail_bufs(struct vhost_virtqueue *,
+ struct vhost_buf *, unsigned count);
int vhost_vq_init_access(struct vhost_virtqueue *);
int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
@@ -204,6 +216,9 @@ void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
unsigned int id, int len);
void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
struct vring_used_elem *heads, unsigned count);
+int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
+int vhost_put_used_n_bufs(struct vhost_virtqueue *,
+ struct vhost_buf *bufs, unsigned count);
void vhost_signal(struct vhost_dev *, struct vhost_virtqueue *);
void vhost_disable_notify(struct vhost_dev *, struct vhost_virtqueue *);
bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
--
MST
Everyone's using buf APIs, no need for head based ones anymore.
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 64 ++++++-------------------------------------
drivers/vhost/vhost.h | 12 --------
2 files changed, 8 insertions(+), 68 deletions(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 03e6bca02288..9096bd291c91 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2425,39 +2425,11 @@ EXPORT_SYMBOL_GPL(vhost_get_avail_buf);
void vhost_discard_avail_bufs(struct vhost_virtqueue *vq,
struct vhost_buf *buf, unsigned count)
{
- vhost_discard_vq_desc(vq, count);
+ unfetch_descs(vq);
+ vq->last_avail_idx -= count;
}
EXPORT_SYMBOL_GPL(vhost_discard_avail_bufs);
-/* This function returns the descriptor number found, or vq->num (which is
- * never a valid descriptor number) if none was found. A negative code is
- * returned on error. */
-int vhost_get_vq_desc(struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
-{
- struct vhost_buf buf;
- int ret = vhost_get_avail_buf(vq, &buf,
- iov, iov_size, out_num, in_num,
- log, log_num);
-
- if (likely(ret > 0))
- return buf->id;
- if (likely(!ret))
- return vq->num;
- return ret;
-}
-EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
-
-/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
-void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
-{
- unfetch_descs(vq);
- vq->last_avail_idx -= n;
-}
-EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
-
static int __vhost_add_used_n(struct vhost_virtqueue *vq,
struct vring_used_elem *heads,
unsigned count)
@@ -2490,8 +2462,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
return 0;
}
-/* After we've used one of their buffers, we tell them about it. We'll then
- * want to notify the guest, using eventfd. */
+static
int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
unsigned count)
{
@@ -2525,10 +2496,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
}
return r;
}
-EXPORT_SYMBOL_GPL(vhost_add_used_n);
-/* After we've used one of their buffers, we tell them about it. We'll then
- * want to notify the guest, using eventfd. */
+static
int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
{
struct vring_used_elem heads = {
@@ -2538,14 +2507,17 @@ int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
return vhost_add_used_n(vq, &heads, 1);
}
-EXPORT_SYMBOL_GPL(vhost_add_used);
+/* After we've used one of their buffers, we tell them about it. We'll then
+ * want to notify the guest, using vhost_signal. */
int vhost_put_used_buf(struct vhost_virtqueue *vq, struct vhost_buf *buf)
{
return vhost_add_used(vq, buf->id, buf->in_len);
}
EXPORT_SYMBOL_GPL(vhost_put_used_buf);
+/* After we've used one of their buffers, we tell them about it. We'll then
+ * want to notify the guest, using vhost_signal. */
int vhost_put_used_n_bufs(struct vhost_virtqueue *vq,
struct vhost_buf *bufs, unsigned count)
{
@@ -2606,26 +2578,6 @@ void vhost_signal(struct vhost_dev *dev, struct vhost_virtqueue *vq)
}
EXPORT_SYMBOL_GPL(vhost_signal);
-/* And here's the combo meal deal. Supersize me! */
-void vhost_add_used_and_signal(struct vhost_dev *dev,
- struct vhost_virtqueue *vq,
- unsigned int head, int len)
-{
- vhost_add_used(vq, head, len);
- vhost_signal(dev, vq);
-}
-EXPORT_SYMBOL_GPL(vhost_add_used_and_signal);
-
-/* multi-buffer version of vhost_add_used_and_signal */
-void vhost_add_used_and_signal_n(struct vhost_dev *dev,
- struct vhost_virtqueue *vq,
- struct vring_used_elem *heads, unsigned count)
-{
- vhost_add_used_n(vq, heads, count);
- vhost_signal(dev, vq);
-}
-EXPORT_SYMBOL_GPL(vhost_add_used_and_signal_n);
-
/* return true if we're sure that avaiable ring is empty */
bool vhost_vq_avail_empty(struct vhost_dev *dev, struct vhost_virtqueue *vq)
{
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 28eea0155efb..264a2a2fae97 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -197,11 +197,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
-int vhost_get_vq_desc(struct vhost_virtqueue *,
- struct iovec iov[], unsigned int iov_count,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num);
-void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
@@ -209,13 +204,6 @@ int vhost_get_avail_buf(struct vhost_virtqueue *, struct vhost_buf *buf,
void vhost_discard_avail_bufs(struct vhost_virtqueue *,
struct vhost_buf *, unsigned count);
int vhost_vq_init_access(struct vhost_virtqueue *);
-int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem *heads,
- unsigned count);
-void vhost_add_used_and_signal(struct vhost_dev *, struct vhost_virtqueue *,
- unsigned int id, int len);
-void vhost_add_used_and_signal_n(struct vhost_dev *, struct vhost_virtqueue *,
- struct vring_used_elem *heads, unsigned count);
int vhost_put_used_buf(struct vhost_virtqueue *, struct vhost_buf *buf);
int vhost_put_used_n_bufs(struct vhost_virtqueue *,
struct vhost_buf *bufs, unsigned count);
--
MST
The idea is to support multiple ring formats by converting
to a format-independent array of descriptors.
This costs extra cycles, but we gain in ability
to fetch a batch of descriptors in one go, which
is good for code cache locality.
When used, this causes a minor performance degradation,
it's been kept as simple as possible for ease of review.
A follow-up patch gets us back the performance by adding batching.
To simplify benchmarking, I kept the old code around so one can switch
back and forth between old and new code. This will go away in the final
submission.
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Eugenio Pérez <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 305 +++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.h | 16 +++
2 files changed, 320 insertions(+), 1 deletion(-)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 172da092107e..180b7b58c76b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -303,6 +303,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
struct vhost_virtqueue *vq)
{
vq->num = 1;
+ vq->ndescs = 0;
vq->desc = NULL;
vq->avail = NULL;
vq->used = NULL;
@@ -373,6 +374,9 @@ static int vhost_worker(void *data)
static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
{
+ kfree(vq->descs);
+ vq->descs = NULL;
+ vq->max_descs = 0;
kfree(vq->indirect);
vq->indirect = NULL;
kfree(vq->log);
@@ -389,6 +393,10 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
+ vq->max_descs = dev->iov_limit;
+ vq->descs = kmalloc_array(vq->max_descs,
+ sizeof(*vq->descs),
+ GFP_KERNEL);
vq->indirect = kmalloc_array(UIO_MAXIOV,
sizeof(*vq->indirect),
GFP_KERNEL);
@@ -396,7 +404,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
GFP_KERNEL);
vq->heads = kmalloc_array(dev->iov_limit, sizeof(*vq->heads),
GFP_KERNEL);
- if (!vq->indirect || !vq->log || !vq->heads)
+ if (!vq->indirect || !vq->log || !vq->heads || !vq->descs)
goto err_nomem;
}
return 0;
@@ -488,6 +496,8 @@ void vhost_dev_init(struct vhost_dev *dev,
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
+ vq->descs = NULL;
+ vq->max_descs = 0;
vq->log = NULL;
vq->indirect = NULL;
vq->heads = NULL;
@@ -2315,6 +2325,299 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
}
EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
+static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
+{
+ BUG_ON(!vq->ndescs);
+ return &vq->descs[vq->ndescs - 1];
+}
+
+static void pop_split_desc(struct vhost_virtqueue *vq)
+{
+ BUG_ON(!vq->ndescs);
+ --vq->ndescs;
+}
+
+#define VHOST_DESC_FLAGS (VRING_DESC_F_INDIRECT | VRING_DESC_F_WRITE | \
+ VRING_DESC_F_NEXT)
+static int push_split_desc(struct vhost_virtqueue *vq, struct vring_desc *desc, u16 id)
+{
+ struct vhost_desc *h;
+
+ if (unlikely(vq->ndescs >= vq->max_descs))
+ return -EINVAL;
+ h = &vq->descs[vq->ndescs++];
+ h->addr = vhost64_to_cpu(vq, desc->addr);
+ h->len = vhost32_to_cpu(vq, desc->len);
+ h->flags = vhost16_to_cpu(vq, desc->flags) & VHOST_DESC_FLAGS;
+ h->id = id;
+
+ return 0;
+}
+
+static int fetch_indirect_descs(struct vhost_virtqueue *vq,
+ struct vhost_desc *indirect,
+ u16 head)
+{
+ struct vring_desc desc;
+ unsigned int i = 0, count, found = 0;
+ u32 len = indirect->len;
+ struct iov_iter from;
+ int ret;
+
+ /* Sanity check */
+ if (unlikely(len % sizeof desc)) {
+ vq_err(vq, "Invalid length in indirect descriptor: "
+ "len 0x%llx not multiple of 0x%zx\n",
+ (unsigned long long)len,
+ sizeof desc);
+ return -EINVAL;
+ }
+
+ ret = translate_desc(vq, indirect->addr, len, vq->indirect,
+ UIO_MAXIOV, VHOST_ACCESS_RO);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d in indirect.\n", ret);
+ return ret;
+ }
+ iov_iter_init(&from, READ, vq->indirect, ret, len);
+
+ /* We will use the result as an address to read from, so most
+ * architectures only need a compiler barrier here. */
+ read_barrier_depends();
+
+ count = len / sizeof desc;
+ /* Buffers are chained via a 16 bit next field, so
+ * we can have at most 2^16 of these. */
+ if (unlikely(count > USHRT_MAX + 1)) {
+ vq_err(vq, "Indirect buffer length too big: %d\n",
+ indirect->len);
+ return -E2BIG;
+ }
+ if (unlikely(vq->ndescs + count > vq->max_descs)) {
+ vq_err(vq, "Too many indirect + direct descs: %d + %d\n",
+ vq->ndescs, indirect->len);
+ return -E2BIG;
+ }
+
+ do {
+ if (unlikely(++found > count)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "indirect size %u\n",
+ i, count);
+ return -EINVAL;
+ }
+ if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
+ vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
+ i, (size_t)indirect->addr + i * sizeof desc);
+ return -EINVAL;
+ }
+ if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
+ vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
+ i, (size_t)indirect->addr + i * sizeof desc);
+ return -EINVAL;
+ }
+
+ /* Note: push_split_desc can't fail here:
+ * we never fetch unless there's space. */
+ ret = push_split_desc(vq, &desc, head);
+ WARN_ON(ret);
+ } while ((i = next_desc(vq, &desc)) != -1);
+ return 0;
+}
+
+/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
+ * A negative code is returned on error. */
+static int fetch_descs(struct vhost_virtqueue *vq)
+{
+ unsigned int i, head, found = 0;
+ struct vhost_desc *last;
+ struct vring_desc desc;
+ __virtio16 avail_idx;
+ __virtio16 ring_head;
+ u16 last_avail_idx;
+ int ret;
+
+ /* Check it isn't doing very strange things with descriptor numbers. */
+ last_avail_idx = vq->last_avail_idx;
+
+ if (vq->avail_idx == vq->last_avail_idx) {
+ if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
+ vq_err(vq, "Failed to access avail idx at %p\n",
+ &vq->avail->idx);
+ return -EFAULT;
+ }
+ vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
+
+ if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
+ vq_err(vq, "Guest moved used index from %u to %u",
+ last_avail_idx, vq->avail_idx);
+ return -EFAULT;
+ }
+
+ /* If there's nothing new since last we looked, return
+ * invalid.
+ */
+ if (vq->avail_idx == last_avail_idx)
+ return 0;
+
+ /* Only get avail ring entries after they have been
+ * exposed by guest.
+ */
+ smp_rmb();
+ }
+
+ /* Grab the next descriptor number they're advertising */
+ if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
+ vq_err(vq, "Failed to read head: idx %d address %p\n",
+ last_avail_idx,
+ &vq->avail->ring[last_avail_idx % vq->num]);
+ return -EFAULT;
+ }
+
+ head = vhost16_to_cpu(vq, ring_head);
+
+ /* If their number is silly, that's an error. */
+ if (unlikely(head >= vq->num)) {
+ vq_err(vq, "Guest says index %u > %u is available",
+ head, vq->num);
+ return -EINVAL;
+ }
+
+ i = head;
+ do {
+ if (unlikely(i >= vq->num)) {
+ vq_err(vq, "Desc index is %u > %u, head = %u",
+ i, vq->num, head);
+ return -EINVAL;
+ }
+ if (unlikely(++found > vq->num)) {
+ vq_err(vq, "Loop detected: last one at %u "
+ "vq size %u head %u\n",
+ i, vq->num, head);
+ return -EINVAL;
+ }
+ ret = vhost_get_desc(vq, &desc, i);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
+ i, vq->desc + i);
+ return -EFAULT;
+ }
+ ret = push_split_desc(vq, &desc, head);
+ if (unlikely(ret)) {
+ vq_err(vq, "Failed to save descriptor: idx %d\n", i);
+ return -EINVAL;
+ }
+ } while ((i = next_desc(vq, &desc)) != -1);
+
+ last = peek_split_desc(vq);
+ if (unlikely(last->flags & VRING_DESC_F_INDIRECT)) {
+ pop_split_desc(vq);
+ ret = fetch_indirect_descs(vq, last, head);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Failure detected "
+ "in indirect descriptor at idx %d\n", head);
+ return ret;
+ }
+ }
+
+ /* Assume notifications from guest are disabled at this point,
+ * if they aren't we would need to update avail_event index. */
+ BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
+
+ /* On success, increment avail index. */
+ vq->last_avail_idx++;
+
+ return 1;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which is
+ * never a valid descriptor number) if none was found. A negative code is
+ * returned on error. */
+int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ int ret = fetch_descs(vq);
+ int i;
+
+ if (ret <= 0)
+ goto err_fetch;
+
+ /* Now convert to IOV */
+ /* When we start there are none of either input nor output. */
+ *out_num = *in_num = 0;
+ if (unlikely(log))
+ *log_num = 0;
+
+ for (i = 0; i < vq->ndescs; ++i) {
+ unsigned iov_count = *in_num + *out_num;
+ struct vhost_desc *desc = &vq->descs[i];
+ int access;
+
+ if (desc->flags & ~VHOST_DESC_FLAGS) {
+ vq_err(vq, "Unexpected flags: 0x%x at descriptor id 0x%x\n",
+ desc->flags, desc->id);
+ ret = -EINVAL;
+ goto err;
+ }
+ if (desc->flags & VRING_DESC_F_WRITE)
+ access = VHOST_ACCESS_WO;
+ else
+ access = VHOST_ACCESS_RO;
+ ret = translate_desc(vq, desc->addr,
+ desc->len, iov + iov_count,
+ iov_size - iov_count, access);
+ if (unlikely(ret < 0)) {
+ if (ret != -EAGAIN)
+ vq_err(vq, "Translation failure %d descriptor idx %d\n",
+ ret, i);
+ goto err;
+ }
+ if (access == VHOST_ACCESS_WO) {
+ /* If this is an input descriptor,
+ * increment that count. */
+ *in_num += ret;
+ if (unlikely(log && ret)) {
+ log[*log_num].addr = desc->addr;
+ log[*log_num].len = desc->len;
+ ++*log_num;
+ }
+ } else {
+ /* If it's an output descriptor, they're all supposed
+ * to come before any input descriptors. */
+ if (unlikely(*in_num)) {
+ vq_err(vq, "Descriptor has out after in: "
+ "idx %d\n", i);
+ ret = -EINVAL;
+ goto err;
+ }
+ *out_num += ret;
+ }
+
+ ret = desc->id;
+ }
+
+ vq->ndescs = 0;
+
+ return ret;
+
+err:
+ vhost_discard_vq_desc(vq, 1);
+err_fetch:
+ vq->ndescs = 0;
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
+
/* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
{
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index c8e96a095d3b..87089d51490d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -60,6 +60,13 @@ enum vhost_uaddr_type {
VHOST_NUM_ADDRS = 3,
};
+struct vhost_desc {
+ u64 addr;
+ u32 len;
+ u16 flags; /* VRING_DESC_F_WRITE, VRING_DESC_F_NEXT */
+ u16 id;
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -71,6 +78,11 @@ struct vhost_virtqueue {
vring_avail_t __user *avail;
vring_used_t __user *used;
const struct vhost_iotlb_map *meta_iotlb[VHOST_NUM_ADDRS];
+
+ struct vhost_desc *descs;
+ int ndescs;
+ int max_descs;
+
struct file *kick;
struct eventfd_ctx *call_ctx;
struct eventfd_ctx *error_ctx;
@@ -177,6 +189,10 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
bool vhost_log_access_ok(struct vhost_dev *);
+int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num);
int vhost_get_vq_desc(struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
--
MST
On Wed, 2020-06-10 at 07:36 -0400, Michael S. Tsirkin wrote:
> As testing shows no performance change, switch to that now.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Eugenio Pérez <[email protected]>
> Link: https://lore.kernel.org/r/[email protected]
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> drivers/vhost/test.c | 2 +-
> drivers/vhost/vhost.c | 318 ++++++++----------------------------------
> drivers/vhost/vhost.h | 7 +-
> 3 files changed, 65 insertions(+), 262 deletions(-)
>
> diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> index 0466921f4772..7d69778aaa26 100644
> --- a/drivers/vhost/test.c
> +++ b/drivers/vhost/test.c
> @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> dev = &n->dev;
> vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> - vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> + vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
>
> f->private_data = n;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 11433d709651..28f324fd77df 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> {
> vq->num = 1;
> vq->ndescs = 0;
> + vq->first_desc = 0;
> vq->desc = NULL;
> vq->avail = NULL;
> vq->used = NULL;
> @@ -372,6 +373,11 @@ static int vhost_worker(void *data)
> return 0;
> }
>
> +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> +{
> + return vq->max_descs - UIO_MAXIOV;
> +}
> +
> static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> {
> kfree(vq->descs);
> @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> vq->max_descs = dev->iov_limit;
> + if (vhost_vq_num_batch_descs(vq) < 0) {
> + return -EINVAL;
> + }
> vq->descs = kmalloc_array(vq->max_descs,
> sizeof(*vq->descs),
> GFP_KERNEL);
> @@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> vq->last_avail_idx = s.num;
> /* Forget the cached index value. */
> vq->avail_idx = vq->last_avail_idx;
> + vq->ndescs = vq->first_desc = 0;
This is not needed if it is done in vhost_vq_set_backend, as far as I can tell.
Actually, maybe it is even better to move `vq->avail_idx = vq->last_avail_idx;` line to vhost_vq_set_backend, it is part
of the backend "set up" procedure, isn't it?
I tested with virtio_test + batch tests sent in
https://lkml.kernel.org/lkml/[email protected]/T/.
I append here what I'm proposing in case it is clearer this way.
Thanks!
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 4d198994e7be..809ad2cd2879 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1617,9 +1617,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
break;
}
vq->last_avail_idx = s.num;
- /* Forget the cached index value. */
- vq->avail_idx = vq->last_avail_idx;
- vq->ndescs = vq->first_desc = 0;
break;
case VHOST_GET_VRING_BASE:
s.index = idx;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index fed36af5c444..f4902dc808e4 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -258,6 +258,7 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
void *private_data)
{
vq->private_data = private_data;
+ vq->avail_idx = vq->last_avail_idx;
vq->ndescs = 0;
vq->first_desc = 0;
}
> break;
> case VHOST_GET_VRING_BASE:
> s.index = idx;
> @@ -2078,253 +2088,6 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
> return next;
> }
>
> -static int get_indirect(struct vhost_virtqueue *vq,
> - struct iovec iov[], unsigned int iov_size,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num,
> - struct vring_desc *indirect)
> -{
> - struct vring_desc desc;
> - unsigned int i = 0, count, found = 0;
> - u32 len = vhost32_to_cpu(vq, indirect->len);
> - struct iov_iter from;
> - int ret, access;
> -
> - /* Sanity check */
> - if (unlikely(len % sizeof desc)) {
> - vq_err(vq, "Invalid length in indirect descriptor: "
> - "len 0x%llx not multiple of 0x%zx\n",
> - (unsigned long long)len,
> - sizeof desc);
> - return -EINVAL;
> - }
> -
> - ret = translate_desc(vq, vhost64_to_cpu(vq, indirect->addr), len, vq->indirect,
> - UIO_MAXIOV, VHOST_ACCESS_RO);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d in indirect.\n", ret);
> - return ret;
> - }
> - iov_iter_init(&from, READ, vq->indirect, ret, len);
> -
> - /* We will use the result as an address to read from, so most
> - * architectures only need a compiler barrier here. */
> - read_barrier_depends();
> -
> - count = len / sizeof desc;
> - /* Buffers are chained via a 16 bit next field, so
> - * we can have at most 2^16 of these. */
> - if (unlikely(count > USHRT_MAX + 1)) {
> - vq_err(vq, "Indirect buffer length too big: %d\n",
> - indirect->len);
> - return -E2BIG;
> - }
> -
> - do {
> - unsigned iov_count = *in_num + *out_num;
> - if (unlikely(++found > count)) {
> - vq_err(vq, "Loop detected: last one at %u "
> - "indirect size %u\n",
> - i, count);
> - return -EINVAL;
> - }
> - if (unlikely(!copy_from_iter_full(&desc, sizeof(desc), &from))) {
> - vq_err(vq, "Failed indirect descriptor: idx %d, %zx\n",
> - i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> - return -EINVAL;
> - }
> - if (unlikely(desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT))) {
> - vq_err(vq, "Nested indirect descriptor: idx %d, %zx\n",
> - i, (size_t)vhost64_to_cpu(vq, indirect->addr) + i * sizeof desc);
> - return -EINVAL;
> - }
> -
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
> - access = VHOST_ACCESS_WO;
> - else
> - access = VHOST_ACCESS_RO;
> -
> - ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
> - vhost32_to_cpu(vq, desc.len), iov + iov_count,
> - iov_size - iov_count, access);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d indirect idx %d\n",
> - ret, i);
> - return ret;
> - }
> - /* If this is an input descriptor, increment that count. */
> - if (access == VHOST_ACCESS_WO) {
> - *in_num += ret;
> - if (unlikely(log && ret)) {
> - log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
> - log[*log_num].len = vhost32_to_cpu(vq, desc.len);
> - ++*log_num;
> - }
> - } else {
> - /* If it's an output descriptor, they're all supposed
> - * to come before any input descriptors. */
> - if (unlikely(*in_num)) {
> - vq_err(vq, "Indirect descriptor "
> - "has out after in: idx %d\n", i);
> - return -EINVAL;
> - }
> - *out_num += ret;
> - }
> - } while ((i = next_desc(vq, &desc)) != -1);
> - return 0;
> -}
> -
> -/* This looks in the virtqueue and for the first available buffer, and converts
> - * it to an iovec for convenient access. Since descriptors consist of some
> - * number of output then some number of input descriptors, it's actually two
> - * iovecs, but we pack them into one and note how many of each there were.
> - *
> - * This function returns the descriptor number found, or vq->num (which is
> - * never a valid descriptor number) if none was found. A negative code is
> - * returned on error. */
> -int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> - struct iovec iov[], unsigned int iov_size,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num)
> -{
> - struct vring_desc desc;
> - unsigned int i, head, found = 0;
> - u16 last_avail_idx;
> - __virtio16 avail_idx;
> - __virtio16 ring_head;
> - int ret, access;
> -
> - /* Check it isn't doing very strange things with descriptor numbers. */
> - last_avail_idx = vq->last_avail_idx;
> -
> - if (vq->avail_idx == vq->last_avail_idx) {
> - if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> - vq_err(vq, "Failed to access avail idx at %p\n",
> - &vq->avail->idx);
> - return -EFAULT;
> - }
> - vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> -
> - if (unlikely((u16)(vq->avail_idx - last_avail_idx) > vq->num)) {
> - vq_err(vq, "Guest moved used index from %u to %u",
> - last_avail_idx, vq->avail_idx);
> - return -EFAULT;
> - }
> -
> - /* If there's nothing new since last we looked, return
> - * invalid.
> - */
> - if (vq->avail_idx == last_avail_idx)
> - return vq->num;
> -
> - /* Only get avail ring entries after they have been
> - * exposed by guest.
> - */
> - smp_rmb();
> - }
> -
> - /* Grab the next descriptor number they're advertising, and increment
> - * the index we've seen. */
> - if (unlikely(vhost_get_avail_head(vq, &ring_head, last_avail_idx))) {
> - vq_err(vq, "Failed to read head: idx %d address %p\n",
> - last_avail_idx,
> - &vq->avail->ring[last_avail_idx % vq->num]);
> - return -EFAULT;
> - }
> -
> - head = vhost16_to_cpu(vq, ring_head);
> -
> - /* If their number is silly, that's an error. */
> - if (unlikely(head >= vq->num)) {
> - vq_err(vq, "Guest says index %u > %u is available",
> - head, vq->num);
> - return -EINVAL;
> - }
> -
> - /* When we start there are none of either input nor output. */
> - *out_num = *in_num = 0;
> - if (unlikely(log))
> - *log_num = 0;
> -
> - i = head;
> - do {
> - unsigned iov_count = *in_num + *out_num;
> - if (unlikely(i >= vq->num)) {
> - vq_err(vq, "Desc index is %u > %u, head = %u",
> - i, vq->num, head);
> - return -EINVAL;
> - }
> - if (unlikely(++found > vq->num)) {
> - vq_err(vq, "Loop detected: last one at %u "
> - "vq size %u head %u\n",
> - i, vq->num, head);
> - return -EINVAL;
> - }
> - ret = vhost_get_desc(vq, &desc, i);
> - if (unlikely(ret)) {
> - vq_err(vq, "Failed to get descriptor: idx %d addr %p\n",
> - i, vq->desc + i);
> - return -EFAULT;
> - }
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_INDIRECT)) {
> - ret = get_indirect(vq, iov, iov_size,
> - out_num, in_num,
> - log, log_num, &desc);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Failure detected "
> - "in indirect descriptor at idx %d\n", i);
> - return ret;
> - }
> - continue;
> - }
> -
> - if (desc.flags & cpu_to_vhost16(vq, VRING_DESC_F_WRITE))
> - access = VHOST_ACCESS_WO;
> - else
> - access = VHOST_ACCESS_RO;
> - ret = translate_desc(vq, vhost64_to_cpu(vq, desc.addr),
> - vhost32_to_cpu(vq, desc.len), iov + iov_count,
> - iov_size - iov_count, access);
> - if (unlikely(ret < 0)) {
> - if (ret != -EAGAIN)
> - vq_err(vq, "Translation failure %d descriptor idx %d\n",
> - ret, i);
> - return ret;
> - }
> - if (access == VHOST_ACCESS_WO) {
> - /* If this is an input descriptor,
> - * increment that count. */
> - *in_num += ret;
> - if (unlikely(log && ret)) {
> - log[*log_num].addr = vhost64_to_cpu(vq, desc.addr);
> - log[*log_num].len = vhost32_to_cpu(vq, desc.len);
> - ++*log_num;
> - }
> - } else {
> - /* If it's an output descriptor, they're all supposed
> - * to come before any input descriptors. */
> - if (unlikely(*in_num)) {
> - vq_err(vq, "Descriptor has out after in: "
> - "idx %d\n", i);
> - return -EINVAL;
> - }
> - *out_num += ret;
> - }
> - } while ((i = next_desc(vq, &desc)) != -1);
> -
> - /* On success, increment avail index. */
> - vq->last_avail_idx++;
> -
> - /* Assume notifications from guest are disabled at this point,
> - * if they aren't we would need to update avail_event index. */
> - BUG_ON(!(vq->used_flags & VRING_USED_F_NO_NOTIFY));
> - return head;
> -}
> -EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
> -
> static struct vhost_desc *peek_split_desc(struct vhost_virtqueue *vq)
> {
> BUG_ON(!vq->ndescs);
> @@ -2428,7 +2191,7 @@ static int fetch_indirect_descs(struct vhost_virtqueue *vq,
>
> /* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> * A negative code is returned on error. */
> -static int fetch_descs(struct vhost_virtqueue *vq)
> +static int fetch_buf(struct vhost_virtqueue *vq)
> {
> unsigned int i, head, found = 0;
> struct vhost_desc *last;
> @@ -2441,7 +2204,11 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> /* Check it isn't doing very strange things with descriptor numbers. */
> last_avail_idx = vq->last_avail_idx;
>
> - if (vq->avail_idx == vq->last_avail_idx) {
> + if (unlikely(vq->avail_idx == vq->last_avail_idx)) {
> + /* If we already have work to do, don't bother re-checking. */
> + if (likely(vq->ndescs))
> + return 1;
> +
> if (unlikely(vhost_get_avail_idx(vq, &avail_idx))) {
> vq_err(vq, "Failed to access avail idx at %p\n",
> &vq->avail->idx);
> @@ -2532,6 +2299,41 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> return 1;
> }
>
> +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> + * A negative code is returned on error. */
> +static int fetch_descs(struct vhost_virtqueue *vq)
> +{
> + int ret;
> +
> + if (unlikely(vq->first_desc >= vq->ndescs)) {
> + vq->first_desc = 0;
> + vq->ndescs = 0;
> + }
> +
> + if (vq->ndescs)
> + return 1;
> +
> + for (ret = 1;
> + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> + ret = fetch_buf(vq))
> + ;
> +
> + /* On success we expect some descs */
> + BUG_ON(ret > 0 && !vq->ndescs);
> + return ret;
> +}
> +
> +/* Reverse the effects of fetch_descs */
> +static void unfetch_descs(struct vhost_virtqueue *vq)
> +{
> + int i;
> +
> + for (i = vq->first_desc; i < vq->ndescs; ++i)
> + if (!(vq->descs[i].flags & VRING_DESC_F_NEXT))
> + vq->last_avail_idx -= 1;
> + vq->ndescs = 0;
> +}
> +
> /* This looks in the virtqueue and for the first available buffer, and converts
> * it to an iovec for convenient access. Since descriptors consist of some
> * number of output then some number of input descriptors, it's actually two
> @@ -2540,7 +2342,7 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> * This function returns the descriptor number found, or vq->num (which is
> * never a valid descriptor number) if none was found. A negative code is
> * returned on error. */
> -int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> +int vhost_get_vq_desc(struct vhost_virtqueue *vq,
> struct iovec iov[], unsigned int iov_size,
> unsigned int *out_num, unsigned int *in_num,
> struct vhost_log *log, unsigned int *log_num)
> @@ -2549,7 +2351,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> int i;
>
> if (ret <= 0)
> - goto err_fetch;
> + goto err;
>
> /* Now convert to IOV */
> /* When we start there are none of either input nor output. */
> @@ -2557,7 +2359,7 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> if (unlikely(log))
> *log_num = 0;
>
> - for (i = 0; i < vq->ndescs; ++i) {
> + for (i = vq->first_desc; i < vq->ndescs; ++i) {
> unsigned iov_count = *in_num + *out_num;
> struct vhost_desc *desc = &vq->descs[i];
> int access;
> @@ -2603,24 +2405,26 @@ int vhost_get_vq_desc_batch(struct vhost_virtqueue *vq,
> }
>
> ret = desc->id;
> +
> + if (!(desc->flags & VRING_DESC_F_NEXT))
> + break;
> }
>
> - vq->ndescs = 0;
> + vq->first_desc = i + 1;
>
> return ret;
>
> err:
> - vhost_discard_vq_desc(vq, 1);
> -err_fetch:
> - vq->ndescs = 0;
> + unfetch_descs(vq);
>
> return ret ? ret : vq->num;
> }
> -EXPORT_SYMBOL_GPL(vhost_get_vq_desc_batch);
> +EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
>
> /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
> void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
> {
> + unfetch_descs(vq);
> vq->last_avail_idx -= n;
> }
> EXPORT_SYMBOL_GPL(vhost_discard_vq_desc);
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 87089d51490d..fed36af5c444 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -81,6 +81,7 @@ struct vhost_virtqueue {
>
> struct vhost_desc *descs;
> int ndescs;
> + int first_desc;
> int max_descs;
>
> struct file *kick;
> @@ -189,10 +190,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> bool vhost_log_access_ok(struct vhost_dev *);
>
> -int vhost_get_vq_desc_batch(struct vhost_virtqueue *,
> - struct iovec iov[], unsigned int iov_count,
> - unsigned int *out_num, unsigned int *in_num,
> - struct vhost_log *log, unsigned int *log_num);
> int vhost_get_vq_desc(struct vhost_virtqueue *,
> struct iovec iov[], unsigned int iov_count,
> unsigned int *out_num, unsigned int *in_num,
> @@ -261,6 +258,8 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
> void *private_data)
> {
> vq->private_data = private_data;
> + vq->ndescs = 0;
> + vq->first_desc = 0;
> }
>
> /**
On Wed, Jun 10, 2020 at 5:13 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > > + * A negative code is returned on error. */
> > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > +{
> > > + int ret;
> > > +
> > > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > > + vq->first_desc = 0;
> > > + vq->ndescs = 0;
> > > + }
> > > +
> > > + if (vq->ndescs)
> > > + return 1;
> > > +
> > > + for (ret = 1;
> > > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > > + ret = fetch_buf(vq))
> > > + ;
> >
> > (Expanding comment in V6):
> >
> > We get an infinite loop this way:
> > * vq->ndescs == 0, so we call fetch_buf() here
> > * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> > * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> > last_avail_vq), so it just return 1
>
> That's what
> [PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
> is supposed to fix.
>
Sorry, I forgot to include that fixup.
With it I don't see CPU stalls, but with that version latency has
increased a lot and I see packet lost:
+ ping -c 5 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
From 10.200.0.2 icmp_seq=1 Destination Host Unreachable
From 10.200.0.2 icmp_seq=2 Destination Host Unreachable
From 10.200.0.2 icmp_seq=3 Destination Host Unreachable
64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=6848 ms
--- 10.200.0.1 ping statistics ---
5 packets transmitted, 1 received, +3 errors, 80% packet loss, time 76ms
rtt min/avg/max/mdev = 6848.316/6848.316/6848.316/0.000 ms, pipe 4
--
I cannot even use netperf.
If I modify with my proposed version:
+ ping -c 5 10.200.0.1
PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
64 bytes from 10.200.0.1: icmp_seq=1 ttl=64 time=7.07 ms
64 bytes from 10.200.0.1: icmp_seq=2 ttl=64 time=0.358 ms
64 bytes from 10.200.0.1: icmp_seq=3 ttl=64 time=5.35 ms
64 bytes from 10.200.0.1: icmp_seq=4 ttl=64 time=2.27 ms
64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=0.426 ms
[root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t TCP_STREAM
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.200.0.1 () port 0 AF_INET
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
131072 16384 16384 10.01 4742.36
[root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t UDP_STREAM
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
10.200.0.1 () port 0 AF_INET
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
212992 65507 10.00 9214 0 482.83
212992 10.00 9214 482.83
I will compare with the non-batch version for reference, but the
difference between the two is noticeable. Maybe it's worth finding a
good value for the if() inside fetch_buf?
Thanks!
> --
> MST
>
On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > + * A negative code is returned on error. */
> > +static int fetch_descs(struct vhost_virtqueue *vq)
> > +{
> > + int ret;
> > +
> > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > + vq->first_desc = 0;
> > + vq->ndescs = 0;
> > + }
> > +
> > + if (vq->ndescs)
> > + return 1;
> > +
> > + for (ret = 1;
> > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > + ret = fetch_buf(vq))
> > + ;
>
> (Expanding comment in V6):
>
> We get an infinite loop this way:
> * vq->ndescs == 0, so we call fetch_buf() here
> * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> last_avail_vq), so it just return 1
That's what
[PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
is supposed to fix.
--
MST
On Wed, Jun 10, 2020 at 5:08 PM Michael S. Tsirkin <[email protected]> wrote:
>
> On Wed, Jun 10, 2020 at 04:29:29PM +0200, Eugenio Pérez wrote:
> > On Wed, 2020-06-10 at 07:36 -0400, Michael S. Tsirkin wrote:
> > > As testing shows no performance change, switch to that now.
> > >
> > > Signed-off-by: Michael S. Tsirkin <[email protected]>
> > > Signed-off-by: Eugenio Pérez <[email protected]>
> > > Link: https://lore.kernel.org/r/[email protected]
> > > Signed-off-by: Michael S. Tsirkin <[email protected]>
> > > ---
> > > drivers/vhost/test.c | 2 +-
> > > drivers/vhost/vhost.c | 318 ++++++++----------------------------------
> > > drivers/vhost/vhost.h | 7 +-
> > > 3 files changed, 65 insertions(+), 262 deletions(-)
> > >
> > > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > > index 0466921f4772..7d69778aaa26 100644
> > > --- a/drivers/vhost/test.c
> > > +++ b/drivers/vhost/test.c
> > > @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> > > dev = &n->dev;
> > > vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> > > n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > > - vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> > > + vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> > > VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
> > >
> > > f->private_data = n;
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 11433d709651..28f324fd77df 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > > {
> > > vq->num = 1;
> > > vq->ndescs = 0;
> > > + vq->first_desc = 0;
> > > vq->desc = NULL;
> > > vq->avail = NULL;
> > > vq->used = NULL;
> > > @@ -372,6 +373,11 @@ static int vhost_worker(void *data)
> > > return 0;
> > > }
> > >
> > > +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> > > +{
> > > + return vq->max_descs - UIO_MAXIOV;
> > > +}
> > > +
> > > static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> > > {
> > > kfree(vq->descs);
> > > @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> > > for (i = 0; i < dev->nvqs; ++i) {
> > > vq = dev->vqs[i];
> > > vq->max_descs = dev->iov_limit;
> > > + if (vhost_vq_num_batch_descs(vq) < 0) {
> > > + return -EINVAL;
> > > + }
> > > vq->descs = kmalloc_array(vq->max_descs,
> > > sizeof(*vq->descs),
> > > GFP_KERNEL);
> > > @@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> > > vq->last_avail_idx = s.num;
> > > /* Forget the cached index value. */
> > > vq->avail_idx = vq->last_avail_idx;
> > > + vq->ndescs = vq->first_desc = 0;
> >
> > This is not needed if it is done in vhost_vq_set_backend, as far as I can tell.
> >
> > Actually, maybe it is even better to move `vq->avail_idx = vq->last_avail_idx;` line to vhost_vq_set_backend, it is part
> > of the backend "set up" procedure, isn't it?
> >
> > I tested with virtio_test + batch tests sent in
> > https://lkml.kernel.org/lkml/[email protected]/T/.
>
> Ow did I forget to merge them for rc1? Should I have? Maybe Linus won't
> yell to hard at me if I merge them after rc1.
>
>
> > I append here what I'm proposing in case it is clearer this way.
> >
> > Thanks!
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 4d198994e7be..809ad2cd2879 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1617,9 +1617,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> > break;
> > }
> > vq->last_avail_idx = s.num;
> > - /* Forget the cached index value. */
> > - vq->avail_idx = vq->last_avail_idx;
> > - vq->ndescs = vq->first_desc = 0;
> > break;
> > case VHOST_GET_VRING_BASE:
> > s.index = idx;
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index fed36af5c444..f4902dc808e4 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -258,6 +258,7 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
> > void *private_data)
> > {
> > vq->private_data = private_data;
> > + vq->avail_idx = vq->last_avail_idx;
> > vq->ndescs = 0;
> > vq->first_desc = 0;
> > }
> >
>
> Seems like a nice cleanup, though it's harmless right?
>
Fields ndescs and first_descs are supposed to be updated outside, that
was the intention but maybe I forgot to delete it here, not sure.
Regarding avail_idx, the whole change has been tested with vhost_test
and for vhost needs to have a backend to modify it already so it seems
safe to me.
On Wed, Jun 10, 2020 at 04:29:29PM +0200, Eugenio P?rez wrote:
> On Wed, 2020-06-10 at 07:36 -0400, Michael S. Tsirkin wrote:
> > As testing shows no performance change, switch to that now.
> >
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
> > Signed-off-by: Eugenio P?rez <[email protected]>
> > Link: https://lore.kernel.org/r/[email protected]
> > Signed-off-by: Michael S. Tsirkin <[email protected]>
> > ---
> > drivers/vhost/test.c | 2 +-
> > drivers/vhost/vhost.c | 318 ++++++++----------------------------------
> > drivers/vhost/vhost.h | 7 +-
> > 3 files changed, 65 insertions(+), 262 deletions(-)
> >
> > diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
> > index 0466921f4772..7d69778aaa26 100644
> > --- a/drivers/vhost/test.c
> > +++ b/drivers/vhost/test.c
> > @@ -119,7 +119,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
> > dev = &n->dev;
> > vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
> > n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
> > - vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
> > + vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV + 64,
> > VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
> >
> > f->private_data = n;
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 11433d709651..28f324fd77df 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -304,6 +304,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
> > {
> > vq->num = 1;
> > vq->ndescs = 0;
> > + vq->first_desc = 0;
> > vq->desc = NULL;
> > vq->avail = NULL;
> > vq->used = NULL;
> > @@ -372,6 +373,11 @@ static int vhost_worker(void *data)
> > return 0;
> > }
> >
> > +static int vhost_vq_num_batch_descs(struct vhost_virtqueue *vq)
> > +{
> > + return vq->max_descs - UIO_MAXIOV;
> > +}
> > +
> > static void vhost_vq_free_iovecs(struct vhost_virtqueue *vq)
> > {
> > kfree(vq->descs);
> > @@ -394,6 +400,9 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
> > for (i = 0; i < dev->nvqs; ++i) {
> > vq = dev->vqs[i];
> > vq->max_descs = dev->iov_limit;
> > + if (vhost_vq_num_batch_descs(vq) < 0) {
> > + return -EINVAL;
> > + }
> > vq->descs = kmalloc_array(vq->max_descs,
> > sizeof(*vq->descs),
> > GFP_KERNEL);
> > @@ -1610,6 +1619,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> > vq->last_avail_idx = s.num;
> > /* Forget the cached index value. */
> > vq->avail_idx = vq->last_avail_idx;
> > + vq->ndescs = vq->first_desc = 0;
>
> This is not needed if it is done in vhost_vq_set_backend, as far as I can tell.
>
> Actually, maybe it is even better to move `vq->avail_idx = vq->last_avail_idx;` line to vhost_vq_set_backend, it is part
> of the backend "set up" procedure, isn't it?
>
> I tested with virtio_test + batch tests sent in
> https://lkml.kernel.org/lkml/[email protected]/T/.
Ow did I forget to merge them for rc1? Should I have? Maybe Linus won't
yell to hard at me if I merge them after rc1.
> I append here what I'm proposing in case it is clearer this way.
>
> Thanks!
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 4d198994e7be..809ad2cd2879 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1617,9 +1617,6 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> break;
> }
> vq->last_avail_idx = s.num;
> - /* Forget the cached index value. */
> - vq->avail_idx = vq->last_avail_idx;
> - vq->ndescs = vq->first_desc = 0;
> break;
> case VHOST_GET_VRING_BASE:
> s.index = idx;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index fed36af5c444..f4902dc808e4 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -258,6 +258,7 @@ static inline void vhost_vq_set_backend(struct vhost_virtqueue *vq,
> void *private_data)
> {
> vq->private_data = private_data;
> + vq->avail_idx = vq->last_avail_idx;
> vq->ndescs = 0;
> vq->first_desc = 0;
> }
>
Seems like a nice cleanup, though it's harmless right?
--
MST
On Wed, 2020-06-10 at 07:36 -0400, Michael S. Tsirkin wrote:
> In preparation for further cleanup, pass net specific pointer
> to ubuf callbacks so we can move net specific fields
> out to net structures.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> drivers/vhost/net.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index bf5e1d81ae25..ff594eec8ae3 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -94,7 +94,7 @@ struct vhost_net_ubuf_ref {
> */
> atomic_t refcount;
> wait_queue_head_t wait;
> - struct vhost_virtqueue *vq;
> + struct vhost_net_virtqueue *nvq;
> };
>
> #define VHOST_NET_BATCH 64
> @@ -231,7 +231,7 @@ static void vhost_net_enable_zcopy(int vq)
> }
>
> static struct vhost_net_ubuf_ref *
> -vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
> +vhost_net_ubuf_alloc(struct vhost_net_virtqueue *nvq, bool zcopy)
> {
> struct vhost_net_ubuf_ref *ubufs;
> /* No zero copy backend? Nothing to count. */
> @@ -242,7 +242,7 @@ vhost_net_ubuf_alloc(struct vhost_virtqueue *vq, bool zcopy)
> return ERR_PTR(-ENOMEM);
> atomic_set(&ubufs->refcount, 1);
> init_waitqueue_head(&ubufs->wait);
> - ubufs->vq = vq;
> + ubufs->nvq = nvq;
> return ubufs;
> }
>
> @@ -384,13 +384,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
> static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
> {
> struct vhost_net_ubuf_ref *ubufs = ubuf->ctx;
> - struct vhost_virtqueue *vq = ubufs->vq;
> + struct vhost_net_virtqueue *nvq = ubufs->nvq;
> int cnt;
>
> rcu_read_lock_bh();
>
> /* set len to mark this desc buffers done DMA */
> - vq->heads[ubuf->desc].len = success ?
> + nvq->vq.heads[ubuf->desc].in_len = success ?
Not like this matter a lot, because it will be override in next patches of the series, but `.len` has been replaced by
`.in_len`, making compiler complain. This fixes it:
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ff594eec8ae3..fdecf39c9ac9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -390,7 +390,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
rcu_read_lock_bh();
/* set len to mark this desc buffers done DMA */
- nvq->vq.heads[ubuf->desc].in_len = success ?
+ nvq->vq.heads[ubuf->desc].len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
> VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
> cnt = vhost_net_ubuf_put(ubufs);
>
> @@ -402,7 +402,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success)
> * less than 10% of times).
> */
> if (cnt <= 1 || !(cnt % 16))
> - vhost_poll_queue(&vq->poll);
> + vhost_poll_queue(&nvq->vq.poll);
>
> rcu_read_unlock_bh();
> }
> @@ -1525,7 +1525,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
> /* start polling new socket */
> oldsock = vhost_vq_get_backend(vq);
> if (sock != oldsock) {
> - ubufs = vhost_net_ubuf_alloc(vq,
> + ubufs = vhost_net_ubuf_alloc(nvq,
> sock && vhost_sock_zcopy(sock));
> if (IS_ERR(ubufs)) {
> r = PTR_ERR(ubufs);
On Wed, Jun 10, 2020 at 06:18:32PM +0200, Eugenio Perez Martin wrote:
> On Wed, Jun 10, 2020 at 5:13 PM Michael S. Tsirkin <[email protected]> wrote:
> >
> > On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > > > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > > > + * A negative code is returned on error. */
> > > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > > +{
> > > > + int ret;
> > > > +
> > > > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > > > + vq->first_desc = 0;
> > > > + vq->ndescs = 0;
> > > > + }
> > > > +
> > > > + if (vq->ndescs)
> > > > + return 1;
> > > > +
> > > > + for (ret = 1;
> > > > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > > > + ret = fetch_buf(vq))
> > > > + ;
> > >
> > > (Expanding comment in V6):
> > >
> > > We get an infinite loop this way:
> > > * vq->ndescs == 0, so we call fetch_buf() here
> > > * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> > > * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> > > last_avail_vq), so it just return 1
> >
> > That's what
> > [PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
> > is supposed to fix.
> >
>
> Sorry, I forgot to include that fixup.
>
> With it I don't see CPU stalls, but with that version latency has
> increased a lot and I see packet lost:
> + ping -c 5 10.200.0.1
> PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> >From 10.200.0.2 icmp_seq=1 Destination Host Unreachable
> >From 10.200.0.2 icmp_seq=2 Destination Host Unreachable
> >From 10.200.0.2 icmp_seq=3 Destination Host Unreachable
> 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=6848 ms
>
> --- 10.200.0.1 ping statistics ---
> 5 packets transmitted, 1 received, +3 errors, 80% packet loss, time 76ms
> rtt min/avg/max/mdev = 6848.316/6848.316/6848.316/0.000 ms, pipe 4
> --
>
> I cannot even use netperf.
OK so that's the bug to try to find and fix I think.
> If I modify with my proposed version:
> + ping -c 5 10.200.0.1
> PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> 64 bytes from 10.200.0.1: icmp_seq=1 ttl=64 time=7.07 ms
> 64 bytes from 10.200.0.1: icmp_seq=2 ttl=64 time=0.358 ms
> 64 bytes from 10.200.0.1: icmp_seq=3 ttl=64 time=5.35 ms
> 64 bytes from 10.200.0.1: icmp_seq=4 ttl=64 time=2.27 ms
> 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=0.426 ms
Not sure which version this is.
> [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t TCP_STREAM
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.200.0.1 () port 0 AF_INET
> Recv Send Send
> Socket Socket Message Elapsed
> Size Size Size Time Throughput
> bytes bytes bytes secs. 10^6bits/sec
>
> 131072 16384 16384 10.01 4742.36
> [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t UDP_STREAM
> MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> 10.200.0.1 () port 0 AF_INET
> Socket Message Elapsed Messages
> Size Size Time Okay Errors Throughput
> bytes bytes secs # # 10^6bits/sec
>
> 212992 65507 10.00 9214 0 482.83
> 212992 10.00 9214 482.83
>
> I will compare with the non-batch version for reference, but the
> difference between the two is noticeable. Maybe it's worth finding a
> good value for the if() inside fetch_buf?
>
> Thanks!
>
I don't think it's performance, I think it's a bug somewhere,
e.g. maybe we corrupt a packet, or stall the queue, or
something like this.
Let's do this, I will squash the fixups and post v8 so you can bisect
and then debug cleanly.
> > --
> > MST
> >
On Thu, 2020-06-11 at 07:30 -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 10, 2020 at 06:18:32PM +0200, Eugenio Perez Martin wrote:
> > On Wed, Jun 10, 2020 at 5:13 PM Michael S. Tsirkin <[email protected]> wrote:
> > > On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > > > > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > > > > + * A negative code is returned on error. */
> > > > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > > > +{
> > > > > + int ret;
> > > > > +
> > > > > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > > > > + vq->first_desc = 0;
> > > > > + vq->ndescs = 0;
> > > > > + }
> > > > > +
> > > > > + if (vq->ndescs)
> > > > > + return 1;
> > > > > +
> > > > > + for (ret = 1;
> > > > > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > > > > + ret = fetch_buf(vq))
> > > > > + ;
> > > >
> > > > (Expanding comment in V6):
> > > >
> > > > We get an infinite loop this way:
> > > > * vq->ndescs == 0, so we call fetch_buf() here
> > > > * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> > > > * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> > > > last_avail_vq), so it just return 1
> > >
> > > That's what
> > > [PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
> > > is supposed to fix.
> > >
> >
> > Sorry, I forgot to include that fixup.
> >
> > With it I don't see CPU stalls, but with that version latency has
> > increased a lot and I see packet lost:
> > + ping -c 5 10.200.0.1
> > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > > From 10.200.0.2 icmp_seq=1 Destination Host Unreachable
> > > From 10.200.0.2 icmp_seq=2 Destination Host Unreachable
> > > From 10.200.0.2 icmp_seq=3 Destination Host Unreachable
> > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=6848 ms
> >
> > --- 10.200.0.1 ping statistics ---
> > 5 packets transmitted, 1 received, +3 errors, 80% packet loss, time 76ms
> > rtt min/avg/max/mdev = 6848.316/6848.316/6848.316/0.000 ms, pipe 4
> > --
> >
> > I cannot even use netperf.
>
> OK so that's the bug to try to find and fix I think.
>
>
> > If I modify with my proposed version:
> > + ping -c 5 10.200.0.1
> > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > 64 bytes from 10.200.0.1: icmp_seq=1 ttl=64 time=7.07 ms
> > 64 bytes from 10.200.0.1: icmp_seq=2 ttl=64 time=0.358 ms
> > 64 bytes from 10.200.0.1: icmp_seq=3 ttl=64 time=5.35 ms
> > 64 bytes from 10.200.0.1: icmp_seq=4 ttl=64 time=2.27 ms
> > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=0.426 ms
>
> Not sure which version this is.
>
> > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t TCP_STREAM
> > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 10.200.0.1 () port 0 AF_INET
> > Recv Send Send
> > Socket Socket Message Elapsed
> > Size Size Size Time Throughput
> > bytes bytes bytes secs. 10^6bits/sec
> >
> > 131072 16384 16384 10.01 4742.36
> > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t UDP_STREAM
> > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > 10.200.0.1 () port 0 AF_INET
> > Socket Message Elapsed Messages
> > Size Size Time Okay Errors Throughput
> > bytes bytes secs # # 10^6bits/sec
> >
> > 212992 65507 10.00 9214 0 482.83
> > 212992 10.00 9214 482.83
> >
> > I will compare with the non-batch version for reference, but the
> > difference between the two is noticeable. Maybe it's worth finding a
> > good value for the if() inside fetch_buf?
> >
> > Thanks!
> >
>
> I don't think it's performance, I think it's a bug somewhere,
> e.g. maybe we corrupt a packet, or stall the queue, or
> something like this.
>
> Let's do this, I will squash the fixups and post v8 so you can bisect
> and then debug cleanly.
Ok, so if we apply the patch proposed in v7 08/14 (Or the version 8 of the patchset sent), this is what happens:
1. Userland (virtio_test in my case) introduces just one buffer in vq, and it kicks
2. vhost module reaches fetch_descs, called from vhost_get_vq_desc. From there we call fetch_buf in a for loop.
3. The first time we call fetch_buf, it returns properly one buffer. However, the second time we call it, it returns 0
because vq->avail_idx == vq->last_avail_idx and vq->avail_idx == last_avail_idx code path.
4. fetch_descs assign ret = 0, so it returns 0. vhost_get_vq_desc will goto err, and it will signal no new buffer
(returning vq->num).
So to fix it and maintain the batching maybe we could return vq->ndescs in case ret == 0:
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c0dfb5e3d2af..5993d4f34ca9 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2315,7 +2327,8 @@ static int fetch_descs(struct vhost_virtqueue *vq)
/* On success we expect some descs */
BUG_ON(ret > 0 && !vq->ndescs);
- return ret;
+ return ret ?: vq->ndescs;
}
/* Reverse the effects of fetch_descs */
--
Another possibility could be to return different codes from fetch_buf, but I find the suggested modification easier.
What do you think?
Thanks!
On Mon, Jun 15, 2020 at 6:05 PM Eugenio Pérez <[email protected]> wrote:
>
> On Thu, 2020-06-11 at 07:30 -0400, Michael S. Tsirkin wrote:
> > On Wed, Jun 10, 2020 at 06:18:32PM +0200, Eugenio Perez Martin wrote:
> > > On Wed, Jun 10, 2020 at 5:13 PM Michael S. Tsirkin <[email protected]> wrote:
> > > > On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > > > > > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > > > > > + * A negative code is returned on error. */
> > > > > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > > > > +{
> > > > > > + int ret;
> > > > > > +
> > > > > > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > > > > > + vq->first_desc = 0;
> > > > > > + vq->ndescs = 0;
> > > > > > + }
> > > > > > +
> > > > > > + if (vq->ndescs)
> > > > > > + return 1;
> > > > > > +
> > > > > > + for (ret = 1;
> > > > > > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > > > > > + ret = fetch_buf(vq))
> > > > > > + ;
> > > > >
> > > > > (Expanding comment in V6):
> > > > >
> > > > > We get an infinite loop this way:
> > > > > * vq->ndescs == 0, so we call fetch_buf() here
> > > > > * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> > > > > * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> > > > > last_avail_vq), so it just return 1
> > > >
> > > > That's what
> > > > [PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
> > > > is supposed to fix.
> > > >
> > >
> > > Sorry, I forgot to include that fixup.
> > >
> > > With it I don't see CPU stalls, but with that version latency has
> > > increased a lot and I see packet lost:
> > > + ping -c 5 10.200.0.1
> > > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > > > From 10.200.0.2 icmp_seq=1 Destination Host Unreachable
> > > > From 10.200.0.2 icmp_seq=2 Destination Host Unreachable
> > > > From 10.200.0.2 icmp_seq=3 Destination Host Unreachable
> > > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=6848 ms
> > >
> > > --- 10.200.0.1 ping statistics ---
> > > 5 packets transmitted, 1 received, +3 errors, 80% packet loss, time 76ms
> > > rtt min/avg/max/mdev = 6848.316/6848.316/6848.316/0.000 ms, pipe 4
> > > --
> > >
> > > I cannot even use netperf.
> >
> > OK so that's the bug to try to find and fix I think.
> >
> >
> > > If I modify with my proposed version:
> > > + ping -c 5 10.200.0.1
> > > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > > 64 bytes from 10.200.0.1: icmp_seq=1 ttl=64 time=7.07 ms
> > > 64 bytes from 10.200.0.1: icmp_seq=2 ttl=64 time=0.358 ms
> > > 64 bytes from 10.200.0.1: icmp_seq=3 ttl=64 time=5.35 ms
> > > 64 bytes from 10.200.0.1: icmp_seq=4 ttl=64 time=2.27 ms
> > > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=0.426 ms
> >
> > Not sure which version this is.
> >
> > > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t TCP_STREAM
> > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > > 10.200.0.1 () port 0 AF_INET
> > > Recv Send Send
> > > Socket Socket Message Elapsed
> > > Size Size Size Time Throughput
> > > bytes bytes bytes secs. 10^6bits/sec
> > >
> > > 131072 16384 16384 10.01 4742.36
> > > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t UDP_STREAM
> > > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > > 10.200.0.1 () port 0 AF_INET
> > > Socket Message Elapsed Messages
> > > Size Size Time Okay Errors Throughput
> > > bytes bytes secs # # 10^6bits/sec
> > >
> > > 212992 65507 10.00 9214 0 482.83
> > > 212992 10.00 9214 482.83
> > >
> > > I will compare with the non-batch version for reference, but the
> > > difference between the two is noticeable. Maybe it's worth finding a
> > > good value for the if() inside fetch_buf?
> > >
> > > Thanks!
> > >
> >
> > I don't think it's performance, I think it's a bug somewhere,
> > e.g. maybe we corrupt a packet, or stall the queue, or
> > something like this.
> >
> > Let's do this, I will squash the fixups and post v8 so you can bisect
> > and then debug cleanly.
>
> Ok, so if we apply the patch proposed in v7 08/14 (Or the version 8 of the patchset sent), this is what happens:
>
> 1. Userland (virtio_test in my case) introduces just one buffer in vq, and it kicks
> 2. vhost module reaches fetch_descs, called from vhost_get_vq_desc. From there we call fetch_buf in a for loop.
> 3. The first time we call fetch_buf, it returns properly one buffer. However, the second time we call it, it returns 0
> because vq->avail_idx == vq->last_avail_idx and vq->avail_idx == last_avail_idx code path.
> 4. fetch_descs assign ret = 0, so it returns 0. vhost_get_vq_desc will goto err, and it will signal no new buffer
> (returning vq->num).
>
> So to fix it and maintain the batching maybe we could return vq->ndescs in case ret == 0:
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c0dfb5e3d2af..5993d4f34ca9 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2315,7 +2327,8 @@ static int fetch_descs(struct vhost_virtqueue *vq)
>
> /* On success we expect some descs */
> BUG_ON(ret > 0 && !vq->ndescs);
> - return ret;
> + return ret ?: vq->ndescs;
> }
>
> /* Reverse the effects of fetch_descs */
> --
>
> Another possibility could be to return different codes from fetch_buf, but I find the suggested modification easier.
>
> What do you think?
>
> Thanks!
>
Hi!
I can send a proposed RFC v9 in case it is more convenient for you.
Thanks!
On Tue, Jun 16, 2020 at 05:23:43PM +0200, Eugenio Perez Martin wrote:
> On Mon, Jun 15, 2020 at 6:05 PM Eugenio P?rez <[email protected]> wrote:
> >
> > On Thu, 2020-06-11 at 07:30 -0400, Michael S. Tsirkin wrote:
> > > On Wed, Jun 10, 2020 at 06:18:32PM +0200, Eugenio Perez Martin wrote:
> > > > On Wed, Jun 10, 2020 at 5:13 PM Michael S. Tsirkin <[email protected]> wrote:
> > > > > On Wed, Jun 10, 2020 at 02:37:50PM +0200, Eugenio Perez Martin wrote:
> > > > > > > +/* This function returns a value > 0 if a descriptor was found, or 0 if none were found.
> > > > > > > + * A negative code is returned on error. */
> > > > > > > +static int fetch_descs(struct vhost_virtqueue *vq)
> > > > > > > +{
> > > > > > > + int ret;
> > > > > > > +
> > > > > > > + if (unlikely(vq->first_desc >= vq->ndescs)) {
> > > > > > > + vq->first_desc = 0;
> > > > > > > + vq->ndescs = 0;
> > > > > > > + }
> > > > > > > +
> > > > > > > + if (vq->ndescs)
> > > > > > > + return 1;
> > > > > > > +
> > > > > > > + for (ret = 1;
> > > > > > > + ret > 0 && vq->ndescs <= vhost_vq_num_batch_descs(vq);
> > > > > > > + ret = fetch_buf(vq))
> > > > > > > + ;
> > > > > >
> > > > > > (Expanding comment in V6):
> > > > > >
> > > > > > We get an infinite loop this way:
> > > > > > * vq->ndescs == 0, so we call fetch_buf() here
> > > > > > * fetch_buf gets less than vhost_vq_num_batch_descs(vq); descriptors. ret = 1
> > > > > > * This loop calls again fetch_buf, but vq->ndescs > 0 (and avail_vq ==
> > > > > > last_avail_vq), so it just return 1
> > > > >
> > > > > That's what
> > > > > [PATCH RFC v7 08/14] fixup! vhost: use batched get_vq_desc version
> > > > > is supposed to fix.
> > > > >
> > > >
> > > > Sorry, I forgot to include that fixup.
> > > >
> > > > With it I don't see CPU stalls, but with that version latency has
> > > > increased a lot and I see packet lost:
> > > > + ping -c 5 10.200.0.1
> > > > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > > > > From 10.200.0.2 icmp_seq=1 Destination Host Unreachable
> > > > > From 10.200.0.2 icmp_seq=2 Destination Host Unreachable
> > > > > From 10.200.0.2 icmp_seq=3 Destination Host Unreachable
> > > > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=6848 ms
> > > >
> > > > --- 10.200.0.1 ping statistics ---
> > > > 5 packets transmitted, 1 received, +3 errors, 80% packet loss, time 76ms
> > > > rtt min/avg/max/mdev = 6848.316/6848.316/6848.316/0.000 ms, pipe 4
> > > > --
> > > >
> > > > I cannot even use netperf.
> > >
> > > OK so that's the bug to try to find and fix I think.
> > >
> > >
> > > > If I modify with my proposed version:
> > > > + ping -c 5 10.200.0.1
> > > > PING 10.200.0.1 (10.200.0.1) 56(84) bytes of data.
> > > > 64 bytes from 10.200.0.1: icmp_seq=1 ttl=64 time=7.07 ms
> > > > 64 bytes from 10.200.0.1: icmp_seq=2 ttl=64 time=0.358 ms
> > > > 64 bytes from 10.200.0.1: icmp_seq=3 ttl=64 time=5.35 ms
> > > > 64 bytes from 10.200.0.1: icmp_seq=4 ttl=64 time=2.27 ms
> > > > 64 bytes from 10.200.0.1: icmp_seq=5 ttl=64 time=0.426 ms
> > >
> > > Not sure which version this is.
> > >
> > > > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t TCP_STREAM
> > > > MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > > > 10.200.0.1 () port 0 AF_INET
> > > > Recv Send Send
> > > > Socket Socket Message Elapsed
> > > > Size Size Size Time Throughput
> > > > bytes bytes bytes secs. 10^6bits/sec
> > > >
> > > > 131072 16384 16384 10.01 4742.36
> > > > [root@localhost ~]# netperf -H 10.200.0.1 -p 12865 -l 10 -t UDP_STREAM
> > > > MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
> > > > 10.200.0.1 () port 0 AF_INET
> > > > Socket Message Elapsed Messages
> > > > Size Size Time Okay Errors Throughput
> > > > bytes bytes secs # # 10^6bits/sec
> > > >
> > > > 212992 65507 10.00 9214 0 482.83
> > > > 212992 10.00 9214 482.83
> > > >
> > > > I will compare with the non-batch version for reference, but the
> > > > difference between the two is noticeable. Maybe it's worth finding a
> > > > good value for the if() inside fetch_buf?
> > > >
> > > > Thanks!
> > > >
> > >
> > > I don't think it's performance, I think it's a bug somewhere,
> > > e.g. maybe we corrupt a packet, or stall the queue, or
> > > something like this.
> > >
> > > Let's do this, I will squash the fixups and post v8 so you can bisect
> > > and then debug cleanly.
> >
> > Ok, so if we apply the patch proposed in v7 08/14 (Or the version 8 of the patchset sent), this is what happens:
> >
> > 1. Userland (virtio_test in my case) introduces just one buffer in vq, and it kicks
> > 2. vhost module reaches fetch_descs, called from vhost_get_vq_desc. From there we call fetch_buf in a for loop.
> > 3. The first time we call fetch_buf, it returns properly one buffer. However, the second time we call it, it returns 0
> > because vq->avail_idx == vq->last_avail_idx and vq->avail_idx == last_avail_idx code path.
> > 4. fetch_descs assign ret = 0, so it returns 0. vhost_get_vq_desc will goto err, and it will signal no new buffer
> > (returning vq->num).
> >
> > So to fix it and maintain the batching maybe we could return vq->ndescs in case ret == 0:
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index c0dfb5e3d2af..5993d4f34ca9 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2315,7 +2327,8 @@ static int fetch_descs(struct vhost_virtqueue *vq)
> >
> > /* On success we expect some descs */
> > BUG_ON(ret > 0 && !vq->ndescs);
> > - return ret;
> > + return ret ?: vq->ndescs;
I'd rather we used standard C. Also ret < 0 needs
to be handled. Also - what if fetch of some descs fails
but some succeeds?
What do we want to do?
Maybe:
return vq->ndescs ? vq->ndescs : ret;
> > }
> >
> > /* Reverse the effects of fetch_descs */
> > --
> >
> > Another possibility could be to return different codes from fetch_buf, but I find the suggested modification easier.
> >
> > What do you think?
> >
> > Thanks!
> >
>
> Hi!
>
> I can send a proposed RFC v9 in case it is more convenient for you.
>
> Thanks!
Excellent, pls go ahead!
And can you include the performance numbers?
It's enough to test the final version.
--
MST