LinuxLists.cc - [PATCH net-next 0/8] enable/disable zero copy tx dynamically

2012-10-29 15:47:22

Subject: [PATCH net-next 0/8] enable/disable zero copy tx dynamically

tun supports zero copy transmit since 0690899b4d4501b3505be069b9a687e68ccbe15b,
however you can only enable this mode if you know your workload does not
trigger heavy guest to host/host to guest traffic - otherwise you
get a (minor) performance regression.
This patchset addresses this problem by notifying the owner
device when callback is invoked because of a data copy.
This makes it possible to detect whether zero copy is appropriate
dynamically: we start in zero copy mode, when we detect
data copied we disable zero copy for a while.

With this patch applied, I get the same performance for
guest to host and guest to guest both with and without zero copy tx.

Michael S. Tsirkin (8):
skb: report completion status for zero copy skbs
skb: api to report errors for zero copy skbs
tun: report orphan frags errors to zero copy callback
vhost-net: cleanup macros for DMA status tracking
vhost: track zero copy failures using DMA length
vhost: move -net specific code out
vhost-net: select tx zero copy dynamically
vhost-net: reduce vq polling on tx zerocopy

drivers/net/tun.c | 1 +
drivers/vhost/net.c | 109 +++++++++++++++++++++++++++++++++++++++++++---
drivers/vhost/tcm_vhost.c | 1 +
drivers/vhost/vhost.c | 52 +++-------------------
drivers/vhost/vhost.h | 11 ++---
include/linux/skbuff.h | 5 ++-
net/core/skbuff.c | 23 +++++++++-
7 files changed, 141 insertions(+), 61 deletions(-)

--
MST

2012-10-29 15:47:27

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 1/8] skb: report completion status for zero copy skbs

Even if skb is marked for zero copy, net core might still decide
to copy it later which is somewhat slower than a copy in user context:
besides copying the data we need to pin/unpin the pages.

Add a parameter reporting such cases through zero copy callback:
if this happens a lot, device can take this into account
and switch to copying in user context.

This patch updates all users but ignores the passed value for now:
it will be used by follow-up patches.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 2 +-
drivers/vhost/vhost.h | 2 +-
include/linux/skbuff.h | 4 +++-
net/core/skbuff.c | 4 ++--
4 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 99ac2cb..92308b6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1600,7 +1600,7 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
kfree(ubufs);
}

-void vhost_zerocopy_callback(struct ubuf_info *ubuf)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, int zerocopy_status)
{
struct vhost_ubuf_ref *ubufs = ubuf->ctx;
struct vhost_virtqueue *vq = ubufs->vq;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 1125af3..eb7263c3 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -191,7 +191,7 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);

int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *);
+void vhost_zerocopy_callback(struct ubuf_info *, int);
int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);

#define vq_err(vq, fmt, ...) do { \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6a2c34e..8bac11b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -235,11 +235,13 @@ enum {
/*
* The callback notifies userspace to release buffers when skb DMA is done in
* lower device, the skb last reference should be 0 when calling this.
+ * The zerocopy_status argument is 0 if zero copy transmit occurred,
+ * 1 on successful data copy; < 0 on out of memory error.
* The ctx field is used to track device context.
* The desc field is used to track userspace buffer index.
*/
struct ubuf_info {
- void (*callback)(struct ubuf_info *);
+ void (*callback)(struct ubuf_info *, int zerocopy_status);
void *ctx;
unsigned long desc;
};
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6e04b1f..eb31f6e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -519,7 +519,7 @@ static void skb_release_data(struct sk_buff *skb)

uarg = skb_shinfo(skb)->destructor_arg;
if (uarg->callback)
- uarg->callback(uarg);
+ uarg->callback(uarg, 0);
}

if (skb_has_frag_list(skb))
@@ -797,7 +797,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
for (i = 0; i < num_frags; i++)
skb_frag_unref(skb, i);

- uarg->callback(uarg);
+ uarg->callback(uarg, 1);

/* skb frags point to kernel buffers */
for (i = num_frags - 1; i >= 0; i--) {
--
MST

2012-10-29 15:47:30

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 2/8] skb: api to report errors for zero copy skbs

Orphaning frags for zero copy skbs needs to allocate data in atomic
context so is has a chance to fail. If it does we currently discard
the skb which is safe, but we don't report anything to the caller,
so it can not recover by e.g. disabling zero copy.

Add an API to free skb reporting such errors: this is used
by tun in case orphaning frags fails.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
include/linux/skbuff.h | 1 +
net/core/skbuff.c | 19 +++++++++++++++++++
2 files changed, 20 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 8bac11b..0644432 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -568,6 +568,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
}

extern void kfree_skb(struct sk_buff *skb);
+extern void skb_tx_error(struct sk_buff *skb, int err);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct kmem_cache *skbuff_head_cache;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index eb31f6e..ad99c64 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -635,6 +635,25 @@ void kfree_skb(struct sk_buff *skb)
EXPORT_SYMBOL(kfree_skb);

/**
+ * kfree_skb_on_error - report an sk_buff xmit error
+ * @skb: buffer that triggered an error
+ *
+ * Report xmit error if a device callback is tracking this skb.
+ */
+void skb_tx_error(struct sk_buff *skb, int err)
+{
+ if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
+ struct ubuf_info *uarg;
+
+ uarg = skb_shinfo(skb)->destructor_arg;
+ if (uarg->callback)
+ uarg->callback(uarg, err);
+ skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+ }
+}
+EXPORT_SYMBOL(skb_tx_error);
+
+/**
* consume_skb - free an skbuff
* @skb: buffer to free
*
--
MST

2012-10-29 15:47:43

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 8/8] vhost-net: reduce vq polling on tx zerocopy

It seems that to avoid deadlocks it is enough to poll vq before
we are going to use the last buffer. This should be faster than
c70aa540c7a9f67add11ad3161096fb95233aa2e.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8e9de79..3967f82 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,8 +197,16 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
{
struct vhost_ubuf_ref *ubufs = ubuf->ctx;
struct vhost_virtqueue *vq = ubufs->vq;
-
- vhost_poll_queue(&vq->poll);
+ int cnt = atomic_read(&ubufs->kref.refcount);
+
+ /*
+ * Trigger polling thread if guest stopped submitting new buffers:
+ * in this case, the refcount after decrement will eventually reach 1
+ * so here it is 2.
+ * We also trigger polling periodically after each 16 packets.
+ */
+ if (cnt <= 2 || !(cnt % 16))
+ vhost_poll_queue(&vq->poll);
/* set len to mark this desc buffers done DMA */
vq->heads[ubuf->desc].len = status ?
VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
--
MST

2012-10-29 15:47:50

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 4/8] vhost-net: cleanup macros for DMA status tracking

Better document macros for DMA tracking. Add an
explicit one for DMA in progress instead of
relying on user supplying len != 1.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 3 ++-
drivers/vhost/vhost.c | 2 +-
drivers/vhost/vhost.h | 12 +++++++++---
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 072cbba..f80ae5f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -237,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
} else {
struct ubuf_info *ubuf = &vq->ubuf_info[head];

- vq->heads[vq->upend_idx].len = len;
+ vq->heads[vq->upend_idx].len =
+ VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = vq->ubufs;
ubuf->desc = vq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 92308b6..906fd9f 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1606,7 +1606,7 @@ void vhost_zerocopy_callback(struct ubuf_info *ubuf, int zerocopy_status)
struct vhost_virtqueue *vq = ubufs->vq;

vhost_poll_queue(&vq->poll);
- /* set len = 1 to mark this desc buffers done DMA */
+ /* set len to mark this desc buffers done DMA */
vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index eb7263c3..ad72a1f 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,9 +13,15 @@
#include <linux/virtio_ring.h>
#include <linux/atomic.h>

-/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
- * done */
-#define VHOST_DMA_DONE_LEN 1
+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN 2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS 1
+/* Buffer unused */
#define VHOST_DMA_CLEAR_LEN 0

struct vhost_device;
--
MST

2012-10-29 15:47:58

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 6/8] vhost: move -net specific code out

Zerocopy handling code is vhost-net specific.
Move it from vhost.c/vhost.h out to net.c

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 45 ++++++++++++++++++++++++++++++++++++++++
drivers/vhost/tcm_vhost.c | 1 +
drivers/vhost/vhost.c | 53 +++++++----------------------------------------
drivers/vhost/vhost.h | 21 +++----------------
4 files changed, 56 insertions(+), 64 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f80ae5f..532fc88 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -126,6 +126,42 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
}

+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+{
+ int i;
+ int j = 0;
+
+ for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+ if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+ vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+ vhost_add_used_and_signal(vq->dev, vq,
+ vq->heads[i].id, 0);
+ ++j;
+ } else
+ break;
+ }
+ if (j)
+ vq->done_idx = i;
+ return j;
+}
+
+static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
+{
+ struct vhost_ubuf_ref *ubufs = ubuf->ctx;
+ struct vhost_virtqueue *vq = ubufs->vq;
+
+ vhost_poll_queue(&vq->poll);
+ /* set len to mark this desc buffers done DMA */
+ vq->heads[ubuf->desc].len = status ?
+ VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
+ vhost_ubuf_put(ubufs);
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
@@ -594,9 +630,18 @@ static int vhost_net_release(struct inode *inode, struct file *f)
struct vhost_net *n = f->private_data;
struct socket *tx_sock;
struct socket *rx_sock;
+ int i;

vhost_net_stop(n, &tx_sock, &rx_sock);
vhost_net_flush(n);
+ vhost_dev_stop(&n->dev);
+ for (i = 0; i < n->dev.nvqs; ++i) {
+ /* Wait for all lower device DMAs done. */
+ if (n->dev.vqs[i].ubufs)
+ vhost_ubuf_put_and_wait(n->dev.vqs[i].ubufs);
+
+ vhost_zerocopy_signal_used(n, &n->dev.vqs[i]);
+ }
vhost_dev_cleanup(&n->dev, false);
if (tx_sock)
fput(tx_sock->file);
diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index aa31692..23c138f 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -895,6 +895,7 @@ static int vhost_scsi_release(struct inode *inode, struct file *f)
vhost_scsi_clear_endpoint(s, &backend);
}

+ vhost_dev_stop(&s->dev);
vhost_dev_cleanup(&s->dev, false);
kfree(s);
return 0;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5affce3..ef8f598 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -26,10 +26,6 @@
#include <linux/kthread.h>
#include <linux/cgroup.h>

-#include <linux/net.h>
-#include <linux/if_packet.h>
-#include <linux/if_arp.h>
-
#include "vhost.h"

enum {
@@ -414,28 +410,16 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
return 0;
}

-/* In case of DMA done not in order in lower device driver for some reason.
- * upend_idx is used to track end of used idx, done_idx is used to track head
- * of used idx. Once lower device DMA done contiguously, we will signal KVM
- * guest used idx.
- */
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+void vhost_dev_stop(struct vhost_dev *dev)
{
int i;
- int j = 0;
-
- for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
- vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
- vhost_add_used_and_signal(vq->dev, vq,
- vq->heads[i].id, 0);
- ++j;
- } else
- break;
+
+ for (i = 0; i < dev->nvqs; ++i) {
+ if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
+ vhost_poll_stop(&dev->vqs[i].poll);
+ vhost_poll_flush(&dev->vqs[i].poll);
+ }
}
- if (j)
- vq->done_idx = i;
- return j;
}

/* Caller should have device mutex if and only if locked is set */
@@ -444,17 +428,6 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
int i;

for (i = 0; i < dev->nvqs; ++i) {
- if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
- vhost_poll_stop(&dev->vqs[i].poll);
- vhost_poll_flush(&dev->vqs[i].poll);
- }
- /* Wait for all lower device DMAs done. */
- if (dev->vqs[i].ubufs)
- vhost_ubuf_put_and_wait(dev->vqs[i].ubufs);
-
- /* Signal guest as appropriate. */
- vhost_zerocopy_signal_used(&dev->vqs[i]);
-
if (dev->vqs[i].error_ctx)
eventfd_ctx_put(dev->vqs[i].error_ctx);
if (dev->vqs[i].error)
@@ -1599,15 +1572,3 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
wait_event(ubufs->wait, !atomic_read(&ubufs->kref.refcount));
kfree(ubufs);
}
-
-void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
-{
- struct vhost_ubuf_ref *ubufs = ubuf->ctx;
- struct vhost_virtqueue *vq = ubufs->vq;
-
- vhost_poll_queue(&vq->poll);
- /* set len to mark this desc buffers done DMA */
- vq->heads[ubuf->desc].len = status ?
- VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
- kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
-}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6fdf31d..5e19e3d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -7,27 +7,11 @@
#include <linux/mutex.h>
#include <linux/poll.h>
#include <linux/file.h>
-#include <linux/skbuff.h>
#include <linux/uio.h>
#include <linux/virtio_config.h>
#include <linux/virtio_ring.h>
#include <linux/atomic.h>

-/*
- * For transmit, used buffer len is unused; we override it to track buffer
- * status internally; used for zerocopy tx only.
- */
-/* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN 3
-/* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN 2
-/* Lower device DMA in progress */
-#define VHOST_DMA_IN_PROGRESS 1
-/* Buffer unused */
-#define VHOST_DMA_CLEAR_LEN 0
-
-#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
-
struct vhost_device;

struct vhost_work;
@@ -80,6 +64,8 @@ struct vhost_ubuf_ref *vhost_ubuf_alloc(struct vhost_virtqueue *, bool zcopy);
void vhost_ubuf_put(struct vhost_ubuf_ref *);
void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *);

+struct ubuf_info;
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -177,6 +163,7 @@ long vhost_dev_init(struct vhost_dev *, struct vhost_virtqueue *vqs, int nvqs);
long vhost_dev_check_owner(struct vhost_dev *);
long vhost_dev_reset_owner(struct vhost_dev *);
void vhost_dev_cleanup(struct vhost_dev *, bool locked);
+void vhost_dev_stop(struct vhost_dev *);
long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, unsigned long arg);
int vhost_vq_access_ok(struct vhost_virtqueue *vq);
int vhost_log_access_ok(struct vhost_dev *);
@@ -201,8 +188,6 @@ bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);

int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
-void vhost_zerocopy_callback(struct ubuf_info *, int);
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);

#define vq_err(vq, fmt, ...) do { \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \
--
MST

2012-10-29 15:48:05

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 7/8] vhost-net: select tx zero copy dynamically

Even when vhost-net is in zero-copy transmit mode,
net core might still decide to copy the skb later
which is somewhat slower than a copy in user
context: data copy overhead is added to the cost of
page pin/unpin. The result is that enabling tx zero copy
option leads to higher CPU utilization for guest to guest
and guest to host traffic.

To fix this, suppress zero copy tx after a given number of
packets triggered late data copy. Re-enable periodically
to detect workload changes.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/net.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 50 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 532fc88..8e9de79 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -42,6 +42,21 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Experimental Zero Copy TX");
#define VHOST_MAX_PEND 128
#define VHOST_GOODCOPY_LEN 256

+/*
+ * For transmit, used buffer len is unused; we override it to track buffer
+ * status internally; used for zerocopy tx only.
+ */
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN 3
+/* Lower device DMA done */
+#define VHOST_DMA_DONE_LEN 2
+/* Lower device DMA in progress */
+#define VHOST_DMA_IN_PROGRESS 1
+/* Buffer unused */
+#define VHOST_DMA_CLEAR_LEN 0
+
+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -62,8 +77,33 @@ struct vhost_net {
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
enum vhost_net_poll_state tx_poll_state;
+ /* Number of TX recently submitted.
+ * Protected by tx vq lock. */
+ unsigned tx_packets;
+ /* Number of times zerocopy TX recently failed.
+ * Protected by tx vq lock. */
+ unsigned tx_zcopy_err;
};

+static void vhost_net_tx_packet(struct vhost_net *net)
+{
+ ++net->tx_packets;
+ if (net->tx_packets < 1024)
+ return;
+ net->tx_packets = 0;
+ net->tx_zcopy_err = 0;
+}
+
+static void vhost_net_tx_err(struct vhost_net *net)
+{
+ ++net->tx_zcopy_err;
+}
+
+static bool vhost_net_tx_select_zcopy(struct vhost_net *net)
+{
+ return net->tx_packets / 64 >= net->tx_zcopy_err;
+}
+
static bool vhost_sock_zcopy(struct socket *sock)
{
return unlikely(experimental_zcopytx) &&
@@ -131,12 +171,15 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
* of used idx. Once lower device DMA done contiguously, we will signal KVM
* guest used idx.
*/
-int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+static int vhost_zerocopy_signal_used(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
{
int i;
int j = 0;

for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
+ if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+ vhost_net_tx_err(net);
if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
vhost_add_used_and_signal(vq->dev, vq,
@@ -208,7 +251,7 @@ static void handle_tx(struct vhost_net *net)
for (;;) {
/* Release DMAs done buffers first */
if (zcopy)
- vhost_zerocopy_signal_used(vq);
+ vhost_zerocopy_signal_used(net, vq);

head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -263,7 +306,8 @@ static void handle_tx(struct vhost_net *net)
/* use msg_control to pass vhost zerocopy ubuf info to skb */
if (zcopy) {
vq->heads[vq->upend_idx].id = head;
- if (len < VHOST_GOODCOPY_LEN) {
+ if (!vhost_net_tx_select_zcopy(net) ||
+ len < VHOST_GOODCOPY_LEN) {
/* copy don't need to wait for DMA done */
vq->heads[vq->upend_idx].len =
VHOST_DMA_DONE_LEN;
@@ -305,8 +349,9 @@ static void handle_tx(struct vhost_net *net)
if (!zcopy)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
else
- vhost_zerocopy_signal_used(vq);
+ vhost_zerocopy_signal_used(net, vq);
total_len += len;
+ vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);
break;
@@ -774,7 +819,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
if (oldubufs) {
vhost_ubuf_put_and_wait(oldubufs);
mutex_lock(&vq->mutex);
- vhost_zerocopy_signal_used(vq);
+ vhost_zerocopy_signal_used(n, vq);
mutex_unlock(&vq->mutex);
}

--
MST

2012-10-29 15:52:43

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 3/8] tun: report orphan frags errors to zero copy callback

When tun transmits a zero copy skb, it orphans the frags
which might need to allocate extra memory, in atomic context.
If that fails, notify ubufs callback before freeing the skb
as a hint that device should disable zerocopy mode.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/net/tun.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3157519..613f826 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -433,6 +433,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)

drop:
dev->stats.tx_dropped++;
+ skb_tx_error(skb, -ENOMEM);
kfree_skb(skb);
return NETDEV_TX_OK;
}
--
MST

2012-10-29 15:56:03

by Michael S. Tsirkin

[permalink] [raw]

Subject: [PATCH net-next 5/8] vhost: track zero copy failures using DMA length

This will be used to disable zerocopy when error rate
is high.

Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/vhost/vhost.c | 7 ++++---
drivers/vhost/vhost.h | 4 ++++
2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 906fd9f..5affce3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -425,7 +425,7 @@ int vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
int j = 0;

for (i = vq->done_idx; i != vq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
- if ((vq->heads[i].len == VHOST_DMA_DONE_LEN)) {
+ if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
vhost_add_used_and_signal(vq->dev, vq,
vq->heads[i].id, 0);
@@ -1600,13 +1600,14 @@ void vhost_ubuf_put_and_wait(struct vhost_ubuf_ref *ubufs)
kfree(ubufs);
}

-void vhost_zerocopy_callback(struct ubuf_info *ubuf, int zerocopy_status)
+void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
{
struct vhost_ubuf_ref *ubufs = ubuf->ctx;
struct vhost_virtqueue *vq = ubufs->vq;

vhost_poll_queue(&vq->poll);
/* set len to mark this desc buffers done DMA */
- vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+ vq->heads[ubuf->desc].len = status ?
+ VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ad72a1f..6fdf31d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -17,6 +17,8 @@
* For transmit, used buffer len is unused; we override it to track buffer
* status internally; used for zerocopy tx only.
*/
+/* Lower device DMA failed */
+#define VHOST_DMA_FAILED_LEN 3
/* Lower device DMA done */
#define VHOST_DMA_DONE_LEN 2
/* Lower device DMA in progress */
@@ -24,6 +26,8 @@
/* Buffer unused */
#define VHOST_DMA_CLEAR_LEN 0

+#define VHOST_DMA_IS_DONE(len) ((len) >= VHOST_DMA_DONE_LEN)
+
struct vhost_device;

struct vhost_work;
--
MST

2012-10-30 15:44:26

by Vlad Yasevich

[permalink] [raw]

Subject: Re: [PATCH net-next 2/8] skb: api to report errors for zero copy skbs

On 10/29/2012 11:49 AM, Michael S. Tsirkin wrote:
> Orphaning frags for zero copy skbs needs to allocate data in atomic
> context so is has a chance to fail. If it does we currently discard
> the skb which is safe, but we don't report anything to the caller,
> so it can not recover by e.g. disabling zero copy.
>
> Add an API to free skb reporting such errors: this is used
> by tun in case orphaning frags fails.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> include/linux/skbuff.h | 1 +
> net/core/skbuff.c | 19 +++++++++++++++++++
> 2 files changed, 20 insertions(+)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 8bac11b..0644432 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -568,6 +568,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
> }
>
> extern void kfree_skb(struct sk_buff *skb);
> +extern void skb_tx_error(struct sk_buff *skb, int err);
> extern void consume_skb(struct sk_buff *skb);
> extern void __kfree_skb(struct sk_buff *skb);
> extern struct kmem_cache *skbuff_head_cache;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index eb31f6e..ad99c64 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -635,6 +635,25 @@ void kfree_skb(struct sk_buff *skb)
> EXPORT_SYMBOL(kfree_skb);
>
> /**
> + * kfree_skb_on_error - report an sk_buff xmit error
> + * @skb: buffer that triggered an error
> + *
> + * Report xmit error if a device callback is tracking this skb.
> + */

Nit: Comment doesn't match new function.

-vlad

> +void skb_tx_error(struct sk_buff *skb, int err)
> +{
> + if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> + struct ubuf_info *uarg;
> +
> + uarg = skb_shinfo(skb)->destructor_arg;
> + if (uarg->callback)
> + uarg->callback(uarg, err);
> + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> + }
> +}
> +EXPORT_SYMBOL(skb_tx_error);
> +
> +/**
> * consume_skb - free an skbuff
> * @skb: buffer to free
> *
>

2012-10-30 15:47:55

by Vlad Yasevich

[permalink] [raw]

Subject: Re: [PATCH net-next 8/8] vhost-net: reduce vq polling on tx zerocopy

On 10/29/2012 11:49 AM, Michael S. Tsirkin wrote:
> It seems that to avoid deadlocks it is enough to poll vq before
> we are going to use the last buffer. This should be faster than
> c70aa540c7a9f67add11ad3161096fb95233aa2e.
>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> ---
> drivers/vhost/net.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 8e9de79..3967f82 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -197,8 +197,16 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
> {
> struct vhost_ubuf_ref *ubufs = ubuf->ctx;
> struct vhost_virtqueue *vq = ubufs->vq;
> -
> - vhost_poll_queue(&vq->poll);
> + int cnt = atomic_read(&ubufs->kref.refcount);
> +
> + /*
> + * Trigger polling thread if guest stopped submitting new buffers:
> + * in this case, the refcount after decrement will eventually reach 1
> + * so here it is 2.
> + * We also trigger polling periodically after each 16 packets.
> + */
> + if (cnt <= 2 || !(cnt % 16))

Why 16? Does it make sense to make it configurable?

-vlad

> + vhost_poll_queue(&vq->poll);
> /* set len to mark this desc buffers done DMA */
> vq->heads[ubuf->desc].len = status ?
> VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
>

2012-10-30 15:52:06

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next 8/8] vhost-net: reduce vq polling on tx zerocopy

On Tue, Oct 30, 2012 at 11:47:45AM -0400, Vlad Yasevich wrote:
> On 10/29/2012 11:49 AM, Michael S. Tsirkin wrote:
> >It seems that to avoid deadlocks it is enough to poll vq before
> > we are going to use the last buffer. This should be faster than
> >c70aa540c7a9f67add11ad3161096fb95233aa2e.
> >
> >Signed-off-by: Michael S. Tsirkin <[email protected]>
> >---
> > drivers/vhost/net.c | 12 ++++++++++--
> > 1 file changed, 10 insertions(+), 2 deletions(-)
> >
> >diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >index 8e9de79..3967f82 100644
> >--- a/drivers/vhost/net.c
> >+++ b/drivers/vhost/net.c
> >@@ -197,8 +197,16 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, int status)
> > {
> > struct vhost_ubuf_ref *ubufs = ubuf->ctx;
> > struct vhost_virtqueue *vq = ubufs->vq;
> >-
> >- vhost_poll_queue(&vq->poll);
> >+ int cnt = atomic_read(&ubufs->kref.refcount);
> >+
> >+ /*
> >+ * Trigger polling thread if guest stopped submitting new buffers:
> >+ * in this case, the refcount after decrement will eventually reach 1
> >+ * so here it is 2.
> >+ * We also trigger polling periodically after each 16 packets.
> >+ */
> >+ if (cnt <= 2 || !(cnt % 16))
>
> Why 16? Does it make sense to make it configurable?
>
> -vlad

Probably not but I'll add a comment explaining why.

> >+ vhost_poll_queue(&vq->poll);
> > /* set len to mark this desc buffers done DMA */
> > vq->heads[ubuf->desc].len = status ?
> > VHOST_DMA_FAILED_LEN : VHOST_DMA_DONE_LEN;
> >

2012-10-30 15:52:28

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [PATCH net-next 2/8] skb: api to report errors for zero copy skbs

On Tue, Oct 30, 2012 at 11:44:16AM -0400, Vlad Yasevich wrote:
> On 10/29/2012 11:49 AM, Michael S. Tsirkin wrote:
> >Orphaning frags for zero copy skbs needs to allocate data in atomic
> >context so is has a chance to fail. If it does we currently discard
> >the skb which is safe, but we don't report anything to the caller,
> >so it can not recover by e.g. disabling zero copy.
> >
> >Add an API to free skb reporting such errors: this is used
> >by tun in case orphaning frags fails.
> >
> >Signed-off-by: Michael S. Tsirkin <[email protected]>
> >---
> > include/linux/skbuff.h | 1 +
> > net/core/skbuff.c | 19 +++++++++++++++++++
> > 2 files changed, 20 insertions(+)
> >
> >diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >index 8bac11b..0644432 100644
> >--- a/include/linux/skbuff.h
> >+++ b/include/linux/skbuff.h
> >@@ -568,6 +568,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
> > }
> >
> > extern void kfree_skb(struct sk_buff *skb);
> >+extern void skb_tx_error(struct sk_buff *skb, int err);
> > extern void consume_skb(struct sk_buff *skb);
> > extern void __kfree_skb(struct sk_buff *skb);
> > extern struct kmem_cache *skbuff_head_cache;
> >diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >index eb31f6e..ad99c64 100644
> >--- a/net/core/skbuff.c
> >+++ b/net/core/skbuff.c
> >@@ -635,6 +635,25 @@ void kfree_skb(struct sk_buff *skb)
> > EXPORT_SYMBOL(kfree_skb);
> >
> > /**
> >+ * kfree_skb_on_error - report an sk_buff xmit error
> >+ * @skb: buffer that triggered an error
> >+ *
> >+ * Report xmit error if a device callback is tracking this skb.
> >+ */
>
> Nit: Comment doesn't match new function.
>
> -vlad

Good catch, thanks.

> >+void skb_tx_error(struct sk_buff *skb, int err)
> >+{
> >+ if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
> >+ struct ubuf_info *uarg;
> >+
> >+ uarg = skb_shinfo(skb)->destructor_arg;
> >+ if (uarg->callback)
> >+ uarg->callback(uarg, err);
> >+ skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
> >+ }
> >+}
> >+EXPORT_SYMBOL(skb_tx_error);
> >+
> >+/**
> > * consume_skb - free an skbuff
> > * @skb: buffer to free
> > *
> >