2011-04-20 19:37:13

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 0/8] macvtap/vhost TX zero copy support

This patchset add supports for TX zero-copy between guest and host
kernel through vhost. It significantly reduces CPU utilization on the
local host on which the guest is located (It reduced 30-50% CPU usage
for vhost thread for single stream test). The patchset is based on
previous submission and comments from the community regarding when/how
to handle guest kernel buffers to be released. This is the simplest
approach I can think of after comparing with several other solutions.

This patchset includes:

1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;

2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
support zero-copy;

3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
buffers release callback when lower device DMA has done for that skb;

4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
buffers.

5/8: Add macvtap zero-copy in lower device when sending packet is
greater than 128 bytes.

6/8: Add Chelsio 10Gb NIC to zero copy feature flag

7/8: Add Intel 10Gb NIC zero copy feature flag

8/8: Add Emulex 10Gb NIC zero copy feature flag

The patchset is built against most recent linux 2.6.git. It has passed
netperf/netserver multiple streams stress test on above NICs.

The single stream test results from 2.6.37 kernel on Chelsio:

64K message size: copy_from_user dropped from 40% to 5%; vhost thread
cpu utilization dropped from 76% to 28%

I am collecting more test results against 2.6.39-rc3 kernel and will
provide the test matrix later.

Thanks
Shirley


2011-04-20 19:43:10

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 1/8] Add a new sock zerocopy flag

This sock zerocopy flag is used to support lower level device DMA
userspace buffers.

Signed-off-by: Shirley Ma <[email protected]>
---

include/net/sock.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 01810a3..daa0a80 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -562,6 +562,7 @@ enum sock_flags {
SOCK_TIMESTAMPING_SYS_HARDWARE, /* %SOF_TIMESTAMPING_SYS_HARDWARE */
SOCK_FASYNC, /* fasync() active */
SOCK_RXQ_OVFL,
+ SOCK_ZEROCOPY,
};

static inline void sock_copy_flags(struct sock *nsk, struct sock *osk)

2011-04-20 19:44:17

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 2/8] Add a new zerocopy device flag

This zerocopy flag is used to support device DMA userspace buffers.

Signed-off-by: Shirley Ma <[email protected]>
---

include/linux/netdevice.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
#define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
#define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */

+/* bit 29 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY (1 << 29)
+
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
#define NETIF_F_GSO_MASK 0x00ff0000

2011-04-20 19:48:31

by Ben Hutchings

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Wed, 2011-04-20 at 12:44 -0700, Shirley Ma wrote:
> This zerocopy flag is used to support device DMA userspace buffers.
>
> Signed-off-by: Shirley Ma <[email protected]>
> ---
>
> include/linux/netdevice.h | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
> #define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
> #define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */
>
> +/* bit 29 is for device to map userspace buffers -- zerocopy */
> +#define NETIF_F_ZEROCOPY (1 << 29)

Look above.

Ben.

> /* Segmentation offload features */
> #define NETIF_F_GSO_SHIFT 16
> #define NETIF_F_GSO_MASK 0x00ff0000
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Ben Hutchings, Senior Software Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2011-04-20 20:07:52

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 4/8] vhost TX zero copy support

This patch maintains the outstanding userspace buffers in the
sequence it is delivered to vhost. The outstanding userspace buffers
will be marked as done once the lower device buffers DMA has finished.
This is monitored through last reference of kfree_skb callback. Two
buffer index are used for this purpose.

The vhost passes the userspace buffers info to lower device skb
through message control. Since there will be some done DMAs when
entering vhost handle_tx. The worse case is all buffers in the vq are
in pending/done status, so we need to notify guest to release DMA done
buffers first before get any new buffers from the vq.

Signed-off-by: Shirley <[email protected]>
---

drivers/vhost/net.c | 30 +++++++++++++++++++++++++++-
drivers/vhost/vhost.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.h | 10 +++++++++
3 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..1bc4536 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,8 @@
* Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x80000

+#define MAX_ZEROCOPY_PEND 64
+
enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -129,6 +131,7 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+ struct skb_ubuf_info pend;

/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +154,10 @@ static void handle_tx(struct vhost_net *net)
hdr_size = vq->vhost_hlen;

for (;;) {
+ /* Release DMAs done buffers first */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_zerocopy_signal_used(vq);
+
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
&out, &in,
@@ -166,6 +173,12 @@ static void handle_tx(struct vhost_net *net)
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
+ /* If more outstanding DMAs, queue the work */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
+ (atomic_read(&vq->refcnt) > MAX_ZEROCOPY_PEND)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
if (unlikely(vhost_enable_notify(vq))) {
vhost_disable_notify(vq);
continue;
@@ -188,17 +201,30 @@ static void handle_tx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
+ /* use msg_control to pass vhost zerocopy ubuf info to skb */
+ if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+ pend.callback = vhost_zerocopy_callback;
+ pend.arg = vq;
+ pend.desc = vq->upend_idx;
+ msg.msg_control = &pend;
+ msg.msg_controllen = sizeof(pend);
+ vq->heads[vq->upend_idx].id = head;
+ vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+ atomic_inc(&vq->refcnt);
+ }
/* TODO: Check specific error and bomb out unless ENOBUFS? */
err = sock->ops->sendmsg(NULL, sock, &msg, len);
if (unlikely(err < 0)) {
- vhost_discard_vq_desc(vq, 1);
+ if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_discard_vq_desc(vq, 1);
tx_poll_start(net, sock);
break;
}
if (err != len)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+ vhost_add_used_and_signal(&net->dev, vq, head, 0);
total_len += len;
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..09bcb1d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->call_ctx = NULL;
vq->call = NULL;
vq->log_ctx = NULL;
+ vq->upend_idx = 0;
+ vq->done_idx = 0;
+ atomic_set(&vq->refcnt, 0);
}

static int vhost_worker(void *data)
@@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct vhost_dev *dev)
UIO_MAXIOV, GFP_KERNEL);
dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log * UIO_MAXIOV,
GFP_KERNEL);
- dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads *
+ dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads *
UIO_MAXIOV, GFP_KERNEL);

if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
@@ -385,10 +388,41 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
return 0;
}

+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq)
+{
+ int i, j = 0;
+
+ i = vq->done_idx;
+ while (i != vq->upend_idx) {
+ /* len = 1 means DMA done */
+ if (vq->heads[i].len == 1) {
+ /* reset len = 0 */
+ vq->heads[i].len = 0;
+ i = (i + 1) % UIO_MAXIOV;
+ ++j;
+ } else
+ break;
+ }
+ if (j) {
+ if (i > vq->done_idx)
+ vhost_add_used_n(vq, &vq->heads[vq->done_idx], j);
+ else {
+ vhost_add_used_n(vq, &vq->heads[vq->done_idx],
+ UIO_MAXIOV - vq->done_idx);
+ vhost_add_used_n(vq, vq->heads, i);
+ }
+ vq->done_idx = i;
+ vhost_signal(vq->dev, vq);
+ atomic_sub(j, &vq->refcnt);
+ }
+}
+
/* Caller should have device mutex */
void vhost_dev_cleanup(struct vhost_dev *dev)
{
int i;
+ unsigned long begin = jiffies;
+

for (i = 0; i < dev->nvqs; ++i) {
if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
@@ -405,6 +439,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
eventfd_ctx_put(dev->vqs[i].call_ctx);
if (dev->vqs[i].call)
fput(dev->vqs[i].call);
+ /* wait for all lower device DMAs done, then notify guest */
+ while (atomic_read(&dev->vqs[i].refcnt)) {
+ if (time_after(jiffies, begin + 5 * HZ))
+ vhost_zerocopy_signal_used(&dev->vqs[i]);
+ }
vhost_vq_reset(dev, dev->vqs + i);
}
vhost_dev_free_iovecs(dev);
@@ -1416,3 +1455,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
vq_err(vq, "Failed to enable notification at %p: %d\n",
&vq->used->flags, r);
}
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+ int idx = skb_shinfo(skb)->ubuf.desc;
+ struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+ /* set len = 1 to mark this desc buffers done DMA */
+ vq->heads[idx].len = 1;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..cd2febb 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -108,6 +108,14 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log *log;
+ /* vhost zerocopy support */
+ atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+ /* index of zerocopy pending DMA buffers */
+ int upend_idx;
+ /* index of zerocopy done DMA buffers, but not notify guest yet */
+ int done_idx;
+ /* notify vhost zerocopy DMA buffers has done in lower device */
+ void (*callback)(struct sk_buff *);
};

struct vhost_dev {
@@ -154,6 +162,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);

int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq);

#define vq_err(vq, fmt, ...) do { \
pr_debug(pr_fmt(fmt), ##__VA_ARGS__); \

2011-04-20 20:10:23

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

Resubmit this patch with the new bit.

Signed-off-by: Shirley Ma <[email protected]>
---

include/linux/netdevice.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
#define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
#define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */

+/* Bit 30 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY (1 << 30)
+
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
#define NETIF_F_GSO_MASK 0x00ff0000

2011-04-20 20:14:18

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 5/8] Enable cxgb3 to support zerocopy

Signed-off-by: Shirley Ma <[email protected]>
---

drivers/net/cxgb3/cxgb3_main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 9108931..93a1101 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -3313,7 +3313,7 @@ static int __devinit init_one(struct pci_dev *pdev,
netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
netdev->features |= NETIF_F_GRO;
if (pci_using_dac)
- netdev->features |= NETIF_F_HIGHDMA;
+ netdev->features |= NETIF_F_HIGHDMA | NETIF_F_ZEROCOPY;

netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
netdev->netdev_ops = &cxgb_netdev_ops;

2011-04-20 20:17:34

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 8/8] Enable benet to support zerocopy

Signed-off-by: Shirley Ma <[email protected]>
---

drivers/net/benet/be_main.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/benet/be_main.c b/drivers/net/benet/be_main.c
index 7cb5a11..d7b7254 100644
--- a/drivers/net/benet/be_main.c
+++ b/drivers/net/benet/be_main.c
@@ -2982,6 +2982,7 @@ static int __devinit be_probe(struct pci_dev *pdev,
status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(64));
if (!status) {
netdev->features |= NETIF_F_HIGHDMA;
+ netdev->features |= NETIF_F_ZEROCOPY;
} else {
status = dma_set_mask(&pdev->dev, DMA_BIT_MASK(32));
if (status) {

2011-04-20 20:21:10

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

Only when buffer size is greater than GOODCOPY_LEN (128), macvtap
enables zero-copy.

Signed-off-by: Shirley MA <[email protected]>
---

drivers/net/macvtap.c | 124 ++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 112 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..b4e6656 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
static struct class *macvtap_class;
static struct cdev macvtap_cdev;

@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;

@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);

+ /*
+ * so far only VM uses macvtap, enable zero copy between guest
+ * kernel and host kernel when lower device supports high memory
+ * DMA
+ */
+ if (vlan) {
+ if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}

+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (offset >= from->iov_len) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (copy > 0) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ skb->data_len += len;
+ skb->len += len;
+ skb->truesize += len;
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,


/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen, zerocopy;

+ zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;

@@ -552,12 +640,24 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;

- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (zerocopy)
+ copylen = vnet_hdr.hdr_len;
+ else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
-
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
+ if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+ memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+ sizeof(struct skb_ubuf_info));
if (err)
goto err_kfree;

@@ -579,7 +679,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
kfree_skb(skb);
rcu_read_unlock_bh();

- return count;
+ return total_len;

err_kfree:
kfree_skb(skb);
@@ -601,8 +701,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;

- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}

@@ -815,7 +915,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}


2011-04-20 20:24:30

by Dimitrios Michailidis

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On 04/20/2011 01:09 PM, Shirley Ma wrote:
> Resubmit this patch with the new bit.

Bit 30 is also taken in net-next.

>
> Signed-off-by: Shirley Ma <[email protected]>
> ---
>
> include/linux/netdevice.h | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
> #define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
> #define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */
>
> +/* Bit 30 is for device to map userspace buffers -- zerocopy */
> +#define NETIF_F_ZEROCOPY (1 << 30)
> +
> /* Segmentation offload features */
> #define NETIF_F_GSO_SHIFT 16
> #define NETIF_F_GSO_MASK 0x00ff0000
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-04-20 20:24:55

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

Thanks. I need to update it to 30 bit.

Shirley

2011-04-20 20:27:23

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 6/8] macvtap/vhost TX zero copy support

Only when buffer size is greater than GOODCOPY_LEN (128), macvtap
enables zero-copy.

Signed-off-by: Shirley Ma <[email protected]>
---

drivers/net/macvtap.c | 124
++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 112 insertions(+), 12 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..b4e6656 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
static struct class *macvtap_class;
static struct cdev macvtap_cdev;

@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct
file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;

@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct
file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);

+ /*
+ * so far only VM uses macvtap, enable zero copy between guest
+ * kernel and host kernel when lower device supports high memory
+ * DMA
+ */
+ if (vlan) {
+ if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff
*macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}

+/* set skb frags from iovec, this can move to core network code for
reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct
iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (offset >= from->iov_len) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (copy > 0) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ skb->data_len += len;
+ skb->len += len;
+ skb->truesize += len;
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct
sk_buff *skb,


/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr
*m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int copylen, zerocopy;

+ zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;

@@ -552,12 +640,24 @@ static ssize_t macvtap_get_user(struct
macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;

- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
- noblock, &err);
+ if (zerocopy)
+ copylen = vnet_hdr.hdr_len;
+ else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen,
+ vnet_hdr.hdr_len, noblock, &err);
if (!skb)
goto err;
-
- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
+ if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+ memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+ sizeof(struct skb_ubuf_info));
if (err)
goto err_kfree;

@@ -579,7 +679,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue
*q,
kfree_skb(skb);
rcu_read_unlock_bh();

- return count;
+ return total_len;

err_kfree:
kfree_skb(skb);
@@ -601,8 +701,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb,
const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;

- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}

@@ -815,7 +915,7 @@ static int macvtap_sendmsg(struct kiocb *iocb,
struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue,
sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}


2011-04-20 20:29:54

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Wed, 2011-04-20 at 13:24 -0700, Dimitris Michailidis wrote:
> Bit 30 is also taken in net-next.

How about 31?

Thanks
Shirley

2011-04-20 20:30:38

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

Resubmit it with 31 bit.

Signed-off-by: Shirley Ma <[email protected]>
---

include/linux/netdevice.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0249fe7..0998d3d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1067,6 +1067,9 @@ struct net_device {
#define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
#define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */

+/* Bit 31 is for device to map userspace buffers -- zerocopy */
+#define NETIF_F_ZEROCOPY (1 << 31)
+
/* Segmentation offload features */
#define NETIF_F_GSO_SHIFT 16
#define NETIF_F_GSO_MASK 0x00ff0000

2011-04-20 20:53:04

by Dimitrios Michailidis

[permalink] [raw]
Subject: Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy

On 04/20/2011 01:13 PM, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <[email protected]>
> ---
>
> drivers/net/cxgb3/cxgb3_main.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
> index 9108931..93a1101 100644
> --- a/drivers/net/cxgb3/cxgb3_main.c
> +++ b/drivers/net/cxgb3/cxgb3_main.c
> @@ -3313,7 +3313,7 @@ static int __devinit init_one(struct pci_dev *pdev,
> netdev->features |= NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO;
> netdev->features |= NETIF_F_GRO;
> if (pci_using_dac)
> - netdev->features |= NETIF_F_HIGHDMA;
> + netdev->features |= NETIF_F_HIGHDMA | NETIF_F_ZEROCOPY;
>
> netdev->features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
> netdev->netdev_ops = &cxgb_netdev_ops;

The features handling has been reworked in net-next and patches like this
won't apply as the code you're patching has changed. Also core code now
does a lot of the related work and you'll need to tell it what to do with
any new flags.

What properties does a device or driver need to meet to set this flag? It
seems to be set a bit too unconditionally. For example, does it work if one
disables scatter/gather?

2011-04-20 20:59:05

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy

On Wed, 2011-04-20 at 13:52 -0700, Dimitris Michailidis wrote:
> The features handling has been reworked in net-next and patches like
> this
> won't apply as the code you're patching has changed. Also core code
> now
> does a lot of the related work and you'll need to tell it what to do
> with
> any new flags.
Ok, will do.

> What properties does a device or driver need to meet to set this flag?
> It
> seems to be set a bit too unconditionally. For example, does it work
> if one
> disables scatter/gather?

This flag can be ON when HIGHDMA and scatter/gather support. I will
modify the patch to make it conditionally.

thanks
Shirley

2011-04-20 21:15:26

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 5/8] Enable cxgb3 to support zerocopy

On Wed, 2011-04-20 at 13:58 -0700, Shirley Ma wrote:
> This flag can be ON when HIGHDMA and scatter/gather support. I will
> modify the patch to make it conditionally.

Double checked, it only needs HIGHDMA condition, not scatter/gather.

thanks
Shirley

2011-04-21 01:21:43

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 3/8] Add userspace buffers support in skb

This patch adds userspace buffers support in skb. A new struct
skb_ubuf_info is needed to maintain the userspace buffers argument
and index, a callback is used to notify userspace to release the
buffers once lower device has done DMA (Last reference to that skb
has gone).

Signed-off-by: Shirley Ma <[email protected]>
---

include/linux/skbuff.h | 14 ++++++++++++++
net/core/skbuff.c | 15 ++++++++++++++-
2 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index d0ae90a..47a187b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,6 +189,16 @@ enum {
SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
};

+/* The callback notifies userspace to release buffers when skb DMA is done in
+ * lower device, the desc is used to track userspace buffer index.
+ */
+struct skb_ubuf_info {
+ /* support buffers allocation from userspace */
+ void (*callback)(struct sk_buff *);
+ void *arg;
+ size_t desc;
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -211,6 +221,10 @@ struct skb_shared_info {
/* Intermediate layers must ensure that destructor_arg
* remains valid until skb destructor */
void * destructor_arg;
+
+ /* DMA mapping from/to userspace buffers */
+ struct skb_ubuf_info ubuf;
+
/* must be last field, see pskb_expand_head() */
skb_frag_t frags[MAX_SKB_FRAGS];
};
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7ebeed0..822c07d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
shinfo = skb_shinfo(skb);
memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
atomic_set(&shinfo->dataref, 1);
+ shinfo->ubuf.callback = NULL;
+ shinfo->ubuf.arg = NULL;
kmemcheck_annotate_variable(shinfo->destructor_arg);

if (fclone) {
@@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff *skb)
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
put_page(skb_shinfo(skb)->frags[i].page);
}
-
+ /*
+ * if skb buf is from userspace, we need to notify the caller
+ * the lower device DMA has done;
+ */
+ if (skb_shinfo(skb)->ubuf.callback) {
+ skb_shinfo(skb)->ubuf.callback(skb);
+ skb_shinfo(skb)->ubuf.callback = NULL;
+ skb_shinfo(skb)->ubuf.arg = NULL;
+ }
if (skb_has_frag_list(skb))
skb_drop_fraglist(skb);

@@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
if (irqs_disabled())
return false;

+ if (shinfo->ubuf.callback)
+ return false;
+
if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
return false;


2011-04-21 04:54:23

by Shirley Ma

[permalink] [raw]
Subject: [PATCH V3 7/8] Enable ixgbe to support zerocopy

Signed-off-by: Shirley Ma <[email protected]>
---

drivers/net/ixgbe/ixgbe_main.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index 6f8adc7..68f1e93 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -7395,6 +7395,7 @@ static int __devinit ixgbe_probe(struct pci_dev *pdev,
#endif /* IXGBE_FCOE */
if (pci_using_dac) {
netdev->features |= NETIF_F_HIGHDMA;
+ netdev->features |= NETIF_F_ZEROCOPY;
netdev->vlan_features |= NETIF_F_HIGHDMA;
}


2011-04-21 14:29:08

by Jon Mason

[permalink] [raw]
Subject: Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

On Wed, Apr 20, 2011 at 3:36 PM, Shirley Ma <[email protected]> wrote:
> This patchset add supports for TX zero-copy between guest and host
> kernel through vhost. It significantly reduces CPU utilization on the
> local host on which the guest is located (It reduced 30-50% CPU usage
> for vhost thread for single stream test). The patchset is based on
> previous submission and comments from the community regarding when/how
> to handle guest kernel buffers to be released. This is the simplest
> approach I can think of after comparing with several other solutions.
>
> This patchset includes:
>
> 1/8: Add a new sock zero-copy flag, SOCK_ZEROCOPY;
>
> 2/8: Add a new device flag, NETIF_F_ZEROCOPY for lower level device
> support zero-copy;
>
> 3/8: Add a new struct skb_ubuf_info in skb_share_info for userspace
> buffers release callback when lower device DMA has done for that skb;
>
> 4/8: Add vhost zero-copy callback in vhost when skb last refcnt is gone;
> add vhost_zerocopy_add_used_and_signal to notify guest to release TX skb
> buffers.
>
> 5/8: Add macvtap zero-copy in lower device when sending packet is
> greater than 128 bytes.
>
> 6/8: Add Chelsio 10Gb NIC to zero copy feature flag
>
> 7/8: Add Intel 10Gb NIC zero copy feature flag
>
> 8/8: Add Emulex 10Gb NIC zero copy feature flag

Why are only these 3 drivers getting support? As far as I can tell,
the only requirement is HIGHDMA. If this is the case, is there really
a need for an additional flag to support this? If you can key off of
HIGHDMA, all devices that support this would get the benefit.



> The patchset is built against most recent linux 2.6.git. It has passed
> netperf/netserver multiple streams stress test on above NICs.
>
> The single stream test results from 2.6.37 kernel on Chelsio:
>
> 64K message size: copy_from_user dropped from 40% to 5%; vhost thread
> cpu utilization dropped from 76% to 28%
>
> I am collecting more test results against 2.6.39-rc3 kernel and will
> provide the test matrix later.
>
> Thanks
> Shirley
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-04-22 17:32:02

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

On Thu, 2011-04-21 at 10:29 -0400, Jon Mason wrote:
> Why are only these 3 drivers getting support? As far as I can tell,
> the only requirement is HIGHDMA. If this is the case, is there really
> a need for an additional flag to support this? If you can key off of
> HIGHDMA, all devices that support this would get the benefit.

Agreed. So far, I have only verified these three 10Gb NICs we have in
our lab. We can enable all devices which support HIGHDMA.

Thanks
Shirley

2011-04-22 17:53:04

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 0/8] macvtap/vhost TX zero copy support

On Wed, 2011-04-20 at 12:36 -0700, Shirley Ma wrote:
> I am collecting more test results against 2.6.39-rc3 kernel and will
> provide the test matrix later.

Single TCP_STREAM 120 secs test results over ixgbe 10Gb NIC results:

Message BW(Gb/s)qemu-kvm (NumCPU)vhost-net(NumCPU) PerfTop irq/s
4K 7408.57 92.1% 22.6% 1229
4K(Orig)4913.17 118.1% 84.1% 2086
8K 9129.90 89.3% 23.3% 1141
8K(Orig)7094.55 115.9% 84.7% 2157
16K 9178.81 89.1% 23.3% 1139
16K(Orig)8927.1 118.7% 83.4% 2262
64K 9171.43 88.4% 24.9% 1253
64K(Orig)9085.85 115.9% 82.4% 2229

For message size less or equal than 2K, there is a known KVM guest TX
overrun issue. With this zerocopy patch, the issue becomes more severe,
guest io_exits has tripled than before, so the performance is not good.
Once the TX overrun problem has been addressed, I will retest the small
message size performance.

Thanks
Shirley

2011-05-02 10:43:38

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Wed, Apr 20, 2011 at 12:44:08PM -0700, Shirley Ma wrote:
> This zerocopy flag is used to support device DMA userspace buffers.
>
> Signed-off-by: Shirley Ma <[email protected]>
> ---
>
> include/linux/netdevice.h | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0249fe7..0998d3d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1067,6 +1067,9 @@ struct net_device {
> #define NETIF_F_RXHASH (1 << 28) /* Receive hashing offload */
> #define NETIF_F_RXCSUM (1 << 29) /* Receive checksumming offload */
>
> +/* bit 29 is for device to map userspace buffers -- zerocopy */

This comment should specify what exactly is the promise the
device makes by setting this flag. Specifically, the
condition is that no skb fragments are used
after the uinfo callback has been called.

The way it's implemented, it probably means the device
should not use any of skb_clone, expand head etc.

> +#define NETIF_F_ZEROCOPY (1 << 29)
> +
> /* Segmentation offload features */
> #define NETIF_F_GSO_SHIFT 16
> #define NETIF_F_GSO_MASK 0x00ff0000
>

2011-05-02 10:54:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3 3/8] Add userspace buffers support in skb

On Wed, Apr 20, 2011 at 12:47:57PM -0700, Shirley Ma wrote:
> This patch adds userspace buffers support in skb. A new struct
> skb_ubuf_info is needed to maintain the userspace buffers argument
> and index, a callback is used to notify userspace to release the
> buffers once lower device has done DMA (Last reference to that skb
> has gone).
>
> Signed-off-by: Shirley Ma <[email protected]>
> ---
>
> include/linux/skbuff.h | 14 ++++++++++++++
> net/core/skbuff.c | 15 ++++++++++++++-
> 2 files changed, 28 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index d0ae90a..47a187b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -189,6 +189,16 @@ enum {
> SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
> };
>
> +/* The callback notifies userspace to release buffers when skb DMA is done in
> + * lower device, the desc is used to track userspace buffer index.
> + */
> +struct skb_ubuf_info {
> + /* support buffers allocation from userspace */
> + void (*callback)(struct sk_buff *);
> + void *arg;
> + size_t desc;
> +};
> +
> /* This data is invariant across clones and lives at
> * the end of the header data, ie. at skb->end.
> */
> @@ -211,6 +221,10 @@ struct skb_shared_info {
> /* Intermediate layers must ensure that destructor_arg
> * remains valid until skb destructor */
> void * destructor_arg;
> +
> + /* DMA mapping from/to userspace buffers */
> + struct skb_ubuf_info ubuf;
> +
> /* must be last field, see pskb_expand_head() */
> skb_frag_t frags[MAX_SKB_FRAGS];
> };
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7ebeed0..822c07d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> shinfo = skb_shinfo(skb);
> memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> atomic_set(&shinfo->dataref, 1);
> + shinfo->ubuf.callback = NULL;
> + shinfo->ubuf.arg = NULL;
> kmemcheck_annotate_variable(shinfo->destructor_arg);
>
> if (fclone) {
> @@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff *skb)
> for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> put_page(skb_shinfo(skb)->frags[i].page);
> }
> -
> + /*
> + * if skb buf is from userspace, we need to notify the caller
> + * the lower device DMA has done;
> + */
> + if (skb_shinfo(skb)->ubuf.callback) {
> + skb_shinfo(skb)->ubuf.callback(skb);
> + skb_shinfo(skb)->ubuf.callback = NULL;
> + skb_shinfo(skb)->ubuf.arg = NULL;
> + }
> if (skb_has_frag_list(skb))
> skb_drop_fraglist(skb);
>

We probably don't need to touch arg if callback is NULL?

> @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int skb_size)
> if (irqs_disabled())
> return false;
>
> + if (shinfo->ubuf.callback)
> + return false;
> +
> if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
> return false;

This is not the only API unsupported for these skbs, is it?
Probably need to check and fail there as well.

>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-05-02 18:36:28

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 6/8] macvtap/vhost TX zero copy support

Resubmit the patch with a bug being found by <[email protected]>,
when he tested 2.6.38.2 kernel.

When the virtio_net doesn't enable GSO, vnet hdr_len is 0, we can't
use vnet hdr_len to allocate linear skb in zerocopy.

Signed-off-by: Shirley Ma <[email protected]>
---

drivers/net/macvtap.c | 124 +++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 6696e56..e8dac4d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,6 +60,7 @@ static struct proto macvtap_proto = {
*/
static dev_t macvtap_major;
#define MACVTAP_NUM_DEVS 65536
+#define GOODCOPY_LEN (L1_CACHE_BYTES < 128 ? 128 : L1_CACHE_BYTES)
static struct class *macvtap_class;
static struct cdev macvtap_cdev;

@@ -340,6 +341,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
{
struct net *net = current->nsproxy->net_ns;
struct net_device *dev = dev_get_by_index(net, iminor(inode));
+ struct macvlan_dev *vlan = netdev_priv(dev);
struct macvtap_queue *q;
int err;

@@ -369,6 +371,16 @@ static int macvtap_open(struct inode *inode, struct file *file)
q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
q->vnet_hdr_sz = sizeof(struct virtio_net_hdr);

+ /*
+ * so far only VM uses macvtap, enable zero copy between guest
+ * kernel and host kernel when lower device supports high memory
+ * DMA
+ */
+ if (vlan) {
+ if (vlan->lowerdev->features & NETIF_F_ZEROCOPY)
+ sock_set_flag(&q->sk, SOCK_ZEROCOPY);
+ }
+
err = macvtap_set_queue(dev, file, q);
if (err)
sock_put(&q->sk);
@@ -433,6 +445,80 @@ static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
return skb;
}

+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+ int offset, size_t count)
+{
+ int len = iov_length(from, count) - offset;
+ int copy = skb_headlen(skb);
+ int size, offset1 = 0;
+ int i = 0;
+ skb_frag_t *f;
+
+ /* Skip over from offset */
+ while (offset >= from->iov_len) {
+ offset -= from->iov_len;
+ ++from;
+ --count;
+ }
+
+ /* copy up to skb headlen */
+ while (copy > 0) {
+ size = min_t(unsigned int, copy, from->iov_len - offset);
+ if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+ size))
+ return -EFAULT;
+ if (copy > size) {
+ ++from;
+ --count;
+ }
+ copy -= size;
+ offset1 += size;
+ offset = 0;
+ }
+
+ if (len == offset1)
+ return 0;
+
+ while (count--) {
+ struct page *page[MAX_SKB_FRAGS];
+ int num_pages;
+ unsigned long base;
+
+ len = from->iov_len - offset1;
+ if (!len) {
+ offset1 = 0;
+ ++from;
+ continue;
+ }
+ base = (unsigned long)from->iov_base + offset1;
+ size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+ if ((num_pages != size) ||
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+ /* put_page is in skb free */
+ return -EFAULT;
+ skb->data_len += len;
+ skb->len += len;
+ skb->truesize += len;
+ while (len) {
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page[i];
+ f->page_offset = base & ~PAGE_MASK;
+ f->size = min_t(int, len, PAGE_SIZE - f->page_offset);
+ skb_shinfo(skb)->nr_frags++;
+ /* increase sk_wmem_alloc */
+ atomic_add(f->size, &skb->sk->sk_wmem_alloc);
+ base += f->size;
+ len -= f->size;
+ i++;
+ }
+ offset1 = 0;
+ ++from;
+ }
+ return 0;
+}
+
/*
* macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
* be shared with the tun/tap driver.
@@ -515,17 +601,19 @@ static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,


/* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
- const struct iovec *iv, size_t count,
- int noblock)
+static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
+ const struct iovec *iv, unsigned long total_len,
+ size_t count, int noblock)
{
struct sk_buff *skb;
struct macvlan_dev *vlan;
- size_t len = count;
+ unsigned long len = total_len;
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
+ int headlen, copylen, zerocopy;

+ zerocopy = sock_flag(&q->sk, SOCK_ZEROCOPY) && (len > GOODCOPY_LEN);
if (q->flags & IFF_VNET_HDR) {
vnet_hdr_len = q->vnet_hdr_sz;

@@ -552,12 +640,28 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
if (unlikely(len < ETH_HLEN))
goto err;

- skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
+ headlen = vnet_hdr.hdr_len;
+ if (zerocopy) {
+ /* create linear skb when vnet doesn't enable gso */
+ if (!vnet_hdr.hdr_len) {
+ headlen = GOODCOPY_LEN;
+ copylen = headlen;
+ } else
+ copylen = len;
+
+ skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, copylen, headlen,
noblock, &err);
if (!skb)
goto err;

- err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
+ if (zerocopy)
+ err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
+ else
+ err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
+ len);
+ if (sock_flag(&q->sk, SOCK_ZEROCOPY))
+ memcpy(&skb_shinfo(skb)->ubuf, m->msg_control,
+ sizeof(struct skb_ubuf_info));
if (err)
goto err_kfree;

@@ -579,7 +683,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
kfree_skb(skb);
rcu_read_unlock_bh();

- return count;
+ return total_len;

err_kfree:
kfree_skb(skb);
@@ -601,8 +705,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
ssize_t result = -ENOLINK;
struct macvtap_queue *q = file->private_data;

- result = macvtap_get_user(q, iv, iov_length(iv, count),
- file->f_flags & O_NONBLOCK);
+ result = macvtap_get_user(q, NULL, iv, iov_length(iv, count), count,
+ file->f_flags & O_NONBLOCK);
return result;
}

@@ -815,7 +919,7 @@ static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
struct msghdr *m, size_t total_len)
{
struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
- return macvtap_get_user(q, m->msg_iov, total_len,
+ return macvtap_get_user(q, m, m->msg_iov, total_len, m->msg_iovlen,
m->msg_flags & MSG_DONTWAIT);
}


2011-05-02 18:47:24

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> This comment should specify what exactly is the promise the
> device makes by setting this flag. Specifically, the
> condition is that no skb fragments are used
> after the uinfo callback has been called.
>
> The way it's implemented, it probably means the device
> should not use any of skb_clone, expand head etc.

Agree. Or maybe force a copy when device uses skb_clone, expand
head ...?

Thanks
Shirley

2011-05-02 19:53:25

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Mon, May 02, 2011 at 11:47:08AM -0700, Shirley Ma wrote:
> On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> > This comment should specify what exactly is the promise the
> > device makes by setting this flag. Specifically, the
> > condition is that no skb fragments are used
> > after the uinfo callback has been called.
> >
> > The way it's implemented, it probably means the device
> > should not use any of skb_clone, expand head etc.
>
> Agree. Or maybe force a copy when device uses skb_clone, expand
> head ...?
>
> Thanks
> Shirley

Copy from userspace upfront without locking is probably cheaper?

--
MST

2011-05-03 17:36:55

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 3/8] Add userspace buffers support in skb

On Mon, 2011-05-02 at 13:53 +0300, Michael S. Tsirkin wrote:
> On Wed, Apr 20, 2011 at 12:47:57PM -0700, Shirley Ma wrote:
> > This patch adds userspace buffers support in skb. A new struct
> > skb_ubuf_info is needed to maintain the userspace buffers argument
> > and index, a callback is used to notify userspace to release the
> > buffers once lower device has done DMA (Last reference to that skb
> > has gone).
> >
> > Signed-off-by: Shirley Ma <[email protected]>
> > ---
> >
> > include/linux/skbuff.h | 14 ++++++++++++++
> > net/core/skbuff.c | 15 ++++++++++++++-
> > 2 files changed, 28 insertions(+), 1 deletions(-)
> >
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index d0ae90a..47a187b 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -189,6 +189,16 @@ enum {
> > SKBTX_DRV_NEEDS_SK_REF = 1 << 3,
> > };
> >
> > +/* The callback notifies userspace to release buffers when skb DMA
> is done in
> > + * lower device, the desc is used to track userspace buffer index.
> > + */
> > +struct skb_ubuf_info {
> > + /* support buffers allocation from userspace */
> > + void (*callback)(struct sk_buff *);
> > + void *arg;
> > + size_t desc;
> > +};
> > +
> > /* This data is invariant across clones and lives at
> > * the end of the header data, ie. at skb->end.
> > */
> > @@ -211,6 +221,10 @@ struct skb_shared_info {
> > /* Intermediate layers must ensure that destructor_arg
> > * remains valid until skb destructor */
> > void * destructor_arg;
> > +
> > + /* DMA mapping from/to userspace buffers */
> > + struct skb_ubuf_info ubuf;
> > +
> > /* must be last field, see pskb_expand_head() */
> > skb_frag_t frags[MAX_SKB_FRAGS];
> > };
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 7ebeed0..822c07d 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -210,6 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size,
> gfp_t gfp_mask,
> > shinfo = skb_shinfo(skb);
> > memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > atomic_set(&shinfo->dataref, 1);
> > + shinfo->ubuf.callback = NULL;
> > + shinfo->ubuf.arg = NULL;
> > kmemcheck_annotate_variable(shinfo->destructor_arg);
> >
> > if (fclone) {
> > @@ -327,7 +329,15 @@ static void skb_release_data(struct sk_buff
> *skb)
> > for (i = 0; i < skb_shinfo(skb)->nr_frags; i
> ++)
> > put_page(skb_shinfo(skb)->frags[i].page);
> > }
> > -
> > + /*
> > + * if skb buf is from userspace, we need to notify the
> caller
> > + * the lower device DMA has done;
> > + */
> > + if (skb_shinfo(skb)->ubuf.callback) {
> > + skb_shinfo(skb)->ubuf.callback(skb);
> > + skb_shinfo(skb)->ubuf.callback = NULL;
> > + skb_shinfo(skb)->ubuf.arg = NULL;
> > + }
> > if (skb_has_frag_list(skb))
> > skb_drop_fraglist(skb);
> >
>
> We probably don't need to touch arg if callback is NULL?

Yes.

> > @@ -480,6 +490,9 @@ bool skb_recycle_check(struct sk_buff *skb, int
> skb_size)
> > if (irqs_disabled())
> > return false;
> >
> > + if (shinfo->ubuf.callback)
> > + return false;
> > +
> > if (skb_is_nonlinear(skb) || skb->fclone !=
> SKB_FCLONE_UNAVAILABLE)
> > return false;
>
> This is not the only API unsupported for these skbs, is it?
> Probably need to check and fail there as well.

Yes, I am going through all these skbs to make sure covering all.

Thanks
Shirley

2011-05-03 17:43:31

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Mon, 2011-05-02 at 22:53 +0300, Michael S. Tsirkin wrote:
> On Mon, May 02, 2011 at 11:47:08AM -0700, Shirley Ma wrote:
> > On Mon, 2011-05-02 at 13:42 +0300, Michael S. Tsirkin wrote:
> > > This comment should specify what exactly is the promise the
> > > device makes by setting this flag. Specifically, the
> > > condition is that no skb fragments are used
> > > after the uinfo callback has been called.
> > >
> > > The way it's implemented, it probably means the device
> > > should not use any of skb_clone, expand head etc.
> >
> > Agree. Or maybe force a copy when device uses skb_clone, expand
> > head ...?
> >
> > Thanks
> > Shirley
>
> Copy from userspace upfront without locking is probably cheaper?

Better to prevent this kind of skbs to be used in skb_clone, expand head
for now.

Thanks
Shirley

2011-05-03 20:11:35

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH V3 2/8] Add a new zerocopy device flag

On Tue, 2011-05-03 at 10:42 -0700, Shirley Ma wrote:
> Better to prevent this kind of skbs to be used in skb_clone, expand
> head
> for now.

I looked at the code, skb_clone shouldn't have any problem since ubuf
callback is only called after the lower device DMA has done. I can
modify the zerocopy len to 256 bytes so expand head should be OK as
well. So we only need to prevent recycle skb. I also checked the device
drivers, only a few device do RX buffers recycle. So there shouldn't be
any problem.

I will add more comments here to make sure when ZEROCOPY flag is set,
the ubuf callback should only be called when last reference to this skb
is gone.


Thanks
Shirley