LinuxLists.cc - [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

2010-06-05 10:06:55

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/skbuff.h | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 124f90c..cf309c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -203,6 +203,18 @@ struct skb_shared_info {
void * destructor_arg;
};

+/* The structure is for a skb which skb->data may point to
+ * an external buffer, which is not allocated from kernel space.
+ * Since the buffer is external, then the shinfo or frags are
+ * also extern too. It also contains a destructor for itself.
+ */
+struct skb_external_page {
+ u8 *start;
+ int size;
+ struct skb_frag_struct *frags;
+ struct skb_shared_info *ushinfo;
+ void (*dtor)(struct skb_external_page *);
+};
/* We divide dataref into two halves. The higher 16 bits hold references
* to the payload part of skb->data. The lower 16 bits hold references to
* the entire skb->data. A clone of a headerless skb holds the length of
--
1.5.4.4

2010-06-05 10:07:13

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.

From: Xin Xiaohui <[email protected]>

If buffer is external, then use the callback to destruct
buffers.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
net/core/skbuff.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 37587f0..418457c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)

static void skb_release_data(struct sk_buff *skb)
{
+ /* check if the skb has external buffers, we have use destructor_arg
+ * here to indicate
+ */
+ struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
+
if (!skb->cloned ||
!atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
&skb_shinfo(skb)->dataref)) {
@@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
if (skb_has_frags(skb))
skb_drop_fraglist(skb);

+ /* if the skb has external buffers, use destructor here,
+ * since after that skb->head will be kfree, in case skb->head
+ * from external buffer cannot use kfree to destroy.
+ */
+ if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
+ ext_page->dtor(ext_page);
kfree(skb->head);
}
}
--
1.5.4.4

2010-06-05 10:07:20

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 14/19] Add header file for mp device.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/mpassthru.h | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)
create mode 100644 include/linux/mpassthru.h

diff --git a/include/linux/mpassthru.h b/include/linux/mpassthru.h
new file mode 100644
index 0000000..ba8f320
--- /dev/null
+++ b/include/linux/mpassthru.h
@@ -0,0 +1,25 @@
+#ifndef __MPASSTHRU_H
+#define __MPASSTHRU_H
+
+#include <linux/types.h>
+#include <linux/if_ether.h>
+
+/* ioctl defines */
+#define MPASSTHRU_BINDDEV _IOW('M', 213, int)
+#define MPASSTHRU_UNBINDDEV _IO('M', 214)
+
+#ifdef __KERNEL__
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+struct socket *mp_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *mp_get_socket(struct file *f)
+{
+ return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MEDIATE_PASSTHRU */
+#endif /* __KERNEL__ */
+#endif /* __MPASSTHRU_H */
--
1.5.4.4

2010-06-05 10:07:35

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 19/19] Provides multiple submits and asynchronous notifications.

From: Xin Xiaohui <[email protected]>

The vhost-net backend now only supports synchronous send/recv
operations. The patch provides multiple submits and asynchronous
notifications. This is needed for zero-copy case.

Signed-off-by: Xin Xiaohui <[email protected]>
---
drivers/vhost/net.c | 255 ++++++++++++++++++++++++++++++++++++++++++++++++-
drivers/vhost/vhost.c | 120 +++++++++++++----------
drivers/vhost/vhost.h | 14 +++
3 files changed, 333 insertions(+), 56 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9777583..9a0d162 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -24,6 +24,8 @@
#include <linux/if_arp.h>
#include <linux/if_tun.h>
#include <linux/if_macvlan.h>
+#include <linux/mpassthru.h>
+#include <linux/aio.h>

#include <net/sock.h>

@@ -45,10 +47,13 @@ enum vhost_net_poll_state {
VHOST_NET_POLL_STOPPED = 2,
};

+static struct kmem_cache *notify_cache;
+
struct vhost_net {
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
struct vhost_poll poll[VHOST_NET_VQ_MAX];
+ struct kmem_cache *cache;
/* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
@@ -93,11 +98,146 @@ static void tx_poll_start(struct vhost_net *net, struct socket *sock)
net->tx_poll_state = VHOST_NET_POLL_STARTED;
}

+struct kiocb *notify_dequeue(struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ if (!list_empty(&vq->notifier)) {
+ iocb = list_first_entry(&vq->notifier,
+ struct kiocb, ki_list);
+ list_del(&iocb->ki_list);
+ }
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+ return iocb;
+}
+
+static void handle_iocb(struct kiocb *iocb)
+{
+ struct vhost_virtqueue *vq = iocb->private;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_add_tail(&iocb->ki_list, &vq->notifier);
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static int is_async_vq(struct vhost_virtqueue *vq)
+{
+ return (vq->link_state == VHOST_VQ_LINK_ASYNC);
+}
+
+static void handle_async_rx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ struct socket *sock)
+{
+ struct kiocb *iocb = NULL;
+ struct vhost_log *vq_log = NULL;
+ int rx_total_len = 0;
+ unsigned int head, log, in, out;
+ int size;
+
+ if (!is_async_vq(vq))
+ return;
+
+ if (sock->sk->sk_data_ready)
+ sock->sk->sk_data_ready(sock->sk, 0);
+
+ vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
+ vq->log : NULL;
+
+ while ((iocb = notify_dequeue(vq)) != NULL) {
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, iocb->ki_nbytes);
+ size = iocb->ki_nbytes;
+ head = iocb->ki_pos;
+ rx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+ kmem_cache_free(net->cache, iocb);
+
+ /* when log is enabled, recomputing the log info is needed,
+ * since these buffers are in async queue, and may not get
+ * the log info before.
+ */
+ if (unlikely(vq_log)) {
+ if (!log)
+ __vhost_get_vq_desc(&net->dev, vq, vq->iov,
+ ARRAY_SIZE(vq->iov),
+ &out, &in, vq_log,
+ &log, head);
+ vhost_log_write(vq, vq_log, log, size);
+ }
+ if (unlikely(rx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+}
+
+static void handle_async_tx_events_notify(struct vhost_net *net,
+ struct vhost_virtqueue *vq)
+{
+ struct kiocb *iocb = NULL;
+ struct list_head *entry, *tmp;
+ unsigned long flags;
+ int tx_total_len = 0;
+
+ if (!is_async_vq(vq))
+ return;
+ spin_lock_irqsave(&vq->notify_lock, flags);
+ list_for_each_safe(entry, tmp, &vq->notifier) {
+ iocb = list_entry(entry,
+ struct kiocb, ki_list);
+ if (!iocb->ki_flags)
+ continue;
+ list_del(&iocb->ki_list);
+ vhost_add_used_and_signal(&net->dev, vq,
+ iocb->ki_pos, 0);
+ tx_total_len += iocb->ki_nbytes;
+
+ if (iocb->ki_dtor)
+ iocb->ki_dtor(iocb);
+
+ kmem_cache_free(net->cache, iocb);
+ if (unlikely(tx_total_len >= VHOST_NET_WEIGHT)) {
+ vhost_poll_queue(&vq->poll);
+ break;
+ }
+ }
+ spin_unlock_irqrestore(&vq->notify_lock, flags);
+}
+
+static struct kiocb *create_iocb(struct vhost_net *net,
+ struct vhost_virtqueue *vq,
+ unsigned head)
+{
+ struct kiocb *iocb = NULL;
+
+ if (!is_async_vq(vq))
+ return NULL;
+
+ iocb = kmem_cache_zalloc(net->cache, GFP_KERNEL);
+ if (!iocb)
+ return NULL;
+ iocb->private = vq;
+ iocb->ki_pos = head;
+ iocb->ki_dtor = handle_iocb;
+ if (vq == &net->dev.vqs[VHOST_NET_VQ_RX]) {
+ iocb->ki_user_data = vq->num;
+ iocb->ki_iovec = vq->hdr;
+ }
+ return iocb;
+}
+
/* Expects to be always run from workqueue - which acts as
* read-size critical section for our kind of RCU. */
static void handle_tx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_TX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, s;
struct msghdr msg = {
.msg_name = NULL,
@@ -130,6 +270,8 @@ static void handle_tx(struct vhost_net *net)
tx_poll_stop(net);
hdr_size = vq->hdr_size;

+ handle_async_tx_events_notify(net, vq);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -157,6 +299,13 @@ static void handle_tx(struct vhost_net *net)
/* Skip header. TODO: support TSO. */
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, out);
msg.msg_iovlen = out;
+
+ if (is_async_vq(vq)) {
+ iocb = create_iocb(net, vq, head);
+ if (!iocb)
+ break;
+ }
+
len = iov_length(vq->iov, out);
/* Sanity check */
if (!len) {
@@ -166,12 +315,18 @@ static void handle_tx(struct vhost_net *net)
break;
}
/* TODO: Check specific error and bomb out unless ENOBUFS? */
- err = sock->ops->sendmsg(NULL, sock, &msg, len);
+ err = sock->ops->sendmsg(iocb, sock, &msg, len);
if (unlikely(err < 0)) {
+ if (is_async_vq(vq))
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
tx_poll_start(net, sock);
break;
}
+
+ if (is_async_vq(vq))
+ continue;
+
if (err != len)
pr_err("Truncated TX packet: "
" len %d != %zd\n", err, len);
@@ -183,6 +338,8 @@ static void handle_tx(struct vhost_net *net)
}
}

+ handle_async_tx_events_notify(net, vq);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
@@ -192,6 +349,7 @@ static void handle_tx(struct vhost_net *net)
static void handle_rx(struct vhost_net *net)
{
struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
+ struct kiocb *iocb = NULL;
unsigned head, out, in, log, s;
struct vhost_log *vq_log;
struct msghdr msg = {
@@ -212,7 +370,8 @@ static void handle_rx(struct vhost_net *net)
int err;
size_t hdr_size;
struct socket *sock = rcu_dereference(vq->private_data);
- if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
+ if (!sock || (skb_queue_empty(&sock->sk->sk_receive_queue) &&
+ vq->link_state == VHOST_VQ_LINK_SYNC))
return;

use_mm(net->dev.mm);
@@ -220,9 +379,17 @@ static void handle_rx(struct vhost_net *net)
vhost_disable_notify(vq);
hdr_size = vq->hdr_size;

+ /* In async cases, when write log is enabled, in case the submitted
+ * buffers did not get log info before the log enabling, so we'd
+ * better recompute the log info when needed. We do this in
+ * handle_async_rx_events_notify().
+ */
+
vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ?
vq->log : NULL;

+ handle_async_rx_events_notify(net, vq, sock);
+
for (;;) {
head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
ARRAY_SIZE(vq->iov),
@@ -251,6 +418,13 @@ static void handle_rx(struct vhost_net *net)
s = move_iovec_hdr(vq->iov, vq->hdr, hdr_size, in);
msg.msg_iovlen = in;
len = iov_length(vq->iov, in);
+
+ if (is_async_vq(vq)) {
+ iocb = create_iocb(net, vq, head);
+ if (!iocb)
+ break;
+ }
+
/* Sanity check */
if (!len) {
vq_err(vq, "Unexpected header len for RX: "
@@ -258,13 +432,20 @@ static void handle_rx(struct vhost_net *net)
iov_length(vq->hdr, s), hdr_size);
break;
}
- err = sock->ops->recvmsg(NULL, sock, &msg,
+
+ err = sock->ops->recvmsg(iocb, sock, &msg,
len, MSG_DONTWAIT | MSG_TRUNC);
/* TODO: Check specific error and bomb out unless EAGAIN? */
if (err < 0) {
+ if (is_async_vq(vq))
+ kmem_cache_free(net->cache, iocb);
vhost_discard_vq_desc(vq);
break;
}
+
+ if (is_async_vq(vq))
+ continue;
+
/* TODO: Should check and handle checksum. */
if (err > len) {
pr_err("Discarded truncated rx packet: "
@@ -290,6 +471,8 @@ static void handle_rx(struct vhost_net *net)
}
}

+ handle_async_rx_events_notify(net, vq, sock);
+
mutex_unlock(&vq->mutex);
unuse_mm(net->dev.mm);
}
@@ -343,6 +526,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, POLLOUT);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, POLLIN);
n->tx_poll_state = VHOST_NET_POLL_DISABLED;
+ n->cache = NULL;

f->private_data = n;

@@ -406,6 +590,21 @@ static void vhost_net_flush(struct vhost_net *n)
vhost_net_flush_vq(n, VHOST_NET_VQ_RX);
}

+static void vhost_async_cleanup(struct vhost_net *n)
+{
+ /* clean the notifier */
+ struct vhost_virtqueue *vq;
+ struct kiocb *iocb = NULL;
+ if (n->cache) {
+ vq = &n->dev.vqs[VHOST_NET_VQ_RX];
+ while ((iocb = notify_dequeue(vq)) != NULL)
+ kmem_cache_free(n->cache, iocb);
+ vq = &n->dev.vqs[VHOST_NET_VQ_TX];
+ while ((iocb = notify_dequeue(vq)) != NULL)
+ kmem_cache_free(n->cache, iocb);
+ }
+}
+
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f->private_data;
@@ -422,6 +621,7 @@ static int vhost_net_release(struct inode *inode, struct file *f)
/* We do an extra flush before freeing memory,
* since jobs can re-queue themselves. */
vhost_net_flush(n);
+ vhost_async_cleanup(n);
kfree(n);
return 0;
}
@@ -473,21 +673,58 @@ static struct socket *get_tap_socket(int fd)
return sock;
}

-static struct socket *get_socket(int fd)
+static struct socket *get_mp_socket(int fd)
+{
+ struct file *file = fget(fd);
+ struct socket *sock;
+ if (!file)
+ return ERR_PTR(-EBADF);
+ sock = mp_get_socket(file);
+ if (IS_ERR(sock))
+ fput(file);
+ return sock;
+}
+
+static struct socket *get_socket(struct vhost_virtqueue *vq, int fd,
+ enum vhost_vq_link_state *state)
{
struct socket *sock;
/* special case to disable backend */
if (fd == -1)
return NULL;
+
+ *state = VHOST_VQ_LINK_SYNC;
+
sock = get_raw_socket(fd);
if (!IS_ERR(sock))
return sock;
sock = get_tap_socket(fd);
if (!IS_ERR(sock))
return sock;
+ /* If we dont' have notify_cache, then dont do mpassthru */
+ if (!notify_cache)
+ return ERR_PTR(-ENOTSOCK);
+ sock = get_mp_socket(fd);
+ if (!IS_ERR(sock)) {
+ *state = VHOST_VQ_LINK_ASYNC;
+ return sock;
+ }
return ERR_PTR(-ENOTSOCK);
}

+static void vhost_init_link_state(struct vhost_net *n, int index)
+{
+ struct vhost_virtqueue *vq = n->vqs + index;
+
+ WARN_ON(!mutex_is_locked(&vq->mutex));
+ if (vq->link_state == VHOST_VQ_LINK_ASYNC) {
+ INIT_LIST_HEAD(&vq->notifier);
+ spin_lock_init(&vq->notify_lock);
+ if (!n->cache)
+ n->cache = notify_cache;
+ }
+}
+
static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
{
struct socket *sock, *oldsock;
@@ -511,12 +748,14 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
r = -EFAULT;
goto err_vq;
}
- sock = get_socket(fd);
+ sock = get_socket(vq, fd, &vq->link_state);
if (IS_ERR(sock)) {
r = PTR_ERR(sock);
goto err_vq;
}

+ vhost_init_link_state(n, index);
+
/* start polling new socket */
oldsock = vq->private_data;
if (sock == oldsock)
@@ -650,6 +889,10 @@ int vhost_net_init(void)
r = misc_register(&vhost_net_misc);
if (r)
goto err_reg;
+
+ notify_cache = kmem_cache_create("vhost_kiocb",
+ sizeof(struct kiocb), 0,
+ SLAB_HWCACHE_ALIGN, NULL);
return 0;
err_reg:
vhost_cleanup();
@@ -663,6 +906,8 @@ void vhost_net_exit(void)
{
misc_deregister(&vhost_net_misc);
vhost_cleanup();
+ if (notify_cache)
+ kmem_cache_destroy(notify_cache);
}
module_exit(vhost_net_exit);

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e69d238..ad3779c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -861,61 +861,17 @@ static unsigned get_indirect(struct vhost_dev *dev, struct vhost_virtqueue *vq,
return 0;
}

-/* This looks in the virtqueue and for the first available buffer, and converts
- * it to an iovec for convenient access. Since descriptors consist of some
- * number of output then some number of input descriptors, it's actually two
- * iovecs, but we pack them into one and note how many of each there were.
- *
- * This function returns the descriptor number found, or vq->num (which
- * is never a valid descriptor number) if none was found. */
-unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
- struct iovec iov[], unsigned int iov_size,
- unsigned int *out_num, unsigned int *in_num,
- struct vhost_log *log, unsigned int *log_num)
+/* This computes the log info according to the index of buffer */
+unsigned __vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head)
{
struct vring_desc desc;
- unsigned int i, head, found = 0;
- u16 last_avail_idx;
+ unsigned int i = head, found = 0;
int ret;

- /* Check it isn't doing very strange things with descriptor numbers. */
- last_avail_idx = vq->last_avail_idx;
- if (get_user(vq->avail_idx, &vq->avail->idx)) {
- vq_err(vq, "Failed to access avail idx at %p\n",
- &vq->avail->idx);
- return vq->num;
- }
-
- if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
- vq_err(vq, "Guest moved used index from %u to %u",
- last_avail_idx, vq->avail_idx);
- return vq->num;
- }
-
- /* If there's nothing new since last we looked, return invalid. */
- if (vq->avail_idx == last_avail_idx)
- return vq->num;
-
- /* Only get avail ring entries after they have been exposed by guest. */
- smp_rmb();
-
- /* Grab the next descriptor number they're advertising, and increment
- * the index we've seen. */
- if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
- vq_err(vq, "Failed to read head: idx %d address %p\n",
- last_avail_idx,
- &vq->avail->ring[last_avail_idx % vq->num]);
- return vq->num;
- }
-
- /* If their number is silly, that's an error. */
- if (head >= vq->num) {
- vq_err(vq, "Guest says index %u > %u is available",
- head, vq->num);
- return vq->num;
- }
-
- /* When we start there are none of either input nor output. */
*out_num = *in_num = 0;
if (unlikely(log))
*log_num = 0;
@@ -979,8 +935,70 @@ unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
*out_num += ret;
}
} while ((i = next_desc(&desc)) != -1);
+ return head;
+}
+
+/* This looks in the virtqueue and for the first available buffer, and converts
+ * it to an iovec for convenient access. Since descriptors consist of some
+ * number of output then some number of input descriptors, it's actually two
+ * iovecs, but we pack them into one and note how many of each there were.
+ *
+ * This function returns the descriptor number found, or vq->num (which
+ * is never a valid descriptor number) if none was found. */
+unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct vhost_virtqueue *vq,
+ struct iovec iov[], unsigned int iov_size,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num)
+{
+ struct vring_desc desc;
+ unsigned int i, head, found = 0;
+ u16 last_avail_idx;
+ int ret;
+
+ /* Check it isn't doing very strange things with descriptor numbers. */
+ last_avail_idx = vq->last_avail_idx;
+ if (get_user(vq->avail_idx, &vq->avail->idx)) {
+ vq_err(vq, "Failed to access avail idx at %p\n",
+ &vq->avail->idx);
+ return vq->num;
+ }
+
+ if ((u16)(vq->avail_idx - last_avail_idx) > vq->num) {
+ vq_err(vq, "Guest moved used index from %u to %u",
+ last_avail_idx, vq->avail_idx);
+ return vq->num;
+ }
+
+ /* If there's nothing new since last we looked, return invalid. */
+ if (vq->avail_idx == last_avail_idx)
+ return vq->num;
+
+ /* Only get avail ring entries after they have been exposed by guest. */
+ smp_rmb();
+
+ /* Grab the next descriptor number they're advertising, and increment
+ * the index we've seen. */
+ if (get_user(head, &vq->avail->ring[last_avail_idx % vq->num])) {
+ vq_err(vq, "Failed to read head: idx %d address %p\n",
+ last_avail_idx,
+ &vq->avail->ring[last_avail_idx % vq->num]);
+ return vq->num;
+ }
+
+ /* If their number is silly, that's an error. */
+ if (head >= vq->num) {
+ vq_err(vq, "Guest says index %u > %u is available",
+ head, vq->num);
+ return vq->num;
+ }
+
+ ret = __vhost_get_vq_desc(dev, vq, iov, iov_size,
+ out_num, in_num,
+ log, log_num, head);

/* On success, increment avail index. */
+ if (ret == vq->num)
+ return ret;
vq->last_avail_idx++;
return head;
}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 44591ba..3c9cbce 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -43,6 +43,11 @@ struct vhost_log {
u64 len;
};

+enum vhost_vq_link_state {
+ VHOST_VQ_LINK_SYNC = 0,
+ VHOST_VQ_LINK_ASYNC = 1,
+};
+
/* The virtqueue structure describes a queue attached to a device. */
struct vhost_virtqueue {
struct vhost_dev *dev;
@@ -96,6 +101,10 @@ struct vhost_virtqueue {
/* Log write descriptors */
void __user *log_base;
struct vhost_log log[VHOST_NET_MAX_SG];
+ /* Differiate async socket for 0-copy from normal */
+ enum vhost_vq_link_state link_state;
+ struct list_head notifier;
+ spinlock_t notify_lock;
};

struct vhost_dev {
@@ -124,6 +133,11 @@ unsigned vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
struct iovec iov[], unsigned int iov_count,
unsigned int *out_num, unsigned int *in_num,
struct vhost_log *log, unsigned int *log_num);
+unsigned __vhost_get_vq_desc(struct vhost_dev *, struct vhost_virtqueue *,
+ struct iovec iov[], unsigned int iov_count,
+ unsigned int *out_num, unsigned int *in_num,
+ struct vhost_log *log, unsigned int *log_num,
+ unsigned int head);
void vhost_discard_vq_desc(struct vhost_virtqueue *);

int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
--
1.5.4.4

2010-06-05 10:07:29

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 18/19] Add a kconfig entry and make entry for mp device.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
drivers/vhost/Kconfig | 10 ++++++++++
drivers/vhost/Makefile | 2 ++
2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index e4e2fd1..a6b8cbf 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -9,3 +9,13 @@ config VHOST_NET
To compile this driver as a module, choose M here: the module will
be called vhost_net.

+config MEDIATE_PASSTHRU
+ tristate "mediate passthru network driver (EXPERIMENTAL)"
+ depends on VHOST_NET
+ ---help---
+ zerocopy network I/O support, we call it as mediate passthru to
+ be distiguish with hardare passthru.
+
+ To compile this driver as a module, choose M here: the module will
+ be called mpassthru.
+
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 72dd020..c18b9fc 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -1,2 +1,4 @@
obj-$(CONFIG_VHOST_NET) += vhost_net.o
vhost_net-y := vhost.o net.o
+
+obj-$(CONFIG_MEDIATE_PASSTHRU) += mpassthru.o
--
1.5.4.4

2010-06-05 10:08:24

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 00/19] Provide a zero-copy method on KVM virtio-net.

We provide an zero-copy method which driver side may get external
buffers to DMA. Here external means driver don't use kernel space
to allocate skb buffers. Currently the external buffer can be from
guest virtio-net driver.

The idea is simple, just to pin the guest VM user space and then
let host NIC driver has the chance to directly DMA to it.
The patches are based on vhost-net backend driver. We add a device
which provides proto_ops as sendmsg/recvmsg to vhost-net to
send/recv directly to/from the NIC driver. KVM guest who use the
vhost-net backend may bind any ethX interface in the host side to
get copyless data transfer thru guest virtio-net frontend.

patch 01-13: net core changes.
patch 14-18: new device as interface to mantpulate external buffers.
patch 19: for vhost-net.

The guest virtio-net driver submits multiple requests thru vhost-net
backend driver to the kernel. And the requests are queued and then
completed after corresponding actions in h/w are done.

For read, user space buffers are dispensed to NIC driver for rx when
a page constructor API is invoked. Means NICs can allocate user buffers
from a page constructor. We add a hook in netif_receive_skb() function
to intercept the incoming packets, and notify the zero-copy device.

For write, the zero-copy deivce may allocates a new host skb and puts
payload on the skb_shinfo(skb)->frags, and copied the header to skb->data.
The request remains pending until the skb is transmitted by h/w.

Here, we have ever considered 2 ways to utilize the page constructor
API to dispense the user buffers.

One: Modify __alloc_skb() function a bit, it can only allocate a
structure of sk_buff, and the data pointer is pointing to a
user buffer which is coming from a page constructor API.
Then the shinfo of the skb is also from guest.
When packet is received from hardware, the skb->data is filled
directly by h/w. What we have done is in this way.

Pros: We can avoid any copy here.
Cons: Guest virtio-net driver needs to allocate skb as almost
the same method with the host NIC drivers, say the size
of netdev_alloc_skb() and the same reserved space in the
head of skb. Many NIC drivers are the same with guest and
ok for this. But some lastest NIC drivers reserves special
room in skb head. To deal with it, we suggest to provide
a method in guest virtio-net driver to ask for parameter
we interest from the NIC driver when we know which device
we have bind to do zero-copy. Then we ask guest to do so.

Two: Modify driver to get user buffer allocated from a page constructor
API(to substitute alloc_page()), the user buffer are used as payload
buffers and filled by h/w directly when packet is received. Driver
should associate the pages with skb (skb_shinfo(skb)->frags). For
the head buffer side, let host allocates skb, and h/w fills it.
After that, the data filled in host skb header will be copied into
guest header buffer which is submitted together with the payload buffer.

Pros: We could less care the way how guest or host allocates their
buffers.
Cons: We still need a bit copy here for the skb header.

We are not sure which way is the better here. This is the first thing we want
to get comments from the community. We wish the modification to the network
part will be generic which not used by vhost-net backend only, but a user
application may use it as well when the zero-copy device may provides async
read/write operations later.

We have got comments from Michael. And he said the first method will break
the compatiblity of virtio-net driver and may complicate the qemu live
migration. Currently, we tried to ignore the skb_reserve() if the device
is doing zero-copy. Then guest virtio-net driver wil not changed. So we now
continue to go with the first way.
But comments about the two ways are still appreicated.

We provide multiple submits and asynchronous notifiicaton to
vhost-net too.

Our goal is to improve the bandwidth and reduce the CPU usage.
Exact performance data will be provided later. But for simple
test with netperf, we found bindwidth up and CPU % up too,
but the bindwidth up ratio is much more than CPU % up ratio.

What we have not done yet:
packet split support
To support GRO
Performance tuning

what we have done in v1:
polish the RCU usage
deal with write logging in asynchroush mode in vhost
add notifier block for mp device
rename page_ctor to mp_port in netdevice.h to make it looks generic
add mp_dev_change_flags() for mp device to change NIC state
add CONIFG_VHOST_MPASSTHRU to limit the usage when module is not load
a small fix for missing dev_put when fail
using dynamic minor instead of static minor number
a __KERNEL__ protect to mp_get_sock()

what we have done in v2:

remove most of the RCU usage, since the ctor pointer is only
changed by BIND/UNBIND ioctl, and during that time, NIC will be
stopped to get good cleanup(all outstanding requests are finished),
so the ctor pointer cannot be raced into wrong situation.

Remove the struct vhost_notifier with struct kiocb.
Let vhost-net backend to alloc/free the kiocb and transfer them
via sendmsg/recvmsg.

use get_user_pages_fast() and set_page_dirty_lock() when read.

Add some comments for netdev_mp_port_prep() and handle_mpassthru().

what we have done in v3:
the async write logging is rewritten
a drafted synchronous write function for qemu live migration
a limit for locked pages from get_user_pages_fast() to prevent Dos
by using RLIMIT_MEMLOCK

what we have done in v4:
add iocb completion callback from vhost-net to queue iocb in mp device
replace vq->receiver by mp_sock_data_ready()
remove stuff in mp device which access structures from vhost-net
modify skb_reserve() to ignore host NIC driver reserved space
rebase to the latest vhost tree
split large patches into small pieces, especially for net core part.

what we have done in v5:
address Arnd Bergmann's comments
-remove IFF_MPASSTHRU_EXCL flag in mp device
-Add CONFIG_COMPAT macro
-remove mp_release ops
move dev_is_mpassthru() as inline func
fix a bug in memory relinquish
Apply to current git (2.6.34-rc6) tree.

what we have done in v6:
move create_iocb() out of page_dtor which may happen in interrupt context
-This remove the potential issues which lock called in interrupt context
make the cache used by mp, vhost as static, and created/destoryed during
modules init/exit functions.
-This makes multiple mp guest created at the same time.

what we have done in v7:
some cleanup prepared to suppprt PS mode

performance:
using netperf with GSO/TSO disabled, 10G NIC,
disabled packet split mode, with raw socket case compared to vhost.

bindwidth will be from 1.1Gbps to 1.7Gbps
CPU % from 120%-140% to 140%-160%

We have retested the performance based on 2.6.34-rc6 in above situtation.
BW CPU %
vhost 1.4Gbps 120% ~ 130%
vhost + zero-copy 2.7Gbps 160% ~ 180%

2010-06-05 10:08:45

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 17/19] Export proto_ops to vhost-net driver.

From: Xin Xiaohui <[email protected]>

Currently, vhost-net is only user to the mp device.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
drivers/vhost/mpassthru.c | 295 ++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 290 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 8c48898..23755ba 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -414,6 +414,11 @@ static void mp_put(struct mp_file *mfile)
mp_detach(mfile->mp);
}

+static void iocb_tag(struct kiocb *iocb)
+{
+ iocb->ki_flags = 1;
+}
+
/* The callback to destruct the external buffers or skb */
static void page_dtor(struct skb_external_page *ext_page)
{
@@ -449,7 +454,7 @@ static void page_dtor(struct skb_external_page *ext_page)
* Queue the notifier to wake up the backend driver
*/

- create_iocb(info, info->total);
+ iocb_tag(info->iocb);

sk = ctor->port.sock->sk;
sk->sk_write_space(sk);
@@ -569,8 +574,288 @@ failed:
return NULL;
}

+static void mp_sock_destruct(struct sock *sk)
+{
+ struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+ kfree(mp);
+}
+
+static void mp_sock_state_change(struct sock *sk)
+{
+ if (sk_has_sleeper(sk))
+ wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN);
+}
+
+static void mp_sock_write_space(struct sock *sk)
+{
+ if (sk_has_sleeper(sk))
+ wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT);
+}
+
+static void mp_sock_data_ready(struct sock *sk, int coming)
+{
+ struct mp_struct *mp = container_of(sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor = NULL;
+ struct sk_buff *skb = NULL;
+ struct page_info *info = NULL;
+ struct ethhdr *eth;
+ struct kiocb *iocb = NULL;
+ int len, i;
+
+ struct virtio_net_hdr hdr = {
+ .flags = 0,
+ .gso_type = VIRTIO_NET_HDR_GSO_NONE
+ };
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return;
+
+ while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
+ if (skb_shinfo(skb)->destructor_arg) {
+ info = container_of(skb_shinfo(skb)->destructor_arg,
+ struct page_info, ext_page);
+ info->skb = skb;
+ if (skb->len > info->len) {
+ mp->dev->stats.rx_dropped++;
+ DBG(KERN_INFO "Discarded truncated rx packet: "
+ " len %d > %zd\n", skb->len, info->len);
+ info->total = skb->len;
+ goto clean;
+ } else {
+ eth = eth_hdr(skb);
+ skb_push(skb, ETH_HLEN);
+ info->total = skb->len;
+ }
+ } else {
+ /* The skb composed with kernel buffers
+ * in case external buffers are not sufficent.
+ * The case should be rare.
+ */
+ unsigned long flags;
+ int i;
+ info = NULL;
+
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq,
+ struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ if (!info) {
+ DBG(KERN_INFO
+ "No external buffer avaliable %p\n",
+ skb);
+ skb_queue_head(&sk->sk_receive_queue,
+ skb);
+ break;
+ }
+ info->skb = skb;
+ eth = eth_hdr(skb);
+ skb_push(skb, ETH_HLEN);
+ info->total = skb->len;
+ skb_copy_datagram_iovec(skb, 0, info->iov, skb->len);
+ }
+
+ len = memcpy_toiovec(info->hdr, (unsigned char *)&hdr,
+ sizeof hdr);
+ if (len) {
+ DBG(KERN_INFO
+ "Unable to write vnet_hdr at addr %p: %d\n",
+ info->hdr->iov_base, len);
+ goto clean;
+ }
+
+ iocb = create_iocb(info, skb->len + sizeof(hdr));
+ continue;
+
+clean:
+ kfree_skb(skb);
+ for (i = 0; info->pages[i]; i++)
+ put_page(info->pages[i]);
+ kmem_cache_free(ext_page_info_cache, info);
+ }
+ return;
+}
+
+static int mp_sendmsg(struct kiocb *iocb, struct socket *sock,
+ struct msghdr *m, size_t total_len)
+{
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor;
+ struct iovec *iov = m->msg_iov;
+ struct page_info *info = NULL;
+ struct frag frags[MAX_SKB_FRAGS];
+ struct sk_buff *skb;
+ int count = m->msg_iovlen;
+ int total = 0, header, n, i, len, rc;
+ unsigned long base;
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -ENODEV;
+
+ total = iov_length(iov, count);
+
+ if (total < ETH_HLEN)
+ return -EINVAL;
+
+ if (total <= COPY_THRESHOLD)
+ goto copy;
+
+ n = 0;
+ for (i = 0; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base;
+ len = iov[i].iov_len;
+ if (!len)
+ continue;
+ n += ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ if (n > MAX_SKB_FRAGS)
+ return -EINVAL;
+ }
+
+copy:
+ header = total > COPY_THRESHOLD ? COPY_HDR_LEN : total;
+
+ skb = alloc_skb(header + NET_IP_ALIGN, GFP_ATOMIC);
+ if (!skb)
+ goto drop;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+
+ skb_set_network_header(skb, ETH_HLEN);
+
+ memcpy_fromiovec(skb->data, iov, header);
+ skb_put(skb, header);
+ skb->protocol = *((__be16 *)(skb->data) + ETH_ALEN);
+
+ if (header == total) {
+ rc = total;
+ info = alloc_small_page_info(ctor, iocb, total);
+ } else {
+ info = alloc_page_info(ctor, iocb, iov, count, frags, 0, total);
+ if (info)
+ for (i = 0; info->pages[i]; i++) {
+ skb_add_rx_frag(skb, i, info->pages[i],
+ frags[i].offset, frags[i].size);
+ info->pages[i] = NULL;
+ }
+ }
+ if (info != NULL) {
+ info->desc_pos = iocb->ki_pos;
+ info->total = total;
+ info->skb = skb;
+ skb_shinfo(skb)->destructor_arg = &info->ext_page;
+ skb->dev = mp->dev;
+ ctor->wq_len++;
+ create_iocb(info, info->total);
+ dev_queue_xmit(skb);
+ return 0;
+ }
+drop:
+ kfree_skb(skb);
+ if (info) {
+ for (i = 0; info->pages[i]; i++)
+ put_page(info->pages[i]);
+ kmem_cache_free(ext_page_info_cache, info);
+ }
+ mp->dev->stats.tx_dropped++;
+ return -ENOMEM;
+}
+
+static int mp_recvmsg(struct kiocb *iocb, struct socket *sock,
+ struct msghdr *m, size_t total_len,
+ int flags)
+{
+ struct mp_struct *mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ struct page_ctor *ctor;
+ struct iovec *iov = m->msg_iov;
+ int count = m->msg_iovlen;
+ int npages, payload;
+ struct page_info *info;
+ struct frag frags[MAX_SKB_FRAGS];
+ unsigned long base;
+ int i, len;
+ unsigned long flag;
+
+ if (!(flags & MSG_DONTWAIT))
+ return -EINVAL;
+
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -EINVAL;
+
+ /* Error detections in case invalid external buffer */
+ if (count > 2 && iov[1].iov_len < ctor->port.hdr_len &&
+ mp->dev->features & NETIF_F_SG) {
+ return -EINVAL;
+ }
+
+ npages = ctor->port.npages;
+ payload = ctor->port.data_len;
+
+ /* If KVM guest virtio-net FE driver use SG feature */
+ if (count > 2) {
+ for (i = 2; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base & ~PAGE_MASK;
+ len = iov[i].iov_len;
+ if (npages == 1)
+ len = min_t(int, len, PAGE_SIZE - base);
+ else if (base)
+ break;
+ payload -= len;
+ if (payload <= 0)
+ goto proceed;
+ if (npages == 1 || (len & ~PAGE_MASK))
+ break;
+ }
+ }
+
+ if ((((unsigned long)iov[1].iov_base & ~PAGE_MASK)
+ - NET_SKB_PAD - NET_IP_ALIGN) >= 0)
+ goto proceed;
+
+ return -EINVAL;
+
+proceed:
+ /* skip the virtnet head */
+ iov++;
+ count--;
+
+ if (!ctor->lock_pages || !ctor->rq_len)
+ set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+ iocb->ki_user_data * 4096 * 2,
+ iocb->ki_user_data * 4096 * 2);
+
+ /* Translate address to kernel */
+ info = alloc_page_info(ctor, iocb, iov, count, frags, npages, 0);
+ if (!info)
+ return -ENOMEM;
+ info->len = total_len;
+ info->hdr[0].iov_base = iocb->ki_iovec[0].iov_base;
+ info->hdr[0].iov_len = iocb->ki_iovec[0].iov_len;
+ info->offset = frags[0].offset;
+ info->desc_pos = iocb->ki_pos;
+
+ iov--;
+ count++;
+
+ memcpy(info->iov, iov, sizeof(struct iovec) * count);
+
+ spin_lock_irqsave(&ctor->read_lock, flag);
+ list_add_tail(&info->list, &ctor->readq);
+ spin_unlock_irqrestore(&ctor->read_lock, flag);
+
+ ctor->rq_len++;
+
+ return 0;
+}
+
/* Ops structure to mimic raw sockets with mp device */
static const struct proto_ops mp_socket_ops = {
+ .sendmsg = mp_sendmsg,
+ .recvmsg = mp_recvmsg,
};

static struct proto mp_proto = {
@@ -693,10 +978,10 @@ static long mp_chr_ioctl(struct file *file, unsigned int cmd,
sk->sk_sndbuf = INT_MAX;
container_of(sk, struct mp_sock, sk)->mp = mp;

- sk->sk_destruct = NULL;
- sk->sk_data_ready = NULL;
- sk->sk_write_space = NULL;
- sk->sk_state_change = NULL;
+ sk->sk_destruct = mp_sock_destruct;
+ sk->sk_data_ready = mp_sock_data_ready;
+ sk->sk_write_space = mp_sock_write_space;
+ sk->sk_state_change = mp_sock_state_change;
ret = mp_attach(mp, file);
if (ret < 0)
goto err_free_sk;
--
1.5.4.4

2010-06-05 10:09:05

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 16/19] Manipulate external buffers in mp device.

From: Xiaohui Xin<[email protected]>

How external buffer comes from, how to destroy.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
drivers/vhost/mpassthru.c | 253 ++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 251 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
index 25e2f3e..8c48898 100644
--- a/drivers/vhost/mpassthru.c
+++ b/drivers/vhost/mpassthru.c
@@ -161,6 +161,39 @@ static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
return ret;
}

+/* The main function to allocate external buffers */
+static struct skb_external_page *page_ctor(struct mpassthru_port *port,
+ struct sk_buff *skb, int npages)
+{
+ int i;
+ unsigned long flags;
+ struct page_ctor *ctor;
+ struct page_info *info = NULL;
+
+ ctor = container_of(port, struct page_ctor, port);
+
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq, struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ if (!info)
+ return NULL;
+
+ for (i = 0; i < info->pnum; i++) {
+ get_page(info->pages[i]);
+ info->frag[i].page = info->pages[i];
+ info->frag[i].page_offset = i ? 0 : info->offset;
+ info->frag[i].size = port->npages > 1 ? PAGE_SIZE :
+ port->data_len;
+ }
+ info->skb = skb;
+ info->ext_page.frags = info->frag;
+ info->ext_page.ushinfo = &info->ushinfo;
+ return &info->ext_page;
+}
+
static int page_ctor_attach(struct mp_struct *mp)
{
int rc;
@@ -186,7 +219,7 @@ static int page_ctor_attach(struct mp_struct *mp)

dev_hold(dev);
ctor->dev = dev;
- ctor->port.ctor = NULL;
+ ctor->port.ctor = page_ctor;
ctor->port.sock = &mp->socket;
ctor->lock_pages = 0;
rc = netdev_mp_port_attach(dev, &ctor->port);
@@ -252,11 +285,66 @@ static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
return 0;
}

+static void relinquish_resource(struct page_ctor *ctor)
+{
+ if (!(ctor->dev->flags & IFF_UP) &&
+ !(ctor->wq_len + ctor->rq_len))
+ printk(KERN_INFO "relinquish_resource\n");
+}
+
+static void mp_ki_dtor(struct kiocb *iocb)
+{
+ struct page_info *info = (struct page_info *)(iocb->private);
+ int i;
+
+ if (info->flags == INFO_READ) {
+ for (i = 0; i < info->pnum; i++) {
+ if (info->pages[i]) {
+ set_page_dirty_lock(info->pages[i]);
+ put_page(info->pages[i]);
+ }
+ }
+ info->skb->destructor = NULL;
+ kfree_skb(info->skb);
+ info->ctor->rq_len--;
+ } else
+ info->ctor->wq_len--;
+ /* Decrement the number of locked pages */
+ info->ctor->lock_pages -= info->pnum;
+ kmem_cache_free(ext_page_info_cache, info);
+ relinquish_resource(info->ctor);
+
+ return;
+}
+
+static struct kiocb *create_iocb(struct page_info *info, int size)
+{
+ struct kiocb *iocb = NULL;
+
+ iocb = info->iocb;
+ if (!iocb)
+ return iocb;
+ iocb->ki_flags = 0;
+ iocb->ki_users = 1;
+ iocb->ki_key = 0;
+ iocb->ki_ctx = NULL;
+ iocb->ki_cancel = NULL;
+ iocb->ki_retry = NULL;
+ iocb->ki_iovec = NULL;
+ iocb->ki_eventfd = NULL;
+ iocb->ki_pos = info->desc_pos;
+ iocb->ki_nbytes = size;
+ iocb->ki_dtor(iocb);
+ iocb->private = (void *)info;
+ iocb->ki_dtor = mp_ki_dtor;
+
+ return iocb;
+}
+
static int page_ctor_detach(struct mp_struct *mp)
{
struct page_ctor *ctor;
struct page_info *info;
- struct kiocb *iocb = NULL;
int i;

/* locked by mp_mutex */
@@ -268,11 +356,17 @@ static int page_ctor_detach(struct mp_struct *mp)
for (i = 0; i < info->pnum; i++)
if (info->pages[i])
put_page(info->pages[i]);
+ create_iocb(info, 0);
+ ctor->rq_len--;
kmem_cache_free(ext_page_info_cache, info);
}
+
+ relinquish_resource(ctor);
+
set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
ctor->o_rlim.rlim_cur,
ctor->o_rlim.rlim_max);
+
netdev_mp_port_detach(ctor->dev);
dev_put(ctor->dev);

@@ -320,6 +414,161 @@ static void mp_put(struct mp_file *mfile)
mp_detach(mfile->mp);
}

+/* The callback to destruct the external buffers or skb */
+static void page_dtor(struct skb_external_page *ext_page)
+{
+ struct page_info *info;
+ struct page_ctor *ctor;
+ struct sock *sk;
+ struct sk_buff *skb;
+ unsigned long flags;
+
+ if (!ext_page)
+ return;
+ info = container_of(ext_page, struct page_info, ext_page);
+ if (!info)
+ return;
+ ctor = info->ctor;
+ skb = info->skb;
+
+ if ((info->flags == INFO_READ) && info->skb)
+ info->skb->head = NULL;
+
+ /* If the info->total is 0, make it to be reused */
+ if (!info->total) {
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ list_add(&info->list, &ctor->readq);
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ return;
+ }
+
+ if (info->flags == INFO_READ)
+ return;
+
+ /* For transmit, we should wait for the DMA finish by hardware.
+ * Queue the notifier to wake up the backend driver
+ */
+
+ create_iocb(info, info->total);
+
+ sk = ctor->port.sock->sk;
+ sk->sk_write_space(sk);
+
+ return;
+}
+
+/* For small exteranl buffers transmit, we don't need to call
+ * get_user_pages().
+ */
+static struct page_info *alloc_small_page_info(struct page_ctor *ctor,
+ struct kiocb *iocb, int total)
+{
+ struct page_info *info =
+ kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+ if (!info)
+ return NULL;
+ info->total = total;
+ info->ext_page.dtor = page_dtor;
+ info->ctor = ctor;
+ info->flags = INFO_WRITE;
+ info->iocb = iocb;
+ return info;
+}
+
+/* The main function to transform the guest user space address
+ * to host kernel address via get_user_pages(). Thus the hardware
+ * can do DMA directly to the external buffer address.
+ */
+static struct page_info *alloc_page_info(struct page_ctor *ctor,
+ struct kiocb *iocb, struct iovec *iov,
+ int count, struct frag *frags,
+ int npages, int total)
+{
+ int rc;
+ int i, j, n = 0;
+ int len;
+ unsigned long base, lock_limit;
+ struct page_info *info = NULL;
+
+ lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
+ lock_limit >>= PAGE_SHIFT;
+
+ if (ctor->lock_pages + count > lock_limit && npages) {
+ printk(KERN_INFO "exceed the locked memory rlimit.");
+ return NULL;
+ }
+
+ info = kmem_cache_zalloc(ext_page_info_cache, GFP_KERNEL);
+
+ if (!info)
+ return NULL;
+
+ for (i = j = 0; i < count; i++) {
+ base = (unsigned long)iov[i].iov_base;
+ len = iov[i].iov_len;
+
+ if (!len)
+ continue;
+ n = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+
+ rc = get_user_pages_fast(base, n, npages ? 1 : 0,
+ &info->pages[j]);
+ if (rc != n)
+ goto failed;
+
+ while (n--) {
+ frags[j].offset = base & ~PAGE_MASK;
+ frags[j].size = min_t(int, len,
+ PAGE_SIZE - frags[j].offset);
+ len -= frags[j].size;
+ base += frags[j].size;
+ j++;
+ }
+ }
+
+#ifdef CONFIG_HIGHMEM
+ if (npages && !(dev->features & NETIF_F_HIGHDMA)) {
+ for (i = 0; i < j; i++) {
+ if (PageHighMem(info->pages[i]))
+ goto failed;
+ }
+ }
+#endif
+
+ info->total = total;
+ info->ext_page.dtor = page_dtor;
+ info->ctor = ctor;
+ info->pnum = j;
+ info->iocb = iocb;
+ if (!npages)
+ info->flags = INFO_WRITE;
+ if (info->flags == INFO_READ) {
+ info->ext_page.start = (u8 *)(((unsigned long)
+ (pfn_to_kaddr(page_to_pfn(info->pages[0]))) +
+ frags[0].offset));
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+ info->ext_page.size = SKB_DATA_ALIGN(
+ iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD);
+#else
+ info->ext_page.size = SKB_DATA_ALIGN(
+ iov[0].iov_len + NET_IP_ALIGN + NET_SKB_PAD) -
+ NET_IP_ALIGN - NET_SKB_PAD;
+#endif
+ }
+ /* increment the number of locked pages */
+ ctor->lock_pages += j;
+ return info;
+
+failed:
+ for (i = 0; i < j; i++)
+ put_page(info->pages[i]);
+
+ kmem_cache_free(ext_page_info_cache, info);
+
+ return NULL;
+}
+
/* Ops structure to mimic raw sockets with mp device */
static const struct proto_ops mp_socket_ops = {
};
--
1.5.4.4

2010-06-05 10:09:32

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 15/19] Add basic funcs and ioctl to mp device.

From: Xin Xiaohui <[email protected]>

The ioctl is used by mp device to bind an underlying
NIC, it will query hardware capability and declare the
NIC to use external buffers.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---

memory leak fixed,
kconfig made,
do_unbind() made,
mp_chr_ioctl() cleanup

by Jeff Dike <[email protected]>

drivers/vhost/mpassthru.c | 681 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 681 insertions(+), 0 deletions(-)
create mode 100644 drivers/vhost/mpassthru.c

diff --git a/drivers/vhost/mpassthru.c b/drivers/vhost/mpassthru.c
new file mode 100644
index 0000000..25e2f3e
--- /dev/null
+++ b/drivers/vhost/mpassthru.c
@@ -0,0 +1,681 @@
+/*
+ * MPASSTHRU - Mediate passthrough device.
+ * Copyright (C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ */
+
+#define DRV_NAME "mpassthru"
+#define DRV_DESCRIPTION "Mediate passthru device driver"
+#define DRV_COPYRIGHT "(C) 2009 ZhaoYu, XinXiaohui, Dike, Jeffery G"
+
+#include <linux/compat.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/major.h>
+#include <linux/slab.h>
+#include <linux/smp_lock.h>
+#include <linux/poll.h>
+#include <linux/fcntl.h>
+#include <linux/init.h>
+#include <linux/aio.h>
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/miscdevice.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <linux/if.h>
+#include <linux/if_arp.h>
+#include <linux/if_ether.h>
+#include <linux/crc32.h>
+#include <linux/nsproxy.h>
+#include <linux/uaccess.h>
+#include <linux/virtio_net.h>
+#include <linux/mpassthru.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+#include <asm/system.h>
+
+/* Uncomment to enable debugging */
+/* #define MPASSTHRU_DEBUG 1 */
+
+#ifdef MPASSTHRU_DEBUG
+static int debug;
+
+#define DBG if (mp->debug) printk
+#define DBG1 if (debug == 2) printk
+#else
+#define DBG(a...)
+#define DBG1(a...)
+#endif
+
+#define COPY_THRESHOLD (L1_CACHE_BYTES * 4)
+#define COPY_HDR_LEN (L1_CACHE_BYTES < 64 ? 64 : L1_CACHE_BYTES)
+
+struct frag {
+ u16 offset;
+ u16 size;
+};
+
+struct page_info {
+ struct list_head list;
+ int header;
+ /* indicate the actual length of bytes
+ * send/recv in the external buffers
+ */
+ int total;
+ int offset;
+ struct page *pages[MAX_SKB_FRAGS+1];
+ struct skb_frag_struct frag[MAX_SKB_FRAGS+1];
+ struct sk_buff *skb;
+ struct page_ctor *ctor;
+
+ /* The pointer relayed to skb, to indicate
+ * it's a external allocated skb or kernel
+ */
+ struct skb_external_page ext_page;
+ struct skb_shared_info ushinfo;
+
+#define INFO_READ 0
+#define INFO_WRITE 1
+ unsigned flags;
+ unsigned pnum;
+
+ /* It's meaningful for receive, means
+ * the max length allowed
+ */
+ size_t len;
+
+ /* The fields after that is for backend
+ * driver, now for vhost-net.
+ */
+
+ struct kiocb *iocb;
+ unsigned int desc_pos;
+ struct iovec hdr[MAX_SKB_FRAGS + 2];
+ struct iovec iov[MAX_SKB_FRAGS + 2];
+};
+
+static struct kmem_cache *ext_page_info_cache;
+
+struct page_ctor {
+ struct list_head readq;
+ int wq_len;
+ int rq_len;
+ spinlock_t read_lock;
+ /* record the locked pages */
+ int lock_pages;
+ struct rlimit o_rlim;
+ struct net_device *dev;
+ struct mpassthru_port port;
+};
+
+struct mp_struct {
+ struct mp_file *mfile;
+ struct net_device *dev;
+ struct page_ctor *ctor;
+ struct socket socket;
+
+#ifdef MPASSTHRU_DEBUG
+ int debug;
+#endif
+};
+
+struct mp_file {
+ atomic_t count;
+ struct mp_struct *mp;
+ struct net *net;
+};
+
+struct mp_sock {
+ struct sock sk;
+ struct mp_struct *mp;
+};
+
+static int mp_dev_change_flags(struct net_device *dev, unsigned flags)
+{
+ int ret = 0;
+
+ rtnl_lock();
+ ret = dev_change_flags(dev, flags);
+ rtnl_unlock();
+
+ if (ret < 0)
+ printk(KERN_ERR "failed to change dev state of %s", dev->name);
+
+ return ret;
+}
+
+static int page_ctor_attach(struct mp_struct *mp)
+{
+ int rc;
+ struct page_ctor *ctor;
+ struct net_device *dev = mp->dev;
+
+ /* locked by mp_mutex */
+ if (rcu_dereference(mp->ctor))
+ return -EBUSY;
+
+ ctor = kzalloc(sizeof(*ctor), GFP_KERNEL);
+ if (!ctor)
+ return -ENOMEM;
+ rc = netdev_mp_port_prep(dev, &ctor->port);
+ if (rc)
+ goto fail;
+
+ INIT_LIST_HEAD(&ctor->readq);
+ spin_lock_init(&ctor->read_lock);
+
+ ctor->rq_len = 0;
+ ctor->wq_len = 0;
+
+ dev_hold(dev);
+ ctor->dev = dev;
+ ctor->port.ctor = NULL;
+ ctor->port.sock = &mp->socket;
+ ctor->lock_pages = 0;
+ rc = netdev_mp_port_attach(dev, &ctor->port);
+ if (rc)
+ goto fail;
+
+ /* locked by mp_mutex */
+ rcu_assign_pointer(mp->ctor, ctor);
+
+ /* XXX:Need we do set_offload here ? */
+
+ return 0;
+
+fail:
+ kfree(ctor);
+ dev_put(dev);
+
+ return rc;
+}
+
+struct page_info *info_dequeue(struct page_ctor *ctor)
+{
+ unsigned long flags;
+ struct page_info *info = NULL;
+ spin_lock_irqsave(&ctor->read_lock, flags);
+ if (!list_empty(&ctor->readq)) {
+ info = list_first_entry(&ctor->readq,
+ struct page_info, list);
+ list_del(&info->list);
+ }
+ spin_unlock_irqrestore(&ctor->read_lock, flags);
+ return info;
+}
+
+static int set_memlock_rlimit(struct page_ctor *ctor, int resource,
+ unsigned long cur, unsigned long max)
+{
+ struct rlimit new_rlim, *old_rlim;
+ int retval;
+
+ if (resource != RLIMIT_MEMLOCK)
+ return -EINVAL;
+ new_rlim.rlim_cur = cur;
+ new_rlim.rlim_max = max;
+
+ old_rlim = current->signal->rlim + resource;
+
+ /* remember the old rlimit value when backend enabled */
+ ctor->o_rlim.rlim_cur = old_rlim->rlim_cur;
+ ctor->o_rlim.rlim_max = old_rlim->rlim_max;
+
+ if ((new_rlim.rlim_max > old_rlim->rlim_max) &&
+ !capable(CAP_SYS_RESOURCE))
+ return -EPERM;
+
+ retval = security_task_setrlimit(resource, &new_rlim);
+ if (retval)
+ return retval;
+
+ task_lock(current->group_leader);
+ *old_rlim = new_rlim;
+ task_unlock(current->group_leader);
+ return 0;
+}
+
+static int page_ctor_detach(struct mp_struct *mp)
+{
+ struct page_ctor *ctor;
+ struct page_info *info;
+ struct kiocb *iocb = NULL;
+ int i;
+
+ /* locked by mp_mutex */
+ ctor = rcu_dereference(mp->ctor);
+ if (!ctor)
+ return -ENODEV;
+
+ while ((info = info_dequeue(ctor))) {
+ for (i = 0; i < info->pnum; i++)
+ if (info->pages[i])
+ put_page(info->pages[i]);
+ kmem_cache_free(ext_page_info_cache, info);
+ }
+ set_memlock_rlimit(ctor, RLIMIT_MEMLOCK,
+ ctor->o_rlim.rlim_cur,
+ ctor->o_rlim.rlim_max);
+ netdev_mp_port_detach(ctor->dev);
+ dev_put(ctor->dev);
+
+ /* locked by mp_mutex */
+ rcu_assign_pointer(mp->ctor, NULL);
+ synchronize_rcu();
+
+ kfree(ctor);
+ return 0;
+}
+
+static void __mp_detach(struct mp_struct *mp)
+{
+ mp->mfile = NULL;
+
+ mp_dev_change_flags(mp->dev, mp->dev->flags & ~IFF_UP);
+ page_ctor_detach(mp);
+ mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+
+ /* Drop the extra count on the net device */
+ dev_put(mp->dev);
+}
+
+static DEFINE_MUTEX(mp_mutex);
+
+static void mp_detach(struct mp_struct *mp)
+{
+ mutex_lock(&mp_mutex);
+ __mp_detach(mp);
+ mutex_unlock(&mp_mutex);
+}
+
+static struct mp_struct *mp_get(struct mp_file *mfile)
+{
+ struct mp_struct *mp = NULL;
+ if (atomic_inc_not_zero(&mfile->count))
+ mp = mfile->mp;
+
+ return mp;
+}
+
+static void mp_put(struct mp_file *mfile)
+{
+ if (atomic_dec_and_test(&mfile->count))
+ mp_detach(mfile->mp);
+}
+
+/* Ops structure to mimic raw sockets with mp device */
+static const struct proto_ops mp_socket_ops = {
+};
+
+static struct proto mp_proto = {
+ .name = "mp",
+ .owner = THIS_MODULE,
+ .obj_size = sizeof(struct mp_sock),
+};
+
+static int mp_chr_open(struct inode *inode, struct file * file)
+{
+ struct mp_file *mfile;
+ cycle_kernel_lock();
+ DBG1(KERN_INFO "mp: mp_chr_open\n");
+
+ mfile = kzalloc(sizeof(*mfile), GFP_KERNEL);
+ if (!mfile)
+ return -ENOMEM;
+ atomic_set(&mfile->count, 0);
+ mfile->mp = NULL;
+ mfile->net = get_net(current->nsproxy->net_ns);
+ file->private_data = mfile;
+ return 0;
+}
+
+static int mp_attach(struct mp_struct *mp, struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+ int err;
+
+ netif_tx_lock_bh(mp->dev);
+
+ err = -EINVAL;
+
+ if (mfile->mp)
+ goto out;
+
+ err = -EBUSY;
+ if (mp->mfile)
+ goto out;
+
+ err = 0;
+ mfile->mp = mp;
+ mp->mfile = mfile;
+ mp->socket.file = file;
+ dev_hold(mp->dev);
+ sock_hold(mp->socket.sk);
+ atomic_inc(&mfile->count);
+
+out:
+ netif_tx_unlock_bh(mp->dev);
+ return err;
+}
+
+static int do_unbind(struct mp_file *mfile)
+{
+ struct mp_struct *mp = mp_get(mfile);
+
+ if (!mp)
+ return -EINVAL;
+
+ mp_detach(mp);
+ sock_put(mp->socket.sk);
+ mp_put(mfile);
+ return 0;
+}
+
+static long mp_chr_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp;
+ struct net_device *dev;
+ void __user* argp = (void __user *)arg;
+ struct ifreq ifr;
+ struct sock *sk;
+ int ret;
+
+ ret = -EINVAL;
+
+ switch (cmd) {
+ case MPASSTHRU_BINDDEV:
+ ret = -EFAULT;
+ if (copy_from_user(&ifr, argp, sizeof ifr))
+ break;
+
+ ifr.ifr_name[IFNAMSIZ-1] = '\0';
+
+ ret = -ENODEV;
+ dev = dev_get_by_name(mfile->net, ifr.ifr_name);
+ if (!dev)
+ break;
+
+ mutex_lock(&mp_mutex);
+
+ ret = -EBUSY;
+
+ /* the device can be only bind once */
+ if (dev_is_mpassthru(dev))
+ goto err_dev_put;
+
+ mp = mfile->mp;
+ if (mp)
+ goto err_dev_put;
+
+ mp = kzalloc(sizeof(*mp), GFP_KERNEL);
+ if (!mp) {
+ ret = -ENOMEM;
+ goto err_dev_put;
+ }
+ mp->dev = dev;
+ ret = -ENOMEM;
+
+ sk = sk_alloc(mfile->net, AF_UNSPEC, GFP_KERNEL, &mp_proto);
+ if (!sk)
+ goto err_free_mp;
+
+ init_waitqueue_head(&mp->socket.wait);
+ mp->socket.ops = &mp_socket_ops;
+ sock_init_data(&mp->socket, sk);
+ sk->sk_sndbuf = INT_MAX;
+ container_of(sk, struct mp_sock, sk)->mp = mp;
+
+ sk->sk_destruct = NULL;
+ sk->sk_data_ready = NULL;
+ sk->sk_write_space = NULL;
+ sk->sk_state_change = NULL;
+ ret = mp_attach(mp, file);
+ if (ret < 0)
+ goto err_free_sk;
+
+ ret = page_ctor_attach(mp);
+ if (ret < 0)
+ goto err_free_sk;
+
+ mp_dev_change_flags(mp->dev, mp->dev->flags | IFF_UP);
+out:
+ mutex_unlock(&mp_mutex);
+ break;
+err_free_sk:
+ sk_free(sk);
+err_free_mp:
+ kfree(mp);
+err_dev_put:
+ dev_put(dev);
+ goto out;
+
+ case MPASSTHRU_UNBINDDEV:
+ ret = do_unbind(mfile);
+ break;
+
+ default:
+ break;
+ }
+ return ret;
+}
+
+static unsigned int mp_chr_poll(struct file *file, poll_table * wait)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp = mp_get(mfile);
+ struct sock *sk;
+ unsigned int mask = 0;
+
+ if (!mp)
+ return POLLERR;
+
+ sk = mp->socket.sk;
+
+ poll_wait(file, &mp->socket.wait, wait);
+
+ if (!skb_queue_empty(&sk->sk_receive_queue))
+ mask |= POLLIN | POLLRDNORM;
+
+ if (sock_writeable(sk) ||
+ (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags) &&
+ sock_writeable(sk)))
+ mask |= POLLOUT | POLLWRNORM;
+
+ if (mp->dev->reg_state != NETREG_REGISTERED)
+ mask = POLLERR;
+
+ mp_put(mfile);
+ return mask;
+}
+
+static ssize_t mp_chr_aio_write(struct kiocb *iocb, const struct iovec *iov,
+ unsigned long count, loff_t pos)
+{
+ struct file *file = iocb->ki_filp;
+ struct mp_struct *mp = mp_get(file->private_data);
+ struct sock *sk = mp->socket.sk;
+ struct sk_buff *skb;
+ int len, err;
+ ssize_t result = 0;
+
+ if (!mp)
+ return -EBADFD;
+
+ /* currently, async is not supported.
+ * but we may support real async aio from user application,
+ * maybe qemu virtio-net backend.
+ */
+ if (!is_sync_kiocb(iocb))
+ return -EFAULT;
+
+ len = iov_length(iov, count);
+
+ if (unlikely(len) < ETH_HLEN)
+ return -EINVAL;
+
+ skb = sock_alloc_send_skb(sk, len + NET_IP_ALIGN,
+ file->f_flags & O_NONBLOCK, &err);
+
+ if (!skb)
+ return -EFAULT;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, len);
+
+ if (skb_copy_datagram_from_iovec(skb, 0, iov, 0, len)) {
+ kfree_skb(skb);
+ return -EAGAIN;
+ }
+
+ skb->protocol = eth_type_trans(skb, mp->dev);
+ skb->dev = mp->dev;
+
+ dev_queue_xmit(skb);
+
+ mp_put(file->private_data);
+ return result;
+}
+
+static int mp_chr_close(struct inode *inode, struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+
+ /*
+ * Ignore return value since an error only means there was nothing to
+ * do
+ */
+ do_unbind(mfile);
+
+ put_net(mfile->net);
+ kfree(mfile);
+
+ return 0;
+}
+
+#ifdef CONFIG_COMPAT
+static long mp_chr_compat_ioctl(struct file *f, unsigned int ioctl,
+ unsigned long arg)
+{
+ return mp_chr_ioctl(f, ioctl, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations mp_fops = {
+ .owner = THIS_MODULE,
+ .llseek = no_llseek,
+ .write = do_sync_write,
+ .aio_write = mp_chr_aio_write,
+ .poll = mp_chr_poll,
+ .unlocked_ioctl = mp_chr_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = mp_chr_compat_ioctl,
+#endif
+ .open = mp_chr_open,
+ .release = mp_chr_close,
+};
+
+static struct miscdevice mp_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mp",
+ .nodename = "net/mp",
+ .fops = &mp_fops,
+};
+
+static int mp_device_event(struct notifier_block *unused,
+ unsigned long event, void *ptr)
+{
+ struct net_device *dev = ptr;
+ struct mpassthru_port *port;
+ struct mp_struct *mp = NULL;
+ struct socket *sock = NULL;
+
+ port = dev->mp_port;
+ if (port == NULL)
+ return NOTIFY_DONE;
+
+ switch (event) {
+ case NETDEV_UNREGISTER:
+ sock = dev->mp_port->sock;
+ mp = container_of(sock->sk, struct mp_sock, sk)->mp;
+ do_unbind(mp->mfile);
+ break;
+ }
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block mp_notifier_block __read_mostly = {
+ .notifier_call = mp_device_event,
+};
+
+static int mp_init(void)
+{
+ int err = 0;
+
+ ext_page_info_cache = kmem_cache_create("skb_page_info",
+ sizeof(struct page_info),
+ 0, SLAB_HWCACHE_ALIGN, NULL);
+ if (!ext_page_info_cache)
+ return -ENOMEM;
+
+ err = misc_register(&mp_miscdev);
+ if (err) {
+ printk(KERN_ERR "mp: Can't register misc device\n");
+ kmem_cache_destroy(ext_page_info_cache);
+ } else {
+ printk(KERN_INFO "Registering mp misc device - minor = %d\n",
+ mp_miscdev.minor);
+ register_netdevice_notifier(&mp_notifier_block);
+ }
+ return err;
+}
+
+void mp_exit(void)
+{
+ unregister_netdevice_notifier(&mp_notifier_block);
+ misc_deregister(&mp_miscdev);
+ kmem_cache_destroy(ext_page_info_cache);
+}
+
+/* Get an underlying socket object from mp file. Returns error unless file is
+ * attached to a device. The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg. The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *mp_get_socket(struct file *file)
+{
+ struct mp_file *mfile = file->private_data;
+ struct mp_struct *mp;
+
+ if (file->f_op != &mp_fops)
+ return ERR_PTR(-EINVAL);
+ mp = mp_get(mfile);
+ if (!mp)
+ return ERR_PTR(-EBADFD);
+ mp_put(mfile);
+ return &mp->socket;
+}
+EXPORT_SYMBOL_GPL(mp_get_socket);
+
+module_init(mp_init);
+module_exit(mp_exit);
+MODULE_AUTHOR(DRV_COPYRIGHT);
+MODULE_DESCRIPTION(DRV_DESCRIPTION);
+MODULE_LICENSE("GPL v2");
--
1.5.4.4

2010-06-05 10:09:53

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 13/19] To skip GRO if buffer is external currently.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
net/core/dev.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index dc2f225..6c6b2fe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2787,6 +2787,10 @@ enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
if (skb_is_gso(skb) || skb_has_frags(skb))
goto normal;

+ /* currently GRO is not supported by mediate passthru */
+ if (dev_is_mpassthru(skb->dev))
+ goto normal;
+
rcu_read_lock();
list_for_each_entry_rcu(ptype, head, list) {
if (ptype->type != type || ptype->dev || !ptype->gro_receive)
--
1.5.4.4

2010-06-05 10:07:09

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 07/19] Add interface to get external buffers.

From: Xin Xiaohui <[email protected]>

Currently, it can get external buffers from mp device.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/skbuff.h | 12 ++++++++++++
net/core/skbuff.c | 16 ++++++++++++++++
2 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index cf309c9..281a1c0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1519,6 +1519,18 @@ static inline void netdev_free_page(struct net_device *dev, struct page *page)
__free_page(page);
}

+extern struct skb_external_page *netdev_alloc_external_pages(
+ struct net_device *dev,
+ struct sk_buff *skb, int npages);
+
+static inline struct skb_external_page *netdev_alloc_external_page(
+ struct net_device *dev,
+ struct sk_buff *skb, unsigned int size)
+{
+ return netdev_alloc_external_pages(dev, skb,
+ DIV_ROUND_UP(size, PAGE_SIZE));
+}
+
/**
* skb_clone_writable - is the header of a clone writable
* @skb: buffer to check
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 93c4e06..fbdb1f1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -278,6 +278,22 @@ struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
}
EXPORT_SYMBOL(__netdev_alloc_page);

+struct skb_external_page *netdev_alloc_external_pages(struct net_device *dev,
+ struct sk_buff *skb, int npages)
+{
+ struct mpassthru_port *port;
+ struct skb_external_page *ext_page = NULL;
+
+ port = rcu_dereference(dev->mp_port);
+ if (!port)
+ goto out;
+ WARN_ON(npages > port->npages);
+ ext_page = port->ctor(port, skb, npages);
+out:
+ return ext_page;
+}
+EXPORT_SYMBOL(netdev_alloc_external_pages);
+
void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
int size)
{
--
1.5.4.4

2010-06-05 10:10:29

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 12/19] Add a hook to intercept external buffers from NIC driver.

From: Xin Xiaohui <[email protected]>

The hook is called in netif_receive_skb().
Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
net/core/dev.c | 35 +++++++++++++++++++++++++++++++++++
1 files changed, 35 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 37b389a..dc2f225 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2548,6 +2548,37 @@ err:
EXPORT_SYMBOL(netdev_mp_port_prep);
#endif

+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+/* Add a hook to intercept mediate passthru(zero-copy) packets,
+ * and insert it to the socket queue owned by mp_port specially.
+ */
+static inline struct sk_buff *handle_mpassthru(struct sk_buff *skb,
+ struct packet_type **pt_prev,
+ int *ret,
+ struct net_device *orig_dev)
+{
+ struct mpassthru_port *mp_port = NULL;
+ struct sock *sk = NULL;
+
+ if (!dev_is_mpassthru(skb->dev))
+ return skb;
+ mp_port = skb->dev->mp_port;
+
+ if (*pt_prev) {
+ *ret = deliver_skb(skb, *pt_prev, orig_dev);
+ *pt_prev = NULL;
+ }
+
+ sk = mp_port->sock->sk;
+ skb_queue_tail(&sk->sk_receive_queue, skb);
+ sk->sk_state_change(sk);
+
+ return NULL;
+}
+#else
+#define handle_mpassthru(skb, pt_prev, ret, orig_dev) (skb)
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
@@ -2629,6 +2660,10 @@ int netif_receive_skb(struct sk_buff *skb)
ncls:
#endif

+ /* To intercept mediate passthru(zero-copy) packets here */
+ skb = handle_mpassthru(skb, &pt_prev, &ret, orig_dev);
+ if (!skb)
+ goto out;
skb = handle_bridge(skb, &pt_prev, &ret, orig_dev);
if (!skb)
goto out;
--
1.5.4.4

2010-06-05 10:10:55

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 10/19] Don't do skb recycle, if device use external buffer.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
net/core/skbuff.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 38d19d0..37587f0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -553,6 +553,12 @@ int skb_recycle_check(struct sk_buff *skb, int skb_size)
if (skb_shared(skb) || skb_cloned(skb))
return 0;

+ /* if the device wants to do mediate passthru, the skb may
+ * get external buffer, so don't recycle
+ */
+ if (dev_is_mpassthru(skb->dev))
+ return 0;
+
skb_release_head_state(skb);
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
--
1.5.4.4

2010-06-05 10:11:18

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 09/19] Ignore room skb_reserve() when device is using external buffer.

From: Xin Xiaohui <[email protected]>

Make the skb->data and skb->head from external buffer
to be consistent, we ignore the room reserved by driver
for kernel skb.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/skbuff.h | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5ff8c27..193b259 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1200,6 +1200,15 @@ static inline int skb_tailroom(const struct sk_buff *skb)
*/
static inline void skb_reserve(struct sk_buff *skb, int len)
{
+ /* Since skb_reserve() is only for an empty buffer,
+ * and when the skb is getting external buffer, we cannot
+ * retain the external buffer has the same reserved space
+ * in the header which kernel allocatd skb has, so have to
+ * ignore this. And we have recorded the external buffer
+ * info in the destructor_arg field, so use it as indicator.
+ */
+ if (skb_shinfo(skb)->destructor_arg)
+ return;
skb->data += len;
skb->tail += len;
}
--
1.5.4.4

2010-06-05 10:11:44

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

From: Xin Xiaohui <[email protected]>

Add a dev parameter to __alloc_skb(), skb->data
points to external buffer, recompute skb->head,
maintain shinfo of the external buffer, record
external buffer info into destructor_arg field.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---

__alloc_skb() cleanup by

Jeff Dike <[email protected]>

include/linux/skbuff.h | 7 ++++---
net/core/skbuff.c | 43 +++++++++++++++++++++++++++++++++++++------
2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 281a1c0..5ff8c27 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -442,17 +442,18 @@ extern void kfree_skb(struct sk_buff *skb);
extern void consume_skb(struct sk_buff *skb);
extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *__alloc_skb(unsigned int size,
- gfp_t priority, int fclone, int node);
+ gfp_t priority, int fclone,
+ int node, struct net_device *dev);
static inline struct sk_buff *alloc_skb(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 0, -1);
+ return __alloc_skb(size, priority, 0, -1, NULL);
}

static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
- return __alloc_skb(size, priority, 1, -1);
+ return __alloc_skb(size, priority, 1, -1, NULL);
}

extern int skb_recycle_check(struct sk_buff *skb, int skb_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fbdb1f1..38d19d0 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -161,7 +161,8 @@ EXPORT_SYMBOL(skb_under_panic);
* @fclone: allocate from fclone cache instead of head cache
* and allocate a cloned (child) skb
* @node: numa node to allocate memory on
- *
+ * @dev: a device owns the skb if the skb try to get external buffer.
+ * otherwise is NULL.
* Allocate a new &sk_buff. The returned buffer has no headroom and a
* tail room of size bytes. The object has a reference count of one.
* The return is the buffer. On a failure the return is %NULL.
@@ -170,12 +171,13 @@ EXPORT_SYMBOL(skb_under_panic);
* %GFP_ATOMIC.
*/
struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int fclone, int node)
+ int fclone, int node, struct net_device *dev)
{
struct kmem_cache *cache;
struct skb_shared_info *shinfo;
struct sk_buff *skb;
- u8 *data;
+ u8 *data = NULL;
+ struct skb_external_page *ext_page = NULL;

cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;

@@ -185,8 +187,23 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
goto out;

size = SKB_DATA_ALIGN(size);
- data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
- gfp_mask, node);
+
+ /* If the device wants to do mediate passthru(zero-copy),
+ * the skb may try to get external buffers from outside.
+ * If fails, then fall back to alloc buffers from kernel.
+ */
+ if (dev && dev->mp_port) {
+ ext_page = netdev_alloc_external_page(dev, skb, size);
+ if (ext_page) {
+ data = ext_page->start;
+ size = ext_page->size;
+ }
+ }
+
+ if (!data)
+ data = kmalloc_node_track_caller(
+ size + sizeof(struct skb_shared_info),
+ gfp_mask, node);
if (!data)
goto nodata;

@@ -208,6 +225,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
skb->mac_header = ~0U;
#endif

+ /* If the skb get external buffers sucessfully, since the shinfo is
+ * at the end of the buffer, we may retain the shinfo once we
+ * need it sometime.
+ */
+ if (ext_page) {
+ skb->head = skb->data - NET_IP_ALIGN - NET_SKB_PAD;
+ memcpy(ext_page->ushinfo, skb_shinfo(skb),
+ sizeof(struct skb_shared_info));
+ }
/* make sure we initialize shinfo sequentially */
shinfo = skb_shinfo(skb);
atomic_set(&shinfo->dataref, 1);
@@ -231,6 +257,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,

child->fclone = SKB_FCLONE_UNAVAILABLE;
}
+ /* Record the external buffer info in this field. It's not so good,
+ * but we cannot find another place easily.
+ */
+ shinfo->destructor_arg = ext_page;
+
out:
return skb;
nodata:
@@ -259,7 +290,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
struct sk_buff *skb;

- skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+ skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node, dev);
if (likely(skb)) {
skb_reserve(skb, NET_SKB_PAD);
skb->dev = dev;
--
1.5.4.4

2010-06-05 10:07:06

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 02/19] Add a new struct for device to manipulate external buffer.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/netdevice.h | 19 ++++++++++++++++++-
1 files changed, 18 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fa8b476..bae725c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -530,6 +530,22 @@ struct netdev_queue {
unsigned long tx_dropped;
} ____cacheline_aligned_in_smp;

+/* Add a structure in structure net_device, the new field is
+ * named as mp_port. It's for mediate passthru (zero-copy).
+ * It contains the capability for the net device driver,
+ * a socket, and an external buffer creator, external means
+ * skb buffer belongs to the device may not be allocated from
+ * kernel space.
+ */
+struct mpassthru_port {
+ int hdr_len;
+ int data_len;
+ int npages;
+ unsigned flags;
+ struct socket *sock;
+ struct skb_external_page *(*ctor)(struct mpassthru_port *,
+ struct sk_buff *, int);
+};

/*
* This structure defines the management hooks for network devices.
@@ -952,7 +968,8 @@ struct net_device {
struct macvlan_port *macvlan_port;
/* GARP */
struct garp_port *garp_port;
-
+ /* mpassthru */
+ struct mpassthru_port *mp_port;
/* class/net/name entry */
struct device dev;
/* space for optional device, statistics, and wireless sysfs groups */
--
1.5.4.4

2010-06-05 10:12:07

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 06/19] Add a function to indicate if device use external buffer.

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/netdevice.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 31d9c4a..0cb78f4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1602,6 +1602,11 @@ extern void netdev_mp_port_detach(struct net_device *dev);
extern int netdev_mp_port_prep(struct net_device *dev,
struct mpassthru_port *port);

+static inline bool dev_is_mpassthru(struct net_device *dev)
+{
+ return (dev && dev->mp_port);
+}
+
static inline void napi_free_frags(struct napi_struct *napi)
{
kfree_skb(napi->skb);
--
1.5.4.4

2010-06-05 10:07:00

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 03/19] Export 2 func for device to assign/deassign new strucure

From: Xin Xiaohui <[email protected]>

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/netdevice.h | 3 +++
net/core/dev.c | 28 ++++++++++++++++++++++++++++
2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bae725c..efb575a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1592,6 +1592,9 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi,
gro_result_t ret);
extern struct sk_buff * napi_frags_skb(struct napi_struct *napi);
extern gro_result_t napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_attach(struct net_device *dev,
+ struct mpassthru_port *port);
+extern void netdev_mp_port_detach(struct net_device *dev);

static inline void napi_free_frags(struct napi_struct *napi)
{
diff --git a/net/core/dev.c b/net/core/dev.c
index f769098..ecbb6b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
}

+/* Export two functions to assign/de-assign mp_port pointer
+ * to a net device.
+ */
+
+int netdev_mp_port_attach(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ /* locked by mp_mutex */
+ if (rcu_dereference(dev->mp_port))
+ return -EBUSY;
+
+ rcu_assign_pointer(dev->mp_port, port);
+
+ return 0;
+}
+EXPORT_SYMBOL(netdev_mp_port_attach);
+
+void netdev_mp_port_detach(struct net_device *dev)
+{
+ /* locked by mp_mutex */
+ if (!rcu_dereference(dev->mp_port))
+ return;
+
+ rcu_assign_pointer(dev->mp_port, NULL);
+ synchronize_rcu();
+}
+EXPORT_SYMBOL(netdev_mp_port_detach);
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
--
1.5.4.4

2010-06-05 10:12:32

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 05/19] Add a function make external buffer owner to query capability.

From: Xin Xiaohui <[email protected]>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/netdevice.h | 2 +
net/core/dev.c | 51 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 53 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 183c786..31d9c4a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t napi_gro_frags(struct napi_struct *napi);
extern int netdev_mp_port_attach(struct net_device *dev,
struct mpassthru_port *port);
extern void netdev_mp_port_detach(struct net_device *dev);
+extern int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port);

static inline void napi_free_frags(struct napi_struct *napi)
{
diff --git a/net/core/dev.c b/net/core/dev.c
index ecbb6b1..37b389a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2497,6 +2497,57 @@ void netdev_mp_port_detach(struct net_device *dev)
}
EXPORT_SYMBOL(netdev_mp_port_detach);

+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ int rc;
+ int npages, data_len;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ /* needed by packet split */
+
+ if (ops->ndo_mp_port_prep) {
+ rc = ops->ndo_mp_port_prep(dev, port);
+ if (rc)
+ return rc;
+ } else {
+ /* If the NIC driver did not report this,
+ * then we try to use default value.
+ */
+ port->hdr_len = 128;
+ port->data_len = 2048;
+ port->npages = 1;
+ }
+
+ if (port->hdr_len <= 0)
+ goto err;
+
+ npages = port->npages;
+ data_len = port->data_len;
+ if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+ (data_len < PAGE_SIZE * (npages - 1) ||
+ data_len > PAGE_SIZE * npages))
+ goto err;
+
+ return 0;
+err:
+ dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
--
1.5.4.4

2010-06-05 10:13:10

by Xin, Xiaohui

[permalink] [raw]

Subject: [RFC PATCH v7 04/19] Add a ndo_mp_port_prep pointer to net_device_ops.

From: Xin Xiaohui <[email protected]>

If the driver want to allocate external buffers,
then it can export it's capability, as the skb
buffer header length, the page length can be DMA, etc.
The external buffers owner may utilize this.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>
---
include/linux/netdevice.h | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index efb575a..183c786 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -707,6 +707,10 @@ struct net_device_ops {
int (*ndo_fcoe_get_wwn)(struct net_device *dev,
u64 *wwn, int type);
#endif
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+ int (*ndo_mp_port_prep)(struct net_device *dev,
+ struct mpassthru_port *port);
+#endif
};

/*
--
1.5.4.4

2010-06-05 14:51:44

by Eric Dumazet

[permalink] [raw]

Subject: Re: [RFC PATCH v7 03/19] Export 2 func for device to assign/deassign new strucure

Le samedi 05 juin 2010 à 18:14 +0800, [email protected] a écrit :
> From: Xin Xiaohui <[email protected]>
>
> Signed-off-by: Xin Xiaohui <[email protected]>
> Signed-off-by: Zhao Yu <[email protected]>
> Reviewed-by: Jeff Dike <[email protected]>
> ---
> include/linux/netdevice.h | 3 +++
> net/core/dev.c | 28 ++++++++++++++++++++++++++++
> 2 files changed, 31 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index bae725c..efb575a 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1592,6 +1592,9 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi,
> gro_result_t ret);
> extern struct sk_buff * napi_frags_skb(struct napi_struct *napi);
> extern gro_result_t napi_gro_frags(struct napi_struct *napi);
> +extern int netdev_mp_port_attach(struct net_device *dev,
> + struct mpassthru_port *port);
> +extern void netdev_mp_port_detach(struct net_device *dev);
>
> static inline void napi_free_frags(struct napi_struct *napi)
> {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index f769098..ecbb6b1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2469,6 +2469,34 @@ void netif_nit_deliver(struct sk_buff *skb)
> rcu_read_unlock();
> }
>
> +/* Export two functions to assign/de-assign mp_port pointer
> + * to a net device.
> + */
> +
> +int netdev_mp_port_attach(struct net_device *dev,
> + struct mpassthru_port *port)
> +{
> + /* locked by mp_mutex */
> + if (rcu_dereference(dev->mp_port))
> + return -EBUSY;
> +

Please... this is bogus...

Try with following config settings :

CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y

> + rcu_assign_pointer(dev->mp_port, port);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(netdev_mp_port_attach);
> +
> +void netdev_mp_port_detach(struct net_device *dev)
> +{
> + /* locked by mp_mutex */
> + if (!rcu_dereference(dev->mp_port))
> + return;

same problem here

> +
> + rcu_assign_pointer(dev->mp_port, NULL);
> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL(netdev_mp_port_detach);
> +
> /**
> * netif_receive_skb - process receive buffer from network
> * @skb: buffer to process

2010-06-05 14:53:32

by Eric Dumazet

[permalink] [raw]

Subject: Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

Le samedi 05 juin 2010 à 18:14 +0800, [email protected] a écrit :
> From: Xin Xiaohui <[email protected]>
> child->fclone = SKB_FCLONE_UNAVAILABLE;
> }
> + /* Record the external buffer info in this field. It's not so good,
> + * but we cannot find another place easily.
> + */
> + shinfo->destructor_arg = ext_page;
> +

Yes this is a big problem, its basically using a cache line that was not
touched before.

2010-06-05 14:56:21

by Eric Dumazet

[permalink] [raw]

Subject: Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.

Le samedi 05 juin 2010 à 18:14 +0800, [email protected] a écrit :
> From: Xin Xiaohui <[email protected]>
>
> If buffer is external, then use the callback to destruct
> buffers.
>
> Signed-off-by: Xin Xiaohui <[email protected]>
> Signed-off-by: Zhao Yu <[email protected]>
> Reviewed-by: Jeff Dike <[email protected]>
> ---
> net/core/skbuff.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 37587f0..418457c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>
> static void skb_release_data(struct sk_buff *skb)
> {
> + /* check if the skb has external buffers, we have use destructor_arg
> + * here to indicate
> + */
> + struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
> +

Oh well. This is v7 of your series, and nobody complained yet ?

This is a new cache miss on a _critical_ path.

> if (!skb->cloned ||
> !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
> &skb_shinfo(skb)->dataref)) {
> @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
> if (skb_has_frags(skb))
> skb_drop_fraglist(skb);
>
> + /* if the skb has external buffers, use destructor here,
> + * since after that skb->head will be kfree, in case skb->head
> + * from external buffer cannot use kfree to destroy.
> + */

Why not deferring here the access to skb_shinfo(skb)->destructor_arg ?

> + if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
> + ext_page->dtor(ext_page);
> kfree(skb->head);
> }
> }

if (dev_is_mpassthru(skb->dev)) {
struct skb_external_page *ext_page =
skb_shinfo(skb)->destructor_arg;
if (ext_page && ext_page->dtor)
ext_page->dtor(ext_page);
}

destructor_arg should me moved before frags[] if you really want to use it.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf243fc..b136d90 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -202,10 +202,11 @@ struct skb_shared_info {
*/
atomic_t dataref;

- skb_frag_t frags[MAX_SKB_FRAGS];
/* Intermediate layers must ensure that destructor_arg
* remains valid until skb destructor */
void * destructor_arg;
+
+ skb_frag_t frags[MAX_SKB_FRAGS];
};

/* We divide dataref into two halves. The higher 16 bits hold references

2010-06-06 23:13:57

by Stephen Hemminger

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

Still not sure this is a good idea for a couple of reasons:

1. We already have lots of special cases with skb's (frags and fraglist),
and skb's travel through a lot of different parts of the kernel. So any
new change like this creates lots of exposed points for new bugs. Look
at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
and ppp and ...

2. SKB's can have infinite lifetime in the kernel. If these buffers come from
a fixed size pool in an external device, they can easily all get tied up
if you have a slow listener. What happens then?

2010-06-07 07:51:20

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

Stephen Hemminger <[email protected]> writes:

> Still not sure this is a good idea for a couple of reasons:
>
> 1. We already have lots of special cases with skb's (frags and fraglist),
> and skb's travel through a lot of different parts of the kernel. So any
> new change like this creates lots of exposed points for new bugs. Look
> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
> and ppp and ...
>
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
> a fixed size pool in an external device, they can easily all get tied up
> if you have a slow listener. What happens then?

3. If they come from an internal pool what happens when the kernel runs
low on memory? How is that pool balanced against other kernel
memory users?

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-07 08:17:11

by Mitchell Erblich

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:

> Stephen Hemminger <[email protected]> writes:
>
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>> and skb's travel through a lot of different parts of the kernel. So any
>> new change like this creates lots of exposed points for new bugs. Look
>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>> and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>> a fixed size pool in an external device, they can easily all get tied up
>> if you have a slow listener. What happens then?
>
> 3. If they come from an internal pool what happens when the kernel runs
> low on memory? How is that pool balanced against other kernel
> memory users?
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.

In general,

When an internal pool is created/used, their SHOULD be a reason.
Maybe, to keep allocation latency to a min, OR ...

Now IMO,

internal pool objects should have a ref count and
if that count is 0, then under memory pressure and/or num
of objects are above a high water mark, then they are freed,

OR if there is a last reference age field, then the object is to be
cleaned if dirty, then freed,

Else, the pool is allowed to grow if the number of objects in the
pool is below a set max (max COULD equal Infinity).

Mitchell Erblich-

2010-06-08 05:27:58

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
> Still not sure this is a good idea for a couple of reasons:
>
> 1. We already have lots of special cases with skb's (frags and fraglist),
> and skb's travel through a lot of different parts of the kernel. So any
> new change like this creates lots of exposed points for new bugs. Look
> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
> and ppp and ...
>
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
> a fixed size pool in an external device, they can easily all get tied up
> if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb->head area to be shared without
copying. I'm claiming that this can be done by simply making
skb->head empty.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-09 07:31:19

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.

>-----Original Message-----
>From: Eric Dumazet [mailto:[email protected]]
>Sent: Saturday, June 05, 2010 10:56 PM
>To: Xin, Xiaohui
>Cc: [email protected]; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 11/19] Use callback to deal with skb_release_data() specially.
>
>Le samedi 05 juin 2010 ? 18:14 +0800, [email protected] a ?crit :
>> From: Xin Xiaohui <[email protected]>
>>
>> If buffer is external, then use the callback to destruct
>> buffers.
>>
>> Signed-off-by: Xin Xiaohui <[email protected]>
>> Signed-off-by: Zhao Yu <[email protected]>
>> Reviewed-by: Jeff Dike <[email protected]>
>> ---
>> net/core/skbuff.c | 11 +++++++++++
>> 1 files changed, 11 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 37587f0..418457c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -385,6 +385,11 @@ static void skb_clone_fraglist(struct sk_buff *skb)
>>
>> static void skb_release_data(struct sk_buff *skb)
>> {
>> + /* check if the skb has external buffers, we have use destructor_arg
>> + * here to indicate
>> + */
>> + struct skb_external_page *ext_page = skb_shinfo(skb)->destructor_arg;
>> +
>
>Oh well. This is v7 of your series, and nobody complained yet ?
>
>This is a new cache miss on a _critical_ path.

Ok, I would remove the declaration here to avoid the new cache miss.

>
>
>> if (!skb->cloned ||
>> !atomic_sub_return(skb->nohdr ? (1 << SKB_DATAREF_SHIFT) + 1 : 1,
>> &skb_shinfo(skb)->dataref)) {
>> @@ -397,6 +402,12 @@ static void skb_release_data(struct sk_buff *skb)
>> if (skb_has_frags(skb))
>> skb_drop_fraglist(skb);
>>
>> + /* if the skb has external buffers, use destructor here,
>> + * since after that skb->head will be kfree, in case skb->head
>> + * from external buffer cannot use kfree to destroy.
>> + */
>
>Why not deferring here the access to skb_shinfo(skb)->destructor_arg ?

And references skb_shinfo(skb)->destructor_arg here.

>
>> + if (dev_is_mpassthru(skb->dev) && ext_page && ext_page->dtor)
>> + ext_page->dtor(ext_page);
>> kfree(skb->head);
>> }
>> }
>
>if (dev_is_mpassthru(skb->dev)) {
> struct skb_external_page *ext_page =
> skb_shinfo(skb)->destructor_arg;
> if (ext_page && ext_page->dtor)
> ext_page->dtor(ext_page);
>}
>
>destructor_arg should me moved before frags[] if you really want to use it.

Thanks for the patch. But why destructor_arg before frags[] is better than after frags[]?
skb_release_data() will reference both of them........

>
>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>index bf243fc..b136d90 100644
>--- a/include/linux/skbuff.h
>+++ b/include/linux/skbuff.h
>@@ -202,10 +202,11 @@ struct skb_shared_info {
> */
> atomic_t dataref;
>
>- skb_frag_t frags[MAX_SKB_FRAGS];
> /* Intermediate layers must ensure that destructor_arg
> * remains valid until skb destructor */
> void * destructor_arg;
>+
>+ skb_frag_t frags[MAX_SKB_FRAGS];
> };
>
> /* We divide dataref into two halves. The higher 16 bits hold references
>
>

2010-06-09 07:36:33

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.

>-----Original Message-----
>From: Eric Dumazet [mailto:[email protected]]
>Sent: Saturday, June 05, 2010 10:53 PM
>To: Xin, Xiaohui
>Cc: [email protected]; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 08/19] Make __alloc_skb() to get external buffer.
>
>Le samedi 05 juin 2010 ? 18:14 +0800, [email protected] a ?crit :
>> From: Xin Xiaohui <[email protected]>
>> child->fclone = SKB_FCLONE_UNAVAILABLE;
>> }
>> + /* Record the external buffer info in this field. It's not so good,
>> + * but we cannot find another place easily.
>> + */
>> + shinfo->destructor_arg = ext_page;
>> +
>
>
>Yes this is a big problem, its basically using a cache line that was not
>touched before.
>

Did your patch which moves destructor_arg before frags[] also fix this?

Thanks
Xiaohui

2010-06-09 08:29:50

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Stephen Hemminger [mailto:[email protected]]
>Sent: Monday, June 07, 2010 7:14 AM
>To: Xin, Xiaohui
>Cc: [email protected]; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Still not sure this is a good idea for a couple of reasons:
>
>1. We already have lots of special cases with skb's (frags and fraglist),
> and skb's travel through a lot of different parts of the kernel. So any
> new change like this creates lots of exposed points for new bugs. Look
> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
> and ppp and ...
>

Yes, I agree on your concern at some extent. But the skbs which use external pages in our cases have short travels which start from NIC driver and then forward to the guest immediately. It will not travel into host kernel stack or any filters which may avoid many problems you have mentioned here.

Another point is that we try to make the solution more generic to different NIC drivers, then many drivers may use the way without modifications.

>2. SKB's can have infinite lifetime in the kernel. If these buffers come from
> a fixed size pool in an external device, they can easily all get tied up
> if you have a slow listener. What happens then?

The pool is not fixed size, it's the size of usable buffers submitted by guest virtio-net driver. Guest virtio-net will check how much buffers are filled and do resubmit. What a slow listener mean? A slow NIC driver?

Thanks
Xiaohui

2010-06-09 08:48:59

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: [email protected] [mailto:[email protected]] On Behalf Of Andi
>Kleen
>Sent: Monday, June 07, 2010 3:51 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]; [email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>Stephen Hemminger <[email protected]> writes:
>
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>> and skb's travel through a lot of different parts of the kernel. So any
>> new change like this creates lots of exposed points for new bugs. Look
>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>> and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>> a fixed size pool in an external device, they can easily all get tied up
>> if you have a slow listener. What happens then?
>
>3. If they come from an internal pool what happens when the kernel runs
>low on memory? How is that pool balanced against other kernel
>memory users?
>
The size of the pool is limited by the virtqueue capacity now.
If the virtqueue is really huge, then how to balance on memory is a problem.
I did not consider it clearly how to tune it dynamically currently...

>-Andi
>
>--
>[email protected] -- Speaking for myself only.
>--
>To unsubscribe from this list: send the line "unsubscribe kvm" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-06-09 09:22:44

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Mitchell Erblich [mailto:[email protected]]
>Sent: Monday, June 07, 2010 4:17 PM
>To: Andi Kleen
>Cc: Stephen Hemminger; Xin, Xiaohui; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]; [email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>
>On Jun 7, 2010, at 12:51 AM, Andi Kleen wrote:
>
>> Stephen Hemminger <[email protected]> writes:
>>
>>> Still not sure this is a good idea for a couple of reasons:
>>>
>>> 1. We already have lots of special cases with skb's (frags and fraglist),
>>> and skb's travel through a lot of different parts of the kernel. So any
>>> new change like this creates lots of exposed points for new bugs. Look
>>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>>> and ppp and ...
>>>
>>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>>> a fixed size pool in an external device, they can easily all get tied up
>>> if you have a slow listener. What happens then?
>>
>> 3. If they come from an internal pool what happens when the kernel runs
>> low on memory? How is that pool balanced against other kernel
>> memory users?
>>
>> -Andi
>>
>> --
>> [email protected] -- Speaking for myself only.
>
>In general,
>
>When an internal pool is created/used, their SHOULD be a reason.
>Maybe, to keep allocation latency to a min, OR ...
>
The internal pool here is a collection of user buffers submitted
by guest virtio-net driver. Guest put buffers here and driver get
buffers from it. If guest submit more buffers then driver needs,
we need somewhere to put the buffers, it's the internal pool here
to deal with.

>Now IMO,
>
>internal pool objects should have a ref count and
>if that count is 0, then under memory pressure and/or num
>of objects are above a high water mark, then they are freed,
>
>OR if there is a last reference age field, then the object is to be
>cleaned if dirty, then freed,
>
>Else, the pool is allowed to grow if the number of objects in the
>pool is below a set max (max COULD equal Infinity).

Thanks for the thoughts.

Basically, the size of the internal pool is not decided by the pool itself,
To add/delete the objects in the pool is not a task of the pool itself too.
It's decided by guest virtio-net driver and vhost-net driver both, and
decided by the guest receive speed and submit speed.
The max size of the pool is limited by the virtqueue buffer numbers.

Thanks
Xiaohui

>
>
>
>Mitchell Erblich

2010-06-09 09:54:10

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Tuesday, June 08, 2010 1:28 PM
>To: Stephen Hemminger
>Cc: Xin, Xiaohui; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
>> Still not sure this is a good idea for a couple of reasons:
>>
>> 1. We already have lots of special cases with skb's (frags and fraglist),
>> and skb's travel through a lot of different parts of the kernel. So any
>> new change like this creates lots of exposed points for new bugs. Look
>> at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>> and ppp and ...
>>
>> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>> a fixed size pool in an external device, they can easily all get tied up
>> if you have a slow listener. What happens then?
>
>I agree with Stephen on this.
>
>FWIW I don't think we even need the external pages concept in
>order to implement zero-copy receive (which I gather is the intent
>here).
>
>Here is one way to do it, simply construct a completely non-linear
>packet in the driver, as you would if you were using the GRO frags
>interface (grep for napi_gro_frags under drivers/net for examples).
>
I'm not sure if I understand your way correctly:
1) Does the way only deal with driver with SG feature? Since packet
is non-linear...

2) Is skb->data still pointing to guest user buffers?
If yes, how to avoid the modifications to net core change to skb?

3) In our way only parts of drivers need be modified to support zero-copy.
and here, need we modify all the drivers?

If I missed your idea, may you explain it in more detail?

>This way you can transfer the entire contents of the packet without
>copying through to the other side, provided that the host stack does
>not modify the packet.
>

>If the host side did modify the packet then we have to incur the
>memory cost anyway.
>
>IOW I think the only feature provided by the external pages
>construct is allowing the skb->head area to be shared without
>copying. I'm claiming that this can be done by simply making
>skb->head empty.
>
I think to make skb->head empty at first will cause more effort to pass the check with
skb header. Have I missed something here? I really make the skb->head NULL
just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

Thanks
Xiaohui

>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-11 05:21:22

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>
> I'm not sure if I understand your way correctly:
> 1) Does the way only deal with driver with SG feature? Since packet
> is non-linear...

No the hardware doesn't have to support SG. You just need to
place the entire packet contents in a page instead of skb->head.

> 2) Is skb->data still pointing to guest user buffers?
> If yes, how to avoid the modifications to net core change to skb?

skb->data would not point to guest user buffers. In the common
case the packet is not modified on its way to the guest so this
is not an issue.

In the rare case where it is modified, you only have to copy the
bits which are modified and the cost of that is inconsequential
since you have to write to that memory anyway.

> 3) In our way only parts of drivers need be modified to support zero-copy.
> and here, need we modify all the drivers?

If you're asking the portion of each driver supporting zero-copy
that needs to be modified, then AFAICS this doesn't change that
very much at all.

> I think to make skb->head empty at first will cause more effort to pass the check with
> skb header. Have I missed something here? I really make the skb->head NULL
> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.

No I'm not suggesting you set it to NULL. It should have some
memory allocated, but skb_headlen(skb) should be zero.

Please have a look at how the napi_gro_frags interface works (e.g.,
in drivers/net/cxgb3/sge.c). This is exactly the model that I am
suggesting.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-12 09:31:17

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Friday, June 11, 2010 1:21 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>
>> I'm not sure if I understand your way correctly:
>> 1) Does the way only deal with driver with SG feature? Since packet
>> is non-linear...
>
>No the hardware doesn't have to support SG. You just need to
>place the entire packet contents in a page instead of skb->head.
>
>> 2) Is skb->data still pointing to guest user buffers?
>> If yes, how to avoid the modifications to net core change to skb?
>
>skb->data would not point to guest user buffers. In the common
>case the packet is not modified on its way to the guest so this
>is not an issue.
>
>In the rare case where it is modified, you only have to copy the
>bits which are modified and the cost of that is inconsequential
>since you have to write to that memory anyway.
>
>> 3) In our way only parts of drivers need be modified to support zero-copy.
>> and here, need we modify all the drivers?
>
>If you're asking the portion of each driver supporting zero-copy
>that needs to be modified, then AFAICS this doesn't change that
>very much at all.
>
>> I think to make skb->head empty at first will cause more effort to pass the check with
>> skb header. Have I missed something here? I really make the skb->head NULL
>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>
>No I'm not suggesting you set it to NULL. It should have some
>memory allocated, but skb_headlen(skb) should be zero.
>
>Please have a look at how the napi_gro_frags interface works (e.g.,
>in drivers/net/cxgb3/sge.c). This is exactly the model that I am
>suggesting.
>
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Herbert,
I explained what I think the thought in your mind here, please clarify if
something missed.

1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
If the driver support PS mode, then modify alloc_page() too.
2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
function.
3) napi_gro_frags() will allocate small skb and pull the header data from
the first page to skb->data.

Is above the way what you have suggested?
I have thought something in detail about the way.

1) The first page will have an offset after the header is copied into allocated kernel skb.
The offset should be recalculated when the user page data is transferred to guest. This
may modify some of the gro code.

2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
put a user page as normally. This may modify the gro code too.

3) When the user buffer returned to guest, some of them need to be appended a vnet header.
That means for some pages, the vnet header room should be reserved when allocated.
But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong.

4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo?

Currently I can only think of this.
How do you think about then?

Thanks
Xiaohui

2010-06-13 08:59:01

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: [email protected] [mailto:[email protected]] On Behalf Of
>Xin, Xiaohui
>Sent: Saturday, June 12, 2010 5:31 PM
>To: Herbert Xu
>Cc: Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>>-----Original Message-----
>>From: Herbert Xu [mailto:[email protected]]
>>Sent: Friday, June 11, 2010 1:21 PM
>>To: Xin, Xiaohui
>>Cc: Stephen Hemminger; [email protected]; [email protected];
>>[email protected]; [email protected]; [email protected]; [email protected];
>>[email protected]
>>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>>
>>On Wed, Jun 09, 2010 at 05:54:02PM +0800, Xin, Xiaohui wrote:
>>>
>>> I'm not sure if I understand your way correctly:
>>> 1) Does the way only deal with driver with SG feature? Since packet
>>> is non-linear...
>>
>>No the hardware doesn't have to support SG. You just need to
>>place the entire packet contents in a page instead of skb->head.
>>
>>> 2) Is skb->data still pointing to guest user buffers?
>>> If yes, how to avoid the modifications to net core change to skb?
>>
>>skb->data would not point to guest user buffers. In the common
>>case the packet is not modified on its way to the guest so this
>>is not an issue.
>>
>>In the rare case where it is modified, you only have to copy the
>>bits which are modified and the cost of that is inconsequential
>>since you have to write to that memory anyway.
>>
>>> 3) In our way only parts of drivers need be modified to support zero-copy.
>>> and here, need we modify all the drivers?
>>
>>If you're asking the portion of each driver supporting zero-copy
>>that needs to be modified, then AFAICS this doesn't change that
>>very much at all.
>>
>>> I think to make skb->head empty at first will cause more effort to pass the check with
>>> skb header. Have I missed something here? I really make the skb->head NULL
>>> just before kfree(skb) in skb_release_data(), it's done by callback we have made for skb.
>>
>>No I'm not suggesting you set it to NULL. It should have some
>>memory allocated, but skb_headlen(skb) should be zero.
>>
>>Please have a look at how the napi_gro_frags interface works (e.g.,
>>in drivers/net/cxgb3/sge.c). This is exactly the model that I am
>>suggesting.
>>
>>Cheers,
>>--
>>Visit Openswan at http://www.openswan.org/
>>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>>Home Page: http://gondor.apana.org.au/~herbert/
>>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
>
>Herbert,
>I explained what I think the thought in your mind here, please clarify if
>something missed.
>
>1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
> If the driver support PS mode, then modify alloc_page() too.
>2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
>function.
>3) napi_gro_frags() will allocate small skb and pull the header data from
>the first page to skb->data.
>
>Is above the way what you have suggested?
>I have thought something in detail about the way.
>
>1) The first page will have an offset after the header is copied into allocated kernel skb.
>The offset should be recalculated when the user page data is transferred to guest. This
>may modify some of the gro code.
>
>2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
>put a user page as normally. This may modify the gro code too.
>
>3) When the user buffer returned to guest, some of them need to be appended a vnet header.
>That means for some pages, the vnet header room should be reserved when allocated.
>But we cannot know which one will be used as the first page when allocated. If we reserved
>vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for
>second pages, then page data will be wrong.
>
>4) Since the user buffer pages should be released, so we still need a dtor callback to do that,
>and then I still need a place to hold it. How do you think about to put it in skb_shinfo?
>
>Currently I can only think of this.
>How do you think about then?
>
>Thanks
>Xiaohui

Herbert,
In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.

We can also have another way here. We can provide a function to only substitute
alloc_page(), and a function to release the pages when cleaning the rx buffers.
The skb for the rx buffer can be allocated in original way, and when pushing
the data to guest, the header data will be copied to guest buffer. In this way, we
should reserve sufficient room for the header in the first guest user buffers.
That need modifications to guest virtio-net kernel. And this way only suitable for
PS mode supported driver. Considered the advanced driver mostly has PS mode.
So it should be not a critical issue.

Thanks
Xiaohui

2010-06-17 11:20:34

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sat, Jun 12, 2010 at 05:31:10PM +0800, Xin, Xiaohui wrote:
>
> 1) Modify driver from netdev_alloc_skb() to alloc user pages if dev is zero-copyed.
> If the driver support PS mode, then modify alloc_page() too.

Well if you were doing this then the driver won't be generating
skbs at all. So you don't need to change netdev_alloc_skb.

The driver would currently do alloc_page, so you would replace
that with netdev_alloc_page, which can call your new function
to allocate an external page where appropriate.

IOW you just need one change in the driver if it already uses
the skbless interface, to replace with alloc_page.

If the driver doesn't use the skbless interface then you need
to make a few more changes but it isn't too hard either, it'll
also mean that the driver will have less overhead even for normal
use which is a win-win situation.

> 2) Add napi_gro_frags() in driver to receive the user pages instead of driver's receiving
> function.
>
> 3) napi_gro_frags() will allocate small skb and pull the header data from
> the first page to skb->data.
>
> Is above the way what you have suggested?

Yes.

> I have thought something in detail about the way.
>
> 1) The first page will have an offset after the header is copied into allocated kernel skb.
> The offset should be recalculated when the user page data is transferred to guest. This
> may modify some of the gro code.

We could keep track whether the stack has modified the header,
since you can simply ignore it if it doesn't modify it, which
should be the common case for virt.

> 2) napi_gro_frags() may remove a page when it's data is totally be pulled, but we cannot
> put a user page as normally. This may modify the gro code too.

If it does anything like that, then we're not in the fast-path
case so you can just fall back to copying.

> 3) When the user buffer returned to guest, some of them need to be appended a vnet header.
> That means for some pages, the vnet header room should be reserved when allocated.
> But we cannot know which one will be used as the first page when allocated. If we reserved vnet header for each page, since the set_skb_frag() in guest driver only use the offset 0 for second pages, then page data will be wrong.

I don't see why this would be a problem, since as far as what
the driver is putting onto the physical RX ring nothing has
changed. IOW if you want to designate a certain page as special,
or the first page, you can still do so.

So can you explain which bits of your patches would be affected
by this?

> 4) Since the user buffer pages should be released, so we still need a dtor callback to do that, and then I still need a place to hold it. How do you think about to put it in skb_shinfo?

While I don't like that very much I guess I can live with that
if nobody else objects.

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-17 11:21:27

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
>
> Herbert,
> In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to receive the rx buffers, and to clean the rx buffers.
>
> We can also have another way here. We can provide a function to only substitute
> alloc_page(), and a function to release the pages when cleaning the rx buffers.

Yes that's exactly what I had in mind.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-18 05:26:57

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Thursday, June 17, 2010 7:21 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Sun, Jun 13, 2010 at 04:58:36PM +0800, Xin, Xiaohui wrote:
>>
>> Herbert,
>> In this way, I think we should create 3 functions at least in drivers to allocate rx buffer, to
>receive the rx buffers, and to clean the rx buffers.
>>
>> We can also have another way here. We can provide a function to only substitute
>> alloc_page(), and a function to release the pages when cleaning the rx buffers.
>
>Yes that's exactly what I had in mind.
>
Herbert,
I have questions about the idea above:
1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
then we don't need napi_gro_frags() any more, the driver's original receiving
function is ok. Right?

2) Is napi_gro_frags() only suitable for TCP protocol packet?
I have done a small test for ixgbe driver to let it only allocate paged buffers
and found kernel hangs when napi_gro_frags() receives an ARP packet.

3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
as usual, the data pointed by skb->data will be copied into the first guest buffer.
That means we should reserve sufficient room in guest buffer. For PS mode
supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
we will put the first frag data. Look into virtio-net.c the function page_to_skb()
and receive_mergeable(), that means we should modify guest virtio-net driver to
compute the offset as the parameter for skb_set_frag().

How do you think about this? Attached is a patch to how to modify the guest driver.
I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.

Thanks
Xiaohui
>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Attachments:

mod-guest.diff (1.98 kB)
mod-guest.diff

2010-06-18 05:59:40

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
>
> Herbert,
> I have questions about the idea above:
> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
> then we don't need napi_gro_frags() any more, the driver's original receiving
> function is ok. Right?

Well I was actually thinking about converting all drivers that
need this to napi_gro_frags. But now that you mention it, yes
we could still keep the old interface to minimise the work.

> 2) Is napi_gro_frags() only suitable for TCP protocol packet?
> I have done a small test for ixgbe driver to let it only allocate paged buffers
> and found kernel hangs when napi_gro_frags() receives an ARP packet.

It should work with any packet. In fact, I'm pretty sure the
other drivers (e.g., cxgb3) use that interface for all packets.

> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
> as usual, the data pointed by skb->data will be copied into the first guest buffer.
> That means we should reserve sufficient room in guest buffer. For PS mode
> supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
> we will put the first frag data. Look into virtio-net.c the function page_to_skb()
> and receive_mergeable(), that means we should modify guest virtio-net driver to
> compute the offset as the parameter for skb_set_frag().
>
> How do you think about this? Attached is a patch to how to modify the guest driver.
> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.

Expanding the buffer size to 512 bytes to accomodate PS mode
looks reasonable to me.

However, I don't think we should increase the copy threshold to
512 bytes at the same time. I don't have any figures myself but
I think if we are to make such a change it should be a separate
one and come with supporting numbers.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-18 07:14:26

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Friday, June 18, 2010 1:59 PM
>To: Xin, Xiaohui
>Cc: Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]; Rusty Russell
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
>>
>> Herbert,
>> I have questions about the idea above:
>> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
>> then we don't need napi_gro_frags() any more, the driver's original receiving
>> function is ok. Right?
>
>Well I was actually thinking about converting all drivers that
>need this to napi_gro_frags. But now that you mention it, yes
>we could still keep the old interface to minimise the work.
>
>> 2) Is napi_gro_frags() only suitable for TCP protocol packet?
>> I have done a small test for ixgbe driver to let it only allocate paged buffers
>> and found kernel hangs when napi_gro_frags() receives an ARP packet.
>
>It should work with any packet. In fact, I'm pretty sure the
>other drivers (e.g., cxgb3) use that interface for all packets.
>
Thanks for the verification. By the way, does that mean that nearly all drivers can use the
same napi_gro_frags() to receive buffers though currently each driver has it's own receiving
function?

>> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
>> as usual, the data pointed by skb->data will be copied into the first guest buffer.
>> That means we should reserve sufficient room in guest buffer. For PS mode
>> supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
>> we will put the first frag data. Look into virtio-net.c the function page_to_skb()
>> and receive_mergeable(), that means we should modify guest virtio-net driver to
>> compute the offset as the parameter for skb_set_frag().
>>
>> How do you think about this? Attached is a patch to how to modify the guest driver.
>> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.
>
>Expanding the buffer size to 512 bytes to accomodate PS mode
>looks reasonable to me.
>
>However, I don't think we should increase the copy threshold to
>512 bytes at the same time. I don't have any figures myself but
>I think if we are to make such a change it should be a separate
>one and come with supporting numbers.
>
Let me have a look to see if I can retain the copy threshold as 128 bytes
and copy the header data safely.

>Cheers,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-18 07:45:12

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:
>
> Thanks for the verification. By the way, does that mean that nearly all drivers can use the
> same napi_gro_frags() to receive buffers though currently each driver has it's own receiving
> function?

There is no reason why the napi_gro_frags can't be used by any
driver that supports receiving data into pages.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 10:11:56

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Fri, Jun 18, 2010 at 03:14:18PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Herbert Xu [mailto:[email protected]]
> >Sent: Friday, June 18, 2010 1:59 PM
> >To: Xin, Xiaohui
> >Cc: Stephen Hemminger; [email protected]; [email protected];
> >[email protected]; [email protected]; [email protected]; [email protected];
> >[email protected]; Rusty Russell
> >Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> >
> >On Fri, Jun 18, 2010 at 01:26:49PM +0800, Xin, Xiaohui wrote:
> >>
> >> Herbert,
> >> I have questions about the idea above:
> >> 1) Since netdev_alloc_skb() is still there, and we only modify alloc_page(),
> >> then we don't need napi_gro_frags() any more, the driver's original receiving
> >> function is ok. Right?
> >
> >Well I was actually thinking about converting all drivers that
> >need this to napi_gro_frags. But now that you mention it, yes
> >we could still keep the old interface to minimise the work.
> >
> >> 2) Is napi_gro_frags() only suitable for TCP protocol packet?
> >> I have done a small test for ixgbe driver to let it only allocate paged buffers
> >> and found kernel hangs when napi_gro_frags() receives an ARP packet.
> >
> >It should work with any packet. In fact, I'm pretty sure the
> >other drivers (e.g., cxgb3) use that interface for all packets.
> >
> Thanks for the verification. By the way, does that mean that nearly all drivers can use the
> same napi_gro_frags() to receive buffers though currently each driver has it's own receiving
> function?
>
> >> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will allocate
> >> as usual, the data pointed by skb->data will be copied into the first guest buffer.
> >> That means we should reserve sufficient room in guest buffer. For PS mode
> >> supported driver (for example ixgbe), the room will be more than 128. After 128bytes,
> >> we will put the first frag data. Look into virtio-net.c the function page_to_skb()
> >> and receive_mergeable(), that means we should modify guest virtio-net driver to
> >> compute the offset as the parameter for skb_set_frag().
> >>
> >> How do you think about this? Attached is a patch to how to modify the guest driver.
> >> I reserve 512 bytes as an example, and transfer the header len of the skb in hdr->hdr_len.
> >
> >Expanding the buffer size to 512 bytes to accomodate PS mode
> >looks reasonable to me.
> >
> >However, I don't think we should increase the copy threshold to
> >512 bytes at the same time. I don't have any figures myself but
> >I think if we are to make such a change it should be a separate
> >one and come with supporting numbers.
> >
> Let me have a look to see if I can retain the copy threshold as 128 bytes
> and copy the header data safely.

Changing the guest virtio to match the backend is a problem,
this breaks migration etc.

> >Cheers,
> >--
> >Visit Openswan at http://www.openswan.org/
> >Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> >Home Page: http://gondor.apana.org.au/~herbert/
> >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 10:32:53

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
>
> Changing the guest virtio to match the backend is a problem,
> this breaks migration etc.

As long as it's done in a backwards compatible way it should be
fine. It's just like migrating from a backend that supports TSO
to one that doesn't.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 10:44:24

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 08:32:35PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 01:06:32PM +0300, Michael S. Tsirkin wrote:
> >
> > Changing the guest virtio to match the backend is a problem,
> > this breaks migration etc.
>
> As long as it's done in a backwards compatible way it should be
> fine.

Possibly, but to me the need to do this implies that
we'll need another change with different hardware at the backend.

> It's just like migrating from a backend that supports TSO
> to one that doesn't.
>
> Cheers,

Exactly. We don't support such migration.

> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 11:03:11

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:
>
> > It's just like migrating from a backend that supports TSO
> > to one that doesn't.
>
> Exactly. We don't support such migration.

Well that's something that has to be addressed in the virtio_net.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 11:16:38

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 09:02:54PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 01:39:09PM +0300, Michael S. Tsirkin wrote:
> >
> > > It's just like migrating from a backend that supports TSO
> > > to one that doesn't.
> >
> > Exactly. We don't support such migration.
>
> Well that's something that has to be addressed in the virtio_net.

Rather than modifying all guests, it seems much easier not to assume
specific buffer layout in host. Copying network header around seems a
small cost.

> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 11:36:23

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:
>
> Rather than modifying all guests, it seems much easier not to assume
> specific buffer layout in host. Copying network header around seems a
> small cost.

Well sure we can debate the specifics of this implementation detail.

However, the fact that virtio_net doesn't support feature renegotiation
on live migration is not a valid reason against this.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 11:52:41

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 09:36:09PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:11:24PM +0300, Michael S. Tsirkin wrote:
> >
> > Rather than modifying all guests, it seems much easier not to assume
> > specific buffer layout in host. Copying network header around seems a
> > small cost.
>
> Well sure we can debate the specifics of this implementation detail.

Let's do this then. So far the virtio spec avoided making layout
assumptions, leaving guests lay out data as they see fit.
Isn't it possible to keep supporting this with zero copy for hardware
that can issue DMA at arbitrary addresses?

> However, the fact that virtio_net doesn't support feature renegotiation
> on live migration is not a valid reason against this.
>
> Cheers,

--
MST

2010-06-20 11:59:40

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
>
> Let's do this then. So far the virtio spec avoided making layout
> assumptions, leaving guests lay out data as they see fit.
> Isn't it possible to keep supporting this with zero copy for hardware
> that can issue DMA at arbitrary addresses?

I think you're mistaken with respect to what is being proposed.
Raising 512 bytes isn't a hard constraint, it is merely an
optimisation for Intel NICs because their PS mode can produce
a head fragment of up to 512 bytes.

If the guest didn't allocate 512 bytes it wouldn't be the end of
the world, it'd just mean that we'd either copy whatever is in
the head fragment, or we waste 4096-X bytes of memory where X
is the number of bytes in the head.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 12:54:10

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, Jun 20, 2010 at 09:59:26PM +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
> >
> > Let's do this then. So far the virtio spec avoided making layout
> > assumptions, leaving guests lay out data as they see fit.
> > Isn't it possible to keep supporting this with zero copy for hardware
> > that can issue DMA at arbitrary addresses?
>
> I think you're mistaken with respect to what is being proposed.
> Raising 512 bytes isn't a hard constraint, it is merely an
> optimisation for Intel NICs because their PS mode can produce
> a head fragment of up to 512 bytes.
>
Thanks for the clarification. So what is discussed here is
the API changes that will enable this optimization?
Of couse, it makes sense to consider this to try and avoid code churn
in the future.

As a side note, I hope to see a basic zero copy implementation with
GSO/GRO that beats copy in host convincingly before work is started on
further optimizations, though.

> If the guest didn't allocate 512 bytes it wouldn't be the end of
> the world, it'd just mean that we'd either copy whatever is in
> the head fragment,
I don't know how much will copying the head cost.

> or we waste 4096-X bytes of memory where X
> is the number of bytes in the head.

This seems mostly harmless - and guest can always do a copy internally
to save memory, correct?
Note also that we lock a full page to allow DMA, anyway.

> Cheers,
> --
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <[email protected]>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-20 15:19:24

by Ben Hutchings

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Sun, 2010-06-20 at 21:59 +1000, Herbert Xu wrote:
> On Sun, Jun 20, 2010 at 02:47:19PM +0300, Michael S. Tsirkin wrote:
> >
> > Let's do this then. So far the virtio spec avoided making layout
> > assumptions, leaving guests lay out data as they see fit.
> > Isn't it possible to keep supporting this with zero copy for hardware
> > that can issue DMA at arbitrary addresses?
>
> I think you're mistaken with respect to what is being proposed.
> Raising 512 bytes isn't a hard constraint, it is merely an
> optimisation for Intel NICs because their PS mode can produce
> a head fragment of up to 512 bytes.
>
> If the guest didn't allocate 512 bytes it wouldn't be the end of
> the world, it'd just mean that we'd either copy whatever is in
> the head fragment, or we waste 4096-X bytes of memory where X
> is the number of bytes in the head.

If I understand correctly what this 'PS mode' is (I haven't seen the
documentation for it), it is a feature that Microsoft requested from
hardware vendors for use in Hyper-V. As a result, the SFC9000 family
and presumably other controllers also implement something similar.

Ben.

--
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

2010-06-23 08:13:20

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

> 3) As I have mentioned above, with this idea, netdev_alloc_skb() will
> allocate
> as usual, the data pointed by skb->data will be copied into the first
> guest buffer.
> That means we should reserve sufficient room in guest buffer. For PS
> mode
> supported driver (for example ixgbe), the room will be more than 128.
> After 128bytes,
> we will put the first frag data. Look into virtio-net.c the function
> page_to_skb()
> and receive_mergeable(), that means we should modify guest virtio-net
> driver to
> compute the offset as the parameter for skb_set_frag().
>
> How do you think about this? Attached is a patch to how to modify the
> guest driver.
> I reserve 512 bytes as an example, and transfer the header len of the
> skb in hdr->hdr_len.
>
Xiaohui & Herbert:
Mixing copy of head & 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then.
Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier.
Thx, Eddie

2010-06-23 09:53:06

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
>
> Xiaohui & Herbert:
> Mixing copy of head & 0-copy of bulk data imposes additional challange to find the guest buffer. The backend driver may be unable to find a spare guest buffer from virtqueue at that time which may block the receiving process then.
> Can't we completely eliminate netdev_alloc_skb here? Assigning guest buffer at this time makes life much easier.

I'm not sure I understand you concern. If you mean that when
the guest doesn't give enough pages to the host and the host
can't receive on behalf of the guest then isn't that already
the case with the original patch-set?

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-23 10:08:39

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 04:09:40PM +0800, Dong, Eddie wrote:
>>
>> Xiaohui & Herbert:
>> Mixing copy of head & 0-copy of bulk data imposes additional
>> challange to find the guest buffer. The backend driver may be
>> unable to find a spare guest buffer from virtqueue at that time
>> which may block the receiving process then. Can't we completely
>> eliminate netdev_alloc_skb here? Assigning guest buffer at this time
>> makes life much easier.
>
> I'm not sure I understand you concern. If you mean that when
> the guest doesn't give enough pages to the host and the host
> can't receive on behalf of the guest then isn't that already
> the case with the original patch-set?
>

I mean once the frontend side driver post the buffers to the backend driver, the backend driver will "immediately" use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :(

Thx, Eddie

2010-06-24 10:08:48

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>
> I mean once the frontend side driver post the buffers to the backend driver, the backend driver will "immediately" use that buffers to compose skb or gro_frags and post them to the assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a packet from the NIC that requires to do copy, it may be unable to find additional free guest buffer because all of them are already used by the NIC driver. We have to reserve some guest buffers for the possible copy even if the buffer address is not identified by original skb :(

OK I see what you mean. Can you tell me how does Xiaohui's
previous patch-set deal with this problem?

Thanks,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-25 01:06:40

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

Herbert Xu wrote:
> On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>>
>> I mean once the frontend side driver post the buffers to the backend
>> driver, the backend driver will "immediately" use that buffers to
>> compose skb or gro_frags and post them to the assigned host NIC
>> driver as receive buffers. In that case, if the backend driver
>> recieves a packet from the NIC that requires to do copy, it may be
>> unable to find additional free guest buffer because all of them are
>> already used by the NIC driver. We have to reserve some guest
>> buffers for the possible copy even if the buffer address is not
>> identified by original skb :(
>
> OK I see what you mean. Can you tell me how does Xiaohui's
> previous patch-set deal with this problem?
>
> Thanks,

In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(

BTW, some hardware may require certain level of packet copy such as for broadcast packets in very old VMDq device, which is not addressed in previous Xiaohui's patch yet. We may address this by implementing an additional virtqueue between guest and host for slow path (broadcast packets only here) with additinal complexity in FE/BE driver.

Thx, Eddie-

2010-06-25 02:08:44

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

Herbert,
That's why I have sent you the patch for guest virtio-net driver. I reserved 512 bytes in each page, then I can always have the space to copy and avoid the backend memory used up issue.

Thanks
Xiaohui

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Thursday, June 24, 2010 6:09 PM
>To: Dong, Eddie
>Cc: Xin, Xiaohui; Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
>>
>> I mean once the frontend side driver post the buffers to the backend driver, the backend
>driver will "immediately" use that buffers to compose skb or gro_frags and post them to the
>assigned host NIC driver as receive buffers. In that case, if the backend driver recieves a
>packet from the NIC that requires to do copy, it may be unable to find additional free guest
>buffer because all of them are already used by the NIC driver. We have to reserve some guest
>buffers for the possible copy even if the buffer address is not identified by original skb :(
>
>OK I see what you mean. Can you tell me how does Xiaohui's
>previous patch-set deal with this problem?
>
>Thanks,
>--
>Visit Openswan at http://www.openswan.org/
>Email: Herbert Xu ~{PmV>HI~} <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-25 11:11:45

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> Herbert Xu wrote:
> > On Wed, Jun 23, 2010 at 06:05:41PM +0800, Dong, Eddie wrote:
> >>
> >> I mean once the frontend side driver post the buffers to the backend
> >> driver, the backend driver will "immediately" use that buffers to
> >> compose skb or gro_frags and post them to the assigned host NIC
> >> driver as receive buffers. In that case, if the backend driver
> >> recieves a packet from the NIC that requires to do copy, it may be
> >> unable to find additional free guest buffer because all of them are
> >> already used by the NIC driver. We have to reserve some guest
> >> buffers for the possible copy even if the buffer address is not
> >> identified by original skb :(
> >
> > OK I see what you mean. Can you tell me how does Xiaohui's
> > previous patch-set deal with this problem?
> >
> > Thanks,
>
> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(
> BTW, some hardware may require certain level of packet copy such as for broadcast packets in very old VMDq device, which is not addressed in previous Xiaohui's patch yet. We may address this by implementing an additional virtqueue between guest and host for slow path (broadcast packets only here) with additinal complexity in FE/BE driver.
>
> Thx, Eddie

guest posts a large number of buffers to the host.
Host can use them any way it wants to, and in any order,
for example reserve half the buffers for the copy.

This might waste some memory if buffers are used
only partially, but let's worry about this later.

--
MST

2010-06-27 06:15:07

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
>
> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete queue pairs) uses the buffer from guest, so it eliminates copy completely in software and requires hardware to do so. If we can have an additonal place to store the buffer per skb (may cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver later on. But that may be not very clean either :(

OK, if I understand you correctly then I don't think have a
problem. With your current patch-set you have exactly the same
situation when the skb->data is reallocated as a kernel buffer.

This is OK because as you correctly argue, it is a rare situation.

With my proposal you will need to get this extra external buffer
in even less cases, because you'd only need to do it if the skb
head grows, which only happens if it becomes encapsulated.

So let me explain it in a bit more detail:

Our packet starts out as a purely non-linear skb, i.e., skb->head
contains nothing and all the page frags come from the guest.

During host processing we may pull data into skb->head but the
first frag will remain unless we pull all of it. If we did do
that then you would have a free external buffer anyway.

Now in the common case the header may be modified or pulled, but
it very rarely grows. So you can just copy the header back into
the first frag just before we give it to the guest.

Only in the case where the packet header grows (e.g., encapsulation)
would you need to get an extra external buffer.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-28 09:56:16

by Xin, Xiaohui

[permalink] [raw]

Subject: RE: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

>-----Original Message-----
>From: Herbert Xu [mailto:[email protected]]
>Sent: Sunday, June 27, 2010 2:15 PM
>To: Dong, Eddie
>Cc: Xin, Xiaohui; Stephen Hemminger; [email protected]; [email protected];
>[email protected]; [email protected]; [email protected]; [email protected];
>[email protected]
>Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
>
>On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
>>
>> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
>queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
>requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
>cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
>later on. But that may be not very clean either :(
>
>OK, if I understand you correctly then I don't think have a
>problem. With your current patch-set you have exactly the same
>situation when the skb->data is reallocated as a kernel buffer.
>

When will skb->data to be reallocated? May you point me the code path?

>This is OK because as you correctly argue, it is a rare situation.
>
>With my proposal you will need to get this extra external buffer
>in even less cases, because you'd only need to do it if the skb
>head grows, which only happens if it becomes encapsulated.
>So let me explain it in a bit more detail:
>
>Our packet starts out as a purely non-linear skb, i.e., skb->head
>contains nothing and all the page frags come from the guest.
>
>During host processing we may pull data into skb->head but the
>first frag will remain unless we pull all of it. If we did do
>that then you would have a free external buffer anyway.
>
>Now in the common case the header may be modified or pulled, but
>it very rarely grows. So you can just copy the header back into
>the first frag just before we give it to the guest.
>
Since the data is still there, so recompute the page offset and size is ok, right?

>Only in the case where the packet header grows (e.g., encapsulation)
>would you need to get an extra external buffer.
>
>Cheers,
>--
>Email: Herbert Xu <[email protected]>
>Home Page: http://gondor.apana.org.au/~herbert/
>PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-06-28 10:05:21

by Michael S. Tsirkin

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
> >-----Original Message-----
> >From: Herbert Xu [mailto:[email protected]]
> >Sent: Sunday, June 27, 2010 2:15 PM
> >To: Dong, Eddie
> >Cc: Xin, Xiaohui; Stephen Hemminger; [email protected]; [email protected];
> >[email protected]; [email protected]; [email protected]; [email protected];
> >[email protected]
> >Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
> >
> >On Fri, Jun 25, 2010 at 09:03:46AM +0800, Dong, Eddie wrote:
> >>
> >> In current patch, each SKB for the assigned device (SRIOV VF or NIC or a complete
> >queue pairs) uses the buffer from guest, so it eliminates copy completely in software and
> >requires hardware to do so. If we can have an additonal place to store the buffer per skb (may
> >cause copy later on), we can do copy later on or re-post the buffer to assigned NIC driver
> >later on. But that may be not very clean either :(
> >
> >OK, if I understand you correctly then I don't think have a
> >problem. With your current patch-set you have exactly the same
> >situation when the skb->data is reallocated as a kernel buffer.
> >
>
> When will skb->data to be reallocated? May you point me the code path?
>
> >This is OK because as you correctly argue, it is a rare situation.
> >
> >With my proposal you will need to get this extra external buffer
> >in even less cases, because you'd only need to do it if the skb
> >head grows, which only happens if it becomes encapsulated.
> >So let me explain it in a bit more detail:
> >
> >Our packet starts out as a purely non-linear skb, i.e., skb->head
> >contains nothing and all the page frags come from the guest.
> >
> >During host processing we may pull data into skb->head but the
> >first frag will remain unless we pull all of it. If we did do
> >that then you would have a free external buffer anyway.
> >
> >Now in the common case the header may be modified or pulled, but
> >it very rarely grows. So you can just copy the header back into
> >the first frag just before we give it to the guest.
> >
> Since the data is still there, so recompute the page offset and size is ok, right?

Question: can devices use parts of the same page
in frags of different skbs (or for other purposes)? If they do,
we'll corrupt that memory if we try to stick the header there.

We have another option, reserve some buffers
posted by guest and use them if we need to copy
the header. This seems the most straight-forward to me.

> >Only in the case where the packet header grows (e.g., encapsulation)
> >would you need to get an extra external buffer.
> >
> >Cheers,
> >--
> >Email: Herbert Xu <[email protected]>
> >Home Page: http://gondor.apana.org.au/~herbert/
> >PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2010-07-03 09:12:51

[permalink] [raw]

Subject: Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.

On Mon, Jun 28, 2010 at 05:56:07PM +0800, Xin, Xiaohui wrote:
>
> >OK, if I understand you correctly then I don't think have a
> >problem. With your current patch-set you have exactly the same
> >situation when the skb->data is reallocated as a kernel buffer.
>
> When will skb->data to be reallocated? May you point me the code path?

Anything that calls pskb_expand_head.

> >This is OK because as you correctly argue, it is a rare situation.
> >
> >With my proposal you will need to get this extra external buffer
> >in even less cases, because you'd only need to do it if the skb
> >head grows, which only happens if it becomes encapsulated.
> >So let me explain it in a bit more detail:
> >
> >Our packet starts out as a purely non-linear skb, i.e., skb->head
> >contains nothing and all the page frags come from the guest.
> >
> >During host processing we may pull data into skb->head but the
> >first frag will remain unless we pull all of it. If we did do
> >that then you would have a free external buffer anyway.
> >
> >Now in the common case the header may be modified or pulled, but
> >it very rarely grows. So you can just copy the header back into
> >the first frag just before we give it to the guest.
> >
> Since the data is still there, so recompute the page offset and size is ok, right?

Right, you just move the page offset back and increase the size.
However, to do this safely we'd need to have a way of knowing
whether the skb head has been modified.

It may well turn out to be just as effective to do something like

if (memcmp(skb->data, page frag head, skb_headlen))
memcpy(page frag head, skb->data, skb_headlen)

As the page frag head should be in cache since it would've been
used to populate skb->data.

It'd be good to run some benchmarks with this to see whether
adding a bit to sk_buff to avoid the memcmp is worth it or not.

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt