[permalink] [raw]

Subject: [RFC PATCH v9 04/16] Add a function make external buffer owner to query capability.

From: Xin Xiaohui <[email protected]>

The external buffer owner can use the functions to get
the capability of the underlying NIC driver.

Signed-off-by: Xin Xiaohui <[email protected]>
Signed-off-by: Zhao Yu <[email protected]>
Reviewed-by: Jeff Dike <[email protected]>

---
include/linux/netdevice.h | 2 +
net/core/dev.c | 49 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aba0308..5f192de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1599,6 +1599,8 @@ extern gro_result_t napi_frags_finish(struct napi_struct *napi,
gro_result_t ret);
extern struct sk_buff * napi_frags_skb(struct napi_struct *napi);
extern gro_result_t napi_gro_frags(struct napi_struct *napi);
+extern int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port);

static inline void napi_free_frags(struct napi_struct *napi)
{
diff --git a/net/core/dev.c b/net/core/dev.c
index 264137f..636f11b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2468,6 +2468,55 @@ void netif_nit_deliver(struct sk_buff *skb)
rcu_read_unlock();
}

+/* To support meidate passthru(zero-copy) with NIC driver,
+ * we'd better query NIC driver for the capability it can
+ * provide, especially for packet split mode, now we only
+ * query for the header size, and the payload a descriptor
+ * may carry. If a driver does not use the API to export,
+ * then we may try to use a default value, currently,
+ * we use the default value from an IGB driver. Now,
+ * it's only called by mpassthru device.
+ */
+#if defined(CONFIG_MEDIATE_PASSTHRU) || defined(CONFIG_MEDIATE_PASSTHRU_MODULE)
+int netdev_mp_port_prep(struct net_device *dev,
+ struct mpassthru_port *port)
+{
+ int rc;
+ int npages, data_len;
+ const struct net_device_ops *ops = dev->netdev_ops;
+
+ if (ops->ndo_mp_port_prep) {
+ rc = ops->ndo_mp_port_prep(dev, port);
+ if (rc)
+ return rc;
+ } else {
+ /* If the NIC driver did not report this,
+ * then we try to use default value.
+ */
+ port->hdr_len = 128;
+ port->data_len = 2048;
+ port->npages = 1;
+ }
+
+ if (port->hdr_len <= 0)
+ goto err;
+
+ npages = port->npages;
+ data_len = port->data_len;
+ if (npages <= 0 || npages > MAX_SKB_FRAGS ||
+ (data_len < PAGE_SIZE * (npages - 1) ||
+ data_len > PAGE_SIZE * npages))
+ goto err;
+
+ return 0;
+err:
+ dev_warn(&dev->dev, "invalid page constructor parameters\n");
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL(netdev_mp_port_prep);
+#endif
+
/**
* netif_receive_skb - process receive buffer from network
* @skb: buffer to process
--
1.5.4.4

2010-08-06 09:09:06

by Xin, Xiaohui

[permalink] [raw]

On Tue, 2010-08-10 at 23:01 -0700, Shirley Ma wrote:
> On Tue, 2010-08-10 at 18:43 -0700, Shirley Ma wrote:
> > > Also I found some vhost performance regression on the new
> > > kernel with tuning. I used to get 9.4Gb/s, now I couldn't get it.
> >
> > I forgot to mention the kernel I used 2.6.36 one. And I found the
> > native
> > host BW is limited to 8.0Gb/s, so the regression might come from the
> > device driver not vhost.
>
> Something is very interesting, when binding ixgbe interrupts to cpu1,
> and running netperf/netserver on cpu0, the native host to host
> performance is still around 8.0Gb/s, however, the macvtap zero copy
> result is 9.0Gb/s.
>
> root@localhost ~]# netperf -H 192.168.10.74 -c -C -l60 -T0,0 -- -m 64K
> TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168..
> 10.74) port 0 AF_INET : cpu bind
> Recv Send Send Utilization Service Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB
>
> 87380 16384 65536 60.00 9013.59 53.01 8.21 0.963 0.597
>
> Below is perf top output:
>
> 578.00 6.5% copy_user_generic_string
> 381.00 4.3% vmx_vcpu_run
> 250.00 2.8% schedule
> 207.00 2.3% vhost_get_vq_desc
> 204.00 2.3% _raw_spin_lock_irqsave
> 197.00 2.2% translate_desc
> 193.00 2.2% memcpy_fromiovec
> 162.00 1.8% gup_pte_range
>
> We can compare your results with mine to see any difference.

When binding vhost thread to cpu3, qemu I/O thread to cpu2, macvtap zero
copy patch can get 9.4Gb/s.

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.10.74 (192.168.10.74) port 0 AF_INET : cpu bind
Recv Send Send Utilization Service Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local remote
bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB

87380 16384 65536 60.00 9408.19 55.69 8.45 0.970 0.589

Shirley