2014-12-01 10:17:18

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

Hello:

We used to orphan packets before transmission for virtio-net. This breaks
socket accounting and can lead serveral functions won't work, e.g:

- Byte Queue Limit depends on tx completion nofication to work.
- Packet Generator depends on tx completion nofication for the last
transmitted packet to complete.
- TCP Small Queue depends on proper accounting of sk_wmem_alloc to work.

This series tries to solve the issue by enabling tx interrupts. To minize
the performance impacts of this, several optimizations were used:

- In guest side, virtqueue_enable_cb_delayed() was used to delay the tx
interrupt untile 3/4 pending packets were sent.
- In host side, interrupt coalescing were used to reduce tx interrupts.

Performance test results[1] (tx-frames 16 tx-usecs 16) shows:

- For guest receiving. No obvious regression on throughput were
noticed. More cpu utilization were noticed in few cases.
- For guest transmission. Very huge improvement on througput for small
packet transmission were noticed. This is expected since TSQ and other
optimization for small packet transmission work after tx interrupt. But
will use more cpu for large packets.
- For TCP_RR, regression (10% on transaction rate and cpu utilization) were
found. Tx interrupt won't help but cause overhead in this case. Using
more aggressive coalescing parameters may help to reduce the regression.

Changes from RFC V3:
- Don't free tx packets in ndo_start_xmit()
- Add interrupt coalescing support for virtio-net
Changes from RFC v2:
- clean up code, address issues raised by Jason
Changes from RFC v1:
- address comments by Jason Wang, use delayed cb everywhere
- rebased Jason's patch on top of mine and include it (with some tweaks)

Please reivew. Comments were more than welcomed.

[1] Performance Test result:

Environment:
- Two Intel(R) Xeon(R) CPU E5620 @ 2.40GHz machines connected back to back
with 82599ES cards.
- Both host and guest were net-next.git plus the patch
- Coalescing parameters for the card:
Adaptive RX: off TX: off
rx-usecs: 1
rx-frames: 0
tx-usecs: 0
tx-frames: 0
- Vhost_net was enabled and zerocopy was disabled
- Tests was done by netperf-2.6
- Guest has 2 vcpus with single queue virtio-net

Results:
- Numbers of square brackets are whose significance is grater than 95%

Guest RX:

size/sessions/+throughput/+cpu/+per_cpu_throughput/
64/1/+2.0326/[+6.2807%]/-3.9970%/
64/2/-0.2104%/[+3.2012%]/[-3.3058%]/
64/4/+1.5956%/+2.2451%/-0.6353%/
64/8/+1.1732%/+3.5123%/-2.2598%/
256/1/+3.7619%/[+5.8117%]/-1.9372%/
256/2/-0.0661%/[+3.2511%]/-3.2127%/
256/4/+1.1435%/[-8.1842%]/[+10.1591%]/
256/8/[+2.2447%]/[+6.2044%]/[-3.7283%]/
1024/1/+9.1479%/[+12.0997%]/[-2.6332%]/
1024/2/[-17.3341%]/[+0.0000%]/[-17.3341%]/
1024/4/[-0.6284%]/-1.0376%/+0.4135%/
1024/8/+1.1444%/-1.6069%/+2.7961%/
4096/1/+0.0401%/-0.5993%/+0.6433%/
4096/2/[-0.5894%]/-2.2071%/+1.6542%/
4096/4/[-0.5560%]/-1.4969%/+0.9553%/
4096/8/-0.3362%/+2.7086%/-2.9645%/
16384/1/-0.0285%/+0.7247%/-0.7478%/
16384/2/-0.5286%/+0.3287%/-0.8545%/
16384/4/-0.3297%/-2.0543%/+1.7608%/
16384/8/+1.0932%/+4.0253%/-2.8187%/
65535/1/+0.0003%/-0.1502%/+0.1508%/
65535/2/[-0.6065%]/+0.2309%/-0.8355%/
65535/4/[-0.6861%]/[+3.9451%]/[-4.4554%]/
65535/8/+1.8359%/+3.1590%/-1.2825%/

Guest RX:
size/sessions/+throughput/+cpu/+per_cpu_throughput/
64/1/[+65.0961%]/[-8.6807%]/[+80.7900%]/
64/2/[+6.0288%]/[-2.2823%]/[+8.5052%]/
64/4/[+5.9038%]/[-2.1834%]/[+8.2677%]/
64/8/[+5.4154%]/[-2.1804%]/[+7.7651%]/
256/1/[+184.6462%]/[+4.8906%]/[+171.3742%]/
256/2/[+46.0731%]/[-8.9626%]/[+60.4539%]/
256/4/[+45.8547%]/[-8.3027%]/[+59.0612%]/
256/8/[+45.3486%]/[-8.4024%]/[+58.6817%]/
1024/1/[+432.5372%]/[+3.9566%]/[+412.2689%]/
1024/2/[-1.4207%]/[-23.6426%]/[+29.1025%]/
1024/4/-0.1003%/[-13.6416%]/[+15.6804%]/
1024/8/[+0.2200%]/[+2.0634%]/[-1.8061%]/
4096/1/[+18.4835%]/[-46.1508%]/[+120.0283%]/
4096/2/+0.1770%/[-26.2780%]/[+35.8848%]/
4096/4/-0.1012%/-0.7353%/+0.6388%/
4096/8/-0.6091%/+1.4159%/-1.9968%/
16384/1/-0.0424%/[+11.9373%]/[-10.7021%]/
16384/2/+0.0482%/+2.4685%/-2.3620%/
16384/4/+0.0840%/[+5.3587%]/[-5.0064%]/
16384/8/+0.0048%/[+5.0176%]/[-4.7733%]/
65535/1/-0.0095%/[+10.9408%]/[-9.8705%]/
65535/2/+0.1515%/[+8.1709%]/[-7.4137%]/
65535/4/+0.0203%/[+5.4316%]/[-5.1325%]/
65535/8/+0.1427%/[+6.2753%]/[-5.7705%]/

size/sessions/+trans.rate/+cpu/+per_cpu_trans.rate/
64/1/+0.2346%/[+11.5080%]/[-10.1099%]/
64/25/[-10.7893%]/-0.5791%/[-10.2697%]/
64/50/[-11.5997%]/-0.3429%/[-11.2956%]/
256/1/+0.7219%/[+13.2374%]/[-11.0524%]/
256/25/-6.9567%/+0.8887%/[-7.7763%]/
256/50/[-4.8814%]/-0.0338%/[-4.8492%]/
4096/1/-1.6061%/-0.7561%/-0.8565%/
4096/25/[+2.2120%]/[+1.0839%]/+1.1161%/
4096/50/[+5.6180%]/[+3.2116%]/[+2.3315%]/

Jason Wang (4):
virtio_net: enable tx interrupt
virtio-net: optimize free_old_xmit_skbs stats
virtio-net: add basic interrupt coalescing support
vhost_net: interrupt coalescing support

Michael S. Tsirkin (1):
virtio_net: bql

drivers/net/virtio_net.c | 211 ++++++++++++++++++++++++++++++++--------
drivers/vhost/net.c | 200 +++++++++++++++++++++++++++++++++++--
include/uapi/linux/vhost.h | 12 +++
include/uapi/linux/virtio_net.h | 12 +++
4 files changed, 383 insertions(+), 52 deletions(-)

--
1.8.3.1


2014-12-01 10:17:26

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt

On newer hosts that support delayed tx interrupts,
we probably don't have much to gain from orphaning
packets early.

Note: this might degrade performance for
hosts without event idx support.
Should be addressed by the next patch.

Cc: Rusty Russell <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/virtio_net.c | 132 +++++++++++++++++++++++++++++++----------------
1 file changed, 88 insertions(+), 44 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ec2a8b4..f68114e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {

/* Name of the send queue: output.$index */
char name[40];
+
+ struct napi_struct napi;
};

/* Internal representation of a receive virtqueue */
@@ -137,6 +139,9 @@ struct virtnet_info {

/* CPU hot plug notifier */
struct notifier_block nb;
+
+ /* Budget for polling tx completion */
+ u32 tx_work_limit;
};

struct skb_vnet_hdr {
@@ -211,15 +216,41 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
return p;
}

+static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
+ struct send_queue *sq, int budget)
+{
+ struct sk_buff *skb;
+ unsigned int len;
+ struct virtnet_info *vi = sq->vq->vdev->priv;
+ struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+ unsigned int packets = 0;
+
+ while (packets < budget &&
+ (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+ pr_debug("Sent skb %p\n", skb);
+
+ u64_stats_update_begin(&stats->tx_syncp);
+ stats->tx_bytes += skb->len;
+ stats->tx_packets++;
+ u64_stats_update_end(&stats->tx_syncp);
+
+ dev_kfree_skb_any(skb);
+ packets++;
+ }
+
+ if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
+ netif_tx_start_queue(txq);
+
+ return packets;
+}
+
static void skb_xmit_done(struct virtqueue *vq)
{
struct virtnet_info *vi = vq->vdev->priv;
+ struct send_queue *sq = &vi->sq[vq2txq(vq)];

- /* Suppress further interrupts. */
- virtqueue_disable_cb(vq);
-
- /* We were probably waiting for more output buffers. */
- netif_wake_subqueue(vi->dev, vq2txq(vq));
+ virtqueue_disable_cb(sq->vq);
+ napi_schedule(&sq->napi);
}

static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -777,6 +808,32 @@ again:
return received;
}

+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+ struct send_queue *sq =
+ container_of(napi, struct send_queue, napi);
+ struct virtnet_info *vi = sq->vq->vdev->priv;
+ struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+ u32 limit = vi->tx_work_limit;
+ unsigned int sent;
+
+ __netif_tx_lock(txq, smp_processor_id());
+ sent = free_old_xmit_skbs(txq, sq, limit);
+ if (sent < limit) {
+ napi_complete(napi);
+ /* Note: we must enable cb *after* napi_complete, because
+ * napi_schedule calls from callbacks that trigger before
+ * napi_complete are ignored.
+ */
+ if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
+ virtqueue_disable_cb(sq->vq);
+ napi_schedule(&sq->napi);
+ }
+ }
+ __netif_tx_unlock(txq);
+ return sent < limit ? 0 : budget;
+}
+
#ifdef CONFIG_NET_RX_BUSY_POLL
/* must be called with local_bh_disable()d */
static int virtnet_busy_poll(struct napi_struct *napi)
@@ -825,30 +882,12 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
schedule_delayed_work(&vi->refill, 0);
virtnet_napi_enable(&vi->rq[i]);
+ napi_enable(&vi->sq[i].napi);
}

return 0;
}

-static void free_old_xmit_skbs(struct send_queue *sq)
-{
- struct sk_buff *skb;
- unsigned int len;
- struct virtnet_info *vi = sq->vq->vdev->priv;
- struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-
- while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
- pr_debug("Sent skb %p\n", skb);
-
- u64_stats_update_begin(&stats->tx_syncp);
- stats->tx_bytes += skb->len;
- stats->tx_packets++;
- u64_stats_update_end(&stats->tx_syncp);
-
- dev_kfree_skb_any(skb);
- }
-}
-
static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
{
struct skb_vnet_hdr *hdr;
@@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
sg_set_buf(sq->sg, hdr, hdr_len);
num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
}
- return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
+
+ return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
+ GFP_ATOMIC);
}

static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
@@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;

- /* Free up any pending old buffers before queueing new ones. */
- free_old_xmit_skbs(sq);
+ virtqueue_disable_cb(sq->vq);

/* Try to transmit */
err = xmit_skb(sq, skb);
@@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
return NETDEV_TX_OK;
}

- /* Don't wait up for transmitted skbs to be freed. */
- skb_orphan(skb);
- nf_reset(skb);
-
/* Apparently nice girls don't return TX_BUSY; stop the queue
* before it gets out of hand. Naturally, this wastes entries. */
- if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
+ if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
netif_stop_subqueue(dev, qnum);
- if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
- /* More just got used, free them then recheck. */
- free_old_xmit_skbs(sq);
- if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
- netif_start_subqueue(dev, qnum);
- virtqueue_disable_cb(sq->vq);
- }
- }
- }

if (kick || netif_xmit_stopped(txq))
virtqueue_kick(sq->vq);

+ if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
+ virtqueue_disable_cb(sq->vq);
+ napi_schedule(&sq->napi);
+ }
+
return NETDEV_TX_OK;
}

@@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device *dev)
/* Make sure refill_work doesn't re-enable napi! */
cancel_delayed_work_sync(&vi->refill);

- for (i = 0; i < vi->max_queue_pairs; i++)
+ for (i = 0; i < vi->max_queue_pairs; i++) {
napi_disable(&vi->rq[i].napi);
+ napi_disable(&vi->sq[i].napi);
+ }

return 0;
}
@@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
{
int i;

- for (i = 0; i < vi->max_queue_pairs; i++)
+ for (i = 0; i < vi->max_queue_pairs; i++) {
netif_napi_del(&vi->rq[i].napi);
+ netif_napi_del(&vi->sq[i].napi);
+ }

kfree(vi->rq);
kfree(vi->sq);
@@ -1607,6 +1643,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
napi_weight);
napi_hash_add(&vi->rq[i].napi);
+ netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
+ napi_weight);

sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
@@ -1790,6 +1828,8 @@ static int virtnet_probe(struct virtio_device *vdev)
if (err)
goto free_stats;

+ vi->tx_work_limit = napi_weight;
+
#ifdef CONFIG_SYSFS
if (vi->mergeable_rx_bufs)
dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
@@ -1904,8 +1944,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
if (netif_running(vi->dev)) {
for (i = 0; i < vi->max_queue_pairs; i++) {
napi_disable(&vi->rq[i].napi);
+ napi_disable(&vi->sq[i].napi);
napi_hash_del(&vi->rq[i].napi);
netif_napi_del(&vi->rq[i].napi);
+ netif_napi_del(&vi->sq[i].napi);
}
}

@@ -1930,8 +1972,10 @@ static int virtnet_restore(struct virtio_device *vdev)
if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
schedule_delayed_work(&vi->refill, 0);

- for (i = 0; i < vi->max_queue_pairs; i++)
+ for (i = 0; i < vi->max_queue_pairs; i++) {
virtnet_napi_enable(&vi->rq[i]);
+ napi_enable(&vi->sq[i].napi);
+ }
}

netif_device_attach(vi->dev);
--
1.8.3.1

2014-12-01 10:17:35

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 4/5] virtio-net: add basic interrupt coalescing support

This patch enables the interrupt coalescing setting through ethtool.

Cc: Rusty Russell <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/virtio_net.c | 67 +++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/virtio_net.h | 12 ++++++++
2 files changed, 79 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9f420c9..2a3551a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -142,6 +142,11 @@ struct virtnet_info {

/* Budget for polling tx completion */
u32 tx_work_limit;
+
+ __u32 rx_coalesce_usecs;
+ __u32 rx_max_coalesced_frames;
+ __u32 tx_coalesce_usecs;
+ __u32 tx_max_coalesced_frames;
};

struct skb_vnet_hdr {
@@ -1419,12 +1424,73 @@ static void virtnet_get_channels(struct net_device *dev,
channels->other_count = 0;
}

+static int virtnet_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+ struct scatterlist sg;
+ struct virtio_net_ctrl_coalesce c;
+
+ if (!vi->has_cvq ||
+ !virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_COALESCE))
+ return -EOPNOTSUPP;
+ if (vi->rx_coalesce_usecs != ec->rx_coalesce_usecs ||
+ vi->rx_max_coalesced_frames != ec->rx_max_coalesced_frames) {
+ c.coalesce_usecs = ec->rx_coalesce_usecs;
+ c.max_coalesced_frames = ec->rx_max_coalesced_frames;
+ sg_init_one(&sg, &c, sizeof(c));
+ if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_COALESCE,
+ VIRTIO_NET_CTRL_COALESCE_RX_SET,
+ &sg)) {
+ dev_warn(&dev->dev, "Fail to set rx coalescing\n");
+ return -EINVAL;
+ }
+ vi->rx_coalesce_usecs = ec->rx_coalesce_usecs;
+ vi->rx_max_coalesced_frames = ec->rx_max_coalesced_frames;
+ }
+
+ if (vi->tx_coalesce_usecs != ec->tx_coalesce_usecs ||
+ vi->tx_max_coalesced_frames != ec->tx_max_coalesced_frames) {
+ c.coalesce_usecs = ec->tx_coalesce_usecs;
+ c.max_coalesced_frames = ec->tx_max_coalesced_frames;
+ sg_init_one(&sg, &c, sizeof(c));
+ if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_COALESCE,
+ VIRTIO_NET_CTRL_COALESCE_TX_SET,
+ &sg)) {
+ dev_warn(&dev->dev, "Fail to set tx coalescing\n");
+ return -EINVAL;
+ }
+ vi->tx_coalesce_usecs = ec->tx_coalesce_usecs;
+ vi->tx_max_coalesced_frames = ec->tx_max_coalesced_frames;
+ }
+
+ vi->tx_work_limit = ec->tx_max_coalesced_frames_irq;
+
+ return 0;
+}
+
+static int virtnet_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec)
+{
+ struct virtnet_info *vi = netdev_priv(dev);
+
+ ec->rx_coalesce_usecs = vi->rx_coalesce_usecs;
+ ec->rx_max_coalesced_frames = vi->rx_max_coalesced_frames;
+ ec->tx_coalesce_usecs = vi->tx_coalesce_usecs;
+ ec->tx_max_coalesced_frames = vi->tx_max_coalesced_frames;
+ ec->tx_max_coalesced_frames_irq = vi->tx_work_limit;
+
+ return 0;
+}
+
static const struct ethtool_ops virtnet_ethtool_ops = {
.get_drvinfo = virtnet_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ringparam = virtnet_get_ringparam,
.set_channels = virtnet_set_channels,
.get_channels = virtnet_get_channels,
+ .set_coalesce = virtnet_set_coalesce,
+ .get_coalesce = virtnet_get_coalesce,
};

#define MIN_MTU 68
@@ -2022,6 +2088,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ,
VIRTIO_NET_F_CTRL_MAC_ADDR,
VIRTIO_F_ANY_LAYOUT,
+ VIRTIO_NET_F_CTRL_COALESCE,
};

static struct virtio_driver virtio_net_driver = {
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 172a7f0..cdafb57 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -33,6 +33,7 @@
/* The feature bitmap for virtio net */
#define VIRTIO_NET_F_CSUM 0 /* Host handles pkts w/ partial csum */
#define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */
+#define VIRTIO_NET_F_CTRL_COALESCE 3 /* Set coalescing */
#define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */
#define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */
#define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */
@@ -201,4 +202,15 @@ struct virtio_net_ctrl_mq {
#define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MIN 1
#define VIRTIO_NET_CTRL_MQ_VQ_PAIRS_MAX 0x8000

+struct virtio_net_ctrl_coalesce {
+ __u32 coalesce_usecs;
+ __u32 max_coalesced_frames;
+};
+
+#define VIRTIO_NET_CTRL_COALESCE 6
+ #define VIRTIO_NET_CTRL_COALESCE_TX_SET 0
+ #define VIRTIO_NET_CTRL_COALESCE_TX_GET 1
+ #define VIRTIO_NET_CTRL_COALESCE_RX_SET 2
+ #define VIRTIO_NET_CTRL_COALESCE_RX_GET 3
+
#endif /* _LINUX_VIRTIO_NET_H */
--
1.8.3.1

2014-12-01 10:17:30

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 3/5] virtio-net: optimize free_old_xmit_skbs stats

We already have counters for sent packets and sent bytes.
Use them to reduce the number of u64_stats_update_begin/end().

Take care not to bother with stats update when called
speculatively.

Cc: Rusty Russell <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/net/virtio_net.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0ed24ff..9f420c9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -230,16 +230,23 @@ static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
(skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
pr_debug("Sent skb %p\n", skb);

- u64_stats_update_begin(&stats->tx_syncp);
- stats->tx_bytes += skb->len;
bytes += skb->len;
- stats->tx_packets++;
- u64_stats_update_end(&stats->tx_syncp);
+ packets++;

dev_kfree_skb_any(skb);
- packets++;
}

+ /* Avoid overhead when no packets have been processed
+ * happens when called speculatively from start_xmit.
+ */
+ if (!packets)
+ return 0;
+
+ u64_stats_update_begin(&stats->tx_syncp);
+ stats->tx_bytes += bytes;
+ stats->tx_packets += packets;
+ u64_stats_update_end(&stats->tx_syncp);
+
netdev_tx_completed_queue(txq, packets, bytes);

if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
--
1.8.3.1

2014-12-01 10:18:09

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 5/5] vhost_net: interrupt coalescing support

This patch implements interrupt coalescing support for vhost_net. And provides
ioctl()s for userspace to get and set coalescing parameters. Two kinds of
parameters were allowed to be set:

- max_coalesced_frames: which is the maximum numbers of packets were allowed
before issuing an irq.
- coalesced_usecs: which is the maximum number of micro seconds were allowed
before issuing an irq if at least one packet were pending.

A per virtqueue hrtimer were used for coalesced_usecs.

Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/net.c | 200 +++++++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/vhost.h | 12 +++
2 files changed, 203 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..c416aa7 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -18,6 +18,7 @@
#include <linux/file.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
+#include <linux/timer.h>

#include <linux/net.h>
#include <linux/if_packet.h>
@@ -61,7 +62,8 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
(1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
- (1ULL << VIRTIO_NET_F_MRG_RXBUF),
+ (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
+ (1ULL << VIRTIO_NET_F_CTRL_COALESCE)
};

enum {
@@ -99,6 +101,15 @@ struct vhost_net_virtqueue {
/* Reference counting for outstanding ubufs.
* Protected by vq mutex. Writers must also take device mutex. */
struct vhost_net_ubuf_ref *ubufs;
+ /* Microseconds after at least 1 paket is processed before
+ * generating an interrupt.
+ */
+ __u32 coalesce_usecs;
+ /* Packets are processed before genearting an interrupt. */
+ __u32 max_coalesced_frames;
+ __u32 coalesced;
+ ktime_t last_signal;
+ struct hrtimer c_timer;
};

struct vhost_net {
@@ -196,11 +207,16 @@ static void vhost_net_vq_reset(struct vhost_net *n)
vhost_net_clear_ubuf_info(n);

for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
+ hrtimer_cancel(&n->vqs[i].c_timer);
n->vqs[i].done_idx = 0;
n->vqs[i].upend_idx = 0;
n->vqs[i].ubufs = NULL;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+ n->vqs[i].max_coalesced_frames = 0;
+ n->vqs[i].coalesce_usecs = 0;
+ n->vqs[i].last_signal = ktime_get();
+ n->vqs[i].coalesced = 0;
}

}
@@ -272,6 +288,56 @@ static void copy_iovec_hdr(const struct iovec *from, struct iovec *to,
}
}

+static int vhost_net_check_coalesce_and_signal(struct vhost_dev *dev,
+ struct vhost_net_virtqueue *nvq)
+{
+ struct vhost_virtqueue *vq = &nvq->vq;
+ int left = 0;
+ ktime_t now;
+
+ if (nvq->coalesced) {
+ now = ktime_get();
+ left = nvq->coalesce_usecs -
+ ktime_to_us(ktime_sub(now, nvq->last_signal));
+ if (left <= 0) {
+ vhost_signal(dev, vq);
+ nvq->last_signal = now;
+ nvq->coalesced = 0;
+ }
+ }
+
+ return left;
+}
+
+static bool vhost_net_add_used_and_signal_n(struct vhost_dev *dev,
+ struct vhost_net_virtqueue *nvq,
+ struct vring_used_elem *heads,
+ unsigned count)
+{
+ struct vhost_virtqueue *vq = &nvq->vq;
+ bool can_coalesce = nvq->max_coalesced_frames && nvq->coalesce_usecs;
+ bool ret = false;
+
+ vhost_add_used_n(vq, heads, count);
+
+ if (can_coalesce) {
+ ktime_t now = ktime_get();
+
+ nvq->coalesced += count;
+ if ((nvq->coalesced >= nvq->max_coalesced_frames) ||
+ (ktime_to_us(ktime_sub(now, nvq->last_signal)) >=
+ nvq->coalesce_usecs)) {
+ vhost_signal(dev, vq);
+ nvq->coalesced = 0;
+ nvq->last_signal = now;
+ ret = true;
+ }
+ } else {
+ vhost_signal(dev, vq);
+ }
+ return ret;
+}
+
/* In case of DMA done not in order in lower device driver for some reason.
* upend_idx is used to track end of used idx, done_idx is used to track head
* of used idx. Once lower device DMA done contiguously, we will signal KVM
@@ -296,8 +362,8 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net,
}
while (j) {
add = min(UIO_MAXIOV - nvq->done_idx, j);
- vhost_add_used_and_signal_n(vq->dev, vq,
- &vq->heads[nvq->done_idx], add);
+ vhost_net_add_used_and_signal_n(vq->dev, nvq,
+ &vq->heads[nvq->done_idx], add);
nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
j -= add;
}
@@ -351,6 +417,7 @@ static void handle_tx(struct vhost_net *net)
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
bool zcopy, zcopy_used;
+ int left;

mutex_lock(&vq->mutex);
sock = vq->private_data;
@@ -362,6 +429,8 @@ static void handle_tx(struct vhost_net *net)
hdr_size = nvq->vhost_hlen;
zcopy = nvq->ubufs;

+ vhost_net_check_coalesce_and_signal(&net->dev, nvq);
+
for (;;) {
/* Release DMAs done buffers first */
if (zcopy)
@@ -444,10 +513,15 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug("Truncated TX packet: "
" len %d != %zd\n", err, len);
- if (!zcopy_used)
- vhost_add_used_and_signal(&net->dev, vq, head, 0);
- else
+
+ if (!zcopy_used) {
+ struct vring_used_elem heads = { head, 0 };
+
+ vhost_net_add_used_and_signal_n(&net->dev,
+ nvq, &heads, 1);
+ } else {
vhost_zerocopy_signal_used(net, vq);
+ }
total_len += len;
vhost_net_tx_packet(net);
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
@@ -455,6 +529,12 @@ static void handle_tx(struct vhost_net *net)
break;
}
}
+
+ left = vhost_net_check_coalesce_and_signal(&net->dev, nvq);
+ if (left > 0)
+ hrtimer_start(&nvq->c_timer, ms_to_ktime(left),
+ HRTIMER_MODE_REL);
+
out:
mutex_unlock(&vq->mutex);
}
@@ -570,7 +650,7 @@ static void handle_rx(struct vhost_net *net)
.hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE
};
size_t total_len = 0;
- int err, mergeable;
+ int err, mergeable, left;
s16 headcount;
size_t vhost_hlen, sock_hlen;
size_t vhost_len, sock_len;
@@ -589,6 +669,8 @@ static void handle_rx(struct vhost_net *net)
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);

+ vhost_net_check_coalesce_and_signal(&net->dev, nvq);
+
while ((sock_len = peek_head_len(sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
@@ -654,8 +736,10 @@ static void handle_rx(struct vhost_net *net)
vhost_discard_vq_desc(vq, headcount);
break;
}
- vhost_add_used_and_signal_n(&net->dev, vq, vq->heads,
- headcount);
+
+ vhost_net_add_used_and_signal_n(&net->dev, nvq,
+ vq->heads, headcount);
+
if (unlikely(vq_log))
vhost_log_write(vq, vq_log, log, vhost_len);
total_len += vhost_len;
@@ -664,6 +748,12 @@ static void handle_rx(struct vhost_net *net)
break;
}
}
+
+ left = vhost_net_check_coalesce_and_signal(&net->dev, nvq);
+ if (left > 0)
+ hrtimer_start(&nvq->c_timer, ms_to_ktime(left),
+ HRTIMER_MODE_REL);
+
out:
mutex_unlock(&vq->mutex);
}
@@ -700,6 +790,18 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
}

+static enum hrtimer_restart vhost_net_timer_handler(struct hrtimer *timer)
+{
+ struct vhost_net_virtqueue *nvq = container_of(timer,
+ struct vhost_net_virtqueue,
+ c_timer);
+ struct vhost_virtqueue *vq = &nvq->vq;
+
+ vhost_poll_queue(&vq->poll);
+
+ return HRTIMER_NORESTART;
+}
+
static int vhost_net_open(struct inode *inode, struct file *f)
{
struct vhost_net *n;
@@ -731,6 +833,13 @@ static int vhost_net_open(struct inode *inode, struct file *f)
n->vqs[i].done_idx = 0;
n->vqs[i].vhost_hlen = 0;
n->vqs[i].sock_hlen = 0;
+ n->vqs[i].max_coalesced_frames = 0;
+ n->vqs[i].coalesce_usecs = 0;
+ n->vqs[i].last_signal = ktime_get();
+ n->vqs[i].coalesced = 0;
+ hrtimer_init(&n->vqs[i].c_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ n->vqs[i].c_timer.function = vhost_net_timer_handler;
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX);

@@ -907,6 +1016,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
struct vhost_virtqueue *vq;
struct vhost_net_virtqueue *nvq;
struct vhost_net_ubuf_ref *ubufs, *oldubufs = NULL;
+ unsigned int coalesced;
int r;

mutex_lock(&n->dev.mutex);
@@ -935,6 +1045,7 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)

/* start polling new socket */
oldsock = vq->private_data;
+ coalesced = nvq->coalesced;
if (sock != oldsock) {
ubufs = vhost_net_ubuf_alloc(vq,
sock && vhost_sock_zcopy(sock));
@@ -969,6 +1080,12 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
mutex_unlock(&vq->mutex);
}

+ if (coalesced) {
+ mutex_lock(&vq->mutex);
+ vhost_signal(&n->dev, vq);
+ mutex_unlock(&vq->mutex);
+ }
+
if (oldsock) {
vhost_net_flush_vq(n, index);
sockfd_put(oldsock);
@@ -1075,6 +1192,67 @@ out:
return r;
}

+static long vhost_net_set_vring_coalesce(struct vhost_dev *d, void __user *argp)
+{
+ u32 __user *idxp = argp;
+ u32 idx;
+ int r;
+ struct vhost_virtqueue *vq;
+ struct vhost_net_vring_coalesce c;
+ struct vhost_net_virtqueue *nvq;
+
+ r = get_user(idx, idxp);
+ if (r < 0)
+ return r;
+ if (idx >= d->nvqs)
+ return -ENOBUFS;
+
+ vq = d->vqs[idx];
+ nvq = container_of(vq, struct vhost_net_virtqueue, vq);
+
+ r = copy_from_user(&c, argp, sizeof(c));
+ if (r < 0)
+ return r;
+
+ mutex_lock(&vq->mutex);
+ nvq->coalesce_usecs = c.coalesce_usecs;
+ nvq->max_coalesced_frames = c.max_coalesced_frames;
+ mutex_unlock(&vq->mutex);
+
+ return 0;
+}
+
+static long vhost_net_get_vring_coalesce(struct vhost_dev *d, void __user *argp)
+{
+ u32 __user *idxp = argp;
+ u32 idx;
+ int r;
+ struct vhost_virtqueue *vq;
+ struct vhost_net_vring_coalesce c;
+ struct vhost_net_virtqueue *nvq;
+
+ r = get_user(idx, idxp);
+ if (r < 0)
+ return r;
+ if (idx >= d->nvqs)
+ return -ENOBUFS;
+
+ vq = d->vqs[idx];
+ nvq = container_of(vq, struct vhost_net_virtqueue, vq);
+
+ mutex_lock(&vq->mutex);
+ c.index = idx;
+ c.coalesce_usecs = nvq->coalesce_usecs;
+ c.max_coalesced_frames = nvq->max_coalesced_frames;
+ mutex_unlock(&vq->mutex);
+
+ r = copy_to_user(argp, &c, sizeof(c));
+ if (r < 0)
+ return r;
+
+ return 0;
+}
+
static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
unsigned long arg)
{
@@ -1105,6 +1283,10 @@ static long vhost_net_ioctl(struct file *f, unsigned int ioctl,
return vhost_net_reset_owner(n);
case VHOST_SET_OWNER:
return vhost_net_set_owner(n);
+ case VHOST_NET_SET_VRING_COALESCE:
+ return vhost_net_set_vring_coalesce(&n->dev, argp);
+ case VHOST_NET_GET_VRING_COALESCE:
+ return vhost_net_get_vring_coalesce(&n->dev, argp);
default:
mutex_lock(&n->dev.mutex);
r = vhost_dev_ioctl(&n->dev, ioctl, argp);
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index bb6a5b4..6799cc1 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -27,6 +27,12 @@ struct vhost_vring_file {

};

+struct vhost_net_vring_coalesce {
+ unsigned int index;
+ __u32 coalesce_usecs;
+ __u32 max_coalesced_frames;
+};
+
struct vhost_vring_addr {
unsigned int index;
/* Option flags. */
@@ -121,6 +127,12 @@ struct vhost_memory {
* device. This can be used to stop the ring (e.g. for migration). */
#define VHOST_NET_SET_BACKEND _IOW(VHOST_VIRTIO, 0x30, struct vhost_vring_file)

+/* Setting interrupt coalescing parameters. */
+#define VHOST_NET_SET_VRING_COALESCE \
+ _IOW(VHOST_VIRTIO, 0x31, struct vhost_net_vring_coalesce)
+/* Getting interrupt coalescing parameters. */
+#define VHOST_NET_GET_VRING_COALESCE \
+ _IOW(VHOST_VIRTIO, 0x32, struct vhost_net_vring_coalesce)
/* Feature bits */
/* Log all write descriptors. Can be changed while device is active. */
#define VHOST_F_LOG_ALL 26
--
1.8.3.1

2014-12-01 10:19:05

by Jason Wang

[permalink] [raw]
Subject: [PATCH RFC v4 net-next 2/5] virtio_net: bql

From: "Michael S. Tsirkin" <[email protected]>

Improve tx batching using byte queue limits.
Should be especially effective for MQ.

Cc: Rusty Russell <[email protected]>
Cc: Michael S. Tsirkin <[email protected]>
Signed-off-by: Michael S. Tsirkin <[email protected]>
---
drivers/net/virtio_net.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f68114e..0ed24ff 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -224,6 +224,7 @@ static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
struct virtnet_info *vi = sq->vq->vdev->priv;
struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
unsigned int packets = 0;
+ unsigned int bytes = 0;

while (packets < budget &&
(skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
@@ -231,6 +232,7 @@ static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,

u64_stats_update_begin(&stats->tx_syncp);
stats->tx_bytes += skb->len;
+ bytes += skb->len;
stats->tx_packets++;
u64_stats_update_end(&stats->tx_syncp);

@@ -238,6 +240,8 @@ static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
packets++;
}

+ netdev_tx_completed_queue(txq, packets, bytes);
+
if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
netif_tx_start_queue(txq);

@@ -964,6 +968,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
int err;
struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
bool kick = !skb->xmit_more;
+ unsigned int bytes = skb->len;

virtqueue_disable_cb(sq->vq);

@@ -981,6 +986,8 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
return NETDEV_TX_OK;
}

+ netdev_tx_sent_queue(txq, bytes);
+
/* Apparently nice girls don't return TX_BUSY; stop the queue
* before it gets out of hand. Naturally, this wastes entries. */
if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
--
1.8.3.1

2014-12-01 10:35:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt

On Mon, Dec 01, 2014 at 06:17:04PM +0800, Jason Wang wrote:
> On newer hosts that support delayed tx interrupts,
> we probably don't have much to gain from orphaning
> packets early.
>
> Note: this might degrade performance for
> hosts without event idx support.
> Should be addressed by the next patch.
>
> Cc: Rusty Russell <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Jason Wang <[email protected]>

Could you document the changes from the RFC I sent please?
Are there optimizations?
If yes, it might be easier to review (at least for me), if you refactor this,
e.g. applying the straight-forward rfc patch and then optimizations if
any on top. If it's taking a different approach, pls feel free to
disregard this.


> ---
> drivers/net/virtio_net.c | 132 +++++++++++++++++++++++++++++++----------------
> 1 file changed, 88 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index ec2a8b4..f68114e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>
> /* Name of the send queue: output.$index */
> char name[40];
> +
> + struct napi_struct napi;
> };
>
> /* Internal representation of a receive virtqueue */
> @@ -137,6 +139,9 @@ struct virtnet_info {
>
> /* CPU hot plug notifier */
> struct notifier_block nb;
> +
> + /* Budget for polling tx completion */
> + u32 tx_work_limit;
> };
>
> struct skb_vnet_hdr {
> @@ -211,15 +216,41 @@ static struct page *get_a_page(struct receive_queue *rq, gfp_t gfp_mask)
> return p;
> }
>
> +static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
> + struct send_queue *sq, int budget)
> +{
> + struct sk_buff *skb;
> + unsigned int len;
> + struct virtnet_info *vi = sq->vq->vdev->priv;
> + struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> + unsigned int packets = 0;
> +
> + while (packets < budget &&
> + (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> + pr_debug("Sent skb %p\n", skb);
> +
> + u64_stats_update_begin(&stats->tx_syncp);
> + stats->tx_bytes += skb->len;
> + stats->tx_packets++;
> + u64_stats_update_end(&stats->tx_syncp);
> +
> + dev_kfree_skb_any(skb);
> + packets++;
> + }
> +
> + if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
> + netif_tx_start_queue(txq);
> +
> + return packets;
> +}
> +
> static void skb_xmit_done(struct virtqueue *vq)
> {
> struct virtnet_info *vi = vq->vdev->priv;
> + struct send_queue *sq = &vi->sq[vq2txq(vq)];
>
> - /* Suppress further interrupts. */
> - virtqueue_disable_cb(vq);
> -
> - /* We were probably waiting for more output buffers. */
> - netif_wake_subqueue(vi->dev, vq2txq(vq));
> + virtqueue_disable_cb(sq->vq);
> + napi_schedule(&sq->napi);
> }
>
> static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
> @@ -777,6 +808,32 @@ again:
> return received;
> }
>
> +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
> +{
> + struct send_queue *sq =
> + container_of(napi, struct send_queue, napi);
> + struct virtnet_info *vi = sq->vq->vdev->priv;
> + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
> + u32 limit = vi->tx_work_limit;
> + unsigned int sent;
> +
> + __netif_tx_lock(txq, smp_processor_id());
> + sent = free_old_xmit_skbs(txq, sq, limit);
> + if (sent < limit) {
> + napi_complete(napi);
> + /* Note: we must enable cb *after* napi_complete, because
> + * napi_schedule calls from callbacks that trigger before
> + * napi_complete are ignored.
> + */
> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> + virtqueue_disable_cb(sq->vq);
> + napi_schedule(&sq->napi);
> + }
> + }
> + __netif_tx_unlock(txq);
> + return sent < limit ? 0 : budget;
> +}
> +

Unlike the patch I sent, this seems to ignore the budget,
and always poll the full napi_weight.
Seems strange. What is the reason for this?



> #ifdef CONFIG_NET_RX_BUSY_POLL
> /* must be called with local_bh_disable()d */
> static int virtnet_busy_poll(struct napi_struct *napi)
> @@ -825,30 +882,12 @@ static int virtnet_open(struct net_device *dev)
> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> schedule_delayed_work(&vi->refill, 0);
> virtnet_napi_enable(&vi->rq[i]);
> + napi_enable(&vi->sq[i].napi);
> }
>
> return 0;
> }
>
> -static void free_old_xmit_skbs(struct send_queue *sq)
> -{
> - struct sk_buff *skb;
> - unsigned int len;
> - struct virtnet_info *vi = sq->vq->vdev->priv;
> - struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> -
> - while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - pr_debug("Sent skb %p\n", skb);
> -
> - u64_stats_update_begin(&stats->tx_syncp);
> - stats->tx_bytes += skb->len;
> - stats->tx_packets++;
> - u64_stats_update_end(&stats->tx_syncp);
> -
> - dev_kfree_skb_any(skb);
> - }
> -}
> -
> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> {
> struct skb_vnet_hdr *hdr;
> @@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> sg_set_buf(sq->sg, hdr, hdr_len);
> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
> }
> - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> +
> + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
> + GFP_ATOMIC);
> }
>
> static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = !skb->xmit_more;
>
> - /* Free up any pending old buffers before queueing new ones. */
> - free_old_xmit_skbs(sq);
> + virtqueue_disable_cb(sq->vq);
>
> /* Try to transmit */
> err = xmit_skb(sq, skb);
> @@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> return NETDEV_TX_OK;
> }
>
> - /* Don't wait up for transmitted skbs to be freed. */
> - skb_orphan(skb);
> - nf_reset(skb);
> -
> /* Apparently nice girls don't return TX_BUSY; stop the queue
> * before it gets out of hand. Naturally, this wastes entries. */
> - if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> + if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
> netif_stop_subqueue(dev, qnum);
> - if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> - /* More just got used, free them then recheck. */
> - free_old_xmit_skbs(sq);
> - if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> - netif_start_subqueue(dev, qnum);
> - virtqueue_disable_cb(sq->vq);
> - }
> - }
> - }
>
> if (kick || netif_xmit_stopped(txq))
> virtqueue_kick(sq->vq);
>
> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> + virtqueue_disable_cb(sq->vq);
> + napi_schedule(&sq->napi);
> + }
> +
> return NETDEV_TX_OK;
> }
>
> @@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device *dev)
> /* Make sure refill_work doesn't re-enable napi! */
> cancel_delayed_work_sync(&vi->refill);
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> napi_disable(&vi->rq[i].napi);
> + napi_disable(&vi->sq[i].napi);
> + }
>
> return 0;
> }
> @@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
> {
> int i;
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> netif_napi_del(&vi->rq[i].napi);
> + netif_napi_del(&vi->sq[i].napi);
> + }
>
> kfree(vi->rq);
> kfree(vi->sq);
> @@ -1607,6 +1643,8 @@ static int virtnet_alloc_queues(struct virtnet_info *vi)
> netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
> napi_weight);
> napi_hash_add(&vi->rq[i].napi);
> + netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
> + napi_weight);
>
> sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
> ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
> @@ -1790,6 +1828,8 @@ static int virtnet_probe(struct virtio_device *vdev)
> if (err)
> goto free_stats;
>
> + vi->tx_work_limit = napi_weight;
> +
> #ifdef CONFIG_SYSFS
> if (vi->mergeable_rx_bufs)
> dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
> @@ -1904,8 +1944,10 @@ static int virtnet_freeze(struct virtio_device *vdev)
> if (netif_running(vi->dev)) {
> for (i = 0; i < vi->max_queue_pairs; i++) {
> napi_disable(&vi->rq[i].napi);
> + napi_disable(&vi->sq[i].napi);
> napi_hash_del(&vi->rq[i].napi);
> netif_napi_del(&vi->rq[i].napi);
> + netif_napi_del(&vi->sq[i].napi);
> }
> }
>
> @@ -1930,8 +1972,10 @@ static int virtnet_restore(struct virtio_device *vdev)
> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
> schedule_delayed_work(&vi->refill, 0);
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> virtnet_napi_enable(&vi->rq[i]);
> + napi_enable(&vi->sq[i].napi);
> + }
> }
>
> netif_device_attach(vi->dev);
> --
> 1.8.3.1

2014-12-01 10:42:31

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> Hello:
>
> We used to orphan packets before transmission for virtio-net. This breaks
> socket accounting and can lead serveral functions won't work, e.g:
>
> - Byte Queue Limit depends on tx completion nofication to work.
> - Packet Generator depends on tx completion nofication for the last
> transmitted packet to complete.
> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to work.
>
> This series tries to solve the issue by enabling tx interrupts. To minize
> the performance impacts of this, several optimizations were used:
>
> - In guest side, virtqueue_enable_cb_delayed() was used to delay the tx
> interrupt untile 3/4 pending packets were sent.
> - In host side, interrupt coalescing were used to reduce tx interrupts.
>
> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
>
> - For guest receiving. No obvious regression on throughput were
> noticed. More cpu utilization were noticed in few cases.
> - For guest transmission. Very huge improvement on througput for small
> packet transmission were noticed. This is expected since TSQ and other
> optimization for small packet transmission work after tx interrupt. But
> will use more cpu for large packets.
> - For TCP_RR, regression (10% on transaction rate and cpu utilization) were
> found. Tx interrupt won't help but cause overhead in this case. Using
> more aggressive coalescing parameters may help to reduce the regression.

OK, you do have posted coalescing patches - does it help any?

I'm not sure the regression is due to interrupts.
It would make sense for CPU but why would it
hurt transaction rate?

It's possible that we are deferring kicks too much due to BQL.

As an experiment: do we get any of it back if we do
- if (kick || netif_xmit_stopped(txq))
- virtqueue_kick(sq->vq);
+ virtqueue_kick(sq->vq);
?

If yes, we can just kick e.g. periodically, e.g. after queueing each
X bytes.

> Changes from RFC V3:
> - Don't free tx packets in ndo_start_xmit()
> - Add interrupt coalescing support for virtio-net
> Changes from RFC v2:
> - clean up code, address issues raised by Jason
> Changes from RFC v1:
> - address comments by Jason Wang, use delayed cb everywhere
> - rebased Jason's patch on top of mine and include it (with some tweaks)
>
> Please reivew. Comments were more than welcomed.
>
> [1] Performance Test result:
>
> Environment:
> - Two Intel(R) Xeon(R) CPU E5620 @ 2.40GHz machines connected back to back
> with 82599ES cards.
> - Both host and guest were net-next.git plus the patch
> - Coalescing parameters for the card:
> Adaptive RX: off TX: off
> rx-usecs: 1
> rx-frames: 0
> tx-usecs: 0
> tx-frames: 0
> - Vhost_net was enabled and zerocopy was disabled
> - Tests was done by netperf-2.6
> - Guest has 2 vcpus with single queue virtio-net
>
> Results:
> - Numbers of square brackets are whose significance is grater than 95%
>
> Guest RX:
>
> size/sessions/+throughput/+cpu/+per_cpu_throughput/
> 64/1/+2.0326/[+6.2807%]/-3.9970%/
> 64/2/-0.2104%/[+3.2012%]/[-3.3058%]/
> 64/4/+1.5956%/+2.2451%/-0.6353%/
> 64/8/+1.1732%/+3.5123%/-2.2598%/
> 256/1/+3.7619%/[+5.8117%]/-1.9372%/
> 256/2/-0.0661%/[+3.2511%]/-3.2127%/
> 256/4/+1.1435%/[-8.1842%]/[+10.1591%]/
> 256/8/[+2.2447%]/[+6.2044%]/[-3.7283%]/
> 1024/1/+9.1479%/[+12.0997%]/[-2.6332%]/
> 1024/2/[-17.3341%]/[+0.0000%]/[-17.3341%]/
> 1024/4/[-0.6284%]/-1.0376%/+0.4135%/
> 1024/8/+1.1444%/-1.6069%/+2.7961%/
> 4096/1/+0.0401%/-0.5993%/+0.6433%/
> 4096/2/[-0.5894%]/-2.2071%/+1.6542%/
> 4096/4/[-0.5560%]/-1.4969%/+0.9553%/
> 4096/8/-0.3362%/+2.7086%/-2.9645%/
> 16384/1/-0.0285%/+0.7247%/-0.7478%/
> 16384/2/-0.5286%/+0.3287%/-0.8545%/
> 16384/4/-0.3297%/-2.0543%/+1.7608%/
> 16384/8/+1.0932%/+4.0253%/-2.8187%/
> 65535/1/+0.0003%/-0.1502%/+0.1508%/
> 65535/2/[-0.6065%]/+0.2309%/-0.8355%/
> 65535/4/[-0.6861%]/[+3.9451%]/[-4.4554%]/
> 65535/8/+1.8359%/+3.1590%/-1.2825%/
>
> Guest RX:
> size/sessions/+throughput/+cpu/+per_cpu_throughput/
> 64/1/[+65.0961%]/[-8.6807%]/[+80.7900%]/
> 64/2/[+6.0288%]/[-2.2823%]/[+8.5052%]/
> 64/4/[+5.9038%]/[-2.1834%]/[+8.2677%]/
> 64/8/[+5.4154%]/[-2.1804%]/[+7.7651%]/
> 256/1/[+184.6462%]/[+4.8906%]/[+171.3742%]/
> 256/2/[+46.0731%]/[-8.9626%]/[+60.4539%]/
> 256/4/[+45.8547%]/[-8.3027%]/[+59.0612%]/
> 256/8/[+45.3486%]/[-8.4024%]/[+58.6817%]/
> 1024/1/[+432.5372%]/[+3.9566%]/[+412.2689%]/
> 1024/2/[-1.4207%]/[-23.6426%]/[+29.1025%]/
> 1024/4/-0.1003%/[-13.6416%]/[+15.6804%]/
> 1024/8/[+0.2200%]/[+2.0634%]/[-1.8061%]/
> 4096/1/[+18.4835%]/[-46.1508%]/[+120.0283%]/
> 4096/2/+0.1770%/[-26.2780%]/[+35.8848%]/
> 4096/4/-0.1012%/-0.7353%/+0.6388%/
> 4096/8/-0.6091%/+1.4159%/-1.9968%/
> 16384/1/-0.0424%/[+11.9373%]/[-10.7021%]/
> 16384/2/+0.0482%/+2.4685%/-2.3620%/
> 16384/4/+0.0840%/[+5.3587%]/[-5.0064%]/
> 16384/8/+0.0048%/[+5.0176%]/[-4.7733%]/
> 65535/1/-0.0095%/[+10.9408%]/[-9.8705%]/
> 65535/2/+0.1515%/[+8.1709%]/[-7.4137%]/
> 65535/4/+0.0203%/[+5.4316%]/[-5.1325%]/
> 65535/8/+0.1427%/[+6.2753%]/[-5.7705%]/
>
> size/sessions/+trans.rate/+cpu/+per_cpu_trans.rate/
> 64/1/+0.2346%/[+11.5080%]/[-10.1099%]/
> 64/25/[-10.7893%]/-0.5791%/[-10.2697%]/
> 64/50/[-11.5997%]/-0.3429%/[-11.2956%]/
> 256/1/+0.7219%/[+13.2374%]/[-11.0524%]/
> 256/25/-6.9567%/+0.8887%/[-7.7763%]/
> 256/50/[-4.8814%]/-0.0338%/[-4.8492%]/
> 4096/1/-1.6061%/-0.7561%/-0.8565%/
> 4096/25/[+2.2120%]/[+1.0839%]/+1.1161%/
> 4096/50/[+5.6180%]/[+3.2116%]/[+2.3315%]/
>
> Jason Wang (4):
> virtio_net: enable tx interrupt
> virtio-net: optimize free_old_xmit_skbs stats
> virtio-net: add basic interrupt coalescing support
> vhost_net: interrupt coalescing support
>
> Michael S. Tsirkin (1):
> virtio_net: bql
>
> drivers/net/virtio_net.c | 211 ++++++++++++++++++++++++++++++++--------
> drivers/vhost/net.c | 200 +++++++++++++++++++++++++++++++++++--
> include/uapi/linux/vhost.h | 12 +++
> include/uapi/linux/virtio_net.h | 12 +++
> 4 files changed, 383 insertions(+), 52 deletions(-)
>
> --
> 1.8.3.1

2014-12-02 03:09:36

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt



On Mon, Dec 1, 2014 at 6:35 PM, Michael S. Tsirkin <[email protected]>
wrote:
> On Mon, Dec 01, 2014 at 06:17:04PM +0800, Jason Wang wrote:
>> On newer hosts that support delayed tx interrupts,
>> we probably don't have much to gain from orphaning
>> packets early.
>>
>> Note: this might degrade performance for
>> hosts without event idx support.
>> Should be addressed by the next patch.
>>
>> Cc: Rusty Russell <[email protected]>
>> Cc: Michael S. Tsirkin <[email protected]>
>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>> Signed-off-by: Jason Wang <[email protected]>
>
> Could you document the changes from the RFC I sent please?

Okay, I will sent a V5 add document more.
>
> Are there optimizations?

Two optimizations.

- Don't do tx packets free in ndo_start_xmit(), tests shows it reduce
the effect of coalescing.
- Let ethtool can change the number of packets freed in one tx napi run
through tx-frames-irq. This is necessary, since user may coalesce more
than 64 packets per irq.

>
> If yes, it might be easier to review (at least for me), if you
> refactor this,
> e.g. applying the straight-forward rfc patch and then optimizations if
> any on top. If it's taking a different approach, pls feel free to
> disregard this.
>
>
>> ---
>> drivers/net/virtio_net.c | 132
>> +++++++++++++++++++++++++++++++----------------
>> 1 file changed, 88 insertions(+), 44 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index ec2a8b4..f68114e 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -72,6 +72,8 @@ struct send_queue {
>>
>> /* Name of the send queue: output.$index */
>> char name[40];
>> +
>> + struct napi_struct napi;
>> };
>>
>> /* Internal representation of a receive virtqueue */
>> @@ -137,6 +139,9 @@ struct virtnet_info {
>>
>> /* CPU hot plug notifier */
>> struct notifier_block nb;
>> +
>> + /* Budget for polling tx completion */
>> + u32 tx_work_limit;
>> };
>>
>> struct skb_vnet_hdr {
>> @@ -211,15 +216,41 @@ static struct page *get_a_page(struct
>> receive_queue *rq, gfp_t gfp_mask)
>> return p;
>> }
>>
>> +static unsigned int free_old_xmit_skbs(struct netdev_queue *txq,
>> + struct send_queue *sq, int budget)
>> +{
>> + struct sk_buff *skb;
>> + unsigned int len;
>> + struct virtnet_info *vi = sq->vq->vdev->priv;
>> + struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>> + unsigned int packets = 0;
>> +
>> + while (packets < budget &&
>> + (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>> + pr_debug("Sent skb %p\n", skb);
>> +
>> + u64_stats_update_begin(&stats->tx_syncp);
>> + stats->tx_bytes += skb->len;
>> + stats->tx_packets++;
>> + u64_stats_update_end(&stats->tx_syncp);
>> +
>> + dev_kfree_skb_any(skb);
>> + packets++;
>> + }
>> +
>> + if (sq->vq->num_free >= 2+MAX_SKB_FRAGS)
>> + netif_tx_start_queue(txq);
>> +
>> + return packets;
>> +}
>> +
>> static void skb_xmit_done(struct virtqueue *vq)
>> {
>> struct virtnet_info *vi = vq->vdev->priv;
>> + struct send_queue *sq = &vi->sq[vq2txq(vq)];
>>
>> - /* Suppress further interrupts. */
>> - virtqueue_disable_cb(vq);
>> -
>> - /* We were probably waiting for more output buffers. */
>> - netif_wake_subqueue(vi->dev, vq2txq(vq));
>> + virtqueue_disable_cb(sq->vq);
>> + napi_schedule(&sq->napi);
>> }
>>
>> static unsigned int mergeable_ctx_to_buf_truesize(unsigned long
>> mrg_ctx)
>> @@ -777,6 +808,32 @@ again:
>> return received;
>> }
>>
>> +static int virtnet_poll_tx(struct napi_struct *napi, int budget)
>> +{
>> + struct send_queue *sq =
>> + container_of(napi, struct send_queue, napi);
>> + struct virtnet_info *vi = sq->vq->vdev->priv;
>> + struct netdev_queue *txq = netdev_get_tx_queue(vi->dev,
>> vq2txq(sq->vq));
>> + u32 limit = vi->tx_work_limit;
>> + unsigned int sent;
>> +
>> + __netif_tx_lock(txq, smp_processor_id());
>> + sent = free_old_xmit_skbs(txq, sq, limit);
>> + if (sent < limit) {
>> + napi_complete(napi);
>> + /* Note: we must enable cb *after* napi_complete, because
>> + * napi_schedule calls from callbacks that trigger before
>> + * napi_complete are ignored.
>> + */
>> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>> + virtqueue_disable_cb(sq->vq);
>> + napi_schedule(&sq->napi);
>> + }
>> + }
>> + __netif_tx_unlock(txq);
>> + return sent < limit ? 0 : budget;
>> +}
>> +
>
> Unlike the patch I sent, this seems to ignore the budget,
> and always poll the full napi_weight.
> Seems strange. What is the reason for this?
>

The budget were in fact the tx_work_limit (by default 64).
This could be tuned by ethtool tx-frames-irq to control
how many packets at most could be processed by on tx napi.

Use may want to coalesce more than 64 packets per irq, so
something like this is necessary for user.
>
>
>> #ifdef CONFIG_NET_RX_BUSY_POLL
>> /* must be called with local_bh_disable()d */
>> static int virtnet_busy_poll(struct napi_struct *napi)
>> @@ -825,30 +882,12 @@ static int virtnet_open(struct net_device
>> *dev)
>> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>> schedule_delayed_work(&vi->refill, 0);
>> virtnet_napi_enable(&vi->rq[i]);
>> + napi_enable(&vi->sq[i].napi);
>> }
>>
>> return 0;
>> }
>>
>> -static void free_old_xmit_skbs(struct send_queue *sq)
>> -{
>> - struct sk_buff *skb;
>> - unsigned int len;
>> - struct virtnet_info *vi = sq->vq->vdev->priv;
>> - struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
>> -
>> - while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>> - pr_debug("Sent skb %p\n", skb);
>> -
>> - u64_stats_update_begin(&stats->tx_syncp);
>> - stats->tx_bytes += skb->len;
>> - stats->tx_packets++;
>> - u64_stats_update_end(&stats->tx_syncp);
>> -
>> - dev_kfree_skb_any(skb);
>> - }
>> -}
>> -
>> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>> {
>> struct skb_vnet_hdr *hdr;
>> @@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq,
>> struct sk_buff *skb)
>> sg_set_buf(sq->sg, hdr, hdr_len);
>> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
>> }
>> - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
>> GFP_ATOMIC);
>> +
>> + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
>> + GFP_ATOMIC);
>> }
>>
>> static netdev_tx_t start_xmit(struct sk_buff *skb, struct
>> net_device *dev)
>> @@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff
>> *skb, struct net_device *dev)
>> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>> bool kick = !skb->xmit_more;
>>
>> - /* Free up any pending old buffers before queueing new ones. */
>> - free_old_xmit_skbs(sq);
>> + virtqueue_disable_cb(sq->vq);
>>
>> /* Try to transmit */
>> err = xmit_skb(sq, skb);
>> @@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff
>> *skb, struct net_device *dev)
>> return NETDEV_TX_OK;
>> }
>>
>> - /* Don't wait up for transmitted skbs to be freed. */
>> - skb_orphan(skb);
>> - nf_reset(skb);
>> -
>> /* Apparently nice girls don't return TX_BUSY; stop the queue
>> * before it gets out of hand. Naturally, this wastes entries. */
>> - if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>> + if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
>> netif_stop_subqueue(dev, qnum);
>> - if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>> - /* More just got used, free them then recheck. */
>> - free_old_xmit_skbs(sq);
>> - if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>> - netif_start_subqueue(dev, qnum);
>> - virtqueue_disable_cb(sq->vq);
>> - }
>> - }
>> - }
>>
>> if (kick || netif_xmit_stopped(txq))
>> virtqueue_kick(sq->vq);
>>
>> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>> + virtqueue_disable_cb(sq->vq);
>> + napi_schedule(&sq->napi);
>> + }
>> +
>> return NETDEV_TX_OK;
>> }
>>
>> @@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device
>> *dev)
>> /* Make sure refill_work doesn't re-enable napi! */
>> cancel_delayed_work_sync(&vi->refill);
>>
>> - for (i = 0; i < vi->max_queue_pairs; i++)
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> napi_disable(&vi->rq[i].napi);
>> + napi_disable(&vi->sq[i].napi);
>> + }
>>
>> return 0;
>> }
>> @@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct
>> virtnet_info *vi)
>> {
>> int i;
>>
>> - for (i = 0; i < vi->max_queue_pairs; i++)
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> netif_napi_del(&vi->rq[i].napi);
>> + netif_napi_del(&vi->sq[i].napi);
>> + }
>>
>> kfree(vi->rq);
>> kfree(vi->sq);
>> @@ -1607,6 +1643,8 @@ static int virtnet_alloc_queues(struct
>> virtnet_info *vi)
>> netif_napi_add(vi->dev, &vi->rq[i].napi, virtnet_poll,
>> napi_weight);
>> napi_hash_add(&vi->rq[i].napi);
>> + netif_napi_add(vi->dev, &vi->sq[i].napi, virtnet_poll_tx,
>> + napi_weight);
>>
>> sg_init_table(vi->rq[i].sg, ARRAY_SIZE(vi->rq[i].sg));
>> ewma_init(&vi->rq[i].mrg_avg_pkt_len, 1, RECEIVE_AVG_WEIGHT);
>> @@ -1790,6 +1828,8 @@ static int virtnet_probe(struct virtio_device
>> *vdev)
>> if (err)
>> goto free_stats;
>>
>> + vi->tx_work_limit = napi_weight;
>> +
>> #ifdef CONFIG_SYSFS
>> if (vi->mergeable_rx_bufs)
>> dev->sysfs_rx_queue_group = &virtio_net_mrg_rx_group;
>> @@ -1904,8 +1944,10 @@ static int virtnet_freeze(struct
>> virtio_device *vdev)
>> if (netif_running(vi->dev)) {
>> for (i = 0; i < vi->max_queue_pairs; i++) {
>> napi_disable(&vi->rq[i].napi);
>> + napi_disable(&vi->sq[i].napi);
>> napi_hash_del(&vi->rq[i].napi);
>> netif_napi_del(&vi->rq[i].napi);
>> + netif_napi_del(&vi->sq[i].napi);
>> }
>> }
>>
>> @@ -1930,8 +1972,10 @@ static int virtnet_restore(struct
>> virtio_device *vdev)
>> if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
>> schedule_delayed_work(&vi->refill, 0);
>>
>> - for (i = 0; i < vi->max_queue_pairs; i++)
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> virtnet_napi_enable(&vi->rq[i]);
>> + napi_enable(&vi->sq[i].napi);
>> + }
>> }
>>
>> netif_device_attach(vi->dev);
>> --
>> 1.8.3.1

2014-12-02 03:15:33

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts



On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]>
wrote:
> On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
>> Hello:
>>
>> We used to orphan packets before transmission for virtio-net. This
>> breaks
>> socket accounting and can lead serveral functions won't work, e.g:
>>
>> - Byte Queue Limit depends on tx completion nofication to work.
>> - Packet Generator depends on tx completion nofication for the last
>> transmitted packet to complete.
>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
>> work.
>>
>> This series tries to solve the issue by enabling tx interrupts. To
>> minize
>> the performance impacts of this, several optimizations were used:
>>
>> - In guest side, virtqueue_enable_cb_delayed() was used to delay
>> the tx
>> interrupt untile 3/4 pending packets were sent.
>> - In host side, interrupt coalescing were used to reduce tx
>> interrupts.
>>
>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
>>
>> - For guest receiving. No obvious regression on throughput were
>> noticed. More cpu utilization were noticed in few cases.
>> - For guest transmission. Very huge improvement on througput for
>> small
>> packet transmission were noticed. This is expected since TSQ and
>> other
>> optimization for small packet transmission work after tx
>> interrupt. But
>> will use more cpu for large packets.
>> - For TCP_RR, regression (10% on transaction rate and cpu
>> utilization) were
>> found. Tx interrupt won't help but cause overhead in this case.
>> Using
>> more aggressive coalescing parameters may help to reduce the
>> regression.
>
> OK, you do have posted coalescing patches - does it help any?

Helps a lot.

For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
For small packet TX, it increases 33% - 245% throughput. (reduce
about 60% inters)
For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx intrs)

>
> I'm not sure the regression is due to interrupts.
> It would make sense for CPU but why would it
> hurt transaction rate?

Anyway guest need to take some cycles to handle tx interrupts.
And transaction rate does increase if we coalesces more tx interurpts.
>
>
> It's possible that we are deferring kicks too much due to BQL.
>
> As an experiment: do we get any of it back if we do
> - if (kick || netif_xmit_stopped(txq))
> - virtqueue_kick(sq->vq);
> + virtqueue_kick(sq->vq);
> ?


I will try, but during TCP_RR, at most 1 packets were pending,
I suspect if BQL can help in this case.
>
>
> If yes, we can just kick e.g. periodically, e.g. after queueing each
> X bytes.

Okay, let me try to see if this help.
>
>> Changes from RFC V3:
>> - Don't free tx packets in ndo_start_xmit()
>> - Add interrupt coalescing support for virtio-net
>> Changes from RFC v2:
>> - clean up code, address issues raised by Jason
>> Changes from RFC v1:
>> - address comments by Jason Wang, use delayed cb everywhere
>> - rebased Jason's patch on top of mine and include it (with some
>> tweaks)
>>
>> Please reivew. Comments were more than welcomed.
>>
>> [1] Performance Test result:
>>
>> Environment:
>> - Two Intel(R) Xeon(R) CPU E5620 @ 2.40GHz machines connected back
>> to back
>> with 82599ES cards.
>> - Both host and guest were net-next.git plus the patch
>> - Coalescing parameters for the card:
>> Adaptive RX: off TX: off
>> rx-usecs: 1
>> rx-frames: 0
>> tx-usecs: 0
>> tx-frames: 0
>> - Vhost_net was enabled and zerocopy was disabled
>> - Tests was done by netperf-2.6
>> - Guest has 2 vcpus with single queue virtio-net
>>
>> Results:
>> - Numbers of square brackets are whose significance is grater than
>> 95%
>>
>> Guest RX:
>>
>> size/sessions/+throughput/+cpu/+per_cpu_throughput/
>> 64/1/+2.0326/[+6.2807%]/-3.9970%/
>> 64/2/-0.2104%/[+3.2012%]/[-3.3058%]/
>> 64/4/+1.5956%/+2.2451%/-0.6353%/
>> 64/8/+1.1732%/+3.5123%/-2.2598%/
>> 256/1/+3.7619%/[+5.8117%]/-1.9372%/
>> 256/2/-0.0661%/[+3.2511%]/-3.2127%/
>> 256/4/+1.1435%/[-8.1842%]/[+10.1591%]/
>> 256/8/[+2.2447%]/[+6.2044%]/[-3.7283%]/
>> 1024/1/+9.1479%/[+12.0997%]/[-2.6332%]/
>> 1024/2/[-17.3341%]/[+0.0000%]/[-17.3341%]/
>> 1024/4/[-0.6284%]/-1.0376%/+0.4135%/
>> 1024/8/+1.1444%/-1.6069%/+2.7961%/
>> 4096/1/+0.0401%/-0.5993%/+0.6433%/
>> 4096/2/[-0.5894%]/-2.2071%/+1.6542%/
>> 4096/4/[-0.5560%]/-1.4969%/+0.9553%/
>> 4096/8/-0.3362%/+2.7086%/-2.9645%/
>> 16384/1/-0.0285%/+0.7247%/-0.7478%/
>> 16384/2/-0.5286%/+0.3287%/-0.8545%/
>> 16384/4/-0.3297%/-2.0543%/+1.7608%/
>> 16384/8/+1.0932%/+4.0253%/-2.8187%/
>> 65535/1/+0.0003%/-0.1502%/+0.1508%/
>> 65535/2/[-0.6065%]/+0.2309%/-0.8355%/
>> 65535/4/[-0.6861%]/[+3.9451%]/[-4.4554%]/
>> 65535/8/+1.8359%/+3.1590%/-1.2825%/
>>
>> Guest RX:
>> size/sessions/+throughput/+cpu/+per_cpu_throughput/
>> 64/1/[+65.0961%]/[-8.6807%]/[+80.7900%]/
>> 64/2/[+6.0288%]/[-2.2823%]/[+8.5052%]/
>> 64/4/[+5.9038%]/[-2.1834%]/[+8.2677%]/
>> 64/8/[+5.4154%]/[-2.1804%]/[+7.7651%]/
>> 256/1/[+184.6462%]/[+4.8906%]/[+171.3742%]/
>> 256/2/[+46.0731%]/[-8.9626%]/[+60.4539%]/
>> 256/4/[+45.8547%]/[-8.3027%]/[+59.0612%]/
>> 256/8/[+45.3486%]/[-8.4024%]/[+58.6817%]/
>> 1024/1/[+432.5372%]/[+3.9566%]/[+412.2689%]/
>> 1024/2/[-1.4207%]/[-23.6426%]/[+29.1025%]/
>> 1024/4/-0.1003%/[-13.6416%]/[+15.6804%]/
>> 1024/8/[+0.2200%]/[+2.0634%]/[-1.8061%]/
>> 4096/1/[+18.4835%]/[-46.1508%]/[+120.0283%]/
>> 4096/2/+0.1770%/[-26.2780%]/[+35.8848%]/
>> 4096/4/-0.1012%/-0.7353%/+0.6388%/
>> 4096/8/-0.6091%/+1.4159%/-1.9968%/
>> 16384/1/-0.0424%/[+11.9373%]/[-10.7021%]/
>> 16384/2/+0.0482%/+2.4685%/-2.3620%/
>> 16384/4/+0.0840%/[+5.3587%]/[-5.0064%]/
>> 16384/8/+0.0048%/[+5.0176%]/[-4.7733%]/
>> 65535/1/-0.0095%/[+10.9408%]/[-9.8705%]/
>> 65535/2/+0.1515%/[+8.1709%]/[-7.4137%]/
>> 65535/4/+0.0203%/[+5.4316%]/[-5.1325%]/
>> 65535/8/+0.1427%/[+6.2753%]/[-5.7705%]/
>>
>> size/sessions/+trans.rate/+cpu/+per_cpu_trans.rate/
>> 64/1/+0.2346%/[+11.5080%]/[-10.1099%]/
>> 64/25/[-10.7893%]/-0.5791%/[-10.2697%]/
>> 64/50/[-11.5997%]/-0.3429%/[-11.2956%]/
>> 256/1/+0.7219%/[+13.2374%]/[-11.0524%]/
>> 256/25/-6.9567%/+0.8887%/[-7.7763%]/
>> 256/50/[-4.8814%]/-0.0338%/[-4.8492%]/
>> 4096/1/-1.6061%/-0.7561%/-0.8565%/
>> 4096/25/[+2.2120%]/[+1.0839%]/+1.1161%/
>> 4096/50/[+5.6180%]/[+3.2116%]/[+2.3315%]/
>>
>> Jason Wang (4):
>> virtio_net: enable tx interrupt
>> virtio-net: optimize free_old_xmit_skbs stats
>> virtio-net: add basic interrupt coalescing support
>> vhost_net: interrupt coalescing support
>>
>> Michael S. Tsirkin (1):
>> virtio_net: bql
>>
>> drivers/net/virtio_net.c | 211
>> ++++++++++++++++++++++++++++++++--------
>> drivers/vhost/net.c | 200
>> +++++++++++++++++++++++++++++++++++--
>> include/uapi/linux/vhost.h | 12 +++
>> include/uapi/linux/virtio_net.h | 12 +++
>> 4 files changed, 383 insertions(+), 52 deletions(-)
>>
>> --
>> 1.8.3.1

2014-12-02 08:07:17

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts



On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]> wrote:
>
>
> On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]>
> wrote:
>> On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
>>> Hello:
>>> We used to orphan packets before transmission for virtio-net.
>>> This breaks
>>> socket accounting and can lead serveral functions won't work, e.g:
>>> - Byte Queue Limit depends on tx completion nofication to work.
>>> - Packet Generator depends on tx completion nofication for the last
>>> transmitted packet to complete.
>>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
>>> work.
>>> This series tries to solve the issue by enabling tx interrupts.
>>> To minize
>>> the performance impacts of this, several optimizations were used:
>>> - In guest side, virtqueue_enable_cb_delayed() was used to delay
>>> the tx
>>> interrupt untile 3/4 pending packets were sent.
>>> - In host side, interrupt coalescing were used to reduce tx
>>> interrupts.
>>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
>>> - For guest receiving. No obvious regression on throughput were
>>> noticed. More cpu utilization were noticed in few cases.
>>> - For guest transmission. Very huge improvement on througput for
>>> small
>>> packet transmission were noticed. This is expected since TSQ and
>>> other
>>> optimization for small packet transmission work after tx
>>> interrupt. But
>>> will use more cpu for large packets.
>>> - For TCP_RR, regression (10% on transaction rate and cpu
>>> utilization) were
>>> found. Tx interrupt won't help but cause overhead in this case.
>>> Using
>>> more aggressive coalescing parameters may help to reduce the
>>> regression.
>>
>> OK, you do have posted coalescing patches - does it help any?
>
> Helps a lot.
>
> For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
> For small packet TX, it increases 33% - 245% throughput. (reduce
> about 60% inters)
> For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx
> intrs)
>
>>
>> I'm not sure the regression is due to interrupts.
>> It would make sense for CPU but why would it
>> hurt transaction rate?
>
> Anyway guest need to take some cycles to handle tx interrupts.
> And transaction rate does increase if we coalesces more tx
> interurpts.
>>
>>
>> It's possible that we are deferring kicks too much due to BQL.
>>
>> As an experiment: do we get any of it back if we do
>> - if (kick || netif_xmit_stopped(txq))
>> - virtqueue_kick(sq->vq);
>> + virtqueue_kick(sq->vq);
>> ?
>
>
> I will try, but during TCP_RR, at most 1 packets were pending,
> I suspect if BQL can help in this case.

Looks like this helps a lot in multiple sessions of TCP_RR.

How about move the BQL patch out of this series?

Let's first converge tx interrupt and then introduce it?
(e.g with kicking after queuing X bytes?)

>
>>
>>
>> If yes, we can just kick e.g. periodically, e.g. after queueing each
>> X bytes.
>
> Okay, let me try to see if this help.
>>
>>> Changes from RFC V3:
>>> - Don't free tx packets in ndo_start_xmit()
>>> - Add interrupt coalescing support for virtio-net
>>> Changes from RFC v2:
>>> - clean up code, address issues raised by Jason
>>> Changes from RFC v1:
>>> - address comments by Jason Wang, use delayed cb everywhere
>>> - rebased Jason's patch on top of mine and include it (with some
>>> tweaks)
>>> Please reivew. Comments were more than welcomed.
>>> [1] Performance Test result:
>>> Environment:
>>> - Two Intel(R) Xeon(R) CPU E5620 @ 2.40GHz machines connected back
>>> to back
>>> with 82599ES cards.
>>> - Both host and guest were net-next.git plus the patch
>>> - Coalescing parameters for the card:
>>> Adaptive RX: off TX: off
>>> rx-usecs: 1
>>> rx-frames: 0
>>> tx-usecs: 0
>>> tx-frames: 0
>>> - Vhost_net was enabled and zerocopy was disabled
>>> - Tests was done by netperf-2.6
>>> - Guest has 2 vcpus with single queue virtio-net
>>> Results:
>>> - Numbers of square brackets are whose significance is grater than
>>> 95%
>>> Guest RX:
>>> size/sessions/+throughput/+cpu/+per_cpu_throughput/
>>> 64/1/+2.0326/[+6.2807%]/-3.9970%/
>>> 64/2/-0.2104%/[+3.2012%]/[-3.3058%]/
>>> 64/4/+1.5956%/+2.2451%/-0.6353%/
>>> 64/8/+1.1732%/+3.5123%/-2.2598%/
>>> 256/1/+3.7619%/[+5.8117%]/-1.9372%/
>>> 256/2/-0.0661%/[+3.2511%]/-3.2127%/
>>> 256/4/+1.1435%/[-8.1842%]/[+10.1591%]/
>>> 256/8/[+2.2447%]/[+6.2044%]/[-3.7283%]/
>>> 1024/1/+9.1479%/[+12.0997%]/[-2.6332%]/
>>> 1024/2/[-17.3341%]/[+0.0000%]/[-17.3341%]/
>>> 1024/4/[-0.6284%]/-1.0376%/+0.4135%/
>>> 1024/8/+1.1444%/-1.6069%/+2.7961%/
>>> 4096/1/+0.0401%/-0.5993%/+0.6433%/
>>> 4096/2/[-0.5894%]/-2.2071%/+1.6542%/
>>> 4096/4/[-0.5560%]/-1.4969%/+0.9553%/
>>> 4096/8/-0.3362%/+2.7086%/-2.9645%/
>>> 16384/1/-0.0285%/+0.7247%/-0.7478%/
>>> 16384/2/-0.5286%/+0.3287%/-0.8545%/
>>> 16384/4/-0.3297%/-2.0543%/+1.7608%/
>>> 16384/8/+1.0932%/+4.0253%/-2.8187%/
>>> 65535/1/+0.0003%/-0.1502%/+0.1508%/
>>> 65535/2/[-0.6065%]/+0.2309%/-0.8355%/
>>> 65535/4/[-0.6861%]/[+3.9451%]/[-4.4554%]/
>>> 65535/8/+1.8359%/+3.1590%/-1.2825%/
>>> Guest RX:
>>> size/sessions/+throughput/+cpu/+per_cpu_throughput/
>>> 64/1/[+65.0961%]/[-8.6807%]/[+80.7900%]/
>>> 64/2/[+6.0288%]/[-2.2823%]/[+8.5052%]/
>>> 64/4/[+5.9038%]/[-2.1834%]/[+8.2677%]/
>>> 64/8/[+5.4154%]/[-2.1804%]/[+7.7651%]/
>>> 256/1/[+184.6462%]/[+4.8906%]/[+171.3742%]/
>>> 256/2/[+46.0731%]/[-8.9626%]/[+60.4539%]/
>>> 256/4/[+45.8547%]/[-8.3027%]/[+59.0612%]/
>>> 256/8/[+45.3486%]/[-8.4024%]/[+58.6817%]/
>>> 1024/1/[+432.5372%]/[+3.9566%]/[+412.2689%]/
>>> 1024/2/[-1.4207%]/[-23.6426%]/[+29.1025%]/
>>> 1024/4/-0.1003%/[-13.6416%]/[+15.6804%]/
>>> 1024/8/[+0.2200%]/[+2.0634%]/[-1.8061%]/
>>> 4096/1/[+18.4835%]/[-46.1508%]/[+120.0283%]/
>>> 4096/2/+0.1770%/[-26.2780%]/[+35.8848%]/
>>> 4096/4/-0.1012%/-0.7353%/+0.6388%/
>>> 4096/8/-0.6091%/+1.4159%/-1.9968%/
>>> 16384/1/-0.0424%/[+11.9373%]/[-10.7021%]/
>>> 16384/2/+0.0482%/+2.4685%/-2.3620%/
>>> 16384/4/+0.0840%/[+5.3587%]/[-5.0064%]/
>>> 16384/8/+0.0048%/[+5.0176%]/[-4.7733%]/
>>> 65535/1/-0.0095%/[+10.9408%]/[-9.8705%]/
>>> 65535/2/+0.1515%/[+8.1709%]/[-7.4137%]/
>>> 65535/4/+0.0203%/[+5.4316%]/[-5.1325%]/
>>> 65535/8/+0.1427%/[+6.2753%]/[-5.7705%]/
>>> size/sessions/+trans.rate/+cpu/+per_cpu_trans.rate/
>>> 64/1/+0.2346%/[+11.5080%]/[-10.1099%]/
>>> 64/25/[-10.7893%]/-0.5791%/[-10.2697%]/
>>> 64/50/[-11.5997%]/-0.3429%/[-11.2956%]/
>>> 256/1/+0.7219%/[+13.2374%]/[-11.0524%]/
>>> 256/25/-6.9567%/+0.8887%/[-7.7763%]/
>>> 256/50/[-4.8814%]/-0.0338%/[-4.8492%]/
>>> 4096/1/-1.6061%/-0.7561%/-0.8565%/
>>> 4096/25/[+2.2120%]/[+1.0839%]/+1.1161%/
>>> 4096/50/[+5.6180%]/[+3.2116%]/[+2.3315%]/
>>> Jason Wang (4):
>>> virtio_net: enable tx interrupt
>>> virtio-net: optimize free_old_xmit_skbs stats
>>> virtio-net: add basic interrupt coalescing support
>>> vhost_net: interrupt coalescing support
>>> Michael S. Tsirkin (1):
>>> virtio_net: bql
>>> drivers/net/virtio_net.c | 211
>>> ++++++++++++++++++++++++++++++++--------
>>> drivers/vhost/net.c | 200
>>> +++++++++++++++++++++++++++++++++++--
>>> include/uapi/linux/vhost.h | 12 +++
>>> include/uapi/linux/virtio_net.h | 12 +++
>>> 4 files changed, 383 insertions(+), 52 deletions(-)
>>> -- 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-12-02 09:43:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote:
>
>
> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]> wrote:
> >
> >
> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]> wrote:
> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> >>> Hello:
> >>> We used to orphan packets before transmission for virtio-net. This
> >>>breaks
> >>> socket accounting and can lead serveral functions won't work, e.g:
> >>> - Byte Queue Limit depends on tx completion nofication to work.
> >>> - Packet Generator depends on tx completion nofication for the last
> >>> transmitted packet to complete.
> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> >>>work.
> >>> This series tries to solve the issue by enabling tx interrupts. To
> >>>minize
> >>> the performance impacts of this, several optimizations were used:
> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay the
> >>>tx
> >>> interrupt untile 3/4 pending packets were sent.
> >>> - In host side, interrupt coalescing were used to reduce tx
> >>>interrupts.
> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
> >>> - For guest receiving. No obvious regression on throughput were
> >>> noticed. More cpu utilization were noticed in few cases.
> >>> - For guest transmission. Very huge improvement on througput for
> >>>small
> >>> packet transmission were noticed. This is expected since TSQ and
> >>>other
> >>> optimization for small packet transmission work after tx interrupt.
> >>>But
> >>> will use more cpu for large packets.
> >>> - For TCP_RR, regression (10% on transaction rate and cpu
> >>>utilization) were
> >>> found. Tx interrupt won't help but cause overhead in this case.
> >>>Using
> >>> more aggressive coalescing parameters may help to reduce the
> >>>regression.
> >>
> >>OK, you do have posted coalescing patches - does it help any?
> >
> >Helps a lot.
> >
> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
> >For small packet TX, it increases 33% - 245% throughput. (reduce about 60%
> >inters)
> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx intrs)
> >
> >>
> >>I'm not sure the regression is due to interrupts.
> >>It would make sense for CPU but why would it
> >>hurt transaction rate?
> >
> >Anyway guest need to take some cycles to handle tx interrupts.
> >And transaction rate does increase if we coalesces more tx interurpts.
> >>
> >>
> >>It's possible that we are deferring kicks too much due to BQL.
> >>
> >>As an experiment: do we get any of it back if we do
> >>- if (kick || netif_xmit_stopped(txq))
> >>- virtqueue_kick(sq->vq);
> >>+ virtqueue_kick(sq->vq);
> >>?
> >
> >
> >I will try, but during TCP_RR, at most 1 packets were pending,
> >I suspect if BQL can help in this case.
>
> Looks like this helps a lot in multiple sessions of TCP_RR.

so what's faster
BQL + kick each packet
no BQL
?

> How about move the BQL patch out of this series?
>
> Let's first converge tx interrupt and then introduce it?
> (e.g with kicking after queuing X bytes?)

Sounds good.

2014-12-02 09:52:00

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts



On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <[email protected]>
wrote:
> On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote:
>>
>>
>> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]>
>> wrote:
>> >
>> >
>> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin
>> <[email protected]> wrote:
>> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
>> >>> Hello:
>> >>> We used to orphan packets before transmission for virtio-net.
>> This
>> >>>breaks
>> >>> socket accounting and can lead serveral functions won't work,
>> e.g:
>> >>> - Byte Queue Limit depends on tx completion nofication to work.
>> >>> - Packet Generator depends on tx completion nofication for the
>> last
>> >>> transmitted packet to complete.
>> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc
>> to
>> >>>work.
>> >>> This series tries to solve the issue by enabling tx
>> interrupts. To
>> >>>minize
>> >>> the performance impacts of this, several optimizations were
>> used:
>> >>> - In guest side, virtqueue_enable_cb_delayed() was used to
>> delay the
>> >>>tx
>> >>> interrupt untile 3/4 pending packets were sent.
>> >>> - In host side, interrupt coalescing were used to reduce tx
>> >>>interrupts.
>> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
>> >>> - For guest receiving. No obvious regression on throughput were
>> >>> noticed. More cpu utilization were noticed in few cases.
>> >>> - For guest transmission. Very huge improvement on througput for
>> >>>small
>> >>> packet transmission were noticed. This is expected since TSQ
>> and
>> >>>other
>> >>> optimization for small packet transmission work after tx
>> interrupt.
>> >>>But
>> >>> will use more cpu for large packets.
>> >>> - For TCP_RR, regression (10% on transaction rate and cpu
>> >>>utilization) were
>> >>> found. Tx interrupt won't help but cause overhead in this
>> case.
>> >>>Using
>> >>> more aggressive coalescing parameters may help to reduce the
>> >>>regression.
>> >>
>> >>OK, you do have posted coalescing patches - does it help any?
>> >
>> >Helps a lot.
>> >
>> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
>> >For small packet TX, it increases 33% - 245% throughput. (reduce
>> about 60%
>> >inters)
>> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx
>> intrs)
>> >
>> >>
>> >>I'm not sure the regression is due to interrupts.
>> >>It would make sense for CPU but why would it
>> >>hurt transaction rate?
>> >
>> >Anyway guest need to take some cycles to handle tx interrupts.
>> >And transaction rate does increase if we coalesces more tx
>> interurpts.
>> >>
>> >>
>> >>It's possible that we are deferring kicks too much due to BQL.
>> >>
>> >>As an experiment: do we get any of it back if we do
>> >>- if (kick || netif_xmit_stopped(txq))
>> >>- virtqueue_kick(sq->vq);
>> >>+ virtqueue_kick(sq->vq);
>> >>?
>> >
>> >
>> >I will try, but during TCP_RR, at most 1 packets were pending,
>> >I suspect if BQL can help in this case.
>>
>> Looks like this helps a lot in multiple sessions of TCP_RR.
>
> so what's faster
> BQL + kick each packet
> no BQL
> ?

Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not
show obvious differences.

May need a complete benchmark to see.
>
>
>> How about move the BQL patch out of this series?
>>
>> Let's first converge tx interrupt and then introduce it?
>> (e.g with kicking after queuing X bytes?)
>
> Sounds good.

2014-12-02 09:55:32

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote:
>
>
> On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <[email protected]> wrote:
> >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote:
> >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]>
> >>wrote:
> >> >
> >> >
> >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]>
> >>wrote:
> >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> >> >>> Hello:
> >> >>> We used to orphan packets before transmission for virtio-net. This
> >> >>>breaks
> >> >>> socket accounting and can lead serveral functions won't work, e.g:
> >> >>> - Byte Queue Limit depends on tx completion nofication to work.
> >> >>> - Packet Generator depends on tx completion nofication for the last
> >> >>> transmitted packet to complete.
> >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> >> >>>work.
> >> >>> This series tries to solve the issue by enabling tx interrupts. To
> >> >>>minize
> >> >>> the performance impacts of this, several optimizations were used:
> >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay
> >>the
> >> >>>tx
> >> >>> interrupt untile 3/4 pending packets were sent.
> >> >>> - In host side, interrupt coalescing were used to reduce tx
> >> >>>interrupts.
> >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
> >> >>> - For guest receiving. No obvious regression on throughput were
> >> >>> noticed. More cpu utilization were noticed in few cases.
> >> >>> - For guest transmission. Very huge improvement on througput for
> >> >>>small
> >> >>> packet transmission were noticed. This is expected since TSQ and
> >> >>>other
> >> >>> optimization for small packet transmission work after tx
> >>interrupt.
> >> >>>But
> >> >>> will use more cpu for large packets.
> >> >>> - For TCP_RR, regression (10% on transaction rate and cpu
> >> >>>utilization) were
> >> >>> found. Tx interrupt won't help but cause overhead in this case.
> >> >>>Using
> >> >>> more aggressive coalescing parameters may help to reduce the
> >> >>>regression.
> >> >>
> >> >>OK, you do have posted coalescing patches - does it help any?
> >> >
> >> >Helps a lot.
> >> >
> >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
> >> >For small packet TX, it increases 33% - 245% throughput. (reduce about
> >>60%
> >> >inters)
> >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx
> >>intrs)
> >> >
> >> >>
> >> >>I'm not sure the regression is due to interrupts.
> >> >>It would make sense for CPU but why would it
> >> >>hurt transaction rate?
> >> >
> >> >Anyway guest need to take some cycles to handle tx interrupts.
> >> >And transaction rate does increase if we coalesces more tx interurpts.
> >> >>
> >> >>
> >> >>It's possible that we are deferring kicks too much due to BQL.
> >> >>
> >> >>As an experiment: do we get any of it back if we do
> >> >>- if (kick || netif_xmit_stopped(txq))
> >> >>- virtqueue_kick(sq->vq);
> >> >>+ virtqueue_kick(sq->vq);
> >> >>?
> >> >
> >> >
> >> >I will try, but during TCP_RR, at most 1 packets were pending,
> >> >I suspect if BQL can help in this case.
> >> Looks like this helps a lot in multiple sessions of TCP_RR.
> >
> >so what's faster
> > BQL + kick each packet
> > no BQL
> >?
>
> Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious
> differences.
>
> May need a complete benchmark to see.

Okay so going forward something like BQL + kick each packet
might be a good solution.
The advantage of BQL is that it works without GSO.
For example, now that we don't do UFO, you might
see significant gains with UDP.


> >
> >
> >> How about move the BQL patch out of this series?
> >> Let's first converge tx interrupt and then introduce it?
> >> (e.g with kicking after queuing X bytes?)
> >
> >Sounds good.

2014-12-02 10:00:40

by David Laight

[permalink] [raw]
Subject: RE: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

From: Jason Wang
> > On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> >> Hello:
> >>
> >> We used to orphan packets before transmission for virtio-net. This
> >> breaks
> >> socket accounting and can lead serveral functions won't work, e.g:
> >>
> >> - Byte Queue Limit depends on tx completion nofication to work.
> >> - Packet Generator depends on tx completion nofication for the last
> >> transmitted packet to complete.
> >> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> >> work.
> >>
> >> This series tries to solve the issue by enabling tx interrupts. To
> >> minize
> >> the performance impacts of this, several optimizations were used:
> >>
> >> - In guest side, virtqueue_enable_cb_delayed() was used to delay the tx
> >> interrupt untile 3/4 pending packets were sent.

Doesn't that give problems for intermittent transmits?

...

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-12-02 10:08:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

On Tue, Dec 02, 2014 at 10:00:06AM +0000, David Laight wrote:
> From: Jason Wang
> > > On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> > >> Hello:
> > >>
> > >> We used to orphan packets before transmission for virtio-net. This
> > >> breaks
> > >> socket accounting and can lead serveral functions won't work, e.g:
> > >>
> > >> - Byte Queue Limit depends on tx completion nofication to work.
> > >> - Packet Generator depends on tx completion nofication for the last
> > >> transmitted packet to complete.
> > >> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> > >> work.
> > >>
> > >> This series tries to solve the issue by enabling tx interrupts. To
> > >> minize
> > >> the performance impacts of this, several optimizations were used:
> > >>
> > >> - In guest side, virtqueue_enable_cb_delayed() was used to delay the tx
> > >> interrupt untile 3/4 pending packets were sent.
>
> Doesn't that give problems for intermittent transmits?
>
> ...
>
> David
>

No because it has not effect in that case.

--
MST

2014-12-02 10:08:42

by Pankaj Gupta

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts


>
> On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote:
> >
> >
> > On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <[email protected]> wrote:
> > >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote:
> > >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]>
> > >>wrote:
> > >> >
> > >> >
> > >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]>
> > >>wrote:
> > >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> > >> >>> Hello:
> > >> >>> We used to orphan packets before transmission for virtio-net. This
> > >> >>>breaks
> > >> >>> socket accounting and can lead serveral functions won't work, e.g:
> > >> >>> - Byte Queue Limit depends on tx completion nofication to work.
> > >> >>> - Packet Generator depends on tx completion nofication for the last
> > >> >>> transmitted packet to complete.
> > >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> > >> >>>work.
> > >> >>> This series tries to solve the issue by enabling tx interrupts. To
> > >> >>>minize
> > >> >>> the performance impacts of this, several optimizations were used:
> > >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay
> > >>the
> > >> >>>tx
> > >> >>> interrupt untile 3/4 pending packets were sent.
> > >> >>> - In host side, interrupt coalescing were used to reduce tx
> > >> >>>interrupts.
> > >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
> > >> >>> - For guest receiving. No obvious regression on throughput were
> > >> >>> noticed. More cpu utilization were noticed in few cases.
> > >> >>> - For guest transmission. Very huge improvement on througput for
> > >> >>>small
> > >> >>> packet transmission were noticed. This is expected since TSQ and
> > >> >>>other
> > >> >>> optimization for small packet transmission work after tx
> > >>interrupt.
> > >> >>>But
> > >> >>> will use more cpu for large packets.
> > >> >>> - For TCP_RR, regression (10% on transaction rate and cpu
> > >> >>>utilization) were
> > >> >>> found. Tx interrupt won't help but cause overhead in this case.
> > >> >>>Using
> > >> >>> more aggressive coalescing parameters may help to reduce the
> > >> >>>regression.
> > >> >>
> > >> >>OK, you do have posted coalescing patches - does it help any?
> > >> >
> > >> >Helps a lot.
> > >> >
> > >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
> > >> >For small packet TX, it increases 33% - 245% throughput. (reduce about
> > >>60%
> > >> >inters)
> > >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx
> > >>intrs)
> > >> >
> > >> >>
> > >> >>I'm not sure the regression is due to interrupts.
> > >> >>It would make sense for CPU but why would it
> > >> >>hurt transaction rate?
> > >> >
> > >> >Anyway guest need to take some cycles to handle tx interrupts.
> > >> >And transaction rate does increase if we coalesces more tx interurpts.
> > >> >>
> > >> >>
> > >> >>It's possible that we are deferring kicks too much due to BQL.
> > >> >>
> > >> >>As an experiment: do we get any of it back if we do
> > >> >>- if (kick || netif_xmit_stopped(txq))
> > >> >>- virtqueue_kick(sq->vq);
> > >> >>+ virtqueue_kick(sq->vq);
> > >> >>?
> > >> >
> > >> >
> > >> >I will try, but during TCP_RR, at most 1 packets were pending,
> > >> >I suspect if BQL can help in this case.
> > >> Looks like this helps a lot in multiple sessions of TCP_RR.
> > >
> > >so what's faster
> > > BQL + kick each packet
> > > no BQL
> > >?
> >
> > Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious
> > differences.
> >
> > May need a complete benchmark to see.
>
> Okay so going forward something like BQL + kick each packet
> might be a good solution.
> The advantage of BQL is that it works without GSO.
> For example, now that we don't do UFO, you might
> see significant gains with UDP.

If I understand correctly, it can also help for small packet
regr. in multiqueue scenario? Would be nice to see the perf. numbers
with multi-queue for small packets streams.
>
>
> > >
> > >
> > >> How about move the BQL patch out of this series?
> > >> Let's first converge tx interrupt and then introduce it?
> > >> (e.g with kicking after queuing X bytes?)
> > >
> > >Sounds good.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2014-12-02 10:11:38

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 0/5] virtio_net: enabling tx interrupts

On Tue, Dec 02, 2014 at 05:08:35AM -0500, Pankaj Gupta wrote:
>
> >
> > On Tue, Dec 02, 2014 at 09:59:48AM +0008, Jason Wang wrote:
> > >
> > >
> > > On Tue, Dec 2, 2014 at 5:43 PM, Michael S. Tsirkin <[email protected]> wrote:
> > > >On Tue, Dec 02, 2014 at 08:15:02AM +0008, Jason Wang wrote:
> > > >> On Tue, Dec 2, 2014 at 11:15 AM, Jason Wang <[email protected]>
> > > >>wrote:
> > > >> >
> > > >> >
> > > >> >On Mon, Dec 1, 2014 at 6:42 PM, Michael S. Tsirkin <[email protected]>
> > > >>wrote:
> > > >> >>On Mon, Dec 01, 2014 at 06:17:03PM +0800, Jason Wang wrote:
> > > >> >>> Hello:
> > > >> >>> We used to orphan packets before transmission for virtio-net. This
> > > >> >>>breaks
> > > >> >>> socket accounting and can lead serveral functions won't work, e.g:
> > > >> >>> - Byte Queue Limit depends on tx completion nofication to work.
> > > >> >>> - Packet Generator depends on tx completion nofication for the last
> > > >> >>> transmitted packet to complete.
> > > >> >>> - TCP Small Queue depends on proper accounting of sk_wmem_alloc to
> > > >> >>>work.
> > > >> >>> This series tries to solve the issue by enabling tx interrupts. To
> > > >> >>>minize
> > > >> >>> the performance impacts of this, several optimizations were used:
> > > >> >>> - In guest side, virtqueue_enable_cb_delayed() was used to delay
> > > >>the
> > > >> >>>tx
> > > >> >>> interrupt untile 3/4 pending packets were sent.
> > > >> >>> - In host side, interrupt coalescing were used to reduce tx
> > > >> >>>interrupts.
> > > >> >>> Performance test results[1] (tx-frames 16 tx-usecs 16) shows:
> > > >> >>> - For guest receiving. No obvious regression on throughput were
> > > >> >>> noticed. More cpu utilization were noticed in few cases.
> > > >> >>> - For guest transmission. Very huge improvement on througput for
> > > >> >>>small
> > > >> >>> packet transmission were noticed. This is expected since TSQ and
> > > >> >>>other
> > > >> >>> optimization for small packet transmission work after tx
> > > >>interrupt.
> > > >> >>>But
> > > >> >>> will use more cpu for large packets.
> > > >> >>> - For TCP_RR, regression (10% on transaction rate and cpu
> > > >> >>>utilization) were
> > > >> >>> found. Tx interrupt won't help but cause overhead in this case.
> > > >> >>>Using
> > > >> >>> more aggressive coalescing parameters may help to reduce the
> > > >> >>>regression.
> > > >> >>
> > > >> >>OK, you do have posted coalescing patches - does it help any?
> > > >> >
> > > >> >Helps a lot.
> > > >> >
> > > >> >For RX, it saves about 5% - 10% cpu. (reduce 60%-90% tx intrs)
> > > >> >For small packet TX, it increases 33% - 245% throughput. (reduce about
> > > >>60%
> > > >> >inters)
> > > >> >For TCP_RR, it increase the 3%-10% trans.rate. (reduce 40%-80% tx
> > > >>intrs)
> > > >> >
> > > >> >>
> > > >> >>I'm not sure the regression is due to interrupts.
> > > >> >>It would make sense for CPU but why would it
> > > >> >>hurt transaction rate?
> > > >> >
> > > >> >Anyway guest need to take some cycles to handle tx interrupts.
> > > >> >And transaction rate does increase if we coalesces more tx interurpts.
> > > >> >>
> > > >> >>
> > > >> >>It's possible that we are deferring kicks too much due to BQL.
> > > >> >>
> > > >> >>As an experiment: do we get any of it back if we do
> > > >> >>- if (kick || netif_xmit_stopped(txq))
> > > >> >>- virtqueue_kick(sq->vq);
> > > >> >>+ virtqueue_kick(sq->vq);
> > > >> >>?
> > > >> >
> > > >> >
> > > >> >I will try, but during TCP_RR, at most 1 packets were pending,
> > > >> >I suspect if BQL can help in this case.
> > > >> Looks like this helps a lot in multiple sessions of TCP_RR.
> > > >
> > > >so what's faster
> > > > BQL + kick each packet
> > > > no BQL
> > > >?
> > >
> > > Quick and manual tests (TCP_RR 64, TCP_STREAM 512) does not show obvious
> > > differences.
> > >
> > > May need a complete benchmark to see.
> >
> > Okay so going forward something like BQL + kick each packet
> > might be a good solution.
> > The advantage of BQL is that it works without GSO.
> > For example, now that we don't do UFO, you might
> > see significant gains with UDP.
>
> If I understand correctly, it can also help for small packet
> regr. in multiqueue scenario?

Well BQL generally should only be active for 1:1 mappings.

> Would be nice to see the perf. numbers
> with multi-queue for small packets streams.
> >
> >
> > > >
> > > >
> > > >> How about move the BQL patch out of this series?
> > > >> Let's first converge tx interrupt and then introduce it?
> > > >> (e.g with kicking after queuing X bytes?)
> > > >
> > > >Sounds good.
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >

2014-12-19 07:32:51

by Qin Chuanyu

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt

On 2014/12/1 18:17, Jason Wang wrote:
> On newer hosts that support delayed tx interrupts,
> we probably don't have much to gain from orphaning
> packets early.
>
> Note: this might degrade performance for
> hosts without event idx support.
> Should be addressed by the next patch.
>
> Cc: Rusty Russell <[email protected]>
> Cc: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Michael S. Tsirkin <[email protected]>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/net/virtio_net.c | 132 +++++++++++++++++++++++++++++++----------------
> 1 file changed, 88 insertions(+), 44 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index ec2a8b4..f68114e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> {
> struct skb_vnet_hdr *hdr;
> @@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> sg_set_buf(sq->sg, hdr, hdr_len);
> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
> }
> - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> +
> + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
> + GFP_ATOMIC);
> }
>
> static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> @@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = !skb->xmit_more;
>
> - /* Free up any pending old buffers before queueing new ones. */
> - free_old_xmit_skbs(sq);

I think there is no need to remove free_old_xmit_skbs here.
you could add free_old_xmit_skbs in tx_irq's napi func.
also could do this in start_xmit if you handle the race well.

I have done the same thing in ixgbe driver(free skb in ndo_start_xmit
and both in tx_irq's poll func), and it seems work well:)

I think there would be no so much interrupts in this way, also tx
interrupt coalesce is not needed.

> + virtqueue_disable_cb(sq->vq);
>
> /* Try to transmit */
> err = xmit_skb(sq, skb);
> @@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> return NETDEV_TX_OK;
> }
>
> - /* Don't wait up for transmitted skbs to be freed. */
> - skb_orphan(skb);
> - nf_reset(skb);
> -
> /* Apparently nice girls don't return TX_BUSY; stop the queue
> * before it gets out of hand. Naturally, this wastes entries. */
> - if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> + if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
> netif_stop_subqueue(dev, qnum);
> - if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> - /* More just got used, free them then recheck. */
> - free_old_xmit_skbs(sq);
> - if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
> - netif_start_subqueue(dev, qnum);
> - virtqueue_disable_cb(sq->vq);
> - }
> - }
> - }
>
> if (kick || netif_xmit_stopped(txq))
> virtqueue_kick(sq->vq);
>
> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
> + virtqueue_disable_cb(sq->vq);
> + napi_schedule(&sq->napi);
> + }
> +
> return NETDEV_TX_OK;
> }
>
> @@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device *dev)
> /* Make sure refill_work doesn't re-enable napi! */
> cancel_delayed_work_sync(&vi->refill);
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> napi_disable(&vi->rq[i].napi);
> + napi_disable(&vi->sq[i].napi);
> + }
>
> return 0;
> }
> @@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct virtnet_info *vi)
> {
> int i;
>
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> netif_napi_del(&vi->rq[i].napi);
> + netif_napi_del(&vi->sq[i].napi);
> + }
>
> kfree(vi->rq);
> kfree(vi->sq);

2014-12-19 10:02:30

by Jason Wang

[permalink] [raw]
Subject: Re: [PATCH RFC v4 net-next 1/5] virtio_net: enable tx interrupt



On Fri, Dec 19, 2014 at 3:32 PM, Qin Chuanyu <[email protected]>
wrote:
> On 2014/12/1 18:17, Jason Wang wrote:
>> On newer hosts that support delayed tx interrupts,
>> we probably don't have much to gain from orphaning
>> packets early.
>>
>> Note: this might degrade performance for
>> hosts without event idx support.
>> Should be addressed by the next patch.
>>
>> Cc: Rusty Russell <[email protected]>
>> Cc: Michael S. Tsirkin <[email protected]>
>> Signed-off-by: Michael S. Tsirkin <[email protected]>
>> Signed-off-by: Jason Wang <[email protected]>
>> ---
>> drivers/net/virtio_net.c | 132
>> +++++++++++++++++++++++++++++++----------------
>> 1 file changed, 88 insertions(+), 44 deletions(-)
>>
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index ec2a8b4..f68114e 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
>> {
>> struct skb_vnet_hdr *hdr;
>> @@ -912,7 +951,9 @@ static int xmit_skb(struct send_queue *sq,
>> struct sk_buff *skb)
>> sg_set_buf(sq->sg, hdr, hdr_len);
>> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
>> }
>> - return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
>> GFP_ATOMIC);
>> +
>> + return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb,
>> + GFP_ATOMIC);
>> }
>>
>> static netdev_tx_t start_xmit(struct sk_buff *skb, struct
>> net_device *dev)
>> @@ -924,8 +965,7 @@ static netdev_tx_t start_xmit(struct sk_buff
>> *skb, struct net_device *dev)
>> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
>> bool kick = !skb->xmit_more;
>>
>> - /* Free up any pending old buffers before queueing new ones. */
>> - free_old_xmit_skbs(sq);
>
> I think there is no need to remove free_old_xmit_skbs here.
> you could add free_old_xmit_skbs in tx_irq's napi func.
> also could do this in start_xmit if you handle the race well.

Note, free_old_xmit_skbs() has already called in tx napi.
It was a must after tx interrupt was enabled.
>
>
> I have done the same thing in ixgbe driver(free skb in ndo_start_xmit
> and both in tx_irq's poll func), and it seems work well:)

Any performance numbers on this change?
I suspect it reduce the effects of interrupt coalescing.
>
> I think there would be no so much interrupts in this way, also tx
> interrupt coalesce is not needed.

Tests (multiple sessions of TCP_RR) does not support this.
Calling free_old_xmit_skbs() in fact damage the performance.

Any justification that you think it may reduce the interrupts?

Thanks
>
>
>> + virtqueue_disable_cb(sq->vq);
>>
>> /* Try to transmit */
>> err = xmit_skb(sq, skb);
>> @@ -941,27 +981,19 @@ static netdev_tx_t start_xmit(struct sk_buff
>> *skb, struct net_device *dev)
>> return NETDEV_TX_OK;
>> }
>>
>> - /* Don't wait up for transmitted skbs to be freed. */
>> - skb_orphan(skb);
>> - nf_reset(skb);
>> -
>> /* Apparently nice girls don't return TX_BUSY; stop the queue
>> * before it gets out of hand. Naturally, this wastes entries. */
>> - if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
>> + if (sq->vq->num_free < 2+MAX_SKB_FRAGS)
>> netif_stop_subqueue(dev, qnum);
>> - if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>> - /* More just got used, free them then recheck. */
>> - free_old_xmit_skbs(sq);
>> - if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>> - netif_start_subqueue(dev, qnum);
>> - virtqueue_disable_cb(sq->vq);
>> - }
>> - }
>> - }
>>
>> if (kick || netif_xmit_stopped(txq))
>> virtqueue_kick(sq->vq);
>>
>> + if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>> + virtqueue_disable_cb(sq->vq);
>> + napi_schedule(&sq->napi);
>> + }
>> +
>> return NETDEV_TX_OK;
>> }
>>
>> @@ -1138,8 +1170,10 @@ static int virtnet_close(struct net_device
>> *dev)
>> /* Make sure refill_work doesn't re-enable napi! */
>> cancel_delayed_work_sync(&vi->refill);
>>
>> - for (i = 0; i < vi->max_queue_pairs; i++)
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> napi_disable(&vi->rq[i].napi);
>> + napi_disable(&vi->sq[i].napi);
>> + }
>>
>> return 0;
>> }
>> @@ -1452,8 +1486,10 @@ static void virtnet_free_queues(struct
>> virtnet_info *vi)
>> {
>> int i;
>>
>> - for (i = 0; i < vi->max_queue_pairs; i++)
>> + for (i = 0; i < vi->max_queue_pairs; i++) {
>> netif_napi_del(&vi->rq[i].napi);
>> + netif_napi_del(&vi->sq[i].napi);
>> + }
>>
>> kfree(vi->rq);
>> kfree(vi->sq);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html