2008-10-08 19:35:56

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 1/2] virtio_net: Recycle some more rx buffer pages

Each time we re-fill the recv queue with buffers, we allocate
one too many skbs and free it again when adding fails. We should
recycle the pages allocated in this case.

A previous version of this patch made trim_pages() trim trailing
unused pages from skbs with some paged data, but this actually
caused a barely measurable slowdown.

Signed-off-by: Mark McLoughlin <[email protected]>
---
drivers/net/virtio_net.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0196a0d..985271a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -82,6 +82,16 @@ static void give_a_page(struct virtnet_info *vi, struct page *page)
vi->pages = page;
}

+static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb)
+{
+ unsigned int i;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ give_a_page(vi, skb_shinfo(skb)->frags[i].page);
+ skb_shinfo(skb)->nr_frags = 0;
+ skb->data_len = 0;
+}
+
static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
{
struct page *p = vi->pages;
@@ -121,14 +131,8 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
}
len -= sizeof(struct virtio_net_hdr);

- if (len <= MAX_PACKET_LEN) {
- unsigned int i;
-
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- give_a_page(dev->priv, skb_shinfo(skb)->frags[i].page);
- skb->data_len = 0;
- skb_shinfo(skb)->nr_frags = 0;
- }
+ if (len <= MAX_PACKET_LEN)
+ trim_pages(dev->priv, skb);

err = pskb_trim(skb, len);
if (err) {
@@ -232,6 +236,7 @@ static void try_fill_recv(struct virtnet_info *vi)
err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, num, skb);
if (err) {
skb_unlink(skb, &vi->recv);
+ trim_pages(vi, skb);
kfree_skb(skb);
break;
}
--
1.5.4.3


2008-10-08 19:36:25

by Mark McLoughlin

[permalink] [raw]
Subject: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

From: Herbert Xu <[email protected]>

If segmentation offload is enabled by the host, we currently allocate
maximum sized packet buffers and pass them to the host. This uses up
20 ring entries, allowing us to supply only 20 packet buffers to the
host with a 256 entry ring. This is a huge overhead when receiving
small packets, and is most keenly felt when receiving MTU sized
packets from off-host.

The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
using receive buffers which are smaller than the maximum packet size.
Buffers must be at least a page in size. In order to transfer large
packets to the guest, the host merges together multiple receive buffers
to form a larger logical buffer. The size of the logical buffer is
returned to the guest rather than the size of the individual smaller
buffers.

Make use of this support by supplying single page receive buffers to
the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
the payload to the skb's linear data buffer and adjust the fragment
offset to point to the remaining data. This ensures proper alignment
and allows us to not use any paged data for small packets. If the
payload occupies multiple pages, we simply append those pages as
fragments and free the associated skbs.

This scheme allows us to be efficient in our use of ring entries
while still supporting large packets. Benchmarking using netperf from
an external machine to a guest over a 10Gb/s network shows a 100%
improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
with GSO disabled on the host side, throughput was seen to increase
from 700Mb/s to 1.7Gb/s.

Slightly modified version of Herbert's original patch, mostly just
renaming the feature from "datahead".

Signed-off-by: Herbert Xu <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
---
drivers/net/virtio_net.c | 146 +++++++++++++++++++++++++++++++++++++++++---
include/linux/virtio_net.h | 1 +
2 files changed, 139 insertions(+), 8 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 985271a..1780d6d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -34,6 +34,7 @@ module_param(gso, bool, 0444);

/* FIXME: MTU in config. */
#define MAX_PACKET_LEN (ETH_HLEN+ETH_DATA_LEN)
+#define GOOD_COPY_LEN 128

struct virtnet_info
{
@@ -58,6 +59,9 @@ struct virtnet_info
/* I like... big packets and I cannot lie! */
bool big_packets;

+ /* Host will merge rx buffers for big packets (shake it! shake it!) */
+ bool mergeable_rx_bufs;
+
/* Receive & send queues. */
struct sk_buff_head recv;
struct sk_buff_head send;
@@ -121,8 +125,10 @@ static void skb_xmit_done(struct virtqueue *svq)
static void receive_skb(struct net_device *dev, struct sk_buff *skb,
unsigned len)
{
+ struct virtnet_info *vi = dev->priv;
struct virtio_net_hdr *hdr = skb_vnet_hdr(skb);
int err;
+ int i;

if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
pr_debug("%s: short packet %i\n", dev->name, len);
@@ -131,15 +137,87 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
}
len -= sizeof(struct virtio_net_hdr);

- if (len <= MAX_PACKET_LEN)
- trim_pages(dev->priv, skb);
+ if (vi->mergeable_rx_bufs) {
+ unsigned int copy;
+ unsigned int plen;
+ char *p = page_address(skb_shinfo(skb)->frags[0].page);

- err = pskb_trim(skb, len);
- if (err) {
- pr_debug("%s: pskb_trim failed %i %d\n", dev->name, len, err);
- dev->stats.rx_dropped++;
- goto drop;
+ memcpy(hdr, p, sizeof(*hdr));
+ p += sizeof(*hdr);
+
+ plen = PAGE_SIZE - sizeof(*hdr);
+ if (plen > len)
+ plen = len;
+
+ copy = plen;
+ if (copy > skb_tailroom(skb))
+ copy = skb_tailroom(skb);
+
+ memcpy(skb_put(skb, copy), p, copy);
+
+ len -= copy;
+ plen -= copy;
+
+ if (!plen) {
+ give_a_page(vi, skb_shinfo(skb)->frags[0].page);
+ skb_shinfo(skb)->nr_frags--;
+ } else {
+ skb_shinfo(skb)->frags[0].page_offset +=
+ sizeof(*hdr) + copy;
+ skb_shinfo(skb)->frags[0].size = plen;
+ skb->data_len += plen;
+ skb->len += plen;
+ }
+
+ while ((len -= plen)) {
+ struct sk_buff *nskb;
+ unsigned nlen;
+
+ i = skb_shinfo(skb)->nr_frags;
+ if (i >= MAX_SKB_FRAGS) {
+ pr_debug("%s: packet too long %d\n", dev->name,
+ len);
+ dev->stats.rx_length_errors++;
+ goto drop;
+ }
+
+ nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &nlen);
+ if (!nskb) {
+ pr_debug("%s: packet length error %d < %d\n",
+ dev->name, skb->len, len);
+ dev->stats.rx_length_errors++;
+ goto drop;
+ }
+
+ __skb_unlink(nskb, &vi->recv);
+ vi->num--;
+
+ skb_shinfo(skb)->frags[i] = skb_shinfo(nskb)->frags[0];
+ skb_shinfo(nskb)->nr_frags = 0;
+ kfree_skb(nskb);
+
+ plen = PAGE_SIZE;
+ if (plen > len)
+ plen = len;
+
+ skb_shinfo(skb)->frags[i].size = plen;
+ skb_shinfo(skb)->nr_frags++;
+ skb->data_len += plen;
+ skb->len += plen;
+ }
+ } else {
+ if (len <= MAX_PACKET_LEN)
+ trim_pages(vi, skb);
+
+ err = pskb_trim(skb, len);
+ if (err) {
+ pr_debug("%s: pskb_trim failed %i %d\n", dev->name,
+ len, err);
+ dev->stats.rx_dropped++;
+ goto drop;
+ }
}
+
skb->truesize += skb->data_len;
dev->stats.rx_bytes += skb->len;
dev->stats.rx_packets++;
@@ -198,7 +276,7 @@ drop:
dev_kfree_skb(skb);
}

-static void try_fill_recv(struct virtnet_info *vi)
+static void try_fill_recv_maxbufs(struct virtnet_info *vi)
{
struct sk_buff *skb;
struct scatterlist sg[2+MAX_SKB_FRAGS];
@@ -247,6 +325,54 @@ static void try_fill_recv(struct virtnet_info *vi)
vi->rvq->vq_ops->kick(vi->rvq);
}

+static void try_fill_recv(struct virtnet_info *vi)
+{
+ struct sk_buff *skb;
+ struct scatterlist sg[1];
+ int err;
+
+ if (!vi->mergeable_rx_bufs) {
+ try_fill_recv_maxbufs(vi);
+ return;
+ }
+
+ for (;;) {
+ skb_frag_t *f;
+
+ skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN);
+ if (unlikely(!skb))
+ break;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+
+ f = &skb_shinfo(skb)->frags[0];
+ f->page = get_a_page(vi, GFP_ATOMIC);
+ if (!f->page) {
+ kfree_skb(skb);
+ break;
+ }
+
+ f->page_offset = 0;
+ f->size = PAGE_SIZE;
+
+ skb_shinfo(skb)->nr_frags++;
+
+ sg_init_one(sg, page_address(f->page), PAGE_SIZE);
+ skb_queue_head(&vi->recv, skb);
+
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 1, skb);
+ if (err) {
+ skb_unlink(skb, &vi->recv);
+ kfree_skb(skb);
+ break;
+ }
+ vi->num++;
+ }
+ if (unlikely(vi->num > vi->max))
+ vi->max = vi->num;
+ vi->rvq->vq_ops->kick(vi->rvq);
+}
+
static void skb_recv_done(struct virtqueue *rvq)
{
struct virtnet_info *vi = rvq->vdev->priv;
@@ -552,6 +678,9 @@ static int virtnet_probe(struct virtio_device *vdev)
|| virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_ECN))
vi->big_packets = true;

+ if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+ vi->mergeable_rx_bufs = true;
+
/* We expect two virtqueues, receive then send. */
vi->rvq = vdev->config->find_vq(vdev, 0, skb_recv_done);
if (IS_ERR(vi->rvq)) {
@@ -644,6 +773,7 @@ static unsigned int features[] = {
VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6,
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
VIRTIO_NET_F_GUEST_ECN, /* We don't yet handle UFO input. */
+ VIRTIO_NET_F_MRG_RXBUF,
VIRTIO_F_NOTIFY_ON_EMPTY,
};

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 5e33761..8f376a7 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -20,6 +20,7 @@
#define VIRTIO_NET_F_HOST_TSO6 12 /* Host can handle TSOv6 in. */
#define VIRTIO_NET_F_HOST_ECN 13 /* Host can handle TSO[6] w/ ECN in. */
#define VIRTIO_NET_F_HOST_UFO 14 /* Host can handle UFO in. */
+#define VIRTIO_NET_F_MRG_RXBUF 15 /* Host can merge receive buffers. */

struct virtio_net_config
{
--
1.5.4.3

2008-10-09 00:57:17

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Thursday 09 October 2008 06:34:59 Mark McLoughlin wrote:
> From: Herbert Xu <[email protected]>
>
> If segmentation offload is enabled by the host, we currently allocate
> maximum sized packet buffers and pass them to the host. This uses up
> 20 ring entries, allowing us to supply only 20 packet buffers to the
> host with a 256 entry ring. This is a huge overhead when receiving
> small packets, and is most keenly felt when receiving MTU sized
> packets from off-host.

Hi Mark!

There are three approaches we should investigate before adding YA feature.
Obviously, we can simply increase the number of ring entries.

Secondly, we can put the virtio_net_hdr at the head of the skb data (this is
also worth considering for xmit I think if we have headroom) and drop
MAX_SKB_FRAGS which contains a gratuitous +2.

Thirdly, we can try to coalesce contiguous buffers. The page caching scheme
we have might help here, I don't know. Maybe we should be explicitly trying
to allocate higher orders.

Now, that said, we might need this anyway. But let's try the easy things
first? (Or as well...)

> The size of the logical buffer is
> returned to the guest rather than the size of the individual smaller
> buffers.

That's a virtio transport breakage: can you use the standard virtio mechanism,
just put the extended length or number of extra buffers inside the
virtio_net_hdr?

That makes more sense to me.

> Make use of this support by supplying single page receive buffers to
> the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
> the payload to the skb's linear data buffer and adjust the fragment
> offset to point to the remaining data. This ensures proper alignment
> and allows us to not use any paged data for small packets. If the
> payload occupies multiple pages, we simply append those pages as
> fragments and free the associated skbs.

> + char *p = page_address(skb_shinfo(skb)->frags[0].page);
...
> + memcpy(hdr, p, sizeof(*hdr));
> + p += sizeof(*hdr);

I think you need kmap_atomic() here to access the page. And yes, that will
effect performance :(

A few more comments moved from the patch header into the source wouldn't go
astray, but I'm happy to do that myself (it's been on my TODO for a while).

Thanks!
Rusty.

2008-10-09 15:30:59

by Herbert Xu

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
>
> There are three approaches we should investigate before adding YA feature.
> Obviously, we can simply increase the number of ring entries.

That's not going to work so well as you need to increase the ring
size by MAX_SKB_FRAGS times to achieve the same level of effect.

Basically the current scheme is either going to suck at non-TSO
traffic or it's going to chew too much resources.

> Secondly, we can put the virtio_net_hdr at the head of the skb data (this is
> also worth considering for xmit I think if we have headroom) and drop
> MAX_SKB_FRAGS which contains a gratuitous +2.

That's fine but having skb->data in the ring still means two
different kinds of memory in there and it sucks when you only
have 1500-byte packets.

> Thirdly, we can try to coalesce contiguous buffers. The page caching scheme
> we have might help here, I don't know. Maybe we should be explicitly trying
> to allocate higher orders.

That's not really the key problem here. The problem here is
that the scheme we're currently using in virtio-net is simply
broken when it comes to 1500-byte sized packets. Most of the
entries on the ring buffer go to waste.

We need a scheme that handles both 1500-byte packets as well
as 64K-byte size ones, and without holding down 16M of memory
per guest.

> > The size of the logical buffer is
> > returned to the guest rather than the size of the individual smaller
> > buffers.
>
> That's a virtio transport breakage: can you use the standard virtio mechanism,
> just put the extended length or number of extra buffers inside the
> virtio_net_hdr?

Sure that sounds reasonable.

> > Make use of this support by supplying single page receive buffers to
> > the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
> > the payload to the skb's linear data buffer and adjust the fragment
> > offset to point to the remaining data. This ensures proper alignment
> > and allows us to not use any paged data for small packets. If the
> > payload occupies multiple pages, we simply append those pages as
> > fragments and free the associated skbs.
>
> > + char *p = page_address(skb_shinfo(skb)->frags[0].page);
> ...
> > + memcpy(hdr, p, sizeof(*hdr));
> > + p += sizeof(*hdr);
>
> I think you need kmap_atomic() here to access the page. And yes, that will
> effect performance :(

No we don't. kmap would only be necessary for highmem which we
did not request.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2008-10-09 15:36:17

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

* Rusty Russell ([email protected]) wrote:
> On Thursday 09 October 2008 06:34:59 Mark McLoughlin wrote:
> > From: Herbert Xu <[email protected]>
> >
> > If segmentation offload is enabled by the host, we currently allocate
> > maximum sized packet buffers and pass them to the host. This uses up
> > 20 ring entries, allowing us to supply only 20 packet buffers to the
> > host with a 256 entry ring. This is a huge overhead when receiving
> > small packets, and is most keenly felt when receiving MTU sized
> > packets from off-host.
>
> There are three approaches we should investigate before adding YA feature.
> Obviously, we can simply increase the number of ring entries.

Tried that, it didn't help much. I don't have my numbers handy, but
levelled off at about 512 and was a modest boost. It's still wasteful
to preallocate like that on the off-chance it's a large packet.

thanks,
-chris

2008-10-09 17:41:28

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote:
> On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
> >
> > There are three approaches we should investigate before adding YA feature.
> > Obviously, we can simply increase the number of ring entries.
>
> That's not going to work so well as you need to increase the ring
> size by MAX_SKB_FRAGS times to achieve the same level of effect.
>
> Basically the current scheme is either going to suck at non-TSO
> traffic or it's going to chew too much resources.

Yeah ... to put some numbers on it, assume we have a 256 entry ring now.

Currently, with GSO enabled in the host the guest will fill this with 12
buffer heads with 20 buffers per head (a 10 byte buffer, an MTU sized
buffer and 18 page sized buffers).

That means we allocate ~900k for receive buffers, 12k for the ring, fail
to use 16 ring entries and the ring ends up with a capacity of 12
packets. In the case of MTU sized packets from an off-host source,
that's a huge amount of overhead for ~17k of data.

If we wanted to match the packet capacity that Herbert's suggestion
enables (i.e. 256 packets), we'd need to bump the ring size to 4k
entries (assuming we reduce it to 19 entries per packet). This would
mean we'd need to allocate ~200k for the ring and ~18M in receive
buffers. Again, assuming MTU sized packets, that's massive overhead for
~400k of data.

> > Secondly, we can put the virtio_net_hdr at the head of the skb data (this is
> > also worth considering for xmit I think if we have headroom) and drop
> > MAX_SKB_FRAGS which contains a gratuitous +2.
>
> That's fine but having skb->data in the ring still means two
> different kinds of memory in there and it sucks when you only
> have 1500-byte packets.

Also, including virtio_net_hdr in the data buffer would need another
feature flag. Rightly or wrongly, KVM's implementation requires
virtio_net_hdr to be the first buffer:

if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
fprintf(stderr, "virtio-net header not in first element\n");
exit(1);
}

i.e. it's part of the ABI ... at least as KVM sees it :-)

> > > The size of the logical buffer is
> > > returned to the guest rather than the size of the individual smaller
> > > buffers.
> >
> > That's a virtio transport breakage: can you use the standard virtio mechanism,
> > just put the extended length or number of extra buffers inside the
> > virtio_net_hdr?
>
> Sure that sounds reasonable.


I'll give that a shot.

Cheers,
Mark.

2008-10-09 19:26:47

by Anthony Liguori

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

Mark McLoughlin wrote:
>
> Also, including virtio_net_hdr in the data buffer would need another
> feature flag. Rightly or wrongly, KVM's implementation requires
> virtio_net_hdr to be the first buffer:
>
> if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
> fprintf(stderr, "virtio-net header not in first element\n");
> exit(1);
> }
>
> i.e. it's part of the ABI ... at least as KVM sees it :-)

This is actually something that's broken in a nasty way. Having the
header in the first element is not supposed to be part of the ABI but it
sort of has to be ATM.

If an older version of QEMU were to use a newer kernel, and the newer
kernel had a larger header size, then if we just made the header be the
first X bytes, QEMU has no way of knowing how many bytes that should be.
Instead, the guest actually has to allocate the virtio-net header in
such a way that it only presents the size depending on the features that
the host supports. We don't use a simple versioning scheme, so you'd
have to check for a combination of features advertised by the host but
that's not good enough because the host may disable certain features.

Perhaps the header size is whatever the longest element that has been
commonly negotiated?

So that's why this aggressive check is here. Not to necessarily cement
this into the ABI but as a way to make someone figure out how to
sanitize this all.

Regards,

Anthony Liguori

2008-10-10 08:31:41

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Thu, 2008-10-09 at 14:26 -0500, Anthony Liguori wrote:
> Mark McLoughlin wrote:
> >
> > Also, including virtio_net_hdr in the data buffer would need another
> > feature flag. Rightly or wrongly, KVM's implementation requires
> > virtio_net_hdr to be the first buffer:
> >
> > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
> > fprintf(stderr, "virtio-net header not in first element\n");
> > exit(1);
> > }
> >
> > i.e. it's part of the ABI ... at least as KVM sees it :-)
>
> This is actually something that's broken in a nasty way. Having the
> header in the first element is not supposed to be part of the ABI but it
> sort of has to be ATM.
>
> If an older version of QEMU were to use a newer kernel, and the newer
> kernel had a larger header size, then if we just made the header be the
> first X bytes, QEMU has no way of knowing how many bytes that should be.
> Instead, the guest actually has to allocate the virtio-net header in
> such a way that it only presents the size depending on the features that
> the host supports. We don't use a simple versioning scheme, so you'd
> have to check for a combination of features advertised by the host but
> that's not good enough because the host may disable certain features.
>
> Perhaps the header size is whatever the longest element that has been
> commonly negotiated?
>
> So that's why this aggressive check is here. Not to necessarily cement
> this into the ABI but as a way to make someone figure out how to
> sanitize this all.

Well, features may be orthogonal but they are still added sequentially
to the ABI. So, you would have a kind of implicit ABI versioning, while
still allowing individual selection of features.

e.g. if NET_F_FOO adds "int foo" to the header and then NET_F_BAR adds
"int bar" to the header then if NET_F_FOO is negotiated, the guest
should only send a header with "foo" and if NET_F_FOO|NET_F_BAR or
NET_F_BAR is negotiated, then the guest sends a header with both "foo"
and "bar".

Or put it another way, a host or guest may not implement NET_F_FOO but
knowledge of the "foo" header field is part of the ABI of NET_F_BAR.
That knowledge would be as simple as knowing that the field exists and
that it should be ignored if the feature isn't used.

Cheers,
Mark.

2008-10-10 12:58:15

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Thu, 2008-10-09 at 23:30 +0800, Herbert Xu wrote:
> On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:

> > > The size of the logical buffer is
> > > returned to the guest rather than the size of the individual smaller
> > > buffers.
> >
> > That's a virtio transport breakage: can you use the standard virtio mechanism,
> > just put the extended length or number of extra buffers inside the
> > virtio_net_hdr?
>
> Sure that sounds reasonable.

Okay, here we go.

The new header is lamely called virtio_net_hdr2 - I've added some
padding in there so we can extend it further in future.

It gets messy for lguest because tun/tap isn't using the same header
format anymore.

Rusty - let me know if this looks reasonable and, if so, I'll merge it
back into the original patches and resend.

Cheers,
Mark.

diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index da934c2..0f840f2 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -940,14 +940,21 @@ static void handle_net_output(int fd, struct virtqueue *vq, bool timeout)
{
unsigned int head, out, in, num = 0;
int len;
- struct iovec iov[vq->vring.num];
+ struct iovec iov[vq->vring.num + 1];
static int last_timeout_num;

/* Keep getting output buffers from the Guest until we run out. */
- while ((head = get_vq_desc(vq, iov, &out, &in)) != vq->vring.num) {
+ while ((head = get_vq_desc(vq, &iov[1], &out, &in)) != vq->vring.num) {
if (in)
errx(1, "Input buffers in output queue?");
- len = writev(vq->dev->fd, iov, out);
+
+ /* tapfd needs a virtio_net_hdr, not virtio_net_hdr2 */
+ iov[0].iov_base = iov[1].iov_base;
+ iov[0].iov_len = sizeof(struct virtio_net_hdr);
+ iov[1].iov_base += sizeof(struct virtio_net_hdr2);
+ iov[1].iov_len -= sizeof(struct virtio_net_hdr2);
+
+ len = writev(vq->dev->fd, iov, out + 1);
if (len < 0)
err(1, "Writing network packet to tun");
add_used_and_trigger(fd, vq, head, len);
@@ -998,18 +1005,24 @@ static unsigned int get_net_recv_head(struct device *dev, struct iovec *iov,

/* Here we add used recv buffers to the used queue but, also, return unused
* buffers to the avail queue. */
-static void add_net_recv_used(struct device *dev, unsigned int *heads,
- int *bufsizes, int nheads, int used_len)
+static void add_net_recv_used(struct device *dev, struct virtio_net_hdr2 *hdr2,
+ unsigned int *heads, int *bufsizes,
+ int nheads, int used_len)
{
int len, idx;

/* Add the buffers we've actually used to the used queue */
len = idx = 0;
while (len < used_len) {
- add_used(dev->vq, heads[idx], used_len, idx);
+ if (bufsizes[idx] > (used_len - len))
+ bufsizes[idx] = used_len - len;
+ add_used(dev->vq, heads[idx], bufsizes[idx], idx);
len += bufsizes[idx++];
}

+ /* The guest needs to know how many buffers to fetch */
+ hdr2->num_buffers = idx;
+
/* Return the rest of them back to the avail queue */
lg_last_avail(dev->vq) -= nheads - idx;
dev->vq->inflight -= nheads - idx;
@@ -1022,12 +1035,17 @@ static void add_net_recv_used(struct device *dev, unsigned int *heads,
* Guest. */
static bool handle_tun_input(int fd, struct device *dev)
{
- struct iovec iov[dev->vq->vring.num];
+ struct virtio_net_hdr hdr;
+ struct virtio_net_hdr2 *hdr2;
+ struct iovec iov[dev->vq->vring.num + 1];
unsigned int heads[NET_MAX_RECV_PAGES];
int bufsizes[NET_MAX_RECV_PAGES];
int nheads, len, iovcnt;

- nheads = len = iovcnt = 0;
+ nheads = len = 0;
+
+ /* First iov is for the header */
+ iovcnt = 1;

/* First we need enough network buffers from the Guests's recv
* virtqueue for the largest possible packet. */
@@ -1056,13 +1074,26 @@ static bool handle_tun_input(int fd, struct device *dev)
len += bufsizes[nheads++];
}

+ /* Read virtio_net_hdr from tapfd */
+ iov[0].iov_base = &hdr;
+ iov[0].iov_len = sizeof(hdr);
+
+ /* Read data into buffer after virtio_net_hdr2 */
+ hdr2 = iov[1].iov_base;
+ iov[1].iov_base += sizeof(*hdr2);
+ iov[1].iov_len -= sizeof(*hdr2);
+
/* Read the packet from the device directly into the Guest's buffer. */
len = readv(dev->fd, iov, iovcnt);
if (len <= 0)
err(1, "reading network");

+ /* Copy the virtio_net_hdr into the virtio_net_hdr2 */
+ hdr2->hdr = hdr;
+ len += sizeof(*hdr2) - sizeof(hdr);
+
/* Return unused buffers to the recv queue */
- add_net_recv_used(dev, heads, bufsizes, nheads, len);
+ add_net_recv_used(dev, hdr2, heads, bufsizes, nheads, len);

/* Fire in the hole ! */
trigger_irq(fd, dev->vq);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1780d6d..719e9dc 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -61,6 +61,7 @@ struct virtnet_info

/* Host will merge rx buffers for big packets (shake it! shake it!) */
bool mergeable_rx_bufs;
+ bool use_vnet_hdr2;

/* Receive & send queues. */
struct sk_buff_head recv;
@@ -70,11 +71,21 @@ struct virtnet_info
struct page *pages;
};

+static inline struct virtio_net_hdr2 *skb_vnet_hdr2(struct sk_buff *skb)
+{
+ return (struct virtio_net_hdr2 *)skb->cb;
+}
+
static inline struct virtio_net_hdr *skb_vnet_hdr(struct sk_buff *skb)
{
return (struct virtio_net_hdr *)skb->cb;
}

+static inline void vnet_hdr2_to_sg(struct scatterlist *sg, struct sk_buff *skb)
+{
+ sg_init_one(sg, skb_vnet_hdr2(skb), sizeof(struct virtio_net_hdr2));
+}
+
static inline void vnet_hdr_to_sg(struct scatterlist *sg, struct sk_buff *skb)
{
sg_init_one(sg, skb_vnet_hdr(skb), sizeof(struct virtio_net_hdr));
@@ -135,43 +146,40 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
dev->stats.rx_length_errors++;
goto drop;
}
- len -= sizeof(struct virtio_net_hdr);

if (vi->mergeable_rx_bufs) {
+ struct virtio_net_hdr2 *hdr2 = skb_vnet_hdr2(skb);
unsigned int copy;
- unsigned int plen;
char *p = page_address(skb_shinfo(skb)->frags[0].page);

- memcpy(hdr, p, sizeof(*hdr));
- p += sizeof(*hdr);
+ if (len > PAGE_SIZE)
+ len = PAGE_SIZE;
+ len -= sizeof(struct virtio_net_hdr2);

- plen = PAGE_SIZE - sizeof(*hdr);
- if (plen > len)
- plen = len;
+ memcpy(hdr2, p, sizeof(*hdr2));
+ p += sizeof(*hdr2);

- copy = plen;
+ copy = len;
if (copy > skb_tailroom(skb))
copy = skb_tailroom(skb);

memcpy(skb_put(skb, copy), p, copy);

len -= copy;
- plen -= copy;

- if (!plen) {
+ if (!len) {
give_a_page(vi, skb_shinfo(skb)->frags[0].page);
skb_shinfo(skb)->nr_frags--;
} else {
skb_shinfo(skb)->frags[0].page_offset +=
- sizeof(*hdr) + copy;
- skb_shinfo(skb)->frags[0].size = plen;
- skb->data_len += plen;
- skb->len += plen;
+ sizeof(*hdr2) + copy;
+ skb_shinfo(skb)->frags[0].size = len;
+ skb->data_len += len;
+ skb->len += len;
}

- while ((len -= plen)) {
+ while (--hdr2->num_buffers) {
struct sk_buff *nskb;
- unsigned nlen;

i = skb_shinfo(skb)->nr_frags;
if (i >= MAX_SKB_FRAGS) {
@@ -181,10 +189,10 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
goto drop;
}

- nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &nlen);
+ nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &len);
if (!nskb) {
- pr_debug("%s: packet length error %d < %d\n",
- dev->name, skb->len, len);
+ pr_debug("%s: rx error: %d buffers missing\n",
+ dev->name, hdr2->num_buffers);
dev->stats.rx_length_errors++;
goto drop;
}
@@ -196,16 +204,17 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
skb_shinfo(nskb)->nr_frags = 0;
kfree_skb(nskb);

- plen = PAGE_SIZE;
- if (plen > len)
- plen = len;
+ if (len > PAGE_SIZE)
+ len = PAGE_SIZE;

- skb_shinfo(skb)->frags[i].size = plen;
+ skb_shinfo(skb)->frags[i].size = len;
skb_shinfo(skb)->nr_frags++;
- skb->data_len += plen;
- skb->len += plen;
+ skb->data_len += len;
+ skb->len += len;
}
} else {
+ len -= sizeof(struct virtio_net_hdr);
+
if (len <= MAX_PACKET_LEN)
trim_pages(vi, skb);

@@ -451,6 +460,7 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
{
int num, err;
struct scatterlist sg[2+MAX_SKB_FRAGS];
+ struct virtio_net_hdr2 *hdr2;
struct virtio_net_hdr *hdr;
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;

@@ -461,7 +471,9 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
dest[3], dest[4], dest[5]);

/* Encode metadata header at front. */
- hdr = skb_vnet_hdr(skb);
+ hdr2 = skb_vnet_hdr2(skb);
+ hdr = &hdr2->hdr;
+
if (skb->ip_summed == CHECKSUM_PARTIAL) {
hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
hdr->csum_start = skb->csum_start - skb_headroom(skb);
@@ -489,7 +501,13 @@ static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
hdr->gso_size = hdr->hdr_len = 0;
}

- vnet_hdr_to_sg(sg, skb);
+ hdr2->num_buffers = 0;
+
+ if (vi->use_vnet_hdr2)
+ vnet_hdr2_to_sg(sg, skb);
+ else
+ vnet_hdr_to_sg(sg, skb);
+
num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;

err = vi->svq->vq_ops->add_buf(vi->svq, sg, num, 0, skb);
@@ -678,8 +696,10 @@ static int virtnet_probe(struct virtio_device *vdev)
|| virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_ECN))
vi->big_packets = true;

- if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF))
+ if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF)) {
vi->mergeable_rx_bufs = true;
+ vi->use_vnet_hdr2 = true;
+ }

/* We expect two virtqueues, receive then send. */
vi->rvq = vdev->config->find_vq(vdev, 0, skb_recv_done);
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 8f376a7..59a5079 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -45,4 +45,14 @@ struct virtio_net_hdr
__u16 csum_start; /* Position to start checksumming from */
__u16 csum_offset; /* Offset after that to place checksum */
};
+
+/* This is the version of the header to use when the MRG_RXBUF
+ * feature (or any later feature) has been negotiated. */
+struct virtio_net_hdr2
+{
+ struct virtio_net_hdr hdr;
+ __u8 num_buffers; /* Number of merged rx buffers */
+ __u8 pad[21]; /* Pad to 32 bytes */
+};
+
#endif /* _LINUX_VIRTIO_NET_H */

2008-10-16 04:44:23

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Friday 10 October 2008 02:30:35 Herbert Xu wrote:
> On Thu, Oct 09, 2008 at 11:55:59AM +1100, Rusty Russell wrote:
> > Secondly, we can put the virtio_net_hdr at the head of the skb data (this
> > is also worth considering for xmit I think if we have headroom) and drop
> > MAX_SKB_FRAGS which contains a gratuitous +2.
>
> That's fine but having skb->data in the ring still means two
> different kinds of memory in there and it sucks when you only
> have 1500-byte packets.

No, you really want to do this for 1500 byte packets since it increases the
effective space in the ring. Unfortunately, Mark points out that kvm assumes
the header is standalone: Anthony and I discussed this a while back and
decided it *wasn't* a good assumption.

TODO: YA feature bit...

> We need a scheme that handles both 1500-byte packets as well
> as 64K-byte size ones, and without holding down 16M of memory
> per guest.

Ah, thanks for that. It's not so much ring entries, as guest memory you're
trying to save. That makes much more sense.

> > > + char *p = page_address(skb_shinfo(skb)->frags[0].page);
> >
> > ...
> >
> > > + memcpy(hdr, p, sizeof(*hdr));
> > > + p += sizeof(*hdr);
> >
> > I think you need kmap_atomic() here to access the page. And yes, that
> > will effect performance :(
>
> No we don't. kmap would only be necessary for highmem which we
> did not request.

Good point. Could you humor me with a comment to that effect? Prevents me
making the same mistake again.

Thanks!
Rusty.
PS. Laptop broke, was MIA for a week. Working overtime now.

2008-10-16 05:11:17

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 1/2] virtio_net: Recycle some more rx buffer pages

On Thursday 09 October 2008 06:34:58 Mark McLoughlin wrote:
> Each time we re-fill the recv queue with buffers, we allocate
> one too many skbs and free it again when adding fails. We should
> recycle the pages allocated in this case.
>
> A previous version of this patch made trim_pages() trim trailing
> unused pages from skbs with some paged data, but this actually
> caused a barely measurable slowdown.

Yes, I noticed a similar effect. Not quite sure why though.

Applied.

Thanks!
Rusty.

2008-10-16 09:16:22

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 2/2] virtio_net: Improve the recv buffer allocation scheme

On Friday 10 October 2008 06:26:25 Anthony Liguori wrote:
> Mark McLoughlin wrote:
> > Also, including virtio_net_hdr in the data buffer would need another
> > feature flag. Rightly or wrongly, KVM's implementation requires
> > virtio_net_hdr to be the first buffer:
> >
> > if (elem.in_num < 1 || elem.in_sg[0].iov_len != sizeof(*hdr)) {
> > fprintf(stderr, "virtio-net header not in first element\n");
> > exit(1);
> > }
> >
> > i.e. it's part of the ABI ... at least as KVM sees it :-)
>
> This is actually something that's broken in a nasty way. Having the
> header in the first element is not supposed to be part of the ABI but it
> sort of has to be ATM.
>
> If an older version of QEMU were to use a newer kernel, and the newer
> kernel had a larger header size, then if we just made the header be the
> first X bytes, QEMU has no way of knowing how many bytes that should be.
> Instead, the guest actually has to allocate the virtio-net header in
> such a way that it only presents the size depending on the features that
> the host supports. We don't use a simple versioning scheme, so you'd
> have to check for a combination of features advertised by the host but
> that's not good enough because the host may disable certain features.
>
> Perhaps the header size is whatever the longest element that has been
> commonly negotiated?

Yes. The feature implies the header extension. Not knowing implies no
extension is possible.

Rusty.