2009-11-20 06:09:24

by Shirley Ma

[permalink] [raw]
Subject: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

Guest virtio_net receives packets from its pre-allocated vring
buffers, then it delivers these packets to upper layer protocols
as skb buffs. So it's not necessary to pre-allocate skb for each
mergable buffer, then frees it when it's useless.

This patch has deferred skb allocation when receiving packets for
both big packets and mergeable buffers. It reduces skb pre-allocations
and skb_frees.

Based on Mickael & Avi's suggestion. A destroy function has been created
to push virtio free buffs to vring for unused pages, and used page private
to maintain page list.

I didn't touch small packet skb allocation to avoid extra copies for small
packets.

This patch has tested and measured against 2.6.32-rc5 git. It is built again
2.6.32-rc7 kernel. Tests have been done for small packets, big packets and
mergeable buffers.

The single netperf TCP_STREAM performance improved for host to guest.
It also reduces UDP packets drop rate.

The netperf laptop results were:

mtu=1500
netperf -H xxx -l 120

w/o patch w/i patch (two runs)
guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s

host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s


2009-11-23 00:53:18

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

On Fri, 20 Nov 2009 04:39:19 pm Shirley Ma wrote:
> Guest virtio_net receives packets from its pre-allocated vring
> buffers, then it delivers these packets to upper layer protocols
> as skb buffs. So it's not necessary to pre-allocate skb for each
> mergable buffer, then frees it when it's useless.
>
> This patch has deferred skb allocation when receiving packets for
> both big packets and mergeable buffers. It reduces skb pre-allocations
> and skb_frees.
>
> Based on Mickael & Avi's suggestion. A destroy function has been created
> to push virtio free buffs to vring for unused pages, and used page private
> to maintain page list.
>
> I didn't touch small packet skb allocation to avoid extra copies for small
> packets.
>
> This patch has tested and measured against 2.6.32-rc5 git. It is built again
> 2.6.32-rc7 kernel. Tests have been done for small packets, big packets and
> mergeable buffers.
>
> The single netperf TCP_STREAM performance improved for host to guest.
> It also reduces UDP packets drop rate.
>
> The netperf laptop results were:
>
> mtu=1500
> netperf -H xxx -l 120
>
> w/o patch w/i patch (two runs)
> guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s
>
> host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s

Nice!

Is this using mergeable_rx_bufs? Or just big_packets?

I'd like to drop big packet support from our driver, but I don't know
how many kvm hosts will not offer mergable rx bufs yet. Anthony?

Thanks,
Rusty.

2009-11-23 08:54:42

by Mark McLoughlin

[permalink] [raw]
Subject: Re: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

On Mon, 2009-11-23 at 11:23 +1030, Rusty Russell wrote:
> I'd like to drop big packet support from our driver, but I don't know
> how many kvm hosts will not offer mergable rx bufs yet. Anthony?

Mergeable rx bufs were first added in kvm-80 and qemu-0.10.0

So e.g., it's in Fedora since Fedora 11 and RHEL since 5.4

Cheers,
Mark.

2009-12-08 12:24:18

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 0/1] Defer skb allocation for both mergeable buffers and big packets in virtio_net

On Mon, Nov 23, 2009 at 11:23:18AM +1030, Rusty Russell wrote:
> On Fri, 20 Nov 2009 04:39:19 pm Shirley Ma wrote:
> > Guest virtio_net receives packets from its pre-allocated vring
> > buffers, then it delivers these packets to upper layer protocols
> > as skb buffs. So it's not necessary to pre-allocate skb for each
> > mergable buffer, then frees it when it's useless.
> >
> > This patch has deferred skb allocation when receiving packets for
> > both big packets and mergeable buffers. It reduces skb pre-allocations
> > and skb_frees.
> >
> > Based on Mickael & Avi's suggestion. A destroy function has been created
> > to push virtio free buffs to vring for unused pages, and used page private
> > to maintain page list.
> >
> > I didn't touch small packet skb allocation to avoid extra copies for small
> > packets.
> >
> > This patch has tested and measured against 2.6.32-rc5 git. It is built again
> > 2.6.32-rc7 kernel. Tests have been done for small packets, big packets and
> > mergeable buffers.
> >
> > The single netperf TCP_STREAM performance improved for host to guest.
> > It also reduces UDP packets drop rate.
> >
> > The netperf laptop results were:
> >
> > mtu=1500
> > netperf -H xxx -l 120
> >
> > w/o patch w/i patch (two runs)
> > guest to host: 3336.84Mb/s 3730.14Mb/s ~ 3582.88Mb/s
> >
> > host to guest: 3165.10Mb/s 3370.39Mb/s ~ 3407.96Mb/s
>
> Nice!
>
> Is this using mergeable_rx_bufs? Or just big_packets?
>
> I'd like to drop big packet support from our driver, but I don't know
> how many kvm hosts will not offer mergable rx bufs yet.


One issue with mergeable buffers is that they can work well with zero
copy RX, but only if hardware merges RX buffers in the same way virtio
does.

However, I suspect that most hardware can not do this, and simply
scatters data to a buffer you give it and then reports completion.
For such simple hardware, RX can work without extra data copies when guest
uses big packets (just scatter data into guest buffers), but when guest
tries to use mergeable buffers, host will have to merge them and post a
single big packet to hardware. Thus host will be able to only post
about 16 buffers to hardware, and performance will suffer.


--
MST

2009-12-11 12:28:28

by Shirley Ma

[permalink] [raw]
Subject: [PATCH v2 0/4] Defer skb allocation for both mergeable buffers and big packets in virtio_net

This is a patch-set for deferring skb allocation based on Rusty and
Michael's inputs.

Guest virtio_net receives packets from its pre-allocated vring
buffers, then it delivers these packets to upper layer protocols
as skb buffs. So it's not necessary to pre-allocate skb for each
mergable buffer, then frees it when it's useless.

This patch has deferred skb allocation when receiving packets for
both big packets and mergeable buffers. It reduces skb pre-allocations
and skb_frees.

A destroy function has been created to push virtio free buffs to vring
for unused pages, and used page private to maintain page list.

This patch has tested and measured against 2.6.32-rc7 git. It is built
again 2.6.32 kernel. Tests have been done for small packets, big packets
and mergeable buffers.

The single netperf TCP_STREAM performance improved for host to guest.
It also reduces UDP packets drop rate.

Thanks
Shirley

2009-12-11 12:33:25

by Shirley Ma

[permalink] [raw]
Subject: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

Signed-off-by: <[email protected]>
-------------

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c708ecc..bb5eb7b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -107,6 +107,16 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
return p;
}

+static void virtio_net_free_pages(void *buf)
+{
+ struct page *page, *next;
+
+ for (page = buf; page; page = next) {
+ next = (struct page *)page->private;
+ __free_pages(page, 0);
+ }
+}
+
static void skb_xmit_done(struct virtqueue *svq)
{
struct virtnet_info *vi = svq->vdev->priv;
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index fbd2ecd..db83465 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -334,6 +334,29 @@ static bool vring_enable_cb(struct virtqueue *_vq)
return true;
}

+static int vring_destroy_bufs(struct virtqueue *_vq, void (*destroy)(void *))
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ void *buf;
+ unsigned int i;
+ int freed = 0;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring.num; i++) {
+ if (vq->data[i]) {
+ /* detach_buf clears data, so grab it now. */
+ buf = vq->data[i];
+ detach_buf(vq, i);
+ destroy(buf);
+ freed++;
+ }
+ }
+
+ END_USE(vq);
+ return freed;
+}
+
irqreturn_t vring_interrupt(int irq, void *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
@@ -360,6 +383,7 @@ static struct virtqueue_ops vring_vq_ops = {
.kick = vring_kick,
.disable_cb = vring_disable_cb,
.enable_cb = vring_enable_cb,
+ .destroy_bufs = vring_destroy_bufs,
};

struct virtqueue *vring_new_virtqueue(unsigned int num,
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 057a2e0..e6d7d77 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -71,6 +71,7 @@ struct virtqueue_ops {

void (*disable_cb)(struct virtqueue *vq);
bool (*enable_cb)(struct virtqueue *vq);
+ int (*destroy_bufs)(struct virtqueue *vq, void (*destory)(void *));
};

/**

2009-12-11 12:43:12

by Shirley Ma

[permalink] [raw]
Subject: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

Signed-off-by: Shirley Ma <[email protected]>
--------------

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bb5eb7b..100b4b9 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -80,29 +80,25 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
return (struct skb_vnet_hdr *)skb->cb;
}

-static void give_a_page(struct virtnet_info *vi, struct page *page)
+static void give_pages(struct virtnet_info *vi, struct page *page)
{
- page->private = (unsigned long)vi->pages;
- vi->pages = page;
-}
+ struct page *end;

-static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb)
-{
- unsigned int i;
-
- for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
- give_a_page(vi, skb_shinfo(skb)->frags[i].page);
- skb_shinfo(skb)->nr_frags = 0;
- skb->data_len = 0;
+ /* Find end of list, sew whole thing into vi->pages. */
+ for (end = page; end->private; end = (struct page *)end->private);
+ end->private = (unsigned long)vi->pages;
+ vi->pages = page;
}

static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
{
struct page *p = vi->pages;

- if (p)
+ if (p) {
vi->pages = (struct page *)p->private;
- else
+ /* use private to chain pages for big packets */
+ p->private = 0;
+ } else
p = alloc_page(gfp_mask);
return p;
}
@@ -128,6 +124,85 @@ static void skb_xmit_done(struct virtqueue *svq)
netif_wake_queue(vi->dev);
}

+static int skb_set_frag(struct sk_buff *skb, struct page *page,
+ int offset, int len)
+{
+ int i = skb_shinfo(skb)->nr_frags;
+ skb_frag_t *f;
+
+ i = skb_shinfo(skb)->nr_frags;
+ f = &skb_shinfo(skb)->frags[i];
+ f->page = page;
+ f->page_offset = offset;
+
+ if (len > PAGE_SIZE - f->page_offset)
+ f->size = PAGE_SIZE - f->page_offset;
+ else
+ f->size = len;
+
+ skb_shinfo(skb)->nr_frags++;
+ skb->data_len += f->size;
+ skb->len += f->size;
+
+ len -= f->size;
+ return len;
+}
+
+static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page,
+ unsigned int *len)
+{
+ struct sk_buff *skb;
+ struct skb_vnet_hdr *hdr;
+ int copy, hdr_len, offset;
+ char *p;
+
+ p = page_address(*page);
+
+ skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN);
+ if (unlikely(!skb))
+ return NULL;
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ hdr = skb_vnet_hdr(skb);
+
+ if (vi->mergeable_rx_bufs) {
+ hdr_len = sizeof(hdr->mhdr);
+ offset = hdr_len;
+ } else {
+ /* share one page between virtio_net header and data */
+ struct padded_vnet_hdr {
+ struct virtio_net_hdr hdr;
+ /* This padding makes our data 16 byte aligned */
+ char padding[6];
+ };
+ hdr_len = sizeof(hdr->hdr);
+ offset = sizeof(struct padded_vnet_hdr);
+ }
+
+ memcpy(hdr, p, hdr_len);
+
+ *len -= hdr_len;
+ p += offset;
+
+ copy = *len;
+ if (copy > skb_tailroom(skb))
+ copy = skb_tailroom(skb);
+ memcpy(skb_put(skb, copy), p, copy);
+
+ *len -= copy;
+ offset += copy;
+
+ if (*len) {
+ *len = skb_set_frag(skb, *page, offset, *len);
+ *page = (struct page *)(*page)->private;
+ } else {
+ give_pages(vi, *page);
+ *page = NULL;
+ }
+
+ return skb;
+}
+
static void receive_skb(struct net_device *dev, struct sk_buff *skb,
unsigned len)
{
@@ -162,7 +237,7 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
len -= copy;

if (!len) {
- give_a_page(vi, skb_shinfo(skb)->frags[0].page);
+ give_pages(vi, skb_shinfo(skb)->frags[0].page);
skb_shinfo(skb)->nr_frags--;
} else {
skb_shinfo(skb)->frags[0].page_offset +=

2009-12-11 12:46:57

by Shirley Ma

[permalink] [raw]
Subject: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

Signed-off-by: Shirley Ma <[email protected]>
-------------

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 100b4b9..dde8060 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -203,6 +203,73 @@ static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page,
return skb;
}

+static struct sk_buff *receive_big(struct virtnet_info *vi, struct page *page,
+ unsigned int len)
+{
+ struct sk_buff *skb;
+
+ skb = skb_goodcopy(vi, &page, &len);
+ if (unlikely(!skb))
+ return NULL;
+
+ while (len > 0) {
+ len = skb_set_frag(skb, page, 0, len);
+ page = (struct page *)page->private;
+ }
+
+ if (page)
+ give_pages(vi, page);
+
+ return skb;
+}
+
+static struct sk_buff *receive_mergeable(struct virtnet_info *vi,
+ struct page *page, unsigned int len)
+{
+ struct sk_buff *skb;
+ struct skb_vnet_hdr *hdr;
+ int num_buf, i;
+
+ if (len > PAGE_SIZE)
+ len = PAGE_SIZE;
+
+ skb = skb_goodcopy(vi, &page, &len);
+
+ if (unlikely(!skb))
+ return NULL;
+
+ hdr = skb_vnet_hdr(skb);
+ num_buf = hdr->mhdr.num_buffers;
+ while (--num_buf) {
+ struct page *page;
+
+ i = skb_shinfo(skb)->nr_frags;
+ if (i >= MAX_SKB_FRAGS) {
+ pr_debug("%s: packet too long %d\n", skb->dev->name,
+ len);
+ skb->dev->stats.rx_length_errors++;
+ return skb;
+ }
+
+ page = vi->rvq->vq_ops->get_buf(vi->rvq, &len);
+ if (!page) {
+ pr_debug("%s: rx error: %d buffers missing\n",
+ skb->dev->name, hdr->mhdr.num_buffers);
+ skb->dev->stats.rx_length_errors++;
+ return skb;
+ }
+
+ if (len > PAGE_SIZE)
+ len = PAGE_SIZE;
+
+ skb_set_frag(skb, page, 0, len);
+
+ vi->num--;
+ }
+
+ return skb;
+}
+
static void receive_skb(struct net_device *dev, struct sk_buff *skb,
unsigned len)
{
@@ -356,6 +423,103 @@ drop:
dev_kfree_skb(skb);
}

+static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp, bool *oom)
+{
+ struct sk_buff *skb;
+ struct skb_vnet_hdr *hdr;
+ struct scatterlist sg[2];
+ int err = 0;
+
+ skb = netdev_alloc_skb(vi->dev, MAX_PACKET_LEN + NET_IP_ALIGN);
+ if (unlikely(!skb)) {
+ *oom = true;
+ return err;
+ }
+
+ skb_reserve(skb, NET_IP_ALIGN);
+ skb_put(skb, MAX_PACKET_LEN);
+
+ hdr = skb_vnet_hdr(skb);
+ sg_set_buf(sg, &hdr->hdr, sizeof(hdr->hdr));
+
+ skb_to_sgvec(skb, sg+1, 0, skb->len);
+
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
+ if (err < 0)
+ kfree_skb(skb);
+ else
+ skb_queue_head(&vi->recv, skb);
+
+ return err;
+}
+
+static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp, bool *oom)
+{
+ struct scatterlist sg[2 + MAX_SKB_FRAGS];
+ int total = MAX_SKB_FRAGS + 2;
+ char *p;
+ int err = 0;
+ int i, offset;
+ struct page *first = NULL;
+ struct page *page;
+ /* share one page between virtio_net header and data */
+ struct padded_vnet_hdr {
+ struct virtio_net_hdr hdr;
+ /* This padding makes our data 16 byte aligned */
+ char padding[6];
+ };
+
+ offset = sizeof(struct padded_vnet_hdr);
+
+ for (i = total - 1; i > 0; i--) {
+ page = get_a_page(vi, gfp);
+ if (!page) {
+ if (first)
+ give_pages(vi, first);
+ *oom = true;
+ break;
+ }
+
+ p = page_address(page);
+ page->private = (unsigned long)first;
+ first = page;
+
+ /* allocate MAX_SKB_FRAGS + 1 pages for big packets */
+ if (i == 1) {
+ sg_set_buf(&sg[i-1], p, sizeof(struct virtio_net_hdr));
+ sg_set_buf(&sg[i], p + offset, PAGE_SIZE - offset);
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, total,
+ first);
+ if (err < 0)
+ give_pages(vi, first);
+ } else
+ sg_set_buf(&sg[i], p, PAGE_SIZE);
+ }
+
+ return err;
+}
+
+static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp, bool *oom)
+{
+ struct page *page;
+ struct scatterlist sg;
+ int err = 0;
+
+ page = get_a_page(vi, gfp);
+ if (!page) {
+ *oom = true;
+ return err;
+ }
+
+ sg_init_one(&sg, page_address(page), PAGE_SIZE);
+
+ err = vi->rvq->vq_ops->add_buf(vi->rvq, &sg, 0, 1, page);
+ if (err < 0)
+ give_pages(vi, page);
+
+ return err;
+}
+
static bool try_fill_recv_maxbufs(struct virtnet_info *vi, gfp_t gfp)
{
struct sk_buff *skb;

2009-12-11 12:49:56

by Shirley Ma

[permalink] [raw]
Subject: [PATCH v2 4/4] Defer skb allocation -- change allocation & receiving in recv path

Signed-off-by: Shirley Ma <[email protected]>
-------------

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index dde8060..b919169 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -270,99 +270,44 @@ static struct sk_buff *receive_mergeable(struct virtnet_info *vi,
return skb;
}

-static void receive_skb(struct net_device *dev, struct sk_buff *skb,
- unsigned len)
+static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
{
struct virtnet_info *vi = netdev_priv(dev);
- struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
- int err;
- int i;
+ struct sk_buff *skb = (struct sk_buff *)buf;
+ struct page *page = (struct page *)buf;
+ struct skb_vnet_hdr *hdr;

if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
pr_debug("%s: short packet %i\n", dev->name, len);
dev->stats.rx_length_errors++;
- goto drop;
- }
-
- if (vi->mergeable_rx_bufs) {
- unsigned int copy;
- char *p = page_address(skb_shinfo(skb)->frags[0].page);
-
- if (len > PAGE_SIZE)
- len = PAGE_SIZE;
- len -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
-
- memcpy(&hdr->mhdr, p, sizeof(hdr->mhdr));
- p += sizeof(hdr->mhdr);
-
- copy = len;
- if (copy > skb_tailroom(skb))
- copy = skb_tailroom(skb);
-
- memcpy(skb_put(skb, copy), p, copy);
-
- len -= copy;
-
- if (!len) {
- give_pages(vi, skb_shinfo(skb)->frags[0].page);
- skb_shinfo(skb)->nr_frags--;
+ if (vi->mergeable_rx_bufs || vi->big_packets) {
+ give_pages(vi, page);
+ return;
} else {
- skb_shinfo(skb)->frags[0].page_offset +=
- sizeof(hdr->mhdr) + copy;
- skb_shinfo(skb)->frags[0].size = len;
- skb->data_len += len;
- skb->len += len;
- }
-
- while (--hdr->mhdr.num_buffers) {
- struct sk_buff *nskb;
-
- i = skb_shinfo(skb)->nr_frags;
- if (i >= MAX_SKB_FRAGS) {
- pr_debug("%s: packet too long %d\n", dev->name,
- len);
- dev->stats.rx_length_errors++;
- goto drop;
- }
-
- nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &len);
- if (!nskb) {
- pr_debug("%s: rx error: %d buffers missing\n",
- dev->name, hdr->mhdr.num_buffers);
- dev->stats.rx_length_errors++;
- goto drop;
- }
-
- __skb_unlink(nskb, &vi->recv);
- vi->num--;
-
- skb_shinfo(skb)->frags[i] = skb_shinfo(nskb)->frags[0];
- skb_shinfo(nskb)->nr_frags = 0;
- kfree_skb(nskb);
-
- if (len > PAGE_SIZE)
- len = PAGE_SIZE;
-
- skb_shinfo(skb)->frags[i].size = len;
- skb_shinfo(skb)->nr_frags++;
- skb->data_len += len;
- skb->len += len;
+ skb_unlink(skb, &vi->recv);
+ goto drop;
}
- } else {
- len -= sizeof(hdr->hdr);
+ }

- if (len <= MAX_PACKET_LEN)
- trim_pages(vi, skb);
+ if (vi->mergeable_rx_bufs)
+ skb = receive_mergeable(vi, page, len);
+ else if (vi->big_packets)
+ skb = receive_big(vi, page, len);
+ else {
+ __skb_unlink(skb, &vi->recv);
+ len -= sizeof(struct virtio_net_hdr);
+ skb_trim(skb, len);
+ }

- err = pskb_trim(skb, len);
- if (err) {
- pr_debug("%s: pskb_trim failed %i %d\n", dev->name,
- len, err);
- dev->stats.rx_dropped++;
- goto drop;
- }
+ if (unlikely(!skb)) {
+ dev->stats.rx_dropped++;
+ /* only for mergeable buf and big packets */
+ give_pages(vi, page);
+ return;
}

+ hdr = skb_vnet_hdr(skb);
+
skb->truesize += skb->data_len;
dev->stats.rx_bytes += skb->len;
dev->stats.rx_packets++;
@@ -520,105 +465,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp, bool *oom)
return err;
}

-static bool try_fill_recv_maxbufs(struct virtnet_info *vi, gfp_t gfp)
-{
- struct sk_buff *skb;
- struct scatterlist sg[2+MAX_SKB_FRAGS];
- int num, err, i;
- bool oom = false;
-
- sg_init_table(sg, 2+MAX_SKB_FRAGS);
- do {
- struct skb_vnet_hdr *hdr;
-
- skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
- if (unlikely(!skb)) {
- oom = true;
- break;
- }
-
- skb_put(skb, MAX_PACKET_LEN);
-
- hdr = skb_vnet_hdr(skb);
- sg_set_buf(sg, &hdr->hdr, sizeof(hdr->hdr));
-
- if (vi->big_packets) {
- for (i = 0; i < MAX_SKB_FRAGS; i++) {
- skb_frag_t *f = &skb_shinfo(skb)->frags[i];
- f->page = get_a_page(vi, gfp);
- if (!f->page)
- break;
-
- f->page_offset = 0;
- f->size = PAGE_SIZE;
-
- skb->data_len += PAGE_SIZE;
- skb->len += PAGE_SIZE;
-
- skb_shinfo(skb)->nr_frags++;
- }
- }
-
- num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
- skb_queue_head(&vi->recv, skb);
-
- err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, num, skb);
- if (err < 0) {
- skb_unlink(skb, &vi->recv);
- trim_pages(vi, skb);
- kfree_skb(skb);
- break;
- }
- vi->num++;
- } while (err >= num);
- if (unlikely(vi->num > vi->max))
- vi->max = vi->num;
- vi->rvq->vq_ops->kick(vi->rvq);
- return !oom;
-}
-
/* Returns false if we couldn't fill entirely (OOM). */
static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
{
- struct sk_buff *skb;
- struct scatterlist sg[1];
int err;
bool oom = false;

- if (!vi->mergeable_rx_bufs)
- return try_fill_recv_maxbufs(vi, gfp);
-
do {
- skb_frag_t *f;
-
- skb = netdev_alloc_skb_ip_align(vi->dev, GOOD_COPY_LEN);
- if (unlikely(!skb)) {
- oom = true;
- break;
- }
-
- f = &skb_shinfo(skb)->frags[0];
- f->page = get_a_page(vi, gfp);
- if (!f->page) {
- oom = true;
- kfree_skb(skb);
- break;
- }
-
- f->page_offset = 0;
- f->size = PAGE_SIZE;
-
- skb_shinfo(skb)->nr_frags++;
-
- sg_init_one(sg, page_address(f->page), PAGE_SIZE);
- skb_queue_head(&vi->recv, skb);
+ if (vi->mergeable_rx_bufs)
+ err = add_recvbuf_mergeable(vi, gfp, &oom);
+ else if (vi->big_packets)
+ err = add_recvbuf_big(vi, gfp, &oom);
+ else
+ err = add_recvbuf_small(vi, gfp, &oom);

- err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 1, skb);
- if (err < 0) {
- skb_unlink(skb, &vi->recv);
- kfree_skb(skb);
+ if (oom || err < 0)
break;
- }
vi->num++;
} while (err > 0);
if (unlikely(vi->num > vi->max))
@@ -657,14 +519,13 @@ static void refill_work(struct work_struct *work)
static int virtnet_poll(struct napi_struct *napi, int budget)
{
struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
- struct sk_buff *skb = NULL;
+ void *buf = NULL;
unsigned int len, received = 0;

again:
while (received < budget &&
- (skb = vi->rvq->vq_ops->get_buf(vi->rvq, &len)) != NULL) {
- __skb_unlink(skb, &vi->recv);
- receive_skb(vi->dev, skb, len);
+ (buf = vi->rvq->vq_ops->get_buf(vi->rvq, &len)) != NULL) {
+ receive_buf(vi->dev, buf, len);
vi->num--;
received++;
}
@@ -1205,6 +1066,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
{
struct virtnet_info *vi = vdev->priv;
struct sk_buff *skb;
+ int freed;

/* Stop all the virtqueues. */
vdev->config->reset(vdev);
@@ -1216,11 +1078,14 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
}
__skb_queue_purge(&vi->send);

- BUG_ON(vi->num != 0);
-
unregister_netdev(vi->dev);
cancel_delayed_work_sync(&vi->refill);

+ freed = vi->rvq->vq_ops->destroy_bufs(vi->rvq, virtio_net_free_pages);
+ vi->num -= freed;
+
+ BUG_ON(vi->num != 0);
+
vdev->config->del_vqs(vi->vdev);

while (vi->pages)

2009-12-13 10:22:10

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Defer skb allocation for both mergeable buffers and big packets in virtio_net

On Fri, Dec 11, 2009 at 04:28:26AM -0800, Shirley Ma wrote:
> This is a patch-set for deferring skb allocation based on Rusty and
> Michael's inputs.

Shirley, some advice on packaging patches
that I hope will be helpful:

You did try to split up the patch logically,
and it's better than a single huge patch, but it
can be better. For example, you add static functions
in one patch and use them in another patch,
this works well for new APIs which are documented
so you can understand from the documentation
what function should do, but not for internal, static functions:
one ends up looking at usage, going back to implementation,
back to usage, each time switching between patches.

One idea on how to split up the patch set better:
- add new "destroy" API and supply documentation
- switch current implementation over to use destroy API
- split current implementation into subfunctions
handling mergeable/big cases
- convert functions to use deferred allocation
[would be nice to convert mergeable/big separately,
but I am not sure this is possible while keeping
patchset bisectable]

Some suggestions on formatting:
- keep patch names short, and prefix with module name,
not patchset name, so that git summaries look nicer. E.g.
Defer skb allocation -- add destroy buffers function for virtio
Would be better "virtio: add destroy buffers function".
- please supply commit message with some explanation
and motivation that will help someone looking
at git history, without background from mailing list.
E.g.
"Add "destroy" vq API that returns all posted
buffers back to caller. This makes it possible
to avoid tracking buffers in callers to free
them on vq teardown. Will be used by deferred
skb allocation patch.".
- Use "---" to separate description from text,
and generally make patch acceptable to git am.
It is a good idea to use git to generate patches,
for example with git format-patch.
I usually use it with --numbered --thread --cover-letter.


> Guest virtio_net receives packets from its pre-allocated vring
> buffers, then it delivers these packets to upper layer protocols
> as skb buffs. So it's not necessary to pre-allocate skb for each
> mergable buffer, then frees it when it's useless.
> This patch has deferred skb allocation when receiving packets for
> both big packets and mergeable buffers. It reduces skb pre-allocations
> and skb_frees.

E.g. the above should go into commit message for the virtio net
part of patchset.

>
> A destroy function has been created to push virtio free buffs to vring
> for unused pages, and used page private to maintain page list.
>
> This patch has tested and measured against 2.6.32-rc7 git. It is built
> again 2.6.32 kernel.

I think you need to base your patch on Dave's net-next,
it's too late to put it in 2.6.32, or even 2.6.33.
It also should probably go in through Dave's tree because virtio part of
patch is very small, while most of it deals with net/virtio_net.

> Tests have been done for small packets, big packets
> and mergeable buffers.
>
> The single netperf TCP_STREAM performance improved for host to guest.
> It also reduces UDP packets drop rate.


BTW, any numbers? Also, 2.6.32 has regressed as compared to 2.6.31.
Did you try with Sridhar Samudrala's patch from net-next applied
that reportedly fixes this?

> Thanks
> Shirley

2009-12-13 10:29:33

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Fri, Dec 11, 2009 at 04:33:25AM -0800, Shirley Ma wrote:
> Signed-off-by: <[email protected]>
> -------------
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c708ecc..bb5eb7b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -107,6 +107,16 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> return p;
> }
>
> +static void virtio_net_free_pages(void *buf)
> +{
> + struct page *page, *next;
> +
> + for (page = buf; page; page = next) {
> + next = (struct page *)page->private;
> + __free_pages(page, 0);
> + }
> +}
> +
> static void skb_xmit_done(struct virtqueue *svq)
> {
> struct virtnet_info *vi = svq->vdev->priv;

This is not the best way to split the patch,
as I commented separately: this adds static function
but no way to see whether it does what it should do.
If you think about it, free should always go into same patch
that does alloc.

> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index fbd2ecd..db83465 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -334,6 +334,29 @@ static bool vring_enable_cb(struct virtqueue *_vq)
> return true;
> }
>
> +static int vring_destroy_bufs(struct virtqueue *_vq, void (*destroy)(void *))
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + void *buf;
> + unsigned int i;
> + int freed = 0;
> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring.num; i++) {

I prefer ++i in such code personally.

> + if (vq->data[i]) {
> + /* detach_buf clears data, so grab it now. */
> + buf = vq->data[i];
> + detach_buf(vq, i);

Hmm, you are calling detach on a buffer which was not used.
It's not safe to call this while host might use buffers, is it?
Please document this and other assumptions you are making.

> + destroy(buf);
> + freed++;

++freed here as well.

> + }
> + }
> +

It's usually better not to put {} around single statement blocks.

> + END_USE(vq);
> + return freed;
> +}
> +
> irqreturn_t vring_interrupt(int irq, void *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -360,6 +383,7 @@ static struct virtqueue_ops vring_vq_ops = {
> .kick = vring_kick,
> .disable_cb = vring_disable_cb,
> .enable_cb = vring_enable_cb,
> + .destroy_bufs = vring_destroy_bufs,
> };
>
> struct virtqueue *vring_new_virtqueue(unsigned int num,
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 057a2e0..e6d7d77 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -71,6 +71,7 @@ struct virtqueue_ops {
>
> void (*disable_cb)(struct virtqueue *vq);
> bool (*enable_cb)(struct virtqueue *vq);
> + int (*destroy_bufs)(struct virtqueue *vq, void (*destory)(void *));

Typo above. Please document the new field.
Also please document assumptions about usage.

I think callback should get struct virtqueue *vq, and decrement num.
And destroy_bufs should return void, which is much more natural for a
destructor.


That said - do we have to use a callback?
I think destroy_buf which returns data pointer,
and which we call repeatedly until we get NULL
or error, would be an a better, more flexible API.
This is not critical though.


> };
>
> /**
>

2009-12-13 11:11:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] Defer skb allocation -- change allocation & receiving in recv path

On Fri, Dec 11, 2009 at 04:49:53AM -0800, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <[email protected]>
> -------------

Comments about splitting up this patch apply here.


> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index dde8060..b919169 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -270,99 +270,44 @@ static struct sk_buff *receive_mergeable(struct virtnet_info *vi,
> return skb;
> }
>
> -static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> - unsigned len)
> +static void receive_buf(struct net_device *dev, void *buf, unsigned int len)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> - struct skb_vnet_hdr *hdr = skb_vnet_hdr(skb);
> - int err;
> - int i;
> + struct sk_buff *skb = (struct sk_buff *)buf;
> + struct page *page = (struct page *)buf;
> + struct skb_vnet_hdr *hdr;

Do not cast away void*.
This initialization above looks very strange: in
fact only one of skb, page makes sense.
So I think you should either get rid of both page and
skb variables (routines such as give_pages get page *
so they will accept void* just as happily) or
move initialization or even use of these variables to
where they are used.

>
> if (unlikely(len < sizeof(struct virtio_net_hdr) + ETH_HLEN)) {
> pr_debug("%s: short packet %i\n", dev->name, len);
> dev->stats.rx_length_errors++;
> - goto drop;
> - }
> -
> - if (vi->mergeable_rx_bufs) {
> - unsigned int copy;
> - char *p = page_address(skb_shinfo(skb)->frags[0].page);
> -
> - if (len > PAGE_SIZE)
> - len = PAGE_SIZE;
> - len -= sizeof(struct virtio_net_hdr_mrg_rxbuf);
> -
> - memcpy(&hdr->mhdr, p, sizeof(hdr->mhdr));
> - p += sizeof(hdr->mhdr);
> -
> - copy = len;
> - if (copy > skb_tailroom(skb))
> - copy = skb_tailroom(skb);
> -
> - memcpy(skb_put(skb, copy), p, copy);
> -
> - len -= copy;
> -
> - if (!len) {
> - give_pages(vi, skb_shinfo(skb)->frags[0].page);
> - skb_shinfo(skb)->nr_frags--;
> + if (vi->mergeable_rx_bufs || vi->big_packets) {
> + give_pages(vi, page);
> + return;
> } else {
> - skb_shinfo(skb)->frags[0].page_offset +=
> - sizeof(hdr->mhdr) + copy;
> - skb_shinfo(skb)->frags[0].size = len;
> - skb->data_len += len;
> - skb->len += len;
> - }
> -
> - while (--hdr->mhdr.num_buffers) {
> - struct sk_buff *nskb;
> -
> - i = skb_shinfo(skb)->nr_frags;
> - if (i >= MAX_SKB_FRAGS) {
> - pr_debug("%s: packet too long %d\n", dev->name,
> - len);
> - dev->stats.rx_length_errors++;
> - goto drop;
> - }
> -
> - nskb = vi->rvq->vq_ops->get_buf(vi->rvq, &len);
> - if (!nskb) {
> - pr_debug("%s: rx error: %d buffers missing\n",
> - dev->name, hdr->mhdr.num_buffers);
> - dev->stats.rx_length_errors++;
> - goto drop;
> - }
> -
> - __skb_unlink(nskb, &vi->recv);
> - vi->num--;
> -
> - skb_shinfo(skb)->frags[i] = skb_shinfo(nskb)->frags[0];
> - skb_shinfo(nskb)->nr_frags = 0;
> - kfree_skb(nskb);
> -
> - if (len > PAGE_SIZE)
> - len = PAGE_SIZE;
> -
> - skb_shinfo(skb)->frags[i].size = len;
> - skb_shinfo(skb)->nr_frags++;
> - skb->data_len += len;
> - skb->len += len;
> + skb_unlink(skb, &vi->recv);
> + goto drop;
> }
> - } else {
> - len -= sizeof(hdr->hdr);
> + }
>
> - if (len <= MAX_PACKET_LEN)
> - trim_pages(vi, skb);
> + if (vi->mergeable_rx_bufs)
> + skb = receive_mergeable(vi, page, len);
> + else if (vi->big_packets)
> + skb = receive_big(vi, page, len);
> + else {
> + __skb_unlink(skb, &vi->recv);
> + len -= sizeof(struct virtio_net_hdr);
> + skb_trim(skb, len);

So above you override skb that you initialized
at the start of function. It would be better
to do in the 3'd case:
skb = buf;
as well. And then it would be clear why
" Only for mergeable and big" comment below
is true.

> + }
>
> - err = pskb_trim(skb, len);
> - if (err) {
> - pr_debug("%s: pskb_trim failed %i %d\n", dev->name,
> - len, err);
> - dev->stats.rx_dropped++;
> - goto drop;
> - }
> + if (unlikely(!skb)) {
> + dev->stats.rx_dropped++;
> + /* only for mergeable buf and big packets */
> + give_pages(vi, page);
> + return;

Did you remove drop: label? If no, is it unused now?

> }
>
> + hdr = skb_vnet_hdr(skb);
> +
> skb->truesize += skb->data_len;
> dev->stats.rx_bytes += skb->len;
> dev->stats.rx_packets++;
> @@ -520,105 +465,22 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp, bool *oom)
> return err;
> }
>
> -static bool try_fill_recv_maxbufs(struct virtnet_info *vi, gfp_t gfp)
> -{
> - struct sk_buff *skb;
> - struct scatterlist sg[2+MAX_SKB_FRAGS];
> - int num, err, i;
> - bool oom = false;
> -
> - sg_init_table(sg, 2+MAX_SKB_FRAGS);
> - do {
> - struct skb_vnet_hdr *hdr;
> -
> - skb = netdev_alloc_skb_ip_align(vi->dev, MAX_PACKET_LEN);
> - if (unlikely(!skb)) {
> - oom = true;
> - break;
> - }
> -
> - skb_put(skb, MAX_PACKET_LEN);
> -
> - hdr = skb_vnet_hdr(skb);
> - sg_set_buf(sg, &hdr->hdr, sizeof(hdr->hdr));
> -
> - if (vi->big_packets) {
> - for (i = 0; i < MAX_SKB_FRAGS; i++) {
> - skb_frag_t *f = &skb_shinfo(skb)->frags[i];
> - f->page = get_a_page(vi, gfp);
> - if (!f->page)
> - break;
> -
> - f->page_offset = 0;
> - f->size = PAGE_SIZE;
> -
> - skb->data_len += PAGE_SIZE;
> - skb->len += PAGE_SIZE;
> -
> - skb_shinfo(skb)->nr_frags++;
> - }
> - }
> -
> - num = skb_to_sgvec(skb, sg+1, 0, skb->len) + 1;
> - skb_queue_head(&vi->recv, skb);
> -
> - err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, num, skb);
> - if (err < 0) {
> - skb_unlink(skb, &vi->recv);
> - trim_pages(vi, skb);
> - kfree_skb(skb);
> - break;
> - }
> - vi->num++;
> - } while (err >= num);
> - if (unlikely(vi->num > vi->max))
> - vi->max = vi->num;
> - vi->rvq->vq_ops->kick(vi->rvq);
> - return !oom;
> -}
> -
> /* Returns false if we couldn't fill entirely (OOM). */
> static bool try_fill_recv(struct virtnet_info *vi, gfp_t gfp)
> {
> - struct sk_buff *skb;
> - struct scatterlist sg[1];
> int err;
> bool oom = false;
>
> - if (!vi->mergeable_rx_bufs)
> - return try_fill_recv_maxbufs(vi, gfp);
> -
> do {
> - skb_frag_t *f;
> -
> - skb = netdev_alloc_skb_ip_align(vi->dev, GOOD_COPY_LEN);
> - if (unlikely(!skb)) {
> - oom = true;
> - break;
> - }
> -
> - f = &skb_shinfo(skb)->frags[0];
> - f->page = get_a_page(vi, gfp);
> - if (!f->page) {
> - oom = true;
> - kfree_skb(skb);
> - break;
> - }
> -
> - f->page_offset = 0;
> - f->size = PAGE_SIZE;
> -
> - skb_shinfo(skb)->nr_frags++;
> -
> - sg_init_one(sg, page_address(f->page), PAGE_SIZE);
> - skb_queue_head(&vi->recv, skb);
> + if (vi->mergeable_rx_bufs)
> + err = add_recvbuf_mergeable(vi, gfp, &oom);
> + else if (vi->big_packets)
> + err = add_recvbuf_big(vi, gfp, &oom);
> + else
> + err = add_recvbuf_small(vi, gfp, &oom);
>
> - err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 1, skb);
> - if (err < 0) {
> - skb_unlink(skb, &vi->recv);
> - kfree_skb(skb);
> + if (oom || err < 0)
> break;
> - }
> vi->num++;
> } while (err > 0);
> if (unlikely(vi->num > vi->max))
> @@ -657,14 +519,13 @@ static void refill_work(struct work_struct *work)
> static int virtnet_poll(struct napi_struct *napi, int budget)
> {
> struct virtnet_info *vi = container_of(napi, struct virtnet_info, napi);
> - struct sk_buff *skb = NULL;
> + void *buf = NULL;

Does this have to be initialized? If not (as it seems) better not do it.

> unsigned int len, received = 0;
>
> again:
> while (received < budget &&
> - (skb = vi->rvq->vq_ops->get_buf(vi->rvq, &len)) != NULL) {
> - __skb_unlink(skb, &vi->recv);
> - receive_skb(vi->dev, skb, len);
> + (buf = vi->rvq->vq_ops->get_buf(vi->rvq, &len)) != NULL) {
> + receive_buf(vi->dev, buf, len);
> vi->num--;
> received++;
> }
> @@ -1205,6 +1066,7 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
> {
> struct virtnet_info *vi = vdev->priv;
> struct sk_buff *skb;
> + int freed;
>
> /* Stop all the virtqueues. */
> vdev->config->reset(vdev);
> @@ -1216,11 +1078,14 @@ static void __devexit virtnet_remove(struct virtio_device *vdev)
> }
> __skb_queue_purge(&vi->send);
>
> - BUG_ON(vi->num != 0);
> -
> unregister_netdev(vi->dev);
> cancel_delayed_work_sync(&vi->refill);
>
> + freed = vi->rvq->vq_ops->destroy_bufs(vi->rvq, virtio_net_free_pages);

This looks like double free to me: should not you remove code that does
__skb_dequeue on recv above?

> + vi->num -= freed;
> +
> + BUG_ON(vi->num != 0);
> +
> vdev->config->del_vqs(vi->vdev);
>
> while (vi->pages)
>

2009-12-13 11:27:44

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

On Fri, Dec 11, 2009 at 04:43:02AM -0800, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <[email protected]>
> --------------
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index bb5eb7b..100b4b9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -80,29 +80,25 @@ static inline struct skb_vnet_hdr *skb_vnet_hdr(struct sk_buff *skb)
> return (struct skb_vnet_hdr *)skb->cb;
> }
>
> -static void give_a_page(struct virtnet_info *vi, struct page *page)
> +static void give_pages(struct virtnet_info *vi, struct page *page)
> {
> - page->private = (unsigned long)vi->pages;
> - vi->pages = page;
> -}
> + struct page *end;
>
> -static void trim_pages(struct virtnet_info *vi, struct sk_buff *skb)
> -{
> - unsigned int i;
> -
> - for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
> - give_a_page(vi, skb_shinfo(skb)->frags[i].page);
> - skb_shinfo(skb)->nr_frags = 0;
> - skb->data_len = 0;
> + /* Find end of list, sew whole thing into vi->pages. */
> + for (end = page; end->private; end = (struct page *)end->private);

Hmm, this scans the whole list each time.
OTOH, the caller probably can easily get list tail as well as head.
If we ask caller to give us list tail, and chain them at head, then
1. we won't have to scan the list each time
2. we get better memory locality reusing same pages over and over again


> + end->private = (unsigned long)vi->pages;
> + vi->pages = page;
> }
>
> static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> {
> struct page *p = vi->pages;
>
> - if (p)
> + if (p) {
> vi->pages = (struct page *)p->private;
> - else
> + /* use private to chain pages for big packets */
> + p->private = 0;

So this comment does not explain why this = 0 is here.
clearly = 0 does not chain anything.
Please add a bigger comment.

I think you also want to extend the comment at top of
file, where datastructure is, that explains the deferred
alogorigthm and how pages are chained.

> + } else
> p = alloc_page(gfp_mask);
> return p;
> }
> @@ -128,6 +124,85 @@ static void skb_xmit_done(struct virtqueue *svq)
> netif_wake_queue(vi->dev);
> }
>
> +static int skb_set_frag(struct sk_buff *skb, struct page *page,
> + int offset, int len)

Maybe it's better not to prefix functions with skb_, one
tries to look them up in skbuff.h immediately.

> +{
> + int i = skb_shinfo(skb)->nr_frags;
> + skb_frag_t *f;
> +
> + i = skb_shinfo(skb)->nr_frags;
> + f = &skb_shinfo(skb)->frags[i];

i seems unused below, so opencode it.

> + f->page = page;
> + f->page_offset = offset;
> +
> + if (len > PAGE_SIZE - f->page_offset)
> + f->size = PAGE_SIZE - f->page_offset;

Will be a bit clearer with s/f->page_offset/offset/.

> + else
> + f->size = len;

Use min for clarity instead of opencoded if.
This will make it obvious that len won't ever become
negative below.

> +
> + skb_shinfo(skb)->nr_frags++;
> + skb->data_len += f->size;
> + skb->len += f->size;


> +
> + len -= f->size;
> + return len;
> +}
> +
> +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page,

I know you got this name from GOOD_COPY_LEN, but it's not
very good for a function :) and skb_ prefix is also confusing.
Just copy_small_skb or something like that?

> + unsigned int *len)

Comments about splitting patches apply here as well.
No way to understand what this should do and whether it
does it correctly just by looking at patch.

> +{
> + struct sk_buff *skb;
> + struct skb_vnet_hdr *hdr;
> + int copy, hdr_len, offset;
> + char *p;
> +
> + p = page_address(*page);
> +
> + skb = netdev_alloc_skb(vi->dev, GOOD_COPY_LEN + NET_IP_ALIGN);
> + if (unlikely(!skb))
> + return NULL;
> +
> + skb_reserve(skb, NET_IP_ALIGN);
> + hdr = skb_vnet_hdr(skb);
> +
> + if (vi->mergeable_rx_bufs) {
> + hdr_len = sizeof(hdr->mhdr);

sizeof(hdr->mhdr) -> sizeof hdr->mhdr

> + offset = hdr_len;
> + } else {
> + /* share one page between virtio_net header and data */
> + struct padded_vnet_hdr {
> + struct virtio_net_hdr hdr;
> + /* This padding makes our data 16 byte aligned */

I think reader will still wonder about is "why does it
need to be 16 byte aligned?". And this is what
comment should explain I think.

> + char padding[6];
> + };
> + hdr_len = sizeof(hdr->hdr);
> + offset = sizeof(struct padded_vnet_hdr);
> + }
> +
> + memcpy(hdr, p, hdr_len);
> +
> + *len -= hdr_len;
> + p += offset;
> +
> + copy = *len;
> + if (copy > skb_tailroom(skb))
> + copy = skb_tailroom(skb);
> + memcpy(skb_put(skb, copy), p, copy);
> +
> + *len -= copy;
> + offset += copy;
> +
> + if (*len) {
> + *len = skb_set_frag(skb, *page, offset, *len);

So you are overriding *len here? why bother calculating it
then?

Also - this does *not* always copy all of data, and *page
tells us whether it did a copy or not? This is pretty confusing,
as APIs go. Also, if we have scatter/gather anyway,
why do we bother copying the head?
Also, before skb_set_frag skb is linear, right?
So in fact you do not need generic skb_set_frag,
you can just put stuff in the first fragment.
For example, pass the fragment number to skb_set_frag,
compiler will be able to better optimize.

> + *page = (struct page *)(*page)->private;
> + } else {
> + give_pages(vi, *page);
> + *page = NULL;
> + }
> +
> + return skb;
> +}
> +
> static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> unsigned len)
> {
> @@ -162,7 +237,7 @@ static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> len -= copy;
>
> if (!len) {
> - give_a_page(vi, skb_shinfo(skb)->frags[0].page);
> + give_pages(vi, skb_shinfo(skb)->frags[0].page);
> skb_shinfo(skb)->nr_frags--;
> } else {
> skb_shinfo(skb)->frags[0].page_offset +=
>

2009-12-13 11:46:20

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

On Fri, Dec 11, 2009 at 04:46:53AM -0800, Shirley Ma wrote:
> Signed-off-by: Shirley Ma <[email protected]>
> -------------
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 100b4b9..dde8060 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -203,6 +203,73 @@ static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page,
> return skb;
> }
>
> +static struct sk_buff *receive_big(struct virtnet_info *vi, struct page *page,
> + unsigned int len)
> +{
> + struct sk_buff *skb;
> +
> + skb = skb_goodcopy(vi, &page, &len);
> + if (unlikely(!skb))
> + return NULL;
> +
> + while (len > 0) {
> + len = skb_set_frag(skb, page, 0, len);
> + page = (struct page *)page->private;

Interesting. I think skb_goodcopy will sometimes
set *page to NULL. Will the above crash then?

> + }
> +
> + if (page)
> + give_pages(vi, page);
> +
> + return skb;
> +}
> +
> +static struct sk_buff *receive_mergeable(struct virtnet_info *vi,
> + struct page *page, unsigned int len)
> +{
> + struct sk_buff *skb;
> + struct skb_vnet_hdr *hdr;
> + int num_buf, i;
> +
> + if (len > PAGE_SIZE)
> + len = PAGE_SIZE;
> +
> + skb = skb_goodcopy(vi, &page, &len);
> +

don't put empty line here. if below is part of same logical block as
skb_goodcopy.

> + if (unlikely(!skb))
> + return NULL;

don't we care that *page might not be NULL? why not?

> +
> + hdr = skb_vnet_hdr(skb);
> + num_buf = hdr->mhdr.num_buffers;
> + while (--num_buf) {
> + struct page *page;

Local variable shadows a parameter.
It seems gcc will let you get away with a warning,
but this is not legal C.

> +
> + i = skb_shinfo(skb)->nr_frags;
> + if (i >= MAX_SKB_FRAGS) {
> + pr_debug("%s: packet too long %d\n", skb->dev->name,
> + len);


If this happens, we have corrupted memory already.
We do need this check, but please put is before you increment
nr_frags.

> + skb->dev->stats.rx_length_errors++;
> + return skb;

This will propagate the error up the stack and corrupt
more memory.

> + }
> +
> + page = vi->rvq->vq_ops->get_buf(vi->rvq, &len);
> + if (!page) {
> + pr_debug("%s: rx error: %d buffers missing\n",
> + skb->dev->name, hdr->mhdr.num_buffers);
> + skb->dev->stats.rx_length_errors++;
> + return skb;

Here, skb is some random part of packet, don't propagate
it up the stack.

> + }
> +
> + if (len > PAGE_SIZE)
> + len = PAGE_SIZE;
> +
> + skb_set_frag(skb, page, 0, len);
> +
> + vi->num--;
> + }
> +
> + return skb;
> +}
> +
> static void receive_skb(struct net_device *dev, struct sk_buff *skb,
> unsigned len)
> {
> @@ -356,6 +423,103 @@ drop:
> dev_kfree_skb(skb);
> }
>
> +static int add_recvbuf_small(struct virtnet_info *vi, gfp_t gfp, bool *oom)
> +{
> + struct sk_buff *skb;
> + struct skb_vnet_hdr *hdr;
> + struct scatterlist sg[2];
> + int err = 0;
> +
> + skb = netdev_alloc_skb(vi->dev, MAX_PACKET_LEN + NET_IP_ALIGN);
> + if (unlikely(!skb)) {
> + *oom = true;
> + return err;
> + }
> +
> + skb_reserve(skb, NET_IP_ALIGN);
> + skb_put(skb, MAX_PACKET_LEN);
> +
> + hdr = skb_vnet_hdr(skb);
> + sg_set_buf(sg, &hdr->hdr, sizeof(hdr->hdr));

sizeof hdr->hdr

> +
> + skb_to_sgvec(skb, sg+1, 0, skb->len);

space around +

> +
> + err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
> + if (err < 0)
> + kfree_skb(skb);
> + else
> + skb_queue_head(&vi->recv, skb);

So why are we queueing this still?

> +
> + return err;
> +}
> +
> +static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp, bool *oom)
> +{
> + struct scatterlist sg[2 + MAX_SKB_FRAGS];

MAX_SKB_FRAGS + 2 will be more readable.
Also, create a macro for this constant and document
why does +2 make sense?

> + int total = MAX_SKB_FRAGS + 2;
> + char *p;
> + int err = 0;
> + int i, offset;
> + struct page *first = NULL;
> + struct page *page;
> + /* share one page between virtio_net header and data */
> + struct padded_vnet_hdr {
> + struct virtio_net_hdr hdr;
> + /* This padding makes our data 16 byte aligned */
> + char padding[6];

Again, pls explain *why* do we want 16 byte alignment.
Also this code seems duplicated?
Please put structs at top of file where they
can be found.

> + };
> +
> + offset = sizeof(struct padded_vnet_hdr);
> +
> + for (i = total - 1; i > 0; i--) {

I prefer --i.
Also, total is just a constant.
So simply MAX_SKB_FRAGS + 1 will be clearer.
Why do we scan last to first?
If there's reason, please add a comment.

> + page = get_a_page(vi, gfp);
> + if (!page) {
> + if (first)
> + give_pages(vi, first);
> + *oom = true;
> + break;
> + }
> +
> + p = page_address(page);
> + page->private = (unsigned long)first;
> + first = page;
> +
> + /* allocate MAX_SKB_FRAGS + 1 pages for big packets */
> + if (i == 1) {
> + sg_set_buf(&sg[i-1], p, sizeof(struct virtio_net_hdr));

space around - .
All the if (i == 1) handling on exit is really hard to grok.
How about moving common code out of this loop
into a function, and then you can
for (i = total - 1; i > 1; i--) {
handle(i);
}
handle(1);
handle(0);
add_buf




> + sg_set_buf(&sg[i], p + offset, PAGE_SIZE - offset);
> + err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, total,
> + first);
> + if (err < 0)
> + give_pages(vi, first);
> + } else
> + sg_set_buf(&sg[i], p, PAGE_SIZE);
> + }
> +
> + return err;
> +}
> +
> +static int add_recvbuf_mergeable(struct virtnet_info *vi, gfp_t gfp, bool *oom)


do we really need *oom here and below?
We can just set err to ENOMEM, no?

> +{
> + struct page *page;
> + struct scatterlist sg;
> + int err = 0;
> +
> + page = get_a_page(vi, gfp);
> + if (!page) {
> + *oom = true;
> + return err;

Please do not return 0 on failure.

> + }
> +
> + sg_init_one(&sg, page_address(page), PAGE_SIZE);
> +
> + err = vi->rvq->vq_ops->add_buf(vi->rvq, &sg, 0, 1, page);
> + if (err < 0)
> + give_pages(vi, page);
> +
> + return err;
> +}
> +
> static bool try_fill_recv_maxbufs(struct virtnet_info *vi, gfp_t gfp)
> {
> struct sk_buff *skb;
>

2009-12-14 03:26:13

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Fri, 11 Dec 2009 11:03:25 pm Shirley Ma wrote:
> Signed-off-by: <[email protected]>

Hi Shirley,

These patches look quite close. More review to follow :)

This title needs revision. It should start with virtio: (all the virtio
patches do, for easy identification after merge), eg:

Subject: virtio: Add destroy callback for unused buffers

It also needs a commit message which explains the problem and the solution.
Something like this:

There's currently no way for a virtio driver to ask for unused
buffers, so it has to keep a list itself to reclaim them at shutdown.
This is redundant, since virtio_ring stores that information. So
add a new hook to do this: virtio_net will be the first user.

> -------------
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c708ecc..bb5eb7b 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -107,6 +107,16 @@ static struct page *get_a_page(struct virtnet_info *vi, gfp_t gfp_mask)
> return p;
> }
>
> +static void virtio_net_free_pages(void *buf)
> +{
> + struct page *page, *next;
> +
> + for (page = buf; page; page = next) {
> + next = (struct page *)page->private;
> + __free_pages(page, 0);
> + }
> +}
> +
> static void skb_xmit_done(struct virtqueue *svq)
> {
> struct virtnet_info *vi = svq->vdev->priv;

This belongs in one of the future patches: it will cause an unused warning
and is logically not part of this patch anyway.

> +static int vring_destroy_bufs(struct virtqueue *_vq, void (*destroy)(void *))
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + void *buf;
> + unsigned int i;
> + int freed = 0;

unsigned int for return and freed counter? Means changing prototype, but
negative return isn't useful here.

> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring.num; i++) {
> + if (vq->data[i]) {
> + /* detach_buf clears data, so grab it now. */
> + buf = vq->data[i];
> + detach_buf(vq, i);
> + destroy(buf);
> + freed++;

You could simplify this a bit, since the queue must be inactive:

destroy(vq->data[i]);
detach_buf(vq, i);
freed++;

> + }
> + }
> +
> + END_USE(vq);
> + return freed;

Perhaps add a:
/* That should have freed everything. */
BUG_ON(vq->num_free != vq->vring.num)

> void (*disable_cb)(struct virtqueue *vq);
> bool (*enable_cb)(struct virtqueue *vq);
> + int (*destroy_bufs)(struct virtqueue *vq, void (*destory)(void *));

destory typo :)

Cheers,
Rusty.

2009-12-14 06:54:57

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

On Fri, 11 Dec 2009 11:13:02 pm Shirley Ma wrote:
> Signed-off-by: Shirley Ma <[email protected]>

I don't think there's a good way of splitting this change across multiple
patches. And I don't think this patch will compile; I don't think we can
get rid of trim_pages yet.

We *could* first split the receive paths into multiple parts as you have
(eg. add_recvbuf_big etc), then actually rewrite them, but that's too much
work to refactor the code twice.

So just roll all the driver changes into one patch; so you will have two
patches: one which creates the destroy_bufs API for virtio, and one which
uses it in the virtio_net driver.

> +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct page **page,
> + unsigned int *len)

This actually allocates an skb, and transfers the first page to it (via copy
and possibly the first fragment). skb_from_page() perhaps?

> + hdr_len = sizeof(hdr->hdr);
> + offset = sizeof(struct padded_vnet_hdr);
> + }
> +
> + memcpy(hdr, p, hdr_len);
> +
> + *len -= hdr_len;

Perhaps you should return NULL here as an error if *len was < hdr_len?

Thanks,
Rusty.

2009-12-14 20:00:10

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 0/4] Defer skb allocation for both mergeable buffers and big packets in virtio_net

On Sun, 2009-12-13 at 12:19 +0200, Michael S. Tsirkin wrote:
>
> Shirley, some advice on packaging patches
> that I hope will be helpful:
>
> You did try to split up the patch logically,
> and it's better than a single huge patch, but it
> can be better. For example, you add static functions
> in one patch and use them in another patch,
> this works well for new APIs which are documented
> so you can understand from the documentation
> what function should do, but not for internal, static functions:
> one ends up looking at usage, going back to implementation,
> back to usage, each time switching between patches.
>
> One idea on how to split up the patch set better:
> - add new "destroy" API and supply documentation
> - switch current implementation over to use destroy API
> - split current implementation into subfunctions
> handling mergeable/big cases
> - convert functions to use deferred allocation
> [would be nice to convert mergeable/big separately,
> but I am not sure this is possible while keeping
> patchset bisectable]
>
> Some suggestions on formatting:
> - keep patch names short, and prefix with module name,
> not patchset name, so that git summaries look nicer. E.g.
> Defer skb allocation -- add destroy buffers function for virtio
> Would be better "virtio: add destroy buffers function".
> - please supply commit message with some explanation
> and motivation that will help someone looking
> at git history, without background from mailing list.
> E.g.
> "Add "destroy" vq API that returns all posted
> buffers back to caller. This makes it possible
> to avoid tracking buffers in callers to free
> them on vq teardown. Will be used by deferred
> skb allocation patch.".
> - Use "---" to separate description from text,
> and generally make patch acceptable to git am.
> It is a good idea to use git to generate patches,
> for example with git format-patch.
> I usually use it with --numbered --thread --cover-letter.
>
> > Guest virtio_net receives packets from its pre-allocated vring
> > buffers, then it delivers these packets to upper layer protocols
> > as skb buffs. So it's not necessary to pre-allocate skb for each
> > mergable buffer, then frees it when it's useless.
> > This patch has deferred skb allocation when receiving packets for
> > both big packets and mergeable buffers. It reduces skb
> pre-allocations
> > and skb_frees.
>
> E.g. the above should go into commit message for the virtio net
> part of patchset.

Nice comments, will include them.


> I think you need to base your patch on Dave's net-next,
> it's too late to put it in 2.6.32, or even 2.6.33.
> It also should probably go in through Dave's tree because virtio part
> of
> patch is very small, while most of it deals with net/virtio_net.

> > Tests have been done for small packets, big packets
> > and mergeable buffers.
> >
> > The single netperf TCP_STREAM performance improved for host to
> guest.
> > It also reduces UDP packets drop rate.
>
>
> BTW, any numbers? Also, 2.6.32 has regressed as compared to 2.6.31.
> Did you try with Sridhar Samudrala's patch from net-next applied
> that reportedly fixes this?

Ok, I will run Dave's net-next tree.

Thanks
Shirley

2009-12-14 20:08:13

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

Hello Michael,

I agree with the comments (will have two patches instead of 4 based on
Rusty's comments) except below one.

On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
> That said - do we have to use a callback?
> I think destroy_buf which returns data pointer,
> and which we call repeatedly until we get NULL
> or error, would be an a better, more flexible API.
> This is not critical though.

The reason to use this is because in virtio_net remove, it has
BUG_ON(vi->num != 0), which will be consistent with small skb packet. If
we use NULL, error then we lose the track for vi->num, since we don't
know how many buffers have been passed to ULPs or still unused.

Thanks
Shirley

2009-12-14 20:25:43

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote:
> Hello Michael,
>
> I agree with the comments (will have two patches instead of 4 based on
> Rusty's comments) except below one.
>
> On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
> > That said - do we have to use a callback?
> > I think destroy_buf which returns data pointer,
> > and which we call repeatedly until we get NULL
> > or error, would be an a better, more flexible API.
> > This is not critical though.
>
> The reason to use this is because in virtio_net remove, it has
> BUG_ON(vi->num != 0), which will be consistent with small skb packet. If
> we use NULL, error then we lose the track for vi->num, since we don't
> know how many buffers have been passed to ULPs or still unused.
>
> Thanks
> Shirley

I dont insist, but my idea was

for (;;) {
b = vq->destroy(vq);
if (!b)
break;
--vi->num;
put_page(b);
}

so we do not have to lose track of the counter.

--
MST

2009-12-14 21:24:15

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

On Sun, 2009-12-13 at 13:24 +0200, Michael S. Tsirkin wrote:
> Hmm, this scans the whole list each time.
> OTOH, the caller probably can easily get list tail as well as head.
> If we ask caller to give us list tail, and chain them at head, then
> 1. we won't have to scan the list each time
> 2. we get better memory locality reusing same pages over and over
> again

I could use page private to point to a list_head which will have a head
and a tail, but it will induce more alloc, and free, when this page is
passed to ULPs as a part of skb frags, it would induce more overhead.

> So this comment does not explain why this = 0 is here.
> clearly = 0 does not chain anything.
> Please add a bigger comment.
> I think you also want to extend the comment at top of
> file, where datastructure is, that explains the deferred
> alogorigthm and how pages are chained.
Ok, will do.

> Use min for clarity instead of opencoded if.
> This will make it obvious that len won't ever become
> negative below.
Ok.

> > +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct
> page **page,
>
> I know you got this name from GOOD_COPY_LEN, but it's not
> very good for a function :) and skb_ prefix is also confusing.
> Just copy_small_skb or something like that?
>
> > + unsigned int *len)
Ok.

> Comments about splitting patches apply here as well.
> No way to understand what this should do and whether it
> does it correctly just by looking at patch.
> I think reader will still wonder about is "why does it
> need to be 16 byte aligned?". And this is what
> comment should explain I think.

Ok, will put more comments.

> So you are overriding *len here? why bother calculating it
> then?
I can remove the overriding part.

> Also - this does *not* always copy all of data, and *page
> tells us whether it did a copy or not? This is pretty confusing,
> as APIs go. Also, if we have scatter/gather anyway,
> why do we bother copying the head?

If receiving buffer in mergeable buf and big packets, the packet is
small, then there is no scatter/gather, we can release the page for new
receiving, that was the reason to copy skb head. *page will be only used
by big packets path to indicate whether/where to start next skb frag if
any.

> Also, before skb_set_frag skb is linear, right?
> So in fact you do not need generic skb_set_frag,
> you can just put stuff in the first fragment.
> For example, pass the fragment number to skb_set_frag,
> compiler will be able to better optimize.

You meant to reuse skb_put_frags() in ipoib_cm.c?

Thanks
Shirley

2009-12-14 22:09:18

by Shirley Ma

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

On Sun, 2009-12-13 at 13:43 +0200, Michael S. Tsirkin wrote:
> Interesting. I think skb_goodcopy will sometimes
> set *page to NULL. Will the above crash then?

Nope, when *page is NULL, *len is 0.

> don't put empty line here. if below is part of same logical block as
> skb_goodcopy.
Ok.

> Local variable shadows a parameter.
> It seems gcc will let you get away with a warning,
> but this is not legal C.
Ok.

> > +
> > + i = skb_shinfo(skb)->nr_frags;
> > + if (i >= MAX_SKB_FRAGS) {
> > + pr_debug("%s: packet too long %d\n",
> skb->dev->name,
> > + len);
>
> If this happens, we have corrupted memory already.
> We do need this check, but please put is before you increment
> nr_frags.

It is before increase for mergeable buffer case. Only one page(one frag)
per get_buf.

> > + skb->dev->stats.rx_length_errors++;
> > + return skb;
>
> This will propagate the error up the stack and corrupt
> more memory.

I just copied the code from original code. There might not be a problem
for mergeable buffer. I will double check.

> sizeof hdr->hdr
Ok.

> > +
> > + skb_to_sgvec(skb, sg+1, 0, skb->len);
>
> space around +
Ok.

> > +
> > + err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
> > + if (err < 0)
> > + kfree_skb(skb);
> > + else
> > + skb_queue_head(&vi->recv, skb);
>
> So why are we queueing this still?
This is for small packet. I didn't change that code since it will
involve extra copy by using page.

> > +
> > + return err;
> > +}
> > +
> > +static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp, bool
> *oom)
> > +{
> > + struct scatterlist sg[2 + MAX_SKB_FRAGS];
>
> MAX_SKB_FRAGS + 2 will be more readable.
> Also, create a macro for this constant and document
> why does +2 make sense?

One is for big packet virtio_net_hdr, one is for goodcopy skb.

> Again, pls explain *why* do we want 16 byte alignment.
> Also this code seems duplicated?
> Please put structs at top of file where they
> can be found.
Ok.

> > + };
> > +
> > + offset = sizeof(struct padded_vnet_hdr);
> > +
> > + for (i = total - 1; i > 0; i--) {
>
> I prefer --i.
Ok.

> Also, total is just a constant.
> So simply MAX_SKB_FRAGS + 1 will be clearer.
Ok.

> Why do we scan last to first?
> If there's reason, please add a comment.
We use page private to maintain next page, here there is no scan last to
first, just add the new page in list head instead of list tail, which
will require scan the list.

> space around - .
Ok.

> All the if (i == 1) handling on exit is really hard to grok.
> How about moving common code out of this loop
> into a function, and then you can
> for (i = total - 1; i > 1; i--) {
> handle(i);
> }
> handle(1);
> handle(0);
> add_buf
That works.

> do we really need *oom here and below?
> We can just set err to ENOMEM, no?
We could.

> Please do not return 0 on failure.

Ok.

2009-12-14 22:10:00

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

Thanks Rusty.

I agree with all these comments, will work on them.

Shirley

2009-12-14 22:10:49

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

Thanks Rusty, agree with you, working on them.

Shirley

2009-12-14 23:22:28

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

Hello Michael,

On Mon, 2009-12-14 at 22:22 +0200, Michael S. Tsirkin wrote:
> I dont insist, but my idea was
>
> for (;;) {
> b = vq->destroy(vq);
> if (!b)
> break;
> --vi->num;
> put_page(b);
> }
>
> so we do not have to lose track of the counter.

That's not a bad idea, but to do this, then this fuction will be
specific for virtio_net device (vi is virtnet_info) only in a generic
virtqueue API.

Thanks
Shirley

2009-12-15 00:37:30

by Shirley Ma

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

Hello Michael,

On Mon, 2009-12-14 at 14:08 -0800, Shirley Ma wrote:
> > > +
> > > + err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
> > > + if (err < 0)
> > > + kfree_skb(skb);
> > > + else
> > > + skb_queue_head(&vi->recv, skb);
> >
> > So why are we queueing this still?
> This is for small packet. I didn't change that code since it will
> involve extra copy by using page.

I think I can remove skb link for small packet as well by adding
kfree_skb() in virtio_net_free_bufs for small packet.

Thanks
Shirley

2009-12-15 08:43:13

by Shirley Ma

[permalink] [raw]
Subject: Re: [PATCH v2 4/4] Defer skb allocation -- change allocation & receiving in recv path

On Sun, 2009-12-13 at 13:08 +0200, Michael S. Tsirkin wrote:
> Do not cast away void*.
> This initialization above looks very strange: in
> fact only one of skb, page makes sense.
> So I think you should either get rid of both page and
> skb variables (routines such as give_pages get page *
> so they will accept void* just as happily) or
> move initialization or even use of these variables to
> where they are used.
Ok.


> So above you override skb that you initialized
> at the start of function. It would be better
> to do in the 3'd case:
> skb = buf;
> as well. And then it would be clear why
> " Only for mergeable and big" comment below
> is true.

Ok.

> > + }
> >
> > - err = pskb_trim(skb, len);
> > - if (err) {
> > - pr_debug("%s: pskb_trim failed %i %d\n",
> dev->name,
> > - len, err);
> > - dev->stats.rx_dropped++;
> > - goto drop;
> > - }
> > + if (unlikely(!skb)) {
> > + dev->stats.rx_dropped++;
> > + /* only for mergeable buf and big packets */
> > + give_pages(vi, page);
> > + return;
>
> Did you remove drop: label? If no, is it unused now?

The label is used when checking buf len, but I will remove drop in the
update patch.

> > + void *buf = NULL;
>
> Does this have to be initialized? If not (as it seems) better not do
> it.
I remembered there was a compile warning, I will double check.

> > + freed = vi->rvq->vq_ops->destroy_bufs(vi->rvq,
> virtio_net_free_pages);
>
> This looks like double free to me: should not you remove code that
> does
> __skb_dequeue on recv above?

Nope, this is not a double free, the queue recv skb is still there for
small packet size. But I will remove it in the update patch.

Thanks
Shirley

2009-12-15 11:00:37

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Mon, Dec 14, 2009 at 03:22:22PM -0800, Shirley Ma wrote:
> Hello Michael,
>
> On Mon, 2009-12-14 at 22:22 +0200, Michael S. Tsirkin wrote:
> > I dont insist, but my idea was
> >
> > for (;;) {
> > b = vq->destroy(vq);
> > if (!b)
> > break;
> > --vi->num;
> > put_page(b);
> > }
> >
> > so we do not have to lose track of the counter.
>
> That's not a bad idea, but to do this, then this fuction will be
> specific for virtio_net device (vi is virtnet_info) only in a generic
> virtqueue API.
>
> Thanks
> Shirley

No, this code would be in virtio net.
destroy would simply be the virtqueue API that returns
data pointer.

--
MST

2009-12-15 11:24:14

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] Defer skb allocation -- new skb_set calls & chain pages in virtio_net

On Mon, Dec 14, 2009 at 01:23:45PM -0800, Shirley Ma wrote:
> On Sun, 2009-12-13 at 13:24 +0200, Michael S. Tsirkin wrote:
> > Hmm, this scans the whole list each time.
> > OTOH, the caller probably can easily get list tail as well as head.
> > If we ask caller to give us list tail, and chain them at head, then
> > 1. we won't have to scan the list each time
> > 2. we get better memory locality reusing same pages over and over
> > again
>
> I could use page private to point to a list_head which will have a head
> and a tail, but it will induce more alloc, and free, when this page is
> passed to ULPs as a part of skb frags, it would induce more overhead.

Yes, we don't want that. But I think caller already has
list tail available because he built up the list,
so it should be possible to chain our pages
at head: head -> new pages -> old pages.

Is this call a rare one? Maybe we do not need
to optimize this list scan, but if so let's
add a comment explaining why it does not matter.

If we are going to change data structures,
I think we should replace the linked list
simply with an array acting as a circular buffer.
But I am not asking you to implement it as
part of this patchset.

> > So this comment does not explain why this = 0 is here.
> > clearly = 0 does not chain anything.
> > Please add a bigger comment.
> > I think you also want to extend the comment at top of
> > file, where datastructure is, that explains the deferred
> > alogorigthm and how pages are chained.
> Ok, will do.
>
> > Use min for clarity instead of opencoded if.
> > This will make it obvious that len won't ever become
> > negative below.
> Ok.
>
> > > +static struct sk_buff *skb_goodcopy(struct virtnet_info *vi, struct
> > page **page,
> >
> > I know you got this name from GOOD_COPY_LEN, but it's not
> > very good for a function :) and skb_ prefix is also confusing.
> > Just copy_small_skb or something like that?
> >
> > > + unsigned int *len)
> Ok.
>
> > Comments about splitting patches apply here as well.
> > No way to understand what this should do and whether it
> > does it correctly just by looking at patch.
> > I think reader will still wonder about is "why does it
> > need to be 16 byte aligned?". And this is what
> > comment should explain I think.
>
> Ok, will put more comments.
>
> > So you are overriding *len here? why bother calculating it
> > then?
> I can remove the overriding part.
>
> > Also - this does *not* always copy all of data, and *page
> > tells us whether it did a copy or not? This is pretty confusing,
> > as APIs go. Also, if we have scatter/gather anyway,
> > why do we bother copying the head?
>
> If receiving buffer in mergeable buf and big packets, the packet is
> small, then there is no scatter/gather, we can release the page for new
> receiving, that was the reason to copy skb head. *page will be only used
> by big packets path to indicate whether/where to start next skb frag if
> any.

I guess the main complaint is that if a function
has copy in the name, one expects it to copy data.
Maybe split it up to functions that copy
different packet types?

> > Also, before skb_set_frag skb is linear, right?
> > So in fact you do not need generic skb_set_frag,
> > you can just put stuff in the first fragment.
> > For example, pass the fragment number to skb_set_frag,
> > compiler will be able to better optimize.
>
> You meant to reuse skb_put_frags() in ipoib_cm.c?
>
> Thanks
> Shirley

I just mean we can pass fragment number to skb_set_frag.
In your code nr_fragments is always 0, right?

--
MST

2009-12-15 11:36:17

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

On Mon, Dec 14, 2009 at 02:08:38PM -0800, Shirley Ma wrote:
> On Sun, 2009-12-13 at 13:43 +0200, Michael S. Tsirkin wrote:
> > Interesting. I think skb_goodcopy will sometimes
> > set *page to NULL. Will the above crash then?
>
> Nope, when *page is NULL, *len is 0.

Hmm. Yes, I see, it is here:
+ if (*len) {
+ *len = skb_set_frag(skb, *page, offset, *len);
+ *page = (struct page *)(*page)->private;
+ } else {
+ give_pages(vi, *page);
+ *page = NULL;
+ }

So what I would suggest is, have function
that just copies part of skb, and have
caller open-code allocating the skb and free up
pages as necessary.

> > don't put empty line here. if below is part of same logical block as
> > skb_goodcopy.
> Ok.
>
> > Local variable shadows a parameter.
> > It seems gcc will let you get away with a warning,
> > but this is not legal C.
> Ok.
>
> > > +
> > > + i = skb_shinfo(skb)->nr_frags;
> > > + if (i >= MAX_SKB_FRAGS) {
> > > + pr_debug("%s: packet too long %d\n",
> > skb->dev->name,
> > > + len);
> >
> > If this happens, we have corrupted memory already.
> > We do need this check, but please put is before you increment
> > nr_frags.
>
> It is before increase for mergeable buffer case. Only one page(one frag)
> per get_buf.
>
> > > + skb->dev->stats.rx_length_errors++;
> > > + return skb;
> >
> > This will propagate the error up the stack and corrupt
> > more memory.
>
> I just copied the code from original code. There might not be a problem
> for mergeable buffer. I will double check.
>
> > sizeof hdr->hdr
> Ok.
>
> > > +
> > > + skb_to_sgvec(skb, sg+1, 0, skb->len);
> >
> > space around +
> Ok.
>
> > > +
> > > + err = vi->rvq->vq_ops->add_buf(vi->rvq, sg, 0, 2, skb);
> > > + if (err < 0)
> > > + kfree_skb(skb);
> > > + else
> > > + skb_queue_head(&vi->recv, skb);
> >
> > So why are we queueing this still?
> This is for small packet. I didn't change that code since it will
> involve extra copy by using page.

What I am asking is why do we add skb in vi->recv.
Can't we use vq destoy hack here as well?

> > > +
> > > + return err;
> > > +}
> > > +
> > > +static int add_recvbuf_big(struct virtnet_info *vi, gfp_t gfp, bool
> > *oom)
> > > +{
> > > + struct scatterlist sg[2 + MAX_SKB_FRAGS];
> >
> > MAX_SKB_FRAGS + 2 will be more readable.
> > Also, create a macro for this constant and document
> > why does +2 make sense?
>
> One is for big packet virtio_net_hdr, one is for goodcopy skb.


Maybe put this in a comment then.

> > Again, pls explain *why* do we want 16 byte alignment.
> > Also this code seems duplicated?
> > Please put structs at top of file where they
> > can be found.
> Ok.
>
> > > + };
> > > +
> > > + offset = sizeof(struct padded_vnet_hdr);
> > > +
> > > + for (i = total - 1; i > 0; i--) {
> >
> > I prefer --i.
> Ok.
>
> > Also, total is just a constant.
> > So simply MAX_SKB_FRAGS + 1 will be clearer.
> Ok.
>
> > Why do we scan last to first?
> > If there's reason, please add a comment.
> We use page private to maintain next page, here there is no scan last to
> first, just add the new page in list head instead of list tail, which
> will require scan the list.

I mean the for loop: can it be for(i = 0, ..., ++i) just as well?
Why do you start at the end of buffer and decrement?

> > space around - .
> Ok.
>
> > All the if (i == 1) handling on exit is really hard to grok.
> > How about moving common code out of this loop
> > into a function, and then you can
> > for (i = total - 1; i > 1; i--) {
> > handle(i);
> > }
> > handle(1);
> > handle(0);
> > add_buf
> That works.
>
> > do we really need *oom here and below?
> > We can just set err to ENOMEM, no?
> We could.
>
> > Please do not return 0 on failure.
>
> Ok.

2009-12-15 16:25:30

by Shirley Ma

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

On Tue, 2009-12-15 at 13:33 +0200, Michael S. Tsirkin wrote:
> So what I would suggest is, have function
> that just copies part of skb, and have
> caller open-code allocating the skb and free up
> pages as necessary.
Yes, the updated patch has changed the function.

> What I am asking is why do we add skb in vi->recv.
> Can't we use vq destoy hack here as well?
Yes, I removed recv queue skb link totally in the updated patch.

> > One is for big packet virtio_net_hdr, one is for goodcopy skb.
>
>
> Maybe put this in a comment then.
Ok, will do.

>
> I mean the for loop: can i be for(i = 0, ..., ++i) just as well?
> Why do you start at the end of buffer and decrement?

Are asking why reverse order for new page to sg? The reason is we link
the new page in first, and only maintain the first pointer. So the most
recent new page should be related to sg[0], if we put the new page in
the last, then we need to travel the page list to get last pointer. Am I
missing your point here?

Thanks
Shirley

2009-12-15 16:42:01

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: PATCH v2 3/4] Defer skb allocation -- new recvbuf alloc & receive calls

On Tue, Dec 15, 2009 at 08:25:20AM -0800, Shirley Ma wrote:
> On Tue, 2009-12-15 at 13:33 +0200, Michael S. Tsirkin wrote:
> > So what I would suggest is, have function
> > that just copies part of skb, and have
> > caller open-code allocating the skb and free up
> > pages as necessary.
> Yes, the updated patch has changed the function.
>
> > What I am asking is why do we add skb in vi->recv.
> > Can't we use vq destoy hack here as well?
> Yes, I removed recv queue skb link totally in the updated patch.
>
> > > One is for big packet virtio_net_hdr, one is for goodcopy skb.
> >
> >
> > Maybe put this in a comment then.
> Ok, will do.
>
> >
> > I mean the for loop: can i be for(i = 0, ..., ++i) just as well?
> > Why do you start at the end of buffer and decrement?
>
> Are asking why reverse order for new page to sg? The reason is we link
> the new page in first, and only maintain the first pointer. So the most
> recent new page should be related to sg[0], if we put the new page in
> the last, then we need to travel the page list to get last pointer. Am I
> missing your point here?
>
> Thanks
> Shirley


No, that was what I was looking for.

2009-12-15 18:43:05

by Shirley Ma

[permalink] [raw]
Subject: [RFC PATCH] Subject: virtio: Add unused buffers detach from vring

I submit one split patch for review to make sure that's the right format.
I copied Rusty's comment for the commit message, and change destroy
to detach since we destroy the buffers in caller. This patch is built
against Dave's net-next tree.

There's currently no way for a virtio driver to ask for unused
buffers, so it has to keep a list itself to reclaim them at shutdown.
This is redundant, since virtio_ring stores that information. So
add a new hook to do this: virtio_net will be the first user.

Signed-off-by: Shirley Ma <[email protected]>
---
drivers/virtio/virtio_ring.c | 24 ++++++++++++++++++++++++
include/linux/virtio.h | 1 +
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index fbd2ecd..f847bc3 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -334,6 +334,29 @@ static bool vring_enable_cb(struct virtqueue *_vq)
return true;
}

+/* This function is used to return vring unused buffers to caller for free */
+static void *vring_detach_bufs(struct virtqueue *_vq)
+{
+ struct vring_virtqueue *vq = to_vvq(_vq);
+ unsigned int i;
+
+ START_USE(vq);
+
+ for (i = 0; i < vq->vring.num; ++i) {
+ if (vq->data[i]) {
+ /* detach_buf clears data, so grab it now. */
+ detach_buf(vq, i);
+ END_USE(vq);
+ return vq->data[i];
+ }
+ }
+ /* That should have freed everything. */
+ BUG_ON(vq->num_free != vq->vring.num);
+
+ END_USE(vq);
+ return NULL;
+}
+
irqreturn_t vring_interrupt(int irq, void *_vq)
{
struct vring_virtqueue *vq = to_vvq(_vq);
@@ -360,6 +383,7 @@ static struct virtqueue_ops vring_vq_ops = {
.kick = vring_kick,
.disable_cb = vring_disable_cb,
.enable_cb = vring_enable_cb,
+ .detach_bufs = vring_detach_bufs,
};

struct virtqueue *vring_new_virtqueue(unsigned int num,
diff --git a/include/linux/virtio.h b/include/linux/virtio.h
index 057a2e0..d7da456 100644
--- a/include/linux/virtio.h
+++ b/include/linux/virtio.h
@@ -71,6 +71,7 @@ struct virtqueue_ops {

void (*disable_cb)(struct virtqueue *vq);
bool (*enable_cb)(struct virtqueue *vq);
+ void *(*detach_bufs)(struct virtqueue *vq);
};

/**

Thanks
Shirley

2009-12-15 18:50:30

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH] Subject: virtio: Add unused buffers detach from vring

On Tue, Dec 15, 2009 at 10:42:53AM -0800, Shirley Ma wrote:
> I submit one split patch for review to make sure that's the right format.
> I copied Rusty's comment for the commit message, and change destroy
> to detach since we destroy the buffers in caller. This patch is built
> against Dave's net-next tree.

Almost :) text not intended for git commit logs like the above
should go after ---, this way git am knows to skip it.

> There's currently no way for a virtio driver to ask for unused
> buffers, so it has to keep a list itself to reclaim them at shutdown.
> This is redundant, since virtio_ring stores that information. So
> add a new hook to do this: virtio_net will be the first user.
>
> Signed-off-by: Shirley Ma <[email protected]>
> ---
> drivers/virtio/virtio_ring.c | 24 ++++++++++++++++++++++++
> include/linux/virtio.h | 1 +
> 2 files changed, 25 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index fbd2ecd..f847bc3 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -334,6 +334,29 @@ static bool vring_enable_cb(struct virtqueue *_vq)
> return true;
> }
>
> +/* This function is used to return vring unused buffers to caller for free */
> +static void *vring_detach_bufs(struct virtqueue *_vq)
> +{
> + struct vring_virtqueue *vq = to_vvq(_vq);
> + unsigned int i;
> +
> + START_USE(vq);
> +
> + for (i = 0; i < vq->vring.num; ++i) {

This is a single statement loop, so you do not need {}
around it. Or even better:
for (i = 0; i < vq->vring.num; ++i) {
if (!vq->data[i])
continue;
...
}
which has less nesting.

> + if (vq->data[i]) {
> + /* detach_buf clears data, so grab it now. */

You wrote that comment, but did you read it :)

> + detach_buf(vq, i);
> + END_USE(vq);
> + return vq->data[i];

In fact, this will return NULL always, won't it?

> + }
> + }
> + /* That should have freed everything. */
> + BUG_ON(vq->num_free != vq->vring.num);
> +
> + END_USE(vq);
> + return NULL;
> +}
> +
> irqreturn_t vring_interrupt(int irq, void *_vq)
> {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -360,6 +383,7 @@ static struct virtqueue_ops vring_vq_ops = {
> .kick = vring_kick,
> .disable_cb = vring_disable_cb,
> .enable_cb = vring_enable_cb,
> + .detach_bufs = vring_detach_bufs,
> };
>
> struct virtqueue *vring_new_virtqueue(unsigned int num,
> diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> index 057a2e0..d7da456 100644
> --- a/include/linux/virtio.h
> +++ b/include/linux/virtio.h
> @@ -71,6 +71,7 @@ struct virtqueue_ops {
>
> void (*disable_cb)(struct virtqueue *vq);
> bool (*enable_cb)(struct virtqueue *vq);
> + void *(*detach_bufs)(struct virtqueue *vq);
> };


Please add documentation in virtio.h
Thanks!

>
> /**
>
> Thanks
> Shirley

2009-12-15 19:08:14

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH] Subject: virtio: Add unused buffers detach from vring

Thanks Michael, will fix them all.

Shirley

2009-12-15 19:14:23

by Shirley Ma

[permalink] [raw]
Subject: Re: [RFC PATCH] Subject: virtio: Add unused buffers detach from vring

Hello Michael,

On Tue, 2009-12-15 at 20:47 +0200, Michael S. Tsirkin wrote:
> > + detach_buf(vq, i);
> > + END_USE(vq);
> > + return vq->data[i];
>
> In fact, this will return NULL always, won't it?

Nope, I changed the destroy to detach and return the buffers without
destroying them within the call. I thought it might be useful in some
other case.

Maybe I should put destroy call back?

Thanks
Shirley

2009-12-15 21:17:41

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [RFC PATCH] Subject: virtio: Add unused buffers detach from vring

On Tue, Dec 15, 2009 at 11:14:07AM -0800, Shirley Ma wrote:
> Hello Michael,
>
> On Tue, 2009-12-15 at 20:47 +0200, Michael S. Tsirkin wrote:
> > > + detach_buf(vq, i);
> > > + END_USE(vq);
> > > + return vq->data[i];
> >
> > In fact, this will return NULL always, won't it?
>
> Nope, I changed the destroy to detach and return the buffers without
> destroying them within the call. I thought it might be useful in some
> other case.
>
> Maybe I should put destroy call back?
>
> Thanks
> Shirley

No I think it's good as is, we do not need a callback.
I was simply saying that detach_buf sets data to NULL,
so return vq->data[i] after detach does not make sense.
You need to save data as comment below says.c

--
MST

2009-12-15 22:36:20

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Tue, 15 Dec 2009 06:52:53 am Michael S. Tsirkin wrote:
> On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote:
> > Hello Michael,
> >
> > I agree with the comments (will have two patches instead of 4 based on
> > Rusty's comments) except below one.
> >
> > On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
> > > That said - do we have to use a callback?
> > > I think destroy_buf which returns data pointer,
> > > and which we call repeatedly until we get NULL
> > > or error, would be an a better, more flexible API.
> > > This is not critical though.
> >
> > The reason to use this is because in virtio_net remove, it has
> > BUG_ON(vi->num != 0), which will be consistent with small skb packet. If
> > we use NULL, error then we lose the track for vi->num, since we don't
> > know how many buffers have been passed to ULPs or still unused.
> >
> > Thanks
> > Shirley
>
> I dont insist, but my idea was
>
> for (;;) {
> b = vq->destroy(vq);
> if (!b)
> break;
> --vi->num;
> put_page(b);
> }

In this case it should be called "get_unused_buf" or something. But I like
Shirley's approach here; destroy (with callback) accurately reflects the only
time this can be validly used.

Cheers,
Rusty.

2009-12-15 22:43:12

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Wed, Dec 16, 2009 at 09:06:12AM +1030, Rusty Russell wrote:
> On Tue, 15 Dec 2009 06:52:53 am Michael S. Tsirkin wrote:
> > On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote:
> > > Hello Michael,
> > >
> > > I agree with the comments (will have two patches instead of 4 based on
> > > Rusty's comments) except below one.
> > >
> > > On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
> > > > That said - do we have to use a callback?
> > > > I think destroy_buf which returns data pointer,
> > > > and which we call repeatedly until we get NULL
> > > > or error, would be an a better, more flexible API.
> > > > This is not critical though.
> > >
> > > The reason to use this is because in virtio_net remove, it has
> > > BUG_ON(vi->num != 0), which will be consistent with small skb packet. If
> > > we use NULL, error then we lose the track for vi->num, since we don't
> > > know how many buffers have been passed to ULPs or still unused.
> > >
> > > Thanks
> > > Shirley
> >
> > I dont insist, but my idea was
> >
> > for (;;) {
> > b = vq->destroy(vq);
> > if (!b)
> > break;
> > --vi->num;
> > put_page(b);
> > }
>
> In this case it should be called "get_unused_buf" or something. But I like
> Shirley's approach here; destroy (with callback) accurately reflects the only
> time this can be validly used.
>
> Cheers,
> Rusty.

I guess the actual requirement is that device must be
inactive.

As I said this is fine with me as well.
But I think the callback should get vq pointer besides the
data pointer, so that it can e.g. find the device if it needs to.
In case of virtio net this makes it possible
to decrement the outstanding skb counter in the callback.
Makes sense?

--
MST

2009-12-16 05:04:13

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] Defer skb allocation -- add destroy buffers function for virtio

On Wed, 16 Dec 2009 09:10:02 am Michael S. Tsirkin wrote:
> On Wed, Dec 16, 2009 at 09:06:12AM +1030, Rusty Russell wrote:
> > On Tue, 15 Dec 2009 06:52:53 am Michael S. Tsirkin wrote:
> > > On Mon, Dec 14, 2009 at 12:08:05PM -0800, Shirley Ma wrote:
> > > > Hello Michael,
> > > >
> > > > I agree with the comments (will have two patches instead of 4 based on
> > > > Rusty's comments) except below one.
> > > >
> > > > On Sun, 2009-12-13 at 12:26 +0200, Michael S. Tsirkin wrote:
> > > > > That said - do we have to use a callback?
> > > > > I think destroy_buf which returns data pointer,
> > > > > and which we call repeatedly until we get NULL
> > > > > or error, would be an a better, more flexible API.
> > > > > This is not critical though.
> > > >
> > > > The reason to use this is because in virtio_net remove, it has
> > > > BUG_ON(vi->num != 0), which will be consistent with small skb packet. If
> > > > we use NULL, error then we lose the track for vi->num, since we don't
> > > > know how many buffers have been passed to ULPs or still unused.
> > > >
> > > > Thanks
> > > > Shirley
> > >
> > > I dont insist, but my idea was
> > >
> > > for (;;) {
> > > b = vq->destroy(vq);
> > > if (!b)
> > > break;
> > > --vi->num;
> > > put_page(b);
> > > }
> >
> > In this case it should be called "get_unused_buf" or something. But I like
> > Shirley's approach here; destroy (with callback) accurately reflects the only
> > time this can be validly used.
> >
> > Cheers,
> > Rusty.
>
> I guess the actual requirement is that device must be
> inactive.

Technically, the vq has to be inactive. (This distinction may matter for
the multiport virtio_console work).

>
> As I said this is fine with me as well.
> But I think the callback should get vq pointer besides the
> data pointer, so that it can e.g. find the device if it needs to.
> In case of virtio net this makes it possible
> to decrement the outstanding skb counter in the callback.
> Makes sense?

Sure. I don't really mind either way, and I'm warming to the name
detach_buf :)

Cheers,
Rusty.