LinuxLists.cc - [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

[permalink] [raw]

Subject: [V2 PATCH 6/9] vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs

When we want to disable vhost_net backend while there's a tx work, a possible
NULL pointer defernece may happen we we try to deference the vq->bufs after
vhost_net_set_backend() assign a NULL to it.

As suggested by Michael, fix this by checking the vq->bufs instead of
vhost_sock_zcopy().
---
drivers/vhost/net.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f0da2c3..ffdc0d8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -166,7 +166,7 @@ static void handle_tx(struct vhost_net *net)
if (wmem < sock->sk->sk_sndbuf / 2)
tx_poll_stop(net);
hdr_size = vq->vhost_hlen;
- zcopy = vhost_sock_zcopy(sock);
+ zcopy = vq->ubufs;

for (;;) {
/* Release DMAs done buffers first */

2012-05-02 03:47:51

[permalink] [raw]

Subject: [V2 PATCH 7/9] vhost_net: re-poll only on EAGAIN or ENOBUFS

Currently, we restart tx polling unconditionally when sendmsg()
fails. This would cause unnecessary wakeups of vhost wokers and waste
cpu utlization when evil userspace(guest driver) is able to hit EFAULT or
EINVAL.

The polling is only needed when the socket send buffer were exceeded or not
enough memory. So fix this by restarting polling only when sendmsg() returns
EAGAIN/ENOBUFS.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/net.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ffdc0d8..62828aa 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -257,7 +257,8 @@ static void handle_tx(struct vhost_net *net)
UIO_MAXIOV;
}
vhost_discard_vq_desc(vq, 1);
- tx_poll_start(net, sock);
+ if (err == -EAGAIN || err == -ENOBUFS)
+ tx_poll_start(net, sock);
break;
}
if (err != len)

2012-05-02 03:47:26

[permalink] [raw]

Subject: [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb

This patch fixes the offset calculation when building skb:

- offset1 were used as skb data offset not vector offset
- reset offset to zero only when we advance to next vector

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvtap.c | 13 +++++++------
1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 0427c65..bd4a70d 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -505,10 +505,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
if (copy > size) {
++from;
--count;
- }
+ offset = 0;
+ } else
+ offset += size;
copy -= size;
offset1 += size;
- offset = 0;
}

if (len == offset1)
@@ -519,13 +520,13 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
int num_pages;
unsigned long base;

- len = from->iov_len - offset1;
+ len = from->iov_len - offset;
if (!len) {
- offset1 = 0;
+ offset = 0;
++from;
continue;
}
- base = (unsigned long)from->iov_base + offset1;
+ base = (unsigned long)from->iov_base + offset;
size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
num_pages = get_user_pages_fast(base, size, 0, &page[i]);
if ((num_pages != size) ||
@@ -546,7 +547,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
len -= size;
i++;
}
- offset1 = 0;
+ offset = 0;
++from;
}
return 0;

2012-05-02 03:47:39

[permalink] [raw]

Subject: [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully

Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
fails to build zerocopy skb because destructor_arg was not
initialized. Solve this by set this flag after the skb were built
successfully.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvtap.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 9ab182a..a4ff694 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -699,10 +699,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
if (!skb)
goto err;

- if (zerocopy) {
+ if (zerocopy)
err = zerocopy_sg_from_iovec(skb, iv, vnet_hdr_len, count);
- skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
- } else
+ else
err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len,
len);
if (err)
@@ -721,8 +720,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
rcu_read_lock_bh();
vlan = rcu_dereference_bh(q->vlan);
/* copy skb_ubuf_info for callback when skb has no error */
- if (zerocopy)
+ if (zerocopy) {
skb_shinfo(skb)->destructor_arg = m->msg_control;
+ skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+ }
if (vlan)
macvlan_start_xmit(skb, vlan->dev);
else

2012-05-02 03:48:13

[permalink] [raw]

Subject: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

We add used and signal guest in worker thread but did not poll the virtqueue
during the zero copy callback. This may lead the missing of adding and
signalling during zerocopy. Solve this by polling the virtqueue and let it
wakeup the worker during callback.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/vhost.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 947f00d..7b75fdf 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
struct vhost_ubuf_ref *ubufs = ubuf->arg;
struct vhost_virtqueue *vq = ubufs->vq;

+ vhost_poll_queue(&vq->poll);
/* set len = 1 to mark this desc buffers done DMA */
vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
kref_put(&ubufs->kref, vhost_zerocopy_done_signal);

2012-05-02 03:48:05

[permalink] [raw]

Subject: [V2 PATCH 8/9] vhost_net: zerocopy: adding and signalling immediately when fully copied

When a packet were fully copied in zerocopy, we don't wait for the DMA done to
mark the done flag, so after the packet were passed to lower device, we need to
add used and signal guest immediately.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/vhost/net.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 62828aa..1dc2aeb 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -266,6 +266,8 @@ static void handle_tx(struct vhost_net *net)
" len %d != %zd\n", err, len);
if (!zcopy)
vhost_add_used_and_signal(&net->dev, vq, head, 0);
+ else
+ vhost_zerocopy_signal_used(vq);
total_len += len;
if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
vhost_poll_queue(&vq->poll);

2012-05-02 03:47:38

[permalink] [raw]

Subject: [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages

When get_user_pages_fast() fails to get all requested pages, we could not use
kfree_skb() to free it as it has not been put in the skb fragments. So we need
to call put_page() instead.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvtap.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7cb2684..9ab182a 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -531,9 +531,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
num_pages = get_user_pages_fast(base, size, 0, &page[i]);
if ((num_pages != size) ||
- (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
- /* put_page is in skb free */
+ (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags)) {
+ for (i = 0; i < num_pages; i++)
+ put_page(page[i]);
return -EFAULT;
+ }
truesize = size * PAGE_SIZE;
skb->data_len += len;
skb->len += len;

2012-05-02 03:47:36

[permalink] [raw]

Subject: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

As the skb fragment were pinned/built from user pages, we should
account the page instead of length for truesize.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvtap.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index bd4a70d..7cb2684 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
struct page *page[MAX_SKB_FRAGS];
int num_pages;
unsigned long base;
+ unsigned long truesize;

len = from->iov_len - offset;
if (!len) {
@@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
(num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
/* put_page is in skb free */
return -EFAULT;
+ truesize = size * PAGE_SIZE;
skb->data_len += len;
skb->len += len;
- skb->truesize += len;
- atomic_add(len, &skb->sk->sk_wmem_alloc);
+ skb->truesize += truesize;
+ atomic_add(truesize, &skb->sk->sk_wmem_alloc);
while (len) {
int off = base & ~PAGE_MASK;
int size = min_t(int, len, PAGE_SIZE - off);

2012-05-02 03:49:38

[permalink] [raw]

Subject: [V2 PATCH 5/9] macvtap: zerocopy: validate vectors before building skb

There're several reasons that the vectors need to be validated:

- Return error when caller provides vectors whose num is greater than UIO_MAXIOV.
- Linearize part of skb when userspace provides vectors grater than MAX_SKB_FRAGS.
- Return error when userspace provides vectors whose total length may exceed
- MAX_SKB_FRAGS * PAGE_SIZE.

Signed-off-by: Jason Wang <[email protected]>
---
drivers/net/macvtap.c | 25 +++++++++++++++++++++----
1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index a4ff694..163559c 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -529,9 +529,10 @@ static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
}
base = (unsigned long)from->iov_base + offset;
size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+ if (i + size > MAX_SKB_FRAGS)
+ return -EMSGSIZE;
num_pages = get_user_pages_fast(base, size, 0, &page[i]);
- if ((num_pages != size) ||
- (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags)) {
+ if (num_pages != size) {
for (i = 0; i < num_pages; i++)
put_page(page[i]);
return -EFAULT;
@@ -651,7 +652,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
int err;
struct virtio_net_hdr vnet_hdr = { 0 };
int vnet_hdr_len = 0;
- int copylen;
+ int copylen = 0;
bool zerocopy = false;

if (q->flags & IFF_VNET_HDR) {
@@ -680,15 +681,31 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
if (unlikely(len < ETH_HLEN))
goto err;

+ err = -EMSGSIZE;
+ if (unlikely(count > UIO_MAXIOV))
+ goto err;
+
if (m && m->msg_control && sock_flag(&q->sk, SOCK_ZEROCOPY))
zerocopy = true;

if (zerocopy) {
+ /* Userspace may produce vectors with count greater than
+ * MAX_SKB_FRAGS, so we need to linearize parts of the skb
+ * to let the rest of data to be fit in the frags.
+ */
+ if (count > MAX_SKB_FRAGS) {
+ copylen = iov_length(iv, count - MAX_SKB_FRAGS);
+ if (copylen < vnet_hdr_len)
+ copylen = 0;
+ else
+ copylen -= vnet_hdr_len;
+ }
/* There are 256 bytes to be copied in skb, so there is enough
* room for skb expand head in case it is used.
* The rest buffer is mapped from userspace.
*/
- copylen = vnet_hdr.hdr_len;
+ if (copylen < vnet_hdr.hdr_len)
+ copylen = vnet_hdr.hdr_len;
if (!copylen)
copylen = GOODCOPY_LEN;
} else

2012-05-02 05:50:08

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
> This is an updated since the last series of vhost/macvtap zerocopy fixes which
> fixes the the possible transmission stall, host kernel stack overflow and other
> misc fixes.
>
> Changes from V1:
> - Addressing comments from Eric and Michael.
> - Adding more fixes into the seires.

Thanks for fixing this.
Acked-by: Michael S. Tsirkin <[email protected]>

Dave, can you merge this for 3.4 please?
Thanks!

> ---
>
> Jason Wang (9):
> macvtap: zerocopy: fix offset calculation when building skb
> macvtap: zerocopy: fix truesize underestimation
> macvtap: zerocopy: put page when fail to get all requested user pages
> macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully
> macvtap: zerocopy: validate vectors before building skb
> vhost_net: zerocopy: fix possible NULL pointer dereference of vq->bufs
> vhost_net: re-poll only on EAGAIN or ENOBUFS
> vhost_net: zerocopy: adding and signalling immediately when fully copied
> vhost: zerocopy: poll vq in zerocopy callback
>
>
> drivers/net/macvtap.c | 57 ++++++++++++++++++++++++++++++++++---------------
> drivers/vhost/net.c | 7 ++++--
> drivers/vhost/vhost.c | 1 +
> 3 files changed, 46 insertions(+), 19 deletions(-)
>
> --
> Jason Wang

2012-05-02 06:44:36

by David Miller

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

From: "Michael S. Tsirkin" <[email protected]>
Date: Wed, 2 May 2012 08:50:09 +0300

> On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
>> This is an updated since the last series of vhost/macvtap zerocopy fixes which
>> fixes the the possible transmission stall, host kernel stack overflow and other
>> misc fixes.
>>
>> Changes from V1:
>> - Addressing comments from Eric and Michael.
>> - Adding more fixes into the seires.
>
> Thanks for fixing this.
> Acked-by: Michael S. Tsirkin <[email protected]>
>
> Dave, can you merge this for 3.4 please?

It's rather late in the -RC for such a large patch set.

2012-05-02 08:11:41

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

On Wed, May 02, 2012 at 02:44:27AM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <[email protected]>
> Date: Wed, 2 May 2012 08:50:09 +0300
>
> > On Wed, May 02, 2012 at 11:41:21AM +0800, Jason Wang wrote:
> >> This is an updated since the last series of vhost/macvtap zerocopy fixes which
> >> fixes the the possible transmission stall, host kernel stack overflow and other
> >> misc fixes.
> >>
> >> Changes from V1:
> >> - Addressing comments from Eric and Michael.
> >> - Adding more fixes into the seires.
> >
> > Thanks for fixing this.
> > Acked-by: Michael S. Tsirkin <[email protected]>
> >
> > Dave, can you merge this for 3.4 please?
>
> It's rather late in the -RC for such a large patch set.

I was in doubt that's why I asked instead of just merging through my tree.
OK, I'll apply to my tree and send pull request for net-next a bit later.
--
MST

2012-05-02 19:41:06

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

Side question. Are you aware that macvtap/vhost net is broken
in the presence of vlan accelleration and that vlan accelleration
is not optional if you are using vlan headers?

Eric

2012-05-02 21:31:26

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
>
> Side question. Are you aware that macvtap/vhost net is broken
> in the presence of vlan accelleration and that vlan accelleration
> is not optional if you are using vlan headers?
>
> Eric

I didn't know. Could you explain please?

2012-05-02 21:54:49

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [V2 PATCH 0/9] vhost/macvtap zeropcopy fixes

"Michael S. Tsirkin" <[email protected]> writes:

> On Wed, May 02, 2012 at 12:40:55PM -0700, Eric W. Biederman wrote:
>>
>> Side question. Are you aware that macvtap/vhost net is broken
>> in the presence of vlan accelleration and that vlan accelleration
>> is not optional if you are using vlan headers?
>>
>> Eric
>
> I didn't know. Could you explain please?

It is worth looking at the netdev history for some recent patches by
Basil Gor, as he has been hit by this issue and has been trying to come
up with a clean fix. I was burned by this issue in other parts of the
networking stack and so have been doing some basic review.

The short version is that on any normal path through the networking
stack we implement vlan header accelleration in hardware or emulation
in software. The result is that the ethernet vlan header is not
on the packet and is instead in the skb->tci field.

When coming on out the pf_packet sockets we don't include the vlan
header in the packet but instead the vlan header is put in aux data.

The result of all of this is that vhost/net.c looses the vlan
header when coming from pf_packet socket.

Furthermore macvtap_recvmsg does not act like pf_packet sockets
when there is a vlan header persent so weirdness ensues. Especially
when the packet is coming from a path where the vlan header has
been stripped and placed in skb->tci.

Eric

2012-05-15 16:51:41

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
> We add used and signal guest in worker thread but did not poll the
> virtqueue
> during the zero copy callback. This may lead the missing of adding and
> signalling during zerocopy. Solve this by polling the virtqueue and
> let it
> wakeup the worker during callback.
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/vhost/vhost.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 947f00d..7b75fdf 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> struct vhost_virtqueue *vq = ubufs->vq;
>
> + vhost_poll_queue(&vq->poll);
> /* set len = 1 to mark this desc buffers done DMA */
> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);

Doing so, we might have redundant vhost_poll_queue(). Do you know in
which scenario there might be missing of adding and signaling during
zerocopy?

Thanks
Shirley

2012-05-15 17:24:46

[permalink] [raw]

Subject: Re: [V2 PATCH 1/9] macvtap: zerocopy: fix offset calculation when building skb

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> This patch fixes the offset calculation when building skb:
>
> - offset1 were used as skb data offset not vector offset
> - reset offset to zero only when we advance to next vector

I tested the original code in all scenario, it worked well.

However this patch makes the code more clear.

Thanks
Shirley

2012-05-15 17:30:55

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> As the skb fragment were pinned/built from user pages, we should
> account the page instead of length for truesize.
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/net/macvtap.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index bd4a70d..7cb2684 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
> struct page *page[MAX_SKB_FRAGS];
> int num_pages;
> unsigned long base;
> + unsigned long truesize;
>
> len = from->iov_len - offset;
> if (!len) {
> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
> (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> /* put_page is in skb free */
> return -EFAULT;
> + truesize = size * PAGE_SIZE;

Here should be truesize = size * PAGE_SIZE - offset, right?

> skb->data_len += len;
> skb->len += len;
> - skb->truesize += len;
> - atomic_add(len, &skb->sk->sk_wmem_alloc);
> + skb->truesize += truesize;
> + atomic_add(truesize, &skb->sk->sk_wmem_alloc);
> while (len) {
> int off = base & ~PAGE_MASK;
> int size = min_t(int, len, PAGE_SIZE - off);
>
>

2012-05-15 17:34:28

[permalink] [raw]

Subject: Re: [V2 PATCH 3/9] macvtap: zerocopy: put page when fail to get all requested user pages

On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
> When get_user_pages_fast() fails to get all requested pages, we could
> not use
> kfree_skb() to free it as it has not been put in the skb fragments. So
> we need
> to call put_page() instead.
>
> Signed-off-by: Jason Wang <[email protected]>
> ---
> drivers/net/macvtap.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7cb2684..9ab182a 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -531,9 +531,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
> *skb, const struct iovec *from,
> size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >>
> PAGE_SHIFT;
> num_pages = get_user_pages_fast(base, size, 0,
> &page[i]);
> if ((num_pages != size) ||
> - (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags))
> - /* put_page is in skb free */
> + (num_pages > MAX_SKB_FRAGS -
> skb_shinfo(skb)->nr_frags)) {
> + for (i = 0; i < num_pages; i++)
> + put_page(page[i]);
> return -EFAULT;
> + }
> truesize = size * PAGE_SIZE;
> skb->data_len += len;
> skb->len += len;

Good catch. I don't know why I thought put_page would be called in
skb_free for these pages which hadn't been added to skb frags before. :(

thanks
Shirley

2012-05-15 17:49:10

[permalink] [raw]

Subject: Re: [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully

On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
> Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
> zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
> fails to build zerocopy skb because destructor_arg was not
> initialized. Solve this by set this flag after the skb were built
> successfully.

I thought this flag was needed for zerocopy skb free even in err case.
I've checked it back again, it's OK to move the flag after the skb
successfully built. Or we can fix it by modify skb free with
destructor_arg NULL check as below:
...
skb_release_data() {
...
if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY) {
struct ubuf_info *uarg;

uarg = skb_shinfo(skb)->destructor_arg;
if (uarg && uarg->callback)
uarg->callback(uarg);
}

...
}
Thanks
Shirley

2012-05-16 02:58:28

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/16/2012 12:50 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
>> We add used and signal guest in worker thread but did not poll the
>> virtqueue
>> during the zero copy callback. This may lead the missing of adding and
>> signalling during zerocopy. Solve this by polling the virtqueue and
>> let it
>> wakeup the worker during callback.
>>
>> Signed-off-by: Jason Wang<[email protected]>
>> ---
>> drivers/vhost/vhost.c | 1 +
>> 1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 947f00d..7b75fdf 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>> struct vhost_ubuf_ref *ubufs = ubuf->arg;
>> struct vhost_virtqueue *vq = ubufs->vq;
>>
>> + vhost_poll_queue(&vq->poll);
>> /* set len = 1 to mark this desc buffers done DMA */
>> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> Doing so, we might have redundant vhost_poll_queue(). Do you know in
> which scenario there might be missing of adding and signaling during
> zerocopy?

Yes, as we only do signaling and adding during tx work, if there's no tx
work when the skb were sent, we may lose the opportunity to let guest
know about the completion. It's easy to be reproduced with netperf test.

Thanks
> Thanks
> Shirley
>

2012-05-16 03:04:25

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On 05/16/2012 01:26 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:41 +0800, Jason Wang wrote:
>> As the skb fragment were pinned/built from user pages, we should
>> account the page instead of length for truesize.
>>
>> Signed-off-by: Jason Wang<[email protected]>
>> ---
>> drivers/net/macvtap.c | 6 ++++--
>> 1 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>> index bd4a70d..7cb2684 100644
>> --- a/drivers/net/macvtap.c
>> +++ b/drivers/net/macvtap.c
>> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct sk_buff
>> *skb, const struct iovec *from,
>> struct page *page[MAX_SKB_FRAGS];
>> int num_pages;
>> unsigned long base;
>> + unsigned long truesize;
>>
>> len = from->iov_len - offset;
>> if (!len) {
>> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct sk_buff
>> *skb, const struct iovec *from,
>> (num_pages> MAX_SKB_FRAGS -
>> skb_shinfo(skb)->nr_frags))
>> /* put_page is in skb free */
>> return -EFAULT;
>> + truesize = size * PAGE_SIZE;
> Here should be truesize = size * PAGE_SIZE - offset, right?
>

We get the whole user page, so need to account them all. Also this is
aligned with skb_copy_ubufs().
>> skb->data_len += len;
>> skb->len += len;
>> - skb->truesize += len;
>> - atomic_add(len,&skb->sk->sk_wmem_alloc);
>> + skb->truesize += truesize;
>> + atomic_add(truesize,&skb->sk->sk_wmem_alloc);
>> while (len) {
>> int off = base& ~PAGE_MASK;
>> int size = min_t(int, len, PAGE_SIZE - off);
>>
>>

2012-05-16 03:18:04

[permalink] [raw]

Subject: Re: [V2 PATCH 4/9] macvtap: zerocopy: set SKBTX_DEV_ZEROCOPY only when skb is built successfully

On 05/16/2012 01:44 AM, Shirley Ma wrote:
> On Wed, 2012-05-02 at 11:42 +0800, Jason Wang wrote:
>> Current the SKBTX_DEV_ZEROCOPY is set unconditionally after
>> zerocopy_sg_from_iovec(), this would lead NULL pointer when macvtap
>> fails to build zerocopy skb because destructor_arg was not
>> initialized. Solve this by set this flag after the skb were built
>> successfully.
> I thought this flag was needed for zerocopy skb free even in err case.
> I've checked it back again, it's OK to move the flag after the skb
> successfully built. Or we can fix it by modify skb free with
> destructor_arg NULL check as below:
> ...
> skb_release_data() {
> ...
> if (skb_shinfo(skb)->tx_flags& SKBTX_DEV_ZEROCOPY) {
> struct ubuf_info *uarg;
>
> uarg = skb_shinfo(skb)->destructor_arg;
> if (uarg&& uarg->callback)
> uarg->callback(uarg);
> }
>
> ...
> }
> Thanks
> Shirley
>
Yes, both are ok. Since the code were merged, let's just use current method.

2012-05-16 15:04:24

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On Wed, 2012-05-16 at 11:04 +0800, Jason Wang wrote:
> >> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> >> index bd4a70d..7cb2684 100644
> >> --- a/drivers/net/macvtap.c
> >> +++ b/drivers/net/macvtap.c
> >> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct
> sk_buff
> >> *skb, const struct iovec *from,
> >> struct page *page[MAX_SKB_FRAGS];
> >> int num_pages;
> >> unsigned long base;
> >> + unsigned long truesize;
> >>
> >> len = from->iov_len - offset;
> >> if (!len) {
> >> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct
> sk_buff
> >> *skb, const struct iovec *from,
> >> (num_pages> MAX_SKB_FRAGS -
> >> skb_shinfo(skb)->nr_frags))
> >> /* put_page is in skb free */
> >> return -EFAULT;
> >> + truesize = size * PAGE_SIZE;
> > Here should be truesize = size * PAGE_SIZE - offset, right?
> >
>
> We get the whole user page, so need to account them all. Also this is
> aligned with skb_copy_ubufs().

Then this would double count the size of "first" offset left from
previous copy, both skb->len and truesize.

Thanks
Shirley

2012-05-16 15:10:47

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> >> drivers/vhost/vhost.c | 1 +
> >> 1 files changed, 1 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >> index 947f00d..7b75fdf 100644
> >> --- a/drivers/vhost/vhost.c
> >> +++ b/drivers/vhost/vhost.c
> >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> >> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> >> struct vhost_virtqueue *vq = ubufs->vq;
> >>
> >> + vhost_poll_queue(&vq->poll);
> >> /* set len = 1 to mark this desc buffers done DMA */
> >> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> >> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > Doing so, we might have redundant vhost_poll_queue(). Do you know in
> > which scenario there might be missing of adding and signaling during
> > zerocopy?
>
> Yes, as we only do signaling and adding during tx work, if there's no
> tx
> work when the skb were sent, we may lose the opportunity to let guest
> know about the completion. It's easy to be reproduced with netperf
> test.

The reason which host signals guest is to free guest tx buffers, if
there is no tx work, then it's not necessary to signal the guest unless
guest runs out of memory. The pending buffers will be released
virtio_net device gone.

What's the behavior of netperf test when you hit this situation?

Thanks
Shirley

2012-05-16 15:14:59

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > >> drivers/vhost/vhost.c | 1 +
> > >> 1 files changed, 1 insertions(+), 0 deletions(-)
> > >>
> > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > >> index 947f00d..7b75fdf 100644
> > >> --- a/drivers/vhost/vhost.c
> > >> +++ b/drivers/vhost/vhost.c
> > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > >> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > >> struct vhost_virtqueue *vq = ubufs->vq;
> > >>
> > >> + vhost_poll_queue(&vq->poll);
> > >> /* set len = 1 to mark this desc buffers done DMA */
> > >> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > >> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > Doing so, we might have redundant vhost_poll_queue(). Do you know in
> > > which scenario there might be missing of adding and signaling during
> > > zerocopy?
> >
> > Yes, as we only do signaling and adding during tx work, if there's no
> > tx
> > work when the skb were sent, we may lose the opportunity to let guest
> > know about the completion. It's easy to be reproduced with netperf
> > test.
>
> The reason which host signals guest is to free guest tx buffers, if
> there is no tx work, then it's not necessary to signal the guest unless
> guest runs out of memory. The pending buffers will be released
> virtio_net device gone.
>
> What's the behavior of netperf test when you hit this situation?
>
> Thanks
> Shirley

IIRC guest networking seems to be lost.

--
MST

2012-05-16 17:34:12

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > >> drivers/vhost/vhost.c | 1 +
> > > >> 1 files changed, 1 insertions(+), 0 deletions(-)
> > > >>
> > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > >> index 947f00d..7b75fdf 100644
> > > >> --- a/drivers/vhost/vhost.c
> > > >> +++ b/drivers/vhost/vhost.c
> > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > > >> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > >> struct vhost_virtqueue *vq = ubufs->vq;
> > > >>
> > > >> + vhost_poll_queue(&vq->poll);
> > > >> /* set len = 1 to mark this desc buffers done DMA */
> > > >> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > >> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> know in
> > > > which scenario there might be missing of adding and signaling
> during
> > > > zerocopy?
> > >
> > > Yes, as we only do signaling and adding during tx work, if there's
> no
> > > tx
> > > work when the skb were sent, we may lose the opportunity to let
> guest
> > > know about the completion. It's easy to be reproduced with netperf
> > > test.
> >
> > The reason which host signals guest is to free guest tx buffers, if
> > there is no tx work, then it's not necessary to signal the guest
> unless
> > guest runs out of memory. The pending buffers will be released
> > virtio_net device gone.
> >
> > What's the behavior of netperf test when you hit this situation?
> >
> > Thanks
> > Shirley
>
> IIRC guest networking seems to be lost.

It seems vhost_enable_notify is missing in somewhere else?

Thanks
Shirley

2012-05-16 18:36:35

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> > On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > > >> drivers/vhost/vhost.c | 1 +
> > > > >> 1 files changed, 1 insertions(+), 0 deletions(-)
> > > > >>
> > > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > >> index 947f00d..7b75fdf 100644
> > > > >> --- a/drivers/vhost/vhost.c
> > > > >> +++ b/drivers/vhost/vhost.c
> > > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
> > > > >> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > > >> struct vhost_virtqueue *vq = ubufs->vq;
> > > > >>
> > > > >> + vhost_poll_queue(&vq->poll);
> > > > >> /* set len = 1 to mark this desc buffers done DMA */
> > > > >> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > > >> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
> > > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> > know in
> > > > > which scenario there might be missing of adding and signaling
> > during
> > > > > zerocopy?
> > > >
> > > > Yes, as we only do signaling and adding during tx work, if there's
> > no
> > > > tx
> > > > work when the skb were sent, we may lose the opportunity to let
> > guest
> > > > know about the completion. It's easy to be reproduced with netperf
> > > > test.
> > >
> > > The reason which host signals guest is to free guest tx buffers, if
> > > there is no tx work, then it's not necessary to signal the guest
> > unless
> > > guest runs out of memory. The pending buffers will be released
> > > virtio_net device gone.
> > >
> > > What's the behavior of netperf test when you hit this situation?
> > >
> > > Thanks
> > > Shirley
> >
> > IIRC guest networking seems to be lost.
>
> It seems vhost_enable_notify is missing in somewhere else?
>
> Thanks
> Shirley

Donnu. I see virtio sending packets but they do not get
to tun on host. debugging.

2012-05-16 19:09:10

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Wed, 2012-05-16 at 21:36 +0300, Michael S. Tsirkin wrote:
> On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
> > On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
> > > On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
> > > > On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
> > > > > >> drivers/vhost/vhost.c | 1 +
> > > > > >> 1 files changed, 1 insertions(+), 0 deletions(-)
> > > > > >>
> > > > > >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > >> index 947f00d..7b75fdf 100644
> > > > > >> --- a/drivers/vhost/vhost.c
> > > > > >> +++ b/drivers/vhost/vhost.c
> > > > > >> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void
> *arg)
> > > > > >> struct vhost_ubuf_ref *ubufs = ubuf->arg;
> > > > > >> struct vhost_virtqueue *vq = ubufs->vq;
> > > > > >>
> > > > > >> + vhost_poll_queue(&vq->poll);
> > > > > >> /* set len = 1 to mark this desc buffers done DMA
> */
> > > > > >> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
> > > > > >> kref_put(&ubufs->kref,
> vhost_zerocopy_done_signal);
> > > > > > Doing so, we might have redundant vhost_poll_queue(). Do you
> > > know in
> > > > > > which scenario there might be missing of adding and
> signaling
> > > during
> > > > > > zerocopy?
> > > > >
> > > > > Yes, as we only do signaling and adding during tx work, if
> there's
> > > no
> > > > > tx
> > > > > work when the skb were sent, we may lose the opportunity to
> let
> > > guest
> > > > > know about the completion. It's easy to be reproduced with
> netperf
> > > > > test.
> > > >
> > > > The reason which host signals guest is to free guest tx buffers,
> if
> > > > there is no tx work, then it's not necessary to signal the guest
> > > unless
> > > > guest runs out of memory. The pending buffers will be released
> > > > virtio_net device gone.
> > > >
> > > > What's the behavior of netperf test when you hit this situation?
> > > >
> > > > Thanks
> > > > Shirley
> > >
> > > IIRC guest networking seems to be lost.
> >
> > It seems vhost_enable_notify is missing in somewhere else?
> >
> > Thanks
> > Shirley
>
> Donnu. I see virtio sending packets but they do not get
> to tun on host. debugging.

I looked at the code, if (zerocopy) check is needed for below code:

+ if (zerocopy) {
num_pends = likely(vq->upend_idx >= vq->done_idx) ?
(vq->upend_idx - vq->done_idx) :
(vq->upend_idx + UIO_MAXIOV - vq->done_idx);
if (unlikely(num_pends > VHOST_MAX_PEND)) {
tx_poll_start(net, sock);
vhost_poll_queue
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
+ }
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
vhost_disable_notify(&net->dev, vq);
continue;
}
break;

Second, whether it's possible the problem comes from tx_poll_start()? In
some situation, vhost_poll_wakeup() is not being called?

Thanks
Shirley

2012-05-17 02:50:48

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/17/2012 01:32 AM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
>> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
>>> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
>>>>>> drivers/vhost/vhost.c | 1 +
>>>>>> 1 files changed, 1 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>> index 947f00d..7b75fdf 100644
>>>>>> --- a/drivers/vhost/vhost.c
>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void *arg)
>>>>>> struct vhost_ubuf_ref *ubufs = ubuf->arg;
>>>>>> struct vhost_virtqueue *vq = ubufs->vq;
>>>>>>
>>>>>> + vhost_poll_queue(&vq->poll);
>>>>>> /* set len = 1 to mark this desc buffers done DMA */
>>>>>> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>>>>>> kref_put(&ubufs->kref, vhost_zerocopy_done_signal);
>>>>> Doing so, we might have redundant vhost_poll_queue(). Do you
>> know in
>>>>> which scenario there might be missing of adding and signaling
>> during
>>>>> zerocopy?
>>>> Yes, as we only do signaling and adding during tx work, if there's
>> no
>>>> tx
>>>> work when the skb were sent, we may lose the opportunity to let
>> guest
>>>> know about the completion. It's easy to be reproduced with netperf
>>>> test.
>>> The reason which host signals guest is to free guest tx buffers, if
>>> there is no tx work, then it's not necessary to signal the guest
>> unless
>>> guest runs out of memory. The pending buffers will be released
>>> virtio_net device gone.

Looks like we only free the skbs in .ndo_start_xmit().
>>>
>>> What's the behavior of netperf test when you hit this situation?
>>>
>>> Thanks
>>> Shirley
>> IIRC guest networking seems to be lost.
> It seems vhost_enable_notify is missing in somewhere else?
>
> Thanks
> Shirley
>

The problem is we may stop the tx queue when there no enough capacity to
place packets, at this moment we depends on the tx interrupt to
re-enable the tx queue. So if we didn't poll the vhost during callback,
guest may lose the tx interrupt to re-enable the tx queue which could
stall the whole tx queue.

Thanks

2012-05-17 02:59:57

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On 05/16/2012 11:03 PM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 11:04 +0800, Jason Wang wrote:
>>>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
>>>> index bd4a70d..7cb2684 100644
>>>> --- a/drivers/net/macvtap.c
>>>> +++ b/drivers/net/macvtap.c
>>>> @@ -519,6 +519,7 @@ static int zerocopy_sg_from_iovec(struct
>> sk_buff
>>>> *skb, const struct iovec *from,
>>>> struct page *page[MAX_SKB_FRAGS];
>>>> int num_pages;
>>>> unsigned long base;
>>>> + unsigned long truesize;
>>>>
>>>> len = from->iov_len - offset;
>>>> if (!len) {
>>>> @@ -533,10 +534,11 @@ static int zerocopy_sg_from_iovec(struct
>> sk_buff
>>>> *skb, const struct iovec *from,
>>>> (num_pages> MAX_SKB_FRAGS -
>>>> skb_shinfo(skb)->nr_frags))
>>>> /* put_page is in skb free */
>>>> return -EFAULT;
>>>> + truesize = size * PAGE_SIZE;
>>> Here should be truesize = size * PAGE_SIZE - offset, right?
>>>
>> We get the whole user page, so need to account them all. Also this is
>> aligned with skb_copy_ubufs().
> Then this would double count the size of "first" offset left from
> previous copy, both skb->len and truesize.
>
> Thanks
> Shirley
>

Didn't see how this affact skb->len. And for truesize, I think they are
different, when the offset were not zero, the data in this vector were
divided into two parts. First part is copied into skb directly, and the
second were pinned from a whole userspace page by get_user_pages_fast(),
so we need count the whole page to the socket limit to prevent evil
application.

Thanks

2012-05-17 15:28:48

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
> Didn't see how this affact skb->len. And for truesize, I think they
> are
> different, when the offset were not zero, the data in this vector
> were
> divided into two parts. First part is copied into skb directly, and
> the
> second were pinned from a whole userspace page by
> get_user_pages_fast(),
> so we need count the whole page to the socket limit to prevent evil
> application.

What I meant that the code for skb->truesize has double added the first
offset if any left from that vector (partically copied into skb
directly, and then count pagesize which includes the offset (truesize +=
PAGE_SIZE)).

Thanks
Shirley

2012-05-17 15:34:46

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
> The problem is we may stop the tx queue when there no enough capacity
> to
> place packets, at this moment we depends on the tx interrupt to
> re-enable the tx queue. So if we didn't poll the vhost during
> callback,
> guest may lose the tx interrupt to re-enable the tx queue which could
> stall the whole tx queue.

VHOST_MAX_PEND should handle the capacity.

Hasn't the above situation been handled in handle_tx() code?:
...
if (unlikely(num_pends > VHOST_MAX_PEND)) {
tx_poll_start(net, sock);
set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
break;
}
...

Thanks
Shirley

2012-05-18 09:58:59

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/17/2012 11:34 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
>> The problem is we may stop the tx queue when there no enough capacity
>> to
>> place packets, at this moment we depends on the tx interrupt to
>> re-enable the tx queue. So if we didn't poll the vhost during
>> callback,
>> guest may lose the tx interrupt to re-enable the tx queue which could
>> stall the whole tx queue.
> VHOST_MAX_PEND should handle the capacity.
>
> Hasn't the above situation been handled in handle_tx() code?:
> ...
> if (unlikely(num_pends> VHOST_MAX_PEND)) {
> tx_poll_start(net, sock);
> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> break;
> }
> ...
>
> Thanks
> Shirley

It may not help in because:

- tx polling depends on skb_orphan() which is often called by device
driver when it place the packet into the queue of the devices instead
of when the packets were sent. So it was too early for vhost to be
notified.

- it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
highly possible that guest needs to be notified when the pending packets
isn't so much.

So this piece of code may not help and could be removed and we need to
poll the virt-queue during zerocopy callback ( through it could be
further optimized but may not be easy).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-05-18 10:10:28

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On 05/17/2012 11:28 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
>> Didn't see how this affact skb->len. And for truesize, I think they
>> are
>> different, when the offset were not zero, the data in this vector
>> were
>> divided into two parts. First part is copied into skb directly, and
>> the
>> second were pinned from a whole userspace page by
>> get_user_pages_fast(),
>> so we need count the whole page to the socket limit to prevent evil
>> application.
> What I meant that the code for skb->truesize has double added the first
> offset if any left from that vector (partically copied into skb
> directly, and then count pagesize which includes the offset (truesize +=
> PAGE_SIZE)).

Yes, I get you mean. There's no difference between first frag and
others: it's also possible for other frags that didn't occupy the whole
page. Since we pin the whole user page, better to count the whole page
size to prevent evil application.
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-05-18 15:22:44

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On Fri, 2012-05-18 at 18:10 +0800, Jason Wang wrote:
> > On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
> >> Didn't see how this affact skb->len. And for truesize, I think they
> >> are
> >> different, when the offset were not zero, the data in this vector
> >> were
> >> divided into two parts. First part is copied into skb directly, and
> >> the
> >> second were pinned from a whole userspace page by
> >> get_user_pages_fast(),
> >> so we need count the whole page to the socket limit to prevent evil
> >> application.
> > What I meant that the code for skb->truesize has double added the
> first
> > offset if any left from that vector (partically copied into skb
> > directly, and then count pagesize which includes the offset
> (truesize +=
> > PAGE_SIZE)).
>
> Yes, I get you mean. There's no difference between first frag and
> others: it's also possible for other frags that didn't occupy the
> whole
> page. Since we pin the whole user page, better to count the whole
> page
> size to prevent evil application.

The difference between first frags and others is other frags might not
occupy the whole page, but the first frags extra offset was doubled
added in skb truesize.

So it's ok for skb->truesize to be bigger than all the skb pinned pages
here?

Thanks
Shirley

2012-05-18 15:29:45

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Fri, 2012-05-18 at 17:58 +0800, Jason Wang wrote:
> On 05/17/2012 11:34 PM, Shirley Ma wrote:
> > On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
> >> The problem is we may stop the tx queue when there no enough
> capacity
> >> to
> >> place packets, at this moment we depends on the tx interrupt to
> >> re-enable the tx queue. So if we didn't poll the vhost during
> >> callback,
> >> guest may lose the tx interrupt to re-enable the tx queue which
> could
> >> stall the whole tx queue.
> > VHOST_MAX_PEND should handle the capacity.
> >
> > Hasn't the above situation been handled in handle_tx() code?:
> > ...
> > if (unlikely(num_pends> VHOST_MAX_PEND)) {
> > tx_poll_start(net, sock);
> >
> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> > break;
> > }
> > ...
> >
> > Thanks
> > Shirley
>
> It may not help in because:
>
> - tx polling depends on skb_orphan() which is often called by device
> driver when it place the packet into the queue of the devices instead
> of when the packets were sent. So it was too early for vhost to be
> notified.
Then do you think it's better to replace with vhost_poll_queue here
instead?

> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
> highly possible that guest needs to be notified when the pending
> packets
> isn't so much.
In which situation the guest needs to be notified when there is no TX
besides buffers run out?

> So this piece of code may not help and could be removed and we need
> to
> poll the virt-queue during zerocopy callback ( through it could be
> further optimized but may not be easy).

Thanks
Shirley

2012-05-21 05:22:54

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/17/2012 03:08 AM, Shirley Ma wrote:
> On Wed, 2012-05-16 at 21:36 +0300, Michael S. Tsirkin wrote:
>> On Wed, May 16, 2012 at 10:32:05AM -0700, Shirley Ma wrote:
>>> On Wed, 2012-05-16 at 18:14 +0300, Michael S. Tsirkin wrote:
>>>> On Wed, May 16, 2012 at 08:10:27AM -0700, Shirley Ma wrote:
>>>>> On Wed, 2012-05-16 at 10:58 +0800, Jason Wang wrote:
>>>>>>>> drivers/vhost/vhost.c | 1 +
>>>>>>>> 1 files changed, 1 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>>>>>>>> index 947f00d..7b75fdf 100644
>>>>>>>> --- a/drivers/vhost/vhost.c
>>>>>>>> +++ b/drivers/vhost/vhost.c
>>>>>>>> @@ -1604,6 +1604,7 @@ void vhost_zerocopy_callback(void
>> *arg)
>>>>>>>> struct vhost_ubuf_ref *ubufs = ubuf->arg;
>>>>>>>> struct vhost_virtqueue *vq = ubufs->vq;
>>>>>>>>
>>>>>>>> + vhost_poll_queue(&vq->poll);
>>>>>>>> /* set len = 1 to mark this desc buffers done DMA
>> */
>>>>>>>> vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
>>>>>>>> kref_put(&ubufs->kref,
>> vhost_zerocopy_done_signal);
>>>>>>> Doing so, we might have redundant vhost_poll_queue(). Do you
>>>> know in
>>>>>>> which scenario there might be missing of adding and
>> signaling
>>>> during
>>>>>>> zerocopy?
>>>>>> Yes, as we only do signaling and adding during tx work, if
>> there's
>>>> no
>>>>>> tx
>>>>>> work when the skb were sent, we may lose the opportunity to
>> let
>>>> guest
>>>>>> know about the completion. It's easy to be reproduced with
>> netperf
>>>>>> test.
>>>>> The reason which host signals guest is to free guest tx buffers,
>> if
>>>>> there is no tx work, then it's not necessary to signal the guest
>>>> unless
>>>>> guest runs out of memory. The pending buffers will be released
>>>>> virtio_net device gone.
>>>>>
>>>>> What's the behavior of netperf test when you hit this situation?
>>>>>
>>>>> Thanks
>>>>> Shirley
>>>> IIRC guest networking seems to be lost.
>>> It seems vhost_enable_notify is missing in somewhere else?
>>>
>>> Thanks
>>> Shirley
>> Donnu. I see virtio sending packets but they do not get
>> to tun on host. debugging.
> I looked at the code, if (zerocopy) check is needed for below code:
>
> + if (zerocopy) {
> num_pends = likely(vq->upend_idx>= vq->done_idx) ?
> (vq->upend_idx - vq->done_idx) :
> (vq->upend_idx + UIO_MAXIOV - vq->done_idx);
> if (unlikely(num_pends> VHOST_MAX_PEND)) {
> tx_poll_start(net, sock);
> vhost_poll_queue
> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> break;
> }
> + }
> if (unlikely(vhost_enable_notify(&net->dev, vq))) {
> vhost_disable_notify(&net->dev, vq);
> continue;
> }
> break;

No objection to add this but looks no difference to me as we won't touch
upend_idx and done_idx when zercocopy is not used.

>
> Second, whether it's possible the problem comes from tx_poll_start()? In
> some situation, vhost_poll_wakeup() is not being called?
>
> Thanks
> Shirley
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-21 06:05:36

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/18/2012 11:29 PM, Shirley Ma wrote:
> On Fri, 2012-05-18 at 17:58 +0800, Jason Wang wrote:
>> On 05/17/2012 11:34 PM, Shirley Ma wrote:
>>> On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
>>>> The problem is we may stop the tx queue when there no enough
>> capacity
>>>> to
>>>> place packets, at this moment we depends on the tx interrupt to
>>>> re-enable the tx queue. So if we didn't poll the vhost during
>>>> callback,
>>>> guest may lose the tx interrupt to re-enable the tx queue which
>> could
>>>> stall the whole tx queue.
>>> VHOST_MAX_PEND should handle the capacity.
>>>
>>> Hasn't the above situation been handled in handle_tx() code?:
>>> ...
>>> if (unlikely(num_pends> VHOST_MAX_PEND)) {
>>> tx_poll_start(net, sock);
>>>
>> set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>>> break;
>>> }
>>> ...
>>>
>>> Thanks
>>> Shirley
>> It may not help in because:
>>
>> - tx polling depends on skb_orphan() which is often called by device
>> driver when it place the packet into the queue of the devices instead
>> of when the packets were sent. So it was too early for vhost to be
>> notified.
> Then do you think it's better to replace with vhost_poll_queue here
> instead?

Just like what does this patch do - calling vhost_poll_queue() in
vhost_zerocopy_callback().
>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
>> highly possible that guest needs to be notified when the pending
>> packets
>> isn't so much.
> In which situation the guest needs to be notified when there is no TX
> besides buffers run out?

Consider guest call virtqueue_enable_cb_delayed() which means it only
need to be notified when 3/4 of pending buffers ( about 178 buffers
(256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would notify
guest when about 60 buffers were pending. Since tx polling is only
enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
would not be notified to run and guest would never get the interrupt it
expected to re-enable the queue.

And just like what we've discussed, tx polling based adding and
signaling is too early for vhost.
>> So this piece of code may not help and could be removed and we need
>> to
>> poll the virt-queue during zerocopy callback ( through it could be
>> further optimized but may not be easy).
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-21 06:15:46

[permalink] [raw]

Subject: Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation

On 05/18/2012 11:22 PM, Shirley Ma wrote:
> On Fri, 2012-05-18 at 18:10 +0800, Jason Wang wrote:
>>> On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
>>>> Didn't see how this affact skb->len. And for truesize, I think they
>>>> are
>>>> different, when the offset were not zero, the data in this vector
>>>> were
>>>> divided into two parts. First part is copied into skb directly, and
>>>> the
>>>> second were pinned from a whole userspace page by
>>>> get_user_pages_fast(),
>>>> so we need count the whole page to the socket limit to prevent evil
>>>> application.
>>> What I meant that the code for skb->truesize has double added the
>> first
>>> offset if any left from that vector (partically copied into skb
>>> directly, and then count pagesize which includes the offset
>> (truesize +=
>>> PAGE_SIZE)).
>> Yes, I get you mean. There's no difference between first frag and
>> others: it's also possible for other frags that didn't occupy the
>> whole
>> page. Since we pin the whole user page, better to count the whole
>> page
>> size to prevent evil application.
> The difference between first frags and others is other frags might not
> occupy the whole page, but the first frags extra offset was doubled
> added in skb truesize.
>
> So it's ok for skb->truesize to be bigger than all the skb pinned pages
> here?

I think it's ok here and we could find other example such as virtio_net
driver.
>
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-05-21 15:43:32

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> >> - tx polling depends on skb_orphan() which is often called by
> device
> >> driver when it place the packet into the queue of the devices
> instead
> >> of when the packets were sent. So it was too early for vhost to be
> >> notified.
> > Then do you think it's better to replace with vhost_poll_queue here
> > instead?
>
> Just like what does this patch do - calling vhost_poll_queue() in
> vhost_zerocopy_callback().
> >> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
> >> highly possible that guest needs to be notified when the pending
> >> packets
> >> isn't so much.
> > In which situation the guest needs to be notified when there is no
> TX
> > besides buffers run out?
>
> Consider guest call virtqueue_enable_cb_delayed() which means it only
> need to be notified when 3/4 of pending buffers ( about 178 buffers
> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> notify
> guest when about 60 buffers were pending. Since tx polling is only
> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
> would not be notified to run and guest would never get the interrupt
> it
> expected to re-enable the queue.

So it seems we still need vhost_enable_notify() in handle_tx when there
is no tx in zerocopy case.

Do you know which one is more expensive: the cost of vhost_poll_queue()
in each zerocopy callback or calling vhost_enable_notify()?

Have you compared the results by removing below code in handle_tx()?

- if (unlikely(num_pends > VHOST_MAX_PEND)) {
- tx_poll_start(net, sock);
- set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- break;
- }

>
> And just like what we've discussed, tx polling based adding and
> signaling is too early for vhost.

Thanks
Shirley

2012-05-21 16:13:22

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Mon, 2012-05-21 at 08:42 -0700, Shirley Ma wrote:
> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> > >> - tx polling depends on skb_orphan() which is often called by
> > device
> > >> driver when it place the packet into the queue of the devices
> > instead
> > >> of when the packets were sent. So it was too early for vhost to
> be
> > >> notified.
> > > Then do you think it's better to replace with vhost_poll_queue
> here
> > > instead?
> >
> > Just like what does this patch do - calling vhost_poll_queue() in
> > vhost_zerocopy_callback().
> > >> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
> it's
> > >> highly possible that guest needs to be notified when the pending
> > >> packets
> > >> isn't so much.
> > > In which situation the guest needs to be notified when there is no
> > TX
> > > besides buffers run out?
> >
> > Consider guest call virtqueue_enable_cb_delayed() which means it
> only
> > need to be notified when 3/4 of pending buffers ( about 178 buffers
> > (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> > notify
> > guest when about 60 buffers were pending. Since tx polling is only
> > enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
> > would not be notified to run and guest would never get the interrupt
> > it
> > expected to re-enable the queue.
>
> So it seems we still need vhost_enable_notify() in handle_tx when there
> is no tx in zerocopy case.
>
> Do you know which one is more expensive: the cost of
> vhost_poll_queue()
> in each zerocopy callback or calling vhost_enable_notify()?
>
> Have you compared the results by removing below code in handle_tx()?
>
> - if (unlikely(num_pends > VHOST_MAX_PEND)) {
> - tx_poll_start(net, sock);
> - set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> - break;
> - }
> >
> > And just like what we've discussed, tx polling based adding and
> > signaling is too early for vhost.
>

Then it could be too early for vhost to notify guest anywhere in
handle_tx for zerocopy. Then we might need to remove any notification in
handle_tx for zerocopy to vhost zerocopy callback instead.

Adding vhost_poll_queue in vhost zerocopy callback unconditionally would
consume unnecessary cpu.

We need to think about a better solution here.

Thanks
Shirley

2012-05-22 10:06:09

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/21/2012 11:42 PM, Shirley Ma wrote:
> On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
>>>> - tx polling depends on skb_orphan() which is often called by
>> device
>>>> driver when it place the packet into the queue of the devices
>> instead
>>>> of when the packets were sent. So it was too early for vhost to be
>>>> notified.
>>> Then do you think it's better to replace with vhost_poll_queue here
>>> instead?
>> Just like what does this patch do - calling vhost_poll_queue() in
>> vhost_zerocopy_callback().
>>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's
>>>> highly possible that guest needs to be notified when the pending
>>>> packets
>>>> isn't so much.
>>> In which situation the guest needs to be notified when there is no
>> TX
>>> besides buffers run out?
>> Consider guest call virtqueue_enable_cb_delayed() which means it only
>> need to be notified when 3/4 of pending buffers ( about 178 buffers
>> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
>> notify
>> guest when about 60 buffers were pending. Since tx polling is only
>> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
>> would not be notified to run and guest would never get the interrupt
>> it
>> expected to re-enable the queue.
> So it seems we still need vhost_enable_notify() in handle_tx when there
> is no tx in zerocopy case.
>
> Do you know which one is more expensive: the cost of vhost_poll_queue()
> in each zerocopy callback or calling vhost_enable_notify()?

Didn't follow here, do you mean vhost_signal() here?
>
> Have you compared the results by removing below code in handle_tx()?
>
> - if (unlikely(num_pends> VHOST_MAX_PEND)) {
> - tx_poll_start(net, sock);
> - set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
> - break;
> - }

I remember I've done some basic test when I send this patch, there's no
much increasing of cpu utilization. Would double check this again.
>> And just like what we've discussed, tx polling based adding and
>> signaling is too early for vhost.
> Thanks
> Shirley
>

2012-05-22 10:13:37

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On 05/22/2012 12:12 AM, Shirley Ma wrote:
> Then it could be too early for vhost to notify guest anywhere in
> handle_tx for zerocopy. Then we might need to remove any notification in
> handle_tx for zerocopy to vhost zerocopy callback instead.
>
> Adding vhost_poll_queue in vhost zerocopy callback unconditionally would
> consume unnecessary cpu.
>
> We need to think about a better solution here.

A possible idea is only call vhost_poll_queue only when the packet of
used_event were sent: if there's no out-of-order completion vhost could
signal the guest; if there is, let the pending packets before used_event
call vhost_poll_queue.
> Thanks
> Shirley

2012-05-22 15:56:18

[permalink] [raw]

Subject: Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback

On Tue, 2012-05-22 at 18:05 +0800, Jason Wang wrote:
> On 05/21/2012 11:42 PM, Shirley Ma wrote:
> > On Mon, 2012-05-21 at 14:05 +0800, Jason Wang wrote:
> >>>> - tx polling depends on skb_orphan() which is often called by
> >> device
> >>>> driver when it place the packet into the queue of the devices
> >> instead
> >>>> of when the packets were sent. So it was too early for vhost to
> be
> >>>> notified.
> >>> Then do you think it's better to replace with vhost_poll_queue
> here
> >>> instead?
> >> Just like what does this patch do - calling vhost_poll_queue() in
> >> vhost_zerocopy_callback().
> >>>> - it only works when the pending DMAs exceeds VHOST_MAX_PEND,
> it's
> >>>> highly possible that guest needs to be notified when the pending
> >>>> packets
> >>>> isn't so much.
> >>> In which situation the guest needs to be notified when there is no
> >> TX
> >>> besides buffers run out?
> >> Consider guest call virtqueue_enable_cb_delayed() which means it
> only
> >> need to be notified when 3/4 of pending buffers ( about 178 buffers
> >> (256-MAX_SKB_FRAGS-2)*3/4 ) were sent by host. So vhost_net would
> >> notify
> >> guest when about 60 buffers were pending. Since tx polling is only
> >> enabled when pending packets exceeds VHOST_MAX_PEND 128, so tx work
> >> would not be notified to run and guest would never get the
> interrupt
> >> it
> >> expected to re-enable the queue.
> > So it seems we still need vhost_enable_notify() in handle_tx when
> there
> > is no tx in zerocopy case.
> >
> > Do you know which one is more expensive: the cost of
> vhost_poll_queue()
> > in each zerocopy callback or calling vhost_enable_notify()?
>
> Didn't follow here, do you mean vhost_signal() here?

I meant removing the code in handle_tx for zerocopy as below:

+ if (zcopy) {
/* If more outstanding DMAs, queue the work.
* Handle upend_idx wrap around
*/
num_pends = likely(vq->upend_idx >= vq->done_idx) ?
(vq->upend_idx - vq->done_idx) :
(vq->upend_idx + UIO_MAXIOV - vq->done_idx);
+ /* zerocopy vhost_enable_notify is under zerocopy callback
+ * since it could be too early to notify here */
+ break;
+ }
- if (unlikely(num_pends > VHOST_MAX_PEND)) {
- tx_poll_start(net, sock);
- set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
- break;
- }
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
vhost_disable_notify(&net->dev, vq);
continue;
}
break;

Thanks
Shirley

2012-05-23 10:30:16