Hello,
DESCRIPTION
this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:
1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
flag will be ignored (e.g. without completion).
2) Kernel uses completions from socket's error queue. Single completion
for single tx syscall (or it can merge several completions to single
one). I used already implemented logic for MSG_ZEROCOPY support:
'msg_zerocopy_realloc()' etc.
Difference with copy way is not significant. During packet allocation,
non-linear skb is created, then I call 'get_user_pages()' for each page
from user's iov iterator (I think i don't need 'pin_user_pages()' as
there is no backing storage for these pages) and add each returned page
to the skb as fragment. There are also some updates for vhost and guest
parts of transport - in both cases i've added handling of non-linear skb
for virtio part. vhost copies data from such skb to the guest's rx virtio
buffers. In the guest, virtio transport fills virtio queue with pages
from skb.
I think doc in Documentation/networking/msg_zerocopy.rst could be also
updated in next versions.
This version has several limits/problems:
1) As this feature totally depends on transport, there is no way (or it
is difficult) to check whether transport is able to handle it or not
during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
setsockopt callback from setsockopt callback for SOL_SOCKET, but this
leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
are not considered to be called from each other. So in current version
SO_ZEROCOPY is set successfully to any type (e.g. transport) of
AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
tx routine will fail with EOPNOTSUPP.
2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
one completion. In each completion there is flag which shows how tx
was performed: zerocopy or copy. This leads that whole message must
be send in zerocopy or copy way - we can't send part of message with
copying and rest of message with zerocopy mode (or vice versa). Now,
we need to account vsock credit logic, e.g. we can't send whole data
once - only allowed number of bytes could sent at any moment. In case
of copying way there is no problem as in worst case we can send single
bytes, but zerocopy is more complex because smallest transmission
unit is single page. So if there is not enough space at peer's side
to send integer number of pages (at least one) - we will wait, thus
stalling tx side. To overcome this problem i've added simple rule -
zerocopy is possible only when there is enough space at another side
for whole message (to check, that current 'msghdr' was already used
in previous tx iterations i use 'iov_offset' field of it's iov iter).
3) loopback transport is not supported, because it requires to implement
non-linear skb handling in dequeue logic (as we "send" fragged skb
and "receive" it from the same queue). I'm going to implement it in
next versions.
4) Current implementation sets max length of packet to 64KB. IIUC this
is due to 'kmalloc()' allocated data buffers. I think, in case of
MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
not touched for data - user space pages are used as buffers. Also
this limit trims every message which is > 64KB, thus such messages
will be send in copy mode due to 'iov_offset' check in 2).
PERFORMANCE
Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it something like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in 'sendmsg()' syscall.
Also there is limit from 4) above so max buffer size is 64KB. I've
tested this patchset in the nested VM, but i think for V1 it is not a
big deal.
Sender:
./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc]
Receiver:
./vsock_perf --vsk-size 256M
Number in cell is seconds which senders spends inside tx syscall.
Guest to host transmission:
*-------------------------------*
| | | |
| buf size | copy | zerocopy |
| | | |
*-------------------------------*
| 4KB | 0.26 | 0.042 |
*-------------------------------*
| 16KB | 0.11 | 0.014 |
*-------------------------------*
| 32KB | 0.05 | 0.009 |
*-------------------------------*
| 64KB | 0.04 | 0.005 |
*-------------------------------*
Host to guest transmission:
*--------------------------------*
| | | |
| buf size | copy | zerocopy |
| | | |
*--------------------------------*
| 4KB | 0.049 | 0.034 |
*--------------------------------*
| 16KB | 0.03 | 0.024 |
*--------------------------------*
| 32KB | 0.025 | 0.01 |
*--------------------------------*
| 64KB | 0.028 | 0.01 |
*--------------------------------*
If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb
allocation.
Zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment. In next versions i'm going to fix 64KB barrier to
perform tests with bigger buffer sizes.
TESTING
This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages).
Thanks, Arseniy
Arseniy Krasnov(12):
vsock: check error queue to set EPOLLERR
vsock: read from socket's error queue
vsock: check for MSG_ZEROCOPY support
vhost/vsock: non-linear skb handling support
vsock/virtio: non-linear skb support
vsock/virtio: non-linear skb handling for TAP dev
vsock/virtio: MGS_ZEROCOPY flag support
vhost/vsock: support MSG_ZEROCOPY for transport
vsock/virtio: support MSG_ZEROCOPY for transport
net/sock: enable setting SO_ZEROCOPY for PF_VSOCK
test/vsock: MSG_ZEROCOPY flag tests
test/vsock: MSG_ZEROCOPY support for vsock_perf
drivers/vhost/vsock.c | 62 +++-
include/linux/socket.h | 1 +
include/linux/virtio_vsock.h | 12 +
include/net/af_vsock.h | 2 +
net/core/sock.c | 4 +-
net/vmw_vsock/af_vsock.c | 35 ++-
net/vmw_vsock/virtio_transport.c | 38 ++-
net/vmw_vsock/virtio_transport_common.c | 255 ++++++++++++++--
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.h | 1 +
tools/testing/vsock/vsock_perf.c | 127 +++++++-
tools/testing/vsock/vsock_test.c | 11 +
tools/testing/vsock/vsock_test_zerocopy.c | 470 ++++++++++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 12 +
14 files changed, 991 insertions(+), 41 deletions(-)
--
2.25.1
If socket's error queue is not empty, EPOLLERR must be set.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/af_vsock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 19aea7cba26e..b5e51ef4a74c 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1026,7 +1026,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
poll_wait(file, sk_sleep(sk), wait);
mask = 0;
- if (sk->sk_err)
+ if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
/* Signify that there has been an error on this socket. */
mask |= EPOLLERR;
--
2.25.1
This adds handling of MSG_ERRQUEUE input flag for receive call, thus
skb from socket's error queue is read.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/linux/socket.h | 1 +
net/vmw_vsock/af_vsock.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..19a6f39fa014 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -379,6 +379,7 @@ struct ucred {
#define SOL_MPTCP 284
#define SOL_MCTP 285
#define SOL_SMC 286
+#define SOL_VSOCK 287
/* IPX options */
#define IPX_TYPE 1
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index b5e51ef4a74c..f752b30b71d6 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -110,6 +110,7 @@
#include <linux/workqueue.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <linux/errqueue.h>
static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
static void vsock_sk_destruct(struct sock *sk);
@@ -2086,6 +2087,27 @@ static int __vsock_seqpacket_recvmsg(struct sock *sk, struct msghdr *msg,
return err;
}
+static int vsock_err_recvmsg(struct sock *sk, struct msghdr *msg)
+{
+ struct sock_extended_err *ee;
+ struct sk_buff *skb;
+ int err;
+
+ lock_sock(sk);
+ skb = sock_dequeue_err_skb(sk);
+ release_sock(sk);
+
+ if (!skb)
+ return -EAGAIN;
+
+ ee = &SKB_EXT_ERR(skb)->ee;
+ err = put_cmsg(msg, SOL_VSOCK, 0, sizeof(*ee), ee);
+ msg->msg_flags |= MSG_ERRQUEUE;
+ consume_skb(skb);
+
+ return err;
+}
+
static int
vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int flags)
@@ -2096,6 +2118,10 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int err;
sk = sock->sk;
+
+ if (unlikely(flags & MSG_ERRQUEUE))
+ return vsock_err_recvmsg(sk, msg);
+
vsk = vsock_sk(sk);
err = 0;
--
2.25.1
This feature totally depends on transport, so if transport doesn't
support it, return error.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/net/af_vsock.h | 2 ++
net/vmw_vsock/af_vsock.c | 7 +++++++
2 files changed, 9 insertions(+)
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 568a87c5e0d0..96d829004c81 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -173,6 +173,8 @@ struct vsock_transport {
/* Addressing. */
u32 (*get_local_cid)(void);
+
+ bool (*msgzerocopy_allow)(void);
};
/**** CORE ****/
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index f752b30b71d6..fb0fcb390113 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1788,6 +1788,13 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
+ if (msg->msg_flags & MSG_ZEROCOPY &&
+ (!transport->msgzerocopy_allow ||
+ !transport->msgzerocopy_allow())) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
/* Wait for room in the produce queue to enqueue our user's data. */
timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
--
2.25.1
This adds copying to guest's virtio buffers from non-linear skbs. Such
skbs are created by protocol layer when MSG_ZEROCOPY flags is used.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 56 ++++++++++++++++++++++++++++++++----
include/linux/virtio_vsock.h | 12 ++++++++
2 files changed, 63 insertions(+), 5 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 1f3b89c885cc..60b9cafa3e31 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -86,6 +86,44 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
return NULL;
}
+static int vhost_transport_copy_nonlinear_skb(struct sk_buff *skb,
+ struct iov_iter *iov_iter,
+ size_t len)
+{
+ size_t rest_len = len;
+
+ while (rest_len && virtio_vsock_skb_has_frags(skb)) {
+ struct bio_vec *curr_vec;
+ size_t curr_vec_end;
+ size_t to_copy;
+ int curr_frag;
+ int curr_offs;
+
+ curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
+ curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+ curr_vec = &skb_shinfo(skb)->frags[curr_frag];
+
+ curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
+ to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
+
+ if (copy_page_to_iter(curr_vec->bv_page, curr_offs,
+ to_copy, iov_iter) != to_copy)
+ return -1;
+
+ rest_len -= to_copy;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += to_copy;
+
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off == (curr_vec_end)) {
+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag++;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
+ }
+ }
+
+ skb->data_len -= len;
+
+ return 0;
+}
+
static void
vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct vhost_virtqueue *vq)
@@ -197,11 +235,19 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
- kfree_skb(skb);
- vq_err(vq, "Faulted on copying pkt buf\n");
- break;
+ if (skb_is_nonlinear(skb)) {
+ if (vhost_transport_copy_nonlinear_skb(skb, &iov_iter,
+ payload_len)) {
+ vq_err(vq, "Faulted on copying pkt buf from page\n");
+ break;
+ }
+ } else {
+ nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
+ if (nbytes != payload_len) {
+ kfree_skb(skb);
+ vq_err(vq, "Faulted on copying pkt buf\n");
+ break;
+ }
}
/* Deliver to monitoring devices all packets that we
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 3f9c16611306..e7efdb78ce6e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,10 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+ /* Current fragment in 'frags' of skb. */
+ u32 curr_frag;
+ /* Offset from 0 in current fragment. */
+ u32 frag_off;
};
#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
@@ -46,6 +50,14 @@ static inline void virtio_vsock_skb_clear_tap_delivered(struct sk_buff *skb)
VIRTIO_VSOCK_SKB_CB(skb)->tap_delivered = false;
}
+static inline bool virtio_vsock_skb_has_frags(struct sk_buff *skb)
+{
+ if (!skb_is_nonlinear(skb))
+ return false;
+
+ return VIRTIO_VSOCK_SKB_CB(skb)->curr_frag != skb_shinfo(skb)->nr_frags;
+}
+
static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
{
u32 len;
--
2.25.1
Use pages of non-linear skb as buffers in virtio tx queue.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 28b5a8e8e094..b8a7d6dc9f46 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
for (;;) {
- struct scatterlist hdr, buf, *sgs[2];
+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +112,30 @@ virtio_transport_send_pkt_work(struct work_struct *work)
virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+ sg_init_one(&bufs[0], virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
+ sgs[out_sg++] = &bufs[0];
+
+ if (skb_is_nonlinear(skb)) {
+ int i;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ struct page *data_page = skb_shinfo(skb)->frags[i].bv_page;
+
+ /* We will use 'page_to_virt()' for userspace page here,
+ * because virtio layer will call 'virt_to_phys()' later
+ * to fill buffer descriptor. We don't touch memory at
+ * "virtual" address of this page.
+ */
+ sg_init_one(&bufs[i + 1],
+ page_to_virt(data_page), PAGE_SIZE);
+ sgs[out_sg++] = &bufs[i + 1];
+ }
+ } else {
+ if (skb->len > 0) {
+ sg_init_one(&bufs[1], skb->data, skb->len);
+ sgs[out_sg++] = &bufs[1];
+ }
- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
- sgs[out_sg++] = &hdr;
- if (skb->len > 0) {
- sg_init_one(&buf, skb->data, skb->len);
- sgs[out_sg++] = &buf;
}
ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
--
2.25.1
For TAP device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 43 +++++++++++++++++++++++--
1 file changed, 40 insertions(+), 3 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index a1581c77cf84..05ce97b967ad 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -101,6 +101,39 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}
+static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ size_t rest_len = len;
+
+ while (rest_len && virtio_vsock_skb_has_frags(skb)) {
+ struct bio_vec *curr_vec;
+ size_t curr_vec_end;
+ size_t to_copy;
+ int curr_frag;
+ int curr_offs;
+
+ curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
+ curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+ curr_vec = &skb_shinfo(skb)->frags[curr_frag];
+
+ curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
+ to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
+
+ memcpy(dst, page_to_virt(curr_vec->bv_page) + curr_offs,
+ to_copy);
+
+ rest_len -= to_copy;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += to_copy;
+
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off == (curr_vec_end)) {
+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag++;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
+ }
+ }
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -109,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;
/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -117,7 +149,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;
skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -160,7 +191,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(skb)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(skb, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}
return skb;
--
2.25.1
This adds main logic of MSG_ZEROCOPY flag processing for packet
creation. When this flag is set and user's iov iterator fits for
zerocopy transmission, call 'get_user_pages()' and add returned
pages to the newly created skb.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 212 ++++++++++++++++++++++--
1 file changed, 195 insertions(+), 17 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 05ce97b967ad..69e37f8a68a6 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,6 +37,169 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}
+static int virtio_transport_can_zcopy(struct iov_iter *iov_iter,
+ size_t free_space)
+{
+ size_t pages;
+ int i;
+
+ if (!iter_is_iovec(iov_iter))
+ return -1;
+
+ if (iov_iter->iov_offset)
+ return -1;
+
+ /* We can't send whole iov. */
+ if (free_space < iov_iter->count)
+ return -1;
+
+ for (pages = 0, i = 0; i < iov_iter->nr_segs; i++) {
+ const struct iovec *iovec;
+ int pages_in_elem;
+
+ iovec = &iov_iter->iov[i];
+
+ /* Base must be page aligned. */
+ if (offset_in_page(iovec->iov_base))
+ return -1;
+
+ /* Only last element could have not page aligned size. */
+ if (i != (iov_iter->nr_segs - 1)) {
+ if (offset_in_page(iovec->iov_len))
+ return -1;
+
+ pages_in_elem = iovec->iov_len >> PAGE_SHIFT;
+ } else {
+ pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE);
+ pages_in_elem >>= PAGE_SHIFT;
+ }
+
+ /* In case of user's pages - one page is one frag. */
+ if (pages + pages_in_elem > MAX_SKB_FRAGS)
+ return -1;
+
+ pages += pages_in_elem;
+ }
+
+ return 0;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+ struct sk_buff *skb,
+ struct iov_iter *iter,
+ bool zerocopy)
+{
+ struct ubuf_info_msgzc *uarg_zc;
+ struct ubuf_info *uarg;
+
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+ iov_length(iter->iov, iter->nr_segs),
+ NULL);
+
+ if (!uarg)
+ return -1;
+
+ uarg_zc = uarg_to_msgzc(uarg);
+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
+
+ skb_zcopy_init(skb, uarg);
+
+ return 0;
+}
+
+static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb,
+ struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info)
+{
+ struct iov_iter *iter;
+ int frag_idx;
+ int seg_idx;
+
+ iter = &info->msg->msg_iter;
+ frag_idx = 0;
+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
+
+ /* At this moment:
+ * 1) 'iov_offset' is zero.
+ * 2) Every 'iov_base' and 'iov_len' are also page aligned
+ * (except length of the last element).
+ * 3) Number of pages in this iov <= MAX_SKB_FRAGS.
+ * 4) Length of the data fits in current credit space.
+ */
+ for (seg_idx = 0; seg_idx < iter->nr_segs; seg_idx++) {
+ struct page *user_pages[MAX_SKB_FRAGS];
+ const struct iovec *iovec;
+ size_t last_frag_len;
+ size_t pages_in_seg;
+ int page_idx;
+
+ iovec = &iter->iov[seg_idx];
+ pages_in_seg = iovec->iov_len >> PAGE_SHIFT;
+
+ if (iovec->iov_len % PAGE_SIZE) {
+ last_frag_len = iovec->iov_len % PAGE_SIZE;
+ pages_in_seg++;
+ } else {
+ last_frag_len = PAGE_SIZE;
+ }
+
+ if (get_user_pages((unsigned long)iovec->iov_base,
+ pages_in_seg, FOLL_GET, user_pages,
+ NULL) != pages_in_seg)
+ return -1;
+
+ for (page_idx = 0; page_idx < pages_in_seg; page_idx++) {
+ int frag_len = PAGE_SIZE;
+
+ if (page_idx == (pages_in_seg - 1))
+ frag_len = last_frag_len;
+
+ skb_fill_page_desc(skb, frag_idx++,
+ user_pages[page_idx], 0,
+ frag_len);
+ skb_len_add(skb, frag_len);
+ }
+ }
+
+ return virtio_transport_init_zcopy_skb(vsk, skb, iter, true);
+}
+
+static int virtio_transport_copy_payload(struct sk_buff *skb,
+ struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t len)
+{
+ void *payload;
+ int err;
+
+ payload = skb_put(skb, len);
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err)
+ return -1;
+
+ if (msg_data_left(info->msg))
+ return 0;
+
+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+ struct virtio_vsock_hdr *hdr;
+
+ hdr = virtio_vsock_hdr(skb);
+
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+ if (info->msg->msg_flags & MSG_EOR)
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ }
+
+ if (info->flags & MSG_ZEROCOPY)
+ return virtio_transport_init_zcopy_skb(vsk, skb,
+ &info->msg->msg_iter,
+ false);
+
+ return 0;
+}
+
/* Returns a new packet on success, otherwise returns NULL.
*
* If NULL is returned, errp is set to a negative errno.
@@ -47,15 +210,31 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
u32 src_cid,
u32 src_port,
u32 dst_cid,
- u32 dst_port)
+ u32 dst_port,
+ struct vsock_sock *vsk)
{
- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
+ const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
struct virtio_vsock_hdr *hdr;
struct sk_buff *skb;
- void *payload;
- int err;
+ bool use_zcopy = false;
+
+ if (info->msg) {
+ /* If SOCK_ZEROCOPY is not enabled, ignore MSG_ZEROCOPY
+ * flag later and continue in classic way(e.g. without
+ * completion).
+ */
+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
+ info->flags &= ~MSG_ZEROCOPY;
+ } else {
+ if ((info->flags & MSG_ZEROCOPY) &&
+ !virtio_transport_can_zcopy(&info->msg->msg_iter, len)) {
+ use_zcopy = true;
+ }
+ }
+ }
- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+ /* For MSG_ZEROCOPY length will be added later. */
+ skb = virtio_vsock_alloc_skb(skb_len + (use_zcopy ? 0 : len), GFP_KERNEL);
if (!skb)
return NULL;
@@ -70,18 +249,15 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
hdr->len = cpu_to_le32(len);
if (info->msg && len > 0) {
- payload = skb_put(skb, len);
- err = memcpy_from_msg(payload, info->msg, len);
- if (err)
- goto out;
+ int err;
- if (msg_data_left(info->msg) == 0 &&
- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+ if (use_zcopy)
+ err = virtio_transport_fill_nonlinear_skb(skb, vsk, info);
+ else
+ err = virtio_transport_copy_payload(skb, vsk, info, len);
- if (info->msg->msg_flags & MSG_EOR)
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
- }
+ if (err)
+ goto out;
}
if (info->reply)
@@ -266,7 +442,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
skb = virtio_transport_alloc_skb(info, pkt_len,
src_cid, src_port,
- dst_cid, dst_port);
+ dst_cid, dst_port,
+ vsk);
if (!skb) {
virtio_transport_put_credit(vvs, pkt_len);
return -ENOMEM;
@@ -842,6 +1019,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
.msg = msg,
.pkt_len = len,
.vsk = vsk,
+ .flags = msg->msg_flags,
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -894,7 +1072,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
le64_to_cpu(hdr->dst_cid),
le32_to_cpu(hdr->dst_port),
le64_to_cpu(hdr->src_cid),
- le32_to_cpu(hdr->src_port));
+ le32_to_cpu(hdr->src_port), NULL);
if (!reply)
return -ENOMEM;
--
2.25.1
Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 60b9cafa3e31..afaa80203528 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -440,6 +440,11 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
return val < vq->num;
}
+static bool vhost_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool vhost_transport_seqpacket_allow(u32 remote_cid);
static struct virtio_transport vhost_transport = {
@@ -485,6 +490,7 @@ static struct virtio_transport vhost_transport = {
.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
.notify_buffer_size = virtio_transport_notify_buffer_size,
+ .msgzerocopy_allow = vhost_transport_msgzerocopy_allow,
},
.send_pkt = vhost_transport_send_pkt,
--
2.25.1
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index b8a7d6dc9f46..4c5f19015df7 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -432,6 +432,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}
+static bool virtio_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);
static struct virtio_transport virtio_transport = {
@@ -476,6 +481,8 @@ static struct virtio_transport virtio_transport = {
.notify_send_pre_enqueue = virtio_transport_notify_send_pre_enqueue,
.notify_send_post_enqueue = virtio_transport_notify_send_post_enqueue,
.notify_buffer_size = virtio_transport_notify_buffer_size,
+
+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,
},
.send_pkt = virtio_transport_send_pkt,
--
2.25.1
PF_VSOCK supports MSG_ZEROCOPY transmission, so SO_ZEROCOPY could
be enabled.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/core/sock.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7ba4891460ad..61f15de84e82 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1457,9 +1457,11 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
(sk->sk_type == SOCK_DGRAM &&
sk->sk_protocol == IPPROTO_UDP)))
ret = -EOPNOTSUPP;
- } else if (sk->sk_family != PF_RDS) {
+ } else if (sk->sk_family != PF_RDS &&
+ sk->sk_family != PF_VSOCK) {
ret = -EOPNOTSUPP;
}
+
if (!ret) {
if (val < 0 || val > 1)
ret = -EINVAL;
--
2.25.1
This adds set of tests for MSG_ZEROCOPY flag.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.h | 1 +
tools/testing/vsock/vsock_test.c | 11 +
tools/testing/vsock/vsock_test_zerocopy.c | 470 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 12 +
5 files changed, 495 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 43a254f0e14d..0a78787d1d92 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
all: test vsock_perf
test: vsock_test vsock_diag_test
-vsock_test: vsock_test.o timeout.o control.o util.o
+vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
vsock_perf: vsock_perf.o
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..46ba1d3202b8 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -2,6 +2,7 @@
#ifndef UTIL_H
#define UTIL_H
+#include <stdbool.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index 67e9f9df3a8c..25373cd18ef1 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -20,6 +20,7 @@
#include <sys/mman.h>
#include <poll.h>
+#include "vsock_test_zerocopy.h"
#include "timeout.h"
#include "control.h"
#include "util.h"
@@ -920,6 +921,16 @@ static struct test_case test_cases[] = {
.run_client = test_seqpacket_bigmsg_client,
.run_server = test_seqpacket_bigmsg_server,
},
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY",
+ .run_client = test_stream_msg_zcopy_client,
+ .run_server = test_stream_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY empty MSG_ERRQUEUE",
+ .run_client = test_stream_msg_zcopy_empty_errq_client,
+ .run_server = test_stream_msg_zcopy_empty_errq_server,
+ },
{},
};
diff --git a/tools/testing/vsock/vsock_test_zerocopy.c b/tools/testing/vsock/vsock_test_zerocopy.c
new file mode 100644
index 000000000000..1214391d3c99
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.c
@@ -0,0 +1,470 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* MSG_ZEROCOPY feature tests for vsock
+ *
+ * Copyright (C) 2023 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include <linux/errqueue.h>
+#include <linux/kernel.h>
+#include <error.h>
+#include <errno.h>
+
+#include "control.h"
+#include "vsock_test_zerocopy.h"
+
+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
+#define PAGE_SIZE 4096
+#define POLL_TIMEOUT_MS 2000
+#define SENDMSG_RES_IOV_LEN (-2)
+
+struct zerocopy_test_data {
+ bool zerocopied;
+ bool completion;
+ int sendmsg_errno;
+ ssize_t sendmsg_res;
+ int vecs_cnt;
+ struct iovec vecs[3];
+};
+
+static void do_recv_completion(int fd, bool zerocopied, bool completion)
+{
+ struct sock_extended_err *serr;
+ struct msghdr msg = { 0 };
+ struct pollfd fds = { 0 };
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ uint32_t hi, lo;
+
+ fds.fd = fd;
+ fds.events = 0;
+
+ if (poll(&fds, 1, POLL_TIMEOUT_MS) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ if (completion) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ } else {
+ return;
+ }
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ recvmsg(fd, &msg, MSG_ERRQUEUE);
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_type != 0) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ printf("serr: wrong origin: %u\n", serr->ee_origin);
+ exit(EXIT_FAILURE);
+ }
+
+ if (serr->ee_errno) {
+ printf("serr: wrong error code: %u\n", serr->ee_errno);
+ exit(EXIT_FAILURE);
+ }
+
+ hi = serr->ee_data;
+ lo = serr->ee_info;
+ if (hi != lo) {
+ fprintf(stderr, "serr: expected hi == lo\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (hi) {
+ fprintf(stderr, "serr: expected hi == lo == 0\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (zerocopied && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was copy instead of zerocopy\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!zerocopied && !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was zerocopy instead of copy\n");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error(1, errno, "setsockopt");
+}
+
+static void *mmap_no_fail(size_t bytes)
+{
+ void *res;
+
+ res = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (res == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ return res;
+}
+
+static size_t iovec_bytes(const struct iovec *iov, size_t iovnum)
+{
+ size_t bytes;
+ int i;
+
+ for (bytes = 0, i = 0; i < iovnum; i++)
+ bytes += iov[i].iov_len;
+
+ return bytes;
+}
+
+static void iovec_random_init(struct iovec *iov,
+ const struct zerocopy_test_data *test_data)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ int j;
+
+ if (test_data->vecs[i].iov_base == MAP_FAILED)
+ continue;
+
+ for (j = 0; j < iov[i].iov_len; j++)
+ ((uint8_t *)iov[i].iov_base)[j] = rand() & 0xff;
+ }
+}
+
+static unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum)
+{
+ unsigned long hash;
+ size_t iov_bytes;
+ size_t offs;
+ void *tmp;
+ int i;
+
+ iov_bytes = iovec_bytes(iov, iovnum);
+
+ tmp = malloc(iov_bytes);
+ if (!tmp) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (offs = 0, i = 0; i < iovnum; i++) {
+ memcpy(tmp + offs, iov[i].iov_base, iov[i].iov_len);
+ offs += iov[i].iov_len;
+ }
+
+ hash = hash_djb2(tmp, iov_bytes);
+ free(tmp);
+
+ return hash;
+}
+
+static struct zerocopy_test_data test_data_array[] = {
+ /* Last element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, 200 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE * 2 },
+ { NULL, PAGE_SIZE * 3 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has non-page aligned size. */
+ {
+ .zerocopied = false,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has both non-page aligned base and size. */
+ {
+ .zerocopied = false,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { (void *)1, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* One element has invalid base. */
+ {
+ .zerocopied = false,
+ .completion = false,
+ .sendmsg_errno = ENOMEM,
+ .sendmsg_res = -1,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { MAP_FAILED, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Valid data, but SO_ZEROCOPY is off. */
+ {
+ .zerocopied = true,
+ .completion = false,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 1,
+ {
+ { NULL, PAGE_SIZE }
+ }
+ },
+};
+
+static void __test_stream_msg_zerocopy_client(const struct test_opts *opts,
+ const struct zerocopy_test_data *test_data)
+{
+ struct msghdr msg = { 0 };
+ ssize_t sendmsg_res;
+ struct iovec *iovec;
+ int fd;
+ int i;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (test_data->completion)
+ enable_so_zerocopy(fd);
+
+ iovec = malloc(sizeof(*iovec) * test_data->vecs_cnt);
+ if (!iovec) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ iovec[i].iov_len = test_data->vecs[i].iov_len;
+ iovec[i].iov_base = mmap_no_fail(test_data->vecs[i].iov_len);
+ }
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_data->vecs[i].iov_base == MAP_FAILED) {
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ iovec_random_init(iovec, test_data);
+
+ msg.msg_iov = iovec;
+ msg.msg_iovlen = test_data->vecs_cnt;
+
+ errno = 0;
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ sendmsg_res = iovec_bytes(iovec, test_data->vecs_cnt);
+ else
+ sendmsg_res = test_data->sendmsg_res;
+
+ if (sendmsg(fd, &msg, MSG_ZEROCOPY) != sendmsg_res) {
+ perror("send");
+ exit(EXIT_FAILURE);
+ }
+
+ if (errno != test_data->sendmsg_errno) {
+ fprintf(stderr, "expected 'errno' == %i, got %i\n",
+ test_data->sendmsg_errno, errno);
+ exit(EXIT_FAILURE);
+ }
+
+ do_recv_completion(fd, test_data->zerocopied, test_data->completion);
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
+ else
+ control_writeulong(0);
+
+ free(iovec);
+ close(fd);
+ control_writeln("DONE");
+}
+
+static void __test_stream_msg_zerocopy_server(const struct test_opts *opts,
+ const struct zerocopy_test_data *test_data)
+{
+ unsigned long remote_hash;
+ unsigned long local_hash;
+ ssize_t total_bytes_rec;
+ unsigned char *data;
+ size_t data_len;
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ data_len = iovec_bytes(test_data->vecs, test_data->vecs_cnt);
+
+ data = malloc(data_len);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ total_bytes_rec = 0;
+
+ while (total_bytes_rec != data_len) {
+ ssize_t bytes_rec;
+
+ bytes_rec = read(fd, data + total_bytes_rec,
+ data_len - total_bytes_rec);
+ if (bytes_rec <= 0)
+ break;
+
+ total_bytes_rec += bytes_rec;
+ }
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ local_hash = hash_djb2(data, data_len);
+ else
+ local_hash = 0;
+
+ /* Waiting for some result. */
+ remote_hash = control_readulong();
+ if (remote_hash != local_hash) {
+ fprintf(stderr, "hash mismatch\n");
+ exit(EXIT_FAILURE);
+ }
+
+ close(fd);
+ control_expectln("DONE");
+}
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_msg_zerocopy_client(opts, &test_data_array[i]);
+}
+
+void test_stream_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_msg_zerocopy_server(opts, &test_data_array[i]);
+}
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts)
+{
+ struct msghdr msg = { 0 };
+ char cmsg_data[128];
+ ssize_t res;
+ int fd;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res != -1) {
+ fprintf(stderr, "expected 'recvmsg(2)' failure, got %zi\n",
+ res);
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
diff --git a/tools/testing/vsock/vsock_test_zerocopy.h b/tools/testing/vsock/vsock_test_zerocopy.h
new file mode 100644
index 000000000000..705a1e90f41a
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VSOCK_TEST_ZEROCOPY_H
+#define VSOCK_TEST_ZEROCOPY_H
+#include "util.h"
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_server(const struct test_opts *opts);
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts);
+
+#endif /* VSOCK_TEST_ZEROCOPY_H */
--
2.25.1
To use this option pass '--zc' parameter:
./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>
With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/vsock_perf.c | 127 +++++++++++++++++++++++++++++--
1 file changed, 120 insertions(+), 7 deletions(-)
diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
index a72520338f84..1d435be9b48e 100644
--- a/tools/testing/vsock/vsock_perf.c
+++ b/tools/testing/vsock/vsock_perf.c
@@ -18,6 +18,8 @@
#include <poll.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>
#define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
#define DEFAULT_TO_SEND_BYTES (64 * 1024)
@@ -28,9 +30,14 @@
#define BYTES_PER_GB (1024 * 1024 * 1024ULL)
#define NSEC_PER_SEC (1000000000ULL)
+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
static unsigned int port = DEFAULT_PORT;
static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
+static bool zerocopy;
static void error(const char *s)
{
@@ -247,15 +254,74 @@ static void run_receiver(unsigned long rcvlowat_bytes)
close(fd);
}
+static void recv_completion(int fd)
+{
+ struct sock_extended_err *serr;
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ struct msghdr msg;
+ int ret;
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret == -1)
+ return;
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ return;
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ return;
+ }
+
+ if (cm->cmsg_type) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ return;
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin\n");
+ return;
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code\n");
+ return;
+ }
+
+ if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
+ fprintf(stderr, "warning: copy instead of zerocopy\n");
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error("setsockopt(SO_ZEROCOPY)");
+}
+
static void run_sender(int peer_cid, unsigned long to_send_bytes)
{
time_t tx_begin_ns;
time_t tx_total_ns;
size_t total_send;
+ time_t time_in_send;
void *data;
int fd;
- printf("Run as sender\n");
+ if (zerocopy)
+ printf("Run as sender MSG_ZEROCOPY\n");
+ else
+ printf("Run as sender\n");
+
printf("Connect to %i:%u\n", peer_cid, port);
printf("Send %lu bytes\n", to_send_bytes);
printf("TX buffer %lu bytes\n", buf_size_bytes);
@@ -265,25 +331,58 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
if (fd < 0)
exit(EXIT_FAILURE);
- data = malloc(buf_size_bytes);
+ if (zerocopy) {
+ enable_so_zerocopy(fd);
- if (!data) {
- fprintf(stderr, "'malloc()' failed\n");
- exit(EXIT_FAILURE);
+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (data == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+ } else {
+ data = malloc(buf_size_bytes);
+
+ if (!data) {
+ fprintf(stderr, "'malloc()' failed\n");
+ exit(EXIT_FAILURE);
+ }
}
memset(data, 0, buf_size_bytes);
total_send = 0;
+ time_in_send = 0;
tx_begin_ns = current_nsec();
while (total_send < to_send_bytes) {
ssize_t sent;
+ size_t rest_bytes;
+ time_t before;
+
+ rest_bytes = to_send_bytes - total_send;
- sent = write(fd, data, buf_size_bytes);
+ before = current_nsec();
+ sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
+ buf_size_bytes : rest_bytes,
+ zerocopy ? MSG_ZEROCOPY : 0);
+ time_in_send += (current_nsec() - before);
if (sent <= 0)
error("write");
+ if (zerocopy) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ recv_completion(fd);
+ }
+
total_send += sent;
}
@@ -294,9 +393,14 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
get_gbps(total_send * 8, tx_total_ns));
printf("total time in 'write()': %f sec\n",
(float)tx_total_ns / NSEC_PER_SEC);
+ printf("time in send %f\n", (float)time_in_send / NSEC_PER_SEC);
close(fd);
- free(data);
+
+ if (zerocopy)
+ munmap(data, buf_size_bytes);
+ else
+ free(data);
}
static const char optstring[] = "";
@@ -336,6 +440,11 @@ static const struct option longopts[] = {
.has_arg = required_argument,
.val = 'R',
},
+ {
+ .name = "zc",
+ .has_arg = no_argument,
+ .val = 'Z',
+ },
{},
};
@@ -351,6 +460,7 @@ static void usage(void)
" --help This message\n"
" --sender <cid> Sender mode (receiver default)\n"
" <cid> of the receiver to connect to\n"
+ " --zc Enable zerocopy\n"
" --port <port> Port (default %d)\n"
" --bytes <bytes>KMG Bytes to send (default %d)\n"
" --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
@@ -413,6 +523,9 @@ int main(int argc, char **argv)
case 'H': /* Help. */
usage();
break;
+ case 'Z': /* Zerocopy. */
+ zerocopy = true;
+ break;
default:
usage();
}
--
2.25.1
Hi Arseniy,
sorry for the delay, but I was offline.
On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote:
>Hello,
>
> DESCRIPTION
>
>this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>current implementation for TCP as much as possible:
>
>1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> flag will be ignored (e.g. without completion).
>
>2) Kernel uses completions from socket's error queue. Single completion
> for single tx syscall (or it can merge several completions to single
> one). I used already implemented logic for MSG_ZEROCOPY support:
> 'msg_zerocopy_realloc()' etc.
I will review for the vsock point of view. Hope some net maintainers can
comment about SO_ZEROCOPY.
Anyway I think is a good idea to keep it as close as possible to the TCP
implementation.
>
>Difference with copy way is not significant. During packet allocation,
>non-linear skb is created, then I call 'get_user_pages()' for each page
>from user's iov iterator (I think i don't need 'pin_user_pages()' as
Are these pages exposed to the host via virtqueues? If so, I think we
should pin them. What happens if the host accesses them but these pages
have been unmapped?
>there is no backing storage for these pages) and add each returned page
>to the skb as fragment. There are also some updates for vhost and guest
>parts of transport - in both cases i've added handling of non-linear skb
>for virtio part. vhost copies data from such skb to the guest's rx virtio
>buffers. In the guest, virtio transport fills virtio queue with pages
>from skb.
>
>I think doc in Documentation/networking/msg_zerocopy.rst could be also
>updated in next versions.
Yep, good idea.
>
>This version has several limits/problems:
>
>1) As this feature totally depends on transport, there is no way (or it
> is difficult) to check whether transport is able to handle it or not
> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> are not considered to be called from each other. So in current version
> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> tx routine will fail with EOPNOTSUPP.
I'll take a look, but if we have no alternative, I think it's okay to
make tx fail.
>
>2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> one completion. In each completion there is flag which shows how tx
> was performed: zerocopy or copy. This leads that whole message must
> be send in zerocopy or copy way - we can't send part of message with
> copying and rest of message with zerocopy mode (or vice versa). Now,
> we need to account vsock credit logic, e.g. we can't send whole data
> once - only allowed number of bytes could sent at any moment. In case
> of copying way there is no problem as in worst case we can send single
> bytes, but zerocopy is more complex because smallest transmission
> unit is single page. So if there is not enough space at peer's side
> to send integer number of pages (at least one) - we will wait, thus
> stalling tx side. To overcome this problem i've added simple rule -
> zerocopy is possible only when there is enough space at another side
> for whole message (to check, that current 'msghdr' was already used
> in previous tx iterations i use 'iov_offset' field of it's iov iter).
I see the problem and I think your approach is the right one.
>
>3) loopback transport is not supported, because it requires to implement
> non-linear skb handling in dequeue logic (as we "send" fragged skb
> and "receive" it from the same queue). I'm going to implement it in
> next versions.
loopback is useful for testing and debugging, so it would be great to
have the support, but if it's too complicated, we can do it later.
>
>4) Current implementation sets max length of packet to 64KB. IIUC this
> is due to 'kmalloc()' allocated data buffers. I think, in case of
Yep, I think so.
When I started touching this code, the limit was already there.
As you said, I think it was introduced to have a limit on (host/device
side?) allocation, but buf_alloc might be enough, so maybe we could
also remove it for copy mode.
The only problem I see is compatibility with old devices/drivers, so
maybe we need a feature in the spec to say the limit is gone, or have a
field in the virtio config space where the device specifies its limit
(for the driver, the limit is implicitly that of the buffer allocated
and put in the virtqueue).
This reminded me that Laura had proposed something similar before,
maybe we should take it up again:
https://markmail.org/message/3el4ckeakfilg5wo
> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> not touched for data - user space pages are used as buffers. Also
> this limit trims every message which is > 64KB, thus such messages
> will be send in copy mode due to 'iov_offset' check in 2).
The host still needs to allocate and copy, so maybe the limitation
could be to avoid large allocations in the host, but actually the host
can use vmalloc because it doesn't need them to be contiguous.
>
> PERFORMANCE
>
>Performance: it is a little bit tricky to compare performance between
>copy and zerocopy transmissions. In zerocopy way we need to wait when
>user buffers will be released by kernel, so it something like synchronous
>path (wait until device driver will process it), while in copy way we
>can feed data to kernel as many as we want, don't care about device
>driver. So I compared only time which we spend in 'sendmsg()' syscall.
>Also there is limit from 4) above so max buffer size is 64KB. I've
>tested this patchset in the nested VM, but i think for V1 it is not a
>big deal.
>
>Sender:
>./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc]
>
>Receiver:
>./vsock_perf --vsk-size 256M
>
>Number in cell is seconds which senders spends inside tx syscall.
>
>Guest to host transmission:
>
>*-------------------------------*
>| | | |
>| buf size | copy | zerocopy |
>| | | |
>*-------------------------------*
>| 4KB | 0.26 | 0.042 |
>*-------------------------------*
>| 16KB | 0.11 | 0.014 |
>*-------------------------------*
>| 32KB | 0.05 | 0.009 |
>*-------------------------------*
>| 64KB | 0.04 | 0.005 |
>*-------------------------------*
>
>Host to guest transmission:
>
>*--------------------------------*
>| | | |
>| buf size | copy | zerocopy |
>| | | |
>*--------------------------------*
>| 4KB | 0.049 | 0.034 |
>*--------------------------------*
>| 16KB | 0.03 | 0.024 |
>*--------------------------------*
>| 32KB | 0.025 | 0.01 |
>*--------------------------------*
>| 64KB | 0.028 | 0.01 |
>*--------------------------------*
>
>If host fails to send data with "Cannot allocate memory", check value
>/proc/sys/net/core/optmem_max - it is accounted during completion skb
>allocation.
>
>Zerocopy is faster than classic copy mode, but of course it requires
>specific architecture of application due to user pages pinning, buffer
>size and alignment. In next versions i'm going to fix 64KB barrier to
>perform tests with bigger buffer sizes.
Yep, I see.
Adjusting vsock_perf to compare also Gbps (can io_uring helps in this
case to have more requests in-flight?) would be great.
>
> TESTING
>
>This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>cover new code as much as possible so there are different cases for
>MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>vector types (different sizes, alignments, with unmapped pages).
Great! Thanks for adding the tests!
I'll go through the patches between today and Monday.
Sorry again for taking so long!
Thanks,
Stefano
On Mon, Feb 06, 2023 at 06:53:22AM +0000, Arseniy Krasnov wrote:
>If socket's error queue is not empty, EPOLLERR must be set.
Could this patch go regardless of this series?
Can you explain (even in the commit message) what happens without this
patch?
Thanks,
Stefano
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/af_vsock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index 19aea7cba26e..b5e51ef4a74c 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1026,7 +1026,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> poll_wait(file, sk_sleep(sk), wait);
> mask = 0;
>
>- if (sk->sk_err)
>+ if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
> /* Signify that there has been an error on this socket. */
> mask |= EPOLLERR;
>
>--
>2.25.1
On Mon, Feb 06, 2023 at 06:54:51AM +0000, Arseniy Krasnov wrote:
>This adds handling of MSG_ERRQUEUE input flag for receive call, thus
>skb from socket's error queue is read.
A general tip, add a little more description in the commit messages,
especially to explain why these changes are necessary.
Otherwise, even review becomes difficult because one has to look at all
the patches to understand what the previous ones are for.
I know it's boring, but it's very useful for those who review and even
then if we have to bisect ;-)
Thanks,
Stefano
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> include/linux/socket.h | 1 +
> net/vmw_vsock/af_vsock.c | 26 ++++++++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
>diff --git a/include/linux/socket.h b/include/linux/socket.h
>index 13c3a237b9c9..19a6f39fa014 100644
>--- a/include/linux/socket.h
>+++ b/include/linux/socket.h
>@@ -379,6 +379,7 @@ struct ucred {
> #define SOL_MPTCP 284
> #define SOL_MCTP 285
> #define SOL_SMC 286
>+#define SOL_VSOCK 287
>
> /* IPX options */
> #define IPX_TYPE 1
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index b5e51ef4a74c..f752b30b71d6 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -110,6 +110,7 @@
> #include <linux/workqueue.h>
> #include <net/sock.h>
> #include <net/af_vsock.h>
>+#include <linux/errqueue.h>
>
> static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
> static void vsock_sk_destruct(struct sock *sk);
>@@ -2086,6 +2087,27 @@ static int __vsock_seqpacket_recvmsg(struct sock *sk, struct msghdr *msg,
> return err;
> }
>
>+static int vsock_err_recvmsg(struct sock *sk, struct msghdr *msg)
>+{
>+ struct sock_extended_err *ee;
>+ struct sk_buff *skb;
>+ int err;
>+
>+ lock_sock(sk);
>+ skb = sock_dequeue_err_skb(sk);
>+ release_sock(sk);
>+
>+ if (!skb)
>+ return -EAGAIN;
>+
>+ ee = &SKB_EXT_ERR(skb)->ee;
>+ err = put_cmsg(msg, SOL_VSOCK, 0, sizeof(*ee), ee);
>+ msg->msg_flags |= MSG_ERRQUEUE;
>+ consume_skb(skb);
>+
>+ return err;
>+}
>+
> static int
> vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> int flags)
>@@ -2096,6 +2118,10 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
> int err;
>
> sk = sock->sk;
>+
>+ if (unlikely(flags & MSG_ERRQUEUE))
>+ return vsock_err_recvmsg(sk, msg);
>+
> vsk = vsock_sk(sk);
> err = 0;
>
>--
>2.25.1
On Mon, Feb 06, 2023 at 06:55:46AM +0000, Arseniy Krasnov wrote:
>This feature totally depends on transport, so if transport doesn't
>support it, return error.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> include/net/af_vsock.h | 2 ++
> net/vmw_vsock/af_vsock.c | 7 +++++++
> 2 files changed, 9 insertions(+)
>
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index 568a87c5e0d0..96d829004c81 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -173,6 +173,8 @@ struct vsock_transport {
>
> /* Addressing. */
> u32 (*get_local_cid)(void);
>+
LGTM, just add comment here for a new section following what we did for
other callaback, e.g.:
/* Zero-copy. */
>+ bool (*msgzerocopy_allow)(void);
> };
>
> /**** CORE ****/
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index f752b30b71d6..fb0fcb390113 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1788,6 +1788,13 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> goto out;
> }
>
>+ if (msg->msg_flags & MSG_ZEROCOPY &&
>+ (!transport->msgzerocopy_allow ||
>+ !transport->msgzerocopy_allow())) {
>+ err = -EOPNOTSUPP;
>+ goto out;
>+ }
>+
> /* Wait for room in the produce queue to enqueue our user's data. */
> timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
>
>--
>2.25.1
On Mon, Feb 06, 2023 at 06:57:16AM +0000, Arseniy Krasnov wrote:
>This adds copying to guest's virtio buffers from non-linear skbs. Such
>skbs are created by protocol layer when MSG_ZEROCOPY flags is used.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> drivers/vhost/vsock.c | 56 ++++++++++++++++++++++++++++++++----
> include/linux/virtio_vsock.h | 12 ++++++++
> 2 files changed, 63 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 1f3b89c885cc..60b9cafa3e31 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -86,6 +86,44 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
> return NULL;
> }
>
>+static int vhost_transport_copy_nonlinear_skb(struct sk_buff *skb,
>+ struct iov_iter *iov_iter,
>+ size_t len)
>+{
>+ size_t rest_len = len;
>+
>+ while (rest_len && virtio_vsock_skb_has_frags(skb)) {
>+ struct bio_vec *curr_vec;
>+ size_t curr_vec_end;
>+ size_t to_copy;
>+ int curr_frag;
>+ int curr_offs;
>+
>+ curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
>+ curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>+ curr_vec = &skb_shinfo(skb)->frags[curr_frag];
>+
>+ curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
>+ to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
>+
>+ if (copy_page_to_iter(curr_vec->bv_page, curr_offs,
>+ to_copy, iov_iter) != to_copy)
>+ return -1;
>+
>+ rest_len -= to_copy;
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += to_copy;
>+
>+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off == (curr_vec_end)) {
>+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag++;
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>+ }
>+ }
Can it happen that we exit this loop and rest_len is not 0?
In this case, is it correct to decrement data_len by len?
Thanks,
Stefano
>+
>+ skb->data_len -= len;
>+
>+ return 0;
>+}
>+
> static void
> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> struct vhost_virtqueue *vq)
>@@ -197,11 +235,19 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> break;
> }
>
>- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>- if (nbytes != payload_len) {
>- kfree_skb(skb);
>- vq_err(vq, "Faulted on copying pkt buf\n");
>- break;
>+ if (skb_is_nonlinear(skb)) {
>+ if (vhost_transport_copy_nonlinear_skb(skb, &iov_iter,
>+ payload_len)) {
>+ vq_err(vq, "Faulted on copying pkt buf from page\n");
>+ break;
>+ }
>+ } else {
>+ nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>+ if (nbytes != payload_len) {
>+ kfree_skb(skb);
>+ vq_err(vq, "Faulted on copying pkt buf\n");
>+ break;
>+ }
> }
>
> /* Deliver to monitoring devices all packets that we
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index 3f9c16611306..e7efdb78ce6e 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -12,6 +12,10 @@
> struct virtio_vsock_skb_cb {
> bool reply;
> bool tap_delivered;
>+ /* Current fragment in 'frags' of skb. */
>+ u32 curr_frag;
>+ /* Offset from 0 in current fragment. */
>+ u32 frag_off;
> };
>
> #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
>@@ -46,6 +50,14 @@ static inline void virtio_vsock_skb_clear_tap_delivered(struct sk_buff *skb)
> VIRTIO_VSOCK_SKB_CB(skb)->tap_delivered = false;
> }
>
>+static inline bool virtio_vsock_skb_has_frags(struct sk_buff *skb)
>+{
>+ if (!skb_is_nonlinear(skb))
>+ return false;
>+
>+ return VIRTIO_VSOCK_SKB_CB(skb)->curr_frag != skb_shinfo(skb)->nr_frags;
>+}
>+
> static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
> {
> u32 len;
>--
>2.25.1
On Mon, Feb 06, 2023 at 06:58:24AM +0000, Arseniy Krasnov wrote:
>Use pages of non-linear skb as buffers in virtio tx queue.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport.c | 31 +++++++++++++++++++++++++------
> 1 file changed, 25 insertions(+), 6 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 28b5a8e8e094..b8a7d6dc9f46 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -100,7 +100,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> vq = vsock->vqs[VSOCK_VQ_TX];
>
> for (;;) {
>- struct scatterlist hdr, buf, *sgs[2];
>+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
>+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
+ 1 is for the header, right?
I'd add a comment just to be clear ;-)
> int ret, in_sg = 0, out_sg = 0;
> struct sk_buff *skb;
> bool reply;
>@@ -111,12 +112,30 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>
> virtio_transport_deliver_tap_pkt(skb);
> reply = virtio_vsock_skb_reply(skb);
>+ sg_init_one(&bufs[0], virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>+ sgs[out_sg++] = &bufs[0];
>+
>+ if (skb_is_nonlinear(skb)) {
>+ int i;
>+
>+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>+ struct page *data_page = skb_shinfo(skb)->frags[i].bv_page;
>+
>+ /* We will use 'page_to_virt()' for userspace page here,
>+ * because virtio layer will call 'virt_to_phys()' later
>+ * to fill buffer descriptor. We don't touch memory at
>+ * "virtual" address of this page.
>+ */
IIUC data_page is a user page, so since we are exposing it to the host,
I think we should pin it.
Is data_page always a user page, or can it be a kernel page when skb is
nonlinear?
Thanks,
Stefano
>+ sg_init_one(&bufs[i + 1],
>+ page_to_virt(data_page), PAGE_SIZE);
>+ sgs[out_sg++] = &bufs[i + 1];
>+ }
>+ } else {
>+ if (skb->len > 0) {
>+ sg_init_one(&bufs[1], skb->data, skb->len);
>+ sgs[out_sg++] = &bufs[1];
>+ }
>
>- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>- sgs[out_sg++] = &hdr;
>- if (skb->len > 0) {
>- sg_init_one(&buf, skb->data, skb->len);
>- sgs[out_sg++] = &buf;
> }
>
> ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
>--
>2.25.1
On Mon, Feb 06, 2023 at 06:59:21AM +0000, Arseniy Krasnov wrote:
>For TAP device new skb is created and data from the current skb is
>copied to it. This adds copying data from non-linear skb to new
>the skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport_common.c | 43 +++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 3 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index a1581c77cf84..05ce97b967ad 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -101,6 +101,39 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> return NULL;
> }
>
>+static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>+ void *dst,
>+ size_t len)
>+{
>+ size_t rest_len = len;
>+
>+ while (rest_len && virtio_vsock_skb_has_frags(skb)) {
>+ struct bio_vec *curr_vec;
>+ size_t curr_vec_end;
>+ size_t to_copy;
>+ int curr_frag;
>+ int curr_offs;
>+
>+ curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
>+ curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>+ curr_vec = &skb_shinfo(skb)->frags[curr_frag];
>+
>+ curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
>+ to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
>+
>+ memcpy(dst, page_to_virt(curr_vec->bv_page) + curr_offs,
>+ to_copy);
>+
>+ rest_len -= to_copy;
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += to_copy;
>+
>+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off == (curr_vec_end)) {
>+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag++;
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>+ }
>+ }
>+}
>+
> /* Packet capture */
> static struct sk_buff *virtio_transport_build_skb(void *opaque)
> {
>@@ -109,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> struct af_vsockmon_hdr *hdr;
> struct sk_buff *skb;
> size_t payload_len;
>- void *payload_buf;
>
> /* A packet could be split to fit the RX buffer, so we can retrieve
> * the payload length from the header and the buffer pointer taking
>@@ -117,7 +149,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> */
> pkt_hdr = virtio_vsock_hdr(pkt);
> payload_len = pkt->len;
>- payload_buf = pkt->data;
>
> skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
> GFP_ATOMIC);
>@@ -160,7 +191,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
>
> if (payload_len) {
>- skb_put_data(skb, payload_buf, payload_len);
>+ if (skb_is_nonlinear(skb)) {
>+ void *data = skb_put(skb, payload_len);
>+
>+ virtio_transport_copy_nonlinear_skb(skb, data, payload_len);
>+ } else {
>+ skb_put_data(skb, pkt->data, payload_len);
>+ }
Ehm I'm a bit confused. Maybe we need to rename the sk_buffs involved in
this function (pre-existing).
We have `pkt` that is the original sk_buff, and `skb` that it is
allocated in this function, so IIUC we should check if `pkt` is
nonlinear and copy its payload into `skb`, so we should do this
(untested) chage:
@@ -367,10 +367,10 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
if (payload_len) {
- if (skb_is_nonlinear(skb)) {
+ if (skb_is_nonlinear(pkt)) {
void *data = skb_put(skb, payload_len);
- virtio_transport_copy_nonlinear_skb(skb, data, payload_len);
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
} else {
skb_put_data(skb, pkt->data, payload_len);
}
Thanks,
Stefano
> }
>
> return skb;
>--
>2.25.1
On Mon, Feb 06, 2023 at 07:00:35AM +0000, Arseniy Krasnov wrote:
>This adds main logic of MSG_ZEROCOPY flag processing for packet
>creation. When this flag is set and user's iov iterator fits for
>zerocopy transmission, call 'get_user_pages()' and add returned
>pages to the newly created skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport_common.c | 212 ++++++++++++++++++++++--
> 1 file changed, 195 insertions(+), 17 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 05ce97b967ad..69e37f8a68a6 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -37,6 +37,169 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> return container_of(t, struct virtio_transport, transport);
> }
>
I'd use bool if we don't need to return an error value in the following
new functions.
>+static int virtio_transport_can_zcopy(struct iov_iter *iov_iter,
>+ size_t free_space)
>+{
>+ size_t pages;
>+ int i;
>+
>+ if (!iter_is_iovec(iov_iter))
>+ return -1;
>+
>+ if (iov_iter->iov_offset)
>+ return -1;
>+
>+ /* We can't send whole iov. */
>+ if (free_space < iov_iter->count)
>+ return -1;
>+
>+ for (pages = 0, i = 0; i < iov_iter->nr_segs; i++) {
>+ const struct iovec *iovec;
>+ int pages_in_elem;
>+
>+ iovec = &iov_iter->iov[i];
>+
>+ /* Base must be page aligned. */
>+ if (offset_in_page(iovec->iov_base))
>+ return -1;
>+
>+ /* Only last element could have not page aligned size. */
>+ if (i != (iov_iter->nr_segs - 1)) {
>+ if (offset_in_page(iovec->iov_len))
>+ return -1;
>+
>+ pages_in_elem = iovec->iov_len >> PAGE_SHIFT;
>+ } else {
>+ pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE);
>+ pages_in_elem >>= PAGE_SHIFT;
>+ }
>+
>+ /* In case of user's pages - one page is one frag. */
>+ if (pages + pages_in_elem > MAX_SKB_FRAGS)
>+ return -1;
>+
>+ pages += pages_in_elem;
>+ }
>+
>+ return 0;
>+}
>+
>+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>+ struct sk_buff *skb,
>+ struct iov_iter *iter,
>+ bool zerocopy)
>+{
>+ struct ubuf_info_msgzc *uarg_zc;
>+ struct ubuf_info *uarg;
>+
>+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>+ iov_length(iter->iov, iter->nr_segs),
>+ NULL);
>+
>+ if (!uarg)
>+ return -1;
>+
>+ uarg_zc = uarg_to_msgzc(uarg);
>+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
>+
>+ skb_zcopy_init(skb, uarg);
>+
>+ return 0;
>+}
>+
>+static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb,
>+ struct vsock_sock *vsk,
>+ struct virtio_vsock_pkt_info *info)
>+{
>+ struct iov_iter *iter;
>+ int frag_idx;
>+ int seg_idx;
>+
>+ iter = &info->msg->msg_iter;
>+ frag_idx = 0;
>+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0;
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>+
>+ /* At this moment:
>+ * 1) 'iov_offset' is zero.
>+ * 2) Every 'iov_base' and 'iov_len' are also page aligned
>+ * (except length of the last element).
>+ * 3) Number of pages in this iov <= MAX_SKB_FRAGS.
>+ * 4) Length of the data fits in current credit space.
>+ */
>+ for (seg_idx = 0; seg_idx < iter->nr_segs; seg_idx++) {
>+ struct page *user_pages[MAX_SKB_FRAGS];
>+ const struct iovec *iovec;
>+ size_t last_frag_len;
>+ size_t pages_in_seg;
>+ int page_idx;
>+
>+ iovec = &iter->iov[seg_idx];
>+ pages_in_seg = iovec->iov_len >> PAGE_SHIFT;
>+
>+ if (iovec->iov_len % PAGE_SIZE) {
>+ last_frag_len = iovec->iov_len % PAGE_SIZE;
>+ pages_in_seg++;
>+ } else {
>+ last_frag_len = PAGE_SIZE;
>+ }
>+
>+ if (get_user_pages((unsigned long)iovec->iov_base,
>+ pages_in_seg, FOLL_GET, user_pages,
>+ NULL) != pages_in_seg)
>+ return -1;
Reading the get_user_pages() documentation, this should pin the user
pages, so we should be fine if we then expose them in the virtqueue.
But reading Documentation/core-api/pin_user_pages.rst it seems that
drivers should use "pin_user_pages*() for DMA-pinned pages", so I'm not
sure what we should do.
Additional advice would be great!
Anyway, when we are done using the pages, we should call put_page() or
unpin_user_page() depending on how we pin them.
>+
>+ for (page_idx = 0; page_idx < pages_in_seg; page_idx++) {
>+ int frag_len = PAGE_SIZE;
>+
>+ if (page_idx == (pages_in_seg - 1))
>+ frag_len = last_frag_len;
>+
>+ skb_fill_page_desc(skb, frag_idx++,
>+ user_pages[page_idx], 0,
>+ frag_len);
>+ skb_len_add(skb, frag_len);
>+ }
>+ }
>+
>+ return virtio_transport_init_zcopy_skb(vsk, skb, iter, true);
>+}
>+
>+static int virtio_transport_copy_payload(struct sk_buff *skb,
>+ struct vsock_sock *vsk,
>+ struct virtio_vsock_pkt_info *info,
>+ size_t len)
>+{
>+ void *payload;
>+ int err;
>+
>+ payload = skb_put(skb, len);
>+ err = memcpy_from_msg(payload, info->msg, len);
>+ if (err)
>+ return -1;
>+
>+ if (msg_data_left(info->msg))
>+ return 0;
>+
>+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>+ struct virtio_vsock_hdr *hdr;
>+
>+ hdr = virtio_vsock_hdr(skb);
>+
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>+
>+ if (info->msg->msg_flags & MSG_EOR)
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>+ }
>+
A comment here explaining why this is necessary would be helpful.
>+ if (info->flags & MSG_ZEROCOPY)
>+ return virtio_transport_init_zcopy_skb(vsk, skb,
>+ &info->msg->msg_iter,
>+ false);
>+
>+ return 0;
>+}
>+
> /* Returns a new packet on success, otherwise returns NULL.
> *
> * If NULL is returned, errp is set to a negative errno.
>@@ -47,15 +210,31 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> u32 src_cid,
> u32 src_port,
> u32 dst_cid,
>- u32 dst_port)
>+ u32 dst_port,
>+ struct vsock_sock *vsk)
> {
>- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>+ const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
> struct virtio_vsock_hdr *hdr;
> struct sk_buff *skb;
>- void *payload;
>- int err;
>+ bool use_zcopy = false;
>+
>+ if (info->msg) {
>+ /* If SOCK_ZEROCOPY is not enabled, ignore MSG_ZEROCOPY
>+ * flag later and continue in classic way(e.g. without
>+ * completion).
>+ */
>+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
`vsk` can be null, should we check it?
Otherwise, what about passing only a flag?
So the caller will check it.
>+ info->flags &= ~MSG_ZEROCOPY;
>+ } else {
>+ if ((info->flags & MSG_ZEROCOPY) &&
>+ !virtio_transport_can_zcopy(&info->msg->msg_iter, len)) {
This part is not very clear, I think virtio_transport_can_zcopy()
should return `true` if "can_zcopy".
>+ use_zcopy = true;
>+ }
>+ }
>+ }
>
>- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>+ /* For MSG_ZEROCOPY length will be added later. */
>+ skb = virtio_vsock_alloc_skb(skb_len + (use_zcopy ? 0 : len), GFP_KERNEL);
I think is better to adsjut `skb_len` in the previous block, when we set
`use_zcopy = true`, we can do `skb_len -= len` (with the comment);
> if (!skb)
> return NULL;
>
>@@ -70,18 +249,15 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> hdr->len = cpu_to_le32(len);
>
> if (info->msg && len > 0) {
>- payload = skb_put(skb, len);
>- err = memcpy_from_msg(payload, info->msg, len);
>- if (err)
>- goto out;
>+ int err;
>
>- if (msg_data_left(info->msg) == 0 &&
>- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>+ if (use_zcopy)
>+ err = virtio_transport_fill_nonlinear_skb(skb, vsk, info);
>+ else
>+ err = virtio_transport_copy_payload(skb, vsk, info, len);
>
>- if (info->msg->msg_flags & MSG_EOR)
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>- }
>+ if (err)
>+ goto out;
> }
>
> if (info->reply)
>@@ -266,7 +442,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>
> skb = virtio_transport_alloc_skb(info, pkt_len,
> src_cid, src_port,
>- dst_cid, dst_port);
>+ dst_cid, dst_port,
>+ vsk);
> if (!skb) {
> virtio_transport_put_credit(vvs, pkt_len);
> return -ENOMEM;
>@@ -842,6 +1019,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> .msg = msg,
> .pkt_len = len,
> .vsk = vsk,
>+ .flags = msg->msg_flags,
> };
>
> return virtio_transport_send_pkt_info(vsk, &info);
>@@ -894,7 +1072,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> le64_to_cpu(hdr->dst_cid),
> le32_to_cpu(hdr->dst_port),
> le64_to_cpu(hdr->src_cid),
>- le32_to_cpu(hdr->src_port));
>+ le32_to_cpu(hdr->src_port), NULL);
> if (!reply)
> return -ENOMEM;
>
>--
>2.25.1
On Mon, Feb 06, 2023 at 07:06:32AM +0000, Arseniy Krasnov wrote:
>To use this option pass '--zc' parameter:
--zerocopy or --zero-copy maybe better follow what we did with the other
parameters :-)
>
>./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>
>
>With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> tools/testing/vsock/vsock_perf.c | 127 +++++++++++++++++++++++++++++--
> 1 file changed, 120 insertions(+), 7 deletions(-)
>
>diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
>index a72520338f84..1d435be9b48e 100644
>--- a/tools/testing/vsock/vsock_perf.c
>+++ b/tools/testing/vsock/vsock_perf.c
>@@ -18,6 +18,8 @@
> #include <poll.h>
> #include <sys/socket.h>
> #include <linux/vm_sockets.h>
>+#include <sys/mman.h>
>+#include <linux/errqueue.h>
>
> #define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
> #define DEFAULT_TO_SEND_BYTES (64 * 1024)
>@@ -28,9 +30,14 @@
> #define BYTES_PER_GB (1024 * 1024 * 1024ULL)
> #define NSEC_PER_SEC (1000000000ULL)
>
>+#ifndef SOL_VSOCK
>+#define SOL_VSOCK 287
>+#endif
I thought we use the current kernel headers when we compile the tests,
do we need to fix something in the makefile?
>+
> static unsigned int port = DEFAULT_PORT;
> static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
> static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
>+static bool zerocopy;
>
> static void error(const char *s)
> {
>@@ -247,15 +254,74 @@ static void run_receiver(unsigned long rcvlowat_bytes)
> close(fd);
> }
>
>+static void recv_completion(int fd)
>+{
>+ struct sock_extended_err *serr;
>+ char cmsg_data[128];
>+ struct cmsghdr *cm;
>+ struct msghdr msg;
>+ int ret;
>+
>+ msg.msg_control = cmsg_data;
>+ msg.msg_controllen = sizeof(cmsg_data);
>+
>+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
>+ if (ret == -1)
>+ return;
>+
>+ cm = CMSG_FIRSTHDR(&msg);
>+ if (!cm) {
>+ fprintf(stderr, "cmsg: no cmsg\n");
>+ return;
>+ }
>+
>+ if (cm->cmsg_level != SOL_VSOCK) {
>+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
>+ return;
>+ }
>+
>+ if (cm->cmsg_type) {
>+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
>+ return;
>+ }
>+
>+ serr = (void *)CMSG_DATA(cm);
>+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
>+ fprintf(stderr, "serr: wrong origin\n");
>+ return;
>+ }
>+
>+ if (serr->ee_errno) {
>+ fprintf(stderr, "serr: wrong error code\n");
>+ return;
>+ }
>+
>+ if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
>+ fprintf(stderr, "warning: copy instead of zerocopy\n");
>+}
>+
>+static void enable_so_zerocopy(int fd)
>+{
>+ int val = 1;
>+
>+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
>+ error("setsockopt(SO_ZEROCOPY)");
>+}
>+
> static void run_sender(int peer_cid, unsigned long to_send_bytes)
> {
> time_t tx_begin_ns;
> time_t tx_total_ns;
> size_t total_send;
>+ time_t time_in_send;
> void *data;
> int fd;
>
>- printf("Run as sender\n");
>+ if (zerocopy)
>+ printf("Run as sender MSG_ZEROCOPY\n");
>+ else
>+ printf("Run as sender\n");
>+
> printf("Connect to %i:%u\n", peer_cid, port);
> printf("Send %lu bytes\n", to_send_bytes);
> printf("TX buffer %lu bytes\n", buf_size_bytes);
>@@ -265,25 +331,58 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
> if (fd < 0)
> exit(EXIT_FAILURE);
>
>- data = malloc(buf_size_bytes);
>+ if (zerocopy) {
>+ enable_so_zerocopy(fd);
>
>- if (!data) {
>- fprintf(stderr, "'malloc()' failed\n");
>- exit(EXIT_FAILURE);
>+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
>+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>+ if (data == MAP_FAILED) {
>+ perror("mmap");
>+ exit(EXIT_FAILURE);
>+ }
>+ } else {
>+ data = malloc(buf_size_bytes);
>+
>+ if (!data) {
>+ fprintf(stderr, "'malloc()' failed\n");
>+ exit(EXIT_FAILURE);
>+ }
> }
Eventually to simplify the code I think we can use the mmaped buffer in
both cases.
>
> memset(data, 0, buf_size_bytes);
> total_send = 0;
>+ time_in_send = 0;
> tx_begin_ns = current_nsec();
>
> while (total_send < to_send_bytes) {
> ssize_t sent;
>+ size_t rest_bytes;
>+ time_t before;
>+
>+ rest_bytes = to_send_bytes - total_send;
>
>- sent = write(fd, data, buf_size_bytes);
>+ before = current_nsec();
>+ sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
>+ buf_size_bytes : rest_bytes,
>+ zerocopy ? MSG_ZEROCOPY : 0);
>+ time_in_send += (current_nsec() - before);
>
> if (sent <= 0)
> error("write");
>
>+ if (zerocopy) {
>+ struct pollfd fds = { 0 };
>+
>+ fds.fd = fd;
Which event are we waiting for here?
>+
>+ if (poll(&fds, 1, -1) < 0) {
>+ perror("poll");
>+ exit(EXIT_FAILURE);
>+ }
We need this because we use only one buffer, but if we use more than
one, we could take full advantage of zerocopy, right?
Otherwise, I don't think it's a fair comparison with non-zerocopy.
Thanks,
Stefano
>+
>+ recv_completion(fd);
>+ }
>+
> total_send += sent;
> }
>
>@@ -294,9 +393,14 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
> get_gbps(total_send * 8, tx_total_ns));
> printf("total time in 'write()': %f sec\n",
> (float)tx_total_ns / NSEC_PER_SEC);
>+ printf("time in send %f\n", (float)time_in_send / NSEC_PER_SEC);
>
> close(fd);
>- free(data);
>+
>+ if (zerocopy)
>+ munmap(data, buf_size_bytes);
>+ else
>+ free(data);
> }
>
> static const char optstring[] = "";
>@@ -336,6 +440,11 @@ static const struct option longopts[] = {
> .has_arg = required_argument,
> .val = 'R',
> },
>+ {
>+ .name = "zc",
>+ .has_arg = no_argument,
>+ .val = 'Z',
>+ },
> {},
> };
>
>@@ -351,6 +460,7 @@ static void usage(void)
> " --help This message\n"
> " --sender <cid> Sender mode (receiver default)\n"
> " <cid> of the receiver to connect to\n"
>+ " --zc Enable zerocopy\n"
> " --port <port> Port (default %d)\n"
> " --bytes <bytes>KMG Bytes to send (default %d)\n"
> " --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
>@@ -413,6 +523,9 @@ int main(int argc, char **argv)
> case 'H': /* Help. */
> usage();
> break;
>+ case 'Z': /* Zerocopy. */
>+ zerocopy = true;
>+ break;
> default:
> usage();
> }
>--
>2.25.1
>
On 16.02.2023 16:33, Stefano Garzarella wrote:
> Hi Arseniy,
> sorry for the delay, but I was offline.
Hello! Sure no problem!, i was also offline a little bit so i'm replying
just now
>
> On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote:
>> Hello,
>>
>> DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>> flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>> for single tx syscall (or it can merge several completions to single
>> one). I used already implemented logic for MSG_ZEROCOPY support:
>> 'msg_zerocopy_realloc()' etc.
>
> I will review for the vsock point of view. Hope some net maintainers can
> comment about SO_ZEROCOPY.
>
> Anyway I think is a good idea to keep it as close as possible to the TCP
> implementation.
>
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created, then I call 'get_user_pages()' for each page
>> from user's iov iterator (I think i don't need 'pin_user_pages()' as
>
> Are these pages exposed to the host via virtqueues? If so, I think we
> should pin them. What happens if the host accesses them but these pages
> have been unmapped?
Yes, user pages with data will be used by the virtio device.
'pin' - You mean use 'pin_user_pages()'? Unmapped - You mean guest
will unmap it, while host must access it to copy packet? Such pages
have incremented refcount by 'get_user_pages()', so page is locked
and memory and will be locked until host finishes copying data.
I think it is better to discuss things related to 'get/pin_user_pages()'
in one place, for example in 07/12 where this function is called.
>
>> there is no backing storage for these pages) and add each returned page
>> to the skb as fragment. There are also some updates for vhost and guest
>> parts of transport - in both cases i've added handling of non-linear skb
>> for virtio part. vhost copies data from such skb to the guest's rx virtio
>> buffers. In the guest, virtio transport fills virtio queue with pages
>> from skb.
>>
>> I think doc in Documentation/networking/msg_zerocopy.rst could be also
>> updated in next versions.
>
> Yep, good idea.
Ack, i'll do it in v2.
>
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>> is difficult) to check whether transport is able to handle it or not
>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>> are not considered to be called from each other. So in current version
>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>> tx routine will fail with EOPNOTSUPP.
>
> I'll take a look, but if we have no alternative, I think it's okay to
> make tx fail.>
Thanks
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>> one completion. In each completion there is flag which shows how tx
>> was performed: zerocopy or copy. This leads that whole message must
>> be send in zerocopy or copy way - we can't send part of message with
>> copying and rest of message with zerocopy mode (or vice versa). Now,
>> we need to account vsock credit logic, e.g. we can't send whole data
>> once - only allowed number of bytes could sent at any moment. In case
>> of copying way there is no problem as in worst case we can send single
>> bytes, but zerocopy is more complex because smallest transmission
>> unit is single page. So if there is not enough space at peer's side
>> to send integer number of pages (at least one) - we will wait, thus
>> stalling tx side. To overcome this problem i've added simple rule -
>> zerocopy is possible only when there is enough space at another side
>> for whole message (to check, that current 'msghdr' was already used
>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> I see the problem and I think your approach is the right one.
>
>>
>> 3) loopback transport is not supported, because it requires to implement
>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>> and "receive" it from the same queue). I'm going to implement it in
>> next versions.
>
> loopback is useful for testing and debugging, so it would be great to
> have the support, but if it's too complicated, we can do it later.
>
Ok, i'll implement it in v2.
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>
> Yep, I think so.
> When I started touching this code, the limit was already there.
> As you said, I think it was introduced to have a limit on (host/device
> side?) allocation, but buf_alloc might be enough, so maybe we could
> also remove it for copy mode.
> The only problem I see is compatibility with old devices/drivers, so
> maybe we need a feature in the spec to say the limit is gone, or have a
> field in the virtio config space where the device specifies its limit
> (for the driver, the limit is implicitly that of the buffer allocated
> and put in the virtqueue).
>
> This reminded me that Laura had proposed something similar before,
> maybe we should take it up again:
> https://markmail.org/message/3el4ckeakfilg5wo
>
>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>> not touched for data - user space pages are used as buffers. Also
>> this limit trims every message which is > 64KB, thus such messages
>> will be send in copy mode due to 'iov_offset' check in 2).
>
> The host still needs to allocate and copy, so maybe the limitation
> could be to avoid large allocations in the host, but actually the host
> can use vmalloc because it doesn't need them to be contiguous.
>
Hmmm, I think it is possible to solve this situation in the following
way - i can keep limitation for 64KB for copy mode, and remove it for
zero copy, but I'll limit each packet size to 64KB(of course technically
it is not exactly 64KB, it is min(max packet size, MAX_SKB_FRAGS * PAGE_SIZE)
where max packet size is 64Kb, but for simplicity let's call this limit 64Kb:) ).
E.g. when zerocopy transmission needs to send for example 129Kb(of course
peer's free space is big enough and this check is passed), I'll won't trim
129Kb to 64Kb + 64Kb + 1Kb in the current manner - by returning to af_vsock.c
after sending every skb. I'll alloc several skbs, (3 in this case - 64Kb + 64Kb + 1Kb)
in single call to the transport. Completion arg will be attached
only to the last one skb, and send these 3 skbs. Host still processes
64Kb(let it be 64Kb for simplicity again :) ) packets - no big allocations.
Moreover, i think that this logic could be a little optimization for
copy mode - why we allocate single skb and always return to af_vsock.c?
May be we can iterate needed number of skbs in the loop and send them.
Also about vmalloc(), IIUC there is already this idea which replaces 'kmalloc()'
to 'kvmalloc()'.
>>
>> PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it something like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in 'sendmsg()' syscall.
>> Also there is limit from 4) above so max buffer size is 64KB. I've
>> tested this patchset in the nested VM, but i think for V1 it is not a
>> big deal.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> Number in cell is seconds which senders spends inside tx syscall.
>>
>> Guest to host transmission:
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 0.26 | 0.042 |
>> *-------------------------------*
>> | 16KB | 0.11 | 0.014 |
>> *-------------------------------*
>> | 32KB | 0.05 | 0.009 |
>> *-------------------------------*
>> | 64KB | 0.04 | 0.005 |
>> *-------------------------------*
>>
>> Host to guest transmission:
>>
>> *--------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *--------------------------------*
>> | 4KB | 0.049 | 0.034 |
>> *--------------------------------*
>> | 16KB | 0.03 | 0.024 |
>> *--------------------------------*
>> | 32KB | 0.025 | 0.01 |
>> *--------------------------------*
>> | 64KB | 0.028 | 0.01 |
>> *--------------------------------*
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation.
>>
>> Zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment. In next versions i'm going to fix 64KB barrier to
>> perform tests with bigger buffer sizes.
>
> Yep, I see.
> Adjusting vsock_perf to compare also Gbps (can io_uring helps in this
> case to have more requests in-flight?) would be great.
>
Yes, i'll add Gbps. Also I thought about adding io_uring test to
the current test suite because io_uring also have MSG_ZEROCOPY
type of request, and interesting thing is that io_uring uses it's
own already allocated uarg.
>>
>> TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages).
>
> Great! Thanks for adding the tests!
>
> I'll go through the patches between today and Monday.
> Sorry again for taking so long!
Sure, Thanks for review! I think we are not hurry :)
>
> Thanks,
> Stefano
>
On 16.02.2023 16:40, Stefano Garzarella wrote:
> On Mon, Feb 06, 2023 at 06:53:22AM +0000, Arseniy Krasnov wrote:
>> If socket's error queue is not empty, EPOLLERR must be set.
>
> Could this patch go regardless of this series?
>
> Can you explain (even in the commit message) what happens without this
> patch?
Sure! Thanks
>
> Thanks,
> Stefano
>
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> net/vmw_vsock/af_vsock.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index 19aea7cba26e..b5e51ef4a74c 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -1026,7 +1026,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
>> poll_wait(file, sk_sleep(sk), wait);
>> mask = 0;
>>
>> - if (sk->sk_err)
>> + if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
>> /* Signify that there has been an error on this socket. */
>> mask |= EPOLLERR;
>>
>> --
>> 2.25.1
>
On 16.02.2023 17:09, Stefano Garzarella wrote:
> On Mon, Feb 06, 2023 at 06:57:16AM +0000, Arseniy Krasnov wrote:
>> This adds copying to guest's virtio buffers from non-linear skbs. Such
>> skbs are created by protocol layer when MSG_ZEROCOPY flags is used.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> drivers/vhost/vsock.c | 56 ++++++++++++++++++++++++++++++++----
>> include/linux/virtio_vsock.h | 12 ++++++++
>> 2 files changed, 63 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 1f3b89c885cc..60b9cafa3e31 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -86,6 +86,44 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>> return NULL;
>> }
>>
>> +static int vhost_transport_copy_nonlinear_skb(struct sk_buff *skb,
>> + struct iov_iter *iov_iter,
>> + size_t len)
>> +{
>> + size_t rest_len = len;
>> +
>> + while (rest_len && virtio_vsock_skb_has_frags(skb)) {
>> + struct bio_vec *curr_vec;
>> + size_t curr_vec_end;
>> + size_t to_copy;
>> + int curr_frag;
>> + int curr_offs;
>> +
>> + curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
>> + curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>> + curr_vec = &skb_shinfo(skb)->frags[curr_frag];
>> +
>> + curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
>> + to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
>> +
>> + if (copy_page_to_iter(curr_vec->bv_page, curr_offs,
>> + to_copy, iov_iter) != to_copy)
>> + return -1;
>> +
>> + rest_len -= to_copy;
>> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off += to_copy;
>> +
>> + if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off == (curr_vec_end)) {
>> + VIRTIO_VSOCK_SKB_CB(skb)->curr_frag++;
>> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>> + }
>> + }
>
> Can it happen that we exit this loop and rest_len is not 0?
>
> In this case, is it correct to decrement data_len by len?
I see
>
> Thanks,
> Stefano
>
>> +
>> + skb->data_len -= len;
>> +
>> + return 0;
>> +}
>> +
>> static void
>> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>> struct vhost_virtqueue *vq)
>> @@ -197,11 +235,19 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>> break;
>> }
>>
>> - nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>> - if (nbytes != payload_len) {
>> - kfree_skb(skb);
>> - vq_err(vq, "Faulted on copying pkt buf\n");
>> - break;
>> + if (skb_is_nonlinear(skb)) {
>> + if (vhost_transport_copy_nonlinear_skb(skb, &iov_iter,
>> + payload_len)) {
>> + vq_err(vq, "Faulted on copying pkt buf from page\n");
>> + break;
>> + }
>> + } else {
>> + nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>> + if (nbytes != payload_len) {
>> + kfree_skb(skb);
>> + vq_err(vq, "Faulted on copying pkt buf\n");
>> + break;
>> + }
>> }
>>
>> /* Deliver to monitoring devices all packets that we
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index 3f9c16611306..e7efdb78ce6e 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -12,6 +12,10 @@
>> struct virtio_vsock_skb_cb {
>> bool reply;
>> bool tap_delivered;
>> + /* Current fragment in 'frags' of skb. */
>> + u32 curr_frag;
>> + /* Offset from 0 in current fragment. */
>> + u32 frag_off;
>> };
>>
>> #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
>> @@ -46,6 +50,14 @@ static inline void virtio_vsock_skb_clear_tap_delivered(struct sk_buff *skb)
>> VIRTIO_VSOCK_SKB_CB(skb)->tap_delivered = false;
>> }
>>
>> +static inline bool virtio_vsock_skb_has_frags(struct sk_buff *skb)
>> +{
>> + if (!skb_is_nonlinear(skb))
>> + return false;
>> +
>> + return VIRTIO_VSOCK_SKB_CB(skb)->curr_frag != skb_shinfo(skb)->nr_frags;
>> +}
>> +
>> static inline void virtio_vsock_skb_rx_put(struct sk_buff *skb)
>> {
>> u32 len;
>> --
>> 2.25.1
>
On 16.02.2023 17:18, Stefano Garzarella wrote:
> On Mon, Feb 06, 2023 at 06:58:24AM +0000, Arseniy Krasnov wrote:
>> Use pages of non-linear skb as buffers in virtio tx queue.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> net/vmw_vsock/virtio_transport.c | 31 +++++++++++++++++++++++++------
>> 1 file changed, 25 insertions(+), 6 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index 28b5a8e8e094..b8a7d6dc9f46 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -100,7 +100,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>> vq = vsock->vqs[VSOCK_VQ_TX];
>>
>> for (;;) {
>> - struct scatterlist hdr, buf, *sgs[2];
>> + struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
>> + struct scatterlist bufs[MAX_SKB_FRAGS + 1];
>
> + 1 is for the header, right?
> I'd add a comment just to be clear ;-)
>
>> int ret, in_sg = 0, out_sg = 0;
>> struct sk_buff *skb;
>> bool reply;
>> @@ -111,12 +112,30 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>>
>> virtio_transport_deliver_tap_pkt(skb);
>> reply = virtio_vsock_skb_reply(skb);
>> + sg_init_one(&bufs[0], virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>> + sgs[out_sg++] = &bufs[0];
>> +
>> + if (skb_is_nonlinear(skb)) {
>> + int i;
>> +
>> + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
>> + struct page *data_page = skb_shinfo(skb)->frags[i].bv_page;
>> +
>> + /* We will use 'page_to_virt()' for userspace page here,
>> + * because virtio layer will call 'virt_to_phys()' later
>> + * to fill buffer descriptor. We don't touch memory at
>> + * "virtual" address of this page.
>> + */
>
> IIUC data_page is a user page, so since we are exposing it to the host,
> I think we should pin it.
>
> Is data_page always a user page, or can it be a kernel page when skb is nonlinear?
By default it is user page, but...may be it is possible to send page of mapped
file with MAP_SHARED flags(i think it this case it will be page from page cache)
>
> Thanks,
> Stefano
>
>> + sg_init_one(&bufs[i + 1],
>> + page_to_virt(data_page), PAGE_SIZE);
>> + sgs[out_sg++] = &bufs[i + 1];
>> + }
>> + } else {
>> + if (skb->len > 0) {
>> + sg_init_one(&bufs[1], skb->data, skb->len);
>> + sgs[out_sg++] = &bufs[1];
>> + }
>>
>> - sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>> - sgs[out_sg++] = &hdr;
>> - if (skb->len > 0) {
>> - sg_init_one(&buf, skb->data, skb->len);
>> - sgs[out_sg++] = &buf;
>> }
>>
>> ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
>> --
>> 2.25.1
>
On 16.02.2023 18:16, Stefano Garzarella wrote:
> On Mon, Feb 06, 2023 at 07:00:35AM +0000, Arseniy Krasnov wrote:
>> This adds main logic of MSG_ZEROCOPY flag processing for packet
>> creation. When this flag is set and user's iov iterator fits for
>> zerocopy transmission, call 'get_user_pages()' and add returned
>> pages to the newly created skb.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> net/vmw_vsock/virtio_transport_common.c | 212 ++++++++++++++++++++++--
>> 1 file changed, 195 insertions(+), 17 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 05ce97b967ad..69e37f8a68a6 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -37,6 +37,169 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>> return container_of(t, struct virtio_transport, transport);
>> }
>>
>
> I'd use bool if we don't need to return an error value in the following
> new functions.
>
>> +static int virtio_transport_can_zcopy(struct iov_iter *iov_iter,
>> + size_t free_space)
>> +{
>> + size_t pages;
>> + int i;
>> +
>> + if (!iter_is_iovec(iov_iter))
>> + return -1;
>> +
>> + if (iov_iter->iov_offset)
>> + return -1;
>> +
>> + /* We can't send whole iov. */
>> + if (free_space < iov_iter->count)
>> + return -1;
>> +
>> + for (pages = 0, i = 0; i < iov_iter->nr_segs; i++) {
>> + const struct iovec *iovec;
>> + int pages_in_elem;
>> +
>> + iovec = &iov_iter->iov[i];
>> +
>> + /* Base must be page aligned. */
>> + if (offset_in_page(iovec->iov_base))
>> + return -1;
>> +
>> + /* Only last element could have not page aligned size. */
>> + if (i != (iov_iter->nr_segs - 1)) {
>> + if (offset_in_page(iovec->iov_len))
>> + return -1;
>> +
>> + pages_in_elem = iovec->iov_len >> PAGE_SHIFT;
>> + } else {
>> + pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE);
>> + pages_in_elem >>= PAGE_SHIFT;
>> + }
>> +
>> + /* In case of user's pages - one page is one frag. */
>> + if (pages + pages_in_elem > MAX_SKB_FRAGS)
>> + return -1;
>> +
>> + pages += pages_in_elem;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>> + struct sk_buff *skb,
>> + struct iov_iter *iter,
>> + bool zerocopy)
>> +{
>> + struct ubuf_info_msgzc *uarg_zc;
>> + struct ubuf_info *uarg;
>> +
>> + uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> + iov_length(iter->iov, iter->nr_segs),
>> + NULL);
>> +
>> + if (!uarg)
>> + return -1;
>> +
>> + uarg_zc = uarg_to_msgzc(uarg);
>> + uarg_zc->zerocopy = zerocopy ? 1 : 0;
>> +
>> + skb_zcopy_init(skb, uarg);
>> +
>> + return 0;
>> +}
>> +
>> +static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb,
>> + struct vsock_sock *vsk,
>> + struct virtio_vsock_pkt_info *info)
>> +{
>> + struct iov_iter *iter;
>> + int frag_idx;
>> + int seg_idx;
>> +
>> + iter = &info->msg->msg_iter;
>> + frag_idx = 0;
>> + VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0;
>> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>> +
>> + /* At this moment:
>> + * 1) 'iov_offset' is zero.
>> + * 2) Every 'iov_base' and 'iov_len' are also page aligned
>> + * (except length of the last element).
>> + * 3) Number of pages in this iov <= MAX_SKB_FRAGS.
>> + * 4) Length of the data fits in current credit space.
>> + */
>> + for (seg_idx = 0; seg_idx < iter->nr_segs; seg_idx++) {
>> + struct page *user_pages[MAX_SKB_FRAGS];
>> + const struct iovec *iovec;
>> + size_t last_frag_len;
>> + size_t pages_in_seg;
>> + int page_idx;
>> +
>> + iovec = &iter->iov[seg_idx];
>> + pages_in_seg = iovec->iov_len >> PAGE_SHIFT;
>> +
>> + if (iovec->iov_len % PAGE_SIZE) {
>> + last_frag_len = iovec->iov_len % PAGE_SIZE;
>> + pages_in_seg++;
>> + } else {
>> + last_frag_len = PAGE_SIZE;
>> + }
>> +
>> + if (get_user_pages((unsigned long)iovec->iov_base,
>> + pages_in_seg, FOLL_GET, user_pages,
>> + NULL) != pages_in_seg)
>> + return -1;
>
> Reading the get_user_pages() documentation, this should pin the user
> pages, so we should be fine if we then expose them in the virtqueue.
>
> But reading Documentation/core-api/pin_user_pages.rst it seems that
> drivers should use "pin_user_pages*() for DMA-pinned pages", so I'm not
> sure what we should do.
>
That is really interesting question for me too. IIUC 'pin_user_pages()'
sets special value to ref counter of page, so we can distinguish such
pages from the others. I've grepped for pinned pages check and found,
the it is used in mm/vmscan.c by calling 'folio_maybe_dma_pinned()' during
page lists processing. Seems 'pin_user_pages()' is more strict version of
'get_user_pages()' and it is recommended to use 'pin_' when data on these
pages will be accessed.
I think, i'll check which API is used in the TCP implementation for zerocopy
transmission.
> Additional advice would be great!
>
> Anyway, when we are done using the pages, we should call put_page() or
> unpin_user_page() depending on how we pin them.
>
In case of 'get_user_pages()' everything is ok here: when such skb
will be released, 'put_page()' will be called for every frag page
of it, so there is no page leak. But in case of 'pin_user_pages()',
i will need to unpin in manually before calling 'consume_skb()'
after it is processed by virtio device. But anyway - it is not a
problem.
>> +
>> + for (page_idx = 0; page_idx < pages_in_seg; page_idx++) {
>> + int frag_len = PAGE_SIZE;
>> +
>> + if (page_idx == (pages_in_seg - 1))
>> + frag_len = last_frag_len;
>> +
>> + skb_fill_page_desc(skb, frag_idx++,
>> + user_pages[page_idx], 0,
>> + frag_len);
>> + skb_len_add(skb, frag_len);
>> + }
>> + }
>> +
>> + return virtio_transport_init_zcopy_skb(vsk, skb, iter, true);
>> +}
>> +
>> +static int virtio_transport_copy_payload(struct sk_buff *skb,
>> + struct vsock_sock *vsk,
>> + struct virtio_vsock_pkt_info *info,
>> + size_t len)
>> +{
>> + void *payload;
>> + int err;
>> +
>> + payload = skb_put(skb, len);
>> + err = memcpy_from_msg(payload, info->msg, len);
>> + if (err)
>> + return -1;
>> +
>> + if (msg_data_left(info->msg))
>> + return 0;
>> +
>> + if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> + struct virtio_vsock_hdr *hdr;
>> +
>> + hdr = virtio_vsock_hdr(skb);
>> +
>> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +
>> + if (info->msg->msg_flags & MSG_EOR)
>> + hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> + }
>> +
>
> A comment here explaining why this is necessary would be helpful.
>
>> + if (info->flags & MSG_ZEROCOPY)
>> + return virtio_transport_init_zcopy_skb(vsk, skb,
>> + &info->msg->msg_iter,
>> + false);
>> +
>> + return 0;
>> +}
>> +
>> /* Returns a new packet on success, otherwise returns NULL.
>> *
>> * If NULL is returned, errp is set to a negative errno.
>> @@ -47,15 +210,31 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> u32 src_cid,
>> u32 src_port,
>> u32 dst_cid,
>> - u32 dst_port)
>> + u32 dst_port,
>> + struct vsock_sock *vsk)
>> {
>> - const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>> + const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>> struct virtio_vsock_hdr *hdr;
>> struct sk_buff *skb;
>> - void *payload;
>> - int err;
>> + bool use_zcopy = false;
>> +
>> + if (info->msg) {
>> + /* If SOCK_ZEROCOPY is not enabled, ignore MSG_ZEROCOPY
>> + * flag later and continue in classic way(e.g. without
>> + * completion).
>> + */
>> + if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY)) {
>
> `vsk` can be null, should we check it?
> Otherwise, what about passing only a flag?
> So the caller will check it.
>
>> + info->flags &= ~MSG_ZEROCOPY;
>> + } else {
>> + if ((info->flags & MSG_ZEROCOPY) &&
>> + !virtio_transport_can_zcopy(&info->msg->msg_iter, len)) {
>
> This part is not very clear, I think virtio_transport_can_zcopy()
> should return `true` if "can_zcopy".
>
>> + use_zcopy = true;
>> + }
>> + }
>> + }
>>
>> - skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> + /* For MSG_ZEROCOPY length will be added later. */
>> + skb = virtio_vsock_alloc_skb(skb_len + (use_zcopy ? 0 : len), GFP_KERNEL);
>
> I think is better to adsjut `skb_len` in the previous block, when we set
> `use_zcopy = true`, we can do `skb_len -= len` (with the comment);
>
>> if (!skb)
>> return NULL;
>>
>> @@ -70,18 +249,15 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> hdr->len = cpu_to_le32(len);
>>
>> if (info->msg && len > 0) {
>> - payload = skb_put(skb, len);
>> - err = memcpy_from_msg(payload, info->msg, len);
>> - if (err)
>> - goto out;
>> + int err;
>>
>> - if (msg_data_left(info->msg) == 0 &&
>> - info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> + if (use_zcopy)
>> + err = virtio_transport_fill_nonlinear_skb(skb, vsk, info);
>> + else
>> + err = virtio_transport_copy_payload(skb, vsk, info, len);
>>
>> - if (info->msg->msg_flags & MSG_EOR)
>> - hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> - }
>> + if (err)
>> + goto out;
>> }
>>
>> if (info->reply)
>> @@ -266,7 +442,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>
>> skb = virtio_transport_alloc_skb(info, pkt_len,
>> src_cid, src_port,
>> - dst_cid, dst_port);
>> + dst_cid, dst_port,
>> + vsk);
>> if (!skb) {
>> virtio_transport_put_credit(vvs, pkt_len);
>> return -ENOMEM;
>> @@ -842,6 +1019,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
>> .msg = msg,
>> .pkt_len = len,
>> .vsk = vsk,
>> + .flags = msg->msg_flags,
>> };
>>
>> return virtio_transport_send_pkt_info(vsk, &info);
>> @@ -894,7 +1072,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>> le64_to_cpu(hdr->dst_cid),
>> le32_to_cpu(hdr->dst_port),
>> le64_to_cpu(hdr->src_cid),
>> - le32_to_cpu(hdr->src_port));
>> + le32_to_cpu(hdr->src_port), NULL);
>> if (!reply)
>> return -ENOMEM;
>>
>> --
>> 2.25.1
>
On 16.02.2023 18:29, Stefano Garzarella wrote:
> On Mon, Feb 06, 2023 at 07:06:32AM +0000, Arseniy Krasnov wrote:
>> To use this option pass '--zc' parameter:
>
> --zerocopy or --zero-copy maybe better follow what we did with the other parameters :-)
>
>>
>> ./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>
>>
>> With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> tools/testing/vsock/vsock_perf.c | 127 +++++++++++++++++++++++++++++--
>> 1 file changed, 120 insertions(+), 7 deletions(-)
>>
>> diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
>> index a72520338f84..1d435be9b48e 100644
>> --- a/tools/testing/vsock/vsock_perf.c
>> +++ b/tools/testing/vsock/vsock_perf.c
>> @@ -18,6 +18,8 @@
>> #include <poll.h>
>> #include <sys/socket.h>
>> #include <linux/vm_sockets.h>
>> +#include <sys/mman.h>
>> +#include <linux/errqueue.h>
>>
>> #define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
>> #define DEFAULT_TO_SEND_BYTES (64 * 1024)
>> @@ -28,9 +30,14 @@
>> #define BYTES_PER_GB (1024 * 1024 * 1024ULL)
>> #define NSEC_PER_SEC (1000000000ULL)
>>
>> +#ifndef SOL_VSOCK
>> +#define SOL_VSOCK 287
>> +#endif
>
> I thought we use the current kernel headers when we compile the tests,
> do we need to fix something in the makefile?
Not sure, of course we are using uapi. But i see, that defines like SOL_XXX is not
defined in uapi headers. For example SOL_IP is defined in include/linux/socket.h,
but userspace app uses SOL_IP from in.h (at least on my machine). E.g. SOL_XXX is
not exported to user.
>
>> +
>> static unsigned int port = DEFAULT_PORT;
>> static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
>> static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
>> +static bool zerocopy;
>>
>> static void error(const char *s)
>> {
>> @@ -247,15 +254,74 @@ static void run_receiver(unsigned long rcvlowat_bytes)
>> close(fd);
>> }
>>
>> +static void recv_completion(int fd)
>> +{
>> + struct sock_extended_err *serr;
>> + char cmsg_data[128];
>> + struct cmsghdr *cm;
>> + struct msghdr msg;
>> + int ret;
>> +
>> + msg.msg_control = cmsg_data;
>> + msg.msg_controllen = sizeof(cmsg_data);
>> +
>> + ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
>> + if (ret == -1)
>> + return;
>> +
>> + cm = CMSG_FIRSTHDR(&msg);
>> + if (!cm) {
>> + fprintf(stderr, "cmsg: no cmsg\n");
>> + return;
>> + }
>> +
>> + if (cm->cmsg_level != SOL_VSOCK) {
>> + fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
>> + return;
>> + }
>> +
>> + if (cm->cmsg_type) {
>> + fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
>> + return;
>> + }
>> +
>> + serr = (void *)CMSG_DATA(cm);
>> + if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
>> + fprintf(stderr, "serr: wrong origin\n");
>> + return;
>> + }
>> +
>> + if (serr->ee_errno) {
>> + fprintf(stderr, "serr: wrong error code\n");
>> + return;
>> + }
>> +
>> + if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
>> + fprintf(stderr, "warning: copy instead of zerocopy\n");
>> +}
>> +
>> +static void enable_so_zerocopy(int fd)
>> +{
>> + int val = 1;
>> +
>> + if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
>> + error("setsockopt(SO_ZEROCOPY)");
>> +}
>> +
>> static void run_sender(int peer_cid, unsigned long to_send_bytes)
>> {
>> time_t tx_begin_ns;
>> time_t tx_total_ns;
>> size_t total_send;
>> + time_t time_in_send;
>> void *data;
>> int fd;
>>
>> - printf("Run as sender\n");
>> + if (zerocopy)
>> + printf("Run as sender MSG_ZEROCOPY\n");
>> + else
>> + printf("Run as sender\n");
>> +
>> printf("Connect to %i:%u\n", peer_cid, port);
>> printf("Send %lu bytes\n", to_send_bytes);
>> printf("TX buffer %lu bytes\n", buf_size_bytes);
>> @@ -265,25 +331,58 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
>> if (fd < 0)
>> exit(EXIT_FAILURE);
>>
>> - data = malloc(buf_size_bytes);
>> + if (zerocopy) {
>> + enable_so_zerocopy(fd);
>>
>> - if (!data) {
>> - fprintf(stderr, "'malloc()' failed\n");
>> - exit(EXIT_FAILURE);
>> + data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
>> + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> + if (data == MAP_FAILED) {
>> + perror("mmap");
>> + exit(EXIT_FAILURE);
>> + }
>> + } else {
>> + data = malloc(buf_size_bytes);
>> +
>> + if (!data) {
>> + fprintf(stderr, "'malloc()' failed\n");
>> + exit(EXIT_FAILURE);
>> + }
>> }
>
> Eventually to simplify the code I think we can use the mmaped buffer in
> both cases.
>
>>
>> memset(data, 0, buf_size_bytes);
>> total_send = 0;
>> + time_in_send = 0;
>> tx_begin_ns = current_nsec();
>>
>> while (total_send < to_send_bytes) {
>> ssize_t sent;
>> + size_t rest_bytes;
>> + time_t before;
>> +
>> + rest_bytes = to_send_bytes - total_send;
>>
>> - sent = write(fd, data, buf_size_bytes);
>> + before = current_nsec();
>> + sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
>> + buf_size_bytes : rest_bytes,
>> + zerocopy ? MSG_ZEROCOPY : 0);
>> + time_in_send += (current_nsec() - before);
>>
>> if (sent <= 0)
>> error("write");
>>
>> + if (zerocopy) {
>> + struct pollfd fds = { 0 };
>> +
>> + fds.fd = fd;
>
> Which event are we waiting for here?
>
>> +
>> + if (poll(&fds, 1, -1) < 0) {
>> + perror("poll");
>> + exit(EXIT_FAILURE);
>> + }
>
> We need this because we use only one buffer, but if we use more than
> one, we could take full advantage of zerocopy, right?
>
> Otherwise, I don't think it's a fair comparison with non-zerocopy.
Yes or course, in next versions i'll update this test for using
more long iovs.
>
> Thanks,
> Stefano
>
>> +
>> + recv_completion(fd);
>> + }
>> +
>> total_send += sent;
>> }
>>
>> @@ -294,9 +393,14 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
>> get_gbps(total_send * 8, tx_total_ns));
>> printf("total time in 'write()': %f sec\n",
>> (float)tx_total_ns / NSEC_PER_SEC);
>> + printf("time in send %f\n", (float)time_in_send / NSEC_PER_SEC);
>>
>> close(fd);
>> - free(data);
>> +
>> + if (zerocopy)
>> + munmap(data, buf_size_bytes);
>> + else
>> + free(data);
>> }
>>
>> static const char optstring[] = "";
>> @@ -336,6 +440,11 @@ static const struct option longopts[] = {
>> .has_arg = required_argument,
>> .val = 'R',
>> },
>> + {
>> + .name = "zc",
>> + .has_arg = no_argument,
>> + .val = 'Z',
>> + },
>> {},
>> };
>>
>> @@ -351,6 +460,7 @@ static void usage(void)
>> " --help This message\n"
>> " --sender <cid> Sender mode (receiver default)\n"
>> " <cid> of the receiver to connect to\n"
>> + " --zc Enable zerocopy\n"
>> " --port <port> Port (default %d)\n"
>> " --bytes <bytes>KMG Bytes to send (default %d)\n"
>> " --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
>> @@ -413,6 +523,9 @@ int main(int argc, char **argv)
>> case 'H': /* Help. */
>> usage();
>> break;
>> + case 'Z': /* Zerocopy. */
>> + zerocopy = true;
>> + break;
>> default:
>> usage();
>> }
>> --
>> 2.25.1
>>
>
On Mon, Feb 20, 2023 at 08:59:32AM +0000, Krasnov Arseniy wrote:
>On 16.02.2023 16:33, Stefano Garzarella wrote:
>> Hi Arseniy,
>> sorry for the delay, but I was offline.
>
>Hello! Sure no problem!, i was also offline a little bit so i'm replying
>just now
>
>>
>> On Mon, Feb 06, 2023 at 06:51:55AM +0000, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> ????????????????????????? DESCRIPTION
>>>
>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>>> current implementation for TCP as much as possible:
>>>
>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>>> ? flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>>> ? flag will be ignored (e.g. without completion).
>>>
>>> 2) Kernel uses completions from socket's error queue. Single completion
>>> ? for single tx syscall (or it can merge several completions to single
>>> ? one). I used already implemented logic for MSG_ZEROCOPY support:
>>> ? 'msg_zerocopy_realloc()' etc.
>>
>> I will review for the vsock point of view. Hope some net maintainers can
>> comment about SO_ZEROCOPY.
>>
>> Anyway I think is a good idea to keep it as close as possible to the TCP
>> implementation.
>>
>>>
>>> Difference with copy way is not significant. During packet allocation,
>>> non-linear skb is created, then I call 'get_user_pages()' for each page
>>> from user's iov iterator (I think i don't need 'pin_user_pages()' as
>>
>> Are these pages exposed to the host via virtqueues? If so, I think we
>> should pin them. What happens if the host accesses them but these pages
>> have been unmapped?
>
>Yes, user pages with data will be used by the virtio device.
>'pin' - You mean use 'pin_user_pages()'? Unmapped - You mean guest
> will unmap it, while host must access it to copy packet? Such pages
> have incremented refcount by 'get_user_pages()', so page is locked
> and memory and will be locked until host finishes copying data.
Yep, I got it while reviewing patch 7 ;-)
> I think it is better to discuss things related to 'get/pin_user_pages()'
> in one place, for example in 07/12 where this function is called.
Agree!
>
>>
>>> there is no backing storage for these pages) and add each returned page
>>> to the skb as fragment. There are also some updates for vhost and guest
>>> parts of transport - in both cases i've added handling of non-linear skb
>>> for virtio part. vhost copies data from such skb to the guest's rx virtio
>>> buffers. In the guest, virtio transport fills virtio queue with pages
>>> from skb.
>>>
>>> I think doc in Documentation/networking/msg_zerocopy.rst could be also
>>> updated in next versions.
>>
>> Yep, good idea.
>
>Ack, i'll do it in v2.
>
>>
>>>
>>> This version has several limits/problems:
>>>
>>> 1) As this feature totally depends on transport, there is no way (or it
>>> ? is difficult) to check whether transport is able to handle it or not
>>> ? during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>>> ? setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>>> ? leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>>> ? are not considered to be called from each other. So in current version
>>> ? SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>> ? AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>> ? tx routine will fail with EOPNOTSUPP.
>>
>> I'll take a look, but if we have no alternative, I think it's okay to
>> make tx fail.>
>
>Thanks
>
>>>
>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>>> ? one completion. In each completion there is flag which shows how tx
>>> ? was performed: zerocopy or copy. This leads that whole message must
>>> ? be send in zerocopy or copy way - we can't send part of message with
>>> ? copying and rest of message with zerocopy mode (or vice versa). Now,
>>> ? we need to account vsock credit logic, e.g. we can't send whole data
>>> ? once - only allowed number of bytes could sent at any moment. In case
>>> ? of copying way there is no problem as in worst case we can send single
>>> ? bytes, but zerocopy is more complex because smallest transmission
>>> ? unit is single page. So if there is not enough space at peer's side
>>> ? to send integer number of pages (at least one) - we will wait, thus
>>> ? stalling tx side. To overcome this problem i've added simple rule -
>>> ? zerocopy is possible only when there is enough space at another side
>>> ? for whole message (to check, that current 'msghdr' was already used
>>> ? in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>> I see the problem and I think your approach is the right one.
>>
>>>
>>> 3) loopback transport is not supported, because it requires to implement
>>> ? non-linear skb handling in dequeue logic (as we "send" fragged skb
>>> ? and "receive" it from the same queue). I'm going to implement it in
>>> ? next versions.
>>
>> loopback is useful for testing and debugging, so it would be great to
>> have the support, but if it's too complicated, we can do it later.
>>
>
>Ok, i'll implement it in v2.
>
>>>
>>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>>> ? is due to 'kmalloc()' allocated data buffers. I think, in case of
>>
>> Yep, I think so.
>> When I started touching this code, the limit was already there.
>> As you said, I think it was introduced to have a limit on (host/device
>> side?) allocation, but buf_alloc might be enough, so maybe we could
>> also remove it for copy mode.
>> The only problem I see is compatibility with old devices/drivers, so
>> maybe we need a feature in the spec to say the limit is gone, or have a
>> field in the virtio config space where the device specifies its limit
>> (for the driver, the limit is implicitly that of the buffer allocated
>> and put in the virtqueue).
>>
>> This reminded me that Laura had proposed something similar before,
>> maybe we should take it up again:
>> https://markmail.org/message/3el4ckeakfilg5wo
>>
>>> ? MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>>> ? not touched for data - user space pages are used as buffers. Also
>>> ? this limit trims every message which is > 64KB, thus such messages
>>> ? will be send in copy mode due to 'iov_offset' check in 2).
>>
>> The host still needs to allocate and copy, so maybe the limitation
>> could be to avoid large allocations in the host, but actually the host
>> can use vmalloc because it doesn't need them to be contiguous.
>>
>
>Hmmm, I think it is possible to solve this situation in the following
>way - i can keep limitation for 64KB for copy mode, and remove it for
>zero copy, but I'll limit each packet size to 64KB(of course technically
>it is not exactly 64KB, it is min(max packet size, MAX_SKB_FRAGS * PAGE_SIZE)
>where max packet size is 64Kb, but for simplicity let's call this limit 64Kb:) ).
>E.g. when zerocopy transmission needs to send for example 129Kb(of course
>peer's free space is big enough and this check is passed), I'll won't trim
>129Kb to 64Kb + 64Kb + 1Kb in the current manner - by returning to af_vsock.c
>after sending every skb. I'll alloc several skbs, (3 in this case - 64Kb + 64Kb + 1Kb)
>in single call to the transport. Completion arg will be attached
>only to the last one skb, and send these 3 skbs. Host still processes
>64Kb(let it be 64Kb for simplicity again :) ) packets - no big allocations.
Make sense to me!
>Moreover, i think that this logic could be a little optimization for
>copy mode - why we allocate single skb and always return to af_vsock.c?
@Bobby, can you help us here?
>May be we can iterate needed number of skbs in the loop and send them.
Yep, I think is doable.
>
>Also about vmalloc(), IIUC there is already this idea which replaces 'kmalloc()'
>to 'kvmalloc()'.
Yep, I think it is already merged:
0e3f72931fc4 ("vhost/vsock: Use kvmalloc/kvfree for larger packets.")
But this is in the vhost transport (device emulation running in the
host), where we don't need that the pages are pinned.
>
>>>
>>> ?????????????????????????? PERFORMANCE
>>>
>>> Performance: it is a little bit tricky to compare performance between
>>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>>> user buffers will be released by kernel, so it something like synchronous
>>> path (wait until device driver will process it), while in copy way we
>>> can feed data to kernel as many as we want, don't care about device
>>> driver. So I compared only time which we spend in 'sendmsg()' syscall.
>>> Also there is limit from 4) above so max buffer size is 64KB. I've
>>> tested this patchset in the nested VM, but i think for V1 it is not a
>>> big deal.
>>>
>>> Sender:
>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 60M [--zc]
>>>
>>> Receiver:
>>> ./vsock_perf --vsk-size 256M
>>>
>>> Number in cell is seconds which senders spends inside tx syscall.
>>>
>>> Guest to host transmission:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |? 0.26?? |?? 0.042? |
>>> *-------------------------------*
>>> |?? 16KB?? |? 0.11?? |?? 0.014? |
>>> *-------------------------------*
>>> |?? 32KB?? |? 0.05?? |?? 0.009? |
>>> *-------------------------------*
>>> |?? 64KB?? |? 0.04?? |?? 0.005? |
>>> *-------------------------------*
>>>
>>> Host to guest transmission:
>>>
>>> *--------------------------------*
>>> |????????? |????????? |????????? |
>>> | buf size |?? copy?? | zerocopy |
>>> |????????? |????????? |????????? |
>>> *--------------------------------*
>>> |?? 4KB??? |?? 0.049? |?? 0.034? |
>>> *--------------------------------*
>>> |?? 16KB?? |?? 0.03?? |?? 0.024? |
>>> *--------------------------------*
>>> |?? 32KB?? |?? 0.025? |?? 0.01?? |
>>> *--------------------------------*
>>> |?? 64KB?? |?? 0.028? |?? 0.01?? |
>>> *--------------------------------*
>>>
>>> If host fails to send data with "Cannot allocate memory", check value
>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>>> allocation.
>>>
>>> Zerocopy is faster than classic copy mode, but of course it requires
>>> specific architecture of application due to user pages pinning, buffer
>>> size and alignment. In next versions i'm going to fix 64KB barrier to
>>> perform tests with bigger buffer sizes.
>>
>> Yep, I see.
>> Adjusting vsock_perf to compare also Gbps (can io_uring helps in this
>> case to have more requests in-flight?) would be great.
>>
>
>Yes, i'll add Gbps. Also I thought about adding io_uring test to
>the current test suite because io_uring also have MSG_ZEROCOPY
>type of request, and interesting thing is that io_uring uses it's
>own already allocated uarg.
Cool!
>
>>>
>>> ?????????????????????????? TESTING
>>>
>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>>> cover new code as much as possible so there are different cases for
>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>>> vector types (different sizes, alignments, with unmapped pages).
>>
>> Great! Thanks for adding the tests!
>>
>> I'll go through the patches between today and Monday.
>> Sorry again for taking so long!
>
>Sure, Thanks for review! I think we are not hurry :)
Yep, thank you for this work!
Stefano
On Mon, Feb 20, 2023 at 09:04:04AM +0000, Krasnov Arseniy wrote:
>On 16.02.2023 18:16, Stefano Garzarella wrote:
>> On Mon, Feb 06, 2023 at 07:00:35AM +0000, Arseniy Krasnov wrote:
>>> This adds main logic of MSG_ZEROCOPY flag processing for packet
>>> creation. When this flag is set and user's iov iterator fits for
>>> zerocopy transmission, call 'get_user_pages()' and add returned
>>> pages to the newly created skb.
>>>
>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>> ---
>>> net/vmw_vsock/virtio_transport_common.c | 212 ++++++++++++++++++++++--
>>> 1 file changed, 195 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index 05ce97b967ad..69e37f8a68a6 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -37,6 +37,169 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>> ????return container_of(t, struct virtio_transport, transport);
>>> }
>>>
>>
>> I'd use bool if we don't need to return an error value in the following
>> new functions.
>>
>>> +static int virtio_transport_can_zcopy(struct iov_iter *iov_iter,
>>> +????????????????????? size_t free_space)
>>> +{
>>> +??? size_t pages;
>>> +??? int i;
>>> +
>>> +??? if (!iter_is_iovec(iov_iter))
>>> +??????? return -1;
>>> +
>>> +??? if (iov_iter->iov_offset)
>>> +??????? return -1;
>>> +
>>> +??? /* We can't send whole iov. */
>>> +??? if (free_space < iov_iter->count)
>>> +??????? return -1;
>>> +
>>> +??? for (pages = 0, i = 0; i < iov_iter->nr_segs; i++) {
>>> +??????? const struct iovec *iovec;
>>> +??????? int pages_in_elem;
>>> +
>>> +??????? iovec = &iov_iter->iov[i];
>>> +
>>> +??????? /* Base must be page aligned. */
>>> +??????? if (offset_in_page(iovec->iov_base))
>>> +??????????? return -1;
>>> +
>>> +??????? /* Only last element could have not page aligned size.? */
>>> +??????? if (i != (iov_iter->nr_segs - 1)) {
>>> +??????????? if (offset_in_page(iovec->iov_len))
>>> +??????????????? return -1;
>>> +
>>> +??????????? pages_in_elem = iovec->iov_len >> PAGE_SHIFT;
>>> +??????? } else {
>>> +??????????? pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE);
>>> +??????????? pages_in_elem >>= PAGE_SHIFT;
>>> +??????? }
>>> +
>>> +??????? /* In case of user's pages - one page is one frag. */
>>> +??????? if (pages + pages_in_elem > MAX_SKB_FRAGS)
>>> +??????????? return -1;
>>> +
>>> +??????? pages += pages_in_elem;
>>> +??? }
>>> +
>>> +??? return 0;
>>> +}
>>> +
>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> +?????????????????????? struct sk_buff *skb,
>>> +?????????????????????? struct iov_iter *iter,
>>> +?????????????????????? bool zerocopy)
>>> +{
>>> +??? struct ubuf_info_msgzc *uarg_zc;
>>> +??? struct ubuf_info *uarg;
>>> +
>>> +??? uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> +??????????????????? iov_length(iter->iov, iter->nr_segs),
>>> +??????????????????? NULL);
>>> +
>>> +??? if (!uarg)
>>> +??????? return -1;
>>> +
>>> +??? uarg_zc = uarg_to_msgzc(uarg);
>>> +??? uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>> +
>>> +??? skb_zcopy_init(skb, uarg);
>>> +
>>> +??? return 0;
>>> +}
>>> +
>>> +static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb,
>>> +?????????????????????????? struct vsock_sock *vsk,
>>> +?????????????????????????? struct virtio_vsock_pkt_info *info)
>>> +{
>>> +??? struct iov_iter *iter;
>>> +??? int frag_idx;
>>> +??? int seg_idx;
>>> +
>>> +??? iter = &info->msg->msg_iter;
>>> +??? frag_idx = 0;
>>> +??? VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0;
>>> +??? VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>>> +
>>> +??? /* At this moment:
>>> +???? * 1) 'iov_offset' is zero.
>>> +???? * 2) Every 'iov_base' and 'iov_len' are also page aligned
>>> +???? *??? (except length of the last element).
>>> +???? * 3) Number of pages in this iov <= MAX_SKB_FRAGS.
>>> +???? * 4) Length of the data fits in current credit space.
>>> +???? */
>>> +??? for (seg_idx = 0; seg_idx < iter->nr_segs; seg_idx++) {
>>> +??????? struct page *user_pages[MAX_SKB_FRAGS];
>>> +??????? const struct iovec *iovec;
>>> +??????? size_t last_frag_len;
>>> +??????? size_t pages_in_seg;
>>> +??????? int page_idx;
>>> +
>>> +??????? iovec = &iter->iov[seg_idx];
>>> +??????? pages_in_seg = iovec->iov_len >> PAGE_SHIFT;
>>> +
>>> +??????? if (iovec->iov_len % PAGE_SIZE) {
>>> +??????????? last_frag_len = iovec->iov_len % PAGE_SIZE;
>>> +??????????? pages_in_seg++;
>>> +??????? } else {
>>> +??????????? last_frag_len = PAGE_SIZE;
>>> +??????? }
>>> +
>>> +??????? if (get_user_pages((unsigned long)iovec->iov_base,
>>> +?????????????????? pages_in_seg, FOLL_GET, user_pages,
>>> +?????????????????? NULL) != pages_in_seg)
>>> +??????????? return -1;
>>
>> Reading the get_user_pages() documentation, this should pin the user
>> pages, so we should be fine if we then expose them in the virtqueue.
>>
>> But reading Documentation/core-api/pin_user_pages.rst it seems that
>> drivers should use "pin_user_pages*() for DMA-pinned pages", so I'm not
>> sure what we should do.
>>
>That is really interesting question for me too. IIUC 'pin_user_pages()'
>sets special value to ref counter of page, so we can distinguish such
>pages from the others. I've grepped for pinned pages check and found,
>the it is used in mm/vmscan.c by calling 'folio_maybe_dma_pinned()' during
>page lists processing. Seems 'pin_user_pages()' is more strict version of
>'get_user_pages()' and it is recommended to use 'pin_' when data on these
>pages will be accessed.
>I think, i'll check which API is used in the TCP implementation for zerocopy
>transmission.
>
>> Additional advice would be great!
>>
>> Anyway, when we are done using the pages, we should call put_page() or
>> unpin_user_page() depending on how we pin them.
>>
>In case of 'get_user_pages()' everything is ok here: when such skb
>will be released, 'put_page()' will be called for every frag page
>of it, so there is no page leak.
Got it!
>But in case of 'pin_user_pages()',
>i will need to unpin in manually before calling 'consume_skb()'
>after it is processed by virtio device. But anyway - it is not a
>problem.
Yep.
Thanks,
Stefano
On Mon, Feb 20, 2023 at 09:05:12AM +0000, Krasnov Arseniy wrote:
>On 16.02.2023 18:29, Stefano Garzarella wrote:
>> On Mon, Feb 06, 2023 at 07:06:32AM +0000, Arseniy Krasnov wrote:
>>> To use this option pass '--zc' parameter:
>>
>> --zerocopy or --zero-copy maybe better follow what we did with the other parameters :-)
>>
>>>
>>> ./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>
>>>
>>> With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.
>>>
>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>> ---
>>> tools/testing/vsock/vsock_perf.c | 127 +++++++++++++++++++++++++++++--
>>> 1 file changed, 120 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
>>> index a72520338f84..1d435be9b48e 100644
>>> --- a/tools/testing/vsock/vsock_perf.c
>>> +++ b/tools/testing/vsock/vsock_perf.c
>>> @@ -18,6 +18,8 @@
>>> #include <poll.h>
>>> #include <sys/socket.h>
>>> #include <linux/vm_sockets.h>
>>> +#include <sys/mman.h>
>>> +#include <linux/errqueue.h>
>>>
>>> #define DEFAULT_BUF_SIZE_BYTES??? (128 * 1024)
>>> #define DEFAULT_TO_SEND_BYTES??? (64 * 1024)
>>> @@ -28,9 +30,14 @@
>>> #define BYTES_PER_GB??????? (1024 * 1024 * 1024ULL)
>>> #define NSEC_PER_SEC??????? (1000000000ULL)
>>>
>>> +#ifndef SOL_VSOCK
>>> +#define SOL_VSOCK 287
>>> +#endif
>>
>> I thought we use the current kernel headers when we compile the tests,
>> do we need to fix something in the makefile?
>Not sure, of course we are using uapi. But i see, that defines like SOL_XXX is not
>defined in uapi headers. For example SOL_IP is defined in include/linux/socket.h,
>but userspace app uses SOL_IP from in.h (at least on my machine). E.g. SOL_XXX is
>not exported to user.
Right, I see only few SOL_* in the uapi, e.g. SOL_TIPC in
uapi/linux/tipc.h
So it's fine for now, otherwise we can define it in
uapi/linux/vm_sockets.h
Thanks,
Stefano