Hello,
DESCRIPTION
this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:
1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
flag will be ignored (e.g. without completion).
2) Kernel uses completions from socket's error queue. Single completion
for single tx syscall (or it can merge several completions to single
one). I used already implemented logic for MSG_ZEROCOPY support:
'msg_zerocopy_realloc()' etc.
Difference with copy way is not significant. During packet allocation,
non-linear skb is created, then I call 'pin_user_pages()' for each page
from user's iov iterator and add each returned page to the skb as fragment.
There are also some updates for vhost and guest parts of transport - in
both cases i've added handling of non-linear skb for virtio part. vhost
copies data from such skb to the guest's rx virtio buffers. In the guest,
virtio transport fills tx virtio queue with pages from skb.
This version has several limits/problems:
1) As this feature totally depends on transport, there is no way (or it
is difficult) to check whether transport is able to handle it or not
during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
setsockopt callback from setsockopt callback for SOL_SOCKET, but this
leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
are not considered to be called from each other. So in current version
SO_ZEROCOPY is set successfully to any type (e.g. transport) of
AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
tx routine will fail with EOPNOTSUPP.
2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
one completion. In each completion there is flag which shows how tx
was performed: zerocopy or copy. This leads that whole message must
be send in zerocopy or copy way - we can't send part of message with
copying and rest of message with zerocopy mode (or vice versa). Now,
we need to account vsock credit logic, e.g. we can't send whole data
once - only allowed number of bytes could sent at any moment. In case
of copying way there is no problem as in worst case we can send single
bytes, but zerocopy is more complex because smallest transmission
unit is single page. So if there is not enough space at peer's side
to send integer number of pages (at least one) - we will wait, thus
stalling tx side. To overcome this problem i've added simple rule -
zerocopy is possible only when there is enough space at another side
for whole message (to check, that current 'msghdr' was already used
in previous tx iterations i use 'iov_offset' field of it's iov iter).
3) loopback transport is not supported, because it requires to implement
non-linear skb handling in dequeue logic (as we "send" fragged skb
and "receive" it from the same queue). I'm going to implement it in
next versions.
^^^ fixed in v2
4) Current implementation sets max length of packet to 64KB. IIUC this
is due to 'kmalloc()' allocated data buffers. I think, in case of
MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
not touched for data - user space pages are used as buffers. Also
this limit trims every message which is > 64KB, thus such messages
will be send in copy mode due to 'iov_offset' check in 2).
^^^ fixed in v2
PATCHSET STRUCTURE
Patchset has the following structure:
1) Handle non-linear skbuff on receive in virtio/vhost.
2) Handle non-linear skbuff on send in virtio/vhost.
3) Updates for AF_VSOCK.
4) Enable MSG_ZEROCOPY support on transports.
5) Tests/tools/docs updates.
PERFORMANCE
Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it something like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in the 'send()' syscall.
Then if this value will be combined with total number of transmitted
bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
enough credit, receiver allocates same amount of space as sender needs.
Sender:
./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
Receiver:
./vsock_perf --vsk-size 256M
G2H transmission (values are Gbit/s):
*-------------------------------*
| | | |
| buf size | copy | zerocopy |
| | | |
*-------------------------------*
| 4KB | 3 | 10 |
*-------------------------------*
| 32KB | 9 | 45 |
*-------------------------------*
| 256KB | 24 | 195 |
*-------------------------------*
| 1M | 27 | 270 |
*-------------------------------*
| 8M | 22 | 277 |
*-------------------------------*
H2G:
*-------------------------------*
| | | |
| buf size | copy | zerocopy |
| | | |
*-------------------------------*
| 4KB | 17 | 11 |
*-------------------------------*
| 32KB | 30 | 66 |
*-------------------------------*
| 256KB | 38 | 179 |
*-------------------------------*
| 1M | 38 | 234 |
*-------------------------------*
| 8M | 28 | 279 |
*-------------------------------*
Loopback:
*-------------------------------*
| | | |
| buf size | copy | zerocopy |
| | | |
*-------------------------------*
| 4KB | 8 | 7 |
*-------------------------------*
| 32KB | 34 | 42 |
*-------------------------------*
| 256KB | 43 | 83 |
*-------------------------------*
| 1M | 40 | 109 |
*-------------------------------*
| 8M | 40 | 171 |
*-------------------------------*
I suppose that huge difference above between both modes has two reasons:
1) We don't need to copy data.
2) We don't need to allocate buffer for data, only for header.
Zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment.
If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb
allocation.
TESTING
This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages). I also
run tests with loopback transport and running vsockmon.
Thanks, Arseniy
Link to v1:
https://lore.kernel.org/netdev/[email protected]/
Changelog:
v1 -> v2:
- Replace 'get_user_pages()' with 'pin_user_pages()'.
- Loopback transport support.
Arseniy Krasnov (15):
vsock/virtio: prepare for non-linear skb support
vhost/vsock: non-linear skb handling support
vsock/virtio: non-linear skb handling support
vsock/virtio: non-linear skb handling for tap
vsock/virtio: MSG_ZEROCOPY flag support
vsock: check error queue to set EPOLLERR
vsock: read from socket's error queue
vsock: check for MSG_ZEROCOPY support
vhost/vsock: support MSG_ZEROCOPY for transport
vsock/virtio: support MSG_ZEROCOPY for transport
vsock/loopback: support MSG_ZEROCOPY for transport
net/sock: enable setting SO_ZEROCOPY for PF_VSOCK
test/vsock: MSG_ZEROCOPY flag tests
test/vsock: MSG_ZEROCOPY support for vsock_perf
docs: net: description of MSG_ZEROCOPY for AF_VSOCK
Documentation/networking/msg_zerocopy.rst | 12 +-
drivers/vhost/vsock.c | 29 +-
include/linux/socket.h | 1 +
include/linux/virtio_vsock.h | 7 +
include/net/af_vsock.h | 3 +
net/core/sock.c | 4 +-
net/vmw_vsock/af_vsock.c | 16 +-
net/vmw_vsock/virtio_transport.c | 39 +-
net/vmw_vsock/virtio_transport_common.c | 497 ++++++++++++++++++---
net/vmw_vsock/vsock_loopback.c | 8 +
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.h | 1 +
tools/testing/vsock/vsock_perf.c | 139 +++++-
tools/testing/vsock/vsock_test.c | 11 +
tools/testing/vsock/vsock_test_zerocopy.c | 501 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 12 +
16 files changed, 1194 insertions(+), 88 deletions(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h
--
2.25.1
Use pages of non-linear skb as buffers in virtio tx queue.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..1c269c3f010d 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];
for (;;) {
- struct scatterlist hdr, buf, *sgs[2];
+ /* +1 is for packet header. */
+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +113,30 @@ virtio_transport_send_pkt_work(struct work_struct *work)
virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+ sg_init_one(&bufs[0], virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
+ sgs[out_sg++] = &bufs[0];
+
+ if (skb_is_nonlinear(skb)) {
+ int i;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ struct page *data_page = skb_shinfo(skb)->frags[i].bv_page;
+
+ /* We will use 'page_to_virt()' for userspace page here,
+ * because virtio layer will call 'virt_to_phys()' later
+ * to fill buffer descriptor. We don't touch memory at
+ * "virtual" address of this page.
+ */
+ sg_init_one(&bufs[i + 1],
+ page_to_virt(data_page), PAGE_SIZE);
+ sgs[out_sg++] = &bufs[i + 1];
+ }
+ } else {
+ if (skb->len > 0) {
+ sg_init_one(&bufs[1], skb->data, skb->len);
+ sgs[out_sg++] = &bufs[1];
+ }
- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
- sgs[out_sg++] = &hdr;
- if (skb->len > 0) {
- sg_init_one(&buf, skb->data, skb->len);
- sgs[out_sg++] = &buf;
}
ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
--
2.25.1
This is preparation patch for non-linear skbuff handling. It does two
things:
1) Handles freeing of non-linear skbuffs.
2) Adds copying from non-linear skbuffs to user's buffer.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/linux/virtio_vsock.h | 7 +++
net/vmw_vsock/virtio_transport_common.c | 84 +++++++++++++++++++++++--
2 files changed, 87 insertions(+), 4 deletions(-)
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..848ec255e665 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,10 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+ /* Current fragment in 'frags' of skb. */
+ u32 curr_frag;
+ /* Offset from 0 in current fragment. */
+ u32 frag_off;
};
#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
@@ -246,4 +250,7 @@ void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
void virtio_transport_deliver_tap_pkt(struct sk_buff *skb);
int virtio_transport_purge_skbs(void *vsk, struct sk_buff_head *list);
int virtio_transport_read_skb(struct vsock_sock *vsk, skb_read_actor_t read_actor);
+int virtio_transport_nl_skb_to_iov(struct sk_buff *skb,
+ struct iov_iter *iov_iter, size_t len,
+ bool peek);
#endif /* _LINUX_VIRTIO_VSOCK_H */
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index dde3c870bddd..b901017b9f92 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -337,6 +337,60 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk)
return virtio_transport_send_pkt_info(vsk, &info);
}
+int virtio_transport_nl_skb_to_iov(struct sk_buff *skb,
+ struct iov_iter *iov_iter,
+ size_t len,
+ bool peek)
+{
+ unsigned int skb_len;
+ size_t rest_len = len;
+ int curr_frag;
+ int curr_offs;
+ int err = 0;
+
+ skb_len = skb->len;
+ curr_frag = VIRTIO_VSOCK_SKB_CB(skb)->curr_frag;
+ curr_offs = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+
+ while (rest_len && skb->len) {
+ struct bio_vec *curr_vec;
+ size_t curr_vec_end;
+ size_t to_copy;
+ void *data;
+
+ curr_vec = &skb_shinfo(skb)->frags[curr_frag];
+ curr_vec_end = curr_vec->bv_offset + curr_vec->bv_len;
+ to_copy = min(rest_len, (size_t)(curr_vec_end - curr_offs));
+ data = kmap_local_page(curr_vec->bv_page);
+
+ if (copy_to_iter(data + curr_offs, to_copy, iov_iter) != to_copy)
+ err = -EFAULT;
+
+ kunmap_local(data);
+
+ if (err)
+ break;
+
+ rest_len -= to_copy;
+ skb_len -= to_copy;
+ curr_offs += to_copy;
+
+ if (curr_offs == (curr_vec_end)) {
+ curr_frag++;
+ curr_offs = 0;
+ }
+ }
+
+ if (!peek) {
+ skb->len = skb_len;
+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = curr_frag;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = curr_offs;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_nl_skb_to_iov);
+
static ssize_t
virtio_transport_stream_do_peek(struct vsock_sock *vsk,
struct msghdr *msg,
@@ -365,7 +419,14 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, skb->data + off, bytes);
+ if (skb_is_nonlinear(skb)) {
+ err = virtio_transport_nl_skb_to_iov(skb,
+ &msg->msg_iter,
+ bytes,
+ true);
+ } else {
+ err = memcpy_to_msg(msg, skb->data + off, bytes);
+ }
if (err)
goto out;
@@ -417,14 +478,22 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, skb->data, bytes);
+ if (skb_is_nonlinear(skb)) {
+ err = virtio_transport_nl_skb_to_iov(skb, &msg->msg_iter,
+ bytes, false);
+ } else {
+ err = memcpy_to_msg(msg, skb->data, bytes);
+ }
+
if (err)
goto out;
spin_lock_bh(&vvs->rx_lock);
total += bytes;
- skb_pull(skb, bytes);
+
+ if (!skb_is_nonlinear(skb))
+ skb_pull(skb, bytes);
if (skb->len == 0) {
u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
@@ -498,7 +567,14 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);
- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+ if (skb_is_nonlinear(skb)) {
+ err = virtio_transport_nl_skb_to_iov(skb,
+ &msg->msg_iter,
+ bytes_to_copy,
+ false);
+ } else {
+ err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+ }
if (err) {
/* Copy of message failed. Rest of
* fragments will be freed without copy.
--
2.25.1
This adds copying to guest's virtio buffers from non-linear skbs. Such
skbs are created by protocol layer when MSG_ZEROCOPY flags is used.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..1e70aa390e44 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -197,11 +197,20 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}
- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
- kfree_skb(skb);
- vq_err(vq, "Faulted on copying pkt buf\n");
- break;
+ if (skb_is_nonlinear(skb)) {
+ if (virtio_transport_nl_skb_to_iov(skb, &iov_iter,
+ payload_len,
+ false)) {
+ vq_err(vq, "Faulted on copying pkt buf from page\n");
+ break;
+ }
+ } else {
+ nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
+ if (nbytes != payload_len) {
+ kfree_skb(skb);
+ vq_err(vq, "Faulted on copying pkt buf\n");
+ break;
+ }
}
/* Deliver to monitoring devices all packets that we
@@ -212,7 +221,9 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;
- skb_pull(skb, payload_len);
+ if (!skb_is_nonlinear(skb))
+ skb_pull(skb, payload_len);
+
total_len += payload_len;
/* If we didn't send all the payload we can requeue the packet
--
2.25.1
This feature totally depends on transport, so if transport doesn't
support it, return error.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/net/af_vsock.h | 3 +++
net/vmw_vsock/af_vsock.c | 7 +++++++
2 files changed, 10 insertions(+)
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..270ca54cfab8 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -177,6 +177,9 @@ struct vsock_transport {
/* Read a single skb */
int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
+
+ /* Zero-copy. */
+ bool (*msgzerocopy_allow)(void);
};
/**** CORE ****/
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index c50d2632a75f..eac8c1affd6a 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1824,6 +1824,13 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
+ if (msg->msg_flags & MSG_ZEROCOPY &&
+ (!transport->msgzerocopy_allow ||
+ !transport->msgzerocopy_allow())) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
/* Wait for room in the produce queue to enqueue our user's data. */
timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
--
2.25.1
For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b901017b9f92..280497d97076 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -101,6 +101,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}
+int virtio_transport_nl_skb_to_iov(struct sk_buff *skb,
+ struct iov_iter *iov_iter,
+ size_t len,
+ bool peek);
+static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ struct iov_iter iov_iter = { 0 };
+ struct iovec iovec;
+
+ iovec.iov_base = dst;
+ iovec.iov_len = len;
+
+ iov_iter.iter_type = ITER_IOVEC;
+ iov_iter.iov = &iovec;
+ iov_iter.nr_segs = 1;
+
+ virtio_transport_nl_skb_to_iov(skb, &iov_iter, len, false);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -109,7 +130,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;
/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -117,7 +137,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;
skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -160,7 +179,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(pkt)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}
return skb;
--
2.25.1
This adds handling of MSG_ERRQUEUE input flag for receive call, thus
skb from socket's error queue is read.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/linux/socket.h | 1 +
net/vmw_vsock/af_vsock.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 13c3a237b9c9..19a6f39fa014 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -379,6 +379,7 @@ struct ucred {
#define SOL_MPTCP 284
#define SOL_MCTP 285
#define SOL_SMC 286
+#define SOL_VSOCK 287
/* IPX options */
#define IPX_TYPE 1
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 137a0db6eaac..c50d2632a75f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -110,6 +110,7 @@
#include <linux/workqueue.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <linux/errqueue.h>
static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
static void vsock_sk_destruct(struct sock *sk);
@@ -2135,6 +2136,10 @@ vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
int err;
sk = sock->sk;
+
+ if (unlikely(flags & MSG_ERRQUEUE))
+ return sock_recv_errqueue(sk, msg, len, SOL_VSOCK, 0);
+
vsk = vsock_sk(sk);
err = 0;
--
2.25.1
Add 'msgzerocopy_allow()' callback for loopback transport.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/vsock_loopback.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index e3afc0c866f5..0de1436c7d4f 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -48,6 +48,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
}
static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_msgzerocopy_allow(void);
static struct virtio_transport loopback_transport = {
.transport = {
@@ -93,11 +94,18 @@ static struct virtio_transport loopback_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,
.read_skb = virtio_transport_read_skb,
+
+ .msgzerocopy_allow = vsock_loopback_msgzerocopy_allow,
},
.send_pkt = vsock_loopback_send_pkt,
};
+static bool vsock_loopback_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
{
return true;
--
2.25.1
Add 'msgzerocopy_allow()' callback for vhost transport.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 1e70aa390e44..4a33940a6020 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -405,6 +405,11 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
return val < vq->num;
}
+static bool vhost_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool vhost_transport_seqpacket_allow(u32 remote_cid);
static struct virtio_transport vhost_transport = {
@@ -451,6 +456,7 @@ static struct virtio_transport vhost_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,
.read_skb = virtio_transport_read_skb,
+ .msgzerocopy_allow = vhost_transport_msgzerocopy_allow,
},
.send_pkt = vhost_transport_send_pkt,
--
2.25.1
Add 'msgzerocopy_allow()' callback for virtio transport.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 1c269c3f010d..ca12db84e053 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -433,6 +433,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}
+static bool virtio_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);
static struct virtio_transport virtio_transport = {
@@ -479,6 +484,8 @@ static struct virtio_transport virtio_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,
.read_skb = virtio_transport_read_skb,
+
+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,
},
.send_pkt = virtio_transport_send_pkt,
--
2.25.1
This adds handling of MSG_ZEROCOPY flag on transmission path, by alloc
non-linear skbuffs and filling it with user's pages.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 390 ++++++++++++++++++++----
1 file changed, 332 insertions(+), 58 deletions(-)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 280497d97076..3c024d0d795c 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,27 +37,227 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}
-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
- size_t len,
- u32 src_cid,
- u32 src_port,
- u32 dst_cid,
- u32 dst_port)
-{
- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
- struct virtio_vsock_hdr *hdr;
- struct sk_buff *skb;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+ size_t max_to_send)
+{
+ struct iov_iter *iov_iter;
+ size_t max_skb_cap;
+ size_t bytes;
+ int i;
+
+ if (!(info->flags & MSG_ZEROCOPY))
+ return false;
+
+ if (!info->msg)
+ return false;
+
+ iov_iter = &info->msg->msg_iter;
+
+ if (!iter_is_iovec(iov_iter))
+ return false;
+
+ if (iov_iter->iov_offset)
+ return false;
+
+ /* We can't send whole iov. */
+ if (iov_iter->count > max_to_send)
+ return false;
+
+ for (bytes = 0, i = 0; i < iov_iter->nr_segs; i++) {
+ const struct iovec *iovec;
+ int pages_in_elem;
+
+ iovec = &iov_iter->iov[i];
+
+ /* Base must be page aligned. */
+ if (offset_in_page(iovec->iov_base))
+ return false;
+
+ /* Only last element could have non page aligned size. */
+ if (i != (iov_iter->nr_segs - 1)) {
+ if (offset_in_page(iovec->iov_len))
+ return false;
+
+ pages_in_elem = iovec->iov_len >> PAGE_SHIFT;
+ } else {
+ pages_in_elem = round_up(iovec->iov_len, PAGE_SIZE);
+ pages_in_elem >>= PAGE_SHIFT;
+ }
+
+ bytes += (pages_in_elem * PAGE_SIZE);
+ }
+
+ /* How many bytes we can pack to single skb. Maximum packet
+ * buffer size is needed to allow vhost handle such packets,
+ * otherwise they will be dropped.
+ */
+ max_skb_cap = min((unsigned int)(MAX_SKB_FRAGS * PAGE_SIZE),
+ (unsigned int)VIRTIO_VSOCK_MAX_PKT_BUF_SIZE);
+
+ return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+ struct sk_buff *skb,
+ struct iov_iter *iter,
+ bool zerocopy)
+{
+ struct ubuf_info_msgzc *uarg_zc;
+ struct ubuf_info *uarg;
+
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+ iov_length(iter->iov, iter->nr_segs),
+ NULL);
+
+ if (!uarg)
+ return -1;
+
+ uarg_zc = uarg_to_msgzc(uarg);
+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
+
+ skb_zcopy_init(skb, uarg);
+
+ return 0;
+}
+
+static int get_iov_elem(struct iov_iter *iter, ssize_t *offset)
+{
+ int i;
+
+ *offset = iter->iov_offset;
+
+ for (i = 0; i < iter->nr_segs; i++) {
+ if (*offset - (ssize_t)iter->iov[i].iov_len < 0)
+ return i;
+
+ *offset -= iter->iov[i].iov_len;
+ }
+
+ return -1;
+}
+
+static int virtio_transport_fill_nonlinear_skb(struct sk_buff *skb,
+ struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t payload_len)
+{
+ size_t payload_rest_len;
+ int frag_idx;
+ int err = 0;
+
+ if (!info->msg)
+ return 0;
+
+ frag_idx = 0;
+ VIRTIO_VSOCK_SKB_CB(skb)->curr_frag = 0;
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
+ payload_rest_len = payload_len;
+
+ while (payload_rest_len) {
+ struct page *user_pages[MAX_SKB_FRAGS];
+ const struct iovec *iovec;
+ struct iov_iter *iter;
+ size_t last_frag_len;
+ ssize_t offs_in_iov;
+ size_t curr_iov_len;
+ size_t pages_in_seg;
+ long pinned_pages;
+ int page_idx;
+ int seg_idx;
+
+ iter = &info->msg->msg_iter;
+ seg_idx = get_iov_elem(iter, &offs_in_iov);
+ if (seg_idx < 0) {
+ err = -1;
+ break;
+ }
+
+ iovec = &iter->iov[seg_idx];
+ curr_iov_len = min(iovec->iov_len - offs_in_iov,
+ payload_rest_len);
+ pages_in_seg = curr_iov_len >> PAGE_SHIFT;
+
+ if (curr_iov_len % PAGE_SIZE) {
+ last_frag_len = curr_iov_len % PAGE_SIZE;
+ pages_in_seg++;
+ } else {
+ last_frag_len = PAGE_SIZE;
+ }
+
+ pinned_pages = pin_user_pages((unsigned long)iovec->iov_base +
+ offs_in_iov, pages_in_seg,
+ FOLL_ANON, user_pages, NULL);
+
+ if (pinned_pages != pages_in_seg) {
+ /* Unpin partially pinned pages. */
+ unpin_user_pages(user_pages, pinned_pages);
+ err = -1;
+ break;
+ }
+
+ for (page_idx = 0; page_idx < pages_in_seg; page_idx++) {
+ int frag_len = PAGE_SIZE;
+
+ if (page_idx == (pages_in_seg - 1))
+ frag_len = last_frag_len;
+
+ /* 'get_page()' as pair to 'put_page()' during
+ * this non-linear skbuff deallocation.
+ */
+ get_page(user_pages[page_idx]);
+ skb_fill_page_desc(skb, frag_idx,
+ user_pages[page_idx], 0,
+ frag_len);
+ skb_len_add(skb, frag_len);
+ frag_idx++;
+ }
+
+ iter->iov_offset += curr_iov_len;
+ payload_rest_len -= curr_iov_len;
+ }
+
+ return err;
+}
+
+static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
+ struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t len)
+{
void *payload;
int err;
- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
- if (!skb)
- return NULL;
+ payload = skb_put(skb, len);
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err)
+ return -1;
+
+ if (msg_data_left(info->msg))
+ return 0;
+
+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+ struct virtio_vsock_hdr *hdr;
+
+ hdr = virtio_vsock_hdr(skb);
+
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+ if (info->msg->msg_flags & MSG_EOR)
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ }
+
+ return 0;
+}
+
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ u32 src_cid,
+ u32 src_port,
+ u32 dst_cid,
+ u32 dst_port,
+ size_t len)
+{
+ struct virtio_vsock_hdr *hdr;
hdr = virtio_vsock_hdr(skb);
hdr->type = cpu_to_le16(info->type);
@@ -68,37 +268,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
hdr->dst_port = cpu_to_le32(dst_port);
hdr->flags = cpu_to_le32(info->flags);
hdr->len = cpu_to_le32(len);
-
- if (info->msg && len > 0) {
- payload = skb_put(skb, len);
- err = memcpy_from_msg(payload, info->msg, len);
- if (err)
- goto out;
-
- if (msg_data_left(info->msg) == 0 &&
- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
-
- if (info->msg->msg_flags & MSG_EOR)
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
- }
- }
-
- if (info->reply)
- virtio_vsock_skb_set_reply(skb);
-
- trace_virtio_transport_alloc_pkt(src_cid, src_port,
- dst_cid, dst_port,
- len,
- info->type,
- info->op,
- info->flags);
-
- return skb;
-
-out:
- kfree_skb(skb);
- return NULL;
}
int virtio_transport_nl_skb_to_iov(struct sk_buff *skb,
@@ -209,6 +378,88 @@ static u16 virtio_transport_get_type(struct sock *sk)
return VIRTIO_VSOCK_TYPE_SEQPACKET;
}
+static void virtio_transport_unpin_skb(struct sk_buff *skb)
+{
+ int i;
+
+ if (!skb_is_nonlinear(skb))
+ return;
+
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ struct bio_vec *frag;
+ int nr_pages;
+ int page;
+
+ frag = &skb_shinfo(skb)->frags[i];
+ nr_pages = frag->bv_len / PAGE_SIZE;
+
+ if (frag->bv_len % PAGE_SIZE)
+ nr_pages++;
+
+ for (page = 0; page < nr_pages; page++)
+ unpin_user_page(&frag->bv_page[page]);
+ }
+}
+
+/* Returns a new packet on success, otherwise returns NULL.
+ *
+ * If NULL is returned, errp is set to a negative errno.
+ */
+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t payload_len,
+ bool zcopy,
+ u32 dst_cid,
+ u32 dst_port,
+ u32 src_cid,
+ u32 src_port)
+{
+ struct sk_buff *skb;
+ size_t skb_len;
+
+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+ if (!zcopy)
+ skb_len += payload_len;
+
+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+ if (!skb)
+ return NULL;
+
+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len);
+
+ if (info->msg && payload_len > 0) {
+ int err;
+
+ if (zcopy) {
+ skb->destructor = virtio_transport_unpin_skb;
+ err = virtio_transport_fill_nonlinear_skb(skb, vsk, info, payload_len);
+ } else {
+ err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
+ }
+
+ if (err)
+ goto out;
+ }
+
+ if (info->reply)
+ virtio_vsock_skb_set_reply(skb);
+
+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len,
+ info->type,
+ info->op,
+ info->flags);
+
+ return skb;
+out:
+ kfree_skb(skb);
+ return NULL;
+}
+
/* This function can only be used on connecting/connected sockets,
* since a socket assigned to a transport is required.
*
@@ -221,6 +472,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
const struct virtio_transport *t_ops;
struct virtio_vsock_sock *vvs;
u32 pkt_len = info->pkt_len;
+ bool can_zcopy = false;
+ u32 max_skb_cap;
u32 rest_len;
int ret;
@@ -230,6 +483,9 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (unlikely(!t_ops))
return -EFAULT;
+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+ info->flags &= ~MSG_ZEROCOPY;
+
src_cid = t_ops->transport.get_local_cid();
src_port = vsk->local_addr.svm_port;
if (!info->remote_cid) {
@@ -249,22 +505,36 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;
+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+ if (can_zcopy)
+ max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+ (MAX_SKB_FRAGS * PAGE_SIZE));
+ else
+ max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+
rest_len = pkt_len;
do {
struct sk_buff *skb;
size_t skb_len;
- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+ skb_len = min(max_skb_cap, rest_len);
- skb = virtio_transport_alloc_skb(info, skb_len,
- src_cid, src_port,
- dst_cid, dst_port);
+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
+ dst_cid, dst_port,
+ src_cid, src_port);
if (!skb) {
ret = -ENOMEM;
break;
}
+ if (skb_len == rest_len &&
+ info->flags & MSG_ZEROCOPY &&
+ info->op == VIRTIO_VSOCK_OP_RW)
+ virtio_transport_init_zcopy_skb(vsk, skb,
+ &info->msg->msg_iter,
+ can_zcopy);
+
virtio_transport_inc_tx_pkt(vvs, skb);
ret = t_ops->send_pkt(skb);
@@ -945,6 +1215,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
.msg = msg,
.pkt_len = len,
.vsk = vsk,
+ .flags = msg->msg_flags,
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -988,6 +1259,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
.reply = true,
};
struct sk_buff *reply;
+ int res;
/* Send RST only if the original pkt is not a RST pkt */
if (le16_to_cpu(hdr->op) == VIRTIO_VSOCK_OP_RST)
@@ -996,15 +1268,17 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
if (!t)
return -ENOTCONN;
- reply = virtio_transport_alloc_skb(&info, 0,
- le64_to_cpu(hdr->dst_cid),
- le32_to_cpu(hdr->dst_port),
+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
le64_to_cpu(hdr->src_cid),
- le32_to_cpu(hdr->src_port));
+ le32_to_cpu(hdr->src_port),
+ le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port));
if (!reply)
return -ENOMEM;
- return t->send_pkt(reply);
+ res = t->send_pkt(reply);
+
+ return res;
}
/* This function should be called with sk_lock held and SOCK_DONE set */
--
2.25.1
If socket's error queue is not empty, EPOLLERR must be set.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/af_vsock.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 413407bb646c..137a0db6eaac 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1030,8 +1030,8 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
poll_wait(file, sk_sleep(sk), wait);
mask = 0;
- if (sk->sk_err)
- /* Signify that there has been an error on this socket. */
+ /* Signify that there has been an error on this socket. */
+ if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
mask |= EPOLLERR;
/* INET sockets treat local write shutdown and peer write shutdown as a
--
2.25.1
This adds set of tests for MSG_ZEROCOPY flag.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.h | 1 +
tools/testing/vsock/vsock_test.c | 11 +
tools/testing/vsock/vsock_test_zerocopy.c | 501 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 12 +
5 files changed, 526 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 43a254f0e14d..0a78787d1d92 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
all: test vsock_perf
test: vsock_test vsock_diag_test
-vsock_test: vsock_test.o timeout.o control.o util.o
+vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
vsock_perf: vsock_perf.o
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..46ba1d3202b8 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -2,6 +2,7 @@
#ifndef UTIL_H
#define UTIL_H
+#include <stdbool.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index ac1bd3ac1533..d9bddb643794 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -20,6 +20,7 @@
#include <sys/mman.h>
#include <poll.h>
+#include "vsock_test_zerocopy.h"
#include "timeout.h"
#include "control.h"
#include "util.h"
@@ -1128,6 +1129,16 @@ static struct test_case test_cases[] = {
.run_client = test_stream_virtio_skb_merge_client,
.run_server = test_stream_virtio_skb_merge_server,
},
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY",
+ .run_client = test_stream_msg_zcopy_client,
+ .run_server = test_stream_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY empty MSG_ERRQUEUE",
+ .run_client = test_stream_msg_zcopy_empty_errq_client,
+ .run_server = test_stream_msg_zcopy_empty_errq_server,
+ },
{},
};
diff --git a/tools/testing/vsock/vsock_test_zerocopy.c b/tools/testing/vsock/vsock_test_zerocopy.c
new file mode 100644
index 000000000000..de44587bff26
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.c
@@ -0,0 +1,501 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* MSG_ZEROCOPY feature tests for vsock
+ *
+ * Copyright (C) 2023 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include <linux/errqueue.h>
+#include <linux/kernel.h>
+#include <error.h>
+#include <errno.h>
+
+#include "control.h"
+#include "vsock_test_zerocopy.h"
+
+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
+#define PAGE_SIZE 4096
+#define POLL_TIMEOUT_MS 100
+#define SENDMSG_RES_IOV_LEN (-2)
+
+struct zerocopy_test_data {
+ bool zerocopied;
+ bool completion;
+ int sendmsg_errno;
+ ssize_t sendmsg_res;
+ int vecs_cnt;
+ struct iovec vecs[3];
+};
+
+static void do_recv_completion(int fd, bool zerocopied, bool completion)
+{
+ struct sock_extended_err *serr;
+ struct msghdr msg = { 0 };
+ struct pollfd fds = { 0 };
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ uint32_t hi, lo;
+ ssize_t res;
+
+ fds.fd = fd;
+ fds.events = 0;
+
+ if (poll(&fds, 1, POLL_TIMEOUT_MS) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ if (completion) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ } else {
+ return;
+ }
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res) {
+ fprintf(stderr, "failed to read error queue: %zi\n", res);
+ exit(EXIT_FAILURE);
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_type != 0) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ printf("serr: wrong origin: %u\n", serr->ee_origin);
+ exit(EXIT_FAILURE);
+ }
+
+ if (serr->ee_errno) {
+ printf("serr: wrong error code: %u\n", serr->ee_errno);
+ exit(EXIT_FAILURE);
+ }
+
+ hi = serr->ee_data;
+ lo = serr->ee_info;
+ if (hi != lo) {
+ fprintf(stderr, "serr: expected hi == lo\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (hi) {
+ fprintf(stderr, "serr: expected hi == lo == 0\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (zerocopied && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was copy instead of zerocopy\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!zerocopied && !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was zerocopy instead of copy\n");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error(1, errno, "setsockopt");
+}
+
+static void *mmap_no_fail(size_t bytes)
+{
+ void *res;
+
+ res = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (res == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ return res;
+}
+
+static size_t iovec_bytes(const struct iovec *iov, size_t iovnum)
+{
+ size_t bytes;
+ int i;
+
+ for (bytes = 0, i = 0; i < iovnum; i++)
+ bytes += iov[i].iov_len;
+
+ return bytes;
+}
+
+static void iovec_random_init(struct iovec *iov,
+ const struct zerocopy_test_data *test_data)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ int j;
+
+ if (test_data->vecs[i].iov_base == MAP_FAILED)
+ continue;
+
+ for (j = 0; j < iov[i].iov_len; j++)
+ ((uint8_t *)iov[i].iov_base)[j] = rand() & 0xff;
+ }
+}
+
+static unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum)
+{
+ unsigned long hash;
+ size_t iov_bytes;
+ size_t offs;
+ void *tmp;
+ int i;
+
+ iov_bytes = iovec_bytes(iov, iovnum);
+
+ tmp = malloc(iov_bytes);
+ if (!tmp) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (offs = 0, i = 0; i < iovnum; i++) {
+ memcpy(tmp + offs, iov[i].iov_base, iov[i].iov_len);
+ offs += iov[i].iov_len;
+ }
+
+ hash = hash_djb2(tmp, iov_bytes);
+ free(tmp);
+
+ return hash;
+}
+
+static struct zerocopy_test_data test_data_array[] = {
+ /* Last element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, 200 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE * 2 },
+ { NULL, PAGE_SIZE * 3 }
+ }
+ },
+ /* All elements have page aligned base and size. But
+ * data length is bigger than 64Kb.
+ */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has non-page aligned size. */
+ {
+ .zerocopied = false,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has both non-page aligned base and size. */
+ {
+ .zerocopied = false,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { (void *)1, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* One element has invalid base. */
+ {
+ .zerocopied = false,
+ .completion = false,
+ .sendmsg_errno = ENOMEM,
+ .sendmsg_res = -1,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { MAP_FAILED, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Valid data, but SO_ZEROCOPY is off. */
+ {
+ .zerocopied = true,
+ .completion = false,
+ .sendmsg_errno = 0,
+ .sendmsg_res = SENDMSG_RES_IOV_LEN,
+ .vecs_cnt = 1,
+ {
+ { NULL, PAGE_SIZE }
+ }
+ },
+};
+
+static void __test_stream_msg_zerocopy_client(const struct test_opts *opts,
+ const struct zerocopy_test_data *test_data)
+{
+ struct msghdr msg = { 0 };
+ ssize_t sendmsg_res;
+ struct iovec *iovec;
+ int fd;
+ int i;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (test_data->completion)
+ enable_so_zerocopy(fd);
+
+ iovec = malloc(sizeof(*iovec) * test_data->vecs_cnt);
+ if (!iovec) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ iovec[i].iov_len = test_data->vecs[i].iov_len;
+ iovec[i].iov_base = mmap_no_fail(test_data->vecs[i].iov_len);
+ }
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_data->vecs[i].iov_base == MAP_FAILED) {
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ iovec_random_init(iovec, test_data);
+
+ msg.msg_iov = iovec;
+ msg.msg_iovlen = test_data->vecs_cnt;
+
+ errno = 0;
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ sendmsg_res = iovec_bytes(iovec, test_data->vecs_cnt);
+ else
+ sendmsg_res = test_data->sendmsg_res;
+
+ if (sendmsg(fd, &msg, MSG_ZEROCOPY) != sendmsg_res) {
+ perror("send");
+ exit(EXIT_FAILURE);
+ }
+
+ if (errno != test_data->sendmsg_errno) {
+ fprintf(stderr, "expected 'errno' == %i, got %i\n",
+ test_data->sendmsg_errno, errno);
+ exit(EXIT_FAILURE);
+ }
+
+ do_recv_completion(fd, test_data->zerocopied, test_data->completion);
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
+ else
+ control_writeulong(0);
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_data->vecs[i].iov_base != MAP_FAILED) {
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ free(iovec);
+ close(fd);
+ control_writeln("DONE");
+}
+
+static void __test_stream_msg_zerocopy_server(const struct test_opts *opts,
+ const struct zerocopy_test_data *test_data)
+{
+ unsigned long remote_hash;
+ unsigned long local_hash;
+ ssize_t total_bytes_rec;
+ unsigned char *data;
+ size_t data_len;
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ data_len = iovec_bytes(test_data->vecs, test_data->vecs_cnt);
+
+ data = malloc(data_len);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ total_bytes_rec = 0;
+
+ while (total_bytes_rec != data_len) {
+ ssize_t bytes_rec;
+
+ bytes_rec = read(fd, data + total_bytes_rec,
+ data_len - total_bytes_rec);
+ if (bytes_rec <= 0)
+ break;
+
+ total_bytes_rec += bytes_rec;
+ }
+
+ if (test_data->sendmsg_res == SENDMSG_RES_IOV_LEN)
+ local_hash = hash_djb2(data, data_len);
+ else
+ local_hash = 0;
+
+ free(data);
+
+ /* Waiting for some result. */
+ remote_hash = control_readulong();
+ if (remote_hash != local_hash) {
+ fprintf(stderr, "hash mismatch\n");
+ exit(EXIT_FAILURE);
+ }
+
+ close(fd);
+ control_expectln("DONE");
+}
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_msg_zerocopy_client(opts, &test_data_array[i]);
+}
+
+void test_stream_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_msg_zerocopy_server(opts, &test_data_array[i]);
+}
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts)
+{
+ struct msghdr msg = { 0 };
+ char cmsg_data[128];
+ ssize_t res;
+ int fd;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res != -1) {
+ fprintf(stderr, "expected 'recvmsg(2)' failure, got %zi\n",
+ res);
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
diff --git a/tools/testing/vsock/vsock_test_zerocopy.h b/tools/testing/vsock/vsock_test_zerocopy.h
new file mode 100644
index 000000000000..705a1e90f41a
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VSOCK_TEST_ZEROCOPY_H
+#define VSOCK_TEST_ZEROCOPY_H
+#include "util.h"
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_server(const struct test_opts *opts);
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts);
+
+#endif /* VSOCK_TEST_ZEROCOPY_H */
--
2.25.1
To use this option pass '--zc' parameter:
./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>
With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/vsock_perf.c | 139 +++++++++++++++++++++++++++++--
1 file changed, 130 insertions(+), 9 deletions(-)
diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
index a72520338f84..7fd76f7a3c16 100644
--- a/tools/testing/vsock/vsock_perf.c
+++ b/tools/testing/vsock/vsock_perf.c
@@ -18,6 +18,8 @@
#include <poll.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>
#define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
#define DEFAULT_TO_SEND_BYTES (64 * 1024)
@@ -28,9 +30,14 @@
#define BYTES_PER_GB (1024 * 1024 * 1024ULL)
#define NSEC_PER_SEC (1000000000ULL)
+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
static unsigned int port = DEFAULT_PORT;
static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
+static bool zerocopy;
static void error(const char *s)
{
@@ -247,15 +254,76 @@ static void run_receiver(unsigned long rcvlowat_bytes)
close(fd);
}
+static void recv_completion(int fd)
+{
+ struct sock_extended_err *serr;
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ struct msghdr msg = { 0 };
+ ssize_t ret;
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret) {
+ fprintf(stderr, "recvmsg: failed to read err: %zi\n", ret);
+ return;
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ return;
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ return;
+ }
+
+ if (cm->cmsg_type) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ return;
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin\n");
+ return;
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code\n");
+ return;
+ }
+
+ if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
+ fprintf(stderr, "warning: copy instead of zerocopy\n");
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error("setsockopt(SO_ZEROCOPY)");
+}
+
static void run_sender(int peer_cid, unsigned long to_send_bytes)
{
time_t tx_begin_ns;
time_t tx_total_ns;
size_t total_send;
+ time_t time_in_send;
void *data;
int fd;
- printf("Run as sender\n");
+ if (zerocopy)
+ printf("Run as sender MSG_ZEROCOPY\n");
+ else
+ printf("Run as sender\n");
+
printf("Connect to %i:%u\n", peer_cid, port);
printf("Send %lu bytes\n", to_send_bytes);
printf("TX buffer %lu bytes\n", buf_size_bytes);
@@ -265,38 +333,82 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
if (fd < 0)
exit(EXIT_FAILURE);
- data = malloc(buf_size_bytes);
+ if (zerocopy) {
+ enable_so_zerocopy(fd);
- if (!data) {
- fprintf(stderr, "'malloc()' failed\n");
- exit(EXIT_FAILURE);
+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (data == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+ } else {
+ data = malloc(buf_size_bytes);
+
+ if (!data) {
+ fprintf(stderr, "'malloc()' failed\n");
+ exit(EXIT_FAILURE);
+ }
}
memset(data, 0, buf_size_bytes);
total_send = 0;
+ time_in_send = 0;
tx_begin_ns = current_nsec();
while (total_send < to_send_bytes) {
ssize_t sent;
+ size_t rest_bytes;
+ time_t before;
- sent = write(fd, data, buf_size_bytes);
+ rest_bytes = to_send_bytes - total_send;
+
+ before = current_nsec();
+ sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
+ buf_size_bytes : rest_bytes,
+ zerocopy ? MSG_ZEROCOPY : 0);
+ time_in_send += (current_nsec() - before);
if (sent <= 0)
error("write");
total_send += sent;
+
+ if (zerocopy) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ }
+
+ recv_completion(fd);
+ }
}
tx_total_ns = current_nsec() - tx_begin_ns;
printf("total bytes sent: %zu\n", total_send);
printf("tx performance: %f Gbits/s\n",
- get_gbps(total_send * 8, tx_total_ns));
- printf("total time in 'write()': %f sec\n",
+ get_gbps(total_send * 8, time_in_send));
+ printf("total time in tx loop: %f sec\n",
(float)tx_total_ns / NSEC_PER_SEC);
+ printf("time in 'send()': %f sec\n",
+ (float)time_in_send / NSEC_PER_SEC);
close(fd);
- free(data);
+
+ if (zerocopy)
+ munmap(data, buf_size_bytes);
+ else
+ free(data);
}
static const char optstring[] = "";
@@ -336,6 +448,11 @@ static const struct option longopts[] = {
.has_arg = required_argument,
.val = 'R',
},
+ {
+ .name = "zc",
+ .has_arg = no_argument,
+ .val = 'Z',
+ },
{},
};
@@ -351,6 +468,7 @@ static void usage(void)
" --help This message\n"
" --sender <cid> Sender mode (receiver default)\n"
" <cid> of the receiver to connect to\n"
+ " --zc Enable zerocopy\n"
" --port <port> Port (default %d)\n"
" --bytes <bytes>KMG Bytes to send (default %d)\n"
" --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
@@ -413,6 +531,9 @@ int main(int argc, char **argv)
case 'H': /* Help. */
usage();
break;
+ case 'Z': /* Zerocopy. */
+ zerocopy = true;
+ break;
default:
usage();
}
--
2.25.1
This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
socket.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
Documentation/networking/msg_zerocopy.rst | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index b3ea96af9b49..34bc7ff411ce 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,8 @@ Intro
=====
The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP and UDP sockets.
+The feature is currently implemented for TCP, UDP and VSOCK (with
+virtio transport) sockets.
Opportunity and Caveats
@@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A notification
is encoded in the standard error format, sock_extended_err.
The level and type fields in the control data are protocol family
-specific, IP_RECVERR or IPV6_RECVERR.
+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
as explained before, to avoid blocking read and write system calls on
@@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.
printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.
Deferred copies
~~~~~~~~~~~~~~~
@@ -235,12 +237,15 @@ Implementation
Loopback
--------
+For TCP and UDP:
Data sent to local sockets can be queued indefinitely if the receive
process does not read its socket. Unbound notification latency is not
acceptable. For this reason all packets generated with MSG_ZEROCOPY
that are looped to a local socket will incur a deferred copy. This
includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+For VSOCK:
+Data path sent to local sockets is the same as for non-local sockets.
Testing
=======
@@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
namespaces, the test will not show any improvement. For testing, the
loopback restriction can be temporarily relaxed by making
skb_orphan_frags_rx identical to skb_orphan_frags.
+
+For VSOCK type of socket example can be found in tools/testing/vsock/
+vsock_test_zerocopy.c.
--
2.25.1
PF_VSOCK supports MSG_ZEROCOPY transmission, so SO_ZEROCOPY could
be enabled.
Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/core/sock.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/net/core/sock.c b/net/core/sock.c
index c25888795390..13a89c6cbfb8 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1457,9 +1457,11 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
(sk->sk_type == SOCK_DGRAM &&
sk->sk_protocol == IPPROTO_UDP)))
ret = -EOPNOTSUPP;
- } else if (sk->sk_family != PF_RDS) {
+ } else if (sk->sk_family != PF_RDS &&
+ sk->sk_family != PF_VSOCK) {
ret = -EOPNOTSUPP;
}
+
if (!ret) {
if (val < 0 || val > 1)
ret = -EINVAL;
--
2.25.1
Hi Arseniy,
Sorry for the delay, but I have been very busy.
I can't apply this series on master or net-next, can you share with me
the base commit?
On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>Hello,
>
> DESCRIPTION
>
>this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>current implementation for TCP as much as possible:
>
>1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> flag will be ignored (e.g. without completion).
>
>2) Kernel uses completions from socket's error queue. Single completion
> for single tx syscall (or it can merge several completions to single
> one). I used already implemented logic for MSG_ZEROCOPY support:
> 'msg_zerocopy_realloc()' etc.
>
>Difference with copy way is not significant. During packet allocation,
>non-linear skb is created, then I call 'pin_user_pages()' for each page
>from user's iov iterator and add each returned page to the skb as fragment.
>There are also some updates for vhost and guest parts of transport - in
>both cases i've added handling of non-linear skb for virtio part. vhost
>copies data from such skb to the guest's rx virtio buffers. In the guest,
>virtio transport fills tx virtio queue with pages from skb.
>
>This version has several limits/problems:
>
>1) As this feature totally depends on transport, there is no way (or it
> is difficult) to check whether transport is able to handle it or not
> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> are not considered to be called from each other. So in current version
> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> tx routine will fail with EOPNOTSUPP.
Do you plan to fix this in the next versions?
If it is too complicated, I think we can have this limitation until we
find a good solution.
>
>2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> one completion. In each completion there is flag which shows how tx
> was performed: zerocopy or copy. This leads that whole message must
> be send in zerocopy or copy way - we can't send part of message with
> copying and rest of message with zerocopy mode (or vice versa). Now,
> we need to account vsock credit logic, e.g. we can't send whole data
> once - only allowed number of bytes could sent at any moment. In case
> of copying way there is no problem as in worst case we can send single
> bytes, but zerocopy is more complex because smallest transmission
> unit is single page. So if there is not enough space at peer's side
> to send integer number of pages (at least one) - we will wait, thus
> stalling tx side. To overcome this problem i've added simple rule -
> zerocopy is possible only when there is enough space at another side
> for whole message (to check, that current 'msghdr' was already used
> in previous tx iterations i use 'iov_offset' field of it's iov iter).
So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
Right?
If it is the case it seems reasonable to me.
>
>3) loopback transport is not supported, because it requires to implement
> non-linear skb handling in dequeue logic (as we "send" fragged skb
> and "receive" it from the same queue). I'm going to implement it in
> next versions.
>
> ^^^ fixed in v2
>
>4) Current implementation sets max length of packet to 64KB. IIUC this
> is due to 'kmalloc()' allocated data buffers. I think, in case of
> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> not touched for data - user space pages are used as buffers. Also
> this limit trims every message which is > 64KB, thus such messages
> will be send in copy mode due to 'iov_offset' check in 2).
>
> ^^^ fixed in v2
>
> PATCHSET STRUCTURE
>
>Patchset has the following structure:
>1) Handle non-linear skbuff on receive in virtio/vhost.
>2) Handle non-linear skbuff on send in virtio/vhost.
>3) Updates for AF_VSOCK.
>4) Enable MSG_ZEROCOPY support on transports.
>5) Tests/tools/docs updates.
>
> PERFORMANCE
>
>Performance: it is a little bit tricky to compare performance between
>copy and zerocopy transmissions. In zerocopy way we need to wait when
>user buffers will be released by kernel, so it something like synchronous
>path (wait until device driver will process it), while in copy way we
>can feed data to kernel as many as we want, don't care about device
>driver. So I compared only time which we spend in the 'send()' syscall.
>Then if this value will be combined with total number of transmitted
>bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>enough credit, receiver allocates same amount of space as sender needs.
>
>Sender:
>./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>
>Receiver:
>./vsock_perf --vsk-size 256M
>
>G2H transmission (values are Gbit/s):
>
>*-------------------------------*
>| | | |
>| buf size | copy | zerocopy |
>| | | |
>*-------------------------------*
>| 4KB | 3 | 10 |
>*-------------------------------*
>| 32KB | 9 | 45 |
>*-------------------------------*
>| 256KB | 24 | 195 |
>*-------------------------------*
>| 1M | 27 | 270 |
>*-------------------------------*
>| 8M | 22 | 277 |
>*-------------------------------*
>
>H2G:
>
>*-------------------------------*
>| | | |
>| buf size | copy | zerocopy |
>| | | |
>*-------------------------------*
>| 4KB | 17 | 11 |
Do you know why in this case zerocopy is slower in this case?
Could be the cost of pin/unpin pages?
>*-------------------------------*
>| 32KB | 30 | 66 |
>*-------------------------------*
>| 256KB | 38 | 179 |
>*-------------------------------*
>| 1M | 38 | 234 |
>*-------------------------------*
>| 8M | 28 | 279 |
>*-------------------------------*
>
>Loopback:
>
>*-------------------------------*
>| | | |
>| buf size | copy | zerocopy |
>| | | |
>*-------------------------------*
>| 4KB | 8 | 7 |
>*-------------------------------*
>| 32KB | 34 | 42 |
>*-------------------------------*
>| 256KB | 43 | 83 |
>*-------------------------------*
>| 1M | 40 | 109 |
>*-------------------------------*
>| 8M | 40 | 171 |
>*-------------------------------*
>
>I suppose that huge difference above between both modes has two reasons:
>1) We don't need to copy data.
>2) We don't need to allocate buffer for data, only for header.
>
>Zerocopy is faster than classic copy mode, but of course it requires
>specific architecture of application due to user pages pinning, buffer
>size and alignment.
>
>If host fails to send data with "Cannot allocate memory", check value
>/proc/sys/net/core/optmem_max - it is accounted during completion skb
>allocation.
What the user needs to do? Increase it?
>
> TESTING
>
>This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>cover new code as much as possible so there are different cases for
>MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>vector types (different sizes, alignments, with unmapped pages). I also
>run tests with loopback transport and running vsockmon.
Thanks for the test again :-)
This cover letter is very good, with a lot of details, but please add
more details in each single patch, explaining the reason of the changes,
otherwise it is very difficult to review, because it is a very big
change.
I'll do a per-patch review in the next days.
Thanks,
Stefano
On 03.05.2023 15:52, Stefano Garzarella wrote:
> Hi Arseniy,
> Sorry for the delay, but I have been very busy.
Hello, no problem!
>
> I can't apply this series on master or net-next, can you share with me
> the base commit?
Here is my base:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
>
> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>> flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>> for single tx syscall (or it can merge several completions to single
>> one). I used already implemented logic for MSG_ZEROCOPY support:
>> 'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created, then I call 'pin_user_pages()' for each page
>> from user's iov iterator and add each returned page to the skb as fragment.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>> is difficult) to check whether transport is able to handle it or not
>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>> are not considered to be called from each other. So in current version
>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>> tx routine will fail with EOPNOTSUPP.
>
> Do you plan to fix this in the next versions?
>
> If it is too complicated, I think we can have this limitation until we
> find a good solution.
>
I'll try to fix it again, but just didn't pay attention on it in v2.
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>> one completion. In each completion there is flag which shows how tx
>> was performed: zerocopy or copy. This leads that whole message must
>> be send in zerocopy or copy way - we can't send part of message with
>> copying and rest of message with zerocopy mode (or vice versa). Now,
>> we need to account vsock credit logic, e.g. we can't send whole data
>> once - only allowed number of bytes could sent at any moment. In case
>> of copying way there is no problem as in worst case we can send single
>> bytes, but zerocopy is more complex because smallest transmission
>> unit is single page. So if there is not enough space at peer's side
>> to send integer number of pages (at least one) - we will wait, thus
>> stalling tx side. To overcome this problem i've added simple rule -
>> zerocopy is possible only when there is enough space at another side
>> for whole message (to check, that current 'msghdr' was already used
>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
> Right?
Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
MSG_ZEROCOPY flag as used). But completion structure contains information that
there was copying instead of zerocopying.
>
> If it is the case it seems reasonable to me.
>
>>
>> 3) loopback transport is not supported, because it requires to implement
>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>> and "receive" it from the same queue). I'm going to implement it in
>> next versions.
>>
>> ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>> not touched for data - user space pages are used as buffers. Also
>> this limit trims every message which is > 64KB, thus such messages
>> will be send in copy mode due to 'iov_offset' check in 2).
>>
>> ^^^ fixed in v2
>>
>> PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>> PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it something like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> G2H transmission (values are Gbit/s):
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 3 | 10 |
>> *-------------------------------*
>> | 32KB | 9 | 45 |
>> *-------------------------------*
>> | 256KB | 24 | 195 |
>> *-------------------------------*
>> | 1M | 27 | 270 |
>> *-------------------------------*
>> | 8M | 22 | 277 |
>> *-------------------------------*
>>
>> H2G:
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 17 | 11 |
>
> Do you know why in this case zerocopy is slower in this case?
> Could be the cost of pin/unpin pages?
May be, i think i need to analyze such enormous difference more. Also about
pin/unpin: i found that there is already implemented function to fill non-linear
skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
>
>> *-------------------------------*
>> | 32KB | 30 | 66 |
>> *-------------------------------*
>> | 256KB | 38 | 179 |
>> *-------------------------------*
>> | 1M | 38 | 234 |
>> *-------------------------------*
>> | 8M | 28 | 279 |
>> *-------------------------------*
>>
>> Loopback:
>>
>> *-------------------------------*
>> | | | |
>> | buf size | copy | zerocopy |
>> | | | |
>> *-------------------------------*
>> | 4KB | 8 | 7 |
>> *-------------------------------*
>> | 32KB | 34 | 42 |
>> *-------------------------------*
>> | 256KB | 43 | 83 |
>> *-------------------------------*
>> | 1M | 40 | 109 |
>> *-------------------------------*
>> | 8M | 40 | 171 |
>> *-------------------------------*
>>
>> I suppose that huge difference above between both modes has two reasons:
>> 1) We don't need to copy data.
>> 2) We don't need to allocate buffer for data, only for header.
>>
>> Zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment.
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation.
>
> What the user needs to do? Increase it?
>
Yes, i'll update it.
>>
>> TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages). I also
>> run tests with loopback transport and running vsockmon.
>
> Thanks for the test again :-)
>
> This cover letter is very good, with a lot of details, but please add
> more details in each single patch, explaining the reason of the changes,
> otherwise it is very difficult to review, because it is a very big
> change.
>
> I'll do a per-patch review in the next days.
Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
Thanks, Arseniy
>
> Thanks,
> Stefano
>
On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote:
>
>
>On 03.05.2023 15:52, Stefano Garzarella wrote:
>> Hi Arseniy,
>> Sorry for the delay, but I have been very busy.
>
>Hello, no problem!
>
>>
>> I can't apply this series on master or net-next, can you share with me
>> the base commit?
>
>Here is my base:
>https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
>
Thanks, it worked!
>>
>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>>> Hello,
>>>
>>> ????????????????????????? DESCRIPTION
>>>
>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>>> current implementation for TCP as much as possible:
>>>
>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>>> ? flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>>> ? flag will be ignored (e.g. without completion).
>>>
>>> 2) Kernel uses completions from socket's error queue. Single completion
>>> ? for single tx syscall (or it can merge several completions to single
>>> ? one). I used already implemented logic for MSG_ZEROCOPY support:
>>> ? 'msg_zerocopy_realloc()' etc.
>>>
>>> Difference with copy way is not significant. During packet allocation,
>>> non-linear skb is created, then I call 'pin_user_pages()' for each page
>>> from user's iov iterator and add each returned page to the skb as fragment.
>>> There are also some updates for vhost and guest parts of transport - in
>>> both cases i've added handling of non-linear skb for virtio part. vhost
>>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>>> virtio transport fills tx virtio queue with pages from skb.
>>>
>>> This version has several limits/problems:
>>>
>>> 1) As this feature totally depends on transport, there is no way (or it
>>> ? is difficult) to check whether transport is able to handle it or not
>>> ? during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>>> ? setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>>> ? leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>>> ? are not considered to be called from each other. So in current version
>>> ? SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>> ? AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>> ? tx routine will fail with EOPNOTSUPP.
>>
>> Do you plan to fix this in the next versions?
>>
>> If it is too complicated, I think we can have this limitation until we
>> find a good solution.
>>
>
>I'll try to fix it again, but just didn't pay attention on it in v2.
>
>>>
>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>>> ? one completion. In each completion there is flag which shows how tx
>>> ? was performed: zerocopy or copy. This leads that whole message must
>>> ? be send in zerocopy or copy way - we can't send part of message with
>>> ? copying and rest of message with zerocopy mode (or vice versa). Now,
>>> ? we need to account vsock credit logic, e.g. we can't send whole data
>>> ? once - only allowed number of bytes could sent at any moment. In case
>>> ? of copying way there is no problem as in worst case we can send single
>>> ? bytes, but zerocopy is more complex because smallest transmission
>>> ? unit is single page. So if there is not enough space at peer's side
>>> ? to send integer number of pages (at least one) - we will wait, thus
>>> ? stalling tx side. To overcome this problem i've added simple rule -
>>> ? zerocopy is possible only when there is enough space at another side
>>> ? for whole message (to check, that current 'msghdr' was already used
>>> ? in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
>> Right?
>
>Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
>MSG_ZEROCOPY flag as used). But completion structure contains information that
>there was copying instead of zerocopying.
Got it.
>
>>
>> If it is the case it seems reasonable to me.
>>
>>>
>>> 3) loopback transport is not supported, because it requires to implement
>>> ? non-linear skb handling in dequeue logic (as we "send" fragged skb
>>> ? and "receive" it from the same queue). I'm going to implement it in
>>> ? next versions.
>>>
>>> ? ^^^ fixed in v2
>>>
>>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>>> ? is due to 'kmalloc()' allocated data buffers. I think, in case of
>>> ? MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>>> ? not touched for data - user space pages are used as buffers. Also
>>> ? this limit trims every message which is > 64KB, thus such messages
>>> ? will be send in copy mode due to 'iov_offset' check in 2).
>>>
>>> ? ^^^ fixed in v2
>>>
>>> ??????????????????????? PATCHSET STRUCTURE
>>>
>>> Patchset has the following structure:
>>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>>> 2) Handle non-linear skbuff on send in virtio/vhost.
>>> 3) Updates for AF_VSOCK.
>>> 4) Enable MSG_ZEROCOPY support on transports.
>>> 5) Tests/tools/docs updates.
>>>
>>> ?????????????????????????? PERFORMANCE
>>>
>>> Performance: it is a little bit tricky to compare performance between
>>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>>> user buffers will be released by kernel, so it something like synchronous
>>> path (wait until device driver will process it), while in copy way we
>>> can feed data to kernel as many as we want, don't care about device
>>> driver. So I compared only time which we spend in the 'send()' syscall.
>>> Then if this value will be combined with total number of transmitted
>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>>> enough credit, receiver allocates same amount of space as sender needs.
>>>
>>> Sender:
>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>>
>>> Receiver:
>>> ./vsock_perf --vsk-size 256M
>>>
>>> G2H transmission (values are Gbit/s):
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 3??? |??? 10??? |
>>> *-------------------------------*
>>> |?? 32KB?? |??? 9??? |??? 45??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 24?? |??? 195?? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 27?? |??? 270?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 22?? |??? 277?? |
>>> *-------------------------------*
>>>
>>> H2G:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 17?? |??? 11??? |
>>
>> Do you know why in this case zerocopy is slower in this case?
>> Could be the cost of pin/unpin pages?
>May be, i think i need to analyze such enormous difference more. Also about
>pin/unpin: i found that there is already implemented function to fill non-linear
>skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
>It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
>is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
>TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
If we can reuse them, it will be great!
>
>>
>>> *-------------------------------*
>>> |?? 32KB?? |??? 30?? |??? 66??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 38?? |??? 179?? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 38?? |??? 234?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 28?? |??? 279?? |
>>> *-------------------------------*
>>>
>>> Loopback:
>>>
>>> *-------------------------------*
>>> |????????? |???????? |????????? |
>>> | buf size |?? copy? | zerocopy |
>>> |????????? |???????? |????????? |
>>> *-------------------------------*
>>> |?? 4KB??? |??? 8??? |??? 7???? |
>>> *-------------------------------*
>>> |?? 32KB?? |??? 34?? |??? 42??? |
>>> *-------------------------------*
>>> |?? 256KB? |??? 43?? |??? 83??? |
>>> *-------------------------------*
>>> |??? 1M??? |??? 40?? |??? 109?? |
>>> *-------------------------------*
>>> |??? 8M??? |??? 40?? |??? 171?? |
>>> *-------------------------------*
>>>
>>> I suppose that huge difference above between both modes has two reasons:
>>> 1) We don't need to copy data.
>>> 2) We don't need to allocate buffer for data, only for header.
>>>
>>> Zerocopy is faster than classic copy mode, but of course it requires
>>> specific architecture of application due to user pages pinning, buffer
>>> size and alignment.
>>>
>>> If host fails to send data with "Cannot allocate memory", check value
>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>>> allocation.
>>
>> What the user needs to do? Increase it?
>>
>Yes, i'll update it.
>>>
>>> ?????????????????????????? TESTING
>>>
>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>>> cover new code as much as possible so there are different cases for
>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>>> vector types (different sizes, alignments, with unmapped pages). I also
>>> run tests with loopback transport and running vsockmon.
>>
>> Thanks for the test again :-)
>>
>> This cover letter is very good, with a lot of details, but please add
>> more details in each single patch, explaining the reason of the changes,
>> otherwise it is very difficult to review, because it is a very big
>> change.
>>
>> I'll do a per-patch review in the next days.
>
>Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
>supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
That would be cool!
Do you want to me to review these patches or it is better to wait for
v3?
Thanks,
Stefano
On Wed, May 3, 2023 at 3:50 PM Arseniy Krasnov <[email protected]> wrote:
>
>
>
> On 03.05.2023 16:47, Stefano Garzarella wrote:
> > On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote:
> >>
> >>
> >> On 03.05.2023 15:52, Stefano Garzarella wrote:
> >>> Hi Arseniy,
> >>> Sorry for the delay, but I have been very busy.
> >>
> >> Hello, no problem!
> >>
> >>>
> >>> I can't apply this series on master or net-next, can you share with me
> >>> the base commit?
> >>
> >> Here is my base:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
> >>
> >
> > Thanks, it worked!
> >
> >>>
> >>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
> >>>> Hello,
> >>>>
> >>>> DESCRIPTION
> >>>>
> >>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
> >>>> current implementation for TCP as much as possible:
> >>>>
> >>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> >>>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> >>>> flag will be ignored (e.g. without completion).
> >>>>
> >>>> 2) Kernel uses completions from socket's error queue. Single completion
> >>>> for single tx syscall (or it can merge several completions to single
> >>>> one). I used already implemented logic for MSG_ZEROCOPY support:
> >>>> 'msg_zerocopy_realloc()' etc.
> >>>>
> >>>> Difference with copy way is not significant. During packet allocation,
> >>>> non-linear skb is created, then I call 'pin_user_pages()' for each page
> >>>> from user's iov iterator and add each returned page to the skb as fragment.
> >>>> There are also some updates for vhost and guest parts of transport - in
> >>>> both cases i've added handling of non-linear skb for virtio part. vhost
> >>>> copies data from such skb to the guest's rx virtio buffers. In the guest,
> >>>> virtio transport fills tx virtio queue with pages from skb.
> >>>>
> >>>> This version has several limits/problems:
> >>>>
> >>>> 1) As this feature totally depends on transport, there is no way (or it
> >>>> is difficult) to check whether transport is able to handle it or not
> >>>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> >>>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> >>>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> >>>> are not considered to be called from each other. So in current version
> >>>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> >>>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> >>>> tx routine will fail with EOPNOTSUPP.
> >>>
> >>> Do you plan to fix this in the next versions?
> >>>
> >>> If it is too complicated, I think we can have this limitation until we
> >>> find a good solution.
> >>>
> >>
> >> I'll try to fix it again, but just didn't pay attention on it in v2.
> >>
> >>>>
> >>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> >>>> one completion. In each completion there is flag which shows how tx
> >>>> was performed: zerocopy or copy. This leads that whole message must
> >>>> be send in zerocopy or copy way - we can't send part of message with
> >>>> copying and rest of message with zerocopy mode (or vice versa). Now,
> >>>> we need to account vsock credit logic, e.g. we can't send whole data
> >>>> once - only allowed number of bytes could sent at any moment. In case
> >>>> of copying way there is no problem as in worst case we can send single
> >>>> bytes, but zerocopy is more complex because smallest transmission
> >>>> unit is single page. So if there is not enough space at peer's side
> >>>> to send integer number of pages (at least one) - we will wait, thus
> >>>> stalling tx side. To overcome this problem i've added simple rule -
> >>>> zerocopy is possible only when there is enough space at another side
> >>>> for whole message (to check, that current 'msghdr' was already used
> >>>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
> >>>
> >>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
> >>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
> >>> Right?
> >>
> >> Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
> >> MSG_ZEROCOPY flag as used). But completion structure contains information that
> >> there was copying instead of zerocopying.
> >
> > Got it.
> >
> >>
> >>>
> >>> If it is the case it seems reasonable to me.
> >>>
> >>>>
> >>>> 3) loopback transport is not supported, because it requires to implement
> >>>> non-linear skb handling in dequeue logic (as we "send" fragged skb
> >>>> and "receive" it from the same queue). I'm going to implement it in
> >>>> next versions.
> >>>>
> >>>> ^^^ fixed in v2
> >>>>
> >>>> 4) Current implementation sets max length of packet to 64KB. IIUC this
> >>>> is due to 'kmalloc()' allocated data buffers. I think, in case of
> >>>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> >>>> not touched for data - user space pages are used as buffers. Also
> >>>> this limit trims every message which is > 64KB, thus such messages
> >>>> will be send in copy mode due to 'iov_offset' check in 2).
> >>>>
> >>>> ^^^ fixed in v2
> >>>>
> >>>> PATCHSET STRUCTURE
> >>>>
> >>>> Patchset has the following structure:
> >>>> 1) Handle non-linear skbuff on receive in virtio/vhost.
> >>>> 2) Handle non-linear skbuff on send in virtio/vhost.
> >>>> 3) Updates for AF_VSOCK.
> >>>> 4) Enable MSG_ZEROCOPY support on transports.
> >>>> 5) Tests/tools/docs updates.
> >>>>
> >>>> PERFORMANCE
> >>>>
> >>>> Performance: it is a little bit tricky to compare performance between
> >>>> copy and zerocopy transmissions. In zerocopy way we need to wait when
> >>>> user buffers will be released by kernel, so it something like synchronous
> >>>> path (wait until device driver will process it), while in copy way we
> >>>> can feed data to kernel as many as we want, don't care about device
> >>>> driver. So I compared only time which we spend in the 'send()' syscall.
> >>>> Then if this value will be combined with total number of transmitted
> >>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
> >>>> enough credit, receiver allocates same amount of space as sender needs.
> >>>>
> >>>> Sender:
> >>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
> >>>>
> >>>> Receiver:
> >>>> ./vsock_perf --vsk-size 256M
> >>>>
> >>>> G2H transmission (values are Gbit/s):
> >>>>
> >>>> *-------------------------------*
> >>>> | | | |
> >>>> | buf size | copy | zerocopy |
> >>>> | | | |
> >>>> *-------------------------------*
> >>>> | 4KB | 3 | 10 |
> >>>> *-------------------------------*
> >>>> | 32KB | 9 | 45 |
> >>>> *-------------------------------*
> >>>> | 256KB | 24 | 195 |
> >>>> *-------------------------------*
> >>>> | 1M | 27 | 270 |
> >>>> *-------------------------------*
> >>>> | 8M | 22 | 277 |
> >>>> *-------------------------------*
> >>>>
> >>>> H2G:
> >>>>
> >>>> *-------------------------------*
> >>>> | | | |
> >>>> | buf size | copy | zerocopy |
> >>>> | | | |
> >>>> *-------------------------------*
> >>>> | 4KB | 17 | 11 |
> >>>
> >>> Do you know why in this case zerocopy is slower in this case?
> >>> Could be the cost of pin/unpin pages?
> >> May be, i think i need to analyze such enormous difference more. Also about
> >> pin/unpin: i found that there is already implemented function to fill non-linear
> >> skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
> >> It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
> >> is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
> >> TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
> >
> > If we can reuse them, it will be great!
> >
> >>
> >>>
> >>>> *-------------------------------*
> >>>> | 32KB | 30 | 66 |
> >>>> *-------------------------------*
> >>>> | 256KB | 38 | 179 |
> >>>> *-------------------------------*
> >>>> | 1M | 38 | 234 |
> >>>> *-------------------------------*
> >>>> | 8M | 28 | 279 |
> >>>> *-------------------------------*
> >>>>
> >>>> Loopback:
> >>>>
> >>>> *-------------------------------*
> >>>> | | | |
> >>>> | buf size | copy | zerocopy |
> >>>> | | | |
> >>>> *-------------------------------*
> >>>> | 4KB | 8 | 7 |
> >>>> *-------------------------------*
> >>>> | 32KB | 34 | 42 |
> >>>> *-------------------------------*
> >>>> | 256KB | 43 | 83 |
> >>>> *-------------------------------*
> >>>> | 1M | 40 | 109 |
> >>>> *-------------------------------*
> >>>> | 8M | 40 | 171 |
> >>>> *-------------------------------*
> >>>>
> >>>> I suppose that huge difference above between both modes has two reasons:
> >>>> 1) We don't need to copy data.
> >>>> 2) We don't need to allocate buffer for data, only for header.
> >>>>
> >>>> Zerocopy is faster than classic copy mode, but of course it requires
> >>>> specific architecture of application due to user pages pinning, buffer
> >>>> size and alignment.
> >>>>
> >>>> If host fails to send data with "Cannot allocate memory", check value
> >>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
> >>>> allocation.
> >>>
> >>> What the user needs to do? Increase it?
> >>>
> >> Yes, i'll update it.
> >>>>
> >>>> TESTING
> >>>>
> >>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
> >>>> cover new code as much as possible so there are different cases for
> >>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
> >>>> vector types (different sizes, alignments, with unmapped pages). I also
> >>>> run tests with loopback transport and running vsockmon.
> >>>
> >>> Thanks for the test again :-)
> >>>
> >>> This cover letter is very good, with a lot of details, but please add
> >>> more details in each single patch, explaining the reason of the changes,
> >>> otherwise it is very difficult to review, because it is a very big
> >>> change.
> >>>
> >>> I'll do a per-patch review in the next days.
> >>
> >> Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
> >> supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
> >
> > That would be cool!
> >
> > Do you want to me to review these patches or it is better to wait for v3?
>
> I think it is ok to wait for v3, as i'm going to reduce size of new kernel source code,
> especially by reusing already implemented functions instead of my own.
Okay, great! I'll wait for it ;-)
Thanks,
Stefano
On 03.05.2023 16:47, Stefano Garzarella wrote:
> On Wed, May 03, 2023 at 04:11:59PM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 03.05.2023 15:52, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>> Sorry for the delay, but I have been very busy.
>>
>> Hello, no problem!
>>
>>>
>>> I can't apply this series on master or net-next, can you share with me
>>> the base commit?
>>
>> Here is my base:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=b103bab0944be030954e5de23851b37980218f54
>>
>
> Thanks, it worked!
>
>>>
>>> On Sun, Apr 23, 2023 at 10:26:28PM +0300, Arseniy Krasnov wrote:
>>>> Hello,
>>>>
>>>> DESCRIPTION
>>>>
>>>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>>>> current implementation for TCP as much as possible:
>>>>
>>>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>>>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>>>> flag will be ignored (e.g. without completion).
>>>>
>>>> 2) Kernel uses completions from socket's error queue. Single completion
>>>> for single tx syscall (or it can merge several completions to single
>>>> one). I used already implemented logic for MSG_ZEROCOPY support:
>>>> 'msg_zerocopy_realloc()' etc.
>>>>
>>>> Difference with copy way is not significant. During packet allocation,
>>>> non-linear skb is created, then I call 'pin_user_pages()' for each page
>>>> from user's iov iterator and add each returned page to the skb as fragment.
>>>> There are also some updates for vhost and guest parts of transport - in
>>>> both cases i've added handling of non-linear skb for virtio part. vhost
>>>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>>>> virtio transport fills tx virtio queue with pages from skb.
>>>>
>>>> This version has several limits/problems:
>>>>
>>>> 1) As this feature totally depends on transport, there is no way (or it
>>>> is difficult) to check whether transport is able to handle it or not
>>>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>>>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>>>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>>>> are not considered to be called from each other. So in current version
>>>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>>> tx routine will fail with EOPNOTSUPP.
>>>
>>> Do you plan to fix this in the next versions?
>>>
>>> If it is too complicated, I think we can have this limitation until we
>>> find a good solution.
>>>
>>
>> I'll try to fix it again, but just didn't pay attention on it in v2.
>>
>>>>
>>>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>>>> one completion. In each completion there is flag which shows how tx
>>>> was performed: zerocopy or copy. This leads that whole message must
>>>> be send in zerocopy or copy way - we can't send part of message with
>>>> copying and rest of message with zerocopy mode (or vice versa). Now,
>>>> we need to account vsock credit logic, e.g. we can't send whole data
>>>> once - only allowed number of bytes could sent at any moment. In case
>>>> of copying way there is no problem as in worst case we can send single
>>>> bytes, but zerocopy is more complex because smallest transmission
>>>> unit is single page. So if there is not enough space at peer's side
>>>> to send integer number of pages (at least one) - we will wait, thus
>>>> stalling tx side. To overcome this problem i've added simple rule -
>>>> zerocopy is possible only when there is enough space at another side
>>>> for whole message (to check, that current 'msghdr' was already used
>>>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>>
>>> So, IIUC if MSG_ZEROCOPY is set, but there isn't enough space in the
>>> destination we temporarily disable zerocopy, also if MSG_ZEROCOPY is set.
>>> Right?
>>
>> Exactly, user still needs to get completion (because SO_ZEROCOPY is enabled and
>> MSG_ZEROCOPY flag as used). But completion structure contains information that
>> there was copying instead of zerocopying.
>
> Got it.
>
>>
>>>
>>> If it is the case it seems reasonable to me.
>>>
>>>>
>>>> 3) loopback transport is not supported, because it requires to implement
>>>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>>>> and "receive" it from the same queue). I'm going to implement it in
>>>> next versions.
>>>>
>>>> ^^^ fixed in v2
>>>>
>>>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>>>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>>>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>>>> not touched for data - user space pages are used as buffers. Also
>>>> this limit trims every message which is > 64KB, thus such messages
>>>> will be send in copy mode due to 'iov_offset' check in 2).
>>>>
>>>> ^^^ fixed in v2
>>>>
>>>> PATCHSET STRUCTURE
>>>>
>>>> Patchset has the following structure:
>>>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>>>> 2) Handle non-linear skbuff on send in virtio/vhost.
>>>> 3) Updates for AF_VSOCK.
>>>> 4) Enable MSG_ZEROCOPY support on transports.
>>>> 5) Tests/tools/docs updates.
>>>>
>>>> PERFORMANCE
>>>>
>>>> Performance: it is a little bit tricky to compare performance between
>>>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>>>> user buffers will be released by kernel, so it something like synchronous
>>>> path (wait until device driver will process it), while in copy way we
>>>> can feed data to kernel as many as we want, don't care about device
>>>> driver. So I compared only time which we spend in the 'send()' syscall.
>>>> Then if this value will be combined with total number of transmitted
>>>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>>>> enough credit, receiver allocates same amount of space as sender needs.
>>>>
>>>> Sender:
>>>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>>>
>>>> Receiver:
>>>> ./vsock_perf --vsk-size 256M
>>>>
>>>> G2H transmission (values are Gbit/s):
>>>>
>>>> *-------------------------------*
>>>> | | | |
>>>> | buf size | copy | zerocopy |
>>>> | | | |
>>>> *-------------------------------*
>>>> | 4KB | 3 | 10 |
>>>> *-------------------------------*
>>>> | 32KB | 9 | 45 |
>>>> *-------------------------------*
>>>> | 256KB | 24 | 195 |
>>>> *-------------------------------*
>>>> | 1M | 27 | 270 |
>>>> *-------------------------------*
>>>> | 8M | 22 | 277 |
>>>> *-------------------------------*
>>>>
>>>> H2G:
>>>>
>>>> *-------------------------------*
>>>> | | | |
>>>> | buf size | copy | zerocopy |
>>>> | | | |
>>>> *-------------------------------*
>>>> | 4KB | 17 | 11 |
>>>
>>> Do you know why in this case zerocopy is slower in this case?
>>> Could be the cost of pin/unpin pages?
>> May be, i think i need to analyze such enormous difference more. Also about
>> pin/unpin: i found that there is already implemented function to fill non-linear
>> skb with pages from user's iov: __zerocopy_sg_from_iter() in net/core/datagram.c.
>> It uses 'get_user_pages()' instead of 'pin_user_pages()'. May be in my case it
>> is also valid to user 'get_XXX()' instead of 'pin_XXX()', because it is used by
>> TCP MSG_ZEROCOPY and iouring MSG_ZEROCOPY.
>
> If we can reuse them, it will be great!
>
>>
>>>
>>>> *-------------------------------*
>>>> | 32KB | 30 | 66 |
>>>> *-------------------------------*
>>>> | 256KB | 38 | 179 |
>>>> *-------------------------------*
>>>> | 1M | 38 | 234 |
>>>> *-------------------------------*
>>>> | 8M | 28 | 279 |
>>>> *-------------------------------*
>>>>
>>>> Loopback:
>>>>
>>>> *-------------------------------*
>>>> | | | |
>>>> | buf size | copy | zerocopy |
>>>> | | | |
>>>> *-------------------------------*
>>>> | 4KB | 8 | 7 |
>>>> *-------------------------------*
>>>> | 32KB | 34 | 42 |
>>>> *-------------------------------*
>>>> | 256KB | 43 | 83 |
>>>> *-------------------------------*
>>>> | 1M | 40 | 109 |
>>>> *-------------------------------*
>>>> | 8M | 40 | 171 |
>>>> *-------------------------------*
>>>>
>>>> I suppose that huge difference above between both modes has two reasons:
>>>> 1) We don't need to copy data.
>>>> 2) We don't need to allocate buffer for data, only for header.
>>>>
>>>> Zerocopy is faster than classic copy mode, but of course it requires
>>>> specific architecture of application due to user pages pinning, buffer
>>>> size and alignment.
>>>>
>>>> If host fails to send data with "Cannot allocate memory", check value
>>>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>>>> allocation.
>>>
>>> What the user needs to do? Increase it?
>>>
>> Yes, i'll update it.
>>>>
>>>> TESTING
>>>>
>>>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>>>> cover new code as much as possible so there are different cases for
>>>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>>>> vector types (different sizes, alignments, with unmapped pages). I also
>>>> run tests with loopback transport and running vsockmon.
>>>
>>> Thanks for the test again :-)
>>>
>>> This cover letter is very good, with a lot of details, but please add
>>> more details in each single patch, explaining the reason of the changes,
>>> otherwise it is very difficult to review, because it is a very big
>>> change.
>>>
>>> I'll do a per-patch review in the next days.
>>
>> Sure, thanks! In v3 i'm also working on io_uring test, because this thing also
>> supports MSG_ZEROCOPY, so we can do virtio/vsock + MSG_ZEROCOPY + io_uring.
>
> That would be cool!
>
> Do you want to me to review these patches or it is better to wait for v3?
I think it is ok to wait for v3, as i'm going to reduce size of new kernel source code,
especially by reusing already implemented functions instead of my own.
Thanks, Arseniy
>
> Thanks,
> Stefano
>