2023-07-01 07:13:36

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 00/17] vsock: MSG_ZEROCOPY flag support

Hello,

DESCRIPTION

this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:

1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
flag will be ignored (e.g. without completion).

2) Kernel uses completions from socket's error queue. Single completion
for single tx syscall (or it can merge several completions to single
one). I used already implemented logic for MSG_ZEROCOPY support:
'msg_zerocopy_realloc()' etc.

Difference with copy way is not significant. During packet allocation,
non-linear skb is created and filled with pinned user pages.
There are also some updates for vhost and guest parts of transport - in
both cases i've added handling of non-linear skb for virtio part. vhost
copies data from such skb to the guest's rx virtio buffers. In the guest,
virtio transport fills tx virtio queue with pages from skb.

Head of this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b


This version has several limits/problems (all resolved at v5):

1) As this feature totally depends on transport, there is no way (or it
is difficult) to check whether transport is able to handle it or not
during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
setsockopt callback from setsockopt callback for SOL_SOCKET, but this
leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
are not considered to be called from each other. So in current version
SO_ZEROCOPY is set successfully to any type (e.g. transport) of
AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
tx routine will fail with EOPNOTSUPP.

^^^ fixed in v5. Thanks to Bobby Eshleman.

2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
one completion. In each completion there is flag which shows how tx
was performed: zerocopy or copy. This leads that whole message must
be send in zerocopy or copy way - we can't send part of message with
copying and rest of message with zerocopy mode (or vice versa). Now,
we need to account vsock credit logic, e.g. we can't send whole data
once - only allowed number of bytes could sent at any moment. In case
of copying way there is no problem as in worst case we can send single
bytes, but zerocopy is more complex because smallest transmission
unit is single page. So if there is not enough space at peer's side
to send integer number of pages (at least one) - we will wait, thus
stalling tx side. To overcome this problem i've added simple rule -
zerocopy is possible only when there is enough space at another side
for whole message (to check, that current 'msghdr' was already used
in previous tx iterations i use 'iov_offset' field of it's iov iter).

^^^
Discussed as ok during v2. Link:
https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/

3) loopback transport is not supported, because it requires to implement
non-linear skb handling in dequeue logic (as we "send" fragged skb
and "receive" it from the same queue). I'm going to implement it in
next versions.

^^^ fixed in v2

4) Current implementation sets max length of packet to 64KB. IIUC this
is due to 'kmalloc()' allocated data buffers. I think, in case of
MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
not touched for data - user space pages are used as buffers. Also
this limit trims every message which is > 64KB, thus such messages
will be send in copy mode due to 'iov_offset' check in 2).

^^^ fixed in v2

PATCHSET STRUCTURE

Patchset has the following structure:
1) Handle non-linear skbuff on receive in virtio/vhost.
2) Handle non-linear skbuff on send in virtio/vhost.
3) Updates for AF_VSOCK.
4) Enable MSG_ZEROCOPY support on transports.
5) Tests/tools/docs updates.

PERFORMANCE

Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it is like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in the 'send()' syscall.
Then if this value will be combined with total number of transmitted
bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
enough credit, receiver allocates same amount of space as sender needs.

Sender:
./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]

Receiver:
./vsock_perf --vsk-size 256M

I run tests on two setups: desktop with Core i7 - I use this PC for
development and in this case guest is nested guest, and host is normal
guest. Another hardware is some embedded board with Atom - here I don't
have nested virtualization - host runs on hw, and guest is normal guest.

G2H transmission (values are Gbit/s):

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
*-------------------------------* *-------------------------------*
| 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
*-------------------------------* *-------------------------------*
| 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
*-------------------------------* *-------------------------------*
| 1M | 30 | 373 | | 1M | 7 | 95 |
*-------------------------------* *-------------------------------*
| 8M | 22 | 475 | | 8M | 7 | 114 |
*-------------------------------* *-------------------------------*

H2G:

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
*-------------------------------* *-------------------------------*
| 32KB | 37 | 75 | | 32KB | 11 | 18 |
*-------------------------------* *-------------------------------*
| 256KB | 44 | 299 | | 256KB | 11 | 62 |
*-------------------------------* *-------------------------------*
| 1M | 28 | 335 | | 1M | 9 | 77 |
*-------------------------------* *-------------------------------*
| 8M | 27 | 417 | | 8M | 9.35 | 115 |
*-------------------------------* *-------------------------------*

* Let's look to the first line of both tables - where copy is better
than zerocopy. I analyzed this case more deeply and found that
bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
caller spends too much time in it with zerocopy mode (comparing to
copy mode). This happens only with 4K buffer size. This function just
calls 'wake_up_process()' and its internal logic does not depends on
skb, so i think potential reason (may be) is interval between two
calls of this function (e.g. how often it is called). Note, that
'vhost_work_queue()' differs from the same function at guest's side of
transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
i think is more optimized for worker purposes, than direct call to
'wake_up_process()'. But again - this is just my assumption.

Loopback:

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 |
*-------------------------------* *-------------------------------*
| 32KB | 38 | 44 | | 32KB | 10 | 10 |
*-------------------------------* *-------------------------------*
| 256KB | 55 | 168 | | 256KB | 15 | 36 |
*-------------------------------* *-------------------------------*
| 1M | 53 | 250 | | 1M | 12 | 45 |
*-------------------------------* *-------------------------------*
| 8M | 40 | 344 | | 8M | 11 | 74 |
*-------------------------------* *-------------------------------*

I analyzed performace difference more deeply for the following setup:
server: ./vsock_perf --vsk-size 16M
client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]

In other words I send 16M of data from guest to host in copy/zerocopy
modes and with two different sizes of buffer - 4K and 64K. Let's see
to tx path for both modes - it consists of two steps:

copy:
1) Allocate skb of buffer's length.
2) Copy data to skb from buffer.

zerocopy:
1) Allocate skb with header space only.
2) Pin pages of the buffer and insert them to skb.

I measured average number of ns (returned by 'ktime_get()') for each
step above:
1) Skb allocation (for both copy and zerocopy modes).
2) For copy mode in 'memcpy_to_msg()' - copying.
3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.

Here are results for copy mode:
*-------------------------------------*
| buf | skb alloc | 'memcpy_to_msg()' |
*-------------------------------------*
| | | |
| 64K | 5000ns | 25000ns |
| | | |
*-------------------------------------*
| | | |
| 4K | 800ns | 2200ns |
| | | |
*-------------------------------------*

Here are results for zerocopy mode:
*-----------------------------------------------*
| buf | skb alloc | '__zerocopy_sg_from_iter()' |
*-----------------------------------------------*
| | | |
| 64K | 250ns | 3500ns |
| | | |
*-----------------------------------------------*
| | | |
| 4K | 250ns | 3000ns |
| | | |
*-----------------------------------------------*

I guess that reason of zerocopy performance is low overhead for page
pinning: there is big difference between 4K and 64K in case of copying
(25000 vs 2200), but in pinning case - just 3000 vs 3500.

So, zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment.

NOTES

If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb
allocation. Try to update it to for example 1M and try send again:
"echo 1048576 > /proc/sys/net/core/optmem_max" (as root).

TESTING

This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages). I also
run tests with loopback transport and run vsockmon. In v3 i've added
io_uring test as separated application.

LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER

In v3 Stefano Garzarella <[email protected]> asked to split this patchset
for several parts, because it looks too big for review. I think in this
version (v4) we can do it in the following way:

[0001 - 0005] - this is preparation for virtio/vhost part.
[0006 - 0009] - this is preparation for AF_VSOCK part.
[0010 - 0014] - these patches allows to trigger logic from the previous
two parts. In addition 0014 is patch for Documentation.
[0015 - rest] - updates for tests, utils. This part doesn't touch kernel
code and looks not critical.

Thanks, Arseniy

Link to v1:
https://lore.kernel.org/netdev/[email protected]/
Link to v2:
https://lore.kernel.org/netdev/[email protected]/
Link to v3:
https://lore.kernel.org/netdev/[email protected]/
Link to v4:
https://lore.kernel.org/netdev/[email protected]/

Changelog:
v1 -> v2:
- Replace 'get_user_pages()' with 'pin_user_pages()'.
- Loopback transport support.

v2 -> v3
- Use 'get_user_pages()' instead of 'pin_user_pages()'. I think this
is right approach, because i'm using '__zerocopy_sg_from_iter()'
function. It is already implemented and used by io_uring zerocopy
tx logic to 'pin' pages of user's buffer.

- Use 'skb_copy_datagram_iter()' to copy data from both linear and
non-linear skb to user's iov iter. It already has support for copying
data from paged part of skb (by calling 'kmap()'). In v2 i used my
own "from scratch" implemented function. With this and previous thing
I significantly reduced LOC number in kernel part.

- Add io_uring test for AF_VSOCK. It is implemented as separated util,
because it depends on liburing (i think there is no need to link
'vsock_test' with liburing, because io_uring functionality depends
on environment - both in kernel and userspace).

- Values from PERFORMANCE section are updated for all transports, but
I didn't found any significant difference with v2.

- More details in commit messages.

v3 -> v4:
- Requirement for buffers to have page aligned base and size is removed,
because virtio can handle such buffers.

- Crash with SOCK_SEQPACKET is fixed. This is done by setting owner of
new 'skb' before passing it to '__zerocopy_sg_from_iter()'. Last one
dereferences owner of the passed skb without any checks (it was NULL).

- Type of "owning" of the newly created skb is also changed: in v3 and
before it was 'skb_set_owner_sk_safe()'. I replace it with this one:
'skb_set_owner_w()'. This is because '__zerocopy_sg_from_iter()'
increments 'sk_wmem_alloc' of socket which owns skb, thus we need a
proper destructor which decrements it back - it is 'sock_wfree()'.
This destructor is set by 'skb_set_owner_w()'. Otherwise we get leak
of resource - such socket will be never deallocated.

- Use ITER_KVEC instead of ITER_IOVEC when skb is copied to another one
for passing to TAP device. Reason of this update is that ITER_IOVEC
considered as userspace memory, while we have only kernel memory here.

v4 -> v5:
- Problem 1) with dependency of SO_ZEROCOPY from the current transport
is fixed.

- See per patch changelog (after ---).

Arseniy Krasnov (17):
vsock/virtio: read data from non-linear skb
vhost/vsock: read data from non-linear skb
vsock/virtio: support to send non-linear skb
vsock/virtio: non-linear skb handling for tap
vsock/virtio: MSG_ZEROCOPY flag support
vsock: fix EPOLLERR set on non-empty error queue
vsock: read from socket's error queue
vsock: check for MSG_ZEROCOPY support on send
vsock: enable SOCK_SUPPORT_ZC bit
vhost/vsock: support MSG_ZEROCOPY for transport
vsock/virtio: support MSG_ZEROCOPY for transport
vsock/loopback: support MSG_ZEROCOPY for transport
vsock: enable setting SO_ZEROCOPY
docs: net: description of MSG_ZEROCOPY for AF_VSOCK
test/vsock: MSG_ZEROCOPY flag tests
test/vsock: MSG_ZEROCOPY support for vsock_perf
test/vsock: io_uring rx/tx tests

Documentation/networking/msg_zerocopy.rst | 12 +-
drivers/vhost/vsock.c | 21 +-
include/linux/socket.h | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 7 +
net/vmw_vsock/af_vsock.c | 61 +++-
net/vmw_vsock/virtio_transport.c | 47 +++-
net/vmw_vsock/virtio_transport_common.c | 313 ++++++++++++++++-----
net/vmw_vsock/vsock_loopback.c | 6 +
tools/testing/vsock/Makefile | 9 +-
tools/testing/vsock/util.c | 218 +++++++++++++++
tools/testing/vsock/util.h | 18 ++
tools/testing/vsock/vsock_perf.c | 139 +++++++++-
tools/testing/vsock/vsock_test.c | 16 ++
tools/testing/vsock/vsock_test_zerocopy.c | 312 +++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 15 +
tools/testing/vsock/vsock_uring_test.c | 321 ++++++++++++++++++++++
17 files changed, 1423 insertions(+), 94 deletions(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h
create mode 100644 tools/testing/vsock/vsock_uring_test.c

--
2.25.1



2023-07-01 07:13:48

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 02/17] vhost/vsock: read data from non-linear skb

This adds copying to guest's virtio buffers from non-linear skbs. Such
skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
replaces call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'- second
function can read data from non-linear skb. Also this patch uses field
'frag_off' from skb control block. This field shows current offset to
read data from skb which could be both linear or not.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v4 -> v5:
* Use local variable for 'frag_off'.
* Update commit message by adding some details about 'frag_off' field.
* R-b from Bobby Eshleman removed due to patch update.

drivers/vhost/vsock.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..cb00e0e059e4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct sk_buff *skb;
unsigned out, in;
size_t nbytes;
+ u32 frag_off;
int head;

skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}

iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
- payload_len = skb->len;
+ frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+ payload_len = skb->len - frag_off;
hdr = virtio_vsock_hdr(skb);

/* If the packet is greater than the space available in the
@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}

- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
+ if (skb_copy_datagram_iter(skb,
+ frag_off,
+ &iov_iter,
+ payload_len)) {
kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;

- skb_pull(skb, payload_len);
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
total_len += payload_len;

/* If we didn't send all the payload we can requeue the packet
* to send it with the next available buffer.
*/
- if (skb->len > 0) {
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
hdr->flags |= cpu_to_le32(flags_to_restore);

/* We are queueing the same skb to handle
--
2.25.1


2023-07-01 07:13:50

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 14/17] docs: net: description of MSG_ZEROCOPY for AF_VSOCK

This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
socket.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Documentation/networking/msg_zerocopy.rst | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index b3ea96af9b49..34bc7ff411ce 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,8 @@ Intro
=====

The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP and UDP sockets.
+The feature is currently implemented for TCP, UDP and VSOCK (with
+virtio transport) sockets.


Opportunity and Caveats
@@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A notification
is encoded in the standard error format, sock_extended_err.

The level and type fields in the control data are protocol family
-specific, IP_RECVERR or IPV6_RECVERR.
+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).

Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
as explained before, to avoid blocking read and write system calls on
@@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.

printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);

+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.

Deferred copies
~~~~~~~~~~~~~~~
@@ -235,12 +237,15 @@ Implementation
Loopback
--------

+For TCP and UDP:
Data sent to local sockets can be queued indefinitely if the receive
process does not read its socket. Unbound notification latency is not
acceptable. For this reason all packets generated with MSG_ZEROCOPY
that are looped to a local socket will incur a deferred copy. This
includes looping onto packet sockets (e.g., tcpdump) and tun devices.

+For VSOCK:
+Data path sent to local sockets is the same as for non-local sockets.

Testing
=======
@@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
namespaces, the test will not show any improvement. For testing, the
loopback restriction can be temporarily relaxed by making
skb_orphan_frags_rx identical to skb_orphan_frags.
+
+For VSOCK type of socket example can be found in tools/testing/vsock/
+vsock_test_zerocopy.c.
--
2.25.1


2023-07-01 07:14:07

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 08/17] vsock: check for MSG_ZEROCOPY support on send

This feature totally depends on transport, so if transport doesn't
support it, return error.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/net/af_vsock.h | 7 +++++++
net/vmw_vsock/af_vsock.c | 6 ++++++
2 files changed, 13 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..ec09edc5f3a0 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -177,6 +177,9 @@ struct vsock_transport {

/* Read a single skb */
int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
+
+ /* Zero-copy. */
+ bool (*msgzerocopy_allow)(void);
};

/**** CORE ****/
@@ -243,4 +246,8 @@ static inline void __init vsock_bpf_build_proto(void)
{}
#endif

+static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
+{
+ return t->msgzerocopy_allow && t->msgzerocopy_allow();
+}
#endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 07803d9fbf6d..033006e1b5ad 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1824,6 +1824,12 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}

+ if (msg->msg_flags & MSG_ZEROCOPY &&
+ !vsock_msgzerocopy_allow(transport)) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
/* Wait for room in the produce queue to enqueue our user's data. */
timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);

--
2.25.1


2023-07-01 07:14:32

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 13/17] vsock: enable setting SO_ZEROCOPY

For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
be set in AF_VSOCK implementation where transport is accessible (if
transport is not set during setting SO_ZEROCOPY: for example socket is
not connected, then SO_ZEROCOPY will be enabled, but once transport will
be assigned, support of this type of transmission will be checked).

To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
bit, thus handling SOL_SOCKET option operations, but all of them except
SO_ZEROCOPY will be forwarded to the generic handler by calling
'sock_setsockopt()'.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v4 -> v5:
* This patch is totally reworked. Previous version added check for
PF_VSOCK directly to 'net/core/sock.c', thus allowing to set
SO_ZEROCOPY for AF_VSOCK type of socket. This new version catches
attempt to set SO_ZEROCOPY in 'af_vsock.c'. All other options
except SO_ZEROCOPY are forwarded to generic handler. Only this
option is processed in 'af_vsock.c'. Handling this option includes
access to transport to check that MSG_ZEROCOPY transmission is
supported by the current transport (if it is set, if not - transport
will be checked during 'connect()').

net/vmw_vsock/af_vsock.c | 44 ++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index da22ae0ef477..8acc77981d01 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1406,8 +1406,18 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
goto out;
}

- if (vsock_msgzerocopy_allow(transport))
+ if (!vsock_msgzerocopy_allow(transport)) {
+ /* If this option was set before 'connect()',
+ * when transport was unknown, check that this
+ * feature is supported here.
+ */
+ if (sock_flag(sk, SOCK_ZEROCOPY)) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+ } else {
set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
+ }

err = vsock_auto_bind(vsk);
if (err)
@@ -1643,7 +1653,7 @@ static int vsock_connectible_setsockopt(struct socket *sock,
const struct vsock_transport *transport;
u64 val;

- if (level != AF_VSOCK)
+ if (level != AF_VSOCK && level != SOL_SOCKET)
return -ENOPROTOOPT;

#define COPY_IN(_v) \
@@ -1666,6 +1676,34 @@ static int vsock_connectible_setsockopt(struct socket *sock,

transport = vsk->transport;

+ if (level == SOL_SOCKET) {
+ if (optname == SO_ZEROCOPY) {
+ int zc_val;
+
+ /* Use 'int' type here, because variable to
+ * set this option usually has this type.
+ */
+ COPY_IN(zc_val);
+
+ if (zc_val < 0 || zc_val > 1) {
+ err = -EINVAL;
+ goto exit;
+ }
+
+ if (transport && !vsock_msgzerocopy_allow(transport)) {
+ err = -EOPNOTSUPP;
+ goto exit;
+ }
+
+ sock_valbool_flag(sk, SOCK_ZEROCOPY,
+ zc_val ? true : false);
+ goto exit;
+ }
+
+ release_sock(sk);
+ return sock_setsockopt(sock, level, optname, optval, optlen);
+ }
+
switch (optname) {
case SO_VM_SOCKETS_BUFFER_SIZE:
COPY_IN(val);
@@ -2321,6 +2359,8 @@ static int vsock_create(struct net *net, struct socket *sock,
}
}

+ set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
+
vsock_insert_unbound(vsk);

return 0;
--
2.25.1


2023-07-01 07:14:49

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 03/17] vsock/virtio: support to send non-linear skb

For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v4 -> v5:
* Use 'out_sgs' variable to index 'bufs', not only 'sgs'.
* Move smaller branch above, see 'if (!skb_is_nonlinear(skb)').
* Remove blank line.
* R-b from Bobby Eshleman removed due to patch update.

net/vmw_vsock/virtio_transport.c | 40 +++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..6cbb45bb12d2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];

for (;;) {
- struct scatterlist hdr, buf, *sgs[2];
+ /* +1 is for packet header. */
+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +113,38 @@ virtio_transport_send_pkt_work(struct work_struct *work)

virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+ sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
+ sizeof(*virtio_vsock_hdr(skb)));
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+
+ if (!skb_is_nonlinear(skb)) {
+ if (skb->len > 0) {
+ sg_init_one(&bufs[out_sg], skb->data, skb->len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
+ } else {
+ struct skb_shared_info *si;
+ int i;
+
+ si = skb_shinfo(skb);
+
+ for (i = 0; i < si->nr_frags; i++) {
+ skb_frag_t *skb_frag = &si->frags[i];
+ void *va = page_to_virt(skb_frag->bv_page);

- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
- sgs[out_sg++] = &hdr;
- if (skb->len > 0) {
- sg_init_one(&buf, skb->data, skb->len);
- sgs[out_sg++] = &buf;
+ /* We will use 'page_to_virt()' for userspace page here,
+ * because virtio layer will call 'virt_to_phys()' later
+ * to fill buffer descriptor. We don't touch memory at
+ * "virtual" address of this page.
+ */
+ sg_init_one(&bufs[out_sg],
+ va + skb_frag->bv_offset,
+ skb_frag->bv_len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
}

ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
--
2.25.1


2023-07-01 07:15:06

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 11/17] vsock/virtio: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for virtio transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v4 -> v5:
* Move 'msgzerocopy_allow' right after seqpacket callbacks.

net/vmw_vsock/virtio_transport.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 6cbb45bb12d2..8d3e9f441fa1 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -441,6 +441,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

+static bool virtio_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport virtio_transport = {
@@ -474,6 +479,8 @@ static struct virtio_transport virtio_transport = {
.seqpacket_allow = virtio_transport_seqpacket_allow,
.seqpacket_has_data = virtio_transport_seqpacket_has_data,

+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,
+
.notify_poll_in = virtio_transport_notify_poll_in,
.notify_poll_out = virtio_transport_notify_poll_out,
.notify_recv_init = virtio_transport_notify_recv_init,
--
2.25.1


2023-07-01 07:15:10

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 04/17] vsock/virtio: non-linear skb handling for tap

For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v4 -> v5:
* Make 'skb' pointer constant because it is source.

net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index e5683af23e60..dfc48b56d0a2 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}

+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ struct iov_iter iov_iter = { 0 };
+ struct kvec kvec;
+ size_t to_copy;
+
+ kvec.iov_base = dst;
+ kvec.iov_len = len;
+
+ iov_iter.iter_type = ITER_KVEC;
+ iov_iter.kvec = &kvec;
+ iov_iter.nr_segs = 1;
+
+ to_copy = min_t(size_t, len, skb->len);
+
+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &iov_iter, to_copy);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;

/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;

skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));

if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(pkt)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}

return skb;
--
2.25.1


2023-07-01 07:15:13

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 15/17] test/vsock: MSG_ZEROCOPY flag tests

This adds three tests for MSG_ZEROCOPY feature:
1) SOCK_STREAM tx with different buffers.
1) SOCK_SEQPACKET tx with different buffers.
1) SOCK_STREAM test to read empty error queue of the socket.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.c | 218 +++++++++++++++
tools/testing/vsock/util.h | 18 ++
tools/testing/vsock/vsock_test.c | 16 ++
tools/testing/vsock/vsock_test_zerocopy.c | 312 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 15 ++
6 files changed, 580 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 43a254f0e14d..0a78787d1d92 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
all: test vsock_perf
test: vsock_test vsock_diag_test
-vsock_test: vsock_test.o timeout.o control.o util.o
+vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
vsock_perf: vsock_perf.o

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 01b636d3039a..6397bfe661b6 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -11,15 +11,23 @@
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
+#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <assert.h>
#include <sys/epoll.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>
+#include <poll.h>

#include "timeout.h"
#include "control.h"
#include "util.h"

+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
/* Install signal handlers */
void init_signals(void)
{
@@ -408,3 +416,213 @@ unsigned long hash_djb2(const void *data, size_t len)

return hash;
}
+
+void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val))) {
+ perror("setsockopt");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void *mmap_no_fail(size_t bytes)
+{
+ void *res;
+
+ res = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (res == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ return res;
+}
+
+size_t iovec_bytes(const struct iovec *iov, size_t iovnum)
+{
+ size_t bytes;
+ int i;
+
+ for (bytes = 0, i = 0; i < iovnum; i++)
+ bytes += iov[i].iov_len;
+
+ return bytes;
+}
+
+static void iovec_random_init(struct iovec *iov,
+ const struct vsock_test_data *test_data)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ int j;
+
+ if (test_data->vecs[i].iov_base == MAP_FAILED)
+ continue;
+
+ for (j = 0; j < iov[i].iov_len; j++)
+ ((uint8_t *)iov[i].iov_base)[j] = rand() & 0xff;
+ }
+}
+
+unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum)
+{
+ unsigned long hash;
+ size_t iov_bytes;
+ size_t offs;
+ void *tmp;
+ int i;
+
+ iov_bytes = iovec_bytes(iov, iovnum);
+
+ tmp = malloc(iov_bytes);
+ if (!tmp) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (offs = 0, i = 0; i < iovnum; i++) {
+ memcpy(tmp + offs, iov[i].iov_base, iov[i].iov_len);
+ offs += iov[i].iov_len;
+ }
+
+ hash = hash_djb2(tmp, iov_bytes);
+ free(tmp);
+
+ return hash;
+}
+
+struct iovec *iovec_from_test_data(const struct vsock_test_data *test_data)
+{
+ const struct iovec *test_iovec;
+ struct iovec *iovec;
+ int i;
+
+ iovec = malloc(sizeof(*iovec) * test_data->vecs_cnt);
+ if (!iovec) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ test_iovec = test_data->vecs;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ iovec[i].iov_len = test_iovec[i].iov_len;
+ iovec[i].iov_base = mmap_no_fail(test_iovec[i].iov_len);
+
+ if (test_iovec[i].iov_base != MAP_FAILED &&
+ test_iovec[i].iov_base)
+ iovec[i].iov_base += (uintptr_t)test_iovec[i].iov_base;
+ }
+
+ /* Unmap "invalid" elements. */
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_iovec[i].iov_base == MAP_FAILED) {
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ iovec_random_init(iovec, test_data);
+
+ return iovec;
+}
+
+void free_iovec_test_data(const struct vsock_test_data *test_data,
+ struct iovec *iovec)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_data->vecs[i].iov_base != MAP_FAILED) {
+ if (test_data->vecs[i].iov_base)
+ iovec[i].iov_base -= (uintptr_t)test_data->vecs[i].iov_base;
+
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ free(iovec);
+}
+
+#define POLL_TIMEOUT_MS 100
+void vsock_recv_completion(int fd, bool zerocopied, bool completion)
+{
+ struct sock_extended_err *serr;
+ struct msghdr msg = { 0 };
+ struct pollfd fds = { 0 };
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ ssize_t res;
+
+ fds.fd = fd;
+ fds.events = 0;
+
+ if (poll(&fds, 1, POLL_TIMEOUT_MS) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ if (completion) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ } else {
+ return;
+ }
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res) {
+ fprintf(stderr, "failed to read error queue: %zi\n", res);
+ exit(EXIT_FAILURE);
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_type != 0) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin: %u\n", serr->ee_origin);
+ exit(EXIT_FAILURE);
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code: %u\n", serr->ee_errno);
+ exit(EXIT_FAILURE);
+ }
+
+ if (zerocopied && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was copy instead of zerocopy\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!zerocopied && !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was zerocopy instead of copy\n");
+ exit(EXIT_FAILURE);
+ }
+}
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..d07c1a9c2e0a 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -2,6 +2,7 @@
#ifndef UTIL_H
#define UTIL_H

+#include <stdbool.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>

@@ -18,6 +19,16 @@ struct test_opts {
unsigned int peer_cid;
};

+#define VSOCK_TEST_DATA_MAX_IOV 4
+
+struct vsock_test_data {
+ bool zerocopied; /* Data must be zerocopied. */
+ bool completion; /* Must dequeue completion. */
+ int sendmsg_errno; /* 'errno' after 'sendmsg()'. */
+ int vecs_cnt; /* Number of elements in 'vecs'. */
+ struct iovec vecs[VSOCK_TEST_DATA_MAX_IOV];
+};
+
/* A test case definition. Test functions must print failures to stderr and
* terminate with exit(EXIT_FAILURE).
*/
@@ -50,4 +61,11 @@ void list_tests(const struct test_case *test_cases);
void skip_test(struct test_case *test_cases, size_t test_cases_len,
const char *test_id_str);
unsigned long hash_djb2(const void *data, size_t len);
+void enable_so_zerocopy(int fd);
+size_t iovec_bytes(const struct iovec *iov, size_t iovnum);
+unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum);
+struct iovec *iovec_from_test_data(const struct vsock_test_data *test_data);
+void free_iovec_test_data(const struct vsock_test_data *test_data,
+ struct iovec *iovec);
+void vsock_recv_completion(int fd, bool zerocopied, bool completion);
#endif /* UTIL_H */
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index ac1bd3ac1533..d576b18bd357 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -20,6 +20,7 @@
#include <sys/mman.h>
#include <poll.h>

+#include "vsock_test_zerocopy.h"
#include "timeout.h"
#include "control.h"
#include "util.h"
@@ -1128,6 +1129,21 @@ static struct test_case test_cases[] = {
.run_client = test_stream_virtio_skb_merge_client,
.run_server = test_stream_virtio_skb_merge_server,
},
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY",
+ .run_client = test_stream_msg_zcopy_client,
+ .run_server = test_stream_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_SEQPACKET MSG_ZEROCOPY",
+ .run_client = test_seqpacket_msg_zcopy_client,
+ .run_server = test_seqpacket_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY empty MSG_ERRQUEUE",
+ .run_client = test_stream_msg_zcopy_empty_errq_client,
+ .run_server = test_stream_msg_zcopy_empty_errq_server,
+ },
{},
};

diff --git a/tools/testing/vsock/vsock_test_zerocopy.c b/tools/testing/vsock/vsock_test_zerocopy.c
new file mode 100644
index 000000000000..c5539c5dbded
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* MSG_ZEROCOPY feature tests for vsock
+ *
+ * Copyright (C) 2023 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include <linux/errqueue.h>
+#include <linux/kernel.h>
+#include <error.h>
+#include <errno.h>
+
+#include "control.h"
+#include "vsock_test_zerocopy.h"
+
+#define PAGE_SIZE 4096
+
+static struct vsock_test_data test_data_array[] = {
+ /* Last element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, 200 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE * 2 },
+ { NULL, PAGE_SIZE * 3 }
+ }
+ },
+ /* All elements have page aligned base and size. But
+ * data length is bigger than 64Kb.
+ */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has both non-page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { (void *)1, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element is unmapped. */
+ {
+ .zerocopied = false,
+ .completion = false,
+ .sendmsg_errno = ENOMEM,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { MAP_FAILED, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Valid data, but SO_ZEROCOPY is off. */
+ {
+ .zerocopied = true,
+ .completion = false,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 1,
+ {
+ { NULL, PAGE_SIZE }
+ }
+ },
+};
+
+static void __test_msg_zerocopy_client(const struct test_opts *opts,
+ const struct vsock_test_data *test_data,
+ bool sock_seqpacket)
+{
+ struct msghdr msg = { 0 };
+ ssize_t sendmsg_res;
+ struct iovec *iovec;
+ int fd;
+
+ if (sock_seqpacket)
+ fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
+ else
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (test_data->completion)
+ enable_so_zerocopy(fd);
+
+ iovec = iovec_from_test_data(test_data);
+
+ msg.msg_iov = iovec;
+ msg.msg_iovlen = test_data->vecs_cnt;
+
+ errno = 0;
+
+ sendmsg_res = sendmsg(fd, &msg, MSG_ZEROCOPY);
+ if (errno != test_data->sendmsg_errno) {
+ fprintf(stderr, "expected 'errno' == %i, got %i\n",
+ test_data->sendmsg_errno, errno);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!errno) {
+ if (sendmsg_res != iovec_bytes(iovec, test_data->vecs_cnt)) {
+ fprintf(stderr, "expected 'sendmsg()' == %li, got %li\n",
+ iovec_bytes(iovec, test_data->vecs_cnt),
+ sendmsg_res);
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ vsock_recv_completion(fd, test_data->zerocopied, test_data->completion);
+
+ if (test_data->sendmsg_errno == 0)
+ control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
+ else
+ control_writeulong(0);
+
+ control_writeln("DONE");
+ free_iovec_test_data(test_data, iovec);
+ close(fd);
+}
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_msg_zerocopy_client(opts, &test_data_array[i], false);
+}
+
+void test_seqpacket_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_msg_zerocopy_client(opts, &test_data_array[i], true);
+}
+
+static void __test_stream_server(const struct test_opts *opts,
+ const struct vsock_test_data *test_data,
+ bool sock_seqpacket)
+{
+ unsigned long remote_hash;
+ unsigned long local_hash;
+ ssize_t total_bytes_rec;
+ unsigned char *data;
+ size_t data_len;
+ int fd;
+
+ if (sock_seqpacket)
+ fd = vsock_seqpacket_accept(VMADDR_CID_ANY, 1234, NULL);
+ else
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ data_len = iovec_bytes(test_data->vecs, test_data->vecs_cnt);
+
+ data = malloc(data_len);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ total_bytes_rec = 0;
+
+ while (total_bytes_rec != data_len) {
+ ssize_t bytes_rec;
+
+ bytes_rec = read(fd, data + total_bytes_rec,
+ data_len - total_bytes_rec);
+ if (bytes_rec <= 0)
+ break;
+
+ total_bytes_rec += bytes_rec;
+ }
+
+ if (test_data->sendmsg_errno == 0)
+ local_hash = hash_djb2(data, data_len);
+ else
+ local_hash = 0;
+
+ free(data);
+
+ /* Waiting for some result. */
+ remote_hash = control_readulong();
+ if (remote_hash != local_hash) {
+ fprintf(stderr, "hash mismatch\n");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_server(opts, &test_data_array[i], false);
+}
+
+void test_seqpacket_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_server(opts, &test_data_array[i], true);
+}
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts)
+{
+ struct msghdr msg = { 0 };
+ char cmsg_data[128];
+ ssize_t res;
+ int fd;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res != -1) {
+ fprintf(stderr, "expected 'recvmsg(2)' failure, got %zi\n",
+ res);
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
diff --git a/tools/testing/vsock/vsock_test_zerocopy.h b/tools/testing/vsock/vsock_test_zerocopy.h
new file mode 100644
index 000000000000..220b4f94f042
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VSOCK_TEST_ZEROCOPY_H
+#define VSOCK_TEST_ZEROCOPY_H
+#include "util.h"
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_server(const struct test_opts *opts);
+
+void test_seqpacket_msg_zcopy_client(const struct test_opts *opts);
+void test_seqpacket_msg_zcopy_server(const struct test_opts *opts);
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts);
+
+#endif /* VSOCK_TEST_ZEROCOPY_H */
--
2.25.1


2023-07-01 07:29:03

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v5 16/17] test/vsock: MSG_ZEROCOPY support for vsock_perf

To use this option pass '--zc' parameter:

./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>

With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/vsock_perf.c | 139 +++++++++++++++++++++++++++++--
1 file changed, 130 insertions(+), 9 deletions(-)

diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
index a72520338f84..7fd76f7a3c16 100644
--- a/tools/testing/vsock/vsock_perf.c
+++ b/tools/testing/vsock/vsock_perf.c
@@ -18,6 +18,8 @@
#include <poll.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>

#define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
#define DEFAULT_TO_SEND_BYTES (64 * 1024)
@@ -28,9 +30,14 @@
#define BYTES_PER_GB (1024 * 1024 * 1024ULL)
#define NSEC_PER_SEC (1000000000ULL)

+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
static unsigned int port = DEFAULT_PORT;
static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
+static bool zerocopy;

static void error(const char *s)
{
@@ -247,15 +254,76 @@ static void run_receiver(unsigned long rcvlowat_bytes)
close(fd);
}

+static void recv_completion(int fd)
+{
+ struct sock_extended_err *serr;
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ struct msghdr msg = { 0 };
+ ssize_t ret;
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret) {
+ fprintf(stderr, "recvmsg: failed to read err: %zi\n", ret);
+ return;
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ return;
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ return;
+ }
+
+ if (cm->cmsg_type) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ return;
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin\n");
+ return;
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code\n");
+ return;
+ }
+
+ if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
+ fprintf(stderr, "warning: copy instead of zerocopy\n");
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error("setsockopt(SO_ZEROCOPY)");
+}
+
static void run_sender(int peer_cid, unsigned long to_send_bytes)
{
time_t tx_begin_ns;
time_t tx_total_ns;
size_t total_send;
+ time_t time_in_send;
void *data;
int fd;

- printf("Run as sender\n");
+ if (zerocopy)
+ printf("Run as sender MSG_ZEROCOPY\n");
+ else
+ printf("Run as sender\n");
+
printf("Connect to %i:%u\n", peer_cid, port);
printf("Send %lu bytes\n", to_send_bytes);
printf("TX buffer %lu bytes\n", buf_size_bytes);
@@ -265,38 +333,82 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
if (fd < 0)
exit(EXIT_FAILURE);

- data = malloc(buf_size_bytes);
+ if (zerocopy) {
+ enable_so_zerocopy(fd);

- if (!data) {
- fprintf(stderr, "'malloc()' failed\n");
- exit(EXIT_FAILURE);
+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (data == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+ } else {
+ data = malloc(buf_size_bytes);
+
+ if (!data) {
+ fprintf(stderr, "'malloc()' failed\n");
+ exit(EXIT_FAILURE);
+ }
}

memset(data, 0, buf_size_bytes);
total_send = 0;
+ time_in_send = 0;
tx_begin_ns = current_nsec();

while (total_send < to_send_bytes) {
ssize_t sent;
+ size_t rest_bytes;
+ time_t before;

- sent = write(fd, data, buf_size_bytes);
+ rest_bytes = to_send_bytes - total_send;
+
+ before = current_nsec();
+ sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
+ buf_size_bytes : rest_bytes,
+ zerocopy ? MSG_ZEROCOPY : 0);
+ time_in_send += (current_nsec() - before);

if (sent <= 0)
error("write");

total_send += sent;
+
+ if (zerocopy) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ }
+
+ recv_completion(fd);
+ }
}

tx_total_ns = current_nsec() - tx_begin_ns;

printf("total bytes sent: %zu\n", total_send);
printf("tx performance: %f Gbits/s\n",
- get_gbps(total_send * 8, tx_total_ns));
- printf("total time in 'write()': %f sec\n",
+ get_gbps(total_send * 8, time_in_send));
+ printf("total time in tx loop: %f sec\n",
(float)tx_total_ns / NSEC_PER_SEC);
+ printf("time in 'send()': %f sec\n",
+ (float)time_in_send / NSEC_PER_SEC);

close(fd);
- free(data);
+
+ if (zerocopy)
+ munmap(data, buf_size_bytes);
+ else
+ free(data);
}

static const char optstring[] = "";
@@ -336,6 +448,11 @@ static const struct option longopts[] = {
.has_arg = required_argument,
.val = 'R',
},
+ {
+ .name = "zc",
+ .has_arg = no_argument,
+ .val = 'Z',
+ },
{},
};

@@ -351,6 +468,7 @@ static void usage(void)
" --help This message\n"
" --sender <cid> Sender mode (receiver default)\n"
" <cid> of the receiver to connect to\n"
+ " --zc Enable zerocopy\n"
" --port <port> Port (default %d)\n"
" --bytes <bytes>KMG Bytes to send (default %d)\n"
" --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
@@ -413,6 +531,9 @@ int main(int argc, char **argv)
case 'H': /* Help. */
usage();
break;
+ case 'Z': /* Zerocopy. */
+ zerocopy = true;
+ break;
default:
usage();
}
--
2.25.1


2023-07-06 17:06:22

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 08/17] vsock: check for MSG_ZEROCOPY support on send

On Sat, Jul 01, 2023 at 09:39:38AM +0300, Arseniy Krasnov wrote:
>This feature totally depends on transport, so if transport doesn't
>support it, return error.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> include/net/af_vsock.h | 7 +++++++
> net/vmw_vsock/af_vsock.c | 6 ++++++
> 2 files changed, 13 insertions(+)

Reviewed-by: Stefano Garzarella <[email protected]>

>
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index 0e7504a42925..ec09edc5f3a0 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -177,6 +177,9 @@ struct vsock_transport {
>
> /* Read a single skb */
> int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
>+
>+ /* Zero-copy. */
>+ bool (*msgzerocopy_allow)(void);
> };
>
> /**** CORE ****/
>@@ -243,4 +246,8 @@ static inline void __init vsock_bpf_build_proto(void)
> {}
> #endif
>
>+static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
>+{
>+ return t->msgzerocopy_allow && t->msgzerocopy_allow();
>+}
> #endif /* __AF_VSOCK_H__ */
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index 07803d9fbf6d..033006e1b5ad 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1824,6 +1824,12 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
> goto out;
> }
>
>+ if (msg->msg_flags & MSG_ZEROCOPY &&
>+ !vsock_msgzerocopy_allow(transport)) {
>+ err = -EOPNOTSUPP;
>+ goto out;
>+ }
>+
> /* Wait for room in the produce queue to enqueue our user's data. */
> timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
>
>--
>2.25.1
>


2023-07-06 17:07:38

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 04/17] vsock/virtio: non-linear skb handling for tap

On Sat, Jul 01, 2023 at 09:39:34AM +0300, Arseniy Krasnov wrote:
>For tap device new skb is created and data from the current skb is
>copied to it. This adds copying data from non-linear skb to new
>the skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v4 -> v5:
> * Make 'skb' pointer constant because it is source.
>
> net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
> 1 file changed, 28 insertions(+), 3 deletions(-)

Reviewed-by: Stefano Garzarella <[email protected]>

>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index e5683af23e60..dfc48b56d0a2 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> return NULL;
> }
>
>+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>+ void *dst,
>+ size_t len)
>+{
>+ struct iov_iter iov_iter = { 0 };
>+ struct kvec kvec;
>+ size_t to_copy;
>+
>+ kvec.iov_base = dst;
>+ kvec.iov_len = len;
>+
>+ iov_iter.iter_type = ITER_KVEC;
>+ iov_iter.kvec = &kvec;
>+ iov_iter.nr_segs = 1;
>+
>+ to_copy = min_t(size_t, len, skb->len);
>+
>+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
>+ &iov_iter, to_copy);
>+}
>+
> /* Packet capture */
> static struct sk_buff *virtio_transport_build_skb(void *opaque)
> {
>@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> struct af_vsockmon_hdr *hdr;
> struct sk_buff *skb;
> size_t payload_len;
>- void *payload_buf;
>
> /* A packet could be split to fit the RX buffer, so we can retrieve
> * the payload length from the header and the buffer pointer taking
>@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> */
> pkt_hdr = virtio_vsock_hdr(pkt);
> payload_len = pkt->len;
>- payload_buf = pkt->data;
>
> skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
> GFP_ATOMIC);
>@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
>
> if (payload_len) {
>- skb_put_data(skb, payload_buf, payload_len);
>+ if (skb_is_nonlinear(pkt)) {
>+ void *data = skb_put(skb, payload_len);
>+
>+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
>+ } else {
>+ skb_put_data(skb, pkt->data, payload_len);
>+ }
> }
>
> return skb;
>--
>2.25.1
>


2023-07-06 17:15:15

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 03/17] vsock/virtio: support to send non-linear skb

On Sat, Jul 01, 2023 at 09:39:33AM +0300, Arseniy Krasnov wrote:
>For non-linear skb use its pages from fragment array as buffers in
>virtio tx queue. These pages are already pinned by 'get_user_pages()'
>during such skb creation.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v4 -> v5:
> * Use 'out_sgs' variable to index 'bufs', not only 'sgs'.
> * Move smaller branch above, see 'if (!skb_is_nonlinear(skb)').
> * Remove blank line.
> * R-b from Bobby Eshleman removed due to patch update.
>
> net/vmw_vsock/virtio_transport.c | 40 +++++++++++++++++++++++++++-----
> 1 file changed, 34 insertions(+), 6 deletions(-)

Reviewed-by: Stefano Garzarella <[email protected]>

>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index e95df847176b..6cbb45bb12d2 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
> vq = vsock->vqs[VSOCK_VQ_TX];
>
> for (;;) {
>- struct scatterlist hdr, buf, *sgs[2];
>+ /* +1 is for packet header. */
>+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
>+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
> int ret, in_sg = 0, out_sg = 0;
> struct sk_buff *skb;
> bool reply;
>@@ -111,12 +113,38 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>
> virtio_transport_deliver_tap_pkt(skb);
> reply = virtio_vsock_skb_reply(skb);
>+ sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
>+ sizeof(*virtio_vsock_hdr(skb)));
>+ sgs[out_sg] = &bufs[out_sg];
>+ out_sg++;
>+
>+ if (!skb_is_nonlinear(skb)) {
>+ if (skb->len > 0) {
>+ sg_init_one(&bufs[out_sg], skb->data, skb->len);
>+ sgs[out_sg] = &bufs[out_sg];
>+ out_sg++;
>+ }
>+ } else {
>+ struct skb_shared_info *si;
>+ int i;
>+
>+ si = skb_shinfo(skb);
>+
>+ for (i = 0; i < si->nr_frags; i++) {
>+ skb_frag_t *skb_frag = &si->frags[i];
>+ void *va = page_to_virt(skb_frag->bv_page);
>
>- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
>- sgs[out_sg++] = &hdr;
>- if (skb->len > 0) {
>- sg_init_one(&buf, skb->data, skb->len);
>- sgs[out_sg++] = &buf;
>+ /* We will use 'page_to_virt()' for userspace page here,
>+ * because virtio layer will call 'virt_to_phys()' later
>+ * to fill buffer descriptor. We don't touch memory at
>+ * "virtual" address of this page.
>+ */
>+ sg_init_one(&bufs[out_sg],
>+ va + skb_frag->bv_offset,
>+ skb_frag->bv_len);
>+ sgs[out_sg] = &bufs[out_sg];
>+ out_sg++;
>+ }
> }
>
> ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
>--
>2.25.1
>


2023-07-06 17:20:25

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 02/17] vhost/vsock: read data from non-linear skb

On Sat, Jul 01, 2023 at 09:39:32AM +0300, Arseniy Krasnov wrote:
>This adds copying to guest's virtio buffers from non-linear skbs. Such
>skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
>replaces call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'- second
>function can read data from non-linear skb. Also this patch uses field
>'frag_off' from skb control block. This field shows current offset to
>read data from skb which could be both linear or not.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v4 -> v5:
> * Use local variable for 'frag_off'.
> * Update commit message by adding some details about 'frag_off' field.
> * R-b from Bobby Eshleman removed due to patch update.

I think we should merge this patch with the previous one, since
vhost-vsock for example uses virtio_transport_stream_do_dequeue()
that we change in the previous commit, so we will break the bisection.

The patch LGTM!

Stefano

>
> drivers/vhost/vsock.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 6578db78f0ae..cb00e0e059e4 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> struct sk_buff *skb;
> unsigned out, in;
> size_t nbytes;
>+ u32 frag_off;
> int head;
>
> skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> }
>
> iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
>- payload_len = skb->len;
>+ frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>+ payload_len = skb->len - frag_off;
> hdr = virtio_vsock_hdr(skb);
>
> /* If the packet is greater than the space available in the
>@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> break;
> }
>
>- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>- if (nbytes != payload_len) {
>+ if (skb_copy_datagram_iter(skb,
>+ frag_off,
>+ &iov_iter,
>+ payload_len)) {
> kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt buf\n");
> break;
>@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
> added = true;
>
>- skb_pull(skb, payload_len);
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
> total_len += payload_len;
>
> /* If we didn't send all the payload we can requeue the packet
> * to send it with the next available buffer.
> */
>- if (skb->len > 0) {
>+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
> hdr->flags |= cpu_to_le32(flags_to_restore);
>
> /* We are queueing the same skb to handle
>--
>2.25.1
>


2023-07-06 17:26:17

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 13/17] vsock: enable setting SO_ZEROCOPY

On Sat, Jul 01, 2023 at 09:39:43AM +0300, Arseniy Krasnov wrote:
>For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
>be set in AF_VSOCK implementation where transport is accessible (if
>transport is not set during setting SO_ZEROCOPY: for example socket is
>not connected, then SO_ZEROCOPY will be enabled, but once transport will
>be assigned, support of this type of transmission will be checked).
>
>To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
>bit, thus handling SOL_SOCKET option operations, but all of them except
>SO_ZEROCOPY will be forwarded to the generic handler by calling
>'sock_setsockopt()'.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v4 -> v5:
> * This patch is totally reworked. Previous version added check for
> PF_VSOCK directly to 'net/core/sock.c', thus allowing to set
> SO_ZEROCOPY for AF_VSOCK type of socket. This new version catches
> attempt to set SO_ZEROCOPY in 'af_vsock.c'. All other options
> except SO_ZEROCOPY are forwarded to generic handler. Only this
> option is processed in 'af_vsock.c'. Handling this option includes
> access to transport to check that MSG_ZEROCOPY transmission is
> supported by the current transport (if it is set, if not - transport
> will be checked during 'connect()').

Yeah, great, this is much better!

>
> net/vmw_vsock/af_vsock.c | 44 ++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index da22ae0ef477..8acc77981d01 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1406,8 +1406,18 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
> goto out;
> }
>
>- if (vsock_msgzerocopy_allow(transport))
>+ if (!vsock_msgzerocopy_allow(transport)) {

Can you leave `if (vsock_msgzerocopy_allow(transport))` and just add
the else branch with this new check?

if (vsock_msgzerocopy_allow(transport)) {
...
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
...
}

>+ /* If this option was set before 'connect()',
>+ * when transport was unknown, check that this
>+ * feature is supported here.
>+ */
>+ if (sock_flag(sk, SOCK_ZEROCOPY)) {
>+ err = -EOPNOTSUPP;
>+ goto out;
>+ }
>+ } else {
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>+ }
>
> err = vsock_auto_bind(vsk);
> if (err)
>@@ -1643,7 +1653,7 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> const struct vsock_transport *transport;
> u64 val;
>
>- if (level != AF_VSOCK)
>+ if (level != AF_VSOCK && level != SOL_SOCKET)
> return -ENOPROTOOPT;
>
> #define COPY_IN(_v) \
>@@ -1666,6 +1676,34 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>
> transport = vsk->transport;
>
>+ if (level == SOL_SOCKET) {

We could reduce the indentation here:
if (optname != SO_ZEROCOPY) {
release_sock(sk);
return sock_setsockopt(sock, level, optname, optval, optlen);
}

Then remove the next indentation.

>+ if (optname == SO_ZEROCOPY) {
>+ int zc_val;

`zerocopy` is more readable.
>+
>+ /* Use 'int' type here, because variable to
>+ * set this option usually has this type.
>+ */
>+ COPY_IN(zc_val);
>+
>+ if (zc_val < 0 || zc_val > 1) {
>+ err = -EINVAL;
>+ goto exit;
>+ }
>+
>+ if (transport && !vsock_msgzerocopy_allow(transport)) {
>+ err = -EOPNOTSUPP;
>+ goto exit;
>+ }
>+
>+ sock_valbool_flag(sk, SOCK_ZEROCOPY,
>+ zc_val ? true : false);

Why not using directly `zc_val`?
The 3rd param of sock_valbool_flag() is an int.

>+ goto exit;
>+ }
>+
>+ release_sock(sk);
>+ return sock_setsockopt(sock, level, optname, optval, optlen);
>+ }
>+
> switch (optname) {
> case SO_VM_SOCKETS_BUFFER_SIZE:
> COPY_IN(val);
>@@ -2321,6 +2359,8 @@ static int vsock_create(struct net *net, struct socket *sock,
> }
> }
>
>+ set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>+
> vsock_insert_unbound(vsk);
>
> return 0;
>--
>2.25.1
>


2023-07-06 17:29:20

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 14/17] docs: net: description of MSG_ZEROCOPY for AF_VSOCK

On Sat, Jul 01, 2023 at 09:39:44AM +0300, Arseniy Krasnov wrote:
>This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
>socket.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Documentation/networking/msg_zerocopy.rst | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
>diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
>index b3ea96af9b49..34bc7ff411ce 100644
>--- a/Documentation/networking/msg_zerocopy.rst
>+++ b/Documentation/networking/msg_zerocopy.rst
>@@ -7,7 +7,8 @@ Intro
> =====
>
> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>-The feature is currently implemented for TCP and UDP sockets.
>+The feature is currently implemented for TCP, UDP and VSOCK (with
>+virtio transport) sockets.
>
>
> Opportunity and Caveats
>@@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A notification
> is encoded in the standard error format, sock_extended_err.
>
> The level and type fields in the control data are protocol family
>-specific, IP_RECVERR or IPV6_RECVERR.
>+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
>
> Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
> as explained before, to avoid blocking read and write system calls on
>@@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.
>
> printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
>
>+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.

Maybe better to move up, just under the previous change.

By the way, should we define a valid type value for vsock
(e.g. VSOCK_RECVERR)?

>
> Deferred copies
> ~~~~~~~~~~~~~~~
>@@ -235,12 +237,15 @@ Implementation
> Loopback
> --------
>
>+For TCP and UDP:
> Data sent to local sockets can be queued indefinitely if the receive
> process does not read its socket. Unbound notification latency is not
> acceptable. For this reason all packets generated with MSG_ZEROCOPY
> that are looped to a local socket will incur a deferred copy. This
> includes looping onto packet sockets (e.g., tcpdump) and tun devices.
>
>+For VSOCK:
>+Data path sent to local sockets is the same as for non-local sockets.
>
> Testing
> =======
>@@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
> namespaces, the test will not show any improvement. For testing, the
> loopback restriction can be temporarily relaxed by making
> skb_orphan_frags_rx identical to skb_orphan_frags.
>+
>+For VSOCK type of socket example can be found in tools/testing/vsock/
>+vsock_test_zerocopy.c.

For VSOCK socket, example can be found in
tools/testing/vsock/vsock_test_zerocopy.c

(we should leave the entire path on the same line)

>--
>2.25.1
>


2023-07-06 17:29:52

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 00/17] vsock: MSG_ZEROCOPY flag support

On Sat, Jul 01, 2023 at 09:39:30AM +0300, Arseniy Krasnov wrote:
>Hello,
>
> DESCRIPTION
>
>this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>current implementation for TCP as much as possible:
>
>1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> flag will be ignored (e.g. without completion).
>
>2) Kernel uses completions from socket's error queue. Single completion
> for single tx syscall (or it can merge several completions to single
> one). I used already implemented logic for MSG_ZEROCOPY support:
> 'msg_zerocopy_realloc()' etc.
>
>Difference with copy way is not significant. During packet allocation,
>non-linear skb is created and filled with pinned user pages.
>There are also some updates for vhost and guest parts of transport - in
>both cases i've added handling of non-linear skb for virtio part. vhost
>copies data from such skb to the guest's rx virtio buffers. In the guest,
>virtio transport fills tx virtio queue with pages from skb.
>
>Head of this patchset is:
>https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>
>
>This version has several limits/problems (all resolved at v5):
>
>1) As this feature totally depends on transport, there is no way (or it
> is difficult) to check whether transport is able to handle it or not
> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> are not considered to be called from each other. So in current version
> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> tx routine will fail with EOPNOTSUPP.
>
> ^^^ fixed in v5. Thanks to Bobby Eshleman.
>
>2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> one completion. In each completion there is flag which shows how tx
> was performed: zerocopy or copy. This leads that whole message must
> be send in zerocopy or copy way - we can't send part of message with
> copying and rest of message with zerocopy mode (or vice versa). Now,
> we need to account vsock credit logic, e.g. we can't send whole data
> once - only allowed number of bytes could sent at any moment. In case
> of copying way there is no problem as in worst case we can send single
> bytes, but zerocopy is more complex because smallest transmission
> unit is single page. So if there is not enough space at peer's side
> to send integer number of pages (at least one) - we will wait, thus
> stalling tx side. To overcome this problem i've added simple rule -
> zerocopy is possible only when there is enough space at another side
> for whole message (to check, that current 'msghdr' was already used
> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> ^^^
> Discussed as ok during v2. Link:
> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>
>3) loopback transport is not supported, because it requires to implement
> non-linear skb handling in dequeue logic (as we "send" fragged skb
> and "receive" it from the same queue). I'm going to implement it in
> next versions.
>
> ^^^ fixed in v2
>
>4) Current implementation sets max length of packet to 64KB. IIUC this
> is due to 'kmalloc()' allocated data buffers. I think, in case of
> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> not touched for data - user space pages are used as buffers. Also
> this limit trims every message which is > 64KB, thus such messages
> will be send in copy mode due to 'iov_offset' check in 2).
>
> ^^^ fixed in v2
>
> PATCHSET STRUCTURE
>
>Patchset has the following structure:
>1) Handle non-linear skbuff on receive in virtio/vhost.
>2) Handle non-linear skbuff on send in virtio/vhost.
>3) Updates for AF_VSOCK.
>4) Enable MSG_ZEROCOPY support on transports.
>5) Tests/tools/docs updates.
>
> PERFORMANCE
>
>Performance: it is a little bit tricky to compare performance between
>copy and zerocopy transmissions. In zerocopy way we need to wait when
>user buffers will be released by kernel, so it is like synchronous
>path (wait until device driver will process it), while in copy way we
>can feed data to kernel as many as we want, don't care about device
>driver. So I compared only time which we spend in the 'send()' syscall.
>Then if this value will be combined with total number of transmitted
>bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>enough credit, receiver allocates same amount of space as sender needs.
>
>Sender:
>./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>
>Receiver:
>./vsock_perf --vsk-size 256M
>
>I run tests on two setups: desktop with Core i7 - I use this PC for
>development and in this case guest is nested guest, and host is normal
>guest. Another hardware is some embedded board with Atom - here I don't
>have nested virtualization - host runs on hw, and guest is normal guest.
>
>G2H transmission (values are Gbit/s):
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
>*-------------------------------* *-------------------------------*
>| 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
>*-------------------------------* *-------------------------------*
>| 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
>*-------------------------------* *-------------------------------*
>| 1M | 30 | 373 | | 1M | 7 | 95 |
>*-------------------------------* *-------------------------------*
>| 8M | 22 | 475 | | 8M | 7 | 114 |
>*-------------------------------* *-------------------------------*
>
>H2G:
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
>*-------------------------------* *-------------------------------*
>| 32KB | 37 | 75 | | 32KB | 11 | 18 |
>*-------------------------------* *-------------------------------*
>| 256KB | 44 | 299 | | 256KB | 11 | 62 |
>*-------------------------------* *-------------------------------*
>| 1M | 28 | 335 | | 1M | 9 | 77 |
>*-------------------------------* *-------------------------------*
>| 8M | 27 | 417 | | 8M | 9.35 | 115 |
>*-------------------------------* *-------------------------------*
>
> * Let's look to the first line of both tables - where copy is better
> than zerocopy. I analyzed this case more deeply and found that
> bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
> caller spends too much time in it with zerocopy mode (comparing to
> copy mode). This happens only with 4K buffer size. This function just
> calls 'wake_up_process()' and its internal logic does not depends on
> skb, so i think potential reason (may be) is interval between two
> calls of this function (e.g. how often it is called). Note, that
> 'vhost_work_queue()' differs from the same function at guest's side of
> transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
> i think is more optimized for worker purposes, than direct call to
> 'wake_up_process()'. But again - this is just my assumption.
>
>Loopback:
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 |
>*-------------------------------* *-------------------------------*
>| 32KB | 38 | 44 | | 32KB | 10 | 10 |
>*-------------------------------* *-------------------------------*
>| 256KB | 55 | 168 | | 256KB | 15 | 36 |
>*-------------------------------* *-------------------------------*
>| 1M | 53 | 250 | | 1M | 12 | 45 |
>*-------------------------------* *-------------------------------*
>| 8M | 40 | 344 | | 8M | 11 | 74 |
>*-------------------------------* *-------------------------------*
>
>I analyzed performace difference more deeply for the following setup:
>server: ./vsock_perf --vsk-size 16M
>client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]
>
>In other words I send 16M of data from guest to host in copy/zerocopy
>modes and with two different sizes of buffer - 4K and 64K. Let's see
>to tx path for both modes - it consists of two steps:
>
>copy:
>1) Allocate skb of buffer's length.
>2) Copy data to skb from buffer.
>
>zerocopy:
>1) Allocate skb with header space only.
>2) Pin pages of the buffer and insert them to skb.
>
>I measured average number of ns (returned by 'ktime_get()') for each
>step above:
>1) Skb allocation (for both copy and zerocopy modes).
>2) For copy mode in 'memcpy_to_msg()' - copying.
>3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.
>
>Here are results for copy mode:
>*-------------------------------------*
>| buf | skb alloc | 'memcpy_to_msg()' |
>*-------------------------------------*
>| | | |
>| 64K | 5000ns | 25000ns |
>| | | |
>*-------------------------------------*
>| | | |
>| 4K | 800ns | 2200ns |
>| | | |
>*-------------------------------------*
>
>Here are results for zerocopy mode:
>*-----------------------------------------------*
>| buf | skb alloc | '__zerocopy_sg_from_iter()' |
>*-----------------------------------------------*
>| | | |
>| 64K | 250ns | 3500ns |
>| | | |
>*-----------------------------------------------*
>| | | |
>| 4K | 250ns | 3000ns |
>| | | |
>*-----------------------------------------------*
>
>I guess that reason of zerocopy performance is low overhead for page
>pinning: there is big difference between 4K and 64K in case of copying
>(25000 vs 2200), but in pinning case - just 3000 vs 3500.
>
>So, zerocopy is faster than classic copy mode, but of course it requires
>specific architecture of application due to user pages pinning, buffer
>size and alignment.
>
> NOTES
>
>If host fails to send data with "Cannot allocate memory", check value
>/proc/sys/net/core/optmem_max - it is accounted during completion skb
>allocation. Try to update it to for example 1M and try send again:
>"echo 1048576 > /proc/sys/net/core/optmem_max" (as root).
>
> TESTING
>
>This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>cover new code as much as possible so there are different cases for
>MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>vector types (different sizes, alignments, with unmapped pages). I also
>run tests with loopback transport and run vsockmon. In v3 i've added
>io_uring test as separated application.
>
> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>
>In v3 Stefano Garzarella <[email protected]> asked to split this patchset
>for several parts, because it looks too big for review. I think in this
>version (v4) we can do it in the following way:
>
>[0001 - 0005] - this is preparation for virtio/vhost part.
>[0006 - 0009] - this is preparation for AF_VSOCK part.
>[0010 - 0014] - these patches allows to trigger logic from the previous
> two parts. In addition 0014 is patch for Documentation.
>[0015 - rest] - updates for tests, utils. This part doesn't touch kernel
> code and looks not critical.

Great!

So IIUC all the issues are fixed. I left some comments, but I think
you can start sending the virtio/vhost preparation patches to net-next
(when it will re-open).

I just pointend out something to fix, and that maybe we can merge
the first 2 patches.

I think you can restart with v0, describing in the cover letter that
the patches was part of this RFC.

Thanks,
Stefano


2023-07-06 17:30:34

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v5 11/17] vsock/virtio: support MSG_ZEROCOPY for transport

On Sat, Jul 01, 2023 at 09:39:41AM +0300, Arseniy Krasnov wrote:
>Add 'msgzerocopy_allow()' callback for virtio transport.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v4 -> v5:
> * Move 'msgzerocopy_allow' right after seqpacket callbacks.
>
> net/vmw_vsock/virtio_transport.c | 7 +++++++
> 1 file changed, 7 insertions(+)

Reviewed-by: Stefano Garzarella <[email protected]>

>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 6cbb45bb12d2..8d3e9f441fa1 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -441,6 +441,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
>+static bool virtio_transport_msgzerocopy_allow(void)
>+{
>+ return true;
>+}
>+
> static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>
> static struct virtio_transport virtio_transport = {
>@@ -474,6 +479,8 @@ static struct virtio_transport virtio_transport = {
> .seqpacket_allow = virtio_transport_seqpacket_allow,
> .seqpacket_has_data = virtio_transport_seqpacket_has_data,
>
>+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,
>+
> .notify_poll_in = virtio_transport_notify_poll_in,
> .notify_poll_out = virtio_transport_notify_poll_out,
> .notify_recv_init = virtio_transport_notify_recv_init,
>--
>2.25.1
>


2023-07-07 05:11:30

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v5 00/17] vsock: MSG_ZEROCOPY flag support



On 06.07.2023 20:07, Stefano Garzarella wrote:
> On Sat, Jul 01, 2023 at 09:39:30AM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>> flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>> for single tx syscall (or it can merge several completions to single
>> one). I used already implemented logic for MSG_ZEROCOPY support:
>> 'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created and filled with pinned user pages.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> Head of this patchset is:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>>
>>
>> This version has several limits/problems (all resolved at v5):
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>> is difficult) to check whether transport is able to handle it or not
>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>> are not considered to be called from each other. So in current version
>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>> tx routine will fail with EOPNOTSUPP.
>>
>> ^^^ fixed in v5. Thanks to Bobby Eshleman.
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>> one completion. In each completion there is flag which shows how tx
>> was performed: zerocopy or copy. This leads that whole message must
>> be send in zerocopy or copy way - we can't send part of message with
>> copying and rest of message with zerocopy mode (or vice versa). Now,
>> we need to account vsock credit logic, e.g. we can't send whole data
>> once - only allowed number of bytes could sent at any moment. In case
>> of copying way there is no problem as in worst case we can send single
>> bytes, but zerocopy is more complex because smallest transmission
>> unit is single page. So if there is not enough space at peer's side
>> to send integer number of pages (at least one) - we will wait, thus
>> stalling tx side. To overcome this problem i've added simple rule -
>> zerocopy is possible only when there is enough space at another side
>> for whole message (to check, that current 'msghdr' was already used
>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>> ^^^
>> Discussed as ok during v2. Link:
>> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>>
>> 3) loopback transport is not supported, because it requires to implement
>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>> and "receive" it from the same queue). I'm going to implement it in
>> next versions.
>>
>> ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>> not touched for data - user space pages are used as buffers. Also
>> this limit trims every message which is > 64KB, thus such messages
>> will be send in copy mode due to 'iov_offset' check in 2).
>>
>> ^^^ fixed in v2
>>
>> PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>> PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it is like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> I run tests on two setups: desktop with Core i7 - I use this PC for
>> development and in this case guest is nested guest, and host is normal
>> guest. Another hardware is some embedded board with Atom - here I don't
>> have nested virtualization - host runs on hw, and guest is normal guest.
>>
>> G2H transmission (values are Gbit/s):
>>
>> Core i7 with nested guest. Atom with normal guest.
>>
>> *-------------------------------* *-------------------------------*
>> | | | | | | | |
>> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
>> | | | | | | | |
>> *-------------------------------* *-------------------------------*
>> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
>> *-------------------------------* *-------------------------------*
>> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
>> *-------------------------------* *-------------------------------*
>> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
>> *-------------------------------* *-------------------------------*
>> | 1M | 30 | 373 | | 1M | 7 | 95 |
>> *-------------------------------* *-------------------------------*
>> | 8M | 22 | 475 | | 8M | 7 | 114 |
>> *-------------------------------* *-------------------------------*
>>
>> H2G:
>>
>> Core i7 with nested guest. Atom with normal guest.
>>
>> *-------------------------------* *-------------------------------*
>> | | | | | | | |
>> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
>> | | | | | | | |
>> *-------------------------------* *-------------------------------*
>> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
>> *-------------------------------* *-------------------------------*
>> | 32KB | 37 | 75 | | 32KB | 11 | 18 |
>> *-------------------------------* *-------------------------------*
>> | 256KB | 44 | 299 | | 256KB | 11 | 62 |
>> *-------------------------------* *-------------------------------*
>> | 1M | 28 | 335 | | 1M | 9 | 77 |
>> *-------------------------------* *-------------------------------*
>> | 8M | 27 | 417 | | 8M | 9.35 | 115 |
>> *-------------------------------* *-------------------------------*
>>
>> * Let's look to the first line of both tables - where copy is better
>> than zerocopy. I analyzed this case more deeply and found that
>> bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
>> caller spends too much time in it with zerocopy mode (comparing to
>> copy mode). This happens only with 4K buffer size. This function just
>> calls 'wake_up_process()' and its internal logic does not depends on
>> skb, so i think potential reason (may be) is interval between two
>> calls of this function (e.g. how often it is called). Note, that
>> 'vhost_work_queue()' differs from the same function at guest's side of
>> transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
>> i think is more optimized for worker purposes, than direct call to
>> 'wake_up_process()'. But again - this is just my assumption.
>>
>> Loopback:
>>
>> Core i7 with nested guest. Atom with normal guest.
>>
>> *-------------------------------* *-------------------------------*
>> | | | | | | | |
>> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
>> | | | | | | | |
>> *-------------------------------* *-------------------------------*
>> | 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 |
>> *-------------------------------* *-------------------------------*
>> | 32KB | 38 | 44 | | 32KB | 10 | 10 |
>> *-------------------------------* *-------------------------------*
>> | 256KB | 55 | 168 | | 256KB | 15 | 36 |
>> *-------------------------------* *-------------------------------*
>> | 1M | 53 | 250 | | 1M | 12 | 45 |
>> *-------------------------------* *-------------------------------*
>> | 8M | 40 | 344 | | 8M | 11 | 74 |
>> *-------------------------------* *-------------------------------*
>>
>> I analyzed performace difference more deeply for the following setup:
>> server: ./vsock_perf --vsk-size 16M
>> client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]
>>
>> In other words I send 16M of data from guest to host in copy/zerocopy
>> modes and with two different sizes of buffer - 4K and 64K. Let's see
>> to tx path for both modes - it consists of two steps:
>>
>> copy:
>> 1) Allocate skb of buffer's length.
>> 2) Copy data to skb from buffer.
>>
>> zerocopy:
>> 1) Allocate skb with header space only.
>> 2) Pin pages of the buffer and insert them to skb.
>>
>> I measured average number of ns (returned by 'ktime_get()') for each
>> step above:
>> 1) Skb allocation (for both copy and zerocopy modes).
>> 2) For copy mode in 'memcpy_to_msg()' - copying.
>> 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.
>>
>> Here are results for copy mode:
>> *-------------------------------------*
>> | buf | skb alloc | 'memcpy_to_msg()' |
>> *-------------------------------------*
>> | | | |
>> | 64K | 5000ns | 25000ns |
>> | | | |
>> *-------------------------------------*
>> | | | |
>> | 4K | 800ns | 2200ns |
>> | | | |
>> *-------------------------------------*
>>
>> Here are results for zerocopy mode:
>> *-----------------------------------------------*
>> | buf | skb alloc | '__zerocopy_sg_from_iter()' |
>> *-----------------------------------------------*
>> | | | |
>> | 64K | 250ns | 3500ns |
>> | | | |
>> *-----------------------------------------------*
>> | | | |
>> | 4K | 250ns | 3000ns |
>> | | | |
>> *-----------------------------------------------*
>>
>> I guess that reason of zerocopy performance is low overhead for page
>> pinning: there is big difference between 4K and 64K in case of copying
>> (25000 vs 2200), but in pinning case - just 3000 vs 3500.
>>
>> So, zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment.
>>
>> NOTES
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation. Try to update it to for example 1M and try send again:
>> "echo 1048576 > /proc/sys/net/core/optmem_max" (as root).
>>
>> TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages). I also
>> run tests with loopback transport and run vsockmon. In v3 i've added
>> io_uring test as separated application.
>>
>> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>>
>> In v3 Stefano Garzarella <[email protected]> asked to split this patchset
>> for several parts, because it looks too big for review. I think in this
>> version (v4) we can do it in the following way:
>>
>> [0001 - 0005] - this is preparation for virtio/vhost part.
>> [0006 - 0009] - this is preparation for AF_VSOCK part.
>> [0010 - 0014] - these patches allows to trigger logic from the previous
>> two parts. In addition 0014 is patch for Documentation.
>> [0015 - rest] - updates for tests, utils. This part doesn't touch kernel
>> code and looks not critical.
>
> Great!

Thanks for review! All comments are clear for me.

>
> So IIUC all the issues are fixed. I left some comments, but I think
> you can start sending the virtio/vhost preparation patches to net-next
> (when it will re-open).
>
> I just pointend out something to fix, and that maybe we can merge
> the first 2 patches.
>
> I think you can restart with v0, describing in the cover letter that
> the patches was part of this RFC.

Ok, I'll fix comments and send 0001-0005 (with first two merged) in a single
net-next patchset!

Thanks, Arseniy

>
> Thanks,
> Stefano
>

2023-07-07 05:14:35

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v5 14/17] docs: net: description of MSG_ZEROCOPY for AF_VSOCK



On 06.07.2023 20:06, Stefano Garzarella wrote:
> On Sat, Jul 01, 2023 at 09:39:44AM +0300, Arseniy Krasnov wrote:
>> This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
>> socket.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> Documentation/networking/msg_zerocopy.rst | 12 ++++++++++--
>> 1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
>> index b3ea96af9b49..34bc7ff411ce 100644
>> --- a/Documentation/networking/msg_zerocopy.rst
>> +++ b/Documentation/networking/msg_zerocopy.rst
>> @@ -7,7 +7,8 @@ Intro
>> =====
>>
>> The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
>> -The feature is currently implemented for TCP and UDP sockets.
>> +The feature is currently implemented for TCP, UDP and VSOCK (with
>> +virtio transport) sockets.
>>
>>
>> Opportunity and Caveats
>> @@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A notification
>> is encoded in the standard error format, sock_extended_err.
>>
>> The level and type fields in the control data are protocol family
>> -specific, IP_RECVERR or IPV6_RECVERR.
>> +specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).
>>
>> Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
>> as explained before, to avoid blocking read and write system calls on
>> @@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.
>>
>> printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
>>
>> +For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.
>
> Maybe better to move up, just under the previous change.
>
> By the way, should we define a valid type value for vsock
> (e.g. VSOCK_RECVERR)?

Yes I think, I'll add it in the same patch which adds SOL_VSOCK.

Thanks, Arseniy

>
>>
>> Deferred copies
>> ~~~~~~~~~~~~~~~
>> @@ -235,12 +237,15 @@ Implementation
>> Loopback
>> --------
>>
>> +For TCP and UDP:
>> Data sent to local sockets can be queued indefinitely if the receive
>> process does not read its socket. Unbound notification latency is not
>> acceptable. For this reason all packets generated with MSG_ZEROCOPY
>> that are looped to a local socket will incur a deferred copy. This
>> includes looping onto packet sockets (e.g., tcpdump) and tun devices.
>>
>> +For VSOCK:
>> +Data path sent to local sockets is the same as for non-local sockets.
>>
>> Testing
>> =======
>> @@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
>> namespaces, the test will not show any improvement. For testing, the
>> loopback restriction can be temporarily relaxed by making
>> skb_orphan_frags_rx identical to skb_orphan_frags.
>> +
>> +For VSOCK type of socket example can be found in tools/testing/vsock/
>> +vsock_test_zerocopy.c.
>
> For VSOCK socket, example can be found in
> tools/testing/vsock/vsock_test_zerocopy.c
>
> (we should leave the entire path on the same line)
>
>> --
>> 2.25.1
>>
>

2023-07-12 22:52:47

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v5 13/17] vsock: enable setting SO_ZEROCOPY

On Sat, Jul 01, 2023 at 09:39:43AM +0300, Arseniy Krasnov wrote:
> For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
> be set in AF_VSOCK implementation where transport is accessible (if
> transport is not set during setting SO_ZEROCOPY: for example socket is
> not connected, then SO_ZEROCOPY will be enabled, but once transport will
> be assigned, support of this type of transmission will be checked).
>
> To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
> bit, thus handling SOL_SOCKET option operations, but all of them except
> SO_ZEROCOPY will be forwarded to the generic handler by calling
> 'sock_setsockopt()'.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>
> ---
> Changelog:
> v4 -> v5:
> * This patch is totally reworked. Previous version added check for
> PF_VSOCK directly to 'net/core/sock.c', thus allowing to set
> SO_ZEROCOPY for AF_VSOCK type of socket. This new version catches
> attempt to set SO_ZEROCOPY in 'af_vsock.c'. All other options
> except SO_ZEROCOPY are forwarded to generic handler. Only this
> option is processed in 'af_vsock.c'. Handling this option includes
> access to transport to check that MSG_ZEROCOPY transmission is
> supported by the current transport (if it is set, if not - transport
> will be checked during 'connect()').
>
> net/vmw_vsock/af_vsock.c | 44 ++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index da22ae0ef477..8acc77981d01 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1406,8 +1406,18 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
> goto out;
> }
>
> - if (vsock_msgzerocopy_allow(transport))
> + if (!vsock_msgzerocopy_allow(transport)) {
> + /* If this option was set before 'connect()',
> + * when transport was unknown, check that this
> + * feature is supported here.
> + */
> + if (sock_flag(sk, SOCK_ZEROCOPY)) {
> + err = -EOPNOTSUPP;
> + goto out;
> + }
> + } else {
> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
> + }
>
> err = vsock_auto_bind(vsk);
> if (err)
> @@ -1643,7 +1653,7 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> const struct vsock_transport *transport;
> u64 val;
>
> - if (level != AF_VSOCK)
> + if (level != AF_VSOCK && level != SOL_SOCKET)
> return -ENOPROTOOPT;
>
> #define COPY_IN(_v) \
> @@ -1666,6 +1676,34 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>
> transport = vsk->transport;
>
> + if (level == SOL_SOCKET) {
> + if (optname == SO_ZEROCOPY) {
> + int zc_val;
> +
> + /* Use 'int' type here, because variable to
> + * set this option usually has this type.
> + */
> + COPY_IN(zc_val);
> +
> + if (zc_val < 0 || zc_val > 1) {
> + err = -EINVAL;
> + goto exit;
> + }
> +
> + if (transport && !vsock_msgzerocopy_allow(transport)) {
> + err = -EOPNOTSUPP;
> + goto exit;
> + }
> +
> + sock_valbool_flag(sk, SOCK_ZEROCOPY,
> + zc_val ? true : false);
> + goto exit;
> + }
> +
> + release_sock(sk);
> + return sock_setsockopt(sock, level, optname, optval, optlen);
> + }
> +
> switch (optname) {
> case SO_VM_SOCKETS_BUFFER_SIZE:
> COPY_IN(val);
> @@ -2321,6 +2359,8 @@ static int vsock_create(struct net *net, struct socket *sock,
> }
> }
>
> + set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
> +

I found that because datagrams have !ops->setsockopt this bit causes
setsockopt() to fail (the related logic can be found in
__sys_setsockopt). Maybe we should only set this for connectibles?

Best,
Bobby

> vsock_insert_unbound(vsk);
>
> return 0;
> --
> 2.25.1
>

2023-07-13 05:23:28

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v5 13/17] vsock: enable setting SO_ZEROCOPY



On 13.07.2023 01:31, Bobby Eshleman wrote:
> On Sat, Jul 01, 2023 at 09:39:43AM +0300, Arseniy Krasnov wrote:
>> For AF_VSOCK, zerocopy tx mode depends on transport, so this option must
>> be set in AF_VSOCK implementation where transport is accessible (if
>> transport is not set during setting SO_ZEROCOPY: for example socket is
>> not connected, then SO_ZEROCOPY will be enabled, but once transport will
>> be assigned, support of this type of transmission will be checked).
>>
>> To handle SO_ZEROCOPY, AF_VSOCK implementation uses SOCK_CUSTOM_SOCKOPT
>> bit, thus handling SOL_SOCKET option operations, but all of them except
>> SO_ZEROCOPY will be forwarded to the generic handler by calling
>> 'sock_setsockopt()'.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> Changelog:
>> v4 -> v5:
>> * This patch is totally reworked. Previous version added check for
>> PF_VSOCK directly to 'net/core/sock.c', thus allowing to set
>> SO_ZEROCOPY for AF_VSOCK type of socket. This new version catches
>> attempt to set SO_ZEROCOPY in 'af_vsock.c'. All other options
>> except SO_ZEROCOPY are forwarded to generic handler. Only this
>> option is processed in 'af_vsock.c'. Handling this option includes
>> access to transport to check that MSG_ZEROCOPY transmission is
>> supported by the current transport (if it is set, if not - transport
>> will be checked during 'connect()').
>>
>> net/vmw_vsock/af_vsock.c | 44 ++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index da22ae0ef477..8acc77981d01 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -1406,8 +1406,18 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
>> goto out;
>> }
>>
>> - if (vsock_msgzerocopy_allow(transport))
>> + if (!vsock_msgzerocopy_allow(transport)) {
>> + /* If this option was set before 'connect()',
>> + * when transport was unknown, check that this
>> + * feature is supported here.
>> + */
>> + if (sock_flag(sk, SOCK_ZEROCOPY)) {
>> + err = -EOPNOTSUPP;
>> + goto out;
>> + }
>> + } else {
>> set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
>> + }
>>
>> err = vsock_auto_bind(vsk);
>> if (err)
>> @@ -1643,7 +1653,7 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>> const struct vsock_transport *transport;
>> u64 val;
>>
>> - if (level != AF_VSOCK)
>> + if (level != AF_VSOCK && level != SOL_SOCKET)
>> return -ENOPROTOOPT;
>>
>> #define COPY_IN(_v) \
>> @@ -1666,6 +1676,34 @@ static int vsock_connectible_setsockopt(struct socket *sock,
>>
>> transport = vsk->transport;
>>
>> + if (level == SOL_SOCKET) {
>> + if (optname == SO_ZEROCOPY) {
>> + int zc_val;
>> +
>> + /* Use 'int' type here, because variable to
>> + * set this option usually has this type.
>> + */
>> + COPY_IN(zc_val);
>> +
>> + if (zc_val < 0 || zc_val > 1) {
>> + err = -EINVAL;
>> + goto exit;
>> + }
>> +
>> + if (transport && !vsock_msgzerocopy_allow(transport)) {
>> + err = -EOPNOTSUPP;
>> + goto exit;
>> + }
>> +
>> + sock_valbool_flag(sk, SOCK_ZEROCOPY,
>> + zc_val ? true : false);
>> + goto exit;
>> + }
>> +
>> + release_sock(sk);
>> + return sock_setsockopt(sock, level, optname, optval, optlen);
>> + }
>> +
>> switch (optname) {
>> case SO_VM_SOCKETS_BUFFER_SIZE:
>> COPY_IN(val);
>> @@ -2321,6 +2359,8 @@ static int vsock_create(struct net *net, struct socket *sock,
>> }
>> }
>>
>> + set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags);
>> +
>
> I found that because datagrams have !ops->setsockopt this bit causes
> setsockopt() to fail (the related logic can be found in
> __sys_setsockopt). Maybe we should only set this for connectibles?

Agree! I'll add this check in the next version

Thanks, Arseniy

>
> Best,
> Bobby
>
>> vsock_insert_unbound(vsk);
>>
>> return 0;
>> --
>> 2.25.1
>>