2023-06-03 20:56:16

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

Hello,

DESCRIPTION

this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
current implementation for TCP as much as possible:

1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
flag will be ignored (e.g. without completion).

2) Kernel uses completions from socket's error queue. Single completion
for single tx syscall (or it can merge several completions to single
one). I used already implemented logic for MSG_ZEROCOPY support:
'msg_zerocopy_realloc()' etc.

Difference with copy way is not significant. During packet allocation,
non-linear skb is created and filled with pinned user pages.
There are also some updates for vhost and guest parts of transport - in
both cases i've added handling of non-linear skb for virtio part. vhost
copies data from such skb to the guest's rx virtio buffers. In the guest,
virtio transport fills tx virtio queue with pages from skb.

Head of this patchset is:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b


This version has several limits/problems:

1) As this feature totally depends on transport, there is no way (or it
is difficult) to check whether transport is able to handle it or not
during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
setsockopt callback from setsockopt callback for SOL_SOCKET, but this
leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
are not considered to be called from each other. So in current version
SO_ZEROCOPY is set successfully to any type (e.g. transport) of
AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
tx routine will fail with EOPNOTSUPP.

^^^
This is still no resolved :(

2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
one completion. In each completion there is flag which shows how tx
was performed: zerocopy or copy. This leads that whole message must
be send in zerocopy or copy way - we can't send part of message with
copying and rest of message with zerocopy mode (or vice versa). Now,
we need to account vsock credit logic, e.g. we can't send whole data
once - only allowed number of bytes could sent at any moment. In case
of copying way there is no problem as in worst case we can send single
bytes, but zerocopy is more complex because smallest transmission
unit is single page. So if there is not enough space at peer's side
to send integer number of pages (at least one) - we will wait, thus
stalling tx side. To overcome this problem i've added simple rule -
zerocopy is possible only when there is enough space at another side
for whole message (to check, that current 'msghdr' was already used
in previous tx iterations i use 'iov_offset' field of it's iov iter).

^^^
Discussed as ok during v2. Link:
https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/

3) loopback transport is not supported, because it requires to implement
non-linear skb handling in dequeue logic (as we "send" fragged skb
and "receive" it from the same queue). I'm going to implement it in
next versions.

^^^ fixed in v2

4) Current implementation sets max length of packet to 64KB. IIUC this
is due to 'kmalloc()' allocated data buffers. I think, in case of
MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
not touched for data - user space pages are used as buffers. Also
this limit trims every message which is > 64KB, thus such messages
will be send in copy mode due to 'iov_offset' check in 2).

^^^ fixed in v2

PATCHSET STRUCTURE

Patchset has the following structure:
1) Handle non-linear skbuff on receive in virtio/vhost.
2) Handle non-linear skbuff on send in virtio/vhost.
3) Updates for AF_VSOCK.
4) Enable MSG_ZEROCOPY support on transports.
5) Tests/tools/docs updates.

PERFORMANCE

Performance: it is a little bit tricky to compare performance between
copy and zerocopy transmissions. In zerocopy way we need to wait when
user buffers will be released by kernel, so it is like synchronous
path (wait until device driver will process it), while in copy way we
can feed data to kernel as many as we want, don't care about device
driver. So I compared only time which we spend in the 'send()' syscall.
Then if this value will be combined with total number of transmitted
bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
enough credit, receiver allocates same amount of space as sender needs.

Sender:
./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]

Receiver:
./vsock_perf --vsk-size 256M

I run tests on two setups: desktop with Core i7 - I use this PC for
development and in this case guest is nested guest, and host is normal
guest. Another hardware is some embedded board with Atom - here I don't
have nested virtualization - host runs on hw, and guest is normal guest.

G2H transmission (values are Gbit/s):

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
*-------------------------------* *-------------------------------*
| 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
*-------------------------------* *-------------------------------*
| 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
*-------------------------------* *-------------------------------*
| 1M | 30 | 373 | | 1M | 7 | 95 |
*-------------------------------* *-------------------------------*
| 8M | 22 | 475 | | 8M | 7 | 114 |
*-------------------------------* *-------------------------------*

H2G:

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
*-------------------------------* *-------------------------------*
| 32KB | 37 | 75 | | 32KB | 11 | 18 |
*-------------------------------* *-------------------------------*
| 256KB | 44 | 299 | | 256KB | 11 | 62 |
*-------------------------------* *-------------------------------*
| 1M | 28 | 335 | | 1M | 9 | 77 |
*-------------------------------* *-------------------------------*
| 8M | 27 | 417 | | 8M | 9.35 | 115 |
*-------------------------------* *-------------------------------*

* Let's look to the first line of both tables - where copy is better
than zerocopy. I analyzed this case more deeply and found that
bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
caller spends too much time in it with zerocopy mode (comparing to
copy mode). This happens only with 4K buffer size. This function just
calls 'wake_up_process()' and its internal logic does not depends on
skb, so i think potential reason (may be) is interval between two
calls of this function (e.g. how often it is called). Note, that
'vhost_work_queue()' differs from the same function at guest's side of
transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
i think is more optimized for worker purposes, than direct call to
'wake_up_process()'. But again - this is just my assumption.

Loopback:

Core i7 with nested guest. Atom with normal guest.

*-------------------------------* *-------------------------------*
| | | | | | | |
| buf size | copy | zerocopy | | buf size | copy | zerocopy |
| | | | | | | |
*-------------------------------* *-------------------------------*
| 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 |
*-------------------------------* *-------------------------------*
| 32KB | 38 | 44 | | 32KB | 10 | 10 |
*-------------------------------* *-------------------------------*
| 256KB | 55 | 168 | | 256KB | 15 | 36 |
*-------------------------------* *-------------------------------*
| 1M | 53 | 250 | | 1M | 12 | 45 |
*-------------------------------* *-------------------------------*
| 8M | 40 | 344 | | 8M | 11 | 74 |
*-------------------------------* *-------------------------------*

I analyzed performace difference more deeply for the following setup:
server: ./vsock_perf --vsk-size 16M
client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]

In other words I send 16M of data from guest to host in copy/zerocopy
modes and with two different sizes of buffer - 4K and 64K. Let's see
to tx path for both modes - it consists of two steps:

copy:
1) Allocate skb of buffer's length.
2) Copy data to skb from buffer.

zerocopy:
1) Allocate skb with header space only.
2) Pin pages of the buffer and insert them to skb.

I measured average number of ns (returned by 'ktime_get()') for each
step above:
1) Skb allocation (for both copy and zerocopy modes).
2) For copy mode in 'memcpy_to_msg()' - copying.
3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.

Here are results for copy mode:
*-------------------------------------*
| buf | skb alloc | 'memcpy_to_msg()' |
*-------------------------------------*
| | | |
| 64K | 5000ns | 25000ns |
| | | |
*-------------------------------------*
| | | |
| 4K | 800ns | 2200ns |
| | | |
*-------------------------------------*

Here are results for zerocopy mode:
*-----------------------------------------------*
| buf | skb alloc | '__zerocopy_sg_from_iter()' |
*-----------------------------------------------*
| | | |
| 64K | 250ns | 3500ns |
| | | |
*-----------------------------------------------*
| | | |
| 4K | 250ns | 3000ns |
| | | |
*-----------------------------------------------*

I guess that reason of zerocopy performance is low overhead for page
pinning: there is big difference between 4K and 64K in case of copying
(25000 vs 2200), but in pinning case - just 3000 vs 3500.

So, zerocopy is faster than classic copy mode, but of course it requires
specific architecture of application due to user pages pinning, buffer
size and alignment.

NOTES

If host fails to send data with "Cannot allocate memory", check value
/proc/sys/net/core/optmem_max - it is accounted during completion skb
allocation. Try to update it to for example 1M and try send again:
"echo 1048576 > /proc/sys/net/core/optmem_max" (as root).

TESTING

This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
cover new code as much as possible so there are different cases for
MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
vector types (different sizes, alignments, with unmapped pages). I also
run tests with loopback transport and run vsockmon. In v3 i've added
io_uring test as separated application.

LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER

In v3 Stefano Garzarella <[email protected]> asked to split this patchset
for several parts, because it looks too big for review. I think in this
version (v4) we can do it in the following way:

[0001 - 0005] - this is preparation for virtio/vhost part.
[0006 - 0009] - this is preparation for AF_VSOCK part.
[0010 - 0013] - these patches allows to trigger logic from the previous
two parts.
[0014 - rest] - updates for doc, tests, utils. This part doesn't touch
kernel code and looks not critical.

Thanks, Arseniy

Link to v1:
https://lore.kernel.org/netdev/[email protected]/
Link to v2:
https://lore.kernel.org/netdev/[email protected]/
Link to v3:
https://lore.kernel.org/netdev/[email protected]/

Changelog:
v1 -> v2:
- Replace 'get_user_pages()' with 'pin_user_pages()'.
- Loopback transport support.

v2 -> v3
- Use 'get_user_pages()' instead of 'pin_user_pages()'. I think this
is right approach, because i'm using '__zerocopy_sg_from_iter()'
function. It is already implemented and used by io_uring zerocopy
tx logic to 'pin' pages of user's buffer.

- Use 'skb_copy_datagram_iter()' to copy data from both linear and
non-linear skb to user's iov iter. It already has support for copying
data from paged part of skb (by calling 'kmap()'). In v2 i used my
own "from scratch" implemented function. With this and previous thing
I significantly reduced LOC number in kernel part.

- Add io_uring test for AF_VSOCK. It is implemented as separated util,
because it depends on liburing (i think there is no need to link
'vsock_test' with liburing, because io_uring functionality depends
on environment - both in kernel and userspace).

- Values from PERFORMANCE section are updated for all transports, but
I didn't found any significant difference with v2.

- More details in commit messages.

v3 -> v4:
- Requirement for buffers to have page aligned base and size is removed,
because virtio can handle such buffers.

- Crash with SOCK_SEQPACKET is fixed. This is done by setting owner of
new 'skb' before passing it to '__zerocopy_sg_from_iter()'. Last one
dereferences owner of the passed skb without any checks (it was NULL).

- Type of "owning" of the newly created skb is also changed: in v3 and
before it was 'skb_set_owner_sk_safe()'. I replace it with this one:
'skb_set_owner_w()'. This is because '__zerocopy_sg_from_iter()'
increments 'sk_wmem_alloc' of socket which owns skb, thus we need a
proper destructor which decrements it back - it is 'sock_wfree()'.
This destructor is set by 'skb_set_owner_w()'. Otherwise we get leak
of resource - such socket will be never deallocated.

- Use ITER_KVEC instead of ITER_IOVEC when skb is copied to another one
for passing to TAP device. Reason of this update is that ITER_IOVEC
considered as userspace memory, while we have only kernel memory here.

Arseniy Krasnov (17):
vsock/virtio: read data from non-linear skb
vhost/vsock: read data from non-linear skb
vsock/virtio: support to send non-linear skb
vsock/virtio: non-linear skb handling for tap
vsock/virtio: MSG_ZEROCOPY flag support
vsock: check error queue to set EPOLLERR
vsock: read from socket's error queue
vsock: check for MSG_ZEROCOPY support on send
vsock: enable SOCK_SUPPORT_ZC bit
vhost/vsock: support MSG_ZEROCOPY for transport
vsock/virtio: support MSG_ZEROCOPY for transport
vsock/loopback: support MSG_ZEROCOPY for transport
net/sock: enable setting SO_ZEROCOPY for PF_VSOCK
docs: net: description of MSG_ZEROCOPY for AF_VSOCK
test/vsock: MSG_ZEROCOPY flag tests
test/vsock: MSG_ZEROCOPY support for vsock_perf
test/vsock: io_uring rx/tx tests

Documentation/networking/msg_zerocopy.rst | 12 +-
drivers/vhost/vsock.c | 18 +-
include/linux/socket.h | 1 +
include/linux/virtio_vsock.h | 1 +
include/net/af_vsock.h | 7 +
net/core/sock.c | 4 +-
net/vmw_vsock/af_vsock.c | 19 +-
net/vmw_vsock/virtio_transport.c | 44 ++-
net/vmw_vsock/virtio_transport_common.c | 317 ++++++++++++++++-----
net/vmw_vsock/vsock_loopback.c | 8 +
tools/testing/vsock/Makefile | 9 +-
tools/testing/vsock/util.c | 218 +++++++++++++++
tools/testing/vsock/util.h | 18 ++
tools/testing/vsock/vsock_perf.c | 139 +++++++++-
tools/testing/vsock/vsock_test.c | 16 ++
tools/testing/vsock/vsock_test_zerocopy.c | 312 +++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 15 +
tools/testing/vsock/vsock_uring_test.c | 321 ++++++++++++++++++++++
18 files changed, 1385 insertions(+), 94 deletions(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h
create mode 100644 tools/testing/vsock/vsock_uring_test.c

--
2.25.1



2023-06-03 20:56:20

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 10/17] vhost/vsock: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for vhost transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index b254aa4b756a..318866713ef7 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -396,6 +396,11 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
return val < vq->num;
}

+static bool vhost_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool vhost_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport vhost_transport = {
@@ -442,6 +447,7 @@ static struct virtio_transport vhost_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+ .msgzerocopy_allow = vhost_transport_msgzerocopy_allow,
},

.send_pkt = vhost_transport_send_pkt,
--
2.25.1


2023-06-03 20:56:34

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 12/17] vsock/loopback: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for loopback transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/vsock_loopback.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 5c6360df1f31..a2e4aeda2d92 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -47,6 +47,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
}

static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_msgzerocopy_allow(void);

static struct virtio_transport loopback_transport = {
.transport = {
@@ -92,11 +93,18 @@ static struct virtio_transport loopback_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+
+ .msgzerocopy_allow = vsock_loopback_msgzerocopy_allow,
},

.send_pkt = vsock_loopback_send_pkt,
};

+static bool vsock_loopback_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
{
return true;
--
2.25.1


2023-06-03 20:56:35

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 11/17] vsock/virtio: support MSG_ZEROCOPY for transport

Add 'msgzerocopy_allow()' callback for virtio transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 6053d8341091..d9ffa16dda69 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -438,6 +438,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

+static bool virtio_transport_msgzerocopy_allow(void)
+{
+ return true;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport virtio_transport = {
@@ -484,6 +489,8 @@ static struct virtio_transport virtio_transport = {
.notify_buffer_size = virtio_transport_notify_buffer_size,

.read_skb = virtio_transport_read_skb,
+
+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,
},

.send_pkt = virtio_transport_send_pkt,
--
2.25.1


2023-06-03 20:56:46

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 13/17] net/sock: enable setting SO_ZEROCOPY for PF_VSOCK

PF_VSOCK supports MSG_ZEROCOPY transmission, so SO_ZEROCOPY could
be enabled. PF_VSOCK implementation is a little bit special comparing to
PF_INET - MSG_ZEROCOPY support depends on transport layer of PF_VSOCK,
but here we can't "ask" its transport, so setting of this option is
always allowed, but if some transport doesn't support zerocopy tx, send
callback of PF_VSOCK will return -EOPNOTSUPP.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/core/sock.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 5440e67bcfe3..d558e541e6d7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1452,9 +1452,11 @@ int sk_setsockopt(struct sock *sk, int level, int optname,
(sk->sk_type == SOCK_DGRAM &&
sk->sk_protocol == IPPROTO_UDP)))
ret = -EOPNOTSUPP;
- } else if (sk->sk_family != PF_RDS) {
+ } else if (sk->sk_family != PF_RDS &&
+ sk->sk_family != PF_VSOCK) {
ret = -EOPNOTSUPP;
}
+
if (!ret) {
if (val < 0 || val > 1)
ret = -EINVAL;
--
2.25.1


2023-06-03 20:57:08

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 17/17] test/vsock: io_uring rx/tx tests

This adds set of tests which use io_uring for rx/tx. This test suite is
implemented as separated util like 'vsock_test' and has the same set of
input arguments as 'vsock_test'. These tests only cover cases of data
transmission (no connect/bind/accept etc).

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 7 +-
tools/testing/vsock/vsock_uring_test.c | 321 +++++++++++++++++++++++++
2 files changed, 327 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/vsock/vsock_uring_test.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 0a78787d1d92..8621ae73051d 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,12 +1,17 @@
# SPDX-License-Identifier: GPL-2.0-only
+ifeq ($(MAKECMDGOALS),vsock_uring_test)
+LDFLAGS = -luring
+endif
+
all: test vsock_perf
test: vsock_test vsock_diag_test
vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
vsock_perf: vsock_perf.o
+vsock_uring_test: control.o util.o vsock_uring_test.o timeout.o $(LDFLAGS)

CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
.PHONY: all test clean
clean:
- ${RM} *.o *.d vsock_test vsock_diag_test
+ ${RM} *.o *.d vsock_test vsock_diag_test vsock_uring_test
-include *.d
diff --git a/tools/testing/vsock/vsock_uring_test.c b/tools/testing/vsock/vsock_uring_test.c
new file mode 100644
index 000000000000..7637ff510490
--- /dev/null
+++ b/tools/testing/vsock/vsock_uring_test.c
@@ -0,0 +1,321 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* io_uring tests for vsock
+ *
+ * Copyright (C) 2023 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <liburing.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <linux/kernel.h>
+#include <error.h>
+
+#include "util.h"
+#include "control.h"
+
+#define PAGE_SIZE 4096
+#define RING_ENTRIES_NUM 4
+
+static struct vsock_test_data test_data_array[] = {
+ /* All elements have page aligned base and size. */
+ {
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, 2 * PAGE_SIZE },
+ { NULL, 3 * PAGE_SIZE },
+ }
+ },
+ /* Middle element has both non-page aligned base and size. */
+ {
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { (void *)1, 200 },
+ { NULL, 3 * PAGE_SIZE },
+ }
+ }
+};
+
+static void vsock_io_uring_client(const struct test_opts *opts,
+ const struct vsock_test_data *test_data,
+ bool msg_zerocopy)
+{
+ struct io_uring_sqe *sqe;
+ struct io_uring_cqe *cqe;
+ struct io_uring ring;
+ struct iovec *iovec;
+ struct msghdr msg;
+ int fd;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (msg_zerocopy)
+ enable_so_zerocopy(fd);
+
+ iovec = iovec_from_test_data(test_data);
+
+ if (io_uring_queue_init(RING_ENTRIES_NUM, &ring, 0))
+ error(1, errno, "io_uring_queue_init");
+
+ if (io_uring_register_buffers(&ring, iovec, test_data->vecs_cnt))
+ error(1, errno, "io_uring_register_buffers");
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = iovec;
+ msg.msg_iovlen = test_data->vecs_cnt;
+ sqe = io_uring_get_sqe(&ring);
+
+ if (msg_zerocopy)
+ io_uring_prep_sendmsg_zc(sqe, fd, &msg, 0);
+ else
+ io_uring_prep_sendmsg(sqe, fd, &msg, 0);
+
+ if (io_uring_submit(&ring) != 1)
+ error(1, errno, "io_uring_submit");
+
+ if (io_uring_wait_cqe(&ring, &cqe))
+ error(1, errno, "io_uring_wait_cqe");
+
+ io_uring_cqe_seen(&ring, cqe);
+
+ control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
+
+ control_writeln("DONE");
+ io_uring_queue_exit(&ring);
+ free_iovec_test_data(test_data, iovec);
+ close(fd);
+}
+
+static void vsock_io_uring_server(const struct test_opts *opts,
+ const struct vsock_test_data *test_data)
+{
+ unsigned long remote_hash;
+ unsigned long local_hash;
+ struct io_uring_sqe *sqe;
+ struct io_uring_cqe *cqe;
+ struct io_uring ring;
+ struct iovec iovec;
+ size_t data_len;
+ void *data;
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ data_len = iovec_bytes(test_data->vecs, test_data->vecs_cnt);
+
+ data = malloc(data_len);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ if (io_uring_queue_init(RING_ENTRIES_NUM, &ring, 0))
+ error(1, errno, "io_uring_queue_init");
+
+ sqe = io_uring_get_sqe(&ring);
+ iovec.iov_base = data;
+ iovec.iov_len = data_len;
+
+ io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
+
+ if (io_uring_submit(&ring) != 1)
+ error(1, errno, "io_uring_submit");
+
+ if (io_uring_wait_cqe(&ring, &cqe))
+ error(1, errno, "io_uring_wait_cqe");
+
+ if (cqe->res != data_len) {
+ fprintf(stderr, "expected %zu, got %u\n", data_len,
+ cqe->res);
+ exit(EXIT_FAILURE);
+ }
+
+ local_hash = hash_djb2(data, data_len);
+
+ remote_hash = control_readulong();
+ if (remote_hash != local_hash) {
+ fprintf(stderr, "hash mismatch\n");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ io_uring_queue_exit(&ring);
+ free(data);
+}
+
+void test_stream_uring_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ vsock_io_uring_server(opts, &test_data_array[i]);
+}
+
+void test_stream_uring_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ vsock_io_uring_client(opts, &test_data_array[i], false);
+}
+
+void test_stream_uring_msg_zc_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ vsock_io_uring_server(opts, &test_data_array[i]);
+}
+
+void test_stream_uring_msg_zc_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ vsock_io_uring_client(opts, &test_data_array[i], true);
+}
+
+static struct test_case test_cases[] = {
+ {
+ .name = "SOCK_STREAM io_uring test",
+ .run_server = test_stream_uring_server,
+ .run_client = test_stream_uring_client,
+ },
+ {
+ .name = "SOCK_STREAM io_uring MSG_ZEROCOPY test",
+ .run_server = test_stream_uring_msg_zc_server,
+ .run_client = test_stream_uring_msg_zc_client,
+ },
+ {},
+};
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+ {
+ .name = "control-host",
+ .has_arg = required_argument,
+ .val = 'H',
+ },
+ {
+ .name = "control-port",
+ .has_arg = required_argument,
+ .val = 'P',
+ },
+ {
+ .name = "mode",
+ .has_arg = required_argument,
+ .val = 'm',
+ },
+ {
+ .name = "peer-cid",
+ .has_arg = required_argument,
+ .val = 'p',
+ },
+ {
+ .name = "help",
+ .has_arg = no_argument,
+ .val = '?',
+ },
+ {},
+};
+
+static void usage(void)
+{
+ fprintf(stderr, "Usage: vsock_uring_test [--help] [--control-host=<host>] --control-port=<port> --mode=client|server --peer-cid=<cid>\n"
+ "\n"
+ " Server: vsock_uring_test --control-port=1234 --mode=server --peer-cid=3\n"
+ " Client: vsock_uring_test --control-host=192.168.0.1 --control-port=1234 --mode=client --peer-cid=2\n"
+ "\n"
+ "Run transmission tests using io_uring. Usage is the same as\n"
+ "in ./vsock_test\n"
+ "\n"
+ "Options:\n"
+ " --help This help message\n"
+ " --control-host <host> Server IP address to connect to\n"
+ " --control-port <port> Server port to listen on/connect to\n"
+ " --mode client|server Server or client mode\n"
+ " --peer-cid <cid> CID of the other side\n"
+ );
+ exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+ const char *control_host = NULL;
+ const char *control_port = NULL;
+ struct test_opts opts = {
+ .mode = TEST_MODE_UNSET,
+ .peer_cid = VMADDR_CID_ANY,
+ };
+
+ init_signals();
+
+ for (;;) {
+ int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+ if (opt == -1)
+ break;
+
+ switch (opt) {
+ case 'H':
+ control_host = optarg;
+ break;
+ case 'm':
+ if (strcmp(optarg, "client") == 0) {
+ opts.mode = TEST_MODE_CLIENT;
+ } else if (strcmp(optarg, "server") == 0) {
+ opts.mode = TEST_MODE_SERVER;
+ } else {
+ fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+ return EXIT_FAILURE;
+ }
+ break;
+ case 'p':
+ opts.peer_cid = parse_cid(optarg);
+ break;
+ case 'P':
+ control_port = optarg;
+ break;
+ case '?':
+ default:
+ usage();
+ }
+ }
+
+ if (!control_port)
+ usage();
+ if (opts.mode == TEST_MODE_UNSET)
+ usage();
+ if (opts.peer_cid == VMADDR_CID_ANY)
+ usage();
+
+ if (!control_host) {
+ if (opts.mode != TEST_MODE_SERVER)
+ usage();
+ control_host = "0.0.0.0";
+ }
+
+ control_init(control_host, control_port,
+ opts.mode == TEST_MODE_SERVER);
+
+ run_tests(test_cases, &opts);
+
+ control_cleanup();
+
+ return 0;
+}
--
2.25.1


2023-06-03 20:57:12

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 14/17] docs: net: description of MSG_ZEROCOPY for AF_VSOCK

This adds description of MSG_ZEROCOPY flag support for AF_VSOCK type of
socket.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Documentation/networking/msg_zerocopy.rst | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
index b3ea96af9b49..34bc7ff411ce 100644
--- a/Documentation/networking/msg_zerocopy.rst
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -7,7 +7,8 @@ Intro
=====

The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
-The feature is currently implemented for TCP and UDP sockets.
+The feature is currently implemented for TCP, UDP and VSOCK (with
+virtio transport) sockets.


Opportunity and Caveats
@@ -174,7 +175,7 @@ read_notification() call in the previous snippet. A notification
is encoded in the standard error format, sock_extended_err.

The level and type fields in the control data are protocol family
-specific, IP_RECVERR or IPV6_RECVERR.
+specific, IP_RECVERR or IPV6_RECVERR (for TCP or UDP socket).

Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
as explained before, to avoid blocking read and write system calls on
@@ -201,6 +202,7 @@ undefined, bar for ee_code, as discussed below.

printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);

+For VSOCK socket, cmsg_level will be SOL_VSOCK and cmsg_type will be 0.

Deferred copies
~~~~~~~~~~~~~~~
@@ -235,12 +237,15 @@ Implementation
Loopback
--------

+For TCP and UDP:
Data sent to local sockets can be queued indefinitely if the receive
process does not read its socket. Unbound notification latency is not
acceptable. For this reason all packets generated with MSG_ZEROCOPY
that are looped to a local socket will incur a deferred copy. This
includes looping onto packet sockets (e.g., tcpdump) and tun devices.

+For VSOCK:
+Data path sent to local sockets is the same as for non-local sockets.

Testing
=======
@@ -254,3 +259,6 @@ instance when run with msg_zerocopy.sh between a veth pair across
namespaces, the test will not show any improvement. For testing, the
loopback restriction can be temporarily relaxed by making
skb_orphan_frags_rx identical to skb_orphan_frags.
+
+For VSOCK type of socket example can be found in tools/testing/vsock/
+vsock_test_zerocopy.c.
--
2.25.1


2023-06-03 20:57:23

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 06/17] vsock: check error queue to set EPOLLERR

If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
reader of error queue won't detect data in it using EPOLLERR bit.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/af_vsock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index efb8a0937a13..45fd20c4ed50 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1030,7 +1030,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
poll_wait(file, sk_sleep(sk), wait);
mask = 0;

- if (sk->sk_err)
+ if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
/* Signify that there has been an error on this socket. */
mask |= EPOLLERR;

--
2.25.1


2023-06-03 20:57:23

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 16/17] test/vsock: MSG_ZEROCOPY support for vsock_perf

To use this option pass '--zc' parameter:

./vsock_perf --zc --sender <cid> --port <port> --bytes <bytes to send>

With this option MSG_ZEROCOPY flag will be passed to the 'send()' call.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/vsock_perf.c | 139 +++++++++++++++++++++++++++++--
1 file changed, 130 insertions(+), 9 deletions(-)

diff --git a/tools/testing/vsock/vsock_perf.c b/tools/testing/vsock/vsock_perf.c
index a72520338f84..7fd76f7a3c16 100644
--- a/tools/testing/vsock/vsock_perf.c
+++ b/tools/testing/vsock/vsock_perf.c
@@ -18,6 +18,8 @@
#include <poll.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>

#define DEFAULT_BUF_SIZE_BYTES (128 * 1024)
#define DEFAULT_TO_SEND_BYTES (64 * 1024)
@@ -28,9 +30,14 @@
#define BYTES_PER_GB (1024 * 1024 * 1024ULL)
#define NSEC_PER_SEC (1000000000ULL)

+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
static unsigned int port = DEFAULT_PORT;
static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
+static bool zerocopy;

static void error(const char *s)
{
@@ -247,15 +254,76 @@ static void run_receiver(unsigned long rcvlowat_bytes)
close(fd);
}

+static void recv_completion(int fd)
+{
+ struct sock_extended_err *serr;
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ struct msghdr msg = { 0 };
+ ssize_t ret;
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret) {
+ fprintf(stderr, "recvmsg: failed to read err: %zi\n", ret);
+ return;
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ return;
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ return;
+ }
+
+ if (cm->cmsg_type) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ return;
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin\n");
+ return;
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code\n");
+ return;
+ }
+
+ if (zerocopy && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED))
+ fprintf(stderr, "warning: copy instead of zerocopy\n");
+}
+
+static void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val)))
+ error("setsockopt(SO_ZEROCOPY)");
+}
+
static void run_sender(int peer_cid, unsigned long to_send_bytes)
{
time_t tx_begin_ns;
time_t tx_total_ns;
size_t total_send;
+ time_t time_in_send;
void *data;
int fd;

- printf("Run as sender\n");
+ if (zerocopy)
+ printf("Run as sender MSG_ZEROCOPY\n");
+ else
+ printf("Run as sender\n");
+
printf("Connect to %i:%u\n", peer_cid, port);
printf("Send %lu bytes\n", to_send_bytes);
printf("TX buffer %lu bytes\n", buf_size_bytes);
@@ -265,38 +333,82 @@ static void run_sender(int peer_cid, unsigned long to_send_bytes)
if (fd < 0)
exit(EXIT_FAILURE);

- data = malloc(buf_size_bytes);
+ if (zerocopy) {
+ enable_so_zerocopy(fd);

- if (!data) {
- fprintf(stderr, "'malloc()' failed\n");
- exit(EXIT_FAILURE);
+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+ if (data == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+ } else {
+ data = malloc(buf_size_bytes);
+
+ if (!data) {
+ fprintf(stderr, "'malloc()' failed\n");
+ exit(EXIT_FAILURE);
+ }
}

memset(data, 0, buf_size_bytes);
total_send = 0;
+ time_in_send = 0;
tx_begin_ns = current_nsec();

while (total_send < to_send_bytes) {
ssize_t sent;
+ size_t rest_bytes;
+ time_t before;

- sent = write(fd, data, buf_size_bytes);
+ rest_bytes = to_send_bytes - total_send;
+
+ before = current_nsec();
+ sent = send(fd, data, (rest_bytes > buf_size_bytes) ?
+ buf_size_bytes : rest_bytes,
+ zerocopy ? MSG_ZEROCOPY : 0);
+ time_in_send += (current_nsec() - before);

if (sent <= 0)
error("write");

total_send += sent;
+
+ if (zerocopy) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ }
+
+ recv_completion(fd);
+ }
}

tx_total_ns = current_nsec() - tx_begin_ns;

printf("total bytes sent: %zu\n", total_send);
printf("tx performance: %f Gbits/s\n",
- get_gbps(total_send * 8, tx_total_ns));
- printf("total time in 'write()': %f sec\n",
+ get_gbps(total_send * 8, time_in_send));
+ printf("total time in tx loop: %f sec\n",
(float)tx_total_ns / NSEC_PER_SEC);
+ printf("time in 'send()': %f sec\n",
+ (float)time_in_send / NSEC_PER_SEC);

close(fd);
- free(data);
+
+ if (zerocopy)
+ munmap(data, buf_size_bytes);
+ else
+ free(data);
}

static const char optstring[] = "";
@@ -336,6 +448,11 @@ static const struct option longopts[] = {
.has_arg = required_argument,
.val = 'R',
},
+ {
+ .name = "zc",
+ .has_arg = no_argument,
+ .val = 'Z',
+ },
{},
};

@@ -351,6 +468,7 @@ static void usage(void)
" --help This message\n"
" --sender <cid> Sender mode (receiver default)\n"
" <cid> of the receiver to connect to\n"
+ " --zc Enable zerocopy\n"
" --port <port> Port (default %d)\n"
" --bytes <bytes>KMG Bytes to send (default %d)\n"
" --buf-size <bytes>KMG Data buffer size (default %d). In sender mode\n"
@@ -413,6 +531,9 @@ int main(int argc, char **argv)
case 'H': /* Help. */
usage();
break;
+ case 'Z': /* Zerocopy. */
+ zerocopy = true;
+ break;
default:
usage();
}
--
2.25.1


2023-06-03 20:57:29

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 15/17] test/vsock: MSG_ZEROCOPY flag tests

This adds three tests for MSG_ZEROCOPY feature:
1) SOCK_STREAM tx with different buffers.
1) SOCK_SEQPACKET tx with different buffers.
1) SOCK_STREAM test to read empty error queue of the socket.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.c | 218 +++++++++++++++
tools/testing/vsock/util.h | 18 ++
tools/testing/vsock/vsock_test.c | 16 ++
tools/testing/vsock/vsock_test_zerocopy.c | 312 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 15 ++
6 files changed, 580 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 43a254f0e14d..0a78787d1d92 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
all: test vsock_perf
test: vsock_test vsock_diag_test
-vsock_test: vsock_test.o timeout.o control.o util.o
+vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
vsock_perf: vsock_perf.o

diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 01b636d3039a..81bc7fdafe03 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -11,15 +11,23 @@
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
+#include <string.h>
#include <signal.h>
#include <unistd.h>
#include <assert.h>
#include <sys/epoll.h>
+#include <sys/mman.h>
+#include <linux/errqueue.h>
+#include <poll.h>

#include "timeout.h"
#include "control.h"
#include "util.h"

+#ifndef SOL_VSOCK
+#define SOL_VSOCK 287
+#endif
+
/* Install signal handlers */
void init_signals(void)
{
@@ -408,3 +416,213 @@ unsigned long hash_djb2(const void *data, size_t len)

return hash;
}
+
+void enable_so_zerocopy(int fd)
+{
+ int val = 1;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &val, sizeof(val))) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void *mmap_no_fail(size_t bytes)
+{
+ void *res;
+
+ res = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
+ MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
+ if (res == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ return res;
+}
+
+size_t iovec_bytes(const struct iovec *iov, size_t iovnum)
+{
+ size_t bytes;
+ int i;
+
+ for (bytes = 0, i = 0; i < iovnum; i++)
+ bytes += iov[i].iov_len;
+
+ return bytes;
+}
+
+static void iovec_random_init(struct iovec *iov,
+ const struct vsock_test_data *test_data)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ int j;
+
+ if (test_data->vecs[i].iov_base == MAP_FAILED)
+ continue;
+
+ for (j = 0; j < iov[i].iov_len; j++)
+ ((uint8_t *)iov[i].iov_base)[j] = rand() & 0xff;
+ }
+}
+
+unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum)
+{
+ unsigned long hash;
+ size_t iov_bytes;
+ size_t offs;
+ void *tmp;
+ int i;
+
+ iov_bytes = iovec_bytes(iov, iovnum);
+
+ tmp = malloc(iov_bytes);
+ if (!tmp) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (offs = 0, i = 0; i < iovnum; i++) {
+ memcpy(tmp + offs, iov[i].iov_base, iov[i].iov_len);
+ offs += iov[i].iov_len;
+ }
+
+ hash = hash_djb2(tmp, iov_bytes);
+ free(tmp);
+
+ return hash;
+}
+
+struct iovec *iovec_from_test_data(const struct vsock_test_data *test_data)
+{
+ const struct iovec *test_iovec;
+ struct iovec *iovec;
+ int i;
+
+ iovec = malloc(sizeof(*iovec) * test_data->vecs_cnt);
+ if (!iovec) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ test_iovec = test_data->vecs;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ iovec[i].iov_len = test_iovec[i].iov_len;
+ iovec[i].iov_base = mmap_no_fail(test_iovec[i].iov_len);
+
+ if (test_iovec[i].iov_base != MAP_FAILED &&
+ test_iovec[i].iov_base)
+ iovec[i].iov_base += (uintptr_t)test_iovec[i].iov_base;
+ }
+
+ /* Unmap "invalid" elements. */
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_iovec[i].iov_base == MAP_FAILED) {
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ iovec_random_init(iovec, test_data);
+
+ return iovec;
+}
+
+void free_iovec_test_data(const struct vsock_test_data *test_data,
+ struct iovec *iovec)
+{
+ int i;
+
+ for (i = 0; i < test_data->vecs_cnt; i++) {
+ if (test_data->vecs[i].iov_base != MAP_FAILED) {
+ if (test_data->vecs[i].iov_base)
+ iovec[i].iov_base -= (uintptr_t)test_data->vecs[i].iov_base;
+
+ if (munmap(iovec[i].iov_base, iovec[i].iov_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ free(iovec);
+}
+
+#define POLL_TIMEOUT_MS 100
+void vsock_recv_completion(int fd, bool zerocopied, bool completion)
+{
+ struct sock_extended_err *serr;
+ struct msghdr msg = { 0 };
+ struct pollfd fds = { 0 };
+ char cmsg_data[128];
+ struct cmsghdr *cm;
+ ssize_t res;
+
+ fds.fd = fd;
+ fds.events = 0;
+
+ if (poll(&fds, 1, POLL_TIMEOUT_MS) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!(fds.revents & POLLERR)) {
+ if (completion) {
+ fprintf(stderr, "POLLERR expected\n");
+ exit(EXIT_FAILURE);
+ } else {
+ return;
+ }
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res) {
+ fprintf(stderr, "failed to read error queue: %zi\n", res);
+ exit(EXIT_FAILURE);
+ }
+
+ cm = CMSG_FIRSTHDR(&msg);
+ if (!cm) {
+ fprintf(stderr, "cmsg: no cmsg\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_level != SOL_VSOCK) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_level'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (cm->cmsg_type != 0) {
+ fprintf(stderr, "cmsg: unexpected 'cmsg_type'\n");
+ exit(EXIT_FAILURE);
+ }
+
+ serr = (void *)CMSG_DATA(cm);
+ if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) {
+ fprintf(stderr, "serr: wrong origin: %u\n", serr->ee_origin);
+ exit(EXIT_FAILURE);
+ }
+
+ if (serr->ee_errno) {
+ fprintf(stderr, "serr: wrong error code: %u\n", serr->ee_errno);
+ exit(EXIT_FAILURE);
+ }
+
+ if (zerocopied && (serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was copy instead of zerocopy\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!zerocopied && !(serr->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)) {
+ fprintf(stderr, "serr: was zerocopy instead of copy\n");
+ exit(EXIT_FAILURE);
+ }
+}
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index fb99208a95ea..d07c1a9c2e0a 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -2,6 +2,7 @@
#ifndef UTIL_H
#define UTIL_H

+#include <stdbool.h>
#include <sys/socket.h>
#include <linux/vm_sockets.h>

@@ -18,6 +19,16 @@ struct test_opts {
unsigned int peer_cid;
};

+#define VSOCK_TEST_DATA_MAX_IOV 4
+
+struct vsock_test_data {
+ bool zerocopied; /* Data must be zerocopied. */
+ bool completion; /* Must dequeue completion. */
+ int sendmsg_errno; /* 'errno' after 'sendmsg()'. */
+ int vecs_cnt; /* Number of elements in 'vecs'. */
+ struct iovec vecs[VSOCK_TEST_DATA_MAX_IOV];
+};
+
/* A test case definition. Test functions must print failures to stderr and
* terminate with exit(EXIT_FAILURE).
*/
@@ -50,4 +61,11 @@ void list_tests(const struct test_case *test_cases);
void skip_test(struct test_case *test_cases, size_t test_cases_len,
const char *test_id_str);
unsigned long hash_djb2(const void *data, size_t len);
+void enable_so_zerocopy(int fd);
+size_t iovec_bytes(const struct iovec *iov, size_t iovnum);
+unsigned long iovec_hash_djb2(struct iovec *iov, size_t iovnum);
+struct iovec *iovec_from_test_data(const struct vsock_test_data *test_data);
+void free_iovec_test_data(const struct vsock_test_data *test_data,
+ struct iovec *iovec);
+void vsock_recv_completion(int fd, bool zerocopied, bool completion);
#endif /* UTIL_H */
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index ac1bd3ac1533..d576b18bd357 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -20,6 +20,7 @@
#include <sys/mman.h>
#include <poll.h>

+#include "vsock_test_zerocopy.h"
#include "timeout.h"
#include "control.h"
#include "util.h"
@@ -1128,6 +1129,21 @@ static struct test_case test_cases[] = {
.run_client = test_stream_virtio_skb_merge_client,
.run_server = test_stream_virtio_skb_merge_server,
},
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY",
+ .run_client = test_stream_msg_zcopy_client,
+ .run_server = test_stream_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_SEQPACKET MSG_ZEROCOPY",
+ .run_client = test_seqpacket_msg_zcopy_client,
+ .run_server = test_seqpacket_msg_zcopy_server,
+ },
+ {
+ .name = "SOCK_STREAM MSG_ZEROCOPY empty MSG_ERRQUEUE",
+ .run_client = test_stream_msg_zcopy_empty_errq_client,
+ .run_server = test_stream_msg_zcopy_empty_errq_server,
+ },
{},
};

diff --git a/tools/testing/vsock/vsock_test_zerocopy.c b/tools/testing/vsock/vsock_test_zerocopy.c
new file mode 100644
index 000000000000..c5539c5dbded
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* MSG_ZEROCOPY feature tests for vsock
+ *
+ * Copyright (C) 2023 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include <poll.h>
+#include <linux/errqueue.h>
+#include <linux/kernel.h>
+#include <error.h>
+#include <errno.h>
+
+#include "control.h"
+#include "vsock_test_zerocopy.h"
+
+#define PAGE_SIZE 4096
+
+static struct vsock_test_data test_data_array[] = {
+ /* Last element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, 200 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE * 2 },
+ { NULL, PAGE_SIZE * 3 }
+ }
+ },
+ /* All elements have page aligned base and size. But
+ * data length is bigger than 64Kb.
+ */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 },
+ { NULL, PAGE_SIZE * 16 }
+ }
+ },
+ /* All elements have page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has non-page aligned size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { NULL, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element has both non-page aligned base and size. */
+ {
+ .zerocopied = true,
+ .completion = true,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { (void *)1, 100 },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Middle element is unmapped. */
+ {
+ .zerocopied = false,
+ .completion = false,
+ .sendmsg_errno = ENOMEM,
+ .vecs_cnt = 3,
+ {
+ { NULL, PAGE_SIZE },
+ { MAP_FAILED, PAGE_SIZE },
+ { NULL, PAGE_SIZE }
+ }
+ },
+ /* Valid data, but SO_ZEROCOPY is off. */
+ {
+ .zerocopied = true,
+ .completion = false,
+ .sendmsg_errno = 0,
+ .vecs_cnt = 1,
+ {
+ { NULL, PAGE_SIZE }
+ }
+ },
+};
+
+static void __test_msg_zerocopy_client(const struct test_opts *opts,
+ const struct vsock_test_data *test_data,
+ bool sock_seqpacket)
+{
+ struct msghdr msg = { 0 };
+ ssize_t sendmsg_res;
+ struct iovec *iovec;
+ int fd;
+
+ if (sock_seqpacket)
+ fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
+ else
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (test_data->completion)
+ enable_so_zerocopy(fd);
+
+ iovec = iovec_from_test_data(test_data);
+
+ msg.msg_iov = iovec;
+ msg.msg_iovlen = test_data->vecs_cnt;
+
+ errno = 0;
+
+ sendmsg_res = sendmsg(fd, &msg, MSG_ZEROCOPY);
+ if (errno != test_data->sendmsg_errno) {
+ fprintf(stderr, "expected 'errno' == %i, got %i\n",
+ test_data->sendmsg_errno, errno);
+ exit(EXIT_FAILURE);
+ }
+
+ if (!errno) {
+ if (sendmsg_res != iovec_bytes(iovec, test_data->vecs_cnt)) {
+ fprintf(stderr, "expected 'sendmsg()' == %li, got %li\n",
+ iovec_bytes(iovec, test_data->vecs_cnt),
+ sendmsg_res);
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ vsock_recv_completion(fd, test_data->zerocopied, test_data->completion);
+
+ if (test_data->sendmsg_errno == 0)
+ control_writeulong(iovec_hash_djb2(iovec, test_data->vecs_cnt));
+ else
+ control_writeulong(0);
+
+ control_writeln("DONE");
+ free_iovec_test_data(test_data, iovec);
+ close(fd);
+}
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_msg_zerocopy_client(opts, &test_data_array[i], false);
+}
+
+void test_seqpacket_msg_zcopy_client(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_msg_zerocopy_client(opts, &test_data_array[i], true);
+}
+
+static void __test_stream_server(const struct test_opts *opts,
+ const struct vsock_test_data *test_data,
+ bool sock_seqpacket)
+{
+ unsigned long remote_hash;
+ unsigned long local_hash;
+ ssize_t total_bytes_rec;
+ unsigned char *data;
+ size_t data_len;
+ int fd;
+
+ if (sock_seqpacket)
+ fd = vsock_seqpacket_accept(VMADDR_CID_ANY, 1234, NULL);
+ else
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ data_len = iovec_bytes(test_data->vecs, test_data->vecs_cnt);
+
+ data = malloc(data_len);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ total_bytes_rec = 0;
+
+ while (total_bytes_rec != data_len) {
+ ssize_t bytes_rec;
+
+ bytes_rec = read(fd, data + total_bytes_rec,
+ data_len - total_bytes_rec);
+ if (bytes_rec <= 0)
+ break;
+
+ total_bytes_rec += bytes_rec;
+ }
+
+ if (test_data->sendmsg_errno == 0)
+ local_hash = hash_djb2(data, data_len);
+ else
+ local_hash = 0;
+
+ free(data);
+
+ /* Waiting for some result. */
+ remote_hash = control_readulong();
+ if (remote_hash != local_hash) {
+ fprintf(stderr, "hash mismatch\n");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_server(opts, &test_data_array[i], false);
+}
+
+void test_seqpacket_msg_zcopy_server(const struct test_opts *opts)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(test_data_array); i++)
+ __test_stream_server(opts, &test_data_array[i], true);
+}
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts)
+{
+ struct msghdr msg = { 0 };
+ char cmsg_data[128];
+ ssize_t res;
+ int fd;
+
+ fd = vsock_stream_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ msg.msg_control = cmsg_data;
+ msg.msg_controllen = sizeof(cmsg_data);
+
+ res = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (res != -1) {
+ fprintf(stderr, "expected 'recvmsg(2)' failure, got %zi\n",
+ res);
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln("DONE");
+ close(fd);
+}
+
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+ close(fd);
+}
diff --git a/tools/testing/vsock/vsock_test_zerocopy.h b/tools/testing/vsock/vsock_test_zerocopy.h
new file mode 100644
index 000000000000..220b4f94f042
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VSOCK_TEST_ZEROCOPY_H
+#define VSOCK_TEST_ZEROCOPY_H
+#include "util.h"
+
+void test_stream_msg_zcopy_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_server(const struct test_opts *opts);
+
+void test_seqpacket_msg_zcopy_client(const struct test_opts *opts);
+void test_seqpacket_msg_zcopy_server(const struct test_opts *opts);
+
+void test_stream_msg_zcopy_empty_errq_client(const struct test_opts *opts);
+void test_stream_msg_zcopy_empty_errq_server(const struct test_opts *opts);
+
+#endif /* VSOCK_TEST_ZEROCOPY_H */
--
2.25.1


2023-06-03 20:57:40

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 08/17] vsock: check for MSG_ZEROCOPY support on send

This feature totally depends on transport, so if transport doesn't
support it, return error.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/net/af_vsock.h | 7 +++++++
net/vmw_vsock/af_vsock.c | 6 ++++++
2 files changed, 13 insertions(+)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 0e7504a42925..ec09edc5f3a0 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -177,6 +177,9 @@ struct vsock_transport {

/* Read a single skb */
int (*read_skb)(struct vsock_sock *, skb_read_actor_t);
+
+ /* Zero-copy. */
+ bool (*msgzerocopy_allow)(void);
};

/**** CORE ****/
@@ -243,4 +246,8 @@ static inline void __init vsock_bpf_build_proto(void)
{}
#endif

+static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
+{
+ return t->msgzerocopy_allow && t->msgzerocopy_allow();
+}
#endif /* __AF_VSOCK_H__ */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 07803d9fbf6d..033006e1b5ad 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1824,6 +1824,12 @@ static int vsock_connectible_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}

+ if (msg->msg_flags & MSG_ZEROCOPY &&
+ !vsock_msgzerocopy_allow(transport)) {
+ err = -EOPNOTSUPP;
+ goto out;
+ }
+
/* Wait for room in the produce queue to enqueue our user's data. */
timeout = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);

--
2.25.1


2023-06-03 21:00:25

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 02/17] vhost/vsock: read data from non-linear skb

This adds copying to guest's virtio buffers from non-linear skbs. Such
skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
changes call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'. Second
function can read data from non-linear skb.

See commit to 'net/vmw_vsock/virtio_transport_common.c' with the same
name for more details.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..b254aa4b756a 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -156,7 +156,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}

iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
- payload_len = skb->len;
+ payload_len = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
hdr = virtio_vsock_hdr(skb);

/* If the packet is greater than the space available in the
@@ -197,8 +197,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}

- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
+ if (skb_copy_datagram_iter(skb,
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &iov_iter,
+ payload_len)) {
kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
@@ -212,13 +214,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;

- skb_pull(skb, payload_len);
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
total_len += payload_len;

/* If we didn't send all the payload we can requeue the packet
* to send it with the next available buffer.
*/
- if (skb->len > 0) {
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
hdr->flags |= cpu_to_le32(flags_to_restore);

/* We are queueing the same skb to handle
--
2.25.1


2023-06-03 21:00:50

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 01/17] vsock/virtio: read data from non-linear skb

This is preparation patch for non-linear skbuff handling. It replaces
direct calls of 'memcpy_to_msg()' with 'skb_copy_datagram_iter()'. Main
advantage of the second one is that is can handle paged part of the skb
by using 'kmap()' on each page, but if there are no pages in the skb,
it behaves like simple copying to iov iterator. This patch also adds
new field to the control block of skb - this value shows current offset
in the skb to read next portion of data (it doesn't matter linear it or
not). Idea is that 'skb_copy_datagram_iter()' handles both types of
skb internally - it just needs an offset from which to copy data from
the given skb. This offset is incremented on each read from skb. This
approach allows to avoid special handling of non-linear skbs:
1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
2) We need to update 'data_len' also on each read from this skb.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport_common.c | 26 +++++++++++++++++--------
2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..17dbb7176e37 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,7 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+ u32 frag_off;
};

#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..5819a9cd4515 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
spin_lock_bh(&vvs->rx_lock);

skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
- off = 0;
+ off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

if (total == len)
break;
@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data + off, bytes);
+ err = skb_copy_datagram_iter(skb, off,
+ &msg->msg_iter,
+ bytes);
+
if (err)
goto out;

@@ -414,24 +417,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
skb = skb_peek(&vvs->rx_queue);

bytes = len - total;
- if (bytes > skb->len)
- bytes = skb->len;
+ if (bytes > skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off)
+ bytes = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

/* sk_lock is held by caller so no one else can dequeue.
* Unlock rx_lock since memcpy_to_msg() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data, bytes);
+ err = skb_copy_datagram_iter(skb,
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &msg->msg_iter, bytes);
+
if (err)
goto out;

spin_lock_bh(&vvs->rx_lock);

total += bytes;
- skb_pull(skb, bytes);

- if (skb->len == 0) {
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
+
+ if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);

virtio_transport_dec_rx_pkt(vvs, pkt_len);
@@ -503,7 +510,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+ err = skb_copy_datagram_iter(skb, 0,
+ &msg->msg_iter,
+ bytes_to_copy);
+
if (err) {
/* Copy of message failed. Rest of
* fragments will be freed without copy.
--
2.25.1


2023-06-03 21:05:19

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support

This adds handling of MSG_ZEROCOPY flag on transmission path: if this
flag is set and zerocopy transmission is possible, then non-linear skb
will be created and filled with the pages of user's buffer. Pages of
user's buffer are locked in memory by 'get_user_pages()'.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
1 file changed, 208 insertions(+), 62 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 0de562c1dc4b..f1ec38c72db7 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}

-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
- size_t len,
- u32 src_cid,
- u32 src_port,
- u32 dst_cid,
- u32 dst_port)
-{
- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
- struct virtio_vsock_hdr *hdr;
- struct sk_buff *skb;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+ size_t max_to_send)
+{
+ struct iov_iter *iov_iter;
+
+ if (!info->msg)
+ return false;
+
+ iov_iter = &info->msg->msg_iter;
+
+ /* Data is simple buffer. */
+ if (iter_is_ubuf(iov_iter))
+ return true;
+
+ if (!iter_is_iovec(iov_iter))
+ return false;
+
+ if (iov_iter->iov_offset)
+ return false;
+
+ /* We can't send whole iov. */
+ if (iov_iter->count > max_to_send)
+ return false;
+
+ return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+ struct sk_buff *skb,
+ struct msghdr *msg,
+ bool zerocopy)
+{
+ struct ubuf_info *uarg;
+
+ if (msg->msg_ubuf) {
+ uarg = msg->msg_ubuf;
+ net_zcopy_get(uarg);
+ } else {
+ struct iov_iter *iter = &msg->msg_iter;
+ struct ubuf_info_msgzc *uarg_zc;
+ int len;
+
+ /* Only ITER_IOVEC or ITER_UBUF are allowed and
+ * checked before.
+ */
+ if (iter_is_iovec(iter))
+ len = iov_length(iter->__iov, iter->nr_segs);
+ else
+ len = iter->count;
+
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+ len,
+ NULL);
+
+ if (!uarg)
+ return -1;
+
+ uarg_zc = uarg_to_msgzc(uarg);
+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
+ }
+
+ skb_zcopy_init(skb, uarg);
+
+ return 0;
+}
+
+static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
+ struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t len)
+{
void *payload;
int err;

- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
- if (!skb)
- return NULL;
+ payload = skb_put(skb, len);
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err)
+ return -1;
+
+ if (msg_data_left(info->msg))
+ return 0;
+
+ return 0;
+}
+
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ u32 src_cid,
+ u32 src_port,
+ u32 dst_cid,
+ u32 dst_port,
+ size_t len)
+{
+ struct virtio_vsock_hdr *hdr;

hdr = virtio_vsock_hdr(skb);
hdr->type = cpu_to_le16(info->type);
@@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
hdr->dst_port = cpu_to_le32(dst_port);
hdr->flags = cpu_to_le32(info->flags);
hdr->len = cpu_to_le32(len);
-
- if (info->msg && len > 0) {
- payload = skb_put(skb, len);
- err = memcpy_from_msg(payload, info->msg, len);
- if (err)
- goto out;
-
- if (msg_data_left(info->msg) == 0 &&
- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
-
- if (info->msg->msg_flags & MSG_EOR)
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
- }
- }
-
- if (info->reply)
- virtio_vsock_skb_set_reply(skb);
-
- trace_virtio_transport_alloc_pkt(src_cid, src_port,
- dst_cid, dst_port,
- len,
- info->type,
- info->op,
- info->flags);
-
- if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
- WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
- goto out;
- }
-
- return skb;
-
-out:
- kfree_skb(skb);
- return NULL;
}

static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
@@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
return VIRTIO_VSOCK_TYPE_SEQPACKET;
}

+/* Returns a new packet on success, otherwise returns NULL.
+ *
+ * If NULL is returned, errp is set to a negative errno.
+ */
+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t payload_len,
+ bool zcopy,
+ u32 dst_cid,
+ u32 dst_port,
+ u32 src_cid,
+ u32 src_port)
+{
+ struct sk_buff *skb;
+ size_t skb_len;
+
+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+ if (!zcopy)
+ skb_len += payload_len;
+
+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+ if (!skb)
+ return NULL;
+
+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len);
+
+ /* Set owner here, because '__zerocopy_sg_from_iter()' uses
+ * owner of skb without check to update 'sk_wmem_alloc'.
+ */
+ if (vsk)
+ skb_set_owner_w(skb, sk_vsock(vsk));
+
+ if (info->msg && payload_len > 0) {
+ int err;
+
+ if (zcopy) {
+ err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
+ &info->msg->msg_iter,
+ payload_len);
+ } else {
+ err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
+ }
+
+ if (err)
+ goto out;
+
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
+
+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+ struct virtio_vsock_hdr *hdr;
+
+ hdr = virtio_vsock_hdr(skb);
+
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+ if (info->msg->msg_flags & MSG_EOR)
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ }
+ }
+
+ if (info->reply)
+ virtio_vsock_skb_set_reply(skb);
+
+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len,
+ info->type,
+ info->op,
+ info->flags);
+
+ return skb;
+out:
+ kfree_skb(skb);
+ return NULL;
+}
+
/* This function can only be used on connecting/connected sockets,
* since a socket assigned to a transport is required.
*
@@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
const struct virtio_transport *t_ops;
struct virtio_vsock_sock *vvs;
u32 pkt_len = info->pkt_len;
+ bool can_zcopy = false;
+ u32 max_skb_cap;
u32 rest_len;
int ret;

@@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;

+ /* If zerocopy is not enabled by 'setsockopt()', we behave as
+ * there is no MSG_ZEROCOPY flag set.
+ */
+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+ info->flags &= ~MSG_ZEROCOPY;
+
+ if (info->flags & MSG_ZEROCOPY)
+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+
+ if (can_zcopy)
+ max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+ (MAX_SKB_FRAGS * PAGE_SIZE));
+ else
+ max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
+
rest_len = pkt_len;

do {
struct sk_buff *skb;
size_t skb_len;

- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+ skb_len = min(max_skb_cap, rest_len);

- skb = virtio_transport_alloc_skb(info, skb_len,
- src_cid, src_port,
- dst_cid, dst_port);
+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
+ dst_cid, dst_port,
+ src_cid, src_port);
if (!skb) {
ret = -ENOMEM;
break;
}

+ /* This is last skb to send this portion of data. */
+ if (skb_len == rest_len &&
+ info->flags & MSG_ZEROCOPY &&
+ info->op == VIRTIO_VSOCK_OP_RW) {
+ if (virtio_transport_init_zcopy_skb(vsk, skb,
+ info->msg,
+ can_zcopy)) {
+ ret = -ENOMEM;
+ break;
+ }
+ }
+
virtio_transport_inc_tx_pkt(vvs, skb);

ret = t_ops->send_pkt(skb);
@@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
.msg = msg,
.pkt_len = len,
.vsk = vsk,
+ .flags = msg->msg_flags,
};

return virtio_transport_send_pkt_info(vsk, &info);
@@ -935,11 +1081,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
if (!t)
return -ENOTCONN;

- reply = virtio_transport_alloc_skb(&info, 0,
- le64_to_cpu(hdr->dst_cid),
- le32_to_cpu(hdr->dst_port),
+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
le64_to_cpu(hdr->src_cid),
- le32_to_cpu(hdr->src_port));
+ le32_to_cpu(hdr->src_port),
+ le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port));
if (!reply)
return -ENOMEM;

--
2.25.1


2023-06-03 21:06:28

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 04/17] vsock/virtio: non-linear skb handling for tap

For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 5819a9cd4515..0de562c1dc4b 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}

+static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ struct iov_iter iov_iter = { 0 };
+ struct kvec kvec;
+ size_t to_copy;
+
+ kvec.iov_base = dst;
+ kvec.iov_len = len;
+
+ iov_iter.iter_type = ITER_KVEC;
+ iov_iter.kvec = &kvec;
+ iov_iter.nr_segs = 1;
+
+ to_copy = min_t(size_t, len, skb->len);
+
+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &iov_iter, to_copy);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;

/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;

skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));

if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(pkt)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}

return skb;
--
2.25.1


2023-06-03 21:07:10

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v4 09/17] vsock: enable SOCK_SUPPORT_ZC bit

This bit is used by io_uring in case of zerocopy tx mode. io_uring code
checks, that socket has this feature. This patch sets it in two places:
1) For socket in 'connect()' call.
2) For new socket which is returned by 'accept()' call.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/af_vsock.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 033006e1b5ad..da22ae0ef477 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1406,6 +1406,9 @@ static int vsock_connect(struct socket *sock, struct sockaddr *addr,
goto out;
}

+ if (vsock_msgzerocopy_allow(transport))
+ set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
+
err = vsock_auto_bind(vsk);
if (err)
goto out;
@@ -1560,6 +1563,9 @@ static int vsock_accept(struct socket *sock, struct socket *newsock, int flags,
} else {
newsock->state = SS_CONNECTED;
sock_graft(connected, newsock);
+ if (vsock_msgzerocopy_allow(vconnected->transport))
+ set_bit(SOCK_SUPPORT_ZC,
+ &connected->sk_socket->flags);
}

release_sock(connected);
--
2.25.1


2023-06-12 17:29:18

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

Hey Arseniy,

Thanks for this series, very good stuff!

On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:
> Hello,
>
> DESCRIPTION
>
> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
> current implementation for TCP as much as possible:
>
> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> flag will be ignored (e.g. without completion).
>
> 2) Kernel uses completions from socket's error queue. Single completion
> for single tx syscall (or it can merge several completions to single
> one). I used already implemented logic for MSG_ZEROCOPY support:
> 'msg_zerocopy_realloc()' etc.
>
> Difference with copy way is not significant. During packet allocation,
> non-linear skb is created and filled with pinned user pages.
> There are also some updates for vhost and guest parts of transport - in
> both cases i've added handling of non-linear skb for virtio part. vhost
> copies data from such skb to the guest's rx virtio buffers. In the guest,
> virtio transport fills tx virtio queue with pages from skb.
>
> Head of this patchset is:
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>
>
> This version has several limits/problems:
>
> 1) As this feature totally depends on transport, there is no way (or it
> is difficult) to check whether transport is able to handle it or not
> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> are not considered to be called from each other. So in current version
> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> tx routine will fail with EOPNOTSUPP.
>
> ^^^
> This is still no resolved :(
>

I think to get around this you could use set SOCK_CUSTOM_SOCKOPT in the
vsock create function, handle SO_ZEROCOPY in the vsock handler, but pass
the rest of the common options to sock_setsockopt().

I think the next issue you would run into though is that users may call
setsockopt() before connect(), and so the transport will still be
unknown (except for dgrams, which are weird for reasons).

What do you think about EOPNOTSUPP being returned when the user selects
an incompatible transport with connect() instead of returning it later
in the tx path?

> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> one completion. In each completion there is flag which shows how tx
> was performed: zerocopy or copy. This leads that whole message must
> be send in zerocopy or copy way - we can't send part of message with
> copying and rest of message with zerocopy mode (or vice versa). Now,
> we need to account vsock credit logic, e.g. we can't send whole data
> once - only allowed number of bytes could sent at any moment. In case
> of copying way there is no problem as in worst case we can send single
> bytes, but zerocopy is more complex because smallest transmission
> unit is single page. So if there is not enough space at peer's side
> to send integer number of pages (at least one) - we will wait, thus
> stalling tx side. To overcome this problem i've added simple rule -
> zerocopy is possible only when there is enough space at another side
> for whole message (to check, that current 'msghdr' was already used
> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> ^^^
> Discussed as ok during v2. Link:
> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>
> 3) loopback transport is not supported, because it requires to implement
> non-linear skb handling in dequeue logic (as we "send" fragged skb
> and "receive" it from the same queue). I'm going to implement it in
> next versions.
>
> ^^^ fixed in v2
>
> 4) Current implementation sets max length of packet to 64KB. IIUC this
> is due to 'kmalloc()' allocated data buffers. I think, in case of
> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> not touched for data - user space pages are used as buffers. Also
> this limit trims every message which is > 64KB, thus such messages
> will be send in copy mode due to 'iov_offset' check in 2).
>
> ^^^ fixed in v2
>
> PATCHSET STRUCTURE
>
> Patchset has the following structure:
> 1) Handle non-linear skbuff on receive in virtio/vhost.
> 2) Handle non-linear skbuff on send in virtio/vhost.
> 3) Updates for AF_VSOCK.
> 4) Enable MSG_ZEROCOPY support on transports.
> 5) Tests/tools/docs updates.
>
> PERFORMANCE
>
> Performance: it is a little bit tricky to compare performance between
> copy and zerocopy transmissions. In zerocopy way we need to wait when
> user buffers will be released by kernel, so it is like synchronous
> path (wait until device driver will process it), while in copy way we
> can feed data to kernel as many as we want, don't care about device
> driver. So I compared only time which we spend in the 'send()' syscall.
> Then if this value will be combined with total number of transmitted
> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
> enough credit, receiver allocates same amount of space as sender needs.
>
> Sender:
> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>
> Receiver:
> ./vsock_perf --vsk-size 256M
>
> I run tests on two setups: desktop with Core i7 - I use this PC for
> development and in this case guest is nested guest, and host is normal
> guest. Another hardware is some embedded board with Atom - here I don't
> have nested virtualization - host runs on hw, and guest is normal guest.
>
> G2H transmission (values are Gbit/s):
>
> Core i7 with nested guest. Atom with normal guest.
>
> *-------------------------------* *-------------------------------*
> | | | | | | | |
> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
> | | | | | | | |
> *-------------------------------* *-------------------------------*
> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
> *-------------------------------* *-------------------------------*
> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
> *-------------------------------* *-------------------------------*
> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
> *-------------------------------* *-------------------------------*
> | 1M | 30 | 373 | | 1M | 7 | 95 |
> *-------------------------------* *-------------------------------*
> | 8M | 22 | 475 | | 8M | 7 | 114 |
> *-------------------------------* *-------------------------------*
>
> H2G:
>
> Core i7 with nested guest. Atom with normal guest.
>
> *-------------------------------* *-------------------------------*
> | | | | | | | |
> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
> | | | | | | | |
> *-------------------------------* *-------------------------------*
> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
> *-------------------------------* *-------------------------------*
> | 32KB | 37 | 75 | | 32KB | 11 | 18 |
> *-------------------------------* *-------------------------------*
> | 256KB | 44 | 299 | | 256KB | 11 | 62 |
> *-------------------------------* *-------------------------------*
> | 1M | 28 | 335 | | 1M | 9 | 77 |
> *-------------------------------* *-------------------------------*
> | 8M | 27 | 417 | | 8M | 9.35 | 115 |
> *-------------------------------* *-------------------------------*
>

Nice!


[...]

Thanks,
Bobby

2023-06-12 19:59:53

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v4 02/17] vhost/vsock: read data from non-linear skb

On Sat, Jun 03, 2023 at 11:49:24PM +0300, Arseniy Krasnov wrote:
> This adds copying to guest's virtio buffers from non-linear skbs. Such
> skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
> changes call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'. Second
> function can read data from non-linear skb.
>
> See commit to 'net/vmw_vsock/virtio_transport_common.c' with the same
> name for more details.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>
> ---
> drivers/vhost/vsock.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 6578db78f0ae..b254aa4b756a 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -156,7 +156,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> }
>
> iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
> - payload_len = skb->len;
> + payload_len = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
> hdr = virtio_vsock_hdr(skb);
>
> /* If the packet is greater than the space available in the
> @@ -197,8 +197,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> break;
> }
>
> - nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
> - if (nbytes != payload_len) {
> + if (skb_copy_datagram_iter(skb,
> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
> + &iov_iter,
> + payload_len)) {
> kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt buf\n");
> break;
> @@ -212,13 +214,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
> added = true;
>
> - skb_pull(skb, payload_len);
> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
> total_len += payload_len;
>
> /* If we didn't send all the payload we can requeue the packet
> * to send it with the next available buffer.
> */
> - if (skb->len > 0) {
> + if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
> hdr->flags |= cpu_to_le32(flags_to_restore);
>
> /* We are queueing the same skb to handle
> --
> 2.25.1
>

LGTM.

Reviewed-by: Bobby Eshleman <[email protected]>

2023-06-12 20:15:15

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/17] vsock/virtio: read data from non-linear skb

On Sat, Jun 03, 2023 at 11:49:23PM +0300, Arseniy Krasnov wrote:
> This is preparation patch for non-linear skbuff handling. It replaces
> direct calls of 'memcpy_to_msg()' with 'skb_copy_datagram_iter()'. Main
> advantage of the second one is that is can handle paged part of the skb
> by using 'kmap()' on each page, but if there are no pages in the skb,
> it behaves like simple copying to iov iterator. This patch also adds
> new field to the control block of skb - this value shows current offset
> in the skb to read next portion of data (it doesn't matter linear it or
> not). Idea is that 'skb_copy_datagram_iter()' handles both types of
> skb internally - it just needs an offset from which to copy data from
> the given skb. This offset is incremented on each read from skb. This
> approach allows to avoid special handling of non-linear skbs:
> 1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
> 2) We need to update 'data_len' also on each read from this skb.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>
> ---
> include/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport_common.c | 26 +++++++++++++++++--------
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index c58453699ee9..17dbb7176e37 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -12,6 +12,7 @@
> struct virtio_vsock_skb_cb {
> bool reply;
> bool tap_delivered;
> + u32 frag_off;
> };
>
> #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index b769fc258931..5819a9cd4515 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> spin_lock_bh(&vvs->rx_lock);
>
> skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
> - off = 0;
> + off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>
> if (total == len)
> break;
> @@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, skb->data + off, bytes);
> + err = skb_copy_datagram_iter(skb, off,
> + &msg->msg_iter,
> + bytes);
> +
> if (err)
> goto out;
>
> @@ -414,24 +417,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> skb = skb_peek(&vvs->rx_queue);
>
> bytes = len - total;
> - if (bytes > skb->len)
> - bytes = skb->len;
> + if (bytes > skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off)
> + bytes = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>
> /* sk_lock is held by caller so no one else can dequeue.
> * Unlock rx_lock since memcpy_to_msg() may sleep.
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, skb->data, bytes);
> + err = skb_copy_datagram_iter(skb,
> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
> + &msg->msg_iter, bytes);
> +
> if (err)
> goto out;
>
> spin_lock_bh(&vvs->rx_lock);
>
> total += bytes;
> - skb_pull(skb, bytes);
>
> - if (skb->len == 0) {
> + VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
> +
> + if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
> u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
>
> virtio_transport_dec_rx_pkt(vvs, pkt_len);
> @@ -503,7 +510,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
> - err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
> + err = skb_copy_datagram_iter(skb, 0,
> + &msg->msg_iter,
> + bytes_to_copy);
> +
> if (err) {
> /* Copy of message failed. Rest of
> * fragments will be freed without copy.
> --
> 2.25.1
>

LGTM.

Reviewed-by: Bobby Eshleman <[email protected]>

2023-06-14 06:09:55

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

Hello Bobby! Sorry for a little bit late reply.

On 12.06.2023 20:20, Bobby Eshleman wrote:
> Hey Arseniy,
>
> Thanks for this series, very good stuff!
>
> On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>> DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>> flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>> for single tx syscall (or it can merge several completions to single
>> one). I used already implemented logic for MSG_ZEROCOPY support:
>> 'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created and filled with pinned user pages.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> Head of this patchset is:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>>
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>> is difficult) to check whether transport is able to handle it or not
>> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>> are not considered to be called from each other. So in current version
>> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>> tx routine will fail with EOPNOTSUPP.
>>
>> ^^^
>> This is still no resolved :(
>>
>
> I think to get around this you could use set SOCK_CUSTOM_SOCKOPT in the
> vsock create function, handle SO_ZEROCOPY in the vsock handler, but pass
> the rest of the common options to sock_setsockopt().

Ah yes, I really forget about this way, thanks!

>
> I think the next issue you would run into though is that users may call
> setsockopt() before connect(), and so the transport will still be
> unknown (except for dgrams, which are weird for reasons).
>
> What do you think about EOPNOTSUPP being returned when the user selects
> an incompatible transport with connect() instead of returning it later
> in the tx path?

Yes, I think it is ok, in 'vsock_assign_transport()' which was called from
'connect()' I will check that if zerocopy transmission is enabled, I will
check that transport supports it (seqpacket mode works in the same way -
if transports doesn't support it -> connect failed).

So if 'setsockopt()' is called before 'connect()' (e.g. transport is unknown),
I just set this option and thats all. Later in 'connect()' during transport
assignment I'll check that selected transport supports this feature if needed.

If 'setsockopt()' is called after 'connect()' everything is simple - transport
is already known.

Thanks for this clue, I'll include it in v5!

>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>> one completion. In each completion there is flag which shows how tx
>> was performed: zerocopy or copy. This leads that whole message must
>> be send in zerocopy or copy way - we can't send part of message with
>> copying and rest of message with zerocopy mode (or vice versa). Now,
>> we need to account vsock credit logic, e.g. we can't send whole data
>> once - only allowed number of bytes could sent at any moment. In case
>> of copying way there is no problem as in worst case we can send single
>> bytes, but zerocopy is more complex because smallest transmission
>> unit is single page. So if there is not enough space at peer's side
>> to send integer number of pages (at least one) - we will wait, thus
>> stalling tx side. To overcome this problem i've added simple rule -
>> zerocopy is possible only when there is enough space at another side
>> for whole message (to check, that current 'msghdr' was already used
>> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>> ^^^
>> Discussed as ok during v2. Link:
>> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>>
>> 3) loopback transport is not supported, because it requires to implement
>> non-linear skb handling in dequeue logic (as we "send" fragged skb
>> and "receive" it from the same queue). I'm going to implement it in
>> next versions.
>>
>> ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>> is due to 'kmalloc()' allocated data buffers. I think, in case of
>> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>> not touched for data - user space pages are used as buffers. Also
>> this limit trims every message which is > 64KB, thus such messages
>> will be send in copy mode due to 'iov_offset' check in 2).
>>
>> ^^^ fixed in v2
>>
>> PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>> PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it is like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> I run tests on two setups: desktop with Core i7 - I use this PC for
>> development and in this case guest is nested guest, and host is normal
>> guest. Another hardware is some embedded board with Atom - here I don't
>> have nested virtualization - host runs on hw, and guest is normal guest.
>>
>> G2H transmission (values are Gbit/s):
>>
>> Core i7 with nested guest. Atom with normal guest.
>>
>> *-------------------------------* *-------------------------------*
>> | | | | | | | |
>> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
>> | | | | | | | |
>> *-------------------------------* *-------------------------------*
>> | 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
>> *-------------------------------* *-------------------------------*
>> | 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
>> *-------------------------------* *-------------------------------*
>> | 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
>> *-------------------------------* *-------------------------------*
>> | 1M | 30 | 373 | | 1M | 7 | 95 |
>> *-------------------------------* *-------------------------------*
>> | 8M | 22 | 475 | | 8M | 7 | 114 |
>> *-------------------------------* *-------------------------------*
>>
>> H2G:
>>
>> Core i7 with nested guest. Atom with normal guest.
>>
>> *-------------------------------* *-------------------------------*
>> | | | | | | | |
>> | buf size | copy | zerocopy | | buf size | copy | zerocopy |
>> | | | | | | | |
>> *-------------------------------* *-------------------------------*
>> | 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
>> *-------------------------------* *-------------------------------*
>> | 32KB | 37 | 75 | | 32KB | 11 | 18 |
>> *-------------------------------* *-------------------------------*
>> | 256KB | 44 | 299 | | 256KB | 11 | 62 |
>> *-------------------------------* *-------------------------------*
>> | 1M | 28 | 335 | | 1M | 9 | 77 |
>> *-------------------------------* *-------------------------------*
>> | 8M | 27 | 417 | | 8M | 9.35 | 115 |
>> *-------------------------------* *-------------------------------*
>>
>
> Nice!
>
>
> [...]
>
> Thanks,
> Bobby

Thanks, Arseniy

2023-06-26 15:42:27

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 01/17] vsock/virtio: read data from non-linear skb

On Sat, Jun 03, 2023 at 11:49:23PM +0300, Arseniy Krasnov wrote:
>This is preparation patch for non-linear skbuff handling. It replaces
>direct calls of 'memcpy_to_msg()' with 'skb_copy_datagram_iter()'. Main
>advantage of the second one is that is can handle paged part of the skb
>by using 'kmap()' on each page, but if there are no pages in the skb,
>it behaves like simple copying to iov iterator. This patch also adds
>new field to the control block of skb - this value shows current offset
>in the skb to read next portion of data (it doesn't matter linear it or
>not). Idea is that 'skb_copy_datagram_iter()' handles both types of
>skb internally - it just needs an offset from which to copy data from
>the given skb. This offset is incremented on each read from skb. This
>approach allows to avoid special handling of non-linear skbs:
>1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
>2) We need to update 'data_len' also on each read from this skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> include/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport_common.c | 26 +++++++++++++++++--------
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index c58453699ee9..17dbb7176e37 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -12,6 +12,7 @@
> struct virtio_vsock_skb_cb {
> bool reply;
> bool tap_delivered;
>+ u32 frag_off;
> };
>
> #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index b769fc258931..5819a9cd4515 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> spin_lock_bh(&vvs->rx_lock);
>
> skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
>- off = 0;
>+ off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>
> if (total == len)
> break;
>@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data + off, bytes);
>+ err = skb_copy_datagram_iter(skb, off,
>+ &msg->msg_iter,
>+ bytes);
>+
> if (err)
> goto out;
>
>@@ -414,24 +417,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> skb = skb_peek(&vvs->rx_queue);
>
> bytes = len - total;
>- if (bytes > skb->len)
>- bytes = skb->len;
>+ if (bytes > skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off)
>+ bytes = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

What about storing `VIRTIO_VSOCK_SKB_CB(skb)->frag_off` in a variable?
More for readability than optimization, which I hope the compiler
already does on its own.

The rest LGTM.

Stefano

>
> /* sk_lock is held by caller so no one else can dequeue.
> * Unlock rx_lock since memcpy_to_msg() may sleep.
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data, bytes);
>+ err = skb_copy_datagram_iter(skb,
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
>+ &msg->msg_iter, bytes);
>+
> if (err)
> goto out;
>
> spin_lock_bh(&vvs->rx_lock);
>
> total += bytes;
>- skb_pull(skb, bytes);
>
>- if (skb->len == 0) {
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
>+
>+ if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
> u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
>
> virtio_transport_dec_rx_pkt(vvs, pkt_len);
>@@ -503,7 +510,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
>+ err = skb_copy_datagram_iter(skb, 0,
>+ &msg->msg_iter,
>+ bytes_to_copy);
>+
> if (err) {
> /* Copy of message failed. Rest of
> * fragments will be freed without copy.
>--
>2.25.1
>


2023-06-26 15:44:30

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 02/17] vhost/vsock: read data from non-linear skb

On Sat, Jun 03, 2023 at 11:49:24PM +0300, Arseniy Krasnov wrote:
>This adds copying to guest's virtio buffers from non-linear skbs. Such
>skbs are created by protocol layer when MSG_ZEROCOPY flags is used. It
>changes call of 'copy_to_iter()' to 'skb_copy_datagram_iter()'. Second
>function can read data from non-linear skb.
>
>See commit to 'net/vmw_vsock/virtio_transport_common.c' with the same
>name for more details.

I think it's okay if we report the same details here.

>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> drivers/vhost/vsock.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 6578db78f0ae..b254aa4b756a 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -156,7 +156,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> }
>
> iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
>- payload_len = skb->len;
>+ payload_len = skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

Also here a variable should make the code more readable.

Stefano

> hdr = virtio_vsock_hdr(skb);
>
> /* If the packet is greater than the space available in the
>@@ -197,8 +197,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> break;
> }
>
>- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>- if (nbytes != payload_len) {
>+ if (skb_copy_datagram_iter(skb,
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
>+ &iov_iter,
>+ payload_len)) {
> kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt buf\n");
> break;
>@@ -212,13 +214,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
> added = true;
>
>- skb_pull(skb, payload_len);
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
> total_len += payload_len;
>
> /* If we didn't send all the payload we can requeue the packet
> * to send it with the next available buffer.
> */
>- if (skb->len > 0) {
>+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
> hdr->flags |= cpu_to_le32(flags_to_restore);
>
> /* We are queueing the same skb to handle
>--
>2.25.1
>


2023-06-26 16:12:14

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 04/17] vsock/virtio: non-linear skb handling for tap

On Sat, Jun 03, 2023 at 11:49:26PM +0300, Arseniy Krasnov wrote:
>For tap device new skb is created and data from the current skb is
>copied to it. This adds copying data from non-linear skb to new
>the skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
> 1 file changed, 28 insertions(+), 3 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 5819a9cd4515..0de562c1dc4b 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> return NULL;
> }
>
>+static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,

`const struct sk_buff *skb` should be better also to understand that
the function copy data from *skb to *dst.

>+ void *dst,
>+ size_t len)
>+{
>+ struct iov_iter iov_iter = { 0 };
>+ struct kvec kvec;
>+ size_t to_copy;
>+
>+ kvec.iov_base = dst;
>+ kvec.iov_len = len;
>+
>+ iov_iter.iter_type = ITER_KVEC;
>+ iov_iter.kvec = &kvec;
>+ iov_iter.nr_segs = 1;
>+
>+ to_copy = min_t(size_t, len, skb->len);
>+
>+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
>+ &iov_iter, to_copy);
>+}
>+
> /* Packet capture */
> static struct sk_buff *virtio_transport_build_skb(void *opaque)
> {
>@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> struct af_vsockmon_hdr *hdr;
> struct sk_buff *skb;
> size_t payload_len;
>- void *payload_buf;
>
> /* A packet could be split to fit the RX buffer, so we can retrieve
> * the payload length from the header and the buffer pointer taking
>@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> */
> pkt_hdr = virtio_vsock_hdr(pkt);
> payload_len = pkt->len;
>- payload_buf = pkt->data;
>
> skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
> GFP_ATOMIC);
>@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
> skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));
>
> if (payload_len) {
>- skb_put_data(skb, payload_buf, payload_len);
>+ if (skb_is_nonlinear(pkt)) {
>+ void *data = skb_put(skb, payload_len);
>+
>+ virtio_transport_copy_nonlinear_skb(pkt, data,
>payload_len);
>+ } else {
>+ skb_put_data(skb, pkt->data, payload_len);
>+ }
> }
>
> return skb;
>--
>2.25.1
>


2023-06-26 16:16:33

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support

On Sat, Jun 03, 2023 at 11:49:27PM +0300, Arseniy Krasnov wrote:
>This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>flag is set and zerocopy transmission is possible, then non-linear skb
>will be created and filled with the pages of user's buffer. Pages of
>user's buffer are locked in memory by 'get_user_pages()'.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
> 1 file changed, 208 insertions(+), 62 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 0de562c1dc4b..f1ec38c72db7 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> return container_of(t, struct virtio_transport, transport);
> }
>
>-/* Returns a new packet on success, otherwise returns NULL.
>- *
>- * If NULL is returned, errp is set to a negative errno.
>- */
>-static struct sk_buff *
>-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>- size_t len,
>- u32 src_cid,
>- u32 src_port,
>- u32 dst_cid,
>- u32 dst_port)
>-{
>- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>- struct virtio_vsock_hdr *hdr;
>- struct sk_buff *skb;
>+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>+ size_t max_to_send)
>+{
>+ struct iov_iter *iov_iter;
>+
>+ if (!info->msg)
>+ return false;
>+
>+ iov_iter = &info->msg->msg_iter;
>+
>+ /* Data is simple buffer. */
>+ if (iter_is_ubuf(iov_iter))
>+ return true;
>+
>+ if (!iter_is_iovec(iov_iter))
>+ return false;
>+
>+ if (iov_iter->iov_offset)
>+ return false;
>+
>+ /* We can't send whole iov. */
>+ if (iov_iter->count > max_to_send)
>+ return false;
>+
>+ return true;
>+}
>+
>+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>+ struct sk_buff *skb,
>+ struct msghdr *msg,
>+ bool zerocopy)
>+{
>+ struct ubuf_info *uarg;
>+
>+ if (msg->msg_ubuf) {
>+ uarg = msg->msg_ubuf;
>+ net_zcopy_get(uarg);
>+ } else {
>+ struct iov_iter *iter = &msg->msg_iter;
>+ struct ubuf_info_msgzc *uarg_zc;
>+ int len;
>+
>+ /* Only ITER_IOVEC or ITER_UBUF are allowed and
>+ * checked before.
>+ */
>+ if (iter_is_iovec(iter))
>+ len = iov_length(iter->__iov, iter->nr_segs);
>+ else
>+ len = iter->count;
>+
>+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>+ len,
>+ NULL);
>+
>+ if (!uarg)
>+ return -1;
>+
>+ uarg_zc = uarg_to_msgzc(uarg);
>+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
>+ }
>+
>+ skb_zcopy_init(skb, uarg);
>+
>+ return 0;
>+}
>+
>+static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
>+ struct vsock_sock *vsk,

`vsk` seems unused

>+ struct virtio_vsock_pkt_info *info,
>+ size_t len)
>+{
> void *payload;
> int err;
>
>- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>- if (!skb)
>- return NULL;
>+ payload = skb_put(skb, len);
>+ err = memcpy_from_msg(payload, info->msg, len);
>+ if (err)
>+ return -1;
>+
>+ if (msg_data_left(info->msg))
>+ return 0;
>+
>+ return 0;
>+}
>+
>+static void virtio_transport_init_hdr(struct sk_buff *skb,
>+ struct virtio_vsock_pkt_info *info,
>+ u32 src_cid,
>+ u32 src_port,
>+ u32 dst_cid,
>+ u32 dst_port,
>+ size_t len)
>+{
>+ struct virtio_vsock_hdr *hdr;
>
> hdr = virtio_vsock_hdr(skb);
> hdr->type = cpu_to_le16(info->type);
>@@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
> hdr->dst_port = cpu_to_le32(dst_port);
> hdr->flags = cpu_to_le32(info->flags);
> hdr->len = cpu_to_le32(len);
>-
>- if (info->msg && len > 0) {
>- payload = skb_put(skb, len);
>- err = memcpy_from_msg(payload, info->msg, len);
>- if (err)
>- goto out;
>-
>- if (msg_data_left(info->msg) == 0 &&
>- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>-
>- if (info->msg->msg_flags & MSG_EOR)
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>- }
>- }
>-
>- if (info->reply)
>- virtio_vsock_skb_set_reply(skb);
>-
>- trace_virtio_transport_alloc_pkt(src_cid, src_port,
>- dst_cid, dst_port,
>- len,
>- info->type,
>- info->op,
>- info->flags);
>-
>- if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>- WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>- goto out;
>- }
>-
>- return skb;
>-
>-out:
>- kfree_skb(skb);
>- return NULL;
> }
>
> static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>@@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> }
>
>+/* Returns a new packet on success, otherwise returns NULL.
>+ *
>+ * If NULL is returned, errp is set to a negative errno.

I had noticed this in Bobby's patches, I think it's an old comment we
left around.

>+ */
>+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>+ struct virtio_vsock_pkt_info *info,
>+ size_t payload_len,
>+ bool zcopy,
>+ u32 dst_cid,
>+ u32 dst_port,
>+ u32 src_cid,
>+ u32 src_port)
>+{
>+ struct sk_buff *skb;
>+ size_t skb_len;
>+
>+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>+
>+ if (!zcopy)
>+ skb_len += payload_len;
>+
>+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>+ if (!skb)
>+ return NULL;
>+
>+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
>+ dst_cid, dst_port,
>+ payload_len);
>+
>+ /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>+ * owner of skb without check to update 'sk_wmem_alloc'.
>+ */
>+ if (vsk)
>+ skb_set_owner_w(skb, sk_vsock(vsk));

why we are moving from skb_set_owner_sk_safe() to skb_set_owner_w()?

We should mention this in the commit description.

>+
>+ if (info->msg && payload_len > 0) {
>+ int err;
>+
>+ if (zcopy) {
>+ err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
>+ &info->msg->msg_iter,
>+ payload_len);
>+ } else {
>+ err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
>+ }
>+
>+ if (err)
>+ goto out;
>+
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>+
>+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>+ struct virtio_vsock_hdr *hdr;
>+
>+ hdr = virtio_vsock_hdr(skb);

Just `struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);` should be
fine.

>+
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>+
>+ if (info->msg->msg_flags & MSG_EOR)
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>+ }
>+ }
>+
>+ if (info->reply)
>+ virtio_vsock_skb_set_reply(skb);
>+
>+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
>+ dst_cid, dst_port,
>+ payload_len,
>+ info->type,
>+ info->op,
>+ info->flags);
>+
>+ return skb;
>+out:
>+ kfree_skb(skb);
>+ return NULL;
>+}
>+
> /* This function can only be used on connecting/connected sockets,
> * since a socket assigned to a transport is required.
> *
>@@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> const struct virtio_transport *t_ops;
> struct virtio_vsock_sock *vvs;
> u32 pkt_len = info->pkt_len;
>+ bool can_zcopy = false;
>+ u32 max_skb_cap;
> u32 rest_len;
> int ret;
>
>@@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> return pkt_len;
>
>+ /* If zerocopy is not enabled by 'setsockopt()', we behave as
>+ * there is no MSG_ZEROCOPY flag set.
>+ */
>+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>+ info->flags &= ~MSG_ZEROCOPY;
>+
>+ if (info->flags & MSG_ZEROCOPY)
>+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>+
>+ if (can_zcopy)
>+ max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>+ (MAX_SKB_FRAGS * PAGE_SIZE));
>+ else
>+ max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>+

We use `len` very often, what about `max_skb_len`?

> rest_len = pkt_len;
>
> do {
> struct sk_buff *skb;
> size_t skb_len;
>
>- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>+ skb_len = min(max_skb_cap, rest_len);
>
>- skb = virtio_transport_alloc_skb(info, skb_len,
>- src_cid, src_port,
>- dst_cid, dst_port);
>+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>+ dst_cid, dst_port,
>+ src_cid, src_port);
> if (!skb) {
> ret = -ENOMEM;
> break;
> }
>
>+ /* This is last skb to send this portion of data. */
>+ if (skb_len == rest_len &&
>+ info->flags & MSG_ZEROCOPY &&
>+ info->op == VIRTIO_VSOCK_OP_RW) {
>+ if (virtio_transport_init_zcopy_skb(vsk, skb,
>+ info->msg,
>+ can_zcopy)) {
>+ ret = -ENOMEM;
>+ break;
>+ }
>+ }
>+
> virtio_transport_inc_tx_pkt(vvs, skb);
>
> ret = t_ops->send_pkt(skb);
>@@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
> .msg = msg,
> .pkt_len = len,
> .vsk = vsk,
>+ .flags = msg->msg_flags,

These flags then get copied into the virtio_vsock_hdr, which I don't
think is a good idea.

Why not using directly info->msg->msg_flags?

> };
>
> return virtio_transport_send_pkt_info(vsk, &info);
>@@ -935,11 +1081,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> if (!t)
> return -ENOTCONN;
>
>- reply = virtio_transport_alloc_skb(&info, 0,
>- le64_to_cpu(hdr->dst_cid),
>- le32_to_cpu(hdr->dst_port),
>+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> le64_to_cpu(hdr->src_cid),
>- le32_to_cpu(hdr->src_port));
>+ le32_to_cpu(hdr->src_port),
>+ le64_to_cpu(hdr->dst_cid),
>+ le32_to_cpu(hdr->dst_port));
> if (!reply)
> return -ENOMEM;
>
>--
>2.25.1
>


2023-06-26 16:16:36

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 06/17] vsock: check error queue to set EPOLLERR

On Sat, Jun 03, 2023 at 11:49:28PM +0300, Arseniy Krasnov wrote:
>If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
>reader of error queue won't detect data in it using EPOLLERR bit.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/af_vsock.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)

This patch looks like it can go even without this series.

Is it a fix? Should we add a fixes tag?

Thanks,
Stefano

>
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index efb8a0937a13..45fd20c4ed50 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1030,7 +1030,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
> poll_wait(file, sk_sleep(sk), wait);
> mask = 0;
>
>- if (sk->sk_err)
>+ if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
> /* Signify that there has been an error on this socket. */
> mask |= EPOLLERR;
>
>--
>2.25.1
>


2023-06-26 16:17:05

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 11/17] vsock/virtio: support MSG_ZEROCOPY for transport

On Sat, Jun 03, 2023 at 11:49:33PM +0300, Arseniy Krasnov wrote:
>Add 'msgzerocopy_allow()' callback for virtio transport.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/virtio_transport.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>index 6053d8341091..d9ffa16dda69 100644
>--- a/net/vmw_vsock/virtio_transport.c
>+++ b/net/vmw_vsock/virtio_transport.c
>@@ -438,6 +438,11 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
> queue_work(virtio_vsock_workqueue, &vsock->rx_work);
> }
>
>+static bool virtio_transport_msgzerocopy_allow(void)
>+{
>+ return true;
>+}
>+
> static bool virtio_transport_seqpacket_allow(u32 remote_cid);
>
> static struct virtio_transport virtio_transport = {
>@@ -484,6 +489,8 @@ static struct virtio_transport virtio_transport = {
> .notify_buffer_size = virtio_transport_notify_buffer_size,
>
> .read_skb = virtio_transport_read_skb,
>+
>+ .msgzerocopy_allow = virtio_transport_msgzerocopy_allow,

Ditto.

> },
>
> .send_pkt = virtio_transport_send_pkt,
>--
>2.25.1
>


2023-06-26 16:28:09

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 12/17] vsock/loopback: support MSG_ZEROCOPY for transport

On Sat, Jun 03, 2023 at 11:49:34PM +0300, Arseniy Krasnov wrote:
>Add 'msgzerocopy_allow()' callback for loopback transport.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> net/vmw_vsock/vsock_loopback.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
>diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
>index 5c6360df1f31..a2e4aeda2d92 100644
>--- a/net/vmw_vsock/vsock_loopback.c
>+++ b/net/vmw_vsock/vsock_loopback.c
>@@ -47,6 +47,7 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
> }
>
> static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
>+static bool vsock_loopback_msgzerocopy_allow(void);

I don't know why we did this for `vsock_loopback_seqpacket_allow`, but
can we just put the implementation here?

>
> static struct virtio_transport loopback_transport = {
> .transport = {
>@@ -92,11 +93,18 @@ static struct virtio_transport loopback_transport = {
> .notify_buffer_size = virtio_transport_notify_buffer_size,
>
> .read_skb = virtio_transport_read_skb,
>+
>+ .msgzerocopy_allow = vsock_loopback_msgzerocopy_allow,

Ditto the moving.

> },
>
> .send_pkt = vsock_loopback_send_pkt,
> };
>
>+static bool vsock_loopback_msgzerocopy_allow(void)
>+{
>+ return true;
>+}
>+
> static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
> {
> return true;
>--
>2.25.1
>


2023-06-26 16:28:19

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:
>Hello,
>
> DESCRIPTION
>
>this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>current implementation for TCP as much as possible:
>
>1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
> flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
> flag will be ignored (e.g. without completion).
>
>2) Kernel uses completions from socket's error queue. Single completion
> for single tx syscall (or it can merge several completions to single
> one). I used already implemented logic for MSG_ZEROCOPY support:
> 'msg_zerocopy_realloc()' etc.
>
>Difference with copy way is not significant. During packet allocation,
>non-linear skb is created and filled with pinned user pages.
>There are also some updates for vhost and guest parts of transport - in
>both cases i've added handling of non-linear skb for virtio part. vhost
>copies data from such skb to the guest's rx virtio buffers. In the guest,
>virtio transport fills tx virtio queue with pages from skb.
>
>Head of this patchset is:
>https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>
>
>This version has several limits/problems:
>
>1) As this feature totally depends on transport, there is no way (or it
> is difficult) to check whether transport is able to handle it or not
> during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
> setsockopt callback from setsockopt callback for SOL_SOCKET, but this
> leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
> are not considered to be called from each other. So in current version
> SO_ZEROCOPY is set successfully to any type (e.g. transport) of
> AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
> tx routine will fail with EOPNOTSUPP.
>
> ^^^
> This is still no resolved :(
>
>2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
> one completion. In each completion there is flag which shows how tx
> was performed: zerocopy or copy. This leads that whole message must
> be send in zerocopy or copy way - we can't send part of message with
> copying and rest of message with zerocopy mode (or vice versa). Now,
> we need to account vsock credit logic, e.g. we can't send whole data
> once - only allowed number of bytes could sent at any moment. In case
> of copying way there is no problem as in worst case we can send single
> bytes, but zerocopy is more complex because smallest transmission
> unit is single page. So if there is not enough space at peer's side
> to send integer number of pages (at least one) - we will wait, thus
> stalling tx side. To overcome this problem i've added simple rule -
> zerocopy is possible only when there is enough space at another side
> for whole message (to check, that current 'msghdr' was already used
> in previous tx iterations i use 'iov_offset' field of it's iov iter).
>
> ^^^
> Discussed as ok during v2. Link:
> https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>
>3) loopback transport is not supported, because it requires to implement
> non-linear skb handling in dequeue logic (as we "send" fragged skb
> and "receive" it from the same queue). I'm going to implement it in
> next versions.
>
> ^^^ fixed in v2
>
>4) Current implementation sets max length of packet to 64KB. IIUC this
> is due to 'kmalloc()' allocated data buffers. I think, in case of
> MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
> not touched for data - user space pages are used as buffers. Also
> this limit trims every message which is > 64KB, thus such messages
> will be send in copy mode due to 'iov_offset' check in 2).
>
> ^^^ fixed in v2
>
> PATCHSET STRUCTURE
>
>Patchset has the following structure:
>1) Handle non-linear skbuff on receive in virtio/vhost.
>2) Handle non-linear skbuff on send in virtio/vhost.
>3) Updates for AF_VSOCK.
>4) Enable MSG_ZEROCOPY support on transports.
>5) Tests/tools/docs updates.
>
> PERFORMANCE
>
>Performance: it is a little bit tricky to compare performance between
>copy and zerocopy transmissions. In zerocopy way we need to wait when
>user buffers will be released by kernel, so it is like synchronous
>path (wait until device driver will process it), while in copy way we
>can feed data to kernel as many as we want, don't care about device
>driver. So I compared only time which we spend in the 'send()' syscall.
>Then if this value will be combined with total number of transmitted
>bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>enough credit, receiver allocates same amount of space as sender needs.
>
>Sender:
>./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>
>Receiver:
>./vsock_perf --vsk-size 256M
>
>I run tests on two setups: desktop with Core i7 - I use this PC for
>development and in this case guest is nested guest, and host is normal
>guest. Another hardware is some embedded board with Atom - here I don't
>have nested virtualization - host runs on hw, and guest is normal guest.
>
>G2H transmission (values are Gbit/s):
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 3 | 10 | | 4KB | 0.8 | 1.9 |
>*-------------------------------* *-------------------------------*
>| 32KB | 20 | 61 | | 32KB | 6.8 | 20.2 |
>*-------------------------------* *-------------------------------*
>| 256KB | 33 | 244 | | 256KB | 7.8 | 55 |
>*-------------------------------* *-------------------------------*
>| 1M | 30 | 373 | | 1M | 7 | 95 |
>*-------------------------------* *-------------------------------*
>| 8M | 22 | 475 | | 8M | 7 | 114 |
>*-------------------------------* *-------------------------------*
>
>H2G:
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 20 | 10 | | 4KB | 4.37 | 3 |
>*-------------------------------* *-------------------------------*
>| 32KB | 37 | 75 | | 32KB | 11 | 18 |
>*-------------------------------* *-------------------------------*
>| 256KB | 44 | 299 | | 256KB | 11 | 62 |
>*-------------------------------* *-------------------------------*
>| 1M | 28 | 335 | | 1M | 9 | 77 |
>*-------------------------------* *-------------------------------*
>| 8M | 27 | 417 | | 8M | 9.35 | 115 |
>*-------------------------------* *-------------------------------*
>
> * Let's look to the first line of both tables - where copy is better
> than zerocopy. I analyzed this case more deeply and found that
> bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
> caller spends too much time in it with zerocopy mode (comparing to
> copy mode). This happens only with 4K buffer size. This function just
> calls 'wake_up_process()' and its internal logic does not depends on
> skb, so i think potential reason (may be) is interval between two
> calls of this function (e.g. how often it is called). Note, that
> 'vhost_work_queue()' differs from the same function at guest's side of
> transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
> i think is more optimized for worker purposes, than direct call to
> 'wake_up_process()'. But again - this is just my assumption.

Thanks for the analysis, however for small payloads it makes sense that
the cost might be too high that optimization does not bring benefits.

>
>Loopback:
>
> Core i7 with nested guest. Atom with normal guest.
>
>*-------------------------------* *-------------------------------*
>| | | | | | | |
>| buf size | copy | zerocopy | | buf size | copy | zerocopy |
>| | | | | | | |
>*-------------------------------* *-------------------------------*
>| 4KB | 8 | 7 | | 4KB | 1.8 | 1.3 |
>*-------------------------------* *-------------------------------*
>| 32KB | 38 | 44 | | 32KB | 10 | 10 |
>*-------------------------------* *-------------------------------*
>| 256KB | 55 | 168 | | 256KB | 15 | 36 |
>*-------------------------------* *-------------------------------*
>| 1M | 53 | 250 | | 1M | 12 | 45 |
>*-------------------------------* *-------------------------------*
>| 8M | 40 | 344 | | 8M | 11 | 74 |
>*-------------------------------* *-------------------------------*
>
>I analyzed performace difference more deeply for the following setup:
>server: ./vsock_perf --vsk-size 16M
>client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]
>
>In other words I send 16M of data from guest to host in copy/zerocopy
>modes and with two different sizes of buffer - 4K and 64K. Let's see
>to tx path for both modes - it consists of two steps:
>
>copy:
>1) Allocate skb of buffer's length.
>2) Copy data to skb from buffer.
>
>zerocopy:
>1) Allocate skb with header space only.
>2) Pin pages of the buffer and insert them to skb.
>
>I measured average number of ns (returned by 'ktime_get()') for each
>step above:
>1) Skb allocation (for both copy and zerocopy modes).
>2) For copy mode in 'memcpy_to_msg()' - copying.
>3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.
>
>Here are results for copy mode:
>*-------------------------------------*
>| buf | skb alloc | 'memcpy_to_msg()' |
>*-------------------------------------*
>| | | |
>| 64K | 5000ns | 25000ns |
>| | | |
>*-------------------------------------*
>| | | |
>| 4K | 800ns | 2200ns |
>| | | |
>*-------------------------------------*
>
>Here are results for zerocopy mode:
>*-----------------------------------------------*
>| buf | skb alloc | '__zerocopy_sg_from_iter()' |
>*-----------------------------------------------*
>| | | |
>| 64K | 250ns | 3500ns |
>| | | |
>*-----------------------------------------------*
>| | | |
>| 4K | 250ns | 3000ns |
>| | | |
>*-----------------------------------------------*
>
>I guess that reason of zerocopy performance is low overhead for page
>pinning: there is big difference between 4K and 64K in case of copying
>(25000 vs 2200), but in pinning case - just 3000 vs 3500.
>
>So, zerocopy is faster than classic copy mode, but of course it requires
>specific architecture of application due to user pages pinning, buffer
>size and alignment.

Makes sense!

>
> NOTES
>
>If host fails to send data with "Cannot allocate memory", check value
>/proc/sys/net/core/optmem_max - it is accounted during completion skb
>allocation. Try to update it to for example 1M and try send again:
>"echo 1048576 > /proc/sys/net/core/optmem_max" (as root).
>
> TESTING
>
>This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>cover new code as much as possible so there are different cases for
>MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>vector types (different sizes, alignments, with unmapped pages). I also
>run tests with loopback transport and run vsockmon. In v3 i've added
>io_uring test as separated application.
>
> LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>
>In v3 Stefano Garzarella <[email protected]> asked to split this patchset
>for several parts, because it looks too big for review. I think in this
>version (v4) we can do it in the following way:
>
>[0001 - 0005] - this is preparation for virtio/vhost part.
>[0006 - 0009] - this is preparation for AF_VSOCK part.
>[0010 - 0013] - these patches allows to trigger logic from the previous
> two parts.
>[0014 - rest] - updates for doc, tests, utils. This part doesn't touch
> kernel code and looks not critical.

Yeah, I like this split, but I'd include 14 in the (10, 13) group.

I have reviewed most of them and I think we are well on our way :-)
I've already seen that Bobby suggested changes for v5, so I'll review
that version better.

Great work so far!

Thanks,
Stefano


2023-06-26 16:28:57

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 10/17] vhost/vsock: support MSG_ZEROCOPY for transport

On Sat, Jun 03, 2023 at 11:49:32PM +0300, Arseniy Krasnov wrote:
>Add 'msgzerocopy_allow()' callback for vhost transport.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> drivers/vhost/vsock.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index b254aa4b756a..318866713ef7 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -396,6 +396,11 @@ static bool vhost_vsock_more_replies(struct vhost_vsock *vsock)
> return val < vq->num;
> }
>
>+static bool vhost_transport_msgzerocopy_allow(void)
>+{
>+ return true;
>+}
>+
> static bool vhost_transport_seqpacket_allow(u32 remote_cid);
>
> static struct virtio_transport vhost_transport = {
>@@ -442,6 +447,7 @@ static struct virtio_transport vhost_transport = {
> .notify_buffer_size = virtio_transport_notify_buffer_size,
>
> .read_skb = virtio_transport_read_skb,
>+ .msgzerocopy_allow = vhost_transport_msgzerocopy_allow,

Can we move this after .seqpacket section?

> },
>
> .send_pkt = vhost_transport_send_pkt,
>--
>2.25.1
>


2023-06-27 05:32:37

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support



On 26.06.2023 19:15, Stefano Garzarella wrote:
> On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:
>> Hello,
>>
>>                           DESCRIPTION
>>
>> this is MSG_ZEROCOPY feature support for virtio/vsock. I tried to follow
>> current implementation for TCP as much as possible:
>>
>> 1) Sender must enable SO_ZEROCOPY flag to use this feature. Without this
>>   flag, data will be sent in "classic" copy manner and MSG_ZEROCOPY
>>   flag will be ignored (e.g. without completion).
>>
>> 2) Kernel uses completions from socket's error queue. Single completion
>>   for single tx syscall (or it can merge several completions to single
>>   one). I used already implemented logic for MSG_ZEROCOPY support:
>>   'msg_zerocopy_realloc()' etc.
>>
>> Difference with copy way is not significant. During packet allocation,
>> non-linear skb is created and filled with pinned user pages.
>> There are also some updates for vhost and guest parts of transport - in
>> both cases i've added handling of non-linear skb for virtio part. vhost
>> copies data from such skb to the guest's rx virtio buffers. In the guest,
>> virtio transport fills tx virtio queue with pages from skb.
>>
>> Head of this patchset is:
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=d20dd0ea14072e8a90ff864b2c1603bd68920b4b
>>
>>
>> This version has several limits/problems:
>>
>> 1) As this feature totally depends on transport, there is no way (or it
>>   is difficult) to check whether transport is able to handle it or not
>>   during SO_ZEROCOPY setting. Seems I need to call AF_VSOCK specific
>>   setsockopt callback from setsockopt callback for SOL_SOCKET, but this
>>   leads to lock problem, because both AF_VSOCK and SOL_SOCKET callback
>>   are not considered to be called from each other. So in current version
>>   SO_ZEROCOPY is set successfully to any type (e.g. transport) of
>>   AF_VSOCK socket, but if transport does not support MSG_ZEROCOPY,
>>   tx routine will fail with EOPNOTSUPP.
>>
>>   ^^^
>>   This is still no resolved :(
>>
>> 2) When MSG_ZEROCOPY is used, for each tx system call we need to enqueue
>>   one completion. In each completion there is flag which shows how tx
>>   was performed: zerocopy or copy. This leads that whole message must
>>   be send in zerocopy or copy way - we can't send part of message with
>>   copying and rest of message with zerocopy mode (or vice versa). Now,
>>   we need to account vsock credit logic, e.g. we can't send whole data
>>   once - only allowed number of bytes could sent at any moment. In case
>>   of copying way there is no problem as in worst case we can send single
>>   bytes, but zerocopy is more complex because smallest transmission
>>   unit is single page. So if there is not enough space at peer's side
>>   to send integer number of pages (at least one) - we will wait, thus
>>   stalling tx side. To overcome this problem i've added simple rule -
>>   zerocopy is possible only when there is enough space at another side
>>   for whole message (to check, that current 'msghdr' was already used
>>   in previous tx iterations i use 'iov_offset' field of it's iov iter).
>>
>>   ^^^
>>   Discussed as ok during v2. Link:
>>   https://lore.kernel.org/netdev/23guh3txkghxpgcrcjx7h62qsoj3xgjhfzgtbmqp2slrz3rxr4@zya2z7kwt75l/
>>
>> 3) loopback transport is not supported, because it requires to implement
>>   non-linear skb handling in dequeue logic (as we "send" fragged skb
>>   and "receive" it from the same queue). I'm going to implement it in
>>   next versions.
>>
>>   ^^^ fixed in v2
>>
>> 4) Current implementation sets max length of packet to 64KB. IIUC this
>>   is due to 'kmalloc()' allocated data buffers. I think, in case of
>>   MSG_ZEROCOPY this value could be increased, because 'kmalloc()' is
>>   not touched for data - user space pages are used as buffers. Also
>>   this limit trims every message which is > 64KB, thus such messages
>>   will be send in copy mode due to 'iov_offset' check in 2).
>>
>>   ^^^ fixed in v2
>>
>>                         PATCHSET STRUCTURE
>>
>> Patchset has the following structure:
>> 1) Handle non-linear skbuff on receive in virtio/vhost.
>> 2) Handle non-linear skbuff on send in virtio/vhost.
>> 3) Updates for AF_VSOCK.
>> 4) Enable MSG_ZEROCOPY support on transports.
>> 5) Tests/tools/docs updates.
>>
>>                            PERFORMANCE
>>
>> Performance: it is a little bit tricky to compare performance between
>> copy and zerocopy transmissions. In zerocopy way we need to wait when
>> user buffers will be released by kernel, so it is like synchronous
>> path (wait until device driver will process it), while in copy way we
>> can feed data to kernel as many as we want, don't care about device
>> driver. So I compared only time which we spend in the 'send()' syscall.
>> Then if this value will be combined with total number of transmitted
>> bytes, we can get Gbit/s parameter. Also to avoid tx stalls due to not
>> enough credit, receiver allocates same amount of space as sender needs.
>>
>> Sender:
>> ./vsock_perf --sender <CID> --buf-size <buf size> --bytes 256M [--zc]
>>
>> Receiver:
>> ./vsock_perf --vsk-size 256M
>>
>> I run tests on two setups: desktop with Core i7 - I use this PC for
>> development and in this case guest is nested guest, and host is normal
>> guest. Another hardware is some embedded board with Atom - here I don't
>> have nested virtualization - host runs on hw, and guest is normal guest.
>>
>> G2H transmission (values are Gbit/s):
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |    3    |    10    |   |   4KB    |   0.8   |   1.9    |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   20    |    61    |   |   32KB   |   6.8   |   20.2   |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   33    |   244    |   |   256KB  |   7.8   |   55     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   30    |   373    |   |    1M    |   7     |   95     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   22    |   475    |   |    8M    |   7     |   114    |
>> *-------------------------------*   *-------------------------------*
>>
>> H2G:
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |   20    |    10    |   |   4KB    |   4.37  |    3     |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   37    |    75    |   |   32KB   |   11    |   18     |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   44    |   299    |   |   256KB  |   11    |   62     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   28    |   335    |   |    1M    |   9     |   77     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   27    |   417    |   |    8M    |  9.35   |  115     |
>> *-------------------------------*   *-------------------------------*
>>
>> * Let's look to the first line of both tables - where copy is better
>>   than zerocopy. I analyzed this case more deeply and found that
>>   bottleneck is function 'vhost_work_queue()'. With 4K buffer size,
>>   caller spends too much time in it with zerocopy mode (comparing to
>>   copy mode). This happens only with 4K buffer size. This function just
>>   calls 'wake_up_process()' and its internal logic does not depends on
>>   skb, so i think potential reason (may be) is interval between two
>>   calls of this function (e.g. how often it is called). Note, that
>>   'vhost_work_queue()' differs from the same function at guest's side of
>>   transport: 'virtio_transport_send_pkt()' uses 'queue_work()' which
>>   i think is more optimized for worker purposes, than direct call to
>>   'wake_up_process()'. But again - this is just my assumption.
>
> Thanks for the analysis, however for small payloads it makes sense that
> the cost might be too high that optimization does not bring benefits.
>
>>
>> Loopback:
>>
>>   Core i7 with nested guest.            Atom with normal guest.
>>
>> *-------------------------------*   *-------------------------------*
>> |          |         |          |   |          |         |          |
>> | buf size |   copy  | zerocopy |   | buf size |   copy  | zerocopy |
>> |          |         |          |   |          |         |          |
>> *-------------------------------*   *-------------------------------*
>> |   4KB    |    8    |     7    |   |   4KB    |   1.8   |   1.3    |
>> *-------------------------------*   *-------------------------------*
>> |   32KB   |   38    |    44    |   |   32KB   |   10    |   10     |
>> *-------------------------------*   *-------------------------------*
>> |   256KB  |   55    |   168    |   |   256KB  |   15    |   36     |
>> *-------------------------------*   *-------------------------------*
>> |    1M    |   53    |   250    |   |    1M    |   12    |   45     |
>> *-------------------------------*   *-------------------------------*
>> |    8M    |   40    |   344    |   |    8M    |   11    |   74     |
>> *-------------------------------*   *-------------------------------*
>>
>> I analyzed performace difference more deeply for the following setup:
>> server: ./vsock_perf --vsk-size 16M
>> client: ./vsock_perf --sender 2 --bytes 16M --buf-size 16K/4K [--zc]
>>
>> In other words I send 16M of data from guest to host in copy/zerocopy
>> modes and with two different sizes of buffer - 4K and 64K. Let's see
>> to tx path for both modes - it consists of two steps:
>>
>> copy:
>> 1) Allocate skb of buffer's length.
>> 2) Copy data to skb from buffer.
>>
>> zerocopy:
>> 1) Allocate skb with header space only.
>> 2) Pin pages of the buffer and insert them to skb.
>>
>> I measured average number of ns (returned by 'ktime_get()') for each
>> step above:
>> 1) Skb allocation (for both copy and zerocopy modes).
>> 2) For copy mode in 'memcpy_to_msg()' - copying.
>> 3) For zerocopy mode in '__zerocopy_sg_from_iter()' - pinning.
>>
>> Here are results for copy mode:
>> *-------------------------------------*
>> | buf | skb alloc | 'memcpy_to_msg()' |
>> *-------------------------------------*
>> |     |           |                   |
>> | 64K |  5000ns   |      25000ns      |
>> |     |           |                   |
>> *-------------------------------------*
>> |     |           |                   |
>> | 4K  |  800ns    |      2200ns       |
>> |     |           |                   |
>> *-------------------------------------*
>>
>> Here are results for zerocopy mode:
>> *-----------------------------------------------*
>> | buf | skb alloc | '__zerocopy_sg_from_iter()' |
>> *-----------------------------------------------*
>> |     |           |                             |
>> | 64K |  250ns    |          3500ns             |
>> |     |           |                             |
>> *-----------------------------------------------*
>> |     |           |                             |
>> | 4K  |  250ns    |          3000ns             |
>> |     |           |                             |
>> *-----------------------------------------------*
>>
>> I guess that reason of zerocopy performance is low overhead for page
>> pinning: there is big difference between 4K and 64K in case of copying
>> (25000 vs 2200), but in pinning case - just 3000 vs 3500.
>>
>> So, zerocopy is faster than classic copy mode, but of course it requires
>> specific architecture of application due to user pages pinning, buffer
>> size and alignment.
>
> Makes sense!
>
>>
>>                             NOTES
>>
>> If host fails to send data with "Cannot allocate memory", check value
>> /proc/sys/net/core/optmem_max - it is accounted during completion skb
>> allocation. Try to update it to for example 1M and try send again:
>> "echo 1048576 > /proc/sys/net/core/optmem_max" (as root).
>>
>>                            TESTING
>>
>> This patchset includes set of tests for MSG_ZEROCOPY feature. I tried to
>> cover new code as much as possible so there are different cases for
>> MSG_ZEROCOPY transmissions: with disabled SO_ZEROCOPY and several io
>> vector types (different sizes, alignments, with unmapped pages). I also
>> run tests with loopback transport and run vsockmon. In v3 i've added
>> io_uring test as separated application.
>>
>>           LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>>
>> In v3 Stefano Garzarella <[email protected]> asked to split this patchset
>> for several parts, because it looks too big for review. I think in this
>> version (v4) we can do it in the following way:
>>
>> [0001 - 0005] - this is preparation for virtio/vhost part.
>> [0006 - 0009] - this is preparation for AF_VSOCK part.
>> [0010 - 0013] - these patches allows to trigger logic from the previous
>>                two parts.
>> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch
>>                kernel code and looks not critical.
>
> Yeah, I like this split, but I'd include 14 in the (10, 13) group.
>
> I have reviewed most of them and I think we are well on our way :-)
> I've already seen that Bobby suggested changes for v5, so I'll review
> that version better.
>
> Great work so far!

Hello Stefano!

Thanks for review! I left some questions, but most of comments are clear
for me. So I guess that idea of split is that I still keep all patches in
a big single patchset, but preserve structure described above and we will
do review process step by step according split?

Or I should split this patchset for 3 separated sets? I guess this will be
more complex to review...

Thanks, Arseniy

>
> Thanks,
> Stefano
>

2023-06-27 05:38:07

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support



On 26.06.2023 19:03, Stefano Garzarella wrote:
> On Sat, Jun 03, 2023 at 11:49:27PM +0300, Arseniy Krasnov wrote:
>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> flag is set and zerocopy transmission is possible, then non-linear skb
>> will be created and filled with the pages of user's buffer. Pages of
>> user's buffer are locked in memory by 'get_user_pages()'.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
>> 1 file changed, 208 insertions(+), 62 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 0de562c1dc4b..f1ec38c72db7 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>     return container_of(t, struct virtio_transport, transport);
>> }
>>
>> -/* Returns a new packet on success, otherwise returns NULL.
>> - *
>> - * If NULL is returned, errp is set to a negative errno.
>> - */
>> -static struct sk_buff *
>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> -               size_t len,
>> -               u32 src_cid,
>> -               u32 src_port,
>> -               u32 dst_cid,
>> -               u32 dst_port)
>> -{
>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>> -    struct virtio_vsock_hdr *hdr;
>> -    struct sk_buff *skb;
>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>> +                       size_t max_to_send)
>> +{
>> +    struct iov_iter *iov_iter;
>> +
>> +    if (!info->msg)
>> +        return false;
>> +
>> +    iov_iter = &info->msg->msg_iter;
>> +
>> +    /* Data is simple buffer. */
>> +    if (iter_is_ubuf(iov_iter))
>> +        return true;
>> +
>> +    if (!iter_is_iovec(iov_iter))
>> +        return false;
>> +
>> +    if (iov_iter->iov_offset)
>> +        return false;
>> +
>> +    /* We can't send whole iov. */
>> +    if (iov_iter->count > max_to_send)
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>> +                       struct sk_buff *skb,
>> +                       struct msghdr *msg,
>> +                       bool zerocopy)
>> +{
>> +    struct ubuf_info *uarg;
>> +
>> +    if (msg->msg_ubuf) {
>> +        uarg = msg->msg_ubuf;
>> +        net_zcopy_get(uarg);
>> +    } else {
>> +        struct iov_iter *iter = &msg->msg_iter;
>> +        struct ubuf_info_msgzc *uarg_zc;
>> +        int len;
>> +
>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>> +         * checked before.
>> +         */
>> +        if (iter_is_iovec(iter))
>> +            len = iov_length(iter->__iov, iter->nr_segs);
>> +        else
>> +            len = iter->count;
>> +
>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> +                        len,
>> +                        NULL);
>> +
>> +        if (!uarg)
>> +            return -1;
>> +
>> +        uarg_zc = uarg_to_msgzc(uarg);
>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>> +    }
>> +
>> +    skb_zcopy_init(skb, uarg);
>> +
>> +    return 0;
>> +}
>> +
>> +static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
>> +                        struct vsock_sock *vsk,
>
> `vsk` seems unused
>
>> +                        struct virtio_vsock_pkt_info *info,
>> +                        size_t len)
>> +{
>>     void *payload;
>>     int err;
>>
>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> -    if (!skb)
>> -        return NULL;
>> +    payload = skb_put(skb, len);
>> +    err = memcpy_from_msg(payload, info->msg, len);
>> +    if (err)
>> +        return -1;
>> +
>> +    if (msg_data_left(info->msg))
>> +        return 0;
>> +
>> +    return 0;
>> +}
>> +
>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>> +                      struct virtio_vsock_pkt_info *info,
>> +                      u32 src_cid,
>> +                      u32 src_port,
>> +                      u32 dst_cid,
>> +                      u32 dst_port,
>> +                      size_t len)
>> +{
>> +    struct virtio_vsock_hdr *hdr;
>>
>>     hdr = virtio_vsock_hdr(skb);
>>     hdr->type    = cpu_to_le16(info->type);
>> @@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>     hdr->dst_port    = cpu_to_le32(dst_port);
>>     hdr->flags    = cpu_to_le32(info->flags);
>>     hdr->len    = cpu_to_le32(len);
>> -
>> -    if (info->msg && len > 0) {
>> -        payload = skb_put(skb, len);
>> -        err = memcpy_from_msg(payload, info->msg, len);
>> -        if (err)
>> -            goto out;
>> -
>> -        if (msg_data_left(info->msg) == 0 &&
>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> -
>> -            if (info->msg->msg_flags & MSG_EOR)
>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> -        }
>> -    }
>> -
>> -    if (info->reply)
>> -        virtio_vsock_skb_set_reply(skb);
>> -
>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> -                     dst_cid, dst_port,
>> -                     len,
>> -                     info->type,
>> -                     info->op,
>> -                     info->flags);
>> -
>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>> -        goto out;
>> -    }
>> -
>> -    return skb;
>> -
>> -out:
>> -    kfree_skb(skb);
>> -    return NULL;
>> }
>>
>> static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>> @@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>         return VIRTIO_VSOCK_TYPE_SEQPACKET;
>> }
>>
>> +/* Returns a new packet on success, otherwise returns NULL.
>> + *
>> + * If NULL is returned, errp is set to a negative errno.
>
> I had noticed this in Bobby's patches, I think it's an old comment we
> left around.
>
>> + */
>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>> +                          struct virtio_vsock_pkt_info *info,
>> +                          size_t payload_len,
>> +                          bool zcopy,
>> +                          u32 dst_cid,
>> +                          u32 dst_port,
>> +                          u32 src_cid,
>> +                          u32 src_port)
>> +{
>> +    struct sk_buff *skb;
>> +    size_t skb_len;
>> +
>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>> +
>> +    if (!zcopy)
>> +        skb_len += payload_len;
>> +
>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> +    if (!skb)
>> +        return NULL;
>> +
>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>> +                  dst_cid, dst_port,
>> +                  payload_len);
>> +
>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>> +     */
>> +    if (vsk)
>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>
> why we are moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
>
> We should mention this in the commit description.
>
>> +
>> +    if (info->msg && payload_len > 0) {
>> +        int err;
>> +
>> +        if (zcopy) {
>> +            err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
>> +                              &info->msg->msg_iter,
>> +                              payload_len);
>> +        } else {
>> +            err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
>> +        }
>> +
>> +        if (err)
>> +            goto out;
>> +
>> +        VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>> +
>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> +            struct virtio_vsock_hdr *hdr;
>> +
>> +            hdr = virtio_vsock_hdr(skb);
>
> Just `struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);` should be
> fine.
>
>> +
>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +
>> +            if (info->msg->msg_flags & MSG_EOR)
>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> +        }
>> +    }
>> +
>> +    if (info->reply)
>> +        virtio_vsock_skb_set_reply(skb);
>> +
>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> +                     dst_cid, dst_port,
>> +                     payload_len,
>> +                     info->type,
>> +                     info->op,
>> +                     info->flags);
>> +
>> +    return skb;
>> +out:
>> +    kfree_skb(skb);
>> +    return NULL;
>> +}
>> +
>> /* This function can only be used on connecting/connected sockets,
>>  * since a socket assigned to a transport is required.
>>  *
>> @@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>     const struct virtio_transport *t_ops;
>>     struct virtio_vsock_sock *vvs;
>>     u32 pkt_len = info->pkt_len;
>> +    bool can_zcopy = false;
>> +    u32 max_skb_cap;
>>     u32 rest_len;
>>     int ret;
>>
>> @@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>     if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>         return pkt_len;
>>
>> +    /* If zerocopy is not enabled by 'setsockopt()', we behave as
>> +     * there is no MSG_ZEROCOPY flag set.
>> +     */
>> +    if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> +        info->flags &= ~MSG_ZEROCOPY;
>> +
>> +    if (info->flags & MSG_ZEROCOPY)
>> +        can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>> +
>> +    if (can_zcopy)
>> +        max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>> +                    (MAX_SKB_FRAGS * PAGE_SIZE));
>> +    else
>> +        max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>> +
>
> We use `len` very often, what about `max_skb_len`?
>
>>     rest_len = pkt_len;
>>
>>     do {
>>         struct sk_buff *skb;
>>         size_t skb_len;
>>
>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>> +        skb_len = min(max_skb_cap, rest_len);
>>
>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>> -                         src_cid, src_port,
>> -                         dst_cid, dst_port);
>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>> +                         dst_cid, dst_port,
>> +                         src_cid, src_port);
>>         if (!skb) {
>>             ret = -ENOMEM;
>>             break;
>>         }
>>
>> +        /* This is last skb to send this portion of data. */
>> +        if (skb_len == rest_len &&
>> +            info->flags & MSG_ZEROCOPY &&
>> +            info->op == VIRTIO_VSOCK_OP_RW) {
>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>> +                                info->msg,
>> +                                can_zcopy)) {
>> +                ret = -ENOMEM;
>> +                break;
>> +            }
>> +        }
>> +
>>         virtio_transport_inc_tx_pkt(vvs, skb);
>>
>>         ret = t_ops->send_pkt(skb);
>> @@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
>>         .msg = msg,
>>         .pkt_len = len,
>>         .vsk = vsk,
>> +        .flags = msg->msg_flags,
>
> These flags then get copied into the virtio_vsock_hdr, which I don't
> think is a good idea.
>
> Why not using directly info->msg->msg_flags?

Ops, yes, it's a bug, You're right, this is really wrong as there are two different
sets of flags - MSG_XXX passed to syscall and flags in the header of packet.

Thanks, Arseniy

>
>>     };
>>
>>     return virtio_transport_send_pkt_info(vsk, &info);
>> @@ -935,11 +1081,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>     if (!t)
>>         return -ENOTCONN;
>>
>> -    reply = virtio_transport_alloc_skb(&info, 0,
>> -                       le64_to_cpu(hdr->dst_cid),
>> -                       le32_to_cpu(hdr->dst_port),
>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>                        le64_to_cpu(hdr->src_cid),
>> -                       le32_to_cpu(hdr->src_port));
>> +                       le32_to_cpu(hdr->src_port),
>> +                       le64_to_cpu(hdr->dst_cid),
>> +                       le32_to_cpu(hdr->dst_port));
>>     if (!reply)
>>         return -ENOMEM;
>>
>> -- 
>> 2.25.1
>>
>

2023-06-27 05:47:12

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v4 06/17] vsock: check error queue to set EPOLLERR



On 26.06.2023 19:04, Stefano Garzarella wrote:
> On Sat, Jun 03, 2023 at 11:49:28PM +0300, Arseniy Krasnov wrote:
>> If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
>> reader of error queue won't detect data in it using EPOLLERR bit.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> net/vmw_vsock/af_vsock.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> This patch looks like it can go even without this series.
>
> Is it a fix? Should we add a fixes tag?

Yes, it is fix and I can exclude it from this set to reduce number
of patches, but there is no reproducer for this without MSG_ZEROCOPY
support - at this moment this feature is the only user of error queue
for AF_VSOCK.

Thanks, Arseniy

>
> Thanks,
> Stefano
>
>>
>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> index efb8a0937a13..45fd20c4ed50 100644
>> --- a/net/vmw_vsock/af_vsock.c
>> +++ b/net/vmw_vsock/af_vsock.c
>> @@ -1030,7 +1030,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
>>     poll_wait(file, sk_sleep(sk), wait);
>>     mask = 0;
>>
>> -    if (sk->sk_err)
>> +    if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
>>         /* Signify that there has been an error on this socket. */
>>         mask |= EPOLLERR;
>>
>> -- 
>> 2.25.1
>>
>

2023-06-27 08:09:42

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support

On Tue, Jun 27, 2023 at 07:41:51AM +0300, Arseniy Krasnov wrote:
>
>
>On 26.06.2023 19:03, Stefano Garzarella wrote:
>> On Sat, Jun 03, 2023 at 11:49:27PM +0300, Arseniy Krasnov wrote:
>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>> flag is set and zerocopy transmission is possible, then non-linear skb
>>> will be created and filled with the pages of user's buffer. Pages of
>>> user's buffer are locked in memory by 'get_user_pages()'.
>>>
>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>> ---
>>> net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
>>> 1 file changed, 208 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index 0de562c1dc4b..f1ec38c72db7 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>> ????return container_of(t, struct virtio_transport, transport);
>>> }
>>>
>>> -/* Returns a new packet on success, otherwise returns NULL.
>>> - *
>>> - * If NULL is returned, errp is set to a negative errno.
>>> - */
>>> -static struct sk_buff *
>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>> -?????????????? size_t len,
>>> -?????????????? u32 src_cid,
>>> -?????????????? u32 src_port,
>>> -?????????????? u32 dst_cid,
>>> -?????????????? u32 dst_port)
>>> -{
>>> -??? const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>> -??? struct virtio_vsock_hdr *hdr;
>>> -??? struct sk_buff *skb;
>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>> +?????????????????????? size_t max_to_send)
>>> +{
>>> +??? struct iov_iter *iov_iter;
>>> +
>>> +??? if (!info->msg)
>>> +??????? return false;
>>> +
>>> +??? iov_iter = &info->msg->msg_iter;
>>> +
>>> +??? /* Data is simple buffer. */
>>> +??? if (iter_is_ubuf(iov_iter))
>>> +??????? return true;
>>> +
>>> +??? if (!iter_is_iovec(iov_iter))
>>> +??????? return false;
>>> +
>>> +??? if (iov_iter->iov_offset)
>>> +??????? return false;
>>> +
>>> +??? /* We can't send whole iov. */
>>> +??? if (iov_iter->count > max_to_send)
>>> +??????? return false;
>>> +
>>> +??? return true;
>>> +}
>>> +
>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> +?????????????????????? struct sk_buff *skb,
>>> +?????????????????????? struct msghdr *msg,
>>> +?????????????????????? bool zerocopy)
>>> +{
>>> +??? struct ubuf_info *uarg;
>>> +
>>> +??? if (msg->msg_ubuf) {
>>> +??????? uarg = msg->msg_ubuf;
>>> +??????? net_zcopy_get(uarg);
>>> +??? } else {
>>> +??????? struct iov_iter *iter = &msg->msg_iter;
>>> +??????? struct ubuf_info_msgzc *uarg_zc;
>>> +??????? int len;
>>> +
>>> +??????? /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>> +???????? * checked before.
>>> +???????? */
>>> +??????? if (iter_is_iovec(iter))
>>> +??????????? len = iov_length(iter->__iov, iter->nr_segs);
>>> +??????? else
>>> +??????????? len = iter->count;
>>> +
>>> +??????? uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> +??????????????????????? len,
>>> +??????????????????????? NULL);
>>> +
>>> +??????? if (!uarg)
>>> +??????????? return -1;
>>> +
>>> +??????? uarg_zc = uarg_to_msgzc(uarg);
>>> +??????? uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>> +??? }
>>> +
>>> +??? skb_zcopy_init(skb, uarg);
>>> +
>>> +??? return 0;
>>> +}
>>> +
>>> +static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
>>> +??????????????????????? struct vsock_sock *vsk,
>>
>> `vsk` seems unused
>>
>>> +??????????????????????? struct virtio_vsock_pkt_info *info,
>>> +??????????????????????? size_t len)
>>> +{
>>> ????void *payload;
>>> ????int err;
>>>
>>> -??? skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> -??? if (!skb)
>>> -??????? return NULL;
>>> +??? payload = skb_put(skb, len);
>>> +??? err = memcpy_from_msg(payload, info->msg, len);
>>> +??? if (err)
>>> +??????? return -1;
>>> +
>>> +??? if (msg_data_left(info->msg))
>>> +??????? return 0;
>>> +
>>> +??? return 0;
>>> +}
>>> +
>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>> +????????????????????? struct virtio_vsock_pkt_info *info,
>>> +????????????????????? u32 src_cid,
>>> +????????????????????? u32 src_port,
>>> +????????????????????? u32 dst_cid,
>>> +????????????????????? u32 dst_port,
>>> +????????????????????? size_t len)
>>> +{
>>> +??? struct virtio_vsock_hdr *hdr;
>>>
>>> ????hdr = virtio_vsock_hdr(skb);
>>> ????hdr->type??? = cpu_to_le16(info->type);
>>> @@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>> ????hdr->dst_port??? = cpu_to_le32(dst_port);
>>> ????hdr->flags??? = cpu_to_le32(info->flags);
>>> ????hdr->len??? = cpu_to_le32(len);
>>> -
>>> -??? if (info->msg && len > 0) {
>>> -??????? payload = skb_put(skb, len);
>>> -??????? err = memcpy_from_msg(payload, info->msg, len);
>>> -??????? if (err)
>>> -??????????? goto out;
>>> -
>>> -??????? if (msg_data_left(info->msg) == 0 &&
>>> -??????????? info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> -??????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> -
>>> -??????????? if (info->msg->msg_flags & MSG_EOR)
>>> -??????????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> -??????? }
>>> -??? }
>>> -
>>> -??? if (info->reply)
>>> -??????? virtio_vsock_skb_set_reply(skb);
>>> -
>>> -??? trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> -???????????????????? dst_cid, dst_port,
>>> -???????????????????? len,
>>> -???????????????????? info->type,
>>> -???????????????????? info->op,
>>> -???????????????????? info->flags);
>>> -
>>> -??? if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>> -??????? WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>> -??????? goto out;
>>> -??? }
>>> -
>>> -??? return skb;
>>> -
>>> -out:
>>> -??? kfree_skb(skb);
>>> -??? return NULL;
>>> }
>>>
>>> static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>>> @@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>> ??????? return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>> }
>>>
>>> +/* Returns a new packet on success, otherwise returns NULL.
>>> + *
>>> + * If NULL is returned, errp is set to a negative errno.
>>
>> I had noticed this in Bobby's patches, I think it's an old comment we
>> left around.
>>
>>> + */
>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>> +????????????????????????? struct virtio_vsock_pkt_info *info,
>>> +????????????????????????? size_t payload_len,
>>> +????????????????????????? bool zcopy,
>>> +????????????????????????? u32 dst_cid,
>>> +????????????????????????? u32 dst_port,
>>> +????????????????????????? u32 src_cid,
>>> +????????????????????????? u32 src_port)
>>> +{
>>> +??? struct sk_buff *skb;
>>> +??? size_t skb_len;
>>> +
>>> +??? skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>> +
>>> +??? if (!zcopy)
>>> +??????? skb_len += payload_len;
>>> +
>>> +??? skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>> +??? if (!skb)
>>> +??????? return NULL;
>>> +
>>> +??? virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>> +????????????????? dst_cid, dst_port,
>>> +????????????????? payload_len);
>>> +
>>> +??? /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>> +???? * owner of skb without check to update 'sk_wmem_alloc'.
>>> +???? */
>>> +??? if (vsk)
>>> +??????? skb_set_owner_w(skb, sk_vsock(vsk));
>>
>> why we are moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
>>
>> We should mention this in the commit description.
>>
>>> +
>>> +??? if (info->msg && payload_len > 0) {
>>> +??????? int err;
>>> +
>>> +??????? if (zcopy) {
>>> +??????????? err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>> +????????????????????????????? &info->msg->msg_iter,
>>> +????????????????????????????? payload_len);
>>> +??????? } else {
>>> +??????????? err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
>>> +??????? }
>>> +
>>> +??????? if (err)
>>> +??????????? goto out;
>>> +
>>> +??????? VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>>> +
>>> +??????? if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>> +??????????? struct virtio_vsock_hdr *hdr;
>>> +
>>> +??????????? hdr = virtio_vsock_hdr(skb);
>>
>> Just `struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);` should be
>> fine.
>>
>>> +
>>> +??????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>> +
>>> +??????????? if (info->msg->msg_flags & MSG_EOR)
>>> +??????????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>> +??????? }
>>> +??? }
>>> +
>>> +??? if (info->reply)
>>> +??????? virtio_vsock_skb_set_reply(skb);
>>> +
>>> +??? trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>> +???????????????????? dst_cid, dst_port,
>>> +???????????????????? payload_len,
>>> +???????????????????? info->type,
>>> +???????????????????? info->op,
>>> +???????????????????? info->flags);
>>> +
>>> +??? return skb;
>>> +out:
>>> +??? kfree_skb(skb);
>>> +??? return NULL;
>>> +}
>>> +
>>> /* This function can only be used on connecting/connected sockets,
>>> ?* since a socket assigned to a transport is required.
>>> ?*
>>> @@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> ????const struct virtio_transport *t_ops;
>>> ????struct virtio_vsock_sock *vvs;
>>> ????u32 pkt_len = info->pkt_len;
>>> +??? bool can_zcopy = false;
>>> +??? u32 max_skb_cap;
>>> ????u32 rest_len;
>>> ????int ret;
>>>
>>> @@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> ????if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>> ??????? return pkt_len;
>>>
>>> +??? /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>> +???? * there is no MSG_ZEROCOPY flag set.
>>> +???? */
>>> +??? if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>> +??????? info->flags &= ~MSG_ZEROCOPY;
>>> +
>>> +??? if (info->flags & MSG_ZEROCOPY)
>>> +??????? can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>> +
>>> +??? if (can_zcopy)
>>> +??????? max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>> +??????????????????? (MAX_SKB_FRAGS * PAGE_SIZE));
>>> +??? else
>>> +??????? max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>> +
>>
>> We use `len` very often, what about `max_skb_len`?
>>
>>> ????rest_len = pkt_len;
>>>
>>> ????do {
>>> ??????? struct sk_buff *skb;
>>> ??????? size_t skb_len;
>>>
>>> -??????? skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>> +??????? skb_len = min(max_skb_cap, rest_len);
>>>
>>> -??????? skb = virtio_transport_alloc_skb(info, skb_len,
>>> -???????????????????????? src_cid, src_port,
>>> -???????????????????????? dst_cid, dst_port);
>>> +??????? skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>> +???????????????????????? dst_cid, dst_port,
>>> +???????????????????????? src_cid, src_port);
>>> ??????? if (!skb) {
>>> ??????????? ret = -ENOMEM;
>>> ??????????? break;
>>> ??????? }
>>>
>>> +??????? /* This is last skb to send this portion of data. */
>>> +??????? if (skb_len == rest_len &&
>>> +??????????? info->flags & MSG_ZEROCOPY &&
>>> +??????????? info->op == VIRTIO_VSOCK_OP_RW) {
>>> +??????????? if (virtio_transport_init_zcopy_skb(vsk, skb,
>>> +??????????????????????????????? info->msg,
>>> +??????????????????????????????? can_zcopy)) {
>>> +??????????????? ret = -ENOMEM;
>>> +??????????????? break;
>>> +??????????? }
>>> +??????? }
>>> +
>>> ??????? virtio_transport_inc_tx_pkt(vvs, skb);
>>>
>>> ??????? ret = t_ops->send_pkt(skb);
>>> @@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
>>> ??????? .msg = msg,
>>> ??????? .pkt_len = len,
>>> ??????? .vsk = vsk,
>>> +??????? .flags = msg->msg_flags,
>>
>> These flags then get copied into the virtio_vsock_hdr, which I don't
>> think is a good idea.
>>
>> Why not using directly info->msg->msg_flags?
>
>Ops, yes, it's a bug, You're right, this is really wrong as there are two different
>sets of flags - MSG_XXX passed to syscall and flags in the header of packet.

Yep.

What about the moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
Was it voluntary? If so, can you explain why?


Thanks,
Stefano


2023-06-27 08:16:52

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 00/17] vsock: MSG_ZEROCOPY flag support

On Tue, Jun 27, 2023 at 07:55:58AM +0300, Arseniy Krasnov wrote:
>
>
>On 26.06.2023 19:15, Stefano Garzarella wrote:
>> On Sat, Jun 03, 2023 at 11:49:22PM +0300, Arseniy Krasnov wrote:

[...]

>>>
>>> ????????? LET'S SPLIT PATCHSET TO MAKE REVIEW EASIER
>>>
>>> In v3 Stefano Garzarella <[email protected]> asked to split this patchset
>>> for several parts, because it looks too big for review. I think in this
>>> version (v4) we can do it in the following way:
>>>
>>> [0001 - 0005] - this is preparation for virtio/vhost part.
>>> [0006 - 0009] - this is preparation for AF_VSOCK part.
>>> [0010 - 0013] - these patches allows to trigger logic from the previous
>>> ?????????????? two parts.
>>> [0014 - rest] - updates for doc, tests, utils. This part doesn't touch
>>> ?????????????? kernel code and looks not critical.
>>
>> Yeah, I like this split, but I'd include 14 in the (10, 13) group.
>>
>> I have reviewed most of them and I think we are well on our way :-)
>> I've already seen that Bobby suggested changes for v5, so I'll review
>> that version better.
>>
>> Great work so far!
>
>Hello Stefano!

Hi Arseniy :-)

>
>Thanks for review! I left some questions, but most of comments are clear
>for me. So I guess that idea of split is that I still keep all patches in
>a big single patchset, but preserve structure described above and we will
>do review process step by step according split?
>
>Or I should split this patchset for 3 separated sets? I guess this will be
>more complex to review...

If the next is still RFC, a single series is fine.

Thanks,
Stefano


2023-06-27 08:29:29

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 06/17] vsock: check error queue to set EPOLLERR

On Tue, Jun 27, 2023 at 07:44:25AM +0300, Arseniy Krasnov wrote:
>
>
>On 26.06.2023 19:04, Stefano Garzarella wrote:
>> On Sat, Jun 03, 2023 at 11:49:28PM +0300, Arseniy Krasnov wrote:
>>> If socket's error queue is not empty, EPOLLERR must be set. Otherwise,
>>> reader of error queue won't detect data in it using EPOLLERR bit.
>>>
>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>> ---
>>> net/vmw_vsock/af_vsock.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> This patch looks like it can go even without this series.
>>
>> Is it a fix? Should we add a fixes tag?
>
>Yes, it is fix and I can exclude it from this set to reduce number
>of patches, but there is no reproducer for this without MSG_ZEROCOPY
>support - at this moment this feature is the only user of error queue
>for AF_VSOCK.

Okay, so it's fine to keep it here, but please mention in the comment
that without MSG_ZEROCOPY it can't be reproduced.

That way we know that we don't have to backport into the stable
branches.

Thanks,
Stefano

>
>Thanks, Arseniy
>
>>
>> Thanks,
>> Stefano
>>
>>>
>>> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>>> index efb8a0937a13..45fd20c4ed50 100644
>>> --- a/net/vmw_vsock/af_vsock.c
>>> +++ b/net/vmw_vsock/af_vsock.c
>>> @@ -1030,7 +1030,7 @@ static __poll_t vsock_poll(struct file *file, struct socket *sock,
>>> ????poll_wait(file, sk_sleep(sk), wait);
>>> ????mask = 0;
>>>
>>> -??? if (sk->sk_err)
>>> +??? if (sk->sk_err || !skb_queue_empty_lockless(&sk->sk_error_queue))
>>> ??????? /* Signify that there has been an error on this socket. */
>>> ??????? mask |= EPOLLERR;
>>>
>>> --?
>>> 2.25.1
>>>
>>
>


2023-06-27 08:38:03

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support



On 27.06.2023 10:50, Stefano Garzarella wrote:
> On Tue, Jun 27, 2023 at 07:41:51AM +0300, Arseniy Krasnov wrote:
>>
>>
>> On 26.06.2023 19:03, Stefano Garzarella wrote:
>>> On Sat, Jun 03, 2023 at 11:49:27PM +0300, Arseniy Krasnov wrote:
>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>> flag is set and zerocopy transmission is possible, then non-linear skb
>>>> will be created and filled with the pages of user's buffer. Pages of
>>>> user's buffer are locked in memory by 'get_user_pages()'.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>>> ---
>>>> net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
>>>> 1 file changed, 208 insertions(+), 62 deletions(-)
>>>>
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 0de562c1dc4b..f1ec38c72db7 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>     return container_of(t, struct virtio_transport, transport);
>>>> }
>>>>
>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>> - *
>>>> - * If NULL is returned, errp is set to a negative errno.
>>>> - */
>>>> -static struct sk_buff *
>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>> -               size_t len,
>>>> -               u32 src_cid,
>>>> -               u32 src_port,
>>>> -               u32 dst_cid,
>>>> -               u32 dst_port)
>>>> -{
>>>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>> -    struct virtio_vsock_hdr *hdr;
>>>> -    struct sk_buff *skb;
>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>> +                       size_t max_to_send)
>>>> +{
>>>> +    struct iov_iter *iov_iter;
>>>> +
>>>> +    if (!info->msg)
>>>> +        return false;
>>>> +
>>>> +    iov_iter = &info->msg->msg_iter;
>>>> +
>>>> +    /* Data is simple buffer. */
>>>> +    if (iter_is_ubuf(iov_iter))
>>>> +        return true;
>>>> +
>>>> +    if (!iter_is_iovec(iov_iter))
>>>> +        return false;
>>>> +
>>>> +    if (iov_iter->iov_offset)
>>>> +        return false;
>>>> +
>>>> +    /* We can't send whole iov. */
>>>> +    if (iov_iter->count > max_to_send)
>>>> +        return false;
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>> +                       struct sk_buff *skb,
>>>> +                       struct msghdr *msg,
>>>> +                       bool zerocopy)
>>>> +{
>>>> +    struct ubuf_info *uarg;
>>>> +
>>>> +    if (msg->msg_ubuf) {
>>>> +        uarg = msg->msg_ubuf;
>>>> +        net_zcopy_get(uarg);
>>>> +    } else {
>>>> +        struct iov_iter *iter = &msg->msg_iter;
>>>> +        struct ubuf_info_msgzc *uarg_zc;
>>>> +        int len;
>>>> +
>>>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>> +         * checked before.
>>>> +         */
>>>> +        if (iter_is_iovec(iter))
>>>> +            len = iov_length(iter->__iov, iter->nr_segs);
>>>> +        else
>>>> +            len = iter->count;
>>>> +
>>>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>> +                        len,
>>>> +                        NULL);
>>>> +
>>>> +        if (!uarg)
>>>> +            return -1;
>>>> +
>>>> +        uarg_zc = uarg_to_msgzc(uarg);
>>>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>> +    }
>>>> +
>>>> +    skb_zcopy_init(skb, uarg);
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
>>>> +                        struct vsock_sock *vsk,
>>>
>>> `vsk` seems unused
>>>
>>>> +                        struct virtio_vsock_pkt_info *info,
>>>> +                        size_t len)
>>>> +{
>>>>     void *payload;
>>>>     int err;
>>>>
>>>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> -    if (!skb)
>>>> -        return NULL;
>>>> +    payload = skb_put(skb, len);
>>>> +    err = memcpy_from_msg(payload, info->msg, len);
>>>> +    if (err)
>>>> +        return -1;
>>>> +
>>>> +    if (msg_data_left(info->msg))
>>>> +        return 0;
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>> +                      struct virtio_vsock_pkt_info *info,
>>>> +                      u32 src_cid,
>>>> +                      u32 src_port,
>>>> +                      u32 dst_cid,
>>>> +                      u32 dst_port,
>>>> +                      size_t len)
>>>> +{
>>>> +    struct virtio_vsock_hdr *hdr;
>>>>
>>>>     hdr = virtio_vsock_hdr(skb);
>>>>     hdr->type    = cpu_to_le16(info->type);
>>>> @@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>     hdr->dst_port    = cpu_to_le32(dst_port);
>>>>     hdr->flags    = cpu_to_le32(info->flags);
>>>>     hdr->len    = cpu_to_le32(len);
>>>> -
>>>> -    if (info->msg && len > 0) {
>>>> -        payload = skb_put(skb, len);
>>>> -        err = memcpy_from_msg(payload, info->msg, len);
>>>> -        if (err)
>>>> -            goto out;
>>>> -
>>>> -        if (msg_data_left(info->msg) == 0 &&
>>>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> -
>>>> -            if (info->msg->msg_flags & MSG_EOR)
>>>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> -        }
>>>> -    }
>>>> -
>>>> -    if (info->reply)
>>>> -        virtio_vsock_skb_set_reply(skb);
>>>> -
>>>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> -                     dst_cid, dst_port,
>>>> -                     len,
>>>> -                     info->type,
>>>> -                     info->op,
>>>> -                     info->flags);
>>>> -
>>>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>> -        goto out;
>>>> -    }
>>>> -
>>>> -    return skb;
>>>> -
>>>> -out:
>>>> -    kfree_skb(skb);
>>>> -    return NULL;
>>>> }
>>>>
>>>> static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>>>> @@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>         return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>> }
>>>>
>>>> +/* Returns a new packet on success, otherwise returns NULL.
>>>> + *
>>>> + * If NULL is returned, errp is set to a negative errno.
>>>
>>> I had noticed this in Bobby's patches, I think it's an old comment we
>>> left around.
>>>
>>>> + */
>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>> +                          struct virtio_vsock_pkt_info *info,
>>>> +                          size_t payload_len,
>>>> +                          bool zcopy,
>>>> +                          u32 dst_cid,
>>>> +                          u32 dst_port,
>>>> +                          u32 src_cid,
>>>> +                          u32 src_port)
>>>> +{
>>>> +    struct sk_buff *skb;
>>>> +    size_t skb_len;
>>>> +
>>>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>> +
>>>> +    if (!zcopy)
>>>> +        skb_len += payload_len;
>>>> +
>>>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>> +    if (!skb)
>>>> +        return NULL;
>>>> +
>>>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>> +                  dst_cid, dst_port,
>>>> +                  payload_len);
>>>> +
>>>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>>>> +     */
>>>> +    if (vsk)
>>>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>>>
>>> why we are moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
>>>
>>> We should mention this in the commit description.
>>>
>>>> +
>>>> +    if (info->msg && payload_len > 0) {
>>>> +        int err;
>>>> +
>>>> +        if (zcopy) {
>>>> +            err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>> +                              &info->msg->msg_iter,
>>>> +                              payload_len);
>>>> +        } else {
>>>> +            err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
>>>> +        }
>>>> +
>>>> +        if (err)
>>>> +            goto out;
>>>> +
>>>> +        VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>>>> +
>>>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>> +            struct virtio_vsock_hdr *hdr;
>>>> +
>>>> +            hdr = virtio_vsock_hdr(skb);
>>>
>>> Just `struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);` should be
>>> fine.
>>>
>>>> +
>>>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>> +
>>>> +            if (info->msg->msg_flags & MSG_EOR)
>>>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (info->reply)
>>>> +        virtio_vsock_skb_set_reply(skb);
>>>> +
>>>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> +                     dst_cid, dst_port,
>>>> +                     payload_len,
>>>> +                     info->type,
>>>> +                     info->op,
>>>> +                     info->flags);
>>>> +
>>>> +    return skb;
>>>> +out:
>>>> +    kfree_skb(skb);
>>>> +    return NULL;
>>>> +}
>>>> +
>>>> /* This function can only be used on connecting/connected sockets,
>>>>  * since a socket assigned to a transport is required.
>>>>  *
>>>> @@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>     const struct virtio_transport *t_ops;
>>>>     struct virtio_vsock_sock *vvs;
>>>>     u32 pkt_len = info->pkt_len;
>>>> +    bool can_zcopy = false;
>>>> +    u32 max_skb_cap;
>>>>     u32 rest_len;
>>>>     int ret;
>>>>
>>>> @@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>     if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>         return pkt_len;
>>>>
>>>> +    /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>> +     * there is no MSG_ZEROCOPY flag set.
>>>> +     */
>>>> +    if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>> +        info->flags &= ~MSG_ZEROCOPY;
>>>> +
>>>> +    if (info->flags & MSG_ZEROCOPY)
>>>> +        can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>> +
>>>> +    if (can_zcopy)
>>>> +        max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>> +                    (MAX_SKB_FRAGS * PAGE_SIZE));
>>>> +    else
>>>> +        max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>> +
>>>
>>> We use `len` very often, what about `max_skb_len`?
>>>
>>>>     rest_len = pkt_len;
>>>>
>>>>     do {
>>>>         struct sk_buff *skb;
>>>>         size_t skb_len;
>>>>
>>>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>> +        skb_len = min(max_skb_cap, rest_len);
>>>>
>>>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>>>> -                         src_cid, src_port,
>>>> -                         dst_cid, dst_port);
>>>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>> +                         dst_cid, dst_port,
>>>> +                         src_cid, src_port);
>>>>         if (!skb) {
>>>>             ret = -ENOMEM;
>>>>             break;
>>>>         }
>>>>
>>>> +        /* This is last skb to send this portion of data. */
>>>> +        if (skb_len == rest_len &&
>>>> +            info->flags & MSG_ZEROCOPY &&
>>>> +            info->op == VIRTIO_VSOCK_OP_RW) {
>>>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>> +                                info->msg,
>>>> +                                can_zcopy)) {
>>>> +                ret = -ENOMEM;
>>>> +                break;
>>>> +            }
>>>> +        }
>>>> +
>>>>         virtio_transport_inc_tx_pkt(vvs, skb);
>>>>
>>>>         ret = t_ops->send_pkt(skb);
>>>> @@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
>>>>         .msg = msg,
>>>>         .pkt_len = len,
>>>>         .vsk = vsk,
>>>> +        .flags = msg->msg_flags,
>>>
>>> These flags then get copied into the virtio_vsock_hdr, which I don't
>>> think is a good idea.
>>>
>>> Why not using directly info->msg->msg_flags?
>>
>> Ops, yes, it's a bug, You're right, this is really wrong as there are two different
>> sets of flags - MSG_XXX passed to syscall and flags in the header of packet.
>
> Yep.
>
> What about the moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
> Was it voluntary? If so, can you explain why?

Sure! here is what happens: difference between these functions is that 'skb_set_owner_w()'
sets another destructor - 'sock_wfree()':

https://elixir.bootlin.com/linux/v6.4/source/net/core/sock.c#L2408

This destructor also updates 'sk->sk_wmem_alloc' during skb freeing. Why this is needed?
Because '__zerocopy_sg_from_iter()' updates 'sk_wmem_alloc' if socket is NOT SOCK_STREAM
or socket was NOT passed as input argument (it is not obvious why, but it is so):

https://elixir.bootlin.com/linux/latest/source/net/core/datagram.c#L646

So to support both types of sockets (seqpacket and stream) I decided to try pass NULL always
as socket argument to '__zerocopy_sg_from_iter()', with this 'sk_wmem_alloc' will be always
updated in '__zerocopy_sg_from_iter()' and then we decrease 'sk_wmem_alloc' in 'sock_wfree()'.

May be this looks strange - I tried to understand logic of this code in '__zerocopy_sg_from_iter()',
but it requires more deep analysis. Anyway it works:)

Thanks, Arseniy

>
>
> Thanks,
> Stefano
>

2023-06-29 12:44:51

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v4 05/17] vsock/virtio: MSG_ZEROCOPY flag support

On Tue, Jun 27, 2023 at 11:22:52AM +0300, Arseniy Krasnov wrote:
>
>
>On 27.06.2023 10:50, Stefano Garzarella wrote:
>> On Tue, Jun 27, 2023 at 07:41:51AM +0300, Arseniy Krasnov wrote:
>>>
>>>
>>> On 26.06.2023 19:03, Stefano Garzarella wrote:
>>>> On Sat, Jun 03, 2023 at 11:49:27PM +0300, Arseniy Krasnov wrote:
>>>>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>>>>> flag is set and zerocopy transmission is possible, then non-linear skb
>>>>> will be created and filled with the pages of user's buffer. Pages of
>>>>> user's buffer are locked in memory by 'get_user_pages()'.
>>>>>
>>>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>>>> ---
>>>>> net/vmw_vsock/virtio_transport_common.c | 270 ++++++++++++++++++------
>>>>> 1 file changed, 208 insertions(+), 62 deletions(-)
>>>>>
>>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>>> index 0de562c1dc4b..f1ec38c72db7 100644
>>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>>> @@ -37,27 +37,100 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>>>> ????return container_of(t, struct virtio_transport, transport);
>>>>> }
>>>>>
>>>>> -/* Returns a new packet on success, otherwise returns NULL.
>>>>> - *
>>>>> - * If NULL is returned, errp is set to a negative errno.
>>>>> - */
>>>>> -static struct sk_buff *
>>>>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>> -?????????????? size_t len,
>>>>> -?????????????? u32 src_cid,
>>>>> -?????????????? u32 src_port,
>>>>> -?????????????? u32 dst_cid,
>>>>> -?????????????? u32 dst_port)
>>>>> -{
>>>>> -??? const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>>>>> -??? struct virtio_vsock_hdr *hdr;
>>>>> -??? struct sk_buff *skb;
>>>>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>>>>> +?????????????????????? size_t max_to_send)
>>>>> +{
>>>>> +??? struct iov_iter *iov_iter;
>>>>> +
>>>>> +??? if (!info->msg)
>>>>> +??????? return false;
>>>>> +
>>>>> +??? iov_iter = &info->msg->msg_iter;
>>>>> +
>>>>> +??? /* Data is simple buffer. */
>>>>> +??? if (iter_is_ubuf(iov_iter))
>>>>> +??????? return true;
>>>>> +
>>>>> +??? if (!iter_is_iovec(iov_iter))
>>>>> +??????? return false;
>>>>> +
>>>>> +??? if (iov_iter->iov_offset)
>>>>> +??????? return false;
>>>>> +
>>>>> +??? /* We can't send whole iov. */
>>>>> +??? if (iov_iter->count > max_to_send)
>>>>> +??????? return false;
>>>>> +
>>>>> +??? return true;
>>>>> +}
>>>>> +
>>>>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>>>> +?????????????????????? struct sk_buff *skb,
>>>>> +?????????????????????? struct msghdr *msg,
>>>>> +?????????????????????? bool zerocopy)
>>>>> +{
>>>>> +??? struct ubuf_info *uarg;
>>>>> +
>>>>> +??? if (msg->msg_ubuf) {
>>>>> +??????? uarg = msg->msg_ubuf;
>>>>> +??????? net_zcopy_get(uarg);
>>>>> +??? } else {
>>>>> +??????? struct iov_iter *iter = &msg->msg_iter;
>>>>> +??????? struct ubuf_info_msgzc *uarg_zc;
>>>>> +??????? int len;
>>>>> +
>>>>> +??????? /* Only ITER_IOVEC or ITER_UBUF are allowed and
>>>>> +???????? * checked before.
>>>>> +???????? */
>>>>> +??????? if (iter_is_iovec(iter))
>>>>> +??????????? len = iov_length(iter->__iov, iter->nr_segs);
>>>>> +??????? else
>>>>> +??????????? len = iter->count;
>>>>> +
>>>>> +??????? uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>>>> +??????????????????????? len,
>>>>> +??????????????????????? NULL);
>>>>> +
>>>>> +??????? if (!uarg)
>>>>> +??????????? return -1;
>>>>> +
>>>>> +??????? uarg_zc = uarg_to_msgzc(uarg);
>>>>> +??????? uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>>>> +??? }
>>>>> +
>>>>> +??? skb_zcopy_init(skb, uarg);
>>>>> +
>>>>> +??? return 0;
>>>>> +}
>>>>> +
>>>>> +static int virtio_transport_fill_linear_skb(struct sk_buff *skb,
>>>>> +??????????????????????? struct vsock_sock *vsk,
>>>>
>>>> `vsk` seems unused
>>>>
>>>>> +??????????????????????? struct virtio_vsock_pkt_info *info,
>>>>> +??????????????????????? size_t len)
>>>>> +{
>>>>> ????void *payload;
>>>>> ????int err;
>>>>>
>>>>> -??? skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> -??? if (!skb)
>>>>> -??????? return NULL;
>>>>> +??? payload = skb_put(skb, len);
>>>>> +??? err = memcpy_from_msg(payload, info->msg, len);
>>>>> +??? if (err)
>>>>> +??????? return -1;
>>>>> +
>>>>> +??? if (msg_data_left(info->msg))
>>>>> +??????? return 0;
>>>>> +
>>>>> +??? return 0;
>>>>> +}
>>>>> +
>>>>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>>>>> +????????????????????? struct virtio_vsock_pkt_info *info,
>>>>> +????????????????????? u32 src_cid,
>>>>> +????????????????????? u32 src_port,
>>>>> +????????????????????? u32 dst_cid,
>>>>> +????????????????????? u32 dst_port,
>>>>> +????????????????????? size_t len)
>>>>> +{
>>>>> +??? struct virtio_vsock_hdr *hdr;
>>>>>
>>>>> ????hdr = virtio_vsock_hdr(skb);
>>>>> ????hdr->type??? = cpu_to_le16(info->type);
>>>>> @@ -68,42 +141,6 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>>>>> ????hdr->dst_port??? = cpu_to_le32(dst_port);
>>>>> ????hdr->flags??? = cpu_to_le32(info->flags);
>>>>> ????hdr->len??? = cpu_to_le32(len);
>>>>> -
>>>>> -??? if (info->msg && len > 0) {
>>>>> -??????? payload = skb_put(skb, len);
>>>>> -??????? err = memcpy_from_msg(payload, info->msg, len);
>>>>> -??????? if (err)
>>>>> -??????????? goto out;
>>>>> -
>>>>> -??????? if (msg_data_left(info->msg) == 0 &&
>>>>> -??????????? info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> -??????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> -
>>>>> -??????????? if (info->msg->msg_flags & MSG_EOR)
>>>>> -??????????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> -??????? }
>>>>> -??? }
>>>>> -
>>>>> -??? if (info->reply)
>>>>> -??????? virtio_vsock_skb_set_reply(skb);
>>>>> -
>>>>> -??? trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> -???????????????????? dst_cid, dst_port,
>>>>> -???????????????????? len,
>>>>> -???????????????????? info->type,
>>>>> -???????????????????? info->op,
>>>>> -???????????????????? info->flags);
>>>>> -
>>>>> -??? if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>>>>> -??????? WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>>>>> -??????? goto out;
>>>>> -??? }
>>>>> -
>>>>> -??? return skb;
>>>>> -
>>>>> -out:
>>>>> -??? kfree_skb(skb);
>>>>> -??? return NULL;
>>>>> }
>>>>>
>>>>> static void virtio_transport_copy_nonlinear_skb(struct sk_buff *skb,
>>>>> @@ -214,6 +251,85 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>>>> ??????? return VIRTIO_VSOCK_TYPE_SEQPACKET;
>>>>> }
>>>>>
>>>>> +/* Returns a new packet on success, otherwise returns NULL.
>>>>> + *
>>>>> + * If NULL is returned, errp is set to a negative errno.
>>>>
>>>> I had noticed this in Bobby's patches, I think it's an old comment we
>>>> left around.
>>>>
>>>>> + */
>>>>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>>>>> +????????????????????????? struct virtio_vsock_pkt_info *info,
>>>>> +????????????????????????? size_t payload_len,
>>>>> +????????????????????????? bool zcopy,
>>>>> +????????????????????????? u32 dst_cid,
>>>>> +????????????????????????? u32 dst_port,
>>>>> +????????????????????????? u32 src_cid,
>>>>> +????????????????????????? u32 src_port)
>>>>> +{
>>>>> +??? struct sk_buff *skb;
>>>>> +??? size_t skb_len;
>>>>> +
>>>>> +??? skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>>>>> +
>>>>> +??? if (!zcopy)
>>>>> +??????? skb_len += payload_len;
>>>>> +
>>>>> +??? skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>>>>> +??? if (!skb)
>>>>> +??????? return NULL;
>>>>> +
>>>>> +??? virtio_transport_init_hdr(skb, info, src_cid, src_port,
>>>>> +????????????????? dst_cid, dst_port,
>>>>> +????????????????? payload_len);
>>>>> +
>>>>> +??? /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>>>>> +???? * owner of skb without check to update 'sk_wmem_alloc'.
>>>>> +???? */
>>>>> +??? if (vsk)
>>>>> +??????? skb_set_owner_w(skb, sk_vsock(vsk));
>>>>
>>>> why we are moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
>>>>
>>>> We should mention this in the commit description.
>>>>
>>>>> +
>>>>> +??? if (info->msg && payload_len > 0) {
>>>>> +??????? int err;
>>>>> +
>>>>> +??????? if (zcopy) {
>>>>> +??????????? err = __zerocopy_sg_from_iter(info->msg, NULL, skb,
>>>>> +????????????????????????????? &info->msg->msg_iter,
>>>>> +????????????????????????????? payload_len);
>>>>> +??????? } else {
>>>>> +??????????? err = virtio_transport_fill_linear_skb(skb, vsk, info, payload_len);
>>>>> +??????? }
>>>>> +
>>>>> +??????? if (err)
>>>>> +??????????? goto out;
>>>>> +
>>>>> +??????? VIRTIO_VSOCK_SKB_CB(skb)->frag_off = 0;
>>>>> +
>>>>> +??????? if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>>>>> +??????????? struct virtio_vsock_hdr *hdr;
>>>>> +
>>>>> +??????????? hdr = virtio_vsock_hdr(skb);
>>>>
>>>> Just `struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);` should be
>>>> fine.
>>>>
>>>>> +
>>>>> +??????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>>>>> +
>>>>> +??????????? if (info->msg->msg_flags & MSG_EOR)
>>>>> +??????????????? hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>>>>> +??????? }
>>>>> +??? }
>>>>> +
>>>>> +??? if (info->reply)
>>>>> +??????? virtio_vsock_skb_set_reply(skb);
>>>>> +
>>>>> +??? trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>>> +???????????????????? dst_cid, dst_port,
>>>>> +???????????????????? payload_len,
>>>>> +???????????????????? info->type,
>>>>> +???????????????????? info->op,
>>>>> +???????????????????? info->flags);
>>>>> +
>>>>> +??? return skb;
>>>>> +out:
>>>>> +??? kfree_skb(skb);
>>>>> +??? return NULL;
>>>>> +}
>>>>> +
>>>>> /* This function can only be used on connecting/connected sockets,
>>>>> ?* since a socket assigned to a transport is required.
>>>>> ?*
>>>>> @@ -226,6 +342,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>> ????const struct virtio_transport *t_ops;
>>>>> ????struct virtio_vsock_sock *vvs;
>>>>> ????u32 pkt_len = info->pkt_len;
>>>>> +??? bool can_zcopy = false;
>>>>> +??? u32 max_skb_cap;
>>>>> ????u32 rest_len;
>>>>> ????int ret;
>>>>>
>>>>> @@ -254,22 +372,49 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>>>> ????if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>>>> ??????? return pkt_len;
>>>>>
>>>>> +??? /* If zerocopy is not enabled by 'setsockopt()', we behave as
>>>>> +???? * there is no MSG_ZEROCOPY flag set.
>>>>> +???? */
>>>>> +??? if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>>>>> +??????? info->flags &= ~MSG_ZEROCOPY;
>>>>> +
>>>>> +??? if (info->flags & MSG_ZEROCOPY)
>>>>> +??????? can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>>>>> +
>>>>> +??? if (can_zcopy)
>>>>> +??????? max_skb_cap = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>>>> +??????????????????? (MAX_SKB_FRAGS * PAGE_SIZE));
>>>>> +??? else
>>>>> +??????? max_skb_cap = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>>>> +
>>>>
>>>> We use `len` very often, what about `max_skb_len`?
>>>>
>>>>> ????rest_len = pkt_len;
>>>>>
>>>>> ????do {
>>>>> ??????? struct sk_buff *skb;
>>>>> ??????? size_t skb_len;
>>>>>
>>>>> -??????? skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>>>>> +??????? skb_len = min(max_skb_cap, rest_len);
>>>>>
>>>>> -??????? skb = virtio_transport_alloc_skb(info, skb_len,
>>>>> -???????????????????????? src_cid, src_port,
>>>>> -???????????????????????? dst_cid, dst_port);
>>>>> +??????? skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>>>>> +???????????????????????? dst_cid, dst_port,
>>>>> +???????????????????????? src_cid, src_port);
>>>>> ??????? if (!skb) {
>>>>> ??????????? ret = -ENOMEM;
>>>>> ??????????? break;
>>>>> ??????? }
>>>>>
>>>>> +??????? /* This is last skb to send this portion of data. */
>>>>> +??????? if (skb_len == rest_len &&
>>>>> +??????????? info->flags & MSG_ZEROCOPY &&
>>>>> +??????????? info->op == VIRTIO_VSOCK_OP_RW) {
>>>>> +??????????? if (virtio_transport_init_zcopy_skb(vsk, skb,
>>>>> +??????????????????????????????? info->msg,
>>>>> +??????????????????????????????? can_zcopy)) {
>>>>> +??????????????? ret = -ENOMEM;
>>>>> +??????????????? break;
>>>>> +??????????? }
>>>>> +??????? }
>>>>> +
>>>>> ??????? virtio_transport_inc_tx_pkt(vvs, skb);
>>>>>
>>>>> ??????? ret = t_ops->send_pkt(skb);
>>>>> @@ -884,6 +1029,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
>>>>> ??????? .msg = msg,
>>>>> ??????? .pkt_len = len,
>>>>> ??????? .vsk = vsk,
>>>>> +??????? .flags = msg->msg_flags,
>>>>
>>>> These flags then get copied into the virtio_vsock_hdr, which I don't
>>>> think is a good idea.
>>>>
>>>> Why not using directly info->msg->msg_flags?
>>>
>>> Ops, yes, it's a bug, You're right, this is really wrong as there are two different
>>> sets of flags - MSG_XXX passed to syscall and flags in the header of packet.
>>
>> Yep.
>>
>> What about the moving from skb_set_owner_sk_safe() to skb_set_owner_w()?
>> Was it voluntary? If so, can you explain why?
>
>Sure! here is what happens: difference between these functions is that 'skb_set_owner_w()'
>sets another destructor - 'sock_wfree()':
>
>https://elixir.bootlin.com/linux/v6.4/source/net/core/sock.c#L2408
>
>This destructor also updates 'sk->sk_wmem_alloc' during skb freeing. Why this is needed?
>Because '__zerocopy_sg_from_iter()' updates 'sk_wmem_alloc' if socket is NOT SOCK_STREAM
>or socket was NOT passed as input argument (it is not obvious why, but it is so):
>
>https://elixir.bootlin.com/linux/latest/source/net/core/datagram.c#L646
>
>So to support both types of sockets (seqpacket and stream) I decided to try pass NULL always
>as socket argument to '__zerocopy_sg_from_iter()', with this 'sk_wmem_alloc' will be always
>updated in '__zerocopy_sg_from_iter()' and then we decrease 'sk_wmem_alloc' in 'sock_wfree()'.
>
>May be this looks strange - I tried to understand logic of this code in '__zerocopy_sg_from_iter()',
>but it requires more deep analysis. Anyway it works:)

Yep, it should require also more analysis on my side :-)

Please document it in the commit description, so if we will find any
issue in the future, we know why we did this change.

Thanks,
Stefano