2022-11-06 19:38:55

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive



INTRODUCTION

Hello,

This is experimental patchset for virtio vsock zerocopy receive.
It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()'
to fill provided vma area with pages of virtio receive buffers. After
received data was processed by user, pages must be freed by 'madvise()'
call with MADV_DONTNEED flag set(but if user will not call 'madvise()',
next 'getsockopt()' will fail).

DETAILS

Here is how mapping with mapped pages looks exactly: first page
contains information about mapped data buffers. At zero offset mapping
contains special data structure:

struct virtio_vsock_usr_hdr_pref {
u32 poll_value;
u32 hdr_num;
};

This structure contains two fields:
'poll_value' - shows that current socket has data to read.When socket's
intput queue is empty, 'poll_value' is set to 0 by kernel. When input
queue has some data, 'poll_value' is set to 1 by kernel. When socket is
closed for data receive, 'poll_value' is ~0.This tells user that "there
will be no more data,continue to call 'getsockopt()' until you'll find
'hdr_num' == 0".User spins on it in userspace, without calling 'poll()'
system call(of course, 'poll()' is still working).
'hdr_num' - shows number of mapped pages with data which starts from
second page of this mappined.

NOTE:
This version has two limitations:

1) One mapping per socket is supported. It is implemented by adding
'struct page*' pointer to 'struct virtio_vsock' structure (first
page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But,
I think, support for multiple pages could be implemented by using
something like hash table of such pages, or more simple, just use
first page of mapping as headers page by default. Also I think,
number of such pages may be controlled by 'setsockop()'.

2) After 'mmap()' call,it is impossible to call 'mmap()' again, even
after calling 'madvise()'/'munmap()' on the whole mapping.This is
because socket can't handle 'munmap()' calls(as there is no such
callback in 'proto_ops'),thus polling page exists until socket is
opened.

After 'virtio_vsock_usr_hdr_pref' object, first page contains array of
trimmed virtio vsock packet headers (in contains only length of data on
the corresponding page and 'flags' field):

struct virtio_vsock_usr_hdr {
uint32_t length;
uint32_t flags;
};

Field 'length' allows user to know exact size of payload within each
sequence of pages and field 'flags' allows to process SOCK_SEQPACKET
flags(such as message bounds or record bounds).All other pages are data
pages from virtio queue.

Page 0 Page 1 Page N

[ pref hdr0 .. hdrN ][ data ] .. [ data ]
| | ^ ^
| | | |
| *-------|-----------*
| |
*----------------*

Of course, single header could represent array of pages (when
packet's buffer is bigger than one page).So here is example of detailed
mapping layout for some set of packages. Lets consider that we have the
following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All
pages: 0,1,2,3,4 and 5 will be inserted to user's vma.

Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
Page 1: [ 56 ]
Page 2: [ 4096 ]
Page 3: [ 4096 ]
Page 4: [ 4096 ]
Page 5: [ 8 ]

Page 0 contains only array of headers:
'pref' is 'struct virtio_vsock_usr_hdr_pref'.
'hdr0' has 56 in length field.
'hdr1' has 4096 in length field.
'hdr2' has 8200 in length field.
'hdr3' has 0 in length field(this is end of data marker).

Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
Page 2 corresponds to 'hdr1' and filled with data.
Page 3 corresponds to 'hdr2' and filled with data.
Page 4 corresponds to 'hdr2' and filled with data.
Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

pref will be the following: poll_value = 1, hdr_num = 5

This patchset also changes packets allocation way: current uses
only 'kmalloc()' to create data buffer. Problem happens when we try to
map such buffers to user's vma - kernel restricts to map slab pages
to user's vma(as pages of "not large" 'kmalloc()' allocations have flag
PageSlab set and "not large" could be bigger than one page).So to avoid
this, data buffers now allocated using 'alloc_pages()' call.

DIFFERENCE WITH TCP

As this feature uses same approach as for TCP protocol,here are
some difference between both version(from user's POV):

1) For 'getsockopt()':
- This version passes only address of mapping.
- TCP passes special structure to 'getsockopt()'. In addition to the
address of mapping in contains 'length' and 'recv_skip_hint'.First
means size of data inside mapping(out param, set by kernel).Second
has bool type, if it is true, then user must dequeue rest of data
using 'read()' syscall(e.g. it is out parameter also).
2) Mapping structure:
- This version uses first page of mapping for meta data and rest of
pages for data.
- TCP version uses whole mapping for data only.
3) Data layout:
- This version inserts virtio buffers to mapping, so each buffer may
be filled partially. To get size of payload in every buffer, first
mapping's page must be used(see 2).
- TCP version inserts pages of single skb.

*Please, correct me if I made some mistake in TCP zerocopy description.

TESTS

This patchset updates 'vsock_test' utility: two tests for new
feature were added. First test covers invalid cases.Second checks valid
transmission case.

BENCHMARKING

For benchmakring I've created small test utility 'vsock_rx_perf'.
It works in client/server mode. When client connects to server, server
starts sending specified amount of data to client(size is set as input
argument). Client reads data and waits for next portion of it. In client
mode, dequeue logic works in three modes: copy, zerocopy and zerocopy
with user polling.

1) In copy mode client uses 'read()' system call.
2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data
and 'poll()' to wait data.
3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()',
but to wait data it polls shared page(e.g. busyloop).

Here is usage:
-c <cid> Peer CID to connect to(if run in client mode).
-m <megabytes> Number of megabytes to send.
-b <bytes> Size of RX/TX buffer(or mapping) in pages.
-r <bytes> SO_RCVLOWAT value in bytes(if run in client mode).
-v <bytes> peer credit.
-s Run as server.
-z [n|y|u] Dequeue mode.
n - copy mode. 1) above.
y - zero copy mode. 2) above.
u - zero copy mode + user poll. 3) above.

Utility produces the following output:
1) In server mode it prints number of sec spent for whole tx loop.
2) In client mode it prints several values:
* Number of sec, spent for whole rx loop(including 'poll()').
* Number of sec, spend in dequeue system calls:
In case of '-z n' it will be time in 'read()'.
In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'.
* Number of wake ups with POLLIN flag set(except '-z u' mode).
* Average time(ns) in single dequeue iteration(e.g. divide second
value by third).

Idea of test is to compare zerocopy approach and classic copy, as it is
clear, that to dequeue some "small" amount of data, copy must be better,
because syscall with 'memcpy()' for 1 byte(for example) is just nothing
against two system calls, where first must map at least one page, while
second will unmap it.

Test was performed with the following settings:
1) Total size of data to send is 2G(-m argument).

2) Peer's buffer size is changed to 2G(-v argument) - this is needed to
avoid stalls of sender to wait for enough credit.

3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number
of bytes to dequeue in single loop iteration. Buffer size limits max
number of bytes to read, while SO_RCVLOWAT won't allow user to get
too small number of bytes.

4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of
course we can set it to peer's buffer size and as we are in STREAM
mode it leads to 'write()' will be called once.

Deignations here and below:
H2G - host to guest transmission. Server is host, client is guest.
G2H - guest to host transmission. Server is guest, client is host.
C - copy mode.
ZC - zerocopy mode.
ZU - zerocopy with user poll mode. This mode is removed from test at
this moment, because I need to support SO_RCVLOWAT logic in it.

So, rows corresponds to dequeue mode, while columns show number of bytes
to dequeue in each mode. Each cell contains several values in the next
format:
*------------*
| A / B |
| C |
| D |
*------------*

A - number of seconds which server spent in tx loop.
B - number of seconds which client spent in rx loop.
C - number of seconds which client spent in rx loop, but except 'poll()'
system call(e.g. only in dequeue system calls).
D - Average number of ns for each POLLIN wake up(in other words
it is average value for C).

G2H:

#0 #1 #2 #3 #4 #5
*----*---------*---------*---------*---------*---------*---------*
| | | | | | | |
| | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
| | | | | | | |
*----*---------*---------*---------*---------*---------*---------*
| | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
| | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 |
*----*---------*---------*---------*---------*---------*---------*
| |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
| | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 |
*----*---------*---------*---------*---------*---------*---------*

H2G:

#0 #1 #2 #3 #4 #5
*----*---------*---------*---------*---------*---------*---------*
| | | | | | | |
| | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
| | | | | | | |
*----*---------*---------*---------*---------*---------*---------*
| | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
| | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 |
*----*---------*---------*---------*---------*---------*---------*
| |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
| | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 |
*----*---------*---------*---------*---------*---------*---------*

Here are my thoughts about these numbers(most obvious):

1) Let's check C and D values. We see, that zerocopy dequeue is faster
on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I
think this is main result of this test(at this moment), that shows
performance difference between copy and zerocopy).

2) In G2H mode both server and client spend almost same time in rx/tx
loops(see A / B in G2H table) - it looks good. In H2G mode, there is
significant difference between server and client. I think there are
some side effects which produces such effect(continue to analyze).

3) Let's check C value. We can see, that G2H is always faster that H2G.
In both copy and zerocopy mode.

4) Another interesting thing could be seen for example in H2G table,
row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode
is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was
faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account
time spent in 'poll()', copy mode looks faster(even it spends more
time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()').
I think, it is also not obvious effect.

So, according 1), it is better to use zerocopy, if You need to process
big buffers, with small rx waitings(for example it nay be video stream).
In other cases - it is better to use classic copy way, as it will be
more lightweight.

All tests were performed on x86 board with 4-core Celeron N2930 CPU(of
course it is, not a mainframe, but better than test with nested guest)
and 8Gb of RAM.

Anyway, this is not final version, and I will continue to improve both
kernel logic and performance tests.

SUGGESTIONS

1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I
can merge both patches into single one?
2) This version works with single page headers. May be I can add new
'getsockopt()' feature to allow to use multiple pages for headers.

CHANGE LOG
v1 -> v2:
1) Zerocopy receive mode must be enabled/disabled (off by default). I
did not use generic SO_ZEROCOPY flag, because in virtio-vsock case
this feature depends on transport support. Instead of SO_ZEROCOPY,
AF_VSOCK layer flag was added:SO_VM_SOCKETS_ZEROCOPY,while previous
meaning of SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's
vm area) now renamed to SO_VM_SOCKETS_MAP_RX.
2) Packet header which is exported to user now gets new extra field:
'copy_len'. This field handles special case: user reads data from
socket in non zerocopy way(with disabled zerocopy) and then enables
zerocopy.
In this case, vhost part will switch buffer allocation logic from
'kmalloc()' to direct calls for buddy allocator. But,there could be
some pending 'kmalloc()' allocated packets in socket's rx list, and
then user tries to read such packets in zerocopy way, dequeue will
fail, because SLAB pages could not be inserted to user's vm area.So
when such packet is found during zerocopy dequeue,dequeue loop will
break and 'copy_len' will show size of such "bad" packet.After user
detects this case,it must use 'read()/recv()' calls to dequeue such
packet.
3) Also may be move this features under config option? <<<<< DECLINED

v2 -> v3:
1) Zerocopy could be enabled only for socket in SS_UNCONNECTED state,
so 'copy_len' field from v2 was removed.
2) Mapping layout was updated. First page of mapping now contains the
following structure: 'struct virtio_vsock_usr_hdr_pref' starting
from the first byte. Then 'struct virtio_vsock_usr_hdr' are placed
in array.
3) Transport get/set callbacks for zerocopy were removed, now flag to
check zerocopy receive on/off is storead in 'vsock_sock'.
4) For 'virtio_transport_recv_pkt()' interface changed. This was done,
because vhost driver needs to check whether zerocopy is enabled on
socket or not.So socket lookup is performed until packet allocation
and socket structure is passed to this function.

void virtio_transport_recv_pkt(struct virtio_transport *,
struct virtio_vsock_pkt *);

changed to

void virtio_transport_recv_pkt(struct virtio_transport *,
struct sock *,
struct virtio_vsock_pkt *);

If 'struct sock *' argument is NULL, this function works as before,
otherwise it skips socket lookup, using input 'sock' as destination
socket.

4) Test for userspace polling was added.
5) Zerocopy tests were moved to dedicated '.c' file.
6) More benchmark results.
7) Two tests updated: message bound test reworked and test for big
message transmission was added.

Arseniy Krasnov(11):
virtio/vsock: rework packet allocation logic
virtio/vsock: update 'virtio_transport_recv_pkt()'
af_vsock: add zerocopy receive logic
virtio/vsock: add transport zerocopy callback
vhost/vsock: switch packet's buffer allocation
vhost/vsock: enable zerocopy callback
virtio/vsock: enable zerocopy callback
test/vsock: rework message bound test
test/vsock: add big message test
test/vsock: add receive zerocopy tests
test/vsock: vsock_rx_perf utility

drivers/vhost/vsock.c | 58 ++-
include/linux/virtio_vsock.h | 8 +
include/net/af_vsock.h | 8 +
include/uapi/linux/virtio_vsock.h | 14 +
include/uapi/linux/vm_sockets.h | 3 +
net/vmw_vsock/af_vsock.c | 187 ++++++++-
net/vmw_vsock/virtio_transport.c | 4 +-
net/vmw_vsock/virtio_transport_common.c | 256 ++++++++++++-
net/vmw_vsock/vsock_loopback.c | 2 +-
tools/include/uapi/linux/virtio_vsock.h | 15 +
tools/include/uapi/linux/vm_sockets.h | 8 +
tools/testing/vsock/Makefile | 3 +-
tools/testing/vsock/control.c | 34 ++
tools/testing/vsock/control.h | 2 +
tools/testing/vsock/util.c | 40 +-
tools/testing/vsock/util.h | 5 +
tools/testing/vsock/vsock_rx_perf.c | 604 ++++++++++++++++++++++++++++++
tools/testing/vsock/vsock_test.c | 198 +++++++++-
tools/testing/vsock/vsock_test_zerocopy.c | 530 ++++++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 14 +
20 files changed, 1953 insertions(+), 40 deletions(-)

--
2.35.0


2022-11-06 19:47:11

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

To support zerocopy receive, packet's buffer allocation is changed: for
buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
kernel restricts to map slab pages to user's vma) and raw buddy
allocator now called. But, for tx packets(such packets won't be mapped
to user), previous 'kmalloc()' way is used, but with special flag in
packet's structure which allows to distinguish between 'kmalloc()' and
raw pages buffers.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 1 +
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport.c | 8 ++++++--
net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
4 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 5703775af129..65475d128a1d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}

+ pkt->slab_buf = true;
pkt->buf_len = pkt->len;

nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 35d7eedb5e8e..d02cb7aa922f 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
u32 off;
bool reply;
bool tap_delivered;
+ bool slab_buf;
};

struct virtio_vsock_pkt_info {
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index ad64f403536a..19909c1e9ba3 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
vq = vsock->vqs[VSOCK_VQ_RX];

do {
+ struct page *buf_page;
+
pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
if (!pkt)
break;

- pkt->buf = kmalloc(buf_len, GFP_KERNEL);
- if (!pkt->buf) {
+ buf_page = alloc_page(GFP_KERNEL);
+
+ if (!buf_page) {
virtio_transport_free_pkt(pkt);
break;
}

+ pkt->buf = page_to_virt(buf_page);
pkt->buf_len = buf_len;
pkt->len = buf_len;

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index a9980e9b9304..37e8dbfe2f5d 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
if (!pkt->buf)
goto out_pkt;

+ pkt->slab_buf = true;
pkt->buf_len = len;

err = memcpy_from_msg(pkt->buf, info->msg, len);
@@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);

void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
{
- kvfree(pkt->buf);
+ if (pkt->buf_len) {
+ if (pkt->slab_buf)
+ kvfree(pkt->buf);
+ else
+ free_pages((unsigned long)pkt->buf,
+ get_order(pkt->buf_len));
+ }
+
kfree(pkt);
}
EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
--
2.35.0

2022-11-06 20:00:39

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 08/11] test/vsock: rework message bound test

This updates message bound test making it more complex. Instead of
sending 1 bytes messages with one MSG_EOR bit, it sends messages of
random length(one half of messages are smaller than page size, second
half are bigger) with random number of MSG_EOR bits set. Receiver
also don't know total number of messages.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/control.c | 34 +++++++++
tools/testing/vsock/control.h | 2 +
tools/testing/vsock/util.c | 13 ++++
tools/testing/vsock/util.h | 1 +
tools/testing/vsock/vsock_test.c | 115 +++++++++++++++++++++++++++----
5 files changed, 152 insertions(+), 13 deletions(-)

diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
index 4874872fc5a3..bed1649bdf3d 100644
--- a/tools/testing/vsock/control.c
+++ b/tools/testing/vsock/control.c
@@ -141,6 +141,40 @@ void control_writeln(const char *str)
timeout_end();
}

+void control_writeulong(unsigned long value)
+{
+ char str[32];
+
+ if (snprintf(str, sizeof(str), "%lu", value) >= sizeof(str)) {
+ perror("snprintf");
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln(str);
+}
+
+unsigned long control_readulong(bool *ok)
+{
+ unsigned long value;
+ char *str;
+
+ if (ok)
+ *ok = false;
+
+ str = control_readln();
+
+ if (str == NULL)
+ return 0;
+
+ value = strtoul(str, NULL, 10);
+ free(str);
+
+ if (ok)
+ *ok = true;
+
+ return value;
+}
+
/* Return the next line from the control socket (without the trailing newline).
*
* The program terminates if a timeout occurs.
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
index 51814b4f9ac1..cdd922dfea68 100644
--- a/tools/testing/vsock/control.h
+++ b/tools/testing/vsock/control.h
@@ -9,7 +9,9 @@ void control_init(const char *control_host, const char *control_port,
void control_cleanup(void);
void control_writeln(const char *str);
char *control_readln(void);
+unsigned long control_readulong(bool *ok);
void control_expectln(const char *str);
bool control_cmpln(char *line, const char *str, bool fail);
+void control_writeulong(unsigned long value);

#endif /* CONTROL_H */
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 2acbb7703c6a..351903836774 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -395,3 +395,16 @@ void skip_test(struct test_case *test_cases, size_t test_cases_len,

test_cases[test_id].skip = true;
}
+
+unsigned long djb2(const void *data, size_t len)
+{
+ unsigned long hash = 5381;
+ int i = 0;
+
+ while (i < len) {
+ hash = ((hash << 5) + hash) + ((unsigned char *)data)[i];
+ i++;
+ }
+
+ return hash;
+}
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index a3375ad2fb7f..988cc69a4642 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -49,4 +49,5 @@ void run_tests(const struct test_case *test_cases,
void list_tests(const struct test_case *test_cases);
void skip_test(struct test_case *test_cases, size_t test_cases_len,
const char *test_id_str);
+unsigned long djb2(const void *data, size_t len);
#endif /* UTIL_H */
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index bb6d691cb30d..107c11165887 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -284,10 +284,14 @@ static void test_stream_msg_peek_server(const struct test_opts *opts)
close(fd);
}

-#define MESSAGES_CNT 7
-#define MSG_EOR_IDX (MESSAGES_CNT / 2)
+#define SOCK_BUF_SIZE (2 * 1024 * 1024)
+#define MAX_MSG_SIZE (32 * 1024)
+
static void test_seqpacket_msg_bounds_client(const struct test_opts *opts)
{
+ unsigned long curr_hash;
+ int page_size;
+ int msg_count;
int fd;

fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
@@ -296,18 +300,69 @@ static void test_seqpacket_msg_bounds_client(const struct test_opts *opts)
exit(EXIT_FAILURE);
}

- /* Send several messages, one with MSG_EOR flag */
- for (int i = 0; i < MESSAGES_CNT; i++)
- send_byte(fd, 1, (i == MSG_EOR_IDX) ? MSG_EOR : 0);
+ /* Wait, until receiver sets buffer size. */
+ control_expectln("SRVREADY");
+
+ curr_hash = 0;
+ page_size = getpagesize();
+ msg_count = SOCK_BUF_SIZE / MAX_MSG_SIZE;
+
+ for (int i = 0; i < msg_count; i++) {
+ ssize_t send_size;
+ size_t buf_size;
+ int flags;
+ void *buf;
+
+ /* Use "small" buffers and "big" buffers. */
+ if (i & 1)
+ buf_size = page_size +
+ (rand() % (MAX_MSG_SIZE - page_size));
+ else
+ buf_size = 1 + (rand() % page_size);
+
+ buf = malloc(buf_size);
+
+ if (!buf) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Set at least one MSG_EOR + some random. */
+ if (i == (msg_count / 2) || (rand() & 1)) {
+ flags = MSG_EOR;
+ curr_hash++;
+ } else {
+ flags = 0;
+ }
+
+ send_size = send(fd, buf, buf_size, flags);
+
+ if (send_size < 0) {
+ perror("send");
+ exit(EXIT_FAILURE);
+ }
+
+ if (send_size != buf_size) {
+ fprintf(stderr, "Invalid send size\n");
+ exit(EXIT_FAILURE);
+ }
+
+ curr_hash += send_size;
+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
+ }

control_writeln("SENDDONE");
+ control_writeulong(curr_hash);
close(fd);
}

static void test_seqpacket_msg_bounds_server(const struct test_opts *opts)
{
+ unsigned long sock_buf_size;
+ unsigned long remote_hash;
+ unsigned long curr_hash;
int fd;
- char buf[16];
+ char buf[MAX_MSG_SIZE];
struct msghdr msg = {0};
struct iovec iov = {0};

@@ -317,25 +372,58 @@ static void test_seqpacket_msg_bounds_server(const struct test_opts *opts)
exit(EXIT_FAILURE);
}

+ sock_buf_size = SOCK_BUF_SIZE;
+
+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
+ &sock_buf_size, sizeof(sock_buf_size))) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
+ &sock_buf_size, sizeof(sock_buf_size))) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Ready to receive data. */
+ control_writeln("SRVREADY");
+ /* Wait, until peer sends whole data. */
control_expectln("SENDDONE");
iov.iov_base = buf;
iov.iov_len = sizeof(buf);
msg.msg_iov = &iov;
msg.msg_iovlen = 1;

- for (int i = 0; i < MESSAGES_CNT; i++) {
- if (recvmsg(fd, &msg, 0) != 1) {
- perror("message bound violated");
- exit(EXIT_FAILURE);
- }
+ curr_hash = 0;

- if ((i == MSG_EOR_IDX) ^ !!(msg.msg_flags & MSG_EOR)) {
- perror("MSG_EOR");
+ while (1) {
+ ssize_t recv_size;
+
+ recv_size = recvmsg(fd, &msg, 0);
+
+ if (!recv_size)
+ break;
+
+ if (recv_size < 0) {
+ perror("recvmsg");
exit(EXIT_FAILURE);
}
+
+ if (msg.msg_flags & MSG_EOR)
+ curr_hash++;
+
+ curr_hash += recv_size;
+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
}

close(fd);
+ remote_hash = control_readulong(NULL);
+
+ if (curr_hash != remote_hash) {
+ fprintf(stderr, "Message bounds broken\n");
+ exit(EXIT_FAILURE);
+ }
}

#define MESSAGE_TRUNC_SZ 32
@@ -837,6 +925,7 @@ int main(int argc, char **argv)
.peer_cid = VMADDR_CID_ANY,
};

+ srand(time(NULL));
init_signals();

for (;;) {
--
2.35.0

2022-11-06 20:30:14

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 06/11] vhost/vsock: enable zerocopy callback

This adds zerocopy callback to vhost transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 191a5b94aa7c..fc6c6e667e36 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -486,6 +486,8 @@ static struct virtio_transport vhost_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .zerocopy_dequeue = virtio_transport_zerocopy_dequeue,
+ .zerocopy_rx_mmap = virtio_transport_zerocopy_init,

.seqpacket_dequeue = virtio_transport_seqpacket_dequeue,
.seqpacket_enqueue = virtio_transport_seqpacket_enqueue,
--
2.35.0

2022-11-06 20:31:35

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 11/11] test/vsock: vsock_rx_perf utility

This adds utility to check vsock receive throughput.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/Makefile | 1 +
tools/testing/vsock/vsock_rx_perf.c | 604 ++++++++++++++++++++++++++++
2 files changed, 605 insertions(+)
create mode 100644 tools/testing/vsock/vsock_rx_perf.c

diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index 7172c21fbd8d..5aea346ba2bc 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -3,6 +3,7 @@ all: test
test: vsock_test vsock_diag_test
vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o
+vsock_rx_perf: vsock_rx_perf.o

CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
.PHONY: all test clean
diff --git a/tools/testing/vsock/vsock_rx_perf.c b/tools/testing/vsock/vsock_rx_perf.c
new file mode 100644
index 000000000000..323626089043
--- /dev/null
+++ b/tools/testing/vsock/vsock_rx_perf.c
@@ -0,0 +1,604 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vsock_rx_perf - benchmark utility for zerocopy receive.
+ *
+ * Copyright (C) 2022 SberDevices.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+#include <getopt.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdbool.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <time.h>
+#include <sys/mman.h>
+#include <stdint.h>
+#include <poll.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
+#include <sys/socket.h>
+#include <linux/vm_sockets.h>
+
+#define PAGE_SIZE 4096
+#define DEFAULT_BUF_SIZE_BYTES (128*1024)
+#define DEFAULT_TO_SEND_BYTES (65*1024)
+#define DEFAULT_VSOCK_BUF_BYTES (256*1024)
+#define DEFAULT_RCVLOWAT_BYTES 1
+#define DEFAULT_PORT 1234
+
+static bool client_mode = true;
+static int peer_cid = -1;
+static int port = DEFAULT_PORT;
+static unsigned long rcvlowat_bytes = DEFAULT_RCVLOWAT_BYTES;
+static unsigned long buf_size_bytes = DEFAULT_BUF_SIZE_BYTES;
+static unsigned long to_send_bytes = DEFAULT_TO_SEND_BYTES;
+static unsigned long vsock_buf_bytes = DEFAULT_VSOCK_BUF_BYTES;
+
+#define ZEROCOPY_MODE_NO 0
+#define ZEROCOPY_MODE_POLL 1
+#define ZEROCOPY_MODE_USER_POLL 2
+
+static int zerocopy_mode = ZEROCOPY_MODE_NO;
+
+static inline time_t current_nsec(void)
+{
+ struct timespec ts;
+
+ if (clock_gettime(CLOCK_REALTIME, &ts)) {
+ perror("clock_gettime");
+ exit(EXIT_FAILURE);
+ }
+
+ return (ts.tv_sec * 1000000000ULL) + ts.tv_nsec;
+}
+
+/* From lib/cmdline.c. */
+static unsigned long memparse(const char *ptr)
+{
+ char *endptr;
+
+ unsigned long long ret = strtoull(ptr, &endptr, 0);
+
+ switch (*endptr) {
+ case 'E':
+ case 'e':
+ ret <<= 10;
+ case 'P':
+ case 'p':
+ ret <<= 10;
+ case 'T':
+ case 't':
+ ret <<= 10;
+ case 'G':
+ case 'g':
+ ret <<= 10;
+ case 'M':
+ case 'm':
+ ret <<= 10;
+ case 'K':
+ case 'k':
+ ret <<= 10;
+ endptr++;
+ default:
+ break;
+ }
+
+ return ret;
+}
+
+static void map_rx(int fd, void *va)
+{
+ socklen_t len = sizeof(va);
+
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX, &va, &len) < 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static void vsock_increase_buf_size(int fd)
+{
+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
+ &vsock_buf_bytes, sizeof(vsock_buf_bytes))) {
+ perror("setsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
+ &vsock_buf_bytes, sizeof(vsock_buf_bytes))) {
+ perror("setsockopt");
+ exit(EXIT_FAILURE);
+ }
+}
+
+static int vsock_connect(unsigned int cid, unsigned int port)
+{
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = port,
+ .svm_cid = cid,
+ },
+ };
+ unsigned long zc_on;
+ int fd;
+
+ fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+ if (fd < 0)
+ return -1;
+
+ vsock_increase_buf_size(fd);
+
+ zc_on = 1;
+
+ if (zerocopy_mode != ZEROCOPY_MODE_NO) {
+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_ZEROCOPY,
+ (void *)&zc_on, sizeof(zc_on))) {
+ close(fd);
+ return -1;
+ }
+ }
+
+ if (connect(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+ close(fd);
+ return -1;
+ }
+
+ return fd;
+}
+
+static void run_server(void)
+{
+ time_t tx_begin_ns;
+ size_t total_send;
+ int client_fd;
+ char *data;
+ int fd;
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } addr = {
+ .svm = {
+ .svm_family = AF_VSOCK,
+ .svm_port = port,
+ .svm_cid = VMADDR_CID_ANY,
+ },
+ };
+ union {
+ struct sockaddr sa;
+ struct sockaddr_vm svm;
+ } clientaddr;
+
+ socklen_t clientaddr_len = sizeof(clientaddr.svm);
+
+ fprintf(stderr, "Run as server, listen %i, send %lu, tx buf %lu vsock buf %lu\n",
+ port, to_send_bytes, buf_size_bytes,
+ vsock_buf_bytes);
+
+ fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+ if (fd < 0) {
+ perror("socket");
+ exit(EXIT_FAILURE);
+ }
+
+ if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+ perror("bind");
+ exit(EXIT_FAILURE);
+ }
+
+ if (listen(fd, 1) < 0) {
+ perror("listen");
+ exit(EXIT_FAILURE);
+ }
+
+ client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+
+ if (client_fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ vsock_increase_buf_size(client_fd);
+ vsock_increase_buf_size(fd);
+
+ data = malloc(buf_size_bytes);
+
+ if (!data) {
+ fprintf(stderr, "malloc failed\n");
+ close(client_fd);
+ close(fd);
+ exit(EXIT_FAILURE);
+ }
+
+ memset(data, 0, buf_size_bytes);
+ total_send = 0;
+ tx_begin_ns = current_nsec();
+
+ while (total_send < to_send_bytes) {
+ ssize_t sent;
+
+ sent = write(client_fd, data, buf_size_bytes);
+
+ if (sent <= 0) {
+ perror("write");
+ exit(EXIT_FAILURE);
+ }
+
+ total_send += sent;
+ }
+
+ fprintf(stderr, "Tx loop time %f sec\n",
+ (float)(current_nsec() - tx_begin_ns) / (1000000000.0));
+
+ close(client_fd);
+ close(fd);
+
+ free(data);
+}
+
+static void run_client(void)
+{
+ unsigned int read_cnt;
+ time_t rx_begin_ns;
+ time_t in_read_ns;
+ void *data;
+ int fd;
+
+ fprintf(stderr, "Running client, copy, peer %i:%i, rx buf %lu, vsock buf %lu\n",
+ peer_cid, port, buf_size_bytes, vsock_buf_bytes);
+
+ fd = vsock_connect(peer_cid, port);
+
+ if (fd < 0) {
+ perror("socket");
+ exit(EXIT_FAILURE);
+ }
+
+ if (setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT,
+ &rcvlowat_bytes,
+ sizeof(rcvlowat_bytes))) {
+ perror("setsockopt 1");
+ exit(EXIT_FAILURE);
+ }
+
+ data = mmap(NULL, buf_size_bytes, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE | MAP_POPULATE, -1, 0);
+
+ if (data == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ in_read_ns = 0;
+ read_cnt = 0;
+ rx_begin_ns = current_nsec();
+
+ while (1) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+ fds.events = POLLIN | POLLERR | POLLHUP |
+ POLLRDHUP | POLLNVAL;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fds.revents & POLLERR) {
+ fprintf(stderr, "Poll error\n");
+ break;
+ }
+
+ if (fds.revents & (POLLHUP | POLLRDHUP))
+ break;
+
+ if (fds.revents & POLLIN) {
+ ssize_t bytes_read;
+ time_t t;
+
+ t = current_nsec();
+ bytes_read = read(fd, data, buf_size_bytes);
+ in_read_ns += (current_nsec() - t);
+ read_cnt++;
+
+ if (!bytes_read)
+ break;
+
+ if (bytes_read < 0) {
+ perror("recv");
+ exit(EXIT_FAILURE);
+ }
+ }
+ }
+
+ fprintf(stderr, "Rx loop time %f sec\n",
+ (float)(current_nsec() - rx_begin_ns) / (1000000000.0));
+ fprintf(stderr, "Total in 'read()' %f sec\n", in_read_ns / 1000000000.0);
+ fprintf(stderr, "POLLIN wakeups: %i\n", read_cnt);
+ fprintf(stderr, "Average in 'read()' %f ns\n",
+ (float)in_read_ns / read_cnt);
+
+ munmap(data, buf_size_bytes);
+ close(fd);
+}
+
+static void run_client_zerocopy(void)
+{
+ unsigned int rx_cnt;
+ time_t rx_begin_ns;
+ time_t in_rx_ns;
+ void *rx_va;
+ int done;
+ int fd;
+
+ fprintf(stderr, "Running client, zerocopy, peer %i:%i, rx buf %lu, vsock buf %lu\n",
+ peer_cid, port, buf_size_bytes, vsock_buf_bytes);
+
+ fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+ if (fd < 0) {
+ perror("socket");
+ exit(EXIT_FAILURE);
+ }
+
+ fd = vsock_connect(peer_cid, port);
+
+ if (setsockopt(fd, SOL_SOCKET, SO_RCVLOWAT,
+ &rcvlowat_bytes,
+ sizeof(rcvlowat_bytes))) {
+ perror("setsockopt 1");
+ exit(EXIT_FAILURE);
+ }
+
+ rx_va = mmap(NULL, buf_size_bytes, PROT_READ, MAP_SHARED, fd, 0);
+
+ if (rx_va == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ done = 0;
+
+ in_rx_ns = 0;
+ rx_cnt = 0;
+ rx_begin_ns = current_nsec();
+
+ while (1) {
+ struct pollfd fds = { 0 };
+
+ fds.fd = fd;
+ fds.events = POLLIN | POLLERR | POLLHUP |
+ POLLRDHUP | POLLNVAL;
+
+ if (poll(&fds, 1, -1) < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fds.revents & (POLLHUP | POLLRDHUP))
+ done = 1;
+
+ if (fds.revents & POLLERR) {
+ fprintf(stderr, "Poll error\n");
+ break;
+ }
+
+ if (fds.revents & POLLIN) {
+ struct virtio_vsock_usr_hdr_pref *hdr_pref;
+ size_t t;
+
+ t = current_nsec();
+ map_rx(fd, rx_va);
+ in_rx_ns += (current_nsec() - t);
+
+ hdr_pref = (struct virtio_vsock_usr_hdr_pref *)rx_va;
+
+ if (!hdr_pref->hdr_num && done)
+ break;
+
+ t = current_nsec();
+ if (madvise((void *)rx_va + PAGE_SIZE, buf_size_bytes - PAGE_SIZE,
+ MADV_DONTNEED)) {
+ perror("madvise");
+ exit(EXIT_FAILURE);
+ }
+ in_rx_ns += (current_nsec() - t);
+ rx_cnt++;
+ }
+
+ if (done)
+ break;
+ }
+
+ fprintf(stderr, "Rx loop time %f sec\n",
+ (float)(current_nsec() - rx_begin_ns) / (1000000000.0));
+ fprintf(stderr, "Total in 'getsockopt()' + 'madvise()' %f sec\n",
+ in_rx_ns / 1000000000.0);
+ fprintf(stderr, "POLLIN wakeups: %i\n", rx_cnt);
+ fprintf(stderr, "Average in 'getsockopt()' + 'madvise()' %f ns\n",
+ (float)in_rx_ns / rx_cnt);
+
+ close(fd);
+}
+
+static void run_client_user_poll(void)
+{
+ unsigned int rx_cnt;
+ time_t rx_begin_ns;
+ time_t in_rx_ns;
+ u32 poll_value;
+ void *rx_va;
+ int fd;
+
+ fprintf(stderr, "Running client, user poll, peer %i:%i, rx buf %lu, vsock buf %lu\n",
+ peer_cid, port, buf_size_bytes, vsock_buf_bytes);
+
+ fd = vsock_connect(peer_cid, port);
+
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ rx_va = mmap(NULL, buf_size_bytes, PROT_READ, MAP_SHARED, fd, 0);
+
+ if (rx_va == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ poll_value = 0;
+ in_rx_ns = 0;
+ rx_cnt = 0;
+ rx_begin_ns = current_nsec();
+
+ while (1) {
+ volatile struct virtio_vsock_usr_hdr_pref *poll_hdr;
+ int leave_loop = 0;
+ size_t t;
+
+ poll_hdr = (struct virtio_vsock_usr_hdr_pref *)rx_va;
+
+ if (poll_value != ~0) {
+ do {
+ poll_value = poll_hdr->poll_value;
+ } while (!poll_value);
+ }
+
+ t = current_nsec();
+ map_rx(fd, rx_va);
+ in_rx_ns += (current_nsec() - t);
+
+ if (!poll_hdr->hdr_num && (poll_value == ~0))
+ leave_loop = 1;
+
+ t = current_nsec();
+ if (madvise((void *)rx_va + PAGE_SIZE,
+ buf_size_bytes - PAGE_SIZE,
+ MADV_DONTNEED)) {
+ perror("madvise");
+ exit(EXIT_FAILURE);
+ }
+ in_rx_ns += (current_nsec() - t);
+ rx_cnt++;
+
+ if (leave_loop)
+ break;
+ }
+
+ fprintf(stderr, "Rx loop time %f sec\n",
+ (float)(current_nsec() - rx_begin_ns) / (1000000000.0));
+ fprintf(stderr, "Total in 'getsockopt()' + 'madvise()' %f sec\n",
+ in_rx_ns / 1000000000.0);
+ fprintf(stderr, "Busyloop wakeups: %i\n", rx_cnt);
+ fprintf(stderr, "Average in 'getsockopt()' + 'madvise()' %f ns\n",
+ (float)in_rx_ns / rx_cnt);
+}
+
+static void usage(void)
+{
+ fprintf(stderr, "Usage: vsock_rx_perf [-h] -m c|s\n"
+ "-z n|y|u -c <cid> -m <megabytes to send>\n"
+ "-b <buffer size>\n"
+ "\n"
+ "Server: vsock_rx_perf -m s\n"
+ "This is benchmarking utility, to test vsock receive performance.\n"
+ "It runs in two modes: server or client. In server mode, it waits\n"
+ "connection from client, and when established, server starts data\n"
+ "transmission. Total size of data to send is set by '-m' option.\n"
+ "\n"
+ "Client could read this data in three different modes:\n"
+ "1) Using 'read()' system call. Default mode.\n"
+ "2) Zerocopy mode, use 'poll()' to wait data.\n"
+ "3) Zerocopy mode, use busyloop in userpace to wait data.\n"
+ "\n"
+ "Meaning of '-b' depends of server or client mode. In server\n"
+ "mode, it is size of tx buffer, passed to 'write()'. In client mode,\n"
+ "it is size of rx buffer(without zerocopy) passed to 'read()'. With\n"
+ "zerocopy enabled, it is size of rx mapping.\n"
+ "\n"
+ "Options:\n"
+ " -h This help message\n"
+ " -m c|s Server or client(client default)\n"
+ " -p <port> Port\n"
+ " -z n|y|u Data waiting mode\n"
+ " -c <cid> CID of the peer\n"
+ " -m <megabytes to send> Megabytes to send\n"
+ " -b <buffer size> Depends on server/client mode\n"
+ "\n");
+ exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+ int c;
+
+ while ((c = getopt(argc, argv, "sc:p:m:b:z:r:hv:")) != -1) {
+ switch (c) {
+ case 'v':
+ vsock_buf_bytes = memparse(optarg);
+ break;
+ case 'r':
+ rcvlowat_bytes = memparse(optarg);
+ break;
+ case 's':
+ client_mode = false;
+ break;
+ case 'c':
+ peer_cid = atoi(optarg);
+ break;
+ case 'p':
+ port = atoi(optarg);
+ break;
+ case 'm':
+ to_send_bytes = memparse(optarg);
+ break;
+ case 'b':
+ buf_size_bytes = memparse(optarg);
+ break;
+ case 'z':
+ if (!strcmp(optarg, "n"))
+ zerocopy_mode = ZEROCOPY_MODE_NO;
+ else if (!strcmp(optarg, "y"))
+ zerocopy_mode = ZEROCOPY_MODE_POLL;
+ else if (!strcmp(optarg, "u"))
+ zerocopy_mode = ZEROCOPY_MODE_USER_POLL;
+ else
+ usage();
+ break;
+ case 'h':
+ usage();
+ break;
+ default:
+ usage();
+
+ }
+ }
+
+ if (client_mode) {
+ switch (zerocopy_mode) {
+ case ZEROCOPY_MODE_NO:
+ run_client();
+ break;
+ case ZEROCOPY_MODE_POLL:
+ run_client_zerocopy();
+ break;
+ case ZEROCOPY_MODE_USER_POLL:
+ run_client_user_poll();
+ break;
+ }
+ } else {
+ run_server();
+ }
+
+ return 0;
+}
--
2.35.0

2022-11-06 20:31:39

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 09/11] test/vsock: add big message test

This adds test for sending message, bigger than peer's buffer size.
For SOCK_SEQPACKET socket it must fail, as this type of socket has
message size limit.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/testing/vsock/vsock_test.c | 62 ++++++++++++++++++++++++++++++++
1 file changed, 62 insertions(+)

diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index 107c11165887..bb4e8657f1d6 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -560,6 +560,63 @@ static void test_seqpacket_timeout_server(const struct test_opts *opts)
close(fd);
}

+static void test_seqpacket_bigmsg_client(const struct test_opts *opts)
+{
+ unsigned long sock_buf_size;
+ ssize_t send_size;
+ socklen_t len;
+ void *data;
+ int fd;
+
+ len = sizeof(sock_buf_size);
+
+ fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
+ &sock_buf_size, &len)) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ sock_buf_size++;
+
+ data = malloc(sock_buf_size);
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ send_size = send(fd, data, sock_buf_size, 0);
+ if (send_size != -1) {
+ fprintf(stderr, "expected 'send(2)' failure, got %zi\n",
+ send_size);
+ }
+
+ control_writeln("CLISENT");
+
+ free(data);
+ close(fd);
+}
+
+static void test_seqpacket_bigmsg_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_seqpacket_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("CLISENT");
+
+ close(fd);
+}
+
#define BUF_PATTERN_1 'a'
#define BUF_PATTERN_2 'b'

@@ -832,6 +889,11 @@ static struct test_case test_cases[] = {
.run_client = test_seqpacket_timeout_client,
.run_server = test_seqpacket_timeout_server,
},
+ {
+ .name = "SOCK_SEQPACKET big message",
+ .run_client = test_seqpacket_bigmsg_client,
+ .run_server = test_seqpacket_bigmsg_server,
+ },
{
.name = "SOCK_SEQPACKET invalid receive buffer",
.run_client = test_seqpacket_invalid_rec_buffer_client,
--
2.35.0

2022-11-06 20:31:46

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

Le 06/11/2022 à 20:36, Arseniy Krasnov a écrit :
> To support zerocopy receive, packet's buffer allocation is changed: for
> buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
> kernel restricts to map slab pages to user's vma) and raw buddy
> allocator now called. But, for tx packets(such packets won't be mapped
> to user), previous 'kmalloc()' way is used, but with special flag in
> packet's structure which allows to distinguish between 'kmalloc()' and
> raw pages buffers.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>
> ---
> drivers/vhost/vsock.c | 1 +
> include/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c | 8 ++++++--
> net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
> 4 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 5703775af129..65475d128a1d 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> return NULL;
> }
>
> + pkt->slab_buf = true;
> pkt->buf_len = pkt->len;
>
> nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 35d7eedb5e8e..d02cb7aa922f 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
> u32 off;
> bool reply;
> bool tap_delivered;
> + bool slab_buf;
> };
>
> struct virtio_vsock_pkt_info {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index ad64f403536a..19909c1e9ba3 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> vq = vsock->vqs[VSOCK_VQ_RX];
>
> do {
> + struct page *buf_page;
> +
> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> if (!pkt)
> break;
>
> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> - if (!pkt->buf) {
> + buf_page = alloc_page(GFP_KERNEL);
> +
> + if (!buf_page) {
> virtio_transport_free_pkt(pkt);
> break;
> }
>
> + pkt->buf = page_to_virt(buf_page);
> pkt->buf_len = buf_len;
> pkt->len = buf_len;
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index a9980e9b9304..37e8dbfe2f5d 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> if (!pkt->buf)
> goto out_pkt;
>
> + pkt->slab_buf = true;
> pkt->buf_len = len;
>
> err = memcpy_from_msg(pkt->buf, info->msg, len);
> @@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> - kvfree(pkt->buf);
> + if (pkt->buf_len) {
> + if (pkt->slab_buf)
> + kvfree(pkt->buf);

Hi,

kfree()? (according to virtio_transport_alloc_pkt() in -next) or
something else need to be changed.

> + else
> + free_pages((unsigned long)pkt->buf,
> + get_order(pkt->buf_len));

In virtio_vsock_rx_fill(), only alloc_page() is used.

Should this be alloc_pages(.., get_order(buf_len)) or free_page()
(without an 's') here?

Just my 2c,

CJ

> + }
> +
> kfree(pkt);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);


2022-11-06 20:32:27

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 05/11] vhost/vsock: switch packet's buffer allocation

This changes packets buffer allocation logic,it depends on whether rx
zerocopy enabled or disabled on destination socket. Thus, now socket
lookup performed here, not in 'virtio_transport_common.c', and for
zerocopy case, buffer is allocated using raw calls to the buddy
allocator. If zerocopy is disabled, then buffers allocated by
'kmalloc()'(like before this patch).

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 56 +++++++++++++++++++++++++++++++++++--------
1 file changed, 46 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6f3d9f02cc1d..191a5b94aa7c 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -354,10 +354,14 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)

static struct virtio_vsock_pkt *
vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
- unsigned int out, unsigned int in)
+ unsigned int out, unsigned int in,
+ struct sock **sk)
{
struct virtio_vsock_pkt *pkt;
+ struct sockaddr_vm src, dst;
+ struct vhost_vsock *vsock;
struct iov_iter iov_iter;
+ struct vsock_sock *vsk;
size_t nbytes;
size_t len;

@@ -381,6 +385,18 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}

+ vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
+ le32_to_cpu(pkt->hdr.src_port));
+ vsock_addr_init(&dst, le64_to_cpu(pkt->hdr.dst_cid),
+ le32_to_cpu(pkt->hdr.dst_port));
+
+ *sk = vsock_find_connected_socket(&src, &dst);
+ if (!(*sk)) {
+ *sk = vsock_find_bound_socket(&dst);
+ if (!(*sk))
+ return pkt;
+ }
+
pkt->len = le32_to_cpu(pkt->hdr.len);

/* No payload */
@@ -393,14 +409,32 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
return NULL;
}

- pkt->buf = kvmalloc(pkt->len, GFP_KERNEL);
- if (!pkt->buf) {
- kfree(pkt);
- return NULL;
- }
-
- pkt->slab_buf = true;
pkt->buf_len = pkt->len;
+ vsock = container_of(vq->dev, struct vhost_vsock, dev);
+
+ vsk = vsock_sk(*sk);
+
+ if (!vsk->rx_zerocopy_on) {
+ pkt->buf = kvmalloc(pkt->len, GFP_KERNEL);
+
+ if (!pkt->buf) {
+ kfree(pkt);
+ return NULL;
+ }
+
+ pkt->slab_buf = true;
+ } else {
+ struct page *buf_page;
+
+ buf_page = alloc_pages(GFP_KERNEL, get_order(pkt->len));
+
+ if (buf_page == NULL) {
+ kfree(pkt);
+ return NULL;
+ }
+
+ pkt->buf = page_to_virt(buf_page);
+ }

nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
if (nbytes != pkt->len) {
@@ -512,6 +546,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)

vhost_disable_notify(&vsock->dev, vq);
do {
+ struct sock *sk = NULL;
+
if (!vhost_vsock_more_replies(vsock)) {
/* Stop tx until the device processes already
* pending replies. Leave tx virtqueue
@@ -533,7 +569,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
break;
}

- pkt = vhost_vsock_alloc_pkt(vq, out, in);
+ pkt = vhost_vsock_alloc_pkt(vq, out, in, &sk);
if (!pkt) {
vq_err(vq, "Faulted on pkt\n");
continue;
@@ -548,7 +584,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
le64_to_cpu(pkt->hdr.dst_cid) ==
vhost_transport_get_local_cid())
- virtio_transport_recv_pkt(&vhost_transport, NULL, pkt);
+ virtio_transport_recv_pkt(&vhost_transport, sk, pkt);
else
virtio_transport_free_pkt(pkt);

--
2.35.0

2022-11-06 20:32:27

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 10/11] test/vsock: add receive zerocopy tests

This adds tests for zerocopy feature: one test checks data transmission
with simple integrity control. Second test covers 'error' branches in
zerocopy logic(to check invalid arguments handling).

Signed-off-by: Arseniy Krasnov <[email protected]>
---
tools/include/uapi/linux/virtio_vsock.h | 15 +
tools/include/uapi/linux/vm_sockets.h | 8 +
tools/testing/vsock/Makefile | 2 +-
tools/testing/vsock/util.c | 27 +-
tools/testing/vsock/util.h | 4 +
tools/testing/vsock/vsock_test.c | 21 +
tools/testing/vsock/vsock_test_zerocopy.c | 530 ++++++++++++++++++++++
tools/testing/vsock/vsock_test_zerocopy.h | 14 +
8 files changed, 617 insertions(+), 4 deletions(-)
create mode 100644 tools/include/uapi/linux/virtio_vsock.h
create mode 100644 tools/include/uapi/linux/vm_sockets.h
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.c
create mode 100644 tools/testing/vsock/vsock_test_zerocopy.h

diff --git a/tools/include/uapi/linux/virtio_vsock.h b/tools/include/uapi/linux/virtio_vsock.h
new file mode 100644
index 000000000000..f393062e0394
--- /dev/null
+++ b/tools/include/uapi/linux/virtio_vsock.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VIRTIO_VSOCK_H
+#define _UAPI_LINUX_VIRTIO_VSOCK_H
+#include <linux/types.h>
+
+struct virtio_vsock_usr_hdr {
+ u32 flags;
+ u32 len;
+} __attribute__((packed));
+
+struct virtio_vsock_usr_hdr_pref {
+ u32 poll_value;
+ u32 hdr_num;
+} __attribute__((packed));
+#endif /* _UAPI_LINUX_VIRTIO_VSOCK_H */
diff --git a/tools/include/uapi/linux/vm_sockets.h b/tools/include/uapi/linux/vm_sockets.h
new file mode 100644
index 000000000000..cac0bc3a7041
--- /dev/null
+++ b/tools/include/uapi/linux/vm_sockets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _UAPI_LINUX_VM_SOCKETS_H
+#define _UAPI_LINUX_VM_SOCKETS_H
+
+#define SO_VM_SOCKETS_MAP_RX 9
+#define SO_VM_SOCKETS_ZEROCOPY 10
+
+#endif /* _UAPI_LINUX_VM_SOCKETS_H */
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
index f8293c6910c9..7172c21fbd8d 100644
--- a/tools/testing/vsock/Makefile
+++ b/tools/testing/vsock/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
all: test
test: vsock_test vsock_diag_test
-vsock_test: vsock_test.o timeout.o control.o util.o
+vsock_test: vsock_test.o vsock_test_zerocopy.o timeout.o control.o util.o
vsock_diag_test: vsock_diag_test.o timeout.o control.o util.o

CFLAGS += -g -O2 -Werror -Wall -I. -I../../include -I../../../usr/include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
index 351903836774..7ae5219d0fe8 100644
--- a/tools/testing/vsock/util.c
+++ b/tools/testing/vsock/util.c
@@ -84,7 +84,8 @@ void vsock_wait_remote_close(int fd)
}

/* Connect to <cid, port> and return the file descriptor. */
-static int vsock_connect(unsigned int cid, unsigned int port, int type)
+static int vsock_connect(unsigned int cid, unsigned int port, int type,
+ int optname, void *optval, socklen_t optlen)
{
union {
struct sockaddr sa;
@@ -103,6 +104,13 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)

fd = socket(AF_VSOCK, type, 0);

+ if (optval) {
+ if (setsockopt(fd, AF_VSOCK, optname, optval, optlen)) {
+ close(fd);
+ return -1;
+ }
+ }
+
timeout_begin(TIMEOUT);
do {
ret = connect(fd, &addr.sa, sizeof(addr.svm));
@@ -122,12 +130,25 @@ static int vsock_connect(unsigned int cid, unsigned int port, int type)

int vsock_stream_connect(unsigned int cid, unsigned int port)
{
- return vsock_connect(cid, port, SOCK_STREAM);
+ return vsock_connect(cid, port, SOCK_STREAM, 0, NULL, 0);
+}
+
+int vsock_stream_connect_opt(unsigned int cid, unsigned int port,
+ int optname, void *optval, socklen_t optlen)
+{
+ return vsock_connect(cid, port, SOCK_STREAM, optname, optval, optlen);
}

int vsock_seqpacket_connect(unsigned int cid, unsigned int port)
{
- return vsock_connect(cid, port, SOCK_SEQPACKET);
+ return vsock_connect(cid, port, SOCK_SEQPACKET, 0, NULL, 0);
+}
+
+int vsock_seqpacket_connect_opt(unsigned int cid, unsigned int port,
+ int optname, void *optval, socklen_t optlen)
+{
+ return vsock_connect(cid, port, SOCK_SEQPACKET, optname, optval,
+ optlen);
}

/* Listen on <cid, port> and return the first incoming connection. The remote
diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
index 988cc69a4642..23f9f0c6853d 100644
--- a/tools/testing/vsock/util.h
+++ b/tools/testing/vsock/util.h
@@ -36,7 +36,11 @@ struct test_case {
void init_signals(void);
unsigned int parse_cid(const char *str);
int vsock_stream_connect(unsigned int cid, unsigned int port);
+int vsock_stream_connect_opt(unsigned int cid, unsigned int port,
+ int optname, void *optval, socklen_t optlen);
int vsock_seqpacket_connect(unsigned int cid, unsigned int port);
+int vsock_seqpacket_connect_opt(unsigned int cid, unsigned int port,
+ int optname, void *optval, socklen_t optlen);
int vsock_stream_accept(unsigned int cid, unsigned int port,
struct sockaddr_vm *clientaddrp);
int vsock_seqpacket_accept(unsigned int cid, unsigned int port,
diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
index bb4e8657f1d6..a6ed076e42f4 100644
--- a/tools/testing/vsock/vsock_test.c
+++ b/tools/testing/vsock/vsock_test.c
@@ -23,6 +23,7 @@
#include "timeout.h"
#include "control.h"
#include "util.h"
+#include "vsock_test_zerocopy.h"

static void test_stream_connection_reset(const struct test_opts *opts)
{
@@ -904,6 +905,26 @@ static struct test_case test_cases[] = {
.run_client = test_stream_poll_rcvlowat_client,
.run_server = test_stream_poll_rcvlowat_server,
},
+ {
+ .name = "SOCK_STREAM zerocopy receive",
+ .run_client = test_stream_zerocopy_rx_client,
+ .run_server = test_stream_zerocopy_rx_server,
+ },
+ {
+ .name = "SOCK_SEQPACKET zerocopy receive",
+ .run_client = test_seqpacket_zerocopy_rx_client,
+ .run_server = test_seqpacket_zerocopy_rx_server,
+ },
+ {
+ .name = "SOCK_STREAM zerocopy receive loop poll",
+ .run_client = test_stream_zerocopy_rx_client_loop_poll,
+ .run_server = test_stream_zerocopy_rx_server_loop_poll,
+ },
+ {
+ .name = "SOCK_STREAM zerocopy invalid",
+ .run_client = test_stream_zerocopy_rx_inv_client,
+ .run_server = test_stream_zerocopy_rx_inv_server,
+ },
{},
};

diff --git a/tools/testing/vsock/vsock_test_zerocopy.c b/tools/testing/vsock/vsock_test_zerocopy.c
new file mode 100644
index 000000000000..1876f14c0cec
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.c
@@ -0,0 +1,530 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ *
+ * Copyright (C) 2022 Sberdevices, Inc.
+ *
+ * Author: Arseniy Krasnov <[email protected]>
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <uapi/linux/virtio_vsock.h>
+#include <uapi/linux/vm_sockets.h>
+#include <sys/mman.h>
+#include <poll.h>
+#include <unistd.h>
+
+#include "timeout.h"
+#include "control.h"
+#include "util.h"
+#include "vsock_test_zerocopy.h"
+
+#define PAGE_SIZE 4096
+#define RX_MAPPING_PAGES 3
+
+#define TX_BUF_SIZE 40000
+#define TX_SEND_LOOPS 3
+
+static void test_connectible_zerocopy_rx_client(const struct test_opts *opts,
+ bool stream)
+{
+ unsigned long remote_hash;
+ unsigned long curr_hash;
+ unsigned long total_sum;
+ unsigned long msg_bytes;
+ unsigned long zc_on;
+ size_t rx_map_len;
+ void *rx_va;
+ int fd;
+
+ zc_on = 1;
+
+ if (stream)
+ fd = vsock_stream_connect_opt(opts->peer_cid, 1234,
+ SO_VM_SOCKETS_ZEROCOPY,
+ (void *)&zc_on, sizeof(zc_on));
+ else
+ fd = vsock_seqpacket_connect_opt(opts->peer_cid, 1234,
+ SO_VM_SOCKETS_ZEROCOPY,
+ (void *)&zc_on, sizeof(zc_on));
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ rx_map_len = PAGE_SIZE * RX_MAPPING_PAGES;
+
+ rx_va = mmap(NULL, rx_map_len, PROT_READ, MAP_SHARED, fd, 0);
+ if (rx_va == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ total_sum = 0;
+ msg_bytes = 0;
+ curr_hash = 0;
+
+ while (1) {
+ struct pollfd fds = { 0 };
+ int leave_loop;
+ int res;
+
+ fds.fd = fd;
+ fds.events = POLLIN | POLLERR | POLLHUP |
+ POLLRDHUP | POLLNVAL;
+
+ res = poll(&fds, 1, -1);
+
+ if (res < 0) {
+ perror("poll");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fds.revents & POLLERR) {
+ perror("poll error");
+ exit(EXIT_FAILURE);
+ }
+
+ leave_loop = 0;
+
+ if (fds.revents & POLLIN) {
+ struct virtio_vsock_usr_hdr_pref *poll_hdr;
+ struct virtio_vsock_usr_hdr *data_hdr;
+ unsigned char *data_va;
+ unsigned char *end_va;
+ socklen_t len;
+ int hdr_cnt;
+
+ poll_hdr = (struct virtio_vsock_usr_hdr_pref *)rx_va;
+ len = sizeof(rx_va);
+
+ if (getsockopt(fd, AF_VSOCK,
+ SO_VM_SOCKETS_MAP_RX,
+ &rx_va, &len) < 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ data_hdr = (struct virtio_vsock_usr_hdr *)(poll_hdr + 1);
+ /* Skip headers page for data. */
+ data_va = rx_va + PAGE_SIZE;
+ end_va = (unsigned char *)(rx_va + rx_map_len);
+ hdr_cnt = 0;
+
+ while (data_va != end_va) {
+ int data_len = data_hdr->len;
+
+ if (!data_hdr->len) {
+ if (fds.revents & (POLLHUP | POLLRDHUP) &&
+ !hdr_cnt)
+ leave_loop = 1;
+
+ break;
+ }
+
+ while (data_len > 0) {
+ int i;
+ int to_read = (data_len < PAGE_SIZE) ?
+ data_len : PAGE_SIZE;
+
+ for (i = 0; i < to_read; i++)
+ total_sum += data_va[i];
+
+ data_va += PAGE_SIZE;
+ data_len -= PAGE_SIZE;
+ }
+
+ if (!stream) {
+ msg_bytes += data_hdr->len;
+
+ if (data_hdr->flags) {
+ curr_hash += msg_bytes;
+ curr_hash = djb2(&curr_hash,
+ sizeof(curr_hash));
+ msg_bytes = 0;
+ }
+ }
+
+ data_hdr++;
+ hdr_cnt++;
+ }
+
+ if (madvise((void *)rx_va,
+ rx_map_len,
+ MADV_DONTNEED)) {
+ perror("madvise");
+ exit(EXIT_FAILURE);
+ }
+
+ if (leave_loop)
+ break;
+ }
+ }
+
+ curr_hash += total_sum;
+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
+
+ if (munmap(rx_va, rx_map_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+
+ remote_hash = control_readulong(NULL);
+
+ if (curr_hash != remote_hash) {
+ fprintf(stderr, "sum mismatch %lu != %lu\n",
+ curr_hash, remote_hash);
+ exit(EXIT_FAILURE);
+ }
+
+ close(fd);
+}
+
+void test_seqpacket_zerocopy_rx_client(const struct test_opts *opts)
+{
+ test_connectible_zerocopy_rx_client(opts, false);
+}
+
+void test_stream_zerocopy_rx_client(const struct test_opts *opts)
+{
+ test_connectible_zerocopy_rx_client(opts, true);
+}
+
+static void test_connectible_zerocopy_rx_server(const struct test_opts *opts,
+ bool stream)
+{
+ size_t max_buf_size = TX_BUF_SIZE;
+ unsigned long curr_hash;
+ long total_sum = 0;
+ int n = TX_SEND_LOOPS;
+ int fd;
+
+ if (stream)
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ else
+ fd = vsock_seqpacket_accept(VMADDR_CID_ANY, 1234, NULL);
+
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ curr_hash = 0;
+
+ while (n) {
+ unsigned char *data;
+ size_t send_size;
+ size_t buf_size;
+ int i;
+
+ buf_size = 1 + (rand() % max_buf_size);
+
+ data = malloc(buf_size);
+
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (i = 0; i < buf_size; i++) {
+ data[i] = rand() & 0xff;
+ total_sum += data[i];
+ }
+
+ send_size = write(fd, data, buf_size);
+
+ if (send_size != buf_size) {
+ perror("write");
+ exit(EXIT_FAILURE);
+ }
+
+ if (!stream) {
+ curr_hash += send_size;
+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
+ }
+
+ free(data);
+ n--;
+ }
+
+ curr_hash += total_sum;
+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
+ control_writeulong(curr_hash);
+
+ close(fd);
+}
+
+void test_stream_zerocopy_rx_server(const struct test_opts *opts)
+{
+ test_connectible_zerocopy_rx_server(opts, true);
+}
+
+void test_seqpacket_zerocopy_rx_server(const struct test_opts *opts)
+{
+ test_connectible_zerocopy_rx_server(opts, false);
+}
+
+void test_stream_zerocopy_rx_client_loop_poll(const struct test_opts *opts)
+{
+ unsigned long remote_sum;
+ unsigned long total_sum;
+ unsigned long zc_on = 1;
+ size_t rx_map_len;
+ u32 poll_value = 0;
+ void *rx_va;
+ int fd;
+
+ fd = vsock_stream_connect_opt(opts->peer_cid, 1234,
+ SO_VM_SOCKETS_ZEROCOPY,
+ (void *)&zc_on, sizeof(zc_on));
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ rx_map_len = PAGE_SIZE * RX_MAPPING_PAGES;
+
+ rx_va = mmap(NULL, rx_map_len, PROT_READ, MAP_SHARED, fd, 0);
+ if (rx_va == MAP_FAILED) {
+ perror("mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ total_sum = 0;
+
+ while (1) {
+ volatile struct virtio_vsock_usr_hdr_pref *poll_hdr;
+ struct virtio_vsock_usr_hdr *data_hdr;
+ unsigned char *data_va;
+ unsigned char *end_va;
+ int leave_loop;
+ socklen_t len;
+ int hdr_cnt;
+
+ poll_hdr = (struct virtio_vsock_usr_hdr_pref *)rx_va;
+
+ if (poll_value != ~0) {
+ do {
+ poll_value = poll_hdr->poll_value;
+ } while (!poll_value);
+ }
+
+ len = sizeof(rx_va);
+
+ if (getsockopt(fd, AF_VSOCK,
+ SO_VM_SOCKETS_MAP_RX,
+ &rx_va, &len) < 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ data_va = rx_va + PAGE_SIZE;
+ end_va = (unsigned char *)(rx_va + rx_map_len);
+ data_hdr = (struct virtio_vsock_usr_hdr *)(poll_hdr + 1);
+ hdr_cnt = 0;
+ leave_loop = 0;
+
+ while (data_va != end_va) {
+ int data_len = data_hdr->len;
+
+ if (!data_hdr->len) {
+ /* Zero length in first header and there will
+ * be no more data, leave processing loop.
+ */
+ if (!hdr_cnt && (poll_value == ~0))
+ leave_loop = 1;
+
+ break;
+ }
+
+ while (data_len > 0) {
+ int i;
+ int to_read = (data_len < PAGE_SIZE) ?
+ data_len : PAGE_SIZE;
+
+ for (i = 0; i < to_read; i++)
+ total_sum += data_va[i];
+
+ data_va += PAGE_SIZE;
+ data_len -= PAGE_SIZE;
+ }
+
+ data_hdr++;
+ hdr_cnt++;
+ }
+
+ if (madvise((void *)rx_va + PAGE_SIZE,
+ rx_map_len - PAGE_SIZE,
+ MADV_DONTNEED)) {
+ perror("madvise");
+ exit(EXIT_FAILURE);
+ }
+
+ if (leave_loop)
+ break;
+ }
+
+ if (munmap(rx_va, rx_map_len)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+
+ remote_sum = control_readulong(NULL);
+
+ if (total_sum != remote_sum) {
+ fprintf(stderr, "loop sum mismatch %lu != %lu\n",
+ total_sum, remote_sum);
+ exit(EXIT_FAILURE);
+ }
+
+ close(fd);
+}
+
+void test_stream_zerocopy_rx_server_loop_poll(const struct test_opts *opts)
+{
+ size_t max_buf_size = TX_BUF_SIZE;
+ unsigned long total_sum;
+ int n = TX_SEND_LOOPS;
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ total_sum = 0;
+
+ while (n) {
+ unsigned char *data;
+ size_t buf_size;
+ int i;
+
+ buf_size = 1 + (rand() % max_buf_size);
+
+ data = malloc(buf_size);
+
+ if (!data) {
+ perror("malloc");
+ exit(EXIT_FAILURE);
+ }
+
+ for (i = 0; i < buf_size; i++) {
+ data[i] = rand() & 0xff;
+ total_sum += data[i];
+ }
+
+ if (write(fd, data, buf_size) != buf_size) {
+ perror("write");
+ exit(EXIT_FAILURE);
+ }
+
+ free(data);
+ n--;
+ }
+
+ control_writeulong(total_sum);
+
+ close(fd);
+}
+
+void test_stream_zerocopy_rx_inv_client(const struct test_opts *opts)
+{
+ size_t map_size = PAGE_SIZE * 5;
+ unsigned long zc_on = 1;
+ socklen_t len;
+ void *map_va;
+ int fd;
+
+ /* Try zerocopy with disable option. */
+ fd = vsock_stream_connect_opt(opts->peer_cid, 1234, SO_VM_SOCKETS_ZEROCOPY,
+ (void *)&zc_on, sizeof(zc_on));
+ if (fd < 0) {
+ perror("connect");
+ exit(EXIT_FAILURE);
+ }
+
+ len = sizeof(map_va);
+ map_va = 0;
+
+ /* Try zerocopy with invalid mapping address. */
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+ &map_va, &len) == 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Try zerocopy with valid, but not socket mapping. */
+ map_va = mmap(NULL, map_size, PROT_READ,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (map_va == MAP_FAILED) {
+ perror("anon mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+ &map_va, &len) == 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ if (munmap(map_va, map_size)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Try zerocopy with valid, but too small mapping. */
+ map_va = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0);
+ if (map_va != MAP_FAILED) {
+ perror("socket mmap small");
+ exit(EXIT_FAILURE);
+ }
+
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+ &map_va, &len) == 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ /* Try zerocopy with valid mapping, but not from first byte. */
+ map_va = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, 0);
+ if (map_va == MAP_FAILED) {
+ perror("socket mmap");
+ exit(EXIT_FAILURE);
+ }
+
+ map_va += PAGE_SIZE;
+
+ if (getsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_MAP_RX,
+ &map_va, &len) == 0) {
+ perror("getsockopt");
+ exit(EXIT_FAILURE);
+ }
+
+ if (munmap(map_va - PAGE_SIZE, map_size)) {
+ perror("munmap");
+ exit(EXIT_FAILURE);
+ }
+
+ control_writeln("DONE");
+
+ close(fd);
+}
+
+void test_stream_zerocopy_rx_inv_server(const struct test_opts *opts)
+{
+ int fd;
+
+ fd = vsock_stream_accept(VMADDR_CID_ANY, 1234, NULL);
+
+ if (fd < 0) {
+ perror("accept");
+ exit(EXIT_FAILURE);
+ }
+
+ control_expectln("DONE");
+
+ close(fd);
+}
+
diff --git a/tools/testing/vsock/vsock_test_zerocopy.h b/tools/testing/vsock/vsock_test_zerocopy.h
new file mode 100644
index 000000000000..a818bd65b376
--- /dev/null
+++ b/tools/testing/vsock/vsock_test_zerocopy.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef VSOCK_TEST_ZEROCOPY_H
+#define VSOCK_TEST_ZEROCOPY_H
+
+void test_stream_zerocopy_rx_client(const struct test_opts *opts);
+void test_stream_zerocopy_rx_server(const struct test_opts *opts);
+void test_seqpacket_zerocopy_rx_client(const struct test_opts *opts);
+void test_seqpacket_zerocopy_rx_server(const struct test_opts *opts);
+void test_stream_zerocopy_rx_client_loop_poll(const struct test_opts *opts);
+void test_stream_zerocopy_rx_server_loop_poll(const struct test_opts *opts);
+void test_stream_zerocopy_rx_inv_client(const struct test_opts *opts);
+void test_stream_zerocopy_rx_inv_server(const struct test_opts *opts);
+
+#endif /* VSOCK_TEST_ZEROCOPY_H */
--
2.35.0

2022-11-06 20:32:57

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 07/11] virtio/vsock: enable zerocopy callback

This adds zerocopy callback for virtio transport.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 0305f94c98bb..cbdc91acd9d4 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -480,6 +480,8 @@ static struct virtio_transport virtio_transport = {
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
.stream_allow = virtio_transport_stream_allow,
+ .zerocopy_dequeue = virtio_transport_zerocopy_dequeue,
+ .zerocopy_rx_mmap = virtio_transport_zerocopy_init,

.seqpacket_dequeue = virtio_transport_seqpacket_dequeue,
.seqpacket_enqueue = virtio_transport_seqpacket_enqueue,
--
2.35.0

2022-11-06 20:33:15

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 04/11] virtio/vsock: add transport zerocopy callback

This adds transport callback which processes rx queue of socket and
instead of copying data to user provided buffer, it inserts data pages
of each packet to user's vm area.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/linux/virtio_vsock.h | 7 +
include/uapi/linux/virtio_vsock.h | 14 ++
net/vmw_vsock/virtio_transport_common.c | 244 +++++++++++++++++++++++-
3 files changed, 261 insertions(+), 4 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c1be40f89a89..d10fdfd8d144 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -37,6 +37,7 @@ struct virtio_vsock_sock {
u32 buf_alloc;
struct list_head rx_queue;
u32 msg_count;
+ struct page *usr_poll_page;
};

struct virtio_vsock_pkt {
@@ -51,6 +52,7 @@ struct virtio_vsock_pkt {
bool reply;
bool tap_delivered;
bool slab_buf;
+ bool split;
};

struct virtio_vsock_pkt_info {
@@ -131,6 +133,11 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
bool virtio_transport_dgram_allow(u32 cid, u32 port);

+int virtio_transport_zerocopy_init(struct vsock_sock *vsk,
+ struct vm_area_struct *vma);
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+ struct page **pages,
+ unsigned long *pages_num);
int virtio_transport_connect(struct vsock_sock *vsk);

int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
index 64738838bee5..2a0e4f309918 100644
--- a/include/uapi/linux/virtio_vsock.h
+++ b/include/uapi/linux/virtio_vsock.h
@@ -66,6 +66,20 @@ struct virtio_vsock_hdr {
__le32 fwd_cnt;
} __attribute__((packed));

+struct virtio_vsock_usr_hdr {
+ u32 flags;
+ u32 len;
+} __attribute__((packed));
+
+#define VIRTIO_VSOCK_USR_POLL_NO_DATA 0
+#define VIRTIO_VSOCK_USR_POLL_HAS_DATA 1
+#define VIRTIO_VSOCK_USR_POLL_SHUTDOWN ~0
+
+struct virtio_vsock_usr_hdr_pref {
+ u32 poll_value;
+ u32 hdr_num;
+} __attribute__((packed));
+
enum virtio_vsock_type {
VIRTIO_VSOCK_TYPE_STREAM = 1,
VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 444764869670..fa4a2688a5d5 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -12,6 +12,7 @@
#include <linux/ctype.h>
#include <linux/list.h>
#include <linux/virtio_vsock.h>
+#include <linux/mm.h>
#include <uapi/linux/vsockmon.h>

#include <net/sock.h>
@@ -241,6 +242,14 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
struct virtio_vsock_pkt *pkt)
{
+ if (vvs->usr_poll_page) {
+ struct virtio_vsock_usr_hdr_pref *hdr;
+
+ hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
+
+ hdr->poll_value = VIRTIO_VSOCK_USR_POLL_HAS_DATA;
+ }
+
if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
return false;

@@ -253,6 +262,14 @@ static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
{
vvs->rx_bytes -= pkt->len;
vvs->fwd_cnt += pkt->len;
+
+ if (!vvs->rx_bytes && vvs->usr_poll_page) {
+ struct virtio_vsock_usr_hdr_pref *hdr;
+
+ hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
+
+ hdr->poll_value = VIRTIO_VSOCK_USR_POLL_NO_DATA;
+ }
}

void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
@@ -347,6 +364,191 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
return err;
}

+int virtio_transport_zerocopy_init(struct vsock_sock *vsk,
+ struct vm_area_struct *vma)
+{
+ struct virtio_vsock_sock *vvs;
+ int err = 0;
+
+ if (vma->vm_end - vma->vm_start < 2 * PAGE_SIZE)
+ return -EINVAL;
+
+ vvs = vsk->trans;
+
+ spin_lock_bh(&vvs->rx_lock);
+
+ if (!vvs->usr_poll_page) {
+ /* GFP_ATOMIC because of spinlock. */
+ vvs->usr_poll_page = alloc_page(GFP_KERNEL | GFP_ATOMIC);
+
+ if (!vvs->usr_poll_page) {
+ err = -ENOMEM;
+ } else {
+ struct virtio_vsock_usr_hdr_pref *usr_hdr_pref;
+ unsigned long one_page = 1;
+
+ usr_hdr_pref = page_to_virt(vvs->usr_poll_page);
+
+ if (vsk->peer_shutdown & SHUTDOWN_MASK) {
+ usr_hdr_pref->poll_value = VIRTIO_VSOCK_USR_POLL_SHUTDOWN;
+ } else {
+ usr_hdr_pref->poll_value = vvs->rx_bytes ?
+ VIRTIO_VSOCK_USR_POLL_HAS_DATA :
+ VIRTIO_VSOCK_USR_POLL_NO_DATA;
+ }
+
+ usr_hdr_pref->hdr_num = 0;
+
+ err = vm_insert_pages(vma, vma->vm_start,
+ &vvs->usr_poll_page,
+ &one_page);
+
+ if (one_page)
+ err = -EINVAL;
+ }
+ } else {
+ err = -EINVAL;
+ }
+
+ spin_unlock_bh(&vvs->rx_lock);
+
+ return err;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_init);
+
+int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
+ struct page **pages,
+ unsigned long *pages_num)
+{
+ struct virtio_vsock_usr_hdr_pref *usr_hdr_pref;
+ struct virtio_vsock_usr_hdr *usr_hdr_buffer;
+ struct virtio_vsock_sock *vvs;
+ unsigned long max_usr_hdrs;
+ struct page *usr_hdr_page;
+ int pages_cnt;
+
+ if (*pages_num < 2)
+ return -EINVAL;
+
+ vvs = vsk->trans;
+
+ max_usr_hdrs = (PAGE_SIZE - sizeof(*usr_hdr_pref)) / sizeof(*usr_hdr_buffer);
+ *pages_num = min(max_usr_hdrs, *pages_num);
+ pages_cnt = 0;
+
+ spin_lock_bh(&vvs->rx_lock);
+
+ if (!vvs->usr_poll_page) {
+ spin_unlock_bh(&vvs->rx_lock);
+ return -EINVAL;
+ }
+
+ usr_hdr_page = vvs->usr_poll_page;
+ usr_hdr_pref = page_to_virt(usr_hdr_page);
+ usr_hdr_buffer = (struct virtio_vsock_usr_hdr *)(usr_hdr_pref + 1);
+ usr_hdr_pref->hdr_num = 0;
+
+ /* If ref counter is 1, then page owned during
+ * allocation and not mapped, so insert it to
+ * the output array. It will be mapped.
+ */
+ if (page_ref_count(usr_hdr_page) == 1) {
+ pages[pages_cnt++] = usr_hdr_page;
+ /* Inc ref one more, as AF_VSOCK layer calls
+ * 'put_page()' for each returned page.
+ */
+ get_page(usr_hdr_page);
+ } else {
+ pages[pages_cnt++] = NULL;
+ }
+
+ /* Polling page is already mapped. */
+ while (!list_empty(&vvs->rx_queue) &&
+ pages_cnt < *pages_num) {
+ struct virtio_vsock_pkt *pkt;
+ ssize_t rest_data_bytes;
+ size_t moved_data_bytes;
+ unsigned long pg_offs;
+
+ pkt = list_first_entry(&vvs->rx_queue,
+ struct virtio_vsock_pkt, list);
+
+ rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
+
+ /* For packets, bigger than one page, split it's
+ * high order allocated buffer to 0 order pages.
+ * Otherwise 'vm_insert_pages()' will fail, for
+ * all pages except first.
+ */
+ if (rest_data_bytes > PAGE_SIZE) {
+ /* High order buffer not split yet. */
+ if (!pkt->split) {
+ split_page(virt_to_page(pkt->buf),
+ get_order(le32_to_cpu(pkt->hdr.len)));
+ pkt->split = true;
+ }
+ }
+
+ pg_offs = pkt->off;
+ moved_data_bytes = 0;
+
+ while (rest_data_bytes &&
+ pages_cnt < *pages_num) {
+ struct page *buf_page;
+
+ buf_page = virt_to_page(pkt->buf + pg_offs);
+
+ pages[pages_cnt++] = buf_page;
+ /* Get reference to prevent this page being
+ * returned to page allocator when packet will
+ * be freed. Ref count will be 2.
+ */
+ get_page(buf_page);
+ pg_offs += PAGE_SIZE;
+
+ if (rest_data_bytes >= PAGE_SIZE) {
+ moved_data_bytes += PAGE_SIZE;
+ rest_data_bytes -= PAGE_SIZE;
+ } else {
+ moved_data_bytes += rest_data_bytes;
+ rest_data_bytes = 0;
+ }
+ }
+
+ if (!rest_data_bytes)
+ usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
+ else
+ usr_hdr_buffer->flags = 0;
+
+ usr_hdr_buffer->len = moved_data_bytes;
+
+ usr_hdr_buffer++;
+ usr_hdr_pref->hdr_num++;
+
+ pkt->off = pg_offs;
+
+ if (rest_data_bytes == 0) {
+ list_del(&pkt->list);
+ virtio_transport_dec_rx_pkt(vvs, pkt);
+ virtio_transport_free_pkt(pkt);
+ }
+
+ /* Now ref count for all pages of packet is 1. */
+ }
+
+ if (*pages_num - 1 < max_usr_hdrs)
+ usr_hdr_buffer->len = 0;
+
+ spin_unlock_bh(&vvs->rx_lock);
+
+ virtio_transport_send_credit_update(vsk);
+
+ *pages_num = pages_cnt;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
+
static ssize_t
virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
struct msghdr *msg,
@@ -969,11 +1171,21 @@ void virtio_transport_release(struct vsock_sock *vsk)
{
struct sock *sk = &vsk->sk;
bool remove_sock = true;
+ struct virtio_vsock_sock *vvs = vsk->trans;

if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET)
remove_sock = virtio_transport_close(vsk);

if (remove_sock) {
+ spin_lock_bh(&vvs->rx_lock);
+
+ if (vvs->usr_poll_page) {
+ __free_page(vvs->usr_poll_page);
+ vvs->usr_poll_page = NULL;
+ }
+
+ spin_unlock_bh(&vvs->rx_lock);
+
sock_set_flag(sk, SOCK_DONE);
virtio_transport_remove_sock(vsk);
}
@@ -1077,6 +1289,7 @@ virtio_transport_recv_connected(struct sock *sk,
struct virtio_vsock_pkt *pkt)
{
struct vsock_sock *vsk = vsock_sk(sk);
+ struct virtio_vsock_sock *vvs = vsk->trans;
int err = 0;

switch (le16_to_cpu(pkt->hdr.op)) {
@@ -1095,6 +1308,19 @@ virtio_transport_recv_connected(struct sock *sk,
vsk->peer_shutdown |= RCV_SHUTDOWN;
if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
vsk->peer_shutdown |= SEND_SHUTDOWN;
+
+ spin_lock_bh(&vvs->rx_lock);
+
+ if (vvs->usr_poll_page) {
+ struct virtio_vsock_usr_hdr_pref *hdr;
+
+ hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
+
+ hdr->poll_value = 0xffffffff;
+ }
+
+ spin_unlock_bh(&vvs->rx_lock);
+
if (vsk->peer_shutdown == SHUTDOWN_MASK &&
vsock_stream_has_data(vsk) <= 0 &&
!sock_flag(sk, SOCK_DONE)) {
@@ -1343,11 +1569,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
{
if (pkt->buf_len) {
- if (pkt->slab_buf)
+ if (pkt->slab_buf) {
kvfree(pkt->buf);
- else
- free_pages((unsigned long)pkt->buf,
- get_order(pkt->buf_len));
+ } else {
+ unsigned int order = get_order(pkt->buf_len);
+ unsigned long buf = (unsigned long)pkt->buf;
+
+ if (pkt->split) {
+ int i;
+
+ for (i = 0; i < (1 << order); i++)
+ free_page(buf + i * PAGE_SIZE);
+ } else {
+ free_pages(buf, order);
+ }
+ }
}

kfree(pkt);
--
2.35.0

2022-11-06 20:35:00

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 02/11] virtio/vsock: update, 'virtio_transport_recv_pkt()'

Pass destination socket to 'virtio_transport_recv_pkt()'. This is
needed, because caller may need socket structure, thus to avoid
socket extra lookup in connect/bind lists, socket is passed as
parameter(if it is NULL, then lookup is performed in the same way as
before).

Signed-off-by: Arseniy Krasnov <[email protected]>
---
drivers/vhost/vsock.c | 2 +-
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport.c | 2 +-
net/vmw_vsock/virtio_transport_common.c | 12 +++++++-----
net/vmw_vsock/vsock_loopback.c | 2 +-
5 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 65475d128a1d..6f3d9f02cc1d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -548,7 +548,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid &&
le64_to_cpu(pkt->hdr.dst_cid) ==
vhost_transport_get_local_cid())
- virtio_transport_recv_pkt(&vhost_transport, pkt);
+ virtio_transport_recv_pkt(&vhost_transport, NULL, pkt);
else
virtio_transport_free_pkt(pkt);

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index d02cb7aa922f..c1be40f89a89 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -150,6 +150,7 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
void virtio_transport_destruct(struct vsock_sock *vsk);

void virtio_transport_recv_pkt(struct virtio_transport *t,
+ struct sock *sk,
struct virtio_vsock_pkt *pkt);
void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt);
void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt);
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 19909c1e9ba3..0305f94c98bb 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -560,7 +560,7 @@ static void virtio_transport_rx_work(struct work_struct *work)

pkt->len = len - sizeof(pkt->hdr);
virtio_transport_deliver_tap_pkt(pkt);
- virtio_transport_recv_pkt(&virtio_transport, pkt);
+ virtio_transport_recv_pkt(&virtio_transport, NULL, pkt);
}
} while (!virtqueue_enable_cb(vq));

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 37e8dbfe2f5d..444764869670 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1238,11 +1238,11 @@ static bool virtio_transport_valid_type(u16 type)
* lock.
*/
void virtio_transport_recv_pkt(struct virtio_transport *t,
+ struct sock *sk,
struct virtio_vsock_pkt *pkt)
{
struct sockaddr_vm src, dst;
struct vsock_sock *vsk;
- struct sock *sk;
bool space_available;

vsock_addr_init(&src, le64_to_cpu(pkt->hdr.src_cid),
@@ -1267,12 +1267,14 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
/* The socket must be in connected or bound table
* otherwise send reset back
*/
- sk = vsock_find_connected_socket(&src, &dst);
if (!sk) {
- sk = vsock_find_bound_socket(&dst);
+ sk = vsock_find_connected_socket(&src, &dst);
if (!sk) {
- (void)virtio_transport_reset_no_sock(t, pkt);
- goto free_pkt;
+ sk = vsock_find_bound_socket(&dst);
+ if (!sk) {
+ (void)virtio_transport_reset_no_sock(t, pkt);
+ goto free_pkt;
+ }
}
}

diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index 169a8cf65b39..37c28bbeafaf 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -134,7 +134,7 @@ static void vsock_loopback_work(struct work_struct *work)
list_del_init(&pkt->list);

virtio_transport_deliver_tap_pkt(pkt);
- virtio_transport_recv_pkt(&loopback_transport, pkt);
+ virtio_transport_recv_pkt(&loopback_transport, NULL, pkt);
}
}

--
2.35.0

2022-11-06 20:49:31

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v3 03/11] af_vsock: add zerocopy receive logic

This:
1) Adds callback for 'mmap()' call on socket. It checks vm area flags
and sets vm area ops.
2) Adds special 'getsockopt()' case which calls transport zerocopy
callback. Input argument is vm area address.
3) Adds 'getsockopt()/setsockopt()' for switching on/off rx zerocopy
mode.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
include/net/af_vsock.h | 8 ++
include/uapi/linux/vm_sockets.h | 3 +
net/vmw_vsock/af_vsock.c | 187 +++++++++++++++++++++++++++++++-
3 files changed, 196 insertions(+), 2 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 568a87c5e0d0..e4f12ef8e623 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -73,6 +73,8 @@ struct vsock_sock {

/* Private to transport. */
void *trans;
+
+ bool rx_zerocopy_on;
};

s64 vsock_stream_has_data(struct vsock_sock *vsk);
@@ -138,6 +140,12 @@ struct vsock_transport {
bool (*stream_allow)(u32 cid, u32 port);
int (*set_rcvlowat)(struct vsock_sock *vsk, int val);

+ int (*zerocopy_rx_mmap)(struct vsock_sock *vsk,
+ struct vm_area_struct *vma);
+ int (*zerocopy_dequeue)(struct vsock_sock *vsk,
+ struct page **pages,
+ unsigned long *pages_num);
+
/* SEQ_PACKET. */
ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
int flags);
diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
index c60ca33eac59..d1f792bed1a7 100644
--- a/include/uapi/linux/vm_sockets.h
+++ b/include/uapi/linux/vm_sockets.h
@@ -83,6 +83,9 @@

#define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8

+#define SO_VM_SOCKETS_MAP_RX 9
+#define SO_VM_SOCKETS_ZEROCOPY 10
+
#if !defined(__KERNEL__)
#if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__))
#define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index ee418701cdee..21a915eb0820 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -1663,6 +1663,16 @@ static int vsock_connectible_setsockopt(struct socket *sock,
}
break;
}
+ case SO_VM_SOCKETS_ZEROCOPY: {
+ if (sock->state != SS_UNCONNECTED) {
+ err = -EOPNOTSUPP;
+ break;
+ }
+
+ COPY_IN(val);
+ vsk->rx_zerocopy_on = val;
+ break;
+ }

default:
err = -ENOPROTOOPT;
@@ -1676,6 +1686,124 @@ static int vsock_connectible_setsockopt(struct socket *sock,
return err;
}

+static const struct vm_operations_struct afvsock_vm_ops = {
+};
+
+static int vsock_recv_zerocopy(struct socket *sock,
+ unsigned long address)
+{
+ const struct vsock_transport *transport;
+ struct vm_area_struct *vma;
+ unsigned long vma_pages;
+ struct vsock_sock *vsk;
+ struct page **pages;
+ struct sock *sk;
+ int err;
+ int i;
+
+ sk = sock->sk;
+ vsk = vsock_sk(sk);
+ err = 0;
+
+ lock_sock(sk);
+
+ if (!vsk->rx_zerocopy_on) {
+ err = -EOPNOTSUPP;
+ goto out_unlock_sock;
+ }
+
+ transport = vsk->transport;
+
+ if (!transport->zerocopy_dequeue) {
+ err = -EOPNOTSUPP;
+ goto out_unlock_sock;
+ }
+
+ mmap_write_lock(current->mm);
+
+ vma = vma_lookup(current->mm, address);
+
+ if (!vma || vma->vm_ops != &afvsock_vm_ops) {
+ err = -EINVAL;
+ goto out_unlock_vma;
+ }
+
+ /* Allow to use vm area only from the first page. */
+ if (vma->vm_start != address) {
+ err = -EINVAL;
+ goto out_unlock_vma;
+ }
+
+ vma_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
+ pages = kmalloc_array(vma_pages, sizeof(pages[0]),
+ GFP_KERNEL | __GFP_ZERO);
+
+ if (!pages) {
+ err = -EINVAL;
+ goto out_unlock_vma;
+ }
+
+ err = transport->zerocopy_dequeue(vsk, pages, &vma_pages);
+
+ if (err)
+ goto out_unlock_vma;
+
+ /* Now 'vma_pages' contains number of pages in array.
+ * If array element is NULL, skip it, go to next page.
+ */
+ for (i = 0; i < vma_pages; i++) {
+ if (pages[i]) {
+ unsigned long pages_inserted;
+
+ pages_inserted = 1;
+ err = vm_insert_pages(vma, address, &pages[i], &pages_inserted);
+
+ if (err || pages_inserted) {
+ /* Failed to insert some pages, we have "partially"
+ * mapped vma. Do not return, set error code. This
+ * code will be returned to user. User needs to call
+ * 'madvise()/mmap()' to clear this vma. Anyway,
+ * references to all pages will to be dropped below.
+ */
+ if (!err) {
+ err = -EFAULT;
+ break;
+ }
+ }
+ }
+
+ address += PAGE_SIZE;
+ }
+
+ i = 0;
+
+ while (i < vma_pages) {
+ /* Drop ref count for all pages, returned by transport.
+ * We call 'put_page()' only once, as transport needed
+ * to 'get_page()' at least only once also, to prevent
+ * pages being freed. If transport calls 'get_page()'
+ * more twice or more for every page - we don't care,
+ * if transport calls 'get_page()' only one time, this
+ * meanse that every page had ref count equal to 1,then
+ * 'vm_insert_pages()' increments it to 2. After this
+ * loop, ref count will be 1 again, and page will be
+ * returned to allocator by user.
+ */
+ if (pages[i])
+ put_page(pages[i]);
+ i++;
+ }
+
+ kfree(pages);
+
+out_unlock_vma:
+ mmap_write_unlock(current->mm);
+out_unlock_sock:
+ release_sock(sk);
+
+ return err;
+}
+
static int vsock_connectible_getsockopt(struct socket *sock,
int level, int optname,
char __user *optval,
@@ -1720,6 +1848,26 @@ static int vsock_connectible_getsockopt(struct socket *sock,
lv = sock_get_timeout(vsk->connect_timeout, &v,
optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
break;
+ case SO_VM_SOCKETS_ZEROCOPY: {
+ lock_sock(sk);
+
+ v.val64 = vsk->rx_zerocopy_on;
+
+ release_sock(sk);
+
+ break;
+ }
+ case SO_VM_SOCKETS_MAP_RX: {
+ unsigned long vma_addr;
+
+ if (len < sizeof(vma_addr))
+ return -EINVAL;
+
+ if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
+ return -EFAULT;
+
+ return vsock_recv_zerocopy(sock, vma_addr);
+ }

default:
return -ENOPROTOOPT;
@@ -2167,6 +2315,41 @@ static int vsock_set_rcvlowat(struct sock *sk, int val)
return 0;
}

+static int afvsock_mmap(struct file *file, struct socket *sock,
+ struct vm_area_struct *vma)
+{
+ const struct vsock_transport *transport;
+ struct vsock_sock *vsk;
+ struct sock *sk;
+ int err;
+
+ if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+ return -EPERM;
+
+ vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+ vma->vm_flags |= (VM_MIXEDMAP);
+ vma->vm_ops = &afvsock_vm_ops;
+
+ sk = sock->sk;
+ vsk = vsock_sk(sk);
+
+ lock_sock(sk);
+
+ transport = vsk->transport;
+
+ if (!transport || !transport->zerocopy_rx_mmap) {
+ err = -EOPNOTSUPP;
+ goto out_unlock;
+ }
+
+ err = transport->zerocopy_rx_mmap(vsk, vma);
+
+out_unlock:
+ release_sock(sk);
+
+ return err;
+}
+
static const struct proto_ops vsock_stream_ops = {
.family = PF_VSOCK,
.owner = THIS_MODULE,
@@ -2184,7 +2367,7 @@ static const struct proto_ops vsock_stream_ops = {
.getsockopt = vsock_connectible_getsockopt,
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
- .mmap = sock_no_mmap,
+ .mmap = afvsock_mmap,
.sendpage = sock_no_sendpage,
.set_rcvlowat = vsock_set_rcvlowat,
};
@@ -2206,7 +2389,7 @@ static const struct proto_ops vsock_seqpacket_ops = {
.getsockopt = vsock_connectible_getsockopt,
.sendmsg = vsock_connectible_sendmsg,
.recvmsg = vsock_connectible_recvmsg,
- .mmap = sock_no_mmap,
+ .mmap = afvsock_mmap,
.sendpage = sock_no_sendpage,
};

--
2.35.0

2022-11-07 05:33:59

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

On 06.11.2022 22:50, Christophe JAILLET wrote:
> Le 06/11/2022 à 20:36, Arseniy Krasnov a écrit :
>> To support zerocopy receive, packet's buffer allocation is changed: for
>> buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
>> kernel restricts to map slab pages to user's vma) and raw buddy
>> allocator now called. But, for tx packets(such packets won't be mapped
>> to user), previous 'kmalloc()' way is used, but with special flag in
>> packet's structure which allows to distinguish between 'kmalloc()' and
>> raw pages buffers.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>>   drivers/vhost/vsock.c                   |  1 +
>>   include/linux/virtio_vsock.h            |  1 +
>>   net/vmw_vsock/virtio_transport.c        |  8 ++++++--
>>   net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
>>   4 files changed, 17 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 5703775af129..65475d128a1d 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>>           return NULL;
>>       }
>>   +    pkt->slab_buf = true;
>>       pkt->buf_len = pkt->len;
>>         nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index 35d7eedb5e8e..d02cb7aa922f 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
>>       u32 off;
>>       bool reply;
>>       bool tap_delivered;
>> +    bool slab_buf;
>>   };
>>     struct virtio_vsock_pkt_info {
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index ad64f403536a..19909c1e9ba3 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>>       vq = vsock->vqs[VSOCK_VQ_RX];
>>         do {
>> +        struct page *buf_page;
>> +
>>           pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>           if (!pkt)
>>               break;
>>   -        pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>> -        if (!pkt->buf) {
>> +        buf_page = alloc_page(GFP_KERNEL);
>> +
>> +        if (!buf_page) {
>>               virtio_transport_free_pkt(pkt);
>>               break;
>>           }
>>   +        pkt->buf = page_to_virt(buf_page);
>>           pkt->buf_len = buf_len;
>>           pkt->len = buf_len;
>>   diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index a9980e9b9304..37e8dbfe2f5d 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
>>           if (!pkt->buf)
>>               goto out_pkt;
>>   +        pkt->slab_buf = true;
>>           pkt->buf_len = len;
>>             err = memcpy_from_msg(pkt->buf, info->msg, len);
>> @@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>>     void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>>   {
>> -    kvfree(pkt->buf);
>> +    if (pkt->buf_len) {
>> +        if (pkt->slab_buf)
>> +            kvfree(pkt->buf);
>
> Hi,
>
> kfree()? (according to virtio_transport_alloc_pkt() in -next) or something else need to be changed.
>
Hello Cristophe,

I think, 'kvfree()' is still needed here, because buffer for packet could be allocated by 'kvmalloc()'
in drivers/vhost/vsock.c. Correct me if i'm wrong.

>> +        else
>> +            free_pages((unsigned long)pkt->buf,
>> +                   get_order(pkt->buf_len));
>
> In virtio_vsock_rx_fill(), only alloc_page() is used.
>
> Should this be alloc_pages(.., get_order(buf_len)) or free_page() (without an 's') here?
This function frees packets which were allocated in vhost path also. In vhost, for zerocopy
packets alloc_pageS() is used.

Thank You, Arseniy
>
> Just my 2c,
>
> CJ
>
>> +    }
>> +
>>       kfree(pkt);
>>   }
>>   EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>

2022-11-07 21:52:20

by Christophe JAILLET

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

Le 07/11/2022 à 06:23, Arseniy Krasnov a écrit :
> On 06.11.2022 22:50, Christophe JAILLET wrote:
>> Le 06/11/2022 à 20:36, Arseniy Krasnov a écrit :
>>> To support zerocopy receive, packet's buffer allocation is changed: for
>>> buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
>>> kernel restricts to map slab pages to user's vma) and raw buddy
>>> allocator now called. But, for tx packets(such packets won't be mapped
>>> to user), previous 'kmalloc()' way is used, but with special flag in
>>> packet's structure which allows to distinguish between 'kmalloc()' and
>>> raw pages buffers.
>>>
>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>> ---
>>>   drivers/vhost/vsock.c                   |  1 +
>>>   include/linux/virtio_vsock.h            |  1 +
>>>   net/vmw_vsock/virtio_transport.c        |  8 ++++++--
>>>   net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
>>>   4 files changed, 17 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>> index 5703775af129..65475d128a1d 100644
>>> --- a/drivers/vhost/vsock.c
>>> +++ b/drivers/vhost/vsock.c
>>> @@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>>>           return NULL;
>>>       }
>>>   +    pkt->slab_buf = true;
>>>       pkt->buf_len = pkt->len;
>>>         nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>>> index 35d7eedb5e8e..d02cb7aa922f 100644
>>> --- a/include/linux/virtio_vsock.h
>>> +++ b/include/linux/virtio_vsock.h
>>> @@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
>>>       u32 off;
>>>       bool reply;
>>>       bool tap_delivered;
>>> +    bool slab_buf;
>>>   };
>>>     struct virtio_vsock_pkt_info {
>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>> index ad64f403536a..19909c1e9ba3 100644
>>> --- a/net/vmw_vsock/virtio_transport.c
>>> +++ b/net/vmw_vsock/virtio_transport.c
>>> @@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>>>       vq = vsock->vqs[VSOCK_VQ_RX];
>>>         do {
>>> +        struct page *buf_page;
>>> +
>>>           pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>>           if (!pkt)
>>>               break;
>>>   -        pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>>> -        if (!pkt->buf) {
>>> +        buf_page = alloc_page(GFP_KERNEL);
>>> +
>>> +        if (!buf_page) {
>>>               virtio_transport_free_pkt(pkt);
>>>               break;
>>>           }
>>>   +        pkt->buf = page_to_virt(buf_page);
>>>           pkt->buf_len = buf_len;
>>>           pkt->len = buf_len;
>>>   diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>> index a9980e9b9304..37e8dbfe2f5d 100644
>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>> @@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
>>>           if (!pkt->buf)
>>>               goto out_pkt;
>>>   +        pkt->slab_buf = true;
>>>           pkt->buf_len = len;
>>>             err = memcpy_from_msg(pkt->buf, info->msg, len);
>>> @@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>>>     void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>>>   {
>>> -    kvfree(pkt->buf);
>>> +    if (pkt->buf_len) {
>>> +        if (pkt->slab_buf)
>>> +            kvfree(pkt->buf);
>>
>> Hi,
>>
>> kfree()? (according to virtio_transport_alloc_pkt() in -next) or something else need to be changed.
>>
> Hello Cristophe,
>
> I think, 'kvfree()' is still needed here, because buffer for packet could be allocated by 'kvmalloc()'
> in drivers/vhost/vsock.c. Correct me if i'm wrong.

Agreed.

>
>>> +        else
>>> +            free_pages((unsigned long)pkt->buf,
>>> +                   get_order(pkt->buf_len));
>>
>> In virtio_vsock_rx_fill(), only alloc_page() is used.
>>
>> Should this be alloc_pages(.., get_order(buf_len)) or free_page() (without an 's') here?
> This function frees packets which were allocated in vhost path also. In vhost, for zerocopy
> packets alloc_pageS() is used.

Ok. Seen in patch 5/11.

But wouldn't it be cleaner and future-proof to also have alloc_pageS()
in virtio_vsock_rx_fill(), even if get_order(buf->len) is kwown to be 0?

What, if for some reason a PAGE_SIZE was < 4kb for a given arch, or
VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE increased?

CJ

>
> Thank You, Arseniy
>>
>> Just my 2c,
>>
>> CJ
>>
>>> +    }
>>> +
>>>       kfree(pkt);
>>>   }
>>>   EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>>
>


2022-11-07 22:02:06

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

On 08.11.2022 00:24, Christophe JAILLET wrote:
> Le 07/11/2022 à 06:23, Arseniy Krasnov a écrit :
>> On 06.11.2022 22:50, Christophe JAILLET wrote:
>>> Le 06/11/2022 à 20:36, Arseniy Krasnov a écrit :
>>>> To support zerocopy receive, packet's buffer allocation is changed: for
>>>> buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
>>>> kernel restricts to map slab pages to user's vma) and raw buddy
>>>> allocator now called. But, for tx packets(such packets won't be mapped
>>>> to user), previous 'kmalloc()' way is used, but with special flag in
>>>> packet's structure which allows to distinguish between 'kmalloc()' and
>>>> raw pages buffers.
>>>>
>>>> Signed-off-by: Arseniy Krasnov <[email protected]>
>>>> ---
>>>>    drivers/vhost/vsock.c                   |  1 +
>>>>    include/linux/virtio_vsock.h            |  1 +
>>>>    net/vmw_vsock/virtio_transport.c        |  8 ++++++--
>>>>    net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
>>>>    4 files changed, 17 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>>> index 5703775af129..65475d128a1d 100644
>>>> --- a/drivers/vhost/vsock.c
>>>> +++ b/drivers/vhost/vsock.c
>>>> @@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
>>>>            return NULL;
>>>>        }
>>>>    +    pkt->slab_buf = true;
>>>>        pkt->buf_len = pkt->len;
>>>>          nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>>>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>>>> index 35d7eedb5e8e..d02cb7aa922f 100644
>>>> --- a/include/linux/virtio_vsock.h
>>>> +++ b/include/linux/virtio_vsock.h
>>>> @@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
>>>>        u32 off;
>>>>        bool reply;
>>>>        bool tap_delivered;
>>>> +    bool slab_buf;
>>>>    };
>>>>      struct virtio_vsock_pkt_info {
>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>> index ad64f403536a..19909c1e9ba3 100644
>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>> @@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
>>>>        vq = vsock->vqs[VSOCK_VQ_RX];
>>>>          do {
>>>> +        struct page *buf_page;
>>>> +
>>>>            pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>>>            if (!pkt)
>>>>                break;
>>>>    -        pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>>>> -        if (!pkt->buf) {
>>>> +        buf_page = alloc_page(GFP_KERNEL);
>>>> +
>>>> +        if (!buf_page) {
>>>>                virtio_transport_free_pkt(pkt);
>>>>                break;
>>>>            }
>>>>    +        pkt->buf = page_to_virt(buf_page);
>>>>            pkt->buf_len = buf_len;
>>>>            pkt->len = buf_len;
>>>>    diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index a9980e9b9304..37e8dbfe2f5d 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
>>>>            if (!pkt->buf)
>>>>                goto out_pkt;
>>>>    +        pkt->slab_buf = true;
>>>>            pkt->buf_len = len;
>>>>              err = memcpy_from_msg(pkt->buf, info->msg, len);
>>>> @@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>>>>      void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>>>>    {
>>>> -    kvfree(pkt->buf);
>>>> +    if (pkt->buf_len) {
>>>> +        if (pkt->slab_buf)
>>>> +            kvfree(pkt->buf);
>>>
>>> Hi,
>>>
>>> kfree()? (according to virtio_transport_alloc_pkt() in -next) or something else need to be changed.
>>>
>> Hello Cristophe,
>>
>> I think, 'kvfree()' is still needed here, because buffer for packet could be allocated by 'kvmalloc()'
>> in drivers/vhost/vsock.c. Correct me if i'm wrong.
>
> Agreed.
>
>>
>>>> +        else
>>>> +            free_pages((unsigned long)pkt->buf,
>>>> +                   get_order(pkt->buf_len));
>>>
>>> In virtio_vsock_rx_fill(), only alloc_page() is used.
>>>
>>> Should this be alloc_pages(.., get_order(buf_len)) or free_page() (without an 's') here?
>> This function frees packets which were allocated in vhost path also. In vhost, for zerocopy
>> packets alloc_pageS() is used.
>
> Ok. Seen in patch 5/11.
>
> But wouldn't it be cleaner and future-proof to also have alloc_pageS() in virtio_vsock_rx_fill(), even if get_order(buf->len) is kwown to be 0?
>
> What, if for some reason a PAGE_SIZE was < 4kb for a given arch, or VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE increased?
Yes, i see. You're right. It will be correct to use alloc_pages(get_order(buf->len)), because in current version, real length of
packet buffer(e.g. single page) is totally unrelated with VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE. I'll fix it in next version.

Thank You
>
> CJ
>
>>
>> Thank You, Arseniy
>>>
>>> Just my 2c,
>>>
>>> CJ
>>>
>>>> +    }
>>>> +
>>>>        kfree(pkt);
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>>>
>>
>

2022-11-10 12:04:23

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 04/11] virtio/vsock: add transport zerocopy callback

On 06.11.2022 22:41, Arseniy Krasnov wrote:
> This adds transport callback which processes rx queue of socket and
> instead of copying data to user provided buffer, it inserts data pages
> of each packet to user's vm area.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>
> ---
> include/linux/virtio_vsock.h | 7 +
> include/uapi/linux/virtio_vsock.h | 14 ++
> net/vmw_vsock/virtio_transport_common.c | 244 +++++++++++++++++++++++-
> 3 files changed, 261 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index c1be40f89a89..d10fdfd8d144 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -37,6 +37,7 @@ struct virtio_vsock_sock {
> u32 buf_alloc;
> struct list_head rx_queue;
> u32 msg_count;
> + struct page *usr_poll_page;
> };
>
> struct virtio_vsock_pkt {
> @@ -51,6 +52,7 @@ struct virtio_vsock_pkt {
> bool reply;
> bool tap_delivered;
> bool slab_buf;
> + bool split;
> };
>
> struct virtio_vsock_pkt_info {
> @@ -131,6 +133,11 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
> struct sockaddr_vm *addr);
> bool virtio_transport_dgram_allow(u32 cid, u32 port);
>
> +int virtio_transport_zerocopy_init(struct vsock_sock *vsk,
> + struct vm_area_struct *vma);
> +int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
> + struct page **pages,
> + unsigned long *pages_num);
> int virtio_transport_connect(struct vsock_sock *vsk);
>
> int virtio_transport_shutdown(struct vsock_sock *vsk, int mode);
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 64738838bee5..2a0e4f309918 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -66,6 +66,20 @@ struct virtio_vsock_hdr {
> __le32 fwd_cnt;
> } __attribute__((packed));
>
> +struct virtio_vsock_usr_hdr {
> + u32 flags;
> + u32 len;
> +} __attribute__((packed));
> +
> +#define VIRTIO_VSOCK_USR_POLL_NO_DATA 0
> +#define VIRTIO_VSOCK_USR_POLL_HAS_DATA 1
> +#define VIRTIO_VSOCK_USR_POLL_SHUTDOWN ~0
> +
> +struct virtio_vsock_usr_hdr_pref {
> + u32 poll_value;
> + u32 hdr_num;
> +} __attribute__((packed));
> +
> enum virtio_vsock_type {
> VIRTIO_VSOCK_TYPE_STREAM = 1,
> VIRTIO_VSOCK_TYPE_SEQPACKET = 2,
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 444764869670..fa4a2688a5d5 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -12,6 +12,7 @@
> #include <linux/ctype.h>
> #include <linux/list.h>
> #include <linux/virtio_vsock.h>
> +#include <linux/mm.h>
> #include <uapi/linux/vsockmon.h>
>
> #include <net/sock.h>
> @@ -241,6 +242,14 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> static bool virtio_transport_inc_rx_pkt(struct virtio_vsock_sock *vvs,
> struct virtio_vsock_pkt *pkt)
> {
> + if (vvs->usr_poll_page) {
> + struct virtio_vsock_usr_hdr_pref *hdr;
> +
> + hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
> +
> + hdr->poll_value = VIRTIO_VSOCK_USR_POLL_HAS_DATA;
> + }
> +
> if (vvs->rx_bytes + pkt->len > vvs->buf_alloc)
> return false;
>
> @@ -253,6 +262,14 @@ static void virtio_transport_dec_rx_pkt(struct virtio_vsock_sock *vvs,
> {
> vvs->rx_bytes -= pkt->len;
> vvs->fwd_cnt += pkt->len;
> +
> + if (!vvs->rx_bytes && vvs->usr_poll_page) {
> + struct virtio_vsock_usr_hdr_pref *hdr;
> +
> + hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
> +
> + hdr->poll_value = VIRTIO_VSOCK_USR_POLL_NO_DATA;
> + }
> }
>
> void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct virtio_vsock_pkt *pkt)
> @@ -347,6 +364,191 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> return err;
> }
>
> +int virtio_transport_zerocopy_init(struct vsock_sock *vsk,
> + struct vm_area_struct *vma)
> +{
> + struct virtio_vsock_sock *vvs;
> + int err = 0;
> +
> + if (vma->vm_end - vma->vm_start < 2 * PAGE_SIZE)
> + return -EINVAL;
> +
> + vvs = vsk->trans;
> +
> + spin_lock_bh(&vvs->rx_lock);
> +
> + if (!vvs->usr_poll_page) {
> + /* GFP_ATOMIC because of spinlock. */
> + vvs->usr_poll_page = alloc_page(GFP_KERNEL | GFP_ATOMIC);
^^^ oops, only GFP_ATOMIC is needed
> +
> + if (!vvs->usr_poll_page) {
> + err = -ENOMEM;
> + } else {
> + struct virtio_vsock_usr_hdr_pref *usr_hdr_pref;
> + unsigned long one_page = 1;
> +
> + usr_hdr_pref = page_to_virt(vvs->usr_poll_page);
> +
> + if (vsk->peer_shutdown & SHUTDOWN_MASK) {
> + usr_hdr_pref->poll_value = VIRTIO_VSOCK_USR_POLL_SHUTDOWN;
> + } else {
> + usr_hdr_pref->poll_value = vvs->rx_bytes ?
> + VIRTIO_VSOCK_USR_POLL_HAS_DATA :
> + VIRTIO_VSOCK_USR_POLL_NO_DATA;
> + }
> +
> + usr_hdr_pref->hdr_num = 0;
> +
> + err = vm_insert_pages(vma, vma->vm_start,
> + &vvs->usr_poll_page,
> + &one_page);
> +
> + if (one_page)
> + err = -EINVAL;
> + }
> + } else {
> + err = -EINVAL;
> + }
> +
> + spin_unlock_bh(&vvs->rx_lock);
> +
> + return err;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_init);
> +
> +int virtio_transport_zerocopy_dequeue(struct vsock_sock *vsk,
> + struct page **pages,
> + unsigned long *pages_num)
> +{
> + struct virtio_vsock_usr_hdr_pref *usr_hdr_pref;
> + struct virtio_vsock_usr_hdr *usr_hdr_buffer;
> + struct virtio_vsock_sock *vvs;
> + unsigned long max_usr_hdrs;
> + struct page *usr_hdr_page;
> + int pages_cnt;
> +
> + if (*pages_num < 2)
> + return -EINVAL;
> +
> + vvs = vsk->trans;
> +
> + max_usr_hdrs = (PAGE_SIZE - sizeof(*usr_hdr_pref)) / sizeof(*usr_hdr_buffer);
> + *pages_num = min(max_usr_hdrs, *pages_num);
> + pages_cnt = 0;
> +
> + spin_lock_bh(&vvs->rx_lock);
> +
> + if (!vvs->usr_poll_page) {
> + spin_unlock_bh(&vvs->rx_lock);
> + return -EINVAL;
> + }
> +
> + usr_hdr_page = vvs->usr_poll_page;
> + usr_hdr_pref = page_to_virt(usr_hdr_page);
> + usr_hdr_buffer = (struct virtio_vsock_usr_hdr *)(usr_hdr_pref + 1);
> + usr_hdr_pref->hdr_num = 0;
> +
> + /* If ref counter is 1, then page owned during
> + * allocation and not mapped, so insert it to
> + * the output array. It will be mapped.
> + */
> + if (page_ref_count(usr_hdr_page) == 1) {
> + pages[pages_cnt++] = usr_hdr_page;
> + /* Inc ref one more, as AF_VSOCK layer calls
> + * 'put_page()' for each returned page.
> + */
> + get_page(usr_hdr_page);
> + } else {
> + pages[pages_cnt++] = NULL;
> + }
> +
> + /* Polling page is already mapped. */
> + while (!list_empty(&vvs->rx_queue) &&
> + pages_cnt < *pages_num) {
> + struct virtio_vsock_pkt *pkt;
> + ssize_t rest_data_bytes;
> + size_t moved_data_bytes;
> + unsigned long pg_offs;
> +
> + pkt = list_first_entry(&vvs->rx_queue,
> + struct virtio_vsock_pkt, list);
> +
> + rest_data_bytes = le32_to_cpu(pkt->hdr.len) - pkt->off;
> +
> + /* For packets, bigger than one page, split it's
> + * high order allocated buffer to 0 order pages.
> + * Otherwise 'vm_insert_pages()' will fail, for
> + * all pages except first.
> + */
> + if (rest_data_bytes > PAGE_SIZE) {
> + /* High order buffer not split yet. */
> + if (!pkt->split) {
> + split_page(virt_to_page(pkt->buf),
> + get_order(le32_to_cpu(pkt->hdr.len)));
> + pkt->split = true;
> + }
> + }
> +
> + pg_offs = pkt->off;
> + moved_data_bytes = 0;
> +
> + while (rest_data_bytes &&
> + pages_cnt < *pages_num) {
> + struct page *buf_page;
> +
> + buf_page = virt_to_page(pkt->buf + pg_offs);
> +
> + pages[pages_cnt++] = buf_page;
> + /* Get reference to prevent this page being
> + * returned to page allocator when packet will
> + * be freed. Ref count will be 2.
> + */
> + get_page(buf_page);
> + pg_offs += PAGE_SIZE;
> +
> + if (rest_data_bytes >= PAGE_SIZE) {
> + moved_data_bytes += PAGE_SIZE;
> + rest_data_bytes -= PAGE_SIZE;
> + } else {
> + moved_data_bytes += rest_data_bytes;
> + rest_data_bytes = 0;
> + }
> + }
> +
> + if (!rest_data_bytes)
> + usr_hdr_buffer->flags = le32_to_cpu(pkt->hdr.flags);
> + else
> + usr_hdr_buffer->flags = 0;
> +
> + usr_hdr_buffer->len = moved_data_bytes;
> +
> + usr_hdr_buffer++;
> + usr_hdr_pref->hdr_num++;
> +
> + pkt->off = pg_offs;
> +
> + if (rest_data_bytes == 0) {
> + list_del(&pkt->list);
> + virtio_transport_dec_rx_pkt(vvs, pkt);
> + virtio_transport_free_pkt(pkt);
> + }
> +
> + /* Now ref count for all pages of packet is 1. */
> + }
> +
> + if (*pages_num - 1 < max_usr_hdrs)
> + usr_hdr_buffer->len = 0;
> +
> + spin_unlock_bh(&vvs->rx_lock);
> +
> + virtio_transport_send_credit_update(vsk);
> +
> + *pages_num = pages_cnt;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_zerocopy_dequeue);
> +
> static ssize_t
> virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> @@ -969,11 +1171,21 @@ void virtio_transport_release(struct vsock_sock *vsk)
> {
> struct sock *sk = &vsk->sk;
> bool remove_sock = true;
> + struct virtio_vsock_sock *vvs = vsk->trans;
>
> if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET)
> remove_sock = virtio_transport_close(vsk);
>
> if (remove_sock) {
> + spin_lock_bh(&vvs->rx_lock);
> +
> + if (vvs->usr_poll_page) {
> + __free_page(vvs->usr_poll_page);
> + vvs->usr_poll_page = NULL;
> + }
> +
> + spin_unlock_bh(&vvs->rx_lock);
> +
> sock_set_flag(sk, SOCK_DONE);
> virtio_transport_remove_sock(vsk);
> }
> @@ -1077,6 +1289,7 @@ virtio_transport_recv_connected(struct sock *sk,
> struct virtio_vsock_pkt *pkt)
> {
> struct vsock_sock *vsk = vsock_sk(sk);
> + struct virtio_vsock_sock *vvs = vsk->trans;
> int err = 0;
>
> switch (le16_to_cpu(pkt->hdr.op)) {
> @@ -1095,6 +1308,19 @@ virtio_transport_recv_connected(struct sock *sk,
> vsk->peer_shutdown |= RCV_SHUTDOWN;
> if (le32_to_cpu(pkt->hdr.flags) & VIRTIO_VSOCK_SHUTDOWN_SEND)
> vsk->peer_shutdown |= SEND_SHUTDOWN;
> +
> + spin_lock_bh(&vvs->rx_lock);
> +
> + if (vvs->usr_poll_page) {
> + struct virtio_vsock_usr_hdr_pref *hdr;
> +
> + hdr = (struct virtio_vsock_usr_hdr_pref *)page_to_virt(vvs->usr_poll_page);
> +
> + hdr->poll_value = 0xffffffff;
> + }
> +
> + spin_unlock_bh(&vvs->rx_lock);
> +
> if (vsk->peer_shutdown == SHUTDOWN_MASK &&
> vsock_stream_has_data(vsk) <= 0 &&
> !sock_flag(sk, SOCK_DONE)) {
> @@ -1343,11 +1569,21 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> if (pkt->buf_len) {
> - if (pkt->slab_buf)
> + if (pkt->slab_buf) {
> kvfree(pkt->buf);
> - else
> - free_pages((unsigned long)pkt->buf,
> - get_order(pkt->buf_len));
> + } else {
> + unsigned int order = get_order(pkt->buf_len);
> + unsigned long buf = (unsigned long)pkt->buf;
> +
> + if (pkt->split) {
> + int i;
> +
> + for (i = 0; i < (1 << order); i++)
> + free_page(buf + i * PAGE_SIZE);
> + } else {
> + free_pages(buf, order);
> + }
> + }
> }
>
> kfree(pkt);

2022-11-11 13:55:56

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

Hi Arseniy,
maybe we should start rebasing this series on the new support for
skbuff:
https://lore.kernel.org/lkml/[email protected]/

CCing Bobby to see if it's easy to integrate since you're both changing
the packet allocation.


On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote:
>
>
> INTRODUCTION
>
>Hello,
>
> This is experimental patchset for virtio vsock zerocopy receive.
>It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()'
>to fill provided vma area with pages of virtio receive buffers. After
>received data was processed by user, pages must be freed by 'madvise()'
>call with MADV_DONTNEED flag set(but if user will not call 'madvise()',
>next 'getsockopt()' will fail).
>
> DETAILS
>
> Here is how mapping with mapped pages looks exactly: first page
>contains information about mapped data buffers. At zero offset mapping
>contains special data structure:
>
> struct virtio_vsock_usr_hdr_pref {
> u32 poll_value;
> u32 hdr_num;
> };
>
>This structure contains two fields:
>'poll_value' - shows that current socket has data to read.When socket's
>intput queue is empty, 'poll_value' is set to 0 by kernel. When input
>queue has some data, 'poll_value' is set to 1 by kernel. When socket is
>closed for data receive, 'poll_value' is ~0.This tells user that "there
>will be no more data,continue to call 'getsockopt()' until you'll find
>'hdr_num' == 0".User spins on it in userspace, without calling 'poll()'
>system call(of course, 'poll()' is still working).
>'hdr_num' - shows number of mapped pages with data which starts from
>second page of this mappined.
>
>NOTE:
> This version has two limitations:
>
> 1) One mapping per socket is supported. It is implemented by adding
> 'struct page*' pointer to 'struct virtio_vsock' structure (first
> page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But,
> I think, support for multiple pages could be implemented by using
> something like hash table of such pages, or more simple, just use
> first page of mapping as headers page by default. Also I think,
> number of such pages may be controlled by 'setsockop()'.
>
> 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even
> after calling 'madvise()'/'munmap()' on the whole mapping.This is
> because socket can't handle 'munmap()' calls(as there is no such
> callback in 'proto_ops'),thus polling page exists until socket is
> opened.
>
>After 'virtio_vsock_usr_hdr_pref' object, first page contains array of
>trimmed virtio vsock packet headers (in contains only length of data on
>the corresponding page and 'flags' field):
>
> struct virtio_vsock_usr_hdr {
> uint32_t length;
> uint32_t flags;
> };
>
>Field 'length' allows user to know exact size of payload within each
>sequence of pages and field 'flags' allows to process SOCK_SEQPACKET
>flags(such as message bounds or record bounds).All other pages are data
>pages from virtio queue.
>
> Page 0 Page 1 Page N
>
> [ pref hdr0 .. hdrN ][ data ] .. [ data ]
> | | ^ ^
> | | | |
> | *-------|-----------*
> | |
> *----------------*
>
> Of course, single header could represent array of pages (when
>packet's buffer is bigger than one page).So here is example of detailed
>mapping layout for some set of packages. Lets consider that we have the
>following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All
>pages: 0,1,2,3,4 and 5 will be inserted to user's vma.
>
> Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
> Page 1: [ 56 ]
> Page 2: [ 4096 ]
> Page 3: [ 4096 ]
> Page 4: [ 4096 ]
> Page 5: [ 8 ]
>
> Page 0 contains only array of headers:
> 'pref' is 'struct virtio_vsock_usr_hdr_pref'.
> 'hdr0' has 56 in length field.
> 'hdr1' has 4096 in length field.
> 'hdr2' has 8200 in length field.
> 'hdr3' has 0 in length field(this is end of data marker).
>
> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> Page 2 corresponds to 'hdr1' and filled with data.
> Page 3 corresponds to 'hdr2' and filled with data.
> Page 4 corresponds to 'hdr2' and filled with data.
> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
> pref will be the following: poll_value = 1, hdr_num = 5
>
> This patchset also changes packets allocation way: current uses
>only 'kmalloc()' to create data buffer. Problem happens when we try to
>map such buffers to user's vma - kernel restricts to map slab pages
>to user's vma(as pages of "not large" 'kmalloc()' allocations have flag
>PageSlab set and "not large" could be bigger than one page).So to avoid
>this, data buffers now allocated using 'alloc_pages()' call.
>
> DIFFERENCE WITH TCP
>
> As this feature uses same approach as for TCP protocol,here are
>some difference between both version(from user's POV):
>
>1) For 'getsockopt()':
> - This version passes only address of mapping.
> - TCP passes special structure to 'getsockopt()'. In addition to the
> address of mapping in contains 'length' and 'recv_skip_hint'.First
> means size of data inside mapping(out param, set by kernel).Second
> has bool type, if it is true, then user must dequeue rest of data
> using 'read()' syscall(e.g. it is out parameter also).
>2) Mapping structure:
> - This version uses first page of mapping for meta data and rest of
> pages for data.
> - TCP version uses whole mapping for data only.
>3) Data layout:
> - This version inserts virtio buffers to mapping, so each buffer may
> be filled partially. To get size of payload in every buffer, first
> mapping's page must be used(see 2).
> - TCP version inserts pages of single skb.
>
>*Please, correct me if I made some mistake in TCP zerocopy description.


Thank you for the description. Do you think it would be possible to try
to do the same as TCP?
Especially now that we should support skbuff.

>
> TESTS
>
> This patchset updates 'vsock_test' utility: two tests for new
>feature were added. First test covers invalid cases.Second checks valid
>transmission case.

Thank you, I really appreciate you adding new tests each time! Great
job!

>
> BENCHMARKING
>
> For benchmakring I've created small test utility 'vsock_rx_perf'.
>It works in client/server mode. When client connects to server, server
>starts sending specified amount of data to client(size is set as input
>argument). Client reads data and waits for next portion of it. In client
>mode, dequeue logic works in three modes: copy, zerocopy and zerocopy
>with user polling.

Cool, thanks for adding it in this series.

>
>1) In copy mode client uses 'read()' system call.
>2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data
> and 'poll()' to wait data.
>3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()',
> but to wait data it polls shared page(e.g. busyloop).
>
>Here is usage:
>-c <cid> Peer CID to connect to(if run in client mode).
>-m <megabytes> Number of megabytes to send.
>-b <bytes> Size of RX/TX buffer(or mapping) in pages.
>-r <bytes> SO_RCVLOWAT value in bytes(if run in client mode).
>-v <bytes> peer credit.
>-s Run as server.
>-z [n|y|u] Dequeue mode.
> n - copy mode. 1) above.
> y - zero copy mode. 2) above.
> u - zero copy mode + user poll. 3) above.
>
>Utility produces the following output:
>1) In server mode it prints number of sec spent for whole tx loop.
>2) In client mode it prints several values:
> * Number of sec, spent for whole rx loop(including 'poll()').
> * Number of sec, spend in dequeue system calls:
> In case of '-z n' it will be time in 'read()'.
> In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'.
> * Number of wake ups with POLLIN flag set(except '-z u' mode).
> * Average time(ns) in single dequeue iteration(e.g. divide second
> value by third).
>
>Idea of test is to compare zerocopy approach and classic copy, as it is
>clear, that to dequeue some "small" amount of data, copy must be better,
>because syscall with 'memcpy()' for 1 byte(for example) is just nothing
>against two system calls, where first must map at least one page, while
>second will unmap it.
>
>Test was performed with the following settings:
>1) Total size of data to send is 2G(-m argument).
>
>2) Peer's buffer size is changed to 2G(-v argument) - this is needed to
> avoid stalls of sender to wait for enough credit.
>
>3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number
> of bytes to dequeue in single loop iteration. Buffer size limits max
> number of bytes to read, while SO_RCVLOWAT won't allow user to get
> too small number of bytes.
>
>4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of
> course we can set it to peer's buffer size and as we are in STREAM
> mode it leads to 'write()' will be called once.
>
>Deignations here and below:
>H2G - host to guest transmission. Server is host, client is guest.
>G2H - guest to host transmission. Server is guest, client is host.
>C - copy mode.
>ZC - zerocopy mode.
>ZU - zerocopy with user poll mode. This mode is removed from test at
> this moment, because I need to support SO_RCVLOWAT logic in it.
>
>So, rows corresponds to dequeue mode, while columns show number of

Maybe it would be better to label the rows, I guess the first one is C
and the second one ZC?

Maybe it would be better to report Gbps so if we change the amount of
data exchanged, we always have a way to compare.

>bytes
>to dequeue in each mode. Each cell contains several values in the next
>format:
>*------------*
>| A / B |
>| C |
>| D |
>*------------*
>
>A - number of seconds which server spent in tx loop.
>B - number of seconds which client spent in rx loop.
>C - number of seconds which client spent in rx loop, but except 'poll()'
> system call(e.g. only in dequeue system calls).
>D - Average number of ns for each POLLIN wake up(in other words
> it is average value for C).

I see only 3 values in the cells, I missed which one is C and which one
is D.

>
>G2H:
>
> #0 #1 #2 #3 #4 #5
> *----*---------*---------*---------*---------*---------*---------*
> | | | | | | | |
> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
> | | | | | | | |
> *----*---------*---------*---------*---------*---------*---------*
> | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
> | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 |
> *----*---------*---------*---------*---------*---------*---------*
> | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
> | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 |
> *----*---------*---------*---------*---------*---------*---------*
>
>H2G:
>
> #0 #1 #2 #3 #4 #5
> *----*---------*---------*---------*---------*---------*---------*
> | | | | | | | |
> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
> | | | | | | | |
> *----*---------*---------*---------*---------*---------*---------*
> | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
> | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 |
> *----*---------*---------*---------*---------*---------*---------*
> | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
> | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 |
> *----*---------*---------*---------*---------*---------*---------*
>
>Here are my thoughts about these numbers(most obvious):
>
>1) Let's check C and D values. We see, that zerocopy dequeue is faster
> on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I
> think this is main result of this test(at this moment), that shows
> performance difference between copy and zerocopy).

Yes, I think this is expected.

>
>2) In G2H mode both server and client spend almost same time in rx/tx
> loops(see A / B in G2H table) - it looks good. In H2G mode, there is
> significant difference between server and client. I think there are
> some side effects which produces such effect(continue to analyze).

Perhaps a different cost to notify the receiver? I think it's better to
talk about transmitter and receiver, instead of server and client, I
think it's confusing.

>
>3) Let's check C value. We can see, that G2H is always faster that H2G.
> In both copy and zerocopy mode.

This is expected because the guest queues buffers up to 64K entirely,
while the host has to split packets into the guest's preallocated
buffers, which are 4K.

>
>4) Another interesting thing could be seen for example in H2G table,
> row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode
> is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was

I see 1.65 vs 1.10, are these the same data, or am I looking at it
wrong?

> faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account
> time spent in 'poll()', copy mode looks faster(even it spends more
> time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()').
> I think, it is also not obvious effect.
>
>So, according 1), it is better to use zerocopy, if You need to process
>big buffers, with small rx waitings(for example it nay be video stream).
>In other cases - it is better to use classic copy way, as it will be
>more lightweight.
>
>All tests were performed on x86 board with 4-core Celeron N2930 CPU(of
>course it is, not a mainframe, but better than test with nested guest)
>and 8Gb of RAM.
>
>Anyway, this is not final version, and I will continue to improve both
>kernel logic and performance tests.

Great work so far!

Maybe to avoid having to rebase everything later, it's already
worthwhile to start using Bobby's patch with skbuff.

>
> SUGGESTIONS
>
>1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I
> can merge both patches into single one?

This is already very big, so I don't know if it's worth breaking into a
preparation series and then a series that adds both.

>2) This version works with single page headers. May be I can add new
> 'getsockopt()' feature to allow to use multiple pages for headers.

What would be the benefit?

A small suggestion, run checkpatch with --strict, because there are
several warnings that would be better resolved.

I'll take a quick look at the patches, but I'd rather do a more detailed
review with skbuffs.

Thanks,
Stefano


2022-11-11 15:24:31

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v3 08/11] test/vsock: rework message bound test

On Sun, Nov 06, 2022 at 07:48:56PM +0000, Arseniy Krasnov wrote:
>This updates message bound test making it more complex. Instead of
>sending 1 bytes messages with one MSG_EOR bit, it sends messages of
>random length(one half of messages are smaller than page size, second
>half are bigger) with random number of MSG_EOR bits set. Receiver
>also don't know total number of messages.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> tools/testing/vsock/control.c | 34 +++++++++
> tools/testing/vsock/control.h | 2 +
> tools/testing/vsock/util.c | 13 ++++
> tools/testing/vsock/util.h | 1 +
> tools/testing/vsock/vsock_test.c | 115 +++++++++++++++++++++++++++----
> 5 files changed, 152 insertions(+), 13 deletions(-)
>
>diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
>index 4874872fc5a3..bed1649bdf3d 100644
>--- a/tools/testing/vsock/control.c
>+++ b/tools/testing/vsock/control.c
>@@ -141,6 +141,40 @@ void control_writeln(const char *str)
> timeout_end();
> }
>
>+void control_writeulong(unsigned long value)
>+{
>+ char str[32];
>+
>+ if (snprintf(str, sizeof(str), "%lu", value) >= sizeof(str)) {
>+ perror("snprintf");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ control_writeln(str);
>+}
>+
>+unsigned long control_readulong(bool *ok)
>+{
>+ unsigned long value;
>+ char *str;
>+
>+ if (ok)
>+ *ok = false;
>+
>+ str = control_readln();
>+
>+ if (str == NULL)
>+ return 0;
>+
>+ value = strtoul(str, NULL, 10);
>+ free(str);
>+
>+ if (ok)
>+ *ok = true;
>+
>+ return value;
>+}
>+
> /* Return the next line from the control socket (without the trailing newline).
> *
> * The program terminates if a timeout occurs.
>diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
>index 51814b4f9ac1..cdd922dfea68 100644
>--- a/tools/testing/vsock/control.h
>+++ b/tools/testing/vsock/control.h
>@@ -9,7 +9,9 @@ void control_init(const char *control_host, const char *control_port,
> void control_cleanup(void);
> void control_writeln(const char *str);
> char *control_readln(void);
>+unsigned long control_readulong(bool *ok);
> void control_expectln(const char *str);
> bool control_cmpln(char *line, const char *str, bool fail);
>+void control_writeulong(unsigned long value);
>
> #endif /* CONTROL_H */
>diff --git a/tools/testing/vsock/util.c b/tools/testing/vsock/util.c
>index 2acbb7703c6a..351903836774 100644
>--- a/tools/testing/vsock/util.c
>+++ b/tools/testing/vsock/util.c
>@@ -395,3 +395,16 @@ void skip_test(struct test_case *test_cases, size_t test_cases_len,
>
> test_cases[test_id].skip = true;
> }
>+
>+unsigned long djb2(const void *data, size_t len)

hash_djb2 maybe it's more understandable

>+{
>+ unsigned long hash = 5381;
>+ int i = 0;
>+
>+ while (i < len) {
>+ hash = ((hash << 5) + hash) + ((unsigned char *)data)[i];
>+ i++;
>+ }
>+
>+ return hash;
>+}
>diff --git a/tools/testing/vsock/util.h b/tools/testing/vsock/util.h
>index a3375ad2fb7f..988cc69a4642 100644
>--- a/tools/testing/vsock/util.h
>+++ b/tools/testing/vsock/util.h
>@@ -49,4 +49,5 @@ void run_tests(const struct test_case *test_cases,
> void list_tests(const struct test_case *test_cases);
> void skip_test(struct test_case *test_cases, size_t test_cases_len,
> const char *test_id_str);
>+unsigned long djb2(const void *data, size_t len);
> #endif /* UTIL_H */
>diff --git a/tools/testing/vsock/vsock_test.c b/tools/testing/vsock/vsock_test.c
>index bb6d691cb30d..107c11165887 100644
>--- a/tools/testing/vsock/vsock_test.c
>+++ b/tools/testing/vsock/vsock_test.c
>@@ -284,10 +284,14 @@ static void test_stream_msg_peek_server(const struct test_opts *opts)
> close(fd);
> }
>
>-#define MESSAGES_CNT 7
>-#define MSG_EOR_IDX (MESSAGES_CNT / 2)
>+#define SOCK_BUF_SIZE (2 * 1024 * 1024)
>+#define MAX_MSG_SIZE (32 * 1024)
>+
> static void test_seqpacket_msg_bounds_client(const struct test_opts *opts)
> {
>+ unsigned long curr_hash;
>+ int page_size;
>+ int msg_count;
> int fd;
>
> fd = vsock_seqpacket_connect(opts->peer_cid, 1234);
>@@ -296,18 +300,69 @@ static void test_seqpacket_msg_bounds_client(const struct test_opts *opts)
> exit(EXIT_FAILURE);
> }
>
>- /* Send several messages, one with MSG_EOR flag */
>- for (int i = 0; i < MESSAGES_CNT; i++)
>- send_byte(fd, 1, (i == MSG_EOR_IDX) ? MSG_EOR : 0);
>+ /* Wait, until receiver sets buffer size. */
>+ control_expectln("SRVREADY");
>+
>+ curr_hash = 0;
>+ page_size = getpagesize();
>+ msg_count = SOCK_BUF_SIZE / MAX_MSG_SIZE;
>+
>+ for (int i = 0; i < msg_count; i++) {
>+ ssize_t send_size;
>+ size_t buf_size;
>+ int flags;
>+ void *buf;
>+
>+ /* Use "small" buffers and "big" buffers. */
>+ if (i & 1)
>+ buf_size = page_size +
>+ (rand() % (MAX_MSG_SIZE - page_size));
>+ else
>+ buf_size = 1 + (rand() % page_size);
>+
>+ buf = malloc(buf_size);
>+
>+ if (!buf) {
>+ perror("malloc");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ /* Set at least one MSG_EOR + some random. */
>+ if (i == (msg_count / 2) || (rand() & 1)) {
>+ flags = MSG_EOR;
>+ curr_hash++;
>+ } else {
>+ flags = 0;
>+ }
>+
>+ send_size = send(fd, buf, buf_size, flags);
>+
>+ if (send_size < 0) {
>+ perror("send");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ if (send_size != buf_size) {
>+ fprintf(stderr, "Invalid send size\n");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ curr_hash += send_size;
>+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
>+ }
>
> control_writeln("SENDDONE");
>+ control_writeulong(curr_hash);
> close(fd);
> }
>
> static void test_seqpacket_msg_bounds_server(const struct test_opts *opts)
> {
>+ unsigned long sock_buf_size;
>+ unsigned long remote_hash;
>+ unsigned long curr_hash;
> int fd;
>- char buf[16];
>+ char buf[MAX_MSG_SIZE];
> struct msghdr msg = {0};
> struct iovec iov = {0};
>
>@@ -317,25 +372,58 @@ static void test_seqpacket_msg_bounds_server(const struct test_opts *opts)
> exit(EXIT_FAILURE);
> }
>
>+ sock_buf_size = SOCK_BUF_SIZE;
>+
>+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_MAX_SIZE,
>+ &sock_buf_size, sizeof(sock_buf_size))) {
>+ perror("getsockopt");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ if (setsockopt(fd, AF_VSOCK, SO_VM_SOCKETS_BUFFER_SIZE,
>+ &sock_buf_size, sizeof(sock_buf_size))) {
>+ perror("getsockopt");
>+ exit(EXIT_FAILURE);
>+ }
>+
>+ /* Ready to receive data. */
>+ control_writeln("SRVREADY");
>+ /* Wait, until peer sends whole data. */
> control_expectln("SENDDONE");
> iov.iov_base = buf;
> iov.iov_len = sizeof(buf);
> msg.msg_iov = &iov;
> msg.msg_iovlen = 1;
>
>- for (int i = 0; i < MESSAGES_CNT; i++) {
>- if (recvmsg(fd, &msg, 0) != 1) {
>- perror("message bound violated");
>- exit(EXIT_FAILURE);
>- }
>+ curr_hash = 0;
>
>- if ((i == MSG_EOR_IDX) ^ !!(msg.msg_flags & MSG_EOR)) {
>- perror("MSG_EOR");
>+ while (1) {
>+ ssize_t recv_size;
>+
>+ recv_size = recvmsg(fd, &msg, 0);
>+
>+ if (!recv_size)
>+ break;
>+
>+ if (recv_size < 0) {
>+ perror("recvmsg");
> exit(EXIT_FAILURE);
> }
>+
>+ if (msg.msg_flags & MSG_EOR)
>+ curr_hash++;
>+
>+ curr_hash += recv_size;
>+ curr_hash = djb2(&curr_hash, sizeof(curr_hash));
> }
>
> close(fd);
>+ remote_hash = control_readulong(NULL);
>+
>+ if (curr_hash != remote_hash) {
>+ fprintf(stderr, "Message bounds broken\n");
>+ exit(EXIT_FAILURE);
>+ }
> }
>
> #define MESSAGE_TRUNC_SZ 32
>@@ -837,6 +925,7 @@ int main(int argc, char **argv)
> .peer_cid = VMADDR_CID_ANY,
> };
>
>+ srand(time(NULL));
> init_signals();
>
> for (;;) {
>--
>2.35.0


2022-11-11 15:28:28

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

On Fri, Nov 11, 2022 at 2:47 PM Stefano Garzarella <[email protected]> wrote:
>
> Hi Arseniy,
> maybe we should start rebasing this series on the new support for
> skbuff:
> https://lore.kernel.org/lkml/[email protected]/
>
> CCing Bobby to see if it's easy to integrate since you're both changing
> the packet allocation.
>
>
> On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote:
> >
> >
> > INTRODUCTION
> >
> >Hello,
> >
> > This is experimental patchset for virtio vsock zerocopy receive.
> >It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
> >same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()'
> >to fill provided vma area with pages of virtio receive buffers. After
> >received data was processed by user, pages must be freed by 'madvise()'
> >call with MADV_DONTNEED flag set(but if user will not call 'madvise()',
> >next 'getsockopt()' will fail).
> >
> > DETAILS
> >
> > Here is how mapping with mapped pages looks exactly: first page
> >contains information about mapped data buffers. At zero offset mapping
> >contains special data structure:
> >
> > struct virtio_vsock_usr_hdr_pref {
> > u32 poll_value;
> > u32 hdr_num;
> > };
> >
> >This structure contains two fields:
> >'poll_value' - shows that current socket has data to read.When socket's
> >intput queue is empty, 'poll_value' is set to 0 by kernel. When input
> >queue has some data, 'poll_value' is set to 1 by kernel. When socket is
> >closed for data receive, 'poll_value' is ~0.This tells user that "there
> >will be no more data,continue to call 'getsockopt()' until you'll find
> >'hdr_num' == 0".User spins on it in userspace, without calling 'poll()'
> >system call(of course, 'poll()' is still working).
> >'hdr_num' - shows number of mapped pages with data which starts from
> >second page of this mappined.
> >
> >NOTE:
> > This version has two limitations:
> >
> > 1) One mapping per socket is supported. It is implemented by adding
> > 'struct page*' pointer to 'struct virtio_vsock' structure (first
> > page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But,
> > I think, support for multiple pages could be implemented by using
> > something like hash table of such pages, or more simple, just use
> > first page of mapping as headers page by default. Also I think,
> > number of such pages may be controlled by 'setsockop()'.
> >
> > 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even
> > after calling 'madvise()'/'munmap()' on the whole mapping.This is
> > because socket can't handle 'munmap()' calls(as there is no such
> > callback in 'proto_ops'),thus polling page exists until socket is
> > opened.
> >
> >After 'virtio_vsock_usr_hdr_pref' object, first page contains array of
> >trimmed virtio vsock packet headers (in contains only length of data on
> >the corresponding page and 'flags' field):
> >
> > struct virtio_vsock_usr_hdr {
> > uint32_t length;
> > uint32_t flags;
> > };
> >
> >Field 'length' allows user to know exact size of payload within each
> >sequence of pages and field 'flags' allows to process SOCK_SEQPACKET
> >flags(such as message bounds or record bounds).All other pages are data
> >pages from virtio queue.
> >
> > Page 0 Page 1 Page N
> >
> > [ pref hdr0 .. hdrN ][ data ] .. [ data ]
> > | | ^ ^
> > | | | |
> > | *-------|-----------*
> > | |
> > *----------------*
> >
> > Of course, single header could represent array of pages (when
> >packet's buffer is bigger than one page).So here is example of detailed
> >mapping layout for some set of packages. Lets consider that we have the
> >following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All
> >pages: 0,1,2,3,4 and 5 will be inserted to user's vma.
> >
> > Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
> > Page 1: [ 56 ]
> > Page 2: [ 4096 ]
> > Page 3: [ 4096 ]
> > Page 4: [ 4096 ]
> > Page 5: [ 8 ]
> >
> > Page 0 contains only array of headers:
> > 'pref' is 'struct virtio_vsock_usr_hdr_pref'.
> > 'hdr0' has 56 in length field.
> > 'hdr1' has 4096 in length field.
> > 'hdr2' has 8200 in length field.
> > 'hdr3' has 0 in length field(this is end of data marker).
> >
> > Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> > Page 2 corresponds to 'hdr1' and filled with data.
> > Page 3 corresponds to 'hdr2' and filled with data.
> > Page 4 corresponds to 'hdr2' and filled with data.
> > Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
> >
> > pref will be the following: poll_value = 1, hdr_num = 5
> >
> > This patchset also changes packets allocation way: current uses
> >only 'kmalloc()' to create data buffer. Problem happens when we try to
> >map such buffers to user's vma - kernel restricts to map slab pages
> >to user's vma(as pages of "not large" 'kmalloc()' allocations have flag
> >PageSlab set and "not large" could be bigger than one page).So to avoid
> >this, data buffers now allocated using 'alloc_pages()' call.
> >
> > DIFFERENCE WITH TCP
> >
> > As this feature uses same approach as for TCP protocol,here are
> >some difference between both version(from user's POV):
> >
> >1) For 'getsockopt()':
> > - This version passes only address of mapping.
> > - TCP passes special structure to 'getsockopt()'. In addition to the
> > address of mapping in contains 'length' and 'recv_skip_hint'.First
> > means size of data inside mapping(out param, set by kernel).Second
> > has bool type, if it is true, then user must dequeue rest of data
> > using 'read()' syscall(e.g. it is out parameter also).
> >2) Mapping structure:
> > - This version uses first page of mapping for meta data and rest of
> > pages for data.
> > - TCP version uses whole mapping for data only.
> >3) Data layout:
> > - This version inserts virtio buffers to mapping, so each buffer may
> > be filled partially. To get size of payload in every buffer, first
> > mapping's page must be used(see 2).
> > - TCP version inserts pages of single skb.
> >
> >*Please, correct me if I made some mistake in TCP zerocopy description.
>
>
> Thank you for the description. Do you think it would be possible to try
> to do the same as TCP?
> Especially now that we should support skbuff.
>
> >
> > TESTS
> >
> > This patchset updates 'vsock_test' utility: two tests for new
> >feature were added. First test covers invalid cases.Second checks valid
> >transmission case.
>
> Thank you, I really appreciate you adding new tests each time! Great
> job!
>
> >
> > BENCHMARKING
> >
> > For benchmakring I've created small test utility 'vsock_rx_perf'.
> >It works in client/server mode. When client connects to server, server
> >starts sending specified amount of data to client(size is set as input
> >argument). Client reads data and waits for next portion of it. In client
> >mode, dequeue logic works in three modes: copy, zerocopy and zerocopy
> >with user polling.
>
> Cool, thanks for adding it in this series.
>
> >
> >1) In copy mode client uses 'read()' system call.
> >2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data
> > and 'poll()' to wait data.
> >3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()',
> > but to wait data it polls shared page(e.g. busyloop).
> >
> >Here is usage:
> >-c <cid> Peer CID to connect to(if run in client mode).
> >-m <megabytes> Number of megabytes to send.
> >-b <bytes> Size of RX/TX buffer(or mapping) in pages.
> >-r <bytes> SO_RCVLOWAT value in bytes(if run in client mode).
> >-v <bytes> peer credit.
> >-s Run as server.
> >-z [n|y|u] Dequeue mode.
> > n - copy mode. 1) above.
> > y - zero copy mode. 2) above.
> > u - zero copy mode + user poll. 3) above.
> >
> >Utility produces the following output:
> >1) In server mode it prints number of sec spent for whole tx loop.
> >2) In client mode it prints several values:
> > * Number of sec, spent for whole rx loop(including 'poll()').
> > * Number of sec, spend in dequeue system calls:
> > In case of '-z n' it will be time in 'read()'.
> > In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'.
> > * Number of wake ups with POLLIN flag set(except '-z u' mode).
> > * Average time(ns) in single dequeue iteration(e.g. divide second
> > value by third).
> >
> >Idea of test is to compare zerocopy approach and classic copy, as it is
> >clear, that to dequeue some "small" amount of data, copy must be better,
> >because syscall with 'memcpy()' for 1 byte(for example) is just nothing
> >against two system calls, where first must map at least one page, while
> >second will unmap it.
> >
> >Test was performed with the following settings:
> >1) Total size of data to send is 2G(-m argument).
> >
> >2) Peer's buffer size is changed to 2G(-v argument) - this is needed to
> > avoid stalls of sender to wait for enough credit.
> >
> >3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number
> > of bytes to dequeue in single loop iteration. Buffer size limits max
> > number of bytes to read, while SO_RCVLOWAT won't allow user to get
> > too small number of bytes.
> >
> >4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of
> > course we can set it to peer's buffer size and as we are in STREAM
> > mode it leads to 'write()' will be called once.
> >
> >Deignations here and below:
> >H2G - host to guest transmission. Server is host, client is guest.
> >G2H - guest to host transmission. Server is guest, client is host.
> >C - copy mode.
> >ZC - zerocopy mode.
> >ZU - zerocopy with user poll mode. This mode is removed from test at
> > this moment, because I need to support SO_RCVLOWAT logic in it.
> >
> >So, rows corresponds to dequeue mode, while columns show number of
>
> Maybe it would be better to label the rows, I guess the first one is C
> and the second one ZC?
>
> Maybe it would be better to report Gbps so if we change the amount of
> data exchanged, we always have a way to compare.
>
> >bytes
> >to dequeue in each mode. Each cell contains several values in the next
> >format:
> >*------------*
> >| A / B |
> >| C |
> >| D |
> >*------------*
> >
> >A - number of seconds which server spent in tx loop.
> >B - number of seconds which client spent in rx loop.
> >C - number of seconds which client spent in rx loop, but except 'poll()'
> > system call(e.g. only in dequeue system calls).
> >D - Average number of ns for each POLLIN wake up(in other words
> > it is average value for C).
>
> I see only 3 values in the cells, I missed which one is C and which one
> is D.
>
> >
> >G2H:
> >
> > #0 #1 #2 #3 #4 #5
> > *----*---------*---------*---------*---------*---------*---------*
> > | | | | | | | |
> > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
> > | | | | | | | |
> > *----*---------*---------*---------*---------*---------*---------*
> > | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
> > | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 |
> > *----*---------*---------*---------*---------*---------*---------*
> > | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
> > | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 |
> > *----*---------*---------*---------*---------*---------*---------*
> >
> >H2G:
> >
> > #0 #1 #2 #3 #4 #5
> > *----*---------*---------*---------*---------*---------*---------*
> > | | | | | | | |
> > | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
> > | | | | | | | |
> > *----*---------*---------*---------*---------*---------*---------*
> > | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
> > | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 |
> > *----*---------*---------*---------*---------*---------*---------*
> > | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
> > | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 |
> > *----*---------*---------*---------*---------*---------*---------*
> >
> >Here are my thoughts about these numbers(most obvious):
> >
> >1) Let's check C and D values. We see, that zerocopy dequeue is faster
> > on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I
> > think this is main result of this test(at this moment), that shows
> > performance difference between copy and zerocopy).
>
> Yes, I think this is expected.
>
> >
> >2) In G2H mode both server and client spend almost same time in rx/tx
> > loops(see A / B in G2H table) - it looks good. In H2G mode, there is
> > significant difference between server and client. I think there are
> > some side effects which produces such effect(continue to analyze).
>
> Perhaps a different cost to notify the receiver? I think it's better to
> talk about transmitter and receiver, instead of server and client, I
> think it's confusing.
>
> >
> >3) Let's check C value. We can see, that G2H is always faster that H2G.
> > In both copy and zerocopy mode.
>
> This is expected because the guest queues buffers up to 64K entirely,
> while the host has to split packets into the guest's preallocated
> buffers, which are 4K.
>
> >
> >4) Another interesting thing could be seen for example in H2G table,
> > row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode
> > is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was
>
> I see 1.65 vs 1.10, are these the same data, or am I looking at it
> wrong?
>
> > faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account
> > time spent in 'poll()', copy mode looks faster(even it spends more
> > time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()').
> > I think, it is also not obvious effect.
> >
> >So, according 1), it is better to use zerocopy, if You need to process
> >big buffers, with small rx waitings(for example it nay be video stream).
> >In other cases - it is better to use classic copy way, as it will be
> >more lightweight.
> >
> >All tests were performed on x86 board with 4-core Celeron N2930 CPU(of
> >course it is, not a mainframe, but better than test with nested guest)
> >and 8Gb of RAM.
> >
> >Anyway, this is not final version, and I will continue to improve both
> >kernel logic and performance tests.
>
> Great work so far!
>
> Maybe to avoid having to rebase everything later, it's already
> worthwhile to start using Bobby's patch with skbuff.
>
> >
> > SUGGESTIONS
> >
> >1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I
> > can merge both patches into single one?
>
> This is already very big, so I don't know if it's worth breaking into a
> preparation series and then a series that adds both.

For example, some test patches not related to zerocopy could go
separately. Maybe even vsock_rx_perf without the zerocopy part that is
not definitive for now.

Too big a set is always scary, even if this one is divided well, but
some patches as mentioned could go separately.

I left some comments, but as said I prefer to review it after the
rebase with skbuff, because I think it changes enough. I'm sorry about
that, but having the skbuffs I think is very important.

Thanks,
Stefano


2022-11-11 15:31:33

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v3 03/11] af_vsock: add zerocopy receive logic

On Sun, Nov 06, 2022 at 07:40:12PM +0000, Arseniy Krasnov wrote:
>This:
>1) Adds callback for 'mmap()' call on socket. It checks vm area flags
> and sets vm area ops.
>2) Adds special 'getsockopt()' case which calls transport zerocopy
> callback. Input argument is vm area address.
>3) Adds 'getsockopt()/setsockopt()' for switching on/off rx zerocopy
> mode.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> include/net/af_vsock.h | 8 ++
> include/uapi/linux/vm_sockets.h | 3 +
> net/vmw_vsock/af_vsock.c | 187 +++++++++++++++++++++++++++++++-
> 3 files changed, 196 insertions(+), 2 deletions(-)
>
>diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
>index 568a87c5e0d0..e4f12ef8e623 100644
>--- a/include/net/af_vsock.h
>+++ b/include/net/af_vsock.h
>@@ -73,6 +73,8 @@ struct vsock_sock {
>
> /* Private to transport. */
> void *trans;
>+
>+ bool rx_zerocopy_on;

Maybe better to leave the last fields the private ones to transports, so
I would say put it before trans;

> };
>
> s64 vsock_stream_has_data(struct vsock_sock *vsk);
>@@ -138,6 +140,12 @@ struct vsock_transport {
> bool (*stream_allow)(u32 cid, u32 port);
> int (*set_rcvlowat)(struct vsock_sock *vsk, int val);
>
>+ int (*zerocopy_rx_mmap)(struct vsock_sock *vsk,
>+ struct vm_area_struct *vma);
>+ int (*zerocopy_dequeue)(struct vsock_sock *vsk,
>+ struct page **pages,
>+ unsigned long *pages_num);
>+
> /* SEQ_PACKET. */
> ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
> int flags);
>diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
>index c60ca33eac59..d1f792bed1a7 100644
>--- a/include/uapi/linux/vm_sockets.h
>+++ b/include/uapi/linux/vm_sockets.h
>@@ -83,6 +83,9 @@
>
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT_NEW 8
>
>+#define SO_VM_SOCKETS_MAP_RX 9
>+#define SO_VM_SOCKETS_ZEROCOPY 10

Before removing RFC, we should document these macros because they are
exposed to the user.

>+
> #if !defined(__KERNEL__)
> #if __BITS_PER_LONG == 64 || (defined(__x86_64__) &&
> defined(__ILP32__))
> #define SO_VM_SOCKETS_CONNECT_TIMEOUT SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD
>diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>index ee418701cdee..21a915eb0820 100644
>--- a/net/vmw_vsock/af_vsock.c
>+++ b/net/vmw_vsock/af_vsock.c
>@@ -1663,6 +1663,16 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> }
> break;
> }
>+ case SO_VM_SOCKETS_ZEROCOPY: {
>+ if (sock->state != SS_UNCONNECTED) {
>+ err = -EOPNOTSUPP;
>+ break;
>+ }
>+
>+ COPY_IN(val);
>+ vsk->rx_zerocopy_on = val;
>+ break;
>+ }
>
> default:
> err = -ENOPROTOOPT;
>@@ -1676,6 +1686,124 @@ static int vsock_connectible_setsockopt(struct socket *sock,
> return err;
> }
>
>+static const struct vm_operations_struct afvsock_vm_ops = {
>+};
>+
>+static int vsock_recv_zerocopy(struct socket *sock,
>+ unsigned long address)
>+{
>+ const struct vsock_transport *transport;
>+ struct vm_area_struct *vma;
>+ unsigned long vma_pages;
>+ struct vsock_sock *vsk;
>+ struct page **pages;
>+ struct sock *sk;
>+ int err;
>+ int i;
>+
>+ sk = sock->sk;
>+ vsk = vsock_sk(sk);
>+ err = 0;
>+
>+ lock_sock(sk);
>+
>+ if (!vsk->rx_zerocopy_on) {
>+ err = -EOPNOTSUPP;
>+ goto out_unlock_sock;
>+ }
>+
>+ transport = vsk->transport;
>+
>+ if (!transport->zerocopy_dequeue) {
>+ err = -EOPNOTSUPP;
>+ goto out_unlock_sock;
>+ }
>+
>+ mmap_write_lock(current->mm);
>+
>+ vma = vma_lookup(current->mm, address);
>+
>+ if (!vma || vma->vm_ops != &afvsock_vm_ops) {
>+ err = -EINVAL;
>+ goto out_unlock_vma;
>+ }
>+
>+ /* Allow to use vm area only from the first page. */
>+ if (vma->vm_start != address) {
>+ err = -EINVAL;
>+ goto out_unlock_vma;
>+ }
>+
>+ vma_pages = (vma->vm_end - vma->vm_start) / PAGE_SIZE;
>+ pages = kmalloc_array(vma_pages, sizeof(pages[0]),
>+ GFP_KERNEL | __GFP_ZERO);
>+
>+ if (!pages) {
>+ err = -EINVAL;
>+ goto out_unlock_vma;
>+ }
>+
>+ err = transport->zerocopy_dequeue(vsk, pages, &vma_pages);
>+
>+ if (err)
>+ goto out_unlock_vma;
>+
>+ /* Now 'vma_pages' contains number of pages in array.
>+ * If array element is NULL, skip it, go to next page.
>+ */
>+ for (i = 0; i < vma_pages; i++) {
>+ if (pages[i]) {
>+ unsigned long pages_inserted;
>+
>+ pages_inserted = 1;
>+ err = vm_insert_pages(vma, address, &pages[i], &pages_inserted);
>+
>+ if (err || pages_inserted) {
>+ /* Failed to insert some pages, we have "partially"
>+ * mapped vma. Do not return, set error code. This
>+ * code will be returned to user. User needs to call
>+ * 'madvise()/mmap()' to clear this vma. Anyway,
>+ * references to all pages will to be dropped below.
>+ */
>+ if (!err) {
>+ err = -EFAULT;
>+ break;
>+ }
>+ }
>+ }
>+
>+ address += PAGE_SIZE;
>+ }
>+
>+ i = 0;
>+
>+ while (i < vma_pages) {
>+ /* Drop ref count for all pages, returned by transport.
>+ * We call 'put_page()' only once, as transport needed
>+ * to 'get_page()' at least only once also, to prevent
>+ * pages being freed. If transport calls 'get_page()'
>+ * more twice or more for every page - we don't care,
>+ * if transport calls 'get_page()' only one time, this
>+ * meanse that every page had ref count equal to 1,then
>+ * 'vm_insert_pages()' increments it to 2. After this
>+ * loop, ref count will be 1 again, and page will be
>+ * returned to allocator by user.
>+ */
>+ if (pages[i])
>+ put_page(pages[i]);
>+ i++;
>+ }
>+
>+ kfree(pages);
>+
>+out_unlock_vma:
>+ mmap_write_unlock(current->mm);
>+out_unlock_sock:
>+ release_sock(sk);
>+
>+ return err;
>+}
>+
> static int vsock_connectible_getsockopt(struct socket *sock,
> int level, int optname,
> char __user *optval,
>@@ -1720,6 +1848,26 @@ static int vsock_connectible_getsockopt(struct socket *sock,
> lv = sock_get_timeout(vsk->connect_timeout, &v,
> optname == SO_VM_SOCKETS_CONNECT_TIMEOUT_OLD);
> break;
>+ case SO_VM_SOCKETS_ZEROCOPY: {
>+ lock_sock(sk);
>+
>+ v.val64 = vsk->rx_zerocopy_on;
>+
>+ release_sock(sk);
>+
>+ break;
>+ }
>+ case SO_VM_SOCKETS_MAP_RX: {
>+ unsigned long vma_addr;
>+
>+ if (len < sizeof(vma_addr))
>+ return -EINVAL;
>+
>+ if (copy_from_user(&vma_addr, optval, sizeof(vma_addr)))
>+ return -EFAULT;
>+
>+ return vsock_recv_zerocopy(sock, vma_addr);
>+ }
>
> default:
> return -ENOPROTOOPT;
>@@ -2167,6 +2315,41 @@ static int vsock_set_rcvlowat(struct sock *sk, int val)
> return 0;
> }
>
>+static int afvsock_mmap(struct file *file, struct socket *sock,
>+ struct vm_area_struct *vma)
>+{
>+ const struct vsock_transport *transport;
>+ struct vsock_sock *vsk;
>+ struct sock *sk;
>+ int err;
>+
>+ if (vma->vm_flags & (VM_WRITE | VM_EXEC))
>+ return -EPERM;
>+
>+ vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>+ vma->vm_flags |= (VM_MIXEDMAP);
>+ vma->vm_ops = &afvsock_vm_ops;
>+
>+ sk = sock->sk;
>+ vsk = vsock_sk(sk);
>+
>+ lock_sock(sk);
>+
>+ transport = vsk->transport;
>+
>+ if (!transport || !transport->zerocopy_rx_mmap) {
>+ err = -EOPNOTSUPP;
>+ goto out_unlock;
>+ }
>+
>+ err = transport->zerocopy_rx_mmap(vsk, vma);
>+
>+out_unlock:
>+ release_sock(sk);
>+
>+ return err;
>+}
>+
> static const struct proto_ops vsock_stream_ops = {
> .family = PF_VSOCK,
> .owner = THIS_MODULE,
>@@ -2184,7 +2367,7 @@ static const struct proto_ops vsock_stream_ops = {
> .getsockopt = vsock_connectible_getsockopt,
> .sendmsg = vsock_connectible_sendmsg,
> .recvmsg = vsock_connectible_recvmsg,
>- .mmap = sock_no_mmap,
>+ .mmap = afvsock_mmap,
> .sendpage = sock_no_sendpage,
> .set_rcvlowat = vsock_set_rcvlowat,
> };
>@@ -2206,7 +2389,7 @@ static const struct proto_ops vsock_seqpacket_ops = {
> .getsockopt = vsock_connectible_getsockopt,
> .sendmsg = vsock_connectible_sendmsg,
> .recvmsg = vsock_connectible_recvmsg,
>- .mmap = sock_no_mmap,
>+ .mmap = afvsock_mmap,
> .sendpage = sock_no_sendpage,
> };
>
>--
>2.35.0


2022-11-11 18:50:59

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

On 11.11.2022 17:06, Stefano Garzarella wrote:
> On Fri, Nov 11, 2022 at 2:47 PM Stefano Garzarella <[email protected]> wrote:
>>
>> Hi Arseniy,
>> maybe we should start rebasing this series on the new support for
>> skbuff:
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> CCing Bobby to see if it's easy to integrate since you're both changing
>> the packet allocation.
Hello Stefano,

Sure! I was waiting for next version of skbuff support(previous one was in august as i remember):

1) Anyway it will be implemented as skbuff is general well-known thing for networking :)
2) It will simplify MSG_ZEROCOPY support, because it uses API based on skbuff.
>>
>>
>> On Sun, Nov 06, 2022 at 07:33:41PM +0000, Arseniy Krasnov wrote:
>>>
>>>
>>> INTRODUCTION
>>>
>>> Hello,
>>>
>>> This is experimental patchset for virtio vsock zerocopy receive.
>>> It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea:call 'mmap()' on socket's descriptor,then call 'getsockopt()'
>>> to fill provided vma area with pages of virtio receive buffers. After
>>> received data was processed by user, pages must be freed by 'madvise()'
>>> call with MADV_DONTNEED flag set(but if user will not call 'madvise()',
>>> next 'getsockopt()' will fail).
>>>
>>> DETAILS
>>>
>>> Here is how mapping with mapped pages looks exactly: first page
>>> contains information about mapped data buffers. At zero offset mapping
>>> contains special data structure:
>>>
>>> struct virtio_vsock_usr_hdr_pref {
>>> u32 poll_value;
>>> u32 hdr_num;
>>> };
>>>
>>> This structure contains two fields:
>>> 'poll_value' - shows that current socket has data to read.When socket's
>>> intput queue is empty, 'poll_value' is set to 0 by kernel. When input
>>> queue has some data, 'poll_value' is set to 1 by kernel. When socket is
>>> closed for data receive, 'poll_value' is ~0.This tells user that "there
>>> will be no more data,continue to call 'getsockopt()' until you'll find
>>> 'hdr_num' == 0".User spins on it in userspace, without calling 'poll()'
>>> system call(of course, 'poll()' is still working).
>>> 'hdr_num' - shows number of mapped pages with data which starts from
>>> second page of this mappined.
>>>
>>> NOTE:
>>> This version has two limitations:
>>>
>>> 1) One mapping per socket is supported. It is implemented by adding
>>> 'struct page*' pointer to 'struct virtio_vsock' structure (first
>>> page of mapping, which contains 'virtio_vsock_usr_hdr_pref').But,
>>> I think, support for multiple pages could be implemented by using
>>> something like hash table of such pages, or more simple, just use
>>> first page of mapping as headers page by default. Also I think,
>>> number of such pages may be controlled by 'setsockop()'.
>>>
>>> 2) After 'mmap()' call,it is impossible to call 'mmap()' again, even
>>> after calling 'madvise()'/'munmap()' on the whole mapping.This is
>>> because socket can't handle 'munmap()' calls(as there is no such
>>> callback in 'proto_ops'),thus polling page exists until socket is
>>> opened.
>>>
>>> After 'virtio_vsock_usr_hdr_pref' object, first page contains array of
>>> trimmed virtio vsock packet headers (in contains only length of data on
>>> the corresponding page and 'flags' field):
>>>
>>> struct virtio_vsock_usr_hdr {
>>> uint32_t length;
>>> uint32_t flags;
>>> };
>>>
>>> Field 'length' allows user to know exact size of payload within each
>>> sequence of pages and field 'flags' allows to process SOCK_SEQPACKET
>>> flags(such as message bounds or record bounds).All other pages are data
>>> pages from virtio queue.
>>>
>>> Page 0 Page 1 Page N
>>>
>>> [ pref hdr0 .. hdrN ][ data ] .. [ data ]
>>> | | ^ ^
>>> | | | |
>>> | *-------|-----------*
>>> | |
>>> *----------------*
>>>
>>> Of course, single header could represent array of pages (when
>>> packet's buffer is bigger than one page).So here is example of detailed
>>> mapping layout for some set of packages. Lets consider that we have the
>>> following sequence of packages:56 bytes, 4096 bytes and 8200 bytes. All
>>> pages: 0,1,2,3,4 and 5 will be inserted to user's vma.
>>>
>>> Page 0: [[ pref ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>> Page 1: [ 56 ]
>>> Page 2: [ 4096 ]
>>> Page 3: [ 4096 ]
>>> Page 4: [ 4096 ]
>>> Page 5: [ 8 ]
>>>
>>> Page 0 contains only array of headers:
>>> 'pref' is 'struct virtio_vsock_usr_hdr_pref'.
>>> 'hdr0' has 56 in length field.
>>> 'hdr1' has 4096 in length field.
>>> 'hdr2' has 8200 in length field.
>>> 'hdr3' has 0 in length field(this is end of data marker).
>>>
>>> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>> Page 2 corresponds to 'hdr1' and filled with data.
>>> Page 3 corresponds to 'hdr2' and filled with data.
>>> Page 4 corresponds to 'hdr2' and filled with data.
>>> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>>
>>> pref will be the following: poll_value = 1, hdr_num = 5
>>>
>>> This patchset also changes packets allocation way: current uses
>>> only 'kmalloc()' to create data buffer. Problem happens when we try to
>>> map such buffers to user's vma - kernel restricts to map slab pages
>>> to user's vma(as pages of "not large" 'kmalloc()' allocations have flag
>>> PageSlab set and "not large" could be bigger than one page).So to avoid
>>> this, data buffers now allocated using 'alloc_pages()' call.
>>>
>>> DIFFERENCE WITH TCP
>>>
>>> As this feature uses same approach as for TCP protocol,here are
>>> some difference between both version(from user's POV):
>>>
>>> 1) For 'getsockopt()':
>>> - This version passes only address of mapping.
>>> - TCP passes special structure to 'getsockopt()'. In addition to the
>>> address of mapping in contains 'length' and 'recv_skip_hint'.First
>>> means size of data inside mapping(out param, set by kernel).Second
>>> has bool type, if it is true, then user must dequeue rest of data
>>> using 'read()' syscall(e.g. it is out parameter also).
>>> 2) Mapping structure:
>>> - This version uses first page of mapping for meta data and rest of
>>> pages for data.
>>> - TCP version uses whole mapping for data only.
>>> 3) Data layout:
>>> - This version inserts virtio buffers to mapping, so each buffer may
>>> be filled partially. To get size of payload in every buffer, first
>>> mapping's page must be used(see 2).
>>> - TCP version inserts pages of single skb.
>>>
>>> *Please, correct me if I made some mistake in TCP zerocopy description.
>>
>>
>> Thank you for the description. Do you think it would be possible to try
>> to do the same as TCP?
>> Especially now that we should support skbuff.
Yes, i'll rework my patchset for skbuff usage.
>>
>>>
>>> TESTS
>>>
>>> This patchset updates 'vsock_test' utility: two tests for new
>>> feature were added. First test covers invalid cases.Second checks valid
>>> transmission case.
>>
>> Thank you, I really appreciate you adding new tests each time! Great
>> job!
>>
>>>
>>> BENCHMARKING
>>>
>>> For benchmakring I've created small test utility 'vsock_rx_perf'.
>>> It works in client/server mode. When client connects to server, server
>>> starts sending specified amount of data to client(size is set as input
>>> argument). Client reads data and waits for next portion of it. In client
>>> mode, dequeue logic works in three modes: copy, zerocopy and zerocopy
>>> with user polling.
>>
>> Cool, thanks for adding it in this series.
>>
>>>
>>> 1) In copy mode client uses 'read()' system call.
>>> 2) In zerocopy mode client uses 'mmap()'/'getsockopt()' to dequeue data
>>> and 'poll()' to wait data.
>>> 3) In zerocopy mode + user polling client uses 'mmap()'/'getsockopt()',
>>> but to wait data it polls shared page(e.g. busyloop).
>>>
>>> Here is usage:
>>> -c <cid> Peer CID to connect to(if run in client mode).
>>> -m <megabytes> Number of megabytes to send.
>>> -b <bytes> Size of RX/TX buffer(or mapping) in pages.
>>> -r <bytes> SO_RCVLOWAT value in bytes(if run in client mode).
>>> -v <bytes> peer credit.
>>> -s Run as server.
>>> -z [n|y|u] Dequeue mode.
>>> n - copy mode. 1) above.
>>> y - zero copy mode. 2) above.
>>> u - zero copy mode + user poll. 3) above.
>>>
>>> Utility produces the following output:
>>> 1) In server mode it prints number of sec spent for whole tx loop.
>>> 2) In client mode it prints several values:
>>> * Number of sec, spent for whole rx loop(including 'poll()').
>>> * Number of sec, spend in dequeue system calls:
>>> In case of '-z n' it will be time in 'read()'.
>>> In case of '-z y|u' it will be time in 'getsockopt()' + 'madvise()'.
>>> * Number of wake ups with POLLIN flag set(except '-z u' mode).
>>> * Average time(ns) in single dequeue iteration(e.g. divide second
>>> value by third).
>>>
>>> Idea of test is to compare zerocopy approach and classic copy, as it is
>>> clear, that to dequeue some "small" amount of data, copy must be better,
>>> because syscall with 'memcpy()' for 1 byte(for example) is just nothing
>>> against two system calls, where first must map at least one page, while
>>> second will unmap it.
>>>
>>> Test was performed with the following settings:
>>> 1) Total size of data to send is 2G(-m argument).
>>>
>>> 2) Peer's buffer size is changed to 2G(-v argument) - this is needed to
>>> avoid stalls of sender to wait for enough credit.
>>>
>>> 3) Both buffer size(-b) and SO_RCVLOWAT(-r) are used to control number
>>> of bytes to dequeue in single loop iteration. Buffer size limits max
>>> number of bytes to read, while SO_RCVLOWAT won't allow user to get
>>> too small number of bytes.
>>>
>>> 4) For sender, tx buffer(which is passed to 'write()') size is 16Mb. Of
>>> course we can set it to peer's buffer size and as we are in STREAM
>>> mode it leads to 'write()' will be called once.
>>>
>>> Deignations here and below:
>>> H2G - host to guest transmission. Server is host, client is guest.
>>> G2H - guest to host transmission. Server is guest, client is host.
>>> C - copy mode.
>>> ZC - zerocopy mode.
>>> ZU - zerocopy with user poll mode. This mode is removed from test at
>>> this moment, because I need to support SO_RCVLOWAT logic in it.
>>>
>>> So, rows corresponds to dequeue mode, while columns show number of
>>
>> Maybe it would be better to label the rows, I guess the first one is C
>> and the second one ZC?
Ooops, seems something wrong happened during patch send, here is valid table:

G2H:

#0 #1 #2 #3 #4 #5
*----*---------*---------*---------*---------*---------*---------*
| | | | | | | |
| | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
| | | | | | | |
*----*---------*---------*---------*---------*---------*---------*
| | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
#0| C | 1.44 | 1.81 | 1.14 | 1.47 | 1.32 | 1.78 |
| | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 |
*----*---------*---------*---------*---------*---------*---------*
| |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
#1| ZC | 1.7 | 1.76 | 0.96 | 0.7 | 0.58 | 0.61 |
| | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 |
*----*---------*---------*---------*---------*---------*---------*

H2G:

#0 #1 #2 #3 #4 #5
*----*---------*---------*---------*---------*---------*---------*
| | | | | | | |
| | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
| | | | | | | |
*----*---------*---------*---------*---------*---------*---------*
| | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
#0| C | 3.3 | 4.3 | 2.37 | 2.33 | 2.42 | 2.75 |
| | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 |
*----*---------*---------*---------*---------*---------*---------*
| |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
#1| ZC | 3.5 | 3.38 | 2.32 | 1.75 | 1.25 | 0.98 |
| | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 |
*----*---------*---------*---------*---------*---------*---------*

Lines with #0 and #1 missed! Don know why, sorry:)

>>
>> Maybe it would be better to report Gbps so if we change the amount of
>> data exchanged, we always have a way to compare.
>>
Ok
>>> bytes
>>> to dequeue in each mode. Each cell contains several values in the next
>>> format:
>>> *------------*
>>> | A / B |
>>> | C |
>>> | D |
>>> *------------*
>>>
>>> A - number of seconds which server spent in tx loop.
>>> B - number of seconds which client spent in rx loop.
>>> C - number of seconds which client spent in rx loop, but except 'poll()'
>>> system call(e.g. only in dequeue system calls).
>>> D - Average number of ns for each POLLIN wake up(in other words
>>> it is average value for C).
>>
>> I see only 3 values in the cells, I missed which one is C and which one
>> is D.
Yep, correct version of table is above
>>
>>>
>>> G2H:
>>>
>>> #0 #1 #2 #3 #4 #5
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | | | | | | | |
>>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
>>> | | | | | | | |
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | | 2.3/2.4 |2.48/2.53|2.34/2.38|2.73/2.76|2.65/2.68|3.26/3.35|
>>> | | 7039 | 15074 | 34975 | 89938 | 162384 | 438278 |
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | |2.37/2.42|2.36/1.96|2.36/2.42|2.43/2.43|2.42/2.47|2.42/2.46|
>>> | | 13598 | 15821 | 29574 | 43265 | 71771 | 150927 |
>>> *----*---------*---------*---------*---------*---------*---------*
>>>
>>> H2G:
>>>
>>> #0 #1 #2 #3 #4 #5
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | | | | | | | |
>>> | | 4Kb | 16Kb | 64Kb | 128Kb | 256Kb | 512Kb |
>>> | | | | | | | |
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | | 1.5/5.3 |1.55/5.00|1.60/5.00|1.65/5.00|1.65/5.00|1.74/5.00|
>>> | | 17145 | 24172 | 72650 | 143496 | 295960 | 674146 |
>>> *----*---------*---------*---------*---------*---------*---------*
>>> | |1.10/6.21|1.10/6.00|1.10/5.48|1.10/5.38|1.10/5.35|1.10/5.35|
>>> | | 41855 | 46339 | 71988 | 106000 | 153064 | 242036 |
>>> *----*---------*---------*---------*---------*---------*---------*
>>>
>>> Here are my thoughts about these numbers(most obvious):
>>>
>>> 1) Let's check C and D values. We see, that zerocopy dequeue is faster
>>> on big buffers(in G2H it starts from 64Kb, in H2g - from 128Kb). I
>>> think this is main result of this test(at this moment), that shows
>>> performance difference between copy and zerocopy).
>>
>> Yes, I think this is expected.
>>
>>>
>>> 2) In G2H mode both server and client spend almost same time in rx/tx
>>> loops(see A / B in G2H table) - it looks good. In H2G mode, there is
>>> significant difference between server and client. I think there are
>>> some side effects which produces such effect(continue to analyze).
>>
>> Perhaps a different cost to notify the receiver? I think it's better to
>> talk about transmitter and receiver, instead of server and client, I
>> think it's confusing.
Ok
>>
>>>
>>> 3) Let's check C value. We can see, that G2H is always faster that H2G.
>>> In both copy and zerocopy mode.
>>
>> This is expected because the guest queues buffers up to 64K entirely,
>> while the host has to split packets into the guest's preallocated
>> buffers, which are 4K.
>>
>>>
>>> 4) Another interesting thing could be seen for example in H2G table,
>>> row #0, col #4 (case for 256Kb). Number of seconds in zerocopy mode
>>> is smaller than in copy mode(1.25 vs 2.42), but whole rx loop was
>>
>> I see 1.65 vs 1.10, are these the same data, or am I looking at it
>> wrong?
Aha, pls see valid version of the table, my fault
>>
>>> faster in copy mode(5 seconds vs 5.35 seconds). E.g. if we account
>>> time spent in 'poll()', copy mode looks faster(even it spends more
>>> time in 'read()' than zerocopy loop in 'getsockopt()' + 'madvise()').
>>> I think, it is also not obvious effect.
>>>
>>> So, according 1), it is better to use zerocopy, if You need to process
>>> big buffers, with small rx waitings(for example it nay be video stream).
>>> In other cases - it is better to use classic copy way, as it will be
>>> more lightweight.
>>>
>>> All tests were performed on x86 board with 4-core Celeron N2930 CPU(of
>>> course it is, not a mainframe, but better than test with nested guest)
>>> and 8Gb of RAM.
>>>
>>> Anyway, this is not final version, and I will continue to improve both
>>> kernel logic and performance tests.
>>
>> Great work so far!
>>
>> Maybe to avoid having to rebase everything later, it's already
>> worthwhile to start using Bobby's patch with skbuff.
>>
>>>
>>> SUGGESTIONS
>>>
>>> 1) I'm also working on MSG_ZEROCOPY support for virtio/vsock. May be I
>>> can merge both patches into single one?
>>
>> This is already very big, so I don't know if it's worth breaking into a
>> preparation series and then a series that adds both.
Hm, ok, I think i can send both series separately(MSG_ZEROCOPY is smaller).
It will be more simple to test and review. Which one of two will be ready to
merge shortly - the second will be rebased and retested.
>
> For example, some test patches not related to zerocopy could go
> separately. Maybe even vsock_rx_perf without the zerocopy part that is
> not definitive for now.
Ok, i think i can send patch which updates current tests as single one -
it will be easy to review and merge. Also vsock_rx_perf: i can prepare it as
dedicated patch. Of course, without zerocopy it has same functionality as
iperf, but i think it will good to have independent small tool which
implements both rx and tx zerocopy support(vsock_rx_perf will be named
vsock_perf for example). It is small tool(comparing to iperf), targeted
for vsock only and located in kernel source tree.
>
> Too big a set is always scary, even if this one is divided well, but
> some patches as mentioned could go separately.
>
> I left some comments, but as said I prefer to review it after the
> rebase with skbuff, because I think it changes enough. I'm sorry about
> that, but having the skbuffs I think is very important.
Sure, no problem. Of course skbuff is correct way! Thanks for review!
>
> Thanks,
> Stefano
>

2022-11-11 21:00:27

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/11] virtio/vsock: rework packet allocation logic

On Sun, Nov 06, 2022 at 07:36:02PM +0000, Arseniy Krasnov wrote:
> To support zerocopy receive, packet's buffer allocation is changed: for
> buffers which could be mapped to user's vma we can't use 'kmalloc()'(as
> kernel restricts to map slab pages to user's vma) and raw buddy
> allocator now called. But, for tx packets(such packets won't be mapped
> to user), previous 'kmalloc()' way is used, but with special flag in
> packet's structure which allows to distinguish between 'kmalloc()' and
> raw pages buffers.
>
> Signed-off-by: Arseniy Krasnov <[email protected]>

Hey Arseniy, great work here!

> ---
> drivers/vhost/vsock.c | 1 +
> include/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport.c | 8 ++++++--
> net/vmw_vsock/virtio_transport_common.c | 10 +++++++++-
> 4 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 5703775af129..65475d128a1d 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -399,6 +399,7 @@ vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> return NULL;
> }
>
> + pkt->slab_buf = true;
> pkt->buf_len = pkt->len;
>
> nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 35d7eedb5e8e..d02cb7aa922f 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -50,6 +50,7 @@ struct virtio_vsock_pkt {
> u32 off;
> bool reply;
> bool tap_delivered;
> + bool slab_buf;
> };

WRT the sk_buff series, I bet this bool will not be needed after the
rebase.

>
> struct virtio_vsock_pkt_info {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index ad64f403536a..19909c1e9ba3 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -255,16 +255,20 @@ static void virtio_vsock_rx_fill(struct virtio_vsock *vsock)
> vq = vsock->vqs[VSOCK_VQ_RX];
>
> do {
> + struct page *buf_page;
> +
> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
> if (!pkt)
> break;
>
> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> - if (!pkt->buf) {
> + buf_page = alloc_page(GFP_KERNEL);
> +
> + if (!buf_page) {

I think it should not be too hard to use build_skb() around the page
here after rebasing onto the sk_buff series.

> virtio_transport_free_pkt(pkt);
> break;
> }
>
> + pkt->buf = page_to_virt(buf_page);
> pkt->buf_len = buf_len;
> pkt->len = buf_len;
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index a9980e9b9304..37e8dbfe2f5d 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -69,6 +69,7 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info *info,
> if (!pkt->buf)
> goto out_pkt;
>
> + pkt->slab_buf = true;
> pkt->buf_len = len;
>
> err = memcpy_from_msg(pkt->buf, info->msg, len);
> @@ -1339,7 +1340,14 @@ EXPORT_SYMBOL_GPL(virtio_transport_recv_pkt);
>
> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
> {
> - kvfree(pkt->buf);
> + if (pkt->buf_len) {
> + if (pkt->slab_buf)
> + kvfree(pkt->buf);
> + else
> + free_pages((unsigned long)pkt->buf,
> + get_order(pkt->buf_len));
> + }
> +
> kfree(pkt);
> }
> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> --
> 2.35.0

2022-11-11 21:27:35

by Bobby Eshleman

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote:
> Hi Arseniy,
> maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/[email protected]/
>
> CCing Bobby to see if it's easy to integrate since you're both changing the
> packet allocation.
>

This looks like the packet allocation can be married somewhat nicely in
since SKBs may be built from pages using build_skb(). There may be some
tweaking necessary though, since it also uses the tail chunk of the page
to hold struct skb_shared_info IIRC.

I left some comments on the patch with the allocator in it.

>
> Maybe to avoid having to rebase everything later, it's already worthwhile to
> start using Bobby's patch with skbuff.
>

I'll be waiting until Monday to see if some more feedback comes in
before sending out v4, so I expect v4 early next week, FWIW.

Best,
Bobby

2022-11-12 12:10:50

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

On 11.11.2022 23:45, Bobby Eshleman wrote:
> On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote:
>> Hi Arseniy,
>> maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/[email protected]/
>>
>> CCing Bobby to see if it's easy to integrate since you're both changing the
>> packet allocation.
>>
>
> This looks like the packet allocation can be married somewhat nicely in
> since SKBs may be built from pages using build_skb(). There may be some
> tweaking necessary though, since it also uses the tail chunk of the page
> to hold struct skb_shared_info IIRC.
>
> I left some comments on the patch with the allocator in it.
Hello Bobby,

thanks for review. I'll rebase my patchset on Your skbuff support.
>
>>
>> Maybe to avoid having to rebase everything later, it's already worthwhile to
>> start using Bobby's patch with skbuff.
>>
>
> I'll be waiting until Monday to see if some more feedback comes in
> before sending out v4, so I expect v4 early next week, FWIW.
One request from me, could You please CC me for next versions of
Your patchset, because:
1) I'll always have latest version of skbuff support.
2) I'll see review process also.

My contacts:
[email protected]
[email protected]

Thanks, Arseniy

>
> Best,
> Bobby

2022-11-13 10:30:28

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/11] virtio/vsock: experimental zerocopy receive

On 12.11.2022 14:40, Arseniy Krasnov wrote:

Hello again Bobby,

i wasn't CCed in Your patchset, but I review it anyway and write comments here in this
manner:) I found strange thing:

In 'virtio_transport_recv_enqueue()' new packet could be copied to the last packet in
rx queue(skb in current version). During copy You update last skb length by call
'skb_put(last_skb, skb->len)' inside 'memcpy()'. So 'last_skb' now have new length,
but header of packet is not updated.

Now let's look to 'virtio_transport_seqpacket_do_dequeue()', it uses value from packet's
header as 'pkt_len', not from skb:

pkt_len = (size_t)le32_to_cpu(hdr->len);

I think we need to update last packet's header during merging new packet to last packet
of rx queue.

Thanks, Arseniy


> On 11.11.2022 23:45, Bobby Eshleman wrote:
>> On Fri, Nov 11, 2022 at 02:47:15PM +0100, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>> maybe we should start rebasing this series on the new support for skbuff: https://lore.kernel.org/lkml/[email protected]/
>>>
>>> CCing Bobby to see if it's easy to integrate since you're both changing the
>>> packet allocation.
>>>
>>
>> This looks like the packet allocation can be married somewhat nicely in
>> since SKBs may be built from pages using build_skb(). There may be some
>> tweaking necessary though, since it also uses the tail chunk of the page
>> to hold struct skb_shared_info IIRC.
>>
>> I left some comments on the patch with the allocator in it.
> Hello Bobby,
>
> thanks for review. I'll rebase my patchset on Your skbuff support.
>>
>>>
>>> Maybe to avoid having to rebase everything later, it's already worthwhile to
>>> start using Bobby's patch with skbuff.
>>>
>>
>> I'll be waiting until Monday to see if some more feedback comes in
>> before sending out v4, so I expect v4 early next week, FWIW.
> One request from me, could You please CC me for next versions of
> Your patchset, because:
> 1) I'll always have latest version of skbuff support.
> 2) I'll see review process also.
>
> My contacts:
> [email protected]
> [email protected]
>
> Thanks, Arseniy
>
>>
>> Best,
>> Bobby
>