2022-06-06 06:05:03

by Arseniy Krasnov

[permalink] [raw]
Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

INTRODUCTION

Hello, this is experimental implementation of virtio vsock zerocopy
receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
fill provided vma area with pages of virtio RX buffers. After received data was
processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).

DETAILS

Here is how mapping with mapped pages looks exactly: first page mapping
contains array of trimmed virtio vsock packet headers (in contains only length
of data on the corresponding page and 'flags' field):

struct virtio_vsock_usr_hdr {
uint32_t length;
uint32_t flags;
uint32_t copy_len;
};

Field 'length' allows user to know exact size of payload within each sequence
of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
All other pages are data pages from RX queue.

Page 0 Page 1 Page N

[ hdr1 .. hdrN ][ data ] .. [ data ]
| | ^ ^
| | | |
| *-------------------*
| |
| |
*----------------*

Of course, single header could represent array of pages (when packet's
buffer is bigger than one page).So here is example of detailed mapping layout
for some set of packages. Lets consider that we have the following sequence of
packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
be inserted to user's vma(vma is large enough).

Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
Page 1: [ 56 ]
Page 2: [ 4096 ]
Page 3: [ 4096 ]
Page 4: [ 4096 ]
Page 5: [ 8 ]

Page 0 contains only array of headers:
'hdr0' has 56 in length field.
'hdr1' has 4096 in length field.
'hdr2' has 8200 in length field.
'hdr3' has 0 in length field(this is end of data marker).

Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
Page 2 corresponds to 'hdr1' and filled with data.
Page 3 corresponds to 'hdr2' and filled with data.
Page 4 corresponds to 'hdr2' and filled with data.
Page 5 corresponds to 'hdr2' and has only 8 bytes of data.

This patchset also changes packets allocation way: today implementation
uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
"not large" could be bigger than one page). So to avoid this, data buffers now
allocated using 'alloc_pages()' call.

TESTS

This patchset updates 'vsock_test' utility: two tests for new feature
were added. First test covers invalid cases. Second checks valid transmission
case.

BENCHMARKING

For benchmakring I've added small utility 'rx_zerocopy'. It works in
client/server mode. When client connects to server, server starts sending exact
amount of data to client(amount is set as input argument).Client reads data and
waits for next portion of it. Client works in two modes: copy and zero-copy. In
copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
is better. For server, we can set size of tx buffer and for client we can set
size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
is quiet simple:

For client mode:

./rx_zerocopy --mode client [--zerocopy] [--rx]

For server mode:

./rx_zerocopy --mode server [--mb] [--tx]

[--mb] sets number of megabytes to transfer.
[--rx] sets size of receive buffer/mapping in pages.
[--tx] sets size of transmit buffer in pages.

I checked for transmission of 4000mb of data. Here are some results:

size of rx/tx buffers in pages
*---------------------------------------------------*
| 8 | 32 | 64 | 256 | 512 |
*--------------*--------*----------*---------*----------*----------*
| zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
*--------------*---------------------------------------------------- process
| non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
*--------------*----------------------------------------------------

Result in first column(where non-zerocopy works better than zerocopy) happens
because time, spent in 'read()' system call is smaller that time in 'getsockopt'
+ 'madvise'. I've checked that.

I think, that results are not so impressive, but at least it is not worse than
copy mode and there is no need to allocate memory for processing date.

PROBLEMS

Updated packet's allocation logic creates some problem: when host gets
data from guest(in vhost-vsock), it allocates at least one page for each packet
(even if packet has 1 byte payload). I think this could be resolved in several
ways:
1) Make zerocopy rx mode disabled by default, so if user didn't enable
it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
2) Use 'kmalloc()' for "small" packets, else call page allocator. But
in this case, we have mix of packets, allocated in two different ways thus
during zerocopying to user(e.g. mapping pages to vma), such small packets will
be handled in some stupid way: we need to allocate one page for user, copy data
to it and then insert page to user's vma.

v1 -> v2:
1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
renamed to SO_VM_SOCKETS_MAP_RX.
2) Packet header which is exported to user now get new field: 'copy_len'.
This field handles special case: user reads data from socket in non
zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
In this case vhost part will switch data buffer allocation logic from
'kmalloc()' to direct calls for buddy allocator. But, there could be
some pending 'kmalloc()' allocated packets in socket's rx list, and then
user tries to read such packets in zerocopy way, dequeue will fail,
because SLAB pages could not be inserted to user's vm area. So when such
packet is found during zerocopy dequeue, dequeue loop will break and
'copy_len' will show size of such "bad" packet. After user detects this
case, it must use 'read()/recv()' calls to dequeue such packet.
3) Also may be move this features under config option?

Arseniy Krasnov(8)
virtio/vsock: rework packet allocation logic
vhost/vsock: rework packet allocation logic
af_vsock: add zerocopy receive logic
virtio/vsock: add transport zerocopy callback
vhost/vsock: enable zerocopy callback
virtio/vsock: enable zerocopy callback
test/vsock: add receive zerocopy tests
test/vsock: vsock rx zerocopy utility

drivers/vhost/vsock.c | 121 +++++++++--
include/linux/virtio_vsock.h | 5 +
include/net/af_vsock.h | 7 +
include/uapi/linux/virtio_vsock.h | 6 +
include/uapi/linux/vm_sockets.h | 3 +
net/vmw_vsock/af_vsock.c | 100 +++++++++
net/vmw_vsock/virtio_transport.c | 51 ++++-
net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++-
tools/include/uapi/linux/virtio_vsock.h | 11 +
tools/include/uapi/linux/vm_sockets.h | 8 +
tools/testing/vsock/Makefile | 1 +
tools/testing/vsock/control.c | 34 +++
tools/testing/vsock/control.h | 2 +
tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++
tools/testing/vsock/vsock_test.c | 295 ++++++++++++++++++++++++++
15 files changed, 1196 insertions(+), 15 deletions(-)

--
2.25.1


2022-06-08 16:23:03

by Zhao Liu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
> Date: Fri, 3 Jun 2022 05:27:56 +0000
> From: Arseniy Krasnov <[email protected]>
> Subject: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive
>
> INTRODUCTION
>
> Hello, this is experimental implementation of virtio vsock zerocopy
> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
> fill provided vma area with pages of virtio RX buffers. After received data was
> processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>
> DETAILS
>
> Here is how mapping with mapped pages looks exactly: first page mapping
> contains array of trimmed virtio vsock packet headers (in contains only length
> of data on the corresponding page and 'flags' field):
>
> struct virtio_vsock_usr_hdr {
> uint32_t length;
> uint32_t flags;
> uint32_t copy_len;
> };
>
> Field 'length' allows user to know exact size of payload within each sequence
> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
> All other pages are data pages from RX queue.
>
> Page 0 Page 1 Page N
>
> [ hdr1 .. hdrN ][ data ] .. [ data ]
> | | ^ ^
> | | | |
> | *-------------------*
> | |
> | |
> *----------------*
>
> Of course, single header could represent array of pages (when packet's
> buffer is bigger than one page).So here is example of detailed mapping layout
> for some set of packages. Lets consider that we have the following sequence of
> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
> be inserted to user's vma(vma is large enough).
>
> Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]

Hi Arseniy, what about adding a `header` for `virtio_vsock_usr_hdr` in Page 0?

Page 0 can be like this:

Page 0: [[ header for hdrs ][ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]

We can store the header numbers/page numbers in this first general header:

struct virtio_vsock_general_hdr {
uint32_t usr_hdr_num;
};

This usr_hdr_num represents how many pages we used here.
At most 256 pages will be used here, and this kind of statistical information is
useful.
> Page 1: [ 56 ]
> Page 2: [ 4096 ]
> Page 3: [ 4096 ]
> Page 4: [ 4096 ]
> Page 5: [ 8 ]
>
> Page 0 contains only array of headers:
> 'hdr0' has 56 in length field.
> 'hdr1' has 4096 in length field.
> 'hdr2' has 8200 in length field.
> 'hdr3' has 0 in length field(this is end of data marker).
>
> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> Page 2 corresponds to 'hdr1' and filled with data.
> Page 3 corresponds to 'hdr2' and filled with data.
> Page 4 corresponds to 'hdr2' and filled with data.
> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
> This patchset also changes packets allocation way: today implementation
> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
> "not large" could be bigger than one page). So to avoid this, data buffers now
> allocated using 'alloc_pages()' call.
>
> TESTS
>
> This patchset updates 'vsock_test' utility: two tests for new feature
> were added. First test covers invalid cases. Second checks valid transmission
> case.
>
> BENCHMARKING
>
> For benchmakring I've added small utility 'rx_zerocopy'. It works in
> client/server mode. When client connects to server, server starts sending exact
> amount of data to client(amount is set as input argument).Client reads data and
> waits for next portion of it. Client works in two modes: copy and zero-copy. In
> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
> is better. For server, we can set size of tx buffer and for client we can set
> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
> is quiet simple:
>
> For client mode:
>
> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>
> For server mode:
>
> ./rx_zerocopy --mode server [--mb] [--tx]
>
> [--mb] sets number of megabytes to transfer.
> [--rx] sets size of receive buffer/mapping in pages.
> [--tx] sets size of transmit buffer in pages.
>
> I checked for transmission of 4000mb of data. Here are some results:
>
> size of rx/tx buffers in pages
> *---------------------------------------------------*
> | 8 | 32 | 64 | 256 | 512 |
> *--------------*--------*----------*---------*----------*----------*
> | zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
> *--------------*---------------------------------------------------- process
> | non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
> *--------------*----------------------------------------------------
>
> Result in first column(where non-zerocopy works better than zerocopy) happens
> because time, spent in 'read()' system call is smaller that time in 'getsockopt'
> + 'madvise'. I've checked that.
>
> I think, that results are not so impressive, but at least it is not worse than
> copy mode and there is no need to allocate memory for processing date.
>
> PROBLEMS
>
> Updated packet's allocation logic creates some problem: when host gets
> data from guest(in vhost-vsock), it allocates at least one page for each packet
> (even if packet has 1 byte payload). I think this could be resolved in several
> ways:
> 1) Make zerocopy rx mode disabled by default, so if user didn't enable
> it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
> 2) Use 'kmalloc()' for "small" packets, else call page allocator. But
> in this case, we have mix of packets, allocated in two different ways thus
> during zerocopying to user(e.g. mapping pages to vma), such small packets will
> be handled in some stupid way: we need to allocate one page for user, copy data
> to it and then insert page to user's vma.
>
> v1 -> v2:
> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
> didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
> feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
> layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
> SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
> renamed to SO_VM_SOCKETS_MAP_RX.
> 2) Packet header which is exported to user now get new field: 'copy_len'.
> This field handles special case: user reads data from socket in non
> zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
> In this case vhost part will switch data buffer allocation logic from
> 'kmalloc()' to direct calls for buddy allocator. But, there could be
> some pending 'kmalloc()' allocated packets in socket's rx list, and then
> user tries to read such packets in zerocopy way, dequeue will fail,
> because SLAB pages could not be inserted to user's vm area. So when such
> packet is found during zerocopy dequeue, dequeue loop will break and
> 'copy_len' will show size of such "bad" packet. After user detects this
> case, it must use 'read()/recv()' calls to dequeue such packet.
> 3) Also may be move this features under config option?
>
> Arseniy Krasnov(8)
> virtio/vsock: rework packet allocation logic
> vhost/vsock: rework packet allocation logic
> af_vsock: add zerocopy receive logic
> virtio/vsock: add transport zerocopy callback
> vhost/vsock: enable zerocopy callback
> virtio/vsock: enable zerocopy callback
> test/vsock: add receive zerocopy tests
> test/vsock: vsock rx zerocopy utility
>
> drivers/vhost/vsock.c | 121 +++++++++--
> include/linux/virtio_vsock.h | 5 +
> include/net/af_vsock.h | 7 +
> include/uapi/linux/virtio_vsock.h | 6 +
> include/uapi/linux/vm_sockets.h | 3 +
> net/vmw_vsock/af_vsock.c | 100 +++++++++
> net/vmw_vsock/virtio_transport.c | 51 ++++-
> net/vmw_vsock/virtio_transport_common.c | 211 ++++++++++++++++++-
> tools/include/uapi/linux/virtio_vsock.h | 11 +
> tools/include/uapi/linux/vm_sockets.h | 8 +
> tools/testing/vsock/Makefile | 1 +
> tools/testing/vsock/control.c | 34 +++
> tools/testing/vsock/control.h | 2 +
> tools/testing/vsock/rx_zerocopy.c | 356 ++++++++++++++++++++++++++++++++
> tools/testing/vsock/vsock_test.c | 295 ++++++++++++++++++++++++++
> 15 files changed, 1196 insertions(+), 15 deletions(-)
>
> --
> 2.25.1

2022-06-09 09:13:03

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

Hi Arseniy,
I left some comments in the patches, and I'm adding something also here:

On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
> INTRODUCTION
>
> Hello, this is experimental implementation of virtio vsock zerocopy
>receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>fill provided vma area with pages of virtio RX buffers. After received data was
>processed by user, pages must be freed by 'madvise()' call with MADV_DONTNEED
>flag set(if user won't call 'madvise()', next 'getsockopt()' will
>fail).

If it is not too time-consuming, can we have a table/list to compare this
and the TCP zerocopy?

>
> DETAILS
>
> Here is how mapping with mapped pages looks exactly: first page mapping
>contains array of trimmed virtio vsock packet headers (in contains only length
>of data on the corresponding page and 'flags' field):
>
> struct virtio_vsock_usr_hdr {
> uint32_t length;
> uint32_t flags;
> uint32_t copy_len;
> };
>
>Field 'length' allows user to know exact size of payload within each sequence
>of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>All other pages are data pages from RX queue.
>
> Page 0 Page 1 Page N
>
> [ hdr1 .. hdrN ][ data ] .. [ data ]
> | | ^ ^
> | | | |
> | *-------------------*
> | |
> | |
> *----------------*
>
> Of course, single header could represent array of pages (when packet's
>buffer is bigger than one page).So here is example of detailed mapping layout
>for some set of packages. Lets consider that we have the following sequence of
>packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>be inserted to user's vma(vma is large enough).

In order to have a "userspace polling-friendly approach" and reduce
number of syscall, can we allow for example the userspace to mmap at
least the first header before packets arrive.
Then the userspace can poll a flag or other fields in the header to
understand that there are new packets.

That would be cool, but in the meantime it would be nice to behave
similarly to TCP, which is why the comparison table I mentioned earlier
would be useful.

>
> Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
> Page 1: [ 56 ]
> Page 2: [ 4096 ]
> Page 3: [ 4096 ]
> Page 4: [ 4096 ]
> Page 5: [ 8 ]
>
> Page 0 contains only array of headers:
> 'hdr0' has 56 in length field.
> 'hdr1' has 4096 in length field.
> 'hdr2' has 8200 in length field.
> 'hdr3' has 0 in length field(this is end of data marker).
>
> Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
> Page 2 corresponds to 'hdr1' and filled with data.
> Page 3 corresponds to 'hdr2' and filled with data.
> Page 4 corresponds to 'hdr2' and filled with data.
> Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>
> This patchset also changes packets allocation way: today implementation
>uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>"not large" could be bigger than one page). So to avoid this, data buffers now
>allocated using 'alloc_pages()' call.
>
> TESTS
>
> This patchset updates 'vsock_test' utility: two tests for new feature
>were added. First test covers invalid cases. Second checks valid transmission
>case.
>
> BENCHMARKING
>
> For benchmakring I've added small utility 'rx_zerocopy'. It works in
>client/server mode. When client connects to server, server starts sending exact
>amount of data to client(amount is set as input argument).Client reads data and
>waits for next portion of it. Client works in two modes: copy and zero-copy. In
>copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>/'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>is better. For server, we can set size of tx buffer and for client we can set
>size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>is quiet simple:
>
>For client mode:
>
>./rx_zerocopy --mode client [--zerocopy] [--rx]
>
>For server mode:
>
>./rx_zerocopy --mode server [--mb] [--tx]
>
>[--mb] sets number of megabytes to transfer.
>[--rx] sets size of receive buffer/mapping in pages.
>[--tx] sets size of transmit buffer in pages.
>
>I checked for transmission of 4000mb of data. Here are some results:
>
> size of rx/tx buffers in pages
> *---------------------------------------------------*
> | 8 | 32 | 64 | 256 | 512 |
>*--------------*--------*----------*---------*----------*----------*
>| zerocopy | 24 | 10.6 | 12.2 | 23.6 | 21 | secs to
>*--------------*---------------------------------------------------- process
>| non-zerocopy | 13 | 16.4 | 24.7 | 27.2 | 23.9 | 4000 mb
>*--------------*----------------------------------------------------
>
>Result in first column(where non-zerocopy works better than zerocopy) happens
>because time, spent in 'read()' system call is smaller that time in 'getsockopt'
>+ 'madvise'. I've checked that.
>
>I think, that results are not so impressive, but at least it is not worse than
>copy mode and there is no need to allocate memory for processing date.
>
> PROBLEMS
>
> Updated packet's allocation logic creates some problem: when host gets
>data from guest(in vhost-vsock), it allocates at least one page for each packet
>(even if packet has 1 byte payload). I think this could be resolved in several
>ways:
> 1) Make zerocopy rx mode disabled by default, so if user didn't enable
>it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)

Yep, but I think we should not allow to change it while we are connected
(see comments in the patches.)

> 2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>in this case, we have mix of packets, allocated in two different ways thus
>during zerocopying to user(e.g. mapping pages to vma), such small packets will
>be handled in some stupid way: we need to allocate one page for user, copy data
>to it and then insert page to user's vma.
>
>v1 -> v2:
> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
> didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
> feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
> layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
> SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
> renamed to SO_VM_SOCKETS_MAP_RX.
> 2) Packet header which is exported to user now get new field: 'copy_len'.
> This field handles special case: user reads data from socket in non
> zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
> In this case vhost part will switch data buffer allocation logic from
> 'kmalloc()' to direct calls for buddy allocator. But, there could be
> some pending 'kmalloc()' allocated packets in socket's rx list, and then
> user tries to read such packets in zerocopy way, dequeue will fail,
> because SLAB pages could not be inserted to user's vm area. So when such
> packet is found during zerocopy dequeue, dequeue loop will break and
> 'copy_len' will show size of such "bad" packet. After user detects this
> case, it must use 'read()/recv()' calls to dequeue such packet.
> 3) Also may be move this features under config option?

Do you mean a build config like CONFIG_VSOCK_ZERO_COPY?

I'm not sure it's needed.

Thanks,
Stefano

2022-06-09 12:44:33

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

On 09.06.2022 11:54, Stefano Garzarella wrote:
> Hi Arseniy,
> I left some comments in the patches, and I'm adding something also here:
Thanks for comments
>
> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>                              INTRODUCTION
>>
>>     Hello, this is experimental implementation of virtio vsock zerocopy
>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>> fill provided vma area with pages of virtio RX buffers. After received data was
>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>
> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
You mean compare API with more details?
>
>>
>>                                 DETAILS
>>
>>     Here is how mapping with mapped pages looks exactly: first page mapping
>> contains array of trimmed virtio vsock packet headers (in contains only length
>> of data on the corresponding page and 'flags' field):
>>
>>     struct virtio_vsock_usr_hdr {
>>         uint32_t length;
>>         uint32_t flags;
>>         uint32_t copy_len;
>>     };
>>
>> Field  'length' allows user to know exact size of payload within each sequence
>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>> All other pages are data pages from RX queue.
>>
>>             Page 0      Page 1      Page N
>>
>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>           |        |       ^           ^
>>           |        |       |           |
>>           |        *-------------------*
>>           |                |
>>           |                |
>>           *----------------*
>>
>>     Of course, single header could represent array of pages (when packet's
>> buffer is bigger than one page).So here is example of detailed mapping layout
>> for some set of packages. Lets consider that we have the following sequence  of
>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>> be inserted to user's vma(vma is large enough).
>
> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.
>
> That would be cool, but in the meantime it would be nice to behave similarly to TCP, which is why the comparison table I mentioned earlier would be useful.
>
>>
>>     Page 0: [[ hdr0 ][ hdr 1 ][ hdr 2 ][ hdr 3 ] ... ]
>>     Page 1: [ 56 ]
>>     Page 2: [ 4096 ]
>>     Page 3: [ 4096 ]
>>     Page 4: [ 4096 ]
>>     Page 5: [ 8 ]
>>
>>     Page 0 contains only array of headers:
>>     'hdr0' has 56 in length field.
>>     'hdr1' has 4096 in length field.
>>     'hdr2' has 8200 in length field.
>>     'hdr3' has 0 in length field(this is end of data marker).
>>
>>     Page 1 corresponds to 'hdr0' and has only 56 bytes of data.
>>     Page 2 corresponds to 'hdr1' and filled with data.
>>     Page 3 corresponds to 'hdr2' and filled with data.
>>     Page 4 corresponds to 'hdr2' and filled with data.
>>     Page 5 corresponds to 'hdr2' and has only 8 bytes of data.
>>
>>     This patchset also changes packets allocation way: today implementation
>> uses only 'kmalloc()' to create data buffer. Problem happens when we try to map
>> such buffers to user's vma - kernel forbids to map slab pages to user's vma(as
>> pages of "not large" 'kmalloc()' allocations are marked with PageSlab flag and
>> "not large" could be bigger than one page). So to avoid this, data buffers now
>> allocated using 'alloc_pages()' call.
>>
>>                                   TESTS
>>
>>     This patchset updates 'vsock_test' utility: two tests for new feature
>> were added. First test covers invalid cases. Second checks valid transmission
>> case.
>>
>>                                BENCHMARKING
>>
>>     For benchmakring I've added small utility 'rx_zerocopy'. It works in
>> client/server mode. When client connects to server, server starts sending exact
>> amount of data to client(amount is set as input argument).Client reads data and
>> waits for next portion of it. Client works in two modes: copy and zero-copy. In
>> copy mode client uses 'read()' call while in zerocopy mode sequence of 'mmap()'
>> /'getsockopt()'/'madvise()' are used. Smaller amount of time for transmission
>> is better. For server, we can set size of tx buffer and for client we can set
>> size of rx buffer or rx mapping size(in zerocopy mode). Usage of this utility
>> is quiet simple:
>>
>> For client mode:
>>
>> ./rx_zerocopy --mode client [--zerocopy] [--rx]
>>
>> For server mode:
>>
>> ./rx_zerocopy --mode server [--mb] [--tx]
>>
>> [--mb] sets number of megabytes to transfer.
>> [--rx] sets size of receive buffer/mapping in pages.
>> [--tx] sets size of transmit buffer in pages.
>>
>> I checked for transmission of 4000mb of data. Here are some results:
>>
>>                           size of rx/tx buffers in pages
>>               *---------------------------------------------------*
>>               |    8   |    32    |    64   |   256    |   512    |
>> *--------------*--------*----------*---------*----------*----------*
>> |   zerocopy   |   24   |   10.6   |  12.2   |   23.6   |    21    | secs to
>> *--------------*---------------------------------------------------- process
>> | non-zerocopy |   13   |   16.4   |  24.7   |   27.2   |   23.9   | 4000 mb
>> *--------------*----------------------------------------------------
>>
>> Result in first column(where non-zerocopy works better than zerocopy) happens
>> because time, spent in 'read()' system call is smaller that time in 'getsockopt'
>> + 'madvise'. I've checked that.
>>
>> I think, that results are not so impressive, but at least it is not worse than
>> copy mode and there is no need to allocate memory for processing date.
>>
>>                                 PROBLEMS
>>
>>     Updated packet's allocation logic creates some problem: when host gets
>> data from guest(in vhost-vsock), it allocates at least one page for each packet
>> (even if packet has 1 byte payload). I think this could be resolved in several
>> ways:
>>     1) Make zerocopy rx mode disabled by default, so if user didn't enable
>> it, current 'kmalloc()' way will be used. <<<<<<< (IMPLEMENTED IN V2)
>
> Yep, but I think we should not allow to change it while we are connected (see comments in the patches.)
>
>>     2) Use 'kmalloc()' for "small" packets, else call page allocator. But
>> in this case, we have mix of packets, allocated in two different ways thus
>> during zerocopying to user(e.g. mapping pages to vma), such small packets will
>> be handled in some stupid way: we need to allocate one page for user, copy data
>> to it and then insert page to user's vma.
>>
>> v1 -> v2:
>> 1) Zerocopy receive mode could be enabled/disabled(disabled by default). I
>>    didn't use generic SO_ZEROCOPY flag, because in virtio-vsock case this
>>    feature depends on transport support. Instead of SO_ZEROCOPY, AF_VSOCK
>>    layer flag was added: SO_VM_SOCKETS_ZEROCOPY, while previous meaning of
>>    SO_VM_SOCKETS_ZEROCOPY(insert receive buffers to user's vm area) now
>>    renamed to SO_VM_SOCKETS_MAP_RX.
>> 2) Packet header which is exported to user now get new field: 'copy_len'.
>>    This field handles special case:  user reads data from socket in non
>>    zerocopy way(with disabled zerocopy) and then enables zerocopy feature.
>>    In this case vhost part will switch data buffer allocation logic from
>>    'kmalloc()' to direct calls for buddy allocator. But, there could be
>>    some pending 'kmalloc()' allocated packets in socket's rx list, and then
>>    user tries to read such packets in zerocopy way, dequeue will fail,
>>    because SLAB pages could not be inserted to user's vm area. So when such
>>    packet is found during zerocopy dequeue, dequeue loop will break and
>>    'copy_len' will show size of such "bad" packet. After user detects this
>>    case, it must use 'read()/recv()' calls to dequeue such packet.
>> 3) Also may be move this features under config option?
>
> Do you mean a build config like CONFIG_VSOCK_ZERO_COPY?
>
> I'm not sure it's needed.
Ack
>
> Thanks,
> Stefano
>

2022-06-13 09:45:19

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:
>On 09.06.2022 11:54, Stefano Garzarella wrote:
>> Hi Arseniy,
>> I left some comments in the patches, and I'm adding something also here:
>Thanks for comments
>>
>> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>> ???????????????????????????? INTRODUCTION
>>>
>>> ????Hello, this is experimental implementation of virtio vsock zerocopy
>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>> processed by user, pages must be freed by 'madvise()'? call with MADV_DONTNEED
>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>
>> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
>You mean compare API with more details?

Yes, maybe a comparison from the user's point of view to do zero-copy
with TCP and VSOCK.

>>
>>>
>>> ??????????????????????????????? DETAILS
>>>
>>> ????Here is how mapping with mapped pages looks exactly: first page mapping
>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>> of data on the corresponding page and 'flags' field):
>>>
>>> ????struct virtio_vsock_usr_hdr {
>>> ??????? uint32_t length;
>>> ??????? uint32_t flags;
>>> ??????? uint32_t copy_len;
>>> ????};
>>>
>>> Field? 'length' allows user to know exact size of payload within each sequence
>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>>> All other pages are data pages from RX queue.
>>>
>>> ??????????? Page 0????? Page 1????? Page N
>>>
>>> ????[ hdr1 .. hdrN ][ data ] .. [ data ]
>>> ????????? |??????? |?????? ^?????????? ^
>>> ????????? |??????? |?????? |?????????? |
>>> ????????? |??????? *-------------------*
>>> ????????? |??????????????? |
>>> ????????? |??????????????? |
>>> ????????? *----------------*
>>>
>>> ????Of course, single header could represent array of pages (when packet's
>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>> for some set of packages. Lets consider that we have the following sequence? of
>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>> be inserted to user's vma(vma is large enough).
>>
>> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
>> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
>You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
>to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.

Yes, that's right. I mean to support both, poll() for interrupt-based
applications and the ability to actively poll a variable in the shared
memory for applications that want to minimize latency.

Thanks,
Stefano

2022-06-14 04:27:11

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [RFC PATCH v2 0/8] virtio/vsock: experimental zerocopy receive

On 13.06.2022 11:54, Stefano Garzarella wrote:
> On Thu, Jun 09, 2022 at 12:33:32PM +0000, Arseniy Krasnov wrote:
>> On 09.06.2022 11:54, Stefano Garzarella wrote:
>>> Hi Arseniy,
>>> I left some comments in the patches, and I'm adding something also here:
>> Thanks for comments
>>>
>>> On Fri, Jun 03, 2022 at 05:27:56AM +0000, Arseniy Krasnov wrote:
>>>>                              INTRODUCTION
>>>>
>>>>     Hello, this is experimental implementation of virtio vsock zerocopy
>>>> receive. It was inspired by TCP zerocopy receive by Eric Dumazet. This API uses
>>>> same idea: call 'mmap()' on socket's descriptor, then every 'getsockopt()' will
>>>> fill provided vma area with pages of virtio RX buffers. After received data was
>>>> processed by user, pages must be freed by 'madvise()'  call with MADV_DONTNEED
>>>> flag set(if user won't call 'madvise()', next 'getsockopt()' will fail).
>>>
>>> If it is not too time-consuming, can we have a table/list to compare this and the TCP zerocopy?
>> You mean compare API with more details?
>
> Yes, maybe a comparison from the user's point of view to do zero-copy with TCP and VSOCK.
>
>>>
>>>>
>>>>                                 DETAILS
>>>>
>>>>     Here is how mapping with mapped pages looks exactly: first page mapping
>>>> contains array of trimmed virtio vsock packet headers (in contains only length
>>>> of data on the corresponding page and 'flags' field):
>>>>
>>>>     struct virtio_vsock_usr_hdr {
>>>>         uint32_t length;
>>>>         uint32_t flags;
>>>>         uint32_t copy_len;
>>>>     };
>>>>
>>>> Field  'length' allows user to know exact size of payload within each sequence
>>>> of pages and 'flags' allows user to handle SOCK_SEQPACKET flags(such as message
>>>> bounds or record bounds). Field 'copy_len' is described below in 'v1->v2' part.
>>>> All other pages are data pages from RX queue.
>>>>
>>>>             Page 0      Page 1      Page N
>>>>
>>>>     [ hdr1 .. hdrN ][ data ] .. [ data ]
>>>>           |        |       ^           ^
>>>>           |        |       |           |
>>>>           |        *-------------------*
>>>>           |                |
>>>>           |                |
>>>>           *----------------*
>>>>
>>>>     Of course, single header could represent array of pages (when packet's
>>>> buffer is bigger than one page).So here is example of detailed mapping layout
>>>> for some set of packages. Lets consider that we have the following sequence  of
>>>> packages: 56 bytes, 4096 bytes and 8200 bytes. All pages: 0,1,2,3,4 and 5 will
>>>> be inserted to user's vma(vma is large enough).
>>>
>>> In order to have a "userspace polling-friendly approach" and reduce number of syscall, can we allow for example the userspace to mmap at least the first header before packets arrive.
>>> Then the userspace can poll a flag or other fields in the header to understand that there are new packets.
>> You mean to avoid 'poll()' syscall, user will spin on some flag, provided by kernel on some mapped page? I think yes. This is ok. Also i think, that i can avoid 'madvise' call
>> to clear memory mapping before each 'getsockopt()' - let 'getsockopt()' do 'madvise()' job by removing pages from previous data. In this case only one system call is needed - 'getsockopt()'.
>
> Yes, that's right. I mean to support both, poll() for interrupt-based applications and the ability to actively poll a variable in the shared memory for applications that want to minimize latency.
I see, in this case seems 'vsock_sock' will maintain list of such shared pages, to update every page when new data is available. And sometimes check that mapping was removed
by user(because we don't have munmap callback in 'proto_ops', mmap only), for example using ref counter for such shared page.

Thanks
>
> Thanks,
> Stefano
>