2023-07-17 21:09:52

by Arseniy Krasnov

[permalink] [raw]
Subject: [PATCH net-next v1 0/4] vsock/virtio/vhost: MSG_ZEROCOPY preparations

Hello,

this patchset is first of three parts of another big patchset for
MSG_ZEROCOPY flag support:
https://lore.kernel.org/netdev/[email protected]/

During review of this series, Stefano Garzarella <[email protected]>
suggested to split it for three parts to simplify review and merging:

1) virtio and vhost updates (for fragged skbs) <--- this patchset
2) AF_VSOCK updates (allows to enable MSG_ZEROCOPY mode and read
tx completions) and update for Documentation/.
3) Updates for tests and utils.

This series enables handling of fragged skbs in virtio and vhost parts.
Newly logic won't be triggered, because SO_ZEROCOPY options is still
impossible to enable at this moment (next bunch of patches from big
set above will enable it).

I've included changelog to some patches anyway, because there were some
comments during review of last big patchset from the link above.

Head for this patchset is 60cc1f7d0605598b47ee3c0c2b4b6fbd4da50a06

Arseniy Krasnov (4):
vsock/virtio/vhost: read data from non-linear skb
vsock/virtio: support to send non-linear skb
vsock/virtio: non-linear skb handling for tap
vsock/virtio: MSG_ZEROCOPY flag support

drivers/vhost/vsock.c | 14 +-
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport.c | 40 ++-
net/vmw_vsock/virtio_transport_common.c | 310 ++++++++++++++++++------
4 files changed, 283 insertions(+), 82 deletions(-)

--
2.25.1



2023-07-17 21:16:45

by Arseniy Krasnov

[permalink] [raw]
Subject: [PATCH net-next v1 2/4] vsock/virtio: support to send non-linear skb

For non-linear skb use its pages from fragment array as buffers in
virtio tx queue. These pages are already pinned by 'get_user_pages()'
during such skb creation.

Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
---
net/vmw_vsock/virtio_transport.c | 40 +++++++++++++++++++++++++++-----
1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index e95df847176b..6cbb45bb12d2 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -100,7 +100,9 @@ virtio_transport_send_pkt_work(struct work_struct *work)
vq = vsock->vqs[VSOCK_VQ_TX];

for (;;) {
- struct scatterlist hdr, buf, *sgs[2];
+ /* +1 is for packet header. */
+ struct scatterlist *sgs[MAX_SKB_FRAGS + 1];
+ struct scatterlist bufs[MAX_SKB_FRAGS + 1];
int ret, in_sg = 0, out_sg = 0;
struct sk_buff *skb;
bool reply;
@@ -111,12 +113,38 @@ virtio_transport_send_pkt_work(struct work_struct *work)

virtio_transport_deliver_tap_pkt(skb);
reply = virtio_vsock_skb_reply(skb);
+ sg_init_one(&bufs[out_sg], virtio_vsock_hdr(skb),
+ sizeof(*virtio_vsock_hdr(skb)));
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+
+ if (!skb_is_nonlinear(skb)) {
+ if (skb->len > 0) {
+ sg_init_one(&bufs[out_sg], skb->data, skb->len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
+ } else {
+ struct skb_shared_info *si;
+ int i;
+
+ si = skb_shinfo(skb);
+
+ for (i = 0; i < si->nr_frags; i++) {
+ skb_frag_t *skb_frag = &si->frags[i];
+ void *va = page_to_virt(skb_frag->bv_page);

- sg_init_one(&hdr, virtio_vsock_hdr(skb), sizeof(*virtio_vsock_hdr(skb)));
- sgs[out_sg++] = &hdr;
- if (skb->len > 0) {
- sg_init_one(&buf, skb->data, skb->len);
- sgs[out_sg++] = &buf;
+ /* We will use 'page_to_virt()' for userspace page here,
+ * because virtio layer will call 'virt_to_phys()' later
+ * to fill buffer descriptor. We don't touch memory at
+ * "virtual" address of this page.
+ */
+ sg_init_one(&bufs[out_sg],
+ va + skb_frag->bv_offset,
+ skb_frag->bv_len);
+ sgs[out_sg] = &bufs[out_sg];
+ out_sg++;
+ }
}

ret = virtqueue_add_sgs(vq, sgs, out_sg, in_sg, skb, GFP_KERNEL);
--
2.25.1


2023-07-17 21:17:15

by Arseniy Krasnov

[permalink] [raw]
Subject: [PATCH net-next v1 4/4] vsock/virtio: MSG_ZEROCOPY flag support

This adds handling of MSG_ZEROCOPY flag on transmission path: if this
flag is set and zerocopy transmission is possible, then non-linear skb
will be created and filled with the pages of user's buffer. Pages of
user's buffer are locked in memory by 'get_user_pages()'. Second thing
that this patch does is replace type of skb owning: instead of calling
'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
of socket, so to decrease this field correctly proper skb destructor is
needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v5(big patchset) -> v1:
* Refactorings of 'if' conditions.
* Remove extra blank line.
* Remove 'frag_off' field unneeded init.
* Add function 'virtio_transport_fill_skb()' which fills both linear
and non-linear skb with provided data.

net/vmw_vsock/virtio_transport_common.c | 260 ++++++++++++++++++------
1 file changed, 197 insertions(+), 63 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 26a4d10da205..1fb0a0f694c6 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,73 +37,115 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
return container_of(t, struct virtio_transport, transport);
}

-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */
-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
- size_t len,
- u32 src_cid,
- u32 src_port,
- u32 dst_cid,
- u32 dst_port)
-{
- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
- struct virtio_vsock_hdr *hdr;
- struct sk_buff *skb;
- void *payload;
- int err;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+ size_t max_to_send)
+{
+ struct iov_iter *iov_iter;

- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
- if (!skb)
- return NULL;
+ if (!info->msg)
+ return false;

- hdr = virtio_vsock_hdr(skb);
- hdr->type = cpu_to_le16(info->type);
- hdr->op = cpu_to_le16(info->op);
- hdr->src_cid = cpu_to_le64(src_cid);
- hdr->dst_cid = cpu_to_le64(dst_cid);
- hdr->src_port = cpu_to_le32(src_port);
- hdr->dst_port = cpu_to_le32(dst_port);
- hdr->flags = cpu_to_le32(info->flags);
- hdr->len = cpu_to_le32(len);
+ iov_iter = &info->msg->msg_iter;

- if (info->msg && len > 0) {
- payload = skb_put(skb, len);
- err = memcpy_from_msg(payload, info->msg, len);
- if (err)
- goto out;
+ /* Data is simple buffer. */
+ if (iter_is_ubuf(iov_iter))
+ return true;

- if (msg_data_left(info->msg) == 0 &&
- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+ if (!iter_is_iovec(iov_iter))
+ return false;

- if (info->msg->msg_flags & MSG_EOR)
- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
- }
+ if (iov_iter->iov_offset)
+ return false;
+
+ /* We can't send whole iov. */
+ if (iov_iter->count > max_to_send)
+ return false;
+
+ return true;
+}
+
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+ struct sk_buff *skb,
+ struct msghdr *msg,
+ bool zerocopy)
+{
+ struct ubuf_info *uarg;
+
+ if (msg->msg_ubuf) {
+ uarg = msg->msg_ubuf;
+ net_zcopy_get(uarg);
+ } else {
+ struct iov_iter *iter = &msg->msg_iter;
+ struct ubuf_info_msgzc *uarg_zc;
+ int len;
+
+ /* Only ITER_IOVEC or ITER_UBUF are allowed and
+ * checked before.
+ */
+ if (iter_is_iovec(iter))
+ len = iov_length(iter->__iov, iter->nr_segs);
+ else
+ len = iter->count;
+
+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+ len,
+ NULL);
+ if (!uarg)
+ return -1;
+
+ uarg_zc = uarg_to_msgzc(uarg);
+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
}

- if (info->reply)
- virtio_vsock_skb_set_reply(skb);
+ skb_zcopy_init(skb, uarg);

- trace_virtio_transport_alloc_pkt(src_cid, src_port,
- dst_cid, dst_port,
- len,
- info->type,
- info->op,
- info->flags);
+ return 0;
+}

- if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
- WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
- goto out;
+static int virtio_transport_fill_skb(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ size_t len,
+ bool zcopy)
+{
+ if (zcopy) {
+ return __zerocopy_sg_from_iter(info->msg, NULL, skb,
+ &info->msg->msg_iter,
+ len);
+ } else {
+ void *payload;
+ int err;
+
+ payload = skb_put(skb, len);
+ err = memcpy_from_msg(payload, info->msg, len);
+ if (err)
+ return -1;
+
+ if (msg_data_left(info->msg))
+ return 0;
+
+ return 0;
}
+}

- return skb;
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+ struct virtio_vsock_pkt_info *info,
+ u32 src_cid,
+ u32 src_port,
+ u32 dst_cid,
+ u32 dst_port,
+ size_t len)
+{
+ struct virtio_vsock_hdr *hdr;

-out:
- kfree_skb(skb);
- return NULL;
+ hdr = virtio_vsock_hdr(skb);
+ hdr->type = cpu_to_le16(info->type);
+ hdr->op = cpu_to_le16(info->op);
+ hdr->src_cid = cpu_to_le64(src_cid);
+ hdr->dst_cid = cpu_to_le64(dst_cid);
+ hdr->src_port = cpu_to_le32(src_port);
+ hdr->dst_port = cpu_to_le32(dst_port);
+ hdr->flags = cpu_to_le32(info->flags);
+ hdr->len = cpu_to_le32(len);
}

static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
@@ -214,6 +256,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
return VIRTIO_VSOCK_TYPE_SEQPACKET;
}

+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
+ struct virtio_vsock_pkt_info *info,
+ size_t payload_len,
+ bool zcopy,
+ u32 dst_cid,
+ u32 dst_port,
+ u32 src_cid,
+ u32 src_port)
+{
+ struct sk_buff *skb;
+ size_t skb_len;
+
+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+ if (!zcopy)
+ skb_len += payload_len;
+
+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+ if (!skb)
+ return NULL;
+
+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len);
+
+ /* Set owner here, because '__zerocopy_sg_from_iter()' uses
+ * owner of skb without check to update 'sk_wmem_alloc'.
+ */
+ if (vsk)
+ skb_set_owner_w(skb, sk_vsock(vsk));
+
+ if (info->msg && payload_len > 0) {
+ int err;
+
+ err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
+ if (err)
+ goto out;
+
+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
+ struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
+
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+ if (info->msg->msg_flags & MSG_EOR)
+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+ }
+ }
+
+ if (info->reply)
+ virtio_vsock_skb_set_reply(skb);
+
+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
+ dst_cid, dst_port,
+ payload_len,
+ info->type,
+ info->op,
+ info->flags);
+
+ return skb;
+out:
+ kfree_skb(skb);
+ return NULL;
+}
+
/* This function can only be used on connecting/connected sockets,
* since a socket assigned to a transport is required.
*
@@ -222,10 +328,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
struct virtio_vsock_pkt_info *info)
{
+ u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
u32 src_cid, src_port, dst_cid, dst_port;
const struct virtio_transport *t_ops;
struct virtio_vsock_sock *vvs;
u32 pkt_len = info->pkt_len;
+ bool can_zcopy = false;
u32 rest_len;
int ret;

@@ -254,22 +362,48 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
return pkt_len;

+ if (info->msg) {
+ /* If zerocopy is not enabled by 'setsockopt()', we behave as
+ * there is no MSG_ZEROCOPY flag set.
+ */
+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+ info->msg->msg_flags &= ~MSG_ZEROCOPY;
+
+ if (info->msg->msg_flags & MSG_ZEROCOPY)
+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+
+ if (can_zcopy)
+ max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+ (MAX_SKB_FRAGS * PAGE_SIZE));
+ }
+
rest_len = pkt_len;

do {
struct sk_buff *skb;
size_t skb_len;

- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+ skb_len = min(max_skb_len, rest_len);

- skb = virtio_transport_alloc_skb(info, skb_len,
- src_cid, src_port,
- dst_cid, dst_port);
+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
+ dst_cid, dst_port,
+ src_cid, src_port);
if (!skb) {
ret = -ENOMEM;
break;
}

+ /* This is last skb to send this portion of data. */
+ if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
+ skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
+ if (virtio_transport_init_zcopy_skb(vsk, skb,
+ info->msg,
+ can_zcopy)) {
+ ret = -ENOMEM;
+ break;
+ }
+ }
+
virtio_transport_inc_tx_pkt(vvs, skb);

ret = t_ops->send_pkt(skb);
@@ -934,11 +1068,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
if (!t)
return -ENOTCONN;

- reply = virtio_transport_alloc_skb(&info, 0,
- le64_to_cpu(hdr->dst_cid),
- le32_to_cpu(hdr->dst_port),
+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
le64_to_cpu(hdr->src_cid),
- le32_to_cpu(hdr->src_port));
+ le32_to_cpu(hdr->src_port),
+ le64_to_cpu(hdr->dst_cid),
+ le32_to_cpu(hdr->dst_port));
if (!reply)
return -ENOMEM;

--
2.25.1


2023-07-17 21:17:42

by Arseniy Krasnov

[permalink] [raw]
Subject: [PATCH net-next v1 3/4] vsock/virtio: non-linear skb handling for tap

For tap device new skb is created and data from the current skb is
copied to it. This adds copying data from non-linear skb to new
the skb.

Signed-off-by: Arseniy Krasnov <[email protected]>
Reviewed-by: Stefano Garzarella <[email protected]>
---
net/vmw_vsock/virtio_transport_common.c | 31 ++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 1a376f808ae6..26a4d10da205 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -106,6 +106,27 @@ virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
return NULL;
}

+static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
+ void *dst,
+ size_t len)
+{
+ struct iov_iter iov_iter = { 0 };
+ struct kvec kvec;
+ size_t to_copy;
+
+ kvec.iov_base = dst;
+ kvec.iov_len = len;
+
+ iov_iter.iter_type = ITER_KVEC;
+ iov_iter.kvec = &kvec;
+ iov_iter.nr_segs = 1;
+
+ to_copy = min_t(size_t, len, skb->len);
+
+ skb_copy_datagram_iter(skb, VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &iov_iter, to_copy);
+}
+
/* Packet capture */
static struct sk_buff *virtio_transport_build_skb(void *opaque)
{
@@ -114,7 +135,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
struct af_vsockmon_hdr *hdr;
struct sk_buff *skb;
size_t payload_len;
- void *payload_buf;

/* A packet could be split to fit the RX buffer, so we can retrieve
* the payload length from the header and the buffer pointer taking
@@ -122,7 +142,6 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
*/
pkt_hdr = virtio_vsock_hdr(pkt);
payload_len = pkt->len;
- payload_buf = pkt->data;

skb = alloc_skb(sizeof(*hdr) + sizeof(*pkt_hdr) + payload_len,
GFP_ATOMIC);
@@ -165,7 +184,13 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
skb_put_data(skb, pkt_hdr, sizeof(*pkt_hdr));

if (payload_len) {
- skb_put_data(skb, payload_buf, payload_len);
+ if (skb_is_nonlinear(pkt)) {
+ void *data = skb_put(skb, payload_len);
+
+ virtio_transport_copy_nonlinear_skb(pkt, data, payload_len);
+ } else {
+ skb_put_data(skb, pkt->data, payload_len);
+ }
}

return skb;
--
2.25.1


2023-07-17 21:21:28

by Arseniy Krasnov

[permalink] [raw]
Subject: [PATCH net-next v1 1/4] vsock/virtio/vhost: read data from non-linear skb

This is preparation patch for MSG_ZEROCOPY support. It adds handling of
non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
'skb_copy_datagram_iter()'. Main advantage of the second one is that it
can handle paged part of the skb by using 'kmap()' on each page, but if
there are no pages in the skb, it behaves like simple copying to iov
iterator. This patch also adds new field to the control block of skb -
this value shows current offset in the skb to read next portion of data
(it doesn't matter linear it or not). Idea behind this field is that
'skb_copy_datagram_iter()' handles both types of skb internally - it
just needs an offset from which to copy data from the given skb. This
offset is incremented on each read from skb. This approach allows to
avoid special handling of non-linear skbs:
1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
2) We need to update 'data_len' also on each read from this skb.

Signed-off-by: Arseniy Krasnov <[email protected]>
---
Changelog:
v5(big patchset) -> v1:
* Merge 'virtio_transport_common.c' and 'vhost/vsock.c' patches into
this single patch.
* Commit message update: grammar fix and remark that this patch is
MSG_ZEROCOPY preparation.
* Use 'min_t()' instead of comparison using '<>' operators.

drivers/vhost/vsock.c | 14 ++++++++-----
include/linux/virtio_vsock.h | 1 +
net/vmw_vsock/virtio_transport_common.c | 27 ++++++++++++++++---------
3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 817d377a3f36..8c917be32b5d 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
struct sk_buff *skb;
unsigned out, in;
size_t nbytes;
+ u32 frag_off;
int head;

skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
}

iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
- payload_len = skb->len;
+ frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
+ payload_len = skb->len - frag_off;
hdr = virtio_vsock_hdr(skb);

/* If the packet is greater than the space available in the
@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
break;
}

- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
- if (nbytes != payload_len) {
+ if (skb_copy_datagram_iter(skb,
+ frag_off,
+ &iov_iter,
+ payload_len)) {
kfree_skb(skb);
vq_err(vq, "Faulted on copying pkt buf\n");
break;
@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
added = true;

- skb_pull(skb, payload_len);
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
total_len += payload_len;

/* If we didn't send all the payload we can requeue the packet
* to send it with the next available buffer.
*/
- if (skb->len > 0) {
+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
hdr->flags |= cpu_to_le32(flags_to_restore);

/* We are queueing the same skb to handle
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c58453699ee9..17dbb7176e37 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -12,6 +12,7 @@
struct virtio_vsock_skb_cb {
bool reply;
bool tap_delivered;
+ u32 frag_off;
};

#define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b769fc258931..1a376f808ae6 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
spin_lock_bh(&vvs->rx_lock);

skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
- off = 0;
+ off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;

if (total == len)
break;
@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data + off, bytes);
+ err = skb_copy_datagram_iter(skb, off,
+ &msg->msg_iter,
+ bytes);
+
if (err)
goto out;

@@ -413,25 +416,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
while (total < len && !skb_queue_empty(&vvs->rx_queue)) {
skb = skb_peek(&vvs->rx_queue);

- bytes = len - total;
- if (bytes > skb->len)
- bytes = skb->len;
+ bytes = min_t(size_t, len - total,
+ skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off);

/* sk_lock is held by caller so no one else can dequeue.
* Unlock rx_lock since memcpy_to_msg() may sleep.
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data, bytes);
+ err = skb_copy_datagram_iter(skb,
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
+ &msg->msg_iter, bytes);
+
if (err)
goto out;

spin_lock_bh(&vvs->rx_lock);

total += bytes;
- skb_pull(skb, bytes);

- if (skb->len == 0) {
+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
+
+ if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);

virtio_transport_dec_rx_pkt(vvs, pkt_len);
@@ -503,7 +509,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
*/
spin_unlock_bh(&vvs->rx_lock);

- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
+ err = skb_copy_datagram_iter(skb, 0,
+ &msg->msg_iter,
+ bytes_to_copy);
+
if (err) {
/* Copy of message failed. Rest of
* fragments will be freed without copy.
--
2.25.1


2023-07-18 13:59:07

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [PATCH net-next v1 4/4] vsock/virtio: MSG_ZEROCOPY flag support

On Tue, Jul 18, 2023 at 12:00:51AM +0300, Arseniy Krasnov wrote:
>This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>flag is set and zerocopy transmission is possible, then non-linear skb
>will be created and filled with the pages of user's buffer. Pages of
>user's buffer are locked in memory by 'get_user_pages()'. Second thing
>that this patch does is replace type of skb owning: instead of calling
>'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>of socket, so to decrease this field correctly proper skb destructor is
>needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v5(big patchset) -> v1:
> * Refactorings of 'if' conditions.
> * Remove extra blank line.
> * Remove 'frag_off' field unneeded init.
> * Add function 'virtio_transport_fill_skb()' which fills both linear
> and non-linear skb with provided data.
>
> net/vmw_vsock/virtio_transport_common.c | 260 ++++++++++++++++++------
> 1 file changed, 197 insertions(+), 63 deletions(-)
>
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index 26a4d10da205..1fb0a0f694c6 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -37,73 +37,115 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
> return container_of(t, struct virtio_transport, transport);
> }
>
>-/* Returns a new packet on success, otherwise returns NULL.
>- *
>- * If NULL is returned, errp is set to a negative errno.
>- */
>-static struct sk_buff *
>-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>- size_t len,
>- u32 src_cid,
>- u32 src_port,
>- u32 dst_cid,
>- u32 dst_port)
>-{
>- const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>- struct virtio_vsock_hdr *hdr;
>- struct sk_buff *skb;
>- void *payload;
>- int err;
>+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>+ size_t max_to_send)
>+{
>+ struct iov_iter *iov_iter;
>
>- skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>- if (!skb)
>- return NULL;
>+ if (!info->msg)
>+ return false;
>
>- hdr = virtio_vsock_hdr(skb);
>- hdr->type = cpu_to_le16(info->type);
>- hdr->op = cpu_to_le16(info->op);
>- hdr->src_cid = cpu_to_le64(src_cid);
>- hdr->dst_cid = cpu_to_le64(dst_cid);
>- hdr->src_port = cpu_to_le32(src_port);
>- hdr->dst_port = cpu_to_le32(dst_port);
>- hdr->flags = cpu_to_le32(info->flags);
>- hdr->len = cpu_to_le32(len);
>+ iov_iter = &info->msg->msg_iter;
>
>- if (info->msg && len > 0) {
>- payload = skb_put(skb, len);
>- err = memcpy_from_msg(payload, info->msg, len);
>- if (err)
>- goto out;
>+ /* Data is simple buffer. */
>+ if (iter_is_ubuf(iov_iter))
>+ return true;
>
>- if (msg_data_left(info->msg) == 0 &&
>- info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>+ if (!iter_is_iovec(iov_iter))
>+ return false;
>
>- if (info->msg->msg_flags & MSG_EOR)
>- hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>- }
>+ if (iov_iter->iov_offset)
>+ return false;
>+
>+ /* We can't send whole iov. */
>+ if (iov_iter->count > max_to_send)
>+ return false;
>+
>+ return true;
>+}
>+
>+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>+ struct sk_buff *skb,
>+ struct msghdr *msg,
>+ bool zerocopy)
>+{
>+ struct ubuf_info *uarg;
>+
>+ if (msg->msg_ubuf) {
>+ uarg = msg->msg_ubuf;
>+ net_zcopy_get(uarg);
>+ } else {
>+ struct iov_iter *iter = &msg->msg_iter;
>+ struct ubuf_info_msgzc *uarg_zc;
>+ int len;
>+
>+ /* Only ITER_IOVEC or ITER_UBUF are allowed and
>+ * checked before.
>+ */
>+ if (iter_is_iovec(iter))
>+ len = iov_length(iter->__iov, iter->nr_segs);
>+ else
>+ len = iter->count;
>+
>+ uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>+ len,
>+ NULL);
>+ if (!uarg)
>+ return -1;
>+
>+ uarg_zc = uarg_to_msgzc(uarg);
>+ uarg_zc->zerocopy = zerocopy ? 1 : 0;
> }
>
>- if (info->reply)
>- virtio_vsock_skb_set_reply(skb);
>+ skb_zcopy_init(skb, uarg);
>
>- trace_virtio_transport_alloc_pkt(src_cid, src_port,
>- dst_cid, dst_port,
>- len,
>- info->type,
>- info->op,
>- info->flags);
>+ return 0;
>+}
>
>- if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>- WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>- goto out;
>+static int virtio_transport_fill_skb(struct sk_buff *skb,
>+ struct virtio_vsock_pkt_info *info,
>+ size_t len,
>+ bool zcopy)
>+{
>+ if (zcopy) {
>+ return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>+ &info->msg->msg_iter,
>+ len);
>+ } else {
>+ void *payload;
>+ int err;
>+
>+ payload = skb_put(skb, len);
>+ err = memcpy_from_msg(payload, info->msg, len);
>+ if (err)
>+ return -1;
>+
>+ if (msg_data_left(info->msg))
>+ return 0;
>+
>+ return 0;
> }
>+}
>
>- return skb;
>+static void virtio_transport_init_hdr(struct sk_buff *skb,
>+ struct virtio_vsock_pkt_info *info,
>+ u32 src_cid,
>+ u32 src_port,
>+ u32 dst_cid,
>+ u32 dst_port,
>+ size_t len)
>+{
>+ struct virtio_vsock_hdr *hdr;
>
>-out:
>- kfree_skb(skb);
>- return NULL;
>+ hdr = virtio_vsock_hdr(skb);
>+ hdr->type = cpu_to_le16(info->type);
>+ hdr->op = cpu_to_le16(info->op);
>+ hdr->src_cid = cpu_to_le64(src_cid);
>+ hdr->dst_cid = cpu_to_le64(dst_cid);
>+ hdr->src_port = cpu_to_le32(src_port);
>+ hdr->dst_port = cpu_to_le32(dst_port);
>+ hdr->flags = cpu_to_le32(info->flags);
>+ hdr->len = cpu_to_le32(len);
> }
>
> static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>@@ -214,6 +256,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
> return VIRTIO_VSOCK_TYPE_SEQPACKET;
> }
>
>+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>+ struct virtio_vsock_pkt_info *info,
>+ size_t payload_len,
>+ bool zcopy,
>+ u32 dst_cid,
>+ u32 dst_port,
>+ u32 src_cid,
>+ u32 src_port)

Before this patch the order of dst_* and src_* fields were reversed
for virtio_transport_alloc_skb(), why are we changing it?

I think putting them back as before (and following the same ordering in
the new functions as well) makes the patch easier to review and the code
easier to maintain.

Thanks,
Stefano

>+{
>+ struct sk_buff *skb;
>+ size_t skb_len;
>+
>+ skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>+
>+ if (!zcopy)
>+ skb_len += payload_len;
>+
>+ skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>+ if (!skb)
>+ return NULL;
>+
>+ virtio_transport_init_hdr(skb, info, src_cid, src_port,
>+ dst_cid, dst_port,
>+ payload_len);
>+
>+ /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>+ * owner of skb without check to update 'sk_wmem_alloc'.
>+ */
>+ if (vsk)
>+ skb_set_owner_w(skb, sk_vsock(vsk));
>+
>+ if (info->msg && payload_len > 0) {
>+ int err;
>+
>+ err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>+ if (err)
>+ goto out;
>+
>+ if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>+ struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>+
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>+
>+ if (info->msg->msg_flags & MSG_EOR)
>+ hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>+ }
>+ }
>+
>+ if (info->reply)
>+ virtio_vsock_skb_set_reply(skb);
>+
>+ trace_virtio_transport_alloc_pkt(src_cid, src_port,
>+ dst_cid, dst_port,
>+ payload_len,
>+ info->type,
>+ info->op,
>+ info->flags);
>+
>+ return skb;
>+out:
>+ kfree_skb(skb);
>+ return NULL;
>+}
>+
> /* This function can only be used on connecting/connected sockets,
> * since a socket assigned to a transport is required.
> *
>@@ -222,10 +328,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> struct virtio_vsock_pkt_info *info)
> {
>+ u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
> u32 src_cid, src_port, dst_cid, dst_port;
> const struct virtio_transport *t_ops;
> struct virtio_vsock_sock *vvs;
> u32 pkt_len = info->pkt_len;
>+ bool can_zcopy = false;
> u32 rest_len;
> int ret;
>
>@@ -254,22 +362,48 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
> return pkt_len;
>
>+ if (info->msg) {
>+ /* If zerocopy is not enabled by 'setsockopt()', we behave as
>+ * there is no MSG_ZEROCOPY flag set.
>+ */
>+ if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>+ info->msg->msg_flags &= ~MSG_ZEROCOPY;
>+
>+ if (info->msg->msg_flags & MSG_ZEROCOPY)
>+ can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>+
>+ if (can_zcopy)
>+ max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>+ (MAX_SKB_FRAGS * PAGE_SIZE));
>+ }
>+
> rest_len = pkt_len;
>
> do {
> struct sk_buff *skb;
> size_t skb_len;
>
>- skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>+ skb_len = min(max_skb_len, rest_len);
>
>- skb = virtio_transport_alloc_skb(info, skb_len,
>- src_cid, src_port,
>- dst_cid, dst_port);
>+ skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>+ dst_cid, dst_port,
>+ src_cid, src_port);
> if (!skb) {
> ret = -ENOMEM;
> break;
> }
>
>+ /* This is last skb to send this portion of data. */
>+ if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>+ skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>+ if (virtio_transport_init_zcopy_skb(vsk, skb,
>+ info->msg,
>+ can_zcopy)) {
>+ ret = -ENOMEM;
>+ break;
>+ }
>+ }
>+
> virtio_transport_inc_tx_pkt(vvs, skb);
>
> ret = t_ops->send_pkt(skb);
>@@ -934,11 +1068,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
> if (!t)
> return -ENOTCONN;
>
>- reply = virtio_transport_alloc_skb(&info, 0,
>- le64_to_cpu(hdr->dst_cid),
>- le32_to_cpu(hdr->dst_port),
>+ reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
> le64_to_cpu(hdr->src_cid),
>- le32_to_cpu(hdr->src_port));
>+ le32_to_cpu(hdr->src_port),
>+ le64_to_cpu(hdr->dst_cid),
>+ le32_to_cpu(hdr->dst_port));
> if (!reply)
> return -ENOMEM;
>
>--
>2.25.1
>


2023-07-18 13:59:34

by Stefano Garzarella

[permalink] [raw]
Subject: Re: [PATCH net-next v1 1/4] vsock/virtio/vhost: read data from non-linear skb

On Tue, Jul 18, 2023 at 12:00:48AM +0300, Arseniy Krasnov wrote:
>This is preparation patch for MSG_ZEROCOPY support. It adds handling of
>non-linear skbs by replacing direct calls of 'memcpy_to_msg()' with
>'skb_copy_datagram_iter()'. Main advantage of the second one is that it
>can handle paged part of the skb by using 'kmap()' on each page, but if
>there are no pages in the skb, it behaves like simple copying to iov
>iterator. This patch also adds new field to the control block of skb -
>this value shows current offset in the skb to read next portion of data
>(it doesn't matter linear it or not). Idea behind this field is that
>'skb_copy_datagram_iter()' handles both types of skb internally - it
>just needs an offset from which to copy data from the given skb. This
>offset is incremented on each read from skb. This approach allows to
>avoid special handling of non-linear skbs:
>1) We can't call 'skb_pull()' on it, because it updates 'data' pointer.
>2) We need to update 'data_len' also on each read from this skb.
>
>Signed-off-by: Arseniy Krasnov <[email protected]>
>---
> Changelog:
> v5(big patchset) -> v1:
> * Merge 'virtio_transport_common.c' and 'vhost/vsock.c' patches into
> this single patch.
> * Commit message update: grammar fix and remark that this patch is
> MSG_ZEROCOPY preparation.
> * Use 'min_t()' instead of comparison using '<>' operators.

Reviewed-by: Stefano Garzarella <[email protected]>

>
> drivers/vhost/vsock.c | 14 ++++++++-----
> include/linux/virtio_vsock.h | 1 +
> net/vmw_vsock/virtio_transport_common.c | 27 ++++++++++++++++---------
> 3 files changed, 28 insertions(+), 14 deletions(-)
>
>diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>index 817d377a3f36..8c917be32b5d 100644
>--- a/drivers/vhost/vsock.c
>+++ b/drivers/vhost/vsock.c
>@@ -114,6 +114,7 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> struct sk_buff *skb;
> unsigned out, in;
> size_t nbytes;
>+ u32 frag_off;
> int head;
>
> skb = virtio_vsock_skb_dequeue(&vsock->send_pkt_queue);
>@@ -156,7 +157,8 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> }
>
> iov_iter_init(&iov_iter, ITER_DEST, &vq->iov[out], in, iov_len);
>- payload_len = skb->len;
>+ frag_off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>+ payload_len = skb->len - frag_off;
> hdr = virtio_vsock_hdr(skb);
>
> /* If the packet is greater than the space available in the
>@@ -197,8 +199,10 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> break;
> }
>
>- nbytes = copy_to_iter(skb->data, payload_len, &iov_iter);
>- if (nbytes != payload_len) {
>+ if (skb_copy_datagram_iter(skb,
>+ frag_off,
>+ &iov_iter,
>+ payload_len)) {
> kfree_skb(skb);
> vq_err(vq, "Faulted on copying pkt buf\n");
> break;
>@@ -212,13 +216,13 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
> vhost_add_used(vq, head, sizeof(*hdr) + payload_len);
> added = true;
>
>- skb_pull(skb, payload_len);
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += payload_len;
> total_len += payload_len;
>
> /* If we didn't send all the payload we can requeue the packet
> * to send it with the next available buffer.
> */
>- if (skb->len > 0) {
>+ if (VIRTIO_VSOCK_SKB_CB(skb)->frag_off < skb->len) {
> hdr->flags |= cpu_to_le32(flags_to_restore);
>
> /* We are queueing the same skb to handle
>diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>index c58453699ee9..17dbb7176e37 100644
>--- a/include/linux/virtio_vsock.h
>+++ b/include/linux/virtio_vsock.h
>@@ -12,6 +12,7 @@
> struct virtio_vsock_skb_cb {
> bool reply;
> bool tap_delivered;
>+ u32 frag_off;
> };
>
> #define VIRTIO_VSOCK_SKB_CB(skb) ((struct virtio_vsock_skb_cb *)((skb)->cb))
>diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>index b769fc258931..1a376f808ae6 100644
>--- a/net/vmw_vsock/virtio_transport_common.c
>+++ b/net/vmw_vsock/virtio_transport_common.c
>@@ -355,7 +355,7 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> spin_lock_bh(&vvs->rx_lock);
>
> skb_queue_walk_safe(&vvs->rx_queue, skb, tmp) {
>- off = 0;
>+ off = VIRTIO_VSOCK_SKB_CB(skb)->frag_off;
>
> if (total == len)
> break;
>@@ -370,7 +370,10 @@ virtio_transport_stream_do_peek(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data + off, bytes);
>+ err = skb_copy_datagram_iter(skb, off,
>+ &msg->msg_iter,
>+ bytes);
>+
> if (err)
> goto out;
>
>@@ -413,25 +416,28 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
> while (total < len && !skb_queue_empty(&vvs->rx_queue)) {
> skb = skb_peek(&vvs->rx_queue);
>
>- bytes = len - total;
>- if (bytes > skb->len)
>- bytes = skb->len;
>+ bytes = min_t(size_t, len - total,
>+ skb->len - VIRTIO_VSOCK_SKB_CB(skb)->frag_off);
>
> /* sk_lock is held by caller so no one else can dequeue.
> * Unlock rx_lock since memcpy_to_msg() may sleep.
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data, bytes);
>+ err = skb_copy_datagram_iter(skb,
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off,
>+ &msg->msg_iter, bytes);
>+
> if (err)
> goto out;
>
> spin_lock_bh(&vvs->rx_lock);
>
> total += bytes;
>- skb_pull(skb, bytes);
>
>- if (skb->len == 0) {
>+ VIRTIO_VSOCK_SKB_CB(skb)->frag_off += bytes;
>+
>+ if (skb->len == VIRTIO_VSOCK_SKB_CB(skb)->frag_off) {
> u32 pkt_len = le32_to_cpu(virtio_vsock_hdr(skb)->len);
>
> virtio_transport_dec_rx_pkt(vvs, pkt_len);
>@@ -503,7 +509,10 @@ static int virtio_transport_seqpacket_do_dequeue(struct vsock_sock *vsk,
> */
> spin_unlock_bh(&vvs->rx_lock);
>
>- err = memcpy_to_msg(msg, skb->data, bytes_to_copy);
>+ err = skb_copy_datagram_iter(skb, 0,
>+ &msg->msg_iter,
>+ bytes_to_copy);
>+
> if (err) {
> /* Copy of message failed. Rest of
> * fragments will be freed without copy.
>--
>2.25.1
>


2023-07-18 14:49:27

by Arseniy Krasnov

[permalink] [raw]
Subject: Re: [PATCH net-next v1 4/4] vsock/virtio: MSG_ZEROCOPY flag support



On 18.07.2023 16:51, Stefano Garzarella wrote:
> On Tue, Jul 18, 2023 at 12:00:51AM +0300, Arseniy Krasnov wrote:
>> This adds handling of MSG_ZEROCOPY flag on transmission path: if this
>> flag is set and zerocopy transmission is possible, then non-linear skb
>> will be created and filled with the pages of user's buffer. Pages of
>> user's buffer are locked in memory by 'get_user_pages()'. Second thing
>> that this patch does is replace type of skb owning: instead of calling
>> 'skb_set_owner_sk_safe()' it calls 'skb_set_owner_w()'. Reason of this
>> change is that '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc'
>> of socket, so to decrease this field correctly proper skb destructor is
>> needed: 'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
>>
>> Signed-off-by: Arseniy Krasnov <[email protected]>
>> ---
>> Changelog:
>> v5(big patchset) -> v1:
>>  * Refactorings of 'if' conditions.
>>  * Remove extra blank line.
>>  * Remove 'frag_off' field unneeded init.
>>  * Add function 'virtio_transport_fill_skb()' which fills both linear
>>    and non-linear skb with provided data.
>>
>> net/vmw_vsock/virtio_transport_common.c | 260 ++++++++++++++++++------
>> 1 file changed, 197 insertions(+), 63 deletions(-)
>>
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 26a4d10da205..1fb0a0f694c6 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -37,73 +37,115 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
>>     return container_of(t, struct virtio_transport, transport);
>> }
>>
>> -/* Returns a new packet on success, otherwise returns NULL.
>> - *
>> - * If NULL is returned, errp is set to a negative errno.
>> - */
>> -static struct sk_buff *
>> -virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
>> -               size_t len,
>> -               u32 src_cid,
>> -               u32 src_port,
>> -               u32 dst_cid,
>> -               u32 dst_port)
>> -{
>> -    const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
>> -    struct virtio_vsock_hdr *hdr;
>> -    struct sk_buff *skb;
>> -    void *payload;
>> -    int err;
>> +static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
>> +                       size_t max_to_send)
>> +{
>> +    struct iov_iter *iov_iter;
>>
>> -    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> -    if (!skb)
>> -        return NULL;
>> +    if (!info->msg)
>> +        return false;
>>
>> -    hdr = virtio_vsock_hdr(skb);
>> -    hdr->type    = cpu_to_le16(info->type);
>> -    hdr->op        = cpu_to_le16(info->op);
>> -    hdr->src_cid    = cpu_to_le64(src_cid);
>> -    hdr->dst_cid    = cpu_to_le64(dst_cid);
>> -    hdr->src_port    = cpu_to_le32(src_port);
>> -    hdr->dst_port    = cpu_to_le32(dst_port);
>> -    hdr->flags    = cpu_to_le32(info->flags);
>> -    hdr->len    = cpu_to_le32(len);
>> +    iov_iter = &info->msg->msg_iter;
>>
>> -    if (info->msg && len > 0) {
>> -        payload = skb_put(skb, len);
>> -        err = memcpy_from_msg(payload, info->msg, len);
>> -        if (err)
>> -            goto out;
>> +    /* Data is simple buffer. */
>> +    if (iter_is_ubuf(iov_iter))
>> +        return true;
>>
>> -        if (msg_data_left(info->msg) == 0 &&
>> -            info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> -            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +    if (!iter_is_iovec(iov_iter))
>> +        return false;
>>
>> -            if (info->msg->msg_flags & MSG_EOR)
>> -                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> -        }
>> +    if (iov_iter->iov_offset)
>> +        return false;
>> +
>> +    /* We can't send whole iov. */
>> +    if (iov_iter->count > max_to_send)
>> +        return false;
>> +
>> +    return true;
>> +}
>> +
>> +static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>> +                       struct sk_buff *skb,
>> +                       struct msghdr *msg,
>> +                       bool zerocopy)
>> +{
>> +    struct ubuf_info *uarg;
>> +
>> +    if (msg->msg_ubuf) {
>> +        uarg = msg->msg_ubuf;
>> +        net_zcopy_get(uarg);
>> +    } else {
>> +        struct iov_iter *iter = &msg->msg_iter;
>> +        struct ubuf_info_msgzc *uarg_zc;
>> +        int len;
>> +
>> +        /* Only ITER_IOVEC or ITER_UBUF are allowed and
>> +         * checked before.
>> +         */
>> +        if (iter_is_iovec(iter))
>> +            len = iov_length(iter->__iov, iter->nr_segs);
>> +        else
>> +            len = iter->count;
>> +
>> +        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>> +                        len,
>> +                        NULL);
>> +        if (!uarg)
>> +            return -1;
>> +
>> +        uarg_zc = uarg_to_msgzc(uarg);
>> +        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>     }
>>
>> -    if (info->reply)
>> -        virtio_vsock_skb_set_reply(skb);
>> +    skb_zcopy_init(skb, uarg);
>>
>> -    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> -                     dst_cid, dst_port,
>> -                     len,
>> -                     info->type,
>> -                     info->op,
>> -                     info->flags);
>> +    return 0;
>> +}
>>
>> -    if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
>> -        WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
>> -        goto out;
>> +static int virtio_transport_fill_skb(struct sk_buff *skb,
>> +                     struct virtio_vsock_pkt_info *info,
>> +                     size_t len,
>> +                     bool zcopy)
>> +{
>> +    if (zcopy) {
>> +        return __zerocopy_sg_from_iter(info->msg, NULL, skb,
>> +                          &info->msg->msg_iter,
>> +                          len);
>> +    } else {
>> +        void *payload;
>> +        int err;
>> +
>> +        payload = skb_put(skb, len);
>> +        err = memcpy_from_msg(payload, info->msg, len);
>> +        if (err)
>> +            return -1;
>> +
>> +        if (msg_data_left(info->msg))
>> +            return 0;
>> +
>> +        return 0;
>>     }
>> +}
>>
>> -    return skb;
>> +static void virtio_transport_init_hdr(struct sk_buff *skb,
>> +                      struct virtio_vsock_pkt_info *info,
>> +                      u32 src_cid,
>> +                      u32 src_port,
>> +                      u32 dst_cid,
>> +                      u32 dst_port,
>> +                      size_t len)
>> +{
>> +    struct virtio_vsock_hdr *hdr;
>>
>> -out:
>> -    kfree_skb(skb);
>> -    return NULL;
>> +    hdr = virtio_vsock_hdr(skb);
>> +    hdr->type    = cpu_to_le16(info->type);
>> +    hdr->op        = cpu_to_le16(info->op);
>> +    hdr->src_cid    = cpu_to_le64(src_cid);
>> +    hdr->dst_cid    = cpu_to_le64(dst_cid);
>> +    hdr->src_port    = cpu_to_le32(src_port);
>> +    hdr->dst_port    = cpu_to_le32(dst_port);
>> +    hdr->flags    = cpu_to_le32(info->flags);
>> +    hdr->len    = cpu_to_le32(len);
>> }
>>
>> static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
>> @@ -214,6 +256,70 @@ static u16 virtio_transport_get_type(struct sock *sk)
>>         return VIRTIO_VSOCK_TYPE_SEQPACKET;
>> }
>>
>> +static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,
>> +                          struct virtio_vsock_pkt_info *info,
>> +                          size_t payload_len,
>> +                          bool zcopy,
>> +                          u32 dst_cid,
>> +                          u32 dst_port,
>> +                          u32 src_cid,
>> +                          u32 src_port)
>
> Before this patch the order of dst_* and src_* fields were reversed
> for virtio_transport_alloc_skb(), why are we changing it?
>
> I think putting them back as before (and following the same ordering in
> the new functions as well) makes the patch easier to review and the code
> easier to maintain.

Yes, exactly! I'll fix it. Seems I just didn't pay attention on this during
coding...

Thanks, Arseniy

>
> Thanks,
> Stefano
>
>> +{
>> +    struct sk_buff *skb;
>> +    size_t skb_len;
>> +
>> +    skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
>> +
>> +    if (!zcopy)
>> +        skb_len += payload_len;
>> +
>> +    skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
>> +    if (!skb)
>> +        return NULL;
>> +
>> +    virtio_transport_init_hdr(skb, info, src_cid, src_port,
>> +                  dst_cid, dst_port,
>> +                  payload_len);
>> +
>> +    /* Set owner here, because '__zerocopy_sg_from_iter()' uses
>> +     * owner of skb without check to update 'sk_wmem_alloc'.
>> +     */
>> +    if (vsk)
>> +        skb_set_owner_w(skb, sk_vsock(vsk));
>> +
>> +    if (info->msg && payload_len > 0) {
>> +        int err;
>> +
>> +        err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
>> +        if (err)
>> +            goto out;
>> +
>> +        if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
>> +            struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
>> +
>> +            hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
>> +
>> +            if (info->msg->msg_flags & MSG_EOR)
>> +                hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
>> +        }
>> +    }
>> +
>> +    if (info->reply)
>> +        virtio_vsock_skb_set_reply(skb);
>> +
>> +    trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> +                     dst_cid, dst_port,
>> +                     payload_len,
>> +                     info->type,
>> +                     info->op,
>> +                     info->flags);
>> +
>> +    return skb;
>> +out:
>> +    kfree_skb(skb);
>> +    return NULL;
>> +}
>> +
>> /* This function can only be used on connecting/connected sockets,
>>  * since a socket assigned to a transport is required.
>>  *
>> @@ -222,10 +328,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
>> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>                       struct virtio_vsock_pkt_info *info)
>> {
>> +    u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
>>     u32 src_cid, src_port, dst_cid, dst_port;
>>     const struct virtio_transport *t_ops;
>>     struct virtio_vsock_sock *vvs;
>>     u32 pkt_len = info->pkt_len;
>> +    bool can_zcopy = false;
>>     u32 rest_len;
>>     int ret;
>>
>> @@ -254,22 +362,48 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>     if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
>>         return pkt_len;
>>
>> +    if (info->msg) {
>> +        /* If zerocopy is not enabled by 'setsockopt()', we behave as
>> +         * there is no MSG_ZEROCOPY flag set.
>> +         */
>> +        if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
>> +            info->msg->msg_flags &= ~MSG_ZEROCOPY;
>> +
>> +        if (info->msg->msg_flags & MSG_ZEROCOPY)
>> +            can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
>> +
>> +        if (can_zcopy)
>> +            max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>> +                        (MAX_SKB_FRAGS * PAGE_SIZE));
>> +    }
>> +
>>     rest_len = pkt_len;
>>
>>     do {
>>         struct sk_buff *skb;
>>         size_t skb_len;
>>
>> -        skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
>> +        skb_len = min(max_skb_len, rest_len);
>>
>> -        skb = virtio_transport_alloc_skb(info, skb_len,
>> -                         src_cid, src_port,
>> -                         dst_cid, dst_port);
>> +        skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
>> +                         dst_cid, dst_port,
>> +                         src_cid, src_port);
>>         if (!skb) {
>>             ret = -ENOMEM;
>>             break;
>>         }
>>
>> +        /* This is last skb to send this portion of data. */
>> +        if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
>> +            skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
>> +            if (virtio_transport_init_zcopy_skb(vsk, skb,
>> +                                info->msg,
>> +                                can_zcopy)) {
>> +                ret = -ENOMEM;
>> +                break;
>> +            }
>> +        }
>> +
>>         virtio_transport_inc_tx_pkt(vvs, skb);
>>
>>         ret = t_ops->send_pkt(skb);
>> @@ -934,11 +1068,11 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
>>     if (!t)
>>         return -ENOTCONN;
>>
>> -    reply = virtio_transport_alloc_skb(&info, 0,
>> -                       le64_to_cpu(hdr->dst_cid),
>> -                       le32_to_cpu(hdr->dst_port),
>> +    reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
>>                        le64_to_cpu(hdr->src_cid),
>> -                       le32_to_cpu(hdr->src_port));
>> +                       le32_to_cpu(hdr->src_port),
>> +                       le64_to_cpu(hdr->dst_cid),
>> +                       le32_to_cpu(hdr->dst_port));
>>     if (!reply)
>>         return -ENOMEM;
>>
>> -- 
>> 2.25.1
>>
>