2021-11-30 15:20:44

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 00/12] io_uring zerocopy send

Early proof of concept for zerocopy send via io_uring. This is just
an RFC, there are details yet to be figured out, but hope to gather
some feedback.

Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
or 2.44 times faster.

№ | test: | BW (MB/s) | speedup
1 | msg_zerocopy (non-zc) | 18281 | 0.38
2 | msg_zerocopy -z (baseline) | 47421 | 1
3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03
4 | io_uring (@flush=true, nr_reqs=1) | 89310 | 1.88
5 | io_uring (@flush=false, nr_reqs=8) | 116079 | 2.44
6 | io_uring (@flush=true, nr_reqs=8) | 109722 | 2.31

Based on selftests/.../msg_zerocopy but more limited. You can use
msg_zerocopy -r as usual for receive side.

@nr_reqs controls how many io_uring requests we submit per syscall, e.g.
nr_reqs=1 avoids any io_uring batching, which is wasteful.
@flush controls whether to generate "buffer not used" event per request
or not at all. The API and implementation is more flexible, e.g. allows
to have one notification per N requests, but not added to the benchmark.

It's ipv4/udp only for now. The concept works well with tcp as well, but
omitted patches for now.

# design

The userspace design is briefly outlined in the message to 06/12. In
short, it allows to attach several io_uring send requests to one of
pre-registered notification contexts, and the userspace can "flush"
a notification from that context making it to post an event into
the io_uring's completion queue when buffers are no more in use.

From the net perspective, as ubuf_info's are controlled by io_uring, we
need a way to pass it from outside, which is currently done through a
new struct msghdr::msg_ubuf field.

Note: one great aspect ubufs being controlled from outside is that
it doesn't matter whether the net actually uses the ubuf or not, the
userspace gets the same API in terms of events delivery: it'll get
that additional event about buffers even it wasn't zerocopy.

Another big part is 5/12, which allows to skip get/put_page() for
io_uring's fixed buffers. io_uring ensures that the pages stay alive
until the corresponding ubuf_info is released.

# performance:

The worst case for io_uring is (4), still 1.88 times faster than
msg_zerocopy (2), and there are a couple of "easy" optimisations left
out from the patchset. For 4096 bytes payload zc is only slightly
outperforms non-zc version, the larger payload the wider gap.
I'll get more numbers next time.

Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
much. Notification posting is not a big problem for now, but need
to compare the performance for when io_uring_tx_zerocopy_callback()
is called from IRQ context, and possible rework it to use task_work.

It supports both, regular buffers and fixed ones, but there is a bunch of
optimisations exclusively for io_uring's fixed buffers. For comparison,
normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s

1) we pass a bvec, so no page table walks.
2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
still doing page get/put (see 4/12) slashed 4-5%.
3) avoiding get_page/put_page in 5/12
4) completion events are posted into io_uring's CQ, so no
extra recvmsg for getting events
5) no poll(2) in the code because of io_uring
6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
io_uring caches the structures reducing it to nearly zero-overhead.

io_uring adds some overhead but there are also benefits of using it,
e.g. fixed files and submission batching (@nr_reqs). We can look at
@nr_reqs=1 test cases for more of a "net-only" optimisations.

# discussion / questions

I haven't got a grasp on many aspects of the net stack yet, so would
appreciate feedback in general and there are a couple of questions
thoughts.

1) What are initialisation rules for adding a new field into
struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
filling all the fields.

2) I don't like too much ubuf_info propagation from udp_sendmsg() into
__ip_append_data() (see 3/12). Ideas how to do it better?

3) 5/12 is a quick'n'dirty patch for letting upper layer to manage page
references but likely needs more work. One problem is to prevent mixing
such pages managed by upper layer and normal referenced ones, that's
what SKBFL_MANAGED_FRAGS is about and many chunk trying to make the
accounting of it right. Any pitfalls I missed? Would really love some
ideas and advice here, e.g. how to make it cleaner and/or remove
overhead from non-zc path.

# references

Kernel git branch for convenience:
https://github.com/isilence/linux.git zc_v1

Benchmark:
https://github.com/isilence/liburing.git zc_v1

or this file in particular:
https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c

To run the benchmark:
```
cd <liburing_dir> && make && cd test
# ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
./send-zc -4 -D 127.0.0.1 udp
```

msg_zerocopy can be used for the server side, e.g.
```
cd <linux-kernel>/tools/testing/selftests/net && make
./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
```

Pavel Begunkov (12):
skbuff: add SKBFL_DONT_ORPHAN flag
skbuff: pass a struct ubuf_info in msghdr
net/udp: add support msgdr::msg_ubuf
net: add zerocopy_sg_from_iter for bvec
net: optimise page get/free for bvec zc
io_uring: add send notifiers registration
io_uring: infrastructure for send zc notifications
io_uring: wire send zc request type
io_uring: add an option to flush zc notifications
io_uring: opcode independent fixed buf import
io_uring: sendzc with fixed buffers
io_uring: cache struct ubuf_info

fs/io_uring.c | 397 +++++++++++++++++++++++++++++++++-
include/linux/skbuff.h | 15 +-
include/linux/socket.h | 1 +
include/net/ip.h | 3 +-
include/uapi/linux/io_uring.h | 14 ++
net/compat.c | 1 +
net/core/datagram.c | 59 +++++
net/core/skbuff.c | 18 +-
net/ipv4/ip_output.c | 35 ++-
net/ipv4/udp.c | 2 +-
net/socket.c | 3 +
11 files changed, 521 insertions(+), 27 deletions(-)

--
2.34.0



2021-11-30 15:20:55

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 08/12] io_uring: wire send zc request type

Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from
other send requests is that the user should specify a tx context index,
which will notifiy the userspace when the kernel doesn't need the
buffers anymore and it's safe to reuse them. So, overwriting data
buffers is racy before getting a separate notification even when the
request is already completed.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 120 +++++++++++++++++++++++++++++++++-
include/uapi/linux/io_uring.h | 2 +
2 files changed, 121 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6ca02e60fa48..337eb91f0198 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -600,6 +600,16 @@ struct io_sr_msg {
size_t len;
};

+struct io_sendzc {
+ struct file *file;
+ void __user *buf;
+ size_t len;
+ struct io_tx_ctx *tx_ctx;
+ int msg_flags;
+ int addr_len;
+ void __user *addr;
+};
+
struct io_open {
struct file *file;
int dfd;
@@ -874,6 +884,7 @@ struct io_kiocb {
struct io_mkdir mkdir;
struct io_symlink symlink;
struct io_hardlink hardlink;
+ struct io_sendzc msgzc;
};

u8 opcode;
@@ -1123,6 +1134,12 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_MKDIRAT] = {},
[IORING_OP_SYMLINKAT] = {},
[IORING_OP_LINKAT] = {},
+ [IORING_OP_SENDZC] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ .audit_skip = 1,
+ },
};

/* requests with any of those set should undergo io_disarm_next() */
@@ -1999,7 +2016,6 @@ static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
return notifier;
}

-__attribute__((unused))
static inline struct io_tx_notifier *io_get_tx_notifier(struct io_ring_ctx *ctx,
struct io_tx_ctx *tx_ctx)
{
@@ -5025,6 +5041,102 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

+static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_sendzc *sr = &req->msgzc;
+ unsigned int idx;
+
+ if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
+ return -EINVAL;
+ if (READ_ONCE(sqe->ioprio))
+ return -EINVAL;
+
+ sr->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+ sr->len = READ_ONCE(sqe->len);
+ sr->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
+ if (sr->msg_flags & MSG_DONTWAIT)
+ req->flags |= REQ_F_NOWAIT;
+
+ idx = READ_ONCE(sqe->tx_ctx_idx);
+ if (idx > ctx->nr_tx_ctxs)
+ return -EINVAL;
+ idx = array_index_nospec(idx, ctx->nr_tx_ctxs);
+ req->msgzc.tx_ctx = &ctx->tx_ctxs[idx];
+
+ sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+ sr->addr_len = READ_ONCE(sqe->__pad2[0]);
+
+#ifdef CONFIG_COMPAT
+ if (req->ctx->compat)
+ sr->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+ return 0;
+}
+
+static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct sockaddr_storage address;
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_tx_notifier *notifier;
+ struct io_sendzc *sr = &req->msgzc;
+ struct msghdr msg;
+ struct iovec iov;
+ struct socket *sock;
+ unsigned flags;
+ int ret, min_ret = 0;
+
+ sock = sock_from_file(req->file);
+ if (unlikely(!sock))
+ return -ENOTSOCK;
+ ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
+ if (unlikely(ret))
+ return ret;
+
+ msg.msg_name = NULL;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_namelen = 0;
+ if (sr->addr) {
+ ret = move_addr_to_kernel(sr->addr, sr->addr_len, &address);
+ if (ret < 0)
+ return ret;
+ msg.msg_name = (struct sockaddr *)&address;
+ msg.msg_namelen = sr->addr_len;
+ }
+
+ io_ring_submit_lock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+ notifier = io_get_tx_notifier(ctx, req->msgzc.tx_ctx);
+ if (!notifier) {
+ req_set_fail(req);
+ ret = -ENOMEM;
+ goto out;
+ }
+ msg.msg_ubuf = &notifier->uarg;
+
+ flags = sr->msg_flags;
+ if (issue_flags & IO_URING_F_NONBLOCK)
+ flags |= MSG_DONTWAIT;
+ if (flags & MSG_WAITALL)
+ min_ret = iov_iter_count(&msg.msg_iter);
+ msg.msg_flags = flags;
+ ret = sock_sendmsg(sock, &msg);
+
+ if (ret < min_ret) {
+ if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
+ goto out;
+ if (ret == -ERESTARTSYS)
+ ret = -EINTR;
+ req_set_fail(req);
+ }
+ io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+ __io_req_complete(req, issue_flags, ret, 0);
+ return 0;
+out:
+ io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
+ return ret;
+}
+
static int __io_recvmsg_copy_hdr(struct io_kiocb *req,
struct io_async_msghdr *iomsg)
{
@@ -5428,6 +5540,7 @@ IO_NETOP_PREP_ASYNC(sendmsg);
IO_NETOP_PREP_ASYNC(recvmsg);
IO_NETOP_PREP_ASYNC(connect);
IO_NETOP_PREP(accept);
+IO_NETOP_PREP(sendzc);
IO_NETOP_FN(send);
IO_NETOP_FN(recv);
#endif /* CONFIG_NET */
@@ -6575,6 +6688,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
case IORING_OP_SENDMSG:
case IORING_OP_SEND:
return io_sendmsg_prep(req, sqe);
+ case IORING_OP_SENDZC:
+ return io_sendzc_prep(req, sqe);
case IORING_OP_RECVMSG:
case IORING_OP_RECV:
return io_recvmsg_prep(req, sqe);
@@ -6832,6 +6947,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_SEND:
ret = io_send(req, issue_flags);
break;
+ case IORING_OP_SENDZC:
+ ret = io_sendzc(req, issue_flags);
+ break;
case IORING_OP_RECVMSG:
ret = io_recvmsg(req, issue_flags);
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index f2e8d18e40e0..bbc78fe8ca77 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -59,6 +59,7 @@ struct io_uring_sqe {
union {
__s32 splice_fd_in;
__u32 file_index;
+ __u32 tx_ctx_idx;
};
__u64 __pad2[2];
};
@@ -143,6 +144,7 @@ enum {
IORING_OP_MKDIRAT,
IORING_OP_SYMLINKAT,
IORING_OP_LINKAT,
+ IORING_OP_SENDZC,

/* this goes last, obviously */
IORING_OP_LAST,
--
2.34.0


2021-11-30 15:21:00

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 02/12] skbuff: pass a struct ubuf_info in msghdr

Instead of the net stack managing ubuf_info, allow to pass it in from
outside in a struct msghdr (in-kernel structure), so io_uring can make
use of it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 2 ++
include/linux/socket.h | 1 +
net/compat.c | 1 +
net/socket.c | 3 +++
4 files changed, 7 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 72da3a75521a..59380e3454ad 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -4911,6 +4911,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
+ msg.msg_ubuf = NULL;

flags = req->sr_msg.msg_flags;
if (issue_flags & IO_URING_F_NONBLOCK)
@@ -5157,6 +5158,7 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_namelen = 0;
msg.msg_iocb = NULL;
msg.msg_flags = 0;
+ msg.msg_ubuf = NULL;

flags = req->sr_msg.msg_flags;
if (force_nonblock)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 8ef26d89ef49..6bd2c6b0c6f2 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -65,6 +65,7 @@ struct msghdr {
__kernel_size_t msg_controllen; /* ancillary data buffer length */
unsigned int msg_flags; /* flags on received message */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
+ struct ubuf_info *msg_ubuf;
};

struct user_msghdr {
diff --git a/net/compat.c b/net/compat.c
index 210fc3b4d0d8..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -80,6 +80,7 @@ int __get_compat_msghdr(struct msghdr *kmsg,
return -EMSGSIZE;

kmsg->msg_iocb = NULL;
+ kmsg->msg_ubuf = NULL;
*ptr = msg.msg_iov;
*len = msg.msg_iovlen;
return 0;
diff --git a/net/socket.c b/net/socket.c
index 7f64a6eccf63..0a29b616a38c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2023,6 +2023,7 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
+ msg.msg_ubuf = NULL;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
@@ -2088,6 +2089,7 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
msg.msg_namelen = 0;
msg.msg_iocb = NULL;
msg.msg_flags = 0;
+ msg.msg_ubuf = NULL;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
err = sock_recvmsg(sock, &msg, flags);
@@ -2326,6 +2328,7 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
return -EMSGSIZE;

kmsg->msg_iocb = NULL;
+ kmsg->msg_ubuf = NULL;
*uiov = msg.msg_iov;
*nsegs = msg.msg_iovlen;
return 0;
--
2.34.0


2021-11-30 15:21:08

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 03/12] net/udp: add support msgdr::msg_ubuf

Make ipv4/udp to use ubuf_info passed in struct msghdr if it was
specified.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/net/ip.h | 3 ++-
net/ipv4/ip_output.c | 31 ++++++++++++++++++++++++-------
net/ipv4/udp.c | 2 +-
3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index b71e88507c4a..e9c61b83a770 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -232,7 +232,8 @@ struct sk_buff *ip_make_skb(struct sock *sk, struct flowi4 *fl4,
int len, int odd, struct sk_buff *skb),
void *from, int length, int transhdrlen,
struct ipcm_cookie *ipc, struct rtable **rtp,
- struct inet_cork *cork, unsigned int flags);
+ struct inet_cork *cork, unsigned int flags,
+ struct ubuf_info *uarg);

int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl);

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9bca57ef8b83..f9aab355d283 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -948,10 +948,10 @@ static int __ip_append_data(struct sock *sk,
int getfrag(void *from, char *to, int offset,
int len, int odd, struct sk_buff *skb),
void *from, int length, int transhdrlen,
- unsigned int flags)
+ unsigned int flags,
+ struct ubuf_info *uarg)
{
struct inet_sock *inet = inet_sk(sk);
- struct ubuf_info *uarg = NULL;
struct sk_buff *skb;

struct ip_options *opt = cork->opt;
@@ -967,6 +967,7 @@ static int __ip_append_data(struct sock *sk,
unsigned int wmem_alloc_delta = 0;
bool paged, extra_uref = false;
u32 tskey = 0;
+ bool zc = false;

skb = skb_peek_tail(queue);

@@ -1001,7 +1002,21 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;

- if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+ if (uarg) {
+ if (skb_zcopy(skb) && uarg != skb_zcopy(skb))
+ return -EINVAL;
+
+ /* If it's not zerocopy, just drop uarg, the caller should
+ * be able to handle it.
+ */
+ if (rt->dst.dev->features & NETIF_F_SG &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ } else {
+ uarg = NULL;
+ }
+ } else if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
return -ENOBUFS;
@@ -1009,6 +1024,7 @@ static int __ip_append_data(struct sock *sk,
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
+ zc = true;
} else {
uarg->zerocopy = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
@@ -1172,7 +1188,7 @@ static int __ip_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
- } else if (!uarg || !uarg->zerocopy) {
+ } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;

err = -ENOMEM;
@@ -1309,7 +1325,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,

return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
sk_page_frag(sk), getfrag,
- from, length, transhdrlen, flags);
+ from, length, transhdrlen, flags, NULL);
}

ssize_t ip_append_page(struct sock *sk, struct flowi4 *fl4, struct page *page,
@@ -1601,7 +1617,8 @@ struct sk_buff *ip_make_skb(struct sock *sk,
int len, int odd, struct sk_buff *skb),
void *from, int length, int transhdrlen,
struct ipcm_cookie *ipc, struct rtable **rtp,
- struct inet_cork *cork, unsigned int flags)
+ struct inet_cork *cork, unsigned int flags,
+ struct ubuf_info *uarg)
{
struct sk_buff_head queue;
int err;
@@ -1620,7 +1637,7 @@ struct sk_buff *ip_make_skb(struct sock *sk,

err = __ip_append_data(sk, fl4, &queue, cork,
&current->task_frag, getfrag,
- from, length, transhdrlen, flags);
+ from, length, transhdrlen, flags, uarg);
if (err) {
__ip_flush_pending_frames(sk, &queue, cork);
return ERR_PTR(err);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8bcecdd6aeda..8c514bff48d4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1247,7 +1247,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)

skb = ip_make_skb(sk, fl4, getfrag, msg, ulen,
sizeof(struct udphdr), &ipc, &rt,
- &cork, msg->msg_flags);
+ &cork, msg->msg_flags, msg->msg_ubuf);
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4, &cork);
--
2.34.0


2021-11-30 15:21:24

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 07/12] io_uring: infrastructure for send zc notifications

Add a new ubuf_info callback io_uring_tx_zerocopy_callback(), which
should post an CQE when it completes. Also, implement some
infrastructuire for allocating and managing struct ubuf_info.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 108 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a01f91e70fa5..6ca02e60fa48 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -329,6 +329,11 @@ struct io_submit_state {
};

struct io_tx_notifier {
+ struct ubuf_info uarg;
+ struct work_struct commit_work;
+ struct percpu_ref *fixed_rsrc_refs;
+ u64 tag;
+ u32 seq;
};

struct io_tx_ctx {
@@ -1275,15 +1280,20 @@ static void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
percpu_ref_get_many(&ctx->rsrc_node->refs, IO_RSRC_REF_BATCH);
}

+static inline void io_set_rsrc_node(struct percpu_ref **rsrc_refs,
+ struct io_ring_ctx *ctx)
+{
+ *rsrc_refs = &ctx->rsrc_node->refs;
+ ctx->rsrc_cached_refs--;
+ if (unlikely(ctx->rsrc_cached_refs < 0))
+ io_rsrc_refs_refill(ctx);
+}
+
static inline void io_req_set_rsrc_node(struct io_kiocb *req,
struct io_ring_ctx *ctx)
{
- if (!req->fixed_rsrc_refs) {
- req->fixed_rsrc_refs = &ctx->rsrc_node->refs;
- ctx->rsrc_cached_refs--;
- if (unlikely(ctx->rsrc_cached_refs < 0))
- io_rsrc_refs_refill(ctx);
- }
+ if (!req->fixed_rsrc_refs)
+ io_set_rsrc_node(&req->fixed_rsrc_refs, ctx);
}

static void io_refs_resurrect(struct percpu_ref *ref, struct completion *compl)
@@ -1930,6 +1940,76 @@ static noinline bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data,
return __io_fill_cqe(ctx, user_data, res, cflags);
}

+static void io_zc_tx_work_callback(struct work_struct *work)
+{
+ struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
+ commit_work);
+ struct io_ring_ctx *ctx = notifier->uarg.ctx;
+
+ spin_lock(&ctx->completion_lock);
+ io_fill_cqe_aux(ctx, notifier->tag, notifier->seq, 0);
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ io_cqring_ev_posted(ctx);
+
+ percpu_ref_put(notifier->fixed_rsrc_refs);
+ percpu_ref_put(&ctx->refs);
+ kfree(notifier);
+}
+
+static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+ struct ubuf_info *uarg,
+ bool success)
+{
+ struct io_tx_notifier *notifier;
+
+ notifier = container_of(uarg, struct io_tx_notifier, uarg);
+ if (!refcount_dec_and_test(&uarg->refcnt))
+ return;
+
+ if (in_interrupt()) {
+ INIT_WORK(&notifier->commit_work, io_zc_tx_work_callback);
+ queue_work(system_unbound_wq, &notifier->commit_work);
+ } else {
+ io_zc_tx_work_callback(&notifier->commit_work);
+ }
+}
+
+static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
+ struct io_tx_ctx *tx_ctx)
+{
+ struct io_tx_notifier *notifier;
+ struct ubuf_info *uarg;
+
+ notifier = kmalloc(sizeof(*notifier), GFP_ATOMIC);
+ if (!notifier)
+ return NULL;
+
+ WARN_ON_ONCE(!current->io_uring);
+ notifier->seq = tx_ctx->seq++;
+ notifier->tag = tx_ctx->tag;
+ io_set_rsrc_node(&notifier->fixed_rsrc_refs, ctx);
+
+ uarg = &notifier->uarg;
+ uarg->ctx = ctx;
+ uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+ uarg->callback = io_uring_tx_zerocopy_callback;
+ refcount_set(&uarg->refcnt, 1);
+ percpu_ref_get(&ctx->refs);
+ return notifier;
+}
+
+__attribute__((unused))
+static inline struct io_tx_notifier *io_get_tx_notifier(struct io_ring_ctx *ctx,
+ struct io_tx_ctx *tx_ctx)
+{
+ if (tx_ctx->notifier)
+ return tx_ctx->notifier;
+
+ tx_ctx->notifier = io_alloc_tx_notifier(ctx, tx_ctx);
+ return tx_ctx->notifier;
+}
+
static void io_req_complete_post(struct io_kiocb *req, s32 res,
u32 cflags)
{
@@ -9212,11 +9292,27 @@ static int io_buffer_validate(struct iovec *iov)
return 0;
}

+static void io_sqe_tx_ctx_kill_ubufs(struct io_ring_ctx *ctx)
+{
+ struct io_tx_ctx *tx_ctx;
+ int i;
+
+ for (i = 0; i < ctx->nr_tx_ctxs; i++) {
+ tx_ctx = &ctx->tx_ctxs[i];
+ if (!tx_ctx->notifier)
+ continue;
+ io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg,
+ true);
+ tx_ctx->notifier = NULL;
+ }
+}
+
static int io_sqe_tx_ctx_unregister(struct io_ring_ctx *ctx)
{
if (!ctx->nr_tx_ctxs)
return -ENXIO;

+ io_sqe_tx_ctx_kill_ubufs(ctx);
kvfree(ctx->tx_ctxs);
ctx->tx_ctxs = NULL;
ctx->nr_tx_ctxs = 0;
@@ -9608,6 +9704,12 @@ static __cold void io_ring_exit_work(struct work_struct *work)
io_sq_thread_unpark(sqd);
}

+ if (READ_ONCE(ctx->nr_tx_ctxs)) {
+ mutex_lock(&ctx->uring_lock);
+ io_sqe_tx_ctx_kill_ubufs(ctx);
+ mutex_unlock(&ctx->uring_lock);
+ }
+
io_req_caches_free(ctx);

if (WARN_ON_ONCE(time_after(jiffies, timeout))) {
--
2.34.0


2021-11-30 15:21:31

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 06/12] io_uring: add send notifiers registration

Add IORING_REGISTER_TX_CTX and IORING_UNREGISTER_TX_CTX. Transmission
(i.e. send) context will serve be used to notify the userspace when
fixed buffers used for zerocopy sends are released by the kernel.

Notification of a single tx context lives in generations, where each
generation posts one CQE with ->user_data equal to the specified tag and
->res is a generation number starting from 0. All requests issued
against a ctx will get attached to the current generation of
notifications. Then, the userspace will be able to request to flush the
notification allowing it to post a CQE when all buffers of all requests
attached to it are released by the kernel. It'll also switch the
generation to a new one with a sequence number incremented by one.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 7 ++++
2 files changed, 79 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 59380e3454ad..a01f91e70fa5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -94,6 +94,8 @@
#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)
#define IORING_SQPOLL_CAP_ENTRIES_VALUE 8

+#define IORING_MAX_TX_NOTIFIERS (1U << 10)
+
/* only define max */
#define IORING_MAX_FIXED_FILES (1U << 15)
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
@@ -326,6 +328,15 @@ struct io_submit_state {
struct blk_plug plug;
};

+struct io_tx_notifier {
+};
+
+struct io_tx_ctx {
+ struct io_tx_notifier *notifier;
+ u64 tag;
+ u32 seq;
+};
+
struct io_ring_ctx {
/* const or read-mostly hot data */
struct {
@@ -373,6 +384,8 @@ struct io_ring_ctx {
unsigned nr_user_files;
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;
+ struct io_tx_ctx *tx_ctxs;
+ unsigned nr_tx_ctxs;

struct io_submit_state submit_state;
struct list_head timeout_list;
@@ -9199,6 +9212,55 @@ static int io_buffer_validate(struct iovec *iov)
return 0;
}

+static int io_sqe_tx_ctx_unregister(struct io_ring_ctx *ctx)
+{
+ if (!ctx->nr_tx_ctxs)
+ return -ENXIO;
+
+ kvfree(ctx->tx_ctxs);
+ ctx->tx_ctxs = NULL;
+ ctx->nr_tx_ctxs = 0;
+ return 0;
+}
+
+static int io_sqe_tx_ctx_register(struct io_ring_ctx *ctx,
+ void __user *arg, unsigned int nr_args)
+{
+ struct io_uring_tx_ctx_register __user *tx_args = arg;
+ struct io_uring_tx_ctx_register tx_arg;
+ unsigned i;
+ int ret;
+
+ if (ctx->nr_tx_ctxs)
+ return -EBUSY;
+ if (!nr_args)
+ return -EINVAL;
+ if (nr_args > IORING_MAX_TX_NOTIFIERS)
+ return -EMFILE;
+
+ ctx->tx_ctxs = kvcalloc(nr_args, sizeof(ctx->tx_ctxs[0]),
+ GFP_KERNEL_ACCOUNT);
+ if (!ctx->tx_ctxs)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_args; i++, ctx->nr_tx_ctxs++) {
+ struct io_tx_ctx *tx_ctx = &ctx->tx_ctxs[i];
+
+ if (copy_from_user(&tx_arg, &tx_args[i], sizeof(tx_arg))) {
+ ret = -EFAULT;
+ goto out_fput;
+ }
+ tx_ctx->tag = tx_arg.tag;
+ }
+ return 0;
+
+out_fput:
+ kvfree(ctx->tx_ctxs);
+ ctx->tx_ctxs = NULL;
+ ctx->nr_tx_ctxs = 0;
+ return ret;
+}
+
static int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
unsigned int nr_args, u64 __user *tags)
{
@@ -9429,6 +9491,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));

+ io_sqe_tx_ctx_unregister(ctx);
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);

@@ -11104,6 +11167,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_iowq_max_workers(ctx, arg);
break;
+ case IORING_REGISTER_TX_CTX:
+ ret = io_sqe_tx_ctx_register(ctx, arg, nr_args);
+ break;
+ case IORING_UNREGISTER_TX_CTX:
+ ret = -EINVAL;
+ if (arg || nr_args)
+ break;
+ ret = io_sqe_tx_ctx_unregister(ctx);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 787f491f0d2a..f2e8d18e40e0 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -325,6 +325,9 @@ enum {
/* set/get max number of io-wq workers */
IORING_REGISTER_IOWQ_MAX_WORKERS = 19,

+ IORING_REGISTER_TX_CTX = 20,
+ IORING_UNREGISTER_TX_CTX = 21,
+
/* this goes last */
IORING_REGISTER_LAST
};
@@ -365,6 +368,10 @@ struct io_uring_rsrc_update2 {
__u32 resv2;
};

+struct io_uring_tx_ctx_register {
+ __u64 tag;
+};
+
/* Skip updating fd indexes set to this value in the fd table */
#define IORING_REGISTER_FILES_SKIP (-2)

--
2.34.0


2021-11-30 15:21:38

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 10/12] io_uring: opcode independent fixed buf import

Extract an opcode independent helper from io_import_fixed for
initialising an iov_iter with a fixed buffer with

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e1360fde95d3..bb991f4cee7b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3152,11 +3152,11 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
}
}

-static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter,
- struct io_mapped_ubuf *imu)
+static int __io_import_fixed(int rw, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu,
+ u64 buf_addr, size_t len)
{
- size_t len = req->rw.len;
- u64 buf_end, buf_addr = req->rw.addr;
+ u64 buf_end;
size_t offset;

if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
@@ -3225,7 +3225,7 @@ static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter)
imu = READ_ONCE(ctx->user_bufs[index]);
req->imu = imu;
}
- return __io_import_fixed(req, rw, iter, imu);
+ return __io_import_fixed(rw, iter, imu, req->rw.addr, req->rw.len);
}

static void io_ring_submit_unlock(struct io_ring_ctx *ctx, bool needs_lock)
--
2.34.0


2021-11-30 15:21:40

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 09/12] io_uring: add an option to flush zc notifications

Add IORING_SENDZC_FLUSH flag. If specified, a send zc operation on
success should also flush a corresponding ubuf_info.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 26 +++++++++++++++++++-------
include/uapi/linux/io_uring.h | 4 ++++
2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 337eb91f0198..e1360fde95d3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -608,6 +608,7 @@ struct io_sendzc {
int msg_flags;
int addr_len;
void __user *addr;
+ unsigned int zc_flags;
};

struct io_open {
@@ -1992,6 +1993,12 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
}
}

+static void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
+{
+ io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg, true);
+ tx_ctx->notifier = NULL;
+}
+
static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
struct io_tx_ctx *tx_ctx)
{
@@ -5041,6 +5048,8 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

+#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FLUSH
+
static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -5049,8 +5058,6 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)

if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
return -EINVAL;
- if (READ_ONCE(sqe->ioprio))
- return -EINVAL;

sr->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
sr->len = READ_ONCE(sqe->len);
@@ -5067,6 +5074,10 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
sr->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
sr->addr_len = READ_ONCE(sqe->__pad2[0]);

+ req->msgzc.zc_flags = READ_ONCE(sqe->ioprio);
+ if (req->msgzc.zc_flags & ~IO_SENDZC_VALID_FLAGS)
+ return -EINVAL;
+
#ifdef CONFIG_COMPAT
if (req->ctx->compat)
sr->msg_flags |= MSG_CMSG_COMPAT;
@@ -5089,6 +5100,7 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
sock = sock_from_file(req->file);
if (unlikely(!sock))
return -ENOTSOCK;
+
ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
if (unlikely(ret))
return ret;
@@ -5128,6 +5140,8 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
+ } else if (req->msgzc.zc_flags & IORING_SENDZC_FLUSH) {
+ io_tx_kill_notification(req->msgzc.tx_ctx);
}
io_ring_submit_unlock(ctx, issue_flags & IO_URING_F_UNLOCKED);
__io_req_complete(req, issue_flags, ret, 0);
@@ -9417,11 +9431,9 @@ static void io_sqe_tx_ctx_kill_ubufs(struct io_ring_ctx *ctx)

for (i = 0; i < ctx->nr_tx_ctxs; i++) {
tx_ctx = &ctx->tx_ctxs[i];
- if (!tx_ctx->notifier)
- continue;
- io_uring_tx_zerocopy_callback(NULL, &tx_ctx->notifier->uarg,
- true);
- tx_ctx->notifier = NULL;
+
+ if (tx_ctx->notifier)
+ io_tx_kill_notification(tx_ctx);
}
}

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index bbc78fe8ca77..ac18e8e6f86f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -187,6 +187,10 @@ enum {
#define IORING_POLL_UPDATE_EVENTS (1U << 1)
#define IORING_POLL_UPDATE_USER_DATA (1U << 2)

+enum {
+ IORING_SENDZC_FLUSH = (1U << 0),
+};
+
/*
* IO completion data structure (Completion Queue Entry)
*/
--
2.34.0


2021-11-30 15:22:24

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 04/12] net: add zerocopy_sg_from_iter for bvec

Add a separate path for bvec iterators in __zerocopy_sg_from_iter, first
it's quite faster but also will be needed to optimise out
get/put_page()

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/core/datagram.c | 54 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index ee290776c661..e00f7e0a7a0a 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -616,11 +616,65 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
}
EXPORT_SYMBOL(skb_copy_datagram_from_iter);

+static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
+ struct iov_iter *from, size_t length)
+{
+ int ret, frag = skb_shinfo(skb)->nr_frags;
+ struct bvec_iter bi;
+ struct bio_vec v;
+ ssize_t copied = 0;
+ unsigned long truesize = 0;
+
+ bi.bi_size = min(from->count, length);
+ bi.bi_bvec_done = from->iov_offset;
+ bi.bi_idx = 0;
+
+ while (bi.bi_size) {
+ if (frag == MAX_SKB_FRAGS) {
+ ret = -EMSGSIZE;
+ goto out;
+ }
+
+ /*
+ * TODO: ignore compound pages for now, all bvec from io_uring
+ * are within boundaries of a single page.
+ */
+ v = mp_bvec_iter_bvec(from->bvec, bi);
+ copied += v.bv_len;
+ truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+ get_page(v.bv_page);
+ skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+ bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+ }
+ ret = 0;
+out:
+ skb->data_len += copied;
+ skb->len += copied;
+ skb->truesize += truesize;
+
+ if (sk && sk->sk_type == SOCK_STREAM) {
+ sk_wmem_queued_add(sk, truesize);
+ if (!skb_zcopy_pure(skb))
+ sk_mem_charge(sk, truesize);
+ } else {
+ refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+ }
+
+ from->bvec += bi.bi_idx;
+ from->nr_segs -= bi.bi_idx;
+ from->count = bi.bi_size;
+ from->iov_offset = bi.bi_bvec_done;
+ return ret;
+}
+
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length)
{
int frag = skb_shinfo(skb)->nr_frags;

+ if (iov_iter_is_bvec(from))
+ return __zerocopy_sg_from_bvec(sk, skb, from, length);
+
while (length && iov_iter_count(from)) {
struct page *pages[MAX_SKB_FRAGS];
struct page *last_head = NULL;
--
2.34.0


2021-11-30 15:22:25

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 05/12] net: optimise page get/free for bvec zc

get_page() in __zerocopy_sg_from_bvec() and matching put_page()s are
expensive. However, we can avoid it if the caller can guarantee that
pages stay alive until the corresponding ubuf_info is not released.
In particular, it targets io_uring with fixed buffers following the
described contract.

Assuming that nobody yet uses bvec together with zerocopy, make all
calls with bvec iterators follow this model.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 10 +++++++++-
net/core/datagram.c | 9 +++++++--
net/core/skbuff.c | 16 +++++++++++++---
net/ipv4/ip_output.c | 4 ++++
4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 750b7518d6e2..ebb12a7d386d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -461,6 +461,11 @@ enum {
SKBFL_PURE_ZEROCOPY = BIT(2),

SKBFL_DONT_ORPHAN = BIT(3),
+
+ /* page references are managed by the ubuf_info, so it's safe to
+ * use frags only up until ubuf_info is released
+ */
+ SKBFL_MANAGED_FRAGS = BIT(4),
};

#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
@@ -3154,7 +3159,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
*/
static inline void skb_frag_unref(struct sk_buff *skb, int f)
{
- __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ if (!(shinfo->flags & SKBFL_MANAGED_FRAGS))
+ __skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
}

/**
diff --git a/net/core/datagram.c b/net/core/datagram.c
index e00f7e0a7a0a..5cf0672039d6 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -642,7 +642,6 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
v = mp_bvec_iter_bvec(from->bvec, bi);
copied += v.bv_len;
truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
- get_page(v.bv_page);
skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
}
@@ -671,9 +670,15 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length)
{
int frag = skb_shinfo(skb)->nr_frags;
+ bool managed = skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS;

- if (iov_iter_is_bvec(from))
+ if (iov_iter_is_bvec(from) && (managed || frag == 0)) {
+ skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAGS;
return __zerocopy_sg_from_bvec(sk, skb, from, length);
+ }
+
+ if (managed)
+ return -EFAULT;

while (length && iov_iter_count(from)) {
struct page *pages[MAX_SKB_FRAGS];
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b23db60ea6f9..b7b087815539 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,10 +666,14 @@ static void skb_release_data(struct sk_buff *skb)
&shinfo->dataref))
goto exit;

- skb_zcopy_clear(skb, true);
+ if (!(shinfo->flags & SKBFL_MANAGED_FRAGS)) {
+ for (i = 0; i < shinfo->nr_frags; i++)
+ __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
+ } else {
+ shinfo->flags &= ~SKBFL_MANAGED_FRAGS;
+ }

- for (i = 0; i < shinfo->nr_frags; i++)
- __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
+ skb_zcopy_clear(skb, true);

if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list);
@@ -1471,6 +1475,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
/* skb frags release userspace buffers */
for (i = 0; i < num_frags; i++)
skb_frag_unref(skb, i);
+ skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAGS;

/* skb frags point to kernel buffers */
for (i = 0; i < new_frags - 1; i++) {
@@ -1597,6 +1602,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));

skb_copy_header(n, skb);
+ skb_shinfo(n)->flags &= ~SKBFL_MANAGED_FRAGS;
return n;
}
EXPORT_SYMBOL(skb_copy);
@@ -1653,6 +1659,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
skb_frag_ref(skb, i);
}
skb_shinfo(n)->nr_frags = i;
+ skb_shinfo(n)->flags &= ~SKBFL_MANAGED_FRAGS;
}

if (skb_has_frag_list(skb)) {
@@ -1725,6 +1732,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
refcount_inc(&skb_uarg(skb)->refcnt);
for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
skb_frag_ref(skb, i);
+ skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAGS;

if (skb_has_frag_list(skb))
skb_clone_fraglist(skb);
@@ -3788,6 +3796,8 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
if (skb_can_coalesce(skb, i, page, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
} else if (i < MAX_SKB_FRAGS) {
+ if (skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS)
+ return -EMSGSIZE;
get_page(page);
skb_fill_page_desc(skb, i, page, offset, size);
} else {
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index f9aab355d283..e6adf96e5530 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1194,6 +1194,10 @@ static int __ip_append_data(struct sock *sk,
err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag))
goto error;
+ if (skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS) {
+ err = -EMSGSIZE;
+ goto error;
+ }

if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) {
--
2.34.0


2021-11-30 15:22:26

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 01/12] skbuff: add SKBFL_DONT_ORPHAN flag

We don't want to list every single ubuf_info callback in
skb_orphan_frags(), add a flag controlling the behaviour.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 5 +++--
net/core/skbuff.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c8cb7e697d47..750b7518d6e2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -459,6 +459,8 @@ enum {
* charged to the kernel memory.
*/
SKBFL_PURE_ZEROCOPY = BIT(2),
+
+ SKBFL_DONT_ORPHAN = BIT(3),
};

#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
@@ -2839,8 +2841,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
{
if (likely(!skb_zcopy(skb)))
return 0;
- if (!skb_zcopy_is_nouarg(skb) &&
- skb_uarg(skb)->callback == msg_zerocopy_callback)
+ if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN)
return 0;
return skb_copy_ubufs(skb, gfp_mask);
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ba2f38246f07..b23db60ea6f9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1191,7 +1191,7 @@ struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
- uarg->flags = SKBFL_ZEROCOPY_FRAG;
+ uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
refcount_set(&uarg->refcnt, 1);
sock_hold(sk);

--
2.34.0


2021-11-30 15:22:32

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 12/12] io_uring: cache struct ubuf_info

Allocation/deallocation of ubuf_info takes some time, add an
optimisation caching them. The implementation is alike to how we cache
requests in io_req_complete_post().

->ubuf_list is protected by ->uring_lock and requests try grab directly
from it, and there is also ->ubuf_list_locked list protected by
->completion_lock, which is eventually batch spliced to ->ubuf_list.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 64 insertions(+), 10 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a0adfadf759..8c81177395c3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -334,6 +334,7 @@ struct io_tx_notifier {
struct percpu_ref *fixed_rsrc_refs;
u64 tag;
u32 seq;
+ struct list_head cache_node;
};

struct io_tx_ctx {
@@ -393,6 +394,9 @@ struct io_ring_ctx {
unsigned nr_tx_ctxs;

struct io_submit_state submit_state;
+ struct list_head ubuf_list;
+ struct list_head ubuf_list_locked;
+ int ubuf_locked_nr;
struct list_head timeout_list;
struct list_head ltimeout_list;
struct list_head cq_overflow_list;
@@ -1491,6 +1495,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_WQ_LIST(&ctx->locked_free_list);
INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+ INIT_LIST_HEAD(&ctx->ubuf_list);
+ INIT_LIST_HEAD(&ctx->ubuf_list_locked);
return ctx;
err:
kfree(ctx->dummy_ubuf);
@@ -1963,16 +1969,20 @@ static void io_zc_tx_work_callback(struct work_struct *work)
struct io_tx_notifier *notifier = container_of(work, struct io_tx_notifier,
commit_work);
struct io_ring_ctx *ctx = notifier->uarg.ctx;
+ struct percpu_ref *rsrc_refs = notifier->fixed_rsrc_refs;

spin_lock(&ctx->completion_lock);
io_fill_cqe_aux(ctx, notifier->tag, notifier->seq, 0);
+
+ list_add(&notifier->cache_node, &ctx->ubuf_list_locked);
+ ctx->ubuf_locked_nr++;
+
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

- percpu_ref_put(notifier->fixed_rsrc_refs);
+ percpu_ref_put(rsrc_refs);
percpu_ref_put(&ctx->refs);
- kfree(notifier);
}

static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
@@ -1999,26 +2009,69 @@ static void io_tx_kill_notification(struct io_tx_ctx *tx_ctx)
tx_ctx->notifier = NULL;
}

+static void io_notifier_splice(struct io_ring_ctx *ctx)
+{
+ spin_lock(&ctx->completion_lock);
+ list_splice_init(&ctx->ubuf_list_locked, &ctx->ubuf_list);
+ ctx->ubuf_locked_nr = 0;
+ spin_unlock(&ctx->completion_lock);
+}
+
+static void io_notifier_free_cached(struct io_ring_ctx *ctx)
+{
+ struct io_tx_notifier *notifier;
+
+ io_notifier_splice(ctx);
+
+ while (!list_empty(&ctx->ubuf_list)) {
+ notifier = list_first_entry(&ctx->ubuf_list,
+ struct io_tx_notifier, cache_node);
+ list_del(&notifier->cache_node);
+ kfree(notifier);
+ }
+}
+
+static inline bool io_notifier_has_cached(struct io_ring_ctx *ctx)
+{
+ if (likely(!list_empty(&ctx->ubuf_list)))
+ return true;
+ if (READ_ONCE(ctx->ubuf_locked_nr) <= IO_REQ_ALLOC_BATCH)
+ return false;
+ io_notifier_splice(ctx);
+ return !list_empty(&ctx->ubuf_list);
+}
+
static struct io_tx_notifier *io_alloc_tx_notifier(struct io_ring_ctx *ctx,
struct io_tx_ctx *tx_ctx)
{
struct io_tx_notifier *notifier;
struct ubuf_info *uarg;

- notifier = kmalloc(sizeof(*notifier), GFP_ATOMIC);
- if (!notifier)
- return NULL;
+ if (likely(io_notifier_has_cached(ctx))) {
+ if (WARN_ON_ONCE(list_empty(&ctx->ubuf_list)))
+ return NULL;
+
+ notifier = list_first_entry(&ctx->ubuf_list,
+ struct io_tx_notifier, cache_node);
+ list_del(&notifier->cache_node);
+ } else {
+ gfp_t gfp_flags = GFP_ATOMIC|GFP_KERNEL_ACCOUNT;
+
+ notifier = kmalloc(sizeof(*notifier), gfp_flags);
+ if (!notifier)
+ return NULL;
+ uarg = &notifier->uarg;
+ uarg->ctx = ctx;
+ uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+ uarg->callback = io_uring_tx_zerocopy_callback;
+ }

WARN_ON_ONCE(!current->io_uring);
notifier->seq = tx_ctx->seq++;
notifier->tag = tx_ctx->tag;
io_set_rsrc_node(&notifier->fixed_rsrc_refs, ctx);

- uarg = &notifier->uarg;
- uarg->ctx = ctx;
- uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
- uarg->callback = io_uring_tx_zerocopy_callback;
- refcount_set(&uarg->refcnt, 1);
+ refcount_set(&notifier->uarg.refcnt, 1);
percpu_ref_get(&ctx->refs);
return notifier;
}
@@ -9732,6 +9785,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));

+ io_notifier_free_cached(ctx);
io_sqe_tx_ctx_unregister(ctx);
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
--
2.34.0


2021-11-30 15:22:35

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 11/12] io_uring: sendzc with fixed buffers

Allow zerocopy sends to use fixed buffers.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 19 +++++++++++++++++--
include/uapi/linux/io_uring.h | 1 +
2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index bb991f4cee7b..5a0adfadf759 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5048,7 +5048,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

-#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FLUSH
+#define IO_SENDZC_VALID_FLAGS (IORING_SENDZC_FLUSH | IORING_SENDZC_FIXED_BUF)

static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
@@ -5078,6 +5078,15 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (req->msgzc.zc_flags & ~IO_SENDZC_VALID_FLAGS)
return -EINVAL;

+ if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+ idx = READ_ONCE(sqe->buf_index);
+ if (unlikely(idx >= ctx->nr_user_bufs))
+ return -EFAULT;
+ idx = array_index_nospec(idx, ctx->nr_user_bufs);
+ req->imu = READ_ONCE(ctx->user_bufs[idx]);
+ io_req_set_rsrc_node(req, ctx);
+ }
+
#ifdef CONFIG_COMPAT
if (req->ctx->compat)
sr->msg_flags |= MSG_CMSG_COMPAT;
@@ -5101,7 +5110,13 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(!sock))
return -ENOTSOCK;

- ret = import_single_range(WRITE, sr->buf, sr->len, &iov, &msg.msg_iter);
+ if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+ ret = __io_import_fixed(WRITE, &msg.msg_iter, req->imu,
+ (u64)sr->buf, sr->len);
+ } else {
+ ret = import_single_range(WRITE, sr->buf, sr->len, &iov,
+ &msg.msg_iter);
+ }
if (unlikely(ret))
return ret;

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index ac18e8e6f86f..740af1d0409f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -189,6 +189,7 @@ enum {

enum {
IORING_SENDZC_FLUSH = (1U << 0),
+ IORING_SENDZC_FIXED_BUF = (1U << 1),
};

/*
--
2.34.0


2021-12-01 03:10:58

by David Ahern

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 11/30/21 8:18 AM, Pavel Begunkov wrote:
> Early proof of concept for zerocopy send via io_uring. This is just
> an RFC, there are details yet to be figured out, but hope to gather
> some feedback.
>
> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
> or 2.44 times faster.
>
> № | test: | BW (MB/s) | speedup
> 1 | msg_zerocopy (non-zc) | 18281 | 0.38
> 2 | msg_zerocopy -z (baseline) | 47421 | 1
> 3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03
> 4 | io_uring (@flush=true, nr_reqs=1) | 89310 | 1.88
> 5 | io_uring (@flush=false, nr_reqs=8) | 116079 | 2.44
> 6 | io_uring (@flush=true, nr_reqs=8) | 109722 | 2.31
>
> Based on selftests/.../msg_zerocopy but more limited. You can use
> msg_zerocopy -r as usual for receive side.
>
...

Can you state the exact command lines you are running for all of the
commands? I tried this set (and commands referenced below) and my
mileage varies quite a bit.

Also, have you run this proposed change (and with TCP) across nodes
(ie., not just local process to local process via dummy interface)?

> Benchmark:
> https://github.com/isilence/liburing.git zc_v1
>
> or this file in particular:
> https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c
>
> To run the benchmark:
> ```
> cd <liburing_dir> && make && cd test
> # ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
> ./send-zc -4 -D 127.0.0.1 udp
> ```
>
> msg_zerocopy can be used for the server side, e.g.
> ```
> cd <linux-kernel>/tools/testing/selftests/net && make
> ./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
> ```
>




2021-12-01 14:32:46

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 11/30/21 15:18, Pavel Begunkov wrote:
> Early proof of concept for zerocopy send via io_uring. This is just
> an RFC, there are details yet to be figured out, but hope to gather
> some feedback.

For larger audience, registered buffers is an io_uring feature
where it pins in advance the userspace memory and keeps it as
an array of pages (bvec in particular).
There is extra logic to make sure the pages stay referenced when
there are inflight requests using it, so they can avoid grabbing
extra page references.

Also, as was asked, attaching a standalone .c version of the
benchmark. Requires any relatively up-to-date liburing installed.

gcc -luring -O2 -o send-zc ./send-zc.c

There are also a couple of options added:

./send-zc [options] [-r <nr_submit_batching>] [-f] udp

-r <nr_submit_batch> is @nr_reqs from the cover-letter, 8 by default,
-f sets @flush to true, false by default.


> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
> or 2.44 times faster.
>
> № | test: | BW (MB/s) | speedup
> 1 | msg_zerocopy (non-zc) | 18281 | 0.38
> 2 | msg_zerocopy -z (baseline) | 47421 | 1
> 3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03
> 4 | io_uring (@flush=true, nr_reqs=1) | 89310 | 1.88
> 5 | io_uring (@flush=false, nr_reqs=8) | 116079 | 2.44
> 6 | io_uring (@flush=true, nr_reqs=8) | 109722 | 2.31
>
> Based on selftests/.../msg_zerocopy but more limited. You can use
> msg_zerocopy -r as usual for receive side.
>
[...]

--
Pavel Begunkov


Attachments:
send-zc.c (7.93 kB)

2021-12-01 15:32:55

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 03:10, David Ahern wrote:
> On 11/30/21 8:18 AM, Pavel Begunkov wrote:
>> Early proof of concept for zerocopy send via io_uring. This is just
>> an RFC, there are details yet to be figured out, but hope to gather
>> some feedback.
>>
>> Benchmarking udp (65435 bytes) with a dummy net device (mtu=0xffff):
>> The best case io_uring=116079 MB/s vs msg_zerocopy=47421 MB/s,
>> or 2.44 times faster.
>>
>> № | test: | BW (MB/s) | speedup
>> 1 | msg_zerocopy (non-zc) | 18281 | 0.38
>> 2 | msg_zerocopy -z (baseline) | 47421 | 1
>> 3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03
>> 4 | io_uring (@flush=true, nr_reqs=1) | 89310 | 1.88
>> 5 | io_uring (@flush=false, nr_reqs=8) | 116079 | 2.44
>> 6 | io_uring (@flush=true, nr_reqs=8) | 109722 | 2.31
>>
>> Based on selftests/.../msg_zerocopy but more limited. You can use
>> msg_zerocopy -r as usual for receive side.
>>
> ...
>
> Can you state the exact command lines you are running for all of the
> commands? I tried this set (and commands referenced below) and my

Sure. First, for dummy I set mtu by hand, not sure can do it from
the userspace, can I? Without it __ip_append_data() falls into
non-zerocopy path.

diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
index f82ad7419508..5c5aeacdabd5 100644
--- a/drivers/net/dummy.c
+++ b/drivers/net/dummy.c
@@ -132,7 +132,8 @@ static void dummy_setup(struct net_device *dev)
eth_hw_addr_random(dev);

dev->min_mtu = 0;
- dev->max_mtu = 0;
+ dev->mtu = 0xffff;
+ dev->max_mtu = 0xffff;
}

# dummy configuration

modprobe dummy numdummies=1
ip link set dummy0 up
# force requests to <dummy_ip_addr> go through the dummy device
ip route add <dummy_ip_addr> dev dummy0


With dummy I was just sinking the traffic to the dummy device,
was good enough for me. Omitting "taskset" and "nice":

send-zc -4 -D <dummy_ip_addr> -t 10 udp

Similarly with msg_zerocopy:

<kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D <dummy_ip_addr> -t 10 -z udp


For loopback testing, as zerocopy is not allowed for it as Willem explained in
the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ebb12a7d386d..42df33b175ce 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
{
- if (likely(!skb_zcopy(skb)))
- return 0;
- return skb_copy_ubufs(skb, gfp_mask);
+ return skb_orphan_frags(skb, gfp_mask);
}

/**

Then running those two lines below in parallel and looking for the numbers
send shows. It was in favor of io_uring for me, but don't remember
exactly. perf shows that "send-zc" spends lot of time receiving, so
wasn't testing performance of it after some point.

msg_zerocopy -r -v -4 -t 20 udp
send-zc -4 -D 127.0.0.1 -t 10 udp


> mileage varies quite a bit.

Interesting, any brief notes on the setup and the results? Dummy
or something real? io_uring doesn't show if it was really zerocopied
or not, but I assume you checked it (e.g. with perf/bpftrace).

I expected that @flush=true might be worse with real devices,
there is one spot to be patched, but apart from that and
cycles spend in a real LLD offseting the overhead, didn't
anticipate any problems. I'll see once I try a real device.


> Also, have you run this proposed change (and with TCP) across nodes
> (ie., not just local process to local process via dummy interface)?

Not yet, I tried dummy, and localhost UDP as per above and similarly
TCP. Just need to grab a server with a proper NIC, will try it out
soon.

>> Benchmark:
>> https://github.com/isilence/liburing.git zc_v1
>>
>> or this file in particular:
>> https://github.com/isilence/liburing/blob/zc_v1/test/send-zc.c
>>
>> To run the benchmark:
>> ```
>> cd <liburing_dir> && make && cd test
>> # ./send-zc -4 [-p <port>] [-s <payload_size>] -D <destination> udp
>> ./send-zc -4 -D 127.0.0.1 udp
>> ```
>>
>> msg_zerocopy can be used for the server side, e.g.
>> ```
>> cd <linux-kernel>/tools/testing/selftests/net && make
>> ./msg_zerocopy -4 -r [-p <port>] [-t <sec>] udp
>> ```

--
Pavel Begunkov

2021-12-01 17:49:33

by David Ahern

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 7:31 AM, Pavel Begunkov wrote:
>
> Also, as was asked, attaching a standalone .c version of the
> benchmark. Requires any relatively up-to-date liburing installed.
>
attached command differs from the version mentioned in the cover letter:

https://github.com/isilence/liburing.git zc_v1

copying this version into that branch, removing the duplicate
definitions and it works.

2021-12-01 17:58:24

by David Ahern

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 8:32 AM, Pavel Begunkov wrote:
>
> Sure. First, for dummy I set mtu by hand, not sure can do it from
> the userspace, can I? Without it __ip_append_data() falls into
> non-zerocopy path.
>
> diff --git a/drivers/net/dummy.c b/drivers/net/dummy.c
> index f82ad7419508..5c5aeacdabd5 100644
> --- a/drivers/net/dummy.c
> +++ b/drivers/net/dummy.c
> @@ -132,7 +132,8 @@ static void dummy_setup(struct net_device *dev)
>      eth_hw_addr_random(dev);
>  
>      dev->min_mtu = 0;
> -    dev->max_mtu = 0;
> +    dev->mtu = 0xffff;
> +    dev->max_mtu = 0xffff;
>  }
>
> # dummy configuration
>
> modprobe dummy numdummies=1
> ip link set dummy0 up


No change is needed to the dummy driver:
ip li add dummy0 type dummy
ip li set dummy0 up mtu 65536


> # force requests to <dummy_ip_addr> go through the dummy device
> ip route add <dummy_ip_addr> dev dummy0

that command is not necessary.

>
>
> With dummy I was just sinking the traffic to the dummy device,
> was good enough for me. Omitting "taskset" and "nice":
>
> send-zc -4 -D <dummy_ip_addr> -t 10 udp
>
> Similarly with msg_zerocopy:
>
> <kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D
> <dummy_ip_addr> -t 10 -z udp

I get -ENOBUFS with '-z' and any local address.

>
>
> For loopback testing, as zerocopy is not allowed for it as Willem
> explained in
> the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index ebb12a7d386d..42df33b175ce 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff
> *skb, gfp_t gfp_mask)
>  /* Frags must be orphaned, even if refcounted, if skb might loop to rx
> path */
>  static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
>  {
> -    if (likely(!skb_zcopy(skb)))
> -        return 0;
> -    return skb_copy_ubufs(skb, gfp_mask);
> +    return skb_orphan_frags(skb, gfp_mask);
>  }
>  

that is the key change that is missing in your repo. All local traffic
(traffic to the address on a dummy device falls into this comment) goes
through loopback. That's just the way Linux works. If you look at the
dummy driver, it's xmit function just drops packets if any actually make
it there.


>> mileage varies quite a bit.
>
> Interesting, any brief notes on the setup and the results? Dummy

VM on Chromebook. I just cloned your repos, built, install and test. As
mentioned above, the skb_orphan_frags_rx change is missing from your
repo and that is the key to your reported performance gains.


2021-12-01 18:11:36

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

> # performance:
>
> The worst case for io_uring is (4), still 1.88 times faster than
> msg_zerocopy (2), and there are a couple of "easy" optimisations left
> out from the patchset. For 4096 bytes payload zc is only slightly
> outperforms non-zc version, the larger payload the wider gap.
> I'll get more numbers next time.

> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
> much. Notification posting is not a big problem for now, but need
> to compare the performance for when io_uring_tx_zerocopy_callback()
> is called from IRQ context, and possible rework it to use task_work.
>
> It supports both, regular buffers and fixed ones, but there is a bunch of
> optimisations exclusively for io_uring's fixed buffers. For comparison,
> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>
> 1) we pass a bvec, so no page table walks.
> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
> still doing page get/put (see 4/12) slashed 4-5%.
> 3) avoiding get_page/put_page in 5/12
> 4) completion events are posted into io_uring's CQ, so no
> extra recvmsg for getting events
> 5) no poll(2) in the code because of io_uring
> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
> io_uring caches the structures reducing it to nearly zero-overhead.

Nice set of complementary optimizations.

We have looked at adding some of those as independent additions to
msg_zerocopy before, such as long-term pinned regions. One issue with
that is that the pages must remain until the request completes,
regardless of whether the calling process is alive. So it cannot rely
on a pinned range held by a process only.

If feasible, it would be preferable if the optimizations can be added
to msg_zerocopy directly, rather than adding a dependency on io_uring
to make use of them. But not sure how feasible that is. For some, like
4 and 5, the answer is clearly it isn't. 6, it probably is?

> # discussion / questions
>
> I haven't got a grasp on many aspects of the net stack yet, so would
> appreciate feedback in general and there are a couple of questions
> thoughts.
>
> 1) What are initialisation rules for adding a new field into
> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
> filling all the fields.
>
> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
> __ip_append_data() (see 3/12). Ideas how to do it better?

Agreed that both of these are less than ideal.

I can't comment too much on the io_uring aspect of the patch series.
But msg_zerocopy is probably used in a small fraction of traffic (even
if a high fraction for users who care about its benefits). We have to
try to minimize the cost incurred on the general hot path.

I was going to suggest using the standard msg_zerocopy ubuf_info
alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
in the cycle profile.

It might still be possible to somehow signal to msg_zerocopy_alloc
that this is being called from within an io_uring request, and
therefore should use a pre-existing uarg with different
uarg->callback. If nothing else, some info can be passed as a cmsg.
But perhaps there is a more direct pointer path to follow from struct
sk, say? Here my limited knowledge of io_uring forces me to hand wave.

Probably also want to see how all this would integrate with TCP. In
some ways, that might be easier, as it does not have the indirection
through ip_make_skb, etc.

2021-12-01 19:11:57

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 17:57, David Ahern wrote:
> On 12/1/21 8:32 AM, Pavel Begunkov wrote:
>>
>> Sure. First, for dummy I set mtu by hand, not sure can do it from
>> the userspace, can I? Without it __ip_append_data() falls into
>> non-zerocopy path.
>>
[...]
>>
>> modprobe dummy numdummies=1
>> ip link set dummy0 up
>
>
> No change is needed to the dummy driver:
> ip li add dummy0 type dummy
> ip li set dummy0 up mtu 65536

awesome, thanks!

>> # force requests to <dummy_ip_addr> go through the dummy device
>> ip route add <dummy_ip_addr> dev dummy0
>
> that command is not necessary.
>
>>
>>
>> With dummy I was just sinking the traffic to the dummy device,
>> was good enough for me. Omitting "taskset" and "nice":
>>
>> send-zc -4 -D <dummy_ip_addr> -t 10 udp
>>
>> Similarly with msg_zerocopy:
>>
>> <kernel>/tools/testing/selftests/net/msg_zerocopy -4 -p 6666 -D
>> <dummy_ip_addr> -t 10 -z udp
>
> I get -ENOBUFS with '-z' and any local address.

Ah, right. Citing from Willem's MSG_ZEROCOPY letter:

"
Notification skbuffs are allocated from optmem. For sockets that
cannot effectively coalesce notifications, the optmem max may need
to be increased to avoid hitting -ENOBUFS:

sysctl -w net.core.optmem_max=1048576
"


>> For loopback testing, as zerocopy is not allowed for it as Willem
>> explained in
>> the original MSG_ZEROCOPY cover-letter, I used a hack to bypass it:
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index ebb12a7d386d..42df33b175ce 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -2854,9 +2854,7 @@ static inline int skb_orphan_frags(struct sk_buff
>> *skb, gfp_t gfp_mask)
>>  /* Frags must be orphaned, even if refcounted, if skb might loop to rx
>> path */
>>  static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
>>  {
>> -    if (likely(!skb_zcopy(skb)))
>> -        return 0;
>> -    return skb_copy_ubufs(skb, gfp_mask);
>> +    return skb_orphan_frags(skb, gfp_mask);
>>  }
>>
>
> that is the key change that is missing in your repo. All local traffic
> (traffic to the address on a dummy device falls into this comment) goes
> through loopback. That's just the way Linux works. If you look at the
> dummy driver, it's xmit function just drops packets if any actually make
> it there.

Not at all, the measurements were done without this patch. In case it
may shed some light, attaching a fresh flamegraph, same 115761.6 MB/s

btw, why a dummy device would ever go through loopback? It doesn't
seem to make sense, though may be missing something.


>>> mileage varies quite a bit.
>>
>> Interesting, any brief notes on the setup and the results? Dummy
>
> VM on Chromebook. I just cloned your repos, built, install and test. As
> mentioned above, the skb_orphan_frags_rx change is missing from your
> repo and that is the key to your reported performance gains.
>

--
Pavel Begunkov


Attachments:
perf.svg (115.77 kB)

2021-12-01 19:21:10

by David Ahern

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> btw, why a dummy device would ever go through loopback? It doesn't
> seem to make sense, though may be missing something.

You are sending to a local ip address, so the fib_lookup returns
RTN_LOCAL. The code makes dev_out the loopback:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773

(you are not using vrf so ignore the l3mdev reference). loopback device
has the logic to put the skb back in the stack for Rx processing:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/loopback.c#n68

2021-12-01 19:27:58

by Jonathan Lemon

[permalink] [raw]
Subject: Re: [RFC 05/12] net: optimise page get/free for bvec zc

On Tue, Nov 30, 2021 at 03:18:53PM +0000, Pavel Begunkov wrote:
> get_page() in __zerocopy_sg_from_bvec() and matching put_page()s are
> expensive. However, we can avoid it if the caller can guarantee that
> pages stay alive until the corresponding ubuf_info is not released.
> In particular, it targets io_uring with fixed buffers following the
> described contract.
>
> Assuming that nobody yet uses bvec together with zerocopy, make all
> calls with bvec iterators follow this model.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> include/linux/skbuff.h | 10 +++++++++-
> net/core/datagram.c | 9 +++++++--
> net/core/skbuff.c | 16 +++++++++++++---
> net/ipv4/ip_output.c | 4 ++++
> 4 files changed, 33 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 750b7518d6e2..ebb12a7d386d 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -461,6 +461,11 @@ enum {
> SKBFL_PURE_ZEROCOPY = BIT(2),
>
> SKBFL_DONT_ORPHAN = BIT(3),
> +
> + /* page references are managed by the ubuf_info, so it's safe to
> + * use frags only up until ubuf_info is released
> + */
> + SKBFL_MANAGED_FRAGS = BIT(4),
> };
>
> #define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
> @@ -3154,7 +3159,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
> */
> static inline void skb_frag_unref(struct sk_buff *skb, int f)
> {
> - __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
> + struct skb_shared_info *shinfo = skb_shinfo(skb);
> +
> + if (!(shinfo->flags & SKBFL_MANAGED_FRAGS))
> + __skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
> }
>
> /**
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index e00f7e0a7a0a..5cf0672039d6 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -642,7 +642,6 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
> v = mp_bvec_iter_bvec(from->bvec, bi);
> copied += v.bv_len;
> truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
> - get_page(v.bv_page);
> skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
> bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
> }
> @@ -671,9 +670,15 @@ int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
> struct iov_iter *from, size_t length)
> {
> int frag = skb_shinfo(skb)->nr_frags;
> + bool managed = skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAGS;
>
> - if (iov_iter_is_bvec(from))
> + if (iov_iter_is_bvec(from) && (managed || frag == 0)) {
> + skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAGS;
> return __zerocopy_sg_from_bvec(sk, skb, from, length);
> + }
> +
> + if (managed)
> + return -EFAULT;
>
> while (length && iov_iter_count(from)) {
> struct page *pages[MAX_SKB_FRAGS];
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b23db60ea6f9..b7b087815539 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -666,10 +666,14 @@ static void skb_release_data(struct sk_buff *skb)
> &shinfo->dataref))
> goto exit;
>
> - skb_zcopy_clear(skb, true);
> + if (!(shinfo->flags & SKBFL_MANAGED_FRAGS)) {
> + for (i = 0; i < shinfo->nr_frags; i++)
> + __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
> + } else {
> + shinfo->flags &= ~SKBFL_MANAGED_FRAGS;
> + }
>
> - for (i = 0; i < shinfo->nr_frags; i++)
> - __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
> + skb_zcopy_clear(skb, true);

It would be better if all of this logic was moved into the callback
under zcopy_clear. Note that this path is called for both TX and RX.
--
Jonathan

2021-12-01 19:59:17

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 18:10, Willem de Bruijn wrote:
>> # performance:
>>
>> The worst case for io_uring is (4), still 1.88 times faster than
>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>> out from the patchset. For 4096 bytes payload zc is only slightly
>> outperforms non-zc version, the larger payload the wider gap.
>> I'll get more numbers next time.
>
>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>> much. Notification posting is not a big problem for now, but need
>> to compare the performance for when io_uring_tx_zerocopy_callback()
>> is called from IRQ context, and possible rework it to use task_work.
>>
>> It supports both, regular buffers and fixed ones, but there is a bunch of
>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>
>> 1) we pass a bvec, so no page table walks.
>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>> still doing page get/put (see 4/12) slashed 4-5%.
>> 3) avoiding get_page/put_page in 5/12
>> 4) completion events are posted into io_uring's CQ, so no
>> extra recvmsg for getting events
>> 5) no poll(2) in the code because of io_uring
>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>> io_uring caches the structures reducing it to nearly zero-overhead.
>
> Nice set of complementary optimizations.
>
> We have looked at adding some of those as independent additions to
> msg_zerocopy before, such as long-term pinned regions. One issue with
> that is that the pages must remain until the request completes,
> regardless of whether the calling process is alive. So it cannot rely
> on a pinned range held by a process only.
>
> If feasible, it would be preferable if the optimizations can be added
> to msg_zerocopy directly, rather than adding a dependency on io_uring
> to make use of them. But not sure how feasible that is. For some, like
> 4 and 5, the answer is clearly it isn't. 6, it probably is?

And for 3), io_uring has a complex infra for keeping pages alive,
the additional overhead is one almost percpu_ref_put() per
request/notification, or even better in common cases. Not sure it's
feasible/possible with current msg_zerocopy. Also, io_uring's
ubufs are kept as a part of a larger structure, which may complicate
things.


>> # discussion / questions
>>
>> I haven't got a grasp on many aspects of the net stack yet, so would
>> appreciate feedback in general and there are a couple of questions
>> thoughts.
>>
>> 1) What are initialisation rules for adding a new field into
>> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
>> filling all the fields.
>>
>> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
>> __ip_append_data() (see 3/12). Ideas how to do it better?
>
> Agreed that both of these are less than ideal.
>
> I can't comment too much on the io_uring aspect of the patch series.
> But msg_zerocopy is probably used in a small fraction of traffic (even
> if a high fraction for users who care about its benefits). We have to
> try to minimize the cost incurred on the general hot path.

One thing, I can hide the initial ubuf check in the beginning of
__ip_append_data() under a common

if (sock_flag(sk, SOCK_ZEROCOPY)) {}

But as SOCK_ZEROCOPY is more of a design problem workaround,
tbh not sure I like from the API perspective. Thoughts? I hope
I can also shuffle some of the stuff in 5/12 out of the
hot path, need to dig a bit deeper.

> I was going to suggest using the standard msg_zerocopy ubuf_info
> alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
> in the cycle profile.
>
> It might still be possible to somehow signal to msg_zerocopy_alloc
> that this is being called from within an io_uring request, and
> therefore should use a pre-existing uarg with different
> uarg->callback. If nothing else, some info can be passed as a cmsg.
> But perhaps there is a more direct pointer path to follow from struct
> sk, say? Here my limited knowledge of io_uring forces me to hand wave.

One thing I consider important though is to be able to specify a
ubuf per request, but not somehow registering it in a socket. It's
more flexible from the userspace API perspective. It would also need
constant register/unregister, and there are concerns with
referencing/cancellations, that's where it came from in the first
place.

IOW, I'd really prefer to pass it down on a per request basis.

> Probably also want to see how all this would integrate with TCP. In
> some ways, that might be easier, as it does not have the indirection
> through ip_make_skb, etc.

Worked well in general, but patches I used should be a broken for
some input after adding 5/12, so need some work. will send next time.

--
Pavel Begunkov

2021-12-01 20:00:34

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 17:49, David Ahern wrote:
> On 12/1/21 7:31 AM, Pavel Begunkov wrote:
>>
>> Also, as was asked, attaching a standalone .c version of the
>> benchmark. Requires any relatively up-to-date liburing installed.
>>
> attached command differs from the version mentioned in the cover letter:

I guess you mean the new options mentioned in the message that
were added to that standalone script. They look useful, will
update the repo with a similar change later.


> https://github.com/isilence/liburing.git zc_v1
>
> copying this version into that branch, removing the duplicate
> definitions and it works.
>

--
Pavel Begunkov

2021-12-01 20:17:35

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 19:20, David Ahern wrote:
> On 12/1/21 12:11 PM, Pavel Begunkov wrote:
>> btw, why a dummy device would ever go through loopback? It doesn't
>> seem to make sense, though may be missing something.
>
> You are sending to a local ip address, so the fib_lookup returns
> RTN_LOCAL. The code makes dev_out the loopback:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773

I see, thanks. I still don't use the skb_orphan_frags_rx() hack
and it doesn't go through the loopback (for my dummy tests), just
dummy_xmit() and no mention of loopback in perf data, see the
flamegraph. Don't know what is the catch.

I'm illiterate of the routing paths. Can it be related to
the "ip route add"? How do you get an ipv4 address for the device?

>
> (you are not using vrf so ignore the l3mdev reference). loopback device
> has the logic to put the skb back in the stack for Rx processing:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/drivers/net/loopback.c#n68
>

--
Pavel Begunkov

2021-12-01 20:18:18

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 05/12] net: optimise page get/free for bvec zc

On 12/1/21 19:20, Jonathan Lemon wrote:
> On Tue, Nov 30, 2021 at 03:18:53PM +0000, Pavel Begunkov wrote:
>> get_page() in __zerocopy_sg_from_bvec() and matching put_page()s are
>> expensive. However, we can avoid it if the caller can guarantee that
>> pages stay alive until the corresponding ubuf_info is not released.
>> In particular, it targets io_uring with fixed buffers following the
>> described contract.
>>
>> Assuming that nobody yet uses bvec together with zerocopy, make all
>> calls with bvec iterators follow this model.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
[...]
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b23db60ea6f9..b7b087815539 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -666,10 +666,14 @@ static void skb_release_data(struct sk_buff *skb)
>> &shinfo->dataref))
>> goto exit;
>>
>> - skb_zcopy_clear(skb, true);
>> + if (!(shinfo->flags & SKBFL_MANAGED_FRAGS)) {
>> + for (i = 0; i < shinfo->nr_frags; i++)
>> + __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
>> + } else {
>> + shinfo->flags &= ~SKBFL_MANAGED_FRAGS;
>> + }
>>
>> - for (i = 0; i < shinfo->nr_frags; i++)
>> - __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);
>> + skb_zcopy_clear(skb, true);
>
> It would be better if all of this logic was moved into the callback
> under zcopy_clear. Note that this path is called for both TX and RX.

Yeah, can be done and removes the extra if from non-zc path.
Thanks!

--
Pavel Begunkov

2021-12-01 20:30:21

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 19:59, Pavel Begunkov wrote:
> On 12/1/21 18:10, Willem de Bruijn wrote:
>>> # performance:
>>>
>>> The worst case for io_uring is (4), still 1.88 times faster than
>>> msg_zerocopy (2), and there are a couple of "easy" optimisations left
>>> out from the patchset. For 4096 bytes payload zc is only slightly
>>> outperforms non-zc version, the larger payload the wider gap.
>>> I'll get more numbers next time.
>>
>>> Comparing (3) and (4), and (5) vs (6), @flush doesn't affect it too
>>> much. Notification posting is not a big problem for now, but need
>>> to compare the performance for when io_uring_tx_zerocopy_callback()
>>> is called from IRQ context, and possible rework it to use task_work.
>>>
>>> It supports both, regular buffers and fixed ones, but there is a bunch of
>>> optimisations exclusively for io_uring's fixed buffers. For comparison,
>>> normal vs fixed buffers (@nr_reqs=8, @flush=0): 75677 vs 116079 MB/s
>>>
>>> 1) we pass a bvec, so no page table walks.
>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>>     still doing page get/put (see 4/12) slashed 4-5%.
>>> 3) avoiding get_page/put_page in 5/12
>>> 4) completion events are posted into io_uring's CQ, so no
>>>     extra recvmsg for getting events
>>> 5) no poll(2) in the code because of io_uring
>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>>     io_uring caches the structures reducing it to nearly zero-overhead.
>>
>> Nice set of complementary optimizations.
>>
>> We have looked at adding some of those as independent additions to
>> msg_zerocopy before, such as long-term pinned regions. One issue with
>> that is that the pages must remain until the request completes,
>> regardless of whether the calling process is alive. So it cannot rely
>> on a pinned range held by a process only.
>>
>> If feasible, it would be preferable if the optimizations can be added
>> to msg_zerocopy directly, rather than adding a dependency on io_uring
>> to make use of them. But not sure how feasible that is. For some, like
>> 4 and 5, the answer is clearly it isn't.  6, it probably is?

Forgot about 6), io_uring uses the fact that submissions are
done under an per ring mutex, and completions are under a per
ring spinlock, so there are two lists for them and no extra
locking. Lists are spliced in a batched manner, so it's
1 spinlock per N (e.g. 32) cached ubuf_info's allocations.

Any similar guarantees for sockets?

[...]

--
Pavel Begunkov

2021-12-01 20:46:58

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 17:57, David Ahern wrote:
> On 12/1/21 8:32 AM, Pavel Begunkov wrote:
[...]
>>> mileage varies quite a bit.
>>
>> Interesting, any brief notes on the setup and the results? Dummy
>
> VM on Chromebook. I just cloned your repos, built, install and test. As
> mentioned above, the skb_orphan_frags_rx change is missing from your
> repo and that is the key to your reported performance gains.

Just to clear misunderstandings if any, all the numbers in the
cover-letter were measured on the same kernel and during the same
boot, and it doesn't include the skb_orphan_frags_rx() change.
All double checked by looking at the traces.

When it's routed through the loopback paths (e.g. -D 127.0.0.1),
it's indeed slow for both msg_zerocopy and send-zc, and both
"benefit" in raw throughput from the hack.

--
Pavel Begunkov

2021-12-01 21:53:04

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
> On 12/1/21 19:20, David Ahern wrote:
> > On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> > > btw, why a dummy device would ever go through loopback? It doesn't
> > > seem to make sense, though may be missing something.
> >
> > You are sending to a local ip address, so the fib_lookup returns
> > RTN_LOCAL. The code makes dev_out the loopback:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
>
> I see, thanks. I still don't use the skb_orphan_frags_rx() hack
> and it doesn't go through the loopback (for my dummy tests), just
> dummy_xmit() and no mention of loopback in perf data, see the
> flamegraph. Don't know what is the catch.
>
> I'm illiterate of the routing paths. Can it be related to
> the "ip route add"? How do you get an ipv4 address for the device?
I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
because I assumed no server is needed by using dummy. Then realized
the cover letter mentioned msg_zerocopy is used as the server.
Mentioning just in case someone hits it also.

To tx out dummy, I did:
#> ip a add 10.0.0.1/24 dev dummy0

#> ip -4 r
10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1

#> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
ip -s link show dev dummy0
2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped missed mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
140800890299 2150397 0 0 0 0

2021-12-01 22:36:00

by David Ahern

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
>
> To tx out dummy, I did:
> #> ip a add 10.0.0.1/24 dev dummy0
^^^^^^^^
>
> #> ip -4 r
> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
>
> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
^^^^^^^^^^

Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'

I read dummy_ip_addr as the address assigned to dummy0; that's an
important detail. You are sending to an address on that network, not the
address assigned to the device, in which case packets are created and
then dropped by the dummy driver - nothing actually makes it to the server.


> ip -s link show dev dummy0
> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> RX: bytes packets errors dropped missed mcast
> 0 0 0 0 0 0
> TX: bytes packets errors dropped carrier collsns
> 140800890299 2150397 0 0 0 0
>


2021-12-01 23:08:32

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On Wed, Dec 01, 2021 at 03:35:49PM -0700, David Ahern wrote:
> On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
> >
> > To tx out dummy, I did:
> > #> ip a add 10.0.0.1/24 dev dummy0
> ^^^^^^^^
> >
> > #> ip -4 r
> > 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> >
> > #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
> ^^^^^^^^^^
>
> Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'
>
> I read dummy_ip_addr as the address assigned to dummy0; that's an
> important detail. You are sending to an address on that network, not the
> address assigned to the device, in which case packets are created and
> then dropped by the dummy driver - nothing actually makes it to the server.
Right, I assumed "dropped by dummy driver" is the usual intention
for using dummy, so just in case if it was the intention for
testing tx only. You are right and it seems the intention
of this command is to have server receiving the packets.

>
>
> > ip -s link show dev dummy0
> > 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> > link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> > RX: bytes packets errors dropped missed mcast
> > 0 0 0 0 0 0
> > TX: bytes packets errors dropped carrier collsns
> > 140800890299 2150397 0 0 0 0
> >
>

2021-12-01 23:20:18

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 23:07, Martin KaFai Lau wrote:
> On Wed, Dec 01, 2021 at 03:35:49PM -0700, David Ahern wrote:
>> On 12/1/21 2:51 PM, Martin KaFai Lau wrote:
>>>
>>> To tx out dummy, I did:
>>> #> ip a add 10.0.0.1/24 dev dummy0
>> ^^^^^^^^
>>>
>>> #> ip -4 r
>>> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
>>>
>>> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
>> ^^^^^^^^^^
>>
>> Pavel's commands have: 'send-zc -4 -D <dummy_ip_addr> -t 10 udp'
>>
>> I read dummy_ip_addr as the address assigned to dummy0; that's an
>> important detail. You are sending to an address on that network, not the
>> address assigned to the device, in which case packets are created and
>> then dropped by the dummy driver - nothing actually makes it to the server.
> Right, I assumed "dropped by dummy driver" is the usual intention
> for using dummy, so just in case if it was the intention for
> testing tx only. You are right and it seems the intention
> of this command is to have server receiving the packets.

I see, it seems we found the misunderstanding:

For dummy device, I indeed was testing only tx part without
a server, in my understanding it better approximates an actual
fast NIC. I don't think dummy is even capable to pass the data,
so was thinking that it's given that there is no receive side,
and so no "msg_zerocopy -r".

For localhost testing (with the hack), there was a server
verifying data.


>>
>>> ip -s link show dev dummy0
>>> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
>>> link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
>>> RX: bytes packets errors dropped missed mcast
>>> 0 0 0 0 0 0
>>> TX: bytes packets errors dropped carrier collsns
>>> 140800890299 2150397 0 0 0 0
>>>
>>

--
Pavel Begunkov

2021-12-02 00:33:04

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

> >> # discussion / questions
> >>
> >> I haven't got a grasp on many aspects of the net stack yet, so would
> >> appreciate feedback in general and there are a couple of questions
> >> thoughts.
> >>
> >> 1) What are initialisation rules for adding a new field into
> >> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
> >> filling all the fields.
> >>
> >> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
> >> __ip_append_data() (see 3/12). Ideas how to do it better?
> >
> > Agreed that both of these are less than ideal.
> >
> > I can't comment too much on the io_uring aspect of the patch series.
> > But msg_zerocopy is probably used in a small fraction of traffic (even
> > if a high fraction for users who care about its benefits). We have to
> > try to minimize the cost incurred on the general hot path.
>
> One thing, I can hide the initial ubuf check in the beginning of
> __ip_append_data() under a common
>
> if (sock_flag(sk, SOCK_ZEROCOPY)) {}
>
> But as SOCK_ZEROCOPY is more of a design problem workaround,
> tbh not sure I like from the API perspective. Thoughts?

Agreed. io_uring does not have the legacy concerns that msg_zerocopy
had to resolve.

It is always possible to hide runtime overhead behind a static_branch,
if nothing else.

Or perhaps do pass the flag and use that:

- if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+ if (flags & MSG_ZEROCOPY && length) {
+ if (uarg) {

etc.


> I hope
> I can also shuffle some of the stuff in 5/12 out of the
> hot path, need to dig a bit deeper.
>
> > I was going to suggest using the standard msg_zerocopy ubuf_info
> > alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
> > in the cycle profile.
> >
> > It might still be possible to somehow signal to msg_zerocopy_alloc
> > that this is being called from within an io_uring request, and
> > therefore should use a pre-existing uarg with different
> > uarg->callback. If nothing else, some info can be passed as a cmsg.
> > But perhaps there is a more direct pointer path to follow from struct
> > sk, say? Here my limited knowledge of io_uring forces me to hand wave.
>
> One thing I consider important though is to be able to specify a
> ubuf per request, but not somehow registering it in a socket. It's
> more flexible from the userspace API perspective. It would also need
> constant register/unregister, and there are concerns with
> referencing/cancellations, that's where it came from in the first
> place.

What if the ubuf pool can be found from the sk, and the index in that
pool is passed as a cmsg?

2021-12-02 00:37:08

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

> >>> 1) we pass a bvec, so no page table walks.
> >>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
> >>> still doing page get/put (see 4/12) slashed 4-5%.
> >>> 3) avoiding get_page/put_page in 5/12
> >>> 4) completion events are posted into io_uring's CQ, so no
> >>> extra recvmsg for getting events
> >>> 5) no poll(2) in the code because of io_uring
> >>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
> >>> io_uring caches the structures reducing it to nearly zero-overhead.
> >>
> >> Nice set of complementary optimizations.
> >>
> >> We have looked at adding some of those as independent additions to
> >> msg_zerocopy before, such as long-term pinned regions. One issue with
> >> that is that the pages must remain until the request completes,
> >> regardless of whether the calling process is alive. So it cannot rely
> >> on a pinned range held by a process only.
> >>
> >> If feasible, it would be preferable if the optimizations can be added
> >> to msg_zerocopy directly, rather than adding a dependency on io_uring
> >> to make use of them. But not sure how feasible that is. For some, like
> >> 4 and 5, the answer is clearly it isn't. 6, it probably is?
>
> Forgot about 6), io_uring uses the fact that submissions are
> done under an per ring mutex, and completions are under a per
> ring spinlock, so there are two lists for them and no extra
> locking. Lists are spliced in a batched manner, so it's
> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
>
> Any similar guarantees for sockets?

For datagrams it might matter, not sure if it would show up in a
profile. The current notification mechanism is quite a bit more
heavyweight than any form of fixed ubuf pool.

For TCP this matters less, as multiple sends are not needed and
completions are coalesced, because in order.

2021-12-02 15:48:38

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/1/21 21:51, Martin KaFai Lau wrote:
> On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
>> On 12/1/21 19:20, David Ahern wrote:
>>> On 12/1/21 12:11 PM, Pavel Begunkov wrote:
>>>> btw, why a dummy device would ever go through loopback? It doesn't
>>>> seem to make sense, though may be missing something.
>>>
>>> You are sending to a local ip address, so the fib_lookup returns
>>> RTN_LOCAL. The code makes dev_out the loopback:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
>>
>> I see, thanks. I still don't use the skb_orphan_frags_rx() hack
>> and it doesn't go through the loopback (for my dummy tests), just
>> dummy_xmit() and no mention of loopback in perf data, see the
>> flamegraph. Don't know what is the catch.
>>
>> I'm illiterate of the routing paths. Can it be related to
>> the "ip route add"? How do you get an ipv4 address for the device?
> I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
> because I assumed no server is needed by using dummy. Then realized
> the cover letter mentioned msg_zerocopy is used as the server.
> Mentioning just in case someone hits it also.
>
> To tx out dummy, I did:
> #> ip a add 10.0.0.1/24 dev dummy0

Works well for me, IOW getting the same behaviour as with my
ip route add <ip> dev dummy0

I'm curious what is the difference bw them?


> #> ip -4 r
> 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
>
> #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
> ip -s link show dev dummy0
> 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> RX: bytes packets errors dropped missed mcast
> 0 0 0 0 0 0
> TX: bytes packets errors dropped carrier collsns
> 140800890299 2150397 0 0 0 0
>

--
Pavel Begunkov

2021-12-02 16:26:19

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/2/21 00:36, Willem de Bruijn wrote:
>>>>> 1) we pass a bvec, so no page table walks.
>>>>> 2) zerocopy_sg_from_iter() is just slow, adding a bvec optimised version
>>>>> still doing page get/put (see 4/12) slashed 4-5%.
>>>>> 3) avoiding get_page/put_page in 5/12
>>>>> 4) completion events are posted into io_uring's CQ, so no
>>>>> extra recvmsg for getting events
>>>>> 5) no poll(2) in the code because of io_uring
>>>>> 6) lot of time is spent in sock_omalloc()/free allocating ubuf_info.
>>>>> io_uring caches the structures reducing it to nearly zero-overhead.
>>>>
>>>> Nice set of complementary optimizations.
>>>>
>>>> We have looked at adding some of those as independent additions to
>>>> msg_zerocopy before, such as long-term pinned regions. One issue with
>>>> that is that the pages must remain until the request completes,
>>>> regardless of whether the calling process is alive. So it cannot rely
>>>> on a pinned range held by a process only.
>>>>
>>>> If feasible, it would be preferable if the optimizations can be added
>>>> to msg_zerocopy directly, rather than adding a dependency on io_uring
>>>> to make use of them. But not sure how feasible that is. For some, like
>>>> 4 and 5, the answer is clearly it isn't. 6, it probably is?
>>
>> Forgot about 6), io_uring uses the fact that submissions are
>> done under an per ring mutex, and completions are under a per
>> ring spinlock, so there are two lists for them and no extra
>> locking. Lists are spliced in a batched manner, so it's
>> 1 spinlock per N (e.g. 32) cached ubuf_info's allocations.
>>
>> Any similar guarantees for sockets?
>
> For datagrams it might matter, not sure if it would show up in a
> profile. The current notification mechanism is quite a bit more
> heavyweight than any form of fixed ubuf pool.

Just to give an idea what I'm seeing in profiles: while testing

3 | io_uring (@flush=false, nr_reqs=1) | 96534 | 2.03

I found that removing one extra smb_mb() per request in io_uring
gave around +0.65% of t-put (quick testing). In profiles the
function where it was dropped from 0.93% to 0.09%.

From what I see, alloc+free takes 6-10% for 64KB UDP, it may be
great to have something for MSG_ZEROCOPY, but if that adds
additional locking/atomics, honestly I'd prefer to keep it separate
from io_uring's caching.

I also hope we can optimise generic paths at some point, and the
faster it gets the more such additional locking will hurt, pretty
much how it was with the block layer.

> For TCP this matters less, as multiple sends are not needed and
> completions are coalesced, because in order.
>

--
Pavel Begunkov

2021-12-02 16:45:59

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/2/21 00:32, Willem de Bruijn wrote:
>>>> # discussion / questions
>>>>
>>>> I haven't got a grasp on many aspects of the net stack yet, so would
>>>> appreciate feedback in general and there are a couple of questions
>>>> thoughts.
>>>>
>>>> 1) What are initialisation rules for adding a new field into
>>>> struct mshdr? E.g. many users (mainly LLD) hand code initialisation not
>>>> filling all the fields.
>>>>
>>>> 2) I don't like too much ubuf_info propagation from udp_sendmsg() into
>>>> __ip_append_data() (see 3/12). Ideas how to do it better?
>>>
>>> Agreed that both of these are less than ideal.
>>>
>>> I can't comment too much on the io_uring aspect of the patch series.
>>> But msg_zerocopy is probably used in a small fraction of traffic (even
>>> if a high fraction for users who care about its benefits). We have to
>>> try to minimize the cost incurred on the general hot path.
>>
>> One thing, I can hide the initial ubuf check in the beginning of
>> __ip_append_data() under a common
>>
>> if (sock_flag(sk, SOCK_ZEROCOPY)) {}
>>
>> But as SOCK_ZEROCOPY is more of a design problem workaround,
>> tbh not sure I like from the API perspective. Thoughts?
>
> Agreed. io_uring does not have the legacy concerns that msg_zerocopy
> had to resolve.
>
> It is always possible to hide runtime overhead behind a static_branch,
> if nothing else.
>
> Or perhaps do pass the flag and use that:
>
> - if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
> + if (flags & MSG_ZEROCOPY && length) {
> + if (uarg) {
>
> etc.

Good idea. Unfortunately, not going to work (SOCK_ZEROCOPY would neither)
because we pass ubuf as a parameter into the function, and e.g. we need
to NULL it if not used, but at least good for tcp_sendmsg_locked

>> I hope
>> I can also shuffle some of the stuff in 5/12 out of the
>> hot path, need to dig a bit deeper.
>>
>>> I was going to suggest using the standard msg_zerocopy ubuf_info
>>> alloc/free mechanism. But you explicitly mention seeing omalloc/ofree
>>> in the cycle profile.
>>>
>>> It might still be possible to somehow signal to msg_zerocopy_alloc
>>> that this is being called from within an io_uring request, and
>>> therefore should use a pre-existing uarg with different
>>> uarg->callback. If nothing else, some info can be passed as a cmsg.
>>> But perhaps there is a more direct pointer path to follow from struct
>>> sk, say? Here my limited knowledge of io_uring forces me to hand wave.
>>
>> One thing I consider important though is to be able to specify a
>> ubuf per request, but not somehow registering it in a socket. It's
>> more flexible from the userspace API perspective. It would also need
>> constant register/unregister, and there are concerns with
>> referencing/cancellations, that's where it came from in the first
>> place.
>
> What if the ubuf pool can be found from the sk, and the index in that
> pool is passed as a cmsg?

It looks to me that ubufs are by nature is something that is not
tightly bound to a socket (at least for io_uring API in the patchset),
it'll be pretty ugly:

1) io_uring'd need to care to register the pool in the socket. Having
multiple rings using the same socket would be horrible. It may be that
it doesn't make much sense to send in parallel from multiple rings, but
a per thread io_uring is a popular solution, and then someone would
want to pass a socket from one thread to another and we'd need to support
it.

2) And io_uring would also need to unregister it, so the pool would
store a list of sockets where it's used, and so referencing sockets
and then we need to bind it somehow to io_uring fixed files or
register all that for tracking referencing circular dependencies.

3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
but if so I don't like exposing basically io_uring's referencing through
cmsg. And it sounds io_uring would need to parse cmsg then.


A lot of nuances :) I'd really prefer to pass it on per-request basis,
it's much cleaner, but still haven't got what's up with msghdr
initialisation...

Maybe, it's better to add a flags field, which would include
"msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
and everything else that may be optional. Does it sound sane?

--
Pavel Begunkov

2021-12-02 17:40:46

by Martin KaFai Lau

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On Thu, Dec 02, 2021 at 03:48:14PM +0000, Pavel Begunkov wrote:
> On 12/1/21 21:51, Martin KaFai Lau wrote:
> > On Wed, Dec 01, 2021 at 08:15:28PM +0000, Pavel Begunkov wrote:
> > > On 12/1/21 19:20, David Ahern wrote:
> > > > On 12/1/21 12:11 PM, Pavel Begunkov wrote:
> > > > > btw, why a dummy device would ever go through loopback? It doesn't
> > > > > seem to make sense, though may be missing something.
> > > >
> > > > You are sending to a local ip address, so the fib_lookup returns
> > > > RTN_LOCAL. The code makes dev_out the loopback:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/tree/net/ipv4/route.c#n2773
> > >
> > > I see, thanks. I still don't use the skb_orphan_frags_rx() hack
> > > and it doesn't go through the loopback (for my dummy tests), just
> > > dummy_xmit() and no mention of loopback in perf data, see the
> > > flamegraph. Don't know what is the catch.
> > >
> > > I'm illiterate of the routing paths. Can it be related to
> > > the "ip route add"? How do you get an ipv4 address for the device?
> > I also bumped into the udp-connect() => ECONNREFUSED (111) error from send-zc.
> > because I assumed no server is needed by using dummy. Then realized
> > the cover letter mentioned msg_zerocopy is used as the server.
> > Mentioning just in case someone hits it also.
> >
> > To tx out dummy, I did:
> > #> ip a add 10.0.0.1/24 dev dummy0
>
> Works well for me, IOW getting the same behaviour as with my
> ip route add <ip> dev dummy0
>
> I'm curious what is the difference bw them?
No difference. It should be the same. The skb should still go out
of dummy (instead of lo) and then get drop/kfree. I think
the confusion is probably from the name "<dummy_ip_addr>" which
points to the intention that the dummy0 has this ip addr
instead of dummy having a route to this ip address.
The need for running msg_zerocopy as the server also
adds to this confusion. There should be no need for
server in dummy test. No skb can reach the server anyway.

>
>
> > #> ip -4 r
> > 10.0.0.0/24 dev dummy0 proto kernel scope link src 10.0.0.1
> >
> > #> ./send-zc -4 -D 10.0.0.(2) -t 10 udp
> > ip -s link show dev dummy0
> > 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 65535 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
> > link/ether 82:0f:e0:dc:f7:e6 brd ff:ff:ff:ff:ff:ff
> > RX: bytes packets errors dropped missed mcast
> > 0 0 0 0 0 0
> > TX: bytes packets errors dropped carrier collsns
> > 140800890299 2150397 0 0 0 0
> >
>
> --
> Pavel Begunkov

2021-12-02 21:26:25

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

> > What if the ubuf pool can be found from the sk, and the index in that
> > pool is passed as a cmsg?
>
> It looks to me that ubufs are by nature is something that is not
> tightly bound to a socket (at least for io_uring API in the patchset),
> it'll be pretty ugly:
>
> 1) io_uring'd need to care to register the pool in the socket. Having
> multiple rings using the same socket would be horrible. It may be that
> it doesn't make much sense to send in parallel from multiple rings, but
> a per thread io_uring is a popular solution, and then someone would
> want to pass a socket from one thread to another and we'd need to support
> it.
>
> 2) And io_uring would also need to unregister it, so the pool would
> store a list of sockets where it's used, and so referencing sockets
> and then we need to bind it somehow to io_uring fixed files or
> register all that for tracking referencing circular dependencies.
>
> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
> but if so I don't like exposing basically io_uring's referencing through
> cmsg. And it sounds io_uring would need to parse cmsg then.
>
>
> A lot of nuances :) I'd really prefer to pass it on per-request basis,

Ok

> it's much cleaner, but still haven't got what's up with msghdr
> initialisation...

And passing the struct through multiple layers of functions.

> Maybe, it's better to add a flags field, which would include
> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
> and everything else that may be optional. Does it sound sane?

If sendmsg takes the argument, it will just have to be initialized, I think.

Other functions are not aware of its existence so it can remain
uninitialized there.

2021-12-03 16:19:42

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On 12/2/21 21:25, Willem de Bruijn wrote:
>>> What if the ubuf pool can be found from the sk, and the index in that
>>> pool is passed as a cmsg?
>>
>> It looks to me that ubufs are by nature is something that is not
>> tightly bound to a socket (at least for io_uring API in the patchset),
>> it'll be pretty ugly:
>>
>> 1) io_uring'd need to care to register the pool in the socket. Having
>> multiple rings using the same socket would be horrible. It may be that
>> it doesn't make much sense to send in parallel from multiple rings, but
>> a per thread io_uring is a popular solution, and then someone would
>> want to pass a socket from one thread to another and we'd need to support
>> it.
>>
>> 2) And io_uring would also need to unregister it, so the pool would
>> store a list of sockets where it's used, and so referencing sockets
>> and then we need to bind it somehow to io_uring fixed files or
>> register all that for tracking referencing circular dependencies.
>>
>> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
>> but if so I don't like exposing basically io_uring's referencing through
>> cmsg. And it sounds io_uring would need to parse cmsg then.
>>
>>
>> A lot of nuances :) I'd really prefer to pass it on per-request basis,
>
> Ok
>
>> it's much cleaner, but still haven't got what's up with msghdr
>> initialisation...
>
> And passing the struct through multiple layers of functions.

If you refer to ip_make_skb(ubuf) -> __ip_append_data(ubuf), I agree
it's a bit messier, will see what can be done. If you're about
msghdr::msg_ubuf, for me it's more like passing a callback,
which sounds like a normal thing to do.


>> Maybe, it's better to add a flags field, which would include
>> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
>> and everything else that may be optional. Does it sound sane?
>
> If sendmsg takes the argument, it will just have to be initialized, I think.
>
> Other functions are not aware of its existence so it can remain
> uninitialized there.

Got it, need to double check, but looks something like 1/12 should
be as you outlined.

And if there will be multiple optional fields that have to be
initialised, we would be able to hide all the zeroing under a
single bitmask. E.g. instead of

msg->field1 = NULL;
...
msg->fieldN = NULL;

It may look like

msg->mask = 0; // HAS_FIELD1 | HAS_FIELDN;

--
Pavel Begunkov

2021-12-03 16:31:02

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC 00/12] io_uring zerocopy send

On Fri, Dec 3, 2021 at 11:19 AM Pavel Begunkov <[email protected]> wrote:
>
> On 12/2/21 21:25, Willem de Bruijn wrote:
> >>> What if the ubuf pool can be found from the sk, and the index in that
> >>> pool is passed as a cmsg?
> >>
> >> It looks to me that ubufs are by nature is something that is not
> >> tightly bound to a socket (at least for io_uring API in the patchset),
> >> it'll be pretty ugly:
> >>
> >> 1) io_uring'd need to care to register the pool in the socket. Having
> >> multiple rings using the same socket would be horrible. It may be that
> >> it doesn't make much sense to send in parallel from multiple rings, but
> >> a per thread io_uring is a popular solution, and then someone would
> >> want to pass a socket from one thread to another and we'd need to support
> >> it.
> >>
> >> 2) And io_uring would also need to unregister it, so the pool would
> >> store a list of sockets where it's used, and so referencing sockets
> >> and then we need to bind it somehow to io_uring fixed files or
> >> register all that for tracking referencing circular dependencies.
> >>
> >> 3) IIRC, we can't add a cmsg entry from the kernel, right? May be wrong,
> >> but if so I don't like exposing basically io_uring's referencing through
> >> cmsg. And it sounds io_uring would need to parse cmsg then.
> >>
> >>
> >> A lot of nuances :) I'd really prefer to pass it on per-request basis,
> >
> > Ok
> >
> >> it's much cleaner, but still haven't got what's up with msghdr
> >> initialisation...
> >
> > And passing the struct through multiple layers of functions.
>
> If you refer to ip_make_skb(ubuf) -> __ip_append_data(ubuf), I agree
> it's a bit messier, will see what can be done. If you're about
> msghdr::msg_ubuf, for me it's more like passing a callback,
> which sounds like a normal thing to do.

Thanks, I do mean the first.

Also, small nit now that it comes up again msghdr::msg_ubuf is not
plain C. I would avoid that pseudo C++ notation (in the subject line
of 3/12)
>
> >> Maybe, it's better to add a flags field, which would include
> >> "msg_control_is_user : 1" and whether msghdr includes msg_iocb, msg_ubuf,
> >> and everything else that may be optional. Does it sound sane?
> >
> > If sendmsg takes the argument, it will just have to be initialized, I think.
> >
> > Other functions are not aware of its existence so it can remain
> > uninitialized there.
>
> Got it, need to double check, but looks something like 1/12 should
> be as you outlined.
>
> And if there will be multiple optional fields that have to be
> initialised, we would be able to hide all the zeroing under a
> single bitmask. E.g. instead of
>
> msg->field1 = NULL;
> ...
> msg->fieldN = NULL;
>
> It may look like
>
> msg->mask = 0; // HAS_FIELD1 | HAS_FIELDN;

Makes sense to me. This patch series only adds one field, so you can
leave the optimization for a possible future separate patch series?