2022-07-12 21:04:19

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 00/27] io_uring zerocopy send

NOTE: Not to be picked directly. After getting necessary acks, I'll be
working out merging with Jakub and Jens.

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking UDP with an optimised version of the selftest (see [1]), which
sends a bunch of requests, waits for completions and repeats. "+ flush" column
posts one additional "buffer-free" notification per request, and just "zc"
doesn't post buffer notifications at all.

NIC (requests / second):
IO size | non-zc | zc | zc + flush
4000 | 495134 | 606420 (+22%) | 558971 (+12%)
1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc | zc | zc + flush
8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting. There
is also an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

For TCP on localhost (with hacks enabling localhost zerocopy) and including
additional overhead for receive:

IO size | non-zc | zc
1200 | 4174 | 4148
4096 | 7597 | 11228

Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help, should look better for 4000,
but couldn't test properly because of setup problems.

Links:

liburing (benchmark + tests):
[1] https://github.com/isilence/liburing/tree/zc_v4

kernel repo:
[2] https://github.com/isilence/linux/tree/zc_v4

RFC v1:
[3] https://lore.kernel.org/io-uring/[email protected]/

RFC v2:
https://lore.kernel.org/io-uring/[email protected]/

Net patches based:
[email protected]:isilence/linux.git zc_v4-net-base
or
https://github.com/isilence/linux/tree/zc_v4-net-base

API design overview:

The series introduces an io_uring concept of notifactors. From the userspace
perspective it's an entity to which it can bind one or more requests and then
requesting to flush it. Flushing a notifier makes it impossible to attach new
requests to it, and instructs the notifier to post a completion once all
requests attached to it are completed and the kernel doesn't need the buffers
anymore.

Notifications are stored in notification slots, which should be registered as
an array in io_uring. Each slot stores only one notifier at any particular
moment. Flushing removes it from the slot and the slot automatically replaces
it with a new notifier. All operations with notifiers are done by specifying
an index of a slot it's currently in.

When registering a notification the userspace specifies a u64 tag for each
slot, which will be copied in notification completion entries as
cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
sequence number counting notifiers of a slot.

Changelog:

v4 -> v5
remove ubuf_info checks from custom iov_iter callbacks to
avoid disabling the page refs optimisations for TCP

v3 -> v4
custom iov_iter handling

RFC v2 -> v3:
mem accounting for non-registered buffers
allow mixing registered and normal requests per notifier
notification flushing via IORING_OP_RSRC_UPDATE
TCP support
fix buffer indexing
fix io-wq ->uring_lock locking
fix bugs when mixing with MSG_ZEROCOPY
fix managed refs bugs in skbuff.c

RFC -> RFC v2:
remove additional overhead for non-zc from skb_release_data()
avoid msg propagation, hide extra bits of non-zc overhead
task_work based "buffer free" notifications
improve io_uring's notification refcounting
added 5/19, (no pfmemalloc tracking)
added 8/19 and 9/19 preventing small copies with zc
misc small changes

David Ahern (1):
net: Allow custom iter handler in msghdr

Pavel Begunkov (26):
ipv4: avoid partial copy for zc
ipv6: avoid partial copy for zc
skbuff: don't mix ubuf_info from different sources
skbuff: add SKBFL_DONT_ORPHAN flag
skbuff: carry external ubuf_info in msghdr
net: introduce managed frags infrastructure
net: introduce __skb_fill_page_desc_noacc
ipv4/udp: support externally provided ubufs
ipv6/udp: support externally provided ubufs
tcp: support externally provided ubufs
io_uring: initialise msghdr::msg_ubuf
io_uring: export io_put_task()
io_uring: add zc notification infrastructure
io_uring: cache struct io_notif
io_uring: complete notifiers in tw
io_uring: add rsrc referencing for notifiers
io_uring: add notification slot registration
io_uring: wire send zc request type
io_uring: account locked pages for non-fixed zc
io_uring: allow to pass addr into sendzc
io_uring: sendzc with fixed buffers
io_uring: flush notifiers after sendzc
io_uring: rename IORING_OP_FILES_UPDATE
io_uring: add zc notification flush requests
io_uring: enable managed frags with register buffers
selftests/io_uring: test zerocopy send

include/linux/io_uring_types.h | 37 ++
include/linux/skbuff.h | 66 +-
include/linux/socket.h | 5 +
include/uapi/linux/io_uring.h | 45 +-
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 42 +-
io_uring/io_uring.h | 22 +
io_uring/net.c | 187 ++++++
io_uring/net.h | 4 +
io_uring/notif.c | 215 +++++++
io_uring/notif.h | 87 +++
io_uring/opdef.c | 24 +-
io_uring/rsrc.c | 55 +-
io_uring/rsrc.h | 16 +-
io_uring/tctx.h | 26 -
net/compat.c | 1 +
net/core/datagram.c | 14 +-
net/core/skbuff.c | 37 +-
net/ipv4/ip_output.c | 50 +-
net/ipv4/tcp.c | 32 +-
net/ipv6/ip6_output.c | 49 +-
net/socket.c | 3 +
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
25 files changed, 1628 insertions(+), 128 deletions(-)
create mode 100644 io_uring/notif.c
create mode 100644 io_uring/notif.h
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

--
2.37.0


2022-07-12 21:05:05

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 10/27] ipv6/udp: support externally provided ubufs

Teach ipv6/udp how to use external ubuf_info provided in msghdr and
also prepare it for managed frags by sprinkling
skb_zcopy_downgrade_managed() when it could mix managed and not managed
frags.

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv6/ip6_output.c | 44 ++++++++++++++++++++++++++++++-------------
1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index fc74ce3ed8cc..897ca4f9b791 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1542,18 +1542,35 @@ static int __ip6_append_data(struct sock *sk,
rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
csummode = CHECKSUM_PARTIAL;

- if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
- uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
- if (!uarg)
- return -ENOBUFS;
- extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
- if (rt->dst.dev->features & NETIF_F_SG &&
- csummode == CHECKSUM_PARTIAL) {
- paged = true;
- zc = true;
- } else {
- uarg->zerocopy = 0;
- skb_zcopy_set(skb, uarg, &extra_uref);
+ if ((flags & MSG_ZEROCOPY) && length) {
+ struct msghdr *msg = from;
+
+ if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+ if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
+ return -EINVAL;
+
+ /* Leave uarg NULL if can't zerocopy, callers should
+ * be able to handle it.
+ */
+ if ((rt->dst.dev->features & NETIF_F_SG) &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ uarg = msg->msg_ubuf;
+ }
+ } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+ uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+ if (!uarg)
+ return -ENOBUFS;
+ extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
+ if (rt->dst.dev->features & NETIF_F_SG &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ } else {
+ uarg->zerocopy = 0;
+ skb_zcopy_set(skb, uarg, &extra_uref);
+ }
}
}

@@ -1747,13 +1764,14 @@ static int __ip6_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
- } else if (!uarg || !uarg->zerocopy) {
+ } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;

err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag))
goto error;

+ skb_zcopy_downgrade_managed(skb);
if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) {
err = -EMSGSIZE;
--
2.37.0

2022-07-12 21:05:07

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 22/27] io_uring: sendzc with fixed buffers

Allow zerocopy sends to use fixed buffers. There is an optimisation for
this case, the network layer don't need to reference the pages, see
SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed
buffers until the notifier is released.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 6 +++++-
io_uring/net.c | 29 ++++++++++++++++++++++++-----
2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 9303bf5236f7..3f2305bc5c79 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -269,9 +269,13 @@ enum io_uring_op {
* IORING_RECV_MULTISHOT Multishot recv. Sets IORING_CQE_F_MORE if
* the handler will continue to report
* CQEs on behalf of the same SQE.
+ *
+ * IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
+ * the buf_index field.
*/
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
-#define IORING_RECV_MULTISHOT (1U << 1)
+#define IORING_RECV_MULTISHOT (1U << 1)
+#define IORING_RECVSEND_FIXED_BUF (1U << 2)

/*
* accept flags stored in sqe->ioprio
diff --git a/io_uring/net.c b/io_uring/net.c
index 2172cf3facd8..0259fbbad591 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -14,6 +14,7 @@
#include "kbuf.h"
#include "net.h"
#include "notif.h"
+#include "rsrc.h"

#if defined(CONFIG_NET)
struct io_shutdown {
@@ -667,13 +668,23 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_sendzc *zc = io_kiocb_to_cmd(req);
+ struct io_ring_ctx *ctx = req->ctx;

if (READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3))
return -EINVAL;

zc->flags = READ_ONCE(sqe->ioprio);
- if (zc->flags & ~IORING_RECVSEND_POLL_FIRST)
+ if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF))
return -EINVAL;
+ if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+ unsigned idx = READ_ONCE(sqe->buf_index);
+
+ if (unlikely(idx >= ctx->nr_user_bufs))
+ return -EFAULT;
+ idx = array_index_nospec(idx, ctx->nr_user_bufs);
+ req->imu = READ_ONCE(ctx->user_bufs[idx]);
+ io_req_set_rsrc_node(req, ctx, 0);
+ }

zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
zc->len = READ_ONCE(sqe->len);
@@ -727,10 +738,18 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_controllen = 0;
msg.msg_namelen = 0;

- ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
- if (unlikely(ret))
- return ret;
- mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+ if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
+ ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
+ (u64)zc->buf, zc->len);
+ if (unlikely(ret))
+ return ret;
+ } else {
+ ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
+ &msg.msg_iter);
+ if (unlikely(ret))
+ return ret;
+ mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+ }

if (zc->addr) {
ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
--
2.37.0

2022-07-12 21:05:14

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send

Add selftests for io_uring zerocopy sends and io_uring's notification
infrastructure. It's largely influenced by msg_zerocopy and uses it on
the receive side.

Signed-off-by: Pavel Begunkov <[email protected]>
---
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
3 files changed, 737 insertions(+)
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 7ea54af55490..51261483744e 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -59,6 +59,7 @@ TEST_GEN_FILES += toeplitz
TEST_GEN_FILES += cmsg_sender
TEST_GEN_FILES += stress_reuseport_listen
TEST_PROGS += test_vxlan_vnifiltering.sh
+TEST_GEN_FILES += io_uring_zerocopy_tx

TEST_FILES := settings

diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
new file mode 100644
index 000000000000..9d64c560a2d6
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
@@ -0,0 +1,605 @@
+/* SPDX-License-Identifier: MIT */
+/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/io_uring.h>
+#include <linux/ipv6.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+
+#define NOTIF_TAG 0xfffffffULL
+#define NONZC_TAG 0
+#define ZC_TAG 1
+
+enum {
+ MODE_NONZC = 0,
+ MODE_ZC = 1,
+ MODE_ZC_FIXED = 2,
+ MODE_MIXED = 3,
+};
+
+static bool cfg_flush = false;
+static bool cfg_cork = false;
+static int cfg_mode = MODE_ZC_FIXED;
+static int cfg_nr_reqs = 8;
+static int cfg_family = PF_UNSPEC;
+static int cfg_payload_len;
+static int cfg_port = 8000;
+static int cfg_runtime_ms = 4200;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_dst_addr;
+
+static char payload[IP_MAXPACKET] __attribute__((aligned(4096)));
+
+struct io_sq_ring {
+ unsigned *head;
+ unsigned *tail;
+ unsigned *ring_mask;
+ unsigned *ring_entries;
+ unsigned *flags;
+ unsigned *array;
+};
+
+struct io_cq_ring {
+ unsigned *head;
+ unsigned *tail;
+ unsigned *ring_mask;
+ unsigned *ring_entries;
+ struct io_uring_cqe *cqes;
+};
+
+struct io_uring_sq {
+ unsigned *khead;
+ unsigned *ktail;
+ unsigned *kring_mask;
+ unsigned *kring_entries;
+ unsigned *kflags;
+ unsigned *kdropped;
+ unsigned *array;
+ struct io_uring_sqe *sqes;
+
+ unsigned sqe_head;
+ unsigned sqe_tail;
+
+ size_t ring_sz;
+};
+
+struct io_uring_cq {
+ unsigned *khead;
+ unsigned *ktail;
+ unsigned *kring_mask;
+ unsigned *kring_entries;
+ unsigned *koverflow;
+ struct io_uring_cqe *cqes;
+
+ size_t ring_sz;
+};
+
+struct io_uring {
+ struct io_uring_sq sq;
+ struct io_uring_cq cq;
+ int ring_fd;
+};
+
+#ifdef __alpha__
+# ifndef __NR_io_uring_setup
+# define __NR_io_uring_setup 535
+# endif
+# ifndef __NR_io_uring_enter
+# define __NR_io_uring_enter 536
+# endif
+# ifndef __NR_io_uring_register
+# define __NR_io_uring_register 537
+# endif
+#else /* !__alpha__ */
+# ifndef __NR_io_uring_setup
+# define __NR_io_uring_setup 425
+# endif
+# ifndef __NR_io_uring_enter
+# define __NR_io_uring_enter 426
+# endif
+# ifndef __NR_io_uring_register
+# define __NR_io_uring_register 427
+# endif
+#endif
+
+#if defined(__x86_64) || defined(__i386__)
+#define read_barrier() __asm__ __volatile__("":::"memory")
+#define write_barrier() __asm__ __volatile__("":::"memory")
+#else
+
+#define read_barrier() __sync_synchronize()
+#define write_barrier() __sync_synchronize()
+#endif
+
+static int io_uring_setup(unsigned int entries, struct io_uring_params *p)
+{
+ return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int io_uring_enter(int fd, unsigned int to_submit,
+ unsigned int min_complete,
+ unsigned int flags, sigset_t *sig)
+{
+ return syscall(__NR_io_uring_enter, fd, to_submit, min_complete,
+ flags, sig, _NSIG / 8);
+}
+
+static int io_uring_register_buffers(struct io_uring *ring,
+ const struct iovec *iovecs,
+ unsigned nr_iovecs)
+{
+ int ret;
+
+ ret = syscall(__NR_io_uring_register, ring->ring_fd,
+ IORING_REGISTER_BUFFERS, iovecs, nr_iovecs);
+ return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_register_notifications(struct io_uring *ring,
+ unsigned nr,
+ struct io_uring_notification_slot *slots)
+{
+ int ret;
+ struct io_uring_notification_register r = {
+ .nr_slots = nr,
+ .data = (unsigned long)slots,
+ };
+
+ ret = syscall(__NR_io_uring_register, ring->ring_fd,
+ IORING_REGISTER_NOTIFIERS, &r, sizeof(r));
+ return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_mmap(int fd, struct io_uring_params *p,
+ struct io_uring_sq *sq, struct io_uring_cq *cq)
+{
+ size_t size;
+ void *ptr;
+ int ret;
+
+ sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
+ ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
+ if (ptr == MAP_FAILED)
+ return -errno;
+ sq->khead = ptr + p->sq_off.head;
+ sq->ktail = ptr + p->sq_off.tail;
+ sq->kring_mask = ptr + p->sq_off.ring_mask;
+ sq->kring_entries = ptr + p->sq_off.ring_entries;
+ sq->kflags = ptr + p->sq_off.flags;
+ sq->kdropped = ptr + p->sq_off.dropped;
+ sq->array = ptr + p->sq_off.array;
+
+ size = p->sq_entries * sizeof(struct io_uring_sqe);
+ sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);
+ if (sq->sqes == MAP_FAILED) {
+ ret = -errno;
+err:
+ munmap(sq->khead, sq->ring_sz);
+ return ret;
+ }
+
+ cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
+ ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
+ if (ptr == MAP_FAILED) {
+ ret = -errno;
+ munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
+ goto err;
+ }
+ cq->khead = ptr + p->cq_off.head;
+ cq->ktail = ptr + p->cq_off.tail;
+ cq->kring_mask = ptr + p->cq_off.ring_mask;
+ cq->kring_entries = ptr + p->cq_off.ring_entries;
+ cq->koverflow = ptr + p->cq_off.overflow;
+ cq->cqes = ptr + p->cq_off.cqes;
+ return 0;
+}
+
+static int io_uring_queue_init(unsigned entries, struct io_uring *ring,
+ unsigned flags)
+{
+ struct io_uring_params p;
+ int fd, ret;
+
+ memset(ring, 0, sizeof(*ring));
+ memset(&p, 0, sizeof(p));
+ p.flags = flags;
+
+ fd = io_uring_setup(entries, &p);
+ if (fd < 0)
+ return fd;
+ ret = io_uring_mmap(fd, &p, &ring->sq, &ring->cq);
+ if (!ret)
+ ring->ring_fd = fd;
+ else
+ close(fd);
+ return ret;
+}
+
+static int io_uring_submit(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+ const unsigned mask = *sq->kring_mask;
+ unsigned ktail, submitted, to_submit;
+ int ret;
+
+ read_barrier();
+ if (*sq->khead != *sq->ktail) {
+ submitted = *sq->kring_entries;
+ goto submit;
+ }
+ if (sq->sqe_head == sq->sqe_tail)
+ return 0;
+
+ ktail = *sq->ktail;
+ to_submit = sq->sqe_tail - sq->sqe_head;
+ for (submitted = 0; submitted < to_submit; submitted++) {
+ read_barrier();
+ sq->array[ktail++ & mask] = sq->sqe_head++ & mask;
+ }
+ if (!submitted)
+ return 0;
+
+ if (*sq->ktail != ktail) {
+ write_barrier();
+ *sq->ktail = ktail;
+ write_barrier();
+ }
+submit:
+ ret = io_uring_enter(ring->ring_fd, submitted, 0,
+ IORING_ENTER_GETEVENTS, NULL);
+ return ret < 0 ? -errno : ret;
+}
+
+static inline void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags)
+{
+ memset(sqe, 0, sizeof(*sqe));
+ sqe->opcode = (__u8) IORING_OP_SEND;
+ sqe->fd = sockfd;
+ sqe->addr = (unsigned long) buf;
+ sqe->len = len;
+ sqe->msg_flags = (__u32) flags;
+}
+
+static inline void io_uring_prep_sendzc(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags,
+ unsigned slot_idx, unsigned zc_flags)
+{
+ io_uring_prep_send(sqe, sockfd, buf, len, flags);
+ sqe->opcode = (__u8) IORING_OP_SENDZC_NOTIF;
+ sqe->notification_idx = slot_idx;
+ sqe->ioprio = zc_flags;
+}
+
+static struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+
+ if (sq->sqe_tail + 1 - sq->sqe_head > *sq->kring_entries)
+ return NULL;
+ return &sq->sqes[sq->sqe_tail++ & *sq->kring_mask];
+}
+
+static int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr)
+{
+ struct io_uring_cq *cq = &ring->cq;
+ const unsigned mask = *cq->kring_mask;
+ unsigned head = *cq->khead;
+ int ret;
+
+ *cqe_ptr = NULL;
+ do {
+ read_barrier();
+ if (head != *cq->ktail) {
+ *cqe_ptr = &cq->cqes[head & mask];
+ break;
+ }
+ ret = io_uring_enter(ring->ring_fd, 0, 1,
+ IORING_ENTER_GETEVENTS, NULL);
+ if (ret < 0)
+ return -errno;
+ } while (1);
+
+ return 0;
+}
+
+static inline void io_uring_cqe_seen(struct io_uring *ring)
+{
+ *(&ring->cq)->khead += 1;
+ write_barrier();
+}
+
+static unsigned long gettimeofday_ms(void)
+{
+ struct timeval tv;
+
+ gettimeofday(&tv, NULL);
+ return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void do_setsockopt(int fd, int level, int optname, int val)
+{
+ if (setsockopt(fd, level, optname, &val, sizeof(val)))
+ error(1, errno, "setsockopt %d.%d: %d", level, optname, val);
+}
+
+static int do_setup_tx(int domain, int type, int protocol)
+{
+ int fd;
+
+ fd = socket(domain, type, protocol);
+ if (fd == -1)
+ error(1, errno, "socket t");
+
+ do_setsockopt(fd, SOL_SOCKET, SO_SNDBUF, 1 << 21);
+
+ if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
+ error(1, errno, "connect");
+ return fd;
+}
+
+static void do_tx(int domain, int type, int protocol)
+{
+ struct io_uring_notification_slot b[1] = {{.tag = NOTIF_TAG}};
+ struct io_uring_sqe *sqe;
+ struct io_uring_cqe *cqe;
+ unsigned long packets = 0, bytes = 0;
+ struct io_uring ring;
+ struct iovec iov;
+ uint64_t tstop;
+ int i, fd, ret;
+ int compl_cqes = 0;
+
+ fd = do_setup_tx(domain, type, protocol);
+
+ ret = io_uring_queue_init(512, &ring, 0);
+ if (ret)
+ error(1, ret, "io_uring: queue init");
+
+ ret = io_uring_register_notifications(&ring, 1, b);
+ if (ret)
+ error(1, ret, "io_uring: tx ctx registration");
+
+ iov.iov_base = payload;
+ iov.iov_len = cfg_payload_len;
+
+ ret = io_uring_register_buffers(&ring, &iov, 1);
+ if (ret)
+ error(1, ret, "io_uring: buffer registration");
+
+ tstop = gettimeofday_ms() + cfg_runtime_ms;
+ do {
+ if (cfg_cork)
+ do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 1);
+
+ for (i = 0; i < cfg_nr_reqs; i++) {
+ unsigned zc_flags = 0;
+ unsigned buf_idx = 0;
+ unsigned slot_idx = 0;
+ unsigned mode = cfg_mode;
+ unsigned msg_flags = 0;
+
+ if (cfg_mode == MODE_MIXED)
+ mode = rand() % 3;
+
+ sqe = io_uring_get_sqe(&ring);
+
+ if (mode == MODE_NONZC) {
+ io_uring_prep_send(sqe, fd, payload,
+ cfg_payload_len, msg_flags);
+ sqe->user_data = NONZC_TAG;
+ } else {
+ if (cfg_flush) {
+ zc_flags |= IORING_RECVSEND_NOTIF_FLUSH;
+ compl_cqes++;
+ }
+ io_uring_prep_sendzc(sqe, fd, payload,
+ cfg_payload_len,
+ msg_flags, slot_idx, zc_flags);
+ if (mode == MODE_ZC_FIXED) {
+ sqe->ioprio |= IORING_RECVSEND_FIXED_BUF;
+ sqe->buf_index = buf_idx;
+ }
+ sqe->user_data = ZC_TAG;
+ }
+ }
+
+ ret = io_uring_submit(&ring);
+ if (ret != cfg_nr_reqs)
+ error(1, ret, "submit");
+
+ for (i = 0; i < cfg_nr_reqs; i++) {
+ ret = io_uring_wait_cqe(&ring, &cqe);
+ if (ret)
+ error(1, ret, "wait cqe");
+
+ if (cqe->user_data == NOTIF_TAG) {
+ compl_cqes--;
+ i--;
+ } else if (cqe->user_data != NONZC_TAG &&
+ cqe->user_data != ZC_TAG) {
+ error(1, cqe->res, "invalid user_data");
+ } else if (cqe->res <= 0 && cqe->res != -EAGAIN) {
+ error(1, cqe->res, "send failed");
+ } else {
+ if (cqe->res > 0) {
+ packets++;
+ bytes += cqe->res;
+ }
+ /* failed requests don't flush */
+ if (cfg_flush &&
+ cqe->res <= 0 &&
+ cqe->user_data == ZC_TAG)
+ compl_cqes--;
+ }
+ io_uring_cqe_seen(&ring);
+ }
+ if (cfg_cork)
+ do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
+ } while (gettimeofday_ms() < tstop);
+
+ if (close(fd))
+ error(1, errno, "close");
+
+ fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n",
+ packets, bytes >> 20,
+ packets / (cfg_runtime_ms / 1000),
+ (bytes >> 20) / (cfg_runtime_ms / 1000));
+
+ while (compl_cqes) {
+ ret = io_uring_wait_cqe(&ring, &cqe);
+ if (ret)
+ error(1, ret, "wait cqe");
+ io_uring_cqe_seen(&ring);
+ compl_cqes--;
+ }
+}
+
+static void do_test(int domain, int type, int protocol)
+{
+ int i;
+
+ for (i = 0; i < IP_MAXPACKET; i++)
+ payload[i] = 'a' + (i % 26);
+ do_tx(domain, type, protocol);
+}
+
+static void usage(const char *filepath)
+{
+ error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
+ "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+ const int max_payload_len = sizeof(payload) -
+ sizeof(struct ipv6hdr) -
+ sizeof(struct tcphdr) -
+ 40 /* max tcp options */;
+ struct sockaddr_in6 *addr6 = (void *) &cfg_dst_addr;
+ struct sockaddr_in *addr4 = (void *) &cfg_dst_addr;
+ char *daddr = NULL;
+ int c;
+
+ if (argc <= 1)
+ usage(argv[0]);
+ cfg_payload_len = max_payload_len;
+
+ while ((c = getopt(argc, argv, "46D:p:s:t:n:fc:m:")) != -1) {
+ switch (c) {
+ case '4':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET;
+ cfg_alen = sizeof(struct sockaddr_in);
+ break;
+ case '6':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET6;
+ cfg_alen = sizeof(struct sockaddr_in6);
+ break;
+ case 'D':
+ daddr = optarg;
+ break;
+ case 'p':
+ cfg_port = strtoul(optarg, NULL, 0);
+ break;
+ case 's':
+ cfg_payload_len = strtoul(optarg, NULL, 0);
+ break;
+ case 't':
+ cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
+ break;
+ case 'n':
+ cfg_nr_reqs = strtoul(optarg, NULL, 0);
+ break;
+ case 'f':
+ cfg_flush = 1;
+ break;
+ case 'c':
+ cfg_cork = strtol(optarg, NULL, 0);
+ break;
+ case 'm':
+ cfg_mode = strtol(optarg, NULL, 0);
+ break;
+ }
+ }
+
+ switch (cfg_family) {
+ case PF_INET:
+ memset(addr4, 0, sizeof(*addr4));
+ addr4->sin_family = AF_INET;
+ addr4->sin_port = htons(cfg_port);
+ if (daddr &&
+ inet_pton(AF_INET, daddr, &(addr4->sin_addr)) != 1)
+ error(1, 0, "ipv4 parse error: %s", daddr);
+ break;
+ case PF_INET6:
+ memset(addr6, 0, sizeof(*addr6));
+ addr6->sin6_family = AF_INET6;
+ addr6->sin6_port = htons(cfg_port);
+ if (daddr &&
+ inet_pton(AF_INET6, daddr, &(addr6->sin6_addr)) != 1)
+ error(1, 0, "ipv6 parse error: %s", daddr);
+ break;
+ default:
+ error(1, 0, "illegal domain");
+ }
+
+ if (cfg_payload_len > max_payload_len)
+ error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
+ if (cfg_mode == MODE_NONZC && cfg_flush)
+ error(1, 0, "-f: only zerocopy modes support notifications");
+ if (optind != argc - 1)
+ usage(argv[0]);
+}
+
+int main(int argc, char **argv)
+{
+ const char *cfg_test = argv[argc - 1];
+
+ parse_opts(argc, argv);
+
+ if (!strcmp(cfg_test, "tcp"))
+ do_test(cfg_family, SOCK_STREAM, 0);
+ else if (!strcmp(cfg_test, "udp"))
+ do_test(cfg_family, SOCK_DGRAM, 0);
+ else
+ error(1, 0, "unknown cfg_test %s", cfg_test);
+ return 0;
+}
diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
new file mode 100755
index 000000000000..6a65e4437640
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+#
+# Send data between two processes across namespaces
+# Run twice: once without and once with zerocopy
+
+set -e
+
+readonly DEV="veth0"
+readonly DEV_MTU=65535
+readonly BIN_TX="./io_uring_zerocopy_tx"
+readonly BIN_RX="./msg_zerocopy"
+
+readonly RAND="$(mktemp -u XXXXXX)"
+readonly NSPREFIX="ns-${RAND}"
+readonly NS1="${NSPREFIX}1"
+readonly NS2="${NSPREFIX}2"
+
+readonly SADDR4='192.168.1.1'
+readonly DADDR4='192.168.1.2'
+readonly SADDR6='fd::1'
+readonly DADDR6='fd::2'
+
+readonly path_sysctl_mem="net.core.optmem_max"
+
+# No arguments: automated test
+if [[ "$#" -eq "0" ]]; then
+ IPs=( "4" "6" )
+ protocols=( "tcp" "udp" )
+
+ for IP in "${IPs[@]}"; do
+ for proto in "${protocols[@]}"; do
+ for mode in $(seq 1 3); do
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -f
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -c -f
+ done
+ done
+ done
+
+ echo "OK. All tests passed"
+ exit 0
+fi
+
+# Argument parsing
+if [[ "$#" -lt "2" ]]; then
+ echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
+ exit 1
+fi
+
+readonly IP="$1"
+shift
+readonly TXMODE="$1"
+shift
+readonly EXTRA_ARGS="$@"
+
+# Argument parsing: configure addresses
+if [[ "${IP}" == "4" ]]; then
+ readonly SADDR="${SADDR4}"
+ readonly DADDR="${DADDR4}"
+elif [[ "${IP}" == "6" ]]; then
+ readonly SADDR="${SADDR6}"
+ readonly DADDR="${DADDR6}"
+else
+ echo "Invalid IP version ${IP}"
+ exit 1
+fi
+
+# Argument parsing: select receive mode
+#
+# This differs from send mode for
+# - packet: use raw recv, because packet receives skb clones
+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
+case "${TXMODE}" in
+'packet' | 'packet_dgram' | 'raw_hdrincl')
+ RXMODE='raw'
+ ;;
+*)
+ RXMODE="${TXMODE}"
+ ;;
+esac
+
+# Start of state changes: install cleanup handler
+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
+
+cleanup() {
+ ip netns del "${NS2}"
+ ip netns del "${NS1}"
+ sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
+}
+
+trap cleanup EXIT
+
+# Configure system settings
+sysctl -w -q "${path_sysctl_mem}=1000000"
+
+# Create virtual ethernet pair between network namespaces
+ip netns add "${NS1}"
+ip netns add "${NS2}"
+
+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
+ peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
+
+# Bring the devices up
+ip -netns "${NS1}" link set "${DEV}" up
+ip -netns "${NS2}" link set "${DEV}" up
+
+# Set fixed MAC addresses on the devices
+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
+
+# Add fixed IP addresses to the devices
+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
+ip -netns "${NS1}" addr add fd::1/64 dev "${DEV}" nodad
+ip -netns "${NS2}" addr add fd::2/64 dev "${DEV}" nodad
+
+# Optionally disable sg or csum offload to test edge cases
+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
+
+do_test() {
+ local readonly ARGS="$1"
+
+ echo "ipv${IP} ${TXMODE} ${ARGS}"
+ ip netns exec "${NS2}" "${BIN_RX}" "-${IP}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" -r "${RXMODE}" &
+ sleep 0.2
+ ip netns exec "${NS1}" "${BIN_TX}" "-${IP}" -t 1 -D "${DADDR}" ${ARGS} "${TXMODE}"
+ wait
+}
+
+do_test "${EXTRA_ARGS}"
+echo ok
--
2.37.0

2022-07-12 21:06:11

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 18/27] io_uring: add notification slot registration

Let the userspace to register and unregister notification slots.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 17 ++++++++++++++
io_uring/io_uring.c | 9 ++++++++
io_uring/notif.c | 43 +++++++++++++++++++++++++++++++++++
io_uring/notif.h | 3 +++
4 files changed, 72 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e858dba2e6c9..f1ba8e934168 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -454,6 +454,10 @@ enum {
/* register a range of fixed file slots for automatic slot allocation */
IORING_REGISTER_FILE_ALLOC_RANGE = 25,

+ /* zerocopy notification API */
+ IORING_REGISTER_NOTIFIERS = 26,
+ IORING_UNREGISTER_NOTIFIERS = 27,
+
/* this goes last */
IORING_REGISTER_LAST
};
@@ -500,6 +504,19 @@ struct io_uring_rsrc_update2 {
__u32 resv2;
};

+struct io_uring_notification_slot {
+ __u64 tag;
+ __u64 resv[3];
+};
+
+struct io_uring_notification_register {
+ __u32 nr_slots;
+ __u32 resv;
+ __u64 resv2;
+ __u64 data;
+ __u64 resv3;
+};
+
/* Skip updating fd indexes set to this value in the fd table */
#define IORING_REGISTER_FILES_SKIP (-2)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bdc5a2839d94..41ef98a43d32 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,6 +3875,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_file_alloc_range(ctx, arg);
break;
+ case IORING_REGISTER_NOTIFIERS:
+ ret = io_notif_register(ctx, arg, nr_args);
+ break;
+ case IORING_UNREGISTER_NOTIFIERS:
+ ret = -EINVAL;
+ if (arg || nr_args)
+ break;
+ ret = io_notif_unregister(ctx);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/io_uring/notif.c b/io_uring/notif.c
index 0a2e98bd74f6..e6d98dc208c7 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -162,5 +162,48 @@ __cold int io_notif_unregister(struct io_ring_ctx *ctx)
kvfree(ctx->notif_slots);
ctx->notif_slots = NULL;
ctx->nr_notif_slots = 0;
+ io_notif_cache_purge(ctx);
+ return 0;
+}
+
+__cold int io_notif_register(struct io_ring_ctx *ctx,
+ void __user *arg, unsigned int size)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_uring_notification_slot __user *slots;
+ struct io_uring_notification_slot slot;
+ struct io_uring_notification_register reg;
+ unsigned i;
+
+ if (ctx->nr_notif_slots)
+ return -EBUSY;
+ if (size != sizeof(reg))
+ return -EINVAL;
+ if (copy_from_user(&reg, arg, sizeof(reg)))
+ return -EFAULT;
+ if (!reg.nr_slots || reg.nr_slots > IORING_MAX_NOTIF_SLOTS)
+ return -EINVAL;
+ if (reg.resv || reg.resv2 || reg.resv3)
+ return -EINVAL;
+
+ slots = u64_to_user_ptr(reg.data);
+ ctx->notif_slots = kvcalloc(reg.nr_slots, sizeof(ctx->notif_slots[0]),
+ GFP_KERNEL_ACCOUNT);
+ if (!ctx->notif_slots)
+ return -ENOMEM;
+
+ for (i = 0; i < reg.nr_slots; i++, ctx->nr_notif_slots++) {
+ struct io_notif_slot *notif_slot = &ctx->notif_slots[i];
+
+ if (copy_from_user(&slot, &slots[i], sizeof(slot))) {
+ io_notif_unregister(ctx);
+ return -EFAULT;
+ }
+ if (slot.resv[0] | slot.resv[1] | slot.resv[2]) {
+ io_notif_unregister(ctx);
+ return -EINVAL;
+ }
+ notif_slot->tag = slot.tag;
+ }
return 0;
}
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 1dd48efb7744..00efe164bdc4 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -6,6 +6,7 @@
#include <linux/nospec.h>

#define IO_NOTIF_SPLICE_BATCH 32
+#define IORING_MAX_NOTIF_SLOTS (1U << 10)

struct io_notif {
struct ubuf_info uarg;
@@ -48,6 +49,8 @@ struct io_notif_slot {
u32 seq;
};

+int io_notif_register(struct io_ring_ctx *ctx,
+ void __user *arg, unsigned int size);
int io_notif_unregister(struct io_ring_ctx *ctx);
void io_notif_cache_purge(struct io_ring_ctx *ctx);

--
2.37.0

2022-07-12 21:06:19

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 20/27] io_uring: account locked pages for non-fixed zc

Fixed buffers are RLIMIT_MEMLOCK accounted, however it doesn't cover iovec
based zerocopy sends. Do the accounting on the io_uring side.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/net.c | 1 +
io_uring/notif.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index 399267e8f1ef..69273d4f4ef0 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -724,6 +724,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
if (unlikely(ret))
return ret;
+ mm_account_pinned_pages(&notif->uarg.mmp, zc->len);

msg_flags = zc->msg_flags | MSG_ZEROCOPY;
if (issue_flags & IO_URING_F_NONBLOCK)
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e6d98dc208c7..c5179e5c1cd6 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -14,7 +14,13 @@ static void __io_notif_complete_tw(struct callback_head *cb)
struct io_notif *notif = container_of(cb, struct io_notif, task_work);
struct io_rsrc_node *rsrc_node = notif->rsrc_node;
struct io_ring_ctx *ctx = notif->ctx;
+ struct mmpin *mmp = &notif->uarg.mmp;

+ if (mmp->user) {
+ atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+ free_uid(mmp->user);
+ mmp->user = NULL;
+ }
if (likely(notif->task)) {
io_put_task(notif->task, 1);
notif->task = NULL;
--
2.37.0

2022-07-12 21:06:34

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 24/27] io_uring: rename IORING_OP_FILES_UPDATE

IORING_OP_FILES_UPDATE will be a more generic opcode serving different
resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype
handling.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 12 +++++++++++-
io_uring/opdef.c | 9 +++++----
io_uring/rsrc.c | 17 +++++++++++++++--
io_uring/rsrc.h | 4 ++--
4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7d21fba54b62..37e8c104d31f 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -171,7 +171,8 @@ enum io_uring_op {
IORING_OP_FALLOCATE,
IORING_OP_OPENAT,
IORING_OP_CLOSE,
- IORING_OP_FILES_UPDATE,
+ IORING_OP_RSRC_UPDATE,
+ IORING_OP_FILES_UPDATE = IORING_OP_RSRC_UPDATE,
IORING_OP_STATX,
IORING_OP_READ,
IORING_OP_WRITE,
@@ -220,6 +221,7 @@ enum io_uring_op {
#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5)
#define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
#define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
+
/*
* sqe->splice_flags
* extends splice(2) flags
@@ -286,6 +288,14 @@ enum io_uring_op {
*/
#define IORING_ACCEPT_MULTISHOT (1U << 0)

+
+/*
+ * IORING_OP_RSRC_UPDATE flags
+ */
+enum {
+ IORING_RSRC_UPDATE_FILES,
+};
+
/*
* IORING_OP_MSG_RING command types, stored in sqe->addr
*/
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 7ab19bbf3126..72dd2b2d8a9d 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -246,12 +246,13 @@ const struct io_op_def io_op_defs[] = {
.prep = io_close_prep,
.issue = io_close,
},
- [IORING_OP_FILES_UPDATE] = {
+ [IORING_OP_RSRC_UPDATE] = {
.audit_skip = 1,
.iopoll = 1,
- .name = "FILES_UPDATE",
- .prep = io_files_update_prep,
- .issue = io_files_update,
+ .name = "RSRC_UPDATE",
+ .prep = io_rsrc_update_prep,
+ .issue = io_rsrc_update,
+ .ioprio = 1,
},
[IORING_OP_STATX] = {
.audit_skip = 1,
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1182cf0ea1fc..98ce8a93a816 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -21,6 +21,7 @@ struct io_rsrc_update {
u64 arg;
u32 nr_args;
u32 offset;
+ int type;
};

static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
@@ -658,7 +659,7 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
return -EINVAL;
}

-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_rsrc_update *up = io_kiocb_to_cmd(req);

@@ -672,6 +673,7 @@ int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!up->nr_args)
return -EINVAL;
up->arg = READ_ONCE(sqe->addr);
+ up->type = READ_ONCE(sqe->ioprio);
return 0;
}

@@ -711,7 +713,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req,
return ret;
}

-int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
+static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rsrc_update *up = io_kiocb_to_cmd(req);
struct io_ring_ctx *ctx = req->ctx;
@@ -740,6 +742,17 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
return IOU_OK;
}

+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_rsrc_update *up = io_kiocb_to_cmd(req);
+
+ switch (up->type) {
+ case IORING_RSRC_UPDATE_FILES:
+ return io_files_update(req, issue_flags);
+ }
+ return -EINVAL;
+}
+
int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
struct io_rsrc_node *node, void *rsrc)
{
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index af342fd239d0..21813a23215f 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,6 +167,6 @@ static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
return &data->tags[table_idx][off];
}

-int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags);
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
#endif
--
2.37.0

2022-07-12 21:07:54

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 23/27] io_uring: flush notifiers after sendzc

Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 4 ++++
io_uring/io_uring.c | 11 +----------
io_uring/io_uring.h | 10 ++++++++++
io_uring/net.c | 5 ++++-
io_uring/notif.c | 2 +-
io_uring/notif.h | 11 +++++++++++
6 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 3f2305bc5c79..7d21fba54b62 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -272,10 +272,14 @@ enum io_uring_op {
*
* IORING_RECVSEND_FIXED_BUF Use registered buffers, the index is stored in
* the buf_index field.
+ *
+ * IORING_RECVSEND_NOTIF_FLUSH Flush a notification after a successful
+ * successful. Only for zerocopy sends.
*/
#define IORING_RECVSEND_POLL_FIRST (1U << 0)
#define IORING_RECV_MULTISHOT (1U << 1)
#define IORING_RECVSEND_FIXED_BUF (1U << 2)
+#define IORING_RECVSEND_NOTIF_FLUSH (1U << 3)

/*
* accept flags stored in sqe->ioprio
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 41ef98a43d32..e4f3a1ede2f4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -615,7 +615,7 @@ void __io_put_task(struct task_struct *task, int nr)
put_task_struct_many(task, nr);
}

-static void io_task_refs_refill(struct io_uring_task *tctx)
+void io_task_refs_refill(struct io_uring_task *tctx)
{
unsigned int refill = -tctx->cached_refs + IO_TCTX_REFS_CACHE_NR;

@@ -624,15 +624,6 @@ static void io_task_refs_refill(struct io_uring_task *tctx)
tctx->cached_refs += refill;
}

-static inline void io_get_task_refs(int nr)
-{
- struct io_uring_task *tctx = current->io_uring;
-
- tctx->cached_refs -= nr;
- if (unlikely(tctx->cached_refs < 0))
- io_task_refs_refill(tctx);
-}
-
static __cold void io_uring_drop_tctx_refs(struct task_struct *task)
{
struct io_uring_task *tctx = task->io_uring;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index b8c858727dc8..d9f2f5c71481 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -69,6 +69,7 @@ void io_wq_submit_work(struct io_wq_work *work);
void io_free_req(struct io_kiocb *req);
void io_queue_next(struct io_kiocb *req);
void __io_put_task(struct task_struct *task, int nr);
+void io_task_refs_refill(struct io_uring_task *tctx);

bool io_match_task_safe(struct io_kiocb *head, struct task_struct *task,
bool cancel_all);
@@ -265,4 +266,13 @@ static inline void io_put_task(struct task_struct *task, int nr)
__io_put_task(task, nr);
}

+static inline void io_get_task_refs(int nr)
+{
+ struct io_uring_task *tctx = current->io_uring;
+
+ tctx->cached_refs -= nr;
+ if (unlikely(tctx->cached_refs < 0))
+ io_task_refs_refill(tctx);
+}
+
#endif
diff --git a/io_uring/net.c b/io_uring/net.c
index 0259fbbad591..bf9916d5e50c 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -674,7 +674,8 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return -EINVAL;

zc->flags = READ_ONCE(sqe->ioprio);
- if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF))
+ if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST |
+ IORING_RECVSEND_FIXED_BUF | IORING_RECVSEND_NOTIF_FLUSH))
return -EINVAL;
if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
unsigned idx = READ_ONCE(sqe->buf_index);
@@ -776,6 +777,8 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
return ret == -ERESTARTSYS ? -EINTR : ret;
}

+ if (zc->flags & IORING_RECVSEND_NOTIF_FLUSH)
+ io_notif_slot_flush_submit(notif_slot, 0);
io_req_set_res(req, ret, 0);
return IOU_OK;
}
diff --git a/io_uring/notif.c b/io_uring/notif.c
index c5179e5c1cd6..a93887451bbb 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -133,7 +133,7 @@ struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
return notif;
}

-static void io_notif_slot_flush(struct io_notif_slot *slot)
+void io_notif_slot_flush(struct io_notif_slot *slot)
__must_hold(&ctx->uring_lock)
{
struct io_notif *notif = slot->notif;
diff --git a/io_uring/notif.h b/io_uring/notif.h
index 00efe164bdc4..6cd73d7b965b 100644
--- a/io_uring/notif.h
+++ b/io_uring/notif.h
@@ -54,6 +54,7 @@ int io_notif_register(struct io_ring_ctx *ctx,
int io_notif_unregister(struct io_ring_ctx *ctx);
void io_notif_cache_purge(struct io_ring_ctx *ctx);

+void io_notif_slot_flush(struct io_notif_slot *slot);
struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
struct io_notif_slot *slot);

@@ -74,3 +75,13 @@ static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
idx = array_index_nospec(idx, ctx->nr_notif_slots);
return &ctx->notif_slots[idx];
}
+
+static inline void io_notif_slot_flush_submit(struct io_notif_slot *slot,
+ unsigned int issue_flags)
+{
+ if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+ slot->notif->task = current;
+ io_get_task_refs(1);
+ }
+ io_notif_slot_flush(slot);
+}
--
2.37.0

2022-07-12 21:24:22

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v5 26/27] io_uring: enable managed frags with register buffers

io_uring's registered buffers infra has a good performant way of pinning
pages, so let's use SKBFL_MANAGED_FRAG_REFS when our requests are purely
register buffer backed.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/net.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index bf9916d5e50c..a4e863dce7ec 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -704,6 +704,60 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return 0;
}

+static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+ struct iov_iter *from, size_t length)
+{
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+ int frag = shinfo->nr_frags;
+ int ret = 0;
+ struct bvec_iter bi;
+ ssize_t copied = 0;
+ unsigned long truesize = 0;
+
+ if (!shinfo->nr_frags)
+ shinfo->flags |= SKBFL_MANAGED_FRAG_REFS;
+
+ if (!skb_zcopy_managed(skb) || !iov_iter_is_bvec(from)) {
+ skb_zcopy_downgrade_managed(skb);
+ return __zerocopy_sg_from_iter(NULL, sk, skb, from, length);
+ }
+
+ bi.bi_size = min(from->count, length);
+ bi.bi_bvec_done = from->iov_offset;
+ bi.bi_idx = 0;
+
+ while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+ struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+ copied += v.bv_len;
+ truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+ __skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+ v.bv_offset, v.bv_len);
+ bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+ }
+ if (bi.bi_size)
+ ret = -EMSGSIZE;
+
+ shinfo->nr_frags = frag;
+ from->bvec += bi.bi_idx;
+ from->nr_segs -= bi.bi_idx;
+ from->count = bi.bi_size;
+ from->iov_offset = bi.bi_bvec_done;
+
+ skb->data_len += copied;
+ skb->len += copied;
+ skb->truesize += truesize;
+
+ if (sk && sk->sk_type == SOCK_STREAM) {
+ sk_wmem_queued_add(sk, truesize);
+ if (!skb_zcopy_pure(skb))
+ sk_mem_charge(sk, truesize);
+ } else {
+ refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+ }
+ return ret;
+}
+
int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
{
struct sockaddr_storage address;
@@ -768,7 +822,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)

msg.msg_flags = msg_flags;
msg.msg_ubuf = &notif->uarg;
- msg.sg_from_iter = NULL;
+ msg.sg_from_iter = io_sg_from_iter;
ret = sock_sendmsg(sock, &msg);

if (unlikely(ret < min_ret)) {
--
2.37.0

2022-07-20 12:53:33

by Jens Axboe

[permalink] [raw]
Subject: Re: (subset) [PATCH net-next v5 00/27] io_uring zerocopy send

On Tue, 12 Jul 2022 21:52:24 +0100, Pavel Begunkov wrote:
> NOTE: Not to be picked directly. After getting necessary acks, I'll be
> working out merging with Jakub and Jens.
>
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
>
> [...]

Applied, thanks!

[12/27] io_uring: initialise msghdr::msg_ubuf
commit: 06f241e2bf4ba2a3e77269be25d21c0196a57a4f
[13/27] io_uring: export io_put_task()
commit: ba64c07a6ef9a05ca9eb09e13b70df7500e78cf8
[14/27] io_uring: add zc notification infrastructure
commit: 6f322c753daee4b9d4ad494d4e8b05da610d804c
[15/27] io_uring: cache struct io_notif
commit: cf49e2d47c49e547d4bc370efe73785fc82354e5
[16/27] io_uring: complete notifiers in tw
commit: 9cc16ae447db07d210175d2ad2419784dd20f784
[17/27] io_uring: add rsrc referencing for notifiers
commit: e133e289093ea35c1f7f940fe4c0ceb62037dc59
[18/27] io_uring: add notification slot registration
commit: f20b817fd29b64ef6de24b83ef23e1f3fb273967
[19/27] io_uring: wire send zc request type
commit: 480ec5ff9a5a75d68423c0bd02e57a9ee6325320
[20/27] io_uring: account locked pages for non-fixed zc
commit: fcb98e61d0232cff7dd14ae85ad1c88d68f98273
[21/27] io_uring: allow to pass addr into sendzc
commit: 7ab12997edc9aa3e2be4169f929c50a1fcd41004
[22/27] io_uring: sendzc with fixed buffers
commit: bb4019de9ea11d21137b4a8ff01d9e338071d633
[23/27] io_uring: flush notifiers after sendzc
commit: 95a70c191696da64a6ae235d52132a5c17866dae
[24/27] io_uring: rename IORING_OP_FILES_UPDATE
commit: d488e605a45192f9f60c7624d46ba0b8c4d93aab
[25/27] io_uring: add zc notification flush requests
commit: cb155defb9bf20a647c8825a085695f3f94fdb60
[26/27] io_uring: enable managed frags with register buffers
commit: 04ae3dbe8a027cf10ab759456ffc4fb119486f74
[27/27] selftests/io_uring: test zerocopy send
commit: 0c450de20ce7d6bc8a2f97c98387baf910454477

Best regards,
--
Jens Axboe


2022-07-27 08:09:23

by Dust Li

[permalink] [raw]
Subject: Re: [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send

On Tue, Jul 12, 2022 at 09:52:51PM +0100, Pavel Begunkov wrote:
>Add selftests for io_uring zerocopy sends and io_uring's notification
>infrastructure. It's largely influenced by msg_zerocopy and uses it on
>the receive side.
>
>Signed-off-by: Pavel Begunkov <[email protected]>
>---
> tools/testing/selftests/net/Makefile | 1 +
> .../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
> .../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
> 3 files changed, 737 insertions(+)
> create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
> create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>
>diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
>index 7ea54af55490..51261483744e 100644
>--- a/tools/testing/selftests/net/Makefile
>+++ b/tools/testing/selftests/net/Makefile
>@@ -59,6 +59,7 @@ TEST_GEN_FILES += toeplitz
> TEST_GEN_FILES += cmsg_sender
> TEST_GEN_FILES += stress_reuseport_listen
> TEST_PROGS += test_vxlan_vnifiltering.sh
>+TEST_GEN_FILES += io_uring_zerocopy_tx
>
> TEST_FILES := settings
>
>diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
>new file mode 100644
>index 000000000000..9d64c560a2d6
>--- /dev/null
>+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
>@@ -0,0 +1,605 @@
>+/* SPDX-License-Identifier: MIT */
>+/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
>+#include <assert.h>
>+#include <errno.h>
>+#include <error.h>
>+#include <fcntl.h>
>+#include <limits.h>
>+#include <stdbool.h>
>+#include <stdint.h>
>+#include <stdio.h>
>+#include <stdlib.h>
>+#include <string.h>
>+#include <unistd.h>
>+
>+#include <arpa/inet.h>
>+#include <linux/errqueue.h>
>+#include <linux/if_packet.h>
>+#include <linux/io_uring.h>
>+#include <linux/ipv6.h>
>+#include <linux/socket.h>
>+#include <linux/sockios.h>
>+#include <net/ethernet.h>
>+#include <net/if.h>
>+#include <netinet/in.h>
>+#include <netinet/ip.h>
>+#include <netinet/ip6.h>
>+#include <netinet/tcp.h>
>+#include <netinet/udp.h>
>+#include <sys/ioctl.h>
>+#include <sys/mman.h>
>+#include <sys/resource.h>
>+#include <sys/socket.h>
>+#include <sys/stat.h>
>+#include <sys/time.h>
>+#include <sys/types.h>
>+#include <sys/un.h>
>+#include <sys/wait.h>
>+
>+#define NOTIF_TAG 0xfffffffULL
>+#define NONZC_TAG 0
>+#define ZC_TAG 1
>+

<...>

>+static void do_test(int domain, int type, int protocol)
>+{
>+ int i;
>+
>+ for (i = 0; i < IP_MAXPACKET; i++)
>+ payload[i] = 'a' + (i % 26);
>+ do_tx(domain, type, protocol);
>+}
>+
>+static void usage(const char *filepath)
>+{
>+ error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
>+ "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);

A small flaw, the usage here doesn't match the real options in parse_opts().

Thanks

>+}
>+
>+static void parse_opts(int argc, char **argv)
>+{
>+ const int max_payload_len = sizeof(payload) -
>+ sizeof(struct ipv6hdr) -
>+ sizeof(struct tcphdr) -
>+ 40 /* max tcp options */;
>+ struct sockaddr_in6 *addr6 = (void *) &cfg_dst_addr;
>+ struct sockaddr_in *addr4 = (void *) &cfg_dst_addr;
>+ char *daddr = NULL;
>+ int c;
>+
>+ if (argc <= 1)
>+ usage(argv[0]);
>+ cfg_payload_len = max_payload_len;
>+
>+ while ((c = getopt(argc, argv, "46D:p:s:t:n:fc:m:")) != -1) {
>+ switch (c) {
>+ case '4':
>+ if (cfg_family != PF_UNSPEC)
>+ error(1, 0, "Pass one of -4 or -6");
>+ cfg_family = PF_INET;
>+ cfg_alen = sizeof(struct sockaddr_in);
>+ break;
>+ case '6':
>+ if (cfg_family != PF_UNSPEC)
>+ error(1, 0, "Pass one of -4 or -6");
>+ cfg_family = PF_INET6;
>+ cfg_alen = sizeof(struct sockaddr_in6);
>+ break;
>+ case 'D':
>+ daddr = optarg;
>+ break;
>+ case 'p':
>+ cfg_port = strtoul(optarg, NULL, 0);
>+ break;
>+ case 's':
>+ cfg_payload_len = strtoul(optarg, NULL, 0);
>+ break;
>+ case 't':
>+ cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
>+ break;
>+ case 'n':
>+ cfg_nr_reqs = strtoul(optarg, NULL, 0);
>+ break;
>+ case 'f':
>+ cfg_flush = 1;
>+ break;
>+ case 'c':
>+ cfg_cork = strtol(optarg, NULL, 0);
>+ break;
>+ case 'm':
>+ cfg_mode = strtol(optarg, NULL, 0);
>+ break;
>+ }
>+ }
>+
>+ switch (cfg_family) {
>+ case PF_INET:
>+ memset(addr4, 0, sizeof(*addr4));
>+ addr4->sin_family = AF_INET;
>+ addr4->sin_port = htons(cfg_port);
>+ if (daddr &&
>+ inet_pton(AF_INET, daddr, &(addr4->sin_addr)) != 1)
>+ error(1, 0, "ipv4 parse error: %s", daddr);
>+ break;
>+ case PF_INET6:
>+ memset(addr6, 0, sizeof(*addr6));
>+ addr6->sin6_family = AF_INET6;
>+ addr6->sin6_port = htons(cfg_port);
>+ if (daddr &&
>+ inet_pton(AF_INET6, daddr, &(addr6->sin6_addr)) != 1)
>+ error(1, 0, "ipv6 parse error: %s", daddr);
>+ break;
>+ default:
>+ error(1, 0, "illegal domain");
>+ }
>+
>+ if (cfg_payload_len > max_payload_len)
>+ error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
>+ if (cfg_mode == MODE_NONZC && cfg_flush)
>+ error(1, 0, "-f: only zerocopy modes support notifications");
>+ if (optind != argc - 1)
>+ usage(argv[0]);
>+}
>+
>+int main(int argc, char **argv)
>+{
>+ const char *cfg_test = argv[argc - 1];
>+
>+ parse_opts(argc, argv);
>+
>+ if (!strcmp(cfg_test, "tcp"))
>+ do_test(cfg_family, SOCK_STREAM, 0);
>+ else if (!strcmp(cfg_test, "udp"))
>+ do_test(cfg_family, SOCK_DGRAM, 0);
>+ else
>+ error(1, 0, "unknown cfg_test %s", cfg_test);
>+ return 0;
>+}
>diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>new file mode 100755
>index 000000000000..6a65e4437640
>--- /dev/null
>+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
>@@ -0,0 +1,131 @@
>+#!/bin/bash
>+#
>+# Send data between two processes across namespaces
>+# Run twice: once without and once with zerocopy
>+
>+set -e
>+
>+readonly DEV="veth0"
>+readonly DEV_MTU=65535
>+readonly BIN_TX="./io_uring_zerocopy_tx"
>+readonly BIN_RX="./msg_zerocopy"
>+
>+readonly RAND="$(mktemp -u XXXXXX)"
>+readonly NSPREFIX="ns-${RAND}"
>+readonly NS1="${NSPREFIX}1"
>+readonly NS2="${NSPREFIX}2"
>+
>+readonly SADDR4='192.168.1.1'
>+readonly DADDR4='192.168.1.2'
>+readonly SADDR6='fd::1'
>+readonly DADDR6='fd::2'
>+
>+readonly path_sysctl_mem="net.core.optmem_max"
>+
>+# No arguments: automated test
>+if [[ "$#" -eq "0" ]]; then
>+ IPs=( "4" "6" )
>+ protocols=( "tcp" "udp" )
>+
>+ for IP in "${IPs[@]}"; do
>+ for proto in "${protocols[@]}"; do
>+ for mode in $(seq 1 3); do
>+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32
>+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -f
>+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -c -f
>+ done
>+ done
>+ done
>+
>+ echo "OK. All tests passed"
>+ exit 0
>+fi
>+
>+# Argument parsing
>+if [[ "$#" -lt "2" ]]; then
>+ echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
>+ exit 1
>+fi
>+
>+readonly IP="$1"
>+shift
>+readonly TXMODE="$1"
>+shift
>+readonly EXTRA_ARGS="$@"
>+
>+# Argument parsing: configure addresses
>+if [[ "${IP}" == "4" ]]; then
>+ readonly SADDR="${SADDR4}"
>+ readonly DADDR="${DADDR4}"
>+elif [[ "${IP}" == "6" ]]; then
>+ readonly SADDR="${SADDR6}"
>+ readonly DADDR="${DADDR6}"
>+else
>+ echo "Invalid IP version ${IP}"
>+ exit 1
>+fi
>+
>+# Argument parsing: select receive mode
>+#
>+# This differs from send mode for
>+# - packet: use raw recv, because packet receives skb clones
>+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
>+case "${TXMODE}" in
>+'packet' | 'packet_dgram' | 'raw_hdrincl')
>+ RXMODE='raw'
>+ ;;
>+*)
>+ RXMODE="${TXMODE}"
>+ ;;
>+esac
>+
>+# Start of state changes: install cleanup handler
>+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
>+
>+cleanup() {
>+ ip netns del "${NS2}"
>+ ip netns del "${NS1}"
>+ sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
>+}
>+
>+trap cleanup EXIT
>+
>+# Configure system settings
>+sysctl -w -q "${path_sysctl_mem}=1000000"
>+
>+# Create virtual ethernet pair between network namespaces
>+ip netns add "${NS1}"
>+ip netns add "${NS2}"
>+
>+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
>+ peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
>+
>+# Bring the devices up
>+ip -netns "${NS1}" link set "${DEV}" up
>+ip -netns "${NS2}" link set "${DEV}" up
>+
>+# Set fixed MAC addresses on the devices
>+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
>+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
>+
>+# Add fixed IP addresses to the devices
>+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
>+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
>+ip -netns "${NS1}" addr add fd::1/64 dev "${DEV}" nodad
>+ip -netns "${NS2}" addr add fd::2/64 dev "${DEV}" nodad
>+
>+# Optionally disable sg or csum offload to test edge cases
>+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
>+
>+do_test() {
>+ local readonly ARGS="$1"
>+
>+ echo "ipv${IP} ${TXMODE} ${ARGS}"
>+ ip netns exec "${NS2}" "${BIN_RX}" "-${IP}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" -r "${RXMODE}" &
>+ sleep 0.2
>+ ip netns exec "${NS1}" "${BIN_TX}" "-${IP}" -t 1 -D "${DADDR}" ${ARGS} "${TXMODE}"
>+ wait
>+}
>+
>+do_test "${EXTRA_ARGS}"
>+echo ok
>--
>2.37.0

2022-07-27 09:38:56

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v5 27/27] selftests/io_uring: test zerocopy send

On 7/27/22 09:01, dust.li wrote:

>> +static void do_test(int domain, int type, int protocol)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < IP_MAXPACKET; i++)
>> + payload[i] = 'a' + (i % 26);
>> + do_tx(domain, type, protocol);
>> +}
>> +
>> +static void usage(const char *filepath)
>> +{
>> + error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
>> + "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
>
> A small flaw, the usage here doesn't match the real options in parse_opts().

Indeed. I'll adjust it, thanks!

--
Pavel Begunkov