2022-07-07 12:28:23

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 00/27] io_uring zerocopy send

NOTE: Not be picked directly. After getting necessary acks, I'll be working
out merging with Jakub and Jens.

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking with an optimised version of the selftest (see [1]), which sends
a bunch of requests, waits for completions and repeats. "+ flush" column posts
one additional "buffer-free" notification per request, and just "zc" doesn't
post buffer notifications at all.

NIC (requests / second):
IO size | non-zc | zc | zc + flush
4000 | 495134 | 606420 (+22%) | 558971 (+12%)
1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc | zc | zc + flush
8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting.

There is an additional bunch of refcounting optimisations that was omitted from
the series for simplicity and as they don't change the picture drastically,
they will be sent as follow up, as well as flushing optimisations closing the
performance gap b/w two last columns.

Note: the series is based on net-next + for-5.20/io_uring, but as vanilla
net-next fails for me the repo (see [2]) is on top of for-5.20/io_uring.

Links:

liburing (benchmark + tests):
[1] https://github.com/isilence/liburing/tree/zc_v4

kernel repo:
[2] https://github.com/isilence/linux/tree/zc_v4

RFC v1:
[3] https://lore.kernel.org/io-uring/[email protected]/

RFC v2:
https://lore.kernel.org/io-uring/[email protected]/

Net patches based
[email protected]:isilence/linux.git zc_v4-net-base

API design overview:

The series introduces an io_uring concept of notifactors. From the userspace
perspective it's an entity to which it can bind one or more requests and then
requesting to flush it. Flushing a notifier makes it impossible to attach new
requests to it, and instructs the notifier to post a completion once all
requests attached to it are completed and the kernel doesn't need the buffers
anymore.

Notifications are stored in notification slots, which should be registered as
an array in io_uring. Each slot stores only one notifier at any particular
moment. Flushing removes it from the slot and the slot automatically replaces
it with a new notifier. All operations with notifiers are done by specifying
an index of a slot it's currently in.

When registering a notification the userspace specifies a u64 tag for each
slot, which will be copied in notification completion entries as
cqe::user_data. cqe::res is 0 and cqe::flags is equal to wrap around u32
sequence number counting notifiers of a slot.

Changelog:

v3 -> v4
custom iov_iter handling

RFC v2 -> v3:
mem accounting for non-registered buffers
allow mixing registered and normal requests per notifier
notification flushing via IORING_OP_RSRC_UPDATE
TCP support
fix buffer indexing
fix io-wq ->uring_lock locking
fix bugs when mixing with MSG_ZEROCOPY
fix managed refs bugs in skbuff.c

RFC -> RFC v2:
remove additional overhead for non-zc from skb_release_data()
avoid msg propagation, hide extra bits of non-zc overhead
task_work based "buffer free" notifications
improve io_uring's notification refcounting
added 5/19, (no pfmemalloc tracking)
added 8/19 and 9/19 preventing small copies with zc
misc small changes

David Ahern (1):
net: Allow custom iter handler in msghdr

Pavel Begunkov (26):
ipv4: avoid partial copy for zc
ipv6: avoid partial copy for zc
skbuff: don't mix ubuf_info from different sources
skbuff: add SKBFL_DONT_ORPHAN flag
skbuff: carry external ubuf_info in msghdr
net: introduce managed frags infrastructure
net: introduce __skb_fill_page_desc_noacc
ipv4/udp: support externally provided ubufs
ipv6/udp: support externally provided ubufs
tcp: support externally provided ubufs
io_uring: initialise msghdr::msg_ubuf
io_uring: export io_put_task()
io_uring: add zc notification infrastructure
io_uring: cache struct io_notif
io_uring: complete notifiers in tw
io_uring: add rsrc referencing for notifiers
io_uring: add notification slot registration
io_uring: wire send zc request type
io_uring: account locked pages for non-fixed zc
io_uring: allow to pass addr into sendzc
io_uring: sendzc with fixed buffers
io_uring: flush notifiers after sendzc
io_uring: rename IORING_OP_FILES_UPDATE
io_uring: add zc notification flush requests
io_uring: enable managed frags with register buffers
selftests/io_uring: test zerocopy send

include/linux/io_uring_types.h | 37 ++
include/linux/skbuff.h | 66 +-
include/linux/socket.h | 5 +
include/uapi/linux/io_uring.h | 45 +-
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 42 +-
io_uring/io_uring.h | 22 +
io_uring/net.c | 187 ++++++
io_uring/net.h | 4 +
io_uring/notif.c | 215 +++++++
io_uring/notif.h | 87 +++
io_uring/opdef.c | 24 +-
io_uring/rsrc.c | 55 +-
io_uring/rsrc.h | 16 +-
io_uring/tctx.h | 26 -
net/compat.c | 1 +
net/core/datagram.c | 14 +-
net/core/skbuff.c | 37 +-
net/ipv4/ip_output.c | 50 +-
net/ipv4/tcp.c | 32 +-
net/ipv6/ip6_output.c | 49 +-
net/socket.c | 3 +
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
25 files changed, 1628 insertions(+), 128 deletions(-)
create mode 100644 io_uring/notif.c
create mode 100644 io_uring/notif.h
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

--
2.36.1


2022-07-07 12:28:35

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 02/27] ipv6: avoid partial copy for zc

Even when zerocopy transmission is requested and possible,
__ip_append_data() will still copy a small chunk of data just because it
allocated some extra linear space (e.g. 128 bytes). It wastes CPU cycles
on copy and iter manipulations and also misalignes potentially aligned
data. Avoid such coies. And as a bonus we can allocate smaller skb.

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv6/ip6_output.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 77e3f5970ce4..fc74ce3ed8cc 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1464,6 +1464,7 @@ static int __ip6_append_data(struct sock *sk,
int copy;
int err;
int offset = 0;
+ bool zc = false;
u32 tskey = 0;
struct rt6_info *rt = (struct rt6_info *)cork->dst;
struct ipv6_txoptions *opt = v6_cork->opt;
@@ -1549,6 +1550,7 @@ static int __ip6_append_data(struct sock *sk,
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
+ zc = true;
} else {
uarg->zerocopy = 0;
skb_zcopy_set(skb, uarg, &extra_uref);
@@ -1630,9 +1632,12 @@ static int __ip6_append_data(struct sock *sk,
(fraglen + alloc_extra < SKB_MAX_ALLOC ||
!(rt->dst.dev->features & NETIF_F_SG)))
alloclen = fraglen;
- else {
+ else if (!zc) {
alloclen = min_t(int, fraglen, MAX_HEADER);
pagedlen = fraglen - alloclen;
+ } else {
+ alloclen = fragheaderlen + transhdrlen;
+ pagedlen = datalen - transhdrlen;
}
alloclen += alloc_extra;

--
2.36.1

2022-07-07 12:28:59

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 08/27] net: introduce __skb_fill_page_desc_noacc

Managed pages contain pinned userspace pages and controlled by upper
layers, there is no need in tracking skb->pfmemalloc for them. Introduce
a helper for filling frags but ignoring page tracking, it'll be needed
later.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 07004593d7ca..1111adefd906 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2550,6 +2550,22 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb);
}

+static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
+ int i, struct page *page,
+ int off, int size)
+{
+ skb_frag_t *frag = &shinfo->frags[i];
+
+ /*
+ * Propagate page pfmemalloc to the skb if we can. The problem is
+ * that not all callers have unique ownership of the page but rely
+ * on page_is_pfmemalloc doing the right thing(tm).
+ */
+ frag->bv_page = page;
+ frag->bv_offset = off;
+ skb_frag_size_set(frag, size);
+}
+
/**
* __skb_fill_page_desc - initialise a paged fragment in an skb
* @skb: buffer containing fragment to be initialised
@@ -2566,17 +2582,7 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
struct page *page, int off, int size)
{
- skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
- /*
- * Propagate page pfmemalloc to the skb if we can. The problem is
- * that not all callers have unique ownership of the page but rely
- * on page_is_pfmemalloc doing the right thing(tm).
- */
- frag->bv_page = page;
- frag->bv_offset = off;
- skb_frag_size_set(frag, size);
-
+ __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
page = compound_head(page);
if (page_is_pfmemalloc(page))
skb->pfmemalloc = true;
--
2.36.1

2022-07-07 12:29:35

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 20/27] io_uring: account locked pages for non-fixed zc

Fixed buffers are RLIMIT_MEMLOCK accounted, however it doesn't cover iovec
based zerocopy sends. Do the accounting on the io_uring side.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/net.c | 1 +
io_uring/notif.c | 6 ++++++
2 files changed, 7 insertions(+)

diff --git a/io_uring/net.c b/io_uring/net.c
index 399267e8f1ef..69273d4f4ef0 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -724,6 +724,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
if (unlikely(ret))
return ret;
+ mm_account_pinned_pages(&notif->uarg.mmp, zc->len);

msg_flags = zc->msg_flags | MSG_ZEROCOPY;
if (issue_flags & IO_URING_F_NONBLOCK)
diff --git a/io_uring/notif.c b/io_uring/notif.c
index e6d98dc208c7..c5179e5c1cd6 100644
--- a/io_uring/notif.c
+++ b/io_uring/notif.c
@@ -14,7 +14,13 @@ static void __io_notif_complete_tw(struct callback_head *cb)
struct io_notif *notif = container_of(cb, struct io_notif, task_work);
struct io_rsrc_node *rsrc_node = notif->rsrc_node;
struct io_ring_ctx *ctx = notif->ctx;
+ struct mmpin *mmp = &notif->uarg.mmp;

+ if (mmp->user) {
+ atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+ free_uid(mmp->user);
+ mmp->user = NULL;
+ }
if (likely(notif->task)) {
io_put_task(notif->task, 1);
notif->task = NULL;
--
2.36.1

2022-07-07 12:30:24

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 14/27] io_uring: add zc notification infrastructure

Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.

The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.

When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 5 ++
io_uring/Makefile | 2 +-
io_uring/io_uring.c | 8 ++-
io_uring/io_uring.h | 2 +
io_uring/notif.c | 102 +++++++++++++++++++++++++++++++++
io_uring/notif.h | 64 +++++++++++++++++++++
6 files changed, 179 insertions(+), 4 deletions(-)
create mode 100644 io_uring/notif.c
create mode 100644 io_uring/notif.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index d876a0367081..95334e678586 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -34,6 +34,9 @@ struct io_file_table {
unsigned int alloc_hint;
};

+struct io_notif;
+struct io_notif_slot;
+
struct io_hash_bucket {
spinlock_t lock;
struct hlist_head list;
@@ -232,6 +235,8 @@ struct io_ring_ctx {
unsigned nr_user_files;
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;
+ struct io_notif_slot *notif_slots;
+ unsigned nr_notif_slots;

struct io_submit_state submit_state;

diff --git a/io_uring/Makefile b/io_uring/Makefile
index 466639c289be..8cc8e5387a75 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -7,5 +7,5 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \
openclose.o uring_cmd.o epoll.o \
statx.o net.o msg_ring.o timeout.o \
sqpoll.o fdinfo.o tctx.o poll.o \
- cancel.o kbuf.o rsrc.o rw.o opdef.o
+ cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o
obj-$(CONFIG_IO_WQ) += io-wq.o
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index bb644b1b575a..ad816afe2345 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -89,6 +89,7 @@
#include "kbuf.h"
#include "rsrc.h"
#include "cancel.h"
+#include "notif.h"

#include "timeout.h"
#include "poll.h"
@@ -726,9 +727,8 @@ struct io_uring_cqe *__io_get_cqe(struct io_ring_ctx *ctx)
return &rings->cqes[off];
}

-static bool io_fill_cqe_aux(struct io_ring_ctx *ctx,
- u64 user_data, s32 res, u32 cflags,
- bool allow_overflow)
+bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
+ bool allow_overflow)
{
struct io_uring_cqe *cqe;

@@ -2496,6 +2496,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
}
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
+ WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);

io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
@@ -2672,6 +2673,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
io_unregister_personality(ctx, index);
if (ctx->rings)
io_poll_remove_all(ctx, NULL, true);
+ io_notif_unregister(ctx);
mutex_unlock(&ctx->uring_lock);

/* failed during ring init, it couldn't have issued any requests */
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 2379d9e70c10..b8c858727dc8 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -33,6 +33,8 @@ void io_req_complete_post(struct io_kiocb *req);
void __io_req_complete_post(struct io_kiocb *req);
bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
bool allow_overflow);
+bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags,
+ bool allow_overflow);
void __io_commit_cqring_flush(struct io_ring_ctx *ctx);

struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages);
diff --git a/io_uring/notif.c b/io_uring/notif.c
new file mode 100644
index 000000000000..6ee948af6a49
--- /dev/null
+++ b/io_uring/notif.c
@@ -0,0 +1,102 @@
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/file.h>
+#include <linux/slab.h>
+#include <linux/net.h>
+#include <linux/io_uring.h>
+
+#include "io_uring.h"
+#include "notif.h"
+
+static void __io_notif_complete_tw(struct callback_head *cb)
+{
+ struct io_notif *notif = container_of(cb, struct io_notif, task_work);
+ struct io_ring_ctx *ctx = notif->ctx;
+
+ io_cq_lock(ctx);
+ io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq, true);
+ io_cq_unlock_post(ctx);
+
+ percpu_ref_put(&ctx->refs);
+ kfree(notif);
+}
+
+static inline void io_notif_complete(struct io_notif *notif)
+{
+ __io_notif_complete_tw(&notif->task_work);
+}
+
+static void io_notif_complete_wq(struct work_struct *work)
+{
+ struct io_notif *notif = container_of(work, struct io_notif, commit_work);
+
+ io_notif_complete(notif);
+}
+
+static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+ struct ubuf_info *uarg,
+ bool success)
+{
+ struct io_notif *notif = container_of(uarg, struct io_notif, uarg);
+
+ if (!refcount_dec_and_test(&uarg->refcnt))
+ return;
+ INIT_WORK(&notif->commit_work, io_notif_complete_wq);
+ queue_work(system_unbound_wq, &notif->commit_work);
+}
+
+struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
+ struct io_notif_slot *slot)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_notif *notif;
+
+ notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
+ if (!notif)
+ return NULL;
+
+ notif->seq = slot->seq++;
+ notif->tag = slot->tag;
+ notif->ctx = ctx;
+ notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+ notif->uarg.callback = io_uring_tx_zerocopy_callback;
+ /* master ref owned by io_notif_slot, will be dropped on flush */
+ refcount_set(&notif->uarg.refcnt, 1);
+ percpu_ref_get(&ctx->refs);
+ return notif;
+}
+
+static void io_notif_slot_flush(struct io_notif_slot *slot)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_notif *notif = slot->notif;
+
+ slot->notif = NULL;
+
+ if (WARN_ON_ONCE(in_interrupt()))
+ return;
+ /* drop slot's master ref */
+ if (refcount_dec_and_test(&notif->uarg.refcnt))
+ io_notif_complete(notif);
+}
+
+__cold int io_notif_unregister(struct io_ring_ctx *ctx)
+ __must_hold(&ctx->uring_lock)
+{
+ int i;
+
+ if (!ctx->notif_slots)
+ return -ENXIO;
+
+ for (i = 0; i < ctx->nr_notif_slots; i++) {
+ struct io_notif_slot *slot = &ctx->notif_slots[i];
+
+ if (slot->notif)
+ io_notif_slot_flush(slot);
+ }
+
+ kvfree(ctx->notif_slots);
+ ctx->notif_slots = NULL;
+ ctx->nr_notif_slots = 0;
+ return 0;
+}
\ No newline at end of file
diff --git a/io_uring/notif.h b/io_uring/notif.h
new file mode 100644
index 000000000000..3d7a1d242e17
--- /dev/null
+++ b/io_uring/notif.h
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/net.h>
+#include <linux/uio.h>
+#include <net/sock.h>
+#include <linux/nospec.h>
+
+struct io_notif {
+ struct ubuf_info uarg;
+ struct io_ring_ctx *ctx;
+
+ /* cqe->user_data, io_notif_slot::tag if not overridden */
+ u64 tag;
+ /* see struct io_notif_slot::seq */
+ u32 seq;
+
+ union {
+ struct callback_head task_work;
+ struct work_struct commit_work;
+ };
+};
+
+struct io_notif_slot {
+ /*
+ * Current/active notifier. A slot holds only one active notifier at a
+ * time and keeps one reference to it. Flush releases the reference and
+ * lazily replaces it with a new notifier.
+ */
+ struct io_notif *notif;
+
+ /*
+ * Default ->user_data for this slot notifiers CQEs
+ */
+ u64 tag;
+ /*
+ * Notifiers of a slot live in generations, we create a new notifier
+ * only after flushing the previous one. Track the sequential number
+ * for all notifiers and copy it into notifiers's cqe->cflags
+ */
+ u32 seq;
+};
+
+int io_notif_unregister(struct io_ring_ctx *ctx);
+
+struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
+ struct io_notif_slot *slot);
+
+static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
+ struct io_notif_slot *slot)
+{
+ if (!slot->notif)
+ slot->notif = io_alloc_notif(ctx, slot);
+ return slot->notif;
+}
+
+static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
+ int idx)
+ __must_hold(&ctx->uring_lock)
+{
+ if (idx >= ctx->nr_notif_slots)
+ return NULL;
+ idx = array_index_nospec(idx, ctx->nr_notif_slots);
+ return &ctx->notif_slots[idx];
+}
--
2.36.1

2022-07-07 12:30:48

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH net-next v4 24/27] io_uring: rename IORING_OP_FILES_UPDATE

IORING_OP_FILES_UPDATE will be a more generic opcode serving different
resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype
handling.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/uapi/linux/io_uring.h | 12 +++++++++++-
io_uring/opdef.c | 9 +++++----
io_uring/rsrc.c | 17 +++++++++++++++--
io_uring/rsrc.h | 4 ++--
4 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 37e0730733f9..9e325179a4f8 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -171,7 +171,8 @@ enum io_uring_op {
IORING_OP_FALLOCATE,
IORING_OP_OPENAT,
IORING_OP_CLOSE,
- IORING_OP_FILES_UPDATE,
+ IORING_OP_RSRC_UPDATE,
+ IORING_OP_FILES_UPDATE = IORING_OP_RSRC_UPDATE,
IORING_OP_STATX,
IORING_OP_READ,
IORING_OP_WRITE,
@@ -220,6 +221,7 @@ enum io_uring_op {
#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5)
#define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
#define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
+
/*
* sqe->splice_flags
* extends splice(2) flags
@@ -286,6 +288,14 @@ enum io_uring_op {
*/
#define IORING_ACCEPT_MULTISHOT (1U << 0)

+
+/*
+ * IORING_OP_RSRC_UPDATE flags
+ */
+enum {
+ IORING_RSRC_UPDATE_FILES,
+};
+
/*
* IORING_OP_MSG_RING command types, stored in sqe->addr
*/
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 8419b50c1d3b..0fb347d1ec16 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -246,12 +246,13 @@ const struct io_op_def io_op_defs[] = {
.prep = io_close_prep,
.issue = io_close,
},
- [IORING_OP_FILES_UPDATE] = {
+ [IORING_OP_RSRC_UPDATE] = {
.audit_skip = 1,
.iopoll = 1,
- .name = "FILES_UPDATE",
- .prep = io_files_update_prep,
- .issue = io_files_update,
+ .name = "RSRC_UPDATE",
+ .prep = io_rsrc_update_prep,
+ .issue = io_rsrc_update,
+ .ioprio = 1,
},
[IORING_OP_STATX] = {
.audit_skip = 1,
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1182cf0ea1fc..98ce8a93a816 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -21,6 +21,7 @@ struct io_rsrc_update {
u64 arg;
u32 nr_args;
u32 offset;
+ int type;
};

static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
@@ -658,7 +659,7 @@ __cold int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
return -EINVAL;
}

-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_rsrc_update *up = io_kiocb_to_cmd(req);

@@ -672,6 +673,7 @@ int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!up->nr_args)
return -EINVAL;
up->arg = READ_ONCE(sqe->addr);
+ up->type = READ_ONCE(sqe->ioprio);
return 0;
}

@@ -711,7 +713,7 @@ static int io_files_update_with_index_alloc(struct io_kiocb *req,
return ret;
}

-int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
+static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_rsrc_update *up = io_kiocb_to_cmd(req);
struct io_ring_ctx *ctx = req->ctx;
@@ -740,6 +742,17 @@ int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
return IOU_OK;
}

+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_rsrc_update *up = io_kiocb_to_cmd(req);
+
+ switch (up->type) {
+ case IORING_RSRC_UPDATE_FILES:
+ return io_files_update(req, issue_flags);
+ }
+ return -EINVAL;
+}
+
int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
struct io_rsrc_node *node, void *rsrc)
{
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index af342fd239d0..21813a23215f 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -167,6 +167,6 @@ static inline u64 *io_get_tag_slot(struct io_rsrc_data *data, unsigned int idx)
return &data->tags[table_idx][off];
}

-int io_files_update(struct io_kiocb *req, unsigned int issue_flags);
-int io_files_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags);
+int io_rsrc_update_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
#endif
--
2.36.1

2022-07-08 04:27:25

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/7/22 5:49 AM, Pavel Begunkov wrote:
> NOTE: Not be picked directly. After getting necessary acks, I'll be working
> out merging with Jakub and Jens.
>
> The patchset implements io_uring zerocopy send. It works with both registered
> and normal buffers, mixing is allowed but not recommended. Apart from usual
> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
> the userspace when buffers are freed and can be reused (see API design below),
> which is delivered into io_uring's Completion Queue. Those "buffer-free"
> notifications are not necessarily per request, but the userspace has control
> over it and should explicitly attaching a number of requests to a single
> notification. The series also adds some internal optimisations when used with
> registered buffers like removing page referencing.
>
> From the kernel networking perspective there are two main changes. The first
> one is passing ubuf_info into the network layer from io_uring (inside of an
> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
> caching on the io_uring side, but also helps to avoid cross-referencing
> and synchronisation problems. The second part is an optional optimisation
> removing page referencing for requests with registered buffers.
>
> Benchmarking with an optimised version of the selftest (see [1]), which sends
> a bunch of requests, waits for completions and repeats. "+ flush" column posts
> one additional "buffer-free" notification per request, and just "zc" doesn't
> post buffer notifications at all.
>
> NIC (requests / second):
> IO size | non-zc | zc | zc + flush
> 4000 | 495134 | 606420 (+22%) | 558971 (+12%)
> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)
>
> dummy (requests / second):
> IO size | non-zc | zc | zc + flush
> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)
>
> Previously it also brought a massive performance speedup compared to the
> msg_zerocopy tool (see [3]), which is probably not super interesting.
>

can you add a comment that the above results are for UDP.

You dropped comments about TCP testing; any progress there? If not, can
you relay any issues you are hitting?

2022-07-08 14:39:40

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/8/22 05:10, David Ahern wrote:
> On 7/7/22 5:49 AM, Pavel Begunkov wrote:
>> NOTE: Not be picked directly. After getting necessary acks, I'll be working
>> out merging with Jakub and Jens.
>>
>> The patchset implements io_uring zerocopy send. It works with both registered
>> and normal buffers, mixing is allowed but not recommended. Apart from usual
>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
>> the userspace when buffers are freed and can be reused (see API design below),
>> which is delivered into io_uring's Completion Queue. Those "buffer-free"
>> notifications are not necessarily per request, but the userspace has control
>> over it and should explicitly attaching a number of requests to a single
>> notification. The series also adds some internal optimisations when used with
>> registered buffers like removing page referencing.
>>
>> From the kernel networking perspective there are two main changes. The first
>> one is passing ubuf_info into the network layer from io_uring (inside of an
>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
>> caching on the io_uring side, but also helps to avoid cross-referencing
>> and synchronisation problems. The second part is an optional optimisation
>> removing page referencing for requests with registered buffers.
>>
>> Benchmarking with an optimised version of the selftest (see [1]), which sends
>> a bunch of requests, waits for completions and repeats. "+ flush" column posts
>> one additional "buffer-free" notification per request, and just "zc" doesn't
>> post buffer notifications at all.
>>
>> NIC (requests / second):
>> IO size | non-zc | zc | zc + flush
>> 4000 | 495134 | 606420 (+22%) | 558971 (+12%)
>> 1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
>> 1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
>> 600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)
>>
>> dummy (requests / second):
>> IO size | non-zc | zc | zc + flush
>> 8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
>> 4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
>> 1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
>> 600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)
>>
>> Previously it also brought a massive performance speedup compared to the
>> msg_zerocopy tool (see [3]), which is probably not super interesting.
>>
>
> can you add a comment that the above results are for UDP.

Oh, right, forgot to add it


> You dropped comments about TCP testing; any progress there? If not, can
> you relay any issues you are hitting?

Not really a problem, but for me it's bottle necked at NIC bandwidth
(~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
Was actually benchmarked by my colleague quite a while ago, but can't
find numbers. Probably need to at least add localhost numbers or grab
a better server.

--
Pavel Begunkov

2022-07-11 13:03:21

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/8/22 15:26, Pavel Begunkov wrote:
> On 7/8/22 05:10, David Ahern wrote:
>> On 7/7/22 5:49 AM, Pavel Begunkov wrote:
>>> NOTE: Not be picked directly. After getting necessary acks, I'll be working
>>>        out merging with Jakub and Jens.
>>>
>>> The patchset implements io_uring zerocopy send. It works with both registered
>>> and normal buffers, mixing is allowed but not recommended. Apart from usual
>>> request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
>>> the userspace when buffers are freed and can be reused (see API design below),
>>> which is delivered into io_uring's Completion Queue. Those "buffer-free"
>>> notifications are not necessarily per request, but the userspace has control
>>> over it and should explicitly attaching a number of requests to a single
>>> notification. The series also adds some internal optimisations when used with
>>> registered buffers like removing page referencing.
>>>
>>>  From the kernel networking perspective there are two main changes. The first
>>> one is passing ubuf_info into the network layer from io_uring (inside of an
>>> in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
>>> caching on the io_uring side, but also helps to avoid cross-referencing
>>> and synchronisation problems. The second part is an optional optimisation
>>> removing page referencing for requests with registered buffers.
>>>
>>> Benchmarking with an optimised version of the selftest (see [1]), which sends
>>> a bunch of requests, waits for completions and repeats. "+ flush" column posts
>>> one additional "buffer-free" notification per request, and just "zc" doesn't
>>> post buffer notifications at all.
>>>
>>> NIC (requests / second):
>>> IO size | non-zc    | zc             | zc + flush
>>> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
>>> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
>>> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
>>> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
>>>
>>> dummy (requests / second):
>>> IO size | non-zc    | zc             | zc + flush
>>> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
>>> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
>>> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
>>> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
>>>
>>> Previously it also brought a massive performance speedup compared to the
>>> msg_zerocopy tool (see [3]), which is probably not super interesting.
>>>
>>
>> can you add a comment that the above results are for UDP.
>
> Oh, right, forgot to add it
>
>
>> You dropped comments about TCP testing; any progress there? If not, can
>> you relay any issues you are hitting?
>
> Not really a problem, but for me it's bottle necked at NIC bandwidth
> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
> Was actually benchmarked by my colleague quite a while ago, but can't
> find numbers. Probably need to at least add localhost numbers or grab
> a better server.

Testing localhost TCP with a hack (see below), it doesn't include
refcounting optimisations I was testing UDP with and that will be
sent afterwards. Numbers are in MB/s

IO size | non-zc | zc
1200 | 4174 | 4148
4096 | 7597 | 11228

Because it's localhost, we also spend cycles here for the recv side.
Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help. I don't consider it to be a
blocker. but would be interesting to poke into later. One thing helping
non-zc is that it squeezes a number of requests into a single page
whenever zerocopy adds a new frag for every request.

Can't say anything new for larger payloads, I'm still NIC-bound but
looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
Also, I don't remember if mentioned before, but another catch is that
with TCP it expects users to not be flushing notifications too much,
because it forces it to allocate a new skb and lose a good chunk of
benefits from using TCP.


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1111adefd906..c4b781b2c3b1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3218,9 +3218,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
{
- if (likely(!skb_zcopy(skb)))
- return 0;
- return skb_copy_ubufs(skb, gfp_mask);
+ return skb_orphan_frags(skb, gfp_mask);
}

--
Pavel Begunkov

2022-07-14 00:23:29

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/11/22 5:56 AM, Pavel Begunkov wrote:
> On 7/8/22 15:26, Pavel Begunkov wrote:
>> On 7/8/22 05:10, David Ahern wrote:
>>> On 7/7/22 5:49 AM, Pavel Begunkov wrote:
>>>> NOTE: Not be picked directly. After getting necessary acks, I'll be
>>>> working
>>>>        out merging with Jakub and Jens.
>>>>
>>>> The patchset implements io_uring zerocopy send. It works with both
>>>> registered
>>>> and normal buffers, mixing is allowed but not recommended. Apart
>>>> from usual
>>>> request completions, just as with MSG_ZEROCOPY, io_uring separately
>>>> notifies
>>>> the userspace when buffers are freed and can be reused (see API
>>>> design below),
>>>> which is delivered into io_uring's Completion Queue. Those
>>>> "buffer-free"
>>>> notifications are not necessarily per request, but the userspace has
>>>> control
>>>> over it and should explicitly attaching a number of requests to a
>>>> single
>>>> notification. The series also adds some internal optimisations when
>>>> used with
>>>> registered buffers like removing page referencing.
>>>>
>>>>  From the kernel networking perspective there are two main changes.
>>>> The first
>>>> one is passing ubuf_info into the network layer from io_uring
>>>> (inside of an
>>>> in kernel struct msghdr). This allows extra optimisations, e.g.
>>>> ubuf_info
>>>> caching on the io_uring side, but also helps to avoid cross-referencing
>>>> and synchronisation problems. The second part is an optional
>>>> optimisation
>>>> removing page referencing for requests with registered buffers.
>>>>
>>>> Benchmarking with an optimised version of the selftest (see [1]),
>>>> which sends
>>>> a bunch of requests, waits for completions and repeats. "+ flush"
>>>> column posts
>>>> one additional "buffer-free" notification per request, and just "zc"
>>>> doesn't
>>>> post buffer notifications at all.
>>>>
>>>> NIC (requests / second):
>>>> IO size | non-zc    | zc             | zc + flush
>>>> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
>>>> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
>>>> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
>>>> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
>>>>
>>>> dummy (requests / second):
>>>> IO size | non-zc    | zc             | zc + flush
>>>> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
>>>> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
>>>> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
>>>> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
>>>>
>>>> Previously it also brought a massive performance speedup compared to
>>>> the
>>>> msg_zerocopy tool (see [3]), which is probably not super interesting.
>>>>
>>>
>>> can you add a comment that the above results are for UDP.
>>
>> Oh, right, forgot to add it
>>
>>
>>> You dropped comments about TCP testing; any progress there? If not, can
>>> you relay any issues you are hitting?
>>
>> Not really a problem, but for me it's bottle necked at NIC bandwidth
>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
>> Was actually benchmarked by my colleague quite a while ago, but can't
>> find numbers. Probably need to at least add localhost numbers or grab
>> a better server.
>
> Testing localhost TCP with a hack (see below), it doesn't include
> refcounting optimisations I was testing UDP with and that will be
> sent afterwards. Numbers are in MB/s
>
> IO size | non-zc    | zc
> 1200    | 4174      | 4148
> 4096    | 7597      | 11228

I am surprised by the low numbers; you should be able to saturate a 100G
link with TCP and ZC TX API.

>
> Because it's localhost, we also spend cycles here for the recv side.
> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
> omitted optimisations will somewhat help. I don't consider it to be a
> blocker. but would be interesting to poke into later. One thing helping
> non-zc is that it squeezes a number of requests into a single page
> whenever zerocopy adds a new frag for every request.
>
> Can't say anything new for larger payloads, I'm still NIC-bound but
> looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
> Also, I don't remember if mentioned before, but another catch is that
> with TCP it expects users to not be flushing notifications too much,
> because it forces it to allocate a new skb and lose a good chunk of
> benefits from using TCP.

I had issues with TCP sockets and io_uring at the end of 2020:
https://www.spinics.net/lists/io-uring/msg05125.html

have not tried anything recent (from 2022).

2022-07-14 19:38:09

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/14/22 00:45, David Ahern wrote:
> On 7/11/22 5:56 AM, Pavel Begunkov wrote:
>> On 7/8/22 15:26, Pavel Begunkov wrote:
>>> On 7/8/22 05:10, David Ahern wrote:
>>>> On 7/7/22 5:49 AM, Pavel Begunkov wrote:
>>>>> NOTE: Not be picked directly. After getting necessary acks, I'll be
>>>>> working
>>>>>        out merging with Jakub and Jens.
>>>>>
>>>>> The patchset implements io_uring zerocopy send. It works with both
>>>>> registered
>>>>> and normal buffers, mixing is allowed but not recommended. Apart
>>>>> from usual
>>>>> request completions, just as with MSG_ZEROCOPY, io_uring separately
>>>>> notifies
>>>>> the userspace when buffers are freed and can be reused (see API
>>>>> design below),
>>>>> which is delivered into io_uring's Completion Queue. Those
>>>>> "buffer-free"
>>>>> notifications are not necessarily per request, but the userspace has
>>>>> control
>>>>> over it and should explicitly attaching a number of requests to a
>>>>> single
>>>>> notification. The series also adds some internal optimisations when
>>>>> used with
>>>>> registered buffers like removing page referencing.
>>>>>
>>>>>  From the kernel networking perspective there are two main changes.
>>>>> The first
>>>>> one is passing ubuf_info into the network layer from io_uring
>>>>> (inside of an
>>>>> in kernel struct msghdr). This allows extra optimisations, e.g.
>>>>> ubuf_info
>>>>> caching on the io_uring side, but also helps to avoid cross-referencing
>>>>> and synchronisation problems. The second part is an optional
>>>>> optimisation
>>>>> removing page referencing for requests with registered buffers.
>>>>>
>>>>> Benchmarking with an optimised version of the selftest (see [1]),
>>>>> which sends
>>>>> a bunch of requests, waits for completions and repeats. "+ flush"
>>>>> column posts
>>>>> one additional "buffer-free" notification per request, and just "zc"
>>>>> doesn't
>>>>> post buffer notifications at all.
>>>>>
>>>>> NIC (requests / second):
>>>>> IO size | non-zc    | zc             | zc + flush
>>>>> 4000    | 495134    | 606420 (+22%)  | 558971 (+12%)
>>>>> 1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
>>>>> 1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
>>>>> 600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)
>>>>>
>>>>> dummy (requests / second):
>>>>> IO size | non-zc    | zc             | zc + flush
>>>>> 8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
>>>>> 4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
>>>>> 1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
>>>>> 600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)
>>>>>
>>>>> Previously it also brought a massive performance speedup compared to
>>>>> the
>>>>> msg_zerocopy tool (see [3]), which is probably not super interesting.
>>>>>
>>>>
>>>> can you add a comment that the above results are for UDP.
>>>
>>> Oh, right, forgot to add it
>>>
>>>
>>>> You dropped comments about TCP testing; any progress there? If not, can
>>>> you relay any issues you are hitting?
>>>
>>> Not really a problem, but for me it's bottle necked at NIC bandwidth
>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
>>> Was actually benchmarked by my colleague quite a while ago, but can't
>>> find numbers. Probably need to at least add localhost numbers or grab
>>> a better server.
>>
>> Testing localhost TCP with a hack (see below), it doesn't include
>> refcounting optimisations I was testing UDP with and that will be
>> sent afterwards. Numbers are in MB/s
>>
>> IO size | non-zc    | zc
>> 1200    | 4174      | 4148
>> 4096    | 7597      | 11228
>
> I am surprised by the low numbers; you should be able to saturate a 100G
> link with TCP and ZC TX API.

It was a quick test with my laptop, not a super fast CPU, preemptible
kernel, etc., and considering that the fact that it processes receives
from in the same send syscall roughly doubles the overhead, 87Gb/s
looks ok. It's not like MSG_ZEROCOPY would look much different, even
more to that all sends here will be executed sequentially in io_uring,
so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable,
it's just the kernel overhead per byte is too high, should be same with
just send(2).

>> Because it's localhost, we also spend cycles here for the recv side.
>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
>> omitted optimisations will somewhat help. I don't consider it to be a
>> blocker. but would be interesting to poke into later. One thing helping
>> non-zc is that it squeezes a number of requests into a single page
>> whenever zerocopy adds a new frag for every request.
>>
>> Can't say anything new for larger payloads, I'm still NIC-bound but
>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
>> Also, I don't remember if mentioned before, but another catch is that
>> with TCP it expects users to not be flushing notifications too much,
>> because it forces it to allocate a new skb and lose a good chunk of
>> benefits from using TCP.
>
> I had issues with TCP sockets and io_uring at the end of 2020:
> https://www.spinics.net/lists/io-uring/msg05125.html
>
> have not tried anything recent (from 2022).

Haven't seen it back then. In general io_uring doesn't stop submitting
requests if one request fails, at least because we're trying to execute
requests asynchronously. And in general, requests can get executed
out of order, so most probably submitting a bunch of requests to a single
TCP sock without any ordering on io_uring side is likely a bug.

You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing
execution ordering. And if you meant links in the message, I agree
that it was not the best decision to consider len < sqe->len not
an error and not breaking links, but it was later added that
MSG_WAITALL would also change the success condition to
len==sqe->len. But all that is relevant if you was using linking.

--
Pavel Begunkov

2022-07-18 02:36:47

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/14/22 12:55 PM, Pavel Begunkov wrote:
>>>>> You dropped comments about TCP testing; any progress there? If not,
>>>>> can
>>>>> you relay any issues you are hitting?
>>>>
>>>> Not really a problem, but for me it's bottle necked at NIC bandwidth
>>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
>>>> Was actually benchmarked by my colleague quite a while ago, but can't
>>>> find numbers. Probably need to at least add localhost numbers or grab
>>>> a better server.
>>>
>>> Testing localhost TCP with a hack (see below), it doesn't include
>>> refcounting optimisations I was testing UDP with and that will be
>>> sent afterwards. Numbers are in MB/s
>>>
>>> IO size | non-zc    | zc
>>> 1200    | 4174      | 4148
>>> 4096    | 7597      | 11228
>>
>> I am surprised by the low numbers; you should be able to saturate a 100G
>> link with TCP and ZC TX API.
>
> It was a quick test with my laptop, not a super fast CPU, preemptible
> kernel, etc., and considering that the fact that it processes receives
> from in the same send syscall roughly doubles the overhead, 87Gb/s
> looks ok. It's not like MSG_ZEROCOPY would look much different, even
> more to that all sends here will be executed sequentially in io_uring,
> so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable,
> it's just the kernel overhead per byte is too high, should be same with
> just send(2).

?
It's a stream socket so those sends are coalesced into MTU sized packets.

>
>>> Because it's localhost, we also spend cycles here for the recv side.
>>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
>>> omitted optimisations will somewhat help. I don't consider it to be a
>>> blocker. but would be interesting to poke into later. One thing helping
>>> non-zc is that it squeezes a number of requests into a single page
>>> whenever zerocopy adds a new frag for every request.
>>>
>>> Can't say anything new for larger payloads, I'm still NIC-bound but
>>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
>>> Also, I don't remember if mentioned before, but another catch is that
>>> with TCP it expects users to not be flushing notifications too much,
>>> because it forces it to allocate a new skb and lose a good chunk of
>>> benefits from using TCP.
>>
>> I had issues with TCP sockets and io_uring at the end of 2020:
>> https://www.spinics.net/lists/io-uring/msg05125.html
>>
>> have not tried anything recent (from 2022).
>
> Haven't seen it back then. In general io_uring doesn't stop submitting
> requests if one request fails, at least because we're trying to execute
> requests asynchronously. And in general, requests can get executed
> out of order, so most probably submitting a bunch of requests to a single
> TCP sock without any ordering on io_uring side is likely a bug.

TCP socket buffer fills resulting in a partial send (i.e, for a given
sqe submission only part of the write/send succeeded). io_uring was not
handling that case.

I'll try to find some time to resurrect the iperf3 patch and try top of
tree kernel.

>
> You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing
> execution ordering. And if you meant links in the message, I agree
> that it was not the best decision to consider len < sqe->len not
> an error and not breaking links, but it was later added that
> MSG_WAITALL would also change the success condition to
> len==sqe->len. But all that is relevant if you was using linking.
>

2022-07-20 13:41:31

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/18/22 03:19, David Ahern wrote:
> On 7/14/22 12:55 PM, Pavel Begunkov wrote:
>>>>>> You dropped comments about TCP testing; any progress there? If not,
>>>>>> can
>>>>>> you relay any issues you are hitting?
>>>>>
>>>>> Not really a problem, but for me it's bottle necked at NIC bandwidth
>>>>> (~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
>>>>> Was actually benchmarked by my colleague quite a while ago, but can't
>>>>> find numbers. Probably need to at least add localhost numbers or grab
>>>>> a better server.
>>>>
>>>> Testing localhost TCP with a hack (see below), it doesn't include
>>>> refcounting optimisations I was testing UDP with and that will be
>>>> sent afterwards. Numbers are in MB/s
>>>>
>>>> IO size | non-zc    | zc
>>>> 1200    | 4174      | 4148
>>>> 4096    | 7597      | 11228
>>>
>>> I am surprised by the low numbers; you should be able to saturate a 100G
>>> link with TCP and ZC TX API.
>>
>> It was a quick test with my laptop, not a super fast CPU, preemptible
>> kernel, etc., and considering that the fact that it processes receives
>> from in the same send syscall roughly doubles the overhead, 87Gb/s
>> looks ok. It's not like MSG_ZEROCOPY would look much different, even
>> more to that all sends here will be executed sequentially in io_uring,
>> so no extra parallelism or so. As for 1200, I think 4GB/s is reasonable,
>> it's just the kernel overhead per byte is too high, should be same with
>> just send(2).
>
> ?
> It's a stream socket so those sends are coalesced into MTU sized packets.

That leaves syscall and io_uring overhead, locking the socket, etc.,
which still requires more cycles than just copying 1200 bytes. And
the used CPU is not blazingly fast, could be that a better CPU/setup
will saturate 100G

>>>> Because it's localhost, we also spend cycles here for the recv side.
>>>> Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
>>>> omitted optimisations will somewhat help. I don't consider it to be a
>>>> blocker. but would be interesting to poke into later. One thing helping
>>>> non-zc is that it squeezes a number of requests into a single page
>>>> whenever zerocopy adds a new frag for every request.
>>>>
>>>> Can't say anything new for larger payloads, I'm still NIC-bound but
>>>> looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
>>>> Also, I don't remember if mentioned before, but another catch is that
>>>> with TCP it expects users to not be flushing notifications too much,
>>>> because it forces it to allocate a new skb and lose a good chunk of
>>>> benefits from using TCP.
>>>
>>> I had issues with TCP sockets and io_uring at the end of 2020:
>>> https://www.spinics.net/lists/io-uring/msg05125.html
>>>
>>> have not tried anything recent (from 2022).
>>
>> Haven't seen it back then. In general io_uring doesn't stop submitting
>> requests if one request fails, at least because we're trying to execute
>> requests asynchronously. And in general, requests can get executed
>> out of order, so most probably submitting a bunch of requests to a single
>> TCP sock without any ordering on io_uring side is likely a bug.
>
> TCP socket buffer fills resulting in a partial send (i.e, for a given
> sqe submission only part of the write/send succeeded). io_uring was not
> handling that case.

Shouldn't have been different from send(2) with MSG_NOWAIT, can be short
and the user should handle it. Also I believe Jens pushed just recently
in-kernel retries on the io_uring side for TCP in such cases.

> I'll try to find some time to resurrect the iperf3 patch and try top of
> tree kernel.

Awesome


>> You can link io_uring requests, i.e. IOSQE_IO_LINK, guaranteeing
>> execution ordering. And if you meant links in the message, I agree
>> that it was not the best decision to consider len < sqe->len not
>> an error and not breaking links, but it was later added that
>> MSG_WAITALL would also change the success condition to
>> len==sqe->len. But all that is relevant if you was using linking.

--
Pavel Begunkov

2022-07-24 19:49:58

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/17/22 8:19 PM, David Ahern wrote:
>>
>> Haven't seen it back then. In general io_uring doesn't stop submitting
>> requests if one request fails, at least because we're trying to execute
>> requests asynchronously. And in general, requests can get executed
>> out of order, so most probably submitting a bunch of requests to a single
>> TCP sock without any ordering on io_uring side is likely a bug.
>
> TCP socket buffer fills resulting in a partial send (i.e, for a given
> sqe submission only part of the write/send succeeded). io_uring was not
> handling that case.
>
> I'll try to find some time to resurrect the iperf3 patch and try top of
> tree kernel.

With your zc_v5 branch (plus the init fix on using msg->sg_from_iter),
iperf3 with io_uring support (non-ZC case) no longer shows completions
with incomplete sends. So that is good improvement over the last time I
tried it.

However, adding in the ZC support and that problem resurfaces - a lot of
completions are for an incomplete size.

liburing comes from your tree, zc_v4 branch. Upstream does not have
support for notifications yet, so I can not move to it.

Changes to iperf3 are here:
https://github.com/dsahern/iperf mods-3.10-io_uring

2022-07-27 10:58:32

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/24/22 19:28, David Ahern wrote:
> On 7/17/22 8:19 PM, David Ahern wrote:
>>>
>>> Haven't seen it back then. In general io_uring doesn't stop submitting
>>> requests if one request fails, at least because we're trying to execute
>>> requests asynchronously. And in general, requests can get executed
>>> out of order, so most probably submitting a bunch of requests to a single
>>> TCP sock without any ordering on io_uring side is likely a bug.
>>
>> TCP socket buffer fills resulting in a partial send (i.e, for a given
>> sqe submission only part of the write/send succeeded). io_uring was not
>> handling that case.
>>
>> I'll try to find some time to resurrect the iperf3 patch and try top of
>> tree kernel.
>
> With your zc_v5 branch (plus the init fix on using msg->sg_from_iter),
> iperf3 with io_uring support (non-ZC case) no longer shows completions
> with incomplete sends. So that is good improvement over the last time I
> tried it.
>
> However, adding in the ZC support and that problem resurfaces - a lot of
> completions are for an incomplete size.

Makes sense, it explicitly retries with normal sends but I didn't
implement it for zc. Might be a good thing to add.

> liburing comes from your tree, zc_v4 branch. Upstream does not have
> support for notifications yet, so I can not move to it.

Upstreamed it

> Changes to iperf3 are here:
> https://github.com/dsahern/iperf mods-3.10-io_uring

--
Pavel Begunkov

2022-07-29 22:43:26

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/27/22 4:51 AM, Pavel Begunkov wrote:
>> With your zc_v5 branch (plus the init fix on using msg->sg_from_iter),
>> iperf3 with io_uring support (non-ZC case) no longer shows completions
>> with incomplete sends. So that is good improvement over the last time I
>> tried it.
>>
>> However, adding in the ZC support and that problem resurfaces - a lot of
>> completions are for an incomplete size.
>
> Makes sense, it explicitly retries with normal sends but I didn't
> implement it for zc. Might be a good thing to add.
>

Yes, before this goes it. It will be confusing to users to get
incomplete completions when using the ZC option.

2022-09-26 21:28:25

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 7/24/22 19:28, David Ahern wrote:
> On 7/17/22 8:19 PM, David Ahern wrote:
>>>
>>> Haven't seen it back then. In general io_uring doesn't stop submitting
>>> requests if one request fails, at least because we're trying to execute
>>> requests asynchronously. And in general, requests can get executed
>>> out of order, so most probably submitting a bunch of requests to a single
>>> TCP sock without any ordering on io_uring side is likely a bug.
>>
>> TCP socket buffer fills resulting in a partial send (i.e, for a given
>> sqe submission only part of the write/send succeeded). io_uring was not
>> handling that case.
>>
>> I'll try to find some time to resurrect the iperf3 patch and try top of
>> tree kernel.
>
> With your zc_v5 branch (plus the init fix on using msg->sg_from_iter),
> iperf3 with io_uring support (non-ZC case) no longer shows completions
> with incomplete sends. So that is good improvement over the last time I
> tried it.
>
> However, adding in the ZC support and that problem resurfaces - a lot of
> completions are for an incomplete size.
>
> liburing comes from your tree, zc_v4 branch. Upstream does not have
> support for notifications yet, so I can not move to it.
>
> Changes to iperf3 are here:
> https://github.com/dsahern/iperf mods-3.10-io_uring

Tried it out, the branch below fixes a small problem, adds a couple
of extra optimisations and now it actually uses registered buffers.

https://github.com/isilence/iperf iou-sendzc

Still, the submission loop looked a bit weird, i.e. it submits I/O
to io_uring only when it exhausts sqes instead of sending right
away with some notion of QD and/or sending in batches. The approach
is good for batching (SQ size =16 here), but not so for latency.

I also see some CPU cycles being burnt in select(2). io_uring wait
would be more natural and perhaps more performant, but I didn't
spend enough time with iperf to say for sure.

--
Pavel Begunkov

2022-09-28 20:28:20

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 9/28/22 20:31, David Ahern wrote:
> On 9/26/22 1:08 PM, Pavel Begunkov wrote:
>> Tried it out, the branch below fixes a small problem, adds a couple
>> of extra optimisations and now it actually uses registered buffers.
>>
>>     https://github.com/isilence/iperf iou-sendzc
>
> thanks for the patch; will it pull it in.
>
>> Still, the submission loop looked a bit weird, i.e. it submits I/O
>> to io_uring only when it exhausts sqes instead of sending right
>> away with some notion of QD and/or sending in batches. The approach
>> is good for batching (SQ size =16 here), but not so for latency.
>>
>> I also see some CPU cycles being burnt in select(2). io_uring wait
>> would be more natural and perhaps more performant, but I didn't
>> spend enough time with iperf to say for sure.
>
> ok. It will be a while before I have time to come back to it. In the
> meantime it seems like some io_uring changes happened between your dev
> branch and what was merged into liburing (compile worked on your branch
> but fails with upstream). Is the ZC support in liburing now?

It is. I forgot to put a note that I also adapted your patches
to uapi changes.No more notification slots but a zc send request
now can post a second CQE if IORING_CQE_F_MORE is set in the
first one. Better described in io_uring_enter(2) man, e.g.

https://git.kernel.dk/cgit/liburing/tree/man/io_uring_enter.2#n1063

--
Pavel Begunkov

2022-09-28 20:47:13

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH net-next v4 00/27] io_uring zerocopy send

On 9/26/22 1:08 PM, Pavel Begunkov wrote:
> Tried it out, the branch below fixes a small problem, adds a couple
> of extra optimisations and now it actually uses registered buffers.
>
>     https://github.com/isilence/iperf iou-sendzc

thanks for the patch; will it pull it in.

>
> Still, the submission loop looked a bit weird, i.e. it submits I/O
> to io_uring only when it exhausts sqes instead of sending right
> away with some notion of QD and/or sending in batches. The approach
> is good for batching (SQ size =16 here), but not so for latency.
>
> I also see some CPU cycles being burnt in select(2). io_uring wait
> would be more natural and perhaps more performant, but I didn't
> spend enough time with iperf to say for sure.

ok. It will be a while before I have time to come back to it. In the
meantime it seems like some io_uring changes happened between your dev
branch and what was merged into liburing (compile worked on your branch
but fails with upstream). Is the ZC support in liburing now?