2022-06-28 19:03:12

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 00/29] io_uring zerocopy send

The third iteration of patches for zerocopy io_uring sends. I fixed
all known issues since the previous version and reshuffled io_uring
patches, but the net/ code didn't change much. I think it's ready
and will send it as a non-RFC soon.

All tests below are done using io_uring with all relevant performance
options turned on. Numbers look good, send + flush per request, which
is the worst case, is on par with non-zerocopy with the payload size
lower than 600 bytes with dummy netdev and b/w 1200-1500 for NIC tests.
Without "buffer-free" notification flushing at all it's on par with NIC
at around 600 bytes.

dummy:
IO size | non-zc (tx/s) | zc (tx/s) | zc + flush (tx/s)
8000 | 1299916 | 2396600 (+84%) | 2224219 (+71%)
4000 | 1869230 | 2344146 (+25%) | 2170069 (+16%)
1200 | 2071617 | 2361960 (+14%) | 2203052 (+6%)
600 | 2106794 | 2381527 (+13%) | 2195295 (+4%)

NIC:
IO size | non-zc (tx/s) | zc (tx/s) | zc + flush (tx/s)
4000 | 495134 | 606420 (+22%) | 558971 (+12%)
1500 | 551808 | 577116 (+4.5%) | 565803 (+2.5%)
1000 | 584677 | 592088 (+1.2%) | 560885 (-4%)
600 | 596292 | 598550 (+0.4%) | 555366 (-6.7%)

Apart from zerocopy, it also removes page referencing for reigstered
buffers (used in all zc tests). I'm experimenting with notificaiton
optimsation, which should improve the 3rd column, but that will go
separately from this series. I've also seen good CPU usage reduction
for TCP comparing to non-zc, but not posting numbers as had problems
saturating CPU.

Links:

RFC v1:
https://lore.kernel.org/io-uring/[email protected]/

RFC v2:
https://lore.kernel.org/io-uring/[email protected]/

liburing (copy of the benchmark + some tests):
https://github.com/isilence/liburing/tree/zc_v3

kernel repo:
https://github.com/isilence/linux/tree/zc_v3

API design overview:

First we take an internal zerocopy handler, aka struct ubuf_info, and let
io_uring to pass it into the network layer in struct msghdr. io_uring
stores them as wrapping into struct io_notif.

It also has an array of so called notification slots, each keeps one and
only one active notifier at a time, to which the userspace can bind requests
by specifying the slot index. Then the userspace can request to flush a
notifier, so when all buffers and requests used with this notifier
complete/freed it'll post one CQE.

The userspace can't bind new requests to a flushed notifier, however,
it can use the slot as flushing automatically replaces the notifier with
a new one.

Changelog:

RFC v2 -> RFC v3:
TCP support
accounting for normal (non-registered) buffers
allow to combine reg and normal requests within a notifier
notification flushing via IORING_OP_RSRC_UPDATE
overriding io_uring notification tag/user_data
add ubuf_info submmision side reference caching/batching
fix buffer indexing
fix io-wq ->uring_lock locking
fix bugs when mixing with MSG_ZEROCOPY
fix managed refs bugs in skbuff.c
numerous cleanups

RFC -> RFC v2:
remove additional overhead for non-zc from skb_release_data()
avoid msg propagation, hide extra bits of non-zc overhead
task_work based "buffer free" notifications
improve io_uring's notification refcounting
added 5/19, (no pfmemalloc tracking)
added 8/19 and 9/19 preventing small copies with zc
misc small changes

Pavel Begunkov (29):
ipv4: avoid partial copy for zc
ipv6: avoid partial copy for zc
skbuff: add SKBFL_DONT_ORPHAN flag
skbuff: carry external ubuf_info in msghdr
net: bvec specific path in zerocopy_sg_from_iter
net: optimise bvec-based zc page referencing
net: don't track pfmemalloc for managed frags
skbuff: don't mix ubuf_info of different types
ipv4/udp: support zc with managed data
ipv6/udp: support zc with managed data
tcp: support zc with managed data
tcp: kill extra io_uring's uarg refcounting
net: let callers provide extra ubuf_info refs
io_uring: opcode independent fixed buf import
io_uring: add zc notification infrastructure
io_uring: cache struct io_notif
io_uring: complete notifiers in tw
io_uring: add notification slot registration
io_uring: rename IORING_OP_FILES_UPDATE
io_uring: add zc notification flush requests
io_uring: wire send zc request type
io_uring: account locked pages for non-fixed zc
io_uring: allow to pass addr into sendzc
io_uring: add rsrc referencing for notifiers
io_uring: sendzc with fixed buffers
io_uring: flush notifiers after sendzc
io_uring: allow to override zc tag on flush
io_uring: batch submission notif referencing
selftests/io_uring: test zerocopy send

fs/io_uring.c | 566 +++++++++++++++-
include/linux/skbuff.h | 59 +-
include/linux/socket.h | 8 +
include/uapi/linux/io_uring.h | 43 +-
net/compat.c | 2 +
net/core/datagram.c | 53 +-
net/core/skbuff.c | 35 +-
net/ipv4/ip_output.c | 66 +-
net/ipv4/tcp.c | 56 +-
net/ipv6/ip6_output.c | 65 +-
net/socket.c | 6 +
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
14 files changed, 1613 insertions(+), 83 deletions(-)
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

--
2.36.1


2022-06-28 19:04:35

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
It'll be used later for {get,put}_page() optimisations.

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 50f4faeea76c..5237cb533bb4 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -613,11 +613,58 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
}
EXPORT_SYMBOL(skb_copy_datagram_from_iter);

+static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
+ struct iov_iter *from, size_t length)
+{
+ int frag = skb_shinfo(skb)->nr_frags;
+ int ret = 0;
+ struct bvec_iter bi;
+ ssize_t copied = 0;
+ unsigned long truesize = 0;
+
+ bi.bi_size = min(from->count, length);
+ bi.bi_bvec_done = from->iov_offset;
+ bi.bi_idx = 0;
+
+ while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+ struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+ copied += v.bv_len;
+ truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+ get_page(v.bv_page);
+ skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+ bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+ }
+ if (bi.bi_size)
+ ret = -EMSGSIZE;
+
+ from->bvec += bi.bi_idx;
+ from->nr_segs -= bi.bi_idx;
+ from->count = bi.bi_size;
+ from->iov_offset = bi.bi_bvec_done;
+
+ skb->data_len += copied;
+ skb->len += copied;
+ skb->truesize += truesize;
+
+ if (sk && sk->sk_type == SOCK_STREAM) {
+ sk_wmem_queued_add(sk, truesize);
+ if (!skb_zcopy_pure(skb))
+ sk_mem_charge(sk, truesize);
+ } else {
+ refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+ }
+ return ret;
+}
+
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length)
{
int frag = skb_shinfo(skb)->nr_frags;

+ if (iov_iter_is_bvec(from))
+ return __zerocopy_sg_from_bvec(sk, skb, from, length);
+
while (length && iov_iter_count(from)) {
struct page *pages[MAX_SKB_FRAGS];
struct page *last_head = NULL;
--
2.36.1

2022-06-28 19:04:35

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 04/29] skbuff: carry external ubuf_info in msghdr

Make possible for network in-kernel callers like io_uring to pass in a
custom ubuf_info by setting it in a new field of struct msghdr.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 4 ++++
include/linux/socket.h | 7 +++++++
net/compat.c | 2 ++
net/socket.c | 6 ++++++
4 files changed, 19 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8e75539fdc1d..6a57a5ae18fb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6230,6 +6230,8 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
+ msg.msg_ubuf = NULL;
+ msg.msg_managed_data = false;

flags = sr->msg_flags;
if (issue_flags & IO_URING_F_NONBLOCK)
@@ -6500,6 +6502,8 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_flags = 0;
msg.msg_controllen = 0;
msg.msg_iocb = NULL;
+ msg.msg_ubuf = NULL;
+ msg.msg_managed_data = false;

flags = sr->msg_flags;
if (force_nonblock)
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 17311ad9f9af..ba84ee614d5a 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -66,9 +66,16 @@ struct msghdr {
};
bool msg_control_is_user : 1;
bool msg_get_inq : 1;/* return INQ after receive */
+ /*
+ * The data pages are pinned and won't be released before ->msg_ubuf
+ * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
+ * to be non-NULL.
+ */
+ bool msg_managed_data : 1;
unsigned int msg_flags; /* flags on received message */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
+ struct ubuf_info *msg_ubuf;
};

struct user_msghdr {
diff --git a/net/compat.c b/net/compat.c
index 210fc3b4d0d8..435846fa85e0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -80,6 +80,8 @@ int __get_compat_msghdr(struct msghdr *kmsg,
return -EMSGSIZE;

kmsg->msg_iocb = NULL;
+ kmsg->msg_ubuf = NULL;
+ kmsg->msg_managed_data = false;
*ptr = msg.msg_iov;
*len = msg.msg_iovlen;
return 0;
diff --git a/net/socket.c b/net/socket.c
index 2bc8773d9dc5..0963a02b1472 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2106,6 +2106,8 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
+ msg.msg_ubuf = NULL;
+ msg.msg_managed_data = false;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
@@ -2171,6 +2173,8 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
msg.msg_namelen = 0;
msg.msg_iocb = NULL;
msg.msg_flags = 0;
+ msg.msg_ubuf = NULL;
+ msg.msg_managed_data = false;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
err = sock_recvmsg(sock, &msg, flags);
@@ -2409,6 +2413,8 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,
return -EMSGSIZE;

kmsg->msg_iocb = NULL;
+ kmsg->msg_ubuf = NULL;
+ kmsg->msg_managed_data = false;
*uiov = msg.msg_iov;
*nsegs = msg.msg_iovlen;
return 0;
--
2.36.1

2022-06-28 19:05:47

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 14/29] io_uring: opcode independent fixed buf import

Extract an opcode independent helper from io_import_fixed for
initialising an iov_iter with a fixed buffer.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6a57a5ae18fb..e47629adf3f7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3728,11 +3728,11 @@ static void kiocb_done(struct io_kiocb *req, ssize_t ret,
}
}

-static int __io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter,
- struct io_mapped_ubuf *imu)
+static int __io_import_fixed(int rw, struct iov_iter *iter,
+ struct io_mapped_ubuf *imu,
+ u64 buf_addr, size_t len)
{
- size_t len = req->rw.len;
- u64 buf_end, buf_addr = req->rw.addr;
+ u64 buf_end;
size_t offset;

if (unlikely(check_add_overflow(buf_addr, (u64)len, &buf_end)))
@@ -3802,7 +3802,7 @@ static int io_import_fixed(struct io_kiocb *req, int rw, struct iov_iter *iter,
imu = READ_ONCE(ctx->user_bufs[index]);
req->imu = imu;
}
- return __io_import_fixed(req, rw, iter, imu);
+ return __io_import_fixed(rw, iter, imu, req->rw.addr, req->rw.len);
}

static int io_buffer_add_list(struct io_ring_ctx *ctx,
--
2.36.1

2022-06-28 19:05:49

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 16/29] io_uring: cache struct io_notif

kmalloc'ing struct io_notif is too expensive when done frequently, cache
them as many other resources in io_uring. Keep two list, the first one
is from where we're getting notifiers, it's protected by ->uring_lock.
The second is protected by ->completion_lock, to which we queue released
notifiers. Then we splice one list into another when needed.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 68 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7d058deb5f73..422ff835bf36 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -381,6 +381,8 @@ struct io_notif {
u64 tag;
/* see struct io_notif_slot::seq */
u32 seq;
+ /* hook into ctx->notif_list and ctx->notif_list_locked */
+ struct list_head cache_node;

union {
struct callback_head task_work;
@@ -469,6 +471,8 @@ struct io_ring_ctx {
struct xarray io_bl_xa;
struct list_head io_buffers_cache;

+ /* struct io_notif cache protected by uring_lock */
+ struct list_head notif_list;
struct list_head timeout_list;
struct list_head ltimeout_list;
struct list_head cq_overflow_list;
@@ -481,6 +485,9 @@ struct io_ring_ctx {
/* IRQ completion list, under ->completion_lock */
struct io_wq_work_list locked_free_list;
unsigned int locked_free_nr;
+ /* struct io_notif cache protected by completion_lock */
+ struct list_head notif_list_locked;
+ unsigned int notif_locked_nr;

const struct cred *sq_creds; /* cred used for __io_sq_thread() */
struct io_sq_data *sq_data; /* if using sq thread polling */
@@ -1932,6 +1939,8 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_WQ_LIST(&ctx->locked_free_list);
INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
+ INIT_LIST_HEAD(&ctx->notif_list);
+ INIT_LIST_HEAD(&ctx->notif_list_locked);
return ctx;
err:
kfree(ctx->dummy_ubuf);
@@ -2795,12 +2804,15 @@ static void __io_notif_complete_tw(struct callback_head *cb)

spin_lock(&ctx->completion_lock);
io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq);
+
+ list_add(&notif->cache_node, &ctx->notif_list_locked);
+ ctx->notif_locked_nr++;
+
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

percpu_ref_put(&ctx->refs);
- kfree(notif);
}

static inline void io_notif_complete(struct io_notif *notif)
@@ -2827,21 +2839,62 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
queue_work(system_unbound_wq, &notif->commit_work);
}

+static void io_notif_splice_cached(struct io_ring_ctx *ctx)
+ __must_hold(&ctx->uring_lock)
+{
+ spin_lock(&ctx->completion_lock);
+ list_splice_init(&ctx->notif_list_locked, &ctx->notif_list);
+ ctx->notif_locked_nr = 0;
+ spin_unlock(&ctx->completion_lock);
+}
+
+static void io_notif_cache_purge(struct io_ring_ctx *ctx)
+ __must_hold(&ctx->uring_lock)
+{
+ io_notif_splice_cached(ctx);
+
+ while (!list_empty(&ctx->notif_list)) {
+ struct io_notif *notif = list_first_entry(&ctx->notif_list,
+ struct io_notif, cache_node);
+
+ list_del(&notif->cache_node);
+ kfree(notif);
+ }
+}
+
+static inline bool io_notif_has_cached(struct io_ring_ctx *ctx)
+ __must_hold(&ctx->uring_lock)
+{
+ if (likely(!list_empty(&ctx->notif_list)))
+ return true;
+ if (data_race(READ_ONCE(ctx->notif_locked_nr) <= IO_COMPL_BATCH))
+ return false;
+ io_notif_splice_cached(ctx);
+ return !list_empty(&ctx->notif_list);
+}
+
static struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
struct io_notif_slot *slot)
__must_hold(&ctx->uring_lock)
{
struct io_notif *notif;

- notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
- if (!notif)
- return NULL;
+ if (likely(io_notif_has_cached(ctx))) {
+ notif = list_first_entry(&ctx->notif_list,
+ struct io_notif, cache_node);
+ list_del(&notif->cache_node);
+ } else {
+ notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
+ if (!notif)
+ return NULL;
+ /* pre-initialise some fields */
+ notif->ctx = ctx;
+ notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+ notif->uarg.callback = io_uring_tx_zerocopy_callback;
+ }

notif->seq = slot->seq++;
notif->tag = slot->tag;
- notif->ctx = ctx;
- notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
- notif->uarg.callback = io_uring_tx_zerocopy_callback;
/* master ref owned by io_notif_slot, will be dropped on flush */
refcount_set(&notif->uarg.refcnt, 1);
percpu_ref_get(&ctx->refs);
@@ -11330,6 +11383,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);

+ io_notif_cache_purge(ctx);
io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);

--
2.36.1

2022-06-28 19:05:51

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 22/29] io_uring: account locked pages for non-fixed zc

Fixed buffers are RLIMIT_MEMLOCK accounted, however it doesn't cover iovec
based zerocopy sends. Do the accounting on the io_uring side.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4a1a1d43e9b3..838030477456 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2825,7 +2825,13 @@ static void __io_notif_complete_tw(struct callback_head *cb)
{
struct io_notif *notif = container_of(cb, struct io_notif, task_work);
struct io_ring_ctx *ctx = notif->ctx;
+ struct mmpin *mmp = &notif->uarg.mmp;

+ if (unlikely(mmp->user)) {
+ atomic_long_sub(mmp->num_pg, &mmp->user->locked_vm);
+ free_uid(mmp->user);
+ mmp->user = NULL;
+ }
if (likely(notif->task)) {
io_put_task(notif->task, 1);
notif->task = NULL;
@@ -6616,6 +6622,7 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
if (unlikely(ret))
return ret;
+ mm_account_pinned_pages(&notif->uarg.mmp, zc->len);

msg_flags = zc->msg_flags | MSG_ZEROCOPY;
if (issue_flags & IO_URING_F_NONBLOCK)
--
2.36.1

2022-06-28 19:05:49

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 11/29] tcp: support zc with managed data

Also make tcp to use managed data and propagate SKBFL_MANAGED_FRAG_REFS
to optimise frag pages referencing.

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv4/tcp.c | 51 +++++++++++++++++++++++++++++++++-----------------
1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9984d23a7f3e..832c1afcdbe7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1202,17 +1202,23 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)

flags = msg->msg_flags;

- if (flags & MSG_ZEROCOPY && size && sock_flag(sk, SOCK_ZEROCOPY)) {
+ if ((flags & MSG_ZEROCOPY) && size) {
skb = tcp_write_queue_tail(sk);
- uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
- if (!uarg) {
- err = -ENOBUFS;
- goto out_err;
- }

- zc = sk->sk_route_caps & NETIF_F_SG;
- if (!zc)
- uarg->zerocopy = 0;
+ if (msg->msg_ubuf) {
+ uarg = msg->msg_ubuf;
+ net_zcopy_get(uarg);
+ zc = sk->sk_route_caps & NETIF_F_SG;
+ } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+ uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
+ if (!uarg) {
+ err = -ENOBUFS;
+ goto out_err;
+ }
+ zc = sk->sk_route_caps & NETIF_F_SG;
+ if (!zc)
+ uarg->zerocopy = 0;
+ }
}

if (unlikely(flags & MSG_FASTOPEN || inet_sk(sk)->defer_connect) &&
@@ -1335,8 +1341,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)

copy = min_t(int, copy, pfrag->size - pfrag->offset);

- if (tcp_downgrade_zcopy_pure(sk, skb) ||
- !sk_wmem_schedule(sk, copy))
+ if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) {
+ if (tcp_downgrade_zcopy_pure(sk, skb))
+ goto wait_for_space;
+ skb_zcopy_downgrade_managed(skb);
+ }
+
+ if (!sk_wmem_schedule(sk, copy))
goto wait_for_space;

err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
@@ -1357,14 +1368,20 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
pfrag->offset += copy;
} else {
/* First append to a fragless skb builds initial
- * pure zerocopy skb
+ * zerocopy skb
*/
- if (!skb->len)
+ if (!skb->len) {
+ if (msg->msg_managed_data)
+ skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
skb_shinfo(skb)->flags |= SKBFL_PURE_ZEROCOPY;
-
- if (!skb_zcopy_pure(skb)) {
- if (!sk_wmem_schedule(sk, copy))
- goto wait_for_space;
+ } else {
+ /* appending, don't mix managed and unmanaged */
+ if (!msg->msg_managed_data)
+ skb_zcopy_downgrade_managed(skb);
+ if (!skb_zcopy_pure(skb)) {
+ if (!sk_wmem_schedule(sk, copy))
+ goto wait_for_space;
+ }
}

err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
--
2.36.1

2022-06-28 19:05:51

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 07/29] net: don't track pfmemalloc for managed frags

Managed pages contain pinned userspace pages and controlled by upper
layers, there is no need in tracking skb->pfmemalloc for them.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 28 +++++++++++++++++-----------
net/core/datagram.c | 7 +++++--
2 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5407cfd9cb89..6cca146be1f4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2549,6 +2549,22 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb);
}

+static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
+ int i, struct page *page,
+ int off, int size)
+{
+ skb_frag_t *frag = &shinfo->frags[i];
+
+ /*
+ * Propagate page pfmemalloc to the skb if we can. The problem is
+ * that not all callers have unique ownership of the page but rely
+ * on page_is_pfmemalloc doing the right thing(tm).
+ */
+ frag->bv_page = page;
+ frag->bv_offset = off;
+ skb_frag_size_set(frag, size);
+}
+
/**
* __skb_fill_page_desc - initialise a paged fragment in an skb
* @skb: buffer containing fragment to be initialised
@@ -2565,17 +2581,7 @@ static inline unsigned int skb_pagelen(const struct sk_buff *skb)
static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
struct page *page, int off, int size)
{
- skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-
- /*
- * Propagate page pfmemalloc to the skb if we can. The problem is
- * that not all callers have unique ownership of the page but rely
- * on page_is_pfmemalloc doing the right thing(tm).
- */
- frag->bv_page = page;
- frag->bv_offset = off;
- skb_frag_size_set(frag, size);
-
+ __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
page = compound_head(page);
if (page_is_pfmemalloc(page))
skb->pfmemalloc = true;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index a93c05156f56..3c913a6342ad 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -616,7 +616,8 @@ EXPORT_SYMBOL(skb_copy_datagram_from_iter);
static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length)
{
- int frag = skb_shinfo(skb)->nr_frags;
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+ int frag = shinfo->nr_frags;
int ret = 0;
struct bvec_iter bi;
ssize_t copied = 0;
@@ -631,12 +632,14 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,

copied += v.bv_len;
truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
- skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
+ __skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+ v.bv_offset, v.bv_len);
bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
}
if (bi.bi_size)
ret = -EMSGSIZE;

+ shinfo->nr_frags = frag;
from->bvec += bi.bi_idx;
from->nr_segs -= bi.bi_idx;
from->count = bi.bi_size;
--
2.36.1

2022-06-28 19:05:51

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 29/29] selftests/io_uring: test zerocopy send

Add selftests for io_uring zerocopy sends and io_uring's notification
infrastructure. It's largely influenced by msg_zerocopy and uses it on
the receive side.

Signed-off-by: Pavel Begunkov <[email protected]>
---
tools/testing/selftests/net/Makefile | 1 +
.../selftests/net/io_uring_zerocopy_tx.c | 605 ++++++++++++++++++
.../selftests/net/io_uring_zerocopy_tx.sh | 131 ++++
3 files changed, 737 insertions(+)
create mode 100644 tools/testing/selftests/net/io_uring_zerocopy_tx.c
create mode 100755 tools/testing/selftests/net/io_uring_zerocopy_tx.sh

diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 464df13831f2..f33a626220eb 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -60,6 +60,7 @@ TEST_GEN_FILES += cmsg_sender
TEST_GEN_FILES += stress_reuseport_listen
TEST_PROGS += test_vxlan_vnifiltering.sh
TEST_GEN_FILES += bind_bhash_test
+TEST_GEN_FILES += io_uring_zerocopy_tx

TEST_FILES := settings

diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.c b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
new file mode 100644
index 000000000000..899ddc84f8a9
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.c
@@ -0,0 +1,605 @@
+/* SPDX-License-Identifier: MIT */
+/* based on linux-kernel/tools/testing/selftests/net/msg_zerocopy.c */
+#include <assert.h>
+#include <errno.h>
+#include <error.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <arpa/inet.h>
+#include <linux/errqueue.h>
+#include <linux/if_packet.h>
+#include <linux/io_uring.h>
+#include <linux/ipv6.h>
+#include <linux/socket.h>
+#include <linux/sockios.h>
+#include <net/ethernet.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/ip6.h>
+#include <netinet/tcp.h>
+#include <netinet/udp.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/un.h>
+#include <sys/wait.h>
+
+#define NOTIF_TAG 0xfffffffULL
+#define NONZC_TAG 0
+#define ZC_TAG 1
+
+enum {
+ MODE_NONZC = 0,
+ MODE_ZC = 1,
+ MODE_ZC_FIXED = 2,
+ MODE_MIXED = 3,
+};
+
+static bool cfg_flush = false;
+static bool cfg_cork = false;
+static int cfg_mode = MODE_ZC_FIXED;
+static int cfg_nr_reqs = 8;
+static int cfg_family = PF_UNSPEC;
+static int cfg_payload_len;
+static int cfg_port = 8000;
+static int cfg_runtime_ms = 4200;
+
+static socklen_t cfg_alen;
+static struct sockaddr_storage cfg_dst_addr;
+
+static char payload[IP_MAXPACKET] __attribute__((aligned(4096)));
+
+struct io_sq_ring {
+ unsigned *head;
+ unsigned *tail;
+ unsigned *ring_mask;
+ unsigned *ring_entries;
+ unsigned *flags;
+ unsigned *array;
+};
+
+struct io_cq_ring {
+ unsigned *head;
+ unsigned *tail;
+ unsigned *ring_mask;
+ unsigned *ring_entries;
+ struct io_uring_cqe *cqes;
+};
+
+struct io_uring_sq {
+ unsigned *khead;
+ unsigned *ktail;
+ unsigned *kring_mask;
+ unsigned *kring_entries;
+ unsigned *kflags;
+ unsigned *kdropped;
+ unsigned *array;
+ struct io_uring_sqe *sqes;
+
+ unsigned sqe_head;
+ unsigned sqe_tail;
+
+ size_t ring_sz;
+};
+
+struct io_uring_cq {
+ unsigned *khead;
+ unsigned *ktail;
+ unsigned *kring_mask;
+ unsigned *kring_entries;
+ unsigned *koverflow;
+ struct io_uring_cqe *cqes;
+
+ size_t ring_sz;
+};
+
+struct io_uring {
+ struct io_uring_sq sq;
+ struct io_uring_cq cq;
+ int ring_fd;
+};
+
+#ifdef __alpha__
+# ifndef __NR_io_uring_setup
+# define __NR_io_uring_setup 535
+# endif
+# ifndef __NR_io_uring_enter
+# define __NR_io_uring_enter 536
+# endif
+# ifndef __NR_io_uring_register
+# define __NR_io_uring_register 537
+# endif
+#else /* !__alpha__ */
+# ifndef __NR_io_uring_setup
+# define __NR_io_uring_setup 425
+# endif
+# ifndef __NR_io_uring_enter
+# define __NR_io_uring_enter 426
+# endif
+# ifndef __NR_io_uring_register
+# define __NR_io_uring_register 427
+# endif
+#endif
+
+#if defined(__x86_64) || defined(__i386__)
+#define read_barrier() __asm__ __volatile__("":::"memory")
+#define write_barrier() __asm__ __volatile__("":::"memory")
+#else
+
+#define read_barrier() __sync_synchronize()
+#define write_barrier() __sync_synchronize()
+#endif
+
+static int io_uring_setup(unsigned int entries, struct io_uring_params *p)
+{
+ return syscall(__NR_io_uring_setup, entries, p);
+}
+
+static int io_uring_enter(int fd, unsigned int to_submit,
+ unsigned int min_complete,
+ unsigned int flags, sigset_t *sig)
+{
+ return syscall(__NR_io_uring_enter, fd, to_submit, min_complete,
+ flags, sig, _NSIG / 8);
+}
+
+static int io_uring_register_buffers(struct io_uring *ring,
+ const struct iovec *iovecs,
+ unsigned nr_iovecs)
+{
+ int ret;
+
+ ret = syscall(__NR_io_uring_register, ring->ring_fd,
+ IORING_REGISTER_BUFFERS, iovecs, nr_iovecs);
+ return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_register_notifications(struct io_uring *ring,
+ unsigned nr,
+ struct io_uring_notification_slot *slots)
+{
+ int ret;
+ struct io_uring_notification_register r = {
+ .nr_slots = nr,
+ .data = (unsigned long)slots,
+ };
+
+ ret = syscall(__NR_io_uring_register, ring->ring_fd,
+ IORING_REGISTER_NOTIFIERS, &r, sizeof(r));
+ return (ret < 0) ? -errno : ret;
+}
+
+static int io_uring_mmap(int fd, struct io_uring_params *p,
+ struct io_uring_sq *sq, struct io_uring_cq *cq)
+{
+ size_t size;
+ void *ptr;
+ int ret;
+
+ sq->ring_sz = p->sq_off.array + p->sq_entries * sizeof(unsigned);
+ ptr = mmap(0, sq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQ_RING);
+ if (ptr == MAP_FAILED)
+ return -errno;
+ sq->khead = ptr + p->sq_off.head;
+ sq->ktail = ptr + p->sq_off.tail;
+ sq->kring_mask = ptr + p->sq_off.ring_mask;
+ sq->kring_entries = ptr + p->sq_off.ring_entries;
+ sq->kflags = ptr + p->sq_off.flags;
+ sq->kdropped = ptr + p->sq_off.dropped;
+ sq->array = ptr + p->sq_off.array;
+
+ size = p->sq_entries * sizeof(struct io_uring_sqe);
+ sq->sqes = mmap(0, size, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_SQES);
+ if (sq->sqes == MAP_FAILED) {
+ ret = -errno;
+err:
+ munmap(sq->khead, sq->ring_sz);
+ return ret;
+ }
+
+ cq->ring_sz = p->cq_off.cqes + p->cq_entries * sizeof(struct io_uring_cqe);
+ ptr = mmap(0, cq->ring_sz, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_POPULATE, fd, IORING_OFF_CQ_RING);
+ if (ptr == MAP_FAILED) {
+ ret = -errno;
+ munmap(sq->sqes, p->sq_entries * sizeof(struct io_uring_sqe));
+ goto err;
+ }
+ cq->khead = ptr + p->cq_off.head;
+ cq->ktail = ptr + p->cq_off.tail;
+ cq->kring_mask = ptr + p->cq_off.ring_mask;
+ cq->kring_entries = ptr + p->cq_off.ring_entries;
+ cq->koverflow = ptr + p->cq_off.overflow;
+ cq->cqes = ptr + p->cq_off.cqes;
+ return 0;
+}
+
+static int io_uring_queue_init(unsigned entries, struct io_uring *ring,
+ unsigned flags)
+{
+ struct io_uring_params p;
+ int fd, ret;
+
+ memset(ring, 0, sizeof(*ring));
+ memset(&p, 0, sizeof(p));
+ p.flags = flags;
+
+ fd = io_uring_setup(entries, &p);
+ if (fd < 0)
+ return fd;
+ ret = io_uring_mmap(fd, &p, &ring->sq, &ring->cq);
+ if (!ret)
+ ring->ring_fd = fd;
+ else
+ close(fd);
+ return ret;
+}
+
+static int io_uring_submit(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+ const unsigned mask = *sq->kring_mask;
+ unsigned ktail, submitted, to_submit;
+ int ret;
+
+ read_barrier();
+ if (*sq->khead != *sq->ktail) {
+ submitted = *sq->kring_entries;
+ goto submit;
+ }
+ if (sq->sqe_head == sq->sqe_tail)
+ return 0;
+
+ ktail = *sq->ktail;
+ to_submit = sq->sqe_tail - sq->sqe_head;
+ for (submitted = 0; submitted < to_submit; submitted++) {
+ read_barrier();
+ sq->array[ktail++ & mask] = sq->sqe_head++ & mask;
+ }
+ if (!submitted)
+ return 0;
+
+ if (*sq->ktail != ktail) {
+ write_barrier();
+ *sq->ktail = ktail;
+ write_barrier();
+ }
+submit:
+ ret = io_uring_enter(ring->ring_fd, submitted, 0,
+ IORING_ENTER_GETEVENTS, NULL);
+ return ret < 0 ? -errno : ret;
+}
+
+static inline void io_uring_prep_send(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags)
+{
+ memset(sqe, 0, sizeof(*sqe));
+ sqe->opcode = (__u8) IORING_OP_SEND;
+ sqe->fd = sockfd;
+ sqe->addr = (unsigned long) buf;
+ sqe->len = len;
+ sqe->msg_flags = (__u32) flags;
+}
+
+static inline void io_uring_prep_sendzc(struct io_uring_sqe *sqe, int sockfd,
+ const void *buf, size_t len, int flags,
+ unsigned slot_idx, unsigned zc_flags)
+{
+ io_uring_prep_send(sqe, sockfd, buf, len, flags);
+ sqe->opcode = (__u8) IORING_OP_SENDZC;
+ sqe->notification_idx = slot_idx;
+ sqe->ioprio = zc_flags;
+}
+
+static struct io_uring_sqe *io_uring_get_sqe(struct io_uring *ring)
+{
+ struct io_uring_sq *sq = &ring->sq;
+
+ if (sq->sqe_tail + 1 - sq->sqe_head > *sq->kring_entries)
+ return NULL;
+ return &sq->sqes[sq->sqe_tail++ & *sq->kring_mask];
+}
+
+static int io_uring_wait_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr)
+{
+ struct io_uring_cq *cq = &ring->cq;
+ const unsigned mask = *cq->kring_mask;
+ unsigned head = *cq->khead;
+ int ret;
+
+ *cqe_ptr = NULL;
+ do {
+ read_barrier();
+ if (head != *cq->ktail) {
+ *cqe_ptr = &cq->cqes[head & mask];
+ break;
+ }
+ ret = io_uring_enter(ring->ring_fd, 0, 1,
+ IORING_ENTER_GETEVENTS, NULL);
+ if (ret < 0)
+ return -errno;
+ } while (1);
+
+ return 0;
+}
+
+static inline void io_uring_cqe_seen(struct io_uring *ring)
+{
+ *(&ring->cq)->khead += 1;
+ write_barrier();
+}
+
+static unsigned long gettimeofday_ms(void)
+{
+ struct timeval tv;
+
+ gettimeofday(&tv, NULL);
+ return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void do_setsockopt(int fd, int level, int optname, int val)
+{
+ if (setsockopt(fd, level, optname, &val, sizeof(val)))
+ error(1, errno, "setsockopt %d.%d: %d", level, optname, val);
+}
+
+static int do_setup_tx(int domain, int type, int protocol)
+{
+ int fd;
+
+ fd = socket(domain, type, protocol);
+ if (fd == -1)
+ error(1, errno, "socket t");
+
+ do_setsockopt(fd, SOL_SOCKET, SO_SNDBUF, 1 << 21);
+
+ if (connect(fd, (void *) &cfg_dst_addr, cfg_alen))
+ error(1, errno, "connect");
+ return fd;
+}
+
+static void do_tx(int domain, int type, int protocol)
+{
+ struct io_uring_notification_slot b[1] = {{.tag = NOTIF_TAG}};
+ struct io_uring_sqe *sqe;
+ struct io_uring_cqe *cqe;
+ unsigned long packets = 0, bytes = 0;
+ struct io_uring ring;
+ struct iovec iov;
+ uint64_t tstop;
+ int i, fd, ret;
+ int compl_cqes = 0;
+
+ fd = do_setup_tx(domain, type, protocol);
+
+ ret = io_uring_queue_init(512, &ring, 0);
+ if (ret)
+ error(1, ret, "io_uring: queue init");
+
+ ret = io_uring_register_notifications(&ring, 1, b);
+ if (ret)
+ error(1, ret, "io_uring: tx ctx registration");
+
+ iov.iov_base = payload;
+ iov.iov_len = cfg_payload_len;
+
+ ret = io_uring_register_buffers(&ring, &iov, 1);
+ if (ret)
+ error(1, ret, "io_uring: buffer registration");
+
+ tstop = gettimeofday_ms() + cfg_runtime_ms;
+ do {
+ if (cfg_cork)
+ do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 1);
+
+ for (i = 0; i < cfg_nr_reqs; i++) {
+ unsigned zc_flags = 0;
+ unsigned buf_idx = 0;
+ unsigned slot_idx = 0;
+ unsigned mode = cfg_mode;
+ unsigned msg_flags = 0;
+
+ if (cfg_mode == MODE_MIXED)
+ mode = rand() % 3;
+
+ sqe = io_uring_get_sqe(&ring);
+
+ if (mode == MODE_NONZC) {
+ io_uring_prep_send(sqe, fd, payload,
+ cfg_payload_len, msg_flags);
+ sqe->user_data = NONZC_TAG;
+ } else {
+ if (cfg_flush) {
+ zc_flags |= IORING_SENDZC_FLUSH;
+ compl_cqes++;
+ }
+ io_uring_prep_sendzc(sqe, fd, payload,
+ cfg_payload_len,
+ msg_flags, slot_idx, zc_flags);
+ if (mode == MODE_ZC_FIXED) {
+ sqe->ioprio |= IORING_SENDZC_FIXED_BUF;
+ sqe->buf_index = buf_idx;
+ }
+ sqe->user_data = ZC_TAG;
+ }
+ }
+
+ ret = io_uring_submit(&ring);
+ if (ret != cfg_nr_reqs)
+ error(1, ret, "submit");
+
+ for (i = 0; i < cfg_nr_reqs; i++) {
+ ret = io_uring_wait_cqe(&ring, &cqe);
+ if (ret)
+ error(1, ret, "wait cqe");
+
+ if (cqe->user_data == NOTIF_TAG) {
+ compl_cqes--;
+ i--;
+ } else if (cqe->user_data != NONZC_TAG &&
+ cqe->user_data != ZC_TAG) {
+ error(1, cqe->res, "invalid user_data");
+ } else if (cqe->res <= 0 && cqe->res != -EAGAIN) {
+ error(1, cqe->res, "send failed");
+ } else {
+ if (cqe->res > 0) {
+ packets++;
+ bytes += cqe->res;
+ }
+ /* failed requests don't flush */
+ if (cfg_flush &&
+ cqe->res <= 0 &&
+ cqe->user_data == ZC_TAG)
+ compl_cqes--;
+ }
+ io_uring_cqe_seen(&ring);
+ }
+ if (cfg_cork)
+ do_setsockopt(fd, IPPROTO_UDP, UDP_CORK, 0);
+ } while (gettimeofday_ms() < tstop);
+
+ if (close(fd))
+ error(1, errno, "close");
+
+ fprintf(stderr, "tx=%lu (MB=%lu), tx/s=%lu (MB/s=%lu)\n",
+ packets, bytes >> 20,
+ packets / (cfg_runtime_ms / 1000),
+ (bytes >> 20) / (cfg_runtime_ms / 1000));
+
+ while (compl_cqes) {
+ ret = io_uring_wait_cqe(&ring, &cqe);
+ if (ret)
+ error(1, ret, "wait cqe");
+ io_uring_cqe_seen(&ring);
+ compl_cqes--;
+ }
+}
+
+static void do_test(int domain, int type, int protocol)
+{
+ int i;
+
+ for (i = 0; i < IP_MAXPACKET; i++)
+ payload[i] = 'a' + (i % 26);
+ do_tx(domain, type, protocol);
+}
+
+static void usage(const char *filepath)
+{
+ error(1, 0, "Usage: %s [-f] [-n<N>] [-z0] [-s<payload size>] "
+ "(-4|-6) [-t<time s>] -D<dst_ip> udp", filepath);
+}
+
+static void parse_opts(int argc, char **argv)
+{
+ const int max_payload_len = sizeof(payload) -
+ sizeof(struct ipv6hdr) -
+ sizeof(struct tcphdr) -
+ 40 /* max tcp options */;
+ struct sockaddr_in6 *addr6 = (void *) &cfg_dst_addr;
+ struct sockaddr_in *addr4 = (void *) &cfg_dst_addr;
+ char *daddr = NULL;
+ int c;
+
+ if (argc <= 1)
+ usage(argv[0]);
+ cfg_payload_len = max_payload_len;
+
+ while ((c = getopt(argc, argv, "46D:p:s:t:n:fc:m:")) != -1) {
+ switch (c) {
+ case '4':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET;
+ cfg_alen = sizeof(struct sockaddr_in);
+ break;
+ case '6':
+ if (cfg_family != PF_UNSPEC)
+ error(1, 0, "Pass one of -4 or -6");
+ cfg_family = PF_INET6;
+ cfg_alen = sizeof(struct sockaddr_in6);
+ break;
+ case 'D':
+ daddr = optarg;
+ break;
+ case 'p':
+ cfg_port = strtoul(optarg, NULL, 0);
+ break;
+ case 's':
+ cfg_payload_len = strtoul(optarg, NULL, 0);
+ break;
+ case 't':
+ cfg_runtime_ms = 200 + strtoul(optarg, NULL, 10) * 1000;
+ break;
+ case 'n':
+ cfg_nr_reqs = strtoul(optarg, NULL, 0);
+ break;
+ case 'f':
+ cfg_flush = 1;
+ break;
+ case 'c':
+ cfg_cork = strtol(optarg, NULL, 0);
+ break;
+ case 'm':
+ cfg_mode = strtol(optarg, NULL, 0);
+ break;
+ }
+ }
+
+ switch (cfg_family) {
+ case PF_INET:
+ memset(addr4, 0, sizeof(*addr4));
+ addr4->sin_family = AF_INET;
+ addr4->sin_port = htons(cfg_port);
+ if (daddr &&
+ inet_pton(AF_INET, daddr, &(addr4->sin_addr)) != 1)
+ error(1, 0, "ipv4 parse error: %s", daddr);
+ break;
+ case PF_INET6:
+ memset(addr6, 0, sizeof(*addr6));
+ addr6->sin6_family = AF_INET6;
+ addr6->sin6_port = htons(cfg_port);
+ if (daddr &&
+ inet_pton(AF_INET6, daddr, &(addr6->sin6_addr)) != 1)
+ error(1, 0, "ipv6 parse error: %s", daddr);
+ break;
+ default:
+ error(1, 0, "illegal domain");
+ }
+
+ if (cfg_payload_len > max_payload_len)
+ error(1, 0, "-s: payload exceeds max (%d)", max_payload_len);
+ if (cfg_mode == MODE_NONZC && cfg_flush)
+ error(1, 0, "-f: only zerocopy modes support notifications");
+ if (optind != argc - 1)
+ usage(argv[0]);
+}
+
+int main(int argc, char **argv)
+{
+ const char *cfg_test = argv[argc - 1];
+
+ parse_opts(argc, argv);
+
+ if (!strcmp(cfg_test, "tcp"))
+ do_test(cfg_family, SOCK_STREAM, 0);
+ else if (!strcmp(cfg_test, "udp"))
+ do_test(cfg_family, SOCK_DGRAM, 0);
+ else
+ error(1, 0, "unknown cfg_test %s", cfg_test);
+ return 0;
+}
diff --git a/tools/testing/selftests/net/io_uring_zerocopy_tx.sh b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
new file mode 100755
index 000000000000..6a65e4437640
--- /dev/null
+++ b/tools/testing/selftests/net/io_uring_zerocopy_tx.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+#
+# Send data between two processes across namespaces
+# Run twice: once without and once with zerocopy
+
+set -e
+
+readonly DEV="veth0"
+readonly DEV_MTU=65535
+readonly BIN_TX="./io_uring_zerocopy_tx"
+readonly BIN_RX="./msg_zerocopy"
+
+readonly RAND="$(mktemp -u XXXXXX)"
+readonly NSPREFIX="ns-${RAND}"
+readonly NS1="${NSPREFIX}1"
+readonly NS2="${NSPREFIX}2"
+
+readonly SADDR4='192.168.1.1'
+readonly DADDR4='192.168.1.2'
+readonly SADDR6='fd::1'
+readonly DADDR6='fd::2'
+
+readonly path_sysctl_mem="net.core.optmem_max"
+
+# No arguments: automated test
+if [[ "$#" -eq "0" ]]; then
+ IPs=( "4" "6" )
+ protocols=( "tcp" "udp" )
+
+ for IP in "${IPs[@]}"; do
+ for proto in "${protocols[@]}"; do
+ for mode in $(seq 1 3); do
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -f
+ $0 "$IP" "$proto" -m "$mode" -t 1 -n 32 -c -f
+ done
+ done
+ done
+
+ echo "OK. All tests passed"
+ exit 0
+fi
+
+# Argument parsing
+if [[ "$#" -lt "2" ]]; then
+ echo "Usage: $0 [4|6] [tcp|udp|raw|raw_hdrincl|packet|packet_dgram] <args>"
+ exit 1
+fi
+
+readonly IP="$1"
+shift
+readonly TXMODE="$1"
+shift
+readonly EXTRA_ARGS="$@"
+
+# Argument parsing: configure addresses
+if [[ "${IP}" == "4" ]]; then
+ readonly SADDR="${SADDR4}"
+ readonly DADDR="${DADDR4}"
+elif [[ "${IP}" == "6" ]]; then
+ readonly SADDR="${SADDR6}"
+ readonly DADDR="${DADDR6}"
+else
+ echo "Invalid IP version ${IP}"
+ exit 1
+fi
+
+# Argument parsing: select receive mode
+#
+# This differs from send mode for
+# - packet: use raw recv, because packet receives skb clones
+# - raw_hdrinc: use raw recv, because hdrincl is a tx-only option
+case "${TXMODE}" in
+'packet' | 'packet_dgram' | 'raw_hdrincl')
+ RXMODE='raw'
+ ;;
+*)
+ RXMODE="${TXMODE}"
+ ;;
+esac
+
+# Start of state changes: install cleanup handler
+save_sysctl_mem="$(sysctl -n ${path_sysctl_mem})"
+
+cleanup() {
+ ip netns del "${NS2}"
+ ip netns del "${NS1}"
+ sysctl -w -q "${path_sysctl_mem}=${save_sysctl_mem}"
+}
+
+trap cleanup EXIT
+
+# Configure system settings
+sysctl -w -q "${path_sysctl_mem}=1000000"
+
+# Create virtual ethernet pair between network namespaces
+ip netns add "${NS1}"
+ip netns add "${NS2}"
+
+ip link add "${DEV}" mtu "${DEV_MTU}" netns "${NS1}" type veth \
+ peer name "${DEV}" mtu "${DEV_MTU}" netns "${NS2}"
+
+# Bring the devices up
+ip -netns "${NS1}" link set "${DEV}" up
+ip -netns "${NS2}" link set "${DEV}" up
+
+# Set fixed MAC addresses on the devices
+ip -netns "${NS1}" link set dev "${DEV}" address 02:02:02:02:02:02
+ip -netns "${NS2}" link set dev "${DEV}" address 06:06:06:06:06:06
+
+# Add fixed IP addresses to the devices
+ip -netns "${NS1}" addr add 192.168.1.1/24 dev "${DEV}"
+ip -netns "${NS2}" addr add 192.168.1.2/24 dev "${DEV}"
+ip -netns "${NS1}" addr add fd::1/64 dev "${DEV}" nodad
+ip -netns "${NS2}" addr add fd::2/64 dev "${DEV}" nodad
+
+# Optionally disable sg or csum offload to test edge cases
+# ip netns exec "${NS1}" ethtool -K "${DEV}" sg off
+
+do_test() {
+ local readonly ARGS="$1"
+
+ echo "ipv${IP} ${TXMODE} ${ARGS}"
+ ip netns exec "${NS2}" "${BIN_RX}" "-${IP}" -t 2 -C 2 -S "${SADDR}" -D "${DADDR}" -r "${RXMODE}" &
+ sleep 0.2
+ ip netns exec "${NS1}" "${BIN_TX}" "-${IP}" -t 1 -D "${DADDR}" ${ARGS} "${TXMODE}"
+ wait
+}
+
+do_test "${EXTRA_ARGS}"
+echo ok
--
2.36.1

2022-06-28 19:05:59

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 27/29] io_uring: allow to override zc tag on flush

Add a new sendzc flag, IORING_SENDZC_COPY_TAG. When set and the request
flushing a notification, it'll set the notification tag to
sqe->user_data. This adds a bit more flexibility allowing to specify
notification tags on per-request basis.

One use cases is combining the new flag with IOSQE_CQE_SKIP_SUCCESS,
so either the request fails and we expect an CQE with a failure and no
notification, or it succedees, then there will be no request completion
but only a zc notification with an overriden tag. In other words, in the
described scheme it posts only one CQE with user_data set to the current
requests sqe->user_data.

note 1: the flat has no effect if nothing is flushed, e.g. there was
no IORING_SENDZC_FLUSH or the request failed.

note 2: copying sqe->user_data may be not ideal, but we don't have
extra space in SQE to keep a second tag/user_data.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 9 +++++++--
include/uapi/linux/io_uring.h | 1 +
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f5fe2ab5622a..08c98a4d9bd2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6581,7 +6581,8 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

-#define IO_SENDZC_VALID_FLAGS (IORING_SENDZC_FIXED_BUF|IORING_SENDZC_FLUSH)
+#define IO_SENDZC_VALID_FLAGS (IORING_SENDZC_FIXED_BUF | IORING_SENDZC_FLUSH | \
+ IORING_SENDZC_OVERRIDE_TAG)

static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
@@ -6686,7 +6687,11 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
ret = sock_sendmsg(sock, &msg);

if (likely(ret >= min_ret)) {
- if (req->msgzc.zc_flags & IORING_SENDZC_FLUSH)
+ unsigned zc_flags = req->msgzc.zc_flags;
+
+ if (zc_flags & IORING_SENDZC_OVERRIDE_TAG)
+ notif->tag = req->cqe.user_data;
+ if (zc_flags & IORING_SENDZC_FLUSH)
io_notif_slot_flush_submit(notif_slot, 0);
} else {
if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7d77d90a5f8a..7533387f25d3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -280,6 +280,7 @@ enum {
enum {
IORING_SENDZC_FIXED_BUF = (1U << 0),
IORING_SENDZC_FLUSH = (1U << 1),
+ IORING_SENDZC_OVERRIDE_TAG = (1U << 2),
};

/*
--
2.36.1

2022-06-28 19:06:18

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 28/29] io_uring: batch submission notif referencing

Batch get notifier references and use ->msg_ubuf_ref to hand off one ref
per sendzc request to the network layer. This ammortises the submission
side net_zcopy_get() atomics. Note that we always keep at least one
reference in the cache because we do only post send checks on
whether ->msg_ubuf_ref was consumed or not.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 08c98a4d9bd2..78990a130b66 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -374,6 +374,7 @@ struct io_ev_fd {
};

#define IO_NOTIF_MAX_SLOTS (1U << 10)
+#define IO_NOTIF_REF_CACHE_NR 64

struct io_notif {
struct ubuf_info uarg;
@@ -384,6 +385,8 @@ struct io_notif {
u64 tag;
/* see struct io_notif_slot::seq */
u32 seq;
+ /* extra uarg->refcnt refs */
+ int cached_refs;
/* hook into ctx->notif_list and ctx->notif_list_locked */
struct list_head cache_node;

@@ -2949,14 +2952,30 @@ static struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,

notif->seq = slot->seq++;
notif->tag = slot->tag;
+ notif->cached_refs = IO_NOTIF_REF_CACHE_NR;
/* master ref owned by io_notif_slot, will be dropped on flush */
- refcount_set(&notif->uarg.refcnt, 1);
+ refcount_set(&notif->uarg.refcnt, IO_NOTIF_REF_CACHE_NR + 1);
percpu_ref_get(&ctx->refs);
notif->rsrc_node = ctx->rsrc_node;
io_charge_rsrc_node(ctx);
return notif;
}

+static inline void io_notif_consume_ref(struct io_notif *notif)
+ __must_hold(&ctx->uring_lock)
+{
+ notif->cached_refs--;
+
+ /*
+ * Issue sends without looking at notif->cached_refs first, so we
+ * always have to have at least one ref cached
+ */
+ if (unlikely(!notif->cached_refs)) {
+ refcount_add(IO_NOTIF_REF_CACHE_NR, &notif->uarg.refcnt);
+ notif->cached_refs += IO_NOTIF_REF_CACHE_NR;
+ }
+}
+
static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
struct io_notif_slot *slot)
{
@@ -2979,13 +2998,15 @@ static void io_notif_slot_flush(struct io_notif_slot *slot)
__must_hold(&ctx->uring_lock)
{
struct io_notif *notif = slot->notif;
+ int refs = notif->cached_refs + 1;

slot->notif = NULL;
+ notif->cached_refs = 0;

if (WARN_ON_ONCE(in_interrupt()))
return;
- /* drop slot's master ref */
- if (refcount_dec_and_test(&notif->uarg.refcnt))
+ /* drop all cached refs and the slot's master ref */
+ if (refcount_sub_and_test(refs, &notif->uarg.refcnt))
io_notif_complete(notif);
}

@@ -6653,6 +6674,7 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_controllen = 0;
msg.msg_namelen = 0;
msg.msg_managed_data = 1;
+ msg.msg_ubuf_ref = 1;

if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
ret = __io_import_fixed(WRITE, &msg.msg_iter, req->imu,
@@ -6686,6 +6708,10 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_ubuf = &notif->uarg;
ret = sock_sendmsg(sock, &msg);

+ /* check if the send consumed an additional ref */
+ if (likely(!msg.msg_ubuf_ref))
+ io_notif_consume_ref(notif);
+
if (likely(ret >= min_ret)) {
unsigned zc_flags = req->msgzc.zc_flags;

--
2.36.1

2022-06-28 19:09:41

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 00/29] io_uring zerocopy send

On 6/28/22 19:56, Pavel Begunkov wrote:
> The third iteration of patches for zerocopy io_uring sends. I fixed
> all known issues since the previous version and reshuffled io_uring
> patches, but the net/ code didn't change much. I think it's ready
> and will send it as a non-RFC soon.

Please ignore, it's a wrong version and shouldn't be RFC

--
Pavel Begunkov

2022-06-28 19:15:57

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 12/29] tcp: kill extra io_uring's uarg refcounting

io_uring guarantees that passed in uarg stays alive until we return from
sendmsg, so no need to temporarily refcount-pin it in
tcp_sendmsg_locked().

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv4/tcp.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 832c1afcdbe7..3482c934eec8 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1207,7 +1207,6 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)

if (msg->msg_ubuf) {
uarg = msg->msg_ubuf;
- net_zcopy_get(uarg);
zc = sk->sk_route_caps & NETIF_F_SG;
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb));
@@ -1437,7 +1436,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
}
out_nopush:
- net_zcopy_put(uarg);
+ if (uarg && !msg->msg_ubuf)
+ net_zcopy_put(uarg);
return copied + copied_syn;

do_error:
@@ -1446,7 +1446,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
if (copied + copied_syn)
goto out;
out_err:
- net_zcopy_put_abort(uarg, true);
+ if (uarg && !msg->msg_ubuf)
+ net_zcopy_put_abort(uarg, true);
err = sk_stream_error(sk, flags, err);
/* make sure we wake any epoll edge trigger waiter */
if (unlikely(tcp_rtx_and_write_queues_empty(sk) && err == -EAGAIN)) {
--
2.36.1

2022-06-28 19:16:39

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 08/29] skbuff: don't mix ubuf_info of different types

We should not append MSG_ZEROCOPY requests to skbuff with non
MSG_ZEROCOPY ubuf_info, they are not compatible.

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/core/skbuff.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 71870def129c..7e6fcb3cd817 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1222,6 +1222,10 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size,
const u32 byte_limit = 1 << 19; /* limit to a few TSO */
u32 bytelen, next;

+ /* there might be non MSG_ZEROCOPY users */
+ if (uarg->callback != msg_zerocopy_callback)
+ return NULL;
+
/* realloc only when socket is locked (TCP, UDP cork),
* so uarg->len and sk_zckey access is serialized
*/
--
2.36.1

2022-06-28 19:16:43

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 09/29] ipv4/udp: support zc with managed data

Teach ipv4/udp about managed data. Make it recognise and use
msg->msg_ubuf, and also set/propagate SKBFL_MANAGED_FRAG_REFS
down to skb_zerocopy_iter_dgram().

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv4/ip_output.c | 57 +++++++++++++++++++++++++++++++++-----------
1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 581d1e233260..3fd1bf675598 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1017,18 +1017,35 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;

- if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
- uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
- if (!uarg)
- return -ENOBUFS;
- extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
- if (rt->dst.dev->features & NETIF_F_SG &&
- csummode == CHECKSUM_PARTIAL) {
- paged = true;
- zc = true;
- } else {
- uarg->zerocopy = 0;
- skb_zcopy_set(skb, uarg, &extra_uref);
+ if ((flags & MSG_ZEROCOPY) && length) {
+ struct msghdr *msg = from;
+
+ if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+ if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
+ return -EINVAL;
+
+ /* Leave uarg NULL if can't zerocopy, callers should
+ * be able to handle it.
+ */
+ if ((rt->dst.dev->features & NETIF_F_SG) &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ uarg = msg->msg_ubuf;
+ }
+ } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+ uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+ if (!uarg)
+ return -ENOBUFS;
+ extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
+ if (rt->dst.dev->features & NETIF_F_SG &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ } else {
+ uarg->zerocopy = 0;
+ skb_zcopy_set(skb, uarg, &extra_uref);
+ }
}
}

@@ -1192,13 +1209,14 @@ static int __ip_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
- } else if (!uarg || !uarg->zerocopy) {
+ } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;

err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag))
goto error;

+ skb_zcopy_downgrade_managed(skb);
if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) {
err = -EMSGSIZE;
@@ -1223,7 +1241,18 @@ static int __ip_append_data(struct sock *sk,
skb->truesize += copy;
wmem_alloc_delta += copy;
} else {
- err = skb_zerocopy_iter_dgram(skb, from, copy);
+ struct msghdr *msg = from;
+
+ if (!skb_shinfo(skb)->nr_frags) {
+ if (msg->msg_managed_data)
+ skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
+ } else {
+ /* appending, don't mix managed and unmanaged */
+ if (!msg->msg_managed_data)
+ skb_zcopy_downgrade_managed(skb);
+ }
+
+ err = skb_zerocopy_iter_dgram(skb, msg, copy);
if (err < 0)
goto error;
}
--
2.36.1

2022-06-28 19:16:49

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 25/29] io_uring: sendzc with fixed buffers

Allow zerocopy sends to use fixed buffers. There is an optimisation for
this case, the network layer don't need to reference the pages, see
SKBFL_MANAGED_FRAG_REFS, so io_uring have to ensure validity of fixed
buffers until the notifier is released.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 39 +++++++++++++++++++++++++++++------
include/uapi/linux/io_uring.h | 7 +++++++
2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 07d09d06e8ab..70b1f77ac64e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -723,6 +723,7 @@ struct io_sendzc {
size_t len;
u16 slot_idx;
int msg_flags;
+ unsigned zc_flags;
int addr_len;
void __user *addr;
};
@@ -6580,11 +6581,14 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

+#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FIXED_BUF
+
static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_sendzc *zc = &req->msgzc;
+ struct io_ring_ctx *ctx = req->ctx;

- if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->__pad2[0]))
+ if (READ_ONCE(sqe->__pad2[0]))
return -EINVAL;

zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -6596,6 +6600,20 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
zc->addr_len = READ_ONCE(sqe->addr_len);

+ zc->zc_flags = READ_ONCE(sqe->ioprio);
+ if (req->msgzc.zc_flags & ~IO_SENDZC_VALID_FLAGS)
+ return -EINVAL;
+
+ if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+ unsigned idx = READ_ONCE(sqe->buf_index);
+
+ if (unlikely(idx >= ctx->nr_user_bufs))
+ return -EFAULT;
+ idx = array_index_nospec(idx, ctx->nr_user_bufs);
+ req->imu = READ_ONCE(ctx->user_bufs[idx]);
+ io_req_set_rsrc_node(req, ctx, 0);
+ }
+
#ifdef CONFIG_COMPAT
if (req->ctx->compat)
zc->msg_flags |= MSG_CMSG_COMPAT;
@@ -6633,12 +6651,21 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
- msg.msg_managed_data = 0;
+ msg.msg_managed_data = 1;

- ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
- if (unlikely(ret))
- return ret;
- mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+ if (req->msgzc.zc_flags & IORING_SENDZC_FIXED_BUF) {
+ ret = __io_import_fixed(WRITE, &msg.msg_iter, req->imu,
+ (u64)zc->buf, zc->len);
+ if (unlikely(ret))
+ return ret;
+ } else {
+ msg.msg_managed_data = 0;
+ ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
+ &msg.msg_iter);
+ if (unlikely(ret))
+ return ret;
+ mm_account_pinned_pages(&notif->uarg.mmp, zc->len);
+ }

if (zc->addr) {
ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 689aa1444cd4..69100aa71448 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -274,6 +274,13 @@ enum {
IORING_RSRC_UPDATE_NOTIF,
};

+/*
+ * IORING_OP_SENDZC flags
+ */
+enum {
+ IORING_SENDZC_FIXED_BUF = (1U << 0),
+};
+
/*
* IO completion data structure (Completion Queue Entry)
*/
--
2.36.1

2022-06-28 19:17:11

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 18/29] io_uring: add notification slot registration

Let the userspace to register and unregister notification slots.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 54 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 16 +++++++++++
2 files changed, 70 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9ade0ea8552b..22427893549a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -94,6 +94,8 @@
#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)
#define IORING_SQPOLL_CAP_ENTRIES_VALUE 8

+#define IORING_MAX_NOTIF_SLOTS (1U << 10)
+
/* only define max */
#define IORING_MAX_FIXED_FILES (1U << 20)
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
@@ -2972,6 +2974,49 @@ static __cold int io_notif_unregister(struct io_ring_ctx *ctx)
kvfree(ctx->notif_slots);
ctx->notif_slots = NULL;
ctx->nr_notif_slots = 0;
+ io_notif_cache_purge(ctx);
+ return 0;
+}
+
+static __cold int io_notif_register(struct io_ring_ctx *ctx,
+ void __user *arg, unsigned int size)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_uring_notification_slot __user *slots;
+ struct io_uring_notification_slot slot;
+ struct io_uring_notification_register reg;
+ unsigned i;
+
+ if (ctx->nr_notif_slots)
+ return -EBUSY;
+ if (size != sizeof(reg))
+ return -EINVAL;
+ if (copy_from_user(&reg, arg, sizeof(reg)))
+ return -EFAULT;
+ if (!reg.nr_slots || reg.nr_slots > IORING_MAX_NOTIF_SLOTS)
+ return -EINVAL;
+ if (reg.resv || reg.resv2 || reg.resv3)
+ return -EINVAL;
+
+ slots = u64_to_user_ptr(reg.data);
+ ctx->notif_slots = kvcalloc(reg.nr_slots, sizeof(ctx->notif_slots[0]),
+ GFP_KERNEL_ACCOUNT);
+ if (!ctx->notif_slots)
+ return -ENOMEM;
+
+ for (i = 0; i < reg.nr_slots; i++, ctx->nr_notif_slots++) {
+ struct io_notif_slot *notif_slot = &ctx->notif_slots[i];
+
+ if (copy_from_user(&slot, &slots[i], sizeof(slot))) {
+ io_notif_unregister(ctx);
+ return -EFAULT;
+ }
+ if (slot.resv[0] | slot.resv[1] | slot.resv[2]) {
+ io_notif_unregister(ctx);
+ return -EINVAL;
+ }
+ notif_slot->tag = slot.tag;
+ }
return 0;
}

@@ -13378,6 +13423,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_unregister_pbuf_ring(ctx, arg);
break;
+ case IORING_REGISTER_NOTIFIERS:
+ ret = io_notif_register(ctx, arg, nr_args);
+ break;
+ case IORING_UNREGISTER_NOTIFIERS:
+ ret = -EINVAL;
+ if (arg || nr_args)
+ break;
+ ret = io_notif_unregister(ctx);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 53e7dae92e42..96193bbda2e4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -417,6 +417,9 @@ enum {
IORING_REGISTER_PBUF_RING = 22,
IORING_UNREGISTER_PBUF_RING = 23,

+ IORING_REGISTER_NOTIFIERS = 24,
+ IORING_UNREGISTER_NOTIFIERS = 25,
+
/* this goes last */
IORING_REGISTER_LAST
};
@@ -463,6 +466,19 @@ struct io_uring_rsrc_update2 {
__u32 resv2;
};

+struct io_uring_notification_slot {
+ __u64 tag;
+ __u64 resv[3];
+};
+
+struct io_uring_notification_register {
+ __u32 nr_slots;
+ __u32 resv;
+ __u64 resv2;
+ __u64 data;
+ __u64 resv3;
+};
+
/* Skip updating fd indexes set to this value in the fd table */
#define IORING_REGISTER_FILES_SKIP (-2)

--
2.36.1

2022-06-28 19:17:33

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 13/29] net: let callers provide extra ubuf_info refs

Subsystems providing external ubufs to the net layer, i.e. ->msg_ubuf,
might have a better way to refcount it. For instance, io_uring can
ammortise ref allocation.

Add a way to pass one extra ref to ->msg_ubuf into the network stack by
setting struct msghdr::msg_ubuf_ref bit. Whoever consumes the ref should
clear the flat. If not consumed, it's the responsibility of the caller
to put it. Make __ip{,6}_append_data() to use it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/socket.h | 1 +
net/ipv4/ip_output.c | 3 +++
net/ipv6/ip6_output.c | 3 +++
3 files changed, 7 insertions(+)

diff --git a/include/linux/socket.h b/include/linux/socket.h
index ba84ee614d5a..ae869dee82de 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -72,6 +72,7 @@ struct msghdr {
* to be non-NULL.
*/
bool msg_managed_data : 1;
+ bool msg_ubuf_ref : 1;
unsigned int msg_flags; /* flags on received message */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3fd1bf675598..d73ec0a73bd2 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1032,6 +1032,9 @@ static int __ip_append_data(struct sock *sk,
paged = true;
zc = true;
uarg = msg->msg_ubuf;
+ /* we might've been given a free ref */
+ extra_uref = msg->msg_ubuf_ref;
+ msg->msg_ubuf_ref = false;
}
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index f4138ce6eda3..90bbaab21dbc 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1557,6 +1557,9 @@ static int __ip6_append_data(struct sock *sk,
paged = true;
zc = true;
uarg = msg->msg_ubuf;
+ /* we might've been given a free ref */
+ extra_uref = msg->msg_ubuf_ref;
+ msg->msg_ubuf_ref = false;
}
} else if (sock_flag(sk, SOCK_ZEROCOPY)) {
uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
--
2.36.1

2022-06-28 19:18:05

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 15/29] io_uring: add zc notification infrastructure

Add internal part of send zerocopy notifications. There are two main
structures, the first one is struct io_notif, which carries inside
struct ubuf_info and maps 1:1 to it. io_uring will be binding a number
of zerocopy send requests to it and ask to complete (aka flush) it. When
flushed and all attached requests and skbs complete, it'll generate one
and only one CQE. There are intended to be passed into the network layer
as struct msghdr::msg_ubuf.

The second concept is notification slots. The userspace will be able to
register an array of slots and subsequently addressing them by the index
in the array. Slots are independent of each other. Each slot can have
only one notifier at a time (called active notifier) but many notifiers
during the lifetime. When active, a notifier not going to post any
completion but the userspace can attach requests to it by specifying
the corresponding slot while issueing send zc requests. Eventually, the
userspace will want to "flush" the notifier losing any way to attach
new requests to it, however it can use the next atomatically added
notifier of this slot or of any other slot.

When the network layer is done with all enqueued skbs attached to a
notifier and doesn't need the specified in them user data, the flushed
notifier will post a CQE.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 156 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e47629adf3f7..7d058deb5f73 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -371,6 +371,43 @@ struct io_ev_fd {
struct rcu_head rcu;
};

+#define IO_NOTIF_MAX_SLOTS (1U << 10)
+
+struct io_notif {
+ struct ubuf_info uarg;
+ struct io_ring_ctx *ctx;
+
+ /* cqe->user_data, io_notif_slot::tag if not overridden */
+ u64 tag;
+ /* see struct io_notif_slot::seq */
+ u32 seq;
+
+ union {
+ struct callback_head task_work;
+ struct work_struct commit_work;
+ };
+};
+
+struct io_notif_slot {
+ /*
+ * Current/active notifier. A slot holds only one active notifier at a
+ * time and keeps one reference to it. Flush releases the reference and
+ * lazily replaces it with a new notifier.
+ */
+ struct io_notif *notif;
+
+ /*
+ * Default ->user_data for this slot notifiers CQEs
+ */
+ u64 tag;
+ /*
+ * Notifiers of a slot live in generations, we create a new notifier
+ * only after flushing the previous one. Track the sequential number
+ * for all notifiers and copy it into notifiers's cqe->cflags
+ */
+ u32 seq;
+};
+
#define BGID_ARRAY 64

struct io_ring_ctx {
@@ -423,6 +460,8 @@ struct io_ring_ctx {
unsigned nr_user_files;
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;
+ struct io_notif_slot *notif_slots;
+ unsigned nr_notif_slots;

struct io_submit_state submit_state;

@@ -2749,6 +2788,121 @@ static __cold void io_free_req(struct io_kiocb *req)
spin_unlock(&ctx->completion_lock);
}

+static void __io_notif_complete_tw(struct callback_head *cb)
+{
+ struct io_notif *notif = container_of(cb, struct io_notif, task_work);
+ struct io_ring_ctx *ctx = notif->ctx;
+
+ spin_lock(&ctx->completion_lock);
+ io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq);
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ io_cqring_ev_posted(ctx);
+
+ percpu_ref_put(&ctx->refs);
+ kfree(notif);
+}
+
+static inline void io_notif_complete(struct io_notif *notif)
+{
+ __io_notif_complete_tw(&notif->task_work);
+}
+
+static void io_notif_complete_wq(struct work_struct *work)
+{
+ struct io_notif *notif = container_of(work, struct io_notif, commit_work);
+
+ io_notif_complete(notif);
+}
+
+static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,
+ struct ubuf_info *uarg,
+ bool success)
+{
+ struct io_notif *notif = container_of(uarg, struct io_notif, uarg);
+
+ if (!refcount_dec_and_test(&uarg->refcnt))
+ return;
+ INIT_WORK(&notif->commit_work, io_notif_complete_wq);
+ queue_work(system_unbound_wq, &notif->commit_work);
+}
+
+static struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
+ struct io_notif_slot *slot)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_notif *notif;
+
+ notif = kzalloc(sizeof(*notif), GFP_ATOMIC | __GFP_ACCOUNT);
+ if (!notif)
+ return NULL;
+
+ notif->seq = slot->seq++;
+ notif->tag = slot->tag;
+ notif->ctx = ctx;
+ notif->uarg.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
+ notif->uarg.callback = io_uring_tx_zerocopy_callback;
+ /* master ref owned by io_notif_slot, will be dropped on flush */
+ refcount_set(&notif->uarg.refcnt, 1);
+ percpu_ref_get(&ctx->refs);
+ return notif;
+}
+
+__attribute__((unused))
+static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
+ struct io_notif_slot *slot)
+{
+ if (!slot->notif)
+ slot->notif = io_alloc_notif(ctx, slot);
+ return slot->notif;
+}
+
+__attribute__((unused))
+static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
+ int idx)
+ __must_hold(&ctx->uring_lock)
+{
+ if (idx >= ctx->nr_notif_slots)
+ return NULL;
+ idx = array_index_nospec(idx, ctx->nr_notif_slots);
+ return &ctx->notif_slots[idx];
+}
+
+static void io_notif_slot_flush(struct io_notif_slot *slot)
+ __must_hold(&ctx->uring_lock)
+{
+ struct io_notif *notif = slot->notif;
+
+ slot->notif = NULL;
+
+ if (WARN_ON_ONCE(in_interrupt()))
+ return;
+ /* drop slot's master ref */
+ if (refcount_dec_and_test(&notif->uarg.refcnt))
+ io_notif_complete(notif);
+}
+
+static __cold int io_notif_unregister(struct io_ring_ctx *ctx)
+ __must_hold(&ctx->uring_lock)
+{
+ int i;
+
+ if (!ctx->notif_slots)
+ return -ENXIO;
+
+ for (i = 0; i < ctx->nr_notif_slots; i++) {
+ struct io_notif_slot *slot = &ctx->notif_slots[i];
+
+ if (slot->notif)
+ io_notif_slot_flush(slot);
+ }
+
+ kvfree(ctx->notif_slots);
+ ctx->notif_slots = NULL;
+ ctx->nr_notif_slots = 0;
+ return 0;
+}
+
static inline void io_remove_next_linked(struct io_kiocb *req)
{
struct io_kiocb *nxt = req->link;
@@ -11174,6 +11328,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
}
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
+ WARN_ON_ONCE(ctx->notif_slots || ctx->nr_notif_slots);

io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
@@ -11368,6 +11523,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
__io_cqring_overflow_flush(ctx, true);
xa_for_each(&ctx->personalities, index, creds)
io_unregister_personality(ctx, index);
+ io_notif_unregister(ctx);
mutex_unlock(&ctx->uring_lock);

/* failed during ring init, it couldn't have issued any requests */
--
2.36.1

2022-06-28 19:18:11

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 10/29] ipv6/udp: support zc with managed data

Just as with ipv4/udp make ipv6/udp to take advantage of managed data
and propagate SKBFL_MANAGED_FRAG_REFS to skb_zerocopy_iter_dgram().

Signed-off-by: Pavel Begunkov <[email protected]>
---
net/ipv6/ip6_output.c | 57 ++++++++++++++++++++++++++++++++-----------
1 file changed, 43 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6103cd9066ff..f4138ce6eda3 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1542,18 +1542,35 @@ static int __ip6_append_data(struct sock *sk,
rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM))
csummode = CHECKSUM_PARTIAL;

- if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
- uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
- if (!uarg)
- return -ENOBUFS;
- extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
- if (rt->dst.dev->features & NETIF_F_SG &&
- csummode == CHECKSUM_PARTIAL) {
- paged = true;
- zc = true;
- } else {
- uarg->zerocopy = 0;
- skb_zcopy_set(skb, uarg, &extra_uref);
+ if ((flags & MSG_ZEROCOPY) && length) {
+ struct msghdr *msg = from;
+
+ if (getfrag == ip_generic_getfrag && msg->msg_ubuf) {
+ if (skb_zcopy(skb) && msg->msg_ubuf != skb_zcopy(skb))
+ return -EINVAL;
+
+ /* Leave uarg NULL if can't zerocopy, callers should
+ * be able to handle it.
+ */
+ if ((rt->dst.dev->features & NETIF_F_SG) &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ uarg = msg->msg_ubuf;
+ }
+ } else if (sock_flag(sk, SOCK_ZEROCOPY)) {
+ uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb));
+ if (!uarg)
+ return -ENOBUFS;
+ extra_uref = !skb_zcopy(skb); /* only ref on new uarg */
+ if (rt->dst.dev->features & NETIF_F_SG &&
+ csummode == CHECKSUM_PARTIAL) {
+ paged = true;
+ zc = true;
+ } else {
+ uarg->zerocopy = 0;
+ skb_zcopy_set(skb, uarg, &extra_uref);
+ }
}
}

@@ -1747,13 +1764,14 @@ static int __ip6_append_data(struct sock *sk,
err = -EFAULT;
goto error;
}
- } else if (!uarg || !uarg->zerocopy) {
+ } else if (!zc) {
int i = skb_shinfo(skb)->nr_frags;

err = -ENOMEM;
if (!sk_page_frag_refill(sk, pfrag))
goto error;

+ skb_zcopy_downgrade_managed(skb);
if (!skb_can_coalesce(skb, i, pfrag->page,
pfrag->offset)) {
err = -EMSGSIZE;
@@ -1778,7 +1796,18 @@ static int __ip6_append_data(struct sock *sk,
skb->truesize += copy;
wmem_alloc_delta += copy;
} else {
- err = skb_zerocopy_iter_dgram(skb, from, copy);
+ struct msghdr *msg = from;
+
+ if (!skb_shinfo(skb)->nr_frags) {
+ if (msg->msg_managed_data)
+ skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
+ } else {
+ /* appending, don't mix managed and unmanaged */
+ if (!msg->msg_managed_data)
+ skb_zcopy_downgrade_managed(skb);
+ }
+
+ err = skb_zerocopy_iter_dgram(skb, msg, copy);
if (err < 0)
goto error;
}
--
2.36.1

2022-06-28 19:18:44

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 26/29] io_uring: flush notifiers after sendzc

Allow to flush notifiers as a part of sendzc request by setting
IORING_SENDZC_FLUSH flag. When the sendzc request succeedes it will
flush the used [active] notifier.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 7 +++++--
include/uapi/linux/io_uring.h | 1 +
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 70b1f77ac64e..f5fe2ab5622a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -6581,7 +6581,7 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

-#define IO_SENDZC_VALID_FLAGS IORING_SENDZC_FIXED_BUF
+#define IO_SENDZC_VALID_FLAGS (IORING_SENDZC_FIXED_BUF|IORING_SENDZC_FLUSH)

static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
@@ -6685,7 +6685,10 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_ubuf = &notif->uarg;
ret = sock_sendmsg(sock, &msg);

- if (unlikely(ret < min_ret)) {
+ if (likely(ret >= min_ret)) {
+ if (req->msgzc.zc_flags & IORING_SENDZC_FLUSH)
+ io_notif_slot_flush_submit(notif_slot, 0);
+ } else {
if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
return -EAGAIN;
if (ret == -ERESTARTSYS)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 69100aa71448..7d77d90a5f8a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -279,6 +279,7 @@ enum {
*/
enum {
IORING_SENDZC_FIXED_BUF = (1U << 0),
+ IORING_SENDZC_FLUSH = (1U << 1),
};

/*
--
2.36.1

2022-06-28 19:18:54

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 19/29] io_uring: rename IORING_OP_FILES_UPDATE

IORING_OP_FILES_UPDATE will be a more generic opcode serving different
resource types, rename it into IORING_OP_RSRC_UPDATE and add subtype
handling.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 23 +++++++++++++++++------
include/uapi/linux/io_uring.h | 12 +++++++++++-
2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 22427893549a..e9fc7e076c7f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -730,6 +730,7 @@ struct io_rsrc_update {
u64 arg;
u32 nr_args;
u32 offset;
+ unsigned type;
};

struct io_fadvise {
@@ -1280,7 +1281,7 @@ static const struct io_op_def io_op_defs[] = {
},
[IORING_OP_OPENAT] = {},
[IORING_OP_CLOSE] = {},
- [IORING_OP_FILES_UPDATE] = {
+ [IORING_OP_RSRC_UPDATE] = {
.audit_skip = 1,
.iopoll = 1,
},
@@ -8268,7 +8269,7 @@ static int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

-static int io_files_update_prep(struct io_kiocb *req,
+static int io_rsrc_update_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
@@ -8280,6 +8281,7 @@ static int io_files_update_prep(struct io_kiocb *req,
req->rsrc_update.nr_args = READ_ONCE(sqe->len);
if (!req->rsrc_update.nr_args)
return -EINVAL;
+ req->rsrc_update.type = READ_ONCE(sqe->ioprio);
req->rsrc_update.arg = READ_ONCE(sqe->addr);
return 0;
}
@@ -8308,6 +8310,15 @@ static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

+static int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+ switch (req->rsrc_update.type) {
+ case IORING_RSRC_UPDATE_FILES:
+ return io_files_update(req, issue_flags);
+ }
+ return -EINVAL;
+}
+
static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
switch (req->opcode) {
@@ -8352,8 +8363,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_openat_prep(req, sqe);
case IORING_OP_CLOSE:
return io_close_prep(req, sqe);
- case IORING_OP_FILES_UPDATE:
- return io_files_update_prep(req, sqe);
+ case IORING_OP_RSRC_UPDATE:
+ return io_rsrc_update_prep(req, sqe);
case IORING_OP_STATX:
return io_statx_prep(req, sqe);
case IORING_OP_FADVISE:
@@ -8661,8 +8672,8 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_CLOSE:
ret = io_close(req, issue_flags);
break;
- case IORING_OP_FILES_UPDATE:
- ret = io_files_update(req, issue_flags);
+ case IORING_OP_RSRC_UPDATE:
+ ret = io_rsrc_update(req, issue_flags);
break;
case IORING_OP_STATX:
ret = io_statx(req, issue_flags);
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 96193bbda2e4..5f574558b96c 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -162,7 +162,8 @@ enum io_uring_op {
IORING_OP_FALLOCATE,
IORING_OP_OPENAT,
IORING_OP_CLOSE,
- IORING_OP_FILES_UPDATE,
+ IORING_OP_RSRC_UPDATE,
+ IORING_OP_FILES_UPDATE = IORING_OP_RSRC_UPDATE,
IORING_OP_STATX,
IORING_OP_READ,
IORING_OP_WRITE,
@@ -210,6 +211,7 @@ enum io_uring_op {
#define IORING_TIMEOUT_ETIME_SUCCESS (1U << 5)
#define IORING_TIMEOUT_CLOCK_MASK (IORING_TIMEOUT_BOOTTIME | IORING_TIMEOUT_REALTIME)
#define IORING_TIMEOUT_UPDATE_MASK (IORING_TIMEOUT_UPDATE | IORING_LINK_TIMEOUT_UPDATE)
+
/*
* sqe->splice_flags
* extends splice(2) flags
@@ -258,6 +260,14 @@ enum io_uring_op {
*/
#define IORING_ACCEPT_MULTISHOT (1U << 0)

+
+/*
+ * IORING_OP_RSRC_UPDATE flags
+ */
+enum {
+ IORING_RSRC_UPDATE_FILES,
+};
+
/*
* IO completion data structure (Completion Queue Entry)
*/
--
2.36.1

2022-06-28 19:19:13

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 03/29] skbuff: add SKBFL_DONT_ORPHAN flag

We don't want to list every single ubuf_info callback in
skb_orphan_frags(), add a flag controlling the behaviour.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 8 +++++---
net/core/skbuff.c | 2 +-
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index da96f0d3e753..eead3527bdaf 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -686,10 +686,13 @@ enum {
* charged to the kernel memory.
*/
SKBFL_PURE_ZEROCOPY = BIT(2),
+
+ SKBFL_DONT_ORPHAN = BIT(3),
};

#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
-#define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY)
+#define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
+ SKBFL_DONT_ORPHAN)

/*
* The callback notifies userspace to release buffers when skb DMA is done in
@@ -3175,8 +3178,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
{
if (likely(!skb_zcopy(skb)))
return 0;
- if (!skb_zcopy_is_nouarg(skb) &&
- skb_uarg(skb)->callback == msg_zerocopy_callback)
+ if (skb_shinfo(skb)->flags & SKBFL_DONT_ORPHAN)
return 0;
return skb_copy_ubufs(skb, gfp_mask);
}
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b3559cb1d82..5b35791064d1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1193,7 +1193,7 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size)
uarg->len = 1;
uarg->bytelen = size;
uarg->zerocopy = 1;
- uarg->flags = SKBFL_ZEROCOPY_FRAG;
+ uarg->flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN;
refcount_set(&uarg->refcnt, 1);
sock_hold(sk);

--
2.36.1

2022-06-28 19:19:52

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

Allow to specify an address to zerocopy sends making it more like
sendto(2).

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 16 +++++++++++++++-
include/uapi/linux/io_uring.h | 2 +-
2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 838030477456..a1e9405a3f1b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -722,6 +722,8 @@ struct io_sendzc {
size_t len;
u16 slot_idx;
int msg_flags;
+ int addr_len;
+ void __user *addr;
};

struct io_open {
@@ -6572,7 +6574,7 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_sendzc *zc = &req->msgzc;

- if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->addr2) || READ_ONCE(sqe->__pad2[0]))
+ if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->__pad2[0]))
return -EINVAL;

zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
@@ -6581,6 +6583,9 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
zc->slot_idx = READ_ONCE(sqe->notification_idx);
if (zc->msg_flags & MSG_DONTWAIT)
req->flags |= REQ_F_NOWAIT;
+ zc->addr = u64_to_user_ptr(READ_ONCE(sqe->addr2));
+ zc->addr_len = READ_ONCE(sqe->addr_len);
+
#ifdef CONFIG_COMPAT
if (req->ctx->compat)
zc->msg_flags |= MSG_CMSG_COMPAT;
@@ -6590,6 +6595,7 @@ static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)

static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
{
+ struct sockaddr_storage address;
struct io_ring_ctx *ctx = req->ctx;
struct io_sendzc *zc = &req->msgzc;
struct io_notif_slot *notif_slot;
@@ -6624,6 +6630,14 @@ static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
return ret;
mm_account_pinned_pages(&notif->uarg.mmp, zc->len);

+ if (zc->addr) {
+ ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
+ if (unlikely(ret < 0))
+ return ret;
+ msg.msg_name = (struct sockaddr *)&address;
+ msg.msg_namelen = zc->addr_len;
+ }
+
msg_flags = zc->msg_flags | MSG_ZEROCOPY;
if (issue_flags & IO_URING_F_NONBLOCK)
msg_flags |= MSG_DONTWAIT;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6c6f20ae5a95..689aa1444cd4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -63,7 +63,7 @@ struct io_uring_sqe {
__u32 file_index;
struct {
__u16 notification_idx;
- __u16 __pad;
+ __u16 addr_len;
} __attribute__((packed));
};
union {
--
2.36.1

2022-06-28 19:19:54

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 21/29] io_uring: wire send zc request type

Add a new io_uring opcode IORING_OP_SENDZC. The main distinction from
IORING_OP_SEND is that the user should specify a notification slot
index in sqe::notification_idx and the buffers are safe to reuse only
when the used notification is flushed and completes.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 103 +++++++++++++++++++++++++++++++++-
include/uapi/linux/io_uring.h | 5 ++
2 files changed, 106 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a88c9c73ed1d..4a1a1d43e9b3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -716,6 +716,14 @@ struct io_sr_msg {
unsigned int flags;
};

+struct io_sendzc {
+ struct file *file;
+ void __user *buf;
+ size_t len;
+ u16 slot_idx;
+ int msg_flags;
+};
+
struct io_open {
struct file *file;
int dfd;
@@ -1044,6 +1052,7 @@ struct io_kiocb {
struct io_socket sock;
struct io_nop nop;
struct io_uring_cmd uring_cmd;
+ struct io_sendzc msgzc;
};

u8 opcode;
@@ -1384,6 +1393,13 @@ static const struct io_op_def io_op_defs[] = {
.needs_async_setup = 1,
.async_size = uring_cmd_pdu_size(1),
},
+ [IORING_OP_SENDZC] = {
+ .needs_file = 1,
+ .unbound_nonreg_file = 1,
+ .pollout = 1,
+ .audit_skip = 1,
+ .ioprio = 1,
+ },
};

/* requests with any of those set should undergo io_disarm_next() */
@@ -1525,6 +1541,8 @@ const char *io_uring_get_opcode(u8 opcode)
return "SOCKET";
case IORING_OP_URING_CMD:
return "URING_CMD";
+ case IORING_OP_SENDZC:
+ return "URING_SENDZC";
case IORING_OP_LAST:
return "INVALID";
}
@@ -2920,7 +2938,6 @@ static struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
return notif;
}

-__attribute__((unused))
static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
struct io_notif_slot *slot)
{
@@ -2929,7 +2946,6 @@ static inline struct io_notif *io_get_notif(struct io_ring_ctx *ctx,
return slot->notif;
}

-__attribute__((unused))
static inline struct io_notif_slot *io_get_notif_slot(struct io_ring_ctx *ctx,
int idx)
__must_hold(&ctx->uring_lock)
@@ -6546,6 +6562,83 @@ static int io_send(struct io_kiocb *req, unsigned int issue_flags)
return 0;
}

+static int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_sendzc *zc = &req->msgzc;
+
+ if (READ_ONCE(sqe->ioprio) || READ_ONCE(sqe->addr2) || READ_ONCE(sqe->__pad2[0]))
+ return -EINVAL;
+
+ zc->buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
+ zc->len = READ_ONCE(sqe->len);
+ zc->msg_flags = READ_ONCE(sqe->msg_flags) | MSG_NOSIGNAL;
+ zc->slot_idx = READ_ONCE(sqe->notification_idx);
+ if (zc->msg_flags & MSG_DONTWAIT)
+ req->flags |= REQ_F_NOWAIT;
+#ifdef CONFIG_COMPAT
+ if (req->ctx->compat)
+ zc->msg_flags |= MSG_CMSG_COMPAT;
+#endif
+ return 0;
+}
+
+static int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_sendzc *zc = &req->msgzc;
+ struct io_notif_slot *notif_slot;
+ struct io_notif *notif;
+ struct msghdr msg;
+ struct iovec iov;
+ struct socket *sock;
+ unsigned msg_flags;
+ int ret, min_ret = 0;
+
+ if (issue_flags & IO_URING_F_UNLOCKED)
+ return -EAGAIN;
+ sock = sock_from_file(req->file);
+ if (unlikely(!sock))
+ return -ENOTSOCK;
+
+ notif_slot = io_get_notif_slot(ctx, zc->slot_idx);
+ if (!notif_slot)
+ return -EINVAL;
+ notif = io_get_notif(ctx, notif_slot);
+ if (!notif)
+ return -ENOMEM;
+
+ msg.msg_name = NULL;
+ msg.msg_control = NULL;
+ msg.msg_controllen = 0;
+ msg.msg_namelen = 0;
+ msg.msg_managed_data = 0;
+
+ ret = import_single_range(WRITE, zc->buf, zc->len, &iov, &msg.msg_iter);
+ if (unlikely(ret))
+ return ret;
+
+ msg_flags = zc->msg_flags | MSG_ZEROCOPY;
+ if (issue_flags & IO_URING_F_NONBLOCK)
+ msg_flags |= MSG_DONTWAIT;
+ if (msg_flags & MSG_WAITALL)
+ min_ret = iov_iter_count(&msg.msg_iter);
+
+ msg.msg_flags = msg_flags;
+ msg.msg_ubuf = &notif->uarg;
+ ret = sock_sendmsg(sock, &msg);
+
+ if (unlikely(ret < min_ret)) {
+ if (ret == -EAGAIN && (issue_flags & IO_URING_F_NONBLOCK))
+ return -EAGAIN;
+ if (ret == -ERESTARTSYS)
+ ret = -EINTR;
+ req_set_fail(req);
+ }
+
+ __io_req_complete(req, issue_flags, ret, 0);
+ return 0;
+}
+
static int __io_recvmsg_copy_hdr(struct io_kiocb *req,
struct io_async_msghdr *iomsg)
{
@@ -7064,6 +7157,7 @@ IO_NETOP_PREP_ASYNC(connect);
IO_NETOP_PREP(accept);
IO_NETOP_PREP(socket);
IO_NETOP_PREP(shutdown);
+IO_NETOP_PREP(sendzc);
IO_NETOP_FN(send);
IO_NETOP_FN(recv);
#endif /* CONFIG_NET */
@@ -8389,6 +8483,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
case IORING_OP_SENDMSG:
case IORING_OP_SEND:
return io_sendmsg_prep(req, sqe);
+ case IORING_OP_SENDZC:
+ return io_sendzc_prep(req, sqe);
case IORING_OP_RECVMSG:
case IORING_OP_RECV:
return io_recvmsg_prep(req, sqe);
@@ -8689,6 +8785,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_SEND:
ret = io_send(req, issue_flags);
break;
+ case IORING_OP_SENDZC:
+ ret = io_sendzc(req, issue_flags);
+ break;
case IORING_OP_RECVMSG:
ret = io_recvmsg(req, issue_flags);
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 19b9d7a2da29..6c6f20ae5a95 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -61,6 +61,10 @@ struct io_uring_sqe {
union {
__s32 splice_fd_in;
__u32 file_index;
+ struct {
+ __u16 notification_idx;
+ __u16 __pad;
+ } __attribute__((packed));
};
union {
struct {
@@ -190,6 +194,7 @@ enum io_uring_op {
IORING_OP_GETXATTR,
IORING_OP_SOCKET,
IORING_OP_URING_CMD,
+ IORING_OP_SENDZC,

/* this goes last, obviously */
IORING_OP_LAST,
--
2.36.1

2022-06-28 19:20:16

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 20/29] io_uring: add zc notification flush requests

Overlay notification control onto IORING_OP_RSRC_UPDATE (former
IORING_OP_FILES_UPDATE). It allows to flush a range of zc notifications
from slots with indexes [sqe->off, sqe->off+sqe->len). If sqe->arg is
not zero, it also copies sqe->arg as a new tag for all flushed
notifications.

Note, it doesn't flush a notification of a slot if there was no requests
attached to it (since last flush or registration).

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 47 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 1 +
2 files changed, 48 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e9fc7e076c7f..a88c9c73ed1d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1284,6 +1284,7 @@ static const struct io_op_def io_op_defs[] = {
[IORING_OP_RSRC_UPDATE] = {
.audit_skip = 1,
.iopoll = 1,
+ .ioprio = 1,
},
[IORING_OP_STATX] = {
.audit_skip = 1,
@@ -2953,6 +2954,16 @@ static void io_notif_slot_flush(struct io_notif_slot *slot)
io_notif_complete(notif);
}

+static inline void io_notif_slot_flush_submit(struct io_notif_slot *slot,
+ unsigned int issue_flags)
+{
+ if (!(issue_flags & IO_URING_F_UNLOCKED)) {
+ slot->notif->task = current;
+ io_get_task_refs(1);
+ }
+ io_notif_slot_flush(slot);
+}
+
static __cold int io_notif_unregister(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
@@ -8286,6 +8297,40 @@ static int io_rsrc_update_prep(struct io_kiocb *req,
return 0;
}

+static int io_notif_update(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ unsigned len = req->rsrc_update.nr_args;
+ unsigned idx_end, idx = req->rsrc_update.offset;
+ int ret = 0;
+
+ io_ring_submit_lock(ctx, issue_flags);
+ if (unlikely(check_add_overflow(idx, len, &idx_end))) {
+ ret = -EOVERFLOW;
+ goto out;
+ }
+ if (unlikely(idx_end > ctx->nr_notif_slots)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ for (; idx < idx_end; idx++) {
+ struct io_notif_slot *slot = &ctx->notif_slots[idx];
+
+ if (!slot->notif)
+ continue;
+ if (req->rsrc_update.arg)
+ slot->tag = req->rsrc_update.arg;
+ io_notif_slot_flush_submit(slot, issue_flags);
+ }
+out:
+ io_ring_submit_unlock(ctx, issue_flags);
+ if (ret < 0)
+ req_set_fail(req);
+ __io_req_complete(req, issue_flags, ret, 0);
+ return 0;
+}
+
static int io_files_update(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -8315,6 +8360,8 @@ static int io_rsrc_update(struct io_kiocb *req, unsigned int issue_flags)
switch (req->rsrc_update.type) {
case IORING_RSRC_UPDATE_FILES:
return io_files_update(req, issue_flags);
+ case IORING_RSRC_UPDATE_NOTIF:
+ return io_notif_update(req, issue_flags);
}
return -EINVAL;
}
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5f574558b96c..19b9d7a2da29 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -266,6 +266,7 @@ enum io_uring_op {
*/
enum {
IORING_RSRC_UPDATE_FILES,
+ IORING_RSRC_UPDATE_NOTIF,
};

/*
--
2.36.1

2022-06-28 19:20:31

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 24/29] io_uring: add rsrc referencing for notifiers

In preparation to zerocopy sends with fixed buffers make notifiers to
reference the rsrc node to protect the used fixed buffers. We can't just
grab it for a send request as notifiers can likely outlive requests that
used it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index a1e9405a3f1b..07d09d06e8ab 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -378,6 +378,7 @@ struct io_ev_fd {
struct io_notif {
struct ubuf_info uarg;
struct io_ring_ctx *ctx;
+ struct io_rsrc_node *rsrc_node;

/* cqe->user_data, io_notif_slot::tag if not overridden */
u64 tag;
@@ -1695,13 +1696,20 @@ static __cold void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
}
}

-static void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
+static __cold void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
ctx->rsrc_cached_refs += IO_RSRC_REF_BATCH;
percpu_ref_get_many(&ctx->rsrc_node->refs, IO_RSRC_REF_BATCH);
}

+static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx)
+{
+ ctx->rsrc_cached_refs--;
+ if (unlikely(ctx->rsrc_cached_refs < 0))
+ io_rsrc_refs_refill(ctx);
+}
+
static inline void io_req_set_rsrc_node(struct io_kiocb *req,
struct io_ring_ctx *ctx,
unsigned int issue_flags)
@@ -1711,9 +1719,7 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req,

if (!(issue_flags & IO_URING_F_UNLOCKED)) {
lockdep_assert_held(&ctx->uring_lock);
- ctx->rsrc_cached_refs--;
- if (unlikely(ctx->rsrc_cached_refs < 0))
- io_rsrc_refs_refill(ctx);
+ io_charge_rsrc_node(ctx);
} else {
percpu_ref_get(&req->rsrc_node->refs);
}
@@ -2826,6 +2832,7 @@ static __cold void io_free_req(struct io_kiocb *req)
static void __io_notif_complete_tw(struct callback_head *cb)
{
struct io_notif *notif = container_of(cb, struct io_notif, task_work);
+ struct io_rsrc_node *rsrc_node = notif->rsrc_node;
struct io_ring_ctx *ctx = notif->ctx;
struct mmpin *mmp = &notif->uarg.mmp;

@@ -2849,6 +2856,7 @@ static void __io_notif_complete_tw(struct callback_head *cb)
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

+ io_rsrc_put_node(rsrc_node, 1);
percpu_ref_put(&ctx->refs);
}

@@ -2943,6 +2951,8 @@ static struct io_notif *io_alloc_notif(struct io_ring_ctx *ctx,
/* master ref owned by io_notif_slot, will be dropped on flush */
refcount_set(&notif->uarg.refcnt, 1);
percpu_ref_get(&ctx->refs);
+ notif->rsrc_node = ctx->rsrc_node;
+ io_charge_rsrc_node(ctx);
return notif;
}

--
2.36.1

2022-06-28 19:20:36

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 06/29] net: optimise bvec-based zc page referencing

Some users like io_uring can pass a bvec iterator to send and also can
implement page pinning more efficiently. Add a ->msg_managed_data toogle
in msghdr. When set, data pages are "managed" by upper layers, i.e.
refcounted and pinned by the caller and will live at least until
->msg_ubuf is released. msghdr has to have non-NULL ->msg_ubuf and
->msg_iter should point to a bvec.

Protocols supporting the feature will propagate it by setting
SKBFL_MANAGED_FRAG_REFS, which means that the skb doesn't hold refs to
its frag pages and only rely on ubuf_info lifetime gurantees. It should
only be used with zerocopy skbs with ubuf_info set.

It's allowed to convert skbs from managed to normal by calling
skb_zcopy_downgrade_managed(). The function will take all needed
page references and clear the flag.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/skbuff.h | 25 +++++++++++++++++++++++--
net/core/datagram.c | 7 ++++---
net/core/skbuff.c | 29 +++++++++++++++++++++++++++--
3 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eead3527bdaf..5407cfd9cb89 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -688,11 +688,16 @@ enum {
SKBFL_PURE_ZEROCOPY = BIT(2),

SKBFL_DONT_ORPHAN = BIT(3),
+
+ /* page references are managed by the ubuf_info, so it's safe to
+ * use frags only up until ubuf_info is released
+ */
+ SKBFL_MANAGED_FRAG_REFS = BIT(4),
};

#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
#define SKBFL_ALL_ZEROCOPY (SKBFL_ZEROCOPY_FRAG | SKBFL_PURE_ZEROCOPY | \
- SKBFL_DONT_ORPHAN)
+ SKBFL_DONT_ORPHAN | SKBFL_MANAGED_FRAG_REFS)

/*
* The callback notifies userspace to release buffers when skb DMA is done in
@@ -1809,6 +1814,11 @@ static inline bool skb_zcopy_pure(const struct sk_buff *skb)
return skb_shinfo(skb)->flags & SKBFL_PURE_ZEROCOPY;
}

+static inline bool skb_zcopy_managed(const struct sk_buff *skb)
+{
+ return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS;
+}
+
static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1,
const struct sk_buff *skb2)
{
@@ -1883,6 +1893,14 @@ static inline void skb_zcopy_clear(struct sk_buff *skb, bool zerocopy_success)
}
}

+void __skb_zcopy_downgrade_managed(struct sk_buff *skb);
+
+static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb)
+{
+ if (unlikely(skb_zcopy_managed(skb)))
+ __skb_zcopy_downgrade_managed(skb);
+}
+
static inline void skb_mark_not_on_list(struct sk_buff *skb)
{
skb->next = NULL;
@@ -3491,7 +3509,10 @@ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle)
*/
static inline void skb_frag_unref(struct sk_buff *skb, int f)
{
- __skb_frag_unref(&skb_shinfo(skb)->frags[f], skb->pp_recycle);
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+ if (!skb_zcopy_managed(skb))
+ __skb_frag_unref(&shinfo->frags[f], skb->pp_recycle);
}

/**
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 5237cb533bb4..a93c05156f56 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -631,7 +631,6 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,

copied += v.bv_len;
truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
- get_page(v.bv_page);
skb_fill_page_desc(skb, frag++, v.bv_page, v.bv_offset, v.bv_len);
bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
}
@@ -660,11 +659,13 @@ static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, size_t length)
{
- int frag = skb_shinfo(skb)->nr_frags;
+ int frag;

- if (iov_iter_is_bvec(from))
+ if (skb_zcopy_managed(skb))
return __zerocopy_sg_from_bvec(sk, skb, from, length);

+ frag = skb_shinfo(skb)->nr_frags;
+
while (length && iov_iter_count(from)) {
struct page *pages[MAX_SKB_FRAGS];
struct page *last_head = NULL;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b35791064d1..71870def129c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -666,11 +666,18 @@ static void skb_release_data(struct sk_buff *skb)
&shinfo->dataref))
goto exit;

- skb_zcopy_clear(skb, true);
+ if (skb_zcopy(skb)) {
+ bool skip_unref = shinfo->flags & SKBFL_MANAGED_FRAG_REFS;
+
+ skb_zcopy_clear(skb, true);
+ if (skip_unref)
+ goto free_head;
+ }

for (i = 0; i < shinfo->nr_frags; i++)
__skb_frag_unref(&shinfo->frags[i], skb->pp_recycle);

+free_head:
if (shinfo->frag_list)
kfree_skb_list(shinfo->frag_list);

@@ -895,7 +902,10 @@ EXPORT_SYMBOL(skb_dump);
*/
void skb_tx_error(struct sk_buff *skb)
{
- skb_zcopy_clear(skb, true);
+ if (skb) {
+ skb_zcopy_downgrade_managed(skb);
+ skb_zcopy_clear(skb, true);
+ }
}
EXPORT_SYMBOL(skb_tx_error);

@@ -1371,6 +1381,16 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
}
EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);

+void __skb_zcopy_downgrade_managed(struct sk_buff *skb)
+{
+ int i;
+
+ skb_shinfo(skb)->flags &= ~SKBFL_MANAGED_FRAG_REFS;
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++)
+ skb_frag_ref(skb, i);
+}
+EXPORT_SYMBOL_GPL(__skb_zcopy_downgrade_managed);
+
static int skb_zerocopy_clone(struct sk_buff *nskb, struct sk_buff *orig,
gfp_t gfp_mask)
{
@@ -1688,6 +1708,8 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,

BUG_ON(skb_shared(skb));

+ skb_zcopy_downgrade_managed(skb);
+
size = SKB_DATA_ALIGN(size);

if (skb_pfmemalloc(skb))
@@ -3484,6 +3506,8 @@ void skb_split(struct sk_buff *skb, struct sk_buff *skb1, const u32 len)
int pos = skb_headlen(skb);
const int zc_flags = SKBFL_SHARED_FRAG | SKBFL_PURE_ZEROCOPY;

+ skb_zcopy_downgrade_managed(skb);
+
skb_shinfo(skb1)->flags |= skb_shinfo(skb)->flags & zc_flags;
skb_zerocopy_clone(skb1, skb, 0);
if (len < pos) /* Split line is inside header. */
@@ -3837,6 +3861,7 @@ int skb_append_pagefrags(struct sk_buff *skb, struct page *page,
if (skb_can_coalesce(skb, i, page, offset)) {
skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], size);
} else if (i < MAX_SKB_FRAGS) {
+ skb_zcopy_downgrade_managed(skb);
get_page(page);
skb_fill_page_desc(skb, i, page, offset, size);
} else {
--
2.36.1

2022-06-28 19:27:21

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC net-next v3 17/29] io_uring: complete notifiers in tw

We need a task context to post CQEs but using wq is too expensive.
Try to complete notifiers using task_work and fall back to wq if fails.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 422ff835bf36..9ade0ea8552b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -384,6 +384,8 @@ struct io_notif {
/* hook into ctx->notif_list and ctx->notif_list_locked */
struct list_head cache_node;

+ /* complete via tw if ->task is non-NULL, fallback to wq otherwise */
+ struct task_struct *task;
union {
struct callback_head task_work;
struct work_struct commit_work;
@@ -2802,6 +2804,11 @@ static void __io_notif_complete_tw(struct callback_head *cb)
struct io_notif *notif = container_of(cb, struct io_notif, task_work);
struct io_ring_ctx *ctx = notif->ctx;

+ if (likely(notif->task)) {
+ io_put_task(notif->task, 1);
+ notif->task = NULL;
+ }
+
spin_lock(&ctx->completion_lock);
io_fill_cqe_aux(ctx, notif->tag, 0, notif->seq);

@@ -2835,6 +2842,14 @@ static void io_uring_tx_zerocopy_callback(struct sk_buff *skb,

if (!refcount_dec_and_test(&uarg->refcnt))
return;
+
+ if (likely(notif->task)) {
+ init_task_work(&notif->task_work, __io_notif_complete_tw);
+ if (likely(!task_work_add(notif->task, &notif->task_work,
+ TWA_SIGNAL)))
+ return;
+ }
+
INIT_WORK(&notif->commit_work, io_notif_complete_wq);
queue_work(system_unbound_wq, &notif->commit_work);
}
@@ -2946,8 +2961,12 @@ static __cold int io_notif_unregister(struct io_ring_ctx *ctx)
for (i = 0; i < ctx->nr_notif_slots; i++) {
struct io_notif_slot *slot = &ctx->notif_slots[i];

- if (slot->notif)
+ if (slot->notif) {
+ WARN_ON_ONCE(slot->notif->task);
+
+ slot->notif->task = NULL;
io_notif_slot_flush(slot);
+ }
}

kvfree(ctx->notif_slots);
--
2.36.1

2022-06-28 20:17:19

by Al Viro

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
> It'll be used later for {get,put}_page() optimisations.

If you need a variant that would not grab page references for ITER_BVEC
(and presumably other non-userland ones), the natural thing to do would
be to provide just such a primitive, wouldn't it?

The fun question here is by which paths ITER_BVEC can be passed to that
function and which all of them are currently guaranteed to hold the
underlying pages pinned...

And AFAICS you quietly assume that only ITER_BVEC ones will ever have that
"managed" flag of your set. Or am I misreading the next patch in the
series?

2022-06-28 21:42:16

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 6/28/22 21:06, Al Viro wrote:
> On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
>> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
>> It'll be used later for {get,put}_page() optimisations.
>
> If you need a variant that would not grab page references for ITER_BVEC
> (and presumably other non-userland ones), the natural thing to do would

I don't see other iter types interesting in this context

> be to provide just such a primitive, wouldn't it?

A helper returning a page array sounds like overshot and waste of cycles
considering that it copies one bvec into another, and especially since
iov_iter_get_pages() parses only the first struct bio_vec and so returns
only 1 page at a time.

I can actually use for_each_bvec(), but still leaves updating the iter
from bvec_iter.

> The fun question here is by which paths ITER_BVEC can be passed to that
> function and which all of them are currently guaranteed to hold the
> underlying pages pinned...

It's the other way around, not all ITER_BVEC are managed but all users
asking to use managed frags (i.e. io_uring) should keep pages pinned and
provide ITER_BVEC. It's opt-in, both for users and protocols.

--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -66,9 +66,16 @@ struct msghdr {
};
bool msg_control_is_user : 1;
bool msg_get_inq : 1;/* return INQ after receive */
+ /*
+ * The data pages are pinned and won't be released before ->msg_ubuf
+ * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
+ * to be non-NULL.
+ */
+ bool msg_managed_data : 1;
unsigned int msg_flags; /* flags on received message */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
+ struct ubuf_info *msg_ubuf;
};

The user sets ->msg_managed_data, then protocols find it and set
SKBFL_MANAGED_FRAG_REFS. If either of the steps didn't happen the
feature is not used.
The ->msg_managed_data part makes io_uring the only user, and io_uring
ensures pages are pinned.


> And AFAICS you quietly assume that only ITER_BVEC ones will ever have that
> "managed" flag of your set. Or am I misreading the next patch in the
> series?

I hope a comment just above ->msg_managed_data should count as not quiet.

--
Pavel Begunkov

2022-06-28 23:38:38

by David Ahern

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
> It'll be used later for {get,put}_page() optimisations.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>

Rather than propagating iter functions, I have been using the attached
patch for a few months now. It leverages your ubuf_info in msghdr to
allow in kernel users to pass in their own iter handler.


Attachments:
(No filename) (583.00 B)
0001-net-Allow-custom-iter-handler-in-uarg.patch (6.70 kB)
Download all attachments

2022-06-29 08:00:22

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc


Hi Pavel,

>
> + if (zc->addr) {
> + ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
> + if (unlikely(ret < 0))
> + return ret;
> + msg.msg_name = (struct sockaddr *)&address;
> + msg.msg_namelen = zc->addr_len;
> + }
> +

Given that this fills in msg almost completely can we also have
a version of SENDMSGZC, it would be very useful to also allow
msg_control to be passed and as well as an iovec.

Would that be possible?

Do I understand it correctly, that the reason for the new opcode is,
that IO_OP_SEND would already work with existing MSG_ZEROCOPY behavior, together
with the recvmsg based completion?

In addition I wondering if a completion based on msg_iocb->ki_complete() (indicated by EIOCBQUEUED)
what have also worked, just deferring the whole sendmsg operation until all buffers are no longer used.
That way it would be possible to buffers are acked by the remote end when it comes back to the application
layer.

I'm also wondering if the ki_complete() based approach should always be provided to sock_sendmsg()
triggered by io_uring (independend of the new zerocopy stuff), it would basically work very simular to
the uring_cmd() completions, which are able to handle both true async operation indicated by EIOCBQUEUED
as well as EAGAIN triggered path via io-wq.

metze

2022-06-29 11:03:18

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

On 6/29/22 08:42, Stefan Metzmacher wrote:
>
> Hi Pavel,
>
>> +    if (zc->addr) {
>> +        ret = move_addr_to_kernel(zc->addr, zc->addr_len, &address);
>> +        if (unlikely(ret < 0))
>> +            return ret;
>> +        msg.msg_name = (struct sockaddr *)&address;
>> +        msg.msg_namelen = zc->addr_len;
>> +    }
>> +
>
> Given that this fills in msg almost completely can we also have
> a version of SENDMSGZC, it would be very useful to also allow
> msg_control to be passed and as well as an iovec.
>
> Would that be possible?

Right, I left it to follow ups as the series is already too long.

fwiw, I'm going to also add addr to IORING_OP_SEND.


> Do I understand it correctly, that the reason for the new opcode is,
> that IO_OP_SEND would already work with existing MSG_ZEROCOPY behavior, together
> with the recvmsg based completion?

Right, it should work with MSG_ZEROCOPY, but with a different notification
semantics, would need recvmsg from error queues, and with performance
implications.


> In addition I wondering if a completion based on msg_iocb->ki_complete() (indicated by EIOCBQUEUED)
> what have also worked, just deferring the whole sendmsg operation until all buffers are no longer used.
> That way it would be possible to buffers are acked by the remote end when it comes back to the application
> layer.

There is msg_iocb, but it's mostly unused by protocols, IIRC apart
from crypto sockets. And then we'd need to repeat the path of
ubuf_info to handle stuff like skb splitting and perhaps also
changing rules for ->ki_complete


> I'm also wondering if the ki_complete() based approach should always be provided to sock_sendmsg()
> triggered by io_uring (independend of the new zerocopy stuff), it would basically work very simular to
> the uring_cmd() completions, which are able to handle both true async operation indicated by EIOCBQUEUED
> as well as EAGAIN triggered path via io-wq.

Would be even more similar to how we has always been doing
read/write, and rw requests do pass in a msg_iocb, but again,
it's largely ignored internally.

--
Pavel Begunkov

2022-07-04 13:35:28

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 6/28/22 23:52, David Ahern wrote:
> On Tue, Jun 28, 2022 at 07:56:27PM +0100, Pavel Begunkov wrote:
>> Add an bvec specialised and optimised path in zerocopy_sg_from_iter.
>> It'll be used later for {get,put}_page() optimisations.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> net/core/datagram.c | 47 +++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 47 insertions(+)
>>
>
> Rather than propagating iter functions, I have been using the attached
> patch for a few months now. It leverages your ubuf_info in msghdr to
> allow in kernel users to pass in their own iter handler.

If the series is going to be picked up for 5.20, how about we delay
this one for 5.21? I'll have time to think about it (maybe moving
the skb managed flag setup inside?), and will anyway need to send
some omitted patches then.

I was also entertaining the idea of having a smaller ubuf_info to
fit it into io_kiocb, which is tight on space.

struct ubuf_info {
void *callback;
refcount_t refcnt;
u32 flags;
};

struct ubuf_info_msgzerocopy {
struct ubuf_info ubuf;
/* others fields */
};

48 bytes would be taking too much, but 16 looks nice. It might
make sense to move the callback into struct msghdr, I don't know.

--
Pavel Begunkov

2022-07-05 02:53:41

by David Ahern

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 7/4/22 7:31 AM, Pavel Begunkov wrote:
> If the series is going to be picked up for 5.20, how about we delay
> this one for 5.21? I'll have time to think about it (maybe moving
> the skb managed flag setup inside?), and will anyway need to send
> some omitted patches then.
>

I think it reads better for io_uring and future extensions for io_uring
to contain the optimized bvec iter handler and setting the managed flag.
Too many disjointed assumptions the way the code is now. By pulling that
into io_uring, core code does not make assumptions that "managed" means
bvec and no page references - rather that is embedded in the code that
cares.

2022-07-05 14:14:55

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 7/5/22 03:28, David Ahern wrote:
> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>> If the series is going to be picked up for 5.20, how about we delay
>> this one for 5.21? I'll have time to think about it (maybe moving
>> the skb managed flag setup inside?), and will anyway need to send
>> some omitted patches then.
>>
>
> I think it reads better for io_uring and future extensions for io_uring
> to contain the optimized bvec iter handler and setting the managed flag.
> Too many disjointed assumptions the way the code is now. By pulling that
> into io_uring, core code does not make assumptions that "managed" means
> bvec and no page references - rather that is embedded in the code that
> cares.

Core code would still need to know when to remove the skb's managed
flag, e.g. in case of mixing. Can be worked out but with assumptions,
which doesn't look better that it currently is. I'll post a 5.20
rebased version and will iron it out on the way then.

--
Pavel Begunkov

2022-07-05 22:43:21

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 7/5/22 15:03, Pavel Begunkov wrote:
> On 7/5/22 03:28, David Ahern wrote:
>> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>>> If the series is going to be picked up for 5.20, how about we delay
>>> this one for 5.21? I'll have time to think about it (maybe moving
>>> the skb managed flag setup inside?), and will anyway need to send
>>> some omitted patches then.
>>>
>>
>> I think it reads better for io_uring and future extensions for io_uring
>> to contain the optimized bvec iter handler and setting the managed flag.
>> Too many disjointed assumptions the way the code is now. By pulling that
>> into io_uring, core code does not make assumptions that "managed" means
>> bvec and no page references - rather that is embedded in the code that
>> cares.
>
> Core code would still need to know when to remove the skb's managed
> flag, e.g. in case of mixing. Can be worked out but with assumptions,
> which doesn't look better that it currently is. I'll post a 5.20
> rebased version and will iron it out on the way then.

Incremental looks like below. Probably looks better. What is slightly
dubious is that for zerocopy paths it leaves downgrading managed bit
to the callback unlike in most other places where it's done by core.
Also knowing upfront whether the user requests the feature or not
sounds less convoluted, but I guess it's not that important for now.

I can try to rebase and see how it goes


diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2d5badd4b9ff..2cc5b8850cb4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1782,12 +1782,13 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg,
bool success);

int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
- struct iov_iter *from, size_t length);
+ struct iov_iter *from, struct msghdr *msg,
+ size_t length);

static inline int skb_zerocopy_iter_dgram(struct sk_buff *skb,
struct msghdr *msg, int len)
{
- return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+ return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, msg, len);
}

int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ba84ee614d5a..59b0f47c1f5a 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -14,6 +14,8 @@ struct file;
struct pid;
struct cred;
struct socket;
+struct sock;
+struct sk_buff;

#define __sockaddr_check_size(size) \
BUILD_BUG_ON(((size) > sizeof(struct __kernel_sockaddr_storage)))
@@ -66,16 +68,13 @@ struct msghdr {
};
bool msg_control_is_user : 1;
bool msg_get_inq : 1;/* return INQ after receive */
- /*
- * The data pages are pinned and won't be released before ->msg_ubuf
- * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
- * to be non-NULL.
- */
- bool msg_managed_data : 1;
unsigned int msg_flags; /* flags on received message */
__kernel_size_t msg_controllen; /* ancillary data buffer length */
struct kiocb *msg_iocb; /* ptr to iocb for async requests */
struct ubuf_info *msg_ubuf;
+
+ int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
+ struct iov_iter *from, size_t length);
};

struct user_msghdr {
diff --git a/io_uring/net.c b/io_uring/net.c
index a142a609790d..b7643f267e20 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -269,7 +269,6 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_controllen = 0;
msg.msg_namelen = 0;
msg.msg_ubuf = NULL;
- msg.msg_managed_data = false;

flags = sr->msg_flags;
if (issue_flags & IO_URING_F_NONBLOCK)
@@ -617,7 +616,6 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_controllen = 0;
msg.msg_iocb = NULL;
msg.msg_ubuf = NULL;
- msg.msg_managed_data = false;

flags = sr->msg_flags;
if (force_nonblock)
@@ -706,6 +704,60 @@ int io_sendzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return 0;
}

+static int io_sg_from_iter(struct sock *sk, struct sk_buff *skb,
+ struct iov_iter *from, size_t length)
+{
+ struct skb_shared_info *shinfo = skb_shinfo(skb);
+ int frag = shinfo->nr_frags;
+ int ret = 0;
+ struct bvec_iter bi;
+ ssize_t copied = 0;
+ unsigned long truesize = 0;
+
+ if (!shinfo->nr_frags)
+ shinfo->flags |= SKBFL_MANAGED_FRAG_REFS;
+
+ if (!skb_zcopy_managed(skb) || !iov_iter_is_bvec(from)) {
+ skb_zcopy_downgrade_managed(skb);
+ return __zerocopy_sg_from_iter(sk, skb, from, NULL, length);
+ }
+
+ bi.bi_size = min(from->count, length);
+ bi.bi_bvec_done = from->iov_offset;
+ bi.bi_idx = 0;
+
+ while (bi.bi_size && frag < MAX_SKB_FRAGS) {
+ struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
+
+ copied += v.bv_len;
+ truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
+ __skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
+ v.bv_offset, v.bv_len);
+ bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
+ }
+ if (bi.bi_size)
+ ret = -EMSGSIZE;
+
+ shinfo->nr_frags = frag;
+ from->bvec += bi.bi_idx;
+ from->nr_segs -= bi.bi_idx;
+ from->count = bi.bi_size;
+ from->iov_offset = bi.bi_bvec_done;
+
+ skb->data_len += copied;
+ skb->len += copied;
+ skb->truesize += truesize;
+
+ if (sk && sk->sk_type == SOCK_STREAM) {
+ sk_wmem_queued_add(sk, truesize);
+ if (!skb_zcopy_pure(skb))
+ sk_mem_charge(sk, truesize);
+ } else {
+ refcount_add(truesize, &skb->sk->sk_wmem_alloc);
+ }
+ return ret;
+}
+
int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
{
struct sockaddr_storage address;
@@ -740,7 +792,7 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
- msg.msg_managed_data = 1;
+ msg.sg_from_iter = io_sg_from_iter;

if (zc->flags & IORING_RECVSEND_FIXED_BUF) {
ret = io_import_fixed(WRITE, &msg.msg_iter, req->imu,
@@ -748,7 +800,6 @@ int io_sendzc(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret))
return ret;
} else {
- msg.msg_managed_data = 0;
ret = import_single_range(WRITE, zc->buf, zc->len, &iov,
&msg.msg_iter);
if (unlikely(ret))
diff --git a/net/compat.c b/net/compat.c
index 435846fa85e0..6cd2e7683dd0 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -81,7 +81,6 @@ int __get_compat_msghdr(struct msghdr *kmsg,

kmsg->msg_iocb = NULL;
kmsg->msg_ubuf = NULL;
- kmsg->msg_managed_data = false;
*ptr = msg.msg_iov;
*len = msg.msg_iovlen;
return 0;
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 3c913a6342ad..6901dcb44d72 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -613,59 +613,14 @@ int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset,
}
EXPORT_SYMBOL(skb_copy_datagram_from_iter);

-static int __zerocopy_sg_from_bvec(struct sock *sk, struct sk_buff *skb,
- struct iov_iter *from, size_t length)
-{
- struct skb_shared_info *shinfo = skb_shinfo(skb);
- int frag = shinfo->nr_frags;
- int ret = 0;
- struct bvec_iter bi;
- ssize_t copied = 0;
- unsigned long truesize = 0;
-
- bi.bi_size = min(from->count, length);
- bi.bi_bvec_done = from->iov_offset;
- bi.bi_idx = 0;
-
- while (bi.bi_size && frag < MAX_SKB_FRAGS) {
- struct bio_vec v = mp_bvec_iter_bvec(from->bvec, bi);
-
- copied += v.bv_len;
- truesize += PAGE_ALIGN(v.bv_len + v.bv_offset);
- __skb_fill_page_desc_noacc(shinfo, frag++, v.bv_page,
- v.bv_offset, v.bv_len);
- bvec_iter_advance_single(from->bvec, &bi, v.bv_len);
- }
- if (bi.bi_size)
- ret = -EMSGSIZE;
-
- shinfo->nr_frags = frag;
- from->bvec += bi.bi_idx;
- from->nr_segs -= bi.bi_idx;
- from->count = bi.bi_size;
- from->iov_offset = bi.bi_bvec_done;
-
- skb->data_len += copied;
- skb->len += copied;
- skb->truesize += truesize;
-
- if (sk && sk->sk_type == SOCK_STREAM) {
- sk_wmem_queued_add(sk, truesize);
- if (!skb_zcopy_pure(skb))
- sk_mem_charge(sk, truesize);
- } else {
- refcount_add(truesize, &skb->sk->sk_wmem_alloc);
- }
- return ret;
-}
-
int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
- struct iov_iter *from, size_t length)
+ struct iov_iter *from, struct msghdr *msg,
+ size_t length)
{
int frag;

- if (skb_zcopy_managed(skb))
- return __zerocopy_sg_from_bvec(sk, skb, from, length);
+ if (unlikely(msg && msg->msg_ubuf && msg->sg_from_iter))
+ return msg->sg_from_iter(sk, skb, from, length);

frag = skb_shinfo(skb)->nr_frags;

@@ -753,7 +708,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *from)
if (skb_copy_datagram_from_iter(skb, 0, from, copy))
return -EFAULT;

- return __zerocopy_sg_from_iter(NULL, skb, from, ~0U);
+ return __zerocopy_sg_from_iter(NULL, skb, from, NULL, ~0U);
}
EXPORT_SYMBOL(zerocopy_sg_from_iter);

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7e6fcb3cd817..046ec3124835 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1368,7 +1368,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
if (orig_uarg && uarg != orig_uarg)
return -EEXIST;

- err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, len);
+ err = __zerocopy_sg_from_iter(sk, skb, &msg->msg_iter, msg, len);
if (err == -EFAULT || (err == -EMSGSIZE && skb->len == orig_len)) {
struct sock *save_sk = skb->sk;

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3fd1bf675598..df7f9dfbe8be 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1241,18 +1241,7 @@ static int __ip_append_data(struct sock *sk,
skb->truesize += copy;
wmem_alloc_delta += copy;
} else {
- struct msghdr *msg = from;
-
- if (!skb_shinfo(skb)->nr_frags) {
- if (msg->msg_managed_data)
- skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
- } else {
- /* appending, don't mix managed and unmanaged */
- if (!msg->msg_managed_data)
- skb_zcopy_downgrade_managed(skb);
- }
-
- err = skb_zerocopy_iter_dgram(skb, msg, copy);
+ err = skb_zerocopy_iter_dgram(skb, from, copy);
if (err < 0)
goto error;
}
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 05e2f6271f65..634c16fe8dcd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1392,18 +1392,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
* zerocopy skb
*/
if (!skb->len) {
- if (msg->msg_managed_data)
- skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
skb_shinfo(skb)->flags |= SKBFL_PURE_ZEROCOPY;
- } else {
- /* appending, don't mix managed and unmanaged */
- if (!msg->msg_managed_data)
- skb_zcopy_downgrade_managed(skb);
- if (!skb_zcopy_pure(skb)) {
- copy = tcp_wmem_schedule(sk, copy);
- if (!copy)
- goto wait_for_space;
- }
+ } else if (!skb_zcopy_pure(skb)) {
+ copy = tcp_wmem_schedule(sk, copy);
+ if (!copy)
+ goto wait_for_space;
}

err = skb_zerocopy_iter_stream(sk, skb, msg, copy, uarg);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 34eb3b5da5e2..897ca4f9b791 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1796,18 +1796,7 @@ static int __ip6_append_data(struct sock *sk,
skb->truesize += copy;
wmem_alloc_delta += copy;
} else {
- struct msghdr *msg = from;
-
- if (!skb_shinfo(skb)->nr_frags) {
- if (msg->msg_managed_data)
- skb_shinfo(skb)->flags |= SKBFL_MANAGED_FRAG_REFS;
- } else {
- /* appending, don't mix managed and unmanaged */
- if (!msg->msg_managed_data)
- skb_zcopy_downgrade_managed(skb);
- }
-
- err = skb_zerocopy_iter_dgram(skb, msg, copy);
+ err = skb_zerocopy_iter_dgram(skb, from, copy);
if (err < 0)
goto error;
}
diff --git a/net/socket.c b/net/socket.c
index 0963a02b1472..ed061609265e 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -2107,7 +2107,6 @@ int __sys_sendto(int fd, void __user *buff, size_t len, unsigned int flags,
msg.msg_controllen = 0;
msg.msg_namelen = 0;
msg.msg_ubuf = NULL;
- msg.msg_managed_data = false;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
@@ -2174,7 +2173,6 @@ int __sys_recvfrom(int fd, void __user *ubuf, size_t size, unsigned int flags,
msg.msg_iocb = NULL;
msg.msg_flags = 0;
msg.msg_ubuf = NULL;
- msg.msg_managed_data = false;
if (sock->file->f_flags & O_NONBLOCK)
flags |= MSG_DONTWAIT;
err = sock_recvmsg(sock, &msg, flags);
@@ -2414,7 +2412,6 @@ int __copy_msghdr_from_user(struct msghdr *kmsg,

kmsg->msg_iocb = NULL;
kmsg->msg_ubuf = NULL;
- kmsg->msg_managed_data = false;
*uiov = msg.msg_iov;
*nsegs = msg.msg_iovlen;
return 0;

2022-07-06 15:26:05

by David Ahern

[permalink] [raw]
Subject: Re: [RFC net-next v3 05/29] net: bvec specific path in zerocopy_sg_from_iter

On 7/5/22 4:09 PM, Pavel Begunkov wrote:
> On 7/5/22 15:03, Pavel Begunkov wrote:
>> On 7/5/22 03:28, David Ahern wrote:
>>> On 7/4/22 7:31 AM, Pavel Begunkov wrote:
>>>> If the series is going to be picked up for 5.20, how about we delay
>>>> this one for 5.21? I'll have time to think about it (maybe moving
>>>> the skb managed flag setup inside?), and will anyway need to send
>>>> some omitted patches then.
>>>>
>>>
>>> I think it reads better for io_uring and future extensions for io_uring
>>> to contain the optimized bvec iter handler and setting the managed flag.
>>> Too many disjointed assumptions the way the code is now. By pulling that
>>> into io_uring, core code does not make assumptions that "managed" means
>>> bvec and no page references - rather that is embedded in the code that
>>> cares.
>>
>> Core code would still need to know when to remove the skb's managed
>> flag, e.g. in case of mixing. Can be worked out but with assumptions,
>> which doesn't look better that it currently is. I'll post a 5.20
>> rebased version and will iron it out on the way then.

Sure. My comment was that MANAGED means something else (not core code)
manages the page references on the skb frags. That flag does not need to
be linked to a customized bvec.

> @@ -66,16 +68,13 @@ struct msghdr {
>      };
>      bool        msg_control_is_user : 1;
>      bool        msg_get_inq : 1;/* return INQ after receive */
> -    /*
> -     * The data pages are pinned and won't be released before ->msg_ubuf
> -     * is released. ->msg_iter should point to a bvec and ->msg_ubuf has
> -     * to be non-NULL.
> -     */
> -    bool        msg_managed_data : 1;
>      unsigned int    msg_flags;    /* flags on received message */
>      __kernel_size_t    msg_controllen;    /* ancillary data buffer
> length */
>      struct kiocb    *msg_iocb;    /* ptr to iocb for async requests */
>      struct ubuf_info *msg_ubuf;
> +
> +    int (*sg_from_iter)(struct sock *sk, struct sk_buff *skb,
> +                struct iov_iter *from, size_t length);
>  };
>  

Putting in msghdr works too. I chose ubuf_info because it is directly
related to the ZC path, but that struct is getting tight on space.

2022-08-13 09:06:16

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

Hi Pavel,

>> Given that this fills in msg almost completely can we also have
>> a version of SENDMSGZC, it would be very useful to also allow
>> msg_control to be passed and as well as an iovec.
>>
>> Would that be possible?
>
> Right, I left it to follow ups as the series is already too long.
>
> fwiw, I'm going to also add addr to IORING_OP_SEND.


Given the minimal differences, which were left between
IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
as indication to use the notif slot.

It would means we don't need to waste two opcodes for
IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)


I also noticed a problem in io_notif_update()

for (; idx < idx_end; idx++) {
struct io_notif_slot *slot = &ctx->notif_slots[idx];

if (!slot->notif)
continue;
if (up->arg)
slot->tag = up->arg;
io_notif_slot_flush_submit(slot, issue_flags);
}

slot->tag = up->arg is skipped if there is no notif already.

So you can't just use a 2 linked sqe's with

IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)

I think the if (!slot->notif) should be moved down a bit.

It would somehow be nice to avoid the notif slots at all and somehow
use some kind of multishot request in order to generate two qces.

I'm also wondering what will happen if a notif will be referenced by the net layer
but the io_uring instance is already closed, wouldn't
io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
because notif->ctx is a stale pointer, of notif itself is already gone...

What do you think?

metze

2022-08-15 09:59:14

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

On 8/13/22 09:45, Stefan Metzmacher wrote:
> Hi Pavel,

Hi Stefan,

Thanks for giving a thought about the API, are you trying
to use it in samba?

>>> Given that this fills in msg almost completely can we also have
>>> a version of SENDMSGZC, it would be very useful to also allow
>>> msg_control to be passed and as well as an iovec.
>>>
>>> Would that be possible?
>>
>> Right, I left it to follow ups as the series is already too long.
>>
>> fwiw, I'm going to also add addr to IORING_OP_SEND.
>
>
> Given the minimal differences, which were left between
> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
> as indication to use the notif slot.

And will be even more similar in for-next, but with notifications
I'd still prefer different opcodes to get a little bit more
flexibility and not making the normal io_uring send path messier.

> It would means we don't need to waste two opcodes for
> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
>
>
> I also noticed a problem in io_notif_update()
>
>         for (; idx < idx_end; idx++) {
>                 struct io_notif_slot *slot = &ctx->notif_slots[idx];
>
>                 if (!slot->notif)
>                         continue;
>                 if (up->arg)
>                         slot->tag = up->arg;
>                 io_notif_slot_flush_submit(slot, issue_flags);
>         }
>
>  slot->tag = up->arg is skipped if there is no notif already.
>
> So you can't just use a 2 linked sqe's with
>
> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)

slot->notif is lazily initialised with the first send attached to it,
so in your example IORING_OP_SENDZC will first create a notification
to execute the send and then will flush it.

This "if" is there is only to have a more reliable API. We can
go over the range and allocate all empty slots and then flush
all of them, but allocation failures should be propagated to the
userspace when currently the function it can't fail.

> I think the if (!slot->notif) should be moved down a bit.

Not sure what you mean

> It would somehow be nice to avoid the notif slots at all and somehow
> use some kind of multishot request in order to generate two qces.

It is there first to ammortise overhead of zerocopy infra and bits
for second CQE posting. But more importantly, without it for TCP
the send payload size would need to be large enough or performance
would suffer, but all depends on the use case. TL;DR; it would be
forced to create a new SKB for each new send.

For something simpler, I'll push another zc variant that doesn't
have notifiers and posts only one CQE and only after the buffers
are no more in use by the kernel. This works well for UDP and for
some TCP scenarios, but doesn't cover all cases.
> I'm also wondering what will happen if a notif will be referenced by the net layer
> but the io_uring instance is already closed, wouldn't
> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
> because notif->ctx is a stale pointer, of notif itself is already gone...

io_uring will flush all slots and wait for all notifications
to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
problem.

--
Pavel Begunkov

2022-08-15 11:51:16

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

Hi Pavel,

> Thanks for giving a thought about the API, are you trying
> to use it in samba?

Yes, but I'd need SENDMSGZC and then I'd like to test,
which variant gives the best performance. It also depends
on the configured samba vfs module stack.

My current prototype uses IO_SENDMSG for the header < 250 bytes
followed by up to 8MBytes via IO_SPLICE if the storage backend also
supports splice, otherwise I'd try to use IO_SENDMSGZC for header + 8 MBytes payload
together. If there's encryption turned actice on the connection we would
most likely always use a bounce buffer and hit the IO_SENDMSGZC case.
So all in all I'd say we'll use it.

I guess it would be useful for userspace to notice if zero was possible or not.

__msg_zerocopy_callback() sets SO_EE_CODE_ZEROCOPY_COPIED, maybe
io_uring_tx_zerocopy_callback() should have something like:

if (!success)
notif->cqe.res = SO_EE_CODE_ZEROCOPY_COPIED;

This would make it a bit easier to judge if SENDZC is useful for the
application or not. Or at least have debug message, which would explain
be able to explain degraded performance to the admin/developer.

>>>> Given that this fills in msg almost completely can we also have
>>>> a version of SENDMSGZC, it would be very useful to also allow
>>>> msg_control to be passed and as well as an iovec.
>>>>
>>>> Would that be possible?
>>>
>>> Right, I left it to follow ups as the series is already too long.
>>>
>>> fwiw, I'm going to also add addr to IORING_OP_SEND.
>>
>>
>> Given the minimal differences, which were left between
>> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
>> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
>> as indication to use the notif slot.
>
> And will be even more similar in for-next, but with notifications
> I'd still prefer different opcodes to get a little bit more
> flexibility and not making the normal io_uring send path messier.

Ok, we should just remember the opcode is only u8
and we already have ~ 50 out of ~250 allocated in ~3 years
time.

>> It would means we don't need to waste two opcodes for
>> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
>>
>>
>> I also noticed a problem in io_notif_update()
>>
>>          for (; idx < idx_end; idx++) {
>>                  struct io_notif_slot *slot = &ctx->notif_slots[idx];
>>
>>                  if (!slot->notif)
>>                          continue;
>>                  if (up->arg)
>>                          slot->tag = up->arg;
>>                  io_notif_slot_flush_submit(slot, issue_flags);
>>          }
>>
>>   slot->tag = up->arg is skipped if there is no notif already.
>>
>> So you can't just use a 2 linked sqe's with
>>
>> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
>
> slot->notif is lazily initialised with the first send attached to it,
> so in your example IORING_OP_SENDZC will first create a notification
> to execute the send and then will flush it.
>
> This "if" is there is only to have a more reliable API. We can
> go over the range and allocate all empty slots and then flush
> all of them, but allocation failures should be propagated to the
> userspace when currently the function it can't fail.
>
>> I think the if (!slot->notif) should be moved down a bit.
>
> Not sure what you mean

I think it should be:

if (up->arg)
slot->tag = up->arg;
if (!slot->notif)
continue;
io_notif_slot_flush_submit(slot, issue_flags);

or even:

slot->tag = up->arg;
if (!slot->notif)
continue;
io_notif_slot_flush_submit(slot, issue_flags);

otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
if notif was never created or already be flushed.

>> It would somehow be nice to avoid the notif slots at all and somehow
>> use some kind of multishot request in order to generate two qces.
>
> It is there first to ammortise overhead of zerocopy infra and bits
> for second CQE posting. But more importantly, without it for TCP
> the send payload size would need to be large enough or performance
> would suffer, but all depends on the use case. TL;DR; it would be
> forced to create a new SKB for each new send.
>
> For something simpler, I'll push another zc variant that doesn't
> have notifiers and posts only one CQE and only after the buffers
> are no more in use by the kernel. This works well for UDP and for
> some TCP scenarios, but doesn't cover all cases.

I think (at least for stream sockets) it would be more useful to
get two CQEs:
1. The first signals userspace that it can
issue the next send-like operation (SEND,SENDZC,SENDMSG,SPLICE)
on the stream without the risk of byte ordering problem within the stream
and avoid too high latency (which would happen, if we wait for a send to
leave the hardware nic, before sending the next PDU).
2. The 2nd signals userspace that the buffer can be reused or released.

In that case it would be useful to also provide a separate 'user_data' element
for the 2nd CQE.

>> I'm also wondering what will happen if a notif will be referenced by the net layer
>> but the io_uring instance is already closed, wouldn't
>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>> because notif->ctx is a stale pointer, of notif itself is already gone...
>
> io_uring will flush all slots and wait for all notifications
> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
> problem.

I can't follow :-(

What I see is that io_notif_unregister():

nd = io_notif_to_data(notif);
slot->notif = NULL;
if (!refcount_dec_and_test(&nd->uarg.refcnt))
continue;

So if the net layer still has a reference we just go on.

Only a wild guess, is it something of:

io_alloc_notif():
...
notif->task = current;
io_get_task_refs(1);
notif->rsrc_node = NULL;
io_req_set_rsrc_node(notif, ctx, 0);
...

and

__io_req_complete_put():
...
io_req_put_rsrc(req);
/*
* Selected buffer deallocation in io_clean_op() assumes that
* we don't hold ->completion_lock. Clean them here to avoid
* deadlocks.
*/
io_put_kbuf_comp(req);
io_dismantle_req(req);
io_put_task(req->task, 1);
...

that causes io_ring_exit_work() to wait for it.
It would be great if you or someone else could explain this in detail
and maybe adding some comments into the code.

metze

2022-08-15 12:57:26

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

On 8/15/22 12:40, Stefan Metzmacher wrote:
> Hi Pavel,
>
>> Thanks for giving a thought about the API, are you trying
>> to use it in samba?
>
> Yes, but I'd need SENDMSGZC and then I'd like to test,
> which variant gives the best performance. It also depends
> on the configured samba vfs module stack.

I can send you a branch this week if you would be
willing to try it out as I'll be sending the "msg" variant
only for 5.21

> My current prototype uses IO_SENDMSG for the header < 250 bytes
> followed by up to 8MBytes via IO_SPLICE if the storage backend also
> supports splice, otherwise I'd try to use IO_SENDMSGZC for header + 8 MBytes payload
> together. If there's encryption turned actice on the connection we would
> most likely always use a bounce buffer and hit the IO_SENDMSGZC case.
> So all in all I'd say we'll use it.

Perfect

> I guess it would be useful for userspace to notice if zero was possible or not.
>
> __msg_zerocopy_callback() sets SO_EE_CODE_ZEROCOPY_COPIED, maybe
> io_uring_tx_zerocopy_callback() should have something like:
>
> if (!success)
>     notif->cqe.res = SO_EE_CODE_ZEROCOPY_COPIED;
>
> This would make it a bit easier to judge if SENDZC is useful for the
> application or not. Or at least have debug message, which would explain
> be able to explain degraded performance to the admin/developer.

Ok, let me think about it


>>>>> Given that this fills in msg almost completely can we also have
>>>>> a version of SENDMSGZC, it would be very useful to also allow
>>>>> msg_control to be passed and as well as an iovec.
>>>>>
>>>>> Would that be possible?
>>>>
>>>> Right, I left it to follow ups as the series is already too long.
>>>>
>>>> fwiw, I'm going to also add addr to IORING_OP_SEND.
>>>
>>>
>>> Given the minimal differences, which were left between
>>> IORING_OP_SENDZC and IORING_OP_SEND, wouldn't it be better
>>> to merge things to IORING_OP_SEND using a IORING_RECVSEND_ZC_NOTIF
>>> as indication to use the notif slot.
>>
>> And will be even more similar in for-next, but with notifications
>> I'd still prefer different opcodes to get a little bit more
>> flexibility and not making the normal io_uring send path messier.
>
> Ok, we should just remember the opcode is only u8
> and we already have ~ 50 out of ~250 allocated in ~3 years
> time.
>
>>> It would means we don't need to waste two opcodes for
>>> IORING_OP_SENDZC and IORING_OP_SENDMSGZC (and maybe more)
>>>
>>>
>>> I also noticed a problem in io_notif_update()
>>>
>>>          for (; idx < idx_end; idx++) {
>>>                  struct io_notif_slot *slot = &ctx->notif_slots[idx];
>>>
>>>                  if (!slot->notif)
>>>                          continue;
>>>                  if (up->arg)
>>>                          slot->tag = up->arg;
>>>                  io_notif_slot_flush_submit(slot, issue_flags);
>>>          }
>>>
>>>   slot->tag = up->arg is skipped if there is no notif already.
>>>
>>> So you can't just use a 2 linked sqe's with
>>>
>>> IORING_RSRC_UPDATE_NOTIF followed by IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
>>
>> slot->notif is lazily initialised with the first send attached to it,
>> so in your example IORING_OP_SENDZC will first create a notification
>> to execute the send and then will flush it.
>>
>> This "if" is there is only to have a more reliable API. We can
>> go over the range and allocate all empty slots and then flush
>> all of them, but allocation failures should be propagated to the
>> userspace when currently the function it can't fail.
>>
>>> I think the if (!slot->notif) should be moved down a bit.
>>
>> Not sure what you mean
>
> I think it should be:
>
>                   if (up->arg)
>                           slot->tag = up->arg;
>                   if (!slot->notif)
>                           continue;
>                   io_notif_slot_flush_submit(slot, issue_flags);
>
> or even:
>
>                   slot->tag = up->arg;
>                   if (!slot->notif)
>                           continue;
>                   io_notif_slot_flush_submit(slot, issue_flags);
>
> otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
> if notif was never created or already be flushed.

Ah, you want to update it for later. The idea was to affect only
those notifiers that are flushed by this update.
...

>>> It would somehow be nice to avoid the notif slots at all and somehow
>>> use some kind of multishot request in order to generate two qces.
>>
>> It is there first to ammortise overhead of zerocopy infra and bits
>> for second CQE posting. But more importantly, without it for TCP
>> the send payload size would need to be large enough or performance
>> would suffer, but all depends on the use case. TL;DR; it would be
>> forced to create a new SKB for each new send.
>>
>> For something simpler, I'll push another zc variant that doesn't
>> have notifiers and posts only one CQE and only after the buffers
>> are no more in use by the kernel. This works well for UDP and for
>> some TCP scenarios, but doesn't cover all cases.
>
> I think (at least for stream sockets) it would be more useful to
> get two CQEs:
> 1. The first signals userspace that it can
>    issue the next send-like operation (SEND,SENDZC,SENDMSG,SPLICE)
>    on the stream without the risk of byte ordering problem within the stream
>    and avoid too high latency (which would happen, if we wait for a send to
>    leave the hardware nic, before sending the next PDU).
> 2. The 2nd signals userspace that the buffer can be reused or released.
>
> In that case it would be useful to also provide a separate 'user_data' element
> for the 2nd CQE.

...

I had a similar chat with Dylan last week. I'd rather not rob SQE of
additional u64 as there is only addr3 left and then we're fully packed,
but there is another option we were thinking about based on OVERRIDE_TAG
feature I scrapped from the final version of zerocopy patches.

Long story short, the idea is to copy req->cqe.user_data of a
send(+flush) request into the notification CQE, so you'll get 2 CQEs
with identical user_data but they can be distinguished by looking at
cqe->flags.

What do you think? Would it work for you?


>>> I'm also wondering what will happen if a notif will be referenced by the net layer
>>> but the io_uring instance is already closed, wouldn't
>>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>>> because notif->ctx is a stale pointer, of notif itself is already gone...
>>
>> io_uring will flush all slots and wait for all notifications
>> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
>> problem.
>
> I can't follow :-(
>
> What I see is that io_notif_unregister():
>
>                 nd = io_notif_to_data(notif);
>                 slot->notif = NULL;
>                 if (!refcount_dec_and_test(&nd->uarg.refcnt))
>                         continue;
>
> So if the net layer still has a reference we just go on.
>
> Only a wild guess, is it something of:
>
> io_alloc_notif():
>         ...
>         notif->task = current;
>         io_get_task_refs(1);
>         notif->rsrc_node = NULL;
>         io_req_set_rsrc_node(notif, ctx, 0);
>         ...
>
> and
>
> __io_req_complete_put():
>                 ...
>                 io_req_put_rsrc(req);
>                 /*
>                  * Selected buffer deallocation in io_clean_op() assumes that
>                  * we don't hold ->completion_lock. Clean them here to avoid
>                  * deadlocks.
>                  */
>                 io_put_kbuf_comp(req);
>                 io_dismantle_req(req);
>                 io_put_task(req->task, 1);
>                 ...
>
> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
> and maybe adding some comments into the code.

Almost, the mechanism is absolutely the same as with requests,
and notifiers are actually requests for internal purposes.

In __io_alloc_req_refill() we grab ctx->refs, which are waited
for in io_ring_exit_work(). We usually put requests into a cache,
so when a request is complete we don't put the ref and therefore
in io_ring_exit_work() we also have a call to io_req_caches_free(),
which puts ctx->refs.

--
Pavel Begunkov

2022-08-15 13:33:34

by Stefan Metzmacher

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

Hi Pavel,

>>> Thanks for giving a thought about the API, are you trying
>>> to use it in samba?
>>
>> Yes, but I'd need SENDMSGZC and then I'd like to test,
>> which variant gives the best performance. It also depends
>> on the configured samba vfs module stack.
>
> I can send you a branch this week if you would be
> willing to try it out as I'll be sending the "msg" variant
> only for 5.21

I'm not sure I'll have time to do runtime testing,
but it would be great to have a look at the code and give some comments
based on that.

>> I think it should be:
>>
>>                    if (up->arg)
>>                            slot->tag = up->arg;
>>                    if (!slot->notif)
>>                            continue;
>>                    io_notif_slot_flush_submit(slot, issue_flags);
>>
>> or even:
>>
>>                    slot->tag = up->arg;
>>                    if (!slot->notif)
>>                            continue;
>>                    io_notif_slot_flush_submit(slot, issue_flags);
>>
>> otherwise IORING_RSRC_UPDATE_NOTIF would not be able to reset the tag,
>> if notif was never created or already be flushed.
>
> Ah, you want to update it for later. The idea was to affect only
> those notifiers that are flushed by this update.
> ...

notif->cqe.user_data = slot->tag; happens in io_alloc_notif(),
so the slot->tag = up->arg; here is always for the next IO_SENDZC.

With IORING_RSRC_UPDATE_NOTIF linked to a IORING_OP_SENDZC(with IORING_RECVSEND_NOTIF_FLUSH)
I basically try to reset slot->tag to the same (or related) user_data as the SENDZC itself.

So that each SENDZC generates two CQEs with the same user_data belonging
to the same userspace buffer.


> I had a similar chat with Dylan last week. I'd rather not rob SQE of
> additional u64 as there is only addr3 left and then we're fully packed,
> but there is another option we were thinking about based on OVERRIDE_TAG
> feature I scrapped from the final version of zerocopy patches.
>
> Long story short, the idea is to copy req->cqe.user_data of a
> send(+flush) request into the notification CQE, so you'll get 2 CQEs
> with identical user_data but they can be distinguished by looking at
> cqe->flags.
>
> What do you think? Would it work for you?

I guess that would work.

>>>> I'm also wondering what will happen if a notif will be referenced by the net layer
>>>> but the io_uring instance is already closed, wouldn't
>>>> io_uring_tx_zerocopy_callback() or __io_notif_complete_tw() crash
>>>> because notif->ctx is a stale pointer, of notif itself is already gone...
>>>
>>> io_uring will flush all slots and wait for all notifications
>>> to fire, i.e. io_uring_tx_zerocopy_callback(), so it's not a
>>> problem.
>>
>> I can't follow :-(
>>
>> What I see is that io_notif_unregister():
>>
>>                  nd = io_notif_to_data(notif);
>>                  slot->notif = NULL;
>>                  if (!refcount_dec_and_test(&nd->uarg.refcnt))
>>                          continue;
>>
>> So if the net layer still has a reference we just go on.
>>
>> Only a wild guess, is it something of:
>>
>> io_alloc_notif():
>>          ...
>>          notif->task = current;
>>          io_get_task_refs(1);
>>          notif->rsrc_node = NULL;
>>          io_req_set_rsrc_node(notif, ctx, 0);
>>          ...
>>
>> and
>>
>> __io_req_complete_put():
>>                  ...
>>                  io_req_put_rsrc(req);
>>                  /*
>>                   * Selected buffer deallocation in io_clean_op() assumes that
>>                   * we don't hold ->completion_lock. Clean them here to avoid
>>                   * deadlocks.
>>                   */
>>                  io_put_kbuf_comp(req);
>>                  io_dismantle_req(req);
>>                  io_put_task(req->task, 1);
>>                  ...
>>
>> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
>> and maybe adding some comments into the code.
>
> Almost, the mechanism is absolutely the same as with requests,
> and notifiers are actually requests for internal purposes.
>
> In __io_alloc_req_refill() we grab ctx->refs, which are waited
> for in io_ring_exit_work(). We usually put requests into a cache,
> so when a request is complete we don't put the ref and therefore
> in io_ring_exit_work() we also have a call to io_req_caches_free(),
> which puts ctx->refs.

Ok, thanks.

Would a close() on the ring fd block? I guess not, but the exit_work may block, correct?
So a process would be a zombie until net released all references?

metze

2022-08-15 14:32:11

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC net-next v3 23/29] io_uring: allow to pass addr into sendzc

On 8/15/22 14:30, Stefan Metzmacher wrote:
[...]
>>> that causes io_ring_exit_work() to wait for it.> It would be great if you or someone else could explain this in detail
>>> and maybe adding some comments into the code.
>>
>> Almost, the mechanism is absolutely the same as with requests,
>> and notifiers are actually requests for internal purposes.
>>
>> In __io_alloc_req_refill() we grab ctx->refs, which are waited
>> for in io_ring_exit_work(). We usually put requests into a cache,
>> so when a request is complete we don't put the ref and therefore
>> in io_ring_exit_work() we also have a call to io_req_caches_free(),
>> which puts ctx->refs.
>
> Ok, thanks.
>
> Would a close() on the ring fd block? I guess not, but the exit_work may block, correct?

Right, close doesn't block, it queues exit_work to get executed async.
exit_work() will wait for ctx->refs and then will free most of
io_uring resources.

> So a process would be a zombie until net released all references?

The userspace process will exit normally, but you can find a kernel
thread (kworker) doing the job.

--
Pavel Begunkov