2021-05-19 20:00:52

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC v2 00/23] io_uring BPF requests

The main problem solved is feeding completion information of other
requests in a form of CQEs back into BPF. I decided to wire up support
for multiple completion queues (aka CQs) and give BPF programs access to
them, so leaving userspace in control over synchronisation that should
be much more flexible that the link-based approach.

For instance, there can be a separate CQ for each BPF program, so no
extra sync is needed, and communication can be done by submitting a
request targeting a neighboring CQ or submitting a CQE there directly
(see test3 below). CQ is choosen by sqe->cq_idx, so everyone can
cross-fire if willing.

A bunch of other features was added to play around (see v1 changelog
below or test1), some are just experimental only. The interfaces are
not even close to settle.
Note: there are problems known, one may live-lock a task, unlikely
to happen but better to be aware.

For convenience git branch for the kernel part is at [1],
libbpf + examples [2]. Examples are written in restricted C and libbpf,
and are under examples/bpf/, see [3], with 4 BPF programs and 4
corresponding test cases in uring.c. It's already shaping interesting
to play with.

test1: just a set of use examples for features
test2/counting: ticks-react N times using timeout reqs and CQ waiting
test3/pingpong: two BPF reqs do message-based communication by
repeatedly writing a CQE to another program's CQ and
waiting for a response
test4/write_file: BPF writes N bytes to a file keeping QD>1

[1] https://github.com/isilence/linux/tree/ebpf_v2
[2] https://github.com/isilence/liburing/tree/ebpf_v2
[3] https://github.com/isilence/liburing/tree/ebpf_v2/examples/bpf

since v1:
- several bug fixes
- support multiple CQs
- allow BPF requests to wait on CQs
- BPF helpers for emit/reap CQE
- expose user_data to BPF program
- sleepable + let BPF read/write from userspace

Pavel Begunkov (23):
io_uring: shuffle rarely used ctx fields
io_uring: localise fixed resources fields
io_uring: remove dependency on ring->sq/cq_entries
io_uring: deduce cq_mask from cq_entries
io_uring: kill cached_cq_overflow
io_uring: rename io_get_cqring
io_uring: extract struct for CQ
io_uring: internally pass CQ indexes
io_uring: extract cq size helper
io_uring: add support for multiple CQs
io_uring: enable mmap'ing additional CQs
bpf: add IOURING program type
io_uring: implement bpf prog registration
io_uring: add support for bpf requests
io_uring: enable BPF to submit SQEs
io_uring: enable bpf to submit CQEs
io_uring: enable bpf to reap CQEs
libbpf: support io_uring
io_uring: pass user_data to bpf executor
bpf: Add bpf_copy_to_user() helper
io_uring: wire bpf copy to user
io_uring: don't wait on CQ exclusively
io_uring: enable bpf reqs to wait for CQs

fs/io_uring.c | 794 +++++++++++++++++++++++++++------
include/linux/bpf.h | 1 +
include/linux/bpf_types.h | 2 +
include/uapi/linux/bpf.h | 12 +
include/uapi/linux/io_uring.h | 15 +-
kernel/bpf/helpers.c | 17 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 5 +-
tools/include/uapi/linux/bpf.h | 7 +
tools/lib/bpf/libbpf.c | 7 +
10 files changed, 722 insertions(+), 139 deletions(-)

--
2.31.1



2021-05-19 20:03:21

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 07/23] io_uring: extract struct for CQ

Extract a structure describing an internal completion queue state and
called, struct io_cqring. We need it to support multi-CQ rings.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 47 +++++++++++++++++++++++++----------------------
1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 49a1b6b81d7d..4fecd9da689e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -335,6 +335,12 @@ struct io_submit_state {
unsigned int ios_left;
};

+struct io_cqring {
+ unsigned cached_tail;
+ unsigned entries;
+ struct io_rings *rings;
+};
+
struct io_ring_ctx {
struct {
struct percpu_ref refs;
@@ -402,17 +408,14 @@ struct io_ring_ctx {
struct xarray personalities;
u32 pers_next;

- struct {
- unsigned cached_cq_tail;
- unsigned cq_entries;
- atomic_t cq_timeouts;
- unsigned cq_last_tm_flush;
- unsigned cq_extra;
- unsigned long cq_check_overflow;
- struct wait_queue_head cq_wait;
- struct fasync_struct *cq_fasync;
- struct eventfd_ctx *cq_ev_fd;
- } ____cacheline_aligned_in_smp;
+ struct fasync_struct *cq_fasync;
+ struct eventfd_ctx *cq_ev_fd;
+ atomic_t cq_timeouts;
+ unsigned cq_last_tm_flush;
+ unsigned long cq_check_overflow;
+ unsigned cq_extra;
+ struct wait_queue_head cq_wait;
+ struct io_cqring cqs[1];

struct {
spinlock_t completion_lock;
@@ -1207,7 +1210,7 @@ static bool req_need_defer(struct io_kiocb *req, u32 seq)
if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
struct io_ring_ctx *ctx = req->ctx;

- return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
+ return seq + READ_ONCE(ctx->cq_extra) != ctx->cqs[0].cached_tail;
}

return false;
@@ -1312,7 +1315,7 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)
if (list_empty(&ctx->timeout_list))
return;

- seq = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
+ seq = ctx->cqs[0].cached_tail - atomic_read(&ctx->cq_timeouts);

do {
u32 events_needed, events_got;
@@ -1346,7 +1349,7 @@ static void io_commit_cqring(struct io_ring_ctx *ctx)
io_flush_timeouts(ctx);

/* order cqe stores with ring update */
- smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
+ smp_store_release(&ctx->rings->cq.tail, ctx->cqs[0].cached_tail);

if (unlikely(!list_empty(&ctx->defer_list)))
__io_queue_deferred(ctx);
@@ -1361,23 +1364,23 @@ static inline bool io_sqring_full(struct io_ring_ctx *ctx)

static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx)
{
- return ctx->cached_cq_tail - READ_ONCE(ctx->rings->cq.head);
+ return ctx->cqs[0].cached_tail - READ_ONCE(ctx->rings->cq.head);
}

static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;
- unsigned tail, mask = ctx->cq_entries - 1;
+ unsigned tail, mask = ctx->cqs[0].entries - 1;

/*
* writes to the cq entry need to come after reading head; the
* control dependency is enough as we're using WRITE_ONCE to
* fill the cq entry
*/
- if (__io_cqring_events(ctx) == ctx->cq_entries)
+ if (__io_cqring_events(ctx) == ctx->cqs[0].entries)
return NULL;

- tail = ctx->cached_cq_tail++;
+ tail = ctx->cqs[0].cached_tail++;
return &rings->cqes[tail & mask];
}

@@ -1430,7 +1433,7 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
unsigned long flags;
bool all_flushed, posted;

- if (!force && __io_cqring_events(ctx) == ctx->cq_entries)
+ if (!force && __io_cqring_events(ctx) == ctx->cqs[0].entries)
return false;

posted = false;
@@ -5670,7 +5673,7 @@ static int io_timeout(struct io_kiocb *req, unsigned int issue_flags)
goto add;
}

- tail = ctx->cached_cq_tail - atomic_read(&ctx->cq_timeouts);
+ tail = ctx->cqs[0].cached_tail - atomic_read(&ctx->cq_timeouts);
req->timeout.target_seq = tail + off;

/* Update the last seq here in case io_flush_timeouts() hasn't.
@@ -9331,7 +9334,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
if (unlikely(ret))
goto out;

- min_complete = min(min_complete, ctx->cq_entries);
+ min_complete = min(min_complete, ctx->cqs[0].entries);

/*
* When SETUP_IOPOLL and SETUP_SQPOLL are both enabled, user
@@ -9481,7 +9484,7 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,

/* make sure these are sane, as we already accounted them */
ctx->sq_entries = p->sq_entries;
- ctx->cq_entries = p->cq_entries;
+ ctx->cqs[0].entries = p->cq_entries;

size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset);
if (size == SIZE_MAX)
--
2.31.1


2021-05-19 20:03:21

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 06/23] io_uring: rename io_get_cqring

Rename io_get_cqring() into io_get_cqe() for consistency with SQ, and
just because the old name is not as clear.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b89a781b3f33..49a1b6b81d7d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11,7 +11,7 @@
* before writing the tail (using smp_load_acquire to read the tail will
* do). It also needs a smp_mb() before updating CQ head (ordering the
* entry load(s) with the head store), pairing with an implicit barrier
- * through a control-dependency in io_get_cqring (smp_store_release to
+ * through a control-dependency in io_get_cqe (smp_store_release to
* store head will do). Failure to do so could lead to reading invalid
* CQ entries.
*
@@ -1364,7 +1364,7 @@ static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx)
return ctx->cached_cq_tail - READ_ONCE(ctx->rings->cq.head);
}

-static inline struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
+static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;
unsigned tail, mask = ctx->cq_entries - 1;
@@ -1436,7 +1436,7 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
posted = false;
spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&ctx->cq_overflow_list)) {
- struct io_uring_cqe *cqe = io_get_cqring(ctx);
+ struct io_uring_cqe *cqe = io_get_cqe(ctx);
struct io_overflow_cqe *ocqe;

if (!cqe && !force)
@@ -1558,7 +1558,7 @@ static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
* submission (by quite a lot). Increment the overflow count in
* the ring.
*/
- cqe = io_get_cqring(ctx);
+ cqe = io_get_cqe(ctx);
if (likely(cqe)) {
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
--
2.31.1


2021-05-19 20:03:25

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 05/23] io_uring: kill cached_cq_overflow

There are two copies of cq_overflow, shared with userspace and internal
cached one. It was needed for DRAIN accounting, but now we have yet
another knob to tune the accounting, i.e. cq_extra, and we can throw
away the internal counter and just increment the one in the shared ring.

If user modifies it as so never gets the right overflow value ever
again, it's its problem, even though before we would have restored it
back by next overflow.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 067c89e63fea..b89a781b3f33 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -363,7 +363,6 @@ struct io_ring_ctx {
unsigned sq_entries;
unsigned sq_thread_idle;
unsigned cached_sq_dropped;
- unsigned cached_cq_overflow;
unsigned long sq_check_overflow;

struct list_head defer_list;
@@ -1195,13 +1194,20 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
return NULL;
}

+static void io_account_cq_overflow(struct io_ring_ctx *ctx)
+{
+ struct io_rings *r = ctx->rings;
+
+ WRITE_ONCE(r->cq_overflow, READ_ONCE(r->cq_overflow) + 1);
+ ctx->cq_extra--;
+}
+
static bool req_need_defer(struct io_kiocb *req, u32 seq)
{
if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
struct io_ring_ctx *ctx = req->ctx;

- return seq + ctx->cq_extra != ctx->cached_cq_tail
- + READ_ONCE(ctx->cached_cq_overflow);
+ return seq + READ_ONCE(ctx->cq_extra) != ctx->cached_cq_tail;
}

return false;
@@ -1440,8 +1446,8 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
if (cqe)
memcpy(cqe, &ocqe->cqe, sizeof(*cqe));
else
- WRITE_ONCE(ctx->rings->cq_overflow,
- ++ctx->cached_cq_overflow);
+ io_account_cq_overflow(ctx);
+
posted = true;
list_del(&ocqe->list);
kfree(ocqe);
@@ -1525,7 +1531,7 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
* or cannot allocate an overflow entry, then we need to drop it
* on the floor.
*/
- WRITE_ONCE(ctx->rings->cq_overflow, ++ctx->cached_cq_overflow);
+ io_account_cq_overflow(ctx);
return false;
}
if (list_empty(&ctx->cq_overflow_list)) {
--
2.31.1


2021-05-19 20:03:44

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 12/23] bpf: add IOURING program type

Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by
io_uring to execute BPF-based requests.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 21 +++++++++++++++++++++
include/linux/bpf_types.h | 2 ++
include/uapi/linux/bpf.h | 1 +
kernel/bpf/syscall.c | 1 +
kernel/bpf/verifier.c | 5 ++++-
5 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1a4c9e513ac9..882b16b5e5eb 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10201,6 +10201,27 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
return ret;
}

+static const struct bpf_func_proto *
+io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ return bpf_base_func_proto(func_id);
+}
+
+static bool io_bpf_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ return false;
+}
+
+const struct bpf_prog_ops bpf_io_uring_prog_ops = {};
+
+const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
+ .get_func_proto = io_bpf_func_proto,
+ .is_valid_access = io_bpf_is_valid_access,
+};
+
SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
void __user *, arg, unsigned int, nr_args)
{
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 99f7fd657d87..d0b7954887bd 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -77,6 +77,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
void *, void *)
#endif /* CONFIG_BPF_LSM */
#endif
+BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,
+ void *, void *)

BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 4ba4ef0ff63a..de544f0fbeef 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -206,6 +206,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_EXT,
BPF_PROG_TYPE_LSM,
BPF_PROG_TYPE_SK_LOOKUP,
+ BPF_PROG_TYPE_IOURING,
};

enum bpf_attach_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 250503482cda..6ef7a26f4dc3 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
case BPF_PROG_TYPE_CGROUP_SOCKOPT:
case BPF_PROG_TYPE_CGROUP_SYSCTL:
case BPF_PROG_TYPE_SOCK_OPS:
+ case BPF_PROG_TYPE_IOURING:
case BPF_PROG_TYPE_EXT: /* extends any prog */
return true;
case BPF_PROG_TYPE_CGROUP_SKB:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0399ac092b36..2a53f44618a7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env)
case BPF_PROG_TYPE_SK_LOOKUP:
range = tnum_range(SK_DROP, SK_PASS);
break;
+ case BPF_PROG_TYPE_IOURING:
+ range = tnum_const(0);
+ break;
case BPF_PROG_TYPE_EXT:
/* freplace program can return anything as its return value
* depends on the to-be-replaced kernel func or bpf program.
@@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
u64 key;

if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
- prog->type != BPF_PROG_TYPE_LSM) {
+ prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) {
verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
return -EINVAL;
}
--
2.31.1


2021-05-19 20:03:45

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 10/23] io_uring: add support for multiple CQs

TODO: don't rob all bits from params, use pointer to a struct

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 89 +++++++++++++++++++++++++++--------
include/uapi/linux/io_uring.h | 3 +-
2 files changed, 71 insertions(+), 21 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f05592ae5f41..067cfb3a6e4a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -91,6 +91,7 @@
#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)

#define IO_DEFAULT_CQ 0
+#define IO_MAX_CQRINGS 1024

/*
* Shift of 9 is 512 entries, or exactly one page on 64-bit archs
@@ -417,7 +418,7 @@ struct io_ring_ctx {
unsigned long cq_check_overflow;
unsigned cq_extra;
struct wait_queue_head cq_wait;
- struct io_cqring cqs[1];
+ struct io_cqring *cqs;
unsigned int cq_nr;

struct {
@@ -1166,6 +1167,9 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
if (!ctx)
return NULL;

+ ctx->cqs = kmalloc_array(p->nr_cq + 1, sizeof(ctx->cqs[0]), GFP_KERNEL);
+ if (!ctx->cqs)
+ goto err;
/*
* Use 5 bits less than the max cq entries, that should give us around
* 32 entries per hash list if totally full and uniformly spread.
@@ -8634,6 +8638,8 @@ static bool io_wait_rsrc_data(struct io_rsrc_data *data)

static void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
+ unsigned int i;
+
io_sq_thread_finish(ctx);

if (ctx->mm_account) {
@@ -8673,6 +8679,9 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)

io_mem_free(ctx->rings);
io_mem_free(ctx->sq_sqes);
+ for (i = 1; i < ctx->cq_nr; i++)
+ io_mem_free(ctx->cqs[i].rings);
+ kfree(ctx->cqs);

percpu_ref_exit(&ctx->refs);
free_uid(ctx->user);
@@ -9524,11 +9533,39 @@ static const struct file_operations io_uring_fops = {
#endif
};

+static void __io_init_cqring(struct io_cqring *cq, struct io_rings *rings,
+ unsigned int entries)
+{
+ WRITE_ONCE(rings->cq_ring_entries, entries);
+ WRITE_ONCE(rings->cq_ring_mask, entries - 1);
+
+ cq->cached_tail = 0;
+ cq->rings = rings;
+ cq->entries = entries;
+}
+
+static int io_init_cqring(struct io_cqring *cq, unsigned int entries)
+{
+ struct io_rings *rings;
+ size_t size;
+
+ size = rings_size(0, entries, NULL);
+ if (size == SIZE_MAX)
+ return -EOVERFLOW;
+ rings = io_mem_alloc(size);
+ if (!rings)
+ return -ENOMEM;
+ __io_init_cqring(cq, rings, entries);
+ return 0;
+}
+
static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
struct io_uring_params *p)
{
+ u32 __user *cq_sizes = u64_to_user_ptr(p->cq_sizes);
struct io_rings *rings;
size_t size, sq_array_offset;
+ int i, ret;

/* make sure these are sane, as we already accounted them */
ctx->sq_entries = p->sq_entries;
@@ -9544,30 +9581,43 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
ctx->rings = rings;
ctx->sq_array = (u32 *)((char *)rings + sq_array_offset);
rings->sq_ring_mask = p->sq_entries - 1;
- rings->cq_ring_mask = p->cq_entries - 1;
rings->sq_ring_entries = p->sq_entries;
- rings->cq_ring_entries = p->cq_entries;

- ctx->cqs[0].cached_tail = 0;
- ctx->cqs[0].rings = rings;
- ctx->cqs[0].entries = p->cq_entries;
+ __io_init_cqring(&ctx->cqs[0], rings, p->cq_entries);
ctx->cq_nr = 1;

size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
- if (size == SIZE_MAX) {
- io_mem_free(ctx->rings);
- ctx->rings = NULL;
- return -EOVERFLOW;
- }
+ ret = -EOVERFLOW;
+ if (unlikely(size == SIZE_MAX))
+ goto err;

ctx->sq_sqes = io_mem_alloc(size);
- if (!ctx->sq_sqes) {
- io_mem_free(ctx->rings);
- ctx->rings = NULL;
- return -ENOMEM;
+ ret = -ENOMEM;
+ if (unlikely(!ctx->sq_sqes))
+ goto err;
+
+ for (i = 0; i < p->nr_cq; i++, ctx->cq_nr++) {
+ u32 sz;
+ long entries;
+
+ ret = -EFAULT;
+ if (copy_from_user(&sz, &cq_sizes[i], sizeof(sz)))
+ goto err;
+ entries = io_get_cqring_size(p, sz);
+ if (entries < 0) {
+ ret = entries;
+ goto err;
+ }
+ ret = io_init_cqring(&ctx->cqs[i + 1], entries);
+ if (ret)
+ goto err;
}

return 0;
+err:
+ io_mem_free(ctx->rings);
+ ctx->rings = NULL;
+ return ret;
}

static int io_uring_install_fd(struct io_ring_ctx *ctx, struct file *file)
@@ -9653,6 +9703,10 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
} else {
p->cq_entries = 2 * p->sq_entries;
}
+ if (p->nr_cq > IO_MAX_CQRINGS)
+ return -EINVAL;
+ if (!p->nr_cq != !p->cq_sizes)
+ return -EINVAL;

ctx = io_ring_ctx_alloc(p);
if (!ctx)
@@ -9744,14 +9798,9 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
{
struct io_uring_params p;
- int i;

if (copy_from_user(&p, params, sizeof(p)))
return -EFAULT;
- for (i = 0; i < ARRAY_SIZE(p.resv); i++) {
- if (p.resv[i])
- return -EINVAL;
- }

if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index c2dfb179360a..92b61ca09ea5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -263,7 +263,8 @@ struct io_uring_params {
__u32 sq_thread_idle;
__u32 features;
__u32 wq_fd;
- __u32 resv[3];
+ __u32 nr_cq;
+ __u64 cq_sizes;
struct io_sqring_offsets sq_off;
struct io_cqring_offsets cq_off;
};
--
2.31.1


2021-05-19 20:04:27

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 02/23] io_uring: localise fixed resources fields

ring has two types of resource-related fields, used for request
submission, and field needed for update/registration. Reshuffle them
into these two groups for better locality and readability. The second
group is not in the hot path, so it's natural to place them somewhere in
the end. Also update an outdated comment.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 33 +++++++++++++++++----------------
1 file changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7e3410ce100a..31eca208f675 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -390,21 +390,17 @@ struct io_ring_ctx {
struct list_head sqd_list;

/*
- * If used, fixed file set. Writers must ensure that ->refs is dead,
- * readers must ensure that ->refs is alive as long as the file* is
- * used. Only updated through io_uring_register(2).
+ * Fixed resources fast path, should be accessed only under uring_lock,
+ * and updated through io_uring_register(2)
*/
- struct io_rsrc_data *file_data;
+ struct io_rsrc_node *rsrc_node;
+
struct io_file_table file_table;
unsigned nr_user_files;
-
- /* if used, fixed mapped user buffers */
- struct io_rsrc_data *buf_data;
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;

struct xarray io_buffers;
-
struct xarray personalities;
u32 pers_next;

@@ -436,16 +432,21 @@ struct io_ring_ctx {
bool poll_multi_file;
} ____cacheline_aligned_in_smp;

- struct delayed_work rsrc_put_work;
- struct llist_head rsrc_put_llist;
- struct list_head rsrc_ref_list;
- spinlock_t rsrc_ref_lock;
- struct io_rsrc_node *rsrc_node;
- struct io_rsrc_node *rsrc_backup_node;
- struct io_mapped_ubuf *dummy_ubuf;
-
struct io_restriction restrictions;

+ /* slow path rsrc auxilary data, used by update/register */
+ struct {
+ struct io_rsrc_node *rsrc_backup_node;
+ struct io_mapped_ubuf *dummy_ubuf;
+ struct io_rsrc_data *file_data;
+ struct io_rsrc_data *buf_data;
+
+ struct delayed_work rsrc_put_work;
+ struct llist_head rsrc_put_llist;
+ struct list_head rsrc_ref_list;
+ spinlock_t rsrc_ref_lock;
+ };
+
/* Keep this last, we don't need it for the fast path */
struct {
#if defined(CONFIG_UNIX)
--
2.31.1


2021-05-19 20:04:29

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 13/23] io_uring: implement bpf prog registration

[de]register BPF programs through io_uring_register() with new
IORING_ATTACH_BPF and IORING_DETACH_BPF commands.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 81 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 2 +
2 files changed, 83 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 882b16b5e5eb..b13cbcd5c47b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@
#include <linux/task_work.h>
#include <linux/pagemap.h>
#include <linux/io_uring.h>
+#include <linux/bpf.h>

#define CREATE_TRACE_POINTS
#include <trace/events/io_uring.h>
@@ -103,6 +104,8 @@
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)

+#define IORING_MAX_BPF_PROGS 100
+
#define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \
IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
IOSQE_BUFFER_SELECT)
@@ -266,6 +269,10 @@ struct io_restriction {
bool registered;
};

+struct io_bpf_prog {
+ struct bpf_prog *prog;
+};
+
enum {
IO_SQ_THREAD_SHOULD_STOP = 0,
IO_SQ_THREAD_SHOULD_PARK,
@@ -411,6 +418,10 @@ struct io_ring_ctx {
struct xarray personalities;
u32 pers_next;

+ /* bpf programs */
+ unsigned nr_bpf_progs;
+ struct io_bpf_prog *bpf_progs;
+
struct fasync_struct *cq_fasync;
struct eventfd_ctx *cq_ev_fd;
atomic_t cq_timeouts;
@@ -8627,6 +8638,66 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
mutex_unlock(&ctx->uring_lock);
}

+static int io_bpf_unregister(struct io_ring_ctx *ctx)
+{
+ int i;
+
+ if (!ctx->nr_bpf_progs)
+ return -ENXIO;
+
+ for (i = 0; i < ctx->nr_bpf_progs; ++i) {
+ struct bpf_prog *prog = ctx->bpf_progs[i].prog;
+
+ if (prog)
+ bpf_prog_put(prog);
+ }
+ kfree(ctx->bpf_progs);
+ ctx->bpf_progs = NULL;
+ ctx->nr_bpf_progs = 0;
+ return 0;
+}
+
+static int io_bpf_register(struct io_ring_ctx *ctx, void __user *arg,
+ unsigned int nr_args)
+{
+ u32 __user *fds = arg;
+ int i, ret = 0;
+
+ if (!nr_args || nr_args > IORING_MAX_BPF_PROGS)
+ return -EINVAL;
+ if (ctx->nr_bpf_progs)
+ return -EBUSY;
+
+ ctx->bpf_progs = kcalloc(nr_args, sizeof(ctx->bpf_progs[0]),
+ GFP_KERNEL);
+ if (!ctx->bpf_progs)
+ return -ENOMEM;
+
+ for (i = 0; i < nr_args; ++i) {
+ struct bpf_prog *prog;
+ u32 fd;
+
+ if (copy_from_user(&fd, &fds[i], sizeof(fd))) {
+ ret = -EFAULT;
+ break;
+ }
+ if (fd == -1)
+ continue;
+
+ prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_IOURING);
+ if (IS_ERR(prog)) {
+ ret = PTR_ERR(prog);
+ break;
+ }
+ ctx->bpf_progs[i].prog = prog;
+ }
+
+ ctx->nr_bpf_progs = i;
+ if (ret)
+ io_bpf_unregister(ctx);
+ return ret;
+}
+
static bool io_wait_rsrc_data(struct io_rsrc_data *data)
{
if (!data)
@@ -8657,6 +8728,7 @@ static void io_ring_ctx_free(struct io_ring_ctx *ctx)
mutex_unlock(&ctx->uring_lock);
io_eventfd_unregister(ctx);
io_destroy_buffers(ctx);
+ io_bpf_unregister(ctx);
if (ctx->sq_creds)
put_cred(ctx->sq_creds);

@@ -10188,6 +10260,15 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
case IORING_REGISTER_RSRC_UPDATE:
ret = io_register_rsrc_update(ctx, arg, nr_args);
break;
+ case IORING_REGISTER_BPF:
+ ret = io_bpf_register(ctx, arg, nr_args);
+ break;
+ case IORING_UNREGISTER_BPF:
+ ret = -EINVAL;
+ if (arg || nr_args)
+ break;
+ ret = io_bpf_unregister(ctx);
+ break;
default:
ret = -EINVAL;
break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 67a97c793de7..b450f41d7389 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -304,6 +304,8 @@ enum {
IORING_REGISTER_ENABLE_RINGS = 12,
IORING_REGISTER_RSRC = 13,
IORING_REGISTER_RSRC_UPDATE = 14,
+ IORING_REGISTER_BPF = 15,
+ IORING_UNREGISTER_BPF = 16,

/* this goes last */
IORING_REGISTER_LAST
--
2.31.1


2021-05-19 20:04:37

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 14/23] io_uring: add support for bpf requests

Wire up a new io_uring operation type IORING_OP_BPF, which executes a
specified BPF program from the registered prog table. It doesn't allow
to do anything useful for now, no BPF functions are allowed apart from
basic ones.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 92 +++++++++++++++++++++++++++++++++++
include/uapi/linux/io_uring.h | 1 +
2 files changed, 93 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index b13cbcd5c47b..20fddc5945f2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -682,6 +682,11 @@ struct io_unlink {
struct filename *filename;
};

+struct io_bpf {
+ struct file *file;
+ struct bpf_prog *prog;
+};
+
struct io_completion {
struct file *file;
struct list_head list;
@@ -826,6 +831,7 @@ struct io_kiocb {
struct io_shutdown shutdown;
struct io_rename rename;
struct io_unlink unlink;
+ struct io_bpf bpf;
/* use only after cleaning per-op data, see io_clean_op() */
struct io_completion compl;
};
@@ -875,6 +881,9 @@ struct io_defer_entry {
u32 seq;
};

+struct io_bpf_ctx {
+};
+
struct io_op_def {
/* needs req->file assigned */
unsigned needs_file : 1;
@@ -1039,6 +1048,7 @@ static const struct io_op_def io_op_defs[] = {
},
[IORING_OP_RENAMEAT] = {},
[IORING_OP_UNLINKAT] = {},
+ [IORING_OP_BPF] = {},
};

static bool io_disarm_next(struct io_kiocb *req);
@@ -1070,6 +1080,7 @@ static void io_rsrc_put_work(struct work_struct *work);
static void io_req_task_queue(struct io_kiocb *req);
static void io_submit_flush_completions(struct io_comp_state *cs,
struct io_ring_ctx *ctx);
+static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags);
static bool io_poll_remove_waitqs(struct io_kiocb *req);
static int io_req_prep_async(struct io_kiocb *req);

@@ -3931,6 +3942,53 @@ static int io_openat(struct io_kiocb *req, unsigned int issue_flags)
return io_openat2(req, issue_flags);
}

+static int io_bpf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct bpf_prog *prog;
+ unsigned int idx;
+
+ if (unlikely(ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL)))
+ return -EINVAL;
+ if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
+ return -EINVAL;
+ if (sqe->ioprio || sqe->len || sqe->cancel_flags)
+ return -EINVAL;
+ if (sqe->addr)
+ return -EINVAL;
+
+ idx = READ_ONCE(sqe->off);
+ if (unlikely(idx >= ctx->nr_bpf_progs))
+ return -EFAULT;
+ idx = array_index_nospec(idx, ctx->nr_bpf_progs);
+ prog = ctx->bpf_progs[idx].prog;
+ if (!prog)
+ return -EFAULT;
+
+ req->bpf.prog = prog;
+ return 0;
+}
+
+static void io_bpf_run_task_work(struct callback_head *cb)
+{
+ struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
+ struct io_ring_ctx *ctx = req->ctx;
+
+ mutex_lock(&ctx->uring_lock);
+ io_bpf_run(req, 0);
+ mutex_unlock(&ctx->uring_lock);
+}
+
+static int io_bpf(struct io_kiocb *req, unsigned int issue_flags)
+{
+ init_task_work(&req->task_work, io_bpf_run_task_work);
+ if (unlikely(io_req_task_work_add(req))) {
+ req_ref_get(req);
+ io_req_task_queue_fail(req, -ECANCELED);
+ }
+ return 0;
+}
+
static int io_remove_buffers_prep(struct io_kiocb *req,
const struct io_uring_sqe *sqe)
{
@@ -6002,6 +6060,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_renameat_prep(req, sqe);
case IORING_OP_UNLINKAT:
return io_unlinkat_prep(req, sqe);
+ case IORING_OP_BPF:
+ return io_bpf_prep(req, sqe);
}

printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
@@ -6269,6 +6329,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
case IORING_OP_UNLINKAT:
ret = io_unlinkat(req, issue_flags);
break;
+ case IORING_OP_BPF:
+ ret = io_bpf(req, issue_flags);
+ break;
default:
ret = -EINVAL;
break;
@@ -10303,6 +10366,35 @@ const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
.is_valid_access = io_bpf_is_valid_access,
};

+static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct io_bpf_ctx bpf_ctx;
+ struct bpf_prog *prog;
+ int ret = -EAGAIN;
+
+ lockdep_assert_held(&req->ctx->uring_lock);
+
+ if (unlikely(percpu_ref_is_dying(&ctx->refs) ||
+ atomic_read(&req->task->io_uring->in_idle)))
+ goto done;
+
+ memset(&bpf_ctx, 0, sizeof(bpf_ctx));
+ prog = req->bpf.prog;
+
+ if (prog->aux->sleepable) {
+ rcu_read_lock();
+ bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
+ rcu_read_unlock();
+ } else {
+ bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
+ }
+
+ ret = 0;
+done:
+ __io_req_complete(req, issue_flags, ret, 0);
+}
+
SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
void __user *, arg, unsigned int, nr_args)
{
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b450f41d7389..25ab804670e1 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -138,6 +138,7 @@ enum {
IORING_OP_SHUTDOWN,
IORING_OP_RENAMEAT,
IORING_OP_UNLINKAT,
+ IORING_OP_BPF,

/* this goes last, obviously */
IORING_OP_LAST,
--
2.31.1


2021-05-19 20:04:47

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 16/23] io_uring: enable bpf to submit CQEs

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 36 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/bpf.h | 1 +
2 files changed, 37 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index aae786291c57..464d630904e2 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10371,6 +10371,29 @@ BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx,
return !io_submit_sqe(ctx, req, sqe);
}

+BPF_CALL_5(io_bpf_emit_cqe, struct io_bpf_ctx *, bpf_ctx,
+ u32, cq_idx,
+ u64, user_data,
+ s32, res,
+ u32, flags)
+{
+ struct io_ring_ctx *ctx = bpf_ctx->ctx;
+ bool submitted;
+
+ if (unlikely(cq_idx >= ctx->cq_nr))
+ return -EINVAL;
+
+ spin_lock_irq(&ctx->completion_lock);
+ submitted = io_cqring_fill_event(ctx, user_data, res, flags, cq_idx);
+ io_commit_cqring(ctx);
+ ctx->cq_extra++;
+ spin_unlock_irq(&ctx->completion_lock);
+ if (submitted)
+ io_cqring_ev_posted(ctx);
+
+ return submitted ? 0 : -ENOMEM;
+}
+
const struct bpf_func_proto io_bpf_queue_sqe_proto = {
.func = io_bpf_queue_sqe,
.gpl_only = false,
@@ -10380,6 +10403,17 @@ const struct bpf_func_proto io_bpf_queue_sqe_proto = {
.arg3_type = ARG_CONST_SIZE,
};

+const struct bpf_func_proto io_bpf_emit_cqe_proto = {
+ .func = io_bpf_emit_cqe,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_ANYTHING,
+ .arg5_type = ARG_ANYTHING,
+};
+
static const struct bpf_func_proto *
io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -10388,6 +10422,8 @@ io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
case BPF_FUNC_iouring_queue_sqe:
return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;
+ case BPF_FUNC_iouring_emit_cqe:
+ return &io_bpf_emit_cqe_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index cc268f749a7d..c6b023be7848 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4083,6 +4083,7 @@ union bpf_attr {
FN(sock_from_file), \
FN(check_mtu), \
FN(iouring_queue_sqe), \
+ FN(iouring_emit_cqe), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
--
2.31.1


2021-05-19 20:04:52

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 17/23] io_uring: enable bpf to reap CQEs

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 48 ++++++++++++++++++++++++++++++++++++++++
include/uapi/linux/bpf.h | 1 +
2 files changed, 49 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 464d630904e2..7c165b2ce8e4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10394,6 +10394,42 @@ BPF_CALL_5(io_bpf_emit_cqe, struct io_bpf_ctx *, bpf_ctx,
return submitted ? 0 : -ENOMEM;
}

+BPF_CALL_4(io_bpf_reap_cqe, struct io_bpf_ctx *, bpf_ctx,
+ u32, cq_idx,
+ struct io_uring_cqe *, cqe_out,
+ u32, cqe_len)
+{
+ struct io_ring_ctx *ctx = bpf_ctx->ctx;
+ struct io_uring_cqe *cqe;
+ struct io_cqring *cq;
+ struct io_rings *r;
+ unsigned tail, head, mask;
+ int ret = -EINVAL;
+
+ if (unlikely(cqe_len != sizeof(*cqe_out)))
+ goto err;
+ if (unlikely(cq_idx >= ctx->cq_nr))
+ goto err;
+
+ cq = &ctx->cqs[cq_idx];
+ r = cq->rings;
+ tail = READ_ONCE(r->cq.tail);
+ head = smp_load_acquire(&r->cq.head);
+
+ ret = -ENOENT;
+ if (unlikely(tail == head))
+ goto err;
+
+ mask = cq->entries - 1;
+ cqe = &r->cqes[head & mask];
+ memcpy(cqe_out, cqe, sizeof(*cqe_out));
+ WRITE_ONCE(r->cq.head, head + 1);
+ return 0;
+err:
+ memset(cqe_out, 0, sizeof(*cqe_out));
+ return ret;
+}
+
const struct bpf_func_proto io_bpf_queue_sqe_proto = {
.func = io_bpf_queue_sqe,
.gpl_only = false,
@@ -10414,6 +10450,16 @@ const struct bpf_func_proto io_bpf_emit_cqe_proto = {
.arg5_type = ARG_ANYTHING,
};

+const struct bpf_func_proto io_bpf_reap_cqe_proto = {
+ .func = io_bpf_reap_cqe,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+};
+
static const struct bpf_func_proto *
io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
@@ -10424,6 +10470,8 @@ io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;
case BPF_FUNC_iouring_emit_cqe:
return &io_bpf_emit_cqe_proto;
+ case BPF_FUNC_iouring_reap_cqe:
+ return &io_bpf_reap_cqe_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c6b023be7848..7719ec4a33e7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4084,6 +4084,7 @@ union bpf_attr {
FN(check_mtu), \
FN(iouring_queue_sqe), \
FN(iouring_emit_cqe), \
+ FN(iouring_reap_cqe), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
--
2.31.1


2021-05-19 20:05:11

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 18/23] libbpf: support io_uring

Signed-off-by: Pavel Begunkov <[email protected]>
---
tools/lib/bpf/libbpf.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 4181d178ee7b..de5d1508f58e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13,6 +13,10 @@
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
+
+/* hack, use local headers instead of system-wide */
+#include "../../../include/uapi/linux/bpf.h"
+
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
@@ -8630,6 +8634,9 @@ static const struct bpf_sec_def section_defs[] = {
BPF_PROG_SEC("struct_ops", BPF_PROG_TYPE_STRUCT_OPS),
BPF_EAPROG_SEC("sk_lookup/", BPF_PROG_TYPE_SK_LOOKUP,
BPF_SK_LOOKUP),
+ SEC_DEF("iouring/", IOURING),
+ SEC_DEF("iouring.s/", IOURING,
+ .is_sleepable = true),
};

#undef BPF_PROG_SEC_IMPL
--
2.31.1


2021-05-19 20:05:13

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 19/23] io_uring: pass user_data to bpf executor

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 16 ++++++++++++++++
include/uapi/linux/io_uring.h | 4 ++++
2 files changed, 20 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7c165b2ce8e4..c37846bca863 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -882,6 +882,7 @@ struct io_defer_entry {
};

struct io_bpf_ctx {
+ struct io_uring_bpf_ctx u;
struct io_ring_ctx *ctx;
};

@@ -10482,6 +10483,15 @@ static bool io_bpf_is_valid_access(int off, int size,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
+ if (off < 0 || off >= sizeof(struct io_uring_bpf_ctx))
+ return false;
+ if (off % size != 0)
+ return false;
+
+ switch (off) {
+ case offsetof(struct io_uring_bpf_ctx, user_data):
+ return size == sizeof_field(struct io_uring_bpf_ctx, user_data);
+ }
return false;
}

@@ -10505,6 +10515,8 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
atomic_read(&req->task->io_uring->in_idle)))
goto done;

+ memset(&bpf_ctx.u, 0, sizeof(bpf_ctx.u));
+ bpf_ctx.u.user_data = req->user_data;
bpf_ctx.ctx = ctx;
prog = req->bpf.prog;

@@ -10591,6 +10603,10 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
BUILD_BUG_SQE_ELEM(48, __u16, cq_idx);

+ /* should be first, see io_bpf_is_valid_access() */
+ __BUILD_BUG_VERIFY_ELEMENT(struct io_bpf_ctx, 0,
+ struct io_uring_bpf_ctx, u);
+
BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
sizeof(struct io_uring_rsrc_update));
BUILD_BUG_ON(sizeof(struct io_uring_rsrc_update) >
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 25ab804670e1..d7b1713bcfb0 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -403,4 +403,8 @@ struct io_uring_getevents_arg {
__u64 ts;
};

+struct io_uring_bpf_ctx {
+ __u64 user_data;
+};
+
#endif
--
2.31.1


2021-05-19 20:05:18

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 20/23] bpf: Add bpf_copy_to_user() helper

Similarly to bpf_copy_from_user(), also allow sleepable BPF programs
to write to user memory.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/bpf.h | 1 +
include/uapi/linux/bpf.h | 8 ++++++++
kernel/bpf/helpers.c | 17 +++++++++++++++++
tools/include/uapi/linux/bpf.h | 7 +++++++
4 files changed, 33 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 00597b0c719c..9b775e2b2a01 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1899,6 +1899,7 @@ extern const struct bpf_func_proto bpf_skc_to_tcp_timewait_sock_proto;
extern const struct bpf_func_proto bpf_skc_to_tcp_request_sock_proto;
extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto;
extern const struct bpf_func_proto bpf_copy_from_user_proto;
+extern const struct bpf_func_proto bpf_copy_to_user_proto;
extern const struct bpf_func_proto bpf_snprintf_btf_proto;
extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7719ec4a33e7..6f19839d2b05 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3648,6 +3648,13 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * long bpf_copy_to_user(void *user_ptr, const void *src, u32 size)
+ * Description
+ * Read *size* bytes from *src* and store the data in user space
+ * address *user_ptr*. This is a wrapper of **copy_to_user**\ ().
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr, u32 btf_ptr_size, u64 flags)
* Description
* Use BTF to store a string representation of *ptr*->ptr in *str*,
@@ -4085,6 +4092,7 @@ union bpf_attr {
FN(iouring_queue_sqe), \
FN(iouring_emit_cqe), \
FN(iouring_reap_cqe), \
+ FN(copy_to_user), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 308427fe03a3..9d7814c564e5 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -634,6 +634,23 @@ const struct bpf_func_proto bpf_copy_from_user_proto = {
.arg3_type = ARG_ANYTHING,
};

+BPF_CALL_3(bpf_copy_to_user, void __user *, user_ptr,
+ const void *, src, u32, size)
+{
+ int ret = copy_to_user(user_ptr, src, size);
+
+ return ret ? -EFAULT : 0;
+}
+
+const struct bpf_func_proto bpf_copy_to_user_proto = {
+ .func = bpf_copy_to_user,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_PTR_TO_MEM,
+ .arg3_type = ARG_CONST_SIZE_OR_ZERO,
+};
+
BPF_CALL_2(bpf_per_cpu_ptr, const void *, ptr, u32, cpu)
{
if (cpu >= nr_cpu_ids)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 79c893310492..18d497247d69 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3647,6 +3647,13 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
+ * long bpf_copy_to_user(void *user_ptr, const void *src, u32 size)
+ * Description
+ * Read *size* bytes from *src* and store the data in user space
+ * address *user_ptr*. This is a wrapper of **copy_to_user**\ ().
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
* long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr, u32 btf_ptr_size, u64 flags)
* Description
* Use BTF to store a string representation of *ptr*->ptr in *str*,
--
2.31.1


2021-05-19 20:05:19

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 21/23] io_uring: wire bpf copy to user

Enable io_uring to write to userspace memory

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c37846bca863..c4682146afa4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10467,6 +10467,8 @@ io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
switch (func_id) {
case BPF_FUNC_copy_from_user:
return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
+ case BPF_FUNC_copy_to_user:
+ return prog->aux->sleepable ? &bpf_copy_to_user_proto : NULL;
case BPF_FUNC_iouring_queue_sqe:
return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;
case BPF_FUNC_iouring_emit_cqe:
--
2.31.1


2021-05-19 20:05:29

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 22/23] io_uring: don't wait on CQ exclusively

It doesn't make much sense for several tasks to wait on a single CQ, it
would be racy and involve rather strange code flow with synchronisation,
so we don't really care optimising this case.

Don't do exclusive CQ waiting and wake up everyone in the queue, it will
be handy for implementing waiting non-default CQs.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index c4682146afa4..805c10be7ea4 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7085,7 +7085,7 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
*/
if (io_should_wake(iowq) || test_bit(0, &iowq->ctx->cq_check_overflow))
return autoremove_wake_function(curr, mode, wake_flags, key);
- return -1;
+ return 0;
}

static int io_run_task_work_sig(void)
@@ -7176,8 +7176,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
ret = -EBUSY;
break;
}
- prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
- TASK_INTERRUPTIBLE);
+ prepare_to_wait(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
ret = io_cqring_wait_schedule(ctx, &iowq, &timeout);
finish_wait(&ctx->wait, &iowq.wq);
cond_resched();
--
2.31.1


2021-05-19 20:05:39

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 23/23] io_uring: enable bpf reqs to wait for CQs

Add experimental support for bpf requests waiting for a number of CQEs
to in a specified CQ.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 80 +++++++++++++++++++++++++++++++++--
include/uapi/linux/io_uring.h | 2 +
2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 805c10be7ea4..cf02389747b5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -687,6 +687,12 @@ struct io_bpf {
struct bpf_prog *prog;
};

+struct io_async_bpf {
+ struct wait_queue_entry wqe;
+ unsigned int wait_nr;
+ unsigned int wait_idx;
+};
+
struct io_completion {
struct file *file;
struct list_head list;
@@ -1050,7 +1056,9 @@ static const struct io_op_def io_op_defs[] = {
},
[IORING_OP_RENAMEAT] = {},
[IORING_OP_UNLINKAT] = {},
- [IORING_OP_BPF] = {},
+ [IORING_OP_BPF] = {
+ .async_size = sizeof(struct io_async_bpf),
+ },
};

static bool io_disarm_next(struct io_kiocb *req);
@@ -9148,6 +9156,7 @@ static void io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
}
}

+ wake_up_all(&ctx->wait);
ret |= io_cancel_defer_files(ctx, task, files);
ret |= io_poll_remove_all(ctx, task, files);
ret |= io_kill_timeouts(ctx, task, files);
@@ -10492,6 +10501,10 @@ static bool io_bpf_is_valid_access(int off, int size,
switch (off) {
case offsetof(struct io_uring_bpf_ctx, user_data):
return size == sizeof_field(struct io_uring_bpf_ctx, user_data);
+ case offsetof(struct io_uring_bpf_ctx, wait_nr):
+ return size == sizeof_field(struct io_uring_bpf_ctx, wait_nr);
+ case offsetof(struct io_uring_bpf_ctx, wait_idx):
+ return size == sizeof_field(struct io_uring_bpf_ctx, wait_idx);
}
return false;
}
@@ -10503,6 +10516,60 @@ const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
.is_valid_access = io_bpf_is_valid_access,
};

+static inline bool io_bpf_need_wake(struct io_async_bpf *abpf)
+{
+ struct io_kiocb *req = abpf->wqe.private;
+ struct io_ring_ctx *ctx = req->ctx;
+
+ if (unlikely(percpu_ref_is_dying(&ctx->refs)) ||
+ atomic_read(&req->task->io_uring->in_idle))
+ return true;
+ return __io_cqring_events(&ctx->cqs[abpf->wait_idx]) >= abpf->wait_nr;
+}
+
+static int io_bpf_wait_func(struct wait_queue_entry *wqe, unsigned mode,
+ int sync, void *key)
+{
+ struct io_async_bpf *abpf = container_of(wqe, struct io_async_bpf, wqe);
+ bool wake = io_bpf_need_wake(abpf);
+
+ if (wake) {
+ list_del_init_careful(&wqe->entry);
+ req_ref_get(wqe->private);
+ io_queue_async_work(wqe->private);
+ }
+ return wake;
+}
+
+static int io_bpf_wait_cq_async(struct io_kiocb *req, unsigned int nr,
+ unsigned int idx)
+{
+ struct io_ring_ctx *ctx = req->ctx;
+ struct wait_queue_head *wq;
+ struct wait_queue_entry *wqe;
+ struct io_async_bpf *abpf;
+
+ if (unlikely(idx >= ctx->cq_nr))
+ return -EINVAL;
+ if (!req->async_data && io_alloc_async_data(req))
+ return -ENOMEM;
+
+ abpf = req->async_data;
+ abpf->wait_nr = nr;
+ abpf->wait_idx = idx;
+ wqe = &abpf->wqe;
+ init_waitqueue_func_entry(wqe, io_bpf_wait_func);
+ wqe->private = req;
+ wq = &ctx->wait;
+
+ spin_lock_irq(&wq->lock);
+ __add_wait_queue(wq, wqe);
+ smp_mb();
+ io_bpf_wait_func(wqe, 0, 0, NULL);
+ spin_unlock_irq(&wq->lock);
+ return 0;
+}
+
static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -10512,8 +10579,8 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)

lockdep_assert_held(&req->ctx->uring_lock);

- if (unlikely(percpu_ref_is_dying(&ctx->refs) ||
- atomic_read(&req->task->io_uring->in_idle)))
+ if (unlikely(percpu_ref_is_dying(&ctx->refs)) ||
+ atomic_read(&req->task->io_uring->in_idle))
goto done;

memset(&bpf_ctx.u, 0, sizeof(bpf_ctx.u));
@@ -10531,6 +10598,13 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
}
io_submit_state_end(&ctx->submit_state, ctx);
ret = 0;
+
+ if (bpf_ctx.u.wait_nr) {
+ ret = io_bpf_wait_cq_async(req, bpf_ctx.u.wait_nr,
+ bpf_ctx.u.wait_idx);
+ if (!ret)
+ return;
+ }
done:
__io_req_complete(req, issue_flags, ret, 0);
}
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index d7b1713bcfb0..95c04af3afd4 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -405,6 +405,8 @@ struct io_uring_getevents_arg {

struct io_uring_bpf_ctx {
__u64 user_data;
+ __u32 wait_nr;
+ __u32 wait_idx;
};

#endif
--
2.31.1


2021-05-19 20:05:44

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 08/23] io_uring: internally pass CQ indexes

Allow to pass CQ index from SQE to the end CQE generators, but support
only one CQ for now.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 113 ++++++++++++++++++++++------------
include/uapi/linux/io_uring.h | 1 +
2 files changed, 75 insertions(+), 39 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4fecd9da689e..356a5dc90f46 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -90,6 +90,8 @@
#define IORING_MAX_ENTRIES 32768
#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)

+#define IO_DEFAULT_CQ 0
+
/*
* Shift of 9 is 512 entries, or exactly one page on 64-bit archs
*/
@@ -416,6 +418,7 @@ struct io_ring_ctx {
unsigned cq_extra;
struct wait_queue_head cq_wait;
struct io_cqring cqs[1];
+ unsigned int cq_nr;

struct {
spinlock_t completion_lock;
@@ -832,6 +835,7 @@ struct io_kiocb {

struct io_kiocb *link;
struct percpu_ref *fixed_rsrc_refs;
+ u16 cq_idx;

/* used with ctx->iopoll_list with reads/writes */
struct list_head inflight_entry;
@@ -1034,7 +1038,8 @@ static void io_uring_cancel_sqpoll(struct io_sq_data *sqd);
static struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx);

static bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
- long res, unsigned int cflags);
+ long res, unsigned int cflags,
+ unsigned int cq_idx);
static void io_put_req(struct io_kiocb *req);
static void io_put_req_deferred(struct io_kiocb *req, int nr);
static void io_dismantle_req(struct io_kiocb *req);
@@ -1207,13 +1212,15 @@ static void io_account_cq_overflow(struct io_ring_ctx *ctx)

static bool req_need_defer(struct io_kiocb *req, u32 seq)
{
- if (unlikely(req->flags & REQ_F_IO_DRAIN)) {
- struct io_ring_ctx *ctx = req->ctx;
-
- return seq + READ_ONCE(ctx->cq_extra) != ctx->cqs[0].cached_tail;
- }
+ struct io_ring_ctx *ctx = req->ctx;
+ u32 cnt = 0;
+ int i;

- return false;
+ if (!(req->flags & REQ_F_IO_DRAIN))
+ return false;
+ for (i = 0; i < ctx->cq_nr; i++)
+ cnt += ctx->cqs[i].cached_tail;
+ return seq + READ_ONCE(ctx->cq_extra) != cnt;
}

static void io_req_track_inflight(struct io_kiocb *req)
@@ -1289,7 +1296,8 @@ static void io_kill_timeout(struct io_kiocb *req, int status)
atomic_set(&req->ctx->cq_timeouts,
atomic_read(&req->ctx->cq_timeouts) + 1);
list_del_init(&req->timeout.list);
- io_cqring_fill_event(req->ctx, req->user_data, status, 0);
+ io_cqring_fill_event(req->ctx, req->user_data, status, 0,
+ req->cq_idx);
io_put_req_deferred(req, 1);
}
}
@@ -1346,10 +1354,13 @@ static void io_flush_timeouts(struct io_ring_ctx *ctx)

static void io_commit_cqring(struct io_ring_ctx *ctx)
{
+ int i;
+
io_flush_timeouts(ctx);

/* order cqe stores with ring update */
- smp_store_release(&ctx->rings->cq.tail, ctx->cqs[0].cached_tail);
+ for (i = 0; i < ctx->cq_nr; i++)
+ smp_store_release(&ctx->cqs[i].rings->cq.tail, ctx->cqs[i].cached_tail);

if (unlikely(!list_empty(&ctx->defer_list)))
__io_queue_deferred(ctx);
@@ -1362,25 +1373,27 @@ static inline bool io_sqring_full(struct io_ring_ctx *ctx)
return READ_ONCE(r->sq.tail) - ctx->cached_sq_head == ctx->sq_entries;
}

-static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx)
+static inline unsigned int __io_cqring_events(struct io_cqring *cq)
{
- return ctx->cqs[0].cached_tail - READ_ONCE(ctx->rings->cq.head);
+ return cq->cached_tail - READ_ONCE(cq->rings->cq.head);
}

-static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx)
+static inline struct io_uring_cqe *io_get_cqe(struct io_ring_ctx *ctx,
+ unsigned int idx)
{
- struct io_rings *rings = ctx->rings;
- unsigned tail, mask = ctx->cqs[0].entries - 1;
+ struct io_cqring *cq = &ctx->cqs[idx];
+ struct io_rings *rings = cq->rings;
+ unsigned tail, mask = cq->entries - 1;

/*
* writes to the cq entry need to come after reading head; the
* control dependency is enough as we're using WRITE_ONCE to
* fill the cq entry
*/
- if (__io_cqring_events(ctx) == ctx->cqs[0].entries)
+ if (__io_cqring_events(cq) == cq->entries)
return NULL;

- tail = ctx->cqs[0].cached_tail++;
+ tail = cq->cached_tail++;
return &rings->cqes[tail & mask];
}

@@ -1432,16 +1445,18 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
{
unsigned long flags;
bool all_flushed, posted;
+ struct io_cqring *cq = &ctx->cqs[IO_DEFAULT_CQ];

- if (!force && __io_cqring_events(ctx) == ctx->cqs[0].entries)
+ if (!force && __io_cqring_events(cq) == cq->entries)
return false;

posted = false;
spin_lock_irqsave(&ctx->completion_lock, flags);
while (!list_empty(&ctx->cq_overflow_list)) {
- struct io_uring_cqe *cqe = io_get_cqe(ctx);
+ struct io_uring_cqe *cqe = io_get_cqe(ctx, IO_DEFAULT_CQ);
struct io_overflow_cqe *ocqe;

+
if (!cqe && !force)
break;
ocqe = list_first_entry(&ctx->cq_overflow_list,
@@ -1523,12 +1538,17 @@ static inline void req_ref_get(struct io_kiocb *req)
}

static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
- long res, unsigned int cflags)
+ long res, unsigned int cflags,
+ unsigned int cq_idx)
{
struct io_overflow_cqe *ocqe;

+ if (cq_idx != IO_DEFAULT_CQ)
+ goto overflow;
+
ocqe = kmalloc(sizeof(*ocqe), GFP_ATOMIC | __GFP_ACCOUNT);
if (!ocqe) {
+overflow:
/*
* If we're in ring overflow flush mode, or in task cancel mode,
* or cannot allocate an overflow entry, then we need to drop it
@@ -1550,7 +1570,8 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
}

static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
- long res, unsigned int cflags)
+ long res, unsigned int cflags,
+ unsigned int cq_idx)
{
struct io_uring_cqe *cqe;

@@ -1561,21 +1582,22 @@ static inline bool __io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data
* submission (by quite a lot). Increment the overflow count in
* the ring.
*/
- cqe = io_get_cqe(ctx);
+ cqe = io_get_cqe(ctx, cq_idx);
if (likely(cqe)) {
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
WRITE_ONCE(cqe->flags, cflags);
return true;
}
- return io_cqring_event_overflow(ctx, user_data, res, cflags);
+ return io_cqring_event_overflow(ctx, user_data, res, cflags, cq_idx);
}

/* not as hot to bloat with inlining */
static noinline bool io_cqring_fill_event(struct io_ring_ctx *ctx, u64 user_data,
- long res, unsigned int cflags)
+ long res, unsigned int cflags,
+ unsigned int cq_idx)
{
- return __io_cqring_fill_event(ctx, user_data, res, cflags);
+ return __io_cqring_fill_event(ctx, user_data, res, cflags, cq_idx);
}

static void io_req_complete_post(struct io_kiocb *req, long res,
@@ -1585,7 +1607,7 @@ static void io_req_complete_post(struct io_kiocb *req, long res,
unsigned long flags;

spin_lock_irqsave(&ctx->completion_lock, flags);
- __io_cqring_fill_event(ctx, req->user_data, res, cflags);
+ __io_cqring_fill_event(ctx, req->user_data, res, cflags, req->cq_idx);
/*
* If we're the last reference to this request, add to our locked
* free_list cache.
@@ -1797,7 +1819,7 @@ static bool io_kill_linked_timeout(struct io_kiocb *req)
link->timeout.head = NULL;
if (hrtimer_try_to_cancel(&io->timer) != -1) {
io_cqring_fill_event(link->ctx, link->user_data,
- -ECANCELED, 0);
+ -ECANCELED, 0, link->cq_idx);
io_put_req_deferred(link, 1);
return true;
}
@@ -1816,7 +1838,8 @@ static void io_fail_links(struct io_kiocb *req)
link->link = NULL;

trace_io_uring_fail_link(req, link);
- io_cqring_fill_event(link->ctx, link->user_data, -ECANCELED, 0);
+ io_cqring_fill_event(link->ctx, link->user_data, -ECANCELED, 0,
+ link->cq_idx);
io_put_req_deferred(link, 2);
link = nxt;
}
@@ -2138,7 +2161,7 @@ static void io_submit_flush_completions(struct io_comp_state *cs,
for (i = 0; i < nr; i++) {
req = cs->reqs[i];
__io_cqring_fill_event(ctx, req->user_data, req->result,
- req->compl.cflags);
+ req->compl.cflags, req->cq_idx);
}
io_commit_cqring(ctx);
spin_unlock_irq(&ctx->completion_lock);
@@ -2201,7 +2224,7 @@ static unsigned io_cqring_events(struct io_ring_ctx *ctx)
{
/* See comment at the top of this file */
smp_rmb();
- return __io_cqring_events(ctx);
+ return __io_cqring_events(&ctx->cqs[IO_DEFAULT_CQ]);
}

static inline unsigned int io_sqring_entries(struct io_ring_ctx *ctx)
@@ -2278,7 +2301,8 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
if (req->flags & REQ_F_BUFFER_SELECTED)
cflags = io_put_rw_kbuf(req);

- __io_cqring_fill_event(ctx, req->user_data, req->result, cflags);
+ __io_cqring_fill_event(ctx, req->user_data, req->result, cflags,
+ req->cq_idx);
(*nr_events)++;

if (req_ref_put_and_test(req))
@@ -4911,7 +4935,7 @@ static bool io_poll_complete(struct io_kiocb *req, __poll_t mask)
}
if (req->poll.events & EPOLLONESHOT)
flags = 0;
- if (!io_cqring_fill_event(ctx, req->user_data, error, flags)) {
+ if (!io_cqring_fill_event(ctx, req->user_data, error, flags, req->cq_idx)) {
io_poll_remove_waitqs(req);
req->poll.done = true;
flags = 0;
@@ -5242,7 +5266,8 @@ static bool io_poll_remove_one(struct io_kiocb *req)

do_complete = io_poll_remove_waitqs(req);
if (do_complete) {
- io_cqring_fill_event(req->ctx, req->user_data, -ECANCELED, 0);
+ io_cqring_fill_event(req->ctx, req->user_data, -ECANCELED, 0,
+ req->cq_idx);
io_commit_cqring(req->ctx);
req_set_fail_links(req);
io_put_req_deferred(req, 1);
@@ -5494,7 +5519,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
atomic_set(&req->ctx->cq_timeouts,
atomic_read(&req->ctx->cq_timeouts) + 1);

- io_cqring_fill_event(ctx, req->user_data, -ETIME, 0);
+ io_cqring_fill_event(ctx, req->user_data, -ETIME, 0, req->cq_idx);
io_commit_cqring(ctx);
spin_unlock_irqrestore(&ctx->completion_lock, flags);

@@ -5536,7 +5561,7 @@ static int io_timeout_cancel(struct io_ring_ctx *ctx, __u64 user_data)
return PTR_ERR(req);

req_set_fail_links(req);
- io_cqring_fill_event(ctx, req->user_data, -ECANCELED, 0);
+ io_cqring_fill_event(ctx, req->user_data, -ECANCELED, 0, req->cq_idx);
io_put_req_deferred(req, 1);
return 0;
}
@@ -5609,7 +5634,7 @@ static int io_timeout_remove(struct io_kiocb *req, unsigned int issue_flags)
ret = io_timeout_update(ctx, tr->addr, &tr->ts,
io_translate_timeout_mode(tr->flags));

- io_cqring_fill_event(ctx, req->user_data, ret, 0);
+ io_cqring_fill_event(ctx, req->user_data, ret, 0, req->cq_idx);
io_commit_cqring(ctx);
spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx);
@@ -5761,7 +5786,7 @@ static void io_async_find_and_cancel(struct io_ring_ctx *ctx,
done:
if (!ret)
ret = success_ret;
- io_cqring_fill_event(ctx, req->user_data, ret, 0);
+ io_cqring_fill_event(ctx, req->user_data, ret, 0, req->cq_idx);
io_commit_cqring(ctx);
spin_unlock_irqrestore(&ctx->completion_lock, flags);
io_cqring_ev_posted(ctx);
@@ -5818,7 +5843,7 @@ static int io_async_cancel(struct io_kiocb *req, unsigned int issue_flags)

spin_lock_irq(&ctx->completion_lock);
done:
- io_cqring_fill_event(ctx, req->user_data, ret, 0);
+ io_cqring_fill_event(ctx, req->user_data, ret, 0, req->cq_idx);
io_commit_cqring(ctx);
spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx);
@@ -6516,6 +6541,11 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
req->result = 0;
req->work.creds = NULL;

+ req->cq_idx = READ_ONCE(sqe->cq_idx);
+ if (unlikely(req->cq_idx >= ctx->cq_nr)) {
+ req->cq_idx = IO_DEFAULT_CQ;
+ return -EINVAL;
+ }
/* enforce forwards compatibility on users */
if (unlikely(sqe_flags & ~SQE_VALID_FLAGS))
return -EINVAL;
@@ -7548,7 +7578,7 @@ static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)

io_ring_submit_lock(ctx, lock_ring);
spin_lock_irqsave(&ctx->completion_lock, flags);
- io_cqring_fill_event(ctx, prsrc->tag, 0, 0);
+ io_cqring_fill_event(ctx, prsrc->tag, 0, 0, IO_DEFAULT_CQ);
ctx->cq_extra++;
io_commit_cqring(ctx);
spin_unlock_irqrestore(&ctx->completion_lock, flags);
@@ -9484,7 +9514,6 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,

/* make sure these are sane, as we already accounted them */
ctx->sq_entries = p->sq_entries;
- ctx->cqs[0].entries = p->cq_entries;

size = rings_size(p->sq_entries, p->cq_entries, &sq_array_offset);
if (size == SIZE_MAX)
@@ -9501,6 +9530,11 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
rings->sq_ring_entries = p->sq_entries;
rings->cq_ring_entries = p->cq_entries;

+ ctx->cqs[0].cached_tail = 0;
+ ctx->cqs[0].rings = rings;
+ ctx->cqs[0].entries = p->cq_entries;
+ ctx->cq_nr = 1;
+
size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
if (size == SIZE_MAX) {
io_mem_free(ctx->rings);
@@ -10164,6 +10198,7 @@ static int __init io_uring_init(void)
BUILD_BUG_SQE_ELEM(40, __u16, buf_index);
BUILD_BUG_SQE_ELEM(42, __u16, personality);
BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in);
+ BUILD_BUG_SQE_ELEM(48, __u16, cq_idx);

BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
sizeof(struct io_uring_rsrc_update));
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e1ae46683301..c2dfb179360a 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -58,6 +58,7 @@ struct io_uring_sqe {
/* personality to use, if used */
__u16 personality;
__s32 splice_fd_in;
+ __u16 cq_idx;
};
__u64 __pad2[3];
};
--
2.31.1


2021-05-19 20:06:38

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 11/23] io_uring: enable mmap'ing additional CQs

TODO: get rid of extra offset

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 13 ++++++++++++-
include/uapi/linux/io_uring.h | 2 ++
2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 067cfb3a6e4a..1a4c9e513ac9 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -9207,6 +9207,7 @@ static void *io_uring_validate_mmap_request(struct file *file,
struct io_ring_ctx *ctx = file->private_data;
loff_t offset = pgoff << PAGE_SHIFT;
struct page *page;
+ unsigned long cq_idx;
void *ptr;

switch (offset) {
@@ -9218,7 +9219,15 @@ static void *io_uring_validate_mmap_request(struct file *file,
ptr = ctx->sq_sqes;
break;
default:
- return ERR_PTR(-EINVAL);
+ if (offset < IORING_OFF_CQ_RING_EXTRA)
+ return ERR_PTR(-EINVAL);
+ offset -= IORING_OFF_CQ_RING_EXTRA;
+ if (offset % IORING_STRIDE_CQ_RING)
+ return ERR_PTR(-EINVAL);
+ cq_idx = offset / IORING_STRIDE_CQ_RING;
+ if (cq_idx >= ctx->cq_nr)
+ return ERR_PTR(-EINVAL);
+ ptr = ctx->cqs[cq_idx].rings;
}

page = virt_to_head_page(ptr);
@@ -9615,6 +9624,8 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,

return 0;
err:
+ while (ctx->cq_nr > 1)
+ io_mem_free(ctx->cqs[--ctx->cq_nr].rings);
io_mem_free(ctx->rings);
ctx->rings = NULL;
return ret;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 92b61ca09ea5..67a97c793de7 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -203,6 +203,8 @@ enum {
#define IORING_OFF_SQ_RING 0ULL
#define IORING_OFF_CQ_RING 0x8000000ULL
#define IORING_OFF_SQES 0x10000000ULL
+#define IORING_OFF_CQ_RING_EXTRA 0x1200000ULL
+#define IORING_STRIDE_CQ_RING 0x0100000ULL

/*
* Filled with the offset for mmap(2)
--
2.31.1


2021-05-19 20:06:43

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 15/23] io_uring: enable BPF to submit SQEs

Add a BPF_FUNC_iouring_queue_sqe BPF function as a demonstration of
submmiting a new request by a BPF request.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++++----
include/uapi/linux/bpf.h | 1 +
2 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 20fddc5945f2..aae786291c57 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -882,6 +882,7 @@ struct io_defer_entry {
};

struct io_bpf_ctx {
+ struct io_ring_ctx *ctx;
};

struct io_op_def {
@@ -6681,7 +6682,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
ret = -EBADF;
}

- state->ios_left--;
+ if (state->ios_left > 1)
+ state->ios_left--;
return ret;
}

@@ -10345,10 +10347,50 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
return ret;
}

+BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx,
+ const struct io_uring_sqe *, sqe,
+ u32, sqe_len)
+{
+ struct io_ring_ctx *ctx = bpf_ctx->ctx;
+ struct io_kiocb *req;
+
+ if (sqe_len != sizeof(struct io_uring_sqe))
+ return -EINVAL;
+
+ req = io_alloc_req(ctx);
+ if (unlikely(!req))
+ return -ENOMEM;
+ if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
+ kmem_cache_free(req_cachep, req);
+ return -EAGAIN;
+ }
+ percpu_counter_add(&current->io_uring->inflight, 1);
+ refcount_add(1, &current->usage);
+
+ /* returns number of submitted SQEs or an error */
+ return !io_submit_sqe(ctx, req, sqe);
+}
+
+const struct bpf_func_proto io_bpf_queue_sqe_proto = {
+ .func = io_bpf_queue_sqe,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_PTR_TO_MEM,
+ .arg3_type = ARG_CONST_SIZE,
+};
+
static const struct bpf_func_proto *
io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
{
- return bpf_base_func_proto(func_id);
+ switch (func_id) {
+ case BPF_FUNC_copy_from_user:
+ return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
+ case BPF_FUNC_iouring_queue_sqe:
+ return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
}

static bool io_bpf_is_valid_access(int off, int size,
@@ -10379,9 +10421,10 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
atomic_read(&req->task->io_uring->in_idle)))
goto done;

- memset(&bpf_ctx, 0, sizeof(bpf_ctx));
+ bpf_ctx.ctx = ctx;
prog = req->bpf.prog;

+ io_submit_state_start(&ctx->submit_state, 1);
if (prog->aux->sleepable) {
rcu_read_lock();
bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
@@ -10389,7 +10432,7 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
} else {
bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
}
-
+ io_submit_state_end(&ctx->submit_state, ctx);
ret = 0;
done:
__io_req_complete(req, issue_flags, ret, 0);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index de544f0fbeef..cc268f749a7d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4082,6 +4082,7 @@ union bpf_attr {
FN(ima_inode_hash), \
FN(sock_from_file), \
FN(check_mtu), \
+ FN(iouring_queue_sqe), \
/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
--
2.31.1


2021-05-19 20:19:00

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 18/23] libbpf: support io_uring

On Wed, May 19, 2021 at 7:14 AM Pavel Begunkov <[email protected]> wrote:
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> tools/lib/bpf/libbpf.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 4181d178ee7b..de5d1508f58e 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13,6 +13,10 @@
> #ifndef _GNU_SOURCE
> #define _GNU_SOURCE
> #endif
> +
> +/* hack, use local headers instead of system-wide */
> +#include "../../../include/uapi/linux/bpf.h"
> +

libbpf is already using the latest UAPI headers, so you don't need
this hack. You just haven't synced include/uapi/linux/bpf.h into
tools/include/uapi/linux/bpf.h

> #include <stdlib.h>
> #include <stdio.h>
> #include <stdarg.h>
> @@ -8630,6 +8634,9 @@ static const struct bpf_sec_def section_defs[] = {
> BPF_PROG_SEC("struct_ops", BPF_PROG_TYPE_STRUCT_OPS),
> BPF_EAPROG_SEC("sk_lookup/", BPF_PROG_TYPE_SK_LOOKUP,
> BPF_SK_LOOKUP),
> + SEC_DEF("iouring/", IOURING),
> + SEC_DEF("iouring.s/", IOURING,
> + .is_sleepable = true),
> };
>
> #undef BPF_PROG_SEC_IMPL
> --
> 2.31.1
>

2021-05-19 21:16:47

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 01/23] io_uring: shuffle rarely used ctx fields

There is a bunch of scattered around ctx fields that are almost never
used, e.g. only on ring exit, plunge them to the end, better locality,
better aesthetically.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 36 +++++++++++++++++-------------------
1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9ac5e278a91e..7e3410ce100a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -367,9 +367,6 @@ struct io_ring_ctx {
unsigned cached_cq_overflow;
unsigned long sq_check_overflow;

- /* hashed buffered write serialization */
- struct io_wq_hash *hash_map;
-
struct list_head defer_list;
struct list_head timeout_list;
struct list_head cq_overflow_list;
@@ -386,9 +383,6 @@ struct io_ring_ctx {

struct io_rings *rings;

- /* Only used for accounting purposes */
- struct mm_struct *mm_account;
-
const struct cred *sq_creds; /* cred used for __io_sq_thread() */
struct io_sq_data *sq_data; /* if using sq thread polling */

@@ -409,14 +403,6 @@ struct io_ring_ctx {
unsigned nr_user_bufs;
struct io_mapped_ubuf **user_bufs;

- struct user_struct *user;
-
- struct completion ref_comp;
-
-#if defined(CONFIG_UNIX)
- struct socket *ring_sock;
-#endif
-
struct xarray io_buffers;

struct xarray personalities;
@@ -460,12 +446,24 @@ struct io_ring_ctx {

struct io_restriction restrictions;

- /* exit task_work */
- struct callback_head *exit_task_work;
-
/* Keep this last, we don't need it for the fast path */
- struct work_struct exit_work;
- struct list_head tctx_list;
+ struct {
+ #if defined(CONFIG_UNIX)
+ struct socket *ring_sock;
+ #endif
+ /* hashed buffered write serialization */
+ struct io_wq_hash *hash_map;
+
+ /* Only used for accounting purposes */
+ struct user_struct *user;
+ struct mm_struct *mm_account;
+
+ /* ctx exit and cancelation */
+ struct callback_head *exit_task_work;
+ struct work_struct exit_work;
+ struct list_head tctx_list;
+ struct completion ref_comp;
+ };
};

struct io_uring_task {
--
2.31.1


2021-05-19 21:16:49

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 03/23] io_uring: remove dependency on ring->sq/cq_entries

We have numbers of {sq,cq} entries cached in ctx, don't look up them in
user-shared rings as 1) it may fetch additional cacheline 2) user may
change it and so it's always error prone.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 31eca208f675..15dc5dad1f7d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1352,7 +1352,7 @@ static inline bool io_sqring_full(struct io_ring_ctx *ctx)
{
struct io_rings *r = ctx->rings;

- return READ_ONCE(r->sq.tail) - ctx->cached_sq_head == r->sq_ring_entries;
+ return READ_ONCE(r->sq.tail) - ctx->cached_sq_head == ctx->sq_entries;
}

static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx)
@@ -1370,7 +1370,7 @@ static inline struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
* control dependency is enough as we're using WRITE_ONCE to
* fill the cq entry
*/
- if (__io_cqring_events(ctx) == rings->cq_ring_entries)
+ if (__io_cqring_events(ctx) == ctx->cq_entries)
return NULL;

tail = ctx->cached_cq_tail++;
@@ -1423,11 +1423,10 @@ static void io_cqring_ev_posted_iopoll(struct io_ring_ctx *ctx)
/* Returns true if there are no backlogged entries after the flush */
static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
{
- struct io_rings *rings = ctx->rings;
unsigned long flags;
bool all_flushed, posted;

- if (!force && __io_cqring_events(ctx) == rings->cq_ring_entries)
+ if (!force && __io_cqring_events(ctx) == ctx->cq_entries)
return false;

posted = false;
--
2.31.1


2021-05-19 21:16:52

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 04/23] io_uring: deduce cq_mask from cq_entries

No need to cache cq_mask, it's exactly cq_entries - 1, so just deduce
it to not carry it around.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 15dc5dad1f7d..067c89e63fea 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -361,7 +361,6 @@ struct io_ring_ctx {
u32 *sq_array;
unsigned cached_sq_head;
unsigned sq_entries;
- unsigned sq_mask;
unsigned sq_thread_idle;
unsigned cached_sq_dropped;
unsigned cached_cq_overflow;
@@ -407,7 +406,6 @@ struct io_ring_ctx {
struct {
unsigned cached_cq_tail;
unsigned cq_entries;
- unsigned cq_mask;
atomic_t cq_timeouts;
unsigned cq_last_tm_flush;
unsigned cq_extra;
@@ -1363,7 +1361,7 @@ static inline unsigned int __io_cqring_events(struct io_ring_ctx *ctx)
static inline struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;
- unsigned tail;
+ unsigned tail, mask = ctx->cq_entries - 1;

/*
* writes to the cq entry need to come after reading head; the
@@ -1374,7 +1372,7 @@ static inline struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
return NULL;

tail = ctx->cached_cq_tail++;
- return &rings->cqes[tail & ctx->cq_mask];
+ return &rings->cqes[tail & mask];
}

static inline bool io_should_trigger_evfd(struct io_ring_ctx *ctx)
@@ -6677,7 +6675,7 @@ static void io_commit_sqring(struct io_ring_ctx *ctx)
static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
{
u32 *sq_array = ctx->sq_array;
- unsigned head;
+ unsigned head, mask = ctx->sq_entries - 1;

/*
* The cached sq head (or cq tail) serves two purposes:
@@ -6687,7 +6685,7 @@ static const struct io_uring_sqe *io_get_sqe(struct io_ring_ctx *ctx)
* 2) allows the kernel side to track the head on its own, even
* though the application is the one updating it.
*/
- head = READ_ONCE(sq_array[ctx->cached_sq_head++ & ctx->sq_mask]);
+ head = READ_ONCE(sq_array[ctx->cached_sq_head++ & mask]);
if (likely(head < ctx->sq_entries))
return &ctx->sq_sqes[head];

@@ -9493,8 +9491,6 @@ static int io_allocate_scq_urings(struct io_ring_ctx *ctx,
rings->cq_ring_mask = p->cq_entries - 1;
rings->sq_ring_entries = p->sq_entries;
rings->cq_ring_entries = p->cq_entries;
- ctx->sq_mask = rings->sq_ring_mask;
- ctx->cq_mask = rings->cq_ring_mask;

size = array_size(sizeof(struct io_uring_sqe), p->sq_entries);
if (size == SIZE_MAX) {
--
2.31.1


2021-05-19 21:16:56

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 09/23] io_uring: extract cq size helper

Extract a helper calculating CQ size from an userspace specified number
of entries.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 38 ++++++++++++++++++++++++--------------
1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 356a5dc90f46..f05592ae5f41 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1139,6 +1139,24 @@ static inline bool io_is_timeout_noseq(struct io_kiocb *req)
return !req->timeout.off;
}

+static long io_get_cqring_size(struct io_uring_params *p, unsigned entries)
+{
+ /*
+ * If IORING_SETUP_CQSIZE is set, we do the same roundup
+ * to a power-of-two, if it isn't already. We do NOT impose
+ * any cq vs sq ring sizing.
+ */
+ if (!entries)
+ return -EINVAL;
+ if (entries > IORING_MAX_CQ_ENTRIES) {
+ if (!(p->flags & IORING_SETUP_CLAMP))
+ return -EINVAL;
+ entries = IORING_MAX_CQ_ENTRIES;
+ }
+ entries = roundup_pow_of_two(entries);
+ return entries;
+}
+
static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
{
struct io_ring_ctx *ctx;
@@ -9625,21 +9643,13 @@ static int io_uring_create(unsigned entries, struct io_uring_params *p,
*/
p->sq_entries = roundup_pow_of_two(entries);
if (p->flags & IORING_SETUP_CQSIZE) {
- /*
- * If IORING_SETUP_CQSIZE is set, we do the same roundup
- * to a power-of-two, if it isn't already. We do NOT impose
- * any cq vs sq ring sizing.
- */
- if (!p->cq_entries)
- return -EINVAL;
- if (p->cq_entries > IORING_MAX_CQ_ENTRIES) {
- if (!(p->flags & IORING_SETUP_CLAMP))
- return -EINVAL;
- p->cq_entries = IORING_MAX_CQ_ENTRIES;
- }
- p->cq_entries = roundup_pow_of_two(p->cq_entries);
- if (p->cq_entries < p->sq_entries)
+ long cq_entries = io_get_cqring_size(p, p->cq_entries);
+
+ if (cq_entries < 0)
+ return cq_entries;
+ if (cq_entries < p->sq_entries)
return -EINVAL;
+ p->cq_entries = cq_entries;
} else {
p->cq_entries = 2 * p->sq_entries;
}
--
2.31.1


2021-05-21 00:44:13

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 18/23] libbpf: support io_uring

On 5/19/21 6:38 PM, Andrii Nakryiko wrote:
> On Wed, May 19, 2021 at 7:14 AM Pavel Begunkov <[email protected]> wrote:
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> tools/lib/bpf/libbpf.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 4181d178ee7b..de5d1508f58e 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -13,6 +13,10 @@
>> #ifndef _GNU_SOURCE
>> #define _GNU_SOURCE
>> #endif
>> +
>> +/* hack, use local headers instead of system-wide */
>> +#include "../../../include/uapi/linux/bpf.h"
>> +
>
> libbpf is already using the latest UAPI headers, so you don't need
> this hack. You just haven't synced include/uapi/linux/bpf.h into
> tools/include/uapi/linux/bpf.h

It's more convenient to keep it local to me while RFC, surely will
drop it later.

btw, I had a problem with find_sec_def() successfully matching
"iouring.s" string with "iouring", because section_defs[i].len
doesn't include final \0 and so does a sort of prefix comparison.
That's why "iouring/". Can we fix it? Are compatibility concerns?

>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <stdarg.h>
>> @@ -8630,6 +8634,9 @@ static const struct bpf_sec_def section_defs[] = {
>> BPF_PROG_SEC("struct_ops", BPF_PROG_TYPE_STRUCT_OPS),
>> BPF_EAPROG_SEC("sk_lookup/", BPF_PROG_TYPE_SK_LOOKUP,
>> BPF_SK_LOOKUP),
>> + SEC_DEF("iouring/", IOURING),
>> + SEC_DEF("iouring.s/", IOURING,
>> + .is_sleepable = true),
>> };
>>
>> #undef BPF_PROG_SEC_IMPL
>> --
>> 2.31.1
>>

--
Pavel Begunkov

2021-05-21 08:14:19

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH 01/23] io_uring: shuffle rarely used ctx fields



> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>
> There is a bunch of scattered around ctx fields that are almost never
> used, e.g. only on ring exit, plunge them to the end, better locality,
> better aesthetically.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 36 +++++++++++++++++-------------------
> 1 file changed, 17 insertions(+), 19 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 9ac5e278a91e..7e3410ce100a 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -367,9 +367,6 @@ struct io_ring_ctx {
> unsigned cached_cq_overflow;
> unsigned long sq_check_overflow;
>
> - /* hashed buffered write serialization */
> - struct io_wq_hash *hash_map;
> -
> struct list_head defer_list;
> struct list_head timeout_list;
> struct list_head cq_overflow_list;
> @@ -386,9 +383,6 @@ struct io_ring_ctx {
>
> struct io_rings *rings;
>
> - /* Only used for accounting purposes */
> - struct mm_struct *mm_account;
> -
> const struct cred *sq_creds; /* cred used for __io_sq_thread() */
> struct io_sq_data *sq_data; /* if using sq thread polling */
>
> @@ -409,14 +403,6 @@ struct io_ring_ctx {
> unsigned nr_user_bufs;
> struct io_mapped_ubuf **user_bufs;
>
> - struct user_struct *user;
> -
> - struct completion ref_comp;
> -
> -#if defined(CONFIG_UNIX)
> - struct socket *ring_sock;
> -#endif
> -
> struct xarray io_buffers;
>
> struct xarray personalities;
> @@ -460,12 +446,24 @@ struct io_ring_ctx {
>
> struct io_restriction restrictions;
>
> - /* exit task_work */
> - struct callback_head *exit_task_work;
> -
> /* Keep this last, we don't need it for the fast path */
> - struct work_struct exit_work;
> - struct list_head tctx_list;
> + struct {

Why do we need an anonymous struct here? For cache line alignment?
Do we need ____cacheline_aligned_in_smp?

> + #if defined(CONFIG_UNIX)
> + struct socket *ring_sock;
> + #endif
> + /* hashed buffered write serialization */
> + struct io_wq_hash *hash_map;
> +
> + /* Only used for accounting purposes */
> + struct user_struct *user;
> + struct mm_struct *mm_account;
> +
> + /* ctx exit and cancelation */
> + struct callback_head *exit_task_work;
> + struct work_struct exit_work;
> + struct list_head tctx_list;
> + struct completion ref_comp;
> + };
> };
>
> struct io_uring_task {
> --
> 2.31.1
>

2021-05-21 08:44:39

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH 13/23] io_uring: implement bpf prog registration



> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>
> [de]register BPF programs through io_uring_register() with new
> IORING_ATTACH_BPF and IORING_DETACH_BPF commands.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 81 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/io_uring.h | 2 +
> 2 files changed, 83 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 882b16b5e5eb..b13cbcd5c47b 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
> #include <linux/task_work.h>
> #include <linux/pagemap.h>
> #include <linux/io_uring.h>
> +#include <linux/bpf.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/io_uring.h>
> @@ -103,6 +104,8 @@
> #define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
> IORING_REGISTER_LAST + IORING_OP_LAST)
>
> +#define IORING_MAX_BPF_PROGS 100

Is 100 a realistic number here?

> +
> #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \
> IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
> IOSQE_BUFFER_SELECT)
> @@ -266,6 +269,10 @@ struct io_restriction {
> bool registered;
> };
>

[...]

2021-05-21 08:46:00

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH 15/23] io_uring: enable BPF to submit SQEs



> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>
> Add a BPF_FUNC_iouring_queue_sqe BPF function as a demonstration of
> submmiting a new request by a BPF request.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 51 ++++++++++++++++++++++++++++++++++++----
> include/uapi/linux/bpf.h | 1 +
> 2 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 20fddc5945f2..aae786291c57 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -882,6 +882,7 @@ struct io_defer_entry {
> };
>
> struct io_bpf_ctx {
> + struct io_ring_ctx *ctx;
> };
>
> struct io_op_def {
> @@ -6681,7 +6682,8 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
> ret = -EBADF;
> }
>
> - state->ios_left--;
> + if (state->ios_left > 1)
> + state->ios_left--;
> return ret;
> }
>
> @@ -10345,10 +10347,50 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> return ret;
> }
>
> +BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx,
> + const struct io_uring_sqe *, sqe,
> + u32, sqe_len)
> +{
> + struct io_ring_ctx *ctx = bpf_ctx->ctx;
> + struct io_kiocb *req;
> +
> + if (sqe_len != sizeof(struct io_uring_sqe))
> + return -EINVAL;
> +
> + req = io_alloc_req(ctx);
> + if (unlikely(!req))
> + return -ENOMEM;
> + if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
> + kmem_cache_free(req_cachep, req);
> + return -EAGAIN;
> + }
> + percpu_counter_add(&current->io_uring->inflight, 1);
> + refcount_add(1, &current->usage);
> +
> + /* returns number of submitted SQEs or an error */
> + return !io_submit_sqe(ctx, req, sqe);
> +}
> +
> +const struct bpf_func_proto io_bpf_queue_sqe_proto = {
> + .func = io_bpf_queue_sqe,
> + .gpl_only = false,
> + .ret_type = RET_INTEGER,
> + .arg1_type = ARG_PTR_TO_CTX,
> + .arg2_type = ARG_PTR_TO_MEM,
> + .arg3_type = ARG_CONST_SIZE,
> +};
> +
> static const struct bpf_func_proto *
> io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> {
> - return bpf_base_func_proto(func_id);
> + switch (func_id) {
> + case BPF_FUNC_copy_from_user:
> + return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL;
> + case BPF_FUNC_iouring_queue_sqe:
> + return prog->aux->sleepable ? &io_bpf_queue_sqe_proto : NULL;
> + default:
> + return bpf_base_func_proto(func_id);
> + }
> }
>
> static bool io_bpf_is_valid_access(int off, int size,
> @@ -10379,9 +10421,10 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
> atomic_read(&req->task->io_uring->in_idle)))
> goto done;
>
> - memset(&bpf_ctx, 0, sizeof(bpf_ctx));
> + bpf_ctx.ctx = ctx;
> prog = req->bpf.prog;
>
> + io_submit_state_start(&ctx->submit_state, 1);
> if (prog->aux->sleepable) {
> rcu_read_lock();
> bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
> @@ -10389,7 +10432,7 @@ static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
> } else {
> bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
> }
> -
> + io_submit_state_end(&ctx->submit_state, ctx);
> ret = 0;
> done:
> __io_req_complete(req, issue_flags, ret, 0);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index de544f0fbeef..cc268f749a7d 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4082,6 +4082,7 @@ union bpf_attr {
> FN(ima_inode_hash), \
> FN(sock_from_file), \
> FN(check_mtu), \
> + FN(iouring_queue_sqe), \

We need to describe this function in the comment above, just like 20/23 does.

> /* */
>
> /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> --
> 2.31.1
>

2021-05-21 08:49:44

by Song Liu

[permalink] [raw]
Subject: Re: [RFC v2 00/23] io_uring BPF requests



> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>
> The main problem solved is feeding completion information of other
> requests in a form of CQEs back into BPF. I decided to wire up support
> for multiple completion queues (aka CQs) and give BPF programs access to
> them, so leaving userspace in control over synchronisation that should
> be much more flexible that the link-based approach.
>
> For instance, there can be a separate CQ for each BPF program, so no
> extra sync is needed, and communication can be done by submitting a
> request targeting a neighboring CQ or submitting a CQE there directly
> (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can
> cross-fire if willing.
>

[...]

> bpf: add IOURING program type
> io_uring: implement bpf prog registration
> io_uring: add support for bpf requests
> io_uring: enable BPF to submit SQEs
> io_uring: enable bpf to submit CQEs
> io_uring: enable bpf to reap CQEs
> libbpf: support io_uring
> io_uring: pass user_data to bpf executor
> bpf: Add bpf_copy_to_user() helper
> io_uring: wire bpf copy to user
> io_uring: don't wait on CQ exclusively
> io_uring: enable bpf reqs to wait for CQs

Besides the a few comments, these BPF related patches look sane to me.
Please consider add some selftests (tools/testing/selftests/bpf).

Thanks,
Song

2021-05-21 08:56:16

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 12/23] bpf: add IOURING program type

On 5/21/21 12:34 AM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>>
>> Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by
>> io_uring to execute BPF-based requests.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> fs/io_uring.c | 21 +++++++++++++++++++++
>> include/linux/bpf_types.h | 2 ++
>> include/uapi/linux/bpf.h | 1 +
>> kernel/bpf/syscall.c | 1 +
>> kernel/bpf/verifier.c | 5 ++++-
>> 5 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 1a4c9e513ac9..882b16b5e5eb 100644
[...]
>> +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,
>> + void *, void *)
>>
>> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
>> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 4ba4ef0ff63a..de544f0fbeef 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -206,6 +206,7 @@ enum bpf_prog_type {
>> BPF_PROG_TYPE_EXT,
>> BPF_PROG_TYPE_LSM,
>> BPF_PROG_TYPE_SK_LOOKUP,
>> + BPF_PROG_TYPE_IOURING,
>> };
>>
>> enum bpf_attach_type {
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 250503482cda..6ef7a26f4dc3 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
>> case BPF_PROG_TYPE_CGROUP_SOCKOPT:
>> case BPF_PROG_TYPE_CGROUP_SYSCTL:
>> case BPF_PROG_TYPE_SOCK_OPS:
>> + case BPF_PROG_TYPE_IOURING:
>> case BPF_PROG_TYPE_EXT: /* extends any prog */
>> return true;
>> case BPF_PROG_TYPE_CGROUP_SKB:
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 0399ac092b36..2a53f44618a7 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env)
>> case BPF_PROG_TYPE_SK_LOOKUP:
>> range = tnum_range(SK_DROP, SK_PASS);
>> break;
>> + case BPF_PROG_TYPE_IOURING:
>> + range = tnum_const(0);
>> + break;
>> case BPF_PROG_TYPE_EXT:
>> /* freplace program can return anything as its return value
>> * depends on the to-be-replaced kernel func or bpf program.
>> @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
>> u64 key;
>>
>> if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
>> - prog->type != BPF_PROG_TYPE_LSM) {
>> + prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) {
>
> Is IOURING program sleepable? If so, please highlight that in the commit log

Supposed to work with both, sleepable and not, but with different set
of helpers. e.g. can't submit requests nor do bpf_copy_from_user() if
can't sleep. The only other difference in handling is rcu around
non-sleepable, but please shut out if I forgot anything

> and update the warning below.

Sure

>
>> verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
>> return -EINVAL;
>> }
>> --
>> 2.31.1
>>
>

--
Pavel Begunkov

2021-05-21 09:00:48

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 15/23] io_uring: enable BPF to submit SQEs

On Wed, May 19, 2021 at 03:13:26PM +0100, Pavel Begunkov wrote:
>
> +BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx,
> + const struct io_uring_sqe *, sqe,
> + u32, sqe_len)
> +{
> + struct io_ring_ctx *ctx = bpf_ctx->ctx;
> + struct io_kiocb *req;
> +
> + if (sqe_len != sizeof(struct io_uring_sqe))
> + return -EINVAL;
> +
> + req = io_alloc_req(ctx);

that is GFP_KERNEL allocation.
It's only allowed from sleepable bpf progs and further down
there is a correct check for it, so all good.
But submitting sqe is a fundemntal io_uring operation,
so what is the use case for non-sleepable?
In other words why bother? Allow sleepable only and simplify the code?

> + if (unlikely(!req))
> + return -ENOMEM;
> + if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
> + kmem_cache_free(req_cachep, req);
> + return -EAGAIN;
> + }
> + percpu_counter_add(&current->io_uring->inflight, 1);
> + refcount_add(1, &current->usage);
> +
> + /* returns number of submitted SQEs or an error */
> + return !io_submit_sqe(ctx, req, sqe);

A buggy bpf prog will be able to pass junk sizeof(struct io_uring_sqe)
as 'sqe' here.
What kind of validation io_submit_sqe() does to avoid crashing the kernel?

General comments that apply to all patches:
- commit logs are way too terse. Pls expand with details.
- describe new bpf helpers in comments in bpf.h. Just adding them to an enum is not enough.
- selftest/bpf are mandatory for all new bpf features.
- consider bpf_link style of attaching bpf progs. We had enough issues with progs
that get stuck due to application bugs. Auto-detach saves the day more often than not.

2021-05-21 16:12:45

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH 18/23] libbpf: support io_uring

On Thu, May 20, 2021 at 2:58 AM Pavel Begunkov <[email protected]> wrote:
>
> On 5/19/21 6:38 PM, Andrii Nakryiko wrote:
> > On Wed, May 19, 2021 at 7:14 AM Pavel Begunkov <[email protected]> wrote:
> >>
> >> Signed-off-by: Pavel Begunkov <[email protected]>
> >> ---
> >> tools/lib/bpf/libbpf.c | 7 +++++++
> >> 1 file changed, 7 insertions(+)
> >>
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index 4181d178ee7b..de5d1508f58e 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -13,6 +13,10 @@
> >> #ifndef _GNU_SOURCE
> >> #define _GNU_SOURCE
> >> #endif
> >> +
> >> +/* hack, use local headers instead of system-wide */
> >> +#include "../../../include/uapi/linux/bpf.h"
> >> +
> >
> > libbpf is already using the latest UAPI headers, so you don't need
> > this hack. You just haven't synced include/uapi/linux/bpf.h into
> > tools/include/uapi/linux/bpf.h
>
> It's more convenient to keep it local to me while RFC, surely will
> drop it later.
>
> btw, I had a problem with find_sec_def() successfully matching
> "iouring.s" string with "iouring", because section_defs[i].len
> doesn't include final \0 and so does a sort of prefix comparison.
> That's why "iouring/". Can we fix it? Are compatibility concerns?

If you put "iouring.s" before "iouring" it will be matched first,
libbpf matches them in order, so more specific prefix should go first.
It is currently always treated as a prefix, not exact match,
unfortunately. I have a work planned to revamp this logic quite a bit
for libbpf 1.0, so this should be improved as part of that work.



>
> >
> >> #include <stdlib.h>
> >> #include <stdio.h>
> >> #include <stdarg.h>
> >> @@ -8630,6 +8634,9 @@ static const struct bpf_sec_def section_defs[] = {
> >> BPF_PROG_SEC("struct_ops", BPF_PROG_TYPE_STRUCT_OPS),
> >> BPF_EAPROG_SEC("sk_lookup/", BPF_PROG_TYPE_SK_LOOKUP,
> >> BPF_SK_LOOKUP),
> >> + SEC_DEF("iouring/", IOURING),
> >> + SEC_DEF("iouring.s/", IOURING,
> >> + .is_sleepable = true),
> >> };
> >>
> >> #undef BPF_PROG_SEC_IMPL
> >> --
> >> 2.31.1
> >>
>
> --
> Pavel Begunkov

2021-05-21 20:02:03

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 01/23] io_uring: shuffle rarely used ctx fields

On 5/20/21 10:46 PM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>> There is a bunch of scattered around ctx fields that are almost never
>> used, e.g. only on ring exit, plunge them to the end, better locality,
>> better aesthetically.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> fs/io_uring.c | 36 +++++++++++++++++-------------------
>> 1 file changed, 17 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 9ac5e278a91e..7e3410ce100a 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -367,9 +367,6 @@ struct io_ring_ctx {
>> unsigned cached_cq_overflow;
>> unsigned long sq_check_overflow;
>>
>> - /* hashed buffered write serialization */
>> - struct io_wq_hash *hash_map;
>> -
>> struct list_head defer_list;
>> struct list_head timeout_list;
>> struct list_head cq_overflow_list;
>> @@ -386,9 +383,6 @@ struct io_ring_ctx {
>>
>> struct io_rings *rings;
>>
>> - /* Only used for accounting purposes */
>> - struct mm_struct *mm_account;
>> -
>> const struct cred *sq_creds; /* cred used for __io_sq_thread() */
>> struct io_sq_data *sq_data; /* if using sq thread polling */
>>
>> @@ -409,14 +403,6 @@ struct io_ring_ctx {
>> unsigned nr_user_bufs;
>> struct io_mapped_ubuf **user_bufs;
>>
>> - struct user_struct *user;
>> -
>> - struct completion ref_comp;
>> -
>> -#if defined(CONFIG_UNIX)
>> - struct socket *ring_sock;
>> -#endif
>> -
>> struct xarray io_buffers;
>>
>> struct xarray personalities;
>> @@ -460,12 +446,24 @@ struct io_ring_ctx {
>>
>> struct io_restriction restrictions;
>>
>> - /* exit task_work */
>> - struct callback_head *exit_task_work;
>> -
>> /* Keep this last, we don't need it for the fast path */
>> - struct work_struct exit_work;
>> - struct list_head tctx_list;
>> + struct {
>
> Why do we need an anonymous struct here? For cache line alignment?
> Do we need ____cacheline_aligned_in_smp?

Rather as a visual hint considering that most of the field historically
are in structs (____cacheline_aligned_in_smp). Also preparing to
potentially splitting it out of the ctx struct as it grows big.

First 2-3 patches are not strictly related to bpf and will go separately
earlier, just the set was based on.

>> + #if defined(CONFIG_UNIX)
>> + struct socket *ring_sock;
>> + #endif
>> + /* hashed buffered write serialization */
>> + struct io_wq_hash *hash_map;
>> +
>> + /* Only used for accounting purposes */
>> + struct user_struct *user;
>> + struct mm_struct *mm_account;
>> +
>> + /* ctx exit and cancelation */
>> + struct callback_head *exit_task_work;
>> + struct work_struct exit_work;
>> + struct list_head tctx_list;
>> + struct completion ref_comp;
>> + };
>> };
>>
>> struct io_uring_task {

--
Pavel Begunkov

2021-05-21 20:04:40

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 14/23] io_uring: add support for bpf requests

On 5/19/21 3:13 PM, Pavel Begunkov wrote:
> Wire up a new io_uring operation type IORING_OP_BPF, which executes a
> specified BPF program from the registered prog table. It doesn't allow
> to do anything useful for now, no BPF functions are allowed apart from
> basic ones.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 92 +++++++++++++++++++++++++++++++++++
> include/uapi/linux/io_uring.h | 1 +
> 2 files changed, 93 insertions(+)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index b13cbcd5c47b..20fddc5945f2 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -682,6 +682,11 @@ struct io_unlink {
> struct filename *filename;
> };
>
> +struct io_bpf {
> + struct file *file;
> + struct bpf_prog *prog;
> +};
> +
> struct io_completion {
> struct file *file;
> struct list_head list;
> @@ -826,6 +831,7 @@ struct io_kiocb {
> struct io_shutdown shutdown;
> struct io_rename rename;
> struct io_unlink unlink;
> + struct io_bpf bpf;
> /* use only after cleaning per-op data, see io_clean_op() */
> struct io_completion compl;
> };
> @@ -875,6 +881,9 @@ struct io_defer_entry {
> u32 seq;
> };
>
> +struct io_bpf_ctx {
> +};
> +
> struct io_op_def {
> /* needs req->file assigned */
> unsigned needs_file : 1;
> @@ -1039,6 +1048,7 @@ static const struct io_op_def io_op_defs[] = {
> },
> [IORING_OP_RENAMEAT] = {},
> [IORING_OP_UNLINKAT] = {},
> + [IORING_OP_BPF] = {},
> };
>
> static bool io_disarm_next(struct io_kiocb *req);
> @@ -1070,6 +1080,7 @@ static void io_rsrc_put_work(struct work_struct *work);
> static void io_req_task_queue(struct io_kiocb *req);
> static void io_submit_flush_completions(struct io_comp_state *cs,
> struct io_ring_ctx *ctx);
> +static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags);
> static bool io_poll_remove_waitqs(struct io_kiocb *req);
> static int io_req_prep_async(struct io_kiocb *req);
>
> @@ -3931,6 +3942,53 @@ static int io_openat(struct io_kiocb *req, unsigned int issue_flags)
> return io_openat2(req, issue_flags);
> }
>
> +static int io_bpf_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> +{
> + struct io_ring_ctx *ctx = req->ctx;
> + struct bpf_prog *prog;
> + unsigned int idx;
> +
> + if (unlikely(ctx->flags & (IORING_SETUP_IOPOLL|IORING_SETUP_SQPOLL)))
> + return -EINVAL;
> + if (unlikely(req->flags & (REQ_F_FIXED_FILE | REQ_F_BUFFER_SELECT)))
> + return -EINVAL;
> + if (sqe->ioprio || sqe->len || sqe->cancel_flags)
> + return -EINVAL;
> + if (sqe->addr)
> + return -EINVAL;
> +
> + idx = READ_ONCE(sqe->off);
> + if (unlikely(idx >= ctx->nr_bpf_progs))
> + return -EFAULT;
> + idx = array_index_nospec(idx, ctx->nr_bpf_progs);
> + prog = ctx->bpf_progs[idx].prog;
> + if (!prog)
> + return -EFAULT;
> +
> + req->bpf.prog = prog;
> + return 0;
> +}
> +
> +static void io_bpf_run_task_work(struct callback_head *cb)
> +{
> + struct io_kiocb *req = container_of(cb, struct io_kiocb, task_work);
> + struct io_ring_ctx *ctx = req->ctx;
> +
> + mutex_lock(&ctx->uring_lock);
> + io_bpf_run(req, 0);
> + mutex_unlock(&ctx->uring_lock);
> +}
> +
> +static int io_bpf(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + init_task_work(&req->task_work, io_bpf_run_task_work);
> + if (unlikely(io_req_task_work_add(req))) {
> + req_ref_get(req);
> + io_req_task_queue_fail(req, -ECANCELED);
> + }
> + return 0;
> +}
> +
> static int io_remove_buffers_prep(struct io_kiocb *req,
> const struct io_uring_sqe *sqe)
> {
> @@ -6002,6 +6060,8 @@ static int io_req_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> return io_renameat_prep(req, sqe);
> case IORING_OP_UNLINKAT:
> return io_unlinkat_prep(req, sqe);
> + case IORING_OP_BPF:
> + return io_bpf_prep(req, sqe);
> }
>
> printk_once(KERN_WARNING "io_uring: unhandled opcode %d\n",
> @@ -6269,6 +6329,9 @@ static int io_issue_sqe(struct io_kiocb *req, unsigned int issue_flags)
> case IORING_OP_UNLINKAT:
> ret = io_unlinkat(req, issue_flags);
> break;
> + case IORING_OP_BPF:
> + ret = io_bpf(req, issue_flags);
> + break;
> default:
> ret = -EINVAL;
> break;
> @@ -10303,6 +10366,35 @@ const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
> .is_valid_access = io_bpf_is_valid_access,
> };
>
> +static void io_bpf_run(struct io_kiocb *req, unsigned int issue_flags)
> +{
> + struct io_ring_ctx *ctx = req->ctx;
> + struct io_bpf_ctx bpf_ctx;
> + struct bpf_prog *prog;
> + int ret = -EAGAIN;
> +
> + lockdep_assert_held(&req->ctx->uring_lock);
> +
> + if (unlikely(percpu_ref_is_dying(&ctx->refs) ||
> + atomic_read(&req->task->io_uring->in_idle)))
> + goto done;
> +
> + memset(&bpf_ctx, 0, sizeof(bpf_ctx));
> + prog = req->bpf.prog;
> +
> + if (prog->aux->sleepable) {

Looks forgot to amend, the condition should be inversed.

> + rcu_read_lock();
> + bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
> + rcu_read_unlock();
> + } else {
> + bpf_prog_run_pin_on_cpu(req->bpf.prog, &bpf_ctx);
> + }
> +
> + ret = 0;
> +done:
> + __io_req_complete(req, issue_flags, ret, 0);
> +}
> +
> SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
> void __user *, arg, unsigned int, nr_args)
> {
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index b450f41d7389..25ab804670e1 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -138,6 +138,7 @@ enum {
> IORING_OP_SHUTDOWN,
> IORING_OP_RENAMEAT,
> IORING_OP_UNLINKAT,
> + IORING_OP_BPF,
>
> /* this goes last, obviously */
> IORING_OP_LAST,
>

--
Pavel Begunkov

2021-05-21 20:05:00

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 13/23] io_uring: implement bpf prog registration

On 5/21/21 12:45 AM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>>
>> [de]register BPF programs through io_uring_register() with new
>> IORING_ATTACH_BPF and IORING_DETACH_BPF commands.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> fs/io_uring.c | 81 +++++++++++++++++++++++++++++++++++
>> include/uapi/linux/io_uring.h | 2 +
>> 2 files changed, 83 insertions(+)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 882b16b5e5eb..b13cbcd5c47b 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -78,6 +78,7 @@
>> #include <linux/task_work.h>
>> #include <linux/pagemap.h>
>> #include <linux/io_uring.h>
>> +#include <linux/bpf.h>
>>
>> #define CREATE_TRACE_POINTS
>> #include <trace/events/io_uring.h>
>> @@ -103,6 +104,8 @@
>> #define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
>> IORING_REGISTER_LAST + IORING_OP_LAST)
>>
>> +#define IORING_MAX_BPF_PROGS 100
>
> Is 100 a realistic number here?

Arbitrary test value, will update

>
>> +
>> #define SQE_VALID_FLAGS (IOSQE_FIXED_FILE|IOSQE_IO_DRAIN|IOSQE_IO_LINK| \
>> IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
>> IOSQE_BUFFER_SELECT)
>> @@ -266,6 +269,10 @@ struct io_restriction {
>> bool registered;
>> };
>>
>
> [...]
>

--
Pavel Begunkov

2021-05-21 20:05:05

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH 12/23] bpf: add IOURING program type



> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>
> Draft a new program type BPF_PROG_TYPE_IOURING, which will be used by
> io_uring to execute BPF-based requests.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 21 +++++++++++++++++++++
> include/linux/bpf_types.h | 2 ++
> include/uapi/linux/bpf.h | 1 +
> kernel/bpf/syscall.c | 1 +
> kernel/bpf/verifier.c | 5 ++++-
> 5 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 1a4c9e513ac9..882b16b5e5eb 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -10201,6 +10201,27 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
> return ret;
> }
>
> +static const struct bpf_func_proto *
> +io_bpf_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> + return bpf_base_func_proto(func_id);
> +}
> +
> +static bool io_bpf_is_valid_access(int off, int size,
> + enum bpf_access_type type,
> + const struct bpf_prog *prog,
> + struct bpf_insn_access_aux *info)
> +{
> + return false;
> +}
> +
> +const struct bpf_prog_ops bpf_io_uring_prog_ops = {};
> +
> +const struct bpf_verifier_ops bpf_io_uring_verifier_ops = {
> + .get_func_proto = io_bpf_func_proto,
> + .is_valid_access = io_bpf_is_valid_access,
> +};
> +
> SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
> void __user *, arg, unsigned int, nr_args)
> {
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 99f7fd657d87..d0b7954887bd 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -77,6 +77,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
> void *, void *)
> #endif /* CONFIG_BPF_LSM */
> #endif
> +BPF_PROG_TYPE(BPF_PROG_TYPE_IOURING, bpf_io_uring,
> + void *, void *)
>
> BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 4ba4ef0ff63a..de544f0fbeef 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -206,6 +206,7 @@ enum bpf_prog_type {
> BPF_PROG_TYPE_EXT,
> BPF_PROG_TYPE_LSM,
> BPF_PROG_TYPE_SK_LOOKUP,
> + BPF_PROG_TYPE_IOURING,
> };
>
> enum bpf_attach_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 250503482cda..6ef7a26f4dc3 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2041,6 +2041,7 @@ static bool is_net_admin_prog_type(enum bpf_prog_type prog_type)
> case BPF_PROG_TYPE_CGROUP_SOCKOPT:
> case BPF_PROG_TYPE_CGROUP_SYSCTL:
> case BPF_PROG_TYPE_SOCK_OPS:
> + case BPF_PROG_TYPE_IOURING:
> case BPF_PROG_TYPE_EXT: /* extends any prog */
> return true;
> case BPF_PROG_TYPE_CGROUP_SKB:
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 0399ac092b36..2a53f44618a7 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -8558,6 +8558,9 @@ static int check_return_code(struct bpf_verifier_env *env)
> case BPF_PROG_TYPE_SK_LOOKUP:
> range = tnum_range(SK_DROP, SK_PASS);
> break;
> + case BPF_PROG_TYPE_IOURING:
> + range = tnum_const(0);
> + break;
> case BPF_PROG_TYPE_EXT:
> /* freplace program can return anything as its return value
> * depends on the to-be-replaced kernel func or bpf program.
> @@ -12560,7 +12563,7 @@ static int check_attach_btf_id(struct bpf_verifier_env *env)
> u64 key;
>
> if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING &&
> - prog->type != BPF_PROG_TYPE_LSM) {
> + prog->type != BPF_PROG_TYPE_LSM && prog->type != BPF_PROG_TYPE_IOURING) {

Is IOURING program sleepable? If so, please highlight that in the commit log
and update the warning below.

> verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n");
> return -EINVAL;
> }
> --
> 2.31.1
>

2021-05-21 20:05:11

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC v2 00/23] io_uring BPF requests

On 5/21/21 1:35 AM, Song Liu wrote:
>> On May 19, 2021, at 7:13 AM, Pavel Begunkov <[email protected]> wrote:
>> The main problem solved is feeding completion information of other
>> requests in a form of CQEs back into BPF. I decided to wire up support
>> for multiple completion queues (aka CQs) and give BPF programs access to
>> them, so leaving userspace in control over synchronisation that should
>> be much more flexible that the link-based approach.
>>
>> For instance, there can be a separate CQ for each BPF program, so no
>> extra sync is needed, and communication can be done by submitting a
>> request targeting a neighboring CQ or submitting a CQE there directly
>> (see test3 below). CQ is choosen by sqe->cq_idx, so everyone can
>> cross-fire if willing.
>>
>
> [...]
>
>> bpf: add IOURING program type
>> io_uring: implement bpf prog registration
>> io_uring: add support for bpf requests
>> io_uring: enable BPF to submit SQEs
>> io_uring: enable bpf to submit CQEs
>> io_uring: enable bpf to reap CQEs
>> libbpf: support io_uring
>> io_uring: pass user_data to bpf executor
>> bpf: Add bpf_copy_to_user() helper
>> io_uring: wire bpf copy to user
>> io_uring: don't wait on CQ exclusively
>> io_uring: enable bpf reqs to wait for CQs
>
> Besides the a few comments, these BPF related patches look sane to me.
> Please consider add some selftests (tools/testing/selftests/bpf).

The comments are noted. Thanks Song

--
Pavel Begunkov

2021-05-21 20:10:49

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 15/23] io_uring: enable BPF to submit SQEs

On 5/21/21 2:07 AM, Alexei Starovoitov wrote:
> On Wed, May 19, 2021 at 03:13:26PM +0100, Pavel Begunkov wrote:
>>
>> +BPF_CALL_3(io_bpf_queue_sqe, struct io_bpf_ctx *, bpf_ctx,
>> + const struct io_uring_sqe *, sqe,
>> + u32, sqe_len)
>> +{
>> + struct io_ring_ctx *ctx = bpf_ctx->ctx;
>> + struct io_kiocb *req;
>> +
>> + if (sqe_len != sizeof(struct io_uring_sqe))
>> + return -EINVAL;
>> +
>> + req = io_alloc_req(ctx);
>
> that is GFP_KERNEL allocation.
> It's only allowed from sleepable bpf progs and further down
> there is a correct check for it, so all good.
> But submitting sqe is a fundemntal io_uring operation,
> so what is the use case for non-sleepable?
> In other words why bother? Allow sleepable only and simplify the code?

Actual submission may be moved out of BPF, so enabling it for both, but
the question I wonder about is what are the plans for sleepable
programs? E.g. if it's a marginal features much limited in
functionality, e.g. iirc as it's not allowed to use some BPF data
types, it may not worth doing.

>
>> + if (unlikely(!req))
>> + return -ENOMEM;
>> + if (!percpu_ref_tryget_many(&ctx->refs, 1)) {
>> + kmem_cache_free(req_cachep, req);
>> + return -EAGAIN;
>> + }
>> + percpu_counter_add(&current->io_uring->inflight, 1);
>> + refcount_add(1, &current->usage);
>> +
>> + /* returns number of submitted SQEs or an error */
>> + return !io_submit_sqe(ctx, req, sqe);
>
> A buggy bpf prog will be able to pass junk sizeof(struct io_uring_sqe)
> as 'sqe' here.
> What kind of validation io_submit_sqe() does to avoid crashing the kernel?

It works on memory rw shared with userspace, so it already assumes
the worst

> General comments that apply to all patches:
> - commit logs are way too terse. Pls expand with details.
> - describe new bpf helpers in comments in bpf.h. Just adding them to an enum is not enough.
> - selftest/bpf are mandatory for all new bpf features.
> - consider bpf_link style of attaching bpf progs. We had enough issues with progs
> that get stuck due to application bugs. Auto-detach saves the day more often than not.

Thanks for taking a look! I have no idea what bpf_link is, need
to check it out

--
Pavel Begunkov