2023-03-30 14:55:07

by Pavel Begunkov

[permalink] [raw]
Subject: [RFC 00/11] optimise registered buffer/file updates

Updating registered files and buffers is a very slow operation, which
makes it not feasible for workloads with medium update frequencies.
Rework the underlying rsrc infra for greater performance and lesser
memory footprint.

The improvement is ~11x for a benchmark updating files in a loop
(1040K -> 11468K updates / sec).

The set requires a couple of patches from the 6.3 branch, for that
reason it's an RFC and will be resent after merge.

https://github.com/isilence/linux.git optimise-rsrc-update

Pavel Begunkov (11):
io_uring/rsrc: use non-pcpu refcounts for nodes
io_uring/rsrc: keep cached refs per node
io_uring: don't put nodes under spinlocks
io_uring: io_free_req() via tw
io_uring/rsrc: protect node refs with uring_lock
io_uring/rsrc: kill rsrc_ref_lock
io_uring/rsrc: rename rsrc_list
io_uring/rsrc: optimise io_rsrc_put allocation
io_uring/rsrc: don't offload node free
io_uring/rsrc: cache struct io_rsrc_node
io_uring/rsrc: add lockdep sanity checks

include/linux/io_uring_types.h | 7 +-
io_uring/io_uring.c | 47 ++++++----
io_uring/rsrc.c | 152 +++++++++++----------------------
io_uring/rsrc.h | 50 ++++++-----
4 files changed, 105 insertions(+), 151 deletions(-)

--
2.39.1


2023-03-30 14:55:29

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 02/11] io_uring/rsrc: keep cached refs per node

We cache refs of the current node (i.e. ctx->rsrc_node) in
ctx->rsrc_cached_refs. We'll be moving away from atomics, so move the
cached refs in struct io_rsrc_node for now. It's a prep patch and
shouldn't change anything in practise.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 1 -
io_uring/rsrc.c | 15 +++++++++------
io_uring/rsrc.h | 16 +++++++++-------
3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 561fa421c453..a0a5b5964d3a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -240,7 +240,6 @@ struct io_ring_ctx {
* uring_lock, and updated through io_uring_register(2)
*/
struct io_rsrc_node *rsrc_node;
- int rsrc_cached_refs;
atomic_t cancel_seq;
struct io_file_table file_table;
unsigned nr_user_files;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f2da9e251e3f..1e7c960737fd 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -36,9 +36,11 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
- if (ctx->rsrc_cached_refs) {
- io_rsrc_put_node(ctx->rsrc_node, ctx->rsrc_cached_refs);
- ctx->rsrc_cached_refs = 0;
+ struct io_rsrc_node *node = ctx->rsrc_node;
+
+ if (node && node->cached_refs) {
+ io_rsrc_put_node(node, node->cached_refs);
+ node->cached_refs = 0;
}
}

@@ -151,11 +153,11 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
*slot = NULL;
}

-void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
+void io_rsrc_refs_refill(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
__must_hold(&ctx->uring_lock)
{
- ctx->rsrc_cached_refs += IO_RSRC_REF_BATCH;
- refcount_add(IO_RSRC_REF_BATCH, &ctx->rsrc_node->refs);
+ node->cached_refs += IO_RSRC_REF_BATCH;
+ refcount_add(IO_RSRC_REF_BATCH, &node->refs);
}

static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
@@ -300,6 +302,7 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
if (!ctx->rsrc_node) {
ctx->rsrc_node = ctx->rsrc_backup_node;
ctx->rsrc_backup_node = NULL;
+ ctx->rsrc_node->cached_refs = 0;
}
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 1467b31843bc..950535e2b9f4 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -43,6 +43,7 @@ struct io_rsrc_node {
struct io_rsrc_data *rsrc_data;
struct llist_node llist;
bool done;
+ int cached_refs;
};

struct io_mapped_ubuf {
@@ -56,7 +57,7 @@ struct io_mapped_ubuf {
void io_rsrc_put_tw(struct callback_head *cb);
void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
void io_rsrc_put_work(struct work_struct *work);
-void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
+void io_rsrc_refs_refill(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
void io_wait_rsrc_data(struct io_rsrc_data *data);
void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
void io_rsrc_refs_drop(struct io_ring_ctx *ctx);
@@ -128,17 +129,18 @@ static inline void io_req_put_rsrc_locked(struct io_kiocb *req,

if (node) {
if (node == ctx->rsrc_node)
- ctx->rsrc_cached_refs++;
+ node->cached_refs++;
else
io_rsrc_put_node(node, 1);
}
}

-static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx)
+static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx,
+ struct io_rsrc_node *node)
{
- ctx->rsrc_cached_refs--;
- if (unlikely(ctx->rsrc_cached_refs < 0))
- io_rsrc_refs_refill(ctx);
+ node->cached_refs--;
+ if (unlikely(node->cached_refs < 0))
+ io_rsrc_refs_refill(ctx, node);
}

static inline void io_req_set_rsrc_node(struct io_kiocb *req,
@@ -151,7 +153,7 @@ static inline void io_req_set_rsrc_node(struct io_kiocb *req,
lockdep_assert_held(&ctx->uring_lock);

req->rsrc_node = ctx->rsrc_node;
- io_charge_rsrc_node(ctx);
+ io_charge_rsrc_node(ctx, ctx->rsrc_node);
io_ring_submit_unlock(ctx, issue_flags);
}
}
--
2.39.1

2023-03-30 14:55:29

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 01/11] io_uring/rsrc: use non-pcpu refcounts for nodes

One problem with the current rsrc infra is that often updates will
generates lots of rsrc nodes, each carry pcpu refs. That takes quite a
lot of memory, especially if there is a stall, and takes lots of CPU
cycles. Only pcpu allocations takes >50 of CPU with a naive benchmark
updating files in a loop.

Replace pcpu refs with normal refcounting. There is already a hot path
avoiding atomics / refs, but following patches will further improve it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/rsrc.c | 15 +++++----------
io_uring/rsrc.h | 6 ++++--
2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 7a43aed8e395..f2da9e251e3f 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -155,7 +155,7 @@ void io_rsrc_refs_refill(struct io_ring_ctx *ctx)
__must_hold(&ctx->uring_lock)
{
ctx->rsrc_cached_refs += IO_RSRC_REF_BATCH;
- percpu_ref_get_many(&ctx->rsrc_node->refs, IO_RSRC_REF_BATCH);
+ refcount_add(IO_RSRC_REF_BATCH, &ctx->rsrc_node->refs);
}

static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
@@ -220,13 +220,11 @@ void io_wait_rsrc_data(struct io_rsrc_data *data)

void io_rsrc_node_destroy(struct io_rsrc_node *ref_node)
{
- percpu_ref_exit(&ref_node->refs);
kfree(ref_node);
}

-static __cold void io_rsrc_node_ref_zero(struct percpu_ref *ref)
+__cold void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
{
- struct io_rsrc_node *node = container_of(ref, struct io_rsrc_node, refs);
struct io_ring_ctx *ctx = node->rsrc_data->ctx;
unsigned long flags;
bool first_add = false;
@@ -269,11 +267,7 @@ static struct io_rsrc_node *io_rsrc_node_alloc(void)
if (!ref_node)
return NULL;

- if (percpu_ref_init(&ref_node->refs, io_rsrc_node_ref_zero,
- 0, GFP_KERNEL)) {
- kfree(ref_node);
- return NULL;
- }
+ refcount_set(&ref_node->refs, 1);
INIT_LIST_HEAD(&ref_node->node);
INIT_LIST_HEAD(&ref_node->rsrc_list);
ref_node->done = false;
@@ -298,7 +292,8 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
spin_unlock_irq(&ctx->rsrc_ref_lock);

atomic_inc(&data_to_kill->refs);
- percpu_ref_kill(&rsrc_node->refs);
+ /* put master ref */
+ io_rsrc_put_node(rsrc_node, 1);
ctx->rsrc_node = NULL;
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index f27f4975217d..1467b31843bc 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -37,7 +37,7 @@ struct io_rsrc_data {
};

struct io_rsrc_node {
- struct percpu_ref refs;
+ refcount_t refs;
struct list_head node;
struct list_head rsrc_list;
struct io_rsrc_data *rsrc_data;
@@ -54,6 +54,7 @@ struct io_mapped_ubuf {
};

void io_rsrc_put_tw(struct callback_head *cb);
+void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
void io_rsrc_put_work(struct work_struct *work);
void io_rsrc_refs_refill(struct io_ring_ctx *ctx);
void io_wait_rsrc_data(struct io_rsrc_data *data);
@@ -109,7 +110,8 @@ int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,

static inline void io_rsrc_put_node(struct io_rsrc_node *node, int nr)
{
- percpu_ref_put_many(&node->refs, nr);
+ if (refcount_sub_and_test(nr, &node->refs))
+ io_rsrc_node_ref_zero(node);
}

static inline void io_req_put_rsrc(struct io_kiocb *req)
--
2.39.1

2023-03-30 14:55:40

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 04/11] io_uring: io_free_req() via tw

io_free_req() is not often used but nevertheless problematic as there is
no way to know the current context, it may be used from the submission
path or even by an irq handler. Push it to a fresh context using
task_work.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c944833099a6..52a88da65f57 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1116,7 +1116,7 @@ static inline void io_dismantle_req(struct io_kiocb *req)
io_put_file(req->file);
}

-__cold void io_free_req(struct io_kiocb *req)
+__cold void io_free_req_tw(struct io_kiocb *req, struct io_tw_state *ts)
{
struct io_ring_ctx *ctx = req->ctx;

@@ -1130,6 +1130,12 @@ __cold void io_free_req(struct io_kiocb *req)
spin_unlock(&ctx->completion_lock);
}

+__cold void io_free_req(struct io_kiocb *req)
+{
+ req->io_task_work.func = io_free_req_tw;
+ io_req_task_work_add(req);
+}
+
static void __io_req_find_next_prep(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
--
2.39.1

2023-03-30 14:55:46

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 07/11] io_uring/rsrc: rename rsrc_list

We have too many "rsrc" around which makes the name of struct
io_rsrc_node::rsrc_list confusing. The field is responsible for keeping
a list of files or buffers, so call it item_list and add comments
around.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/rsrc.c | 6 +++---
io_uring/rsrc.h | 8 +++++++-
2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index e122b6e5f9c5..10006fb169d2 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -146,7 +146,7 @@ static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
struct io_ring_ctx *ctx = rsrc_data->ctx;
struct io_rsrc_put *prsrc, *tmp;

- list_for_each_entry_safe(prsrc, tmp, &ref_node->rsrc_list, list) {
+ list_for_each_entry_safe(prsrc, tmp, &ref_node->item_list, list) {
list_del(&prsrc->list);

if (prsrc->tag) {
@@ -249,7 +249,7 @@ static struct io_rsrc_node *io_rsrc_node_alloc(void)

ref_node->refs = 1;
INIT_LIST_HEAD(&ref_node->node);
- INIT_LIST_HEAD(&ref_node->rsrc_list);
+ INIT_LIST_HEAD(&ref_node->item_list);
ref_node->done = false;
return ref_node;
}
@@ -737,7 +737,7 @@ int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
prsrc->tag = *tag_slot;
*tag_slot = 0;
prsrc->rsrc = rsrc;
- list_add(&prsrc->list, &node->rsrc_list);
+ list_add(&prsrc->list, &node->item_list);
return 0;
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index fd3bdd30a993..c68846de031f 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -39,10 +39,16 @@ struct io_rsrc_data {
struct io_rsrc_node {
int refs;
struct list_head node;
- struct list_head rsrc_list;
struct io_rsrc_data *rsrc_data;
struct llist_node llist;
bool done;
+
+ /*
+ * Keeps a list of struct io_rsrc_put to be completed. Each entry
+ * represents one rsrc (e.g. file or buffer), but all of them should've
+ * came from the same table and so are of the same type.
+ */
+ struct list_head item_list;
};

struct io_mapped_ubuf {
--
2.39.1

2023-03-30 14:55:52

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 05/11] io_uring/rsrc: protect node refs with uring_lock

Currently, for nodes we have an atomic counter and some cached
(non-atomic) refs protected by uring_lock. Let's put all ref
manipulations under uring_lock and get rid of the atomic part.
It's free as in all cases we care about we already hold the lock.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 18 ++++++++++++------
io_uring/rsrc.c | 30 ++++--------------------------
io_uring/rsrc.h | 29 +++++------------------------
3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 52a88da65f57..e55c48d91209 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -967,7 +967,7 @@ bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32
return true;
}

-static void __io_req_complete_post(struct io_kiocb *req)
+static void __io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_rsrc_node *rsrc_node = NULL;
@@ -1003,7 +1003,11 @@ static void __io_req_complete_post(struct io_kiocb *req)
}
io_cq_unlock_post(ctx);

- io_put_rsrc_node(rsrc_node);
+ if (rsrc_node) {
+ io_ring_submit_lock(ctx, issue_flags);
+ io_put_rsrc_node(rsrc_node);
+ io_ring_submit_unlock(ctx, issue_flags);
+ }
}

void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
@@ -1013,12 +1017,12 @@ void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
io_req_task_work_add(req);
} else if (!(issue_flags & IO_URING_F_UNLOCKED) ||
!(req->ctx->flags & IORING_SETUP_IOPOLL)) {
- __io_req_complete_post(req);
+ __io_req_complete_post(req, issue_flags);
} else {
struct io_ring_ctx *ctx = req->ctx;

mutex_lock(&ctx->uring_lock);
- __io_req_complete_post(req);
+ __io_req_complete_post(req, issue_flags & ~IO_URING_F_UNLOCKED);
mutex_unlock(&ctx->uring_lock);
}
}
@@ -1120,7 +1124,10 @@ __cold void io_free_req_tw(struct io_kiocb *req, struct io_tw_state *ts)
{
struct io_ring_ctx *ctx = req->ctx;

- io_put_rsrc_node(req->rsrc_node);
+ if (req->rsrc_node) {
+ io_tw_lock(ctx, ts);
+ io_put_rsrc_node(req->rsrc_node);
+ }
io_dismantle_req(req);
io_put_task_remote(req->task, 1);

@@ -2791,7 +2798,6 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
io_sq_thread_finish(ctx);
- io_rsrc_refs_drop(ctx);
/* __io_rsrc_put_work() may need uring_lock to progress, wait w/o it */
io_wait_rsrc_data(ctx->buf_data);
io_wait_rsrc_data(ctx->file_data);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1e7c960737fd..1237fc77c250 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -27,23 +27,10 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
struct io_mapped_ubuf **pimu,
struct page **last_hpage);

-#define IO_RSRC_REF_BATCH 100
-
/* only define max */
#define IORING_MAX_FIXED_FILES (1U << 20)
#define IORING_MAX_REG_BUFFERS (1U << 14)

-void io_rsrc_refs_drop(struct io_ring_ctx *ctx)
- __must_hold(&ctx->uring_lock)
-{
- struct io_rsrc_node *node = ctx->rsrc_node;
-
- if (node && node->cached_refs) {
- io_rsrc_put_node(node, node->cached_refs);
- node->cached_refs = 0;
- }
-}
-
int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
{
unsigned long page_limit, cur_pages, new_pages;
@@ -153,13 +140,6 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
*slot = NULL;
}

-void io_rsrc_refs_refill(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
- __must_hold(&ctx->uring_lock)
-{
- node->cached_refs += IO_RSRC_REF_BATCH;
- refcount_add(IO_RSRC_REF_BATCH, &node->refs);
-}
-
static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
{
struct io_rsrc_data *rsrc_data = ref_node->rsrc_data;
@@ -225,7 +205,8 @@ void io_rsrc_node_destroy(struct io_rsrc_node *ref_node)
kfree(ref_node);
}

-__cold void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
+void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
+ __must_hold(&node->rsrc_data->ctx->uring_lock)
{
struct io_ring_ctx *ctx = node->rsrc_data->ctx;
unsigned long flags;
@@ -269,7 +250,7 @@ static struct io_rsrc_node *io_rsrc_node_alloc(void)
if (!ref_node)
return NULL;

- refcount_set(&ref_node->refs, 1);
+ ref_node->refs = 1;
INIT_LIST_HEAD(&ref_node->node);
INIT_LIST_HEAD(&ref_node->rsrc_list);
ref_node->done = false;
@@ -283,8 +264,6 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
WARN_ON_ONCE(!ctx->rsrc_backup_node);
WARN_ON_ONCE(data_to_kill && !ctx->rsrc_node);

- io_rsrc_refs_drop(ctx);
-
if (data_to_kill) {
struct io_rsrc_node *rsrc_node = ctx->rsrc_node;

@@ -295,14 +274,13 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,

atomic_inc(&data_to_kill->refs);
/* put master ref */
- io_rsrc_put_node(rsrc_node, 1);
+ io_put_rsrc_node(rsrc_node);
ctx->rsrc_node = NULL;
}

if (!ctx->rsrc_node) {
ctx->rsrc_node = ctx->rsrc_backup_node;
ctx->rsrc_backup_node = NULL;
- ctx->rsrc_node->cached_refs = 0;
}
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 8164777279ba..fd3bdd30a993 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -37,13 +37,12 @@ struct io_rsrc_data {
};

struct io_rsrc_node {
- refcount_t refs;
+ int refs;
struct list_head node;
struct list_head rsrc_list;
struct io_rsrc_data *rsrc_data;
struct llist_node llist;
bool done;
- int cached_refs;
};

struct io_mapped_ubuf {
@@ -57,10 +56,8 @@ struct io_mapped_ubuf {
void io_rsrc_put_tw(struct callback_head *cb);
void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
void io_rsrc_put_work(struct work_struct *work);
-void io_rsrc_refs_refill(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
void io_wait_rsrc_data(struct io_rsrc_data *data);
void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
-void io_rsrc_refs_drop(struct io_ring_ctx *ctx);
int io_rsrc_node_switch_start(struct io_ring_ctx *ctx);
int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
struct io_rsrc_node *node, void *rsrc);
@@ -109,38 +106,22 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
unsigned int size, unsigned int type);

-static inline void io_rsrc_put_node(struct io_rsrc_node *node, int nr)
-{
- if (refcount_sub_and_test(nr, &node->refs))
- io_rsrc_node_ref_zero(node);
-}
-
static inline void io_put_rsrc_node(struct io_rsrc_node *node)
{
- if (node)
- io_rsrc_put_node(node, 1);
+ if (node && !--node->refs)
+ io_rsrc_node_ref_zero(node);
}

static inline void io_req_put_rsrc_locked(struct io_kiocb *req,
struct io_ring_ctx *ctx)
- __must_hold(&ctx->uring_lock)
{
- struct io_rsrc_node *node = req->rsrc_node;
-
- if (node) {
- if (node == ctx->rsrc_node)
- node->cached_refs++;
- else
- io_rsrc_put_node(node, 1);
- }
+ io_put_rsrc_node(req->rsrc_node);
}

static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx,
struct io_rsrc_node *node)
{
- node->cached_refs--;
- if (unlikely(node->cached_refs < 0))
- io_rsrc_refs_refill(ctx, node);
+ node->refs++;
}

static inline void io_req_set_rsrc_node(struct io_kiocb *req,
--
2.39.1

2023-03-30 14:55:56

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 03/11] io_uring: don't put nodes under spinlocks

io_req_put_rsrc() doesn't need any locking, so move it out of
a spinlock section in __io_req_complete_post() and adjust helpers.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 7 +++++--
io_uring/rsrc.h | 6 +++---
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 536940675c67..c944833099a6 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -970,6 +970,7 @@ bool io_aux_cqe(struct io_ring_ctx *ctx, bool defer, u64 user_data, s32 res, u32
static void __io_req_complete_post(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
+ struct io_rsrc_node *rsrc_node = NULL;

io_cq_lock(ctx);
if (!(req->flags & REQ_F_CQE_SKIP))
@@ -990,7 +991,7 @@ static void __io_req_complete_post(struct io_kiocb *req)
}
io_put_kbuf_comp(req);
io_dismantle_req(req);
- io_req_put_rsrc(req);
+ rsrc_node = req->rsrc_node;
/*
* Selected buffer deallocation in io_clean_op() assumes that
* we don't hold ->completion_lock. Clean them here to avoid
@@ -1001,6 +1002,8 @@ static void __io_req_complete_post(struct io_kiocb *req)
ctx->locked_free_nr++;
}
io_cq_unlock_post(ctx);
+
+ io_put_rsrc_node(rsrc_node);
}

void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
@@ -1117,7 +1120,7 @@ __cold void io_free_req(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;

- io_req_put_rsrc(req);
+ io_put_rsrc_node(req->rsrc_node);
io_dismantle_req(req);
io_put_task_remote(req->task, 1);

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 950535e2b9f4..8164777279ba 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -115,10 +115,10 @@ static inline void io_rsrc_put_node(struct io_rsrc_node *node, int nr)
io_rsrc_node_ref_zero(node);
}

-static inline void io_req_put_rsrc(struct io_kiocb *req)
+static inline void io_put_rsrc_node(struct io_rsrc_node *node)
{
- if (req->rsrc_node)
- io_rsrc_put_node(req->rsrc_node, 1);
+ if (node)
+ io_rsrc_put_node(node, 1);
}

static inline void io_req_put_rsrc_locked(struct io_kiocb *req,
--
2.39.1

2023-03-30 14:56:39

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 08/11] io_uring/rsrc: optimise io_rsrc_put allocation

Every io_rsrc_node keeps a list of items to put, and all entries are
kmalloc()'ed. However, it's quite often to queue up only one entry per
node, so let's add an inline entry there to avoid extra allocations.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/rsrc.c | 51 ++++++++++++++++++++++++++++++++-----------------
io_uring/rsrc.h | 2 ++
2 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 10006fb169d2..95e71300bb35 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -140,26 +140,34 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_mapped_ubuf **slo
*slot = NULL;
}

+static void io_rsrc_put_work_one(struct io_rsrc_data *rsrc_data,
+ struct io_rsrc_put *prsrc)
+{
+ struct io_ring_ctx *ctx = rsrc_data->ctx;
+
+ if (prsrc->tag) {
+ if (ctx->flags & IORING_SETUP_IOPOLL) {
+ mutex_lock(&ctx->uring_lock);
+ io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
+ mutex_unlock(&ctx->uring_lock);
+ } else {
+ io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
+ }
+ }
+ rsrc_data->do_put(ctx, prsrc);
+}
+
static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
{
struct io_rsrc_data *rsrc_data = ref_node->rsrc_data;
- struct io_ring_ctx *ctx = rsrc_data->ctx;
struct io_rsrc_put *prsrc, *tmp;

+ if (ref_node->inline_items)
+ io_rsrc_put_work_one(rsrc_data, &ref_node->item);
+
list_for_each_entry_safe(prsrc, tmp, &ref_node->item_list, list) {
list_del(&prsrc->list);
-
- if (prsrc->tag) {
- if (ctx->flags & IORING_SETUP_IOPOLL) {
- mutex_lock(&ctx->uring_lock);
- io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
- mutex_unlock(&ctx->uring_lock);
- } else {
- io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
- }
- }
-
- rsrc_data->do_put(ctx, prsrc);
+ io_rsrc_put_work_one(rsrc_data, prsrc);
kfree(prsrc);
}

@@ -251,6 +259,7 @@ static struct io_rsrc_node *io_rsrc_node_alloc(void)
INIT_LIST_HEAD(&ref_node->node);
INIT_LIST_HEAD(&ref_node->item_list);
ref_node->done = false;
+ ref_node->inline_items = 0;
return ref_node;
}

@@ -729,15 +738,23 @@ int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
{
u64 *tag_slot = io_get_tag_slot(data, idx);
struct io_rsrc_put *prsrc;
+ bool inline_item = true;

- prsrc = kzalloc(sizeof(*prsrc), GFP_KERNEL);
- if (!prsrc)
- return -ENOMEM;
+ if (!node->inline_items) {
+ prsrc = &node->item;
+ node->inline_items++;
+ } else {
+ prsrc = kzalloc(sizeof(*prsrc), GFP_KERNEL);
+ if (!prsrc)
+ return -ENOMEM;
+ inline_item = false;
+ }

prsrc->tag = *tag_slot;
*tag_slot = 0;
prsrc->rsrc = rsrc;
- list_add(&prsrc->list, &node->item_list);
+ if (!inline_item)
+ list_add(&prsrc->list, &node->item_list);
return 0;
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index c68846de031f..17293ab90f64 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -49,6 +49,8 @@ struct io_rsrc_node {
* came from the same table and so are of the same type.
*/
struct list_head item_list;
+ struct io_rsrc_put item;
+ int inline_items;
};

struct io_mapped_ubuf {
--
2.39.1

2023-03-30 14:56:48

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 09/11] io_uring/rsrc: don't offload node free

struct delayed_work rsrc_put_work was previously used to offload node
freeing because io_rsrc_node_ref_zero() was previously called by RCU in
the IRQ context. Now, as percpu refcounting is gone, we can do it
eagerly at the spot without pushing it to a worker.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 3 --
io_uring/io_uring.c | 6 ----
io_uring/rsrc.c | 59 +++-------------------------------
3 files changed, 4 insertions(+), 64 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9492889f00c0..47496059e13a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -330,9 +330,6 @@ struct io_ring_ctx {
struct io_rsrc_data *file_data;
struct io_rsrc_data *buf_data;

- struct delayed_work rsrc_put_work;
- struct callback_head rsrc_put_tw;
- struct llist_head rsrc_put_llist;
/* protected by ->uring_lock */
struct list_head rsrc_ref_list;

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e94780c0a024..8c3886a4ca96 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -326,9 +326,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_LIST_HEAD(&ctx->timeout_list);
INIT_LIST_HEAD(&ctx->ltimeout_list);
INIT_LIST_HEAD(&ctx->rsrc_ref_list);
- INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
- init_task_work(&ctx->rsrc_put_tw, io_rsrc_put_tw);
- init_llist_head(&ctx->rsrc_put_llist);
init_llist_head(&ctx->work_llist);
INIT_LIST_HEAD(&ctx->tctx_list);
ctx->submit_state.free_list.next = NULL;
@@ -2822,11 +2819,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
io_rsrc_node_destroy(ctx->rsrc_node);
if (ctx->rsrc_backup_node)
io_rsrc_node_destroy(ctx->rsrc_backup_node);
- flush_delayed_work(&ctx->rsrc_put_work);
- flush_delayed_work(&ctx->fallback_work);

WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list));
- WARN_ON_ONCE(!llist_empty(&ctx->rsrc_put_llist));

#if defined(CONFIG_UNIX)
if (ctx->ring_sock) {
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 95e71300bb35..0f4e245dee1b 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -145,15 +145,8 @@ static void io_rsrc_put_work_one(struct io_rsrc_data *rsrc_data,
{
struct io_ring_ctx *ctx = rsrc_data->ctx;

- if (prsrc->tag) {
- if (ctx->flags & IORING_SETUP_IOPOLL) {
- mutex_lock(&ctx->uring_lock);
- io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
- mutex_unlock(&ctx->uring_lock);
- } else {
- io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
- }
- }
+ if (prsrc->tag)
+ io_post_aux_cqe(ctx, prsrc->tag, 0, 0);
rsrc_data->do_put(ctx, prsrc);
}

@@ -176,32 +169,6 @@ static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
complete(&rsrc_data->done);
}

-void io_rsrc_put_work(struct work_struct *work)
-{
- struct io_ring_ctx *ctx;
- struct llist_node *node;
-
- ctx = container_of(work, struct io_ring_ctx, rsrc_put_work.work);
- node = llist_del_all(&ctx->rsrc_put_llist);
-
- while (node) {
- struct io_rsrc_node *ref_node;
- struct llist_node *next = node->next;
-
- ref_node = llist_entry(node, struct io_rsrc_node, llist);
- __io_rsrc_put_work(ref_node);
- node = next;
- }
-}
-
-void io_rsrc_put_tw(struct callback_head *cb)
-{
- struct io_ring_ctx *ctx = container_of(cb, struct io_ring_ctx,
- rsrc_put_tw);
-
- io_rsrc_put_work(&ctx->rsrc_put_work.work);
-}
-
void io_wait_rsrc_data(struct io_rsrc_data *data)
{
if (data && !atomic_dec_and_test(&data->refs))
@@ -217,34 +184,18 @@ void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
__must_hold(&node->rsrc_data->ctx->uring_lock)
{
struct io_ring_ctx *ctx = node->rsrc_data->ctx;
- bool first_add = false;
- unsigned long delay = HZ;

node->done = true;
-
- /* if we are mid-quiesce then do not delay */
- if (node->rsrc_data->quiesce)
- delay = 0;
-
while (!list_empty(&ctx->rsrc_ref_list)) {
node = list_first_entry(&ctx->rsrc_ref_list,
struct io_rsrc_node, node);
/* recycle ref nodes in order */
if (!node->done)
break;
- list_del(&node->node);
- first_add |= llist_add(&node->llist, &ctx->rsrc_put_llist);
- }

- if (!first_add)
- return;
-
- if (ctx->submitter_task) {
- if (!task_work_add(ctx->submitter_task, &ctx->rsrc_put_tw,
- ctx->notify_method))
- return;
+ list_del(&node->node);
+ __io_rsrc_put_work(node);
}
- mod_delayed_work(system_wq, &ctx->rsrc_put_work, delay);
}

static struct io_rsrc_node *io_rsrc_node_alloc(void)
@@ -320,13 +271,11 @@ __cold static int io_rsrc_ref_quiesce(struct io_rsrc_data *data,
if (ret < 0) {
atomic_inc(&data->refs);
/* wait for all works potentially completing data->done */
- flush_delayed_work(&ctx->rsrc_put_work);
reinit_completion(&data->done);
mutex_lock(&ctx->uring_lock);
break;
}

- flush_delayed_work(&ctx->rsrc_put_work);
ret = wait_for_completion_interruptible(&data->done);
if (!ret) {
mutex_lock(&ctx->uring_lock);
--
2.39.1

2023-03-30 14:57:10

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Add allocation cache for struct io_rsrc_node, it's always allocated and
put under ->uring_lock, so it doesn't need any extra synchronisation
around caches.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 1 +
io_uring/io_uring.c | 11 +++++++++--
io_uring/rsrc.c | 23 +++++++++++++++--------
io_uring/rsrc.h | 5 ++++-
4 files changed, 29 insertions(+), 11 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 47496059e13a..5d772e36e7fc 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -332,6 +332,7 @@ struct io_ring_ctx {

/* protected by ->uring_lock */
struct list_head rsrc_ref_list;
+ struct io_alloc_cache rsrc_node_cache;

struct list_head io_buffers_pages;

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 8c3886a4ca96..beedaf403284 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -310,6 +310,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_LIST_HEAD(&ctx->sqd_list);
INIT_LIST_HEAD(&ctx->cq_overflow_list);
INIT_LIST_HEAD(&ctx->io_buffers_cache);
+ io_alloc_cache_init(&ctx->rsrc_node_cache, sizeof(struct io_rsrc_node));
io_alloc_cache_init(&ctx->apoll_cache, sizeof(struct async_poll));
io_alloc_cache_init(&ctx->netmsg_cache, sizeof(struct io_async_msghdr));
init_completion(&ctx->ref_comp);
@@ -2791,6 +2792,11 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
mutex_unlock(&ctx->uring_lock);
}

+void io_rsrc_node_cache_free(struct io_cache_entry *entry)
+{
+ kfree(container_of(entry, struct io_rsrc_node, cache));
+}
+
static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
{
io_sq_thread_finish(ctx);
@@ -2816,9 +2822,9 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)

/* there are no registered resources left, nobody uses it */
if (ctx->rsrc_node)
- io_rsrc_node_destroy(ctx->rsrc_node);
+ io_rsrc_node_destroy(ctx, ctx->rsrc_node);
if (ctx->rsrc_backup_node)
- io_rsrc_node_destroy(ctx->rsrc_backup_node);
+ io_rsrc_node_destroy(ctx, ctx->rsrc_backup_node);

WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list));

@@ -2830,6 +2836,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
#endif
WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));

+ io_alloc_cache_free(&ctx->rsrc_node_cache, io_rsrc_node_cache_free);
if (ctx->mm_account) {
mmdrop(ctx->mm_account);
ctx->mm_account = NULL;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 0f4e245dee1b..345631091d80 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -164,7 +164,7 @@ static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
kfree(prsrc);
}

- io_rsrc_node_destroy(ref_node);
+ io_rsrc_node_destroy(rsrc_data->ctx, ref_node);
if (atomic_dec_and_test(&rsrc_data->refs))
complete(&rsrc_data->done);
}
@@ -175,9 +175,10 @@ void io_wait_rsrc_data(struct io_rsrc_data *data)
wait_for_completion(&data->done);
}

-void io_rsrc_node_destroy(struct io_rsrc_node *ref_node)
+void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
{
- kfree(ref_node);
+ if (!io_alloc_cache_put(&ctx->rsrc_node_cache, &node->cache))
+ kfree(node);
}

void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
@@ -198,13 +199,19 @@ void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
}
}

-static struct io_rsrc_node *io_rsrc_node_alloc(void)
+static struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx)
{
struct io_rsrc_node *ref_node;
+ struct io_cache_entry *entry;

- ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
- if (!ref_node)
- return NULL;
+ entry = io_alloc_cache_get(&ctx->rsrc_node_cache);
+ if (entry) {
+ ref_node = container_of(entry, struct io_rsrc_node, cache);
+ } else {
+ ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
+ if (!ref_node)
+ return NULL;
+ }

ref_node->refs = 1;
INIT_LIST_HEAD(&ref_node->node);
@@ -243,7 +250,7 @@ int io_rsrc_node_switch_start(struct io_ring_ctx *ctx)
{
if (ctx->rsrc_backup_node)
return 0;
- ctx->rsrc_backup_node = io_rsrc_node_alloc();
+ ctx->rsrc_backup_node = io_rsrc_node_alloc(ctx);
return ctx->rsrc_backup_node ? 0 : -ENOMEM;
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 17293ab90f64..d1555eaae81a 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -4,6 +4,8 @@

#include <net/af_unix.h>

+#include "alloc_cache.h"
+
#define IO_RSRC_TAG_TABLE_SHIFT (PAGE_SHIFT - 3)
#define IO_RSRC_TAG_TABLE_MAX (1U << IO_RSRC_TAG_TABLE_SHIFT)
#define IO_RSRC_TAG_TABLE_MASK (IO_RSRC_TAG_TABLE_MAX - 1)
@@ -37,6 +39,7 @@ struct io_rsrc_data {
};

struct io_rsrc_node {
+ struct io_cache_entry cache;
int refs;
struct list_head node;
struct io_rsrc_data *rsrc_data;
@@ -65,7 +68,7 @@ void io_rsrc_put_tw(struct callback_head *cb);
void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
void io_rsrc_put_work(struct work_struct *work);
void io_wait_rsrc_data(struct io_rsrc_data *data);
-void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
+void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *ref_node);
int io_rsrc_node_switch_start(struct io_ring_ctx *ctx);
int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
struct io_rsrc_node *node, void *rsrc);
--
2.39.1

2023-03-30 14:57:32

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 11/11] io_uring/rsrc: add lockdep sanity checks

We should hold ->uring_lock while putting nodes with io_put_rsrc_node(),
add a lockdep check for that.

Signed-off-by: Pavel Begunkov <[email protected]>
---
io_uring/io_uring.c | 4 ++--
io_uring/rsrc.c | 2 +-
io_uring/rsrc.h | 6 ++++--
3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index beedaf403284..a781b7243b97 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -1002,7 +1002,7 @@ static void __io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)

if (rsrc_node) {
io_ring_submit_lock(ctx, issue_flags);
- io_put_rsrc_node(rsrc_node);
+ io_put_rsrc_node(ctx, rsrc_node);
io_ring_submit_unlock(ctx, issue_flags);
}
}
@@ -1123,7 +1123,7 @@ __cold void io_free_req_tw(struct io_kiocb *req, struct io_tw_state *ts)

if (req->rsrc_node) {
io_tw_lock(ctx, ts);
- io_put_rsrc_node(req->rsrc_node);
+ io_put_rsrc_node(ctx, req->rsrc_node);
}
io_dismantle_req(req);
io_put_task_remote(req->task, 1);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 345631091d80..d4bca5e18434 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -236,7 +236,7 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,

atomic_inc(&data_to_kill->refs);
/* put master ref */
- io_put_rsrc_node(rsrc_node);
+ io_put_rsrc_node(ctx, rsrc_node);
ctx->rsrc_node = NULL;
}

diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index d1555eaae81a..99f2df4eafa1 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -117,8 +117,10 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg,
int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg,
unsigned int size, unsigned int type);

-static inline void io_put_rsrc_node(struct io_rsrc_node *node)
+static inline void io_put_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
{
+ lockdep_assert_held(&ctx->uring_lock);
+
if (node && !--node->refs)
io_rsrc_node_ref_zero(node);
}
@@ -126,7 +128,7 @@ static inline void io_put_rsrc_node(struct io_rsrc_node *node)
static inline void io_req_put_rsrc_locked(struct io_kiocb *req,
struct io_ring_ctx *ctx)
{
- io_put_rsrc_node(req->rsrc_node);
+ io_put_rsrc_node(ctx, req->rsrc_node);
}

static inline void io_charge_rsrc_node(struct io_ring_ctx *ctx,
--
2.39.1

2023-03-30 14:59:00

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 06/11] io_uring/rsrc: kill rsrc_ref_lock

We use ->rsrc_ref_lock spinlock to protect ->rsrc_ref_list in
io_rsrc_node_ref_zero(). Now we removed pcpu refcounting, which means
io_rsrc_node_ref_zero() is not executed from the irq context as an RCU
callback anymore, and we also put it under ->uring_lock.
io_rsrc_node_switch(), which queues up nodes into the list, is also
protected by ->uring_lock, so we can safely get rid of ->rsrc_ref_lock.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/io_uring_types.h | 2 +-
io_uring/io_uring.c | 1 -
io_uring/rsrc.c | 5 -----
3 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index a0a5b5964d3a..9492889f00c0 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -333,8 +333,8 @@ struct io_ring_ctx {
struct delayed_work rsrc_put_work;
struct callback_head rsrc_put_tw;
struct llist_head rsrc_put_llist;
+ /* protected by ->uring_lock */
struct list_head rsrc_ref_list;
- spinlock_t rsrc_ref_lock;

struct list_head io_buffers_pages;

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e55c48d91209..e94780c0a024 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -325,7 +325,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
INIT_LIST_HEAD(&ctx->defer_list);
INIT_LIST_HEAD(&ctx->timeout_list);
INIT_LIST_HEAD(&ctx->ltimeout_list);
- spin_lock_init(&ctx->rsrc_ref_lock);
INIT_LIST_HEAD(&ctx->rsrc_ref_list);
INIT_DELAYED_WORK(&ctx->rsrc_put_work, io_rsrc_put_work);
init_task_work(&ctx->rsrc_put_tw, io_rsrc_put_tw);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 1237fc77c250..e122b6e5f9c5 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -209,11 +209,9 @@ void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
__must_hold(&node->rsrc_data->ctx->uring_lock)
{
struct io_ring_ctx *ctx = node->rsrc_data->ctx;
- unsigned long flags;
bool first_add = false;
unsigned long delay = HZ;

- spin_lock_irqsave(&ctx->rsrc_ref_lock, flags);
node->done = true;

/* if we are mid-quiesce then do not delay */
@@ -229,7 +227,6 @@ void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
list_del(&node->node);
first_add |= llist_add(&node->llist, &ctx->rsrc_put_llist);
}
- spin_unlock_irqrestore(&ctx->rsrc_ref_lock, flags);

if (!first_add)
return;
@@ -268,9 +265,7 @@ void io_rsrc_node_switch(struct io_ring_ctx *ctx,
struct io_rsrc_node *rsrc_node = ctx->rsrc_node;

rsrc_node->rsrc_data = data_to_kill;
- spin_lock_irq(&ctx->rsrc_ref_lock);
list_add_tail(&rsrc_node->node, &ctx->rsrc_ref_list);
- spin_unlock_irq(&ctx->rsrc_ref_lock);

atomic_inc(&data_to_kill->refs);
/* put master ref */
--
2.39.1

2023-03-31 13:40:15

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [RFC 00/11] optimise registered buffer/file updates


Pavel,

Pavel Begunkov <[email protected]> writes:
> Updating registered files and buffers is a very slow operation, which
> makes it not feasible for workloads with medium update frequencies.
> Rework the underlying rsrc infra for greater performance and lesser
> memory footprint.
>
> The improvement is ~11x for a benchmark updating files in a loop
> (1040K -> 11468K updates / sec).

Nice. That's a really impressive improvement.

I've been adding io_uring test cases for automated performance
regression testing with mmtests (open source). I'd love to take a look
at this test case and adapt it to mmtests, so we can pick it up and run
it frequently.

is it something you can share?

--
Gabriel Krisman Bertazi

2023-03-31 14:13:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Pavel Begunkov <[email protected]> writes:

> Add allocation cache for struct io_rsrc_node, it's always allocated and
> put under ->uring_lock, so it doesn't need any extra synchronisation
> around caches.

Hi Pavel,

I'm curious if you considered using kmem_cache instead of the custom
cache for this case? I'm wondering if this provokes visible difference in
performance in your benchmark.

> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> include/linux/io_uring_types.h | 1 +
> io_uring/io_uring.c | 11 +++++++++--
> io_uring/rsrc.c | 23 +++++++++++++++--------
> io_uring/rsrc.h | 5 ++++-
> 4 files changed, 29 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 47496059e13a..5d772e36e7fc 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -332,6 +332,7 @@ struct io_ring_ctx {
>
> /* protected by ->uring_lock */
> struct list_head rsrc_ref_list;
> + struct io_alloc_cache rsrc_node_cache;
>
> struct list_head io_buffers_pages;
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 8c3886a4ca96..beedaf403284 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -310,6 +310,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
> INIT_LIST_HEAD(&ctx->sqd_list);
> INIT_LIST_HEAD(&ctx->cq_overflow_list);
> INIT_LIST_HEAD(&ctx->io_buffers_cache);
> + io_alloc_cache_init(&ctx->rsrc_node_cache, sizeof(struct io_rsrc_node));
> io_alloc_cache_init(&ctx->apoll_cache, sizeof(struct async_poll));
> io_alloc_cache_init(&ctx->netmsg_cache, sizeof(struct io_async_msghdr));
> init_completion(&ctx->ref_comp);
> @@ -2791,6 +2792,11 @@ static void io_req_caches_free(struct io_ring_ctx *ctx)
> mutex_unlock(&ctx->uring_lock);
> }
>
> +void io_rsrc_node_cache_free(struct io_cache_entry *entry)
> +{
> + kfree(container_of(entry, struct io_rsrc_node, cache));
> +}
> +
> static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
> {
> io_sq_thread_finish(ctx);
> @@ -2816,9 +2822,9 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
>
> /* there are no registered resources left, nobody uses it */
> if (ctx->rsrc_node)
> - io_rsrc_node_destroy(ctx->rsrc_node);
> + io_rsrc_node_destroy(ctx, ctx->rsrc_node);
> if (ctx->rsrc_backup_node)
> - io_rsrc_node_destroy(ctx->rsrc_backup_node);
> + io_rsrc_node_destroy(ctx, ctx->rsrc_backup_node);
>
> WARN_ON_ONCE(!list_empty(&ctx->rsrc_ref_list));
>
> @@ -2830,6 +2836,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx)
> #endif
> WARN_ON_ONCE(!list_empty(&ctx->ltimeout_list));
>
> + io_alloc_cache_free(&ctx->rsrc_node_cache, io_rsrc_node_cache_free);
> if (ctx->mm_account) {
> mmdrop(ctx->mm_account);
> ctx->mm_account = NULL;
> diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
> index 0f4e245dee1b..345631091d80 100644
> --- a/io_uring/rsrc.c
> +++ b/io_uring/rsrc.c
> @@ -164,7 +164,7 @@ static void __io_rsrc_put_work(struct io_rsrc_node *ref_node)
> kfree(prsrc);
> }
>
> - io_rsrc_node_destroy(ref_node);
> + io_rsrc_node_destroy(rsrc_data->ctx, ref_node);
> if (atomic_dec_and_test(&rsrc_data->refs))
> complete(&rsrc_data->done);
> }
> @@ -175,9 +175,10 @@ void io_wait_rsrc_data(struct io_rsrc_data *data)
> wait_for_completion(&data->done);
> }
>
> -void io_rsrc_node_destroy(struct io_rsrc_node *ref_node)
> +void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
> {
> - kfree(ref_node);
> + if (!io_alloc_cache_put(&ctx->rsrc_node_cache, &node->cache))
> + kfree(node);
> }
>
> void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
> @@ -198,13 +199,19 @@ void io_rsrc_node_ref_zero(struct io_rsrc_node *node)
> }
> }
>
> -static struct io_rsrc_node *io_rsrc_node_alloc(void)
> +static struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx)
> {
> struct io_rsrc_node *ref_node;
> + struct io_cache_entry *entry;
>
> - ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
> - if (!ref_node)
> - return NULL;
> + entry = io_alloc_cache_get(&ctx->rsrc_node_cache);
> + if (entry) {
> + ref_node = container_of(entry, struct io_rsrc_node, cache);
> + } else {
> + ref_node = kzalloc(sizeof(*ref_node), GFP_KERNEL);
> + if (!ref_node)
> + return NULL;
> + }
>
> ref_node->refs = 1;
> INIT_LIST_HEAD(&ref_node->node);
> @@ -243,7 +250,7 @@ int io_rsrc_node_switch_start(struct io_ring_ctx *ctx)
> {
> if (ctx->rsrc_backup_node)
> return 0;
> - ctx->rsrc_backup_node = io_rsrc_node_alloc();
> + ctx->rsrc_backup_node = io_rsrc_node_alloc(ctx);
> return ctx->rsrc_backup_node ? 0 : -ENOMEM;
> }
>
> diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
> index 17293ab90f64..d1555eaae81a 100644
> --- a/io_uring/rsrc.h
> +++ b/io_uring/rsrc.h
> @@ -4,6 +4,8 @@
>
> #include <net/af_unix.h>
>
> +#include "alloc_cache.h"
> +
> #define IO_RSRC_TAG_TABLE_SHIFT (PAGE_SHIFT - 3)
> #define IO_RSRC_TAG_TABLE_MAX (1U << IO_RSRC_TAG_TABLE_SHIFT)
> #define IO_RSRC_TAG_TABLE_MASK (IO_RSRC_TAG_TABLE_MAX - 1)
> @@ -37,6 +39,7 @@ struct io_rsrc_data {
> };
>
> struct io_rsrc_node {
> + struct io_cache_entry cache;
> int refs;
> struct list_head node;
> struct io_rsrc_data *rsrc_data;
> @@ -65,7 +68,7 @@ void io_rsrc_put_tw(struct callback_head *cb);
> void io_rsrc_node_ref_zero(struct io_rsrc_node *node);
> void io_rsrc_put_work(struct work_struct *work);
> void io_wait_rsrc_data(struct io_rsrc_data *data);
> -void io_rsrc_node_destroy(struct io_rsrc_node *ref_node);
> +void io_rsrc_node_destroy(struct io_ring_ctx *ctx, struct io_rsrc_node *ref_node);
> int io_rsrc_node_switch_start(struct io_ring_ctx *ctx);
> int io_queue_rsrc_removal(struct io_rsrc_data *data, unsigned idx,
> struct io_rsrc_node *node, void *rsrc);

--
Gabriel Krisman Bertazi

2023-03-31 15:31:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC 00/11] optimise registered buffer/file updates

On 3/30/23 8:53 AM, Pavel Begunkov wrote:
> Updating registered files and buffers is a very slow operation, which
> makes it not feasible for workloads with medium update frequencies.
> Rework the underlying rsrc infra for greater performance and lesser
> memory footprint.
>
> The improvement is ~11x for a benchmark updating files in a loop
> (1040K -> 11468K updates / sec).
>
> The set requires a couple of patches from the 6.3 branch, for that
> reason it's an RFC and will be resent after merge.

Looks pretty sane to me, didn't find anything immediately wrong. I
do wonder if we should have a conditional uring_lock helper, we do
have a few of those. But not really related to this series, as it
just moves one around.

--
Jens Axboe


2023-03-31 16:34:19

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [RFC 00/11] optimise registered buffer/file updates

On 3/31/23 14:35, Gabriel Krisman Bertazi wrote:
> Pavel,
>
> Pavel Begunkov <[email protected]> writes:
>> Updating registered files and buffers is a very slow operation, which
>> makes it not feasible for workloads with medium update frequencies.
>> Rework the underlying rsrc infra for greater performance and lesser
>> memory footprint.
>>
>> The improvement is ~11x for a benchmark updating files in a loop
>> (1040K -> 11468K updates / sec).
>
> Nice. That's a really impressive improvement.
>
> I've been adding io_uring test cases for automated performance
> regression testing with mmtests (open source). I'd love to take a look
> at this test case and adapt it to mmtests, so we can pick it up and run
> it frequently.
>
> is it something you can share?

I'll post it later.

The test is quite stupid and with the patches less than 10% of CPU
cycles go to the update machinery (against 90+ w/o), the rest is spend
for syscalling, submitting update requests, etc., so it almost hits the
limit.

Another test we can do is to measure latency b/w the point we asked a
rsrc to be removed and when it actually got destroyed/freed, e.g. tags
will help with that. It should've been improved nicely as well as it
removes the RCU grace period and other bouncing.

--
Pavel Begunkov

2023-03-31 16:42:16

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

On 3/31/23 15:09, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <[email protected]> writes:
>
>> Add allocation cache for struct io_rsrc_node, it's always allocated and
>> put under ->uring_lock, so it doesn't need any extra synchronisation
>> around caches.
>
> Hi Pavel,
>
> I'm curious if you considered using kmem_cache instead of the custom
> cache for this case? I'm wondering if this provokes visible difference in
> performance in your benchmark.

I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
much, definitely doesn't spare from locking, and the overhead
definitely wasn't satisfactory for requests before.

A quick note that I want to further limit the cache size,
IO_ALLOC_CACHE_MAX=512 for nodes doesn't sound great.

--
Pavel Begunkov

2023-04-01 00:21:34

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Pavel Begunkov <[email protected]> writes:

> On 3/31/23 15:09, Gabriel Krisman Bertazi wrote:
>> Pavel Begunkov <[email protected]> writes:
>>
>>> Add allocation cache for struct io_rsrc_node, it's always allocated and
>>> put under ->uring_lock, so it doesn't need any extra synchronisation
>>> around caches.
>> Hi Pavel,
>> I'm curious if you considered using kmem_cache instead of the custom
>> cache for this case? I'm wondering if this provokes visible difference in
>> performance in your benchmark.
>
> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
> much, definitely doesn't spare from locking, and the overhead
> definitely wasn't satisfactory for requests before.

There is no locks in the fast path of slub, as far as I know. it has a
per-cpu cache that is refilled once empty, quite similar to the fastpath
of this cache. I imagine the performance hit in slub comes from the
barrier and atomic operations?

kmem_cache works fine for most hot paths of the kernel. I think this
custom cache makes sense for the request cache, where objects are
allocated at an incredibly high rate. but is this level of update
frequency a valid use case here?

If it is indeed a significant performance improvement, I guess it is
fine to have another user of the cache. But I'd be curious to know how
much of the performance improvement you mentioned in the cover letter is
due to this patch!

--
Gabriel Krisman Bertazi

2023-04-04 13:33:25

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <[email protected]> writes:
>
>> On 3/31/23 15:09, Gabriel Krisman Bertazi wrote:
>>> Pavel Begunkov <[email protected]> writes:
>>>
>>>> Add allocation cache for struct io_rsrc_node, it's always allocated and
>>>> put under ->uring_lock, so it doesn't need any extra synchronisation
>>>> around caches.
>>> Hi Pavel,
>>> I'm curious if you considered using kmem_cache instead of the custom
>>> cache for this case? I'm wondering if this provokes visible difference in
>>> performance in your benchmark.
>>
>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>> much, definitely doesn't spare from locking, and the overhead
>> definitely wasn't satisfactory for requests before.
>
> There is no locks in the fast path of slub, as far as I know. it has a
> per-cpu cache that is refilled once empty, quite similar to the fastpath
> of this cache. I imagine the performance hit in slub comes from the
> barrier and atomic operations?

Yeah, I mean all kinds of synchronisation. And I don't think
that's the main offender here, the test is single threaded without
contention and the system was mostly idle.

> kmem_cache works fine for most hot paths of the kernel. I think this

It doesn't for io_uring. There are caches for the net side and now
in the block layer as well. I wouldn't say it necessarily halves
performance but definitely takes a share of CPU.

> custom cache makes sense for the request cache, where objects are
> allocated at an incredibly high rate. but is this level of update
> frequency a valid use case here?

I can think of some. For example it was of interest before to
install a file for just 2-3 IO operations and also fully bypassing
the normal file table. I rather don't see why we wouldn't use it.

> If it is indeed a significant performance improvement, I guess it is
> fine to have another user of the cache. But I'd be curious to know how
> much of the performance improvement you mentioned in the cover letter is
> due to this patch!

It was definitely sticking out in profiles, 5-10% of cycles, maybe more

--
Pavel Begunkov

2023-04-04 16:00:28

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Pavel Begunkov <[email protected]> writes:

> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
>> Pavel Begunkov <[email protected]> writes:

>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>>> much, definitely doesn't spare from locking, and the overhead
>>> definitely wasn't satisfactory for requests before.
>> There is no locks in the fast path of slub, as far as I know. it has
>> a
>> per-cpu cache that is refilled once empty, quite similar to the fastpath
>> of this cache. I imagine the performance hit in slub comes from the
>> barrier and atomic operations?
>
> Yeah, I mean all kinds of synchronisation. And I don't think
> that's the main offender here, the test is single threaded without
> contention and the system was mostly idle.
>
>> kmem_cache works fine for most hot paths of the kernel. I think this
>
> It doesn't for io_uring. There are caches for the net side and now
> in the block layer as well. I wouldn't say it necessarily halves
> performance but definitely takes a share of CPU.

Right. My point is that all these caches (block, io_uring) duplicate
what the slab cache is meant to do. Since slab became a bottleneck, I'm
looking at how to improve the situation on their side, to see if we can
drop the caching here and in block/.

>> If it is indeed a significant performance improvement, I guess it is
>> fine to have another user of the cache. But I'd be curious to know how
>> much of the performance improvement you mentioned in the cover letter is
>> due to this patch!
>
> It was definitely sticking out in profiles, 5-10% of cycles, maybe
> more

That's surprisingly high. Hopefully we will can avoid this caching
soon. For now, feel free to add to this patch:

Reviewed-by: Gabriel Krisman Bertazi <[email protected]>

--
Gabriel Krisman Bertazi

2023-04-04 16:01:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <[email protected]> writes:
>
>> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
>>> Pavel Begunkov <[email protected]> writes:
>
>>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>>>> much, definitely doesn't spare from locking, and the overhead
>>>> definitely wasn't satisfactory for requests before.
>>> There is no locks in the fast path of slub, as far as I know. it has
>>> a
>>> per-cpu cache that is refilled once empty, quite similar to the fastpath
>>> of this cache. I imagine the performance hit in slub comes from the
>>> barrier and atomic operations?
>>
>> Yeah, I mean all kinds of synchronisation. And I don't think
>> that's the main offender here, the test is single threaded without
>> contention and the system was mostly idle.
>>
>>> kmem_cache works fine for most hot paths of the kernel. I think this
>>
>> It doesn't for io_uring. There are caches for the net side and now
>> in the block layer as well. I wouldn't say it necessarily halves
>> performance but definitely takes a share of CPU.
>
> Right. My point is that all these caches (block, io_uring) duplicate
> what the slab cache is meant to do. Since slab became a bottleneck, I'm
> looking at how to improve the situation on their side, to see if we can
> drop the caching here and in block/.

That would certainly be a worthy goal, and I do agree that these caches
are (largely) working around deficiencies. One important point that you
may miss is that most of this caching gets its performance from both
avoiding atomics in slub, but also because we can guarantee that both
alloc and free happen from process context. The block IRQ bits are a bit
different, but apart from that, it's true elsewhere. Caching that needs
to even disable IRQs locally generally doesn't beat out slub by much,
the big wins are the cases where we know free+alloc is done in process
context.

--
Jens Axboe

2023-04-04 16:54:53

by Gabriel Krisman Bertazi

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

Jens Axboe <[email protected]> writes:

> On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote:
>> Pavel Begunkov <[email protected]> writes:
>>
>>> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
>>>> Pavel Begunkov <[email protected]> writes:
>>
>>>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>>>>> much, definitely doesn't spare from locking, and the overhead
>>>>> definitely wasn't satisfactory for requests before.
>>>> There is no locks in the fast path of slub, as far as I know. it has
>>>> a
>>>> per-cpu cache that is refilled once empty, quite similar to the fastpath
>>>> of this cache. I imagine the performance hit in slub comes from the
>>>> barrier and atomic operations?
>>>
>>> Yeah, I mean all kinds of synchronisation. And I don't think
>>> that's the main offender here, the test is single threaded without
>>> contention and the system was mostly idle.
>>>
>>>> kmem_cache works fine for most hot paths of the kernel. I think this
>>>
>>> It doesn't for io_uring. There are caches for the net side and now
>>> in the block layer as well. I wouldn't say it necessarily halves
>>> performance but definitely takes a share of CPU.
>>
>> Right. My point is that all these caches (block, io_uring) duplicate
>> what the slab cache is meant to do. Since slab became a bottleneck, I'm
>> looking at how to improve the situation on their side, to see if we can
>> drop the caching here and in block/.
>
> That would certainly be a worthy goal, and I do agree that these caches
> are (largely) working around deficiencies. One important point that you
> may miss is that most of this caching gets its performance from both
> avoiding atomics in slub, but also because we can guarantee that both
> alloc and free happen from process context. The block IRQ bits are a bit
> different, but apart from that, it's true elsewhere. Caching that needs
> to even disable IRQs locally generally doesn't beat out slub by much,
> the big wins are the cases where we know free+alloc is done in process
> context.

Yes, I noticed that. I was thinking of exposing a flag at kmem_cache
creation-time to tell slab the user promises not to use it in IRQ
context, so it doesn't need to worry about nested invocation in the
allocation/free path. Then, for those caches, have a
kmem_cache_alloc_locked variant, where the synchronization is maintained
by the caller (i.e. by ->uring_lock here), so it can manipulate the
cache without atomics.

I was looking at your implementation of the block cache for inspiration
and saw how you kept a second list for IRQ. I'm thinking how to fit a
similar change inside slub. But for now, I want to get the simpler
case, which is all io_uring need.

I'll try to get a prototype before lsfmm and see if I get the MM folks
input there.

--
Gabriel Krisman Bertazi

2023-04-04 18:30:55

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 10/11] io_uring/rsrc: cache struct io_rsrc_node

On 4/4/23 17:53, Gabriel Krisman Bertazi wrote:
> Jens Axboe <[email protected]> writes:
>
>> On 4/4/23 9:48?AM, Gabriel Krisman Bertazi wrote:
>>> Pavel Begunkov <[email protected]> writes:
>>>
>>>> On 4/1/23 01:04, Gabriel Krisman Bertazi wrote:
>>>>> Pavel Begunkov <[email protected]> writes:
>>>
>>>>>> I didn't try it, but kmem_cache vs kmalloc, IIRC, doesn't bring us
>>>>>> much, definitely doesn't spare from locking, and the overhead
>>>>>> definitely wasn't satisfactory for requests before.
>>>>> There is no locks in the fast path of slub, as far as I know. it has
>>>>> a
>>>>> per-cpu cache that is refilled once empty, quite similar to the fastpath
>>>>> of this cache. I imagine the performance hit in slub comes from the
>>>>> barrier and atomic operations?
>>>>
>>>> Yeah, I mean all kinds of synchronisation. And I don't think
>>>> that's the main offender here, the test is single threaded without
>>>> contention and the system was mostly idle.
>>>>
>>>>> kmem_cache works fine for most hot paths of the kernel. I think this
>>>>
>>>> It doesn't for io_uring. There are caches for the net side and now
>>>> in the block layer as well. I wouldn't say it necessarily halves
>>>> performance but definitely takes a share of CPU.
>>>
>>> Right. My point is that all these caches (block, io_uring) duplicate
>>> what the slab cache is meant to do. Since slab became a bottleneck, I'm
>>> looking at how to improve the situation on their side, to see if we can
>>> drop the caching here and in block/.
>>
>> That would certainly be a worthy goal, and I do agree that these caches
>> are (largely) working around deficiencies. One important point that you
>> may miss is that most of this caching gets its performance from both
>> avoiding atomics in slub, but also because we can guarantee that both
>> alloc and free happen from process context. The block IRQ bits are a bit
>> different, but apart from that, it's true elsewhere. Caching that needs
>> to even disable IRQs locally generally doesn't beat out slub by much,
>> the big wins are the cases where we know free+alloc is done in process
>> context.
>
> Yes, I noticed that. I was thinking of exposing a flag at kmem_cache
> creation-time to tell slab the user promises not to use it in IRQ
> context, so it doesn't need to worry about nested invocation in the
> allocation/free path. Then, for those caches, have a
> kmem_cache_alloc_locked variant, where the synchronization is maintained
> by the caller (i.e. by ->uring_lock here), so it can manipulate the
> cache without atomics.
>
> I was looking at your implementation of the block cache for inspiration
> and saw how you kept a second list for IRQ. I'm thinking how to fit a

It works fine, but one problem I had while implementing it is
local_irq_save() in bio_put_percpu_cache() not being so cheap even when
interrupts are already disabled. Apparently, raw_local_save_flags() is
not as free as I wished it to be, and we can't easily pass the current
context. Would be nice to do something about it at some point.


> similar change inside slub. But for now, I want to get the simpler
> case, which is all io_uring need.
>
> I'll try to get a prototype before lsfmm and see if I get the MM folks
> input there.
>

--
Pavel Begunkov