2019-12-21 16:17:33

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 0/3] optimise ctx's refs grabbing in io_uring

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
better comments for percpu_ref_tryget_many (Dennis Zhou)
amortise across io_uring_enter() boundary

Pavel Begunkov (3):
pcpu_ref: add percpu_ref_tryget_many()
io_uring: batch getting pcpu references
io_uring: batch get(ctx->ref) across submits

fs/io_uring.c | 29 ++++++++++++++++++++++++++---
include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
2 files changed, 47 insertions(+), 8 deletions(-)

--
2.24.0


2019-12-21 16:39:39

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 1/3] pcpu_ref: add percpu_ref_tryget_many()

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..22d9d183950d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
}

/**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
* @ref: percpu_ref to try-get
+ * @nr: number of references to get
*
- * Increment a percpu refcount unless its count already reached zero.
+ * Increment a percpu refcount by @nr unless its count already reached zero.
* Returns %true on success; %false on failure.
*
* This function is safe to call as long as @ref is between init and exit.
*/
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+ unsigned long nr)
{
unsigned long __percpu *percpu_count;
bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
rcu_read_lock();

if (__ref_is_percpu(ref, &percpu_count)) {
- this_cpu_inc(*percpu_count);
+ this_cpu_add(*percpu_count, nr);
ret = true;
} else {
- ret = atomic_long_inc_not_zero(&ref->count);
+ ret = atomic_long_add_unless(&ref->count, nr, 0);
}

rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
return ret;
}

+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+ return percpu_ref_tryget_many(ref, 1);
+}
+
/**
* percpu_ref_tryget_live - try to increment a live percpu refcount
* @ref: percpu_ref to try-get
--
2.24.0

2019-12-21 17:01:07

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits

Double account ctx->refs keeping number of taken refs in ctx. As
io_uring gets per-request ctx->refs during submission, while holding
ctx->uring_lock, this allows in most of the time to bypass
percpu_ref_get*() and its overhead.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5392134f042f..eef09de94609 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -84,6 +84,9 @@
#define IORING_MAX_ENTRIES 32768
#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)

+/* Not less than IORING_MAX_ENTRIES, so can grab once per submission loop */
+#define IORING_REFS_THRESHOLD IORING_MAX_ENTRIES
+
/*
* Shift of 9 is 512 entries, or exactly one page on 64-bit archs
*/
@@ -197,6 +200,7 @@ struct fixed_file_data {
struct io_ring_ctx {
struct {
struct percpu_ref refs;
+ unsigned long taken_refs; /* used under @uring_lock */
} ____cacheline_aligned_in_smp;

struct {
@@ -690,6 +694,13 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
complete(&ctx->completions[0]);
}

+static void io_free_taken_refs(struct io_ring_ctx *ctx)
+{
+ if (ctx->taken_refs)
+ percpu_ref_put_many(&ctx->refs, ctx->taken_refs);
+ ctx->taken_refs = 0;
+}
+
static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
{
struct io_ring_ctx *ctx;
@@ -4388,7 +4399,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
struct io_submit_state state, *statep = NULL;
struct io_kiocb *link = NULL;
int i, submitted = 0;
- unsigned int extra_refs;
bool mm_fault = false;

/* if we have a backlog and couldn't flush it all, return BUSY */
@@ -4398,9 +4408,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
return -EBUSY;
}

- if (!percpu_ref_tryget_many(&ctx->refs, nr))
- return -EAGAIN;
- extra_refs = nr;
+ if (ctx->taken_refs < IORING_REFS_THRESHOLD) {
+ if (unlikely(percpu_ref_is_dying(&ctx->refs))) {
+ io_free_taken_refs(ctx);
+ return -ENXIO;
+ }
+ if (!percpu_ref_tryget_many(&ctx->refs, IORING_REFS_THRESHOLD))
+ return -EAGAIN;
+ ctx->taken_refs += IORING_REFS_THRESHOLD;
+ }

if (nr > IO_PLUG_THRESHOLD) {
io_submit_state_start(&state, nr);
@@ -4417,8 +4433,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
submitted = -EAGAIN;
break;
}
- --extra_refs;
if (!io_get_sqring(ctx, req, &sqe)) {
+ /* not submitted, but a ref is freed */
+ ctx->taken_refs--;
__io_free_req(req);
break;
}
@@ -4454,8 +4471,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
io_queue_link_head(link);
if (statep)
io_submit_state_end(&state);
- if (extra_refs)
- percpu_ref_put_many(&ctx->refs, extra_refs);
+ ctx->taken_refs -= submitted;

/* Commit SQ ring head once we've consumed and submitted all SQEs */
io_commit_sqring(ctx);
@@ -5731,6 +5747,7 @@ static int io_uring_fasync(int fd, struct file *file, int on)
static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
{
mutex_lock(&ctx->uring_lock);
+ io_free_taken_refs(ctx);
percpu_ref_kill(&ctx->refs);
mutex_unlock(&ctx->uring_lock);

@@ -6196,6 +6213,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,

if (opcode != IORING_UNREGISTER_FILES &&
opcode != IORING_REGISTER_FILES_UPDATE) {
+ io_free_taken_refs(ctx);
percpu_ref_kill(&ctx->refs);

/*
--
2.24.0

2019-12-21 17:01:39

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits

On 21/12/2019 19:15, Pavel Begunkov wrote:
> Double account ctx->refs keeping number of taken refs in ctx. As
> io_uring gets per-request ctx->refs during submission, while holding
> ctx->uring_lock, this allows in most of the time to bypass
> percpu_ref_get*() and its overhead.

Jens, could you please benchmark with this one? Especially for offloaded QD1
case. I haven't got any difference for nops test and don't have a decent SSD
at hands to test it myself. We could drop it, if there is no benefit.

This rewrites that @extra_refs from the second one, so I left it for now.


>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 32 +++++++++++++++++++++++++-------
> 1 file changed, 25 insertions(+), 7 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5392134f042f..eef09de94609 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -84,6 +84,9 @@
> #define IORING_MAX_ENTRIES 32768
> #define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)
>
> +/* Not less than IORING_MAX_ENTRIES, so can grab once per submission loop */
> +#define IORING_REFS_THRESHOLD IORING_MAX_ENTRIES
> +
> /*
> * Shift of 9 is 512 entries, or exactly one page on 64-bit archs
> */
> @@ -197,6 +200,7 @@ struct fixed_file_data {
> struct io_ring_ctx {
> struct {
> struct percpu_ref refs;
> + unsigned long taken_refs; /* used under @uring_lock */
> } ____cacheline_aligned_in_smp;
>
> struct {
> @@ -690,6 +694,13 @@ static void io_ring_ctx_ref_free(struct percpu_ref *ref)
> complete(&ctx->completions[0]);
> }
>
> +static void io_free_taken_refs(struct io_ring_ctx *ctx)
> +{
> + if (ctx->taken_refs)
> + percpu_ref_put_many(&ctx->refs, ctx->taken_refs);
> + ctx->taken_refs = 0;
> +}
> +
> static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
> {
> struct io_ring_ctx *ctx;
> @@ -4388,7 +4399,6 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> struct io_submit_state state, *statep = NULL;
> struct io_kiocb *link = NULL;
> int i, submitted = 0;
> - unsigned int extra_refs;
> bool mm_fault = false;
>
> /* if we have a backlog and couldn't flush it all, return BUSY */
> @@ -4398,9 +4408,15 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> return -EBUSY;
> }
>
> - if (!percpu_ref_tryget_many(&ctx->refs, nr))
> - return -EAGAIN;
> - extra_refs = nr;
> + if (ctx->taken_refs < IORING_REFS_THRESHOLD) {
> + if (unlikely(percpu_ref_is_dying(&ctx->refs))) {
> + io_free_taken_refs(ctx);
> + return -ENXIO;
> + }
> + if (!percpu_ref_tryget_many(&ctx->refs, IORING_REFS_THRESHOLD))
> + return -EAGAIN;
> + ctx->taken_refs += IORING_REFS_THRESHOLD;
> + }
>
> if (nr > IO_PLUG_THRESHOLD) {
> io_submit_state_start(&state, nr);
> @@ -4417,8 +4433,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> submitted = -EAGAIN;
> break;
> }
> - --extra_refs;
> if (!io_get_sqring(ctx, req, &sqe)) {
> + /* not submitted, but a ref is freed */
> + ctx->taken_refs--;
> __io_free_req(req);
> break;
> }
> @@ -4454,8 +4471,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> io_queue_link_head(link);
> if (statep)
> io_submit_state_end(&state);
> - if (extra_refs)
> - percpu_ref_put_many(&ctx->refs, extra_refs);
> + ctx->taken_refs -= submitted;
>
> /* Commit SQ ring head once we've consumed and submitted all SQEs */
> io_commit_sqring(ctx);
> @@ -5731,6 +5747,7 @@ static int io_uring_fasync(int fd, struct file *file, int on)
> static void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
> {
> mutex_lock(&ctx->uring_lock);
> + io_free_taken_refs(ctx);
> percpu_ref_kill(&ctx->refs);
> mutex_unlock(&ctx->uring_lock);
>
> @@ -6196,6 +6213,7 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>
> if (opcode != IORING_UNREGISTER_FILES &&
> opcode != IORING_REGISTER_FILES_UPDATE) {
> + io_free_taken_refs(ctx);
> percpu_ref_kill(&ctx->refs);
>
> /*
>

--
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-12-21 17:21:40

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH RFC v2 3/3] io_uring: batch get(ctx->ref) across submits

On 12/21/19 9:20 AM, Pavel Begunkov wrote:
> On 21/12/2019 19:15, Pavel Begunkov wrote:
>> Double account ctx->refs keeping number of taken refs in ctx. As
>> io_uring gets per-request ctx->refs during submission, while holding
>> ctx->uring_lock, this allows in most of the time to bypass
>> percpu_ref_get*() and its overhead.
>
> Jens, could you please benchmark with this one? Especially for offloaded QD1
> case. I haven't got any difference for nops test and don't have a decent SSD
> at hands to test it myself. We could drop it, if there is no benefit.
>
> This rewrites that @extra_refs from the second one, so I left it for now.

Sure, let me run a peak test, qd1 test, qd1+sqpoll test on
for-5.6/io_uring, same branch with 1-2, and same branch with 1-3. That
should give us a good comparison. One core used for all, and we're going
to be core speed bound for the performance in all cases on this setup.
So it'll be a good comparison.

--
Jens Axboe

2019-12-21 20:14:25

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v3 0/2] optimise ctx's refs grabbing in io_uring

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
better comments for percpu_ref_tryget_many (Dennis Zhou)
amortise across io_uring_enter() boundary

v3: drop "batching across syscalls"
remove error handling in io_submit_sqes() from common path

Pavel Begunkov (2):
pcpu_ref: add percpu_ref_tryget_many()
io_uring: batch getting pcpu references

fs/io_uring.c | 14 ++++++++++----
include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
2 files changed, 31 insertions(+), 9 deletions(-)

--
2.24.0

2019-12-21 20:14:48

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v3 2/2] io_uring: batch getting pcpu references

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

basic benchmark with submit and wait 128 non-linked nops showed ~5%
performance gain. (7044 KIOPS vs 7423)

Signed-off-by: Pavel Begunkov <[email protected]>
---

It's just becoming more bulky with ret for me, and would love to hear,
hot to make it clearer. This version removes all error handling from
hot path, though with goto.

fs/io_uring.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 513f1922ce6a..b89a8b975c69 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1045,9 +1045,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
struct io_kiocb *req;

- if (!percpu_ref_tryget(&ctx->refs))
- return NULL;
-
if (!state) {
req = kmem_cache_alloc(req_cachep, gfp);
if (unlikely(!req))
@@ -4400,6 +4397,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
return -EBUSY;
}

+ if (!percpu_ref_tryget_many(&ctx->refs, nr))
+ return -EAGAIN;
+
if (nr > IO_PLUG_THRESHOLD) {
io_submit_state_start(&state, nr);
statep = &state;
@@ -4408,16 +4408,22 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
for (i = 0; i < nr; i++) {
const struct io_uring_sqe *sqe;
struct io_kiocb *req;
+ unsigned int unused_refs;

req = io_get_req(ctx, statep);
if (unlikely(!req)) {
+ unused_refs = nr - submitted;
if (!submitted)
submitted = -EAGAIN;
+put_refs:
+ percpu_ref_put_many(&ctx->refs, unused_refs);
break;
}
if (!io_get_sqring(ctx, req, &sqe)) {
__io_free_req(req);
- break;
+ /* __io_free_req() puts a ref */
+ unused_refs = nr - submitted - 1;
+ goto put_refs;
}

/* will complete beyond this point, count as submitted */
--
2.24.0

2019-12-21 20:23:55

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v3 1/2] pcpu_ref: add percpu_ref_tryget_many()

Add percpu_ref_tryget_many(), which works the same way as
percpu_ref_tryget(), but grabs specified number of refs.

Signed-off-by: Pavel Begunkov <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Dennis Zhou <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 390031e816dc..22d9d183950d 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -210,15 +210,17 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
}

/**
- * percpu_ref_tryget - try to increment a percpu refcount
+ * percpu_ref_tryget_many - try to increment a percpu refcount
* @ref: percpu_ref to try-get
+ * @nr: number of references to get
*
- * Increment a percpu refcount unless its count already reached zero.
+ * Increment a percpu refcount by @nr unless its count already reached zero.
* Returns %true on success; %false on failure.
*
* This function is safe to call as long as @ref is between init and exit.
*/
-static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+static inline bool percpu_ref_tryget_many(struct percpu_ref *ref,
+ unsigned long nr)
{
unsigned long __percpu *percpu_count;
bool ret;
@@ -226,10 +228,10 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
rcu_read_lock();

if (__ref_is_percpu(ref, &percpu_count)) {
- this_cpu_inc(*percpu_count);
+ this_cpu_add(*percpu_count, nr);
ret = true;
} else {
- ret = atomic_long_inc_not_zero(&ref->count);
+ ret = atomic_long_add_unless(&ref->count, nr, 0);
}

rcu_read_unlock();
@@ -237,6 +239,20 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
return ret;
}

+/**
+ * percpu_ref_tryget - try to increment a percpu refcount
+ * @ref: percpu_ref to try-get
+ *
+ * Increment a percpu refcount unless its count already reached zero.
+ * Returns %true on success; %false on failure.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
+static inline bool percpu_ref_tryget(struct percpu_ref *ref)
+{
+ return percpu_ref_tryget_many(ref, 1);
+}
+
/**
* percpu_ref_tryget_live - try to increment a live percpu refcount
* @ref: percpu_ref to try-get
--
2.24.0

2019-12-21 21:58:07

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] io_uring: batch getting pcpu references

On 21/12/2019 23:12, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
>
> basic benchmark with submit and wait 128 non-linked nops showed ~5%
> performance gain. (7044 KIOPS vs 7423)
>

Hmm, this won't work. Please drop it.

> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
>
> It's just becoming more bulky with ret for me, and would love to hear,
> hot to make it clearer. This version removes all error handling from
> hot path, though with goto.
>
> fs/io_uring.c | 14 ++++++++++----
> 1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 513f1922ce6a..b89a8b975c69 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1045,9 +1045,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
> gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> struct io_kiocb *req;
>
> - if (!percpu_ref_tryget(&ctx->refs))
> - return NULL;
> -
> if (!state) {
> req = kmem_cache_alloc(req_cachep, gfp);
> if (unlikely(!req))
> @@ -4400,6 +4397,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> return -EBUSY;
> }
>
> + if (!percpu_ref_tryget_many(&ctx->refs, nr))
> + return -EAGAIN;
> +
> if (nr > IO_PLUG_THRESHOLD) {
> io_submit_state_start(&state, nr);
> statep = &state;
> @@ -4408,16 +4408,22 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> for (i = 0; i < nr; i++) {
> const struct io_uring_sqe *sqe;
> struct io_kiocb *req;
> + unsigned int unused_refs;
>
> req = io_get_req(ctx, statep);
> if (unlikely(!req)) {
> + unused_refs = nr - submitted;
> if (!submitted)
> submitted = -EAGAIN;
> +put_refs:
> + percpu_ref_put_many(&ctx->refs, unused_refs);
> break;
> }
> if (!io_get_sqring(ctx, req, &sqe)) {
> __io_free_req(req);
> - break;
> + /* __io_free_req() puts a ref */
> + unused_refs = nr - submitted - 1;
> + goto put_refs;
> }
>
> /* will complete beyond this point, count as submitted */
>

--
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-12-28 11:14:32

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v4 0/2] optimise ctx's refs grabbing in io_uring

Optimise percpu_ref_tryget() by not calling it for each request, but
batching it. This gave a measurable ~5% performance boost for large QD.

v2: fix uncommited plug (Jens Axboe)
better comments for percpu_ref_tryget_many (Dennis Zhou)
amortise across io_uring_enter() boundary

v3: drop "batching across syscalls"
remove error handling in io_submit_sqes() from common path

v4: fix error handling

Pavel Begunkov (2):
pcpu_ref: add percpu_ref_tryget_many()
io_uring: batch getting pcpu references

fs/io_uring.c | 26 +++++++++++++++++---------
include/linux/percpu-refcount.h | 26 +++++++++++++++++++++-----
2 files changed, 38 insertions(+), 14 deletions(-)

--
2.24.0

2019-12-28 11:15:03

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v4 2/2] io_uring: batch getting pcpu references

percpu_ref_tryget() has its own overhead. Instead getting a reference
for each request, grab a bunch once per io_submit_sqes().

~5% throughput boost for a "submit and wait 128 nops" benchmark.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 26 +++++++++++++++++---------
1 file changed, 17 insertions(+), 9 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7fc1158bf9a4..404946080e86 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
struct io_kiocb *req;

- if (!percpu_ref_tryget(&ctx->refs))
- return NULL;
-
if (!state) {
req = kmem_cache_alloc(req_cachep, gfp);
if (unlikely(!req))
@@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
}
}

+static void __io_req_free_empty(struct io_kiocb *req)
+{
+ if (likely(!io_is_fallback_req(req)))
+ kmem_cache_free(req_cachep, req);
+ else
+ clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req);
+}
+
static void __io_free_req(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
@@ -1162,11 +1167,9 @@ static void __io_free_req(struct io_kiocb *req)
wake_up(&ctx->inflight_wait);
spin_unlock_irqrestore(&ctx->inflight_lock, flags);
}
- percpu_ref_put(&ctx->refs);
- if (likely(!io_is_fallback_req(req)))
- kmem_cache_free(req_cachep, req);
- else
- clear_bit_unlock(0, (unsigned long *) ctx->fallback_req);
+
+ percpu_ref_put(&req->ctx->refs);
+ __io_req_free_empty(req);
}

static bool io_link_cancel_timeout(struct io_kiocb *req)
@@ -4551,6 +4554,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
return -EBUSY;
}

+ if (!percpu_ref_tryget_many(&ctx->refs, nr))
+ return -EAGAIN;
+
if (nr > IO_PLUG_THRESHOLD) {
io_submit_state_start(&state, nr);
statep = &state;
@@ -4567,7 +4573,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
break;
}
if (!io_get_sqring(ctx, req, &sqe)) {
- __io_free_req(req);
+ __io_req_free_empty(req);
break;
}

@@ -4598,6 +4604,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
break;
}

+ if (submitted != nr)
+ percpu_ref_put_many(&ctx->refs, nr - submitted);
if (link)
io_queue_link_head(link);
if (statep)
--
2.24.0

2019-12-28 11:17:48

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] io_uring: batch getting pcpu references

On 28/12/2019 14:13, Pavel Begunkov wrote:
> percpu_ref_tryget() has its own overhead. Instead getting a reference
> for each request, grab a bunch once per io_submit_sqes().
>
> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
>
> Signed-off-by: Pavel Begunkov <[email protected]>
> ---
> fs/io_uring.c | 26 +++++++++++++++++---------
> 1 file changed, 17 insertions(+), 9 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 7fc1158bf9a4..404946080e86 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
> gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> struct io_kiocb *req;
>
> - if (!percpu_ref_tryget(&ctx->refs))
> - return NULL;
> -
> if (!state) {
> req = kmem_cache_alloc(req_cachep, gfp);
> if (unlikely(!req))
> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
> }
> }
>
> +static void __io_req_free_empty(struct io_kiocb *req)

If anybody have better naming (or a better approach at all), I'm all ears.


> +{
> + if (likely(!io_is_fallback_req(req)))
> + kmem_cache_free(req_cachep, req);
> + else
> + clear_bit_unlock(0, (unsigned long *) req->ctx->fallback_req);
> +}
> +
> static void __io_free_req(struct io_kiocb *req)
> {
> struct io_ring_ctx *ctx = req->ctx;
> @@ -1162,11 +1167,9 @@ static void __io_free_req(struct io_kiocb *req)
> wake_up(&ctx->inflight_wait);
> spin_unlock_irqrestore(&ctx->inflight_lock, flags);
> }
> - percpu_ref_put(&ctx->refs);
> - if (likely(!io_is_fallback_req(req)))
> - kmem_cache_free(req_cachep, req);
> - else
> - clear_bit_unlock(0, (unsigned long *) ctx->fallback_req);
> +
> + percpu_ref_put(&req->ctx->refs);
> + __io_req_free_empty(req);
> }
>
> static bool io_link_cancel_timeout(struct io_kiocb *req)
> @@ -4551,6 +4554,9 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> return -EBUSY;
> }
>
> + if (!percpu_ref_tryget_many(&ctx->refs, nr))
> + return -EAGAIN;
> +
> if (nr > IO_PLUG_THRESHOLD) {
> io_submit_state_start(&state, nr);
> statep = &state;
> @@ -4567,7 +4573,7 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> break;
> }
> if (!io_get_sqring(ctx, req, &sqe)) {
> - __io_free_req(req);
> + __io_req_free_empty(req);
> break;
> }
>
> @@ -4598,6 +4604,8 @@ static int io_submit_sqes(struct io_ring_ctx *ctx, unsigned int nr,
> break;
> }
>
> + if (submitted != nr)
> + percpu_ref_put_many(&ctx->refs, nr - submitted);
> if (link)
> io_queue_link_head(link);
> if (statep)
>

--
Pavel Begunkov

2019-12-28 17:10:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] io_uring: batch getting pcpu references

On 12/28/19 4:15 AM, Pavel Begunkov wrote:
> On 28/12/2019 14:13, Pavel Begunkov wrote:
>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>> for each request, grab a bunch once per io_submit_sqes().
>>
>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
>>
>> Signed-off-by: Pavel Begunkov <[email protected]>
>> ---
>> fs/io_uring.c | 26 +++++++++++++++++---------
>> 1 file changed, 17 insertions(+), 9 deletions(-)
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 7fc1158bf9a4..404946080e86 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>> gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>> struct io_kiocb *req;
>>
>> - if (!percpu_ref_tryget(&ctx->refs))
>> - return NULL;
>> -
>> if (!state) {
>> req = kmem_cache_alloc(req_cachep, gfp);
>> if (unlikely(!req))
>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>> }
>> }
>>
>> +static void __io_req_free_empty(struct io_kiocb *req)
>
> If anybody have better naming (or a better approach at all), I'm all ears.

__io_req_do_free()?

I think that's better than the empty, not quite sure what that means.
If you're fine with that, I can just make that edit when applying.
The rest looks fine to me now.

--
Jens Axboe

2019-12-28 18:39:14

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] io_uring: batch getting pcpu references

On 28/12/2019 20:03, Jens Axboe wrote:
> On 12/28/19 4:15 AM, Pavel Begunkov wrote:
>> On 28/12/2019 14:13, Pavel Begunkov wrote:
>>> percpu_ref_tryget() has its own overhead. Instead getting a reference
>>> for each request, grab a bunch once per io_submit_sqes().
>>>
>>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
>>>
>>> Signed-off-by: Pavel Begunkov <[email protected]>
>>> ---
>>> fs/io_uring.c | 26 +++++++++++++++++---------
>>> 1 file changed, 17 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 7fc1158bf9a4..404946080e86 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
>>> gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
>>> struct io_kiocb *req;
>>>
>>> - if (!percpu_ref_tryget(&ctx->refs))
>>> - return NULL;
>>> -
>>> if (!state) {
>>> req = kmem_cache_alloc(req_cachep, gfp);
>>> if (unlikely(!req))
>>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
>>> }
>>> }
>>>
>>> +static void __io_req_free_empty(struct io_kiocb *req)
>>
>> If anybody have better naming (or a better approach at all), I'm all ears.
>
> __io_req_do_free()?

Not quite clear what's the difference with __io_req_free() then

>
> I think that's better than the empty, not quite sure what that means.

Probably, so. It was kind of "request without a bound sqe".
Does io_free_{hollow,empty}_req() sound better?

> If you're fine with that, I can just make that edit when applying.
> The rest looks fine to me now.
>

Please do

--
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-12-30 03:34:17

by Brian Gianforcaro

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] io_uring: batch getting pcpu references

On Sat, Dec 28, 2019 at 09:37:35PM +0300, Pavel Begunkov wrote:
> On 28/12/2019 20:03, Jens Axboe wrote:
> > On 12/28/19 4:15 AM, Pavel Begunkov wrote:
> >> On 28/12/2019 14:13, Pavel Begunkov wrote:
> >>> percpu_ref_tryget() has its own overhead. Instead getting a reference
> >>> for each request, grab a bunch once per io_submit_sqes().
> >>>
> >>> ~5% throughput boost for a "submit and wait 128 nops" benchmark.
> >>>
> >>> Signed-off-by: Pavel Begunkov <[email protected]>
> >>> ---
> >>> fs/io_uring.c | 26 +++++++++++++++++---------
> >>> 1 file changed, 17 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/fs/io_uring.c b/fs/io_uring.c
> >>> index 7fc1158bf9a4..404946080e86 100644
> >>> --- a/fs/io_uring.c
> >>> +++ b/fs/io_uring.c
> >>> @@ -1080,9 +1080,6 @@ static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
> >>> gfp_t gfp = GFP_KERNEL | __GFP_NOWARN;
> >>> struct io_kiocb *req;
> >>>
> >>> - if (!percpu_ref_tryget(&ctx->refs))
> >>> - return NULL;
> >>> -
> >>> if (!state) {
> >>> req = kmem_cache_alloc(req_cachep, gfp);
> >>> if (unlikely(!req))
> >>> @@ -1141,6 +1138,14 @@ static void io_free_req_many(struct io_ring_ctx *ctx, void **reqs, int *nr)
> >>> }
> >>> }
> >>>
> >>> +static void __io_req_free_empty(struct io_kiocb *req)
> >>
> >> If anybody have better naming (or a better approach at all), I'm all ears.
> >
> > __io_req_do_free()?
>
> Not quite clear what's the difference with __io_req_free() then
>
> >
> > I think that's better than the empty, not quite sure what that means.
>
> Probably, so. It was kind of "request without a bound sqe".
> Does io_free_{hollow,empty}_req() sound better?

Given your description, perhaps io_free_unbound_req() makes sense?

2019-12-30 18:47:04

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] io_uring: batch getting pcpu references

On 30/12/2019 06:33, Brian Gianforcaro wrote:
>>>>> +static void __io_req_free_empty(struct io_kiocb *req)
>>>>
>>>> If anybody have better naming (or a better approach at all), I'm all ears.
>>>
>>> __io_req_do_free()?
>>
>> Not quite clear what's the difference with __io_req_free() then
>>
>>>
>>> I think that's better than the empty, not quite sure what that means.
>>
>> Probably, so. It was kind of "request without a bound sqe".
>> Does io_free_{hollow,empty}_req() sound better?
>
> Given your description, perhaps io_free_unbound_req() makes sense?
>

Like it more, though neither of these have a set meaning in io_uring context.
The patch already got into Jen's repo, so you can send another one, if you
think it's worth it.

--
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature