2023-07-18 20:07:15

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] io_uring: Use io_schedule* in cqring wait

From: Andres Freund <[email protected]>

I observed poor performance of io_uring compared to synchronous IO. That
turns out to be caused by deeper CPU idle states entered with io_uring,
due to io_uring using plain schedule(), whereas synchronous IO uses
io_schedule().

The losses due to this are substantial. On my cascade lake workstation,
t/io_uring from the fio repository e.g. yields regressions between 20%
and 40% with the following command:
./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0

This is repeatable with different filesystems, using raw block devices
and using different block devices.

Use io_schedule_prepare() / io_schedule_finish() in
io_cqring_wait_schedule() to address the difference.

After that using io_uring is on par or surpassing synchronous IO (using
registered files etc makes it reliably win, but arguably is a less fair
comparison).

There are other calls to schedule() in io_uring/, but none immediately
jump out to be similarly situated, so I did not touch them. Similarly,
it's possible that mutex_lock_io() should be used, but it's not clear if
there are cases where that matters.

Cc: [email protected] # 5.10+
Cc: Pavel Begunkov <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Andres Freund <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
[axboe: minor style fixup]
Signed-off-by: Jens Axboe <[email protected]>
---
io_uring/io_uring.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e8096d502a7c..7505de2428e0 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2489,6 +2489,8 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
struct io_wait_queue *iowq)
{
+ int token, ret;
+
if (unlikely(READ_ONCE(ctx->check_cq)))
return 1;
if (unlikely(!llist_empty(&ctx->work_llist)))
@@ -2499,11 +2501,20 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
return -EINTR;
if (unlikely(io_should_wake(iowq)))
return 0;
+
+ /*
+ * Use io_schedule_prepare/finish, so cpufreq can take into account
+ * that the task is waiting for IO - turns out to be important for low
+ * QD IO.
+ */
+ token = io_schedule_prepare();
+ ret = 0;
if (iowq->timeout == KTIME_MAX)
schedule();
else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
- return -ETIME;
- return 0;
+ ret = -ETIME;
+ io_schedule_finish(token);
+ return ret;
}

/*
--
2.40.1



2023-07-24 16:02:01

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 9:35?AM, Phil Elwell wrote:
> Hi Andres,
>
> With this commit applied to the 6.1 and later kernels (others not
> tested) the iowait time ("wa" field in top) in an ARM64 build running
> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
> is permanently blocked on I/O. The change can be observed after
> installing mariadb-server (no configuration or use is required). After
> reverting just this commit, "wa" drops to zero again.

There are a few other threads on this...

> I can believe that this change hasn't negatively affected performance,
> but the result is misleading. I also think it's pushing the boundaries
> of what a back-port to stable should do.

It's just a cosmetic thing, to be fair, and it makes quite a large
difference on important cases. This is why it also went to stable, which
btw was not Andres's decision at all. I've posted this patch in another
thread as well, but here it is in this thread too - this will limit the
cases that are marked as iowait.


diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 89a611541bc4..f4591b912ea8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2493,11 +2493,20 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
return 0;
}

+static bool current_pending_io(void)
+{
+ struct io_uring_task *tctx = current->io_uring;
+
+ if (!tctx)
+ return false;
+ return percpu_counter_read_positive(&tctx->inflight);
+}
+
/* when returns >0, the caller should retry */
static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
struct io_wait_queue *iowq)
{
- int token, ret;
+ int io_wait, ret;

if (unlikely(READ_ONCE(ctx->check_cq)))
return 1;
@@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
return 0;

/*
- * Use io_schedule_prepare/finish, so cpufreq can take into account
- * that the task is waiting for IO - turns out to be important for low
- * QD IO.
+ * Mark us as being in io_wait if we have pending requests, so cpufreq
+ * can take into account that the task is waiting for IO - turns out
+ * to be important for low QD IO.
*/
- token = io_schedule_prepare();
+ io_wait = current->in_iowait;
+ if (current_pending_io())
+ current->in_iowait = 1;
ret = 0;
if (iowq->timeout == KTIME_MAX)
schedule();
else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
ret = -ETIME;
- io_schedule_finish(token);
+ current->in_iowait = io_wait;
return ret;
}


--
Jens Axboe


2023-07-24 16:07:26

by Phil Elwell

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Hi Andres,

With this commit applied to the 6.1 and later kernels (others not
tested) the iowait time ("wa" field in top) in an ARM64 build running
on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
is permanently blocked on I/O. The change can be observed after
installing mariadb-server (no configuration or use is required). After
reverting just this commit, "wa" drops to zero again.

I can believe that this change hasn't negatively affected performance,
but the result is misleading. I also think it's pushing the boundaries
of what a back-port to stable should do.

Phil

2023-07-24 16:25:37

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On Mon, Jul 24, 2023 at 04:35:43PM +0100, Phil Elwell wrote:
> Hi Andres,
>
> With this commit applied to the 6.1 and later kernels (others not
> tested) the iowait time ("wa" field in top) in an ARM64 build running
> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
> is permanently blocked on I/O. The change can be observed after
> installing mariadb-server (no configuration or use is required). After
> reverting just this commit, "wa" drops to zero again.

This has been discussed already:
https://lore.kernel.org/r/[email protected]

It's not a bug, mariadb does have pending I/O, so the report is correct,
but the CPU isn't blocked at all.

thanks,

greg k-h

2023-07-24 16:26:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 9:48?AM, Greg KH wrote:
> On Mon, Jul 24, 2023 at 04:35:43PM +0100, Phil Elwell wrote:
>> Hi Andres,
>>
>> With this commit applied to the 6.1 and later kernels (others not
>> tested) the iowait time ("wa" field in top) in an ARM64 build running
>> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
>> is permanently blocked on I/O. The change can be observed after
>> installing mariadb-server (no configuration or use is required). After
>> reverting just this commit, "wa" drops to zero again.
>
> This has been discussed already:
> https://lore.kernel.org/r/[email protected]
>
> It's not a bug, mariadb does have pending I/O, so the report is correct,
> but the CPU isn't blocked at all.

Indeed - only thing I can think of is perhaps mariadb is having a
separate thread waiting on the ring in perpetuity, regardless of whether
or not it currently has IO.

But yes, this is very much ado about nothing...

--
Jens Axboe


2023-07-24 16:27:48

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 10:07?AM, Phil Elwell wrote:
>> Even though I don't think this is an actual problem, it is a bit
>> confusing that you get 100% iowait while waiting without having IO
>> pending. So I do think the suggested patch is probably worthwhile
>> pursuing. I'll post it and hopefully have Andres test it too, if he's
>> available.
>
> If you CC me I'll happily test it for you.

Here it is.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 89a611541bc4..f4591b912ea8 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -2493,11 +2493,20 @@ int io_run_task_work_sig(struct io_ring_ctx *ctx)
return 0;
}

+static bool current_pending_io(void)
+{
+ struct io_uring_task *tctx = current->io_uring;
+
+ if (!tctx)
+ return false;
+ return percpu_counter_read_positive(&tctx->inflight);
+}
+
/* when returns >0, the caller should retry */
static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
struct io_wait_queue *iowq)
{
- int token, ret;
+ int io_wait, ret;

if (unlikely(READ_ONCE(ctx->check_cq)))
return 1;
@@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
return 0;

/*
- * Use io_schedule_prepare/finish, so cpufreq can take into account
- * that the task is waiting for IO - turns out to be important for low
- * QD IO.
+ * Mark us as being in io_wait if we have pending requests, so cpufreq
+ * can take into account that the task is waiting for IO - turns out
+ * to be important for low QD IO.
*/
- token = io_schedule_prepare();
+ io_wait = current->in_iowait;
+ if (current_pending_io())
+ current->in_iowait = 1;
ret = 0;
if (iowq->timeout == KTIME_MAX)
schedule();
else if (!schedule_hrtimeout(&iowq->timeout, HRTIMER_MODE_ABS))
ret = -ETIME;
- io_schedule_finish(token);
+ current->in_iowait = io_wait;
return ret;
}


--
Jens Axboe


2023-07-24 16:29:45

by Phil Elwell

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Jens, Greg,

On Mon, 24 Jul 2023 at 16:58, Jens Axboe <[email protected]> wrote:
>
> On 7/24/23 9:50?AM, Jens Axboe wrote:
> > On 7/24/23 9:48?AM, Greg KH wrote:
> >> On Mon, Jul 24, 2023 at 04:35:43PM +0100, Phil Elwell wrote:
> >>> Hi Andres,
> >>>
> >>> With this commit applied to the 6.1 and later kernels (others not
> >>> tested) the iowait time ("wa" field in top) in an ARM64 build running
> >>> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
> >>> is permanently blocked on I/O. The change can be observed after
> >>> installing mariadb-server (no configuration or use is required). After
> >>> reverting just this commit, "wa" drops to zero again.
> >>
> >> This has been discussed already:
> >> https://lore.kernel.org/r/[email protected]

Sorry - a brief search failed to find that.

> >> It's not a bug, mariadb does have pending I/O, so the report is correct,
> >> but the CPU isn't blocked at all.
> >
> > Indeed - only thing I can think of is perhaps mariadb is having a
> > separate thread waiting on the ring in perpetuity, regardless of whether
> > or not it currently has IO.
> >
> > But yes, this is very much ado about nothing...
>
> Current -git and having mariadb idle:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: all 0.00 0.00 0.04 12.47 0.04 0.00 0.00 0.00 0.00 87.44
> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 2 0.00 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 99.67
> Average: 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 4 0.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 99.67
> Average: 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 6 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00
> Average: 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>
> which is showing 100% iowait on one cpu, as mariadb has a thread waiting
> on IO. That is obviously a valid use case, if you split submission and
> completion into separate threads. Then you have the latter just always
> waiting on something to process.
>
> With the suggested patch, we do eliminate that case and the iowait on
> that task is gone. Here's current -git with the patch and mariadb also
> running:
>
> 09:53:49 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 09:53:50 AM all 0.00 0.00 0.00 0.00 0.00 0.75 0.00 0.00 0.00 99.25
> 09:53:50 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> 09:53:50 AM 1 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 2 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 3 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 4 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 99.01
> 09:53:50 AM 5 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> 09:53:50 AM 7 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
>
>
> Even though I don't think this is an actual problem, it is a bit
> confusing that you get 100% iowait while waiting without having IO
> pending. So I do think the suggested patch is probably worthwhile
> pursuing. I'll post it and hopefully have Andres test it too, if he's
> available.

If you CC me I'll happily test it for you.

Thanks,

Phil

> --
> Jens Axboe
>

2023-07-24 16:45:48

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Hi,

On 2023-07-24 09:48:58 -0600, Jens Axboe wrote:
> On 7/24/23 9:35?AM, Phil Elwell wrote:
> > Hi Andres,
> >
> > With this commit applied to the 6.1 and later kernels (others not
> > tested) the iowait time ("wa" field in top) in an ARM64 build running
> > on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
> > is permanently blocked on I/O. The change can be observed after
> > installing mariadb-server (no configuration or use is required). After
> > reverting just this commit, "wa" drops to zero again.
>
> There are a few other threads on this...
>
> > I can believe that this change hasn't negatively affected performance,
> > but the result is misleading. I also think it's pushing the boundaries
> > of what a back-port to stable should do.

FWIW, I think this partially just mpstat reporting something quite bogus. It
makes no sense to say that a cpu is 100% busy waiting for IO, when the one
process is doing IO is just waiting.


> +static bool current_pending_io(void)
> +{
> + struct io_uring_task *tctx = current->io_uring;
> +
> + if (!tctx)
> + return false;
> + return percpu_counter_read_positive(&tctx->inflight);
> +}
> +
> /* when returns >0, the caller should retry */
> static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> struct io_wait_queue *iowq)
> {
> - int token, ret;
> + int io_wait, ret;
>
> if (unlikely(READ_ONCE(ctx->check_cq)))
> return 1;
> @@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
> return 0;
>
> /*
> - * Use io_schedule_prepare/finish, so cpufreq can take into account
> - * that the task is waiting for IO - turns out to be important for low
> - * QD IO.
> + * Mark us as being in io_wait if we have pending requests, so cpufreq
> + * can take into account that the task is waiting for IO - turns out
> + * to be important for low QD IO.
> */
> - token = io_schedule_prepare();
> + io_wait = current->in_iowait;

I don't know the kernel "rules" around this, but ->in_iowait is only modified
in kernel/sched, so it seemed a tad "unfriendly" to scribble on it here...


Building a kernel to test with the patch applied, will reboot into it once the
call I am on has finished. Unfortunately the performance difference didn't
reproduce nicely in VM...

Greetings,

Andres Freund

2023-07-24 16:48:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 10:16?AM, Andres Freund wrote:
> Hi,
>
> On 2023-07-24 09:48:58 -0600, Jens Axboe wrote:
>> On 7/24/23 9:35?AM, Phil Elwell wrote:
>>> Hi Andres,
>>>
>>> With this commit applied to the 6.1 and later kernels (others not
>>> tested) the iowait time ("wa" field in top) in an ARM64 build running
>>> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
>>> is permanently blocked on I/O. The change can be observed after
>>> installing mariadb-server (no configuration or use is required). After
>>> reverting just this commit, "wa" drops to zero again.
>>
>> There are a few other threads on this...
>>
>>> I can believe that this change hasn't negatively affected performance,
>>> but the result is misleading. I also think it's pushing the boundaries
>>> of what a back-port to stable should do.
>
> FWIW, I think this partially just mpstat reporting something quite bogus. It
> makes no sense to say that a cpu is 100% busy waiting for IO, when the one
> process is doing IO is just waiting.

Indeed... It really just means it's spending 100% of its time _waiting_
on IO, not that it's doing anything. This is largely to save myself from
future emails on this subject, saving my own time.

>> +static bool current_pending_io(void)
>> +{
>> + struct io_uring_task *tctx = current->io_uring;
>> +
>> + if (!tctx)
>> + return false;
>> + return percpu_counter_read_positive(&tctx->inflight);
>> +}
>> +
>> /* when returns >0, the caller should retry */
>> static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>> struct io_wait_queue *iowq)
>> {
>> - int token, ret;
>> + int io_wait, ret;
>>
>> if (unlikely(READ_ONCE(ctx->check_cq)))
>> return 1;
>> @@ -2511,17 +2520,19 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx,
>> return 0;
>>
>> /*
>> - * Use io_schedule_prepare/finish, so cpufreq can take into account
>> - * that the task is waiting for IO - turns out to be important for low
>> - * QD IO.
>> + * Mark us as being in io_wait if we have pending requests, so cpufreq
>> + * can take into account that the task is waiting for IO - turns out
>> + * to be important for low QD IO.
>> */
>> - token = io_schedule_prepare();
>> + io_wait = current->in_iowait;
>
> I don't know the kernel "rules" around this, but ->in_iowait is only
> modified in kernel/sched, so it seemed a tad "unfriendly" to scribble
> on it here...

It's either that or add new helpers for this, at least for the initial
one. Calling blk_flush_plug() (and with async == true, no less) is not
something we need or want to do.

So we could add an io_schedule_prepare_noflush() for this, but also
seems silly to add a single use helper for that imho.

> Building a kernel to test with the patch applied, will reboot into it
> once the call I am on has finished. Unfortunately the performance
> difference didn't reproduce nicely in VM...

Thanks!

--
Jens Axboe


2023-07-24 16:49:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 9:50?AM, Jens Axboe wrote:
> On 7/24/23 9:48?AM, Greg KH wrote:
>> On Mon, Jul 24, 2023 at 04:35:43PM +0100, Phil Elwell wrote:
>>> Hi Andres,
>>>
>>> With this commit applied to the 6.1 and later kernels (others not
>>> tested) the iowait time ("wa" field in top) in an ARM64 build running
>>> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
>>> is permanently blocked on I/O. The change can be observed after
>>> installing mariadb-server (no configuration or use is required). After
>>> reverting just this commit, "wa" drops to zero again.
>>
>> This has been discussed already:
>> https://lore.kernel.org/r/[email protected]
>>
>> It's not a bug, mariadb does have pending I/O, so the report is correct,
>> but the CPU isn't blocked at all.
>
> Indeed - only thing I can think of is perhaps mariadb is having a
> separate thread waiting on the ring in perpetuity, regardless of whether
> or not it currently has IO.
>
> But yes, this is very much ado about nothing...

Current -git and having mariadb idle:

Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: all 0.00 0.00 0.04 12.47 0.04 0.00 0.00 0.00 0.00 87.44
Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 2 0.00 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 99.67
Average: 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 4 0.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 99.67
Average: 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
Average: 6 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00

which is showing 100% iowait on one cpu, as mariadb has a thread waiting
on IO. That is obviously a valid use case, if you split submission and
completion into separate threads. Then you have the latter just always
waiting on something to process.

With the suggested patch, we do eliminate that case and the iowait on
that task is gone. Here's current -git with the patch and mariadb also
running:

09:53:49 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
09:53:50 AM all 0.00 0.00 0.00 0.00 0.00 0.75 0.00 0.00 0.00 99.25
09:53:50 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
09:53:50 AM 1 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
09:53:50 AM 2 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
09:53:50 AM 3 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
09:53:50 AM 4 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 99.01
09:53:50 AM 5 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
09:53:50 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
09:53:50 AM 7 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00


Even though I don't think this is an actual problem, it is a bit
confusing that you get 100% iowait while waiting without having IO
pending. So I do think the suggested patch is probably worthwhile
pursuing. I'll post it and hopefully have Andres test it too, if he's
available.

--
Jens Axboe


2023-07-24 17:13:01

by Phil Elwell

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Hi Jens,

On Mon, 24 Jul 2023 at 17:08, Jens Axboe <[email protected]> wrote:
>
> On 7/24/23 10:07?AM, Phil Elwell wrote:
> >> Even though I don't think this is an actual problem, it is a bit
> >> confusing that you get 100% iowait while waiting without having IO
> >> pending. So I do think the suggested patch is probably worthwhile
> >> pursuing. I'll post it and hopefully have Andres test it too, if he's
> >> available.
> >
> > If you CC me I'll happily test it for you.
>
> Here it is.

< snip >

Thanks, that works for me on top of 6.5-rc3. Going to 6.1 is a
non-trivial (for me) back-port - the switch from "ret = 0" in 6.5 to
"ret = 1" in 6.1 is surprising.

Phil

2023-07-24 17:41:18

by Andres Freund

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Hi,

On 2023-07-24 09:16:56 -0700, Andres Freund wrote:
> Building a kernel to test with the patch applied, will reboot into it once the
> call I am on has finished. Unfortunately the performance difference didn't
> reproduce nicely in VM...

Performance is good with the patch applied. Results are slightly better even,
but I think that's likely just noise.

Greetings,

Andres Freund

2023-07-24 18:33:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 11:24?AM, Andres Freund wrote:
> Hi,
>
> On 2023-07-24 09:16:56 -0700, Andres Freund wrote:
>> Building a kernel to test with the patch applied, will reboot into it once the
>> call I am on has finished. Unfortunately the performance difference didn't
>> reproduce nicely in VM...
>
> Performance is good with the patch applied. Results are slightly better even,
> but I think that's likely just noise.

Could be - it's avoiding a few function calls and the flush, but would
probably have to be a pretty targeted test setup to find that for
storage IO. Thanks for testing!

--
Jens Axboe


2023-07-24 18:51:13

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 10:48 AM, Phil Elwell wrote:
> Hi Jens,
>
> On Mon, 24 Jul 2023 at 17:08, Jens Axboe <[email protected]> wrote:
>>
>> On 7/24/23 10:07?AM, Phil Elwell wrote:
>>>> Even though I don't think this is an actual problem, it is a bit
>>>> confusing that you get 100% iowait while waiting without having IO
>>>> pending. So I do think the suggested patch is probably worthwhile
>>>> pursuing. I'll post it and hopefully have Andres test it too, if he's
>>>> available.
>>>
>>> If you CC me I'll happily test it for you.
>>
>> Here it is.
>
> < snip >
>
> Thanks, that works for me on top of 6.5-rc3. Going to 6.1 is a
> non-trivial (for me) back-port - the switch from "ret = 0" in 6.5 to
> "ret = 1" in 6.1 is surprising.

Great, thanks for testing. I'll take care of the stable backports
once the patch lands in upstream -git later this week.

--
Jens Axboe



2023-07-24 19:56:10

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

On 7/24/23 16:58, Jens Axboe wrote:
> On 7/24/23 9:50?AM, Jens Axboe wrote:
>> On 7/24/23 9:48?AM, Greg KH wrote:
>>> On Mon, Jul 24, 2023 at 04:35:43PM +0100, Phil Elwell wrote:
>>>> Hi Andres,
>>>>
>>>> With this commit applied to the 6.1 and later kernels (others not
>>>> tested) the iowait time ("wa" field in top) in an ARM64 build running
>>>> on a 4 core CPU (a Raspberry Pi 4 B) increases to 25%, as if one core
>>>> is permanently blocked on I/O. The change can be observed after
>>>> installing mariadb-server (no configuration or use is required). After
>>>> reverting just this commit, "wa" drops to zero again.
>>>
>>> This has been discussed already:
>>> https://lore.kernel.org/r/[email protected]
>>>
>>> It's not a bug, mariadb does have pending I/O, so the report is correct,
>>> but the CPU isn't blocked at all.
>>
>> Indeed - only thing I can think of is perhaps mariadb is having a
>> separate thread waiting on the ring in perpetuity, regardless of whether
>> or not it currently has IO.
>>
>> But yes, this is very much ado about nothing...
>
> Current -git and having mariadb idle:
>
> Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> Average: all 0.00 0.00 0.04 12.47 0.04 0.00 0.00 0.00 0.00 87.44
> Average: 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 2 0.00 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 99.67
> Average: 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 4 0.00 0.00 0.33 0.00 0.00 0.00 0.00 0.00 0.00 99.67
> Average: 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> Average: 6 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00
> Average: 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
>
> which is showing 100% iowait on one cpu, as mariadb has a thread waiting
> on IO. That is obviously a valid use case, if you split submission and
> completion into separate threads. Then you have the latter just always
> waiting on something to process.
>
> With the suggested patch, we do eliminate that case and the iowait on
> that task is gone. Here's current -git with the patch and mariadb also
> running:
>
> 09:53:49 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
> 09:53:50 AM all 0.00 0.00 0.00 0.00 0.00 0.75 0.00 0.00 0.00 99.25
> 09:53:50 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> 09:53:50 AM 1 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 2 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 3 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 4 0.00 0.00 0.00 0.00 0.00 0.99 0.00 0.00 0.00 99.01
> 09:53:50 AM 5 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
> 09:53:50 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
> 09:53:50 AM 7 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 99.00
>
>
> Even though I don't think this is an actual problem, it is a bit
> confusing that you get 100% iowait while waiting without having IO
> pending. So I do think the suggested patch is probably worthwhile
> pursuing. I'll post it and hopefully have Andres test it too, if he's
> available.

Emmm, what's the definition of the "IO" state? Unless we can say what exactly
it is there will be no end to adjustments, because I can easily argue that
CQ waiting by itself is IO.
Do we consider sleep(N) to be "IO"? I don't think the kernel uses io
schedule around that, and so it'd be different from io_uring waiting for
a timeout request. What about epoll waiting, etc.?

--
Pavel Begunkov

2023-07-24 20:50:46

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH] io_uring: Use io_schedule* in cqring wait

Pavel Begunkov <[email protected]> writes:

> On 7/24/23 16:58, Jens Axboe wrote:
>> Even though I don't think this is an actual problem, it is a bit
>> confusing that you get 100% iowait while waiting without having IO
>> pending. So I do think the suggested patch is probably worthwhile
>> pursuing. I'll post it and hopefully have Andres test it too, if he's
>> available.
>
> Emmm, what's the definition of the "IO" state? Unless we can say what exactly
> it is there will be no end to adjustments, because I can easily argue that
> CQ waiting by itself is IO.
> Do we consider sleep(N) to be "IO"? I don't think the kernel uses io
> schedule around that, and so it'd be different from io_uring waiting for
> a timeout request. What about epoll waiting, etc.?

See Documentation/filesystems/proc.rst (and mainly commit 9c240d757658
("Change the document about iowait")):

- iowait: In a word, iowait stands for waiting for I/O to complete. But there
are several problems:

1. CPU will not wait for I/O to complete, iowait is the time that a task is
waiting for I/O to complete. When CPU goes into idle state for
outstanding task I/O, another task will be scheduled on this CPU.
2. In a multi-core CPU, the task waiting for I/O to complete is not running
on any CPU, so the iowait of each CPU is difficult to calculate.
3. The value of iowait field in /proc/stat will decrease in certain
conditions.

So, the iowait is not reliable by reading from /proc/stat.

Also, vmstat(8):
wa: Time spent waiting for IO. Prior to Linux 2.5.41, included in idle.

iostat/mpstat man pages:
%iowait
Show the percentage of time that the CPU or CPUs were
idle during which the system had an outstanding disk I/O
request.

sar(1):
%iowait
Percentage of time that the CPU or CPUs were idle during
which the system had an outstanding disk I/O request.

iowait was initially introduced in 2002 by Rik van Riel in historical
git commit 7b88e5e0bdf25 ("[PATCH] "io wait" process accounting"). The
changelog from akpm reads:

Patch from Rik adds "I/O wait" statistics to /proc/stat.

This allows us to determine how much system time is being spent
awaiting IO completion. This is an important statistic, as it tends to
directly subtract from job completion time.

procps-2.0.9 is OK with this, but doesn't report it.

I vaguely recall there was confusion from users about why the system was
idle when running database workloads. Maybe Rik can remember more
clearly.

Anyway, as you can see, the definition is murky, at best. I don't think
we should overthink it. I agree with the principle of Jens'
patch--let's just not surprise users with a change in behavior.

Cheers,
Jeff