LinuxLists.cc - [PATCH 0/2] io

2021-08-08 00:15:47

Subject: [PATCH 0/2] io_uring: bug fixes

From: Nadav Amit <[email protected]>

Two small bug fixes. The first fix is for a real issue that encountered
on 5.13, which caused my workload not to work with a submission queue.
Apparently, on 5.14 it is only a performance issue (i.e., not a
functionality issue).

The second fix is for a theoretical issue.

I did not cc stable, as I leave this decision to the maintainer.

Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>

Nadav Amit (2):
io_uring: clear TIF_NOTIFY_SIGNAL when running task work
io_uring: Use WRITE_ONCE() when writing to sq_flags

fs/io_uring.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

--
2.25.1

2021-08-08 00:16:58

by Nadav Amit

[permalink] [raw]

Subject: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

From: Nadav Amit <[email protected]>

When using SQPOLL, the submission queue polling thread calls
task_work_run() to run queued work. However, when work is added with
TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
set afterwards and is never cleared.

Consequently, when the submission queue polling thread checks whether
signal_pending(), it may always find a pending signal, if
task_work_add() was ever called before.

The impact of this bug might be different on different kernel versions.
It appears that on 5.14 it would only cause unnecessary calculation and
prevent the polling thread from sleeping. On 5.13, where the bug was
found, it stops the polling thread from finding newly submitted work.

Instead of task_work_run(), use tracehook_notify_signal() that clears
TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
current->task_works to avoid a race in which task_works is cleared but
the TIF_NOTIFY_SIGNAL is set.

Fixes: 685fe7feedb96 ("io-wq: eliminate the need for a manager thread")
Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
fs/io_uring.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5a0fd6bcd318..f39244d35f90 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -78,6 +78,7 @@
#include <linux/task_work.h>
#include <linux/pagemap.h>
#include <linux/io_uring.h>
+#include <linux/tracehook.h>

#define CREATE_TRACE_POINTS
#include <trace/events/io_uring.h>
@@ -2203,9 +2204,9 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)

static inline bool io_run_task_work(void)
{
- if (current->task_works) {
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
__set_current_state(TASK_RUNNING);
- task_work_run();
+ tracehook_notify_signal();
return true;
}

--
2.25.1

2021-08-08 00:18:02

by Nadav Amit

[permalink] [raw]

Subject: [PATCH 2/2] io_uring: Use WRITE_ONCE() when writing to sq_flags

From: Nadav Amit <[email protected]>

The compiler should be forbidden from any strange optimization for async
writes to user visible data-structures. Without proper protection, the
compiler can cause write-tearing or invent writes that would confuse the
userspace.

However, there are writes to sq_flags which are not protected by
WRITE_ONCE(). Use WRITE_ONCE() for these writes.

This is purely a theoretical issue. Presumably, any compiler is very
unlikely to do such optimizations.

Fixes: 75b28affdd6a ("io_uring: allocate the two rings together")
Cc: Jens Axboe <[email protected]>
Cc: Pavel Begunkov <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
fs/io_uring.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index f39244d35f90..c78b487d5a80 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1491,7 +1491,8 @@ static bool __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force)
all_flushed = list_empty(&ctx->cq_overflow_list);
if (all_flushed) {
clear_bit(0, &ctx->check_cq_overflow);
- ctx->rings->sq_flags &= ~IORING_SQ_CQ_OVERFLOW;
+ WRITE_ONCE(ctx->rings->sq_flags,
+ ctx->rings->sq_flags & ~IORING_SQ_CQ_OVERFLOW);
}

if (posted)
@@ -1570,7 +1571,9 @@ static bool io_cqring_event_overflow(struct io_ring_ctx *ctx, u64 user_data,
}
if (list_empty(&ctx->cq_overflow_list)) {
set_bit(0, &ctx->check_cq_overflow);
- ctx->rings->sq_flags |= IORING_SQ_CQ_OVERFLOW;
+ WRITE_ONCE(ctx->rings->sq_flags,
+ ctx->rings->sq_flags | IORING_SQ_CQ_OVERFLOW);
+
}
ocqe->cqe.user_data = user_data;
ocqe->cqe.res = res;
@@ -6780,14 +6783,16 @@ static inline void io_ring_set_wakeup_flag(struct io_ring_ctx *ctx)
{
/* Tell userspace we may need a wakeup call */
spin_lock_irq(&ctx->completion_lock);
- ctx->rings->sq_flags |= IORING_SQ_NEED_WAKEUP;
+ WRITE_ONCE(ctx->rings->sq_flags,
+ ctx->rings->sq_flags | IORING_SQ_NEED_WAKEUP);
spin_unlock_irq(&ctx->completion_lock);
}

static inline void io_ring_clear_wakeup_flag(struct io_ring_ctx *ctx)
{
spin_lock_irq(&ctx->completion_lock);
- ctx->rings->sq_flags &= ~IORING_SQ_NEED_WAKEUP;
+ WRITE_ONCE(ctx->rings->sq_flags,
+ ctx->rings->sq_flags & ~IORING_SQ_NEED_WAKEUP);
spin_unlock_irq(&ctx->completion_lock);
}

--
2.25.1

2021-08-08 13:00:04

by Pavel Begunkov

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On 8/8/21 1:13 AM, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> When using SQPOLL, the submission queue polling thread calls
> task_work_run() to run queued work. However, when work is added with
> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains

static int io_req_task_work_add(struct io_kiocb *req)
{
...
notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
if (!task_work_add(tsk, &tctx->task_work, notify))
...
}

io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
rather curious who does.

> set afterwards and is never cleared.
>
> Consequently, when the submission queue polling thread checks whether
> signal_pending(), it may always find a pending signal, if
> task_work_add() was ever called before.
>
> The impact of this bug might be different on different kernel versions.
> It appears that on 5.14 it would only cause unnecessary calculation and
> prevent the polling thread from sleeping. On 5.13, where the bug was
> found, it stops the polling thread from finding newly submitted work.
>
> Instead of task_work_run(), use tracehook_notify_signal() that clears
> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
> current->task_works to avoid a race in which task_works is cleared but
> the TIF_NOTIFY_SIGNAL is set.
>
> Fixes: 685fe7feedb96 ("io-wq: eliminate the need for a manager thread")
> Cc: Jens Axboe <[email protected]>
> Cc: Pavel Begunkov <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> fs/io_uring.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5a0fd6bcd318..f39244d35f90 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
> #include <linux/task_work.h>
> #include <linux/pagemap.h>
> #include <linux/io_uring.h>
> +#include <linux/tracehook.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/io_uring.h>
> @@ -2203,9 +2204,9 @@ static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
>
> static inline bool io_run_task_work(void)
> {
> - if (current->task_works) {
> + if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works) {
> __set_current_state(TASK_RUNNING);
> - task_work_run();
> + tracehook_notify_signal();
> return true;
> }
>
>

--
Pavel Begunkov

2021-08-08 17:56:27

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

> On Aug 8, 2021, at 5:55 AM, Pavel Begunkov <[email protected]> wrote:
>
> On 8/8/21 1:13 AM, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> When using SQPOLL, the submission queue polling thread calls
>> task_work_run() to run queued work. However, when work is added with
>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>
> static int io_req_task_work_add(struct io_kiocb *req)
> {
> ...
> notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
> if (!task_work_add(tsk, &tctx->task_work, notify))
> ...
> }
>
> io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
> rather curious who does.

I was saying io-uring, but I meant io-uring in the wider sense:
io_queue_worker_create().

Here is a call trace for when TWA_SIGNAL is used. io_queue_worker_create()
uses TWA_SIGNAL. It is called by io_wqe_dec_running(), and not shown due
to inlining:

[ 70.540761] Call Trace:
[ 70.541352] dump_stack+0x7d/0x9c
[ 70.541930] task_work_add.cold+0x9/0x12
[ 70.542591] io_wqe_dec_running+0xd6/0xf0
[ 70.543259] io_wq_worker_sleeping+0x3d/0x60
[ 70.544106] schedule+0xa0/0xc0
[ 70.544673] userfaultfd_read_iter+0x2c3/0x790
[ 70.545374] ? wake_up_q+0xa0/0xa0
[ 70.545887] io_iter_do_read+0x1e/0x40
[ 70.546531] io_read+0xdc/0x340
[ 70.547148] ? update_curr+0x72/0x1c0
[ 70.547887] ? update_load_avg+0x7c/0x600
[ 70.548538] ? __switch_to_xtra+0x10a/0x500
[ 70.549264] io_issue_sqe+0xd99/0x1840
[ 70.549887] ? lock_timer_base+0x72/0xa0
[ 70.550516] ? try_to_del_timer_sync+0x54/0x80
[ 70.551224] io_wq_submit_work+0x87/0xb0
[ 70.552001] io_worker_handle_work+0x2b5/0x4b0
[ 70.552705] io_wqe_worker+0xd6/0x2f0
[ 70.553364] ? recalc_sigpending+0x1c/0x50
[ 70.554074] ? io_worker_handle_work+0x4b0/0x4b0
[ 70.554813] ret_from_fork+0x22/0x30

Does it answer your question?

2021-08-09 04:09:55

by Hao Xu

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

在 2021/8/9 上午1:31, Nadav Amit 写道:
>
>
>> On Aug 8, 2021, at 5:55 AM, Pavel Begunkov <[email protected]> wrote:
>>
>> On 8/8/21 1:13 AM, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>>
>>> When using SQPOLL, the submission queue polling thread calls
>>> task_work_run() to run queued work. However, when work is added with
>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>
>> static int io_req_task_work_add(struct io_kiocb *req)
>> {
>> ...
>> notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
>> if (!task_work_add(tsk, &tctx->task_work, notify))
>> ...
>> }
>>
>> io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
>> rather curious who does.
>
> I was saying io-uring, but I meant io-uring in the wider sense:
> io_queue_worker_create().
>
> Here is a call trace for when TWA_SIGNAL is used. io_queue_worker_create()
> uses TWA_SIGNAL. It is called by io_wqe_dec_running(), and not shown due
> to inlining:
Hi Nadav, Pavel,
How about trying to make this kind of call to use TWA_NONE for sqthread,
though I know for this case currently there is no info to get to know if
task is sqthread. I think we shouldn't kick sqthread.

regards,
Hao
>
> [ 70.540761] Call Trace:
> [ 70.541352] dump_stack+0x7d/0x9c
> [ 70.541930] task_work_add.cold+0x9/0x12
> [ 70.542591] io_wqe_dec_running+0xd6/0xf0
> [ 70.543259] io_wq_worker_sleeping+0x3d/0x60
> [ 70.544106] schedule+0xa0/0xc0
> [ 70.544673] userfaultfd_read_iter+0x2c3/0x790
> [ 70.545374] ? wake_up_q+0xa0/0xa0
> [ 70.545887] io_iter_do_read+0x1e/0x40
> [ 70.546531] io_read+0xdc/0x340
> [ 70.547148] ? update_curr+0x72/0x1c0
> [ 70.547887] ? update_load_avg+0x7c/0x600
> [ 70.548538] ? __switch_to_xtra+0x10a/0x500
> [ 70.549264] io_issue_sqe+0xd99/0x1840
> [ 70.549887] ? lock_timer_base+0x72/0xa0
> [ 70.550516] ? try_to_del_timer_sync+0x54/0x80
> [ 70.551224] io_wq_submit_work+0x87/0xb0
> [ 70.552001] io_worker_handle_work+0x2b5/0x4b0
> [ 70.552705] io_wqe_worker+0xd6/0x2f0
> [ 70.553364] ? recalc_sigpending+0x1c/0x50
> [ 70.554074] ? io_worker_handle_work+0x4b0/0x4b0
> [ 70.554813] ret_from_fork+0x22/0x30
>
> Does it answer your question?
>

2021-08-09 05:08:11

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

> On Aug 8, 2021, at 9:07 PM, Hao Xu <[email protected]> wrote:
>
> 在 2021/8/9 上午1:31, Nadav Amit 写道:
>>> On Aug 8, 2021, at 5:55 AM, Pavel Begunkov <[email protected]> wrote:
>>>
>>> On 8/8/21 1:13 AM, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>>
>>>> When using SQPOLL, the submission queue polling thread calls
>>>> task_work_run() to run queued work. However, when work is added with
>>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>>
>>> static int io_req_task_work_add(struct io_kiocb *req)
>>> {
>>> ...
>>> notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
>>> if (!task_work_add(tsk, &tctx->task_work, notify))
>>> ...
>>> }
>>>
>>> io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
>>> rather curious who does.
>> I was saying io-uring, but I meant io-uring in the wider sense:
>> io_queue_worker_create().
>> Here is a call trace for when TWA_SIGNAL is used. io_queue_worker_create()
>> uses TWA_SIGNAL. It is called by io_wqe_dec_running(), and not shown due
>> to inlining:
> Hi Nadav, Pavel,
> How about trying to make this kind of call to use TWA_NONE for sqthread,
> though I know for this case currently there is no info to get to know if
> task is sqthread. I think we shouldn't kick sqthread.

It is possible, but it would break the abstractions and propagating
it would be disgusting. Let me give it some thought.

Regardless, I think that this patch should go to 5.14 and stable,
and any solution to avoid kicking the SQ should come on top (to be
safe).

2021-08-09 10:49:11

by Pavel Begunkov

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On 8/9/21 5:50 AM, Nadav Amit wrote:
>> On Aug 8, 2021, at 9:07 PM, Hao Xu <[email protected]> wrote:
>> 在 2021/8/9 上午1:31, Nadav Amit 写道:
>>>> On Aug 8, 2021, at 5:55 AM, Pavel Begunkov <[email protected]> wrote:
>>>> On 8/8/21 1:13 AM, Nadav Amit wrote:
>>>>> From: Nadav Amit <[email protected]>
>>>>>
>>>>> When using SQPOLL, the submission queue polling thread calls
>>>>> task_work_run() to run queued work. However, when work is added with
>>>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>>>
>>>> static int io_req_task_work_add(struct io_kiocb *req)
>>>> {
>>>> ...
>>>> notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
>>>> if (!task_work_add(tsk, &tctx->task_work, notify))
>>>> ...
>>>> }
>>>>
>>>> io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
>>>> rather curious who does.
>>> I was saying io-uring, but I meant io-uring in the wider sense:
>>> io_queue_worker_create().
>>> Here is a call trace for when TWA_SIGNAL is used. io_queue_worker_create()
>>> uses TWA_SIGNAL. It is called by io_wqe_dec_running(), and not shown due
>>> to inlining:
>> Hi Nadav, Pavel,
>> How about trying to make this kind of call to use TWA_NONE for sqthread,
>> though I know for this case currently there is no info to get to know if
>> task is sqthread. I think we shouldn't kick sqthread.
>
> It is possible, but it would break the abstractions and propagating
> it would be disgusting. Let me give it some thought.

We already do io_wq_enqueue() only from the right task context,
so in theory instead of pushing it through tw, we can create
a new thread on the spot. Though, would need to be careful
with locking.

Anyway, agree that it's better to be left for next.

> Regardless, I think that this patch should go to 5.14 and stable,
> and any solution to avoid kicking the SQ should come on top (to be
> safe).

--
Pavel Begunkov

2021-08-09 11:59:13

by Pavel Begunkov

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On 8/8/21 6:31 PM, Nadav Amit wrote:
>> On Aug 8, 2021, at 5:55 AM, Pavel Begunkov <[email protected]> wrote:
>> On 8/8/21 1:13 AM, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>>
>>> When using SQPOLL, the submission queue polling thread calls
>>> task_work_run() to run queued work. However, when work is added with
>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>
>> static int io_req_task_work_add(struct io_kiocb *req)
>> {
>> ...
>> notify = (req->ctx->flags & IORING_SETUP_SQPOLL) ? TWA_NONE : TWA_SIGNAL;
>> if (!task_work_add(tsk, &tctx->task_work, notify))
>> ...
>> }
>>
>> io_uring doesn't set TIF_NOTIFY_SIGNAL for SQPOLL. But if you see it, I'm
>> rather curious who does.
>
> I was saying io-uring, but I meant io-uring in the wider sense:
> io_queue_worker_create().
>
> Here is a call trace for when TWA_SIGNAL is used. io_queue_worker_create()
> uses TWA_SIGNAL. It is called by io_wqe_dec_running(), and not shown due
> to inlining:
>
> [ 70.540761] Call Trace:
> [ 70.541352] dump_stack+0x7d/0x9c
> [ 70.541930] task_work_add.cold+0x9/0x12
> [ 70.542591] io_wqe_dec_running+0xd6/0xf0
> [ 70.543259] io_wq_worker_sleeping+0x3d/0x60
> [ 70.544106] schedule+0xa0/0xc0
> [ 70.544673] userfaultfd_read_iter+0x2c3/0x790
> [ 70.545374] ? wake_up_q+0xa0/0xa0
> [ 70.545887] io_iter_do_read+0x1e/0x40
> [ 70.546531] io_read+0xdc/0x340
> [ 70.547148] ? update_curr+0x72/0x1c0
> [ 70.547887] ? update_load_avg+0x7c/0x600
> [ 70.548538] ? __switch_to_xtra+0x10a/0x500
> [ 70.549264] io_issue_sqe+0xd99/0x1840
> [ 70.549887] ? lock_timer_base+0x72/0xa0
> [ 70.550516] ? try_to_del_timer_sync+0x54/0x80
> [ 70.551224] io_wq_submit_work+0x87/0xb0
> [ 70.552001] io_worker_handle_work+0x2b5/0x4b0
> [ 70.552705] io_wqe_worker+0xd6/0x2f0
> [ 70.553364] ? recalc_sigpending+0x1c/0x50
> [ 70.554074] ? io_worker_handle_work+0x4b0/0x4b0
> [ 70.554813] ret_from_fork+0x22/0x30
>
> Does it answer your question?

Pretty much, thanks

--
Pavel Begunkov

2021-08-09 13:55:53

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 0/2] io_uring: bug fixes

On 8/7/21 6:13 PM, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> Two small bug fixes. The first fix is for a real issue that encountered
> on 5.13, which caused my workload not to work with a submission queue.
> Apparently, on 5.14 it is only a performance issue (i.e., not a
> functionality issue).
>
> The second fix is for a theoretical issue.
>
> I did not cc stable, as I leave this decision to the maintainer.

Applied, thanks!

--
Jens Axboe

2021-08-10 02:09:05

by Olivier Langlois

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On Sat, 2021-08-07 at 17:13 -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> When using SQPOLL, the submission queue polling thread calls
> task_work_run() to run queued work. However, when work is added with
> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
> set afterwards and is never cleared.
>
> Consequently, when the submission queue polling thread checks whether
> signal_pending(), it may always find a pending signal, if
> task_work_add() was ever called before.
>
> The impact of this bug might be different on different kernel versions.
> It appears that on 5.14 it would only cause unnecessary calculation and
> prevent the polling thread from sleeping. On 5.13, where the bug was
> found, it stops the polling thread from finding newly submitted work.
>
> Instead of task_work_run(), use tracehook_notify_signal() that clears
> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
> current->task_works to avoid a race in which task_works is cleared but
> the TIF_NOTIFY_SIGNAL is set.
>
> Fixes: 685fe7feedb96 ("io-wq: eliminate the need for a manager thread")
> Cc: Jens Axboe <[email protected]>
> Cc: Pavel Begunkov <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> ?fs/io_uring.c | 5 +++--
> ?1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 5a0fd6bcd318..f39244d35f90 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -78,6 +78,7 @@
> ?#include <linux/task_work.h>
> ?#include <linux/pagemap.h>
> ?#include <linux/io_uring.h>
> +#include <linux/tracehook.h>
> ?
> ?#define CREATE_TRACE_POINTS
> ?#include <trace/events/io_uring.h>
> @@ -2203,9 +2204,9 @@ static inline unsigned int io_put_rw_kbuf(struct
> io_kiocb *req)
> ?
> ?static inline bool io_run_task_work(void)
> ?{
> -???????if (current->task_works) {
> +???????if (test_thread_flag(TIF_NOTIFY_SIGNAL) || current->task_works)
> {
> ????????????????__set_current_state(TASK_RUNNING);
> -???????????????task_work_run();
> +???????????????tracehook_notify_signal();
> ????????????????return true;
> ????????}
> ?

thx a lot for this patch!

This explains what I am seeing here:
https://lore.kernel.org/io-uring/[email protected]/

I was under the impression that task_work_run() was clearing
TIF_NOTIFY_SIGNAL.

your patch made me realize that it does not...

2021-08-10 11:23:48

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

> On Aug 9, 2021, at 2:48 PM, Olivier Langlois <[email protected]> wrote:
>
> On Sat, 2021-08-07 at 17:13 -0700, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> When using SQPOLL, the submission queue polling thread calls
>> task_work_run() to run queued work. However, when work is added with
>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>> set afterwards and is never cleared.
>>
>> Consequently, when the submission queue polling thread checks whether
>> signal_pending(), it may always find a pending signal, if
>> task_work_add() was ever called before.
>>
>> The impact of this bug might be different on different kernel versions.
>> It appears that on 5.14 it would only cause unnecessary calculation and
>> prevent the polling thread from sleeping. On 5.13, where the bug was
>> found, it stops the polling thread from finding newly submitted work.
>>
>> Instead of task_work_run(), use tracehook_notify_signal() that clears
>> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
>> current->task_works to avoid a race in which task_works is cleared but
>> the TIF_NOTIFY_SIGNAL is set.
>
> thx a lot for this patch!
>
> This explains what I am seeing here:
> https://lore.kernel.org/io-uring/[email protected]/
>
> I was under the impression that task_work_run() was clearing
> TIF_NOTIFY_SIGNAL.
>
> your patch made me realize that it does not…

Happy it could help.

Unfortunately, there seems to be yet another issue (unless my code
somehow caused it). It seems that when SQPOLL is used, there are cases
in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
never goes down to zero.

Debugging... (while also trying to make some progress with my code)

2021-08-10 16:34:36

by Olivier Langlois

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On Tue, 2021-08-10 at 01:28 -0700, Nadav Amit wrote:
>
> Happy it could help.
>
> Unfortunately, there seems to be yet another issue (unless my code
> somehow caused it). It seems that when SQPOLL is used, there are
> cases
> in which we get stuck in io_uring_cancel_sqpoll() when
> tctx_inflight()
> never goes down to zero.
>
> Debugging... (while also trying to make some progress with my code)

You are on something. io_uring starts to be very solid but it isn't
100% flawless yet.

I am a heavy user of SQPOLL which now run flawlessly for me with 5.13.9
(Was running flawlessly since 5.12 minus few patches I did submit
recently) with my simple use-case (my SQPOLL thread isn't spawning any
threads like in your use-case).

The best is yet to come. I'm salivating by seeing all the performance
optimizations that Jens and Pavel are putting in place lately...

2021-08-10 21:34:41

by Pavel Begunkov

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On 8/10/21 9:28 AM, Nadav Amit wrote:
>> On Aug 9, 2021, at 2:48 PM, Olivier Langlois <[email protected]> wrote:
>> On Sat, 2021-08-07 at 17:13 -0700, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>>
>>> When using SQPOLL, the submission queue polling thread calls
>>> task_work_run() to run queued work. However, when work is added with
>>> TWA_SIGNAL - as done by io_uring itself - the TIF_NOTIFY_SIGNAL remains
>>> set afterwards and is never cleared.
>>>
>>> Consequently, when the submission queue polling thread checks whether
>>> signal_pending(), it may always find a pending signal, if
>>> task_work_add() was ever called before.
>>>
>>> The impact of this bug might be different on different kernel versions.
>>> It appears that on 5.14 it would only cause unnecessary calculation and
>>> prevent the polling thread from sleeping. On 5.13, where the bug was
>>> found, it stops the polling thread from finding newly submitted work.
>>>
>>> Instead of task_work_run(), use tracehook_notify_signal() that clears
>>> TIF_NOTIFY_SIGNAL. Test for TIF_NOTIFY_SIGNAL in addition to
>>> current->task_works to avoid a race in which task_works is cleared but
>>> the TIF_NOTIFY_SIGNAL is set.
>>
>> thx a lot for this patch!
>>
>> This explains what I am seeing here:
>> https://lore.kernel.org/io-uring/[email protected]/
>>
>> I was under the impression that task_work_run() was clearing
>> TIF_NOTIFY_SIGNAL.
>>
>> your patch made me realize that it does not…
>
> Happy it could help.
>
> Unfortunately, there seems to be yet another issue (unless my code
> somehow caused it). It seems that when SQPOLL is used, there are cases
> in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
> never goes down to zero.
>
> Debugging... (while also trying to make some progress with my code)

It's most likely because a request has been lost (mis-refcounted).
Let us know if you need any help. Would be great to solve it for 5.14.
quick tips:

1) if not already, try out Jens' 5.14 branch
git://git.kernel.dk/linux-block io_uring-5.14

2) try to characterise the io_uring use pattern. Poll requests?
Read/write requests? Send/recv? Filesystem vs bdev vs sockets?

If easily reproducible, you can match io_alloc_req() with it
getting into io_dismantle_req();

--
Pavel Begunkov

2021-08-11 02:37:58

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

> On Aug 10, 2021, at 2:32 PM, Pavel Begunkov <[email protected]> wrote:
>
> On 8/10/21 9:28 AM, Nadav Amit wrote:
>>
>> Unfortunately, there seems to be yet another issue (unless my code
>> somehow caused it). It seems that when SQPOLL is used, there are cases
>> in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
>> never goes down to zero.
>>
>> Debugging... (while also trying to make some progress with my code)
>
> It's most likely because a request has been lost (mis-refcounted).
> Let us know if you need any help. Would be great to solve it for 5.14.
> quick tips:
>
> 1) if not already, try out Jens' 5.14 branch
> git://git.kernel.dk/linux-block io_uring-5.14
>
> 2) try to characterise the io_uring use pattern. Poll requests?
> Read/write requests? Send/recv? Filesystem vs bdev vs sockets?
>
> If easily reproducible, you can match io_alloc_req() with it
> getting into io_dismantle_req();

So actually the problem is more of a missing IO-uring functionality that I need. When an I/O is queued for async completion (i.e., after returning -EIOCBQUEUED), there should be a way for io-uring to cancel these I/Os if needed. Otherwise they might potentially never complete, as happens in my use-case.

AIO has ki_cancel() for this matter. So I presume the proper solution would be to move ki_cancel() from aio_kiocb to kiocb so it can be used by both io-uring and aio. And then - to use this infrastructure.

But it is messy. There is already a bug in the (few) uses of kiocb_set_cancel_fn() that blindly assume AIO is used and not IO-uring. Then, I am not sure about some things in the AIO code. Oh boy. I’ll work on an RFC.

2021-08-11 02:55:07

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 1/2] io_uring: clear TIF_NOTIFY_SIGNAL when running task work

On 8/10/21 8:33 PM, Nadav Amit wrote:
>
>
>> On Aug 10, 2021, at 2:32 PM, Pavel Begunkov <[email protected]> wrote:
>>
>> On 8/10/21 9:28 AM, Nadav Amit wrote:
>>>
>>> Unfortunately, there seems to be yet another issue (unless my code
>>> somehow caused it). It seems that when SQPOLL is used, there are cases
>>> in which we get stuck in io_uring_cancel_sqpoll() when tctx_inflight()
>>> never goes down to zero.
>>>
>>> Debugging... (while also trying to make some progress with my code)
>>
>> It's most likely because a request has been lost (mis-refcounted).
>> Let us know if you need any help. Would be great to solve it for 5.14.
>> quick tips:
>>
>> 1) if not already, try out Jens' 5.14 branch
>> git://git.kernel.dk/linux-block io_uring-5.14
>>
>> 2) try to characterise the io_uring use pattern. Poll requests?
>> Read/write requests? Send/recv? Filesystem vs bdev vs sockets?
>>
>> If easily reproducible, you can match io_alloc_req() with it
>> getting into io_dismantle_req();
>
> So actually the problem is more of a missing IO-uring functionality
> that I need. When an I/O is queued for async completion (i.e., after
> returning -EIOCBQUEUED), there should be a way for io-uring to cancel
> these I/Os if needed.

There's no way to cancel file/bdev related IO, and there likely never
will be. That's basically the only exception, everything else can get
canceled pretty easily. Many things can be written on why that is the
case, and they have (myself included), but it boils down to proper
hardware support which we'll likely never have as it's not a well tested
path. For other kind of async IO, we're either waiting in poll (which is
trivially cancellable) or in an async thread (which is also easily
cancellable). For DMA+irq driven block storage, we'd need to be able to
reliably cancel on the hardware side, to prevent errant DMA after the
fact.

None of this is really io_uring specific, io_uring just suffers from the
same limitations as others would (or are).

> Otherwise they might potentially never complete, as happens in my
> use-case.

If you see that, that is most certainly a bug. While bdev/reg file IO
can't really be canceled, they all have the property that they complete
in finite time. Either the IO completes normally in a "short" amount of
time, or a timeout will cancel it and complete it in error. There are no
unbounded execution times for uncancellable IO.

> AIO has ki_cancel() for this matter. So I presume the proper solution
> would be to move ki_cancel() from aio_kiocb to kiocb so it can be used
> by both io-uring and aio. And then - to use this infrastructure.

There is no infrastructure, I'm fraid. ki_cancel() is just a random hook
that nobody (outside of USB gadget??) ever implemented or used.

> But it is messy. There is already a bug in the (few) uses of
> kiocb_set_cancel_fn() that blindly assume AIO is used and not
> IO-uring. Then, I am not sure about some things in the AIO code. Oh
> boy. I’ll work on an RFC.

ki_cancel is a non-starter, it doesn't even work for the single case
that it's intended for, and I'm actually surprised it hasn't been
removed yet. It's one of those things that someone added a hook for, but
never really grew into something that is useful.

--
Jens Axboe

2021-08-11 05:44:07

by Nadav Amit

[permalink] [raw]

Subject: I/O cancellation in io-uring (was: io_uring: clear TIF_NOTIFY_SIGNAL ...)

> On Aug 10, 2021, at 7:51 PM, Jens Axboe <[email protected]> wrote:
>
> There's no way to cancel file/bdev related IO, and there likely never
> will be. That's basically the only exception, everything else can get
> canceled pretty easily. Many things can be written on why that is the
> case, and they have (myself included), but it boils down to proper
> hardware support which we'll likely never have as it's not a well tested
> path. For other kind of async IO, we're either waiting in poll (which is
> trivially cancellable) or in an async thread (which is also easily
> cancellable). For DMA+irq driven block storage, we'd need to be able to
> reliably cancel on the hardware side, to prevent errant DMA after the
> fact.
>
> None of this is really io_uring specific, io_uring just suffers from the
> same limitations as others would (or are).
>
>> Otherwise they might potentially never complete, as happens in my
>> use-case.
>
> If you see that, that is most certainly a bug. While bdev/reg file IO
> can't really be canceled, they all have the property that they complete
> in finite time. Either the IO completes normally in a "short" amount of
> time, or a timeout will cancel it and complete it in error. There are no
> unbounded execution times for uncancellable IO.

I understand your point that hardware reads/writes cannot easily be cancelled.
(well, the buffers can be unmapped from the IOMMU tables, but let's put this
discussion aside).

Yet the question is whether reads/writes from special files such as pipes,
eventfd, signalfd, fuse should be cancellable. Obviously, it is always
possible to issue a blocking read/write from a worker thread. Yet, there are
inherent overheads that are associated with this design, specifically
context-switches. While the overhead of a context-switch is not as high as
portrayed by some, it is still high for low latency use-cases.

There is a potential alternative, however. When a write to a pipe is
performed, or when an event takes place or signal sent, queued io-uring reads
can be fulfilled immediately, without a context-switch to a worker. I
specifically want to fulfill userfaultfd reads and notify userspace on
page-faults in such manner. I do not have the numbers in front of me, but
doing so shows considerable performance improvement.

To allow such use-cases, cancellation of the read/write is needed. A read from
a pipe might never complete if there is no further writes to the pipe.
Cancellation is not hard to implement for such cases (it's only the mess with
the existing AIO's ki_cancel() that bothers me, but as you said - it is a
single use-case).

Admittedly, there are no such use-cases in the kernel today, but arguably,
this is due to the lack of infrastructure. I see no alternative which is as
performant as the one I propose here. Using poll+read or any other option
will have unwarranted overhead.

If an RFC might convince you, or some more mainstream use-case such as queued
pipe reads would convince you, I can work on such in order to try to get
something like that.