2019-10-25 19:36:49

by Pavel Begunkov

[permalink] [raw]
Subject: [BUG][RFC] Miscalculated inflight counter in io_uring

In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:

@inflight count returned by io_submit_sqes() is the number of entries
picked up from a sqring including already completed/failed. And
completed but not failed requests will be placed into @poll_list.

Then io_sq_thread() tries to poll @inflight events, even though failed
won't appear in @poll_list. Thus, it will think that there are always
something to poll (i.e. @inflight > 0)

There are several issues with this:
1. io_sq_thread won't ever sleep
2. io_sq_thread() may be left running and actively polling even after
user process is destroyed
3. the same goes for mm_struct with all vmas of the user process
TL;DR;
awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
recycling of vmas used for rings' mapping, which hold refcount of
io_uring's struct file. Thus, io_uring_release() won't be called, as
well as kthread_{park,stop}(). That's all in case when the user process
haven't unmapped rings.


I'm not sure how to fix it better:
1. try to put failed into poll_list (grabbing mutex).

2. test for zero-inflight case with comparing sq and cq. something like
```
if (nr_polled == 0) {
lock(comp_lock);
if (cached_cq_head == cached_sq_tail)
inflight = 0;
unlock(comp_lock);
}
```
But that's adds extra spinlock locking in fast-path. And that's unsafe
to use non-cached heads/tails, as it could be maliciously changed by
userspace.

3. Do some counting of failed (probably needs atomic or synchronisation)

4. something else?


--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-10-25 20:41:17

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG][RFC] Miscalculated inflight counter in io_uring

On 10/25/19 3:48 AM, Pavel Begunkov wrote:
> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
>
> @inflight count returned by io_submit_sqes() is the number of entries
> picked up from a sqring including already completed/failed. And
> completed but not failed requests will be placed into @poll_list.
>
> Then io_sq_thread() tries to poll @inflight events, even though failed
> won't appear in @poll_list. Thus, it will think that there are always
> something to poll (i.e. @inflight > 0)
>
> There are several issues with this:
> 1. io_sq_thread won't ever sleep
> 2. io_sq_thread() may be left running and actively polling even after
> user process is destroyed
> 3. the same goes for mm_struct with all vmas of the user process
> TL;DR;
> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
> recycling of vmas used for rings' mapping, which hold refcount of
> io_uring's struct file. Thus, io_uring_release() won't be called, as
> well as kthread_{park,stop}(). That's all in case when the user process
> haven't unmapped rings.
>
>
> I'm not sure how to fix it better:
> 1. try to put failed into poll_list (grabbing mutex).
>
> 2. test for zero-inflight case with comparing sq and cq. something like
> ```
> if (nr_polled == 0) {
> lock(comp_lock);
> if (cached_cq_head == cached_sq_tail)
> inflight = 0;
> unlock(comp_lock);
> }
> ```
> But that's adds extra spinlock locking in fast-path. And that's unsafe
> to use non-cached heads/tails, as it could be maliciously changed by
> userspace.
>
> 3. Do some counting of failed (probably needs atomic or synchronisation)
>
> 4. something else?

Can we just look at the completion count? Ala:

prev_tail = ctx->cached_cq_tail;
inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,
mm_fault);
if (prev_tail != ctx->cached_cq_tail)
inflight -= (ctx->cached_cq_tail - prev_tail);

or something like that.

--
Jens Axboe

2019-10-25 20:43:42

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [BUG][RFC] Miscalculated inflight counter in io_uring

On 25/10/2019 18:32, Jens Axboe wrote:
> On 10/25/19 3:48 AM, Pavel Begunkov wrote:
>> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
>>
>> @inflight count returned by io_submit_sqes() is the number of entries
>> picked up from a sqring including already completed/failed. And
>> completed but not failed requests will be placed into @poll_list.
>>
>> Then io_sq_thread() tries to poll @inflight events, even though failed
>> won't appear in @poll_list. Thus, it will think that there are always
>> something to poll (i.e. @inflight > 0)
>>
>> There are several issues with this:
>> 1. io_sq_thread won't ever sleep
>> 2. io_sq_thread() may be left running and actively polling even after
>> user process is destroyed
>> 3. the same goes for mm_struct with all vmas of the user process
>> TL;DR;
>> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
>> recycling of vmas used for rings' mapping, which hold refcount of
>> io_uring's struct file. Thus, io_uring_release() won't be called, as
>> well as kthread_{park,stop}(). That's all in case when the user process
>> haven't unmapped rings.
>>
>>
>> I'm not sure how to fix it better:
>> 1. try to put failed into poll_list (grabbing mutex).
>>
>> 2. test for zero-inflight case with comparing sq and cq. something like
>> ```
>> if (nr_polled == 0) {
>> lock(comp_lock);
>> if (cached_cq_head == cached_sq_tail)
>> inflight = 0;
>> unlock(comp_lock);
>> }
>> ```
>> But that's adds extra spinlock locking in fast-path. And that's unsafe
>> to use non-cached heads/tails, as it could be maliciously changed by
>> userspace.
>>
>> 3. Do some counting of failed (probably needs atomic or synchronisation)
>>
>> 4. something else?
>
> Can we just look at the completion count? Ala:
>
> prev_tail = ctx->cached_cq_tail;
> inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,
> mm_fault);
> if (prev_tail != ctx->cached_cq_tail)
> inflight -= (ctx->cached_cq_tail - prev_tail);
>
> or something like that.
>

Don't think so:
1. @cached_cq_tail is protected be @completion_lock. (right?)
Never know what happens, when you violate a memory model.
2. if something is successfully completed by that time, we again get
the wrong number.

Basically, it's
inflight = (cached_sq_head - cached_cq_tail) + len(poll_list)
maybe you can figure out something from this.

idea 1:
How about to count failed events and subtract it?
But as they may fail asynchronously need synchronisation
e.g. atomic_add() for fails (fail, slow-path)
and atomic_load() in kthread (fast-path)


BTW, tested the patch below before, it fixes the issue, but is racy
for the same reason 1.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 32f6598ecae9..0353d374a0d5 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2650,6 +2650,10 @@ static int io_sq_thread(void *data)
bool mm_fault = false;
unsigned int to_submit;

+ if (ctx->cached_sq_head == ctx->cached_cq_tail +
+ ctx->rings->sq_dropped)
+ inflight = 0;
+
if (inflight) {
unsigned nr_events = 0;



--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-10-25 20:43:57

by Jens Axboe

[permalink] [raw]
Subject: Re: [BUG][RFC] Miscalculated inflight counter in io_uring

On 10/25/19 10:08 AM, Pavel Begunkov wrote:
> On 25/10/2019 18:32, Jens Axboe wrote:
>> On 10/25/19 3:48 AM, Pavel Begunkov wrote:
>>> In case of IORING_SETUP_SQPOLL | IORING_SETUP_IOPOLL:
>>>
>>> @inflight count returned by io_submit_sqes() is the number of entries
>>> picked up from a sqring including already completed/failed. And
>>> completed but not failed requests will be placed into @poll_list.
>>>
>>> Then io_sq_thread() tries to poll @inflight events, even though failed
>>> won't appear in @poll_list. Thus, it will think that there are always
>>> something to poll (i.e. @inflight > 0)
>>>
>>> There are several issues with this:
>>> 1. io_sq_thread won't ever sleep
>>> 2. io_sq_thread() may be left running and actively polling even after
>>> user process is destroyed
>>> 3. the same goes for mm_struct with all vmas of the user process
>>> TL;DR;
>>> awhile @inflight > 0, io_sq_thread won't put @cur_mm, so locking
>>> recycling of vmas used for rings' mapping, which hold refcount of
>>> io_uring's struct file. Thus, io_uring_release() won't be called, as
>>> well as kthread_{park,stop}(). That's all in case when the user process
>>> haven't unmapped rings.
>>>
>>>
>>> I'm not sure how to fix it better:
>>> 1. try to put failed into poll_list (grabbing mutex).
>>>
>>> 2. test for zero-inflight case with comparing sq and cq. something like
>>> ```
>>> if (nr_polled == 0) {
>>> lock(comp_lock);
>>> if (cached_cq_head == cached_sq_tail)
>>> inflight = 0;
>>> unlock(comp_lock);
>>> }
>>> ```
>>> But that's adds extra spinlock locking in fast-path. And that's unsafe
>>> to use non-cached heads/tails, as it could be maliciously changed by
>>> userspace.
>>>
>>> 3. Do some counting of failed (probably needs atomic or synchronisation)
>>>
>>> 4. something else?
>>
>> Can we just look at the completion count? Ala:
>>
>> prev_tail = ctx->cached_cq_tail;
>> inflight += io_submit_sqes(ctx, to_submit, cur_mm != NULL,
>> mm_fault);
>> if (prev_tail != ctx->cached_cq_tail)
>> inflight -= (ctx->cached_cq_tail - prev_tail);
>>
>> or something like that.
>>
>
> Don't think so:
> 1. @cached_cq_tail is protected be @completion_lock. (right?)
> Never know what happens, when you violate a memory model.
> 2. if something is successfully completed by that time, we again get
> the wrong number.
>
> Basically, it's
> inflight = (cached_sq_head - cached_cq_tail) + len(poll_list)
> maybe you can figure out something from this.
>
> idea 1:
> How about to count failed events and subtract it?
> But as they may fail asynchronously need synchronisation
> e.g. atomic_add() for fails (fail, slow-path)
> and atomic_load() in kthread (fast-path)

How about we just go way simpler? We only really care about if we have
any inflight requests for polling, or none at all. We don't care about
the count of them. So if we check the poll_list, and if it's empty, then
we just reset inflight. That should handle it without any extra weird
accounting, or racy logic.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9c128df3d675..afc463a06fdc 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -874,19 +874,12 @@ static void io_iopoll_reap_events(struct io_ring_ctx *ctx)
mutex_unlock(&ctx->uring_lock);
}

-static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
- long min)
+static int __io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+ long min)
{
- int iters, ret = 0;
-
- /*
- * We disallow the app entering submit/complete with polling, but we
- * still need to lock the ring to prevent racing with polled issue
- * that got punted to a workqueue.
- */
- mutex_lock(&ctx->uring_lock);
+ int iters, ret;

- iters = 0;
+ ret = iters = 0;
do {
int tmin = 0;

@@ -922,6 +915,21 @@ static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
ret = 0;
} while (min && !*nr_events && !need_resched());

+ return ret;
+}
+
+static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned *nr_events,
+ long min)
+{
+ int ret;
+
+ /*
+ * We disallow the app entering submit/complete with polling, but we
+ * still need to lock the ring to prevent racing with polled issue
+ * that got punted to a workqueue.
+ */
+ mutex_lock(&ctx->uring_lock);
+ ret = __io_iopoll_check(ctx, nr_events, min);
mutex_unlock(&ctx->uring_lock);
return ret;
}
@@ -2658,7 +2666,12 @@ static int io_sq_thread(void *data)
unsigned nr_events = 0;

if (ctx->flags & IORING_SETUP_IOPOLL) {
- io_iopoll_check(ctx, &nr_events, 0);
+ mutex_lock(&ctx->uring_lock);
+ if (!list_empty(&ctx->poll_list))
+ __io_iopoll_check(ctx, &nr_events, 0);
+ else
+ inflight = 0;
+ mutex_unlock(&ctx->uring_lock);
} else {
/*
* Normal IO, just pretend everything completed.

--
Jens Axboe