2023-11-27 21:54:40

by Jann Horn

[permalink] [raw]
Subject: io_uring: risky use of task work, especially wrt fdget()

Hi!

I noticed something that I think does not currently cause any
significant security issues, but could be problematic in the future:

io_uring sometimes processes task work in the middle of syscalls,
including between fdget() and fdput(). My understanding of task work
is that it is expected to run in a context similar to directly at
syscall entry/exit: task context, no locks held, sleeping is okay, and
it doesn't execute in the middle of some syscall that expects private
state of the task_struct to stay the same.

An example of another user of task work is the keyring subsystem,
which does task_work_add() in keyctl_session_to_parent() to change the
cred pointers of another task.

Several places in io_uring process task work while holding an fdget()
reference to some file descriptor. For example, the io_uring_enter
syscall handler calls io_iopoll_check() while the io_ring_ctx is only
referenced via fdget(). This means that if there were another kernel
subsystem that uses task work to close file descriptors, io_uring
would become unsafe. And io_uring does _almost_ that itself, I think:
io_queue_worker_create() can be run on a workqueue, and uses task work
to launch a worker thread from the context of a userspace thread; and
this worker thread can then accept commands to close file descriptors.
Except it doesn't accept commands to close io_uring file descriptors.

A closer miss might be io_sync_cancel(), which holds a reference to
some normal file with fdget()/fdput() while calling into
io_run_task_work_sig(). However, from what I can tell, the only things
that are actually done with this file pointer are pointer comparisons,
so this also shouldn't have significant security impact.

Would it make sense to use fget()/fput() instead of fdget()/fdput() in
io_sync_cancel(), io_uring_enter and io_uring_register? These
functions probably usually run in multithreaded environments anyway
(thanks to the io_uring worker threads), so I would think fdget()
shouldn't bring significant performance savings here?


2023-11-28 15:58:17

by Jens Axboe

[permalink] [raw]
Subject: Re: io_uring: risky use of task work, especially wrt fdget()

On 11/27/23 2:53 PM, Jann Horn wrote:
> Hi!
>
> I noticed something that I think does not currently cause any
> significant security issues, but could be problematic in the future:
>
> io_uring sometimes processes task work in the middle of syscalls,
> including between fdget() and fdput(). My understanding of task work
> is that it is expected to run in a context similar to directly at
> syscall entry/exit: task context, no locks held, sleeping is okay, and
> it doesn't execute in the middle of some syscall that expects private
> state of the task_struct to stay the same.
>
> An example of another user of task work is the keyring subsystem,
> which does task_work_add() in keyctl_session_to_parent() to change the
> cred pointers of another task.
>
> Several places in io_uring process task work while holding an fdget()
> reference to some file descriptor. For example, the io_uring_enter
> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
> referenced via fdget(). This means that if there were another kernel
> subsystem that uses task work to close file descriptors, io_uring
> would become unsafe. And io_uring does _almost_ that itself, I think:
> io_queue_worker_create() can be run on a workqueue, and uses task work
> to launch a worker thread from the context of a userspace thread; and
> this worker thread can then accept commands to close file descriptors.
> Except it doesn't accept commands to close io_uring file descriptors.
>
> A closer miss might be io_sync_cancel(), which holds a reference to
> some normal file with fdget()/fdput() while calling into
> io_run_task_work_sig(). However, from what I can tell, the only things
> that are actually done with this file pointer are pointer comparisons,
> so this also shouldn't have significant security impact.
>
> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
> io_sync_cancel(), io_uring_enter and io_uring_register? These
> functions probably usually run in multithreaded environments anyway
> (thanks to the io_uring worker threads), so I would think fdget()
> shouldn't bring significant performance savings here?

Let me run some testing on that. It's a mistake to think that it's
usually multithreaded, generally if you end up using io-wq then it's not
a fast path. A fast networked setup, for example, would never touch the
threads and hence no threading would be implied by using io_uring. Ditto
on the storage front, if you're just reading/writing or eg doing polled
IO. That said, those workloads are generally threaded _anyway_ - not
because of io_uring, but because that's how these kinds of workloads are
written to begin with.

So probably won't be much of a concern to do the swap. The only
"interesting" part of the above mix of cancel/register/enter is
obviously the enter part. The rest are not really fast path.

--
Jens Axboe

2023-11-28 16:19:19

by Jens Axboe

[permalink] [raw]
Subject: Re: io_uring: risky use of task work, especially wrt fdget()

On 11/28/23 8:58 AM, Jens Axboe wrote:
> On 11/27/23 2:53 PM, Jann Horn wrote:
>> Hi!
>>
>> I noticed something that I think does not currently cause any
>> significant security issues, but could be problematic in the future:
>>
>> io_uring sometimes processes task work in the middle of syscalls,
>> including between fdget() and fdput(). My understanding of task work
>> is that it is expected to run in a context similar to directly at
>> syscall entry/exit: task context, no locks held, sleeping is okay, and
>> it doesn't execute in the middle of some syscall that expects private
>> state of the task_struct to stay the same.
>>
>> An example of another user of task work is the keyring subsystem,
>> which does task_work_add() in keyctl_session_to_parent() to change the
>> cred pointers of another task.
>>
>> Several places in io_uring process task work while holding an fdget()
>> reference to some file descriptor. For example, the io_uring_enter
>> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
>> referenced via fdget(). This means that if there were another kernel
>> subsystem that uses task work to close file descriptors, io_uring
>> would become unsafe. And io_uring does _almost_ that itself, I think:
>> io_queue_worker_create() can be run on a workqueue, and uses task work
>> to launch a worker thread from the context of a userspace thread; and
>> this worker thread can then accept commands to close file descriptors.
>> Except it doesn't accept commands to close io_uring file descriptors.
>>
>> A closer miss might be io_sync_cancel(), which holds a reference to
>> some normal file with fdget()/fdput() while calling into
>> io_run_task_work_sig(). However, from what I can tell, the only things
>> that are actually done with this file pointer are pointer comparisons,
>> so this also shouldn't have significant security impact.
>>
>> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
>> io_sync_cancel(), io_uring_enter and io_uring_register? These
>> functions probably usually run in multithreaded environments anyway
>> (thanks to the io_uring worker threads), so I would think fdget()
>> shouldn't bring significant performance savings here?
>
> Let me run some testing on that. It's a mistake to think that it's
> usually multithreaded, generally if you end up using io-wq then it's not
> a fast path. A fast networked setup, for example, would never touch the
> threads and hence no threading would be implied by using io_uring. Ditto
> on the storage front, if you're just reading/writing or eg doing polled
> IO. That said, those workloads are generally threaded _anyway_ - not
> because of io_uring, but because that's how these kinds of workloads are
> written to begin with.
>
> So probably won't be much of a concern to do the swap. The only
> "interesting" part of the above mix of cancel/register/enter is
> obviously the enter part. The rest are not really fast path.

Did all three and ran the usual testing, which just so happens to be
multithreaded to begin with anyway. No discernable change from using
fget/fput over fdget/fdput.

IOW, we may as well do this. Do you want to send a patch? Or I can send
out mine, up to you.

--
Jens Axboe

2023-11-28 16:28:46

by Jens Axboe

[permalink] [raw]
Subject: Re: io_uring: risky use of task work, especially wrt fdget()

On 11/28/23 9:19 AM, Jens Axboe wrote:
> On 11/28/23 8:58 AM, Jens Axboe wrote:
>> On 11/27/23 2:53 PM, Jann Horn wrote:
>>> Hi!
>>>
>>> I noticed something that I think does not currently cause any
>>> significant security issues, but could be problematic in the future:
>>>
>>> io_uring sometimes processes task work in the middle of syscalls,
>>> including between fdget() and fdput(). My understanding of task work
>>> is that it is expected to run in a context similar to directly at
>>> syscall entry/exit: task context, no locks held, sleeping is okay, and
>>> it doesn't execute in the middle of some syscall that expects private
>>> state of the task_struct to stay the same.
>>>
>>> An example of another user of task work is the keyring subsystem,
>>> which does task_work_add() in keyctl_session_to_parent() to change the
>>> cred pointers of another task.
>>>
>>> Several places in io_uring process task work while holding an fdget()
>>> reference to some file descriptor. For example, the io_uring_enter
>>> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
>>> referenced via fdget(). This means that if there were another kernel
>>> subsystem that uses task work to close file descriptors, io_uring
>>> would become unsafe. And io_uring does _almost_ that itself, I think:
>>> io_queue_worker_create() can be run on a workqueue, and uses task work
>>> to launch a worker thread from the context of a userspace thread; and
>>> this worker thread can then accept commands to close file descriptors.
>>> Except it doesn't accept commands to close io_uring file descriptors.
>>>
>>> A closer miss might be io_sync_cancel(), which holds a reference to
>>> some normal file with fdget()/fdput() while calling into
>>> io_run_task_work_sig(). However, from what I can tell, the only things
>>> that are actually done with this file pointer are pointer comparisons,
>>> so this also shouldn't have significant security impact.
>>>
>>> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
>>> io_sync_cancel(), io_uring_enter and io_uring_register? These
>>> functions probably usually run in multithreaded environments anyway
>>> (thanks to the io_uring worker threads), so I would think fdget()
>>> shouldn't bring significant performance savings here?
>>
>> Let me run some testing on that. It's a mistake to think that it's
>> usually multithreaded, generally if you end up using io-wq then it's not
>> a fast path. A fast networked setup, for example, would never touch the
>> threads and hence no threading would be implied by using io_uring. Ditto
>> on the storage front, if you're just reading/writing or eg doing polled
>> IO. That said, those workloads are generally threaded _anyway_ - not
>> because of io_uring, but because that's how these kinds of workloads are
>> written to begin with.
>>
>> So probably won't be much of a concern to do the swap. The only
>> "interesting" part of the above mix of cancel/register/enter is
>> obviously the enter part. The rest are not really fast path.
>
> Did all three and ran the usual testing, which just so happens to be
> multithreaded to begin with anyway. No discernable change from using
> fget/fput over fdget/fdput.
>
> IOW, we may as well do this. Do you want to send a patch? Or I can send
> out mine, up to you.

For reference, this is what I ran my testing with:

diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 3c19cccb1aec..8a8b07dfc444 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -273,7 +273,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
};
ktime_t timeout = KTIME_MAX;
struct io_uring_sync_cancel_reg sc;
- struct fd f = { };
+ struct file *file = NULL;
DEFINE_WAIT(wait);
int ret, i;

@@ -295,10 +295,10 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
/* we can grab a normal file descriptor upfront */
if ((cd.flags & IORING_ASYNC_CANCEL_FD) &&
!(cd.flags & IORING_ASYNC_CANCEL_FD_FIXED)) {
- f = fdget(sc.fd);
- if (!f.file)
+ file = fget(sc.fd);
+ if (!file)
return -EBADF;
- cd.file = f.file;
+ cd.file = file;
}

ret = __io_sync_cancel(current->io_uring, &cd, sc.fd);
@@ -348,6 +348,7 @@ int io_sync_cancel(struct io_ring_ctx *ctx, void __user *arg)
if (ret == -ENOENT || ret > 0)
ret = 0;
out:
- fdput(f);
+ if (file)
+ fput(file);
return ret;
}
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 05f933dddfde..aba5657d287e 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3652,7 +3652,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
size_t, argsz)
{
struct io_ring_ctx *ctx;
- struct fd f;
+ struct file *file;
long ret;

if (unlikely(flags & ~(IORING_ENTER_GETEVENTS | IORING_ENTER_SQ_WAKEUP |
@@ -3670,20 +3670,19 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
if (unlikely(!tctx || fd >= IO_RINGFD_REG_MAX))
return -EINVAL;
fd = array_index_nospec(fd, IO_RINGFD_REG_MAX);
- f.file = tctx->registered_rings[fd];
- f.flags = 0;
- if (unlikely(!f.file))
+ file = tctx->registered_rings[fd];
+ if (unlikely(!file))
return -EBADF;
} else {
- f = fdget(fd);
- if (unlikely(!f.file))
+ file = fget(fd);
+ if (unlikely(!file))
return -EBADF;
ret = -EOPNOTSUPP;
- if (unlikely(!io_is_uring_fops(f.file)))
+ if (unlikely(!io_is_uring_fops(file)))
goto out;
}

- ctx = f.file->private_data;
+ ctx = file->private_data;
ret = -EBADFD;
if (unlikely(ctx->flags & IORING_SETUP_R_DISABLED))
goto out;
@@ -3777,7 +3776,8 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
}
}
out:
- fdput(f);
+ if (!(flags & IORING_ENTER_REGISTERED_RING))
+ fput(file);
return ret;
}

@@ -4618,7 +4618,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
{
struct io_ring_ctx *ctx;
long ret = -EBADF;
- struct fd f;
+ struct file *file;
bool use_registered_ring;

use_registered_ring = !!(opcode & IORING_REGISTER_USE_REGISTERED_RING);
@@ -4637,27 +4637,27 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
if (unlikely(!tctx || fd >= IO_RINGFD_REG_MAX))
return -EINVAL;
fd = array_index_nospec(fd, IO_RINGFD_REG_MAX);
- f.file = tctx->registered_rings[fd];
- f.flags = 0;
- if (unlikely(!f.file))
+ file = tctx->registered_rings[fd];
+ if (unlikely(!file))
return -EBADF;
} else {
- f = fdget(fd);
- if (unlikely(!f.file))
+ file = fget(fd);
+ if (unlikely(!file))
return -EBADF;
ret = -EOPNOTSUPP;
- if (!io_is_uring_fops(f.file))
+ if (!io_is_uring_fops(file))
goto out_fput;
}

- ctx = f.file->private_data;
+ ctx = file->private_data;

mutex_lock(&ctx->uring_lock);
ret = __io_uring_register(ctx, opcode, arg, nr_args);
mutex_unlock(&ctx->uring_lock);
trace_io_uring_register(ctx, opcode, ctx->nr_user_files, ctx->nr_user_bufs, ret);
out_fput:
- fdput(f);
+ if (!use_registered_ring)
+ fput(file);
return ret;
}


--
Jens Axboe

2023-11-28 16:37:28

by Jann Horn

[permalink] [raw]
Subject: Re: io_uring: risky use of task work, especially wrt fdget()

On Tue, Nov 28, 2023 at 5:19 PM Jens Axboe <[email protected]> wrote:
> On 11/28/23 8:58 AM, Jens Axboe wrote:
> > On 11/27/23 2:53 PM, Jann Horn wrote:
> >> Hi!
> >>
> >> I noticed something that I think does not currently cause any
> >> significant security issues, but could be problematic in the future:
> >>
> >> io_uring sometimes processes task work in the middle of syscalls,
> >> including between fdget() and fdput(). My understanding of task work
> >> is that it is expected to run in a context similar to directly at
> >> syscall entry/exit: task context, no locks held, sleeping is okay, and
> >> it doesn't execute in the middle of some syscall that expects private
> >> state of the task_struct to stay the same.
> >>
> >> An example of another user of task work is the keyring subsystem,
> >> which does task_work_add() in keyctl_session_to_parent() to change the
> >> cred pointers of another task.
> >>
> >> Several places in io_uring process task work while holding an fdget()
> >> reference to some file descriptor. For example, the io_uring_enter
> >> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
> >> referenced via fdget(). This means that if there were another kernel
> >> subsystem that uses task work to close file descriptors, io_uring
> >> would become unsafe. And io_uring does _almost_ that itself, I think:
> >> io_queue_worker_create() can be run on a workqueue, and uses task work
> >> to launch a worker thread from the context of a userspace thread; and
> >> this worker thread can then accept commands to close file descriptors.
> >> Except it doesn't accept commands to close io_uring file descriptors.
> >>
> >> A closer miss might be io_sync_cancel(), which holds a reference to
> >> some normal file with fdget()/fdput() while calling into
> >> io_run_task_work_sig(). However, from what I can tell, the only things
> >> that are actually done with this file pointer are pointer comparisons,
> >> so this also shouldn't have significant security impact.
> >>
> >> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
> >> io_sync_cancel(), io_uring_enter and io_uring_register? These
> >> functions probably usually run in multithreaded environments anyway
> >> (thanks to the io_uring worker threads), so I would think fdget()
> >> shouldn't bring significant performance savings here?
> >
> > Let me run some testing on that. It's a mistake to think that it's
> > usually multithreaded, generally if you end up using io-wq then it's not
> > a fast path. A fast networked setup, for example, would never touch the
> > threads and hence no threading would be implied by using io_uring. Ditto
> > on the storage front, if you're just reading/writing or eg doing polled
> > IO. That said, those workloads are generally threaded _anyway_ - not
> > because of io_uring, but because that's how these kinds of workloads are
> > written to begin with.

Aah, because with polled I/O, when the fd is signalled as ready, the
actual execution of work is done via task_work? Thanks for the
explanation, I missed that.

> > So probably won't be much of a concern to do the swap. The only
> > "interesting" part of the above mix of cancel/register/enter is
> > obviously the enter part. The rest are not really fast path.
>
> Did all three and ran the usual testing, which just so happens to be
> multithreaded to begin with anyway. No discernable change from using
> fget/fput over fdget/fdput.

Oh, nice!

> IOW, we may as well do this. Do you want to send a patch? Or I can send
> out mine, up to you.

Ah, if you already cooked up a patch to do the testing, I guess it's
easier if you commit that?

2023-11-28 16:46:53

by Jens Axboe

[permalink] [raw]
Subject: Re: io_uring: risky use of task work, especially wrt fdget()

On 11/28/23 9:36 AM, Jann Horn wrote:
> On Tue, Nov 28, 2023 at 5:19?PM Jens Axboe <[email protected]> wrote:
>> On 11/28/23 8:58 AM, Jens Axboe wrote:
>>> On 11/27/23 2:53 PM, Jann Horn wrote:
>>>> Hi!
>>>>
>>>> I noticed something that I think does not currently cause any
>>>> significant security issues, but could be problematic in the future:
>>>>
>>>> io_uring sometimes processes task work in the middle of syscalls,
>>>> including between fdget() and fdput(). My understanding of task work
>>>> is that it is expected to run in a context similar to directly at
>>>> syscall entry/exit: task context, no locks held, sleeping is okay, and
>>>> it doesn't execute in the middle of some syscall that expects private
>>>> state of the task_struct to stay the same.
>>>>
>>>> An example of another user of task work is the keyring subsystem,
>>>> which does task_work_add() in keyctl_session_to_parent() to change the
>>>> cred pointers of another task.
>>>>
>>>> Several places in io_uring process task work while holding an fdget()
>>>> reference to some file descriptor. For example, the io_uring_enter
>>>> syscall handler calls io_iopoll_check() while the io_ring_ctx is only
>>>> referenced via fdget(). This means that if there were another kernel
>>>> subsystem that uses task work to close file descriptors, io_uring
>>>> would become unsafe. And io_uring does _almost_ that itself, I think:
>>>> io_queue_worker_create() can be run on a workqueue, and uses task work
>>>> to launch a worker thread from the context of a userspace thread; and
>>>> this worker thread can then accept commands to close file descriptors.
>>>> Except it doesn't accept commands to close io_uring file descriptors.
>>>>
>>>> A closer miss might be io_sync_cancel(), which holds a reference to
>>>> some normal file with fdget()/fdput() while calling into
>>>> io_run_task_work_sig(). However, from what I can tell, the only things
>>>> that are actually done with this file pointer are pointer comparisons,
>>>> so this also shouldn't have significant security impact.
>>>>
>>>> Would it make sense to use fget()/fput() instead of fdget()/fdput() in
>>>> io_sync_cancel(), io_uring_enter and io_uring_register? These
>>>> functions probably usually run in multithreaded environments anyway
>>>> (thanks to the io_uring worker threads), so I would think fdget()
>>>> shouldn't bring significant performance savings here?
>>>
>>> Let me run some testing on that. It's a mistake to think that it's
>>> usually multithreaded, generally if you end up using io-wq then it's not
>>> a fast path. A fast networked setup, for example, would never touch the
>>> threads and hence no threading would be implied by using io_uring. Ditto
>>> on the storage front, if you're just reading/writing or eg doing polled
>>> IO. That said, those workloads are generally threaded _anyway_ - not
>>> because of io_uring, but because that's how these kinds of workloads are
>>> written to begin with.
>
> Aah, because with polled I/O, when the fd is signalled as ready, the
> actual execution of work is done via task_work? Thanks for the
> explanation, I missed that.

For polled IO, there's actually no task_work at all, as the polling task
will run the completions itself, inline. But for things like eg network
recv, when data readiness is triggered, the receive retry is done via
task_work. We don't block on anything pollable, which would then mean
you'd be punting it to io-wq.

There are obviously things that WILL punt to io-wq as part of normal
operations. Buffered reads will generally not need io-wq, but buffered
writes certainly can if we need to throttle or otherwise do something
that will block. For O_DIRECT IO, you'd only ever need io-wq if you
overload the device resources. Fast path DIO would not use io-wq either.

Outside of that, there's a class of "slower" opcodes which always punt
to io-wq. Things like fsync, for example.

For an efficient networked or storage ring usage, you'd never see io-wq
being used.

>>> So probably won't be much of a concern to do the swap. The only
>>> "interesting" part of the above mix of cancel/register/enter is
>>> obviously the enter part. The rest are not really fast path.
>>
>> Did all three and ran the usual testing, which just so happens to be
>> multithreaded to begin with anyway. No discernable change from using
>> fget/fput over fdget/fdput.
>
> Oh, nice!
>
>> IOW, we may as well do this. Do you want to send a patch? Or I can send
>> out mine, up to you.
>
> Ah, if you already cooked up a patch to do the testing, I guess it's
> easier if you commit that?

Sure, I can do that. I'll put a link in to this thread and add a
Suggested-by as well. Commit message will be mostly an edited version of
your post anyway.

--
Jens Axboe