2020-07-15 11:53:39

by Miklos Szeredi

[permalink] [raw]
Subject: strace of io_uring events?

Hi,

This thread is to discuss the possibility of stracing requests
submitted through io_uring. I'm not directly involved in io_uring
development, so I'm posting this out of interest in using strace on
processes utilizing io_uring.

io_uring gives the developer a way to bypass the syscall interface,
which results in loss of information when tracing. This is a strace
fragment on "io_uring-cp" from liburing:

io_uring_enter(5, 40, 0, 0, NULL, 8) = 40
io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
...

What really happens are read + write requests. Without that
information the strace output is mostly useless.

This loss of information is not new, e.g. calls through the vdso or
futext fast paths are also invisible to strace. But losing filesystem
I/O calls are a major blow, imo.

What do people think?

From what I can tell, listing the submitted requests on
io_uring_enter() would not be hard. Request completion is
asynchronous, however, and may not require io_uring_enter() syscall.
Am I correct?

Is there some existing tracing infrastructure that strace could use to
get async completion events? Should we be introducing one?

Thanks,
Miklos


2020-07-15 15:12:25

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?



> On Jul 15, 2020, at 4:12 AM, Miklos Szeredi <[email protected]> wrote:
>
> Hi,
>
> This thread is to discuss the possibility of stracing requests
> submitted through io_uring. I'm not directly involved in io_uring
> development, so I'm posting this out of interest in using strace on
> processes utilizing io_uring.
>
> io_uring gives the developer a way to bypass the syscall interface,
> which results in loss of information when tracing. This is a strace
> fragment on "io_uring-cp" from liburing:
>
> io_uring_enter(5, 40, 0, 0, NULL, 8) = 40
> io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
> io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
> ...
>
> What really happens are read + write requests. Without that
> information the strace output is mostly useless.
>
> This loss of information is not new, e.g. calls through the vdso or
> futext fast paths are also invisible to strace. But losing filesystem
> I/O calls are a major blow, imo.
>
> What do people think?
>
> From what I can tell, listing the submitted requests on
> io_uring_enter() would not be hard. Request completion is
> asynchronous, however, and may not require io_uring_enter() syscall.
> Am I correct?
>
> Is there some existing tracing infrastructure that strace could use to
> get async completion events? Should we be introducing one?
>
>

Let’s add some seccomp folks. We probably also want to be able to run seccomp-like filters on io_uring requests. So maybe io_uring should call into seccomp-and-tracing code for each action.

2020-07-15 17:12:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Wed, Jul 15, 2020 at 07:35:50AM -0700, Andy Lutomirski wrote:
> > On Jul 15, 2020, at 4:12 AM, Miklos Szeredi <[email protected]> wrote:
> >
> > <feff>Hi,

feff? Are we doing WTF-16 in email now? ;-)

> >
> > This thread is to discuss the possibility of stracing requests
> > submitted through io_uring. I'm not directly involved in io_uring
> > development, so I'm posting this out of interest in using strace on
> > processes utilizing io_uring.
> >
> > io_uring gives the developer a way to bypass the syscall interface,
> > which results in loss of information when tracing. This is a strace
> > fragment on "io_uring-cp" from liburing:
> >
> > io_uring_enter(5, 40, 0, 0, NULL, 8) = 40
> > io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
> > io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
> > ...
> >
> > What really happens are read + write requests. Without that
> > information the strace output is mostly useless.
> >
> > This loss of information is not new, e.g. calls through the vdso or
> > futext fast paths are also invisible to strace. But losing filesystem
> > I/O calls are a major blow, imo.
> >
> > What do people think?
> >
> > From what I can tell, listing the submitted requests on
> > io_uring_enter() would not be hard. Request completion is
> > asynchronous, however, and may not require io_uring_enter() syscall.
> > Am I correct?
> >
> > Is there some existing tracing infrastructure that strace could use to
> > get async completion events? Should we be introducing one?
> >
> >
>
> Let’s add some seccomp folks. We probably also want to be able to run seccomp-like filters on io_uring requests. So maybe io_uring should call into seccomp-and-tracing code for each action.

Adding Stefano since he had a complementary proposal for iouring
restrictions that weren't exactly seccomp.

2020-07-15 19:44:27

by Pavel Begunkov

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 15/07/2020 20:11, Matthew Wilcox wrote:
> On Wed, Jul 15, 2020 at 07:35:50AM -0700, Andy Lutomirski wrote:
>>> On Jul 15, 2020, at 4:12 AM, Miklos Szeredi <[email protected]> wrote:
>>>
>>> <feff>Hi,
>
> feff? Are we doing WTF-16 in email now? ;-)
>
>>>
>>> This thread is to discuss the possibility of stracing requests
>>> submitted through io_uring. I'm not directly involved in io_uring
>>> development, so I'm posting this out of interest in using strace on
>>> processes utilizing io_uring.
>>>
>>> io_uring gives the developer a way to bypass the syscall interface,
>>> which results in loss of information when tracing. This is a strace
>>> fragment on "io_uring-cp" from liburing:
>>>
>>> io_uring_enter(5, 40, 0, 0, NULL, 8) = 40
>>> io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
>>> io_uring_enter(5, 1, 0, 0, NULL, 8) = 1
>>> ...
>>>
>>> What really happens are read + write requests. Without that
>>> information the strace output is mostly useless.
>>>
>>> This loss of information is not new, e.g. calls through the vdso or
>>> futext fast paths are also invisible to strace. But losing filesystem
>>> I/O calls are a major blow, imo.

To clear details for those who are not familiar with io_uring:

io_uring has a pair of queues, submission (SQ) and completion queues (CQ),
both shared between kernel and user spaces. The userspace submits requests
by filling a chunk of memory in SQ. The kernel picks up SQ entries in
(syscall io_uring_enter) or asynchronously by polling SQ.

CQ entries are filled by the kernel completely asynchronously and
in parallel. Some users just poll CQ to get them, but also have a way
to wait for them.

>>>
>>> What do people think?
>>>
>>> From what I can tell, listing the submitted requests on
>>> io_uring_enter() would not be hard. Request completion is
>>> asynchronous, however, and may not require io_uring_enter() syscall.
>>> Am I correct?

Both, submission and completion sides may not require a syscall.

>>>
>>> Is there some existing tracing infrastructure that strace could use to
>>> get async completion events? Should we be introducing one?

There are static trace points covering all needs.

And if not used the whole thing have to be zero-overhead. Otherwise
there is perf, which is zero-overhead, and this IMHO won't fly.

>>
>> Let’s add some seccomp folks. We probably also want to be able to run
>> seccomp-like filters on io_uring requests. So maybe io_uring should
>> call into seccomp-and-tracing code for each action.
>
> Adding Stefano since he had a complementary proposal for iouring
> restrictions that weren't exactly seccomp.
>

--
Pavel Begunkov

2020-07-15 20:11:10

by Miklos Szeredi

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Wed, Jul 15, 2020 at 9:43 PM Pavel Begunkov <[email protected]> wrote:
>
> To clear details for those who are not familiar with io_uring:
>
> io_uring has a pair of queues, submission (SQ) and completion queues (CQ),
> both shared between kernel and user spaces. The userspace submits requests
> by filling a chunk of memory in SQ. The kernel picks up SQ entries in
> (syscall io_uring_enter) or asynchronously by polling SQ.
>
> CQ entries are filled by the kernel completely asynchronously and
> in parallel. Some users just poll CQ to get them, but also have a way
> to wait for them.
>
> >>>
> >>> What do people think?
> >>>
> >>> From what I can tell, listing the submitted requests on
> >>> io_uring_enter() would not be hard. Request completion is
> >>> asynchronous, however, and may not require io_uring_enter() syscall.
> >>> Am I correct?
>
> Both, submission and completion sides may not require a syscall.

Okay.

> >>> Is there some existing tracing infrastructure that strace could use to
> >>> get async completion events? Should we be introducing one?
>
> There are static trace points covering all needs.

This needs to be unprivileged, or its usefulness is again compromized.

>
> And if not used the whole thing have to be zero-overhead. Otherwise
> there is perf, which is zero-overhead, and this IMHO won't fly.

Obviously it needs to be zero overhead if not tracing.

What won't fly?

Thanks,
Miklos

2020-07-15 20:23:29

by Pavel Begunkov

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 15/07/2020 23:09, Miklos Szeredi wrote:
> On Wed, Jul 15, 2020 at 9:43 PM Pavel Begunkov <[email protected]> wrote:
>>
>> To clear details for those who are not familiar with io_uring:
>>
>> io_uring has a pair of queues, submission (SQ) and completion queues (CQ),
>> both shared between kernel and user spaces. The userspace submits requests
>> by filling a chunk of memory in SQ. The kernel picks up SQ entries in
>> (syscall io_uring_enter) or asynchronously by polling SQ.
>>
>> CQ entries are filled by the kernel completely asynchronously and
>> in parallel. Some users just poll CQ to get them, but also have a way
>> to wait for them.
>>
>>>>>
>>>>> What do people think?
>>>>>
>>>>> From what I can tell, listing the submitted requests on
>>>>> io_uring_enter() would not be hard. Request completion is
>>>>> asynchronous, however, and may not require io_uring_enter() syscall.
>>>>> Am I correct?
>>
>> Both, submission and completion sides may not require a syscall.
>
> Okay.
>
>>>>> Is there some existing tracing infrastructure that strace could use to
>>>>> get async completion events? Should we be introducing one?
>>
>> There are static trace points covering all needs.
>
> This needs to be unprivileged, or its usefulness is again compromized.
>
>>
>> And if not used the whole thing have to be zero-overhead. Otherwise
>> there is perf, which is zero-overhead, and this IMHO won't fly.
>
> Obviously it needs to be zero overhead if not tracing.
>
> What won't fly?
Any approach that is not "zero-overhead if not used".

--
Pavel Begunkov

2020-07-15 23:08:04

by Kees Cook

[permalink] [raw]
Subject: Re: strace of io_uring events?

Earlier Andy Lutomirski wrote:
> Let’s add some seccomp folks. We probably also want to be able to run
> seccomp-like filters on io_uring requests. So maybe io_uring should call into
> seccomp-and-tracing code for each action.

Okay, I'm finally able to spend time looking at this. And thank you to
the many people that CCed me into this and earlier discussions (at least
Jann, Christian, and Andy).

It *seems* like there is a really clean mapping of SQE OPs to syscalls.
To that end, yes, it should be trivial to add ptrace and seccomp support
(sort of). The trouble comes for doing _interception_, which is how both
ptrace and seccomp are designed.

In the basic case of seccomp, various syscalls are just being checked
for accept/reject. It seems like that would be easy to wire up. For the
more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
I think any such results would need to be "upgraded" to "reject". Things
are a bit complex in that seccomp's form of "reject" can be "return
errno" (easy) or it can be "kill thread (or thread_group)" which ...
becomes less clear. (More on this later.)

In the basic case of "I want to run strace", this is really just a
creative use of ptrace in that interception is being used only for
reporting. Does ptrace need to grow a way to create/attach an io_uring
eventfd? Or should there be an entirely different tool for
administrative analysis of io_uring events (kind of how disk IO can be
monitored)?

For io_uring generally, I have a few comments/questions:

- Why did a new syscall get added that couldn't be extended? All new
syscalls should be using Extended Arguments. :(

- Why aren't the io_uring syscalls in the man-page git? (It seems like
they're in liburing, but that's should document the _library_ not the
syscalls, yes?)

Speaking to Stefano's proposal[1]:

- There appear to be three classes of desired restrictions:
- opcodes for io_uring_register() (which can be enforced entirely with
seccomp right now).
- opcodes from SQEs (this _could_ be intercepted by seccomp, but is
not currently written)
- opcodes of the types of restrictions to restrict... for making sure
things can't be changed after being set? seccomp already enforces
that kind of "can only be made stricter"

- Credentials vs no_new_privs needs examination (more on this later)

So, I think, at least for restrictions, seccomp should absolutely be
the place to get this work done. It already covers 2 of the 3 points in
the proposal.

Solving the mapping of seccomp interception types into CQEs (or anything
more severe) will likely inform what it would mean to map ptrace events
to CQEs. So, I think they're related, and we should get seccomp hooked
up right away, and that might help us see how (if) ptrace should be
attached.

Specifically for seccomp, I see at least the following design questions:

- How does no_new_privs play a role in the existing io_uring credential
management? Using _any_ kind of syscall-effective filtering, whether
it's seccomp or Stefano's existing proposal, needs to address the
potential inheritable restrictions across privilege boundaries (which is
what no_new_privs tries to eliminate). In regular syscall land, this is
an issue when a filter follows a process through setuid via execve()
and it gains privileges that now the filter-creator can trick into
doing weird stuff -- io_uring has a concept of alternative credentials
so I have to ask about it. (I don't *think* there would be a path to
install a filter before gaining privilege, but I likely just
need to do my homework on the io_uring internals. Regardless,
use of seccomp by io_uring would need to have this issue "solved"
in the sense that it must be "safe" to filter io_uring OPs, from a
privilege-boundary-crossing perspective.

- From which task perspective should filters be applied? It seems like it
needs to follow the io_uring personalities, as that contains the
credentials. (This email is a brain-dump so far -- I haven't gone to
look to see if that means io_uring is literally getting a reference to
struct cred; I assume so.) Seccomp filters are attached to task_struct.
However, for v5.9, seccomp will gain a more generalized get/put system
for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding
more get/put-ers for some part of the io_uring context shouldn't
be hard.

- How should seccomp return values be applied? Three seem okay:
SECCOMP_RET_ALLOW: do SQE action normally
SECCOMP_RET_LOG: do SQE action, log via seccomp
SECCOMP_RET_ERRNO: skip actions in SQE and pass errno to CQE
The rest not so much:
SECCOMP_RET_TRAP: can't send SIGSYS anywhere sane?
SECCOMP_RET_TRACE: no tracer, can't send SIGSYS?
SECCOMP_RET_USER_NOTIF: can't do user_notif rewrites?
SECCOMP_RET_KILL_THREAD: kill which thread?
SECCOMP_RET_KILL_PROCESS: kill which thread group?
If TRAP, TRACE, and USER_NOTIF need to be "upgraded" to KILL_THREAD,
what does KILL_THREAD mean? Does it really mean "shut down the entire
SQ?" Does it mean kill the worker thread? Does KILL_PROCESS mean kill
all the tasks with an open mapping for the SQ?

Anyway, I'd love to hear what folks think, but given the very direct
mapping from SQE OPs to syscalls, I really think seccomp needs to be
inserted in here somewhere to maintain any kind of sensible reasoning
about syscall filtering.

-Kees

[1] https://lore.kernel.org/lkml/[email protected]/

--
Kees Cook

2020-07-16 00:13:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Wed, Jul 15, 2020 at 06:11:30PM +0100, Matthew Wilcox wrote:
> On Wed, Jul 15, 2020 at 07:35:50AM -0700, Andy Lutomirski wrote:
> > > On Jul 15, 2020, at 4:12 AM, Miklos Szeredi <[email protected]> wrote:
> > > This thread is to discuss the possibility of stracing requests
> > > submitted through io_uring. I'm not directly involved in io_uring
> > > development, so I'm posting this out of interest in using strace on
> > > processes utilizing io_uring.

> > >
> > > Is there some existing tracing infrastructure that strace could use to
> > > get async completion events? Should we be introducing one?

I suspect the best approach to use here is use eBPF, since since
sending asyncronously to a ring buffer is going to be *way* more
efficient than using the blocking ptrace(2) system call...

- Ted

2020-07-16 13:15:06

by Stefano Garzarella

[permalink] [raw]
Subject: Re: strace of io_uring events?

+Cc Stefan Hajnoczi

On Wed, Jul 15, 2020 at 04:07:00PM -0700, Kees Cook wrote:
> Earlier Andy Lutomirski wrote:
> > Let’s add some seccomp folks. We probably also want to be able to run
> > seccomp-like filters on io_uring requests. So maybe io_uring should call into
> > seccomp-and-tracing code for each action.
>
> Okay, I'm finally able to spend time looking at this. And thank you to
> the many people that CCed me into this and earlier discussions (at least
> Jann, Christian, and Andy).
>

[...]

>
> Speaking to Stefano's proposal[1]:
>
> - There appear to be three classes of desired restrictions:
> - opcodes for io_uring_register() (which can be enforced entirely with
> seccomp right now).
> - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> not currently written)
> - opcodes of the types of restrictions to restrict... for making sure
> things can't be changed after being set? seccomp already enforces
> that kind of "can only be made stricter"

In addition we want to limit the SQEs to use only the registered fd and buffers.

>
> - Credentials vs no_new_privs needs examination (more on this later)
>
> So, I think, at least for restrictions, seccomp should absolutely be
> the place to get this work done. It already covers 2 of the 3 points in
> the proposal.

Thanks for your feedback, I just sent v2 before I read this e-mail.

Do you think it's better to have everything in seccomp instead of adding
the restrictions in io_uring (the patch isn't very big)?

With seccomp, would it be possible to have different restrictions for two
instances of io_uring in the same process?
I suppose it should be possible using BPF filters.

Thanks,
Stefano

>
> Solving the mapping of seccomp interception types into CQEs (or anything
> more severe) will likely inform what it would mean to map ptrace events
> to CQEs. So, I think they're related, and we should get seccomp hooked
> up right away, and that might help us see how (if) ptrace should be
> attached.
>
> Specifically for seccomp, I see at least the following design questions:
>
> - How does no_new_privs play a role in the existing io_uring credential
> management? Using _any_ kind of syscall-effective filtering, whether
> it's seccomp or Stefano's existing proposal, needs to address the
> potential inheritable restrictions across privilege boundaries (which is
> what no_new_privs tries to eliminate). In regular syscall land, this is
> an issue when a filter follows a process through setuid via execve()
> and it gains privileges that now the filter-creator can trick into
> doing weird stuff -- io_uring has a concept of alternative credentials
> so I have to ask about it. (I don't *think* there would be a path to
> install a filter before gaining privilege, but I likely just
> need to do my homework on the io_uring internals. Regardless,
> use of seccomp by io_uring would need to have this issue "solved"
> in the sense that it must be "safe" to filter io_uring OPs, from a
> privilege-boundary-crossing perspective.
>
> - From which task perspective should filters be applied? It seems like it
> needs to follow the io_uring personalities, as that contains the
> credentials. (This email is a brain-dump so far -- I haven't gone to
> look to see if that means io_uring is literally getting a reference to
> struct cred; I assume so.) Seccomp filters are attached to task_struct.
> However, for v5.9, seccomp will gain a more generalized get/put system
> for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding
> more get/put-ers for some part of the io_uring context shouldn't
> be hard.
>
> - How should seccomp return values be applied? Three seem okay:
> SECCOMP_RET_ALLOW: do SQE action normally
> SECCOMP_RET_LOG: do SQE action, log via seccomp
> SECCOMP_RET_ERRNO: skip actions in SQE and pass errno to CQE
> The rest not so much:
> SECCOMP_RET_TRAP: can't send SIGSYS anywhere sane?
> SECCOMP_RET_TRACE: no tracer, can't send SIGSYS?
> SECCOMP_RET_USER_NOTIF: can't do user_notif rewrites?
> SECCOMP_RET_KILL_THREAD: kill which thread?
> SECCOMP_RET_KILL_PROCESS: kill which thread group?
> If TRAP, TRACE, and USER_NOTIF need to be "upgraded" to KILL_THREAD,
> what does KILL_THREAD mean? Does it really mean "shut down the entire
> SQ?" Does it mean kill the worker thread? Does KILL_PROCESS mean kill
> all the tasks with an open mapping for the SQ?
>
> Anyway, I'd love to hear what folks think, but given the very direct
> mapping from SQE OPs to syscalls, I really think seccomp needs to be
> inserted in here somewhere to maintain any kind of sensible reasoning
> about syscall filtering.
>
> -Kees
>
> [1] https://lore.kernel.org/lkml/[email protected]/
>
> --
> Kees Cook
>

2020-07-16 13:19:10

by Aleksa Sarai

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 2020-07-15, Kees Cook <[email protected]> wrote:
> Earlier Andy Lutomirski wrote:
> > Let’s add some seccomp folks. We probably also want to be able to run
> > seccomp-like filters on io_uring requests. So maybe io_uring should call into
> > seccomp-and-tracing code for each action.
>
> Okay, I'm finally able to spend time looking at this. And thank you to
> the many people that CCed me into this and earlier discussions (at least
> Jann, Christian, and Andy).
>
> It *seems* like there is a really clean mapping of SQE OPs to syscalls.
> To that end, yes, it should be trivial to add ptrace and seccomp support
> (sort of). The trouble comes for doing _interception_, which is how both
> ptrace and seccomp are designed.
>
> In the basic case of seccomp, various syscalls are just being checked
> for accept/reject. It seems like that would be easy to wire up. For the
> more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
> I think any such results would need to be "upgraded" to "reject". Things
> are a bit complex in that seccomp's form of "reject" can be "return
> errno" (easy) or it can be "kill thread (or thread_group)" which ...
> becomes less clear. (More on this later.)
>
> In the basic case of "I want to run strace", this is really just a
> creative use of ptrace in that interception is being used only for
> reporting. Does ptrace need to grow a way to create/attach an io_uring
> eventfd? Or should there be an entirely different tool for
> administrative analysis of io_uring events (kind of how disk IO can be
> monitored)?

I would hope that we wouldn't introduce ptrace to io_uring, because
unless we plan to attach to io_uring events via GDB it's simply the
wrong tool for the job. strace does use ptrace, but that's mostly
because Linux's dynamic tracing was still in its infancy at the time
(and even today it requires more privileges than ptrace) -- but you can
emulate strace using bpftrace these days fairly easily.

So really what is being asked here is "can we make it possible to debug
io_uring programs as easily as traditional I/O programs". And this does
not require ptrace, nor should ptrace be part of this discussion IMHO. I
believe this issue (along with seccomp-style filtering) have been
mentioned informally in the past, but I am happy to finally see a thread
about this appear.

> For io_uring generally, I have a few comments/questions:
>
> - Why did a new syscall get added that couldn't be extended? All new
> syscalls should be using Extended Arguments. :(

io_uring was introduced in Linux 5.1, predating clone3() and openat2().
My larger concern is that io_uring operations aren't extensible-structs
-- but we can resolve that issue with some slight ugliness if we ever
run into the problem.

> - Why aren't the io_uring syscalls in the man-page git? (It seems like
> they're in liburing, but that's should document the _library_ not the
> syscalls, yes?)

I imagine because using the syscall requires specific memory barriers
which we probably don't want most C programs to be fiddling with
directly. Sort of similar to how iptables doesn't have a syscall-style
man page.

> Speaking to Stefano's proposal[1]:
>
> - There appear to be three classes of desired restrictions:
> - opcodes for io_uring_register() (which can be enforced entirely with
> seccomp right now).
> - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> not currently written)
> - opcodes of the types of restrictions to restrict... for making sure
> things can't be changed after being set? seccomp already enforces
> that kind of "can only be made stricter"

Unless I misunderstood the patch cover-letter, Stefano's proposal is to
have a mechanism for adding restrictions to individual io_urings -- so
we still need a separate mechanism (or an extended version of Stefano's
proposal) to allow for the "reduce attack surface" usecase of seccomp.
It seems to me like Stefano's proposal is more related to cases where
you might SCM_RIGHTS-send an io_uring to an unprivileged process.

> Solving the mapping of seccomp interception types into CQEs (or anything
> more severe) will likely inform what it would mean to map ptrace events
> to CQEs. So, I think they're related, and we should get seccomp hooked
> up right away, and that might help us see how (if) ptrace should be
> attached.

We could just emulate the seccomp-bpf API with the pseudo-syscalls done
as a result of CQEs, though I'm not sure how happy folks will be with
this kind of glue code in "seccomp-uring" (though in theory it would
allow us to attach existing filters to io_uring...).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (4.74 kB)
signature.asc (235.00 B)
Download all attachments

2020-07-16 15:13:29

by Kees Cook

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> On Wed, Jul 15, 2020 at 04:07:00PM -0700, Kees Cook wrote:
> [...]
>
> > Speaking to Stefano's proposal[1]:
> >
> > - There appear to be three classes of desired restrictions:
> > - opcodes for io_uring_register() (which can be enforced entirely with
> > seccomp right now).
> > - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> > not currently written)
> > - opcodes of the types of restrictions to restrict... for making sure
> > things can't be changed after being set? seccomp already enforces
> > that kind of "can only be made stricter"
>
> In addition we want to limit the SQEs to use only the registered fd and buffers.

Hmm, good point. Yeah, since it's an "extra" mapping (ioring file number
vs fd number) this doesn't really map well to seccomp. (And frankly,
there's some difficulty here mapping many of the ioring-syscalls to
seccomp because it's happening "deeper" than the syscall layer (i.e.
some of the arguments have already been resolved into kernel object
pointers, etc).

> Do you think it's better to have everything in seccomp instead of adding
> the restrictions in io_uring (the patch isn't very big)?

I'm still trying to understand how io_uring will be used, and it seems
odd to me that it's effectively a seccomp bypass. (Though from what I
can tell it is not an LSM bypass, which is good -- though I'm worried
there might be some embedded assumptions in LSMs about creds vs current
and LSMs may try to reason (or report) on actions with the kthread in
mind, but afaict everything important is checked against creds.

> With seccomp, would it be possible to have different restrictions for two
> instances of io_uring in the same process?

For me, this is the most compelling reason to have the restrictions NOT
implemented via seccomp. Trying to make "which instance" choice in
seccomp would be extremely clumsy.

So at this point, I think it makes sense for the restriction series to
carry on -- it is io_uring-specific and solves some problems that
seccomp is not in good position to reason about.

All this said, I'd still like a way to apply seccomp to io_uring
because it's a rather giant syscall filter bypass mechanism, and gaining
access (IIUC) is possible without actually calling any of the io_uring
syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
access to the SQ and CQ, and off it goes? (The only glitch I see is
waking up the worker thread?)

What appears to be the worst bit about adding seccomp to io_uring is the
almost complete disassociation of process hierarchy from syscall action.
Only a cred is used for io_uring, and seccomp filters are associated with
task structs. I'm not sure if there is a way to solve this disconnect
without a major internal refactoring of seccomp to attach to creds and
then make every filter attachment create a new cred... *head explody*

--
Kees Cook

2020-07-16 15:20:44

by Kees Cook

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Thu, Jul 16, 2020 at 11:17:55PM +1000, Aleksa Sarai wrote:
> On 2020-07-15, Kees Cook <[email protected]> wrote:
> > In the basic case of "I want to run strace", this is really just a
> > creative use of ptrace in that interception is being used only for
> > reporting. Does ptrace need to grow a way to create/attach an io_uring
> > eventfd? Or should there be an entirely different tool for
> > administrative analysis of io_uring events (kind of how disk IO can be
> > monitored)?
>
> I would hope that we wouldn't introduce ptrace to io_uring, because
> unless we plan to attach to io_uring events via GDB it's simply the
> wrong tool for the job. strace does use ptrace, but that's mostly
> because Linux's dynamic tracing was still in its infancy at the time
> (and even today it requires more privileges than ptrace) -- but you can
> emulate strace using bpftrace these days fairly easily.
>
> So really what is being asked here is "can we make it possible to debug
> io_uring programs as easily as traditional I/O programs". And this does
> not require ptrace, nor should ptrace be part of this discussion IMHO. I
> believe this issue (along with seccomp-style filtering) have been
> mentioned informally in the past, but I am happy to finally see a thread
> about this appear.

Yeah, I don't see any sane way to attach ptrace, especially when what's
wanted is just "io_uring action logging", which is a much more narrow
issue, and one that doesn't map well to processes.

Can the io_uring eventfd be used for this kind of thing? It seems
io_uring just needs a way to gain an administrative path to opening it?

> > Solving the mapping of seccomp interception types into CQEs (or anything
> > more severe) will likely inform what it would mean to map ptrace events
> > to CQEs. So, I think they're related, and we should get seccomp hooked
> > up right away, and that might help us see how (if) ptrace should be
> > attached.
>
> We could just emulate the seccomp-bpf API with the pseudo-syscalls done
> as a result of CQEs, though I'm not sure how happy folks will be with
> this kind of glue code in "seccomp-uring" (though in theory it would
> allow us to attach existing filters to io_uring...).

Looking at the per-OP "syscall" implementations, I'm kind of alarmed
that some (e.g. openat2) are rather "open coded". It seems like this
should be fixed to have at least a common entry point for both io_uring
and proper syscalls.

--
Kees Cook

2020-07-16 16:25:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?

> On Jul 15, 2020, at 4:07 PM, Kees Cook <[email protected]> wrote:
>
> Earlier Andy Lutomirski wrote:
>> Let’s add some seccomp folks. We probably also want to be able to run
>> seccomp-like filters on io_uring requests. So maybe io_uring should call into
>> seccomp-and-tracing code for each action.
>
> Okay, I'm finally able to spend time looking at this. And thank you to
> the many people that CCed me into this and earlier discussions (at least
> Jann, Christian, and Andy).
>
> It *seems* like there is a really clean mapping of SQE OPs to syscalls.
> To that end, yes, it should be trivial to add ptrace and seccomp support
> (sort of). The trouble comes for doing _interception_, which is how both
> ptrace and seccomp are designed.
>
> In the basic case of seccomp, various syscalls are just being checked
> for accept/reject. It seems like that would be easy to wire up. For the
> more ptrace-y things (SECCOMP_RET_TRAP, SECCOMP_RET_USER_NOTIF, etc),
> I think any such results would need to be "upgraded" to "reject". Things
> are a bit complex in that seccomp's form of "reject" can be "return
> errno" (easy) or it can be "kill thread (or thread_group)" which ...
> becomes less clear. (More on this later.)

My intuition is not to do this kind of creative reinterpretation of
return values. Instead let’s have a new type of seccomp filter
specifically for io_uring. So we can have SECCOMP_IO_URING_ACCEPT,
ERRNO, and eventually other things. We probably will want a user
notifier feature for io_uring, but I'd be a bit surprised if it ends
up ABI-compatible with current users of user notifiers.

> - There appear to be three classes of desired restrictions:
> - opcodes for io_uring_register() (which can be enforced entirely with
> seccomp right now).

Agreed.

> - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> not currently written)

As above, I think this should be intercepted by seccomp, but in a new
mode. I think that existing seccomp filters should not intercept it.

> - opcodes of the types of restrictions to restrict... for making sure
> things can't be changed after being set? seccomp already enforces
> that kind of "can only be made stricter"

Agreed.

>
> - How does no_new_privs play a role in the existing io_uring credential
> management? Using _any_ kind of syscall-effective filtering, whether
> it's seccomp or Stefano's existing proposal, needs to address the
> potential inheritable restrictions across privilege boundaries (which is
> what no_new_privs tries to eliminate). In regular syscall land, this is
> an issue when a filter follows a process through setuid via execve()
> and it gains privileges that now the filter-creator can trick into
> doing weird stuff -- io_uring has a concept of alternative credentials
> so I have to ask about it. (I don't *think* there would be a path to
> install a filter before gaining privilege, but I likely just
> need to do my homework on the io_uring internals. Regardless,
> use of seccomp by io_uring would need to have this issue "solved"
> in the sense that it must be "safe" to filter io_uring OPs, from a
> privilege-boundary-crossing perspective.
>
> - From which task perspective should filters be applied? It seems like it
> needs to follow the io_uring personalities, as that contains the
> credentials. (This email is a brain-dump so far -- I haven't gone to
> look to see if that means io_uring is literally getting a reference to
> struct cred; I assume so.) Seccomp filters are attached to task_struct.
> However, for v5.9, seccomp will gain a more generalized get/put system
> for having filters attached to the SECCOMP_RET_USER_NOTIF fd. Adding
> more get/put-ers for some part of the io_uring context shouldn't
> be hard.

Let's ignore personalities for a moment (and see below). Thinking
through the possibilities:

A: io_uring seccomp filters are attached to tasks. When an io_uring
is created, it inherits an immutable copy of its creating task's
filter, and that's the filter set that applies to that io_uring
instance. This could have somewhat bizarre consequences if the fd
gets passed around, but io_uring already has odd security effects if
fds are passed around. It has the annoying property that, if a
library creates an io_uring and then a seccomp filter is loaded, the
io_uring bypasses the library.

B: The same, but the io_uring references the creating task so new
filters on the task apply to the io_uring, too. This allows loading
and then sandboxing. Is this too bizarre overall?

C: io_uring filters are attached directly to io_urings. This has the
problem where an io_uring created before a task sandboxes itself isn't
sandboxed. It also would require that a filter be able to hook
io_uring creation to sandbox it.

Does anyone actually pass io_urings around with SCM_RIGHTS? It would
be really nice if we could make the default be that io_urings are
bound to their creating mm and can't be used outside it. Then
creating an mm-crossing io_uring could, itself, be restricted.

In any case, my inclination is to go for choice B. Choice C could
also be supported if there's a use case.

2020-07-17 08:03:44

by Stefano Garzarella

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> > On Wed, Jul 15, 2020 at 04:07:00PM -0700, Kees Cook wrote:
> > [...]
> >
> > > Speaking to Stefano's proposal[1]:
> > >
> > > - There appear to be three classes of desired restrictions:
> > > - opcodes for io_uring_register() (which can be enforced entirely with
> > > seccomp right now).
> > > - opcodes from SQEs (this _could_ be intercepted by seccomp, but is
> > > not currently written)
> > > - opcodes of the types of restrictions to restrict... for making sure
> > > things can't be changed after being set? seccomp already enforces
> > > that kind of "can only be made stricter"
> >
> > In addition we want to limit the SQEs to use only the registered fd and buffers.
>
> Hmm, good point. Yeah, since it's an "extra" mapping (ioring file number
> vs fd number) this doesn't really map well to seccomp. (And frankly,
> there's some difficulty here mapping many of the ioring-syscalls to
> seccomp because it's happening "deeper" than the syscall layer (i.e.
> some of the arguments have already been resolved into kernel object
> pointers, etc).
>
> > Do you think it's better to have everything in seccomp instead of adding
> > the restrictions in io_uring (the patch isn't very big)?
>
> I'm still trying to understand how io_uring will be used, and it seems
> odd to me that it's effectively a seccomp bypass. (Though from what I
> can tell it is not an LSM bypass, which is good -- though I'm worried
> there might be some embedded assumptions in LSMs about creds vs current
> and LSMs may try to reason (or report) on actions with the kthread in
> mind, but afaict everything important is checked against creds.
>
> > With seccomp, would it be possible to have different restrictions for two
> > instances of io_uring in the same process?
>
> For me, this is the most compelling reason to have the restrictions NOT
> implemented via seccomp. Trying to make "which instance" choice in
> seccomp would be extremely clumsy.
>
> So at this point, I think it makes sense for the restriction series to
> carry on -- it is io_uring-specific and solves some problems that
> seccomp is not in good position to reason about.

Thanks for the feedback, then I'll continue in this direction!

>
> All this said, I'd still like a way to apply seccomp to io_uring
> because it's a rather giant syscall filter bypass mechanism, and gaining

Agree.

> access (IIUC) is possible without actually calling any of the io_uring
> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> access to the SQ and CQ, and off it goes? (The only glitch I see is
> waking up the worker thread?)

It is true only if the io_uring istance is created with SQPOLL flag (not the
default behaviour and it requires CAP_SYS_ADMIN). In this case the
kthread is created and you can also set an higher idle time for it, so
also the waking up syscall can be avoided.

>
> What appears to be the worst bit about adding seccomp to io_uring is the
> almost complete disassociation of process hierarchy from syscall action.
> Only a cred is used for io_uring, and seccomp filters are associated with
> task structs. I'm not sure if there is a way to solve this disconnect
> without a major internal refactoring of seccomp to attach to creds and
> then make every filter attachment create a new cred... *head explody*
>

Sorry but I don't know seccomp that well :-(
I'm learning a lot about it these days. I'll keep your concern in mind.

Thanks,
Stefano

2020-07-17 08:18:15

by Cyril Hrubis

[permalink] [raw]
Subject: Re: strace of io_uring events?

Hi!
> > - Why aren't the io_uring syscalls in the man-page git? (It seems like
> > they're in liburing, but that's should document the _library_ not the
> > syscalls, yes?)
>
> I imagine because using the syscall requires specific memory barriers
> which we probably don't want most C programs to be fiddling with
> directly. Sort of similar to how iptables doesn't have a syscall-style
> man page.

Call me old fashioned but I would vote for having all syscalls
documented in man pages. At least for me it makes my life much easier as
I do not have to fish for documentation or read library source code when
debugging. Think of all the poor kernel QA folks that will cry in
despair when you decided not to submit manual pages.

There is plenty of stuff documented there that most C programmers
shouldn't touch, I do not consider this to be a valid excuse.

--
Cyril Hrubis
[email protected]

2020-07-21 15:29:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
>
> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> > On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:

> > access (IIUC) is possible without actually calling any of the io_uring
> > syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> > pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> > access to the SQ and CQ, and off it goes? (The only glitch I see is
> > waking up the worker thread?)
>
> It is true only if the io_uring istance is created with SQPOLL flag (not the
> default behaviour and it requires CAP_SYS_ADMIN). In this case the
> kthread is created and you can also set an higher idle time for it, so
> also the waking up syscall can be avoided.

I stared at the io_uring code for a while, and I'm wondering if we're
approaching this the wrong way. It seems to me that most of the
complications here come from the fact that io_uring SQEs don't clearly
belong to any particular security principle. (We have struct creds,
but we don't really have a task or mm.) But I'm also not convinced
that io_uring actually supports cross-mm submission except by accident
-- as it stands, unless a user is very careful to only submit SQEs
that don't use user pointers, the results will be unpredictable.
Perhaps we can get away with this:

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 74bc4a04befa..92266f869174 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
fd, u32, to_submit,
if (!percpu_ref_tryget(&ctx->refs))
goto out_fput;

+ if (unlikely(current->mm != ctx->sqo_mm)) {
+ /*
+ * The mm used to process SQEs will be current->mm or
+ * ctx->sqo_mm depending on which submission path is used.
+ * It's also unclear who is responsible for an SQE submitted
+ * out-of-process from a security and auditing perspective.
+ *
+ * Until a real usecase emerges and there are clear semantics
+ * for out-of-process submission, disallow it.
+ */
+ ret = -EACCES;
+ goto out;
+ }
+
/*
* For SQ polling, the thread will do all submissions and completions.
* Just return the requested submit count, and wake the thread if

If we can do that, then we could bind seccomp-like io_uring filters to
an mm, and we get obvious semantics that ought to cover most of the
bases.

Jens, Christoph?

Stefano, what's your intended usecase for your restriction patchset?

2020-07-21 15:32:45

by Jens Axboe

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 7/21/20 9:27 AM, Andy Lutomirski wrote:
> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
>>
>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
>
>>> access (IIUC) is possible without actually calling any of the io_uring
>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
>>> waking up the worker thread?)
>>
>> It is true only if the io_uring istance is created with SQPOLL flag (not the
>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
>> kthread is created and you can also set an higher idle time for it, so
>> also the waking up syscall can be avoided.
>
> I stared at the io_uring code for a while, and I'm wondering if we're
> approaching this the wrong way. It seems to me that most of the
> complications here come from the fact that io_uring SQEs don't clearly
> belong to any particular security principle. (We have struct creds,
> but we don't really have a task or mm.) But I'm also not convinced
> that io_uring actually supports cross-mm submission except by accident
> -- as it stands, unless a user is very careful to only submit SQEs
> that don't use user pointers, the results will be unpredictable.

How so?

> Perhaps we can get away with this:
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 74bc4a04befa..92266f869174 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> fd, u32, to_submit,
> if (!percpu_ref_tryget(&ctx->refs))
> goto out_fput;
>
> + if (unlikely(current->mm != ctx->sqo_mm)) {
> + /*
> + * The mm used to process SQEs will be current->mm or
> + * ctx->sqo_mm depending on which submission path is used.
> + * It's also unclear who is responsible for an SQE submitted
> + * out-of-process from a security and auditing perspective.
> + *
> + * Until a real usecase emerges and there are clear semantics
> + * for out-of-process submission, disallow it.
> + */
> + ret = -EACCES;
> + goto out;
> + }
> +
> /*
> * For SQ polling, the thread will do all submissions and completions.
> * Just return the requested submit count, and wake the thread if

That'll break postgres that already uses this, also see:

commit 73e08e711d9c1d79fae01daed4b0e1fee5f8a275
Author: Jens Axboe <[email protected]>
Date: Sun Jan 26 09:53:12 2020 -0700

Revert "io_uring: only allow submit from owning task"

So no, we can't do that.

--
Jens Axboe

2020-07-21 16:01:55

by Stefano Garzarella

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020 at 08:27:34AM -0700, Andy Lutomirski wrote:
> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
> >
> > On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> > > On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
>
> > > access (IIUC) is possible without actually calling any of the io_uring
> > > syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> > > pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> > > access to the SQ and CQ, and off it goes? (The only glitch I see is
> > > waking up the worker thread?)
> >
> > It is true only if the io_uring istance is created with SQPOLL flag (not the
> > default behaviour and it requires CAP_SYS_ADMIN). In this case the
> > kthread is created and you can also set an higher idle time for it, so
> > also the waking up syscall can be avoided.
>
> I stared at the io_uring code for a while, and I'm wondering if we're
> approaching this the wrong way. It seems to me that most of the
> complications here come from the fact that io_uring SQEs don't clearly
> belong to any particular security principle. (We have struct creds,
> but we don't really have a task or mm.) But I'm also not convinced
> that io_uring actually supports cross-mm submission except by accident
> -- as it stands, unless a user is very careful to only submit SQEs
> that don't use user pointers, the results will be unpredictable.
> Perhaps we can get away with this:
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 74bc4a04befa..92266f869174 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> fd, u32, to_submit,
> if (!percpu_ref_tryget(&ctx->refs))
> goto out_fput;
>
> + if (unlikely(current->mm != ctx->sqo_mm)) {
> + /*
> + * The mm used to process SQEs will be current->mm or
> + * ctx->sqo_mm depending on which submission path is used.
> + * It's also unclear who is responsible for an SQE submitted
> + * out-of-process from a security and auditing perspective.
> + *
> + * Until a real usecase emerges and there are clear semantics
> + * for out-of-process submission, disallow it.
> + */
> + ret = -EACCES;
> + goto out;
> + }
> +
> /*
> * For SQ polling, the thread will do all submissions and completions.
> * Just return the requested submit count, and wake the thread if
>
> If we can do that, then we could bind seccomp-like io_uring filters to
> an mm, and we get obvious semantics that ought to cover most of the
> bases.
>
> Jens, Christoph?
>
> Stefano, what's your intended usecase for your restriction patchset?
>

Hi Andy,
my use case concerns virtualization. The idea, that I described in the
proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
with a guest VM for block operations.

In the PoC that I realized, there is a block device driver in the guest that
uses io_uring queues coming from the host to submit block requests.

Since the guest is not trusted, we need restrictions to allow only
a subset of syscalls on a subset of file descriptors and memory.


Cheers,
Stefano

[1] https://lore.kernel.org/io-uring/20200609142406.upuwpfmgqjeji4lc@steredhat/

2020-07-21 17:24:18

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
>
> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
> > On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
> >>
> >> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> >>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> >
> >>> access (IIUC) is possible without actually calling any of the io_uring
> >>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> >>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> >>> access to the SQ and CQ, and off it goes? (The only glitch I see is
> >>> waking up the worker thread?)
> >>
> >> It is true only if the io_uring istance is created with SQPOLL flag (not the
> >> default behaviour and it requires CAP_SYS_ADMIN). In this case the
> >> kthread is created and you can also set an higher idle time for it, so
> >> also the waking up syscall can be avoided.
> >
> > I stared at the io_uring code for a while, and I'm wondering if we're
> > approaching this the wrong way. It seems to me that most of the
> > complications here come from the fact that io_uring SQEs don't clearly
> > belong to any particular security principle. (We have struct creds,
> > but we don't really have a task or mm.) But I'm also not convinced
> > that io_uring actually supports cross-mm submission except by accident
> > -- as it stands, unless a user is very careful to only submit SQEs
> > that don't use user pointers, the results will be unpredictable.
>
> How so?

Unless I've missed something, either current->mm or sqo_mm will be
used depending on which thread ends up doing the IO. (And there might
be similar issues with threads.) Having the user memory references
end up somewhere that is an implementation detail seems suboptimal.

>
> > Perhaps we can get away with this:
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 74bc4a04befa..92266f869174 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> > if (!percpu_ref_tryget(&ctx->refs))
> > goto out_fput;
> >
> > + if (unlikely(current->mm != ctx->sqo_mm)) {
> > + /*
> > + * The mm used to process SQEs will be current->mm or
> > + * ctx->sqo_mm depending on which submission path is used.
> > + * It's also unclear who is responsible for an SQE submitted
> > + * out-of-process from a security and auditing perspective.
> > + *
> > + * Until a real usecase emerges and there are clear semantics
> > + * for out-of-process submission, disallow it.
> > + */
> > + ret = -EACCES;
> > + goto out;
> > + }
> > +
> > /*
> > * For SQ polling, the thread will do all submissions and completions.
> > * Just return the requested submit count, and wake the thread if
>
> That'll break postgres that already uses this, also see:
>
> commit 73e08e711d9c1d79fae01daed4b0e1fee5f8a275
> Author: Jens Axboe <[email protected]>
> Date: Sun Jan 26 09:53:12 2020 -0700
>
> Revert "io_uring: only allow submit from owning task"
>
> So no, we can't do that.
>

Yikes, I missed that.

Andres, how final is your Postgres branch? I'm wondering if we could
get away with requiring a special flag when creating an io_uring to
indicate that you intend to submit IO from outside the creating mm.

Even if we can't make this change, we could plausibly get away with
tying seccomp-style filtering to sqo_mm. IOW we'd look up a
hypothetical sqo_mm->io_uring_filters to filter SQEs even when
submitted from a different mm.

--Andy

2020-07-21 17:31:52

by Jens Axboe

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 7/21/20 11:23 AM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
>>
>> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
>>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
>>>>
>>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
>>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
>>>
>>>>> access (IIUC) is possible without actually calling any of the io_uring
>>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
>>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
>>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
>>>>> waking up the worker thread?)
>>>>
>>>> It is true only if the io_uring istance is created with SQPOLL flag (not the
>>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
>>>> kthread is created and you can also set an higher idle time for it, so
>>>> also the waking up syscall can be avoided.
>>>
>>> I stared at the io_uring code for a while, and I'm wondering if we're
>>> approaching this the wrong way. It seems to me that most of the
>>> complications here come from the fact that io_uring SQEs don't clearly
>>> belong to any particular security principle. (We have struct creds,
>>> but we don't really have a task or mm.) But I'm also not convinced
>>> that io_uring actually supports cross-mm submission except by accident
>>> -- as it stands, unless a user is very careful to only submit SQEs
>>> that don't use user pointers, the results will be unpredictable.
>>
>> How so?
>
> Unless I've missed something, either current->mm or sqo_mm will be
> used depending on which thread ends up doing the IO. (And there might
> be similar issues with threads.) Having the user memory references
> end up somewhere that is an implementation detail seems suboptimal.

current->mm is always used from the entering task - obviously if done
synchronously, but also if it needs to go async. The only exception is a
setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the
ring. SQPOLL requires root privileges to setup, and there's no task
entering the io_uring at all necessarily. It'll just submit sqes with
the credentials that are registered with the ring.

>>> Perhaps we can get away with this:
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 74bc4a04befa..92266f869174 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
>>> fd, u32, to_submit,
>>> if (!percpu_ref_tryget(&ctx->refs))
>>> goto out_fput;
>>>
>>> + if (unlikely(current->mm != ctx->sqo_mm)) {
>>> + /*
>>> + * The mm used to process SQEs will be current->mm or
>>> + * ctx->sqo_mm depending on which submission path is used.
>>> + * It's also unclear who is responsible for an SQE submitted
>>> + * out-of-process from a security and auditing perspective.
>>> + *
>>> + * Until a real usecase emerges and there are clear semantics
>>> + * for out-of-process submission, disallow it.
>>> + */
>>> + ret = -EACCES;
>>> + goto out;
>>> + }
>>> +
>>> /*
>>> * For SQ polling, the thread will do all submissions and completions.
>>> * Just return the requested submit count, and wake the thread if
>>
>> That'll break postgres that already uses this, also see:
>>
>> commit 73e08e711d9c1d79fae01daed4b0e1fee5f8a275
>> Author: Jens Axboe <[email protected]>
>> Date: Sun Jan 26 09:53:12 2020 -0700
>>
>> Revert "io_uring: only allow submit from owning task"
>>
>> So no, we can't do that.
>>
>
> Yikes, I missed that.
>
> Andres, how final is your Postgres branch? I'm wondering if we could
> get away with requiring a special flag when creating an io_uring to
> indicate that you intend to submit IO from outside the creating mm.
>
> Even if we can't make this change, we could plausibly get away with
> tying seccomp-style filtering to sqo_mm. IOW we'd look up a
> hypothetical sqo_mm->io_uring_filters to filter SQEs even when
> submitted from a different mm.

This is just one known use case, there may very well be others. Outside
of SQPOLL, which is special, I don't see a reason to restrict this.
Given that you may have a fuller understanding of it after the above
explanation, please clearly state what problem you're seeing that
warrants a change.

--
Jens Axboe

2020-07-21 17:47:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020 at 10:30 AM Jens Axboe <[email protected]> wrote:
>
> On 7/21/20 11:23 AM, Andy Lutomirski wrote:
> > On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
> >>
> >> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
> >>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
> >>>>
> >>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> >>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> >>>
> >>>>> access (IIUC) is possible without actually calling any of the io_uring
> >>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> >>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> >>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
> >>>>> waking up the worker thread?)
> >>>>
> >>>> It is true only if the io_uring istance is created with SQPOLL flag (not the
> >>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
> >>>> kthread is created and you can also set an higher idle time for it, so
> >>>> also the waking up syscall can be avoided.
> >>>
> >>> I stared at the io_uring code for a while, and I'm wondering if we're
> >>> approaching this the wrong way. It seems to me that most of the
> >>> complications here come from the fact that io_uring SQEs don't clearly
> >>> belong to any particular security principle. (We have struct creds,
> >>> but we don't really have a task or mm.) But I'm also not convinced
> >>> that io_uring actually supports cross-mm submission except by accident
> >>> -- as it stands, unless a user is very careful to only submit SQEs
> >>> that don't use user pointers, the results will be unpredictable.
> >>
> >> How so?
> >
> > Unless I've missed something, either current->mm or sqo_mm will be
> > used depending on which thread ends up doing the IO. (And there might
> > be similar issues with threads.) Having the user memory references
> > end up somewhere that is an implementation detail seems suboptimal.
>
> current->mm is always used from the entering task - obviously if done
> synchronously, but also if it needs to go async. The only exception is a
> setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the
> ring. SQPOLL requires root privileges to setup, and there's no task
> entering the io_uring at all necessarily. It'll just submit sqes with
> the credentials that are registered with the ring.

Really? I admit I haven't fully followed how the code works, but it
looks like anything that goes through the io_queue_async_work() path
will use sqo_mm, and can't most requests that end up blocking end up
there? It looks like, even if SQPOLL is not set, the mm used will
depend on whether the request ends up blocking and thus getting queued
for later completion.

Or does some magic I missed make this a nonissue.


>
> This is just one known use case, there may very well be others. Outside
> of SQPOLL, which is special, I don't see a reason to restrict this.
> Given that you may have a fuller understanding of it after the above
> explanation, please clearly state what problem you're seeing that
> warrants a change.

I see two fundamental issues:

1. The above. This may be less of an issue than it seems to me, but,
if you submit io from outside sqo_mm, the mm that ends up being used
depends on whether the IO is completed from io_uring_enter() or from
the workqueue. For something like Postgres, I guess this is okay
because the memory is MAP_ANONYMOUS | MAP_SHARED and the pointers all
point the same place regardless.

2. If you create an io_uring and io_uring_enter() it from a different
mm, it's unclear what seccomp is supposed to do. (Or audit, for that
matter.) Which task did the IO? Which mm did the IO? Whose sandbox
is supposed to be applied?

--Andy

2020-07-21 18:42:46

by Jens Axboe

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 7/21/20 11:44 AM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2020 at 10:30 AM Jens Axboe <[email protected]> wrote:
>>
>> On 7/21/20 11:23 AM, Andy Lutomirski wrote:
>>> On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
>>>>
>>>> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
>>>>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
>>>>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
>>>>>
>>>>>>> access (IIUC) is possible without actually calling any of the io_uring
>>>>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
>>>>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
>>>>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
>>>>>>> waking up the worker thread?)
>>>>>>
>>>>>> It is true only if the io_uring istance is created with SQPOLL flag (not the
>>>>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
>>>>>> kthread is created and you can also set an higher idle time for it, so
>>>>>> also the waking up syscall can be avoided.
>>>>>
>>>>> I stared at the io_uring code for a while, and I'm wondering if we're
>>>>> approaching this the wrong way. It seems to me that most of the
>>>>> complications here come from the fact that io_uring SQEs don't clearly
>>>>> belong to any particular security principle. (We have struct creds,
>>>>> but we don't really have a task or mm.) But I'm also not convinced
>>>>> that io_uring actually supports cross-mm submission except by accident
>>>>> -- as it stands, unless a user is very careful to only submit SQEs
>>>>> that don't use user pointers, the results will be unpredictable.
>>>>
>>>> How so?
>>>
>>> Unless I've missed something, either current->mm or sqo_mm will be
>>> used depending on which thread ends up doing the IO. (And there might
>>> be similar issues with threads.) Having the user memory references
>>> end up somewhere that is an implementation detail seems suboptimal.
>>
>> current->mm is always used from the entering task - obviously if done
>> synchronously, but also if it needs to go async. The only exception is a
>> setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the
>> ring. SQPOLL requires root privileges to setup, and there's no task
>> entering the io_uring at all necessarily. It'll just submit sqes with
>> the credentials that are registered with the ring.
>
> Really? I admit I haven't fully followed how the code works, but it
> looks like anything that goes through the io_queue_async_work() path
> will use sqo_mm, and can't most requests that end up blocking end up
> there? It looks like, even if SQPOLL is not set, the mm used will
> depend on whether the request ends up blocking and thus getting queued
> for later completion.
>
> Or does some magic I missed make this a nonissue.

No, you are wrong. The logic works as I described it.

>> This is just one known use case, there may very well be others. Outside
>> of SQPOLL, which is special, I don't see a reason to restrict this.
>> Given that you may have a fuller understanding of it after the above
>> explanation, please clearly state what problem you're seeing that
>> warrants a change.
>
> I see two fundamental issues:
>
> 1. The above. This may be less of an issue than it seems to me, but,
> if you submit io from outside sqo_mm, the mm that ends up being used
> depends on whether the IO is completed from io_uring_enter() or from
> the workqueue. For something like Postgres, I guess this is okay
> because the memory is MAP_ANONYMOUS | MAP_SHARED and the pointers all
> point the same place regardless.

No that is incorrect. If you disregard SQPOLL, then the 'mm' is always
who submitted it.

> 2. If you create an io_uring and io_uring_enter() it from a different
> mm, it's unclear what seccomp is supposed to do. (Or audit, for that
> matter.) Which task did the IO? Which mm did the IO? Whose sandbox
> is supposed to be applied?

Also doesn't seem like a problem, if you understand the 'mm' logic
above. Unless SQPOLL is used, the entering tasks mm will be used.
There's no mixing of tasks and mm outside of that.

--
Jens Axboe

2020-07-21 19:38:50

by Andres Freund

[permalink] [raw]
Subject: Re: strace of io_uring events?

Hi,

On 2020-07-21 10:23:22 -0700, Andy Lutomirski wrote:
> Andres, how final is your Postgres branch?

Not final at all. Some of the constraints like needing to submit/receive
completions to/from multiple processes are pretty immovable though.


> I'm wondering if we could get away with requiring a special flag when
> creating an io_uring to indicate that you intend to submit IO from
> outside the creating mm.

Perhaps. It'd need to be clear when we need to do so, as we certainly
won't want to move the minimal kernel version further up.


But I think postgres is far from the only use case for wanting the
submitting mm to be the relevant one, not the creating one. It seems far
more dangerous to use the creating mm than the submitting mm. Makes
things like passing a uring fd with a few pre-opened files to another
process impossible.

Greetings,

Andres Freund

2020-07-21 19:44:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020 at 11:39 AM Jens Axboe <[email protected]> wrote:
>
> On 7/21/20 11:44 AM, Andy Lutomirski wrote:
> > On Tue, Jul 21, 2020 at 10:30 AM Jens Axboe <[email protected]> wrote:
> >>
> >> On 7/21/20 11:23 AM, Andy Lutomirski wrote:
> >>> On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
> >>>>
> >>>> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
> >>>>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
> >>>>>>
> >>>>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> >>>>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> >>>>>
> >>>>>>> access (IIUC) is possible without actually calling any of the io_uring
> >>>>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> >>>>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> >>>>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
> >>>>>>> waking up the worker thread?)
> >>>>>>
> >>>>>> It is true only if the io_uring istance is created with SQPOLL flag (not the
> >>>>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
> >>>>>> kthread is created and you can also set an higher idle time for it, so
> >>>>>> also the waking up syscall can be avoided.
> >>>>>
> >>>>> I stared at the io_uring code for a while, and I'm wondering if we're
> >>>>> approaching this the wrong way. It seems to me that most of the
> >>>>> complications here come from the fact that io_uring SQEs don't clearly
> >>>>> belong to any particular security principle. (We have struct creds,
> >>>>> but we don't really have a task or mm.) But I'm also not convinced
> >>>>> that io_uring actually supports cross-mm submission except by accident
> >>>>> -- as it stands, unless a user is very careful to only submit SQEs
> >>>>> that don't use user pointers, the results will be unpredictable.
> >>>>
> >>>> How so?
> >>>
> >>> Unless I've missed something, either current->mm or sqo_mm will be
> >>> used depending on which thread ends up doing the IO. (And there might
> >>> be similar issues with threads.) Having the user memory references
> >>> end up somewhere that is an implementation detail seems suboptimal.
> >>
> >> current->mm is always used from the entering task - obviously if done
> >> synchronously, but also if it needs to go async. The only exception is a
> >> setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the
> >> ring. SQPOLL requires root privileges to setup, and there's no task
> >> entering the io_uring at all necessarily. It'll just submit sqes with
> >> the credentials that are registered with the ring.
> >
> > Really? I admit I haven't fully followed how the code works, but it
> > looks like anything that goes through the io_queue_async_work() path
> > will use sqo_mm, and can't most requests that end up blocking end up
> > there? It looks like, even if SQPOLL is not set, the mm used will
> > depend on whether the request ends up blocking and thus getting queued
> > for later completion.
> >
> > Or does some magic I missed make this a nonissue.
>
> No, you are wrong. The logic works as I described it.

Can you enlighten me? I don't see any iov_iter_get_pages() calls or
equivalents. If an IO is punted, how does the data end up in the
io_uring_enter() caller's mm?

2020-07-21 19:50:54

by Jens Axboe

[permalink] [raw]
Subject: Re: strace of io_uring events?

On 7/21/20 1:44 PM, Andy Lutomirski wrote:
> On Tue, Jul 21, 2020 at 11:39 AM Jens Axboe <[email protected]> wrote:
>>
>> On 7/21/20 11:44 AM, Andy Lutomirski wrote:
>>> On Tue, Jul 21, 2020 at 10:30 AM Jens Axboe <[email protected]> wrote:
>>>>
>>>> On 7/21/20 11:23 AM, Andy Lutomirski wrote:
>>>>> On Tue, Jul 21, 2020 at 8:31 AM Jens Axboe <[email protected]> wrote:
>>>>>>
>>>>>> On 7/21/20 9:27 AM, Andy Lutomirski wrote:
>>>>>>> On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
>>>>>>>>> On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
>>>>>>>
>>>>>>>>> access (IIUC) is possible without actually calling any of the io_uring
>>>>>>>>> syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
>>>>>>>>> pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
>>>>>>>>> access to the SQ and CQ, and off it goes? (The only glitch I see is
>>>>>>>>> waking up the worker thread?)
>>>>>>>>
>>>>>>>> It is true only if the io_uring istance is created with SQPOLL flag (not the
>>>>>>>> default behaviour and it requires CAP_SYS_ADMIN). In this case the
>>>>>>>> kthread is created and you can also set an higher idle time for it, so
>>>>>>>> also the waking up syscall can be avoided.
>>>>>>>
>>>>>>> I stared at the io_uring code for a while, and I'm wondering if we're
>>>>>>> approaching this the wrong way. It seems to me that most of the
>>>>>>> complications here come from the fact that io_uring SQEs don't clearly
>>>>>>> belong to any particular security principle. (We have struct creds,
>>>>>>> but we don't really have a task or mm.) But I'm also not convinced
>>>>>>> that io_uring actually supports cross-mm submission except by accident
>>>>>>> -- as it stands, unless a user is very careful to only submit SQEs
>>>>>>> that don't use user pointers, the results will be unpredictable.
>>>>>>
>>>>>> How so?
>>>>>
>>>>> Unless I've missed something, either current->mm or sqo_mm will be
>>>>> used depending on which thread ends up doing the IO. (And there might
>>>>> be similar issues with threads.) Having the user memory references
>>>>> end up somewhere that is an implementation detail seems suboptimal.
>>>>
>>>> current->mm is always used from the entering task - obviously if done
>>>> synchronously, but also if it needs to go async. The only exception is a
>>>> setup with SQPOLL, in which case ctx->sqo_mm is the task that set up the
>>>> ring. SQPOLL requires root privileges to setup, and there's no task
>>>> entering the io_uring at all necessarily. It'll just submit sqes with
>>>> the credentials that are registered with the ring.
>>>
>>> Really? I admit I haven't fully followed how the code works, but it
>>> looks like anything that goes through the io_queue_async_work() path
>>> will use sqo_mm, and can't most requests that end up blocking end up
>>> there? It looks like, even if SQPOLL is not set, the mm used will
>>> depend on whether the request ends up blocking and thus getting queued
>>> for later completion.
>>>
>>> Or does some magic I missed make this a nonissue.
>>
>> No, you are wrong. The logic works as I described it.
>
> Can you enlighten me? I don't see any iov_iter_get_pages() calls or
> equivalents. If an IO is punted, how does the data end up in the
> io_uring_enter() caller's mm?

If the SQE needs to be punted to an io-wq worker, then
io_prep_async_work() is ultimately called before it's queued with the
io-wq worker. That grabs anything we need to successfully process this
request, user access and all. io-wq then assumes the right "context" to
performn that request. As the async punt is always done on behalf of the
task that is submitting the IO (via io_uring_enter()), that is the
context that we grab and use for that particular request.

You keep looking at ctx->sqo_mm, and I've told you several times that
it's only related to the SQPOLL thread. If you don't use SQPOLL, no
request will ever use it.

--
Jens Axboe

2020-07-21 19:59:33

by Andres Freund

[permalink] [raw]
Subject: Re: strace of io_uring events?

Hi,

On 2020-07-21 12:44:09 -0700, Andy Lutomirski wrote:
> Can you enlighten me? I don't see any iov_iter_get_pages() calls or
> equivalents. If an IO is punted, how does the data end up in the
> io_uring_enter() caller's mm?

For operations needing that io_op_def.needs_mm is true. Which is checked
by io_prep_async_work(), adding the current mm to req. On the wq side
io_wq_switch_mm() uses that mm when executing the queue entry.

Greetings,

Andres Freund

2020-07-23 10:40:45

by Stefan Hajnoczi

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020 at 05:58:48PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 21, 2020 at 08:27:34AM -0700, Andy Lutomirski wrote:
> > On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella <[email protected]> wrote:
> > >
> > > On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> > > > On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> >
> > > > access (IIUC) is possible without actually calling any of the io_uring
> > > > syscalls. Is that correct? A process would receive an fd (via SCM_RIGHTS,
> > > > pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> > > > access to the SQ and CQ, and off it goes? (The only glitch I see is
> > > > waking up the worker thread?)
> > >
> > > It is true only if the io_uring istance is created with SQPOLL flag (not the
> > > default behaviour and it requires CAP_SYS_ADMIN). In this case the
> > > kthread is created and you can also set an higher idle time for it, so
> > > also the waking up syscall can be avoided.
> >
> > I stared at the io_uring code for a while, and I'm wondering if we're
> > approaching this the wrong way. It seems to me that most of the
> > complications here come from the fact that io_uring SQEs don't clearly
> > belong to any particular security principle. (We have struct creds,
> > but we don't really have a task or mm.) But I'm also not convinced
> > that io_uring actually supports cross-mm submission except by accident
> > -- as it stands, unless a user is very careful to only submit SQEs
> > that don't use user pointers, the results will be unpredictable.
> > Perhaps we can get away with this:
> >
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 74bc4a04befa..92266f869174 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> > if (!percpu_ref_tryget(&ctx->refs))
> > goto out_fput;
> >
> > + if (unlikely(current->mm != ctx->sqo_mm)) {
> > + /*
> > + * The mm used to process SQEs will be current->mm or
> > + * ctx->sqo_mm depending on which submission path is used.
> > + * It's also unclear who is responsible for an SQE submitted
> > + * out-of-process from a security and auditing perspective.
> > + *
> > + * Until a real usecase emerges and there are clear semantics
> > + * for out-of-process submission, disallow it.
> > + */
> > + ret = -EACCES;
> > + goto out;
> > + }
> > +
> > /*
> > * For SQ polling, the thread will do all submissions and completions.
> > * Just return the requested submit count, and wake the thread if
> >
> > If we can do that, then we could bind seccomp-like io_uring filters to
> > an mm, and we get obvious semantics that ought to cover most of the
> > bases.
> >
> > Jens, Christoph?
> >
> > Stefano, what's your intended usecase for your restriction patchset?
> >
>
> Hi Andy,
> my use case concerns virtualization. The idea, that I described in the
> proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
> with a guest VM for block operations.
>
> In the PoC that I realized, there is a block device driver in the guest that
> uses io_uring queues coming from the host to submit block requests.
>
> Since the guest is not trusted, we need restrictions to allow only
> a subset of syscalls on a subset of file descriptors and memory.

BTW there's only a single mm in the kvm.ko use case.

Stefan


Attachments:
(No filename) (3.54 kB)
signature.asc (499.00 B)
Download all attachments

2020-07-23 13:38:38

by Colin Walters

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Tue, Jul 21, 2020, at 11:58 AM, Stefano Garzarella wrote:

> my use case concerns virtualization. The idea, that I described in the
> proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
> with a guest VM for block operations.

Virtualization being a strong security barrier is in eternal conflict with maximizing performance.
All of these "let's add a special guest/host channel" are high risk areas.

And this effort in particular - is it *really* worth it to expose a brand new, fast moving Linux kernel interface (that probably hasn't been fuzzed as much as it needs to be) to virtual machines?

People who want maximum performance at the cost of a bit of security already have the choice to use Linux containers, where they can use io_uring natively.

2020-07-24 07:27:57

by Stefano Garzarella

[permalink] [raw]
Subject: Re: strace of io_uring events?

On Thu, Jul 23, 2020 at 09:37:40AM -0400, Colin Walters wrote:
> On Tue, Jul 21, 2020, at 11:58 AM, Stefano Garzarella wrote:
>
> > my use case concerns virtualization. The idea, that I described in the
> > proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
> > with a guest VM for block operations.
>
> Virtualization being a strong security barrier is in eternal conflict
> with maximizing performance. All of these "let's add a special
> guest/host channel" are high risk areas.
>
> And this effort in particular - is it *really* worth it to expose a
> brand new, fast moving Linux kernel interface (that probably hasn't
> been fuzzed as much as it needs to be) to virtual machines?
>

It is an experiment to explore the potential of io_uring. In addition
the restrictions can also be useful for other use case, for example if
a process wants to allow another process to use io_uring, but only allowing
a subset of operations.

> People who want maximum performance at the cost of a bit of security
> already have the choice to use Linux containers, where they can use
> io_uring natively.
>

Thanks,
Stefano