2020-10-29 02:29:45

by Camille Mougey

[permalink] [raw]
Subject: [seccomp] Request for a "enable on execve" mode for Seccomp filters

Hello,

(This is my first message to the kernel list, I hope I'm doing it right)

From my understanding, there is no way to delay the activation of
seccomp filters, for instance "until an _execve_ call".
But this might be useful, especially for tools who sandbox other,
non-cooperative, executables, such as "systemd" or "FireJail".

It seems to be a caveat of seccomp specific to the system call
_execve_. For now, some tools such as "systemd" explicitly mention
this exception, and do not support it (from the man page):
> Note that strict system call filters may impact execution and error handling code paths of the service invocation. Specifically, access to the execve system call is required for the execution of the service binary — if it is blocked service invocation will necessarily fail

"FireJail" takes a different approach[1], with a kind of workaround:
the project uses an external library to be loaded through LD_PRELOAD
mechanism, in order to install filters during the loader stage.
This approach, a bit hacky, also has several caveats:
* _openat_, _mmap_, etc. must be allowed in order to reach the
LD_PRELOAD mechanism, and for the crafted library to work ;
* it doesn't work for static binaries.

I only see hackish ways to restrict the use of _execve_ in a
non-cooperative executable. These methods seem globally bypassables
and not satisfactory from a security point of view.

IMHO, a way to prepare filter and enable them only on the next
_execve_ would have some benefit:
* have a way to restrict _execve_ in a non-cooperative executable;
* install filters atomically, ie. before the _execve_ system call
return. That would limit racy situations, and have the very firsts
instructions of potentially untrusted binaries already subject to
seccomp filters. It would also ensure there is only one thread running
at the filter enabling time.

From what I understand, there is a relative use case[2] where the
"enable on exec" mode would also be a solution.

Thanks for your attention,
C. Mougey

[1]: https://github.com/netblue30/firejail/issues/3685
[2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/


2020-10-29 08:57:36

by Rich Felker

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 07:39:41PM +0100, Jann Horn wrote:
> On Wed, Oct 28, 2020 at 7:35 PM Rich Felker <[email protected]> wrote:
> > On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> > > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker <[email protected]> wrote:
> > > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker <[email protected]> wrote:
> > > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey <[email protected]> wrote:
> > > > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > > > poke around in the file system with access() and openat() and fstat(),
> > > > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > > > >
> > > > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > > > you have to permit in the filter. And if you want the filter to take
> > > > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > > > permit are sufficient to cobble something together in userspace that
> > > > > > > effectively does almost the same thing as execve().
> > > > > >
> > > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > > > controlling these operations and allowing only the ones that are valid
> > > > > > during dynamic linking. This also allows you to defer application of
> > > > > > the filter until after execve. So unless I'm missing some reason why
> > > > > > this doesn't work, I think the requested functionality is already
> > > > > > available.
> > > > >
> > > > > Ah, yeah, good point.
> > > > >
> > > > > > If you really just want the "activate at exec" behavior, it might be
> > > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > > > no notify fd open; I forget)
> > > > >
> > > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > > > though it might be a bit nicer if userspace had control over the errno
> > > > > there, such that it could be EPERM instead... oh well.)
> > > >
> > > > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > > > at least mildly better, but indeed it should be controllable, probably
> > > > by allowing a code path for the BPF to continue with a jump to a
> > > > different logic path if the notify listener is missing.
> > >
> > > I guess we might be able to expose the listener status through a bit /
> > > a field in the struct seccomp_data, and then filters could branch on
> > > that. (And the kernel would run the filter twice if we raced with
> > > filter detachment.) I don't know whether it would look pretty, but I
> > > think it should be doable...
> >
> > I was thinking the race wouldn't be salvagable, but indeed since the
> > filter is side-effect-free you can just re-run it if the status
> > changes between start of filter processing and the attempt at
> > notification. This sounds like it should work.
> >
> > I guess it's not possible to chain two BPF filters to do this, because
> > that only works when the first one allows? Or am I misunderstanding
> > the multiple-filters case entirely? (I've never gotten that far with
> > programming it.)
>
> I'm not sure if I'm understanding the question correctly...
> At the moment you basically can't have multiple filters with notifiers.
> The rule with multiple filters is always that all the filters get run,
> and the actual action taken is the most restrictive result of all of
> them.

I probably just don't understand how multiple filters work then, which
is pretty much what I expected. But in any case it seems correct that
they're not a tool for solving the problem here.

Rich

2020-10-29 08:57:38

by Rich Felker

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> On Wed, Oct 28, 2020 at 6:52 PM Rich Felker <[email protected]> wrote:
> > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker <[email protected]> wrote:
> > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey <[email protected]> wrote:
> > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > poke around in the file system with access() and openat() and fstat(),
> > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > >
> > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > you have to permit in the filter. And if you want the filter to take
> > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > permit are sufficient to cobble something together in userspace that
> > > > > effectively does almost the same thing as execve().
> > > >
> > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > controlling these operations and allowing only the ones that are valid
> > > > during dynamic linking. This also allows you to defer application of
> > > > the filter until after execve. So unless I'm missing some reason why
> > > > this doesn't work, I think the requested functionality is already
> > > > available.
> > >
> > > Ah, yeah, good point.
> > >
> > > > If you really just want the "activate at exec" behavior, it might be
> > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > no notify fd open; I forget)
> > >
> > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > though it might be a bit nicer if userspace had control over the errno
> > > there, such that it could be EPERM instead... oh well.)
> >
> > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > at least mildly better, but indeed it should be controllable, probably
> > by allowing a code path for the BPF to continue with a jump to a
> > different logic path if the notify listener is missing.
>
> I guess we might be able to expose the listener status through a bit /
> a field in the struct seccomp_data, and then filters could branch on
> that. (And the kernel would run the filter twice if we raced with
> filter detachment.) I don't know whether it would look pretty, but I
> think it should be doable...

I was thinking the race wouldn't be salvagable, but indeed since the
filter is side-effect-free you can just re-run it if the status
changes between start of filter processing and the attempt at
notification. This sounds like it should work.

I guess it's not possible to chain two BPF filters to do this, because
that only works when the first one allows? Or am I misunderstanding
the multiple-filters case entirely? (I've never gotten that far with
programming it.)

Rich

2020-10-29 09:07:52

by Sargun Dhillon

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 03:47:27PM -0700, Kees Cook wrote:
> On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> > (This is my first message to the kernel list, I hope I'm doing it right)
>
> Looks good to me! The key was CCing real people. ;)
>
> > From my understanding, there is no way to delay the activation of
> > seccomp filters, for instance "until an _execve_ call".
> > But this might be useful, especially for tools who sandbox other,
> > non-cooperative, executables, such as "systemd" or "FireJail".
> > [...]
> > I only see hackish ways to restrict the use of _execve_ in a
> > non-cooperative executable. These methods seem globally bypassables
> > and not satisfactory from a security point of view.
> >
> > IMHO, a way to prepare filter and enable them only on the next
> > _execve_ would have some benefit:
> > * have a way to restrict _execve_ in a non-cooperative executable;
> > * install filters atomically, ie. before the _execve_ system call
> > return. That would limit racy situations, and have the very firsts
> > instructions of potentially untrusted binaries already subject to
> > seccomp filters. It would also ensure there is only one thread running
> > at the filter enabling time.
> >
> > From what I understand, there is a relative use case[2] where the
> > "enable on exec" mode would also be a solution.
> >
> > Thanks for your attention,
> > C. Mougey
> >
> > [1]: https://github.com/netblue30/firejail/issues/3685
> > [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/
>
> Just to restate things already said in the thread and to try to illustrate
> with more clarity, I tend to organize my thinking about seccomp usage
> into three categories:
>
> 1- self-confinement
> 2- launching external processes
> a) cooperating
> b) oblivious
>
> I classify things like Chrome's complex tree of related processes and
> filters as 1, since it's all one thing together.
>
> I think of systemd, docker, minijail, FireJail, etc all as falling into
> category 2, with some variation about how to deal with 2a or 2b. I see
> systemd as weakly covering both 2a and 2b: e.g. services are documenting
> what restrictions they want, etc. minijail has stronger 2b coverage as
> it attempts to do PRELOAD tricks (which it sounds like FireJail does
> too?) (Aside: why doesn't systemd do any self-confinement?)
>
> We don't have much possibility for the targets in the 2a realm as far
> as cooperating over how to _manage_ confinement, but rather about simply
> expecting confinement to exist, or adding more confinement on their own.
>
> So, what would adding delayed filters gain in the above classifications?
>
> Both 1 and 2 would benefit from some simplification over how to apply
> filters (e.g. the referenced relative complexity of needing to pass the
> USER_NOTIF fd up to the supervisor).
>
> Dealing with 2b is improved by allowing execve itself to be blocked.
>
> If we turn this:
>
> fork
> prepare & apply
> exec
>
> into this:
>
> fork
> prepare
> exec & apply
>
> for 2a, this isn't too interesting since a 2a target could just give up
> execve after it launched. For 2b, though, it's pretty meaningful to gain
> further isolation of an oblivious (and assumingly untrusted) process
> (given all the hacks needed to try to cover the situation).
>
> And to clarify, 2a would much prefer this to be able to separate
> initialization from runtime:
>
> fork
> prepare
> exec
> other things
> apply
>
> And just for completeness, none of this is useful at all for 1, which
> doesn't even "see" the fork from its perspective:
>
> exec
> other things
> prepare & apply
>
> How should 2a targets indicate they're ready? Can it be done passively
> (in the sense that libc would make some seccomp call to apply the
> delayed filters), or does it need to stay explicit? (e.g. can we turn
> a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
> My instinct is that hiding it won't gain much over a "on-execve" case,
> but having an explicit call that means "I'm done initializing now" would
> be a meaningful synchronization point -- except I note that it just means
> the target could just as easily start doing its own confinement anyway,
> which means they effectively move from 2a to 1, and now we don't care
> about delayed filters any more.
>
> So, lacking a clearer sync point, execve() does seem to stand out to me.
>
> The other idea which was touched on in the thread was very direct
> management (e.g. ptrace) and the supervisor waits until some point and
> then forces the filters to apply on the target. What would be more
> light-weight than this? (Or rather, what kinds of things would such a
> ptracer be looking for to mark "I've started"?)
>
> Since I've got bitmaps on my mind, what about a syscall bitmap that
> triggers the application of delayed filters? The supervisor is launching
> a daemon: mark NR_listen as the apply-point. The supervisor is launching
> something totally unknown: mark NR_execve as the apply-point.
>
> If we did that, what happens to non-delayed filters applied between
> program start and the apply-point getting tripped?
>
> --
> Kees Cook

So, I think we talked about something like this at LSS NA 2019. We have a use
case where we want to build a syscall attenuation framework. Basically, we know
there are some set of syscalls that all programs need, and are not problematic
(write, read, send, recv, etc..), there are syscalls that some programs need
(execve, connect, etc...), and there are syscalls that no program needs (reboot,
kexec).

Upon the first time a developer starts a docker image, we want to have our
seccomp supervisor monitor for the syscalls that some programs need, and
once it "learns" what the image is doing we can prepare a filter for it
that can be evaluated purely in kernelspace.

We wanted a couple properties before embarking upon this work:

1. We still wanted to be able to capture the syscalls that are not allowed.
This could be indicative of (a) a special program that needs access for
good reasons (think: systemd) or (b) an attacker trying to compromise
the system. SECCOMP_FILTER_FLAG_LOG is useful for keeping track of (b),
but USER_NOTIF really spoils you.

2. We wanted to be able to have the supervisor shutdown / crash, and
some default actions to be put in place for the filter.

---

I think that there may be few things that could be useful here:

A way to setup multiple filters with multiple listeners. AFAIK, the reason for
limiting the seccomp filters to have one listener was for sanity / simplicity,
and making the semantics clear. Now that we've matured the feature a little bit,
I think it might make sense to add something like:
SECCOMP_FILTER_FLAG_NEW_LISTENER_COOPERATIVE (Which allows other filters with
the cooperative flag to be installed), and the semantics could be that if a
specific response is set, it would terminate processing of further filters,
whereas if SECCOMP_USER_NOTIF_FLAG_CONTINUE was set it would continue up the
chain.

A mechanism for the thing listening on the listener FD to turn itself on or off
and indicate that it is no longer interested in receiving notifications and to
always continue / return an error code, or that it has taken an interest now,
and it would like to return to handling these events. The idea of an action
other than ENOSYS (specifically SECCOMP_USER_NOTIF_FLAG_CONTINUE) if the
listener goes away is attractive as well, in case the supervisor crashes. EPERM
is somewhat "cleaner" of an error code than ENOSYS (most people don't write
handling for ENOSYS on connect).

And lastly, a mechanism that once we ran the program long enough to learn
what special syscalls it needed, we can replace (or add) a filter from
the listener fd.

This hasn't been particularly high up on my priority list (dealing with
arguments copied from userspace is next on my list to tackle), so the above is
mostly a brain dump of what I've thought about so far. It seems to align
somewhat with what Camille is asking for, but I'm unsure of the "replace filter"
or "append filter" would suit the needs as we would want it to effect every
instance of that filter, and not only descendants of the newly spawned
process (after an execve).

----

Kees,

What do you think of adding the above features^? I realize *replacing* the
filter somewhat removes pr_set_no_new_privs, but if the supervisor was going to
return continue (or some error) anyways, I'm not sure that it violates the
notion of no_new_privs

^features:
1. The ability for multiple cooperative filters that use user_notif / listener
to be installed?
2. The ability for the listener to specify some default response, and
turn that default response off and on?
3. The ability for the supervisor to replace the filter (on the entire
existing process tree)?
4. The ability for the supervisor to specify some default response (or even
some fallback filter to use) in case it exits / crashes / closes
the listener fd?

2020-10-29 10:01:12

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 3:47 PM Kees Cook <[email protected]> wrote:
>
> On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> > (This is my first message to the kernel list, I hope I'm doing it right)
>
> 1- self-confinement
> 2- launching external processes
> a) cooperating
> b) oblivious

I remain quite unconvinced that delayed filters will solve a real
problem. As you described, 2a could just confine itself. There's an
obvious synchronization point -- sd_notify(). I bet sd_notify() could
be rigged up to apply externally-supplied filters, or sd_notify()
could interact with user notifiers to get some assistance.

2b is nasty. In an ideal world, we would materialize a fully formed
process with filters installed. The problem is that processes don't
generally come fully formed. Almost all interesting processes are
dynamically linked, and they get to specify their own dynamic linkers.
Even if we limit ourselves to a known dynamic linker, we would want to
make sure that the dynamic linker is hardened against various escape
techniques. For dynamic linking, we would probably want to start out
with one set of privileges (loading libraries) and then switch.

I have an alternative suggestion to try to address some of the above:
allow a notifier to run in a mode in which it can replace the BPF
program outright. This would be something like:

if (fork() != 0)
return; // do parent stuff

// Start up. Set a BPF program that directs pretty much everything at
the listener.
int fd = seccomp(..., SECCOMP_FILTER_FLAG_NEW_LISTENER |
SECCOMP_FILTER_FLAG_ALLOW_REPLACEMENT, ...);

// Set up other things if needed.

execve();

Now, in the parent, once the child is ready for its final filters:

// Replace the filter on *all* processes using the filter to which
we're attached.
// I think the locking for this should be straightforward.
// Optional flag here to remove the ALLOW_REPLACEMENT flag, but it's
not really necessary
// since we're about to close() the listener.
ioctl(fd, SECCOMP_IOCTL_NOTIF_REPLACE_FILTER, new_filter);

// Call recv in a loop to drain and handle notifications.
for (...) {
ioctl(fd, SECCOMP_IOCTL_NOTIF_RECV, ...);
...
}

close(fd);

And now we're done. We can make the synchronization point be anything we like.


What do you all think? For people who really want
delay-until-execve(), this can emulate it efficiently.

2020-10-29 10:02:14

by Jann Horn

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 7:35 PM Rich Felker <[email protected]> wrote:
> On Wed, Oct 28, 2020 at 07:25:45PM +0100, Jann Horn wrote:
> > On Wed, Oct 28, 2020 at 6:52 PM Rich Felker <[email protected]> wrote:
> > > On Wed, Oct 28, 2020 at 06:34:56PM +0100, Jann Horn wrote:
> > > > On Wed, Oct 28, 2020 at 5:49 PM Rich Felker <[email protected]> wrote:
> > > > > On Wed, Oct 28, 2020 at 01:42:13PM +0100, Jann Horn wrote:
> > > > > > On Wed, Oct 28, 2020 at 12:18 PM Camille Mougey <[email protected]> wrote:
> > > > > > You're just focusing on execve() - I think it's important to keep in
> > > > > > mind what happens after execve() for normal, dynamically-linked
> > > > > > binaries: The next step is that the dynamic linker runs, and it will
> > > > > > poke around in the file system with access() and openat() and fstat(),
> > > > > > it will mmap() executable libraries into memory, it will mprotect()
> > > > > > some memory regions, it will set up thread-local storage (e.g. using
> > > > > > arch_prctl(); even if the process is single-threaded), and so on.
> > > > > >
> > > > > > The earlier you install the seccomp filter, the more of these steps
> > > > > > you have to permit in the filter. And if you want the filter to take
> > > > > > effect directly after execve(), the syscalls you'll be forced to
> > > > > > permit are sufficient to cobble something together in userspace that
> > > > > > effectively does almost the same thing as execve().
> > > > >
> > > > > I would assume you use SECCOMP_RET_USER_NOTIF to implement policy for
> > > > > controlling these operations and allowing only the ones that are valid
> > > > > during dynamic linking. This also allows you to defer application of
> > > > > the filter until after execve. So unless I'm missing some reason why
> > > > > this doesn't work, I think the requested functionality is already
> > > > > available.
> > > >
> > > > Ah, yeah, good point.
> > > >
> > > > > If you really just want the "activate at exec" behavior, it might be
> > > > > possible (depending on how SECCOMP_RET_USER_NOTIF behaves when there's
> > > > > no notify fd open; I forget)
> > > >
> > > > syscall returns -ENOSYS. Yeah, that'd probably do the job. (Even
> > > > though it might be a bit nicer if userspace had control over the errno
> > > > there, such that it could be EPERM instead... oh well.)
> > >
> > > EPERM is a major bug in current sandbox implementations, so ENOSYS is
> > > at least mildly better, but indeed it should be controllable, probably
> > > by allowing a code path for the BPF to continue with a jump to a
> > > different logic path if the notify listener is missing.
> >
> > I guess we might be able to expose the listener status through a bit /
> > a field in the struct seccomp_data, and then filters could branch on
> > that. (And the kernel would run the filter twice if we raced with
> > filter detachment.) I don't know whether it would look pretty, but I
> > think it should be doable...
>
> I was thinking the race wouldn't be salvagable, but indeed since the
> filter is side-effect-free you can just re-run it if the status
> changes between start of filter processing and the attempt at
> notification. This sounds like it should work.
>
> I guess it's not possible to chain two BPF filters to do this, because
> that only works when the first one allows? Or am I misunderstanding
> the multiple-filters case entirely? (I've never gotten that far with
> programming it.)

I'm not sure if I'm understanding the question correctly...
At the moment you basically can't have multiple filters with notifiers.
The rule with multiple filters is always that all the filters get run,
and the actual action taken is the most restrictive result of all of
them.

2020-10-29 10:03:04

by Kees Cook

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> (This is my first message to the kernel list, I hope I'm doing it right)

Looks good to me! The key was CCing real people. ;)

> From my understanding, there is no way to delay the activation of
> seccomp filters, for instance "until an _execve_ call".
> But this might be useful, especially for tools who sandbox other,
> non-cooperative, executables, such as "systemd" or "FireJail".
> [...]
> I only see hackish ways to restrict the use of _execve_ in a
> non-cooperative executable. These methods seem globally bypassables
> and not satisfactory from a security point of view.
>
> IMHO, a way to prepare filter and enable them only on the next
> _execve_ would have some benefit:
> * have a way to restrict _execve_ in a non-cooperative executable;
> * install filters atomically, ie. before the _execve_ system call
> return. That would limit racy situations, and have the very firsts
> instructions of potentially untrusted binaries already subject to
> seccomp filters. It would also ensure there is only one thread running
> at the filter enabling time.
>
> From what I understand, there is a relative use case[2] where the
> "enable on exec" mode would also be a solution.
>
> Thanks for your attention,
> C. Mougey
>
> [1]: https://github.com/netblue30/firejail/issues/3685
> [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/

Just to restate things already said in the thread and to try to illustrate
with more clarity, I tend to organize my thinking about seccomp usage
into three categories:

1- self-confinement
2- launching external processes
a) cooperating
b) oblivious

I classify things like Chrome's complex tree of related processes and
filters as 1, since it's all one thing together.

I think of systemd, docker, minijail, FireJail, etc all as falling into
category 2, with some variation about how to deal with 2a or 2b. I see
systemd as weakly covering both 2a and 2b: e.g. services are documenting
what restrictions they want, etc. minijail has stronger 2b coverage as
it attempts to do PRELOAD tricks (which it sounds like FireJail does
too?) (Aside: why doesn't systemd do any self-confinement?)

We don't have much possibility for the targets in the 2a realm as far
as cooperating over how to _manage_ confinement, but rather about simply
expecting confinement to exist, or adding more confinement on their own.

So, what would adding delayed filters gain in the above classifications?

Both 1 and 2 would benefit from some simplification over how to apply
filters (e.g. the referenced relative complexity of needing to pass the
USER_NOTIF fd up to the supervisor).

Dealing with 2b is improved by allowing execve itself to be blocked.

If we turn this:

fork
prepare & apply
exec

into this:

fork
prepare
exec & apply

for 2a, this isn't too interesting since a 2a target could just give up
execve after it launched. For 2b, though, it's pretty meaningful to gain
further isolation of an oblivious (and assumingly untrusted) process
(given all the hacks needed to try to cover the situation).

And to clarify, 2a would much prefer this to be able to separate
initialization from runtime:

fork
prepare
exec
other things
apply

And just for completeness, none of this is useful at all for 1, which
doesn't even "see" the fork from its perspective:

exec
other things
prepare & apply

How should 2a targets indicate they're ready? Can it be done passively
(in the sense that libc would make some seccomp call to apply the
delayed filters), or does it need to stay explicit? (e.g. can we turn
a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
My instinct is that hiding it won't gain much over a "on-execve" case,
but having an explicit call that means "I'm done initializing now" would
be a meaningful synchronization point -- except I note that it just means
the target could just as easily start doing its own confinement anyway,
which means they effectively move from 2a to 1, and now we don't care
about delayed filters any more.

So, lacking a clearer sync point, execve() does seem to stand out to me.

The other idea which was touched on in the thread was very direct
management (e.g. ptrace) and the supervisor waits until some point and
then forces the filters to apply on the target. What would be more
light-weight than this? (Or rather, what kinds of things would such a
ptracer be looking for to mark "I've started"?)

Since I've got bitmaps on my mind, what about a syscall bitmap that
triggers the application of delayed filters? The supervisor is launching
a daemon: mark NR_listen as the apply-point. The supervisor is launching
something totally unknown: mark NR_execve as the apply-point.

If we did that, what happens to non-delayed filters applied between
program start and the apply-point getting tripped?

--
Kees Cook

2020-10-29 14:05:01

by Rich Felker

[permalink] [raw]
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp filters

On Thu, Oct 29, 2020 at 07:58:42AM +0000, Sargun Dhillon wrote:
> A mechanism for the thing listening on the listener FD to turn itself on or off
> and indicate that it is no longer interested in receiving notifications and to
> always continue / return an error code, or that it has taken an interest now,
> and it would like to return to handling these events. The idea of an action
> other than ENOSYS (specifically SECCOMP_USER_NOTIF_FLAG_CONTINUE) if the
> listener goes away is attractive as well, in case the supervisor crashes. EPERM
> is somewhat "cleaner" of an error code than ENOSYS (most people don't write
> handling for ENOSYS on connect).

This is a common misconception that really needs to be addressed.
EPERM is not a suitable error code for as-yet-unknown seccomp-blocked
syscalls, and is not suitable for a large portion of currently known
ones either. It use has actively broken lots of things that would have
worked fine with ENOSYS returned. This is because a caller will react
to ENOSYS by attempting to do whatever it wanted to do in another way,
one which your policy might handlle better (for example, if your
filter is outdated and blocking clock_gettime64 but not clock_gettime,
or blocking statx but not stat64) while it will react to EPERM as an
indication that your filter understands the abstract operation it was
trying to perform and considers that action forbidden. We hit issues
like this with virtually every seccomp-using application while moving
to time64 because the wrong idiom is so widespread, and further vdso
prevented it from being caught right away (only users without vdso hit
it later).

From a standards perspective (and thus of programming to them), POSIX
does permit implementations to define additional errors for any
interface that already has errors defined, but does not permit
overloading the standard errors defined, so if EPERM is already
defined for an interface, returning it is probably not ok. (IMO it is
defensible if the seccomp policy can be thought of as an extension of
the permission model that would cause the standaed EPERM, but it's not
defensible if the policy is just "we don't know about this syscall
yet" since the syscall might just be a new/better way of implementing
some existing operation that makes the policy appear inconsistent.)

In your example of connect, I think either is semantically defensible;
one is "you lack suitable privilege to perform the operation" and the
other is "you're running on a type of system that doesn't have socket
connect functionality" (because it's a minimal execution environment).

Rich