2021-02-20 09:35:21

by Sargun Dhillon

[permalink] [raw]
Subject: seccomp: Delay filter activation

We've run into a problem where attaching a filter can be quite messy
business because the filter itself intercepts sendmsg, and other
syscalls related to exfiltrating the listener FD. I believe that this
problem set has been brought up before, and although there are
"simpler" methods of exfiltrating the listener, like clone3 or
pidfd_getfd, but these are still less than ideal.

One of the ideas that's been talked about (I want to say back at LSS
NA) is the idea of "delayed activation". I was thinking that it might
be nice to have a mechanism to do delayed attach, either activated on
execve / fork, or an ioctl on the listenerfd to activate the filter
and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which
indicates that the listener should be setup, but not enforcing, and
another ioctl to activate it.

The later approach is preferred due to simplicity, but I can see a
situation where you could accidentally get into a state where the
filter is not being enforced. Additionally, this may have unforeseen
implications with CRIU.

I'm curious whether this is a problem others share, and whether any of
the aforementioned approaches seem reasonable.

-Thanks,
Sargun


2021-03-02 18:03:46

by Kees Cook

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Mon, Mar 01, 2021 at 02:21:56PM +0100, Christian Brauner wrote:
> On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > > We've run into a problem where attaching a filter can be quite messy
> > > business because the filter itself intercepts sendmsg, and other
> > > syscalls related to exfiltrating the listener FD. I believe that this
> > > problem set has been brought up before, and although there are
> > > "simpler" methods of exfiltrating the listener, like clone3 or
> > > pidfd_getfd, but these are still less than ideal.

I'm trying to make sure I understand: the target process would like to
have a filter attached that blocks sendmsg, but that would mean it has
no way to send the listener FD to its manager?

And you'd want to have listening working for sendmsg (otherwise you
could do it with two filters, I imagine)?

> > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
> >
> > BARRIER_WAIT_SETUP_DONE;
> >
> > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));
>
> This obviously should've been sm like:
>
> struct seccomp_filter_attach {
> union {
> __s32 pidfd;
> __s32 pid;
> };
> __u32 fd_filter;
> };
>
> and then
>
> int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);

Given the difficulty with TSYNC, I'm not excited about adding an
"apply this filter to another process" API. :)

The prior thread was here:
https://lore.kernel.org/lkml/[email protected]/

But I haven't had time to follow up. Both Andy and Sargun discuss filter
"replacement", but I'm not a fan of that, since I'd really like to keep
the "additive-only" property of seccomp.

So, I'm still back to wanting an answer to my questions at the end of
https://lore.kernel.org/lkml/202010281503.3D1FCFE0@keescook/

Namely, how to best indicate the point of execution where "delayed"
filters become applied?

If we require supporting the "2b" (launched oblivious target) case
(which I think we must), we need to signal it externally, or via an
automatic trip point.

Since synchronizing with an oblivious target is rather nasty (e.g.
involving ptrace or at least ptrace access checking), I'd rather create
a predefined trip point. Having it be "execve" limits the utility of
this feature for cooperating targets, though, so I think "apply on exec"
isn't great.

struct seccomp_filter_attach_trigger {
u64 nr;
unsigned char *filter;
};

seccomp(SECCOMP_ATTACH_FILTER_TRIGGER, 0, seccomp_filter_attach_trigger);

after "nr" is evaluated (but before it runs), seccomp installs the
filter.

And by "installs", I'm not sure if it needs to keep it in a queue, with
separate ref coutning, or if it should be in the main filter stack, but
have an "alive" toggle, or what.

--
Kees Cook

2021-03-03 02:50:54

by Christian Brauner

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> We've run into a problem where attaching a filter can be quite messy
> business because the filter itself intercepts sendmsg, and other
> syscalls related to exfiltrating the listener FD. I believe that this
> problem set has been brought up before, and although there are
> "simpler" methods of exfiltrating the listener, like clone3 or
> pidfd_getfd, but these are still less than ideal.

(You really like sending patches and discussion points in the middle of
the merge window. :D I think everyone's panicked about getting their PRs
in shape so it's not unlikely that this sometimes gets lost on the list. :))

It hasn't been a huge problem for us, especially since we added
pidfd_getfd() this seemed like a straightforward problem to solve by
selecting a fix fd number that is to be used for the listener. But I can
see why it is annoying.

>
> One of the ideas that's been talked about (I want to say back at LSS
> NA) is the idea of "delayed activation". I was thinking that it might
> be nice to have a mechanism to do delayed attach, either activated on
> execve / fork, or an ioctl on the listenerfd to activate the filter
> and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which
> indicates that the listener should be setup, but not enforcing, and
> another ioctl to activate it.
>
> The later approach is preferred due to simplicity, but I can see a
> situation where you could accidentally get into a state where the
> filter is not being enforced. Additionally, this may have unforeseen
> implications with CRIU.

(If you were to expose an ioctl() that allows userspace to query the
notifer state then CRIU shouldn't have a problem restoring the notifier
in the correct state. Right now it doesn't do anyting fancy about the
notifier, it just restores the task with the filter. It just has to
learn about the new feature and that's fine imho.)

>
> I'm curious whether this is a problem others share, and whether any of
> the aforementioned approaches seem reasonable.

So when I originally suggested the delayed activation I I had another
related idea that I think I might have mentioned too: if we're already
considering delaying filter activation I like to discuss the possibility
of attaching a seccomp filter to a task.

Right now, if one task wants to attach to another task they need to
recreate the whole seccomp filter and load it. That's not just pretty
expensive but also only works if you have access to the rules that the
filter was generated with. For container that's usually some sort of
pseudo seccomp filter configuration language dumped into a config file
from which it can be read.

So right now the status quo is:

struct sock_filter filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), /* Get me a listener fd */
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};
int fd = seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);

and then the caller must send the fd to the manager or the manager uses
pidfd_getfd().

But, why not get a bit crazy^wcreative; especially since seccomp() is
already a multiplexer. We introduce a new seccomp flag:

#define SECCOMP_FILTER_DETACHED

and a new seccomp command:

#define SECCOMP_ATTACH_FILTER

And now we could do something like:

pid_t pid = fork();
if (pid < 0)
return;

if (pid == 0) {
// do stuff
BARRIER_WAKE_SETUP_DONE;

// do more unrelated stuff

BARRIER_WAIT_SECCOMP_FILTER;
execve(exec-something);
} else {

int fd_filter;

struct sock_filter filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};

struct sock_fprog prog = {
.len = (unsigned short)ARRAY_SIZE(filter),
.filter = filter,
};

int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);

BARRIER_WAIT_SETUP_DONE;

int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));

BARRIER_WAKE_SECCOMP_FILTER;
}

And now you have attached a filter to another task. This would be super
elegant for a container manager. The container manager could also stash
the filter fd and when attaching to a container the manager can send the
attaching task the fd and the attaching task can do:

int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_filter));

too and would be attached to the same filter as the target task.

And for the listener fd case a container manager could simply set
SECCOMP_RET_USER_NOTIF as before

struct sock_filter filter[] = {
BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
};

and now fd_filter simply functions as the notifier fd after
seccomp(SECCOMP_ATTACH_FILTER) that's basically the fancy version of my
delayed notifier activiation idea.

I'm sure there's nastiness to figure out but I would love to see
something like this.

Christian

2021-03-03 03:27:19

by Christian Brauner

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > We've run into a problem where attaching a filter can be quite messy
> > business because the filter itself intercepts sendmsg, and other
> > syscalls related to exfiltrating the listener FD. I believe that this
> > problem set has been brought up before, and although there are
> > "simpler" methods of exfiltrating the listener, like clone3 or
> > pidfd_getfd, but these are still less than ideal.
>
> (You really like sending patches and discussion points in the middle of
> the merge window. :D I think everyone's panicked about getting their PRs
> in shape so it's not unlikely that this sometimes gets lost on the list. :))
>
> It hasn't been a huge problem for us, especially since we added
> pidfd_getfd() this seemed like a straightforward problem to solve by
> selecting a fix fd number that is to be used for the listener. But I can
> see why it is annoying.
>
> >
> > One of the ideas that's been talked about (I want to say back at LSS
> > NA) is the idea of "delayed activation". I was thinking that it might
> > be nice to have a mechanism to do delayed attach, either activated on
> > execve / fork, or an ioctl on the listenerfd to activate the filter
> > and have a flag like SECCOMP_FILTER_FLAG_NEW_LISTENER_INACTIVE, which
> > indicates that the listener should be setup, but not enforcing, and
> > another ioctl to activate it.
> >
> > The later approach is preferred due to simplicity, but I can see a
> > situation where you could accidentally get into a state where the
> > filter is not being enforced. Additionally, this may have unforeseen
> > implications with CRIU.
>
> (If you were to expose an ioctl() that allows userspace to query the
> notifer state then CRIU shouldn't have a problem restoring the notifier
> in the correct state. Right now it doesn't do anyting fancy about the
> notifier, it just restores the task with the filter. It just has to
> learn about the new feature and that's fine imho.)
>
> >
> > I'm curious whether this is a problem others share, and whether any of
> > the aforementioned approaches seem reasonable.
>
> So when I originally suggested the delayed activation I I had another
> related idea that I think I might have mentioned too: if we're already
> considering delaying filter activation I like to discuss the possibility
> of attaching a seccomp filter to a task.
>
> Right now, if one task wants to attach to another task they need to
> recreate the whole seccomp filter and load it. That's not just pretty
> expensive but also only works if you have access to the rules that the
> filter was generated with. For container that's usually some sort of
> pseudo seccomp filter configuration language dumped into a config file
> from which it can be read.
>
> So right now the status quo is:
>
> struct sock_filter filter[] = {
> BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF), /* Get me a listener fd */
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> };
> struct sock_fprog prog = {
> .len = (unsigned short)ARRAY_SIZE(filter),
> .filter = filter,
> };
> int fd = seccomp(SECCOMP_SET_MODE_FILTER, flags, &prog);
>
> and then the caller must send the fd to the manager or the manager uses
> pidfd_getfd().
>
> But, why not get a bit crazy^wcreative; especially since seccomp() is
> already a multiplexer. We introduce a new seccomp flag:
>
> #define SECCOMP_FILTER_DETACHED
>
> and a new seccomp command:
>
> #define SECCOMP_ATTACH_FILTER
>
> And now we could do something like:
>
> pid_t pid = fork();
> if (pid < 0)
> return;
>
> if (pid == 0) {
> // do stuff
> BARRIER_WAKE_SETUP_DONE;
>
> // do more unrelated stuff
>
> BARRIER_WAIT_SECCOMP_FILTER;
> execve(exec-something);
> } else {
>
> int fd_filter;
>
> struct sock_filter filter[] = {
> BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> };
>
> struct sock_fprog prog = {
> .len = (unsigned short)ARRAY_SIZE(filter),
> .filter = filter,
> };
>
> int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
>
> BARRIER_WAIT_SETUP_DONE;
>
> int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));

This obviously should've been sm like:

struct seccomp_filter_attach {
union {
__s32 pidfd;
__s32 pid;
};
__u32 fd_filter;
};

and then

int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);

>
> BARRIER_WAKE_SECCOMP_FILTER;
> }
>
> And now you have attached a filter to another task. This would be super
> elegant for a container manager. The container manager could also stash
> the filter fd and when attaching to a container the manager can send the
> attaching task the fd and the attaching task can do:
>
> int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_filter));
>
> too and would be attached to the same filter as the target task.
>
> And for the listener fd case a container manager could simply set
> SECCOMP_RET_USER_NOTIF as before
>
> struct sock_filter filter[] = {
> BPF_STMT(BPF_LD|BPF_W|BPF_ABS, offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, nr, 0, 1),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_USER_NOTIF),
> BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
> };
>
> and now fd_filter simply functions as the notifier fd after
> seccomp(SECCOMP_ATTACH_FILTER) that's basically the fancy version of my
> delayed notifier activiation idea.
>
> I'm sure there's nastiness to figure out but I would love to see
> something like this.
>
> Christian

2021-03-18 14:57:46

by Christian Brauner

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

Sorry, I just found that mail.

On Mon, Mar 01, 2021 at 03:44:06PM -0800, Kees Cook wrote:
> On Mon, Mar 01, 2021 at 02:21:56PM +0100, Christian Brauner wrote:
> > On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> > > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > > > We've run into a problem where attaching a filter can be quite messy
> > > > business because the filter itself intercepts sendmsg, and other
> > > > syscalls related to exfiltrating the listener FD. I believe that this
> > > > problem set has been brought up before, and although there are
> > > > "simpler" methods of exfiltrating the listener, like clone3 or
> > > > pidfd_getfd, but these are still less than ideal.
>
> I'm trying to make sure I understand: the target process would like to
> have a filter attached that blocks sendmsg, but that would mean it has
> no way to send the listener FD to its manager?

With pidfd_getfd() that wouldn't be a problem, I think which is what I
was trying to say. Unless the supervising task doen't have enough
privilege over the supervised task which seems like an odd scenario but
is technically possible, I guess.

>
> And you'd want to have listening working for sendmsg (otherwise you
> could do it with two filters, I imagine)?
>
> > > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
> > >
> > > BARRIER_WAIT_SETUP_DONE;
> > >
> > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));
> >
> > This obviously should've been sm like:
> >
> > struct seccomp_filter_attach {
> > union {
> > __s32 pidfd;
> > __s32 pid;
> > };
> > __u32 fd_filter;
> > };
> >
> > and then
> >
> > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);
>
> Given the difficulty with TSYNC, I'm not excited about adding an
> "apply this filter to another process" API. :)

Just to give a more complete reason for suggesting something like this
without trying to argue that we must have this:

seccomp() has so far been an API that is caller-centric and by that I
mean that the caller loaded it's seccomp profile and sandboxed itself. As
such seccomp is an example of "caller-managed" security. This security
model has obvious advantages and fits into the general fork()-like world
of unix. But imho that self-management model breaks down as soon as a
file descriptor that can be used to refer to the object in question
enters into the picture. For seccomp this "breaking point" was the
seccomp notifier fd.

Because with the introduction of that fd we have introduced the concept
of supervisor and supervisee for seccomp which imho didn't really exist
in the same way before. It's pretty obvious from the type of language
that we now use both in userspace and in kernelspace when we talk about
the seccomp notifier.

At the current point we're somewhere in the middle between caller-managed
and supervised seccomp which brings up funny probelms and edge-cases.
One of them most obvious examples is in fact the question how to get the
seccomp notify fd out of the supervised task. This clearly points to the
fact that we're missing one of the fundamentals of an fd-based
supervision model: open(). This is why I was suggesting the
SECCOMP_ATTACH_FILTER command. It's in a sense an open-call for the
seccomp notify fd.

That all being said I know that it can be weird to implement this and if
you prefer we go with another simpler model to work around such things
than I fully understand.

Christian

2021-03-18 20:41:23

by Sargun Dhillon

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Thu, Mar 18, 2021 at 03:54:54PM +0100, Christian Brauner wrote:
> Sorry, I just found that mail.
>
> On Mon, Mar 01, 2021 at 03:44:06PM -0800, Kees Cook wrote:
> > On Mon, Mar 01, 2021 at 02:21:56PM +0100, Christian Brauner wrote:
> > > On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> > > > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > > > > We've run into a problem where attaching a filter can be quite messy
> > > > > business because the filter itself intercepts sendmsg, and other
> > > > > syscalls related to exfiltrating the listener FD. I believe that this
> > > > > problem set has been brought up before, and although there are
> > > > > "simpler" methods of exfiltrating the listener, like clone3 or
> > > > > pidfd_getfd, but these are still less than ideal.
> >
> > I'm trying to make sure I understand: the target process would like to
> > have a filter attached that blocks sendmsg, but that would mean it has
> > no way to send the listener FD to its manager?
>
> With pidfd_getfd() that wouldn't be a problem, I think which is what I
> was trying to say. Unless the supervising task doen't have enough
> privilege over the supervised task which seems like an odd scenario but
> is technically possible, I guess.
>
> >
> > And you'd want to have listening working for sendmsg (otherwise you
> > could do it with two filters, I imagine)?
> >
> > > > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
> > > >
> > > > BARRIER_WAIT_SETUP_DONE;
> > > >
> > > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));
> > >
> > > This obviously should've been sm like:
> > >
> > > struct seccomp_filter_attach {
> > > union {
> > > __s32 pidfd;
> > > __s32 pid;
> > > };
> > > __u32 fd_filter;
> > > };
> > >
> > > and then
> > >
> > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);
> >
> > Given the difficulty with TSYNC, I'm not excited about adding an
> > "apply this filter to another process" API. :)
>
> Just to give a more complete reason for suggesting something like this
> without trying to argue that we must have this:
>
> seccomp() has so far been an API that is caller-centric and by that I
> mean that the caller loaded it's seccomp profile and sandboxed itself. As
> such seccomp is an example of "caller-managed" security. This security
> model has obvious advantages and fits into the general fork()-like world
> of unix. But imho that self-management model breaks down as soon as a
> file descriptor that can be used to refer to the object in question
> enters into the picture. For seccomp this "breaking point" was the
> seccomp notifier fd.
>
> Because with the introduction of that fd we have introduced the concept
> of supervisor and supervisee for seccomp which imho didn't really exist
> in the same way before. It's pretty obvious from the type of language
> that we now use both in userspace and in kernelspace when we talk about
> the seccomp notifier.
>
> At the current point we're somewhere in the middle between caller-managed
> and supervised seccomp which brings up funny probelms and edge-cases.
> One of them most obvious examples is in fact the question how to get the
> seccomp notify fd out of the supervised task. This clearly points to the
> fact that we're missing one of the fundamentals of an fd-based
> supervision model: open(). This is why I was suggesting the
> SECCOMP_ATTACH_FILTER command. It's in a sense an open-call for the
> seccomp notify fd.
>
> That all being said I know that it can be weird to implement this and if
> you prefer we go with another simpler model to work around such things
> than I fully understand.
>
> Christian

So, beyond clone3 to get pidfds being kind of awkward, how do you see this
pattern actually working? How does the filter installer let the supervisor
know that it's ready for extraction? pause() + signaling the parent?

In the case that you're not fork-execing, how do you communicate to the
notifier? My coworkers have been working on this code, where they need to
connect to a daemon that does the supervision, and it's gnarly[1]. They're
looking at adding sendmsg to the filter list, and that complicates things.

I think that pidfd_getfd works well if the child has some way to signal to the
parent that it's ready and that the filter has been installed, but when your
filter intercepts connect, and sendmsg, and the supervisor is not your parent
(and your parent is some compiled, COTS software), it becomes complicated to
handle this.

I believe that the OCI spec[2] is going to run into this class of problem unless
we introduce an out of band signaling mechanism. I think a valid way to handle
this is do a send() of the fd number (literal), and wait for the other side to
pidfd_getfd the seccomp filter, and wait for the socket to be closed to continue,
but I think we should maybe create an example (I volunteer) showing how to do this.



[1]:
https://github.com/Netflix/titus-executor/blob/393d71fa3b4ab836c6036c6a242a7f4fd2d0307e/tini/src/seccomp_fd_notify.c#L139-L201
[2]:
https://github.com/opencontainers/runtime-spec/commit/a8c4a9ee0f6b5a0b994c5c23c68725394e2b0d9d

2021-03-19 10:08:41

by Rodrigo Campos

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Thu, Mar 18, 2021 at 9:39 PM Sargun Dhillon <[email protected]> wrote:
> I believe that the OCI spec[2] is going to run into this class of problem unless
> we introduce an out of band signaling mechanism. I think a valid way to handle
> this is do a send() of the fd number (literal), and wait for the other side to
> pidfd_getfd the seccomp filter, and wait for the socket to be closed to continue,
> but I think we should maybe create an example (I volunteer) showing how to do this.

Well, we created a runc implementation for that OCI spec change and we
hit exactly that[1].

runc has a pipe mechanism to communicate already, so we use that. What
we do is: do the seccomp syscall, send the plain fd number over the
pipe and the parent gets the fd with pidfd_getfd()[2]. We use the pipe
to sync, so no issues with that part.

But, of course, if the seccomp filter blocks the syscall to send over
the pipe, this fails.

Christian, can you please elaborate on how you solve this on lxd? I'm
curious to understand if we can use the same in runc or not.


[1]: https://github.com/opencontainers/runc/pull/2682
[2]: https://github.com/opencontainers/runc/pull/2682/files#diff-f0214a0f16408fc7f168c6fc9837d189590025cc1813ebf7c1d751136936dfbfR172
--
Rodrigo Campos
---
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

2021-03-19 11:06:18

by Christian Brauner

[permalink] [raw]
Subject: Re: seccomp: Delay filter activation

On Thu, Mar 18, 2021 at 08:39:13PM +0000, Sargun Dhillon wrote:
> On Thu, Mar 18, 2021 at 03:54:54PM +0100, Christian Brauner wrote:
> > Sorry, I just found that mail.
> >
> > On Mon, Mar 01, 2021 at 03:44:06PM -0800, Kees Cook wrote:
> > > On Mon, Mar 01, 2021 at 02:21:56PM +0100, Christian Brauner wrote:
> > > > On Mon, Mar 01, 2021 at 12:09:09PM +0100, Christian Brauner wrote:
> > > > > On Sat, Feb 20, 2021 at 01:31:57AM -0800, Sargun Dhillon wrote:
> > > > > > We've run into a problem where attaching a filter can be quite messy
> > > > > > business because the filter itself intercepts sendmsg, and other
> > > > > > syscalls related to exfiltrating the listener FD. I believe that this
> > > > > > problem set has been brought up before, and although there are
> > > > > > "simpler" methods of exfiltrating the listener, like clone3 or
> > > > > > pidfd_getfd, but these are still less than ideal.
> > >
> > > I'm trying to make sure I understand: the target process would like to
> > > have a filter attached that blocks sendmsg, but that would mean it has
> > > no way to send the listener FD to its manager?
> >
> > With pidfd_getfd() that wouldn't be a problem, I think which is what I
> > was trying to say. Unless the supervising task doen't have enough
> > privilege over the supervised task which seems like an odd scenario but
> > is technically possible, I guess.
> >
> > >
> > > And you'd want to have listening working for sendmsg (otherwise you
> > > could do it with two filters, I imagine)?
> > >
> > > > > int fd_filter = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_DETACHED, &prog);
> > > > >
> > > > > BARRIER_WAIT_SETUP_DONE;
> > > > >
> > > > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, INT_TO_PTR(fd_listener));
> > > >
> > > > This obviously should've been sm like:
> > > >
> > > > struct seccomp_filter_attach {
> > > > union {
> > > > __s32 pidfd;
> > > > __s32 pid;
> > > > };
> > > > __u32 fd_filter;
> > > > };
> > > >
> > > > and then
> > > >
> > > > int ret = seccomp(SECCOMP_ATTACH_FILTER, 0, seccomp_filter_attach);
> > >
> > > Given the difficulty with TSYNC, I'm not excited about adding an
> > > "apply this filter to another process" API. :)
> >
> > Just to give a more complete reason for suggesting something like this
> > without trying to argue that we must have this:
> >
> > seccomp() has so far been an API that is caller-centric and by that I
> > mean that the caller loaded it's seccomp profile and sandboxed itself. As
> > such seccomp is an example of "caller-managed" security. This security
> > model has obvious advantages and fits into the general fork()-like world
> > of unix. But imho that self-management model breaks down as soon as a
> > file descriptor that can be used to refer to the object in question
> > enters into the picture. For seccomp this "breaking point" was the
> > seccomp notifier fd.
> >
> > Because with the introduction of that fd we have introduced the concept
> > of supervisor and supervisee for seccomp which imho didn't really exist
> > in the same way before. It's pretty obvious from the type of language
> > that we now use both in userspace and in kernelspace when we talk about
> > the seccomp notifier.
> >
> > At the current point we're somewhere in the middle between caller-managed
> > and supervised seccomp which brings up funny probelms and edge-cases.
> > One of them most obvious examples is in fact the question how to get the
> > seccomp notify fd out of the supervised task. This clearly points to the
> > fact that we're missing one of the fundamentals of an fd-based
> > supervision model: open(). This is why I was suggesting the
> > SECCOMP_ATTACH_FILTER command. It's in a sense an open-call for the
> > seccomp notify fd.
> >
> > That all being said I know that it can be weird to implement this and if
> > you prefer we go with another simpler model to work around such things
> > than I fully understand.
> >
> > Christian
>
> So, beyond clone3 to get pidfds being kind of awkward, how do you see this

Afaict we're talking about getting a file descriptor out of another
task. So just for my understanding what does clone3()'s way of returning
pidfds have to do with this and why is returning a pidfd in clone3()
awkward?

> pattern actually working? How does the filter installer let the supervisor

Sorry, what pattern are you referring to, i.e. is this a
critique/questions of my attach api proposal or of my statement about
using pidfd_getfd() to retrieve the fd? I think I'm just a little
confused about where you chose to place your reply in the thread. :)
Just to reiterate briefly: my preferred variant is to have an api
similar to what I proposed above where we can essentially prepare a
filter for another task and attach it to that task. In that model the fd
is retrieved directly by the supervisor without any need for cooperation
from the supervisee. After that you just need simply barriers (pipe,
socket etc.) to synchronize with the supervisee. For example, a
supervise does all the setup it wants and then sends a message over a
socket or pipe to the supervisor signaling it's ready and then the parent
attaches the child to its seccomp filter via some form of the command I
proposed above.

> know that it's ready for extraction? pause() + signaling the parent?

>
> In the case that you're not fork-execing, how do you communicate to the
> notifier? My coworkers have been working on this code, where they need to
> connect to a daemon that does the supervision, and it's gnarly[1]. They're
> looking at adding sendmsg to the filter list, and that complicates things.
>
> I think that pidfd_getfd works well if the child has some way to signal to the
> parent that it's ready and that the filter has been installed, but when your
> filter intercepts connect, and sendmsg, and the supervisor is not your parent
> (and your parent is some compiled, COTS software), it becomes complicated to
> handle this.
>
> I believe that the OCI spec[2] is going to run into this class of problem unless
> we introduce an out of band signaling mechanism. I think a valid way to handle
> this is do a send() of the fd number (literal), and wait for the other side to
> pidfd_getfd the seccomp filter, and wait for the socket to be closed to continue,
> but I think we should maybe create an example (I volunteer) showing how to do this.

So at first it seems were looking at very different cases here:

- Supervisor and supervisee do not have a child-parent relationship
- Supervisor and supervisee do have a child-parent relationship

But in both scenarios you must have a way for the supervisor and
supervisee to communicate with each other. If you don't then I'd be
curious how you manage to supervise the setup process of the supervisee
even independent of the seccomp notifier. (Your first link seems to show
a non-child-parent supervisor-supervisee relationship and you
communicate between supervisor and supervisee via a socket.)

I think you have two immediate ways of solving this problem (Possibly
more but I would need to think a bit more and I think that the two
solutions would solve most cases.):
1. Send the fd number and use pidfd_getfd() (Which is also what Rodrigo
mentioned.).
This can technically be a problem when somehow the supervisee is
blocked from using any kind of write method to a pipe which I find to
be a very artifical example.
And in fact looking at [1] runC is writing to a fifo after the
seccomp profile has been loaded so a similar method of telling the
parent which fd number to retrieve should always work.
2. Use a shared memory location: Have the supervisor create a file
created via memfd_create(). Send that fd to the supervisee. (Make the
fd close-on-exec in case you for some reason want to block close()
which again looking at [1] would mean you'd violate assumptions about
what syscalls are available after loading the seccomp profile.)
Have the supervisee mmap() that location (In case you're blocking
write which again, doesn't work with current assumption.). Have it
load it's seccomp profile. Write the fd number to that memory
location. The mmap() memory will be cleared on exec so no need to
have munmap() available.

But again, I prefer a model I suggested above.

[1]: https://github.com/opencontainers/runc/blob/4d4d19ce528ac40cc357ef92cd3a6931dba19316/libcontainer/standard_init_linux.go#L192

Christian