Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Wed, 28 Oct 2020 15:47:27 -0700
From:   Kees Cook <keescook@chromium.org>
To:     Camille Mougey <commial@gmail.com>
Cc:     lkml <linux-kernel@vger.kernel.org>, Jann Horn <jannh@google.com>,
        Tycho Andersen <tycho@tycho.pizza>,
        Rich Felker <dalias@libc.org>,
        Sargun Dhillon <sargun@sargun.me>,
        Christian Brauner <christian.brauner@ubuntu.com>,
        "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>,
        Denis Efremov <efremov@linux.com>,
        Andy Lutomirski <luto@kernel.org>
Subject: Re: [seccomp] Request for a "enable on execve" mode for Seccomp
 filters
Message-ID: <202010281503.3D1FCFE0@keescook>
References: <CAAnLoWnS74dK9Wq4EQ-uzQ0qCRfSK-dLqh+HCais-5qwDjrVzg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAAnLoWnS74dK9Wq4EQ-uzQ0qCRfSK-dLqh+HCais-5qwDjrVzg@mail.gmail.com>
Precedence: bulk

On Wed, Oct 28, 2020 at 12:18:47PM +0100, Camille Mougey wrote:
> (This is my first message to the kernel list, I hope I'm doing it right)

Looks good to me! The key was CCing real people. ;)

> From my understanding, there is no way to delay the activation of
> seccomp filters, for instance "until an _execve_ call".
> But this might be useful, especially for tools who sandbox other,
> non-cooperative, executables, such as "systemd" or "FireJail".
> [...]
> I only see hackish ways to restrict the use of _execve_ in a
> non-cooperative executable. These methods seem globally bypassables
> and not satisfactory from a security point of view.
> 
> IMHO, a way to prepare filter and enable them only on the next
> _execve_ would have some benefit:
> * have a way to restrict _execve_ in a non-cooperative executable;
> * install filters atomically, ie. before the _execve_ system call
> return. That would limit racy situations, and have the very firsts
> instructions of potentially untrusted binaries already subject to
> seccomp filters. It would also ensure there is only one thread running
> at the filter enabling time.
> 
> From what I understand, there is a relative use case[2] where the
> "enable on exec" mode would also be a solution.
> 
> Thanks for your attention,
> C. Mougey
> 
> [1]: https://github.com/netblue30/firejail/issues/3685
> [2]: https://lore.kernel.org/linux-man/202010250759.F9745E0B6@keescook/

Just to restate things already said in the thread and to try to illustrate
with more clarity, I tend to organize my thinking about seccomp usage
into three categories:

1- self-confinement
2- launching external processes
	a) cooperating
	b) oblivious

I classify things like Chrome's complex tree of related processes and
filters as 1, since it's all one thing together.

I think of systemd, docker, minijail, FireJail, etc all as falling into
category 2, with some variation about how to deal with 2a or 2b. I see
systemd as weakly covering both 2a and 2b: e.g. services are documenting
what restrictions they want, etc. minijail has stronger 2b coverage as
it attempts to do PRELOAD tricks (which it sounds like FireJail does
too?) (Aside: why doesn't systemd do any self-confinement?)

We don't have much possibility for the targets in the 2a realm as far
as cooperating over how to _manage_ confinement, but rather about simply
expecting confinement to exist, or adding more confinement on their own.

So, what would adding delayed filters gain in the above classifications?

Both 1 and 2 would benefit from some simplification over how to apply
filters (e.g. the referenced relative complexity of needing to pass the
USER_NOTIF fd up to the supervisor).

Dealing with 2b is improved by allowing execve itself to be blocked.

If we turn this:

	fork
	prepare & apply
	exec

into this:

	fork
	prepare
	exec & apply

for 2a, this isn't too interesting since a 2a target could just give up
execve after it launched. For 2b, though, it's pretty meaningful to gain
further isolation of an oblivious (and assumingly untrusted) process
(given all the hacks needed to try to cover the situation).

And to clarify, 2a would much prefer this to be able to separate
initialization from runtime:

fork
prepare
exec
other things
apply

And just for completeness, none of this is useful at all for 1, which
doesn't even "see" the fork from its perspective:

exec
other things
prepare & apply

How should 2a targets indicate they're ready? Can it be done passively
(in the sense that libc would make some seccomp call to apply the
delayed filters), or does it need to stay explicit? (e.g. can we turn
a pre-untrusted-input 2b into a 2a just by having the libc make calls?)
My instinct is that hiding it won't gain much over a "on-execve" case,
but having an explicit call that means "I'm done initializing now" would
be a meaningful synchronization point -- except I note that it just means
the target could just as easily start doing its own confinement anyway,
which means they effectively move from 2a to 1, and now we don't care
about delayed filters any more.

So, lacking a clearer sync point, execve() does seem to stand out to me.

The other idea which was touched on in the thread was very direct
management (e.g. ptrace) and the supervisor waits until some point and
then forces the filters to apply on the target. What would be more
light-weight than this? (Or rather, what kinds of things would such a
ptracer be looking for to mark "I've started"?)

Since I've got bitmaps on my mind, what about a syscall bitmap that
triggers the application of delayed filters? The supervisor is launching
a daemon: mark NR_listen as the apply-point. The supervisor is launching
something totally unknown: mark NR_execve as the apply-point.

If we did that, what happens to non-delayed filters applied between
program start and the apply-point getting tripped?

-- 
Kees Cook