Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <202005181120.971232B7B@keescook> <CAG48ez1LrQvR2RHD5-ZCEihL4YT1tVgoAJfGYo+M3QukumX=OQ@mail.gmail.com>
 <20200519024846.b6dr5cjojnuetuyb@yavin.dot.cyphar.com>
In-Reply-To: <20200519024846.b6dr5cjojnuetuyb@yavin.dot.cyphar.com>
From:   Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date:   Tue, 19 May 2020 09:18:47 -0700
Message-ID: <CAADnVQKRCCHRQrNy=V7ue38skb8nKCczScpph2WFv7U_jsS3KQ@mail.gmail.com>
Subject: Re: seccomp feature development
To:     Aleksa Sarai <cyphar@cyphar.com>
Cc:     Jann Horn <jannh@google.com>, Kees Cook <keescook@chromium.org>,
        Christian Brauner <christian.brauner@ubuntu.com>,
        Tycho Andersen <tycho@tycho.ws>,
        Sargun Dhillon <sargun@sargun.me>,
        Matt Denton <mpdenton@google.com>,
        Chris Palmer <palmer@google.com>,
        Jeffrey Vander Stoep <jeffv@google.com>,
        Linux Containers <containers@lists.linux-foundation.org>,
        Linux API <linux-api@vger.kernel.org>,
        kernel list <linux-kernel@vger.kernel.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        bpf <bpf@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Mon, May 18, 2020 at 7:53 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2020-05-19, Jann Horn <jannh@google.com> wrote:
> > On Mon, May 18, 2020 at 11:05 PM Kees Cook <keescook@chromium.org> wrote:
> > > ## deep argument inspection
> > >
> > > Background: seccomp users would like to write filters that traverse
> > > the user pointers passed into many syscalls, but seccomp can't do this
> > > dereference for a variety of reasons (mostly involving race conditions and
> > > rearchitecting the entire kernel syscall and copy_from_user() code flows).
> >
> > Also, other than for syscall entry, it might be worth thinking about
> > whether we want to have a special hook into seccomp for io_uring.
> > io_uring is growing support for more and more syscalls, including
> > things like openat2, connect, sendmsg, splice and so on, and that list
> > is probably just going to grow in the future. If people start wanting
> > to use io_uring in software with seccomp filters, it might be
> > necessary to come up with some mechanism to prevent io_uring from
> > permitting access to almost everything else...
> >
> > Probably not a big priority for now, but something to keep in mind for
> > the future.
>
> Indeed. Quite a few people have raised concerns about io_uring and its
> debug-ability, but I agree that another less-commonly-mentioned concern
> should be how you restrict io_uring(2) from doing operations you've
> disallowed through seccomp. Though obviously user_notif shouldn't be
> allowed. :D
>
> > > The argument caching bit is, I think, rather mechanical in nature since
> > > it's all "just" internal to the kernel: seccomp can likely adjust how it
> > > allocates seccomp_data (maybe going so far as to have it split across two
> > > pages with the syscall argument struct always starting on the 2nd page
> > > boundary), and copying the EA struct into that page, which will be both
> > > used by the filter and by the syscall.
> >
> > We could also do the same kind of thing the eBPF verifier does in
> > convert_ctx_accesses(), and rewrite the context accesses to actually
> > go through two different pointers depending on the (constant) offset
> > into seccomp_data.
>
> My main worry with this is that we'll need to figure out what kind of
> offset mathematics are necessary to deal with pointers inside the
> extensible struct. As a very ugly proposal, you could make it so that
> you multiply the offset by PAGE_SIZE each time you want to dereference
> the pointer at that offset (unless we want to add new opcodes to cBPF to
> allow us to represent this).

Please don't. cbpf is frozen.

>
> This might even be needed for seccomp user_notif -- given one of the
> recent proposals was basically to just add two (extensible) struct
> pointers inside the main user_notif struct.
>
> > > I imagine state tracking ("is
> > > there a cached EA?", "what is the address of seccomp_data?", "what is
> > > the address of the EA?") can be associated with the thread struct.
> >
> > You probably mean the task struct?
> >
> > > The growing size of the EA struct will need some API design. For filters
> > > to operate on the contiguous seccomp_data+EA struct, the filter will
> > > need to know how large seccomp_data is (more on this later), and how
> > > large the EA struct is. When the filter is written in userspace, it can
> > > do the math, point into the expected offsets, and get what it needs. For
> > > this to work correctly in the kernel, though, the seccomp BPF verifier
> > > needs to know the size of the EA struct as well, so it can correctly
> > > perform the offset checking (as it currently does for just the
> > > seccomp_data struct size).
> > >
> > > Since there is not really any caller-based "seccomp state" associated
> > > across seccomp(2) calls, I don't think we can add a new command to tell
> > > the kernel "I'm expecting the EA struct size to be $foo bytes", since
> > > the kernel doesn't track who "I" is besides just being "current", which
> > > doesn't take into account the thread lifetime -- if a process launcher
> > > knows about one size and the child knows about another, things will get
> > > confused. The sizes really are just associated with individual filters,
> > > based on the syscalls they're examining. So, I have thoughts on possible
> > > solutions:
> > >
> > > - create a new seccomp command SECCOMP_SET_MODE_FILTER2 which uses the
> > >   EA style so we can pass in more than a filter and include also an
> > >   array of syscall to size mappings. (I don't like this...)
> > > - create a new filter flag, SECCOMP_FILTER_FLAG_EXTENSIBLE, which changes
> > >   the meaning of the uarg from "filter" to a EA-style structure with
> > >   sizes and pointers to the filter and an array of syscall to size
> > >   mappings. (I like this slightly better, but I still don't like it.)
> > > - leverage the EA design and just accept anything <= PAGE_SIZE, record
> > >   the "max offset" value seen during filter verification, and zero-fill
> > >   the EA struct with zeros to that size when constructing the
> > >   seccomp_data + EA struct that the filter will examine. Then the seccomp
> > >   filter doesn't care what any of the sizes are, and userspace doesn't
> > >   care what any of the sizes are. (I like this as it makes the problems
> > >   to solve contained entirely by the seccomp infrastructure and does not
> > >   touch user API, but I worry I'm missing some gotcha I haven't
> > >   considered.)
> >
> > We don't need to actually zero-fill memory for this beyond what the
> > kernel supports - AFAIK the existing APIs already say that passing a
> > short length is equivalent to passing zeroes, so we can just replace
> > all out-of-bounds loads with zeroing registers in the filter.
> > The tricky case is what should happen if the userspace program passes
> > in fields that the filter doesn't know about. The filter can see the
> > length field passed in by userspace, and then just reject anything
> > where the length field is bigger than the structure size the filter
> > knows about. But maybe it'd be slightly nicer if there was an
> > operation for "tell me whether everything starting at offset X is
> > zeroes", so that if someone compiles with newer kernel headers where
> > the struct is bigger, and leaves the new fields zeroed, the syscalls
> > still go through an outdated filter properly.
>
> I think the best way of handling this (without breaking programs
> senselessly) is to have filters essentially emulate
> copy_struct_from_user() semantics -- which is along the lines of what
> you've suggested.

and cpbf load instruction will become copy_from_user() underneath?
I don't see how that can work.
Have you considered implications to jits, register usage, etc ?

ebpf will become sleepable soon. It will be able to do copy_from_user()
and examine any level of user pointer dereference.
toctou is still going to be a concern though,
but such ebpf+copy_from_user analysis and syscall sandboxing
will not need to change kernel code base around syscalls at all.
No need to invent E-syscalls and all the rest I've seen in this thread.