Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <45f07f17-18b6-d187-0914-6f341fe90857@gmail.com>
 <CAG48ez3aqLs_-xgU0bThOLqRiiDWGObxcg-X9iFe6D5RDnLVJg@mail.gmail.com> <20201001125043.dj6taeieatpw3a4w@gmail.com>
In-Reply-To: <20201001125043.dj6taeieatpw3a4w@gmail.com>
From:   Jann Horn <jannh@google.com>
Date:   Thu, 1 Oct 2020 17:47:54 +0200
Message-ID: <CAG48ez2U1K2XYZu6goRYwmQ-RSu7LkKSOhPt8_wPVEUQfm7Eeg@mail.gmail.com>
Subject: Re: For review: seccomp_user_notif(2) manual page
To:     Christian Brauner <christian.brauner@canonical.com>
Cc:     "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>,
        linux-man <linux-man@vger.kernel.org>,
        Song Liu <songliubraving@fb.com>,
        Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>,
        Daniel Borkmann <daniel@iogearbox.net>,
        Giuseppe Scrivano <gscrivan@redhat.com>,
        Robert Sesek <rsesek@google.com>,
        Linux Containers <containers@lists.linux-foundation.org>,
        lkml <linux-kernel@vger.kernel.org>,
        Alexei Starovoitov <ast@kernel.org>, bpf <bpf@vger.kernel.org>,
        Andy Lutomirski <luto@amacapital.net>,
        Christian Brauner <christian@brauner.io>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Thu, Oct 1, 2020 at 2:54 PM Christian Brauner
<christian.brauner@canonical.com> wrote:
> On Wed, Sep 30, 2020 at 05:53:46PM +0200, Jann Horn via Containers wrote:
> > On Wed, Sep 30, 2020 at 1:07 PM Michael Kerrisk (man-pages)
> > <mtk.manpages@gmail.com> wrote:
> > > NOTES
> > >        The file descriptor returned when seccomp(2) is employed with =
the
> > >        SECCOMP_FILTER_FLAG_NEW_LISTENER  flag  can  be  monitored  us=
ing
> > >        poll(2), epoll(7), and select(2).  When a notification  is  pe=
nd=E2=80=90
> > >        ing,  these interfaces indicate that the file descriptor is re=
ad=E2=80=90
> > >        able.
> >
> > We should probably also point out somewhere that, as
> > include/uapi/linux/seccomp.h says:
> >
> >  * Similar precautions should be applied when stacking SECCOMP_RET_USER=
_NOTIF
> >  * or SECCOMP_RET_TRACE. For SECCOMP_RET_USER_NOTIF filters acting on t=
he
> >  * same syscall, the most recently added filter takes precedence. This =
means
> >  * that the new SECCOMP_RET_USER_NOTIF filter can override any
> >  * SECCOMP_IOCTL_NOTIF_SEND from earlier filters, essentially allowing =
all
> >  * such filtered syscalls to be executed by sending the response
> >  * SECCOMP_USER_NOTIF_FLAG_CONTINUE. Note that SECCOMP_RET_TRACE can eq=
ually
> >  * be overriden by SECCOMP_USER_NOTIF_FLAG_CONTINUE.
> >
> > In other words, from a security perspective, you must assume that the
> > target process can bypass any SECCOMP_RET_USER_NOTIF (or
> > SECCOMP_RET_TRACE) filters unless it is completely prohibited from
> > calling seccomp(). This should also be noted over in the main
> > seccomp(2) manpage, especially the SECCOMP_RET_TRACE part.
>
> So I was actually wondering about this when I skimmed this and a while
> ago but forgot about this again... Afaict, you can only ever load a
> single filter with SECCOMP_FILTER_FLAG_NEW_LISTENER set. If there
> already is a filter with the SECCOMP_FILTER_FLAG_NEW_LISTENER property
> in the tasks filter hierarchy then the kernel will refuse to load a new
> one?
>
> static struct file *init_listener(struct seccomp_filter *filter)
> {
>         struct file *ret =3D ERR_PTR(-EBUSY);
>         struct seccomp_filter *cur;
>
>         for (cur =3D current->seccomp.filter; cur; cur =3D cur->prev) {
>                 if (cur->notif)
>                         goto out;
>         }
>
> shouldn't that be sufficient to guarantee that USER_NOTIF filters can't
> override each other for the same task simply because there can only ever
> be a single one?

Good point. Exceeeept that that check seems ineffective because this
happens before we take the locks that guard against TSYNC, and also
before we decide to which existing filter we want to chain the new
filter. So if two threads race with TSYNC, I think they'll be able to
chain two filters with listeners together.

I don't know whether we want to eternalize this "only one listener
across all the filters" restriction in the manpage though, or whether
the man page should just say that the kernel currently doesn't support
it but that security-wise you should assume that it might at some
point.

[...]
> > >            if (procMemFd =3D=3D -1)
> > >                errExit("Supervisor: open");
> > >
> > >            /* Check that the process whose info we are accessing is s=
till alive.
> > >               If the SECCOMP_IOCTL_NOTIF_ID_VALID operation (performe=
d
> > >               in checkNotificationIdIsValid()) succeeds, we know that=
 the
> > >               /proc/PID/mem file descriptor that we opened correspond=
s to the
> > >               process for which we received a notification. If that p=
rocess
> > >               subsequently terminates, then read() on that file descr=
iptor
> > >               will return 0 (EOF). */
> > >
> > >            checkNotificationIdIsValid(notifyFd, req->id);
> > >
> > >            /* Seek to the location containing the pathname argument (=
i.e., the
> > >               first argument) of the mkdir(2) call and read that path=
name */
> > >
> > >            if (lseek(procMemFd, req->data.args[0], SEEK_SET) =3D=3D -=
1)
> > >                errExit("Supervisor: lseek");
> > >
> > >            ssize_t s =3D read(procMemFd, path, PATH_MAX);
> > >            if (s =3D=3D -1)
> > >                errExit("read");
> >
> > Why not pread() instead of lseek()+read()?
>
> With multiple arguments to be read process_vm_readv() should also be
> considered.

process_vm_readv() can end up doing each read against a different
process, which is sort of weird semantically. You would end up taking
page faults at random addresses in unrelated processes, blocking on
their mmap locks, potentially triggering their userfaultfd notifiers,
and so on.

Whereas if you first open /proc/$tid/mem, then re-check
SECCOMP_IOCTL_NOTIF_ID_VALID, and then do the read, you know that
you're only taking page faults on the process where you intended to do
it.

So until there is a variant of process_vm_readv() that operates on
pidfds, I would not recommend using that here.