MIME-Version: 1.0
In-Reply-To: <87oaar2ryz.fsf@x220.int.ebiederm.org>
References: <20160306082820.GA1917@mail.hallyn.com> <87oaar2ryz.fsf@x220.int.ebiederm.org>
From: Andy Lutomirski <luto@amacapital.net>
Date: Sun, 6 Mar 2016 18:24:23 -0800
Message-ID: <CALCETrXUcWnP-mvdJCUy91iJ8U+0=K0WHEs1p8h89SisQr0ZOQ@mail.gmail.com>
Subject: Re: user namespace and fully visible proc and sys mounts
To: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>,
        Serge Hallyn <serge.hallyn@ubuntu.com>,
        Seth Forshee <seth.forshee@canonical.com>,
        lkml <linux-kernel@vger.kernel.org>,
        =?UTF-8?Q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2695
Lines: 61

On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
> "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
>
> > Hi,
> >
> > So we've been over this many times...  but unfortunately there is more
> > breakage to report.  Regular privileged and unprivileged containers
> > work all right for us.  But running an unprivileged container inside a
> > privileged container is blocked.
> >
> > When creating privileged containers, lxc by default does a few things:
> > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> > /proc/uptime.  It mounts proc rw but /proc/sysrq-trigger ro as well as
> > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> > (because this container is not in a user namespace) then moves
> > /proc/sys/net back.  Finally it mounts sys ro but bind-mounts
> > /sys/devices/virtual/net as writeable.
> >
> > If any of these are left enabled, unprivileged containers can't be
> > started.  If all are disabled, then they can be.
> >
> > Can we find a way to make these not block remounts in child user
> > namespaces?  A boot flag, a procfs and sysfs mount option, a sysctl?
>
> Are any of these overmounts done for the purpose of security?  It
> appears the /proc/sys and /sys mounts being made read-only is for that
> purpose.
>
> If none of the mounts are for secuirty the easy solution that works
> today is to also mount /proc and /sys somewhere else in your container
> so that the permission check for mounting a new copy passes.

Can we use the big hammer approach on /proc/sys?  Specifically, what
if we made it so that /proc mounts created in a non-root namespace
*only* see things that are scoped to the active namespaces, and only
those over which the mounter has capabilities?  We could have mount
options for this.

/proc/sys utterly sucks for namespaces things.  So does the uid_map
and similar crap.  The API is simply awful.

On a related note, can we *please* find a way to constrain namespace
creation in a way that might satisfy the RHEL crowd?

>
> That said /proc/sys appears to be a show stopper in this scheme.  As the
> root of your privileged container can enter your unprivileged container
> it can bypass your read-only /proc/sys by mounting a new copy of proc if
> we allow the relaxation you are requesting.
>
> Therefore the only choice on the table (and I don't have a clue how
> realistic it is) is to have a variant of proc with just files describing
> processes.  Call it processfs.  That would not need the current
> restrictions.
>
> As for sysfs I am drawing a blank about what might be possible.

Lovely.  Yet another vaguely-namespaced thing in a pseudo-filesystem.

--Andy