MIME-Version: 1.0
In-Reply-To: <20160307034516.GA11489@mail.hallyn.com>
References: <20160306082820.GA1917@mail.hallyn.com> <87oaar2ryz.fsf@x220.int.ebiederm.org>
 <CALCETrXUcWnP-mvdJCUy91iJ8U+0=K0WHEs1p8h89SisQr0ZOQ@mail.gmail.com> <20160307034516.GA11489@mail.hallyn.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Sun, 6 Mar 2016 19:49:14 -0800
Message-ID: <CALCETrWYOuKNjUTCc6bMHFg72gRKtRk1m+rG0sLTcnRti5G6dw@mail.gmail.com>
Subject: Re: user namespace and fully visible proc and sys mounts
To: "Serge E. Hallyn" <serge@hallyn.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>,
        Serge Hallyn <serge.hallyn@ubuntu.com>,
        Seth Forshee <seth.forshee@canonical.com>,
        lkml <linux-kernel@vger.kernel.org>,
        =?UTF-8?Q?St=C3=A9phane_Graber?= <stgraber@ubuntu.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2596
Lines: 55

On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <serge@hallyn.com> wrote:
> On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote:
>> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>> >
>> > "Serge E. Hallyn" <serge.hallyn@ubuntu.com> writes:
>> >
>> > > Hi,
>> > >
>> > > So we've been over this many times...  but unfortunately there is more
>> > > breakage to report.  Regular privileged and unprivileged containers
>> > > work all right for us.  But running an unprivileged container inside a
>> > > privileged container is blocked.
>> > >
>> > > When creating privileged containers, lxc by default does a few things:
>> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
>> > > /proc/uptime.  It mounts proc rw but /proc/sysrq-trigger ro as well as
>> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
>> > > (because this container is not in a user namespace) then moves
>> > > /proc/sys/net back.  Finally it mounts sys ro but bind-mounts
>> > > /sys/devices/virtual/net as writeable.
>> > >
>> > > If any of these are left enabled, unprivileged containers can't be
>> > > started.  If all are disabled, then they can be.
>> > >
>> > > Can we find a way to make these not block remounts in child user
>> > > namespaces?  A boot flag, a procfs and sysfs mount option, a sysctl?
>> >
>> > Are any of these overmounts done for the purpose of security?  It
>> > appears the /proc/sys and /sys mounts being made read-only is for that
>> > purpose.
>> >
>> > If none of the mounts are for secuirty the easy solution that works
>> > today is to also mount /proc and /sys somewhere else in your container
>> > so that the permission check for mounting a new copy passes.
>>
>> Can we use the big hammer approach on /proc/sys?  Specifically, what
>> if we made it so that /proc mounts created in a non-root namespace
>> *only* see things that are scoped to the active namespaces, and only
>> those over which the mounter has capabilities?  We could have mount
>> options for this.
>
> Of course the problem is precisely non-user-namespaced containers which
> do own and have capabilities over the /proc/sys/files.  For user-namespaced
> containers /proc/sys/ isn't really an issue.

What I mean is:

mount -o nsonly=user,net -t proc none /proc

would show the list of processors and things scoped to the current
userns and netns, would *not* show global sysctls, and would fail
unless the caller has appropriate caps over the userns and netns.
This would work even if the old procfs is not fully visbile.

--Andy