Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752191AbcCGDti (ORCPT ); Sun, 6 Mar 2016 22:49:38 -0500 Received: from mail-oi0-f45.google.com ([209.85.218.45]:35209 "EHLO mail-oi0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752046AbcCGDte (ORCPT ); Sun, 6 Mar 2016 22:49:34 -0500 MIME-Version: 1.0 In-Reply-To: <20160307034516.GA11489@mail.hallyn.com> References: <20160306082820.GA1917@mail.hallyn.com> <87oaar2ryz.fsf@x220.int.ebiederm.org> <20160307034516.GA11489@mail.hallyn.com> From: Andy Lutomirski Date: Sun, 6 Mar 2016 19:49:14 -0800 Message-ID: Subject: Re: user namespace and fully visible proc and sys mounts To: "Serge E. Hallyn" Cc: "Eric W. Biederman" , Serge Hallyn , Seth Forshee , lkml , =?UTF-8?Q?St=C3=A9phane_Graber?= Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2596 Lines: 55 On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn wrote: > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote: >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" wrote: >> > >> > "Serge E. Hallyn" writes: >> > >> > > Hi, >> > > >> > > So we've been over this many times... but unfortunately there is more >> > > breakage to report. Regular privileged and unprivileged containers >> > > work all right for us. But running an unprivileged container inside a >> > > privileged container is blocked. >> > > >> > > When creating privileged containers, lxc by default does a few things: >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly >> > > (because this container is not in a user namespace) then moves >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts >> > > /sys/devices/virtual/net as writeable. >> > > >> > > If any of these are left enabled, unprivileged containers can't be >> > > started. If all are disabled, then they can be. >> > > >> > > Can we find a way to make these not block remounts in child user >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? >> > >> > Are any of these overmounts done for the purpose of security? It >> > appears the /proc/sys and /sys mounts being made read-only is for that >> > purpose. >> > >> > If none of the mounts are for secuirty the easy solution that works >> > today is to also mount /proc and /sys somewhere else in your container >> > so that the permission check for mounting a new copy passes. >> >> Can we use the big hammer approach on /proc/sys? Specifically, what >> if we made it so that /proc mounts created in a non-root namespace >> *only* see things that are scoped to the active namespaces, and only >> those over which the mounter has capabilities? We could have mount >> options for this. > > Of course the problem is precisely non-user-namespaced containers which > do own and have capabilities over the /proc/sys/files. For user-namespaced > containers /proc/sys/ isn't really an issue. What I mean is: mount -o nsonly=user,net -t proc none /proc would show the list of processors and things scoped to the current userns and netns, would *not* show global sysctls, and would fail unless the caller has appropriate caps over the userns and netns. This would work even if the old procfs is not fully visbile. --Andy