Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751932AbcCGCYo (ORCPT ); Sun, 6 Mar 2016 21:24:44 -0500 Received: from mail-ob0-f175.google.com ([209.85.214.175]:32882 "EHLO mail-ob0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751771AbcCGCYo (ORCPT ); Sun, 6 Mar 2016 21:24:44 -0500 MIME-Version: 1.0 In-Reply-To: <87oaar2ryz.fsf@x220.int.ebiederm.org> References: <20160306082820.GA1917@mail.hallyn.com> <87oaar2ryz.fsf@x220.int.ebiederm.org> From: Andy Lutomirski Date: Sun, 6 Mar 2016 18:24:23 -0800 Message-ID: Subject: Re: user namespace and fully visible proc and sys mounts To: "Eric W. Biederman" Cc: "Serge E. Hallyn" , Serge Hallyn , Seth Forshee , lkml , =?UTF-8?Q?St=C3=A9phane_Graber?= Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2695 Lines: 61 On Mar 6, 2016 2:03 PM, "Eric W. Biederman" wrote: > > "Serge E. Hallyn" writes: > > > Hi, > > > > So we've been over this many times... but unfortunately there is more > > breakage to report. Regular privileged and unprivileged containers > > work all right for us. But running an unprivileged container inside a > > privileged container is blocked. > > > > When creating privileged containers, lxc by default does a few things: > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > > (because this container is not in a user namespace) then moves > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > > /sys/devices/virtual/net as writeable. > > > > If any of these are left enabled, unprivileged containers can't be > > started. If all are disabled, then they can be. > > > > Can we find a way to make these not block remounts in child user > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > > Are any of these overmounts done for the purpose of security? It > appears the /proc/sys and /sys mounts being made read-only is for that > purpose. > > If none of the mounts are for secuirty the easy solution that works > today is to also mount /proc and /sys somewhere else in your container > so that the permission check for mounting a new copy passes. Can we use the big hammer approach on /proc/sys? Specifically, what if we made it so that /proc mounts created in a non-root namespace *only* see things that are scoped to the active namespaces, and only those over which the mounter has capabilities? We could have mount options for this. /proc/sys utterly sucks for namespaces things. So does the uid_map and similar crap. The API is simply awful. On a related note, can we *please* find a way to constrain namespace creation in a way that might satisfy the RHEL crowd? > > That said /proc/sys appears to be a show stopper in this scheme. As the > root of your privileged container can enter your unprivileged container > it can bypass your read-only /proc/sys by mounting a new copy of proc if > we allow the relaxation you are requesting. > > Therefore the only choice on the table (and I don't have a clue how > realistic it is) is to have a variant of proc with just files describing > processes. Call it processfs. That would not need the current > restrictions. > > As for sysfs I am drawing a blank about what might be possible. Lovely. Yet another vaguely-namespaced thing in a pseudo-filesystem. --Andy