Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755139AbcDMXwp (ORCPT ); Wed, 13 Apr 2016 19:52:45 -0400 Received: from h2.hallyn.com ([78.46.35.8]:56746 "EHLO h2.hallyn.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752971AbcDMXwo (ORCPT ); Wed, 13 Apr 2016 19:52:44 -0400 Date: Wed, 13 Apr 2016 18:52:40 -0500 From: "Serge E. Hallyn" To: Aditya Kali Cc: "Serge E. Hallyn" , Tejun Heo , Linux API , Linux Containers , "Eric W. Biederman" , cgroups mailinglist , lkml Subject: Re: [RFC PATCH] cgroup namespaces: add a 'nsroot=' mountinfo field Message-ID: <20160413235240.GA921@mail.hallyn.com> References: <20160321234133.GA22463@mail.hallyn.com> <20160413175736.GC3676@htj.duckdns.org> <20160413184639.GA29483@mail.hallyn.com> <20160413185033.GH3676@htj.duckdns.org> <20160413190152.GA29753@mail.hallyn.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3783 Lines: 75 Quoting Aditya Kali (adityakali@google.com): > On Wed, Apr 13, 2016 at 12:01 PM, Serge E. Hallyn wrote: > > Quoting Tejun Heo (tj@kernel.org): > >> Hello, Serge. > >> > >> On Wed, Apr 13, 2016 at 01:46:39PM -0500, Serge E. Hallyn wrote: > >> > It's not a leak of any information we're trying to hide. I realize > >> > something like 8 years have passed, but I still basically go by the > >> > ksummit guidance that containers are ok but the kernel's first priority > >> > is to facilitate containers but not trick containers into thinking > >> > they're not containerized. So long as the container is properly set > >> > up, I don't think there's anything the workload could do with the > >> > nsroot= info other than *know* that it is in a ns cgroup. > >> > > >> > If we did change that guidance, there's a slew of proc info that we > >> > could better virtualize :) > >> > >> I see. I'm just wondering because the information here seems a bit > >> gratuituous. Isn't the only thing necessary telling whether the root > >> is bind mounted or namescoped? Wouldn't simple "nsroot" work for that > >> purpose? > > > > I don't think so - we could be in a cgroup namespace but still have > > access only to bind-mounted cgroups. So we need to compare the > > superblock dentry root field to the nsroot= value. > > Umm, I don't think this is such a good idea. The main purpose of > cgroup namespace was to prevent this exposure of system cgroup > hierarchy that used to happen because of /proc/self/cgroup. Wouldn't > showing that information in /proc/self/mountinfo defeat the purpose? I disagree. The primary purpose was to simplify init's job and to keep cgroup mounts in sync with /proc/self/cgroup. So that userspace doesn't have to look at /proc/self/cgroup and then try and figure out how that relates to its actual cgroup mountpoints. It was not to *hide* the information. Field 3 already gives us the path, nsroot just tells us what part of it we are namespaced under. > > One practical problem I've found with cgroup namespaces is that there > > is no way to disambiguate between a cgroupfs mount which was done in > > a cgroup namespace, and a bind mount of a cgroupfs directory. > > Thats actually by design, no? Namespaced apps should not know/care if > they are running inside namespace. If they can find it out today, its No. If a workload isn't allowed to mount its own cgroups, and can only see that freezer /lxc/x1 was mounted at /dev/cgroup (poorly done, but we don't get to pass judgement or choose mountpoints for userspace), and it sees /lxc/x1 in its freezer entry for /proc/self/cgroup, then it cannot tell whether it should be using /dev/cgroup/tasks or /dev/cgroup/lxc/tasks or /dev/cgroup/lxc/x1/tasks. That's a problem. > just because of certain side-effects. I fear adding explicit "nsroot" > or something in /proc/self/mountinfo now becomes an API making it hard > to virtualize user-apps again. It doesn't make it hard to virtualize. The only complication would be if you wanted to checkpoint/restart and reproduce the exact /proc/self/mountinfo output. That's a bogus goal anyway, since the restart could be in a different cgroup and field 3 would be different. In contrast, not providing this makes it impossible for software to deal with both cgroup namespace and any bind-mounted cgroups. Which means any new docker (say) which can run in cgroup namespaces will not be able to run under old (that is, anything currently released except lxc 2.0) container managers. We're breaking all container managers. Now the other thing we could do would be to tweak field 3 in the mountinfo output. That had been my first inclination, but the way the mountinfo code is currently done makes that ... challenging. -serge