Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752985AbcD2FyH (ORCPT ); Fri, 29 Apr 2016 01:54:07 -0400 Received: from mail-ig0-f176.google.com ([209.85.213.176]:37516 "EHLO mail-ig0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752520AbcD2FyE (ORCPT ); Fri, 29 Apr 2016 01:54:04 -0400 MIME-Version: 1.0 X-Originating-IP: [122.106.150.15] In-Reply-To: References: Date: Fri, 29 Apr 2016 15:54:03 +1000 Message-ID: Subject: Re: cgroup namespace and user namespace interactions From: Aleksa Sarai To: Tejun Heo Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2503 Lines: 56 > The new cgroup namespace currently only allows for superficial > interaction with the user namespace (it checks against the namespace > it was created in whether or not a user has the right capabilities > before allowing mounting, and things like that). However, there is one > glaring feature that appears to be missing from the new cgroup > namespace implementation: unprivileged user namespaces can't modify > their sub-hierarchy. This is particularly frustrating for the > containerisation community, where we are working on adding support for > "rootless containers" in runC (the execution driver of Docker)[1]. It > essentially means that we can't use cgroup resource limiting to limit > *the resources of our own processes*. It also makes things like the > freezer cgroup unusable. > > Here follows how I think we can solve this issue: the most obvious way > of dealing with this would be (in the cgroupv1 view) to create a new > subtree in every controller when you CLONE_NEWCGROUP. This new subtree > is the root of the process's cgroup hierarchy. This doesn't affect any > resource control, but it will result in the process only being able to > affect its *own* resources. However, for cgroupv2 we have the "No > Internal Process Constraint". So, maybe we could also move all of the > other processes into a sibling subtree (with the *exact same* access > permissions as the parent). Thus, the operation would look like this: > > - C0 - P00 > \ P01 > \ P02 (about to setns) > > becomes > > - C0 - C00 - P00 > \ P01 > \ C01 - P02 > > But then we have C00 which is just a waste of cycles (it doesn't have > any resource settings). So maybe there's some optimisation we can do > there, but that's as far as I've gotten into thinking about how to > deal with the constraints of cgroupv2. After that's been solved we can > reuse how we store the user namespace the cgroup was created in > (cgroup_namespace.user_ns), and just check that whatever user is > trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace. > > Do you think this would work? Are there any recommendations on whether > we can make this work better? Also, can you clarify whether or not > CLONE_NEWCGROUP only works for cgroupv2 or does it also work on > cgroupv1 (we haven't yet transitioned to cgroupv2 in runC). > > Thanks. > > [1]: https://github.com/opencontainers/runc/pull/774 Does anyone have an opinion on this proposal? -- Aleksa Sarai (cyphar) www.cyphar.com