Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933380AbcDYShG (ORCPT ); Mon, 25 Apr 2016 14:37:06 -0400 Received: from mail-ig0-f195.google.com ([209.85.213.195]:34331 "EHLO mail-ig0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932528AbcDYShC (ORCPT ); Mon, 25 Apr 2016 14:37:02 -0400 MIME-Version: 1.0 X-Originating-IP: [122.106.150.15] Date: Tue, 26 Apr 2016 04:37:01 +1000 Message-ID: Subject: cgroup namespace and user namespace interactions From: Aleksa Sarai To: Tejun Heo Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2373 Lines: 55 Hello, The new cgroup namespace currently only allows for superficial interaction with the user namespace (it checks against the namespace it was created in whether or not a user has the right capabilities before allowing mounting, and things like that). However, there is one glaring feature that appears to be missing from the new cgroup namespace implementation: unprivileged user namespaces can't modify their sub-hierarchy. This is particularly frustrating for the containerisation community, where we are working on adding support for "rootless containers" in runC (the execution driver of Docker)[1]. It essentially means that we can't use cgroup resource limiting to limit *the resources of our own processes*. It also makes things like the freezer cgroup unusable. Here follows how I think we can solve this issue: the most obvious way of dealing with this would be (in the cgroupv1 view) to create a new subtree in every controller when you CLONE_NEWCGROUP. This new subtree is the root of the process's cgroup hierarchy. This doesn't affect any resource control, but it will result in the process only being able to affect its *own* resources. However, for cgroupv2 we have the "No Internal Process Constraint". So, maybe we could also move all of the other processes into a sibling subtree (with the *exact same* access permissions as the parent). Thus, the operation would look like this: - C0 - P00 \ P01 \ P02 (about to setns) becomes - C0 - C00 - P00 \ P01 \ C01 - P02 But then we have C00 which is just a waste of cycles (it doesn't have any resource settings). So maybe there's some optimisation we can do there, but that's as far as I've gotten into thinking about how to deal with the constraints of cgroupv2. After that's been solved we can reuse how we store the user namespace the cgroup was created in (cgroup_namespace.user_ns), and just check that whatever user is trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace. Do you think this would work? Are there any recommendations on whether we can make this work better? Also, can you clarify whether or not CLONE_NEWCGROUP only works for cgroupv2 or does it also work on cgroupv1 (we haven't yet transitioned to cgroupv2 in runC). Thanks. [1]: https://github.com/opencontainers/runc/pull/774 -- Aleksa Sarai (cyphar) www.cyphar.com