2014-10-31 19:19:13

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 0/7] CGroup Namespaces

Another attempt at Cgroup Namespace patch-set. This incorporates
suggestions on previous patch-set.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.

More details in the writeup below.

Background
Cgroups and Namespaces are used together to create “virtual”
containers that isolates the host environment from the processes
running in container. But since cgroups themselves are not
“virtualized”, the task is always able to see global cgroups view
through cgroupfs mount and via /proc/self/cgroup file.

$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

This exposure of cgroup names to the processes running inside a
container results in some problems:
(1) The container names are typically host-container-management-agent
(systemd, docker/libcontainer, etc.) data and leaking its name (or
leaking the hierarchy) reveals too much information about the host
system.
(2) It makes the container migration across machines (CRIU) more
difficult as the container names need to be unique across the
machines in the migration domain.
(3) It makes it difficult to run container management tools (like
docker/libcontainer, lmctfy, etc.) within virtual containers
without adding dependency on some state/agent present outside the
container.

Note that the feature proposed here is completely different than the
“ns cgroup” feature which existed in the linux kernel until recently.
The ns cgroup also attempted to connect cgroups and namespaces by
creating a new cgroup every time a new namespace was created. It did
not solve any of the above mentioned problems and was later dropped
from the kernel. Incidentally though, it used the same config option
name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
With unified cgroup hierarchy
(Documentation/cgroups/unified-hierarchy.txt), the containers can now
have a much more coherent cgroup view and its easy to associate a
container with a single cgroup. This also allows us to virtualize the
cgroup view for tasks inside the container.

The new CGroup Namespace allows a process to “unshare” its cgroup
hierarchy starting from the cgroup its currently in.
For Ex:
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
$ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
[ns]$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
cgroup:[4026532183]
# From within new cgroupns, process sees that its in the root cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

# From global cgroupns:
$ cat /proc/<pid>/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

# Unshare cgroupns along with userns and mountns
# Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
# sets up uid/gid map and exec’s /bin/bash
$ ~/unshare -c -u -m

# Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
# hierarchy.
[ns]$ mount -t cgroup cgroup /tmp/cgroup
[ns]$ ls -l /tmp/cgroup
total 0
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
-rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
-rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
filesystem root for the namespace specific cgroupfs mount.

The virtualization of /proc/self/cgroup file combined with restricting
the view of cgroup hierarchy by namespace-private cgroupfs mount
should provide a completely isolated cgroup view inside the container.

In its current form, the cgroup namespaces patcheset provides following
behavior:

(1) The “root” cgroup for a cgroup namespace is the cgroup in which
the process calling unshare is running.
For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
For the init_cgroup_ns, this is the real root (“/”) cgroup
(identified in code as cgrp_dfl_root.cgrp).

(2) The cgroupns-root cgroup does not change even if the namespace
creator process later moves to a different cgroup.
$ ~/unshare -c # unshare cgroupns in some cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
[ns]$ mkdir sub_cgrp_1
[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
(a) Processes running inside the cgroup namespace will be able to see
cgroup paths (in /proc/self/cgroup) only inside their root cgroup
[ns]$ sleep 100000 & # From within unshared cgroupns
[1] 7353
[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(b) From global cgroupns, the real cgroup path will be visible:
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
path relative to its own cgroupns-root will be shown:
# ns2's cgroupns-root is at '/batchjobs/c_job_id2'
[ns2]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2/sub_cgrp_1
[ns2]$
Note that the relative path always starts with '/' to indicate that its
relative to the cgroupns-root of the caller.

(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
(if they have proper access to external cgroups).
# From inside cgroupns (with cgroupns-root at /batchjobs/c_job_id1), and
# assuming that the global hierarchy is still accessible inside cgroupns:
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
$ echo 7353 > batchjobs/c_job_id2/cgroup.procs
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../c_job_id2

Note that this kind of setup is not encouraged. A task inside cgroupns
should only be exposed to its own cgroupns hierarchy. Otherwise it makes
the virtualization of /proc/<pid>/cgroup less useful.

(5) Setns to another cgroup namespace is allowed when:
(a) the process has CAP_SYS_ADMIN in its current userns
(b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
No implicit cgroup changes happen with attaching to another cgroupns. It
is expected that the somone moves the attaching process under the target
cgroupns-root.

(6) When some thread from a multi-threaded process unshares its
cgroup-namespace, the new cgroupns gets applied to the entire
process (all the threads). This should be OK since
unified-hierarchy only allows process-level containerization. So
all the threads in the process will have the same cgroup. And both
- changing cgroups and unsharing namespaces - are protected under
threadgroup_lock(task).

(7) The cgroup namespace is alive as long as there is atleast 1
process inside it. When the last process exits, the cgroup
namespace is destroyed. The cgroupns-root and the actual cgroups
remain though.

(8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
the unified cgroup hierarchy with cgroupns-root as the filesystem root.
The process needs CAP_SYS_ADMIN in its userns and mntns.

Implementation
The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
branch). Its fairly non-intrusive and provides above mentioned
features.

Possible extensions of CGROUPNS:
(1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
capabilities to restrict cgroups to administrative users. CGroup
namespaces could be of help here. With cgroup namespaces, it might
be possible to delegate administration of sub-cgroups under a
cgroupns-root to the cgroupns owner.

---

fs/kernfs/dir.c | 194 ++++++++++++++++++++++++++++++++++-----
fs/kernfs/mount.c | 48 ++++++++++
fs/proc/namespaces.c | 1 +
include/linux/cgroup.h | 41 ++++++++-
include/linux/cgroup_namespace.h | 36 ++++++++
include/linux/kernfs.h | 5 +
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 +
include/uapi/linux/sched.h | 3 +-
kernel/Makefile | 2 +-
kernel/cgroup.c | 108 +++++++++++++++++-----
kernel/cgroup_namespace.c | 148 +++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++-
14 files changed, 561 insertions(+), 52 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c

[PATCHv2 1/7] kernfs: Add API to generate relative kernfs path
[PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv2 3/7] cgroup: add function to get task's cgroup on default
[PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()
[PATCHv2 5/7] cgroup: introduce cgroup namespaces
[PATCHv2 6/7] cgroup: cgroup namespace setns support
[PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns


2014-10-31 19:19:21

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 1/7] kernfs: Add API to generate relative kernfs path

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/dir.c | 194 +++++++++++++++++++++++++++++++++++++++++++------
include/linux/kernfs.h | 3 +
2 files changed, 176 insertions(+), 21 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..e49c365 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,158 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}

-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
{
- char *p = buf + buflen;
+ size_t depth = 0;
+
+ BUG_ON(!kn);
+ while (kn->parent) {
+ depth++;
+ kn = kn->parent;
+ }
+ return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to: /n1/n2/n3/n4/n5
+ * result: /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to: /n1/n2/n5
+ * result: /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5 [depth=5]
+ * kn_to: /n1/n2/n3 [depth=3]
+ * result: /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to,
+ char *buf,
+ size_t buflen)
+{
+ char *p = buf;
+ struct kernfs_node *kn;
+ size_t depth_from = 0, depth_to, d;
int len;

- *--p = '\0';
+ /* We atleast need 2 bytes to write "/\0". */
+ BUG_ON(buflen < 2);

- do {
- len = strlen(kn->name);
- if (p - buf < len + 1) {
- buf[0] = '\0';
- p = NULL;
- break;
+ if (kn_from == kn_to) {
+ *p = '/';
+ *(p + 1) = '\0';
+ return p;
+ }
+
+ /* We can find the relative path only if both the nodes belong to the
+ * same kernfs root.
+ */
+ if (kn_from) {
+ BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+ depth_from = kernfs_node_depth(kn_from);
+ }
+
+ depth_to = kernfs_node_depth(kn_to);
+
+ /* We compose path from left to right. So first write out all possible
+ * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+ */
+ if (kn_from) {
+ while (depth_from > depth_to) {
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
}
+
+ d = depth_to;
+ kn = kn_to;
+ while (depth_from < d) {
+ kn = kn->parent;
+ d--;
+ }
+
+ /* Now we have 'depth_from == depth_to' at this point. Add more
+ * "/.."s until we reach common ancestor. In the worst case,
+ * root node will be the common ancestor.
+ */
+ while (depth_from > 0) {
+ /* If we reached common ancestor, stop. */
+ if (kn_from == kn)
+ break;
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
+ kn = kn->parent;
+ }
+ }
+
+ /* Figure out how many bytes we need to write the path.
+ */
+ d = depth_to;
+ kn = kn_to;
+ len = 0;
+ while (depth_from < d) {
+ /* Account for "/<name>". */
+ len += strlen(kn->name) + 1;
+ kn = kn->parent;
+ --d;
+ }
+
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+
+ /* We have enough space. Move 'p' ahead by computed length and start
+ * writing node names into buffer.
+ */
+ p += len;
+ *p = '\0';
+ d = depth_to;
+ kn = kn_to;
+ while (d > depth_from) {
+ len = strlen(kn->name);
p -= len;
memcpy(p, kn->name, len);
*--p = '/';
kn = kn->parent;
- } while (kn && kn->parent);
+ --d;
+ }

- return p;
+ return buf;
}

/**
@@ -92,26 +222,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
}

/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;

spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);

/**
@@ -145,8 +297,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)

spin_lock_irqsave(&kernfs_rename_lock, flags);

- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}

int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:19:27

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <[email protected]>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:19:33

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
return NULL;
}

+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..250aaec 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

memset(opts, 0, sizeof(*opts));

+ /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+ * namespace.
+ */
+ if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+ opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+ }
+
while ((token = strsep(&o, ",")) != NULL) {
nr_opts++;

@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
- if (nr_opts != 1) {
+ if (nr_opts > 1) {
pr_err("sane_behavior: no other mount options allowed\n");
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
{
LIST_HEAD(tmp_links);
@@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1841,28 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);

- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }

dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+ /* If this mount is for the default hierarchy in non-init cgroup
+ * namespace, then instead of root cgroup's dentry, we return
+ * the dentry corresponding to the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);

@@ -1834,6 +1875,7 @@ out_free:
deactivate_super(pinned_sb);
}

+ put_cgroup_ns(ns);
return dentry;
}

@@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct kobject *cgroup_kobj;
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:19:36

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 6/7] cgroup: cgroup namespace setns support

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <[email protected]>
---
kernel/cgroup_namespace.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 7e9bda0..0803575 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -86,8 +86,22 @@ err_out:

static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
{
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+ !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+ threadgroup_unlock(current);
+
+ return 0;
}

static void *cgroupns_get(struct task_struct *task)
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:20:11

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/proc/namespaces.c | 1 +
include/linux/cgroup.h | 18 +++++-
include/linux/cgroup_namespace.h | 36 +++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
kernel/Makefile | 2 +-
kernel/cgroup.c | 14 ++++
kernel/cgroup_namespace.c | 134 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
10 files changed, 227 insertions(+), 5 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+ &cgroupns_operations,
};

static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

#ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
#endif
};

+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}

+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
}

static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+ return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;

/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;

struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};

#ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

#include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
root = parent->root;

/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
unsigned long key;
int ssid, err;

+ get_user_ns(init_cgroup_ns.user_ns);
+
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..7e9bda0
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,134 @@
+/*
+ * Copyright (C) 2014 Google Inc.
+ *
+ * Author: Aditya Kali ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+ kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+ */
+ cgrp = get_task_cgroup(current);
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out_unlock;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out_unlock;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ threadgroup_unlock(current);
+
+ return new_ns;
+
+err_out_unlock:
+ threadgroup_unlock(current);
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ rcu_read_unlock();
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};

static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}

+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;

out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:20:44

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 3/7] cgroup: add function to get task's cgroup on default hierarchy

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
}

char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);

int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 136ecea..50fa8e3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1917,6 +1917,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
}
EXPORT_SYMBOL_GPL(task_cgroup_path);

+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
/* used to track tasks and other necessary states during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
--
2.1.0.rc2.206.gedb03e5

2014-10-31 19:20:43

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv2 4/7] cgroup: export cgroup_get() and cgroup_put()

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 22 ++++++++++++++++++++++
kernel/cgroup.c | 22 ----------------------
2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == &cgrp_dfl_root;
}

+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 50fa8e3..9c622b9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
return cgroup_css(cgrp, ss);
}

-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}

-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
/**
* cgroup_refresh_child_subsys_mask - update child_subsys_mask
* @cgrp: the target cgroup
--
2.1.0.rc2.206.gedb03e5

2014-11-01 00:03:06

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <[email protected]> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
> they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>

> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(current);

This could just be me being dense, but what is the lock for?

> +
> + /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> + */
> + cgrp = get_task_cgroup(current);
> +
> + err = -ENOMEM;
> + new_ns = alloc_cgroup_ns();
> + if (!new_ns)
> + goto err_out_unlock;
> +
> + err = proc_alloc_inum(&new_ns->proc_inum);
> + if (err)
> + goto err_out_unlock;
> +
> + new_ns->user_ns = get_user_ns(user_ns);
> + new_ns->root_cgrp = cgrp;
> +
> + threadgroup_unlock(current);
> +
> + return new_ns;
> +
> +err_out_unlock:
> + threadgroup_unlock(current);
> +err_out:
> + if (cgrp)
> + cgroup_put(cgrp);
> + kfree(new_ns);
> + return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> + pr_info("setns not supported for cgroup namespace");
> + return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> + struct cgroup_namespace *ns = NULL;
> + struct nsproxy *nsproxy;
> +
> + rcu_read_lock();
> + nsproxy = task->nsproxy;
> + if (nsproxy) {
> + ns = nsproxy->cgroup_ns;
> + get_cgroup_ns(ns);
> + }
> + rcu_read_unlock();

How is this correct? Other namespaces do it too, so it Must Be
Correct (tm), but I don't understand. What is RCU protecting?

--Andy

2014-11-01 00:08:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 2 ++
> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> return NULL;
> }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn)
> +{

I can't usefully review this, but kernfs_make_root and
kernfs_obtain_root aren't the same string...

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> memset(opts, 0, sizeof(*opts));
>
> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> + * namespace.
> + */
> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> + }
> +

I don't like this implicit stuff. Can you just return -EINVAL if sane
behavior isn't requested?

> while ((token = strsep(&o, ",")) != NULL) {
> nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> - if (nr_opts != 1) {
> + if (nr_opts > 1) {
> pr_err("sane_behavior: no other mount options allowed\n");
> return -EINVAL;

This looks wrong. But, if you make the change above, then it'll be right.

> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> int ret;
> int i;
> bool new_sb;
> + struct cgroup_namespace *ns =
> + get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }

Why is this necessary?

> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
> .name = "cgroup",
> .mount = cgroup_mount,
> .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,

Aargh, another one! Eric, can you either ack or nack my patch?
Because if my patch goes in, then this line may need to change. Or
not, but if a stable release with cgroupfs and without my patch
happens, then we'll have an ABI break.

--Andy

2014-11-01 00:59:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

Andy Lutomirski <[email protected]> writes:

> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <[email protected]> wrote:

<snip>

>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> + struct cgroup_namespace *ns = NULL;
>> + struct nsproxy *nsproxy;
>> +
>> + rcu_read_lock();
>> + nsproxy = task->nsproxy;
>> + if (nsproxy) {
>> + ns = nsproxy->cgroup_ns;
>> + get_cgroup_ns(ns);
>> + }
>> + rcu_read_unlock();
>
> How is this correct? Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand. What is RCU protecting?

The code is not correct. The code needs to use task_lock.

RCU used to protect nsproxy, and now task_lock protects nsproxy.
For the reasons of of all of this I refer you to the commit
that changed this, and the comment in nsproxy.h

commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
Author: Eric W. Biederman <[email protected]>
Date: Mon Feb 3 19:13:49 2014 -0800

namespaces: Use task_lock and not rcu to protect nsproxy

The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
a sufficiently expensive system call that people have complained.

Upon inspect nsproxy no longer needs rcu protection for remote reads.
remote reads are rare. So optimize for same process reads and write
by switching using rask_lock instead.

This yields a simpler to understand lock, and a faster setns system call.

In particular this fixes a performance regression observed
by Rafael David Tinoco <[email protected]>.

This is effectively a revert of Pavel Emelyanov's commit
cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
from 2007. The race this originialy fixed no longer exists as
do_notify_parent uses task_active_pid_ns(parent) instead of
parent->nsproxy.

Signed-off-by: "Eric W. Biederman" <[email protected]>

Eric

2014-11-01 01:10:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Aditya Kali <[email protected]> writes:

> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.

There is a misdesign in this. Because files already exist we need the
protections that are present in proc and sysfs that only allow you to
mount the filesystem if it is already mounted. Otherwise you can wind
up mounting this cgroupfs in a chroot jail when the global root would
not like you to see it. cgroupfs isn't as bad as proc and sys but there
is at the very least an information leak in mounting it.

Given that we are effectively performing a bind mount in this patch, and
that we need to require cgroupfs be mounted anyway (to be safe).

I don't see the point of this change.

If we could change the set of cgroups or visible in cgroupfs I could
probably see the point. But as it is this change seems to be pointless.

Eric


> Signed-off-by: Aditya Kali <[email protected]>
> ---
> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 2 ++
> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> return NULL;
> }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn)
> +{
> + struct dentry *dentry;
> + struct inode *inode;
> +
> + BUG_ON(sb->s_op != &kernfs_sops);
> +
> + /* inode for the given kernfs_node should already exist. */
> + inode = ilookup(sb, kn->ino);
> + if (!inode) {
> + pr_debug("kernfs: could not get inode for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + /* instantiate and link root dentry */
> + dentry = d_obtain_root(inode);
> + if (!dentry) {
> + pr_debug("kernfs: could not get dentry for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
> + * may be called by callers other than kernfs_fill_super. */
> + mutex_lock(&kernfs_mutex);
> + if (!dentry->d_fsdata) {
> + kernfs_get(kn);
> + dentry->d_fsdata = kn;
> + } else {
> + WARN_ON(dentry->d_fsdata != kn);
> + }
> + mutex_unlock(&kernfs_mutex);
> +
> + return dentry;
> +}
> +
> static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> {
> struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
> struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
> struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn);
> struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
> unsigned int flags, void *priv);
> void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7e5d597..250aaec 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> memset(opts, 0, sizeof(*opts));
>
> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> + * namespace.
> + */
> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> + }
> +
> while ((token = strsep(&o, ",")) != NULL) {
> nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> - if (nr_opts != 1) {
> + if (nr_opts > 1) {
> pr_err("sane_behavior: no other mount options allowed\n");
> return -EINVAL;
> }
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> }
>
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> + struct cgroup_namespace *ns)
> +{
> + struct dentry *nsdentry;
> +
> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> + return nsdentry;
> +}
> +
> static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
> {
> LIST_HEAD(tmp_links);
> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> int ret;
> int i;
> bool new_sb;
> + struct cgroup_namespace *ns =
> + get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }
>
> /*
> * The first time anyone tries to mount a cgroup, enable the list
> @@ -1817,11 +1841,28 @@ out_free:
> kfree(opts.release_agent);
> kfree(opts.name);
>
> - if (ret)
> + if (ret) {
> + put_cgroup_ns(ns);
> return ERR_PTR(ret);
> + }
>
> dentry = kernfs_mount(fs_type, flags, root->kf_root,
> CGROUP_SUPER_MAGIC, &new_sb);
> +
> + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> + /* If this mount is for the default hierarchy in non-init cgroup
> + * namespace, then instead of root cgroup's dentry, we return
> + * the dentry corresponding to the cgroupns->root_cgrp.
> + */
> + if (ns != &init_cgroup_ns) {
> + struct dentry *nsdentry;
> +
> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> + dput(dentry);
> + dentry = nsdentry;
> + }
> + }
> +
> if (IS_ERR(dentry) || !new_sb)
> cgroup_put(&root->cgrp);
>
> @@ -1834,6 +1875,7 @@ out_free:
> deactivate_super(pinned_sb);
> }
>
> + put_cgroup_ns(ns);
> return dentry;
> }
>
> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
> .name = "cgroup",
> .mount = cgroup_mount,
> .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> static struct kobject *cgroup_kobj;

2014-11-01 03:00:46

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Andy Lutomirski <[email protected]> writes:
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> + .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one! Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change. Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.

cgroupfs has no device nodes. So as long as we are consistent in any
given release what happens here is orthogonal.

I don't remember if we have managed to get the original problem fixed
with the trivial backportable solution. I think so.

My apologies for not getting to that I haven't even had time to shepherd
through the regression associated regression fix. I probably just lock
track of them but I haven't found the Tested-By's for it yet.

Nor have I had time to dig through and figure out how to safely deal
with umount -l aka MOUNT_DETACH.

Along with the question about what to do with nodev, there is also
your patch about nosuid.

Starting in about 5 minutes I am going to be mostly offline until
sometime in the 3rd week in November as I haul all of my stuff accross
the country to someplace that actually has winter and my allergies don't
kill me.

I am going to have to review and merge a lot of code as soon as I am
back to being a programmer full time again. There is a lot of
interesting stuff coming in right now.

Eric

2014-11-01 03:29:38

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Fri, Oct 31, 2014 at 7:59 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>>> .name = "cgroup",
>>> .mount = cgroup_mount,
>>> .kill_sb = cgroup_kill_sb,
>>> + .fs_flags = FS_USERNS_MOUNT,
>>
>> Aargh, another one! Eric, can you either ack or nack my patch?
>> Because if my patch goes in, then this line may need to change. Or
>> not, but if a stable release with cgroupfs and without my patch
>> happens, then we'll have an ABI break.
>
> cgroupfs has no device nodes. So as long as we are consistent in any
> given release what happens here is orthogonal.
>
> I don't remember if we have managed to get the original problem fixed
> with the trivial backportable solution. I think so.

I don't remember. I think the problem is still there, since I think
my patch still applies, and my patch conflicts with your fix. It's
been long enough that I'm not sure it's worth applying your patch as
an interim fix.

>
> My apologies for not getting to that I haven't even had time to shepherd
> through the regression associated regression fix. I probably just lock
> track of them but I haven't found the Tested-By's for it yet.

No worries. I've tested it, but it's my patch, so there's a big grain
of salt there. I think Serge tested it, too.

>
> Nor have I had time to dig through and figure out how to safely deal
> with umount -l aka MOUNT_DETACH.

If you're talking about the do_remount_sb thing, that's already in Linus' tree.

>
> Along with the question about what to do with nodev, there is also
> your patch about nosuid.

The nosuid patch has a couple versions, and I'm not sure which version
I prefer. It's certainly debatable.

>
> Starting in about 5 minutes I am going to be mostly offline until
> sometime in the 3rd week in November as I haul all of my stuff accross
> the country to someplace that actually has winter and my allergies don't
> kill me.

Have fun!

--Andy

>
> I am going to have to review and merge a lot of code as soon as I am
> back to being a programmer full time again. There is a lot of
> interesting stuff coming in right now.
>
> Eric



--
Andy Lutomirski
AMA Capital Management, LLC

2014-11-03 22:46:52

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

(sorry for accidental non-plain-text response earlier).

On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman
<[email protected]> wrote:
> Aditya Kali <[email protected]> writes:
>
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>
> There is a misdesign in this. Because files already exist we need the
> protections that are present in proc and sysfs that only allow you to
> mount the filesystem if it is already mounted. Otherwise you can wind
> up mounting this cgroupfs in a chroot jail when the global root would
> not like you to see it. cgroupfs isn't as bad as proc and sys but there
> is at the very least an information leak in mounting it.
>

I think simply mounting the cgroupfs doesn't give you any more
information than what you don't already know about the system ;
specially if the visibility is restricted within the process's
cgroupns-root. The cgroups still wont be writable by the user, so I
think it should be fine to allow mounting?

> Given that we are effectively performing a bind mount in this patch, and
> that we need to require cgroupfs be mounted anyway (to be safe).
>
> I don't see the point of this change.
>
> If we could change the set of cgroups or visible in cgroupfs I could
> probably see the point. But as it is this change seems to be pointless.
>

I agree that this is effectively bind-mounting, but doing this in
kernel makes it really convenient for the userspace. The process that
sets up the container doesn't need to care whether it should
bind-mount cgroupfs inside the container or not. The tasks inside the
container can mount cgroupfs on as-needed basis. The root container
manager can simply unshare cgroupns and forget about the internal
setup. I think this is useful just for the reason that it makes life
much simpler for userspace.

> Eric
>
>
>> Signed-off-by: Aditya Kali <[email protected]>
>> ---
>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/kernfs.h | 2 ++
>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>> return NULL;
>> }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn)
>> +{
>> + struct dentry *dentry;
>> + struct inode *inode;
>> +
>> + BUG_ON(sb->s_op != &kernfs_sops);
>> +
>> + /* inode for the given kernfs_node should already exist. */
>> + inode = ilookup(sb, kn->ino);
>> + if (!inode) {
>> + pr_debug("kernfs: could not get inode for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-EINVAL);
>> + }
>> +
>> + /* instantiate and link root dentry */
>> + dentry = d_obtain_root(inode);
>> + if (!dentry) {
>> + pr_debug("kernfs: could not get dentry for '");
>> + pr_cont_kernfs_path(kn);
>> + pr_cont("'.\n");
>> + return ERR_PTR(-ENOMEM);
>> + }
>> +
>> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
>> + * may be called by callers other than kernfs_fill_super. */
>> + mutex_lock(&kernfs_mutex);
>> + if (!dentry->d_fsdata) {
>> + kernfs_get(kn);
>> + dentry->d_fsdata = kn;
>> + } else {
>> + WARN_ON(dentry->d_fsdata != kn);
>> + }
>> + mutex_unlock(&kernfs_mutex);
>> +
>> + return dentry;
>> +}
>> +
>> static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
>> {
>> struct kernfs_super_info *info = kernfs_info(sb);
>> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
>> index 3c2be75..b9538e0 100644
>> --- a/include/linux/kernfs.h
>> +++ b/include/linux/kernfs.h
>> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
>> struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
>> struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>>
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn);
>> struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
>> unsigned int flags, void *priv);
>> void kernfs_destroy_root(struct kernfs_root *root);
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> memset(opts, 0, sizeof(*opts));
>>
>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> + * namespace.
>> + */
>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> + }
>> +
>> while ((token = strsep(&o, ",")) != NULL) {
>> nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> - if (nr_opts != 1) {
>> + if (nr_opts > 1) {
>> pr_err("sane_behavior: no other mount options allowed\n");
>> return -EINVAL;
>> }
>> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
>> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
>> }
>>
>> +struct dentry *cgroupns_get_root(struct super_block *sb,
>> + struct cgroup_namespace *ns)
>> +{
>> + struct dentry *nsdentry;
>> +
>> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
>> + return nsdentry;
>> +}
>> +
>> static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
>> {
>> LIST_HEAD(tmp_links);
>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>> int ret;
>> int i;
>> bool new_sb;
>> + struct cgroup_namespace *ns =
>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> + /* Check if the caller has permission to mount. */
>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> + put_cgroup_ns(ns);
>> + return ERR_PTR(-EPERM);
>> + }
>>
>> /*
>> * The first time anyone tries to mount a cgroup, enable the list
>> @@ -1817,11 +1841,28 @@ out_free:
>> kfree(opts.release_agent);
>> kfree(opts.name);
>>
>> - if (ret)
>> + if (ret) {
>> + put_cgroup_ns(ns);
>> return ERR_PTR(ret);
>> + }
>>
>> dentry = kernfs_mount(fs_type, flags, root->kf_root,
>> CGROUP_SUPER_MAGIC, &new_sb);
>> +
>> + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
>> + /* If this mount is for the default hierarchy in non-init cgroup
>> + * namespace, then instead of root cgroup's dentry, we return
>> + * the dentry corresponding to the cgroupns->root_cgrp.
>> + */
>> + if (ns != &init_cgroup_ns) {
>> + struct dentry *nsdentry;
>> +
>> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
>> + dput(dentry);
>> + dentry = nsdentry;
>> + }
>> + }
>> +
>> if (IS_ERR(dentry) || !new_sb)
>> cgroup_put(&root->cgrp);
>>
>> @@ -1834,6 +1875,7 @@ out_free:
>> deactivate_super(pinned_sb);
>> }
>>
>> + put_cgroup_ns(ns);
>> return dentry;
>> }
>>
>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> + .fs_flags = FS_USERNS_MOUNT,
>> };
>>
>> static struct kobject *cgroup_kobj;



--
Aditya

2014-11-03 22:57:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 2:43 PM, Aditya Kali <[email protected]> wrote:
>
>
> On Fri, Oct 31, 2014 at 6:09 PM, Eric W. Biederman <[email protected]>
> wrote:
>>
>> Aditya Kali <[email protected]> writes:
>>
>> > This patch enables cgroup mounting inside userns when a process
>> > as appropriate privileges. The cgroup filesystem mounted is
>> > rooted at the cgroupns-root. Thus, in a container-setup, only
>> > the hierarchy under the cgroupns-root is exposed inside the container.
>> > This allows container management tools to run inside the containers
>> > without depending on any global state.
>> > In order to support this, a new kernfs api is added to lookup the
>> > dentry for the cgroupns-root.
>>
>> There is a misdesign in this. Because files already exist we need the
>> protections that are present in proc and sysfs that only allow you to
>> mount the filesystem if it is already mounted. Otherwise you can wind
>> up mounting this cgroupfs in a chroot jail when the global root would
>> not like you to see it. cgroupfs isn't as bad as proc and sys but there
>> is at the very least an information leak in mounting it.
>>
>
> I think simply mounting the cgroupfs doesn't give you any more information
> than what you don't already know about the system ; specially if the
> visibility is restricted within the process's cgroupns-root. The cgroups
> still wont be writable by the user, so I think it should be fine to allow
> mounting?
>

Can we try to figure out a better way to do this than checking at
mount time for a fully-visible procfs/sysfs/cgroupfs? The current
approach is unpleasant to deal with.

For example, we could check the equivalent conditions when the userns
is created and store then in a per-userns flags field.

>
>>
>> Given that we are effectively performing a bind mount in this patch, and
>> that we need to require cgroupfs be mounted anyway (to be safe).
>>
>> I don't see the point of this change.
>>
>> If we could change the set of cgroups or visible in cgroupfs I could
>> probably see the point. But as it is this change seems to be pointless.
>>
>
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.
>

If we add the fully-visible check at mount time, then I almost agree
with Eric. I say almost because fs_fully_visible isn't checking
whether the superblock root is the thing that's mounted, and, if we
fix that, then bind-mounting like this becomes impossible and we'd
have to refine the check.

But if we come up with something less gross than checking for fs
visibility at mount time, then I agree with Aditya: let's let mount do
the right thing, since there may be nothing there to bind mount. If
we go that route, then I think we might want to make it explicit:
require a mount flag like root=. to indicate that we want to be rooted
at our cgroupns's root cgroup.

--Andy

2014-11-03 23:12:52

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>> This patch enables cgroup mounting inside userns when a process
>> as appropriate privileges. The cgroup filesystem mounted is
>> rooted at the cgroupns-root. Thus, in a container-setup, only
>> the hierarchy under the cgroupns-root is exposed inside the container.
>> This allows container management tools to run inside the containers
>> without depending on any global state.
>> In order to support this, a new kernfs api is added to lookup the
>> dentry for the cgroupns-root.
>>
>> Signed-off-by: Aditya Kali <[email protected]>
>> ---
>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/kernfs.h | 2 ++
>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>> index f973ae9..e334f45 100644
>> --- a/fs/kernfs/mount.c
>> +++ b/fs/kernfs/mount.c
>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>> return NULL;
>> }
>>
>> +/**
>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>> + * @sb: the kernfs super_block
>> + * @kn: kernfs_node for which a dentry is needed
>> + *
>> + * This can used used by callers which want to mount only a part of the kernfs
>> + * as root of the filesystem.
>> + */
>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>> + struct kernfs_node *kn)
>> +{
>
> I can't usefully review this, but kernfs_make_root and
> kernfs_obtain_root aren't the same string...
>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 7e5d597..250aaec 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> memset(opts, 0, sizeof(*opts));
>>
>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>> + * namespace.
>> + */
>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>> + }
>> +
>
> I don't like this implicit stuff. Can you just return -EINVAL if sane
> behavior isn't requested?
>

I think the sane-behavior flag is only temporary and will be removed
anyways, right? So I didn't bother asking user to supply it. But I can
make the change as you suggested. We just have to make sure that tasks
inside cgroupns cannot mount non-default hierarchies as it would be a
regression.

>> while ((token = strsep(&o, ",")) != NULL) {
>> nr_opts++;
>>
>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>
>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>> - if (nr_opts != 1) {
>> + if (nr_opts > 1) {
>> pr_err("sane_behavior: no other mount options allowed\n");
>> return -EINVAL;
>
> This looks wrong. But, if you make the change above, then it'll be right.
>

It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
cgroupns does the right thing automatically.


>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>> int ret;
>> int i;
>> bool new_sb;
>> + struct cgroup_namespace *ns =
>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>> +
>> + /* Check if the caller has permission to mount. */
>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>> + put_cgroup_ns(ns);
>> + return ERR_PTR(-EPERM);
>> + }
>
> Why is this necessary?
>

Without this, if I unshare userns and mntns (but no cgroupns), I will
be able to mount my parent's cgroupfs hierarchy. This is deviation
from whats allowed today (i.e., today I can't mount cgroupfs even
after unsharing userns & mntns). This check is there to prevent the
unintended effect of cgroupns feature.

>> @@ -1862,6 +1904,7 @@ static struct file_system_type cgroup_fs_type = {
>> .name = "cgroup",
>> .mount = cgroup_mount,
>> .kill_sb = cgroup_kill_sb,
>> + .fs_flags = FS_USERNS_MOUNT,
>
> Aargh, another one! Eric, can you either ack or nack my patch?
> Because if my patch goes in, then this line may need to change. Or
> not, but if a stable release with cgroupfs and without my patch
> happens, then we'll have an ABI break.
>
> --Andy



--
Aditya

2014-11-03 23:16:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>> This patch enables cgroup mounting inside userns when a process
>>> as appropriate privileges. The cgroup filesystem mounted is
>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>> This allows container management tools to run inside the containers
>>> without depending on any global state.
>>> In order to support this, a new kernfs api is added to lookup the
>>> dentry for the cgroupns-root.
>>>
>>> Signed-off-by: Aditya Kali <[email protected]>
>>> ---
>>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/kernfs.h | 2 ++
>>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>> index f973ae9..e334f45 100644
>>> --- a/fs/kernfs/mount.c
>>> +++ b/fs/kernfs/mount.c
>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>> return NULL;
>>> }
>>>
>>> +/**
>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>> + * @sb: the kernfs super_block
>>> + * @kn: kernfs_node for which a dentry is needed
>>> + *
>>> + * This can used used by callers which want to mount only a part of the kernfs
>>> + * as root of the filesystem.
>>> + */
>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>> + struct kernfs_node *kn)
>>> +{
>>
>> I can't usefully review this, but kernfs_make_root and
>> kernfs_obtain_root aren't the same string...
>>
>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>> index 7e5d597..250aaec 100644
>>> --- a/kernel/cgroup.c
>>> +++ b/kernel/cgroup.c
>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>> memset(opts, 0, sizeof(*opts));
>>>
>>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>> + * namespace.
>>> + */
>>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>> + }
>>> +
>>
>> I don't like this implicit stuff. Can you just return -EINVAL if sane
>> behavior isn't requested?
>>
>
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.
>
>>> while ((token = strsep(&o, ",")) != NULL) {
>>> nr_opts++;
>>>
>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>
>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>> - if (nr_opts != 1) {
>>> + if (nr_opts > 1) {
>>> pr_err("sane_behavior: no other mount options allowed\n");
>>> return -EINVAL;
>>
>> This looks wrong. But, if you make the change above, then it'll be right.
>>
>
> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
> cgroupns does the right thing automatically.
>

This is a debatable point, but it's not what I meant. Won't your code
let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?

>
>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>> int ret;
>>> int i;
>>> bool new_sb;
>>> + struct cgroup_namespace *ns =
>>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>>> +
>>> + /* Check if the caller has permission to mount. */
>>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>> + put_cgroup_ns(ns);
>>> + return ERR_PTR(-EPERM);
>>> + }
>>
>> Why is this necessary?
>>
>
> Without this, if I unshare userns and mntns (but no cgroupns), I will
> be able to mount my parent's cgroupfs hierarchy. This is deviation
> from whats allowed today (i.e., today I can't mount cgroupfs even
> after unsharing userns & mntns). This check is there to prevent the
> unintended effect of cgroupns feature.

Oh, I get it. I misunderstood the code.

I guess this is reasonable. If it annoys anyone, it can be reverted
or weakened.

--Andy

2014-11-03 23:24:12

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>>> This patch enables cgroup mounting inside userns when a process
>>>> as appropriate privileges. The cgroup filesystem mounted is
>>>> rooted at the cgroupns-root. Thus, in a container-setup, only
>>>> the hierarchy under the cgroupns-root is exposed inside the container.
>>>> This allows container management tools to run inside the containers
>>>> without depending on any global state.
>>>> In order to support this, a new kernfs api is added to lookup the
>>>> dentry for the cgroupns-root.
>>>>
>>>> Signed-off-by: Aditya Kali <[email protected]>
>>>> ---
>>>> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>> include/linux/kernfs.h | 2 ++
>>>> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
>>>> 3 files changed, 95 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
>>>> index f973ae9..e334f45 100644
>>>> --- a/fs/kernfs/mount.c
>>>> +++ b/fs/kernfs/mount.c
>>>> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
>>>> return NULL;
>>>> }
>>>>
>>>> +/**
>>>> + * kernfs_make_root - create new root dentry for the given kernfs_node.
>>>> + * @sb: the kernfs super_block
>>>> + * @kn: kernfs_node for which a dentry is needed
>>>> + *
>>>> + * This can used used by callers which want to mount only a part of the kernfs
>>>> + * as root of the filesystem.
>>>> + */
>>>> +struct dentry *kernfs_obtain_root(struct super_block *sb,
>>>> + struct kernfs_node *kn)
>>>> +{
>>>
>>> I can't usefully review this, but kernfs_make_root and
>>> kernfs_obtain_root aren't the same string...
>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 7e5d597..250aaec 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>> memset(opts, 0, sizeof(*opts));
>>>>
>>>> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
>>>> + * namespace.
>>>> + */
>>>> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
>>>> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
>>>> + }
>>>> +
>>>
>>> I don't like this implicit stuff. Can you just return -EINVAL if sane
>>> behavior isn't requested?
>>>
>>
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>>
>>>> while ((token = strsep(&o, ",")) != NULL) {
>>>> nr_opts++;
>>>>
>>>> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>>>>
>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>> - if (nr_opts != 1) {
>>>> + if (nr_opts > 1) {
>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>> return -EINVAL;
>>>
>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>
>>
>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>> cgroupns does the right thing automatically.
>>
>
> This is a debatable point, but it's not what I meant. Won't your code
> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>

I don't think so. This check "if (nr_opts > 1)" is nested under "if
(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
here.

>>
>>>> @@ -1685,6 +1701,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>>>> int ret;
>>>> int i;
>>>> bool new_sb;
>>>> + struct cgroup_namespace *ns =
>>>> + get_cgroup_ns(current->nsproxy->cgroup_ns);
>>>> +
>>>> + /* Check if the caller has permission to mount. */
>>>> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
>>>> + put_cgroup_ns(ns);
>>>> + return ERR_PTR(-EPERM);
>>>> + }
>>>
>>> Why is this necessary?
>>>
>>
>> Without this, if I unshare userns and mntns (but no cgroupns), I will
>> be able to mount my parent's cgroupfs hierarchy. This is deviation
>> from whats allowed today (i.e., today I can't mount cgroupfs even
>> after unsharing userns & mntns). This check is there to prevent the
>> unintended effect of cgroupns feature.
>
> Oh, I get it. I misunderstood the code.
>
> I guess this is reasonable. If it annoys anyone, it can be reverted
> or weakened.
>
> --Andy



--
Aditya

2014-11-03 23:40:34

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

On Fri, Oct 31, 2014 at 5:02 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <[email protected]> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the cgroup of the process at the point
>> of creation of the cgroup namespace (referred as cgroupns-root).
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root
>> (unless they are moved outside of their cgroupns-root, at which point
>> they will see a relative path from their cgroupns-root).
>> For a correctly setup container this enables container-tools
>> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
>> containers without leaking system level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(current);
>
> This could just be me being dense, but what is the lock for?
>

threadgroup_lock() is there to prevent the task from changing cgroups
while we are unsharing cgroupns.
But it seems that this might be unnecessary now because we have
removed the pinning restriction. Without pinning, we don't care if the
task cgroup changes underneath us. I will remove it from here as well
as from cgroupns_install().

>> +
>> + /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
>> + */
>> + cgrp = get_task_cgroup(current);
>> +
>> + err = -ENOMEM;
>> + new_ns = alloc_cgroup_ns();
>> + if (!new_ns)
>> + goto err_out_unlock;
>> +
>> + err = proc_alloc_inum(&new_ns->proc_inum);
>> + if (err)
>> + goto err_out_unlock;
>> +
>> + new_ns->user_ns = get_user_ns(user_ns);
>> + new_ns->root_cgrp = cgrp;
>> +
>> + threadgroup_unlock(current);
>> +
>> + return new_ns;
>> +
>> +err_out_unlock:
>> + threadgroup_unlock(current);
>> +err_out:
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + kfree(new_ns);
>> + return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> + pr_info("setns not supported for cgroup namespace");
>> + return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> + struct cgroup_namespace *ns = NULL;
>> + struct nsproxy *nsproxy;
>> +
>> + rcu_read_lock();
>> + nsproxy = task->nsproxy;
>> + if (nsproxy) {
>> + ns = nsproxy->cgroup_ns;
>> + get_cgroup_ns(ns);
>> + }
>> + rcu_read_unlock();
>
> How is this correct? Other namespaces do it too, so it Must Be
> Correct (tm), but I don't understand. What is RCU protecting?
>
> --Andy



--
Aditya

2014-11-03 23:42:53

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces

On Fri, Oct 31, 2014 at 5:58 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Fri, Oct 31, 2014 at 12:18 PM, Aditya Kali <[email protected]> wrote:
>
> <snip>
>
>>> +static void *cgroupns_get(struct task_struct *task)
>>> +{
>>> + struct cgroup_namespace *ns = NULL;
>>> + struct nsproxy *nsproxy;
>>> +
>>> + rcu_read_lock();
>>> + nsproxy = task->nsproxy;
>>> + if (nsproxy) {
>>> + ns = nsproxy->cgroup_ns;
>>> + get_cgroup_ns(ns);
>>> + }
>>> + rcu_read_unlock();
>>
>> How is this correct? Other namespaces do it too, so it Must Be
>> Correct (tm), but I don't understand. What is RCU protecting?
>
> The code is not correct. The code needs to use task_lock.
>
> RCU used to protect nsproxy, and now task_lock protects nsproxy.
> For the reasons of of all of this I refer you to the commit
> that changed this, and the comment in nsproxy.h
>

My bad. This should be under task_lock. I will fix it.

> commit 728dba3a39c66b3d8ac889ddbe38b5b1c264aec3
> Author: Eric W. Biederman <[email protected]>
> Date: Mon Feb 3 19:13:49 2014 -0800
>
> namespaces: Use task_lock and not rcu to protect nsproxy
>
> The synchronous syncrhonize_rcu in switch_task_namespaces makes setns
> a sufficiently expensive system call that people have complained.
>
> Upon inspect nsproxy no longer needs rcu protection for remote reads.
> remote reads are rare. So optimize for same process reads and write
> by switching using rask_lock instead.
>
> This yields a simpler to understand lock, and a faster setns system call.
>
> In particular this fixes a performance regression observed
> by Rafael David Tinoco <[email protected]>.
>
> This is effectively a revert of Pavel Emelyanov's commit
> cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter
> from 2007. The race this originialy fixed no longer exists as
> do_notify_parent uses task_active_pid_ns(parent) instead of
> parent->nsproxy.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
>
> Eric



--
Aditya

2014-11-03 23:48:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <[email protected]> wrote:
> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>> - if (nr_opts != 1) {
>>>>> + if (nr_opts > 1) {
>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>> return -EINVAL;
>>>>
>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>
>>>
>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>> cgroupns does the right thing automatically.
>>>
>>
>> This is a debatable point, but it's not what I meant. Won't your code
>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>
>
> I don't think so. This check "if (nr_opts > 1)" is nested under "if
> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
> here.

But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?

--Andy

2014-11-04 00:13:21

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <[email protected]> wrote:
>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>> - if (nr_opts != 1) {
>>>>>> + if (nr_opts > 1) {
>>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>>> return -EINVAL;
>>>>>
>>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>>
>>>>
>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>> cgroupns does the right thing automatically.
>>>>
>>>
>>> This is a debatable point, but it's not what I meant. Won't your code
>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>
>>
>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>> here.
>
> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>

Yes. Hence this change makes sure that we don't return EINVAL when
nr_opts == 0 or nr_opts == 1 :)
That way, both of the following are equivalent when inside non-init cgroupns:

(1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
(2) $ mount -t cgroup cgroup mountpoint

Any other mount option will trigger the error here.


> --Andy

--
Aditya

2014-11-04 00:17:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <[email protected]> wrote:
> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <[email protected]> wrote:
>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>> - if (nr_opts != 1) {
>>>>>>> + if (nr_opts > 1) {
>>>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>> return -EINVAL;
>>>>>>
>>>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>>>
>>>>>
>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>> cgroupns does the right thing automatically.
>>>>>
>>>>
>>>> This is a debatable point, but it's not what I meant. Won't your code
>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>
>>>
>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>> here.
>>
>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>
>
> Yes. Hence this change makes sure that we don't return EINVAL when
> nr_opts == 0 or nr_opts == 1 :)
> That way, both of the following are equivalent when inside non-init cgroupns:
>
> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
> (2) $ mount -t cgroup cgroup mountpoint
>
> Any other mount option will trigger the error here.

I still don't get it. Can you walk me through why mount -o
some_other_option -t cgroup cgroup mountpoint causes -EINVAL?

--Andy

2014-11-04 00:50:06

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Mon, Nov 3, 2014 at 4:17 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Nov 3, 2014 at 4:12 PM, Aditya Kali <[email protected]> wrote:
>> On Mon, Nov 3, 2014 at 3:48 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Mon, Nov 3, 2014 at 3:23 PM, Aditya Kali <[email protected]> wrote:
>>>> On Mon, Nov 3, 2014 at 3:15 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Mon, Nov 3, 2014 at 3:12 PM, Aditya Kali <[email protected]> wrote:
>>>>>> On Fri, Oct 31, 2014 at 5:07 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>>> On Fri, Oct 31, 2014 at 12:19 PM, Aditya Kali <[email protected]> wrote:
>>>>>>>> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
>>>>>>>> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
>>>>>>>> - if (nr_opts != 1) {
>>>>>>>> + if (nr_opts > 1) {
>>>>>>>> pr_err("sane_behavior: no other mount options allowed\n");
>>>>>>>> return -EINVAL;
>>>>>>>
>>>>>>> This looks wrong. But, if you make the change above, then it'll be right.
>>>>>>>
>>>>>>
>>>>>> It would have been nice if simple 'mount -t cgroup cgroup <mnt>' from
>>>>>> cgroupns does the right thing automatically.
>>>>>>
>>>>>
>>>>> This is a debatable point, but it's not what I meant. Won't your code
>>>>> let 'mount -t cgroup -o one_evil_flag cgroup mountpoint' through?
>>>>>
>>>>
>>>> I don't think so. This check "if (nr_opts > 1)" is nested under "if
>>>> (opts->flags & CGRP_ROOT_SANE_BEHAVIOR)". So we know that there is
>>>> atleast 1 option ('__DEVEL__sane_behavior') present (implicit or not).
>>>> Addition of 'one_evil_flag' will make nr_opts = 2 and result in EINVAL
>>>> here.
>>>
>>> But the implicit __DEVEL__sane_behavior doesn't increment nr_opts, right?
>>>
>>
>> Yes. Hence this change makes sure that we don't return EINVAL when
>> nr_opts == 0 or nr_opts == 1 :)
>> That way, both of the following are equivalent when inside non-init cgroupns:
>>
>> (1) $ mount -t cgroup -o __DEVEL__sane_behavior cgroup mountpoint
>> (2) $ mount -t cgroup cgroup mountpoint
>>
>> Any other mount option will trigger the error here.
>
> I still don't get it. Can you walk me through why mount -o
> some_other_option -t cgroup cgroup mountpoint causes -EINVAL?
>

Argh! You are right. I was totally convinced that this works. But it
clearly doesn't if you specify 1 legit mount option. I wanted to make
it work for both cases (1) and (2) above. But then this check will
have to be changed :(
Sorry about the back and forth. I am just going to make it return
EINVAL if __DEVEL_sane_behavior is not specified as suggested in the
beginning.

> --Andy

--
Aditya

2014-11-04 01:57:06

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 5/7] cgroup: introduce cgroup namespaces


Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/proc/namespaces.c | 1 +
include/linux/cgroup.h | 18 +++++-
include/linux/cgroup_namespace.h | 36 +++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
kernel/Makefile | 2 +-
kernel/cgroup.c | 14 +++++
kernel/cgroup_namespace.c | 127
+++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
10 files changed, 220 insertions(+), 5 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+ &cgroupns_operations,
};

static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

#ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
#endif
};

+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp,
char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}

+static inline char * __must_check cgroup_path_ns(struct
cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp,
char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
}

static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h
b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+ return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;

/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;

struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};

#ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 9c622b9..7e5d597 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated
array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

#include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int
end,
gfp_t gfp_mask)
@@ -4550,6 +4561,7 @@ static int cgroup_mkdir(struct kernfs_node
*parent_kn, const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
root = parent->root;

/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4922,6 +4934,8 @@ int __init cgroup_init(void)
unsigned long key;
int ssid, err;

+ get_user_ns(init_cgroup_ns.user_ns);
+
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2014 Google Inc.
+ *
+ * Author: Aditya Kali ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by
the Free
+ * Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+ kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+ */
+ cgrp = get_task_cgroup(current);
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ return new_ns;
+
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long
unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};

static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned
long flags,
goto out_pid;
}

+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy
*create_new_namespaces(unsigned long flags,
return new_nsp;

out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct
task_struct *tsk)
struct nsproxy *new_ns;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long
unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.1.0.rc2.206.gedb03e5

2014-11-04 02:00:02

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/mount.c | 48
++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 95 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct
super_block *sb)
return NULL;
}

+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the
kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7e5d597..8008c4c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1389,6 +1389,14 @@ static int parse_cgroupfs_options(char *data,
struct cgroup_sb_opts *opts)
return -ENOENT;
}

+ /* If inside a non-init cgroup namespace, only allow default hierarchy
+ * to be mounted.
+ */
+ if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+ !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+ return -EINVAL;
+ }
+
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its
behaviors will change, proceed at your own risk\n");
if (nr_opts != 1) {
@@ -1581,6 +1589,15 @@ static void init_cgroup_root(struct cgroup_root
*root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
static int cgroup_setup_root(struct cgroup_root *root, unsigned int
ss_mask)
{
LIST_HEAD(tmp_links);
@@ -1685,6 +1702,14 @@ static struct dentry *cgroup_mount(struct
file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1817,11 +1842,28 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);

- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }

dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+ /* If this mount is for the default hierarchy in non-init cgroup
+ * namespace, then instead of root cgroup's dentry, we return
+ * the dentry corresponding to the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);

@@ -1834,6 +1876,7 @@ out_free:
deactivate_super(pinned_sb);
}

+ put_cgroup_ns(ns);
return dentry;
}

@@ -1862,6 +1905,7 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct kobject *cgroup_kobj;
--
2.1.0.rc2.206.gedb03e5

2014-11-04 13:11:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCHv2 0/7] CGroup Namespaces

On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
[..]
> fs/kernfs/dir.c | 194 ++++++++++++++++++++++++++++++++++-----
> fs/kernfs/mount.c | 48 ++++++++++
> fs/proc/namespaces.c | 1 +
> include/linux/cgroup.h | 41 ++++++++-
> include/linux/cgroup_namespace.h | 36 ++++++++
> include/linux/kernfs.h | 5 +
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 +
> include/uapi/linux/sched.h | 3 +-
> kernel/Makefile | 2 +-
> kernel/cgroup.c | 108 +++++++++++++++++-----
> kernel/cgroup_namespace.c | 148 +++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 +++-

Hi Aditya,

Can we provide a documentation file for cgroup namespace behavior. Say,
Documentation/namespaces/cgroup-namespace.txt.

Namespaces are complicated and it might be a good idea to keep one .txt
file for each namespace.

Thanks
Vivek

2014-11-04 13:46:47

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Hello, Aditya.

On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> I agree that this is effectively bind-mounting, but doing this in kernel
> makes it really convenient for the userspace. The process that sets up the
> container doesn't need to care whether it should bind-mount cgroupfs inside
> the container or not. The tasks inside the container can mount cgroupfs on
> as-needed basis. The root container manager can simply unshare cgroupns and
> forget about the internal setup. I think this is useful just for the reason
> that it makes life much simpler for userspace.

If it's okay to require userland to just do bind mounting, I'd be far
happier with that. cgroup mount code is already overcomplicated
because of the dynamic matching of supers to mounts when it could just
have told userland to use bind mounting. Doesn't the host side have
to set up some of the filesystem layouts anyway? Does it really
matter that we require the host to set up cgroup hierarchy too?

Thanks.

--
tejun

2014-11-04 13:57:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Hello, Aditya.

On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
> I think the sane-behavior flag is only temporary and will be removed
> anyways, right? So I didn't bother asking user to supply it. But I can
> make the change as you suggested. We just have to make sure that tasks
> inside cgroupns cannot mount non-default hierarchies as it would be a
> regression.

I'm not sure whether supporting mounting from inside a ns is even
necessary but, if it is, can't you just test against cgrp_dfl_root?
There's no reason to do anything differnetly for ns mounting.

Thanks.

--
tejun

2014-11-04 15:00:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <[email protected]> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
>> I agree that this is effectively bind-mounting, but doing this in kernel
>> makes it really convenient for the userspace. The process that sets up the
>> container doesn't need to care whether it should bind-mount cgroupfs inside
>> the container or not. The tasks inside the container can mount cgroupfs on
>> as-needed basis. The root container manager can simply unshare cgroupns and
>> forget about the internal setup. I think this is useful just for the reason
>> that it makes life much simpler for userspace.
>
> If it's okay to require userland to just do bind mounting, I'd be far
> happier with that. cgroup mount code is already overcomplicated
> because of the dynamic matching of supers to mounts when it could just
> have told userland to use bind mounting. Doesn't the host side have
> to set up some of the filesystem layouts anyway? Does it really
> matter that we require the host to set up cgroup hierarchy too?
>

Sort of, but only sort of.

You can create a container by unsharing namespaces, mounting
everything, and then calling pivot_root. But this is unpleasant
because of the strange way that pid namespaces work -- you generally
have to fork first, so this gets tedious. And it doesn't integrate
well with things like fstab or other container-side configuration
mechanisms.

It's nicer if you can unshare namespaces, mount the bare minimum,
pivot_root, and let the contained software do as much setup as
possible.

--Andy

2014-11-04 15:51:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Quoting Andy Lutomirski ([email protected]):
> On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <[email protected]> wrote:
> > Hello, Aditya.
> >
> > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> >> I agree that this is effectively bind-mounting, but doing this in kernel
> >> makes it really convenient for the userspace. The process that sets up the
> >> container doesn't need to care whether it should bind-mount cgroupfs inside
> >> the container or not. The tasks inside the container can mount cgroupfs on
> >> as-needed basis. The root container manager can simply unshare cgroupns and
> >> forget about the internal setup. I think this is useful just for the reason
> >> that it makes life much simpler for userspace.
> >
> > If it's okay to require userland to just do bind mounting, I'd be far
> > happier with that. cgroup mount code is already overcomplicated
> > because of the dynamic matching of supers to mounts when it could just
> > have told userland to use bind mounting. Doesn't the host side have
> > to set up some of the filesystem layouts anyway? Does it really
> > matter that we require the host to set up cgroup hierarchy too?
> >
>
> Sort of, but only sort of.
>
> You can create a container by unsharing namespaces, mounting
> everything, and then calling pivot_root. But this is unpleasant
> because of the strange way that pid namespaces work -- you generally
> have to fork first, so this gets tedious. And it doesn't integrate
> well with things like fstab or other container-side configuration
> mechanisms.
>
> It's nicer if you can unshare namespaces, mount the bare minimum,
> pivot_root, and let the contained software do as much setup as
> possible.

Also, the bind-mount requires the container manager to know where
the guest distro will want the cgroups mounted.

-serge

2014-11-06 17:28:48

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Tue, Nov 4, 2014 at 5:57 AM, Tejun Heo <[email protected]> wrote:
> Hello, Aditya.
>
> On Mon, Nov 03, 2014 at 03:12:28PM -0800, Aditya Kali wrote:
>> I think the sane-behavior flag is only temporary and will be removed
>> anyways, right? So I didn't bother asking user to supply it. But I can
>> make the change as you suggested. We just have to make sure that tasks
>> inside cgroupns cannot mount non-default hierarchies as it would be a
>> regression.
>
> I'm not sure whether supporting mounting from inside a ns is even
> necessary but, if it is, can't you just test against cgrp_dfl_root?
> There's no reason to do anything differnetly for ns mounting.
>

I am not sure I fully understand what you mean. But we don't have a
way to test against cgrp_dfl_root while parsing mount-options. They
only way we know that user is trying to mount a default hierarchy is
via the sane_behavior flag. So I need to test against this flag it if
we want to restrict processes inside cgroupns to mounting the default
hierarchy only.
Or are you suggesting that its OK for nsown_capable(CAP_SYS_ADMIN)
processes to mount any cgroup hierarchy (irrespective of their
cgroupns)? I assumed that this will be a undesirable.

> Thanks.
>
> --
> tejun


Thanks,
--
Aditya

2014-11-06 17:34:09

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 0/7] CGroup Namespaces

On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <[email protected]> wrote:
> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> [..]
>> fs/kernfs/dir.c | 194 ++++++++++++++++++++++++++++++++++-----
>> fs/kernfs/mount.c | 48 ++++++++++
>> fs/proc/namespaces.c | 1 +
>> include/linux/cgroup.h | 41 ++++++++-
>> include/linux/cgroup_namespace.h | 36 ++++++++
>> include/linux/kernfs.h | 5 +
>> include/linux/nsproxy.h | 2 +
>> include/linux/proc_ns.h | 4 +
>> include/uapi/linux/sched.h | 3 +-
>> kernel/Makefile | 2 +-
>> kernel/cgroup.c | 108 +++++++++++++++++-----
>> kernel/cgroup_namespace.c | 148 +++++++++++++++++++++++++++++
>> kernel/fork.c | 2 +-
>> kernel/nsproxy.c | 19 +++-
>
> Hi Aditya,
>
> Can we provide a documentation file for cgroup namespace behavior. Say,
> Documentation/namespaces/cgroup-namespace.txt.
>
Yes, definitely. I will add it as soon as we have a consensus on the
overall series.

> Namespaces are complicated and it might be a good idea to keep one .txt
> file for each namespace.
>
> Thanks
> Vivek


Thanks,
--
Aditya

2014-11-12 17:48:42

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 7/7] cgroup: mount cgroupns-root when inside non-init cgroupns

I agree with what Andy and Serge has to say. The ability to mount
cgroupfs inside userns also seems consistent with other kernel
interfaces like sysfs, procfs, etc.

Though it would be great if we can atleast merge the rest of the
patches first while we address the mounting part.

Thanks for your feedback.

On Tue, Nov 4, 2014 at 7:50 AM, Serge E. Hallyn <[email protected]> wrote:
>
> Quoting Andy Lutomirski ([email protected]):
> > On Tue, Nov 4, 2014 at 5:46 AM, Tejun Heo <[email protected]> wrote:
> > > Hello, Aditya.
> > >
> > > On Mon, Nov 03, 2014 at 02:43:47PM -0800, Aditya Kali wrote:
> > >> I agree that this is effectively bind-mounting, but doing this in kernel
> > >> makes it really convenient for the userspace. The process that sets up the
> > >> container doesn't need to care whether it should bind-mount cgroupfs inside
> > >> the container or not. The tasks inside the container can mount cgroupfs on
> > >> as-needed basis. The root container manager can simply unshare cgroupns and
> > >> forget about the internal setup. I think this is useful just for the reason
> > >> that it makes life much simpler for userspace.
> > >
> > > If it's okay to require userland to just do bind mounting, I'd be far
> > > happier with that. cgroup mount code is already overcomplicated
> > > because of the dynamic matching of supers to mounts when it could just
> > > have told userland to use bind mounting. Doesn't the host side have
> > > to set up some of the filesystem layouts anyway? Does it really
> > > matter that we require the host to set up cgroup hierarchy too?
> > >
> >
> > Sort of, but only sort of.
> >
> > You can create a container by unsharing namespaces, mounting
> > everything, and then calling pivot_root. But this is unpleasant
> > because of the strange way that pid namespaces work -- you generally
> > have to fork first, so this gets tedious. And it doesn't integrate
> > well with things like fstab or other container-side configuration
> > mechanisms.
> >
> > It's nicer if you can unshare namespaces, mount the bare minimum,
> > pivot_root, and let the contained software do as much setup as
> > possible.
>
> Also, the bind-mount requires the container manager to know where
> the guest distro will want the cgroups mounted.
>
> -serge
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers




--
Aditya

2014-11-26 22:59:00

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCHv2 0/7] CGroup Namespaces

On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <[email protected]> wrote:
> On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <[email protected]> wrote:
>> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
>> [..]
>>> fs/kernfs/dir.c | 194 ++++++++++++++++++++++++++++++++++-----
>>> fs/kernfs/mount.c | 48 ++++++++++
>>> fs/proc/namespaces.c | 1 +
>>> include/linux/cgroup.h | 41 ++++++++-
>>> include/linux/cgroup_namespace.h | 36 ++++++++
>>> include/linux/kernfs.h | 5 +
>>> include/linux/nsproxy.h | 2 +
>>> include/linux/proc_ns.h | 4 +
>>> include/uapi/linux/sched.h | 3 +-
>>> kernel/Makefile | 2 +-
>>> kernel/cgroup.c | 108 +++++++++++++++++-----
>>> kernel/cgroup_namespace.c | 148 +++++++++++++++++++++++++++++
>>> kernel/fork.c | 2 +-
>>> kernel/nsproxy.c | 19 +++-
>>
>> Hi Aditya,
>>
>> Can we provide a documentation file for cgroup namespace behavior. Say,
>> Documentation/namespaces/cgroup-namespace.txt.
>>
> Yes, definitely. I will add it as soon as we have a consensus on the
> overall series.

Do you have a public git repository which contains your patches?

--
Thanks,
//richard

2014-12-02 19:14:27

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv2 0/7] CGroup Namespaces

On Wed, Nov 26, 2014 at 2:58 PM, Richard Weinberger
<[email protected]> wrote:
>
> On Thu, Nov 6, 2014 at 6:33 PM, Aditya Kali <[email protected]> wrote:
> > On Tue, Nov 4, 2014 at 5:10 AM, Vivek Goyal <[email protected]> wrote:
> >> On Fri, Oct 31, 2014 at 12:18:54PM -0700, Aditya Kali wrote:
> >> [..]
> >>> fs/kernfs/dir.c | 194 ++++++++++++++++++++++++++++++++++-----
> >>> fs/kernfs/mount.c | 48 ++++++++++
> >>> fs/proc/namespaces.c | 1 +
> >>> include/linux/cgroup.h | 41 ++++++++-
> >>> include/linux/cgroup_namespace.h | 36 ++++++++
> >>> include/linux/kernfs.h | 5 +
> >>> include/linux/nsproxy.h | 2 +
> >>> include/linux/proc_ns.h | 4 +
> >>> include/uapi/linux/sched.h | 3 +-
> >>> kernel/Makefile | 2 +-
> >>> kernel/cgroup.c | 108 +++++++++++++++++-----
> >>> kernel/cgroup_namespace.c | 148 +++++++++++++++++++++++++++++
> >>> kernel/fork.c | 2 +-
> >>> kernel/nsproxy.c | 19 +++-
> >>
> >> Hi Aditya,
> >>
> >> Can we provide a documentation file for cgroup namespace behavior. Say,
> >> Documentation/namespaces/cgroup-namespace.txt.
> >>
> > Yes, definitely. I will add it as soon as we have a consensus on the
> > overall series.
>
> Do you have a public git repository which contains your patches?
>

Hi, Sorry for late reply. I don't have these in a public git repo yet.
But I will try to post it on github or somewhere.
Also, I found a bug in this patchset that crashes the kernel in some
cases (when both unified and split hierarchies are mounted). I have a
fix and will send out the patches (with documentation) soon.

>
> --
> Thanks,
> //richard

Thanks,
--
Aditya