Another spin for CGroup Namespaces feature.
Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
- removed use of threadgroup_lock() while creating new cgroupns
- use task_lock() instead of rcu_read_lock() while accessing
task->nsproxy
- optimized setns() to own cgroupns
- simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
not changed since then.
Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
cgroupns-root.
Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.
---
Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
fs/kernfs/dir.c | 195 ++++++++++++++++++++++++++++++++----
fs/kernfs/mount.c | 48 +++++++++
fs/proc/namespaces.c | 1 +
include/linux/cgroup.h | 52 +++++++++-
include/linux/cgroup_namespace.h | 36 +++++++
include/linux/kernfs.h | 5 +
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 +
include/uapi/linux/sched.h | 3 +-
kernel/Makefile | 2 +-
kernel/cgroup.c | 106 +++++++++++++++-----
kernel/cgroup_namespace.c | 140 ++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++-
15 files changed, 711 insertions(+), 51 deletions(-)
create mode 100644 Documentation/cgroups/namespace.txt
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c
[PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv3 3/8] cgroup: add function to get task's cgroup on default
[PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv3 5/8] cgroup: introduce cgroup namespaces
[PATCHv3 6/8] cgroup: cgroup namespace setns support
[PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
[PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.
Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/dir.c | 195 +++++++++++++++++++++++++++++++++++++++++++------
include/linux/kernfs.h | 3 +
2 files changed, 177 insertions(+), 21 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1c77193..cb225a7 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,159 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+/**
+ * kernfs_node_depth - compute depth of the kernfs node from root.
+ * The root node itself is considered to be at depth 0.
+ */
+static size_t kernfs_node_depth(struct kernfs_node *kn)
{
- char *p = buf + buflen;
+ size_t depth = 0;
+
+ BUG_ON(!kn);
+ while (kn->parent) {
+ depth++;
+ kn = kn->parent;
+ }
+ return depth;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a relative path from @kn_from to @kn_to
+ * @kn_from: reference node of the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to: /n1/n2/n3/n4/n5
+ * result: /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to: /n1/n2/n5
+ * result: /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5 [depth=5]
+ * kn_to: /n1/n2/n3 [depth=3]
+ * result: /../..
+ */
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to,
+ char *buf,
+ size_t buflen)
+{
+ char *p = buf;
+ struct kernfs_node *kn;
+ size_t depth_from = 0, depth_to, d;
int len;
- *--p = '\0';
+ /* We atleast need 2 bytes to write "/\0". */
+ BUG_ON(buflen < 2);
- do {
- len = strlen(kn->name);
- if (p - buf < len + 1) {
- buf[0] = '\0';
- p = NULL;
- break;
+ /* Short-circuit the easy case - kn_to is the root node. */
+ if ((kn_from == kn_to) || (!kn_from && !kn_to->parent)) {
+ *p = '/';
+ *(p + 1) = '\0';
+ return p;
+ }
+
+ /* We can find the relative path only if both the nodes belong to the
+ * same kernfs root.
+ */
+ if (kn_from) {
+ BUG_ON(kernfs_root(kn_from) != kernfs_root(kn_to));
+ depth_from = kernfs_node_depth(kn_from);
+ }
+
+ depth_to = kernfs_node_depth(kn_to);
+
+ /* We compose path from left to right. So first write out all possible
+ * "/.." strings needed to reach from 'kn_from' to the common ancestor.
+ */
+ if (kn_from) {
+ while (depth_from > depth_to) {
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
}
+
+ d = depth_to;
+ kn = kn_to;
+ while (depth_from < d) {
+ kn = kn->parent;
+ d--;
+ }
+
+ /* Now we have 'depth_from == depth_to' at this point. Add more
+ * "/.."s until we reach common ancestor. In the worst case,
+ * root node will be the common ancestor.
+ */
+ while (depth_from > 0) {
+ /* If we reached common ancestor, stop. */
+ if (kn_from == kn)
+ break;
+ len = strlen("/..");
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+ memcpy(p, "/..", len);
+ p += len;
+ *p = '\0';
+ --depth_from;
+ kn_from = kn_from->parent;
+ kn = kn->parent;
+ }
+ }
+
+ /* Figure out how many bytes we need to write the path.
+ */
+ d = depth_to;
+ kn = kn_to;
+ len = 0;
+ while (depth_from < d) {
+ /* Account for "/<name>". */
+ len += strlen(kn->name) + 1;
+ kn = kn->parent;
+ --d;
+ }
+
+ if ((buflen - (p - buf)) < len + 1) {
+ /* buffer not big enough. */
+ buf[0] = '\0';
+ return NULL;
+ }
+
+ /* We have enough space. Move 'p' ahead by computed length and start
+ * writing node names into buffer.
+ */
+ p += len;
+ *p = '\0';
+ d = depth_to;
+ kn = kn_to;
+ while (d > depth_from) {
+ len = strlen(kn->name);
p -= len;
memcpy(p, kn->name, len);
*--p = '/';
kn = kn->parent;
- } while (kn && kn->parent);
+ --d;
+ }
- return p;
+ return buf;
}
/**
@@ -92,26 +223,48 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
}
/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);
/**
@@ -145,8 +298,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.2.0.rc0.207.ga3a616c
CLONE_NEWCGROUP will be used to create new cgroup namespace.
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aditya Kali <[email protected]>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.2.0.rc0.207.ga3a616c
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.
Signed-off-by: Aditya Kali <[email protected]>
---
fs/proc/namespaces.c | 1 +
include/linux/cgroup.h | 29 ++++++++-
include/linux/cgroup_namespace.h | 36 +++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
kernel/Makefile | 2 +-
kernel/cgroup.c | 13 ++++
kernel/cgroup_namespace.c | 127 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
10 files changed, 230 insertions(+), 5 deletions(-)
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..55bc5da 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+ &cgroupns_operations,
};
static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6e7533b..94a5a0c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
#ifdef CONFIG_CGROUPS
@@ -460,6 +462,13 @@ struct cftype {
#endif
};
+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;
@@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ if (ns) {
+ BUG_ON(!cgroup_on_dfl(cgrp));
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
+ buflen);
+ } else {
+ return kernfs_path(cgrp->kn, buf, buflen);
+ }
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ if (cgroup_on_dfl(cgrp)) {
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
+ buflen);
+ } else {
+ return cgroup_path_ns(NULL, cgrp, buf, buflen);
+ }
}
static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..0b97b8d
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *current_cgroupns_root(void)
+{
+ return current->nsproxy->cgroup_ns->root_cgrp;
+}
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;
/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;
struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};
#ifdef CONFIG_PROC_FS
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..d9731e2 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
-obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e12d36e..b1ae6d9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
#include <linux/atomic.h>
@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);
+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
gfp_t gfp_mask)
@@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
unsigned long key;
int ssid, err;
+ get_user_ns(init_cgroup_ns.user_ns);
+
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..0e0ef3a
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,127 @@
+/*
+ * Copyright (C) 2014 Google Inc.
+ *
+ * Author: Aditya Kali ([email protected])
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation, version 2 of the License.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+ kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
+ */
+ cgrp = get_task_cgroup(current);
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ return new_ns;
+
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..d22d793 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
static struct kmem_cache *nsproxy_cachep;
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};
static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}
+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;
out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.2.0.rc0.207.ga3a616c
Signed-off-by: Aditya Kali <[email protected]>
---
Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
1 file changed, 147 insertions(+)
create mode 100644 Documentation/cgroups/namespace.txt
diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
new file mode 100644
index 0000000..6480379
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,147 @@
+ CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+ $ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+ # Before creating cgroup namespace
+ $ ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+ $ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+ # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+ $ ~/unshare -c
+ [ns]$ ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+ # From within new cgroupns, process sees that its in the root cgroup
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+ # From global cgroupns:
+ $ cat /proc/<pid>/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+ # Unshare cgroupns along with userns and mountns
+ # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+ # sets up uid/gid map and execs /bin/bash
+ $ ~/unshare -c -u -m
+ # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
+ # hierarchy.
+ [ns]$ mount -t cgroup cgroup /tmp/cgroup
+ [ns]$ ls -l /tmp/cgroup
+ total 0
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+ -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+ -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+Note that CGroup Namespaces virtualizes the path on unified hierarchy only. If
+other hierarchies are mounted, /proc/<pid>/cgroup will continue to show the full
+cgroup path for those.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+ the process calling unshare is running.
+ For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+ cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+ For the init_cgroup_ns, this is the real root ('/') cgroup
+ (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+ creator process later moves to a different cgroup.
+ $ ~/unshare -c # unshare cgroupns in some cgroup
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+ [ns]$ mkdir sub_cgrp_1
+ [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+ cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+ [ns]$ sleep 100000 & # From within unshared cgroupns
+ [1] 7353
+ [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+ [ns]$ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+ path relative to its own cgroupns-root will be shown:
+ # ns2's cgroupns-root is at '/batchjobs/container_id2'
+ [ns2]$ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+ Note that the relative path always starts with '/' to indicate that its
+ relative to the cgroupns-root of the caller.
+
+(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
+ (if they have proper access to external cgroups).
+ # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
+ # assuming that the global hierarchy is still accessible inside cgroupns:
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+ $ echo 7353 > batchjobs/container_id2/cgroup.procs
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+ Note that this kind of setup is not encouraged. A task inside cgroupns
+ should only be exposed to its own cgroupns hierarchy. Otherwise it makes
+ the virtualization of /proc/<pid>/cgroup less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+ (a) the process has CAP_SYS_ADMIN in its current userns
+ (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
+ No implicit cgroup changes happen with attaching to another cgroupns. It
+ is expected that the somone moves the attaching process under the target
+ cgroupns-root.
+
+(6) When some thread from a multi-threaded process unshares its
+ cgroup-namespace, the new cgroupns gets applied to the entire
+ process (all the threads). This should be OK since
+ unified-hierarchy only allows process-level containerization. So
+ all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is atleast 1
+ process inside it. When the last process exits, the cgroup
+ namespace is destroyed. The cgroupns-root and the actual cgroups
+ remain though.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+ inside cgroupns:
+ $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+ This will mount the unified cgroup hierarchy with cgroupns-root as the
+ filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
+
--
2.2.0.rc0.207.ga3a616c
This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.
Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 95 insertions(+), 1 deletion(-)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..efe5e15 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
return NULL;
}
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b1ae6d9..e779890 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
return -ENOENT;
}
+ /* If inside a non-init cgroup namespace, only allow default hierarchy
+ * to be mounted.
+ */
+ if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
+ !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
+ return -EINVAL;
+ }
+
if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
if (nr_opts != 1) {
@@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}
+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
{
LIST_HEAD(tmp_links);
@@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1866,11 +1891,28 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }
dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
+ /* If this mount is for the default hierarchy in non-init cgroup
+ * namespace, then instead of root cgroup's dentry, we return
+ * the dentry corresponding to the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
@@ -1883,6 +1925,7 @@ out_free:
deactivate_super(pinned_sb);
}
+ put_cgroup_ns(ns);
return dentry;
}
@@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};
static struct kobject *cgroup_kobj;
--
2.2.0.rc0.207.ga3a616c
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 22 ++++++++++++++++++++++
kernel/cgroup.c | 22 ----------------------
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d6930de..6e7533b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == &cgrp_dfl_root;
}
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5d8fc84..e12d36e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -321,12 +321,6 @@ out_unlock:
return css;
}
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1039,22 +1033,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}
-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
/**
* cgroup_calc_child_subsys_mask - calculate child_subsys_mask
* @cgrp: the target cgroup
--
2.2.0.rc0.207.ga3a616c
setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.
Signed-off-by: Aditya Kali <[email protected]>
---
kernel/cgroup_namespace.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index 0e0ef3a..ee0cc51 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -79,8 +79,21 @@ err_out:
static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
{
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+ !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Don't need to do anything if we are attaching to our own cgroupns. */
+ if (cgroup_ns == nsproxy->cgroup_ns)
+ return 0;
+
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+ return 0;
}
static void *cgroupns_get(struct task_struct *task)
--
2.2.0.rc0.207.ga3a616c
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.
Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 9fd99f5..d6930de 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
}
char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bb263d0..5d8fc84 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1966,6 +1966,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
}
EXPORT_SYMBOL_GPL(task_cgroup_path);
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
/* used to track tasks and other necessary states during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
--
2.2.0.rc0.207.ga3a616c
These patches are now also hosted on github at
https://github.com/adityakali/linux/tree/cgroupns_v3.
Thanks,
On Thu, Dec 4, 2014 at 5:55 PM, Aditya Kali <[email protected]> wrote:
> Another spin for CGroup Namespaces feature.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
> - removed use of threadgroup_lock() while creating new cgroupns
> - use task_lock() instead of rcu_read_lock() while accessing
> task->nsproxy
> - optimized setns() to own cgroupns
> - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
> not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
> across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
> apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
> cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
> of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
> cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> your cgroupns-root.
>
> ---
> Documentation/cgroups/namespace.txt | 147 +++++++++++++++++++++++++++
> fs/kernfs/dir.c | 195 ++++++++++++++++++++++++++++++++----
> fs/kernfs/mount.c | 48 +++++++++
> fs/proc/namespaces.c | 1 +
> include/linux/cgroup.h | 52 +++++++++-
> include/linux/cgroup_namespace.h | 36 +++++++
> include/linux/kernfs.h | 5 +
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 +
> include/uapi/linux/sched.h | 3 +-
> kernel/Makefile | 2 +-
> kernel/cgroup.c | 106 +++++++++++++++-----
> kernel/cgroup_namespace.c | 140 ++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 +++-
> 15 files changed, 711 insertions(+), 51 deletions(-)
> create mode 100644 Documentation/cgroups/namespace.txt
> create mode 100644 include/linux/cgroup_namespace.h
> create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv3 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv3 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv3 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv3 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv3 5/8] cgroup: introduce cgroup namespaces
> [PATCHv3 6/8] cgroup: cgroup namespace setns support
> [PATCHv3 7/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> [PATCHv3 8/8] cgroup: Add documentation for cgroup namespaces
--
Aditya
> +In its current form, the cgroup namespaces patcheset provides following
> +behavior:
> +
> +(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
> + the process calling unshare is running.
> + For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
> + cgroup /batchjobs/container_id1 becomes the cgroupns-root.
> + For the init_cgroup_ns, this is the real root ('/') cgroup
> + (identified in code as cgrp_dfl_root.cgrp).
> +
> +(2) The cgroupns-root cgroup does not change even if the namespace
> + creator process later moves to a different cgroup.
> + $ ~/unshare -c # unshare cgroupns in some cgroup
> + [ns]$ cat /proc/self/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> + [ns]$ mkdir sub_cgrp_1
> + [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> + [ns]$ cat /proc/self/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
> +(a) Processes running inside the cgroup namespace will be able to see
> + cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> + [ns]$ sleep 100000 & # From within unshared cgroupns
> + [1] 7353
> + [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> + [ns]$ cat /proc/7353/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> +
> +(b) From global cgroupns, the real cgroup path will be visible:
> + $ cat /proc/7353/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
> +
> +(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
> + path relative to its own cgroupns-root will be shown:
> + # ns2's cgroupns-root is at '/batchjobs/container_id2'
> + [ns2]$ cat /proc/7353/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
Should be ../container_id1/sub_cgrp_1 ?
> +
> + Note that the relative path always starts with '/' to indicate that its
> + relative to the cgroupns-root of the caller.
If a path doesn't start with '/', then it's a relative path, so why make it start with '/'?
> +
> +(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
> + (if they have proper access to external cgroups).
> + # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
> + # assuming that the global hierarchy is still accessible inside cgroupns:
> + $ cat /proc/7353/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> + $ echo 7353 > batchjobs/container_id2/cgroup.procs
> + $ cat /proc/7353/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
> +
> + Note that this kind of setup is not encouraged. A task inside cgroupns
> + should only be exposed to its own cgroupns hierarchy. Otherwise it makes
> + the virtualization of /proc/<pid>/cgroup less useful.
> +
> +(5) Setns to another cgroup namespace is allowed when:
> + (a) the process has CAP_SYS_ADMIN in its current userns
> + (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
> + No implicit cgroup changes happen with attaching to another cgroupns. It
> + is expected that the somone moves the attaching process under the target
> + cgroupns-root.
> +
s/the somone/someone
> +(6) When some thread from a multi-threaded process unshares its
> + cgroup-namespace, the new cgroupns gets applied to the entire
> + process (all the threads). This should be OK since
> + unified-hierarchy only allows process-level containerization. So
> + all the threads in the process will have the same cgroup.
> +
> +(7) The cgroup namespace is alive as long as there is atleast 1
s/atelast/at least
> + process inside it. When the last process exits, the cgroup
> + namespace is destroyed. The cgroupns-root and the actual cgroups
> + remain though.
> +
> +(8) Namespace specific cgroup hierarchy can be mounted by a process running
> + inside cgroupns:
> + $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
> +
> + This will mount the unified cgroup hierarchy with cgroupns-root as the
> + filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
> +
>
On 2014/12/5 9:55, Aditya Kali wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the cgroup of the process at the point
> of creation of the cgroup namespace (referred as cgroupns-root).
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root
> (unless they are moved outside of their cgroupns-root, at which point
> they will see a relative path from their cgroupns-root).
> For a correctly setup container this enables container-tools
> (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
> containers without leaking system level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> fs/proc/namespaces.c | 1 +
> include/linux/cgroup.h | 29 ++++++++-
> include/linux/cgroup_namespace.h | 36 +++++++++++
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 ++
> kernel/Makefile | 2 +-
> kernel/cgroup.c | 13 ++++
> kernel/cgroup_namespace.c | 127 +++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 +++++-
> 10 files changed, 230 insertions(+), 5 deletions(-)
>
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..55bc5da 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,7 @@ static const struct proc_ns_operations *ns_entries[] = {
> &userns_operations,
> #endif
> &mntns_operations,
> + &cgroupns_operations,
Should be guarded with CONFIG_CGROUPS ?
There are other changes that break compile if !CONFIG_CGROUPS.
> };
>
> static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 6e7533b..94a5a0c 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
> #include <linux/seq_file.h>
> #include <linux/kernfs.h>
> #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>
> #ifdef CONFIG_CGROUPS
>
> @@ -460,6 +462,13 @@ struct cftype {
> #endif
> };
>
> +struct cgroup_namespace {
> + atomic_t count;
> + unsigned int proc_inum;
> + struct user_namespace *user_ns;
> + struct cgroup *root_cgrp;
> +};
> +
> extern struct cgroup_root cgrp_dfl_root;
> extern struct css_set init_css_set;
>
> @@ -584,10 +593,28 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
> return kernfs_name(cgrp->kn, buf, buflen);
> }
>
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> + struct cgroup *cgrp, char *buf,
> + size_t buflen)
> +{
> + if (ns) {
> + BUG_ON(!cgroup_on_dfl(cgrp));
> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf,
> + buflen);
> + } else {
> + return kernfs_path(cgrp->kn, buf, buflen);
> + }
> +}
> +
> static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
> size_t buflen)
> {
> - return kernfs_path(cgrp->kn, buf, buflen);
> + if (cgroup_on_dfl(cgrp)) {
> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf,
> + buflen);
> + } else {
> + return cgroup_path_ns(NULL, cgrp, buf, buflen);
> + }
> }
>
> static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..0b97b8d
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,36 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *current_cgroupns_root(void)
> +{
> + return current->nsproxy->cgroup_ns->root_cgrp;
> +}
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + if (ns)
> + atomic_inc(&ns->count);
> + return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns && atomic_dec_and_test(&ns->count))
> + free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns);
> +
> +#endif /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
> struct uts_namespace;
> struct ipc_namespace;
> struct pid_namespace;
> +struct cgroup_namespace;
> struct fs_struct;
>
> /*
> @@ -33,6 +34,7 @@ struct nsproxy {
> struct mnt_namespace *mnt_ns;
> struct pid_namespace *pid_ns_for_children;
> struct net *net_ns;
> + struct cgroup_namespace *cgroup_ns;
> };
> extern struct nsproxy init_nsproxy;
>
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>
> struct pid_namespace;
> struct nsproxy;
> +struct task_struct;
> +struct inode;
These two lines seems unnecessary.
>
> struct proc_ns_operations {
> const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
> extern const struct proc_ns_operations pidns_operations;
> extern const struct proc_ns_operations userns_operations;
> extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>
> /*
> * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
> PROC_UTS_INIT_INO = 0xEFFFFFFEU,
> PROC_USER_INIT_INO = 0xEFFFFFFDU,
> PROC_PID_INIT_INO = 0xEFFFFFFCU,
> + PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
> };
>
> #ifdef CONFIG_PROC_FS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..d9731e2 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -50,7 +50,7 @@ obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
> obj-$(CONFIG_KEXEC) += kexec.o
> obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
> obj-$(CONFIG_COMPAT) += compat.o
> -obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUPS) += cgroup.o cgroup_namespace.o
> obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
> obj-$(CONFIG_CPUSETS) += cpuset.o
> obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e12d36e..b1ae6d9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
> #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
> #include <linux/kthread.h>
> #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>
> #include <linux/atomic.h>
>
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
> static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
> bool is_add);
>
> +struct cgroup_namespace init_cgroup_ns = {
> + .count = {
> + .counter = 1,
> + },
.count = ATOMIC_INIT(1)
> + .proc_inum = PROC_CGROUP_INIT_INO,
> + .user_ns = &init_user_ns,
> + .root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
> /* IDR wrappers which synchronize using cgroup_idr_lock */
> static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
> gfp_t gfp_mask)
> @@ -4989,6 +5000,8 @@ int __init cgroup_init(void)
> unsigned long key;
> int ssid, err;
>
> + get_user_ns(init_cgroup_ns.user_ns);
> +
> BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
> BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));
>
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..0e0ef3a
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,127 @@
> +/*
> + * Copyright (C) 2014 Google Inc.
> + *
> + * Author: Aditya Kali ([email protected])
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation, version 2 of the License.
> + */
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> + struct cgroup_namespace *new_ns;
> +
> + new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> + if (new_ns)
> + atomic_set(&new_ns->count, 1);
> + return new_ns;
> +}
Better fold this function into copy_cgroup_ns().
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + cgroup_put(ns->root_cgrp);
> + put_user_ns(ns->user_ns);
> + proc_free_inum(ns->proc_inum);
> + kfree(ns);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
This should be a static inline function.
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns)
> +{
> + struct cgroup_namespace *new_ns = NULL;
> + struct cgroup *cgrp = NULL;
> + int err;
> +
> + BUG_ON(!old_ns);
> +
> + if (!(flags & CLONE_NEWCGROUP))
> + return get_cgroup_ns(old_ns);
> +
> + /* Allow only sysadmin to create cgroup namespace. */
> + err = -EPERM;
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> + goto err_out;
> +
> + /* CGROUPNS only virtualizes the cgroup path on the unified hierarchy.
> + */
The comment style should be
/*
* ...
*/
> + cgrp = get_task_cgroup(current);
> +
> + err = -ENOMEM;
> + new_ns = alloc_cgroup_ns();
> + if (!new_ns)
> + goto err_out;
> +
> + err = proc_alloc_inum(&new_ns->proc_inum);
> + if (err)
> + goto err_out;
> +
> + new_ns->user_ns = get_user_ns(user_ns);
> + new_ns->root_cgrp = cgrp;
> +
> + return new_ns;
> +
> +err_out:
> + if (cgrp)
> + cgroup_put(cgrp);
> + kfree(new_ns);
> + return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> + pr_info("setns not supported for cgroup namespace");
> + return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> + struct cgroup_namespace *ns = NULL;
> + struct nsproxy *nsproxy;
> +
> + task_lock(task);
> + nsproxy = task->nsproxy;
> + if (nsproxy) {
> + ns = nsproxy->cgroup_ns;
> + get_cgroup_ns(ns);
> + }
> + task_unlock(task);
> +
> + return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> + put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> + struct cgroup_namespace *cgroup_ns = ns;
> +
> + return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> + .name = "cgroup",
> + .type = CLONE_NEWCGROUP,
> + .get = cgroupns_get,
> + .put = cgroupns_put,
> + .install = cgroupns_install,
> + .inum = cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> + return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
Why provide this unused init function?
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9b7d746..d22d793 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1797,7 +1797,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
> if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
> CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> - CLONE_NEWUSER|CLONE_NEWPID))
> + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
> return -EINVAL;
> /*
> * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
> #include <linux/proc_ns.h>
> #include <linux/file.h>
> #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>
> static struct kmem_cache *nsproxy_cachep;
>
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
> #ifdef CONFIG_NET
> .net_ns = &init_net,
> #endif
> + .cgroup_ns = &init_cgroup_ns,
> };
>
> static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> goto out_pid;
> }
>
> + new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> + tsk->nsproxy->cgroup_ns);
> + if (IS_ERR(new_nsp->cgroup_ns)) {
> + err = PTR_ERR(new_nsp->cgroup_ns);
> + goto out_cgroup;
> + }
> +
> new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
> if (IS_ERR(new_nsp->net_ns)) {
> err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> return new_nsp;
>
> out_net:
> + if (new_nsp->cgroup_ns)
> + put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
> if (new_nsp->pid_ns_for_children)
> put_pid_ns(new_nsp->pid_ns_for_children);
> out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> struct nsproxy *new_ns;
>
> if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> - CLONE_NEWPID | CLONE_NEWNET)))) {
> + CLONE_NEWPID | CLONE_NEWNET |
> + CLONE_NEWCGROUP)))) {
> get_nsproxy(old_ns);
> return 0;
> }
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
> put_ipc_ns(ns->ipc_ns);
> if (ns->pid_ns_for_children)
> put_pid_ns(ns->pid_ns_for_children);
> + if (ns->cgroup_ns)
> + put_cgroup_ns(ns->cgroup_ns);
> put_net(ns->net_ns);
> kmem_cache_free(nsproxy_cachep, ns);
> }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> int err = 0;
>
> if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> - CLONE_NEWNET | CLONE_NEWPID)))
> + CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
> return 0;
>
> user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>
On 2014/12/5 9:55, Aditya Kali wrote:
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
s/as/has
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 2 ++
> kernel/cgroup.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..efe5e15 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> return NULL;
> }
>
> +/**
> + * kernfs_obtain_root - get a dentry for the given kernfs_node
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
s/used used/be used/
s/which/who
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn)
> +{
> + struct dentry *dentry;
> + struct inode *inode;
> +
> + BUG_ON(sb->s_op != &kernfs_sops);
> +
> + /* inode for the given kernfs_node should already exist. */
> + inode = ilookup(sb, kn->ino);
> + if (!inode) {
> + pr_debug("kernfs: could not get inode for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + /* instantiate and link root dentry */
> + dentry = d_obtain_root(inode);
> + if (!dentry) {
> + pr_debug("kernfs: could not get dentry for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
> + * may be called by callers other than kernfs_fill_super. */
/*
* ...
*/
> + mutex_lock(&kernfs_mutex);
> + if (!dentry->d_fsdata) {
> + kernfs_get(kn);
> + dentry->d_fsdata = kn;
> + } else {
> + WARN_ON(dentry->d_fsdata != kn);
> + }
> + mutex_unlock(&kernfs_mutex);
> +
> + return dentry;
> +}
Seperate this as a standalone patch?
> +
> static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> {
> struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
> struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
> struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn);
> struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
> unsigned int flags, void *priv);
> void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b1ae6d9..e779890 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1438,6 +1438,14 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
> return -ENOENT;
> }
>
> + /* If inside a non-init cgroup namespace, only allow default hierarchy
> + * to be mounted.
> + */
/*
* ...
*/
> + if ((current->nsproxy->cgroup_ns != &init_cgroup_ns) &&
> + !(opts->flags & CGRP_ROOT_SANE_BEHAVIOR)) {
> + return -EINVAL;
> + }
> +
> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> if (nr_opts != 1) {
> @@ -1630,6 +1638,15 @@ static void init_cgroup_root(struct cgroup_root *root,
> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> }
>
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> + struct cgroup_namespace *ns)
> +{
> + struct dentry *nsdentry;
> +
> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> + return nsdentry;
> +}
> +
> static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
> {
> LIST_HEAD(tmp_links);
> @@ -1734,6 +1751,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> int ret;
> int i;
> bool new_sb;
> + struct cgroup_namespace *ns =
> + get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }
>
> /*
> * The first time anyone tries to mount a cgroup, enable the list
> @@ -1866,11 +1891,28 @@ out_free:
> kfree(opts.release_agent);
> kfree(opts.name);
>
> - if (ret)
> + if (ret) {
> + put_cgroup_ns(ns);
> return ERR_PTR(ret);
> + }
>
> dentry = kernfs_mount(fs_type, flags, root->kf_root,
> CGROUP_SUPER_MAGIC, &new_sb);
> +
> + if (!IS_ERR(dentry) && (root == &cgrp_dfl_root)) {
> + /* If this mount is for the default hierarchy in non-init cgroup
> + * namespace, then instead of root cgroup's dentry, we return
> + * the dentry corresponding to the cgroupns->root_cgrp.
> + */
> + if (ns != &init_cgroup_ns) {
> + struct dentry *nsdentry;
> +
> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> + dput(dentry);
> + dentry = nsdentry;
> + }
> + }
> +
> if (IS_ERR(dentry) || !new_sb)
> cgroup_put(&root->cgrp);
>
> @@ -1883,6 +1925,7 @@ out_free:
> deactivate_super(pinned_sb);
> }
>
> + put_cgroup_ns(ns);
> return dentry;
> }
>
> @@ -1911,6 +1954,7 @@ static struct file_system_type cgroup_fs_type = {
> .name = "cgroup",
> .mount = cgroup_mount,
> .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> static struct kobject *cgroup_kobj;
>
Aditya,
I gave your patch set a try but it does not work for me.
Maybe you can bring some light into the issues I'm facing.
Sadly I still had no time to dig into your code.
Am 05.12.2014 um 02:55 schrieb Aditya Kali:
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> Documentation/cgroups/namespace.txt | 147 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 147 insertions(+)
> create mode 100644 Documentation/cgroups/namespace.txt
>
> diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
> new file mode 100644
> index 0000000..6480379
> --- /dev/null
> +++ b/Documentation/cgroups/namespace.txt
> @@ -0,0 +1,147 @@
> + CGroup Namespaces
> +
> +CGroup Namespace provides a mechanism to virtualize the view of the
> +/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
> +clone() and unshare() syscalls to create a new cgroup namespace.
> +The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
> +output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
> +at the time of creation of the cgroup namespace.
> +
> +Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
> +path of the cgroup of a process. In a container setup (where a set of cgroups
> +and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
> +may leak potential system level information to the isolated processes.
> +
> +For Example:
> + $ cat /proc/self/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> +The path '/batchjobs/container_id1' can generally be considered as system-data
> +and its desirable to not expose it to the isolated process.
> +
> +CGroup Namespaces can be used to restrict visibility of this path.
> +For Example:
> + # Before creating cgroup namespace
> + $ ls -l /proc/self/ns/cgroup
> + lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> + $ cat /proc/self/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> + # unshare(CLONE_NEWCGROUP) and exec /bin/bash
> + $ ~/unshare -c
> + [ns]$ ls -l /proc/self/ns/cgroup
> + lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> + # From within new cgroupns, process sees that its in the root cgroup
> + [ns]$ cat /proc/self/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> +
> + # From global cgroupns:
> + $ cat /proc/<pid>/cgroup
> + 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
> +
> + # Unshare cgroupns along with userns and mountns
> + # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> + # sets up uid/gid map and execs /bin/bash
> + $ ~/unshare -c -u -m
This command does not issue CLONE_NEWUSER, -U does.
> + # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
> + # hierarchy.
> + [ns]$ mount -t cgroup cgroup /tmp/cgroup
> + [ns]$ ls -l /tmp/cgroup
> + total 0
> + -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> + -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> + -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> + -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
I've patched libvirt-lxc to issue CLONE_NEWCGROUP and not bind mount cgroupfs into a container.
But I'm unable to mount cgroupfs within the container, mount(2) is failing with EINVAL.
And /proc/self/cgroup still shows the cgroup from outside.
---cut---
container:/ # ls /sys/fs/cgroup/
container:/ # mount -t cgroup none /sys/fs/cgroup/
mount: wrong fs type, bad option, bad superblock on none,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so.
container:/ # cat /proc/self/cgroup
8:memory:/machine/test00.libvirt-lxc
7:devices:/machine/test00.libvirt-lxc
6:hugetlb:/
5:cpuset:/machine/test00.libvirt-lxc
4:blkio:/machine/test00.libvirt-lxc
3:cpu,cpuacct:/machine/test00.libvirt-lxc
2:freezer:/machine/test00.libvirt-lxc
1:name=systemd:/user.slice/user-0.slice/session-c2.scope
container:/ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:02 .
dr-xr-xr-x 8 root root 0 Dec 14 23:02 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:02 cgroup -> cgroup:[4026532240]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 ipc -> ipc:[4026532238]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 mnt -> mnt:[4026532235]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 net -> net:[4026532242]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 pid -> pid:[4026532239]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 user -> user:[4026532234]
lrwxrwxrwx 1 root root 0 Dec 14 23:02 uts -> uts:[4026532236]
container:/ #
#host side
lxc-os132:~ # ls -la /proc/self/ns
total 0
dr-x--x--x 2 root root 0 Dec 14 23:56 .
dr-xr-xr-x 8 root root 0 Dec 14 23:56 ..
lrwxrwxrwx 1 root root 0 Dec 14 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 net -> net:[4026531957]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Dec 14 23:56 uts -> uts:[4026531838]
---cut---
Any ideas?
Thanks,
//richard
Quoting Tejun Heo ([email protected]):
> On Wed, Jan 07, 2015 at 05:27:38PM -0600, Eric W. Biederman wrote:
> > >> The -o SUBSYS option doesn't exist. Jesus, at least get yourself
> > >> familiar with the basics before claiming random stuff.
> >
> > Oh let's see I got that command line option out of /proc/mounts and yes
> > it works. Perhaps it doesn't if I invoke unified hiearchies but the
> > option does in fact exist and work.
>
> I meant the -o SUBSYS doesn't exist for unified hierarchy.
>
> > Now I really do need to test report regressions, and send probably send
> > regression fixes. If I understand your strange ranting I think you just
> > told me that option that -o SUBSYS does work with unified hierarchies.
>
> What? Why would -O SUBSYS exist for unified hierarchy? It's unified
> for all controllers.
>
> > Tejun. I asked you specifically about this case 2 years ago at plumbers
> > and you personally told me this would continue to work. I am going to
> > hold you to that.
>
> I have no idea what you're talking about in *THIS* thread. I'm fully
> aware of what was discussed *THEN*.
>
> > Fixing bugs is one thing. Gratuitious regressions that make supporting
> > existing user space applications insane is another.
>
> Can you explain what problem you're actually trying to talk about
> without spouting random claims about regressions?
A few weeks ago, in order to test the cgroup namespace patchset with lxc,
I went through the motions of getting lxc to work with unified hierarchy.
A few of the things I had to change:
1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0. Used
to start with 1. I expect many userspace parsers to be broken by this.
2. After creating every non-leaf cgroup, we must fill in the
cgroup.subtree_cgroups file. This is extra work which userspace
doesn't have to do right now.
3. Let's say we want to create a freezer cgroup /foo/bar for some set of
tasks, which they will administer. In fact let's assume we are going to
use cgroup namespaces. We have to put the tasks into /foo/bar, unshare
the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
and then write 'freezer' into /foo/bar. (If we're not using cgroup
namespaces, then we have to do a similar thing to let the tasks administer
/foo/bar while placing them under /foo/bar/leaf). The oddness I'm pointing
to is where the tasks have to know that they can create cgroups in "..".
For containers this becomes odd. We tend to group containers by the
tasks in and under a cgroup. We now will have to assume a convention
where we know to check for tasks in and under "..", since by definition
pid 1's cgroup (in a container) cannot have children.
4. The per-cgroup "tasks" file not existing seems odd, although certainly
unexpected by much current software.
So, if the unified hierarchy is going to not cause undue pain, existing
software really needs to start working now to use it. It's going to be
a sizeable task for lxc.
-serge
On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0. Used
> to start with 1. I expect many userspace parsers to be broken by this.
This is intentional. The unified hierarchy will always have the
hierarchy number zero. Userland needs to be updated anyway and the
unified hierarchy won't show up unless explicitly enabled.
> 2. After creating every non-leaf cgroup, we must fill in the
> cgroup.subtree_cgroups file. This is extra work which userspace
> doesn't have to do right now.
Again, by design. This is how organization and control are separated
and the differing levels of granularity is achieved.
> 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
There shouldn't be a "freezer" cgroup. The processes are categorized
according to their logical structure and controllers are applied to
the hierarchy as necessary.
> tasks, which they will administer. In fact let's assume we are going to
> use cgroup namespaces. We have to put the tasks into /foo/bar, unshare
> the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> and then write 'freezer' into /foo/bar. (If we're not using cgroup
> namespaces, then we have to do a similar thing to let the tasks administer
> /foo/bar while placing them under /foo/bar/leaf). The oddness I'm pointing
> to is where the tasks have to know that they can create cgroups in "..".
>
> For containers this becomes odd. We tend to group containers by the
> tasks in and under a cgroup. We now will have to assume a convention
> where we know to check for tasks in and under "..", since by definition
> pid 1's cgroup (in a container) cannot have children.
The semantics is that the parent enables distribution of its given
type of resource by enabling the controller in its subtree_control.
This scoping isn't necessary for freezer and I'm debating whether to
enable controllers which don't need granularity control to be enabled
unconditionally. Right now, I'm leaning against it mostly for
consistency.
> 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> unexpected by much current software.
And, yes, everything is per-process for reasons described in
unified-hierarchy.txt.
> So, if the unified hierarchy is going to not cause undue pain, existing
> software really needs to start working now to use it. It's going to be
> a sizeable task for lxc.
Yes, this isn't gonna be a trivial conversion. The usage model
changes and so will a lot of controller knobs and behaviors.
Thanks.
--
tejun
Quoting Tejun Heo ([email protected]):
> On Wed, Feb 11, 2015 at 04:46:16AM +0100, Serge E. Hallyn wrote:
> > 1. Hierarchy_num in /proc/cgroups and /proc/self/cgroup start at 0. Used
> > to start with 1. I expect many userspace parsers to be broken by this.
>
> This is intentional. The unified hierarchy will always have the
> hierarchy number zero. Userland needs to be updated anyway and the
> unified hierarchy won't show up unless explicitly enabled.
>
> > 2. After creating every non-leaf cgroup, we must fill in the
> > cgroup.subtree_cgroups file. This is extra work which userspace
> > doesn't have to do right now.
>
> Again, by design. This is how organization and control are separated
> and the differing levels of granularity is achieved.
>
> > 3. Let's say we want to create a freezer cgroup /foo/bar for some set of
>
> There shouldn't be a "freezer" cgroup. The processes are categorized
> according to their logical structure and controllers are applied to
> the hierarchy as necessary.
But there can well be cgroups for which only freezer is enabled. If
I'm wrong about that, then I am suffering a fundamental misunderstanding.
> > tasks, which they will administer. In fact let's assume we are going to
> > use cgroup namespaces. We have to put the tasks into /foo/bar, unshare
> > the cgroup ns, then create /foo/bar/leaf, move the tasks into /foo/bar/leaf,
> > and then write 'freezer' into /foo/bar. (If we're not using cgroup
> > namespaces, then we have to do a similar thing to let the tasks administer
> > /foo/bar while placing them under /foo/bar/leaf). The oddness I'm pointing
> > to is where the tasks have to know that they can create cgroups in "..".
> >
> > For containers this becomes odd. We tend to group containers by the
> > tasks in and under a cgroup. We now will have to assume a convention
> > where we know to check for tasks in and under "..", since by definition
> > pid 1's cgroup (in a container) cannot have children.
>
> The semantics is that the parent enables distribution of its given
> type of resource by enabling the controller in its subtree_control.
> This scoping isn't necessary for freezer and I'm debating whether to
> enable controllers which don't need granularity control to be enabled
> unconditionally. Right now, I'm leaning against it mostly for
> consistency.
Yeah, IIUC (i.e. freezer would always be enabled?) that would be
even-more-confusing.
> > 4. The per-cgroup "tasks" file not existing seems odd, although certainly
> > unexpected by much current software.
>
> And, yes, everything is per-process for reasons described in
> unified-hierarchy.txt.
>
> > So, if the unified hierarchy is going to not cause undue pain, existing
> > software really needs to start working now to use it. It's going to be
> > a sizeable task for lxc.
>
> Yes, this isn't gonna be a trivial conversion. The usage model
> changes and so will a lot of controller knobs and behaviors.
>
> Thanks.
>
> --
> tejun
A slightly off topic comment, for where this thread has gone but
relevant if we are talking about cgroup namespaces.
If don't implement compatibility with existing userspace, they get a
nack. A backwards-incompatible change should figure out how to remove
the need for any namespaces.
Because that is what namespaces are about backwards compatibility.
Eric
Hello,
On Wed, Feb 11, 2015 at 05:29:42AM +0100, Serge E. Hallyn wrote:
> > There shouldn't be a "freezer" cgroup. The processes are categorized
> > according to their logical structure and controllers are applied to
> > the hierarchy as necessary.
>
> But there can well be cgroups for which only freezer is enabled. If
> I'm wrong about that, then I am suffering a fundamental misunderstanding.
Ah, sure, I was mostly arguing semantics. It's just weird to call it
"freezer" cgroup.
> > The semantics is that the parent enables distribution of its given
> > type of resource by enabling the controller in its subtree_control.
> > This scoping isn't necessary for freezer and I'm debating whether to
> > enable controllers which don't need granularity control to be enabled
> > unconditionally. Right now, I'm leaning against it mostly for
> > consistency.
>
> Yeah, IIUC (i.e. freezer would always be enabled?) that would be
> even-more-confusing.
Right, freezer is kinda weird tho. Its feature can almost be
considered a utility feature of cgroups core rather than a separate
controller. That said, it's most likely that it'll remain in its
current form although how it blocks tasks should definitely be
reimplemented.
Thanks.
--
tejun
Hey,
On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> A slightly off topic comment, for where this thread has gone but
> relevant if we are talking about cgroup namespaces.
>
> If don't implement compatibility with existing userspace, they get a
> nack. A backwards-incompatible change should figure out how to remove
> the need for any namespaces.
>
> Because that is what namespaces are about backwards compatibility.
Are you claiming that namespaces are soley about backwards
compatibility? ie. to trick userland into scoping without letting it
notice? That's a very restricted view and namespaces do provide
further isolation capabilties in addition to what can be achieved
otherwise and it is logical to collect simliar funtionalities there.
Thanks.
--
tejun
Tejun Heo <[email protected]> writes:
> Hey,
>
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
>> A slightly off topic comment, for where this thread has gone but
>> relevant if we are talking about cgroup namespaces.
>>
>> If don't implement compatibility with existing userspace, they get a
>> nack. A backwards-incompatible change should figure out how to remove
>> the need for any namespaces.
>>
>> Because that is what namespaces are about backwards compatibility.
>
> Are you claiming that namespaces are soley about backwards
> compatibility? ie. to trick userland into scoping without letting it
> notice? That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.
In principle a namespace is an additional layer of indirection from
names to objects. A namespace does not invent new kinds of objects.
A namespace takes things that were previously global and gives them a
scope.
In princple after name resolution a namespace should impose no overhead.
In general namespaces are not necessary if your scope of names
already has hierarchy. Which means that new interfaces can almost
always be designed in such a way that you can support containers without
needing to add any special namespace support. Which typically results
in more flexible and useful APIs for everyone, with no real code cost.
Further in the cgroup namespace patchset I looked at a while ago, the
only reason for having a cgroup namespace was to provide a measure of
backwards compatibility with existing userspace. I expect removing the
/proc/<pid>/cgroup file and replacing it with something in cgroupfs
itself would serve just as well if backwards compatibility is not the
objective. Or possibly replacincg /proc/<pid>/cgroup into a magic
symlink onto somewhere in the unified cgroupfs itself.
I just don't see any point in doing weird silly namespace things to keep
existing userspace working when the existing userspace won't work.
As such if a namespace doesn't implement compatibility with the existing
userspace it gets my nack.
Eric
Hey,
On Wed, Feb 11, 2015 at 12:29:20AM -0600, Eric W. Biederman wrote:
> In general namespaces are not necessary if your scope of names
> already has hierarchy. Which means that new interfaces can almost
> always be designed in such a way that you can support containers without
> needing to add any special namespace support. Which typically results
> in more flexible and useful APIs for everyone, with no real code cost.
Sure, and cgroup ns support isn't doing anything weird there. Just
bind mounting a subhierarchy is enough for the core features. The ns
part is dealing with things which can't easily be tied to such
hierarchical scoping like path reported under through proc and even
handling that can be achieved by, for example, marking delegation
points in cgroup proper and forcing tasks beyond that point to
consider that as its origin when determining the path to report.
However, note that something like that is inherently similar to what's
being provided by other namespaces. It is true that it can be
implemented outside namespace facility proper but that doesn't
automatically make that the right choice and it's more likely to be
worse - we'd be introducing a different way to perform about the same
thing.
So, the argument that adding namespace interface except for backward
compatibility doesn't seem to hold water. Like it or not, namespace
is serving as a platform for certain type of features and we'd be
foolish to not to consider putting a related feature together there
and I fail to see a valid technical argument as of yet.
> Further in the cgroup namespace patchset I looked at a while ago, the
> only reason for having a cgroup namespace was to provide a measure of
> backwards compatibility with existing userspace. I expect removing the
> /proc/<pid>/cgroup file and replacing it with something in cgroupfs
> itself would serve just as well if backwards compatibility is not the
> objective. Or possibly replacincg /proc/<pid>/cgroup into a magic
> symlink onto somewhere in the unified cgroupfs itself.
No matter what we do, we'd still need to mark the delegation point
somehow; otherwise, there's no way to produce a scoped identifier.
This isn't really about backward compatibility but rather the feature
to scope a subhierarcy properly.
> I just don't see any point in doing weird silly namespace things to keep
> existing userspace working when the existing userspace won't work.
If it's too different from existing namespaces, sure, doing something
is definitely an option but is it?
> As such if a namespace doesn't implement compatibility with the existing
> userspace it gets my nack.
Hmmm.... I don't think making the proposed NS support to work across
all hierarchies including the traditional multiple ones would be too
difficult. That should work then, right?
Thanks.
--
tejun
Quoting Tejun Heo ([email protected]):
> Hey,
>
> On Tue, Feb 10, 2015 at 11:02:40PM -0600, Eric W. Biederman wrote:
> > A slightly off topic comment, for where this thread has gone but
> > relevant if we are talking about cgroup namespaces.
> >
> > If don't implement compatibility with existing userspace, they get a
> > nack. A backwards-incompatible change should figure out how to remove
> > the need for any namespaces.
> >
> > Because that is what namespaces are about backwards compatibility.
>
> Are you claiming that namespaces are soley about backwards
> compatibility? ie. to trick userland into scoping without letting it
> notice? That's a very restricted view and namespaces do provide
> further isolation capabilties in addition to what can be achieved
> otherwise and it is logical to collect simliar funtionalities there.
We absolutely would love to use cgroup namespaces to run older
userspace in containers. I don't know that it's actually possible
to do both that and use unified hierarchy at the same time though,
which is unfortunate. So an Ubuntu 12.04 container will never, afaics,
be able to run inside an ubuntu 16.04 host that is using unified
hierarchy, without using backported newer versions of lxc (etc) in
the container.
On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> We absolutely would love to use cgroup namespaces to run older
> userspace in containers. I don't know that it's actually possible
> to do both that and use unified hierarchy at the same time though,
> which is unfortunate. So an Ubuntu 12.04 container will never, afaics,
> be able to run inside an ubuntu 16.04 host that is using unified
> hierarchy, without using backported newer versions of lxc (etc) in
> the container.
So, the constraint there are the controllers. A controller can't be
attached to two hierarchies at the same time for obvious reasons, so
regardless of NS, you can't use the same controller on a unified
hierarchy *and* a traditional hierarchy. NS doesn't adds or
substracts from the situation. If you decide to attach a controller
to a traditional hierarchy, that's where it's gonna be available. If
you attach it to the unified hierarchy, the same story.
Thanks.
--
tejun
Quoting Tejun Heo ([email protected]):
> On Wed, Feb 11, 2015 at 05:00:23PM +0100, Serge E. Hallyn wrote:
> > We absolutely would love to use cgroup namespaces to run older
> > userspace in containers. I don't know that it's actually possible
> > to do both that and use unified hierarchy at the same time though,
> > which is unfortunate. So an Ubuntu 12.04 container will never, afaics,
> > be able to run inside an ubuntu 16.04 host that is using unified
> > hierarchy, without using backported newer versions of lxc (etc) in
> > the container.
>
> So, the constraint there are the controllers. A controller can't be
> attached to two hierarchies at the same time for obvious reasons, so
> regardless of NS, you can't use the same controller on a unified
> hierarchy *and* a traditional hierarchy. NS doesn't adds or
> substracts from the situation. If you decide to attach a controller
> to a traditional hierarchy, that's where it's gonna be available. If
> you attach it to the unified hierarchy, the same story.
Right, exactly.
thanks,
-serge