Background
Cgroups and Namespaces are used together to create “virtual”
containers that isolates the host environment from the processes
running in container. But since cgroups themselves are not
“virtualized”, the task is always able to see global cgroups view
through cgroupfs mount and via /proc/self/cgroup file.
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
This exposure of cgroup names to the processes running inside a
container results in some problems:
(1) The container names are typically host-container-management-agent
(systemd, docker/libcontainer, etc.) data and leaking its name (or
leaking the hierarchy) reveals too much information about the host
system.
(2) It makes the container migration across machines (CRIU) more
difficult as the container names need to be unique across the
machines in the migration domain.
(3) It makes it difficult to run container management tools (like
docker/libcontainer, lmctfy, etc.) within virtual containers
without adding dependency on some state/agent present outside the
container.
Note that the feature proposed here is completely different than the
“ns cgroup” feature which existed in the linux kernel until recently.
The ns cgroup also attempted to connect cgroups and namespaces by
creating a new cgroup every time a new namespace was created. It did
not solve any of the above mentioned problems and was later dropped
from the kernel.
Introducing CGroup Namespaces
With unified cgroup hierarchy
(Documentation/cgroups/unified-hierarchy.txt), the containers can now
have a much more coherent cgroup view and its easy to associate a
container with a single cgroup. This also allows us to virtualize the
cgroup view for tasks inside the container.
The new CGroup Namespace allows a process to “unshare” its cgroup
hierarchy starting from the cgroup its currently in.
For Ex:
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
$ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
[ns]$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
# From within new cgroupns, process sees that its in the root cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
# From global cgroupns:
$ cat /proc/<pid>/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
The virtualization of /proc/self/cgroup file combined with restricting
the view of cgroup hierarchy by bind-mounting for the
$CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
$CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
cgroup view inside the container.
In its current simplistic form, the cgroup namespaces provide
following behavior:
(1) The “root” cgroup for a cgroup namespace is the cgroup in which
the process calling unshare is running.
For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
For the init_cgroup_ns, this is the real root (“/”) cgroup
(identified in code as cgrp_dfl_root.cgrp).
(2) The cgroupns-root cgroup does not change even if the namespace
creator process later moves to a different cgroup.
$ ~/unshare -c # unshare cgroupns in some cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
[ns]$ mkdir sub_cgrp_1
[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
(3) Each process gets its CGROUPNS specific view of
/proc/<pid>/cgroup.
(a) Processes running inside the cgroup namespace will be able to see
cgroup paths (in /proc/self/cgroup) only inside their root cgroup
[ns]$ sleep 100000 & # From within unshared cgroupns
[1] 7353
[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
(b) From global cgroupns, the real cgroup path will be visible:
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
(c) From a sibling cgroupns, the real path will be visible:
[ns2]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
(In correct container setup though, it should not be possible to
access PIDs in another container in the first place. This can be
detected changed if desired.)
(4) Processes inside a cgroupns are not allowed to move out of the
cgroupns-root. This is true even if a privileged process in global
cgroupns tries to move the process out of its cgroupns-root.
# From global cgroupns
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
# cgroupns-root for 7353 is /batchjobs/c_job_id1
$ echo 7353 > batchjobs/c_job_id2/cgroup.procs
-bash: echo: write error: Operation not permitted
(5) setns() is not supported for cgroup namespace in the initial
version.
(6) When some thread from a multi-threaded process unshares its
cgroup-namespace, the new cgroupns gets applied to the entire
process (all the threads). This should be OK since
unified-hierarchy only allows process-level containerization. So
all the threads in the process will have the same cgroup. And both
- changing cgroups and unsharing namespaces - are protected under
threadgroup_lock(task).
(7) The cgroup namespace is alive as long as there is atleast 1
process inside it. When the last process exits, the cgroup
namespace is destroyed. The cgroupns-root and the actual cgroups
remain though.
Implementation
The current patch-set is based on top of Tejun's cgroup tree (for-next
branch). Its fairly non-intrusive and provides above mentioned
features.
Possible extensions of CGROUPNS:
(1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
capabilities to restrict cgroups to administrative users. CGroup
namespaces could be of help here. With cgroup namespaces, it might
be possible to delegate administration of sub-cgroups under a
cgroupns-root to the cgroupns owner.
(2) Provide a cgroupns specific cgroupfs mount. i.e., the following
command when ran from inside a cgroupns should only mount the
hierarchy from cgroupns-root cgroup:
$ mount -t cgroup cgroup <cgroup-mountpoint>
# -o __DEVEL__sane_behavior should be implicit
This is similar to how procfs can be mounted for every PIDNS. This
may have some usecases.
---
fs/kernfs/dir.c | 51 +++++++++++++---
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 36 ++++++++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
include/linux/kernfs.h | 3 +
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
include/uapi/linux/sched.h | 3 +-
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 75 +++++++++++++++++------
kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
14 files changed, 364 insertions(+), 34 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c
[PATCH 1/5] kernfs: Add API to get generate relative kernfs path
[PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCH 3/5] cgroup: add function to get task's cgroup on default
[PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
[PATCH 5/5] cgroup: introduce cgroup namespaces
CLONE_NEWCGROUP will be used to create new cgroup namespace.
Signed-off-by: Aditya Kali <[email protected]>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.0.0.526.g5318336
Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp. In the first version,
setns() is not supported for cgroup namespaces.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
Signed-off-by: Aditya Kali <[email protected]>
---
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 18 +++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 32 ++++++++++
kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
11 files changed, 276 insertions(+), 4 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+ &cgroupns_operations,
+#endif
};
static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4ea477f..d3c6070 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
#ifdef CONFIG_CGROUPS
@@ -469,6 +471,13 @@ struct cftype {
#endif
};
+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;
@@ -591,10 +600,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}
+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
}
static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+ return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#else /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+ unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns) {
+ if (flags & CLONE_NEWCGROUP)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+#endif /* CONFIG_CGROUP_NS */
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index b4ec59d..44f388c 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;
/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@
struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;
struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;
/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};
#ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index 9d76b99..2f43ec9 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1101,6 +1101,15 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
+config CGROUP_NS
+ bool "CGroup Namespaces"
+ default n
+ help
+ This options enables CGroup Namespaces which can be used to isolate
+ cgroup paths. This feature is only useful when unified cgroup
+ hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+ option).
+
endif # CGROUPS
config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index f2a8b62..61c5791 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -52,6 +52,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8552513..c04e971 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>
#include <linux/atomic.h>
@@ -196,6 +198,15 @@ static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);
static void cgroup_pidlist_destroy_all(struct cgroup *cgrp);
+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
gfp_t gfp_mask)
@@ -2333,6 +2344,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
struct task_struct *task;
int ret;
+ /* Only allow changing cgroups accessible within task's cgroup
+ * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+ * cgroupns->root_cgrp. */
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ return -EPERM;
+
/* look up all src csets */
down_read(&css_set_rwsem);
rcu_read_lock();
@@ -4551,6 +4568,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
+ /* Allow mkdir only within process's cgroup namespace root. */
+ if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
root = parent->root;
/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4819,6 +4843,14 @@ static int cgroup_rmdir(struct kernfs_node *kn)
cgrp = cgroup_kn_lock_live(kn);
if (!cgrp)
return 0;
+
+ /* Allow rmdir only within process's cgroup namespace root.
+ * The process can't delete its own root anyways. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+ cgroup_kn_unlock(kn);
+ return -EPERM;
+ }
+
cgroup_get(cgrp); /* for @kn->priv clearing */
ret = cgroup_destroy_locked(cgrp);
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..a2e6804
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!capable(CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ cgrp = get_task_cgroup(current);
+
+ /* Creating new CGROUPNS is supported only when unified hierarchy is in
+ * use. */
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto err_out_unlock;
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out_unlock;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out_unlock;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ threadgroup_unlock(current);
+
+ return new_ns;
+
+err_out_unlock:
+ threadgroup_unlock(current);
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task_nsproxy(task);
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ rcu_read_unlock();
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index d2799d1..95981a1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1747,7 +1747,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 8e78110..e20298c 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>
static struct kmem_cache *nsproxy_cachep;
@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};
static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}
+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;
out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;
if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;
user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.0.0.526.g5318336
move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.
Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 17 +++++++++++++++++
kernel/cgroup.c | 18 ------------------
2 files changed, 17 insertions(+), 18 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 707c302..4ea477f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == &cgrp_dfl_root;
}
+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1671345..8552513 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
static struct cftype cgroup_dfl_base_files[];
static struct cftype cgroup_legacy_base_files[];
-static void cgroup_put(struct cgroup *cgrp);
static int rebind_subsystems(struct cgroup_root *dst_root,
unsigned int ss_mask);
static int cgroup_destroy_locked(struct cgroup *cgrp);
@@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
return cgroup_css(cgrp, ss);
}
-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}
-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
/**
* cgroup_refresh_child_subsys_mask - update child_subsys_mask
* @cgrp: the target cgroup
--
2.0.0.526.g5318336
get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.
Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b5223c5..707c302 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
}
char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);
int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1e94b71..1671345 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
}
EXPORT_SYMBOL_GPL(task_cgroup_path);
+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
/* used to track tasks and other necessary states during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
--
2.0.0.526.g5318336
The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.
Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/dir.c | 51 ++++++++++++++++++++++++++++++++++++++++----------
include/linux/kernfs.h | 3 +++
2 files changed, 44 insertions(+), 10 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..2224f08 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}
-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_root,
+ struct kernfs_node *kn,
+ char *buf,
+ size_t buflen)
{
char *p = buf + buflen;
int len;
*--p = '\0';
+ if (kn == kn_root) {
+ *--p = '/';
+ return p;
+ }
+
do {
len = strlen(kn->name);
if (p - buf < len + 1) {
@@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
memcpy(p, kn->name, len);
*--p = '/';
kn = kn->parent;
+ if (kn == kn_root)
+ break;
} while (kn && kn->parent);
return p;
@@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
}
/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
* doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);
/**
@@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 20f4935..1627341 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}
int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.0.0.526.g5318336
On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <[email protected]> wrote:
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp. In the first version,
> setns() is not supported for cgroup namespaces.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
What happens if someone moves a task in a cgroup namespace outside of
the namespace root cgroup?
--Andy
On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
> On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <[email protected]> wrote:
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp. In the first version,
>> setns() is not supported for cgroup namespaces.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>
> What happens if someone moves a task in a cgroup namespace outside of
> the namespace root cgroup?
>
Attempt to move a task outside of cgroupns root will fail with EPERM.
This is true irrespective of the privileges of the process attempting
this. Once cgroupns is created, the task will be confined to the
cgroup hierarchy under its cgroupns root until it dies.
> --Andy
--
Aditya
Quoting Aditya Kali ([email protected]):
> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel.
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
Hi,
So right now we basically do this in userspace using cgmanager. Each
container/chroot/whatever that has a cgproxy is 'locked' under that
proxy's cgroup. So if root in a container asks the cgproxy for the
cgroup of pid 2000, and cgproxy is in /lxc/u1 while pid 2000 in the
container is in /lxc/u1/service1, then the response will be '/service1'.
Same happens with creating cgroups, moving pids into cgroups, etc.
(Hoping to take a close look at this set early next week)
-serge
On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>
> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
> > On Thu, Jul 17, 2014 at 12:52 PM, Aditya Kali <[email protected]> wrote:
> >> Introduce the ability to create new cgroup namespace. The newly created
> >> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> >> of creation of the cgroup namespace. The task that creates the new
> >> cgroup namespace and all its future children will now be restricted only
> >> to the cgroup hierarchy under this root_cgrp. In the first version,
> >> setns() is not supported for cgroup namespaces.
> >> The main purpose of cgroup namespace is to virtualize the contents
> >> of /proc/self/cgroup file. Processes inside a cgroup namespace
> >> are only able to see paths relative to their namespace root.
> >> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> >> to create completely virtualized containers without leaking system
> >> level cgroup hierarchy to the task.
> >
> > What happens if someone moves a task in a cgroup namespace outside of
> > the namespace root cgroup?
> >
>
> Attempt to move a task outside of cgroupns root will fail with EPERM.
> This is true irrespective of the privileges of the process attempting
> this. Once cgroupns is created, the task will be confined to the
> cgroup hierarchy under its cgroupns root until it dies.
Can a task in a non-init userns create a cgroupns? If not, that's
unusual. If so, is it problematic if they can prevent themselves from
being moved?
I hate to say it, but it might be worth requiring explicit permission
from the cgroup manager for this. For example, there could be a new
cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
will fail with -EPERM unless the caller is in a may_share=1 cgroup.
may_unshare in a parent cgroup would not give child cgroups the
ability to unshare.
--Andy
On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <[email protected]> wrote:
> On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>>
>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
>> > What happens if someone moves a task in a cgroup namespace outside of
>> > the namespace root cgroup?
>> >
>>
>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>> This is true irrespective of the privileges of the process attempting
>> this. Once cgroupns is created, the task will be confined to the
>> cgroup hierarchy under its cgroupns root until it dies.
>
> Can a task in a non-init userns create a cgroupns? If not, that's
> unusual. If so, is it problematic if they can prevent themselves from
> being moved?
>
Currently, only a task with CAP_SYS_ADMIN in the init-userns can
create cgroupns. It is stricter than for other namespaces, yes.
> I hate to say it, but it might be worth requiring explicit permission
> from the cgroup manager for this. For example, there could be a new
> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
> may_unshare in a parent cgroup would not give child cgroups the
> ability to unshare.
>
What you suggest can be done. The current patch-set punts the problem
of permission checking by only allowing unshare from a
capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
improvement to cgroupns feature if we want to open it to non-init
userns.
Being said that, I would argue that even if we don't have this
explicit permission and relax the check to non-init userns, it should
be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
you should probably be allowed to unshare() it). By unsharing
cgroupns, the tasks can only confine themselves further under its
cgroupns-root. As long as it cannot escape that hierarchy, it should
be fine.
In my experience, there is seldom a need to move tasks out of their
cgroup. At most, we create a sub-cgroup and move the task there (which
is allowed in their cgroupns). Even for a cgroup manager, I can't
think of a case where it will be useful to move a task from one cgroup
hierarchy to another. Such move seems overly complicated (even without
cgroup namespaces). The cgroup manager can just modify the settings of
the task's cgroup as needed or simply kill & restart the task in a new
container.
> --Andy
Thanks,
--
Aditya
On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <[email protected]> wrote:
> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <[email protected]> wrote:
>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>>>
>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
>>> > What happens if someone moves a task in a cgroup namespace outside of
>>> > the namespace root cgroup?
>>> >
>>>
>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>> This is true irrespective of the privileges of the process attempting
>>> this. Once cgroupns is created, the task will be confined to the
>>> cgroup hierarchy under its cgroupns root until it dies.
>>
>> Can a task in a non-init userns create a cgroupns? If not, that's
>> unusual. If so, is it problematic if they can prevent themselves from
>> being moved?
>>
>
> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
> create cgroupns. It is stricter than for other namespaces, yes.
I'm slightly hesitant to have unshare(CLONE_NEWUSER |
CLONE_NEWCGROUPNS | ...) start having weird side effects that are
visible outside the namespace, especially when those side effects
don't happen (because the call fails entirely) if
unshare(CLONE_NEWUSER) happens first. I don't see a real problem with
it, but it's weird.
>
>> I hate to say it, but it might be worth requiring explicit permission
>> from the cgroup manager for this. For example, there could be a new
>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>> may_unshare in a parent cgroup would not give child cgroups the
>> ability to unshare.
>>
>
> What you suggest can be done. The current patch-set punts the problem
> of permission checking by only allowing unshare from a
> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
> improvement to cgroupns feature if we want to open it to non-init
> userns.
>
> Being said that, I would argue that even if we don't have this
> explicit permission and relax the check to non-init userns, it should
> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
> you should probably be allowed to unshare() it).
But non-init-userns tasks can't create cgroup hierarchies, unless I
misunderstand the current code. And, if they can, I bet I can find
three or four serious security issues in an hour or two. :)
> By unsharing
> cgroupns, the tasks can only confine themselves further under its
> cgroupns-root. As long as it cannot escape that hierarchy, it should
> be fine.
But they can also *lock* their hierarchy.
> In my experience, there is seldom a need to move tasks out of their
> cgroup. At most, we create a sub-cgroup and move the task there (which
> is allowed in their cgroupns). Even for a cgroup manager, I can't
> think of a case where it will be useful to move a task from one cgroup
> hierarchy to another. Such move seems overly complicated (even without
> cgroup namespaces). The cgroup manager can just modify the settings of
> the task's cgroup as needed or simply kill & restart the task in a new
> container.
>
I do this all the time. Maybe my new systemd overlords will make me
stop doing it, at which point my current production setup will blow
up.
--Andy
On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <[email protected]> wrote:
>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>>>>
>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>> > the namespace root cgroup?
>>>> >
>>>>
>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>> This is true irrespective of the privileges of the process attempting
>>>> this. Once cgroupns is created, the task will be confined to the
>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>
>>> Can a task in a non-init userns create a cgroupns? If not, that's
>>> unusual. If so, is it problematic if they can prevent themselves from
>>> being moved?
>>>
>>
>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>> create cgroupns. It is stricter than for other namespaces, yes.
>
> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
> visible outside the namespace, especially when those side effects
> don't happen (because the call fails entirely) if
> unshare(CLONE_NEWUSER) happens first. I don't see a real problem with
> it, but it's weird.
>
I expect this to be only in the initial version of the patch. We can
make this consistent with other namespaces once we figure out how
cgroupns can be safely enabled for non-init-userns.
>>
>>> I hate to say it, but it might be worth requiring explicit permission
>>> from the cgroup manager for this. For example, there could be a new
>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>> may_unshare in a parent cgroup would not give child cgroups the
>>> ability to unshare.
>>>
>>
>> What you suggest can be done. The current patch-set punts the problem
>> of permission checking by only allowing unshare from a
>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>> improvement to cgroupns feature if we want to open it to non-init
>> userns.
>>
>> Being said that, I would argue that even if we don't have this
>> explicit permission and relax the check to non-init userns, it should
>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>> you should probably be allowed to unshare() it).
>
> But non-init-userns tasks can't create cgroup hierarchies, unless I
> misunderstand the current code. And, if they can, I bet I can find
> three or four serious security issues in an hour or two. :)
>
Task running in non-init userns can create cgroup hierarchies if you
chown/chgrp their cgroup root to the task user:
# while running as 'root' (uid=0)
$ cd $CGROUP_MOUNT
$ mkdir -p batchjobs/c_job_id1/
# transfer ownership to the user (in this case 'nobody' (uid=99)).
$ chown nobody batchjobs/c_job_id1/
$ chgrp nobody batchjobs/c_job_id1/
$ ls -ld batchjobs/c_job_id1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 12:47 batchjobs/c_job_id1/
# enter container cgroup
$ echo 0 > batchjobs/c_job_id1/cgroup.procs
# unshare both userns and cgroupns
$ unshare -u -c
# setup uid_map and gid_map and export user '99' in the userns
# $ cat /proc/<pid>/uid_map
# 0 0 1
# 99 99 1
# $ cat /proc/<pid>/gid_map
# 0 0 1
# 99 99 1
# switch to user 'nobody'
$ su nobody
$ id
uid=99(nobody) gid=99(nobody) groups=99(nobody)
# Now user nobody running under non-init userns can create sub-cgroups
# under "batchjobs/c_job_id1/".
# PWD=$CGROUP_MOUNT/batchjobs/c_job_id1
$ mkdir sub_cgroup1
$ ls -ld sub_cgroup1/
drwxr-xr-x 2 nobody nobody 0 2014-07-21 13:11 sub_cgroup1/
$ echo 0 > sub_cgroup1/cgroup.procs
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgroup1
$ ls -l sub_cgroup1/
total 0
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.controllers
-r--r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.populated
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:12 cgroup.procs
-rw-r--r-- 1 nobody nobody 0 2014-07-21 13:11 cgroup.subtree_control
This is a powerful feature as it allows non-root tasks to run
container-management tools and provision their resources properly. But
this makes implementing your suggestion of having 'cgroup.may_unshare'
file tricky as the cgroup owner (task) will be able to set it and
still unshare cgroupns. Instead, may be we could just check if the
task has appropriate (write?) permissions on the cgroup directory
before allowing nested cgroupns creation.
>> By unsharing
>> cgroupns, the tasks can only confine themselves further under its
>> cgroupns-root. As long as it cannot escape that hierarchy, it should
>> be fine.
>
> But they can also *lock* their hierarchy.
>
But locking the tasks inside the hierarchy is really what cgroupns
feature is trying to provide. I understand that this is a change in
expectation, but with unified hierarchy, there are already
restrictions on where tasks can be moved (only to leaf cgroups). With
cgroup namespaces, this becomes: "only to leaf cgroups within task's
cgroupns".
>> In my experience, there is seldom a need to move tasks out of their
>> cgroup. At most, we create a sub-cgroup and move the task there (which
>> is allowed in their cgroupns). Even for a cgroup manager, I can't
>> think of a case where it will be useful to move a task from one cgroup
>> hierarchy to another. Such move seems overly complicated (even without
>> cgroup namespaces). The cgroup manager can just modify the settings of
>> the task's cgroup as needed or simply kill & restart the task in a new
>> container.
>>
>
> I do this all the time. Maybe my new systemd overlords will make me
> stop doing it, at which point my current production setup will blow
> up.
>
[shudder]
I am surprised that this even works correctly.
Either way, may be checking cgroup directory permissions will work for
you? i.e., if you "chown" a cgroup directory to the user, it should be
OK if the user's task unshares cgroupns under that cgroup and you
don't care about moving tasks from under that cgroup. Without
ownership of the cgroup directory, creation of cgroupns will be
disallowed. What do you think?
> --Andy
--
Aditya
On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <[email protected]> wrote:
> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <[email protected]> wrote:
>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <[email protected]> wrote:
>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <[email protected]> wrote:
>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>>>>>
>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>> > the namespace root cgroup?
>>>>> >
>>>>>
>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>> This is true irrespective of the privileges of the process attempting
>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>
>>>> Can a task in a non-init userns create a cgroupns? If not, that's
>>>> unusual. If so, is it problematic if they can prevent themselves from
>>>> being moved?
>>>>
>>>
>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>> create cgroupns. It is stricter than for other namespaces, yes.
>>
>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>> visible outside the namespace, especially when those side effects
>> don't happen (because the call fails entirely) if
>> unshare(CLONE_NEWUSER) happens first. I don't see a real problem with
>> it, but it's weird.
>>
>
> I expect this to be only in the initial version of the patch. We can
> make this consistent with other namespaces once we figure out how
> cgroupns can be safely enabled for non-init-userns.
>
>>>
>>>> I hate to say it, but it might be worth requiring explicit permission
>>>> from the cgroup manager for this. For example, there could be a new
>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>> ability to unshare.
>>>>
>>>
>>> What you suggest can be done. The current patch-set punts the problem
>>> of permission checking by only allowing unshare from a
>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>> improvement to cgroupns feature if we want to open it to non-init
>>> userns.
>>>
>>> Being said that, I would argue that even if we don't have this
>>> explicit permission and relax the check to non-init userns, it should
>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>> you should probably be allowed to unshare() it).
>>
>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>> misunderstand the current code. And, if they can, I bet I can find
>> three or four serious security issues in an hour or two. :)
>>
>
> Task running in non-init userns can create cgroup hierarchies if you
> chown/chgrp their cgroup root to the task user:
Won't the systemd people hate you forever for this suggestion? (I do
exactly this myself...)
> This is a powerful feature as it allows non-root tasks to run
> container-management tools and provision their resources properly. But
> this makes implementing your suggestion of having 'cgroup.may_unshare'
> file tricky as the cgroup owner (task) will be able to set it and
> still unshare cgroupns. Instead, may be we could just check if the
> task has appropriate (write?) permissions on the cgroup directory
> before allowing nested cgroupns creation.
I bet that systemd will want to set may_unshare but not give write
access. Who knows?
> [shudder]
> I am surprised that this even works correctly.
>
> Either way, may be checking cgroup directory permissions will work for
> you? i.e., if you "chown" a cgroup directory to the user, it should be
> OK if the user's task unshares cgroupns under that cgroup and you
> don't care about moving tasks from under that cgroup. Without
> ownership of the cgroup directory, creation of cgroupns will be
> disallowed. What do you think?
I think this is *safe* but may not useful for eventual systemd stuff.
Not really sure.
--Andy
On Mon, Jul 21, 2014 at 3:16 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Jul 21, 2014 at 3:11 PM, Aditya Kali <[email protected]> wrote:
>> On Fri, Jul 18, 2014 at 11:57 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Fri, Jul 18, 2014 at 11:51 AM, Aditya Kali <[email protected]> wrote:
>>>> On Fri, Jul 18, 2014 at 9:51 AM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Jul 17, 2014 1:56 PM, "Aditya Kali" <[email protected]> wrote:
>>>>>>
>>>>>> On Thu, Jul 17, 2014 at 12:57 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>> > What happens if someone moves a task in a cgroup namespace outside of
>>>>>> > the namespace root cgroup?
>>>>>> >
>>>>>>
>>>>>> Attempt to move a task outside of cgroupns root will fail with EPERM.
>>>>>> This is true irrespective of the privileges of the process attempting
>>>>>> this. Once cgroupns is created, the task will be confined to the
>>>>>> cgroup hierarchy under its cgroupns root until it dies.
>>>>>
>>>>> Can a task in a non-init userns create a cgroupns? If not, that's
>>>>> unusual. If so, is it problematic if they can prevent themselves from
>>>>> being moved?
>>>>>
>>>>
>>>> Currently, only a task with CAP_SYS_ADMIN in the init-userns can
>>>> create cgroupns. It is stricter than for other namespaces, yes.
>>>
>>> I'm slightly hesitant to have unshare(CLONE_NEWUSER |
>>> CLONE_NEWCGROUPNS | ...) start having weird side effects that are
>>> visible outside the namespace, especially when those side effects
>>> don't happen (because the call fails entirely) if
>>> unshare(CLONE_NEWUSER) happens first. I don't see a real problem with
>>> it, but it's weird.
>>>
>>
>> I expect this to be only in the initial version of the patch. We can
>> make this consistent with other namespaces once we figure out how
>> cgroupns can be safely enabled for non-init-userns.
>>
>>>>
>>>>> I hate to say it, but it might be worth requiring explicit permission
>>>>> from the cgroup manager for this. For example, there could be a new
>>>>> cgroup attribute may_unshare, and any attempt to unshare the cgroup ns
>>>>> will fail with -EPERM unless the caller is in a may_share=1 cgroup.
>>>>> may_unshare in a parent cgroup would not give child cgroups the
>>>>> ability to unshare.
>>>>>
>>>>
>>>> What you suggest can be done. The current patch-set punts the problem
>>>> of permission checking by only allowing unshare from a
>>>> capable(CAP_SYS_ADMIN) process. This can be implemented as a follow-up
>>>> improvement to cgroupns feature if we want to open it to non-init
>>>> userns.
>>>>
>>>> Being said that, I would argue that even if we don't have this
>>>> explicit permission and relax the check to non-init userns, it should
>>>> be 'OK' to let ns_capable(current_user_ns(), CAP_SYS_ADMIN) tasks to
>>>> unshare cgroupns (basically, if you can "create" a cgroup hierarchy,
>>>> you should probably be allowed to unshare() it).
>>>
>>> But non-init-userns tasks can't create cgroup hierarchies, unless I
>>> misunderstand the current code. And, if they can, I bet I can find
>>> three or four serious security issues in an hour or two. :)
>>>
>>
>> Task running in non-init userns can create cgroup hierarchies if you
>> chown/chgrp their cgroup root to the task user:
>
> Won't the systemd people hate you forever for this suggestion? (I do
> exactly this myself...)
>
I was actually thinking this feature will really simplify container
management tools (since cgroupns allows you to recursively run them
inside containers without any hacks). I would appreciate any feedback
from them on how we can improve this to help their usecase.
Thanks for your comments!
>
>> This is a powerful feature as it allows non-root tasks to run
>> container-management tools and provision their resources properly. But
>> this makes implementing your suggestion of having 'cgroup.may_unshare'
>> file tricky as the cgroup owner (task) will be able to set it and
>> still unshare cgroupns. Instead, may be we could just check if the
>> task has appropriate (write?) permissions on the cgroup directory
>> before allowing nested cgroupns creation.
>
> I bet that systemd will want to set may_unshare but not give write
> access. Who knows?
>
>> [shudder]
>> I am surprised that this even works correctly.
>>
>> Either way, may be checking cgroup directory permissions will work for
>> you? i.e., if you "chown" a cgroup directory to the user, it should be
>> OK if the user's task unshares cgroupns under that cgroup and you
>> don't care about moving tasks from under that cgroup. Without
>> ownership of the cgroup directory, creation of cgroupns will be
>> disallowed. What do you think?
>
> I think this is *safe* but may not useful for eventual systemd stuff.
> Not really sure.
>
> --Andy
--
Aditya
Quoting Aditya Kali ([email protected]):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
>
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> fs/kernfs/dir.c | 51 ++++++++++++++++++++++++++++++++++++++++----------
> include/linux/kernfs.h | 3 +++
> 2 files changed, 44 insertions(+), 10 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..2224f08 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,22 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
> return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
> }
>
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> - size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> + struct kernfs_node *kn_root,
> + struct kernfs_node *kn,
> + char *buf,
> + size_t buflen)
> {
> char *p = buf + buflen;
> int len;
>
> *--p = '\0';
I realize this is currently couldn't happen (hm, well through the
EXPORT_SYMBOL_GPL(kernfs_path) it actually could), and it's the same in the
current code, but could you add a BUG_ON(!buflen) here?
Otherwise looks good to me.
Acked-by: Serge E. Hallyn <[email protected]>
>
> + if (kn == kn_root) {
> + *--p = '/';
> + return p;
> + }
> +
> do {
> len = strlen(kn->name);
> if (p - buf < len + 1) {
> @@ -63,6 +71,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> memcpy(p, kn->name, len);
> *--p = '/';
> kn = kn->parent;
> + if (kn == kn_root)
> + break;
> } while (kn && kn->parent);
>
> return p;
> @@ -92,26 +102,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
> }
>
> /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
> * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
> * @buflen: size of @buf
> *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes. The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
> * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
> * and %NULL is returned.
> */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> + char *buf, size_t buflen)
> {
> unsigned long flags;
> char *p;
>
> spin_lock_irqsave(&kernfs_rename_lock, flags);
> - p = kernfs_path_locked(kn, buf, buflen);
> + p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
> spin_unlock_irqrestore(&kernfs_rename_lock, flags);
> return p;
> }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes. The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> + return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
> EXPORT_SYMBOL_GPL(kernfs_path);
>
> /**
> @@ -145,8 +176,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>
> spin_lock_irqsave(&kernfs_rename_lock, flags);
>
> - p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> - sizeof(kernfs_pr_cont_buf));
> + p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> + sizeof(kernfs_pr_cont_buf));
> if (p)
> pr_cont("%s", p);
> else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 20f4935..1627341 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -257,6 +257,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
> }
>
> int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> + struct kernfs_node *kn, char *buf,
> + size_t buflen);
> char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
> size_t buflen);
> void pr_cont_kernfs_name(struct kernfs_node *kn);
> --
> 2.0.0.526.g5318336
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Quoting Aditya Kali ([email protected]):
> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
Hi,
A few questions/comments:
1. Based on this description, am I to understand that after doing a
cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
still mount the global root cgroup? Any plans on "changing" that?
Will attempts to change settings of a cgroup which is not under
our current ns be rejected? (That should be easy to do given your
patch 1/5). Sorry if it's done in the set, I'm jumping around...
2. What would be the reprecussions of allowing cgroupns unshare so
long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
created your current ns cgroup? It'd be a shame if that wasn't
on the roadmap.
3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
makes me wonder whether it wouldn't be more appropriate to leave
/proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
(or somesuch) to provide the namespaced view. /proc/self/nscgroup
would simply be empty (or say (invalid) or (unreachable)) from a
sibling ns. That will give criu and admin tools like lxc/docker all
they need to do simple cgroup setup.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel.
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
>
> The new CGroup Namespace allows a process to “unshare” its cgroup
> hierarchy starting from the cgroup its currently in.
> For Ex:
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> $ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> [ns]$ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> # From within new cgroupns, process sees that its in the root cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
> # From global cgroupns:
> $ cat /proc/<pid>/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> The virtualization of /proc/self/cgroup file combined with restricting
> the view of cgroup hierarchy by bind-mounting for the
> $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> cgroup view inside the container.
>
> In its current simplistic form, the cgroup namespaces provide
> following behavior:
>
> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> the process calling unshare is running.
> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> For the init_cgroup_ns, this is the real root (“/”) cgroup
> (identified in code as cgrp_dfl_root.cgrp).
>
> (2) The cgroupns-root cgroup does not change even if the namespace
> creator process later moves to a different cgroup.
> $ ~/unshare -c # unshare cgroupns in some cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> [ns]$ mkdir sub_cgrp_1
> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (3) Each process gets its CGROUPNS specific view of
> /proc/<pid>/cgroup.
> (a) Processes running inside the cgroup namespace will be able to see
> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> [ns]$ sleep 100000 & # From within unshared cgroupns
> [1] 7353
> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (b) From global cgroupns, the real cgroup path will be visible:
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> (c) From a sibling cgroupns, the real path will be visible:
> [ns2]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> (In correct container setup though, it should not be possible to
> access PIDs in another container in the first place. This can be
> detected changed if desired.)
>
> (4) Processes inside a cgroupns are not allowed to move out of the
> cgroupns-root. This is true even if a privileged process in global
> cgroupns tries to move the process out of its cgroupns-root.
>
> # From global cgroupns
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> -bash: echo: write error: Operation not permitted
>
> (5) setns() is not supported for cgroup namespace in the initial
> version.
>
> (6) When some thread from a multi-threaded process unshares its
> cgroup-namespace, the new cgroupns gets applied to the entire
> process (all the threads). This should be OK since
> unified-hierarchy only allows process-level containerization. So
> all the threads in the process will have the same cgroup. And both
> - changing cgroups and unsharing namespaces - are protected under
> threadgroup_lock(task).
>
> (7) The cgroup namespace is alive as long as there is atleast 1
> process inside it. When the last process exits, the cgroup
> namespace is destroyed. The cgroupns-root and the actual cgroups
> remain though.
>
> Implementation
> The current patch-set is based on top of Tejun's cgroup tree (for-next
> branch). Its fairly non-intrusive and provides above mentioned
> features.
>
> Possible extensions of CGROUPNS:
> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> capabilities to restrict cgroups to administrative users. CGroup
> namespaces could be of help here. With cgroup namespaces, it might
> be possible to delegate administration of sub-cgroups under a
> cgroupns-root to the cgroupns owner.
>
> (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> command when ran from inside a cgroupns should only mount the
> hierarchy from cgroupns-root cgroup:
> $ mount -t cgroup cgroup <cgroup-mountpoint>
> # -o __DEVEL__sane_behavior should be implicit
>
> This is similar to how procfs can be mounted for every PIDNS. This
> may have some usecases.
>
> ---
> fs/kernfs/dir.c | 51 +++++++++++++---
> fs/proc/namespaces.c | 3 +
> include/linux/cgroup.h | 36 ++++++++++-
> include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
> include/linux/kernfs.h | 3 +
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 ++
> include/uapi/linux/sched.h | 3 +-
> init/Kconfig | 9 +++
> kernel/Makefile | 1 +
> kernel/cgroup.c | 75 +++++++++++++++++------
> kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 +++++-
> 14 files changed, 364 insertions(+), 34 deletions(-)
> create mode 100644 include/linux/cgroup_namespace.h
> create mode 100644 kernel/cgroup_namespace.c
>
> [PATCH 1/5] kernfs: Add API to get generate relative kernfs path
> [PATCH 2/5] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCH 3/5] cgroup: add function to get task's cgroup on default
> [PATCH 4/5] cgroup: export cgroup_get() and cgroup_put()
> [PATCH 5/5] cgroup: introduce cgroup namespaces
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Quoting Aditya Kali ([email protected]):
> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel.
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
>
> The new CGroup Namespace allows a process to “unshare” its cgroup
> hierarchy starting from the cgroup its currently in.
> For Ex:
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> $ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> [ns]$ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> # From within new cgroupns, process sees that its in the root cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
> # From global cgroupns:
> $ cat /proc/<pid>/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> The virtualization of /proc/self/cgroup file combined with restricting
> the view of cgroup hierarchy by bind-mounting for the
> $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> cgroup view inside the container.
>
> In its current simplistic form, the cgroup namespaces provide
> following behavior:
>
> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> the process calling unshare is running.
> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> For the init_cgroup_ns, this is the real root (“/”) cgroup
> (identified in code as cgrp_dfl_root.cgrp).
>
> (2) The cgroupns-root cgroup does not change even if the namespace
> creator process later moves to a different cgroup.
> $ ~/unshare -c # unshare cgroupns in some cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> [ns]$ mkdir sub_cgrp_1
> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (3) Each process gets its CGROUPNS specific view of
> /proc/<pid>/cgroup.
> (a) Processes running inside the cgroup namespace will be able to see
> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> [ns]$ sleep 100000 & # From within unshared cgroupns
> [1] 7353
> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (b) From global cgroupns, the real cgroup path will be visible:
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> (c) From a sibling cgroupns, the real path will be visible:
> [ns2]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> (In correct container setup though, it should not be possible to
> access PIDs in another container in the first place. This can be
> detected changed if desired.)
>
> (4) Processes inside a cgroupns are not allowed to move out of the
> cgroupns-root. This is true even if a privileged process in global
> cgroupns tries to move the process out of its cgroupns-root.
>
> # From global cgroupns
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> -bash: echo: write error: Operation not permitted
>
> (5) setns() is not supported for cgroup namespace in the initial
> version.
This combined with the full-path reporting for peer ns cgroups could make
for fun antics when attaching to an existing container (since we'd have
to unshare into a new ns cgroup with the same roto as the container).
I understand you are implying this will be fixed soon though.
> (6) When some thread from a multi-threaded process unshares its
> cgroup-namespace, the new cgroupns gets applied to the entire
> process (all the threads). This should be OK since
> unified-hierarchy only allows process-level containerization. So
> all the threads in the process will have the same cgroup. And both
> - changing cgroups and unsharing namespaces - are protected under
> threadgroup_lock(task).
>
> (7) The cgroup namespace is alive as long as there is atleast 1
> process inside it. When the last process exits, the cgroup
> namespace is destroyed. The cgroupns-root and the actual cgroups
> remain though.
>
> Implementation
> The current patch-set is based on top of Tejun's cgroup tree (for-next
> branch). Its fairly non-intrusive and provides above mentioned
> features.
>
> Possible extensions of CGROUPNS:
> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> capabilities to restrict cgroups to administrative users. CGroup
> namespaces could be of help here. With cgroup namespaces, it might
> be possible to delegate administration of sub-cgroups under a
> cgroupns-root to the cgroupns owner.
That would be nice.
> (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> command when ran from inside a cgroupns should only mount the
> hierarchy from cgroupns-root cgroup:
> $ mount -t cgroup cgroup <cgroup-mountpoint>
> # -o __DEVEL__sane_behavior should be implicit
>
> This is similar to how procfs can be mounted for every PIDNS. This
> may have some usecases.
Sorry - I see this answers the first part of a question in my previous email.
However, the question of whether changes to limits in cgroups which are not
under our cgroup-ns-root are allowed.
Admittedly the current case with cgmanager is the same - in that it depends
on proper setup of the container - but cgmanager is geared to recommend
not mounting the cgroups in the container at all (and we can reject such
mounts in the contaienr altogether with no loss in functionality) whereas
you are here encouraging such mounts. Which is fine - so long as you then
fully address the potential issues.
Quoting Aditya Kali ([email protected]):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
>
> Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
> ---
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 25 +++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b5223c5..707c302 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -591,6 +591,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
> }
>
> char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>
> int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1e94b71..1671345 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1937,6 +1937,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
> }
> EXPORT_SYMBOL_GPL(task_cgroup_path);
>
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> + struct cgroup *cgrp;
> +
> + mutex_lock(&cgroup_mutex);
> + down_read(&css_set_rwsem);
> +
> + cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> + cgroup_get(cgrp);
> +
> + up_read(&css_set_rwsem);
> + mutex_unlock(&cgroup_mutex);
> + return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
> /* used to track tasks and other necessary states during migration */
> struct cgroup_taskset {
> /* the src and dst cset list running through cset->mg_node */
> --
> 2.0.0.526.g5318336
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Quoting Aditya Kali ([email protected]):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>
This is fine and I'm not looking to bikeshed, but am wondering - did
you consider any other ways beside unshare (i.e. a new mount option
to cgroupfs)? If so, do you have a list of the downsides of those?
(I mainly ask bc clone flags are still a scarce commodity)
> Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
> ---
> include/uapi/linux/sched.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
> #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
> #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
> #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> - and is now available for re-use. */
> +#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
> #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
> #define CLONE_NEWIPC 0x08000000 /* New ipcs */
> #define CLONE_NEWUSER 0x10000000 /* New user namespace */
> --
> 2.0.0.526.g5318336
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Quoting Aditya Kali ([email protected]):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
>
> Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
> ---
> include/linux/cgroup.h | 17 +++++++++++++++++
> kernel/cgroup.c | 18 ------------------
> 2 files changed, 17 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 707c302..4ea477f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -530,6 +530,23 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
> return cgrp->root == &cgrp_dfl_root;
> }
>
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> + return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> + WARN_ON_ONCE(cgroup_is_dead(cgrp));
> + css_get(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> + css_put(&cgrp->self);
> +}
> +
> /* no synchronization, the result can only be used as a hint */
> static inline bool cgroup_has_tasks(struct cgroup *cgrp)
> {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 1671345..8552513 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -185,7 +185,6 @@ static int need_forkexit_callback __read_mostly;
> static struct cftype cgroup_dfl_base_files[];
> static struct cftype cgroup_legacy_base_files[];
>
> -static void cgroup_put(struct cgroup *cgrp);
> static int rebind_subsystems(struct cgroup_root *dst_root,
> unsigned int ss_mask);
> static int cgroup_destroy_locked(struct cgroup *cgrp);
> @@ -286,12 +285,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
> return cgroup_css(cgrp, ss);
> }
>
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> - return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
> struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
> {
> struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1029,17 +1022,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
> return mode;
> }
>
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> - WARN_ON_ONCE(cgroup_is_dead(cgrp));
> - css_get(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> - css_put(&cgrp->self);
> -}
> -
> /**
> * cgroup_refresh_child_subsys_mask - update child_subsys_mask
> * @cgrp: the target cgroup
> --
> 2.0.0.526.g5318336
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Thank you for your review. I have tried to respond to both your emails here.
On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> Background
>> Cgroups and Namespaces are used together to create “virtual”
>> containers that isolates the host environment from the processes
>> running in container. But since cgroups themselves are not
>> “virtualized”, the task is always able to see global cgroups view
>> through cgroupfs mount and via /proc/self/cgroup file.
>>
> Hi,
>
> A few questions/comments:
>
> 1. Based on this description, am I to understand that after doing a
> cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> still mount the global root cgroup? Any plans on "changing" that?
This is suggested in the "Possible Extensions of CGROUPNS" section.
More details below.
> Will attempts to change settings of a cgroup which is not under
> our current ns be rejected? (That should be easy to do given your
> patch 1/5). Sorry if it's done in the set, I'm jumping around...
>
Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
read/write of actual cgroup properties are not prevented. Usual
permission checks continue to apply for those. I was hoping that
should be enough, but see more comments towards the end.
> 2. What would be the reprecussions of allowing cgroupns unshare so
> long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> created your current ns cgroup? It'd be a shame if that wasn't
> on the roadmap.
>
Its certainly on the roadmap, just that some logistics were not clear
at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
processes, we may need some kind of explicit permission from the
cgroup subsystem to allow this. One approach could be an explicit
cgroup.may_unshare setting. Alternatively, the cgroup directory (which
is going to become the cgroupns-root) ownership could also be used
here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
cgroup directory. There seems to be already a function that allows
similar thing and might be sufficient:
/**
* capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
* @inode: The inode in question
* @cap: The capability in question
*
* Return true if the current task has the given capability targeted at
* its own user namespace and that the given inode's uid and gid are
* mapped into the current user namespace.
*/
bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
What do you think? We can enable this for non-init userns once this is
decided on.
> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
> makes me wonder whether it wouldn't be more appropriate to leave
> /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
> (or somesuch) to provide the namespaced view. /proc/self/nscgroup
> would simply be empty (or say (invalid) or (unreachable)) from a
> sibling ns. That will give criu and admin tools like lxc/docker all
> they need to do simple cgroup setup.
>
It may work for lxc/docker and new applications that use the new
interface. But its difficult to change numerous existing user
applications and libraries that depend on /proc/self/cgroup. Moreover,
even with the new interface, /proc/self/cgroup will continue to leak
system level cgroup information. And fixing this leak is critical to
make the container migratable.
Its easy to correctly handle the read of /proc/<pid>/cgroup from a
sibling cgroupns. Instead of showing unfiltered view, we could just
not show anything (same behavior when the cgroup hierarchy is not
mounted). Will that be more acceptable? I can make that change in the
next version of this series.
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> This exposure of cgroup names to the processes running inside a
>> container results in some problems:
>> (1) The container names are typically host-container-management-agent
>> (systemd, docker/libcontainer, etc.) data and leaking its name (or
>> leaking the hierarchy) reveals too much information about the host
>> system.
>> (2) It makes the container migration across machines (CRIU) more
>> difficult as the container names need to be unique across the
>> machines in the migration domain.
>> (3) It makes it difficult to run container management tools (like
>> docker/libcontainer, lmctfy, etc.) within virtual containers
>> without adding dependency on some state/agent present outside the
>> container.
>>
>> Note that the feature proposed here is completely different than the
>> “ns cgroup” feature which existed in the linux kernel until recently.
>> The ns cgroup also attempted to connect cgroups and namespaces by
>> creating a new cgroup every time a new namespace was created. It did
>> not solve any of the above mentioned problems and was later dropped
>> from the kernel.
>>
>> Introducing CGroup Namespaces
>> With unified cgroup hierarchy
>> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>> have a much more coherent cgroup view and its easy to associate a
>> container with a single cgroup. This also allows us to virtualize the
>> cgroup view for tasks inside the container.
>>
>> The new CGroup Namespace allows a process to “unshare” its cgroup
>> hierarchy starting from the cgroup its currently in.
>> For Ex:
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>> $ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>> [ns]$ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
>> # From within new cgroupns, process sees that its in the root cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>> # From global cgroupns:
>> $ cat /proc/<pid>/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> The virtualization of /proc/self/cgroup file combined with restricting
>> the view of cgroup hierarchy by bind-mounting for the
>> $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
>> $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
>> cgroup view inside the container.
>>
>> In its current simplistic form, the cgroup namespaces provide
>> following behavior:
>>
>> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>> the process calling unshare is running.
>> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>> For the init_cgroup_ns, this is the real root (“/”) cgroup
>> (identified in code as cgrp_dfl_root.cgrp).
>>
>> (2) The cgroupns-root cgroup does not change even if the namespace
>> creator process later moves to a different cgroup.
>> $ ~/unshare -c # unshare cgroupns in some cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> [ns]$ mkdir sub_cgrp_1
>> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (3) Each process gets its CGROUPNS specific view of
>> /proc/<pid>/cgroup.
>> (a) Processes running inside the cgroup namespace will be able to see
>> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> [ns]$ sleep 100000 & # From within unshared cgroupns
>> [1] 7353
>> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (b) From global cgroupns, the real cgroup path will be visible:
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>> (c) From a sibling cgroupns, the real path will be visible:
>> [ns2]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> (In correct container setup though, it should not be possible to
>> access PIDs in another container in the first place. This can be
>> detected changed if desired.)
>>
>> (4) Processes inside a cgroupns are not allowed to move out of the
>> cgroupns-root. This is true even if a privileged process in global
>> cgroupns tries to move the process out of its cgroupns-root.
>>
>> # From global cgroupns
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> # cgroupns-root for 7353 is /batchjobs/c_job_id1
>> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>> -bash: echo: write error: Operation not permitted
>>
>> (5) setns() is not supported for cgroup namespace in the initial
>> version.
>
> This combined with the full-path reporting for peer ns cgroups could make
> for fun antics when attaching to an existing container (since we'd have
> to unshare into a new ns cgroup with the same roto as the container).
> I understand you are implying this will be fixed soon though.
>
I am thinking the setns() will be only allowed if
target_cgrpns->cgroupns_root is_descendant_of
current_cgrpns->cgroupns_root. i.e., you will only be setns to a
cgroup namespace which is rooted deeper in hierarchy than your own (in
addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).
In addition to this, we need to decide whether its OK for setns() to
also change the cgroup of the task. Consider following example:
[A] ----> [B] ----> C
----> D
[A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
under cgroupns [A]) attempts to setns() to cgroupns [B], then its
cgroup should change from /A/D to /A/B. I am concerned about the
side-effects this might cause. Though otherwise, this is a very useful
feature for containers. One could argue that this is similar to
setns() to a mount-namespace which is pivot_root'd somewhere else (in
which case, the attaching task's root "/" moves implicitly with
setns).
Alternatively, we could only allow setns() if
target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
example again, if process in Cgroup D wants to setns() to cgroupns
[B], then it will first need to move to Cgroup B, and only then the
setns() will succeed. This makes sure that there is no implicit cgroup
move.
WDYT? I haven't prototyped this yet, but will send out a patch after
this series is accepted.
>> (6) When some thread from a multi-threaded process unshares its
>> cgroup-namespace, the new cgroupns gets applied to the entire
>> process (all the threads). This should be OK since
>> unified-hierarchy only allows process-level containerization. So
>> all the threads in the process will have the same cgroup. And both
>> - changing cgroups and unsharing namespaces - are protected under
>> threadgroup_lock(task).
>>
>> (7) The cgroup namespace is alive as long as there is atleast 1
>> process inside it. When the last process exits, the cgroup
>> namespace is destroyed. The cgroupns-root and the actual cgroups
>> remain though.
>>
>> Implementation
>> The current patch-set is based on top of Tejun's cgroup tree (for-next
>> branch). Its fairly non-intrusive and provides above mentioned
>> features.
>>
>> Possible extensions of CGROUPNS:
>> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>> capabilities to restrict cgroups to administrative users. CGroup
>> namespaces could be of help here. With cgroup namespaces, it might
>> be possible to delegate administration of sub-cgroups under a
>> cgroupns-root to the cgroupns owner.
>
> That would be nice.
>
>> (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
>> command when ran from inside a cgroupns should only mount the
>> hierarchy from cgroupns-root cgroup:
>> $ mount -t cgroup cgroup <cgroup-mountpoint>
>> # -o __DEVEL__sane_behavior should be implicit
>>
>> This is similar to how procfs can be mounted for every PIDNS. This
>> may have some usecases.
>
> Sorry - I see this answers the first part of a question in my previous email.
> However, the question of whether changes to limits in cgroups which are not
> under our cgroup-ns-root are allowed.
>
> Admittedly the current case with cgmanager is the same - in that it depends
> on proper setup of the container - but cgmanager is geared to recommend
> not mounting the cgroups in the container at all (and we can reject such
> mounts in the contaienr altogether with no loss in functionality) whereas
> you are here encouraging such mounts. Which is fine - so long as you then
> fully address the potential issues.
It will be nice to have this, but frankly, it may add a bit of
complexity in the cgroup/kernfs code (I will have to prototype and
see). Also same behavior can be obtained simply by bind-mounting
cgroupns-root inside the container. So I am currently inclining
towards rejecting such mounts in favor of simplicity.
Regarding disallowing writes to cgroup files outside of your
cgroupns-root, I think it should possible implement it easily. I will
include it in the next revision of this series.
Thanks,
--
Aditya
On Fri, Jul 25, 2014 at 12:29 PM, Aditya Kali <[email protected]> wrote:
> Thank you for your review. I have tried to respond to both your emails here.
>
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <[email protected]> wrote:
>> 2. What would be the reprecussions of allowing cgroupns unshare so
>> long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>> created your current ns cgroup? It'd be a shame if that wasn't
>> on the roadmap.
>>
>
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit
> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
>
> /**
> * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
> * @inode: The inode in question
> * @cap: The capability in question
> *
> * Return true if the current task has the given capability targeted at
> * its own user namespace and that the given inode's uid and gid are
> * mapped into the current user namespace.
> */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>
> What do you think? We can enable this for non-init userns once this is
> decided on.
>
I think I'd rather it just check that it's owned by the userns owner
if we were going down that route. But maybe there's a good reason to
do it this way.
>
>> 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
>> makes me wonder whether it wouldn't be more appropriate to leave
>> /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
>> (or somesuch) to provide the namespaced view. /proc/self/nscgroup
>> would simply be empty (or say (invalid) or (unreachable)) from a
>> sibling ns. That will give criu and admin tools like lxc/docker all
>> they need to do simple cgroup setup.
>>
>
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
>
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.
>
>
>>> (5) setns() is not supported for cgroup namespace in the initial
>>> version.
>>
>> This combined with the full-path reporting for peer ns cgroups could make
>> for fun antics when attaching to an existing container (since we'd have
>> to unshare into a new ns cgroup with the same roto as the container).
>> I understand you are implying this will be fixed soon though.
>>
>
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).
I'm not sure why the capable_wrt_inode_uidgid is needed here -- I
imagine that the hierarchy check and the usual CAP_SYS_ADMIN check on
the cgroupns's userns would be sufficient.
>
> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
>
> [A] ----> [B] ----> C
> ----> D
>
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).
Off the top of my head, I think that making setns do this would be too
magical. How about just requiring that you already be in (a
descendent of) the requested cgroupns's root cgroup if you try to
setns?
>
> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.
I like this one, but I think that descendant cgroups should probably
be allowed, too.
--Andy
Quoting Aditya Kali ([email protected]):
> Thank you for your review. I have tried to respond to both your emails here.
>
> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <[email protected]> wrote:
> > Quoting Aditya Kali ([email protected]):
> >> Background
> >> Cgroups and Namespaces are used together to create “virtual”
> >> containers that isolates the host environment from the processes
> >> running in container. But since cgroups themselves are not
> >> “virtualized”, the task is always able to see global cgroups view
> >> through cgroupfs mount and via /proc/self/cgroup file.
> >>
> > Hi,
> >
> > A few questions/comments:
> >
> > 1. Based on this description, am I to understand that after doing a
> > cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> > still mount the global root cgroup? Any plans on "changing" that?
>
> This is suggested in the "Possible Extensions of CGROUPNS" section.
> More details below.
>
> > Will attempts to change settings of a cgroup which is not under
> > our current ns be rejected? (That should be easy to do given your
> > patch 1/5). Sorry if it's done in the set, I'm jumping around...
> >
>
> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> read/write of actual cgroup properties are not prevented. Usual
> permission checks continue to apply for those. I was hoping that
> should be enough, but see more comments towards the end.
>
> > 2. What would be the reprecussions of allowing cgroupns unshare so
> > long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> > created your current ns cgroup? It'd be a shame if that wasn't
> > on the roadmap.
> >
>
> Its certainly on the roadmap, just that some logistics were not clear
> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> processes, we may need some kind of explicit permission from the
> cgroup subsystem to allow this. One approach could be an explicit
So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
you're fine.
The only real problem I can think of with unsharing a cgroup_ns is that
you could lock a setuid-root application someplace it wasn't expecting.
The above check guarantees that you were privileged enough that you'd
better be trusted in this user namespace.
(Unless there is some possible interaction I'm overlooking)
> cgroup.may_unshare setting. Alternatively, the cgroup directory (which
> is going to become the cgroupns-root) ownership could also be used
> here. i.e., the process is ns_capable(CAP_SYS_ADMIN) && it owns the
> cgroup directory. There seems to be already a function that allows
> similar thing and might be sufficient:
>
> /**
> * capable_wrt_inode_uidgid - Check nsown_capable and uid and gid mapped
> * @inode: The inode in question
> * @cap: The capability in question
> *
> * Return true if the current task has the given capability targeted at
> * its own user namespace and that the given inode's uid and gid are
> * mapped into the current user namespace.
> */
> bool capable_wrt_inode_uidgid(const struct inode *inode, int cap)
>
> What do you think? We can enable this for non-init userns once this is
> decided on.
I don't think it's needed... (until you show how wrong I am above :)
> > 3. The un-namespaced view of /proc/self/cgroup from a sibling cgroupns
> > makes me wonder whether it wouldn't be more appropriate to leave
> > /proc/self/cgroup always un-filtered, and use /proc/self/nscgroup
> > (or somesuch) to provide the namespaced view. /proc/self/nscgroup
> > would simply be empty (or say (invalid) or (unreachable)) from a
> > sibling ns. That will give criu and admin tools like lxc/docker all
> > they need to do simple cgroup setup.
> >
>
> It may work for lxc/docker and new applications that use the new
> interface. But its difficult to change numerous existing user
> applications and libraries that depend on /proc/self/cgroup. Moreover,
> even with the new interface, /proc/self/cgroup will continue to leak
> system level cgroup information. And fixing this leak is critical to
> make the container migratable.
>
> Its easy to correctly handle the read of /proc/<pid>/cgroup from a
> sibling cgroupns. Instead of showing unfiltered view, we could just
> not show anything (same behavior when the cgroup hierarchy is not
> mounted). Will that be more acceptable? I can make that change in the
> next version of this series.
It'll be acceptable so long as setns(CLONE_NEWCGROUP) is supported.
> >> $ cat /proc/self/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >> This exposure of cgroup names to the processes running inside a
> >> container results in some problems:
> >> (1) The container names are typically host-container-management-agent
> >> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> >> leaking the hierarchy) reveals too much information about the host
> >> system.
> >> (2) It makes the container migration across machines (CRIU) more
> >> difficult as the container names need to be unique across the
> >> machines in the migration domain.
> >> (3) It makes it difficult to run container management tools (like
> >> docker/libcontainer, lmctfy, etc.) within virtual containers
> >> without adding dependency on some state/agent present outside the
> >> container.
> >>
> >> Note that the feature proposed here is completely different than the
> >> “ns cgroup” feature which existed in the linux kernel until recently.
> >> The ns cgroup also attempted to connect cgroups and namespaces by
> >> creating a new cgroup every time a new namespace was created. It did
> >> not solve any of the above mentioned problems and was later dropped
> >> from the kernel.
> >>
> >> Introducing CGroup Namespaces
> >> With unified cgroup hierarchy
> >> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> >> have a much more coherent cgroup view and its easy to associate a
> >> container with a single cgroup. This also allows us to virtualize the
> >> cgroup view for tasks inside the container.
> >>
> >> The new CGroup Namespace allows a process to “unshare” its cgroup
> >> hierarchy starting from the cgroup its currently in.
> >> For Ex:
> >> $ cat /proc/self/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >> $ ls -l /proc/self/ns/cgroup
> >> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> >> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> >> [ns]$ ls -l /proc/self/ns/cgroup
> >> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
> >> # From within new cgroupns, process sees that its in the root cgroup
> >> [ns]$ cat /proc/self/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >>
> >> # From global cgroupns:
> >> $ cat /proc/<pid>/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> >>
> >> The virtualization of /proc/self/cgroup file combined with restricting
> >> the view of cgroup hierarchy by bind-mounting for the
> >> $CGROUP_MOUNT/batchjobs/c_job_id1/ directory to
> >> $CONTAINER_CHROOT/sys/fs/cgroup/) should provide a completely isolated
> >> cgroup view inside the container.
> >>
> >> In its current simplistic form, the cgroup namespaces provide
> >> following behavior:
> >>
> >> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> >> the process calling unshare is running.
> >> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> >> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> >> For the init_cgroup_ns, this is the real root (“/”) cgroup
> >> (identified in code as cgrp_dfl_root.cgrp).
> >>
> >> (2) The cgroupns-root cgroup does not change even if the namespace
> >> creator process later moves to a different cgroup.
> >> $ ~/unshare -c # unshare cgroupns in some cgroup
> >> [ns]$ cat /proc/self/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> >> [ns]$ mkdir sub_cgrp_1
> >> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> >> [ns]$ cat /proc/self/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >> (3) Each process gets its CGROUPNS specific view of
> >> /proc/<pid>/cgroup.
> >> (a) Processes running inside the cgroup namespace will be able to see
> >> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> >> [ns]$ sleep 100000 & # From within unshared cgroupns
> >> [1] 7353
> >> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> >> [ns]$ cat /proc/7353/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
> >>
> >> (b) From global cgroupns, the real cgroup path will be visible:
> >> $ cat /proc/7353/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >>
> >> (c) From a sibling cgroupns, the real path will be visible:
> >> [ns2]$ cat /proc/7353/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >> (In correct container setup though, it should not be possible to
> >> access PIDs in another container in the first place. This can be
> >> detected changed if desired.)
> >>
> >> (4) Processes inside a cgroupns are not allowed to move out of the
> >> cgroupns-root. This is true even if a privileged process in global
> >> cgroupns tries to move the process out of its cgroupns-root.
> >>
> >> # From global cgroupns
> >> $ cat /proc/7353/cgroup
> >> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> >> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> >> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> >> -bash: echo: write error: Operation not permitted
> >>
> >> (5) setns() is not supported for cgroup namespace in the initial
> >> version.
> >
> > This combined with the full-path reporting for peer ns cgroups could make
> > for fun antics when attaching to an existing container (since we'd have
> > to unshare into a new ns cgroup with the same roto as the container).
> > I understand you are implying this will be fixed soon though.
> >
>
> I am thinking the setns() will be only allowed if
> target_cgrpns->cgroupns_root is_descendant_of
> current_cgrpns->cgroupns_root. i.e., you will only be setns to a
> cgroup namespace which is rooted deeper in hierarchy than your own (in
> addition to checking capable_wrt_inode_uidgid(target_cgrpns_inode)).
Certainly.
> In addition to this, we need to decide whether its OK for setns() to
> also change the cgroup of the task. Consider following example:
>
> [A] ----> [B] ----> C
> ----> D
>
> [A] and [B] are cgroupns-roots. Now, if a task in Cgroup D (which is
> under cgroupns [A]) attempts to setns() to cgroupns [B], then its
> cgroup should change from /A/D to /A/B. I am concerned about the
> side-effects this might cause. Though otherwise, this is a very useful
> feature for containers. One could argue that this is similar to
> setns() to a mount-namespace which is pivot_root'd somewhere else (in
> which case, the attaching task's root "/" moves implicitly with
> setns).
This is what I'd expect.
> Alternatively, we could only allow setns() if
> target_cgrpns->cgroupns_root == current->cgroup . I.e., taking above
> example again, if process in Cgroup D wants to setns() to cgroupns
> [B], then it will first need to move to Cgroup B, and only then the
> setns() will succeed. This makes sure that there is no implicit cgroup
> move.
I'm ok with the restriction if it makes the patchset easier for you -
i.e. you not having to man-handle me into another cgroup. Though I
wouldn't expect the locking for that to be an obstacle...
> WDYT? I haven't prototyped this yet, but will send out a patch after
> this series is accepted.
Either one is fine with me.
> >> (6) When some thread from a multi-threaded process unshares its
> >> cgroup-namespace, the new cgroupns gets applied to the entire
> >> process (all the threads). This should be OK since
> >> unified-hierarchy only allows process-level containerization. So
> >> all the threads in the process will have the same cgroup. And both
> >> - changing cgroups and unsharing namespaces - are protected under
> >> threadgroup_lock(task).
> >>
> >> (7) The cgroup namespace is alive as long as there is atleast 1
> >> process inside it. When the last process exits, the cgroup
> >> namespace is destroyed. The cgroupns-root and the actual cgroups
> >> remain though.
> >>
> >> Implementation
> >> The current patch-set is based on top of Tejun's cgroup tree (for-next
> >> branch). Its fairly non-intrusive and provides above mentioned
> >> features.
> >>
> >> Possible extensions of CGROUPNS:
> >> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> >> capabilities to restrict cgroups to administrative users. CGroup
> >> namespaces could be of help here. With cgroup namespaces, it might
> >> be possible to delegate administration of sub-cgroups under a
> >> cgroupns-root to the cgroupns owner.
> >
> > That would be nice.
> >
> >> (2) Provide a cgroupns specific cgroupfs mount. i.e., the following
> >> command when ran from inside a cgroupns should only mount the
> >> hierarchy from cgroupns-root cgroup:
> >> $ mount -t cgroup cgroup <cgroup-mountpoint>
> >> # -o __DEVEL__sane_behavior should be implicit
> >>
> >> This is similar to how procfs can be mounted for every PIDNS. This
> >> may have some usecases.
> >
> > Sorry - I see this answers the first part of a question in my previous email.
> > However, the question of whether changes to limits in cgroups which are not
> > under our cgroup-ns-root are allowed.
> >
> > Admittedly the current case with cgmanager is the same - in that it depends
> > on proper setup of the container - but cgmanager is geared to recommend
> > not mounting the cgroups in the container at all (and we can reject such
> > mounts in the contaienr altogether with no loss in functionality) whereas
> > you are here encouraging such mounts. Which is fine - so long as you then
> > fully address the potential issues.
>
> It will be nice to have this, but frankly, it may add a bit of
> complexity in the cgroup/kernfs code (I will have to prototype and
> see). Also same behavior can be obtained simply by bind-mounting
> cgroupns-root inside the container. So I am currently inclining
> towards rejecting such mounts in favor of simplicity.
Not having to track what to bind-mount where is a very nice
simplification though. In lxc with cgmanager, we are now able to always
simply bind-mount /sys/fs/cgroup/cgmanager from the host into the
container. Nothing more needed for the container to be able to manage
its own cgroup and start its own containers. Likewise, if mount -t
cgroup were filtered to cgroupns, then lxc could simply not mount
anything into the container at all. If it mount -t cgroup is not
filtered wrt cgroupns, then we'd have to go back to, at container start,
finding the mountpoint for every subsystem, calculating the container's
cgroup there, and bind-mounting them into the container.
> Regarding disallowing writes to cgroup files outside of your
> cgroupns-root, I think it should possible implement it easily. I will
> include it in the next revision of this series.
Great - thanks.
-serge
On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> Thank you for your review. I have tried to respond to both your emails here.
>>
>> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <[email protected]> wrote:
>> > Quoting Aditya Kali ([email protected]):
>> >> Background
>> >> Cgroups and Namespaces are used together to create “virtual”
>> >> containers that isolates the host environment from the processes
>> >> running in container. But since cgroups themselves are not
>> >> “virtualized”, the task is always able to see global cgroups view
>> >> through cgroupfs mount and via /proc/self/cgroup file.
>> >>
>> > Hi,
>> >
>> > A few questions/comments:
>> >
>> > 1. Based on this description, am I to understand that after doing a
>> > cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
>> > still mount the global root cgroup? Any plans on "changing" that?
>>
>> This is suggested in the "Possible Extensions of CGROUPNS" section.
>> More details below.
>>
>> > Will attempts to change settings of a cgroup which is not under
>> > our current ns be rejected? (That should be easy to do given your
>> > patch 1/5). Sorry if it's done in the set, I'm jumping around...
>> >
>>
>> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
>> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
>> read/write of actual cgroup properties are not prevented. Usual
>> permission checks continue to apply for those. I was hoping that
>> should be enough, but see more comments towards the end.
>>
>> > 2. What would be the reprecussions of allowing cgroupns unshare so
>> > long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
>> > created your current ns cgroup? It'd be a shame if that wasn't
>> > on the roadmap.
>> >
>>
>> Its certainly on the roadmap, just that some logistics were not clear
>> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
>> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
>> processes, we may need some kind of explicit permission from the
>> cgroup subsystem to allow this. One approach could be an explicit
>
> So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> you're fine.
>
> The only real problem I can think of with unsharing a cgroup_ns is that
> you could lock a setuid-root application someplace it wasn't expecting.
> The above check guarantees that you were privileged enough that you'd
> better be trusted in this user namespace.
>
> (Unless there is some possible interaction I'm overlooking)
I think that, if it's done this way, you'd have to unshare cgroupns
before unsharing userns, since you forfeit that capability when you
unshare your userns. That means that the new cgroupns ends up being
associated w/ the root userns, which may not be what you want.
You could unshare both namespaces in one syscall and give that some
magic semantics, but that's kind of weird. It would be nice if you
could unshare your userns and temporarily retains caps in the parent,
but there is no such mechanism right now.
--Andy
Quoting Andy Lutomirski ([email protected]):
> On Mon, Jul 28, 2014 at 9:51 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Aditya Kali ([email protected]):
> >> Thank you for your review. I have tried to respond to both your emails here.
> >>
> >> On Thu, Jul 24, 2014 at 9:36 AM, Serge Hallyn <[email protected]> wrote:
> >> > Quoting Aditya Kali ([email protected]):
> >> >> Background
> >> >> Cgroups and Namespaces are used together to create “virtual”
> >> >> containers that isolates the host environment from the processes
> >> >> running in container. But since cgroups themselves are not
> >> >> “virtualized”, the task is always able to see global cgroups view
> >> >> through cgroupfs mount and via /proc/self/cgroup file.
> >> >>
> >> > Hi,
> >> >
> >> > A few questions/comments:
> >> >
> >> > 1. Based on this description, am I to understand that after doing a
> >> > cgroupns unshare, 'mount -t cgroup cgroup /mnt' by default will
> >> > still mount the global root cgroup? Any plans on "changing" that?
> >>
> >> This is suggested in the "Possible Extensions of CGROUPNS" section.
> >> More details below.
> >>
> >> > Will attempts to change settings of a cgroup which is not under
> >> > our current ns be rejected? (That should be easy to do given your
> >> > patch 1/5). Sorry if it's done in the set, I'm jumping around...
> >> >
> >>
> >> Currently, only 'cgroup_attach_task', 'cgroup_mkdir' and
> >> 'cgroup_rmdir' of cgroups outside of cgroupns-root are prevented. The
> >> read/write of actual cgroup properties are not prevented. Usual
> >> permission checks continue to apply for those. I was hoping that
> >> should be enough, but see more comments towards the end.
> >>
> >> > 2. What would be the reprecussions of allowing cgroupns unshare so
> >> > long as you have ns_capable(CAP_SYS_ADMIN) to the user_ns which
> >> > created your current ns cgroup? It'd be a shame if that wasn't
> >> > on the roadmap.
> >> >
> >>
> >> Its certainly on the roadmap, just that some logistics were not clear
> >> at this time. As pointed out by Andy Lutomirski on [PATCH 5/5] of this
> >> series, if we allow cgroupns creation to ns_capable(CAP_SYS_ADMIN)
> >> processes, we may need some kind of explicit permission from the
> >> cgroup subsystem to allow this. One approach could be an explicit
> >
> > So long as you do ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN) I think
> > you're fine.
> >
> > The only real problem I can think of with unsharing a cgroup_ns is that
> > you could lock a setuid-root application someplace it wasn't expecting.
> > The above check guarantees that you were privileged enough that you'd
> > better be trusted in this user namespace.
> >
> > (Unless there is some possible interaction I'm overlooking)
>
> I think that, if it's done this way, you'd have to unshare cgroupns
> before unsharing userns, since you forfeit that capability when you
> unshare your userns. That means that the new cgroupns ends up being
> associated w/ the root userns, which may not be what you want.
>
> You could unshare both namespaces in one syscall and give that some
> magic semantics, but that's kind of weird. It would be nice if you
> could unshare your userns and temporarily retains caps in the parent,
> but there is no such mechanism right now.
Hm, good point.
On Thu, Jul 24, 2014 at 10:01 AM, Serge Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>>
>
> This is fine and I'm not looking to bikeshed, but am wondering - did
> you consider any other ways beside unshare (i.e. a new mount option
> to cgroupfs)? If so, do you have a list of the downsides of those?
> (I mainly ask bc clone flags are still a scarce commodity)
>
I did consider couple of other ways:
(1) having a cgroup.ns_root (or something) cgroup file. If this value
is '1', it would mean that all processes it and its descendant cgroups
will have their cgroup paths in /proc/self/cgroup terminated at this
cgroup.
For ex:
[A] --> [B] --> C
| --> [D] --> E
[A], [B] and [D] has cgroup.ns_root = 1.
* all processes in cgroup C & E will see their cgroup path as /C and
/E respectively
* all processes in cgroup B & D will see their own cgroup path as /
In this model, its easy to know what to show if process is looking at
its own cgroup paths (/proc/self/cgroup). It gets tricky when you are
looking at other process's /proc/<pid>/cgroup. We may be able to come
up with some hacky way read correct value, but depending on the
cgroupfs mount, it may not make sense.
One other major drawback of this approach is that "every" process in
the cgroup will now get a restricted view. i.e., you cannot change
cgroups without affecting your view. And this is undesirable for
administrative processes.
(2) Another idea that I didn't pursue further (and is a bit hacky as
above) was having cgroup.ns_procs (like cgroup.procs, but all the pids
in cgroup.ns_procs will have their /proc/self/cgroup restricted).
Writing a pid to cgroup.ns_procs implies that you are writing it to
cgroup.procs too. But, not vise-versa. So, you could move yourself in
another cgroup by writing your pid in cgroup.procs, but not in
cgroup.ns_procs, thus preventing from getting "rooted". I This was to
solve administrative process issue in the above appraoch. But I think
this is very clunky too and I find semantics for this approach to be
non-intuitive. It almost looks like moving towards a separate "ns"
subsystem. But as we already know, its a path to failure.
I didn't think of using a mount option. I imagine the mount option
(something like -o root=/bathjobs/container_1) could be used to
restrict the visibility of cgroupfs inside the container's mount
namespace. i.e., the value you read from /proc/<pid>/cgroup now
depends on what mount namespace you are in. Its similar to cgroup
namespace, but just that the cgroupns_root is now stored in the
'struct mnt_namespace' instead of a separate 'struct
cgroup_namespace'. But, since mount namespace on creation inherits
mounts from its parent, the first cgroupfs mount in a mount namespace
is now treated specially. Also, its not possible to restrict cgroups
without mount namespace now. This is interesting and may not be too
bad. I am willing to give this a try. But I feel the cgroup namespace
approach fits well in-line with other namespaces where it does one
thing - virtualize the view of /proc/<pid>/cgroup file for processes
inside the namespace. The semantics are more intuitive as they are
similar to other namespaces.
Thanks,
>> Signed-off-by: Aditya Kali <[email protected]>
>
> Acked-by: Serge E. Hallyn <[email protected]>
>
>> ---
>> include/uapi/linux/sched.h | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
>> index 34f9d73..2f90d00 100644
>> --- a/include/uapi/linux/sched.h
>> +++ b/include/uapi/linux/sched.h
>> @@ -21,8 +21,7 @@
>> #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
>> #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
>> #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
>> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
>> - and is now available for re-use. */
>> +#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
>> #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
>> #define CLONE_NEWIPC 0x08000000 /* New ipcs */
>> #define CLONE_NEWUSER 0x10000000 /* New user namespace */
>> --
>> 2.0.0.526.g5318336
>>
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
Aditya