2014-10-13 21:30:44

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 0/8] CGroup Namespaces

Second take at the Cgroup Namespace patch-set.

Major changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.

More details in the writeup below.

Background
Cgroups and Namespaces are used together to create “virtual”
containers that isolates the host environment from the processes
running in container. But since cgroups themselves are not
“virtualized”, the task is always able to see global cgroups view
through cgroupfs mount and via /proc/self/cgroup file.

$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

This exposure of cgroup names to the processes running inside a
container results in some problems:
(1) The container names are typically host-container-management-agent
(systemd, docker/libcontainer, etc.) data and leaking its name (or
leaking the hierarchy) reveals too much information about the host
system.
(2) It makes the container migration across machines (CRIU) more
difficult as the container names need to be unique across the
machines in the migration domain.
(3) It makes it difficult to run container management tools (like
docker/libcontainer, lmctfy, etc.) within virtual containers
without adding dependency on some state/agent present outside the
container.

Note that the feature proposed here is completely different than the
“ns cgroup” feature which existed in the linux kernel until recently.
The ns cgroup also attempted to connect cgroups and namespaces by
creating a new cgroup every time a new namespace was created. It did
not solve any of the above mentioned problems and was later dropped
from the kernel. Incidentally though, it used the same config option
name CONFIG_CGROUP_NS as used in my prototype!

Introducing CGroup Namespaces
With unified cgroup hierarchy
(Documentation/cgroups/unified-hierarchy.txt), the containers can now
have a much more coherent cgroup view and its easy to associate a
container with a single cgroup. This also allows us to virtualize the
cgroup view for tasks inside the container.

The new CGroup Namespace allows a process to “unshare” its cgroup
hierarchy starting from the cgroup its currently in.
For Ex:
$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
$ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
[ns]$ ls -l /proc/self/ns/cgroup
lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
cgroup:[4026532183]
# From within new cgroupns, process sees that its in the root cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/

# From global cgroupns:
$ cat /proc/<pid>/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1

# Unshare cgroupns along with userns and mountns
# Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
# sets up uid/gid map and exec’s /bin/bash
$ ~/unshare -c -u -m

# Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
# hierarchy.
[ns]$ mount -t cgroup cgroup /tmp/cgroup
[ns]$ ls -l /tmp/cgroup
total 0
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
-r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
-rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
-rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control

The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
filesystem root for the namespace specific cgroupfs mount.

The virtualization of /proc/self/cgroup file combined with restricting
the view of cgroup hierarchy by namespace-private cgroupfs mount
should provide a completely isolated cgroup view inside the container.

In its current form, the cgroup namespaces patcheset provides following
behavior:

(1) The “root” cgroup for a cgroup namespace is the cgroup in which
the process calling unshare is running.
For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
For the init_cgroup_ns, this is the real root (“/”) cgroup
(identified in code as cgrp_dfl_root.cgrp).

(2) The cgroupns-root cgroup does not change even if the namespace
creator process later moves to a different cgroup.
$ ~/unshare -c # unshare cgroupns in some cgroup
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
[ns]$ mkdir sub_cgrp_1
[ns]$ echo 0 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/self/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(3) Each process gets its CGROUPNS specific view of
/proc/<pid>/cgroup.
(a) Processes running inside the cgroup namespace will be able to see
cgroup paths (in /proc/self/cgroup) only inside their root cgroup
[ns]$ sleep 100000 & # From within unshared cgroupns
[1] 7353
[ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
[ns]$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1

(b) From global cgroupns, the real cgroup path will be visible:
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

(c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
path will be visible:
# ns2's cgroupns-root is at '/batchjobs/c_job_id2'
[ns2]$ cat /proc/7353/cgroup
[ns2]$
This is same as when cgroup hierarchy is not mounted at all.
(In correct container setup though, it should not be possible to
access PIDs in another container in the first place.)

(4) Processes inside a cgroupns are not allowed to move out of the
cgroupns-root. This is true even if a privileged process in global
cgroupns tries to move the process out of its cgroupns-root.

# From global cgroupns
$ cat /proc/7353/cgroup
0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
# cgroupns-root for 7353 is /batchjobs/c_job_id1
$ echo 7353 > batchjobs/c_job_id2/cgroup.procs
-bash: echo: write error: Operation not permitted

(5) Setns to another cgroup namespace is allowed only when:
(a) process has CAP_SYS_ADMIN in its current userns
(b) process has CAP_SYS_ADMIN in the target cgroupns' userns
(c) the process's current cgroup is a descendant cgroupns-root of the
target namespace.
(d) the target cgroupns-root is descendant of current cgroupns-root..
The last check (d) prevents processes from escaping their cgroupns-root by
attaching to parent cgroupns. Thus, setns is allowed only when the process
is trying to restrict itself to a deeper cgroup hierarchy.

(6) When some thread from a multi-threaded process unshares its
cgroup-namespace, the new cgroupns gets applied to the entire
process (all the threads). This should be OK since
unified-hierarchy only allows process-level containerization. So
all the threads in the process will have the same cgroup. And both
- changing cgroups and unsharing namespaces - are protected under
threadgroup_lock(task).

(7) The cgroup namespace is alive as long as there is atleast 1
process inside it. When the last process exits, the cgroup
namespace is destroyed. The cgroupns-root and the actual cgroups
remain though.

(8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
the unified cgroup hierarchy with cgroupns-root as the filesystem root.
The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
container management tools to be run inside the containers transparently.

Implementation
The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
branch). Its fairly non-intrusive and provides above mentioned
features.

Possible extensions of CGROUPNS:
(1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
capabilities to restrict cgroups to administrative users. CGroup
namespaces could be of help here. With cgroup namespaces, it might
be possible to delegate administration of sub-cgroups under a
cgroupns-root to the cgroupns owner.


---
fs/kernfs/dir.c | 53 +++++++++---
fs/kernfs/mount.c | 48 +++++++++++
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 41 +++++++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++
include/linux/kernfs.h | 5 ++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 +
include/uapi/linux/sched.h | 3 +-
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 ++++-
15 files changed, 518 insertions(+), 41 deletions(-)
create mode 100644 include/linux/cgroup_namespace.h
create mode 100644 kernel/cgroup_namespace.c

[PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
[PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
[PATCHv1 3/8] cgroup: add function to get task's cgroup on default
[PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
[PATCHv1 5/8] cgroup: introduce cgroup namespaces
[PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
[PATCHv1 7/8] cgroup: cgroup namespace setns support
[PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns


2014-10-13 21:24:17

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path

The new function kernfs_path_from_node() generates and returns
kernfs path of a given kernfs_node relative to a given parent
kernfs_node.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/dir.c | 53 ++++++++++++++++++++++++++++++++++++++++----------
include/linux/kernfs.h | 3 +++
2 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a693f5b..8655485 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}

-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+static char * __must_check kernfs_path_from_node_locked(
+ struct kernfs_node *kn_root,
+ struct kernfs_node *kn,
+ char *buf,
+ size_t buflen)
{
char *p = buf + buflen;
int len;

+ BUG_ON(!buflen);
+
*--p = '\0';

+ if (kn == kn_root) {
+ *--p = '/';
+ return p;
+ }
+
do {
len = strlen(kn->name);
if (p - buf < len + 1) {
@@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
memcpy(p, kn->name, len);
*--p = '/';
kn = kn->parent;
+ if (kn == kn_root)
+ break;
} while (kn && kn->parent);

return p;
@@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
}

/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
+ * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
+ * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
+ * then full path of @kn is returned.
+ * The path is built from the end of @buf so the returned pointer usually
* doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;

spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);

/**
@@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)

spin_lock_irqsave(&kernfs_rename_lock, flags);

- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 30faf79..3c2be75 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
}

int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:24:25

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy

get_task_cgroup() returns the (reference counted) cgroup of the
given task on the default hierarchy.

Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1d51968..80ed6e0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
}

char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
+struct cgroup *get_task_cgroup(struct task_struct *task);

int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cab7dc4..56d507b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
}
EXPORT_SYMBOL_GPL(task_cgroup_path);

+/*
+ * get_task_cgroup - returns the cgroup of the task in the default cgroup
+ * hierarchy.
+ *
+ * @task: target task
+ * This function returns the @task's cgroup on the default cgroup hierarchy. The
+ * returned cgroup has its reference incremented (by calling cgroup_get()). So
+ * the caller must cgroup_put() the obtained reference once it is done with it.
+ */
+struct cgroup *get_task_cgroup(struct task_struct *task)
+{
+ struct cgroup *cgrp;
+
+ mutex_lock(&cgroup_mutex);
+ down_read(&css_set_rwsem);
+
+ cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
+ cgroup_get(cgrp);
+
+ up_read(&css_set_rwsem);
+ mutex_unlock(&cgroup_mutex);
+ return cgrp;
+}
+EXPORT_SYMBOL_GPL(get_task_cgroup);
+
/* used to track tasks and other necessary states during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:24:30

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 7/8] cgroup: cgroup namespace setns support

setns on a cgroup namespace is allowed only if
* task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
* task's current cgroup is descendent of the target cgroupns-root
cgroup.
* target cgroupns-root is same as or deeper than task's current
cgroupns-root. This is so that the task cannot escape out of its
cgroupns-root. This also ensures that setns() only makes the task
get restricted to a deeper cgroup hierarchy.

Signed-off-by: Aditya Kali <[email protected]>
---
kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
index c16604f..c612946 100644
--- a/kernel/cgroup_namespace.c
+++ b/kernel/cgroup_namespace.c
@@ -80,8 +80,48 @@ err_out:

static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
{
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ struct cgroup_namespace *cgroup_ns = ns;
+ struct task_struct *task = current;
+ struct cgroup *cgrp = NULL;
+ int err = 0;
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+ !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(task);
+
+ cgrp = get_task_cgroup(task);
+
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto out_unlock;
+
+ /* Allow switch only if the task's current cgroup is descendant of the
+ * target cgroup_ns->root_cgrp.
+ */
+ if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
+ goto out_unlock;
+
+ /* Only allow setns to a cgroupns root-ed deeper than task's current
+ * cgroupns-root. This will make sure that tasks cannot escape their
+ * cgroupns by attaching to parent cgroupns.
+ */
+ if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
+ task_cgroupns_root(task)))
+ goto out_unlock;
+
+ err = 0;
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+out_unlock:
+ threadgroup_unlock(current);
+ if (cgrp)
+ cgroup_put(cgrp);
+ return err;
}

static void *cgroupns_get(struct task_struct *task)
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:24:38

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f973ae9..e334f45 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
return NULL;
}

+/**
+ * kernfs_make_root - create new root dentry for the given kernfs_node.
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can used used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct inode *inode;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ /* inode for the given kernfs_node should already exist. */
+ inode = ilookup(sb, kn->ino);
+ if (!inode) {
+ pr_debug("kernfs: could not get inode for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ /* instantiate and link root dentry */
+ dentry = d_obtain_root(inode);
+ if (!dentry) {
+ pr_debug("kernfs: could not get dentry for '");
+ pr_cont_kernfs_path(kn);
+ pr_cont("'.\n");
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* If this is a new dentry, set it up. We need kernfs_mutex because this
+ * may be called by callers other than kernfs_fill_super. */
+ mutex_lock(&kernfs_mutex);
+ if (!dentry->d_fsdata) {
+ kernfs_get(kn);
+ dentry->d_fsdata = kn;
+ } else {
+ WARN_ON(dentry->d_fsdata != kn);
+ }
+ mutex_unlock(&kernfs_mutex);
+
+ return dentry;
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 3c2be75..b9538e0 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2fc0dfa..ef27dc4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

memset(opts, 0, sizeof(*opts));

+ /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
+ * namespace.
+ */
+ if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+ opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
+ }
+
while ((token = strsep(&o, ",")) != NULL) {
nr_opts++;

@@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
- if (nr_opts != 1) {
+ if (nr_opts > 1) {
pr_err("sane_behavior: no other mount options allowed\n");
return -EINVAL;
}
@@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

+struct dentry *cgroupns_get_root(struct super_block *sb,
+ struct cgroup_namespace *ns)
+{
+ struct dentry *nsdentry;
+
+ nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
+ return nsdentry;
+}
+
static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
{
LIST_HEAD(tmp_links);
@@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns =
+ get_cgroup_ns(current->nsproxy->cgroup_ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1816,11 +1840,28 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);

- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }

dentry = kernfs_mount(fs_type, flags, root->kf_root,
CGROUP_SUPER_MAGIC, &new_sb);
+
+ if (!IS_ERR(dentry)) {
+ /* If this mount is for a non-init cgroup namespace, then
+ * Instead of root's dentry, we return the dentry specific to
+ * the cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+
+ nsdentry = cgroupns_get_root(dentry->d_sb, ns);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);

@@ -1833,6 +1874,7 @@ out_free:
deactivate_super(pinned_sb);
}

+ put_cgroup_ns(ns);
return dentry;
}

@@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct kobject *cgroup_kobj;
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:24:23

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <[email protected]>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 34f9d73..2f90d00 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname group? */
#define CLONE_NEWIPC 0x08000000 /* New ipcs */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:25:31

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
of creation of the cgroup namespace. The task that creates the new
cgroup namespace and all its future children will now be restricted only
to the cgroup hierarchy under this root_cgrp.
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root.
This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
to create completely virtualized containers without leaking system
level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Signed-off-by: Aditya Kali <[email protected]>
---
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 18 +++++-
include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
init/Kconfig | 9 +++
kernel/Makefile | 1 +
kernel/cgroup.c | 11 ++++
kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
kernel/fork.c | 2 +-
kernel/nsproxy.c | 19 +++++-
11 files changed, 255 insertions(+), 4 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 8902609..e04ed4b 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+#ifdef CONFIG_CGROUP_NS
+ &cgroupns_operations,
+#endif
};

static const struct file_operations ns_file_operations = {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4a0eb2d..aa86495 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -22,6 +22,8 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/wait.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>

#ifdef CONFIG_CGROUPS

@@ -460,6 +462,13 @@ struct cftype {
#endif
};

+struct cgroup_namespace {
+ atomic_t count;
+ unsigned int proc_inum;
+ struct user_namespace *user_ns;
+ struct cgroup *root_cgrp;
+};
+
extern struct cgroup_root cgrp_dfl_root;
extern struct css_set init_css_set;

@@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}

+static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
+ struct cgroup *cgrp, char *buf,
+ size_t buflen)
+{
+ return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
+}
+
static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
size_t buflen)
{
- return kernfs_path(cgrp->kn, buf, buflen);
+ return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
}

static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
new file mode 100644
index 0000000..9f637fe
--- /dev/null
+++ b/include/linux/cgroup_namespace.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_CGROUP_NAMESPACE_H
+#define _LINUX_CGROUP_NAMESPACE_H
+
+#include <linux/nsproxy.h>
+#include <linux/cgroup.h>
+#include <linux/types.h>
+#include <linux/user_namespace.h>
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
+{
+ return tsk->nsproxy->cgroup_ns->root_cgrp;
+}
+
+#ifdef CONFIG_CGROUP_NS
+
+extern void free_cgroup_ns(struct cgroup_namespace *ns);
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+ return ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
+extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
+#else /* CONFIG_CGROUP_NS */
+
+static inline struct cgroup_namespace *get_cgroup_ns(
+ struct cgroup_namespace *ns)
+{
+ return &init_cgroup_ns;
+}
+
+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+}
+
+static inline struct cgroup_namespace *copy_cgroup_ns(
+ unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns) {
+ if (flags & CLONE_NEWCGROUP)
+ return ERR_PTR(-EINVAL);
+
+ return old_ns;
+}
+
+#endif /* CONFIG_CGROUP_NS */
+
+#endif /* _LINUX_CGROUP_NAMESPACE_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;

/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 34a1e10..e56dd73 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -6,6 +6,8 @@

struct pid_namespace;
struct nsproxy;
+struct task_struct;
+struct inode;

struct proc_ns_operations {
const char *name;
@@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

/*
* We always define these enumerators
@@ -37,6 +40,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};

#ifdef CONFIG_PROC_FS
diff --git a/init/Kconfig b/init/Kconfig
index e84c642..c3be001 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config CGROUP_NS
+ bool "CGroup Namespaces"
+ default n
+ help
+ This options enables CGroup Namespaces which can be used to isolate
+ cgroup paths. This feature is only useful when unified cgroup
+ hierarchy is in use (i.e. cgroups are mounted with sane_behavior
+ option).
+
endif # CGROUPS

config CHECKPOINT_RESTORE
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c775..75334f8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup.o
+obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_UTS_NS) += utsname.o
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2b3e9f9..f8099b4 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,8 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/cgroup_namespace.h>

#include <linux/atomic.h>

@@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
bool is_add);

+struct cgroup_namespace init_cgroup_ns = {
+ .count = {
+ .counter = 1,
+ },
+ .proc_inum = PROC_CGROUP_INIT_INO,
+ .user_ns = &init_user_ns,
+ .root_cgrp = &cgrp_dfl_root.cgrp,
+};
+
/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
gfp_t gfp_mask)
diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
new file mode 100644
index 0000000..c16604f
--- /dev/null
+++ b/kernel/cgroup_namespace.c
@@ -0,0 +1,128 @@
+
+#include <linux/cgroup.h>
+#include <linux/cgroup_namespace.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+
+ new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (new_ns)
+ atomic_set(&new_ns->count, 1);
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ cgroup_put(ns->root_cgrp);
+ put_user_ns(ns->user_ns);
+ proc_free_inum(ns->proc_inum);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct cgroup *cgrp = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP))
+ return get_cgroup_ns(old_ns);
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ /* Prevent cgroup changes for this task. */
+ threadgroup_lock(current);
+
+ cgrp = get_task_cgroup(current);
+
+ /* Creating new CGROUPNS is supported only when unified hierarchy is in
+ * use. */
+ err = -EINVAL;
+ if (!cgroup_on_dfl(cgrp))
+ goto err_out_unlock;
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out_unlock;
+
+ err = proc_alloc_inum(&new_ns->proc_inum);
+ if (err)
+ goto err_out_unlock;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cgrp = cgrp;
+
+ threadgroup_unlock(current);
+
+ return new_ns;
+
+err_out_unlock:
+ threadgroup_unlock(current);
+err_out:
+ if (cgrp)
+ cgroup_put(cgrp);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static void *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ rcu_read_lock();
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ rcu_read_unlock();
+
+ return ns;
+}
+
+static void cgroupns_put(void *ns)
+{
+ put_cgroup_ns(ns);
+}
+
+static unsigned int cgroupns_inum(void *ns)
+{
+ struct cgroup_namespace *cgroup_ns = ns;
+
+ return cgroup_ns->proc_inum;
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+ .inum = cgroupns_inum,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
diff --git a/kernel/fork.c b/kernel/fork.c
index 0cf9cdb..cc06851 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing to
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index ef42d0a..a8b1970 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup_namespace.h>

static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+ .cgroup_ns = &init_cgroup_ns,
};

static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}

+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;

out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:25:55

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()

move cgroup_get() and cgroup_put() into cgroup.h so that
they can be called from other places.

Signed-off-by: Aditya Kali <[email protected]>
---
include/linux/cgroup.h | 22 ++++++++++++++++++++++
kernel/cgroup.c | 22 ----------------------
2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 80ed6e0..4a0eb2d 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
return cgrp->root == &cgrp_dfl_root;
}

+/* convenient tests for these bits */
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
+{
+ return !(cgrp->self.flags & CSS_ONLINE);
+}
+
+static inline void cgroup_get(struct cgroup *cgrp)
+{
+ WARN_ON_ONCE(cgroup_is_dead(cgrp));
+ css_get(&cgrp->self);
+}
+
+static inline bool cgroup_tryget(struct cgroup *cgrp)
+{
+ return css_tryget(&cgrp->self);
+}
+
+static inline void cgroup_put(struct cgroup *cgrp)
+{
+ css_put(&cgrp->self);
+}
+
/* no synchronization, the result can only be used as a hint */
static inline bool cgroup_has_tasks(struct cgroup *cgrp)
{
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 56d507b..2b3e9f9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
return cgroup_css(cgrp, ss);
}

-/* convenient tests for these bits */
-static inline bool cgroup_is_dead(const struct cgroup *cgrp)
-{
- return !(cgrp->self.flags & CSS_ONLINE);
-}
-
struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
struct cgroup *cgrp = of->kn->parent->priv;
@@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
return mode;
}

-static void cgroup_get(struct cgroup *cgrp)
-{
- WARN_ON_ONCE(cgroup_is_dead(cgrp));
- css_get(&cgrp->self);
-}
-
-static bool cgroup_tryget(struct cgroup *cgrp)
-{
- return css_tryget(&cgrp->self);
-}
-
-static void cgroup_put(struct cgroup *cgrp)
-{
- css_put(&cgrp->self);
-}
-
/**
* cgroup_refresh_child_subsys_mask - update child_subsys_mask
* @cgrp: the target cgroup
--
2.1.0.rc2.206.gedb03e5

2014-10-13 21:30:35

by Aditya Kali

[permalink] [raw]
Subject: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

Restrict following operations within the calling tasks:
* cgroup_mkdir & cgroup_rmdir
* cgroup_attach_task
* writes to cgroup files outside of task's cgroupns-root

Also, read of /proc/<pid>/cgroup file is now restricted only
to tasks under same cgroupns-root. If a task tries to look
at cgroup of another task outside of its cgroupns-root, then
it won't be able to see anything for the default hierarchy.
This is same as if the cgroups are not mounted.

Signed-off-by: Aditya Kali <[email protected]>
---
kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f8099b4..2fc0dfa 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
struct task_struct *task;
int ret;

+ /* Only allow changing cgroups accessible within task's cgroup
+ * namespace. i.e. 'dst_cgrp' should be a descendant of task's
+ * cgroupns->root_cgrp. */
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ return -EPERM;
+
/* look up all src csets */
down_read(&css_set_rwsem);
rcu_read_lock();
@@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
struct cgroup_subsys_state *css;
int ret;

+ /* Reject writes to cgroup files outside of task's cgroupns-root. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ return -EINVAL;
+
if (cft->write)
return cft->write(of, buf, nbytes, off);

@@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
parent = cgroup_kn_lock_live(parent_kn);
if (!parent)
return -ENODEV;
+
+ /* Allow mkdir only within process's cgroup namespace root. */
+ if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
root = parent->root;

/* allocate the cgroup and its ID, 0 is reserved for the root */
@@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
if (!cgrp)
return 0;

+ /* Allow rmdir only within process's cgroup namespace root.
+ * The process can't delete its own root anyways. */
+ if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
+ cgroup_kn_unlock(kn);
+ return -EPERM;
+ }
+
ret = cgroup_destroy_locked(cgrp);

cgroup_kn_unlock(kn);
@@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
continue;

+ cgrp = task_cgroup_from_root(tsk, root);
+
+ /* The cgroup path on default hierarchy is shown only if it
+ * falls under current task's cgroupns-root.
+ */
+ if (root == &cgrp_dfl_root &&
+ !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
+ continue;
+
seq_printf(m, "%d:", root->hierarchy_id);
for_each_subsys(ss, ssid)
if (root->subsys_mask & (1 << ssid))
@@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m, "%sname=%s", count ? "," : "",
root->name);
seq_putc(m, ':');
- cgrp = task_cgroup_from_root(tsk, root);
path = cgroup_path(cgrp, buf, PATH_MAX);
if (!path) {
retval = -ENAMETOOLONG;
--
2.1.0.rc2.206.gedb03e5

2014-10-14 22:43:35

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 0/8] CGroup Namespaces

On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <[email protected]> wrote:
> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> your cgroupns-root.
>
> More details in the writeup below.
>
> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel. Incidentally though, it used the same config option
> name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
>
> The new CGroup Namespace allows a process to “unshare” its cgroup
> hierarchy starting from the cgroup its currently in.
> For Ex:
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> $ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> [ns]$ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
> cgroup:[4026532183]
> # From within new cgroupns, process sees that its in the root cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
> # From global cgroupns:
> $ cat /proc/<pid>/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> # Unshare cgroupns along with userns and mountns
> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> # sets up uid/gid map and exec’s /bin/bash
> $ ~/unshare -c -u -m
>
> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
> # hierarchy.
> [ns]$ mount -t cgroup cgroup /tmp/cgroup
> [ns]$ ls -l /tmp/cgroup
> total 0
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
> filesystem root for the namespace specific cgroupfs mount.
>
> The virtualization of /proc/self/cgroup file combined with restricting
> the view of cgroup hierarchy by namespace-private cgroupfs mount
> should provide a completely isolated cgroup view inside the container.
>
> In its current form, the cgroup namespaces patcheset provides following
> behavior:
>
> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> the process calling unshare is running.
> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> For the init_cgroup_ns, this is the real root (“/”) cgroup
> (identified in code as cgrp_dfl_root.cgrp).
>
> (2) The cgroupns-root cgroup does not change even if the namespace
> creator process later moves to a different cgroup.
> $ ~/unshare -c # unshare cgroupns in some cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> [ns]$ mkdir sub_cgrp_1
> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (3) Each process gets its CGROUPNS specific view of
> /proc/<pid>/cgroup.
> (a) Processes running inside the cgroup namespace will be able to see
> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> [ns]$ sleep 100000 & # From within unshared cgroupns
> [1] 7353
> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (b) From global cgroupns, the real cgroup path will be visible:
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1

This is a little weird. Not sure it's a problem.

>
> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
> path will be visible:
> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
> [ns2]$ cat /proc/7353/cgroup
> [ns2]$
> This is same as when cgroup hierarchy is not mounted at all.
> (In correct container setup though, it should not be possible to
> access PIDs in another container in the first place.)
>
> (4) Processes inside a cgroupns are not allowed to move out of the
> cgroupns-root. This is true even if a privileged process in global
> cgroupns tries to move the process out of its cgroupns-root.
>
> # From global cgroupns
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> -bash: echo: write error: Operation not permitted
>

>
> (6) When some thread from a multi-threaded process unshares its
> cgroup-namespace, the new cgroupns gets applied to the entire
> process (all the threads). This should be OK since
> unified-hierarchy only allows process-level containerization. So
> all the threads in the process will have the same cgroup. And both
> - changing cgroups and unsharing namespaces - are protected under
> threadgroup_lock(task).

This seems odd to me. Does unsharing the cgroupns unshare for all
tasks in the process? If not, then I think that it shouldn't change
the cgroup either.

What did you end up doing to grant permission to unshare the cgroup ns?

--Andy

2014-10-14 23:33:35

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 0/8] CGroup Namespaces

On Tue, Oct 14, 2014 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Oct 13, 2014 at 2:23 PM, Aditya Kali <[email protected]> wrote:
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>> your cgroupns-root.
>>
>> More details in the writeup below.
>>
>> Background
>> Cgroups and Namespaces are used together to create “virtual”
>> containers that isolates the host environment from the processes
>> running in container. But since cgroups themselves are not
>> “virtualized”, the task is always able to see global cgroups view
>> through cgroupfs mount and via /proc/self/cgroup file.
>>
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> This exposure of cgroup names to the processes running inside a
>> container results in some problems:
>> (1) The container names are typically host-container-management-agent
>> (systemd, docker/libcontainer, etc.) data and leaking its name (or
>> leaking the hierarchy) reveals too much information about the host
>> system.
>> (2) It makes the container migration across machines (CRIU) more
>> difficult as the container names need to be unique across the
>> machines in the migration domain.
>> (3) It makes it difficult to run container management tools (like
>> docker/libcontainer, lmctfy, etc.) within virtual containers
>> without adding dependency on some state/agent present outside the
>> container.
>>
>> Note that the feature proposed here is completely different than the
>> “ns cgroup” feature which existed in the linux kernel until recently.
>> The ns cgroup also attempted to connect cgroups and namespaces by
>> creating a new cgroup every time a new namespace was created. It did
>> not solve any of the above mentioned problems and was later dropped
>> from the kernel. Incidentally though, it used the same config option
>> name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>> With unified cgroup hierarchy
>> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>> have a much more coherent cgroup view and its easy to associate a
>> container with a single cgroup. This also allows us to virtualize the
>> cgroup view for tasks inside the container.
>>
>> The new CGroup Namespace allows a process to “unshare” its cgroup
>> hierarchy starting from the cgroup its currently in.
>> For Ex:
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>> $ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>> [ns]$ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>> cgroup:[4026532183]
>> # From within new cgroupns, process sees that its in the root cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>> # From global cgroupns:
>> $ cat /proc/<pid>/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> # Unshare cgroupns along with userns and mountns
>> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> # sets up uid/gid map and exec’s /bin/bash
>> $ ~/unshare -c -u -m
>>
>> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>> # hierarchy.
>> [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> [ns]$ ls -l /tmp/cgroup
>> total 0
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>> filesystem root for the namespace specific cgroupfs mount.
>>
>> The virtualization of /proc/self/cgroup file combined with restricting
>> the view of cgroup hierarchy by namespace-private cgroupfs mount
>> should provide a completely isolated cgroup view inside the container.
>>
>> In its current form, the cgroup namespaces patcheset provides following
>> behavior:
>>
>> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>> the process calling unshare is running.
>> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>> For the init_cgroup_ns, this is the real root (“/”) cgroup
>> (identified in code as cgrp_dfl_root.cgrp).
>>
>> (2) The cgroupns-root cgroup does not change even if the namespace
>> creator process later moves to a different cgroup.
>> $ ~/unshare -c # unshare cgroupns in some cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> [ns]$ mkdir sub_cgrp_1
>> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (3) Each process gets its CGROUPNS specific view of
>> /proc/<pid>/cgroup.
>> (a) Processes running inside the cgroup namespace will be able to see
>> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> [ns]$ sleep 100000 & # From within unshared cgroupns
>> [1] 7353
>> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (b) From global cgroupns, the real cgroup path will be visible:
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> This is a little weird. Not sure it's a problem.
>
>>
>> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>> path will be visible:
>> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>> [ns2]$ cat /proc/7353/cgroup
>> [ns2]$
>> This is same as when cgroup hierarchy is not mounted at all.
>> (In correct container setup though, it should not be possible to
>> access PIDs in another container in the first place.)
>>
>> (4) Processes inside a cgroupns are not allowed to move out of the
>> cgroupns-root. This is true even if a privileged process in global
>> cgroupns tries to move the process out of its cgroupns-root.
>>
>> # From global cgroupns
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> # cgroupns-root for 7353 is /batchjobs/c_job_id1
>> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>> -bash: echo: write error: Operation not permitted
>>
>
>>
>> (6) When some thread from a multi-threaded process unshares its
>> cgroup-namespace, the new cgroupns gets applied to the entire
>> process (all the threads). This should be OK since
>> unified-hierarchy only allows process-level containerization. So
>> all the threads in the process will have the same cgroup. And both
>> - changing cgroups and unsharing namespaces - are protected under
>> threadgroup_lock(task).
>
> This seems odd to me. Does unsharing the cgroupns unshare for all
> tasks in the process? If not, then I think that it shouldn't change
> the cgroup either.
>

Unsharing cgorupns unshares for all tasks in the process, yes.

The cgroup changes are protected by threadgroup_lock. So it made sense
to protect cgroupns changes (unshare or setns) by the same lock as we
don't want task's cgroup to change underneath while we are changing
its cgroup-namespace. No cgroup change happens during the
unshare/setns call.

> What did you end up doing to grant permission to unshare the cgroup ns?
>

Currently the only requirement is ns_capable(cgroupns->user_ns,
CAP_SYS_ADMIN). Its possible to refine this further, but for now I
just kept it simpler. I am looking into the explicit permission check
discussed previously (https://lkml.org/lkml/2014/7/29/402), but wanted
to get this out sooner.

> --Andy

Thanks,
--
Aditya

2014-10-16 16:07:40

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path

Quoting Aditya Kali ([email protected]):
> The new function kernfs_path_from_node() generates and returns
> kernfs path of a given kernfs_node relative to a given parent
> kernfs_node.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

(with or without my comment below taken)

> ---
> fs/kernfs/dir.c | 53 ++++++++++++++++++++++++++++++++++++++++----------
> include/linux/kernfs.h | 3 +++
> 2 files changed, 46 insertions(+), 10 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index a693f5b..8655485 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -44,14 +44,24 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
> return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
> }
>
> -static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> - size_t buflen)
> +static char * __must_check kernfs_path_from_node_locked(
> + struct kernfs_node *kn_root,
> + struct kernfs_node *kn,
> + char *buf,
> + size_t buflen)
> {
> char *p = buf + buflen;
> int len;
>
> + BUG_ON(!buflen);
> +
> *--p = '\0';
>
> + if (kn == kn_root) {
> + *--p = '/';
> + return p;
> + }
> +
> do {
> len = strlen(kn->name);
> if (p - buf < len + 1) {
> @@ -63,6 +73,8 @@ static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
> memcpy(p, kn->name, len);
> *--p = '/';
> kn = kn->parent;
> + if (kn == kn_root)
> + break;

I wonder if it would be clearer if you instead changed the while condition, i.e.

} while (kn && kn != kn_root && kn_parent);

i.e .it's not a special condition, just a part of the expected flow.

> } while (kn && kn->parent);
>
> return p;
> @@ -92,26 +104,47 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
> }
>
> /**
> - * kernfs_path - build full path of a given node
> + * kernfs_path_from_node - build path of node @kn relative to @kn_root.
> + * @kn_root: parent kernfs_node relative to which we need to build the path
> * @kn: kernfs_node of interest
> - * @buf: buffer to copy @kn's name into
> + * @buf: buffer to copy @kn's path into
> * @buflen: size of @buf
> *
> - * Builds and returns the full path of @kn in @buf of @buflen bytes. The
> - * path is built from the end of @buf so the returned pointer usually
> + * Builds and returns @kn's path relative to @kn_root. @kn_root is expected to
> + * be parent of @kn at some level. If this is not true or if @kn_root is NULL,
> + * then full path of @kn is returned.
> + * The path is built from the end of @buf so the returned pointer usually
> * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
> * and %NULL is returned.
> */
> -char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
> + char *buf, size_t buflen)
> {
> unsigned long flags;
> char *p;
>
> spin_lock_irqsave(&kernfs_rename_lock, flags);
> - p = kernfs_path_locked(kn, buf, buflen);
> + p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
> spin_unlock_irqrestore(&kernfs_rename_lock, flags);
> return p;
> }
> +EXPORT_SYMBOL_GPL(kernfs_path_from_node);
> +
> +/**
> + * kernfs_path - build full path of a given node
> + * @kn: kernfs_node of interest
> + * @buf: buffer to copy @kn's name into
> + * @buflen: size of @buf
> + *
> + * Builds and returns the full path of @kn in @buf of @buflen bytes. The
> + * path is built from the end of @buf so the returned pointer usually
> + * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
> + * and %NULL is returned.
> + */
> +char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
> +{
> + return kernfs_path_from_node(NULL, kn, buf, buflen);
> +}
> EXPORT_SYMBOL_GPL(kernfs_path);
>
> /**
> @@ -145,8 +178,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)
>
> spin_lock_irqsave(&kernfs_rename_lock, flags);
>
> - p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
> - sizeof(kernfs_pr_cont_buf));
> + p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
> + sizeof(kernfs_pr_cont_buf));
> if (p)
> pr_cont("%s", p);
> else
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 30faf79..3c2be75 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -258,6 +258,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)
> }
>
> int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
> +char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
> + struct kernfs_node *kn, char *buf,
> + size_t buflen);
> char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
> size_t buflen);
> void pr_cont_kernfs_name(struct kernfs_node *kn);
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-16 16:08:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

Quoting Aditya Kali ([email protected]):
> CLONE_NEWCGROUP will be used to create new cgroup namespace.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> include/uapi/linux/sched.h | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 34f9d73..2f90d00 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -21,8 +21,7 @@
> #define CLONE_DETACHED 0x00400000 /* Unused, ignored */
> #define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
> #define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
> -/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
> - and is now available for re-use. */
> +#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
> #define CLONE_NEWUTS 0x04000000 /* New utsname group? */
> #define CLONE_NEWIPC 0x08000000 /* New ipcs */
> #define CLONE_NEWUSER 0x10000000 /* New user namespace */
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-16 16:13:28

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 3/8] cgroup: add function to get task's cgroup on default hierarchy

Quoting Aditya Kali ([email protected]):
> get_task_cgroup() returns the (reference counted) cgroup of the
> given task on the default hierarchy.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 25 +++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 1d51968..80ed6e0 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -579,6 +579,7 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)
> }
>
> char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen);
> +struct cgroup *get_task_cgroup(struct task_struct *task);
>
> int cgroup_add_dfl_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> int cgroup_add_legacy_cftypes(struct cgroup_subsys *ss, struct cftype *cfts);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cab7dc4..56d507b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1916,6 +1916,31 @@ char *task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
> }
> EXPORT_SYMBOL_GPL(task_cgroup_path);
>
> +/*
> + * get_task_cgroup - returns the cgroup of the task in the default cgroup
> + * hierarchy.
> + *
> + * @task: target task
> + * This function returns the @task's cgroup on the default cgroup hierarchy. The
> + * returned cgroup has its reference incremented (by calling cgroup_get()). So
> + * the caller must cgroup_put() the obtained reference once it is done with it.
> + */
> +struct cgroup *get_task_cgroup(struct task_struct *task)
> +{
> + struct cgroup *cgrp;
> +
> + mutex_lock(&cgroup_mutex);
> + down_read(&css_set_rwsem);
> +
> + cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
> + cgroup_get(cgrp);
> +
> + up_read(&css_set_rwsem);
> + mutex_unlock(&cgroup_mutex);
> + return cgrp;
> +}
> +EXPORT_SYMBOL_GPL(get_task_cgroup);
> +
> /* used to track tasks and other necessary states during migration */
> struct cgroup_taskset {
> /* the src and dst cset list running through cset->mg_node */
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-16 16:14:25

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()

Quoting Aditya Kali ([email protected]):
> move cgroup_get() and cgroup_put() into cgroup.h so that
> they can be called from other places.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> include/linux/cgroup.h | 22 ++++++++++++++++++++++
> kernel/cgroup.c | 22 ----------------------
> 2 files changed, 22 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 80ed6e0..4a0eb2d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -521,6 +521,28 @@ static inline bool cgroup_on_dfl(const struct cgroup *cgrp)
> return cgrp->root == &cgrp_dfl_root;
> }
>
> +/* convenient tests for these bits */
> +static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> +{
> + return !(cgrp->self.flags & CSS_ONLINE);
> +}
> +
> +static inline void cgroup_get(struct cgroup *cgrp)
> +{
> + WARN_ON_ONCE(cgroup_is_dead(cgrp));
> + css_get(&cgrp->self);
> +}
> +
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> +{
> + return css_tryget(&cgrp->self);
> +}
> +
> +static inline void cgroup_put(struct cgroup *cgrp)
> +{
> + css_put(&cgrp->self);
> +}
> +
> /* no synchronization, the result can only be used as a hint */
> static inline bool cgroup_has_tasks(struct cgroup *cgrp)
> {
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 56d507b..2b3e9f9 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -284,12 +284,6 @@ static struct cgroup_subsys_state *cgroup_e_css(struct cgroup *cgrp,
> return cgroup_css(cgrp, ss);
> }
>
> -/* convenient tests for these bits */
> -static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> -{
> - return !(cgrp->self.flags & CSS_ONLINE);
> -}
> -
> struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
> {
> struct cgroup *cgrp = of->kn->parent->priv;
> @@ -1002,22 +996,6 @@ static umode_t cgroup_file_mode(const struct cftype *cft)
> return mode;
> }
>
> -static void cgroup_get(struct cgroup *cgrp)
> -{
> - WARN_ON_ONCE(cgroup_is_dead(cgrp));
> - css_get(&cgrp->self);
> -}
> -
> -static bool cgroup_tryget(struct cgroup *cgrp)
> -{
> - return css_tryget(&cgrp->self);
> -}
> -
> -static void cgroup_put(struct cgroup *cgrp)
> -{
> - css_put(&cgrp->self);
> -}
> -
> /**
> * cgroup_refresh_child_subsys_mask - update child_subsys_mask
> * @cgrp: the target cgroup
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-16 16:37:11

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Quoting Aditya Kali ([email protected]):
> Introduce the ability to create new cgroup namespace. The newly created
> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
> of creation of the cgroup namespace. The task that creates the new
> cgroup namespace and all its future children will now be restricted only
> to the cgroup hierarchy under this root_cgrp.
> The main purpose of cgroup namespace is to virtualize the contents
> of /proc/self/cgroup file. Processes inside a cgroup namespace
> are only able to see paths relative to their namespace root.
> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
> to create completely virtualized containers without leaking system
> level cgroup hierarchy to the task.
> This patch only implements the 'unshare' part of the cgroupns.
>
> Signed-off-by: Aditya Kali <[email protected]>

I'm not sure that the CONFIG_CGROUP_NS is worthwhile. If you already
have cgroups in the kernel this won't add much in the way of memory
usage, right? And I think the 'experimental' argument has long since
been squashed. So I'd argue for simplifying this patch by removing
CONFIG_CGROUP_NS.

(more below)

> ---
> fs/proc/namespaces.c | 3 +
> include/linux/cgroup.h | 18 +++++-
> include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 ++
> init/Kconfig | 9 +++
> kernel/Makefile | 1 +
> kernel/cgroup.c | 11 ++++
> kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 +++++-
> 11 files changed, 255 insertions(+), 4 deletions(-)
>
> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
> index 8902609..e04ed4b 100644
> --- a/fs/proc/namespaces.c
> +++ b/fs/proc/namespaces.c
> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
> &userns_operations,
> #endif
> &mntns_operations,
> +#ifdef CONFIG_CGROUP_NS
> + &cgroupns_operations,
> +#endif
> };
>
> static const struct file_operations ns_file_operations = {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 4a0eb2d..aa86495 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -22,6 +22,8 @@
> #include <linux/seq_file.h>
> #include <linux/kernfs.h>
> #include <linux/wait.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
>
> #ifdef CONFIG_CGROUPS
>
> @@ -460,6 +462,13 @@ struct cftype {
> #endif
> };
>
> +struct cgroup_namespace {
> + atomic_t count;
> + unsigned int proc_inum;
> + struct user_namespace *user_ns;
> + struct cgroup *root_cgrp;
> +};
> +
> extern struct cgroup_root cgrp_dfl_root;
> extern struct css_set init_css_set;
>
> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
> return kernfs_name(cgrp->kn, buf, buflen);
> }
>
> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
> + struct cgroup *cgrp, char *buf,
> + size_t buflen)
> +{
> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
> +}
> +
> static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
> size_t buflen)
> {
> - return kernfs_path(cgrp->kn, buf, buflen);
> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
> }
>
> static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
> new file mode 100644
> index 0000000..9f637fe
> --- /dev/null
> +++ b/include/linux/cgroup_namespace.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_CGROUP_NAMESPACE_H
> +#define _LINUX_CGROUP_NAMESPACE_H
> +
> +#include <linux/nsproxy.h>
> +#include <linux/cgroup.h>
> +#include <linux/types.h>
> +#include <linux/user_namespace.h>
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
> +{
> + return tsk->nsproxy->cgroup_ns->root_cgrp;

Per the rules in nsproxy.h, you should be taking the task_lock here.

(If you are making assumptions about tsk then you need to state them
here - I only looked quickly enough that you pass in 'leader')

> +}
> +
> +#ifdef CONFIG_CGROUP_NS
> +
> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + if (ns)
> + atomic_inc(&ns->count);
> + return ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns && atomic_dec_and_test(&ns->count))
> + free_cgroup_ns(ns);
> +}
> +
> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns);
> +
> +#else /* CONFIG_CGROUP_NS */
> +
> +static inline struct cgroup_namespace *get_cgroup_ns(
> + struct cgroup_namespace *ns)
> +{
> + return &init_cgroup_ns;
> +}
> +
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> +}
> +
> +static inline struct cgroup_namespace *copy_cgroup_ns(
> + unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns) {
> + if (flags & CLONE_NEWCGROUP)
> + return ERR_PTR(-EINVAL);
> +
> + return old_ns;
> +}
> +
> +#endif /* CONFIG_CGROUP_NS */
> +
> +#endif /* _LINUX_CGROUP_NAMESPACE_H */
> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
> index 35fa08f..ac0d65b 100644
> --- a/include/linux/nsproxy.h
> +++ b/include/linux/nsproxy.h
> @@ -8,6 +8,7 @@ struct mnt_namespace;
> struct uts_namespace;
> struct ipc_namespace;
> struct pid_namespace;
> +struct cgroup_namespace;
> struct fs_struct;
>
> /*
> @@ -33,6 +34,7 @@ struct nsproxy {
> struct mnt_namespace *mnt_ns;
> struct pid_namespace *pid_ns_for_children;
> struct net *net_ns;
> + struct cgroup_namespace *cgroup_ns;
> };
> extern struct nsproxy init_nsproxy;
>
> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
> index 34a1e10..e56dd73 100644
> --- a/include/linux/proc_ns.h
> +++ b/include/linux/proc_ns.h
> @@ -6,6 +6,8 @@
>
> struct pid_namespace;
> struct nsproxy;
> +struct task_struct;
> +struct inode;
>
> struct proc_ns_operations {
> const char *name;
> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
> extern const struct proc_ns_operations pidns_operations;
> extern const struct proc_ns_operations userns_operations;
> extern const struct proc_ns_operations mntns_operations;
> +extern const struct proc_ns_operations cgroupns_operations;
>
> /*
> * We always define these enumerators
> @@ -37,6 +40,7 @@ enum {
> PROC_UTS_INIT_INO = 0xEFFFFFFEU,
> PROC_USER_INIT_INO = 0xEFFFFFFDU,
> PROC_PID_INIT_INO = 0xEFFFFFFCU,
> + PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
> };
>
> #ifdef CONFIG_PROC_FS
> diff --git a/init/Kconfig b/init/Kconfig
> index e84c642..c3be001 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
> Enable some debugging help. Currently it exports additional stat
> files in a cgroup which can be useful for debugging.
>
> +config CGROUP_NS
> + bool "CGroup Namespaces"
> + default n
> + help
> + This options enables CGroup Namespaces which can be used to isolate
> + cgroup paths. This feature is only useful when unified cgroup
> + hierarchy is in use (i.e. cgroups are mounted with sane_behavior
> + option).
> +
> endif # CGROUPS
>
> config CHECKPOINT_RESTORE
> diff --git a/kernel/Makefile b/kernel/Makefile
> index dc5c775..75334f8 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
> obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
> obj-$(CONFIG_COMPAT) += compat.o
> obj-$(CONFIG_CGROUPS) += cgroup.o
> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
> obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
> obj-$(CONFIG_CPUSETS) += cpuset.o
> obj-$(CONFIG_UTS_NS) += utsname.o
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2b3e9f9..f8099b4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -57,6 +57,8 @@
> #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
> #include <linux/kthread.h>
> #include <linux/delay.h>
> +#include <linux/proc_ns.h>
> +#include <linux/cgroup_namespace.h>
>
> #include <linux/atomic.h>
>
> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
> static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
> bool is_add);
>
> +struct cgroup_namespace init_cgroup_ns = {
> + .count = {
> + .counter = 1,
> + },
> + .proc_inum = PROC_CGROUP_INIT_INO,
> + .user_ns = &init_user_ns,

This might mean that you should bump the init_user_ns refcount.

> + .root_cgrp = &cgrp_dfl_root.cgrp,
> +};
> +
> /* IDR wrappers which synchronize using cgroup_idr_lock */
> static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
> gfp_t gfp_mask)
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> new file mode 100644
> index 0000000..c16604f
> --- /dev/null
> +++ b/kernel/cgroup_namespace.c
> @@ -0,0 +1,128 @@
> +
> +#include <linux/cgroup.h>
> +#include <linux/cgroup_namespace.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/nsproxy.h>
> +#include <linux/proc_ns.h>
> +
> +static struct cgroup_namespace *alloc_cgroup_ns(void)
> +{
> + struct cgroup_namespace *new_ns;
> +
> + new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
> + if (new_ns)
> + atomic_set(&new_ns->count, 1);
> + return new_ns;
> +}
> +
> +void free_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + cgroup_put(ns->root_cgrp);
> + put_user_ns(ns->user_ns);

This is a problem on error patch in copy_cgroup_ns. The
alloc_cgroup_ns() doesn't initialize these values, so if
you should fail in proc_alloc_inum() you'll show up here
with fandom values in ns->*.

> + proc_free_inum(ns->proc_inum);
> +}
> +EXPORT_SYMBOL(free_cgroup_ns);
> +
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns)
> +{
> + struct cgroup_namespace *new_ns = NULL;
> + struct cgroup *cgrp = NULL;
> + int err;
> +
> + BUG_ON(!old_ns);
> +
> + if (!(flags & CLONE_NEWCGROUP))
> + return get_cgroup_ns(old_ns);
> +
> + /* Allow only sysadmin to create cgroup namespace. */
> + err = -EPERM;
> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> + goto err_out;
> +
> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(current);
> +
> + cgrp = get_task_cgroup(current);
> +
> + /* Creating new CGROUPNS is supported only when unified hierarchy is in
> + * use. */

Oh, drat. Well, I'll take, it, but under protest :)

> + err = -EINVAL;
> + if (!cgroup_on_dfl(cgrp))
> + goto err_out_unlock;
> +
> + err = -ENOMEM;
> + new_ns = alloc_cgroup_ns();
> + if (!new_ns)
> + goto err_out_unlock;
> +
> + err = proc_alloc_inum(&new_ns->proc_inum);
> + if (err)
> + goto err_out_unlock;
> +
> + new_ns->user_ns = get_user_ns(user_ns);
> + new_ns->root_cgrp = cgrp;
> +
> + threadgroup_unlock(current);
> +
> + return new_ns;
> +
> +err_out_unlock:
> + threadgroup_unlock(current);
> +err_out:
> + if (cgrp)
> + cgroup_put(cgrp);
> + kfree(new_ns);
> + return ERR_PTR(err);
> +}
> +
> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> +{
> + pr_info("setns not supported for cgroup namespace");
> + return -EINVAL;
> +}
> +
> +static void *cgroupns_get(struct task_struct *task)
> +{
> + struct cgroup_namespace *ns = NULL;
> + struct nsproxy *nsproxy;
> +
> + rcu_read_lock();
> + nsproxy = task->nsproxy;
> + if (nsproxy) {
> + ns = nsproxy->cgroup_ns;
> + get_cgroup_ns(ns);
> + }
> + rcu_read_unlock();
> +
> + return ns;
> +}
> +
> +static void cgroupns_put(void *ns)
> +{
> + put_cgroup_ns(ns);
> +}
> +
> +static unsigned int cgroupns_inum(void *ns)
> +{
> + struct cgroup_namespace *cgroup_ns = ns;
> +
> + return cgroup_ns->proc_inum;
> +}
> +
> +const struct proc_ns_operations cgroupns_operations = {
> + .name = "cgroup",
> + .type = CLONE_NEWCGROUP,
> + .get = cgroupns_get,
> + .put = cgroupns_put,
> + .install = cgroupns_install,
> + .inum = cgroupns_inum,
> +};
> +
> +static __init int cgroup_namespaces_init(void)
> +{
> + return 0;
> +}
> +subsys_initcall(cgroup_namespaces_init);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0cf9cdb..cc06851 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
> if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
> CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
> CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
> - CLONE_NEWUSER|CLONE_NEWPID))
> + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
> return -EINVAL;
> /*
> * Not implemented, but pretend it works if there is nothing to
> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
> index ef42d0a..a8b1970 100644
> --- a/kernel/nsproxy.c
> +++ b/kernel/nsproxy.c
> @@ -25,6 +25,7 @@
> #include <linux/proc_ns.h>
> #include <linux/file.h>
> #include <linux/syscalls.h>
> +#include <linux/cgroup_namespace.h>
>
> static struct kmem_cache *nsproxy_cachep;
>
> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
> #ifdef CONFIG_NET
> .net_ns = &init_net,
> #endif
> + .cgroup_ns = &init_cgroup_ns,
> };
>
> static inline struct nsproxy *create_nsproxy(void)
> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> goto out_pid;
> }
>
> + new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> + tsk->nsproxy->cgroup_ns);
> + if (IS_ERR(new_nsp->cgroup_ns)) {
> + err = PTR_ERR(new_nsp->cgroup_ns);
> + goto out_cgroup;
> + }
> +
> new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
> if (IS_ERR(new_nsp->net_ns)) {
> err = PTR_ERR(new_nsp->net_ns);
> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
> return new_nsp;
>
> out_net:
> + if (new_nsp->cgroup_ns)
> + put_cgroup_ns(new_nsp->cgroup_ns);
> +out_cgroup:
> if (new_nsp->pid_ns_for_children)
> put_pid_ns(new_nsp->pid_ns_for_children);
> out_pid:
> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> struct nsproxy *new_ns;
>
> if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> - CLONE_NEWPID | CLONE_NEWNET)))) {
> + CLONE_NEWPID | CLONE_NEWNET |
> + CLONE_NEWCGROUP)))) {
> get_nsproxy(old_ns);
> return 0;
> }
> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
> put_ipc_ns(ns->ipc_ns);
> if (ns->pid_ns_for_children)
> put_pid_ns(ns->pid_ns_for_children);
> + if (ns->cgroup_ns)
> + put_cgroup_ns(ns->cgroup_ns);
> put_net(ns->net_ns);
> kmem_cache_free(nsproxy_cachep, ns);
> }
> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
> int err = 0;
>
> if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> - CLONE_NEWNET | CLONE_NEWPID)))
> + CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
> return 0;
>
> user_ns = new_cred ? new_cred->user_ns : current_user_ns();
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-16 21:12:46

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Quoting Aditya Kali ([email protected]):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
> over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
> cgroup.

What is the point of this?

If I'm a user logged into
/lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
a container which is in
/lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
then I will want to be able to enter the container's cgroup.
The container's cgroup root is under my own (satisfying the
below condition0 but my cgroup is not a descendent of the
container's cgroup.


> * target cgroupns-root is same as or deeper than task's current
> cgroupns-root. This is so that the task cannot escape out of its
> cgroupns-root. This also ensures that setns() only makes the task
> get restricted to a deeper cgroup hierarchy.
>
> Signed-off-by: Aditya Kali <[email protected]>
> ---
> kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>
> static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> {
> - pr_info("setns not supported for cgroup namespace");
> - return -EINVAL;
> + struct cgroup_namespace *cgroup_ns = ns;
> + struct task_struct *task = current;
> + struct cgroup *cgrp = NULL;
> + int err = 0;
> +
> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> + !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(task);
> +
> + cgrp = get_task_cgroup(task);
> +
> + err = -EINVAL;
> + if (!cgroup_on_dfl(cgrp))
> + goto out_unlock;
> +
> + /* Allow switch only if the task's current cgroup is descendant of the
> + * target cgroup_ns->root_cgrp.
> + */
> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> + goto out_unlock;
> +
> + /* Only allow setns to a cgroupns root-ed deeper than task's current
> + * cgroupns-root. This will make sure that tasks cannot escape their
> + * cgroupns by attaching to parent cgroupns.
> + */
> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> + task_cgroupns_root(task)))
> + goto out_unlock;
> +
> + err = 0;
> + get_cgroup_ns(cgroup_ns);
> + put_cgroup_ns(nsproxy->cgroup_ns);
> + nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> + threadgroup_unlock(current);
> + if (cgrp)
> + cgroup_put(cgrp);
> + return err;
> }
>
> static void *cgroupns_get(struct task_struct *task)
> --
> 2.1.0.rc2.206.gedb03e5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-10-16 21:17:42

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>> over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>> cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>

Presumably you need to ask your friendly cgroup manager to stick you
in that cgroup first. Or we need to generally allow tasks to move
themselves deeper in the hierarchy, but that seems like a big change.

--Andy

>
>> * target cgroupns-root is same as or deeper than task's current
>> cgroupns-root. This is so that the task cannot escape out of its
>> cgroupns-root. This also ensures that setns() only makes the task
>> get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <[email protected]>
>> ---
>> kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>> static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> {
>> - pr_info("setns not supported for cgroup namespace");
>> - return -EINVAL;
>> + struct cgroup_namespace *cgroup_ns = ns;
>> + struct task_struct *task = current;
>> + struct cgroup *cgrp = NULL;
>> + int err = 0;
>> +
>> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> + !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(task);
>> +
>> + cgrp = get_task_cgroup(task);
>> +
>> + err = -EINVAL;
>> + if (!cgroup_on_dfl(cgrp))
>> + goto out_unlock;
>> +
>> + /* Allow switch only if the task's current cgroup is descendant of the
>> + * target cgroup_ns->root_cgrp.
>> + */
>> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> + goto out_unlock;
>> +
>> + /* Only allow setns to a cgroupns root-ed deeper than task's current
>> + * cgroupns-root. This will make sure that tasks cannot escape their
>> + * cgroupns by attaching to parent cgroupns.
>> + */
>> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> + task_cgroupns_root(task)))
>> + goto out_unlock;
>> +
>> + err = 0;
>> + get_cgroup_ns(cgroup_ns);
>> + put_cgroup_ns(nsproxy->cgroup_ns);
>> + nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> + threadgroup_unlock(current);
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + return err;
>> }
>>
>> static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-16 21:22:41

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> setns on a cgroup namespace is allowed only if
>> * task has CAP_SYS_ADMIN in its current user-namespace and
>> over the user-namespace associated with target cgroupns.
>> * task's current cgroup is descendent of the target cgroupns-root
>> cgroup.
>
> What is the point of this?
>
> If I'm a user logged into
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> a container which is in
> /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> then I will want to be able to enter the container's cgroup.
> The container's cgroup root is under my own (satisfying the
> below condition0 but my cgroup is not a descendent of the
> container's cgroup.
>
This condition is there because we don't want to do implicit cgroup
changes when a process attaches to another cgroupns. cgroupns tries to
preserve the invariant that at any point, your current cgroup is
always under the cgroupns-root of your cgroup namespace. But in your
example, if we allow a process in "session-c12.scope" container to
attach to cgroupns root'ed at "session-c12.scope/x1" container
(without implicitly moving its cgroup), then this invariant won't
hold.

>
>> * target cgroupns-root is same as or deeper than task's current
>> cgroupns-root. This is so that the task cannot escape out of its
>> cgroupns-root. This also ensures that setns() only makes the task
>> get restricted to a deeper cgroup hierarchy.
>>
>> Signed-off-by: Aditya Kali <[email protected]>
>> ---
>> kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 42 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> index c16604f..c612946 100644
>> --- a/kernel/cgroup_namespace.c
>> +++ b/kernel/cgroup_namespace.c
>> @@ -80,8 +80,48 @@ err_out:
>>
>> static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> {
>> - pr_info("setns not supported for cgroup namespace");
>> - return -EINVAL;
>> + struct cgroup_namespace *cgroup_ns = ns;
>> + struct task_struct *task = current;
>> + struct cgroup *cgrp = NULL;
>> + int err = 0;
>> +
>> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
>> + !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
>> + return -EPERM;
>> +
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(task);
>> +
>> + cgrp = get_task_cgroup(task);
>> +
>> + err = -EINVAL;
>> + if (!cgroup_on_dfl(cgrp))
>> + goto out_unlock;
>> +
>> + /* Allow switch only if the task's current cgroup is descendant of the
>> + * target cgroup_ns->root_cgrp.
>> + */
>> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
>> + goto out_unlock;
>> +
>> + /* Only allow setns to a cgroupns root-ed deeper than task's current
>> + * cgroupns-root. This will make sure that tasks cannot escape their
>> + * cgroupns by attaching to parent cgroupns.
>> + */
>> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
>> + task_cgroupns_root(task)))
>> + goto out_unlock;
>> +
>> + err = 0;
>> + get_cgroup_ns(cgroup_ns);
>> + put_cgroup_ns(nsproxy->cgroup_ns);
>> + nsproxy->cgroup_ns = cgroup_ns;
>> +
>> +out_unlock:
>> + threadgroup_unlock(current);
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + return err;
>> }
>>
>> static void *cgroupns_get(struct task_struct *task)
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/



--
Aditya

2014-10-16 21:47:13

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Quoting Aditya Kali ([email protected]):
> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]> wrote:
> > Quoting Aditya Kali ([email protected]):
> >> setns on a cgroup namespace is allowed only if
> >> * task has CAP_SYS_ADMIN in its current user-namespace and
> >> over the user-namespace associated with target cgroupns.
> >> * task's current cgroup is descendent of the target cgroupns-root
> >> cgroup.
> >
> > What is the point of this?
> >
> > If I'm a user logged into
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
> > a container which is in
> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
> > then I will want to be able to enter the container's cgroup.
> > The container's cgroup root is under my own (satisfying the
> > below condition0 but my cgroup is not a descendent of the
> > container's cgroup.
> >
> This condition is there because we don't want to do implicit cgroup
> changes when a process attaches to another cgroupns. cgroupns tries to
> preserve the invariant that at any point, your current cgroup is
> always under the cgroupns-root of your cgroup namespace. But in your
> example, if we allow a process in "session-c12.scope" container to
> attach to cgroupns root'ed at "session-c12.scope/x1" container
> (without implicitly moving its cgroup), then this invariant won't
> hold.

Oh, I see. Guess that should be workable. Thanks.

-serge

2014-10-17 09:28:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

Quoting Aditya Kali ([email protected]):
> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
>
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.
>
> Signed-off-by: Aditya Kali <[email protected]>

So this is a bit different from some other namespaces - if I
have an open fd to a file, then setns into a mntns where that
file is not addressable, I can still use the file.

I guess not allowing attach to a cgroup outside our ns is a
good failsafe as we'll otherwise risk falling off a cliff in
some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
restrictions are needed. (And really I can fchdir to a
directory not in my ns, so the cgroup-attach restriction is
any more justified).

Still I'm not strictly opposed ot this, so

Acked-by: Serge Hallyn <[email protected]>

just wanted to point this out.

> ---
> kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
> struct task_struct *task;
> int ret;
>
> + /* Only allow changing cgroups accessible within task's cgroup
> + * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> + * cgroupns->root_cgrp. */
> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> + return -EPERM;
> +
> /* look up all src csets */
> down_read(&css_set_rwsem);
> rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> struct cgroup_subsys_state *css;
> int ret;
>
> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + return -EINVAL;
> +
> if (cft->write)
> return cft->write(of, buf, nbytes, off);
>
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> parent = cgroup_kn_lock_live(parent_kn);
> if (!parent)
> return -ENODEV;
> +
> + /* Allow mkdir only within process's cgroup namespace root. */
> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> +
> root = parent->root;
>
> /* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
> if (!cgrp)
> return 0;
>
> + /* Allow rmdir only within process's cgroup namespace root.
> + * The process can't delete its own root anyways. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> + cgroup_kn_unlock(kn);
> + return -EPERM;
> + }
> +
> ret = cgroup_destroy_locked(cgrp);
>
> cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
> continue;
>
> + cgrp = task_cgroup_from_root(tsk, root);
> +
> + /* The cgroup path on default hierarchy is shown only if it
> + * falls under current task's cgroupns-root.
> + */
> + if (root == &cgrp_dfl_root &&
> + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + continue;
> +
> seq_printf(m, "%d:", root->hierarchy_id);
> for_each_subsys(ss, ssid)
> if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> seq_printf(m, "%sname=%s", count ? "," : "",
> root->name);
> seq_putc(m, ':');
> - cgrp = task_cgroup_from_root(tsk, root);
> path = cgroup_path(cgrp, buf, PATH_MAX);
> if (!path) {
> retval = -ENAMETOOLONG;
> --
> 2.1.0.rc2.206.gedb03e5
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-17 09:52:29

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Quoting Aditya Kali ([email protected]):
> setns on a cgroup namespace is allowed only if
> * task has CAP_SYS_ADMIN in its current user-namespace and
> over the user-namespace associated with target cgroupns.
> * task's current cgroup is descendent of the target cgroupns-root
> cgroup.
> * target cgroupns-root is same as or deeper than task's current
> cgroupns-root. This is so that the task cannot escape out of its
> cgroupns-root. This also ensures that setns() only makes the task
> get restricted to a deeper cgroup hierarchy.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

Below you allow setns to your own cgroupns. I think that's fine,
but since you're not doing an explicit cgroup change anyway should
you just return 0 at top in that case to save some cpu time?

> ---
> kernel/cgroup_namespace.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 42 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
> index c16604f..c612946 100644
> --- a/kernel/cgroup_namespace.c
> +++ b/kernel/cgroup_namespace.c
> @@ -80,8 +80,48 @@ err_out:
>
> static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
> {
> - pr_info("setns not supported for cgroup namespace");
> - return -EINVAL;
> + struct cgroup_namespace *cgroup_ns = ns;
> + struct task_struct *task = current;
> + struct cgroup *cgrp = NULL;
> + int err = 0;
> +
> + if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> + !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> + return -EPERM;
> +
> + /* Prevent cgroup changes for this task. */
> + threadgroup_lock(task);
> +
> + cgrp = get_task_cgroup(task);
> +
> + err = -EINVAL;
> + if (!cgroup_on_dfl(cgrp))
> + goto out_unlock;
> +
> + /* Allow switch only if the task's current cgroup is descendant of the
> + * target cgroup_ns->root_cgrp.
> + */
> + if (!cgroup_is_descendant(cgrp, cgroup_ns->root_cgrp))
> + goto out_unlock;
> +
> + /* Only allow setns to a cgroupns root-ed deeper than task's current
> + * cgroupns-root. This will make sure that tasks cannot escape their
> + * cgroupns by attaching to parent cgroupns.
> + */
> + if (!cgroup_is_descendant(cgroup_ns->root_cgrp,
> + task_cgroupns_root(task)))
> + goto out_unlock;
> +
> + err = 0;
> + get_cgroup_ns(cgroup_ns);
> + put_cgroup_ns(nsproxy->cgroup_ns);
> + nsproxy->cgroup_ns = cgroup_ns;
> +
> +out_unlock:
> + threadgroup_unlock(current);
> + if (cgrp)
> + cgroup_put(cgrp);
> + return err;
> }
>
> static void *cgroupns_get(struct task_struct *task)
> --
> 2.1.0.rc2.206.gedb03e5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-10-17 12:20:01

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns

Quoting Aditya Kali ([email protected]):
> This patch enables cgroup mounting inside userns when a process
> as appropriate privileges. The cgroup filesystem mounted is
> rooted at the cgroupns-root. Thus, in a container-setup, only
> the hierarchy under the cgroupns-root is exposed inside the container.
> This allows container management tools to run inside the containers
> without depending on any global state.
> In order to support this, a new kernfs api is added to lookup the
> dentry for the cgroupns-root.
>
> Signed-off-by: Aditya Kali <[email protected]>

Acked-by: Serge Hallyn <[email protected]>

> ---
> fs/kernfs/mount.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 2 ++
> kernel/cgroup.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index f973ae9..e334f45 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +62,54 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> return NULL;
> }
>
> +/**
> + * kernfs_make_root - create new root dentry for the given kernfs_node.
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can used used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn)
> +{
> + struct dentry *dentry;
> + struct inode *inode;
> +
> + BUG_ON(sb->s_op != &kernfs_sops);
> +
> + /* inode for the given kernfs_node should already exist. */
> + inode = ilookup(sb, kn->ino);
> + if (!inode) {
> + pr_debug("kernfs: could not get inode for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-EINVAL);
> + }
> +
> + /* instantiate and link root dentry */
> + dentry = d_obtain_root(inode);
> + if (!dentry) {
> + pr_debug("kernfs: could not get dentry for '");
> + pr_cont_kernfs_path(kn);
> + pr_cont("'.\n");
> + return ERR_PTR(-ENOMEM);
> + }
> +
> + /* If this is a new dentry, set it up. We need kernfs_mutex because this
> + * may be called by callers other than kernfs_fill_super. */
> + mutex_lock(&kernfs_mutex);
> + if (!dentry->d_fsdata) {
> + kernfs_get(kn);
> + dentry->d_fsdata = kn;
> + } else {
> + WARN_ON(dentry->d_fsdata != kn);
> + }
> + mutex_unlock(&kernfs_mutex);
> +
> + return dentry;
> +}
> +
> static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
> {
> struct kernfs_super_info *info = kernfs_info(sb);
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 3c2be75..b9538e0 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -274,6 +274,8 @@ void kernfs_put(struct kernfs_node *kn);
> struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
> struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
>
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn);
> struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
> unsigned int flags, void *priv);
> void kernfs_destroy_root(struct kernfs_root *root);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2fc0dfa..ef27dc4 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1302,6 +1302,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> memset(opts, 0, sizeof(*opts));
>
> + /* Implicitly add CGRP_ROOT_SANE_BEHAVIOR if inside a non-init cgroup
> + * namespace.
> + */
> + if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
> + opts->flags |= CGRP_ROOT_SANE_BEHAVIOR;
> + }
> +
> while ((token = strsep(&o, ",")) != NULL) {
> nr_opts++;
>
> @@ -1391,7 +1398,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
>
> if (opts->flags & CGRP_ROOT_SANE_BEHAVIOR) {
> pr_warn("sane_behavior: this is still under development and its behaviors will change, proceed at your own risk\n");
> - if (nr_opts != 1) {
> + if (nr_opts > 1) {
> pr_err("sane_behavior: no other mount options allowed\n");
> return -EINVAL;
> }
> @@ -1581,6 +1588,15 @@ static void init_cgroup_root(struct cgroup_root *root,
> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> }
>
> +struct dentry *cgroupns_get_root(struct super_block *sb,
> + struct cgroup_namespace *ns)
> +{
> + struct dentry *nsdentry;
> +
> + nsdentry = kernfs_obtain_root(sb, ns->root_cgrp->kn);
> + return nsdentry;
> +}
> +
> static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask)
> {
> LIST_HEAD(tmp_links);
> @@ -1684,6 +1700,14 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> int ret;
> int i;
> bool new_sb;
> + struct cgroup_namespace *ns =
> + get_cgroup_ns(current->nsproxy->cgroup_ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }
>
> /*
> * The first time anyone tries to mount a cgroup, enable the list
> @@ -1816,11 +1840,28 @@ out_free:
> kfree(opts.release_agent);
> kfree(opts.name);
>
> - if (ret)
> + if (ret) {
> + put_cgroup_ns(ns);
> return ERR_PTR(ret);
> + }
>
> dentry = kernfs_mount(fs_type, flags, root->kf_root,
> CGROUP_SUPER_MAGIC, &new_sb);
> +
> + if (!IS_ERR(dentry)) {
> + /* If this mount is for a non-init cgroup namespace, then
> + * Instead of root's dentry, we return the dentry specific to
> + * the cgroupns->root_cgrp.
> + */
> + if (ns != &init_cgroup_ns) {
> + struct dentry *nsdentry;
> +
> + nsdentry = cgroupns_get_root(dentry->d_sb, ns);
> + dput(dentry);
> + dentry = nsdentry;
> + }
> + }
> +
> if (IS_ERR(dentry) || !new_sb)
> cgroup_put(&root->cgrp);
>
> @@ -1833,6 +1874,7 @@ out_free:
> deactivate_super(pinned_sb);
> }
>
> + put_cgroup_ns(ns);
> return dentry;
> }
>
> @@ -1861,6 +1903,7 @@ static struct file_system_type cgroup_fs_type = {
> .name = "cgroup",
> .mount = cgroup_mount,
> .kill_sb = cgroup_kill_sb,
> + .fs_flags = FS_USERNS_MOUNT,
> };
>
> static struct kobject *cgroup_kobj;
> --
> 2.1.0.rc2.206.gedb03e5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-10-19 04:55:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 0/8] CGroup Namespaces

Aditya Kali <[email protected]> writes:

> Second take at the Cgroup Namespace patch-set.
>
> Major changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> your cgroupns-root.
>
> More details in the writeup below.

This definitely looks like the right direction to go, and something that
in some form or another I had been asking for since cgroups were merged.
So I am very glad to see this work moving forward.

I had hoped that we might just be able to be clever with remounting
cgroupfs but 2 things stand in the way.
1) /proc/<pid>/cgroups (but proc could capture that).
2) providing a hard guarnatee that tasks stay within a subset of the
cgroup hierarchy.

So I think this clearly meets the requirements for a new namespace.

We need to have the discussion on chmod of files on cgroupfs. There is
a notion that has floated around that only systemd or only root (with
the appropriate capabilities) should be allowed to set resource limits
in cgroupfs. In a practical reality that is nonsense. If an atribute
is properly bound in it's hiearchy it should be safe to change.

Not all attributes are properly bound to hierarchy and some are or at
least were dangerous for anyone except root to set. So I suggest that a
CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
a cgroup attribute from root.

That would be complimentary work, and not strictly tied the cgroup
namespaces but unprivileged cgroup namespaces don't make much sense
without that work.

Eric

> Background
> Cgroups and Namespaces are used together to create “virtual”
> containers that isolates the host environment from the processes
> running in container. But since cgroups themselves are not
> “virtualized”, the task is always able to see global cgroups view
> through cgroupfs mount and via /proc/self/cgroup file.
>
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> This exposure of cgroup names to the processes running inside a
> container results in some problems:
> (1) The container names are typically host-container-management-agent
> (systemd, docker/libcontainer, etc.) data and leaking its name (or
> leaking the hierarchy) reveals too much information about the host
> system.
> (2) It makes the container migration across machines (CRIU) more
> difficult as the container names need to be unique across the
> machines in the migration domain.
> (3) It makes it difficult to run container management tools (like
> docker/libcontainer, lmctfy, etc.) within virtual containers
> without adding dependency on some state/agent present outside the
> container.
>
> Note that the feature proposed here is completely different than the
> “ns cgroup” feature which existed in the linux kernel until recently.
> The ns cgroup also attempted to connect cgroups and namespaces by
> creating a new cgroup every time a new namespace was created. It did
> not solve any of the above mentioned problems and was later dropped
> from the kernel. Incidentally though, it used the same config option
> name CONFIG_CGROUP_NS as used in my prototype!
>
> Introducing CGroup Namespaces
> With unified cgroup hierarchy
> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
> have a much more coherent cgroup view and its easy to associate a
> container with a single cgroup. This also allows us to virtualize the
> cgroup view for tasks inside the container.
>
> The new CGroup Namespace allows a process to “unshare” its cgroup
> hierarchy starting from the cgroup its currently in.
> For Ex:
> $ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
> $ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
> [ns]$ ls -l /proc/self/ns/cgroup
> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
> cgroup:[4026532183]
> # From within new cgroupns, process sees that its in the root cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>
> # From global cgroupns:
> $ cat /proc/<pid>/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>
> # Unshare cgroupns along with userns and mountns
> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
> # sets up uid/gid map and exec’s /bin/bash
> $ ~/unshare -c -u -m
>
> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
> # hierarchy.
> [ns]$ mount -t cgroup cgroup /tmp/cgroup
> [ns]$ ls -l /tmp/cgroup
> total 0
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>
> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
> filesystem root for the namespace specific cgroupfs mount.
>
> The virtualization of /proc/self/cgroup file combined with restricting
> the view of cgroup hierarchy by namespace-private cgroupfs mount
> should provide a completely isolated cgroup view inside the container.
>
> In its current form, the cgroup namespaces patcheset provides following
> behavior:
>
> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
> the process calling unshare is running.
> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
> For the init_cgroup_ns, this is the real root (“/”) cgroup
> (identified in code as cgrp_dfl_root.cgrp).
>
> (2) The cgroupns-root cgroup does not change even if the namespace
> creator process later moves to a different cgroup.
> $ ~/unshare -c # unshare cgroupns in some cgroup
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
> [ns]$ mkdir sub_cgrp_1
> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/self/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (3) Each process gets its CGROUPNS specific view of
> /proc/<pid>/cgroup.
> (a) Processes running inside the cgroup namespace will be able to see
> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
> [ns]$ sleep 100000 & # From within unshared cgroupns
> [1] 7353
> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
> [ns]$ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>
> (b) From global cgroupns, the real cgroup path will be visible:
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>
> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
> path will be visible:
> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
> [ns2]$ cat /proc/7353/cgroup
> [ns2]$
> This is same as when cgroup hierarchy is not mounted at all.
> (In correct container setup though, it should not be possible to
> access PIDs in another container in the first place.)
>
> (4) Processes inside a cgroupns are not allowed to move out of the
> cgroupns-root. This is true even if a privileged process in global
> cgroupns tries to move the process out of its cgroupns-root.
>
> # From global cgroupns
> $ cat /proc/7353/cgroup
> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
> # cgroupns-root for 7353 is /batchjobs/c_job_id1
> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
> -bash: echo: write error: Operation not permitted
>
> (5) Setns to another cgroup namespace is allowed only when:
> (a) process has CAP_SYS_ADMIN in its current userns
> (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
> (c) the process's current cgroup is a descendant cgroupns-root of the
> target namespace.
> (d) the target cgroupns-root is descendant of current cgroupns-root..
> The last check (d) prevents processes from escaping their cgroupns-root by
> attaching to parent cgroupns. Thus, setns is allowed only when the process
> is trying to restrict itself to a deeper cgroup hierarchy.
>
> (6) When some thread from a multi-threaded process unshares its
> cgroup-namespace, the new cgroupns gets applied to the entire
> process (all the threads). This should be OK since
> unified-hierarchy only allows process-level containerization. So
> all the threads in the process will have the same cgroup. And both
> - changing cgroups and unsharing namespaces - are protected under
> threadgroup_lock(task).
>
> (7) The cgroup namespace is alive as long as there is atleast 1
> process inside it. When the last process exits, the cgroup
> namespace is destroyed. The cgroupns-root and the actual cgroups
> remain though.
>
> (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
> the unified cgroup hierarchy with cgroupns-root as the filesystem root.
> The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
> container management tools to be run inside the containers transparently.
>
> Implementation
> The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
> branch). Its fairly non-intrusive and provides above mentioned
> features.
>
> Possible extensions of CGROUPNS:
> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
> capabilities to restrict cgroups to administrative users. CGroup
> namespaces could be of help here. With cgroup namespaces, it might
> be possible to delegate administration of sub-cgroups under a
> cgroupns-root to the cgroupns owner.




> ---
> fs/kernfs/dir.c | 53 +++++++++---
> fs/kernfs/mount.c | 48 +++++++++++
> fs/proc/namespaces.c | 3 +
> include/linux/cgroup.h | 41 +++++++++-
> include/linux/cgroup_namespace.h | 62 +++++++++++++++
> include/linux/kernfs.h | 5 ++
> include/linux/nsproxy.h | 2 +
> include/linux/proc_ns.h | 4 +
> include/uapi/linux/sched.h | 3 +-
> init/Kconfig | 9 +++
> kernel/Makefile | 1 +
> kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
> kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
> kernel/fork.c | 2 +-
> kernel/nsproxy.c | 19 ++++-
> 15 files changed, 518 insertions(+), 41 deletions(-)
> create mode 100644 include/linux/cgroup_namespace.h
> create mode 100644 kernel/cgroup_namespace.c
>
> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
> [PATCHv1 7/8] cgroup: cgroup namespace setns support
> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2014-10-19 04:58:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

Aditya Kali <[email protected]> writes:

> Restrict following operations within the calling tasks:
> * cgroup_mkdir & cgroup_rmdir
> * cgroup_attach_task
> * writes to cgroup files outside of task's cgroupns-root
>
> Also, read of /proc/<pid>/cgroup file is now restricted only
> to tasks under same cgroupns-root. If a task tries to look
> at cgroup of another task outside of its cgroupns-root, then
> it won't be able to see anything for the default hierarchy.
> This is same as if the cgroups are not mounted.

So I think this patch is out of order.

We should add the namespace infrastructre and the restrictions before
we allow creation of the namespace. Otherwise there is a bisection
point where cgroup namespaces are broken or at the very least have a
security hole. Since we can anticipate this let's see if we can figure
out how to avoid it.

Eric


> Signed-off-by: Aditya Kali <[email protected]>
> ---
> kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f8099b4..2fc0dfa 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
> struct task_struct *task;
> int ret;
>
> + /* Only allow changing cgroups accessible within task's cgroup
> + * namespace. i.e. 'dst_cgrp' should be a descendant of task's
> + * cgroupns->root_cgrp. */
> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
> + return -EPERM;
> +
> /* look up all src csets */
> down_read(&css_set_rwsem);
> rcu_read_lock();
> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> struct cgroup_subsys_state *css;
> int ret;
>
> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + return -EINVAL;
> +
> if (cft->write)
> return cft->write(of, buf, nbytes, off);
>
> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> parent = cgroup_kn_lock_live(parent_kn);
> if (!parent)
> return -ENODEV;
> +
> + /* Allow mkdir only within process's cgroup namespace root. */
> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
> + ret = -EPERM;
> + goto out_unlock;
> + }
> +
> root = parent->root;
>
> /* allocate the cgroup and its ID, 0 is reserved for the root */
> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
> if (!cgrp)
> return 0;
>
> + /* Allow rmdir only within process's cgroup namespace root.
> + * The process can't delete its own root anyways. */
> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
> + cgroup_kn_unlock(kn);
> + return -EPERM;
> + }
> +
> ret = cgroup_destroy_locked(cgrp);
>
> cgroup_kn_unlock(kn);
> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
> continue;
>
> + cgrp = task_cgroup_from_root(tsk, root);
> +
> + /* The cgroup path on default hierarchy is shown only if it
> + * falls under current task's cgroupns-root.
> + */
> + if (root == &cgrp_dfl_root &&
> + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
> + continue;
> +
> seq_printf(m, "%d:", root->hierarchy_id);
> for_each_subsys(ss, ssid)
> if (root->subsys_mask & (1 << ssid))
> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
> seq_printf(m, "%sname=%s", count ? "," : "",
> root->name);
> seq_putc(m, ':');
> - cgrp = task_cgroup_from_root(tsk, root);
> path = cgroup_path(cgrp, buf, PATH_MAX);
> if (!path) {
> retval = -ENAMETOOLONG;

2014-10-19 05:24:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

"Serge E. Hallyn" <[email protected]> writes:

> Quoting Aditya Kali ([email protected]):
>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]> wrote:
>> > Quoting Aditya Kali ([email protected]):
>> >> setns on a cgroup namespace is allowed only if
>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>> >> over the user-namespace associated with target cgroupns.
>> >> * task's current cgroup is descendent of the target cgroupns-root
>> >> cgroup.
>> >
>> > What is the point of this?
>> >
>> > If I'm a user logged into
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>> > a container which is in
>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>> > then I will want to be able to enter the container's cgroup.
>> > The container's cgroup root is under my own (satisfying the
>> > below condition0 but my cgroup is not a descendent of the
>> > container's cgroup.
>> >
>> This condition is there because we don't want to do implicit cgroup
>> changes when a process attaches to another cgroupns. cgroupns tries to
>> preserve the invariant that at any point, your current cgroup is
>> always under the cgroupns-root of your cgroup namespace. But in your
>> example, if we allow a process in "session-c12.scope" container to
>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>> (without implicitly moving its cgroup), then this invariant won't
>> hold.
>
> Oh, I see. Guess that should be workable. Thanks.

Which has me looking at what the rules are for moving through
the cgroup hierarchy.

As long as we have write access to cgroup.procs and are allowed
to open the file for write, we can move any of our own tasks
into the cgroup. So the cgroup namespace rules don't seem
to be a problem.

Andy can you please take a look at the permission checks in
__cgroup_procs_write.

As I read the code I see 3 security gaffaws in the permssion check.
- Using current->cred instead of file->f_cred.
- Not checking tcred->euid.
- Checking GLOBAL_ROOT_UID instead of having a capable call.

The file permission on cgroup.procs seem just sufficient to keep
to keep those bugs from being easily exploitable.

Eric

2014-10-19 18:26:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
<[email protected]> wrote:
> "Serge E. Hallyn" <[email protected]> writes:
>
>> Quoting Aditya Kali ([email protected]):
>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]> wrote:
>>> > Quoting Aditya Kali ([email protected]):
>>> >> setns on a cgroup namespace is allowed only if
>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>> >> over the user-namespace associated with target cgroupns.
>>> >> * task's current cgroup is descendent of the target cgroupns-root
>>> >> cgroup.
>>> >
>>> > What is the point of this?
>>> >
>>> > If I'm a user logged into
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>> > a container which is in
>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>> > then I will want to be able to enter the container's cgroup.
>>> > The container's cgroup root is under my own (satisfying the
>>> > below condition0 but my cgroup is not a descendent of the
>>> > container's cgroup.
>>> >
>>> This condition is there because we don't want to do implicit cgroup
>>> changes when a process attaches to another cgroupns. cgroupns tries to
>>> preserve the invariant that at any point, your current cgroup is
>>> always under the cgroupns-root of your cgroup namespace. But in your
>>> example, if we allow a process in "session-c12.scope" container to
>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>> (without implicitly moving its cgroup), then this invariant won't
>>> hold.
>>
>> Oh, I see. Guess that should be workable. Thanks.
>
> Which has me looking at what the rules are for moving through
> the cgroup hierarchy.
>
> As long as we have write access to cgroup.procs and are allowed
> to open the file for write, we can move any of our own tasks
> into the cgroup. So the cgroup namespace rules don't seem
> to be a problem.
>
> Andy can you please take a look at the permission checks in
> __cgroup_procs_write.

The actual requirements for calling that function haven't changed,
right? IOW, what does this have to do with cgroupns? Is the idea
that you want a privileged user wrt a cgroupns's userns to be able to
use this? If so:

Yes, that current_cred() thing is bogus. (Actually, this is probably
exploitable right now if any cgroup.procs inode anywhere on the system
lets non-root write.) (Can we have some kernel debugging option that
makes any use of current_cred() in write(2) warn?)

We really need a weaker version of may_ptrace for this kind of stuff.
Maybe the existing may_ptrace stuff is okay, actually. But this is
completely missing group checks, cap checks, capabilities wrt the
userns, etc.

Also, I think that, if this version of the patchset allows non-init
userns to unshare cgroupns, then the issue of what permission is
needed to lock the cgroup hierarchy like that needs to be addressed,
because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
the calling task with no permission required. Bolting on a fix later
will be a mess.

--Andy

>
> As I read the code I see 3 security gaffaws in the permssion check.
> - Using current->cred instead of file->f_cred.
> - Not checking tcred->euid.
> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>
> The file permission on cgroup.procs seem just sufficient to keep
> to keep those bugs from being easily exploitable.
>
> Eric



--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-20 04:56:13

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support



On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <[email protected]> wrote:
>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
><[email protected]> wrote:
>> "Serge E. Hallyn" <[email protected]> writes:
>>
>>> Quoting Aditya Kali ([email protected]):
>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]>
>wrote:
>>>> > Quoting Aditya Kali ([email protected]):
>>>> >> setns on a cgroup namespace is allowed only if
>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>> >> over the user-namespace associated with target cgroupns.
>>>> >> * task's current cgroup is descendent of the target
>cgroupns-root
>>>> >> cgroup.
>>>> >
>>>> > What is the point of this?
>>>> >
>>>> > If I'm a user logged into
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>> > a container which is in
>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>> > then I will want to be able to enter the container's cgroup.
>>>> > The container's cgroup root is under my own (satisfying the
>>>> > below condition0 but my cgroup is not a descendent of the
>>>> > container's cgroup.
>>>> >
>>>> This condition is there because we don't want to do implicit cgroup
>>>> changes when a process attaches to another cgroupns. cgroupns tries
>to
>>>> preserve the invariant that at any point, your current cgroup is
>>>> always under the cgroupns-root of your cgroup namespace. But in
>your
>>>> example, if we allow a process in "session-c12.scope" container to
>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>> (without implicitly moving its cgroup), then this invariant won't
>>>> hold.
>>>
>>> Oh, I see. Guess that should be workable. Thanks.
>>
>> Which has me looking at what the rules are for moving through
>> the cgroup hierarchy.
>>
>> As long as we have write access to cgroup.procs and are allowed
>> to open the file for write, we can move any of our own tasks
>> into the cgroup. So the cgroup namespace rules don't seem
>> to be a problem.
>>
>> Andy can you please take a look at the permission checks in
>> __cgroup_procs_write.
>
>The actual requirements for calling that function haven't changed,
>right? IOW, what does this have to do with cgroupns?

Excluding user namespaces the requirements have not changed.

The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.

So I was examining what it would take to enter the cgroup of cgroupns.

> Is the idea
>that you want a privileged user wrt a cgroupns's userns to be able to
>use this? If so:
>
>Yes, that current_cred() thing is bogus. (Actually, this is probably
>exploitable right now if any cgroup.procs inode anywhere on the system
>lets non-root write.) (Can we have some kernel debugging option that
>makes any use of current_cred() in write(2) warn?)
>
>We really need a weaker version of may_ptrace for this kind of stuff.
>Maybe the existing may_ptrace stuff is okay, actually. But this is
>completely missing group checks, cap checks, capabilities wrt the
>userns, etc.
>
>Also, I think that, if this version of the patchset allows non-init
>userns to unshare cgroupns, then the issue of what permission is
>needed to lock the cgroup hierarchy like that needs to be addressed,
>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>the calling task with no permission required. Bolting on a fix later
>will be a mess.

I imagine the pinning would be like the userns.

Ah but there is a potentially serious issue with the pinning.
With pinning we can make it impossible for root to move us to a different cgroup.

I am not certain how serious that is but it bears thinking about.
If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.

Sigh.

I am too tired tonight to see the end game in this.

Eric
>> As I read the code I see 3 security gaffaws in the permssion check.
>> - Using current->cred instead of file->f_cred.
>> - Not checking tcred->euid.
>> - Checking GLOBAL_ROOT_UID instead of having a capable call.
>>
>> The file permission on cgroup.procs seem just sufficient to keep
>> to keep those bugs from being easily exploitable.
>>
>> Eric

2014-10-21 00:21:10

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <[email protected]> wrote:
>
>
> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <[email protected]> wrote:
>>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
>><[email protected]> wrote:
>>> "Serge E. Hallyn" <[email protected]> writes:
>>>
>>>> Quoting Aditya Kali ([email protected]):
>>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <[email protected]>
>>wrote:
>>>>> > Quoting Aditya Kali ([email protected]):
>>>>> >> setns on a cgroup namespace is allowed only if
>>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>>> >> over the user-namespace associated with target cgroupns.
>>>>> >> * task's current cgroup is descendent of the target
>>cgroupns-root
>>>>> >> cgroup.
>>>>> >
>>>>> > What is the point of this?
>>>>> >
>>>>> > If I'm a user logged into
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>>> > a container which is in
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>>> > then I will want to be able to enter the container's cgroup.
>>>>> > The container's cgroup root is under my own (satisfying the
>>>>> > below condition0 but my cgroup is not a descendent of the
>>>>> > container's cgroup.
>>>>> >
>>>>> This condition is there because we don't want to do implicit cgroup
>>>>> changes when a process attaches to another cgroupns. cgroupns tries
>>to
>>>>> preserve the invariant that at any point, your current cgroup is
>>>>> always under the cgroupns-root of your cgroup namespace. But in
>>your
>>>>> example, if we allow a process in "session-c12.scope" container to
>>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>>> (without implicitly moving its cgroup), then this invariant won't
>>>>> hold.
>>>>
>>>> Oh, I see. Guess that should be workable. Thanks.
>>>
>>> Which has me looking at what the rules are for moving through
>>> the cgroup hierarchy.
>>>
>>> As long as we have write access to cgroup.procs and are allowed
>>> to open the file for write, we can move any of our own tasks
>>> into the cgroup. So the cgroup namespace rules don't seem
>>> to be a problem.
>>>
>>> Andy can you please take a look at the permission checks in
>>> __cgroup_procs_write.
>>
>>The actual requirements for calling that function haven't changed,
>>right? IOW, what does this have to do with cgroupns?
>
> Excluding user namespaces the requirements have not changed.
>
> The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
>
> So I was examining what it would take to enter the cgroup of cgroupns.
>
>> Is the idea
>>that you want a privileged user wrt a cgroupns's userns to be able to
>>use this? If so:
>>
>>Yes, that current_cred() thing is bogus. (Actually, this is probably
>>exploitable right now if any cgroup.procs inode anywhere on the system
>>lets non-root write.) (Can we have some kernel debugging option that
>>makes any use of current_cred() in write(2) warn?)
>>
>>We really need a weaker version of may_ptrace for this kind of stuff.
>>Maybe the existing may_ptrace stuff is okay, actually. But this is
>>completely missing group checks, cap checks, capabilities wrt the
>>userns, etc.
>>
>>Also, I think that, if this version of the patchset allows non-init
>>userns to unshare cgroupns, then the issue of what permission is
>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>the calling task with no permission required. Bolting on a fix later
>>will be a mess.
>
> I imagine the pinning would be like the userns.
>
> Ah but there is a potentially serious issue with the pinning.
> With pinning we can make it impossible for root to move us to a different cgroup.
>
> I am not certain how serious that is but it bears thinking about.
> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>
> Sigh.
>
> I am too tired tonight to see the end game in this.

Possible solution:

Ditch the pinning. That is, if you're outside a cgroupns (or you have
a non-ns-confined cgroupfs mounted), then you can move a task in a
cgroupns outside of its root cgroup. If you do this, then the task
thinks its cgroup is something like "../foo" or "../../foo".

While we're at it, consider making setns for a cgroupns *not* change
the caller's cgroup. Is there any reason it really needs to?

Thoughts?

--Andy

2014-10-21 04:50:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Andy Lutomirski <[email protected]> writes:

> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <[email protected]> wrote:
>>
>>
>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <[email protected]> wrote:

>>> Is the idea
>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>use this? If so:
>>>
>>>Yes, that current_cred() thing is bogus. (Actually, this is probably
>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>lets non-root write.) (Can we have some kernel debugging option that
>>>makes any use of current_cred() in write(2) warn?)
>>>
>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>Maybe the existing may_ptrace stuff is okay, actually. But this is
>>>completely missing group checks, cap checks, capabilities wrt the
>>>userns, etc.
>>>
>>>Also, I think that, if this version of the patchset allows non-init
>>>userns to unshare cgroupns, then the issue of what permission is
>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>the calling task with no permission required. Bolting on a fix later
>>>will be a mess.
>>
>> I imagine the pinning would be like the userns.
>>
>> Ah but there is a potentially serious issue with the pinning.
>> With pinning we can make it impossible for root to move us to a different cgroup.
>>
>> I am not certain how serious that is but it bears thinking about.
>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>
>> Sigh.
>>
>> I am too tired tonight to see the end game in this.
>
> Possible solution:
>
> Ditch the pinning. That is, if you're outside a cgroupns (or you have
> a non-ns-confined cgroupfs mounted), then you can move a task in a
> cgroupns outside of its root cgroup. If you do this, then the task
> thinks its cgroup is something like "../foo" or "../../foo".

Of the possible solutions that seems attractive to me, simply because
we sometimes want to allow clever things to occur.

Does anyone know of a reason (beyond pretty printing) why we need
cgroupns to restrict the subset of cgroups processes can be in?

I would expect permissions on the cgroup directories themselves, and
limited visiblilty would be (in general) to achieve the desired
visiblity.

> While we're at it, consider making setns for a cgroupns *not* change
> the caller's cgroup. Is there any reason it really needs to?

setns doesn't but nsenter is going to need to change the cgroup
if the pinning requirement is kept. nsenenter is going to want to
change the cgroup if the pinning requirement is dropped.

Eric

2014-10-21 05:04:11

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <[email protected]> wrote:
>>>
>>>
>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <[email protected]> wrote:
>
>>>> Is the idea
>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>use this? If so:
>>>>
>>>>Yes, that current_cred() thing is bogus. (Actually, this is probably
>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>lets non-root write.) (Can we have some kernel debugging option that
>>>>makes any use of current_cred() in write(2) warn?)
>>>>
>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>Maybe the existing may_ptrace stuff is okay, actually. But this is
>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>userns, etc.
>>>>
>>>>Also, I think that, if this version of the patchset allows non-init
>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>the calling task with no permission required. Bolting on a fix later
>>>>will be a mess.
>>>
>>> I imagine the pinning would be like the userns.
>>>
>>> Ah but there is a potentially serious issue with the pinning.
>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>
>>> I am not certain how serious that is but it bears thinking about.
>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>
>>> Sigh.
>>>
>>> I am too tired tonight to see the end game in this.
>>
>> Possible solution:
>>
>> Ditch the pinning. That is, if you're outside a cgroupns (or you have
>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>> cgroupns outside of its root cgroup. If you do this, then the task
>> thinks its cgroup is something like "../foo" or "../../foo".
>
> Of the possible solutions that seems attractive to me, simply because
> we sometimes want to allow clever things to occur.
>
> Does anyone know of a reason (beyond pretty printing) why we need
> cgroupns to restrict the subset of cgroups processes can be in?
>
> I would expect permissions on the cgroup directories themselves, and
> limited visiblilty would be (in general) to achieve the desired
> visiblity.

This makes the security impact of cgroupns very easy to understand,
right? Because there really won't be any -- cgroupns only affects
reads from /proc and what cgroupfs shows, but it doesn't change any
actual cgroups, nor does it affect any cgroup *changes*.

>
>> While we're at it, consider making setns for a cgroupns *not* change
>> the caller's cgroup. Is there any reason it really needs to?
>
> setns doesn't but nsenter is going to need to change the cgroup
> if the pinning requirement is kept. nsenenter is going to want to
> change the cgroup if the pinning requirement is dropped.
>

It seems easy enough for nsenter to change the cgroup all by itself.

--Andy

2014-10-21 05:43:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Andy Lutomirski <[email protected]> writes:

> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <[email protected]> wrote:
>>>>
>>>>
>>>> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <[email protected]> wrote:
>>
>>>>> Is the idea
>>>>>that you want a privileged user wrt a cgroupns's userns to be able to
>>>>>use this? If so:
>>>>>
>>>>>Yes, that current_cred() thing is bogus. (Actually, this is probably
>>>>>exploitable right now if any cgroup.procs inode anywhere on the system
>>>>>lets non-root write.) (Can we have some kernel debugging option that
>>>>>makes any use of current_cred() in write(2) warn?)
>>>>>
>>>>>We really need a weaker version of may_ptrace for this kind of stuff.
>>>>>Maybe the existing may_ptrace stuff is okay, actually. But this is
>>>>>completely missing group checks, cap checks, capabilities wrt the
>>>>>userns, etc.
>>>>>
>>>>>Also, I think that, if this version of the patchset allows non-init
>>>>>userns to unshare cgroupns, then the issue of what permission is
>>>>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>>>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>>>>the calling task with no permission required. Bolting on a fix later
>>>>>will be a mess.
>>>>
>>>> I imagine the pinning would be like the userns.
>>>>
>>>> Ah but there is a potentially serious issue with the pinning.
>>>> With pinning we can make it impossible for root to move us to a different cgroup.
>>>>
>>>> I am not certain how serious that is but it bears thinking about.
>>>> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>>>>
>>>> Sigh.
>>>>
>>>> I am too tired tonight to see the end game in this.
>>>
>>> Possible solution:
>>>
>>> Ditch the pinning. That is, if you're outside a cgroupns (or you have
>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>> cgroupns outside of its root cgroup. If you do this, then the task
>>> thinks its cgroup is something like "../foo" or "../../foo".
>>
>> Of the possible solutions that seems attractive to me, simply because
>> we sometimes want to allow clever things to occur.
>>
>> Does anyone know of a reason (beyond pretty printing) why we need
>> cgroupns to restrict the subset of cgroups processes can be in?
>>
>> I would expect permissions on the cgroup directories themselves, and
>> limited visiblilty would be (in general) to achieve the desired
>> visiblity.
>
> This makes the security impact of cgroupns very easy to understand,
> right? Because there really won't be any -- cgroupns only affects
> reads from /proc and what cgroupfs shows, but it doesn't change any
> actual cgroups, nor does it affect any cgroup *changes*.

It seems like what we have described is chcgrouproot aka chroot for
cgroups. At which point I think there are potentially similar security
issues as for chroot. Can we confuse a setuid root process if we make
it's cgroup names look different.

Of course the confusing root concern is handled by the usual namespace
security checks that are already present.

I do wonder if we think of this as chcgrouproot if there is a simpler
implementation.

>>> While we're at it, consider making setns for a cgroupns *not* change
>>> the caller's cgroup. Is there any reason it really needs to?
>>
>> setns doesn't but nsenter is going to need to change the cgroup
>> if the pinning requirement is kept. nsenenter is going to want to
>> change the cgroup if the pinning requirement is dropped.
>>
>
> It seems easy enough for nsenter to change the cgroup all by itself.

Again. I don't think anyone has suggested or implemented anything
different.

Eric

2014-10-21 05:49:44

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>> Possible solution:
>>>>
>>>> Ditch the pinning. That is, if you're outside a cgroupns (or you have
>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>> cgroupns outside of its root cgroup. If you do this, then the task
>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>
>>> Of the possible solutions that seems attractive to me, simply because
>>> we sometimes want to allow clever things to occur.
>>>
>>> Does anyone know of a reason (beyond pretty printing) why we need
>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>
>>> I would expect permissions on the cgroup directories themselves, and
>>> limited visiblilty would be (in general) to achieve the desired
>>> visiblity.
>>
>> This makes the security impact of cgroupns very easy to understand,
>> right? Because there really won't be any -- cgroupns only affects
>> reads from /proc and what cgroupfs shows, but it doesn't change any
>> actual cgroups, nor does it affect any cgroup *changes*.
>
> It seems like what we have described is chcgrouproot aka chroot for
> cgroups. At which point I think there are potentially similar security
> issues as for chroot. Can we confuse a setuid root process if we make
> it's cgroup names look different.
>
> Of course the confusing root concern is handled by the usual namespace
> security checks that are already present.

I think that the chroot issues are mostly in two categories: setuid
confusion (not an issue here as you described) and chroot escapes.
cgroupns escapes aren't a big deal, I think -- admins should deny the
confined task the right to write to cgroupfs outside its hierarchy, by
setting cgroupfs permissions appropriately and/or avoiding mounting
cgroupfs outside the hierarchy.

>
> I do wonder if we think of this as chcgrouproot if there is a simpler
> implementation.

Could be. I'll defer to Aditya for that one.

>
>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>> the caller's cgroup. Is there any reason it really needs to?
>>>
>>> setns doesn't but nsenter is going to need to change the cgroup
>>> if the pinning requirement is kept. nsenenter is going to want to
>>> change the cgroup if the pinning requirement is dropped.
>>>
>>
>> It seems easy enough for nsenter to change the cgroup all by itself.
>
> Again. I don't think anyone has suggested or implemented anything
> different.

The current patchset seems to punt on this decision by just failing
the setns call if the caller is outside the cgroup in question.

--Andy

2014-10-21 18:49:36

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On Mon, Oct 20, 2014 at 9:49 PM, Eric W. Biederman
>>> <[email protected]> wrote:
>>>> Andy Lutomirski <[email protected]> writes:
>>>>> Possible solution:
>>>>>
>>>>> Ditch the pinning. That is, if you're outside a cgroupns (or you have
>>>>> a non-ns-confined cgroupfs mounted), then you can move a task in a
>>>>> cgroupns outside of its root cgroup. If you do this, then the task
>>>>> thinks its cgroup is something like "../foo" or "../../foo".
>>>>
>>>> Of the possible solutions that seems attractive to me, simply because
>>>> we sometimes want to allow clever things to occur.
>>>>
>>>> Does anyone know of a reason (beyond pretty printing) why we need
>>>> cgroupns to restrict the subset of cgroups processes can be in?
>>>>
>>>> I would expect permissions on the cgroup directories themselves, and
>>>> limited visiblilty would be (in general) to achieve the desired
>>>> visiblity.
>>>
>>> This makes the security impact of cgroupns very easy to understand,
>>> right? Because there really won't be any -- cgroupns only affects
>>> reads from /proc and what cgroupfs shows, but it doesn't change any
>>> actual cgroups, nor does it affect any cgroup *changes*.
>>
>> It seems like what we have described is chcgrouproot aka chroot for
>> cgroups. At which point I think there are potentially similar security
>> issues as for chroot. Can we confuse a setuid root process if we make
>> it's cgroup names look different.
>>
>> Of course the confusing root concern is handled by the usual namespace
>> security checks that are already present.
>
> I think that the chroot issues are mostly in two categories: setuid
> confusion (not an issue here as you described) and chroot escapes.
> cgroupns escapes aren't a big deal, I think -- admins should deny the
> confined task the right to write to cgroupfs outside its hierarchy, by
> setting cgroupfs permissions appropriately and/or avoiding mounting
> cgroupfs outside the hierarchy.
>
>>
>> I do wonder if we think of this as chcgrouproot if there is a simpler
>> implementation.
>
> Could be. I'll defer to Aditya for that one.
>

More than chcgrouproot, its probably closer to pivot_cgroup_root. In
addition to restricting the process to a cgroup-root, new processes
entering the container should also be implicitly contained within the
cgroup-root of that container. Implementing pivot_cgroup_root would
probably involve overloading mount-namespace to now understand cgroup
filesystem too. I did attempt combining cgroupns-root with mntns
earlier (not via a new syscall though), but came to the conclusion
that its just simpler to have a separate cgroup namespace and get
clear semantics. One of the issues was that implicitly changing cgroup
on setns to mntns seemed like a huge undesirable side-effect.

About pinning: I really feel that it should be OK to pin processes
within cgroupns-root. I think thats one of the most important feature
of cgroup-namespace since its most common usecase is to containerize
un-trusted processes - processes that, for their entire lifetime, need
to remain inside their container. And with explicit permission from
cgroup subsystem (something like cgroup.may_unshare as you had
suggested previously), we can make sure that unprivileged processes
cannot pin themselves. Also, maintaining this invariant (your current
cgroup is always under your cgroupns-root) keeps the code and the
semantics simple.

If we ditch the pinning requirement and allow the containarized
process to move outside of its cgroupns-root, we will have to address
atleast the following:
* what does its /proc/self/cgroup (and /proc/<pid>/cgroup in general)
look like? We might need to just not show anything in
/proc/<pid>/cgroup in such case (for default hierarchy).
* how should future setns() and unshare() by such process behave?
* 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
* container will not remain migratable
* added code complexity to handle above scenarios

I understand that having process pinned to a cgroup hierarchy might
seem inconvenient. But even today (without cgroup namespaces), moving
a task from one cgroup to another can fail for reasons outside of
control of the task attempting the move (even if its privileged). So
the userspace should already handle this scenario. I feel its not
worth to add complexity in the kernel for this.

>>
>>>>> While we're at it, consider making setns for a cgroupns *not* change
>>>>> the caller's cgroup. Is there any reason it really needs to?
>>>>
>>>> setns doesn't but nsenter is going to need to change the cgroup
>>>> if the pinning requirement is kept. nsenenter is going to want to
>>>> change the cgroup if the pinning requirement is dropped.
>>>>
>>>
>>> It seems easy enough for nsenter to change the cgroup all by itself.
>>
>> Again. I don't think anyone has suggested or implemented anything
>> different.
>
> The current patchset seems to punt on this decision by just failing
> the setns call if the caller is outside the cgroup in question.
>
> --Andy

--
Aditya

2014-10-21 19:03:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>>
>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>> implementation.
>>
>> Could be. I'll defer to Aditya for that one.
>>
>
> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
> addition to restricting the process to a cgroup-root, new processes
> entering the container should also be implicitly contained within the
> cgroup-root of that container.

Why? Concretely, why should this be in the kernel namespace code
instead of in userspace?

> Implementing pivot_cgroup_root would
> probably involve overloading mount-namespace to now understand cgroup
> filesystem too. I did attempt combining cgroupns-root with mntns
> earlier (not via a new syscall though), but came to the conclusion
> that its just simpler to have a separate cgroup namespace and get
> clear semantics. One of the issues was that implicitly changing cgroup
> on setns to mntns seemed like a huge undesirable side-effect.
>
> About pinning: I really feel that it should be OK to pin processes
> within cgroupns-root. I think thats one of the most important feature
> of cgroup-namespace since its most common usecase is to containerize
> un-trusted processes - processes that, for their entire lifetime, need
> to remain inside their container.

So don't let them out. None of the other namespaces have this kind of
constraint:

- If you're in a mntns, you can still use fds from outside.
- If you're in a netns, you can still use sockets from outside the namespace.
- If you're in an ipcns, you can still use ipc handles from outside.

etc.

> And with explicit permission from
> cgroup subsystem (something like cgroup.may_unshare as you had
> suggested previously), we can make sure that unprivileged processes
> cannot pin themselves. Also, maintaining this invariant (your current
> cgroup is always under your cgroupns-root) keeps the code and the
> semantics simple.

I actually think it makes the semantics more complex. The less policy
you stick in the kernel, the easier it is to understand the impact of
that policy.

>
> If we ditch the pinning requirement and allow the containarized
> process to move outside of its cgroupns-root, we will have to address
> atleast the following:
> * what does its /proc/self/cgroup (and /proc/<pid>/cgroup in general)
> look like? We might need to just not show anything in
> /proc/<pid>/cgroup in such case (for default hierarchy).

The process should see the cgroup path relative to its cgroup ns.
Whether this requires a new /proc mount or happens automatically is an
open question. (I *hate* procfs for reasons like this.)

> * how should future setns() and unshare() by such process behave?

Open question.

> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result

You could disallow that and instead require 'mount -t cgroup -o
cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
relative to the caller's cgroupns.

> * container will not remain migratable

Why not?

> * added code complexity to handle above scenarios
>
> I understand that having process pinned to a cgroup hierarchy might
> seem inconvenient. But even today (without cgroup namespaces), moving
> a task from one cgroup to another can fail for reasons outside of
> control of the task attempting the move (even if its privileged). So
> the userspace should already handle this scenario. I feel its not
> worth to add complexity in the kernel for this.

--Andy

2014-10-21 22:33:45

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>> <[email protected]> wrote:
>>>>
>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>> implementation.
>>>
>>> Could be. I'll defer to Aditya for that one.
>>>
>>
>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>> addition to restricting the process to a cgroup-root, new processes
>> entering the container should also be implicitly contained within the
>> cgroup-root of that container.
>
> Why? Concretely, why should this be in the kernel namespace code
> instead of in userspace?
>

Userspace can do it too. Though then there will be possibility of
having processes in the same mount namespace with different
cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
more complex. Thats another reason why it might not be good idea to
tie cgroups with mount namespace.

>> Implementing pivot_cgroup_root would
>> probably involve overloading mount-namespace to now understand cgroup
>> filesystem too. I did attempt combining cgroupns-root with mntns
>> earlier (not via a new syscall though), but came to the conclusion
>> that its just simpler to have a separate cgroup namespace and get
>> clear semantics. One of the issues was that implicitly changing cgroup
>> on setns to mntns seemed like a huge undesirable side-effect.
>>
>> About pinning: I really feel that it should be OK to pin processes
>> within cgroupns-root. I think thats one of the most important feature
>> of cgroup-namespace since its most common usecase is to containerize
>> un-trusted processes - processes that, for their entire lifetime, need
>> to remain inside their container.
>
> So don't let them out. None of the other namespaces have this kind of
> constraint:
>
> - If you're in a mntns, you can still use fds from outside.
> - If you're in a netns, you can still use sockets from outside the namespace.
> - If you're in an ipcns, you can still use ipc handles from outside.

But none of the namespaces allow you to allocate new fds/sockets/ipc
handles in the outside namespace. I think moving a process outside of
cgroupns-root is like allocating a resource outside of your namespace.

>
> etc.

>
>> And with explicit permission from
>> cgroup subsystem (something like cgroup.may_unshare as you had
>> suggested previously), we can make sure that unprivileged processes
>> cannot pin themselves. Also, maintaining this invariant (your current
>> cgroup is always under your cgroupns-root) keeps the code and the
>> semantics simple.
>
> I actually think it makes the semantics more complex. The less policy
> you stick in the kernel, the easier it is to understand the impact of
> that policy.
>

My inclination is towards keeping things simpler - both in code as
well as in configuration. I agree that cgroupns might seem
"less-flexible", but in its current form, it encourages consistent
container configuration. If you have a process that needs to move
around between cgroups belonging to different containers, then that
process should probably not be inside any container's cgroup
namespace. Allowing that will just make the cgroup namespace
pretty-much meaningless.

>>
>> If we ditch the pinning requirement and allow the containarized
>> process to move outside of its cgroupns-root, we will have to address
>> atleast the following:
>> * what does its /proc/self/cgroup (and /proc/<pid>/cgroup in general)
>> look like? We might need to just not show anything in
>> /proc/<pid>/cgroup in such case (for default hierarchy).
>
> The process should see the cgroup path relative to its cgroup ns.
> Whether this requires a new /proc mount or happens automatically is an
> open question. (I *hate* procfs for reasons like this.)
>
>> * how should future setns() and unshare() by such process behave?
>
> Open question.
>
>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>
> You could disallow that and instead require 'mount -t cgroup -o
> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
> relative to the caller's cgroupns.
>
>> * container will not remain migratable
>
> Why not?
>

Well, the processes running outside of cgroupns root will be exposed
to information outside of the container (i.e., its /proc/self/cgroup
will show paths involving other containers and potentially system
level information). So unless you even restore them, it will be
difficult to restore these processes. The whole point of virtualizing
the /proc/self/cgroup view was so that the processes don't see outside
cgroups.

>> * added code complexity to handle above scenarios
>>
>> I understand that having process pinned to a cgroup hierarchy might
>> seem inconvenient. But even today (without cgroup namespaces), moving
>> a task from one cgroup to another can fail for reasons outside of
>> control of the task attempting the move (even if its privileged). So
>> the userspace should already handle this scenario. I feel its not
>> worth to add complexity in the kernel for this.
>
> --Andy



--
Aditya

2014-10-21 22:42:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>> <[email protected]> wrote:
>>>>>
>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>> implementation.
>>>>
>>>> Could be. I'll defer to Aditya for that one.
>>>>
>>>
>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>> addition to restricting the process to a cgroup-root, new processes
>>> entering the container should also be implicitly contained within the
>>> cgroup-root of that container.
>>
>> Why? Concretely, why should this be in the kernel namespace code
>> instead of in userspace?
>>
>
> Userspace can do it too. Though then there will be possibility of
> having processes in the same mount namespace with different
> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
> more complex. Thats another reason why it might not be good idea to
> tie cgroups with mount namespace.
>
>>> Implementing pivot_cgroup_root would
>>> probably involve overloading mount-namespace to now understand cgroup
>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>> earlier (not via a new syscall though), but came to the conclusion
>>> that its just simpler to have a separate cgroup namespace and get
>>> clear semantics. One of the issues was that implicitly changing cgroup
>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>
>>> About pinning: I really feel that it should be OK to pin processes
>>> within cgroupns-root. I think thats one of the most important feature
>>> of cgroup-namespace since its most common usecase is to containerize
>>> un-trusted processes - processes that, for their entire lifetime, need
>>> to remain inside their container.
>>
>> So don't let them out. None of the other namespaces have this kind of
>> constraint:
>>
>> - If you're in a mntns, you can still use fds from outside.
>> - If you're in a netns, you can still use sockets from outside the namespace.
>> - If you're in an ipcns, you can still use ipc handles from outside.
>
> But none of the namespaces allow you to allocate new fds/sockets/ipc
> handles in the outside namespace. I think moving a process outside of
> cgroupns-root is like allocating a resource outside of your namespace.

In a pidns, you can see outside tasks if you have an outside procfs
mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
like that? You wouldn't be able to escape your cgroup as long as you
don't have an inappropriate cgroupfs mounted.


>>
>>> And with explicit permission from
>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>> suggested previously), we can make sure that unprivileged processes
>>> cannot pin themselves. Also, maintaining this invariant (your current
>>> cgroup is always under your cgroupns-root) keeps the code and the
>>> semantics simple.
>>
>> I actually think it makes the semantics more complex. The less policy
>> you stick in the kernel, the easier it is to understand the impact of
>> that policy.
>>
>
> My inclination is towards keeping things simpler - both in code as
> well as in configuration. I agree that cgroupns might seem
> "less-flexible", but in its current form, it encourages consistent
> container configuration. If you have a process that needs to move
> around between cgroups belonging to different containers, then that
> process should probably not be inside any container's cgroup
> namespace. Allowing that will just make the cgroup namespace
> pretty-much meaningless.

The problem with pinning is that preventing it causes problems
(specifically, either something potentially complex and incompatible
needs to be added or unprivileged processes will be able to pin
themselves).

Unless I'm missing something, a normal cgroupns user doesn't actually
need kernel pinning support to effectively constrain its members'
cgroups.

>
>>>
>>> If we ditch the pinning requirement and allow the containarized
>>> process to move outside of its cgroupns-root, we will have to address
>>> atleast the following:
>>> * what does its /proc/self/cgroup (and /proc/<pid>/cgroup in general)
>>> look like? We might need to just not show anything in
>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>
>> The process should see the cgroup path relative to its cgroup ns.
>> Whether this requires a new /proc mount or happens automatically is an
>> open question. (I *hate* procfs for reasons like this.)
>>
>>> * how should future setns() and unshare() by such process behave?
>>
>> Open question.
>>
>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>
>> You could disallow that and instead require 'mount -t cgroup -o
>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>> relative to the caller's cgroupns.
>>
>>> * container will not remain migratable
>>
>> Why not?
>>
>
> Well, the processes running outside of cgroupns root will be exposed
> to information outside of the container (i.e., its /proc/self/cgroup
> will show paths involving other containers and potentially system
> level information). So unless you even restore them, it will be
> difficult to restore these processes. The whole point of virtualizing
> the /proc/self/cgroup view was so that the processes don't see outside
> cgroups.
>

So don't do that?

--Andy

2014-10-22 00:46:32

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[email protected]> wrote:
>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>> implementation.
>>>>>
>>>>> Could be. I'll defer to Aditya for that one.
>>>>>
>>>>
>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>> addition to restricting the process to a cgroup-root, new processes
>>>> entering the container should also be implicitly contained within the
>>>> cgroup-root of that container.
>>>
>>> Why? Concretely, why should this be in the kernel namespace code
>>> instead of in userspace?
>>>
>>
>> Userspace can do it too. Though then there will be possibility of
>> having processes in the same mount namespace with different
>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>> more complex. Thats another reason why it might not be good idea to
>> tie cgroups with mount namespace.
>>
>>>> Implementing pivot_cgroup_root would
>>>> probably involve overloading mount-namespace to now understand cgroup
>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>> earlier (not via a new syscall though), but came to the conclusion
>>>> that its just simpler to have a separate cgroup namespace and get
>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>
>>>> About pinning: I really feel that it should be OK to pin processes
>>>> within cgroupns-root. I think thats one of the most important feature
>>>> of cgroup-namespace since its most common usecase is to containerize
>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>> to remain inside their container.
>>>
>>> So don't let them out. None of the other namespaces have this kind of
>>> constraint:
>>>
>>> - If you're in a mntns, you can still use fds from outside.
>>> - If you're in a netns, you can still use sockets from outside the namespace.
>>> - If you're in an ipcns, you can still use ipc handles from outside.
>>
>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>> handles in the outside namespace. I think moving a process outside of
>> cgroupns-root is like allocating a resource outside of your namespace.
>
> In a pidns, you can see outside tasks if you have an outside procfs
> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
> like that? You wouldn't be able to escape your cgroup as long as you
> don't have an inappropriate cgroupfs mounted.
>

I am not if we should only depend on restricted visibility for this
though. More details below.

>
>>>
>>>> And with explicit permission from
>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>> suggested previously), we can make sure that unprivileged processes
>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>> semantics simple.
>>>
>>> I actually think it makes the semantics more complex. The less policy
>>> you stick in the kernel, the easier it is to understand the impact of
>>> that policy.
>>>
>>
>> My inclination is towards keeping things simpler - both in code as
>> well as in configuration. I agree that cgroupns might seem
>> "less-flexible", but in its current form, it encourages consistent
>> container configuration. If you have a process that needs to move
>> around between cgroups belonging to different containers, then that
>> process should probably not be inside any container's cgroup
>> namespace. Allowing that will just make the cgroup namespace
>> pretty-much meaningless.
>
> The problem with pinning is that preventing it causes problems
> (specifically, either something potentially complex and incompatible
> needs to be added or unprivileged processes will be able to pin
> themselves).
>
> Unless I'm missing something, a normal cgroupns user doesn't actually
> need kernel pinning support to effectively constrain its members'
> cgroups.
>

So there are 2 scenarios to consider:

We have 2 containers with cgroups: /container1 and /container2
Assume process P is running under cgroupns-root '/container1'

(1) process P wants to 'write' to cgroup.procs outside its
cgroupns-root (say to /container2/cgroup.procs)
(2) An admin process running in init_cgroup_ns (or any parent cgroupns
with cgroupns-root above /container1) wants to write pid of process P
to /container2/cgroup.procs (which lies outside of P's cgroupns-root)

For (1), I think its ok to reject such a write. This is consistent
with the restriction in cgroup_file_write added in 'Patch 6' of this
set. I believe this should be independent of visibility of the cgroup
hierarchy for P.

For (2), we may allow the write to succeed if we make sure that the
process doing the write is an admin process (with CAP_SYS_ADMIN in its
userns AND over P's cgroupns->user_ns).
If this write succeeds, then:
(a) process P's /proc/<pid>/cgroup does not show anything when viewed
by 'self' or any other process in P's cgrgroupns. I would really like
to avoid showing relative paths or paths outside the cgroupns-root
(b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
only able to mount and see cgroup hierarchy under its cgroupns-root
(d) if process P tries to write to any cgroup file outside of its
cgroupns-root (assuming that hierarchy is visible to it for whatever
reason), it will fail as in (1)

i.e., in summary, you can't escape out of cgroupns-root yourself. You
will need help from an admin process running under some parent
cgroupns-root to move you out. Is that workable for your usecase? Most
of the things above already happen with the current patch-set, so it
should be easy to enable this.

Though there are still some open issues like:
* what happens if you move all the processes out of /container1 and
then 'rmdir /container1'? As it is now, you won't be able to setns()
to that cgroupns anymore. But the cgroupns will still hang around
until the processes switch their cgroupns.
* should we then also allow setns() without first entering the
cgroupns-root? setns also checks the same conditions as in (a) plus it
checks that your current cgroup is descendant of target cgroupns-root.
Alternatively we can special-case setns() to own cgroupns so that it
doesn't fail.
* migration for these processes will be tricky, if not impossible. But
the admin trying to do this probably doesn't care about it or will
provision for it.

>>
>>>>
>>>> If we ditch the pinning requirement and allow the containarized
>>>> process to move outside of its cgroupns-root, we will have to address
>>>> atleast the following:
>>>> * what does its /proc/self/cgroup (and /proc/<pid>/cgroup in general)
>>>> look like? We might need to just not show anything in
>>>> /proc/<pid>/cgroup in such case (for default hierarchy).
>>>
>>> The process should see the cgroup path relative to its cgroup ns.
>>> Whether this requires a new /proc mount or happens automatically is an
>>> open question. (I *hate* procfs for reasons like this.)
>>>
>>>> * how should future setns() and unshare() by such process behave?
>>>
>>> Open question.
>>>
>>>> * 'mount -t cgroup cgroup <mnt>' by such a process will yield unexpected result
>>>
>>> You could disallow that and instead require 'mount -t cgroup -o
>>> cgrouproot=. cgroup mnt' where '.' will be resolved at mount time
>>> relative to the caller's cgroupns.
>>>
>>>> * container will not remain migratable
>>>
>>> Why not?
>>>
>>
>> Well, the processes running outside of cgroupns root will be exposed
>> to information outside of the container (i.e., its /proc/self/cgroup
>> will show paths involving other containers and potentially system
>> level information). So unless you even restore them, it will be
>> difficult to restore these processes. The whole point of virtualizing
>> the /proc/self/cgroup view was so that the processes don't see outside
>> cgroups.
>>
>
> So don't do that?
>

Lot of non-cgroup-manager userspace processes have legitimate reasons
to read /proc/self/cgroup. One example is to register for OOM
notifications. Migratability of the container is also very important.
So "don't do that" is not always an option :)


> --Andy

--
Aditya

2014-10-22 00:59:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[email protected]> wrote:
>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>> implementation.
>>>>>>
>>>>>> Could be. I'll defer to Aditya for that one.
>>>>>>
>>>>>
>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>> entering the container should also be implicitly contained within the
>>>>> cgroup-root of that container.
>>>>
>>>> Why? Concretely, why should this be in the kernel namespace code
>>>> instead of in userspace?
>>>>
>>>
>>> Userspace can do it too. Though then there will be possibility of
>>> having processes in the same mount namespace with different
>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>> more complex. Thats another reason why it might not be good idea to
>>> tie cgroups with mount namespace.
>>>
>>>>> Implementing pivot_cgroup_root would
>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>
>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>> to remain inside their container.
>>>>
>>>> So don't let them out. None of the other namespaces have this kind of
>>>> constraint:
>>>>
>>>> - If you're in a mntns, you can still use fds from outside.
>>>> - If you're in a netns, you can still use sockets from outside the namespace.
>>>> - If you're in an ipcns, you can still use ipc handles from outside.
>>>
>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>> handles in the outside namespace. I think moving a process outside of
>>> cgroupns-root is like allocating a resource outside of your namespace.
>>
>> In a pidns, you can see outside tasks if you have an outside procfs
>> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
>> like that? You wouldn't be able to escape your cgroup as long as you
>> don't have an inappropriate cgroupfs mounted.
>>
>
> I am not if we should only depend on restricted visibility for this
> though. More details below.
>
>>
>>>>
>>>>> And with explicit permission from
>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>> suggested previously), we can make sure that unprivileged processes
>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>> semantics simple.
>>>>
>>>> I actually think it makes the semantics more complex. The less policy
>>>> you stick in the kernel, the easier it is to understand the impact of
>>>> that policy.
>>>>
>>>
>>> My inclination is towards keeping things simpler - both in code as
>>> well as in configuration. I agree that cgroupns might seem
>>> "less-flexible", but in its current form, it encourages consistent
>>> container configuration. If you have a process that needs to move
>>> around between cgroups belonging to different containers, then that
>>> process should probably not be inside any container's cgroup
>>> namespace. Allowing that will just make the cgroup namespace
>>> pretty-much meaningless.
>>
>> The problem with pinning is that preventing it causes problems
>> (specifically, either something potentially complex and incompatible
>> needs to be added or unprivileged processes will be able to pin
>> themselves).
>>
>> Unless I'm missing something, a normal cgroupns user doesn't actually
>> need kernel pinning support to effectively constrain its members'
>> cgroups.
>>
>
> So there are 2 scenarios to consider:
>
> We have 2 containers with cgroups: /container1 and /container2
> Assume process P is running under cgroupns-root '/container1'
>
> (1) process P wants to 'write' to cgroup.procs outside its
> cgroupns-root (say to /container2/cgroup.procs)

This, at least, doesn't have the problem with unprivileged processes
pinning themselves.

> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
> with cgroupns-root above /container1) wants to write pid of process P
> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>
> For (1), I think its ok to reject such a write. This is consistent
> with the restriction in cgroup_file_write added in 'Patch 6' of this
> set. I believe this should be independent of visibility of the cgroup
> hierarchy for P.
>
> For (2), we may allow the write to succeed if we make sure that the
> process doing the write is an admin process (with CAP_SYS_ADMIN in its
> userns AND over P's cgroupns->user_ns).

Why is its userns relevant?

Why not just check whether the target cgroup is in the process doing
the write's cgroupns? (NB: you need to check f_cred, here, not
current_cred(), but that's orthogonal.) Then the policy becomes: no
user of cgroupfs can move any process outside of the cgroupfs's user's
cgroupns root.

I think I'm okay with this.

> If this write succeeds, then:
> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
> by 'self' or any other process in P's cgrgroupns. I would really like
> to avoid showing relative paths or paths outside the cgroupns-root

The empty string seems just as problematic to me.

> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
> only able to mount and see cgroup hierarchy under its cgroupns-root
> (d) if process P tries to write to any cgroup file outside of its
> cgroupns-root (assuming that hierarchy is visible to it for whatever
> reason), it will fail as in (1)

I'm still unconvinced that this serves any purpose. If you give
DAC/MAC permission to a task to write to something, and you give it
access to an fd or mount pointing there, and you don't want it writing
there, then *don't do that*. I'm not really seeing why cgroupns needs
special treatment here.

>
> i.e., in summary, you can't escape out of cgroupns-root yourself. You
> will need help from an admin process running under some parent
> cgroupns-root to move you out. Is that workable for your usecase? Most
> of the things above already happen with the current patch-set, so it
> should be easy to enable this.
>
> Though there are still some open issues like:
> * what happens if you move all the processes out of /container1 and
> then 'rmdir /container1'? As it is now, you won't be able to setns()
> to that cgroupns anymore. But the cgroupns will still hang around
> until the processes switch their cgroupns.

Seems okay.

> * should we then also allow setns() without first entering the
> cgroupns-root? setns also checks the same conditions as in (a) plus it
> checks that your current cgroup is descendant of target cgroupns-root.
> Alternatively we can special-case setns() to own cgroupns so that it
> doesn't fail.

I think setns should completely ignore the caller's cgroup and should
not change it. Userspace can do this.

> * migration for these processes will be tricky, if not impossible. But
> the admin trying to do this probably doesn't care about it or will
> provision for it.

Migration for processes in a mntns that have a current directory
outside their mntns is also difficult or impossible. Same with
pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
procfs. Nothing new here.

--Andy

2014-10-22 18:38:19

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <[email protected]> wrote:
>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[email protected]> wrote:
>>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <[email protected]> wrote:
>>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <[email protected]> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <[email protected]> wrote:
>>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>>> implementation.
>>>>>>>
>>>>>>> Could be. I'll defer to Aditya for that one.
>>>>>>>
>>>>>>
>>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>>> entering the container should also be implicitly contained within the
>>>>>> cgroup-root of that container.
>>>>>
>>>>> Why? Concretely, why should this be in the kernel namespace code
>>>>> instead of in userspace?
>>>>>
>>>>
>>>> Userspace can do it too. Though then there will be possibility of
>>>> having processes in the same mount namespace with different
>>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>>> more complex. Thats another reason why it might not be good idea to
>>>> tie cgroups with mount namespace.
>>>>
>>>>>> Implementing pivot_cgroup_root would
>>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>>
>>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>>> to remain inside their container.
>>>>>
>>>>> So don't let them out. None of the other namespaces have this kind of
>>>>> constraint:
>>>>>
>>>>> - If you're in a mntns, you can still use fds from outside.
>>>>> - If you're in a netns, you can still use sockets from outside the namespace.
>>>>> - If you're in an ipcns, you can still use ipc handles from outside.
>>>>
>>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>>> handles in the outside namespace. I think moving a process outside of
>>>> cgroupns-root is like allocating a resource outside of your namespace.
>>>
>>> In a pidns, you can see outside tasks if you have an outside procfs
>>> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
>>> like that? You wouldn't be able to escape your cgroup as long as you
>>> don't have an inappropriate cgroupfs mounted.
>>>
>>
>> I am not if we should only depend on restricted visibility for this
>> though. More details below.
>>
>>>
>>>>>
>>>>>> And with explicit permission from
>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>> semantics simple.
>>>>>
>>>>> I actually think it makes the semantics more complex. The less policy
>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>> that policy.
>>>>>
>>>>
>>>> My inclination is towards keeping things simpler - both in code as
>>>> well as in configuration. I agree that cgroupns might seem
>>>> "less-flexible", but in its current form, it encourages consistent
>>>> container configuration. If you have a process that needs to move
>>>> around between cgroups belonging to different containers, then that
>>>> process should probably not be inside any container's cgroup
>>>> namespace. Allowing that will just make the cgroup namespace
>>>> pretty-much meaningless.
>>>
>>> The problem with pinning is that preventing it causes problems
>>> (specifically, either something potentially complex and incompatible
>>> needs to be added or unprivileged processes will be able to pin
>>> themselves).
>>>
>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>> need kernel pinning support to effectively constrain its members'
>>> cgroups.
>>>
>>
>> So there are 2 scenarios to consider:
>>
>> We have 2 containers with cgroups: /container1 and /container2
>> Assume process P is running under cgroupns-root '/container1'
>>
>> (1) process P wants to 'write' to cgroup.procs outside its
>> cgroupns-root (say to /container2/cgroup.procs)
>
> This, at least, doesn't have the problem with unprivileged processes
> pinning themselves.
>
>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>> with cgroupns-root above /container1) wants to write pid of process P
>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>
>> For (1), I think its ok to reject such a write. This is consistent
>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>> set. I believe this should be independent of visibility of the cgroup
>> hierarchy for P.
>>
>> For (2), we may allow the write to succeed if we make sure that the
>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>> userns AND over P's cgroupns->user_ns).
>
> Why is its userns relevant?
>
> Why not just check whether the target cgroup is in the process doing
> the write's cgroupns? (NB: you need to check f_cred, here, not
> current_cred(), but that's orthogonal.) Then the policy becomes: no
> user of cgroupfs can move any process outside of the cgroupfs's user's
> cgroupns root.
>
Humm .. it doesn't have to be. I think its simpler to not enforce
artificial permission checks unless there is a security concern (and
in this case, there doesn't seem to be any). So I will leave the
capability check out from here.

> I think I'm okay with this.
>
>> If this write succeeds, then:
>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>> by 'self' or any other process in P's cgrgroupns. I would really like
>> to avoid showing relative paths or paths outside the cgroupns-root
>
> The empty string seems just as problematic to me.

Actually, there is no right answer here. Our options are:
* show relative path
-- this will break userspace as /proc/<pid>/cgroup does not show
relative paths today. This is also very ambiguous (is it relative to
cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

* show absolute path
-- this will also wrong as the process won't be able to make sense of
it unless it has exposure to the global cgroup hierarchy.
-- worse case is this that the global path also exists under the
cgroupns-root ... so now the process thinks its in completely wrong
cgroup
-- this exposes system

* show only "/"
-- this is arguably better, but if the process tires to verify that
its pid is in cgroup.procs of the cgroupns-root, its in for a
surprise!

In either case, whatever we expose, the userspace won't be able to use
this path correctly (worse yet, it associates wrong cgroup for that
path). So I think its best to not print out the line for default
hierarchy at all. This happens today when cgroupfs is not mounted. I
am open to other suggestions.

>
>> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
>> only able to mount and see cgroup hierarchy under its cgroupns-root
>> (d) if process P tries to write to any cgroup file outside of its
>> cgroupns-root (assuming that hierarchy is visible to it for whatever
>> reason), it will fail as in (1)
>
> I'm still unconvinced that this serves any purpose. If you give
> DAC/MAC permission to a task to write to something, and you give it
> access to an fd or mount pointing there, and you don't want it writing
> there, then *don't do that*. I'm not really seeing why cgroupns needs
> special treatment here.
>

There was a suggestion on the previous version of this patch-set that
we need to prevent processes inside cgroupns to not be able to modify
settings of cgroups outside of its cgroupns-root. But I agree with
your point that cgroupns should not enforce unnecessary access-control
restrictions. Its job is only to virtualize the view of
/proc/<pid>/cgroup file as much as possible (100% virtualized for a
correctly setup container). This will get rid of most of patch 6/8
"cgroup: restrict cgroup operations within task's cgroupns" of this
series. The only check we keep is in cgroup_attach_task() which
ensures that target-cgroup is descendant of current's cgroupns-root
and prevents processes from escaping their cgroupns on their own.

>>
>> i.e., in summary, you can't escape out of cgroupns-root yourself. You
>> will need help from an admin process running under some parent
>> cgroupns-root to move you out. Is that workable for your usecase? Most
>> of the things above already happen with the current patch-set, so it
>> should be easy to enable this.
>>
>> Though there are still some open issues like:
>> * what happens if you move all the processes out of /container1 and
>> then 'rmdir /container1'? As it is now, you won't be able to setns()
>> to that cgroupns anymore. But the cgroupns will still hang around
>> until the processes switch their cgroupns.
>
> Seems okay.
>
>> * should we then also allow setns() without first entering the
>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>> checks that your current cgroup is descendant of target cgroupns-root.
>> Alternatively we can special-case setns() to own cgroupns so that it
>> doesn't fail.
>
> I think setns should completely ignore the caller's cgroup and should
> not change it. Userspace can do this.
>

All above changes more or less means that tasks cannot pin themselves
by unsharing cgroupns. Do you agree that we don't need that "explicit
permission from cgroupfs" anymore (via cgroup.may_unshare file or
other mechanism)?

>> * migration for these processes will be tricky, if not impossible. But
>> the admin trying to do this probably doesn't care about it or will
>> provision for it.
>
> Migration for processes in a mntns that have a current directory
> outside their mntns is also difficult or impossible. Same with
> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
> procfs. Nothing new here.
>
> --Andy

Thanks for the review!

--
Aditya

2014-10-22 18:50:33

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

On Wed, Oct 22, 2014 at 11:37 AM, Aditya Kali <[email protected]> wrote:
> On Tue, Oct 21, 2014 at 5:58 PM, Andy Lutomirski <[email protected]> wrote:
>> On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <[email protected]> wrote:
>>> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <[email protected]> wrote:
>>>>>>
>>>>>>> And with explicit permission from
>>>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>>>> suggested previously), we can make sure that unprivileged processes
>>>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>>>> semantics simple.
>>>>>>
>>>>>> I actually think it makes the semantics more complex. The less policy
>>>>>> you stick in the kernel, the easier it is to understand the impact of
>>>>>> that policy.
>>>>>>
>>>>>
>>>>> My inclination is towards keeping things simpler - both in code as
>>>>> well as in configuration. I agree that cgroupns might seem
>>>>> "less-flexible", but in its current form, it encourages consistent
>>>>> container configuration. If you have a process that needs to move
>>>>> around between cgroups belonging to different containers, then that
>>>>> process should probably not be inside any container's cgroup
>>>>> namespace. Allowing that will just make the cgroup namespace
>>>>> pretty-much meaningless.
>>>>
>>>> The problem with pinning is that preventing it causes problems
>>>> (specifically, either something potentially complex and incompatible
>>>> needs to be added or unprivileged processes will be able to pin
>>>> themselves).
>>>>
>>>> Unless I'm missing something, a normal cgroupns user doesn't actually
>>>> need kernel pinning support to effectively constrain its members'
>>>> cgroups.
>>>>
>>>
>>> So there are 2 scenarios to consider:
>>>
>>> We have 2 containers with cgroups: /container1 and /container2
>>> Assume process P is running under cgroupns-root '/container1'
>>>
>>> (1) process P wants to 'write' to cgroup.procs outside its
>>> cgroupns-root (say to /container2/cgroup.procs)
>>
>> This, at least, doesn't have the problem with unprivileged processes
>> pinning themselves.
>>
>>> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
>>> with cgroupns-root above /container1) wants to write pid of process P
>>> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>>>
>>> For (1), I think its ok to reject such a write. This is consistent
>>> with the restriction in cgroup_file_write added in 'Patch 6' of this
>>> set. I believe this should be independent of visibility of the cgroup
>>> hierarchy for P.
>>>
>>> For (2), we may allow the write to succeed if we make sure that the
>>> process doing the write is an admin process (with CAP_SYS_ADMIN in its
>>> userns AND over P's cgroupns->user_ns).
>>
>> Why is its userns relevant?
>>
>> Why not just check whether the target cgroup is in the process doing
>> the write's cgroupns? (NB: you need to check f_cred, here, not
>> current_cred(), but that's orthogonal.) Then the policy becomes: no
>> user of cgroupfs can move any process outside of the cgroupfs's user's
>> cgroupns root.
>>
> Humm .. it doesn't have to be. I think its simpler to not enforce
> artificial permission checks unless there is a security concern (and
> in this case, there doesn't seem to be any). So I will leave the
> capability check out from here.
>
>> I think I'm okay with this.
>>
>>> If this write succeeds, then:
>>> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
>>> by 'self' or any other process in P's cgrgroupns. I would really like
>>> to avoid showing relative paths or paths outside the cgroupns-root
>>
>> The empty string seems just as problematic to me.
>
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).
>

Confused now. If ".." in /proc/pid/group would be ambiguous, then so
would a path relative to cgroupns root, right? Or am I missing
something?

(I'm not saying that ".." is beautiful or that it won't confuse
things. I'm just not sure why it's ambiguous.)

> * show absolute path
> -- this will also wrong as the process won't be able to make sense of
> it unless it has exposure to the global cgroup hierarchy.
> -- worse case is this that the global path also exists under the
> cgroupns-root ... so now the process thinks its in completely wrong
> cgroup
> -- this exposes system
>
> * show only "/"
> -- this is arguably better, but if the process tires to verify that
> its pid is in cgroup.procs of the cgroupns-root, its in for a
> surprise!
>
> In either case, whatever we expose, the userspace won't be able to use
> this path correctly (worse yet, it associates wrong cgroup for that
> path). So I think its best to not print out the line for default
> hierarchy at all. This happens today when cgroupfs is not mounted. I
> am open to other suggestions.

I suppose that ".." is a possible security problem. If I can force
you to see lots of ..s in there, then I might be about to get you to
write outside cgroupfs.

Grr. No great solution here. I suppose that the empty string isn't
so bad. We could also write something obviously invalid like
"(unreachable)". As long as no one actually creates a cgroup called
"(unreachable)", then this could result in errors but not actual
confusion.

>>> * should we then also allow setns() without first entering the
>>> cgroupns-root? setns also checks the same conditions as in (a) plus it
>>> checks that your current cgroup is descendant of target cgroupns-root.
>>> Alternatively we can special-case setns() to own cgroupns so that it
>>> doesn't fail.
>>
>> I think setns should completely ignore the caller's cgroup and should
>> not change it. Userspace can do this.
>>
>
> All above changes more or less means that tasks cannot pin themselves
> by unsharing cgroupns. Do you agree that we don't need that "explicit
> permission from cgroupfs" anymore (via cgroup.may_unshare file or
> other mechanism)?

Yes, I agree.

>
>>> * migration for these processes will be tricky, if not impossible. But
>>> the admin trying to do this probably doesn't care about it or will
>>> provision for it.
>>
>> Migration for processes in a mntns that have a current directory
>> outside their mntns is also difficult or impossible. Same with
>> pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
>> procfs. Nothing new here.
>>
>> --Andy
>
> Thanks for the review!

No problem.

2014-10-22 19:06:47

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns

On Fri, Oct 17, 2014 at 2:28 AM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> Restrict following operations within the calling tasks:
>> * cgroup_mkdir & cgroup_rmdir
>> * cgroup_attach_task
>> * writes to cgroup files outside of task's cgroupns-root
>>
>> Also, read of /proc/<pid>/cgroup file is now restricted only
>> to tasks under same cgroupns-root. If a task tries to look
>> at cgroup of another task outside of its cgroupns-root, then
>> it won't be able to see anything for the default hierarchy.
>> This is same as if the cgroups are not mounted.
>>
>> Signed-off-by: Aditya Kali <[email protected]>
>
> So this is a bit different from some other namespaces - if I
> have an open fd to a file, then setns into a mntns where that
> file is not addressable, I can still use the file.
>
> I guess not allowing attach to a cgroup outside our ns is a
> good failsafe as we'll otherwise risk falling off a cliff in
> some code, but I'm not sure the cgroup_file_write/mkdir/rmdir
> restrictions are needed. (And really I can fchdir to a
> directory not in my ns, so the cgroup-attach restriction is
> any more justified).
>

As discussed on another thread, most of the restrictions in this patch
are undesirable and will be removed in the next version. Even the
restriction in cgroup_attach_task() will change to something like:

- if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
+ if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(current)))
return -EPERM;

i.e., we don't care the cgroup of the process being moved. We only
check if the writer has access to the dst_cgrp.

So I will just drop this patch in the next version and merge the
cgroup_attach_task() change in another patch.

> Still I'm not strictly opposed ot this, so
>
> Acked-by: Serge Hallyn <[email protected]>
>
> just wanted to point this out.
>
>> ---
>> kernel/cgroup.c | 34 +++++++++++++++++++++++++++++++++-
>> 1 file changed, 33 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index f8099b4..2fc0dfa 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -2318,6 +2318,12 @@ static int cgroup_attach_task(struct cgroup *dst_cgrp,
>> struct task_struct *task;
>> int ret;
>>
>> + /* Only allow changing cgroups accessible within task's cgroup
>> + * namespace. i.e. 'dst_cgrp' should be a descendant of task's
>> + * cgroupns->root_cgrp. */
>> + if (!cgroup_is_descendant(dst_cgrp, task_cgroupns_root(leader)))
>> + return -EPERM;
>> +
>> /* look up all src csets */
>> down_read(&css_set_rwsem);
>> rcu_read_lock();
>> @@ -2882,6 +2888,10 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
>> struct cgroup_subsys_state *css;
>> int ret;
>>
>> + /* Reject writes to cgroup files outside of task's cgroupns-root. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> + return -EINVAL;
>> +
>> if (cft->write)
>> return cft->write(of, buf, nbytes, off);
>>
>> @@ -4560,6 +4570,13 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
>> parent = cgroup_kn_lock_live(parent_kn);
>> if (!parent)
>> return -ENODEV;
>> +
>> + /* Allow mkdir only within process's cgroup namespace root. */
>> + if (!cgroup_is_descendant(parent, task_cgroupns_root(current))) {
>> + ret = -EPERM;
>> + goto out_unlock;
>> + }
>> +
>> root = parent->root;
>>
>> /* allocate the cgroup and its ID, 0 is reserved for the root */
>> @@ -4822,6 +4839,13 @@ static int cgroup_rmdir(struct kernfs_node *kn)
>> if (!cgrp)
>> return 0;
>>
>> + /* Allow rmdir only within process's cgroup namespace root.
>> + * The process can't delete its own root anyways. */
>> + if (!cgroup_is_descendant(cgrp, task_cgroupns_root(current))) {
>> + cgroup_kn_unlock(kn);
>> + return -EPERM;
>> + }
>> +
>> ret = cgroup_destroy_locked(cgrp);
>>
>> cgroup_kn_unlock(kn);
>> @@ -5051,6 +5075,15 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>> if (root == &cgrp_dfl_root && !cgrp_dfl_root_visible)
>> continue;
>>
>> + cgrp = task_cgroup_from_root(tsk, root);
>> +
>> + /* The cgroup path on default hierarchy is shown only if it
>> + * falls under current task's cgroupns-root.
>> + */
>> + if (root == &cgrp_dfl_root &&
>> + !cgroup_is_descendant(cgrp, task_cgroupns_root(current)))
>> + continue;
>> +
>> seq_printf(m, "%d:", root->hierarchy_id);
>> for_each_subsys(ss, ssid)
>> if (root->subsys_mask & (1 << ssid))
>> @@ -5059,7 +5092,6 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
>> seq_printf(m, "%sname=%s", count ? "," : "",
>> root->name);
>> seq_putc(m, ':');
>> - cgrp = task_cgroup_from_root(tsk, root);
>> path = cgroup_path(cgrp, buf, PATH_MAX);
>> if (!path) {
>> retval = -ENAMETOOLONG;
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the reiview!

--
Aditya

2014-10-22 19:42:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

Hello,

On Wed, Oct 22, 2014 at 11:37:55AM -0700, Aditya Kali wrote:
...
> Actually, there is no right answer here. Our options are:
> * show relative path
> -- this will break userspace as /proc/<pid>/cgroup does not show
> relative paths today. This is also very ambiguous (is it relative to
> cgroupns-root or relative to /proc/<pid>cgroup file reader's cgroup?).

Let's go with this w/o pinning. The only necessary feature for
cgroupns is making the /proc/*/cgroups relative to its own root. It's
not like containers can avoid trusting its outside world anyway and
playing tricks with things like this tend to lead to weird surprises
down the road. If userland messes up, userland messes up.

Thanks.

--
tejun

2014-10-24 01:03:26

by Aditya Kali

[permalink] [raw]
Subject: Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

I will include the suggested changes in the new patchset. Some comments inline.

On Thu, Oct 16, 2014 at 9:37 AM, Serge E. Hallyn <[email protected]> wrote:
> Quoting Aditya Kali ([email protected]):
>> Introduce the ability to create new cgroup namespace. The newly created
>> cgroup namespace remembers the 'struct cgroup *root_cgrp' at the point
>> of creation of the cgroup namespace. The task that creates the new
>> cgroup namespace and all its future children will now be restricted only
>> to the cgroup hierarchy under this root_cgrp.
>> The main purpose of cgroup namespace is to virtualize the contents
>> of /proc/self/cgroup file. Processes inside a cgroup namespace
>> are only able to see paths relative to their namespace root.
>> This allows container-tools (like libcontainer, lxc, lmctfy, etc.)
>> to create completely virtualized containers without leaking system
>> level cgroup hierarchy to the task.
>> This patch only implements the 'unshare' part of the cgroupns.
>>
>> Signed-off-by: Aditya Kali <[email protected]>
>
> I'm not sure that the CONFIG_CGROUP_NS is worthwhile. If you already
> have cgroups in the kernel this won't add much in the way of memory
> usage, right? And I think the 'experimental' argument has long since
> been squashed. So I'd argue for simplifying this patch by removing
> CONFIG_CGROUP_NS.
>

With no pinning involved, I think its safe to enable the feature
without needing a config option. Removed it from next version. This
feature is now implicitly available with CONFIG_CGROUPS.

> (more below)
>
>> ---
>> fs/proc/namespaces.c | 3 +
>> include/linux/cgroup.h | 18 +++++-
>> include/linux/cgroup_namespace.h | 62 +++++++++++++++++++
>> include/linux/nsproxy.h | 2 +
>> include/linux/proc_ns.h | 4 ++
>> init/Kconfig | 9 +++
>> kernel/Makefile | 1 +
>> kernel/cgroup.c | 11 ++++
>> kernel/cgroup_namespace.c | 128 +++++++++++++++++++++++++++++++++++++++
>> kernel/fork.c | 2 +-
>> kernel/nsproxy.c | 19 +++++-
>> 11 files changed, 255 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
>> index 8902609..e04ed4b 100644
>> --- a/fs/proc/namespaces.c
>> +++ b/fs/proc/namespaces.c
>> @@ -32,6 +32,9 @@ static const struct proc_ns_operations *ns_entries[] = {
>> &userns_operations,
>> #endif
>> &mntns_operations,
>> +#ifdef CONFIG_CGROUP_NS
>> + &cgroupns_operations,
>> +#endif
>> };
>>
>> static const struct file_operations ns_file_operations = {
>> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
>> index 4a0eb2d..aa86495 100644
>> --- a/include/linux/cgroup.h
>> +++ b/include/linux/cgroup.h
>> @@ -22,6 +22,8 @@
>> #include <linux/seq_file.h>
>> #include <linux/kernfs.h>
>> #include <linux/wait.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/types.h>
>>
>> #ifdef CONFIG_CGROUPS
>>
>> @@ -460,6 +462,13 @@ struct cftype {
>> #endif
>> };
>>
>> +struct cgroup_namespace {
>> + atomic_t count;
>> + unsigned int proc_inum;
>> + struct user_namespace *user_ns;
>> + struct cgroup *root_cgrp;
>> +};
>> +
>> extern struct cgroup_root cgrp_dfl_root;
>> extern struct css_set init_css_set;
>>
>> @@ -584,10 +593,17 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
>> return kernfs_name(cgrp->kn, buf, buflen);
>> }
>>
>> +static inline char * __must_check cgroup_path_ns(struct cgroup_namespace *ns,
>> + struct cgroup *cgrp, char *buf,
>> + size_t buflen)
>> +{
>> + return kernfs_path_from_node(ns->root_cgrp->kn, cgrp->kn, buf, buflen);
>> +}
>> +
>> static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
>> size_t buflen)
>> {
>> - return kernfs_path(cgrp->kn, buf, buflen);
>> + return cgroup_path_ns(current->nsproxy->cgroup_ns, cgrp, buf, buflen);
>> }
>>
>> static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
>> diff --git a/include/linux/cgroup_namespace.h b/include/linux/cgroup_namespace.h
>> new file mode 100644
>> index 0000000..9f637fe
>> --- /dev/null
>> +++ b/include/linux/cgroup_namespace.h
>> @@ -0,0 +1,62 @@
>> +#ifndef _LINUX_CGROUP_NAMESPACE_H
>> +#define _LINUX_CGROUP_NAMESPACE_H
>> +
>> +#include <linux/nsproxy.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/types.h>
>> +#include <linux/user_namespace.h>
>> +
>> +extern struct cgroup_namespace init_cgroup_ns;
>> +
>> +static inline struct cgroup *task_cgroupns_root(struct task_struct *tsk)
>> +{
>> + return tsk->nsproxy->cgroup_ns->root_cgrp;
>
> Per the rules in nsproxy.h, you should be taking the task_lock here.
>
> (If you are making assumptions about tsk then you need to state them
> here - I only looked quickly enough that you pass in 'leader')
>

In the new version of the patch, we call this function only for the
'current' task. As per nsproxy.h, no special precautions needed when
reading current task's nsproxy. So I just remodeled this function into
"current_cgroupns_root(void)".

>> +}
>> +
>> +#ifdef CONFIG_CGROUP_NS
>> +
>> +extern void free_cgroup_ns(struct cgroup_namespace *ns);
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> + struct cgroup_namespace *ns)
>> +{
>> + if (ns)
>> + atomic_inc(&ns->count);
>> + return ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> + if (ns && atomic_dec_and_test(&ns->count))
>> + free_cgroup_ns(ns);
>> +}
>> +
>> +extern struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> + struct user_namespace *user_ns,
>> + struct cgroup_namespace *old_ns);
>> +
>> +#else /* CONFIG_CGROUP_NS */
>> +
>> +static inline struct cgroup_namespace *get_cgroup_ns(
>> + struct cgroup_namespace *ns)
>> +{
>> + return &init_cgroup_ns;
>> +}
>> +
>> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> +}
>> +
>> +static inline struct cgroup_namespace *copy_cgroup_ns(
>> + unsigned long flags,
>> + struct user_namespace *user_ns,
>> + struct cgroup_namespace *old_ns) {
>> + if (flags & CLONE_NEWCGROUP)
>> + return ERR_PTR(-EINVAL);
>> +
>> + return old_ns;
>> +}
>> +
>> +#endif /* CONFIG_CGROUP_NS */
>> +
>> +#endif /* _LINUX_CGROUP_NAMESPACE_H */
>> diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
>> index 35fa08f..ac0d65b 100644
>> --- a/include/linux/nsproxy.h
>> +++ b/include/linux/nsproxy.h
>> @@ -8,6 +8,7 @@ struct mnt_namespace;
>> struct uts_namespace;
>> struct ipc_namespace;
>> struct pid_namespace;
>> +struct cgroup_namespace;
>> struct fs_struct;
>>
>> /*
>> @@ -33,6 +34,7 @@ struct nsproxy {
>> struct mnt_namespace *mnt_ns;
>> struct pid_namespace *pid_ns_for_children;
>> struct net *net_ns;
>> + struct cgroup_namespace *cgroup_ns;
>> };
>> extern struct nsproxy init_nsproxy;
>>
>> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
>> index 34a1e10..e56dd73 100644
>> --- a/include/linux/proc_ns.h
>> +++ b/include/linux/proc_ns.h
>> @@ -6,6 +6,8 @@
>>
>> struct pid_namespace;
>> struct nsproxy;
>> +struct task_struct;
>> +struct inode;
>>
>> struct proc_ns_operations {
>> const char *name;
>> @@ -27,6 +29,7 @@ extern const struct proc_ns_operations ipcns_operations;
>> extern const struct proc_ns_operations pidns_operations;
>> extern const struct proc_ns_operations userns_operations;
>> extern const struct proc_ns_operations mntns_operations;
>> +extern const struct proc_ns_operations cgroupns_operations;
>>
>> /*
>> * We always define these enumerators
>> @@ -37,6 +40,7 @@ enum {
>> PROC_UTS_INIT_INO = 0xEFFFFFFEU,
>> PROC_USER_INIT_INO = 0xEFFFFFFDU,
>> PROC_PID_INIT_INO = 0xEFFFFFFCU,
>> + PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
>> };
>>
>> #ifdef CONFIG_PROC_FS
>> diff --git a/init/Kconfig b/init/Kconfig
>> index e84c642..c3be001 100644
>> --- a/init/Kconfig
>> +++ b/init/Kconfig
>> @@ -1144,6 +1144,15 @@ config DEBUG_BLK_CGROUP
>> Enable some debugging help. Currently it exports additional stat
>> files in a cgroup which can be useful for debugging.
>>
>> +config CGROUP_NS
>> + bool "CGroup Namespaces"
>> + default n
>> + help
>> + This options enables CGroup Namespaces which can be used to isolate
>> + cgroup paths. This feature is only useful when unified cgroup
>> + hierarchy is in use (i.e. cgroups are mounted with sane_behavior
>> + option).
>> +
>> endif # CGROUPS
>>
>> config CHECKPOINT_RESTORE
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index dc5c775..75334f8 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -51,6 +51,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
>> obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
>> obj-$(CONFIG_COMPAT) += compat.o
>> obj-$(CONFIG_CGROUPS) += cgroup.o
>> +obj-$(CONFIG_CGROUP_NS) += cgroup_namespace.o
>> obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
>> obj-$(CONFIG_CPUSETS) += cpuset.o
>> obj-$(CONFIG_UTS_NS) += utsname.o
>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>> index 2b3e9f9..f8099b4 100644
>> --- a/kernel/cgroup.c
>> +++ b/kernel/cgroup.c
>> @@ -57,6 +57,8 @@
>> #include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
>> #include <linux/kthread.h>
>> #include <linux/delay.h>
>> +#include <linux/proc_ns.h>
>> +#include <linux/cgroup_namespace.h>
>>
>> #include <linux/atomic.h>
>>
>> @@ -195,6 +197,15 @@ static void kill_css(struct cgroup_subsys_state *css);
>> static int cgroup_addrm_files(struct cgroup *cgrp, struct cftype cfts[],
>> bool is_add);
>>
>> +struct cgroup_namespace init_cgroup_ns = {
>> + .count = {
>> + .counter = 1,
>> + },
>> + .proc_inum = PROC_CGROUP_INIT_INO,
>> + .user_ns = &init_user_ns,
>
> This might mean that you should bump the init_user_ns refcount.
>

Humm. Doesn't look like all other namespaces are doing it though (ex:
init_pid_ns or init_ipc_ns). The initial count in init_user_ns is set
to 3 which only accounts for some current users, but not all. I will
increment it for init_cgroup_ns nevertheless (in cgroup_init()).

>> + .root_cgrp = &cgrp_dfl_root.cgrp,
>> +};
>> +
>> /* IDR wrappers which synchronize using cgroup_idr_lock */
>> static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>> gfp_t gfp_mask)
>> diff --git a/kernel/cgroup_namespace.c b/kernel/cgroup_namespace.c
>> new file mode 100644
>> index 0000000..c16604f
>> --- /dev/null
>> +++ b/kernel/cgroup_namespace.c
>> @@ -0,0 +1,128 @@
>> +
>> +#include <linux/cgroup.h>
>> +#include <linux/cgroup_namespace.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/nsproxy.h>
>> +#include <linux/proc_ns.h>
>> +
>> +static struct cgroup_namespace *alloc_cgroup_ns(void)
>> +{
>> + struct cgroup_namespace *new_ns;
>> +
>> + new_ns = kmalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
>> + if (new_ns)
>> + atomic_set(&new_ns->count, 1);
>> + return new_ns;
>> +}
>> +
>> +void free_cgroup_ns(struct cgroup_namespace *ns)
>> +{
>> + cgroup_put(ns->root_cgrp);
>> + put_user_ns(ns->user_ns);
>
> This is a problem on error patch in copy_cgroup_ns. The
> alloc_cgroup_ns() doesn't initialize these values, so if
> you should fail in proc_alloc_inum() you'll show up here
> with fandom values in ns->*.
>

I don't see the codepath that leads to calling free_cgroup_ns() with
uninitialized members. We don't call free_cgroup_ns() on the error
path in copy_cgroup_ns().

>> + proc_free_inum(ns->proc_inum);

BTW, I was missing the actual kfree(ns) here. Added it.

>> +}
>> +EXPORT_SYMBOL(free_cgroup_ns);
>> +
>> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
>> + struct user_namespace *user_ns,
>> + struct cgroup_namespace *old_ns)
>> +{
>> + struct cgroup_namespace *new_ns = NULL;
>> + struct cgroup *cgrp = NULL;
>> + int err;
>> +
>> + BUG_ON(!old_ns);
>> +
>> + if (!(flags & CLONE_NEWCGROUP))
>> + return get_cgroup_ns(old_ns);
>> +
>> + /* Allow only sysadmin to create cgroup namespace. */
>> + err = -EPERM;
>> + if (!ns_capable(user_ns, CAP_SYS_ADMIN))
>> + goto err_out;
>> +
>> + /* Prevent cgroup changes for this task. */
>> + threadgroup_lock(current);
>> +
>> + cgrp = get_task_cgroup(current);
>> +
>> + /* Creating new CGROUPNS is supported only when unified hierarchy is in
>> + * use. */
>
> Oh, drat. Well, I'll take, it, but under protest :)
>

Actually, I realized that this comment and the check below is bogus.
The 'get_task_cgroup(current)' always only returns the cgroup on the
default hierarchy. And so, the check below is unnecessary.
What this comment should really say is that cgroup namespace only
virtualizes the cgroup path for the default(unified) hierarchy. Its
fine if you have other hierarchies mounted too. Just that for those
hierarchies, full (non-virtualized) cgroup path will be visible in
/proc/self/cgroup. So cgroupns won't help there.

I have updated the comment in the new version of the patch.

>> + err = -EINVAL;
>> + if (!cgroup_on_dfl(cgrp))
>> + goto err_out_unlock;
>> +
>> + err = -ENOMEM;
>> + new_ns = alloc_cgroup_ns();
>> + if (!new_ns)
>> + goto err_out_unlock;
>> +
>> + err = proc_alloc_inum(&new_ns->proc_inum);
>> + if (err)
>> + goto err_out_unlock;
>> +
>> + new_ns->user_ns = get_user_ns(user_ns);
>> + new_ns->root_cgrp = cgrp;
>> +
>> + threadgroup_unlock(current);
>> +
>> + return new_ns;
>> +
>> +err_out_unlock:
>> + threadgroup_unlock(current);
>> +err_out:
>> + if (cgrp)
>> + cgroup_put(cgrp);
>> + kfree(new_ns);
>> + return ERR_PTR(err);
>> +}
>> +
>> +static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
>> +{
>> + pr_info("setns not supported for cgroup namespace");
>> + return -EINVAL;
>> +}
>> +
>> +static void *cgroupns_get(struct task_struct *task)
>> +{
>> + struct cgroup_namespace *ns = NULL;
>> + struct nsproxy *nsproxy;
>> +
>> + rcu_read_lock();
>> + nsproxy = task->nsproxy;
>> + if (nsproxy) {
>> + ns = nsproxy->cgroup_ns;
>> + get_cgroup_ns(ns);
>> + }
>> + rcu_read_unlock();
>> +
>> + return ns;
>> +}
>> +
>> +static void cgroupns_put(void *ns)
>> +{
>> + put_cgroup_ns(ns);
>> +}
>> +
>> +static unsigned int cgroupns_inum(void *ns)
>> +{
>> + struct cgroup_namespace *cgroup_ns = ns;
>> +
>> + return cgroup_ns->proc_inum;
>> +}
>> +
>> +const struct proc_ns_operations cgroupns_operations = {
>> + .name = "cgroup",
>> + .type = CLONE_NEWCGROUP,
>> + .get = cgroupns_get,
>> + .put = cgroupns_put,
>> + .install = cgroupns_install,
>> + .inum = cgroupns_inum,
>> +};
>> +
>> +static __init int cgroup_namespaces_init(void)
>> +{
>> + return 0;
>> +}
>> +subsys_initcall(cgroup_namespaces_init);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 0cf9cdb..cc06851 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -1790,7 +1790,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
>> if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
>> CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
>> CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
>> - CLONE_NEWUSER|CLONE_NEWPID))
>> + CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
>> return -EINVAL;
>> /*
>> * Not implemented, but pretend it works if there is nothing to
>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
>> index ef42d0a..a8b1970 100644
>> --- a/kernel/nsproxy.c
>> +++ b/kernel/nsproxy.c
>> @@ -25,6 +25,7 @@
>> #include <linux/proc_ns.h>
>> #include <linux/file.h>
>> #include <linux/syscalls.h>
>> +#include <linux/cgroup_namespace.h>
>>
>> static struct kmem_cache *nsproxy_cachep;
>>
>> @@ -39,6 +40,7 @@ struct nsproxy init_nsproxy = {
>> #ifdef CONFIG_NET
>> .net_ns = &init_net,
>> #endif
>> + .cgroup_ns = &init_cgroup_ns,
>> };
>>
>> static inline struct nsproxy *create_nsproxy(void)
>> @@ -92,6 +94,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>> goto out_pid;
>> }
>>
>> + new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
>> + tsk->nsproxy->cgroup_ns);
>> + if (IS_ERR(new_nsp->cgroup_ns)) {
>> + err = PTR_ERR(new_nsp->cgroup_ns);
>> + goto out_cgroup;
>> + }
>> +
>> new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
>> if (IS_ERR(new_nsp->net_ns)) {
>> err = PTR_ERR(new_nsp->net_ns);
>> @@ -101,6 +110,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
>> return new_nsp;
>>
>> out_net:
>> + if (new_nsp->cgroup_ns)
>> + put_cgroup_ns(new_nsp->cgroup_ns);
>> +out_cgroup:
>> if (new_nsp->pid_ns_for_children)
>> put_pid_ns(new_nsp->pid_ns_for_children);
>> out_pid:
>> @@ -128,7 +140,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
>> struct nsproxy *new_ns;
>>
>> if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> - CLONE_NEWPID | CLONE_NEWNET)))) {
>> + CLONE_NEWPID | CLONE_NEWNET |
>> + CLONE_NEWCGROUP)))) {
>> get_nsproxy(old_ns);
>> return 0;
>> }
>> @@ -165,6 +178,8 @@ void free_nsproxy(struct nsproxy *ns)
>> put_ipc_ns(ns->ipc_ns);
>> if (ns->pid_ns_for_children)
>> put_pid_ns(ns->pid_ns_for_children);
>> + if (ns->cgroup_ns)
>> + put_cgroup_ns(ns->cgroup_ns);
>> put_net(ns->net_ns);
>> kmem_cache_free(nsproxy_cachep, ns);
>> }
>> @@ -180,7 +195,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
>> int err = 0;
>>
>> if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
>> - CLONE_NEWNET | CLONE_NEWPID)))
>> + CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
>> return 0;
>>
>> user_ns = new_cred ? new_cred->user_ns : current_user_ns();
>> --
>> 2.1.0.rc2.206.gedb03e5
>>
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers


Thanks for the review!
--
Aditya

2014-10-25 03:16:16

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCHv1 5/8] cgroup: introduce cgroup namespaces

Quoting Aditya Kali ([email protected]):
> >> +void free_cgroup_ns(struct cgroup_namespace *ns)
> >> +{
> >> + cgroup_put(ns->root_cgrp);
> >> + put_user_ns(ns->user_ns);
> >
> > This is a problem on error patch in copy_cgroup_ns. The
> > alloc_cgroup_ns() doesn't initialize these values, so if
> > you should fail in proc_alloc_inum() you'll show up here
> > with fandom values in ns->*.
> >
>
> I don't see the codepath that leads to calling free_cgroup_ns() with
> uninitialized members. We don't call free_cgroup_ns() on the error
> path in copy_cgroup_ns().

Hm, yeah, I'm not seeing it now, sorry.

2015-07-22 18:10:38

by Vincent Batts

[permalink] [raw]
Subject: Re: [PATCHv1 0/8] CGroup Namespaces

Has there been further movement on CLONE_NEWCGROUP outside of this?


vb

On Sun, Oct 19, 2014 at 12:54 AM, Eric W. Biederman
<[email protected]> wrote:
> Aditya Kali <[email protected]> writes:
>
>> Second take at the Cgroup Namespace patch-set.
>>
>> Major changes form RFC (V0):
>> 1. setns support for cgroupns
>> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
>> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
>> 3. writes to cgroup files outside of cgroupns-root are not allowed
>> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
>> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
>> your cgroupns-root.
>>
>> More details in the writeup below.
>
> This definitely looks like the right direction to go, and something that
> in some form or another I had been asking for since cgroups were merged.
> So I am very glad to see this work moving forward.
>
> I had hoped that we might just be able to be clever with remounting
> cgroupfs but 2 things stand in the way.
> 1) /proc/<pid>/cgroups (but proc could capture that).
> 2) providing a hard guarnatee that tasks stay within a subset of the
> cgroup hierarchy.
>
> So I think this clearly meets the requirements for a new namespace.
>
> We need to have the discussion on chmod of files on cgroupfs. There is
> a notion that has floated around that only systemd or only root (with
> the appropriate capabilities) should be allowed to set resource limits
> in cgroupfs. In a practical reality that is nonsense. If an atribute
> is properly bound in it's hiearchy it should be safe to change.
>
> Not all attributes are properly bound to hierarchy and some are or at
> least were dangerous for anyone except root to set. So I suggest that a
> CFTYPE flag perhaps CFTYPE_UNPRIV be added for attributes that are safe
> to allow anyone to set, and require CFTYPE_UNPRIV be set before we chmod
> a cgroup attribute from root.
>
> That would be complimentary work, and not strictly tied the cgroup
> namespaces but unprivileged cgroup namespaces don't make much sense
> without that work.
>
> Eric
>
>> Background
>> Cgroups and Namespaces are used together to create “virtual”
>> containers that isolates the host environment from the processes
>> running in container. But since cgroups themselves are not
>> “virtualized”, the task is always able to see global cgroups view
>> through cgroupfs mount and via /proc/self/cgroup file.
>>
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> This exposure of cgroup names to the processes running inside a
>> container results in some problems:
>> (1) The container names are typically host-container-management-agent
>> (systemd, docker/libcontainer, etc.) data and leaking its name (or
>> leaking the hierarchy) reveals too much information about the host
>> system.
>> (2) It makes the container migration across machines (CRIU) more
>> difficult as the container names need to be unique across the
>> machines in the migration domain.
>> (3) It makes it difficult to run container management tools (like
>> docker/libcontainer, lmctfy, etc.) within virtual containers
>> without adding dependency on some state/agent present outside the
>> container.
>>
>> Note that the feature proposed here is completely different than the
>> “ns cgroup” feature which existed in the linux kernel until recently.
>> The ns cgroup also attempted to connect cgroups and namespaces by
>> creating a new cgroup every time a new namespace was created. It did
>> not solve any of the above mentioned problems and was later dropped
>> from the kernel. Incidentally though, it used the same config option
>> name CONFIG_CGROUP_NS as used in my prototype!
>>
>> Introducing CGroup Namespaces
>> With unified cgroup hierarchy
>> (Documentation/cgroups/unified-hierarchy.txt), the containers can now
>> have a much more coherent cgroup view and its easy to associate a
>> container with a single cgroup. This also allows us to virtualize the
>> cgroup view for tasks inside the container.
>>
>> The new CGroup Namespace allows a process to “unshare” its cgroup
>> hierarchy starting from the cgroup its currently in.
>> For Ex:
>> $ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>> $ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
>> $ ~/unshare -c # calls unshare(CLONE_NEWCGROUP) and exec’s /bin/bash
>> [ns]$ ls -l /proc/self/ns/cgroup
>> lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup ->
>> cgroup:[4026532183]
>> # From within new cgroupns, process sees that its in the root cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>>
>> # From global cgroupns:
>> $ cat /proc/<pid>/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1
>>
>> # Unshare cgroupns along with userns and mountns
>> # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
>> # sets up uid/gid map and exec’s /bin/bash
>> $ ~/unshare -c -u -m
>>
>> # Originally, we were in /batchjobs/c_job_id1 cgroup. Mount our own cgroup
>> # hierarchy.
>> [ns]$ mount -t cgroup cgroup /tmp/cgroup
>> [ns]$ ls -l /tmp/cgroup
>> total 0
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
>> -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
>> -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
>> -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
>>
>> The cgroupns-root (/batchjobs/c_job_id1 in above example) becomes the
>> filesystem root for the namespace specific cgroupfs mount.
>>
>> The virtualization of /proc/self/cgroup file combined with restricting
>> the view of cgroup hierarchy by namespace-private cgroupfs mount
>> should provide a completely isolated cgroup view inside the container.
>>
>> In its current form, the cgroup namespaces patcheset provides following
>> behavior:
>>
>> (1) The “root” cgroup for a cgroup namespace is the cgroup in which
>> the process calling unshare is running.
>> For ex. if a process in /batchjobs/c_job_id1 cgroup calls unshare,
>> cgroup /batchjobs/c_job_id1 becomes the cgroupns-root.
>> For the init_cgroup_ns, this is the real root (“/”) cgroup
>> (identified in code as cgrp_dfl_root.cgrp).
>>
>> (2) The cgroupns-root cgroup does not change even if the namespace
>> creator process later moves to a different cgroup.
>> $ ~/unshare -c # unshare cgroupns in some cgroup
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
>> [ns]$ mkdir sub_cgrp_1
>> [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/self/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (3) Each process gets its CGROUPNS specific view of
>> /proc/<pid>/cgroup.
>> (a) Processes running inside the cgroup namespace will be able to see
>> cgroup paths (in /proc/self/cgroup) only inside their root cgroup
>> [ns]$ sleep 100000 & # From within unshared cgroupns
>> [1] 7353
>> [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
>> [ns]$ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
>>
>> (b) From global cgroupns, the real cgroup path will be visible:
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>>
>> (c) From a sibling cgroupns (cgroupns root-ed at a sibling cgroup), no cgroup
>> path will be visible:
>> # ns2's cgroupns-root is at '/batchjobs/c_job_id2'
>> [ns2]$ cat /proc/7353/cgroup
>> [ns2]$
>> This is same as when cgroup hierarchy is not mounted at all.
>> (In correct container setup though, it should not be possible to
>> access PIDs in another container in the first place.)
>>
>> (4) Processes inside a cgroupns are not allowed to move out of the
>> cgroupns-root. This is true even if a privileged process in global
>> cgroupns tries to move the process out of its cgroupns-root.
>>
>> # From global cgroupns
>> $ cat /proc/7353/cgroup
>> 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/c_job_id1/sub_cgrp_1
>> # cgroupns-root for 7353 is /batchjobs/c_job_id1
>> $ echo 7353 > batchjobs/c_job_id2/cgroup.procs
>> -bash: echo: write error: Operation not permitted
>>
>> (5) Setns to another cgroup namespace is allowed only when:
>> (a) process has CAP_SYS_ADMIN in its current userns
>> (b) process has CAP_SYS_ADMIN in the target cgroupns' userns
>> (c) the process's current cgroup is a descendant cgroupns-root of the
>> target namespace.
>> (d) the target cgroupns-root is descendant of current cgroupns-root..
>> The last check (d) prevents processes from escaping their cgroupns-root by
>> attaching to parent cgroupns. Thus, setns is allowed only when the process
>> is trying to restrict itself to a deeper cgroup hierarchy.
>>
>> (6) When some thread from a multi-threaded process unshares its
>> cgroup-namespace, the new cgroupns gets applied to the entire
>> process (all the threads). This should be OK since
>> unified-hierarchy only allows process-level containerization. So
>> all the threads in the process will have the same cgroup. And both
>> - changing cgroups and unsharing namespaces - are protected under
>> threadgroup_lock(task).
>>
>> (7) The cgroup namespace is alive as long as there is atleast 1
>> process inside it. When the last process exits, the cgroup
>> namespace is destroyed. The cgroupns-root and the actual cgroups
>> remain though.
>>
>> (8) 'mount -t cgroup cgroup <mntpt>' when called from within cgroupns mounts
>> the unified cgroup hierarchy with cgroupns-root as the filesystem root.
>> The process needs CAP_SYS_ADMIN in its userns and mntns. This allows the
>> container management tools to be run inside the containers transparently.
>>
>> Implementation
>> The current patch-set is based on top of Tejun Heo's cgroup tree (for-next
>> branch). Its fairly non-intrusive and provides above mentioned
>> features.
>>
>> Possible extensions of CGROUPNS:
>> (1) The Documentation/cgroups/unified-hierarchy.txt mentions use of
>> capabilities to restrict cgroups to administrative users. CGroup
>> namespaces could be of help here. With cgroup namespaces, it might
>> be possible to delegate administration of sub-cgroups under a
>> cgroupns-root to the cgroupns owner.
>
>
>
>
>> ---
>> fs/kernfs/dir.c | 53 +++++++++---
>> fs/kernfs/mount.c | 48 +++++++++++
>> fs/proc/namespaces.c | 3 +
>> include/linux/cgroup.h | 41 +++++++++-
>> include/linux/cgroup_namespace.h | 62 +++++++++++++++
>> include/linux/kernfs.h | 5 ++
>> include/linux/nsproxy.h | 2 +
>> include/linux/proc_ns.h | 4 +
>> include/uapi/linux/sched.h | 3 +-
>> init/Kconfig | 9 +++
>> kernel/Makefile | 1 +
>> kernel/cgroup.c | 139 ++++++++++++++++++++++++++------
>> kernel/cgroup_namespace.c | 168 +++++++++++++++++++++++++++++++++++++++
>> kernel/fork.c | 2 +-
>> kernel/nsproxy.c | 19 ++++-
>> 15 files changed, 518 insertions(+), 41 deletions(-)
>> create mode 100644 include/linux/cgroup_namespace.h
>> create mode 100644 kernel/cgroup_namespace.c
>>
>> [PATCHv1 1/8] kernfs: Add API to generate relative kernfs path
>> [PATCHv1 2/8] sched: new clone flag CLONE_NEWCGROUP for cgroup
>> [PATCHv1 3/8] cgroup: add function to get task's cgroup on default
>> [PATCHv1 4/8] cgroup: export cgroup_get() and cgroup_put()
>> [PATCHv1 5/8] cgroup: introduce cgroup namespaces
>> [PATCHv1 6/8] cgroup: restrict cgroup operations within task's cgroupns
>> [PATCHv1 7/8] cgroup: cgroup namespace setns support
>> [PATCHv1 8/8] cgroup: mount cgroupns-root when inside non-init cgroupns
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers