2015-12-07 23:09:10

by Serge Hallyn

[permalink] [raw]
Subject: CGroup Namespaces (v6)

Hi,

following is a revised set of the CGroup Namespace patchset which Aditya
Kali has previously sent. The code can also be found in the cgroupns.v6
branch of

https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/

To summarize the semantics:

1. CLONE_NEWCGROUP re-uses 0x02000000, which was previously CLONE_STOPPED

2. unsharing a cgroup namespace makes all your current cgroups your new
cgroup root.

3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
cgroup namespce root. A task outside of your cgroup looks like

8:memory:/../../..

4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
on the mounting task's cgroup namespace.

5. setns to a cgroup namespace switches your cgroup namespace but not
your cgroups.

With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

This is completely backward compatible and will be completely invisible
to any existing cgroup users (except for those running inside a cgroup
namespace and looking at /proc/pid/cgroup of tasks outside their
namespace.)

Changes from V5:
1. To get a root dentry for cgroup namespace mount, walk the path from the
kernfs root dentry.

Changes from V4:
1. Move the FS_USERNS_MOUNT flag to last patch
2. Rebase onto cgroup/for-4.5
3. Don't non-init user namespaces to bind new subsystems when mounting.
4. Address feedback from Tejun (thanks). Specificaly, not addressed:
. kernfs_obtain_root - walking dentry from kernfs root.
(I think that's the only piece)
5. Dropped unused get_task_cgroup fn/patch.
6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
It now finds a common ancestor, walks from the source to it, then back
up to the target.

Changes from V3:
1. Rebased onto latest cgroup changes. In particular switch to
css_set_lock and ns_common.
2. Support all hierarchies.

Changes from V2:
1. Added documentation in Documentation/cgroups/namespace.txt
2. Fixed a bug that caused crash
3. Incorporated some other suggestions from last patchset:
- removed use of threadgroup_lock() while creating new cgroupns
- use task_lock() instead of rcu_read_lock() while accessing
task->nsproxy
- optimized setns() to own cgroupns
- simplified code around sane-behavior mount option parsing
4. Restored ACKs from Serge Hallyn from v1 on few patches that have
not changed since then.

Changes from V1:
1. No pinning of processes within cgroupns. Tasks can be freely moved
across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
apply as before.
2. Path in /proc/<pid>/cgroup is now always shown and is relative to
cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
of the reader and cgroup of <pid>.
3. setns() does not require the process to first move under target
cgroupns-root.

Changes form RFC (V0):
1. setns support for cgroupns
2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
3. writes to cgroup files outside of cgroupns-root are not allowed
4. visibility of /proc/<pid>/cgroup is further restricted by not showing
anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
your cgroupns-root.


2015-12-07 23:06:35

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

From: Aditya Kali <[email protected]>

The new function kernfs_path_from_node() generates and returns kernfs
path of a given kernfs_node relative to a given parent kernfs_node.

Changelog 20151125:
- Fully-wing multilinecomments
- Rework kernfs_path_from_node_locked() logic
- Replace BUG_ONs with returning NULL
- Use a const char* for /.. and precalculate its size
Changelog 20151130:
- Update kernfs_path_from_node_locked comment

Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
---
fs/kernfs/dir.c | 182 +++++++++++++++++++++++++++++++++++++++++-------
include/linux/kernfs.h | 3 +
2 files changed, 158 insertions(+), 27 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 91e0045..7cd4bb4 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -44,28 +44,134 @@ static int kernfs_name_locked(struct kernfs_node *kn, char *buf, size_t buflen)
return strlcpy(buf, kn->parent ? kn->name : "/", buflen);
}

-static char * __must_check kernfs_path_locked(struct kernfs_node *kn, char *buf,
- size_t buflen)
+/* kernfs_node_depth - compute depth from @from to @to */
+static size_t kernfs_node_distance(struct kernfs_node *from, struct kernfs_node *to)
{
- char *p = buf + buflen;
- int len;
+ size_t depth = 0;

- *--p = '\0';
+ BUG_ON(!to);
+ BUG_ON(!from);

- do {
- len = strlen(kn->name);
- if (p - buf < len + 1) {
- buf[0] = '\0';
- p = NULL;
- break;
- }
- p -= len;
- memcpy(p, kn->name, len);
- *--p = '/';
- kn = kn->parent;
- } while (kn && kn->parent);
+ while (to->parent && to != from) {
+ depth++;
+ to = to->parent;
+ }
+ return depth;
+}

- return p;
+static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
+ struct kernfs_node *b)
+{
+ size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
+ size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
+
+ if (da == 0)
+ return a;
+ if (db == 0)
+ return b;
+
+ while (da > db) {
+ a = a->parent;
+ da--;
+ }
+ while (db > da) {
+ b = b->parent;
+ db--;
+ }
+
+ /* worst case b and a will be the same at root */
+ while (b != a) {
+ b = b->parent;
+ a = a->parent;
+ }
+
+ return a;
+}
+
+/**
+ * kernfs_path_from_node_locked - find a pseudo-absolute path to @kn_to,
+ * where kn_from is treated as root of the path.
+ * @kn_from: kernfs node which should be treated as root for the path
+ * @kn_to: kernfs node to which path is needed
+ * @buf: buffer to copy the path into
+ * @buflen: size of @buf
+ *
+ * We need to handle couple of scenarios here:
+ * [1] when @kn_from is an ancestor of @kn_to at some level
+ * kn_from: /n1/n2/n3
+ * kn_to: /n1/n2/n3/n4/n5
+ * result: /n4/n5
+ *
+ * [2] when @kn_from is on a different hierarchy and we need to find common
+ * ancestor between @kn_from and @kn_to.
+ * kn_from: /n1/n2/n3/n4
+ * kn_to: /n1/n2/n5
+ * result: /../../n5
+ * OR
+ * kn_from: /n1/n2/n3/n4/n5 [depth=5]
+ * kn_to: /n1/n2/n3 [depth=3]
+ * result: /../..
+ */
+static char *
+__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,
+ struct kernfs_node *kn_to, char *buf,
+ size_t buflen)
+{
+ char *p = buf;
+ struct kernfs_node *kn, *common;
+ const char parent_str[] = "/..";
+ int i;
+ size_t depth_from, depth_to, len = 0, nlen = 0,
+ plen = sizeof(parent_str) - 1;
+
+ /* We atleast need 2 bytes to write "/\0". */
+ if (buflen < 2)
+ return NULL;
+
+ if (!kn_from)
+ kn_from = kernfs_root(kn_to)->kn;
+
+ if (kn_from == kn_to) {
+ *p = '/';
+ *(++p) = '\0';
+ return buf;
+ }
+
+ common = kernfs_common_ancestor(kn_from, kn_to);
+ if (!common) {
+ WARN_ONCE("%s: kn_from and kn_to on different roots\n",
+ __func__);
+ return NULL;
+ }
+
+ depth_to = kernfs_node_distance(common, kn_to);
+ depth_from = kernfs_node_distance(common, kn_from);
+
+ for (i = 0; i < depth_from; i++) {
+ if (len + plen + 1 > buflen)
+ return NULL;
+ strcpy(p, parent_str);
+ p += plen;
+ len += plen;
+ }
+
+ /* Calculate how many bytes we need for the rest */
+ for (kn = kn_to; kn != common; kn = kn->parent)
+ nlen += strlen(kn->name) + 1;
+
+ if (len + nlen + 1 > buflen)
+ return NULL;
+
+ p += nlen;
+ *p = '\0';
+ for (kn = kn_to; kn != common; kn = kn->parent) {
+ nlen = strlen(kn->name);
+ p -= nlen;
+ memcpy(p, kn->name, nlen);
+ *(--p) = '/';
+ }
+
+ return buf;
}

/**
@@ -115,26 +221,48 @@ size_t kernfs_path_len(struct kernfs_node *kn)
}

/**
- * kernfs_path - build full path of a given node
+ * kernfs_path_from_node - build path of node @kn relative to @kn_root.
+ * @kn_root: parent kernfs_node relative to which we need to build the path
* @kn: kernfs_node of interest
- * @buf: buffer to copy @kn's name into
+ * @buf: buffer to copy @kn's path into
* @buflen: size of @buf
*
- * Builds and returns the full path of @kn in @buf of @buflen bytes. The
- * path is built from the end of @buf so the returned pointer usually
- * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * Builds and returns @kn's path relative to @kn_root. @kn_root and @kn must
+ * be on the same kernfs-root. If @kn_root is not parent of @kn, then a relative
+ * path (which includes '..'s) as needed to reach from @kn_root to @kn is
+ * returned.
+ * The path may be built from the end of @buf so the returned pointer may not
+ * match @buf. If @buf isn't long enough, @buf is nul terminated
* and %NULL is returned.
*/
-char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+char *kernfs_path_from_node(struct kernfs_node *kn_root, struct kernfs_node *kn,
+ char *buf, size_t buflen)
{
unsigned long flags;
char *p;

spin_lock_irqsave(&kernfs_rename_lock, flags);
- p = kernfs_path_locked(kn, buf, buflen);
+ p = kernfs_path_from_node_locked(kn_root, kn, buf, buflen);
spin_unlock_irqrestore(&kernfs_rename_lock, flags);
return p;
}
+EXPORT_SYMBOL_GPL(kernfs_path_from_node);
+
+/**
+ * kernfs_path - build full path of a given node
+ * @kn: kernfs_node of interest
+ * @buf: buffer to copy @kn's name into
+ * @buflen: size of @buf
+ *
+ * Builds and returns the full path of @kn in @buf of @buflen bytes. The
+ * path is built from the end of @buf so the returned pointer usually
+ * doesn't match @buf. If @buf isn't long enough, @buf is nul terminated
+ * and %NULL is returned.
+ */
+char *kernfs_path(struct kernfs_node *kn, char *buf, size_t buflen)
+{
+ return kernfs_path_from_node(NULL, kn, buf, buflen);
+}
EXPORT_SYMBOL_GPL(kernfs_path);

/**
@@ -168,8 +296,8 @@ void pr_cont_kernfs_path(struct kernfs_node *kn)

spin_lock_irqsave(&kernfs_rename_lock, flags);

- p = kernfs_path_locked(kn, kernfs_pr_cont_buf,
- sizeof(kernfs_pr_cont_buf));
+ p = kernfs_path_from_node_locked(NULL, kn, kernfs_pr_cont_buf,
+ sizeof(kernfs_pr_cont_buf));
if (p)
pr_cont("%s", p);
else
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 5d4e9c4..d025ebd 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -267,6 +267,9 @@ static inline bool kernfs_ns_enabled(struct kernfs_node *kn)

int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
size_t kernfs_path_len(struct kernfs_node *kn);
+char * __must_check kernfs_path_from_node(struct kernfs_node *root_kn,
+ struct kernfs_node *kn, char *buf,
+ size_t buflen);
char * __must_check kernfs_path(struct kernfs_node *kn, char *buf,
size_t buflen);
void pr_cont_kernfs_name(struct kernfs_node *kn);
--
1.7.9.5

2015-12-07 23:06:32

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 2/7] sched: new clone flag CLONE_NEWCGROUP for cgroup namespace

From: Aditya Kali <[email protected]>

CLONE_NEWCGROUP will be used to create new cgroup namespace.

Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
---
include/uapi/linux/sched.h | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index cc89dde..5f0fe01 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -21,8 +21,7 @@
#define CLONE_DETACHED 0x00400000 /* Unused, ignored */
#define CLONE_UNTRACED 0x00800000 /* set if the tracing process can't force CLONE_PTRACE on this clone */
#define CLONE_CHILD_SETTID 0x01000000 /* set the TID in the child */
-/* 0x02000000 was previously the unused CLONE_STOPPED (Start in stopped state)
- and is now available for re-use. */
+#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname namespace */
#define CLONE_NEWIPC 0x08000000 /* New ipc namespace */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
--
1.7.9.5

2015-12-07 23:09:11

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 3/7] cgroup: introduce cgroup namespaces

From: Aditya Kali <[email protected]>

Introduce the ability to create new cgroup namespace. The newly created
cgroup namespace remembers the cgroup of the process at the point
of creation of the cgroup namespace (referred as cgroupns-root).
The main purpose of cgroup namespace is to virtualize the contents
of /proc/self/cgroup file. Processes inside a cgroup namespace
are only able to see paths relative to their namespace root
(unless they are moved outside of their cgroupns-root, at which point
they will see a relative path from their cgroupns-root).
For a correctly setup container this enables container-tools
(like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
containers without leaking system level cgroup hierarchy to the task.
This patch only implements the 'unshare' part of the cgroupns.

Changelog: 2015-11-24
- move cgroup_namespace.c into cgroup.c (and .h)
- reformatting
- make get_cgroup_ns return void
- rename ns->root_cgrps to root_cset.

Signed-off-by: Aditya Kali <[email protected]>
Signed-off-by: Serge Hallyn <[email protected]>
---
fs/proc/namespaces.c | 3 +
include/linux/cgroup.h | 51 ++++++++++++----
include/linux/nsproxy.h | 2 +
include/linux/proc_ns.h | 4 ++
kernel/cgroup.c | 151 +++++++++++++++++++++++++++++++++++++++++++++--
kernel/fork.c | 2 +-
kernel/nsproxy.c | 21 ++++++-
7 files changed, 217 insertions(+), 17 deletions(-)

diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index f6e8354..bd61075 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -28,6 +28,9 @@ static const struct proc_ns_operations *ns_entries[] = {
&userns_operations,
#endif
&mntns_operations,
+#ifdef CONFIG_CGROUPS
+ &cgroupns_operations,
+#endif
};

static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2b3e2314..906f240 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -17,11 +17,36 @@
#include <linux/seq_file.h>
#include <linux/kernfs.h>
#include <linux/jump_label.h>
+#include <linux/nsproxy.h>
+#include <linux/types.h>
+#include <linux/ns_common.h>
+#include <linux/nsproxy.h>
+#include <linux/user_namespace.h>

#include <linux/cgroup-defs.h>

+struct cgroup_namespace {
+ atomic_t count;
+ struct ns_common ns;
+ struct user_namespace *user_ns;
+ struct css_set *root_cset;
+};
+
+extern struct cgroup_namespace init_cgroup_ns;
+
+static inline void get_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns)
+ atomic_inc(&ns->count);
+}
+
#ifdef CONFIG_CGROUPS

+void free_cgroup_ns(struct cgroup_namespace *ns);
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns);
+
/*
* All weight knobs on the default hierarhcy should use the following min,
* default and max values. The default value is the logarithmic center of
@@ -105,6 +130,10 @@ void cgroup_free(struct task_struct *p);
int cgroup_init_early(void);
int cgroup_init(void);

+char * __must_check cgroup_path_ns(struct cgroup *cgrp, char *buf,
+ size_t buflen, struct cgroup_namespace *ns);
+char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
+
/*
* Iteration helpers and macros.
*/
@@ -272,10 +301,6 @@ void css_task_iter_end(struct css_task_iter *it);
; \
else

-/*
- * Inline functions.
- */
-
/**
* css_get - obtain a reference on the specified css
* @css: target css
@@ -509,12 +534,6 @@ static inline int cgroup_name(struct cgroup *cgrp, char *buf, size_t buflen)
return kernfs_name(cgrp->kn, buf, buflen);
}

-static inline char * __must_check cgroup_path(struct cgroup *cgrp, char *buf,
- size_t buflen)
-{
- return kernfs_path(cgrp->kn, buf, buflen);
-}
-
static inline void pr_cont_cgroup_name(struct cgroup *cgrp)
{
pr_cont_kernfs_name(cgrp->kn);
@@ -527,6 +546,12 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp)

#else /* !CONFIG_CGROUPS */

+static inline void free_cgroup_ns(struct cgroup_namespace *ns) { }
+static inline struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{ return old_ns; }
+
struct cgroup_subsys_state;

static inline void css_put(struct cgroup_subsys_state *css) {}
@@ -547,4 +572,10 @@ static inline int cgroup_init(void) { return 0; }

#endif /* !CONFIG_CGROUPS */

+static inline void put_cgroup_ns(struct cgroup_namespace *ns)
+{
+ if (ns && atomic_dec_and_test(&ns->count))
+ free_cgroup_ns(ns);
+}
+
#endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..ac0d65b 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -8,6 +8,7 @@ struct mnt_namespace;
struct uts_namespace;
struct ipc_namespace;
struct pid_namespace;
+struct cgroup_namespace;
struct fs_struct;

/*
@@ -33,6 +34,7 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net *net_ns;
+ struct cgroup_namespace *cgroup_ns;
};
extern struct nsproxy init_nsproxy;

diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index 42dfc61..de0e771 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -9,6 +9,8 @@
struct pid_namespace;
struct nsproxy;
struct path;
+struct task_struct;
+struct inode;

struct proc_ns_operations {
const char *name;
@@ -24,6 +26,7 @@ extern const struct proc_ns_operations ipcns_operations;
extern const struct proc_ns_operations pidns_operations;
extern const struct proc_ns_operations userns_operations;
extern const struct proc_ns_operations mntns_operations;
+extern const struct proc_ns_operations cgroupns_operations;

/*
* We always define these enumerators
@@ -34,6 +37,7 @@ enum {
PROC_UTS_INIT_INO = 0xEFFFFFFEU,
PROC_USER_INIT_INO = 0xEFFFFFFDU,
PROC_PID_INIT_INO = 0xEFFFFFFCU,
+ PROC_CGROUP_INIT_INO = 0xEFFFFFFBU,
};

#ifdef CONFIG_PROC_FS
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7f2f007..4fd07b5a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -57,6 +57,9 @@
#include <linux/vmalloc.h> /* TODO: replace with more sophisticated array */
#include <linux/kthread.h>
#include <linux/delay.h>
+#include <linux/proc_ns.h>
+#include <linux/nsproxy.h>
+#include <linux/proc_ns.h>

#include <linux/atomic.h>

@@ -297,6 +300,13 @@ static bool cgroup_on_dfl(const struct cgroup *cgrp)
{
return cgrp->root == &cgrp_dfl_root;
}
+struct cgroup_namespace init_cgroup_ns = {
+ .count = { .counter = 1, },
+ .user_ns = &init_user_ns,
+ .ns.ops = &cgroupns_operations,
+ .ns.inum = PROC_CGROUP_INIT_INO,
+ .root_cset = &init_css_set,
+};

/* IDR wrappers which synchronize using cgroup_idr_lock */
static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
@@ -430,18 +440,18 @@ static inline bool cgroup_is_dead(const struct cgroup *cgrp)
return !(cgrp->self.flags & CSS_ONLINE);
}

-static void cgroup_get(struct cgroup *cgrp)
+static inline void cgroup_get(struct cgroup *cgrp)
{
WARN_ON_ONCE(cgroup_is_dead(cgrp));
css_get(&cgrp->self);
}

-static bool cgroup_tryget(struct cgroup *cgrp)
+static inline bool cgroup_tryget(struct cgroup *cgrp)
{
return css_tryget(&cgrp->self);
}

-static void cgroup_put(struct cgroup *cgrp)
+static inline void cgroup_put(struct cgroup *cgrp)
{
css_put(&cgrp->self);
}
@@ -780,7 +790,7 @@ static void put_css_set_locked(struct css_set *cset)
kfree_rcu(cset, rcu_head);
}

-static void put_css_set(struct css_set *cset)
+static inline void put_css_set(struct css_set *cset)
{
/*
* Ensure that the refcount doesn't hit zero while any readers
@@ -2189,6 +2199,25 @@ static struct file_system_type cgroup2_fs_type = {
.kill_sb = cgroup_kill_sb,
};

+char * __must_check
+cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
+ struct cgroup_namespace *ns)
+{
+ struct cgroup *root;
+
+ BUG_ON(!ns);
+ root = cset_cgroup_from_root(ns->root_cset, cgrp->root);
+ return kernfs_path_from_node(root->kn, cgrp->kn, buf, buflen);
+}
+
+char * __must_check
+cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen)
+{
+ return cgroup_path_ns(cgrp, buf, buflen,
+ current->nsproxy->cgroup_ns);
+}
+EXPORT_SYMBOL_GPL(cgroup_path);
+
/**
* task_cgroup_path - cgroup path of a task in the first cgroup hierarchy
* @task: target task
@@ -5289,6 +5318,8 @@ int __init cgroup_init(void)
BUG_ON(cgroup_init_cftypes(NULL, cgroup_dfl_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_legacy_base_files));

+ get_user_ns(init_cgroup_ns.user_ns);
+
mutex_lock(&cgroup_mutex);

/* Add init_css_set to the hash table */
@@ -5805,6 +5836,118 @@ struct cgroup_subsys_state *css_from_id(int id, struct cgroup_subsys *ss)
return id > 0 ? idr_find(&ss->css_idr, id) : NULL;
}

+/* cgroup namespaces */
+
+const struct proc_ns_operations cgroupns_operations;
+
+static struct cgroup_namespace *alloc_cgroup_ns(void)
+{
+ struct cgroup_namespace *new_ns;
+ int ret;
+
+ new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
+ if (!new_ns)
+ return ERR_PTR(-ENOMEM);
+ ret = ns_alloc_inum(&new_ns->ns);
+ if (ret) {
+ kfree(new_ns);
+ return ERR_PTR(ret);
+ }
+ atomic_set(&new_ns->count, 1);
+ new_ns->ns.ops = &cgroupns_operations;
+ return new_ns;
+}
+
+void free_cgroup_ns(struct cgroup_namespace *ns)
+{
+ put_css_set(ns->root_cset);
+ put_user_ns(ns->user_ns);
+ ns_free_inum(&ns->ns);
+ kfree(ns);
+}
+EXPORT_SYMBOL(free_cgroup_ns);
+
+struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
+ struct user_namespace *user_ns,
+ struct cgroup_namespace *old_ns)
+{
+ struct cgroup_namespace *new_ns = NULL;
+ struct css_set *cset = NULL;
+ int err;
+
+ BUG_ON(!old_ns);
+
+ if (!(flags & CLONE_NEWCGROUP)) {
+ get_cgroup_ns(old_ns);
+ return old_ns;
+ }
+
+ /* Allow only sysadmin to create cgroup namespace. */
+ err = -EPERM;
+ if (!ns_capable(user_ns, CAP_SYS_ADMIN))
+ goto err_out;
+
+ cset = task_css_set(current);
+ get_css_set(cset);
+
+ err = -ENOMEM;
+ new_ns = alloc_cgroup_ns();
+ if (!new_ns)
+ goto err_out;
+
+ new_ns->user_ns = get_user_ns(user_ns);
+ new_ns->root_cset = cset;
+
+ return new_ns;
+
+err_out:
+ if (cset)
+ put_css_set(cset);
+ kfree(new_ns);
+ return ERR_PTR(err);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+{
+ pr_info("setns not supported for cgroup namespace");
+ return -EINVAL;
+}
+
+static struct ns_common *cgroupns_get(struct task_struct *task)
+{
+ struct cgroup_namespace *ns = NULL;
+ struct nsproxy *nsproxy;
+
+ task_lock(task);
+ nsproxy = task->nsproxy;
+ if (nsproxy) {
+ ns = nsproxy->cgroup_ns;
+ get_cgroup_ns(ns);
+ }
+ task_unlock(task);
+
+ return ns ? &ns->ns : NULL;
+}
+
+static void cgroupns_put(struct ns_common *ns)
+{
+ put_cgroup_ns(to_cg_ns(ns));
+}
+
+const struct proc_ns_operations cgroupns_operations = {
+ .name = "cgroup",
+ .type = CLONE_NEWCGROUP,
+ .get = cgroupns_get,
+ .put = cgroupns_put,
+ .install = cgroupns_install,
+};
+
+static __init int cgroup_namespaces_init(void)
+{
+ return 0;
+}
+subsys_initcall(cgroup_namespaces_init);
+
#ifdef CONFIG_CGROUP_DEBUG
static struct cgroup_subsys_state *
debug_css_alloc(struct cgroup_subsys_state *parent_css)
diff --git a/kernel/fork.c b/kernel/fork.c
index ba7d1c0..7982fee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1880,7 +1880,7 @@ static int check_unshare_flags(unsigned long unshare_flags)
if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET|
- CLONE_NEWUSER|CLONE_NEWPID))
+ CLONE_NEWUSER|CLONE_NEWPID|CLONE_NEWCGROUP))
return -EINVAL;
/*
* Not implemented, but pretend it works if there is nothing
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 49746c8..64fe865 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -25,6 +25,7 @@
#include <linux/proc_ns.h>
#include <linux/file.h>
#include <linux/syscalls.h>
+#include <linux/cgroup.h>

static struct kmem_cache *nsproxy_cachep;

@@ -39,6 +40,9 @@ struct nsproxy init_nsproxy = {
#ifdef CONFIG_NET
.net_ns = &init_net,
#endif
+#ifdef CONFIG_CGROUPS
+ .cgroup_ns = &init_cgroup_ns,
+#endif
};

static inline struct nsproxy *create_nsproxy(void)
@@ -92,6 +96,13 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
goto out_pid;
}

+ new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
+ tsk->nsproxy->cgroup_ns);
+ if (IS_ERR(new_nsp->cgroup_ns)) {
+ err = PTR_ERR(new_nsp->cgroup_ns);
+ goto out_cgroup;
+ }
+
new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
@@ -101,6 +112,9 @@ static struct nsproxy *create_new_namespaces(unsigned long flags,
return new_nsp;

out_net:
+ if (new_nsp->cgroup_ns)
+ put_cgroup_ns(new_nsp->cgroup_ns);
+out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
@@ -128,7 +142,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
struct nsproxy *new_ns;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWPID | CLONE_NEWNET)))) {
+ CLONE_NEWPID | CLONE_NEWNET |
+ CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}
@@ -165,6 +180,8 @@ void free_nsproxy(struct nsproxy *ns)
put_ipc_ns(ns->ipc_ns);
if (ns->pid_ns_for_children)
put_pid_ns(ns->pid_ns_for_children);
+ if (ns->cgroup_ns)
+ put_cgroup_ns(ns->cgroup_ns);
put_net(ns->net_ns);
kmem_cache_free(nsproxy_cachep, ns);
}
@@ -180,7 +197,7 @@ int unshare_nsproxy_namespaces(unsigned long unshare_flags,
int err = 0;

if (!(unshare_flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
- CLONE_NEWNET | CLONE_NEWPID)))
+ CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWCGROUP)))
return 0;

user_ns = new_cred ? new_cred->user_ns : current_user_ns();
--
1.7.9.5

2015-12-07 23:06:33

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 4/7] cgroup: cgroup namespace setns support

From: Aditya Kali <[email protected]>

setns on a cgroup namespace is allowed only if
task has CAP_SYS_ADMIN in its current user-namespace and
over the user-namespace associated with target cgroupns.
No implicit cgroup changes happen with attaching to another
cgroupns. It is expected that the somone moves the attaching
process under the target cgroupns-root.

Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
---
kernel/cgroup.c | 24 +++++++++++++++++++++---
1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4fd07b5a..a5ab74d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5907,10 +5907,28 @@ err_out:
return ERR_PTR(err);
}

-static int cgroupns_install(struct nsproxy *nsproxy, void *ns)
+static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
{
- pr_info("setns not supported for cgroup namespace");
- return -EINVAL;
+ return container_of(ns, struct cgroup_namespace, ns);
+}
+
+static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
+{
+ struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
+
+ if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
+ !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ /* Don't need to do anything if we are attaching to our own cgroupns. */
+ if (cgroup_ns == nsproxy->cgroup_ns)
+ return 0;
+
+ get_cgroup_ns(cgroup_ns);
+ put_cgroup_ns(nsproxy->cgroup_ns);
+ nsproxy->cgroup_ns = cgroup_ns;
+
+ return 0;
}

static struct ns_common *cgroupns_get(struct task_struct *task)
--
1.7.9.5

2015-12-07 23:07:46

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

From: Aditya Kali <[email protected]>

This patch enables cgroup mounting inside userns when a process
as appropriate privileges. The cgroup filesystem mounted is
rooted at the cgroupns-root. Thus, in a container-setup, only
the hierarchy under the cgroupns-root is exposed inside the container.
This allows container management tools to run inside the containers
without depending on any global state.
In order to support this, a new kernfs api is added to lookup the
dentry for the cgroupns-root.

Changelog:
20151116 - Don't allow user namespaces to bind new subsystems
20151118 - postpone the FS_USERNS_MOUNT flag until the
last patch, until we can convince ourselves it
is safe.
20151207 - Switch to walking up the kernfs path from kn root.

Signed-off-by: Aditya Kali <[email protected]>
Acked-by: Serge E. Hallyn <[email protected]>
---
fs/kernfs/mount.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kernfs.h | 2 ++
kernel/cgroup.c | 39 ++++++++++++++++++++++++-
3 files changed, 114 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 8eaf417..9219444 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -14,6 +14,7 @@
#include <linux/magic.h>
#include <linux/slab.h>
#include <linux/pagemap.h>
+#include <linux/namei.h>

#include "kernfs-internal.h"

@@ -62,6 +63,79 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
return NULL;
}

+/*
+ * find the next ancestor in the path down to @child, where @parent was the
+ * parent whose child we want to find.
+ *
+ * Say the path is /a/b/c/d. @child is d, @parent is NULL. We return the root
+ * node. If @parent is b, then we return the node for c.
+ * Passing in d as @parent is not ok.
+ */
+static struct kernfs_node *
+find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
+{
+ if (child == parent) {
+ pr_crit_once("BUG in find_next_ancestor: called with parent == child");
+ return NULL;
+ }
+
+ while (child->parent != parent) {
+ if (!child->parent)
+ return NULL;
+ child = child->parent;
+ }
+
+ return child;
+}
+
+/**
+ * kernfs_obtain_root - get a dentry for the given kernfs_node
+ * @sb: the kernfs super_block
+ * @kn: kernfs_node for which a dentry is needed
+ *
+ * This can be used by callers which want to mount only a part of the kernfs
+ * as root of the filesystem.
+ */
+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn)
+{
+ struct dentry *dentry;
+ struct kernfs_node *knparent = NULL;
+
+ BUG_ON(sb->s_op != &kernfs_sops);
+
+ dentry = dget(sb->s_root);
+ if (!kn->parent) // this is the root
+ return dentry;
+
+ knparent = find_next_ancestor(kn, NULL);
+ if (!knparent) {
+ pr_crit("BUG: find_next_ancestor for root dentry returned NULL\n");
+ return ERR_PTR(-EINVAL);
+ }
+
+ do {
+ struct dentry *dtmp;
+ struct kernfs_node *kntmp;
+
+ if (kn == knparent)
+ return dentry;
+ kntmp = find_next_ancestor(kn, knparent);
+ if (!kntmp) {
+ pr_crit("BUG: find_next_ancestor returned NULL for node\n");
+ return ERR_PTR(-EINVAL);
+ }
+ dtmp = lookup_one_len(kntmp->name, dentry, strlen(kntmp->name));
+ dput(dentry);
+ if (IS_ERR(dtmp))
+ return dtmp;
+ knparent = kntmp;
+ dentry = dtmp;
+ } while (1);
+
+ // notreached
+}
+
static int kernfs_fill_super(struct super_block *sb, unsigned long magic)
{
struct kernfs_super_info *info = kernfs_info(sb);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d025ebd..1903777 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -284,6 +284,8 @@ struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);

+struct dentry *kernfs_obtain_root(struct super_block *sb,
+ struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv);
void kernfs_destroy_root(struct kernfs_root *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a5ab74d..09cd718 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2011,6 +2011,15 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+
+ get_cgroup_ns(ns);
+
+ /* Check if the caller has permission to mount. */
+ if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
+ put_cgroup_ns(ns);
+ return ERR_PTR(-EPERM);
+ }

/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -2127,6 +2136,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}

+ if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+ ret = -EPERM;
+ goto out_unlock;
+ }
+
root = kzalloc(sizeof(*root), GFP_KERNEL);
if (!root) {
ret = -ENOMEM;
@@ -2145,12 +2159,34 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);

- if (ret)
+ if (ret) {
+ put_cgroup_ns(ns);
return ERR_PTR(ret);
+ }
+
out_mount:
dentry = kernfs_mount(fs_type, flags, root->kf_root,
is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
&new_sb);
+
+ if (!IS_ERR(dentry)) {
+ /*
+ * In non-init cgroup namespace, instead of root cgroup's
+ * dentry, we return the dentry corresponding to the
+ * cgroupns->root_cgrp.
+ */
+ if (ns != &init_cgroup_ns) {
+ struct dentry *nsdentry;
+ struct cgroup *cgrp;
+
+ cgrp = cset_cgroup_from_root(ns->root_cset, root);
+ nsdentry = kernfs_obtain_root(dentry->d_sb,
+ cgrp->kn);
+ dput(dentry);
+ dentry = nsdentry;
+ }
+ }
+
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);

@@ -2163,6 +2199,7 @@ out_mount:
deactivate_super(pinned_sb);
}

+ put_cgroup_ns(ns);
return dentry;
}

--
1.7.9.5

2015-12-07 23:07:45

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 6/7] cgroup: Add documentation for cgroup namespaces

From: Aditya Kali <[email protected]>

Signed-off-by: Aditya Kali <[email protected]>
Signed-off-by: Serge Hallyn <[email protected]>
---
Documentation/cgroups/namespace.txt | 142 +++++++++++++++++++++++++++++++++++
1 file changed, 142 insertions(+)
create mode 100644 Documentation/cgroups/namespace.txt

diff --git a/Documentation/cgroups/namespace.txt b/Documentation/cgroups/namespace.txt
new file mode 100644
index 0000000..a5b80e8
--- /dev/null
+++ b/Documentation/cgroups/namespace.txt
@@ -0,0 +1,142 @@
+ CGroup Namespaces
+
+CGroup Namespace provides a mechanism to virtualize the view of the
+/proc/<pid>/cgroup file. The CLONE_NEWCGROUP clone-flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace.
+The process running inside the cgroup namespace will have its /proc/<pid>/cgroup
+output restricted to cgroupns-root. cgroupns-root is the cgroup of the process
+at the time of creation of the cgroup namespace.
+
+Prior to CGroup Namespace, the /proc/<pid>/cgroup file used to show complete
+path of the cgroup of a process. In a container setup (where a set of cgroups
+and namespaces are intended to isolate processes), the /proc/<pid>/cgroup file
+may leak potential system level information to the isolated processes.
+
+For Example:
+ $ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+CGroup Namespaces can be used to restrict visibility of this path.
+For Example:
+ # Before creating cgroup namespace
+ $ ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+ $ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+ # unshare(CLONE_NEWCGROUP) and exec /bin/bash
+ $ ~/unshare -c
+ [ns]$ ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+ # From within new cgroupns, process sees that its in the root cgroup
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+ # From global cgroupns:
+ $ cat /proc/<pid>/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+ # Unshare cgroupns along with userns and mountns
+ # Following calls unshare(CLONE_NEWCGROUP|CLONE_NEWUSER|CLONE_NEWNS), then
+ # sets up uid/gid map and execs /bin/bash
+ $ ~/unshare -c -u -m
+ # Originally, we were in /batchjobs/container_id1 cgroup. Mount our own cgroup
+ # hierarchy.
+ [ns]$ mount -t cgroup cgroup /tmp/cgroup
+ [ns]$ ls -l /tmp/cgroup
+ total 0
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+ -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+ -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns-root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns-root' for a cgroup namespace is the cgroup in which
+ the process calling unshare is running.
+ For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+ cgroup /batchjobs/container_id1 becomes the cgroupns-root.
+ For the init_cgroup_ns, this is the real root ('/') cgroup
+ (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns-root cgroup does not change even if the namespace
+ creator process later moves to a different cgroup.
+ $ ~/unshare -c # unshare cgroupns in some cgroup
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+ [ns]$ mkdir sub_cgrp_1
+ [ns]$ echo 0 > sub_cgrp_1/cgroup.procs
+ [ns]$ cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its CGROUPNS specific view of /proc/<pid>/cgroup
+(a) Processes running inside the cgroup namespace will be able to see
+ cgroup paths (in /proc/self/cgroup) only inside their root cgroup
+ [ns]$ sleep 100000 & # From within unshared cgroupns
+ [1] 7353
+ [ns]$ echo 7353 > sub_cgrp_1/cgroup.procs
+ [ns]$ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From global cgroupns, the real cgroup path will be visible:
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroupns (cgroupns root-ed at a different cgroup), cgroup
+ path relative to its own cgroupns-root will be shown:
+ # ns2's cgroupns-root is at '/batchjobs/container_id2'
+ [ns2]$ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+ Note that the relative path always starts with '/' to indicate that its
+ relative to the cgroupns-root of the caller.
+
+(4) Processes inside a cgroupns can move in-and-out of the cgroupns-root
+ (if they have proper access to external cgroups).
+ # From inside cgroupns (with cgroupns-root at /batchjobs/container_id1), and
+ # assuming that the global hierarchy is still accessible inside cgroupns:
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+ $ echo 7353 > batchjobs/container_id2/cgroup.procs
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+ Note that this kind of setup is not encouraged. A task inside cgroupns
+ should only be exposed to its own cgroupns hierarchy. Otherwise it makes
+ the virtualization of /proc/<pid>/cgroup less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+ (a) the process has CAP_SYS_ADMIN in its current userns
+ (b) the process has CAP_SYS_ADMIN in the target cgroupns' userns
+ No implicit cgroup changes happen with attaching to another cgroupns. It
+ is expected that the somone moves the attaching process under the target
+ cgroupns-root.
+
+(6) When some thread from a multi-threaded process unshares its
+ cgroup-namespace, the new cgroupns gets applied to the entire process (all
+ the threads). For the unified-hierarchy this is expected as it only allows
+ process-level containerization. For the legacy hierarchies this may be
+ unexpected. So all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is atleast 1
+ process inside it. When the last process exits, the cgroup
+ namespace is destroyed. The cgroupns-root and the actual cgroups
+ remain though.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+ inside cgroupns:
+ $ mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+ This will mount the unified cgroup hierarchy with cgroupns-root as the
+ filesystem root. The process needs CAP_SYS_ADMIN in its userns and mntns.
--
1.7.9.5

2015-12-07 23:07:44

by Serge Hallyn

[permalink] [raw]
Subject: [PATCH 7/7] Add FS_USERNS_FLAG to cgroup fs

From: Serge Hallyn <[email protected]>

allowing root in a non-init user namespace to mount it. This should
now be safe, because

1. non-init-root cannot mount a previously unbound subsystem
2. the task doing the mount must be privileged with respect to the
user namespace owning the cgroup namespace
3. the mounted subsystem will have its current cgroup as the root dentry.
the permissions will be unchanged, so tasks will receive no new
privilege over the cgroups which they did not have on the original
mounts.

Signed-off-by: Serge Hallyn <[email protected]>
---
kernel/cgroup.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 09cd718..5419ef7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2228,12 +2228,14 @@ static struct file_system_type cgroup_fs_type = {
.name = "cgroup",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

static struct file_system_type cgroup2_fs_type = {
.name = "cgroup2",
.mount = cgroup_mount,
.kill_sb = cgroup_kill_sb,
+ .fs_flags = FS_USERNS_MOUNT,
};

char * __must_check
--
1.7.9.5

2015-12-08 10:10:30

by Alban Crequy

[permalink] [raw]
Subject: Re: CGroup Namespaces (v6)

Hi,

Thanks for the patches!

On 8 December 2015 at 00:06, <[email protected]> wrote:
> Hi,
>
> following is a revised set of the CGroup Namespace patchset which Aditya
> Kali has previously sent. The code can also be found in the cgroupns.v6
> branch of
>
> https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
>
> To summarize the semantics:
>
> 1. CLONE_NEWCGROUP re-uses 0x02000000, which was previously CLONE_STOPPED
>
> 2. unsharing a cgroup namespace makes all your current cgroups your new
> cgroup root.
>
> 3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
> cgroup namespce root. A task outside of your cgroup looks like
>
> 8:memory:/../../..
>
> 4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
> on the mounting task's cgroup namespace.
>
> 5. setns to a cgroup namespace switches your cgroup namespace but not
> your cgroups.
>
> With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
> github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
> proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.

I tested cgroupns.v6 with systemd-nspawn + patches from
https://github.com/systemd/systemd/pull/2112 using
unshare(CLONE_NEWCGROUP) booted with
systemd.unified_cgroup_hierarchy=1 in Fedora22. Tested with and
without userns. It worked for me :)

Do you need people to run more tests, with other scenarios?

Do you have patches already for /usr/bin/unshare and /usr/bin/nsenter?

> This is completely backward compatible and will be completely invisible
> to any existing cgroup users (except for those running inside a cgroup
> namespace and looking at /proc/pid/cgroup of tasks outside their
> namespace.)
>
> Changes from V5:
> 1. To get a root dentry for cgroup namespace mount, walk the path from the
> kernfs root dentry.
>
> Changes from V4:
> 1. Move the FS_USERNS_MOUNT flag to last patch
> 2. Rebase onto cgroup/for-4.5
> 3. Don't non-init user namespaces to bind new subsystems when mounting.
> 4. Address feedback from Tejun (thanks). Specificaly, not addressed:
> . kernfs_obtain_root - walking dentry from kernfs root.
> (I think that's the only piece)
> 5. Dropped unused get_task_cgroup fn/patch.
> 6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
> It now finds a common ancestor, walks from the source to it, then back
> up to the target.
>
> Changes from V3:
> 1. Rebased onto latest cgroup changes. In particular switch to
> css_set_lock and ns_common.
> 2. Support all hierarchies.
>
> Changes from V2:
> 1. Added documentation in Documentation/cgroups/namespace.txt
> 2. Fixed a bug that caused crash
> 3. Incorporated some other suggestions from last patchset:
> - removed use of threadgroup_lock() while creating new cgroupns
> - use task_lock() instead of rcu_read_lock() while accessing
> task->nsproxy
> - optimized setns() to own cgroupns
> - simplified code around sane-behavior mount option parsing
> 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
> not changed since then.
>
> Changes from V1:
> 1. No pinning of processes within cgroupns. Tasks can be freely moved
> across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
> apply as before.
> 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
> cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
> of the reader and cgroup of <pid>.
> 3. setns() does not require the process to first move under target
> cgroupns-root.
>
> Changes form RFC (V0):
> 1. setns support for cgroupns
> 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> 3. writes to cgroup files outside of cgroupns-root are not allowed
> 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> your cgroupns-root.
>
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2015-12-09 00:48:00

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

On Mon, Dec 07, 2015 at 05:06:16PM -0600, [email protected] wrote:
> From: Aditya Kali <[email protected]>
>
> The new function kernfs_path_from_node() generates and returns kernfs
> path of a given kernfs_node relative to a given parent kernfs_node.
>
> Changelog 20151125:
> - Fully-wing multilinecomments
> - Rework kernfs_path_from_node_locked() logic
> - Replace BUG_ONs with returning NULL
> - Use a const char* for /.. and precalculate its size
> Changelog 20151130:
> - Update kernfs_path_from_node_locked comment

changes should go below the --- line, not here in the body of the
changelog that will show up in git :(

2015-12-08 15:23:05

by Serge Hallyn

[permalink] [raw]
Subject: Re: CGroup Namespaces (v6)

On Tue, Dec 08, 2015 at 11:10:03AM +0100, Alban Crequy wrote:
> Hi,
>
> Thanks for the patches!
>
> On 8 December 2015 at 00:06, <[email protected]> wrote:
> > Hi,
> >
> > following is a revised set of the CGroup Namespace patchset which Aditya
> > Kali has previously sent. The code can also be found in the cgroupns.v6
> > branch of
> >
> > https://git.kernel.org/cgit/linux/kernel/git/sergeh/linux-security.git/
> >
> > To summarize the semantics:
> >
> > 1. CLONE_NEWCGROUP re-uses 0x02000000, which was previously CLONE_STOPPED
> >
> > 2. unsharing a cgroup namespace makes all your current cgroups your new
> > cgroup root.
> >
> > 3. /proc/pid/cgroup always shows cgroup paths relative to the reader's
> > cgroup namespce root. A task outside of your cgroup looks like
> >
> > 8:memory:/../../..
> >
> > 4. when a task mounts a cgroupfs, the cgroup which shows up as root depends
> > on the mounting task's cgroup namespace.
> >
> > 5. setns to a cgroup namespace switches your cgroup namespace but not
> > your cgroups.
> >
> > With this, using github.com/hallyn/lxc #2015-11-09/cgns (and
> > github.com/hallyn/lxcfs #2015-11-10/cgns) we can start a container in a full
> > proper cgroup namespace, avoiding either cgmanager or lxcfs cgroup bind mounts.
>
> I tested cgroupns.v6 with systemd-nspawn + patches from
> https://github.com/systemd/systemd/pull/2112 using
> unshare(CLONE_NEWCGROUP) booted with
> systemd.unified_cgroup_hierarchy=1 in Fedora22. Tested with and
> without userns. It worked for me :)

Great, thanks for testing.

> Do you need people to run more tests, with other scenarios?

Certainly the more testing the better. There is a particular set of
cases which I'd earlier tested just in the shell, which could stand
to have a testcase. That's to basically test all of the '..' possibilities
for /proc/self/cgroup and make sure it's all sane. I.e. place task t1
into cgroups: '/', '/x1', '/x1/x2', '/x1/x2/x3'; place task t2 into
various relative paths '/', '/x1', '/x1/x2', '/y1', etc; have t1
check where t2 is, then have t2 setns into t1's namespace and check where
t1 is.

> Do you have patches already for /usr/bin/unshare and /usr/bin/nsenter?

Nope, I don't have patch for util-linux yet, I just used a custom unshare
and setns program.

> > This is completely backward compatible and will be completely invisible
> > to any existing cgroup users (except for those running inside a cgroup
> > namespace and looking at /proc/pid/cgroup of tasks outside their
> > namespace.)
> >
> > Changes from V5:
> > 1. To get a root dentry for cgroup namespace mount, walk the path from the
> > kernfs root dentry.
> >
> > Changes from V4:
> > 1. Move the FS_USERNS_MOUNT flag to last patch
> > 2. Rebase onto cgroup/for-4.5
> > 3. Don't non-init user namespaces to bind new subsystems when mounting.
> > 4. Address feedback from Tejun (thanks). Specificaly, not addressed:
> > . kernfs_obtain_root - walking dentry from kernfs root.
> > (I think that's the only piece)
> > 5. Dropped unused get_task_cgroup fn/patch.
> > 6. Reworked kernfs_path_from_node_locked() to try to simplify the logic.
> > It now finds a common ancestor, walks from the source to it, then back
> > up to the target.
> >
> > Changes from V3:
> > 1. Rebased onto latest cgroup changes. In particular switch to
> > css_set_lock and ns_common.
> > 2. Support all hierarchies.
> >
> > Changes from V2:
> > 1. Added documentation in Documentation/cgroups/namespace.txt
> > 2. Fixed a bug that caused crash
> > 3. Incorporated some other suggestions from last patchset:
> > - removed use of threadgroup_lock() while creating new cgroupns
> > - use task_lock() instead of rcu_read_lock() while accessing
> > task->nsproxy
> > - optimized setns() to own cgroupns
> > - simplified code around sane-behavior mount option parsing
> > 4. Restored ACKs from Serge Hallyn from v1 on few patches that have
> > not changed since then.
> >
> > Changes from V1:
> > 1. No pinning of processes within cgroupns. Tasks can be freely moved
> > across cgroups even outside of their cgroupns-root. Usual DAC/MAC policies
> > apply as before.
> > 2. Path in /proc/<pid>/cgroup is now always shown and is relative to
> > cgroupns-root. So path can contain '/..' strings depending on cgroupns-root
> > of the reader and cgroup of <pid>.
> > 3. setns() does not require the process to first move under target
> > cgroupns-root.
> >
> > Changes form RFC (V0):
> > 1. setns support for cgroupns
> > 2. 'mount -t cgroup cgroup <mntpt>' from inside a cgroupns now
> > mounts the cgroup hierarcy with cgroupns-root as the filesystem root.
> > 3. writes to cgroup files outside of cgroupns-root are not allowed
> > 4. visibility of /proc/<pid>/cgroup is further restricted by not showing
> > anything if the <pid> is in a sibling cgroupns and its cgroup falls outside
> > your cgroupns-root.
> >
> >
> > _______________________________________________
> > Containers mailing list
> > [email protected]
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2015-12-08 15:52:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

Hello, Serge.

On Mon, Dec 07, 2015 at 05:06:16PM -0600, [email protected] wrote:
> +/* kernfs_node_depth - compute depth from @from to @to */
> +static size_t kernfs_node_distance(struct kernfs_node *from, struct kernfs_node *to)
> {
> + size_t depth = 0;
>
> + BUG_ON(!to);
> + BUG_ON(!from);

Do these BUG_ON()s achieve anything? Also, would something like
kernfs_relative_depth() be a better name for the function? Maybe even
just kernfs_depth()?

...
> +static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
> + struct kernfs_node *b)
> +{
> + size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
> + size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
> +
> + if (da == 0)
> + return a;
> + if (db == 0)
> + return b;

Hmm... are the above two ifs necessary? Wouldn't the outcome be the
same? Furthermore, if a and b are on different roots the above may
give the wrong answer while not doing the above would return NULL.

> + while (da > db) {
> + a = a->parent;
> + da--;
> + }
> + while (db > da) {
> + b = b->parent;
> + db--;
> + }
> +
> + /* worst case b and a will be the same at root */
> + while (b != a) {
> + b = b->parent;
> + a = a->parent;
> + }
> +
> + return a;
> +}
...
> +static char *
> +__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,

Maybe

static char * __must_check
kernfs_path...

> + struct kernfs_node *kn_to, char *buf,
> + size_t buflen)

Given that @kn_from is optional and is not the target node, maybe put
@kn_to before @kn_from?

> +{
> + char *p = buf;
> + struct kernfs_node *kn, *common;
> + const char parent_str[] = "/..";
> + int i;
> + size_t depth_from, depth_to, len = 0, nlen = 0,
> + plen = sizeof(parent_str) - 1;

Heh, idk, just put plen on a separate decl?

> +
> + /* We atleast need 2 bytes to write "/\0". */
> + if (buflen < 2)
> + return NULL;
> +
> + if (!kn_from)
> + kn_from = kernfs_root(kn_to)->kn;
> +
> + if (kn_from == kn_to) {
> + *p = '/';
> + *(++p) = '\0';
> + return buf;
> + }
> +
> + common = kernfs_common_ancestor(kn_from, kn_to);
> + if (!common) {
> + WARN_ONCE("%s: kn_from and kn_to on different roots\n",
> + __func__);
> + return NULL;
> + }

Have you compiled it? WARN_ONCE()'s first argument is condition, so
you'd write

if (WARN_ONCE(!common, "blah blah"))
return NULL;

> +
> + depth_to = kernfs_node_distance(common, kn_to);
> + depth_from = kernfs_node_distance(common, kn_from);
> +
> + for (i = 0; i < depth_from; i++) {
> + if (len + plen + 1 > buflen)
> + return NULL;
> + strcpy(p, parent_str);
> + p += plen;
> + len += plen;
> + }
> +
> + /* Calculate how many bytes we need for the rest */
> + for (kn = kn_to; kn != common; kn = kn->parent)
> + nlen += strlen(kn->name) + 1;
> +
> + if (len + nlen + 1 > buflen)
> + return NULL;

Hmm... if we do this anyway, maybe we can make the function behave
more like other string formatting function (strlcpy) and return the
would-be length instead where ret >= len indicates truncation?

Thanks.

--
tejun

2015-12-08 16:04:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/7] cgroup: introduce cgroup namespaces

On Mon, Dec 07, 2015 at 05:06:18PM -0600, [email protected] wrote:
> static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 2b3e2314..906f240 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -17,11 +17,36 @@
> #include <linux/seq_file.h>
> #include <linux/kernfs.h>
> #include <linux/jump_label.h>
> +#include <linux/nsproxy.h>
> +#include <linux/types.h>
> +#include <linux/ns_common.h>
> +#include <linux/nsproxy.h>
> +#include <linux/user_namespace.h>
>
> #include <linux/cgroup-defs.h>
>
> +struct cgroup_namespace {
> + atomic_t count;
> + struct ns_common ns;
> + struct user_namespace *user_ns;
> + struct css_set *root_cset;
> +};
> +
> +extern struct cgroup_namespace init_cgroup_ns;
> +
> +static inline void get_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns)
> + atomic_inc(&ns->count);
> +}
> +
..
> +void free_cgroup_ns(struct cgroup_namespace *ns);
> +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> + struct user_namespace *user_ns,
> + struct cgroup_namespace *old_ns);
...
> +char * __must_check cgroup_path_ns(struct cgroup *cgrp, char *buf,
> + size_t buflen, struct cgroup_namespace *ns);
> +char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
...
> +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> +{
> + if (ns && atomic_dec_and_test(&ns->count))
> + free_cgroup_ns(ns);
> +}

I'd prefer collecting all ns related declarations in a single place.

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7f2f007..4fd07b5a 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -297,6 +300,13 @@ static bool cgroup_on_dfl(const struct cgroup *cgrp)
> {
> return cgrp->root == &cgrp_dfl_root;
> }

New line.

> +struct cgroup_namespace init_cgroup_ns = {
> + .count = { .counter = 1, },
> + .user_ns = &init_user_ns,
> + .ns.ops = &cgroupns_operations,
> + .ns.inum = PROC_CGROUP_INIT_INO,
> + .root_cset = &init_css_set,
> +};

Also, why inbetween two function defs? cgroup.c is messy but you can
still put it after other variable definitions.

> @@ -430,18 +440,18 @@ static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> return !(cgrp->self.flags & CSS_ONLINE);
> }
>
> -static void cgroup_get(struct cgroup *cgrp)
> +static inline void cgroup_get(struct cgroup *cgrp)
> {
> WARN_ON_ONCE(cgroup_is_dead(cgrp));
> css_get(&cgrp->self);
> }
>
> -static bool cgroup_tryget(struct cgroup *cgrp)
> +static inline bool cgroup_tryget(struct cgroup *cgrp)
> {
> return css_tryget(&cgrp->self);
> }
>
> -static void cgroup_put(struct cgroup *cgrp)
> +static inline void cgroup_put(struct cgroup *cgrp)
> {
> css_put(&cgrp->self);
> }
>
> @@ -780,7 +790,7 @@ static void put_css_set_locked(struct css_set *cset)
> kfree_rcu(cset, rcu_head);
> }
>
> -static void put_css_set(struct css_set *cset)
> +static inline void put_css_set(struct css_set *cset)

Stray diffs?

> @@ -2189,6 +2199,25 @@ static struct file_system_type cgroup2_fs_type = {
> .kill_sb = cgroup_kill_sb,
> };
>
> +char * __must_check
> +cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
> + struct cgroup_namespace *ns)
> +{
> + struct cgroup *root;
> +
> + BUG_ON(!ns);

Just let it crash on NULL deref.

Thanks.

--
tejun

2015-12-08 16:21:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Hello,

On Mon, Dec 07, 2015 at 05:06:20PM -0600, [email protected] wrote:
> fs/kernfs/mount.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/kernfs.h | 2 ++
> kernel/cgroup.c | 39 ++++++++++++++++++++++++-
> 3 files changed, 114 insertions(+), 1 deletion(-)

Please put kernfs changes in a spearate patch.

> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index 8eaf417..9219444 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -62,6 +63,79 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> return NULL;
> }
>
> +/*
> + * find the next ancestor in the path down to @child, where @parent was the
> + * parent whose child we want to find.

s/parent/ancestor/ s/child/descendant/ ?

> + *
> + * Say the path is /a/b/c/d. @child is d, @parent is NULL. We return the root
> + * node. If @parent is b, then we return the node for c.
> + * Passing in d as @parent is not ok.
> + */
> +static struct kernfs_node *
> +find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
> +{
> + if (child == parent) {
> + pr_crit_once("BUG in find_next_ancestor: called with parent == child");
> + return NULL;
> + }
> +
> + while (child->parent != parent) {
> + if (!child->parent)
> + return NULL;
> + child = child->parent;
> + }
> +
> + return child;
> +}
> +
> +/**
> + * kernfs_obtain_root - get a dentry for the given kernfs_node
> + * @sb: the kernfs super_block
> + * @kn: kernfs_node for which a dentry is needed
> + *
> + * This can be used by callers which want to mount only a part of the kernfs
> + * as root of the filesystem.
> + */
> +struct dentry *kernfs_obtain_root(struct super_block *sb,
> + struct kernfs_node *kn)

Wouldn't @kn, @sb be a better order? Also, kernfs super_blocks are
determined by the kernfs_root and its namespace. I wonder whether
specifying @ns would be better.

> +{
> + struct dentry *dentry;
> + struct kernfs_node *knparent = NULL;
> +
> + BUG_ON(sb->s_op != &kernfs_sops);
> +
> + dentry = dget(sb->s_root);
> + if (!kn->parent) // this is the root
^^^
Do we do this now?

> + return dentry;
> +
> + knparent = find_next_ancestor(kn, NULL);
> + if (!knparent) {
> + pr_crit("BUG: find_next_ancestor for root dentry returned NULL\n");

Wouldn't stack dump helpful here? Why not

if (WARN_ONCE(!knparent, "find_next..."))
return ERR_PTR(-EINVAL);

or even just WARN_ON_ONCE().

> + return ERR_PTR(-EINVAL);
> + }
> +
> + do {
> + struct dentry *dtmp;
> + struct kernfs_node *kntmp;
> +
> + if (kn == knparent)
> + return dentry;
> + kntmp = find_next_ancestor(kn, knparent);
> + if (!kntmp) {
> + pr_crit("BUG: find_next_ancestor returned NULL for node\n");

Ditto. It'd be a kernel bug. WARN is usually the better way.

> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index a5ab74d..09cd718 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2011,6 +2011,15 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> int ret;
> int i;
> bool new_sb;
> + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;

Please move this upwards so that it's below other initialized
variables.

> +
> + get_cgroup_ns(ns);
> +
> + /* Check if the caller has permission to mount. */
> + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> + put_cgroup_ns(ns);
> + return ERR_PTR(-EPERM);
> + }
>
> /*
> * The first time anyone tries to mount a cgroup, enable the list
> @@ -2127,6 +2136,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> goto out_unlock;
> }
>
> + if (!opts.none && !capable(CAP_SYS_ADMIN)) {
> + ret = -EPERM;
> + goto out_unlock;
> + }

Hmmm... why is !opts.none necessary? Please add a comment explaining
why the above is necessary.

> out_mount:
> dentry = kernfs_mount(fs_type, flags, root->kf_root,
> is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
> &new_sb);
> +
> + if (!IS_ERR(dentry)) {
> + /*
> + * In non-init cgroup namespace, instead of root cgroup's
> + * dentry, we return the dentry corresponding to the
> + * cgroupns->root_cgrp.
> + */
> + if (ns != &init_cgroup_ns) {

if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {

> + struct dentry *nsdentry;
> + struct cgroup *cgrp;
> +
> + cgrp = cset_cgroup_from_root(ns->root_cset, root);
> + nsdentry = kernfs_obtain_root(dentry->d_sb,
> + cgrp->kn);

Heh, is kernfs_obtain_root() the right name? Maybe
kernfs_node_to_inode()?

Thanks.

--
tejun

2015-12-08 16:22:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 6/7] cgroup: Add documentation for cgroup namespaces

On Mon, Dec 07, 2015 at 05:06:21PM -0600, [email protected] wrote:
> From: Aditya Kali <[email protected]>
>
> Signed-off-by: Aditya Kali <[email protected]>
> Signed-off-by: Serge Hallyn <[email protected]>
> ---
> Documentation/cgroups/namespace.txt | 142 +++++++++++++++++++++++++++++++++++

Please integrate the documentation into Documentation/cgroup.txt.

Thanks.

--
tejun

2015-12-08 16:46:30

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

On Tue, Dec 08, 2015 at 10:52:51AM -0500, Tejun Heo wrote:
> Hello, Serge.
>
> On Mon, Dec 07, 2015 at 05:06:16PM -0600, [email protected] wrote:
> > +/* kernfs_node_depth - compute depth from @from to @to */
> > +static size_t kernfs_node_distance(struct kernfs_node *from, struct kernfs_node *to)
> > {
> > + size_t depth = 0;
> >
> > + BUG_ON(!to);
> > + BUG_ON(!from);
>
> Do these BUG_ON()s achieve anything? Also, would something like
> kernfs_relative_depth() be a better name for the function? Maybe even
> just kernfs_depth()?
>
> ...
> > +static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
> > + struct kernfs_node *b)
> > +{
> > + size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
> > + size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
> > +
> > + if (da == 0)
> > + return a;
> > + if (db == 0)
> > + return b;
>
> Hmm... are the above two ifs necessary? Wouldn't the outcome be the
> same? Furthermore, if a and b are on different roots the above may
> give the wrong answer while not doing the above would return NULL.
>
> > + while (da > db) {
> > + a = a->parent;
> > + da--;
> > + }
> > + while (db > da) {
> > + b = b->parent;
> > + db--;
> > + }
> > +
> > + /* worst case b and a will be the same at root */
> > + while (b != a) {
> > + b = b->parent;
> > + a = a->parent;
> > + }
> > +
> > + return a;
> > +}
> ...
> > +static char *
> > +__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,
>
> Maybe
>
> static char * __must_check
> kernfs_path...
>
> > + struct kernfs_node *kn_to, char *buf,
> > + size_t buflen)
>
> Given that @kn_from is optional and is not the target node, maybe put
> @kn_to before @kn_from?
>
> > +{
> > + char *p = buf;
> > + struct kernfs_node *kn, *common;
> > + const char parent_str[] = "/..";
> > + int i;
> > + size_t depth_from, depth_to, len = 0, nlen = 0,
> > + plen = sizeof(parent_str) - 1;
>
> Heh, idk, just put plen on a separate decl?
>
> > +
> > + /* We atleast need 2 bytes to write "/\0". */
> > + if (buflen < 2)
> > + return NULL;
> > +
> > + if (!kn_from)
> > + kn_from = kernfs_root(kn_to)->kn;
> > +
> > + if (kn_from == kn_to) {
> > + *p = '/';
> > + *(++p) = '\0';
> > + return buf;
> > + }
> > +
> > + common = kernfs_common_ancestor(kn_from, kn_to);
> > + if (!common) {
> > + WARN_ONCE("%s: kn_from and kn_to on different roots\n",
> > + __func__);
> > + return NULL;
> > + }
>
> Have you compiled it?

Yup,

> WARN_ONCE()'s first argument is condition, so
> you'd write
>
> if (WARN_ONCE(!common, "blah blah"))
> return NULL;

Well that's odd, i thought I had changed all those to
pr_*. Will fix.

> > +
> > + depth_to = kernfs_node_distance(common, kn_to);
> > + depth_from = kernfs_node_distance(common, kn_from);
> > +
> > + for (i = 0; i < depth_from; i++) {
> > + if (len + plen + 1 > buflen)
> > + return NULL;
> > + strcpy(p, parent_str);
> > + p += plen;
> > + len += plen;
> > + }
> > +
> > + /* Calculate how many bytes we need for the rest */
> > + for (kn = kn_to; kn != common; kn = kn->parent)
> > + nlen += strlen(kn->name) + 1;
> > +
> > + if (len + nlen + 1 > buflen)
> > + return NULL;
>
> Hmm... if we do this anyway, maybe we can make the function behave
> more like other string formatting function (strlcpy) and return the
> would-be length instead where ret >= len indicates truncation?

Ok, will do.

2015-12-08 16:48:34

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Tue, Dec 08, 2015 at 11:20:40AM -0500, Tejun Heo wrote:
> Hello,
>
> On Mon, Dec 07, 2015 at 05:06:20PM -0600, [email protected] wrote:
> > fs/kernfs/mount.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/kernfs.h | 2 ++
> > kernel/cgroup.c | 39 ++++++++++++++++++++++++-
> > 3 files changed, 114 insertions(+), 1 deletion(-)
>
> Please put kernfs changes in a spearate patch.
>
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index 8eaf417..9219444 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -62,6 +63,79 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> > return NULL;
> > }
> >
> > +/*
> > + * find the next ancestor in the path down to @child, where @parent was the
> > + * parent whose child we want to find.
>
> s/parent/ancestor/ s/child/descendant/ ?
>
> > + *
> > + * Say the path is /a/b/c/d. @child is d, @parent is NULL. We return the root
> > + * node. If @parent is b, then we return the node for c.
> > + * Passing in d as @parent is not ok.
> > + */
> > +static struct kernfs_node *
> > +find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
> > +{
> > + if (child == parent) {
> > + pr_crit_once("BUG in find_next_ancestor: called with parent == child");
> > + return NULL;
> > + }
> > +
> > + while (child->parent != parent) {
> > + if (!child->parent)
> > + return NULL;
> > + child = child->parent;
> > + }
> > +
> > + return child;
> > +}
> > +
> > +/**
> > + * kernfs_obtain_root - get a dentry for the given kernfs_node
> > + * @sb: the kernfs super_block
> > + * @kn: kernfs_node for which a dentry is needed
> > + *
> > + * This can be used by callers which want to mount only a part of the kernfs
> > + * as root of the filesystem.
> > + */
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > + struct kernfs_node *kn)
>
> Wouldn't @kn, @sb be a better order? Also, kernfs super_blocks are
> determined by the kernfs_root and its namespace. I wonder whether
> specifying @ns would be better.
>
> > +{
> > + struct dentry *dentry;
> > + struct kernfs_node *knparent = NULL;
> > +
> > + BUG_ON(sb->s_op != &kernfs_sops);
> > +
> > + dentry = dget(sb->s_root);
> > + if (!kn->parent) // this is the root
> ^^^
> Do we do this now?
>
> > + return dentry;
> > +
> > + knparent = find_next_ancestor(kn, NULL);
> > + if (!knparent) {
> > + pr_crit("BUG: find_next_ancestor for root dentry returned NULL\n");
>
> Wouldn't stack dump helpful here? Why not

Hm, yeah, that's a good reason to use WARN - thanks.

> if (WARN_ONCE(!knparent, "find_next..."))
> return ERR_PTR(-EINVAL);
>
> or even just WARN_ON_ONCE().
>
> > + return ERR_PTR(-EINVAL);
> > + }
> > +
> > + do {
> > + struct dentry *dtmp;
> > + struct kernfs_node *kntmp;
> > +
> > + if (kn == knparent)
> > + return dentry;
> > + kntmp = find_next_ancestor(kn, knparent);
> > + if (!kntmp) {
> > + pr_crit("BUG: find_next_ancestor returned NULL for node\n");
>
> Ditto. It'd be a kernel bug. WARN is usually the better way.
>
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index a5ab74d..09cd718 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -2011,6 +2011,15 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> > int ret;
> > int i;
> > bool new_sb;
> > + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
>
> Please move this upwards so that it's below other initialized
> variables.
>
> > +
> > + get_cgroup_ns(ns);
> > +
> > + /* Check if the caller has permission to mount. */
> > + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> > + put_cgroup_ns(ns);
> > + return ERR_PTR(-EPERM);
> > + }
> >
> > /*
> > * The first time anyone tries to mount a cgroup, enable the list
> > @@ -2127,6 +2136,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> > goto out_unlock;
> > }
> >
> > + if (!opts.none && !capable(CAP_SYS_ADMIN)) {
> > + ret = -EPERM;
> > + goto out_unlock;
> > + }
>
> Hmmm... why is !opts.none necessary? Please add a comment explaining
> why the above is necessary.
>
> > out_mount:
> > dentry = kernfs_mount(fs_type, flags, root->kf_root,
> > is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
> > &new_sb);
> > +
> > + if (!IS_ERR(dentry)) {
> > + /*
> > + * In non-init cgroup namespace, instead of root cgroup's
> > + * dentry, we return the dentry corresponding to the
> > + * cgroupns->root_cgrp.
> > + */
> > + if (ns != &init_cgroup_ns) {
>
> if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
>
> > + struct dentry *nsdentry;
> > + struct cgroup *cgrp;
> > +
> > + cgrp = cset_cgroup_from_root(ns->root_cset, root);
> > + nsdentry = kernfs_obtain_root(dentry->d_sb,
> > + cgrp->kn);
>
> Heh, is kernfs_obtain_root() the right name? Maybe
> kernfs_node_to_inode()?
>
> Thanks.
>
> --
> tejun
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2015-12-08 18:45:10

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

On Tue, Dec 08, 2015 at 10:52:51AM -0500, Tejun Heo wrote:
> Hello, Serge.
>
> On Mon, Dec 07, 2015 at 05:06:16PM -0600, [email protected] wrote:
> > +/* kernfs_node_depth - compute depth from @from to @to */
> > +static size_t kernfs_node_distance(struct kernfs_node *from, struct kernfs_node *to)
> > {
> > + size_t depth = 0;
> >
> > + BUG_ON(!to);
> > + BUG_ON(!from);
>
> Do these BUG_ON()s achieve anything?

Just try to catch caller errors early on, but I'll drop these.

> Also, would something like
> kernfs_relative_depth() be a better name for the function? Maybe even
> just kernfs_depth()?

ok

> ...
> > +static struct kernfs_node *kernfs_common_ancestor(struct kernfs_node *a,
> > + struct kernfs_node *b)
> > +{
> > + size_t da = kernfs_node_distance(kernfs_root(a)->kn, a);
> > + size_t db = kernfs_node_distance(kernfs_root(b)->kn, b);
> > +
> > + if (da == 0)
> > + return a;
> > + if (db == 0)
> > + return b;
>
> Hmm... are the above two ifs necessary? Wouldn't the outcome be the
> same?

Yeah it would, dropping them.

> Furthermore, if a and b are on different roots the above may
> give the wrong answer while not doing the above would return NULL.

Right, thanks for catching that. I'm adding a check that the
roots are the same before proceeding.

> > + while (da > db) {
> > + a = a->parent;
> > + da--;
> > + }
> > + while (db > da) {
> > + b = b->parent;
> > + db--;
> > + }
> > +
> > + /* worst case b and a will be the same at root */
> > + while (b != a) {
> > + b = b->parent;
> > + a = a->parent;
> > + }
> > +
> > + return a;
> > +}
> ...
> > +static char *
> > +__must_check kernfs_path_from_node_locked(struct kernfs_node *kn_from,
>
> Maybe
>
> static char * __must_check
> kernfs_path...

Actually that __must_check seems weird, I'll just drop it. (ISTM __must_check
makes sense in a fn that does something where we worry the caller doesn't
check that it succeeded, not in a fn where we are just querying a value.

> > + struct kernfs_node *kn_to, char *buf,
> > + size_t buflen)
>
> Given that @kn_from is optional and is not the target node, maybe put
> @kn_to before @kn_from?

ok

> > +{
> > + char *p = buf;
> > + struct kernfs_node *kn, *common;
> > + const char parent_str[] = "/..";
> > + int i;
> > + size_t depth_from, depth_to, len = 0, nlen = 0,
> > + plen = sizeof(parent_str) - 1;
>
> Heh, idk, just put plen on a separate decl?
>
> > +
> > + /* We atleast need 2 bytes to write "/\0". */
> > + if (buflen < 2)
> > + return NULL;
> > +
> > + if (!kn_from)
> > + kn_from = kernfs_root(kn_to)->kn;
> > +
> > + if (kn_from == kn_to) {
> > + *p = '/';
> > + *(++p) = '\0';
> > + return buf;
> > + }
> > +
> > + common = kernfs_common_ancestor(kn_from, kn_to);
> > + if (!common) {
> > + WARN_ONCE("%s: kn_from and kn_to on different roots\n",
> > + __func__);
> > + return NULL;
> > + }
>
> Have you compiled it? WARN_ONCE()'s first argument is condition, so
> you'd write
>
> if (WARN_ONCE(!common, "blah blah"))
> return NULL;

D'oh. Actually once isn't even right. I'll just do WARN_ON
(and try to do it right).

> > + depth_to = kernfs_node_distance(common, kn_to);
> > + depth_from = kernfs_node_distance(common, kn_from);
> > +
> > + for (i = 0; i < depth_from; i++) {
> > + if (len + plen + 1 > buflen)
> > + return NULL;
> > + strcpy(p, parent_str);
> > + p += plen;
> > + len += plen;
> > + }
> > +
> > + /* Calculate how many bytes we need for the rest */
> > + for (kn = kn_to; kn != common; kn = kn->parent)
> > + nlen += strlen(kn->name) + 1;
> > +
> > + if (len + nlen + 1 > buflen)
> > + return NULL;
>
> Hmm... if we do this anyway, maybe we can make the function behave
> more like other string formatting function (strlcpy) and return the
> would-be length instead where ret >= len indicates truncation?

I can change that, but the callers right now don't re-try with
larger buffer anyway, so this would actually complicate them just
a smidgeon. Would you want them changed to do that? (pr_cont_kernfs_path
right now writes into a static char[] for instance)

2015-12-08 19:34:36

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] cgroup: introduce cgroup namespaces

On Tue, Dec 08, 2015 at 11:04:53AM -0500, Tejun Heo wrote:
> On Mon, Dec 07, 2015 at 05:06:18PM -0600, [email protected] wrote:
> > static const char *proc_ns_follow_link(struct dentry *dentry, void **cookie)
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 2b3e2314..906f240 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -17,11 +17,36 @@
> > #include <linux/seq_file.h>
> > #include <linux/kernfs.h>
> > #include <linux/jump_label.h>
> > +#include <linux/nsproxy.h>
> > +#include <linux/types.h>
> > +#include <linux/ns_common.h>
> > +#include <linux/nsproxy.h>
> > +#include <linux/user_namespace.h>
> >
> > #include <linux/cgroup-defs.h>
> >
> > +struct cgroup_namespace {
> > + atomic_t count;
> > + struct ns_common ns;
> > + struct user_namespace *user_ns;
> > + struct css_set *root_cset;
> > +};
> > +
> > +extern struct cgroup_namespace init_cgroup_ns;
> > +
> > +static inline void get_cgroup_ns(struct cgroup_namespace *ns)
> > +{
> > + if (ns)
> > + atomic_inc(&ns->count);
> > +}
> > +
> ..
> > +void free_cgroup_ns(struct cgroup_namespace *ns);
> > +struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
> > + struct user_namespace *user_ns,
> > + struct cgroup_namespace *old_ns);
> ...
> > +char * __must_check cgroup_path_ns(struct cgroup *cgrp, char *buf,
> > + size_t buflen, struct cgroup_namespace *ns);
> > +char * __must_check cgroup_path(struct cgroup *cgrp, char *buf, size_t buflen);
> ...
> > +static inline void put_cgroup_ns(struct cgroup_namespace *ns)
> > +{
> > + if (ns && atomic_dec_and_test(&ns->count))
> > + free_cgroup_ns(ns);
> > +}
>
> I'd prefer collecting all ns related declarations in a single place.

I can group some of them, but free_cgroup_ns needs the
cgroup_namespace definition, put_cgroup_ns() needs free_cgroup_ns(),
and free_cgroup_ns() is static in the !CONFIG_CGROUPS case and a
non-static function in the other case.

So I'm going to put both get_cgroup_ns() and put_cgroup_ns() at the
end of the file, with the struct namespace define at the top. Is
that sufficient?

> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index 7f2f007..4fd07b5a 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -297,6 +300,13 @@ static bool cgroup_on_dfl(const struct cgroup *cgrp)
> > {
> > return cgrp->root == &cgrp_dfl_root;
> > }
>
> New line.
>
> > +struct cgroup_namespace init_cgroup_ns = {
> > + .count = { .counter = 1, },
> > + .user_ns = &init_user_ns,
> > + .ns.ops = &cgroupns_operations,
> > + .ns.inum = PROC_CGROUP_INIT_INO,
> > + .root_cset = &init_css_set,
> > +};
>
> Also, why inbetween two function defs? cgroup.c is messy but you can
> still put it after other variable definitions.
>
> > @@ -430,18 +440,18 @@ static inline bool cgroup_is_dead(const struct cgroup *cgrp)
> > return !(cgrp->self.flags & CSS_ONLINE);
> > }
> >
> > -static void cgroup_get(struct cgroup *cgrp)
> > +static inline void cgroup_get(struct cgroup *cgrp)
> > {
> > WARN_ON_ONCE(cgroup_is_dead(cgrp));
> > css_get(&cgrp->self);
> > }
> >
> > -static bool cgroup_tryget(struct cgroup *cgrp)
> > +static inline bool cgroup_tryget(struct cgroup *cgrp)
> > {
> > return css_tryget(&cgrp->self);
> > }
> >
> > -static void cgroup_put(struct cgroup *cgrp)
> > +static inline void cgroup_put(struct cgroup *cgrp)
> > {
> > css_put(&cgrp->self);
> > }
> >
> > @@ -780,7 +790,7 @@ static void put_css_set_locked(struct css_set *cset)
> > kfree_rcu(cset, rcu_head);
> > }
> >
> > -static void put_css_set(struct css_set *cset)
> > +static inline void put_css_set(struct css_set *cset)
>
> Stray diffs?
>
> > @@ -2189,6 +2199,25 @@ static struct file_system_type cgroup2_fs_type = {
> > .kill_sb = cgroup_kill_sb,
> > };
> >
> > +char * __must_check
> > +cgroup_path_ns(struct cgroup *cgrp, char *buf, size_t buflen,
> > + struct cgroup_namespace *ns)
> > +{
> > + struct cgroup *root;
> > +
> > + BUG_ON(!ns);
>
> Just let it crash on NULL deref.
>
> Thanks.
>
> --
> tejun
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers

2015-12-08 19:46:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/7] cgroup: introduce cgroup namespaces

Hello, Serge.

On Tue, Dec 08, 2015 at 01:34:31PM -0600, Serge E. Hallyn wrote:
> > I'd prefer collecting all ns related declarations in a single place.
>
> I can group some of them, but free_cgroup_ns needs the
> cgroup_namespace definition, put_cgroup_ns() needs free_cgroup_ns(),
> and free_cgroup_ns() is static in the !CONFIG_CGROUPS case and a
> non-static function in the other case.
>
> So I'm going to put both get_cgroup_ns() and put_cgroup_ns() at the
> end of the file, with the struct namespace define at the top. Is
> that sufficient?

I wouldn't even mind creating a separate #ifdef section for NS stuff
if that's wnat's necessary for collecting everything into one spot.

Thanks.

--
tejun

2015-12-08 19:47:46

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 3/7] cgroup: introduce cgroup namespaces

On Tue, Dec 08, 2015 at 02:46:00PM -0500, Tejun Heo wrote:
> Hello, Serge.
>
> On Tue, Dec 08, 2015 at 01:34:31PM -0600, Serge E. Hallyn wrote:
> > > I'd prefer collecting all ns related declarations in a single place.
> >
> > I can group some of them, but free_cgroup_ns needs the
> > cgroup_namespace definition, put_cgroup_ns() needs free_cgroup_ns(),
> > and free_cgroup_ns() is static in the !CONFIG_CGROUPS case and a
> > non-static function in the other case.
> >
> > So I'm going to put both get_cgroup_ns() and put_cgroup_ns() at the
> > end of the file, with the struct namespace define at the top. Is
> > that sufficient?
>
> I wouldn't even mind creating a separate #ifdef section for NS stuff
> if that's wnat's necessary for collecting everything into one spot.

Ok, will do that, thanks.

2015-12-08 23:21:28

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

On Tue, Dec 08, 2015 at 11:20:40AM -0500, Tejun Heo wrote:
> Hello,
>
> On Mon, Dec 07, 2015 at 05:06:20PM -0600, [email protected] wrote:
> > fs/kernfs/mount.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/kernfs.h | 2 ++
> > kernel/cgroup.c | 39 ++++++++++++++++++++++++-
> > 3 files changed, 114 insertions(+), 1 deletion(-)
>
> Please put kernfs changes in a spearate patch.
>
> > diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> > index 8eaf417..9219444 100644
> > --- a/fs/kernfs/mount.c
> > +++ b/fs/kernfs/mount.c
> > @@ -62,6 +63,79 @@ struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
> > return NULL;
> > }
> >
> > +/*
> > + * find the next ancestor in the path down to @child, where @parent was the
> > + * parent whose child we want to find.
>
> s/parent/ancestor/ s/child/descendant/ ?
>
> > + *
> > + * Say the path is /a/b/c/d. @child is d, @parent is NULL. We return the root
> > + * node. If @parent is b, then we return the node for c.
> > + * Passing in d as @parent is not ok.
> > + */
> > +static struct kernfs_node *
> > +find_next_ancestor(struct kernfs_node *child, struct kernfs_node *parent)
> > +{
> > + if (child == parent) {
> > + pr_crit_once("BUG in find_next_ancestor: called with parent == child");
> > + return NULL;
> > + }
> > +
> > + while (child->parent != parent) {
> > + if (!child->parent)
> > + return NULL;
> > + child = child->parent;
> > + }
> > +
> > + return child;
> > +}
> > +
> > +/**
> > + * kernfs_obtain_root - get a dentry for the given kernfs_node
> > + * @sb: the kernfs super_block
> > + * @kn: kernfs_node for which a dentry is needed
> > + *
> > + * This can be used by callers which want to mount only a part of the kernfs
> > + * as root of the filesystem.
> > + */
> > +struct dentry *kernfs_obtain_root(struct super_block *sb,
> > + struct kernfs_node *kn)
>
> Wouldn't @kn, @sb be a better order? Also, kernfs super_blocks are
> determined by the kernfs_root and its namespace. I wonder whether
> specifying @ns would be better.

That would require kernfs to keep a mapping from an opaque void* to the
kernfs_nodes, though. This way the caller can worry about how to choose a
kernfs_node from the namespace object.

It might be worth it to make 'namespacing' of kernfs more boilerplate, but OTOH
this could also fall under the old kernel rule of don't add it until there's a
user, i.e. until there's a second user to justify the abstraction.

> > +{
> > + struct dentry *dentry;
> > + struct kernfs_node *knparent = NULL;
> > +
> > + BUG_ON(sb->s_op != &kernfs_sops);
> > +
> > + dentry = dget(sb->s_root);
> > + if (!kn->parent) // this is the root
> ^^^
> Do we do this now?
>
> > + return dentry;
> > +
> > + knparent = find_next_ancestor(kn, NULL);
> > + if (!knparent) {
> > + pr_crit("BUG: find_next_ancestor for root dentry returned NULL\n");
>
> Wouldn't stack dump helpful here? Why not
>
> if (WARN_ONCE(!knparent, "find_next..."))
> return ERR_PTR(-EINVAL);
>
> or even just WARN_ON_ONCE().
>
> > + return ERR_PTR(-EINVAL);
> > + }
> > +
> > + do {
> > + struct dentry *dtmp;
> > + struct kernfs_node *kntmp;
> > +
> > + if (kn == knparent)
> > + return dentry;
> > + kntmp = find_next_ancestor(kn, knparent);
> > + if (!kntmp) {
> > + pr_crit("BUG: find_next_ancestor returned NULL for node\n");
>
> Ditto. It'd be a kernel bug. WARN is usually the better way.
>
> > diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> > index a5ab74d..09cd718 100644
> > --- a/kernel/cgroup.c
> > +++ b/kernel/cgroup.c
> > @@ -2011,6 +2011,15 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> > int ret;
> > int i;
> > bool new_sb;
> > + struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
>
> Please move this upwards so that it's below other initialized
> variables.
>
> > +
> > + get_cgroup_ns(ns);
> > +
> > + /* Check if the caller has permission to mount. */
> > + if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN)) {
> > + put_cgroup_ns(ns);
> > + return ERR_PTR(-EPERM);
> > + }
> >
> > /*
> > * The first time anyone tries to mount a cgroup, enable the list
> > @@ -2127,6 +2136,11 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
> > goto out_unlock;
> > }
> >
> > + if (!opts.none && !capable(CAP_SYS_ADMIN)) {
> > + ret = -EPERM;
> > + goto out_unlock;
> > + }
>
> Hmmm... why is !opts.none necessary? Please add a comment explaining
> why the above is necessary.
>
> > out_mount:
> > dentry = kernfs_mount(fs_type, flags, root->kf_root,
> > is_v2 ? CGROUP2_SUPER_MAGIC : CGROUP_SUPER_MAGIC,
> > &new_sb);
> > +
> > + if (!IS_ERR(dentry)) {
> > + /*
> > + * In non-init cgroup namespace, instead of root cgroup's
> > + * dentry, we return the dentry corresponding to the
> > + * cgroupns->root_cgrp.
> > + */
> > + if (ns != &init_cgroup_ns) {
>
> if (!IS_ERR(dentry) && ns != &init_cgroup_ns) {
>
> > + struct dentry *nsdentry;
> > + struct cgroup *cgrp;
> > +
> > + cgrp = cset_cgroup_from_root(ns->root_cset, root);
> > + nsdentry = kernfs_obtain_root(dentry->d_sb,
> > + cgrp->kn);
>
> Heh, is kernfs_obtain_root() the right name? Maybe
> kernfs_node_to_inode()?

kernfs_node_to_dentry?

This would presumably make the question of whether to pass in a namespace
moot?

2015-12-09 01:17:12

by Serge Hallyn

[permalink] [raw]
Subject: Re: [PATCH 1/7] kernfs: Add API to generate relative kernfs path

On Tue, Dec 08, 2015 at 06:51:20AM -0500, Greg KH wrote:
> On Mon, Dec 07, 2015 at 05:06:16PM -0600, [email protected] wrote:
> > From: Aditya Kali <[email protected]>
> >
> > The new function kernfs_path_from_node() generates and returns kernfs
> > path of a given kernfs_node relative to a given parent kernfs_node.
> >
> > Changelog 20151125:
> > - Fully-wing multilinecomments
> > - Rework kernfs_path_from_node_locked() logic
> > - Replace BUG_ONs with returning NULL
> > - Use a const char* for /.. and precalculate its size
> > Changelog 20151130:
> > - Update kernfs_path_from_node_locked comment
>
> changes should go below the --- line, not here in the body of the
> changelog that will show up in git :(

Right, thanks. I've changed them all to add a --- before the changelog.
The format-patch ends up looking a bit weird with two --- lines, but
subsequent git am does the right thing, so hopefully this is the right
way.

2015-12-09 15:48:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/7] cgroup: mount cgroupns-root when inside non-init cgroupns

Hello, Serge.

On Tue, Dec 08, 2015 at 05:21:24PM -0600, Serge E. Hallyn wrote:
> > Heh, is kernfs_obtain_root() the right name? Maybe
> > kernfs_node_to_inode()?
>
> kernfs_node_to_dentry?
>
> This would presumably make the question of whether to pass in a namespace
> moot?

Sounds good.

Thanks.

--
tejun