2009-12-04 08:54:10

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 0/5] cgroups: support for module-loadable subsystems

[This is a revision of http://lkml.org/lkml/2009/11/2/442 ]

This patch series implements support for building, loading, and
unloading subsystems as modules, both within and outside the kernel
source tree. It provides an interface cgroup_load_subsys() and
cgroup_unload_subsys() which modular subsystems can use to register and
depart during runtime. The net_cls classifier subsystem serves as the
example for a subsystem which can be converted into a module using these
changes. Additionally, support for subsystem dependencies is implemented
with the new modular subsystem scenario in mind.

Patch #1 sets up the subsys[] array so its contents can be dynamic as
modules appear and (eventually) disappear. I introduce an rwsem called
subsys_mutex to protect against concurrent loads/unloads/reads, and
modify iterations over the array to handle when subsystems are absent.

Patch #2 implements an interface for modules to load subsystems, called
cgroup_load_subsys, similar to cgroup_init_subsys, and adds a module
pointer in struct cgroup_subsys.

Patch #3 modifies the net_cls subsystem, which already had some module
declarations, to be configurable as a module, which also serves as a
simple proof-of-concept.

Patch #4 adds a mechanism for unloading modular subsystems, which
includes a more advanced rework of the rudimentary reference counting
introduced in patch 2. It also modifies the exitcall of net_cls to use
this interface.

Patch #5 implements support for subsystems to specify dependencies on
other subsystems. It requires the subsystem to give a list of its
dependencies in the form of a NULL-terminated array of strings, each
being the name of a depended-upon subsystem, for reasons explained in
the patch itself.

Part of implementing patches 2 and 4 involved updating css pointers in
each css_set when the module appears or leaves. In doing this, it was
discovered that css_sets always remain linked to the dummy cgroup,
regardless of whether or not any subsystems are actually bound to it
(i.e., not mounted on an actual hierarchy). The subsystem loading and
unloading code therefore should keep in mind the special cases where the
added subsystem is the only one in the dummy cgroup (and therefore all
css_sets need to be linked back into it) and where the removed subsys
was the only one in the dummy cgroup (and therefore all css_sets should
be unlinked from it) - however, as all css_sets always stay attached to
the dummy cgroup anyway, these cases are ignored. Any fix that addresses
this issue should also make sure these cases are addressed in the
subsystem loading and unloading code.

-- bblum


2009-12-04 08:55:27

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 1/5] cgroups: revamp subsys array

Make subsys[] able to be dynamically populated to support modular subsystems

From: Ben Blum <[email protected]>

This patch reworks the way the subsys[] array is used so that subsystems can
register themselves after boot time, and enables the internals of cgroups to
be able to handle when subsystems are not present or may appear/disappear.

Signed-off-by: Ben Blum <[email protected]>
---

include/linux/cgroup.h | 8 +++-
kernel/cgroup.c | 92 ++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 82 insertions(+), 18 deletions(-)


diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 642a47f..d7f1545 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -41,13 +41,17 @@ extern int cgroupstats_build(struct cgroupstats *stats,

extern struct file_operations proc_cgroup_operations;

-/* Define the enumeration of all cgroup subsystems */
+/* Define the enumeration of all builtin cgroup subsystems */
#define SUBSYS(_x) _x ## _subsys_id,
enum cgroup_subsys_id {
#include <linux/cgroup_subsys.h>
- CGROUP_SUBSYS_COUNT
+ CGROUP_BUILTIN_SUBSYS_COUNT
};
#undef SUBSYS
+/* This define indicates the maximum number of subsystems that can be loaded
+ * at once. We limit to this many since cgroupfs_root has subsys_bits to keep
+ * track of all of them. */
+#define CGROUP_SUBSYS_COUNT (BITS_PER_BYTE*sizeof(unsigned long))

/* Per-subsystem/per-cgroup state maintained by the system. */
struct cgroup_subsys_state {
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7af6168..84448d0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -58,10 +58,15 @@ static DEFINE_MUTEX(cgroup_mutex);

/* Generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) &_x ## _subsys,
-
-static struct cgroup_subsys *subsys[] = {
+static struct cgroup_subsys *subsys[CGROUP_SUBSYS_COUNT] = {
#include <linux/cgroup_subsys.h>
};
+/*
+ * this lock protects the dynamic section of the array, in which modular
+ * subsystems are loaded. nests outside of cgroup_mutex, because of both
+ * cgroup_get_sb and cgroup_load_subsys.
+ */
+static DECLARE_RWSEM(subsys_mutex);

#define MAX_CGROUP_ROOT_NAMELEN 64

@@ -433,8 +438,9 @@ static struct css_set *find_existing_css_set(
struct hlist_node *node;
struct css_set *cg;

- /* Built the set of subsystem state objects that we want to
- * see in the new css_set */
+ /* Built the set of subsystem state objects that we want to see in the
+ * new css_set. while subsystems can change globally, the entries here
+ * won't change, so no need for subsys_mutex. */
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
if (root->subsys_bits & (1UL << i)) {
/* Subsystem is in this hierarchy. So we want
@@ -869,7 +875,9 @@ void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
css_put(css);
}

-
+/*
+ * Call with subsys_mutex held - see parse_cgroupfs_options.
+ */
static int rebind_subsystems(struct cgroupfs_root *root,
unsigned long final_bits)
{
@@ -877,6 +885,8 @@ static int rebind_subsystems(struct cgroupfs_root *root,
struct cgroup *cgrp = &root->top_cgroup;
int i;

+ BUG_ON(!rwsem_is_locked(&subsys_mutex));
+
removed_bits = root->actual_subsys_bits & ~final_bits;
added_bits = final_bits & ~root->actual_subsys_bits;
/* Check that any added subsystems are currently free */
@@ -885,6 +895,12 @@ static int rebind_subsystems(struct cgroupfs_root *root,
struct cgroup_subsys *ss = subsys[i];
if (!(bit & added_bits))
continue;
+ /* nobody should tell us to do a subsys that doesn't exist:
+ * parse_cgroupfs_options should catch that case (subsys_mutex
+ * prevents state from changing in between), and the only time
+ * we're called without that beforehand is when the bitmask is
+ * 0 in cgroup_kill_sb. */
+ BUG_ON(ss == NULL);
if (ss->root != &rootnode) {
/* Subsystem isn't free */
return -EBUSY;
@@ -904,6 +920,7 @@ static int rebind_subsystems(struct cgroupfs_root *root,
unsigned long bit = 1UL << i;
if (bit & added_bits) {
/* We're binding this subsystem to this hierarchy */
+ BUG_ON(ss == NULL);
BUG_ON(cgrp->subsys[i]);
BUG_ON(!dummytop->subsys[i]);
BUG_ON(dummytop->subsys[i]->cgroup != dummytop);
@@ -915,8 +932,11 @@ static int rebind_subsystems(struct cgroupfs_root *root,
if (ss->bind)
ss->bind(ss, cgrp);
mutex_unlock(&ss->hierarchy_mutex);
+ /* TODO: If subsystem unloading support, need to take
+ * a reference on the subsystem here. */
} else if (bit & removed_bits) {
/* We're removing this subsystem */
+ BUG_ON(ss == NULL);
BUG_ON(cgrp->subsys[i] != dummytop->subsys[i]);
BUG_ON(cgrp->subsys[i]->cgroup != cgrp);
mutex_lock(&ss->hierarchy_mutex);
@@ -927,8 +947,11 @@ static int rebind_subsystems(struct cgroupfs_root *root,
subsys[i]->root = &rootnode;
list_move(&ss->sibling, &rootnode.subsys_list);
mutex_unlock(&ss->hierarchy_mutex);
+ /* TODO: If subsystem unloading support, drop refcount
+ * here. */
} else if (bit & final_bits) {
/* Subsystem state should already exist */
+ BUG_ON(ss == NULL);
BUG_ON(!cgrp->subsys[i]);
} else {
/* Subsystem state shouldn't exist */
@@ -971,14 +994,18 @@ struct cgroup_sb_opts {

};

-/* Convert a hierarchy specifier into a bitmask of subsystems and
- * flags. */
+/*
+ * Convert a hierarchy specifier into a bitmask of subsystems and flags. Call
+ * with subsys_mutex held to guard subsys[] between us and rebind_subsystems.
+ */
static int parse_cgroupfs_options(char *data,
struct cgroup_sb_opts *opts)
{
char *token, *o = data ?: "all";
unsigned long mask = (unsigned long)-1;

+ BUG_ON(!rwsem_is_locked(&subsys_mutex));
+
#ifdef CONFIG_CPUSETS
mask = ~(1UL << cpuset_subsys_id);
#endif
@@ -994,6 +1021,8 @@ static int parse_cgroupfs_options(char *data,
opts->subsys_bits = 0;
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
+ if (ss == NULL)
+ continue;
if (!ss->disabled)
opts->subsys_bits |= 1ul << i;
}
@@ -1038,6 +1067,8 @@ static int parse_cgroupfs_options(char *data,
int i;
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
ss = subsys[i];
+ if (ss == NULL)
+ continue;
if (!strcmp(token, ss->name)) {
if (!ss->disabled)
set_bit(i, &opts->subsys_bits);
@@ -1084,6 +1115,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)

lock_kernel();
mutex_lock(&cgrp->dentry->d_inode->i_mutex);
+ down_read(&subsys_mutex);
mutex_lock(&cgroup_mutex);

/* See what subsystems are wanted */
@@ -1116,6 +1148,7 @@ static int cgroup_remount(struct super_block *sb, int *flags, char *data)
kfree(opts.release_agent);
kfree(opts.name);
mutex_unlock(&cgroup_mutex);
+ up_read(&subsys_mutex);
mutex_unlock(&cgrp->dentry->d_inode->i_mutex);
unlock_kernel();
return ret;
@@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
struct cgroupfs_root *new_root;

/* First find the desired set of subsystems */
+ down_read(&subsys_mutex);
ret = parse_cgroupfs_options(data, &opts);
if (ret)
goto out_err;
@@ -1367,6 +1401,8 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
free_cg_links(&tmp_cg_links);
goto drop_new_super;
}
+ /* done with subsys stuff and no other failure case */
+ up_read(&subsys_mutex);

/* EBUSY should be the only error here */
BUG_ON(ret);
@@ -1415,6 +1451,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
drop_new_super:
deactivate_locked_super(sb);
out_err:
+ up_read(&subsys_mutex);
kfree(opts.release_agent);
kfree(opts.name);

@@ -1434,10 +1471,12 @@ static void cgroup_kill_sb(struct super_block *sb) {
BUG_ON(!list_empty(&cgrp->children));
BUG_ON(!list_empty(&cgrp->sibling));

+ down_read(&subsys_mutex);
mutex_lock(&cgroup_mutex);

/* Rebind all subsystems back to the default hierarchy */
ret = rebind_subsystems(root, 0);
+ up_read(&subsys_mutex);
/* Shouldn't be able to fail ... */
BUG_ON(ret);

@@ -3276,9 +3315,12 @@ static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
{
/* We need to take each hierarchy_mutex in a consistent order */
int i;
-
+ /* no worry about a race with rebind_subsystems that might mess up the
+ * locking order, since both parties are under cgroup_mutex. */
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
+ if (ss == NULL)
+ continue;
if (ss->root == root)
mutex_lock(&ss->hierarchy_mutex);
}
@@ -3287,9 +3329,10 @@ static void cgroup_lock_hierarchy(struct cgroupfs_root *root)
static void cgroup_unlock_hierarchy(struct cgroupfs_root *root)
{
int i;
-
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
+ if (ss == NULL)
+ continue;
if (ss->root == root)
mutex_unlock(&ss->hierarchy_mutex);
}
@@ -3410,11 +3453,14 @@ static int cgroup_has_css_refs(struct cgroup *cgrp)
* synchronization other than RCU, and the subsystem linked
* list isn't RCU-safe */
int i;
+ /* we won't need to lock the subsys array, because the subsystems
+ * we're concerned about aren't going anywhere since our cgroup root
+ * has a reference on them. */
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
struct cgroup_subsys_state *css;
- /* Skip subsystems not in this hierarchy */
- if (ss->root != cgrp->root)
+ /* Skip subsystems not present or not in this hierarchy */
+ if (ss == NULL || ss->root != cgrp->root)
continue;
css = cgrp->subsys[ss->subsys_id];
/* When called from check_for_release() it's possible
@@ -3635,7 +3681,8 @@ int __init cgroup_init_early(void)
for (i = 0; i < CSS_SET_TABLE_SIZE; i++)
INIT_HLIST_HEAD(&css_set_table[i]);

- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ /* at bootup time, we don't worry about modular subsystems */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];

BUG_ON(!ss->name);
@@ -3670,7 +3717,8 @@ int __init cgroup_init(void)
if (err)
return err;

- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ /* at bootup time, we don't worry about modular subsystems */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (!ss->early_init)
cgroup_init_subsys(ss);
@@ -3779,14 +3827,19 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
int i;

seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
+ /* ideally we don't want subsystems moving around while we do this */
+ down_read(&subsys_mutex);
mutex_lock(&cgroup_mutex);
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
+ if (ss == NULL)
+ continue;
seq_printf(m, "%s\t%d\t%d\t%d\n",
ss->name, ss->root->hierarchy_id,
ss->root->number_of_cgroups, !ss->disabled);
}
mutex_unlock(&cgroup_mutex);
+ up_read(&subsys_mutex);
return 0;
}

@@ -3841,7 +3894,10 @@ void cgroup_fork_callbacks(struct task_struct *child)
{
if (need_forkexit_callback) {
int i;
- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ /* forkexit callbacks are only supported for builtin modules,
+ * and the builtin section of the subsys array is immutable,
+ * so we don't need the subsys_lock here. */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->fork)
ss->fork(ss, child);
@@ -3912,7 +3968,9 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
struct css_set *cg;

if (run_callbacks && need_forkexit_callback) {
- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ /* modular subsystems can't use callbacks, so no need to take
+ * subsys_lock */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
if (ss->exit)
ss->exit(ss, tsk);
@@ -4221,7 +4279,9 @@ static int __init cgroup_disable(char *str)
if (!*token)
continue;

- for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
+ /* cgroup_disable, being at boot time, can't know about module
+ * subsystems, so we don't worry about them. */
+ for (i = 0; i < CGROUP_BUILTIN_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];

if (!strcmp(token, ss->name)) {


Attachments:
cgroups-revamp-subsys-array.patch (12.28 kB)

2009-12-04 08:56:49

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 2/5] cgroups: subsystem module loading interface

Add interface between cgroups subsystem management and module loading

From: Ben Blum <[email protected]>

This patch implements rudimentary module-loading support for cgroups - namely,
a cgroup_load_subsys (similar to cgroup_init_subsys) for use as a module
initcall, and a struct module pointer in struct cgroup_subsys.

Several functions that might be wanted by modules have had EXPORT_SYMBOL added
to them, but it's unclear exactly which functions want it and which won't.

Signed-off-by: Ben Blum <[email protected]>
---

Documentation/cgroups/cgroups.txt | 3 +
include/linux/cgroup.h | 4 +
kernel/cgroup.c | 119 +++++++++++++++++++++++++++++++++++++
3 files changed, 126 insertions(+), 0 deletions(-)


diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 3df4b9a..dd0d6f1 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -492,6 +492,9 @@ Each subsystem should:
- add an entry in linux/cgroup_subsys.h
- define a cgroup_subsys object called <name>_subsys

+If a subsystem can be compiled as a module, it should also have in its
+module initcall a call to cgroup_load_subsys().
+
Each subsystem may export the following methods. The only mandatory
methods are create/destroy. Any others that are null are presumed to
be successful no-ops.
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d7f1545..c8474c4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -38,6 +38,7 @@ extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
unsigned long clone_flags);
extern int cgroupstats_build(struct cgroupstats *stats,
struct dentry *dentry);
+extern int cgroup_load_subsys(struct cgroup_subsys *ss);

extern struct file_operations proc_cgroup_operations;

@@ -477,6 +478,9 @@ struct cgroup_subsys {
/* used when use_id == true */
struct idr idr;
spinlock_t id_lock;
+
+ /* should be defined only by modular subsystems */
+ struct module *module;
};

#define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 84448d0..858a786 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2492,6 +2492,7 @@ int cgroup_add_file(struct cgroup *cgrp,
error = PTR_ERR(dentry);
return error;
}
+EXPORT_SYMBOL_GPL(cgroup_add_file);

int cgroup_add_files(struct cgroup *cgrp,
struct cgroup_subsys *subsys,
@@ -2506,6 +2507,7 @@ int cgroup_add_files(struct cgroup *cgrp,
}
return 0;
}
+EXPORT_SYMBOL_GPL(cgroup_add_files);

/**
* cgroup_task_count - count the number of tasks in a cgroup.
@@ -3651,7 +3653,124 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
mutex_init(&ss->hierarchy_mutex);
lockdep_set_class(&ss->hierarchy_mutex, &ss->subsys_key);
ss->active = 1;
+
+ /* this function shouldn't be used with modular subsystems, since they
+ * need to register a subsys_id, among other things */
+ BUG_ON(ss->module);
+}
+
+/**
+ * cgroup_load_subsys: load and register a modular subsystem at runtime
+ * @ss: the subsystem to load
+ *
+ * This function should be called in a modular subsystem's initcall. If the
+ * subsytem is built as a module, it will be assigned a new subsys_id and set
+ * up for use. If the subsystem is built-in anyway, work is delegated to the
+ * simpler cgroup_init_subsys.
+ */
+int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
+{
+ int i;
+ struct cgroup_subsys_state *css;
+ struct cg_cgroup_link *link;
+
+ /* check name and function validity */
+ if (ss->name == NULL || strlen(ss->name) > MAX_CGROUP_TYPE_NAMELEN ||
+ ss->create == NULL || ss->destroy == NULL)
+ return -EINVAL;
+
+ /* we don't support callbacks in modular subsystems. this check is
+ * before the ss->module check for consistency - a module that *could*
+ * be a module should still have no callbacks for consistency. */
+ if (ss->fork || ss->exit)
+ return -EINVAL;
+
+ /* an optionally modular subsystem is built-in: we want to do nothing,
+ * since cgroup_init_subsys will take care of it. */
+ if (ss->module == NULL) {
+ /* sanity: ss->module NULL only if the subsys is built-in and
+ * appears in subsys[] already. */
+ BUG_ON(ss->subsys_id >= CGROUP_BUILTIN_SUBSYS_COUNT);
+ BUG_ON(subsys[ss->subsys_id] != ss);
+ return 0;
+ }
+
+ /* need to register a subsys id before anything else - for example,
+ * init_cgroup_css needs it. also, subsys_mutex needs to nest outside
+ * cgroup_mutex. */
+ down_write(&subsys_mutex);
+ /* find the first empty slot in the array */
+ for (i = CGROUP_BUILTIN_SUBSYS_COUNT; i < CGROUP_SUBSYS_COUNT; i++) {
+ if (subsys[i] == NULL)
+ break;
+ }
+ if (i == CGROUP_SUBSYS_COUNT) {
+ /* maximum number of subsystems already registered! */
+ up_write(&subsys_mutex);
+ return -EBUSY;
+ }
+ /* assign ourselves the subsys_id */
+ ss->subsys_id = i;
+ subsys[i] = ss;
+
+ mutex_lock(&cgroup_mutex);
+ /* no ss->create seems to need anything important in the ss struct, so
+ * this can happen first (i.e. before the rootnode attachment). */
+ css = ss->create(ss, dummytop);
+ if (IS_ERR(css)) {
+ /* failure case - need to deassign the subsys[] slot. */
+ mutex_unlock(&cgroup_mutex);
+ subsys[i] = NULL;
+ up_write(&subsys_mutex);
+ return PTR_ERR(css);
+ }
+
+ list_add(&ss->sibling, &rootnode.subsys_list);
+ ss->root = &rootnode;
+
+ /* our new subsystem will be attached to the dummy hierarchy. */
+ init_cgroup_css(css, ss, dummytop);
+ /* now we need to entangle the css into the existing css_sets. unlike
+ * in cgroup_init_subsys, there are now multiple css_sets, so each one
+ * will need a new pointer to it; done by iterating the css_set_table.
+ * furthermore, modifying the existing css_sets will corrupt the hash
+ * table state, so each changed css_set will need its hash recomputed.
+ * this is all done under the css_set_lock. */
+ write_lock(&css_set_lock);
+ for (i = 0; i < CSS_SET_TABLE_SIZE; i++) {
+ struct css_set *cg;
+ struct hlist_node *node, *tmp;
+ struct hlist_head *bucket = &css_set_table[i], *new_bucket;
+ hlist_for_each_entry_safe(cg, node, tmp, bucket, hlist) {
+ /* skip entries that we already rehashed */
+ if (cg->subsys[ss->subsys_id])
+ continue;
+ /* remove existing entry */
+ hlist_del(&cg->hlist);
+ /* set new value */
+ cg->subsys[ss->subsys_id] = css;
+ /* recompute hash and restore entry */
+ new_bucket = css_set_hash(cg->subsys);
+ hlist_add_head(&cg->hlist, new_bucket);
+ }
+ }
+ write_unlock(&css_set_lock);
+
+ mutex_init(&ss->hierarchy_mutex);
+ lockdep_set_class(&ss->hierarchy_mutex, &ss->subsys_key);
+ ss->active = 1;
+
+ /* pin the subsystem's module so it doesn't go away. this shouldn't
+ * fail, since the module's initcall calls us.
+ * TODO: with module unloading, move this elsewhere */
+ BUG_ON(!try_module_get(ss->module));
+
+ /* success! */
+ mutex_unlock(&cgroup_mutex);
+ up_write(&subsys_mutex);
+ return 0;
}
+EXPORT_SYMBOL_GPL(cgroup_load_subsys);

/**
* cgroup_init_early - cgroup initialization at system boot


Attachments:
cgroups-subsys-module-interface.patch (7.18 kB)

2009-12-04 08:57:34

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 3/5] cgroups: net_cls as module

Allows the net_cls cgroup subsystem to be compiled as a module

From: Ben Blum <[email protected]>

This patch modifies net/sched/cls_cgroup.c to allow the net_cls subsystem to
be optionally compiled as a module instead of builtin. The cgroup_subsys
struct is moved around a bit to allow the subsys_id to be either declared as a
compile-time constant by the cgroup_subsys.h include in cgroup.h, or, if it's
a module, initialized within the struct by cgroup_load_subsys.

Signed-off-by: Ben Blum <[email protected]>
---

net/sched/Kconfig | 5 ++++-
net/sched/cls_cgroup.c | 37 ++++++++++++++++++++++++++++---------
2 files changed, 32 insertions(+), 10 deletions(-)


diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 929218a..08ff9bc 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -328,13 +328,16 @@ config NET_CLS_FLOW
module will be called cls_flow.

config NET_CLS_CGROUP
- bool "Control Group Classifier"
+ tristate "Control Group Classifier"
select NET_CLS
depends on CGROUPS
---help---
Say Y here if you want to classify packets based on the control
cgroup of their process.

+ To compile this code as a module, choose M here: the
+ module will be called cls_cgroup.
+
config NET_EMATCH
bool "Extended Matches"
select NET_CLS
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index e4877ca..df9723b 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -24,6 +24,25 @@ struct cgroup_cls_state
u32 classid;
};

+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+ struct cgroup *cgrp);
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+struct cgroup_subsys net_cls_subsys = {
+ .name = "net_cls",
+ .create = cgrp_create,
+ .destroy = cgrp_destroy,
+ .populate = cgrp_populate,
+#ifdef CONFIG_NET_CLS_CGROUP
+ .subsys_id = net_cls_subsys_id,
+#else
+#define net_cls_subsys_id net_cls_subsys.subsys_id
+#endif
+ .module = THIS_MODULE,
+};
+
+
static inline struct cgroup_cls_state *cgrp_cls_state(struct cgroup *cgrp)
{
return container_of(cgroup_subsys_state(cgrp, net_cls_subsys_id),
@@ -79,14 +98,6 @@ static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
}

-struct cgroup_subsys net_cls_subsys = {
- .name = "net_cls",
- .create = cgrp_create,
- .destroy = cgrp_destroy,
- .populate = cgrp_populate,
- .subsys_id = net_cls_subsys_id,
-};
-
struct cls_cgroup_head
{
u32 handle;
@@ -277,12 +288,20 @@ static struct tcf_proto_ops cls_cgroup_ops __read_mostly = {

static int __init init_cgroup_cls(void)
{
- return register_tcf_proto_ops(&cls_cgroup_ops);
+ int ret = register_tcf_proto_ops(&cls_cgroup_ops);
+ if (ret)
+ return ret;
+ ret = cgroup_load_subsys(&net_cls_subsys);
+ if (ret)
+ unregister_tcf_proto_ops(&cls_cgroup_ops);
+ return ret;
}

static void __exit exit_cgroup_cls(void)
{
unregister_tcf_proto_ops(&cls_cgroup_ops);
+ /* TODO: unload subsystem. for now, the try_module_get in load_subsys
+ * prevents us from getting here. */
}

module_init(init_cgroup_cls);


Attachments:
cgroups-net_cls-as-module.patch (3.40 kB)

2009-12-04 08:59:08

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 4/5] cgroups: subsystem module unloading

Provides support for unloading modular subsystems.

From: Ben Blum <[email protected]>

This patch adds a new function cgroup_unload_subsys which is to be used for
removing a loaded subsystem during module deletion. Reference counting of the
subsystems' modules is moved from once (at load time) to once per attached
hierarchy (in rebind_subsystems) (i.e., 0 or 1).

It also adds a proper module_delete call in net/sched/cls_cgroup.c.

Signed-off-by: Ben Blum <[email protected]>
---

Documentation/cgroups/cgroups.txt | 3 +
include/linux/cgroup.h | 4 +-
kernel/cgroup.c | 92 ++++++++++++++++++++++++++++++++-----
net/sched/cls_cgroup.c | 3 -
4 files changed, 86 insertions(+), 16 deletions(-)


diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index dd0d6f1..110228e 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -493,7 +493,8 @@ Each subsystem should:
- define a cgroup_subsys object called <name>_subsys

If a subsystem can be compiled as a module, it should also have in its
-module initcall a call to cgroup_load_subsys().
+module initcall a call to cgroup_load_subsys(), and in its exitcall a
+call to cgroup_unload_subsys().

Each subsystem may export the following methods. The only mandatory
methods are create/destroy. Any others that are null are presumed to
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c8474c4..1cbb07f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -39,6 +39,7 @@ extern void cgroup_fork_failed(struct task_struct *p, int run_callbacks,
extern int cgroupstats_build(struct cgroupstats *stats,
struct dentry *dentry);
extern int cgroup_load_subsys(struct cgroup_subsys *ss);
+extern void cgroup_unload_subsys(struct cgroup_subsys *ss);

extern struct file_operations proc_cgroup_operations;

@@ -264,7 +265,8 @@ struct css_set {
/*
* Set of subsystem states, one for each subsystem. This array
* is immutable after creation apart from the init_css_set
- * during subsystem registration (at boot time).
+ * during subsystem registration (at boot time) and modular subsystem
+ * loading/unloading.
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 858a786..c2c7681 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -883,7 +883,7 @@ static int rebind_subsystems(struct cgroupfs_root *root,
{
unsigned long added_bits, removed_bits;
struct cgroup *cgrp = &root->top_cgroup;
- int i;
+ int i, module_pin_failed = 0;

BUG_ON(!rwsem_is_locked(&subsys_mutex));

@@ -914,6 +914,27 @@ static int rebind_subsystems(struct cgroupfs_root *root,
if (root->number_of_cgroups > 1)
return -EBUSY;

+ /* pin the modules for all subsystems that will stop being free. we
+ * don't drop refcounts on removed subsystems until later, since there
+ * are failure cases here. */
+ for (i = CGROUP_BUILTIN_SUBSYS_COUNT; i < CGROUP_SUBSYS_COUNT; i++) {
+ unsigned long bit = 1UL << i;
+ if (!(bit & added_bits))
+ continue;
+ if (!try_module_get(subsys[i]->module)) {
+ module_pin_failed = 1;
+ break;
+ }
+ }
+ if (module_pin_failed) {
+ /* oops, one of the modules was going away. this means that we
+ * raced with a module_delete call, and to the user this is
+ * essentially a "subsystem doesn't exist" case. */
+ for (i--; i >= CGROUP_BUILTIN_SUBSYS_COUNT; i--)
+ module_put(subsys[i]->module);
+ return -ENOENT;
+ }
+
/* Process each subsystem */
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
@@ -932,8 +953,7 @@ static int rebind_subsystems(struct cgroupfs_root *root,
if (ss->bind)
ss->bind(ss, cgrp);
mutex_unlock(&ss->hierarchy_mutex);
- /* TODO: If subsystem unloading support, need to take
- * a reference on the subsystem here. */
+ /* we already got the reference for this subsystem. */
} else if (bit & removed_bits) {
/* We're removing this subsystem */
BUG_ON(ss == NULL);
@@ -947,8 +967,8 @@ static int rebind_subsystems(struct cgroupfs_root *root,
subsys[i]->root = &rootnode;
list_move(&ss->sibling, &rootnode.subsys_list);
mutex_unlock(&ss->hierarchy_mutex);
- /* TODO: If subsystem unloading support, drop refcount
- * here. */
+ /* subsystem is now free - drop reference on module */
+ module_put(ss->module);
} else if (bit & final_bits) {
/* Subsystem state should already exist */
BUG_ON(ss == NULL);
@@ -1395,7 +1415,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
}

ret = rebind_subsystems(root, root->subsys_bits);
- if (ret == -EBUSY) {
+ if (ret == -EBUSY || ret == -ENOENT) {
mutex_unlock(&cgroup_mutex);
mutex_unlock(&inode->i_mutex);
free_cg_links(&tmp_cg_links);
@@ -1404,7 +1424,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
/* done with subsys stuff and no other failure case */
up_read(&subsys_mutex);

- /* EBUSY should be the only error here */
+ /* EBUSY and ENOENT should be the only errors here */
BUG_ON(ret);

list_add(&root->root_list, &roots);
@@ -3760,11 +3780,6 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
lockdep_set_class(&ss->hierarchy_mutex, &ss->subsys_key);
ss->active = 1;

- /* pin the subsystem's module so it doesn't go away. this shouldn't
- * fail, since the module's initcall calls us.
- * TODO: with module unloading, move this elsewhere */
- BUG_ON(!try_module_get(ss->module));
-
/* success! */
mutex_unlock(&cgroup_mutex);
up_write(&subsys_mutex);
@@ -3773,6 +3788,59 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
EXPORT_SYMBOL_GPL(cgroup_load_subsys);

/**
+ * cgroup_unload_subsys: unload a modular subsystem
+ * @ss: the subsystem to unload
+ *
+ * This function should be called in a modular subsystem's exitcall. When this
+ * function is invoked, the refcount on the subsystem's module will be 0, so
+ * the subsystem will not be attached to any hierarchy.
+ */
+void cgroup_unload_subsys(struct cgroup_subsys *ss)
+{
+ struct cg_cgroup_link *link;
+ struct hlist_head *hhead;
+
+ BUG_ON(ss->module == NULL);
+
+ /* we shouldn't be called if the subsystem is in use, and the use of
+ * try_module_get in rebind_subsystems should ensure that it doesn't
+ * start being used while we're killing it off. */
+ BUG_ON(ss->root != &rootnode);
+
+ down_write(&subsys_mutex);
+ /* deassign the subsys_id */
+ BUG_ON(ss->subsys_id < CGROUP_BUILTIN_SUBSYS_COUNT);
+ subsys[ss->subsys_id] = NULL;
+
+ mutex_lock(&cgroup_mutex);
+ /* remove subsystem from rootnode's list of subsystems */
+ list_del(&ss->sibling);
+
+ /* disentangle the css from all css_sets attached to the dummytop. as
+ * in loading, we need to pay our respects to the hashtable gods. */
+ write_lock(&css_set_lock);
+ list_for_each_entry(link, &dummytop->css_sets, cgrp_link_list) {
+ struct css_set *cg = link->cg;
+ hlist_del(&cg->hlist);
+ BUG_ON(!cg->subsys[ss->subsys_id]);
+ cg->subsys[ss->subsys_id] = NULL;
+ hhead = css_set_hash(cg->subsys);
+ hlist_add_head(&cg->hlist, hhead);
+ }
+ write_unlock(&css_set_lock);
+
+ /* remove subsystem's css from the dummytop and free it - need to free
+ * before marking as null because ss->destroy needs the cgrp->subsys
+ * pointer to find their state. */
+ ss->destroy(ss, dummytop);
+ dummytop->subsys[ss->subsys_id] = NULL;
+
+ mutex_unlock(&cgroup_mutex);
+ up_write(&subsys_mutex);
+}
+EXPORT_SYMBOL_GPL(cgroup_unload_subsys);
+
+/**
* cgroup_init_early - cgroup initialization at system boot
*
* Initialize cgroups at system boot, and initialize any
diff --git a/net/sched/cls_cgroup.c b/net/sched/cls_cgroup.c
index df9723b..7f27d2c 100644
--- a/net/sched/cls_cgroup.c
+++ b/net/sched/cls_cgroup.c
@@ -300,8 +300,7 @@ static int __init init_cgroup_cls(void)
static void __exit exit_cgroup_cls(void)
{
unregister_tcf_proto_ops(&cls_cgroup_ops);
- /* TODO: unload subsystem. for now, the try_module_get in load_subsys
- * prevents us from getting here. */
+ cgroup_unload_subsys(&net_cls_subsys);
}

module_init(init_cgroup_cls);


Attachments:
cgroups-subsys-unloading.patch (8.25 kB)

2009-12-04 09:00:16

by Ben Blum

[permalink] [raw]
Subject: [RFC] [PATCH 5/5] cgroups: subsystem dependencies

Adds support for cgroup subsystems to specify dependencies on other subsystems

From: Ben Blum <[email protected]>

This patch adds a mechanism with which a subsystem can specify dependencies on
other subsystems. Since subsystems can come and go, it would not be useful to
refer to a depended-upon subsystem by subsys_id, so instead a NULL-terminated
array of strings must be specified. Validation is done by scanning for each
potential subsystem its list of subsystem names with strcmp at the end of
parse_cgroupfs_options.

Signed-off-by: Ben Blum <[email protected]>
---

Documentation/cgroups/cgroups.txt | 4 ++++
include/linux/cgroup.h | 8 ++++++++
kernel/cgroup.c | 36 ++++++++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+), 0 deletions(-)


diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt
index 110228e..a8de651 100644
--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -458,6 +458,10 @@ Other fields in the cgroup_subsys object include:
- name: should be initialized to a unique subsystem name. Should be
no longer than MAX_CGROUP_TYPE_NAMELEN.

+- depends: allows subsystems to specify dependencies in the form of
+ a list (NULL-terminated) of names of other subsystems. A subsystem
+ cannot be mounted unless all of its dependencies are also mounted.
+
- early_init: indicate if the subsystem needs early initialization
at system boot.

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 1cbb07f..6f77e30 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -457,6 +457,14 @@ struct cgroup_subsys {
bool use_id;
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;
+ /*
+ * A subsystem can specify dependencies in the form of a list of names
+ * of other subsystems. Such a subsystem can only be mounted on a
+ * hierarchy if all subsystems named in this list are also to be
+ * mounted. If this list of strings is initialized, it must be null
+ * terminated.
+ */
+ const char **depends;

/*
* Protects sibling/children links of cgroups in this
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c2c7681..5bbc45a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1015,6 +1015,38 @@ struct cgroup_sb_opts {
};

/*
+ * Given a mask of subsystems to be mounted, ensure all dependencies are met.
+ */
+static bool check_subsys_dependencies(unsigned long subsys_bits)
+{
+ unsigned long i, j, k;
+ bool matched;
+
+ for_each_bit(i, &subsys_bits, CGROUP_SUBSYS_COUNT) {
+ BUG_ON(!subsys[i]);
+ if (!subsys[i]->depends)
+ continue;
+ /* For each specified dependency, make sure there's a matching
+ * subsystem among the ones intended for mounting. */
+ for (j = 0; subsys[i]->depends[j]; j++) {
+ matched = false;
+ for_each_bit(k, &subsys_bits, CGROUP_SUBSYS_COUNT) {
+ BUG_ON(!subsys[k]);
+ BUG_ON(!subsys[k]->name);
+ if (!strcmp(subsys[i]->depends[j],
+ subsys[k]->name)) {
+ matched = true;
+ break;
+ }
+ }
+ if (!matched)
+ return false;
+ }
+ }
+ return true;
+}
+
+/*
* Convert a hierarchy specifier into a bitmask of subsystems and flags. Call
* with subsys_mutex held to guard subsys[] between us and rebind_subsystems.
*/
@@ -1123,6 +1155,10 @@ static int parse_cgroupfs_options(char *data,
if (!opts->subsys_bits && !opts->name)
return -EINVAL;

+ /* Check subsystem dependencies */
+ if (opts->subsys_bits && !check_subsys_dependencies(opts->subsys_bits))
+ return -ESRCH;
+
return 0;
}


Attachments:
cgroups-subsystem-dependencies.patch (3.73 kB)

2009-12-08 06:08:14

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 3/5] cgroups: net_cls as module

> +struct cgroup_subsys net_cls_subsys = {
> + .name = "net_cls",
> + .create = cgrp_create,
> + .destroy = cgrp_destroy,
> + .populate = cgrp_populate,
> +#ifdef CONFIG_NET_CLS_CGROUP
> + .subsys_id = net_cls_subsys_id,
> +#else
> +#define net_cls_subsys_id net_cls_subsys.subsys_id
> +#endif

This looks a bit odd. I think below is the more standard way:

#ifndef MODULES
...
#else
...
#endif

> + .module = THIS_MODULE,
> +};

2009-12-08 06:11:41

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/5] cgroups: subsystem dependencies

> This patch adds a mechanism with which a subsystem can specify dependencies on
> other subsystems. Since subsystems can come and go, it would not be useful to
> refer to a depended-upon subsystem by subsys_id, so instead a NULL-terminated
> array of strings must be specified. Validation is done by scanning for each
> potential subsystem its list of subsystem names with strcmp at the end of
> parse_cgroupfs_options.

I once posted a patch to implement this feature, but it was
rejected by Andrew, because there was no user using it.

And still there is no user, right?

2009-12-08 07:39:08

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

> @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> struct cgroupfs_root *new_root;
>
> /* First find the desired set of subsystems */
> + down_read(&subsys_mutex);

Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
held, so here we have:

down_read(&subsys_mutex);

down_write(&sb->s_umount);

On the other hand, sb->s_umount is held before calling kill_sb(),
so when umounting we have:

down_write(&sb->s_umount);

down_read(&subsys_mutex);

> ret = parse_cgroupfs_options(data, &opts);
> if (ret)
> goto out_err;
> @@ -1367,6 +1401,8 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> free_cg_links(&tmp_cg_links);
> goto drop_new_super;
> }
> + /* done with subsys stuff and no other failure case */
> + up_read(&subsys_mutex);
>
> /* EBUSY should be the only error here */
> BUG_ON(ret);
> @@ -1415,6 +1451,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> drop_new_super:
> deactivate_locked_super(sb);
> out_err:
> + up_read(&subsys_mutex);
> kfree(opts.release_agent);
> kfree(opts.name);
>
> @@ -1434,10 +1471,12 @@ static void cgroup_kill_sb(struct super_block *sb) {
> BUG_ON(!list_empty(&cgrp->children));
> BUG_ON(!list_empty(&cgrp->sibling));
>
> + down_read(&subsys_mutex);
> mutex_lock(&cgroup_mutex);
>
> /* Rebind all subsystems back to the default hierarchy */
> ret = rebind_subsystems(root, 0);
> + up_read(&subsys_mutex);
> /* Shouldn't be able to fail ... */
> BUG_ON(ret);
>

2009-12-09 01:08:36

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/5] cgroups: subsystem dependencies

On Tue, Dec 08, 2009 at 02:11:21PM +0800, Li Zefan wrote:
> > This patch adds a mechanism with which a subsystem can specify dependencies on
> > other subsystems. Since subsystems can come and go, it would not be useful to
> > refer to a depended-upon subsystem by subsys_id, so instead a NULL-terminated
> > array of strings must be specified. Validation is done by scanning for each
> > potential subsystem its list of subsystem names with strcmp at the end of
> > parse_cgroupfs_options.
>
> I once posted a patch to implement this feature, but it was
> rejected by Andrew, because there was no user using it.
>
> And still there is no user, right?

Dunno - Paul mentioned it might be useful in the future, so I threw this
together in an hour or so. Not entirely sure what he had in mind, but
this series can go with or without this particular patch.

2009-12-09 01:41:07

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 5/5] cgroups: subsystem dependencies

Ben Blum wrote:
> On Tue, Dec 08, 2009 at 02:11:21PM +0800, Li Zefan wrote:
>>> This patch adds a mechanism with which a subsystem can specify dependencies on
>>> other subsystems. Since subsystems can come and go, it would not be useful to
>>> refer to a depended-upon subsystem by subsys_id, so instead a NULL-terminated
>>> array of strings must be specified. Validation is done by scanning for each
>>> potential subsystem its list of subsystem names with strcmp at the end of
>>> parse_cgroupfs_options.
>> I once posted a patch to implement this feature, but it was
>> rejected by Andrew, because there was no user using it.
>>
>> And still there is no user, right?
>
> Dunno - Paul mentioned it might be useful in the future, so I threw this
> together in an hour or so. Not entirely sure what he had in mind, but
> this series can go with or without this particular patch.
>

So I think Andrew's suggestion still applies: drop the
patch for now, until it finds its user.

2009-12-09 05:51:36

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

On Tue, Dec 08, 2009 at 03:38:43PM +0800, Li Zefan wrote:
> > @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> > struct cgroupfs_root *new_root;
> >
> > /* First find the desired set of subsystems */
> > + down_read(&subsys_mutex);
>
> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
> held, so here we have:
>
> down_read(&subsys_mutex);
>
> down_write(&sb->s_umount);
>
> On the other hand, sb->s_umount is held before calling kill_sb(),
> so when umounting we have:
>
> down_write(&sb->s_umount);
>
> down_read(&subsys_mutex);

Unless I'm gravely mistaken, you can't have deadlock on an rwsem when
it's being taken for reading in both cases? You would have to have at
least one of the cases being down_write.

In fairness to readability, perhaps subsys_mutex should instead be
subsys_rwsem? It seemed to me to be that calling it "mutex" was
conventional anyway.

-- bblum

2009-12-09 06:16:55

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

Ben Blum wrote:
> On Tue, Dec 08, 2009 at 03:38:43PM +0800, Li Zefan wrote:
>>> @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
>>> struct cgroupfs_root *new_root;
>>>
>>> /* First find the desired set of subsystems */
>>> + down_read(&subsys_mutex);
>> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
>> held, so here we have:
>>
>> down_read(&subsys_mutex);
>>
>> down_write(&sb->s_umount);
>>
>> On the other hand, sb->s_umount is held before calling kill_sb(),
>> so when umounting we have:
>>
>> down_write(&sb->s_umount);
>>
>> down_read(&subsys_mutex);
>
> Unless I'm gravely mistaken, you can't have deadlock on an rwsem when
> it's being taken for reading in both cases? You would have to have at
> least one of the cases being down_write.
>

lockdep will warn on this..

And it can really lead to deadlock, though not so obivously:

thread 1 thread 2 thread 3
-------------------------------------------
| read(A) write(B)
|
| write(A)
|
| read(A)
|
| write(B)
|

t3 is waiting for t1 to release the lock, then t2 tries to
acquire A lock to read, but it has to wait because of t3,
and t1 has to wait t2.

Note: a read lock has to wait if a write lock is already
waiting for the lock.

> In fairness to readability, perhaps subsys_mutex should instead be
> subsys_rwsem? It seemed to me to be that calling it "mutex" was
> conventional anyway.
>

2009-12-09 06:16:48

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

Fix "To" and "Cc"..

Li Zefan wrote:
> Ben Blum wrote:
>> On Tue, Dec 08, 2009 at 03:38:43PM +0800, Li Zefan wrote:
>>>> @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
>>>> struct cgroupfs_root *new_root;
>>>>
>>>> /* First find the desired set of subsystems */
>>>> + down_read(&subsys_mutex);
>>> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
>>> held, so here we have:
>>>
>>> down_read(&subsys_mutex);
>>>
>>> down_write(&sb->s_umount);
>>>
>>> On the other hand, sb->s_umount is held before calling kill_sb(),
>>> so when umounting we have:
>>>
>>> down_write(&sb->s_umount);
>>>
>>> down_read(&subsys_mutex);
>> Unless I'm gravely mistaken, you can't have deadlock on an rwsem when
>> it's being taken for reading in both cases? You would have to have at
>> least one of the cases being down_write.
>>
>
> lockdep will warn on this..
>
> And it can really lead to deadlock, though not so obivously:
>
> thread 1 thread 2 thread 3
> -------------------------------------------
> | read(A) write(B)
> |
> | write(A)
> |
> | read(A)
> |
> | write(B)
> |
>
> t3 is waiting for t1 to release the lock, then t2 tries to
> acquire A lock to read, but it has to wait because of t3,
> and t1 has to wait t2.
>
> Note: a read lock has to wait if a write lock is already
> waiting for the lock.
>
>> In fairness to readability, perhaps subsys_mutex should instead be
>> subsys_rwsem? It seemed to me to be that calling it "mutex" was
>> conventional anyway.
>>

2009-12-09 08:27:49

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

On Wed, Dec 09, 2009 at 02:07:53PM +0800, Li Zefan wrote:
> Ben Blum wrote:
> > On Tue, Dec 08, 2009 at 03:38:43PM +0800, Li Zefan wrote:
> >>> @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> >>> struct cgroupfs_root *new_root;
> >>>
> >>> /* First find the desired set of subsystems */
> >>> + down_read(&subsys_mutex);
> >> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
> >> held, so here we have:
> >>
> >> down_read(&subsys_mutex);
> >>
> >> down_write(&sb->s_umount);
> >>
> >> On the other hand, sb->s_umount is held before calling kill_sb(),
> >> so when umounting we have:
> >>
> >> down_write(&sb->s_umount);
> >>
> >> down_read(&subsys_mutex);
> >
> > Unless I'm gravely mistaken, you can't have deadlock on an rwsem when
> > it's being taken for reading in both cases? You would have to have at
> > least one of the cases being down_write.
> >
>
> lockdep will warn on this..

Hm. Why did I not see this warning...?

> And it can really lead to deadlock, though not so obivously:
>
> thread 1 thread 2 thread 3
> -------------------------------------------
> | read(A) write(B)
> |
> | write(A)
> |
> | read(A)
> |
> | write(B)
> |
>
> t3 is waiting for t1 to release the lock, then t2 tries to
> acquire A lock to read, but it has to wait because of t3,
> and t1 has to wait t2.
>
> Note: a read lock has to wait if a write lock is already
> waiting for the lock.

Okay, clever, the deadlock happens because of a behavioural optimization
of the rwsems. Good catch on the whole issue.

How does this sound as a possible solution, in cgroup_get_sb:

1) Take subsys_mutex
2) Call parse_cgroupfs_options()
3) Drop subsys_mutex
4) Call sget(), which gets sb->s_umount without subsys_mutex held
5) Take subsys_mutex
6) Call verify_cgroupfs_options()
7) Proceed as normal

In which verify_cgroupfs_options will be a new function that ensures the
invariants that rebind_subsystems expects are still there; if not, bail
out by jumping to drop_new_super just as if parse_cgroupfs_options had
failed in the first place.

Another question: What's the justification for having an interface of
seemingly symmetrical "initialize" and "destroy" functions, one of which
has to take a lock and the other gets called with the lock already held?
Seems like it's asking for trouble.

-- bblum

2009-12-09 08:37:13

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

On Tue, Dec 08, 2009 at 03:38:43PM +0800, Li Zefan wrote:
> > @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> > struct cgroupfs_root *new_root;
> >
> > /* First find the desired set of subsystems */
> > + down_read(&subsys_mutex);
>
> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
> held, so here we have:
>
> down_read(&subsys_mutex);
>
> down_write(&sb->s_umount);
>
> On the other hand, sb->s_umount is held before calling kill_sb(),
> so when umounting we have:
>
> down_write(&sb->s_umount);
>
> down_read(&subsys_mutex);
>
> > ret = parse_cgroupfs_options(data, &opts);
> > if (ret)
> > goto out_err;
> > @@ -1367,6 +1401,8 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> > free_cg_links(&tmp_cg_links);
> > goto drop_new_super;
> > }
> > + /* done with subsys stuff and no other failure case */
> > + up_read(&subsys_mutex);
> >
> > /* EBUSY should be the only error here */
> > BUG_ON(ret);
> > @@ -1415,6 +1451,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
> > drop_new_super:
> > deactivate_locked_super(sb);
> > out_err:
> > + up_read(&subsys_mutex);
> > kfree(opts.release_agent);
> > kfree(opts.name);

There's another problem here. deactivate_locked_super() calls kill_sb(),
which tries to down_read(&subsys_mutex), and rwsems can't recurse. This
can be fixed by just moving the up_read(&subsys_mutex); at the end to
before each 'goto' in each error case, though, so not a big deal.

-- bblum

2009-12-10 03:18:28

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

>>>>> @@ -1291,6 +1324,7 @@ static int cgroup_get_sb(struct file_system_type *fs_type,
>>>>> struct cgroupfs_root *new_root;
>>>>>
>>>>> /* First find the desired set of subsystems */
>>>>> + down_read(&subsys_mutex);
>>>> Hmm.. this can lead to deadlock. sget() returns success with sb->s_umount
>>>> held, so here we have:
>>>>
>>>> down_read(&subsys_mutex);
>>>>
>>>> down_write(&sb->s_umount);
>>>>
>>>> On the other hand, sb->s_umount is held before calling kill_sb(),
>>>> so when umounting we have:
>>>>
>>>> down_write(&sb->s_umount);
>>>>
>>>> down_read(&subsys_mutex);
>>> Unless I'm gravely mistaken, you can't have deadlock on an rwsem when
>>> it's being taken for reading in both cases? You would have to have at
>>> least one of the cases being down_write.
>>>
>> lockdep will warn on this..
>
> Hm. Why did I not see this warning...?
>

Because you haven't triggered it. ;)

The scripts below can trigger the warning (at least for me):

# cat test1.sh
#! /bin/sh
for ((; ;))
{
mount -t cgroup -o devices xxx /cgroup1
umount /cgroup1
}

# cat test2.sh
#! /bin/sh
for ((; ;))
{
mount -t cgroup -o devices xxx /cgroup2
umount /cgroup2
}


>> And it can really lead to deadlock, though not so obivously:
>>
>> thread 1 thread 2 thread 3
>> -------------------------------------------
>> | read(A) write(B)
>> |
>> | write(A)
>> |
>> | read(A)
>> |
>> | write(B)
>> |
>>
>> t3 is waiting for t1 to release the lock, then t2 tries to
>> acquire A lock to read, but it has to wait because of t3,
>> and t1 has to wait t2.
>>
>> Note: a read lock has to wait if a write lock is already
>> waiting for the lock.
>
> Okay, clever, the deadlock happens because of a behavioural optimization
> of the rwsems. Good catch on the whole issue.
>
> How does this sound as a possible solution, in cgroup_get_sb:
>
> 1) Take subsys_mutex
> 2) Call parse_cgroupfs_options()
> 3) Drop subsys_mutex
> 4) Call sget(), which gets sb->s_umount without subsys_mutex held
> 5) Take subsys_mutex
> 6) Call verify_cgroupfs_options()
> 7) Proceed as normal
>
> In which verify_cgroupfs_options will be a new function that ensures the
> invariants that rebind_subsystems expects are still there; if not, bail
> out by jumping to drop_new_super just as if parse_cgroupfs_options had
> failed in the first place.
>

The current code doesn't need this verify_cgroupfs_options, so why it
will become necessary? I think what we need is grab module refcnt in
parse_cgroupfs_options, and then we can drop subsys_mutex.

But why you are using a rw semaphore? I think a mutex is fine.
And why not just use cgroup_mutex to protect the subsys[] array?
The adding and spreading of subsys_mutex looks ugly to me.

> Another question: What's the justification for having an interface of
> seemingly symmetrical "initialize" and "destroy" functions, one of which
> has to take a lock and the other gets called with the lock already held?
> Seems like it's asking for trouble.
>

Are you refering to get_sb() and kill_sb()? VFS is not my area, so I'm
not going to judge it. ;)

2009-12-10 05:19:33

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

On Thu, Dec 10, 2009 at 11:18:06AM +0800, Li Zefan wrote:
> >> And it can really lead to deadlock, though not so obivously:
> >>
> >> thread 1 thread 2 thread 3
> >> -------------------------------------------
> >> | read(A) write(B)
> >> |
> >> | write(A)
> >> |
> >> | read(A)
> >> |
> >> | write(B)
> >> |
> >>
> >> t3 is waiting for t1 to release the lock, then t2 tries to
> >> acquire A lock to read, but it has to wait because of t3,
> >> and t1 has to wait t2.
> >>
> >> Note: a read lock has to wait if a write lock is already
> >> waiting for the lock.
> >
> > Okay, clever, the deadlock happens because of a behavioural optimization
> > of the rwsems. Good catch on the whole issue.
> >
> > How does this sound as a possible solution, in cgroup_get_sb:
> >
> > 1) Take subsys_mutex
> > 2) Call parse_cgroupfs_options()
> > 3) Drop subsys_mutex
> > 4) Call sget(), which gets sb->s_umount without subsys_mutex held
> > 5) Take subsys_mutex
> > 6) Call verify_cgroupfs_options()
> > 7) Proceed as normal
> >
> > In which verify_cgroupfs_options will be a new function that ensures the
> > invariants that rebind_subsystems expects are still there; if not, bail
> > out by jumping to drop_new_super just as if parse_cgroupfs_options had
> > failed in the first place.
> >
>
> The current code doesn't need this verify_cgroupfs_options, so why it
> will become necessary? I think what we need is grab module refcnt in
> parse_cgroupfs_options, and then we can drop subsys_mutex.

Oh, good point. I thought pinning the modules had to happen in rebinding
since there's a case where rebind_subsystems is called without parsing,
but that's just in kill_sb where no new subsystems are added. So, better
would be to make sure we can't get owned while we drop the lock instead
of checking afterwards if we got owned and bailing if so.

> But why you are using a rw semaphore? I think a mutex is fine.

The "most of cgroups wants to look at the subsys array" versus "module
loading/unloading modifies the array" is clearly a readers/writers case.

> And why not just use cgroup_mutex to protect the subsys[] array?
> The adding and spreading of subsys_mutex looks ugly to me.

The reasoning for this is that there are various chunks of code that
need to be protected by a mutex guarding subsys[] that aren't already
under cgroup_mutex - like parse_cgroupfs_options, or the first stage
of cgroup_load_subsys. Do you think those critical sections are small
enough that sacrificing reentrancy for simplicity of code is worth it?

2009-12-10 06:00:51

by Li Zefan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

>>> How does this sound as a possible solution, in cgroup_get_sb:
>>>
>>> 1) Take subsys_mutex
>>> 2) Call parse_cgroupfs_options()
>>> 3) Drop subsys_mutex
>>> 4) Call sget(), which gets sb->s_umount without subsys_mutex held
>>> 5) Take subsys_mutex
>>> 6) Call verify_cgroupfs_options()
>>> 7) Proceed as normal
>>>
>>> In which verify_cgroupfs_options will be a new function that ensures the
>>> invariants that rebind_subsystems expects are still there; if not, bail
>>> out by jumping to drop_new_super just as if parse_cgroupfs_options had
>>> failed in the first place.
>>>
>> The current code doesn't need this verify_cgroupfs_options, so why it
>> will become necessary? I think what we need is grab module refcnt in
>> parse_cgroupfs_options, and then we can drop subsys_mutex.
>
> Oh, good point. I thought pinning the modules had to happen in rebinding
> since there's a case where rebind_subsystems is called without parsing,
> but that's just in kill_sb where no new subsystems are added. So, better
> would be to make sure we can't get owned while we drop the lock instead
> of checking afterwards if we got owned and bailing if so.
>
>> But why you are using a rw semaphore? I think a mutex is fine.
>
> The "most of cgroups wants to look at the subsys array" versus "module
> loading/unloading modifies the array" is clearly a readers/writers case.
>

Yes, but it doesn't mean we should use rw lock or rw semaphore is
preferable than plain mutex.

- the read side of subsys_mutex is mainly at mount/remount/umount,
the write side is in cgroup_load_subsys() and cgroup_unload_subsys().
None is in critical path.

- In most callsites, cgroup_mutex is held just after acquiring
subsys_mutex.

So what does it gain us to use this rw_sem?

>> And why not just use cgroup_mutex to protect the subsys[] array?
>> The adding and spreading of subsys_mutex looks ugly to me.
>
> The reasoning for this is that there are various chunks of code that
> need to be protected by a mutex guarding subsys[] that aren't already
> under cgroup_mutex - like parse_cgroupfs_options, or the first stage
> of cgroup_load_subsys. Do you think those critical sections are small
> enough that sacrificing reentrancy for simplicity of code is worth it?
>

Except parse_cgroupfs_options() which is called without cgroup_mutex
held, in all other callsites, cgroup_mutex is held right after acquiring
subsys_mutex.

So yes, I don't think use cgroup_mutex will harm scalibility.

In contrast, this subsys_mutex is quite ugly and deadlock-prone.
For example, see this:

static int cgroup_remount(struct super_block *sb, int *flags, char *data)
{
...
lock_kernel();
mutex_lock(&cgrp->dentry->d_inode->i_mutex);
down_read(&subsys_mutex);
mutex_lock(&cgroup_mutex);
...
}

Four locks here!

2009-12-10 06:13:52

by Ben Blum

[permalink] [raw]
Subject: Re: [RFC] [PATCH 1/5] cgroups: revamp subsys array

On Thu, Dec 10, 2009 at 02:00:29PM +0800, Li Zefan wrote:
> Yes, but it doesn't mean we should use rw lock or rw semaphore is
> preferable than plain mutex.
>
> - the read side of subsys_mutex is mainly at mount/remount/umount,
> the write side is in cgroup_load_subsys() and cgroup_unload_subsys().
> None is in critical path.
>
> - In most callsites, cgroup_mutex is held just after acquiring
> subsys_mutex.
>
> So what does it gain us to use this rw_sem?
>
> >> And why not just use cgroup_mutex to protect the subsys[] array?
> >> The adding and spreading of subsys_mutex looks ugly to me.
> >
> > The reasoning for this is that there are various chunks of code that
> > need to be protected by a mutex guarding subsys[] that aren't already
> > under cgroup_mutex - like parse_cgroupfs_options, or the first stage
> > of cgroup_load_subsys. Do you think those critical sections are small
> > enough that sacrificing reentrancy for simplicity of code is worth it?
> >
>
> Except parse_cgroupfs_options() which is called without cgroup_mutex
> held, in all other callsites, cgroup_mutex is held right after acquiring
> subsys_mutex.
>
> So yes, I don't think use cgroup_mutex will harm scalibility.
>
> In contrast, this subsys_mutex is quite ugly and deadlock-prone.

to be fair the deadlock case would still be as much to worry about
if we were protecting subsys[] with cgroup_mutex instead.

> For example, see this:
>
> static int cgroup_remount(struct super_block *sb, int *flags, char *data)
> {
> ...
> lock_kernel();
> mutex_lock(&cgrp->dentry->d_inode->i_mutex);
> down_read(&subsys_mutex);
> mutex_lock(&cgroup_mutex);
> ...
> }
>
> Four locks here!
>
>

if we really don't care about performance in those places, it doesn't
really gain anything. when I presented the previous patch series, one
guy (at google) asked me if anything could be done about the "big ugly
cgroup_mutex" and that he wished most of cgroups could be less all under
one single lock. so, I wrote this with that in mind...

anyway, I think the best solution here is - whichever lock ends up being
used for this - have parse_cgroupfs_options and rebind_subsystems each
take the lock when they are called and drop before returning, instead of
having their calling functions hold the lock between calls. division of
labour would be that parse_cgroupfs_options takes the module refcounts,
and that rebind_subsystems drops them.