2013-06-12 21:03:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET cgroup/for-3.11] cgroup: convert cgroup_subsys_state refcnt to percpu_ref

Hello,

This patchset does a lot of cleanup and then updates css
(cgroup_subsys_state) refcnt to use the new percpu reference counter.
A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various hugely diverse paths css's end up being
used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.

This patchset contains the following 11 patches.

0001-cgroup-remove-now-unused-css_depth.patch
0002-cgroup-consistently-use-cset-for-struct-css_set-vari.patch
0003-cgroup-bring-some-sanity-to-naming-around-cg_cgroup_.patch
0004-cgroup-use-kzalloc-and-list_del_init.patch
0005-cgroup-clean-up-css_-try-get-and-css_put.patch
0006-cgroup-rename-CGRP_REMOVED-to-CGRP_DEAD.patch
0007-cgroup-drop-unnecessary-RCU-dancing-from-__put_css_s.patch
0008-cgroup-remove-cgroup-count-and-use.patch
0009-cgroup-reorder-the-operations-in-cgroup_destroy_lock.patch
0010-cgroup-split-cgroup-destruction-into-two-steps.patch
0011-cgroup-use-percpu-refcnt-for-cgroup_subsys_states.patch

0001-0007 are cleanups, many of them cosmetic. They clean up the code
paths which are related with the refcnting changes. If you're only
interested in the precpu usage, you can probably skip these.

0008-0010 updates the cgroup destruction path so that percpu refcnting
can be used.

0011 updates css refcnting to use percpu_ref.

This patchset is on top of

cgroup/for-3.11 d5c56ced77 ("cgroup: clean up the cftype array for the base cgroup files")
+ [1] percpu/review-percpu-ref-tryget

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-css-percpu-ref

diffstat follows. Thanks.

include/linux/cgroup.h | 87 ++----
kernel/cgroup.c | 710 +++++++++++++++++++++++++------------------------
2 files changed, 406 insertions(+), 391 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel/1507258


2013-06-12 21:03:38

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/11] cgroup: consistently use @cset for struct css_set variables

cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.

It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough. Let's give it some sanity by
uniformly using @cset for all struct css_set variables.

* s/cg/cset/ for all css_set variables.

* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones
prefixed with "new".

* s/cg/cgrp/ for cgroup variables in compare_css_sets().

* s/css/cset/ for the cgroup variable in task_cgroup_from_root().

* Whiteline adjustments.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 216 ++++++++++++++++++++++++++++----------------------------
1 file changed, 109 insertions(+), 107 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index dc8997a..e7f6845 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -376,30 +376,32 @@ static unsigned long css_set_hash(struct cgroup_subsys_state *css[])
* compiled into their kernel but not actually in use */
static int use_task_css_set_links __read_mostly;

-static void __put_css_set(struct css_set *cg, int taskexit)
+static void __put_css_set(struct css_set *cset, int taskexit)
{
struct cg_cgroup_link *link;
struct cg_cgroup_link *saved_link;
+
/*
* Ensure that the refcount doesn't hit zero while any readers
* can see it. Similar to atomic_dec_and_lock(), but for an
* rwlock
*/
- if (atomic_add_unless(&cg->refcount, -1, 1))
+ if (atomic_add_unless(&cset->refcount, -1, 1))
return;
write_lock(&css_set_lock);
- if (!atomic_dec_and_test(&cg->refcount)) {
+ if (!atomic_dec_and_test(&cset->refcount)) {
write_unlock(&css_set_lock);
return;
}

/* This css_set is dead. unlink it and release cgroup refcounts */
- hash_del(&cg->hlist);
+ hash_del(&cset->hlist);
css_set_count--;

- list_for_each_entry_safe(link, saved_link, &cg->cg_links,
+ list_for_each_entry_safe(link, saved_link, &cset->cg_links,
cg_link_list) {
struct cgroup *cgrp = link->cgrp;
+
list_del(&link->cg_link_list);
list_del(&link->cgrp_link_list);

@@ -421,45 +423,45 @@ static void __put_css_set(struct css_set *cg, int taskexit)
}

write_unlock(&css_set_lock);
- kfree_rcu(cg, rcu_head);
+ kfree_rcu(cset, rcu_head);
}

/*
* refcounted get/put for css_set objects
*/
-static inline void get_css_set(struct css_set *cg)
+static inline void get_css_set(struct css_set *cset)
{
- atomic_inc(&cg->refcount);
+ atomic_inc(&cset->refcount);
}

-static inline void put_css_set(struct css_set *cg)
+static inline void put_css_set(struct css_set *cset)
{
- __put_css_set(cg, 0);
+ __put_css_set(cset, 0);
}

-static inline void put_css_set_taskexit(struct css_set *cg)
+static inline void put_css_set_taskexit(struct css_set *cset)
{
- __put_css_set(cg, 1);
+ __put_css_set(cset, 1);
}

/*
* compare_css_sets - helper function for find_existing_css_set().
- * @cg: candidate css_set being tested
- * @old_cg: existing css_set for a task
+ * @cset: candidate css_set being tested
+ * @old_cset: existing css_set for a task
* @new_cgrp: cgroup that's being entered by the task
* @template: desired set of css pointers in css_set (pre-calculated)
*
* Returns true if "cg" matches "old_cg" except for the hierarchy
* which "new_cgrp" belongs to, for which it should match "new_cgrp".
*/
-static bool compare_css_sets(struct css_set *cg,
- struct css_set *old_cg,
+static bool compare_css_sets(struct css_set *cset,
+ struct css_set *old_cset,
struct cgroup *new_cgrp,
struct cgroup_subsys_state *template[])
{
struct list_head *l1, *l2;

- if (memcmp(template, cg->subsys, sizeof(cg->subsys))) {
+ if (memcmp(template, cset->subsys, sizeof(cset->subsys))) {
/* Not all subsystems matched */
return false;
}
@@ -473,28 +475,28 @@ static bool compare_css_sets(struct css_set *cg,
* candidates.
*/

- l1 = &cg->cg_links;
- l2 = &old_cg->cg_links;
+ l1 = &cset->cg_links;
+ l2 = &old_cset->cg_links;
while (1) {
struct cg_cgroup_link *cgl1, *cgl2;
- struct cgroup *cg1, *cg2;
+ struct cgroup *cgrp1, *cgrp2;

l1 = l1->next;
l2 = l2->next;
/* See if we reached the end - both lists are equal length. */
- if (l1 == &cg->cg_links) {
- BUG_ON(l2 != &old_cg->cg_links);
+ if (l1 == &cset->cg_links) {
+ BUG_ON(l2 != &old_cset->cg_links);
break;
} else {
- BUG_ON(l2 == &old_cg->cg_links);
+ BUG_ON(l2 == &old_cset->cg_links);
}
/* Locate the cgroups associated with these links. */
cgl1 = list_entry(l1, struct cg_cgroup_link, cg_link_list);
cgl2 = list_entry(l2, struct cg_cgroup_link, cg_link_list);
- cg1 = cgl1->cgrp;
- cg2 = cgl2->cgrp;
+ cgrp1 = cgl1->cgrp;
+ cgrp2 = cgl2->cgrp;
/* Hierarchies should be linked in the same order. */
- BUG_ON(cg1->root != cg2->root);
+ BUG_ON(cgrp1->root != cgrp2->root);

/*
* If this hierarchy is the hierarchy of the cgroup
@@ -503,11 +505,11 @@ static bool compare_css_sets(struct css_set *cg,
* hierarchy, then this css_set should point to the
* same cgroup as the old css_set.
*/
- if (cg1->root == new_cgrp->root) {
- if (cg1 != new_cgrp)
+ if (cgrp1->root == new_cgrp->root) {
+ if (cgrp1 != new_cgrp)
return false;
} else {
- if (cg1 != cg2)
+ if (cgrp1 != cgrp2)
return false;
}
}
@@ -527,14 +529,13 @@ static bool compare_css_sets(struct css_set *cg,
* template: location in which to build the desired set of subsystem
* state objects for the new cgroup group
*/
-static struct css_set *find_existing_css_set(
- struct css_set *oldcg,
- struct cgroup *cgrp,
- struct cgroup_subsys_state *template[])
+static struct css_set *find_existing_css_set(struct css_set *old_cset,
+ struct cgroup *cgrp,
+ struct cgroup_subsys_state *template[])
{
int i;
struct cgroupfs_root *root = cgrp->root;
- struct css_set *cg;
+ struct css_set *cset;
unsigned long key;

/*
@@ -551,17 +552,17 @@ static struct css_set *find_existing_css_set(
} else {
/* Subsystem is not in this hierarchy, so we
* don't want to change the subsystem state */
- template[i] = oldcg->subsys[i];
+ template[i] = old_cset->subsys[i];
}
}

key = css_set_hash(template);
- hash_for_each_possible(css_set_table, cg, hlist, key) {
- if (!compare_css_sets(cg, oldcg, cgrp, template))
+ hash_for_each_possible(css_set_table, cset, hlist, key) {
+ if (!compare_css_sets(cset, old_cset, cgrp, template))
continue;

/* This css_set matches what we need */
- return cg;
+ return cset;
}

/* No existing cgroup group matched */
@@ -603,18 +604,18 @@ static int allocate_cg_links(int count, struct list_head *tmp)
/**
* link_css_set - a helper function to link a css_set to a cgroup
* @tmp_cg_links: cg_cgroup_link objects allocated by allocate_cg_links()
- * @cg: the css_set to be linked
+ * @cset: the css_set to be linked
* @cgrp: the destination cgroup
*/
static void link_css_set(struct list_head *tmp_cg_links,
- struct css_set *cg, struct cgroup *cgrp)
+ struct css_set *cset, struct cgroup *cgrp)
{
struct cg_cgroup_link *link;

BUG_ON(list_empty(tmp_cg_links));
link = list_first_entry(tmp_cg_links, struct cg_cgroup_link,
cgrp_link_list);
- link->cg = cg;
+ link->cg = cset;
link->cgrp = cgrp;
atomic_inc(&cgrp->count);
list_move(&link->cgrp_link_list, &cgrp->css_sets);
@@ -622,7 +623,7 @@ static void link_css_set(struct list_head *tmp_cg_links,
* Always add links to the tail of the list so that the list
* is sorted by order of hierarchy creation
*/
- list_add_tail(&link->cg_link_list, &cg->cg_links);
+ list_add_tail(&link->cg_link_list, &cset->cg_links);
}

/*
@@ -632,10 +633,10 @@ static void link_css_set(struct list_head *tmp_cg_links,
* substituted into the appropriate hierarchy. Must be called with
* cgroup_mutex held
*/
-static struct css_set *find_css_set(
- struct css_set *oldcg, struct cgroup *cgrp)
+static struct css_set *find_css_set(struct css_set *old_cset,
+ struct cgroup *cgrp)
{
- struct css_set *res;
+ struct css_set *cset;
struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];

struct list_head tmp_cg_links;
@@ -646,40 +647,40 @@ static struct css_set *find_css_set(
/* First see if we already have a cgroup group that matches
* the desired set */
read_lock(&css_set_lock);
- res = find_existing_css_set(oldcg, cgrp, template);
- if (res)
- get_css_set(res);
+ cset = find_existing_css_set(old_cset, cgrp, template);
+ if (cset)
+ get_css_set(cset);
read_unlock(&css_set_lock);

- if (res)
- return res;
+ if (cset)
+ return cset;

- res = kmalloc(sizeof(*res), GFP_KERNEL);
- if (!res)
+ cset = kmalloc(sizeof(*cset), GFP_KERNEL);
+ if (!cset)
return NULL;

/* Allocate all the cg_cgroup_link objects that we'll need */
if (allocate_cg_links(root_count, &tmp_cg_links) < 0) {
- kfree(res);
+ kfree(cset);
return NULL;
}

- atomic_set(&res->refcount, 1);
- INIT_LIST_HEAD(&res->cg_links);
- INIT_LIST_HEAD(&res->tasks);
- INIT_HLIST_NODE(&res->hlist);
+ atomic_set(&cset->refcount, 1);
+ INIT_LIST_HEAD(&cset->cg_links);
+ INIT_LIST_HEAD(&cset->tasks);
+ INIT_HLIST_NODE(&cset->hlist);

/* Copy the set of subsystem state objects generated in
* find_existing_css_set() */
- memcpy(res->subsys, template, sizeof(res->subsys));
+ memcpy(cset->subsys, template, sizeof(cset->subsys));

write_lock(&css_set_lock);
/* Add reference counts and links from the new css_set. */
- list_for_each_entry(link, &oldcg->cg_links, cg_link_list) {
+ list_for_each_entry(link, &old_cset->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
if (c->root == cgrp->root)
c = cgrp;
- link_css_set(&tmp_cg_links, res, c);
+ link_css_set(&tmp_cg_links, cset, c);
}

BUG_ON(!list_empty(&tmp_cg_links));
@@ -687,12 +688,12 @@ static struct css_set *find_css_set(
css_set_count++;

/* Add this cgroup group to the hash table */
- key = css_set_hash(res->subsys);
- hash_add(css_set_table, &res->hlist, key);
+ key = css_set_hash(cset->subsys);
+ hash_add(css_set_table, &cset->hlist, key);

write_unlock(&css_set_lock);

- return res;
+ return cset;
}

/*
@@ -702,7 +703,7 @@ static struct css_set *find_css_set(
static struct cgroup *task_cgroup_from_root(struct task_struct *task,
struct cgroupfs_root *root)
{
- struct css_set *css;
+ struct css_set *cset;
struct cgroup *res = NULL;

BUG_ON(!mutex_is_locked(&cgroup_mutex));
@@ -712,12 +713,12 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
* task can't change groups, so the only thing that can happen
* is that it exits and its css is set back to init_css_set.
*/
- css = task->cgroups;
- if (css == &init_css_set) {
+ cset = task->cgroups;
+ if (cset == &init_css_set) {
res = &root->top_cgroup;
} else {
struct cg_cgroup_link *link;
- list_for_each_entry(link, &css->cg_links, cg_link_list) {
+ list_for_each_entry(link, &cset->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
if (c->root == root) {
res = c;
@@ -1608,7 +1609,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
struct cgroupfs_root *existing_root;
const struct cred *cred;
int i;
- struct css_set *cg;
+ struct css_set *cset;

BUG_ON(sb->s_root != NULL);

@@ -1666,8 +1667,8 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
/* Link the top cgroup in this hierarchy into all
* the css_set objects */
write_lock(&css_set_lock);
- hash_for_each(css_set_table, i, cg, hlist)
- link_css_set(&tmp_cg_links, cg, root_cgrp);
+ hash_for_each(css_set_table, i, cset, hlist)
+ link_css_set(&tmp_cg_links, cset, root_cgrp);
write_unlock(&css_set_lock);

free_cg_links(&tmp_cg_links);
@@ -1947,10 +1948,11 @@ EXPORT_SYMBOL_GPL(cgroup_taskset_size);
*
* Must be called with cgroup_mutex and threadgroup locked.
*/
-static void cgroup_task_migrate(struct cgroup *oldcgrp,
- struct task_struct *tsk, struct css_set *newcg)
+static void cgroup_task_migrate(struct cgroup *old_cgrp,
+ struct task_struct *tsk,
+ struct css_set *new_cset)
{
- struct css_set *oldcg;
+ struct css_set *old_cset;

/*
* We are synchronized through threadgroup_lock() against PF_EXITING
@@ -1958,25 +1960,25 @@ static void cgroup_task_migrate(struct cgroup *oldcgrp,
* css_set to init_css_set and dropping the old one.
*/
WARN_ON_ONCE(tsk->flags & PF_EXITING);
- oldcg = tsk->cgroups;
+ old_cset = tsk->cgroups;

task_lock(tsk);
- rcu_assign_pointer(tsk->cgroups, newcg);
+ rcu_assign_pointer(tsk->cgroups, new_cset);
task_unlock(tsk);

/* Update the css_set linked lists if we're using them */
write_lock(&css_set_lock);
if (!list_empty(&tsk->cg_list))
- list_move(&tsk->cg_list, &newcg->tasks);
+ list_move(&tsk->cg_list, &new_cset->tasks);
write_unlock(&css_set_lock);

/*
- * We just gained a reference on oldcg by taking it from the task. As
- * trading it for newcg is protected by cgroup_mutex, we're safe to drop
- * it here; it will be freed under RCU.
+ * We just gained a reference on old_cset by taking it from the
+ * task. As trading it for new_cset is protected by cgroup_mutex,
+ * we're safe to drop it here; it will be freed under RCU.
*/
- set_bit(CGRP_RELEASABLE, &oldcgrp->flags);
- put_css_set(oldcg);
+ set_bit(CGRP_RELEASABLE, &old_cgrp->flags);
+ put_css_set(old_cset);
}

/**
@@ -2928,7 +2930,7 @@ static void cgroup_advance_iter(struct cgroup *cgrp,
{
struct list_head *l = it->cg_link;
struct cg_cgroup_link *link;
- struct css_set *cg;
+ struct css_set *cset;

/* Advance to the next non-empty css_set */
do {
@@ -2938,10 +2940,10 @@ static void cgroup_advance_iter(struct cgroup *cgrp,
return;
}
link = list_entry(l, struct cg_cgroup_link, cgrp_link_list);
- cg = link->cg;
- } while (list_empty(&cg->tasks));
+ cset = link->cg;
+ } while (list_empty(&cset->tasks));
it->cg_link = l;
- it->task = cg->tasks.next;
+ it->task = cset->tasks.next;
}

/*
@@ -4519,7 +4521,7 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
struct cgroup_subsys_state *css;
int i, ret;
struct hlist_node *tmp;
- struct css_set *cg;
+ struct css_set *cset;
unsigned long key;

/* check name and function validity */
@@ -4586,17 +4588,17 @@ int __init_or_module cgroup_load_subsys(struct cgroup_subsys *ss)
* this is all done under the css_set_lock.
*/
write_lock(&css_set_lock);
- hash_for_each_safe(css_set_table, i, tmp, cg, hlist) {
+ hash_for_each_safe(css_set_table, i, tmp, cset, hlist) {
/* skip entries that we already rehashed */
- if (cg->subsys[ss->subsys_id])
+ if (cset->subsys[ss->subsys_id])
continue;
/* remove existing entry */
- hash_del(&cg->hlist);
+ hash_del(&cset->hlist);
/* set new value */
- cg->subsys[ss->subsys_id] = css;
+ cset->subsys[ss->subsys_id] = css;
/* recompute hash and restore entry */
- key = css_set_hash(cg->subsys);
- hash_add(css_set_table, &cg->hlist, key);
+ key = css_set_hash(cset->subsys);
+ hash_add(css_set_table, &cset->hlist, key);
}
write_unlock(&css_set_lock);

@@ -4656,13 +4658,13 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
*/
write_lock(&css_set_lock);
list_for_each_entry(link, &dummytop->css_sets, cgrp_link_list) {
- struct css_set *cg = link->cg;
+ struct css_set *cset = link->cg;
unsigned long key;

- hash_del(&cg->hlist);
- cg->subsys[ss->subsys_id] = NULL;
- key = css_set_hash(cg->subsys);
- hash_add(css_set_table, &cg->hlist, key);
+ hash_del(&cset->hlist);
+ cset->subsys[ss->subsys_id] = NULL;
+ key = css_set_hash(cset->subsys);
+ hash_add(css_set_table, &cset->hlist, key);
}
write_unlock(&css_set_lock);

@@ -5009,7 +5011,7 @@ void cgroup_post_fork(struct task_struct *child)
*/
void cgroup_exit(struct task_struct *tsk, int run_callbacks)
{
- struct css_set *cg;
+ struct css_set *cset;
int i;

/*
@@ -5026,7 +5028,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)

/* Reassign the task to the init_css_set. */
task_lock(tsk);
- cg = tsk->cgroups;
+ cset = tsk->cgroups;
tsk->cgroups = &init_css_set;

if (run_callbacks && need_forkexit_callback) {
@@ -5039,7 +5041,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)

if (ss->exit) {
struct cgroup *old_cgrp =
- rcu_dereference_raw(cg->subsys[i])->cgroup;
+ rcu_dereference_raw(cset->subsys[i])->cgroup;
struct cgroup *cgrp = task_cgroup(tsk, i);
ss->exit(cgrp, old_cgrp, tsk);
}
@@ -5047,7 +5049,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
}
task_unlock(tsk);

- put_css_set_taskexit(cg);
+ put_css_set_taskexit(cset);
}

static void check_for_release(struct cgroup *cgrp)
@@ -5456,12 +5458,12 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
struct seq_file *seq)
{
struct cg_cgroup_link *link;
- struct css_set *cg;
+ struct css_set *cset;

read_lock(&css_set_lock);
rcu_read_lock();
- cg = rcu_dereference(current->cgroups);
- list_for_each_entry(link, &cg->cg_links, cg_link_list) {
+ cset = rcu_dereference(current->cgroups);
+ list_for_each_entry(link, &cset->cg_links, cg_link_list) {
struct cgroup *c = link->cgrp;
const char *name;

@@ -5486,11 +5488,11 @@ static int cgroup_css_links_read(struct cgroup *cont,

read_lock(&css_set_lock);
list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
- struct css_set *cg = link->cg;
+ struct css_set *cset = link->cg;
struct task_struct *task;
int count = 0;
- seq_printf(seq, "css_set %p\n", cg);
- list_for_each_entry(task, &cg->tasks, cg_list) {
+ seq_printf(seq, "css_set %p\n", cset);
+ list_for_each_entry(task, &cset->tasks, cg_list) {
if (count++ > MAX_TASKS_SHOWN_PER_CSS) {
seq_puts(seq, " ...\n");
break;
--
1.8.2.1

2013-06-12 21:03:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/11] cgroup: rename CGRP_REMOVED to CGRP_DEAD

We will add another flag indicating that the cgroup is in the process
of being killed. REMOVING / REMOVED is more difficult to distinguish
and cgroup_is_removing()/cgroup_is_removed() are a bit awkward. Also,
later percpu_ref usage will involve "kill"ing the refcnt.

s/CGRP_REMOVED/CGRP_DEAD/
s/cgroup_is_removed()/cgroup_is_dead()

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 2 +-
kernel/cgroup.c | 30 ++++++++++++++----------------
2 files changed, 15 insertions(+), 17 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 7b16882..75a6733 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -143,7 +143,7 @@ static inline void css_put(struct cgroup_subsys_state *css)
/* bits in struct cgroup flags field */
enum {
/* Control Group is dead */
- CGRP_REMOVED,
+ CGRP_DEAD,
/*
* Control Group has previously had a child cgroup or a task,
* but no longer (only if CGRP_NOTIFY_ON_RELEASE is set)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 585f0a5..0343df5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -226,9 +226,9 @@ static int css_refcnt(struct cgroup_subsys_state *css)
}

/* convenient tests for these bits */
-static inline bool cgroup_is_removed(const struct cgroup *cgrp)
+static inline bool cgroup_is_dead(const struct cgroup *cgrp)
{
- return test_bit(CGRP_REMOVED, &cgrp->flags);
+ return test_bit(CGRP_DEAD, &cgrp->flags);
}

/**
@@ -300,7 +300,7 @@ static inline struct cftype *__d_cft(struct dentry *dentry)
static bool cgroup_lock_live_group(struct cgroup *cgrp)
{
mutex_lock(&cgroup_mutex);
- if (cgroup_is_removed(cgrp)) {
+ if (cgroup_is_dead(cgrp)) {
mutex_unlock(&cgroup_mutex);
return false;
}
@@ -892,7 +892,7 @@ static void cgroup_diput(struct dentry *dentry, struct inode *inode)
if (S_ISDIR(inode->i_mode)) {
struct cgroup *cgrp = dentry->d_fsdata;

- BUG_ON(!(cgroup_is_removed(cgrp)));
+ BUG_ON(!(cgroup_is_dead(cgrp)));
call_rcu(&cgrp->rcu_head, cgroup_free_rcu);
} else {
struct cfent *cfe = __d_cfe(dentry);
@@ -2366,7 +2366,7 @@ static ssize_t cgroup_file_write(struct file *file, const char __user *buf,
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);

- if (cgroup_is_removed(cgrp))
+ if (cgroup_is_dead(cgrp))
return -ENODEV;
if (cft->write)
return cft->write(cgrp, cft, file, buf, nbytes, ppos);
@@ -2411,7 +2411,7 @@ static ssize_t cgroup_file_read(struct file *file, char __user *buf,
struct cftype *cft = __d_cft(file->f_dentry);
struct cgroup *cgrp = __d_cgrp(file->f_dentry->d_parent);

- if (cgroup_is_removed(cgrp))
+ if (cgroup_is_dead(cgrp))
return -ENODEV;

if (cft->read)
@@ -2834,7 +2834,7 @@ static void cgroup_cfts_commit(struct cgroup_subsys *ss,

mutex_lock(&inode->i_mutex);
mutex_lock(&cgroup_mutex);
- if (!cgroup_is_removed(cgrp))
+ if (!cgroup_is_dead(cgrp))
cgroup_addrm_files(cgrp, ss, cfts, is_add);
mutex_unlock(&cgroup_mutex);
mutex_unlock(&inode->i_mutex);
@@ -3002,14 +3002,14 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos)
/*
* @pos could already have been removed. Once a cgroup is removed,
* its ->sibling.next is no longer updated when its next sibling
- * changes. As CGRP_REMOVED is set on removal which is fully
+ * changes. As CGRP_DEAD is set on removal which is fully
* serialized, if we see it unasserted, it's guaranteed that the
* next sibling hasn't finished its grace period even if it's
* already removed, and thus safe to dereference from this RCU
* critical section. If ->sibling.next is inaccessible,
- * cgroup_is_removed() is guaranteed to be visible as %true here.
+ * cgroup_is_dead() is guaranteed to be visible as %true here.
*/
- if (likely(!cgroup_is_removed(pos))) {
+ if (likely(!cgroup_is_dead(pos))) {
next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
if (&next->sibling != &pos->parent->children)
return next;
@@ -4386,7 +4386,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
* attempts fail thus maintaining the removal conditions verified
* above.
*
- * Note that CGRP_REMVOED clearing is depended upon by
+ * Note that CGRP_DEAD assertion is depended upon by
* cgroup_next_sibling() to resume iteration after dropping RCU
* read lock. See cgroup_next_sibling() for details.
*/
@@ -4396,7 +4396,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
WARN_ON(atomic_read(&css->refcnt) < 0);
atomic_add(CSS_DEACT_BIAS, &css->refcnt);
}
- set_bit(CGRP_REMOVED, &cgrp->flags);
+ set_bit(CGRP_DEAD, &cgrp->flags);

/* tell subsystems to initate destruction */
for_each_subsys(cgrp->root, ss)
@@ -5066,7 +5066,7 @@ static void check_for_release(struct cgroup *cgrp)
int need_schedule_work = 0;

raw_spin_lock(&release_list_lock);
- if (!cgroup_is_removed(cgrp) &&
+ if (!cgroup_is_dead(cgrp) &&
list_empty(&cgrp->release_list)) {
list_add(&cgrp->release_list, &release_list);
need_schedule_work = 1;
@@ -5212,9 +5212,7 @@ __setup("cgroup_disable=", cgroup_disable);
* Functons for CSS ID.
*/

-/*
- *To get ID other than 0, this should be called when !cgroup_is_removed().
- */
+/* to get ID other than 0, this should be called when !cgroup_is_dead() */
unsigned short css_id(struct cgroup_subsys_state *css)
{
struct css_id *cssid;
--
1.8.2.1

2013-06-12 21:03:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

There's no point in using kmalloc() and list_del() instead of the
clearing variants for trivial stuff. We can live dangerously
elsewhere. Use kzalloc() and list_del_init() instead and drop 0
inits.

While at it, do trivial code reorganization in cgroup_file_open().

This patch doesn't introduce any functional changes.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0b5cd0e..585f0a5 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -404,8 +404,8 @@ static void __put_css_set(struct css_set *cset, int taskexit)
list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
struct cgroup *cgrp = link->cgrp;

- list_del(&link->cset_link);
- list_del(&link->cgrp_link);
+ list_del_init(&link->cset_link);
+ list_del_init(&link->cgrp_link);

/*
* We may not be holding cgroup_mutex, and if cgrp->count is
@@ -576,7 +576,7 @@ static void free_cgrp_cset_links(struct list_head *links_to_free)
struct cgrp_cset_link *link, *tmp_link;

list_for_each_entry_safe(link, tmp_link, links_to_free, cset_link) {
- list_del(&link->cset_link);
+ list_del_init(&link->cset_link);
kfree(link);
}
}
@@ -597,7 +597,7 @@ static int allocate_cgrp_cset_links(int count, struct list_head *tmp_links)
INIT_LIST_HEAD(tmp_links);

for (i = 0; i < count; i++) {
- link = kmalloc(sizeof(*link), GFP_KERNEL);
+ link = kzalloc(sizeof(*link), GFP_KERNEL);
if (!link) {
free_cgrp_cset_links(tmp_links);
return -ENOMEM;
@@ -658,7 +658,7 @@ static struct css_set *find_css_set(struct css_set *old_cset,
if (cset)
return cset;

- cset = kmalloc(sizeof(*cset), GFP_KERNEL);
+ cset = kzalloc(sizeof(*cset), GFP_KERNEL);
if (!cset)
return NULL;

@@ -1754,14 +1754,14 @@ static void cgroup_kill_sb(struct super_block *sb) {
write_lock(&css_set_lock);

list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) {
- list_del(&link->cset_link);
- list_del(&link->cgrp_link);
+ list_del_init(&link->cset_link);
+ list_del_init(&link->cgrp_link);
kfree(link);
}
write_unlock(&css_set_lock);

if (!list_empty(&root->root_list)) {
- list_del(&root->root_list);
+ list_del_init(&root->root_list);
root_count--;
}

@@ -2478,10 +2478,12 @@ static int cgroup_file_open(struct inode *inode, struct file *file)
cft = __d_cft(file->f_dentry);

if (cft->read_map || cft->read_seq_string) {
- struct cgroup_seqfile_state *state =
- kzalloc(sizeof(*state), GFP_USER);
+ struct cgroup_seqfile_state *state;
+
+ state = kzalloc(sizeof(*state), GFP_USER);
if (!state)
return -ENOMEM;
+
state->cft = cft;
state->cgroup = __d_cgrp(file->f_dentry->d_parent);
file->f_op = &cgroup_seqfile_operations;
@@ -3514,7 +3516,7 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
}
}
/* entry not found; create a new one */
- l = kmalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
+ l = kzalloc(sizeof(struct cgroup_pidlist), GFP_KERNEL);
if (!l) {
mutex_unlock(&cgrp->pidlist_mutex);
return l;
@@ -3523,8 +3525,6 @@ static struct cgroup_pidlist *cgroup_pidlist_find(struct cgroup *cgrp,
down_write(&l->mutex);
l->key.type = type;
l->key.ns = get_pid_ns(ns);
- l->use_count = 0; /* don't increment here */
- l->list = NULL;
l->owner = cgrp;
list_add(&l->links, &cgrp->pidlists);
mutex_unlock(&cgrp->pidlist_mutex);
@@ -3738,7 +3738,7 @@ static void cgroup_release_pid_array(struct cgroup_pidlist *l)
BUG_ON(!l->use_count);
if (!--l->use_count) {
/* we're the last user if refcount is 0; remove and free */
- list_del(&l->links);
+ list_del_init(&l->links);
mutex_unlock(&l->owner->pidlist_mutex);
pidlist_free(l->list);
put_pid_ns(l->key.ns);
--
1.8.2.1

2013-06-12 21:03:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/11] cgroup: remove cgroup->count and use

cgroup->count tracks the number of css_sets associated with the cgroup
and used only to verify that no css_set is associated when the cgroup
is being destroyed. It's superflous as the destruction path can
simply check whether cgroup->cset_links is empty instead.

Drop cgroup->count and check ->cset_links directly from
cgroup_destroy_locked().

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 6 ------
kernel/cgroup.c | 21 +++++----------------
2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 75a6733..5428738 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -169,12 +169,6 @@ struct cgroup_name {
struct cgroup {
unsigned long flags; /* "unsigned long" so bitops work */

- /*
- * count users of this cgroup. >0 means busy, but doesn't
- * necessarily indicate the number of tasks in the cgroup
- */
- atomic_t count;
-
int id; /* ida allocated in-hierarchy ID */

/*
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index efe500b..aefda90 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -408,8 +408,7 @@ static void __put_css_set(struct css_set *cset, int taskexit)
list_del_init(&link->cgrp_link);

/* @cgrp can't go away while we're holding css_set_lock */
- if (atomic_dec_and_test(&cgrp->count) &&
- notify_on_release(cgrp)) {
+ if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
if (taskexit)
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
@@ -616,7 +615,6 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
link->cset = cset;
link->cgrp = cgrp;
- atomic_inc(&cgrp->count);
list_move(&link->cset_link, &cgrp->cset_links);
/*
* Always add links to the tail of the list so that the list
@@ -4373,11 +4371,11 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
lockdep_assert_held(&cgroup_mutex);

/*
- * css_set_lock prevents @cgrp from being removed while
- * __put_css_set() is in progress.
+ * css_set_lock synchronizes access to ->cset_links and prevents
+ * @cgrp from being removed while __put_css_set() is in progress.
*/
read_lock(&css_set_lock);
- empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children);
+ empty = list_empty(&cgrp->cset_links) && list_empty(&cgrp->children);
read_unlock(&css_set_lock);
if (!empty)
return -EBUSY;
@@ -5057,7 +5055,7 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)
static void check_for_release(struct cgroup *cgrp)
{
if (cgroup_is_releasable(cgrp) &&
- !atomic_read(&cgrp->count) && list_empty(&cgrp->children)) {
+ list_empty(&cgrp->cset_links) && list_empty(&cgrp->children)) {
/*
* Control Group is currently removeable. If it's not
* already queued for a userspace notification, queue
@@ -5425,11 +5423,6 @@ static void debug_css_free(struct cgroup *cont)
kfree(cont->subsys[debug_subsys_id]);
}

-static u64 cgroup_refcount_read(struct cgroup *cont, struct cftype *cft)
-{
- return atomic_read(&cont->count);
-}
-
static u64 debug_taskcount_read(struct cgroup *cont, struct cftype *cft)
{
return cgroup_task_count(cont);
@@ -5511,10 +5504,6 @@ static u64 releasable_read(struct cgroup *cgrp, struct cftype *cft)

static struct cftype debug_files[] = {
{
- .name = "cgroup_refcount",
- .read_u64 = cgroup_refcount_read,
- },
- {
.name = "taskcount",
.read_u64 = debug_taskcount_read,
},
--
1.8.2.1

2013-06-12 21:04:02

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/11] cgroup: use percpu refcnt for cgroup_subsys_states

A css (cgroup_subsys_state) is how each cgroup is represented to a
controller. As such, it can be used in hot paths across the various
subsystems different controllers are associated with.

One of the common operations is reference counting, which up until now
has been implemented using a global atomic counter and can have
significant adverse impact on scalability. For example, css refcnt
can be gotten and put multiple times by blkcg for each IO request.
For highops configurations which try to do as much per-cpu as
possible, the global frequent refcnting can be very expensive.

In general, given the various hugely diverse paths css's end up being
used from, we need to make it cheap and highly scalable. In its
usage, css refcnting isn't very different from module refcnting.

This patch converts css refcnting to use the recently added
percpu_ref. css_get/tryget/put() directly maps to the matching
percpu_ref operations and the deactivation logic is no longer
necessary as percpu_ref already has refcnt killing.

The only complication is that as the refcnt is per-cpu,
percpu_ref_kill() in itself doesn't ensure that further tryget
operations will fail, which we need to guarantee before invoking
->css_offline()'s. This is resolved collecting kill confirmation
using percpu_ref_kill_and_confirm() and initiating the offline phase
of destruction after all css refcnt's are confirmed to be seen as
killed on all CPUs. The previous patches already splitted destruction
into two phases, so percpu_ref_kill_and_confirm() can be hooked up
easily.

This patch removes css_refcnt() which is used for rcu dereference
sanity check in css_id(). While we can add a percpu refcnt API to ask
the same question, css_id() itself is scheduled to be removed fairly
soon, so let's not bother with it. Just drop the sanity check and use
rcu_dereference_raw() instead.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: "Alasdair G. Kergon" <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Kent Overstreet <[email protected]>
---
include/linux/cgroup.h | 23 +++------
kernel/cgroup.c | 138 +++++++++++++++++++++++++++++--------------------
2 files changed, 91 insertions(+), 70 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 58ee0f4..4d71cc2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -20,6 +20,7 @@
#include <linux/workqueue.h>
#include <linux/xattr.h>
#include <linux/fs.h>
+#include <linux/percpu-refcount.h>

#ifdef CONFIG_CGROUPS

@@ -72,13 +73,8 @@ struct cgroup_subsys_state {
*/
struct cgroup *cgroup;

- /*
- * State maintained by the cgroup system to allow subsystems
- * to be "busy". Should be accessed via css_get(),
- * css_tryget() and css_put().
- */
-
- atomic_t refcnt;
+ /* reference count - access via css_[try]get() and css_put() */
+ struct percpu_ref refcnt;

unsigned long flags;
/* ID for this css, if possible */
@@ -104,11 +100,9 @@ static inline void css_get(struct cgroup_subsys_state *css)
{
/* We don't need to reference count the root state */
if (!(css->flags & CSS_ROOT))
- atomic_inc(&css->refcnt);
+ percpu_ref_get(&css->refcnt);
}

-extern bool __css_tryget(struct cgroup_subsys_state *css);
-
/**
* css_tryget - try to obtain a reference on the specified css
* @css: target css
@@ -123,11 +117,9 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
{
if (css->flags & CSS_ROOT)
return true;
- return __css_tryget(css);
+ return percpu_ref_tryget(&css->refcnt);
}

-extern void __css_put(struct cgroup_subsys_state *css);
-
/**
* css_put - put a css reference
* @css: target css
@@ -137,7 +129,7 @@ extern void __css_put(struct cgroup_subsys_state *css);
static inline void css_put(struct cgroup_subsys_state *css)
{
if (!(css->flags & CSS_ROOT))
- __css_put(css);
+ percpu_ref_put(&css->refcnt);
}

/* bits in struct cgroup flags field */
@@ -231,9 +223,10 @@ struct cgroup {
struct list_head pidlists;
struct mutex pidlist_mutex;

- /* For RCU-protected deletion */
+ /* For css percpu_ref killing and RCU-protected deletion */
struct rcu_head rcu_head;
struct work_struct destroy_work;
+ atomic_t css_kill_cnt;

/* List of events which userspace want to receive */
struct list_head event_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 83e3183..769f800 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -63,9 +63,6 @@

#include <linux/atomic.h>

-/* css deactivation bias, makes css->refcnt negative to deny new trygets */
-#define CSS_DEACT_BIAS INT_MIN
-
/*
* cgroup_mutex is the master lock. Any modification to cgroup or its
* hierarchy must be performed while holding it.
@@ -213,19 +210,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys,
struct cftype cfts[], bool is_add);

-static int css_unbias_refcnt(int refcnt)
-{
- return refcnt >= 0 ? refcnt : refcnt - CSS_DEACT_BIAS;
-}
-
-/* the current nr of refs, always >= 0 whether @css is deactivated or not */
-static int css_refcnt(struct cgroup_subsys_state *css)
-{
- int v = atomic_read(&css->refcnt);
-
- return css_unbias_refcnt(v);
-}
-
/* convenient tests for these bits */
static inline bool cgroup_is_dead(const struct cgroup *cgrp)
{
@@ -4139,12 +4123,20 @@ static void css_dput_fn(struct work_struct *work)
deactivate_super(sb);
}

+static void css_release(struct percpu_ref *ref)
+{
+ struct cgroup_subsys_state *css =
+ container_of(ref, struct cgroup_subsys_state, refcnt);
+
+ schedule_work(&css->dput_work);
+}
+
static void init_cgroup_css(struct cgroup_subsys_state *css,
struct cgroup_subsys *ss,
struct cgroup *cgrp)
{
css->cgroup = cgrp;
- atomic_set(&css->refcnt, 1);
+ percpu_ref_init(&css->refcnt, css_release);
css->flags = 0;
css->id = NULL;
if (cgrp == dummytop)
@@ -4360,6 +4352,48 @@ static int cgroup_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
return cgroup_create(c_parent, dentry, mode | S_IFDIR);
}

+static void cgroup_css_killed(struct cgroup *cgrp)
+{
+ if (!atomic_dec_and_test(&cgrp->css_kill_cnt))
+ return;
+
+ /* percpu ref's of all css's are killed, kick off the next step */
+ INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn);
+ schedule_work(&cgrp->destroy_work);
+}
+
+static void css_ref_killed_fn(struct percpu_ref *ref)
+{
+ struct cgroup_subsys_state *css =
+ container_of(ref, struct cgroup_subsys_state, refcnt);
+
+ cgroup_css_killed(css->cgroup);
+}
+
+/**
+ * cgroup_destroy_locked - the first stage of cgroup destruction
+ * @cgrp: cgroup to be destroyed
+ *
+ * css's make use of percpu refcnts whose killing latency shouldn't be
+ * exposed to userland and are RCU protected. Also, cgroup core needs to
+ * guarantee that css_tryget() won't succeed by the time ->css_offline() is
+ * invoked. To satisfy all the requirements, destruction is implemented in
+ * the following two steps.
+ *
+ * s1. Verify @cgrp can be destroyed and mark it dying. Remove all
+ * userland visible parts and start killing the percpu refcnts of
+ * css's. Set up so that the next stage will be kicked off once all
+ * the percpu refcnts are confirmed to be killed.
+ *
+ * s2. Invoke ->css_offline(), mark the cgroup dead and proceed with the
+ * rest of destruction. Once all cgroup references are gone, the
+ * cgroup is RCU-freed.
+ *
+ * This function implements s1. After this step, @cgrp is gone as far as
+ * the userland is concerned and a new cgroup with the same name may be
+ * created. As cgroup doesn't care about the names internally, this
+ * doesn't cause any problem.
+ */
static int cgroup_destroy_locked(struct cgroup *cgrp)
__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
{
@@ -4382,16 +4416,28 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
return -EBUSY;

/*
- * Block new css_tryget() by deactivating refcnt and mark @cgrp
- * removed. This makes future css_tryget() attempts fail which we
- * guarantee to ->css_offline() callbacks.
+ * Block new css_tryget() by killing css refcnts. cgroup core
+ * guarantees that, by the time ->css_offline() is invoked, no new
+ * css reference will be given out via css_tryget(). We can't
+ * simply call percpu_ref_kill() and proceed to offlining css's
+ * because percpu_ref_kill() doesn't guarantee that the ref is seen
+ * as killed on all CPUs on return.
+ *
+ * Use percpu_ref_kill_and_confirm() to get notifications as each
+ * css is confirmed to be seen as killed on all CPUs. The
+ * notification callback keeps track of the number of css's to be
+ * killed and schedules cgroup_offline_fn() to perform the rest of
+ * destruction once the percpu refs of all css's are confirmed to
+ * be killed.
*/
+ atomic_set(&cgrp->css_kill_cnt, 1);
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];

- WARN_ON(atomic_read(&css->refcnt) < 0);
- atomic_add(CSS_DEACT_BIAS, &css->refcnt);
+ atomic_inc(&cgrp->css_kill_cnt);
+ percpu_ref_kill_and_confirm(&css->refcnt, css_ref_killed_fn);
}
+ cgroup_css_killed(cgrp);

/*
* Mark @cgrp dead. This prevents further task migration and child
@@ -4427,12 +4473,19 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
}
spin_unlock(&cgrp->event_list_lock);

- INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn);
- schedule_work(&cgrp->destroy_work);
-
return 0;
};

+/**
+ * cgroup_offline_fn - the second step of cgroup destruction
+ * @work: cgroup->destroy_free_work
+ *
+ * This function is invoked from a work item for a cgroup which is being
+ * destroyed after the percpu refcnts of all css's are guaranteed to be
+ * seen as killed on all CPUs, and performs the rest of destruction. This
+ * is the second step of destruction described in the comment above
+ * cgroup_destroy_locked().
+ */
static void cgroup_offline_fn(struct work_struct *work)
{
struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
@@ -4442,7 +4495,10 @@ static void cgroup_offline_fn(struct work_struct *work)

mutex_lock(&cgroup_mutex);

- /* tell subsystems to initate destruction */
+ /*
+ * css_tryget() is guaranteed to fail now. Tell subsystems to
+ * initate destruction.
+ */
for_each_subsys(cgrp->root, ss)
offline_css(ss, cgrp);

@@ -5100,34 +5156,6 @@ static void check_for_release(struct cgroup *cgrp)
}
}

-/* Caller must verify that the css is not for root cgroup */
-bool __css_tryget(struct cgroup_subsys_state *css)
-{
- while (true) {
- int t, v;
-
- v = css_refcnt(css);
- t = atomic_cmpxchg(&css->refcnt, v, v + 1);
- if (likely(t == v))
- return true;
- else if (t < 0)
- return false;
- cpu_relax();
- }
-}
-EXPORT_SYMBOL_GPL(__css_tryget);
-
-/* Caller must verify that the css is not for root cgroup */
-void __css_put(struct cgroup_subsys_state *css)
-{
- int v;
-
- v = css_unbias_refcnt(atomic_dec_return(&css->refcnt));
- if (v == 0)
- schedule_work(&css->dput_work);
-}
-EXPORT_SYMBOL_GPL(__css_put);
-
/*
* Notify userspace when a cgroup is released, by running the
* configured release agent with the name of the cgroup (path
@@ -5245,7 +5273,7 @@ unsigned short css_id(struct cgroup_subsys_state *css)
* on this or this is under rcu_read_lock(). Once css->id is allocated,
* it's unchanged until freed.
*/
- cssid = rcu_dereference_check(css->id, css_refcnt(css));
+ cssid = rcu_dereference_raw(css->id);

if (cssid)
return cssid->id;
--
1.8.2.1

2013-06-12 21:03:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/11] cgroup: split cgroup destruction into two steps

Split cgroup_destroy_locked() into two steps and put the latter half
into cgroup_offline_fn() which is executed from a work item. The
latter half is responsible for offlining the css's, removing the
cgroup from internal lists, and propagating release notification to
the parent. The separation is to allow using percpu refcnt for css.

Note that this allows for other cgroup operations to happen between
the first and second halves of destruction, including creating a new
cgroup with the same name. As the target cgroup is marked DEAD in the
first half and cgroup internals don't care about the names of cgroups,
this should be fine. A comment explaining this will be added by the
next patch which implements the actual percpu refcnting.

As RCU freeing is guaranteed to happen after the second step of
destruction, we can use the same work item for both. This patch
renames cgroup->free_work to ->destroy_work and uses it for both
purposes. INIT_WORK() is now performed right before queueing the work
item.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 2 +-
kernel/cgroup.c | 38 +++++++++++++++++++++++++++-----------
2 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5428738..58ee0f4 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -233,7 +233,7 @@ struct cgroup {

/* For RCU-protected deletion */
struct rcu_head rcu_head;
- struct work_struct free_work;
+ struct work_struct destroy_work;

/* List of events which userspace want to receive */
struct list_head event_list;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a43bc9d..83e3183 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -208,6 +208,7 @@ static struct cgroup_name root_cgroup_name = { .name = "/" };
*/
static int need_forkexit_callback __read_mostly;

+static void cgroup_offline_fn(struct work_struct *work);
static int cgroup_destroy_locked(struct cgroup *cgrp);
static int cgroup_addrm_files(struct cgroup *cgrp, struct cgroup_subsys *subsys,
struct cftype cfts[], bool is_add);
@@ -830,7 +831,7 @@ static struct cgroup_name *cgroup_alloc_name(struct dentry *dentry)

static void cgroup_free_fn(struct work_struct *work)
{
- struct cgroup *cgrp = container_of(work, struct cgroup, free_work);
+ struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
struct cgroup_subsys *ss;

mutex_lock(&cgroup_mutex);
@@ -875,7 +876,8 @@ static void cgroup_free_rcu(struct rcu_head *head)
{
struct cgroup *cgrp = container_of(head, struct cgroup, rcu_head);

- schedule_work(&cgrp->free_work);
+ INIT_WORK(&cgrp->destroy_work, cgroup_free_fn);
+ schedule_work(&cgrp->destroy_work);
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -1407,7 +1409,6 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->allcg_node);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
- INIT_WORK(&cgrp->free_work, cgroup_free_fn);
mutex_init(&cgrp->pidlist_mutex);
INIT_LIST_HEAD(&cgrp->event_list);
spin_lock_init(&cgrp->event_list_lock);
@@ -2994,12 +2995,13 @@ struct cgroup *cgroup_next_sibling(struct cgroup *pos)
/*
* @pos could already have been removed. Once a cgroup is removed,
* its ->sibling.next is no longer updated when its next sibling
- * changes. As CGRP_DEAD is set on removal which is fully
- * serialized, if we see it unasserted, it's guaranteed that the
- * next sibling hasn't finished its grace period even if it's
- * already removed, and thus safe to dereference from this RCU
- * critical section. If ->sibling.next is inaccessible,
- * cgroup_is_dead() is guaranteed to be visible as %true here.
+ * changes. As CGRP_DEAD assertion is serialized and happens
+ * before the cgroup is taken off the ->sibling list, if we see it
+ * unasserted, it's guaranteed that the next sibling hasn't
+ * finished its grace period even if it's already removed, and thus
+ * safe to dereference from this RCU critical section. If
+ * ->sibling.next is inaccessible, cgroup_is_dead() is guaranteed
+ * to be visible as %true here.
*/
if (likely(!cgroup_is_dead(pos))) {
next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
@@ -4362,7 +4364,6 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
__releases(&cgroup_mutex) __acquires(&cgroup_mutex)
{
struct dentry *d = cgrp->dentry;
- struct cgroup *parent = cgrp->parent;
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;
bool empty;
@@ -4426,6 +4427,21 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
}
spin_unlock(&cgrp->event_list_lock);

+ INIT_WORK(&cgrp->destroy_work, cgroup_offline_fn);
+ schedule_work(&cgrp->destroy_work);
+
+ return 0;
+};
+
+static void cgroup_offline_fn(struct work_struct *work)
+{
+ struct cgroup *cgrp = container_of(work, struct cgroup, destroy_work);
+ struct cgroup *parent = cgrp->parent;
+ struct dentry *d = cgrp->dentry;
+ struct cgroup_subsys *ss;
+
+ mutex_lock(&cgroup_mutex);
+
/* tell subsystems to initate destruction */
for_each_subsys(cgrp->root, ss)
offline_css(ss, cgrp);
@@ -4449,7 +4465,7 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);

- return 0;
+ mutex_unlock(&cgroup_mutex);
}

static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
--
1.8.2.1

2013-06-12 21:04:36

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/11] cgroup: reorder the operations in cgroup_destroy_locked()

This patch reorders the operations in cgroup_destroy_locked() such
that the userland visible parts happen before css offlining and
removal from the ->sibling list. This will be used to make css use
percpu refcnt.

While at it, split out CGRP_DEAD related comment from the refcnt
deactivation one and correct / clarify how different guarantees are
met.

While this patch changes the specific order of operations, it
shouldn't cause any noticeable behavior difference.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 61 +++++++++++++++++++++++++++++++++------------------------
1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index aefda90..a43bc9d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4382,13 +4382,8 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)

/*
* Block new css_tryget() by deactivating refcnt and mark @cgrp
- * removed. This makes future css_tryget() and child creation
- * attempts fail thus maintaining the removal conditions verified
- * above.
- *
- * Note that CGRP_DEAD assertion is depended upon by
- * cgroup_next_sibling() to resume iteration after dropping RCU
- * read lock. See cgroup_next_sibling() for details.
+ * removed. This makes future css_tryget() attempts fail which we
+ * guarantee to ->css_offline() callbacks.
*/
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
@@ -4396,8 +4391,41 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
WARN_ON(atomic_read(&css->refcnt) < 0);
atomic_add(CSS_DEACT_BIAS, &css->refcnt);
}
+
+ /*
+ * Mark @cgrp dead. This prevents further task migration and child
+ * creation by disabling cgroup_lock_live_group(). Note that
+ * CGRP_DEAD assertion is depended upon by cgroup_next_sibling() to
+ * resume iteration after dropping RCU read lock. See
+ * cgroup_next_sibling() for details.
+ */
set_bit(CGRP_DEAD, &cgrp->flags);

+ /* CGRP_DEAD is set, remove from ->release_list for the last time */
+ raw_spin_lock(&release_list_lock);
+ if (!list_empty(&cgrp->release_list))
+ list_del_init(&cgrp->release_list);
+ raw_spin_unlock(&release_list_lock);
+
+ /*
+ * Remove @cgrp directory. The removal puts the base ref but we
+ * aren't quite done with @cgrp yet, so hold onto it.
+ */
+ dget(d);
+ cgroup_d_remove_dir(d);
+
+ /*
+ * Unregister events and notify userspace.
+ * Notify userspace about cgroup removing only after rmdir of cgroup
+ * directory to avoid race between userspace and kernelspace.
+ */
+ spin_lock(&cgrp->event_list_lock);
+ list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
+ list_del_init(&event->list);
+ schedule_work(&event->remove);
+ }
+ spin_unlock(&cgrp->event_list_lock);
+
/* tell subsystems to initate destruction */
for_each_subsys(cgrp->root, ss)
offline_css(ss, cgrp);
@@ -4412,34 +4440,15 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
for_each_subsys(cgrp->root, ss)
css_put(cgrp->subsys[ss->subsys_id]);

- raw_spin_lock(&release_list_lock);
- if (!list_empty(&cgrp->release_list))
- list_del_init(&cgrp->release_list);
- raw_spin_unlock(&release_list_lock);
-
/* delete this cgroup from parent->children */
list_del_rcu(&cgrp->sibling);
list_del_init(&cgrp->allcg_node);

- dget(d);
- cgroup_d_remove_dir(d);
dput(d);

set_bit(CGRP_RELEASABLE, &parent->flags);
check_for_release(parent);

- /*
- * Unregister events and notify userspace.
- * Notify userspace about cgroup removing only after rmdir of cgroup
- * directory to avoid race between userspace and kernelspace.
- */
- spin_lock(&cgrp->event_list_lock);
- list_for_each_entry_safe(event, tmp, &cgrp->event_list, list) {
- list_del_init(&event->list);
- schedule_work(&event->remove);
- }
- spin_unlock(&cgrp->event_list_lock);
-
return 0;
}

--
1.8.2.1

2013-06-12 21:05:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/11] cgroup: drop unnecessary RCU dancing from __put_css_set()

__put_css_set() does RCU read access on @cgrp across dropping
@cgrp->count so that it can continue accessing @cgrp even if the count
reached zero and destruction of the cgroup commenced. Given that both
sides - __css_put() and cgroup_destroy_locked() - are cold paths, this
is unnecessary. Just making cgroup_destroy_locked() grab css_set_lock
while checking @cgrp->count is enough.

Remove the RCU read locking from __put_css_set() and make
cgroup_destroy_locked() read-lock css_set_lock when checking
@cgrp->count. This will also allow removing @cgrp->count.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 0343df5..efe500b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -407,19 +407,13 @@ static void __put_css_set(struct css_set *cset, int taskexit)
list_del_init(&link->cset_link);
list_del_init(&link->cgrp_link);

- /*
- * We may not be holding cgroup_mutex, and if cgrp->count is
- * dropped to 0 the cgroup can be destroyed at any time, hence
- * rcu_read_lock is used to keep it alive.
- */
- rcu_read_lock();
+ /* @cgrp can't go away while we're holding css_set_lock */
if (atomic_dec_and_test(&cgrp->count) &&
notify_on_release(cgrp)) {
if (taskexit)
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
}
- rcu_read_unlock();

kfree(link);
}
@@ -4373,11 +4367,19 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
struct cgroup *parent = cgrp->parent;
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;
+ bool empty;

lockdep_assert_held(&d->d_inode->i_mutex);
lockdep_assert_held(&cgroup_mutex);

- if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children))
+ /*
+ * css_set_lock prevents @cgrp from being removed while
+ * __put_css_set() is in progress.
+ */
+ read_lock(&css_set_lock);
+ empty = !atomic_read(&cgrp->count) && list_empty(&cgrp->children);
+ read_unlock(&css_set_lock);
+ if (!empty)
return -EBUSY;

/*
@@ -5054,8 +5056,6 @@ void cgroup_exit(struct task_struct *tsk, int run_callbacks)

static void check_for_release(struct cgroup *cgrp)
{
- /* All of these checks rely on RCU to keep the cgroup
- * structure alive */
if (cgroup_is_releasable(cgrp) &&
!atomic_read(&cgrp->count) && list_empty(&cgrp->children)) {
/*
--
1.8.2.1

2013-06-12 21:05:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/11] cgroup: clean up css_[try]get() and css_put()

* __css_get() isn't used by anyone. Fold it into css_get().

* Add proper function comments to all css reference functions.

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 48 ++++++++++++++++++++++++------------------------
1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 0e32855..7b16882 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -94,33 +94,31 @@ enum {
CSS_ONLINE = (1 << 1), /* between ->css_online() and ->css_offline() */
};

-/* Caller must verify that the css is not for root cgroup */
-static inline void __css_get(struct cgroup_subsys_state *css, int count)
-{
- atomic_add(count, &css->refcnt);
-}
-
-/*
- * Call css_get() to hold a reference on the css; it can be used
- * for a reference obtained via:
- * - an existing ref-counted reference to the css
- * - task->cgroups for a locked task
+/**
+ * css_get - obtain a reference on the specified css
+ * @css: taget css
+ *
+ * The caller must already have a reference.
*/
-
static inline void css_get(struct cgroup_subsys_state *css)
{
/* We don't need to reference count the root state */
if (!(css->flags & CSS_ROOT))
- __css_get(css, 1);
+ atomic_inc(&css->refcnt);
}

-/*
- * Call css_tryget() to take a reference on a css if your existing
- * (known-valid) reference isn't already ref-counted. Returns false if
- * the css has been destroyed.
- */
-
extern bool __css_tryget(struct cgroup_subsys_state *css);
+
+/**
+ * css_tryget - try to obtain a reference on the specified css
+ * @css: target css
+ *
+ * Obtain a reference on @css if it's alive. The caller naturally needs to
+ * ensure that @css is accessible but doesn't have to be holding a
+ * reference on it - IOW, RCU protected access is good enough for this
+ * function. Returns %true if a reference count was successfully obtained;
+ * %false otherwise.
+ */
static inline bool css_tryget(struct cgroup_subsys_state *css)
{
if (css->flags & CSS_ROOT)
@@ -128,12 +126,14 @@ static inline bool css_tryget(struct cgroup_subsys_state *css)
return __css_tryget(css);
}

-/*
- * css_put() should be called to release a reference taken by
- * css_get() or css_tryget()
- */
-
extern void __css_put(struct cgroup_subsys_state *css);
+
+/**
+ * css_put - put a css reference
+ * @css: target css
+ *
+ * Put a reference obtained via css_get() and css_tryget().
+ */
static inline void css_put(struct cgroup_subsys_state *css)
{
if (!(css->flags & CSS_ROOT))
--
1.8.2.1

2013-06-12 21:03:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/11] cgroup: remove now unused css_depth()

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 1 -
kernel/cgroup.c | 12 ------------
2 files changed, 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index d0ad379..5830592 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -848,7 +848,6 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,

/* Get id and depth of css */
unsigned short css_id(struct cgroup_subsys_state *css);
-unsigned short css_depth(struct cgroup_subsys_state *css);
struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);

#else /* !CONFIG_CGROUPS */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3ad2a9c..dc8997a 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5230,18 +5230,6 @@ unsigned short css_id(struct cgroup_subsys_state *css)
}
EXPORT_SYMBOL_GPL(css_id);

-unsigned short css_depth(struct cgroup_subsys_state *css)
-{
- struct css_id *cssid;
-
- cssid = rcu_dereference_check(css->id, css_refcnt(css));
-
- if (cssid)
- return cssid->depth;
- return 0;
-}
-EXPORT_SYMBOL_GPL(css_depth);
-
/**
* css_is_ancestor - test "root" css is an ancestor of "child"
* @child: the css to be tested.
--
1.8.2.1

2013-06-12 21:06:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link

cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which points to forms linked
lists on both sides. The naming around this already confusing struct
is pretty bad.

>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list. From css_set side, it starts off ->cg_links and
runs through ->cg_link_list. This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets. This patch cleans it up a bit.

* s/cgroup->css_sets/cgroup->cset_links/
s/css_set->cg_links/css_set->cgrp_links/
s/cgroup_iter->cg_link/cgroup_iter->cset_link/

* s/cg_cgroup_link/cgrp_cset_link/

* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/

* s/init_css_set_link/init_cgrp_cset_link/
s/free_cg_links/free_cgrp_cset_links/
s/allocate_cg_links/allocate_cgrp_cset_links/

* s/cgl[12]/link[12]/ in compare_css_sets()

* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
adustments.

* Comment and whiteline adjustments.

After the changes, we have

list_for_each_entry(link, &cont->cset_links, cset_link) {
struct css_set *cset = link->cset;

instead of

list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
struct css_set *cset = link->cg;

This patch is purely cosmetic.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 15 ++--
kernel/cgroup.c | 226 ++++++++++++++++++++++++-------------------------
2 files changed, 120 insertions(+), 121 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5830592..0e32855 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -215,10 +215,10 @@ struct cgroup {
struct cgroupfs_root *root;

/*
- * List of cg_cgroup_links pointing at css_sets with
- * tasks in this cgroup. Protected by css_set_lock
+ * List of cgrp_cset_links pointing at css_sets with tasks in this
+ * cgroup. Protected by css_set_lock.
*/
- struct list_head css_sets;
+ struct list_head cset_links;

struct list_head allcg_node; /* cgroupfs_root->allcg_list */
struct list_head cft_q_node; /* used during cftype add/rm */
@@ -365,11 +365,10 @@ struct css_set {
struct list_head tasks;

/*
- * List of cg_cgroup_link objects on link chains from
- * cgroups referenced from this css_set. Protected by
- * css_set_lock
+ * List of cgrp_cset_links pointing at cgroups referenced from this
+ * css_set. Protected by css_set_lock.
*/
- struct list_head cg_links;
+ struct list_head cgrp_links;

/*
* Set of subsystem states, one for each subsystem. This array
@@ -792,7 +791,7 @@ struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,

/* A cgroup_iter should be treated as an opaque object */
struct cgroup_iter {
- struct list_head *cg_link;
+ struct list_head *cset_link;
struct list_head *task;
};

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e7f6845..0b5cd0e 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -315,20 +315,24 @@ static void cgroup_release_agent(struct work_struct *work);
static DECLARE_WORK(release_agent_work, cgroup_release_agent);
static void check_for_release(struct cgroup *cgrp);

-/* Link structure for associating css_set objects with cgroups */
-struct cg_cgroup_link {
- /*
- * List running through cg_cgroup_links associated with a
- * cgroup, anchored on cgroup->css_sets
- */
- struct list_head cgrp_link_list;
- struct cgroup *cgrp;
- /*
- * List running through cg_cgroup_links pointing at a
- * single css_set object, anchored on css_set->cg_links
- */
- struct list_head cg_link_list;
- struct css_set *cg;
+/*
+ * A cgroup can be associated with multiple css_sets as different tasks may
+ * belong to different cgroups on different hierarchies. In the other
+ * direction, a css_set is naturally associated with multiple cgroups.
+ * This M:N relationship is represented by the following link structure
+ * which exists for each association and allows traversing the associations
+ * from both sides.
+ */
+struct cgrp_cset_link {
+ /* the cgroup and css_set this link associates */
+ struct cgroup *cgrp;
+ struct css_set *cset;
+
+ /* list of cgrp_cset_links anchored at cgrp->cset_links */
+ struct list_head cset_link;
+
+ /* list of cgrp_cset_links anchored at css_set->cgrp_links */
+ struct list_head cgrp_link;
};

/* The default css_set - used by init and its children prior to any
@@ -339,7 +343,7 @@ struct cg_cgroup_link {
*/

static struct css_set init_css_set;
-static struct cg_cgroup_link init_css_set_link;
+static struct cgrp_cset_link init_cgrp_cset_link;

static int cgroup_init_idr(struct cgroup_subsys *ss,
struct cgroup_subsys_state *css);
@@ -378,8 +382,7 @@ static int use_task_css_set_links __read_mostly;

static void __put_css_set(struct css_set *cset, int taskexit)
{
- struct cg_cgroup_link *link;
- struct cg_cgroup_link *saved_link;
+ struct cgrp_cset_link *link, *tmp_link;

/*
* Ensure that the refcount doesn't hit zero while any readers
@@ -398,12 +401,11 @@ static void __put_css_set(struct css_set *cset, int taskexit)
hash_del(&cset->hlist);
css_set_count--;

- list_for_each_entry_safe(link, saved_link, &cset->cg_links,
- cg_link_list) {
+ list_for_each_entry_safe(link, tmp_link, &cset->cgrp_links, cgrp_link) {
struct cgroup *cgrp = link->cgrp;

- list_del(&link->cg_link_list);
- list_del(&link->cgrp_link_list);
+ list_del(&link->cset_link);
+ list_del(&link->cgrp_link);

/*
* We may not be holding cgroup_mutex, and if cgrp->count is
@@ -475,26 +477,26 @@ static bool compare_css_sets(struct css_set *cset,
* candidates.
*/

- l1 = &cset->cg_links;
- l2 = &old_cset->cg_links;
+ l1 = &cset->cgrp_links;
+ l2 = &old_cset->cgrp_links;
while (1) {
- struct cg_cgroup_link *cgl1, *cgl2;
+ struct cgrp_cset_link *link1, *link2;
struct cgroup *cgrp1, *cgrp2;

l1 = l1->next;
l2 = l2->next;
/* See if we reached the end - both lists are equal length. */
- if (l1 == &cset->cg_links) {
- BUG_ON(l2 != &old_cset->cg_links);
+ if (l1 == &cset->cgrp_links) {
+ BUG_ON(l2 != &old_cset->cgrp_links);
break;
} else {
- BUG_ON(l2 == &old_cset->cg_links);
+ BUG_ON(l2 == &old_cset->cgrp_links);
}
/* Locate the cgroups associated with these links. */
- cgl1 = list_entry(l1, struct cg_cgroup_link, cg_link_list);
- cgl2 = list_entry(l2, struct cg_cgroup_link, cg_link_list);
- cgrp1 = cgl1->cgrp;
- cgrp2 = cgl2->cgrp;
+ link1 = list_entry(l1, struct cgrp_cset_link, cgrp_link);
+ link2 = list_entry(l2, struct cgrp_cset_link, cgrp_link);
+ cgrp1 = link1->cgrp;
+ cgrp2 = link2->cgrp;
/* Hierarchies should be linked in the same order. */
BUG_ON(cgrp1->root != cgrp2->root);

@@ -569,61 +571,64 @@ static struct css_set *find_existing_css_set(struct css_set *old_cset,
return NULL;
}

-static void free_cg_links(struct list_head *tmp)
+static void free_cgrp_cset_links(struct list_head *links_to_free)
{
- struct cg_cgroup_link *link;
- struct cg_cgroup_link *saved_link;
+ struct cgrp_cset_link *link, *tmp_link;

- list_for_each_entry_safe(link, saved_link, tmp, cgrp_link_list) {
- list_del(&link->cgrp_link_list);
+ list_for_each_entry_safe(link, tmp_link, links_to_free, cset_link) {
+ list_del(&link->cset_link);
kfree(link);
}
}

-/*
- * allocate_cg_links() allocates "count" cg_cgroup_link structures
- * and chains them on tmp through their cgrp_link_list fields. Returns 0 on
- * success or a negative error
+/**
+ * allocate_cgrp_cset_links - allocate cgrp_cset_links
+ * @count: the number of links to allocate
+ * @tmp_links: list_head the allocated links are put on
+ *
+ * Allocate @count cgrp_cset_link structures and chain them on @tmp_links
+ * through ->cset_link. Returns 0 on success or -errno.
*/
-static int allocate_cg_links(int count, struct list_head *tmp)
+static int allocate_cgrp_cset_links(int count, struct list_head *tmp_links)
{
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;
int i;
- INIT_LIST_HEAD(tmp);
+
+ INIT_LIST_HEAD(tmp_links);
+
for (i = 0; i < count; i++) {
link = kmalloc(sizeof(*link), GFP_KERNEL);
if (!link) {
- free_cg_links(tmp);
+ free_cgrp_cset_links(tmp_links);
return -ENOMEM;
}
- list_add(&link->cgrp_link_list, tmp);
+ list_add(&link->cset_link, tmp_links);
}
return 0;
}

/**
* link_css_set - a helper function to link a css_set to a cgroup
- * @tmp_cg_links: cg_cgroup_link objects allocated by allocate_cg_links()
+ * @tmp_links: cgrp_cset_link objects allocated by allocate_cgrp_cset_links()
* @cset: the css_set to be linked
* @cgrp: the destination cgroup
*/
-static void link_css_set(struct list_head *tmp_cg_links,
- struct css_set *cset, struct cgroup *cgrp)
+static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
+ struct cgroup *cgrp)
{
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;

- BUG_ON(list_empty(tmp_cg_links));
- link = list_first_entry(tmp_cg_links, struct cg_cgroup_link,
- cgrp_link_list);
- link->cg = cset;
+ BUG_ON(list_empty(tmp_links));
+ link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
+ link->cset = cset;
link->cgrp = cgrp;
atomic_inc(&cgrp->count);
- list_move(&link->cgrp_link_list, &cgrp->css_sets);
+ list_move(&link->cset_link, &cgrp->cset_links);
/*
* Always add links to the tail of the list so that the list
* is sorted by order of hierarchy creation
*/
- list_add_tail(&link->cg_link_list, &cset->cg_links);
+ list_add_tail(&link->cgrp_link, &cset->cgrp_links);
}

/*
@@ -638,10 +643,8 @@ static struct css_set *find_css_set(struct css_set *old_cset,
{
struct css_set *cset;
struct cgroup_subsys_state *template[CGROUP_SUBSYS_COUNT];
-
- struct list_head tmp_cg_links;
-
- struct cg_cgroup_link *link;
+ struct list_head tmp_links;
+ struct cgrp_cset_link *link;
unsigned long key;

/* First see if we already have a cgroup group that matches
@@ -659,14 +662,14 @@ static struct css_set *find_css_set(struct css_set *old_cset,
if (!cset)
return NULL;

- /* Allocate all the cg_cgroup_link objects that we'll need */
- if (allocate_cg_links(root_count, &tmp_cg_links) < 0) {
+ /* Allocate all the cgrp_cset_link objects that we'll need */
+ if (allocate_cgrp_cset_links(root_count, &tmp_links) < 0) {
kfree(cset);
return NULL;
}

atomic_set(&cset->refcount, 1);
- INIT_LIST_HEAD(&cset->cg_links);
+ INIT_LIST_HEAD(&cset->cgrp_links);
INIT_LIST_HEAD(&cset->tasks);
INIT_HLIST_NODE(&cset->hlist);

@@ -676,14 +679,15 @@ static struct css_set *find_css_set(struct css_set *old_cset,

write_lock(&css_set_lock);
/* Add reference counts and links from the new css_set. */
- list_for_each_entry(link, &old_cset->cg_links, cg_link_list) {
+ list_for_each_entry(link, &old_cset->cgrp_links, cgrp_link) {
struct cgroup *c = link->cgrp;
+
if (c->root == cgrp->root)
c = cgrp;
- link_css_set(&tmp_cg_links, cset, c);
+ link_css_set(&tmp_links, cset, c);
}

- BUG_ON(!list_empty(&tmp_cg_links));
+ BUG_ON(!list_empty(&tmp_links));

css_set_count++;

@@ -717,9 +721,11 @@ static struct cgroup *task_cgroup_from_root(struct task_struct *task,
if (cset == &init_css_set) {
res = &root->top_cgroup;
} else {
- struct cg_cgroup_link *link;
- list_for_each_entry(link, &cset->cg_links, cg_link_list) {
+ struct cgrp_cset_link *link;
+
+ list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
struct cgroup *c = link->cgrp;
+
if (c->root == root) {
res = c;
break;
@@ -1405,7 +1411,7 @@ static void init_cgroup_housekeeping(struct cgroup *cgrp)
INIT_LIST_HEAD(&cgrp->sibling);
INIT_LIST_HEAD(&cgrp->children);
INIT_LIST_HEAD(&cgrp->files);
- INIT_LIST_HEAD(&cgrp->css_sets);
+ INIT_LIST_HEAD(&cgrp->cset_links);
INIT_LIST_HEAD(&cgrp->allcg_node);
INIT_LIST_HEAD(&cgrp->release_list);
INIT_LIST_HEAD(&cgrp->pidlists);
@@ -1604,7 +1610,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
BUG_ON(!root);
if (root == opts.new_root) {
/* We used the new root structure, so this is a new hierarchy */
- struct list_head tmp_cg_links;
+ struct list_head tmp_links;
struct cgroup *root_cgrp = &root->top_cgroup;
struct cgroupfs_root *existing_root;
const struct cred *cred;
@@ -1636,7 +1642,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
* that's us. The worst that can happen is that we
* have some link structures left over
*/
- ret = allocate_cg_links(css_set_count, &tmp_cg_links);
+ ret = allocate_cgrp_cset_links(css_set_count, &tmp_links);
if (ret)
goto unlock_drop;

@@ -1646,7 +1652,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,

ret = rebind_subsystems(root, root->subsys_mask);
if (ret == -EBUSY) {
- free_cg_links(&tmp_cg_links);
+ free_cgrp_cset_links(&tmp_links);
goto unlock_drop;
}
/*
@@ -1668,10 +1674,10 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
* the css_set objects */
write_lock(&css_set_lock);
hash_for_each(css_set_table, i, cset, hlist)
- link_css_set(&tmp_cg_links, cset, root_cgrp);
+ link_css_set(&tmp_links, cset, root_cgrp);
write_unlock(&css_set_lock);

- free_cg_links(&tmp_cg_links);
+ free_cgrp_cset_links(&tmp_links);

BUG_ON(!list_empty(&root_cgrp->children));
BUG_ON(root->number_of_cgroups != 1);
@@ -1725,9 +1731,8 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
static void cgroup_kill_sb(struct super_block *sb) {
struct cgroupfs_root *root = sb->s_fs_info;
struct cgroup *cgrp = &root->top_cgroup;
+ struct cgrp_cset_link *link, *tmp_link;
int ret;
- struct cg_cgroup_link *link;
- struct cg_cgroup_link *saved_link;

BUG_ON(!root);

@@ -1743,15 +1748,14 @@ static void cgroup_kill_sb(struct super_block *sb) {
BUG_ON(ret);

/*
- * Release all the links from css_sets to this hierarchy's
+ * Release all the links from cset_links to this hierarchy's
* root cgroup
*/
write_lock(&css_set_lock);

- list_for_each_entry_safe(link, saved_link, &cgrp->css_sets,
- cgrp_link_list) {
- list_del(&link->cg_link_list);
- list_del(&link->cgrp_link_list);
+ list_for_each_entry_safe(link, tmp_link, &cgrp->cset_links, cset_link) {
+ list_del(&link->cset_link);
+ list_del(&link->cgrp_link);
kfree(link);
}
write_unlock(&css_set_lock);
@@ -2911,12 +2915,11 @@ int cgroup_rm_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
int cgroup_task_count(const struct cgroup *cgrp)
{
int count = 0;
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;

read_lock(&css_set_lock);
- list_for_each_entry(link, &cgrp->css_sets, cgrp_link_list) {
- count += atomic_read(&link->cg->refcount);
- }
+ list_for_each_entry(link, &cgrp->cset_links, cset_link)
+ count += atomic_read(&link->cset->refcount);
read_unlock(&css_set_lock);
return count;
}
@@ -2925,24 +2928,23 @@ int cgroup_task_count(const struct cgroup *cgrp)
* Advance a list_head iterator. The iterator should be positioned at
* the start of a css_set
*/
-static void cgroup_advance_iter(struct cgroup *cgrp,
- struct cgroup_iter *it)
+static void cgroup_advance_iter(struct cgroup *cgrp, struct cgroup_iter *it)
{
- struct list_head *l = it->cg_link;
- struct cg_cgroup_link *link;
+ struct list_head *l = it->cset_link;
+ struct cgrp_cset_link *link;
struct css_set *cset;

/* Advance to the next non-empty css_set */
do {
l = l->next;
- if (l == &cgrp->css_sets) {
- it->cg_link = NULL;
+ if (l == &cgrp->cset_links) {
+ it->cset_link = NULL;
return;
}
- link = list_entry(l, struct cg_cgroup_link, cgrp_link_list);
- cset = link->cg;
+ link = list_entry(l, struct cgrp_cset_link, cset_link);
+ cset = link->cset;
} while (list_empty(&cset->tasks));
- it->cg_link = l;
+ it->cset_link = l;
it->task = cset->tasks.next;
}

@@ -3163,7 +3165,7 @@ void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
cgroup_enable_task_cg_lists();

read_lock(&css_set_lock);
- it->cg_link = &cgrp->css_sets;
+ it->cset_link = &cgrp->cset_links;
cgroup_advance_iter(cgrp, it);
}

@@ -3172,16 +3174,16 @@ struct task_struct *cgroup_iter_next(struct cgroup *cgrp,
{
struct task_struct *res;
struct list_head *l = it->task;
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;

/* If the iterator cg is NULL, we have no tasks */
- if (!it->cg_link)
+ if (!it->cset_link)
return NULL;
res = list_entry(l, struct task_struct, cg_list);
/* Advance iterator to find next entry */
l = l->next;
- link = list_entry(it->cg_link, struct cg_cgroup_link, cgrp_link_list);
- if (l == &link->cg->tasks) {
+ link = list_entry(it->cset_link, struct cgrp_cset_link, cset_link);
+ if (l == &link->cset->tasks) {
/* We reached the end of this task list - move on to
* the next cg_cgroup_link */
cgroup_advance_iter(cgrp, it);
@@ -4628,7 +4630,7 @@ EXPORT_SYMBOL_GPL(cgroup_load_subsys);
*/
void cgroup_unload_subsys(struct cgroup_subsys *ss)
{
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;

BUG_ON(ss->module == NULL);

@@ -4657,8 +4659,8 @@ void cgroup_unload_subsys(struct cgroup_subsys *ss)
* in loading, we need to pay our respects to the hashtable gods.
*/
write_lock(&css_set_lock);
- list_for_each_entry(link, &dummytop->css_sets, cgrp_link_list) {
- struct css_set *cset = link->cg;
+ list_for_each_entry(link, &dummytop->cset_links, cset_link) {
+ struct css_set *cset = link->cset;
unsigned long key;

hash_del(&cset->hlist);
@@ -4691,7 +4693,7 @@ int __init cgroup_init_early(void)
{
int i;
atomic_set(&init_css_set.refcount, 1);
- INIT_LIST_HEAD(&init_css_set.cg_links);
+ INIT_LIST_HEAD(&init_css_set.cgrp_links);
INIT_LIST_HEAD(&init_css_set.tasks);
INIT_HLIST_NODE(&init_css_set.hlist);
css_set_count = 1;
@@ -4699,12 +4701,10 @@ int __init cgroup_init_early(void)
root_count = 1;
init_task.cgroups = &init_css_set;

- init_css_set_link.cg = &init_css_set;
- init_css_set_link.cgrp = dummytop;
- list_add(&init_css_set_link.cgrp_link_list,
- &rootnode.top_cgroup.css_sets);
- list_add(&init_css_set_link.cg_link_list,
- &init_css_set.cg_links);
+ init_cgrp_cset_link.cset = &init_css_set;
+ init_cgrp_cset_link.cgrp = dummytop;
+ list_add(&init_cgrp_cset_link.cset_link, &rootnode.top_cgroup.cset_links);
+ list_add(&init_cgrp_cset_link.cgrp_link, &init_css_set.cgrp_links);

for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
struct cgroup_subsys *ss = subsys[i];
@@ -5457,13 +5457,13 @@ static int current_css_set_cg_links_read(struct cgroup *cont,
struct cftype *cft,
struct seq_file *seq)
{
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;
struct css_set *cset;

read_lock(&css_set_lock);
rcu_read_lock();
cset = rcu_dereference(current->cgroups);
- list_for_each_entry(link, &cset->cg_links, cg_link_list) {
+ list_for_each_entry(link, &cset->cgrp_links, cgrp_link) {
struct cgroup *c = link->cgrp;
const char *name;

@@ -5484,11 +5484,11 @@ static int cgroup_css_links_read(struct cgroup *cont,
struct cftype *cft,
struct seq_file *seq)
{
- struct cg_cgroup_link *link;
+ struct cgrp_cset_link *link;

read_lock(&css_set_lock);
- list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
- struct css_set *cset = link->cg;
+ list_for_each_entry(link, &cont->cset_links, cset_link) {
+ struct css_set *cset = link->cset;
struct task_struct *task;
int count = 0;
seq_printf(seq, "css_set %p\n", cset);
--
1.8.2.1

2013-06-13 02:35:19

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 03/11] cgroup: bring some sanity to naming around cg_cgroup_link

On 2013/6/13 5:03, Tejun Heo wrote:
> cgroups and css_sets are mapped M:N and this M:N mapping is
> represented by struct cg_cgroup_link which points to forms linked
> lists on both sides. The naming around this already confusing struct
> is pretty bad.
>
>>From cgroup side, it starts off ->css_sets and runs through
> ->cgrp_link_list. From css_set side, it starts off ->cg_links and
> runs through ->cg_link_list. This is rather reversed as
> cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
> Also, this is the only place which is still using the confusing "cg"
> for css_sets. This patch cleans it up a bit.
>
> * s/cgroup->css_sets/cgroup->cset_links/
> s/css_set->cg_links/css_set->cgrp_links/
> s/cgroup_iter->cg_link/cgroup_iter->cset_link/
>
> * s/cg_cgroup_link/cgrp_cset_link/
>
> * s/cgrp_cset_link->cg/cgrp_cset_link->cset/
> s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
> s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
>
> * s/init_css_set_link/init_cgrp_cset_link/
> s/free_cg_links/free_cgrp_cset_links/
> s/allocate_cg_links/allocate_cgrp_cset_links/
>
> * s/cgl[12]/link[12]/ in compare_css_sets()
>
> * s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
> adustments.
>
> * Comment and whiteline adjustments.
>
> After the changes, we have
>
> list_for_each_entry(link, &cont->cset_links, cset_link) {
> struct css_set *cset = link->cset;
>
> instead of
>
> list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
> struct css_set *cset = link->cg;
>

Those renames really make the code more readable! I have to think for a while
everytime I see those namings.

2013-06-13 02:37:05

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On 2013/6/13 5:03, Tejun Heo wrote:
> There's no point in using kmalloc() and list_del() instead of the
> clearing variants for trivial stuff. We can live dangerously
> elsewhere. Use kzalloc() and list_del_init() instead and drop 0
> inits.
>

Do you mean we prefer list_del_init() than list_del() in general? Then
in which cases do we prefer list_del()?

2013-06-13 02:38:28

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 05/11] cgroup: clean up css_[try]get() and css_put()

On 2013/6/13 5:03, Tejun Heo wrote:
> * __css_get() isn't used by anyone. Fold it into css_get().
>
> * Add proper function comments to all css reference functions.
>
> This patch is purely cosmetic.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> include/linux/cgroup.h | 48 ++++++++++++++++++++++++------------------------
> 1 file changed, 24 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 0e32855..7b16882 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -94,33 +94,31 @@ enum {
> CSS_ONLINE = (1 << 1), /* between ->css_online() and ->css_offline() */
> };
>
> -/* Caller must verify that the css is not for root cgroup */
> -static inline void __css_get(struct cgroup_subsys_state *css, int count)
> -{
> - atomic_add(count, &css->refcnt);
> -}
> -
> -/*
> - * Call css_get() to hold a reference on the css; it can be used
> - * for a reference obtained via:
> - * - an existing ref-counted reference to the css
> - * - task->cgroups for a locked task
> +/**
> + * css_get - obtain a reference on the specified css
> + * @css: taget css

s/taget/target

> + *
> + * The caller must already have a reference.
> */
> -
> static inline void css_get(struct cgroup_subsys_state *css)
> {
> /* We don't need to reference count the root state */
> if (!(css->flags & CSS_ROOT))
> - __css_get(css, 1);
> + atomic_inc(&css->refcnt);
> }
>

2013-06-13 02:38:37

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Thu, Jun 13, 2013 at 10:36:40AM +0800, Li Zefan wrote:
> On 2013/6/13 5:03, Tejun Heo wrote:
> > There's no point in using kmalloc() and list_del() instead of the
> > clearing variants for trivial stuff. We can live dangerously
> > elsewhere. Use kzalloc() and list_del_init() instead and drop 0
> > inits.
> >
>
> Do you mean we prefer list_del_init() than list_del() in general? Then
> in which cases do we prefer list_del()?

IMO, list_del() is preferred when the object shouldn't be reused (i.e.
it gets taken off a list and then it's freed). list_del_init() could
hide bugs.

2013-06-13 02:39:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

Hello,

On Wed, Jun 12, 2013 at 7:36 PM, Li Zefan <[email protected]> wrote:
> Do you mean we prefer list_del_init() than list_del() in general? Then

Yes.

> in which cases do we prefer list_del()?

Nowadays, I don't think we ever prefer list_del(). Maybe if it can be
shown that the extra init part is noticeably expensive?

--
tejun

2013-06-13 02:41:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

Hello,

On Wed, Jun 12, 2013 at 7:38 PM, Kent Overstreet <[email protected]> wrote:
> IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> it gets taken off a list and then it's freed). list_del_init() could
> hide bugs.

Nah... use-after-frees are detected much more reliably by poisoning
anyway. Using list_del() instead of list_del_init() to hunt down
use-after-free isn't a good idea because you're likely to corrupt the
memory of unrelated area. I really don't see much point in using
list_del().

Thanks.

--
tejun

2013-06-13 02:43:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Wed, Jun 12, 2013 at 07:41:15PM -0700, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 12, 2013 at 7:38 PM, Kent Overstreet <[email protected]> wrote:
> > IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> > it gets taken off a list and then it's freed). list_del_init() could
> > hide bugs.
>
> Nah... use-after-frees are detected much more reliably by poisoning
> anyway. Using list_del() instead of list_del_init() to hunt down
> use-after-free isn't a good idea because you're likely to corrupt the
> memory of unrelated area. I really don't see much point in using
> list_del().

list_del() does do poisoning - and list debugging is cheaper to enable
than full slab debugging.

2013-06-13 02:49:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> list_del() does do poisoning - and list debugging is cheaper to enable
> than full slab debugging.

Ah, right, now we have DEBUG_LIST. Completely forgot about that. I
don't think the cost difference matters that much as long as there are
enough people running with slab debugging, but, yeah, with DEBUG_LIST,
leaving list_del() alone would actually be better. I'll drop that
part.

Thanks.

--
tejun

2013-06-13 02:52:08

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Wed, Jun 12, 2013 at 07:48:59PM -0700, Tejun Heo wrote:
> On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> > list_del() does do poisoning - and list debugging is cheaper to enable
> > than full slab debugging.
>
> Ah, right, now we have DEBUG_LIST. Completely forgot about that. I
> don't think the cost difference matters that much as long as there are
> enough people running with slab debugging, but, yeah, with DEBUG_LIST,
> leaving list_del() alone would actually be better. I'll drop that
> part.

I can't remember if it was Fedora or RH (or both?) but in one of those
they actually leave it enabled in their production kernels. Someone was
blogging about the bugs it found...

2013-06-13 02:56:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Wed, Jun 12, 2013 at 07:52:02PM -0700, Kent Overstreet wrote:
> On Wed, Jun 12, 2013 at 07:48:59PM -0700, Tejun Heo wrote:
> > On Wed, Jun 12, 2013 at 07:43:10PM -0700, Kent Overstreet wrote:
> > > list_del() does do poisoning - and list debugging is cheaper to enable
> > > than full slab debugging.
> >
> > Ah, right, now we have DEBUG_LIST. Completely forgot about that. I
> > don't think the cost difference matters that much as long as there are
> > enough people running with slab debugging, but, yeah, with DEBUG_LIST,
> > leaving list_del() alone would actually be better. I'll drop that
> > part.
>
> I can't remember if it was Fedora or RH (or both?) but in one of those
> they actually leave it enabled in their production kernels. Someone was
> blogging about the bugs it found...

And we poison regardless of DEBUG_LIST and looks like have been doing
that forever. I have no idea why I was thinking list_del() didn't
poison. Maybe it was something which got stuck in my brain from
before the git history or I'm just hallucinating. Anyways, yeap,
list_del() is better.

Thanks.

--
tejun

2013-06-13 03:05:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On Wed, Jun 12, 2013 at 07:56:23PM -0700, Tejun Heo wrote:
> poison. Maybe it was something which got stuck in my brain from
> before the git history or I'm just hallucinating. Anyways, yeap,

Just checked 2.4. It didn't poison then. Somehow I never got that
out of my brain all these years. #feelingancient

--
tejun

2013-06-13 03:13:44

by Zefan Li

[permalink] [raw]
Subject: Re: [PATCH 04/11] cgroup: use kzalloc() and list_del_init()

On 2013/6/13 10:38, Kent Overstreet wrote:
> On Thu, Jun 13, 2013 at 10:36:40AM +0800, Li Zefan wrote:
>> On 2013/6/13 5:03, Tejun Heo wrote:
>>> There's no point in using kmalloc() and list_del() instead of the
>>> clearing variants for trivial stuff. We can live dangerously
>>> elsewhere. Use kzalloc() and list_del_init() instead and drop 0
>>> inits.
>>>
>>
>> Do you mean we prefer list_del_init() than list_del() in general? Then
>> in which cases do we prefer list_del()?
>
> IMO, list_del() is preferred when the object shouldn't be reused (i.e.
> it gets taken off a list and then it's freed).

yeah, this is what I have in my mind. I would wonder why list_del_init()
if I know that object won't be used anymore.

> list_del_init() could
> hide bugs.
>

Same here. I do worry a bit about this.