Hello,
This patchset implement proper hierarchy support for cgroup_freezer as
discussed in "[RFC] cgroup TODOs"[1].
The patchset first implements generic cgroup iteration macros -
cgroup_for_each_children(), cgroup_for_each_descendant_{pre|post}().
Combined with the newly introduced ->post_create() callback, this
allows controllers to implement reliable iteration over descendants
without messing with cgroup internal locking. Controllers can perform
reliable walking using simple hierarchy-wide locking or finer-grained
parent-children locking.
Using the iteration macros and ->post_create(), cgroup_freezer is
updated to propagate state updates to and collect FROZEN completions
from the descendants. This removes .broken_hierarchy marking from
cgroup_freezer.
cgroup_freezer hierarchy support is implemented using finer-grained
locking not necessarily because it's necessary but more because I
wanted an example controller doing that.
This patchset contains the following nine patches.
0001-cgroup-add-cgroup_subsys-post_create.patch
0002-cgroup-Use-rculist-ops-for-cgroup-children.patch
0003-cgroup-implement-generic-child-descendant-walk-macro.patch
0004-cgroup_freezer-trivial-cleanups.patch
0005-cgroup_freezer-prepare-freezer_change_state-for-full.patch
0006-cgroup_freezer-make-freezer-state-mask-of-flags.patch
0007-cgroup_freezer-introduce-CGROUP_FREEZING_-SELF-PAREN.patch
0008-cgroup_freezer-add-post_create-and-pre_destroy-and-t.patch
0009-cgroup_freezer-implement-proper-hierarchy-support.patch
0001-0003 implement cgroup descendant iterators.
0004-0008 prepare cgroup_freezer for hierarchy support. 0009
implements it.
This patchset is on top of
v3.6 (a0d271cbfe)
+ [2] the first three patches of
"memcg/cgroup: do not fail fail on pre_destroy callbacks" patchset
+ [3] "cgroup: simplify cgroup removal path" v2 patchset
with cgroup/for-3.8 pulled into it. The branch is rather floaty at
the moment so it would be the easiest to pull the following branch for
review.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup_freezer-hierarchy
Thanks.
kernel/cgroup.c | 106 +++++++++++++-
kernel/cgroup_freezer.c | 359 +++++++++++++++++++++++++++++++++++-------------
3 files changed, 445 insertions(+), 104 deletions(-)
--
tejun
[1] http://thread.gmane.org/gmane.linux.kernel.containers/23698
[2] http://thread.gmane.org/gmane.linux.kernel.cgroups/4757
[3] http://thread.gmane.org/gmane.linux.kernel.cgroups/4861
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.
This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.
This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().
Signed-off-by: Tejun Heo <[email protected]>
Cc: Glauber Costa <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index fe876a7..b442122 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -438,6 +438,7 @@ int cgroup_taskset_size(struct cgroup_taskset *tset);
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
+ void (*post_create)(struct cgroup *cgrp);
void (*pre_destroy)(struct cgroup *cgrp);
void (*destroy)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index e3045ad..f05d992 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4060,10 +4060,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
if (err < 0)
goto err_remove;
- /* each css holds a ref to the cgroup's dentry */
- for_each_subsys(root, ss)
+ for_each_subsys(root, ss) {
+ /* each css holds a ref to the cgroup's dentry */
dget(dentry);
+ /* creation succeeded, notify subsystems */
+ if (ss->post_create)
+ ss->post_create(cgrp);
+ }
+
/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
@@ -4281,6 +4286,9 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
ss->active = 1;
+ if (ss->post_create)
+ ss->post_create(&ss->root->top_cgroup);
+
/* this function shouldn't be used with modular subsystems, since they
* need to register a subsys_id, among other things */
BUG_ON(ss->module);
--
1.7.11.7
Use RCU safe list operations for cgroup->children. This will be used
to implement cgroup children / descendant walking which can be used by
controllers.
Note that cgroup_create() now puts a new cgroup at the end of the
->children list instead of head. This isn't strictly necessary but is
done so that the iteration order is more conventional.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 8 +++-----
2 files changed, 4 insertions(+), 5 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b442122..90c33eb 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/nodemask.h>
#include <linux/rcupdate.h>
+#include <linux/rculist.h>
#include <linux/cgroupstats.h>
#include <linux/prio_heap.h>
#include <linux/rwsem.h>
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f05d992..cc5d2a0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1650,7 +1650,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
free_cg_links(&tmp_cg_links);
- BUG_ON(!list_empty(&root_cgrp->sibling));
BUG_ON(!list_empty(&root_cgrp->children));
BUG_ON(root->number_of_cgroups != 1);
@@ -1699,7 +1698,6 @@ static void cgroup_kill_sb(struct super_block *sb) {
BUG_ON(root->number_of_cgroups != 1);
BUG_ON(!list_empty(&cgrp->children));
- BUG_ON(!list_empty(&cgrp->sibling));
mutex_lock(&cgroup_mutex);
mutex_lock(&cgroup_root_mutex);
@@ -4053,7 +4051,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
}
}
- list_add(&cgrp->sibling, &cgrp->parent->children);
+ list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
root->number_of_cgroups++;
err = cgroup_create_dir(cgrp, dentry, mode);
@@ -4084,7 +4082,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
err_remove:
- list_del(&cgrp->sibling);
+ list_del_rcu(&cgrp->sibling);
root->number_of_cgroups--;
err_destroy:
@@ -4210,7 +4208,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
raw_spin_unlock(&release_list_lock);
/* delete this cgroup from parent->children */
- list_del_init(&cgrp->sibling);
+ list_del_rcu(&cgrp->sibling);
list_del_init(&cgrp->allcg_node);
--
1.7.11.7
Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
the two flags. This is to prepare for full hierarchy support.
freezer_apply_date() is updated such that it can handle setting and
clearing of both flags. The two flags are also exposed to userland
via read-only files self_freezing and parent_freezing.
Other than the added cgroupfs files, this patch doesn't introduce any
behavior change.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 55 ++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 47 insertions(+), 8 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index e76aa9f..b8ad93c 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -23,8 +23,12 @@
#include <linux/seq_file.h>
enum freezer_state_flags {
- CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
+ CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
+ CGROUP_FREEZING_PARENT = (1 << 2), /* the parent freezer is freezing */
CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
+
+ /* mask for all FREEZING flags */
+ CGROUP_FREEZING = CGROUP_FREEZING_SELF | CGROUP_FREEZING_PARENT,
};
struct freezer {
@@ -245,8 +249,13 @@ static void unfreeze_cgroup(struct freezer *freezer)
* freezer_apply_state - apply state change to a single cgroup_freezer
* @freezer: freezer to apply state change to
* @freeze: whether to freeze or unfreeze
+ * @state: CGROUP_FREEZING_* flag to set or clear
+ *
+ * Set or clear @state on @cgroup according to @freeze, and perform
+ * freezing or thawing as necessary.
*/
-static void freezer_apply_state(struct freezer *freezer, bool freeze)
+static void freezer_apply_state(struct freezer *freezer, bool freeze,
+ unsigned int state)
{
/* also synchronizes against task migration, see freezer_attach() */
lockdep_assert_held(&freezer->lock);
@@ -254,13 +263,19 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
if (freeze) {
if (!(freezer->state & CGROUP_FREEZING))
atomic_inc(&system_freezing_cnt);
- freezer->state |= CGROUP_FREEZING;
+ freezer->state |= state;
freeze_cgroup(freezer);
} else {
- if (freezer->state & CGROUP_FREEZING)
- atomic_dec(&system_freezing_cnt);
- freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
- unfreeze_cgroup(freezer);
+ bool was_freezing = freezer->state & CGROUP_FREEZING;
+
+ freezer->state &= ~state;
+
+ if (!(freezer->state & CGROUP_FREEZING)) {
+ if (was_freezing)
+ atomic_dec(&system_freezing_cnt);
+ freezer->state &= ~CGROUP_FROZEN;
+ unfreeze_cgroup(freezer);
+ }
}
}
@@ -275,7 +290,7 @@ static void freezer_change_state(struct freezer *freezer, bool freeze)
{
/* update @freezer */
spin_lock_irq(&freezer->lock);
- freezer_apply_state(freezer, freeze);
+ freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
spin_unlock_irq(&freezer->lock);
}
@@ -295,6 +310,20 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
return 0;
}
+static u64 freezer_self_freezing_read(struct cgroup *cgroup, struct cftype *cft)
+{
+ struct freezer *freezer = cgroup_freezer(cgroup);
+
+ return (bool)(freezer->state & CGROUP_FREEZING_SELF);
+}
+
+static u64 freezer_parent_freezing_read(struct cgroup *cgroup, struct cftype *cft)
+{
+ struct freezer *freezer = cgroup_freezer(cgroup);
+
+ return (bool)(freezer->state & CGROUP_FREEZING_PARENT);
+}
+
static struct cftype files[] = {
{
.name = "state",
@@ -302,6 +331,16 @@ static struct cftype files[] = {
.read_seq_string = freezer_read,
.write_string = freezer_write,
},
+ {
+ .name = "self_freezing",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = freezer_self_freezing_read,
+ },
+ {
+ .name = "parent_freezing",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .read_u64 = freezer_parent_freezing_read,
+ },
{ } /* terminate */
};
--
1.7.11.7
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved. They all operated
separately.
This patch implements proper hierarchy support. If a cgroup is
frozen, all its descendants are frozen. A cgroup is thawed iff it and
all its ancestors are THAWED. freezer.self_freezing shows the current
freezing state for the cgroup itself. freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.
freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state. update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.
Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations. Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.
As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.
Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too. This behavior change is
intended and has been warned via .broken_hierarchy.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 161 ++++++++++++++++++++++++++++++++++++------------
1 file changed, 123 insertions(+), 38 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 4f12d31..3262537 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -22,6 +22,13 @@
#include <linux/freezer.h>
#include <linux/seq_file.h>
+/*
+ * A cgroup is freezing if any FREEZING flags are set. FREEZING_SELF is
+ * set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
+ * for "THAWED". FREEZING_PARENT is set if the parent freezer is FREEZING
+ * for whatever reason. IOW, a cgroup has FREEZING_PARENT set if one of
+ * its ancestors has FREEZING_SELF set.
+ */
enum freezer_state_flags {
CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
@@ -50,6 +57,15 @@ static inline struct freezer *task_freezer(struct task_struct *task)
struct freezer, css);
}
+static struct freezer *parent_freezer(struct freezer *freezer)
+{
+ struct cgroup *pcg = freezer->css.cgroup->parent;
+
+ if (pcg)
+ return cgroup_freezer(pcg);
+ return NULL;
+}
+
bool cgroup_freezing(struct task_struct *task)
{
bool ret;
@@ -74,17 +90,6 @@ static const char *freezer_state_strs(unsigned int state)
return "THAWED";
};
-/*
- * State diagram
- * Transitions are caused by userspace writes to the freezer.state file.
- * The values in parenthesis are state labels. The rest are edge labels.
- *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- * ^ ^ | |
- * | \_______THAWED_______/ |
- * \__________________________THAWED____________/
- */
-
struct cgroup_subsys freezer_subsys;
static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
@@ -103,15 +108,34 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
* freezer_post_create - commit creation of a freezer cgroup
* @cgroup: cgroup being created
*
- * We're committing to creation of @cgroup. Mark it online.
+ * We're committing to creation of @cgroup. Mark it online and inherit
+ * parent's freezing state while holding both parent's and our
+ * freezer->lock.
*/
static void freezer_post_create(struct cgroup *cgroup)
{
struct freezer *freezer = cgroup_freezer(cgroup);
+ struct freezer *parent = parent_freezer(freezer);
+
+ /*
+ * The following double locking and freezing state inheritance
+ * guarantee that @cgroup can never escape ancestors' freezing
+ * states. See cgroup_for_each_descendant_pre() for details.
+ */
+ if (parent)
+ spin_lock_irq(&parent->lock);
+ spin_lock_nested(&freezer->lock, SINGLE_DEPTH_NESTING);
- spin_lock_irq(&freezer->lock);
freezer->state |= CGROUP_FREEZER_ONLINE;
- spin_unlock_irq(&freezer->lock);
+
+ if (parent && (parent->state & CGROUP_FREEZING)) {
+ freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
+ atomic_inc(&system_freezing_cnt);
+ }
+
+ spin_unlock(&freezer->lock);
+ if (parent)
+ spin_unlock_irq(&parent->lock);
}
/**
@@ -153,6 +177,7 @@ static void freezer_attach(struct cgroup *new_cgrp, struct cgroup_taskset *tset)
{
struct freezer *freezer = cgroup_freezer(new_cgrp);
struct task_struct *task;
+ bool clear_frozen = false;
spin_lock_irq(&freezer->lock);
@@ -172,10 +197,25 @@ static void freezer_attach(struct cgroup *new_cgrp, struct cgroup_taskset *tset)
} else {
freeze_task(task);
freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = true;
}
}
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Propagate FROZEN clearing upwards. We may race with
+ * update_if_frozen(), but as long as both work bottom-up, either
+ * update_if_frozen() sees child's FROZEN cleared or we clear the
+ * parent's FROZEN later. No parent w/ !FROZEN children can be
+ * left FROZEN.
+ */
+ while (clear_frozen && (freezer = parent_freezer(freezer))) {
+ spin_lock_irq(&freezer->lock);
+ freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = freezer->state & CGROUP_FREEZING;
+ spin_unlock_irq(&freezer->lock);
+ }
}
static void freezer_fork(struct task_struct *task)
@@ -200,24 +240,47 @@ out:
rcu_read_unlock();
}
-/*
- * We change from FREEZING to FROZEN lazily if the cgroup was only
- * partially frozen when we exitted write. Caller must hold freezer->lock.
+/**
+ * update_if_frozen - update whether a cgroup finished freezing
+ * @cgroup: cgroup of interest
+ *
+ * Once FREEZING is initiated, transition to FROZEN is lazily updated by
+ * calling this function. If the current state is FREEZING but not FROZEN,
+ * this function checks whether all tasks of this cgroup and the descendant
+ * cgroups finished freezing and, if so, sets FROZEN.
+ *
+ * The caller is responsible for grabbing RCU read lock and calling
+ * update_if_frozen() on all descendants prior to invoking this function.
*
* Task states and freezer state might disagree while tasks are being
* migrated into or out of @cgroup, so we can't verify task states against
* @freezer state here. See freezer_attach() for details.
*/
-static void update_if_frozen(struct freezer *freezer)
+static void update_if_frozen(struct cgroup *cgroup)
{
- struct cgroup *cgroup = freezer->css.cgroup;
+ struct freezer *freezer = cgroup_freezer(cgroup);
+ struct cgroup *pos;
struct cgroup_iter it;
struct task_struct *task;
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ spin_lock_irq(&freezer->lock);
+
if (!(freezer->state & CGROUP_FREEZING) ||
(freezer->state & CGROUP_FROZEN))
- return;
+ goto out_unlock;
+
+ /* are all (live) children frozen? */
+ cgroup_for_each_child(pos, cgroup) {
+ struct freezer *child = cgroup_freezer(pos);
+ if ((child->state & CGROUP_FREEZER_ONLINE) &&
+ !(child->state & CGROUP_FROZEN))
+ goto out_unlock;
+ }
+
+ /* are all tasks frozen? */
cgroup_iter_start(cgroup, &it);
while ((task = cgroup_iter_next(cgroup, &it))) {
@@ -229,27 +292,32 @@ static void update_if_frozen(struct freezer *freezer)
* the usual frozen condition.
*/
if (!frozen(task) && !freezer_should_skip(task))
- goto notyet;
+ goto out_iter_end;
}
}
freezer->state |= CGROUP_FROZEN;
-notyet:
+out_iter_end:
cgroup_iter_end(cgroup, &it);
+out_unlock:
+ spin_unlock_irq(&freezer->lock);
}
static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
- struct freezer *freezer = cgroup_freezer(cgroup);
- unsigned int state;
+ struct cgroup *pos;
- spin_lock_irq(&freezer->lock);
- update_if_frozen(freezer);
- state = freezer->state;
- spin_unlock_irq(&freezer->lock);
+ rcu_read_lock();
- seq_puts(m, freezer_state_strs(state));
+ /* update states bottom-up */
+ cgroup_for_each_descendant_post(pos, cgroup)
+ update_if_frozen(pos);
+ update_if_frozen(cgroup);
+
+ rcu_read_unlock();
+
+ seq_puts(m, freezer_state_strs(cgroup_freezer(cgroup)->state));
seq_putc(m, '\n');
return 0;
}
@@ -320,14 +388,39 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
* @freezer: freezer of interest
* @freeze: whether to freeze or thaw
*
- * Freeze or thaw @cgroup according to @freeze.
+ * Freeze or thaw @freezer according to @freeze. The operations are
+ * recursive - all descendants of @freezer will be affected.
*/
static void freezer_change_state(struct freezer *freezer, bool freeze)
{
+ struct cgroup *pos;
+
/* update @freezer */
spin_lock_irq(&freezer->lock);
freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Update all its descendants in pre-order traversal. Each
+ * descendant will try to inherit its parent's FREEZING state as
+ * CGROUP_FREEZING_PARENT.
+ */
+ rcu_read_lock();
+ cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
+ struct freezer *pos_f = cgroup_freezer(pos);
+ struct freezer *parent = parent_freezer(freezer);
+
+ /*
+ * Our update to @parent->state is already visible which is
+ * all we need. No need to lock @parent. For more info on
+ * synchronization, see freezer_post_create().
+ */
+ spin_lock_irq(&pos_f->lock);
+ freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
+ CGROUP_FREEZING_PARENT);
+ spin_unlock_irq(&pos_f->lock);
+ }
+ rcu_read_unlock();
}
static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
@@ -390,12 +483,4 @@ struct cgroup_subsys freezer_subsys = {
.attach = freezer_attach,
.fork = freezer_fork,
.base_cftypes = files,
-
- /*
- * freezer subsys doesn't handle hierarchy at all. Frozen state
- * should be inherited through the hierarchy - if a parent is
- * frozen, all its children should be frozen. Fix it and remove
- * the following.
- */
- .broken_hierarchy = true,
};
--
1.7.11.7
A cgroup is online and visible to iteration between ->post_create()
and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
toggles it from the newly added freezer_post_create() and
freezer_pre_destroy() while holding freezer->lock such that a
cgroup_freezer can be reilably distinguished to be online. This will
be used by full hierarchy support.
ONLINE test is added to freezer_apply_state() but it currently doesn't
make any difference as freezer_write() can only be called for an
online cgroup.
Adjusting system_freezing_cnt on destruction is moved from
freezer_destroy() to the new freezer_pre_destroy() for consistency.
This patch doesn't introduce any noticeable behavior change.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 42 ++++++++++++++++++++++++++++++++++++++++--
1 file changed, 40 insertions(+), 2 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index b8ad93c..4f12d31 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -23,6 +23,7 @@
#include <linux/seq_file.h>
enum freezer_state_flags {
+ CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
CGROUP_FREEZING_PARENT = (1 << 2), /* the parent freezer is freezing */
CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
@@ -98,13 +99,45 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
return &freezer->css;
}
-static void freezer_destroy(struct cgroup *cgroup)
+/**
+ * freezer_post_create - commit creation of a freezer cgroup
+ * @cgroup: cgroup being created
+ *
+ * We're committing to creation of @cgroup. Mark it online.
+ */
+static void freezer_post_create(struct cgroup *cgroup)
{
struct freezer *freezer = cgroup_freezer(cgroup);
+ spin_lock_irq(&freezer->lock);
+ freezer->state |= CGROUP_FREEZER_ONLINE;
+ spin_unlock_irq(&freezer->lock);
+}
+
+/**
+ * freezer_pre_destroy - initiate destruction of @cgroup
+ * @cgroup: cgroup being destroyed
+ *
+ * @cgroup is going away. Mark it dead and decrement system_freezing_count
+ * if it was holding one.
+ */
+static void freezer_pre_destroy(struct cgroup *cgroup)
+{
+ struct freezer *freezer = cgroup_freezer(cgroup);
+
+ spin_lock_irq(&freezer->lock);
+
if (freezer->state & CGROUP_FREEZING)
atomic_dec(&system_freezing_cnt);
- kfree(freezer);
+
+ freezer->state = 0;
+
+ spin_unlock_irq(&freezer->lock);
+}
+
+static void freezer_destroy(struct cgroup *cgroup)
+{
+ kfree(cgroup_freezer(cgroup));
}
/*
@@ -260,6 +293,9 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
/* also synchronizes against task migration, see freezer_attach() */
lockdep_assert_held(&freezer->lock);
+ if (!(freezer->state & CGROUP_FREEZER_ONLINE))
+ return;
+
if (freeze) {
if (!(freezer->state & CGROUP_FREEZING))
atomic_inc(&system_freezing_cnt);
@@ -347,6 +383,8 @@ static struct cftype files[] = {
struct cgroup_subsys freezer_subsys = {
.name = "freezer",
.create = freezer_create,
+ .post_create = freezer_post_create,
+ .pre_destroy = freezer_pre_destroy,
.destroy = freezer_destroy,
.subsys_id = freezer_subsys_id,
.attach = freezer_attach,
--
1.7.11.7
* Clean-up indentation and line-breaks. Drop the invalid comment
about freezer->lock.
* Make all internal functions take @freezer instead of both @cgroup
and @freezer.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 41 +++++++++++++++++++----------------------
1 file changed, 19 insertions(+), 22 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index bedefd9..975b3d8 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -29,17 +29,15 @@ enum freezer_state {
};
struct freezer {
- struct cgroup_subsys_state css;
- enum freezer_state state;
- spinlock_t lock; /* protects _writes_ to state */
+ struct cgroup_subsys_state css;
+ enum freezer_state state;
+ spinlock_t lock;
};
-static inline struct freezer *cgroup_freezer(
- struct cgroup *cgroup)
+static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)
{
- return container_of(
- cgroup_subsys_state(cgroup, freezer_subsys_id),
- struct freezer, css);
+ return container_of(cgroup_subsys_state(cgroup, freezer_subsys_id),
+ struct freezer, css);
}
static inline struct freezer *task_freezer(struct task_struct *task)
@@ -180,8 +178,9 @@ out:
* migrated into or out of @cgroup, so we can't verify task states against
* @freezer state here. See freezer_attach() for details.
*/
-static void update_if_frozen(struct cgroup *cgroup, struct freezer *freezer)
+static void update_if_frozen(struct freezer *freezer)
{
+ struct cgroup *cgroup = freezer->css.cgroup;
struct cgroup_iter it;
struct task_struct *task;
@@ -211,12 +210,11 @@ notyet:
static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
- struct freezer *freezer;
+ struct freezer *freezer = cgroup_freezer(cgroup);
enum freezer_state state;
- freezer = cgroup_freezer(cgroup);
spin_lock_irq(&freezer->lock);
- update_if_frozen(cgroup, freezer);
+ update_if_frozen(freezer);
state = freezer->state;
spin_unlock_irq(&freezer->lock);
@@ -225,8 +223,9 @@ static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
return 0;
}
-static void freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
+static void freeze_cgroup(struct freezer *freezer)
{
+ struct cgroup *cgroup = freezer->css.cgroup;
struct cgroup_iter it;
struct task_struct *task;
@@ -236,8 +235,9 @@ static void freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
cgroup_iter_end(cgroup, &it);
}
-static void unfreeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
+static void unfreeze_cgroup(struct freezer *freezer)
{
+ struct cgroup *cgroup = freezer->css.cgroup;
struct cgroup_iter it;
struct task_struct *task;
@@ -247,11 +247,9 @@ static void unfreeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
cgroup_iter_end(cgroup, &it);
}
-static void freezer_change_state(struct cgroup *cgroup,
+static void freezer_change_state(struct freezer *freezer,
enum freezer_state goal_state)
{
- struct freezer *freezer = cgroup_freezer(cgroup);
-
/* also synchronizes against task migration, see freezer_attach() */
spin_lock_irq(&freezer->lock);
@@ -260,13 +258,13 @@ static void freezer_change_state(struct cgroup *cgroup,
if (freezer->state != CGROUP_THAWED)
atomic_dec(&system_freezing_cnt);
freezer->state = CGROUP_THAWED;
- unfreeze_cgroup(cgroup, freezer);
+ unfreeze_cgroup(freezer);
break;
case CGROUP_FROZEN:
if (freezer->state == CGROUP_THAWED)
atomic_inc(&system_freezing_cnt);
freezer->state = CGROUP_FREEZING;
- freeze_cgroup(cgroup, freezer);
+ freeze_cgroup(freezer);
break;
default:
BUG();
@@ -275,8 +273,7 @@ static void freezer_change_state(struct cgroup *cgroup,
spin_unlock_irq(&freezer->lock);
}
-static int freezer_write(struct cgroup *cgroup,
- struct cftype *cft,
+static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
const char *buffer)
{
enum freezer_state goal_state;
@@ -288,7 +285,7 @@ static int freezer_write(struct cgroup *cgroup,
else
return -EINVAL;
- freezer_change_state(cgroup, goal_state);
+ freezer_change_state(cgroup_freezer(cgroup), goal_state);
return 0;
}
--
1.7.11.7
* Make freezer_change_state() take bool @freeze instead of enum
freezer_state.
* Separate out freezer_apply_state() out of freezer_change_state().
This makes freezer_change_state() a rather silly thin wrapper. It
will be filled with hierarchy handling later on.
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 48 ++++++++++++++++++++++++++++++------------------
1 file changed, 30 insertions(+), 18 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 975b3d8..2690830 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -247,45 +247,57 @@ static void unfreeze_cgroup(struct freezer *freezer)
cgroup_iter_end(cgroup, &it);
}
-static void freezer_change_state(struct freezer *freezer,
- enum freezer_state goal_state)
+/**
+ * freezer_apply_state - apply state change to a single cgroup_freezer
+ * @freezer: freezer to apply state change to
+ * @freeze: whether to freeze or unfreeze
+ */
+static void freezer_apply_state(struct freezer *freezer, bool freeze)
{
/* also synchronizes against task migration, see freezer_attach() */
- spin_lock_irq(&freezer->lock);
+ lockdep_assert_held(&freezer->lock);
- switch (goal_state) {
- case CGROUP_THAWED:
- if (freezer->state != CGROUP_THAWED)
- atomic_dec(&system_freezing_cnt);
- freezer->state = CGROUP_THAWED;
- unfreeze_cgroup(freezer);
- break;
- case CGROUP_FROZEN:
+ if (freeze) {
if (freezer->state == CGROUP_THAWED)
atomic_inc(&system_freezing_cnt);
freezer->state = CGROUP_FREEZING;
freeze_cgroup(freezer);
- break;
- default:
- BUG();
+ } else {
+ if (freezer->state != CGROUP_THAWED)
+ atomic_dec(&system_freezing_cnt);
+ freezer->state = CGROUP_THAWED;
+ unfreeze_cgroup(freezer);
}
+}
+/**
+ * freezer_change_state - change the freezing state of a cgroup_freezer
+ * @freezer: freezer of interest
+ * @freeze: whether to freeze or thaw
+ *
+ * Freeze or thaw @cgroup according to @freeze.
+ */
+static void freezer_change_state(struct freezer *freezer, bool freeze)
+{
+ /* update @freezer */
+ spin_lock_irq(&freezer->lock);
+ freezer_apply_state(freezer, freeze);
spin_unlock_irq(&freezer->lock);
}
static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
const char *buffer)
{
- enum freezer_state goal_state;
+ bool freeze;
if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
- goal_state = CGROUP_THAWED;
+ freeze = false;
else if (strcmp(buffer, freezer_state_strs[CGROUP_FROZEN]) == 0)
- goal_state = CGROUP_FROZEN;
+ freeze = true;
else
return -EINVAL;
- freezer_change_state(cgroup_freezer(cgroup), goal_state);
+ freezer_change_state(cgroup_freezer(cgroup), freeze);
return 0;
}
--
1.7.11.7
freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
As the scheduled full hierarchy support requires more than one
freezing condition, switch it to mask of flags. If FREEZING is not
set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
both FREEZING and FROZEN are set. Now that tasks can be attached to
an already frozen cgroup, this also makes freezing condition checks
more natural.
This patch doesn't introduce any behavior change.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 60 ++++++++++++++++++++++---------------------------
1 file changed, 27 insertions(+), 33 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index 2690830..e76aa9f 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -22,15 +22,14 @@
#include <linux/freezer.h>
#include <linux/seq_file.h>
-enum freezer_state {
- CGROUP_THAWED = 0,
- CGROUP_FREEZING,
- CGROUP_FROZEN,
+enum freezer_state_flags {
+ CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
+ CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
};
struct freezer {
struct cgroup_subsys_state css;
- enum freezer_state state;
+ unsigned int state;
spinlock_t lock;
};
@@ -48,12 +47,10 @@ static inline struct freezer *task_freezer(struct task_struct *task)
bool cgroup_freezing(struct task_struct *task)
{
- enum freezer_state state;
bool ret;
rcu_read_lock();
- state = task_freezer(task)->state;
- ret = state == CGROUP_FREEZING || state == CGROUP_FROZEN;
+ ret = task_freezer(task)->state & CGROUP_FREEZING;
rcu_read_unlock();
return ret;
@@ -63,10 +60,13 @@ bool cgroup_freezing(struct task_struct *task)
* cgroups_write_string() limits the size of freezer state strings to
* CGROUP_LOCAL_BUFFER_SIZE
*/
-static const char *freezer_state_strs[] = {
- "THAWED",
- "FREEZING",
- "FROZEN",
+static const char *freezer_state_strs(unsigned int state)
+{
+ if (state & CGROUP_FROZEN)
+ return "FROZEN";
+ if (state & CGROUP_FREEZING)
+ return "FREEZING";
+ return "THAWED";
};
/*
@@ -91,7 +91,6 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
return ERR_PTR(-ENOMEM);
spin_lock_init(&freezer->lock);
- freezer->state = CGROUP_THAWED;
return &freezer->css;
}
@@ -99,7 +98,7 @@ static void freezer_destroy(struct cgroup *cgroup)
{
struct freezer *freezer = cgroup_freezer(cgroup);
- if (freezer->state != CGROUP_THAWED)
+ if (freezer->state & CGROUP_FREEZING)
atomic_dec(&system_freezing_cnt);
kfree(freezer);
}
@@ -129,15 +128,13 @@ static void freezer_attach(struct cgroup *new_cgrp, struct cgroup_taskset *tset)
* Tasks in @tset are on @new_cgrp but may not conform to its
* current state before executing the following - !frozen tasks may
* be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
- * This means that, to determine whether to freeze, one should test
- * whether the state equals THAWED.
*/
cgroup_taskset_for_each(task, new_cgrp, tset) {
- if (freezer->state == CGROUP_THAWED) {
+ if (!(freezer->state & CGROUP_FREEZING)) {
__thaw_task(task);
} else {
freeze_task(task);
- freezer->state = CGROUP_FREEZING;
+ freezer->state &= ~CGROUP_FROZEN;
}
}
@@ -159,11 +156,7 @@ static void freezer_fork(struct task_struct *task)
goto out;
spin_lock_irq(&freezer->lock);
- /*
- * @task might have been just migrated into a FROZEN cgroup. Test
- * equality with THAWED. Read the comment in freezer_attach().
- */
- if (freezer->state != CGROUP_THAWED)
+ if (freezer->state & CGROUP_FREEZING)
freeze_task(task);
spin_unlock_irq(&freezer->lock);
out:
@@ -184,7 +177,8 @@ static void update_if_frozen(struct freezer *freezer)
struct cgroup_iter it;
struct task_struct *task;
- if (freezer->state != CGROUP_FREEZING)
+ if (!(freezer->state & CGROUP_FREEZING) ||
+ (freezer->state & CGROUP_FROZEN))
return;
cgroup_iter_start(cgroup, &it);
@@ -202,7 +196,7 @@ static void update_if_frozen(struct freezer *freezer)
}
}
- freezer->state = CGROUP_FROZEN;
+ freezer->state |= CGROUP_FROZEN;
notyet:
cgroup_iter_end(cgroup, &it);
}
@@ -211,14 +205,14 @@ static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
struct freezer *freezer = cgroup_freezer(cgroup);
- enum freezer_state state;
+ unsigned int state;
spin_lock_irq(&freezer->lock);
update_if_frozen(freezer);
state = freezer->state;
spin_unlock_irq(&freezer->lock);
- seq_puts(m, freezer_state_strs[state]);
+ seq_puts(m, freezer_state_strs(state));
seq_putc(m, '\n');
return 0;
}
@@ -258,14 +252,14 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
lockdep_assert_held(&freezer->lock);
if (freeze) {
- if (freezer->state == CGROUP_THAWED)
+ if (!(freezer->state & CGROUP_FREEZING))
atomic_inc(&system_freezing_cnt);
- freezer->state = CGROUP_FREEZING;
+ freezer->state |= CGROUP_FREEZING;
freeze_cgroup(freezer);
} else {
- if (freezer->state != CGROUP_THAWED)
+ if (freezer->state & CGROUP_FREEZING)
atomic_dec(&system_freezing_cnt);
- freezer->state = CGROUP_THAWED;
+ freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
unfreeze_cgroup(freezer);
}
}
@@ -290,9 +284,9 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
{
bool freeze;
- if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
+ if (strcmp(buffer, freezer_state_strs(0)) == 0)
freeze = false;
- else if (strcmp(buffer, freezer_state_strs[CGROUP_FROZEN]) == 0)
+ else if (strcmp(buffer, freezer_state_strs(CGROUP_FROZEN)) == 0)
freeze = true;
else
return -EINVAL;
--
1.7.11.7
Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants. This patch adds the following
three macros.
* cgroup_for_each_child() - walk immediate children of a cgroup.
* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
in pre-order tree traversal.
* cgroup_for_each_descendant_post() - visit all descendants of a
cgroup in post-order tree traversal.
All three only require the user to hold RCU read lock during
traversal. Verifying that each iterated cgroup is online is the
responsibility of the user. When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate config
updates to descendants in reliable way. See comments for details.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 168 insertions(+)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 90c33eb..0020329 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -534,6 +534,88 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
return task_subsys_state(task, subsys_id)->cgroup;
}
+/**
+ * cgroup_for_each_child - iterate through children of a cgroup
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose children to walk
+ *
+ * Walk @cgroup's children. Must be called under rcu_read_lock(). A child
+ * cgroup which hasn't finished ->post_create() or already has finished
+ * ->pre_destroy() may show up during traversal and it's each subsystem's
+ * responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, a cgroup which finished ->post_create()
+ * is guaranteed to be visible in the future iterations.
+ */
+#define cgroup_for_each_child(pos, cgroup) \
+ list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
+
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+ struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_pre - pre-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Walk @cgroup's descendants. Must be called under rcu_read_lock(). A
+ * descendant cgroup which hasn't finished ->post_create() or already has
+ * finished ->pre_destroy() may show up during traversal and it's each
+ * subsystem's responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, and synchronizes against @pos on each
+ * iteration, any descendant cgroup which finished ->post_create() is
+ * guaranteed to be visible in the future iterations.
+ *
+ * In other words, the following guarantees that a descendant can't escape
+ * configuration of its ancestors.
+ *
+ * my_post_create(@cgrp)
+ * {
+ * Lock @cgrp->parent and @cgrp;
+ * Inherit config from @cgrp->parent;
+ * Unlock both.
+ * }
+ *
+ * my_update_config(@cgrp)
+ * {
+ * Lock @cgrp;
+ * Update @cgrp's config;
+ * Unlock @cgrp;
+ *
+ * cgroup_for_each_descendant_pre(@pos, @cgrp) {
+ * Lock @pos;
+ * Verify @pos is alive and inherit config from @pos->parent;
+ * Unlock @pos;
+ * }
+ * }
+ *
+ * Alternatively, a subsystem may choose to use a single global lock to
+ * synchronize ->post_create() and ->pre_destroy() against tree-walking
+ * operations.
+ */
+#define cgroup_for_each_descendant_pre(pos, cgroup) \
+ for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \
+ pos = cgroup_next_descendant_pre((pos), (cgroup)))
+
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+ struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_post - post-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Similar to cgroup_for_each_descendant_pre() but performs post-order
+ * traversal instead. Note that the walk visibility guarantee described in
+ * pre-order walk doesn't apply the same to post-order walks.
+ */
+#define cgroup_for_each_descendant_post(pos, cgroup) \
+ for (pos = cgroup_next_descendant_post(NULL, (cgroup)); (pos); \
+ pos = cgroup_next_descendant_post((pos), (cgroup)))
+
/* A cgroup_iter should be treated as an opaque object */
struct cgroup_iter {
struct list_head *cg_link;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index cc5d2a0..8bd662c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2985,6 +2985,92 @@ static void cgroup_enable_task_cg_lists(void)
write_unlock(&css_set_lock);
}
+/**
+ * cgroup_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_pre(). Find the next
+ * descendant to visit for pre-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+ struct cgroup *cgroup)
+{
+ struct cgroup *next;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* if first iteration, pretend we just visited @cgroup */
+ if (!pos) {
+ if (list_empty(&cgroup->children))
+ return NULL;
+ pos = cgroup;
+ }
+
+ /* visit the first child if exists */
+ next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
+ if (next)
+ return next;
+
+ /* no child, visit my or the closest ancestor's next sibling */
+ do {
+ next = list_entry_rcu(pos->sibling.next, struct cgroup,
+ sibling);
+ if (&next->sibling != &pos->parent->children)
+ return next;
+
+ pos = pos->parent;
+ } while (pos != cgroup);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
+
+static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
+{
+ struct cgroup *last;
+
+ do {
+ last = pos;
+ pos = list_first_or_null_rcu(&pos->children, struct cgroup,
+ sibling);
+ } while (pos);
+
+ return last;
+}
+
+/**
+ * cgroup_next_descendant_post - find the next descendant for post-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_post(). Find the next
+ * descendant to visit for post-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+ struct cgroup *cgroup)
+{
+ struct cgroup *next;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* if first iteration, visit the leftmost descendant */
+ if (!pos) {
+ next = cgroup_leftmost_descendant(cgroup);
+ return next != cgroup ? next : NULL;
+ }
+
+ /* if there's an unvisited sibling, visit its leftmost descendant */
+ next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+ if (&next->sibling != &pos->parent->children)
+ return cgroup_leftmost_descendant(next);
+
+ /* no sibling left, visit parent */
+ next = pos->parent;
+ return next != cgroup ? next : NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_post);
+
void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
__acquires(css_set_lock)
{
--
1.7.11.7
On 11/03/2012 09:38 AM, Tejun Heo wrote:
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
>
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
>
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Glauber Costa <[email protected]>
Tejun, If we do it this way, we end up with two callbacks that are
called after create: post_clone and post_create. I myself prefer the
approach I took, that convert post_clone into post_create, and would
prefer if you would pick that up.
For me, post_clone is totally a glitch that should not exist. Merging
this with post_create gives the following semantics:
* A while after cgroup creation, you will get a callback. In that
callback, you do whatever initialization you may need that you could not
in create. Why is reacting to a flag being set any different?
clone_children makes cgroup invoke ->post_clone() callback if it
exists and sets CGRP_CLONE_CHILDREN. ->post_clone(), while being
named generically, is only supposed to copy configuration from its
parent.
This is an entirely convenience feature which is only used by cpuset
to alter its configuration propagation. As the mount option and
cgroupfs knobs are cgroup-wide and different controllers differ in
their configuration propagations, it's awkward to use for multiple
controllers especially if they're co-mounted. At this point, we can't
use this flag for any other controller anyway as it would implicitly
change the behavior.
As this is unnecessary feature with very limited use and awkward in
co-mounted use cases, let's try to deprecate it. Whine on the mount
option and accesses to cgroupfs knobs.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
Glauber, I think this is more befitting change for .post_clone(). If
people aren't depending on this, I'd much prefer to just remove it.
Thanks.
kernel/cgroup.c | 8 ++++++++
1 file changed, 8 insertions(+)
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1162,6 +1162,8 @@ static int parse_cgroupfs_options(char *
continue;
}
if (!strcmp(token, "clone_children")) {
+ pr_warning("cgroup: mount option clone_children is deprecated (pid=%d comm=%s)\n",
+ task_tgid_nr(current), current->comm);
opts->clone_children = true;
continue;
}
@@ -3837,6 +3839,9 @@ fail:
static u64 cgroup_clone_children_read(struct cgroup *cgrp,
struct cftype *cft)
{
+ printk_ratelimited(KERN_WARNING "cgroup: clone_children is deprecated (pid=%d comm=%s)\n",
+ task_tgid_nr(current), current->comm);
+
return clone_children(cgrp);
}
@@ -3844,6 +3849,9 @@ static int cgroup_clone_children_write(s
struct cftype *cft,
u64 val)
{
+ printk_ratelimited(KERN_WARNING "cgroup: clone_children is deprecated (pid=%d comm=%s)\n",
+ task_tgid_nr(current), current->comm);
+
if (val)
set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
else
Quoting Tejun Heo ([email protected]):
> clone_children makes cgroup invoke ->post_clone() callback if it
> exists and sets CGRP_CLONE_CHILDREN. ->post_clone(), while being
> named generically, is only supposed to copy configuration from its
> parent.
>
> This is an entirely convenience feature which is only used by cpuset
> to alter its configuration propagation. As the mount option and
> cgroupfs knobs are cgroup-wide and different controllers differ in
> their configuration propagations, it's awkward to use for multiple
> controllers especially if they're co-mounted. At this point, we can't
> use this flag for any other controller anyway as it would implicitly
> change the behavior.
>
> As this is unnecessary feature with very limited use and awkward in
clone_children is currently required by lxc. Of course lxc could
manually set the .cpus and .mems in the newly created child, but if
those interfaces or others like it ever change (i.e. any new ones which
must be set before a task can join a cgroup) we'll get into yet more of
a rats nets of code to support different kernel versions.
(Just as an idea, if there was a way to generically tell from the list
of files in my cgroups dir which files need to be initialized before a
task can join, I think that would suffice. Maybe a cgroups.needssetup
file which right now contains 'cpuset.mems\ncpuset.cpus'. Then if we
find a file we don't recognize in there we can throw an intelligent
error, or guess at duplicating the parent value.)
> co-mounted use cases, let's try to deprecate it. Whine on the mount
> option and accesses to cgroupfs knobs.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Glauber Costa <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> Glauber, I think this is more befitting change for .post_clone(). If
What do you mean by that? That he is working on a patch-set which will
remove post_clone and this belongs there, or that there is a proposed
alternative?
> people aren't depending on this, I'd much prefer to just remove it.
>
> Thanks.
>
> kernel/cgroup.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1162,6 +1162,8 @@ static int parse_cgroupfs_options(char *
> continue;
> }
> if (!strcmp(token, "clone_children")) {
> + pr_warning("cgroup: mount option clone_children is deprecated (pid=%d comm=%s)\n",
> + task_tgid_nr(current), current->comm);
> opts->clone_children = true;
> continue;
> }
> @@ -3837,6 +3839,9 @@ fail:
> static u64 cgroup_clone_children_read(struct cgroup *cgrp,
> struct cftype *cft)
> {
> + printk_ratelimited(KERN_WARNING "cgroup: clone_children is deprecated (pid=%d comm=%s)\n",
> + task_tgid_nr(current), current->comm);
> +
> return clone_children(cgrp);
> }
>
> @@ -3844,6 +3849,9 @@ static int cgroup_clone_children_write(s
> struct cftype *cft,
> u64 val)
> {
> + printk_ratelimited(KERN_WARNING "cgroup: clone_children is deprecated (pid=%d comm=%s)\n",
> + task_tgid_nr(current), current->comm);
> +
> if (val)
> set_bit(CGRP_CLONE_CHILDREN, &cgrp->flags);
> else
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Hello, Serge.
On Mon, Nov 05, 2012 at 01:17:14PM -0600, Serge Hallyn wrote:
> > As this is unnecessary feature with very limited use and awkward in
>
> clone_children is currently required by lxc. Of course lxc could
> manually set the .cpus and .mems in the newly created child, but if
> those interfaces or others like it ever change (i.e. any new ones which
> must be set before a task can join a cgroup) we'll get into yet more of
> a rats nets of code to support different kernel versions.
If lxc is using it, this will have to stay then. I'll try to make it
a cpuset specific thing.
> (Just as an idea, if there was a way to generically tell from the list
> of files in my cgroups dir which files need to be initialized before a
> task can join, I think that would suffice. Maybe a cgroups.needssetup
> file which right now contains 'cpuset.mems\ncpuset.cpus'. Then if we
> find a file we don't recognize in there we can throw an intelligent
> error, or guess at duplicating the parent value.)
Hmmm... I think the root problem is that different controllers don't
agree on the way they inherit configurations from parents. cgroup
really is a trainwreck. :(
I don't know whether "needssetup" is the right way to deal with it.
I'll think more about it.
> > co-mounted use cases, let's try to deprecate it. Whine on the mount
> > option and accesses to cgroupfs knobs.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Glauber Costa <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > ---
> > Glauber, I think this is more befitting change for .post_clone(). If
>
> What do you mean by that? That he is working on a patch-set which will
> remove post_clone and this belongs there, or that there is a proposed
> alternative?
The former. There was a patchset from Glauber updating ->post_clone()
(which didn't change the behavior).
Thanks.
--
tejun
On Sat, Nov 03, 2012 at 01:38:29AM -0700, Tejun Heo wrote:
> Currently, cgroup doesn't provide any generic helper for walking a
> given cgroup's children or descendants. This patch adds the following
> three macros.
>
> * cgroup_for_each_child() - walk immediate children of a cgroup.
>
> * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> in pre-order tree traversal.
>
> * cgroup_for_each_descendant_post() - visit all descendants of a
> cgroup in post-order tree traversal.
>
> All three only require the user to hold RCU read lock during
> traversal. Verifying that each iterated cgroup is online is the
> responsibility of the user. When used with proper synchronization,
> cgroup_for_each_descendant_pre() can be used to propagate config
> updates to descendants in reliable way. See comments for details.
Michal, Li, how does this look to you? Would this be okay for memcg
too? Li, do you think the comment on cgroup_for_each_descendant_pre()
is correct?
Thanks.
--
tejun
On Sat 03-11-12 01:38:35, Tejun Heo wrote:
[...]
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index 4f12d31..3262537 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
[...]
> @@ -320,14 +388,39 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
> * @freezer: freezer of interest
> * @freeze: whether to freeze or thaw
> *
> - * Freeze or thaw @cgroup according to @freeze.
> + * Freeze or thaw @freezer according to @freeze. The operations are
> + * recursive - all descendants of @freezer will be affected.
> */
> static void freezer_change_state(struct freezer *freezer, bool freeze)
> {
> + struct cgroup *pos;
> +
> /* update @freezer */
> spin_lock_irq(&freezer->lock);
> freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
> spin_unlock_irq(&freezer->lock);
> +
> + /*
> + * Update all its descendants in pre-order traversal. Each
> + * descendant will try to inherit its parent's FREEZING state as
> + * CGROUP_FREEZING_PARENT.
> + */
> + rcu_read_lock();
> + cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
> + struct freezer *pos_f = cgroup_freezer(pos);
> + struct freezer *parent = parent_freezer(freezer);
This doesn't seem right. Why are we interested in freezer->parent here
at all? We should either care about freezer->state & CGROUP_FREEZING or
parent = parent_freezer(pos_f), right?
> +
> + /*
> + * Our update to @parent->state is already visible which is
> + * all we need. No need to lock @parent. For more info on
> + * synchronization, see freezer_post_create().
> + */
> + spin_lock_irq(&pos_f->lock);
> + freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
> + CGROUP_FREEZING_PARENT);
> + spin_unlock_irq(&pos_f->lock);
> + }
> + rcu_read_unlock();
> }
>
> static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:27, Tejun Heo wrote:
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
>
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
>
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
Hmm, I had to look at "cgroup_freezer: implement proper hierarchy
support" to actually understand what is the callback good for. The above
sounds as if the callback is needed when a controller wants to use
the new iterators or when pre_destroy is defined.
I think it would be helpful if the changelog described that the callback
is needed when the controller keeps a mutable shared state for the
hierarchy. For example memory controller doesn't have any such a strict
requirement so we can safely use your new iterators without pre_destroy.
Anyway, I like this change because the shared state is now really easy
to implement.
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Glauber Costa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 12 ++++++++++--
> 2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index fe876a7..b442122 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -438,6 +438,7 @@ int cgroup_taskset_size(struct cgroup_taskset *tset);
>
> struct cgroup_subsys {
> struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
> + void (*post_create)(struct cgroup *cgrp);
> void (*pre_destroy)(struct cgroup *cgrp);
> void (*destroy)(struct cgroup *cgrp);
> int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index e3045ad..f05d992 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4060,10 +4060,15 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> if (err < 0)
> goto err_remove;
>
> - /* each css holds a ref to the cgroup's dentry */
> - for_each_subsys(root, ss)
> + for_each_subsys(root, ss) {
> + /* each css holds a ref to the cgroup's dentry */
> dget(dentry);
>
> + /* creation succeeded, notify subsystems */
> + if (ss->post_create)
> + ss->post_create(cgrp);
> + }
> +
> /* The cgroup directory was pre-locked for us */
> BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
>
> @@ -4281,6 +4286,9 @@ static void __init cgroup_init_subsys(struct cgroup_subsys *ss)
>
> ss->active = 1;
>
> + if (ss->post_create)
> + ss->post_create(&ss->root->top_cgroup);
> +
> /* this function shouldn't be used with modular subsystems, since they
> * need to register a subsys_id, among other things */
> BUG_ON(ss->module);
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:28, Tejun Heo wrote:
> Use RCU safe list operations for cgroup->children. This will be used
> to implement cgroup children / descendant walking which can be used by
> controllers.
>
> Note that cgroup_create() now puts a new cgroup at the end of the
> ->children list instead of head. This isn't strictly necessary but is
> done so that the iteration order is more conventional.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
> ---
> include/linux/cgroup.h | 1 +
> kernel/cgroup.c | 8 +++-----
> 2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index b442122..90c33eb 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -12,6 +12,7 @@
> #include <linux/cpumask.h>
> #include <linux/nodemask.h>
> #include <linux/rcupdate.h>
> +#include <linux/rculist.h>
> #include <linux/cgroupstats.h>
> #include <linux/prio_heap.h>
> #include <linux/rwsem.h>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index f05d992..cc5d2a0 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -1650,7 +1650,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
>
> free_cg_links(&tmp_cg_links);
>
> - BUG_ON(!list_empty(&root_cgrp->sibling));
> BUG_ON(!list_empty(&root_cgrp->children));
> BUG_ON(root->number_of_cgroups != 1);
>
> @@ -1699,7 +1698,6 @@ static void cgroup_kill_sb(struct super_block *sb) {
>
> BUG_ON(root->number_of_cgroups != 1);
> BUG_ON(!list_empty(&cgrp->children));
> - BUG_ON(!list_empty(&cgrp->sibling));
>
> mutex_lock(&cgroup_mutex);
> mutex_lock(&cgroup_root_mutex);
> @@ -4053,7 +4051,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> }
> }
>
> - list_add(&cgrp->sibling, &cgrp->parent->children);
> + list_add_tail_rcu(&cgrp->sibling, &cgrp->parent->children);
> root->number_of_cgroups++;
>
> err = cgroup_create_dir(cgrp, dentry, mode);
> @@ -4084,7 +4082,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>
> err_remove:
>
> - list_del(&cgrp->sibling);
> + list_del_rcu(&cgrp->sibling);
> root->number_of_cgroups--;
>
> err_destroy:
> @@ -4210,7 +4208,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> raw_spin_unlock(&release_list_lock);
>
> /* delete this cgroup from parent->children */
> - list_del_init(&cgrp->sibling);
> + list_del_rcu(&cgrp->sibling);
>
> list_del_init(&cgrp->allcg_node);
>
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Tue 06-11-12 12:31:54, Tejun Heo wrote:
> On Sat, Nov 03, 2012 at 01:38:29AM -0700, Tejun Heo wrote:
> > Currently, cgroup doesn't provide any generic helper for walking a
> > given cgroup's children or descendants. This patch adds the following
> > three macros.
> >
> > * cgroup_for_each_child() - walk immediate children of a cgroup.
> >
> > * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> > in pre-order tree traversal.
> >
> > * cgroup_for_each_descendant_post() - visit all descendants of a
> > cgroup in post-order tree traversal.
> >
> > All three only require the user to hold RCU read lock during
> > traversal. Verifying that each iterated cgroup is online is the
> > responsibility of the user. When used with proper synchronization,
> > cgroup_for_each_descendant_pre() can be used to propagate config
> > updates to descendants in reliable way. See comments for details.
>
> Michal, Li, how does this look to you? Would this be okay for memcg
> too?
Yes, definitely. We are currently iterating by css->id which is, ehm,
impractical. Having a deterministic tree walk is definitely a plus.
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Wed, Nov 07, 2012 at 12:00:57PM +0100, Michal Hocko wrote:
> > + * Update all its descendants in pre-order traversal. Each
> > + * descendant will try to inherit its parent's FREEZING state as
> > + * CGROUP_FREEZING_PARENT.
> > + */
> > + rcu_read_lock();
> > + cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
> > + struct freezer *pos_f = cgroup_freezer(pos);
> > + struct freezer *parent = parent_freezer(freezer);
>
> This doesn't seem right. Why are we interested in freezer->parent here
> at all? We should either care about freezer->state & CGROUP_FREEZING or
> parent = parent_freezer(pos_f), right?
Yeap, that should have been parent_freezer(pos_f). Argh...
Thanks.
--
tejun
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved. They all operated
separately.
This patch implements proper hierarchy support. If a cgroup is
frozen, all its descendants are frozen. A cgroup is thawed iff it and
all its ancestors are THAWED. freezer.self_freezing shows the current
freezing state for the cgroup itself. freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.
freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state. update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.
Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations. Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.
As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.
Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too. This behavior change is
intended and has been warned via .broken_hierarchy.
v2: Michal spotted a bug in freezer_change_state() - descendants were
inheriting from the wrong ancestor. Fixed.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 161 ++++++++++++++++++++++++++++++++++++------------
1 file changed, 123 insertions(+), 38 deletions(-)
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -22,6 +22,13 @@
#include <linux/freezer.h>
#include <linux/seq_file.h>
+/*
+ * A cgroup is freezing if any FREEZING flags are set. FREEZING_SELF is
+ * set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
+ * for "THAWED". FREEZING_PARENT is set if the parent freezer is FREEZING
+ * for whatever reason. IOW, a cgroup has FREEZING_PARENT set if one of
+ * its ancestors has FREEZING_SELF set.
+ */
enum freezer_state_flags {
CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
@@ -50,6 +57,15 @@ static inline struct freezer *task_freez
struct freezer, css);
}
+static struct freezer *parent_freezer(struct freezer *freezer)
+{
+ struct cgroup *pcg = freezer->css.cgroup->parent;
+
+ if (pcg)
+ return cgroup_freezer(pcg);
+ return NULL;
+}
+
bool cgroup_freezing(struct task_struct *task)
{
bool ret;
@@ -74,17 +90,6 @@ static const char *freezer_state_strs(un
return "THAWED";
};
-/*
- * State diagram
- * Transitions are caused by userspace writes to the freezer.state file.
- * The values in parenthesis are state labels. The rest are edge labels.
- *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- * ^ ^ | |
- * | \_______THAWED_______/ |
- * \__________________________THAWED____________/
- */
-
struct cgroup_subsys freezer_subsys;
static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
@@ -103,15 +108,34 @@ static struct cgroup_subsys_state *freez
* freezer_post_create - commit creation of a freezer cgroup
* @cgroup: cgroup being created
*
- * We're committing to creation of @cgroup. Mark it online.
+ * We're committing to creation of @cgroup. Mark it online and inherit
+ * parent's freezing state while holding both parent's and our
+ * freezer->lock.
*/
static void freezer_post_create(struct cgroup *cgroup)
{
struct freezer *freezer = cgroup_freezer(cgroup);
+ struct freezer *parent = parent_freezer(freezer);
+
+ /*
+ * The following double locking and freezing state inheritance
+ * guarantee that @cgroup can never escape ancestors' freezing
+ * states. See cgroup_for_each_descendant_pre() for details.
+ */
+ if (parent)
+ spin_lock_irq(&parent->lock);
+ spin_lock_nested(&freezer->lock, SINGLE_DEPTH_NESTING);
- spin_lock_irq(&freezer->lock);
freezer->state |= CGROUP_FREEZER_ONLINE;
- spin_unlock_irq(&freezer->lock);
+
+ if (parent && (parent->state & CGROUP_FREEZING)) {
+ freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
+ atomic_inc(&system_freezing_cnt);
+ }
+
+ spin_unlock(&freezer->lock);
+ if (parent)
+ spin_unlock_irq(&parent->lock);
}
/**
@@ -153,6 +177,7 @@ static void freezer_attach(struct cgroup
{
struct freezer *freezer = cgroup_freezer(new_cgrp);
struct task_struct *task;
+ bool clear_frozen = false;
spin_lock_irq(&freezer->lock);
@@ -172,10 +197,25 @@ static void freezer_attach(struct cgroup
} else {
freeze_task(task);
freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = true;
}
}
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Propagate FROZEN clearing upwards. We may race with
+ * update_if_frozen(), but as long as both work bottom-up, either
+ * update_if_frozen() sees child's FROZEN cleared or we clear the
+ * parent's FROZEN later. No parent w/ !FROZEN children can be
+ * left FROZEN.
+ */
+ while (clear_frozen && (freezer = parent_freezer(freezer))) {
+ spin_lock_irq(&freezer->lock);
+ freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = freezer->state & CGROUP_FREEZING;
+ spin_unlock_irq(&freezer->lock);
+ }
}
static void freezer_fork(struct task_struct *task)
@@ -200,24 +240,47 @@ out:
rcu_read_unlock();
}
-/*
- * We change from FREEZING to FROZEN lazily if the cgroup was only
- * partially frozen when we exitted write. Caller must hold freezer->lock.
+/**
+ * update_if_frozen - update whether a cgroup finished freezing
+ * @cgroup: cgroup of interest
+ *
+ * Once FREEZING is initiated, transition to FROZEN is lazily updated by
+ * calling this function. If the current state is FREEZING but not FROZEN,
+ * this function checks whether all tasks of this cgroup and the descendant
+ * cgroups finished freezing and, if so, sets FROZEN.
+ *
+ * The caller is responsible for grabbing RCU read lock and calling
+ * update_if_frozen() on all descendants prior to invoking this function.
*
* Task states and freezer state might disagree while tasks are being
* migrated into or out of @cgroup, so we can't verify task states against
* @freezer state here. See freezer_attach() for details.
*/
-static void update_if_frozen(struct freezer *freezer)
+static void update_if_frozen(struct cgroup *cgroup)
{
- struct cgroup *cgroup = freezer->css.cgroup;
+ struct freezer *freezer = cgroup_freezer(cgroup);
+ struct cgroup *pos;
struct cgroup_iter it;
struct task_struct *task;
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ spin_lock_irq(&freezer->lock);
+
if (!(freezer->state & CGROUP_FREEZING) ||
(freezer->state & CGROUP_FROZEN))
- return;
+ goto out_unlock;
+ /* are all (live) children frozen? */
+ cgroup_for_each_child(pos, cgroup) {
+ struct freezer *child = cgroup_freezer(pos);
+
+ if ((child->state & CGROUP_FREEZER_ONLINE) &&
+ !(child->state & CGROUP_FROZEN))
+ goto out_unlock;
+ }
+
+ /* are all tasks frozen? */
cgroup_iter_start(cgroup, &it);
while ((task = cgroup_iter_next(cgroup, &it))) {
@@ -229,27 +292,32 @@ static void update_if_frozen(struct free
* the usual frozen condition.
*/
if (!frozen(task) && !freezer_should_skip(task))
- goto notyet;
+ goto out_iter_end;
}
}
freezer->state |= CGROUP_FROZEN;
-notyet:
+out_iter_end:
cgroup_iter_end(cgroup, &it);
+out_unlock:
+ spin_unlock_irq(&freezer->lock);
}
static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
- struct freezer *freezer = cgroup_freezer(cgroup);
- unsigned int state;
+ struct cgroup *pos;
- spin_lock_irq(&freezer->lock);
- update_if_frozen(freezer);
- state = freezer->state;
- spin_unlock_irq(&freezer->lock);
+ rcu_read_lock();
- seq_puts(m, freezer_state_strs(state));
+ /* update states bottom-up */
+ cgroup_for_each_descendant_post(pos, cgroup)
+ update_if_frozen(pos);
+ update_if_frozen(cgroup);
+
+ rcu_read_unlock();
+
+ seq_puts(m, freezer_state_strs(cgroup_freezer(cgroup)->state));
seq_putc(m, '\n');
return 0;
}
@@ -320,14 +388,39 @@ static void freezer_apply_state(struct f
* @freezer: freezer of interest
* @freeze: whether to freeze or thaw
*
- * Freeze or thaw @cgroup according to @freeze.
+ * Freeze or thaw @freezer according to @freeze. The operations are
+ * recursive - all descendants of @freezer will be affected.
*/
static void freezer_change_state(struct freezer *freezer, bool freeze)
{
+ struct cgroup *pos;
+
/* update @freezer */
spin_lock_irq(&freezer->lock);
freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Update all its descendants in pre-order traversal. Each
+ * descendant will try to inherit its parent's FREEZING state as
+ * CGROUP_FREEZING_PARENT.
+ */
+ rcu_read_lock();
+ cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
+ struct freezer *pos_f = cgroup_freezer(pos);
+ struct freezer *parent = parent_freezer(pos_f);
+
+ /*
+ * Our update to @parent->state is already visible which is
+ * all we need. No need to lock @parent. For more info on
+ * synchronization, see freezer_post_create().
+ */
+ spin_lock_irq(&pos_f->lock);
+ freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
+ CGROUP_FREEZING_PARENT);
+ spin_unlock_irq(&pos_f->lock);
+ }
+ rcu_read_unlock();
}
static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
@@ -390,12 +483,4 @@ struct cgroup_subsys freezer_subsys = {
.attach = freezer_attach,
.fork = freezer_fork,
.base_cftypes = files,
-
- /*
- * freezer subsys doesn't handle hierarchy at all. Frozen state
- * should be inherited through the hierarchy - if a parent is
- * frozen, all its children should be frozen. Fix it and remove
- * the following.
- */
- .broken_hierarchy = true,
};
On Sat 03-11-12 01:38:29, Tejun Heo wrote:
[...]
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cc5d2a0..8bd662c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2985,6 +2985,92 @@ static void cgroup_enable_task_cg_lists(void)
> write_unlock(&css_set_lock);
> }
>
> +/**
> + * cgroup_next_descendant_pre - find the next descendant for pre-order walk
> + * @pos: the current position (%NULL to initiate traversal)
> + * @cgroup: cgroup whose descendants to walk
> + *
> + * To be used by cgroup_for_each_descendant_pre(). Find the next
> + * descendant to visit for pre-order traversal of @cgroup's descendants.
> + */
> +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> + struct cgroup *cgroup)
> +{
> + struct cgroup *next;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /* if first iteration, pretend we just visited @cgroup */
> + if (!pos) {
> + if (list_empty(&cgroup->children))
> + return NULL;
> + pos = cgroup;
> + }
Is there any specific reason why the root of the tree is excluded?
This is bit impractical because you have to special case the root
in the code.
> +
> + /* visit the first child if exists */
> + next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
> + if (next)
> + return next;
> +
> + /* no child, visit my or the closest ancestor's next sibling */
> + do {
> + next = list_entry_rcu(pos->sibling.next, struct cgroup,
> + sibling);
> + if (&next->sibling != &pos->parent->children)
> + return next;
> +
> + pos = pos->parent;
> + } while (pos != cgroup);
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
[...]
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Wed, Nov 07, 2012 at 05:54:57PM +0100, Michal Hocko wrote:
> > +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> > + struct cgroup *cgroup)
> > +{
> > + struct cgroup *next;
> > +
> > + WARN_ON_ONCE(!rcu_read_lock_held());
> > +
> > + /* if first iteration, pretend we just visited @cgroup */
> > + if (!pos) {
> > + if (list_empty(&cgroup->children))
> > + return NULL;
> > + pos = cgroup;
> > + }
>
> Is there any specific reason why the root of the tree is excluded?
> This is bit impractical because you have to special case the root
> in the code.
Yeah, thought about including it but decided against it for two
reasons.
* To be consistent with cgroup_for_each_children() - it's a bit weird
for descendants to include self when children don't.
* Iteration root is likely to require different treatment anyway.
e.g. for cgroup_freezer, the root is updated to the specified config
while all the descendants inherit config from its immediate parent.
They are different.
Thanks.
--
tejun
Hello, Michal.
On Wed, Nov 07, 2012 at 04:25:16PM +0100, Michal Hocko wrote:
> > This patch adds ->post_create(). It's called after all ->create()
> > succeeded and the cgroup is linked into the generic cgroup hierarchy.
> > This plays the counterpart of ->pre_destroy().
>
> Hmm, I had to look at "cgroup_freezer: implement proper hierarchy
> support" to actually understand what is the callback good for. The above
> sounds as if the callback is needed when a controller wants to use
> the new iterators or when pre_destroy is defined.
>
> I think it would be helpful if the changelog described that the callback
> is needed when the controller keeps a mutable shared state for the
> hierarchy. For example memory controller doesn't have any such a strict
> requirement so we can safely use your new iterators without pre_destroy.
Hmm.... will try to explain it but I think it might be best to just
refer to the later patch for details. It's a bit tricky to explain.
Thanks.
--
tejun
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.
This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.
This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().
When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance. It will be explained with the descendant iterators.
v2: Added a paragraph about its future use w/ descendant iterators per
Michal.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Glauber Costa <[email protected]>
---
If you can explain it better, please be my guest.
Thanks.
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 12 ++++++++++--
2 files changed, 11 insertions(+), 2 deletions(-)
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -438,6 +438,7 @@ int cgroup_taskset_size(struct cgroup_ta
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
+ void (*post_create)(struct cgroup *cgrp);
void (*pre_destroy)(struct cgroup *cgrp);
void (*destroy)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4060,10 +4060,15 @@ static long cgroup_create(struct cgroup
if (err < 0)
goto err_remove;
- /* each css holds a ref to the cgroup's dentry */
- for_each_subsys(root, ss)
+ for_each_subsys(root, ss) {
+ /* each css holds a ref to the cgroup's dentry */
dget(dentry);
+ /* creation succeeded, notify subsystems */
+ if (ss->post_create)
+ ss->post_create(cgrp);
+ }
+
/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
@@ -4281,6 +4286,9 @@ static void __init cgroup_init_subsys(st
ss->active = 1;
+ if (ss->post_create)
+ ss->post_create(&ss->root->top_cgroup);
+
/* this function shouldn't be used with modular subsystems, since they
* need to register a subsys_id, among other things */
BUG_ON(ss->module);
On Wed 07-11-12 09:15:08, Tejun Heo wrote:
[...]
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
>
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance. It will be explained with the descendant iterators.
Thanks
--
Michal Hocko
SUSE Labs
On Wed 07-11-12 09:01:18, Tejun Heo wrote:
> Hello, Michal.
>
> On Wed, Nov 07, 2012 at 05:54:57PM +0100, Michal Hocko wrote:
> > > +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> > > + struct cgroup *cgroup)
> > > +{
> > > + struct cgroup *next;
> > > +
> > > + WARN_ON_ONCE(!rcu_read_lock_held());
> > > +
> > > + /* if first iteration, pretend we just visited @cgroup */
> > > + if (!pos) {
> > > + if (list_empty(&cgroup->children))
> > > + return NULL;
> > > + pos = cgroup;
> > > + }
> >
> > Is there any specific reason why the root of the tree is excluded?
> > This is bit impractical because you have to special case the root
> > in the code.
>
> Yeah, thought about including it but decided against it for two
> reasons.
>
> * To be consistent with cgroup_for_each_children() - it's a bit weird
> for descendants to include self when children don't.
>
> * Iteration root is likely to require different treatment anyway.
> e.g. for cgroup_freezer, the root is updated to the specified config
> while all the descendants inherit config from its immediate parent.
> They are different.
OK if there is a code which handles root differently then let's not
overcomplicate this. The naming should be clear that root needs a
special treatment.
I will continue with the review tomorrow.
--
Michal Hocko
SUSE Labs
(2012/11/08 2:15), Tejun Heo wrote:
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
>
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
>
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
>
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance. It will be explained with the descendant iterators.
>
> v2: Added a paragraph about its future use w/ descendant iterators per
> Michal.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Glauber Costa <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/03 17:38), Tejun Heo wrote:
> Use RCU safe list operations for cgroup->children. This will be used
> to implement cgroup children / descendant walking which can be used by
> controllers.
>
> Note that cgroup_create() now puts a new cgroup at the end of the
> ->children list instead of head. This isn't strictly necessary but is
> done so that the iteration order is more conventional.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/03 17:38), Tejun Heo wrote:
> Currently, cgroup doesn't provide any generic helper for walking a
> given cgroup's children or descendants. This patch adds the following
> three macros.
>
> * cgroup_for_each_child() - walk immediate children of a cgroup.
>
> * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> in pre-order tree traversal.
>
> * cgroup_for_each_descendant_post() - visit all descendants of a
> cgroup in post-order tree traversal.
>
> All three only require the user to hold RCU read lock during
> traversal. Verifying that each iterated cgroup is online is the
> responsibility of the user. When used with proper synchronization,
> cgroup_for_each_descendant_pre() can be used to propagate config
> updates to descendants in reliable way. See comments for details.
>
> Signed-off-by: Tejun Heo <[email protected]>
maybe better than using css->id in some(many) case.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/03 17:38), Tejun Heo wrote:
> * Clean-up indentation and line-breaks. Drop the invalid comment
> about freezer->lock.
>
> * Make all internal functions take @freezer instead of both @cgroup
> and @freezer.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/03 17:38), Tejun Heo wrote:
> * Make freezer_change_state() take bool @freeze instead of enum
> freezer_state.
>
> * Separate out freezer_apply_state() out of freezer_change_state().
> This makes freezer_change_state() a rather silly thin wrapper. It
> will be filled with hierarchy handling later on.
>
> This patch doesn't introduce any behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/03 17:38), Tejun Heo wrote:
> freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> As the scheduled full hierarchy support requires more than one
> freezing condition, switch it to mask of flags. If FREEZING is not
> set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
> both FREEZING and FROZEN are set. Now that tasks can be attached to
> an already frozen cgroup, this also makes freezing condition checks
> more natural.
>
> This patch doesn't introduce any behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/cgroup_freezer.c | 60 ++++++++++++++++++++++---------------------------
> 1 file changed, 27 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index 2690830..e76aa9f 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -22,15 +22,14 @@
> #include <linux/freezer.h>
> #include <linux/seq_file.h>
>
> -enum freezer_state {
> - CGROUP_THAWED = 0,
> - CGROUP_FREEZING,
> - CGROUP_FROZEN,
> +enum freezer_state_flags {
> + CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
> + CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
> };
>
> struct freezer {
> struct cgroup_subsys_state css;
> - enum freezer_state state;
> + unsigned int state;
> spinlock_t lock;
> };
>
> @@ -48,12 +47,10 @@ static inline struct freezer *task_freezer(struct task_struct *task)
>
> bool cgroup_freezing(struct task_struct *task)
> {
> - enum freezer_state state;
> bool ret;
>
> rcu_read_lock();
> - state = task_freezer(task)->state;
> - ret = state == CGROUP_FREEZING || state == CGROUP_FROZEN;
> + ret = task_freezer(task)->state & CGROUP_FREEZING;
> rcu_read_unlock();
>
> return ret;
> @@ -63,10 +60,13 @@ bool cgroup_freezing(struct task_struct *task)
> * cgroups_write_string() limits the size of freezer state strings to
> * CGROUP_LOCAL_BUFFER_SIZE
> */
> -static const char *freezer_state_strs[] = {
> - "THAWED",
> - "FREEZING",
> - "FROZEN",
> +static const char *freezer_state_strs(unsigned int state)
> +{
> + if (state & CGROUP_FROZEN)
> + return "FROZEN";
> + if (state & CGROUP_FREEZING)
> + return "FREEZING";
> + return "THAWED";
> };
>
> /*
> @@ -91,7 +91,6 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
> return ERR_PTR(-ENOMEM);
>
> spin_lock_init(&freezer->lock);
> - freezer->state = CGROUP_THAWED;
> return &freezer->css;
> }
>
> @@ -99,7 +98,7 @@ static void freezer_destroy(struct cgroup *cgroup)
> {
> struct freezer *freezer = cgroup_freezer(cgroup);
>
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> atomic_dec(&system_freezing_cnt);
> kfree(freezer);
> }
> @@ -129,15 +128,13 @@ static void freezer_attach(struct cgroup *new_cgrp, struct cgroup_taskset *tset)
> * Tasks in @tset are on @new_cgrp but may not conform to its
> * current state before executing the following - !frozen tasks may
> * be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
> - * This means that, to determine whether to freeze, one should test
> - * whether the state equals THAWED.
> */
> cgroup_taskset_for_each(task, new_cgrp, tset) {
> - if (freezer->state == CGROUP_THAWED) {
> + if (!(freezer->state & CGROUP_FREEZING)) {
> __thaw_task(task);
> } else {
> freeze_task(task);
> - freezer->state = CGROUP_FREEZING;
> + freezer->state &= ~CGROUP_FROZEN;
> }
> }
>
> @@ -159,11 +156,7 @@ static void freezer_fork(struct task_struct *task)
> goto out;
>
> spin_lock_irq(&freezer->lock);
> - /*
> - * @task might have been just migrated into a FROZEN cgroup. Test
> - * equality with THAWED. Read the comment in freezer_attach().
> - */
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> freeze_task(task);
> spin_unlock_irq(&freezer->lock);
> out:
> @@ -184,7 +177,8 @@ static void update_if_frozen(struct freezer *freezer)
> struct cgroup_iter it;
> struct task_struct *task;
>
> - if (freezer->state != CGROUP_FREEZING)
> + if (!(freezer->state & CGROUP_FREEZING) ||
> + (freezer->state & CGROUP_FROZEN))
> return;
Hmm....
How about
enum {
__CGROUP_FREEZING,
__CGROUP_FROZEN,
};
#define CGROUP_FREEZER_STATE_MASK 0x3
#define CGROUP_FREEZER_STATE(state) ((state) & CGROUP_FREEZER_STATE_MASK)
#define CGROUP_THAW(state) (CGROUP_FREEZER_STATE(state) == 0)
#define CGROUP_FREEZING(state) (CGROUP_FREEZER_STATE(state) == __CGROUP_FREEZING)
#define CGROUP_FROZEN(state)\
(CGROUP_FREEZER_STATE(state) == (__CGROUP_FREEZING | __CGROUP_FROZEN))
>
> cgroup_iter_start(cgroup, &it);
> @@ -202,7 +196,7 @@ static void update_if_frozen(struct freezer *freezer)
> }
> }
>
> - freezer->state = CGROUP_FROZEN;
> + freezer->state |= CGROUP_FROZEN;
> notyet:
> cgroup_iter_end(cgroup, &it);
> }
> @@ -211,14 +205,14 @@ static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
> struct seq_file *m)
> {
> struct freezer *freezer = cgroup_freezer(cgroup);
> - enum freezer_state state;
> + unsigned int state;
>
> spin_lock_irq(&freezer->lock);
> update_if_frozen(freezer);
> state = freezer->state;
> spin_unlock_irq(&freezer->lock);
>
> - seq_puts(m, freezer_state_strs[state]);
> + seq_puts(m, freezer_state_strs(state));
> seq_putc(m, '\n');
> return 0;
> }
> @@ -258,14 +252,14 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
> lockdep_assert_held(&freezer->lock);
>
> if (freeze) {
> - if (freezer->state == CGROUP_THAWED)
> + if (!(freezer->state & CGROUP_FREEZING))
> atomic_inc(&system_freezing_cnt);
> - freezer->state = CGROUP_FREEZING;
> + freezer->state |= CGROUP_FREEZING;
> freeze_cgroup(freezer);
> } else {
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> atomic_dec(&system_freezing_cnt);
> - freezer->state = CGROUP_THAWED;
> + freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
> unfreeze_cgroup(freezer);
> }
> }
> @@ -290,9 +284,9 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> {
> bool freeze;
>
> - if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
> + if (strcmp(buffer, freezer_state_strs(0)) == 0)
Can't we have a name for "0" ?
> freeze = false;
> - else if (strcmp(buffer, freezer_state_strs[CGROUP_FROZEN]) == 0)
> + else if (strcmp(buffer, freezer_state_strs(CGROUP_FROZEN)) == 0)
> freeze = true;
> else
> return -EINVAL;
>
Thanks,
-Kame
(2012/11/03 17:38), Tejun Heo wrote:
> Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
> the two flags. This is to prepare for full hierarchy support.
>
> freezer_apply_date() is updated such that it can handle setting and
> clearing of both flags. The two flags are also exposed to userland
> via read-only files self_freezing and parent_freezing.
>
> Other than the added cgroupfs files, this patch doesn't introduce any
> behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
What is the purpose of FREEZING_SELG/PARENT, having 2 flags and
make it visible to users ?
Thanks,
-Kame
> ---
> kernel/cgroup_freezer.c | 55 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index e76aa9f..b8ad93c 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -23,8 +23,12 @@
> #include <linux/seq_file.h>
>
> enum freezer_state_flags {
> - CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
> + CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
> + CGROUP_FREEZING_PARENT = (1 << 2), /* the parent freezer is freezing */
> CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
> +
> + /* mask for all FREEZING flags */
> + CGROUP_FREEZING = CGROUP_FREEZING_SELF | CGROUP_FREEZING_PARENT,
> };
>
> struct freezer {
> @@ -245,8 +249,13 @@ static void unfreeze_cgroup(struct freezer *freezer)
> * freezer_apply_state - apply state change to a single cgroup_freezer
> * @freezer: freezer to apply state change to
> * @freeze: whether to freeze or unfreeze
> + * @state: CGROUP_FREEZING_* flag to set or clear
> + *
> + * Set or clear @state on @cgroup according to @freeze, and perform
> + * freezing or thawing as necessary.
> */
> -static void freezer_apply_state(struct freezer *freezer, bool freeze)
> +static void freezer_apply_state(struct freezer *freezer, bool freeze,
> + unsigned int state)
> {
> /* also synchronizes against task migration, see freezer_attach() */
> lockdep_assert_held(&freezer->lock);
> @@ -254,13 +263,19 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
> if (freeze) {
> if (!(freezer->state & CGROUP_FREEZING))
> atomic_inc(&system_freezing_cnt);
> - freezer->state |= CGROUP_FREEZING;
> + freezer->state |= state;
> freeze_cgroup(freezer);
> } else {
> - if (freezer->state & CGROUP_FREEZING)
> - atomic_dec(&system_freezing_cnt);
> - freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
> - unfreeze_cgroup(freezer);
> + bool was_freezing = freezer->state & CGROUP_FREEZING;
> +
> + freezer->state &= ~state;
> +
> + if (!(freezer->state & CGROUP_FREEZING)) {
> + if (was_freezing)
> + atomic_dec(&system_freezing_cnt);
> + freezer->state &= ~CGROUP_FROZEN;
> + unfreeze_cgroup(freezer);
> + }
> }
> }
>
> @@ -275,7 +290,7 @@ static void freezer_change_state(struct freezer *freezer, bool freeze)
> {
> /* update @freezer */
> spin_lock_irq(&freezer->lock);
> - freezer_apply_state(freezer, freeze);
> + freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
> spin_unlock_irq(&freezer->lock);
> }
>
> @@ -295,6 +310,20 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> return 0;
> }
>
> +static u64 freezer_self_freezing_read(struct cgroup *cgroup, struct cftype *cft)
> +{
> + struct freezer *freezer = cgroup_freezer(cgroup);
> +
> + return (bool)(freezer->state & CGROUP_FREEZING_SELF);
> +}
> +
> +static u64 freezer_parent_freezing_read(struct cgroup *cgroup, struct cftype *cft)
> +{
> + struct freezer *freezer = cgroup_freezer(cgroup);
> +
> + return (bool)(freezer->state & CGROUP_FREEZING_PARENT);
> +}
> +
> static struct cftype files[] = {
> {
> .name = "state",
> @@ -302,6 +331,16 @@ static struct cftype files[] = {
> .read_seq_string = freezer_read,
> .write_string = freezer_write,
> },
> + {
> + .name = "self_freezing",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = freezer_self_freezing_read,
> + },
> + {
> + .name = "parent_freezing",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = freezer_parent_freezing_read,
> + },
> { } /* terminate */
> };
>
>
Hello, Kame.
On Thu, Nov 08, 2012 at 01:37:50PM +0900, Kamezawa Hiroyuki wrote:
> How about
> enum {
> __CGROUP_FREEZING,
> __CGROUP_FROZEN,
> };
>
> #define CGROUP_FREEZER_STATE_MASK 0x3
> #define CGROUP_FREEZER_STATE(state) ((state) & CGROUP_FREEZER_STATE_MASK)
> #define CGROUP_THAW(state) (CGROUP_FREEZER_STATE(state) == 0)
> #define CGROUP_FREEZING(state) (CGROUP_FREEZER_STATE(state) == __CGROUP_FREEZING)
> #define CGROUP_FROZEN(state)\
> (CGROUP_FREEZER_STATE(state) == (__CGROUP_FREEZING | __CGROUP_FROZEN))
I think it's a bit overdone and we have cases where we test for
FREEZING regardless of FROZEN and cases where test for FREEZING &&
!FROZEN. We can have, say, CGROUP_FREZING() and then
CGROUP_FREEZING_BUT_NOT_FROZEN(), but it feels more like obfuscation
than anything else.
> >@@ -290,9 +284,9 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> > {
> > bool freeze;
> >
> >- if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
> >+ if (strcmp(buffer, freezer_state_strs(0)) == 0)
>
> Can't we have a name for "0" ?
We can define CGROUP_THAWED to be zero. I'd rather keep it explicitly
zero tho. freezer state is mask of freezing and frozen flags and no
flag set means thawed.
Thanks.
--
tejun
Hello,
On Thu, Nov 08, 2012 at 01:42:22PM +0900, Kamezawa Hiroyuki wrote:
> (2012/11/03 17:38), Tejun Heo wrote:
> >Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
> >the two flags. This is to prepare for full hierarchy support.
> >
> >freezer_apply_date() is updated such that it can handle setting and
> >clearing of both flags. The two flags are also exposed to userland
> >via read-only files self_freezing and parent_freezing.
> >
> >Other than the added cgroupfs files, this patch doesn't introduce any
> >behavior change.
> >
> >Signed-off-by: Tejun Heo <[email protected]>
>
> What is the purpose of FREEZING_SELG/PARENT, having 2 flags and
> make it visible to users ?
SELF is the configuration of the current cgroup, PARENT is freezing
state inherited from an ancestor. If either is set, the cgroup is
freezing. Without the two conditions visible to userland, it can get
a bit confusing as echoing "THAWED" to state may not do anything to
the cgroup.
Thanks.
--
tejun
(2012/11/03 17:38), Tejun Heo wrote:
> A cgroup is online and visible to iteration between ->post_create()
> and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
> toggles it from the newly added freezer_post_create() and
> freezer_pre_destroy() while holding freezer->lock such that a
> cgroup_freezer can be reilably distinguished to be online. This will
> be used by full hierarchy support.
>
> ONLINE test is added to freezer_apply_state() but it currently doesn't
> make any difference as freezer_write() can only be called for an
> online cgroup.
>
> Adjusting system_freezing_cnt on destruction is moved from
> freezer_destroy() to the new freezer_pre_destroy() for consistency.
>
> This patch doesn't introduce any noticeable behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> kernel/cgroup_freezer.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index b8ad93c..4f12d31 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -23,6 +23,7 @@
> #include <linux/seq_file.h>
>
> enum freezer_state_flags {
> + CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
Could you explain what 'online' means here again, rather than changelog ?
BTW, 'online' is a shared concept, between post_create() and pre_destroy(), among
developpers ? Is it new ?
Anyway, the patch itself is simple.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/08 13:45), Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 08, 2012 at 01:42:22PM +0900, Kamezawa Hiroyuki wrote:
>> (2012/11/03 17:38), Tejun Heo wrote:
>>> Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
>>> the two flags. This is to prepare for full hierarchy support.
>>>
>>> freezer_apply_date() is updated such that it can handle setting and
>>> clearing of both flags. The two flags are also exposed to userland
>>> via read-only files self_freezing and parent_freezing.
>>>
>>> Other than the added cgroupfs files, this patch doesn't introduce any
>>> behavior change.
>>>
>>> Signed-off-by: Tejun Heo <[email protected]>
>>
>> What is the purpose of FREEZING_SELG/PARENT, having 2 flags and
>> make it visible to users ?
>
> SELF is the configuration of the current cgroup, PARENT is freezing
> state inherited from an ancestor. If either is set, the cgroup is
> freezing. Without the two conditions visible to userland, it can get
> a bit confusing as echoing "THAWED" to state may not do anything to
> the cgroup.
>
Got it, thank you. I'm glad if you add some more explanation in comments
and considered use-case in changelog.
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Thanks,
-Kame
(2012/11/08 13:42), Tejun Heo wrote:
> Hello, Kame.
>
> On Thu, Nov 08, 2012 at 01:37:50PM +0900, Kamezawa Hiroyuki wrote:
>> How about
>> enum {
>> __CGROUP_FREEZING,
>> __CGROUP_FROZEN,
>> };
>>
>> #define CGROUP_FREEZER_STATE_MASK 0x3
>> #define CGROUP_FREEZER_STATE(state) ((state) & CGROUP_FREEZER_STATE_MASK)
>> #define CGROUP_THAW(state) (CGROUP_FREEZER_STATE(state) == 0)
>> #define CGROUP_FREEZING(state) (CGROUP_FREEZER_STATE(state) == __CGROUP_FREEZING)
>> #define CGROUP_FROZEN(state)\
>> (CGROUP_FREEZER_STATE(state) == (__CGROUP_FREEZING | __CGROUP_FROZEN))
>
> I think it's a bit overdone and we have cases where we test for
> FREEZING regardless of FROZEN and cases where test for FREEZING &&
> !FROZEN. We can have, say, CGROUP_FREZING() and then
> CGROUP_FREEZING_BUT_NOT_FROZEN(), but it feels more like obfuscation
> than anything else.
>
Hm, then, I'm glad if I can see what combinations of flags are valid and
meanings of them in source code comments.
Anyway,
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
On Sat 03-11-12 01:38:29, Tejun Heo wrote:
> Currently, cgroup doesn't provide any generic helper for walking a
> given cgroup's children or descendants. This patch adds the following
> three macros.
>
> * cgroup_for_each_child() - walk immediate children of a cgroup.
>
> * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> in pre-order tree traversal.
>
> * cgroup_for_each_descendant_post() - visit all descendants of a
> cgroup in post-order tree traversal.
>
> All three only require the user to hold RCU read lock during
> traversal. Verifying that each iterated cgroup is online is the
> responsibility of the user. When used with proper synchronization,
> cgroup_for_each_descendant_pre() can be used to propagate config
> updates to descendants in reliable way. See comments for details.
>
> Signed-off-by: Tejun Heo <[email protected]>
I will convert mem_cgroup_iter to use this rather than css_get_next
after this gets into the next tree so that it can fly via Andrew.
Reviewed-by: Michal Hocko <[email protected]>
Just a minor knit. You are talking about a config propagation while I
would consider state propagation more clear and less confusing. Config
is usually stable enough so that post_create is not necessary for
syncing (e.g. memcg.swappiness). It is a state which must be consistent
throughout the hierarchy which matters here.
Thanks!
> ---
> include/linux/cgroup.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++
> kernel/cgroup.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 168 insertions(+)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 90c33eb..0020329 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -534,6 +534,88 @@ static inline struct cgroup* task_cgroup(struct task_struct *task,
> return task_subsys_state(task, subsys_id)->cgroup;
> }
>
> +/**
> + * cgroup_for_each_child - iterate through children of a cgroup
> + * @pos: the cgroup * to use as the loop cursor
> + * @cgroup: cgroup whose children to walk
> + *
> + * Walk @cgroup's children. Must be called under rcu_read_lock(). A child
> + * cgroup which hasn't finished ->post_create() or already has finished
> + * ->pre_destroy() may show up during traversal and it's each subsystem's
> + * responsibility to verify that each @pos is alive.
> + *
> + * If a subsystem synchronizes against the parent in its ->post_create()
> + * and before starting iterating, a cgroup which finished ->post_create()
> + * is guaranteed to be visible in the future iterations.
> + */
> +#define cgroup_for_each_child(pos, cgroup) \
> + list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
> +
> +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> + struct cgroup *cgroup);
> +
> +/**
> + * cgroup_for_each_descendant_pre - pre-order walk of a cgroup's descendants
> + * @pos: the cgroup * to use as the loop cursor
> + * @cgroup: cgroup whose descendants to walk
> + *
> + * Walk @cgroup's descendants. Must be called under rcu_read_lock(). A
> + * descendant cgroup which hasn't finished ->post_create() or already has
> + * finished ->pre_destroy() may show up during traversal and it's each
> + * subsystem's responsibility to verify that each @pos is alive.
> + *
> + * If a subsystem synchronizes against the parent in its ->post_create()
> + * and before starting iterating, and synchronizes against @pos on each
> + * iteration, any descendant cgroup which finished ->post_create() is
> + * guaranteed to be visible in the future iterations.
> + *
> + * In other words, the following guarantees that a descendant can't escape
> + * configuration of its ancestors.
> + *
> + * my_post_create(@cgrp)
> + * {
> + * Lock @cgrp->parent and @cgrp;
> + * Inherit config from @cgrp->parent;
> + * Unlock both.
> + * }
> + *
> + * my_update_config(@cgrp)
> + * {
> + * Lock @cgrp;
> + * Update @cgrp's config;
> + * Unlock @cgrp;
> + *
> + * cgroup_for_each_descendant_pre(@pos, @cgrp) {
> + * Lock @pos;
> + * Verify @pos is alive and inherit config from @pos->parent;
> + * Unlock @pos;
> + * }
> + * }
> + *
> + * Alternatively, a subsystem may choose to use a single global lock to
> + * synchronize ->post_create() and ->pre_destroy() against tree-walking
> + * operations.
> + */
> +#define cgroup_for_each_descendant_pre(pos, cgroup) \
> + for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \
> + pos = cgroup_next_descendant_pre((pos), (cgroup)))
> +
> +struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
> + struct cgroup *cgroup);
> +
> +/**
> + * cgroup_for_each_descendant_post - post-order walk of a cgroup's descendants
> + * @pos: the cgroup * to use as the loop cursor
> + * @cgroup: cgroup whose descendants to walk
> + *
> + * Similar to cgroup_for_each_descendant_pre() but performs post-order
> + * traversal instead. Note that the walk visibility guarantee described in
> + * pre-order walk doesn't apply the same to post-order walks.
> + */
> +#define cgroup_for_each_descendant_post(pos, cgroup) \
> + for (pos = cgroup_next_descendant_post(NULL, (cgroup)); (pos); \
> + pos = cgroup_next_descendant_post((pos), (cgroup)))
> +
> /* A cgroup_iter should be treated as an opaque object */
> struct cgroup_iter {
> struct list_head *cg_link;
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index cc5d2a0..8bd662c 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -2985,6 +2985,92 @@ static void cgroup_enable_task_cg_lists(void)
> write_unlock(&css_set_lock);
> }
>
> +/**
> + * cgroup_next_descendant_pre - find the next descendant for pre-order walk
> + * @pos: the current position (%NULL to initiate traversal)
> + * @cgroup: cgroup whose descendants to walk
> + *
> + * To be used by cgroup_for_each_descendant_pre(). Find the next
> + * descendant to visit for pre-order traversal of @cgroup's descendants.
> + */
> +struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
> + struct cgroup *cgroup)
> +{
> + struct cgroup *next;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /* if first iteration, pretend we just visited @cgroup */
> + if (!pos) {
> + if (list_empty(&cgroup->children))
> + return NULL;
> + pos = cgroup;
> + }
> +
> + /* visit the first child if exists */
> + next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
> + if (next)
> + return next;
> +
> + /* no child, visit my or the closest ancestor's next sibling */
> + do {
> + next = list_entry_rcu(pos->sibling.next, struct cgroup,
> + sibling);
> + if (&next->sibling != &pos->parent->children)
> + return next;
> +
> + pos = pos->parent;
> + } while (pos != cgroup);
> +
> + return NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
> +
> +static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
> +{
> + struct cgroup *last;
> +
> + do {
> + last = pos;
> + pos = list_first_or_null_rcu(&pos->children, struct cgroup,
> + sibling);
> + } while (pos);
> +
> + return last;
> +}
> +
> +/**
> + * cgroup_next_descendant_post - find the next descendant for post-order walk
> + * @pos: the current position (%NULL to initiate traversal)
> + * @cgroup: cgroup whose descendants to walk
> + *
> + * To be used by cgroup_for_each_descendant_post(). Find the next
> + * descendant to visit for post-order traversal of @cgroup's descendants.
> + */
> +struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
> + struct cgroup *cgroup)
> +{
> + struct cgroup *next;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /* if first iteration, visit the leftmost descendant */
> + if (!pos) {
> + next = cgroup_leftmost_descendant(cgroup);
> + return next != cgroup ? next : NULL;
> + }
> +
> + /* if there's an unvisited sibling, visit its leftmost descendant */
> + next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
> + if (&next->sibling != &pos->parent->children)
> + return cgroup_leftmost_descendant(next);
> +
> + /* no sibling left, visit parent */
> + next = pos->parent;
> + return next != cgroup ? next : NULL;
> +}
> +EXPORT_SYMBOL_GPL(cgroup_next_descendant_post);
> +
> void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
> __acquires(css_set_lock)
> {
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:30, Tejun Heo wrote:
> * Clean-up indentation and line-breaks. Drop the invalid comment
> about freezer->lock.
>
> * Make all internal functions take @freezer instead of both @cgroup
> and @freezer.
>
> Signed-off-by: Tejun Heo <[email protected]>
Looks reasonable
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup_freezer.c | 41 +++++++++++++++++++----------------------
> 1 file changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index bedefd9..975b3d8 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -29,17 +29,15 @@ enum freezer_state {
> };
>
> struct freezer {
> - struct cgroup_subsys_state css;
> - enum freezer_state state;
> - spinlock_t lock; /* protects _writes_ to state */
> + struct cgroup_subsys_state css;
> + enum freezer_state state;
> + spinlock_t lock;
> };
>
> -static inline struct freezer *cgroup_freezer(
> - struct cgroup *cgroup)
> +static inline struct freezer *cgroup_freezer(struct cgroup *cgroup)
> {
> - return container_of(
> - cgroup_subsys_state(cgroup, freezer_subsys_id),
> - struct freezer, css);
> + return container_of(cgroup_subsys_state(cgroup, freezer_subsys_id),
> + struct freezer, css);
> }
>
> static inline struct freezer *task_freezer(struct task_struct *task)
> @@ -180,8 +178,9 @@ out:
> * migrated into or out of @cgroup, so we can't verify task states against
> * @freezer state here. See freezer_attach() for details.
> */
> -static void update_if_frozen(struct cgroup *cgroup, struct freezer *freezer)
> +static void update_if_frozen(struct freezer *freezer)
> {
> + struct cgroup *cgroup = freezer->css.cgroup;
> struct cgroup_iter it;
> struct task_struct *task;
>
> @@ -211,12 +210,11 @@ notyet:
> static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
> struct seq_file *m)
> {
> - struct freezer *freezer;
> + struct freezer *freezer = cgroup_freezer(cgroup);
> enum freezer_state state;
>
> - freezer = cgroup_freezer(cgroup);
> spin_lock_irq(&freezer->lock);
> - update_if_frozen(cgroup, freezer);
> + update_if_frozen(freezer);
> state = freezer->state;
> spin_unlock_irq(&freezer->lock);
>
> @@ -225,8 +223,9 @@ static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
> return 0;
> }
>
> -static void freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
> +static void freeze_cgroup(struct freezer *freezer)
> {
> + struct cgroup *cgroup = freezer->css.cgroup;
> struct cgroup_iter it;
> struct task_struct *task;
>
> @@ -236,8 +235,9 @@ static void freeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
> cgroup_iter_end(cgroup, &it);
> }
>
> -static void unfreeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
> +static void unfreeze_cgroup(struct freezer *freezer)
> {
> + struct cgroup *cgroup = freezer->css.cgroup;
> struct cgroup_iter it;
> struct task_struct *task;
>
> @@ -247,11 +247,9 @@ static void unfreeze_cgroup(struct cgroup *cgroup, struct freezer *freezer)
> cgroup_iter_end(cgroup, &it);
> }
>
> -static void freezer_change_state(struct cgroup *cgroup,
> +static void freezer_change_state(struct freezer *freezer,
> enum freezer_state goal_state)
> {
> - struct freezer *freezer = cgroup_freezer(cgroup);
> -
> /* also synchronizes against task migration, see freezer_attach() */
> spin_lock_irq(&freezer->lock);
>
> @@ -260,13 +258,13 @@ static void freezer_change_state(struct cgroup *cgroup,
> if (freezer->state != CGROUP_THAWED)
> atomic_dec(&system_freezing_cnt);
> freezer->state = CGROUP_THAWED;
> - unfreeze_cgroup(cgroup, freezer);
> + unfreeze_cgroup(freezer);
> break;
> case CGROUP_FROZEN:
> if (freezer->state == CGROUP_THAWED)
> atomic_inc(&system_freezing_cnt);
> freezer->state = CGROUP_FREEZING;
> - freeze_cgroup(cgroup, freezer);
> + freeze_cgroup(freezer);
> break;
> default:
> BUG();
> @@ -275,8 +273,7 @@ static void freezer_change_state(struct cgroup *cgroup,
> spin_unlock_irq(&freezer->lock);
> }
>
> -static int freezer_write(struct cgroup *cgroup,
> - struct cftype *cft,
> +static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> const char *buffer)
> {
> enum freezer_state goal_state;
> @@ -288,7 +285,7 @@ static int freezer_write(struct cgroup *cgroup,
> else
> return -EINVAL;
>
> - freezer_change_state(cgroup, goal_state);
> + freezer_change_state(cgroup_freezer(cgroup), goal_state);
> return 0;
> }
>
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:31, Tejun Heo wrote:
> * Make freezer_change_state() take bool @freeze instead of enum
> freezer_state.
>
> * Separate out freezer_apply_state() out of freezer_change_state().
> This makes freezer_change_state() a rather silly thin wrapper. It
> will be filled with hierarchy handling later on.
>
> This patch doesn't introduce any behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
Makes sense
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup_freezer.c | 48 ++++++++++++++++++++++++++++++------------------
> 1 file changed, 30 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index 975b3d8..2690830 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -247,45 +247,57 @@ static void unfreeze_cgroup(struct freezer *freezer)
> cgroup_iter_end(cgroup, &it);
> }
>
> -static void freezer_change_state(struct freezer *freezer,
> - enum freezer_state goal_state)
> +/**
> + * freezer_apply_state - apply state change to a single cgroup_freezer
> + * @freezer: freezer to apply state change to
> + * @freeze: whether to freeze or unfreeze
> + */
> +static void freezer_apply_state(struct freezer *freezer, bool freeze)
> {
> /* also synchronizes against task migration, see freezer_attach() */
> - spin_lock_irq(&freezer->lock);
> + lockdep_assert_held(&freezer->lock);
>
> - switch (goal_state) {
> - case CGROUP_THAWED:
> - if (freezer->state != CGROUP_THAWED)
> - atomic_dec(&system_freezing_cnt);
> - freezer->state = CGROUP_THAWED;
> - unfreeze_cgroup(freezer);
> - break;
> - case CGROUP_FROZEN:
> + if (freeze) {
> if (freezer->state == CGROUP_THAWED)
> atomic_inc(&system_freezing_cnt);
> freezer->state = CGROUP_FREEZING;
> freeze_cgroup(freezer);
> - break;
> - default:
> - BUG();
> + } else {
> + if (freezer->state != CGROUP_THAWED)
> + atomic_dec(&system_freezing_cnt);
> + freezer->state = CGROUP_THAWED;
> + unfreeze_cgroup(freezer);
> }
> +}
>
> +/**
> + * freezer_change_state - change the freezing state of a cgroup_freezer
> + * @freezer: freezer of interest
> + * @freeze: whether to freeze or thaw
> + *
> + * Freeze or thaw @cgroup according to @freeze.
> + */
> +static void freezer_change_state(struct freezer *freezer, bool freeze)
> +{
> + /* update @freezer */
> + spin_lock_irq(&freezer->lock);
> + freezer_apply_state(freezer, freeze);
> spin_unlock_irq(&freezer->lock);
> }
>
> static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> const char *buffer)
> {
> - enum freezer_state goal_state;
> + bool freeze;
>
> if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
> - goal_state = CGROUP_THAWED;
> + freeze = false;
> else if (strcmp(buffer, freezer_state_strs[CGROUP_FROZEN]) == 0)
> - goal_state = CGROUP_FROZEN;
> + freeze = true;
> else
> return -EINVAL;
>
> - freezer_change_state(cgroup_freezer(cgroup), goal_state);
> + freezer_change_state(cgroup_freezer(cgroup), freeze);
> return 0;
> }
>
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:32, Tejun Heo wrote:
> freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> As the scheduled full hierarchy support requires more than one
> freezing condition, switch it to mask of flags. If FREEZING is not
> set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
> both FREEZING and FROZEN are set. Now that tasks can be attached to
> an already frozen cgroup, this also makes freezing condition checks
> more natural.
>
> This patch doesn't introduce any behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
I think it would be nicer to use freezer_state_flags enum rather than
unsigned int for the state. I would even expect gcc to complain about
that but it looks like -fstrict-enums is c++ specific (so long enum
safety).
Anyway
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup_freezer.c | 60 ++++++++++++++++++++++---------------------------
> 1 file changed, 27 insertions(+), 33 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index 2690830..e76aa9f 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -22,15 +22,14 @@
> #include <linux/freezer.h>
> #include <linux/seq_file.h>
>
> -enum freezer_state {
> - CGROUP_THAWED = 0,
> - CGROUP_FREEZING,
> - CGROUP_FROZEN,
> +enum freezer_state_flags {
> + CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
> + CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
> };
>
> struct freezer {
> struct cgroup_subsys_state css;
> - enum freezer_state state;
> + unsigned int state;
> spinlock_t lock;
> };
>
> @@ -48,12 +47,10 @@ static inline struct freezer *task_freezer(struct task_struct *task)
>
> bool cgroup_freezing(struct task_struct *task)
> {
> - enum freezer_state state;
> bool ret;
>
> rcu_read_lock();
> - state = task_freezer(task)->state;
> - ret = state == CGROUP_FREEZING || state == CGROUP_FROZEN;
> + ret = task_freezer(task)->state & CGROUP_FREEZING;
> rcu_read_unlock();
>
> return ret;
> @@ -63,10 +60,13 @@ bool cgroup_freezing(struct task_struct *task)
> * cgroups_write_string() limits the size of freezer state strings to
> * CGROUP_LOCAL_BUFFER_SIZE
> */
> -static const char *freezer_state_strs[] = {
> - "THAWED",
> - "FREEZING",
> - "FROZEN",
> +static const char *freezer_state_strs(unsigned int state)
> +{
> + if (state & CGROUP_FROZEN)
> + return "FROZEN";
> + if (state & CGROUP_FREEZING)
> + return "FREEZING";
> + return "THAWED";
> };
>
> /*
> @@ -91,7 +91,6 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
> return ERR_PTR(-ENOMEM);
>
> spin_lock_init(&freezer->lock);
> - freezer->state = CGROUP_THAWED;
> return &freezer->css;
> }
>
> @@ -99,7 +98,7 @@ static void freezer_destroy(struct cgroup *cgroup)
> {
> struct freezer *freezer = cgroup_freezer(cgroup);
>
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> atomic_dec(&system_freezing_cnt);
> kfree(freezer);
> }
> @@ -129,15 +128,13 @@ static void freezer_attach(struct cgroup *new_cgrp, struct cgroup_taskset *tset)
> * Tasks in @tset are on @new_cgrp but may not conform to its
> * current state before executing the following - !frozen tasks may
> * be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
> - * This means that, to determine whether to freeze, one should test
> - * whether the state equals THAWED.
> */
> cgroup_taskset_for_each(task, new_cgrp, tset) {
> - if (freezer->state == CGROUP_THAWED) {
> + if (!(freezer->state & CGROUP_FREEZING)) {
> __thaw_task(task);
> } else {
> freeze_task(task);
> - freezer->state = CGROUP_FREEZING;
> + freezer->state &= ~CGROUP_FROZEN;
> }
> }
>
> @@ -159,11 +156,7 @@ static void freezer_fork(struct task_struct *task)
> goto out;
>
> spin_lock_irq(&freezer->lock);
> - /*
> - * @task might have been just migrated into a FROZEN cgroup. Test
> - * equality with THAWED. Read the comment in freezer_attach().
> - */
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> freeze_task(task);
> spin_unlock_irq(&freezer->lock);
> out:
> @@ -184,7 +177,8 @@ static void update_if_frozen(struct freezer *freezer)
> struct cgroup_iter it;
> struct task_struct *task;
>
> - if (freezer->state != CGROUP_FREEZING)
> + if (!(freezer->state & CGROUP_FREEZING) ||
> + (freezer->state & CGROUP_FROZEN))
> return;
>
> cgroup_iter_start(cgroup, &it);
> @@ -202,7 +196,7 @@ static void update_if_frozen(struct freezer *freezer)
> }
> }
>
> - freezer->state = CGROUP_FROZEN;
> + freezer->state |= CGROUP_FROZEN;
> notyet:
> cgroup_iter_end(cgroup, &it);
> }
> @@ -211,14 +205,14 @@ static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
> struct seq_file *m)
> {
> struct freezer *freezer = cgroup_freezer(cgroup);
> - enum freezer_state state;
> + unsigned int state;
>
> spin_lock_irq(&freezer->lock);
> update_if_frozen(freezer);
> state = freezer->state;
> spin_unlock_irq(&freezer->lock);
>
> - seq_puts(m, freezer_state_strs[state]);
> + seq_puts(m, freezer_state_strs(state));
> seq_putc(m, '\n');
> return 0;
> }
> @@ -258,14 +252,14 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
> lockdep_assert_held(&freezer->lock);
>
> if (freeze) {
> - if (freezer->state == CGROUP_THAWED)
> + if (!(freezer->state & CGROUP_FREEZING))
> atomic_inc(&system_freezing_cnt);
> - freezer->state = CGROUP_FREEZING;
> + freezer->state |= CGROUP_FREEZING;
> freeze_cgroup(freezer);
> } else {
> - if (freezer->state != CGROUP_THAWED)
> + if (freezer->state & CGROUP_FREEZING)
> atomic_dec(&system_freezing_cnt);
> - freezer->state = CGROUP_THAWED;
> + freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
> unfreeze_cgroup(freezer);
> }
> }
> @@ -290,9 +284,9 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> {
> bool freeze;
>
> - if (strcmp(buffer, freezer_state_strs[CGROUP_THAWED]) == 0)
> + if (strcmp(buffer, freezer_state_strs(0)) == 0)
> freeze = false;
> - else if (strcmp(buffer, freezer_state_strs[CGROUP_FROZEN]) == 0)
> + else if (strcmp(buffer, freezer_state_strs(CGROUP_FROZEN)) == 0)
> freeze = true;
> else
> return -EINVAL;
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:33, Tejun Heo wrote:
> Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
> the two flags. This is to prepare for full hierarchy support.
>
> freezer_apply_date() is updated such that it can handle setting and
> clearing of both flags. The two flags are also exposed to userland
> via read-only files self_freezing and parent_freezing.
>
> Other than the added cgroupfs files, this patch doesn't introduce any
> behavior change.
OK, I can imagine that this might be useful to find the first parent
group that needs to be thawed to unfreeze the group I am interested
in. Could you also update Documentation/cgroups/freezer-subsystem.txt to
clarify the intended usage?
Minor nit. Same as mentioned in the previous patch freezer_apply_state
should get enum freezer_state_flags state parameter.
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup_freezer.c | 55 ++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 47 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index e76aa9f..b8ad93c 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -23,8 +23,12 @@
> #include <linux/seq_file.h>
>
> enum freezer_state_flags {
> - CGROUP_FREEZING = (1 << 1), /* this freezer is freezing */
> + CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
> + CGROUP_FREEZING_PARENT = (1 << 2), /* the parent freezer is freezing */
> CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
> +
> + /* mask for all FREEZING flags */
> + CGROUP_FREEZING = CGROUP_FREEZING_SELF | CGROUP_FREEZING_PARENT,
> };
>
> struct freezer {
> @@ -245,8 +249,13 @@ static void unfreeze_cgroup(struct freezer *freezer)
> * freezer_apply_state - apply state change to a single cgroup_freezer
> * @freezer: freezer to apply state change to
> * @freeze: whether to freeze or unfreeze
> + * @state: CGROUP_FREEZING_* flag to set or clear
> + *
> + * Set or clear @state on @cgroup according to @freeze, and perform
> + * freezing or thawing as necessary.
> */
> -static void freezer_apply_state(struct freezer *freezer, bool freeze)
> +static void freezer_apply_state(struct freezer *freezer, bool freeze,
> + unsigned int state)
> {
> /* also synchronizes against task migration, see freezer_attach() */
> lockdep_assert_held(&freezer->lock);
> @@ -254,13 +263,19 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze)
> if (freeze) {
> if (!(freezer->state & CGROUP_FREEZING))
> atomic_inc(&system_freezing_cnt);
> - freezer->state |= CGROUP_FREEZING;
> + freezer->state |= state;
> freeze_cgroup(freezer);
> } else {
> - if (freezer->state & CGROUP_FREEZING)
> - atomic_dec(&system_freezing_cnt);
> - freezer->state &= ~(CGROUP_FREEZING | CGROUP_FROZEN);
> - unfreeze_cgroup(freezer);
> + bool was_freezing = freezer->state & CGROUP_FREEZING;
> +
> + freezer->state &= ~state;
> +
> + if (!(freezer->state & CGROUP_FREEZING)) {
> + if (was_freezing)
> + atomic_dec(&system_freezing_cnt);
> + freezer->state &= ~CGROUP_FROZEN;
> + unfreeze_cgroup(freezer);
> + }
> }
> }
>
> @@ -275,7 +290,7 @@ static void freezer_change_state(struct freezer *freezer, bool freeze)
> {
> /* update @freezer */
> spin_lock_irq(&freezer->lock);
> - freezer_apply_state(freezer, freeze);
> + freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
> spin_unlock_irq(&freezer->lock);
> }
>
> @@ -295,6 +310,20 @@ static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
> return 0;
> }
>
> +static u64 freezer_self_freezing_read(struct cgroup *cgroup, struct cftype *cft)
> +{
> + struct freezer *freezer = cgroup_freezer(cgroup);
> +
> + return (bool)(freezer->state & CGROUP_FREEZING_SELF);
> +}
> +
> +static u64 freezer_parent_freezing_read(struct cgroup *cgroup, struct cftype *cft)
> +{
> + struct freezer *freezer = cgroup_freezer(cgroup);
> +
> + return (bool)(freezer->state & CGROUP_FREEZING_PARENT);
> +}
> +
> static struct cftype files[] = {
> {
> .name = "state",
> @@ -302,6 +331,16 @@ static struct cftype files[] = {
> .read_seq_string = freezer_read,
> .write_string = freezer_write,
> },
> + {
> + .name = "self_freezing",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = freezer_self_freezing_read,
> + },
> + {
> + .name = "parent_freezing",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = freezer_parent_freezing_read,
> + },
> { } /* terminate */
> };
>
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Sat 03-11-12 01:38:34, Tejun Heo wrote:
> A cgroup is online and visible to iteration between ->post_create()
> and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
> toggles it from the newly added freezer_post_create() and
> freezer_pre_destroy() while holding freezer->lock such that a
> cgroup_freezer can be reilably distinguished to be online. This will
> be used by full hierarchy support.
I am thinking whether freezer_pre_destroy is really needed. Once we
reach pre_destroy then there are no tasks nor any children in the group
so there is nobody to wake up if the group was frozen and the destroy
callback is called after synchronize_rcu so the traversing should be
safe.
> ONLINE test is added to freezer_apply_state() but it currently doesn't
> make any difference as freezer_write() can only be called for an
> online cgroup.
>
> Adjusting system_freezing_cnt on destruction is moved from
> freezer_destroy() to the new freezer_pre_destroy() for consistency.
>
> This patch doesn't introduce any noticeable behavior change.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
> ---
> kernel/cgroup_freezer.c | 42 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
> index b8ad93c..4f12d31 100644
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
> @@ -23,6 +23,7 @@
> #include <linux/seq_file.h>
>
> enum freezer_state_flags {
> + CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
> CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
> CGROUP_FREEZING_PARENT = (1 << 2), /* the parent freezer is freezing */
> CGROUP_FROZEN = (1 << 3), /* this and its descendants frozen */
> @@ -98,13 +99,45 @@ static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
> return &freezer->css;
> }
>
> -static void freezer_destroy(struct cgroup *cgroup)
> +/**
> + * freezer_post_create - commit creation of a freezer cgroup
> + * @cgroup: cgroup being created
> + *
> + * We're committing to creation of @cgroup. Mark it online.
> + */
> +static void freezer_post_create(struct cgroup *cgroup)
> {
> struct freezer *freezer = cgroup_freezer(cgroup);
>
> + spin_lock_irq(&freezer->lock);
> + freezer->state |= CGROUP_FREEZER_ONLINE;
> + spin_unlock_irq(&freezer->lock);
> +}
> +
> +/**
> + * freezer_pre_destroy - initiate destruction of @cgroup
> + * @cgroup: cgroup being destroyed
> + *
> + * @cgroup is going away. Mark it dead and decrement system_freezing_count
> + * if it was holding one.
> + */
> +static void freezer_pre_destroy(struct cgroup *cgroup)
> +{
> + struct freezer *freezer = cgroup_freezer(cgroup);
> +
> + spin_lock_irq(&freezer->lock);
> +
> if (freezer->state & CGROUP_FREEZING)
> atomic_dec(&system_freezing_cnt);
> - kfree(freezer);
> +
> + freezer->state = 0;
> +
> + spin_unlock_irq(&freezer->lock);
> +}
> +
> +static void freezer_destroy(struct cgroup *cgroup)
> +{
> + kfree(cgroup_freezer(cgroup));
> }
>
> /*
> @@ -260,6 +293,9 @@ static void freezer_apply_state(struct freezer *freezer, bool freeze,
> /* also synchronizes against task migration, see freezer_attach() */
> lockdep_assert_held(&freezer->lock);
>
> + if (!(freezer->state & CGROUP_FREEZER_ONLINE))
> + return;
> +
> if (freeze) {
> if (!(freezer->state & CGROUP_FREEZING))
> atomic_inc(&system_freezing_cnt);
> @@ -347,6 +383,8 @@ static struct cftype files[] = {
> struct cgroup_subsys freezer_subsys = {
> .name = "freezer",
> .create = freezer_create,
> + .post_create = freezer_post_create,
> + .pre_destroy = freezer_pre_destroy,
> .destroy = freezer_destroy,
> .subsys_id = freezer_subsys_id,
> .attach = freezer_attach,
> --
> 1.7.11.7
>
--
Michal Hocko
SUSE Labs
On Wed 07-11-12 08:39:19, Tejun Heo wrote:
[...]
> --- a/kernel/cgroup_freezer.c
> +++ b/kernel/cgroup_freezer.c
[...]
> @@ -320,14 +388,39 @@ static void freezer_apply_state(struct f
> * @freezer: freezer of interest
> * @freeze: whether to freeze or thaw
> *
> - * Freeze or thaw @cgroup according to @freeze.
> + * Freeze or thaw @freezer according to @freeze. The operations are
> + * recursive - all descendants of @freezer will be affected.
> */
> static void freezer_change_state(struct freezer *freezer, bool freeze)
> {
> + struct cgroup *pos;
> +
> /* update @freezer */
> spin_lock_irq(&freezer->lock);
> freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
> spin_unlock_irq(&freezer->lock);
> +
> + /*
> + * Update all its descendants in pre-order traversal. Each
> + * descendant will try to inherit its parent's FREEZING state as
> + * CGROUP_FREEZING_PARENT.
> + */
> + rcu_read_lock();
> + cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
> + struct freezer *pos_f = cgroup_freezer(pos);
> + struct freezer *parent = parent_freezer(pos_f);
> +
> + /*
> + * Our update to @parent->state is already visible which is
> + * all we need. No need to lock @parent. For more info on
> + * synchronization, see freezer_post_create().
> + */
> + spin_lock_irq(&pos_f->lock);
> + freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
> + CGROUP_FREEZING_PARENT);
> + spin_unlock_irq(&pos_f->lock);
> + }
> + rcu_read_unlock();
> }
This seems to be racy because parent->state access is not linearized.
Say we have parallel freeze and thawing on a tree like the following:
A
|
B
|
C
pre_order will visit them in B, C order.
CPU1 CPU2
freezer_apply_state(A, true)
A->state & FREEZING == true freezer_apply_state(A, false)
A->state & FREEZING == false
freezer_apply_state(B, false)
B->state & FREEZING == false
freezer_apply_state(B, true)
B->state & FREEZING == true
freezer_apply_state(C, true)
freezer_apply_state(C, false)
So A, C are thawed while B is frozen. Or am I missing something which
would prevent from this kind of race?
The easy fix is to move spin_unlock_irq(&freezer->lock); after
rcu_read_unlock.
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Thu, Nov 08, 2012 at 03:08:52PM +0100, Michal Hocko wrote:
> This seems to be racy because parent->state access is not linearized.
> Say we have parallel freeze and thawing on a tree like the following:
> A
> |
> B
> |
> C
>
> pre_order will visit them in B, C order.
> CPU1 CPU2
> freezer_apply_state(A, true)
> A->state & FREEZING == true freezer_apply_state(A, false)
> A->state & FREEZING == false
> freezer_apply_state(B, false)
> B->state & FREEZING == false
> freezer_apply_state(B, true)
>
> B->state & FREEZING == true
> freezer_apply_state(C, true)
> freezer_apply_state(C, false)
>
> So A, C are thawed while B is frozen. Or am I missing something which
> would prevent from this kind of race?
The rule is that there will be at least one inheritance operation
after a parent is updated. The exact order of propagation doesn't
matter as long as there's at least one inheritance event after the
latest update to a parent. This works because inheritance operations
are atomic to each other. If one inheritance operation "saw" an
update to its parent, the next inheritance operation is guaranteed to
see at least upto that update.
So, in the above example in CPU2, (B->state & FREEZING) test and
freezer_apply_state(C, false) can't be interleaved with the same
inheritance operation from CPU1. They either happen before or after.
Maybe it's too subtle. The only reason I didn't use a giant
freezer_mutex was that I wanted to demonstrate how to do state
propagation without depending on single giant lock. Maybe nobody
wants that and this should use one mutex to protect the hierarchy. I
don't know.
Thanks.
--
tejun
Hello, Kamezawa.
On Thu, Nov 08, 2012 at 02:00:34PM +0900, Kamezawa Hiroyuki wrote:
> (2012/11/08 13:42), Tejun Heo wrote:
> >Hello, Kame.
> >
> >On Thu, Nov 08, 2012 at 01:37:50PM +0900, Kamezawa Hiroyuki wrote:
> >>How about
> >>enum {
> >> __CGROUP_FREEZING,
> >> __CGROUP_FROZEN,
> >>};
> >>
> >>#define CGROUP_FREEZER_STATE_MASK 0x3
> >>#define CGROUP_FREEZER_STATE(state) ((state) & CGROUP_FREEZER_STATE_MASK)
> >>#define CGROUP_THAW(state) (CGROUP_FREEZER_STATE(state) == 0)
> >>#define CGROUP_FREEZING(state) (CGROUP_FREEZER_STATE(state) == __CGROUP_FREEZING)
> >>#define CGROUP_FROZEN(state)\
> >> (CGROUP_FREEZER_STATE(state) == (__CGROUP_FREEZING | __CGROUP_FROZEN))
> >
> >I think it's a bit overdone and we have cases where we test for
> >FREEZING regardless of FROZEN and cases where test for FREEZING &&
> >!FROZEN. We can have, say, CGROUP_FREZING() and then
> >CGROUP_FREEZING_BUT_NOT_FROZEN(), but it feels more like obfuscation
> >than anything else.
> >
>
> Hm, then, I'm glad if I can see what combinations of flags are valid and
> meanings of them in source code comments.
Ooh, the following comment is added by a later patch when the whole
thing is complete.
/*
* A cgroup is freezing if any FREEZING flags are set. FREEZING_SELF is
* set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
* for "THAWED". FREEZING_PARENT is set if the parent freezer is FREEZING
* for whatever reason. IOW, a cgroup has FREEZING_PARENT set if one of
* its ancestors has FREEZING_SELF set.
*/
Thanks.
--
tejun
Hello, Michal.
On Thu, Nov 08, 2012 at 11:39:28AM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:32, Tejun Heo wrote:
> > freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> > As the scheduled full hierarchy support requires more than one
> > freezing condition, switch it to mask of flags. If FREEZING is not
> > set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
> > both FREEZING and FROZEN are set. Now that tasks can be attached to
> > an already frozen cgroup, this also makes freezing condition checks
> > more natural.
> >
> > This patch doesn't introduce any behavior change.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
>
> I think it would be nicer to use freezer_state_flags enum rather than
> unsigned int for the state. I would even expect gcc to complain about
> that but it looks like -fstrict-enums is c++ specific (so long enum
> safety).
But if you use it as flag bits, the resulting value no longer is
inside the defined enum values. Isn't that weird?
Thanks.
--
tejun
Hello, Kamezawa.
On Thu, Nov 08, 2012 at 01:56:52PM +0900, Kamezawa Hiroyuki wrote:
> Got it, thank you. I'm glad if you add some more explanation in comments
> and considered use-case in changelog.
I explained it in the last patch because it was a bit weird to explain
stuff which isn't implemented yet. If it doesn't feel sufficient
after the last patch, please let me know.
Thanks.
--
tejun
Hello,
On Thu, Nov 08, 2012 at 01:47:55PM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:33, Tejun Heo wrote:
> > Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
> > the two flags. This is to prepare for full hierarchy support.
> >
> > freezer_apply_date() is updated such that it can handle setting and
> > clearing of both flags. The two flags are also exposed to userland
> > via read-only files self_freezing and parent_freezing.
> >
> > Other than the added cgroupfs files, this patch doesn't introduce any
> > behavior change.
>
> OK, I can imagine that this might be useful to find the first parent
> group that needs to be thawed to unfreeze the group I am interested
> in. Could you also update Documentation/cgroups/freezer-subsystem.txt to
> clarify the intended usage?
Ooh, right. Will update the doc in the last patch.
> Minor nit. Same as mentioned in the previous patch freezer_apply_state
> should get enum freezer_state_flags state parameter.
But it isn't an enum value.
Thanks.
--
tejun
On Thu 08-11-12 06:39:52, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, Nov 08, 2012 at 11:39:28AM +0100, Michal Hocko wrote:
> > On Sat 03-11-12 01:38:32, Tejun Heo wrote:
> > > freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
> > > As the scheduled full hierarchy support requires more than one
> > > freezing condition, switch it to mask of flags. If FREEZING is not
> > > set, it's thawed. FREEZING is set if freezing or frozen. If frozen,
> > > both FREEZING and FROZEN are set. Now that tasks can be attached to
> > > an already frozen cgroup, this also makes freezing condition checks
> > > more natural.
> > >
> > > This patch doesn't introduce any behavior change.
> > >
> > > Signed-off-by: Tejun Heo <[email protected]>
> >
> > I think it would be nicer to use freezer_state_flags enum rather than
> > unsigned int for the state. I would even expect gcc to complain about
> > that but it looks like -fstrict-enums is c++ specific (so long enum
> > safety).
>
> But if you use it as flag bits, the resulting value no longer is
> inside the defined enum values. Isn't that weird?
Ahh, there is always CGROUP_FREEZER_ONLINE which would overcomplicate
the possible and valid masks.
Please ignore this...
--
Michal Hocko
SUSE Labs
On Thu 08-11-12 06:18:48, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, Nov 08, 2012 at 03:08:52PM +0100, Michal Hocko wrote:
> > This seems to be racy because parent->state access is not linearized.
> > Say we have parallel freeze and thawing on a tree like the following:
> > A
> > |
> > B
> > |
> > C
> >
> > pre_order will visit them in B, C order.
> > CPU1 CPU2
> > freezer_apply_state(A, true)
> > A->state & FREEZING == true freezer_apply_state(A, false)
> > A->state & FREEZING == false
> > freezer_apply_state(B, false)
> > B->state & FREEZING == false
> > freezer_apply_state(B, true)
> >
> > B->state & FREEZING == true
> > freezer_apply_state(C, true)
> > freezer_apply_state(C, false)
> >
> > So A, C are thawed while B is frozen. Or am I missing something which
> > would prevent from this kind of race?
>
> The rule is that there will be at least one inheritance operation
> after a parent is updated. The exact order of propagation doesn't
> matter as long as there's at least one inheritance event after the
> latest update to a parent. This works because inheritance operations
> are atomic to each other. If one inheritance operation "saw" an
> update to its parent, the next inheritance operation is guaranteed to
> see at least upto that update.
>
> So, in the above example in CPU2, (B->state & FREEZING) test and
> freezer_apply_state(C, false) can't be interleaved with the same
> inheritance operation from CPU1. They either happen before or after.
I am not sure I understand what you mean by inheritance operation but
you are right that the race is not possible because spin_lock makes
sure that Foo->state is done after the lock(Foo->child) and spin_unlock
then serves as a leave barrier so the other CPUs will see the correctly
updated value. The rest is handled by the fixed ordered tree walk. So
there is really no interleaving going on here.
Sorry about the confusion!
Feel free to add my Reviewed-by.
--
Michal Hocko
SUSE Labs
Hey, Michal.
On Thu, Nov 08, 2012 at 04:20:39PM +0100, Michal Hocko wrote:
> > So, in the above example in CPU2, (B->state & FREEZING) test and
> > freezer_apply_state(C, false) can't be interleaved with the same
> > inheritance operation from CPU1. They either happen before or after.
>
> I am not sure I understand what you mean by inheritance operation but
The operation of looking at one's parent and inherting the state.
Here, checking parent->state and calling apply_state accordingly.
> you are right that the race is not possible because spin_lock makes
> sure that Foo->state is done after the lock(Foo->child) and spin_unlock
> then serves as a leave barrier so the other CPUs will see the correctly
> updated value. The rest is handled by the fixed ordered tree walk. So
> there is really no interleaving going on here.
I'm worried that this is a bit too subtle. This will work fine with a
single hierarchy mutex protecting hierarchy updates and state
propagations through it and that should work for most controllers too.
I want to have at least one example of finer grained locking for
future reference and cgroup_freezer happened to be the one I started
working on. So, it is more complicated (not necessarily in written
code but the sync rules) than necessary. I'm still not sure whether
to keep it or not. I'll add more documentation about synchronization
in cgroup_for_each_descendant_pre() either way.
Thanks.
--
tejun
Hello, Kamezawa.
On Thu, Nov 08, 2012 at 01:48:25PM +0900, Kamezawa Hiroyuki wrote:
> Could you explain what 'online' means here again, rather than changelog ?
> BTW, 'online' is a shared concept, between post_create() and pre_destroy(), among
> developpers ? Is it new ?
I'm prepping a patch to rename create, post_create, pre_destroy,
destroy operations to allocate, online, offline, free, so yeah the
terms and concepts will be used for all of cgroup. I'll update the
docs in that patch.
Thanks!
--
tejun
On Thu 08-11-12 07:29:23, Tejun Heo wrote:
> Hey, Michal.
>
> On Thu, Nov 08, 2012 at 04:20:39PM +0100, Michal Hocko wrote:
> > > So, in the above example in CPU2, (B->state & FREEZING) test and
> > > freezer_apply_state(C, false) can't be interleaved with the same
> > > inheritance operation from CPU1. They either happen before or after.
> >
> > I am not sure I understand what you mean by inheritance operation but
>
> The operation of looking at one's parent and inherting the state.
> Here, checking parent->state and calling apply_state accordingly.
>
> > you are right that the race is not possible because spin_lock makes
> > sure that Foo->state is done after the lock(Foo->child) and spin_unlock
> > then serves as a leave barrier so the other CPUs will see the correctly
> > updated value. The rest is handled by the fixed ordered tree walk. So
> > there is really no interleaving going on here.
>
> I'm worried that this is a bit too subtle.
Dunno, it looks obvious now, I just missed the entry&leave implicit
barriers by spinlocks and again sorry about the confusion.
> This will work fine with a single hierarchy mutex protecting hierarchy
> updates and state propagations through it and that should work for
> most controllers too.
But single mutex is just ugly.
> I want to have at least one example of finer grained locking for
> future reference and cgroup_freezer happened to be the one I started
> working on.
I think this is a good example because it shows how to share the state
without too many implementation details.
> So, it is more complicated (not necessarily in written code but the
> sync rules) than necessary. I'm still not sure whether to keep it or
> not.
I think the locking is fine and I would keep it this way rather than a
big lock.
> I'll add more documentation about synchronization in
> cgroup_for_each_descendant_pre() either way.
more doc cannot hurt ;)
Thanks
--
Michal Hocko
SUSE Labs
Hey, Michal.
On Thu, Nov 08, 2012 at 10:50:13AM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:29, Tejun Heo wrote:
> > Currently, cgroup doesn't provide any generic helper for walking a
> > given cgroup's children or descendants. This patch adds the following
> > three macros.
> >
> > * cgroup_for_each_child() - walk immediate children of a cgroup.
> >
> > * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> > in pre-order tree traversal.
> >
> > * cgroup_for_each_descendant_post() - visit all descendants of a
> > cgroup in post-order tree traversal.
> >
> > All three only require the user to hold RCU read lock during
> > traversal. Verifying that each iterated cgroup is online is the
> > responsibility of the user. When used with proper synchronization,
> > cgroup_for_each_descendant_pre() can be used to propagate config
> > updates to descendants in reliable way. See comments for details.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
>
> I will convert mem_cgroup_iter to use this rather than css_get_next
> after this gets into the next tree so that it can fly via Andrew.
>
> Reviewed-by: Michal Hocko <[email protected]>
>
> Just a minor knit. You are talking about a config propagation while I
> would consider state propagation more clear and less confusing. Config
> is usually stable enough so that post_create is not necessary for
> syncing (e.g. memcg.swappiness). It is a state which must be consistent
> throughout the hierarchy which matters here.
Did s/config/state/g
Thanks.
--
tejun
Hello,
On Thu, Nov 08, 2012 at 02:23:06PM +0100, Michal Hocko wrote:
> On Sat 03-11-12 01:38:34, Tejun Heo wrote:
> > A cgroup is online and visible to iteration between ->post_create()
> > and ->pre_destroy(). This patch introduces CGROUP_FREEZER_ONLINE and
> > toggles it from the newly added freezer_post_create() and
> > freezer_pre_destroy() while holding freezer->lock such that a
> > cgroup_freezer can be reilably distinguished to be online. This will
> > be used by full hierarchy support.
>
> I am thinking whether freezer_pre_destroy is really needed. Once we
> reach pre_destroy then there are no tasks nor any children in the group
> so there is nobody to wake up if the group was frozen and the destroy
> callback is called after synchronize_rcu so the traversing should be
> safe.
Yeah, it might be true, but I'd still like to keep the offlining in
->pre_destroy() so that it's symmetrical w/ ->post_create(). I'll
rename and document the ops so that the roles of each are clearer.
Thanks.
--
tejun
Up until now, cgroup_freezer didn't implement hierarchy properly.
cgroups could be arranged in hierarchy but it didn't make any
difference in how each cgroup_freezer behaved. They all operated
separately.
This patch implements proper hierarchy support. If a cgroup is
frozen, all its descendants are frozen. A cgroup is thawed iff it and
all its ancestors are THAWED. freezer.self_freezing shows the current
freezing state for the cgroup itself. freezer.parent_freezing shows
whether the cgroup is freezing because any of its ancestors is
freezing.
freezer_post_create() locks the parent and new cgroup and inherits the
parent's state and freezer_change_state() applies new state top-down
using cgroup_for_each_descendant_pre() which guarantees that no child
can escape its parent's state. update_if_frozen() uses
cgroup_for_each_descendant_post() to propagate frozen states
bottom-up.
Synchronization could be coarser and easier by using a single mutex to
protect all hierarchy operations. Finer grained approach was used
because it wasn't too difficult for cgroup_freezer and I think it's
beneficial to have an example implementation and cgroup_freezer is
rather simple and can serve a good one.
As this makes cgroup_freezer properly hierarchical,
freezer_subsys.broken_hierarchy marking is removed.
Note that this patch changes userland visible behavior - freezing a
cgroup now freezes all its descendants too. This behavior change is
intended and has been warned via .broken_hierarchy.
v2: Michal spotted a bug in freezer_change_state() - descendants were
inheriting from the wrong ancestor. Fixed.
v3: Documentation/cgroups/freezer-subsystem.txt updated.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
Documentation/cgroups/freezer-subsystem.txt | 63 +++++++---
kernel/cgroup_freezer.c | 161 +++++++++++++++++++++-------
2 files changed, 165 insertions(+), 59 deletions(-)
--- a/Documentation/cgroups/freezer-subsystem.txt
+++ b/Documentation/cgroups/freezer-subsystem.txt
@@ -49,13 +49,49 @@ prevent the freeze/unfreeze cycle from b
being frozen. This allows the bash example above and gdb to run as
expected.
-The freezer subsystem in the container filesystem defines a file named
-freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the
-cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup.
-Reading will return the current state.
+The cgroup freezer is hierarchical. Freezing a cgroup freezes all
+tasks beloning to the cgroup and all its descendant cgroups. Each
+cgroup has its own state (self-state) and the state inherited from the
+parent (parent-state). Iff both states are THAWED, the cgroup is
+THAWED.
-Note freezer.state doesn't exist in root cgroup, which means root cgroup
-is non-freezable.
+The following cgroupfs files are created by cgroup freezer.
+
+* freezer.state: Read-write.
+
+ When read, returns the effective state of the cgroup - "THAWED",
+ "FREEZING" or "FROZEN". This is the combined self and parent-states.
+ If any is freezing, the cgroup is freezing (FREEZING or FROZEN).
+
+ FREEZING cgroup transitions into FROZEN state when all tasks
+ belonging to the cgroup and its descendants become frozen. Note that
+ a cgroup reverts to FREEZING from FROZEN after a new task is added
+ to the cgroup or one of its descendant cgroups until the new task is
+ frozen.
+
+ When written, sets the self-state of the cgroup. Two values are
+ allowed - "FROZEN" and "THAWED". If FROZEN is written, the cgroup,
+ if not already freezing, enters FREEZING state along with all its
+ descendant cgroups.
+
+ If THAWED is written, the self-state of the cgroup is changed to
+ THAWED. Note that the effective state may not change to THAWED if
+ the parent-state is still freezing. If a cgroup's effective state
+ becomes THAWED, all its descendants which are freezing because of
+ the cgroup also leave the freezing state.
+
+* freezer.self_freezing: Read only.
+
+ Shows the self-state. 0 if the self-state is THAWED; otherwise, 1.
+ This value is 1 iff the last write to freezer.state was "FROZEN".
+
+* freezer.parent_freezing: Read only.
+
+ Shows the parent-state. 0 if none of the cgroup's ancestors is
+ frozen; otherwise, 1.
+
+The root cgroup is non-freezable and the above interface files don't
+exist.
* Examples of usage :
@@ -85,18 +121,3 @@ to unfreeze all tasks in the container :
This is the basic mechanism which should do the right thing for user space task
in a simple scenario.
-
-It's important to note that freezing can be incomplete. In that case we return
-EBUSY. This means that some tasks in the cgroup are busy doing something that
-prevents us from completely freezing the cgroup at this time. After EBUSY,
-the cgroup will remain partially frozen -- reflected by freezer.state reporting
-"FREEZING" when read. The state will remain "FREEZING" until one of these
-things happens:
-
- 1) Userspace cancels the freezing operation by writing "THAWED" to
- the freezer.state file
- 2) Userspace retries the freezing operation by writing "FROZEN" to
- the freezer.state file (writing "FREEZING" is not legal
- and returns EINVAL)
- 3) The tasks that blocked the cgroup from entering the "FROZEN"
- state disappear from the cgroup's set of tasks.
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -22,6 +22,13 @@
#include <linux/freezer.h>
#include <linux/seq_file.h>
+/*
+ * A cgroup is freezing if any FREEZING flags are set. FREEZING_SELF is
+ * set if "FROZEN" is written to freezer.state cgroupfs file, and cleared
+ * for "THAWED". FREEZING_PARENT is set if the parent freezer is FREEZING
+ * for whatever reason. IOW, a cgroup has FREEZING_PARENT set if one of
+ * its ancestors has FREEZING_SELF set.
+ */
enum freezer_state_flags {
CGROUP_FREEZER_ONLINE = (1 << 0), /* freezer is fully online */
CGROUP_FREEZING_SELF = (1 << 1), /* this freezer is freezing */
@@ -50,6 +57,15 @@ static inline struct freezer *task_freez
struct freezer, css);
}
+static struct freezer *parent_freezer(struct freezer *freezer)
+{
+ struct cgroup *pcg = freezer->css.cgroup->parent;
+
+ if (pcg)
+ return cgroup_freezer(pcg);
+ return NULL;
+}
+
bool cgroup_freezing(struct task_struct *task)
{
bool ret;
@@ -74,17 +90,6 @@ static const char *freezer_state_strs(un
return "THAWED";
};
-/*
- * State diagram
- * Transitions are caused by userspace writes to the freezer.state file.
- * The values in parenthesis are state labels. The rest are edge labels.
- *
- * (THAWED) --FROZEN--> (FREEZING) --FROZEN--> (FROZEN)
- * ^ ^ | |
- * | \_______THAWED_______/ |
- * \__________________________THAWED____________/
- */
-
struct cgroup_subsys freezer_subsys;
static struct cgroup_subsys_state *freezer_create(struct cgroup *cgroup)
@@ -103,15 +108,34 @@ static struct cgroup_subsys_state *freez
* freezer_post_create - commit creation of a freezer cgroup
* @cgroup: cgroup being created
*
- * We're committing to creation of @cgroup. Mark it online.
+ * We're committing to creation of @cgroup. Mark it online and inherit
+ * parent's freezing state while holding both parent's and our
+ * freezer->lock.
*/
static void freezer_post_create(struct cgroup *cgroup)
{
struct freezer *freezer = cgroup_freezer(cgroup);
+ struct freezer *parent = parent_freezer(freezer);
+
+ /*
+ * The following double locking and freezing state inheritance
+ * guarantee that @cgroup can never escape ancestors' freezing
+ * states. See cgroup_for_each_descendant_pre() for details.
+ */
+ if (parent)
+ spin_lock_irq(&parent->lock);
+ spin_lock_nested(&freezer->lock, SINGLE_DEPTH_NESTING);
- spin_lock_irq(&freezer->lock);
freezer->state |= CGROUP_FREEZER_ONLINE;
- spin_unlock_irq(&freezer->lock);
+
+ if (parent && (parent->state & CGROUP_FREEZING)) {
+ freezer->state |= CGROUP_FREEZING_PARENT | CGROUP_FROZEN;
+ atomic_inc(&system_freezing_cnt);
+ }
+
+ spin_unlock(&freezer->lock);
+ if (parent)
+ spin_unlock_irq(&parent->lock);
}
/**
@@ -153,6 +177,7 @@ static void freezer_attach(struct cgroup
{
struct freezer *freezer = cgroup_freezer(new_cgrp);
struct task_struct *task;
+ bool clear_frozen = false;
spin_lock_irq(&freezer->lock);
@@ -172,10 +197,25 @@ static void freezer_attach(struct cgroup
} else {
freeze_task(task);
freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = true;
}
}
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Propagate FROZEN clearing upwards. We may race with
+ * update_if_frozen(), but as long as both work bottom-up, either
+ * update_if_frozen() sees child's FROZEN cleared or we clear the
+ * parent's FROZEN later. No parent w/ !FROZEN children can be
+ * left FROZEN.
+ */
+ while (clear_frozen && (freezer = parent_freezer(freezer))) {
+ spin_lock_irq(&freezer->lock);
+ freezer->state &= ~CGROUP_FROZEN;
+ clear_frozen = freezer->state & CGROUP_FREEZING;
+ spin_unlock_irq(&freezer->lock);
+ }
}
static void freezer_fork(struct task_struct *task)
@@ -200,24 +240,47 @@ out:
rcu_read_unlock();
}
-/*
- * We change from FREEZING to FROZEN lazily if the cgroup was only
- * partially frozen when we exitted write. Caller must hold freezer->lock.
+/**
+ * update_if_frozen - update whether a cgroup finished freezing
+ * @cgroup: cgroup of interest
+ *
+ * Once FREEZING is initiated, transition to FROZEN is lazily updated by
+ * calling this function. If the current state is FREEZING but not FROZEN,
+ * this function checks whether all tasks of this cgroup and the descendant
+ * cgroups finished freezing and, if so, sets FROZEN.
+ *
+ * The caller is responsible for grabbing RCU read lock and calling
+ * update_if_frozen() on all descendants prior to invoking this function.
*
* Task states and freezer state might disagree while tasks are being
* migrated into or out of @cgroup, so we can't verify task states against
* @freezer state here. See freezer_attach() for details.
*/
-static void update_if_frozen(struct freezer *freezer)
+static void update_if_frozen(struct cgroup *cgroup)
{
- struct cgroup *cgroup = freezer->css.cgroup;
+ struct freezer *freezer = cgroup_freezer(cgroup);
+ struct cgroup *pos;
struct cgroup_iter it;
struct task_struct *task;
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ spin_lock_irq(&freezer->lock);
+
if (!(freezer->state & CGROUP_FREEZING) ||
(freezer->state & CGROUP_FROZEN))
- return;
+ goto out_unlock;
+ /* are all (live) children frozen? */
+ cgroup_for_each_child(pos, cgroup) {
+ struct freezer *child = cgroup_freezer(pos);
+
+ if ((child->state & CGROUP_FREEZER_ONLINE) &&
+ !(child->state & CGROUP_FROZEN))
+ goto out_unlock;
+ }
+
+ /* are all tasks frozen? */
cgroup_iter_start(cgroup, &it);
while ((task = cgroup_iter_next(cgroup, &it))) {
@@ -229,27 +292,32 @@ static void update_if_frozen(struct free
* the usual frozen condition.
*/
if (!frozen(task) && !freezer_should_skip(task))
- goto notyet;
+ goto out_iter_end;
}
}
freezer->state |= CGROUP_FROZEN;
-notyet:
+out_iter_end:
cgroup_iter_end(cgroup, &it);
+out_unlock:
+ spin_unlock_irq(&freezer->lock);
}
static int freezer_read(struct cgroup *cgroup, struct cftype *cft,
struct seq_file *m)
{
- struct freezer *freezer = cgroup_freezer(cgroup);
- unsigned int state;
+ struct cgroup *pos;
- spin_lock_irq(&freezer->lock);
- update_if_frozen(freezer);
- state = freezer->state;
- spin_unlock_irq(&freezer->lock);
+ rcu_read_lock();
- seq_puts(m, freezer_state_strs(state));
+ /* update states bottom-up */
+ cgroup_for_each_descendant_post(pos, cgroup)
+ update_if_frozen(pos);
+ update_if_frozen(cgroup);
+
+ rcu_read_unlock();
+
+ seq_puts(m, freezer_state_strs(cgroup_freezer(cgroup)->state));
seq_putc(m, '\n');
return 0;
}
@@ -320,14 +388,39 @@ static void freezer_apply_state(struct f
* @freezer: freezer of interest
* @freeze: whether to freeze or thaw
*
- * Freeze or thaw @cgroup according to @freeze.
+ * Freeze or thaw @freezer according to @freeze. The operations are
+ * recursive - all descendants of @freezer will be affected.
*/
static void freezer_change_state(struct freezer *freezer, bool freeze)
{
+ struct cgroup *pos;
+
/* update @freezer */
spin_lock_irq(&freezer->lock);
freezer_apply_state(freezer, freeze, CGROUP_FREEZING_SELF);
spin_unlock_irq(&freezer->lock);
+
+ /*
+ * Update all its descendants in pre-order traversal. Each
+ * descendant will try to inherit its parent's FREEZING state as
+ * CGROUP_FREEZING_PARENT.
+ */
+ rcu_read_lock();
+ cgroup_for_each_descendant_pre(pos, freezer->css.cgroup) {
+ struct freezer *pos_f = cgroup_freezer(pos);
+ struct freezer *parent = parent_freezer(pos_f);
+
+ /*
+ * Our update to @parent->state is already visible which is
+ * all we need. No need to lock @parent. For more info on
+ * synchronization, see freezer_post_create().
+ */
+ spin_lock_irq(&pos_f->lock);
+ freezer_apply_state(pos_f, parent->state & CGROUP_FREEZING,
+ CGROUP_FREEZING_PARENT);
+ spin_unlock_irq(&pos_f->lock);
+ }
+ rcu_read_unlock();
}
static int freezer_write(struct cgroup *cgroup, struct cftype *cft,
@@ -390,12 +483,4 @@ struct cgroup_subsys freezer_subsys = {
.attach = freezer_attach,
.fork = freezer_fork,
.base_cftypes = files,
-
- /*
- * freezer subsys doesn't handle hierarchy at all. Frozen state
- * should be inherited through the hierarchy - if a parent is
- * frozen, all its children should be frozen. Fix it and remove
- * the following.
- */
- .broken_hierarchy = true,
};
Currently, cgroup doesn't provide any generic helper for walking a
given cgroup's children or descendants. This patch adds the following
three macros.
* cgroup_for_each_child() - walk immediate children of a cgroup.
* cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
in pre-order tree traversal.
* cgroup_for_each_descendant_post() - visit all descendants of a
cgroup in post-order tree traversal.
All three only require the user to hold RCU read lock during
traversal. Verifying that each iterated cgroup is online is the
responsibility of the user. When used with proper synchronization,
cgroup_for_each_descendant_pre() can be used to propagate state
updates to descendants in reliable way. See comments for details.
v2: s/config/state/ in commit message and comments per Michal. More
documentation on synchronization rules.
Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
include/linux/cgroup.h | 94 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup.c | 86 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 180 insertions(+)
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -534,6 +534,100 @@ static inline struct cgroup* task_cgroup
return task_subsys_state(task, subsys_id)->cgroup;
}
+/**
+ * cgroup_for_each_child - iterate through children of a cgroup
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose children to walk
+ *
+ * Walk @cgroup's children. Must be called under rcu_read_lock(). A child
+ * cgroup which hasn't finished ->post_create() or already has finished
+ * ->pre_destroy() may show up during traversal and it's each subsystem's
+ * responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, a cgroup which finished ->post_create()
+ * is guaranteed to be visible in the future iterations.
+ */
+#define cgroup_for_each_child(pos, cgroup) \
+ list_for_each_entry_rcu(pos, &(cgroup)->children, sibling)
+
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+ struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_pre - pre-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Walk @cgroup's descendants. Must be called under rcu_read_lock(). A
+ * descendant cgroup which hasn't finished ->post_create() or already has
+ * finished ->pre_destroy() may show up during traversal and it's each
+ * subsystem's responsibility to verify that each @pos is alive.
+ *
+ * If a subsystem synchronizes against the parent in its ->post_create()
+ * and before starting iterating, and synchronizes against @pos on each
+ * iteration, any descendant cgroup which finished ->post_create() is
+ * guaranteed to be visible in the future iterations.
+ *
+ * In other words, the following guarantees that a descendant can't escape
+ * state updates of its ancestors.
+ *
+ * my_post_create(@cgrp)
+ * {
+ * Lock @cgrp->parent and @cgrp;
+ * Inherit state from @cgrp->parent;
+ * Unlock both.
+ * }
+ *
+ * my_update_state(@cgrp)
+ * {
+ * Lock @cgrp;
+ * Update @cgrp's state;
+ * Unlock @cgrp;
+ *
+ * cgroup_for_each_descendant_pre(@pos, @cgrp) {
+ * Lock @pos;
+ * Verify @pos is alive and inherit state from @pos->parent;
+ * Unlock @pos;
+ * }
+ * }
+ *
+ * As long as the inheriting step, including checking the parent state, is
+ * enclosed inside @pos locking, double-locking the parent isn't necessary
+ * while inheriting. The state update to the parent is guaranteed to be
+ * visible by walking order and, as long as inheriting operations to the
+ * same @pos are atomic to each other, multiple updates racing each other
+ * still result in the correct state. It's guaranateed that at least one
+ * inheritance happens for any cgroup after the latest update to its
+ * parent.
+ *
+ * If checking parent's state requires locking the parent, each inheriting
+ * iteration should lock and unlock both @pos->parent and @pos.
+ *
+ * Alternatively, a subsystem may choose to use a single global lock to
+ * synchronize ->post_create() and ->pre_destroy() against tree-walking
+ * operations.
+ */
+#define cgroup_for_each_descendant_pre(pos, cgroup) \
+ for (pos = cgroup_next_descendant_pre(NULL, (cgroup)); (pos); \
+ pos = cgroup_next_descendant_pre((pos), (cgroup)))
+
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+ struct cgroup *cgroup);
+
+/**
+ * cgroup_for_each_descendant_post - post-order walk of a cgroup's descendants
+ * @pos: the cgroup * to use as the loop cursor
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * Similar to cgroup_for_each_descendant_pre() but performs post-order
+ * traversal instead. Note that the walk visibility guarantee described in
+ * pre-order walk doesn't apply the same to post-order walks.
+ */
+#define cgroup_for_each_descendant_post(pos, cgroup) \
+ for (pos = cgroup_next_descendant_post(NULL, (cgroup)); (pos); \
+ pos = cgroup_next_descendant_post((pos), (cgroup)))
+
/* A cgroup_iter should be treated as an opaque object */
struct cgroup_iter {
struct list_head *cg_link;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2984,6 +2984,92 @@ static void cgroup_enable_task_cg_lists(
write_unlock(&css_set_lock);
}
+/**
+ * cgroup_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_pre(). Find the next
+ * descendant to visit for pre-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_pre(struct cgroup *pos,
+ struct cgroup *cgroup)
+{
+ struct cgroup *next;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* if first iteration, pretend we just visited @cgroup */
+ if (!pos) {
+ if (list_empty(&cgroup->children))
+ return NULL;
+ pos = cgroup;
+ }
+
+ /* visit the first child if exists */
+ next = list_first_or_null_rcu(&pos->children, struct cgroup, sibling);
+ if (next)
+ return next;
+
+ /* no child, visit my or the closest ancestor's next sibling */
+ do {
+ next = list_entry_rcu(pos->sibling.next, struct cgroup,
+ sibling);
+ if (&next->sibling != &pos->parent->children)
+ return next;
+
+ pos = pos->parent;
+ } while (pos != cgroup);
+
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_pre);
+
+static struct cgroup *cgroup_leftmost_descendant(struct cgroup *pos)
+{
+ struct cgroup *last;
+
+ do {
+ last = pos;
+ pos = list_first_or_null_rcu(&pos->children, struct cgroup,
+ sibling);
+ } while (pos);
+
+ return last;
+}
+
+/**
+ * cgroup_next_descendant_post - find the next descendant for post-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @cgroup: cgroup whose descendants to walk
+ *
+ * To be used by cgroup_for_each_descendant_post(). Find the next
+ * descendant to visit for post-order traversal of @cgroup's descendants.
+ */
+struct cgroup *cgroup_next_descendant_post(struct cgroup *pos,
+ struct cgroup *cgroup)
+{
+ struct cgroup *next;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* if first iteration, visit the leftmost descendant */
+ if (!pos) {
+ next = cgroup_leftmost_descendant(cgroup);
+ return next != cgroup ? next : NULL;
+ }
+
+ /* if there's an unvisited sibling, visit its leftmost descendant */
+ next = list_entry_rcu(pos->sibling.next, struct cgroup, sibling);
+ if (&next->sibling != &pos->parent->children)
+ return cgroup_leftmost_descendant(next);
+
+ /* no sibling left, visit parent */
+ next = pos->parent;
+ return next != cgroup ? next : NULL;
+}
+EXPORT_SYMBOL_GPL(cgroup_next_descendant_post);
+
void cgroup_iter_start(struct cgroup *cgrp, struct cgroup_iter *it)
__acquires(css_set_lock)
{
On Sat, Nov 03, 2012 at 01:38:26AM -0700, Tejun Heo wrote:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup_freezer-hierarchy
Updated patches posted as replies to the original patches and the
above git branch updated with the updated patches. As all the updates
are minor, I won't re-post the whole series.
Thanks.
--
tejun
On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> Up until now, cgroup_freezer didn't implement hierarchy properly.
> cgroups could be arranged in hierarchy but it didn't make any
> difference in how each cgroup_freezer behaved. They all operated
> separately.
>
> This patch implements proper hierarchy support. If a cgroup is
> frozen, all its descendants are frozen. A cgroup is thawed iff it and
> all its ancestors are THAWED. freezer.self_freezing shows the current
> freezing state for the cgroup itself. freezer.parent_freezing shows
> whether the cgroup is freezing because any of its ancestors is
> freezing.
>
> freezer_post_create() locks the parent and new cgroup and inherits the
> parent's state and freezer_change_state() applies new state top-down
> using cgroup_for_each_descendant_pre() which guarantees that no child
> can escape its parent's state. update_if_frozen() uses
> cgroup_for_each_descendant_post() to propagate frozen states
> bottom-up.
>
> Synchronization could be coarser and easier by using a single mutex to
> protect all hierarchy operations. Finer grained approach was used
> because it wasn't too difficult for cgroup_freezer and I think it's
> beneficial to have an example implementation and cgroup_freezer is
> rather simple and can serve a good one.
>
> As this makes cgroup_freezer properly hierarchical,
> freezer_subsys.broken_hierarchy marking is removed.
>
> Note that this patch changes userland visible behavior - freezing a
> cgroup now freezes all its descendants too. This behavior change is
> intended and has been warned via .broken_hierarchy.
>
> v2: Michal spotted a bug in freezer_change_state() - descendants were
> inheriting from the wrong ancestor. Fixed.
>
> v3: Documentation/cgroups/freezer-subsystem.txt updated.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
You probably meant Reviewed-by: Michal Hocko .... ;)
--
Michal Hocko
SUSE Labs
On Thu, Nov 08, 2012 at 07:02:46PM +0100, Michal Hocko wrote:
> On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reviewed-by: Tejun Heo <[email protected]>
>
> You probably meant Reviewed-by: Michal Hocko .... ;)
Hehehheheh... man, I'm too self-absolved. Thanks for noticing
that. :) Updated.
--
tejun
On Thu 08-11-12 10:04:17, Tejun Heo wrote:
> On Thu, Nov 08, 2012 at 07:02:46PM +0100, Michal Hocko wrote:
> > On Thu 08-11-12 09:57:50, Tejun Heo wrote:
> > > Signed-off-by: Tejun Heo <[email protected]>
> > > Reviewed-by: Tejun Heo <[email protected]>
> >
> > You probably meant Reviewed-by: Michal Hocko .... ;)
>
> Hehehheheh... man, I'm too self-absolved. Thanks for noticing
> that. :) Updated.
Heh, btw. the doc update looks good as well.
--
Michal Hocko
SUSE Labs
Subject: cgroup: add cgroup_subsys->post_create()
Currently, there's no way for a controller to find out whether a new
cgroup finished all ->create() allocatinos successfully and is
considered "live" by cgroup.
This becomes a problem later when we add generic descendants walking
to cgroup which can be used by controllers as controllers don't have a
synchronization point where it can synchronize against new cgroups
appearing in such walks.
This patch adds ->post_create(). It's called after all ->create()
succeeded and the cgroup is linked into the generic cgroup hierarchy.
This plays the counterpart of ->pre_destroy().
When used in combination with the to-be-added generic descendant
iterators, ->post_create() can be used to implement reliable state
inheritance. It will be explained with the descendant iterators.
v2: Added a paragraph about its future use w/ descendant iterators per
Michal.
v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
Fixed.
Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: Glauber Costa <[email protected]>
---
Oops, forgot updating cgroup_load_subsys(). Hate that it's a
different path from cgroup_init_subsys(). :(
Thanks.
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 15 +++++++++++++--
2 files changed, 14 insertions(+), 2 deletions(-)
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -438,6 +438,7 @@ int cgroup_taskset_size(struct cgroup_ta
struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
+ void (*post_create)(struct cgroup *cgrp);
void (*pre_destroy)(struct cgroup *cgrp);
void (*destroy)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4059,10 +4059,15 @@ static long cgroup_create(struct cgroup
if (err < 0)
goto err_remove;
- /* each css holds a ref to the cgroup's dentry */
- for_each_subsys(root, ss)
+ for_each_subsys(root, ss) {
+ /* each css holds a ref to the cgroup's dentry */
dget(dentry);
+ /* creation succeeded, notify subsystems */
+ if (ss->post_create)
+ ss->post_create(cgrp);
+ }
+
/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
@@ -4280,6 +4285,9 @@ static void __init cgroup_init_subsys(st
ss->active = 1;
+ if (ss->post_create)
+ ss->post_create(&ss->root->top_cgroup);
+
/* this function shouldn't be used with modular subsystems, since they
* need to register a subsys_id, among other things */
BUG_ON(ss->module);
@@ -4389,6 +4397,9 @@ int __init_or_module cgroup_load_subsys(
ss->active = 1;
+ if (ss->post_create)
+ ss->post_create(&ss->root->top_cgroup);
+
/* success! */
mutex_unlock(&cgroup_mutex);
return 0;
On 2012/11/9 3:07, Tejun Heo wrote:
> Subject: cgroup: add cgroup_subsys->post_create()
>
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
>
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
>
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
>
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance. It will be explained with the descendant iterators.
>
> v2: Added a paragraph about its future use w/ descendant iterators per
> Michal.
>
> v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
> Fixed.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Glauber Costa <[email protected]>
Li Zefan <[email protected]
On 2012/11/9 3:07, Tejun Heo wrote:
> Subject: cgroup: add cgroup_subsys->post_create()
>
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
>
> This becomes a problem later when we add generic descendants walking
> to cgroup which can be used by controllers as controllers don't have a
> synchronization point where it can synchronize against new cgroups
> appearing in such walks.
>
> This patch adds ->post_create(). It's called after all ->create()
> succeeded and the cgroup is linked into the generic cgroup hierarchy.
> This plays the counterpart of ->pre_destroy().
>
> When used in combination with the to-be-added generic descendant
> iterators, ->post_create() can be used to implement reliable state
> inheritance. It will be explained with the descendant iterators.
>
> v2: Added a paragraph about its future use w/ descendant iterators per
> Michal.
>
> v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
> Fixed.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Glauber Costa <[email protected]>
Acked-by: Li Zefan <[email protected]>
On 2012/11/3 16:38, Tejun Heo wrote:
> Use RCU safe list operations for cgroup->children. This will be used
> to implement cgroup children / descendant walking which can be used by
> controllers.
>
> Note that cgroup_create() now puts a new cgroup at the end of the
> ->children list instead of head. This isn't strictly necessary but is
> done so that the iteration order is more conventional.
>
> Signed-off-by: Tejun Heo <[email protected]>
Acked-by: Li Zefan <[email protected]>
On 2012/11/9 1:59, Tejun Heo wrote:
> Currently, cgroup doesn't provide any generic helper for walking a
> given cgroup's children or descendants. This patch adds the following
> three macros.
>
> * cgroup_for_each_child() - walk immediate children of a cgroup.
>
> * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
> in pre-order tree traversal.
>
> * cgroup_for_each_descendant_post() - visit all descendants of a
> cgroup in post-order tree traversal.
>
> All three only require the user to hold RCU read lock during
> traversal. Verifying that each iterated cgroup is online is the
> responsibility of the user. When used with proper synchronization,
> cgroup_for_each_descendant_pre() can be used to propagate state
> updates to descendants in reliable way. See comments for details.
>
> v2: s/config/state/ in commit message and comments per Michal. More
> documentation on synchronization rules.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Michal Hocko <[email protected]>
I don't see anything wrong with the comment on cgroup_next_descendant_pre().
Acked-by: Li Zefan <[email protected]>
Hi Tejun,
On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
cgroup_subsys->post_create()
>
> Currently, there's no way for a controller to find out whether a new
> cgroup finished all ->create() allocatinos successfully and is
> considered "live" by cgroup.
I'd like add hierarchy support to net_prio and the first thing to
do is to get rid of get_prioidx(). It looks like it would be nice to be
able to use use_id and post_create() for this but as I read the code
this might not work because the netdev might access resources allocated
between create() and post_create(). So my question is would it make
sense to move
cgroup_create():
if (ss->use_id) {
err = alloc_css_id(ss, parent, cgrp);
if (err)
goto err_destroy;
}
part before create() or add some protection between create() and
post_create() callback in net_prio. I have a patch but I see
I could drop it completely if post_create() is there.
cheers,
daniel
From 84fbbdf0dc5d3460389e39a00a3ee553ee55b563 Mon Sep 17 00:00:00 2001
From: Daniel Wagner <[email protected]>
Date: Thu, 8 Nov 2012 17:17:21 +0100
Subject: [PATCH] cgroups: net_prio: Use IDR library to assign netprio index
get_prioidx() allocated a new id whenever it was called. put_prioidx()
gave an id back. get_pioidx() could can reallocate the id later on.
So that is exactly what IDR offers and so let's use it.
Signed-off-by: Daniel Wagner <[email protected]>
---
net/core/netprio_cgroup.c | 51
+++++++++--------------------------------------
1 file changed, 9 insertions(+), 42 deletions(-)
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 847c02b..3c1b612 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -27,10 +27,7 @@
#include <linux/fdtable.h>
-#define PRIOIDX_SZ 128
-
-static unsigned long prioidx_map[PRIOIDX_SZ];
-static DEFINE_SPINLOCK(prioidx_map_lock);
+static DEFINE_IDA(netprio_ida);
static atomic_t max_prioidx = ATOMIC_INIT(0);
static inline struct cgroup_netprio_state *cgrp_netprio_state(struct
cgroup *cgrp)
@@ -39,34 +36,6 @@ static inline struct cgroup_netprio_state
*cgrp_netprio_state(struct cgroup *cgr
struct cgroup_netprio_state, css);
}
-static int get_prioidx(u32 *prio)
-{
- unsigned long flags;
- u32 prioidx;
-
- spin_lock_irqsave(&prioidx_map_lock, flags);
- prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) *
PRIOIDX_SZ);
- if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ) {
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
- return -ENOSPC;
- }
- set_bit(prioidx, prioidx_map);
- if (atomic_read(&max_prioidx) < prioidx)
- atomic_set(&max_prioidx, prioidx);
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
- *prio = prioidx;
- return 0;
-}
-
-static void put_prioidx(u32 idx)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&prioidx_map_lock, flags);
- clear_bit(idx, prioidx_map);
- spin_unlock_irqrestore(&prioidx_map_lock, flags);
-}
-
static int extend_netdev_table(struct net_device *dev, u32 new_len)
{
size_t new_size = sizeof(struct netprio_map) +
@@ -120,9 +89,9 @@ static struct cgroup_subsys_state *cgrp_create(struct
cgroup *cgrp)
if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx)
goto out;
- ret = get_prioidx(&cs->prioidx);
- if (ret < 0) {
- pr_warn("No space in priority index array\n");
+ cs->prioidx = ida_simple_get(&netprio_ida, 0, 0, GFP_KERNEL);
+ if (cs->prioidx < 0) {
+ ret = cs->prioidx;
goto out;
}
@@ -146,7 +115,7 @@ static void cgrp_destroy(struct cgroup *cgrp)
map->priomap[cs->prioidx] = 0;
}
rtnl_unlock();
- put_prioidx(cs->prioidx);
+ ida_simple_remove(&netprio_ida, cs->prioidx);
kfree(cs);
}
@@ -284,12 +253,10 @@ struct cgroup_subsys net_prio_subsys = {
.module = THIS_MODULE,
/*
- * net_prio has artificial limit on the number of cgroups and
- * disallows nesting making it impossible to co-mount it with other
- * hierarchical subsystems. Remove the artificially low PRIOIDX_SZ
- * limit and properly nest configuration such that children follow
- * their parents' configurations by default and are allowed to
- * override and remove the following.
+ * net_prio has artificial limit on properly nest
+ * configuration such that children follow their parents'
+ * configurations by default and are allowed to override and
+ * remove the following.
*/
.broken_hierarchy = true,
};
--
1.7.11.7
On Sat, Nov 03, 2012 at 01:38:26AM -0700, Tejun Heo wrote:
> Hello,
>
> This patchset implement proper hierarchy support for cgroup_freezer as
> discussed in "[RFC] cgroup TODOs"[1].
Applied to cgroup/for-3.8. Rafael, I applied the cgroup_freezer
changes there too as there already are and will be more dependencies
between cgroup_freezer and cgroup changes, and the cgroup_freezer
changes don't really affect the rest of the freezer. If you'd like
them to be routed differently, please let me know.
Thanks.
--
tejun
Hey, Daniel.
On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
> cgroup_subsys->post_create()
> >
> > Currently, there's no way for a controller to find out whether a new
> > cgroup finished all ->create() allocatinos successfully and is
> > considered "live" by cgroup.
>
> I'd like add hierarchy support to net_prio and the first thing to
> do is to get rid of get_prioidx(). It looks like it would be nice to
Ooh, I'm already working on it. I *think* I should be able to post
the patches later today or early next week.
> be able to use use_id and post_create() for this but as I read the
> code this might not work because the netdev might access resources
> allocated between create() and post_create(). So my question is
> would it make sense to move
>
> cgroup_create():
>
> if (ss->use_id) {
> err = alloc_css_id(ss, parent, cgrp);
> if (err)
> goto err_destroy;
> }
>
> part before create() or add some protection between create() and
> post_create() callback in net_prio. I have a patch but I see
> I could drop it completely if post_create() is there.
Glauber had about similar question about css_id and I need to think
more about it but currently I think I want to phase out css IDs. It's
an id of the wrong thing (CSSes don't need IDs, cgroups do) and
unnecessarily duplicates its own hierarchy when the hierarchy of
cgroups already exists. Once memcontrol moves away from walking using
css_ids, I *think* I'll try to kill it.
I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
number) so that it can be used for cgroup indexing. Glauber, that
should solve your problem too, right?
Thanks.
--
tejun
On 11/09/2012 06:22 PM, Tejun Heo wrote:
> Hey, Daniel.
>
> On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
>> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
>> cgroup_subsys->post_create()
>>>
>>> Currently, there's no way for a controller to find out whether a new
>>> cgroup finished all ->create() allocatinos successfully and is
>>> considered "live" by cgroup.
>>
>> I'd like add hierarchy support to net_prio and the first thing to
>> do is to get rid of get_prioidx(). It looks like it would be nice to
>
> Ooh, I'm already working on it. I *think* I should be able to post
> the patches later today or early next week.
>
>> be able to use use_id and post_create() for this but as I read the
>> code this might not work because the netdev might access resources
>> allocated between create() and post_create(). So my question is
>> would it make sense to move
>>
>> cgroup_create():
>>
>> if (ss->use_id) {
>> err = alloc_css_id(ss, parent, cgrp);
>> if (err)
>> goto err_destroy;
>> }
>>
>> part before create() or add some protection between create() and
>> post_create() callback in net_prio. I have a patch but I see
>> I could drop it completely if post_create() is there.
>
> Glauber had about similar question about css_id and I need to think
> more about it but currently I think I want to phase out css IDs. It's
> an id of the wrong thing (CSSes don't need IDs, cgroups do) and
> unnecessarily duplicates its own hierarchy when the hierarchy of
> cgroups already exists. Once memcontrol moves away from walking using
> css_ids, I *think* I'll try to kill it.
May I suggest doing something similar with what the scheduler does? I
had some code in the past that reused that code - but basically
duplicated it. If you want, I can try getting a version of that in
kernel/cgroup.c that would serve as a general walker.
I like that walker a lot, because it happens in a sane order. memcg
basically walks in a random weird order, that makes hierarchical
computation of anything quite hard.
>
> I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
> number) so that it can be used for cgroup indexing. Glauber, that
> should solve your problem too, right?
>
Actually I went with a totally orthogonal solution. I am now using per
kmem-limited ids. Because they are not tied to the cgroup creation
workflow, I can allocate whenever it is more convenient.
I ended up liking this solution because it will do better in scenarios
where most of the memcgs are not kmem limited. So it had an edge here,
and also got rid of the create/post_create problem by breaking the
dependency.
But of course, if cgroups would gain some kind of sane indexing, it
could shift the balance towards reusing it.
Hi Tejun,
On 09.11.2012 18:22, Tejun Heo wrote:
> Hey, Daniel.
>
> On Fri, Nov 09, 2012 at 12:09:38PM +0100, Daniel Wagner wrote:
>> On 08.11.2012 20:07, Tejun Heo wrote:> Subject: cgroup: add
>> cgroup_subsys->post_create()
>>>
>>> Currently, there's no way for a controller to find out whether a new
>>> cgroup finished all ->create() allocatinos successfully and is
>>> considered "live" by cgroup.
>>
>> I'd like add hierarchy support to net_prio and the first thing to
>> do is to get rid of get_prioidx(). It looks like it would be nice to
>
> Ooh, I'm already working on it. I *think* I should be able to post
> the patches later today or early next week.
No problem. I didn't spend too much time in writing the patch. Getting
rid of get_prioidx() was simple. I just wanted to help out, but then
I might to slow to keep up with you :)
Though I have a question on the patch you are writing. When you disable
broken_hierarchy for the networking controllers, are you also
considering to disable them on the root cgroup? Currently both
net_prio and net_cls will do work on the root cgroup. I think for
harmonizing the behavior or all controllers this should also be
disabled.
>> be able to use use_id and post_create() for this but as I read the
>> code this might not work because the netdev might access resources
>> allocated between create() and post_create(). So my question is
>> would it make sense to move
>>
>> cgroup_create():
>>
>> if (ss->use_id) {
>> err = alloc_css_id(ss, parent, cgrp);
>> if (err)
>> goto err_destroy;
>> }
>>
>> part before create() or add some protection between create() and
>> post_create() callback in net_prio. I have a patch but I see
>> I could drop it completely if post_create() is there.
>
> Glauber had about similar question about css_id and I need to think
> more about it but currently I think I want to phase out css IDs. It's
> an id of the wrong thing (CSSes don't need IDs, cgroups do) and
> unnecessarily duplicates its own hierarchy when the hierarchy of
> cgroups already exists. Once memcontrol moves away from walking using
> css_ids, I *think* I'll try to kill it.
>
> I'll add cgroup ID (no hierarchy funnies, just a single ida allocated
> number) so that it can be used for cgroup indexing. Glauber, that
> should solve your problem too, right?
net_prio doesn't have any particular requirements on the indexing,
unless there is one. So a global ida would work for net_prio.
cheers,
daniel