2012-10-31 04:22:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] cgroup: simplify cgroup removal path

Hello, guys.

cgroup removal path is quite ugly. A lot of the ugliness comes from
the weird design which allows ->pre_destroy() to fail and the feature
to drain existing CSS reference counts before committing to removal.
Both mean that it should be possible to roll-back cgroup destruction
after some or all ->pre_destroy() invocations.

This weird design has never really worked. To list a couple examples.

* Some ->pre_destroy() implementations aren't side-effect free.
Roll-back happens after a lot of state is already lost.

* Some ->pre_destroy() implementations (naturally) assume that the
cgroup being destroyed would stay quiescent between successful
->pre_destroy() and its destruction. Unfortunately, any operation
can happen inbetween and the cgroup could be in a very different
state by the time it actually gets destroyed.

It's just such an unusual design which unnecessarily contains weird
code path combinations which are tricky to hit, reproduce and expect.
Moreover, the design's deficiencies attracts kludges on top as
workarounds and we end up with stuff like cgroup_exclude_rmdir() and
cgroup_release_and_wakeup_rmdir() which really make me want to cry.

Now that memcg has moved away from failable ->pre_destroy(), we can do
away with all these. I tested some basic operations and some corner
cases but am still a bit scared. Would love to get acks from Li and
memcg people.

This patchset contains the following eight patches.

0001-cgroup-kill-cgroup_subsys-__DEPRECATED_clear_css_ref.patch
0002-cgroup-kill-CSS_REMOVED.patch
0003-cgroup-use-cgroup_lock_live_group-parent-in-cgroup_c.patch
0004-cgroup-deactivate-CSS-s-and-mark-cgroup-dead-before-.patch
0005-cgroup-remove-CGRP_WAIT_ON_RMDIR-cgroup_exclude_rmdi.patch
0006-memcg-make-mem_cgroup_reparent_charges-non-failing.patch
0007-hugetlb-do-not-fail-in-hugetlb_cgroup_pre_destroy.patch
0008-cgroup-make-pre_destroy-return-void.patch

0001-0002 remove now unused ->pre_destroy() failure handling and do
follow-up simplification.

0003-0004 update removal path such that each ->pre_destroy() is
guaranteed to be invoked once per removal and the cgroup being
destroyed stays quiescent until destruction is complete.

0005 removes the scary CGRP_WAIT_ON_RMDIR mechanism.

0006-0008 are follow-up clean-ups. 0006 and 0007 are from Michal's
patchset[1].

This patchset is on top of

v3.6 (a0d271cbfe)
+ [1] the first three patches of
"memcg/cgroup: do not fail fail on pre_destroy callbacks" patchset

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-rmdir-updates

Thanks.

block/blk-cgroup.c | 3
include/linux/cgroup.h | 41 -------
kernel/cgroup.c | 256 +++++++++++--------------------------------------
mm/hugetlb_cgroup.c | 11 --
mm/memcontrol.c | 51 +--------
5 files changed, 75 insertions(+), 287 deletions(-)

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel.cgroups/4757


2012-10-31 04:22:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
handling") removed the last user of __DEPRECATED_clear_css_refs. This
patch removes __DEPRECATED_clear_css_refs and mechanisms to support
it.

* Conditionals dependent on __DEPRECATED_clear_css_refs removed.

* ->pre_destroy() now can only fail if a new task is attached or child
cgroup is created while ->pre_destroy()s are being called. As the
condition is checked again after re-acquiring cgroup_mutex
afterwards, we don't need to take any immediate action on
->pre_destroy() failures. This reduces cgroup_call_pre_destroy() to
a simple loop surrounding ->pre_destory(). Remove
cgroup_call_pre_destroy() and open-code the loop into
cgroup_rmdir().

* cgroup_clear_css_refs() can no longer fail. All that needs to be
done are deactivating refcnts, setting CSS_REMOVED and putting the
base reference on each css. Remove cgroup_clear_css_refs() and the
failure path, and open-code the loops into cgroup_rmdir().

Note that cgroup_rmdir() will see more cleanup soon.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/cgroup.h | 12 ----
kernel/cgroup.c | 159 ++++++++++++-------------------------------------
2 files changed, 38 insertions(+), 133 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c90eaa8..02e09c0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -86,7 +86,6 @@ struct cgroup_subsys_state {
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
CSS_REMOVED, /* This CSS is dead */
- CSS_CLEAR_CSS_REFS, /* @ss->__DEPRECATED_clear_css_refs */
};

/* Caller must verify that the css is not for root cgroup */
@@ -485,17 +484,6 @@ struct cgroup_subsys {
*/
bool use_id;

- /*
- * If %true, cgroup removal will try to clear css refs by retrying
- * ss->pre_destroy() until there's no css ref left. This behavior
- * is strictly for backward compatibility and will be removed as
- * soon as the current user (memcg) is updated.
- *
- * If %false, ss->pre_destroy() can't fail and cgroup removal won't
- * wait for css refs to drop to zero before proceeding.
- */
- bool __DEPRECATED_clear_css_refs;
-
#define MAX_CGROUP_TYPE_NAMELEN 32
const char *name;

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7981850..033bf4b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -851,30 +851,6 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb)
return inode;
}

-/*
- * Call subsys's pre_destroy handler.
- * This is called before css refcnt check.
- */
-static int cgroup_call_pre_destroy(struct cgroup *cgrp)
-{
- struct cgroup_subsys *ss;
- int ret = 0;
-
- for_each_subsys(cgrp->root, ss) {
- if (!ss->pre_destroy)
- continue;
-
- ret = ss->pre_destroy(cgrp);
- if (ret) {
- /* ->pre_destroy() failure is being deprecated */
- WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs);
- break;
- }
- }
-
- return ret;
-}
-
static void cgroup_diput(struct dentry *dentry, struct inode *inode)
{
/* is dentry a directory ? if so, kfree() associated cgroup */
@@ -3901,14 +3877,12 @@ static void init_cgroup_css(struct cgroup_subsys_state *css,
cgrp->subsys[ss->subsys_id] = css;

/*
- * If !clear_css_refs, css holds an extra ref to @cgrp->dentry
- * which is put on the last css_put(). dput() requires process
- * context, which css_put() may be called without. @css->dput_work
- * will be used to invoke dput() asynchronously from css_put().
+ * css holds an extra ref to @cgrp->dentry which is put on the last
+ * css_put(). dput() requires process context, which css_put() may
+ * be called without. @css->dput_work will be used to invoke
+ * dput() asynchronously from css_put().
*/
INIT_WORK(&css->dput_work, css_dput_fn);
- if (ss->__DEPRECATED_clear_css_refs)
- set_bit(CSS_CLEAR_CSS_REFS, &css->flags);
}

/*
@@ -3978,10 +3952,9 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
if (err < 0)
goto err_remove;

- /* If !clear_css_refs, each css holds a ref to the cgroup's dentry */
+ /* each css holds a ref to the cgroup's dentry */
for_each_subsys(root, ss)
- if (!ss->__DEPRECATED_clear_css_refs)
- dget(dentry);
+ dget(dentry);

/* The cgroup directory was pre-locked for us */
BUG_ON(!mutex_is_locked(&cgrp->dentry->d_inode->i_mutex));
@@ -4066,71 +4039,6 @@ static int cgroup_has_css_refs(struct cgroup *cgrp)
return 0;
}

-/*
- * Atomically mark all (or else none) of the cgroup's CSS objects as
- * CSS_REMOVED. Return true on success, or false if the cgroup has
- * busy subsystems. Call with cgroup_mutex held
- *
- * Depending on whether a subsys has __DEPRECATED_clear_css_refs set or
- * not, cgroup removal behaves differently.
- *
- * If clear is set, css refcnt for the subsystem should be zero before
- * cgroup removal can be committed. This is implemented by
- * CGRP_WAIT_ON_RMDIR and retry logic around ->pre_destroy(), which may be
- * called multiple times until all css refcnts reach zero and is allowed to
- * veto removal on any invocation. This behavior is deprecated and will be
- * removed as soon as the existing user (memcg) is updated.
- *
- * If clear is not set, each css holds an extra reference to the cgroup's
- * dentry and cgroup removal proceeds regardless of css refs.
- * ->pre_destroy() will be called at least once and is not allowed to fail.
- * On the last put of each css, whenever that may be, the extra dentry ref
- * is put so that dentry destruction happens only after all css's are
- * released.
- */
-static int cgroup_clear_css_refs(struct cgroup *cgrp)
-{
- struct cgroup_subsys *ss;
- unsigned long flags;
- bool failed = false;
-
- local_irq_save(flags);
-
- /*
- * Block new css_tryget() by deactivating refcnt. If all refcnts
- * for subsystems w/ clear_css_refs set were 1 at the moment of
- * deactivation, we succeeded.
- */
- for_each_subsys(cgrp->root, ss) {
- struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-
- WARN_ON(atomic_read(&css->refcnt) < 0);
- atomic_add(CSS_DEACT_BIAS, &css->refcnt);
-
- if (ss->__DEPRECATED_clear_css_refs)
- failed |= css_refcnt(css) != 1;
- }
-
- /*
- * If succeeded, set REMOVED and put all the base refs; otherwise,
- * restore refcnts to positive values. Either way, all in-progress
- * css_tryget() will be released.
- */
- for_each_subsys(cgrp->root, ss) {
- struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-
- if (!failed) {
- set_bit(CSS_REMOVED, &css->flags);
- css_put(css);
- } else {
- atomic_sub(CSS_DEACT_BIAS, &css->refcnt);
- }
- }
-
- local_irq_restore(flags);
- return !failed;
-}
-
static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
{
struct cgroup *cgrp = dentry->d_fsdata;
@@ -4138,10 +4046,9 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
struct cgroup *parent;
DEFINE_WAIT(wait);
struct cgroup_event *event, *tmp;
- int ret;
+ struct cgroup_subsys *ss;

/* the vfs holds both inode->i_mutex already */
-again:
mutex_lock(&cgroup_mutex);
if (atomic_read(&cgrp->count) != 0) {
mutex_unlock(&cgroup_mutex);
@@ -4168,11 +4075,9 @@ again:
* Call pre_destroy handlers of subsys. Notify subsystems
* that rmdir() request comes.
*/
- ret = cgroup_call_pre_destroy(cgrp);
- if (ret) {
- clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
- return ret;
- }
+ for_each_subsys(cgrp->root, ss)
+ if (ss->pre_destroy)
+ WARN_ON_ONCE(ss->pre_destroy(cgrp));

mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
@@ -4182,21 +4087,34 @@ again:
return -EBUSY;
}
prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
- if (!cgroup_clear_css_refs(cgrp)) {
- mutex_unlock(&cgroup_mutex);
- /*
- * Because someone may call cgroup_wakeup_rmdir_waiter() before
- * prepare_to_wait(), we need to check this flag.
- */
- if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
- schedule();
- finish_wait(&cgroup_rmdir_waitq, &wait);
- clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
- if (signal_pending(current))
- return -EINTR;
- goto again;
+
+ local_irq_disable();
+
+ /* block new css_tryget() by deactivating refcnt */
+ for_each_subsys(cgrp->root, ss) {
+ struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+ WARN_ON(atomic_read(&css->refcnt) < 0);
+ atomic_add(CSS_DEACT_BIAS, &css->refcnt);
}
- /* NO css_tryget() can success after here. */
+
+ /*
+ * Set REMOVED. All in-progress css_tryget() will be released.
+ * Put all the base refs. Each css holds an extra reference to the
+ * cgroup's dentry and cgroup removal proceeds regardless of css
+ * refs. On the last put of each css, whenever that may be, the
+ * extra dentry ref is put so that dentry destruction happens only
+ * after all css's are released.
+ */
+ for_each_subsys(cgrp->root, ss) {
+ struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
+
+ set_bit(CSS_REMOVED, &css->flags);
+ css_put(css);
+ }
+
+ local_irq_enable();
+
finish_wait(&cgroup_rmdir_waitq, &wait);
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

@@ -4949,8 +4867,7 @@ void __css_put(struct cgroup_subsys_state *css)
cgroup_wakeup_rmdir_waiter(cgrp);
break;
case 0:
- if (!test_bit(CSS_CLEAR_CSS_REFS, &css->flags))
- schedule_work(&css->dput_work);
+ schedule_work(&css->dput_work);
break;
}
rcu_read_unlock();
--
1.7.11.7

2012-10-31 04:23:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

This patch makes cgroup_create() fail if @parent is marked removed.
This is to prepare for further updates to cgroup_rmdir() path.

Note that this change isn't strictly necessary. cgroup can only be
created via mkdir and the removed marking and dentry removal happen
without releasing cgroup_mutex, so cgroup_create() can never race with
cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(),
cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
rendering the added liveliness check unnecessary.

Do it anyway such that locking is contained inside cgroup proper and
we don't get nasty surprises if we ever grow another caller of
cgroup_create().

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a49cdbc..b3010ae 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3906,6 +3906,18 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
if (!cgrp)
return -ENOMEM;

+ /*
+ * Only live parents can have children. Note that the liveliness
+ * check isn't strictly necessary because cgroup_mkdir() and
+ * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
+ * anyway so that locking is contained inside cgroup proper and we
+ * don't get nasty surprises if we ever grow another caller.
+ */
+ if (!cgroup_lock_live_group(parent)) {
+ err = -ENODEV;
+ goto err_free;
+ }
+
/* Grab a reference on the superblock so the hierarchy doesn't
* get deleted on unmount if there are child cgroups. This
* can be done outside cgroup_mutex, since the sb can't
@@ -3913,8 +3925,6 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
* fs */
atomic_inc(&sb->s_active);

- mutex_lock(&cgroup_mutex);
-
init_cgroup_housekeeping(cgrp);

cgrp->parent = parent;
@@ -3985,7 +3995,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,

/* Release the reference count that we took on the superblock */
deactivate_super(sb);
-
+err_free:
kfree(cgrp);
return err;
}
--
1.7.11.7

2012-10-31 04:23:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 7/8] hugetlb: do not fail in hugetlb_cgroup_pre_destroy

From: Michal Hocko <[email protected]>

Now that pre_destroy callbacks are called from the context where neither
any task can attach the group nor any children group can be added there
is no other way to fail from hugetlb_pre_destroy.

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Reviewed-by: Glauber Costa <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
mm/hugetlb_cgroup.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a3f358f..dc595c6 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -159,14 +159,9 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
{
struct hstate *h;
struct page *page;
- int ret = 0, idx = 0;
+ int idx = 0;

do {
- if (cgroup_task_count(cgroup) ||
- !list_empty(&cgroup->children)) {
- ret = -EBUSY;
- goto out;
- }
for_each_hstate(h) {
spin_lock(&hugetlb_lock);
list_for_each_entry(page, &h->hugepage_activelist, lru)
@@ -177,8 +172,8 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
}
cond_resched();
} while (hugetlb_cgroup_have_usage(cgroup));
-out:
- return ret;
+
+ return 0;
}

int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
--
1.7.11.7

2012-10-31 04:23:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 6/8] memcg: make mem_cgroup_reparent_charges non failing

From: Michal Hocko <[email protected]>

Now that pre_destroy callbacks are called from the context where neither
any task can attach the group nor any children group can be added there
is no other way to fail from mem_cgroup_pre_destroy.
mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css
because all css' are marked dead already.

tj: Remove now unused local variable @cgrp from
mem_cgroup_reparent_charges().

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Glauber Costa <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 19 ++++++-------------
1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1033b2b..47c4680 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3740,14 +3740,11 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
*
* Caller is responsible for holding css reference on the memcg.
*/
-static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
- struct cgroup *cgrp = memcg->css.cgroup;
int node, zid;

do {
- if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
- return -EBUSY;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
@@ -3773,8 +3770,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
* charge before adding to the LRU.
*/
} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0);
-
- return 0;
}

/*
@@ -3811,7 +3806,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)

}
lru_add_drain();
- return mem_cgroup_reparent_charges(memcg);
+ mem_cgroup_reparent_charges(memcg);
+
+ return 0;
}

static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
@@ -5008,13 +5005,9 @@ free_out:
static int mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int ret;

- css_get(&memcg->css);
- ret = mem_cgroup_reparent_charges(memcg);
- css_put(&memcg->css);
-
- return ret;
+ mem_cgroup_reparent_charges(memcg);
+ return 0;
}

static void mem_cgroup_destroy(struct cgroup *cont)
--
1.7.11.7

2012-10-31 04:23:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 8/8] cgroup: make ->pre_destroy() return void

All ->pre_destory() implementations return 0 now, which is the only
allowed return value. Make it return void.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 3 +--
include/linux/cgroup.h | 2 +-
kernel/cgroup.c | 2 +-
mm/hugetlb_cgroup.c | 4 +---
mm/memcontrol.c | 3 +--
5 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index f3b44a6..a7816f3 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -600,7 +600,7 @@ struct cftype blkcg_files[] = {
*
* This is the blkcg counterpart of ioc_release_fn().
*/
-static int blkcg_pre_destroy(struct cgroup *cgroup)
+static void blkcg_pre_destroy(struct cgroup *cgroup)
{
struct blkcg *blkcg = cgroup_to_blkcg(cgroup);

@@ -622,7 +622,6 @@ static int blkcg_pre_destroy(struct cgroup *cgroup)
}

spin_unlock_irq(&blkcg->lock);
- return 0;
}

static void blkcg_destroy(struct cgroup *cgroup)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 47868a8..adb2adc 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -436,7 +436,7 @@ int cgroup_taskset_size(struct cgroup_taskset *tset);

struct cgroup_subsys {
struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
- int (*pre_destroy)(struct cgroup *cgrp);
+ void (*pre_destroy)(struct cgroup *cgrp);
void (*destroy)(struct cgroup *cgrp);
int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index c5f6fb2..83cd7d0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4054,7 +4054,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
mutex_unlock(&cgroup_mutex);
for_each_subsys(cgrp->root, ss)
if (ss->pre_destroy)
- WARN_ON_ONCE(ss->pre_destroy(cgrp));
+ ss->pre_destroy(cgrp);
mutex_lock(&cgroup_mutex);

/*
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index dc595c6..0d3a1a3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -155,7 +155,7 @@ out:
* Force the hugetlb cgroup to empty the hugetlb resources by moving them to
* the parent cgroup.
*/
-static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
+static void hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
{
struct hstate *h;
struct page *page;
@@ -172,8 +172,6 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
}
cond_resched();
} while (hugetlb_cgroup_have_usage(cgroup));
-
- return 0;
}

int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 47c4680..af05a60 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5002,12 +5002,11 @@ free_out:
return ERR_PTR(error);
}

-static int mem_cgroup_pre_destroy(struct cgroup *cont)
+static void mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);

mem_cgroup_reparent_charges(memcg);
- return 0;
}

static void mem_cgroup_destroy(struct cgroup *cont)
--
1.7.11.7

2012-10-31 04:23:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 5/8] cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()

CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
destruction rollback somewhat working. cgroup_rmdir() used to drain
CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
helpers were used to allow the task performing rmdir to wait for the
next relevant event.

Unfortunately, the wait is visible to controllers too and the
mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
sleep at rmdir").

Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
unnecessary. Remove it and all the mechanisms supporting it. Note
that memcontrol.c changes are essentially revert of 887032670d
("cgroup avoid permanent sleep at rmdir").

Signed-off-by: Tejun Heo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 21 ---------------------
kernel/cgroup.c | 51 --------------------------------------------------
mm/memcontrol.c | 24 +-----------------------
3 files changed, 1 insertion(+), 95 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a309804..47868a8 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -145,10 +145,6 @@ enum {
/* Control Group requires release notifications to userspace */
CGRP_NOTIFY_ON_RELEASE,
/*
- * A thread in rmdir() is wating for this cgroup.
- */
- CGRP_WAIT_ON_RMDIR,
- /*
* Clone cgroup values when creating a new child cgroup
*/
CGRP_CLONE_CHILDREN,
@@ -412,23 +408,6 @@ int cgroup_task_count(const struct cgroup *cgrp);
int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);

/*
- * When the subsys has to access css and may add permanent refcnt to css,
- * it should take care of racy conditions with rmdir(). Following set of
- * functions, is for stop/restart rmdir if necessary.
- * Because these will call css_get/put, "css" should be alive css.
- *
- * cgroup_exclude_rmdir();
- * ...do some jobs which may access arbitrary empty cgroup
- * cgroup_release_and_wakeup_rmdir();
- *
- * When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
- * it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
- */
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
-
-/*
* Control Group taskset, used to pass around set of tasks to cgroup_subsys
* methods.
*/
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 66204a6..c5f6fb2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -966,33 +966,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
}

/*
- * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
- * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
- * reference to css->refcnt. In general, this refcnt is expected to goes down
- * to zero, soon.
- *
- * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
- */
-static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
-
-static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
-{
- if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
- wake_up_all(&cgroup_rmdir_waitq);
-}
-
-void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
-{
- css_get(css);
-}
-
-void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
-{
- cgroup_wakeup_rmdir_waiter(css->cgroup);
- css_put(css);
-}
-
-/*
* Call with cgroup_mutex held. Drops reference counts on modules, including
* any duplicate ones that parse_cgroupfs_options took. If this function
* returns an error, no reference counts are touched.
@@ -1963,12 +1936,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
}

synchronize_rcu();
-
- /*
- * wake up rmdir() waiter. the rmdir should fail since the cgroup
- * is no longer empty.
- */
- cgroup_wakeup_rmdir_waiter(cgrp);
out:
if (retval) {
for_each_subsys(root, ss) {
@@ -2138,7 +2105,6 @@ static int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
* step 5: success! and cleanup
*/
synchronize_rcu();
- cgroup_wakeup_rmdir_waiter(cgrp);
retval = 0;
out_put_css_set_refs:
if (retval) {
@@ -4058,26 +4024,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;

- /*
- * In general, subsystem has no css->refcnt after pre_destroy(). But
- * in racy cases, subsystem may have to get css->refcnt after
- * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
- * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
- * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
- * and subsystem's reference count handling. Please see css_get/put
- * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
- */
- set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-
/* the vfs holds both inode->i_mutex already */
mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
- clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
- prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);

/*
* Block new css_tryget() by deactivating refcnt and mark @cgrp
@@ -4114,9 +4067,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
for_each_subsys(cgrp->root, ss)
css_put(cgrp->subsys[ss->subsys_id]);

- finish_wait(&cgroup_rmdir_waitq, &wait);
- clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
-
raw_spin_lock(&release_list_lock);
if (!list_empty(&cgrp->release_list))
list_del_init(&cgrp->release_list);
@@ -4864,7 +4814,6 @@ void __css_put(struct cgroup_subsys_state *css)
set_bit(CGRP_RELEASABLE, &cgrp->flags);
check_for_release(cgrp);
}
- cgroup_wakeup_rmdir_waiter(cgrp);
break;
case 0:
schedule_work(&css->dput_work);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f8bd9d..1033b2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2681,13 +2681,6 @@ static int mem_cgroup_move_account(struct page *page,
/* caller should have done css_get */
pc->mem_cgroup = to;
mem_cgroup_charge_statistics(to, anon, nr_pages);
- /*
- * We charges against "to" which may not have any tasks. Then, "to"
- * can be under rmdir(). But in current implementation, caller of
- * this function is just force_empty() and move charge, so it's
- * guaranteed that "to" is never removed. So, we don't check rmdir
- * status here.
- */
move_unlock_mem_cgroup(from, &flags);
ret = 0;
unlock:
@@ -2893,7 +2886,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
return;
if (!memcg)
return;
- cgroup_exclude_rmdir(&memcg->css);

__mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
/*
@@ -2907,12 +2899,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
swp_entry_t ent = {.val = page_private(page)};
mem_cgroup_uncharge_swap(ent);
}
- /*
- * At swapin, we may charge account against cgroup which has no tasks.
- * So, rmdir()->pre_destroy() can be called while we do this charge.
- * In that case, we need to call pre_destroy() again. check it here.
- */
- cgroup_release_and_wakeup_rmdir(&memcg->css);
}

void mem_cgroup_commit_charge_swapin(struct page *page,
@@ -3360,8 +3346,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,

if (!memcg)
return;
- /* blocks rmdir() */
- cgroup_exclude_rmdir(&memcg->css);
+
if (!migration_ok) {
used = oldpage;
unused = newpage;
@@ -3395,13 +3380,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,
*/
if (anon)
mem_cgroup_uncharge_page(used);
- /*
- * At migration, we may charge account against cgroup which has no
- * tasks.
- * So, rmdir()->pre_destroy() can be called while we do this charge.
- * In that case, we need to call pre_destroy() again. check it here.
- */
- cgroup_release_and_wakeup_rmdir(&memcg->css);
}

/*
--
1.7.11.7

2012-10-31 04:24:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 4/8] cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()

Because ->pre_destroy() could fail and can't be called under
cgroup_mutex, cgroup destruction did something very ugly.

1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.

2. Release cgroup_mutex and call ->pre_destroy().

3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
otherwise.

4. Continue destroying.

In addition to being ugly, it has been always broken in various ways.
For example, memcg ->pre_destroy() expects the cgroup to be inactive
after it's done but tasks can be attached and detached between #2 and
#3 and the conditions that memcg verified in ->pre_destroy() might no
longer hold by the time control reaches #3.

Now that ->pre_destroy() is no longer allowed to fail. We can switch
to the following.

1. Grab cgroup_mutex and fail if it can't be destroyed; fail
otherwise.

2. Deactivate CSS's and mark the cgroup removed thus preventing any
further operations which can invalidate the verification from #1.

3. Release cgroup_mutex and call ->pre_destroy().

4. Re-grab cgroup_mutex and continue destroying.

After this change, controllers can safely assume that ->pre_destroy()
will only be called only once for a given cgroup and, once
->pre_destroy() is called, the cgroup will stay dormant till it's
destroyed.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 41 +++++++++++++++++++----------------------
1 file changed, 19 insertions(+), 22 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b3010ae..66204a6 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -4058,18 +4058,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
struct cgroup_event *event, *tmp;
struct cgroup_subsys *ss;

- /* the vfs holds both inode->i_mutex already */
- mutex_lock(&cgroup_mutex);
- if (atomic_read(&cgrp->count) != 0) {
- mutex_unlock(&cgroup_mutex);
- return -EBUSY;
- }
- if (!list_empty(&cgrp->children)) {
- mutex_unlock(&cgroup_mutex);
- return -EBUSY;
- }
- mutex_unlock(&cgroup_mutex);
-
/*
* In general, subsystem has no css->refcnt after pre_destroy(). But
* in racy cases, subsystem may have to get css->refcnt after
@@ -4081,14 +4069,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
*/
set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

- /*
- * Call pre_destroy handlers of subsys. Notify subsystems
- * that rmdir() request comes.
- */
- for_each_subsys(cgrp->root, ss)
- if (ss->pre_destroy)
- WARN_ON_ONCE(ss->pre_destroy(cgrp));
-
+ /* the vfs holds both inode->i_mutex already */
mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
@@ -4098,13 +4079,30 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
}
prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);

- /* block new css_tryget() by deactivating refcnt */
+ /*
+ * Block new css_tryget() by deactivating refcnt and mark @cgrp
+ * removed. This makes future css_tryget() and child creation
+ * attempts fail thus maintaining the removal conditions verified
+ * above.
+ */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];

WARN_ON(atomic_read(&css->refcnt) < 0);
atomic_add(CSS_DEACT_BIAS, &css->refcnt);
}
+ set_bit(CGRP_REMOVED, &cgrp->flags);
+
+ /*
+ * Tell subsystems to initate destruction. pre_destroy() should be
+ * called with cgroup_mutex unlocked. See 3fa59dfbc3 ("cgroup: fix
+ * potential deadlock in pre_destroy") for details.
+ */
+ mutex_unlock(&cgroup_mutex);
+ for_each_subsys(cgrp->root, ss)
+ if (ss->pre_destroy)
+ WARN_ON_ONCE(ss->pre_destroy(cgrp));
+ mutex_lock(&cgroup_mutex);

/*
* Put all the base refs. Each css holds an extra reference to the
@@ -4120,7 +4118,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

raw_spin_lock(&release_list_lock);
- set_bit(CGRP_REMOVED, &cgrp->flags);
if (!list_empty(&cgrp->release_list))
list_del_init(&cgrp->release_list);
raw_spin_unlock(&release_list_lock);
--
1.7.11.7

2012-10-31 04:24:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/8] cgroup: kill CSS_REMOVED

CSS_REMOVED is one of the several contortions which were necessary to
support css reference draining on cgroup removal. All css->refcnts
which need draining should be deactivated and verified to equal zero
atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
needed to be re-activated and css_tryget() shouldn't fail in the
process.

This was achieved by letting css_tryget() busy-loop until either the
refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
(committing to removal).

Now that css refcnt draining is no longer used, there's no need for
atomic rollback mechanism. css_tryget() simply can look at the
reference count and fail if the it's deactivated - it's never getting
re-activated.

This patch removes CSS_REMOVED and updates __css_tryget() to fail if
the refcnt is deactivated.

Note that this removes css_is_removed() whose only user is VM_BUG_ON()
in memcontrol.c. We can replace it with a check on the refcnt but
given that the only use case is a debug assert, I think it's better to
simply unexport it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/cgroup.h | 6 ------
kernel/cgroup.c | 31 ++++++++++++-------------------
mm/memcontrol.c | 7 +++----
3 files changed, 15 insertions(+), 29 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 02e09c0..a309804 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -85,7 +85,6 @@ struct cgroup_subsys_state {
/* bits in struct cgroup_subsys_state flags field */
enum {
CSS_ROOT, /* This CSS is the root of the subsystem */
- CSS_REMOVED, /* This CSS is dead */
};

/* Caller must verify that the css is not for root cgroup */
@@ -108,11 +107,6 @@ static inline void css_get(struct cgroup_subsys_state *css)
__css_get(css, 1);
}

-static inline bool css_is_removed(struct cgroup_subsys_state *css)
-{
- return test_bit(CSS_REMOVED, &css->flags);
-}
-
/*
* Call css_tryget() to take a reference on a css if your existing
* (known-valid) reference isn't already ref-counted. Returns false if
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 033bf4b..a49cdbc 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -170,8 +170,8 @@ struct css_id {
* The css to which this ID points. This pointer is set to valid value
* after cgroup is populated. If cgroup is removed, this will be NULL.
* This pointer is expected to be RCU-safe because destroy()
- * is called after synchronize_rcu(). But for safe use, css_is_removed()
- * css_tryget() should be used for avoiding race.
+ * is called after synchronize_rcu(). But for safe use, css_tryget()
+ * should be used for avoiding race.
*/
struct cgroup_subsys_state __rcu *css;
/*
@@ -4088,8 +4088,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
}
prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);

- local_irq_disable();
-
/* block new css_tryget() by deactivating refcnt */
for_each_subsys(cgrp->root, ss) {
struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
@@ -4099,21 +4097,14 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
}

/*
- * Set REMOVED. All in-progress css_tryget() will be released.
* Put all the base refs. Each css holds an extra reference to the
* cgroup's dentry and cgroup removal proceeds regardless of css
* refs. On the last put of each css, whenever that may be, the
* extra dentry ref is put so that dentry destruction happens only
* after all css's are released.
*/
- for_each_subsys(cgrp->root, ss) {
- struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
-
- set_bit(CSS_REMOVED, &css->flags);
- css_put(css);
- }
-
- local_irq_enable();
+ for_each_subsys(cgrp->root, ss)
+ css_put(cgrp->subsys[ss->subsys_id]);

finish_wait(&cgroup_rmdir_waitq, &wait);
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
@@ -4837,15 +4828,17 @@ static void check_for_release(struct cgroup *cgrp)
/* Caller must verify that the css is not for root cgroup */
bool __css_tryget(struct cgroup_subsys_state *css)
{
- do {
- int v = css_refcnt(css);
+ while (true) {
+ int t, v;

- if (atomic_cmpxchg(&css->refcnt, v, v + 1) == v)
+ v = css_refcnt(css);
+ t = atomic_cmpxchg(&css->refcnt, v, v + 1);
+ if (likely(t == v))
return true;
+ else if (t < 0)
+ return false;
cpu_relax();
- } while (!test_bit(CSS_REMOVED, &css->flags));
-
- return false;
+ }
}
EXPORT_SYMBOL_GPL(__css_tryget);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5a1d584..6f8bd9d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2343,7 +2343,6 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
again:
if (*ptr) { /* css should be a valid one */
memcg = *ptr;
- VM_BUG_ON(css_is_removed(&memcg->css));
if (mem_cgroup_is_root(memcg))
goto done;
if (nr_pages == 1 && consume_stock(memcg))
@@ -2483,9 +2482,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,

/*
* A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock(). The caller must check css_is_removed() or some if
- * it's concern. (dropping refcnt from swap can be called against removed
- * memcg.)
+ * rcu_read_lock(). The caller is responsible for verifying the returned
+ * memcg is still alive if necessary. (dropping refcnt from swap can be
+ * called against removed memcg.)
*/
static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
{
--
1.7.11.7

2012-10-31 13:21:49

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs


> }
> prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> - if (!cgroup_clear_css_refs(cgrp)) {
> - mutex_unlock(&cgroup_mutex);
> - /*
> - * Because someone may call cgroup_wakeup_rmdir_waiter() before
> - * prepare_to_wait(), we need to check this flag.
> - */
> - if (test_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags))
> - schedule();
> - finish_wait(&cgroup_rmdir_waitq, &wait);
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> - if (signal_pending(current))
> - return -EINTR;
> - goto again;
> +
> + local_irq_disable();
> +
> + /* block new css_tryget() by deactivating refcnt */
> + for_each_subsys(cgrp->root, ss) {
> + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +
> + WARN_ON(atomic_read(&css->refcnt) < 0);
> + atomic_add(CSS_DEACT_BIAS, &css->refcnt);
> }
> - /* NO css_tryget() can success after here. */
> +
> + /*
> + * Set REMOVED. All in-progress css_tryget() will be released.
> + * Put all the base refs. Each css holds an extra reference to the
> + * cgroup's dentry and cgroup removal proceeds regardless of css
> + * refs. On the last put of each css, whenever that may be, the
> + * extra dentry ref is put so that dentry destruction happens only
> + * after all css's are released.
> + */
> + for_each_subsys(cgrp->root, ss) {
> + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> +
> + set_bit(CSS_REMOVED, &css->flags);
> + css_put(css);
> + }
> +
> + local_irq_enable();
> +
> finish_wait(&cgroup_rmdir_waitq, &wait);

It seems to me that the main reason for doing this in two passes was to
derive a global "failed" state on the first pass. Now that we can't
fail, why can't you just loop through the subsystems just once ?

2012-10-31 13:42:44

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 4/8] cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()

On 10/31/2012 08:22 AM, Tejun Heo wrote:
> Because ->pre_destroy() could fail and can't be called under
> cgroup_mutex, cgroup destruction did something very ugly.
>
> 1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
>
> 2. Release cgroup_mutex and call ->pre_destroy().
>
> 3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
> otherwise.
>
> 4. Continue destroying.
>
> In addition to being ugly, it has been always broken in various ways.
> For example, memcg ->pre_destroy() expects the cgroup to be inactive
> after it's done but tasks can be attached and detached between #2 and
> #3 and the conditions that memcg verified in ->pre_destroy() might no
> longer hold by the time control reaches #3.
>
> Now that ->pre_destroy() is no longer allowed to fail. We can switch
> to the following.
>
> 1. Grab cgroup_mutex and fail if it can't be destroyed; fail
> otherwise.
>
fail, or fail otherwise? Seems quite negative =)

2012-10-31 13:49:53

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

On 10/31/2012 08:22 AM, Tejun Heo wrote:
> Hello, guys.
>
> cgroup removal path is quite ugly. A lot of the ugliness comes from
> the weird design which allows ->pre_destroy() to fail and the feature
> to drain existing CSS reference counts before committing to removal.
> Both mean that it should be possible to roll-back cgroup destruction
> after some or all ->pre_destroy() invocations.
>
> This weird design has never really worked. To list a couple examples.
>
> * Some ->pre_destroy() implementations aren't side-effect free.
> Roll-back happens after a lot of state is already lost.
>
> * Some ->pre_destroy() implementations (naturally) assume that the
> cgroup being destroyed would stay quiescent between successful
> ->pre_destroy() and its destruction. Unfortunately, any operation
> can happen inbetween and the cgroup could be in a very different
> state by the time it actually gets destroyed.
>
> It's just such an unusual design which unnecessarily contains weird
> code path combinations which are tricky to hit, reproduce and expect.
> Moreover, the design's deficiencies attracts kludges on top as
> workarounds and we end up with stuff like cgroup_exclude_rmdir() and
> cgroup_release_and_wakeup_rmdir() which really make me want to cry.
>
> Now that memcg has moved away from failable ->pre_destroy(), we can do
> away with all these. I tested some basic operations and some corner
> cases but am still a bit scared. Would love to get acks from Li and
> memcg people.
>
> This patchset contains the following eight patches.
>
> 0001-cgroup-kill-cgroup_subsys-__DEPRECATED_clear_css_ref.patch
> 0002-cgroup-kill-CSS_REMOVED.patch
> 0003-cgroup-use-cgroup_lock_live_group-parent-in-cgroup_c.patch
> 0004-cgroup-deactivate-CSS-s-and-mark-cgroup-dead-before-.patch
> 0005-cgroup-remove-CGRP_WAIT_ON_RMDIR-cgroup_exclude_rmdi.patch
> 0006-memcg-make-mem_cgroup_reparent_charges-non-failing.patch
> 0007-hugetlb-do-not-fail-in-hugetlb_cgroup_pre_destroy.patch
> 0008-cgroup-make-pre_destroy-return-void.patch
>
> 0001-0002 remove now unused ->pre_destroy() failure handling and do
> follow-up simplification.
>
> 0003-0004 update removal path such that each ->pre_destroy() is
> guaranteed to be invoked once per removal and the cgroup being
> destroyed stays quiescent until destruction is complete.
>
> 0005 removes the scary CGRP_WAIT_ON_RMDIR mechanism.
>
> 0006-0008 are follow-up clean-ups. 0006 and 0007 are from Michal's
> patchset[1].
>
> This patchset is on top of
>
> v3.6 (a0d271cbfe)
> + [1] the first three patches of
> "memcg/cgroup: do not fail fail on pre_destroy callbacks" patchset
>
> and available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-rmdir-updates
>
> Thanks.
>
> block/blk-cgroup.c | 3
> include/linux/cgroup.h | 41 -------
> kernel/cgroup.c | 256 +++++++++++--------------------------------------
> mm/hugetlb_cgroup.c | 11 --
> mm/memcontrol.c | 51 +--------
> 5 files changed, 75 insertions(+), 287 deletions(-)


The patches are quite straightforward, and you are basically throwing
useless code away...

The only think that drew my attention is that you are changing the
local_irq_save callsite to local_irq_disable. It shouldn't be a problem,
since this is never expected to be called in interrupt context.

Still... it makes me wonder if that disabled-interrupt block is still
needed? According to the changelogs, it was introduced in e7c5ec919 for
the css_tryget mechanism. But css_tryget itself will never scan
subsystems, so if we can no longer fail, we should be able to just ditch
it. Unless I am missing something

2012-10-31 13:57:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 8/8] cgroup: make ->pre_destroy() return void

On Tue, Oct 30, 2012 at 09:22:45PM -0700, Tejun Heo wrote:
> All ->pre_destory() implementations return 0 now, which is the only
> allowed return value. Make it return void.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> ---
> block/blk-cgroup.c | 3 +--
> include/linux/cgroup.h | 2 +-
> kernel/cgroup.c | 2 +-
> mm/hugetlb_cgroup.c | 4 +---
> mm/memcontrol.c | 3 +--
> 5 files changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index f3b44a6..a7816f3 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -600,7 +600,7 @@ struct cftype blkcg_files[] = {
> *
> * This is the blkcg counterpart of ioc_release_fn().
> */
> -static int blkcg_pre_destroy(struct cgroup *cgroup)
> +static void blkcg_pre_destroy(struct cgroup *cgroup)
> {
> struct blkcg *blkcg = cgroup_to_blkcg(cgroup);
>
> @@ -622,7 +622,6 @@ static int blkcg_pre_destroy(struct cgroup *cgroup)
> }
>
> spin_unlock_irq(&blkcg->lock);
> - return 0;

Looks good from blk-cgroup perspective. ACK for these bits.

Thanks
Vivek

2012-10-31 14:37:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

On Tue 30-10-12 21:22:38, Tejun Heo wrote:
> 2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
> handling") removed the last user of __DEPRECATED_clear_css_refs. This
> patch removes __DEPRECATED_clear_css_refs and mechanisms to support
> it.
>
> * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
>
> * ->pre_destroy() now can only fail if a new task is attached or child
> cgroup is created while ->pre_destroy()s are being called. As the
> condition is checked again after re-acquiring cgroup_mutex
> afterwards, we don't need to take any immediate action on
> ->pre_destroy() failures. This reduces cgroup_call_pre_destroy() to
> a simple loop surrounding ->pre_destory(). Remove
> cgroup_call_pre_destroy() and open-code the loop into
> cgroup_rmdir().
>
> * cgroup_clear_css_refs() can no longer fail. All that needs to be
> done are deactivating refcnts, setting CSS_REMOVED and putting the
> base reference on each css. Remove cgroup_clear_css_refs() and the
> failure path, and open-code the loops into cgroup_rmdir().
>
> Note that cgroup_rmdir() will see more cleanup soon.
>
> Signed-off-by: Tejun Heo <[email protected]>

Looks good to me and the diffstat is really encouraging
Reviewed-by: Michal Hocko <[email protected]>

with a minor note bellow
> ---
> include/linux/cgroup.h | 12 ----
> kernel/cgroup.c | 159 ++++++++++++-------------------------------------
> 2 files changed, 38 insertions(+), 133 deletions(-)
[...]
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 7981850..033bf4b 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
[...]
> @@ -4168,11 +4075,9 @@ again:
> * Call pre_destroy handlers of subsys. Notify subsystems
> * that rmdir() request comes.
> */
> - ret = cgroup_call_pre_destroy(cgrp);
> - if (ret) {
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> - return ret;
> - }
> + for_each_subsys(cgrp->root, ss)
> + if (ss->pre_destroy)
> + WARN_ON_ONCE(ss->pre_destroy(cgrp));

Hmm, I am not sure I like this WARN_ON_ONCE. First it can happen for
more than one controller and second we can just clear CGRP_WAIT_ON_RMDIR
and return with EBUSY. The only possible failure at the moment is when a
new task or a child group appear.
I know it is not a big deal because it will disappear later in the
series but it would be more readable IMO.

Thanks!
--
Michal Hocko
SUSE Labs

2012-10-31 15:39:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On Tue 30-10-12 21:22:39, Tejun Heo wrote:
> CSS_REMOVED is one of the several contortions which were necessary to
> support css reference draining on cgroup removal. All css->refcnts
> which need draining should be deactivated and verified to equal zero
> atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
> needed to be re-activated and css_tryget() shouldn't fail in the
> process.
>
> This was achieved by letting css_tryget() busy-loop until either the
> refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
> (committing to removal).
>
> Now that css refcnt draining is no longer used, there's no need for
> atomic rollback mechanism. css_tryget() simply can look at the
> reference count and fail if the it's deactivated - it's never getting
> re-activated.
>
> This patch removes CSS_REMOVED and updates __css_tryget() to fail if
> the refcnt is deactivated.
>
> Note that this removes css_is_removed() whose only user is VM_BUG_ON()
> in memcontrol.c. We can replace it with a check on the refcnt but
> given that the only use case is a debug assert, I think it's better to
> simply unexport it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>

Few questions/suggestions below but it looks good in general to me.
Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/cgroup.h | 6 ------
> kernel/cgroup.c | 31 ++++++++++++-------------------
> mm/memcontrol.c | 7 +++----
> 3 files changed, 15 insertions(+), 29 deletions(-)
>
[...]
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 033bf4b..a49cdbc 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -170,8 +170,8 @@ struct css_id {
> * The css to which this ID points. This pointer is set to valid value
> * after cgroup is populated. If cgroup is removed, this will be NULL.
> * This pointer is expected to be RCU-safe because destroy()
> - * is called after synchronize_rcu(). But for safe use, css_is_removed()
> - * css_tryget() should be used for avoiding race.
> + * is called after synchronize_rcu(). But for safe use, css_tryget()
> + * should be used for avoiding race.
> */
> struct cgroup_subsys_state __rcu *css;
> /*
> @@ -4088,8 +4088,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> }
> prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>
> - local_irq_disable();
> -

OK, so the new charges shouldn't come from the IRQ context so we cannot
race with css_tryget but why did we need this in the first place?
A separate patch which removes this with an explanation would be nice.

> /* block new css_tryget() by deactivating refcnt */
> for_each_subsys(cgrp->root, ss) {
> struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5a1d584..6f8bd9d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2343,7 +2343,6 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> again:
> if (*ptr) { /* css should be a valid one */
> memcg = *ptr;
> - VM_BUG_ON(css_is_removed(&memcg->css));

All the callers seem to be fine but this was a safety net that something
didn't leak out. Can we keep it and test that the reference counter has
been disabled already (css_refcnt(&memcg->css) < 0 - I do not care
whether open coded or wrapped innsude css_is_removed albeit helper
sounds nicer)?

> if (mem_cgroup_is_root(memcg))
> goto done;
> if (nr_pages == 1 && consume_stock(memcg))
> @@ -2483,9 +2482,9 @@ static void __mem_cgroup_cancel_local_charge(struct mem_cgroup *memcg,
>
> /*
> * A helper function to get mem_cgroup from ID. must be called under
> - * rcu_read_lock(). The caller must check css_is_removed() or some if
> - * it's concern. (dropping refcnt from swap can be called against removed
> - * memcg.)
> + * rcu_read_lock(). The caller is responsible for verifying the returned
> + * memcg is still alive if necessary. (dropping refcnt from swap can be
> + * called against removed memcg.)

I think that something like the following would be more instructive:

+ * rcu_read_lock(). The caller is responsible for calling css_tryget
+ * if the mem_cgroup is used for charging. (dropping refcnt from swap can be
+ * called against removed memcg.)

Thanks!
--
Michal Hocko
SUSE Labs

2012-10-31 15:55:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

On Tue 30-10-12 21:22:40, Tejun Heo wrote:
> This patch makes cgroup_create() fail if @parent is marked removed.
> This is to prepare for further updates to cgroup_rmdir() path.
>
> Note that this change isn't strictly necessary. cgroup can only be
> created via mkdir and the removed marking and dentry removal happen
> without releasing cgroup_mutex, so cgroup_create() can never race with
> cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(),
> cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
> rendering the added liveliness check unnecessary.
>
> Do it anyway such that locking is contained inside cgroup proper and
> we don't get nasty surprises if we ever grow another caller of
> cgroup_create().
>
> Signed-off-by: Tejun Heo <[email protected]>

Looks good. Just a nit bellow
Reviewed-by: Michal Hocko <[email protected]>

> ---
> kernel/cgroup.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index a49cdbc..b3010ae 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3906,6 +3906,18 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> if (!cgrp)
> return -ENOMEM;
>
> + /*
> + * Only live parents can have children. Note that the liveliness
> + * check isn't strictly necessary because cgroup_mkdir() and
> + * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
> + * anyway so that locking is contained inside cgroup proper and we
> + * don't get nasty surprises if we ever grow another caller.
> + */
> + if (!cgroup_lock_live_group(parent)) {
> + err = -ENODEV;
> + goto err_free;
> + }
> +

I think this should be moved up before we try to allocate any memory.
Or is your motivation to keep cgroup_lock held for shorter time?
I could agree with that but a small comment would be helpful.

--
Michal Hocko
SUSE Labs

2012-10-31 16:05:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/8] cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()

On Tue 30-10-12 21:22:41, Tejun Heo wrote:
> Because ->pre_destroy() could fail and can't be called under
> cgroup_mutex, cgroup destruction did something very ugly.

You are referring to a commit in the comment but I would rather see it
here.

> 1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
>
> 2. Release cgroup_mutex and call ->pre_destroy().
>
> 3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
> otherwise.
>
> 4. Continue destroying.
>
> In addition to being ugly, it has been always broken in various ways.
> For example, memcg ->pre_destroy() expects the cgroup to be inactive
> after it's done but tasks can be attached and detached between #2 and
> #3 and the conditions that memcg verified in ->pre_destroy() might no
> longer hold by the time control reaches #3.
>
> Now that ->pre_destroy() is no longer allowed to fail. We can switch
> to the following.
>
> 1. Grab cgroup_mutex and fail if it can't be destroyed; fail
> otherwise.

the other fail is superfluous and too negative ;)

> 2. Deactivate CSS's and mark the cgroup removed thus preventing any
> further operations which can invalidate the verification from #1.
>
> 3. Release cgroup_mutex and call ->pre_destroy().
>
> 4. Re-grab cgroup_mutex and continue destroying.
>
> After this change, controllers can safely assume that ->pre_destroy()
> will only be called only once for a given cgroup and, once
> ->pre_destroy() is called, the cgroup will stay dormant till it's
> destroyed.
>
> Signed-off-by: Tejun Heo <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> kernel/cgroup.c | 41 +++++++++++++++++++----------------------
> 1 file changed, 19 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index b3010ae..66204a6 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4058,18 +4058,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> struct cgroup_event *event, *tmp;
> struct cgroup_subsys *ss;
>
> - /* the vfs holds both inode->i_mutex already */
> - mutex_lock(&cgroup_mutex);
> - if (atomic_read(&cgrp->count) != 0) {
> - mutex_unlock(&cgroup_mutex);
> - return -EBUSY;
> - }
> - if (!list_empty(&cgrp->children)) {
> - mutex_unlock(&cgroup_mutex);
> - return -EBUSY;
> - }
> - mutex_unlock(&cgroup_mutex);
> -
> /*
> * In general, subsystem has no css->refcnt after pre_destroy(). But
> * in racy cases, subsystem may have to get css->refcnt after
> @@ -4081,14 +4069,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> */
> set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>
> - /*
> - * Call pre_destroy handlers of subsys. Notify subsystems
> - * that rmdir() request comes.
> - */
> - for_each_subsys(cgrp->root, ss)
> - if (ss->pre_destroy)
> - WARN_ON_ONCE(ss->pre_destroy(cgrp));
> -
> + /* the vfs holds both inode->i_mutex already */
> mutex_lock(&cgroup_mutex);
> parent = cgrp->parent;
> if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> @@ -4098,13 +4079,30 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> }
> prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>
> - /* block new css_tryget() by deactivating refcnt */
> + /*
> + * Block new css_tryget() by deactivating refcnt and mark @cgrp
> + * removed. This makes future css_tryget() and child creation
> + * attempts fail thus maintaining the removal conditions verified
> + * above.
> + */
> for_each_subsys(cgrp->root, ss) {
> struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
>
> WARN_ON(atomic_read(&css->refcnt) < 0);
> atomic_add(CSS_DEACT_BIAS, &css->refcnt);
> }
> + set_bit(CGRP_REMOVED, &cgrp->flags);
> +
> + /*
> + * Tell subsystems to initate destruction. pre_destroy() should be
> + * called with cgroup_mutex unlocked. See 3fa59dfbc3 ("cgroup: fix
> + * potential deadlock in pre_destroy") for details.
> + */
> + mutex_unlock(&cgroup_mutex);
> + for_each_subsys(cgrp->root, ss)
> + if (ss->pre_destroy)
> + WARN_ON_ONCE(ss->pre_destroy(cgrp));
> + mutex_lock(&cgroup_mutex);
>
> /*
> * Put all the base refs. Each css holds an extra reference to the
> @@ -4120,7 +4118,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
>
> raw_spin_lock(&release_list_lock);
> - set_bit(CGRP_REMOVED, &cgrp->flags);
> if (!list_empty(&cgrp->release_list))
> list_del_init(&cgrp->release_list);
> raw_spin_unlock(&release_list_lock);
> --
> 1.7.11.7
>

--
Michal Hocko
SUSE Labs

2012-10-31 16:27:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 5/8] cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()

On Tue 30-10-12 21:22:42, Tejun Heo wrote:
> CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
> destruction rollback somewhat working. cgroup_rmdir() used to drain
> CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
> helpers were used to allow the task performing rmdir to wait for the
> next relevant event.
>
> Unfortunately, the wait is visible to controllers too and the
> mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
> sleep at rmdir").
>
> Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
> unnecessary. Remove it and all the mechanisms supporting it. Note
> that memcontrol.c changes are essentially revert of 887032670d
> ("cgroup avoid permanent sleep at rmdir").
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> include/linux/cgroup.h | 21 ---------------------
> kernel/cgroup.c | 51 --------------------------------------------------
> mm/memcontrol.c | 24 +-----------------------
> 3 files changed, 1 insertion(+), 95 deletions(-)

/me likes this

>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index a309804..47868a8 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -145,10 +145,6 @@ enum {
> /* Control Group requires release notifications to userspace */
> CGRP_NOTIFY_ON_RELEASE,
> /*
> - * A thread in rmdir() is wating for this cgroup.
> - */
> - CGRP_WAIT_ON_RMDIR,
> - /*
> * Clone cgroup values when creating a new child cgroup
> */
> CGRP_CLONE_CHILDREN,
> @@ -412,23 +408,6 @@ int cgroup_task_count(const struct cgroup *cgrp);
> int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>
> /*
> - * When the subsys has to access css and may add permanent refcnt to css,
> - * it should take care of racy conditions with rmdir(). Following set of
> - * functions, is for stop/restart rmdir if necessary.
> - * Because these will call css_get/put, "css" should be alive css.
> - *
> - * cgroup_exclude_rmdir();
> - * ...do some jobs which may access arbitrary empty cgroup
> - * cgroup_release_and_wakeup_rmdir();
> - *
> - * When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
> - * it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
> - */
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
> -
> -/*
> * Control Group taskset, used to pass around set of tasks to cgroup_subsys
> * methods.
> */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66204a6..c5f6fb2 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -966,33 +966,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
> }
>
> /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> - if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> - wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> - css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> - cgroup_wakeup_rmdir_waiter(css->cgroup);
> - css_put(css);
> -}
> -
> -/*
> * Call with cgroup_mutex held. Drops reference counts on modules, including
> * any duplicate ones that parse_cgroupfs_options took. If this function
> * returns an error, no reference counts are touched.
> @@ -1963,12 +1936,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> }
>
> synchronize_rcu();
> -
> - /*
> - * wake up rmdir() waiter. the rmdir should fail since the cgroup


> - * is no longer empty.
> - */
> - cgroup_wakeup_rmdir_waiter(cgrp);
> out:
> if (retval) {
> for_each_subsys(root, ss) {
> @@ -2138,7 +2105,6 @@ static int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> * step 5: success! and cleanup
> */
> synchronize_rcu();
> - cgroup_wakeup_rmdir_waiter(cgrp);
> retval = 0;
> out_put_css_set_refs:
> if (retval) {
> @@ -4058,26 +4024,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> struct cgroup_event *event, *tmp;
> struct cgroup_subsys *ss;
>
> - /*
> - * In general, subsystem has no css->refcnt after pre_destroy(). But
> - * in racy cases, subsystem may have to get css->refcnt after
> - * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
> - * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
> - * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
> - * and subsystem's reference count handling. Please see css_get/put
> - * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
> - */
> - set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -
> /* the vfs holds both inode->i_mutex already */
> mutex_lock(&cgroup_mutex);
> parent = cgrp->parent;
> if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> mutex_unlock(&cgroup_mutex);
> return -EBUSY;
> }
> - prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>
> /*
> * Block new css_tryget() by deactivating refcnt and mark @cgrp
> @@ -4114,9 +4067,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> for_each_subsys(cgrp->root, ss)
> css_put(cgrp->subsys[ss->subsys_id]);
>
> - finish_wait(&cgroup_rmdir_waitq, &wait);
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -
> raw_spin_lock(&release_list_lock);
> if (!list_empty(&cgrp->release_list))
> list_del_init(&cgrp->release_list);
> @@ -4864,7 +4814,6 @@ void __css_put(struct cgroup_subsys_state *css)
> set_bit(CGRP_RELEASABLE, &cgrp->flags);
> check_for_release(cgrp);
> }
> - cgroup_wakeup_rmdir_waiter(cgrp);
> break;
> case 0:
> schedule_work(&css->dput_work);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6f8bd9d..1033b2b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2681,13 +2681,6 @@ static int mem_cgroup_move_account(struct page *page,
> /* caller should have done css_get */
> pc->mem_cgroup = to;
> mem_cgroup_charge_statistics(to, anon, nr_pages);
> - /*
> - * We charges against "to" which may not have any tasks. Then, "to"
> - * can be under rmdir(). But in current implementation, caller of
> - * this function is just force_empty() and move charge, so it's
> - * guaranteed that "to" is never removed. So, we don't check rmdir
> - * status here.
> - */
> move_unlock_mem_cgroup(from, &flags);
> ret = 0;
> unlock:
> @@ -2893,7 +2886,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
> return;
> if (!memcg)
> return;
> - cgroup_exclude_rmdir(&memcg->css);
>
> __mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
> /*
> @@ -2907,12 +2899,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
> swp_entry_t ent = {.val = page_private(page)};
> mem_cgroup_uncharge_swap(ent);
> }
> - /*
> - * At swapin, we may charge account against cgroup which has no tasks.
> - * So, rmdir()->pre_destroy() can be called while we do this charge.
> - * In that case, we need to call pre_destroy() again. check it here.
> - */
> - cgroup_release_and_wakeup_rmdir(&memcg->css);
> }
>
> void mem_cgroup_commit_charge_swapin(struct page *page,
> @@ -3360,8 +3346,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,
>
> if (!memcg)
> return;
> - /* blocks rmdir() */
> - cgroup_exclude_rmdir(&memcg->css);
> +
> if (!migration_ok) {
> used = oldpage;
> unused = newpage;
> @@ -3395,13 +3380,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> */
> if (anon)
> mem_cgroup_uncharge_page(used);
> - /*
> - * At migration, we may charge account against cgroup which has no
> - * tasks.
> - * So, rmdir()->pre_destroy() can be called while we do this charge.
> - * In that case, we need to call pre_destroy() again. check it here.
> - */
> - cgroup_release_and_wakeup_rmdir(&memcg->css);
> }
>
> /*
> --
> 1.7.11.7
>

--
Michal Hocko
SUSE Labs

2012-10-31 16:28:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 8/8] cgroup: make ->pre_destroy() return void

On Tue 30-10-12 21:22:45, Tejun Heo wrote:
> All ->pre_destory() implementations return 0 now, which is the only
> allowed return value. Make it return void.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Vivek Goyal <[email protected]>

Acked-by: Michal Hocko <[email protected]>

> ---
> block/blk-cgroup.c | 3 +--
> include/linux/cgroup.h | 2 +-
> kernel/cgroup.c | 2 +-
> mm/hugetlb_cgroup.c | 4 +---
> mm/memcontrol.c | 3 +--
> 5 files changed, 5 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index f3b44a6..a7816f3 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -600,7 +600,7 @@ struct cftype blkcg_files[] = {
> *
> * This is the blkcg counterpart of ioc_release_fn().
> */
> -static int blkcg_pre_destroy(struct cgroup *cgroup)
> +static void blkcg_pre_destroy(struct cgroup *cgroup)
> {
> struct blkcg *blkcg = cgroup_to_blkcg(cgroup);
>
> @@ -622,7 +622,6 @@ static int blkcg_pre_destroy(struct cgroup *cgroup)
> }
>
> spin_unlock_irq(&blkcg->lock);
> - return 0;
> }
>
> static void blkcg_destroy(struct cgroup *cgroup)
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 47868a8..adb2adc 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -436,7 +436,7 @@ int cgroup_taskset_size(struct cgroup_taskset *tset);
>
> struct cgroup_subsys {
> struct cgroup_subsys_state *(*create)(struct cgroup *cgrp);
> - int (*pre_destroy)(struct cgroup *cgrp);
> + void (*pre_destroy)(struct cgroup *cgrp);
> void (*destroy)(struct cgroup *cgrp);
> int (*can_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
> void (*cancel_attach)(struct cgroup *cgrp, struct cgroup_taskset *tset);
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index c5f6fb2..83cd7d0 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -4054,7 +4054,7 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> mutex_unlock(&cgroup_mutex);
> for_each_subsys(cgrp->root, ss)
> if (ss->pre_destroy)
> - WARN_ON_ONCE(ss->pre_destroy(cgrp));
> + ss->pre_destroy(cgrp);
> mutex_lock(&cgroup_mutex);
>
> /*
> diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> index dc595c6..0d3a1a3 100644
> --- a/mm/hugetlb_cgroup.c
> +++ b/mm/hugetlb_cgroup.c
> @@ -155,7 +155,7 @@ out:
> * Force the hugetlb cgroup to empty the hugetlb resources by moving them to
> * the parent cgroup.
> */
> -static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> +static void hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> {
> struct hstate *h;
> struct page *page;
> @@ -172,8 +172,6 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
> }
> cond_resched();
> } while (hugetlb_cgroup_have_usage(cgroup));
> -
> - return 0;
> }
>
> int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 47c4680..af05a60 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5002,12 +5002,11 @@ free_out:
> return ERR_PTR(error);
> }
>
> -static int mem_cgroup_pre_destroy(struct cgroup *cont)
> +static void mem_cgroup_pre_destroy(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
>
> mem_cgroup_reparent_charges(memcg);
> - return 0;
> }
>
> static void mem_cgroup_destroy(struct cgroup *cont)
> --
> 1.7.11.7
>

--
Michal Hocko
SUSE Labs

2012-10-31 16:31:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

Thanks for this cleanup. The code looks much better now and:
$ git diff --stat mmotm...tj-cgroups/review-cgroup-rmdir-updates
block/blk-cgroup.c | 3 +-
include/linux/cgroup.h | 41 +-------
kernel/cgroup.c | 256 ++++++++++++------------------------------------
mm/hugetlb_cgroup.c | 11 +--
mm/memcontrol.c | 51 ++--------
5 files changed, 75 insertions(+), 287 deletions(-)

speaks for itself.
--
Michal Hocko
SUSE Labs

2012-10-31 16:35:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

On Wed, Oct 31, 2012 at 05:31:34PM +0100, Michal Hocko wrote:
> Thanks for this cleanup. The code looks much better now and:
> $ git diff --stat mmotm...tj-cgroups/review-cgroup-rmdir-updates
> block/blk-cgroup.c | 3 +-
> include/linux/cgroup.h | 41 +-------
> kernel/cgroup.c | 256 ++++++++++++------------------------------------
> mm/hugetlb_cgroup.c | 11 +--
> mm/memcontrol.c | 51 ++--------
> 5 files changed, 75 insertions(+), 287 deletions(-)
>
> speaks for itself.

Oh, yeah, I've been wanting to do this for so long. :)

Thanks.

--
tejun

2012-10-31 16:38:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

Hello, Glauber.

On Wed, Oct 31, 2012 at 05:21:29PM +0400, Glauber Costa wrote:
> > +
> > + local_irq_disable();
> > +
> > + /* block new css_tryget() by deactivating refcnt */
> > + for_each_subsys(cgrp->root, ss) {
> > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> > +
> > + WARN_ON(atomic_read(&css->refcnt) < 0);
> > + atomic_add(CSS_DEACT_BIAS, &css->refcnt);
> > }
> > - /* NO css_tryget() can success after here. */
> > +
> > + /*
> > + * Set REMOVED. All in-progress css_tryget() will be released.
> > + * Put all the base refs. Each css holds an extra reference to the
> > + * cgroup's dentry and cgroup removal proceeds regardless of css
> > + * refs. On the last put of each css, whenever that may be, the
> > + * extra dentry ref is put so that dentry destruction happens only
> > + * after all css's are released.
> > + */
> > + for_each_subsys(cgrp->root, ss) {
> > + struct cgroup_subsys_state *css = cgrp->subsys[ss->subsys_id];
> > +
> > + set_bit(CSS_REMOVED, &css->flags);
> > + css_put(css);
> > + }
> > +
> > + local_irq_enable();
> > +
> > finish_wait(&cgroup_rmdir_waitq, &wait);
>
> It seems to me that the main reason for doing this in two passes was to
> derive a global "failed" state on the first pass. Now that we can't
> fail, why can't you just loop through the subsystems just once ?

Later in the series, pre_destroy() happens inbetween. The first loop
marks the CSSes and cgrp dead, ->pre_destroy()s are called, and the
later loop will put the final ref. I'll mention it in the commit
message.

Thanks.

--
tejun

2012-10-31 16:41:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

Hey, Michal.

On Wed, Oct 31, 2012 at 03:37:51PM +0100, Michal Hocko wrote:
> > + for_each_subsys(cgrp->root, ss)
> > + if (ss->pre_destroy)
> > + WARN_ON_ONCE(ss->pre_destroy(cgrp));
>
> Hmm, I am not sure I like this WARN_ON_ONCE. First it can happen for
> more than one controller and second we can just clear CGRP_WAIT_ON_RMDIR
> and return with EBUSY. The only possible failure at the moment is when a
> new task or a child group appear.
> I know it is not a big deal because it will disappear later in the
> series but it would be more readable IMO.

The WARN_ON_ONCE() is just moved from the original
cgroup_call_pre_destroy(). We can add an error out there but that
makes future changes difficult. It's a chicken and egg problem. You
gotta break the loop somewhere.

Thanks.

--
tejun

2012-10-31 16:49:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

On Wed 31-10-12 09:41:23, Tejun Heo wrote:
> Hey, Michal.
>
> On Wed, Oct 31, 2012 at 03:37:51PM +0100, Michal Hocko wrote:
> > > + for_each_subsys(cgrp->root, ss)
> > > + if (ss->pre_destroy)
> > > + WARN_ON_ONCE(ss->pre_destroy(cgrp));
> >
> > Hmm, I am not sure I like this WARN_ON_ONCE. First it can happen for
> > more than one controller and second we can just clear CGRP_WAIT_ON_RMDIR
> > and return with EBUSY. The only possible failure at the moment is when a
> > new task or a child group appear.
> > I know it is not a big deal because it will disappear later in the
> > series but it would be more readable IMO.
>
> The WARN_ON_ONCE() is just moved from the original
> cgroup_call_pre_destroy(). We can add an error out there but that
> makes future changes difficult. It's a chicken and egg problem. You
> gotta break the loop somewhere.

I do not think this is that hard. You can simply change the return path
on pre_destroy error by BUG_ON in "cgroup: deactivate CSS's and mark
cgroup dead before invoking ->pre_destroy()". There is no chicke&egg
problem here because once the group is marked dead and css frozen then
the existing callbacks cannot possibly fail.
Or am I missing your point?
--
Michal Hocko
SUSE Labs

2012-10-31 16:57:45

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

Hey, Michal.

On Wed, Oct 31, 2012 at 04:39:26PM +0100, Michal Hocko wrote:
> > prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> >
> > - local_irq_disable();
> > -
>
> OK, so the new charges shouldn't come from the IRQ context so we cannot
> race with css_tryget but why did we need this in the first place?
> A separate patch which removes this with an explanation would be nice.

The change is actually tied to this one. Because css_tryget() busy
loops on DEACT_BIAS && !CSS_REMOVED and css_tryget() may happen from
an IRQ context, we need to disable IRQ while deactivating refcnts and
setting CSS_REMOVED. I'll mention it in the commit message.

> > @@ -2343,7 +2343,6 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > again:
> > if (*ptr) { /* css should be a valid one */
> > memcg = *ptr;
> > - VM_BUG_ON(css_is_removed(&memcg->css));
>
> All the callers seem to be fine but this was a safety net that something
> didn't leak out. Can we keep it and test that the reference counter has
> been disabled already (css_refcnt(&memcg->css) < 0 - I do not care
> whether open coded or wrapped innsude css_is_removed albeit helper
> sounds nicer)?

I don't think that's a good idea. In general, I think too much of
cgroup internals are exposed to controllers. People try to implement
weird behaviors and expose cgroup internals for that, which in turn
attracts more weirdness, and there seems to be a pattern - cgroup core
is unnecessarily coupled with VFS locking like controllers are
unnecessarily coupled with cgroup internal locking. I really wanna
move away from such pattern. I mean, you can't even know
css_is_removed() isn't gonna change while the function is in progress.

I have a patch queued to add ->pre_destroy() - different from
Glauber's in that it can't fail, so we'll have

->create()
->post_create()
->pre_destroy()
->destroy()

Where ->create() may fail but none other can. ->post_create() and
->pre_destroy() mark the point where a cgroup is committed to and
decommissioned from active service and thus can be used as
synchronization points. If you want liveliness check inside memcg,
please take the necessary actions (synchronization and marking) from
->post_create() and ->pre_destroy() and check against that. That way,
you control your locking and there will also be a general mechanism to
iterate through a cgroup's children/descendants which can also be
synchronized that way. I'm planning to send the series out later
today.

> I think that something like the following would be more instructive:
>
> + * rcu_read_lock(). The caller is responsible for calling css_tryget
> + * if the mem_cgroup is used for charging. (dropping refcnt from swap can be
> + * called against removed memcg.)

So updated. Thanks!

--
tejun

2012-10-31 17:04:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

Hey, Michal.

On Wed, Oct 31, 2012 at 04:55:14PM +0100, Michal Hocko wrote:
> > + /*
> > + * Only live parents can have children. Note that the liveliness
> > + * check isn't strictly necessary because cgroup_mkdir() and
> > + * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
> > + * anyway so that locking is contained inside cgroup proper and we
> > + * don't get nasty surprises if we ever grow another caller.
> > + */
> > + if (!cgroup_lock_live_group(parent)) {
> > + err = -ENODEV;
> > + goto err_free;
> > + }
> > +
>
> I think this should be moved up before we try to allocate any memory.
> Or is your motivation to keep cgroup_lock held for shorter time?
> I could agree with that but a small comment would be helpful.

Then I have to change the error out path more and I'm not sure I wanna
call deactivate_super() under cgroup_mutex. It's just simpler this
way.

Thanks.

--
tejun

2012-10-31 17:06:37

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On 10/31/2012 08:57 PM, Tejun Heo wrote:
> I have a patch queued to add ->pre_destroy() - different from
> Glauber's in that it can't fail, so we'll have
>
> ->create()
> ->post_create()
> ->pre_destroy()
> ->destroy()
>
> Where ->create() may fail but none other can.

This is not the topic of this thread, but since you brought it:
If you take a look at the description patch in the patch I sent, the
problem I outlined is that at create time, we don't know anything about
which will the css_id be - and I would like to avoid creating yet
another index.

Is there any way you would suggest of handling this ? Any chance of us
allocating the css_id earlier then?

2012-10-31 17:10:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

Hello, Glauber.

On Wed, Oct 31, 2012 at 10:06 AM, Glauber Costa <[email protected]> wrote:
> This is not the topic of this thread, but since you brought it:
> If you take a look at the description patch in the patch I sent, the
> problem I outlined is that at create time, we don't know anything about
> which will the css_id be - and I would like to avoid creating yet
> another index.
>
> Is there any way you would suggest of handling this ? Any chance of us
> allocating the css_id earlier then?

I only glanced the patch description but the problem I'm trying to
solve is locking - currently we don't have a place where a controller
can tell a cgroup is becoming online; thus it has nothing to
synchronized against and tell that a cgroup is alive or not. As for
css_id allocation, maybe you can deal with that in ->post_create() or
maybe we can allocate css_id earlier (but where would it be stored?).
I'll look into it.

Thanks.

--
tejun

2012-10-31 17:16:52

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 5/8] cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()

On Wed, Oct 31, 2012 at 05:27:35PM +0100, Michal Hocko wrote:
> > ---
> > include/linux/cgroup.h | 21 ---------------------
> > kernel/cgroup.c | 51 --------------------------------------------------
> > mm/memcontrol.c | 24 +-----------------------
> > 3 files changed, 1 insertion(+), 95 deletions(-)
>
> /me likes this

Amen, Michal. The amount of code deleted isn't much but it was one
ugly thing. The comment in memcg around the use of exclude_rmdir was
very depressing.

Thanks.

--
tejun

2012-10-31 17:18:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

Hello,

On Wed, Oct 31, 2012 at 05:49:33PM +0400, Glauber Costa wrote:
> The only think that drew my attention is that you are changing the
> local_irq_save callsite to local_irq_disable. It shouldn't be a problem,
> since this is never expected to be called in interrupt context.
>
> Still... it makes me wonder if that disabled-interrupt block is still
> needed? According to the changelogs, it was introduced in e7c5ec919 for
> the css_tryget mechanism. But css_tryget itself will never scan
> subsystems, so if we can no longer fail, we should be able to just ditch
> it. Unless I am missing something

Note both in the commit messages.

Thanks.

--
tejun

2012-10-31 17:20:11

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On 10/31/2012 09:10 PM, Tejun Heo wrote:
> Hello, Glauber.
>
> On Wed, Oct 31, 2012 at 10:06 AM, Glauber Costa <[email protected]> wrote:
>> This is not the topic of this thread, but since you brought it:
>> If you take a look at the description patch in the patch I sent, the
>> problem I outlined is that at create time, we don't know anything about
>> which will the css_id be - and I would like to avoid creating yet
>> another index.
>>
>> Is there any way you would suggest of handling this ? Any chance of us
>> allocating the css_id earlier then?
>
> I only glanced the patch description but the problem I'm trying to
> solve is locking - currently we don't have a place where a controller
> can tell a cgroup is becoming online; thus it has nothing to
> synchronized against and tell that a cgroup is alive or not. As for
> css_id allocation, maybe you can deal with that in ->post_create() or
> maybe we can allocate css_id earlier (but where would it be stored?).
> I'll look into it.
>

The css_id is allocated right after ->create(). So if post_create() is
called after that, I can use it just fine - which is what I do in my
patch. Problem is, since I do memory allocation based on that, it can fail.

So although I would have to look at your series myself to see exactly
what you are trying to achieve (looking forward), I seemed natural to me
to think about it terms of "early_create" + "late_create" (->create())
and ->post_create()), where the later is a callback when things are
readier (in this case, the css_id).

I don't see post_create failing as a huge problem. The natural
synchronization point would be "right after post_create" - then you can
definitely tell that it is online. Although this can be viewed a bit as
"exposing internals", creating is different then destroying: When you
create, you may not have all data yet. When destroying, you do - and
want to get rid of it. So this kind of bootstrapping is pretty standard
and common.


2012-10-31 17:22:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

Hey, Michal.

On Wed, Oct 31, 2012 at 05:48:55PM +0100, Michal Hocko wrote:
> > The WARN_ON_ONCE() is just moved from the original
> > cgroup_call_pre_destroy(). We can add an error out there but that
> > makes future changes difficult. It's a chicken and egg problem. You
> > gotta break the loop somewhere.
>
> I do not think this is that hard. You can simply change the return path
> on pre_destroy error by BUG_ON in "cgroup: deactivate CSS's and mark
> cgroup dead before invoking ->pre_destroy()". There is no chicke&egg
> problem here because once the group is marked dead and css frozen then
> the existing callbacks cannot possibly fail.
> Or am I missing your point?

I want to be able to move code verbatim in the later patch when
relocating ->pre_destroy() invocations. Hmmm... alright, I'll make it
break out of pre_destroy() here and change it to WARN_ON_ONCE() later
on.

Thanks.

--
tejun

2012-10-31 17:24:25

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

On 10/31/2012 09:18 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 31, 2012 at 05:49:33PM +0400, Glauber Costa wrote:
>> The only think that drew my attention is that you are changing the
>> local_irq_save callsite to local_irq_disable. It shouldn't be a problem,
>> since this is never expected to be called in interrupt context.
>>
>> Still... it makes me wonder if that disabled-interrupt block is still
>> needed? According to the changelogs, it was introduced in e7c5ec919 for
>> the css_tryget mechanism. But css_tryget itself will never scan
>> subsystems, so if we can no longer fail, we should be able to just ditch
>> it. Unless I am missing something
>
> Note both in the commit messages.
>

I am sorry, but I can't find anything that may be related to this in the
commit messages. Can you be more specific ?

2012-10-31 17:25:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

Hello,

On Wed, Oct 31, 2012 at 09:19:51PM +0400, Glauber Costa wrote:
> I don't see post_create failing as a huge problem. The natural
> synchronization point would be "right after post_create" - then you can
> definitely tell that it is online. Although this can be viewed a bit as
> "exposing internals", creating is different then destroying: When you
> create, you may not have all data yet. When destroying, you do - and
> want to get rid of it. So this kind of bootstrapping is pretty standard
> and common.

More proper names for these callbacks would be,

->allocate()
->online()
->offline()
->free()

And I may rename them. I don't wanna make ->online() failable. Why
can't you just allocate everything from ->allocate() and use it from
->online()?

Thanks.

--
tejun

2012-10-31 17:26:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

On Wed, Oct 31, 2012 at 09:24:06PM +0400, Glauber Costa wrote:
> > Note both in the commit messages.
>
> I am sorry, but I can't find anything that may be related to this in the
> commit messages. Can you be more specific ?

Eh.. 'd', missing there. I meant that I noted both in the updated
commit messages, so they will be mentioned in commit messages on the
next posting of this series.

Thanks.

--
tejun

2012-10-31 17:34:06

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCHSET] cgroup: simplify cgroup removal path

On 10/31/2012 09:26 PM, Tejun Heo wrote:
> On Wed, Oct 31, 2012 at 09:24:06PM +0400, Glauber Costa wrote:
>>> Note both in the commit messages.
>>
>> I am sorry, but I can't find anything that may be related to this in the
>> commit messages. Can you be more specific ?
>
> Eh.. 'd', missing there. I meant that I noted both in the updated
> commit messages, so they will be mentioned in commit messages on the
> next posting of this series.
>

Ahhh! All right.

2012-10-31 17:38:53

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On 10/31/2012 09:25 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Oct 31, 2012 at 09:19:51PM +0400, Glauber Costa wrote:
>> I don't see post_create failing as a huge problem. The natural
>> synchronization point would be "right after post_create" - then you can
>> definitely tell that it is online. Although this can be viewed a bit as
>> "exposing internals", creating is different then destroying: When you
>> create, you may not have all data yet. When destroying, you do - and
>> want to get rid of it. So this kind of bootstrapping is pretty standard
>> and common.
>
> More proper names for these callbacks would be,
>
> ->allocate()
> ->online()
> ->offline()
> ->free()
>
> And I may rename them. I don't wanna make ->online() failable. Why
> can't you just allocate everything from ->allocate() and use it from
> ->online()?
>

Because I am allocating an array big enough to hold one entry per memcg.
The natural array index for this, is the css_id. Obviously, I don't want
this array to have 65k entries in size, so I resize it (doubling every
time) Because I don't know the css_id at this time, I have to do it later.

Another option for this - which I also considered - would be to use a
different index. We get more packing, since not all memcgs will be kmem
limited (and the index would contain only the kmem limited memcgs), and
we can allocate this index during ->create().

I initially picked the css_index because I though a specialized index
might be confusing. But if you feel strongly about all the allocations
happening inside ->create(), this would be a way to avoid it. Would you
prefer that?

2012-10-31 17:39:23

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On 10/31/2012 09:25 PM, Tejun Heo wrote:
> More proper names for these callbacks would be,
>
> ->allocate()
> ->online()
> ->offline()
> ->free()
I support the name change, btw.

2012-10-31 17:44:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

Hello, Glauber.

On Wed, Oct 31, 2012 at 09:38:32PM +0400, Glauber Costa wrote:
> Because I am allocating an array big enough to hold one entry per memcg.
> The natural array index for this, is the css_id. Obviously, I don't want
> this array to have 65k entries in size, so I resize it (doubling every
> time) Because I don't know the css_id at this time, I have to do it later.
>
> Another option for this - which I also considered - would be to use a
> different index. We get more packing, since not all memcgs will be kmem
> limited (and the index would contain only the kmem limited memcgs), and
> we can allocate this index during ->create().
>
> I initially picked the css_index because I though a specialized index
> might be confusing. But if you feel strongly about all the allocations
> happening inside ->create(), this would be a way to avoid it. Would you
> prefer that?

I'll think more about it. The whole css_id thing might need some soul
searching too anyway. Let's continue this one your patch thread.

Thanks.

--
tejun

2012-10-31 19:16:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

On Wed 31-10-12 09:57:39, Tejun Heo wrote:
> Hey, Michal.
>
> On Wed, Oct 31, 2012 at 04:39:26PM +0100, Michal Hocko wrote:
> > > prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
> > >
> > > - local_irq_disable();
> > > -
> >
> > OK, so the new charges shouldn't come from the IRQ context so we cannot
> > race with css_tryget but why did we need this in the first place?
> > A separate patch which removes this with an explanation would be nice.
>
> The change is actually tied to this one. Because css_tryget() busy
> loops on DEACT_BIAS && !CSS_REMOVED and css_tryget() may happen from
> an IRQ context, we need to disable IRQ while deactivating refcnts and
> setting CSS_REMOVED. I'll mention it in the commit message.

OK

> > > @@ -2343,7 +2343,6 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > > again:
> > > if (*ptr) { /* css should be a valid one */
> > > memcg = *ptr;
> > > - VM_BUG_ON(css_is_removed(&memcg->css));
> >
> > All the callers seem to be fine but this was a safety net that something
> > didn't leak out. Can we keep it and test that the reference counter has
> > been disabled already (css_refcnt(&memcg->css) < 0 - I do not care
> > whether open coded or wrapped innsude css_is_removed albeit helper
> > sounds nicer)?
>
> I don't think that's a good idea. In general, I think too much of
> cgroup internals are exposed to controllers. People try to implement
> weird behaviors and expose cgroup internals for that, which in turn
> attracts more weirdness, and there seems to be a pattern - cgroup core
> is unnecessarily coupled with VFS locking like controllers are
> unnecessarily coupled with cgroup internal locking. I really wanna
> move away from such pattern. I mean, you can't even know
> css_is_removed() isn't gonna change while the function is in progress.

Sure, it cannot detect races but this wasn't the intention of the check.
The more important part is that memcg can outlive css and we want to
catch bugs when we try to charge to an already dead memcg.

[...]
--
Michal Hocko
SUSE Labs

2012-10-31 19:33:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

Hey, Michal.

On Wed, Oct 31, 2012 at 08:16:02PM +0100, Michal Hocko wrote:
> Sure, it cannot detect races but this wasn't the intention of the check.
> The more important part is that memcg can outlive css and we want to
> catch bugs when we try to charge to an already dead memcg.

cgroup will provide mechanisms to perform such checks properly
synchronized, so let's not leak internals like CSS rercnt going
negative outside cgroup.c.

Thanks.

--
tejun

2012-11-01 09:16:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

On Wed 31-10-12 10:04:31, Tejun Heo wrote:
> Hey, Michal.
>
> On Wed, Oct 31, 2012 at 04:55:14PM +0100, Michal Hocko wrote:
> > > + /*
> > > + * Only live parents can have children. Note that the liveliness
> > > + * check isn't strictly necessary because cgroup_mkdir() and
> > > + * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
> > > + * anyway so that locking is contained inside cgroup proper and we
> > > + * don't get nasty surprises if we ever grow another caller.
> > > + */
> > > + if (!cgroup_lock_live_group(parent)) {
> > > + err = -ENODEV;
> > > + goto err_free;
> > > + }
> > > +
> >
> > I think this should be moved up before we try to allocate any memory.
> > Or is your motivation to keep cgroup_lock held for shorter time?
> > I could agree with that but a small comment would be helpful.
>
> Then I have to change the error out path more and I'm not sure I wanna
> call deactivate_super() under cgroup_mutex. It's just simpler this
> way.

I am not sure I understand. What does deactivate_super has to do with
the above suggestion? cgroup_lock_live_group will take the cgroup_mutex
on the success or frees the previously allocated&unused memory. The
only thing I was suggesting is to do cgroup_lock_live_group first and
allocate the cgroup only if it doesn't fail.

--
Michal Hocko
SUSE Labs

2012-11-01 14:52:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

Hey, Michal.

On Thu, Nov 1, 2012 at 2:16 AM, Michal Hocko <[email protected]> wrote:
> I am not sure I understand. What does deactivate_super has to do with
> the above suggestion? cgroup_lock_live_group will take the cgroup_mutex
> on the success or frees the previously allocated&unused memory. The
> only thing I was suggesting is to do cgroup_lock_live_group first and
> allocate the cgroup only if it doesn't fail.

It complicates updates to the error exit path. You end up having to
update a lot more and it's not like grabbing lock first is
substantially better in any way, so why bother?

Thanks.

--
tejun

2012-11-01 15:05:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

On Thu 01-11-12 07:52:24, Tejun Heo wrote:
> Hey, Michal.
>
> On Thu, Nov 1, 2012 at 2:16 AM, Michal Hocko <[email protected]> wrote:
> > I am not sure I understand. What does deactivate_super has to do with
> > the above suggestion? cgroup_lock_live_group will take the cgroup_mutex
> > on the success or frees the previously allocated&unused memory. The
> > only thing I was suggesting is to do cgroup_lock_live_group first and
> > allocate the cgroup only if it doesn't fail.
>
> It complicates updates to the error exit path.

Still don't get it, sorry. What prevents us to do:
static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
umode_t mode)
{
struct cgroup *cgrp;
struct cgroupfs_root *root = parent->root;
int err = 0;
struct cgroup_subsys *ss;
struct super_block *sb = root->sb;

if (!cgroup_lock_live_group(parent))
return -ENODEV;

cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
if (!cgrp)
return -ENOMEM;

> You end up having to update a lot more and it's not like grabbing lock
> first is substantially better in any way, so why bother?

Yes the allocation can sleep if we are short on memory so this can
potentially take long which is not entirely nice but a pointless
allocation is not nice either. Anyway I am asking because I am trying to
understand what is the motivation behind and your explanation about the
error exit path doesn't make much sense to me. So I am either missing
something or we are talking about two different things.
--
Michal Hocko
SUSE Labs

2012-11-01 15:16:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

On Thu 01-11-12 16:05:32, Michal Hocko wrote:
> On Thu 01-11-12 07:52:24, Tejun Heo wrote:
> > Hey, Michal.
> >
> > On Thu, Nov 1, 2012 at 2:16 AM, Michal Hocko <[email protected]> wrote:
> > > I am not sure I understand. What does deactivate_super has to do with
> > > the above suggestion? cgroup_lock_live_group will take the cgroup_mutex
> > > on the success or frees the previously allocated&unused memory. The
> > > only thing I was suggesting is to do cgroup_lock_live_group first and
> > > allocate the cgroup only if it doesn't fail.
> >
> > It complicates updates to the error exit path.
>
> Still don't get it, sorry. What prevents us to do:
> static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> umode_t mode)
> {
> struct cgroup *cgrp;
> struct cgroupfs_root *root = parent->root;
> int err = 0;
> struct cgroup_subsys *ss;
> struct super_block *sb = root->sb;
>
> if (!cgroup_lock_live_group(parent))
> return -ENODEV;
>
> cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
> if (!cgrp)
> return -ENOMEM;

this needs to drop the lock of course but it doesn't make it much more
complicated...

> > You end up having to update a lot more and it's not like grabbing lock
> > first is substantially better in any way, so why bother?
>
> Yes the allocation can sleep if we are short on memory so this can
> potentially take long which is not entirely nice but a pointless
> allocation is not nice either. Anyway I am asking because I am trying to
> understand what is the motivation behind and your explanation about the
> error exit path doesn't make much sense to me. So I am either missing
> something or we are talking about two different things.

--
Michal Hocko
SUSE Labs

2012-11-01 15:43:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

Hey, Michal.

On Thu, Nov 01, 2012 at 04:15:56PM +0100, Michal Hocko wrote:
> > if (!cgroup_lock_live_group(parent))
> > return -ENODEV;
> >
> > cgrp = kzalloc(sizeof(*cgrp), GFP_KERNEL);
> > if (!cgrp)
> > return -ENOMEM;
>
> this needs to drop the lock of course but it doesn't make it much more
> complicated...

Yeah, now look at the error exit path of the function. You have to
reorder them too so that it matches the changed order above, which is
all fine and dandy, but one way is not necessarily better than the
other, so there's no good reason to incur that churn.

Thanks.

--
tejun

2012-11-02 09:23:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/8] cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs

(2012/10/31 13:22), Tejun Heo wrote:
> 2ef37d3fe4 ("memcg: Simplify mem_cgroup_force_empty_list error
> handling") removed the last user of __DEPRECATED_clear_css_refs. This
> patch removes __DEPRECATED_clear_css_refs and mechanisms to support
> it.
>
> * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
>
> * ->pre_destroy() now can only fail if a new task is attached or child
> cgroup is created while ->pre_destroy()s are being called. As the
> condition is checked again after re-acquiring cgroup_mutex
> afterwards, we don't need to take any immediate action on
> ->pre_destroy() failures. This reduces cgroup_call_pre_destroy() to
> a simple loop surrounding ->pre_destory(). Remove
> cgroup_call_pre_destroy() and open-code the loop into
> cgroup_rmdir().
>
> * cgroup_clear_css_refs() can no longer fail. All that needs to be
> done are deactivating refcnts, setting CSS_REMOVED and putting the
> base reference on each css. Remove cgroup_clear_css_refs() and the
> failure path, and open-code the loops into cgroup_rmdir().
>
> Note that cgroup_rmdir() will see more cleanup soon.
>
> Signed-off-by: Tejun Heo <[email protected]>

I thank you and Michal for this work.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>



2012-11-02 09:30:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 2/8] cgroup: kill CSS_REMOVED

(2012/10/31 13:22), Tejun Heo wrote:
> CSS_REMOVED is one of the several contortions which were necessary to
> support css reference draining on cgroup removal. All css->refcnts
> which need draining should be deactivated and verified to equal zero
> atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts
> needed to be re-activated and css_tryget() shouldn't fail in the
> process.
>
> This was achieved by letting css_tryget() busy-loop until either the
> refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
> (committing to removal).
>
> Now that css refcnt draining is no longer used, there's no need for
> atomic rollback mechanism. css_tryget() simply can look at the
> reference count and fail if the it's deactivated - it's never getting
> re-activated.
>
> This patch removes CSS_REMOVED and updates __css_tryget() to fail if
> the refcnt is deactivated.
>
> Note that this removes css_is_removed() whose only user is VM_BUG_ON()
> in memcontrol.c. We can replace it with a check on the refcnt but
> given that the only use case is a debug assert, I think it's better to
> simply unexport it.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>


2012-11-02 09:37:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 3/8] cgroup: use cgroup_lock_live_group(parent) in cgroup_create()

(2012/10/31 13:22), Tejun Heo wrote:
> This patch makes cgroup_create() fail if @parent is marked removed.
> This is to prepare for further updates to cgroup_rmdir() path.
>
> Note that this change isn't strictly necessary. cgroup can only be
> created via mkdir and the removed marking and dentry removal happen
> without releasing cgroup_mutex, so cgroup_create() can never race with
> cgroup_rmdir(). Even after the scheduled updates to cgroup_rmdir(),
> cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
> rendering the added liveliness check unnecessary.
>
> Do it anyway such that locking is contained inside cgroup proper and
> we don't get nasty surprises if we ever grow another caller of
> cgroup_create().
>
> Signed-off-by: Tejun Heo <[email protected]>

I welcome this change.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

> ---
> kernel/cgroup.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index a49cdbc..b3010ae 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -3906,6 +3906,18 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> if (!cgrp)
> return -ENOMEM;
>
> + /*
> + * Only live parents can have children. Note that the liveliness
> + * check isn't strictly necessary because cgroup_mkdir() and
> + * cgroup_rmdir() are fully synchronized by i_mutex; however, do it
> + * anyway so that locking is contained inside cgroup proper and we
> + * don't get nasty surprises if we ever grow another caller.
> + */
> + if (!cgroup_lock_live_group(parent)) {
> + err = -ENODEV;
> + goto err_free;
> + }
> +
> /* Grab a reference on the superblock so the hierarchy doesn't
> * get deleted on unmount if there are child cgroups. This
> * can be done outside cgroup_mutex, since the sb can't
> @@ -3913,8 +3925,6 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
> * fs */
> atomic_inc(&sb->s_active);
>
> - mutex_lock(&cgroup_mutex);
> -
> init_cgroup_housekeeping(cgrp);
>
> cgrp->parent = parent;
> @@ -3985,7 +3995,7 @@ static long cgroup_create(struct cgroup *parent, struct dentry *dentry,
>
> /* Release the reference count that we took on the superblock */
> deactivate_super(sb);
> -
> +err_free:
> kfree(cgrp);
> return err;
> }
>

2012-11-02 09:44:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 4/8] cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy()

(2012/10/31 13:22), Tejun Heo wrote:
> Because ->pre_destroy() could fail and can't be called under
> cgroup_mutex, cgroup destruction did something very ugly.
>
> 1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
>
> 2. Release cgroup_mutex and call ->pre_destroy().
>
> 3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
> otherwise.
>
> 4. Continue destroying.
>
> In addition to being ugly, it has been always broken in various ways.
> For example, memcg ->pre_destroy() expects the cgroup to be inactive
> after it's done but tasks can be attached and detached between #2 and
> #3 and the conditions that memcg verified in ->pre_destroy() might no
> longer hold by the time control reaches #3.
>
> Now that ->pre_destroy() is no longer allowed to fail. We can switch
> to the following.
>
> 1. Grab cgroup_mutex and fail if it can't be destroyed; fail
> otherwise.
>
> 2. Deactivate CSS's and mark the cgroup removed thus preventing any
> further operations which can invalidate the verification from #1.
>
> 3. Release cgroup_mutex and call ->pre_destroy().
>
> 4. Re-grab cgroup_mutex and continue destroying.
>
> After this change, controllers can safely assume that ->pre_destroy()
> will only be called only once for a given cgroup and, once
> ->pre_destroy() is called, the cgroup will stay dormant till it's
> destroyed.
>
> Signed-off-by: Tejun Heo <[email protected]>


Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Ok, cgroup_lock_live_group() will work synchronously against attach_task().
I welcome this.

2012-11-02 09:53:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 5/8] cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir()

(2012/10/31 13:22), Tejun Heo wrote:
> CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
> destruction rollback somewhat working. cgroup_rmdir() used to drain
> CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
> helpers were used to allow the task performing rmdir to wait for the
> next relevant event.
>
> Unfortunately, the wait is visible to controllers too and the
> mechanism got exposed to memcg by 887032670d ("cgroup avoid permanent
> sleep at rmdir").
>
> Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
> unnecessary. Remove it and all the mechanisms supporting it. Note
> that memcontrol.c changes are essentially revert of 887032670d
> ("cgroup avoid permanent sleep at rmdir").
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>

Hmm, ok...



> ---
> include/linux/cgroup.h | 21 ---------------------
> kernel/cgroup.c | 51 --------------------------------------------------
> mm/memcontrol.c | 24 +-----------------------
> 3 files changed, 1 insertion(+), 95 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index a309804..47868a8 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -145,10 +145,6 @@ enum {
> /* Control Group requires release notifications to userspace */
> CGRP_NOTIFY_ON_RELEASE,
> /*
> - * A thread in rmdir() is wating for this cgroup.
> - */
> - CGRP_WAIT_ON_RMDIR,
> - /*
> * Clone cgroup values when creating a new child cgroup
> */
> CGRP_CLONE_CHILDREN,
> @@ -412,23 +408,6 @@ int cgroup_task_count(const struct cgroup *cgrp);
> int cgroup_is_descendant(const struct cgroup *cgrp, struct task_struct *task);
>
> /*
> - * When the subsys has to access css and may add permanent refcnt to css,
> - * it should take care of racy conditions with rmdir(). Following set of
> - * functions, is for stop/restart rmdir if necessary.
> - * Because these will call css_get/put, "css" should be alive css.
> - *
> - * cgroup_exclude_rmdir();
> - * ...do some jobs which may access arbitrary empty cgroup
> - * cgroup_release_and_wakeup_rmdir();
> - *
> - * When someone removes a cgroup while cgroup_exclude_rmdir() holds it,
> - * it sleeps and cgroup_release_and_wakeup_rmdir() will wake him up.
> - */
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css);
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css);
> -
> -/*
> * Control Group taskset, used to pass around set of tasks to cgroup_subsys
> * methods.
> */
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 66204a6..c5f6fb2 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -966,33 +966,6 @@ static void cgroup_d_remove_dir(struct dentry *dentry)
> }
>
> /*
> - * A queue for waiters to do rmdir() cgroup. A tasks will sleep when
> - * cgroup->count == 0 && list_empty(&cgroup->children) && subsys has some
> - * reference to css->refcnt. In general, this refcnt is expected to goes down
> - * to zero, soon.
> - *
> - * CGRP_WAIT_ON_RMDIR flag is set under cgroup's inode->i_mutex;
> - */
> -static DECLARE_WAIT_QUEUE_HEAD(cgroup_rmdir_waitq);
> -
> -static void cgroup_wakeup_rmdir_waiter(struct cgroup *cgrp)
> -{
> - if (unlikely(test_and_clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags)))
> - wake_up_all(&cgroup_rmdir_waitq);
> -}
> -
> -void cgroup_exclude_rmdir(struct cgroup_subsys_state *css)
> -{
> - css_get(css);
> -}
> -
> -void cgroup_release_and_wakeup_rmdir(struct cgroup_subsys_state *css)
> -{
> - cgroup_wakeup_rmdir_waiter(css->cgroup);
> - css_put(css);
> -}
> -
> -/*
> * Call with cgroup_mutex held. Drops reference counts on modules, including
> * any duplicate ones that parse_cgroupfs_options took. If this function
> * returns an error, no reference counts are touched.
> @@ -1963,12 +1936,6 @@ int cgroup_attach_task(struct cgroup *cgrp, struct task_struct *tsk)
> }
>
> synchronize_rcu();
> -
> - /*
> - * wake up rmdir() waiter. the rmdir should fail since the cgroup
> - * is no longer empty.
> - */
> - cgroup_wakeup_rmdir_waiter(cgrp);
> out:
> if (retval) {
> for_each_subsys(root, ss) {
> @@ -2138,7 +2105,6 @@ static int cgroup_attach_proc(struct cgroup *cgrp, struct task_struct *leader)
> * step 5: success! and cleanup
> */
> synchronize_rcu();
> - cgroup_wakeup_rmdir_waiter(cgrp);
> retval = 0;
> out_put_css_set_refs:
> if (retval) {
> @@ -4058,26 +4024,13 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> struct cgroup_event *event, *tmp;
> struct cgroup_subsys *ss;
>
> - /*
> - * In general, subsystem has no css->refcnt after pre_destroy(). But
> - * in racy cases, subsystem may have to get css->refcnt after
> - * pre_destroy() and it makes rmdir return with -EBUSY. This sometimes
> - * make rmdir return -EBUSY too often. To avoid that, we use waitqueue
> - * for cgroup's rmdir. CGRP_WAIT_ON_RMDIR is for synchronizing rmdir
> - * and subsystem's reference count handling. Please see css_get/put
> - * and css_tryget() and cgroup_wakeup_rmdir_waiter() implementation.
> - */
> - set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -
> /* the vfs holds both inode->i_mutex already */
> mutex_lock(&cgroup_mutex);
> parent = cgrp->parent;
> if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> mutex_unlock(&cgroup_mutex);
> return -EBUSY;
> }
> - prepare_to_wait(&cgroup_rmdir_waitq, &wait, TASK_INTERRUPTIBLE);
>
> /*
> * Block new css_tryget() by deactivating refcnt and mark @cgrp
> @@ -4114,9 +4067,6 @@ static int cgroup_rmdir(struct inode *unused_dir, struct dentry *dentry)
> for_each_subsys(cgrp->root, ss)
> css_put(cgrp->subsys[ss->subsys_id]);
>
> - finish_wait(&cgroup_rmdir_waitq, &wait);
> - clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
> -
> raw_spin_lock(&release_list_lock);
> if (!list_empty(&cgrp->release_list))
> list_del_init(&cgrp->release_list);
> @@ -4864,7 +4814,6 @@ void __css_put(struct cgroup_subsys_state *css)
> set_bit(CGRP_RELEASABLE, &cgrp->flags);
> check_for_release(cgrp);
> }
> - cgroup_wakeup_rmdir_waiter(cgrp);
> break;
> case 0:
> schedule_work(&css->dput_work);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6f8bd9d..1033b2b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2681,13 +2681,6 @@ static int mem_cgroup_move_account(struct page *page,
> /* caller should have done css_get */
> pc->mem_cgroup = to;
> mem_cgroup_charge_statistics(to, anon, nr_pages);
> - /*
> - * We charges against "to" which may not have any tasks. Then, "to"
> - * can be under rmdir(). But in current implementation, caller of
> - * this function is just force_empty() and move charge, so it's
> - * guaranteed that "to" is never removed. So, we don't check rmdir
> - * status here.
> - */
> move_unlock_mem_cgroup(from, &flags);
> ret = 0;
> unlock:
> @@ -2893,7 +2886,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
> return;
> if (!memcg)
> return;
> - cgroup_exclude_rmdir(&memcg->css);

I'm sorry if I misunderstand...I think we need css_tryget() instead, here.
Following code will set pc->mem_cgroup to be this lost memcg. It's not good because
we assume there are no page_cgroup pointing this memcg after pre_desroy().

I think, at failure, we need to avoid set pc->mem_cgroup to this memcg.
Hmm, set pc->mem_cgroup as root_mem_cgroup should work.

>
> __mem_cgroup_commit_charge(memcg, page, 1, ctype, true);
> /*
> @@ -2907,12 +2899,6 @@ __mem_cgroup_commit_charge_swapin(struct page *page, struct mem_cgroup *memcg,
> swp_entry_t ent = {.val = page_private(page)};
> mem_cgroup_uncharge_swap(ent);
> }
> - /*
> - * At swapin, we may charge account against cgroup which has no tasks.
> - * So, rmdir()->pre_destroy() can be called while we do this charge.
> - * In that case, we need to call pre_destroy() again. check it here.
> - */
> - cgroup_release_and_wakeup_rmdir(&memcg->css);

css_put() ?


> }
>
> void mem_cgroup_commit_charge_swapin(struct page *page,
> @@ -3360,8 +3346,7 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,
>
> if (!memcg)
> return;
> - /* blocks rmdir() */
> - cgroup_exclude_rmdir(&memcg->css);

css_tryget()

> +
> if (!migration_ok) {
> used = oldpage;
> unused = newpage;
> @@ -3395,13 +3380,6 @@ void mem_cgroup_end_migration(struct mem_cgroup *memcg,
> */
> if (anon)
> mem_cgroup_uncharge_page(used);
> - /*
> - * At migration, we may charge account against cgroup which has no
> - * tasks.
> - * So, rmdir()->pre_destroy() can be called while we do this charge.
> - * In that case, we need to call pre_destroy() again. check it here.
> - */
> - cgroup_release_and_wakeup_rmdir(&memcg->css);

css_put()..

Thanks,
-Kame

2012-11-02 09:55:09

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg: make mem_cgroup_reparent_charges non failing

(2012/10/31 13:22), Tejun Heo wrote:
> From: Michal Hocko <[email protected]>
>
> Now that pre_destroy callbacks are called from the context where neither
> any task can attach the group nor any children group can be added there
> is no other way to fail from mem_cgroup_pre_destroy.
> mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css
> because all css' are marked dead already.
>
> tj: Remove now unused local variable @cgrp from
> mem_cgroup_reparent_charges().
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Glauber Costa <[email protected]>
> Signed-off-by: Tejun Heo <[email protected]>

+1
Acked-by: KAMEZAWA Hiroyuki <[email protected]>


> ---
> mm/memcontrol.c | 19 ++++++-------------
> 1 file changed, 6 insertions(+), 13 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1033b2b..47c4680 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3740,14 +3740,11 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
> *
> * Caller is responsible for holding css reference on the memcg.
> */
> -static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> +static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> {
> - struct cgroup *cgrp = memcg->css.cgroup;
> int node, zid;
>
> do {
> - if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> - return -EBUSY;
> /* This is for making all *used* pages to be on LRU. */
> lru_add_drain_all();
> drain_all_stock_sync(memcg);
> @@ -3773,8 +3770,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
> * charge before adding to the LRU.
> */
> } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0);
> -
> - return 0;
> }
>
> /*
> @@ -3811,7 +3806,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>
> }
> lru_add_drain();
> - return mem_cgroup_reparent_charges(memcg);
> + mem_cgroup_reparent_charges(memcg);
> +
> + return 0;
> }
>
> static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
> @@ -5008,13 +5005,9 @@ free_out:
> static int mem_cgroup_pre_destroy(struct cgroup *cont)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
> - int ret;
>
> - css_get(&memcg->css);
> - ret = mem_cgroup_reparent_charges(memcg);
> - css_put(&memcg->css);
> -
> - return ret;
> + mem_cgroup_reparent_charges(memcg);
> + return 0;
> }
>
> static void mem_cgroup_destroy(struct cgroup *cont)
>

2012-11-02 09:56:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: do not fail in hugetlb_cgroup_pre_destroy

(2012/10/31 13:22), Tejun Heo wrote:
> From: Michal Hocko <[email protected]>
>
> Now that pre_destroy callbacks are called from the context where neither
> any task can attach the group nor any children group can be added there
> is no other way to fail from hugetlb_pre_destroy.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
> Reviewed-by: Glauber Costa <[email protected]>
> Signed-off-by: Tejun Heo <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-02 09:57:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 8/8] cgroup: make ->pre_destroy() return void

(2012/10/31 13:22), Tejun Heo wrote:
> All ->pre_destory() implementations return 0 now, which is the only
> allowed return value. Make it return void.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Vivek Goyal <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>