This patch follows several iterations of the memory controller hierarchy
patches. The hardwall approach by Kamezawa-San[1]. Version 1 of this patchset
at [2].
The current approach is based on [2] and has the following properties
1. Hierarchies are very natural in a filesystem like the cgroup filesystem.
A multi-tree hierarchy has been supported for a long time in filesystems.
When the feature is turned on, we honor hierarchies such that the root
accounts for resource usage of all children and limits can be set at
any point in the hierarchy. Any memory cgroup is limited by limits
along the hierarchy. The total usage of all children of a node cannot
exceed the limit of the node.
2. The hierarchy feature is selectable and off by default
3. Hierarchies are expensive and the trade off is depth versus performance.
Hierarchies can also be completely turned off.
The patches are against 2.6.28-rc2-mm1 and were tested in a KVM instance
with SMP and swap turned on.
Signed-off-by: Balbir Singh <[email protected]>
v4..v3
======
Move to iteration instead of recursion in hierarchical reclaim. Thanks
go out to Kamezawa for pushing this
Port the patches to 2.6.28-rc4 (mmotm 15th November)
v3..v2
======
1. Hierarchy selection logic, now allows use_hierarchy changes only if
parent's use_hierarchy is set to 0 and there are no children
2. last_scanned_child is protected by cgroup_lock()
3. cgroup_lock() is released before lru_add_drain_all() in
mem_cgroup_force_empty()
v2..v1
======
1. The hierarchy is now selectable per-subtree
2. The features file has been renamed to use_hierarchy
3. Reclaim now holds cgroup lock and the reclaim does recursive walk and reclaim
Acknowledgements
----------------
Thanks for the feedback from Li Zefan, Kamezawa Hiroyuki, Paul Menage and
others.
Series
------
memcg-hierarchy-documentation.patch
resource-counters-hierarchy-support.patch
memcg-hierarchical-reclaim.patch
memcg-add-hierarchy-selector.patch
Reviews? Comments?
References
1. http://linux.derkeiler.com/Mailing-Lists/Kernel/2008-06/msg05417.html
2. http://kerneltrap.org/mailarchive/linux-kernel/2008/4/19/1513644/thread
--
Balbir
Documentation updates for hierarchy support
Signed-off-by: Balbir Singh <[email protected]>
---
Documentation/controllers/memory.txt | 38 ++++++++++++++++++++++++++++++++++-
1 file changed, 37 insertions(+), 1 deletion(-)
diff -puN Documentation/controllers/memory.txt~memcg-hierarchy-documentation Documentation/controllers/memory.txt
--- linux-2.6.28-rc4/Documentation/controllers/memory.txt~memcg-hierarchy-documentation 2008-11-16 13:14:39.000000000 +0530
+++ linux-2.6.28-rc4-balbir/Documentation/controllers/memory.txt 2008-11-16 13:14:39.000000000 +0530
@@ -289,8 +289,44 @@ will be charged as a new owner of it.
Because rmdir() moves all pages to parent, some out-of-use page caches can be
moved to the parent. If you want to avoid that, force_empty will be useful.
+6. Hierarchy support
-6. TODO
+The memory controller supports a deep hierarchy and hierarchical accounting.
+The hierarchy is created by creating the appropriate cgroups in the
+cgroup filesystem. Consider for example, the following cgroup filesystem
+hierarchy
+
+ root
+ / | \
+ / | \
+ a b c
+ | \
+ | \
+ d e
+
+In the diagram above, with hierarchical accounting enabled, all memory
+usage of e, is accounted to its ancestors up until the root (i.e, c and root),
+that has memory.use_hierarchy enabled. If one of the ancestors goes over its
+limit, the reclaim algorithm reclaims from the tasks in the ancestor and the
+children of the ancestor.
+
+6.1 Enabling hierarchical accounting and reclaim
+
+The memory controller by default disables the hierarchy feature. Support
+can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
+
+# echo 1 > memory.use_hierarchy
+
+The feature can be disabled by
+
+# echo 0 > memory.use_hierarchy
+
+NOTE1: Enabling/disabling will fail if the cgroup already has other
+cgroups created below it.
+
+NOTE2: This feature can be enabled/disabled per subtree.
+
+7. TODO
1. Add support for accounting huge pages (as a separate controller)
2. Make per-cgroup scanner reclaim not-shared pages first
_
--
Balbir
Add support for building hierarchies in resource counters. Cgroups allows us
to build a deep hierarchy, but we currently don't link the resource counters
belonging to the memory controller control groups, in the same fashion
as the corresponding cgroup entries in the cgroup hierarchy. This patch
provides the infrastructure for resource counters that have the same hiearchy
as their cgroup counter parts.
These set of patches are based on the resource counter hiearchy patches posted
by Pavel Emelianov.
NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are expensive,
so the user needs to be careful and aware of the trade-offs before creating
very deep ones.
Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/res_counter.h | 8 ++++++--
kernel/res_counter.c | 42 ++++++++++++++++++++++++++++++++++--------
mm/memcontrol.c | 20 +++++++++++++-------
3 files changed, 53 insertions(+), 17 deletions(-)
diff -puN include/linux/res_counter.h~resource-counters-hierarchy-support include/linux/res_counter.h
--- linux-2.6.28-rc4/include/linux/res_counter.h~resource-counters-hierarchy-support 2008-11-16 13:14:43.000000000 +0530
+++ linux-2.6.28-rc4-balbir/include/linux/res_counter.h 2008-11-16 13:14:43.000000000 +0530
@@ -43,6 +43,10 @@ struct res_counter {
* the routines below consider this to be IRQ-safe
*/
spinlock_t lock;
+ /*
+ * Parent counter, used for hierarchial resource accounting
+ */
+ struct res_counter *parent;
};
/**
@@ -87,7 +91,7 @@ enum {
* helpers for accounting
*/
-void res_counter_init(struct res_counter *counter);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent);
/*
* charge - try to consume more resource.
@@ -103,7 +107,7 @@ void res_counter_init(struct res_counter
int __must_check res_counter_charge_locked(struct res_counter *counter,
unsigned long val);
int __must_check res_counter_charge(struct res_counter *counter,
- unsigned long val);
+ unsigned long val, struct res_counter **limit_fail_at);
/*
* uncharge - tell that some portion of the resource is released
diff -puN kernel/res_counter.c~resource-counters-hierarchy-support kernel/res_counter.c
--- linux-2.6.28-rc4/kernel/res_counter.c~resource-counters-hierarchy-support 2008-11-16 13:14:43.000000000 +0530
+++ linux-2.6.28-rc4-balbir/kernel/res_counter.c 2008-11-16 13:14:43.000000000 +0530
@@ -15,10 +15,11 @@
#include <linux/uaccess.h>
#include <linux/mm.h>
-void res_counter_init(struct res_counter *counter)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent)
{
spin_lock_init(&counter->lock);
counter->limit = (unsigned long long)LLONG_MAX;
+ counter->parent = parent;
}
int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
@@ -34,14 +35,34 @@ int res_counter_charge_locked(struct res
return 0;
}
-int res_counter_charge(struct res_counter *counter, unsigned long val)
+int res_counter_charge(struct res_counter *counter, unsigned long val,
+ struct res_counter **limit_fail_at)
{
int ret;
unsigned long flags;
+ struct res_counter *c, *u;
- spin_lock_irqsave(&counter->lock, flags);
- ret = res_counter_charge_locked(counter, val);
- spin_unlock_irqrestore(&counter->lock, flags);
+ *limit_fail_at = NULL;
+ local_irq_save(flags);
+ for (c = counter; c != NULL; c = c->parent) {
+ spin_lock(&c->lock);
+ ret = res_counter_charge_locked(c, val);
+ spin_unlock(&c->lock);
+ if (ret < 0) {
+ *limit_fail_at = c;
+ goto undo;
+ }
+ }
+ ret = 0;
+ goto done;
+undo:
+ for (u = counter; u != c; u = u->parent) {
+ spin_lock(&u->lock);
+ res_counter_uncharge_locked(u, val);
+ spin_unlock(&u->lock);
+ }
+done:
+ local_irq_restore(flags);
return ret;
}
@@ -56,10 +77,15 @@ void res_counter_uncharge_locked(struct
void res_counter_uncharge(struct res_counter *counter, unsigned long val)
{
unsigned long flags;
+ struct res_counter *c;
- spin_lock_irqsave(&counter->lock, flags);
- res_counter_uncharge_locked(counter, val);
- spin_unlock_irqrestore(&counter->lock, flags);
+ local_irq_save(flags);
+ for (c = counter; c != NULL; c = c->parent) {
+ spin_lock(&c->lock);
+ res_counter_uncharge_locked(c, val);
+ spin_unlock(&c->lock);
+ }
+ local_irq_restore(flags);
}
diff -puN mm/memcontrol.c~resource-counters-hierarchy-support mm/memcontrol.c
--- linux-2.6.28-rc4/mm/memcontrol.c~resource-counters-hierarchy-support 2008-11-16 13:14:43.000000000 +0530
+++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:14:43.000000000 +0530
@@ -470,6 +470,7 @@ static int __mem_cgroup_try_charge(struc
{
struct mem_cgroup *mem;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ struct res_counter *fail_res;
/*
* We always charge the cgroup the mm_struct belongs to.
* The mm_struct's mem_cgroup changes on task migration if the
@@ -498,11 +499,12 @@ static int __mem_cgroup_try_charge(struc
int ret;
bool noswap = false;
- ret = res_counter_charge(&mem->res, PAGE_SIZE);
+ ret = res_counter_charge(&mem->res, PAGE_SIZE, &fail_res);
if (likely(!ret)) {
if (!do_swap_account)
break;
- ret = res_counter_charge(&mem->memsw, PAGE_SIZE);
+ ret = res_counter_charge(&mem->memsw, PAGE_SIZE,
+ &fail_res);
if (likely(!ret))
break;
/* mem+swap counter fails */
@@ -1687,22 +1689,26 @@ static void __init enable_swap_cgroup(vo
static struct cgroup_subsys_state *
mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *parent;
int node;
mem = mem_cgroup_alloc();
if (!mem)
return ERR_PTR(-ENOMEM);
- res_counter_init(&mem->res);
- res_counter_init(&mem->memsw);
-
for_each_node_state(node, N_POSSIBLE)
if (alloc_mem_cgroup_per_zone_info(mem, node))
goto free_out;
/* root ? */
- if (cont->parent == NULL)
+ if (cont->parent == NULL) {
enable_swap_cgroup();
+ parent = NULL;
+ } else
+ parent = mem_cgroup_from_cont(cont->parent);
+
+ res_counter_init(&mem->res, parent ? &parent->res : NULL);
+ res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
+
return &mem->css;
free_out:
_
--
Balbir
This patch introduces hierarchical reclaim. When an ancestor goes over its
limit, the charging routine points to the parent that is above its limit.
The reclaim process then starts from the last scanned child of the ancestor
and reclaims until the ancestor goes below its limit.
Signed-off-by: Balbir Singh <[email protected]>
---
mm/memcontrol.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 167 insertions(+), 3 deletions(-)
diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
--- linux-2.6.28-rc4/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-16 13:17:33.000000000 +0530
+++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:17:33.000000000 +0530
@@ -142,6 +142,13 @@ struct mem_cgroup {
struct mem_cgroup_lru_info info;
int prev_priority; /* for recording reclaim priority */
+
+ /*
+ * While reclaiming in a hiearchy, we cache the last child we
+ * reclaimed from. Protected by cgroup_lock()
+ */
+ struct mem_cgroup *last_scanned_child;
+
int obsolete;
atomic_t refcnt;
/*
@@ -460,6 +467,153 @@ unsigned long mem_cgroup_isolate_pages(u
return nr_taken;
}
+static struct mem_cgroup *
+mem_cgroup_from_res_counter(struct res_counter *counter)
+{
+ return container_of(counter, struct mem_cgroup, res);
+}
+
+/*
+ * This routine finds the DFS walk successor. This routine should be
+ * called with cgroup_mutex held
+ */
+static struct mem_cgroup *
+mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
+{
+ struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
+
+ curr_cgroup = curr->css.cgroup;
+ root_cgroup = root_mem->css.cgroup;
+
+ if (!list_empty(&curr_cgroup->children)) {
+ /*
+ * Walk down to children
+ */
+ mem_cgroup_put(curr);
+ cgroup = list_entry(curr_cgroup->children.next,
+ struct cgroup, sibling);
+ curr = mem_cgroup_from_cont(cgroup);
+ mem_cgroup_get(curr);
+ goto done;
+ }
+
+visit_parent:
+ if (curr_cgroup == root_cgroup) {
+ mem_cgroup_put(curr);
+ curr = root_mem;
+ mem_cgroup_get(curr);
+ goto done;
+ }
+
+ /*
+ * Goto next sibling
+ */
+ if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
+ mem_cgroup_put(curr);
+ cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
+ sibling);
+ curr = mem_cgroup_from_cont(cgroup);
+ mem_cgroup_get(curr);
+ goto done;
+ }
+
+ /*
+ * Go up to next parent and next parent's sibling if need be
+ */
+ curr_cgroup = curr_cgroup->parent;
+ goto visit_parent;
+
+done:
+ root_mem->last_scanned_child = curr;
+ return curr;
+}
+
+/*
+ * Visit the first child (need not be the first child as per the ordering
+ * of the cgroup list, since we track last_scanned_child) of @mem and use
+ * that to reclaim free pages from.
+ */
+static struct mem_cgroup *
+mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
+{
+ struct cgroup *cgroup;
+ struct mem_cgroup *ret;
+ bool obsolete = (root_mem->last_scanned_child &&
+ root_mem->last_scanned_child->obsolete);
+
+ /*
+ * Scan all children under the mem_cgroup mem
+ */
+ cgroup_lock();
+ if (list_empty(&root_mem->css.cgroup->children)) {
+ ret = root_mem;
+ goto done;
+ }
+
+ if (!root_mem->last_scanned_child || obsolete) {
+
+ if (obsolete)
+ mem_cgroup_put(root_mem->last_scanned_child);
+
+ cgroup = list_first_entry(&root_mem->css.cgroup->children,
+ struct cgroup, sibling);
+ ret = mem_cgroup_from_cont(cgroup);
+ mem_cgroup_get(ret);
+ } else
+ ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
+ root_mem);
+
+done:
+ root_mem->last_scanned_child = ret;
+ cgroup_unlock();
+ return ret;
+}
+
+/*
+ * Dance down the hierarchy if needed to reclaim memory. We remember the
+ * last child we reclaimed from, so that we don't end up penalizing
+ * one child extensively based on its position in the children list.
+ *
+ * root_mem is the original ancestor that we've been reclaim from.
+ */
+static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
+ gfp_t gfp_mask, bool noswap)
+{
+ struct mem_cgroup *next_mem;
+ int ret = 0;
+
+ /*
+ * Reclaim unconditionally and don't check for return value.
+ * We need to reclaim in the current group and down the tree.
+ * One might think about checking for children before reclaiming,
+ * but there might be left over accounting, even after children
+ * have left.
+ */
+ ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
+ if (res_counter_check_under_limit(&root_mem->res))
+ return 0;
+
+ next_mem = mem_cgroup_get_first_node(root_mem);
+
+ while (next_mem != root_mem) {
+ if (next_mem->obsolete) {
+ mem_cgroup_put(next_mem);
+ cgroup_lock();
+ next_mem = mem_cgroup_get_first_node(root_mem);
+ cgroup_unlock();
+ continue;
+ }
+ ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap);
+ if (res_counter_check_under_limit(&root_mem->res)) {
+ return 0;
+ }
+ cgroup_lock();
+ next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
+ cgroup_unlock();
+ }
+ return ret;
+}
+
/*
* Unlike exported interface, "oom" parameter is added. if oom==true,
* oom-killer can be invoked.
@@ -468,7 +622,7 @@ static int __mem_cgroup_try_charge(struc
gfp_t gfp_mask, struct mem_cgroup **memcg,
bool oom)
{
- struct mem_cgroup *mem;
+ struct mem_cgroup *mem, *mem_over_limit;
int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct res_counter *fail_res;
/*
@@ -514,8 +668,16 @@ static int __mem_cgroup_try_charge(struc
if (!(gfp_mask & __GFP_WAIT))
goto nomem;
- if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
- continue;
+ /*
+ * Is one of our ancestors over their limit?
+ */
+ if (fail_res)
+ mem_over_limit = mem_cgroup_from_res_counter(fail_res);
+ else
+ mem_over_limit = mem;
+
+ ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
+ noswap);
/*
* try_to_free_mem_cgroup_pages() might not give us a full
@@ -1710,6 +1872,8 @@ mem_cgroup_create(struct cgroup_subsys *
res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
+ mem->last_scanned_child = NULL;
+
return &mem->css;
free_out:
for_each_node_state(node, N_POSSIBLE)
_
--
Balbir
Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.
Signed-off-by: Balbir Singh <[email protected]>
---
mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 57 insertions(+), 4 deletions(-)
diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
--- linux-2.6.28-rc4/mm/memcontrol.c~memcg-add-hierarchy-selector 2008-11-16 13:19:33.000000000 +0530
+++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:19:33.000000000 +0530
@@ -148,6 +148,10 @@ struct mem_cgroup {
* reclaimed from. Protected by cgroup_lock()
*/
struct mem_cgroup *last_scanned_child;
+ /*
+ * Should the accounting and control be hierarchical, per subtree?
+ */
+ unsigned long use_hierarchy;
int obsolete;
atomic_t refcnt;
@@ -1527,6 +1531,44 @@ int mem_cgroup_force_empty_write(struct
}
+static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
+{
+ return mem_cgroup_from_cont(cont)->use_hierarchy;
+}
+
+static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
+ u64 val)
+{
+ int retval = 0;
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ struct cgroup *parent = cont->parent;
+ struct mem_cgroup *parent_mem = NULL;
+
+ if (parent)
+ parent_mem = mem_cgroup_from_cont(parent);
+
+ cgroup_lock();
+ /*
+ * If parent's use_hiearchy is set, we can't make any modifications
+ * in the child subtrees. If it is unset, then the change can
+ * occur, provided the current cgroup has no children.
+ *
+ * For the root cgroup, parent_mem is NULL, we allow value to be
+ * set if there are no children.
+ */
+ if (!parent_mem || (!parent_mem->use_hierarchy &&
+ (val == 1 || val == 0))) {
+ if (list_empty(&cont->children))
+ mem->use_hierarchy = val;
+ else
+ retval = -EBUSY;
+ } else
+ retval = -EINVAL;
+ cgroup_unlock();
+
+ return retval;
+}
+
static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
{
struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
@@ -1690,6 +1732,11 @@ static struct cftype mem_cgroup_files[]
.name = "force_empty",
.trigger = mem_cgroup_force_empty_write,
},
+ {
+ .name = "use_hierarchy",
+ .write_u64 = mem_cgroup_hierarchy_write,
+ .read_u64 = mem_cgroup_hierarchy_read,
+ },
};
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
@@ -1865,12 +1912,18 @@ mem_cgroup_create(struct cgroup_subsys *
if (cont->parent == NULL) {
enable_swap_cgroup();
parent = NULL;
- } else
+ } else {
parent = mem_cgroup_from_cont(cont->parent);
+ mem->use_hierarchy = parent->use_hierarchy;
+ }
- res_counter_init(&mem->res, parent ? &parent->res : NULL);
- res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
-
+ if (parent && parent->use_hierarchy) {
+ res_counter_init(&mem->res, &parent->res);
+ res_counter_init(&mem->memsw, &parent->memsw);
+ } else {
+ res_counter_init(&mem->res, NULL);
+ res_counter_init(&mem->memsw, NULL);
+ }
mem->last_scanned_child = NULL;
_
--
Balbir
> +6.1 Enabling hierarchical accounting and reclaim
> +
> +The memory controller by default disables the hierarchy feature. Support
> +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
> +
> +# echo 1 > memory.use_hierarchy
> +
> +The feature can be disabled by
> +
> +# echo 0 > memory.use_hierarchy
> +
> +NOTE1: Enabling/disabling will fail if the cgroup already has other
> +cgroups created below it.
> +
It's better to also document that it will fail if it's parent's use_hierarchy
is already enabled.
> +NOTE2: This feature can be enabled/disabled per subtree.
> +
> +7. TODO
>
> 1. Add support for accounting huge pages (as a separate controller)
> 2. Make per-cgroup scanner reclaim not-shared pages first
> _
>
Li Zefan wrote:
>> +6.1 Enabling hierarchical accounting and reclaim
>> +
>> +The memory controller by default disables the hierarchy feature. Support
>> +can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup
>> +
>> +# echo 1 > memory.use_hierarchy
>> +
>> +The feature can be disabled by
>> +
>> +# echo 0 > memory.use_hierarchy
>> +
>> +NOTE1: Enabling/disabling will fail if the cgroup already has other
>> +cgroups created below it.
>> +
>
> It's better to also document that it will fail if it's parent's use_hierarchy
> is already enabled.
>
Good point, will update the documentation.
--
Balbir
> + /*
> + * If parent's use_hiearchy is set, we can't make any modifications
> + * in the child subtrees. If it is unset, then the change can
> + * occur, provided the current cgroup has no children.
> + *
> + * For the root cgroup, parent_mem is NULL, we allow value to be
> + * set if there are no children.
> + */
> + if (!parent_mem || (!parent_mem->use_hierarchy &&
> + (val == 1 || val == 0))) {
Should be :
if ((!parent_mem || !parent_mem->use_hierarchy) &&
(val == 1 || val == 0)) {
> + if (list_empty(&cont->children))
> + mem->use_hierarchy = val;
> + else
> + retval = -EBUSY;
> + } else
> + retval = -EINVAL;
> + cgroup_unlock();
> +
> + return retval;
> +}
Li Zefan wrote:
>> + /*
>> + * If parent's use_hiearchy is set, we can't make any modifications
>> + * in the child subtrees. If it is unset, then the change can
>> + * occur, provided the current cgroup has no children.
>> + *
>> + * For the root cgroup, parent_mem is NULL, we allow value to be
>> + * set if there are no children.
>> + */
>> + if (!parent_mem || (!parent_mem->use_hierarchy &&
>> + (val == 1 || val == 0))) {
>
> Should be :
>
> if ((!parent_mem || !parent_mem->use_hierarchy) &&
> (val == 1 || val == 0)) {
Yes, we need to validate values for root cgroup as well. Thanks for the comments
--
Balbir
On Sun, 16 Nov 2008 13:41:05 +0530
Balbir Singh <[email protected]> wrote:
>
> Don't enable multiple hierarchy support by default. This patch introduces
> a features element that can be set to enable the nested depth hierarchy
> feature. This feature can only be enabled when the cgroup for which the
> feature this is enabled, has no children.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 57 insertions(+), 4 deletions(-)
>
> diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
> --- linux-2.6.28-rc4/mm/memcontrol.c~memcg-add-hierarchy-selector 2008-11-16 13:19:33.000000000 +0530
> +++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:19:33.000000000 +0530
> @@ -148,6 +148,10 @@ struct mem_cgroup {
> * reclaimed from. Protected by cgroup_lock()
> */
> struct mem_cgroup *last_scanned_child;
> + /*
> + * Should the accounting and control be hierarchical, per subtree?
> + */
> + unsigned long use_hierarchy;
This field is a boolean, but it is declared as an unsigned long and is
accessed from userspace via an API which returns a u64. This all seems
ripe for a cleanup..
> int obsolete;
> atomic_t refcnt;
> @@ -1527,6 +1531,44 @@ int mem_cgroup_force_empty_write(struct
> }
>
>
> +static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
> +{
> + return mem_cgroup_from_cont(cont)->use_hierarchy;
> +}
> +
> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
> + u64 val)
> +{
> + int retval = 0;
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> + struct cgroup *parent = cont->parent;
> + struct mem_cgroup *parent_mem = NULL;
> +
> + if (parent)
> + parent_mem = mem_cgroup_from_cont(parent);
> +
> + cgroup_lock();
> + /*
> + * If parent's use_hiearchy is set, we can't make any modifications
> + * in the child subtrees. If it is unset, then the change can
> + * occur, provided the current cgroup has no children.
> + *
> + * For the root cgroup, parent_mem is NULL, we allow value to be
> + * set if there are no children.
> + */
> + if (!parent_mem || (!parent_mem->use_hierarchy &&
> + (val == 1 || val == 0))) {
One part of this test permits any value, but the other part restricts
values to 0 or 1.
> + if (list_empty(&cont->children))
> + mem->use_hierarchy = val;
> + else
> + retval = -EBUSY;
> + } else
> + retval = -EINVAL;
> + cgroup_unlock();
> +
> + return retval;
> +}
> +
> static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
> {
> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
> @@ -1690,6 +1732,11 @@ static struct cftype mem_cgroup_files[]
> .name = "force_empty",
> .trigger = mem_cgroup_force_empty_write,
> },
> + {
> + .name = "use_hierarchy",
> + .write_u64 = mem_cgroup_hierarchy_write,
> + .read_u64 = mem_cgroup_hierarchy_read,
> + },
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -1865,12 +1912,18 @@ mem_cgroup_create(struct cgroup_subsys *
> if (cont->parent == NULL) {
> enable_swap_cgroup();
> parent = NULL;
> - } else
> + } else {
> parent = mem_cgroup_from_cont(cont->parent);
> + mem->use_hierarchy = parent->use_hierarchy;
> + }
>
> - res_counter_init(&mem->res, parent ? &parent->res : NULL);
> - res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
> -
> + if (parent && parent->use_hierarchy) {
> + res_counter_init(&mem->res, &parent->res);
> + res_counter_init(&mem->memsw, &parent->memsw);
> + } else {
> + res_counter_init(&mem->res, NULL);
> + res_counter_init(&mem->memsw, NULL);
> + }
Andrew Morton wrote:
> On Sun, 16 Nov 2008 13:41:05 +0530
> Balbir Singh <[email protected]> wrote:
>
>> Don't enable multiple hierarchy support by default. This patch introduces
>> a features element that can be set to enable the nested depth hierarchy
>> feature. This feature can only be enabled when the cgroup for which the
>> feature this is enabled, has no children.
>>
>> Signed-off-by: Balbir Singh <[email protected]>
>> ---
>>
>> mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
>> 1 file changed, 57 insertions(+), 4 deletions(-)
>>
>> diff -puN mm/memcontrol.c~memcg-add-hierarchy-selector mm/memcontrol.c
>> --- linux-2.6.28-rc4/mm/memcontrol.c~memcg-add-hierarchy-selector 2008-11-16 13:19:33.000000000 +0530
>> +++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:19:33.000000000 +0530
>> @@ -148,6 +148,10 @@ struct mem_cgroup {
>> * reclaimed from. Protected by cgroup_lock()
>> */
>> struct mem_cgroup *last_scanned_child;
>> + /*
>> + * Should the accounting and control be hierarchical, per subtree?
>> + */
>> + unsigned long use_hierarchy;
>
> This field is a boolean, but it is declared as an unsigned long and is
> accessed from userspace via an API which returns a u64. This all seems
> ripe for a cleanup..
>
Hmm.. Yes. I initially had a file called features that I intended to use for
enabling features. I'll change/fix this and the write routine.
>> int obsolete;
>> atomic_t refcnt;
>> @@ -1527,6 +1531,44 @@ int mem_cgroup_force_empty_write(struct
>> }
>>
>>
>> +static u64 mem_cgroup_hierarchy_read(struct cgroup *cont, struct cftype *cft)
>> +{
>> + return mem_cgroup_from_cont(cont)->use_hierarchy;
>> +}
>> +
>> +static int mem_cgroup_hierarchy_write(struct cgroup *cont, struct cftype *cft,
>> + u64 val)
>> +{
>> + int retval = 0;
>> + struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> + struct cgroup *parent = cont->parent;
>> + struct mem_cgroup *parent_mem = NULL;
>> +
>> + if (parent)
>> + parent_mem = mem_cgroup_from_cont(parent);
>> +
>> + cgroup_lock();
>> + /*
>> + * If parent's use_hiearchy is set, we can't make any modifications
>> + * in the child subtrees. If it is unset, then the change can
>> + * occur, provided the current cgroup has no children.
>> + *
>> + * For the root cgroup, parent_mem is NULL, we allow value to be
>> + * set if there are no children.
>> + */
>> + if (!parent_mem || (!parent_mem->use_hierarchy &&
>> + (val == 1 || val == 0))) {
>
> One part of this test permits any value, but the other part restricts
> values to 0 or 1.
>
Thanks, will fix!
>> + if (list_empty(&cont->children))
>> + mem->use_hierarchy = val;
>> + else
>> + retval = -EBUSY;
>> + } else
>> + retval = -EINVAL;
>> + cgroup_unlock();
>> +
>> + return retval;
>> +}
>> +
>> static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
>> {
>> struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
>> @@ -1690,6 +1732,11 @@ static struct cftype mem_cgroup_files[]
>> .name = "force_empty",
>> .trigger = mem_cgroup_force_empty_write,
>> },
>> + {
>> + .name = "use_hierarchy",
>> + .write_u64 = mem_cgroup_hierarchy_write,
>> + .read_u64 = mem_cgroup_hierarchy_read,
>> + },
>> };
>>
>> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
>> @@ -1865,12 +1912,18 @@ mem_cgroup_create(struct cgroup_subsys *
>> if (cont->parent == NULL) {
>> enable_swap_cgroup();
>> parent = NULL;
>> - } else
>> + } else {
>> parent = mem_cgroup_from_cont(cont->parent);
>> + mem->use_hierarchy = parent->use_hierarchy;
>> + }
>>
>> - res_counter_init(&mem->res, parent ? &parent->res : NULL);
>> - res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
>> -
>> + if (parent && parent->use_hierarchy) {
>> + res_counter_init(&mem->res, &parent->res);
>> + res_counter_init(&mem->memsw, &parent->memsw);
>> + } else {
>> + res_counter_init(&mem->res, NULL);
>> + res_counter_init(&mem->memsw, NULL);
>> + }
>
Thanks for the review!
--
Balbir
Hi.
Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
For example, when attaching a task to some cpuset directory(memory_migrate=on),
cgroup_tasks_write (hold cgroup_mutex)
attach_task_by_pid
cgroup_attach_task
cpuset_attach
cpuset_migrate_mm
:
unmap_and_move
mem_cgroup_prepare_migration
mem_cgroup_try_charge
mem_cgroup_hierarchical_reclaim
I think similar problem can also happen when removing memcg's directory.
Any ideas?
Thanks,
Daisuke Nishimura.
On Sun, 16 Nov 2008 13:40:55 +0530, Balbir Singh <[email protected]> wrote:
> This patch introduces hierarchical reclaim. When an ancestor goes over its
> limit, the charging routine points to the parent that is above its limit.
> The reclaim process then starts from the last scanned child of the ancestor
> and reclaims until the ancestor goes below its limit.
>
> Signed-off-by: Balbir Singh <[email protected]>
> ---
>
> mm/memcontrol.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 167 insertions(+), 3 deletions(-)
>
> diff -puN mm/memcontrol.c~memcg-hierarchical-reclaim mm/memcontrol.c
> --- linux-2.6.28-rc4/mm/memcontrol.c~memcg-hierarchical-reclaim 2008-11-16 13:17:33.000000000 +0530
> +++ linux-2.6.28-rc4-balbir/mm/memcontrol.c 2008-11-16 13:17:33.000000000 +0530
> @@ -142,6 +142,13 @@ struct mem_cgroup {
> struct mem_cgroup_lru_info info;
>
> int prev_priority; /* for recording reclaim priority */
> +
> + /*
> + * While reclaiming in a hiearchy, we cache the last child we
> + * reclaimed from. Protected by cgroup_lock()
> + */
> + struct mem_cgroup *last_scanned_child;
> +
> int obsolete;
> atomic_t refcnt;
> /*
> @@ -460,6 +467,153 @@ unsigned long mem_cgroup_isolate_pages(u
> return nr_taken;
> }
>
> +static struct mem_cgroup *
> +mem_cgroup_from_res_counter(struct res_counter *counter)
> +{
> + return container_of(counter, struct mem_cgroup, res);
> +}
> +
> +/*
> + * This routine finds the DFS walk successor. This routine should be
> + * called with cgroup_mutex held
> + */
> +static struct mem_cgroup *
> +mem_cgroup_get_next_node(struct mem_cgroup *curr, struct mem_cgroup *root_mem)
> +{
> + struct cgroup *cgroup, *curr_cgroup, *root_cgroup;
> +
> + curr_cgroup = curr->css.cgroup;
> + root_cgroup = root_mem->css.cgroup;
> +
> + if (!list_empty(&curr_cgroup->children)) {
> + /*
> + * Walk down to children
> + */
> + mem_cgroup_put(curr);
> + cgroup = list_entry(curr_cgroup->children.next,
> + struct cgroup, sibling);
> + curr = mem_cgroup_from_cont(cgroup);
> + mem_cgroup_get(curr);
> + goto done;
> + }
> +
> +visit_parent:
> + if (curr_cgroup == root_cgroup) {
> + mem_cgroup_put(curr);
> + curr = root_mem;
> + mem_cgroup_get(curr);
> + goto done;
> + }
> +
> + /*
> + * Goto next sibling
> + */
> + if (curr_cgroup->sibling.next != &curr_cgroup->parent->children) {
> + mem_cgroup_put(curr);
> + cgroup = list_entry(curr_cgroup->sibling.next, struct cgroup,
> + sibling);
> + curr = mem_cgroup_from_cont(cgroup);
> + mem_cgroup_get(curr);
> + goto done;
> + }
> +
> + /*
> + * Go up to next parent and next parent's sibling if need be
> + */
> + curr_cgroup = curr_cgroup->parent;
> + goto visit_parent;
> +
> +done:
> + root_mem->last_scanned_child = curr;
> + return curr;
> +}
> +
> +/*
> + * Visit the first child (need not be the first child as per the ordering
> + * of the cgroup list, since we track last_scanned_child) of @mem and use
> + * that to reclaim free pages from.
> + */
> +static struct mem_cgroup *
> +mem_cgroup_get_first_node(struct mem_cgroup *root_mem)
> +{
> + struct cgroup *cgroup;
> + struct mem_cgroup *ret;
> + bool obsolete = (root_mem->last_scanned_child &&
> + root_mem->last_scanned_child->obsolete);
> +
> + /*
> + * Scan all children under the mem_cgroup mem
> + */
> + cgroup_lock();
> + if (list_empty(&root_mem->css.cgroup->children)) {
> + ret = root_mem;
> + goto done;
> + }
> +
> + if (!root_mem->last_scanned_child || obsolete) {
> +
> + if (obsolete)
> + mem_cgroup_put(root_mem->last_scanned_child);
> +
> + cgroup = list_first_entry(&root_mem->css.cgroup->children,
> + struct cgroup, sibling);
> + ret = mem_cgroup_from_cont(cgroup);
> + mem_cgroup_get(ret);
> + } else
> + ret = mem_cgroup_get_next_node(root_mem->last_scanned_child,
> + root_mem);
> +
> +done:
> + root_mem->last_scanned_child = ret;
> + cgroup_unlock();
> + return ret;
> +}
> +
> +/*
> + * Dance down the hierarchy if needed to reclaim memory. We remember the
> + * last child we reclaimed from, so that we don't end up penalizing
> + * one child extensively based on its position in the children list.
> + *
> + * root_mem is the original ancestor that we've been reclaim from.
> + */
> +static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem,
> + gfp_t gfp_mask, bool noswap)
> +{
> + struct mem_cgroup *next_mem;
> + int ret = 0;
> +
> + /*
> + * Reclaim unconditionally and don't check for return value.
> + * We need to reclaim in the current group and down the tree.
> + * One might think about checking for children before reclaiming,
> + * but there might be left over accounting, even after children
> + * have left.
> + */
> + ret = try_to_free_mem_cgroup_pages(root_mem, gfp_mask, noswap);
> + if (res_counter_check_under_limit(&root_mem->res))
> + return 0;
> +
> + next_mem = mem_cgroup_get_first_node(root_mem);
> +
> + while (next_mem != root_mem) {
> + if (next_mem->obsolete) {
> + mem_cgroup_put(next_mem);
> + cgroup_lock();
> + next_mem = mem_cgroup_get_first_node(root_mem);
> + cgroup_unlock();
> + continue;
> + }
> + ret = try_to_free_mem_cgroup_pages(next_mem, gfp_mask, noswap);
> + if (res_counter_check_under_limit(&root_mem->res)) {
> + return 0;
> + }
> + cgroup_lock();
> + next_mem = mem_cgroup_get_next_node(next_mem, root_mem);
> + cgroup_unlock();
> + }
> + return ret;
> +}
> +
> /*
> * Unlike exported interface, "oom" parameter is added. if oom==true,
> * oom-killer can be invoked.
> @@ -468,7 +622,7 @@ static int __mem_cgroup_try_charge(struc
> gfp_t gfp_mask, struct mem_cgroup **memcg,
> bool oom)
> {
> - struct mem_cgroup *mem;
> + struct mem_cgroup *mem, *mem_over_limit;
> int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> struct res_counter *fail_res;
> /*
> @@ -514,8 +668,16 @@ static int __mem_cgroup_try_charge(struc
> if (!(gfp_mask & __GFP_WAIT))
> goto nomem;
>
> - if (try_to_free_mem_cgroup_pages(mem, gfp_mask, noswap))
> - continue;
> + /*
> + * Is one of our ancestors over their limit?
> + */
> + if (fail_res)
> + mem_over_limit = mem_cgroup_from_res_counter(fail_res);
> + else
> + mem_over_limit = mem;
> +
> + ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, gfp_mask,
> + noswap);
>
> /*
> * try_to_free_mem_cgroup_pages() might not give us a full
> @@ -1710,6 +1872,8 @@ mem_cgroup_create(struct cgroup_subsys *
> res_counter_init(&mem->memsw, parent ? &parent->memsw : NULL);
>
>
> + mem->last_scanned_child = NULL;
> +
> return &mem->css;
> free_out:
> for_each_node_state(node, N_POSSIBLE)
> _
>
> --
> Balbir
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Daisuke Nishimura wrote:
> Hi.
>
> Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
>
> For example, when attaching a task to some cpuset directory(memory_migrate=on),
>
> cgroup_tasks_write (hold cgroup_mutex)
> attach_task_by_pid
> cgroup_attach_task
> cpuset_attach
> cpuset_migrate_mm
> :
> unmap_and_move
> mem_cgroup_prepare_migration
> mem_cgroup_try_charge
> mem_cgroup_hierarchical_reclaim
>
Did lockdep complain about it?
1. We could probably move away from cgroup_mutex to a memory controller specific
mutex.
2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
hold the cgroup lock for long and holding it during reclaim will definitely be
visible to users trying to create/delete nodes.
I prefer to do (2), I'll look at the code more closely
> I think similar problem can also happen when removing memcg's directory.
>
Why removing a directory? memcg (now) marks the directory as obsolete and we
check for obsolete directories and get/put references.
Thanks for the bug report!
--
Balbir
On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh <[email protected]> wrote:
> Daisuke Nishimura wrote:
> > Hi.
> >
> > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
> >
> > For example, when attaching a task to some cpuset directory(memory_migrate=on),
> >
> > cgroup_tasks_write (hold cgroup_mutex)
> > attach_task_by_pid
> > cgroup_attach_task
> > cpuset_attach
> > cpuset_migrate_mm
> > :
> > unmap_and_move
> > mem_cgroup_prepare_migration
> > mem_cgroup_try_charge
> > mem_cgroup_hierarchical_reclaim
> >
>
> Did lockdep complain about it?
>
I haven't understood lockdep so well, but I got logs like this:
===
INFO: task move.sh:17710 blocked for more than 480 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
move.sh D ffff88010e1c76c0 0 17710 17597
ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000
ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c
ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
Call Trace:
[<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff804cb357>] mutex_lock_nested+0x180/0x2a2
[<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
[<ffffffff802aed9c>] __mem_cgroup_try_charge+0x27a/0x2de
[<ffffffff802afdfd>] mem_cgroup_prepare_migration+0x6c/0xa5
[<ffffffff802ad97f>] migrate_pages+0x10c/0x4a0
[<ffffffff802ad9c8>] migrate_pages+0x155/0x4a0
[<ffffffff802a14cb>] new_node_page+0x0/0x2f
[<ffffffff802a1adb>] check_range+0x300/0x325
[<ffffffff802a2374>] do_migrate_pages+0x1a5/0x1f1
[<ffffffff8026d272>] cpuset_migrate_mm+0x30/0x93
[<ffffffff8026d29c>] cpuset_migrate_mm+0x5a/0x93
[<ffffffff8026df41>] cpuset_attach+0x93/0xa6
[<ffffffff8026ae1b>] cgroup_attach_task+0x395/0x3e1
[<ffffffff8026af61>] cgroup_tasks_write+0xfa/0x11d
[<ffffffff8026aea0>] cgroup_tasks_write+0x39/0x11d
[<ffffffff8026b5aa>] cgroup_file_write+0xef/0x216
[<ffffffff802b2968>] vfs_write+0xad/0x136
[<ffffffff802b2dfe>] sys_write+0x45/0x6e
[<ffffffff8020bdab>] system_call_fastpath+0x16/0x1b
INFO: lockdep is turned off.
===
And other processes trying to hold cgroup_mutex are also stuck.
> 1. We could probably move away from cgroup_mutex to a memory controller specific
> mutex.
> 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
> hold the cgroup lock for long and holding it during reclaim will definitely be
> visible to users trying to create/delete nodes.
>
> I prefer to do (2), I'll look at the code more closely
>
I basically agree, but I think we should also consider mpol_rebind_mm.
mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem),
which means down_write(mm->mmap_sem) can be called under cgroup_mutex.
OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge,
which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem).
> > I think similar problem can also happen when removing memcg's directory.
> >
>
> Why removing a directory? memcg (now) marks the directory as obsolete and we
> check for obsolete directories and get/put references.
>
I don't think so.
mem_cgroup_pre_destroy (make mem->obsolete = 1)
mem_cgroup_force_empty(mem, **FALSE**)
mem_cgroup_force_empty_list
mem_cgroup_move_parent
__mem_cgroup_try_charge
hmm, but looking more closely, cgroup_call_pre_destroy is called
outside of cgroup_mutex, so this problem doesn't happen at rmdir probably.
Thanks,
Daisuke Nishimura.
On Wed, 26 Nov 2008 11:14:47 +0900
Daisuke Nishimura <[email protected]> wrote:
> On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh <[email protected]> wrote:
> > Daisuke Nishimura wrote:
> > > Hi.
> > >
> > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
> > >
> > > For example, when attaching a task to some cpuset directory(memory_migrate=on),
> > >
> > > cgroup_tasks_write (hold cgroup_mutex)
> > > attach_task_by_pid
> > > cgroup_attach_task
> > > cpuset_attach
> > > cpuset_migrate_mm
> > > :
> > > unmap_and_move
> > > mem_cgroup_prepare_migration
> > > mem_cgroup_try_charge
> > > mem_cgroup_hierarchical_reclaim
> > >
> >
> > Did lockdep complain about it?
> >
> I haven't understood lockdep so well, but I got logs like this:
>
> ===
> INFO: task move.sh:17710 blocked for more than 480 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> move.sh D ffff88010e1c76c0 0 17710 17597
> ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000
> ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c
> ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> Call Trace:
> [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> [<ffffffff804cb357>] mutex_lock_nested+0x180/0x2a2
> [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> [<ffffffff802aed9c>] __mem_cgroup_try_charge+0x27a/0x2de
> [<ffffffff802afdfd>] mem_cgroup_prepare_migration+0x6c/0xa5
> [<ffffffff802ad97f>] migrate_pages+0x10c/0x4a0
> [<ffffffff802ad9c8>] migrate_pages+0x155/0x4a0
> [<ffffffff802a14cb>] new_node_page+0x0/0x2f
> [<ffffffff802a1adb>] check_range+0x300/0x325
> [<ffffffff802a2374>] do_migrate_pages+0x1a5/0x1f1
> [<ffffffff8026d272>] cpuset_migrate_mm+0x30/0x93
> [<ffffffff8026d29c>] cpuset_migrate_mm+0x5a/0x93
> [<ffffffff8026df41>] cpuset_attach+0x93/0xa6
> [<ffffffff8026ae1b>] cgroup_attach_task+0x395/0x3e1
> [<ffffffff8026af61>] cgroup_tasks_write+0xfa/0x11d
> [<ffffffff8026aea0>] cgroup_tasks_write+0x39/0x11d
> [<ffffffff8026b5aa>] cgroup_file_write+0xef/0x216
> [<ffffffff802b2968>] vfs_write+0xad/0x136
> [<ffffffff802b2dfe>] sys_write+0x45/0x6e
> [<ffffffff8020bdab>] system_call_fastpath+0x16/0x1b
> INFO: lockdep is turned off.
> ===
>
> And other processes trying to hold cgroup_mutex are also stuck.
>
> > 1. We could probably move away from cgroup_mutex to a memory controller specific
> > mutex.
> > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
> > hold the cgroup lock for long and holding it during reclaim will definitely be
> > visible to users trying to create/delete nodes.
> >
> > I prefer to do (2), I'll look at the code more closely
> >
> I basically agree, but I think we should also consider mpol_rebind_mm.
>
> mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem),
> which means down_write(mm->mmap_sem) can be called under cgroup_mutex.
> OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge,
> which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem).
>
What's status of this problem ? fixed or not yet ?
Sorry for failing to track paches.
Thanks,
-Kame
> > > I think similar problem can also happen when removing memcg's directory.
> > >
> >
> > Why removing a directory? memcg (now) marks the directory as obsolete and we
> > check for obsolete directories and get/put references.
> >
> I don't think so.
>
> mem_cgroup_pre_destroy (make mem->obsolete = 1)
> mem_cgroup_force_empty(mem, **FALSE**)
> mem_cgroup_force_empty_list
> mem_cgroup_move_parent
> __mem_cgroup_try_charge
>
> hmm, but looking more closely, cgroup_call_pre_destroy is called
> outside of cgroup_mutex, so this problem doesn't happen at rmdir probably.
>
>
> Thanks,
> Daisuke Nishimura.
>
* KAMEZAWA Hiroyuki <[email protected]> [2008-12-09 11:59:43]:
> On Wed, 26 Nov 2008 11:14:47 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh <[email protected]> wrote:
> > > Daisuke Nishimura wrote:
> > > > Hi.
> > > >
> > > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
> > > >
> > > > For example, when attaching a task to some cpuset directory(memory_migrate=on),
> > > >
> > > > cgroup_tasks_write (hold cgroup_mutex)
> > > > attach_task_by_pid
> > > > cgroup_attach_task
> > > > cpuset_attach
> > > > cpuset_migrate_mm
> > > > :
> > > > unmap_and_move
> > > > mem_cgroup_prepare_migration
> > > > mem_cgroup_try_charge
> > > > mem_cgroup_hierarchical_reclaim
> > > >
> > >
> > > Did lockdep complain about it?
> > >
> > I haven't understood lockdep so well, but I got logs like this:
> >
> > ===
> > INFO: task move.sh:17710 blocked for more than 480 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > move.sh D ffff88010e1c76c0 0 17710 17597
> > ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000
> > ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c
> > ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> > Call Trace:
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff804cb357>] mutex_lock_nested+0x180/0x2a2
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff802aed9c>] __mem_cgroup_try_charge+0x27a/0x2de
> > [<ffffffff802afdfd>] mem_cgroup_prepare_migration+0x6c/0xa5
> > [<ffffffff802ad97f>] migrate_pages+0x10c/0x4a0
> > [<ffffffff802ad9c8>] migrate_pages+0x155/0x4a0
> > [<ffffffff802a14cb>] new_node_page+0x0/0x2f
> > [<ffffffff802a1adb>] check_range+0x300/0x325
> > [<ffffffff802a2374>] do_migrate_pages+0x1a5/0x1f1
> > [<ffffffff8026d272>] cpuset_migrate_mm+0x30/0x93
> > [<ffffffff8026d29c>] cpuset_migrate_mm+0x5a/0x93
> > [<ffffffff8026df41>] cpuset_attach+0x93/0xa6
> > [<ffffffff8026ae1b>] cgroup_attach_task+0x395/0x3e1
> > [<ffffffff8026af61>] cgroup_tasks_write+0xfa/0x11d
> > [<ffffffff8026aea0>] cgroup_tasks_write+0x39/0x11d
> > [<ffffffff8026b5aa>] cgroup_file_write+0xef/0x216
> > [<ffffffff802b2968>] vfs_write+0xad/0x136
> > [<ffffffff802b2dfe>] sys_write+0x45/0x6e
> > [<ffffffff8020bdab>] system_call_fastpath+0x16/0x1b
> > INFO: lockdep is turned off.
> > ===
> >
> > And other processes trying to hold cgroup_mutex are also stuck.
> >
> > > 1. We could probably move away from cgroup_mutex to a memory controller specific
> > > mutex.
> > > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
> > > hold the cgroup lock for long and holding it during reclaim will definitely be
> > > visible to users trying to create/delete nodes.
> > >
> > > I prefer to do (2), I'll look at the code more closely
> > >
> > I basically agree, but I think we should also consider mpol_rebind_mm.
> >
> > mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem),
> > which means down_write(mm->mmap_sem) can be called under cgroup_mutex.
> > OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge,
> > which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem).
> >
>
> What's status of this problem ? fixed or not yet ?
> Sorry for failing to track paches.
>
Kamezawa-San,
We are looking at two approaches that I had mentioned earlier
1) rely on the new cgroup_tasklist mutex introduced to close the race
2) Removing cgroup lock dependency with cgroup_tasks_write. I worry
that it can lead to long latencies with cgroup_lock held
I can send a patch for (1) today, I want to fix (2)
and spent a lot of time staring at that code and could not find
any easy way to fix it.
--
Balbir
On Tue, 9 Dec 2008 11:59:43 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Wed, 26 Nov 2008 11:14:47 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh <[email protected]> wrote:
> > > Daisuke Nishimura wrote:
> > > > Hi.
> > > >
> > > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
> > > >
> > > > For example, when attaching a task to some cpuset directory(memory_migrate=on),
> > > >
> > > > cgroup_tasks_write (hold cgroup_mutex)
> > > > attach_task_by_pid
> > > > cgroup_attach_task
> > > > cpuset_attach
> > > > cpuset_migrate_mm
> > > > :
> > > > unmap_and_move
> > > > mem_cgroup_prepare_migration
> > > > mem_cgroup_try_charge
> > > > mem_cgroup_hierarchical_reclaim
> > > >
> > >
> > > Did lockdep complain about it?
> > >
> > I haven't understood lockdep so well, but I got logs like this:
> >
> > ===
> > INFO: task move.sh:17710 blocked for more than 480 seconds.
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > move.sh D ffff88010e1c76c0 0 17710 17597
> > ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000
> > ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c
> > ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> > Call Trace:
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff804cb357>] mutex_lock_nested+0x180/0x2a2
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > [<ffffffff802aed9c>] __mem_cgroup_try_charge+0x27a/0x2de
> > [<ffffffff802afdfd>] mem_cgroup_prepare_migration+0x6c/0xa5
> > [<ffffffff802ad97f>] migrate_pages+0x10c/0x4a0
> > [<ffffffff802ad9c8>] migrate_pages+0x155/0x4a0
> > [<ffffffff802a14cb>] new_node_page+0x0/0x2f
> > [<ffffffff802a1adb>] check_range+0x300/0x325
> > [<ffffffff802a2374>] do_migrate_pages+0x1a5/0x1f1
> > [<ffffffff8026d272>] cpuset_migrate_mm+0x30/0x93
> > [<ffffffff8026d29c>] cpuset_migrate_mm+0x5a/0x93
> > [<ffffffff8026df41>] cpuset_attach+0x93/0xa6
> > [<ffffffff8026ae1b>] cgroup_attach_task+0x395/0x3e1
> > [<ffffffff8026af61>] cgroup_tasks_write+0xfa/0x11d
> > [<ffffffff8026aea0>] cgroup_tasks_write+0x39/0x11d
> > [<ffffffff8026b5aa>] cgroup_file_write+0xef/0x216
> > [<ffffffff802b2968>] vfs_write+0xad/0x136
> > [<ffffffff802b2dfe>] sys_write+0x45/0x6e
> > [<ffffffff8020bdab>] system_call_fastpath+0x16/0x1b
> > INFO: lockdep is turned off.
> > ===
> >
> > And other processes trying to hold cgroup_mutex are also stuck.
> >
> > > 1. We could probably move away from cgroup_mutex to a memory controller specific
> > > mutex.
> > > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
> > > hold the cgroup lock for long and holding it during reclaim will definitely be
> > > visible to users trying to create/delete nodes.
> > >
> > > I prefer to do (2), I'll look at the code more closely
> > >
> > I basically agree, but I think we should also consider mpol_rebind_mm.
> >
> > mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem),
> > which means down_write(mm->mmap_sem) can be called under cgroup_mutex.
> > OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge,
> > which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem).
> >
>
> What's status of this problem ? fixed or not yet ?
> Sorry for failing to track paches.
>
Not yet.
Those dead locks cannot be fixed as long as reclaim path tries to hold cgroup_mutex.
(current mmotm doesn't hold cgroup_mutex on reclaim path if !use_hierarchy and
I'm testing with !use_hierarchy. It works well basically, but I got another bug
at rmdir today, and digging it now.)
The dead lock I've fixed by memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
is another one(removed cgroup_lock from oom code).
Thanks,
Daisuke Nishimura.
On Tue, 9 Dec 2008 09:18:28 +0530
Balbir Singh <[email protected]> wrote:
> * KAMEZAWA Hiroyuki <[email protected]> [2008-12-09 11:59:43]:
>
> > On Wed, 26 Nov 2008 11:14:47 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Tue, 25 Nov 2008 20:31:25 +0530, Balbir Singh <[email protected]> wrote:
> > > > Daisuke Nishimura wrote:
> > > > > Hi.
> > > > >
> > > > > Unfortunately, trying to hold cgroup_mutex at reclaim causes dead lock.
> > > > >
> > > > > For example, when attaching a task to some cpuset directory(memory_migrate=on),
> > > > >
> > > > > cgroup_tasks_write (hold cgroup_mutex)
> > > > > attach_task_by_pid
> > > > > cgroup_attach_task
> > > > > cpuset_attach
> > > > > cpuset_migrate_mm
> > > > > :
> > > > > unmap_and_move
> > > > > mem_cgroup_prepare_migration
> > > > > mem_cgroup_try_charge
> > > > > mem_cgroup_hierarchical_reclaim
> > > > >
> > > >
> > > > Did lockdep complain about it?
> > > >
> > > I haven't understood lockdep so well, but I got logs like this:
> > >
> > > ===
> > > INFO: task move.sh:17710 blocked for more than 480 seconds.
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > move.sh D ffff88010e1c76c0 0 17710 17597
> > > ffff8800bd9edf00 0000000000000046 0000000000000000 0000000000000000
> > > ffff8803afbc0000 ffff8800bd9ee270 0000000e00000000 000000010a54459c
> > > ffffffffffffffff ffffffffffffffff ffffffffffffffff 7fffffffffffffff
> > > Call Trace:
> > > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > > [<ffffffff804cb357>] mutex_lock_nested+0x180/0x2a2
> > > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > > [<ffffffff802ae9f0>] mem_cgroup_get_first_node+0x29/0x8a
> > > [<ffffffff802aed9c>] __mem_cgroup_try_charge+0x27a/0x2de
> > > [<ffffffff802afdfd>] mem_cgroup_prepare_migration+0x6c/0xa5
> > > [<ffffffff802ad97f>] migrate_pages+0x10c/0x4a0
> > > [<ffffffff802ad9c8>] migrate_pages+0x155/0x4a0
> > > [<ffffffff802a14cb>] new_node_page+0x0/0x2f
> > > [<ffffffff802a1adb>] check_range+0x300/0x325
> > > [<ffffffff802a2374>] do_migrate_pages+0x1a5/0x1f1
> > > [<ffffffff8026d272>] cpuset_migrate_mm+0x30/0x93
> > > [<ffffffff8026d29c>] cpuset_migrate_mm+0x5a/0x93
> > > [<ffffffff8026df41>] cpuset_attach+0x93/0xa6
> > > [<ffffffff8026ae1b>] cgroup_attach_task+0x395/0x3e1
> > > [<ffffffff8026af61>] cgroup_tasks_write+0xfa/0x11d
> > > [<ffffffff8026aea0>] cgroup_tasks_write+0x39/0x11d
> > > [<ffffffff8026b5aa>] cgroup_file_write+0xef/0x216
> > > [<ffffffff802b2968>] vfs_write+0xad/0x136
> > > [<ffffffff802b2dfe>] sys_write+0x45/0x6e
> > > [<ffffffff8020bdab>] system_call_fastpath+0x16/0x1b
> > > INFO: lockdep is turned off.
> > > ===
> > >
> > > And other processes trying to hold cgroup_mutex are also stuck.
> > >
> > > > 1. We could probably move away from cgroup_mutex to a memory controller specific
> > > > mutex.
> > > > 2. We could give up cgroup_mutex before migrate_mm, since it seems like we'll
> > > > hold the cgroup lock for long and holding it during reclaim will definitely be
> > > > visible to users trying to create/delete nodes.
> > > >
> > > > I prefer to do (2), I'll look at the code more closely
> > > >
> > > I basically agree, but I think we should also consider mpol_rebind_mm.
> > >
> > > mpol_rebind_mm, which can be called from cpuset_attach, does down_write(mm->mmap_sem),
> > > which means down_write(mm->mmap_sem) can be called under cgroup_mutex.
> > > OTOH, page fault path does down_read(mm->mmap_sem) and can call mem_cgroup_try_charge,
> > > which means mutex_lock(cgroup_mutex) can be called under down_read(mm->mmap_sem).
> > >
> >
> > What's status of this problem ? fixed or not yet ?
> > Sorry for failing to track paches.
> >
>
> Kamezawa-San,
>
> We are looking at two approaches that I had mentioned earlier
>
> 1) rely on the new cgroup_tasklist mutex introduced to close the race
Hmm ? what you're talking about is
==memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
+static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
this ?
I think there is no Acks from Paul Menage (I.e. cgroup maitainer)
to this and am afraid that this will add new complexity.
Hmm...It seems that my cgroup-ID scan patch will need more time to fix this
race. But I'd like to push fixes for this kind of race out to cgroup layer rather
than making more special...we already have tons of special operations which
cannot be handled in cgroup layer and breaks assumption.
-Kame
> 2) Removing cgroup lock dependency with cgroup_tasks_write. I worry
> that it can lead to long latencies with cgroup_lock held
>
> I can send a patch for (1) today, I want to fix (2)
> and spent a lot of time staring at that code and could not find
> any easy way to fix it.
>
>
> --
> Balbir
>
On Tue, 9 Dec 2008 12:53:41 +0900
Daisuke Nishimura <[email protected]> wrote:
> Not yet.
>
> Those dead locks cannot be fixed as long as reclaim path tries to hold cgroup_mutex.
> (current mmotm doesn't hold cgroup_mutex on reclaim path if !use_hierarchy and
> I'm testing with !use_hierarchy. It works well basically, but I got another bug
> at rmdir today, and digging it now.)
>
> The dead lock I've fixed by memcg-avoid-dead-lock-caused-by-race-between-oom-and-cpuset_attach.patch
> is another one(removed cgroup_lock from oom code).
>
Okay, then removing cgroup_lock from memory-reclaim path is a way to go..
Thank you.
-Kame