Hi all,
this patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as it
is not deterministic because it depends on the creation ordering.
Diffstat looks promising but it is fair the say that the biggest cleanup is
just css_get_next removal. The memcg code has grown a bit but I think it is
worth the resulting outcome (the sanity ;)).
The first patch fixes a potential misbehaving which I haven't seen but the
fix is needed for the later patches anyway. We could take it alone as well
but I do not have any bug report to base the fix on.
The second patch replaces css_get_next by cgroup iterators which are
scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
fe1e904c cgroup: implement generic child / descendant walk macros
7e187c6c cgroup: use rculist ops for cgroup->children
The third patch is an attempt for simplification of the mem_cgroup_iter. It
basically removes all css usages to make the code easier. The next patch
removes the big while(!memcg) loop around the iterating logic. It could have
been folded into #3 but I rather have the rework separate from the code
moving noise.
The last patch just removes css_get_next as there is no user for it any
longer.
I am also thinking that leaf-to-root iteration makes more sense but this
patch is not included in the series yet because I have to think some
more about the justification.
So far I didn't get to testing but I am posting this early if everybody is
OK with this change.
Any thoughts?
Cumulative diffstat:
include/linux/cgroup.h | 7 ---
kernel/cgroup.c | 49 ---------------------
mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++---------------
3 files changed, 75 insertions(+), 91 deletions(-)
Michal Hocko (5):
memcg: synchronize per-zone iterator access by a spinlock
memcg: rework mem_cgroup_iter to use cgroup iterators
memcg: simplify mem_cgroup_iter
memcg: clean up mem_cgroup_iter
cgroup: remove css_get_next
Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/cgroup.h | 7 -------
kernel/cgroup.c | 49 ------------------------------------------------
2 files changed, 56 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 329eb46..ba46041 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -676,13 +676,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);
struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);
-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
- struct cgroup_subsys_state *root, int *foundid);
-
/* Returns true if root is ancestor of cg */
bool css_is_ancestor(struct cgroup_subsys_state *cg,
const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d51958a..4d874b2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5230,55 +5230,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
}
EXPORT_SYMBOL_GPL(css_lookup);
-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
- struct cgroup_subsys_state *root, int *foundid)
-{
- struct cgroup_subsys_state *ret = NULL;
- struct css_id *tmp;
- int tmpid;
- int rootid = css_id(root);
- int depth = css_depth(root);
-
- if (!rootid)
- return NULL;
-
- BUG_ON(!ss->use_id);
- WARN_ON_ONCE(!rcu_read_lock_held());
-
- /* fill start point for scan */
- tmpid = id;
- while (1) {
- /*
- * scan next entry from bitmap(tree), tmpid is updated after
- * idr_get_next().
- */
- tmp = idr_get_next(&ss->idr, &tmpid);
- if (!tmp)
- break;
- if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
- ret = rcu_dereference(tmp->css);
- if (ret) {
- *foundid = tmpid;
- break;
- }
- }
- /* continue to scan from next id */
- tmpid = tmpid + 1;
- }
- return ret;
-}
-
/*
* get corresponding css from file open on cgroupfs directory
*/
--
1.7.10.4
Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.
It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.
We could get rid of the surrounding while loop but keep it for now for
an easier review. It will go away in the next patch.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 52 ++++++++++++++++++++++++----------------------------
1 file changed, 24 insertions(+), 28 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5da1e58..dd84094 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- struct cgroup_subsys_state *css = NULL;
if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1113,49 +1112,46 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
* treatment.
*/
if (!last_visited) {
- css = &root->css;
+ memcg = root;
} else {
- struct cgroup *next_cgroup;
-
+ struct cgroup *next_cgroup,
+ *pos = last_visited->css.cgroup;
+skip_node:
next_cgroup = cgroup_next_descendant_pre(
- last_visited->css.cgroup,
+ pos,
root->css.cgroup);
- if (next_cgroup)
- css = cgroup_subsys_state(next_cgroup,
- mem_cgroup_subsys_id);
+ /*
+ * Even if we find a group we have to make sure it is
+ * alive. If not we, should skip the node.
+ */
+ if (next_cgroup) {
+ struct mem_cgroup *mem = mem_cgroup_from_cont(
+ next_cgroup);
+ if (css_tryget(&mem->css))
+ memcg = mem;
+ else {
+ pos = next_cgroup;
+ goto skip_node;
+ }
+ }
}
- /*
- * Even if we find a group we have to make sure it is alive.
- * css && !memcg means that the groups should be skipped and
- * we should continue the tree walk.
- */
- if (css == &root->css || (css && css_tryget(css)))
- memcg = mem_cgroup_from_css(css);
-
if (reclaim) {
- struct mem_cgroup *curr = memcg;
-
if (last_visited)
mem_cgroup_put(last_visited);
+ if (memcg)
+ mem_cgroup_get(memcg);
+ iter->last_visited = memcg;
- if (css && !memcg)
- curr = mem_cgroup_from_css(css);
- if (curr)
- mem_cgroup_get(curr);
- iter->last_visited = curr;
-
- if (!css)
+ if (!memcg)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&iter->iter_lock);
- } else if (css && !memcg) {
- last_visited = mem_cgroup_from_css(css);
}
rcu_read_unlock();
- if (prev && !css)
+ if (prev && !memcg)
return NULL;
}
return memcg;
--
1.7.10.4
per-zone per-priority iterator is aimed at coordinating concurrent
reclaimers on the same hierarchy (or the global reclaim when all
groups are reclaimed) so that all groups get reclaimed evenly as
much as possible. iter->position holds the last css->id visited
and iter->generation signals the completed tree walk (when it is
incremented).
Concurrent reclaimers are supposed to provide a reclaim cookie which
holds the reclaim priority and the last generation they saw. If cookie's
generation doesn't match the iterator's view then other concurrent
reclaimer already did the job and the tree walk is done for that
priority.
This scheme works nicely in most cases but it is not raceless. Two
racing reclaimers can see the same iter->position and so bang on the
same group. iter->generation increment is not serialized as well so a
reclaimer can see an updated iter->position with and old generation so
the iteration might be restarted from the root of the hierarchy.
The simplest way to fix this issue is to synchronise access to the
iterator by a lock. This implementation uses per-zone per-priority
spinlock which linearizes only directly racing reclaimers which use
reclaim cookies so the effect of the new locking should be really
minimal.
I have to note that I haven't seen this as a real issue so far. The
primary motivation for the change is different. The following patch
will change the way how the iterator is implemented and css->id
iteration will be replaced cgroup generic iteration which requires
storing mem_cgroup pointer into iterator and that requires reference
counting and so concurrent access will be a problem.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6136fec..0fe5177 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -146,6 +146,8 @@ struct mem_cgroup_reclaim_iter {
int position;
/* scan generation, increased every round-trip */
unsigned int generation;
+ /* lock to protect the position and generation */
+ spinlock_t iter_lock;
};
/*
@@ -1093,8 +1095,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
- if (prev && reclaim->generation != iter->generation)
+ spin_lock(&iter->iter_lock);
+ if (prev && reclaim->generation != iter->generation) {
+ spin_unlock(&iter->iter_lock);
return NULL;
+ }
id = iter->position;
}
@@ -1113,6 +1118,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
+ spin_unlock(&iter->iter_lock);
}
if (prev && !css)
@@ -5871,8 +5877,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
return 1;
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ int prio;
+
mz = &pn->zoneinfo[zone];
lruvec_init(&mz->lruvec, &NODE_DATA(node)->node_zones[zone]);
+ for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+ spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->memcg = memcg;
--
1.7.10.4
Get rid of while(!memcg) loop as it is no longer needed because there
will always be at least one group that should be visited (root).
This patch doesn't add any change to the implementation but it is
separate to make a review easier.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 120 +++++++++++++++++++++++++++----------------------------
1 file changed, 60 insertions(+), 60 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd84094..b924f27 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1063,6 +1063,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
struct mem_cgroup_reclaim_cookie *reclaim)
{
+ struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
struct mem_cgroup *memcg = NULL,
*last_visited = NULL;
@@ -1084,76 +1085,75 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
return root;
}
- while (!memcg) {
- struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
-
- if (reclaim) {
- int nid = zone_to_nid(reclaim->zone);
- int zid = zone_idx(reclaim->zone);
- struct mem_cgroup_per_zone *mz;
-
- mz = mem_cgroup_zoneinfo(root, nid, zid);
- iter = &mz->reclaim_iter[reclaim->priority];
- spin_lock(&iter->iter_lock);
- last_visited = iter->last_visited;
- if (prev && reclaim->generation != iter->generation) {
- if (last_visited) {
- mem_cgroup_put(last_visited);
- iter->last_visited = NULL;
- }
- spin_unlock(&iter->iter_lock);
- return NULL;
+ if (reclaim) {
+ int nid = zone_to_nid(reclaim->zone);
+ int zid = zone_idx(reclaim->zone);
+ struct mem_cgroup_per_zone *mz;
+
+ mz = mem_cgroup_zoneinfo(root, nid, zid);
+ iter = &mz->reclaim_iter[reclaim->priority];
+ spin_lock(&iter->iter_lock);
+ last_visited = iter->last_visited;
+ if (prev && reclaim->generation != iter->generation) {
+ if (last_visited) {
+ mem_cgroup_put(last_visited);
+ iter->last_visited = NULL;
}
+ spin_unlock(&iter->iter_lock);
+ return NULL;
}
+ }
- rcu_read_lock();
+ rcu_read_lock();
+ /*
+ * Root is not visited by cgroup iterators so it needs a special
+ * treatment.
+ */
+ if (!last_visited) {
+ memcg = root;
+ } else {
+ struct cgroup *next_cgroup,
+ *pos = last_visited->css.cgroup;
+skip_node:
+ next_cgroup = cgroup_next_descendant_pre(
+ pos,
+ root->css.cgroup);
/*
- * Root is not visited by cgroup iterators so it needs a special
- * treatment.
+ * Even if we find a group we have to make sure it is
+ * alive. If not we, should skip the node.
*/
- if (!last_visited) {
- memcg = root;
- } else {
- struct cgroup *next_cgroup,
- *pos = last_visited->css.cgroup;
-skip_node:
- next_cgroup = cgroup_next_descendant_pre(
- pos,
- root->css.cgroup);
- /*
- * Even if we find a group we have to make sure it is
- * alive. If not we, should skip the node.
- */
- if (next_cgroup) {
- struct mem_cgroup *mem = mem_cgroup_from_cont(
- next_cgroup);
- if (css_tryget(&mem->css))
- memcg = mem;
- else {
- pos = next_cgroup;
- goto skip_node;
- }
+ if (next_cgroup) {
+ struct mem_cgroup *mem = mem_cgroup_from_cont(
+ next_cgroup);
+ if (css_tryget(&mem->css))
+ memcg = mem;
+ else {
+ pos = next_cgroup;
+ goto skip_node;
}
}
+ }
- if (reclaim) {
- if (last_visited)
- mem_cgroup_put(last_visited);
- if (memcg)
- mem_cgroup_get(memcg);
- iter->last_visited = memcg;
-
- if (!memcg)
- iter->generation++;
- else if (!prev && memcg)
- reclaim->generation = iter->generation;
- spin_unlock(&iter->iter_lock);
- }
- rcu_read_unlock();
+ if (reclaim) {
+ if (last_visited)
+ mem_cgroup_put(last_visited);
+ if (memcg)
+ mem_cgroup_get(memcg);
+ iter->last_visited = memcg;
- if (prev && !memcg)
- return NULL;
+ if (!memcg)
+ iter->generation++;
+ else if (!prev && memcg)
+ reclaim->generation = iter->generation;
+ spin_unlock(&iter->iter_lock);
}
+ rcu_read_unlock();
+
+ /*
+ * At least root has to be visited
+ */
+ VM_BUG_ON(!prev && !memcg);
+
return memcg;
}
--
1.7.10.4
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d
574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.
cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css_tryget makes
sure that the group is alive when we encounter it by iterator.
If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the memcg guarantees that
the group will not vanish even though it has been already removed from
the tree. In such a case css_tryget will fail and the iteration is
retried (groups are linked with RCU safe lists so the forward progress
is still possible). iter_lock will make sure that only one reclaimer
will see the last_visited group and the reference count game around it.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 49 insertions(+), 15 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0fe5177..5da1e58 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu {
};
struct mem_cgroup_reclaim_iter {
- /* css_id of the last scanned hierarchy member */
- int position;
+ /* last scanned hierarchy member with elevated ref count */
+ struct mem_cgroup *last_visited;
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
struct mem_cgroup_reclaim_cookie *reclaim)
{
- struct mem_cgroup *memcg = NULL;
- int id = 0;
+ struct mem_cgroup *memcg = NULL,
+ *last_visited = NULL;
if (mem_cgroup_disabled())
return NULL;
@@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
root = root_mem_cgroup;
if (prev && !reclaim)
- id = css_id(&prev->css);
+ last_visited = prev;
if (prev && prev != root)
css_put(&prev->css);
@@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *css = NULL;
if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
+ last_visited = iter->last_visited;
if (prev && reclaim->generation != iter->generation) {
+ if (last_visited) {
+ mem_cgroup_put(last_visited);
+ iter->last_visited = NULL;
+ }
spin_unlock(&iter->iter_lock);
return NULL;
}
- id = iter->position;
}
rcu_read_lock();
- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
- if (css) {
- if (css == &root->css || css_tryget(css))
- memcg = mem_cgroup_from_css(css);
- } else
- id = 0;
- rcu_read_unlock();
+ /*
+ * Root is not visited by cgroup iterators so it needs a special
+ * treatment.
+ */
+ if (!last_visited) {
+ css = &root->css;
+ } else {
+ struct cgroup *next_cgroup;
+
+ next_cgroup = cgroup_next_descendant_pre(
+ last_visited->css.cgroup,
+ root->css.cgroup);
+ if (next_cgroup)
+ css = cgroup_subsys_state(next_cgroup,
+ mem_cgroup_subsys_id);
+ }
+
+ /*
+ * Even if we find a group we have to make sure it is alive.
+ * css && !memcg means that the groups should be skipped and
+ * we should continue the tree walk.
+ */
+ if (css == &root->css || (css && css_tryget(css)))
+ memcg = mem_cgroup_from_css(css);
if (reclaim) {
- iter->position = id;
+ struct mem_cgroup *curr = memcg;
+
+ if (last_visited)
+ mem_cgroup_put(last_visited);
+
+ if (css && !memcg)
+ curr = mem_cgroup_from_css(css);
+ if (curr)
+ mem_cgroup_get(curr);
+ iter->last_visited = curr;
+
if (!css)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&iter->iter_lock);
+ } else if (css && !memcg) {
+ last_visited = mem_cgroup_from_css(css);
}
+ rcu_read_unlock();
if (prev && !css)
return NULL;
--
1.7.10.4
On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote:
> @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> struct mem_cgroup_reclaim_cookie *reclaim)
> {
> - struct mem_cgroup *memcg = NULL;
> - int id = 0;
> + struct mem_cgroup *memcg = NULL,
> + *last_visited = NULL;
Nitpick but please don't do this.
> + /*
> + * Root is not visited by cgroup iterators so it needs a special
> + * treatment.
> + */
> + if (!last_visited) {
> + css = &root->css;
> + } else {
> + struct cgroup *next_cgroup;
> +
> + next_cgroup = cgroup_next_descendant_pre(
> + last_visited->css.cgroup,
> + root->css.cgroup);
> + if (next_cgroup)
> + css = cgroup_subsys_state(next_cgroup,
> + mem_cgroup_subsys_id);
Hmmm... wouldn't it be better to move the reclaim logic into a
function and do the following?
reclaim(root);
for_each_descendent_pre()
reclaim(descendant);
If this is a problem, I'd be happy to add a iterator which includes
the top node. I'd prefer controllers not using the next functions
directly.
Thanks.
--
tejun
(2012/11/14 0:30), Michal Hocko wrote:
> per-zone per-priority iterator is aimed at coordinating concurrent
> reclaimers on the same hierarchy (or the global reclaim when all
> groups are reclaimed) so that all groups get reclaimed evenly as
> much as possible. iter->position holds the last css->id visited
> and iter->generation signals the completed tree walk (when it is
> incremented).
> Concurrent reclaimers are supposed to provide a reclaim cookie which
> holds the reclaim priority and the last generation they saw. If cookie's
> generation doesn't match the iterator's view then other concurrent
> reclaimer already did the job and the tree walk is done for that
> priority.
>
> This scheme works nicely in most cases but it is not raceless. Two
> racing reclaimers can see the same iter->position and so bang on the
> same group. iter->generation increment is not serialized as well so a
> reclaimer can see an updated iter->position with and old generation so
> the iteration might be restarted from the root of the hierarchy.
>
> The simplest way to fix this issue is to synchronise access to the
> iterator by a lock. This implementation uses per-zone per-priority
> spinlock which linearizes only directly racing reclaimers which use
> reclaim cookies so the effect of the new locking should be really
> minimal.
>
> I have to note that I haven't seen this as a real issue so far. The
> primary motivation for the change is different. The following patch
> will change the way how the iterator is implemented and css->id
> iteration will be replaced cgroup generic iteration which requires
> storing mem_cgroup pointer into iterator and that requires reference
> counting and so concurrent access will be a problem.
>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
(2012/11/14 0:30), Michal Hocko wrote:
> Hi all,
> this patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as it
> is not deterministic because it depends on the creation ordering.
>
> Diffstat looks promising but it is fair the say that the biggest cleanup is
> just css_get_next removal. The memcg code has grown a bit but I think it is
> worth the resulting outcome (the sanity ;)).
>
> The first patch fixes a potential misbehaving which I haven't seen but the
> fix is needed for the later patches anyway. We could take it alone as well
> but I do not have any bug report to base the fix on.
>
> The second patch replaces css_get_next by cgroup iterators which are
> scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
> fe1e904c cgroup: implement generic child / descendant walk macros
> 7e187c6c cgroup: use rculist ops for cgroup->children
>
> The third patch is an attempt for simplification of the mem_cgroup_iter. It
> basically removes all css usages to make the code easier. The next patch
> removes the big while(!memcg) loop around the iterating logic. It could have
> been folded into #3 but I rather have the rework separate from the code
> moving noise.
>
> The last patch just removes css_get_next as there is no user for it any
> longer.
>
> I am also thinking that leaf-to-root iteration makes more sense but this
> patch is not included in the series yet because I have to think some
> more about the justification.
>
> So far I didn't get to testing but I am posting this early if everybody is
> OK with this change.
>
> Any thoughts?
>
I'm O.K. Maybe I have some points I'm not understanding...I'll make a reply to patches.
Thanks,
-Kame
(2012/11/14 0:30), Michal Hocko wrote:
> mem_cgroup_iter curently relies on css->id when walking down a group
> hierarchy tree. This is really awkward because the tree walk depends on
> the groups creation ordering. The only guarantee is that a parent node
> is visited before its children.
> Example
> 1) mkdir -p a a/d a/b/c
> 2) mkdir -a a/b/c a/d
> Will create the same trees but the tree walks will be different:
> 1) a, d, b, c
> 2) a, b, c, d
>
> 574bd9f7 (cgroup: implement generic child / descendant walk macros) has
> introduced generic cgroup tree walkers which provide either pre-order
> or post-order tree walk. This patch converts css->id based iteration
> to pre-order tree walk to keep the semantic with the original iterator
> where parent is always visited before its subtree.
>
> cgroup_for_each_descendant_pre suggests using post_create and
> pre_destroy for proper synchronization with groups addidition resp.
> removal. This implementation doesn't use those because a new memory
> cgroup is fully initialized in mem_cgroup_create and css_tryget makes
> sure that the group is alive when we encounter it by iterator.
>
> If the reclaim cookie is used we need to store the last visited group
> into the iterator so we have to be careful that it doesn't disappear in
> the mean time. Elevated reference count on the memcg guarantees that
> the group will not vanish even though it has been already removed from
> the tree. In such a case css_tryget will fail and the iteration is
> retried (groups are linked with RCU safe lists so the forward progress
> is still possible). iter_lock will make sure that only one reclaimer
> will see the last_visited group and the reference count game around it.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 49 insertions(+), 15 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 0fe5177..5da1e58 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu {
> };
>
> struct mem_cgroup_reclaim_iter {
> - /* css_id of the last scanned hierarchy member */
> - int position;
> + /* last scanned hierarchy member with elevated ref count */
> + struct mem_cgroup *last_visited;
> /* scan generation, increased every round-trip */
> unsigned int generation;
> /* lock to protect the position and generation */
> @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> struct mem_cgroup_reclaim_cookie *reclaim)
> {
> - struct mem_cgroup *memcg = NULL;
> - int id = 0;
> + struct mem_cgroup *memcg = NULL,
> + *last_visited = NULL;
>
> if (mem_cgroup_disabled())
> return NULL;
> @@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> root = root_mem_cgroup;
>
> if (prev && !reclaim)
> - id = css_id(&prev->css);
> + last_visited = prev;
>
> if (prev && prev != root)
> css_put(&prev->css);
> @@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>
> while (!memcg) {
> struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> - struct cgroup_subsys_state *css;
> + struct cgroup_subsys_state *css = NULL;
>
> if (reclaim) {
> int nid = zone_to_nid(reclaim->zone);
> @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> mz = mem_cgroup_zoneinfo(root, nid, zid);
> iter = &mz->reclaim_iter[reclaim->priority];
> spin_lock(&iter->iter_lock);
> + last_visited = iter->last_visited;
> if (prev && reclaim->generation != iter->generation) {
> + if (last_visited) {
> + mem_cgroup_put(last_visited);
> + iter->last_visited = NULL;
> + }
> spin_unlock(&iter->iter_lock);
> return NULL;
> }
> - id = iter->position;
> }
>
> rcu_read_lock();
> - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> - if (css) {
> - if (css == &root->css || css_tryget(css))
> - memcg = mem_cgroup_from_css(css);
> - } else
> - id = 0;
> - rcu_read_unlock();
> + /*
> + * Root is not visited by cgroup iterators so it needs a special
> + * treatment.
> + */
> + if (!last_visited) {
> + css = &root->css;
> + } else {
> + struct cgroup *next_cgroup;
> +
> + next_cgroup = cgroup_next_descendant_pre(
> + last_visited->css.cgroup,
> + root->css.cgroup);
Maybe I miss something but.... last_visited is holded by memcg's refcnt.
The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed
before memcg is freed and last_visited->css.cgroup is out of RCU cycle.
Is this safe ?
Thanks,
-Kame
> + if (next_cgroup)
> + css = cgroup_subsys_state(next_cgroup,
> + mem_cgroup_subsys_id);
> + }
> +
> + /*
> + * Even if we find a group we have to make sure it is alive.
> + * css && !memcg means that the groups should be skipped and
> + * we should continue the tree walk.
> + */
> + if (css == &root->css || (css && css_tryget(css)))
> + memcg = mem_cgroup_from_css(css);
>
> if (reclaim) {
> - iter->position = id;
> + struct mem_cgroup *curr = memcg;
> +
> + if (last_visited)
> + mem_cgroup_put(last_visited);
> +
> + if (css && !memcg)
> + curr = mem_cgroup_from_css(css);
> + if (curr)
> + mem_cgroup_get(curr);
> + iter->last_visited = curr;
> +
> if (!css)
> iter->generation++;
> else if (!prev && memcg)
> reclaim->generation = iter->generation;
> spin_unlock(&iter->iter_lock);
> + } else if (css && !memcg) {
> + last_visited = mem_cgroup_from_css(css);
> }
> + rcu_read_unlock();
>
> if (prev && !css)
> return NULL;
>
On 2012/11/13 23:30, Michal Hocko wrote:
> Hi all,
> this patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as it
> is not deterministic because it depends on the creation ordering.
>
> Diffstat looks promising but it is fair the say that the biggest cleanup is
> just css_get_next removal. The memcg code has grown a bit but I think it is
> worth the resulting outcome (the sanity ;)).
>
So memcg won't use css id at all, right? Then we can remove the whole css_id
stuff, and that's quite a bunch of code.
> The first patch fixes a potential misbehaving which I haven't seen but the
> fix is needed for the later patches anyway. We could take it alone as well
> but I do not have any bug report to base the fix on.
>
> The second patch replaces css_get_next by cgroup iterators which are
> scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
> fe1e904c cgroup: implement generic child / descendant walk macros
> 7e187c6c cgroup: use rculist ops for cgroup->children
>
> The third patch is an attempt for simplification of the mem_cgroup_iter. It
> basically removes all css usages to make the code easier. The next patch
> removes the big while(!memcg) loop around the iterating logic. It could have
> been folded into #3 but I rather have the rework separate from the code
> moving noise.
>
> The last patch just removes css_get_next as there is no user for it any
> longer.
>
> I am also thinking that leaf-to-root iteration makes more sense but this
> patch is not included in the series yet because I have to think some
> more about the justification.
>
> So far I didn't get to testing but I am posting this early if everybody is
> OK with this change.
>
On 11/13/2012 04:30 PM, Michal Hocko wrote:
> Hi all,
> this patch set tries to make mem_cgroup_iter saner in the way how it
> walks hierarchies. css->id based traversal is far from being ideal as it
> is not deterministic because it depends on the creation ordering.
>
> Diffstat looks promising but it is fair the say that the biggest cleanup is
> just css_get_next removal. The memcg code has grown a bit but I think it is
> worth the resulting outcome (the sanity ;)).
>
> The first patch fixes a potential misbehaving which I haven't seen but the
> fix is needed for the later patches anyway. We could take it alone as well
> but I do not have any bug report to base the fix on.
>
> The second patch replaces css_get_next by cgroup iterators which are
> scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
> fe1e904c cgroup: implement generic child / descendant walk macros
> 7e187c6c cgroup: use rculist ops for cgroup->children
>
> The third patch is an attempt for simplification of the mem_cgroup_iter. It
> basically removes all css usages to make the code easier. The next patch
> removes the big while(!memcg) loop around the iterating logic. It could have
> been folded into #3 but I rather have the rework separate from the code
> moving noise.
>
> The last patch just removes css_get_next as there is no user for it any
> longer.
>
> I am also thinking that leaf-to-root iteration makes more sense but this
> patch is not included in the series yet because I have to think some
> more about the justification.
>
> So far I didn't get to testing but I am posting this early if everybody is
> OK with this change.
>
> Any thoughts?
>
Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c
? It already exists, provide sane ordering, and only relies on parent
information - which cgroup core already have - to do the walk.
The only minor problem is that we'll have to handle the damn
use_hierarchy case, so we may not be able to blindly rely on
cgroup->parent. But maybe we can, if we don't guarantee any particular
leaf-order.
On Wed 14-11-12 09:55:08, Li Zefan wrote:
> On 2012/11/13 23:30, Michal Hocko wrote:
> > Hi all,
> > this patch set tries to make mem_cgroup_iter saner in the way how it
> > walks hierarchies. css->id based traversal is far from being ideal as it
> > is not deterministic because it depends on the creation ordering.
> >
> > Diffstat looks promising but it is fair the say that the biggest cleanup is
> > just css_get_next removal. The memcg code has grown a bit but I think it is
> > worth the resulting outcome (the sanity ;)).
> >
>
> So memcg won't use css id at all, right?
Unfortunately we still use it for the swap accounting but that one could
be replaced by something else, probably. Have to think about it.
> Then we can remove the whole css_id stuff, and that's quite a bunch of
> code.
Is memcg the only user of css_id? Quick grep shows that yes but I
haven't checked all the callers of the exported functions. I would be
happy if more code goes away.
--
Michal Hocko
SUSE Labs
On Wed 14-11-12 17:17:51, Glauber Costa wrote:
[...]
> Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c?
I do not care much about the internal implementation of the core
iterators. Those implemented by Tejun make sense to me. I just want to
get rid of css->id based ones.
Memcg iterator, however, still needs its own iterator on top because we
have to handle the parallel reclaimers.
[...]
--
Michal Hocko
SUSE Labs
On Tue 13-11-12 08:14:42, Tejun Heo wrote:
> On Tue, Nov 13, 2012 at 04:30:36PM +0100, Michal Hocko wrote:
> > @@ -1063,8 +1063,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > struct mem_cgroup *prev,
> > struct mem_cgroup_reclaim_cookie *reclaim)
> > {
> > - struct mem_cgroup *memcg = NULL;
> > - int id = 0;
> > + struct mem_cgroup *memcg = NULL,
> > + *last_visited = NULL;
>
> Nitpick but please don't do this.
OK, will make it grep friendlier;
> > + /*
> > + * Root is not visited by cgroup iterators so it needs a special
> > + * treatment.
> > + */
> > + if (!last_visited) {
> > + css = &root->css;
> > + } else {
> > + struct cgroup *next_cgroup;
> > +
> > + next_cgroup = cgroup_next_descendant_pre(
> > + last_visited->css.cgroup,
> > + root->css.cgroup);
> > + if (next_cgroup)
> > + css = cgroup_subsys_state(next_cgroup,
> > + mem_cgroup_subsys_id);
>
> Hmmm... wouldn't it be better to move the reclaim logic into a
> function and do the following?
>
> reclaim(root);
> for_each_descendent_pre()
> reclaim(descendant);
We cannot do for_each_descendent_pre here because we do not iterate
through the whole hierarchy all the time. Check shrink_zone.
> If this is a problem, I'd be happy to add a iterator which includes
> the top node.
This would help with the above if-else but I do not think this is the
worst thing in the function ;)
> I'd prefer controllers not using the next functions directly.
Well, we will need to use it directly because of the single group
reclaim mentioned above.
--
Michal Hocko
SUSE Labs
On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote:
> (2012/11/14 0:30), Michal Hocko wrote:
[...]
> > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > mz = mem_cgroup_zoneinfo(root, nid, zid);
> > iter = &mz->reclaim_iter[reclaim->priority];
> > spin_lock(&iter->iter_lock);
> > + last_visited = iter->last_visited;
> > if (prev && reclaim->generation != iter->generation) {
> > + if (last_visited) {
> > + mem_cgroup_put(last_visited);
> > + iter->last_visited = NULL;
> > + }
> > spin_unlock(&iter->iter_lock);
> > return NULL;
> > }
> > - id = iter->position;
> > }
> >
> > rcu_read_lock();
> > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> > - if (css) {
> > - if (css == &root->css || css_tryget(css))
> > - memcg = mem_cgroup_from_css(css);
> > - } else
> > - id = 0;
> > - rcu_read_unlock();
> > + /*
> > + * Root is not visited by cgroup iterators so it needs a special
> > + * treatment.
> > + */
> > + if (!last_visited) {
> > + css = &root->css;
> > + } else {
> > + struct cgroup *next_cgroup;
> > +
> > + next_cgroup = cgroup_next_descendant_pre(
> > + last_visited->css.cgroup,
> > + root->css.cgroup);
>
> Maybe I miss something but.... last_visited is holded by memcg's refcnt.
> The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed
> before memcg is freed and last_visited->css.cgroup is out of RCU cycle.
> Is this safe ?
Good spotted. You are right. What I need to do is to check that the
last_visited is alive and restart from the root if not. Something like
the bellow (incremental patch on top of this one) should help, right?
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30efd7e..c0a91a3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
spin_unlock(&iter->iter_lock);
return NULL;
}
+ /*
+ * memcg is still valid because we hold a reference but
+ * its cgroup might have vanished in the meantime so
+ * we have to double check it is alive and restart the
+ * tree walk otherwise.
+ */
+ if (last_visited && !css_tryget(&last_visited->css)) {
+ mem_cgroup_put(last_visited);
+ last_visited = NULL;
+ }
}
rcu_read_lock();
@@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (reclaim) {
struct mem_cgroup *curr = memcg;
- if (last_visited)
+ if (last_visited) {
+ css_put(&last_visited->css);
mem_cgroup_put(last_visited);
+ }
if (css && !memcg)
curr = mem_cgroup_from_css(css);
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Wed, Nov 14, 2012 at 09:36:53AM +0100, Michal Hocko wrote:
> > So memcg won't use css id at all, right?
>
> Unfortunately we still use it for the swap accounting but that one could
> be replaced by something else, probably. Have to think about it.
I have a patch to add cgrp->id pending. From what I can see, memcg
should be able to use that for swap accounting.
> > Then we can remove the whole css_id stuff, and that's quite a bunch of
> > code.
Yeap, that's the plan.
> Is memcg the only user of css_id? Quick grep shows that yes but I
> haven't checked all the callers of the exported functions. I would be
> happy if more code goes away.
Yeap, memcg is the only user and I really wanna remove it once memcg
moves onto saner stuff.
Thanks.
--
tejun
Hello, Glauber.
On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote:
> Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c
> ? It already exists, provide sane ordering, and only relies on parent
> information - which cgroup core already have - to do the walk.
Hmmm... we can but I personally much prefer for_each_*() iterators
over callback based ones. It's just much easier to share states
across an iteration and follow the logic. walk_tg_tree_from() does
have the benefit of being able to combine pre and post visits in the
same walk, which doesn't seem to have any user at the moment.
Thanks.
--
tejun
On 11/14/2012 10:41 PM, Tejun Heo wrote:
> Hello, Glauber.
>
> On Wed, Nov 14, 2012 at 05:17:51PM +0100, Glauber Costa wrote:
>> Why can't we reuse the scheduler iterator and move it to kernel/cgroup.c
>> ? It already exists, provide sane ordering, and only relies on parent
>> information - which cgroup core already have - to do the walk.
>
> Hmmm... we can but I personally much prefer for_each_*() iterators
> over callback based ones. It's just much easier to share states
> across an iteration and follow the logic. walk_tg_tree_from() does
> have the benefit of being able to combine pre and post visits in the
> same walk, which doesn't seem to have any user at the moment.
>
> Thanks.
>
Is there any particular reason why we can't do the other way around
then, and use a for_each_*() for sched walks? Without even consider what
I personally prefer, what I really don't like is to have two different
cgroup walkers when it seems like we could very well have just one.
Hello,
On Thu, Nov 15, 2012 at 06:44:57AM +0400, Glauber Costa wrote:
> Is there any particular reason why we can't do the other way around
> then, and use a for_each_*() for sched walks? Without even consider what
> I personally prefer, what I really don't like is to have two different
> cgroup walkers when it seems like we could very well have just one.
Ooh, sure thing, let's do that. Will work on that.
Thanks.
--
tejun
Hello, Michal.
On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote:
> > reclaim(root);
> > for_each_descendent_pre()
> > reclaim(descendant);
>
> We cannot do for_each_descendent_pre here because we do not iterate
> through the whole hierarchy all the time. Check shrink_zone.
I'm a bit confused. Why would that make any difference? Shouldn't it
be just able to test the condition and continue?
Thanks.
--
tejun
(2012/11/14 10:55), Li Zefan wrote:
> On 2012/11/13 23:30, Michal Hocko wrote:
>> Hi all,
>> this patch set tries to make mem_cgroup_iter saner in the way how it
>> walks hierarchies. css->id based traversal is far from being ideal as it
>> is not deterministic because it depends on the creation ordering.
>>
>> Diffstat looks promising but it is fair the say that the biggest cleanup is
>> just css_get_next removal. The memcg code has grown a bit but I think it is
>> worth the resulting outcome (the sanity ;)).
>>
>
> So memcg won't use css id at all, right? Then we can remove the whole css_id
> stuff, and that's quite a bunch of code.
>
It's used by swap information recording for saving spaces.
Thanks,
-Kame
(2012/11/14 19:10), Michal Hocko wrote:
> On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote:
>> (2012/11/14 0:30), Michal Hocko wrote:
> [...]
>>> @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>>> mz = mem_cgroup_zoneinfo(root, nid, zid);
>>> iter = &mz->reclaim_iter[reclaim->priority];
>>> spin_lock(&iter->iter_lock);
>>> + last_visited = iter->last_visited;
>>> if (prev && reclaim->generation != iter->generation) {
>>> + if (last_visited) {
>>> + mem_cgroup_put(last_visited);
>>> + iter->last_visited = NULL;
>>> + }
>>> spin_unlock(&iter->iter_lock);
>>> return NULL;
>>> }
>>> - id = iter->position;
>>> }
>>>
>>> rcu_read_lock();
>>> - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
>>> - if (css) {
>>> - if (css == &root->css || css_tryget(css))
>>> - memcg = mem_cgroup_from_css(css);
>>> - } else
>>> - id = 0;
>>> - rcu_read_unlock();
>>> + /*
>>> + * Root is not visited by cgroup iterators so it needs a special
>>> + * treatment.
>>> + */
>>> + if (!last_visited) {
>>> + css = &root->css;
>>> + } else {
>>> + struct cgroup *next_cgroup;
>>> +
>>> + next_cgroup = cgroup_next_descendant_pre(
>>> + last_visited->css.cgroup,
>>> + root->css.cgroup);
>>
>> Maybe I miss something but.... last_visited is holded by memcg's refcnt.
>> The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed
>> before memcg is freed and last_visited->css.cgroup is out of RCU cycle.
>> Is this safe ?
>
> Good spotted. You are right. What I need to do is to check that the
> last_visited is alive and restart from the root if not. Something like
> the bellow (incremental patch on top of this one) should help, right?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30efd7e..c0a91a3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> spin_unlock(&iter->iter_lock);
> return NULL;
> }
> + /*
> + * memcg is still valid because we hold a reference but
> + * its cgroup might have vanished in the meantime so
> + * we have to double check it is alive and restart the
> + * tree walk otherwise.
> + */
> + if (last_visited && !css_tryget(&last_visited->css)) {
> + mem_cgroup_put(last_visited);
> + last_visited = NULL;
> + }
> }
>
> rcu_read_lock();
> @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> if (reclaim) {
> struct mem_cgroup *curr = memcg;
>
> - if (last_visited)
> + if (last_visited) {
> + css_put(&last_visited->css);
> mem_cgroup_put(last_visited);
> + }
>
> if (css && !memcg)
> curr = mem_cgroup_from_css(css);
>
I think this will work.
Thanks,
-Kame
On Wed 14-11-12 10:52:45, Tejun Heo wrote:
> Hello, Michal.
>
> On Wed, Nov 14, 2012 at 09:51:29AM +0100, Michal Hocko wrote:
> > > reclaim(root);
> > > for_each_descendent_pre()
> > > reclaim(descendant);
> >
> > We cannot do for_each_descendent_pre here because we do not iterate
> > through the whole hierarchy all the time. Check shrink_zone.
>
> I'm a bit confused. Why would that make any difference? Shouldn't it
> be just able to test the condition and continue?
Ohh, I misunderstood your proposal. So what you are suggesting is
to put all the logic we have in mem_cgroup_iter inside what you call
reclaim here + mem_cgroup_iter_break inside the loop, right?
I do not see how this would help us much. mem_cgroup_iter is not the
nicest piece of code but it handles quite a complex requirements that we
have currently (css reference count, multiple reclaimers racing). So I
would rather keep it this way. Further simplifications are welcome of
course.
Is there any reason why you are not happy about direct using of
cgroup_next_descendant_pre?
--
Michal Hocko
SUSE Labs
On Thu 15-11-12 13:12:38, KAMEZAWA Hiroyuki wrote:
> (2012/11/14 19:10), Michal Hocko wrote:
> >On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote:
> >>(2012/11/14 0:30), Michal Hocko wrote:
> >[...]
> >>>@@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >>> mz = mem_cgroup_zoneinfo(root, nid, zid);
> >>> iter = &mz->reclaim_iter[reclaim->priority];
> >>> spin_lock(&iter->iter_lock);
> >>>+ last_visited = iter->last_visited;
> >>> if (prev && reclaim->generation != iter->generation) {
> >>>+ if (last_visited) {
> >>>+ mem_cgroup_put(last_visited);
> >>>+ iter->last_visited = NULL;
> >>>+ }
> >>> spin_unlock(&iter->iter_lock);
> >>> return NULL;
> >>> }
> >>>- id = iter->position;
> >>> }
> >>>
> >>> rcu_read_lock();
> >>>- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> >>>- if (css) {
> >>>- if (css == &root->css || css_tryget(css))
> >>>- memcg = mem_cgroup_from_css(css);
> >>>- } else
> >>>- id = 0;
> >>>- rcu_read_unlock();
> >>>+ /*
> >>>+ * Root is not visited by cgroup iterators so it needs a special
> >>>+ * treatment.
> >>>+ */
> >>>+ if (!last_visited) {
> >>>+ css = &root->css;
> >>>+ } else {
> >>>+ struct cgroup *next_cgroup;
> >>>+
> >>>+ next_cgroup = cgroup_next_descendant_pre(
> >>>+ last_visited->css.cgroup,
> >>>+ root->css.cgroup);
> >>
> >>Maybe I miss something but.... last_visited is holded by memcg's refcnt.
> >>The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed
> >>before memcg is freed and last_visited->css.cgroup is out of RCU cycle.
> >>Is this safe ?
> >
> >Good spotted. You are right. What I need to do is to check that the
> >last_visited is alive and restart from the root if not. Something like
> >the bellow (incremental patch on top of this one) should help, right?
> >
> >diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >index 30efd7e..c0a91a3 100644
> >--- a/mm/memcontrol.c
> >+++ b/mm/memcontrol.c
> >@@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > spin_unlock(&iter->iter_lock);
> > return NULL;
> > }
> >+ /*
> >+ * memcg is still valid because we hold a reference but
> >+ * its cgroup might have vanished in the meantime so
> >+ * we have to double check it is alive and restart the
> >+ * tree walk otherwise.
> >+ */
> >+ if (last_visited && !css_tryget(&last_visited->css)) {
> >+ mem_cgroup_put(last_visited);
> >+ last_visited = NULL;
> >+ }
> > }
> >
> > rcu_read_lock();
> >@@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > if (reclaim) {
> > struct mem_cgroup *curr = memcg;
> >
> >- if (last_visited)
> >+ if (last_visited) {
> >+ css_put(&last_visited->css);
> > mem_cgroup_put(last_visited);
> >+ }
> >
> > if (css && !memcg)
> > curr = mem_cgroup_from_css(css);
> >
>
> I think this will work.
Thanks for double checking. The updated patch:
---
>From 3811de23a0306c229ce44dc655878431a4fdf449 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 13 Nov 2012 11:40:35 +0100
Subject: [PATCH] memcg: rework mem_cgroup_iter to use cgroup iterators
mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d
574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.
cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css_tryget makes
sure that the group is alive when we encounter it by iterator.
If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the memcg keeps it alive even
though the group have vanished in the meantime. This is checked by
css_tryget and if the group is gone the iteration is restarted from the
beginning.
V2
- iter->last_visited need to check whether the cgroup is still alive.
Thanks to Kamezawa for pointing this out.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 60 insertions(+), 14 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0fe5177..c0a91a3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -142,8 +142,8 @@ struct mem_cgroup_stat_cpu {
};
struct mem_cgroup_reclaim_iter {
- /* css_id of the last scanned hierarchy member */
- int position;
+ /* last scanned hierarchy member with elevated ref count */
+ struct mem_cgroup *last_visited;
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -1064,7 +1064,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup_reclaim_cookie *reclaim)
{
struct mem_cgroup *memcg = NULL;
- int id = 0;
+ struct mem_cgroup *last_visited = NULL;
if (mem_cgroup_disabled())
return NULL;
@@ -1073,7 +1073,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
root = root_mem_cgroup;
if (prev && !reclaim)
- id = css_id(&prev->css);
+ last_visited = prev;
if (prev && prev != root)
css_put(&prev->css);
@@ -1086,7 +1086,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *css = NULL;
if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1096,30 +1096,76 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
+ last_visited = iter->last_visited;
if (prev && reclaim->generation != iter->generation) {
+ if (last_visited) {
+ mem_cgroup_put(last_visited);
+ iter->last_visited = NULL;
+ }
spin_unlock(&iter->iter_lock);
return NULL;
}
- id = iter->position;
+ /*
+ * memcg is still valid because we hold a reference but
+ * its cgroup might have vanished in the meantime so
+ * we have to double check it is alive and restart the
+ * tree walk otherwise.
+ */
+ if (last_visited && !css_tryget(&last_visited->css)) {
+ mem_cgroup_put(last_visited);
+ last_visited = NULL;
+ }
}
rcu_read_lock();
- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
- if (css) {
- if (css == &root->css || css_tryget(css))
- memcg = mem_cgroup_from_css(css);
- } else
- id = 0;
- rcu_read_unlock();
+ /*
+ * Root is not visited by cgroup iterators so it needs a special
+ * treatment.
+ */
+ if (!last_visited) {
+ css = &root->css;
+ } else {
+ struct cgroup *next_cgroup;
+
+ next_cgroup = cgroup_next_descendant_pre(
+ last_visited->css.cgroup,
+ root->css.cgroup);
+ if (next_cgroup)
+ css = cgroup_subsys_state(next_cgroup,
+ mem_cgroup_subsys_id);
+ }
+
+ /*
+ * Even if we find a group we have to make sure it is alive.
+ * css && !memcg means that the groups should be skipped and
+ * we should continue the tree walk.
+ */
+ if (css == &root->css || (css && css_tryget(css)))
+ memcg = mem_cgroup_from_css(css);
if (reclaim) {
- iter->position = id;
+ struct mem_cgroup *curr = memcg;
+
+ if (last_visited) {
+ css_put(&last_visited->css);
+ mem_cgroup_put(last_visited);
+ }
+
+ if (css && !memcg)
+ curr = mem_cgroup_from_css(css);
+ if (curr)
+ mem_cgroup_get(curr);
+ iter->last_visited = curr;
+
if (!css)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&iter->iter_lock);
+ } else if (css && !memcg) {
+ last_visited = mem_cgroup_from_css(css);
}
+ rcu_read_unlock();
if (prev && !css)
return NULL;
--
1.7.10.4
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote:
> > I'm a bit confused. Why would that make any difference? Shouldn't it
> > be just able to test the condition and continue?
>
> Ohh, I misunderstood your proposal. So what you are suggesting is
> to put all the logic we have in mem_cgroup_iter inside what you call
> reclaim here + mem_cgroup_iter_break inside the loop, right?
>
> I do not see how this would help us much. mem_cgroup_iter is not the
> nicest piece of code but it handles quite a complex requirements that we
> have currently (css reference count, multiple reclaimers racing). So I
> would rather keep it this way. Further simplifications are welcome of
> course.
>
> Is there any reason why you are not happy about direct using of
> cgroup_next_descendant_pre?
Because I'd like to consider the next functions as implementation
detail, and having interations structred as loops tend to read better
and less error-prone. e.g. when you use next functions directly, it's
way easier to circumvent locking requirements in a way which isn't
very obvious. So, unless it messes up the code too much (and I can't
see why it would), I'd much prefer if memcg used for_each_*() macros.
Thanks.
--
tejun
On Thu 15-11-12 06:47:32, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, Nov 15, 2012 at 10:51:03AM +0100, Michal Hocko wrote:
> > > I'm a bit confused. Why would that make any difference? Shouldn't it
> > > be just able to test the condition and continue?
> >
> > Ohh, I misunderstood your proposal. So what you are suggesting is
> > to put all the logic we have in mem_cgroup_iter inside what you call
> > reclaim here + mem_cgroup_iter_break inside the loop, right?
> >
> > I do not see how this would help us much. mem_cgroup_iter is not the
> > nicest piece of code but it handles quite a complex requirements that we
> > have currently (css reference count, multiple reclaimers racing). So I
> > would rather keep it this way. Further simplifications are welcome of
> > course.
> >
> > Is there any reason why you are not happy about direct using of
> > cgroup_next_descendant_pre?
>
> Because I'd like to consider the next functions as implementation
> detail, and having interations structred as loops tend to read better
> and less error-prone. e.g. when you use next functions directly, it's
> way easier to circumvent locking requirements in a way which isn't
> very obvious.
The whole point behind mem_cgroup_iter is to hide all the complexity
behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree
for !reclaim case and mem_cgroup_iter otherwise.
> So, unless it messes up the code too much (and I can't see why it
> would), I'd much prefer if memcg used for_each_*() macros.
As I said this would mean that the current mem_cgroup_iter code would
have to be inverted which doesn't simplify the code much. I'd rather
hide all the grossy details inside the memcg iterator.
Or am I still missing your suggestion?
--
Michal Hocko
SUSE Labs
Hello, Michal.
On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote:
> > Because I'd like to consider the next functions as implementation
> > detail, and having interations structred as loops tend to read better
> > and less error-prone. e.g. when you use next functions directly, it's
> > way easier to circumvent locking requirements in a way which isn't
> > very obvious.
>
> The whole point behind mem_cgroup_iter is to hide all the complexity
> behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree
> for !reclaim case and mem_cgroup_iter otherwise.
>
> > So, unless it messes up the code too much (and I can't see why it
> > would), I'd much prefer if memcg used for_each_*() macros.
>
> As I said this would mean that the current mem_cgroup_iter code would
> have to be inverted which doesn't simplify the code much. I'd rather
> hide all the grossy details inside the memcg iterator.
> Or am I still missing your suggestion?
One way or the other, I don't think the code complexity would change
much. Again, I'd much *prefer* if memcg used what other controllers
would be using, but that's a preference and if necessary we can keep
the next functions as exposed APIs. I think the issue I have is that
I can't see much technical justification for that. If the code
becomes much simpler by choosing one over the other, sure, but is that
the case here? Isn't it mostly just about where to put the same
things? If so, what would be the rationale for requiring a different
interface?
Thanks.
--
tejun
On Thu 15-11-12 07:31:24, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, Nov 15, 2012 at 04:12:55PM +0100, Michal Hocko wrote:
> > > Because I'd like to consider the next functions as implementation
> > > detail, and having interations structred as loops tend to read better
> > > and less error-prone. e.g. when you use next functions directly, it's
> > > way easier to circumvent locking requirements in a way which isn't
> > > very obvious.
> >
> > The whole point behind mem_cgroup_iter is to hide all the complexity
> > behind memcg iteration. Memcg code either use for_each_mem_cgroup_tree
> > for !reclaim case and mem_cgroup_iter otherwise.
> >
> > > So, unless it messes up the code too much (and I can't see why it
> > > would), I'd much prefer if memcg used for_each_*() macros.
> >
> > As I said this would mean that the current mem_cgroup_iter code would
> > have to be inverted which doesn't simplify the code much. I'd rather
> > hide all the grossy details inside the memcg iterator.
> > Or am I still missing your suggestion?
>
> One way or the other, I don't think the code complexity would change
> much. Again, I'd much *prefer* if memcg used what other controllers
> would be using, but that's a preference and if necessary we can keep
> the next functions as exposed APIs.
Yes please.
> I think the issue I have is that I can't see much technical
> justification for that. If the code becomes much simpler by choosing
> one over the other, sure, but is that the case here?
Yes and I've tried to say that already. Memcg needs hierarchy, css
ref counting and concurrent reclaim (per-zone per-priority) aware
iteration. All of that is hidden in mem_cgroup_iter currently so the
caller doesn't have to care about it at all. Which makes shrink_zone
not care about memcg that much.
cgroup_for_each_descendant_pre is not suitable at least because it
doesn't provide a way to start a walk at a selected node (which is
shared per-zone per-priority in memcg case).
Even if cgroup_for_each_descendant_pre had start parameter there
is still a lot of house keeping that callers would have to handle
(css_tryget to start with, update of the cached possible not mentioning
use_hierarchy thingy or mem_cgroup_disabled).
We also try to not pollute mm/vmscan.c as much as possible so we
definitely do not want to bring all this into shrink_zone.
This all sounds like too much of a hassle if it is exposed so I would
really like to stay with mem_cgroup_iter and slowly simplify it until it
can go away (if that is possible at all).
> Isn't it mostly just about where to put the same things?
Unfortunately no. We wouldn't grow own iterator in such a case.
> If so, what would be the rationale for requiring a different
> interface?
Does the above explain it?
--
Michal Hocko
SUSE Labs
On Wed 14-11-12 11:10:52, Michal Hocko wrote:
> On Wed 14-11-12 09:20:03, KAMEZAWA Hiroyuki wrote:
> > (2012/11/14 0:30), Michal Hocko wrote:
> [...]
> > > @@ -1096,30 +1096,64 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > > mz = mem_cgroup_zoneinfo(root, nid, zid);
> > > iter = &mz->reclaim_iter[reclaim->priority];
> > > spin_lock(&iter->iter_lock);
> > > + last_visited = iter->last_visited;
> > > if (prev && reclaim->generation != iter->generation) {
> > > + if (last_visited) {
> > > + mem_cgroup_put(last_visited);
> > > + iter->last_visited = NULL;
> > > + }
> > > spin_unlock(&iter->iter_lock);
> > > return NULL;
> > > }
> > > - id = iter->position;
> > > }
> > >
> > > rcu_read_lock();
> > > - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> > > - if (css) {
> > > - if (css == &root->css || css_tryget(css))
> > > - memcg = mem_cgroup_from_css(css);
> > > - } else
> > > - id = 0;
> > > - rcu_read_unlock();
> > > + /*
> > > + * Root is not visited by cgroup iterators so it needs a special
> > > + * treatment.
> > > + */
> > > + if (!last_visited) {
> > > + css = &root->css;
> > > + } else {
> > > + struct cgroup *next_cgroup;
> > > +
> > > + next_cgroup = cgroup_next_descendant_pre(
> > > + last_visited->css.cgroup,
> > > + root->css.cgroup);
> >
> > Maybe I miss something but.... last_visited is holded by memcg's refcnt.
> > The cgroup pointed by css.cgroup is by cgroup's refcnt which can be freed
> > before memcg is freed and last_visited->css.cgroup is out of RCU cycle.
> > Is this safe ?
>
> Good spotted. You are right. What I need to do is to check that the
> last_visited is alive and restart from the root if not. Something like
> the bellow (incremental patch on top of this one) should help, right?
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30efd7e..c0a91a3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1105,6 +1105,16 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> spin_unlock(&iter->iter_lock);
> return NULL;
> }
> + /*
> + * memcg is still valid because we hold a reference but
> + * its cgroup might have vanished in the meantime so
> + * we have to double check it is alive and restart the
> + * tree walk otherwise.
> + */
> + if (last_visited && !css_tryget(&last_visited->css)) {
> + mem_cgroup_put(last_visited);
> + last_visited = NULL;
> + }
> }
>
> rcu_read_lock();
> @@ -1136,8 +1146,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> if (reclaim) {
> struct mem_cgroup *curr = memcg;
>
> - if (last_visited)
> + if (last_visited) {
> + css_put(&last_visited->css);
> mem_cgroup_put(last_visited);
> + }
>
> if (css && !memcg)
> curr = mem_cgroup_from_css(css);
Now that I think about it again it seems that this is more complicated
than necessary. It should be sufficient to hold css' reference for the
iter->last_visited because this makes sure that the cgroup won't go
away same as mem_cgroup.
Memcg reference counting + css_tryget just makes the situation more
complicated because it forces us to retry the iteration on css_tryget
failure as the cgroup is gone already and we have no point to continue
other than start all over again. Which is, ehmm, _really_ ugly.
I will repost the updated version sometime this week after it passes
some testing.
--
Michal Hocko
SUSE Labs
On Tue 13-11-12 16:30:36, Michal Hocko wrote:
[...]
> + /*
> + * Root is not visited by cgroup iterators so it needs a special
> + * treatment.
> + */
> + if (!last_visited) {
> + css = &root->css;
> + } else {
> + struct cgroup *next_cgroup;
> +
> + next_cgroup = cgroup_next_descendant_pre(
> + last_visited->css.cgroup,
> + root->css.cgroup);
> + if (next_cgroup)
> + css = cgroup_subsys_state(next_cgroup,
> + mem_cgroup_subsys_id);
This is not correct because cgroup_next_descendant_pre expects pos to be
NULL for the first iteration but the way we do iterate (visit the root
first) means that the second iteration will have last_visited != NULL
and if root doesn't have any children the iteration would go unleashed
to to the endless loop. We need something like:
struct cgroup *prev_cgroup = (last_visited == root) ? NULL
: last_visited->css.cgroup;
next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
root->css.gtoup);
--
Michal Hocko
SUSE Labs