2012-11-26 18:48:16

by Michal Hocko

[permalink] [raw]
Subject: rework mem_cgroup iterator

Hi all,
this is a second version of the patchset previously posted here:
https://lkml.org/lkml/2012/11/13/335

The patch set tries to make mem_cgroup_iter saner in the way how it
walks hierarchies. css->id based traversal is far from being ideal as it
is not deterministic because it depends on the creation ordering.

Diffstat looks promising but it is fair the say that the biggest cleanup is
just css_get_next removal. The memcg code has grown a bit but I think it is
worth the resulting outcome (the sanity ;)).

The first patch fixes a potential misbehaving which I haven't seen but the
fix is needed for the later patches anyway. We could take it alone as well
but I do not have any bug report to base the fix on. The second one is also
preparatory and it is new to the series.

The third patch replaces css_get_next by cgroup iterators which are
scheduled for 3.8 in Tejun's tree and I depend on the following two patches:
fe1e904c cgroup: implement generic child / descendant walk macros
7e187c6c cgroup: use rculist ops for cgroup->children

The third and fourth patches are an attempt for simplification of the
mem_cgroup_iter. css juggling is removed and the iteration logic is
moved to a helper so that the reference counting and iteration are
separated.

The last patch just removes css_get_next as there is no user for it any
longer.

I am also thinking that leaf-to-root iteration makes more sense but this
patch is not included in the series yet because I have to think some
more about the justification.

I have dropped "[RFC 4/5] memcg: clean up mem_cgroup_iter"
(https://lkml.org/lkml/2012/11/13/333) because testing quickly shown
that my thinking was flawed and VM_BUG_ON(!prev && !memcg) triggered
very quickly. This also suggest that this version has seen some testing,
unlike the previous one ;)
The test was simple but I guess it exercised this code path quite heavily.
A (limit = 280M, use_hierarchy=true)
/ | \
B C D (all have 100M limit)

and independent kernel build (with the full distribution config) in
all children groups. This triggers both children only and hierarchical
reclaims.

The shortlog says:
Michal Hocko (6):
memcg: synchronize per-zone iterator access by a spinlock
memcg: keep prev's css alive for the whole mem_cgroup_iter
memcg: rework mem_cgroup_iter to use cgroup iterators
memcg: simplify mem_cgroup_iter
memcg: further simplify mem_cgroup_iter
cgroup: remove css_get_next

And diffstat:
include/linux/cgroup.h | 7 ---
kernel/cgroup.c | 49 ---------------------
mm/memcontrol.c | 110 +++++++++++++++++++++++++++++++++++++-----------
3 files changed, 86 insertions(+), 80 deletions(-)


2012-11-26 18:48:15

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 1/6] memcg: synchronize per-zone iterator access by a spinlock

per-zone per-priority iterator is aimed at coordinating concurrent
reclaimers on the same hierarchy (or the global reclaim when all
groups are reclaimed) so that all groups get reclaimed evenly as
much as possible. iter->position holds the last css->id visited
and iter->generation signals the completed tree walk (when it is
incremented).
Concurrent reclaimers are supposed to provide a reclaim cookie which
holds the reclaim priority and the last generation they saw. If cookie's
generation doesn't match the iterator's view then other concurrent
reclaimer already did the job and the tree walk is done for that
priority.

This scheme works nicely in most cases but it is not raceless. Two
racing reclaimers can see the same iter->position and so bang on the
same group. iter->generation increment is not serialized as well so a
reclaimer can see an updated iter->position with and old generation so
the iteration might be restarted from the root of the hierarchy.

The simplest way to fix this issue is to synchronise access to the
iterator by a lock. This implementation uses per-zone per-priority
spinlock which linearizes only directly racing reclaimers which use
reclaim cookies so the effect of the new locking should be really
minimal.

I have to note that I haven't seen this as a real issue so far. The
primary motivation for the change is different. The following patch
will change the way how the iterator is implemented and css->id
iteration will be replaced cgroup generic iteration which requires
storing mem_cgroup pointer into iterator and that requires reference
counting and so concurrent access will be a problem.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 02ee2f7..0e52ec9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -148,6 +148,8 @@ struct mem_cgroup_reclaim_iter {
int position;
/* scan generation, increased every round-trip */
unsigned int generation;
+ /* lock to protect the position and generation */
+ spinlock_t iter_lock;
};

/*
@@ -1095,8 +1097,11 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,

mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
- if (prev && reclaim->generation != iter->generation)
+ spin_lock(&iter->iter_lock);
+ if (prev && reclaim->generation != iter->generation) {
+ spin_unlock(&iter->iter_lock);
return NULL;
+ }
id = iter->position;
}

@@ -1115,6 +1120,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
+ spin_unlock(&iter->iter_lock);
}

if (prev && !css)
@@ -5880,8 +5886,12 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
return 1;

for (zone = 0; zone < MAX_NR_ZONES; zone++) {
+ int prio;
+
mz = &pn->zoneinfo[zone];
lruvec_init(&mz->lruvec);
+ for (prio = 0; prio < DEF_PRIORITY + 1; prio++)
+ spin_lock_init(&mz->reclaim_iter[prio].iter_lock);
mz->usage_in_excess = 0;
mz->on_tree = false;
mz->memcg = memcg;
--
1.7.10.4

2012-11-26 18:48:14

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 2/6] memcg: keep prev's css alive for the whole mem_cgroup_iter

css reference counting keeps the cgroup alive even though it has been
already removed. mem_cgroup_iter relies on this fact and takes a
reference to the returned group. The reference is then released on the
next iteration or mem_cgroup_iter_break.
mem_cgroup_iter currently releases the reference right after it gets the
last css_id.
This is correct because neither prev's memcg nor cgroup are accessed
after then. This will change in the next patch so we need to hold the
group alive a bit longer so let's move the css_put at the end of the
function.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0e52ec9..1f5528d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1077,12 +1077,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (prev && !reclaim)
id = css_id(&prev->css);

- if (prev && prev != root)
- css_put(&prev->css);
-
if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
- return NULL;
+ goto out_css_put;
return root;
}

@@ -1100,7 +1097,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
spin_lock(&iter->iter_lock);
if (prev && reclaim->generation != iter->generation) {
spin_unlock(&iter->iter_lock);
- return NULL;
+ goto out_css_put;
}
id = iter->position;
}
@@ -1124,8 +1121,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}

if (prev && !css)
- return NULL;
+ goto out_css_put;
}
+out_css_put:
+ if (prev && prev != root)
+ css_put(&prev->css);
+
return memcg;
}

--
1.7.10.4

2012-11-26 18:48:40

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 6/6] cgroup: remove css_get_next

Now that we have generic and well ordered cgroup tree walkers there is
no need to keep css_get_next in the place.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/cgroup.h | 7 -------
kernel/cgroup.c | 49 ------------------------------------------------
2 files changed, 56 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 329eb46..ba46041 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -676,13 +676,6 @@ void free_css_id(struct cgroup_subsys *ss, struct cgroup_subsys_state *css);

struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id);

-/*
- * Get a cgroup whose id is greater than or equal to id under tree of root.
- * Returning a cgroup_subsys_state or NULL.
- */
-struct cgroup_subsys_state *css_get_next(struct cgroup_subsys *ss, int id,
- struct cgroup_subsys_state *root, int *foundid);
-
/* Returns true if root is ancestor of cg */
bool css_is_ancestor(struct cgroup_subsys_state *cg,
const struct cgroup_subsys_state *root);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d51958a..4d874b2 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -5230,55 +5230,6 @@ struct cgroup_subsys_state *css_lookup(struct cgroup_subsys *ss, int id)
}
EXPORT_SYMBOL_GPL(css_lookup);

-/**
- * css_get_next - lookup next cgroup under specified hierarchy.
- * @ss: pointer to subsystem
- * @id: current position of iteration.
- * @root: pointer to css. search tree under this.
- * @foundid: position of found object.
- *
- * Search next css under the specified hierarchy of rootid. Calling under
- * rcu_read_lock() is necessary. Returns NULL if it reaches the end.
- */
-struct cgroup_subsys_state *
-css_get_next(struct cgroup_subsys *ss, int id,
- struct cgroup_subsys_state *root, int *foundid)
-{
- struct cgroup_subsys_state *ret = NULL;
- struct css_id *tmp;
- int tmpid;
- int rootid = css_id(root);
- int depth = css_depth(root);
-
- if (!rootid)
- return NULL;
-
- BUG_ON(!ss->use_id);
- WARN_ON_ONCE(!rcu_read_lock_held());
-
- /* fill start point for scan */
- tmpid = id;
- while (1) {
- /*
- * scan next entry from bitmap(tree), tmpid is updated after
- * idr_get_next().
- */
- tmp = idr_get_next(&ss->idr, &tmpid);
- if (!tmp)
- break;
- if (tmp->depth >= depth && tmp->stack[depth] == rootid) {
- ret = rcu_dereference(tmp->css);
- if (ret) {
- *foundid = tmpid;
- break;
- }
- }
- /* continue to scan from next id */
- tmpid = tmpid + 1;
- }
- return ret;
-}
-
/*
* get corresponding css from file open on cgroupfs directory
*/
--
1.7.10.4

2012-11-26 18:48:56

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 5/6] memcg: further simplify mem_cgroup_iter

mem_cgroup_iter basically does two things currently. It takes care of
the house keeping (reference counting, raclaim cookie) and it iterates
through a hierarchy tree (by using cgroup generic tree walk).
The code would be much more easier to follow if we move the iteration
outside of the function (to __mem_cgrou_iter_next) so the distinction
is more clear.
This patch doesn't introduce any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 79 ++++++++++++++++++++++++++++++++-----------------------
1 file changed, 46 insertions(+), 33 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d1bc0e8..a5018bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1044,6 +1044,51 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
return memcg;
}

+/*
+ * Returns a next (in a pre-order walk) alive memcg (with elevated css
+ * ref. count) or NULL if the whole root's subtree has been visited.
+ *
+ * helper function to be used by mem_cgroup_iter
+ */
+static struct mem_cgroup *__mem_cgrou_iter_next(struct mem_cgroup *root,
+ struct mem_cgroup *last_visited)
+{
+ struct cgroup *prev_cgroup, *next_cgroup;
+
+ /*
+ * Root is not visited by cgroup iterators so it needs an
+ * explicit visit.
+ */
+ if (!last_visited)
+ return root;
+
+ prev_cgroup = (last_visited == root) ? NULL
+ : last_visited->css.cgroup;
+skip_node:
+ next_cgroup = cgroup_next_descendant_pre(
+ prev_cgroup, root->css.cgroup);
+
+ /*
+ * Even if we found a group we have to make sure it is
+ * alive. css && !memcg means that the groups should be
+ * skipped and we should continue the tree walk.
+ * last_visited css is safe to use because it is
+ * protected by css_get and the tree walk is rcu safe.
+ */
+ if (next_cgroup) {
+ struct mem_cgroup *mem = mem_cgroup_from_cont(
+ next_cgroup);
+ if (css_tryget(&mem->css))
+ return mem;
+ else {
+ prev_cgroup = next_cgroup;
+ goto skip_node;
+ }
+ }
+
+ return NULL;
+}
+
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -1106,39 +1151,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}
}

- /*
- * Root is not visited by cgroup iterators so it needs an
- * explicit visit.
- */
- if (!last_visited) {
- memcg = root;
- } else {
- struct cgroup *prev_cgroup, *next_cgroup;
-
- prev_cgroup = (last_visited == root) ? NULL
- : last_visited->css.cgroup;
-skip_node:
- next_cgroup = cgroup_next_descendant_pre(
- prev_cgroup, root->css.cgroup);
-
- /*
- * Even if we found a group we have to make sure it is
- * alive. css && !memcg means that the groups should be
- * skipped and we should continue the tree walk.
- * last_visited css is safe to use because it is
- * protected by css_get and the tree walk is rcu safe.
- */
- if (next_cgroup) {
- struct mem_cgroup *mem = mem_cgroup_from_cont(
- next_cgroup);
- if (css_tryget(&mem->css))
- memcg = mem;
- else {
- prev_cgroup = next_cgroup;
- goto skip_node;
- }
- }
- }
+ memcg = __mem_cgrou_iter_next(root, last_visited);

if (reclaim) {
if (last_visited)
--
1.7.10.4

2012-11-26 18:49:19

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

mem_cgroup_iter curently relies on css->id when walking down a group
hierarchy tree. This is really awkward because the tree walk depends on
the groups creation ordering. The only guarantee is that a parent node
is visited before its children.
Example
1) mkdir -p a a/d a/b/c
2) mkdir -a a/b/c a/d
Will create the same trees but the tree walks will be different:
1) a, d, b, c
2) a, b, c, d

574bd9f7 (cgroup: implement generic child / descendant walk macros) has
introduced generic cgroup tree walkers which provide either pre-order
or post-order tree walk. This patch converts css->id based iteration
to pre-order tree walk to keep the semantic with the original iterator
where parent is always visited before its subtree.

cgroup_for_each_descendant_pre suggests using post_create and
pre_destroy for proper synchronization with groups addidition resp.
removal. This implementation doesn't use those because a new memory
cgroup is fully initialized in mem_cgroup_create and css reference
counting enforces that the group is alive for both the last seen cgroup
and the found one resp. it signals that the group is dead and it should
be skipped.

If the reclaim cookie is used we need to store the last visited group
into the iterator so we have to be careful that it doesn't disappear in
the mean time. Elevated reference count on the css keeps it alive even
though the group have been removed (parked waiting for the last dput so
that it can be freed).

V2
- use css_{get,put} for iter->last_visited rather than
mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
- cgroup_next_descendant_pre expects NULL pos for the first iterartion
otherwise it might loop endlessly for intermediate node without any
children.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 57 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f5528d..6bcc97b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu {
};

struct mem_cgroup_reclaim_iter {
- /* css_id of the last scanned hierarchy member */
- int position;
+ /* last scanned hierarchy member with elevated css ref count */
+ struct mem_cgroup *last_visited;
/* scan generation, increased every round-trip */
unsigned int generation;
/* lock to protect the position and generation */
@@ -1066,7 +1066,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup_reclaim_cookie *reclaim)
{
struct mem_cgroup *memcg = NULL;
- int id = 0;
+ struct mem_cgroup *last_visited = NULL;

if (mem_cgroup_disabled())
return NULL;
@@ -1075,7 +1075,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
root = root_mem_cgroup;

if (prev && !reclaim)
- id = css_id(&prev->css);
+ last_visited = prev;

if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
@@ -1083,9 +1083,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
return root;
}

+ rcu_read_lock();
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *css = NULL;

if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1095,34 +1096,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
mz = mem_cgroup_zoneinfo(root, nid, zid);
iter = &mz->reclaim_iter[reclaim->priority];
spin_lock(&iter->iter_lock);
+ last_visited = iter->last_visited;
if (prev && reclaim->generation != iter->generation) {
+ if (last_visited) {
+ css_put(&last_visited->css);
+ iter->last_visited = NULL;
+ }
spin_unlock(&iter->iter_lock);
- goto out_css_put;
+ goto out_unlock;
}
- id = iter->position;
}

- rcu_read_lock();
- css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
- if (css) {
- if (css == &root->css || css_tryget(css))
- memcg = mem_cgroup_from_css(css);
- } else
- id = 0;
- rcu_read_unlock();
+ /*
+ * Root is not visited by cgroup iterators so it needs an
+ * explicit visit.
+ */
+ if (!last_visited) {
+ css = &root->css;
+ } else {
+ struct cgroup *prev_cgroup, *next_cgroup;
+
+ prev_cgroup = (last_visited == root) ? NULL
+ : last_visited->css.cgroup;
+ next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
+ root->css.cgroup);
+ if (next_cgroup)
+ css = cgroup_subsys_state(next_cgroup,
+ mem_cgroup_subsys_id);
+ }
+
+ /*
+ * Even if we found a group we have to make sure it is alive.
+ * css && !memcg means that the groups should be skipped and
+ * we should continue the tree walk.
+ * last_visited css is safe to use because it is protected by
+ * css_get and the tree walk is rcu safe.
+ */
+ if (css == &root->css || (css && css_tryget(css)))
+ memcg = mem_cgroup_from_css(css);

if (reclaim) {
- iter->position = id;
+ struct mem_cgroup *curr = memcg;
+
+ if (last_visited)
+ css_put(&last_visited->css);
+
+ if (css && !memcg)
+ curr = mem_cgroup_from_css(css);
+
+ /* make sure that the cached memcg is not removed */
+ if (curr)
+ css_get(&curr->css);
+ iter->last_visited = curr;
+
if (!css)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&iter->iter_lock);
+ } else if (css && !memcg) {
+ last_visited = mem_cgroup_from_css(css);
}

if (prev && !css)
- goto out_css_put;
+ goto out_unlock;
}
+out_unlock:
+ rcu_read_unlock();
out_css_put:
if (prev && prev != root)
css_put(&prev->css);
--
1.7.10.4

2012-11-26 18:49:18

by Michal Hocko

[permalink] [raw]
Subject: [patch v2 4/6] memcg: simplify mem_cgroup_iter

Current implementation of mem_cgroup_iter has to consider both css and
memcg to find out whether no group has been found (css==NULL - aka the
loop is completed) and that no memcg is associated with the found node
(!memcg - aka css_tryget failed because the group is no longer alive).
This leads to awkward tweaks like tests for css && !memcg to skip the
current node.

It will be much easier if we got rid off css variable altogether and
only rely on memcg. In order to do that the iteration part has to skip
dead nodes. This sounds natural to me and as a nice side effect we will
get a simple invariant that memcg is always alive when non-NULL and all
nodes have been visited otherwise.

We could get rid of the surrounding while loop but keep it in for now to
make review easier. It will go away in the following patch.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 56 +++++++++++++++++++++++++++----------------------------
1 file changed, 27 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6bcc97b..d1bc0e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1086,7 +1086,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
rcu_read_lock();
while (!memcg) {
struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
- struct cgroup_subsys_state *css = NULL;

if (reclaim) {
int nid = zone_to_nid(reclaim->zone);
@@ -1112,53 +1111,52 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
* explicit visit.
*/
if (!last_visited) {
- css = &root->css;
+ memcg = root;
} else {
struct cgroup *prev_cgroup, *next_cgroup;

prev_cgroup = (last_visited == root) ? NULL
: last_visited->css.cgroup;
- next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
- root->css.cgroup);
- if (next_cgroup)
- css = cgroup_subsys_state(next_cgroup,
- mem_cgroup_subsys_id);
- }
+skip_node:
+ next_cgroup = cgroup_next_descendant_pre(
+ prev_cgroup, root->css.cgroup);

- /*
- * Even if we found a group we have to make sure it is alive.
- * css && !memcg means that the groups should be skipped and
- * we should continue the tree walk.
- * last_visited css is safe to use because it is protected by
- * css_get and the tree walk is rcu safe.
- */
- if (css == &root->css || (css && css_tryget(css)))
- memcg = mem_cgroup_from_css(css);
+ /*
+ * Even if we found a group we have to make sure it is
+ * alive. css && !memcg means that the groups should be
+ * skipped and we should continue the tree walk.
+ * last_visited css is safe to use because it is
+ * protected by css_get and the tree walk is rcu safe.
+ */
+ if (next_cgroup) {
+ struct mem_cgroup *mem = mem_cgroup_from_cont(
+ next_cgroup);
+ if (css_tryget(&mem->css))
+ memcg = mem;
+ else {
+ prev_cgroup = next_cgroup;
+ goto skip_node;
+ }
+ }
+ }

if (reclaim) {
- struct mem_cgroup *curr = memcg;
-
if (last_visited)
css_put(&last_visited->css);

- if (css && !memcg)
- curr = mem_cgroup_from_css(css);
-
/* make sure that the cached memcg is not removed */
- if (curr)
- css_get(&curr->css);
- iter->last_visited = curr;
+ if (memcg)
+ css_get(&memcg->css);
+ iter->last_visited = memcg;

- if (!css)
+ if (!memcg)
iter->generation++;
else if (!prev && memcg)
reclaim->generation = iter->generation;
spin_unlock(&iter->iter_lock);
- } else if (css && !memcg) {
- last_visited = mem_cgroup_from_css(css);
}

- if (prev && !css)
+ if (prev && !memcg)
goto out_unlock;
}
out_unlock:
--
1.7.10.4

2012-11-28 08:39:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 2/6] memcg: keep prev's css alive for the whole mem_cgroup_iter

(2012/11/27 3:47), Michal Hocko wrote:
> css reference counting keeps the cgroup alive even though it has been
> already removed. mem_cgroup_iter relies on this fact and takes a
> reference to the returned group. The reference is then released on the
> next iteration or mem_cgroup_iter_break.
> mem_cgroup_iter currently releases the reference right after it gets the
> last css_id.
> This is correct because neither prev's memcg nor cgroup are accessed
> after then. This will change in the next patch so we need to hold the
> group alive a bit longer so let's move the css_put at the end of the
> function.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-28 08:48:22

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

(2012/11/27 3:47), Michal Hocko wrote:
> mem_cgroup_iter curently relies on css->id when walking down a group
> hierarchy tree. This is really awkward because the tree walk depends on
> the groups creation ordering. The only guarantee is that a parent node
> is visited before its children.
> Example
> 1) mkdir -p a a/d a/b/c
> 2) mkdir -a a/b/c a/d
> Will create the same trees but the tree walks will be different:
> 1) a, d, b, c
> 2) a, b, c, d
>
> 574bd9f7 (cgroup: implement generic child / descendant walk macros) has
> introduced generic cgroup tree walkers which provide either pre-order
> or post-order tree walk. This patch converts css->id based iteration
> to pre-order tree walk to keep the semantic with the original iterator
> where parent is always visited before its subtree.
>
> cgroup_for_each_descendant_pre suggests using post_create and
> pre_destroy for proper synchronization with groups addidition resp.
> removal. This implementation doesn't use those because a new memory
> cgroup is fully initialized in mem_cgroup_create and css reference
> counting enforces that the group is alive for both the last seen cgroup
> and the found one resp. it signals that the group is dead and it should
> be skipped.
>
> If the reclaim cookie is used we need to store the last visited group
> into the iterator so we have to be careful that it doesn't disappear in
> the mean time. Elevated reference count on the css keeps it alive even
> though the group have been removed (parked waiting for the last dput so
> that it can be freed).
>
> V2
> - use css_{get,put} for iter->last_visited rather than
> mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
> - cgroup_next_descendant_pre expects NULL pos for the first iterartion
> otherwise it might loop endlessly for intermediate node without any
> children.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 74 ++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 57 insertions(+), 17 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1f5528d..6bcc97b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -144,8 +144,8 @@ struct mem_cgroup_stat_cpu {
> };
>
> struct mem_cgroup_reclaim_iter {
> - /* css_id of the last scanned hierarchy member */
> - int position;
> + /* last scanned hierarchy member with elevated css ref count */
> + struct mem_cgroup *last_visited;
> /* scan generation, increased every round-trip */
> unsigned int generation;
> /* lock to protect the position and generation */
> @@ -1066,7 +1066,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup_reclaim_cookie *reclaim)
> {
> struct mem_cgroup *memcg = NULL;
> - int id = 0;
> + struct mem_cgroup *last_visited = NULL;
>
> if (mem_cgroup_disabled())
> return NULL;
> @@ -1075,7 +1075,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> root = root_mem_cgroup;
>
> if (prev && !reclaim)
> - id = css_id(&prev->css);
> + last_visited = prev;
>
> if (!root->use_hierarchy && root != root_mem_cgroup) {
> if (prev)
> @@ -1083,9 +1083,10 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> return root;
> }
>
> + rcu_read_lock();
> while (!memcg) {
> struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> - struct cgroup_subsys_state *css;
> + struct cgroup_subsys_state *css = NULL;
>
> if (reclaim) {
> int nid = zone_to_nid(reclaim->zone);
> @@ -1095,34 +1096,73 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> mz = mem_cgroup_zoneinfo(root, nid, zid);
> iter = &mz->reclaim_iter[reclaim->priority];
> spin_lock(&iter->iter_lock);
> + last_visited = iter->last_visited;
> if (prev && reclaim->generation != iter->generation) {
> + if (last_visited) {
> + css_put(&last_visited->css);
> + iter->last_visited = NULL;
> + }
> spin_unlock(&iter->iter_lock);
> - goto out_css_put;
> + goto out_unlock;
> }
> - id = iter->position;
> }
>
> - rcu_read_lock();
> - css = css_get_next(&mem_cgroup_subsys, id + 1, &root->css, &id);
> - if (css) {
> - if (css == &root->css || css_tryget(css))
> - memcg = mem_cgroup_from_css(css);
> - } else
> - id = 0;
> - rcu_read_unlock();
> + /*
> + * Root is not visited by cgroup iterators so it needs an
> + * explicit visit.
> + */
> + if (!last_visited) {
> + css = &root->css;
> + } else {
> + struct cgroup *prev_cgroup, *next_cgroup;
> +
> + prev_cgroup = (last_visited == root) ? NULL
> + : last_visited->css.cgroup;
> + next_cgroup = cgroup_next_descendant_pre(prev_cgroup,
> + root->css.cgroup);
> + if (next_cgroup)
> + css = cgroup_subsys_state(next_cgroup,
> + mem_cgroup_subsys_id);
> + }
> +
> + /*
> + * Even if we found a group we have to make sure it is alive.
> + * css && !memcg means that the groups should be skipped and
> + * we should continue the tree walk.
> + * last_visited css is safe to use because it is protected by
> + * css_get and the tree walk is rcu safe.
> + */
> + if (css == &root->css || (css && css_tryget(css)))
> + memcg = mem_cgroup_from_css(css);

Could you note that this iterator will never visit dangling(removed) memcg, somewhere ?
Hmm, I'm not sure but it may be trouble at shrkinking dangling kmem_cache(slab).

Costa, how do you think ?

I guess there is no problem with swap and not against the way you go.


Thanks,
-Kame


>
> if (reclaim) {
> - iter->position = id;
> + struct mem_cgroup *curr = memcg;
> +
> + if (last_visited)
> + css_put(&last_visited->css);
> +
> + if (css && !memcg)
> + curr = mem_cgroup_from_css(css);
> +
> + /* make sure that the cached memcg is not removed */
> + if (curr)
> + css_get(&curr->css);
> + iter->last_visited = curr;
> +
> if (!css)
> iter->generation++;
> else if (!prev && memcg)
> reclaim->generation = iter->generation;
> spin_unlock(&iter->iter_lock);
> + } else if (css && !memcg) {
> + last_visited = mem_cgroup_from_css(css);
> }
>
> if (prev && !css)
> - goto out_css_put;
> + goto out_unlock;
> }
> +out_unlock:
> + rcu_read_unlock();
> out_css_put:
> if (prev && prev != root)
> css_put(&prev->css);
>

2012-11-28 08:52:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 4/6] memcg: simplify mem_cgroup_iter

(2012/11/27 3:47), Michal Hocko wrote:
> Current implementation of mem_cgroup_iter has to consider both css and
> memcg to find out whether no group has been found (css==NULL - aka the
> loop is completed) and that no memcg is associated with the found node
> (!memcg - aka css_tryget failed because the group is no longer alive).
> This leads to awkward tweaks like tests for css && !memcg to skip the
> current node.
>
> It will be much easier if we got rid off css variable altogether and
> only rely on memcg. In order to do that the iteration part has to skip
> dead nodes. This sounds natural to me and as a nice side effect we will
> get a simple invariant that memcg is always alive when non-NULL and all
> nodes have been visited otherwise.
>
> We could get rid of the surrounding while loop but keep it in for now to
> make review easier. It will go away in the following patch.
>
> Signed-off-by: Michal Hocko <[email protected]>

seems nice clean up. I'll ack when I ack 3/6 and we make agreement on that
iterator will skip dead node.

Thanks,
-Kame

2012-11-28 09:17:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
> (2012/11/27 3:47), Michal Hocko wrote:
[...]
> > + /*
> > + * Even if we found a group we have to make sure it is alive.
> > + * css && !memcg means that the groups should be skipped and
> > + * we should continue the tree walk.
> > + * last_visited css is safe to use because it is protected by
> > + * css_get and the tree walk is rcu safe.
> > + */
> > + if (css == &root->css || (css && css_tryget(css)))
> > + memcg = mem_cgroup_from_css(css);
>
> Could you note that this iterator will never visit dangling(removed)
> memcg, somewhere ?

OK, I can add it to the function comment but the behavior hasn't changed
so I wouldn't like to confuse anybody.

> Hmm, I'm not sure but it may be trouble at shrkinking dangling
> kmem_cache(slab).

We do not shrink slab at all. Those objects that are in a dead memcg
wait for their owner tho release them which will make the dangling group
eventually go away

>
> Costa, how do you think ?
>
> I guess there is no problem with swap and not against the way you go.

Yes, swap should be OK. Pages charged against removed memcg will
fallback to the the current's mm (try_get_mem_cgroup_from_page and
__mem_cgroup_try_charge_swapin)

[...]
--
Michal Hocko
SUSE Labs

2012-11-28 09:24:14

by Glauber Costa

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

On 11/28/2012 01:17 PM, Michal Hocko wrote:
> On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
>> (2012/11/27 3:47), Michal Hocko wrote:
> [...]
>>> + /*
>>> + * Even if we found a group we have to make sure it is alive.
>>> + * css && !memcg means that the groups should be skipped and
>>> + * we should continue the tree walk.
>>> + * last_visited css is safe to use because it is protected by
>>> + * css_get and the tree walk is rcu safe.
>>> + */
>>> + if (css == &root->css || (css && css_tryget(css)))
>>> + memcg = mem_cgroup_from_css(css);
>>
>> Could you note that this iterator will never visit dangling(removed)
>> memcg, somewhere ?
>
> OK, I can add it to the function comment but the behavior hasn't changed
> so I wouldn't like to confuse anybody.
>
>> Hmm, I'm not sure but it may be trouble at shrkinking dangling
>> kmem_cache(slab).
>
> We do not shrink slab at all.

yet. However...

> Those objects that are in a dead memcg
> wait for their owner tho release them which will make the dangling group
> eventually go away
>
>>
>> Costa, how do you think ?
>>

In general, I particularly believe it is a good idea to skip dead memcgs
in the iterator. I don't anticipate any problems with shrinking at all.

Basically, we will only ever actively shrink when the memcg is dying, at
which point it is still alive.

After this, it's better to just leave it to vmscan. Whenever there is
pressure, it will go away.

2012-11-28 09:33:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

On Wed 28-11-12 13:23:57, Glauber Costa wrote:
> On 11/28/2012 01:17 PM, Michal Hocko wrote:
> > On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
> >> (2012/11/27 3:47), Michal Hocko wrote:
> > [...]
> >>> + /*
> >>> + * Even if we found a group we have to make sure it is alive.
> >>> + * css && !memcg means that the groups should be skipped and
> >>> + * we should continue the tree walk.
> >>> + * last_visited css is safe to use because it is protected by
> >>> + * css_get and the tree walk is rcu safe.
> >>> + */
> >>> + if (css == &root->css || (css && css_tryget(css)))
> >>> + memcg = mem_cgroup_from_css(css);
> >>
> >> Could you note that this iterator will never visit dangling(removed)
> >> memcg, somewhere ?
> >
> > OK, I can add it to the function comment but the behavior hasn't changed
> > so I wouldn't like to confuse anybody.
> >
> >> Hmm, I'm not sure but it may be trouble at shrkinking dangling
> >> kmem_cache(slab).
> >
> > We do not shrink slab at all.
>
> yet. However...
>
> > Those objects that are in a dead memcg
> > wait for their owner tho release them which will make the dangling group
> > eventually go away
> >
> >>
> >> Costa, how do you think ?
> >>
>
> In general, I particularly believe it is a good idea to skip dead memcgs
> in the iterator. I don't anticipate any problems with shrinking at all.

We even cannot iterate dead ones because their cgroups are gone and so
you do not have any way to iterate. So either make them alive by css_get
or we cannot iterate them.

> Basically, we will only ever actively shrink when the memcg is dying, at
> which point it is still alive.
> After this, it's better to just leave it to vmscan. Whenever there is
> pressure, it will go away.
--
Michal Hocko
SUSE Labs

2012-11-28 09:35:54

by Glauber Costa

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

On 11/28/2012 01:33 PM, Michal Hocko wrote:
> On Wed 28-11-12 13:23:57, Glauber Costa wrote:
>> On 11/28/2012 01:17 PM, Michal Hocko wrote:
>>> On Wed 28-11-12 17:47:59, KAMEZAWA Hiroyuki wrote:
>>>> (2012/11/27 3:47), Michal Hocko wrote:
>>> [...]
>>>>> + /*
>>>>> + * Even if we found a group we have to make sure it is alive.
>>>>> + * css && !memcg means that the groups should be skipped and
>>>>> + * we should continue the tree walk.
>>>>> + * last_visited css is safe to use because it is protected by
>>>>> + * css_get and the tree walk is rcu safe.
>>>>> + */
>>>>> + if (css == &root->css || (css && css_tryget(css)))
>>>>> + memcg = mem_cgroup_from_css(css);
>>>>
>>>> Could you note that this iterator will never visit dangling(removed)
>>>> memcg, somewhere ?
>>>
>>> OK, I can add it to the function comment but the behavior hasn't changed
>>> so I wouldn't like to confuse anybody.
>>>
>>>> Hmm, I'm not sure but it may be trouble at shrkinking dangling
>>>> kmem_cache(slab).
>>>
>>> We do not shrink slab at all.
>>
>> yet. However...
>>
>>> Those objects that are in a dead memcg
>>> wait for their owner tho release them which will make the dangling group
>>> eventually go away
>>>
>>>>
>>>> Costa, how do you think ?
>>>>
>>
>> In general, I particularly believe it is a good idea to skip dead memcgs
>> in the iterator. I don't anticipate any problems with shrinking at all.
>
> We even cannot iterate dead ones because their cgroups are gone and so
> you do not have any way to iterate. So either make them alive by css_get
> or we cannot iterate them.
>

We are in full agreement.

2012-11-30 04:07:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 3/6] memcg: rework mem_cgroup_iter to use cgroup iterators

(2012/11/27 3:47), Michal Hocko wrote:
> mem_cgroup_iter curently relies on css->id when walking down a group
> hierarchy tree. This is really awkward because the tree walk depends on
> the groups creation ordering. The only guarantee is that a parent node
> is visited before its children.
> Example
> 1) mkdir -p a a/d a/b/c
> 2) mkdir -a a/b/c a/d
> Will create the same trees but the tree walks will be different:
> 1) a, d, b, c
> 2) a, b, c, d
>
> 574bd9f7 (cgroup: implement generic child / descendant walk macros) has
> introduced generic cgroup tree walkers which provide either pre-order
> or post-order tree walk. This patch converts css->id based iteration
> to pre-order tree walk to keep the semantic with the original iterator
> where parent is always visited before its subtree.
>
> cgroup_for_each_descendant_pre suggests using post_create and
> pre_destroy for proper synchronization with groups addidition resp.
> removal. This implementation doesn't use those because a new memory
> cgroup is fully initialized in mem_cgroup_create and css reference
> counting enforces that the group is alive for both the last seen cgroup
> and the found one resp. it signals that the group is dead and it should
> be skipped.
>
> If the reclaim cookie is used we need to store the last visited group
> into the iterator so we have to be careful that it doesn't disappear in
> the mean time. Elevated reference count on the css keeps it alive even
> though the group have been removed (parked waiting for the last dput so
> that it can be freed).
>
> V2
> - use css_{get,put} for iter->last_visited rather than
> mem_cgroup_{get,put} because it is stronger wrt. cgroup life cycle
> - cgroup_next_descendant_pre expects NULL pos for the first iterartion
> otherwise it might loop endlessly for intermediate node without any
> children.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-30 04:09:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 4/6] memcg: simplify mem_cgroup_iter

(2012/11/27 3:47), Michal Hocko wrote:
> Current implementation of mem_cgroup_iter has to consider both css and
> memcg to find out whether no group has been found (css==NULL - aka the
> loop is completed) and that no memcg is associated with the found node
> (!memcg - aka css_tryget failed because the group is no longer alive).
> This leads to awkward tweaks like tests for css && !memcg to skip the
> current node.
>
> It will be much easier if we got rid off css variable altogether and
> only rely on memcg. In order to do that the iteration part has to skip
> dead nodes. This sounds natural to me and as a nice side effect we will
> get a simple invariant that memcg is always alive when non-NULL and all
> nodes have been visited otherwise.
>
> We could get rid of the surrounding while loop but keep it in for now to
> make review easier. It will go away in the following patch.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-30 04:10:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 5/6] memcg: further simplify mem_cgroup_iter

(2012/11/27 3:47), Michal Hocko wrote:
> mem_cgroup_iter basically does two things currently. It takes care of
> the house keeping (reference counting, raclaim cookie) and it iterates
> through a hierarchy tree (by using cgroup generic tree walk).
> The code would be much more easier to follow if we move the iteration
> outside of the function (to __mem_cgrou_iter_next) so the distinction
> is more clear.
> This patch doesn't introduce any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>

Very nice look !

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-30 04:12:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [patch v2 6/6] cgroup: remove css_get_next

(2012/11/27 3:47), Michal Hocko wrote:
> Now that we have generic and well ordered cgroup tree walkers there is
> no need to keep css_get_next in the place.
>
> Signed-off-by: Michal Hocko <[email protected]>

Hm, then, the next think will be css_is_ancestor() etc..

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-11-30 08:18:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch v2 6/6] cgroup: remove css_get_next

On Fri 30-11-12 13:12:29, KAMEZAWA Hiroyuki wrote:
> (2012/11/27 3:47), Michal Hocko wrote:
> > Now that we have generic and well ordered cgroup tree walkers there is
> > no need to keep css_get_next in the place.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Hm, then, the next think will be css_is_ancestor() etc..

Good point I thought that the only remaining part is swap accounting. It
is not directly with the iterators so I will send a separate patch.
Tejun said he has a replacement that could be used for the swap
accounting and then the whole css_id can be buried.

>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Thanks for the review of the whole patchset, Kame!
I would like to hear back from Johannes and then resubmit the series to
Andrew.
--
Michal Hocko
SUSE Labs

2012-11-30 09:08:48

by Glauber Costa

[permalink] [raw]
Subject: Re: [patch v2 5/6] memcg: further simplify mem_cgroup_iter

On 11/26/2012 10:47 PM, Michal Hocko wrote:
> The code would be much more easier to follow if we move the iteration
> outside of the function (to __mem_cgrou_iter_next) so the distinction
> is more clear.
totally nit: Why is it call __mem_cgrou ?

What happened to your p ?

2012-11-30 10:23:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch v2 5/6] memcg: further simplify mem_cgroup_iter

On Fri 30-11-12 13:08:35, Glauber Costa wrote:
> On 11/26/2012 10:47 PM, Michal Hocko wrote:
> > The code would be much more easier to follow if we move the iteration
> > outside of the function (to __mem_cgrou_iter_next) so the distinction
> > is more clear.
> totally nit: Why is it call __mem_cgrou ?
>
> What happened to your p ?

It was a fight against p as a source of all evil but the fight is over
and we can put it back :p

Thanks for noticing
--
Michal Hocko
SUSE Labs