Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
once enough pages have been reclaimed, in which case, in contrast to a
full round-trip over a cgroup sub-tree, the current position stored in
mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
and so is left holding the reference to the last scanned cgroup. If the
target cgroup does not get scanned again (we might have just reclaimed
the last page or all processes might exit and free their memory
voluntary), we will leak it, because there is nobody to put the
reference held by the iterator.
The problem is easy to reproduce by running the following command
sequence in a loop:
mkdir /sys/fs/cgroup/memory/test
echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
memhog 150M
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir test
The cgroups generated by it will never get freed.
This patch fixes this issue by making mem_cgroup_iter avoid taking
reference to the current position. In order not to hit use-after-free
bug while running reclaim in parallel with cgroup deletion, we make use
of ->css_released cgroup callback to clear references to the dying
cgroup in all reclaim iterators that might refer to it. This callback is
called right before scheduling rcu work which will free css, so if we
access iter->position from rcu read section, we might be sure it won't
go away under us.
Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: <[email protected]> # 3.19+
---
Changes in v2:
As pointed out by Johannes, clearing iter->position when interrupting
memcg reclaim, as it was done in v1, would result in unfairly high
pressure exerted on a parent cgroup in comparison to its children. So in
v2, we go another way - instead of pinning cgroup in iterator we clear
references to dying cgroup in all iterators that might refer to it right
before it is scheduled to be freed.
mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 42 insertions(+), 11 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 87af26a24491..f42352369cbc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (prev && reclaim->generation != iter->generation)
goto out_unlock;
- do {
+ while (1) {
pos = READ_ONCE(iter->position);
+ if (!pos || css_tryget(&pos->css))
+ break;
/*
- * A racing update may change the position and
- * put the last reference, hence css_tryget(),
- * or retry to see the updated position.
+ * css reference reached zero, so iter->position will
+ * be cleared by ->css_released. However, we should not
+ * rely on this happening soon, because ->css_released
+ * is called from a work queue, and by busy-waiting we
+ * might block it. So we clear iter->position right
+ * away.
*/
- } while (pos && !css_tryget(&pos->css));
+ cmpxchg(&iter->position, pos, NULL);
+ }
}
if (pos)
@@ -912,12 +918,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}
if (reclaim) {
- if (cmpxchg(&iter->position, pos, memcg) == pos) {
- if (memcg)
- css_get(&memcg->css);
- if (pos)
- css_put(&pos->css);
- }
+ cmpxchg(&iter->position, pos, memcg);
/*
* pairs with css_tryget when dereferencing iter->position
@@ -955,6 +956,28 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
css_put(&prev->css);
}
+static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
+{
+ struct mem_cgroup *memcg = dead_memcg;
+ struct mem_cgroup_reclaim_iter *iter;
+ struct mem_cgroup_per_zone *mz;
+ int nid, zid;
+ int i;
+
+ while ((memcg = parent_mem_cgroup(memcg))) {
+ for_each_node(nid) {
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
+ for (i = 0; i <= DEF_PRIORITY; i++) {
+ iter = &mz->iter[i];
+ cmpxchg(&iter->position,
+ dead_memcg, NULL);
+ }
+ }
+ }
+ }
+}
+
/*
* Iteration constructs for visiting all cgroups (under a tree). If
* loops are exited prematurely (break), mem_cgroup_iter_break() must
@@ -4375,6 +4398,13 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
wb_memcg_offline(memcg);
}
+static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ invalidate_reclaim_iterators(memcg);
+}
+
static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5229,6 +5259,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
.css_alloc = mem_cgroup_css_alloc,
.css_online = mem_cgroup_css_online,
.css_offline = mem_cgroup_css_offline,
+ .css_released = mem_cgroup_css_released,
.css_free = mem_cgroup_css_free,
.css_reset = mem_cgroup_css_reset,
.can_attach = mem_cgroup_can_attach,
--
2.1.4
On Tue, Dec 15, 2015 at 03:31:37PM +0300, Vladimir Davydov wrote:
> Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> once enough pages have been reclaimed, in which case, in contrast to a
> full round-trip over a cgroup sub-tree, the current position stored in
> mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> and so is left holding the reference to the last scanned cgroup. If the
> target cgroup does not get scanned again (we might have just reclaimed
> the last page or all processes might exit and free their memory
> voluntary), we will leak it, because there is nobody to put the
> reference held by the iterator.
>
> The problem is easy to reproduce by running the following command
> sequence in a loop:
>
> mkdir /sys/fs/cgroup/memory/test
> echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> memhog 150M
> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> rmdir test
>
> The cgroups generated by it will never get freed.
>
> This patch fixes this issue by making mem_cgroup_iter avoid taking
> reference to the current position. In order not to hit use-after-free
> bug while running reclaim in parallel with cgroup deletion, we make use
> of ->css_released cgroup callback to clear references to the dying
> cgroup in all reclaim iterators that might refer to it. This callback is
> called right before scheduling rcu work which will free css, so if we
> access iter->position from rcu read section, we might be sure it won't
> go away under us.
>
> Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
> Signed-off-by: Vladimir Davydov <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: <[email protected]> # 3.19+
Acked-by: Johannes Weiner <[email protected]>
Full quote follows for [email protected].
> ---
> Changes in v2:
>
> As pointed out by Johannes, clearing iter->position when interrupting
> memcg reclaim, as it was done in v1, would result in unfairly high
> pressure exerted on a parent cgroup in comparison to its children. So in
> v2, we go another way - instead of pinning cgroup in iterator we clear
> references to dying cgroup in all iterators that might refer to it right
> before it is scheduled to be freed.
>
> mm/memcontrol.c | 53 ++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 42 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 87af26a24491..f42352369cbc 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> if (prev && reclaim->generation != iter->generation)
> goto out_unlock;
>
> - do {
> + while (1) {
> pos = READ_ONCE(iter->position);
> + if (!pos || css_tryget(&pos->css))
> + break;
> /*
> - * A racing update may change the position and
> - * put the last reference, hence css_tryget(),
> - * or retry to see the updated position.
> + * css reference reached zero, so iter->position will
> + * be cleared by ->css_released. However, we should not
> + * rely on this happening soon, because ->css_released
> + * is called from a work queue, and by busy-waiting we
> + * might block it. So we clear iter->position right
> + * away.
> */
> - } while (pos && !css_tryget(&pos->css));
> + cmpxchg(&iter->position, pos, NULL);
> + }
> }
>
> if (pos)
> @@ -912,12 +918,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> }
>
> if (reclaim) {
> - if (cmpxchg(&iter->position, pos, memcg) == pos) {
> - if (memcg)
> - css_get(&memcg->css);
> - if (pos)
> - css_put(&pos->css);
> - }
> + cmpxchg(&iter->position, pos, memcg);
>
> /*
> * pairs with css_tryget when dereferencing iter->position
> @@ -955,6 +956,28 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
> css_put(&prev->css);
> }
>
> +static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
> +{
> + struct mem_cgroup *memcg = dead_memcg;
> + struct mem_cgroup_reclaim_iter *iter;
> + struct mem_cgroup_per_zone *mz;
> + int nid, zid;
> + int i;
> +
> + while ((memcg = parent_mem_cgroup(memcg))) {
> + for_each_node(nid) {
> + for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> + mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
> + for (i = 0; i <= DEF_PRIORITY; i++) {
> + iter = &mz->iter[i];
> + cmpxchg(&iter->position,
> + dead_memcg, NULL);
> + }
> + }
> + }
> + }
> +}
> +
> /*
> * Iteration constructs for visiting all cgroups (under a tree). If
> * loops are exited prematurely (break), mem_cgroup_iter_break() must
> @@ -4375,6 +4398,13 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> wb_memcg_offline(memcg);
> }
>
> +static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> + invalidate_reclaim_iterators(memcg);
> +}
> +
> static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
> {
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> @@ -5229,6 +5259,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
> .css_alloc = mem_cgroup_css_alloc,
> .css_online = mem_cgroup_css_online,
> .css_offline = mem_cgroup_css_offline,
> + .css_released = mem_cgroup_css_released,
> .css_free = mem_cgroup_css_free,
> .css_reset = mem_cgroup_css_reset,
> .can_attach = mem_cgroup_can_attach,
> --
> 2.1.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 15 Dec 2015 15:31:37 +0300 Vladimir Davydov <[email protected]> wrote:
> Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> once enough pages have been reclaimed, in which case, in contrast to a
> full round-trip over a cgroup sub-tree, the current position stored in
> mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> and so is left holding the reference to the last scanned cgroup. If the
> target cgroup does not get scanned again (we might have just reclaimed
> the last page or all processes might exit and free their memory
> voluntary), we will leak it, because there is nobody to put the
> reference held by the iterator.
>
> The problem is easy to reproduce by running the following command
> sequence in a loop:
>
> mkdir /sys/fs/cgroup/memory/test
> echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> memhog 150M
> echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> rmdir test
>
> The cgroups generated by it will never get freed.
>
> This patch fixes this issue by making mem_cgroup_iter avoid taking
> reference to the current position. In order not to hit use-after-free
> bug while running reclaim in parallel with cgroup deletion, we make use
> of ->css_released cgroup callback to clear references to the dying
> cgroup in all reclaim iterators that might refer to it. This callback is
> called right before scheduling rcu work which will free css, so if we
> access iter->position from rcu read section, we might be sure it won't
> go away under us.
>
> ...
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> if (prev && reclaim->generation != iter->generation)
> goto out_unlock;
>
> - do {
> + while (1) {
> pos = READ_ONCE(iter->position);
> + if (!pos || css_tryget(&pos->css))
> + break;
> /*
> - * A racing update may change the position and
> - * put the last reference, hence css_tryget(),
> - * or retry to see the updated position.
> + * css reference reached zero, so iter->position will
> + * be cleared by ->css_released. However, we should not
> + * rely on this happening soon, because ->css_released
> + * is called from a work queue, and by busy-waiting we
> + * might block it. So we clear iter->position right
> + * away.
> */
> - } while (pos && !css_tryget(&pos->css));
> + cmpxchg(&iter->position, pos, NULL);
> + }
It's peculiar to use cmpxchg() without actually checking that it did
anything. Should we use xchg() here? And why aren't we using plain
old "=", come to that?
Perhaps it just needs a comment to defog things.
On Thu, Dec 17, 2015 at 03:02:17PM -0800, Andrew Morton wrote:
> On Tue, 15 Dec 2015 15:31:37 +0300 Vladimir Davydov <[email protected]> wrote:
>
> > Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> > once enough pages have been reclaimed, in which case, in contrast to a
> > full round-trip over a cgroup sub-tree, the current position stored in
> > mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> > and so is left holding the reference to the last scanned cgroup. If the
> > target cgroup does not get scanned again (we might have just reclaimed
> > the last page or all processes might exit and free their memory
> > voluntary), we will leak it, because there is nobody to put the
> > reference held by the iterator.
> >
> > The problem is easy to reproduce by running the following command
> > sequence in a loop:
> >
> > mkdir /sys/fs/cgroup/memory/test
> > echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> > echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> > memhog 150M
> > echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> > rmdir test
> >
> > The cgroups generated by it will never get freed.
> >
> > This patch fixes this issue by making mem_cgroup_iter avoid taking
> > reference to the current position. In order not to hit use-after-free
> > bug while running reclaim in parallel with cgroup deletion, we make use
> > of ->css_released cgroup callback to clear references to the dying
> > cgroup in all reclaim iterators that might refer to it. This callback is
> > called right before scheduling rcu work which will free css, so if we
> > access iter->position from rcu read section, we might be sure it won't
> > go away under us.
> >
> > ...
> >
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > if (prev && reclaim->generation != iter->generation)
> > goto out_unlock;
> >
> > - do {
> > + while (1) {
> > pos = READ_ONCE(iter->position);
> > + if (!pos || css_tryget(&pos->css))
> > + break;
> > /*
> > - * A racing update may change the position and
> > - * put the last reference, hence css_tryget(),
> > - * or retry to see the updated position.
> > + * css reference reached zero, so iter->position will
> > + * be cleared by ->css_released. However, we should not
> > + * rely on this happening soon, because ->css_released
> > + * is called from a work queue, and by busy-waiting we
> > + * might block it. So we clear iter->position right
> > + * away.
> > */
> > - } while (pos && !css_tryget(&pos->css));
> > + cmpxchg(&iter->position, pos, NULL);
> > + }
>
> It's peculiar to use cmpxchg() without actually checking that it did
> anything. Should we use xchg() here? And why aren't we using plain
> old "=", come to that?
Well, it's obvious why we need the 'compare' part - the iter could have
been already advanced by a competing process, in which case we shouldn't
touch it, otherwise we would reclaim some cgroup twice during the same
reclaim generation. However, it's not that clear why it must be atomic.
Before this patch, atomicity was necessary to guarantee that we adjust
the reference counters correctly, but now we don't do it anymore. If a
competing process happens to update iter->position between the compare
and set steps, we might reclaim from the same cgroup twice at worst, and
this extremely unlikely to happen.
So I think we can replace the atomic operation with a non-atomic one,
like the patch below does. Any objections?
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc25dc211eaf..42187e84b147 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -864,7 +864,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
* might block it. So we clear iter->position right
* away.
*/
- cmpxchg(&iter->position, pos, NULL);
+ if (iter->position == pos)
+ iter->position = NULL;
}
}
@@ -902,7 +903,15 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}
if (reclaim) {
- cmpxchg(&iter->position, pos, memcg);
+ /*
+ * The position could have already been updated by a competing
+ * thread, so check that the value hasn't changed since we read
+ * it. This operation doesn't need to be atomic, because a race
+ * is extremely unlikely and in the worst case can only result
+ * in the same cgroup reclaimed twice.
+ */
+ if (iter->position == pos)
+ iter->position = memcg;
/*
* pairs with css_tryget when dereferencing iter->position
On Fri, Dec 18, 2015 at 06:32:02PM +0300, Vladimir Davydov wrote:
> On Thu, Dec 17, 2015 at 03:02:17PM -0800, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 15:31:37 +0300 Vladimir Davydov <[email protected]> wrote:
> > > @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > > if (prev && reclaim->generation != iter->generation)
> > > goto out_unlock;
> > >
> > > - do {
> > > + while (1) {
> > > pos = READ_ONCE(iter->position);
> > > + if (!pos || css_tryget(&pos->css))
> > > + break;
> > > /*
> > > - * A racing update may change the position and
> > > - * put the last reference, hence css_tryget(),
> > > - * or retry to see the updated position.
> > > + * css reference reached zero, so iter->position will
> > > + * be cleared by ->css_released. However, we should not
> > > + * rely on this happening soon, because ->css_released
> > > + * is called from a work queue, and by busy-waiting we
> > > + * might block it. So we clear iter->position right
> > > + * away.
> > > */
> > > - } while (pos && !css_tryget(&pos->css));
> > > + cmpxchg(&iter->position, pos, NULL);
> > > + }
> >
> > It's peculiar to use cmpxchg() without actually checking that it did
> > anything. Should we use xchg() here? And why aren't we using plain
> > old "=", come to that?
>
> Well, it's obvious why we need the 'compare' part - the iter could have
> been already advanced by a competing process, in which case we shouldn't
> touch it, otherwise we would reclaim some cgroup twice during the same
> reclaim generation. However, it's not that clear why it must be atomic.
> Before this patch, atomicity was necessary to guarantee that we adjust
> the reference counters correctly, but now we don't do it anymore. If a
> competing process happens to update iter->position between the compare
> and set steps, we might reclaim from the same cgroup twice at worst, and
> this extremely unlikely to happen.
>
> So I think we can replace the atomic operation with a non-atomic one,
> like the patch below does. Any objections?
I don't think the race window is actually that small and reclaiming a
group twice could cause sporadic latency issues in the victim group.
Think about the group not just trimming caches but already swapping.
The cmpxchg()s without checking the return values look odd without a
comment, but that doesn't mean that they're wrong in this situation:
advance the iterator from what we think is the current position, and
don't if somebody beat us to that. That's what cmpxchg() does. So I'd
rather we kept them here.
> @@ -902,7 +903,15 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> }
>
> if (reclaim) {
> - cmpxchg(&iter->position, pos, memcg);
> + /*
> + * The position could have already been updated by a competing
> + * thread, so check that the value hasn't changed since we read
> + * it. This operation doesn't need to be atomic, because a race
> + * is extremely unlikely and in the worst case can only result
> + * in the same cgroup reclaimed twice.
But it would be good to add the first half of that comment to the
cmpxchg to explain why we don't have to check the return value.
On Fri, Dec 18, 2015 at 11:00:41AM -0500, Johannes Weiner wrote:
> On Fri, Dec 18, 2015 at 06:32:02PM +0300, Vladimir Davydov wrote:
> > On Thu, Dec 17, 2015 at 03:02:17PM -0800, Andrew Morton wrote:
> > > On Tue, 15 Dec 2015 15:31:37 +0300 Vladimir Davydov <[email protected]> wrote:
> > > > @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > > > if (prev && reclaim->generation != iter->generation)
> > > > goto out_unlock;
> > > >
> > > > - do {
> > > > + while (1) {
> > > > pos = READ_ONCE(iter->position);
> > > > + if (!pos || css_tryget(&pos->css))
> > > > + break;
> > > > /*
> > > > - * A racing update may change the position and
> > > > - * put the last reference, hence css_tryget(),
> > > > - * or retry to see the updated position.
> > > > + * css reference reached zero, so iter->position will
> > > > + * be cleared by ->css_released. However, we should not
> > > > + * rely on this happening soon, because ->css_released
> > > > + * is called from a work queue, and by busy-waiting we
> > > > + * might block it. So we clear iter->position right
> > > > + * away.
> > > > */
> > > > - } while (pos && !css_tryget(&pos->css));
> > > > + cmpxchg(&iter->position, pos, NULL);
> > > > + }
> > >
> > > It's peculiar to use cmpxchg() without actually checking that it did
> > > anything. Should we use xchg() here? And why aren't we using plain
> > > old "=", come to that?
> >
> > Well, it's obvious why we need the 'compare' part - the iter could have
> > been already advanced by a competing process, in which case we shouldn't
> > touch it, otherwise we would reclaim some cgroup twice during the same
> > reclaim generation. However, it's not that clear why it must be atomic.
> > Before this patch, atomicity was necessary to guarantee that we adjust
> > the reference counters correctly, but now we don't do it anymore. If a
> > competing process happens to update iter->position between the compare
> > and set steps, we might reclaim from the same cgroup twice at worst, and
> > this extremely unlikely to happen.
> >
> > So I think we can replace the atomic operation with a non-atomic one,
> > like the patch below does. Any objections?
>
> I don't think the race window is actually that small and reclaiming a
> group twice could cause sporadic latency issues in the victim group.
> Think about the group not just trimming caches but already swapping.
>
> The cmpxchg()s without checking the return values look odd without a
> comment, but that doesn't mean that they're wrong in this situation:
> advance the iterator from what we think is the current position, and
> don't if somebody beat us to that. That's what cmpxchg() does. So I'd
> rather we kept them here.
>
> > @@ -902,7 +903,15 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > }
> >
> > if (reclaim) {
> > - cmpxchg(&iter->position, pos, memcg);
> > + /*
> > + * The position could have already been updated by a competing
> > + * thread, so check that the value hasn't changed since we read
> > + * it. This operation doesn't need to be atomic, because a race
> > + * is extremely unlikely and in the worst case can only result
> > + * in the same cgroup reclaimed twice.
>
> But it would be good to add the first half of that comment to the
> cmpxchg to explain why we don't have to check the return value.
>
OK, got it, thanks. Here goes the incremental patch (it should also fix
the warning regarding unused cmpxchg returned value):
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc25dc211eaf..908c075e04eb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -864,7 +864,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
* might block it. So we clear iter->position right
* away.
*/
- cmpxchg(&iter->position, pos, NULL);
+ (void)cmpxchg(&iter->position, pos, NULL);
}
}
@@ -902,7 +902,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}
if (reclaim) {
- cmpxchg(&iter->position, pos, memcg);
+ /*
+ * The position could have already been updated by a competing
+ * thread, so check that the value hasn't changed since we read
+ * it to avoid reclaiming from the same cgroup twice.
+ */
+ (void)cmpxchg(&iter->position, pos, memcg);
/*
* pairs with css_tryget when dereferencing iter->position
On Fri, 18 Dec 2015 19:24:05 +0300 Vladimir Davydov <[email protected]> wrote:
>
> OK, got it, thanks. Here goes the incremental patch (it should also fix
> the warning regarding unused cmpxchg returned value):
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fc25dc211eaf..908c075e04eb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -864,7 +864,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> * might block it. So we clear iter->position right
> * away.
> */
> - cmpxchg(&iter->position, pos, NULL);
> + (void)cmpxchg(&iter->position, pos, NULL);
No, this doesn't actually squish the __must_check warning.
Can anyone think of anything smarter than this?
--- a/mm/memcontrol.c~mm-memcontrol-fix-possible-memcg-leak-due-to-interrupted-reclaim-fix-fix
+++ a/mm/memcontrol.c
@@ -851,6 +851,9 @@ static struct mem_cgroup *get_mem_cgroup
return memcg;
}
+/* Move this to compiler.h if it proves worthy */
+#define defeat_must_check(expr) do { if (expr) ; } while (0)
+
/**
* mem_cgroup_iter - iterate over memory cgroup hierarchy
* @root: hierarchy root
@@ -915,7 +918,7 @@ struct mem_cgroup *mem_cgroup_iter(struc
* might block it. So we clear iter->position right
* away.
*/
- (void)cmpxchg(&iter->position, pos, NULL);
+ defeat_must_check(cmpxchg(&iter->position, pos, NULL));
}
}
@@ -967,7 +970,7 @@ struct mem_cgroup *mem_cgroup_iter(struc
* thread, so check that the value hasn't changed since we read
* it to avoid reclaiming from the same cgroup twice.
*/
- (void)cmpxchg(&iter->position, pos, memcg);
+ defeat_must_check(cmpxchg(&iter->position, pos, memcg));
/*
* pairs with css_tryget when dereferencing iter->position
_
On Fri, Dec 18, 2015 at 02:40:04PM -0800, Andrew Morton wrote:
> On Fri, 18 Dec 2015 19:24:05 +0300 Vladimir Davydov <[email protected]> wrote:
>
> >
> > OK, got it, thanks. Here goes the incremental patch (it should also fix
> > the warning regarding unused cmpxchg returned value):
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index fc25dc211eaf..908c075e04eb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -864,7 +864,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> > * might block it. So we clear iter->position right
> > * away.
> > */
> > - cmpxchg(&iter->position, pos, NULL);
> > + (void)cmpxchg(&iter->position, pos, NULL);
>
> No, this doesn't actually squish the __must_check warning.
The warning was caused not by a __must_check annotation - using it for
cmpxchg would be just wrong - it was caused by type conversion done in
expansion of cmpxchg macro:
arch/m68k/include/asm/cmpxchg.h:121:3: warning: value computed is not used [-Wunused-value]
((__typeof__(*(ptr)))__cmpxchg((ptr), (unsigned long)(o), \
^
(see http://www.spinics.net/lists/linux-mm/msg99133.html)
Type conversion to (void) helps suppressing this warning, and it seems
this is what is done commonly (e.g. see kernel/rcu/tree_plugin.h)
Thanks,
Vladimir
I think we can fold the following in there as well:
>From b885bf06f55d05f0e8249357d4edb231dfe4a5dc Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Wed, 23 Dec 2015 17:16:07 -0500
Subject: [PATCH] mm: memcontrol: fix possible memcg leak due to interrupted
reclaim fix
When we handled multiple css references in mem_cgroup_iter() it got a
little confusing which puts belong to which gets. Now that we only ref
the position css it's obvious, and a multi-line comment to explain it
gets in the way of reading an already long function.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99acd6a..498c61e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -909,10 +909,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
*/
(void)cmpxchg(&iter->position, pos, memcg);
- /*
- * pairs with css_tryget when dereferencing iter->position
- * above.
- */
if (pos)
css_put(&pos->css);
--
2.6.4