Date: Fri, 18 Dec 2015 18:32:02 +0300
From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Andrew Morton <akpm@linux-foundation.org>
CC: Johannes Weiner <hannes@cmpxchg.org>, <stable@vger.kernel.org>,
        Michal Hocko <mhocko@kernel.org>, <linux-mm@kvack.org>,
        <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2] mm: memcontrol: fix possible memcg leak due to
 interrupted reclaim
Message-ID: <20151218153202.GS28521@esperanza>
References: <1450182697-11049-1-git-send-email-vdavydov@virtuozzo.com>
 <20151217150217.a02c264ce9b5335b02bae888@linux-foundation.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20151217150217.a02c264ce9b5335b02bae888@linux-foundation.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4867
Lines: 115

On Thu, Dec 17, 2015 at 03:02:17PM -0800, Andrew Morton wrote:
> On Tue, 15 Dec 2015 15:31:37 +0300 Vladimir Davydov <vdavydov@virtuozzo.com> wrote:
> 
> > Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> > once enough pages have been reclaimed, in which case, in contrast to a
> > full round-trip over a cgroup sub-tree, the current position stored in
> > mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> > and so is left holding the reference to the last scanned cgroup. If the
> > target cgroup does not get scanned again (we might have just reclaimed
> > the last page or all processes might exit and free their memory
> > voluntary), we will leak it, because there is nobody to put the
> > reference held by the iterator.
> > 
> > The problem is easy to reproduce by running the following command
> > sequence in a loop:
> > 
> >     mkdir /sys/fs/cgroup/memory/test
> >     echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> >     echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> >     memhog 150M
> >     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> >     rmdir test
> > 
> > The cgroups generated by it will never get freed.
> > 
> > This patch fixes this issue by making mem_cgroup_iter avoid taking
> > reference to the current position. In order not to hit use-after-free
> > bug while running reclaim in parallel with cgroup deletion, we make use
> > of ->css_released cgroup callback to clear references to the dying
> > cgroup in all reclaim iterators that might refer to it. This callback is
> > called right before scheduling rcu work which will free css, so if we
> > access iter->position from rcu read section, we might be sure it won't
> > go away under us.
> > 
> > ...
> >
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -859,14 +859,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >  		if (prev && reclaim->generation != iter->generation)
> >  			goto out_unlock;
> >  
> > -		do {
> > +		while (1) {
> >  			pos = READ_ONCE(iter->position);
> > +			if (!pos || css_tryget(&pos->css))
> > +				break;
> >  			/*
> > -			 * A racing update may change the position and
> > -			 * put the last reference, hence css_tryget(),
> > -			 * or retry to see the updated position.
> > +			 * css reference reached zero, so iter->position will
> > +			 * be cleared by ->css_released. However, we should not
> > +			 * rely on this happening soon, because ->css_released
> > +			 * is called from a work queue, and by busy-waiting we
> > +			 * might block it. So we clear iter->position right
> > +			 * away.
> >  			 */
> > -		} while (pos && !css_tryget(&pos->css));
> > +			cmpxchg(&iter->position, pos, NULL);
> > +		}
> 
> It's peculiar to use cmpxchg() without actually checking that it did
> anything.  Should we use xchg() here?  And why aren't we using plain
> old "=", come to that?

Well, it's obvious why we need the 'compare' part - the iter could have
been already advanced by a competing process, in which case we shouldn't
touch it, otherwise we would reclaim some cgroup twice during the same
reclaim generation. However, it's not that clear why it must be atomic.
Before this patch, atomicity was necessary to guarantee that we adjust
the reference counters correctly, but now we don't do it anymore. If a
competing process happens to update iter->position between the compare
and set steps, we might reclaim from the same cgroup twice at worst, and
this extremely unlikely to happen.

So I think we can replace the atomic operation with a non-atomic one,
like the patch below does. Any objections?
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc25dc211eaf..42187e84b147 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -864,7 +864,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			 * might block it. So we clear iter->position right
 			 * away.
 			 */
-			cmpxchg(&iter->position, pos, NULL);
+			if (iter->position == pos)
+				iter->position = NULL;
 		}
 	}
 
@@ -902,7 +903,15 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	if (reclaim) {
-		cmpxchg(&iter->position, pos, memcg);
+		/*
+		 * The position could have already been updated by a competing
+		 * thread, so check that the value hasn't changed since we read
+		 * it. This operation doesn't need to be atomic, because a race
+		 * is extremely unlikely and in the worst case can only result
+		 * in the same cgroup reclaimed twice.
+		 */
+		if (iter->position == pos)
+			iter->position = memcg;
 
 		/*
 		 * pairs with css_tryget when dereferencing iter->position
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/