Date: Sat, 12 Dec 2015 22:18:55 +0300
From: Vladimir Davydov <vdavydov@virtuozzo.com>
To: Johannes Weiner <hannes@cmpxchg.org>
CC: Andrew Morton <akpm@linux-foundation.org>,
        Michal Hocko <mhocko@kernel.org>, <stable@vger.kernel.org>,
        <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: memcontrol: fix possible memcg leak due to
 interrupted reclaim
Message-ID: <20151212191855.GE28521@esperanza>
References: <1449927242-9608-1-git-send-email-vdavydov@virtuozzo.com>
 <20151212164540.GA7107@cmpxchg.org>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Disposition: inline
In-Reply-To: <20151212164540.GA7107@cmpxchg.org>
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6441
Lines: 177

On Sat, Dec 12, 2015 at 11:45:40AM -0500, Johannes Weiner wrote:
> On Sat, Dec 12, 2015 at 04:34:02PM +0300, Vladimir Davydov wrote:
> > Memory cgroup reclaim can be interrupted with mem_cgroup_iter_break()
> > once enough pages have been reclaimed, in which case, in contrast to a
> > full round-trip over a cgroup sub-tree, the current position stored in
> > mem_cgroup_reclaim_iter of the target cgroup does not get invalidated
> > and so is left holding the reference to the last scanned cgroup. If the
> > target cgroup does not get scanned again (we might have just reclaimed
> > the last page or all processes might exit and free their memory
> > voluntary), we will leak it, because there is nobody to put the
> > reference held by the iterator.
> > 
> > The problem is easy to reproduce by running the following command
> > sequence in a loop:
> > 
> >     mkdir /sys/fs/cgroup/memory/test
> >     echo 100M > /sys/fs/cgroup/memory/test/memory.limit_in_bytes
> >     echo $$ > /sys/fs/cgroup/memory/test/cgroup.procs
> >     memhog 150M
> >     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> >     rmdir test
> > 
> > The cgroups generated by it will never get freed.
> > 
> > This patch fixes this issue by making mem_cgroup_iter_break() clear
> > mem_cgroup_reclaim_iter->position in case it points to the memory cgroup
> > we interrupted reclaim on.
> > 
> > Fixes: 5ac8fb31ad2e ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
> > Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: <stable@vger.kernel.org> # 3.19+
> 
> Good catch!
> 
> The downside of not remembering the last position across incomplete
> reclaim cycles is that we always restart at the same position. If a
> cgroup has a certain number of children, it's conceivable that we
> might never get to the youngest cgroups in the subtree anymore.

That's true, scratch this patch.

> 
> So I'd be more comfortable removing incomplete reclaim walks. It was a
> nice little optimization we could do while supporting interleave walks,
> but it's not necessary: when global reclaim can walk the entire system,
> limit reclaim should be okay walking subtrees exhaustively.
> 
> How about this?
...
> @@ -2425,21 +2425,6 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  				   sc->nr_scanned - scanned,
>  				   sc->nr_reclaimed - reclaimed);
>  
> -			/*
> -			 * Direct reclaim and kswapd have to scan all memory
> -			 * cgroups to fulfill the overall scan target for the
> -			 * zone.
> -			 *
> -			 * Limit reclaim, on the other hand, only cares about
> -			 * nr_to_reclaim pages to be reclaimed and it will
> -			 * retry with decreasing priority if one round over the
> -			 * whole hierarchy is not sufficient.
> -			 */
> -			if (!global_reclaim(sc) &&
> -					sc->nr_reclaimed >= sc->nr_to_reclaim) {
> -				mem_cgroup_iter_break(root, memcg);
> -				break;
> -			}

Dunno. I like it, because it's simple and clean, but I'm unsure: can't
it result in lags when performing memcg reclaim for deep hierarchies?
For global reclaim we have kswapd, which tries to keep the system within
bounds so as to avoid direct reclaim at all. Memcg lacks such thing, and
interleave walks looks like a good compensation for it.

Alternatively, we could avoid taking reference to iter->position and
make use of css_released cgroup callback to invalidate reclaim
iterators. With this approach, upper level cgroups shouldn't receive
unfairly high pressure in comparison to their children. Something like
this, maybe?

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 87af26a24491..fcc5133210a0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -859,14 +859,12 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		if (prev && reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		do {
+		while (1) {
 			pos = READ_ONCE(iter->position);
-			/*
-			 * A racing update may change the position and
-			 * put the last reference, hence css_tryget(),
-			 * or retry to see the updated position.
-			 */
-		} while (pos && !css_tryget(&pos->css));
+			if (!pos || css_tryget(&pos->css))
+				break;
+			cmpxchg(&iter->position, pos, NULL);
+		}
 	}
 
 	if (pos)
@@ -912,12 +910,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	}
 
 	if (reclaim) {
-		if (cmpxchg(&iter->position, pos, memcg) == pos) {
-			if (memcg)
-				css_get(&memcg->css);
-			if (pos)
-				css_put(&pos->css);
-		}
+		cmpxchg(&iter->position, pos, memcg);
 
 		/*
 		 * pairs with css_tryget when dereferencing iter->position
@@ -955,6 +948,28 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
 		css_put(&prev->css);
 }
 
+static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
+{
+	struct mem_cgroup *memcg = dead_memcg;
+	struct mem_cgroup_reclaim_iter *iter;
+	struct mem_cgroup_per_zone *mz;
+	int nid, zid;
+	int i;
+
+	while ((memcg = parent_mem_cgroup(memcg))) {
+		for_each_node(nid) {
+			for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+				mz = &memcg->nodeinfo[nid]->zoneinfo[zid];
+				for (i = 0; i <= DEF_PRIORITY; i++) {
+					iter = &mz->iter[i];
+					cmpxchg(&iter->position,
+						dead_memcg, NULL);
+				}
+			}
+		}
+	}
+}
+
 /*
  * Iteration constructs for visiting all cgroups (under a tree).  If
  * loops are exited prematurely (break), mem_cgroup_iter_break() must
@@ -4375,6 +4390,13 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	wb_memcg_offline(memcg);
 }
 
+static void mem_cgroup_css_released(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	invalidate_reclaim_iterators(memcg);
+}
+
 static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -5229,6 +5251,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_alloc = mem_cgroup_css_alloc,
 	.css_online = mem_cgroup_css_online,
 	.css_offline = mem_cgroup_css_offline,
+	.css_released = mem_cgroup_css_released,
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.can_attach = mem_cgroup_can_attach,
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/