If page migration cannot charge the new page to the memcg,
migrate_pages() will return -ENOMEM. This isn't considered in memory
compaction however, and the loop continues to iterate over all pageblocks
trying in a futile attempt to continue migrations which are only bound to
fail.
This will short circuit and fail memory compaction if migrate_pages()
returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations
were successful so that the page allocator will retry.
Signed-off-by: David Rientjes <[email protected]>
---
mm/compaction.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/compaction.c b/mm/compaction.c
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -701,8 +701,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
if (err) {
putback_lru_pages(&cc->migratepages);
cc->nr_migratepages = 0;
+ if (err == -ENOMEM) {
+ ret = COMPACT_PARTIAL;
+ goto out;
+ }
}
-
}
out:
On 06/21/2012 03:52 PM, David Rientjes wrote:
> If page migration cannot charge the new page to the memcg,
> migrate_pages() will return -ENOMEM. This isn't considered in memory
> compaction however, and the loop continues to iterate over all pageblocks
> trying in a futile attempt to continue migrations which are only bound to
> fail.
Hmm, it might be dumb question.
I imagine that pages in next pageblock could be in another memcg so it could be successful.
Why should we stop compaction once it fails to migrate pages in current pageblock/memcg?
>
> This will short circuit and fail memory compaction if migrate_pages()
> returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations
> were successful so that the page allocator will retry.
>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> mm/compaction.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -701,8 +701,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> if (err) {
> putback_lru_pages(&cc->migratepages);
> cc->nr_migratepages = 0;
> + if (err == -ENOMEM) {
> + ret = COMPACT_PARTIAL;
> + goto out;
> + }
> }
> -
> }
>
> out:
--
Kind regards,
Minchan Kim
On Thu, 21 Jun 2012, Minchan Kim wrote:
> > If page migration cannot charge the new page to the memcg,
> > migrate_pages() will return -ENOMEM. This isn't considered in memory
> > compaction however, and the loop continues to iterate over all pageblocks
> > trying in a futile attempt to continue migrations which are only bound to
> > fail.
>
>
> Hmm, it might be dumb question.
> I imagine that pages in next pageblock could be in another memcg so it could be successful.
> Why should we stop compaction once it fails to migrate pages in current pageblock/memcg?
>
[ You included the gmane.linux.kernel and gmane.linux.kernel.mm
newsgroups in your reply, not sure why, so I removed them. ]
This was inspired by a system running with a single oom memcg running with
thp that continuously tried migrating pages resulting in vmstats such as
this:
compact_blocks_moved 59473599
compact_pages_moved 50041548
compact_pagemigrate_failed 1494277831
compact_stall 1013
compact_fail 573
Obviously not a good result.
We could certainly continue the iteration in cases like this, but I
thought it would be better to fail and rely on direct reclaim to actually
try to free some memory, especially if that oom memcg happens to include
current.
It's possible that subsequent pageblocks would contain memory allocated
from solely non-oom memcgs, but it's certainly not a guarantee and results
in terrible performance as exhibited above. Is there another good
criteria to use when deciding when to stop isolating and attempting to
migrate all of these pageblocks?
Other ideas?
On Thu, 21 Jun 2012, David Rientjes wrote:
> It's possible that subsequent pageblocks would contain memory allocated
> from solely non-oom memcgs, but it's certainly not a guarantee and results
> in terrible performance as exhibited above. Is there another good
> criteria to use when deciding when to stop isolating and attempting to
> migrate all of these pageblocks?
>
> Other ideas?
>
The only other alternative that I can think of is to check
mem_cgroup_margin() in isolate_migratepages_range() and return a NULL
lruvec that would break that pageblock and return, and then set a bit in
struct mem_cgroup that labels it as oom so we can check for it on
subsequent pageblocks without incurring the locking to do
mem_cgroup_margin() in res_counter, and then clear that bit on every
uncharge to a memcg, but this still seems like a tremendous waste of cpu
(especially if /sys/kernel/mm/transparent_hugepage/defrag == always) if
most pageblocks contain pages from an oom memcg.
On Wed, Jun 20, 2012 at 11:52:35PM -0700, David Rientjes wrote:
> If page migration cannot charge the new page to the memcg,
> migrate_pages() will return -ENOMEM.
I take it this happens in a memcg that is full and the temporary page
is enough to push it over the edge. Is that correct? If so, it should
be included in the changelog because that would make this a -stable
candidate on the grounds that it is a corner case where compaction can
consume excessive CPU when using memcgs leading to apparant stalls.
> This isn't considered in memory
> compaction however, and the loop continues to iterate over all pageblocks
> trying in a futile attempt to continue migrations which are only bound to
> fail.
>
This is not necessarily true if it was just one small memcg that was
occupying that pageblock for example.
> This will short circuit and fail memory compaction if migrate_pages()
> returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations
> were successful so that the page allocator will retry.
>
the "some migration" could have been in other unrelated zones so
COMPACT_PARTIAL makes sense from that perspective.
> Signed-off-by: David Rientjes <[email protected]>
There is a side-effect that a small memcg that is full will cause
compaction to abort when it could have succeeded. This will impact THP
allocation success rates in other memcgs or in the root domain.
The ideal would be to backout iff the memcg occupied a large percentage of
the zone that was poorly fragmented. There is no way of finding this out
easily and a good value for a "large percentage" is unknowable. Besides it
would depend on the distribution as if the memcg had one page per pageblock
it would still fail so you'd need a full scan anyway. That all feels
like a bust and a path to fail. I'd like to see a more detailed
changelog but otherwise;
Acked-by: Mel Gorman <[email protected]>
However, here is a possible extention to your patch that should work while
preserving THP success rates but needs a more messing. At the place of
your patch do something like this in compact_zone
arbitrary_mem_group = NULL
...
/*
* Break out if memcg has "unmovable" pages that disable compaction in
* this zone
*/
if err == -ENOMEM
foreach page in cc->migratepages
cgroup = page_cgroup(page)
if cgroup
mem_group = cgroup->mem_cgroup
if mem_cgroup->disabled_compaction == true
goto out
else
arbitrary_cgroup = mem_cgroup
i.e. add a new boolean to mem_cgroup that is set to true if this memcg
has impaired compaction. If a cgroup is not disabled_compaction then
remember that.
Next is when to set disabled_compaction. At the end of compact_zones,
do
if ret == COMPACT_COMPLETE && cc->order != -1 && arbitrary_cgroup
arbitrary_cgroup->disabled_compaction = true
i.e. if we are in direct compaction and there was a full compaction
cycle that failed due to a cgroup getting in the way then tag that
cgroup is "disabled_compaction". On subsequent compaction attempts if
that cgroup is encountered again then abort compaction faster.
This will mitigate a small full memcg disabling compaction for the entire
zone at least until such time as the memcg has polluted every movable
pageblock.
> putback_lru_pages(&cc->migratepages);
> cc->nr_migratepages = 0;
> + if (err == -ENOMEM) {
> + ret = COMPACT_PARTIAL;
> + goto out;
> + }
> }
> -
> }
>
> out:
--
Mel Gorman
SUSE Labs
On Thu, 21 Jun 2012, Mel Gorman wrote:
> I take it this happens in a memcg that is full and the temporary page
> is enough to push it over the edge. Is that correct? If so, it should
> be included in the changelog because that would make this a -stable
> candidate on the grounds that it is a corner case where compaction can
> consume excessive CPU when using memcgs leading to apparant stalls.
>
Yes, the charge against the page under migration causes the oom. It's
really a nasty side-effect of memcg page migration that we have to charge
a temporary page that I wish we could address there and certainly we can
try to do that in the future. This issue has just been causing us a lot
of pain, especially for systems with a low number of very large memcgs.
I agree with your assessment that it should be added to stable and ask
that Andrew replace the old changelog with the following:
===SNIP===
mm, thp: abort compaction if migration page cannot be charged to memcg
If page migration cannot charge the temporary page to the memcg,
migrate_pages() will return -ENOMEM. This isn't considered in memory
compaction however, and the loop continues to iterate over all pageblocks
trying to isolate and migrate pages. If a small number of very large
memcgs happen to be oom, however, these attempts will mostly be futile
leading to an enormous amout of cpu consumption due to the page migration
failures.
This patch will short circuit and fail memory compaction if
migrate_pages() returns -ENOMEM. COMPACT_PARTIAL is returned in case some
migrations were successful so that the page allocator will retry.
Cc: [email protected]
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
===SNIP===
> However, here is a possible extention to your patch that should work while
> preserving THP success rates but needs a more messing. At the place of
> your patch do something like this in compact_zone
>
> arbitrary_mem_group = NULL
>
> ...
>
> /*
> * Break out if memcg has "unmovable" pages that disable compaction in
> * this zone
> */
> if err == -ENOMEM
> foreach page in cc->migratepages
> cgroup = page_cgroup(page)
> if cgroup
> mem_group = cgroup->mem_cgroup
> if mem_cgroup->disabled_compaction == true
> goto out
> else
> arbitrary_cgroup = mem_cgroup
>
> i.e. add a new boolean to mem_cgroup that is set to true if this memcg
> has impaired compaction. If a cgroup is not disabled_compaction then
> remember that.
>
> Next is when to set disabled_compaction. At the end of compact_zones,
> do
>
> if ret == COMPACT_COMPLETE && cc->order != -1 && arbitrary_cgroup
> arbitrary_cgroup->disabled_compaction = true
>
> i.e. if we are in direct compaction and there was a full compaction
> cycle that failed due to a cgroup getting in the way then tag that
> cgroup is "disabled_compaction". On subsequent compaction attempts if
> that cgroup is encountered again then abort compaction faster.
>
> This will mitigate a small full memcg disabling compaction for the entire
> zone at least until such time as the memcg has polluted every movable
> pageblock.
>
Interesting approach, I'll look to do something like this as a follow-up
to this patch since we have usecases that reproduce this easily.
Thanks for looking at it and the detailed analysis, Mel.
(2012/06/21 18:01), David Rientjes wrote:
> On Thu, 21 Jun 2012, David Rientjes wrote:
>
>> It's possible that subsequent pageblocks would contain memory allocated
>> from solely non-oom memcgs, but it's certainly not a guarantee and results
>> in terrible performance as exhibited above. Is there another good
>> criteria to use when deciding when to stop isolating and attempting to
>> migrate all of these pageblocks?
>>
>> Other ideas?
>>
>
> The only other alternative that I can think of is to check
> mem_cgroup_margin() in isolate_migratepages_range() and return a NULL
> lruvec that would break that pageblock and return, and then set a bit in
> struct mem_cgroup that labels it as oom so we can check for it on
> subsequent pageblocks without incurring the locking to do
> mem_cgroup_margin() in res_counter, and then clear that bit on every
> uncharge to a memcg, but this still seems like a tremendous waste of cpu
> (especially if /sys/kernel/mm/transparent_hugepage/defrag == always) if
> most pageblocks contain pages from an oom memcg.
I guess the best way will be never calling charge/uncharge at migration.
....but it has been a battle with many race conditions..
Here is an alternative way, remove -ENOMEM in mem_cgroup_prepare_migration()
by using res_counter_charge_nofail().
Could you try this ?
==
From 12cd8c387cc19b6f89c51a89dc89cdb0fc54074e Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 21 Jun 2012 19:18:07 +0900
Subject: [PATCH] memcg: never fail charge at page migration.
memory cgroup adds an extra charge to memcg at page migration.
In theory, this is unnecessary but codes will be much more complex
without this...because of many race conditions.
Now, if a memory cgroup is under OOM, page migration never succeed
if target page is under the memcg. This prevents page defragment
and tend to consume much cpus needlessly.
This patch uses res_counter_charge_nofail() in migration path
and avoid stopping page migration, caused by OOM-memcg.
But, even if it's temporal state, usage > limit doesn't seem
good. This patch adds a new function res_counter_usage_safe().
This does
if (usage < limit)
return usage;
return limit;
So, res_counter_charge_nofail() will never break user experience.
Reported-by: David Rientjes <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/res_counter.h | 2 ++
kernel/res_counter.c | 14 ++++++++++++++
mm/memcontrol.c | 22 ++++++----------------
3 files changed, 22 insertions(+), 16 deletions(-)
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index c6b0368..ece3d02 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -226,4 +226,6 @@ res_counter_set_soft_limit(struct res_counter *cnt,
return 0;
}
+u64 res_counter_usage_safe(struct res_counter *cnt);
+
#endif
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index d9ea45e..da520c7 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -176,6 +176,20 @@ u64 res_counter_read_u64(struct res_counter *counter, int member)
}
#endif
+u64 res_counter_usage_safe(struct res_counter *counter)
+{
+ u64 val;
+ unsigned long flags;
+
+ spin_lock_irqsave(&counter->lock, flags);
+ if (counter->usage < counter->limit)
+ val = counter->usage;
+ else
+ val = counter->limit;
+ spin_unlock_irqrestore(&counter->lock, flags);
+ return val;
+}
+
int res_counter_memparse_write_strategy(const char *buf,
unsigned long long *res)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 767440c..b468d9a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3364,6 +3364,7 @@ int mem_cgroup_prepare_migration(struct page *page,
struct mem_cgroup *memcg = NULL;
struct page_cgroup *pc;
enum charge_type ctype;
+ struct res_counter *dummy;
int ret = 0;
*memcgp = NULL;
@@ -3418,21 +3419,10 @@ int mem_cgroup_prepare_migration(struct page *page,
return 0;
*memcgp = memcg;
- ret = __mem_cgroup_try_charge(NULL, gfp_mask, 1, memcgp, false);
+ res_counter_charge_nofail(&memcg->res, PAGE_SIZE, &dummy);
+ if (do_swap_account)
+ res_counter_charge_nofail(&memcg->memsw, PAGE_SIZE, &dummy);
css_put(&memcg->css);/* drop extra refcnt */
- if (ret) {
- if (PageAnon(page)) {
- lock_page_cgroup(pc);
- ClearPageCgroupMigration(pc);
- unlock_page_cgroup(pc);
- /*
- * The old page may be fully unmapped while we kept it.
- */
- mem_cgroup_uncharge_page(page);
- }
- /* we'll need to revisit this error code (we have -EINTR) */
- return -ENOMEM;
- }
/*
* We charge new page before it's used/mapped. So, even if unlock_page()
* is called before end_migration, we can catch all events on this new
@@ -3995,9 +3985,9 @@ static inline u64 mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
if (!mem_cgroup_is_root(memcg)) {
if (!swap)
- return res_counter_read_u64(&memcg->res, RES_USAGE);
+ return res_counter_usage_safe(&memcg->res);
else
- return res_counter_read_u64(&memcg->memsw, RES_USAGE);
+ return res_counter_usage_safe(&memcg->memsw);
}
val = mem_cgroup_recursive_stat(memcg, MEM_CGROUP_STAT_CACHE);
--
1.7.4.1
On Thu, 21 Jun 2012, Kamezawa Hiroyuki wrote:
> I guess the best way will be never calling charge/uncharge at migration.
> ....but it has been a battle with many race conditions..
>
> Here is an alternative way, remove -ENOMEM in mem_cgroup_prepare_migration()
> by using res_counter_charge_nofail().
>
> Could you try this ?
I would love to be able to remove the -ENOMEM as the result of charging
the temporary page, but what happens if all cpus are calling into
migrate_pages() that are unmapping pages from the same memcg? This need
not only be happening from compaction, either. It feels like this
wouldn't scale appropriately and you risk going significantly over the
limit even for a brief enough period of time. I'd hate to be 128K over my
limit on a machine with 32 cores.
Comments from the memcg folks on if this is acceptable?
In the interim, do you have an objection to merging this patch as bandaid
for stable?
On 06/21/2012 02:52 AM, David Rientjes wrote:
> If page migration cannot charge the new page to the memcg,
> migrate_pages() will return -ENOMEM. This isn't considered in memory
> compaction however, and the loop continues to iterate over all pageblocks
> trying in a futile attempt to continue migrations which are only bound to
> fail.
>
> This will short circuit and fail memory compaction if migrate_pages()
> returns -ENOMEM. COMPACT_PARTIAL is returned in case some migrations
> were successful so that the page allocator will retry.
The patch makes sense, however I wonder if it would make
more sense in the long run to allow migrate/compaction to
temporarily exceed the memcg memory limit for a cgroup,
because the original page will get freed again soon anyway.
That has the potential to improve compaction success, and
reduce compaction related CPU use.
On Mon, 25 Jun 2012, Rik van Riel wrote:
> The patch makes sense, however I wonder if it would make
> more sense in the long run to allow migrate/compaction to
> temporarily exceed the memcg memory limit for a cgroup,
> because the original page will get freed again soon anyway.
>
> That has the potential to improve compaction success, and
> reduce compaction related CPU use.
>
Yeah, Kame brought up the same point with a sample patch by allowing the
temporary charge for the new page. It would certainly solve this problem
in a way that we don't have to even touch compaction, it's disappointing
that we have to charge memory to do a page migration. I'm not so sure
about the approach of temporarily allowing the excess charge, however,
since it would scale with the number of cpus doing compaction or
migration, which could end up with PAGE_SIZE * nr_cpu_ids.
I haven't looked at it (yet), but I'm hoping that there's a way to avoid
charging the temporary page at all until after move_to_new_page()
succeeds, i.e. find a way to uncharge page before charging newpage. We
currently don't charge things like vmalloc() memory to things that call
alloc_pages() directly so it seems like it's plausible without causing
usage > limit.
(2012/06/26 9:32), David Rientjes wrote:
> On Mon, 25 Jun 2012, Rik van Riel wrote:
>
>> The patch makes sense, however I wonder if it would make
>> more sense in the long run to allow migrate/compaction to
>> temporarily exceed the memcg memory limit for a cgroup,
>> because the original page will get freed again soon anyway.
>>
>> That has the potential to improve compaction success, and
>> reduce compaction related CPU use.
>>
>
> Yeah, Kame brought up the same point with a sample patch by allowing the
> temporary charge for the new page. It would certainly solve this problem
> in a way that we don't have to even touch compaction, it's disappointing
> that we have to charge memory to do a page migration. I'm not so sure
> about the approach of temporarily allowing the excess charge, however,
> since it would scale with the number of cpus doing compaction or
> migration, which could end up with PAGE_SIZE * nr_cpu_ids.
>
I don't think it's problem. Even if there are 4096 cpus, it's only 16MB
on that system, which tends to have terabytes of memory.
(We already have 32pages of per-cpu-cache....)
I'd like to post that patch with updating to mmotm.
> I haven't looked at it (yet), but I'm hoping that there's a way to avoid
> charging the temporary page at all until after move_to_new_page()
> succeeds, i.e. find a way to uncharge page before charging newpage.
Hmm...this code has been verrry racy and we did many mis-accounting.
So, I'd like to start from a safe way.
THanks,
-Kame