LinuxLists.cc - mmotm woes, mainly compaction

2016-04-12 07:18:13

by Hugh Dickins

[permalink] [raw]

Subject: mmotm woes, mainly compaction

Michal, I'm sorry to say that I now find that I misinformed you.

You'll remember when we were chasing the order=2 OOMs on two of my
machines at the end of March (in private mail). And you sent me a
mail containing two patches, the second "Another thing to try ...
so this on top" doing a *migrate_mode++.

I answered you definitively that the first patch worked,
so "I haven't tried adding the one below at all".

Not true, I'm afraid. Although I had split the *migrate_mode++ one
off into a separate patch that I did not apply, I found looking back
today (when trying to work out why order=2 OOMs were still a problem
on mmotm 2016-04-06) that I never deleted that part from the end of
the first patch; so in fact what I'd been testing had included the
second; and now I find that _it_ was the effective solution.

Which is particularly sad because I think we were both a bit
uneasy about the *migrate_mode++ one: partly the style of it
incrementing the enum; but more seriously that it advances all the
way to MIGRATE_SYNC, when the first went only to MIGRATE_SYNC_LIGHT.

But without it, I am still stuck with the order=2 OOMs.

And worse: after establishing that that fixes the order=2 OOMs for
me on 4.6-rc2-mm1, I thought I'd better check that the three you
posted today (the 1/2 classzone_idx one, the 2/2 prevent looping
forever, and the "ction-abstract-compaction-feedback-to-helpers-fix";
but I'm too far behind to consider or try the RFC thp backoff one)
(a) did not surprisingly fix it on their own, and (b) worked well
with the *migrate_mode++ one added in.

(a) as you'd expect, they did not help on their own; and (b) they
worked fine together on the G5 (until it hit the powerpc swapping
sigsegv, which I think the powerpc guys are hoping is a figment of
my imagination); but (b) they did not work fine together on the
laptop, that combination now gives it order=1 OOMs. Despair.

And I'm sorry that it's taken me so long to report, but aside from
home distractions, I had quite a lot of troubles with 4.6-rc2-mm1 on
different machines, once I got down to trying it. But located Eric's
fix to an __inet_hash() crash in linux-next, and spotted Joonsoo's
setup_kmem_cache_node() slab bootup fix on lkml this morning.
With those out of the way, and forgetting the OOMs for now,

[PATCH mmotm] mm: fix several bugs in compaction

Fix three problems in the mmotm 2016-04-06-20-40 mm/compaction.c,
plus three minor tidyups there. Sorry, I'm now too tired to work
out which is a fix to what patch, and split them up appropriately:
better get these out quickly now.

1. Fix crash in release_pages() from compact_zone() from kcompactd_do_work():
kcompactd needs to INIT_LIST_HEAD on the new freepages_held list.

2. Fix crash in get_pfnblock_flags_mask() from suitable_migration_target()
from isolate_freepages(): there's a case when that "block_start_pfn -=
pageblock_nr_pages" loop can pass through 0 and end up trying to access
a pageblock before the start of the mem_map[]. (I have not worked out
why this never hit me before 4.6-rc2-mm1, it looks much older.)

3. /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
go increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. putback_movable_pages()
decrements the NR_ISOLATED counts which acct_isolated() increments,
so isolate_migratepages_block() needs to acct before putback in that
special case, and isolate_migratepages_range() can always do the acct
itself, leaving migratepages putback to caller like most other places.

4. Added VM_BUG_ONs to assert freepages_held is empty, matching those on
the other lists - though they're getting to look rather too much now.

5. It's easier to track the life of cc->migratepages if we don't assign
it to a migratelist variable.

6. Remove unused bool success from kcompactd_do_work().

Signed-off-by: Hugh Dickins <[email protected]>

--- 4.6-rc2-mm1/mm/compaction.c 2016-04-10 09:43:20.314514944 -0700
+++ linux/mm/compaction.c 2016-04-11 11:35:08.536604712 -0700
@@ -638,7 +638,6 @@ isolate_migratepages_block(struct compac
{
struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
- struct list_head *migratelist = &cc->migratepages;
struct lruvec *lruvec;
unsigned long flags = 0;
bool locked = false;
@@ -817,7 +816,7 @@ isolate_migratepages_block(struct compac
del_page_from_lru_list(page, lruvec, page_lru(page));

isolate_success:
- list_add(&page->lru, migratelist);
+ list_add(&page->lru, &cc->migratepages);
cc->nr_migratepages++;
nr_isolated++;

@@ -851,9 +850,11 @@ isolate_fail:
spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;
}
- putback_movable_pages(migratelist);
- nr_isolated = 0;
+ acct_isolated(zone, cc);
+ putback_movable_pages(&cc->migratepages);
+ cc->nr_migratepages = 0;
cc->last_migrated_pfn = 0;
+ nr_isolated = 0;
}

if (low_pfn < next_skip_pfn) {
@@ -928,17 +929,8 @@ isolate_migratepages_range(struct compac

pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
ISOLATE_UNEVICTABLE);
-
- /*
- * In case of fatal failure, release everything that might
- * have been isolated in the previous iteration, and signal
- * the failure back to caller.
- */
- if (!pfn) {
- putback_movable_pages(&cc->migratepages);
- cc->nr_migratepages = 0;
+ if (!pfn)
break;
- }

if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
break;
@@ -1019,7 +1011,7 @@ static void isolate_freepages(struct com
* pages on cc->migratepages. We stop searching if the migrate
* and free page scanners meet or enough free pages are isolated.
*/
- for (; block_start_pfn >= low_pfn;
+ for (; block_start_pfn >= low_pfn && block_start_pfn < block_end_pfn;
block_end_pfn = block_start_pfn,
block_start_pfn -= pageblock_nr_pages,
isolate_start_pfn = block_start_pfn) {
@@ -1617,6 +1609,7 @@ static enum compact_result compact_zone_

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
+ VM_BUG_ON(!list_empty(&cc.freepages_held));

*contended = cc.contended;
return ret;
@@ -1776,6 +1769,7 @@ static void __compact_pgdat(pg_data_t *p

VM_BUG_ON(!list_empty(&cc->freepages));
VM_BUG_ON(!list_empty(&cc->migratepages));
+ VM_BUG_ON(!list_empty(&cc->freepages_held));

if (is_via_compact_memory(cc->order))
continue;
@@ -1915,7 +1909,6 @@ static void kcompactd_do_work(pg_data_t
.ignore_skip_hint = true,

};
- bool success = false;

trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
cc.classzone_idx);
@@ -1940,12 +1933,12 @@ static void kcompactd_do_work(pg_data_t
cc.zone = zone;
INIT_LIST_HEAD(&cc.freepages);
INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.freepages_held);

status = compact_zone(zone, &cc);

if (zone_watermark_ok(zone, cc.order, low_wmark_pages(zone),
cc.classzone_idx, 0)) {
- success = true;
compaction_defer_reset(zone, cc.order, false);
} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
/*
@@ -1957,6 +1950,7 @@ static void kcompactd_do_work(pg_data_t

VM_BUG_ON(!list_empty(&cc.freepages));
VM_BUG_ON(!list_empty(&cc.migratepages));
+ VM_BUG_ON(!list_empty(&cc.freepages_held));
}

/*

2016-04-12 08:03:55

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/12/2016 09:18 AM, Hugh Dickins wrote:
> 2. Fix crash in get_pfnblock_flags_mask() from suitable_migration_target()
> from isolate_freepages(): there's a case when that "block_start_pfn -=
> pageblock_nr_pages" loop can pass through 0 and end up trying to access
> a pageblock before the start of the mem_map[]. (I have not worked out
> why this never hit me before 4.6-rc2-mm1, it looks much older.)

This is actually my fresh mmotm bug, thanks for catching that!

Fix for:
mm-compaction-wrap-calculating-first-and-last-pfn-of-pageblock.patch

----8<----
>From 45330dfb350d6b3bc72bcdaccc226bcc286e1236 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <[email protected]>
Date: Tue, 12 Apr 2016 09:54:33 +0200
Subject: [PATCH] mm, compaction: fix crash in get_pfnblock_flags_mask() from
isolate_freepages():

In isolate_freepages(), low_pfn was mistakenly initialized to
pageblock_start_pfn() instead of pageblock_end_pfn(), creating a possible
underflow, as described by Hugh:

There's a case when that "block_start_pfn -= pageblock_nr_pages" loop can
pass through 0 and end up trying to access a pageblock before the start of
the mem_map[].

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/compaction.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 315e5d57e7e9..67f886ecd773 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1012,7 +1012,7 @@ static void isolate_freepages(struct compact_control *cc)
block_start_pfn = pageblock_start_pfn(cc->free_pfn);
block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
zone_end_pfn(zone));
- low_pfn = pageblock_start_pfn(cc->migrate_pfn);
+ low_pfn = pageblock_end_pfn(cc->migrate_pfn);

/*
* Isolate free pages until enough are available to migrate the
--
2.8.1

2016-04-12 09:03:15

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/12/2016 09:18 AM, Hugh Dickins wrote:
> 3. /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
> go increasingly negative under compaction: which would add delay when
> should be none, or no delay when should delay. putback_movable_pages()
> decrements the NR_ISOLATED counts which acct_isolated() increments,
> so isolate_migratepages_block() needs to acct before putback in that
> special case, and isolate_migratepages_range() can always do the acct
> itself, leaving migratepages putback to caller like most other places.

The isolate_migratepages_block() part is mmotm-specific, so I'll split
it out in this patch. Thanks for catching it and the lack of reset for
cc->nr_migratepages which wasn't mentioned in changelog so I added it.

> 5. It's easier to track the life of cc->migratepages if we don't assign
> it to a migratelist variable.

This is also included here.

This is a -fix for:
mm-compaction-skip-blocks-where-isolation-fails-in-async-direct-compaction.patch

----8<----
>From 59a0075b6cf85045aa2dc5cee1f27797bcd0b3d2 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <[email protected]>
Date: Tue, 12 Apr 2016 10:51:20 +0200
Subject: [PATCH] mm, compaction: prevent nr_isolated_* from going negative

/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
go increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. putback_movable_pages()
decrements the NR_ISOLATED counts which acct_isolated() increments,
so isolate_migratepages_block() needs to acct before putback in that
special case. It's also useful to reset cc->nr_migratepages after putback
so we don't needlessly return too early on the COMPACT_CLUSTER_MAX check.

Also it's easier to track the life of cc->migratepages if we don't assign
it to a migratelist variable.
---
mm/compaction.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 67f886ecd773..ab649fba3d88 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -638,7 +638,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
{
struct zone *zone = cc->zone;
unsigned long nr_scanned = 0, nr_isolated = 0;
- struct list_head *migratelist = &cc->migratepages;
struct lruvec *lruvec;
unsigned long flags = 0;
bool locked = false;
@@ -817,7 +816,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
del_page_from_lru_list(page, lruvec, page_lru(page));

isolate_success:
- list_add(&page->lru, migratelist);
+ list_add(&page->lru, &cc->migratepages);
cc->nr_migratepages++;
nr_isolated++;

@@ -851,9 +850,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
spin_unlock_irqrestore(&zone->lru_lock, flags);
locked = false;
}
- putback_movable_pages(migratelist);
- nr_isolated = 0;
+ acct_isolated(zone, cc);
+ putback_movable_pages(&cc->migratepages);
+ cc->nr_migratepages = 0;
cc->last_migrated_pfn = 0;
+ nr_isolated = 0;
}

if (low_pfn < next_skip_pfn) {
--
2.8.1

2016-04-12 09:14:30

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/12/2016 11:03 AM, Vlastimil Babka wrote:
> On 04/12/2016 09:18 AM, Hugh Dickins wrote:
>> 3. /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
>> go increasingly negative under compaction: which would add delay when
>> should be none, or no delay when should delay. putback_movable_pages()
>> decrements the NR_ISOLATED counts which acct_isolated() increments,
>> so isolate_migratepages_block() needs to acct before putback in that
>> special case, and isolate_migratepages_range() can always do the acct
>> itself, leaving migratepages putback to caller like most other places.
>
> The isolate_migratepages_block() part is mmotm-specific, so I'll split
> it out in this patch. Thanks for catching it and the lack of reset for
> cc->nr_migratepages which wasn't mentioned in changelog so I added it.
>
>> 5. It's easier to track the life of cc->migratepages if we don't assign
>> it to a migratelist variable.
>
> This is also included here.
>
> This is a -fix for:
> mm-compaction-skip-blocks-where-isolation-fails-in-async-direct-compaction.patch
>
> ----8<----
> From 59a0075b6cf85045aa2dc5cee1f27797bcd0b3d2 Mon Sep 17 00:00:00 2001
> From: Hugh Dickins <[email protected]>
> Date: Tue, 12 Apr 2016 10:51:20 +0200
> Subject: [PATCH] mm, compaction: prevent nr_isolated_* from going negative
>
> /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
> go increasingly negative under compaction: which would add delay when
> should be none, or no delay when should delay. putback_movable_pages()
> decrements the NR_ISOLATED counts which acct_isolated() increments,
> so isolate_migratepages_block() needs to acct before putback in that
> special case. It's also useful to reset cc->nr_migratepages after putback
> so we don't needlessly return too early on the COMPACT_CLUSTER_MAX check.
>
> Also it's easier to track the life of cc->migratepages if we don't assign
> it to a migratelist variable.

Forgot
Signed-off-by: Vlastimil Babka <[email protected]>

> ---
> mm/compaction.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 67f886ecd773..ab649fba3d88 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -638,7 +638,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> {
> struct zone *zone = cc->zone;
> unsigned long nr_scanned = 0, nr_isolated = 0;
> - struct list_head *migratelist = &cc->migratepages;
> struct lruvec *lruvec;
> unsigned long flags = 0;
> bool locked = false;
> @@ -817,7 +816,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> del_page_from_lru_list(page, lruvec, page_lru(page));
>
> isolate_success:
> - list_add(&page->lru, migratelist);
> + list_add(&page->lru, &cc->migratepages);
> cc->nr_migratepages++;
> nr_isolated++;
>
> @@ -851,9 +850,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> spin_unlock_irqrestore(&zone->lru_lock, flags);
> locked = false;
> }
> - putback_movable_pages(migratelist);
> - nr_isolated = 0;
> + acct_isolated(zone, cc);
> + putback_movable_pages(&cc->migratepages);
> + cc->nr_migratepages = 0;
> cc->last_migrated_pfn = 0;
> + nr_isolated = 0;
> }
>
> if (low_pfn < next_skip_pfn) {
>

2016-04-12 09:31:14

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/12/2016 09:18 AM, Hugh Dickins wrote:
> 3. /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
> go increasingly negative under compaction: which would add delay when
> should be none, or no delay when should delay. putback_movable_pages()
> decrements the NR_ISOLATED counts which acct_isolated() increments,
> so isolate_migratepages_block() needs to acct before putback in that
> special case, and isolate_migratepages_range() can always do the acct
> itself, leaving migratepages putback to caller like most other places.

Sigh, looks like I notoriously suck at the nr_isolated_* accounting. The
isolate_migratepages_range() is also due to my 3.18 commit. Back then,
Joonsoo caught the problem for compaction side, but CMA issue remains.
Sorry and thanks.

----8<----
>From 8aa6e765d4f931718386e13290d43348e34f0e76 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <[email protected]>
Date: Tue, 12 Apr 2016 11:14:49 +0200
Subject: [PATCH] mm, cma: prevent nr_isolated_* counters from going negative

/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file
go increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. The bug in compaction was
due to a recent mmotm patch, but much older instance of the bug was also
noticed in isolate_migratepages_range() which is used for CMA and
gigantic hugepage allocations.

The bug is caused by putback_movable_pages() in an error path decrementing
the isolated counters without them being previously incremented by
acct_isolated(). Fix isolate_migratepages_range() by removing the error-path
putback, thus reaching acct_isolated() with migratepages still isolated, and
leaving putback to caller like most other places do.

[[email protected]: expanded the changelog]
Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
Cc: [email protected] #3.18+
Cc: Joonsoo Kim <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
mm/compaction.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index ab649fba3d88..c0be38f634d3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -930,16 +930,8 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
ISOLATE_UNEVICTABLE);

- /*
- * In case of fatal failure, release everything that might
- * have been isolated in the previous iteration, and signal
- * the failure back to caller.
- */
- if (!pfn) {
- putback_movable_pages(&cc->migratepages);
- cc->nr_migratepages = 0;
+ if (!pfn)
break;
- }

if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
break;
--
2.8.1

2016-04-12 09:38:33

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/12/2016 09:18 AM, Hugh Dickins wrote:
> 1. Fix crash in release_pages() from compact_zone() from kcompactd_do_work():
> kcompactd needs to INIT_LIST_HEAD on the new freepages_held list.

Hmm, right. I didn't notice the new call site added by one series when
rebasing the other series :/

> 4. Added VM_BUG_ONs to assert freepages_held is empty, matching those on
> the other lists - though they're getting to look rather too much now.

I think the easiest thing to do for now is to drop from mmotm:
mm-compaction-direct-freepage-allocation-for-async-direct-compaction.patch
As Mel and Joonsoo didn't like it in the current state anyway.

> 6. Remove unused bool success from kcompactd_do_work().

That leaves just this part, which didn't fit anywhere else. I guess can
just fold it to upcoming kcompactd patches?

Thanks for organizing my morning today, Hugh :)

2016-04-12 12:10:25

by Michal Hocko

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On Tue 12-04-16 00:18:00, Hugh Dickins wrote:
> Michal, I'm sorry to say that I now find that I misinformed you.
>
> You'll remember when we were chasing the order=2 OOMs on two of my
> machines at the end of March (in private mail). And you sent me a
> mail containing two patches, the second "Another thing to try ...
> so this on top" doing a *migrate_mode++.
>
> I answered you definitively that the first patch worked,
> so "I haven't tried adding the one below at all".
>
> Not true, I'm afraid. Although I had split the *migrate_mode++ one
> off into a separate patch that I did not apply, I found looking back
> today (when trying to work out why order=2 OOMs were still a problem
> on mmotm 2016-04-06) that I never deleted that part from the end of
> the first patch; so in fact what I'd been testing had included the
> second; and now I find that _it_ was the effective solution.
>
> Which is particularly sad because I think we were both a bit
> uneasy about the *migrate_mode++ one: partly the style of it
> incrementing the enum; but more seriously that it advances all the
> way to MIGRATE_SYNC, when the first went only to MIGRATE_SYNC_LIGHT.

Yeah, I was thinking about this some more and I have dropped
MIGRATE_SYNC patch because this is just too dangerous. It gets all the
way to to writeout() and this is a great stack overflow hazard. But I
guess we do not need this writeout and wait_on_page_writeback (done from
__unmap_and_move) would be sufficient. I was already thinking about
splitting MIGRATE_SYNC into two states one allowing the wait on events
and the other to allow the writeout.

> But without it, I am still stuck with the order=2 OOMs.
>
> And worse: after establishing that that fixes the order=2 OOMs for
> me on 4.6-rc2-mm1, I thought I'd better check that the three you
> posted today (the 1/2 classzone_idx one, the 2/2 prevent looping
> forever, and the "ction-abstract-compaction-feedback-to-helpers-fix";
> but I'm too far behind to consider or try the RFC thp backoff one)
> (a) did not surprisingly fix it on their own, and (b) worked well
> with the *migrate_mode++ one added in.

I am not really sure what you have been testing here. The hugetlb load
or the same make on tmpfs? I would be really surprised if any of the
above pathces made any difference for the make workload.

> (a) as you'd expect, they did not help on their own; and (b) they
> worked fine together on the G5 (until it hit the powerpc swapping
> sigsegv, which I think the powerpc guys are hoping is a figment of
> my imagination); but (b) they did not work fine together on the
> laptop, that combination now gives it order=1 OOMs. Despair.

Something is definitelly wrong here. I have already seen that compaction
is sometimes giving surprising results. I have seen Vlastimil has posted
some fixes so maybe this would be a side effect. I also hope to come up
with some reasonable set of trace points to tell us more but let's see
whether the order-2 issue can be solved first.

This is still with the ugly enum++ but let's close eyes and think about
something nicer...

Thanks!
---
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89a3919..e1947d7af63f 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SYNC_WRITEOUT will trigger the IO when migrating pages. Make sure
+ * to not use this flag from deep stacks.
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_SYNC_WRITEOUT,
};

#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 539b25a76111..0f14c65865ad 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -9,7 +9,8 @@
#define MIGRATE_MODE \
EM( MIGRATE_ASYNC, "MIGRATE_ASYNC") \
EM( MIGRATE_SYNC_LIGHT, "MIGRATE_SYNC_LIGHT") \
- EMe(MIGRATE_SYNC, "MIGRATE_SYNC")
+ EM( MIGRATE_SYNC, "MIGRATE_SYNC") \
+ EMe(MIGRATE_SYNC_WRITEOUT, "MIGRATE_SYNC_WRITEOUT")

#define MIGRATE_REASON \
diff --git a/mm/compaction.c b/mm/compaction.c
index 68dfbc07692d..7f631a6e234f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1839,7 +1839,7 @@ static void compact_node(int nid)
{
struct compact_control cc = {
.order = -1,
- .mode = MIGRATE_SYNC,
+ .mode = MIGRATE_SYNC_WRITEOUT,
.ignore_skip_hint = true,
};

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5a544c6c0717..a591b29a25ba 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1571,7 +1571,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
}

ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
- MIGRATE_SYNC, MR_MEMORY_FAILURE);
+ MIGRATE_SYNC_WRITEOUT, MR_MEMORY_FAILURE);
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
@@ -1651,7 +1651,7 @@ static int __soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
- MIGRATE_SYNC, MR_MEMORY_FAILURE);
+ MIGRATE_SYNC_WRITEOUT, MR_MEMORY_FAILURE);
if (ret) {
if (!list_empty(&pagelist)) {
list_del(&page->lru);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7ad0d2eb9a2c..6cd8664c9e6e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1554,7 +1554,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
* migrate_pages returns # of failed pages.
*/
ret = migrate_pages(&source, alloc_migrate_target, NULL, 0,
- MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
+ MIGRATE_SYNC_WRITEOUT, MR_MEMORY_HOTPLUG);
if (ret)
putback_movable_pages(&source);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7f80ebcd6552..a6a947980773 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1001,7 +1001,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,

if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_node_page, NULL, dest,
- MIGRATE_SYNC, MR_SYSCALL);
+ MIGRATE_SYNC_WRITEOUT, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
}
@@ -1242,7 +1242,7 @@ static long do_mbind(unsigned long start, unsigned long len,
if (!list_empty(&pagelist)) {
WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_page, NULL,
- start, MIGRATE_SYNC, MR_MEMPOLICY_MBIND);
+ start, MIGRATE_SYNC_WRITEOUT, MR_MEMPOLICY_MBIND);
if (nr_failed)
putback_movable_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 028814625eea..3e907354cfec 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -809,7 +809,7 @@ static int fallback_migrate_page(struct address_space *mapping,
{
if (PageDirty(page)) {
/* Only writeback pages in full synchronous migration */
- if (mode != MIGRATE_SYNC)
+ if (mode != MIGRATE_SYNC_WRITEOUT)
return -EBUSY;
return writeout(mapping, page);
}
@@ -938,7 +938,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
* the retry loop is too short and in the sync-light case,
* the overhead of stalling is too much
*/
- if (mode != MIGRATE_SYNC) {
+ if (mode < MIGRATE_SYNC) {
rc = -EBUSY;
goto out_unlock;
}
@@ -1187,7 +1187,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
return -ENOMEM;

if (!trylock_page(hpage)) {
- if (!force || mode != MIGRATE_SYNC)
+ if (!force || mode < MIGRATE_SYNC)
goto out;
lock_page(hpage);
}
@@ -1447,7 +1447,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node, NULL,
- (unsigned long)pm, MIGRATE_SYNC, MR_SYSCALL);
+ (unsigned long)pm, MIGRATE_SYNC_WRITEOUT, MR_SYSCALL);
if (err)
putback_movable_pages(&pagelist);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d1da0ceaf1e..d80c9755ffc7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3030,8 +3030,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
* failure could be caused by weak migration mode.
*/
if (compaction_failed(compact_result)) {
- if (*migrate_mode == MIGRATE_ASYNC) {
- *migrate_mode = MIGRATE_SYNC_LIGHT;
+ if (*migrate_mode < MIGRATE_SYNC) {
+ *migrate_mode++;
return true;
}
return false;
--
Michal Hocko
SUSE Labs

2016-04-12 14:51:41

by Michal Hocko

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On Tue 12-04-16 14:10:20, Michal Hocko wrote:
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d1da0ceaf1e..d80c9755ffc7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3030,8 +3030,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> * failure could be caused by weak migration mode.
> */
> if (compaction_failed(compact_result)) {
> - if (*migrate_mode == MIGRATE_ASYNC) {
> - *migrate_mode = MIGRATE_SYNC_LIGHT;
> + if (*migrate_mode < MIGRATE_SYNC) {
> + *migrate_mode++;
> return true;

this should be (*migrate_mode)++ of course.

> }
> return false;

--
Michal Hocko
SUSE Labs

2016-04-14 20:16:04

by Hugh Dickins

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On Tue, 12 Apr 2016, Michal Hocko wrote:
> On Tue 12-04-16 00:18:00, Hugh Dickins wrote:
> > Michal, I'm sorry to say that I now find that I misinformed you.
> >
> > You'll remember when we were chasing the order=2 OOMs on two of my
> > machines at the end of March (in private mail). And you sent me a
> > mail containing two patches, the second "Another thing to try ...
> > so this on top" doing a *migrate_mode++.
> >
> > I answered you definitively that the first patch worked,
> > so "I haven't tried adding the one below at all".
> >
> > Not true, I'm afraid. Although I had split the *migrate_mode++ one
> > off into a separate patch that I did not apply, I found looking back
> > today (when trying to work out why order=2 OOMs were still a problem
> > on mmotm 2016-04-06) that I never deleted that part from the end of
> > the first patch; so in fact what I'd been testing had included the
> > second; and now I find that _it_ was the effective solution.
> >
> > Which is particularly sad because I think we were both a bit
> > uneasy about the *migrate_mode++ one: partly the style of it
> > incrementing the enum; but more seriously that it advances all the
> > way to MIGRATE_SYNC, when the first went only to MIGRATE_SYNC_LIGHT.
>
> Yeah, I was thinking about this some more and I have dropped
> MIGRATE_SYNC patch because this is just too dangerous. It gets all the
> way to to writeout() and this is a great stack overflow hazard.

That's a very good point about MIGRATE_SYNC, which I never thought of.
Worried me for a moment, since I do use MIGRATE_SYNC from a reclaim
context in huge tmpfs (its shrinker): but in fact that one's not a
problem, because the only pages it's trying to migrate are its own,
and shmem migrates PageDirty without writeout(); whereas compaction
has to deal with pages from assorted unknown filesystems, including
the writeout() ones.

> But I
> guess we do not need this writeout and wait_on_page_writeback (done from
> __unmap_and_move) would be sufficient. I was already thinking about
> splitting MIGRATE_SYNC into two states one allowing the wait on events
> and the other to allow the writeout.

Might eventually turn out to be necessary, but not for my immediate
issue: see below.

>
> > But without it, I am still stuck with the order=2 OOMs.
> >
> > And worse: after establishing that that fixes the order=2 OOMs for
> > me on 4.6-rc2-mm1, I thought I'd better check that the three you
> > posted today (the 1/2 classzone_idx one, the 2/2 prevent looping
> > forever, and the "ction-abstract-compaction-feedback-to-helpers-fix";
> > but I'm too far behind to consider or try the RFC thp backoff one)
> > (a) did not surprisingly fix it on their own, and (b) worked well
> > with the *migrate_mode++ one added in.
>
> I am not really sure what you have been testing here. The hugetlb load
> or the same make on tmpfs? I would be really surprised if any of the
> above pathces made any difference for the make workload.

Merely the make on tmpfs. Optimizing hugepage loads is something else,
that I'm just not going to worry about, until your end has settled down:
hence why I asked Andrew to back out my 30/31 and 31/31, to get out of
your way. Just the recognition of THP allocations is a tricky matter,
which gets even more confusing when chasing a moving target.

>
> > (a) as you'd expect, they did not help on their own; and (b) they
> > worked fine together on the G5 (until it hit the powerpc swapping
> > sigsegv, which I think the powerpc guys are hoping is a figment of
> > my imagination); but (b) they did not work fine together on the
> > laptop, that combination now gives it order=1 OOMs. Despair.
>
> Something is definitelly wrong here.

Indeed! And you'll kick yourself when you read on :)

> I have already seen that compaction
> is sometimes giving surprising results. I have seen Vlastimil has posted
> some fixes so maybe this would be a side effect. I also hope to come up
> with some reasonable set of trace points to tell us more but let's see
> whether the order-2 issue can be solved first.
>
> This is still with the ugly enum++ but let's close eyes and think about
> something nicer...
>
> Thanks!
> ---
> diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
> index ebf3d89a3919..e1947d7af63f 100644
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -6,11 +6,14 @@
> * on most operations but not ->writepage as the potential stall time
> * is too significant
> * MIGRATE_SYNC will block when migrating pages
> + * MIGRATE_SYNC_WRITEOUT will trigger the IO when migrating pages. Make sure
> + * to not use this flag from deep stacks.
> */
> enum migrate_mode {
> MIGRATE_ASYNC,
> MIGRATE_SYNC_LIGHT,
> MIGRATE_SYNC,
> + MIGRATE_SYNC_WRITEOUT,
> };

To tell the truth, I didn't even try applying your patch, because I knew
it was going to clash with my MIGRATE_SHMEM_RECOVERy one, and I didn't
want to spend time thinking about what the resultant tests ought to be
(!= this or == that or < the other or what?).

(And your patch seemed unnecessarily extensive, changing the meaning of
MIGRATE_SYNC, then adding a new mode beyond it to do what MIGRATE_SYNC
used to do.)

Since I wasn't worried about latency in the experiment, just order=2
OOMs, I instead tried the much simpler patch...

> @@ -938,7 +938,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> * the retry loop is too short and in the sync-light case,
> * the overhead of stalling is too much
> */
> - if (mode != MIGRATE_SYNC) {
> + if (mode < MIGRATE_SYNC) {
> rc = -EBUSY;
> goto out_unlock;
> }

... saying "if (mode == MIGRATE_ASYNC) {" there, so that
MIGRATE_SYNC_LIGHT would proceed to wait_on_page_writeback() when force.

And that patch did not help at all, on either machine: so although
pages under writeback had been my suspicion, and motivation for your
patch, it just wasn't the issue.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3030,8 +3030,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> * failure could be caused by weak migration mode.
> */
> if (compaction_failed(compact_result)) {
> - if (*migrate_mode == MIGRATE_ASYNC) {
> - *migrate_mode = MIGRATE_SYNC_LIGHT;
> + if (*migrate_mode < MIGRATE_SYNC) {
> + *migrate_mode++;
> return true;
> }
> return false;

Thanks so much for your followup mail, pointing out that it should say
(*migrate_mode)++. I never noticed that. So the set of patches I had
been testing before was doing something quite random, I don't want to
think about exactly what it was doing. Yet proved useful...

... because the thing that was really wrong (and I spent far too long
studying compaction.c before noticing in page_alloc.c) was this:

/*
* It can become very expensive to allocate transparent hugepages at
* fault, so use asynchronous memory compaction for THP unless it is
* khugepaged trying to collapse.
*/
if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
migration_mode = MIGRATE_SYNC_LIGHT;

Yes, but advancing migration_mode before should_compact_retry() checks
whether it was MIGRATE_ASYNC, so eliminating the retry when MIGRATE_ASYNC
compaction_failed(). And the bogus *migrate_mode++ code had appeared to
work by interposing an additional state for a retry.

So all I had to do to get OOM-free results, on both machines, was to
remove those lines quoted above. Now, no doubt it's wrong (for THP)
to remove them completely, but I don't want to second-guess you on
where to do the equivalent check: over to you for that.

I'm over-optimistic when I say OOM-free: on the G5 yes; but I did
see an order=2 OOM after an hour on the laptop one time, and much
sooner when I applied your further three patches (classzone_idx etc),
again on the laptop, on one occasion but not another. Something not
quite right, but much easier to live with than before, and will need
a separate tedious investigation if it persists.

Earlier on, when all my suspicions were in compaction.c, I did make a
couple of probable fixes there, though neither helped out of my OOMs:

At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects
with -EBUSY: of course the writeback might complete in between, but
that's not what we usually expect, so probably better not to isolate it.

And where compact_zone() sets whole_zone, I tried a BUG_ON if
compact_scanners_met() already, and hit that as a real possibility
(only when compactors competing perhaps): without the BUG_ON, it
would proceed to compact_finished() COMPACT_COMPLETE without doing
any work at all - and the new should_compact_retry() code is placing
more faith in that compact_result than perhaps it deserves. No need
to BUG_ON then, just be stricter about setting whole_zone.

(I do worry about the skip hints: once in MIGRATE_SYNC_LIGHT mode,
compaction seemed good at finding pages to migrate, but not so good
at then finding enough free pages to migrate them into. Perhaps
there were none, but it left me suspicious.)

Vlastimil, thanks so much for picking up my bits and pieces a couple
of days ago: I think I'm going to impose upon you again with the below,
if that's not too irritating.

Signed-off-by: Hugh Dickins <[email protected]>

--- 4.6-rc2-mm1/mm/compaction.c 2016-04-11 11:35:08.536604712 -0700
+++ linux/mm/compaction.c 2016-04-13 23:17:03.671959715 -0700
@@ -1190,7 +1190,7 @@ static isolate_migrate_t isolate_migrate
struct page *page;
const isolate_mode_t isolate_mode =
(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
- (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+ (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);

/*
* Start at where we last stopped, or beginning of the zone as
@@ -1459,8 +1459,8 @@ static enum compact_result compact_zone(
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}

- if (cc->migrate_pfn == start_pfn)
- cc->whole_zone = true;
+ cc->whole_zone = cc->migrate_pfn == start_pfn &&
+ cc->free_pfn == pageblock_start_pfn(end_pfn - 1);

cc->last_migrated_pfn = 0;

2016-04-14 23:25:05

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/14/2016 10:15 PM, Hugh Dickins wrote:
> ... because the thing that was really wrong (and I spent far too long
> studying compaction.c before noticing in page_alloc.c) was this:
>
> /*
> * It can become very expensive to allocate transparent hugepages at
> * fault, so use asynchronous memory compaction for THP unless it is
> * khugepaged trying to collapse.
> */
> if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> migration_mode = MIGRATE_SYNC_LIGHT;
>
> Yes, but advancing migration_mode before should_compact_retry() checks
> whether it was MIGRATE_ASYNC, so eliminating the retry when MIGRATE_ASYNC
> compaction_failed(). And the bogus *migrate_mode++ code had appeared to
> work by interposing an additional state for a retry.

Ouch. But AFAICS the is_thp_gfp_mask() path is already addressed properly in
Michal's latest versions of the patches (patch 10/11 of "[PATCH 00/11] oom
detection rework v5"). So that wasn't what you tested this time? Maybe mmotm
still had the older version... But I guess Michal will know better than me.

> So all I had to do to get OOM-free results, on both machines, was to
> remove those lines quoted above. Now, no doubt it's wrong (for THP)
> to remove them completely, but I don't want to second-guess you on
> where to do the equivalent check: over to you for that.
>
> I'm over-optimistic when I say OOM-free: on the G5 yes; but I did
> see an order=2 OOM after an hour on the laptop one time, and much
> sooner when I applied your further three patches (classzone_idx etc),
> again on the laptop, on one occasion but not another. Something not
> quite right, but much easier to live with than before, and will need
> a separate tedious investigation if it persists.
>
> Earlier on, when all my suspicions were in compaction.c, I did make a
> couple of probable fixes there, though neither helped out of my OOMs:
>
> At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
> isolate a PageWriteback page, which __unmap_and_move() then rejects
> with -EBUSY: of course the writeback might complete in between, but
> that's not what we usually expect, so probably better not to isolate it.
>
> And where compact_zone() sets whole_zone, I tried a BUG_ON if
> compact_scanners_met() already, and hit that as a real possibility
> (only when compactors competing perhaps): without the BUG_ON, it
> would proceed to compact_finished() COMPACT_COMPLETE without doing
> any work at all - and the new should_compact_retry() code is placing
> more faith in that compact_result than perhaps it deserves. No need
> to BUG_ON then, just be stricter about setting whole_zone.

I warned Michal about the whole_zone settings being prone to races, thanks for
confirming it's real. Although, compact_scanners_met() being immediately true,
when the migration scanner is at the zone start, sounds really weird. That would
mean the free scanner did previously advance as far as first pageblock to cache
that pfn, which I wouldn't imagine except maybe a tiny zone... was it a
ZONE_DMA? Or it's a sign of a larger issue.

Anyway, too bad fixing this didn't help the rare OOMs as that would be a
perfectly valid reason for them.

> (I do worry about the skip hints: once in MIGRATE_SYNC_LIGHT mode,
> compaction seemed good at finding pages to migrate, but not so good
> at then finding enough free pages to migrate them into. Perhaps
> there were none, but it left me suspicious.)

Skip hints could be another reason for the rare OOMs. What you describe sounds
like the reclaim between async and sync_light compaction attempts did free some
pages, but their pageblocks were not scanned due to the skip hints.
A test patch setting compact_control.ignore_skip_hint for sync_(light)
compaction could confirm (or also mask, I'm afraid) the issue.

> Vlastimil, thanks so much for picking up my bits and pieces a couple
> of days ago: I think I'm going to impose upon you again with the below,
> if that's not too irritating.

No problem :)

> Signed-off-by: Hugh Dickins <[email protected]>
>
> --- 4.6-rc2-mm1/mm/compaction.c 2016-04-11 11:35:08.536604712 -0700
> +++ linux/mm/compaction.c 2016-04-13 23:17:03.671959715 -0700
> @@ -1190,7 +1190,7 @@ static isolate_migrate_t isolate_migrate
> struct page *page;
> const isolate_mode_t isolate_mode =
> (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
> - (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> + (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>
> /*
> * Start at where we last stopped, or beginning of the zone as

I'll check this optimization, thanks.

> @@ -1459,8 +1459,8 @@ static enum compact_result compact_zone(
> zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> }
>
> - if (cc->migrate_pfn == start_pfn)
> - cc->whole_zone = true;
> + cc->whole_zone = cc->migrate_pfn == start_pfn &&
> + cc->free_pfn == pageblock_start_pfn(end_pfn - 1);
>
> cc->last_migrated_pfn = 0;

This would be for Michal, but I agree.

2016-04-15 01:00:14

by Hugh Dickins

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On Fri, 15 Apr 2016, Vlastimil Babka wrote:
> On 04/14/2016 10:15 PM, Hugh Dickins wrote:
> > ... because the thing that was really wrong (and I spent far too long
> > studying compaction.c before noticing in page_alloc.c) was this:
> >
> > /*
> > * It can become very expensive to allocate transparent hugepages at
> > * fault, so use asynchronous memory compaction for THP unless it is
> > * khugepaged trying to collapse.
> > */
> > if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> > migration_mode = MIGRATE_SYNC_LIGHT;
> >
> > Yes, but advancing migration_mode before should_compact_retry() checks
> > whether it was MIGRATE_ASYNC, so eliminating the retry when MIGRATE_ASYNC
> > compaction_failed(). And the bogus *migrate_mode++ code had appeared to
> > work by interposing an additional state for a retry.
>
> Ouch. But AFAICS the is_thp_gfp_mask() path is already addressed properly in
> Michal's latest versions of the patches (patch 10/11 of "[PATCH 00/11] oom
> detection rework v5"). So that wasn't what you tested this time? Maybe mmotm
> still had the older version... But I guess Michal will know better than me.

What I've been testing is mmotm 2016-04-06-20-40, but (unsurprisingly)
linux next-20160414 is the same here. I think these were based on
Michal's oom detection rework v5.

This muckup is probably a relic from when Andrew had to resolve the
conflict between Michal's 10/11 and my 30/31: I thought we reverted
to Michal's as is here, that was the intention; but maybe there was
a misunderstanding and this block got left behind.

But I don't think it's causing trouble at the moment for anyone but me,
I being the one with the order=2 OOMs nobody else could see (though
I'm not sure what happened to Sergey's).

> >
> > And where compact_zone() sets whole_zone, I tried a BUG_ON if
> > compact_scanners_met() already, and hit that as a real possibility
> > (only when compactors competing perhaps): without the BUG_ON, it
> > would proceed to compact_finished() COMPACT_COMPLETE without doing
> > any work at all - and the new should_compact_retry() code is placing
> > more faith in that compact_result than perhaps it deserves. No need
> > to BUG_ON then, just be stricter about setting whole_zone.
>
> I warned Michal about the whole_zone settings being prone to races, thanks
> for confirming it's real. Although, compact_scanners_met() being immediately
> true, when the migration scanner is at the zone start, sounds really weird.
> That would mean the free scanner did previously advance as far as first
> pageblock to cache that pfn, which I wouldn't imagine except maybe a tiny
> zone... was it a ZONE_DMA? Or it's a sign of a larger issue.

I am constraining with mem=700M to do this swap testing (and all in the
one zone): smallish, but not tiny.

There is no locking when cc->migrate_pfn and cc->free_pfn are initialized
from zone->compact_cached*, so I assumed (but did not check) that it was
a consequence of a racing compactor updating zone->compact_cached_free_pfn
to the low pfn at an awkward moment; and separately I did see evidence of
the search for free sometimes skipping down lots of pageblocks, finding
too few pages free to place all the 32 migration candidates selected.
But I'm not at all familiar with the expectations here.

>
> Anyway, too bad fixing this didn't help the rare OOMs as that would be a
> perfectly valid reason for them.
>
> > (I do worry about the skip hints: once in MIGRATE_SYNC_LIGHT mode,
> > compaction seemed good at finding pages to migrate, but not so good
> > at then finding enough free pages to migrate them into. Perhaps
> > there were none, but it left me suspicious.)
>
> Skip hints could be another reason for the rare OOMs. What you describe
> sounds like the reclaim between async and sync_light compaction attempts did
> free some pages, but their pageblocks were not scanned due to the skip hints.
> A test patch setting compact_control.ignore_skip_hint for sync_(light)
> compaction could confirm (or also mask, I'm afraid) the issue.

Yes, at one stage I simply #ifdef'ed out respect for the skip hints.
But as soon as I'd found the real offender, I undid that hack and was
glad to find it was not needed - apart perhaps from those "rare OOMs":
I have no picture yet of how much of an issue they are (and since
I only saw them on the laptop from which I am replying to you,
and prefer not to be running a heavy swapping load while typing,
my testing is and will ever be very intermittent).

>
> > Vlastimil, thanks so much for picking up my bits and pieces a couple
> > of days ago: I think I'm going to impose upon you again with the below,
> > if that's not too irritating.
>
> No problem :)

Thanks!

>
> > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > --- 4.6-rc2-mm1/mm/compaction.c 2016-04-11 11:35:08.536604712 -0700
> > +++ linux/mm/compaction.c 2016-04-13 23:17:03.671959715 -0700
> > @@ -1190,7 +1190,7 @@ static isolate_migrate_t isolate_migrate
> > struct page *page;
> > const isolate_mode_t isolate_mode =
> > (sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE :
> > 0) |
> > - (cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> > + (cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> >
> > /*
> > * Start at where we last stopped, or beginning of the zone as
>
> I'll check this optimization, thanks.

Yes, please do (in /proc/vmstat, I could see the ratio of pgmigrate_success
to _fail go up, but the ratio of compact_migrate_scanned to _isolated go up).

>
> > @@ -1459,8 +1459,8 @@ static enum compact_result compact_zone(
> > zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> > }
> >
> > - if (cc->migrate_pfn == start_pfn)
> > - cc->whole_zone = true;
> > + cc->whole_zone = cc->migrate_pfn == start_pfn &&
> > + cc->free_pfn == pageblock_start_pfn(end_pfn - 1);
> >
> > cc->last_migrated_pfn = 0;
>
> This would be for Michal, but I agree.

Okay, thanks.

Hugh

2016-04-15 08:42:50

by Michal Hocko

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On Thu 14-04-16 13:15:51, Hugh Dickins wrote:
> On Tue, 12 Apr 2016, Michal Hocko wrote:
[...]
> > @@ -938,7 +938,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> > * the retry loop is too short and in the sync-light case,
> > * the overhead of stalling is too much
> > */
> > - if (mode != MIGRATE_SYNC) {
> > + if (mode < MIGRATE_SYNC) {
> > rc = -EBUSY;
> > goto out_unlock;
> > }
>
> ... saying "if (mode == MIGRATE_ASYNC) {" there, so that
> MIGRATE_SYNC_LIGHT would proceed to wait_on_page_writeback() when force.
>
> And that patch did not help at all, on either machine: so although
> pages under writeback had been my suspicion, and motivation for your
> patch, it just wasn't the issue.

OK, So it was not the writeback which blocked the compaction. This is
good to know. But I guess we want to wait for writeback longterm. I kind
of like the MIGRATE_SYNC vs. MIGRATE_SYNC_WRITEOUT split. I will think
again whether to add it to the series or not some more.

> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3030,8 +3030,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> > * failure could be caused by weak migration mode.
> > */
> > if (compaction_failed(compact_result)) {
> > - if (*migrate_mode == MIGRATE_ASYNC) {
> > - *migrate_mode = MIGRATE_SYNC_LIGHT;
> > + if (*migrate_mode < MIGRATE_SYNC) {
> > + *migrate_mode++;
> > return true;
> > }
> > return false;
>
> Thanks so much for your followup mail, pointing out that it should say
> (*migrate_mode)++. I never noticed that. So the set of patches I had
> been testing before was doing something quite random, I don't want to
> think about exactly what it was doing. Yet proved useful...
>
> ... because the thing that was really wrong (and I spent far too long
> studying compaction.c before noticing in page_alloc.c) was this:
>
> /*
> * It can become very expensive to allocate transparent hugepages at
> * fault, so use asynchronous memory compaction for THP unless it is
> * khugepaged trying to collapse.
> */
> if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> migration_mode = MIGRATE_SYNC_LIGHT;
>
> Yes, but advancing migration_mode before should_compact_retry() checks
> whether it was MIGRATE_ASYNC, so eliminating the retry when MIGRATE_ASYNC
> compaction_failed(). And the bogus *migrate_mode++ code had appeared to
> work by interposing an additional state for a retry.

Ouch. My http://lkml.kernel.org/r/[email protected]
has moved the code down after noretry: label for this very reason.
Andrew has asked about this because it has caused the conflict when
applying both your and mine patch series but the end results contains
both the original code and my moved migrate_mode update for the noretry
path. I have even checked my patch when posted to mm-commit ML but
failed to notice that the to-be-removed-hunk is still there!

That would mean that there is no change to be done in my original patch,
just to apply it properly in mmotm tree. I will ask Andrew to drop the
whole series from the mmotm tree and will repost the whole series again
sometimes next week. I have done some changes in the patch ordering
which should make it easier to merge in smaller parts because the core
part of the change should sit longer in the mmomt before it gets merged.

> So all I had to do to get OOM-free results, on both machines, was to
> remove those lines quoted above. Now, no doubt it's wrong (for THP)
> to remove them completely, but I don't want to second-guess you on
> where to do the equivalent check: over to you for that.

As I have tried to explain
http://lkml.kernel.org/r/[email protected] moving
the check down for thp is OK.

> I'm over-optimistic when I say OOM-free: on the G5 yes; but I did
> see an order=2 OOM after an hour on the laptop one time, and much
> sooner when I applied your further three patches (classzone_idx etc),

well, classzone_idx patch is fixing a long term bug where we actually
never triggered OOM for order != 0. So it might be possible that a
previously existing issue was just papered over.

> again on the laptop, on one occasion but not another. Something not
> quite right, but much easier to live with than before, and will need
> a separate tedious investigation if it persists.

I really hope I will have some tracepoints ready soon which would help
to pinpoint what is going on there.

> Earlier on, when all my suspicions were in compaction.c, I did make a
> couple of probable fixes there, though neither helped out of my OOMs:
>
> At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
> isolate a PageWriteback page, which __unmap_and_move() then rejects
> with -EBUSY: of course the writeback might complete in between, but
> that's not what we usually expect, so probably better not to isolate it.

Those two definitely should be in sync.

> And where compact_zone() sets whole_zone, I tried a BUG_ON if
> compact_scanners_met() already, and hit that as a real possibility
> (only when compactors competing perhaps): without the BUG_ON, it
> would proceed to compact_finished() COMPACT_COMPLETE without doing
> any work at all - and the new should_compact_retry() code is placing
> more faith in that compact_result than perhaps it deserves. No need
> to BUG_ON then, just be stricter about setting whole_zone.

Interesting, I will have to check this closer.

Thanks again!
--
Michal Hocko
SUSE Labs

2016-04-18 19:14:27

by Vlastimil Babka

[permalink] [raw]

Subject: Re: mmotm woes, mainly compaction

On 04/14/2016 07:24 PM, Vlastimil Babka wrote:
>> > @@ -1459,8 +1459,8 @@ static enum compact_result compact_zone(
>> > zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
>> > }
>> >
>> > - if (cc->migrate_pfn == start_pfn)
>> > - cc->whole_zone = true;
>> > + cc->whole_zone = cc->migrate_pfn == start_pfn &&
>> > + cc->free_pfn == pageblock_start_pfn(end_pfn - 1);
>> >
>> > cc->last_migrated_pfn = 0;
> This would be for Michal, but I agree.

So there's an alternative here that wouldn't have the danger of missing
cc->whole_zone multiple time due to races. When resetting either one
scanner to zone boundary, reset the other as well (and set
cc->whole_zone). I think the situations, where not doing that have any
(performance) advantage, are rare.