2016-03-07 16:08:59

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

Joonsoo wasn't very happy about this approach so let me try a different
way. What do you think about the following? Hugh, Sergey does it help
for your load? I have tested it with the Hugh's load and there was no
major difference from the previous testing so at least nothing has blown
up as I am not able to reproduce the issue here.

Other changes in the compaction are still needed but I would like to not
depend on them right now.
---
>From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 7 Mar 2016 15:30:37 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if the last compaction round
was either inactive (deferred, skipped or bailed out early due to
contention) or it told us to continue.

Additionally define COMPACT_NONE which reflects cases where the
compaction is completely disabled.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 2 ++
mm/page_alloc.c | 41 ++++++++++++++++++++++++-----------------
2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..a4cec4a03f7d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H

+/* compaction disabled */
+#define COMPACT_NONE -1
/* Return values for compact_zone() and try_to_compact_pages() */
/* compaction didn't start as it was deferred due to past failures */
#define COMPACT_DEFERRED 0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..f89e3cbfdf90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2819,28 +2819,22 @@ static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ unsigned long *compact_result)
{
- unsigned long compact_result;
struct page *page;

- if (!order)
+ if (!order) {
+ *compact_result = COMPACT_NONE;
return NULL;
+ }

current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC;

- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_SKIPPED)
return NULL;
- default:
- break;
- }

/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2875,8 +2869,9 @@ static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ unsigned long *compact_result)
{
+ *compact_result = COMPACT_NONE;
return NULL;
}
#endif /* CONFIG_COMPACTION */
@@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int alloc_flags;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
+ unsigned long compact_result;
int contended_compaction = COMPACT_CONTENDED_NONE;
int no_progress_loops = 0;

@@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;

@@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
@@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
did_some_progress > 0, no_progress_loops))
goto retry;

+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM.
+ */
+ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compact_result <= COMPACT_CONTINUE)
+ goto retry;
+ if (contended_compaction > COMPACT_CONTENDED_NONE)
+ goto retry;
+ }
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
--
2.7.0

--
Michal Hocko
SUSE Labs


2016-03-08 03:49:52

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

Hello Michal,

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.

(next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")

seems it's significantly less likely to oom-kill now, but I still can see
something like this

[ 501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[ 501.942796] CPU: 3 PID: 409 Comm: coretemp-sensor Not tainted 4.5.0-rc6-next-20160307-dbg-00015-g8a56edd-dirty #250
[ 501.942801] 0000000000000000 ffff88013114fb88 ffffffff812364e9 0000000000000000
[ 501.942804] ffff88013114fd28 ffff88013114fbf8 ffffffff8113b11c ffff88013114fba8
[ 501.942807] ffffffff810835c1 ffff88013114fbc8 0000000000000206 ffffffff81a46de0
[ 501.942808] Call Trace:
[ 501.942813] [<ffffffff812364e9>] dump_stack+0x67/0x90
[ 501.942817] [<ffffffff8113b11c>] dump_header.isra.5+0x54/0x359
[ 501.942820] [<ffffffff810835c1>] ? trace_hardirqs_on+0xd/0xf
[ 501.942823] [<ffffffff810f97c2>] oom_kill_process+0x89/0x503
[ 501.942825] [<ffffffff810f9ffe>] out_of_memory+0x372/0x38d
[ 501.942827] [<ffffffff810fe5ae>] __alloc_pages_nodemask+0x9b6/0xa92
[ 501.942830] [<ffffffff810fe882>] alloc_kmem_pages_node+0x1b/0x1d
[ 501.942833] [<ffffffff81041f86>] copy_process.part.9+0xfe/0x17f4
[ 501.942835] [<ffffffff810858f6>] ? lock_acquire+0x10f/0x1a3
[ 501.942837] [<ffffffff8104380f>] _do_fork+0xbd/0x5da
[ 501.942838] [<ffffffff81083598>] ? trace_hardirqs_on_caller+0x16c/0x188
[ 501.942842] [<ffffffff81001a79>] ? do_syscall_64+0x18/0xe6
[ 501.942844] [<ffffffff81043db2>] SyS_clone+0x19/0x1b
[ 501.942845] [<ffffffff81001abb>] do_syscall_64+0x5a/0xe6
[ 501.942848] [<ffffffff8151245a>] entry_SYSCALL64_slow_path+0x25/0x25
[ 501.942850] Mem-Info:
[ 501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
active_file:31213 inactive_file:302048 isolated_file:0
unevictable:0 dirty:44 writeback:221 unstable:0
slab_reclaimable:43570 slab_unreclaimable:5651
mapped:16660 shmem:29495 pagetables:2542 bounce:0
free:10884 free_pcp:214 free_cma:0
[ 501.942859] DMA free:14896kB min:28kB low:40kB high:52kB active_anon:0kB inactive_anon:0kB active_file:96kB inactive_file:104kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:124kB shmem:0kB slab_reclaimable:28kB slab_unreclaimable:108kB kernel_stack:16kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 501.942862] lowmem_reserve[]: 0 3031 3855 3855
[ 501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 501.942870] lowmem_reserve[]: 0 0 824 824
[ 501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no
[ 501.942879] lowmem_reserve[]: 0 0 0 0
[ 501.942902] DMA: 6*4kB (UME) 3*8kB (M) 2*16kB (UM) 3*32kB (ME) 2*64kB (ME) 2*128kB (ME) 2*256kB (UE) 3*512kB (UME) 2*1024kB (ME) 1*2048kB (E) 2*4096kB (M) = 14896kB
[ 501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
[ 501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB
[ 501.942922] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 501.942923] 362670 total pagecache pages
[ 501.942924] 0 pages in swap cache
[ 501.942926] Swap cache stats: add 150, delete 150, find 0/0
[ 501.942926] Free swap = 8388504kB
[ 501.942927] Total swap = 8388604kB
[ 501.942928] 1032092 pages RAM
[ 501.942928] 0 pages HighMem/MovableOnly
[ 501.942929] 40111 pages reserved
[ 501.942930] 0 pages hwpoisoned
[ 501.942930] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 501.942935] [ 162] 0 162 15823 1065 35 3 0 0 systemd-journal
[ 501.942991] [ 186] 0 186 8586 1054 19 3 0 -1000 systemd-udevd
[ 501.942993] [ 287] 0 287 3557 651 12 3 0 0 crond
[ 501.942995] [ 288] 81 288 8159 775 20 3 0 -900 dbus-daemon
[ 501.942997] [ 289] 0 289 3843 518 13 3 0 0 systemd-logind
[ 501.942999] [ 294] 0 294 22455 856 47 3 0 0 login
[ 501.943001] [ 302] 1000 302 8481 1029 20 3 0 0 systemd
[ 501.943003] [ 304] 1000 304 24212 438 47 3 0 0 (sd-pam)
[ 501.943005] [ 309] 1000 309 4431 1123 14 3 0 0 bash
[ 501.943007] [ 316] 1000 316 3712 764 13 3 0 0 startx
[ 501.943009] [ 338] 1000 338 3976 255 14 3 0 0 xinit
[ 501.943012] [ 339] 1000 339 44397 11311 90 3 0 0 Xorg
[ 501.943014] [ 341] 1000 341 39703 4045 78 3 0 0 openbox
[ 501.943016] [ 352] 1000 352 43465 2997 86 4 0 0 tint2
[ 501.943018] [ 355] 1000 355 33962 4351 57 3 0 0 urxvt
[ 501.943020] [ 356] 1000 356 4466 1155 13 3 0 0 bash
[ 501.943022] [ 359] 1000 359 4433 1116 13 3 0 0 bash
[ 501.943024] [ 364] 1000 364 49365 6236 62 3 0 0 urxvt
[ 501.943026] [ 365] 1000 365 4433 1093 15 3 0 0 bash
[ 501.943028] [ 368] 1000 368 5203 745 15 3 0 0 tmux
[ 501.943030] [ 370] 1000 370 6336 1374 17 3 0 0 tmux
[ 501.943046] [ 371] 1000 371 4433 1100 14 3 0 0 bash
[ 501.943049] [ 378] 1000 378 4433 1115 13 3 0 0 bash
[ 501.943051] [ 381] 1000 381 5203 763 16 3 0 0 tmux
[ 501.943053] [ 382] 1000 382 4433 1089 15 3 0 0 bash
[ 501.943055] [ 389] 1000 389 4433 1078 15 3 0 0 bash
[ 501.943057] [ 392] 1000 392 4433 1078 15 3 0 0 bash
[ 501.943058] [ 395] 1000 395 4433 1090 14 3 0 0 bash
[ 501.943060] [ 398] 1000 398 4433 1111 14 3 0 0 bash
[ 501.943062] [ 401] 1000 401 10126 1010 25 3 0 0 top
[ 501.943064] [ 403] 1000 403 4433 1129 14 3 0 0 bash
[ 501.943066] [ 409] 1000 409 3740 786 13 3 0 0 coretemp-sensor
[ 501.943069] [ 443] 1000 443 25873 3141 51 3 0 0 urxvt
[ 501.943071] [ 444] 1000 444 4433 1110 13 3 0 0 bash
[ 501.943073] [ 447] 1000 447 68144 55547 138 3 0 0 mutt
[ 501.943075] [ 450] 1000 450 29966 3825 51 3 0 0 urxvt
[ 501.943077] [ 451] 1000 451 4433 1117 14 3 0 0 bash
[ 501.943079] [ 456] 1000 456 29967 3793 53 3 0 0 urxvt
[ 501.943081] [ 457] 1000 457 4433 1085 14 3 0 0 bash
[ 501.943083] [ 462] 1000 462 29967 3845 51 4 0 0 urxvt
[ 501.943085] [ 463] 1000 463 4433 1093 14 3 0 0 bash
[ 501.943087] [ 468] 1000 468 29967 3793 50 3 0 0 urxvt
[ 501.943089] [ 469] 1000 469 4433 1086 15 3 0 0 bash
[ 501.943091] [ 493] 1000 493 52976 6416 69 3 0 0 urxvt
[ 501.943093] [ 494] 1000 494 4433 1106 14 3 0 0 bash
[ 501.943095] [ 499] 1000 499 29966 3792 54 3 0 0 urxvt
[ 501.943097] [ 500] 1000 500 4433 1078 14 3 0 0 bash
[ 501.943099] [ 525] 0 525 17802 1108 38 3 0 0 sudo
[ 501.943101] [ 528] 0 528 186583 768 207 4 0 0 journalctl
[ 501.943103] [ 550] 1000 550 42144 9259 66 4 0 0 urxvt
[ 501.943105] [ 551] 1000 551 4433 1067 14 4 0 0 bash
[ 501.943107] [ 557] 1000 557 11115 768 27 3 0 0 su
[ 501.943109] [ 579] 0 579 4462 1148 13 3 0 0 bash
[ 501.943111] [ 963] 1000 963 4433 1075 14 3 0 0 bash
[ 501.943113] [ 981] 1000 981 4433 1114 13 3 0 0 bash
[ 501.943115] [ 993] 1000 993 4432 1118 14 3 0 0 bash
[ 501.943117] [ 1062] 1000 1062 5203 734 15 3 0 0 tmux
[ 501.943119] [ 1063] 1000 1063 13805 10479 32 3 0 0 bash
[ 501.943121] [ 1145] 1000 1145 4466 1144 14 3 0 0 bash
[ 501.943123] [ 4331] 1000 4331 287422 64040 429 4 0 0 firefox
[ 501.943125] [ 4440] 1000 4440 8132 761 20 3 0 0 dbus-daemon
[ 501.943127] [ 4470] 1000 4470 83823 934 31 4 0 0 at-spi-bus-laun
[ 501.943129] [17875] 1000 17875 7549 1926 20 3 0 0 vim
[ 501.943131] [27066] 1000 27066 4432 1120 15 3 0 0 bash
[ 501.943133] [27073] 1000 27073 4432 1071 13 3 0 0 bash
[ 501.943135] [27079] 1000 27079 4432 1077 15 3 0 0 bash
[ 501.943137] [27085] 1000 27085 4432 1080 14 3 0 0 bash
[ 501.943139] [27091] 1000 27091 4432 1091 14 3 0 0 bash
[ 501.943141] [27097] 1000 27097 4432 1096 15 3 0 0 bash
[ 501.943143] [ 1235] 0 1235 3745 809 11 3 0 0 zram-test.sh
[ 501.943145] [ 2316] 1000 2316 1759 166 9 3 0 0 sleep
[ 501.943147] [ 2323] 0 2323 3302 1946 12 3 0 0 dd
[ 501.943148] Out of memory: Kill process 4331 (firefox) score 20 or sacrifice child
[ 501.943352] Killed process 4331 (firefox) total-vm:1149688kB, anon-rss:207844kB, file-rss:48172kB, shmem-rss:516kB

-ss

2016-03-08 09:08:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> Hello Michal,
>
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> >
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
>
> (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
>
> seems it's significantly less likely to oom-kill now, but I still can see
> something like this

Thanks for the testing. This is highly appreciated. If you are able to
reproduce this then collecting compaction related tracepoints might be
really helpful.

> [ 501.942745] coretemp-sensor invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
> [ 501.942853] active_anon:151312 inactive_anon:54791 isolated_anon:0
> active_file:31213 inactive_file:302048 isolated_file:0
> unevictable:0 dirty:44 writeback:221 unstable:0
> slab_reclaimable:43570 slab_unreclaimable:5651
> mapped:16660 shmem:29495 pagetables:2542 bounce:0
> free:10884 free_pcp:214 free_cma:0
[...]
> [ 501.942867] DMA32 free:23664kB min:6232kB low:9332kB high:12432kB active_anon:516228kB inactive_anon:129136kB active_file:96508kB inactive_file:954780kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3194880kB managed:3107512kB mlocked:0kB dirty:136kB writeback:440kB mapped:51816kB shmem:91488kB slab_reclaimable:129856kB slab_unreclaimable:13876kB kernel_stack:2160kB pagetables:7888kB unstable:0kB bounce:0kB free_pcp:724kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
> [ 501.942870] lowmem_reserve[]: 0 0 824 824
> [ 501.942876] Normal free:4784kB min:1696kB low:2540kB high:3384kB active_anon:89020kB inactive_anon:90028kB active_file:28248kB inactive_file:253308kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:917504kB managed:844512kB mlocked:0kB dirty:40kB writeback:444kB mapped:14700kB shmem:26492kB slab_reclaimable:44396kB slab_unreclaimable:8620kB kernel_stack:1328kB pagetables:2280kB unstable:0kB bounce:0kB free_pcp:244kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:60 all_unreclaimable? no

Both DMA32 and Normal zones are over high watermarks so this OOM is due
to the memory fragmentation.

> [ 501.942912] DMA32: 564*4kB (UME) 2700*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23856kB
> [ 501.942921] Normal: 959*4kB (ME) 128*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4860kB

There are no order-2+ pages usable even after we know that the
compaction was active and didn't back out early. I might be missing
something of course and the patch might still be tweaked to be more
conservative. Tracepoints should tell us more though.

Thanks!
--
Michal Hocko
SUSE Labs

2016-03-08 09:23:34

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On (03/08/16 10:08), Michal Hocko wrote:
> On Tue 08-03-16 12:51:04, Sergey Senozhatsky wrote:
> > Hello Michal,
> >
> > On (03/07/16 17:08), Michal Hocko wrote:
> > > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > > Andrew,
> > > > could you queue this one as well, please? This is more a band aid than a
> > > > real solution which I will be working on as soon as I am able to
> > > > reproduce the issue but the patch should help to some degree at least.
> > >
> > > Joonsoo wasn't very happy about this approach so let me try a different
> > > way. What do you think about the following? Hugh, Sergey does it help
> > > for your load? I have tested it with the Hugh's load and there was no
> > > major difference from the previous testing so at least nothing has blown
> > > up as I am not able to reproduce the issue here.
> >
> > (next-20160307 + "[PATCH] mm, oom: protect !costly allocations some more")
> >
> > seems it's significantly less likely to oom-kill now, but I still can see
> > something like this
>
> Thanks for the testing. This is highly appreciated. If you are able to
> reproduce this then collecting compaction related tracepoints might be
> really helpful.
>

oh, wow... compaction is disabled, somehow.

$ zcat /proc/config.gz | grep -i CONFIG_COMPACTION
# CONFIG_COMPACTION is not set

I should have checked that, sorry!

will enable and re-test.

-ss

2016-03-08 09:25:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On 03/07/2016 05:08 PM, Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
>
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
>
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/compaction.h | 2 ++
> mm/page_alloc.c | 41 ++++++++++++++++++++++++-----------------
> 2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
> #ifndef _LINUX_COMPACTION_H
> #define _LINUX_COMPACTION_H
>
> +/* compaction disabled */
> +#define COMPACT_NONE -1
> /* Return values for compact_zone() and try_to_compact_pages() */
> /* compaction didn't start as it was deferred due to past failures */
> #define COMPACT_DEFERRED 0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + unsigned long *compact_result)
> {
> - unsigned long compact_result;
> struct page *page;
>
> - if (!order)
> + if (!order) {
> + *compact_result = COMPACT_NONE;
> return NULL;
> + }
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_SKIPPED)

COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
work as expected.

> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + unsigned long *compact_result)
> {
> + *compact_result = COMPACT_NONE;
> return NULL;
> }
> #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int alloc_flags;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> + unsigned long compact_result;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> int no_progress_loops = 0;
>
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> did_some_progress > 0, no_progress_loops))
> goto retry;
>
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM.
> + */
> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compact_result <= COMPACT_CONTINUE)

Same here.
I was going to say that this didn't have effect on Sergey's test, but
turns out it did :)

> + goto retry;
> + if (contended_compaction > COMPACT_CONTENDED_NONE)
> + goto retry;
> + }
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
>

2016-03-08 09:31:15

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On (03/08/16 10:24), Vlastimil Babka wrote:
[..]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > did_some_progress > 0, no_progress_loops))
> > goto retry;
> >
> > + /*
> > + * !costly allocations are really important and we have to make sure
> > + * the compaction wasn't deferred or didn't bail out early due to locks
> > + * contention before we go OOM.
> > + */
> > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > + if (compact_result <= COMPACT_CONTINUE)
>
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

I'm sorry, my test is not correct. I have disabled compaction last weeked on
purpose - to provoke more OOM-kills and OOM conditions for reworked printk()
patch set testing (http://marc.info/?l=linux-kernel&m=145734549308803); and I
forgot to re-enable it.

-ss

2016-03-08 09:46:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
[...]
> > @@ -2819,28 +2819,22 @@ static struct page *
> > __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > int alloc_flags, const struct alloc_context *ac,
> > enum migrate_mode mode, int *contended_compaction,
> > - bool *deferred_compaction)
> > + unsigned long *compact_result)
> > {
> > - unsigned long compact_result;
> > struct page *page;
> >
> > - if (!order)
> > + if (!order) {
> > + *compact_result = COMPACT_NONE;
> > return NULL;
> > + }
> >
> > current->flags |= PF_MEMALLOC;
> > - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> > + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> > mode, contended_compaction);
> > current->flags &= ~PF_MEMALLOC;
> >
> > - switch (compact_result) {
> > - case COMPACT_DEFERRED:
> > - *deferred_compaction = true;
> > - /* fall-through */
> > - case COMPACT_SKIPPED:
> > + if (*compact_result <= COMPACT_SKIPPED)
>
> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
> work as expected.

Well, COMPACT_NONE is documented as /* compaction disabled */ so we
should never get it from try_to_compact_pages.

[...]
> > @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > did_some_progress > 0, no_progress_loops))
> > goto retry;
> >
> > + /*
> > + * !costly allocations are really important and we have to make sure
> > + * the compaction wasn't deferred or didn't bail out early due to locks
> > + * contention before we go OOM.
> > + */
> > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> > + if (compact_result <= COMPACT_CONTINUE)
>
> Same here.
> I was going to say that this didn't have effect on Sergey's test, but
> turns out it did :)

This should work as expected because compact_result is unsigned long
and so this is the unsigned arithmetic. I can make
#define COMPACT_NONE -1UL

to make the intention more obvious if you prefer, though.

Thanks for the review.
--
Michal Hocko
SUSE Labs

2016-03-08 09:52:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On 03/08/2016 10:46 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:24:56, Vlastimil Babka wrote:
> [...]
>>> @@ -2819,28 +2819,22 @@ static struct page *
>>> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>> int alloc_flags, const struct alloc_context *ac,
>>> enum migrate_mode mode, int *contended_compaction,
>>> - bool *deferred_compaction)
>>> + unsigned long *compact_result)
>>> {
>>> - unsigned long compact_result;
>>> struct page *page;
>>>
>>> - if (!order)
>>> + if (!order) {
>>> + *compact_result = COMPACT_NONE;
>>> return NULL;
>>> + }
>>>
>>> current->flags |= PF_MEMALLOC;
>>> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
>>> mode, contended_compaction);
>>> current->flags &= ~PF_MEMALLOC;
>>>
>>> - switch (compact_result) {
>>> - case COMPACT_DEFERRED:
>>> - *deferred_compaction = true;
>>> - /* fall-through */
>>> - case COMPACT_SKIPPED:
>>> + if (*compact_result <= COMPACT_SKIPPED)
>>
>> COMPACT_NONE is -1 and compact_result is unsigned long, so this won't
>> work as expected.
>
> Well, COMPACT_NONE is documented as /* compaction disabled */ so we
> should never get it from try_to_compact_pages.

Right.

>
> [...]
>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>> did_some_progress > 0, no_progress_loops))
>>> goto retry;
>>>
>>> + /*
>>> + * !costly allocations are really important and we have to make sure
>>> + * the compaction wasn't deferred or didn't bail out early due to locks
>>> + * contention before we go OOM.
>>> + */
>>> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>> + if (compact_result <= COMPACT_CONTINUE)
>>
>> Same here.
>> I was going to say that this didn't have effect on Sergey's test, but
>> turns out it did :)
>
> This should work as expected because compact_result is unsigned long
> and so this is the unsigned arithmetic. I can make
> #define COMPACT_NONE -1UL
>
> to make the intention more obvious if you prefer, though.

Well, what wasn't obvious to me is actually that here (unlike in the
test above) it was actually intended that COMPACT_NONE doesn't result in
a retry. But it makes sense, otherwise we would retry endlessly if
reclaim couldn't form a higher-order page, right.

> Thanks for the review.
>

2016-03-08 09:57:15

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On (03/07/16 17:08), Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
>
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.

works fine for me.

$ cat /proc/vmstat | egrep -e "compact|swap"
pgsteal_kswapd_dma 7
pgsteal_kswapd_dma32 6457075
pgsteal_kswapd_normal 1462767
pgsteal_kswapd_movable 0
pgscan_kswapd_dma 18
pgscan_kswapd_dma32 6544126
pgscan_kswapd_normal 1495604
pgscan_kswapd_movable 0
kswapd_inodesteal 29
kswapd_low_wmark_hit_quickly 1168
kswapd_high_wmark_hit_quickly 1627
compact_migrate_scanned 5762793
compact_free_scanned 54090239
compact_isolated 1303895
compact_stall 1542
compact_fail 1117
compact_success 425
compact_kcompatd_wake 0

no OOM-kills after 6 rounds of tests.

Tested-by: Sergey Senozhatsky <[email protected]>

thanks!

-ss

2016-03-08 10:10:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> On 03/08/2016 10:46 AM, Michal Hocko wrote:
[...]
> >>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>> did_some_progress > 0, no_progress_loops))
> >>> goto retry;
> >>>
> >>> + /*
> >>> + * !costly allocations are really important and we have to make sure
> >>> + * the compaction wasn't deferred or didn't bail out early due to locks
> >>> + * contention before we go OOM.
> >>> + */
> >>> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>> + if (compact_result <= COMPACT_CONTINUE)
> >>
> >> Same here.
> >> I was going to say that this didn't have effect on Sergey's test, but
> >> turns out it did :)
> >
> > This should work as expected because compact_result is unsigned long
> > and so this is the unsigned arithmetic. I can make
> > #define COMPACT_NONE -1UL
> >
> > to make the intention more obvious if you prefer, though.
>
> Well, what wasn't obvious to me is actually that here (unlike in the
> test above) it was actually intended that COMPACT_NONE doesn't result in
> a retry. But it makes sense, otherwise we would retry endlessly if
> reclaim couldn't form a higher-order page, right.

Yeah, that was the whole point. An alternative would be moving the test
into should_compact_retry(order, compact_result, contended_compaction)
which would be CONFIG_COMPACTION specific so we can get rid of the
COMPACT_NONE altogether. Something like the following. We would lose the
always initialized compact_result but this would matter only for
order==0 and we check for that. Even gcc doesn't complain.

A more important question is whether the criteria I have chosen are
reasonable and reasonably independent on the particular implementation
of the compaction. I still cannot convince myself about the convergence
here. Is it possible that the compaction would keep returning
compact_result <= COMPACT_CONTINUE while not making any progress at all?

Sure we can see a case where somebody is stealing the compacted blocks
but that is very same with the order-0 where parallel mem eaters will
piggy back on the reclaimer and there is no upper boundary as well well.

---
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a4cec4a03f7d..4cd4ddf64cc7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,8 +1,6 @@
#ifndef _LINUX_COMPACTION_H
#define _LINUX_COMPACTION_H

-/* compaction disabled */
-#define COMPACT_NONE -1
/* Return values for compact_zone() and try_to_compact_pages() */
/* compaction didn't start as it was deferred due to past failures */
#define COMPACT_DEFERRED 0
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f89e3cbfdf90..c5932a218fc6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2823,10 +2823,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
{
struct page *page;

- if (!order) {
- *compact_result = COMPACT_NONE;
+ if (!order)
return NULL;
- }

current->flags |= PF_MEMALLOC;
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
@@ -2864,6 +2862,25 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+ int contended_compaction)
+{
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM.
+ */
+ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compact_result <= COMPACT_CONTINUE)
+ return true;
+ if (contended_compaction > COMPACT_CONTENDED_NONE)
+ return true;
+ }
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -2871,9 +2888,15 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum migrate_mode mode, int *contended_compaction,
unsigned long *compact_result)
{
- *compact_result = COMPACT_NONE;
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, unsigned long compact_result,
+ int contended_compaction)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */

/* Perform direct synchronous page reclaim */
@@ -3289,17 +3312,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
did_some_progress > 0, no_progress_loops))
goto retry;

- /*
- * !costly allocations are really important and we have to make sure
- * the compaction wasn't deferred or didn't bail out early due to locks
- * contention before we go OOM.
- */
- if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
- if (compact_result <= COMPACT_CONTINUE)
- goto retry;
- if (contended_compaction > COMPACT_CONTENDED_NONE)
- goto retry;
- }
+ if (should_compact_retry(order, compact_result, contended_compaction))
+ goto retry;

/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
--
Michal Hocko
SUSE Labs

2016-03-08 10:38:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Mon, 7 Mar 2016, Michal Hocko wrote:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > Andrew,
> > could you queue this one as well, please? This is more a band aid than a
> > real solution which I will be working on as soon as I am able to
> > reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help
> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.

Did not help with my load at all, I'm afraid: quite the reverse,
OOMed very much sooner (as usual on order=2), and with much more
noise (multiple OOMs) than your previous patch.

vmstats.xz attached; sorry, I don't have tracing built in,
and must move on to the powerpc issue before going back to bed.

I do hate replying without having something constructive to say, but
have very little time to think about this, and no bright ideas so far.

I do not understand why it's so easy for me to reproduce, yet impossible
for you - unless it's that you are still doing all your testing in a VM?
Is Sergey the only other to see this issue?

Hugh


Attachments:
vmstats.xz (8.01 kB)

2016-03-08 11:12:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On 03/08/2016 11:10 AM, Michal Hocko wrote:
> On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
>> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> [...]
>>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>>> did_some_progress > 0, no_progress_loops))
>>>>> goto retry;
>>>>>
>>>>> + /*
>>>>> + * !costly allocations are really important and we have to make sure
>>>>> + * the compaction wasn't deferred or didn't bail out early due to locks
>>>>> + * contention before we go OOM.
>>>>> + */
>>>>> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
>>>>> + if (compact_result <= COMPACT_CONTINUE)
>>>>
>>>> Same here.
>>>> I was going to say that this didn't have effect on Sergey's test, but
>>>> turns out it did :)
>>>
>>> This should work as expected because compact_result is unsigned long
>>> and so this is the unsigned arithmetic. I can make
>>> #define COMPACT_NONE -1UL
>>>
>>> to make the intention more obvious if you prefer, though.
>>
>> Well, what wasn't obvious to me is actually that here (unlike in the
>> test above) it was actually intended that COMPACT_NONE doesn't result in
>> a retry. But it makes sense, otherwise we would retry endlessly if
>> reclaim couldn't form a higher-order page, right.
>
> Yeah, that was the whole point. An alternative would be moving the test
> into should_compact_retry(order, compact_result, contended_compaction)
> which would be CONFIG_COMPACTION specific so we can get rid of the
> COMPACT_NONE altogether. Something like the following. We would lose the
> always initialized compact_result but this would matter only for
> order==0 and we check for that. Even gcc doesn't complain.

Yeah I like this version better, you can add my Acked-By.

Thanks.

> A more important question is whether the criteria I have chosen are
> reasonable and reasonably independent on the particular implementation
> of the compaction. I still cannot convince myself about the convergence
> here. Is it possible that the compaction would keep returning
> compact_result <= COMPACT_CONTINUE while not making any progress at all?

Theoretically, if reclaim/compaction suitability decisions and
allocation attempts didn't match the watermark checks, including the
alloc_flags and classzone_idx parameters. Possible scenarios:

- reclaim thinks compaction has enough to proceed, but compaction thinks
otherwise and returns COMPACT_SKIPPED
- compaction thinks it succeeded and returns COMPACT_PARTIAL, but
allocation attempt fails
- and perhaps some other combinations

> Sure we can see a case where somebody is stealing the compacted blocks
> but that is very same with the order-0 where parallel mem eaters will
> piggy back on the reclaimer and there is no upper boundary as well well.

Yep.

2016-03-08 12:22:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On Tue 08-03-16 12:12:20, Vlastimil Babka wrote:
> On 03/08/2016 11:10 AM, Michal Hocko wrote:
> > On Tue 08-03-16 10:52:15, Vlastimil Babka wrote:
> >> On 03/08/2016 10:46 AM, Michal Hocko wrote:
> > [...]
> >>>>> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>>>> did_some_progress > 0, no_progress_loops))
> >>>>> goto retry;
> >>>>>
> >>>>> + /*
> >>>>> + * !costly allocations are really important and we have to make sure
> >>>>> + * the compaction wasn't deferred or didn't bail out early due to locks
> >>>>> + * contention before we go OOM.
> >>>>> + */
> >>>>> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> >>>>> + if (compact_result <= COMPACT_CONTINUE)
> >>>>
> >>>> Same here.
> >>>> I was going to say that this didn't have effect on Sergey's test, but
> >>>> turns out it did :)
> >>>
> >>> This should work as expected because compact_result is unsigned long
> >>> and so this is the unsigned arithmetic. I can make
> >>> #define COMPACT_NONE -1UL
> >>>
> >>> to make the intention more obvious if you prefer, though.
> >>
> >> Well, what wasn't obvious to me is actually that here (unlike in the
> >> test above) it was actually intended that COMPACT_NONE doesn't result in
> >> a retry. But it makes sense, otherwise we would retry endlessly if
> >> reclaim couldn't form a higher-order page, right.
> >
> > Yeah, that was the whole point. An alternative would be moving the test
> > into should_compact_retry(order, compact_result, contended_compaction)
> > which would be CONFIG_COMPACTION specific so we can get rid of the
> > COMPACT_NONE altogether. Something like the following. We would lose the
> > always initialized compact_result but this would matter only for
> > order==0 and we check for that. Even gcc doesn't complain.
>
> Yeah I like this version better, you can add my Acked-By.

OK, patch updated and I will post it as a reply to the original email.

> Thanks.
>
> > A more important question is whether the criteria I have chosen are
> > reasonable and reasonably independent on the particular implementation
> > of the compaction. I still cannot convince myself about the convergence
> > here. Is it possible that the compaction would keep returning
> > compact_result <= COMPACT_CONTINUE while not making any progress at all?
>
> Theoretically, if reclaim/compaction suitability decisions and
> allocation attempts didn't match the watermark checks, including the
> alloc_flags and classzone_idx parameters. Possible scenarios:
>
> - reclaim thinks compaction has enough to proceed, but compaction thinks
> otherwise and returns COMPACT_SKIPPED
> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
> allocation attempt fails
> - and perhaps some other combinations

But that might happen right now as well so it wouldn't be a regression,
right?
--
Michal Hocko
SUSE Labs

2016-03-08 12:30:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more

On 03/08/2016 01:22 PM, Michal Hocko wrote:
>> Thanks.
>>
>>> A more important question is whether the criteria I have chosen are
>>> reasonable and reasonably independent on the particular implementation
>>> of the compaction. I still cannot convince myself about the convergence
>>> here. Is it possible that the compaction would keep returning
>>> compact_result <= COMPACT_CONTINUE while not making any progress at all?
>>
>> Theoretically, if reclaim/compaction suitability decisions and
>> allocation attempts didn't match the watermark checks, including the
>> alloc_flags and classzone_idx parameters. Possible scenarios:
>>
>> - reclaim thinks compaction has enough to proceed, but compaction thinks
>> otherwise and returns COMPACT_SKIPPED
>> - compaction thinks it succeeded and returns COMPACT_PARTIAL, but
>> allocation attempt fails
>> - and perhaps some other combinations
>
> But that might happen right now as well so it wouldn't be a regression,
> right?

Maybe, somehow, I didn't study closely how the retry decisions work.
Your patch adds another way to retry so it's theoretically more
dangerous. Just hinting at what to possibly check (the watermark checks) :)


2016-03-08 13:43:00

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/2] oom rework: high order enahncements

The first two patches are cleanups for the compaction and the second
patch is updated as per Vlastimil's feedback. I didn't add his Acked-by
because I have added COMPACT_SHOULD_RETRY to make the retry logic in
the page allocator more robust for future changes.

Hugh has still reported this is not sufficient but I would prefer to
handle the issue he is seeing in a separate patch once we understand
what is going on there. The second patch sounds like a reasonable
starting point to me.

2016-03-08 13:43:06

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum

From: Michal Hocko <[email protected]>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
mm/compaction.c | 27 ++++++++++++++-------------
mm/page_alloc.c | 2 +-
3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4cd4ddf64cc7..b167801187e7 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
#define _LINUX_COMPACTION_H

/* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED 0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE 4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE 5
-#define COMPACT_NOT_SUITABLE_ZONE 6
-#define COMPACT_CONTENDED 7
/* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /*
+ * compaction didn't start as it was not possible or direct reclaim
+ * was more suitable
+ */
+ COMPACT_SKIPPED,
+ /* compaction should continue to another pageblock */
+ COMPACT_CONTINUE,
+ /*
+ * direct compaction partially compacted a zone and there are suitable
+ * pages
+ */
+ COMPACT_PARTIAL,
+ /* The full zone was compacted */
+ COMPACT_COMPLETE,
+ /* For more detailed tracepoint output */
+ COMPACT_NO_SUITABLE_PAGE,
+ COMPACT_NOT_SUITABLE_ZONE,
+ COMPACT_CONTENDED,
+};

/* Used to signal whether compaction detected need_sched() or lock contention */
/* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int sysctl_compact_unevictable_allowed;

extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+ unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);

extern void defer_compaction(struct zone *zone, int order);
@@ -53,7 +62,7 @@ extern void compaction_defer_reset(struct zone *zone, int order,
extern bool compaction_restarting(struct zone *zone, int order);

#else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
@@ -69,7 +78,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
{
}

-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 585de54dbe8c..0f61f12d82b6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1195,7 +1195,7 @@ static inline bool is_via_compact_memory(int order)
return order == -1;
}

-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
@@ -1258,8 +1258,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_NO_SUITABLE_PAGE;
}

-static int compact_finished(struct zone *zone, struct compact_control *cc,
- const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+ struct compact_control *cc,
+ const int migratetype)
{
int ret;

@@ -1278,7 +1279,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
* COMPACT_PARTIAL - If the allocation would succeed without compaction
* COMPACT_CONTINUE - If compaction should run now
*/
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
int fragindex;
@@ -1323,10 +1324,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
return COMPACT_CONTINUE;
}

-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;

ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
trace_mm_compaction_suitable(zone, order, ret);
@@ -1336,9 +1337,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
return ret;
}

-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
- int ret;
+ enum compact_result ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1483,11 +1484,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}

-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum migrate_mode mode, int *contended,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -1524,7 +1525,7 @@ int sysctl_extfrag_threshold = 500;
*
* This is the main entry point for direct page compaction.
*/
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
{
@@ -1532,7 +1533,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_DEFERRED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */

*contended = COMPACT_CONTENDED_NONE;
@@ -1546,7 +1547,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
- int status;
+ enum compact_result status;
int zone_contended;

if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 269a04f20927..4acc0aa1aee0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2821,7 +2821,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum migrate_mode mode, int *contended_compaction,
bool *deferred_compaction)
{
- unsigned long compact_result;
+ enum compact_result compact_result;
struct page *page;

if (!order)
--
2.7.0

2016-03-08 13:43:12

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone

From: Michal Hocko <[email protected]>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
switch (ret) {
^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/compaction.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0f61f12d82b6..86968d3a04e6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1347,15 +1347,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro

ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
- switch (ret) {
- case COMPACT_PARTIAL:
- case COMPACT_SKIPPED:
- /* Compaction is likely to fail */
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
return ret;
- case COMPACT_CONTINUE:
- /* Fall through to compaction */
- ;
- }
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);

/*
* Clear pageblock skip if there were failures recently and compaction
--
2.7.0

2016-03-08 13:43:23

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/3] mm, oom: protect !costly allocations some more

From: Michal Hocko <[email protected]>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if the last compaction round
was either inactive (deferred, skipped or bailed out early due to
contention) or it told us to continue.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 5 +++++
mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++--------------
2 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b167801187e7..49e04326dcb8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -14,6 +14,11 @@ enum compact_result {
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
/*
+ * whoever is calling compaction should retry because it was either
+ * not active or it tells us there is more work to be done.
+ */
+ COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
+ /*
* direct compaction partially compacted a zone and there are suitable
* pages
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4acc0aa1aee0..041aeb1dc3b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2819,28 +2819,20 @@ static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum compact_result *compact_result)
{
- enum compact_result compact_result;
struct page *page;

if (!order)
return NULL;

current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC;

- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_SKIPPED)
return NULL;
- default:
- break;
- }

/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ int contended_compaction)
+{
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM.
+ */
+ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compact_result <= COMPACT_SHOULD_RETRY)
+ return true;
+ if (contended_compaction > COMPACT_CONTENDED_NONE)
+ return true;
+ }
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum compact_result *compact_result)
{
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ int contended_compaction)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */

/* Perform direct synchronous page reclaim */
@@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int alloc_flags;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
+ enum compact_result compact_result;
int contended_compaction = COMPACT_CONTENDED_NONE;
int no_progress_loops = 0;

@@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;

@@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
@@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
did_some_progress > 0, no_progress_loops))
goto retry;

+ if (should_compact_retry(order, compact_result, contended_compaction))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
--
2.7.0

2016-03-08 13:57:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Tue 08-03-16 18:58:24, Sergey Senozhatsky wrote:
> On (03/07/16 17:08), Michal Hocko wrote:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> > > Andrew,
> > > could you queue this one as well, please? This is more a band aid than a
> > > real solution which I will be working on as soon as I am able to
> > > reproduce the issue but the patch should help to some degree at least.
> >
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
> > for your load? I have tested it with the Hugh's load and there was no
> > major difference from the previous testing so at least nothing has blown
> > up as I am not able to reproduce the issue here.
> >
> > Other changes in the compaction are still needed but I would like to not
> > depend on them right now.
>
> works fine for me.
>
> $ cat /proc/vmstat | egrep -e "compact|swap"
> pgsteal_kswapd_dma 7
> pgsteal_kswapd_dma32 6457075
> pgsteal_kswapd_normal 1462767
> pgsteal_kswapd_movable 0
> pgscan_kswapd_dma 18
> pgscan_kswapd_dma32 6544126
> pgscan_kswapd_normal 1495604
> pgscan_kswapd_movable 0
> kswapd_inodesteal 29
> kswapd_low_wmark_hit_quickly 1168
> kswapd_high_wmark_hit_quickly 1627
> compact_migrate_scanned 5762793
> compact_free_scanned 54090239
> compact_isolated 1303895
> compact_stall 1542
> compact_fail 1117
> compact_success 425
> compact_kcompatd_wake 0
>
> no OOM-kills after 6 rounds of tests.
>
> Tested-by: Sergey Senozhatsky <[email protected]>

Thanks for retesting!
--
Michal Hocko
SUSE Labs

2016-03-08 14:19:37

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
>
> but there doesn't seem to be any reason for that. All functions which

I vaguely recall trying this once and running into header dependency
hell. But maybe it was something a bit different and involved storing a
value in struct compact_control.

> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
>
> This is a pure cleanup and shouldn't introduce any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

Thanks.

2016-03-08 14:22:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"

Potentially a squash into that patch then?

> mm/compaction.c: In function ‘compact_zone’:
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
> switch (ret) {
> ^
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
>
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2016-03-08 14:34:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On 03/08/2016 02:42 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/compaction.h | 5 +++++
> mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++--------------
> 2 files changed, 42 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..49e04326dcb8 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -14,6 +14,11 @@ enum compact_result {
> /* compaction should continue to another pageblock */
> COMPACT_CONTINUE,
> /*
> + * whoever is calling compaction should retry because it was either
> + * not active or it tells us there is more work to be done.
> + */
> + COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,

Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
overloading of CONTINUE only applies to compaction_suitable(). But the
value that should_compact_retry() is testing comes only from
try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

> + /*
> * direct compaction partially compacted a zone and there are suitable
> * pages
> */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..041aeb1dc3b4 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,20 @@ static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> - enum compact_result compact_result;
> struct page *page;
>
> if (!order)
> return NULL;
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_SKIPPED)
> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2862,41 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
> return NULL;
> }
> +
> +static inline bool
> + (unsigned int order, enum compact_result compact_result,
> + int contended_compaction)
> +{
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM.
> + */
> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compact_result <= COMPACT_SHOULD_RETRY)
> + return true;
> + if (contended_compaction > COMPACT_CONTENDED_NONE)
> + return true;
> + }
> +
> + return false;
> +}
> #else
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + int contended_compaction)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3136,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int alloc_flags;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> + enum compact_result compact_result;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> int no_progress_loops = 0;
>
> @@ -3227,7 +3245,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> @@ -3240,7 +3258,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> @@ -3294,6 +3312,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> did_some_progress > 0, no_progress_loops))
> goto retry;
>
> + if (should_compact_retry(order, compact_result, contended_compaction))
> + goto retry;
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3314,7 +3335,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
>

2016-03-08 14:48:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
> On 03/08/2016 02:42 PM, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> >
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and there is no guarantee
> > further reclaim/compaction attempts would help but at least make sure
> > that the compaction was active before we go OOM and keep retrying even
> > if should_reclaim_retry tells us to oom if the last compaction round
> > was either inactive (deferred, skipped or bailed out early due to
> > contention) or it told us to continue.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > include/linux/compaction.h | 5 +++++
> > mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++--------------
> > 2 files changed, 42 insertions(+), 16 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..49e04326dcb8 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -14,6 +14,11 @@ enum compact_result {
> > /* compaction should continue to another pageblock */
> > COMPACT_CONTINUE,
> > /*
> > + * whoever is calling compaction should retry because it was either
> > + * not active or it tells us there is more work to be done.
> > + */
> > + COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
>
> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
> overloading of CONTINUE only applies to compaction_suitable(). But the
> value that should_compact_retry() is testing comes only from
> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?

Well the idea was that I wanted to cover all the _possible_ cases where
compaction might want to tell us "please try again even when the last
round wasn't really successful". COMPACT_CONTINUE might not be returned
right now but we can come up with that in the future. It sounds like a
sensible feedback to me. But maybe there would be a better name for such
a feedback. I confess this is a bit oom-rework centric name...

Also I find it better to hide details behind a more generic name.

I am open to suggestions here, of course.
--
Michal Hocko
SUSE Labs

2016-03-08 15:03:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On 03/08/2016 03:48 PM, Michal Hocko wrote:
> On Tue 08-03-16 15:34:37, Vlastimil Babka wrote:
>>> --- a/include/linux/compaction.h
>>> +++ b/include/linux/compaction.h
>>> @@ -14,6 +14,11 @@ enum compact_result {
>>> /* compaction should continue to another pageblock */
>>> COMPACT_CONTINUE,
>>> /*
>>> + * whoever is calling compaction should retry because it was either
>>> + * not active or it tells us there is more work to be done.
>>> + */
>>> + COMPACT_SHOULD_RETRY = COMPACT_CONTINUE,
>>
>> Hmm, I'm not sure about this. AFAIK compact_zone() doesn't ever return
>> COMPACT_CONTINUE, and thus try_to_compact_pages() also doesn't. This
>> overloading of CONTINUE only applies to compaction_suitable(). But the
>> value that should_compact_retry() is testing comes only from
>> try_to_compact_pages(). So this is not wrong, but perhaps a bit misleading?
>
> Well the idea was that I wanted to cover all the _possible_ cases where
> compaction might want to tell us "please try again even when the last
> round wasn't really successful". COMPACT_CONTINUE might not be returned
> right now but we can come up with that in the future. It sounds like a
> sensible feedback to me. But maybe there would be a better name for such
> a feedback. I confess this is a bit oom-rework centric name...

Hmm, I see. But it doesn't really tell use to please try again. That
interpretation is indeed oom-specific. What it's actually telling us is
either a) reclaim and then try again (COMPACT_SKIPPED), b) try again
just to overcome the deferred state (COMPACT_DEFERRED). COMPACT_CONTINUE
says "go ahead", but only from compaction_suitable().
So the attempt a generic name doesn't really work here I'm afraid :/

> Also I find it better to hide details behind a more generic name.
>
> I am open to suggestions here, of course.
>

2016-03-08 15:19:15

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

2016-03-08 1:08 GMT+09:00 Michal Hocko <[email protected]>:
> On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> Andrew,
>> could you queue this one as well, please? This is more a band aid than a
>> real solution which I will be working on as soon as I am able to
>> reproduce the issue but the patch should help to some degree at least.
>
> Joonsoo wasn't very happy about this approach so let me try a different
> way. What do you think about the following? Hugh, Sergey does it help

I'm still not happy. Just ensuring one compaction run doesn't mean our
best. What's your purpose of OOM rework? From my understanding,
you'd like to trigger OOM kill deterministic and *not prematurely*.
This makes sense.

But, what you did in case of high order allocation is completely different
with original purpose. It may be deterministic but *completely premature*.
There is no way to prevent premature OOM kill. So, I want to ask one more
time. Why OOM kill is better than retry reclaiming when there is reclaimable
page? Deterministic is for what? It ensures something more?

Please see Hugh's latest vmstat. There are plenty of anon pages when
OOM kill happens and it may have enough swap space. Even if
compaction runs and fails, why do we need to kill something
in this case? OOM kill should be a last resort.

Please see Hugh's previous report and OOM dump.

[ 796.540791] Mem-Info:
[ 796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
active_file:5107 inactive_file:1664 isolated_file:57
unevictable:3067 dirty:4 writeback:75 unstable:0
slab_reclaimable:13907 slab_unreclaimable:23236
mapped:8889 shmem:3171 pagetables:2176 bounce:0
free:1637 free_pcp:54 free_cma:0
[ 796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
high:5948kB active_anon:588776kB inactive_anon:188816kB
active_file:20432kB inactive_file:6928kB unevictable:12268kB
isolated(anon):128kB isolated(file):8kB present:1046128kB
managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 796.685815] lowmem_reserve[]: 0 0 0
[ 796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 8820kB
[ 796.729696] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB

See [ 796.557378] and [ 796.630465].
In this 100 ms time interval, freepage increase a lot and
there are enough high order pages. OOM kill happen later
so freepage would come from reclaim. This shows
that your previous implementation which uses static retry number
causes premature OOM.

This attempt using compaction result looks not different to me.
It would also cause premature OOM kill.

I don't insist endless retry. I just want a more scientific criteria
that prevents
premature OOM kill. I'm really tire to say same thing again and again.
Am I missing something? This is the situation that I totally misunderstand
something? Please let me know.

Note: your current implementation doesn't consider which zone is compacted.
If DMA zone which easily fail to make high order page is compacted,
your implementation will not do retry. It also looks not our best.

Thanks.

> for your load? I have tested it with the Hugh's load and there was no
> major difference from the previous testing so at least nothing has blown
> up as I am not able to reproduce the issue here.
>
> Other changes in the compaction are still needed but I would like to not
> depend on them right now.
> ---
> From 0974f127e8eb7fe53e65f3a8b398db57effe9755 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 7 Mar 2016 15:30:37 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if the last compaction round
> was either inactive (deferred, skipped or bailed out early due to
> contention) or it told us to continue.
>
> Additionally define COMPACT_NONE which reflects cases where the
> compaction is completely disabled.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/compaction.h | 2 ++
> mm/page_alloc.c | 41 ++++++++++++++++++++++++-----------------
> 2 files changed, 26 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4cd4ddf64cc7..a4cec4a03f7d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,8 @@
> #ifndef _LINUX_COMPACTION_H
> #define _LINUX_COMPACTION_H
>
> +/* compaction disabled */
> +#define COMPACT_NONE -1
> /* Return values for compact_zone() and try_to_compact_pages() */
> /* compaction didn't start as it was deferred due to past failures */
> #define COMPACT_DEFERRED 0
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 269a04f20927..f89e3cbfdf90 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2819,28 +2819,22 @@ static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + unsigned long *compact_result)
> {
> - unsigned long compact_result;
> struct page *page;
>
> - if (!order)
> + if (!order) {
> + *compact_result = COMPACT_NONE;
> return NULL;
> + }
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_SKIPPED)
> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2875,8 +2869,9 @@ static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + unsigned long *compact_result)
> {
> + *compact_result = COMPACT_NONE;
> return NULL;
> }
> #endif /* CONFIG_COMPACTION */
> @@ -3118,7 +3113,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int alloc_flags;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> + unsigned long compact_result;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> int no_progress_loops = 0;
>
> @@ -3227,7 +3222,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> @@ -3240,7 +3235,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> @@ -3294,6 +3289,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> did_some_progress > 0, no_progress_loops))
> goto retry;
>
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM.
> + */
> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compact_result <= COMPACT_CONTINUE)
> + goto retry;
> + if (contended_compaction > COMPACT_CONTENDED_NONE)
> + goto retry;
> + }
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3314,7 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
> --
> 2.7.0
>
> --
> Michal Hocko
> SUSE Labs

2016-03-08 16:05:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> 2016-03-08 1:08 GMT+09:00 Michal Hocko <[email protected]>:
> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> Andrew,
> >> could you queue this one as well, please? This is more a band aid than a
> >> real solution which I will be working on as soon as I am able to
> >> reproduce the issue but the patch should help to some degree at least.
> >
> > Joonsoo wasn't very happy about this approach so let me try a different
> > way. What do you think about the following? Hugh, Sergey does it help
>
> I'm still not happy. Just ensuring one compaction run doesn't mean our
> best.

OK, let me think about it some more.

> What's your purpose of OOM rework? From my understanding,
> you'd like to trigger OOM kill deterministic and *not prematurely*.
> This makes sense.

Well this is a bit awkward because we do not have any proper definition
of what prematurely actually means. We do not know whether something
changes and decides to free some memory right after we made the decision.
We also do not know whether reclaiming some more memory would help
because we might be trashing over few remaining pages so there would be
still some progress, albeit small, progress. The system would be
basically unusable and the OOM killer would be a large relief. What I
want to achieve is to have a clear definition of _when_ we fire and do
not fire _often_ to be impractical. There are loads where the new
implementation behaved slightly better (see the cover for my tests) and
there surely be some where this will be worse. I want this to be
reasonably good. I am not claiming we are there yet and the interaction
with the compaction seems like it needs some work, no question about
that.

> But, what you did in case of high order allocation is completely different
> with original purpose. It may be deterministic but *completely premature*.
> There is no way to prevent premature OOM kill. So, I want to ask one more
> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> page? Deterministic is for what? It ensures something more?

yes, If we keep reclaiming we can soon start trashing or over reclaim
too much which would hurt more processes. If you invoke the OOM killer
instead then chances are that you will release a lot of memory at once
and that would help to reconcile the memory pressure as well as free
some page blocks which couldn't have been compacted before and not
affect potentially many processes. The effect would be reduced to a
single process. If we had a proper trashing detection feedback we could
do much more clever decisions of course.

But back to the !costly OOMs. Once your system is fragmented so heavily
that there are no free blocks that would satisfy !costly request then
something has gone terribly wrong and we should fix it. To me it sounds
like we do not care about those requests early enough and only start
carying after we hit the wall. Maybe kcompactd can help us in this
regards.

> Please see Hugh's latest vmstat. There are plenty of anon pages when
> OOM kill happens and it may have enough swap space. Even if
> compaction runs and fails, why do we need to kill something
> in this case? OOM kill should be a last resort.

Well this would be the case even if we were trashing over swap.
Refaulting the swapped out memory all over again...

> Please see Hugh's previous report and OOM dump.
>
> [ 796.540791] Mem-Info:
> [ 796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
> active_file:5107 inactive_file:1664 isolated_file:57
> unevictable:3067 dirty:4 writeback:75 unstable:0
> slab_reclaimable:13907 slab_unreclaimable:23236
> mapped:8889 shmem:3171 pagetables:2176 bounce:0
> free:1637 free_pcp:54 free_cma:0
> [ 796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
> high:5948kB active_anon:588776kB inactive_anon:188816kB
> active_file:20432kB inactive_file:6928kB unevictable:12268kB
> isolated(anon):128kB isolated(file):8kB present:1046128kB
> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
> [ 796.685815] lowmem_reserve[]: 0 0 0
> [ 796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
> 0*4096kB = 8820kB
> [ 796.729696] Node 0 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=2048kB
>
> See [ 796.557378] and [ 796.630465].
> In this 100 ms time interval, freepage increase a lot and
> there are enough high order pages. OOM kill happen later
> so freepage would come from reclaim. This shows
> that your previous implementation which uses static retry number
> causes premature OOM.

Or simply one of the gcc simply exitted and freed up a memory which is
more likely. As I've tried to explain in other email, we cannot prevent
from those races. We simply do not have a crystal ball. All we know is
that at the time we checked the watermarks the last time there were
simply no eligible high order pages available.

> This attempt using compaction result looks not different to me.
> It would also cause premature OOM kill.
>
> I don't insist endless retry. I just want a more scientific criteria
> that prevents premature OOM kill.

That is exactly what I try to achive here. Right now we are relying on
zone_reclaimable heuristic. That relies that some pages are freed (and
reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
pages this is basically unbounded. What I am trying to achieve here
is to base the decision on the feedback. The first attempt was to use
the reclaim feedback. This turned out to be not sufficient for higher
orders because compaction can deffer and skip if we are close to
watermarks which is really surprising to me. So now I've tried to make
sure that we do not hit this path. I agree we can do better but there
always will be a moment to simply give up. Whatever that moment will
be we can still find loads which could theoretically go on for little
more and survive.

> I'm really tire to say same thing again and again.
> Am I missing something? This is the situation that I totally misunderstand
> something? Please let me know.
>
> Note: your current implementation doesn't consider which zone is compacted.
> If DMA zone which easily fail to make high order page is compacted,
> your implementation will not do retry. It also looks not our best.

Why are we even consider DMA zone when we cannot ever allocate from this
zone?
--
Michal Hocko
SUSE Labs

2016-03-08 17:04:14

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

2016-03-09 1:05 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> 2016-03-08 1:08 GMT+09:00 Michal Hocko <[email protected]>:
>> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> Andrew,
>> >> could you queue this one as well, please? This is more a band aid than a
>> >> real solution which I will be working on as soon as I am able to
>> >> reproduce the issue but the patch should help to some degree at least.
>> >
>> > Joonsoo wasn't very happy about this approach so let me try a different
>> > way. What do you think about the following? Hugh, Sergey does it help
>>
>> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> best.
>
> OK, let me think about it some more.
>
>> What's your purpose of OOM rework? From my understanding,
>> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> This makes sense.
>
> Well this is a bit awkward because we do not have any proper definition
> of what prematurely actually means. We do not know whether something

If we don't have proper definition to it, please define it first. We
need to improve
the situation toward the clear goal. Just certain number of retry which has no
base doesn't make any sense.

> changes and decides to free some memory right after we made the decision.
> We also do not know whether reclaiming some more memory would help
> because we might be trashing over few remaining pages so there would be
> still some progress, albeit small, progress. The system would be
> basically unusable and the OOM killer would be a large relief. What I
> want to achieve is to have a clear definition of _when_ we fire and do

If we have no clear definition about premature, what's the meaning of
a clear definition of _when_? It would just mean random time.

> not fire _often_ to be impractical. There are loads where the new
> implementation behaved slightly better (see the cover for my tests) and
> there surely be some where this will be worse. I want this to be
> reasonably good. I am not claiming we are there yet and the interaction
> with the compaction seems like it needs some work, no question about
> that.
>
>> But, what you did in case of high order allocation is completely different
>> with original purpose. It may be deterministic but *completely premature*.
>> There is no way to prevent premature OOM kill. So, I want to ask one more
>> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> page? Deterministic is for what? It ensures something more?
>
> yes, If we keep reclaiming we can soon start trashing or over reclaim
> too much which would hurt more processes. If you invoke the OOM killer
> instead then chances are that you will release a lot of memory at once
> and that would help to reconcile the memory pressure as well as free
> some page blocks which couldn't have been compacted before and not
> affect potentially many processes. The effect would be reduced to a
> single process. If we had a proper trashing detection feedback we could
> do much more clever decisions of course.

It looks like you did it for performance reason. You'd better think again about
effect of OOM kill. We don't have enough knowledge about user space program
architecture and killing one important process could lead to whole
system unusable. Moreover, OOM kill could cause important data loss so
should be avoided as much as possible. Performance reason cannot
justify OOM kill.

>
> But back to the !costly OOMs. Once your system is fragmented so heavily
> that there are no free blocks that would satisfy !costly request then
> something has gone terribly wrong and we should fix it. To me it sounds
> like we do not care about those requests early enough and only start
> carying after we hit the wall. Maybe kcompactd can help us in this
> regards.

Yes, but, it's another issue. In any situation, !costly OOM should not happen
prematurely.

>> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> OOM kill happens and it may have enough swap space. Even if
>> compaction runs and fails, why do we need to kill something
>> in this case? OOM kill should be a last resort.
>
> Well this would be the case even if we were trashing over swap.
> Refaulting the swapped out memory all over again...

If thrashing is a main obstacle to decide proper OOM point,
we need to invent a way to handle thrashing or invent reasonable metric
which isn't affected by thrashing.

>> Please see Hugh's previous report and OOM dump.
>>
>> [ 796.540791] Mem-Info:
>> [ 796.557378] active_anon:150198 inactive_anon:46022 isolated_anon:32
>> active_file:5107 inactive_file:1664 isolated_file:57
>> unevictable:3067 dirty:4 writeback:75 unstable:0
>> slab_reclaimable:13907 slab_unreclaimable:23236
>> mapped:8889 shmem:3171 pagetables:2176 bounce:0
>> free:1637 free_pcp:54 free_cma:0
>> [ 796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB
>> high:5948kB active_anon:588776kB inactive_anon:188816kB
>> active_file:20432kB inactive_file:6928kB unevictable:12268kB
>> isolated(anon):128kB isolated(file):8kB present:1046128kB
>> managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB
>> mapped:35556kB shmem:12684kB slab_reclaimable:55628kB
>> slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB
>> unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB
>> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
>> [ 796.685815] lowmem_reserve[]: 0 0 0
>> [ 796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM)
>> 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
>> 0*4096kB = 8820kB
>> [ 796.729696] Node 0 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=2048kB
>>
>> See [ 796.557378] and [ 796.630465].
>> In this 100 ms time interval, freepage increase a lot and
>> there are enough high order pages. OOM kill happen later
>> so freepage would come from reclaim. This shows
>> that your previous implementation which uses static retry number
>> causes premature OOM.
>
> Or simply one of the gcc simply exitted and freed up a memory which is

It doesn't matter where free memory comes from. If free memory increases
due to gcc exit, it implies that we can reclaim some memory from it. There is
no reason to trigger OOM in this case.

> more likely. As I've tried to explain in other email, we cannot prevent
> from those races. We simply do not have a crystal ball. All we know is
> that at the time we checked the watermarks the last time there were
> simply no eligible high order pages available.
>
>> This attempt using compaction result looks not different to me.
>> It would also cause premature OOM kill.
>>
>> I don't insist endless retry. I just want a more scientific criteria
>> that prevents premature OOM kill.
>
> That is exactly what I try to achive here. Right now we are relying on
> zone_reclaimable heuristic. That relies that some pages are freed (and
> reset NR_PAGES_SCANNED) while we are scanning. With a stream of order-0
> pages this is basically unbounded. What I am trying to achieve here
> is to base the decision on the feedback. The first attempt was to use
> the reclaim feedback. This turned out to be not sufficient for higher
> orders because compaction can deffer and skip if we are close to
> watermarks which is really surprising to me. So now I've tried to make
> sure that we do not hit this path. I agree we can do better but there
> always will be a moment to simply give up. Whatever that moment will
> be we can still find loads which could theoretically go on for little
> more and survive.

Problem is that, to me, current implementation looks really simple
give up. Maybe, precise definition about premature would be helpful here.
Without it, it would be just subjective.

>
>> I'm really tire to say same thing again and again.
>> Am I missing something? This is the situation that I totally misunderstand
>> something? Please let me know.
>>
>> Note: your current implementation doesn't consider which zone is compacted.
>> If DMA zone which easily fail to make high order page is compacted,
>> your implementation will not do retry. It also looks not our best.
>
> Why are we even consider DMA zone when we cannot ever allocate from this
> zone?

This is just an example. It could be ZONE_NORMAL and something else. If
we don't try all zones to compact, it's reasonable point to trigger OOM?

Thanks.

2016-03-09 03:57:22

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, compaction: change COMPACT_ constants into enum

>
> From: Michal Hocko <[email protected]>
>
> compaction code is doing weird dances between
> COMPACT_FOO -> int -> unsigned long
>
> but there doesn't seem to be any reason for that. All functions which
> return/use one of those constants are not expecting any other value
> so it really makes sense to define an enum for them and make it clear
> that no other values are expected.
>
> This is a pure cleanup and shouldn't introduce any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---

Acked-by: Hillf Danton <[email protected]>


2016-03-09 03:58:17

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm, compaction: cover all compaction mode in compact_zone

>
> From: Michal Hocko <[email protected]>
>
> the compiler is complaining after "mm, compaction: change COMPACT_
> constants into enum"
>
> mm/compaction.c: In function ‘compact_zone’:
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
> switch (ret) {
> ^
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
> mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
>
> compaction_suitable is allowed to return only COMPACT_PARTIAL,
> COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
> impossible. Put a VM_BUG_ON to catch an impossible return value.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---

Acked-by: Hillf Danton <[email protected]>


2016-03-09 10:41:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> 2016-03-09 1:05 GMT+09:00 Michal Hocko <[email protected]>:
> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <[email protected]>:
> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
> >> >> Andrew,
> >> >> could you queue this one as well, please? This is more a band aid than a
> >> >> real solution which I will be working on as soon as I am able to
> >> >> reproduce the issue but the patch should help to some degree at least.
> >> >
> >> > Joonsoo wasn't very happy about this approach so let me try a different
> >> > way. What do you think about the following? Hugh, Sergey does it help
> >>
> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
> >> best.
> >
> > OK, let me think about it some more.
> >
> >> What's your purpose of OOM rework? From my understanding,
> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> This makes sense.
> >
> > Well this is a bit awkward because we do not have any proper definition
> > of what prematurely actually means. We do not know whether something
>
> If we don't have proper definition to it, please define it first.

OK, I should have probably said that _there_is_no_proper_definition_...
This will always be about heuristics as the clear cut can be pretty
subjective and what some load might see as unreasonable retries others
might see as insufficient. Our ultimate goal is to behave reasonable for
reasonable workloads. I am somehow skeptical about formulating this
into a single equation...

> We need to improve the situation toward the clear goal. Just certain
> number of retry which has no base doesn't make any sense.

Certain number of retries is what we already have right now. And that
certain number is hard to define even though it looks as simple as

NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages

because this is highly fragile when there are only few pages freed
regularly but not sufficient to get us out of the loop... I am trying
to formulate those retries somehow more deterministically considering
the feedback _and_ an estimate about the feasibility of future
reclaim/compaction. I admit that my attempts at compaction part have
been far from ideal so far. Partially because I missed many aspects
how it works.

[...]
> > not fire _often_ to be impractical. There are loads where the new
> > implementation behaved slightly better (see the cover for my tests) and
> > there surely be some where this will be worse. I want this to be
> > reasonably good. I am not claiming we are there yet and the interaction
> > with the compaction seems like it needs some work, no question about
> > that.
> >
> >> But, what you did in case of high order allocation is completely different
> >> with original purpose. It may be deterministic but *completely premature*.
> >> There is no way to prevent premature OOM kill. So, I want to ask one more
> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
> >> page? Deterministic is for what? It ensures something more?
> >
> > yes, If we keep reclaiming we can soon start trashing or over reclaim
> > too much which would hurt more processes. If you invoke the OOM killer
> > instead then chances are that you will release a lot of memory at once
> > and that would help to reconcile the memory pressure as well as free
> > some page blocks which couldn't have been compacted before and not
> > affect potentially many processes. The effect would be reduced to a
> > single process. If we had a proper trashing detection feedback we could
> > do much more clever decisions of course.
>
> It looks like you did it for performance reason. You'd better think again about
> effect of OOM kill. We don't have enough knowledge about user space program
> architecture and killing one important process could lead to whole
> system unusable. Moreover, OOM kill could cause important data loss so
> should be avoided as much as possible. Performance reason cannot
> justify OOM kill.

No I am not talking about performance. I am talking about the system
healthiness as whole.

> > But back to the !costly OOMs. Once your system is fragmented so heavily
> > that there are no free blocks that would satisfy !costly request then
> > something has gone terribly wrong and we should fix it. To me it sounds
> > like we do not care about those requests early enough and only start
> > carying after we hit the wall. Maybe kcompactd can help us in this
> > regards.
>
> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> prematurely.

I fully agree and I guess we also agree on the assumption that we
shouldn't retry endlessly. So let's focus on what the OOM convergence
criteria should look like. I have another proposal which I will send as
a reply to the previous one.

> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
> >> OOM kill happens and it may have enough swap space. Even if
> >> compaction runs and fails, why do we need to kill something
> >> in this case? OOM kill should be a last resort.
> >
> > Well this would be the case even if we were trashing over swap.
> > Refaulting the swapped out memory all over again...
>
> If thrashing is a main obstacle to decide proper OOM point,
> we need to invent a way to handle thrashing or invent reasonable metric
> which isn't affected by thrashing.

Great, you are welcome to come up with one. But more seriously, isn't
the retries limiting a way to reduce the chances of threshing? It might
be not the ideal one because it doesn't work 100% but can we simply come
up with the one which works that reliable. This is a hard problem which
we haven't been able to solve for ages.

[...]

Thanks!
--
Michal Hocko
SUSE Labs

2016-03-09 11:11:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

Joonsoo has pointed out that this attempt is still not sufficient
becasuse we might have invoked only a single compaction round which
is might be not enough. I fully agree with that. Here is my take on
that. It is again based on the number of retries loop.

I was also playing with an idea of doing something similar to the
reclaim retry logic:
if (order) {
if (compaction_made_progress(compact_result)
no_compact_progress = 0;
else if (compaction_failed(compact_result)
no_compact_progress++;
}
but it is compaction_failed() part which is not really
straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
compaction_suitable however hide this from compaction users so it
seems like we can never see it.

Maybe we can update the feedback mechanism from the compaction but
retries count seems reasonably easy to understand and pragmatic. If
we cannot form a order page after we tried for N times then it really
doesn't make much sense to continue and we are oom for this order. I am
holding my breath to hear from Hugh on this, though. In case it doesn't
then I would be really interested whether changing MAX_COMPACT_RETRIES
makes any difference.

I haven't preserved Tested-by from Sergey to be on the safe side even
though strictly speaking this should be less prone to high order OOMs
because we clearly retry more times.
---
>From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 9 Mar 2016 10:57:42 +0100
Subject: [PATCH] mm, oom: protect !costly allocations some more

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
- the last compaction round was either inactive (deferred,
skipped or bailed out early due to contention) or
- we haven't completed at least MAX_COMPACT_RETRIES successful
(either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
rounds.

The first rule ensures that the very last attempt for compaction
was ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy packing of other contexts to steal the compacted pages while
the current context manages to retry to allocate them.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 10 +++++++
mm/page_alloc.c | 68 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b167801187e7..7d028ccf440a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);

+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return (compact_result > COMPACT_SKIPPED &&
+ compact_result < COMPACT_NO_SUITABLE_PAGE)
+}
+
#else
static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
@@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}

+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4acc0aa1aee0..5f1fc3793836 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
return page;
}

+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum compact_result *compact_result)
{
- enum compact_result compact_result;
struct page *page;

if (!order)
return NULL;

current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC;

- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_SKIPPED)
return NULL;
- default:
- break;
- }

/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ int contended_compaction, int compaction_retries)
+{
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM. Still cap the reclaim retry loops with
+ * progress to prevent from looping forever and potential trashing.
+ */
+ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compact_result <= COMPACT_SKIPPED)
+ return true;
+ if (contended_compaction > COMPACT_CONTENDED_NONE)
+ return true;
+ if (compaction_retries <= MAX_COMPACT_RETRIES)
+ return true;
+ }
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum compact_result *compact_result)
{
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ int contended_compaction, int compaction_retries)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */

/* Perform direct synchronous page reclaim */
@@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
int alloc_flags;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
+ enum compact_result compact_result;
+ int compaction_retries = 0;
int contended_compaction = COMPACT_CONTENDED_NONE;
int no_progress_loops = 0;

@@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;

+ if (order && compaction_made_progress(compact_result))
+ compaction_retries++;
+
/* Checks for THP-specific high-order allocations */
if (is_thp_gfp_mask(gfp_mask)) {
/*
@@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
@@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
did_some_progress > 0, no_progress_loops))
goto retry;

+ if (should_compact_retry(order, compact_result, contended_compaction,
+ compaction_retries))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
--
2.7.0

--
Michal Hocko
SUSE Labs

2016-03-09 14:07:27

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On 03/09/2016 12:11 PM, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
>
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> if (order) {
> if (compaction_made_progress(compact_result)

Progress for compaction would probably mean counting successful
migrations. This would converge towards a definitive false (without
parallel activity) in the current implementation, but probably not for
the proposed redesigns where migration and free scanner initial
positions are not fixed.

> no_compact_progress = 0;
> else if (compaction_failed(compact_result)
> no_compact_progress++;
> }
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.

Anything other than COMPACT_PARTIAL is "failed" :) But it doesn't itself
hint at whether retrying makes sense or not. Reclaim is simpler in this
sense...

> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though. In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
>
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> - the last compaction round was either inactive (deferred,
> skipped or bailed out early due to contention) or
> - we haven't completed at least MAX_COMPACT_RETRIES successful
> (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> rounds.
>
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
>
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
>
> Signed-off-by: Michal Hocko <[email protected]>

Yeah, this could work.
Acked-by: Vlastimil Babka <[email protected]>

> ---
> include/linux/compaction.h | 10 +++++++
> mm/page_alloc.c | 68 +++++++++++++++++++++++++++++++++++-----------
> 2 files changed, 62 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return (compact_result > COMPACT_SKIPPED &&
> + compact_result < COMPACT_NO_SUITABLE_PAGE)
> +}
> +
> #else
> static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
> unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
> return true;
> }
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> return page;
> }
>
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
> #ifdef CONFIG_COMPACTION
> /* Try memory compaction for high-order allocations before reclaim */
> static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> - enum compact_result compact_result;
> struct page *page;
>
> if (!order)
> return NULL;
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_SKIPPED)
> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + int contended_compaction, int compaction_retries)
> +{
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM. Still cap the reclaim retry loops with
> + * progress to prevent from looping forever and potential trashing.
> + */
> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compact_result <= COMPACT_SKIPPED)
> + return true;
> + if (contended_compaction > COMPACT_CONTENDED_NONE)
> + return true;
> + if (compaction_retries <= MAX_COMPACT_RETRIES)
> + return true;
> + }
> +
> + return false;
> +}
> #else
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + int contended_compaction, int compaction_retries)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int alloc_flags;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> + enum compact_result compact_result;
> + int compaction_retries = 0;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> int no_progress_loops = 0;
>
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> + if (order && compaction_made_progress(compact_result))
> + compaction_retries++;
> +
> /* Checks for THP-specific high-order allocations */
> if (is_thp_gfp_mask(gfp_mask)) {
> /*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> did_some_progress > 0, no_progress_loops))
> goto retry;
>
> + if (should_compact_retry(order, compact_result, contended_compaction,
> + compaction_retries))
> + goto retry;
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
>

2016-03-11 12:17:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On Wed, 9 Mar 2016, Michal Hocko wrote:
> Joonsoo has pointed out that this attempt is still not sufficient
> becasuse we might have invoked only a single compaction round which
> is might be not enough. I fully agree with that. Here is my take on
> that. It is again based on the number of retries loop.
>
> I was also playing with an idea of doing something similar to the
> reclaim retry logic:
> if (order) {
> if (compaction_made_progress(compact_result)
> no_compact_progress = 0;
> else if (compaction_failed(compact_result)
> no_compact_progress++;
> }
> but it is compaction_failed() part which is not really
> straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> compaction_suitable however hide this from compaction users so it
> seems like we can never see it.
>
> Maybe we can update the feedback mechanism from the compaction but
> retries count seems reasonably easy to understand and pragmatic. If
> we cannot form a order page after we tried for N times then it really
> doesn't make much sense to continue and we are oom for this order. I am
> holding my breath to hear from Hugh on this, though.

Never a wise strategy. But I just got around to it tonight.

I do believe you've nailed it with this patch! Thank you!

I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
but I've not got to them yet): so far it is looking good on all machines.

After a quick go with the simple make -j20 in tmpfs, which survived
a cycle on the laptop, I've switched back to my original tougher load,
and that's going well so far: no sign of any OOMs. But I've interrupted
on the laptop to report back to you now, then I'll leave it running
overnight.

> In case it doesn't
> then I would be really interested whether changing MAX_COMPACT_RETRIES
> makes any difference.
>
> I haven't preserved Tested-by from Sergey to be on the safe side even
> though strictly speaking this should be less prone to high order OOMs
> because we clearly retry more times.
> ---
> From 33f08d6eeb0f5eaf1c73c292f070102ddec5878a Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 9 Mar 2016 10:57:42 +0100
> Subject: [PATCH] mm, oom: protect !costly allocations some more
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> - the last compaction round was either inactive (deferred,
> skipped or bailed out early due to contention) or
> - we haven't completed at least MAX_COMPACT_RETRIES successful
> (either COMPACT_PARTIAL or COMPACT_COMPLETE) compaction
> rounds.
>
> The first rule ensures that the very last attempt for compaction
> was ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages while
> the current context manages to retry to allocate them.
>
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/compaction.h | 10 +++++++
> mm/page_alloc.c | 68 +++++++++++++++++++++++++++++++++++-----------
> 2 files changed, 62 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index b167801187e7..7d028ccf440a 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return (compact_result > COMPACT_SKIPPED &&
> + compact_result < COMPACT_NO_SUITABLE_PAGE)

That line didn't build at all:

return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

> +}
> +
> #else
> static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
> unsigned int order, int alloc_flags,
> @@ -93,6 +99,10 @@ static inline bool compaction_deferred(struct zone *zone, int order)
> return true;
> }
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4acc0aa1aee0..5f1fc3793836 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2813,34 +2813,33 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> return page;
> }
>
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
> #ifdef CONFIG_COMPACTION
> /* Try memory compaction for high-order allocations before reclaim */
> static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> - enum compact_result compact_result;
> struct page *page;
>
> if (!order)
> return NULL;
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_SKIPPED)
> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2870,15 +2869,44 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + int contended_compaction, int compaction_retries)
> +{
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM. Still cap the reclaim retry loops with
> + * progress to prevent from looping forever and potential trashing.
> + */
> + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compact_result <= COMPACT_SKIPPED)
> + return true;
> + if (contended_compaction > COMPACT_CONTENDED_NONE)
> + return true;
> + if (compaction_retries <= MAX_COMPACT_RETRIES)
> + return true;
> + }
> +
> + return false;
> +}
> #else
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum compact_result *compact_result)
> {
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + int contended_compaction, int compaction_retries)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> /* Perform direct synchronous page reclaim */
> @@ -3118,7 +3146,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> int alloc_flags;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> + enum compact_result compact_result;
> + int compaction_retries = 0;
> int contended_compaction = COMPACT_CONTENDED_NONE;
> int no_progress_loops = 0;
>
> @@ -3227,10 +3256,13 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> + if (order && compaction_made_progress(compact_result))
> + compaction_retries++;
> +
> /* Checks for THP-specific high-order allocations */
> if (is_thp_gfp_mask(gfp_mask)) {
> /*
> @@ -3240,7 +3272,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> @@ -3294,6 +3326,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> did_some_progress > 0, no_progress_loops))
> goto retry;
>
> + if (should_compact_retry(order, compact_result, contended_compaction,
> + compaction_retries))
> + goto retry;
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3314,7 +3350,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
> --
> 2.7.0
>
> --
> Michal Hocko
> SUSE Labs

2016-03-11 13:06:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> On Wed, 9 Mar 2016, Michal Hocko wrote:
> > Joonsoo has pointed out that this attempt is still not sufficient
> > becasuse we might have invoked only a single compaction round which
> > is might be not enough. I fully agree with that. Here is my take on
> > that. It is again based on the number of retries loop.
> >
> > I was also playing with an idea of doing something similar to the
> > reclaim retry logic:
> > if (order) {
> > if (compaction_made_progress(compact_result)
> > no_compact_progress = 0;
> > else if (compaction_failed(compact_result)
> > no_compact_progress++;
> > }
> > but it is compaction_failed() part which is not really
> > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > compaction_suitable however hide this from compaction users so it
> > seems like we can never see it.
> >
> > Maybe we can update the feedback mechanism from the compaction but
> > retries count seems reasonably easy to understand and pragmatic. If
> > we cannot form a order page after we tried for N times then it really
> > doesn't make much sense to continue and we are oom for this order. I am
> > holding my breath to hear from Hugh on this, though.
>
> Never a wise strategy. But I just got around to it tonight.
>
> I do believe you've nailed it with this patch! Thank you!

That's a great news! Thanks for testing.

> I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> but I've not got to them yet): so far it is looking good on all machines.
>
> After a quick go with the simple make -j20 in tmpfs, which survived
> a cycle on the laptop, I've switched back to my original tougher load,
> and that's going well so far: no sign of any OOMs. But I've interrupted
> on the laptop to report back to you now, then I'll leave it running
> overnight.

OK, let's wait for the rest of the tests but I find it really optimistic
considering how easily you could trigger the issue previously. Anyway
I hope for your Tested-by after you are reasonably confident your loads
are behaving well.

[...]
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index b167801187e7..7d028ccf440a 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -61,6 +61,12 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> > bool alloc_success);
> > extern bool compaction_restarting(struct zone *zone, int order);
> >
> > +static inline bool compaction_made_progress(enum compact_result result)
> > +{
> > + return (compact_result > COMPACT_SKIPPED &&
> > + compact_result < COMPACT_NO_SUITABLE_PAGE)
>
> That line didn't build at all:
>
> return result > COMPACT_SKIPPED && result < COMPACT_NO_SUITABLE_PAGE;

those last minute changes... Sorry about that. Fixed.

Thanks!
--
Michal Hocko
SUSE Labs

2016-03-11 14:53:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

2016-03-09 19:41 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
>> 2016-03-09 1:05 GMT+09:00 Michal Hocko <[email protected]>:
>> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
>> >> 2016-03-08 1:08 GMT+09:00 Michal Hocko <[email protected]>:
>> >> > On Mon 29-02-16 22:02:13, Michal Hocko wrote:
>> >> >> Andrew,
>> >> >> could you queue this one as well, please? This is more a band aid than a
>> >> >> real solution which I will be working on as soon as I am able to
>> >> >> reproduce the issue but the patch should help to some degree at least.
>> >> >
>> >> > Joonsoo wasn't very happy about this approach so let me try a different
>> >> > way. What do you think about the following? Hugh, Sergey does it help
>> >>
>> >> I'm still not happy. Just ensuring one compaction run doesn't mean our
>> >> best.
>> >
>> > OK, let me think about it some more.
>> >
>> >> What's your purpose of OOM rework? From my understanding,
>> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
>> >> This makes sense.
>> >
>> > Well this is a bit awkward because we do not have any proper definition
>> > of what prematurely actually means. We do not know whether something
>>
>> If we don't have proper definition to it, please define it first.
>
> OK, I should have probably said that _there_is_no_proper_definition_...
> This will always be about heuristics as the clear cut can be pretty
> subjective and what some load might see as unreasonable retries others
> might see as insufficient. Our ultimate goal is to behave reasonable for
> reasonable workloads. I am somehow skeptical about formulating this
> into a single equation...

I don't want a theoretically perfect definition. We need something that
can be used for judging further changes. So, how can you judge that
reasonable behave for reasonable workload? What's your criteria?
If someone complains 16 retries is too small and the other complains
16 retries is too big, what's your decision in this case?

If you decide to increase number of retry in this case, when can we
stop that increasing? If someone complains again that XX is too small
then do you continue to increase it?

For me, for order 0 case, reasonable part is watermark checking with
available (free + reclaimable) memory. It shows that we've done
our best so it doesn't matter that how many times we retry.

But, for high order case, there is no *feasible* estimation. Watermark
check as you did here isn't feasible because high order freepage
problem usually happen when there are enough but fragmented freepages.
It would be always failed. Without feasible estimation, N retry can't
show anything.

Your logic here is just like below.

"We've tried N times reclaim/compaction and failed. It is proved that
there is no possibility to make high order page. We should trigger OOM now."

Is it true that there is no possibility to make high order page in this case?
Can you be sure?

If someone who get OOM complains regression, can you persuade him
by above logic?

I don't think so. This is why I ask you to make proper definition on
term *premature* here.

>> We need to improve the situation toward the clear goal. Just certain
>> number of retry which has no base doesn't make any sense.
>
> Certain number of retries is what we already have right now. And that
> certain number is hard to define even though it looks as simple as
>
> NR_PAGES_SCANNED < 6*zone_reclaimable_pages && no_reclaimable_pages
>
> because this is highly fragile when there are only few pages freed
> regularly but not sufficient to get us out of the loop... I am trying
> to formulate those retries somehow more deterministically considering
> the feedback _and_ an estimate about the feasibility of future
> reclaim/compaction. I admit that my attempts at compaction part have
> been far from ideal so far. Partially because I missed many aspects
> how it works.
> [...]
>> > not fire _often_ to be impractical. There are loads where the new
>> > implementation behaved slightly better (see the cover for my tests) and
>> > there surely be some where this will be worse. I want this to be
>> > reasonably good. I am not claiming we are there yet and the interaction
>> > with the compaction seems like it needs some work, no question about
>> > that.
>> >
>> >> But, what you did in case of high order allocation is completely different
>> >> with original purpose. It may be deterministic but *completely premature*.
>> >> There is no way to prevent premature OOM kill. So, I want to ask one more
>> >> time. Why OOM kill is better than retry reclaiming when there is reclaimable
>> >> page? Deterministic is for what? It ensures something more?
>> >
>> > yes, If we keep reclaiming we can soon start trashing or over reclaim
>> > too much which would hurt more processes. If you invoke the OOM killer
>> > instead then chances are that you will release a lot of memory at once
>> > and that would help to reconcile the memory pressure as well as free
>> > some page blocks which couldn't have been compacted before and not
>> > affect potentially many processes. The effect would be reduced to a
>> > single process. If we had a proper trashing detection feedback we could
>> > do much more clever decisions of course.
>>
>> It looks like you did it for performance reason. You'd better think again about
>> effect of OOM kill. We don't have enough knowledge about user space program
>> architecture and killing one important process could lead to whole
>> system unusable. Moreover, OOM kill could cause important data loss so
>> should be avoided as much as possible. Performance reason cannot
>> justify OOM kill.
>
> No I am not talking about performance. I am talking about the system
> healthiness as whole.

So, do you think that more frequent OOM kill is healthier than other ways?

>> > But back to the !costly OOMs. Once your system is fragmented so heavily
>> > that there are no free blocks that would satisfy !costly request then
>> > something has gone terribly wrong and we should fix it. To me it sounds
>> > like we do not care about those requests early enough and only start
>> > carying after we hit the wall. Maybe kcompactd can help us in this
>> > regards.
>>
>> Yes, but, it's another issue. In any situation, !costly OOM should not happen
>> prematurely.
>
> I fully agree and I guess we also agree on the assumption that we
> shouldn't retry endlessly. So let's focus on what the OOM convergence
> criteria should look like. I have another proposal which I will send as
> a reply to the previous one.

That's also insufficient to me. It just add one more brute force retry
for compaction
without any reasonable estimation.

>> >> Please see Hugh's latest vmstat. There are plenty of anon pages when
>> >> OOM kill happens and it may have enough swap space. Even if
>> >> compaction runs and fails, why do we need to kill something
>> >> in this case? OOM kill should be a last resort.
>> >
>> > Well this would be the case even if we were trashing over swap.
>> > Refaulting the swapped out memory all over again...
>>
>> If thrashing is a main obstacle to decide proper OOM point,
>> we need to invent a way to handle thrashing or invent reasonable metric
>> which isn't affected by thrashing.
>
> Great, you are welcome to come up with one. But more seriously, isn't

For example, we can collect how many pages reclaimed and compare
it with reclaimable pages at start. If number of reclaimed pages
exceed it, we can think that we've tried to reclaim all reclaimable pages
at least once and can go next step such as OOM.

Thanks.

2016-03-11 15:20:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm, oom: protect !costly allocations some more (was: Re: [PATCH 0/3] OOM detection rework v4)

On Fri 11-03-16 23:53:18, Joonsoo Kim wrote:
> 2016-03-09 19:41 GMT+09:00 Michal Hocko <[email protected]>:
> > On Wed 09-03-16 02:03:59, Joonsoo Kim wrote:
> >> 2016-03-09 1:05 GMT+09:00 Michal Hocko <[email protected]>:
> >> > On Wed 09-03-16 00:19:03, Joonsoo Kim wrote:
[...]
> >> >> What's your purpose of OOM rework? From my understanding,
> >> >> you'd like to trigger OOM kill deterministic and *not prematurely*.
> >> >> This makes sense.
> >> >
> >> > Well this is a bit awkward because we do not have any proper definition
> >> > of what prematurely actually means. We do not know whether something
> >>
> >> If we don't have proper definition to it, please define it first.
> >
> > OK, I should have probably said that _there_is_no_proper_definition_...
> > This will always be about heuristics as the clear cut can be pretty
> > subjective and what some load might see as unreasonable retries others
> > might see as insufficient. Our ultimate goal is to behave reasonable for
> > reasonable workloads. I am somehow skeptical about formulating this
> > into a single equation...
>
> I don't want a theoretically perfect definition. We need something that
> can be used for judging further changes. So, how can you judge that
> reasonable behave for reasonable workload? What's your criteria?
> If someone complains 16 retries is too small and the other complains
> 16 retries is too big, what's your decision in this case?

The number of retries is the implementation detail. What matters,
really, is whether we can argue about the particular load and why it
should resp. shouldn't trigger the OOM killer. We can use our
tracepoints to have a look and judge the overall progress or lack of it
and see if we could do better. It is not the number of retries to tweak
first. It is the reclaim/compaction to be made more reliable. Tweaking
the retries would be just the very last resort. If we can see that
compaction doesn't form the high order pages in a sufficient pace we
should find out why.

> If you decide to increase number of retry in this case, when can we
> stop that increasing? If someone complains again that XX is too small
> then do you continue to increase it?
>
> For me, for order 0 case, reasonable part is watermark checking with
> available (free + reclaimable) memory. It shows that we've done
> our best so it doesn't matter that how many times we retry.
>
> But, for high order case, there is no *feasible* estimation. Watermark
> check as you did here isn't feasible because high order freepage
> problem usually happen when there are enough but fragmented freepages.
> It would be always failed. Without feasible estimation, N retry can't
> show anything.

That's why I have done compaction retry loop independent on it in the
last patch.

> Your logic here is just like below.
>
> "We've tried N times reclaim/compaction and failed. It is proved that
> there is no possibility to make high order page. We should trigger OOM now."

Have you seen the last patch where I make sure that the compaction had
to report _success_ at least N times to declare the OOM? I think we can
be reasonably sure that keep compacting again and again without any
bound doesn't make much sense when that doesn't lead to a requested
order page.

> Is it true that there is no possibility to make high order page in this case?
> Can you be sure?

The thing I am trying to tell you, and I seem to fail here, is that you
simply cannot be sure. Full stop. We might be staggering on the edge of the
cliff and fall or be lucky and end up on the safe side.

> If someone who get OOM complains regression, can you persuade him
> by above logic?

This really depends on the particular load of course.

> I don't think so. This is why I ask you to make proper definition on
> term *premature* here.

Sigh. And what if that particular reporter doesn't agree with my
"proper" definition because it doesn't suite the workload of the
interest? I mean, anything we end up doing is highly subjective and
it's been like that since ever OOM was introduced.

[...]
> >> It looks like you did it for performance reason. You'd better think again about
> >> effect of OOM kill. We don't have enough knowledge about user space program
> >> architecture and killing one important process could lead to whole
> >> system unusable. Moreover, OOM kill could cause important data loss so
> >> should be avoided as much as possible. Performance reason cannot
> >> justify OOM kill.
> >
> > No I am not talking about performance. I am talking about the system
> > healthiness as whole.
>
> So, do you think that more frequent OOM kill is healthier than other ways?

I didn't say so. And except for the Hugh's testcase I haven't seen the
rework would cause that. As per the last testing result it seems that
this particular case has been fixed. If you believe that you can see
other cases than I am more than happy to look at them.

> >> > But back to the !costly OOMs. Once your system is fragmented so heavily
> >> > that there are no free blocks that would satisfy !costly request then
> >> > something has gone terribly wrong and we should fix it. To me it sounds
> >> > like we do not care about those requests early enough and only start
> >> > carying after we hit the wall. Maybe kcompactd can help us in this
> >> > regards.
> >>
> >> Yes, but, it's another issue. In any situation, !costly OOM should not happen
> >> prematurely.
> >
> > I fully agree and I guess we also agree on the assumption that we
> > shouldn't retry endlessly. So let's focus on what the OOM convergence
> > criteria should look like. I have another proposal which I will send as
> > a reply to the previous one.
>
> That's also insufficient to me. It just add one more brute force retry
> for compaction
> without any reasonable estimation.

The compaction absolutely lacks any useful feedback mechanism. If we
ever grow one I am more than happy to make the estimate better.

--
Michal Hocko
SUSE Labs

2016-03-11 19:08:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On Fri, 11 Mar 2016, Michal Hocko wrote:
> On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > Joonsoo has pointed out that this attempt is still not sufficient
> > > becasuse we might have invoked only a single compaction round which
> > > is might be not enough. I fully agree with that. Here is my take on
> > > that. It is again based on the number of retries loop.
> > >
> > > I was also playing with an idea of doing something similar to the
> > > reclaim retry logic:
> > > if (order) {
> > > if (compaction_made_progress(compact_result)
> > > no_compact_progress = 0;
> > > else if (compaction_failed(compact_result)
> > > no_compact_progress++;
> > > }
> > > but it is compaction_failed() part which is not really
> > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > compaction_suitable however hide this from compaction users so it
> > > seems like we can never see it.
> > >
> > > Maybe we can update the feedback mechanism from the compaction but
> > > retries count seems reasonably easy to understand and pragmatic. If
> > > we cannot form a order page after we tried for N times then it really
> > > doesn't make much sense to continue and we are oom for this order. I am
> > > holding my breath to hear from Hugh on this, though.
> >
> > Never a wise strategy. But I just got around to it tonight.
> >
> > I do believe you've nailed it with this patch! Thank you!
>
> That's a great news! Thanks for testing.
>
> > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > but I've not got to them yet): so far it is looking good on all machines.
> >
> > After a quick go with the simple make -j20 in tmpfs, which survived
> > a cycle on the laptop, I've switched back to my original tougher load,
> > and that's going well so far: no sign of any OOMs. But I've interrupted
> > on the laptop to report back to you now, then I'll leave it running
> > overnight.
>
> OK, let's wait for the rest of the tests but I find it really optimistic
> considering how easily you could trigger the issue previously. Anyway
> I hope for your Tested-by after you are reasonably confident your loads
> are behaving well.

Three have been stably running load for between 6 and 7 hours now,
no problems, looking very good:

Tested-by: Hugh Dickins <[email protected]>

I'll be interested to see how my huge tmpfs loads fare with the rework,
but I'm not quite ready to try that today; and any issue there (I've no
reason to suppose that there will be) can be a separate investigation
for me to make at some future date. It was this order=2 regression
that was holding me back, and I've now no objection to your patches
(though nobody should imagine that I've actually studied them).

Thank you, Michal.

Hugh

2016-03-14 16:22:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/3] mm, oom: protect !costly allocations some more

On Fri 11-03-16 11:08:05, Hugh Dickins wrote:
> On Fri, 11 Mar 2016, Michal Hocko wrote:
> > On Fri 11-03-16 04:17:30, Hugh Dickins wrote:
> > > On Wed, 9 Mar 2016, Michal Hocko wrote:
> > > > Joonsoo has pointed out that this attempt is still not sufficient
> > > > becasuse we might have invoked only a single compaction round which
> > > > is might be not enough. I fully agree with that. Here is my take on
> > > > that. It is again based on the number of retries loop.
> > > >
> > > > I was also playing with an idea of doing something similar to the
> > > > reclaim retry logic:
> > > > if (order) {
> > > > if (compaction_made_progress(compact_result)
> > > > no_compact_progress = 0;
> > > > else if (compaction_failed(compact_result)
> > > > no_compact_progress++;
> > > > }
> > > > but it is compaction_failed() part which is not really
> > > > straightforward to define. Is it COMPACT_NO_SUITABLE_PAGE
> > > > resp. COMPACT_NOT_SUITABLE_ZONE sufficient? compact_finished and
> > > > compaction_suitable however hide this from compaction users so it
> > > > seems like we can never see it.
> > > >
> > > > Maybe we can update the feedback mechanism from the compaction but
> > > > retries count seems reasonably easy to understand and pragmatic. If
> > > > we cannot form a order page after we tried for N times then it really
> > > > doesn't make much sense to continue and we are oom for this order. I am
> > > > holding my breath to hear from Hugh on this, though.
> > >
> > > Never a wise strategy. But I just got around to it tonight.
> > >
> > > I do believe you've nailed it with this patch! Thank you!
> >
> > That's a great news! Thanks for testing.
> >
> > > I've applied 1/3, 2/3 and this (ah, it became the missing 3/3 later on)
> > > on top of 4.5.0-rc5-mm1 (I think there have been a couple of mmotms since,
> > > but I've not got to them yet): so far it is looking good on all machines.
> > >
> > > After a quick go with the simple make -j20 in tmpfs, which survived
> > > a cycle on the laptop, I've switched back to my original tougher load,
> > > and that's going well so far: no sign of any OOMs. But I've interrupted
> > > on the laptop to report back to you now, then I'll leave it running
> > > overnight.
> >
> > OK, let's wait for the rest of the tests but I find it really optimistic
> > considering how easily you could trigger the issue previously. Anyway
> > I hope for your Tested-by after you are reasonably confident your loads
> > are behaving well.
>
> Three have been stably running load for between 6 and 7 hours now,
> no problems, looking very good:
>
> Tested-by: Hugh Dickins <[email protected]>

Thanks!

> I'll be interested to see how my huge tmpfs loads fare with the rework,
> but I'm not quite ready to try that today; and any issue there (I've no
> reason to suppose that there will be) can be a separate investigation
> for me to make at some future date. It was this order=2 regression
> that was holding me back, and I've now no objection to your patches
> (though nobody should imagine that I've actually studied them).

I still have some work on top pending and I do not want to rush these
changes and target this for 4.7. 4.6 is just too close and I would hate
to push some last minute changes. I think oom_reaper would be large
enough for 4.6 in this area.

I will post the full series after rc1. Andrew feel free to drop it from
the mmotm tree for now. I would prefer they got all reviewed together
rather than a larger number of fixups.

Thanks Hugh for your testing. I wish I could depend on it less but I've
not been able to reproduce not matter how much I tried.

--
Michal Hocko
SUSE Labs