2016-04-05 11:25:50

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 00/11] oom detection rework v5

Hi,

This is v5 of the series. The previous version was posted [1]. There
was a number of changes since then. Mostly driven by premature oom
killer invocation for higher order requests reported by Hugh [2]. That
forced me to look into compaction because it became clear that we have
to consider its feedback to make any retry decisions especially for
!costly high order requests. As a result we have 7 compaction related
patches because the code wasn't really prepared to tell its caller
about a feedback. I have tried hard to not change compaction logic
and profile it for my specific test cases and rather come up with a
generic feedback mechanism which would be mostly independent on the
current implementation.

As pointed by Linus [3][4] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
file size, removes the files and starts over again) running in
parallel for 10s to build up a lot of dirty pages when 100 parallel
mem_eaters (anon private populated mmap which waits until it gets
signal) with 80M each.

This causes an OOM flood of course and I have compared both patched
and unpatched kernels. The test is considered finished after there
are no OOM conditions detected. This should tell us whether there are
any excessive kills or some of them premature (e.g. due to dirty pages):

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78

$ grep "Kill process" base-oom-run1.log | tail -n1
[ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3

* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77

e grep "Kill process" patched-oom-run1.log | tail -n1
[ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3

The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
memory as possible without triggering the OOM killer. This required a lot
of tuning but I've considered 3 consecutive runs in three different boots
without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.

I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.

I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.

* Changes since v4
- dropped __GFP_REPEAT for costly allocation as it is now replaced by
the compaction based feedback logic
- !costly high order requests are retried based on the compaction feedback
- compaction feedback has been tweaked to give us an useful information
to make decisions in the page allocator
- rebased on the current mmotm-2016-04-01-16-24 with the previous version
of the rework reverted

* Changes since v3
- factor out the new heuristic into its own function as suggested by
Johannes (no functional changes)

* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
wait_iff_congested related changes which needed refresh in
patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[4] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com


2016-04-05 11:25:54

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 07/11] mm, compaction: Update compaction_result ordering

From: Michal Hocko <[email protected]>

compaction_result will be used as the primary feedback channel for
compaction users. At the same time try_to_compact_pages (and potentially
others) assume a certain ordering where a more specific feedback takes
precendence. This gets a bit awkward when we have conflicting feedback
from different zones. E.g one returing COMPACT_COMPLETE meaning the full
zone has been scanned without any outcome while other returns with
COMPACT_PARTIAL aka made some progress. The caller should get
COMPACT_PARTIAL because that means that the compaction still can make
some progress. The same applies for COMPACT_PARTIAL vs.
COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
larger the value is the more progress we have done.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7c4de92d12cc..a7b9091ff349 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,6 +4,8 @@
/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result {
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NOT_SUITABLE_ZONE,
/*
* compaction didn't start as it was not possible or direct reclaim
* was more suitable
@@ -11,30 +13,34 @@ enum compact_result {
COMPACT_SKIPPED,
/* compaction didn't start as it was deferred due to past failures */
COMPACT_DEFERRED,
+
/* compaction not active last round */
COMPACT_INACTIVE = COMPACT_DEFERRED,

+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NO_SUITABLE_PAGE,
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
+
/*
- * direct compaction partially compacted a zone and there are suitable
- * pages
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
*/
- COMPACT_PARTIAL,
+ COMPACT_COMPLETE,
/*
* direct compaction has scanned part of the zone but wasn't successfull
* to compact suitable pages.
*/
COMPACT_PARTIAL_SKIPPED,
+
+ /* compaction terminated prematurely due to lock contentions */
+ COMPACT_CONTENDED,
+
/*
- * The full zone was compacted scanned but wasn't successfull to compact
- * suitable pages.
+ * direct compaction partially compacted a zone and there might be
+ * suitable pages
*/
- COMPACT_COMPLETE,
- /* For more detailed tracepoint output */
- COMPACT_NO_SUITABLE_PAGE,
- COMPACT_NOT_SUITABLE_ZONE,
- COMPACT_CONTENDED,
+ COMPACT_PARTIAL,
};

/* Used to signal whether compaction detected need_sched() or lock contention */
--
2.8.0.rc3

2016-04-05 11:25:56

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 11/11] mm: consider compaction feedback also for costly allocation

From: Michal Hocko <[email protected]>

PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. Remove the check for the number of reclaimed pages and rely
on the compaction feedback solely. should_reclaim_retry now only
makes sure that we keep retrying reclaim for high order pages only
if they are hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt. !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim. We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
- compact memory
- mmap large portion of the memory (1920M on 2GRAM machine with 2G
of swapspace)
- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
steps until certain amount of memory is freed (250M in my test)
and reduce the step to (step / 2) + 1 after reaching the end of
the mapping
- then run a script which populates the page cache 2G (MemTotal)
from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone DMA 31 28 31 10 2 0 2 1 2 3 1
Node 0, zone DMA32 437 319 171 50 28 25 20 16 16 14 437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone DMA 165 48 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 35109 14575 185 51 41 12 6 0 0 0 0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 111 344 153 20 24 10 3 0 0 0 0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 127 97 30 99 11 6 2 1 4 0 0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
1 file changed, 32 insertions(+), 31 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d26ab892a7d..90a18ae92849 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
enum migrate_mode *migrate_mode,
int compaction_retries)
{
+ int max_retries = MAX_COMPACT_RETRIES;
+
if (!order)
return false;

@@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
}

/*
- * !costly allocations are really important and we have to make sure
- * the compaction wasn't deferred or didn't bail out early due to locks
- * contention before we go OOM. Still cap the reclaim retry loops with
- * progress to prevent from looping forever and potential trashing.
+ * make sure the compaction wasn't deferred or didn't bail out early
+ * due to locks contention before we declare that we should give up.
*/
- if (order <= PAGE_ALLOC_COSTLY_ORDER) {
- if (compaction_withdrawn(compact_result))
- return true;
- if (compaction_retries <= MAX_COMPACT_RETRIES)
- return true;
- }
+ if (compaction_withdrawn(compact_result))
+ return true;
+
+ /*
+ * !costly requests are much more important than __GFP_REPEAT
+ * costly ones because they are de facto nofail and invoke OOM
+ * killer to move on while costly can fail and users are ready
+ * to cope with that. 1/4 retries is rather arbitrary but we
+ * would need much more detailed feedback from compaction to
+ * make a better decision.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_retries /= 4;
+ if (compaction_retries <= max_retries)
+ return true;

return false;
}
@@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
* Checks whether it makes sense to retry the reclaim to make a forward progress
* for the given allocation request.
* The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
*
* Returns true if a retry is viable or false to enter the oom path.
*/
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
- bool did_some_progress, unsigned long pages_reclaimed,
- int no_progress_loops)
+ bool did_some_progress, int no_progress_loops)
{
struct zone *zone;
struct zoneref *z;
@@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
if (no_progress_loops > MAX_RECLAIM_RETRIES)
return false;

- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- if (pages_reclaimed >= (1<<order))
- return false;
-
- if (did_some_progress)
- return true;
- }
-
/*
* Keep reclaiming pages while there is a chance this will lead somewhere.
* If none of the target zones can satisfy our allocation request even
@@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
struct page *page = NULL;
int alloc_flags;
- unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
enum compact_result compact_result;
@@ -3442,16 +3441,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto noretry;

- if (did_some_progress) {
+ /*
+ * Costly allocations might have made a progress but this doesn't mean
+ * their order will become available due to high fragmentation so
+ * always increment the no progress counter for them
+ */
+ if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
no_progress_loops = 0;
- pages_reclaimed += did_some_progress;
- } else {
+ else
no_progress_loops++;
- }

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, pages_reclaimed,
- no_progress_loops))
+ did_some_progress > 0, no_progress_loops))
goto retry;

/*
--
2.8.0.rc3

2016-04-05 11:26:17

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 10/11] mm, oom: protect !costly allocations some more

From: Michal Hocko <[email protected]>

should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
- the last compaction round backed off or
- we haven't completed at least MAX_COMPACT_RETRIES active
compaction rounds.

The first rule ensures that the very last attempt for compaction
was not ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy packing of other contexts to steal the compacted pages before
the current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense. We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up. With this logic in place we do not have
to increase the migration mode unconditionally and rather do it only if
the compaction failed for the weaker mode. A nice side effect is that
the stronger migration mode is used only when really needed so this has
a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 78 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c05de84c8157..3d26ab892a7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
return page;
}

+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
@@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,

return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ if (!order)
+ return false;
+
+ /*
+ * compaction considers all the zone as desperately out of memory
+ * so it doesn't really make much sense to retry except when the
+ * failure could be caused by weak migration mode.
+ */
+ if (compaction_failed(compact_result)) {
+ if (*migrate_mode == MIGRATE_ASYNC) {
+ *migrate_mode = MIGRATE_SYNC_LIGHT;
+ return true;
+ }
+ return false;
+ }
+
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM. Still cap the reclaim retry loops with
+ * progress to prevent from looping forever and potential trashing.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compaction_withdrawn(compact_result))
+ return true;
+ if (compaction_retries <= MAX_COMPACT_RETRIES)
+ return true;
+ }
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
{
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */

/* Perform direct synchronous page reclaim */
@@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
enum compact_result compact_result;
+ int compaction_retries = 0;
int no_progress_loops = 0;

/*
@@ -3362,6 +3415,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;

+ if (order && compaction_made_progress(compact_result))
+ compaction_retries++;
+
/*
* Checks for THP-specific high-order allocations and back off
* if the the compaction backed off
@@ -3369,14 +3425,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
goto nopage;

- /*
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse.
- */
- if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_SYNC_LIGHT;
-
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
@@ -3406,6 +3454,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
no_progress_loops))
goto retry;

+ /*
+ * It doesn't make any sense to retry for the compaction if the order-0
+ * reclaim is not able to make any progress because the current
+ * implementation of the compaction depends on the sufficient amount
+ * of free memory (see __compaction_suitable)
+ */
+ if (did_some_progress > 0 &&
+ should_compact_retry(order, compact_result,
+ &migration_mode, compaction_retries))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3419,10 +3478,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

noretry:
/*
- * High-order allocations do not necessarily loop after
- * direct reclaim and reclaim/compaction depends on compaction
- * being called after reclaim so call directly if necessary
+ * High-order allocations do not necessarily loop after direct reclaim
+ * and reclaim/compaction depends on compaction being called after
+ * reclaim so call directly if necessary.
+ * It can become very expensive to allocate transparent hugepages at
+ * fault, so use asynchronous memory compaction for THP unless it is
+ * khugepaged trying to collapse. All other requests should tolerate
+ * at least light sync migration.
*/
+ if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ else
+ migration_mode = MIGRATE_SYNC_LIGHT;
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&compact_result);
--
2.8.0.rc3

2016-04-05 11:26:33

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

From: Michal Hocko <[email protected]>

Compaction can provide a wild variation of feedback to the caller. Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.

This patch abstracts the feedback into three basic types:
- compaction_made_progress - compaction was active and made some
progress.
- compaction_failed - compaction failed and further attempts to
invoke it would most probably fail and therefore it is not
worth retrying
- compaction_withdrawn - compaction wasn't invoked for an
implementation specific reasons. In the current implementation
it means that the compaction was deferred, contended or the
page scanners met too early without any progress. Retrying is
still worthwhile.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 74 ++++++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 25 ++++------------
2 files changed, 80 insertions(+), 19 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a7b9091ff349..512db9c3f0ed 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);

+/* Compaction has made some progress and retrying makes sense */
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ /*
+ * Even though this might sound confusing this in fact tells us
+ * that the compaction successfully isolated and migrated some
+ * pageblocks.
+ */
+ if (result == COMPACT_PARTIAL)
+ return true;
+
+ return false;
+}
+
+/* Compaction has failed and it doesn't make much sense to keep retrying. */
+static inline bool compaction_failed(enum compact_result result)
+{
+ /* All zones where scanned completely and still not result. */
+ if (result == COMPACT_COMPLETE)
+ return true;
+
+ return false;
+}
+
+/*
+ * Compaction has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ /*
+ * Compaction backed off due to watermark checks for order-0
+ * so the regular reclaim has to try harder and reclaim something.
+ */
+ if (result == COMPACT_SKIPPED)
+ return true;
+
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (result == COMPACT_DEFERRED)
+ return true;
+
+ /*
+ * If compaction in async mode encounters contention or blocks higher
+ * priority task we back off early rather than cause stalls.
+ */
+ if (result == COMPACT_CONTENDED)
+ return true;
+
+ /*
+ * Page scanners have met but we haven't scanned full zones so this
+ * is a back off in fact.
+ */
+ if (result == COMPACT_PARTIAL_SKIPPED)
+ return true;
+
+ return false;
+}
+
extern int kcompactd_run(int nid);
extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
@@ -114,6 +178,16 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}

+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ return true;
+}
+
static inline int kcompactd_run(int nid)
{
return 0;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c37e6d1ad643..c05de84c8157 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3362,25 +3362,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;

- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (compact_result == COMPACT_DEFERRED)
- goto nopage;
-
- /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if(compact_result == COMPACT_CONTENDED)
- goto nopage;
- }
+ /*
+ * Checks for THP-specific high-order allocations and back off
+ * if the the compaction backed off
+ */
+ if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
+ goto nopage;

/*
* It can become very expensive to allocate transparent hugepages at
--
2.8.0.rc3

2016-04-05 11:26:51

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 08/11] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface

From: Michal Hocko <[email protected]>

__alloc_pages_direct_compact communicates potential back off by two
variables:
- deferred_compaction tells that the compaction returned
COMPACT_DEFERRED
- contended_compaction is set when there is a contention on
zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
1 file changed, 31 insertions(+), 36 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 42e935c83de1..c37e6d1ad643 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
- enum compact_result compact_result;
struct page *page;
+ int contended_compaction;

if (!order)
return NULL;

current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- mode, contended_compaction);
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ mode, &contended_compaction);
current->flags &= ~PF_MEMALLOC;

- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_INACTIVE)
return NULL;
- default:
- break;
- }

/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTFAIL);

+ /*
+ * In all zones where compaction was attempted (and not
+ * deferred or skipped), lock contention has been detected.
+ * For THP allocation we do not want to disrupt the others
+ * so we fallback to base pages instead.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ *compact_result = COMPACT_CONTENDED;
+
+ /*
+ * If compaction was aborted due to need_resched(), we do not
+ * want to further increase allocation latency, unless it is
+ * khugepaged trying to collapse.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_SCHED
+ && !(current->flags & PF_KTHREAD))
+ *compact_result = COMPACT_CONTENDED;
+
cond_resched();

return NULL;
@@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
return NULL;
}
@@ -3250,8 +3259,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
- int contended_compaction = COMPACT_CONTENDED_NONE;
+ enum compact_result compact_result;
int no_progress_loops = 0;

/*
@@ -3350,8 +3358,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;

@@ -3364,25 +3371,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
- goto nopage;
-
- /*
- * In all zones where compaction was attempted (and not
- * deferred or skipped), lock contention has been detected.
- * For THP allocation we do not want to disrupt the others
- * so we fallback to base pages instead.
- */
- if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
- * If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
*/
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
+ if(compact_result == COMPACT_CONTENDED)
goto nopage;
}

@@ -3442,8 +3438,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
--
2.8.0.rc3

2016-04-05 11:27:24

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

From: Michal Hocko <[email protected]>

COMPACT_COMPLETE now means that compaction and free scanner met. This is
not very useful information if somebody just wants to use this feedback
and make any decisions based on that. The current caller might be a poor
guy who just happened to scan tiny portion of the zone and that could be
the reason no suitable pages were compacted. Make sure we distinguish
the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to
use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 10 +++++++++-
include/trace/events/compaction.h | 1 +
mm/compaction.c | 14 +++++++++++---
mm/internal.h | 1 +
4 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7e177d111c39..7c4de92d12cc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -21,7 +21,15 @@ enum compact_result {
* pages
*/
COMPACT_PARTIAL,
- /* The full zone was compacted */
+ /*
+ * direct compaction has scanned part of the zone but wasn't successfull
+ * to compact suitable pages.
+ */
+ COMPACT_PARTIAL_SKIPPED,
+ /*
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
+ */
COMPACT_COMPLETE,
/* For more detailed tracepoint output */
COMPACT_NO_SUITABLE_PAGE,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6ba16c86d7db..36e2d6fb1360 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -14,6 +14,7 @@
EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_CONTINUE, "continue") \
EM( COMPACT_PARTIAL, "partial") \
+ EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
EM( COMPACT_COMPLETE, "complete") \
EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
diff --git a/mm/compaction.c b/mm/compaction.c
index 13709e33a2fc..e2e487cea5ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
if (cc->direct_compaction)
zone->compact_blockskip_flush = true;

- return COMPACT_COMPLETE;
+ if (cc->whole_zone)
+ return COMPACT_COMPLETE;
+ else
+ return COMPACT_PARTIAL_SKIPPED;
}

if (is_via_compact_memory(cc->order))
@@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
+
+ if (cc->migrate_pfn == start_pfn)
+ cc->whole_zone = true;
+
cc->last_migrated_pfn = 0;

trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
@@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
goto break_loop;
}

- if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
+ if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+ status == COMPACT_PARTIAL_SKIPPED)) {
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
@@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
cc.classzone_idx, 0)) {
success = true;
compaction_defer_reset(zone, cc.order, false);
- } else if (status == COMPACT_COMPLETE) {
+ } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
/*
* We use sync migration mode here, so we defer like
* sync direct compaction does.
diff --git a/mm/internal.h b/mm/internal.h
index e9aacea1a0d1..4423dfe69382 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -182,6 +182,7 @@ struct compact_control {
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool direct_compaction; /* False from kcompactd or /proc/... */
+ bool whole_zone; /* Whole zone has been scanned */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
--
2.8.0.rc3

2016-04-05 11:25:48

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 01/11] mm, oom: rework oom detection

From: Michal Hocko <[email protected]>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow. It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[[email protected]: check classzone_idx for shrink_zone]
[[email protected]: separate the heuristic into should_reclaim_retry]
[[email protected]: use zone_page_state_snapshot for NR_FREE_PAGES]
[[email protected]: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/swap.h | 1 +
mm/page_alloc.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
mm/vmscan.c | 25 +++----------
3 files changed, 97 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..b14a2bb33514 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
struct vm_area_struct *vma);

/* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4efafc38273..16100acc40ba 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3136,6 +3136,77 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
}

+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+ struct alloc_context *ac, int alloc_flags,
+ bool did_some_progress, unsigned long pages_reclaimed,
+ int no_progress_loops)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure we converge to OOM if we cannot make any progress
+ * several times in the row.
+ */
+ if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;
+
+ if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (pages_reclaimed >= (1<<order))
+ return false;
+
+ if (did_some_progress)
+ return true;
+ }
+
+ /*
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
+ * If none of the target zones can satisfy our allocation request even
+ * if all reclaimable pages are considered then we are screwed and have
+ * to go OOM.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+
+ available = zone_reclaimable_pages(zone);
+ available -= DIV_ROUND_UP(no_progress_loops * available,
+ MAX_RECLAIM_RETRIES);
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ /*
+ * Would the allocation succeed if we reclaimed the whole
+ * available?
+ */
+ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ ac->high_zoneidx, alloc_flags, available)) {
+ /* Wait for some write requests to complete then retry */
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ return true;
+ }
+ }
+
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3148,6 +3219,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ int no_progress_loops = 0;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3299,23 +3371,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

- /* Keep reclaiming pages as long as there is reasonable progress */
- pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
- ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
- goto retry;
+ /*
+ * Do not retry costly high order allocations unless they are
+ * __GFP_REPEAT
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+ goto noretry;
+
+ if (did_some_progress) {
+ no_progress_loops = 0;
+ pages_reclaimed += did_some_progress;
+ } else {
+ no_progress_loops++;
}

+ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, pages_reclaimed,
+ no_progress_loops))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
goto got_pg;

/* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ no_progress_loops = 0;
goto retry;
+ }

noretry:
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c839adc13efd..9a3b2342dbae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,7 +191,7 @@ static bool sane_reclaim(struct scan_control *sc)
}
#endif

-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
{
unsigned long nr;

@@ -2530,10 +2530,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
@@ -2541,7 +2539,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
- bool reclaimable = false;

/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2606,17 +2603,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
&nr_soft_scanned);
sc->nr_reclaimed += nr_soft_reclaimed;
sc->nr_scanned += nr_soft_scanned;
- if (nr_soft_reclaimed)
- reclaimable = true;
/* need some check for avoid more shrink_zone() */
}

- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
- reclaimable = true;
-
- if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
- reclaimable = true;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
}

/*
@@ -2624,8 +2614,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
* promoted it to __GFP_HIGHMEM.
*/
sc->gfp_mask = orig_mask;
-
- return reclaimable;
}

/*
@@ -2650,7 +2638,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
int initial_priority = sc->priority;
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
- bool zones_reclaimable;
retry:
delayacct_freepages_start();

@@ -2661,7 +2648,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ shrink_zones(zonelist, sc);

total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2708,10 +2695,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
goto retry;
}

- /* Any of the zones still reclaimable? Don't OOM. */
- if (zones_reclaimable)
- return 1;
-
return 0;
}

--
2.8.0.rc3

2016-04-05 11:27:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 04/11] mm, compaction: cover all compaction mode in compact_zone

From: Michal Hocko <[email protected]>

the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
switch (ret) {
^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/compaction.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 8ae7b1c46c72..b06de27b7f72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,15 +1433,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro

ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
- switch (ret) {
- case COMPACT_PARTIAL:
- case COMPACT_SKIPPED:
- /* Compaction is likely to fail */
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
return ret;
- case COMPACT_CONTINUE:
- /* Fall through to compaction */
- ;
- }
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);

/*
* Clear pageblock skip if there were failures recently and compaction
--
2.8.0.rc3

2016-04-05 11:27:46

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 05/11] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED

From: Michal Hocko <[email protected]>

try_to_compact_pages can currently return COMPACT_SKIPPED even when the
compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

This shouldn't introduce any functional change.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 7 +++++--
include/trace/events/compaction.h | 2 +-
mm/compaction.c | 8 +++++---
3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4458fd94170f..7e177d111c39 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,13 +4,16 @@
/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result {
- /* compaction didn't start as it was deferred due to past failures */
- COMPACT_DEFERRED,
/*
* compaction didn't start as it was not possible or direct reclaim
* was more suitable
*/
COMPACT_SKIPPED,
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /* compaction not active last round */
+ COMPACT_INACTIVE = COMPACT_DEFERRED,
+
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
/*
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index e215bf68f521..6ba16c86d7db 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -10,8 +10,8 @@
#include <trace/events/mmflags.h>

#define COMPACTION_STATUS \
- EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_SKIPPED, "skipped") \
+ EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_CONTINUE, "continue") \
EM( COMPACT_PARTIAL, "partial") \
EM( COMPACT_COMPLETE, "complete") \
diff --git a/mm/compaction.c b/mm/compaction.c
index b06de27b7f72..13709e33a2fc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- enum compact_result rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_SKIPPED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */

*contended = COMPACT_CONTENDED_NONE;
@@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
enum compact_result status;
int zone_contended;

- if (compaction_deferred(zone, order))
+ if (compaction_deferred(zone, order)) {
+ rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;
+ }

status = compact_zone_order(zone, order, gfp_mask, mode,
&zone_contended, alloc_flags,
@@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
* If at least one zone wasn't deferred or skipped, we report if all
* zones that were tried were lock contended.
*/
- if (rc > COMPACT_SKIPPED && all_zones_contended)
+ if (rc > COMPACT_INACTIVE && all_zones_contended)
*contended = COMPACT_CONTENDED_LOCK;

return rc;
--
2.8.0.rc3

2016-04-05 11:28:43

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 03/11] mm, compaction: change COMPACT_ constants into enum

From: Michal Hocko <[email protected]>

compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long

but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
mm/compaction.c | 27 ++++++++++++++-------------
mm/page_alloc.c | 2 +-
3 files changed, 42 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d7c8de583a23..4458fd94170f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
#define _LINUX_COMPACTION_H

/* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED 0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE 4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE 5
-#define COMPACT_NOT_SUITABLE_ZONE 6
-#define COMPACT_CONTENDED 7
/* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /*
+ * compaction didn't start as it was not possible or direct reclaim
+ * was more suitable
+ */
+ COMPACT_SKIPPED,
+ /* compaction should continue to another pageblock */
+ COMPACT_CONTINUE,
+ /*
+ * direct compaction partially compacted a zone and there are suitable
+ * pages
+ */
+ COMPACT_PARTIAL,
+ /* The full zone was compacted */
+ COMPACT_COMPLETE,
+ /* For more detailed tracepoint output */
+ COMPACT_NO_SUITABLE_PAGE,
+ COMPACT_NOT_SUITABLE_ZONE,
+ COMPACT_CONTENDED,
+};

/* Used to signal whether compaction detected need_sched() or lock contention */
/* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int sysctl_compact_unevictable_allowed;

extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+ unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);

extern void defer_compaction(struct zone *zone, int order);
@@ -57,7 +66,7 @@ extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);

#else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
@@ -73,7 +82,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
{
}

-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8cc495042303..8ae7b1c46c72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1281,7 +1281,7 @@ static inline bool is_via_compact_memory(int order)
return order == -1;
}

-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
@@ -1344,8 +1344,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_NO_SUITABLE_PAGE;
}

-static int compact_finished(struct zone *zone, struct compact_control *cc,
- const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+ struct compact_control *cc,
+ const int migratetype)
{
int ret;

@@ -1364,7 +1365,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
* COMPACT_PARTIAL - If the allocation would succeed without compaction
* COMPACT_CONTINUE - If compaction should run now
*/
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
int fragindex;
@@ -1409,10 +1410,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
return COMPACT_CONTINUE;
}

-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;

ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
trace_mm_compaction_suitable(zone, order, ret);
@@ -1422,9 +1423,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
return ret;
}

-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
- int ret;
+ enum compact_result ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1588,11 +1589,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}

-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum migrate_mode mode, int *contended,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -1631,7 +1632,7 @@ int sysctl_extfrag_threshold = 500;
*
* This is the main entry point for direct page compaction.
*/
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
{
@@ -1639,7 +1640,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_DEFERRED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */

*contended = COMPACT_CONTENDED_NONE;
@@ -1653,7 +1654,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
- int status;
+ enum compact_result status;
int zone_contended;

if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f18092572f38..42e935c83de1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2947,7 +2947,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum migrate_mode mode, int *contended_compaction,
bool *deferred_compaction)
{
- unsigned long compact_result;
+ enum compact_result compact_result;
struct page *page;

if (!order)
--
2.8.0.rc3

2016-04-05 11:29:05

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 02/11] mm: throttle on IO only when there are too many dirty and writeback pages

From: Michal Hocko <[email protected]>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by 373ccbe59270 ("mm, vmstat: allow
WQ concurrency to discover memory reclaim doesn't make any progress")
into the __alloc_pages_slowpath now that wait_iff_congested is not
used anymore. As the only remaining user of wait_iff_congested is
shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/backing-dev.c | 20 +++-----------------
mm/page_alloc.c | 39 ++++++++++++++++++++++++++++++++++++---
2 files changed, 39 insertions(+), 20 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index bfbd7096b6ed..08e3a58628ed 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -979,20 +978,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
*/
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
-
- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that a particular
- * WQ is congested if the worker thread is looping without
- * ever sleeping. Therefore we have to do a short sleep
- * here rather than calling cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
+ cond_resched();
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 16100acc40ba..f18092572f38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3186,8 +3186,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
unsigned long available;
+ unsigned long reclaimable;

- available = zone_reclaimable_pages(zone);
+ available = reclaimable = zone_reclaimable_pages(zone);
available -= DIV_ROUND_UP(no_progress_loops * available,
MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -3198,8 +3199,40 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
*/
if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
ac->high_zoneidx, alloc_flags, available)) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ /*
+ * If we didn't make any progress and have a lot of
+ * dirty + writeback pages then we should wait for
+ * an IO to complete to slow down the reclaim and
+ * prevent from pre mature OOM
+ */
+ if (!did_some_progress) {
+ unsigned long writeback;
+ unsigned long dirty;
+
+ writeback = zone_page_state_snapshot(zone,
+ NR_WRITEBACK);
+ dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+ if (2*(writeback + dirty) > reclaimable) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ return true;
+ }
+ }
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+
return true;
}
}
--
2.8.0.rc3

2016-04-05 12:46:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm: consider compaction feedback also for costly allocation

Attached you can find both the tool to fragment memory and the script it
calls in my testing.

I was executing this as follows (on 2G machine with 2G swap space):
echo 1 > /proc/sys/vm/overcommit_memory
echo 1 > /proc/sys/vm/compact_memory
/root/fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
--
Michal Hocko
SUSE Labs


Attachments:
(No filename) (330.00 B)
fragment-mem-and-run.c (2.59 kB)
alloc_hugepages.sh (536.00 B)
Download all attachments

2016-04-05 12:47:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/11] oom detection rework v5

One side note. I have promissed to provide some tracepoints which would
help us to see how the new code behaves. I have some basics but still
have to think more about that so I will send some more patches later on.
--
Michal Hocko
SUSE Labs

2016-04-05 23:58:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Tue, 5 Apr 2016 13:25:31 +0200 Michal Hocko <[email protected]> wrote:

> From: Michal Hocko <[email protected]>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> - compaction_made_progress - compaction was active and made some
> progress.
> - compaction_failed - compaction failed and further attempts to
> invoke it would most probably fail and therefore it is not
> worth retrying
> - compaction_withdrawn - compaction wasn't invoked for an
> implementation specific reasons. In the current implementation
> it means that the compaction was deferred, contended or the
> page scanners met too early without any progress. Retrying is
> still worthwhile.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3362,25 +3362,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (page)
> goto got_pg;
>
> - /* Checks for THP-specific high-order allocations */
> - if (is_thp_gfp_mask(gfp_mask)) {
> - /*
> - * If compaction is deferred for high-order allocations, it is
> - * because sync compaction recently failed. If this is the case
> - * and the caller requested a THP allocation, we do not want
> - * to heavily disrupt the system, so we fail the allocation
> - * instead of entering direct reclaim.
> - */
> - if (compact_result == COMPACT_DEFERRED)
> - goto nopage;
> -
> - /*
> - * Compaction is contended so rather back off than cause
> - * excessive stalls.
> - */
> - if(compact_result == COMPACT_CONTENDED)
> - goto nopage;
> - }
> + /*
> + * Checks for THP-specific high-order allocations and back off
> + * if the the compaction backed off
> + */
> + if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> + goto nopage;

This change smashed into Hugh's "huge tmpfs: shmem_huge_gfpmask and
shmem_recovery_gfpmask".

I ended up doing this:

/* Checks for THP-specific high-order allocations */
if (!is_thp_allocation(gfp_mask, order))
migration_mode = MIGRATE_SYNC_LIGHT;

/*
* Checks for THP-specific high-order allocations and back off
* if the the compaction backed off
*/
if (is_thp_allocation(gfp_mask) && compaction_withdrawn(compact_result))
goto nopage;



2016-04-06 00:06:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 10/11] mm, oom: protect !costly allocations some more

On Tue, 5 Apr 2016 13:25:32 +0200 Michal Hocko <[email protected]> wrote:

> From: Michal Hocko <[email protected]>
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> - the last compaction round backed off or
> - we haven't completed at least MAX_COMPACT_RETRIES active
> compaction rounds.
>
> The first rule ensures that the very last attempt for compaction
> was not ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages before
> the current context manages to retry to allocate them.
>
> compaction_failed() is taken as a final word from the compaction that
> the retry doesn't make much sense. We have to be careful though because
> the first compaction round is MIGRATE_ASYNC which is rather weak as it
> ignores pages under writeback and gives up too easily in other
> situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> has been used before we give up. With this logic in place we do not have
> to increase the migration mode unconditionally and rather do it only if
> the compaction failed for the weaker mode. A nice side effect is that
> the stronger migration mode is used only when really needed so this has
> a potential of smaller latencies in some cases.
>
> Please note that the compaction doesn't tell us much about how
> successful it was when returning compaction_made_progress so we just
> have to blindly trust that another retry is worthwhile and cap the
> number to something reasonable to guarantee a convergence.
>
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
>
> @@ -3369,14 +3425,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> goto nopage;
>
> - /*
> - * It can become very expensive to allocate transparent hugepages at
> - * fault, so use asynchronous memory compaction for THP unless it is
> - * khugepaged trying to collapse.
> - */
> - if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> - migration_mode = MIGRATE_SYNC_LIGHT;
> -
> /* Try direct reclaim and then allocating */
> page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> &did_some_progress);

Hugh's patches moved this elsewhere. I'll drop this hunk altogether -
please carefully review the result.

2016-04-06 00:55:53

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Tue, 5 Apr 2016, Andrew Morton wrote:
> On Tue, 5 Apr 2016 13:25:31 +0200 Michal Hocko <[email protected]> wrote:
> > - if (is_thp_gfp_mask(gfp_mask)) {
> > - /*
> > - * If compaction is deferred for high-order allocations, it is
> > - * because sync compaction recently failed. If this is the case
> > - * and the caller requested a THP allocation, we do not want
> > - * to heavily disrupt the system, so we fail the allocation
> > - * instead of entering direct reclaim.
> > - */
> > - if (compact_result == COMPACT_DEFERRED)
> > - goto nopage;
> > -
> > - /*
> > - * Compaction is contended so rather back off than cause
> > - * excessive stalls.
> > - */
> > - if(compact_result == COMPACT_CONTENDED)
> > - goto nopage;
> > - }
> > + /*
> > + * Checks for THP-specific high-order allocations and back off
> > + * if the the compaction backed off
> > + */
> > + if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> > + goto nopage;
>
> This change smashed into Hugh's "huge tmpfs: shmem_huge_gfpmask and
> shmem_recovery_gfpmask".
>
> I ended up doing this:
>
> /* Checks for THP-specific high-order allocations */
> if (!is_thp_allocation(gfp_mask, order))
> migration_mode = MIGRATE_SYNC_LIGHT;
>
> /*
> * Checks for THP-specific high-order allocations and back off
> * if the the compaction backed off
> */
> if (is_thp_allocation(gfp_mask) && compaction_withdrawn(compact_result))
> goto nopage;

You'll already have found that is_thp_allocation() needs the order too.
But then you had to drop a hunk out of his 10/11 also to fit with mine.

What you've done may be just right, but I haven't had time to digest
Michal's changes yet (and not yet seen what happens to the PF_KTHREAD
distinction), so I think it will probably end up better if you take
his exactly as he tested and posted them, and drop my 30/31 and 31/31
for now - I can resubmit them (or maybe drop 30 altogether) after I've
pondered and tested a little on top of Michal's.

Huge tmpfs got along fine for many months without 30/31 and 31/31: 30
is just for experimentation, and 31 to reduce the compaction stalls we
saw under some loads. Maybe I'll find that Michal's rework has changed
the balance there anyway, and something else or nothing at all needed.

(The gfp_mask stuff was very confusing, and it's painful for me, how
~__GFP_KSWAPD_RECLAIM gets used as a secret password to say "THP" and
how to angle compaction - or maybe it's all more straightforward now.)

Many thanks for giving us both this quick exposure!

Hugh

2016-04-06 09:26:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Tue 05-04-16 17:55:39, Hugh Dickins wrote:
> On Tue, 5 Apr 2016, Andrew Morton wrote:
> > On Tue, 5 Apr 2016 13:25:31 +0200 Michal Hocko <[email protected]> wrote:
> > > - if (is_thp_gfp_mask(gfp_mask)) {
> > > - /*
> > > - * If compaction is deferred for high-order allocations, it is
> > > - * because sync compaction recently failed. If this is the case
> > > - * and the caller requested a THP allocation, we do not want
> > > - * to heavily disrupt the system, so we fail the allocation
> > > - * instead of entering direct reclaim.
> > > - */
> > > - if (compact_result == COMPACT_DEFERRED)
> > > - goto nopage;
> > > -
> > > - /*
> > > - * Compaction is contended so rather back off than cause
> > > - * excessive stalls.
> > > - */
> > > - if(compact_result == COMPACT_CONTENDED)
> > > - goto nopage;
> > > - }
> > > + /*
> > > + * Checks for THP-specific high-order allocations and back off
> > > + * if the the compaction backed off
> > > + */
> > > + if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> > > + goto nopage;
> >
> > This change smashed into Hugh's "huge tmpfs: shmem_huge_gfpmask and
> > shmem_recovery_gfpmask".
> >
> > I ended up doing this:
> >
> > /* Checks for THP-specific high-order allocations */
> > if (!is_thp_allocation(gfp_mask, order))
> > migration_mode = MIGRATE_SYNC_LIGHT;
> >
> > /*
> > * Checks for THP-specific high-order allocations and back off
> > * if the the compaction backed off
> > */
> > if (is_thp_allocation(gfp_mask) && compaction_withdrawn(compact_result))
> > goto nopage;
>
> You'll already have found that is_thp_allocation() needs the order too.
> But then you had to drop a hunk out of his 10/11 also to fit with mine.
>
> What you've done may be just right, but I haven't had time to digest
> Michal's changes yet (and not yet seen what happens to the PF_KTHREAD
> distinction), so I think it will probably end up better if you take
> his exactly as he tested and posted them, and drop my 30/31 and 31/31
> for now

I have only briefly checked your patch30 but I guess the above is
not really necessary. If the request is __GFP_REPEAT (I haven't checked
whether that is the case for shmem) then we promote to MIGRATE_SYNC_LIGHT
as soon as we cannot move on with ASYNC. For !__GFP_REPEAT I did
+ if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ else
+ migration_mode = MIGRATE_SYNC_LIGHT;
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&compact_result);

so you will end up doing SYNC_LIGHT for !is_thp_allocation as well

> - I can resubmit them (or maybe drop 30 altogether) after I've
> pondered and tested a little on top of Michal's.

I guess this would be safer. If it turns out that we need some special
handling I would prefer if that could be done in should_compact_retry.

> Huge tmpfs got along fine for many months without 30/31 and 31/31: 30
> is just for experimentation, and 31 to reduce the compaction stalls we
> saw under some loads. Maybe I'll find that Michal's rework has changed
> the balance there anyway, and something else or nothing at all needed.
>
> (The gfp_mask stuff was very confusing, and it's painful for me, how
> ~__GFP_KSWAPD_RECLAIM gets used as a secret password to say "THP" and
> how to angle compaction - or maybe it's all more straightforward now.)
>
> Many thanks for giving us both this quick exposure!

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-06 09:28:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 10/11] mm, oom: protect !costly allocations some more

On Tue 05-04-16 17:06:12, Andrew Morton wrote:
> On Tue, 5 Apr 2016 13:25:32 +0200 Michal Hocko <[email protected]> wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> >
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and there is no guarantee
> > further reclaim/compaction attempts would help but at least make sure
> > that the compaction was active before we go OOM and keep retrying even
> > if should_reclaim_retry tells us to oom if
> > - the last compaction round backed off or
> > - we haven't completed at least MAX_COMPACT_RETRIES active
> > compaction rounds.
> >
> > The first rule ensures that the very last attempt for compaction
> > was not ignored while the second guarantees that the compaction has done
> > some work. Multiple retries might be needed to prevent occasional
> > pigggy packing of other contexts to steal the compacted pages before
> > the current context manages to retry to allocate them.
> >
> > compaction_failed() is taken as a final word from the compaction that
> > the retry doesn't make much sense. We have to be careful though because
> > the first compaction round is MIGRATE_ASYNC which is rather weak as it
> > ignores pages under writeback and gives up too easily in other
> > situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> > has been used before we give up. With this logic in place we do not have
> > to increase the migration mode unconditionally and rather do it only if
> > the compaction failed for the weaker mode. A nice side effect is that
> > the stronger migration mode is used only when really needed so this has
> > a potential of smaller latencies in some cases.
> >
> > Please note that the compaction doesn't tell us much about how
> > successful it was when returning compaction_made_progress so we just
> > have to blindly trust that another retry is worthwhile and cap the
> > number to something reasonable to guarantee a convergence.
> >
> > If the given number of successful retries is not sufficient for a
> > reasonable workloads we should focus on the collected compaction
> > tracepoints data and try to address the issue in the compaction code.
> > If this is not feasible we can increase the retries limit.
> >
> > @@ -3369,14 +3425,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> > goto nopage;
> >
> > - /*
> > - * It can become very expensive to allocate transparent hugepages at
> > - * fault, so use asynchronous memory compaction for THP unless it is
> > - * khugepaged trying to collapse.
> > - */
> > - if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> > - migration_mode = MIGRATE_SYNC_LIGHT;
> > -
> > /* Try direct reclaim and then allocating */
> > page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> > &did_some_progress);
>
> Hugh's patches moved this elsewhere. I'll drop this hunk altogether -
> please carefully review the result.

I have checked mm-oom-protect-costly-allocations-some-more.patch and it
kept the hunk which is the correct way to go because migration_mode
should be updated only in should_compact_retry or before the last
attempt for __alloc_pages_direct_compact before we fail (for
!__GFP_REPEAT resp. __GFP_NORETRY).

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-06 17:46:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Tue, 5 Apr 2016 17:55:39 -0700 (PDT) Hugh Dickins <[email protected]> wrote:

> > I ended up doing this:
> >
> > /* Checks for THP-specific high-order allocations */
> > if (!is_thp_allocation(gfp_mask, order))
> > migration_mode = MIGRATE_SYNC_LIGHT;
> >
> > /*
> > * Checks for THP-specific high-order allocations and back off
> > * if the the compaction backed off
> > */
> > if (is_thp_allocation(gfp_mask) && compaction_withdrawn(compact_result))
> > goto nopage;
>
> You'll already have found that is_thp_allocation() needs the order too.
> But then you had to drop a hunk out of his 10/11 also to fit with mine.
>
> What you've done may be just right, but I haven't had time to digest
> Michal's changes yet (and not yet seen what happens to the PF_KTHREAD
> distinction), so I think it will probably end up better if you take
> his exactly as he tested and posted them, and drop my 30/31 and 31/31
> for now - I can resubmit them (or maybe drop 30 altogether) after I've
> pondered and tested a little on top of Michal's.
>
> Huge tmpfs got along fine for many months without 30/31 and 31/31: 30
> is just for experimentation, and 31 to reduce the compaction stalls we
> saw under some loads. Maybe I'll find that Michal's rework has changed
> the balance there anyway, and something else or nothing at all needed.
>
> (The gfp_mask stuff was very confusing, and it's painful for me, how
> ~__GFP_KSWAPD_RECLAIM gets used as a secret password to say "THP" and
> how to angle compaction - or maybe it's all more straightforward now.)

OK, thanks. I dropped
huge-tmpfs-shmem_huge_gfpmask-and-shmem_recovery_gfpmask.patch and
huge-tmpfs-no-kswapd-by-default-on-sync-allocations.patch and restored
Michal's patches.

> Many thanks for giving us both this quick exposure!

I'll push all this into -next later today.

2016-04-11 11:02:26

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 05/11] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> try_to_compact_pages can currently return COMPACT_SKIPPED even when the
> compaction is defered for some zone just because zone DMA is skipped
> in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
> basically unusable for the page allocator as a feedback mechanism.
>
> Make sure we distinguish those two states properly and switch their
> ordering in the enum. This would mean that the COMPACT_SKIPPED will be
> returned only when all eligible zones are skipped.
>
> This shouldn't introduce any functional change.

Hmm, really? __alloc_pages_direct_compact() does distinguish
COMPACT_DEFERRED, and sets *deferred compaction, so ultimately this is
some change for THP allocations?

Also there's no mention of COMPACT_INACTIVE in the changelog (which
indeed isn't functional change, but might surprise somebody).

> Signed-off-by: Michal Hocko <[email protected]>

Patch itself is OK.

Acked-by: Vlastimil Babka <[email protected]>

2016-04-11 11:24:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 05/11] mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED

On Mon 11-04-16 13:02:20, Vlastimil Babka wrote:
> On 04/05/2016 01:25 PM, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >try_to_compact_pages can currently return COMPACT_SKIPPED even when the
> >compaction is defered for some zone just because zone DMA is skipped
> >in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
> >basically unusable for the page allocator as a feedback mechanism.
> >
> >Make sure we distinguish those two states properly and switch their
> >ordering in the enum. This would mean that the COMPACT_SKIPPED will be
> >returned only when all eligible zones are skipped.
> >
> >This shouldn't introduce any functional change.
>
> Hmm, really? __alloc_pages_direct_compact() does distinguish
> COMPACT_DEFERRED, and sets *deferred compaction, so ultimately this is some
> change for THP allocations?

Hmm, you are right. In cases where we would return COMPACTION_SKIPED
even though there is a zone which would really like to tell us
COMPACT_DEFERRED then we would previously did __alloc_pages_direct_reclaim
and then bail out for THP which do not have __GFP_REPEAT while now we
would recognize DEFERRED and bail out without the direct reclaim. So
there is a functional change. Adnrew, could you drop the sentence about
no functional change and replace it by the following?

"
As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.
"

> Also there's no mention of COMPACT_INACTIVE in the changelog (which indeed
> isn't functional change, but might surprise somebody).
>
> >Signed-off-by: Michal Hocko <[email protected]>
>
> Patch itself is OK.
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks!

--
Michal Hocko
SUSE Labs

2016-04-11 12:10:32

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> COMPACT_COMPLETE now means that compaction and free scanner met. This is
> not very useful information if somebody just wants to use this feedback
> and make any decisions based on that. The current caller might be a poor
> guy who just happened to scan tiny portion of the zone and that could be
> the reason no suitable pages were compacted. Make sure we distinguish
> the full and partial zone walks.
>
> Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
> and be optimistic in retrying.
>
> The existing users of COMPACT_COMPLETE are conservatively changed to
> use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
> reconsidered and only defer the compaction only for COMPACT_COMPLETE
> with the new semantic.
>
> This patch shouldn't introduce any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

With some notes:

> @@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
> zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
> zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> }
> +
> + if (cc->migrate_pfn == start_pfn)
> + cc->whole_zone = true;
> +

This assumes that migrate scanner at initial position implies also free
scanner at the initial position. That should be true, because migration
scanner is the first to run. But getting the zone->compact_cached_*_pfn
is racy. Worse, zone->compact_cached_migrate_pfn is array distinguishing
sync and async compaction, so it's possible that async compaction has
advanced both its own migrate scanner cached position, and the shared
free scanner cached position, and then sync compaction starts migrate
scanner at start_pfn, but free scanner has already advanced.
So you might still see a false positive COMPACT_COMPLETE, just less
frequently and probably with much lower impact.
But if you need to be truly reliable, check also that cc->free_pfn ==
round_down(end_pfn - 1, pageblock_nr_pages)

2016-04-11 12:17:00

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 07/11] mm, compaction: Update compaction_result ordering

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> compaction_result will be used as the primary feedback channel for
> compaction users. At the same time try_to_compact_pages (and potentially
> others) assume a certain ordering where a more specific feedback takes
> precendence. This gets a bit awkward when we have conflicting feedback
> from different zones. E.g one returing COMPACT_COMPLETE meaning the full
> zone has been scanned without any outcome while other returns with
> COMPACT_PARTIAL aka made some progress. The caller should get
> COMPACT_PARTIAL because that means that the compaction still can make
> some progress. The same applies for COMPACT_PARTIAL vs.
> COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
> larger the value is the more progress we have done.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2016-04-11 12:47:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On Mon 11-04-16 14:10:26, Vlastimil Babka wrote:
> On 04/05/2016 01:25 PM, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >COMPACT_COMPLETE now means that compaction and free scanner met. This is
> >not very useful information if somebody just wants to use this feedback
> >and make any decisions based on that. The current caller might be a poor
> >guy who just happened to scan tiny portion of the zone and that could be
> >the reason no suitable pages were compacted. Make sure we distinguish
> >the full and partial zone walks.
> >
> >Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
> >and be optimistic in retrying.
> >
> >The existing users of COMPACT_COMPLETE are conservatively changed to
> >use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
> >reconsidered and only defer the compaction only for COMPACT_COMPLETE
> >with the new semantic.
> >
> >This patch shouldn't introduce any functional changes.
> >
> >Signed-off-by: Michal Hocko <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks!

> With some notes:
>
> >@@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
> > zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
> > zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> > }
> >+
> >+ if (cc->migrate_pfn == start_pfn)
> >+ cc->whole_zone = true;
> >+
>
> This assumes that migrate scanner at initial position implies also free
> scanner at the initial position. That should be true, because migration
> scanner is the first to run. But getting the zone->compact_cached_*_pfn is
> racy. Worse, zone->compact_cached_migrate_pfn is array distinguishing sync
> and async compaction, so it's possible that async compaction has advanced
> both its own migrate scanner cached position, and the shared free scanner
> cached position, and then sync compaction starts migrate scanner at
> start_pfn, but free scanner has already advanced.

OK, I see. The whole thing smelled racy but I thought it wouldn't be
such a big deal. Even if we raced then only a marginal part of the zone
wouldn't be scanned, right? Or is it possible that free_pfn would appear
in the middle of the zone because of the race?

> So you might still see a false positive COMPACT_COMPLETE, just less
> frequently and probably with much lower impact.
> But if you need to be truly reliable, check also that cc->free_pfn ==
> round_down(end_pfn - 1, pageblock_nr_pages)

I do not think we need the precise check if the race window (in the
skipped zone range) is always small.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-11 12:53:43

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On 04/11/2016 02:46 PM, Michal Hocko wrote:
>> This assumes that migrate scanner at initial position implies also free
>> scanner at the initial position. That should be true, because migration
>> scanner is the first to run. But getting the zone->compact_cached_*_pfn is
>> racy. Worse, zone->compact_cached_migrate_pfn is array distinguishing sync
>> and async compaction, so it's possible that async compaction has advanced
>> both its own migrate scanner cached position, and the shared free scanner
>> cached position, and then sync compaction starts migrate scanner at
>> start_pfn, but free scanner has already advanced.
>
> OK, I see. The whole thing smelled racy but I thought it wouldn't be
> such a big deal. Even if we raced then only a marginal part of the zone
> wouldn't be scanned, right? Or is it possible that free_pfn would appear
> in the middle of the zone because of the race?

The racy part is negligible but I didn't realize the sync/async migrate
scanner part until now. So yeah, free_pfn could have got to middle of
zone when it was in the async mode. But that also means that the async
mode recently used up all free pages in the second half of the zone. WRT
free pages isolation, async mode is not trying less than sync, so it
shouldn't be a considerable missed opportunity if we don't rescan the
it, though.

>> So you might still see a false positive COMPACT_COMPLETE, just less
>> frequently and probably with much lower impact.
>> But if you need to be truly reliable, check also that cc->free_pfn ==
>> round_down(end_pfn - 1, pageblock_nr_pages)
>
> I do not think we need the precise check if the race window (in the
> skipped zone range) is always small.
>
> Thanks!
>

2016-04-11 13:28:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On Mon 11-04-16 14:53:36, Vlastimil Babka wrote:
> On 04/11/2016 02:46 PM, Michal Hocko wrote:
> >>This assumes that migrate scanner at initial position implies also free
> >>scanner at the initial position. That should be true, because migration
> >>scanner is the first to run. But getting the zone->compact_cached_*_pfn is
> >>racy. Worse, zone->compact_cached_migrate_pfn is array distinguishing sync
> >>and async compaction, so it's possible that async compaction has advanced
> >>both its own migrate scanner cached position, and the shared free scanner
> >>cached position, and then sync compaction starts migrate scanner at
> >>start_pfn, but free scanner has already advanced.
> >
> >OK, I see. The whole thing smelled racy but I thought it wouldn't be
> >such a big deal. Even if we raced then only a marginal part of the zone
> >wouldn't be scanned, right? Or is it possible that free_pfn would appear
> >in the middle of the zone because of the race?
>
> The racy part is negligible but I didn't realize the sync/async migrate
> scanner part until now. So yeah, free_pfn could have got to middle of zone
> when it was in the async mode. But that also means that the async mode
> recently used up all free pages in the second half of the zone. WRT free
> pages isolation, async mode is not trying less than sync, so it shouldn't be
> a considerable missed opportunity if we don't rescan the it, though.

I am not really sure I understand. The primary intention of this patch
is to distinguish where we have scanned basically whole zones from cases
where a new scan started off previous mark and so it was just unlucky to
see only tiny bit of the zone where we would clearly give up too early.
FWIU this shouldn't be the case if we start scanning from the beginning
of the zone even if we raced on the other end of the zone because the
missed part would be negligible. Is that understanding correct?
--
Michal Hocko
SUSE Labs

2016-04-11 13:42:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On 04/11/2016 03:27 PM, Michal Hocko wrote:
> On Mon 11-04-16 14:53:36, Vlastimil Babka wrote:
>> On 04/11/2016 02:46 PM, Michal Hocko wrote:
>>
>> The racy part is negligible but I didn't realize the sync/async migrate
>> scanner part until now. So yeah, free_pfn could have got to middle of zone
>> when it was in the async mode. But that also means that the async mode
>> recently used up all free pages in the second half of the zone. WRT free
>> pages isolation, async mode is not trying less than sync, so it shouldn't be
>> a considerable missed opportunity if we don't rescan the it, though.
>
> I am not really sure I understand. The primary intention of this patch
> is to distinguish where we have scanned basically whole zones from cases
> where a new scan started off previous mark and so it was just unlucky to
> see only tiny bit of the zone where we would clearly give up too early.
> FWIU this shouldn't be the case if we start scanning from the beginning
> of the zone even if we raced on the other end of the zone because the
> missed part would be negligible. Is that understanding correct?

Yes, it should be less unlucky than seeing a tiny bit of the zone. Just
wanted to point out that you might still not see the whole zone in one
compaction attempt. E.g. async compaction is first, advances the free
scanner and caches its position when it bails out due to being
contended. Then direct reclaim frees some pages behind the cached
position. Sync compaction attempts starts migration scanner from
start_pfn, but picks up the cached free scanner pfn. The result is
missing some free pages and the scanners meeting somewhat earlier than
they otherwise would. Probably not critical even for OOM decisions, as
that's also racy anyway.

2016-04-11 13:46:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/11] mm, compaction: distinguish between full and partial COMPACT_COMPLETE

On Mon 11-04-16 15:42:21, Vlastimil Babka wrote:
> On 04/11/2016 03:27 PM, Michal Hocko wrote:
> >On Mon 11-04-16 14:53:36, Vlastimil Babka wrote:
> >>On 04/11/2016 02:46 PM, Michal Hocko wrote:
> >>
> >>The racy part is negligible but I didn't realize the sync/async migrate
> >>scanner part until now. So yeah, free_pfn could have got to middle of zone
> >>when it was in the async mode. But that also means that the async mode
> >>recently used up all free pages in the second half of the zone. WRT free
> >>pages isolation, async mode is not trying less than sync, so it shouldn't be
> >>a considerable missed opportunity if we don't rescan the it, though.
> >
> >I am not really sure I understand. The primary intention of this patch
> >is to distinguish where we have scanned basically whole zones from cases
> >where a new scan started off previous mark and so it was just unlucky to
> >see only tiny bit of the zone where we would clearly give up too early.
> >FWIU this shouldn't be the case if we start scanning from the beginning
> >of the zone even if we raced on the other end of the zone because the
> >missed part would be negligible. Is that understanding correct?
>
> Yes, it should be less unlucky than seeing a tiny bit of the zone. Just
> wanted to point out that you might still not see the whole zone in one
> compaction attempt. E.g. async compaction is first, advances the free
> scanner and caches its position when it bails out due to being contended.
> Then direct reclaim frees some pages behind the cached position. Sync
> compaction attempts starts migration scanner from start_pfn, but picks up
> the cached free scanner pfn. The result is missing some free pages and the
> scanners meeting somewhat earlier than they otherwise would. Probably not
> critical even for OOM decisions, as that's also racy anyway.

OK, I see now. I agree this shouldn't be critical and thanks for the
clarification.
--
Michal Hocko
SUSE Labs

2016-04-11 13:49:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 08/11] mm, compaction: Simplify __alloc_pages_direct_compact feedback interface

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> __alloc_pages_direct_compact communicates potential back off by two
> variables:
> - deferred_compaction tells that the compaction returned
> COMPACT_DEFERRED
> - contended_compaction is set when there is a contention on
> zone->lock resp. zone->lru_lock locks
>
> __alloc_pages_slowpath then backs of for THP allocation requests to
> prevent from long stalls. This is rather messy and it would be much
> cleaner to return a single compact result value and hide all the nasty
> details into __alloc_pages_direct_compact.

On the other hand, the nasty subtle details of THP allocation handling
are now split between __alloc_pages_direct_compact and
__alloc_pages_slowpath(). Lesser evil, I guess. I wish we could get rid
of these special cases, now that latency of THP direct allocations is
reduced by Mel's new defaults.

> This patch shouldn't introduce any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2016-04-11 14:39:33

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> - compaction_made_progress - compaction was active and made some
> progress.
> - compaction_failed - compaction failed and further attempts to
> invoke it would most probably fail and therefore it is not
> worth retrying
> - compaction_withdrawn - compaction wasn't invoked for an
> implementation specific reasons. In the current implementation
> it means that the compaction was deferred, contended or the
> page scanners met too early without any progress. Retrying is
> still worthwhile.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/compaction.h | 74 ++++++++++++++++++++++++++++++++++++++++++++++
> mm/page_alloc.c | 25 ++++------------
> 2 files changed, 80 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a7b9091ff349..512db9c3f0ed 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +/* Compaction has made some progress and retrying makes sense */
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + /*
> + * Even though this might sound confusing this in fact tells us
> + * that the compaction successfully isolated and migrated some
> + * pageblocks.
> + */
> + if (result == COMPACT_PARTIAL)
> + return true;
> +
> + return false;
> +}
> +
> +/* Compaction has failed and it doesn't make much sense to keep retrying. */
> +static inline bool compaction_failed(enum compact_result result)
> +{
> + /* All zones where scanned completely and still not result. */

Hmm given that try_to_compact_pages() uses a max() on results, then in
fact it takes only one zone to get this. Others could have been also
SKIPPED or DEFERRED. Is that what you want?

> + if (result == COMPACT_COMPLETE)
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Compaction has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> + /*
> + * Compaction backed off due to watermark checks for order-0
> + * so the regular reclaim has to try harder and reclaim something.
> + */
> + if (result == COMPACT_SKIPPED)
> + return true;
> +
> + /*
> + * If compaction is deferred for high-order allocations, it is
> + * because sync compaction recently failed. If this is the case
> + * and the caller requested a THP allocation, we do not want
> + * to heavily disrupt the system, so we fail the allocation
> + * instead of entering direct reclaim.
> + */
> + if (result == COMPACT_DEFERRED)
> + return true;
> +
> + /*
> + * If compaction in async mode encounters contention or blocks higher
> + * priority task we back off early rather than cause stalls.
> + */
> + if (result == COMPACT_CONTENDED)
> + return true;
> +
> + /*
> + * Page scanners have met but we haven't scanned full zones so this
> + * is a back off in fact.
> + */
> + if (result == COMPACT_PARTIAL_SKIPPED)
> + return true;
> +
> + return false;
> +}
> +

[...]

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c37e6d1ad643..c05de84c8157 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3362,25 +3362,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (page)
> goto got_pg;
>
> - /* Checks for THP-specific high-order allocations */
> - if (is_thp_gfp_mask(gfp_mask)) {
> - /*
> - * If compaction is deferred for high-order allocations, it is
> - * because sync compaction recently failed. If this is the case
> - * and the caller requested a THP allocation, we do not want
> - * to heavily disrupt the system, so we fail the allocation
> - * instead of entering direct reclaim.
> - */
> - if (compact_result == COMPACT_DEFERRED)
> - goto nopage;
> -
> - /*
> - * Compaction is contended so rather back off than cause
> - * excessive stalls.
> - */
> - if(compact_result == COMPACT_CONTENDED)
> - goto nopage;
> - }
> + /*
> + * Checks for THP-specific high-order allocations and back off
> + * if the the compaction backed off
> + */
> + if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> + goto nopage;

The change of semantics for THP is not trivial here and should at least
be discussed in changelog. CONTENDED and DEFERRED is only subset of
compaction_withdrawn() as seen above. Why is it useful to back off due
to COMPACT_PARTIAL_SKIPPED (we were just unlucky in our starting
position), but not due to COMPACT_COMPLETE (we have seen the whole zone
but failed anyway)? Why back off due to COMPACT_SKIPPED (not enough
order-0 pages) without trying reclaim at least once, and then another
async compaction, like before?

>
> /*
> * It can become very expensive to allocate transparent hugepages at
>

2016-04-11 14:48:49

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 10/11] mm, oom: protect !costly allocations some more

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> - the last compaction round backed off or
> - we haven't completed at least MAX_COMPACT_RETRIES active
> compaction rounds.
>
> The first rule ensures that the very last attempt for compaction
> was not ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy packing of other contexts to steal the compacted pages before
> the current context manages to retry to allocate them.
>
> compaction_failed() is taken as a final word from the compaction that
> the retry doesn't make much sense. We have to be careful though because
> the first compaction round is MIGRATE_ASYNC which is rather weak as it
> ignores pages under writeback and gives up too easily in other
> situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> has been used before we give up. With this logic in place we do not have
> to increase the migration mode unconditionally and rather do it only if
> the compaction failed for the weaker mode. A nice side effect is that
> the stronger migration mode is used only when really needed so this has
> a potential of smaller latencies in some cases.
>
> Please note that the compaction doesn't tell us much about how
> successful it was when returning compaction_made_progress so we just
> have to blindly trust that another retry is worthwhile and cap the
> number to something reasonable to guarantee a convergence.
>
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
>
> Signed-off-by: Michal Hocko <[email protected]>

Looks good.

Acked-by: Vlastimil Babka <[email protected]>

2016-04-11 15:07:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 11/11] mm: consider compaction feedback also for costly allocation

On 04/05/2016 01:25 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
> should_reclaim_retry currently where we decide to not retry after at
> least order worth of pages were reclaimed or the watermark check for at
> least one zone would succeed after reclaiming all pages if the reclaim
> hasn't made any progress. Compaction feedback is mostly ignored and we
> just try to make sure that the compaction did at least something before
> giving up.
>
> The first condition was added by a41f24ea9fd6 ("page allocator: smarter
> retry of costly-order allocations) and it assumed that lumpy reclaim
> could have created a page of the sufficient order. Lumpy reclaim,
> has been removed quite some time ago so the assumption doesn't hold
> anymore. Remove the check for the number of reclaimed pages and rely
> on the compaction feedback solely. should_reclaim_retry now only
> makes sure that we keep retrying reclaim for high order pages only
> if they are hidden by watermaks so order-0 reclaim makes really sense.
>
> should_compact_retry now keeps retrying even for the costly allocations.
> The number of retries is reduced wrt. !costly requests because they are
> less important and harder to grant and so their pressure shouldn't cause
> contention for other requests or cause an over reclaim. We also do not
> reset no_progress_loops for costly request to make sure we do not keep
> reclaiming too agressively.

[...]

> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2016-04-11 15:14:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Mon 11-04-16 16:39:21, Vlastimil Babka wrote:
> On 04/05/2016 01:25 PM, Michal Hocko wrote:
[...]
> >+/* Compaction has failed and it doesn't make much sense to keep retrying. */
> >+static inline bool compaction_failed(enum compact_result result)
> >+{
> >+ /* All zones where scanned completely and still not result. */
>
> Hmm given that try_to_compact_pages() uses a max() on results, then in fact
> it takes only one zone to get this. Others could have been also SKIPPED or
> DEFERRED. Is that what you want?

In short I didn't find any better way and still guarantee a some
guarantee of convergence. COMPACT_COMPLETE means that at least one zone
was completely scanned and led to no result. That zone would be
compact_suitable by definition. If I made DEFERRED or SKIPPED more
priorite (aka higher in the enum) then I could easily end up in a state
where all zones would return COMPACT_COMPLETE and few remaining would
just alternate returning their DEFFERED resp. SKIPPED. So while this
might sound like giving up too early I couldn't come up with anything
more specific that would lead to reliable results.

I am open to any suggestions of course.

[...]
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -3362,25 +3362,12 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > if (page)
> > goto got_pg;
> >
> >- /* Checks for THP-specific high-order allocations */
> >- if (is_thp_gfp_mask(gfp_mask)) {
> >- /*
> >- * If compaction is deferred for high-order allocations, it is
> >- * because sync compaction recently failed. If this is the case
> >- * and the caller requested a THP allocation, we do not want
> >- * to heavily disrupt the system, so we fail the allocation
> >- * instead of entering direct reclaim.
> >- */
> >- if (compact_result == COMPACT_DEFERRED)
> >- goto nopage;
> >-
> >- /*
> >- * Compaction is contended so rather back off than cause
> >- * excessive stalls.
> >- */
> >- if(compact_result == COMPACT_CONTENDED)
> >- goto nopage;
> >- }
> >+ /*
> >+ * Checks for THP-specific high-order allocations and back off
> >+ * if the the compaction backed off
> >+ */
> >+ if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> >+ goto nopage;
>
> The change of semantics for THP is not trivial here and should at least be
> discussed in changelog. CONTENDED and DEFERRED is only subset of
> compaction_withdrawn() as seen above.

True. My main motivation was to get rid of the compaction specific code
from the allocator path as much as possible. I can drop the above hunk
of course but I think we should get rid of these checks and make the
code simpler. To be honest I am not even sure those changes are really
measurable.

> Why is it useful to back off due to
> COMPACT_PARTIAL_SKIPPED (we were just unlucky in our starting position), but
> not due to COMPACT_COMPLETE (we have seen the whole zone but failed anyway)?

OK, that is a good remark. I could change that to:
if (is_thp_gfp_mask(gfp_mask) &&
(compaction_withdrawn(compact_result) || compaction_failed(compact_result))

> Why back off due to COMPACT_SKIPPED (not enough order-0 pages) without
> trying reclaim at least once, and then another async compaction, like
> before?

The idea was that COMPACT_SKIPPED wouldn't change after a single reclaim
round most of the time because a zone would have to get above
low_wmark + 1<<9 pages. So the only situation where it would matter would be if
we had some order-9 pages available hidden by the min wmark and we would
reclaim enough to get above the above gap. I am not sure this is what we
really want in the first place. Increase the reclaim stalls when we are
getting under memory pressure.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-11 15:34:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Mon 11-04-16 17:14:10, Michal Hocko wrote:
> On Mon 11-04-16 16:39:21, Vlastimil Babka wrote:
> > On 04/05/2016 01:25 PM, Michal Hocko wrote:
[...]
> > >+ /*
> > >+ * Checks for THP-specific high-order allocations and back off
> > >+ * if the the compaction backed off
> > >+ */
> > >+ if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
> > >+ goto nopage;
> >
> > The change of semantics for THP is not trivial here and should at least be
> > discussed in changelog. CONTENDED and DEFERRED is only subset of
> > compaction_withdrawn() as seen above.
>
> True. My main motivation was to get rid of the compaction specific code
> from the allocator path as much as possible. I can drop the above hunk

I was thinking about this some more and will drop the hunk. I would
rather have this patch without side effects as much as possible. A
follow up patch can get rid of the specific checks and use a simpler
ones. I will post -fix patch to Andrew.
--
Michal Hocko
SUSE Labs

2016-04-11 15:56:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

Hi Andrew,
Vlastimil has pointed out[1] that using compaction_withdrawn() for THP
allocations has some non-trivial consequences. While I still think that
the check is OK it is true we shouldn't sneak in a potential behavior
change into something that basically provides an API. So can you fold
the following partial revert into the original patch please?

[1] http://lkml.kernel.org/r/[email protected]

---
>From 71ddeee4238e33d67ef07883e73f946a7cc40e73 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 11 Apr 2016 17:38:22 +0200
Subject: [PATCH] ction-abstract-compaction-feedback-to-helpers-fix

Preserve the original thp back off checks to not introduce any
functional changes as per Vlastimil.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 25 +++++++++++++++++++------
1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c05de84c8157..c37e6d1ad643 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3362,12 +3362,25 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;

- /*
- * Checks for THP-specific high-order allocations and back off
- * if the the compaction backed off
- */
- if (is_thp_gfp_mask(gfp_mask) && compaction_withdrawn(compact_result))
- goto nopage;
+ /* Checks for THP-specific high-order allocations */
+ if (is_thp_gfp_mask(gfp_mask)) {
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (compact_result == COMPACT_DEFERRED)
+ goto nopage;
+
+ /*
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
+ */
+ if(compact_result == COMPACT_CONTENDED)
+ goto nopage;
+ }

/*
* It can become very expensive to allocate transparent hugepages at
--
2.8.0.rc3

--
Michal Hocko
SUSE Labs

2016-04-11 16:07:53

by Michal Hocko

[permalink] [raw]
Subject: [RFC PATCH] mm: use compaction feedback for thp backoff conditions

On Mon 11-04-16 17:40:36, Michal Hocko wrote:
> Hi Andrew,
> Vlastimil has pointed out[1] that using compaction_withdrawn() for THP
> allocations has some non-trivial consequences. While I still think that
> the check is OK it is true we shouldn't sneak in a potential behavior
> change into something that basically provides an API. So can you fold
> the following partial revert into the original patch please?
>
> [1] http://lkml.kernel.org/r/[email protected]

This would be an RFC on top.
---
>From 6cfed80ad41f3f1506930b9a3254fe135bf90d4c Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 11 Apr 2016 17:51:28 +0200
Subject: [PATCH] mm: use compaction feedback for thp backoff conditions

THP requests skip the direct reclaim if the compaction is either
deferred or contended to reduce stalls which wouldn't help the
allocation success anyway. These checks are ignoring other potential
feedback modes which we have available now.

It clearly doesn't make much sense to go and reclaim few pages if the
previous compaction has failed.

We can also simplify the check by using compaction_withdrawn which
checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
is however covering more reasons why the compaction was withdrawn.
None of them should be a problem for the THP case though.

It is safe to back of if we see COMPACT_SKIPPED because that means
that compaction_suitable failed and a single round of the reclaim is
unlikely to make any difference here. We would have to be close to
the low watermark to reclaim enough and even then there is no guarantee
that the compaction would make any progress while the direct reclaim
would have caused the stall.

COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
have only seen a part of the zone so a retry would make some sense. But
it would be a compaction retry not a reclaim retry to perform. We are
not doing that and that might indeed lead to situations where THP fails
but this should happen only rarely and it would be really hard to
measure.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5eae9e0555ed..6d1da0ceaf1e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3420,25 +3420,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (order && compaction_made_progress(compact_result))
compaction_retries++;

- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (compact_result == COMPACT_DEFERRED)
- goto nopage;
-
- /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if(compact_result == COMPACT_CONTENDED)
- goto nopage;
- }
+ /*
+ * Checks for THP-specific high-order allocations and back off
+ * if the the compaction backed off or failed
+ */
+ if (is_thp_gfp_mask(gfp_mask) &&
+ (compaction_withdrawn(compact_result) ||
+ compaction_failed(compact_result)))
+ goto nopage;

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
--
2.8.0.rc3

--
Michal Hocko
SUSE Labs

2016-04-12 11:53:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On 04/11/2016 05:14 PM, Michal Hocko wrote:
> On Mon 11-04-16 16:39:21, Vlastimil Babka wrote:
>> On 04/05/2016 01:25 PM, Michal Hocko wrote:
> [...]
>>> +/* Compaction has failed and it doesn't make much sense to keep retrying. */
>>> +static inline bool compaction_failed(enum compact_result result)
>>> +{
>>> + /* All zones where scanned completely and still not result. */
>>
>> Hmm given that try_to_compact_pages() uses a max() on results, then in fact
>> it takes only one zone to get this. Others could have been also SKIPPED or
>> DEFERRED. Is that what you want?
>
> In short I didn't find any better way and still guarantee a some
> guarantee of convergence. COMPACT_COMPLETE means that at least one zone
> was completely scanned and led to no result. That zone would be
> compact_suitable by definition. If I made DEFERRED or SKIPPED more
> priorite (aka higher in the enum) then I could easily end up in a state
> where all zones would return COMPACT_COMPLETE and few remaining would
> just alternate returning their DEFFERED resp. SKIPPED. So while this
> might sound like giving up too early I couldn't come up with anything
> more specific that would lead to reliable results.
>
> I am open to any suggestions of course.

I guess you would have to track each zone separately and make sure
you've seen COMPACT_COMPLETE in all of them, although not necessary
during the same zonelist attempt. But then do the same for reclaim, as
you would also have to match COMPAT_SKIPPED and inability of reclaim...
and that gets uglier and uglier, and also against the move to node-based
reclaim...

So there's a danger that you'll see COMPACT_COMPLETE on a small ZONE_DMA
early on, before the larger zones even stop being deferred, but I don't
see an easy solution.


2016-04-12 11:55:03

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On 04/11/2016 05:40 PM, Michal Hocko wrote:
> Hi Andrew,
> Vlastimil has pointed out[1] that using compaction_withdrawn() for THP
> allocations has some non-trivial consequences. While I still think that
> the check is OK it is true we shouldn't sneak in a potential behavior
> change into something that basically provides an API. So can you fold
> the following partial revert into the original patch please?
>
> [1] http://lkml.kernel.org/r/[email protected]
>
> ---
> From 71ddeee4238e33d67ef07883e73f946a7cc40e73 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 11 Apr 2016 17:38:22 +0200
> Subject: [PATCH] ction-abstract-compaction-feedback-to-helpers-fix
>
> Preserve the original thp back off checks to not introduce any
> functional changes as per Vlastimil.
>
> Signed-off-by: Michal Hocko <[email protected]>

Ack, thanks.

2016-04-12 12:23:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 09/11] mm, compaction: Abstract compaction feedback to helpers

On Tue 12-04-16 13:53:47, Vlastimil Babka wrote:
> On 04/11/2016 05:14 PM, Michal Hocko wrote:
> >On Mon 11-04-16 16:39:21, Vlastimil Babka wrote:
> >>On 04/05/2016 01:25 PM, Michal Hocko wrote:
> >[...]
> >>>+/* Compaction has failed and it doesn't make much sense to keep retrying. */
> >>>+static inline bool compaction_failed(enum compact_result result)
> >>>+{
> >>>+ /* All zones where scanned completely and still not result. */
> >>
> >>Hmm given that try_to_compact_pages() uses a max() on results, then in fact
> >>it takes only one zone to get this. Others could have been also SKIPPED or
> >>DEFERRED. Is that what you want?
> >
> >In short I didn't find any better way and still guarantee a some
> >guarantee of convergence. COMPACT_COMPLETE means that at least one zone
> >was completely scanned and led to no result. That zone would be
> >compact_suitable by definition. If I made DEFERRED or SKIPPED more
> >priorite (aka higher in the enum) then I could easily end up in a state
> >where all zones would return COMPACT_COMPLETE and few remaining would
> >just alternate returning their DEFFERED resp. SKIPPED. So while this
> >might sound like giving up too early I couldn't come up with anything
> >more specific that would lead to reliable results.
> >
> >I am open to any suggestions of course.
>
> I guess you would have to track each zone separately and make sure you've
> seen COMPACT_COMPLETE in all of them, although not necessary during the same
> zonelist attempt. But then do the same for reclaim, as you would also have
> to match COMPAT_SKIPPED and inability of reclaim... and that gets uglier and
> uglier, and also against the move to node-based reclaim...

I think we want to get rid some of these states long term. Or at least
do not defer or skip for small orders that really matter and are nofail
in fact. But I cannot tell I would understand the defer logic enought to
do it right now.

> So there's a danger that you'll see COMPACT_COMPLETE on a small ZONE_DMA
> early on, before the larger zones even stop being deferred, but I don't see
> an easy solution.

ZONE_DMA should back off most of the time due to watermark checks. But
it is true that we might a small zone which is not protected by lowmem
reserves.

I certainly see a lot of room for improving the compaction and this
rework looks like a good motivation.
--
Michal Hocko
SUSE Labs