Hi,
This is v6 of the series. The previous version was posted [1]. The
code hasn't changed much since then. I have found one old standing
bug (patch 1) which just got much more severe and visible with this
series. Other than that I have reorganized the series and put the
compaction feedback abstraction to the front just in case we find out
that parts of the series would have to be reverted later on for some
reason. The premature oom killer invocation reported by Hugh [2] seems
to be addressed.
We have discussed this series at LSF/MM summit in Raleigh and there
didn't seem to be any concerns/objections to go on with the patch set
and target it for the next merge window.
Motivation:
As pointed by Linus [3][4] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.
This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.
I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.
Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.
1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
file size, removes the files and starts over again) running in
parallel for 10s to build up a lot of dirty pages when 100 parallel
mem_eaters (anon private populated mmap which waits until it gets
signal) with 80M each.
This causes an OOM flood of course and I have compared both patched
and unpatched kernels. The test is considered finished after there
are no OOM conditions detected. This should tell us whether there are
any excessive kills or some of them premature (e.g. due to dirty pages):
I have performed two runs this time each after a fresh boot.
* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78
$ grep "Kill process" base-oom-run1.log | tail -n1
[ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child
$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52
$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3
* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77
e grep "Kill process" patched-oom-run1.log | tail -n1
[ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child
$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77
e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3
The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.
2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
memory as possible without triggering the OOM killer. This required a lot
of tuning but I've considered 3 consecutive runs in three different boots
without OOM as a success.
* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)
* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)
That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.
I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.
I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.
* Changes since v5
- added "vmscan: consider classzone_idx in compaction_ready"
- added "mm, oom, compaction: prevent from should_compact_retry looping
for ever for costly orders"
- acked-bys from Vlastimil
- integrated feedback from review
* Changes since v4
- dropped __GFP_REPEAT for costly allocation as it is now replaced by
the compaction based feedback logic
- !costly high order requests are retried based on the compaction feedback
- compaction feedback has been tweaked to give us an useful information
to make decisions in the page allocator
- rebased on the current mmotm-2016-04-01-16-24 with the previous version
of the rework reverted
* Changes since v3
- factor out the new heuristic into its own function as suggested by
Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
wait_iff_congested related changes which needed refresh in
patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
the last time (4.2 -> 4.3 based kernel + mmotm patches)
* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug
[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]
[3] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[4] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com
From: Michal Hocko <[email protected]>
while playing with the oom detection rework [1] I have noticed
that my heavy order-9 (hugetlb) load close to OOM ended up in an
endless loop where the reclaim hasn't made any progress but
did_some_progress didn't reflect that and compaction_suitable
was backing off because no zone is above low wmark + 1 << order.
It turned out that this is in fact an old standing bug in compaction_ready
which ignores the requested_highidx and did the watermark check for
0 classzone_idx. This succeeds for zone DMA most of the time as the zone
is mostly unused because of lowmem protection. This also means that the
OOM killer wouldn't be triggered for higher order requests even when
there is no reclaim progress and we essentially rely on order-0 request
to find this out. This has been broken in one way or another since
fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
are sufficient free pages available") but only since 7335084d446b ("mm:
vmscan: do not OOM if aborting reclaim to start compaction") we are not
invoking the OOM killer based on the wrong calculation.
Propagate requested_highidx down to compaction_ready and use it for both
the watermak check and compaction_suitable to fix this issue.
[1] http://lkml.kernel.org/r/[email protected]
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmscan.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c839adc13efd..3e6347e2a5fc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
* Returns true if compaction should go ahead for a high-order request, or
* the high-order allocation would succeed without compaction.
*/
-static inline bool compaction_ready(struct zone *zone, int order)
+static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
{
unsigned long balance_gap, watermark;
bool watermark_ok;
@@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
- watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
+ watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
/*
* If compaction is deferred, reclaim up to a point where
@@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
* If compaction is not ready to start and allocation is not likely
* to succeed without it, then keep reclaiming.
*/
- if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
+ if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
return false;
return watermark_ok;
@@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (IS_ENABLED(CONFIG_COMPACTION) &&
sc->order > PAGE_ALLOC_COSTLY_ORDER &&
zonelist_zone_idx(z) <= requested_highidx &&
- compaction_ready(zone, sc->order)) {
+ compaction_ready(zone, sc->order, requested_highidx)) {
sc->compaction_ready = true;
continue;
}
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
the compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"
mm/compaction.c: In function ‘compact_zone’:
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_DEFERRED’ not handled in switch [-Wswitch]
switch (ret) {
^
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_COMPLETE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NO_SUITABLE_PAGE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_NOT_SUITABLE_ZONE’ not handled in switch [-Wswitch]
mm/compaction.c:1350:2: warning: enumeration value ‘COMPACT_CONTENDED’ not handled in switch [-Wswitch]
compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible. Put a VM_BUG_ON to catch an impossible return value.
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/compaction.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 8ae7b1c46c72..b06de27b7f72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1433,15 +1433,12 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
ret = compaction_suitable(zone, cc->order, cc->alloc_flags,
cc->classzone_idx);
- switch (ret) {
- case COMPACT_PARTIAL:
- case COMPACT_SKIPPED:
- /* Compaction is likely to fail */
+ /* Compaction is likely to fail */
+ if (ret == COMPACT_PARTIAL || ret == COMPACT_SKIPPED)
return ret;
- case COMPACT_CONTINUE:
- /* Fall through to compaction */
- ;
- }
+
+ /* huh, compaction_suitable is returning something unexpected */
+ VM_BUG_ON(ret != COMPACT_CONTINUE);
/*
* Clear pageblock skip if there were failures recently and compaction
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
COMPACT_COMPLETE now means that compaction and free scanner met. This is
not very useful information if somebody just wants to use this feedback
and make any decisions based on that. The current caller might be a poor
guy who just happened to scan tiny portion of the zone and that could be
the reason no suitable pages were compacted. Make sure we distinguish
the full and partial zone walks.
Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.
The existing users of COMPACT_COMPLETE are conservatively changed to
use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.
This patch shouldn't introduce any functional changes.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 10 +++++++++-
include/trace/events/compaction.h | 1 +
mm/compaction.c | 14 +++++++++++---
mm/internal.h | 1 +
4 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7e177d111c39..7c4de92d12cc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -21,7 +21,15 @@ enum compact_result {
* pages
*/
COMPACT_PARTIAL,
- /* The full zone was compacted */
+ /*
+ * direct compaction has scanned part of the zone but wasn't successfull
+ * to compact suitable pages.
+ */
+ COMPACT_PARTIAL_SKIPPED,
+ /*
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
+ */
COMPACT_COMPLETE,
/* For more detailed tracepoint output */
COMPACT_NO_SUITABLE_PAGE,
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 6ba16c86d7db..36e2d6fb1360 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -14,6 +14,7 @@
EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_CONTINUE, "continue") \
EM( COMPACT_PARTIAL, "partial") \
+ EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
EM( COMPACT_COMPLETE, "complete") \
EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
diff --git a/mm/compaction.c b/mm/compaction.c
index 13709e33a2fc..e2e487cea5ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
if (cc->direct_compaction)
zone->compact_blockskip_flush = true;
- return COMPACT_COMPLETE;
+ if (cc->whole_zone)
+ return COMPACT_COMPLETE;
+ else
+ return COMPACT_PARTIAL_SKIPPED;
}
if (is_via_compact_memory(cc->order))
@@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
}
+
+ if (cc->migrate_pfn == start_pfn)
+ cc->whole_zone = true;
+
cc->last_migrated_pfn = 0;
trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
@@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
goto break_loop;
}
- if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
+ if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+ status == COMPACT_PARTIAL_SKIPPED)) {
/*
* We think that allocation won't succeed in this zone
* so we defer compaction there. If it ends up
@@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
cc.classzone_idx, 0)) {
success = true;
compaction_defer_reset(zone, cc.order, false);
- } else if (status == COMPACT_COMPLETE) {
+ } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
/*
* We use sync migration mode here, so we defer like
* sync direct compaction does.
diff --git a/mm/internal.h b/mm/internal.h
index e9aacea1a0d1..4423dfe69382 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -182,6 +182,7 @@ struct compact_control {
enum migrate_mode mode; /* Async or sync migration mode */
bool ignore_skip_hint; /* Scan blocks even if marked skip */
bool direct_compaction; /* False from kcompactd or /proc/... */
+ bool whole_zone; /* Whole zone has been scanned */
int order; /* order a direct compactor needs */
const gfp_t gfp_mask; /* gfp mask of a direct compactor */
const int alloc_flags; /* alloc flags of a direct compactor */
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
Compaction can provide a wild variation of feedback to the caller. Many
of them are implementation specific and the caller of the compaction
(especially the page allocator) shouldn't be bound to specifics of the
current implementation.
This patch abstracts the feedback into three basic types:
- compaction_made_progress - compaction was active and made some
progress.
- compaction_failed - compaction failed and further attempts to
invoke it would most probably fail and therefore it is not
worth retrying
- compaction_withdrawn - compaction wasn't invoked for an
implementation specific reasons. In the current implementation
it means that the compaction was deferred, contended or the
page scanners met too early without any progress. Retrying is
still worthwhile.
[[email protected]: do not change thp back off behavior]
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a7b9091ff349..a002ca55c513 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
extern bool compaction_restarting(struct zone *zone, int order);
+/* Compaction has made some progress and retrying makes sense */
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ /*
+ * Even though this might sound confusing this in fact tells us
+ * that the compaction successfully isolated and migrated some
+ * pageblocks.
+ */
+ if (result == COMPACT_PARTIAL)
+ return true;
+
+ return false;
+}
+
+/* Compaction has failed and it doesn't make much sense to keep retrying. */
+static inline bool compaction_failed(enum compact_result result)
+{
+ /* All zones where scanned completely and still not result. */
+ if (result == COMPACT_COMPLETE)
+ return true;
+
+ return false;
+}
+
+/*
+ * Compaction has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ /*
+ * Compaction backed off due to watermark checks for order-0
+ * so the regular reclaim has to try harder and reclaim something.
+ */
+ if (result == COMPACT_SKIPPED)
+ return true;
+
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. If this is the case
+ * and the caller requested a THP allocation, we do not want
+ * to heavily disrupt the system, so we fail the allocation
+ * instead of entering direct reclaim.
+ */
+ if (result == COMPACT_DEFERRED)
+ return true;
+
+ /*
+ * If compaction in async mode encounters contention or blocks higher
+ * priority task we back off early rather than cause stalls.
+ */
+ if (result == COMPACT_CONTENDED)
+ return true;
+
+ /*
+ * Page scanners have met but we haven't scanned full zones so this
+ * is a back off in fact.
+ */
+ if (result == COMPACT_PARTIAL_SKIPPED)
+ return true;
+
+ return false;
+}
+
extern int kcompactd_run(int nid);
extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
@@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}
+static inline bool compaction_made_progress(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_failed(enum compact_result result)
+{
+ return false;
+}
+
+static inline bool compaction_withdrawn(enum compact_result result)
+{
+ return true;
+}
+
static inline int kcompactd_run(int nid)
{
return 0;
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.
This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.
We have to preserve logic introduced by 373ccbe59270 ("mm, vmstat: allow
WQ concurrency to discover memory reclaim doesn't make any progress")
into the __alloc_pages_slowpath now that wait_iff_congested is not
used anymore. As the only remaining user of wait_iff_congested is
shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/backing-dev.c | 20 +++-----------------
mm/page_alloc.c | 39 ++++++++++++++++++++++++++++++++++++---
2 files changed, 39 insertions(+), 20 deletions(-)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index bfbd7096b6ed..08e3a58628ed 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -979,20 +978,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
*/
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
-
- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that a particular
- * WQ is congested if the worker thread is looping without
- * ever sleeping. Therefore we have to do a short sleep
- * here rather than calling cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
+ cond_resched();
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38302c2041a3..3b78936eca70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3195,8 +3195,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
unsigned long available;
+ unsigned long reclaimable;
- available = zone_reclaimable_pages(zone);
+ available = reclaimable = zone_reclaimable_pages(zone);
available -= DIV_ROUND_UP(no_progress_loops * available,
MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
@@ -3207,8 +3208,40 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
*/
if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
ac->high_zoneidx, alloc_flags, available)) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ /*
+ * If we didn't make any progress and have a lot of
+ * dirty + writeback pages then we should wait for
+ * an IO to complete to slow down the reclaim and
+ * prevent from pre mature OOM
+ */
+ if (!did_some_progress) {
+ unsigned long writeback;
+ unsigned long dirty;
+
+ writeback = zone_page_state_snapshot(zone,
+ NR_WRITEBACK);
+ dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+ if (2*(writeback + dirty) > reclaimable) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ return true;
+ }
+ }
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+
return true;
}
}
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
compaction_result will be used as the primary feedback channel for
compaction users. At the same time try_to_compact_pages (and potentially
others) assume a certain ordering where a more specific feedback takes
precendence. This gets a bit awkward when we have conflicting feedback
from different zones. E.g one returing COMPACT_COMPLETE meaning the full
zone has been scanned without any outcome while other returns with
COMPACT_PARTIAL aka made some progress. The caller should get
COMPACT_PARTIAL because that means that the compaction still can make
some progress. The same applies for COMPACT_PARTIAL vs.
COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
larger the value is the more progress we have done.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 26 ++++++++++++++++----------
1 file changed, 16 insertions(+), 10 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 7c4de92d12cc..a7b9091ff349 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,6 +4,8 @@
/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result {
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NOT_SUITABLE_ZONE,
/*
* compaction didn't start as it was not possible or direct reclaim
* was more suitable
@@ -11,30 +13,34 @@ enum compact_result {
COMPACT_SKIPPED,
/* compaction didn't start as it was deferred due to past failures */
COMPACT_DEFERRED,
+
/* compaction not active last round */
COMPACT_INACTIVE = COMPACT_DEFERRED,
+ /* For more detailed tracepoint output - internal to compaction */
+ COMPACT_NO_SUITABLE_PAGE,
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
+
/*
- * direct compaction partially compacted a zone and there are suitable
- * pages
+ * The full zone was compacted scanned but wasn't successfull to compact
+ * suitable pages.
*/
- COMPACT_PARTIAL,
+ COMPACT_COMPLETE,
/*
* direct compaction has scanned part of the zone but wasn't successfull
* to compact suitable pages.
*/
COMPACT_PARTIAL_SKIPPED,
+
+ /* compaction terminated prematurely due to lock contentions */
+ COMPACT_CONTENDED,
+
/*
- * The full zone was compacted scanned but wasn't successfull to compact
- * suitable pages.
+ * direct compaction partially compacted a zone and there might be
+ * suitable pages
*/
- COMPACT_COMPLETE,
- /* For more detailed tracepoint output */
- COMPACT_NO_SUITABLE_PAGE,
- COMPACT_NOT_SUITABLE_ZONE,
- COMPACT_CONTENDED,
+ COMPACT_PARTIAL,
};
/* Used to signal whether compaction detected need_sched() or lock contention */
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
THP requests skip the direct reclaim if the compaction is either
deferred or contended to reduce stalls which wouldn't help the
allocation success anyway. These checks are ignoring other potential
feedback modes which we have available now.
It clearly doesn't make much sense to go and reclaim few pages if the
previous compaction has failed.
We can also simplify the check by using compaction_withdrawn which
checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
is however covering more reasons why the compaction was withdrawn.
None of them should be a problem for the THP case though.
It is safe to back of if we see COMPACT_SKIPPED because that means
that compaction_suitable failed and a single round of the reclaim is
unlikely to make any difference here. We would have to be close to
the low watermark to reclaim enough and even then there is no guarantee
that the compaction would make any progress while the direct reclaim
would have caused the stall.
COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
have only seen a part of the zone so a retry would make some sense. But
it would be a compaction retry not a reclaim retry to perform. We are
not doing that and that might indeed lead to situations where THP fails
but this should happen only rarely and it would be really hard to
measure.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 350d13f3709b..d551fe326c33 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3257,25 +3257,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (page)
goto got_pg;
- /* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
- /*
- * If compaction is deferred for high-order allocations, it is
- * because sync compaction recently failed. If this is the case
- * and the caller requested a THP allocation, we do not want
- * to heavily disrupt the system, so we fail the allocation
- * instead of entering direct reclaim.
- */
- if (compact_result == COMPACT_DEFERRED)
- goto nopage;
-
- /*
- * Compaction is contended so rather back off than cause
- * excessive stalls.
- */
- if(compact_result == COMPACT_CONTENDED)
- goto nopage;
- }
+ /*
+ * Checks for THP-specific high-order allocations and back off
+ * if the the compaction backed off or failed
+ */
+ if (is_thp_gfp_mask(gfp_mask) &&
+ (compaction_withdrawn(compact_result) ||
+ compaction_failed(compact_result)))
+ goto nopage;
/*
* It can become very expensive to allocate transparent hugepages at
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders. While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic. The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap. If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g. hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark. The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.
I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.
The reason why compaction requires being over low rather than min
watermark is not clear to me. This check was there essentially since
56de7263fcf3 ("mm: compaction: direct compact when a high-order
allocation fails"). It is clearly an implementation detail though and we
shouldn't pull it into the generic retry logic while we should be able
to cope with such eventuality. The only place in should_compact_retry
where we retry without any upper bound is for compaction_withdrawn()
case.
Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed. In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark check.
The reclaimable memory is reduced linearly by the allocation order. The
idea is that we do not want to reclaim all the remaining memory for a
single allocation request just unblock __compaction_suitable which
doesn't guarantee we will make a further progress.
The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress. !costly requests shouldn't be affected much - e.g. order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.
[[email protected]: fix classzone_idx vs. high_zoneidx usage in
compaction_zonelist_suitable]
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 4 ++++
include/linux/mmzone.h | 3 +++
mm/compaction.c | 42 +++++++++++++++++++++++++++++++++++++++---
mm/page_alloc.c | 18 +++++++++++-------
4 files changed, 57 insertions(+), 10 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a002ca55c513..7bbdbf729757 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
return false;
}
+
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags);
+
extern int kcompactd_run(int nid);
extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 150c6049f961..0bf13c7cd8cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
extern struct mutex zonelists_mutex;
void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
+ unsigned long mark, int classzone_idx, int alloc_flags,
+ long free_pages);
bool zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags);
bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index e2e487cea5ea..0a7ca578af97 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
* COMPACT_CONTINUE - If compaction should run now
*/
static enum compact_result __compaction_suitable(struct zone *zone, int order,
- int alloc_flags, int classzone_idx)
+ int alloc_flags, int classzone_idx,
+ unsigned long wmark_target)
{
int fragindex;
unsigned long watermark;
@@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
* allocated and for a short time, the footprint is higher
*/
watermark += (2UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
+ if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
+ alloc_flags, wmark_target))
return COMPACT_SKIPPED;
/*
@@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
{
enum compact_result ret;
- ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
+ ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
+ zone_page_state(zone, NR_FREE_PAGES));
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;
@@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
return ret;
}
+bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
+ int alloc_flags)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure at least one zone would pass __compaction_suitable if we continue
+ * retrying the reclaim.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+ enum compact_result compact_result;
+
+ /*
+ * Do not consider all the reclaimable memory because we do not
+ * want to trash just for a single high order allocation which
+ * is even not guaranteed to appear even if __compaction_suitable
+ * is happy about the watermark check.
+ */
+ available = zone_reclaimable_pages(zone) / order;
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+ compact_result = __compaction_suitable(zone, order, alloc_flags,
+ ac->classzone_idx, available);
+ if (compact_result != COMPACT_SKIPPED &&
+ compact_result != COMPACT_NOT_SUITABLE_ZONE)
+ return true;
+ }
+
+ return false;
+}
+
static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
enum compact_result ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d5a938f12554..6757d6df2160 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
* one free page of a suitable size. Checking now avoids taking the zone lock
* to check in the allocation paths if no pages are free.
*/
-static bool __zone_watermark_ok(struct zone *z, unsigned int order,
+bool __zone_watermark_ok(struct zone *z, unsigned int order,
unsigned long mark, int classzone_idx, int alloc_flags,
long free_pages)
{
@@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
}
static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
- enum migrate_mode *migrate_mode,
+should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
+ enum compact_result compact_result, enum migrate_mode *migrate_mode,
int compaction_retries)
{
int max_retries = MAX_COMPACT_RETRIES;
@@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
/*
* make sure the compaction wasn't deferred or didn't bail out early
* due to locks contention before we declare that we should give up.
+ * But do not retry if the given zonelist is not suitable for
+ * compaction.
*/
if (compaction_withdrawn(compact_result))
- return true;
+ return compaction_zonelist_suitable(ac, order, alloc_flags);
/*
* !costly requests are much more important than __GFP_REPEAT
@@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
}
static inline bool
-should_compact_retry(unsigned int order, enum compact_result compact_result,
+should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
+ enum compact_result compact_result,
enum migrate_mode *migrate_mode,
int compaction_retries)
{
@@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* of free memory (see __compaction_suitable)
*/
if (did_some_progress > 0 &&
- should_compact_retry(order, compact_result,
- &migration_mode, compaction_retries))
+ should_compact_retry(ac, order, alloc_flags,
+ compact_result, &migration_mode,
+ compaction_retries))
goto retry;
/* Reclaim has failed us, start killing things */
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.
The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. Remove the check for the number of reclaimed pages and rely
on the compaction feedback solely. should_reclaim_retry now only
makes sure that we keep retrying reclaim for high order pages only
if they are hidden by watermaks so order-0 reclaim makes really sense.
should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt. !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim. We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.
This has been tested by running a process which fragments memory:
- compact memory
- mmap large portion of the memory (1920M on 2GRAM machine with 2G
of swapspace)
- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
steps until certain amount of memory is freed (250M in my test)
and reduce the step to (step / 2) + 1 after reaching the end of
the mapping
- then run a script which populates the page cache 2G (MemTotal)
from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.
root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone DMA 31 28 31 10 2 0 2 1 2 3 1
Node 0, zone DMA32 437 319 171 50 28 25 20 16 16 14 437
* This is the /proc/buddyinfo after the compaction
Done fragmenting. size=2013265920 freed=262144000
Node 0, zone DMA 165 48 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 35109 14575 185 51 41 12 6 0 0 0 0
* /proc/buddyinfo after memory got fragmented
Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 111 344 153 20 24 10 3 0 0 0 0
* /proc/buddyinfo after page cache got eaten
Trying to allocate 129
129
* 129 hugepages requested and all of them granted.
Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
Node 0, zone DMA32 127 97 30 99 11 6 2 1 4 0 0
* /proc/buddyinfo after hugetlb allocation.
10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37
Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8
As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
1 file changed, 32 insertions(+), 31 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb4df1be0d43..d5a938f12554 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
enum migrate_mode *migrate_mode,
int compaction_retries)
{
+ int max_retries = MAX_COMPACT_RETRIES;
+
if (!order)
return false;
@@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
}
/*
- * !costly allocations are really important and we have to make sure
- * the compaction wasn't deferred or didn't bail out early due to locks
- * contention before we go OOM. Still cap the reclaim retry loops with
- * progress to prevent from looping forever and potential trashing.
+ * make sure the compaction wasn't deferred or didn't bail out early
+ * due to locks contention before we declare that we should give up.
*/
- if (order <= PAGE_ALLOC_COSTLY_ORDER) {
- if (compaction_withdrawn(compact_result))
- return true;
- if (compaction_retries <= MAX_COMPACT_RETRIES)
- return true;
- }
+ if (compaction_withdrawn(compact_result))
+ return true;
+
+ /*
+ * !costly requests are much more important than __GFP_REPEAT
+ * costly ones because they are de facto nofail and invoke OOM
+ * killer to move on while costly can fail and users are ready
+ * to cope with that. 1/4 retries is rather arbitrary but we
+ * would need much more detailed feedback from compaction to
+ * make a better decision.
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ max_retries /= 4;
+ if (compaction_retries <= max_retries)
+ return true;
return false;
}
@@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
* Checks whether it makes sense to retry the reclaim to make a forward progress
* for the given allocation request.
* The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
*
* Returns true if a retry is viable or false to enter the oom path.
*/
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
- bool did_some_progress, unsigned long pages_reclaimed,
- int no_progress_loops)
+ bool did_some_progress, int no_progress_loops)
{
struct zone *zone;
struct zoneref *z;
@@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
if (no_progress_loops > MAX_RECLAIM_RETRIES)
return false;
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- if (pages_reclaimed >= (1<<order))
- return false;
-
- if (did_some_progress)
- return true;
- }
-
/*
* Keep reclaiming pages while there is a chance this will lead somewhere.
* If none of the target zones can satisfy our allocation request even
@@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
struct page *page = NULL;
int alloc_flags;
- unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
enum compact_result compact_result;
@@ -3444,16 +3443,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
goto noretry;
- if (did_some_progress) {
+ /*
+ * Costly allocations might have made a progress but this doesn't mean
+ * their order will become available due to high fragmentation so
+ * always increment the no progress counter for them
+ */
+ if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
no_progress_loops = 0;
- pages_reclaimed += did_some_progress;
- } else {
+ else
no_progress_loops++;
- }
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, pages_reclaimed,
- no_progress_loops))
+ did_some_progress > 0, no_progress_loops))
goto retry;
/*
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0. This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.
This can, however, lead to situations were the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger
OOM too early - e.g. after the first reclaim/compaction round. Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if
- the last compaction round backed off or
- we haven't completed at least MAX_COMPACT_RETRIES active
compaction rounds.
The first rule ensures that the very last attempt for compaction
was not ignored while the second guarantees that the compaction has done
some work. Multiple retries might be needed to prevent occasional
pigggy backing of other contexts to steal the compacted pages before
the current context manages to retry to allocate them.
compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense. We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up. With this logic in place we do not have
to increase the migration mode unconditionally and rather do it only if
the compaction failed for the weaker mode. A nice side effect is that
the stronger migration mode is used only when really needed so this has
a potential of smaller latencies in some cases.
Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.
If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 77 insertions(+), 10 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b78936eca70..bb4df1be0d43 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
return page;
}
+
+/*
+ * Maximum number of compaction retries wit a progress before OOM
+ * killer is consider as the only way to move forward.
+ */
+#define MAX_COMPACT_RETRIES 16
+
#ifdef CONFIG_COMPACTION
/* Try memory compaction for high-order allocations before reclaim */
static struct page *
@@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ if (!order)
+ return false;
+
+ /*
+ * compaction considers all the zone as desperately out of memory
+ * so it doesn't really make much sense to retry except when the
+ * failure could be caused by weak migration mode.
+ */
+ if (compaction_failed(compact_result)) {
+ if (*migrate_mode == MIGRATE_ASYNC) {
+ *migrate_mode = MIGRATE_SYNC_LIGHT;
+ return true;
+ }
+ return false;
+ }
+
+ /*
+ * !costly allocations are really important and we have to make sure
+ * the compaction wasn't deferred or didn't bail out early due to locks
+ * contention before we go OOM. Still cap the reclaim retry loops with
+ * progress to prevent from looping forever and potential trashing.
+ */
+ if (order <= PAGE_ALLOC_COSTLY_ORDER) {
+ if (compaction_withdrawn(compact_result))
+ return true;
+ if (compaction_retries <= MAX_COMPACT_RETRIES)
+ return true;
+ }
+
+ return false;
+}
#else
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
@@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
{
return NULL;
}
+
+static inline bool
+should_compact_retry(unsigned int order, enum compact_result compact_result,
+ enum migrate_mode *migrate_mode,
+ int compaction_retries)
+{
+ return false;
+}
#endif /* CONFIG_COMPACTION */
/* Perform direct synchronous page reclaim */
@@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
enum compact_result compact_result;
+ int compaction_retries = 0;
int no_progress_loops = 0;
/*
@@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
compaction_failed(compact_result)))
goto nopage;
- /*
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse.
- */
- if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
- migration_mode = MIGRATE_SYNC_LIGHT;
+ if (order && compaction_made_progress(compact_result))
+ compaction_retries++;
/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
@@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
no_progress_loops))
goto retry;
+ /*
+ * It doesn't make any sense to retry for the compaction if the order-0
+ * reclaim is not able to make any progress because the current
+ * implementation of the compaction depends on the sufficient amount
+ * of free memory (see __compaction_suitable)
+ */
+ if (did_some_progress > 0 &&
+ should_compact_retry(order, compact_result,
+ &migration_mode, compaction_retries))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
@@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
noretry:
/*
- * High-order allocations do not necessarily loop after
- * direct reclaim and reclaim/compaction depends on compaction
- * being called after reclaim so call directly if necessary
+ * High-order allocations do not necessarily loop after direct reclaim
+ * and reclaim/compaction depends on compaction being called after
+ * reclaim so call directly if necessary.
+ * It can become very expensive to allocate transparent hugepages at
+ * fault, so use asynchronous memory compaction for THP unless it is
+ * khugepaged trying to collapse. All other requests should tolerate
+ * at least light sync migration.
*/
+ if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
+ migration_mode = MIGRATE_ASYNC;
+ else
+ migration_mode = MIGRATE_SYNC_LIGHT;
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
&compact_result);
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.
zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.
This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.
The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow. It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.
This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.
Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.
[[email protected]: check classzone_idx for shrink_zone]
[[email protected]: separate the heuristic into should_reclaim_retry]
[[email protected]: use zone_page_state_snapshot for NR_FREE_PAGES]
[[email protected]: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/swap.h | 1 +
mm/page_alloc.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++-----
mm/vmscan.c | 25 +++----------
3 files changed, 97 insertions(+), 29 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index d18b65c53dbb..b14a2bb33514 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
struct vm_area_struct *vma);
/* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d551fe326c33..38302c2041a3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3145,6 +3145,77 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
}
+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+ struct alloc_context *ac, int alloc_flags,
+ bool did_some_progress, unsigned long pages_reclaimed,
+ int no_progress_loops)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure we converge to OOM if we cannot make any progress
+ * several times in the row.
+ */
+ if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;
+
+ if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (pages_reclaimed >= (1<<order))
+ return false;
+
+ if (did_some_progress)
+ return true;
+ }
+
+ /*
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
+ * If none of the target zones can satisfy our allocation request even
+ * if all reclaimable pages are considered then we are screwed and have
+ * to go OOM.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
+ ac->nodemask) {
+ unsigned long available;
+
+ available = zone_reclaimable_pages(zone);
+ available -= DIV_ROUND_UP(no_progress_loops * available,
+ MAX_RECLAIM_RETRIES);
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ /*
+ * Would the allocation succeed if we reclaimed the whole
+ * available?
+ */
+ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ ac->high_zoneidx, alloc_flags, available)) {
+ /* Wait for some write requests to complete then retry */
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ return true;
+ }
+ }
+
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3156,6 +3227,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
enum compact_result compact_result;
+ int no_progress_loops = 0;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3284,23 +3356,35 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;
- /* Keep reclaiming pages as long as there is reasonable progress */
- pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
- ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
- goto retry;
+ /*
+ * Do not retry costly high order allocations unless they are
+ * __GFP_REPEAT
+ */
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+ goto noretry;
+
+ if (did_some_progress) {
+ no_progress_loops = 0;
+ pages_reclaimed += did_some_progress;
+ } else {
+ no_progress_loops++;
}
+ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, pages_reclaimed,
+ no_progress_loops))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
goto got_pg;
/* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ no_progress_loops = 0;
goto retry;
+ }
noretry:
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3e6347e2a5fc..a2ba60aa7b88 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -191,7 +191,7 @@ static bool sane_reclaim(struct scan_control *sc)
}
#endif
-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
{
unsigned long nr;
@@ -2530,10 +2530,8 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
@@ -2541,7 +2539,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
- bool reclaimable = false;
/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2606,17 +2603,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
&nr_soft_scanned);
sc->nr_reclaimed += nr_soft_reclaimed;
sc->nr_scanned += nr_soft_scanned;
- if (nr_soft_reclaimed)
- reclaimable = true;
/* need some check for avoid more shrink_zone() */
}
- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
- reclaimable = true;
-
- if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
- reclaimable = true;
+ shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
}
/*
@@ -2624,8 +2614,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
* promoted it to __GFP_HIGHMEM.
*/
sc->gfp_mask = orig_mask;
-
- return reclaimable;
}
/*
@@ -2650,7 +2638,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
int initial_priority = sc->priority;
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
- bool zones_reclaimable;
retry:
delayacct_freepages_start();
@@ -2661,7 +2648,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ shrink_zones(zonelist, sc);
total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2708,10 +2695,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
goto retry;
}
- /* Any of the zones still reclaimable? Don't OOM. */
- if (zones_reclaimable)
- return 1;
-
return 0;
}
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
__alloc_pages_direct_compact communicates potential back off by two
variables:
- deferred_compaction tells that the compaction returned
COMPACT_DEFERRED
- contended_compaction is set when there is a contention on
zone->lock resp. zone->lru_lock locks
__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.
This patch shouldn't introduce any functional changes.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
1 file changed, 31 insertions(+), 36 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06af8a757d52..350d13f3709b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
static struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
- enum compact_result compact_result;
struct page *page;
+ int contended_compaction;
if (!order)
return NULL;
current->flags |= PF_MEMALLOC;
- compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
- mode, contended_compaction);
+ *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
+ mode, &contended_compaction);
current->flags &= ~PF_MEMALLOC;
- switch (compact_result) {
- case COMPACT_DEFERRED:
- *deferred_compaction = true;
- /* fall-through */
- case COMPACT_SKIPPED:
+ if (*compact_result <= COMPACT_INACTIVE)
return NULL;
- default:
- break;
- }
/*
* At least in one zone compaction wasn't deferred or skipped, so let's
@@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
*/
count_vm_event(COMPACTFAIL);
+ /*
+ * In all zones where compaction was attempted (and not
+ * deferred or skipped), lock contention has been detected.
+ * For THP allocation we do not want to disrupt the others
+ * so we fallback to base pages instead.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ *compact_result = COMPACT_CONTENDED;
+
+ /*
+ * If compaction was aborted due to need_resched(), we do not
+ * want to further increase allocation latency, unless it is
+ * khugepaged trying to collapse.
+ */
+ if (contended_compaction == COMPACT_CONTENDED_SCHED
+ && !(current->flags & PF_KTHREAD))
+ *compact_result = COMPACT_CONTENDED;
+
cond_resched();
return NULL;
@@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
static inline struct page *
__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
- enum migrate_mode mode, int *contended_compaction,
- bool *deferred_compaction)
+ enum migrate_mode mode, enum compact_result *compact_result)
{
return NULL;
}
@@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
- bool deferred_compaction = false;
- int contended_compaction = COMPACT_CONTENDED_NONE;
+ enum compact_result compact_result;
/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
@@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* to heavily disrupt the system, so we fail the allocation
* instead of entering direct reclaim.
*/
- if (deferred_compaction)
- goto nopage;
-
- /*
- * In all zones where compaction was attempted (and not
- * deferred or skipped), lock contention has been detected.
- * For THP allocation we do not want to disrupt the others
- * so we fallback to base pages instead.
- */
- if (contended_compaction == COMPACT_CONTENDED_LOCK)
+ if (compact_result == COMPACT_DEFERRED)
goto nopage;
/*
- * If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
+ * Compaction is contended so rather back off than cause
+ * excessive stalls.
*/
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
+ if(compact_result == COMPACT_CONTENDED)
goto nopage;
}
@@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
ac, migration_mode,
- &contended_compaction,
- &deferred_compaction);
+ &compact_result);
if (page)
goto got_pg;
nopage:
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
try_to_compact_pages can currently return COMPACT_SKIPPED even when the
compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.
Make sure we distinguish those two states properly and switch their
ordering in the enum. This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.
As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.
Acked-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/compaction.h | 7 +++++--
include/trace/events/compaction.h | 2 +-
mm/compaction.c | 8 +++++---
3 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4458fd94170f..7e177d111c39 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -4,13 +4,16 @@
/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result {
- /* compaction didn't start as it was deferred due to past failures */
- COMPACT_DEFERRED,
/*
* compaction didn't start as it was not possible or direct reclaim
* was more suitable
*/
COMPACT_SKIPPED,
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /* compaction not active last round */
+ COMPACT_INACTIVE = COMPACT_DEFERRED,
+
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
/*
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index e215bf68f521..6ba16c86d7db 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -10,8 +10,8 @@
#include <trace/events/mmflags.h>
#define COMPACTION_STATUS \
- EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_SKIPPED, "skipped") \
+ EM( COMPACT_DEFERRED, "deferred") \
EM( COMPACT_CONTINUE, "continue") \
EM( COMPACT_PARTIAL, "partial") \
EM( COMPACT_COMPLETE, "complete") \
diff --git a/mm/compaction.c b/mm/compaction.c
index b06de27b7f72..13709e33a2fc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- enum compact_result rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_SKIPPED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
*contended = COMPACT_CONTENDED_NONE;
@@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
enum compact_result status;
int zone_contended;
- if (compaction_deferred(zone, order))
+ if (compaction_deferred(zone, order)) {
+ rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
continue;
+ }
status = compact_zone_order(zone, order, gfp_mask, mode,
&zone_contended, alloc_flags,
@@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
* If at least one zone wasn't deferred or skipped, we report if all
* zones that were tried were lock contended.
*/
- if (rc > COMPACT_SKIPPED && all_zones_contended)
+ if (rc > COMPACT_INACTIVE && all_zones_contended)
*contended = COMPACT_CONTENDED_LOCK;
return rc;
--
2.8.0.rc3
From: Michal Hocko <[email protected]>
compaction code is doing weird dances between
COMPACT_FOO -> int -> unsigned long
but there doesn't seem to be any reason for that. All functions which
return/use one of those constants are not expecting any other value
so it really makes sense to define an enum for them and make it clear
that no other values are expected.
This is a pure cleanup and shouldn't introduce any functional changes.
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/compaction.h | 45 +++++++++++++++++++++++++++------------------
mm/compaction.c | 27 ++++++++++++++-------------
mm/page_alloc.c | 2 +-
3 files changed, 42 insertions(+), 32 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d7c8de583a23..4458fd94170f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,21 +2,29 @@
#define _LINUX_COMPACTION_H
/* Return values for compact_zone() and try_to_compact_pages() */
-/* compaction didn't start as it was deferred due to past failures */
-#define COMPACT_DEFERRED 0
-/* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED 1
-/* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE 2
-/* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL 3
-/* The full zone was compacted */
-#define COMPACT_COMPLETE 4
-/* For more detailed tracepoint output */
-#define COMPACT_NO_SUITABLE_PAGE 5
-#define COMPACT_NOT_SUITABLE_ZONE 6
-#define COMPACT_CONTENDED 7
/* When adding new states, please adjust include/trace/events/compaction.h */
+enum compact_result {
+ /* compaction didn't start as it was deferred due to past failures */
+ COMPACT_DEFERRED,
+ /*
+ * compaction didn't start as it was not possible or direct reclaim
+ * was more suitable
+ */
+ COMPACT_SKIPPED,
+ /* compaction should continue to another pageblock */
+ COMPACT_CONTINUE,
+ /*
+ * direct compaction partially compacted a zone and there are suitable
+ * pages
+ */
+ COMPACT_PARTIAL,
+ /* The full zone was compacted */
+ COMPACT_COMPLETE,
+ /* For more detailed tracepoint output */
+ COMPACT_NO_SUITABLE_PAGE,
+ COMPACT_NOT_SUITABLE_ZONE,
+ COMPACT_CONTENDED,
+};
/* Used to signal whether compaction detected need_sched() or lock contention */
/* No contention detected */
@@ -38,12 +46,13 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
extern int sysctl_compact_unevictable_allowed;
extern int fragmentation_index(struct zone *zone, unsigned int order);
-extern unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
+ unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended);
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
-extern unsigned long compaction_suitable(struct zone *zone, int order,
+extern enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx);
extern void defer_compaction(struct zone *zone, int order);
@@ -57,7 +66,7 @@ extern void kcompactd_stop(int nid);
extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
#else
-static inline unsigned long try_to_compact_pages(gfp_t gfp_mask,
+static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
unsigned int order, int alloc_flags,
const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
@@ -73,7 +82,7 @@ static inline void reset_isolation_suitable(pg_data_t *pgdat)
{
}
-static inline unsigned long compaction_suitable(struct zone *zone, int order,
+static inline enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
return COMPACT_SKIPPED;
diff --git a/mm/compaction.c b/mm/compaction.c
index 8cc495042303..8ae7b1c46c72 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1281,7 +1281,7 @@ static inline bool is_via_compact_memory(int order)
return order == -1;
}
-static int __compact_finished(struct zone *zone, struct compact_control *cc,
+static enum compact_result __compact_finished(struct zone *zone, struct compact_control *cc,
const int migratetype)
{
unsigned int order;
@@ -1344,8 +1344,9 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
return COMPACT_NO_SUITABLE_PAGE;
}
-static int compact_finished(struct zone *zone, struct compact_control *cc,
- const int migratetype)
+static enum compact_result compact_finished(struct zone *zone,
+ struct compact_control *cc,
+ const int migratetype)
{
int ret;
@@ -1364,7 +1365,7 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
* COMPACT_PARTIAL - If the allocation would succeed without compaction
* COMPACT_CONTINUE - If compaction should run now
*/
-static unsigned long __compaction_suitable(struct zone *zone, int order,
+static enum compact_result __compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
int fragindex;
@@ -1409,10 +1410,10 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
return COMPACT_CONTINUE;
}
-unsigned long compaction_suitable(struct zone *zone, int order,
+enum compact_result compaction_suitable(struct zone *zone, int order,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
trace_mm_compaction_suitable(zone, order, ret);
@@ -1422,9 +1423,9 @@ unsigned long compaction_suitable(struct zone *zone, int order,
return ret;
}
-static int compact_zone(struct zone *zone, struct compact_control *cc)
+static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
{
- int ret;
+ enum compact_result ret;
unsigned long start_pfn = zone->zone_start_pfn;
unsigned long end_pfn = zone_end_pfn(zone);
const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
@@ -1588,11 +1589,11 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
return ret;
}
-static unsigned long compact_zone_order(struct zone *zone, int order,
+static enum compact_result compact_zone_order(struct zone *zone, int order,
gfp_t gfp_mask, enum migrate_mode mode, int *contended,
int alloc_flags, int classzone_idx)
{
- unsigned long ret;
+ enum compact_result ret;
struct compact_control cc = {
.nr_freepages = 0,
.nr_migratepages = 0,
@@ -1631,7 +1632,7 @@ int sysctl_extfrag_threshold = 500;
*
* This is the main entry point for direct page compaction.
*/
-unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
+enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac,
enum migrate_mode mode, int *contended)
{
@@ -1639,7 +1640,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
int may_perform_io = gfp_mask & __GFP_IO;
struct zoneref *z;
struct zone *zone;
- int rc = COMPACT_DEFERRED;
+ enum compact_result rc = COMPACT_DEFERRED;
int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
*contended = COMPACT_CONTENDED_NONE;
@@ -1653,7 +1654,7 @@ unsigned long try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
/* Compact each zone in the list */
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
ac->nodemask) {
- int status;
+ enum compact_result status;
int zone_contended;
if (compaction_deferred(zone, order))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c4efafc38273..06af8a757d52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2947,7 +2947,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
enum migrate_mode mode, int *contended_compaction,
bool *deferred_compaction)
{
- unsigned long compact_result;
+ enum compact_result compact_result;
struct page *page;
if (!order)
--
2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> while playing with the oom detection rework [1] I have noticed
> that my heavy order-9 (hugetlb) load close to OOM ended up in an
> endless loop where the reclaim hasn't made any progress but
> did_some_progress didn't reflect that and compaction_suitable
> was backing off because no zone is above low wmark + 1 << order.
>
> It turned out that this is in fact an old standing bug in compaction_ready
> which ignores the requested_highidx and did the watermark check for
> 0 classzone_idx. This succeeds for zone DMA most of the time as the zone
> is mostly unused because of lowmem protection. This also means that the
> OOM killer wouldn't be triggered for higher order requests even when
> there is no reclaim progress and we essentially rely on order-0 request
> to find this out.
Thanks.
> This has been broken in one way or another since
> fe4b1b244bdb ("mm: vmscan: when reclaiming for compaction, ensure there
> are sufficient free pages available") but only since 7335084d446b ("mm:
> vmscan: do not OOM if aborting reclaim to start compaction") we are not
> invoking the OOM killer based on the wrong calculation.
>
> Propagate requested_highidx down to compaction_ready and use it for both
> the watermak check and compaction_suitable to fix this issue.
>
> [1] http://lkml.kernel.org/r/[email protected]
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> mm/vmscan.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c839adc13efd..3e6347e2a5fc 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2482,7 +2482,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
> * Returns true if compaction should go ahead for a high-order request, or
> * the high-order allocation would succeed without compaction.
> */
> -static inline bool compaction_ready(struct zone *zone, int order)
> +static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
> {
> unsigned long balance_gap, watermark;
> bool watermark_ok;
> @@ -2496,7 +2496,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
> balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
> zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
> watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
> - watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
> + watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
>
> /*
> * If compaction is deferred, reclaim up to a point where
> @@ -2509,7 +2509,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
> * If compaction is not ready to start and allocation is not likely
> * to succeed without it, then keep reclaiming.
> */
> - if (compaction_suitable(zone, order, 0, 0) == COMPACT_SKIPPED)
> + if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
> return false;
>
> return watermark_ok;
> @@ -2589,7 +2589,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> if (IS_ENABLED(CONFIG_COMPACTION) &&
> sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> zonelist_zone_idx(z) <= requested_highidx &&
> - compaction_ready(zone, sc->order)) {
> + compaction_ready(zone, sc->order, requested_highidx)) {
> sc->compaction_ready = true;
> continue;
> }
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> COMPACT_COMPLETE now means that compaction and free scanner met. This is
> not very useful information if somebody just wants to use this feedback
> and make any decisions based on that. The current caller might be a poor
> guy who just happened to scan tiny portion of the zone and that could be
> the reason no suitable pages were compacted. Make sure we distinguish
> the full and partial zone walks.
>
> Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
> and be optimistic in retrying.
>
> The existing users of COMPACT_COMPLETE are conservatively changed to
> use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
> reconsidered and only defer the compaction only for COMPACT_COMPLETE
> with the new semantic.
>
> This patch shouldn't introduce any functional changes.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> include/linux/compaction.h | 10 +++++++++-
> include/trace/events/compaction.h | 1 +
> mm/compaction.c | 14 +++++++++++---
> mm/internal.h | 1 +
> 4 files changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7e177d111c39..7c4de92d12cc 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -21,7 +21,15 @@ enum compact_result {
> * pages
> */
> COMPACT_PARTIAL,
> - /* The full zone was compacted */
> + /*
> + * direct compaction has scanned part of the zone but wasn't successfull
> + * to compact suitable pages.
> + */
> + COMPACT_PARTIAL_SKIPPED,
> + /*
> + * The full zone was compacted scanned but wasn't successfull to compact
> + * suitable pages.
> + */
> COMPACT_COMPLETE,
> /* For more detailed tracepoint output */
> COMPACT_NO_SUITABLE_PAGE,
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index 6ba16c86d7db..36e2d6fb1360 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -14,6 +14,7 @@
> EM( COMPACT_DEFERRED, "deferred") \
> EM( COMPACT_CONTINUE, "continue") \
> EM( COMPACT_PARTIAL, "partial") \
> + EM( COMPACT_PARTIAL_SKIPPED, "partial_skipped") \
> EM( COMPACT_COMPLETE, "complete") \
> EM( COMPACT_NO_SUITABLE_PAGE, "no_suitable_page") \
> EM( COMPACT_NOT_SUITABLE_ZONE, "not_suitable_zone") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 13709e33a2fc..e2e487cea5ea 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1304,7 +1304,10 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
> if (cc->direct_compaction)
> zone->compact_blockskip_flush = true;
>
> - return COMPACT_COMPLETE;
> + if (cc->whole_zone)
> + return COMPACT_COMPLETE;
> + else
> + return COMPACT_PARTIAL_SKIPPED;
> }
>
> if (is_via_compact_memory(cc->order))
> @@ -1463,6 +1466,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
> zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
> zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> }
> +
> + if (cc->migrate_pfn == start_pfn)
> + cc->whole_zone = true;
> +
> cc->last_migrated_pfn = 0;
>
> trace_mm_compaction_begin(start_pfn, cc->migrate_pfn,
> @@ -1693,7 +1700,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> goto break_loop;
> }
>
> - if (mode != MIGRATE_ASYNC && status == COMPACT_COMPLETE) {
> + if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
> + status == COMPACT_PARTIAL_SKIPPED)) {
> /*
> * We think that allocation won't succeed in this zone
> * so we defer compaction there. If it ends up
> @@ -1939,7 +1947,7 @@ static void kcompactd_do_work(pg_data_t *pgdat)
> cc.classzone_idx, 0)) {
> success = true;
> compaction_defer_reset(zone, cc.order, false);
> - } else if (status == COMPACT_COMPLETE) {
> + } else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
> /*
> * We use sync migration mode here, so we defer like
> * sync direct compaction does.
> diff --git a/mm/internal.h b/mm/internal.h
> index e9aacea1a0d1..4423dfe69382 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -182,6 +182,7 @@ struct compact_control {
> enum migrate_mode mode; /* Async or sync migration mode */
> bool ignore_skip_hint; /* Scan blocks even if marked skip */
> bool direct_compaction; /* False from kcompactd or /proc/... */
> + bool whole_zone; /* Whole zone has been scanned */
> int order; /* order a direct compactor needs */
> const gfp_t gfp_mask; /* gfp mask of a direct compactor */
> const int alloc_flags; /* alloc flags of a direct compactor */
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> compaction_result will be used as the primary feedback channel for
> compaction users. At the same time try_to_compact_pages (and potentially
> others) assume a certain ordering where a more specific feedback takes
> precendence. This gets a bit awkward when we have conflicting feedback
> from different zones. E.g one returing COMPACT_COMPLETE meaning the full
> zone has been scanned without any outcome while other returns with
> COMPACT_PARTIAL aka made some progress. The caller should get
> COMPACT_PARTIAL because that means that the compaction still can make
> some progress. The same applies for COMPACT_PARTIAL vs.
> COMPACT_PARTIAL_SKIPPED. Reorder PARTIAL to be the largest one so the
> larger the value is the more progress we have done.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> include/linux/compaction.h | 26 ++++++++++++++++----------
> 1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 7c4de92d12cc..a7b9091ff349 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -4,6 +4,8 @@
> /* Return values for compact_zone() and try_to_compact_pages() */
> /* When adding new states, please adjust include/trace/events/compaction.h */
> enum compact_result {
> + /* For more detailed tracepoint output - internal to compaction */
> + COMPACT_NOT_SUITABLE_ZONE,
> /*
> * compaction didn't start as it was not possible or direct reclaim
> * was more suitable
> @@ -11,30 +13,34 @@ enum compact_result {
> COMPACT_SKIPPED,
> /* compaction didn't start as it was deferred due to past failures */
> COMPACT_DEFERRED,
> +
> /* compaction not active last round */
> COMPACT_INACTIVE = COMPACT_DEFERRED,
>
> + /* For more detailed tracepoint output - internal to compaction */
> + COMPACT_NO_SUITABLE_PAGE,
> /* compaction should continue to another pageblock */
> COMPACT_CONTINUE,
> +
> /*
> - * direct compaction partially compacted a zone and there are suitable
> - * pages
> + * The full zone was compacted scanned but wasn't successfull to compact
> + * suitable pages.
> */
> - COMPACT_PARTIAL,
> + COMPACT_COMPLETE,
> /*
> * direct compaction has scanned part of the zone but wasn't successfull
> * to compact suitable pages.
> */
> COMPACT_PARTIAL_SKIPPED,
> +
> + /* compaction terminated prematurely due to lock contentions */
> + COMPACT_CONTENDED,
> +
> /*
> - * The full zone was compacted scanned but wasn't successfull to compact
> - * suitable pages.
> + * direct compaction partially compacted a zone and there might be
> + * suitable pages
> */
> - COMPACT_COMPLETE,
> - /* For more detailed tracepoint output */
> - COMPACT_NO_SUITABLE_PAGE,
> - COMPACT_NOT_SUITABLE_ZONE,
> - COMPACT_CONTENDED,
> + COMPACT_PARTIAL,
> };
>
> /* Used to signal whether compaction detected need_sched() or lock contention */
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> __alloc_pages_direct_compact communicates potential back off by two
> variables:
> - deferred_compaction tells that the compaction returned
> COMPACT_DEFERRED
> - contended_compaction is set when there is a contention on
> zone->lock resp. zone->lru_lock locks
>
> __alloc_pages_slowpath then backs of for THP allocation requests to
> prevent from long stalls. This is rather messy and it would be much
> cleaner to return a single compact result value and hide all the nasty
> details into __alloc_pages_direct_compact.
>
> This patch shouldn't introduce any functional changes.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> mm/page_alloc.c | 67 ++++++++++++++++++++++++++-------------------------------
> 1 file changed, 31 insertions(+), 36 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 06af8a757d52..350d13f3709b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2944,29 +2944,21 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> static struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> - enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum migrate_mode mode, enum compact_result *compact_result)
> {
> - enum compact_result compact_result;
> struct page *page;
> + int contended_compaction;
>
> if (!order)
> return NULL;
>
> current->flags |= PF_MEMALLOC;
> - compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> - mode, contended_compaction);
> + *compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
> + mode, &contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> - switch (compact_result) {
> - case COMPACT_DEFERRED:
> - *deferred_compaction = true;
> - /* fall-through */
> - case COMPACT_SKIPPED:
> + if (*compact_result <= COMPACT_INACTIVE)
> return NULL;
> - default:
> - break;
> - }
>
> /*
> * At least in one zone compaction wasn't deferred or skipped, so let's
> @@ -2992,6 +2984,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> */
> count_vm_event(COMPACTFAIL);
>
> + /*
> + * In all zones where compaction was attempted (and not
> + * deferred or skipped), lock contention has been detected.
> + * For THP allocation we do not want to disrupt the others
> + * so we fallback to base pages instead.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_LOCK)
> + *compact_result = COMPACT_CONTENDED;
> +
> + /*
> + * If compaction was aborted due to need_resched(), we do not
> + * want to further increase allocation latency, unless it is
> + * khugepaged trying to collapse.
> + */
> + if (contended_compaction == COMPACT_CONTENDED_SCHED
> + && !(current->flags & PF_KTHREAD))
> + *compact_result = COMPACT_CONTENDED;
> +
> cond_resched();
>
> return NULL;
> @@ -3000,8 +3010,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> int alloc_flags, const struct alloc_context *ac,
> - enum migrate_mode mode, int *contended_compaction,
> - bool *deferred_compaction)
> + enum migrate_mode mode, enum compact_result *compact_result)
> {
> return NULL;
> }
> @@ -3146,8 +3155,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed = 0;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> - bool deferred_compaction = false;
> - int contended_compaction = COMPACT_CONTENDED_NONE;
> + enum compact_result compact_result;
>
> /*
> * In the slowpath, we sanity check order to avoid ever trying to
> @@ -3245,8 +3253,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> */
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
> migration_mode,
> - &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
>
> @@ -3259,25 +3266,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * to heavily disrupt the system, so we fail the allocation
> * instead of entering direct reclaim.
> */
> - if (deferred_compaction)
> - goto nopage;
> -
> - /*
> - * In all zones where compaction was attempted (and not
> - * deferred or skipped), lock contention has been detected.
> - * For THP allocation we do not want to disrupt the others
> - * so we fallback to base pages instead.
> - */
> - if (contended_compaction == COMPACT_CONTENDED_LOCK)
> + if (compact_result == COMPACT_DEFERRED)
> goto nopage;
>
> /*
> - * If compaction was aborted due to need_resched(), we do not
> - * want to further increase allocation latency, unless it is
> - * khugepaged trying to collapse.
> + * Compaction is contended so rather back off than cause
> + * excessive stalls.
> */
> - if (contended_compaction == COMPACT_CONTENDED_SCHED
> - && !(current->flags & PF_KTHREAD))
> + if(compact_result == COMPACT_CONTENDED)
> goto nopage;
> }
>
> @@ -3325,8 +3321,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> */
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> - &contended_compaction,
> - &deferred_compaction);
> + &compact_result);
> if (page)
> goto got_pg;
> nopage:
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> - compaction_made_progress - compaction was active and made some
> progress.
> - compaction_failed - compaction failed and further attempts to
> invoke it would most probably fail and therefore it is not
> worth retrying
> - compaction_withdrawn - compaction wasn't invoked for an
> implementation specific reasons. In the current implementation
> it means that the compaction was deferred, contended or the
> page scanners met too early without any progress. Retrying is
> still worthwhile.
>
> [[email protected]: do not change thp back off behavior]
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> include/linux/compaction.h | 79 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 79 insertions(+)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a7b9091ff349..a002ca55c513 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -78,6 +78,70 @@ extern void compaction_defer_reset(struct zone *zone, int order,
> bool alloc_success);
> extern bool compaction_restarting(struct zone *zone, int order);
>
> +/* Compaction has made some progress and retrying makes sense */
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + /*
> + * Even though this might sound confusing this in fact tells us
> + * that the compaction successfully isolated and migrated some
> + * pageblocks.
> + */
> + if (result == COMPACT_PARTIAL)
> + return true;
> +
> + return false;
> +}
> +
> +/* Compaction has failed and it doesn't make much sense to keep retrying. */
> +static inline bool compaction_failed(enum compact_result result)
> +{
> + /* All zones where scanned completely and still not result. */
s/where/were/
> + if (result == COMPACT_COMPLETE)
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Compaction has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> + /*
> + * Compaction backed off due to watermark checks for order-0
> + * so the regular reclaim has to try harder and reclaim something.
> + */
> + if (result == COMPACT_SKIPPED)
> + return true;
> +
> + /*
> + * If compaction is deferred for high-order allocations, it is
> + * because sync compaction recently failed. If this is the case
> + * and the caller requested a THP allocation, we do not want
> + * to heavily disrupt the system, so we fail the allocation
> + * instead of entering direct reclaim.
> + */
> + if (result == COMPACT_DEFERRED)
> + return true;
> +
> + /*
> + * If compaction in async mode encounters contention or blocks higher
> + * priority task we back off early rather than cause stalls.
> + */
> + if (result == COMPACT_CONTENDED)
> + return true;
> +
> + /*
> + * Page scanners have met but we haven't scanned full zones so this
> + * is a back off in fact.
> + */
> + if (result == COMPACT_PARTIAL_SKIPPED)
> + return true;
> +
> + return false;
> +}
> +
> extern int kcompactd_run(int nid);
> extern void kcompactd_stop(int nid);
> extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> @@ -114,6 +178,21 @@ static inline bool compaction_deferred(struct zone *zone, int order)
> return true;
> }
>
> +static inline bool compaction_made_progress(enum compact_result result)
> +{
> + return false;
> +}
> +
> +static inline bool compaction_failed(enum compact_result result)
> +{
> + return false;
> +}
> +
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
> + return true;
> +}
> +
> static inline int kcompactd_run(int nid)
> {
> return 0;
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> THP requests skip the direct reclaim if the compaction is either
> deferred or contended to reduce stalls which wouldn't help the
> allocation success anyway. These checks are ignoring other potential
> feedback modes which we have available now.
>
> It clearly doesn't make much sense to go and reclaim few pages if the
> previous compaction has failed.
>
> We can also simplify the check by using compaction_withdrawn which
> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> is however covering more reasons why the compaction was withdrawn.
> None of them should be a problem for the THP case though.
>
> It is safe to back of if we see COMPACT_SKIPPED because that means
> that compaction_suitable failed and a single round of the reclaim is
> unlikely to make any difference here. We would have to be close to
> the low watermark to reclaim enough and even then there is no guarantee
> that the compaction would make any progress while the direct reclaim
> would have caused the stall.
>
> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> have only seen a part of the zone so a retry would make some sense. But
> it would be a compaction retry not a reclaim retry to perform. We are
> not doing that and that might indeed lead to situations where THP fails
> but this should happen only rarely and it would be really hard to
> measure.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> mm/page_alloc.c | 27 ++++++++-------------------
> 1 file changed, 8 insertions(+), 19 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 350d13f3709b..d551fe326c33 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3257,25 +3257,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (page)
> goto got_pg;
>
> - /* Checks for THP-specific high-order allocations */
> - if (is_thp_gfp_mask(gfp_mask)) {
> - /*
> - * If compaction is deferred for high-order allocations, it is
> - * because sync compaction recently failed. If this is the case
> - * and the caller requested a THP allocation, we do not want
> - * to heavily disrupt the system, so we fail the allocation
> - * instead of entering direct reclaim.
> - */
> - if (compact_result == COMPACT_DEFERRED)
> - goto nopage;
> -
> - /*
> - * Compaction is contended so rather back off than cause
> - * excessive stalls.
> - */
> - if(compact_result == COMPACT_CONTENDED)
> - goto nopage;
> - }
> + /*
> + * Checks for THP-specific high-order allocations and back off
> + * if the the compaction backed off or failed
> + */
Alternatively,
/*
* Check THP allocations and back off
* if the compaction bailed out or failed
*/
> + if (is_thp_gfp_mask(gfp_mask) &&
> + (compaction_withdrawn(compact_result) ||
> + compaction_failed(compact_result)))
> + goto nopage;
>
> /*
> * It can become very expensive to allocate transparent hugepages at
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> try_to_compact_pages can currently return COMPACT_SKIPPED even when the
> compaction is defered for some zone just because zone DMA is skipped
> in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED
> basically unusable for the page allocator as a feedback mechanism.
>
> Make sure we distinguish those two states properly and switch their
> ordering in the enum. This would mean that the COMPACT_SKIPPED will be
> returned only when all eligible zones are skipped.
>
> As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
> will be more precise and we would bail out rather than reclaim.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> include/linux/compaction.h | 7 +++++--
> include/trace/events/compaction.h | 2 +-
> mm/compaction.c | 8 +++++---
> 3 files changed, 11 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4458fd94170f..7e177d111c39 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -4,13 +4,16 @@
> /* Return values for compact_zone() and try_to_compact_pages() */
> /* When adding new states, please adjust include/trace/events/compaction.h */
> enum compact_result {
> - /* compaction didn't start as it was deferred due to past failures */
> - COMPACT_DEFERRED,
> /*
> * compaction didn't start as it was not possible or direct reclaim
> * was more suitable
> */
> COMPACT_SKIPPED,
> + /* compaction didn't start as it was deferred due to past failures */
> + COMPACT_DEFERRED,
> + /* compaction not active last round */
> + COMPACT_INACTIVE = COMPACT_DEFERRED,
> +
> /* compaction should continue to another pageblock */
> COMPACT_CONTINUE,
> /*
> diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
> index e215bf68f521..6ba16c86d7db 100644
> --- a/include/trace/events/compaction.h
> +++ b/include/trace/events/compaction.h
> @@ -10,8 +10,8 @@
> #include <trace/events/mmflags.h>
>
> #define COMPACTION_STATUS \
> - EM( COMPACT_DEFERRED, "deferred") \
> EM( COMPACT_SKIPPED, "skipped") \
> + EM( COMPACT_DEFERRED, "deferred") \
> EM( COMPACT_CONTINUE, "continue") \
> EM( COMPACT_PARTIAL, "partial") \
> EM( COMPACT_COMPLETE, "complete") \
> diff --git a/mm/compaction.c b/mm/compaction.c
> index b06de27b7f72..13709e33a2fc 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1637,7 +1637,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> int may_perform_io = gfp_mask & __GFP_IO;
> struct zoneref *z;
> struct zone *zone;
> - enum compact_result rc = COMPACT_DEFERRED;
> + enum compact_result rc = COMPACT_SKIPPED;
> int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
>
> *contended = COMPACT_CONTENDED_NONE;
> @@ -1654,8 +1654,10 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> enum compact_result status;
> int zone_contended;
>
> - if (compaction_deferred(zone, order))
> + if (compaction_deferred(zone, order)) {
> + rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
> continue;
> + }
>
> status = compact_zone_order(zone, order, gfp_mask, mode,
> &zone_contended, alloc_flags,
> @@ -1726,7 +1728,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> * If at least one zone wasn't deferred or skipped, we report if all
> * zones that were tried were lock contended.
> */
> - if (rc > COMPACT_SKIPPED && all_zones_contended)
> + if (rc > COMPACT_INACTIVE && all_zones_contended)
> *contended = COMPACT_CONTENDED_LOCK;
>
> return rc;
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> should_reclaim_retry will give up retries for higher order allocations
> if none of the eligible zones has any requested or higher order pages
> available even if we pass the watermak check for order-0. This is done
> because there is no guarantee that the reclaimable and currently free
> pages will form the required order.
>
> This can, however, lead to situations were the high-order request (e.g.
s/were/where/
> order-2 required for the stack allocation during fork) will trigger
> OOM too early - e.g. after the first reclaim/compaction round. Such a
> system would have to be highly fragmented and there is no guarantee
> further reclaim/compaction attempts would help but at least make sure
> that the compaction was active before we go OOM and keep retrying even
> if should_reclaim_retry tells us to oom if
> - the last compaction round backed off or
> - we haven't completed at least MAX_COMPACT_RETRIES active
> compaction rounds.
>
> The first rule ensures that the very last attempt for compaction
> was not ignored while the second guarantees that the compaction has done
> some work. Multiple retries might be needed to prevent occasional
> pigggy backing of other contexts to steal the compacted pages before
> the current context manages to retry to allocate them.
>
> compaction_failed() is taken as a final word from the compaction that
> the retry doesn't make much sense. We have to be careful though because
> the first compaction round is MIGRATE_ASYNC which is rather weak as it
> ignores pages under writeback and gives up too easily in other
> situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
> has been used before we give up. With this logic in place we do not have
> to increase the migration mode unconditionally and rather do it only if
> the compaction failed for the weaker mode. A nice side effect is that
> the stronger migration mode is used only when really needed so this has
> a potential of smaller latencies in some cases.
>
> Please note that the compaction doesn't tell us much about how
> successful it was when returning compaction_made_progress so we just
> have to blindly trust that another retry is worthwhile and cap the
> number to something reasonable to guarantee a convergence.
>
> If the given number of successful retries is not sufficient for a
> reasonable workloads we should focus on the collected compaction
> tracepoints data and try to address the issue in the compaction code.
> If this is not feasible we can increase the retries limit.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> mm/page_alloc.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 77 insertions(+), 10 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b78936eca70..bb4df1be0d43 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2939,6 +2939,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
> return page;
> }
>
> +
> +/*
> + * Maximum number of compaction retries wit a progress before OOM
> + * killer is consider as the only way to move forward.
> + */
> +#define MAX_COMPACT_RETRIES 16
> +
> #ifdef CONFIG_COMPACTION
> /* Try memory compaction for high-order allocations before reclaim */
> static struct page *
> @@ -3006,6 +3013,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + enum migrate_mode *migrate_mode,
> + int compaction_retries)
> +{
> + if (!order)
> + return false;
> +
> + /*
> + * compaction considers all the zone as desperately out of memory
> + * so it doesn't really make much sense to retry except when the
> + * failure could be caused by weak migration mode.
> + */
> + if (compaction_failed(compact_result)) {
> + if (*migrate_mode == MIGRATE_ASYNC) {
> + *migrate_mode = MIGRATE_SYNC_LIGHT;
> + return true;
> + }
> + return false;
> + }
> +
> + /*
> + * !costly allocations are really important and we have to make sure
> + * the compaction wasn't deferred or didn't bail out early due to locks
> + * contention before we go OOM. Still cap the reclaim retry loops with
> + * progress to prevent from looping forever and potential trashing.
> + */
> + if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> + if (compaction_withdrawn(compact_result))
> + return true;
> + if (compaction_retries <= MAX_COMPACT_RETRIES)
> + return true;
> + }
> +
> + return false;
> +}
> #else
> static inline struct page *
> __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> @@ -3014,6 +3058,14 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> {
> return NULL;
> }
> +
> +static inline bool
> +should_compact_retry(unsigned int order, enum compact_result compact_result,
> + enum migrate_mode *migrate_mode,
> + int compaction_retries)
> +{
> + return false;
> +}
> #endif /* CONFIG_COMPACTION */
>
> /* Perform direct synchronous page reclaim */
> @@ -3260,6 +3312,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> enum compact_result compact_result;
> + int compaction_retries = 0;
> int no_progress_loops = 0;
>
> /*
> @@ -3371,13 +3424,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> compaction_failed(compact_result)))
> goto nopage;
>
> - /*
> - * It can become very expensive to allocate transparent hugepages at
> - * fault, so use asynchronous memory compaction for THP unless it is
> - * khugepaged trying to collapse.
> - */
> - if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
> - migration_mode = MIGRATE_SYNC_LIGHT;
> + if (order && compaction_made_progress(compact_result))
> + compaction_retries++;
>
> /* Try direct reclaim and then allocating */
> page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> @@ -3408,6 +3456,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> no_progress_loops))
> goto retry;
>
> + /*
> + * It doesn't make any sense to retry for the compaction if the order-0
> + * reclaim is not able to make any progress because the current
> + * implementation of the compaction depends on the sufficient amount
> + * of free memory (see __compaction_suitable)
> + */
> + if (did_some_progress > 0 &&
> + should_compact_retry(order, compact_result,
> + &migration_mode, compaction_retries))
> + goto retry;
> +
> /* Reclaim has failed us, start killing things */
> page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
> if (page)
> @@ -3421,10 +3480,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> noretry:
> /*
> - * High-order allocations do not necessarily loop after
> - * direct reclaim and reclaim/compaction depends on compaction
> - * being called after reclaim so call directly if necessary
> + * High-order allocations do not necessarily loop after direct reclaim
> + * and reclaim/compaction depends on compaction being called after
> + * reclaim so call directly if necessary.
> + * It can become very expensive to allocate transparent hugepages at
> + * fault, so use asynchronous memory compaction for THP unless it is
> + * khugepaged trying to collapse. All other requests should tolerate
> + * at least light sync migration.
> */
> + if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> + migration_mode = MIGRATE_ASYNC;
> + else
> + migration_mode = MIGRATE_SYNC_LIGHT;
> page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> ac, migration_mode,
> &compact_result);
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
> should_reclaim_retry currently where we decide to not retry after at
> least order worth of pages were reclaimed or the watermark check for at
> least one zone would succeed after reclaiming all pages if the reclaim
> hasn't made any progress. Compaction feedback is mostly ignored and we
> just try to make sure that the compaction did at least something before
> giving up.
>
> The first condition was added by a41f24ea9fd6 ("page allocator: smarter
> retry of costly-order allocations) and it assumed that lumpy reclaim
> could have created a page of the sufficient order. Lumpy reclaim,
> has been removed quite some time ago so the assumption doesn't hold
> anymore. Remove the check for the number of reclaimed pages and rely
> on the compaction feedback solely. should_reclaim_retry now only
> makes sure that we keep retrying reclaim for high order pages only
> if they are hidden by watermaks so order-0 reclaim makes really sense.
>
> should_compact_retry now keeps retrying even for the costly allocations.
> The number of retries is reduced wrt. !costly requests because they are
> less important and harder to grant and so their pressure shouldn't cause
> contention for other requests or cause an over reclaim. We also do not
> reset no_progress_loops for costly request to make sure we do not keep
> reclaiming too agressively.
>
> This has been tested by running a process which fragments memory:
> - compact memory
> - mmap large portion of the memory (1920M on 2GRAM machine with 2G
> of swapspace)
> - MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
> steps until certain amount of memory is freed (250M in my test)
> and reduce the step to (step / 2) + 1 after reaching the end of
> the mapping
> - then run a script which populates the page cache 2G (MemTotal)
> from /dev/zero to a new file
> And then tries to allocate
> nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
> huge pages.
>
> root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run
> /root/alloc_hugepages.sh 1920M 250M
> Node 0, zone DMA 31 28 31 10 2 0 2 1 2 3 1
> Node 0, zone DMA32 437 319 171 50 28 25 20 16 16 14 437
>
> * This is the /proc/buddyinfo after the compaction
>
> Done fragmenting. size=2013265920 freed=262144000
> Node 0, zone DMA 165 48 3 1 2 0 2 2 2 2 0
> Node 0, zone DMA32 35109 14575 185 51 41 12 6 0 0 0 0
>
> * /proc/buddyinfo after memory got fragmented
>
> Executing "/root/alloc_hugepages.sh"
> Eating some pagecache
> 508623+0 records in
> 508623+0 records out
> 2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
> Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
> Node 0, zone DMA32 111 344 153 20 24 10 3 0 0 0 0
>
> * /proc/buddyinfo after page cache got eaten
>
> Trying to allocate 129
> 129
>
> * 129 hugepages requested and all of them granted.
>
> Node 0, zone DMA 3 5 3 1 2 0 2 2 2 2 0
> Node 0, zone DMA32 127 97 30 99 11 6 2 1 4 0 0
>
> * /proc/buddyinfo after hugetlb allocation.
>
> 10 runs will behave as follows:
> Trying to allocate 130
> 130
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 132
> 132
> --
> Trying to allocate 129
> 129
> --
> Trying to allocate 128
> 128
> --
> Trying to allocate 129
> 129
>
> So basically 100% success for all 10 attempts.
> Without the patch numbers looked much worse:
> Trying to allocate 128
> 12
> --
> Trying to allocate 129
> 14
> --
> Trying to allocate 129
> 7
> --
> Trying to allocate 129
> 16
> --
> Trying to allocate 129
> 30
> --
> Trying to allocate 129
> 38
> --
> Trying to allocate 129
> 19
> --
> Trying to allocate 129
> 37
> --
> Trying to allocate 129
> 28
> --
> Trying to allocate 129
> 37
>
> Just for completness the base kernel without oom detection rework looks
> as follows:
> Trying to allocate 127
> 30
> --
> Trying to allocate 129
> 12
> --
> Trying to allocate 129
> 52
> --
> Trying to allocate 128
> 32
> --
> Trying to allocate 129
> 12
> --
> Trying to allocate 129
> 10
> --
> Trying to allocate 129
> 32
> --
> Trying to allocate 128
> 14
> --
> Trying to allocate 128
> 16
> --
> Trying to allocate 129
> 8
>
> As we can see the success rate is much more volatile and smaller without
> this patch. So the patch not only makes the retry logic for costly
> requests more sensible the success rate is even higher.
>
> Acked-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>
> mm/page_alloc.c | 63 +++++++++++++++++++++++++++++----------------------------
> 1 file changed, 32 insertions(+), 31 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bb4df1be0d43..d5a938f12554 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3019,6 +3019,8 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
> enum migrate_mode *migrate_mode,
> int compaction_retries)
> {
> + int max_retries = MAX_COMPACT_RETRIES;
> +
> if (!order)
> return false;
>
> @@ -3036,17 +3038,24 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
> }
>
> /*
> - * !costly allocations are really important and we have to make sure
> - * the compaction wasn't deferred or didn't bail out early due to locks
> - * contention before we go OOM. Still cap the reclaim retry loops with
> - * progress to prevent from looping forever and potential trashing.
> + * make sure the compaction wasn't deferred or didn't bail out early
> + * due to locks contention before we declare that we should give up.
> */
> - if (order <= PAGE_ALLOC_COSTLY_ORDER) {
> - if (compaction_withdrawn(compact_result))
> - return true;
> - if (compaction_retries <= MAX_COMPACT_RETRIES)
> - return true;
> - }
> + if (compaction_withdrawn(compact_result))
> + return true;
> +
> + /*
> + * !costly requests are much more important than __GFP_REPEAT
> + * costly ones because they are de facto nofail and invoke OOM
> + * killer to move on while costly can fail and users are ready
> + * to cope with that. 1/4 retries is rather arbitrary but we
> + * would need much more detailed feedback from compaction to
> + * make a better decision.
> + */
> + if (order > PAGE_ALLOC_COSTLY_ORDER)
> + max_retries /= 4;
> + if (compaction_retries <= max_retries)
> + return true;
>
> return false;
> }
> @@ -3207,18 +3216,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> * Checks whether it makes sense to retry the reclaim to make a forward progress
> * for the given allocation request.
> * The reclaim feedback represented by did_some_progress (any progress during
> - * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
> - * pages) and no_progress_loops (number of reclaim rounds without any progress
> - * in a row) is considered as well as the reclaimable pages on the applicable
> - * zone list (with a backoff mechanism which is a function of no_progress_loops).
> + * the last reclaim round) and no_progress_loops (number of reclaim rounds without
> + * any progress in a row) is considered as well as the reclaimable pages on the
> + * applicable zone list (with a backoff mechanism which is a function of
> + * no_progress_loops).
> *
> * Returns true if a retry is viable or false to enter the oom path.
> */
> static inline bool
> should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> struct alloc_context *ac, int alloc_flags,
> - bool did_some_progress, unsigned long pages_reclaimed,
> - int no_progress_loops)
> + bool did_some_progress, int no_progress_loops)
> {
> struct zone *zone;
> struct zoneref *z;
> @@ -3230,14 +3238,6 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> if (no_progress_loops > MAX_RECLAIM_RETRIES)
> return false;
>
> - if (order > PAGE_ALLOC_COSTLY_ORDER) {
> - if (pages_reclaimed >= (1<<order))
> - return false;
> -
> - if (did_some_progress)
> - return true;
> - }
> -
> /*
> * Keep reclaiming pages while there is a chance this will lead somewhere.
> * If none of the target zones can satisfy our allocation request even
> @@ -3308,7 +3308,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> struct page *page = NULL;
> int alloc_flags;
> - unsigned long pages_reclaimed = 0;
> unsigned long did_some_progress;
> enum migrate_mode migration_mode = MIGRATE_ASYNC;
> enum compact_result compact_result;
> @@ -3444,16 +3443,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> goto noretry;
>
> - if (did_some_progress) {
> + /*
> + * Costly allocations might have made a progress but this doesn't mean
> + * their order will become available due to high fragmentation so
> + * always increment the no progress counter for them
> + */
> + if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> no_progress_loops = 0;
> - pages_reclaimed += did_some_progress;
> - } else {
> + else
> no_progress_loops++;
> - }
>
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> - did_some_progress > 0, pages_reclaimed,
> - no_progress_loops))
> + did_some_progress > 0, no_progress_loops))
> goto retry;
>
> /*
> --
> 2.8.0.rc3
>
> From: Michal Hocko <[email protected]>
>
> "mm: consider compaction feedback also for costly allocation" has
> removed the upper bound for the reclaim/compaction retries based on the
> number of reclaimed pages for costly orders. While this is desirable
> the patch did miss a mis interaction between reclaim, compaction and the
> retry logic. The direct reclaim tries to get zones over min watermark
> while compaction backs off and returns COMPACT_SKIPPED when all zones
> are below low watermark + 1<<order gap. If we are getting really close
> to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
> high order request (e.g. hugetlb order-9) while the reclaim is not able
> to release enough pages to get us over low watermark. The reclaim is
> still able to make some progress (usually trashing over few remaining
> pages) so we are not able to break out from the loop.
>
> I have seen this happening with the same test described in "mm: consider
> compaction feedback also for costly allocation" on a swapless system.
> The original problem got resolved by "vmscan: consider classzone_idx in
> compaction_ready" but it shows how things might go wrong when we
> approach the oom event horizont.
>
> The reason why compaction requires being over low rather than min
> watermark is not clear to me. This check was there essentially since
> 56de7263fcf3 ("mm: compaction: direct compact when a high-order
> allocation fails"). It is clearly an implementation detail though and we
> shouldn't pull it into the generic retry logic while we should be able
> to cope with such eventuality. The only place in should_compact_retry
> where we retry without any upper bound is for compaction_withdrawn()
> case.
>
> Introduce compaction_zonelist_suitable function which checks the given
> zonelist and returns true only if there is at least one zone which would
> would unblock __compaction_suitable if more memory got reclaimed. In
> this implementation it checks __compaction_suitable with NR_FREE_PAGES
> plus part of the reclaimable memory as the target for the watermark check.
> The reclaimable memory is reduced linearly by the allocation order. The
> idea is that we do not want to reclaim all the remaining memory for a
> single allocation request just unblock __compaction_suitable which
> doesn't guarantee we will make a further progress.
>
> The new helper is then used if compaction_withdrawn() feedback was
> provided so we do not retry if there is no outlook for a further
> progress. !costly requests shouldn't be affected much - e.g. order-2
> pages would require to have at least 64kB on the reclaimable LRUs while
> order-9 would need at least 32M which should be enough to not lock up.
>
> [[email protected]: fix classzone_idx vs. high_zoneidx usage in
> compaction_zonelist_suitable]
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Hillf Danton <[email protected]>
> ---
> include/linux/compaction.h | 4 ++++
> include/linux/mmzone.h | 3 +++
> mm/compaction.c | 42 +++++++++++++++++++++++++++++++++++++++---
> mm/page_alloc.c | 18 +++++++++++-------
> 4 files changed, 57 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index a002ca55c513..7bbdbf729757 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -142,6 +142,10 @@ static inline bool compaction_withdrawn(enum compact_result result)
> return false;
> }
>
> +
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> + int alloc_flags);
> +
> extern int kcompactd_run(int nid);
> extern void kcompactd_stop(int nid);
> extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 150c6049f961..0bf13c7cd8cd 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -746,6 +746,9 @@ static inline bool is_dev_zone(const struct zone *zone)
> extern struct mutex zonelists_mutex;
> void build_all_zonelists(pg_data_t *pgdat, struct zone *zone);
> void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
> + unsigned long mark, int classzone_idx, int alloc_flags,
> + long free_pages);
> bool zone_watermark_ok(struct zone *z, unsigned int order,
> unsigned long mark, int classzone_idx, int alloc_flags);
> bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index e2e487cea5ea..0a7ca578af97 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1369,7 +1369,8 @@ static enum compact_result compact_finished(struct zone *zone,
> * COMPACT_CONTINUE - If compaction should run now
> */
> static enum compact_result __compaction_suitable(struct zone *zone, int order,
> - int alloc_flags, int classzone_idx)
> + int alloc_flags, int classzone_idx,
> + unsigned long wmark_target)
> {
> int fragindex;
> unsigned long watermark;
> @@ -1392,7 +1393,8 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
> * allocated and for a short time, the footprint is higher
> */
> watermark += (2UL << order);
> - if (!zone_watermark_ok(zone, 0, watermark, classzone_idx, alloc_flags))
> + if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
> + alloc_flags, wmark_target))
> return COMPACT_SKIPPED;
>
> /*
> @@ -1418,7 +1420,8 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
> {
> enum compact_result ret;
>
> - ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx);
> + ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
> + zone_page_state(zone, NR_FREE_PAGES));
> trace_mm_compaction_suitable(zone, order, ret);
> if (ret == COMPACT_NOT_SUITABLE_ZONE)
> ret = COMPACT_SKIPPED;
> @@ -1426,6 +1429,39 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
> return ret;
> }
>
> +bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
> + int alloc_flags)
> +{
> + struct zone *zone;
> + struct zoneref *z;
> +
> + /*
> + * Make sure at least one zone would pass __compaction_suitable if we continue
> + * retrying the reclaim.
> + */
> + for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
> + ac->nodemask) {
> + unsigned long available;
> + enum compact_result compact_result;
> +
> + /*
> + * Do not consider all the reclaimable memory because we do not
> + * want to trash just for a single high order allocation which
> + * is even not guaranteed to appear even if __compaction_suitable
> + * is happy about the watermark check.
> + */
> + available = zone_reclaimable_pages(zone) / order;
> + available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> + compact_result = __compaction_suitable(zone, order, alloc_flags,
> + ac->classzone_idx, available);
> + if (compact_result != COMPACT_SKIPPED &&
> + compact_result != COMPACT_NOT_SUITABLE_ZONE)
> + return true;
> + }
> +
> + return false;
> +}
> +
> static enum compact_result compact_zone(struct zone *zone, struct compact_control *cc)
> {
> enum compact_result ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5a938f12554..6757d6df2160 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2526,7 +2526,7 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> * one free page of a suitable size. Checking now avoids taking the zone lock
> * to check in the allocation paths if no pages are free.
> */
> -static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> +bool __zone_watermark_ok(struct zone *z, unsigned int order,
> unsigned long mark, int classzone_idx, int alloc_flags,
> long free_pages)
> {
> @@ -3015,8 +3015,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> }
>
> static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> - enum migrate_mode *migrate_mode,
> +should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
> + enum compact_result compact_result, enum migrate_mode *migrate_mode,
> int compaction_retries)
> {
> int max_retries = MAX_COMPACT_RETRIES;
> @@ -3040,9 +3040,11 @@ should_compact_retry(unsigned int order, enum compact_result compact_result,
> /*
> * make sure the compaction wasn't deferred or didn't bail out early
> * due to locks contention before we declare that we should give up.
> + * But do not retry if the given zonelist is not suitable for
> + * compaction.
> */
> if (compaction_withdrawn(compact_result))
> - return true;
> + return compaction_zonelist_suitable(ac, order, alloc_flags);
>
> /*
> * !costly requests are much more important than __GFP_REPEAT
> @@ -3069,7 +3071,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> }
>
> static inline bool
> -should_compact_retry(unsigned int order, enum compact_result compact_result,
> +should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
> + enum compact_result compact_result,
> enum migrate_mode *migrate_mode,
> int compaction_retries)
> {
> @@ -3464,8 +3467,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * of free memory (see __compaction_suitable)
> */
> if (did_some_progress > 0 &&
> - should_compact_retry(order, compact_result,
> - &migration_mode, compaction_retries))
> + should_compact_retry(ac, order, alloc_flags,
> + compact_result, &migration_mode,
> + compaction_retries))
> goto retry;
>
> /* Reclaim has failed us, start killing things */
> --
> 2.8.0.rc3
On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Compaction can provide a wild variation of feedback to the caller. Many
> of them are implementation specific and the caller of the compaction
> (especially the page allocator) shouldn't be bound to specifics of the
> current implementation.
>
> This patch abstracts the feedback into three basic types:
> - compaction_made_progress - compaction was active and made some
> progress.
> - compaction_failed - compaction failed and further attempts to
> invoke it would most probably fail and therefore it is not
> worth retrying
> - compaction_withdrawn - compaction wasn't invoked for an
> implementation specific reasons. In the current implementation
> it means that the compaction was deferred, contended or the
> page scanners met too early without any progress. Retrying is
> still worthwhile.
>
> [[email protected]: do not change thp back off behavior]
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> "mm: consider compaction feedback also for costly allocation" has
> removed the upper bound for the reclaim/compaction retries based on the
> number of reclaimed pages for costly orders. While this is desirable
> the patch did miss a mis interaction between reclaim, compaction and the
> retry logic.
Hmm perhaps reversing the order of patches 13 and 14 would be a bit
safer wrt future bisections then? Add compaction_zonelist_suitable()
first with the reasoning, and then immediately use it in the other patch.
> The direct reclaim tries to get zones over min watermark
> while compaction backs off and returns COMPACT_SKIPPED when all zones
> are below low watermark + 1<<order gap. If we are getting really close
> to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
> high order request (e.g. hugetlb order-9) while the reclaim is not able
> to release enough pages to get us over low watermark. The reclaim is
> still able to make some progress (usually trashing over few remaining
> pages) so we are not able to break out from the loop.
>
> I have seen this happening with the same test described in "mm: consider
> compaction feedback also for costly allocation" on a swapless system.
> The original problem got resolved by "vmscan: consider classzone_idx in
> compaction_ready" but it shows how things might go wrong when we
> approach the oom event horizont.
>
> The reason why compaction requires being over low rather than min
> watermark is not clear to me. This check was there essentially since
> 56de7263fcf3 ("mm: compaction: direct compact when a high-order
> allocation fails"). It is clearly an implementation detail though and we
> shouldn't pull it into the generic retry logic while we should be able
> to cope with such eventuality. The only place in should_compact_retry
> where we retry without any upper bound is for compaction_withdrawn()
> case.
>
> Introduce compaction_zonelist_suitable function which checks the given
> zonelist and returns true only if there is at least one zone which would
> would unblock __compaction_suitable if more memory got reclaimed. In
> this implementation it checks __compaction_suitable with NR_FREE_PAGES
> plus part of the reclaimable memory as the target for the watermark check.
> The reclaimable memory is reduced linearly by the allocation order. The
> idea is that we do not want to reclaim all the remaining memory for a
> single allocation request just unblock __compaction_suitable which
> doesn't guarantee we will make a further progress.
>
> The new helper is then used if compaction_withdrawn() feedback was
> provided so we do not retry if there is no outlook for a further
> progress. !costly requests shouldn't be affected much - e.g. order-2
> pages would require to have at least 64kB on the reclaimable LRUs while
> order-9 would need at least 32M which should be enough to not lock up.
>
> [[email protected]: fix classzone_idx vs. high_zoneidx usage in
> compaction_zonelist_suitable]
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 04/20/2016 09:47 PM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> THP requests skip the direct reclaim if the compaction is either
> deferred or contended to reduce stalls which wouldn't help the
> allocation success anyway. These checks are ignoring other potential
> feedback modes which we have available now.
>
> It clearly doesn't make much sense to go and reclaim few pages if the
> previous compaction has failed.
>
> We can also simplify the check by using compaction_withdrawn which
> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> is however covering more reasons why the compaction was withdrawn.
> None of them should be a problem for the THP case though.
>
> It is safe to back of if we see COMPACT_SKIPPED because that means
> that compaction_suitable failed and a single round of the reclaim is
> unlikely to make any difference here. We would have to be close to
> the low watermark to reclaim enough and even then there is no guarantee
> that the compaction would make any progress while the direct reclaim
> would have caused the stall.
>
> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> have only seen a part of the zone so a retry would make some sense. But
> it would be a compaction retry not a reclaim retry to perform. We are
> not doing that and that might indeed lead to situations where THP fails
> but this should happen only rarely and it would be really hard to
> measure.
>
> Signed-off-by: Michal Hocko <[email protected]>
THP's don't compact by default in page fault path anymore, so we don't
need to restrict them even more. And hopefully we'll replace the
is_thp_gfp_mask() hack with something better soon, so this might be just
extra code churn. But I don't feel strongly enough to nack it.
On Thu 28-04-16 10:53:18, Vlastimil Babka wrote:
> On 04/20/2016 09:47 PM, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >THP requests skip the direct reclaim if the compaction is either
> >deferred or contended to reduce stalls which wouldn't help the
> >allocation success anyway. These checks are ignoring other potential
> >feedback modes which we have available now.
> >
> >It clearly doesn't make much sense to go and reclaim few pages if the
> >previous compaction has failed.
> >
> >We can also simplify the check by using compaction_withdrawn which
> >checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
> >is however covering more reasons why the compaction was withdrawn.
> >None of them should be a problem for the THP case though.
> >
> >It is safe to back of if we see COMPACT_SKIPPED because that means
> >that compaction_suitable failed and a single round of the reclaim is
> >unlikely to make any difference here. We would have to be close to
> >the low watermark to reclaim enough and even then there is no guarantee
> >that the compaction would make any progress while the direct reclaim
> >would have caused the stall.
> >
> >COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
> >have only seen a part of the zone so a retry would make some sense. But
> >it would be a compaction retry not a reclaim retry to perform. We are
> >not doing that and that might indeed lead to situations where THP fails
> >but this should happen only rarely and it would be really hard to
> >measure.
> >
> >Signed-off-by: Michal Hocko <[email protected]>
>
> THP's don't compact by default in page fault path anymore, so we don't need
> to restrict them even more. And hopefully we'll replace the
> is_thp_gfp_mask() hack with something better soon, so this might be just
> extra code churn. But I don't feel strongly enough to nack it.
My main point was to simplify the code and get rid of as much compaction
specific hacks as possible. We might very well drop this later on but it
would be at least less code to grasp through. I do not have any problem
with dropping this but I think this shouldn't collide with other patches
much so reducing the number of lines is worth it.
--
Michal Hocko
SUSE Labs
On Thu 28-04-16 10:59:22, Vlastimil Babka wrote:
> On 04/20/2016 09:47 PM, Michal Hocko wrote:
> >From: Michal Hocko <[email protected]>
> >
> >"mm: consider compaction feedback also for costly allocation" has
> >removed the upper bound for the reclaim/compaction retries based on the
> >number of reclaimed pages for costly orders. While this is desirable
> >the patch did miss a mis interaction between reclaim, compaction and the
> >retry logic.
>
> Hmm perhaps reversing the order of patches 13 and 14 would be a bit safer
> wrt future bisections then? Add compaction_zonelist_suitable() first with
> the reasoning, and then immediately use it in the other patch.
Hmm, I do not think the risk is high. This would require the allocate
GFP_REPEAT large orders to the last drop which is not usual. I found the
ordering more logical to argue about because this patch will be mostly
noop for costly orders without 13 and !costly allocations retry
endlessly anyway. So I would prefer this ordering even though there is
a window where an extreme load can lockup. I do not expect people
shooting their head during bisection.
[...]
> >
> >[[email protected]: fix classzone_idx vs. high_zoneidx usage in
> >compaction_zonelist_suitable]
> >Signed-off-by: Michal Hocko <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
On 04/28/2016 02:35 PM, Michal Hocko wrote:
> On Thu 28-04-16 10:53:18, Vlastimil Babka wrote:
>> On 04/20/2016 09:47 PM, Michal Hocko wrote:
>>> From: Michal Hocko <[email protected]>
>>>
>>> THP requests skip the direct reclaim if the compaction is either
>>> deferred or contended to reduce stalls which wouldn't help the
>>> allocation success anyway. These checks are ignoring other potential
>>> feedback modes which we have available now.
>>>
>>> It clearly doesn't make much sense to go and reclaim few pages if the
>>> previous compaction has failed.
>>>
>>> We can also simplify the check by using compaction_withdrawn which
>>> checks for both COMPACT_CONTENDED and COMPACT_DEFERRED. This check
>>> is however covering more reasons why the compaction was withdrawn.
>>> None of them should be a problem for the THP case though.
>>>
>>> It is safe to back of if we see COMPACT_SKIPPED because that means
>>> that compaction_suitable failed and a single round of the reclaim is
>>> unlikely to make any difference here. We would have to be close to
Hmm this is actually incorrect, as should_continue_reclaim() will keep
shrink_zone() going as much as needed for compaction to become enabled,
so it doesn't reclaim just SWAP_CLUSTER_MAX.
>>> the low watermark to reclaim enough and even then there is no guarantee
>>> that the compaction would make any progress while the direct reclaim
>>> would have caused the stall.
>>>
>>> COMPACT_PARTIAL_SKIPPED is slightly different because that means that we
>>> have only seen a part of the zone so a retry would make some sense. But
>>> it would be a compaction retry not a reclaim retry to perform. We are
>>> not doing that and that might indeed lead to situations where THP fails
>>> but this should happen only rarely and it would be really hard to
>>> measure.
>>>
>>> Signed-off-by: Michal Hocko <[email protected]>
>>
>> THP's don't compact by default in page fault path anymore, so we don't need
>> to restrict them even more. And hopefully we'll replace the
>> is_thp_gfp_mask() hack with something better soon, so this might be just
>> extra code churn. But I don't feel strongly enough to nack it.
>
> My main point was to simplify the code and get rid of as much compaction
> specific hacks as possible. We might very well drop this later on but it
> would be at least less code to grasp through. I do not have any problem
> with dropping this but I think this shouldn't collide with other patches
> much so reducing the number of lines is worth it.
I just realized it also affects khugepaged, and not just THP page
faults, so it may potentially cripple THP's completely. My main issue is
that the reasons to bail out includes COMPACT_SKIPPED, and for a wrong
reason (see the comment above). It also goes against the comment below
the noretry label:
* High-order allocations do not necessarily loop after direct reclaim
* and reclaim/compaction depends on compaction being called after
* reclaim so call directly if necessary.
Given that THP's are large, I expect reclaim would indeed be quite often
necessary before compaction, and the first optimistic async compaction
attempt will just return SKIPPED. After this patch, there will be no
more reclaim/compaction attempts for THP's, including khugepaged. And
given the change of THP page fault defaults, even crippling that path
should no longer be necessary.
So I would just drop this for now indeed.
On Fri 29-04-16 11:16:44, Vlastimil Babka wrote:
> On 04/28/2016 02:35 PM, Michal Hocko wrote:
[...]
> >My main point was to simplify the code and get rid of as much compaction
> >specific hacks as possible. We might very well drop this later on but it
> >would be at least less code to grasp through. I do not have any problem
> >with dropping this but I think this shouldn't collide with other patches
> >much so reducing the number of lines is worth it.
Good point, I have completely missed this part.
> I just realized it also affects khugepaged, and not just THP page faults, so
> it may potentially cripple THP's completely. My main issue is that the
> reasons to bail out includes COMPACT_SKIPPED, and for a wrong reason (see
> the comment above). It also goes against the comment below the noretry
> label:
>
> * High-order allocations do not necessarily loop after direct reclaim
> * and reclaim/compaction depends on compaction being called after
> * reclaim so call directly if necessary.
>
> Given that THP's are large, I expect reclaim would indeed be quite often
> necessary before compaction, and the first optimistic async compaction
> attempt will just return SKIPPED. After this patch, there will be no more
> reclaim/compaction attempts for THP's, including khugepaged. And given the
> change of THP page fault defaults, even crippling that path should no longer
> be necessary.
>
> So I would just drop this for now indeed.
Agreed, thanks for catching this. Andrew, could you drop this patch
please? It was supposed to be a mere clean up without any effect on the
oom detection.
Thanks!
--
Michal Hocko
SUSE Labs