2015-12-15 18:20:10

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/3] OOM detection rework v4

Hi,

This is v4 of the series. The previous version was posted [1]. I have
dropped the RFC because this has been sitting and waiting for the
fundamental objections for quite some time and there were none. I still
do not think we should rush this and merge it no sooner than 4.6. Having
this in the mmotm and thus linux-next would open it to a much larger
testing coverage. I will iron out issues as they come but hopefully
there will no serious ones.

* Changes since v3
- factor out the new heuristic into its own function as suggested by
Johannes (no functional changes)
* Changes since v2
- rebased on top of mmotm-2015-11-25-17-08 which includes
wait_iff_congested related changes which needed refresh in
patch#1 and patch#2
- use zone_page_state_snapshot for NR_FREE_PAGES per David
- shrink_zones doesn't need to return anything per David
- retested because the major kernel version has changed since
the last time (4.2 -> 4.3 based kernel + mmotm patches)

* Changes since v1
- backoff calculation was de-obfuscated by using DIV_ROUND_UP
- __GFP_NOFAIL high order migh fail fixed - theoretical bug

as pointed by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim. The more on the implementation
is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative. There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive. Anyway, I have tested the following 3 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G size,
removes the files and starts over again) running in parallel for 10s
to build up a lot of dirty pages when 100 parallel mem_eaters (anon
private populated mmap which waits until it gets signal) with 80M
each.

This causes an OOM flood of course and I have compared both patched
and unpatched kernels. The test is considered finished after there
are no OOM conditions detected. This should tell us whether there are
any excessive kills or some of them premature:

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Killed process" base-oom-run1.log | tail -n1
[ 211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
$ grep "Killed process" base-oom-run2.log | tail -n1
[ 157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB

$ grep "invoked oom-killer" base-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" base-oom-run2.log | wc -l
76

The number of OOM invocations is consistent with my last measurements
but the runtime is way too different (it took 800+s). One thing that
could have skewed results was that I was tail -f the serial log on the
host system to see the progress. I have stopped doing that. The results
are more consistent now but still too different from the last time.
This is really weird so I've retested with the last 4.2 mmotm again and
I am getting consistent ~220s which is really close to the above. If I
apply the WQ vmstat patch on top I am getting close to 160s so the stale
vmstat counters made a difference which is to be expected. I have a new
SSD in my laptop which migh have made a difference but I wouldn't expect
it to be that large.

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
4
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
1

* patched kernel
$ grep "Killed process" patched-oom-run1.log | tail -n1
[ 341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
$ grep "Killed process" patched-oom-run2.log | tail -n1
[ 349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

$ grep "invoked oom-killer" patched-oom-run1.log | wc -l
78
$ grep "invoked oom-killer" patched-oom-run2.log | wc -l
77

$ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
0

So the number of OOM killer invocation is the same but the overall
runtime of the test was much longer with the patched kernel. This can be
attributed to more retries in general. The results from the base kernel
are quite inconsitent and I think that consistency is better here.


2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
memory as possible without triggering the OOM killer. This required a lot
of tuning but I've considered 3 consecutive runs without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(15*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(9*1024)}' /proc/meminfo)

It was -14M for the base 4.2 kernel and -7500M for the patched 4.2 kernel in
my last measurements.
The patched kernel handled the low mem conditions better and fired OOM
killer later.

3) Costly high-order allocations with a limited amount of memory.
Start 10 memeaters in parallel each with
size=$(awk '/MemTotal/{printf "%d\n", $2/10}' /proc/meminfo)
This will cause an OOM killer which will kill one of them which will free up
200M and then try to use all the remaining space for hugetlb pages. See how
many of them will pass kill everything, wait 2s and try again.
This tests whether we do not fail __GFP_REPEAT costly allocations too early
now.
* base kernel
$ sort base-hugepages.log | uniq -c
1 64
13 65
6 66
20 Trying to allocate 73

* patched kernel
$ sort patched-hugepages.log | uniq -c
17 65
3 66
20 Trying to allocate 73

This also doesn't look very bad but this particular test is quite timing
sensitive.

The above results do seem optimistic but more loads should be tested
obviously. I would really appreciate a feedback on the approach I have
chosen before I go into more tuning. Is this viable way to go?

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/CA+55aFwapaED7JV6zm-NVkP-jKie+eQ1vDXWrKD=SkbshZSgmw@mail.gmail.com
[3] http://lkml.kernel.org/r/CA+55aFxwg=vS2nrXsQhAUzPQDGb8aQpZi0M7UUh21ftBo-z46Q@mail.gmail.com


2015-12-15 18:21:32

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/3] mm, oom: rework oom detection

From: Michal Hocko <[email protected]>

__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM. shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent
from a premature OOM killer invocation - the LRU might be full of dirty
or writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several
times and restart if a page is freed. This is really subtle behavior
and it might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page. OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as OOM
which is per zonelist property. It is __alloc_pages_slowpath which knows
how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath. It tries to be more deterministic and
easier to follow. It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context. Therefore there is a
feedback mechanism implemented which reduces the reclaim target after
each reclaim round without any progress. This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM. The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress. We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order. The only difference is that if there was no progress
during the reclaim we rely on zone watermark check. This is more logical
thing to do than previous 1<<order attempts which were a result of
zone_reclaimable faking the progress.

[[email protected]: separate the heuristic into should_reclaim_retry]
[[email protected]: use zone_page_state_snapshot for NR_FREE_PAGES]
[[email protected]: shrink_zones doesn't need to return anything]
Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>

factor out the retry logic into separate function - per Johannes
---
include/linux/swap.h | 1 +
mm/page_alloc.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++-----
mm/vmscan.c | 25 +++------------
3 files changed, 88 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 457181844b6e..738ae2206635 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -316,6 +316,7 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
struct vm_area_struct *vma);

/* linux/mm/vmscan.c */
+extern unsigned long zone_reclaimable_pages(struct zone *zone);
extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *mask);
extern int __isolate_lru_page(struct page *page, isolate_mode_t mode);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e267faad4649..f77e283fb8c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2984,6 +2984,75 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
}

+/*
+ * Maximum number of reclaim retries without any progress before OOM killer
+ * is consider as the only way to move forward.
+ */
+#define MAX_RECLAIM_RETRIES 16
+
+/*
+ * Checks whether it makes sense to retry the reclaim to make a forward progress
+ * for the given allocation request.
+ * The reclaim feedback represented by did_some_progress (any progress during
+ * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
+ * pages) and no_progress_loops (number of reclaim rounds without any progress
+ * in a row) is considered as well as the reclaimable pages on the applicable
+ * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ *
+ * Returns true if a retry is viable or false to enter the oom path.
+ */
+static inline bool
+should_reclaim_retry(gfp_t gfp_mask, unsigned order,
+ struct alloc_context *ac, int alloc_flags,
+ bool did_some_progress, unsigned long pages_reclaimed,
+ int no_progress_loops)
+{
+ struct zone *zone;
+ struct zoneref *z;
+
+ /*
+ * Make sure we converge to OOM if we cannot make any progress
+ * several times in the row.
+ */
+ if (no_progress_loops > MAX_RECLAIM_RETRIES)
+ return false;
+
+ /* Do not retry high order allocations unless they are __GFP_REPEAT */
+ if (order > PAGE_ALLOC_COSTLY_ORDER) {
+ if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
+ return false;
+
+ if (did_some_progress)
+ return true;
+ }
+
+ /*
+ * Keep reclaiming pages while there is a chance this will lead somewhere.
+ * If none of the target zones can satisfy our allocation request even
+ * if all reclaimable pages are considered then we are screwed and have
+ * to go OOM.
+ */
+ for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
+ unsigned long available;
+
+ available = zone_reclaimable_pages(zone);
+ available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
+ available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+ /*
+ * Would the allocation succeed if we reclaimed the whole available?
+ */
+ if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ ac->high_zoneidx, alloc_flags, available)) {
+ /* Wait for some write requests to complete then retry */
+ wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ return true;
+ }
+ }
+
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -2996,6 +3065,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ int no_progress_loops = 0;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3155,23 +3225,28 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

- /* Keep reclaiming pages as long as there is reasonable progress */
- pages_reclaimed += did_some_progress;
- if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
- ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
- goto retry;
+ if (did_some_progress) {
+ no_progress_loops = 0;
+ pages_reclaimed += did_some_progress;
+ } else {
+ no_progress_loops++;
}

+ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+ did_some_progress > 0, pages_reclaimed,
+ no_progress_loops))
+ goto retry;
+
/* Reclaim has failed us, start killing things */
page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if (page)
goto got_pg;

/* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ no_progress_loops = 0;
goto retry;
+ }

noretry:
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..489212252cd6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -192,7 +192,7 @@ static bool sane_reclaim(struct scan_control *sc)
}
#endif

-static unsigned long zone_reclaimable_pages(struct zone *zone)
+unsigned long zone_reclaimable_pages(struct zone *zone)
{
unsigned long nr;

@@ -2516,10 +2516,8 @@ static inline bool compaction_ready(struct zone *zone, int order)
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
- *
- * Returns true if a zone was reclaimable.
*/
-static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
+static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
@@ -2527,7 +2525,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
unsigned long nr_soft_scanned;
gfp_t orig_mask;
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
- bool reclaimable = false;

/*
* If the number of buffer_heads in the machine exceeds the maximum
@@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
&nr_soft_scanned);
sc->nr_reclaimed += nr_soft_reclaimed;
sc->nr_scanned += nr_soft_scanned;
- if (nr_soft_reclaimed)
- reclaimable = true;
/* need some check for avoid more shrink_zone() */
}

- if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
- reclaimable = true;
-
- if (global_reclaim(sc) &&
- !reclaimable && zone_reclaimable(zone))
- reclaimable = true;
+ shrink_zone(zone, sc, zone_idx(zone));
}

/*
@@ -2610,8 +2600,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
* promoted it to __GFP_HIGHMEM.
*/
sc->gfp_mask = orig_mask;
-
- return reclaimable;
}

/*
@@ -2636,7 +2624,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
int initial_priority = sc->priority;
unsigned long total_scanned = 0;
unsigned long writeback_threshold;
- bool zones_reclaimable;
retry:
delayacct_freepages_start();

@@ -2647,7 +2634,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
sc->priority);
sc->nr_scanned = 0;
- zones_reclaimable = shrink_zones(zonelist, sc);
+ shrink_zones(zonelist, sc);

total_scanned += sc->nr_scanned;
if (sc->nr_reclaimed >= sc->nr_to_reclaim)
@@ -2694,10 +2681,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
goto retry;
}

- /* Any of the zones still reclaimable? Don't OOM. */
- if (zones_reclaimable)
- return 1;
-
return 0;
}

--
2.6.2

2015-12-15 18:20:16

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

From: Michal Hocko <[email protected]>

wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress. We used to do congestion_wait before
0e093d99763e ("writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone") but that led to undesirable stalls
and sleeping for the full timeout even when the BDI wasn't congested.
Hence wait_iff_congested was used instead. But it seems that even
wait_iff_congested doesn't work as expected. We might have a small file
LRU list with all pages dirty/writeback and yet the bdi is not congested
so this is just a cond_resched in the end and can end up triggering pre
mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation. This shouldn't reintroduce stalls
fixed by 0e093d99763e because congestion_wait is called only when we
are getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by "mm, vmstat: allow WQ concurrency
to discover memory reclaim doesn't make any progress" into the
__alloc_pages_slowpath now that wait_iff_congested is not used anymore.
As the only remaining user of wait_iff_congested is shrink_inactive_list
we can remove the WQ specific short sleep from wait_iff_congested
because the sleep is needed to be done only once in the allocation retry
cycle.

Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/backing-dev.c | 19 +++----------------
mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++---
2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..d2473ce9cc57 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,9 +957,8 @@ EXPORT_SYMBOL(congestion_wait);
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, a short sleep or a cond_resched is
- * performed to yield the processor and to allow other subsystems to make
- * a forward progress.
+ * In the absence of zone congestion, cond_resched() is called to yield
+ * the processor if necessary but otherwise does not sleep.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -980,19 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {

- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that a particular
- * WQ is congested if the worker thread is looping without
- * ever sleeping. Therefore we have to do a short sleep
- * here rather than calling cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout(1);
- else
- cond_resched();
-
+ cond_resched();
/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
if (ret < 0)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f77e283fb8c6..b2de8c8761ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3034,8 +3034,9 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
*/
for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
unsigned long available;
+ unsigned long reclaimable;

- available = zone_reclaimable_pages(zone);
+ available = reclaimable = zone_reclaimable_pages(zone);
available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES);
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);

@@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
*/
if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
ac->high_zoneidx, alloc_flags, available)) {
- /* Wait for some write requests to complete then retry */
- wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
+ unsigned long writeback;
+ unsigned long dirty;
+
+ writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
+ dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
+
+ /*
+ * If we didn't make any progress and have a lot of
+ * dirty + writeback pages then we should wait for
+ * an IO to complete to slow down the reclaim and
+ * prevent from pre mature OOM
+ */
+ if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ return true;
+ }
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout(1);
+ else
+ cond_resched();
+
return true;
}
}
--
2.6.2

2015-12-15 18:20:14

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/3] mm: use watermak checks for __GFP_REPEAT high order allocations

From: Michal Hocko <[email protected]>

__alloc_pages_slowpath retries costly allocations until at least
order worth of pages were reclaimed or the watermark check for at least
one zone would succeed after all reclaiming all pages if the reclaim
hasn't made any progress.

The first condition was added by a41f24ea9fd6 ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order. Lumpy reclaim,
has been removed quite some time ago so the assumption doesn't hold
anymore. It would be more appropriate to check the compaction progress
instead but this patch simply removes the check and relies solely
on the watermark check.

To prevent from too many retries the no_progress_loops is not reseted after
a reclaim which made progress because we cannot assume it helped high
order situation. Only costly allocation requests depended on
pages_reclaimed so we can drop it.

Acked-by: Hillf Danton <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 34 +++++++++++++++-------------------
1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b2de8c8761ad..268de1654128 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2994,17 +2994,17 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
* Checks whether it makes sense to retry the reclaim to make a forward progress
* for the given allocation request.
* The reclaim feedback represented by did_some_progress (any progress during
- * the last reclaim round), pages_reclaimed (cumulative number of reclaimed
- * pages) and no_progress_loops (number of reclaim rounds without any progress
- * in a row) is considered as well as the reclaimable pages on the applicable
- * zone list (with a backoff mechanism which is a function of no_progress_loops).
+ * the last reclaim round) and no_progress_loops (number of reclaim rounds without
+ * any progress in a row) is considered as well as the reclaimable pages on the
+ * applicable zone list (with a backoff mechanism which is a function of
+ * no_progress_loops).
*
* Returns true if a retry is viable or false to enter the oom path.
*/
static inline bool
should_reclaim_retry(gfp_t gfp_mask, unsigned order,
struct alloc_context *ac, int alloc_flags,
- bool did_some_progress, unsigned long pages_reclaimed,
+ bool did_some_progress,
int no_progress_loops)
{
struct zone *zone;
@@ -3018,13 +3018,8 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
return false;

/* Do not retry high order allocations unless they are __GFP_REPEAT */
- if (order > PAGE_ALLOC_COSTLY_ORDER) {
- if (!(gfp_mask & __GFP_REPEAT) || pages_reclaimed >= (1<<order))
- return false;
-
- if (did_some_progress)
- return true;
- }
+ if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
+ return false;

/*
* Keep reclaiming pages while there is a chance this will lead somewhere.
@@ -3090,7 +3085,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
struct page *page = NULL;
int alloc_flags;
- unsigned long pages_reclaimed = 0;
unsigned long did_some_progress;
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
@@ -3255,16 +3249,18 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
goto noretry;

- if (did_some_progress) {
+ /*
+ * Costly allocations might have made a progress but this doesn't mean
+ * their order will become available due to high fragmentation so do
+ * not reset the no progress counter for them
+ */
+ if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
no_progress_loops = 0;
- pages_reclaimed += did_some_progress;
- } else {
+ else
no_progress_loops++;
- }

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
- did_some_progress > 0, pages_reclaimed,
- no_progress_loops))
+ did_some_progress > 0, no_progress_loops))
goto retry;

/* Reclaim has failed us, start killing things */
--
2.6.2

2015-12-16 23:35:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <[email protected]> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well. This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything). It's not so easy in the case of
oom-too-late-or-never. The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom. But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
start going wrong and turns on diagnostics (this would need an enable
knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
instructions/scripts so that people who know nothing about kernel
internals or tracing can easily gather the info we need to understand
issues.

- add a sysrq key to turn on diagnostics. Pretty essential when the
machine is comatose and doesn't respond to keystrokes.

- something else

So... please have a think about it? What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well?
At this time, too much developer support code will be better than too
little. We can take it out later on.

2015-12-16 23:58:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <[email protected]> wrote:

>
> ...
>
> * base kernel
> $ grep "Killed process" base-oom-run1.log | tail -n1
> [ 211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> $ grep "Killed process" base-oom-run2.log | tail -n1
> [ 157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
>
> $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> 76
>
> The number of OOM invocations is consistent with my last measurements
> but the runtime is way too different (it took 800+s).

I'm seeing 211 seconds vs 157 seconds? If so, that's not toooo bad. I
assume the 800+s is sum-across-multiple-CPUs? Given that all the CPUs
are pounding away at the same data and the same disk, that doesn't
sound like very interesting info - the overall elapsed time is the
thing to look at in this case.

> One thing that
> could have skewed results was that I was tail -f the serial log on the
> host system to see the progress. I have stopped doing that. The results
> are more consistent now but still too different from the last time.
> This is really weird so I've retested with the last 4.2 mmotm again and
> I am getting consistent ~220s which is really close to the above. If I
> apply the WQ vmstat patch on top I am getting close to 160s so the stale
> vmstat counters made a difference which is to be expected. I have a new
> SSD in my laptop which migh have made a difference but I wouldn't expect
> it to be that large.
>
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
> 4
> $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
> 1
>
> * patched kernel
> $ grep "Killed process" patched-oom-run1.log | tail -n1
> [ 341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> $ grep "Killed process" patched-oom-run2.log | tail -n1
> [ 349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB

Even better.

> $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> 78
> $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> 77
>
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> 1
> $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> 0
>
> So the number of OOM killer invocation is the same but the overall
> runtime of the test was much longer with the patched kernel. This can be
> attributed to more retries in general. The results from the base kernel
> are quite inconsitent and I think that consistency is better here.

It's hard to say how long declaration of oom should take. Correctness
comes first. But what is "correct"? oom isn't a binary condition -
there's a chance that if we keep churning away for another 5 minutes
we'll be able to satisfy this allocation (but probably not the next
one). There are tradeoffs between promptness-of-declaring-oom and
exhaustiveness-in-avoiding-it.

>
> 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> memory as possible without triggering the OOM killer. This required a lot
> of tuning but I've considered 3 consecutive runs without OOM as a success.

"a lot of tuning" sounds bad. It means that the tuning settings you
have now for a particular workload on a particular machine will be
wrong for other workloads and machines. uh-oh.

> ...

2015-12-18 12:12:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 16-12-15 15:35:13, Andrew Morton wrote:
[...]
> So... please have a think about it? What can we add in here to make it
> as easy as possible for us (ie: you ;)) to get this code working well?
> At this time, too much developer support code will be better than too
> little. We can take it out later on.

Sure. I will think about this and get back to it early next year. I will
be mostly offline starting next week.

Thanks for looking into this!

--
Michal Hocko
SUSE Labs

2015-12-18 13:15:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <[email protected]> wrote:
>
> >
> > ...
> >
> > * base kernel
> > $ grep "Killed process" base-oom-run1.log | tail -n1
> > [ 211.824379] Killed process 3086 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:332kB, shmem-rss:0kB
> > $ grep "Killed process" base-oom-run2.log | tail -n1
> > [ 157.188326] Killed process 3094 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:368kB, shmem-rss:0kB
> >
> > $ grep "invoked oom-killer" base-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" base-oom-run2.log | wc -l
> > 76
> >
> > The number of OOM invocations is consistent with my last measurements
> > but the runtime is way too different (it took 800+s).
>
> I'm seeing 211 seconds vs 157 seconds? If so, that's not toooo bad. I
> assume the 800+s is sum-across-multiple-CPUs?

This is the time until the oom situation settled down. And I really
suspect that the new SSD made a difference here.

> Given that all the CPUs
> are pounding away at the same data and the same disk, that doesn't
> sound like very interesting info - the overall elapsed time is the
> thing to look at in this case.

Which is what I was looking at when checking the timestamp in the log.

[...]
> > * patched kernel
> > $ grep "Killed process" patched-oom-run1.log | tail -n1
> > [ 341.164930] Killed process 3099 (mem_eater) total-vm:85852kB, anon-rss:82000kB, file-rss:336kB, shmem-rss:0kB
> > $ grep "Killed process" patched-oom-run2.log | tail -n1
> > [ 349.111539] Killed process 3082 (mem_eater) total-vm:85852kB, anon-rss:81996kB, file-rss:4kB, shmem-rss:0kB
>
> Even better.
>
> > $ grep "invoked oom-killer" patched-oom-run1.log | wc -l
> > 78
> > $ grep "invoked oom-killer" patched-oom-run2.log | wc -l
> > 77
> >
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
> > 1
> > $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
> > 0
> >
> > So the number of OOM killer invocation is the same but the overall
> > runtime of the test was much longer with the patched kernel. This can be
> > attributed to more retries in general. The results from the base kernel
> > are quite inconsitent and I think that consistency is better here.
>
> It's hard to say how long declaration of oom should take. Correctness
> comes first. But what is "correct"? oom isn't a binary condition -
> there's a chance that if we keep churning away for another 5 minutes
> we'll be able to satisfy this allocation (but probably not the next
> one). There are tradeoffs between promptness-of-declaring-oom and
> exhaustiveness-in-avoiding-it.

Yes, this is really hard to tell. What I wanted to achieve here is a
determinism - the same load should give comparable results. It seems
that there is an improvement in this regards. The time to settle is
much more consistent than with the original implementation.

> > 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
> > memory as possible without triggering the OOM killer. This required a lot
> > of tuning but I've considered 3 consecutive runs without OOM as a success.
>
> "a lot of tuning" sounds bad. It means that the tuning settings you
> have now for a particular workload on a particular machine will be
> wrong for other workloads and machines. uh-oh.

Well, I had to tune the test to see how close to the edge I can get. I
haven't done any decisions based on this test.

Thanks!
--
Michal Hocko
SUSE Labs

2015-12-18 16:36:13

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri, Dec 18, 2015 at 02:15:09PM +0100, Michal Hocko wrote:
> On Wed 16-12-15 15:58:44, Andrew Morton wrote:
> > It's hard to say how long declaration of oom should take. Correctness
> > comes first. But what is "correct"? oom isn't a binary condition -
> > there's a chance that if we keep churning away for another 5 minutes
> > we'll be able to satisfy this allocation (but probably not the next
> > one). There are tradeoffs between promptness-of-declaring-oom and
> > exhaustiveness-in-avoiding-it.
>
> Yes, this is really hard to tell. What I wanted to achieve here is a
> determinism - the same load should give comparable results. It seems
> that there is an improvement in this regards. The time to settle is
> much more consistent than with the original implementation.

+1

Before that we couldn't even really make a meaningful statement about
how long we are going to try - "as long as reclaim thinks it can maybe
do some more, depending on heuristics". I think the best thing we can
strive for with OOM is to make the rules simple and predictable.

2015-12-24 12:41:26

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

I got OOM killers while running heavy disk I/O (extracting kernel source,
running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
Do you think these OOM killers reasonable? Too weak against fragmentation?

[ 3902.430630] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 3902.432780] kthreadd cpuset=/ mems_allowed=0
[ 3902.433904] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 3902.435463] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 3902.437541] 0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 3902.439129] 0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 3902.440779] ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 3902.442436] Call Trace:
[ 3902.443094] [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 3902.444188] [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 3902.445301] [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 3902.446656] [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 3902.447881] [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 3902.449093] [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 3902.450266] [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 3902.451430] [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 3902.452757] [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 3902.454468] [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 3902.455756] [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 3902.457076] [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 3902.458396] [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 3902.459480] [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 3902.460775] [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 3902.461894] [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 3902.463035] [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 3902.464230] [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.465502] [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 3902.466648] [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 3902.467953] Mem-Info:
[ 3902.468537] active_anon:20817 inactive_anon:2098 isolated_anon:0
[ 3902.468537] active_file:145434 inactive_file:145453 isolated_file:0
[ 3902.468537] unevictable:0 dirty:20613 writeback:7248 unstable:0
[ 3902.468537] slab_reclaimable:86363 slab_unreclaimable:14905
[ 3902.468537] mapped:6670 shmem:2167 pagetables:1497 bounce:0
[ 3902.468537] free:5422 free_pcp:75 free_cma:0
[ 3902.476541] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:3268kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:36kB shmem:216kB slab_reclaimable:3708kB slab_unreclaimable:456kB kernel_stack:48kB pagetables:160kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 3902.486494] lowmem_reserve[]: 0 1714 1714 1714
[ 3902.487659] Node 0 DMA32 free:13760kB min:5172kB low:6464kB high:7756kB active_anon:80000kB inactive_anon:8192kB active_file:581780kB inactive_file:581848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:82312kB writeback:29588kB mapped:26648kB shmem:8452kB slab_reclaimable:341744kB slab_unreclaimable:59496kB kernel_stack:3456kB pagetables:5828kB unstable:0kB bounce:0kB free_pcp:732kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:560 all_unreclaimable? no
[ 3902.500438] lowmem_reserve[]: 0 0 0 0
[ 3902.502373] Node 0 DMA: 42*4kB (UME) 84*8kB (UM) 57*16kB (UM) 15*32kB (UM) 11*64kB (M) 9*128kB (UME) 1*256kB (M) 1*512kB (M) 2*1024kB (UM) 0*2048kB 0*4096kB = 6904kB
[ 3902.507561] Node 0 DMA32: 3788*4kB (UME) 184*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16624kB
[ 3902.511236] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3902.513938] 292144 total pagecache pages
[ 3902.515609] 0 pages in swap cache
[ 3902.517139] Swap cache stats: add 0, delete 0, find 0/0
[ 3902.519153] Free swap = 0kB
[ 3902.520587] Total swap = 0kB
[ 3902.522095] 524157 pages RAM
[ 3902.523511] 0 pages HighMem/MovableOnly
[ 3902.525091] 80441 pages reserved
[ 3902.526580] 0 pages hwpoisoned
[ 3902.528169] Out of memory: Kill process 687 (firewalld) score 11 or sacrifice child
[ 3902.531017] Killed process 687 (firewalld) total-vm:323600kB, anon-rss:17032kB, file-rss:4896kB, shmem-rss:0kB
[ 5262.901161] smbd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5262.903629] smbd cpuset=/ mems_allowed=0
[ 5262.904725] CPU: 2 PID: 3935 Comm: smbd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5262.906401] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5262.908679] 0000000000000000 00000000eaa24b41 ffff88007c37faf8 ffffffff81395bc3
[ 5262.910459] 0000000000000000 ffff88007c37fb98 ffffffff811babac 0000000000000206
[ 5262.912224] ffffffff81810470 ffff88007c37fb38 ffffffff810bce29 0000000000000206
[ 5262.914019] Call Trace:
[ 5262.914839] [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5262.916118] [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5262.917493] [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5262.919131] [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5262.920690] [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5262.922204] [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5262.923863] [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5262.925386] [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5262.927121] [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5262.928738] [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5262.930438] [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.932110] [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5262.933410] [<ffffffff8111c4da>] ? __audit_syscall_entry+0xaa/0xf0
[ 5262.935016] [<ffffffff810030c1>] ? do_audit_syscall_entry+0x61/0x70
[ 5262.936632] [<ffffffff81003254>] ? syscall_trace_enter_phase1+0x134/0x150
[ 5262.938383] [<ffffffff81003017>] ? trace_hardirqs_on_thunk+0x17/0x19
[ 5262.940024] [<ffffffff8106e5a4>] SyS_clone+0x14/0x20
[ 5262.941465] [<ffffffff816fc532>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 5262.943137] Mem-Info:
[ 5262.944068] active_anon:37901 inactive_anon:2095 isolated_anon:0
[ 5262.944068] active_file:134812 inactive_file:135474 isolated_file:0
[ 5262.944068] unevictable:0 dirty:257 writeback:0 unstable:0
[ 5262.944068] slab_reclaimable:90770 slab_unreclaimable:12759
[ 5262.944068] mapped:4223 shmem:2166 pagetables:1428 bounce:0
[ 5262.944068] free:3738 free_pcp:49 free_cma:0
[ 5262.953176] Node 0 DMA free:6904kB min:44kB low:52kB high:64kB active_anon:900kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:712kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5262.963749] lowmem_reserve[]: 0 1714 1714 1714
[ 5262.965434] Node 0 DMA32 free:8048kB min:5172kB low:6464kB high:7756kB active_anon:150704kB inactive_anon:8180kB active_file:539244kB inactive_file:541892kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:1028kB writeback:0kB mapped:16860kB shmem:8448kB slab_reclaimable:357524kB slab_unreclaimable:50324kB kernel_stack:3232kB pagetables:5560kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:132 all_unreclaimable? no
[ 5262.976879] lowmem_reserve[]: 0 0 0 0
[ 5262.978586] Node 0 DMA: 58*4kB (UME) 60*8kB (UME) 73*16kB (UME) 23*32kB (UME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6904kB
[ 5262.983496] Node 0 DMA32: 1987*4kB (UME) 14*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8060kB
[ 5262.987124] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5262.989532] 272459 total pagecache pages
[ 5262.991203] 0 pages in swap cache
[ 5262.992583] Swap cache stats: add 0, delete 0, find 0/0
[ 5262.994334] Free swap = 0kB
[ 5262.995787] Total swap = 0kB
[ 5262.997038] 524157 pages RAM
[ 5262.998270] 0 pages HighMem/MovableOnly
[ 5262.999683] 80441 pages reserved
[ 5263.001153] 0 pages hwpoisoned
[ 5263.002612] Out of memory: Kill process 26226 (genxref) score 54 or sacrifice child
[ 5263.004648] Killed process 26226 (genxref) total-vm:130348kB, anon-rss:94680kB, file-rss:4756kB, shmem-rss:0kB
[ 5269.764580] kthreadd invoked oom-killer: order=2, oom_score_adj=0, gfp_mask=0x27000c0(GFP_KERNEL|GFP_NOTRACK|0x100000)
[ 5269.767289] kthreadd cpuset=/ mems_allowed=0
[ 5269.768904] CPU: 2 PID: 2 Comm: kthreadd Not tainted 4.4.0-rc6-next-20151222 #255
[ 5269.770956] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 5269.773754] 0000000000000000 000000009cc7eb67 ffff88007cc1faa0 ffffffff81395bc3
[ 5269.776088] 0000000000000000 ffff88007cc1fb40 ffffffff811babac 0000000000000206
[ 5269.778213] ffffffff81810470 ffff88007cc1fae0 ffffffff810bce29 0000000000000206
[ 5269.780497] Call Trace:
[ 5269.781796] [<ffffffff81395bc3>] dump_stack+0x4b/0x68
[ 5269.783634] [<ffffffff811babac>] dump_header+0x5b/0x3b0
[ 5269.786116] [<ffffffff810bce29>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 5269.788495] [<ffffffff810bcefd>] ? trace_hardirqs_on+0xd/0x10
[ 5269.790538] [<ffffffff81142646>] oom_kill_process+0x366/0x540
[ 5269.792755] [<ffffffff81142a5f>] out_of_memory+0x1ef/0x5a0
[ 5269.794784] [<ffffffff81142b1d>] ? out_of_memory+0x2ad/0x5a0
[ 5269.796848] [<ffffffff8114836d>] __alloc_pages_nodemask+0xb9d/0xd90
[ 5269.799038] [<ffffffff810bce00>] ? trace_hardirqs_on_caller+0xd0/0x1c0
[ 5269.801073] [<ffffffff8114871c>] alloc_kmem_pages_node+0x4c/0xc0
[ 5269.803186] [<ffffffff8106c451>] copy_process.part.31+0x131/0x1b40
[ 5269.805249] [<ffffffff8108f590>] ? kthread_create_on_node+0x230/0x230
[ 5269.807374] [<ffffffff8106e02b>] _do_fork+0xdb/0x5d0
[ 5269.809089] [<ffffffff81094a8a>] ? finish_task_switch+0x6a/0x2b0
[ 5269.811146] [<ffffffff8106e544>] kernel_thread+0x24/0x30
[ 5269.812944] [<ffffffff8109007c>] kthreadd+0x1bc/0x220
[ 5269.814698] [<ffffffff816fc89f>] ? ret_from_fork+0x3f/0x70
[ 5269.816330] [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.818088] [<ffffffff816fc89f>] ret_from_fork+0x3f/0x70
[ 5269.819685] [<ffffffff8108fec0>] ? kthread_create_on_cpu+0x60/0x60
[ 5269.821399] Mem-Info:
[ 5269.822430] active_anon:14280 inactive_anon:2095 isolated_anon:0
[ 5269.822430] active_file:134344 inactive_file:134515 isolated_file:0
[ 5269.822430] unevictable:0 dirty:2 writeback:0 unstable:0
[ 5269.822430] slab_reclaimable:96214 slab_unreclaimable:22185
[ 5269.822430] mapped:3512 shmem:2166 pagetables:1368 bounce:0
[ 5269.822430] free:12388 free_pcp:51 free_cma:0
[ 5269.831310] Node 0 DMA free:6892kB min:44kB low:52kB high:64kB active_anon:856kB inactive_anon:200kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:32kB shmem:216kB slab_reclaimable:5556kB slab_unreclaimable:768kB kernel_stack:48kB pagetables:152kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.840580] lowmem_reserve[]: 0 1714 1714 1714
[ 5269.842107] Node 0 DMA32 free:42660kB min:5172kB low:6464kB high:7756kB active_anon:56264kB inactive_anon:8180kB active_file:537372kB inactive_file:538056kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:1758960kB mlocked:0kB dirty:8kB writeback:0kB mapped:14020kB shmem:8448kB slab_reclaimable:379300kB slab_unreclaimable:87972kB kernel_stack:3232kB pagetables:5320kB unstable:0kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 5269.852375] lowmem_reserve[]: 0 0 0 0
[ 5269.853784] Node 0 DMA: 67*4kB (ME) 60*8kB (UME) 72*16kB (ME) 22*32kB (ME) 13*64kB (UME) 5*128kB (UM) 5*256kB (UME) 3*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 6892kB
[ 5269.858330] Node 0 DMA32: 10648*4kB (UME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42592kB
[ 5269.861551] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 5269.863676] 271012 total pagecache pages
[ 5269.865100] 0 pages in swap cache
[ 5269.866366] Swap cache stats: add 0, delete 0, find 0/0
[ 5269.867996] Free swap = 0kB
[ 5269.869363] Total swap = 0kB
[ 5269.870593] 524157 pages RAM
[ 5269.871857] 0 pages HighMem/MovableOnly
[ 5269.873604] 80441 pages reserved
[ 5269.874937] 0 pages hwpoisoned
[ 5269.876207] Out of memory: Kill process 2710 (tuned) score 7 or sacrifice child
[ 5269.878265] Killed process 2710 (tuned) total-vm:553052kB, anon-rss:10596kB, file-rss:2776kB, shmem-rss:0kB

2015-12-28 12:09:03

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

Well, current patch invokes OOM killers when more than 75% of memory is used
for file cache (active_file: + inactive_file:). I think this is a surprising
thing for administrators and we want to retry more harder (but not forever,
please).

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151228.txt.xz .
----------
[ 277.863985] Node 0 DMA32 free:20128kB min:5564kB low:6952kB high:8344kB active_anon:108332kB inactive_anon:8252kB active_file:985160kB inactive_file:615436kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5904kB shmem:8524kB slab_reclaimable:52088kB slab_unreclaimable:59748kB kernel_stack:31280kB pagetables:55708kB unstable:0kB bounce:0kB free_pcp:1056kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 277.884512] Node 0 DMA32: 3438*4kB (UME) 791*8kB (UME) 3*16kB (UM) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20128kB
[ 291.331040] Node 0 DMA32 free:29500kB min:5564kB low:6952kB high:8344kB active_anon:126756kB inactive_anon:8252kB active_file:821500kB inactive_file:604016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:12684kB shmem:8524kB slab_reclaimable:56808kB slab_unreclaimable:99804kB kernel_stack:58448kB pagetables:92552kB unstable:0kB bounce:0kB free_pcp:2004kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 291.349097] Node 0 DMA32: 4221*4kB (UME) 1971*8kB (UME) 436*16kB (UME) 141*32kB (UME) 8*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44652kB
[ 302.897985] Node 0 DMA32 free:28240kB min:5564kB low:6952kB high:8344kB active_anon:79344kB inactive_anon:8248kB active_file:1016568kB inactive_file:604696kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:80kB writeback:0kB mapped:13004kB shmem:8520kB slab_reclaimable:52076kB slab_unreclaimable:64064kB kernel_stack:35168kB pagetables:48552kB unstable:0kB bounce:0kB free_pcp:1384kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 302.916334] Node 0 DMA32: 4304*4kB (UM) 1181*8kB (UME) 59*16kB (UME) 7*32kB (ME) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27832kB
[ 311.014501] Node 0 DMA32 free:22820kB min:5564kB low:6952kB high:8344kB active_anon:56852kB inactive_anon:11976kB active_file:1142936kB inactive_file:582040kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:160kB writeback:0kB mapped:10796kB shmem:16640kB slab_reclaimable:48608kB slab_unreclaimable:41912kB kernel_stack:16560kB pagetables:30876kB unstable:0kB bounce:0kB free_pcp:948kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
[ 311.034251] Node 0 DMA32: 6*4kB (U) 2401*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 19232kB
[ 314.293371] Node 0 DMA32 free:15244kB min:5564kB low:6952kB high:8344kB active_anon:82496kB inactive_anon:11976kB active_file:1110984kB inactive_file:467400kB unevictable:0kB isolated(anon):0kB isolated(file):88kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:9440kB shmem:16640kB slab_reclaimable:53684kB slab_unreclaimable:72536kB kernel_stack:40048kB pagetables:67672kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:12 all_unreclaimable? no
[ 314.314336] Node 0 DMA32: 1180*4kB (UM) 1449*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16312kB
[ 322.774181] Node 0 DMA32 free:19780kB min:5564kB low:6952kB high:8344kB active_anon:68264kB inactive_anon:17816kB active_file:1155724kB inactive_file:470216kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:8kB writeback:0kB mapped:9744kB shmem:24708kB slab_reclaimable:52540kB slab_unreclaimable:63216kB kernel_stack:32464kB pagetables:51856kB unstable:0kB bounce:0kB free_pcp:1076kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 322.796256] Node 0 DMA32: 86*4kB (UME) 2474*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20136kB
[ 330.804341] Node 0 DMA32 free:22076kB min:5564kB low:6952kB high:8344kB active_anon:47616kB inactive_anon:17816kB active_file:1063272kB inactive_file:685848kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:216kB writeback:0kB mapped:9708kB shmem:24708kB slab_reclaimable:48536kB slab_unreclaimable:36844kB kernel_stack:12048kB pagetables:25992kB unstable:0kB bounce:0kB free_pcp:776kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 330.826190] Node 0 DMA32: 1637*4kB (UM) 1354*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17380kB
[ 332.828224] Node 0 DMA32 free:15544kB min:5564kB low:6952kB high:8344kB active_anon:63184kB inactive_anon:17784kB active_file:1215752kB inactive_file:468872kB unevictable:0kB isolated(anon):0kB isolated(file):68kB present:2080640kB managed:2021100kB mlocked:0kB dirty:312kB writeback:0kB mapped:9116kB shmem:24708kB slab_reclaimable:49912kB slab_unreclaimable:50068kB kernel_stack:21600kB pagetables:42384kB unstable:0kB bounce:0kB free_pcp:1364kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 332.846805] Node 0 DMA32: 4108*4kB (UME) 897*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 23608kB
[ 341.054731] Node 0 DMA32 free:20512kB min:5564kB low:6952kB high:8344kB active_anon:76796kB inactive_anon:23792kB active_file:1053836kB inactive_file:618588kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1656kB writeback:0kB mapped:19768kB shmem:32784kB slab_reclaimable:49000kB slab_unreclaimable:47636kB kernel_stack:21664kB pagetables:37188kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 341.073722] Node 0 DMA32: 3309*4kB (UM) 1124*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22228kB
[ 360.075472] Node 0 DMA32 free:17856kB min:5564kB low:6952kB high:8344kB active_anon:117872kB inactive_anon:25588kB active_file:1022532kB inactive_file:466856kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:420kB writeback:0kB mapped:25300kB shmem:40976kB slab_reclaimable:57804kB slab_unreclaimable:79416kB kernel_stack:46784kB pagetables:78044kB unstable:0kB bounce:0kB free_pcp:1100kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 360.093794] Node 0 DMA32: 2719*4kB (UM) 97*8kB (UM) 14*16kB (UM) 37*32kB (UME) 27*64kB (UME) 3*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15172kB
[ 368.853099] Node 0 DMA32 free:22524kB min:5564kB low:6952kB high:8344kB active_anon:79156kB inactive_anon:24876kB active_file:872972kB inactive_file:738900kB unevictable:0kB isolated(anon):0kB isolated(file):96kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:25708kB shmem:40976kB slab_reclaimable:50820kB slab_unreclaimable:62880kB kernel_stack:32048kB pagetables:49656kB unstable:0kB bounce:0kB free_pcp:524kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 368.871173] Node 0 DMA32: 5042*4kB (UM) 248*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22152kB
[ 379.261759] Node 0 DMA32 free:15888kB min:5564kB low:6952kB high:8344kB active_anon:89928kB inactive_anon:23780kB active_file:1295512kB inactive_file:358284kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1608kB writeback:0kB mapped:25376kB shmem:40976kB slab_reclaimable:47972kB slab_unreclaimable:50848kB kernel_stack:22320kB pagetables:42360kB unstable:0kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 379.279344] Node 0 DMA32: 2994*4kB (ME) 503*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 16000kB
[ 387.367409] Node 0 DMA32 free:15320kB min:5564kB low:6952kB high:8344kB active_anon:76364kB inactive_anon:28712kB active_file:1061180kB inactive_file:596956kB unevictable:0kB isolated(anon):0kB isolated(file):120kB present:2080640kB managed:2021100kB mlocked:0kB dirty:20kB writeback:0kB mapped:27700kB shmem:49168kB slab_reclaimable:51236kB slab_unreclaimable:51096kB kernel_stack:22912kB pagetables:40920kB unstable:0kB bounce:0kB free_pcp:700kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 387.385740] Node 0 DMA32: 3638*4kB (UM) 115*8kB (UM) 1*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15488kB
[ 391.207543] Node 0 DMA32 free:15224kB min:5564kB low:6952kB high:8344kB active_anon:115956kB inactive_anon:28392kB active_file:1117532kB inactive_file:359656kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:29348kB shmem:49168kB slab_reclaimable:56028kB slab_unreclaimable:85168kB kernel_stack:48592kB pagetables:81620kB unstable:0kB bounce:0kB free_pcp:1124kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:356 all_unreclaimable? no
[ 391.228084] Node 0 DMA32: 3374*4kB (UME) 221*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15264kB
[ 395.663881] Node 0 DMA32 free:12820kB min:5564kB low:6952kB high:8344kB active_anon:98924kB inactive_anon:27520kB active_file:1105780kB inactive_file:494760kB unevictable:0kB isolated(anon):4kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:1412kB writeback:12kB mapped:29588kB shmem:49168kB slab_reclaimable:49836kB slab_unreclaimable:60524kB kernel_stack:32176kB pagetables:50356kB unstable:0kB bounce:0kB free_pcp:1500kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:388 all_unreclaimable? no
[ 395.683137] Node 0 DMA32: 3794*4kB (ME) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15176kB
[ 399.871655] Node 0 DMA32 free:18432kB min:5564kB low:6952kB high:8344kB active_anon:99156kB inactive_anon:26780kB active_file:1150532kB inactive_file:408872kB unevictable:0kB isolated(anon):68kB isolated(file):80kB present:2080640kB managed:2021100kB mlocked:0kB dirty:3492kB writeback:0kB mapped:30924kB shmem:49168kB slab_reclaimable:54236kB slab_unreclaimable:68184kB kernel_stack:37392kB pagetables:63708kB unstable:0kB bounce:0kB free_pcp:784kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 399.890082] Node 0 DMA32: 4155*4kB (UME) 200*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18220kB
[ 408.447006] Node 0 DMA32 free:12684kB min:5564kB low:6952kB high:8344kB active_anon:74296kB inactive_anon:25960kB active_file:1086404kB inactive_file:605660kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:264kB writeback:0kB mapped:30604kB shmem:49168kB slab_reclaimable:50200kB slab_unreclaimable:45212kB kernel_stack:19184kB pagetables:34500kB unstable:0kB bounce:0kB free_pcp:740kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 408.465169] Node 0 DMA32: 2804*4kB (ME) 203*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 12840kB
[ 416.426931] Node 0 DMA32 free:15396kB min:5564kB low:6952kB high:8344kB active_anon:98836kB inactive_anon:32120kB active_file:964808kB inactive_file:666224kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:33628kB shmem:57332kB slab_reclaimable:51048kB slab_unreclaimable:51824kB kernel_stack:23328kB pagetables:41896kB unstable:0kB bounce:0kB free_pcp:988kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 416.447247] Node 0 DMA32: 5158*4kB (UME) 68*8kB (M) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 21176kB
[ 418.780159] Node 0 DMA32 free:8876kB min:5564kB low:6952kB high:8344kB active_anon:86544kB inactive_anon:31516kB active_file:965016kB inactive_file:654444kB unevictable:0kB isolated(anon):0kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:8408kB shmem:57332kB slab_reclaimable:48856kB slab_unreclaimable:61116kB kernel_stack:30224kB pagetables:48636kB unstable:0kB bounce:0kB free_pcp:980kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:260 all_unreclaimable? no
[ 418.799643] Node 0 DMA32: 3093*4kB (UME) 1043*8kB (UME) 2*16kB (M) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 20748kB
[ 428.087913] Node 0 DMA32 free:22760kB min:5564kB low:6952kB high:8344kB active_anon:94544kB inactive_anon:38936kB active_file:1013576kB inactive_file:564976kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:0kB mapped:36096kB shmem:65376kB slab_reclaimable:52196kB slab_unreclaimable:60576kB kernel_stack:29888kB pagetables:56364kB unstable:0kB bounce:0kB free_pcp:852kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 428.109005] Node 0 DMA32: 2943*4kB (UME) 458*8kB (UME) 20*16kB (UME) 11*32kB (UME) 11*64kB (ME) 4*128kB (UME) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 17324kB
[ 439.014180] Node 0 DMA32 free:11232kB min:5564kB low:6952kB high:8344kB active_anon:82868kB inactive_anon:38872kB active_file:1189912kB inactive_file:439592kB unevictable:0kB isolated(anon):12kB isolated(file):40kB present:2080640kB managed:2021100kB mlocked:0kB dirty:0kB writeback:1152kB mapped:35948kB shmem:65376kB slab_reclaimable:51224kB slab_unreclaimable:56664kB kernel_stack:27696kB pagetables:43180kB unstable:0kB bounce:0kB free_pcp:380kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 439.032446] Node 0 DMA32: 2761*4kB (UM) 28*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 11268kB
[ 441.731001] Node 0 DMA32 free:15056kB min:5564kB low:6952kB high:8344kB active_anon:90532kB inactive_anon:42716kB active_file:1204248kB inactive_file:377196kB unevictable:0kB isolated(anon):12kB isolated(file):116kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5552kB shmem:73568kB slab_reclaimable:52956kB slab_unreclaimable:68304kB kernel_stack:39936kB pagetables:47472kB unstable:0kB bounce:0kB free_pcp:624kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 441.731018] Node 0 DMA32: 3130*4kB (UM) 338*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15224kB
[ 442.070851] Node 0 DMA32 free:8852kB min:5564kB low:6952kB high:8344kB active_anon:90412kB inactive_anon:42664kB active_file:1179304kB inactive_file:371316kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5544kB shmem:73568kB slab_reclaimable:55136kB slab_unreclaimable:80080kB kernel_stack:55456kB pagetables:52692kB unstable:0kB bounce:0kB free_pcp:312kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:348 all_unreclaimable? no
[ 442.070867] Node 0 DMA32: 590*4kB (ME) 827*8kB (ME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8976kB
[ 442.245192] Node 0 DMA32 free:10832kB min:5564kB low:6952kB high:8344kB active_anon:97756kB inactive_anon:42664kB active_file:1082048kB inactive_file:417012kB unevictable:0kB isolated(anon):108kB isolated(file):268kB present:2080640kB managed:2021100kB mlocked:0kB dirty:4kB writeback:0kB mapped:5248kB shmem:73568kB slab_reclaimable:62816kB slab_unreclaimable:88964kB kernel_stack:61408kB pagetables:62908kB unstable:0kB bounce:0kB free_pcp:696kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 442.245208] Node 0 DMA32: 1902*4kB (UME) 410*8kB (UME) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10888kB
----------

Since I cannot establish workload that caused December 24's natural OOM
killers, I used the following stressor for generating similar situation.

The fileio.c fills up all memory with file cache and tries to keep them
on memory. The fork.c is flood of order-2 allocation generator because
December 24's OOM killers were triggered by copy_process() which involves
order-2 allocation request.

---------- fileio.c start ----------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

int main(int argc, char *argv[])
{
int i;
static char buffer[4096];
signal(SIGCHLD, SIG_IGN);
for (i = 0; i < 2; i++) {
int fd;
int j;
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
fd = open(buffer, O_RDWR | O_CREAT, 0600);
memset(buffer, 0, sizeof(buffer));
for (j = 0; j < 1048576 * 1000 / 4096; j++) /* 1000 is MemTotal / 2 */
write(fd, buffer, sizeof(buffer));
close(fd);
}
for (i = 0; i < 2; i++) {
if (fork() == 0) {
int fd;
snprintf(buffer, sizeof(buffer), "/tmp/file.%u", i);
fd = open(buffer, O_RDWR);
memset(buffer, 0, sizeof(buffer));
while (fd != EOF) {
lseek(fd, 0, SEEK_SET);
while (read(fd, buffer, sizeof(buffer)) == sizeof(buffer));
}
_exit(0);
}
}
if (fork() == 0) {
execl("./fork", "./fork", NULL);
_exit(1);
}
if (fork() == 0) {
sleep(1);
execl("./fork", "./fork", NULL);
_exit(1);
}
while (1)
system("pidof fork | wc");
return 0;
}
---------- fileio.c end ----------

---------- fork.c start ----------
#include <unistd.h>
#include <signal.h>

int main(int argc, char *argv[])
{
int i;
signal(SIGCHLD, SIG_IGN);
while (1) {
sleep(5);
for (i = 0; i < 2000; i++) {
if (fork() == 0) {
sleep(3);
_exit(0);
}
}
}
}
---------- fork.c end ----------

This reproducer also showed that once the OOM killer is invoked,
subsequent OOM killers tend to occur shortly because file cache
do not decrease.

2015-12-28 14:13:36

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Since I cannot establish workload that caused December 24's natural OOM
> killers, I used the following stressor for generating similar situation.
>

I came to feel that I am observing a different problem which is currently
hidden behind the "too small to fail" memory-allocation rule. That is, tasks
requesting order > 0 pages are continuously losing the competition when
tasks requesting order = 0 pages dominate, for reclaimed pages are stolen
by tasks requesting order = 0 pages before reclaimed pages are combined to
order > 0 pages (or maybe order > 0 pages are immediately split into
order = 0 pages due to tasks requesting order = 0 pages).

Currently, order <= PAGE_ALLOC_COSTLY_ORDER allocations implicitly retry
unless chosen by the OOM killer. Therefore, even if tasks requesting
order = 2 pages lost the competition when there are tasks requesting
order = 0 pages, the order = 2 allocation request is implicitly retried
and therefore the OOM killer is not invoked (though there is a problem that
tasks requesting order > 0 allocation will stall as long as tasks requesting
order = 0 pages dominate).

But this patchset introduced a limit of 16 retries. Thus, if tasks requesting
order = 2 pages lost the competition for 16 times due to tasks requesting
order = 0 pages, tasks requesting order = 2 pages invoke the OOM killer.
To avoid the OOM killer, we need to make sure that pages reclaimed for
order > 0 allocations will not be stolen by tasks requesting order = 0
allocations.

Is my feeling plausible?

2015-12-29 16:28:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu 24-12-15 21:41:19, Tetsuo Handa wrote:
> I got OOM killers while running heavy disk I/O (extracting kernel source,
> running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> Do you think these OOM killers reasonable? Too weak against fragmentation?

I will have a look at the oom report more closely early next week (I am
still in holiday mode) but it would be good to compare how the same load
behaves with the original implementation. It would be also interesting
to see how stable are the results (is there any variability in multiple
runs?).

Thanks!
--
Michal Hocko
SUSE Labs

2015-12-29 16:32:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > Do you think these OOM killers reasonable? Too weak against fragmentation?
>
> Well, current patch invokes OOM killers when more than 75% of memory is used
> for file cache (active_file: + inactive_file:). I think this is a surprising
> thing for administrators and we want to retry more harder (but not forever,
> please).

Here again, it would be good to see what is the comparision between
the original and the new behavior. 75% of a page cache is certainly
unexpected but those pages might be pinned for other reasons and so
unreclaimable and basically IO bound. This is hard to optimize for
without causing any undesirable side effects for other loads. I will
have a look at the oom reports later but having a comparision would be
a great start.

Thanks!
--
Michal Hocko
SUSE Labs

2015-12-30 15:05:51

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> On Mon 28-12-15 21:08:56, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > I got OOM killers while running heavy disk I/O (extracting kernel source,
> > > running lxr's genxref command). (Environ: 4 CPUs / 2048MB RAM / no swap / XFS)
> > > Do you think these OOM killers reasonable? Too weak against fragmentation?
> >
> > Well, current patch invokes OOM killers when more than 75% of memory is used
> > for file cache (active_file: + inactive_file:). I think this is a surprising
> > thing for administrators and we want to retry more harder (but not forever,
> > please).
>
> Here again, it would be good to see what is the comparision between
> the original and the new behavior. 75% of a page cache is certainly
> unexpected but those pages might be pinned for other reasons and so
> unreclaimable and basically IO bound. This is hard to optimize for
> without causing any undesirable side effects for other loads. I will
> have a look at the oom reports later but having a comparision would be
> a great start.

Prior to "mm, oom: rework oom detection" patch (the original), this stressor
never invoked the OOM killer. After this patch (the new), this stressor easily
invokes the OOM killer. Both the original and the new case, active_file: +
inactive_file: occupies nearly 75%. I think we lost invisible retry logic for
order > 0 allocation requests.

2016-03-01 07:29:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Mon, 29 Feb 2016, Michal Hocko wrote:
> On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> [...]
> > Boot with mem=1G (or boot your usual way, and do something to occupy
> > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > way to gobble up most of the memory, though it's not how I've done it).
> >
> > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > kernel source tree into a tmpfs: size=2G is more than enough.
> > make defconfig there, then make -j20.
> >
> > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> >
> > Except that you'll probably need to fiddle around with that j20,
> > it's true for my laptop but not for my workstation. j20 just happens
> > to be what I've had there for years, that I now see breaking down
> > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > but it still doesn't exercise swap very much).
>
> I have tried to reproduce and failed in a virtual on my laptop. I
> will try with another host with more CPUs (because my laptop has only
> two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> (16, 10 no difference really). I was also collecting vmstat in the
> background. The compilation takes ages but the behavior seems consistent
> and stable.

Thanks a lot for giving it a go.

I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
are less than 2M big?? But probably that's just a misunderstanding
or typo somewhere.

Ignoring that, you're successfully doing a make -20 defconfig build
in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
at all surprised that it takes ages, but I am very surprised that it
does not OOM. I suppose by rights it ought not to OOM, the built
tree occupies only a little more than 1G, so you do have enough swap;
but I wouldn't get anywhere near that myself without OOMing - I give
myself 1G of RAM (well, minus whatever the booted system takes up)
to do that build in, four times your RAM, yet in my case it OOMs.

That source tree alone occupies more than 700M, so just copying it
into your tmpfs would take a long time. I'd expect a build in 224M
RAM plus 2G of swap to take so long, that I'd be very grateful to be
OOM killed, even if there is technically enough space. Unless
perhaps it's some superfast swap that you have?

I was only suggesting to allocate hugetlb pages, if you preferred
not to reboot with artificially reduced RAM. Not an issue if you're
booting VMs.

It's true that my testing has been done on the physical machines,
no virtualization involved: I expect that accounts for some difference
between us, but as much difference as we're seeing? That's strange.

>
> If I try 900M for huge pages then I get OOMs but this happens with the
> mmotm without my oom rework patch set as well.

Right, not at all surprising.

>
> It would be great if you could retry and collect /proc/vmstat data
> around the OOM time to see what compaction did? (I was using the
> attached little program to reduce interference during OOM (no forks, the
> code locked in and the resulting file preallocated - e.g.
> read_vmstat 1s vmstat.log 10M and interrupt it by ctrl+c after the OOM
> hits).
>
> Thanks!

I'll give it a try, thanks, but not tonight.

Hugh

2016-03-01 13:38:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

[Adding Vlastimil and Joonsoo for compaction related things - this was a
large thread but the more interesting part starts with
http://lkml.kernel.org/r/[email protected]]

On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> On Mon, 29 Feb 2016, Michal Hocko wrote:
> > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > [...]
> > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > way to gobble up most of the memory, though it's not how I've done it).
> > >
> > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > make defconfig there, then make -j20.
> > >
> > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > >
> > > Except that you'll probably need to fiddle around with that j20,
> > > it's true for my laptop but not for my workstation. j20 just happens
> > > to be what I've had there for years, that I now see breaking down
> > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > but it still doesn't exercise swap very much).
> >
> > I have tried to reproduce and failed in a virtual on my laptop. I
> > will try with another host with more CPUs (because my laptop has only
> > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
> > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > (16, 10 no difference really). I was also collecting vmstat in the
> > background. The compilation takes ages but the behavior seems consistent
> > and stable.
>
> Thanks a lot for giving it a go.
>
> I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
> are less than 2M big?? But probably that's just a misunderstanding
> or typo somewhere.

A typo. 445 was from 900M test which I was doing while writing the
email. Sorry about the confusion.

> Ignoring that, you're successfully doing a make -20 defconfig build
> in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
> at all surprised that it takes ages, but I am very surprised that it
> does not OOM. I suppose by rights it ought not to OOM, the built
> tree occupies only a little more than 1G, so you do have enough swap;
> but I wouldn't get anywhere near that myself without OOMing - I give
> myself 1G of RAM (well, minus whatever the booted system takes up)
> to do that build in, four times your RAM, yet in my case it OOMs.
>
> That source tree alone occupies more than 700M, so just copying it
> into your tmpfs would take a long time.

OK, I just found out that I was cheating a bit. I was building
linux-3.7-rc5.tar.bz2 which is smaller:
$ du -sh /mnt/tmpfs/linux-3.7-rc5/
537M /mnt/tmpfs/linux-3.7-rc5/

and after the defconfig build:
$ free
total used free shared buffers cached
Mem: 1008460 941904 66556 0 5092 806760
-/+ buffers/cache: 130052 878408
Swap: 2097148 42648 2054500
$ du -sh linux-3.7-rc5/
799M linux-3.7-rc5/

Sorry about that but this is what my other tests were using and I forgot
to check. Now let's try the same with the current linus tree:
host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
$ du -sh /mnt/tmpfs/linux-4.5-rc6/
707M /mnt/tmpfs/linux-4.5-rc6/
$ free
total used free shared buffers cached
Mem: 1008460 962976 45484 0 7236 820064
-/+ buffers/cache: 135676 872784
Swap: 2097148 16 2097132
$ time make -j20 > /dev/null
drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]

real 8m36.621s
user 14m1.642s
sys 2m45.238s

so I wasn't cheating all that much...

> I'd expect a build in 224M
> RAM plus 2G of swap to take so long, that I'd be very grateful to be
> OOM killed, even if there is technically enough space. Unless
> perhaps it's some superfast swap that you have?

the swap partition is a standard qcow image stored on my SSD disk. So
I guess the IO should be quite fast. This smells like a potential
contributor because my reclaim seems to be much faster and that should
lead to a more efficient reclaim (in the scanned/reclaimed sense).
I realize I might be boring already when blaming compaction but let me
try again ;)
$ grep compact /proc/vmstat
compact_migrate_scanned 113983
compact_free_scanned 1433503
compact_isolated 134307
compact_stall 128
compact_fail 26
compact_success 102
compact_kcompatd_wake 0

So the whole load has done the direct compaction only 128 times during
that test. This doesn't sound much to me
$ grep allocstall /proc/vmstat
allocstall 1061

we entered the direct reclaim much more but most of the load will be
order-0 so this might be still ok. So I've tried the following:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894b4219..107d444afdb1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
mode, contended_compaction);
current->flags &= ~PF_MEMALLOC;

+ if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
+ trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
+
switch (compact_result) {
case COMPACT_DEFERRED:
*deferred_compaction = true;

And the result was:
$ cat /debug/tracing/trace_pipe | tee ~/trace.log
gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1

this shows that order-2 memory pressure is not overly high in my
setup. Both attempts ended up COMPACT_SKIPPED which is interesting.

So I went back to 800M of hugetlb pages and tried again. It took ages
so I have interrupted that after one hour (there was still no OOM). The
trace log is quite interesting regardless:
$ wc -l ~/trace.log
371 /root/trace.log

$ grep compact_stall /proc/vmstat
compact_stall 190

so the compaction was still ignored more than actually invoked for
!costly allocations:
sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
190 2 1
122 2 3
59 2 4

#define COMPACT_SKIPPED 1
#define COMPACT_PARTIAL 3
#define COMPACT_COMPLETE 4

that means that compaction is even not tried in half cases! This
doesn't sounds right to me, especially when we are talking about
<= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
then we simply rely on the order-0 reclaim to automagically form higher
blocks. This might indeed work when we retry many times but I guess this
is not a good approach. It leads to a excessive reclaim and the stall
for allocation can be really large.

One of the suspicious places is __compaction_suitable which does order-0
watermark check (increased by 2<<order). I have put another trace_printk
there and it clearly pointed out this was the case.

So I have tried the following:
diff --git a/mm/compaction.c b/mm/compaction.c
index 4d99e1f5055c..7364e48cf69a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
alloc_flags))
return COMPACT_PARTIAL;

+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ return COMPACT_CONTINUE;
+
/*
* Watermarks for order-0 must be met for compaction. Note the 2UL.
* This is because during migration, copies of pages need to be

and retried the same test (without huge pages):
$ time make -j20 > /dev/null

real 8m46.626s
user 14m15.823s
sys 2m45.471s

the time increased but I haven't checked how stable the result is.

$ grep compact /proc/vmstat
compact_migrate_scanned 139822
compact_free_scanned 1661642
compact_isolated 139407
compact_stall 129
compact_fail 58
compact_success 71
compact_kcompatd_wake 1

$ grep allocstall /proc/vmstat
allocstall 1665

this is worse because we have scanned more pages for migration but the
overall success rate was much smaller and the direct reclaim was invoked
more. I do not have a good theory for that and will play with this some
more. Maybe other changes are needed deeper in the compaction code.

I will play with this some more but I would be really interested to hear
whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
even make sense to you?

> I was only suggesting to allocate hugetlb pages, if you preferred
> not to reboot with artificially reduced RAM. Not an issue if you're
> booting VMs.

Ohh, I see.

--
Michal Hocko
SUSE Labs

2016-03-01 14:40:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue 01-03-16 14:38:46, Michal Hocko wrote:
[...]
> the time increased but I haven't checked how stable the result is.

And those results vary a lot (even when executed from the fresh boot)
as per my further testing. Sure it might be related to the virtual
environment but I do not think this particular test should be used for
the performance regression comparision.
--
Michal Hocko
SUSE Labs

2016-03-01 18:14:14

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On 03/01/2016 02:38 PM, Michal Hocko wrote:
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> + trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
> switch (compact_result) {
> case COMPACT_DEFERRED:
> *deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> 190 2 1
> 122 2 3
> 59 2 4
>
> #define COMPACT_SKIPPED 1
> #define COMPACT_PARTIAL 3
> #define COMPACT_COMPLETE 4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.

Yes, compaction is historically quite careful to avoid making low memory
conditions worse, and to prevent work if it doesn't look like it can ultimately
succeed the allocation (so having not enough base pages means that compacting
them is considered pointless). This aspect of preventing non-zero-order OOMs is
somewhat unexpected :)

> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> alloc_flags))
> return COMPACT_PARTIAL;
>
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return COMPACT_CONTINUE;
> +
> /*
> * Watermarks for order-0 must be met for compaction. Note the 2UL.
> * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real 8m46.626s
> user 14m15.823s
> sys 2m45.471s
>
> the time increased but I haven't checked how stable the result is.
>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1
>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.

I was under impression that similar checks to compaction_suitable() were done
also in compact_finished(), to stop compacting if memory got low due to parallel
activity. But I guess it was a patch from Joonsoo that didn't get merged.

My only other theory so far is that watermark checks fail in
__isolate_free_page() when we want to grab page(s) as migration targets. I would
suggest enabling all compaction tracepoint and the migration tracepoint. Looking
at the trace could hopefully help faster than going one trace_printk() per attempt.

Once we learn all the relevant places/checks, we can think about how to
communicate to them that this compaction attempt is "important" and should
continue as long as possible even in low-memory conditions. Maybe not just a
costly order check, but we also have alloc_flags or could add something to
compact_control, etc.

> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?
>
>> I was only suggesting to allocate hugetlb pages, if you preferred
>> not to reboot with artificially reduced RAM. Not an issue if you're
>> booting VMs.
>
> Ohh, I see.
>
>

2016-03-02 02:19:40

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> Andrew,
> could you queue this one as well, please? This is more a band aid than a
> real solution which I will be working on as soon as I am able to
> reproduce the issue but the patch should help to some degree at least.

I'm not sure that this is a way to go. See below.

>
> On Thu 25-02-16 10:23:15, Michal Hocko wrote:
> > From d09de26cee148b4d8c486943b4e8f3bd7ad6f4be Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Thu, 4 Feb 2016 14:56:59 +0100
> > Subject: [PATCH] mm, oom: protect !costly allocations some more
> >
> > should_reclaim_retry will give up retries for higher order allocations
> > if none of the eligible zones has any requested or higher order pages
> > available even if we pass the watermak check for order-0. This is done
> > because there is no guarantee that the reclaimable and currently free
> > pages will form the required order.
> >
> > This can, however, lead to situations were the high-order request (e.g.
> > order-2 required for the stack allocation during fork) will trigger
> > OOM too early - e.g. after the first reclaim/compaction round. Such a
> > system would have to be highly fragmented and the OOM killer is just a
> > matter of time but let's stick to our MAX_RECLAIM_RETRIES for the high
> > order and not costly requests to make sure we do not fail prematurely.
> >
> > This also means that we do not reset no_progress_loops at the
> > __alloc_pages_slowpath for high order allocations to guarantee a bounded
> > number of retries.
> >
> > Longterm it would be much better to communicate with the compaction
> > and retry only if the compaction considers it meaningfull.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/page_alloc.c | 20 ++++++++++++++++----
> > 1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 269a04f20927..f05aca36469b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3106,6 +3106,18 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > }
> > }
> >
> > + /*
> > + * OK, so the watermak check has failed. Make sure we do all the
> > + * retries for !costly high order requests and hope that multiple
> > + * runs of compaction will generate some high order ones for us.
> > + *
> > + * XXX: ideally we should teach the compaction to try _really_ hard
> > + * if we are in the retry path - something like priority 0 for the
> > + * reclaim
> > + */
> > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > + return true;
> > +
> > return false;

This seems not a proper fix. Checking watermark with high order has
another meaning that there is high order page or not. This isn't
what we want here. So, following fix is needed.

'if (order)' check isn't needed. It is used to clarify the meaning of
this fix. You can remove it.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1993894..8c80375 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
return false;

+ /* To check whether compaction is available or not */
+ if (order)
+ order = 0;
+
/*
* Keep reclaiming pages while there is a chance this will lead
* somewhere. If none of the target zones can satisfy our allocation

> > }
> >
> > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > goto noretry;
> >
> > /*
> > - * Costly allocations might have made a progress but this doesn't mean
> > - * their order will become available due to high fragmentation so do
> > - * not reset the no progress counter for them
> > + * High order allocations might have made a progress but this doesn't
> > + * mean their order will become available due to high fragmentation so
> > + * do not reset the no progress counter for them
> > */
> > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > + if (did_some_progress && !order)
> > no_progress_loops = 0;
> > else
> > no_progress_loops++;

This unconditionally increases no_progress_loops for high order
allocation, so, after 16 iterations, it will fail. If compaction isn't
enabled in Kconfig, 16 times reclaim attempt would not be sufficient
to make high order page. Should we consider this case also?

Thanks.

2016-03-02 02:28:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space. Unless
> > perhaps it's some superfast swap that you have?
>
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).

Hmm... This looks like one of potential culprit. If page is in
writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
In this case, this page works as pinned page and prevent compaction.
It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
'no_progress_loops > XXX' will help in this situation.

Thanks.

2016-03-02 02:54:50

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> On 03/01/2016 02:38 PM, Michal Hocko wrote:
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 113983
> >compact_free_scanned 1433503
> >compact_isolated 134307
> >compact_stall 128
> >compact_fail 26
> >compact_success 102
> >compact_kcompatd_wake 0
> >
> >So the whole load has done the direct compaction only 128 times during
> >that test. This doesn't sound much to me
> >$ grep allocstall /proc/vmstat
> >allocstall 1061
> >
> >we entered the direct reclaim much more but most of the load will be
> >order-0 so this might be still ok. So I've tried the following:
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 1993894b4219..107d444afdb1 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > mode, contended_compaction);
> > current->flags &= ~PF_MEMALLOC;
> >
> >+ if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> >+ trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> >+
> > switch (compact_result) {
> > case COMPACT_DEFERRED:
> > *deferred_compaction = true;
> >
> >And the result was:
> >$ cat /debug/tracing/trace_pipe | tee ~/trace.log
> > gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> > gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> >this shows that order-2 memory pressure is not overly high in my
> >setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> >So I went back to 800M of hugetlb pages and tried again. It took ages
> >so I have interrupted that after one hour (there was still no OOM). The
> >trace log is quite interesting regardless:
> >$ wc -l ~/trace.log
> >371 /root/trace.log
> >
> >$ grep compact_stall /proc/vmstat
> >compact_stall 190
> >
> >so the compaction was still ignored more than actually invoked for
> >!costly allocations:
> >sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> > 190 2 1
> > 122 2 3
> > 59 2 4
> >
> >#define COMPACT_SKIPPED 1
> >#define COMPACT_PARTIAL 3
> >#define COMPACT_COMPLETE 4
> >
> >that means that compaction is even not tried in half cases! This
> >doesn't sounds right to me, especially when we are talking about
> ><= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> >then we simply rely on the order-0 reclaim to automagically form higher
> >blocks. This might indeed work when we retry many times but I guess this
> >is not a good approach. It leads to a excessive reclaim and the stall
> >for allocation can be really large.
> >
> >One of the suspicious places is __compaction_suitable which does order-0
> >watermark check (increased by 2<<order). I have put another trace_printk
> >there and it clearly pointed out this was the case.
>
> Yes, compaction is historically quite careful to avoid making low
> memory conditions worse, and to prevent work if it doesn't look like
> it can ultimately succeed the allocation (so having not enough base
> pages means that compacting them is considered pointless). This
> aspect of preventing non-zero-order OOMs is somewhat unexpected :)

It's better not to assume that compaction would succeed all the times.
Compaction has some limitations so it sometimes fails.
For example, in lowmem situation, it only scans small parts of memory
and if that part is fragmented by non-movable page, compaction would fail.
And, compaction would defer requests 64 times at maximum if successive
compaction failure happens before.

Depending on compaction heavily is right direction to go but I think
that it's not ready for now. More reclaim would relieve problem.

I tried to fix this situation but not yet finished.

http://thread.gmane.org/gmane.linux.kernel.mm/142364
https://lkml.org/lkml/2015/8/23/182


> >So I have tried the following:
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index 4d99e1f5055c..7364e48cf69a 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > alloc_flags))
> > return COMPACT_PARTIAL;
> >
> >+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
> >+ return COMPACT_CONTINUE;
> >+
> > /*
> > * Watermarks for order-0 must be met for compaction. Note the 2UL.
> > * This is because during migration, copies of pages need to be
> >
> >and retried the same test (without huge pages):
> >$ time make -j20 > /dev/null
> >
> >real 8m46.626s
> >user 14m15.823s
> >sys 2m45.471s
> >
> >the time increased but I haven't checked how stable the result is.
> >
> >$ grep compact /proc/vmstat
> >compact_migrate_scanned 139822
> >compact_free_scanned 1661642
> >compact_isolated 139407
> >compact_stall 129
> >compact_fail 58
> >compact_success 71
> >compact_kcompatd_wake 1
> >
> >$ grep allocstall /proc/vmstat
> >allocstall 1665
> >
> >this is worse because we have scanned more pages for migration but the
> >overall success rate was much smaller and the direct reclaim was invoked
> >more. I do not have a good theory for that and will play with this some
> >more. Maybe other changes are needed deeper in the compaction code.
>
> I was under impression that similar checks to compaction_suitable()
> were done also in compact_finished(), to stop compacting if memory
> got low due to parallel activity. But I guess it was a patch from
> Joonsoo that didn't get merged.
>
> My only other theory so far is that watermark checks fail in
> __isolate_free_page() when we want to grab page(s) as migration
> targets. I would suggest enabling all compaction tracepoint and the
> migration tracepoint. Looking at the trace could hopefully help
> faster than going one trace_printk() per attempt.

Agreed. It's best thing to do now.

Thanks.

>
> Once we learn all the relevant places/checks, we can think about how
> to communicate to them that this compaction attempt is "important"
> and should continue as long as possible even in low-memory
> conditions. Maybe not just a costly order check, but we also have
> alloc_flags or could add something to compact_control, etc.
>
> >I will play with this some more but I would be really interested to hear
> >whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> >even make sense to you?
> >
> >>I was only suggesting to allocate hugetlb pages, if you preferred
> >>not to reboot with artificially reduced RAM. Not an issue if you're
> >>booting VMs.
> >
> >Ohh, I see.
> >
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-02 09:51:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
[...]
> > > + /*
> > > + * OK, so the watermak check has failed. Make sure we do all the
> > > + * retries for !costly high order requests and hope that multiple
> > > + * runs of compaction will generate some high order ones for us.
> > > + *
> > > + * XXX: ideally we should teach the compaction to try _really_ hard
> > > + * if we are in the retry path - something like priority 0 for the
> > > + * reclaim
> > > + */
> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > + return true;
> > > +
> > > return false;
>
> This seems not a proper fix. Checking watermark with high order has
> another meaning that there is high order page or not. This isn't
> what we want here.

Why not? Why should we retry the reclaim if we do not have >=order page
available? Reclaim itself doesn't guarantee any of the freed pages will
form the requested order. The ordering on the LRU lists is pretty much
random wrt. pfn ordering. On the other hand if we have a page available
which is just hidden by watermarks then it makes perfect sense to retry
and free even order-0 pages.

> So, following fix is needed.

> 'if (order)' check isn't needed. It is used to clarify the meaning of
> this fix. You can remove it.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894..8c80375 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> return false;
>
> + /* To check whether compaction is available or not */
> + if (order)
> + order = 0;
> +

This would enforce the order 0 wmark check which is IMHO not correct as
per above.

> /*
> * Keep reclaiming pages while there is a chance this will lead
> * somewhere. If none of the target zones can satisfy our allocation
>
> > > }
> > >
> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > goto noretry;
> > >
> > > /*
> > > - * Costly allocations might have made a progress but this doesn't mean
> > > - * their order will become available due to high fragmentation so do
> > > - * not reset the no progress counter for them
> > > + * High order allocations might have made a progress but this doesn't
> > > + * mean their order will become available due to high fragmentation so
> > > + * do not reset the no progress counter for them
> > > */
> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > + if (did_some_progress && !order)
> > > no_progress_loops = 0;
> > > else
> > > no_progress_loops++;
>
> This unconditionally increases no_progress_loops for high order
> allocation, so, after 16 iterations, it will fail. If compaction isn't
> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> to make high order page. Should we consider this case also?

How many retries would help? I do not think any number will work
reliably. Configurations without compaction enabled are asking for
problems by definition IMHO. Relying on order-0 reclaim for high order
allocations simply cannot work.

--
Michal Hocko
SUSE Labs

2016-03-02 12:37:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
[...]
> > Yes, compaction is historically quite careful to avoid making low
> > memory conditions worse, and to prevent work if it doesn't look like
> > it can ultimately succeed the allocation (so having not enough base
> > pages means that compacting them is considered pointless). This
> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>
> It's better not to assume that compaction would succeed all the times.
> Compaction has some limitations so it sometimes fails.
> For example, in lowmem situation, it only scans small parts of memory
> and if that part is fragmented by non-movable page, compaction would fail.
> And, compaction would defer requests 64 times at maximum if successive
> compaction failure happens before.
>
> Depending on compaction heavily is right direction to go but I think
> that it's not ready for now. More reclaim would relieve problem.

I really fail to see why. The reclaimable memory can be migrated as
well, no? Relying on the order-0 reclaim makes only sense to get over
wmarks.

--
Michal Hocko
SUSE Labs

2016-03-02 12:39:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 02-03-16 11:28:46, Joonsoo Kim wrote:
> On Tue, Mar 01, 2016 at 02:38:46PM +0100, Michal Hocko wrote:
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space. Unless
> > > perhaps it's some superfast swap that you have?
> >
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
>
> Hmm... This looks like one of potential culprit. If page is in
> writeback, it can't be migrated by compaction with MIGRATE_SYNC_LIGHT.
> In this case, this page works as pinned page and prevent compaction.
> It'd be better to check that changing 'migration_mode = MIGRATE_SYNC' at
> 'no_progress_loops > XXX' will help in this situation.

Would it make sense to use MIGRATE_SYNC for !costly allocations by
default?

--
Michal Hocko
SUSE Labs

2016-03-02 13:22:38

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On 03/02/2016 01:24 PM, Michal Hocko wrote:
> On Tue 01-03-16 19:14:08, Vlastimil Babka wrote:
>>
>> I was under impression that similar checks to compaction_suitable() were
>> done also in compact_finished(), to stop compacting if memory got low due to
>> parallel activity. But I guess it was a patch from Joonsoo that didn't get
>> merged.
>>
>> My only other theory so far is that watermark checks fail in
>> __isolate_free_page() when we want to grab page(s) as migration targets.
>
> yes this certainly contributes to the problem and triggered in my case a
> lot:
> $ grep __isolate_free_page trace.log | wc -l
> 181
> $ grep __alloc_pages_direct_compact: trace.log | wc -l
> 7
>
>> I would suggest enabling all compaction tracepoint and the migration
>> tracepoint. Looking at the trace could hopefully help faster than
>> going one trace_printk() per attempt.
>
> OK, here we go with both watermarks checks removed and hopefully all the
> compaction related tracepoints enabled:
> echo 1 > /debug/tracing/events/compaction/enable
> echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

The trace shows only 4 direct compaction attempts with order=2. The rest
is order=9, i.e. THP, which has little chances of success under such
pressure, and thus those failures and defers. The few order=2 attempts
appear all successful (defer_reset is called).

So it seems your system is mostly fine with just reclaim, and there's
little need for order-2 compaction, and that's also why you can't
reproduce the OOMs. So I'm afraid we'll learn nothing here, and looks
like Hugh will have to try those watermark check adjustments/removals
and/or provide the same kind of trace.

2016-03-02 13:32:12

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
>> > > + /*
>> > > + * OK, so the watermak check has failed. Make sure we do all the
>> > > + * retries for !costly high order requests and hope that multiple
>> > > + * runs of compaction will generate some high order ones for us.
>> > > + *
>> > > + * XXX: ideally we should teach the compaction to try _really_ hard
>> > > + * if we are in the retry path - something like priority 0 for the
>> > > + * reclaim
>> > > + */
>> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + return true;
>> > > +
>> > > return false;
>>
>> This seems not a proper fix. Checking watermark with high order has
>> another meaning that there is high order page or not. This isn't
>> what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.

If we have >= order page available, we would not reach here. We would
just allocate it.

And, should_reclaim_retry() is not just for reclaim. It is also for
retrying compaction.

That watermark check is to check further reclaim/compaction
is meaningful. And, for high order case, if there is enough freepage,
compaction could make high order page even if there is no high order
page now.

Adding freeable memory and checking watermark with it doesn't help
in this case because number of high order page isn't changed with it.

I just did quick review to your patches so maybe I am wrong.
Am I missing something?

>> So, following fix is needed.
>
>> 'if (order)' check isn't needed. It is used to clarify the meaning of
>> this fix. You can remove it.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 1993894..8c80375 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>> if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>> return false;
>>
>> + /* To check whether compaction is available or not */
>> + if (order)
>> + order = 0;
>> +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
>> /*
>> * Keep reclaiming pages while there is a chance this will lead
>> * somewhere. If none of the target zones can satisfy our allocation
>>
>> > > }
>> > >
>> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> > > goto noretry;
>> > >
>> > > /*
>> > > - * Costly allocations might have made a progress but this doesn't mean
>> > > - * their order will become available due to high fragmentation so do
>> > > - * not reset the no progress counter for them
>> > > + * High order allocations might have made a progress but this doesn't
>> > > + * mean their order will become available due to high fragmentation so
>> > > + * do not reset the no progress counter for them
>> > > */
>> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> > > + if (did_some_progress && !order)
>> > > no_progress_loops = 0;
>> > > else
>> > > no_progress_loops++;
>>
>> This unconditionally increases no_progress_loops for high order
>> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

At least, reset no_progress_loops when did_some_progress. High
order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
as order 0. And, reclaim something would increase probability of
compaction success. Why do we limit retry as 16 times with no
evidence of potential impossibility of making high order page?

And, 16 retry looks not good to me because compaction could defer
actual doing up to 64 times.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-02 14:06:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > [...]
> >> > > + /*
> >> > > + * OK, so the watermak check has failed. Make sure we do all the
> >> > > + * retries for !costly high order requests and hope that multiple
> >> > > + * runs of compaction will generate some high order ones for us.
> >> > > + *
> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
> >> > > + * if we are in the retry path - something like priority 0 for the
> >> > > + * reclaim
> >> > > + */
> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + return true;
> >> > > +
> >> > > return false;
> >>
> >> This seems not a proper fix. Checking watermark with high order has
> >> another meaning that there is high order page or not. This isn't
> >> what we want here.
> >
> > Why not? Why should we retry the reclaim if we do not have >=order page
> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > form the requested order. The ordering on the LRU lists is pretty much
> > random wrt. pfn ordering. On the other hand if we have a page available
> > which is just hidden by watermarks then it makes perfect sense to retry
> > and free even order-0 pages.
>
> If we have >= order page available, we would not reach here. We would
> just allocate it.

not really, we can still be under the low watermark. Note that the
target for the should_reclaim_retry watermark check includes also the
reclaimable memory.

> And, should_reclaim_retry() is not just for reclaim. It is also for
> retrying compaction.
>
> That watermark check is to check further reclaim/compaction
> is meaningful. And, for high order case, if there is enough freepage,
> compaction could make high order page even if there is no high order
> page now.
>
> Adding freeable memory and checking watermark with it doesn't help
> in this case because number of high order page isn't changed with it.
>
> I just did quick review to your patches so maybe I am wrong.
> Am I missing something?

The core idea behind should_reclaim_retry is to check whether the
reclaiming all the pages would help to get over the watermark and there
is at least one >= order page. Then it really makes sense to retry. As
the compaction has already was performed before this is called we should
have created some high order pages already. The decay guarantees that we
eventually trigger the OOM killer after some attempts.

If the compaction can backoff and ignore our requests then we are
screwed of course and that should be addressed imho at the compaction
layer. Maybe we can tell the compaction to try harder but I would like
to understand why this shouldn't be a default behavior for !costly
orders.

[...]
> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >> > > goto noretry;
> >> > >
> >> > > /*
> >> > > - * Costly allocations might have made a progress but this doesn't mean
> >> > > - * their order will become available due to high fragmentation so do
> >> > > - * not reset the no progress counter for them
> >> > > + * High order allocations might have made a progress but this doesn't
> >> > > + * mean their order will become available due to high fragmentation so
> >> > > + * do not reset the no progress counter for them
> >> > > */
> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> > > + if (did_some_progress && !order)
> >> > > no_progress_loops = 0;
> >> > > else
> >> > > no_progress_loops++;
> >>
> >> This unconditionally increases no_progress_loops for high order
> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> >> to make high order page. Should we consider this case also?
> >
> > How many retries would help? I do not think any number will work
> > reliably. Configurations without compaction enabled are asking for
> > problems by definition IMHO. Relying on order-0 reclaim for high order
> > allocations simply cannot work.
>
> At least, reset no_progress_loops when did_some_progress. High
> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> as order 0. And, reclaim something would increase probability of
> compaction success.

This is something I still do not understand. Why would reclaiming
random order-0 pages help compaction? Could you clarify this please?

> Why do we limit retry as 16 times with no evidence of potential
> impossibility of making high order page?

If we tried to compact 16 times without any progress then this sounds
like a sufficient evidence to me. Well, this number is somehow arbitrary
but the main point is to limit it to _some_ number, if we can show that
a larger value would work better then we can update it of course.

> And, 16 retry looks not good to me because compaction could defer
> actual doing up to 64 times.

OK, this is something that needs to be handled in a better way. The
primary question would be why to defer the compaction for <=
PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
why it makes sense it for the best effort mode of operation but !costly
orders should be trying much harder as they are nofail, no?

Thanks!
--
Michal Hocko
SUSE Labs

2016-03-02 14:06:32

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

2016-03-02 21:37 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 02-03-16 11:55:07, Joonsoo Kim wrote:
>> On Tue, Mar 01, 2016 at 07:14:08PM +0100, Vlastimil Babka wrote:
> [...]
>> > Yes, compaction is historically quite careful to avoid making low
>> > memory conditions worse, and to prevent work if it doesn't look like
>> > it can ultimately succeed the allocation (so having not enough base
>> > pages means that compacting them is considered pointless). This
>> > aspect of preventing non-zero-order OOMs is somewhat unexpected :)
>>
>> It's better not to assume that compaction would succeed all the times.
>> Compaction has some limitations so it sometimes fails.
>> For example, in lowmem situation, it only scans small parts of memory
>> and if that part is fragmented by non-movable page, compaction would fail.
>> And, compaction would defer requests 64 times at maximum if successive
>> compaction failure happens before.
>>
>> Depending on compaction heavily is right direction to go but I think
>> that it's not ready for now. More reclaim would relieve problem.
>
> I really fail to see why. The reclaimable memory can be migrated as
> well, no? Relying on the order-0 reclaim makes only sense to get over
> wmarks.

Attached link on previous reply mentioned limitation of current compaction
implementation. Briefly speaking, It would not scan all range of memory
due to algorithm limitation so even if there is reclaimable memory that
can be also migrated, compaction could fail.

There is no such limitation on reclaim and that's why I think that compaction
is not ready for now.

Thanks.

2016-03-02 14:34:34

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

2016-03-02 23:06 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
>> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> > [...]
>> >> > > + /*
>> >> > > + * OK, so the watermak check has failed. Make sure we do all the
>> >> > > + * retries for !costly high order requests and hope that multiple
>> >> > > + * runs of compaction will generate some high order ones for us.
>> >> > > + *
>> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
>> >> > > + * if we are in the retry path - something like priority 0 for the
>> >> > > + * reclaim
>> >> > > + */
>> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + return true;
>> >> > > +
>> >> > > return false;
>> >>
>> >> This seems not a proper fix. Checking watermark with high order has
>> >> another meaning that there is high order page or not. This isn't
>> >> what we want here.
>> >
>> > Why not? Why should we retry the reclaim if we do not have >=order page
>> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> > form the requested order. The ordering on the LRU lists is pretty much
>> > random wrt. pfn ordering. On the other hand if we have a page available
>> > which is just hidden by watermarks then it makes perfect sense to retry
>> > and free even order-0 pages.
>>
>> If we have >= order page available, we would not reach here. We would
>> just allocate it.
>
> not really, we can still be under the low watermark. Note that the

you mean min watermark?

> target for the should_reclaim_retry watermark check includes also the
> reclaimable memory.

I guess that usual case for high order allocation failure has enough freepage.

>> And, should_reclaim_retry() is not just for reclaim. It is also for
>> retrying compaction.
>>
>> That watermark check is to check further reclaim/compaction
>> is meaningful. And, for high order case, if there is enough freepage,
>> compaction could make high order page even if there is no high order
>> page now.
>>
>> Adding freeable memory and checking watermark with it doesn't help
>> in this case because number of high order page isn't changed with it.
>>
>> I just did quick review to your patches so maybe I am wrong.
>> Am I missing something?
>
> The core idea behind should_reclaim_retry is to check whether the
> reclaiming all the pages would help to get over the watermark and there
> is at least one >= order page. Then it really makes sense to retry. As

How you can judge that reclaiming all the pages would help to check
there is at least one >= order page?

> the compaction has already was performed before this is called we should
> have created some high order pages already. The decay guarantees that we

Not really. Compaction could fail.

> eventually trigger the OOM killer after some attempts.

Yep.

> If the compaction can backoff and ignore our requests then we are
> screwed of course and that should be addressed imho at the compaction
> layer. Maybe we can tell the compaction to try harder but I would like
> to understand why this shouldn't be a default behavior for !costly
> orders.

Yes, I agree that.

> [...]
>> >> > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>> >> > > goto noretry;
>> >> > >
>> >> > > /*
>> >> > > - * Costly allocations might have made a progress but this doesn't mean
>> >> > > - * their order will become available due to high fragmentation so do
>> >> > > - * not reset the no progress counter for them
>> >> > > + * High order allocations might have made a progress but this doesn't
>> >> > > + * mean their order will become available due to high fragmentation so
>> >> > > + * do not reset the no progress counter for them
>> >> > > */
>> >> > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> > > + if (did_some_progress && !order)
>> >> > > no_progress_loops = 0;
>> >> > > else
>> >> > > no_progress_loops++;
>> >>
>> >> This unconditionally increases no_progress_loops for high order
>> >> allocation, so, after 16 iterations, it will fail. If compaction isn't
>> >> enabled in Kconfig, 16 times reclaim attempt would not be sufficient
>> >> to make high order page. Should we consider this case also?
>> >
>> > How many retries would help? I do not think any number will work
>> > reliably. Configurations without compaction enabled are asking for
>> > problems by definition IMHO. Relying on order-0 reclaim for high order
>> > allocations simply cannot work.
>>
>> At least, reset no_progress_loops when did_some_progress. High
>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> as order 0. And, reclaim something would increase probability of
>> compaction success.
>
> This is something I still do not understand. Why would reclaiming
> random order-0 pages help compaction? Could you clarify this please?

I just can tell simple version. Please check the link from me on another reply.
Compaction could scan more range of memory if we have more freepage.
This is due to algorithm limitation. Anyway, so, reclaiming random
order-0 pages helps compaction.

>> Why do we limit retry as 16 times with no evidence of potential
>> impossibility of making high order page?
>
> If we tried to compact 16 times without any progress then this sounds
> like a sufficient evidence to me. Well, this number is somehow arbitrary
> but the main point is to limit it to _some_ number, if we can show that
> a larger value would work better then we can update it of course.

My arguing is for your band aid patch.
My point is that why retry count for order-0 is reset if there is some progress,
but, retry counter for order up to costly isn't reset even if there is
some progress

>> And, 16 retry looks not good to me because compaction could defer
>> actual doing up to 64 times.
>
> OK, this is something that needs to be handled in a better way. The
> primary question would be why to defer the compaction for <=
> PAGE_ALLOC_COSTLY_ORDER requests in the first place. I guess I do see
> why it makes sense it for the best effort mode of operation but !costly
> orders should be trying much harder as they are nofail, no?

Make sense.

Thanks.

2016-03-02 15:01:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed, Mar 02, 2016 at 10:50:56AM +0100, Michal Hocko wrote:
> On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> [...]
> > > > + /*
> > > > + * OK, so the watermak check has failed. Make sure we do all the
> > > > + * retries for !costly high order requests and hope that multiple
> > > > + * runs of compaction will generate some high order ones for us.
> > > > + *
> > > > + * XXX: ideally we should teach the compaction to try _really_ hard
> > > > + * if we are in the retry path - something like priority 0 for the
> > > > + * reclaim
> > > > + */
> > > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > + return true;
> > > > +
> > > > return false;
> >
> > This seems not a proper fix. Checking watermark with high order has
> > another meaning that there is high order page or not. This isn't
> > what we want here.
>
> Why not? Why should we retry the reclaim if we do not have >=order page
> available? Reclaim itself doesn't guarantee any of the freed pages will
> form the requested order. The ordering on the LRU lists is pretty much
> random wrt. pfn ordering. On the other hand if we have a page available
> which is just hidden by watermarks then it makes perfect sense to retry
> and free even order-0 pages.
>
> > So, following fix is needed.
>
> > 'if (order)' check isn't needed. It is used to clarify the meaning of
> > this fix. You can remove it.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894..8c80375 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3125,6 +3125,10 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> > return false;
> >
> > + /* To check whether compaction is available or not */
> > + if (order)
> > + order = 0;
> > +
>
> This would enforce the order 0 wmark check which is IMHO not correct as
> per above.
>
> > /*
> > * Keep reclaiming pages while there is a chance this will lead
> > * somewhere. If none of the target zones can satisfy our allocation
> >
> > > > }
> > > >
> > > > @@ -3281,11 +3293,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > > > goto noretry;
> > > >
> > > > /*
> > > > - * Costly allocations might have made a progress but this doesn't mean
> > > > - * their order will become available due to high fragmentation so do
> > > > - * not reset the no progress counter for them
> > > > + * High order allocations might have made a progress but this doesn't
> > > > + * mean their order will become available due to high fragmentation so
> > > > + * do not reset the no progress counter for them
> > > > */
> > > > - if (did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER)
> > > > + if (did_some_progress && !order)
> > > > no_progress_loops = 0;
> > > > else
> > > > no_progress_loops++;
> >
> > This unconditionally increases no_progress_loops for high order
> > allocation, so, after 16 iterations, it will fail. If compaction isn't
> > enabled in Kconfig, 16 times reclaim attempt would not be sufficient
> > to make high order page. Should we consider this case also?
>
> How many retries would help? I do not think any number will work
> reliably. Configurations without compaction enabled are asking for
> problems by definition IMHO. Relying on order-0 reclaim for high order
> allocations simply cannot work.

I left compaction code for a long time so a super hero might make it
perfect now but I don't think the dream come true yet and I believe
any algorithm has a drawback so we end up relying on a fallback approach
in case of not working compaction correctly.

My suggestion is to reintroduce *lumpy reclaim* and kicks in only when
compaction gave up by some reasons. It would be better to rely on
random number retrial of reclaim.

2016-03-03 09:26:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> 2016-03-02 23:06 GMT+09:00 Michal Hocko <[email protected]>:
> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> > [...]
> >> >> > > + /*
> >> >> > > + * OK, so the watermak check has failed. Make sure we do all the
> >> >> > > + * retries for !costly high order requests and hope that multiple
> >> >> > > + * runs of compaction will generate some high order ones for us.
> >> >> > > + *
> >> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> > > + * if we are in the retry path - something like priority 0 for the
> >> >> > > + * reclaim
> >> >> > > + */
> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> > > + return true;
> >> >> > > +
> >> >> > > return false;
> >> >>
> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> another meaning that there is high order page or not. This isn't
> >> >> what we want here.
> >> >
> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> > and free even order-0 pages.
> >>
> >> If we have >= order page available, we would not reach here. We would
> >> just allocate it.
> >
> > not really, we can still be under the low watermark. Note that the
>
> you mean min watermark?

ohh, right...

> > target for the should_reclaim_retry watermark check includes also the
> > reclaimable memory.
>
> I guess that usual case for high order allocation failure has enough freepage.

Not sure I understand you mean here but I wouldn't be surprised if high
order failed even with enough free pages. And that is exactly why I am
claiming that reclaiming more pages is no free ticket to high order
pages.

[...]
> >> I just did quick review to your patches so maybe I am wrong.
> >> Am I missing something?
> >
> > The core idea behind should_reclaim_retry is to check whether the
> > reclaiming all the pages would help to get over the watermark and there
> > is at least one >= order page. Then it really makes sense to retry. As
>
> How you can judge that reclaiming all the pages would help to check
> there is at least one >= order page?

Again, not sure I understand you here. __zone_watermark_ok checks both
wmark and an available page of the sufficient order. While increased
free_pages (which includes reclaimable pages as well) will tell us
whether we have a chance to get over the min wmark, the order check will
tell us we have something to allocate from after we reach the min wmark.

> > the compaction has already was performed before this is called we should
> > have created some high order pages already. The decay guarantees that we
>
> Not really. Compaction could fail.

Yes it could have failed. But what is the point to retry endlessly then?

[...]
> >> At least, reset no_progress_loops when did_some_progress. High
> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >> as order 0. And, reclaim something would increase probability of
> >> compaction success.
> >
> > This is something I still do not understand. Why would reclaiming
> > random order-0 pages help compaction? Could you clarify this please?
>
> I just can tell simple version. Please check the link from me on another reply.
> Compaction could scan more range of memory if we have more freepage.
> This is due to algorithm limitation. Anyway, so, reclaiming random
> order-0 pages helps compaction.

I will have a look at that code but this just doesn't make any sense.
The compaction should be reshuffling pages, this shouldn't be a function
of free memory.

> >> Why do we limit retry as 16 times with no evidence of potential
> >> impossibility of making high order page?
> >
> > If we tried to compact 16 times without any progress then this sounds
> > like a sufficient evidence to me. Well, this number is somehow arbitrary
> > but the main point is to limit it to _some_ number, if we can show that
> > a larger value would work better then we can update it of course.
>
> My arguing is for your band aid patch.
> My point is that why retry count for order-0 is reset if there is some progress,
> but, retry counter for order up to costly isn't reset even if there is
> some progress

Because we know that order-0 requests have chance to proceed if we keep
reclaiming order-0 pages while this is not true for order > 0. If we did
reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
then we would be back to the zone_reclaimable heuristic. Why? Because
order-0 reclaim progress will keep !costly in the reclaim loop while
compaction still might not make any progress. So we either have to fail
when __zone_watermark_ok fails for the order (which turned out to be
too easy to trigger) or have the fixed amount of retries regardless the
watermark check result. We cannot relax both unless we have other
measures in place.

Sure we can be more intelligent and reset the counter if the
feedback from compaction is optimistic and we are making some
progress. This would be less hackish and the XXX comment points into
that direction. For now I would like this to catch most loads reasonably
and build better heuristics on top. I would like to do as much as
possible to close the obvious regressions but I guess we have to expect
there will be cases where the OOM fires and hasn't before and vice
versa.

--
Michal Hocko
SUSE Labs

2016-03-03 09:55:26

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, 1 Mar 2016, Michal Hocko wrote:
> [Adding Vlastimil and Joonsoo for compaction related things - this was a
> large thread but the more interesting part starts with
> http://lkml.kernel.org/r/[email protected]]
>
> On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > [...]
> > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > way to gobble up most of the memory, though it's not how I've done it).
> > > >
> > > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > make defconfig there, then make -j20.
> > > >
> > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > >
> > > > Except that you'll probably need to fiddle around with that j20,
> > > > it's true for my laptop but not for my workstation. j20 just happens
> > > > to be what I've had there for years, that I now see breaking down
> > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > but it still doesn't exercise swap very much).
> > >
> > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > will try with another host with more CPUs (because my laptop has only
> > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap

I've found that the number of CPUs makes quite a difference - I have 4.

And another difference between us may be in our configs: on this laptop
I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
the size of each shmem_inode (and those of course are not swappable).

I found that I could avoid the OOM if I ran the "make -j20" on a
kernel without all those debug options, and booted with nr_cpus=2.
And currently I'm booting the kernel with the debug options in,
but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).

Maybe in the OOM rework, threads are cancelling each other's progress
more destructively, where before they co-operated to some extent?

(All that is on the laptop. The G5 is still busy full-time bisecting
a powerpc issue: I know it was OOMing with the rework, but I have not
verified the effect of nr_cpus on it. My x86 workstation has not been
OOMing with the rework - I think that means that I've not been exerting
as much memory pressure on it as I'd thought, that it copes with the load
better, and would only show the difference if I loaded it more heavily.)

> > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > (16, 10 no difference really). I was also collecting vmstat in the
> > > background. The compilation takes ages but the behavior seems consistent
> > > and stable.
> >
> > Thanks a lot for giving it a go.
> >
> > I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
> > are less than 2M big?? But probably that's just a misunderstanding
> > or typo somewhere.
>
> A typo. 445 was from 900M test which I was doing while writing the
> email. Sorry about the confusion.

That makes more sense! Though I'm still amazed that you got anywhere,
taking so much of the usable memory out.

>
> > Ignoring that, you're successfully doing a make -20 defconfig build
> > in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
> > at all surprised that it takes ages, but I am very surprised that it
> > does not OOM. I suppose by rights it ought not to OOM, the built
> > tree occupies only a little more than 1G, so you do have enough swap;
> > but I wouldn't get anywhere near that myself without OOMing - I give
> > myself 1G of RAM (well, minus whatever the booted system takes up)
> > to do that build in, four times your RAM, yet in my case it OOMs.
> >
> > That source tree alone occupies more than 700M, so just copying it
> > into your tmpfs would take a long time.
>
> OK, I just found out that I was cheating a bit. I was building
> linux-3.7-rc5.tar.bz2 which is smaller:
> $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> 537M /mnt/tmpfs/linux-3.7-rc5/

Right, I have a habit like that too; but my habitual testing still
uses the 2.6.24 source tree, which is rather too old to ask others
to reproduce with - but we both find that the kernel source tree
keeps growing, and prefer to stick with something of a fixed size.

>
> and after the defconfig build:
> $ free
> total used free shared buffers cached
> Mem: 1008460 941904 66556 0 5092 806760
> -/+ buffers/cache: 130052 878408
> Swap: 2097148 42648 2054500
> $ du -sh linux-3.7-rc5/
> 799M linux-3.7-rc5/
>
> Sorry about that but this is what my other tests were using and I forgot
> to check. Now let's try the same with the current linus tree:
> host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> 707M /mnt/tmpfs/linux-4.5-rc6/
> $ free
> total used free shared buffers cached
> Mem: 1008460 962976 45484 0 7236 820064

I guess we have different versions of "free": mine shows Shmem as shared,
but yours appears to be an older version, just showing 0.

> -/+ buffers/cache: 135676 872784
> Swap: 2097148 16 2097132
> $ time make -j20 > /dev/null
> drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
> drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>
> real 8m36.621s
> user 14m1.642s
> sys 2m45.238s
>
> so I wasn't cheating all that much...
>
> > I'd expect a build in 224M
> > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > OOM killed, even if there is technically enough space. Unless
> > perhaps it's some superfast swap that you have?
>
> the swap partition is a standard qcow image stored on my SSD disk. So
> I guess the IO should be quite fast. This smells like a potential
> contributor because my reclaim seems to be much faster and that should
> lead to a more efficient reclaim (in the scanned/reclaimed sense).
> I realize I might be boring already when blaming compaction but let me
> try again ;)
> $ grep compact /proc/vmstat
> compact_migrate_scanned 113983
> compact_free_scanned 1433503
> compact_isolated 134307
> compact_stall 128
> compact_fail 26
> compact_success 102
> compact_kcompatd_wake 0
>
> So the whole load has done the direct compaction only 128 times during
> that test. This doesn't sound much to me
> $ grep allocstall /proc/vmstat
> allocstall 1061
>
> we entered the direct reclaim much more but most of the load will be
> order-0 so this might be still ok. So I've tried the following:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1993894b4219..107d444afdb1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> mode, contended_compaction);
> current->flags &= ~PF_MEMALLOC;
>
> + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> + trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> +
> switch (compact_result) {
> case COMPACT_DEFERRED:
> *deferred_compaction = true;
>
> And the result was:
> $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
>
> this shows that order-2 memory pressure is not overly high in my
> setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
>
> So I went back to 800M of hugetlb pages and tried again. It took ages
> so I have interrupted that after one hour (there was still no OOM). The
> trace log is quite interesting regardless:
> $ wc -l ~/trace.log
> 371 /root/trace.log
>
> $ grep compact_stall /proc/vmstat
> compact_stall 190
>
> so the compaction was still ignored more than actually invoked for
> !costly allocations:
> sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> 190 2 1
> 122 2 3
> 59 2 4
>
> #define COMPACT_SKIPPED 1
> #define COMPACT_PARTIAL 3
> #define COMPACT_COMPLETE 4
>
> that means that compaction is even not tried in half cases! This
> doesn't sounds right to me, especially when we are talking about
> <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> then we simply rely on the order-0 reclaim to automagically form higher
> blocks. This might indeed work when we retry many times but I guess this
> is not a good approach. It leads to a excessive reclaim and the stall
> for allocation can be really large.
>
> One of the suspicious places is __compaction_suitable which does order-0
> watermark check (increased by 2<<order). I have put another trace_printk
> there and it clearly pointed out this was the case.
>
> So I have tried the following:
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4d99e1f5055c..7364e48cf69a 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> alloc_flags))
> return COMPACT_PARTIAL;
>
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + return COMPACT_CONTINUE;
> +

I gave that a try just now, but it didn't help me: OOMed much sooner,
after doing half as much work. (FWIW, I have been including your other
patch, the "Andrew, could you queue this one as well, please" patch.)

I do agree that compaction appears to have closed down when we OOM:
taking that along with my nr_cpus remark (and the make -jNumber),
are parallel compactions interfering with each other destructively,
in a way that they did not before the rework?

> /*
> * Watermarks for order-0 must be met for compaction. Note the 2UL.
> * This is because during migration, copies of pages need to be
>
> and retried the same test (without huge pages):
> $ time make -j20 > /dev/null
>
> real 8m46.626s
> user 14m15.823s
> sys 2m45.471s
>
> the time increased but I haven't checked how stable the result is.

But I didn't investigate its stability either, may have judged against
it too soon.

>
> $ grep compact /proc/vmstat
> compact_migrate_scanned 139822
> compact_free_scanned 1661642
> compact_isolated 139407
> compact_stall 129
> compact_fail 58
> compact_success 71
> compact_kcompatd_wake 1

I have not seen any compact_kcompatd_wakes at all:
perhaps we're too busy compacting directly.

(Vlastimil, there's a "c" missing from that name, it should be
"compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)

>
> $ grep allocstall /proc/vmstat
> allocstall 1665
>
> this is worse because we have scanned more pages for migration but the
> overall success rate was much smaller and the direct reclaim was invoked
> more. I do not have a good theory for that and will play with this some
> more. Maybe other changes are needed deeper in the compaction code.
>
> I will play with this some more but I would be really interested to hear
> whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> even make sense to you?

It didn't help me; but I do suspect you're right to be worrying about
the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.

>
> > I was only suggesting to allocate hugetlb pages, if you preferred
> > not to reboot with artificially reduced RAM. Not an issue if you're
> > booting VMs.
>
> Ohh, I see.

I've attached vmstats.xz, output from your read_vmstat proggy;
together with oom.xz, the dmesg for the OOM in question.

I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
that being a count that's always 0 for me: so when you see
"balloon_deflate 1" towards the end, that's where the OOM
kill came in, and shortly after I Ctrl-C'ed.

I hope you can get more out of it than I have - thanks!

Hugh


Attachments:
vmstats.xz (52.84 kB)
oom.xz (3.43 kB)
Download all attachments

2016-03-03 10:29:52

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> Sure we can be more intelligent and reset the counter if the
> feedback from compaction is optimistic and we are making some
> progress. This would be less hackish and the XXX comment points into
> that direction. For now I would like this to catch most loads reasonably
> and build better heuristics on top. I would like to do as much as
> possible to close the obvious regressions but I guess we have to expect
> there will be cases where the OOM fires and hasn't before and vice
> versa.

Aren't you forgetting that some people use panic_on_oom > 0 which means that
premature OOM killer invocation is fatal for them?

2016-03-03 12:33:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
[...]
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > alloc_flags))
> > return COMPACT_PARTIAL;
> >
> > + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > + return COMPACT_CONTINUE;
> > +
>
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work.

I do not have an explanation why it would cause oom sooner but this
turned out to be incomplete. There is another wmaark check deeper in the
compaction path. Could you try the one from
http://lkml.kernel.org/r/[email protected]

I will try to find a machine with more CPUs and try to reproduce this in
the mean time.

I will also have a look at the data you have collected.
--
Michal Hocko
SUSE Labs

2016-03-03 14:10:13

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

2016-03-03 18:26 GMT+09:00 Michal Hocko <[email protected]>:
> On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
>> 2016-03-02 23:06 GMT+09:00 Michal Hocko <[email protected]>:
>> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
>> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
>> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
>> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
>> >> > [...]
>> >> >> > > + /*
>> >> >> > > + * OK, so the watermak check has failed. Make sure we do all the
>> >> >> > > + * retries for !costly high order requests and hope that multiple
>> >> >> > > + * runs of compaction will generate some high order ones for us.
>> >> >> > > + *
>> >> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
>> >> >> > > + * if we are in the retry path - something like priority 0 for the
>> >> >> > > + * reclaim
>> >> >> > > + */
>> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
>> >> >> > > + return true;
>> >> >> > > +
>> >> >> > > return false;
>> >> >>
>> >> >> This seems not a proper fix. Checking watermark with high order has
>> >> >> another meaning that there is high order page or not. This isn't
>> >> >> what we want here.
>> >> >
>> >> > Why not? Why should we retry the reclaim if we do not have >=order page
>> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
>> >> > form the requested order. The ordering on the LRU lists is pretty much
>> >> > random wrt. pfn ordering. On the other hand if we have a page available
>> >> > which is just hidden by watermarks then it makes perfect sense to retry
>> >> > and free even order-0 pages.
>> >>
>> >> If we have >= order page available, we would not reach here. We would
>> >> just allocate it.
>> >
>> > not really, we can still be under the low watermark. Note that the
>>
>> you mean min watermark?
>
> ohh, right...
>
>> > target for the should_reclaim_retry watermark check includes also the
>> > reclaimable memory.
>>
>> I guess that usual case for high order allocation failure has enough freepage.
>
> Not sure I understand you mean here but I wouldn't be surprised if high
> order failed even with enough free pages. And that is exactly why I am
> claiming that reclaiming more pages is no free ticket to high order
> pages.

I didn't say that it's free ticket. OOM kill would be the most expensive ticket
that we have. Why do you want to kill something? It also doesn't guarantee
to make high order pages. It is just another way of reclaiming memory. What is
the difference between plain reclaim and OOM kill? Why do we use OOM kill
in this case?

> [...]
>> >> I just did quick review to your patches so maybe I am wrong.
>> >> Am I missing something?
>> >
>> > The core idea behind should_reclaim_retry is to check whether the
>> > reclaiming all the pages would help to get over the watermark and there
>> > is at least one >= order page. Then it really makes sense to retry. As
>>
>> How you can judge that reclaiming all the pages would help to check
>> there is at least one >= order page?
>
> Again, not sure I understand you here. __zone_watermark_ok checks both
> wmark and an available page of the sufficient order. While increased
> free_pages (which includes reclaimable pages as well) will tell us
> whether we have a chance to get over the min wmark, the order check will
> tell us we have something to allocate from after we reach the min wmark.

Again, your assumption would be different with mine. My assumption is that
high order allocation problem happens due to fragmentation rather than
low free memory. In this case, there is no high order page. Even if you can
reclaim 1TB and add this counter to freepage counter, high order page
counter will not be changed and watermark check would fail. So, high order
allocation will not go through retry logic. This is what you want?

>> > the compaction has already was performed before this is called we should
>> > have created some high order pages already. The decay guarantees that we
>>
>> Not really. Compaction could fail.
>
> Yes it could have failed. But what is the point to retry endlessly then?

I didn't say we should retry endlessly.

> [...]
>> >> At least, reset no_progress_loops when did_some_progress. High
>> >> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>> >> as order 0. And, reclaim something would increase probability of
>> >> compaction success.
>> >
>> > This is something I still do not understand. Why would reclaiming
>> > random order-0 pages help compaction? Could you clarify this please?
>>
>> I just can tell simple version. Please check the link from me on another reply.
>> Compaction could scan more range of memory if we have more freepage.
>> This is due to algorithm limitation. Anyway, so, reclaiming random
>> order-0 pages helps compaction.
>
> I will have a look at that code but this just doesn't make any sense.
> The compaction should be reshuffling pages, this shouldn't be a function
> of free memory.

Please refer the link I mentioned before. There is a reason why more free
memory would help compaction success. Compaction doesn't work
like as random reshuffling. It has an algorithm to reduce system overall
fragmentation so there is limitation.

>> >> Why do we limit retry as 16 times with no evidence of potential
>> >> impossibility of making high order page?
>> >
>> > If we tried to compact 16 times without any progress then this sounds
>> > like a sufficient evidence to me. Well, this number is somehow arbitrary
>> > but the main point is to limit it to _some_ number, if we can show that
>> > a larger value would work better then we can update it of course.
>>
>> My arguing is for your band aid patch.
>> My point is that why retry count for order-0 is reset if there is some progress,
>> but, retry counter for order up to costly isn't reset even if there is
>> some progress
>
> Because we know that order-0 requests have chance to proceed if we keep
> reclaiming order-0 pages while this is not true for order > 0. If we did
> reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> then we would be back to the zone_reclaimable heuristic. Why? Because
> order-0 reclaim progress will keep !costly in the reclaim loop while
> compaction still might not make any progress. So we either have to fail
> when __zone_watermark_ok fails for the order (which turned out to be
> too easy to trigger) or have the fixed amount of retries regardless the
> watermark check result. We cannot relax both unless we have other
> measures in place.

As mentioned before, OOM kill also doesn't guarantee to make high order page.
Reclaim more memory as much as possible makes more sense to me.
Timing of OOM kill for order-0 is reasonable because there is not enough
freeable page. But, it's not reasonable to kill something when we have
much reclaimable memory like as your current implementation.

Thanks.

2016-03-03 15:25:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> 2016-03-03 18:26 GMT+09:00 Michal Hocko <[email protected]>:
> > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <[email protected]>:
> >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
> >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> >> >> > [...]
> >> >> >> > > + /*
> >> >> >> > > + * OK, so the watermak check has failed. Make sure we do all the
> >> >> >> > > + * retries for !costly high order requests and hope that multiple
> >> >> >> > > + * runs of compaction will generate some high order ones for us.
> >> >> >> > > + *
> >> >> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
> >> >> >> > > + * if we are in the retry path - something like priority 0 for the
> >> >> >> > > + * reclaim
> >> >> >> > > + */
> >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> >> >> >> > > + return true;
> >> >> >> > > +
> >> >> >> > > return false;
> >> >> >>
> >> >> >> This seems not a proper fix. Checking watermark with high order has
> >> >> >> another meaning that there is high order page or not. This isn't
> >> >> >> what we want here.
> >> >> >
> >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> >> >> > form the requested order. The ordering on the LRU lists is pretty much
> >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> >> >> > and free even order-0 pages.
> >> >>
> >> >> If we have >= order page available, we would not reach here. We would
> >> >> just allocate it.
> >> >
> >> > not really, we can still be under the low watermark. Note that the
> >>
> >> you mean min watermark?
> >
> > ohh, right...
> >
> >> > target for the should_reclaim_retry watermark check includes also the
> >> > reclaimable memory.
> >>
> >> I guess that usual case for high order allocation failure has enough freepage.
> >
> > Not sure I understand you mean here but I wouldn't be surprised if high
> > order failed even with enough free pages. And that is exactly why I am
> > claiming that reclaiming more pages is no free ticket to high order
> > pages.
>
> I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> that we have. Why do you want to kill something?

Because all the attempts so far have failed and we should rather not
retry endlessly. With the band-aid we know we will retry
MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
resolve the situation along with the same amount of reclaim rounds to
help and get over watermarks.

> It also doesn't guarantee to make high order pages. It is just another
> way of reclaiming memory. What is the difference between plain reclaim
> and OOM kill? Why do we use OOM kill in this case?

What is our alternative other than keep looping endlessly?

> > [...]
> >> >> I just did quick review to your patches so maybe I am wrong.
> >> >> Am I missing something?
> >> >
> >> > The core idea behind should_reclaim_retry is to check whether the
> >> > reclaiming all the pages would help to get over the watermark and there
> >> > is at least one >= order page. Then it really makes sense to retry. As
> >>
> >> How you can judge that reclaiming all the pages would help to check
> >> there is at least one >= order page?
> >
> > Again, not sure I understand you here. __zone_watermark_ok checks both
> > wmark and an available page of the sufficient order. While increased
> > free_pages (which includes reclaimable pages as well) will tell us
> > whether we have a chance to get over the min wmark, the order check will
> > tell us we have something to allocate from after we reach the min wmark.
>
> Again, your assumption would be different with mine. My assumption is that
> high order allocation problem happens due to fragmentation rather than
> low free memory. In this case, there is no high order page. Even if you can
> reclaim 1TB and add this counter to freepage counter, high order page
> counter will not be changed and watermark check would fail. So, high order
> allocation will not go through retry logic. This is what you want?

I really want to base the decision on something measurable rather
than a good hope. This is what all the zone_reclaimable() is about. I
understand your concerns that compaction doesn't guarantee anything but
I am quite convinced that we really need an upper bound for retries
(unlike now when zone_reclaimable is basically unbounded assuming
order-0 reclaim makes some progress). What is the best bound is harder
to tell, of course.

[...]
> >> My arguing is for your band aid patch.
> >> My point is that why retry count for order-0 is reset if there is some progress,
> >> but, retry counter for order up to costly isn't reset even if there is
> >> some progress
> >
> > Because we know that order-0 requests have chance to proceed if we keep
> > reclaiming order-0 pages while this is not true for order > 0. If we did
> > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > then we would be back to the zone_reclaimable heuristic. Why? Because
> > order-0 reclaim progress will keep !costly in the reclaim loop while
> > compaction still might not make any progress. So we either have to fail
> > when __zone_watermark_ok fails for the order (which turned out to be
> > too easy to trigger) or have the fixed amount of retries regardless the
> > watermark check result. We cannot relax both unless we have other
> > measures in place.
>
> As mentioned before, OOM kill also doesn't guarantee to make high order page.

Yes, of course, apart from the kernel stack which is high order there is
no guarantee.

> Reclaim more memory as much as possible makes more sense to me.

But then we are back to square one. How much and how to decide when it
makes sense to give up. Do you have any suggestions on what should be
the criteria? Is there any feedback mechanism from the compaction which
would tell us to keep retrying? Something like did_some_progress from
the order-0 reclaim? Is any of deferred_compaction resp.
contended_compaction usable? Or is there any per-zone flag we can check
and prefer over wmark order check?

Thanks
--
Michal Hocko
SUSE Labs

2016-03-03 15:50:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
>
>> [...]
>>>>> At least, reset no_progress_loops when did_some_progress. High
>>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
>>>>> as order 0. And, reclaim something would increase probability of
>>>>> compaction success.
>>>>
>>>> This is something I still do not understand. Why would reclaiming
>>>> random order-0 pages help compaction? Could you clarify this please?
>>>
>>> I just can tell simple version. Please check the link from me on another reply.
>>> Compaction could scan more range of memory if we have more freepage.
>>> This is due to algorithm limitation. Anyway, so, reclaiming random
>>> order-0 pages helps compaction.
>>
>> I will have a look at that code but this just doesn't make any sense.
>> The compaction should be reshuffling pages, this shouldn't be a function
>> of free memory.
>
> Please refer the link I mentioned before. There is a reason why more free
> memory would help compaction success. Compaction doesn't work
> like as random reshuffling. It has an algorithm to reduce system overall
> fragmentation so there is limitation.

I proposed another way to get better results from direct compaction -
don't scan for free pages but get them directly from freelists:

https://lkml.org/lkml/2015/12/3/60

But your redesign would be useful too for kcompactd/khugepaged keeping
overall fragmentation low.

2016-03-03 16:26:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu 03-03-16 16:50:16, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> >
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> >
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
>
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
>
> https://lkml.org/lkml/2015/12/3/60

Yes this makes perfect sense to me (with my limited experience in
this area so I might be missing some obvious problems this would
introduce). The direct compaction for !costly orders is something
we should better satisfy immediately. I would just object that this
shouldn't be reduced to ASYNC compaction requests only. SYNC* modes are
even a more desperate call (at least that is my understanding) for the
page and we should treat them the appropriately.

> But your redesign would be useful too for kcompactd/khugepaged keeping
> overall fragmentation low.

kcompactd can handle and should focus on the long term goals.

--
Michal Hocko
SUSE Labs

2016-03-03 20:57:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu, 3 Mar 2016, Michal Hocko wrote:
> On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> > On Tue, 1 Mar 2016, Michal Hocko wrote:
> [...]
> > > So I have tried the following:
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 4d99e1f5055c..7364e48cf69a 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > > alloc_flags))
> > > return COMPACT_PARTIAL;
> > >
> > > + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > > + return COMPACT_CONTINUE;
> > > +
> >
> > I gave that a try just now, but it didn't help me: OOMed much sooner,
> > after doing half as much work.

I think I exaggerated: sooner, but not _much_ sooner; and I cannot
see now what I based that estimate of "half as much work" on.

>
> I do not have an explanation why it would cause oom sooner but this
> turned out to be incomplete. There is another wmaark check deeper in the
> compaction path. Could you try the one from
> http://lkml.kernel.org/r/[email protected]

I've now added that in: it corrects the "sooner", but does not make
any difference to the fact of OOMing for me.

Hugh

>
> I will try to find a machine with more CPUs and try to reproduce this in
> the mean time.
>
> I will also have a look at the data you have collected.
> --
> Michal Hocko
> SUSE Labs

2016-03-04 05:23:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > 2016-03-03 18:26 GMT+09:00 Michal Hocko <[email protected]>:
> > > On Wed 02-03-16 23:34:21, Joonsoo Kim wrote:
> > >> 2016-03-02 23:06 GMT+09:00 Michal Hocko <[email protected]>:
> > >> > On Wed 02-03-16 22:32:09, Joonsoo Kim wrote:
> > >> >> 2016-03-02 18:50 GMT+09:00 Michal Hocko <[email protected]>:
> > >> >> > On Wed 02-03-16 11:19:54, Joonsoo Kim wrote:
> > >> >> >> On Mon, Feb 29, 2016 at 10:02:13PM +0100, Michal Hocko wrote:
> > >> >> > [...]
> > >> >> >> > > + /*
> > >> >> >> > > + * OK, so the watermak check has failed. Make sure we do all the
> > >> >> >> > > + * retries for !costly high order requests and hope that multiple
> > >> >> >> > > + * runs of compaction will generate some high order ones for us.
> > >> >> >> > > + *
> > >> >> >> > > + * XXX: ideally we should teach the compaction to try _really_ hard
> > >> >> >> > > + * if we are in the retry path - something like priority 0 for the
> > >> >> >> > > + * reclaim
> > >> >> >> > > + */
> > >> >> >> > > + if (order && order <= PAGE_ALLOC_COSTLY_ORDER)
> > >> >> >> > > + return true;
> > >> >> >> > > +
> > >> >> >> > > return false;
> > >> >> >>
> > >> >> >> This seems not a proper fix. Checking watermark with high order has
> > >> >> >> another meaning that there is high order page or not. This isn't
> > >> >> >> what we want here.
> > >> >> >
> > >> >> > Why not? Why should we retry the reclaim if we do not have >=order page
> > >> >> > available? Reclaim itself doesn't guarantee any of the freed pages will
> > >> >> > form the requested order. The ordering on the LRU lists is pretty much
> > >> >> > random wrt. pfn ordering. On the other hand if we have a page available
> > >> >> > which is just hidden by watermarks then it makes perfect sense to retry
> > >> >> > and free even order-0 pages.
> > >> >>
> > >> >> If we have >= order page available, we would not reach here. We would
> > >> >> just allocate it.
> > >> >
> > >> > not really, we can still be under the low watermark. Note that the
> > >>
> > >> you mean min watermark?
> > >
> > > ohh, right...
> > >
> > >> > target for the should_reclaim_retry watermark check includes also the
> > >> > reclaimable memory.
> > >>
> > >> I guess that usual case for high order allocation failure has enough freepage.
> > >
> > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > order failed even with enough free pages. And that is exactly why I am
> > > claiming that reclaiming more pages is no free ticket to high order
> > > pages.
> >
> > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > that we have. Why do you want to kill something?
>
> Because all the attempts so far have failed and we should rather not
> retry endlessly. With the band-aid we know we will retry
> MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> resolve the situation along with the same amount of reclaim rounds to
> help and get over watermarks.
>
> > It also doesn't guarantee to make high order pages. It is just another
> > way of reclaiming memory. What is the difference between plain reclaim
> > and OOM kill? Why do we use OOM kill in this case?
>
> What is our alternative other than keep looping endlessly?

Loop as long as free memory or estimated available memory (free +
reclaimable) increases. This means that we did some progress. And,
they will not grow forever because we have just limited reclaimable
memory and limited memory. You can reset no_progress_loops = 0 when
those metric increases than before.

With this bound, we can do our best to try to solve this unpleasant
situation before OOM.

Unconditional 16 looping and then OOM kill really doesn't make any
sense, because it doesn't mean that we already do our best. OOM
should not be called prematurely and AFAIK it is one of goals
on your patches.

If above suggestion doesn't make sense to you, please try to find
another way rather than suggesting work-around that could cause
OOM prematurely in high order allocation case.

Thanks.

>
> > > [...]
> > >> >> I just did quick review to your patches so maybe I am wrong.
> > >> >> Am I missing something?
> > >> >
> > >> > The core idea behind should_reclaim_retry is to check whether the
> > >> > reclaiming all the pages would help to get over the watermark and there
> > >> > is at least one >= order page. Then it really makes sense to retry. As
> > >>
> > >> How you can judge that reclaiming all the pages would help to check
> > >> there is at least one >= order page?
> > >
> > > Again, not sure I understand you here. __zone_watermark_ok checks both
> > > wmark and an available page of the sufficient order. While increased
> > > free_pages (which includes reclaimable pages as well) will tell us
> > > whether we have a chance to get over the min wmark, the order check will
> > > tell us we have something to allocate from after we reach the min wmark.
> >
> > Again, your assumption would be different with mine. My assumption is that
> > high order allocation problem happens due to fragmentation rather than
> > low free memory. In this case, there is no high order page. Even if you can
> > reclaim 1TB and add this counter to freepage counter, high order page
> > counter will not be changed and watermark check would fail. So, high order
> > allocation will not go through retry logic. This is what you want?
>
> I really want to base the decision on something measurable rather
> than a good hope. This is what all the zone_reclaimable() is about. I
> understand your concerns that compaction doesn't guarantee anything but
> I am quite convinced that we really need an upper bound for retries
> (unlike now when zone_reclaimable is basically unbounded assuming
> order-0 reclaim makes some progress). What is the best bound is harder
> to tell, of course.
>
> [...]
> > >> My arguing is for your band aid patch.
> > >> My point is that why retry count for order-0 is reset if there is some progress,
> > >> but, retry counter for order up to costly isn't reset even if there is
> > >> some progress
> > >
> > > Because we know that order-0 requests have chance to proceed if we keep
> > > reclaiming order-0 pages while this is not true for order > 0. If we did
> > > reset the no_progress_loops for order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER
> > > then we would be back to the zone_reclaimable heuristic. Why? Because
> > > order-0 reclaim progress will keep !costly in the reclaim loop while
> > > compaction still might not make any progress. So we either have to fail
> > > when __zone_watermark_ok fails for the order (which turned out to be
> > > too easy to trigger) or have the fixed amount of retries regardless the
> > > watermark check result. We cannot relax both unless we have other
> > > measures in place.
> >
> > As mentioned before, OOM kill also doesn't guarantee to make high order page.
>
> Yes, of course, apart from the kernel stack which is high order there is
> no guarantee.
>
> > Reclaim more memory as much as possible makes more sense to me.
>
> But then we are back to square one. How much and how to decide when it
> makes sense to give up. Do you have any suggestions on what should be
> the criteria? Is there any feedback mechanism from the compaction which
> would tell us to keep retrying? Something like did_some_progress from
> the order-0 reclaim? Is any of deferred_compaction resp.
> contended_compaction usable? Or is there any per-zone flag we can check
> and prefer over wmark order check?
>
> Thanks
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-03-04 07:10:34

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu, Mar 03, 2016 at 04:50:16PM +0100, Vlastimil Babka wrote:
> On 03/03/2016 03:10 PM, Joonsoo Kim wrote:
> >
> >> [...]
> >>>>> At least, reset no_progress_loops when did_some_progress. High
> >>>>> order allocation up to PAGE_ALLOC_COSTLY_ORDER is as important
> >>>>> as order 0. And, reclaim something would increase probability of
> >>>>> compaction success.
> >>>>
> >>>> This is something I still do not understand. Why would reclaiming
> >>>> random order-0 pages help compaction? Could you clarify this please?
> >>>
> >>> I just can tell simple version. Please check the link from me on another reply.
> >>> Compaction could scan more range of memory if we have more freepage.
> >>> This is due to algorithm limitation. Anyway, so, reclaiming random
> >>> order-0 pages helps compaction.
> >>
> >> I will have a look at that code but this just doesn't make any sense.
> >> The compaction should be reshuffling pages, this shouldn't be a function
> >> of free memory.
> >
> > Please refer the link I mentioned before. There is a reason why more free
> > memory would help compaction success. Compaction doesn't work
> > like as random reshuffling. It has an algorithm to reduce system overall
> > fragmentation so there is limitation.
>
> I proposed another way to get better results from direct compaction -
> don't scan for free pages but get them directly from freelists:
>
> https://lkml.org/lkml/2015/12/3/60
>

I think that major problem of this approach is that there is no way
to prevent other parallel compacting thread from taking freepage on
targetted aligned block. So, if there are parallel compaction requestors,
they would disturb each others. However, it would not be a problem for order
up to PAGE_ALLOC_COSTLY_ORDER which would be finished so soon.

In fact, for quick allocation, migration scanner is also unnecessary.
There would be a lot of pageblock we cannot do migration. Scanning
all of them in this situation is unnecessary and costly. Moreover, scanning
only half of zone due to limitation of compaction algorithm also looks
not good. Instead, we can get base page on lru list and migrate
neighborhood pages. I named this idea as "lumpy compaction" but didn't
try it. If we only focus on quick allocation, this would be a better way.
Any thought?

Thanks.

2016-03-04 07:41:28

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On 03/03/2016 09:57 PM, Hugh Dickins wrote:
>
>>
>> I do not have an explanation why it would cause oom sooner but this
>> turned out to be incomplete. There is another wmaark check deeper in the
>> compaction path. Could you try the one from
>> http://lkml.kernel.org/r/[email protected]
>
> I've now added that in: it corrects the "sooner", but does not make
> any difference to the fact of OOMing for me.

Could you try producing a trace with
echo 1 > /debug/tracing/events/compaction/enable
echo 1 > /debug/tracing/events/migrate/mm_migrate_pages/enable

Hopefully it will hint at what's wrong with:
compact_migrate_scanned 424920
compact_free_scanned 9278408
compact_isolated 469472
compact_stall 377
compact_fail 297
compact_success 80
compact_kcompatd_wake 0

2016-03-04 07:53:28

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu, Mar 03, 2016 at 01:54:43AM -0800, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/[email protected]]
> >
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > >
> > > > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > >
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > >
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation. j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > >
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
>
> I've found that the number of CPUs makes quite a difference - I have 4.
>
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).
>
> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
>
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
>
> (All that is on the laptop. The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it. My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)
>
> > > > and reserve 800M for hugetlb pages (I got 445 of them). Then I extract
> > > > the kernel source to tmpfs (-o size=2G), make defconfig and make -j20
> > > > (16, 10 no difference really). I was also collecting vmstat in the
> > > > background. The compilation takes ages but the behavior seems consistent
> > > > and stable.
> > >
> > > Thanks a lot for giving it a go.
> > >
> > > I'm puzzled. 445 hugetlb pages in 800M surprises me: some of them
> > > are less than 2M big?? But probably that's just a misunderstanding
> > > or typo somewhere.
> >
> > A typo. 445 was from 900M test which I was doing while writing the
> > email. Sorry about the confusion.
>
> That makes more sense! Though I'm still amazed that you got anywhere,
> taking so much of the usable memory out.
>
> >
> > > Ignoring that, you're successfully doing a make -20 defconfig build
> > > in tmpfs, with only 224M of RAM available, plus 2G of swap? I'm not
> > > at all surprised that it takes ages, but I am very surprised that it
> > > does not OOM. I suppose by rights it ought not to OOM, the built
> > > tree occupies only a little more than 1G, so you do have enough swap;
> > > but I wouldn't get anywhere near that myself without OOMing - I give
> > > myself 1G of RAM (well, minus whatever the booted system takes up)
> > > to do that build in, four times your RAM, yet in my case it OOMs.
> > >
> > > That source tree alone occupies more than 700M, so just copying it
> > > into your tmpfs would take a long time.
> >
> > OK, I just found out that I was cheating a bit. I was building
> > linux-3.7-rc5.tar.bz2 which is smaller:
> > $ du -sh /mnt/tmpfs/linux-3.7-rc5/
> > 537M /mnt/tmpfs/linux-3.7-rc5/
>
> Right, I have a habit like that too; but my habitual testing still
> uses the 2.6.24 source tree, which is rather too old to ask others
> to reproduce with - but we both find that the kernel source tree
> keeps growing, and prefer to stick with something of a fixed size.
>
> >
> > and after the defconfig build:
> > $ free
> > total used free shared buffers cached
> > Mem: 1008460 941904 66556 0 5092 806760
> > -/+ buffers/cache: 130052 878408
> > Swap: 2097148 42648 2054500
> > $ du -sh linux-3.7-rc5/
> > 799M linux-3.7-rc5/
> >
> > Sorry about that but this is what my other tests were using and I forgot
> > to check. Now let's try the same with the current linus tree:
> > host $ git archive v4.5-rc6 --prefix=linux-4.5-rc6/ | bzip2 > linux-4.5-rc6.tar.bz2
> > $ du -sh /mnt/tmpfs/linux-4.5-rc6/
> > 707M /mnt/tmpfs/linux-4.5-rc6/
> > $ free
> > total used free shared buffers cached
> > Mem: 1008460 962976 45484 0 7236 820064
>
> I guess we have different versions of "free": mine shows Shmem as shared,
> but yours appears to be an older version, just showing 0.
>
> > -/+ buffers/cache: 135676 872784
> > Swap: 2097148 16 2097132
> > $ time make -j20 > /dev/null
> > drivers/acpi/property.c: In function ‘acpi_data_prop_read’:
> > drivers/acpi/property.c:745:8: warning: ‘obj’ may be used uninitialized in this function [-Wmaybe-uninitialized]
> >
> > real 8m36.621s
> > user 14m1.642s
> > sys 2m45.238s
> >
> > so I wasn't cheating all that much...
> >
> > > I'd expect a build in 224M
> > > RAM plus 2G of swap to take so long, that I'd be very grateful to be
> > > OOM killed, even if there is technically enough space. Unless
> > > perhaps it's some superfast swap that you have?
> >
> > the swap partition is a standard qcow image stored on my SSD disk. So
> > I guess the IO should be quite fast. This smells like a potential
> > contributor because my reclaim seems to be much faster and that should
> > lead to a more efficient reclaim (in the scanned/reclaimed sense).
> > I realize I might be boring already when blaming compaction but let me
> > try again ;)
> > $ grep compact /proc/vmstat
> > compact_migrate_scanned 113983
> > compact_free_scanned 1433503
> > compact_isolated 134307
> > compact_stall 128
> > compact_fail 26
> > compact_success 102
> > compact_kcompatd_wake 0
> >
> > So the whole load has done the direct compaction only 128 times during
> > that test. This doesn't sound much to me
> > $ grep allocstall /proc/vmstat
> > allocstall 1061
> >
> > we entered the direct reclaim much more but most of the load will be
> > order-0 so this might be still ok. So I've tried the following:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 1993894b4219..107d444afdb1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2910,6 +2910,9 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> > mode, contended_compaction);
> > current->flags &= ~PF_MEMALLOC;
> >
> > + if (order > 0 && order <= PAGE_ALLOC_COSTLY_ORDER)
> > + trace_printk("order:%d gfp_mask:%pGg compact_result:%lu\n", order, &gfp_mask, compact_result);
> > +
> > switch (compact_result) {
> > case COMPACT_DEFERRED:
> > *deferred_compaction = true;
> >
> > And the result was:
> > $ cat /debug/tracing/trace_pipe | tee ~/trace.log
> > gcc-8707 [001] .... 137.946370: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> > gcc-8726 [000] .... 138.528571: __alloc_pages_direct_compact: order:2 gfp_mask:GFP_KERNEL_ACCOUNT|__GFP_NOTRACK compact_result:1
> >
> > this shows that order-2 memory pressure is not overly high in my
> > setup. Both attempts ended up COMPACT_SKIPPED which is interesting.
> >
> > So I went back to 800M of hugetlb pages and tried again. It took ages
> > so I have interrupted that after one hour (there was still no OOM). The
> > trace log is quite interesting regardless:
> > $ wc -l ~/trace.log
> > 371 /root/trace.log
> >
> > $ grep compact_stall /proc/vmstat
> > compact_stall 190
> >
> > so the compaction was still ignored more than actually invoked for
> > !costly allocations:
> > sed 's@.*order:\([[:digit:]]\).* compact_result:\([[:digit:]]\)@\1 \2@' ~/trace.log | sort | uniq -c
> > 190 2 1
> > 122 2 3
> > 59 2 4
> >
> > #define COMPACT_SKIPPED 1
> > #define COMPACT_PARTIAL 3
> > #define COMPACT_COMPLETE 4
> >
> > that means that compaction is even not tried in half cases! This
> > doesn't sounds right to me, especially when we are talking about
> > <= PAGE_ALLOC_COSTLY_ORDER requests which are implicitly nofail, because
> > then we simply rely on the order-0 reclaim to automagically form higher
> > blocks. This might indeed work when we retry many times but I guess this
> > is not a good approach. It leads to a excessive reclaim and the stall
> > for allocation can be really large.
> >
> > One of the suspicious places is __compaction_suitable which does order-0
> > watermark check (increased by 2<<order). I have put another trace_printk
> > there and it clearly pointed out this was the case.
> >
> > So I have tried the following:
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 4d99e1f5055c..7364e48cf69a 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1276,6 +1276,9 @@ static unsigned long __compaction_suitable(struct zone *zone, int order,
> > alloc_flags))
> > return COMPACT_PARTIAL;
> >
> > + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > + return COMPACT_CONTINUE;
> > +
>
> I gave that a try just now, but it didn't help me: OOMed much sooner,
> after doing half as much work. (FWIW, I have been including your other
> patch, the "Andrew, could you queue this one as well, please" patch.)
>
> I do agree that compaction appears to have closed down when we OOM:
> taking that along with my nr_cpus remark (and the make -jNumber),
> are parallel compactions interfering with each other destructively,
> in a way that they did not before the rework?
>
> > /*
> > * Watermarks for order-0 must be met for compaction. Note the 2UL.
> > * This is because during migration, copies of pages need to be
> >
> > and retried the same test (without huge pages):
> > $ time make -j20 > /dev/null
> >
> > real 8m46.626s
> > user 14m15.823s
> > sys 2m45.471s
> >
> > the time increased but I haven't checked how stable the result is.
>
> But I didn't investigate its stability either, may have judged against
> it too soon.
>
> >
> > $ grep compact /proc/vmstat
> > compact_migrate_scanned 139822
> > compact_free_scanned 1661642
> > compact_isolated 139407
> > compact_stall 129
> > compact_fail 58
> > compact_success 71
> > compact_kcompatd_wake 1
>
> I have not seen any compact_kcompatd_wakes at all:
> perhaps we're too busy compacting directly.
>
> (Vlastimil, there's a "c" missing from that name, it should be
> "compact_kcompactd_wake" - though "compact_daemon_wake" might be nicer.)
>
> >
> > $ grep allocstall /proc/vmstat
> > allocstall 1665
> >
> > this is worse because we have scanned more pages for migration but the
> > overall success rate was much smaller and the direct reclaim was invoked
> > more. I do not have a good theory for that and will play with this some
> > more. Maybe other changes are needed deeper in the compaction code.
> >
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
>
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
>
> >
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM. Not an issue if you're
> > > booting VMs.
> >
> > Ohh, I see.
>
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

Hello, Hugh.

I guess following things from your vmstat.
it could be wrong so please be careful. :)

Before OOM happens,

pgmigrate_success 230007
pgmigrate_fail 94
compact_migrate_scanned 422734
compact_free_scanned 9277915
compact_isolated 469308
compact_stall 370
compact_fail 291
compact_success 79
...
balloon_deflate 0

After OOM happens,

pgmigrate_success 230007
pgmigrate_fail 94
compact_migrate_scanned 424920
compact_free_scanned 9278408
compact_isolated 469472
compact_stall 377
compact_fail 297
compact_success 80
...
balloon_deflate 1

This shows that we tried compaction (compaction stall increases).
Increased compact_isolated tell us that we isolated something for
migration. But, pgmigrate_xxx isn't changed and it means that we
didn't do any actual migration. It could happen when we can't find
freepage. compact_free_scanned changed a little so it seems that
there are many pageblocks with skipbit set and compaction would skip
almost range in this case. This skipbit could be reset when we try more
and reach the reset threshold. How about do test
with MAX_RECLAIM_RETRIES 128 or something larger to see that makes
some difference?

Thanks.

>
> I hacked out_of_memory() to count_vm_event(BALLOON_DEFLATE),
> that being a count that's always 0 for me: so when you see
> "balloon_deflate 1" towards the end, that's where the OOM
> kill came in, and shortly after I Ctrl-C'ed.
>
> I hope you can get more out of it than I have - thanks!
>
> Hugh



2016-03-04 12:28:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Thu 03-03-16 01:54:43, Hugh Dickins wrote:
> On Tue, 1 Mar 2016, Michal Hocko wrote:
> > [Adding Vlastimil and Joonsoo for compaction related things - this was a
> > large thread but the more interesting part starts with
> > http://lkml.kernel.org/r/[email protected]]
> >
> > On Mon 29-02-16 23:29:06, Hugh Dickins wrote:
> > > On Mon, 29 Feb 2016, Michal Hocko wrote:
> > > > On Wed 24-02-16 19:47:06, Hugh Dickins wrote:
> > > > [...]
> > > > > Boot with mem=1G (or boot your usual way, and do something to occupy
> > > > > most of the memory: I think /proc/sys/vm/nr_hugepages provides a great
> > > > > way to gobble up most of the memory, though it's not how I've done it).
> > > > >
> > > > > Make sure you have swap: 2G is more than enough. Copy the v4.5-rc5
> > > > > kernel source tree into a tmpfs: size=2G is more than enough.
> > > > > make defconfig there, then make -j20.
> > > > >
> > > > > On a v4.5-rc5 kernel that builds fine, on mmotm it is soon OOM-killed.
> > > > >
> > > > > Except that you'll probably need to fiddle around with that j20,
> > > > > it's true for my laptop but not for my workstation. j20 just happens
> > > > > to be what I've had there for years, that I now see breaking down
> > > > > (I can lower to j6 to proceed, perhaps could go a bit higher,
> > > > > but it still doesn't exercise swap very much).
> > > >
> > > > I have tried to reproduce and failed in a virtual on my laptop. I
> > > > will try with another host with more CPUs (because my laptop has only
> > > > two). Just for the record I did: boot 1G machine in kvm, I have 2G swap
>
> I've found that the number of CPUs makes quite a difference - I have 4.
>
> And another difference between us may be in our configs: on this laptop
> I had lots of debug options on (including DEBUG_VM, DEBUG_SPINLOCK and
> PROVE_LOCKING, though not DEBUG_PAGEALLOC), which approximately doubles
> the size of each shmem_inode (and those of course are not swappable).

I had everything but PROVE_LOCKING. Enabling this option doesn't change
anything (except for the overal runtime which is longer of course) in my
2 cpus setup, though.

All the following is with the clean mmotm (mmotm-2016-02-24-16-18)
without any additional change. I have moved my kvm setup to a larger
machine. The storage is a standard spinning rust and I've made sure that
the swap is not cached on the host and the swap IO is done directly by
doing
-drive file=swap-2G.qcow,if=ide,index=2,cache=none

retested with 4CPUs and make -j20
real 8m42.263s
user 20m52.838s
sys 8m8.805s

with 16CPU and make -j20
real 3m34.806s
user 20m25.245s
sys 8m39.366s

and the same with -j60 which actually triggered the OOM
$ grep "invoked oom-killer:" oomrework.qcow_serial.log
[10064.286799] cc1 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[...]
[10064.394172] DMA32 free:3764kB min:3796kB low:4776kB high:5756kB active_anon:394184kB inactive_anon:394168kB active_file:1836kB inactive_file:2156kB unevictable:0kB isolated(anon):148kB isolated(file):0kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:96kB mapped:1308kB shmem:6704kB slab_reclaimable:51356kB slab_unreclaimable:100532kB kernel_stack:7328kB pagetables:15944kB unstable:0kB bounce:0kB free_pcp:1796kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:63244 all_unreclaimable? yes
[...]
[10560.926971] cc1 invoked oom-killer: gfp_mask=0x24200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[...]
[10561.007362] DMA32 free:4800kB min:3796kB low:4776kB high:5756kB active_anon:393112kB inactive_anon:393508kB active_file:1560kB inactive_file:1428kB unevictable:0kB isolated(anon):2452kB isolated(file):212kB present:1032060kB managed:987556kB mlocked:0kB dirty:0kB writeback:564kB mapped:2552kB shmem:7664kB slab_reclaimable:51352kB slab_unreclaimable:100396kB kernel_stack:7392kB pagetables:16196kB unstable:0kB bounce:0kB free_pcp:812kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1172 all_unreclaimable? no

but those are simple order-0 OOMs. So this cannot be a compaction
related. The second oom is probably racing with the exiting task
because we are over the low wmark. This would suggest we have exhausted
all the attempts with no progress.

This was all after fresh boot so then I stayed with 16CPUs and did
make -j20 > /dev/null
make clean

in the loop and left it run overnight. This should randomize the swap
IO and also should have a better chance of longterm fragmentation.
It survived 300 iterations.

I really have no idea what might be the difference with your setup. So
I've tried to test linux-next (next-20160226) just to make sure that
this is not something mmotm git tree (which I maintain) specific.

> I found that I could avoid the OOM if I ran the "make -j20" on a
> kernel without all those debug options, and booted with nr_cpus=2.
> And currently I'm booting the kernel with the debug options in,
> but with nr_cpus=2, which does still OOM (whereas not if nr_cpus=1).
>
> Maybe in the OOM rework, threads are cancelling each other's progress
> more destructively, where before they co-operated to some extent?
>
> (All that is on the laptop. The G5 is still busy full-time bisecting
> a powerpc issue: I know it was OOMing with the rework, but I have not
> verified the effect of nr_cpus on it. My x86 workstation has not been
> OOMing with the rework - I think that means that I've not been exerting
> as much memory pressure on it as I'd thought, that it copes with the load
> better, and would only show the difference if I loaded it more heavily.)

I am currently testing with the swap backed on sshfs (with -o direct_io)
which should emulate a really slow storage. But still not OOM, I only
managed to hit:
INFO: task khugepaged:246 blocked for more than 120 seconds.
int the IO path
[ 480.422500] [<ffffffff812b0c9b>] get_request+0x440/0x55e
[ 480.423444] [<ffffffff81081148>] ? wait_woken+0x72/0x72
[ 480.424447] [<ffffffff812b3071>] blk_queue_bio+0x16d/0x302
[ 480.425566] [<ffffffff812b1607>] generic_make_request+0xc0/0x15e
[ 480.426642] [<ffffffff812b17ae>] submit_bio+0x109/0x114
[ 480.427704] [<ffffffff81147101>] __swap_writepage+0x1ea/0x1f9
[ 480.430364] [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[ 480.432718] [<ffffffff815a8aed>] ? _raw_spin_unlock+0x31/0x44
[ 480.433722] [<ffffffff81149346>] ? page_swapcount+0x45/0x4c
[ 480.434697] [<ffffffff8114714a>] swap_writepage+0x3a/0x3e
[ 480.435718] [<ffffffff81122bbe>] shmem_writepage+0x37b/0x3d1
[ 480.436757] [<ffffffff8111dbe8>] shrink_page_list+0x49c/0xd88

[...]
> > I will play with this some more but I would be really interested to hear
> > whether this helped Hugh with his setup. Vlastimi, Joonsoo does this
> > even make sense to you?
>
> It didn't help me; but I do suspect you're right to be worrying about
> the treatment of compaction of 0 < order <= PAGE_ALLOC_COSTLY_ORDER.
>
> >
> > > I was only suggesting to allocate hugetlb pages, if you preferred
> > > not to reboot with artificially reduced RAM. Not an issue if you're
> > > booting VMs.
> >
> > Ohh, I see.
>
> I've attached vmstats.xz, output from your read_vmstat proggy;
> together with oom.xz, the dmesg for the OOM in question.

[ 796.225322] sh invoked oom-killer: gfp_mask=0x27000c0(GFP_KERNEL_ACCOUNT|__GFP_NOTRACK), order=2, oom_score_adj=0
[...]
[ 796.630465] Node 0 DMA32 free:13904kB min:3940kB low:4944kB high:5948kB active_anon:588776kB inactive_anon:188816kB active_file:20432kB inactive_file:6928kB unevictable:12268kB isolated(anon):128kB isolated(file):8kB present:1046128kB managed:1004892kB mlocked:12268kB dirty:16kB writeback:1400kB mapped:35556kB shmem:12684kB slab_reclaimable:55628kB slab_unreclaimable:92944kB kernel_stack:4448kB pagetables:8604kB unstable:0kB bounce:0kB free_pcp:296kB local_pcp:164kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 796.687390] Node 0 DMA32: 969*4kB (UE) 184*8kB (UME) 167*16kB (UM) 19*32kB (UM) 3*64kB (UM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 8820kB
[...]

This is really interesting because there are some order-2+ pages
available. Even more striking is that free is way above high watermark.
This would suggest that declaring OOM must have raced with an exiting
task. This is not that unexpected because gcc are quite shortlived
and `make' spawns new as soon the last one terminated. This race is not
new and we cannot do much better without a moving the wmark check closer
to the actual do_send_sig_info. This is not the main problem though. The
thing that you are able to trigger this consistently is what bothers me.
--
Michal Hocko
SUSE Labs

2016-03-04 15:16:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <[email protected]>:
[...]
> > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > >
> > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > order failed even with enough free pages. And that is exactly why I am
> > > > claiming that reclaiming more pages is no free ticket to high order
> > > > pages.
> > >
> > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > that we have. Why do you want to kill something?
> >
> > Because all the attempts so far have failed and we should rather not
> > retry endlessly. With the band-aid we know we will retry
> > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > resolve the situation along with the same amount of reclaim rounds to
> > help and get over watermarks.
> >
> > > It also doesn't guarantee to make high order pages. It is just another
> > > way of reclaiming memory. What is the difference between plain reclaim
> > > and OOM kill? Why do we use OOM kill in this case?
> >
> > What is our alternative other than keep looping endlessly?
>
> Loop as long as free memory or estimated available memory (free +
> reclaimable) increases. This means that we did some progress. And,
> they will not grow forever because we have just limited reclaimable
> memory and limited memory. You can reset no_progress_loops = 0 when
> those metric increases than before.

Hmm, why is this any better than taking the feedback from the reclaim
(did_some_progress)?

> With this bound, we can do our best to try to solve this unpleasant
> situation before OOM.
>
> Unconditional 16 looping and then OOM kill really doesn't make any
> sense, because it doesn't mean that we already do our best.

16 is not really that important. We can change that if that doesn't
sounds sufficient. But please note that each reclaim round means
that we have scanned all eligible LRUs to find and reclaim something
and asked direct compaction to prepare a high order page.
This sounds like "do our best" to me.

Now it seems that we need more changes at least in the compaction area
because the code doesn't seem to fit the nature of !costly allocation
requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
high order pages, I would much rather see some feedback mechanism which
would measurable and evaluated in some way but is this really necessary
for the initial version?
--
Michal Hocko
SUSE Labs

2016-03-04 17:39:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri 04-03-16 16:15:58, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
[...]
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
>
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something

this should read "scanned potentially all eligible LRUs..."

> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.


--
Michal Hocko
SUSE Labs

2016-03-07 05:22:37

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri, Mar 04, 2016 at 04:15:58PM +0100, Michal Hocko wrote:
> On Fri 04-03-16 14:23:27, Joonsoo Kim wrote:
> > On Thu, Mar 03, 2016 at 04:25:15PM +0100, Michal Hocko wrote:
> > > On Thu 03-03-16 23:10:09, Joonsoo Kim wrote:
> > > > 2016-03-03 18:26 GMT+09:00 Michal Hocko <[email protected]>:
> [...]
> > > > >> I guess that usual case for high order allocation failure has enough freepage.
> > > > >
> > > > > Not sure I understand you mean here but I wouldn't be surprised if high
> > > > > order failed even with enough free pages. And that is exactly why I am
> > > > > claiming that reclaiming more pages is no free ticket to high order
> > > > > pages.
> > > >
> > > > I didn't say that it's free ticket. OOM kill would be the most expensive ticket
> > > > that we have. Why do you want to kill something?
> > >
> > > Because all the attempts so far have failed and we should rather not
> > > retry endlessly. With the band-aid we know we will retry
> > > MAX_RECLAIM_RETRIES at most. So compaction had that many attempts to
> > > resolve the situation along with the same amount of reclaim rounds to
> > > help and get over watermarks.
> > >
> > > > It also doesn't guarantee to make high order pages. It is just another
> > > > way of reclaiming memory. What is the difference between plain reclaim
> > > > and OOM kill? Why do we use OOM kill in this case?
> > >
> > > What is our alternative other than keep looping endlessly?
> >
> > Loop as long as free memory or estimated available memory (free +
> > reclaimable) increases. This means that we did some progress. And,
> > they will not grow forever because we have just limited reclaimable
> > memory and limited memory. You can reset no_progress_loops = 0 when
> > those metric increases than before.
>
> Hmm, why is this any better than taking the feedback from the reclaim
> (did_some_progress)?

My suggestion could be only applied to high order case. In this case,
free page and reclaimable page is already sufficient and parallel
free page consumer would re-generate reclaimable page endlessly so
positive did_some_progress will be returned endlessy. We need to stop
retry at some point so we need some metric that ensures finite retry
in any case.

>
> > With this bound, we can do our best to try to solve this unpleasant
> > situation before OOM.
> >
> > Unconditional 16 looping and then OOM kill really doesn't make any
> > sense, because it doesn't mean that we already do our best.
>
> 16 is not really that important. We can change that if that doesn't
> sounds sufficient. But please note that each reclaim round means
> that we have scanned all eligible LRUs to find and reclaim something
> and asked direct compaction to prepare a high order page.
> This sounds like "do our best" to me.

AFAIK, each reclaim round doesn't reclaim all reclaimable page. It has
a limit to reclaim. It looks not our best to me and N retry only
multipies that limit N times. It also doesn't look like our best to
me and will lead to premature OOM kill.

> Now it seems that we need more changes at least in the compaction area
> because the code doesn't seem to fit the nature of !costly allocation
> requests. I am also not satisfied with the fixed MAX_RECLAIM_RETRIES for
> high order pages, I would much rather see some feedback mechanism which
> would measurable and evaluated in some way but is this really necessary
> for the initial version?

I don't know. My analysis is just based on my guess and background knowledge,
not practical usecase, so I'm not sure it is necessary for the initial
version or not. It's up to you.

Thanks.

2016-03-11 10:45:41

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

(Posting as a reply to this thread.)

I was trying to test side effect of "oom, oom_reaper: disable oom_reaper for
oom_kill_allocating_task" compared to "oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space" using a reproducer shown below.

---------- Reproducer start ----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/prctl.h>
#include <signal.h>

static char buffer[4096] = { };

static int file_io(void *unused)
{
const int fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
sleep(2);
while (write(fd, buffer, sizeof(buffer)) > 0);
close(fd);
return 0;
}

int main(int argc, char *argv[])
{
int i;
if (chdir("/tmp"))
return 1;
for (i = 0; i < 64; i++)
if (fork() == 0) {
static cpu_set_t set = { { 1 } };
const int fd = open("/proc/self/oom_score_adj", O_WRONLY);
write(fd, "1000", 4);
close(fd);
sched_setaffinity(0, sizeof(set), &set);
snprintf(buffer, sizeof(buffer), "file_io.%02u", i);
prctl(PR_SET_NAME, (unsigned long) buffer, 0, 0, 0);
for (i = 0; i < 16; i++)
clone(file_io, malloc(1024) + 1024, CLONE_VM, NULL);
while (1)
pause();
}
{ /* A dummy process for invoking the OOM killer. */
char *buf = NULL;
unsigned long i;
unsigned long size = 0;
prctl(PR_SET_NAME, (unsigned long) "memeater", 0, 0, 0);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
sleep(4);
for (i = 0; i < size; i += 4096)
buf[i] = '\0'; /* Will cause OOM due to overcommit */
}
kill(-1, SIGKILL);
return * (char *) NULL; /* Not reached. */
}
---------- Reproducer end ----------

The characteristic of this reproducer is that the OOM killer chooses the same mm
for multiple times due to clone(!CLONE_SIGHAND && CLONE_VM) and the OOM reaper
happily skips reaping that mm due to marking that mm_struct as MMF_OOM_KILLED or
marking only first victim's signal_struct as OOM_SCORE_ADJ_MIN, which means that
nobody can unlock TIF_MEMDIE when non-first victim cannot terminate.

But the problem I can hit trivially is that kswapd got stuck at unkillable lock
when all allocating tasks are waiting at congestion_wait(). This situation resembles
http://lkml.kernel.org/r/[email protected]
but not looping at too_many_isolated() in shrink_inactive_list().
I don't know what is happening.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160311.txt.xz .
---------- console log start ----------
[ 81.282661] memeater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[ 81.297589] memeater cpuset=/ mems_allowed=0
[ 81.303615] CPU: 2 PID: 1239 Comm: memeater Tainted: G W 4.5.0-rc7-next-20160310 #103
(...snipped...)
[ 81.456295] Out of memory: Kill process 1240 (file_io.00) score 999 or sacrifice child
[ 81.459768] Killed process 1240 (file_io.00) total-vm:4308kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 81.682547] ksmtuned invoked oom-killer: gfp_mask=0x24084c0(GFP_KERNEL|__GFP_REPEAT|__GFP_ZERO), order=0, oom_score_adj=0
[ 81.703992] ksmtuned cpuset=/ mems_allowed=0
[ 81.709402] CPU: 1 PID: 2330 Comm: ksmtuned Tainted: G W 4.5.0-rc7-next-20160310 #103
(...snipped...)
[ 81.928733] Out of memory: Kill process 1248 (file_io.00) score 1000 or sacrifice child
[ 81.932194] Killed process 1248 (file_io.00) total-vm:4308kB, anon-rss:104kB, file-rss:1044kB, shmem-rss:0kB
(...snipped...)
[ 136.837273] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[ 136.858075] lowmem_reserve[]: 0 953 953 953
[ 136.860609] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_cma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
[ 136.880330] lowmem_reserve[]: 0 0 0 0
[ 136.883137] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[ 136.890862] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[ 143.721805] kswapd0 D ffff880039ffb760 0 52 2 0x00000000
[ 143.724711] ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[ 143.727782] ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[ 143.730815] ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[ 143.733839] Call Trace:
[ 143.735190] [<ffffffff81666600>] schedule+0x30/0x80
[ 143.737387] [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[ 143.739964] [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[ 143.742944] [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[ 143.745315] [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 143.747737] [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[ 143.750071] [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[ 143.752534] [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 143.755230] [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 143.757959] [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[ 143.760382] [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[ 143.762785] [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[ 143.765347] [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[ 143.767801] [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[ 143.770084] [<ffffffff81157661>] kswapd+0x441/0x830
[ 143.772193] [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 143.774941] [<ffffffff8109181e>] kthread+0xee/0x110
[ 143.777025] [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[ 143.779276] [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 144.479298] file_io.00 D ffff88003ac97cb8 0 1248 1 0x00100084
[ 144.482410] ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[ 144.485513] ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[ 144.488618] ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[ 144.491661] Call Trace:
[ 144.492921] [<ffffffff81666600>] schedule+0x30/0x80
[ 144.495066] [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[ 144.497582] [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[ 144.500060] [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[ 144.503077] [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[ 144.505494] [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[ 144.508375] [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[ 144.510996] [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[ 144.514521] [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[ 144.517230] [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[ 144.519407] [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[ 144.521772] [<ffffffff811c0713>] SyS_write+0x53/0xd0
[ 144.523909] [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[ 144.526145] [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[ 145.684411] kworker/3:3 D ffff88000e987878 0 2329 2 0x00000080
[ 145.684415] Workqueue: events_freezable_power_ disk_events_workfn
[ 145.684416] ffff88000e987878 ffff880037d76140 ffff88000e980100 ffff88000e988000
[ 145.684417] ffff88000e9878b0 ffff88003d6d02c0 00000000fffd9bc4 ffff88003ffdf100
[ 145.684417] ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[ 145.684418] Call Trace:
[ 145.684419] [<ffffffff81666600>] schedule+0x30/0x80
[ 145.684419] [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[ 145.684420] [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[ 145.684421] [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[ 145.684422] [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[ 145.684423] [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[ 145.684424] [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[ 145.684425] [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[ 145.684426] [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[ 145.684427] [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[ 145.684428] [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[ 145.684430] [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 145.684431] [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[ 145.684433] [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[ 145.684435] [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[ 145.684436] [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[ 145.684438] [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 145.684440] [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 145.684441] [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 145.684442] [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[ 145.684443] [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[ 145.684445] [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[ 145.684446] [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[ 145.684448] [<ffffffff8108b846>] worker_thread+0x126/0x490
[ 145.684449] [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[ 145.684450] [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[ 145.684451] [<ffffffff8109181e>] kthread+0xee/0x110
[ 145.684452] [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[ 145.684453] [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 208.035194] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[ 208.051970] lowmem_reserve[]: 0 953 953 953
[ 208.054174] Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39064kB kernel_stack:20512kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1644kB local_pcp:108kB free_cma:0kB writeback_tmp:0kB pages_scanned:1882904 all_unreclaimable? yes
[ 208.072237] lowmem_reserve[]: 0 0 0 0
[ 208.074340] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[ 208.080915] Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB
(...snipped...)
[ 290.388544] INFO: task kswapd0:52 blocked for more than 120 seconds.
[ 290.391197] Tainted: G W 4.5.0-rc7-next-20160310 #103
[ 290.393979] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 290.397150] kswapd0 D ffff880039ffb760 0 52 2 0x00000000
[ 290.400194] ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[ 290.403394] ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[ 290.406715] ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[ 290.409874] Call Trace:
[ 290.411242] [<ffffffff81666600>] schedule+0x30/0x80
[ 290.413423] [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[ 290.416100] [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[ 290.418835] [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[ 290.421278] [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 290.423672] [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[ 290.426042] [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[ 290.428569] [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 290.431173] [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 290.433753] [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[ 290.436135] [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[ 290.438583] [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[ 290.441090] [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[ 290.443500] [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[ 290.445703] [<ffffffff81157661>] kswapd+0x441/0x830
[ 290.447973] [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 290.450676] [<ffffffff8109181e>] kthread+0xee/0x110
[ 290.452780] [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[ 290.455018] [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
[ 290.457910] 1 lock held by kswapd0/52:
[ 290.459813] #0: (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
(...snipped...)
[ 336.562747] Node 0 DMA free:3864kB min:60kB low:72kB high:84kB active_anon:9504kB inactive_anon:84kB active_file:140kB inactive_file:448kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:15988kB managed:15904kB mlocked:0kB dirty:448kB writeback:0kB mapped:172kB shmem:84kB slab_reclaimable:164kB slab_unreclaimable:692kB kernel_stack:448kB pagetables:156kB unstable:0kB
bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:4244 all_unreclaimable? yes
[ 336.589823] lowmem_reserve[]: 0 953 953 953
[ 336.593296] Node 0 DMA32 free:3776kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB
isolated(file):128kB present:1032064kB managed:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34724kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39192kB kernel_stack:20416kB
pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1520kB local_pcp:100kB free_cma:0kB writeback_tmp:0kB pages_scanned:1001584 all_unreclaimable? yes
[ 336.618011] lowmem_reserve[]: 0 0 0 0
[ 336.620073] Node 0 DMA: 28*4kB (UE) 15*8kB (UE) 9*16kB (UME) 1*32kB (M) 2*64kB (UE) 2*128kB (UE) 0*256kB 2*512kB (UE) 2*1024kB (UE) 0*2048kB 0*4096kB = 3864kB
[ 336.626844] Node 0 DMA32: 860*4kB (UME) 18*8kB (UME) 8*16kB (UM) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3776kB
(...snipped...)
[ 393.774051] kswapd0 D ffff880039ffb760 0 52 2 0x00000000
[ 393.777018] ffff880039ffb760 ffff88003bb5e140 ffff880039ff4000 ffff880039ffc000
[ 393.779986] ffff88003a2c3850 ffff88003a2c3868 ffff880039ffb958 0000000000000001
[ 393.783000] ffff880039ffb778 ffffffff81666600 ffff880039ff4000 ffff880039ffb7d8
[ 393.785958] Call Trace:
[ 393.787191] [<ffffffff81666600>] schedule+0x30/0x80
[ 393.789198] [<ffffffff8166a066>] rwsem_down_read_failed+0xd6/0x140
[ 393.791707] [<ffffffff81323708>] call_rwsem_down_read_failed+0x18/0x30
[ 393.794364] [<ffffffff810b888b>] down_read_nested+0x3b/0x50
[ 393.796634] [<ffffffffa0242c5b>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 393.798952] [<ffffffffa0242c5b>] xfs_ilock+0x4b/0xe0 [xfs]
[ 393.801274] [<ffffffffa022d2d0>] xfs_map_blocks+0x80/0x150 [xfs]
[ 393.803709] [<ffffffffa022e27b>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 393.806254] [<ffffffffa022e656>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 393.808718] [<ffffffff8115356f>] pageout.isra.43+0x18f/0x240
[ 393.811002] [<ffffffff81154ed3>] shrink_page_list+0x803/0xae0
[ 393.813415] [<ffffffff8115590b>] shrink_inactive_list+0x1fb/0x460
[ 393.815834] [<ffffffff81156516>] shrink_zone_memcg+0x5b6/0x780
[ 393.818316] [<ffffffff811567b4>] shrink_zone+0xd4/0x2f0
[ 393.820472] [<ffffffff81157661>] kswapd+0x441/0x830
[ 393.822658] [<ffffffff81157220>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 393.825463] [<ffffffff8109181e>] kthread+0xee/0x110
[ 393.827626] [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[ 393.829824] [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 395.000240] file_io.00 D ffff88003ac97cb8 0 1248 1 0x00100084
[ 395.003355] ffff88003ac97cb8 ffff88003b8760c0 ffff88003658c040 ffff88003ac98000
[ 395.006582] ffff88003a280ac8 0000000000000246 ffff88003658c040 00000000ffffffff
[ 395.010026] ffff88003ac97cd0 ffffffff81666600 ffff88003a280ac0 ffff88003ac97ce0
[ 395.013010] Call Trace:
[ 395.014201] [<ffffffff81666600>] schedule+0x30/0x80
[ 395.016248] [<ffffffff81666909>] schedule_preempt_disabled+0x9/0x10
[ 395.018824] [<ffffffff816684bf>] mutex_lock_nested+0x14f/0x3a0
[ 395.021194] [<ffffffffa0237eef>] ? xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[ 395.024197] [<ffffffff810bd130>] ? __lock_acquire+0x8c0/0x1f50
[ 395.026672] [<ffffffffa0237eef>] xfs_file_buffered_aio_write+0x5f/0x1f0 [xfs]
[ 395.029525] [<ffffffff8111dfca>] ? __audit_syscall_entry+0xaa/0xf0
[ 395.032029] [<ffffffffa023810a>] xfs_file_write_iter+0x8a/0x150 [xfs]
[ 395.034589] [<ffffffff811bf327>] __vfs_write+0xc7/0x100
[ 395.036723] [<ffffffff811bfedd>] vfs_write+0x9d/0x190
[ 395.038841] [<ffffffff811df5da>] ? __fget_light+0x6a/0x90
[ 395.041069] [<ffffffff811c0713>] SyS_write+0x53/0xd0
[ 395.043258] [<ffffffff8100364d>] do_syscall_64+0x5d/0x180
[ 395.045511] [<ffffffff8166b57f>] entry_SYSCALL64_slow_path+0x25/0x25
(...snipped...)
[ 446.012823] kworker/3:3 D ffff88000e987878 0 2329 2 0x00000080
[ 446.015632] Workqueue: events_freezable_power_ disk_events_workfn
[ 446.018103] ffff88000e987878 ffff88003cc0c040 ffff88000e980100 ffff88000e988000
[ 446.021099] ffff88000e9878b0 ffff88003d6d02c0 0000000100016c95 ffff88003ffdf100
[ 446.024247] ffff88000e987890 ffffffff81666600 ffff88003d6d02c0 ffff88000e987938
[ 446.027332] Call Trace:
[ 446.028568] [<ffffffff81666600>] schedule+0x30/0x80
[ 446.030748] [<ffffffff8166a687>] schedule_timeout+0x117/0x1c0
[ 446.033122] [<ffffffff810bc306>] ? mark_held_locks+0x66/0x90
[ 446.035466] [<ffffffff810def90>] ? init_timer_key+0x40/0x40
[ 446.037756] [<ffffffff810e5e17>] ? ktime_get+0xa7/0x130
[ 446.039960] [<ffffffff81665b41>] io_schedule_timeout+0xa1/0x110
[ 446.042385] [<ffffffff81160ccd>] congestion_wait+0x7d/0xd0
[ 446.044651] [<ffffffff810b63a0>] ? wait_woken+0x80/0x80
[ 446.046817] [<ffffffff8114a602>] __alloc_pages_nodemask+0xb42/0xd50
[ 446.049395] [<ffffffff810bc300>] ? mark_held_locks+0x60/0x90
[ 446.051700] [<ffffffff81193a26>] alloc_pages_current+0x96/0x1b0
[ 446.054089] [<ffffffff812e1b3d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 446.056515] [<ffffffff812e2e74>] bio_copy_kern+0xc4/0x180
[ 446.058737] [<ffffffff812edb20>] blk_rq_map_kern+0x70/0x130
[ 446.061105] [<ffffffff8145255d>] scsi_execute+0x12d/0x160
[ 446.063334] [<ffffffff81452684>] scsi_execute_req_flags+0x84/0xf0
[ 446.065810] [<ffffffffa01ed762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 446.068343] [<ffffffffa01e1163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 446.070897] [<ffffffffa01edba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 446.073569] [<ffffffff812f928b>] disk_check_events+0x5b/0x150
[ 446.075895] [<ffffffff812f9397>] disk_events_workfn+0x17/0x20
[ 446.078340] [<ffffffff8108b4c5>] process_one_work+0x1a5/0x400
[ 446.080696] [<ffffffff8108b461>] ? process_one_work+0x141/0x400
[ 446.083069] [<ffffffff8108b846>] worker_thread+0x126/0x490
[ 446.085395] [<ffffffff81665ec1>] ? __schedule+0x311/0xa20
[ 446.087587] [<ffffffff8108b720>] ? process_one_work+0x400/0x400
[ 446.089996] [<ffffffff8109181e>] kthread+0xee/0x110
[ 446.092242] [<ffffffff8166b6f2>] ret_from_fork+0x22/0x50
[ 446.094527] [<ffffffff81091730>] ? kthread_create_on_node+0x230/0x230
---------- console log end ----------

2016-03-11 13:08:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> (Posting as a reply to this thread.)

I really do not see how this is related to this thread.
--
Michal Hocko
SUSE Labs

2016-03-11 13:32:08

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > (Posting as a reply to this thread.)
>
> I really do not see how this is related to this thread.

All allocating tasks are looping at

/*
* If we didn't make any progress and have a lot of
* dirty + writeback pages then we should wait for
* an IO to complete to slow down the reclaim and
* prevent from pre mature OOM
*/
if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
return true;
}

in should_reclaim_retry().

should_reclaim_retry() was added by OOM detection rework, wan't it?

2016-03-11 15:28:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > (Posting as a reply to this thread.)
> >
> > I really do not see how this is related to this thread.
>
> All allocating tasks are looping at
>
> /*
> * If we didn't make any progress and have a lot of
> * dirty + writeback pages then we should wait for
> * an IO to complete to slow down the reclaim and
> * prevent from pre mature OOM
> */
> if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> congestion_wait(BLK_RW_ASYNC, HZ/10);
> return true;
> }
>
> in should_reclaim_retry().
>
> should_reclaim_retry() was added by OOM detection rework, wan't it?

What happens without this patch applied. In other words, it all smells
like the IO got stuck somewhere and the direct reclaim cannot perform it
so we have to wait for the flushers to make a progress for us. Are those
stuck? Is the IO making any progress at all or it is just too slow and
it would finish actually. Wouldn't we just wait somewhere else in the
direct reclaim path instead.

--
Michal Hocko
SUSE Labs

2016-03-11 16:49:36

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > (Posting as a reply to this thread.)
> > >
> > > I really do not see how this is related to this thread.
> >
> > All allocating tasks are looping at
> >
> > /*
> > * If we didn't make any progress and have a lot of
> > * dirty + writeback pages then we should wait for
> > * an IO to complete to slow down the reclaim and
> > * prevent from pre mature OOM
> > */
> > if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > return true;
> > }
> >
> > in should_reclaim_retry().
> >
> > should_reclaim_retry() was added by OOM detection rework, wan't it?
>
> What happens without this patch applied. In other words, it all smells
> like the IO got stuck somewhere and the direct reclaim cannot perform it
> so we have to wait for the flushers to make a progress for us. Are those
> stuck? Is the IO making any progress at all or it is just too slow and
> it would finish actually. Wouldn't we just wait somewhere else in the
> direct reclaim path instead.

As of next-20160311, CPU usage becomes 0% when this problem occurs.

If I remove

mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
mm: use watermark checks for __GFP_REPEAT high order allocations
mm: throttle on IO only when there are too many dirty and writeback pages
mm-oom-rework-oom-detection-checkpatch-fixes
mm, oom: rework oom detection

then CPU usage becomes 60% and most of allocating tasks
are looping at

/*
* Acquire the oom lock. If that fails, somebody else is
* making progress for us.
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
}

in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

2016-03-11 17:00:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 11-03-16 22:32:02, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Fri 11-03-16 19:45:29, Tetsuo Handa wrote:
> > > > > (Posting as a reply to this thread.)
> > > >
> > > > I really do not see how this is related to this thread.
> > >
> > > All allocating tasks are looping at
> > >
> > > /*
> > > * If we didn't make any progress and have a lot of
> > > * dirty + writeback pages then we should wait for
> > > * an IO to complete to slow down the reclaim and
> > > * prevent from pre mature OOM
> > > */
> > > if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > > congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > return true;
> > > }
> > >
> > > in should_reclaim_retry().
> > >
> > > should_reclaim_retry() was added by OOM detection rework, wan't it?
> >
> > What happens without this patch applied. In other words, it all smells
> > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > so we have to wait for the flushers to make a progress for us. Are those
> > stuck? Is the IO making any progress at all or it is just too slow and
> > it would finish actually. Wouldn't we just wait somewhere else in the
> > direct reclaim path instead.
>
> As of next-20160311, CPU usage becomes 0% when this problem occurs.
>
> If I remove
>
> mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
> mm: use watermark checks for __GFP_REPEAT high order allocations
> mm: throttle on IO only when there are too many dirty and writeback pages
> mm-oom-rework-oom-detection-checkpatch-fixes
> mm, oom: rework oom detection
>
> then CPU usage becomes 60% and most of allocating tasks
> are looping at
>
> /*
> * Acquire the oom lock. If that fails, somebody else is
> * making progress for us.
> */
> if (!mutex_trylock(&oom_lock)) {
> *did_some_progress = 1;
> schedule_timeout_uninterruptible(1);
> return NULL;
> }
>
> in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).

OK, that would suggest that the oom rework patches are not really
related. They just moved from the livelock to a sleep which is good in
general IMHO. We even know that it is most probably the IO that is the
problem because we know that more than half of the reclaimable memory is
either dirty or under writeback. That is where you should be looking.
Why the IO is not making progress or such a slow progress.

--
Michal Hocko
SUSE Labs

2016-03-11 17:20:48

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> On Sat 12-03-16 01:49:26, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > What happens without this patch applied. In other words, it all smells
> > > like the IO got stuck somewhere and the direct reclaim cannot perform it
> > > so we have to wait for the flushers to make a progress for us. Are those
> > > stuck? Is the IO making any progress at all or it is just too slow and
> > > it would finish actually. Wouldn't we just wait somewhere else in the
> > > direct reclaim path instead.
> >
> > As of next-20160311, CPU usage becomes 0% when this problem occurs.
> >
> > If I remove
> >
> > mm-use-watermak-checks-for-__gfp_repeat-high-order-allocations-checkpatch-fixes
> > mm: use watermark checks for __GFP_REPEAT high order allocations
> > mm: throttle on IO only when there are too many dirty and writeback pages
> > mm-oom-rework-oom-detection-checkpatch-fixes
> > mm, oom: rework oom detection
> >
> > then CPU usage becomes 60% and most of allocating tasks
> > are looping at
> >
> > /*
> > * Acquire the oom lock. If that fails, somebody else is
> > * making progress for us.
> > */
> > if (!mutex_trylock(&oom_lock)) {
> > *did_some_progress = 1;
> > schedule_timeout_uninterruptible(1);
> > return NULL;
> > }
> >
> > in __alloc_pages_may_oom() (i.e. OOM-livelock due to the OOM reaper disabled).
>
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
>

Excuse me, but I can't understand why you think the oom rework patches are not
related. This problem occurs immediately after the OOM killer is invoked, which
means that there is little reclaimable memory.

Node 0 DMA32 free:3648kB min:3780kB low:4752kB high:5724kB active_anon:783216kB inactive_anon:6376kB active_file:33388kB inactive_file:40292kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB mana\
ged:980816kB mlocked:0kB dirty:40232kB writeback:120kB mapped:34720kB shmem:6628kB slab_reclaimable:10528kB slab_unreclaimable:39068kB kernel_stack:20512kB pagetables:8000kB unstable:0kB bounce:0kB free_pcp:1648kB local_pcp:116kB free_c\
ma:0kB writeback_tmp:0kB pages_scanned:964952 all_unreclaimable? yes
Node 0 DMA32: 860*4kB (UME) 16*8kB (UME) 1*16kB (M) 0*32kB 1*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3648kB

The OOM killer is invoked (but nothing happens due to TIF_MEMDIE) if I remove
the oom rework patches, which means that there is little reclaimable memory.

My understanding is that memory allocation requests needed for doing I/O cannot
be satisfied because free: is below min: . And since kswapd got stuck, nobody can
perform operations needed for making 2*(writeback + dirty) > reclaimable false.

2016-03-12 04:08:12

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Michal Hocko wrote:
> OK, that would suggest that the oom rework patches are not really
> related. They just moved from the livelock to a sleep which is good in
> general IMHO. We even know that it is most probably the IO that is the
> problem because we know that more than half of the reclaimable memory is
> either dirty or under writeback. That is where you should be looking.
> Why the IO is not making progress or such a slow progress.
>

A footnote. Regarding this reproducer, the problem was "anybody can declare
OOM and call out_of_memory(). But out_of_memory() does nothing because there
is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
and the problem is "nobody can declare OOM and call out_of_memory(). Although
out_of_memory() will do nothing because there is a thread which has
TIF_MEMDIE." after the OOM detection rework patches.

Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > Although there are memory allocating tasks passing gfp flags with
> > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > it is blocked at down() called from memory reclaim path. And since it is
> > legal to block kswapd from memory reclaim path (am I correct?), I think
> > we must not assume that current_is_kswapd() check will break the infinite
> > loop condition.
>
> Right, the threads that are blocked in writeback waiting on memory
> reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> that does not avoid the problem that kswapd can then get stuck
> on those locks, too. Hence there is no guarantee that kswapd can
> make reclaim progress if it does dirty page writeback...

Unless we address the issue Dave commented, the OOM detection rework patches
add a new location of livelock (which is demonstrated by this reproducer) in
the memory allocator. It is an unfortunate change that we add a new location
of livelock when we are trying to solve thrashing problem.

2016-03-13 14:41:52

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/3] OOM detection rework v4

Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, that would suggest that the oom rework patches are not really
> > related. They just moved from the livelock to a sleep which is good in
> > general IMHO. We even know that it is most probably the IO that is the
> > problem because we know that more than half of the reclaimable memory is
> > either dirty or under writeback. That is where you should be looking.
> > Why the IO is not making progress or such a slow progress.
> >
>
> A footnote. Regarding this reproducer, the problem was "anybody can declare
> OOM and call out_of_memory(). But out_of_memory() does nothing because there
> is a thread which has TIF_MEMDIE." before the OOM detection rework patches,
> and the problem is "nobody can declare OOM and call out_of_memory(). Although
> out_of_memory() will do nothing because there is a thread which has
> TIF_MEMDIE." after the OOM detection rework patches.

According to kmallocwd, allocating tasks are very slowly able to call
out_of_memory() ( http://I-love.SAKURA.ne.jp/tmp/serial-20160313.txt.xz ).
It seems that the oom detection rework patches are not really related.

>
> Dave Chinner wrote at http://lkml.kernel.org/r/20160211225929.GU14668@dastard :
> > > Although there are memory allocating tasks passing gfp flags with
> > > __GFP_KSWAPD_RECLAIM, kswapd is unable to make forward progress because
> > > it is blocked at down() called from memory reclaim path. And since it is
> > > legal to block kswapd from memory reclaim path (am I correct?), I think
> > > we must not assume that current_is_kswapd() check will break the infinite
> > > loop condition.
> >
> > Right, the threads that are blocked in writeback waiting on memory
> > reclaim will be using GFP_NOFS to prevent recursion deadlocks, but
> > that does not avoid the problem that kswapd can then get stuck
> > on those locks, too. Hence there is no guarantee that kswapd can
> > make reclaim progress if it does dirty page writeback...
>
> Unless we address the issue Dave commented, the OOM detection rework patches
> add a new location of livelock (which is demonstrated by this reproducer) in
> the memory allocator. It is an unfortunate change that we add a new location
> of livelock when we are trying to solve thrashing problem.
>

The oom detection rework patches did not add a new location of livelock.
They just did not address the problem that I/O cannot make progress.

2016-03-17 11:35:36

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

Today I was testing

----------
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c950e6e8..aa52e23ac280 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -887,7 +887,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
{
struct wb_writeback_work *work;

- if (!wb_has_dirty_io(wb))
+ if (!wb_has_dirty_io(wb) || writeback_in_progress(wb))
return;

/*
----------

using next-20160317, and I got below results.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20160317.txt.xz .
---------- console log ----------
[ 1354.048836] Out of memory: Kill process 3641 (file_io.02) score 1000 or sacrifice child
[ 1354.054773] Killed process 3641 (file_io.02) total-vm:4308kB, anon-rss:104kB, file-rss:1264kB, shmem-rss:0kB
[ 1593.471245] sysrq: SysRq : Show State
(...snipped...)
[ 1595.944649] kswapd0 D ffff88003681f760 0 53 2 0x00000000
[ 1595.949872] ffff88003681f760 ffff88003fbfa140 ffff88003681a040 ffff880036820000
[ 1595.955342] ffff88002b5e0750 ffff88002b5e0768 ffff88003681f958 0000000000000001
[ 1595.960826] ffff88003681f778 ffffffff81660570 ffff88003681a040 ffff88003681f7d8
[ 1595.966319] Call Trace:
[ 1595.968662] [<ffffffff81660570>] schedule+0x30/0x80
[ 1595.972552] [<ffffffff81663fd6>] rwsem_down_read_failed+0xd6/0x140
[ 1595.977199] [<ffffffff81322d98>] call_rwsem_down_read_failed+0x18/0x30
[ 1595.982087] [<ffffffff810b8b4b>] down_read_nested+0x3b/0x50
[ 1595.986370] [<ffffffffa024bcbb>] ? xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.990681] [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1595.994898] [<ffffffffa0236330>] xfs_map_blocks+0x80/0x150 [xfs]
[ 1595.999441] [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.004138] [<ffffffffa02376b6>] xfs_vm_writepage+0x36/0x70 [xfs]
[ 1596.008692] [<ffffffff811538ef>] pageout.isra.43+0x18f/0x240
[ 1596.012938] [<ffffffff81155253>] shrink_page_list+0x803/0xae0
[ 1596.017247] [<ffffffff81155c8b>] shrink_inactive_list+0x1fb/0x460
[ 1596.021771] [<ffffffff81156896>] shrink_zone_memcg+0x5b6/0x780
[ 1596.026103] [<ffffffff81156b34>] shrink_zone+0xd4/0x2f0
[ 1596.030111] [<ffffffff811579e1>] kswapd+0x441/0x830
[ 1596.033847] [<ffffffff811575a0>] ? mem_cgroup_shrink_node_zone+0xb0/0xb0
[ 1596.038786] [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.042546] [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.046591] [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1596.216946] kworker/u128:1 D ffff8800368eaf78 0 70 2 0x00000000
[ 1596.222105] Workqueue: writeback wb_workfn (flush-8:0)
[ 1596.226009] ffff8800368eaf78 ffff88003aa4c040 ffff88003686c0c0 ffff8800368ec000
[ 1596.231502] ffff8800368eafb0 ffff88003d610300 000000010013c47d ffff88003ffdf100
[ 1596.237003] ffff8800368eaf90 ffffffff81660570 ffff88003d610300 ffff8800368eb038
[ 1596.242505] Call Trace:
[ 1596.244750] [<ffffffff81660570>] schedule+0x30/0x80
[ 1596.248519] [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1596.252841] [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1596.257153] [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1596.261424] [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1596.265390] [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1596.269836] [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1596.273978] [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1596.278153] [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1596.283301] [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1596.287737] [<ffffffffa0270d70>] xfs_buf_allocate_memory+0x170/0x2ab [xfs]
[ 1596.292829] [<ffffffffa023c9aa>] xfs_buf_get_map+0xfa/0x160 [xfs]
[ 1596.297457] [<ffffffffa023cea9>] xfs_buf_read_map+0x29/0xe0 [xfs]
[ 1596.302034] [<ffffffffa02670e7>] xfs_trans_read_buf_map+0x97/0x1a0 [xfs]
[ 1596.307004] [<ffffffffa02171b3>] xfs_btree_read_buf_block.constprop.29+0x73/0xc0 [xfs]
[ 1596.312736] [<ffffffffa021727b>] xfs_btree_lookup_get_block+0x7b/0xf0 [xfs]
[ 1596.317859] [<ffffffffa021b981>] xfs_btree_lookup+0xc1/0x580 [xfs]
[ 1596.322448] [<ffffffffa0205dcc>] ? xfs_allocbt_init_cursor+0x3c/0xc0 [xfs]
[ 1596.327478] [<ffffffffa0204290>] xfs_alloc_ag_vextent_near+0xb0/0x880 [xfs]
[ 1596.332841] [<ffffffffa0204b57>] xfs_alloc_ag_vextent+0xf7/0x130 [xfs]
[ 1596.338547] [<ffffffffa02056a2>] xfs_alloc_vextent+0x3b2/0x480 [xfs]
[ 1596.343706] [<ffffffffa021316f>] xfs_bmap_btalloc+0x3bf/0x710 [xfs]
[ 1596.348841] [<ffffffffa02134c9>] xfs_bmap_alloc+0x9/0x10 [xfs]
[ 1596.353988] [<ffffffffa0213eba>] xfs_bmapi_write+0x47a/0xa10 [xfs]
[ 1596.359255] [<ffffffffa02493cd>] xfs_iomap_write_allocate+0x16d/0x380 [xfs]
[ 1596.365138] [<ffffffffa02363ed>] xfs_map_blocks+0x13d/0x150 [xfs]
[ 1596.370046] [<ffffffffa02372db>] xfs_do_writepage+0x15b/0x500 [xfs]
[ 1596.375322] [<ffffffff8114d756>] write_cache_pages+0x1f6/0x490
[ 1596.380014] [<ffffffffa0237180>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
[ 1596.385220] [<ffffffffa0236fa6>] xfs_vm_writepages+0x66/0xa0 [xfs]
[ 1596.389823] [<ffffffff8114e8bc>] do_writepages+0x1c/0x30
[ 1596.393865] [<ffffffff811ed543>] __writeback_single_inode+0x33/0x170
[ 1596.398583] [<ffffffff811ede3e>] writeback_sb_inodes+0x2ce/0x570
[ 1596.403200] [<ffffffff811ee167>] __writeback_inodes_wb+0x87/0xc0
[ 1596.407955] [<ffffffff811ee38b>] wb_writeback+0x1eb/0x220
[ 1596.412037] [<ffffffff811eea2f>] wb_workfn+0x1df/0x2b0
[ 1596.416133] [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1596.420437] [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1596.424836] [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1596.428948] [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1596.433635] [<ffffffff8109196e>] kthread+0xee/0x110
[ 1596.437346] [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1596.441325] [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1599.581883] kworker/0:2 D ffff880036743878 0 3476 2 0x00000080
[ 1599.587099] Workqueue: events_freezable_power_ disk_events_workfn
[ 1599.591615] ffff880036743878 ffffffff81c0d540 ffff880039c02040 ffff880036744000
[ 1599.597112] ffff8800367438b0 ffff88003d610300 000000010013d1a9 ffff88003ffdf100
[ 1599.602613] ffff880036743890 ffffffff81660570 ffff88003d610300 ffff880036743938
[ 1599.608068] Call Trace:
[ 1599.610155] [<ffffffff81660570>] schedule+0x30/0x80
[ 1599.613996] [<ffffffff816645f7>] schedule_timeout+0x117/0x1c0
[ 1599.618285] [<ffffffff810bc5c6>] ? mark_held_locks+0x66/0x90
[ 1599.622537] [<ffffffff810df270>] ? init_timer_key+0x40/0x40
[ 1599.626721] [<ffffffff810e60f7>] ? ktime_get+0xa7/0x130
[ 1599.630666] [<ffffffff8165fab1>] io_schedule_timeout+0xa1/0x110
[ 1599.635108] [<ffffffff8116104d>] congestion_wait+0x7d/0xd0
[ 1599.639234] [<ffffffff810b6620>] ? wait_woken+0x80/0x80
[ 1599.643156] [<ffffffff8114a982>] __alloc_pages_nodemask+0xb42/0xd50
[ 1599.647774] [<ffffffff810bc500>] ? mark_lock+0x620/0x680
[ 1599.651785] [<ffffffff81193876>] alloc_pages_current+0x96/0x1b0
[ 1599.656235] [<ffffffff812e108d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 1599.660662] [<ffffffff812e2454>] bio_copy_kern+0xc4/0x180
[ 1599.664702] [<ffffffff812ed070>] blk_rq_map_kern+0x70/0x130
[ 1599.668864] [<ffffffff8144c4bd>] scsi_execute+0x12d/0x160
[ 1599.672950] [<ffffffff8144c5e4>] scsi_execute_req_flags+0x84/0xf0
[ 1599.677784] [<ffffffffa01e8762>] sr_check_events+0xb2/0x2a0 [sr_mod]
[ 1599.682744] [<ffffffffa01ce163>] cdrom_check_events+0x13/0x30 [cdrom]
[ 1599.687747] [<ffffffffa01e8ba5>] sr_block_check_events+0x25/0x30 [sr_mod]
[ 1599.692752] [<ffffffff812f874b>] disk_check_events+0x5b/0x150
[ 1599.697130] [<ffffffff812f8857>] disk_events_workfn+0x17/0x20
[ 1599.701783] [<ffffffff8108b2c5>] process_one_work+0x1a5/0x400
[ 1599.706347] [<ffffffff8108b261>] ? process_one_work+0x141/0x400
[ 1599.710809] [<ffffffff8108b646>] worker_thread+0x126/0x490
[ 1599.715005] [<ffffffff8108b520>] ? process_one_work+0x400/0x400
[ 1599.719427] [<ffffffff8109196e>] kthread+0xee/0x110
[ 1599.723220] [<ffffffff81665672>] ret_from_fork+0x22/0x50
[ 1599.727240] [<ffffffff81091880>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 1698.163933] 1 lock held by kswapd0/53:
[ 1698.166948] #0: (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcbb>] xfs_ilock+0x4b/0xe0 [xfs]
[ 1698.174361] 5 locks held by kworker/u128:1/70:
[ 1698.177849] #0: ("writeback"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.184626] #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.191670] #2: (&type->s_umount_key#30){++++++}, at: [<ffffffff811c35d6>] trylock_super+0x16/0x50
[ 1698.198449] #3: (sb_internal){.+.+.?}, at: [<ffffffff811c35ac>] __sb_start_write+0xcc/0xe0
[ 1698.204743] #4: (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffa024bcef>] xfs_ilock+0x7f/0xe0 [xfs]
(...snipped...)
[ 1698.222061] 2 locks held by kworker/0:2/3476:
[ 1698.225546] #0: ("events_freezable_power_efficient"){.+.+.+}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
[ 1698.233350] #1: ((&(&ev->dwork)->work)){+.+.+.}, at: [<ffffffff8108b261>] process_one_work+0x141/0x400
(...snipped...)
[ 1718.427909] Showing busy workqueues and worker pools:
[ 1718.432224] workqueue events: flags=0x0
[ 1718.435754] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1718.440769] in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1718.445766] pending: vmpressure_work_fn, cache_reap
[ 1718.450227] workqueue events_power_efficient: flags=0x80
[ 1718.454645] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1718.459663] pending: fb_flashcursor
[ 1718.463133] workqueue events_freezable_power_: flags=0x84
[ 1718.467620] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1718.472552] in-flight: 3476:disk_events_workfn
[ 1718.476643] workqueue writeback: flags=0x4e
[ 1718.480197] pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1718.484977] in-flight: 70:wb_workfn
[ 1718.488671] workqueue vmstat: flags=0xc
[ 1718.492312] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 MAYDAY
[ 1718.497665] pending: vmstat_update
[ 1718.501304] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1718.507471] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=15s workers=2 manager: 3490
[ 1718.513528] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
[ 1745.495540] sysrq: SysRq : Show Memory
[ 1745.508581] Mem-Info:
[ 1745.516772] active_anon:182211 inactive_anon:12238 isolated_anon:0
[ 1745.516772] active_file:6978 inactive_file:19887 isolated_file:32
[ 1745.516772] unevictable:0 dirty:19697 writeback:214 unstable:0
[ 1745.516772] slab_reclaimable:2382 slab_unreclaimable:8786
[ 1745.516772] mapped:6820 shmem:12582 pagetables:1311 bounce:0
[ 1745.516772] free:1877 free_pcp:132 free_cma:0
[ 1745.563639] Node 0 DMA free:3868kB min:60kB low:72kB high:84kB active_anon:6184kB inactive_anon:1120kB active_file:644kB inactive_file:1784kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:1784kB writeback:0kB mapped:644kB shmem:1172kB slab_reclaimable:220kB slab_unreclaimable:660kB kernel_stack:496kB pagetables:252kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:15392 all_unreclaimable? yes
[ 1745.595872] lowmem_reserve[]: 0 953 953 953
[ 1745.599508] Node 0 DMA32 free:3640kB min:3780kB low:4752kB high:5724kB active_anon:722660kB inactive_anon:47832kB active_file:27268kB inactive_file:77764kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:1032064kB managed:980852kB mlocked:0kB dirty:77004kB writeback:856kB mapped:26636kB shmem:49156kB slab_reclaimable:9308kB slab_unreclaimable:34484kB kernel_stack:19760kB pagetables:4992kB unstable:0kB bounce:0kB free_pcp:528kB local_pcp:60kB free_cma:0kB writeback_tmp:0kB pages_scanned:1387692 all_unreclaimable? yes
[ 1745.633558] lowmem_reserve[]: 0 0 0 0
[ 1745.636871] Node 0 DMA: 25*4kB (UME) 9*8kB (UME) 7*16kB (UME) 2*32kB (ME) 3*64kB (ME) 4*128kB (UE) 3*256kB (UME) 4*512kB (UE) 0*1024kB 0*2048kB 0*4096kB = 3868kB
[ 1745.648828] Node 0 DMA32: 886*4kB (UE) 8*8kB (UM) 2*16kB (U) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3640kB
[ 1745.658179] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1745.664712] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1745.671127] 39477 total pagecache pages
[ 1745.674392] 0 pages in swap cache
[ 1745.677315] Swap cache stats: add 0, delete 0, find 0/0
[ 1745.681493] Free swap = 0kB
[ 1745.684113] Total swap = 0kB
[ 1745.686786] 262013 pages RAM
[ 1745.689386] 0 pages HighMem/MovableOnly
[ 1745.692883] 12824 pages reserved
[ 1745.695779] 0 pages cma reserved
[ 1745.698763] 0 pages hwpoisoned
[ 1746.841678] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 44s!
[ 1746.866634] Showing busy workqueues and worker pools:
[ 1746.881055] workqueue events: flags=0x0
[ 1746.887480] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[ 1746.894205] in-flight: 52:mptspi_dv_renegotiate_work [mptspi]
[ 1746.900892] pending: vmpressure_work_fn, cache_reap
[ 1746.906938] workqueue events_power_efficient: flags=0x80
[ 1746.912780] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 1746.917657] pending: fb_flashcursor
[ 1746.920983] workqueue events_freezable_power_: flags=0x84
[ 1746.925304] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 1746.930114] in-flight: 3476:disk_events_workfn
[ 1746.934076] workqueue writeback: flags=0x4e
[ 1746.937546] pwq 128: cpus=0-63 flags=0x4 nice=0 active=1/256
[ 1746.942258] in-flight: 70:wb_workfn
[ 1746.945978] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3451 3501
[ 1746.952268] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=44s workers=2 manager: 3490
[ 1746.958276] pool 128: cpus=0-63 flags=0x4 nice=0 hung=0s workers=3 manager: 3471 idle: 6
---------- console log ----------

This is an OOM-livelocked situation where kswapd got stuck and
allocating tasks are sleeping at

/*
* If we didn't make any progress and have a lot of
* dirty + writeback pages then we should wait for
* an IO to complete to slow down the reclaim and
* prevent from pre mature OOM
*/
if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
return true;
}

in should_reclaim_retry(). Presumably out_of_memory() is called (I didn't
confirm it using kmallocwd), and this is a situation where "we need to select
next OOM-victim" or "fail !__GFP_FS && !__GFP_NOFAIL allocation requests".

But what I felt strange is what should_reclaim_retry() is doing.

Michal Hocko wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f77e283fb8c6..b2de8c8761ad 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> */
> if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> ac->high_zoneidx, alloc_flags, available)) {
> - /* Wait for some write requests to complete then retry */
> - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> + unsigned long writeback;
> + unsigned long dirty;
> +
> + writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> + dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> +
> + /*
> + * If we didn't make any progress and have a lot of
> + * dirty + writeback pages then we should wait for
> + * an IO to complete to slow down the reclaim and
> + * prevent from pre mature OOM
> + */
> + if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + return true;
> + }

writeback and dirty are used only when did_some_progress == 0. Thus, we don't
need to calculate writeback and dirty using zone_page_state_snapshot() unless
did_some_progress == 0.

But, does it make sense to take writeback and dirty into account when
disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

> +
> + /*
> + * Memory allocation/reclaim might be called from a WQ
> + * context and the current implementation of the WQ
> + * concurrency control doesn't recognize that
> + * a particular WQ is congested if the worker thread is
> + * looping without ever sleeping. Therefore we have to
> + * do a short sleep here rather than calling
> + * cond_resched().
> + */
> + if (current->flags & PF_WQ_WORKER)
> + schedule_timeout(1);

This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
Please update.

> + else
> + cond_resched();
> +
> return true;
> }
> }
> --

2016-03-17 12:01:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

On Thu 17-03-16 20:35:23, Tetsuo Handa wrote:
[...]
> But what I felt strange is what should_reclaim_retry() is doing.
>
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f77e283fb8c6..b2de8c8761ad 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > */
> > if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > ac->high_zoneidx, alloc_flags, available)) {
> > - /* Wait for some write requests to complete then retry */
> > - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > + unsigned long writeback;
> > + unsigned long dirty;
> > +
> > + writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> > + dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> > +
> > + /*
> > + * If we didn't make any progress and have a lot of
> > + * dirty + writeback pages then we should wait for
> > + * an IO to complete to slow down the reclaim and
> > + * prevent from pre mature OOM
> > + */
> > + if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + return true;
> > + }
>
> writeback and dirty are used only when did_some_progress == 0. Thus, we don't
> need to calculate writeback and dirty using zone_page_state_snapshot() unless
> did_some_progress == 0.

OK, I will move this into if !did_some_progress.

> But, does it make sense to take writeback and dirty into account when
> disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
> wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
> Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

I have considered skiping the throttling part for GFP_NOFS/GFP_NOIO
previously but I couldn't have convinced myself it would make any
difference. We know there was no progress in the reclaim and even if the
current context is doing FS/IO allocation potentially then it obviously
cannot get its memory so it cannot proceed. So now we are in the state
where we either busy loop or sleep for a while. So I ended up not
complicating the code even more. If you have a use case where busy
waiting makes a difference then I would vote for a separate patch with a
clear description.

> > +
> > + /*
> > + * Memory allocation/reclaim might be called from a WQ
> > + * context and the current implementation of the WQ
> > + * concurrency control doesn't recognize that
> > + * a particular WQ is congested if the worker thread is
> > + * looping without ever sleeping. Therefore we have to
> > + * do a short sleep here rather than calling
> > + * cond_resched().
> > + */
> > + if (current->flags & PF_WQ_WORKER)
> > + schedule_timeout(1);
>
> This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
> Please update.

Yeah, I have that updated in my local patch already.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-04 08:23:58

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, oom: rework oom detection

On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
...
> @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> &nr_soft_scanned);
> sc->nr_reclaimed += nr_soft_reclaimed;
> sc->nr_scanned += nr_soft_scanned;
> - if (nr_soft_reclaimed)
> - reclaimable = true;
> /* need some check for avoid more shrink_zone() */
> }
>
> - if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> - reclaimable = true;
> -
> - if (global_reclaim(sc) &&
> - !reclaimable && zone_reclaimable(zone))
> - reclaimable = true;
> + shrink_zone(zone, sc, zone_idx(zone));

Shouldn't it be

shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);

?

> }
>
> /*

2016-04-04 09:42:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm, oom: rework oom detection

On Mon 04-04-16 11:23:43, Vladimir Davydov wrote:
> On Tue, Dec 15, 2015 at 07:19:44PM +0100, Michal Hocko wrote:
> ...
> > @@ -2592,17 +2589,10 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > &nr_soft_scanned);
> > sc->nr_reclaimed += nr_soft_reclaimed;
> > sc->nr_scanned += nr_soft_scanned;
> > - if (nr_soft_reclaimed)
> > - reclaimable = true;
> > /* need some check for avoid more shrink_zone() */
> > }
> >
> > - if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
> > - reclaimable = true;
> > -
> > - if (global_reclaim(sc) &&
> > - !reclaimable && zone_reclaimable(zone))
> > - reclaimable = true;
> > + shrink_zone(zone, sc, zone_idx(zone));
>
> Shouldn't it be
>
> shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
>
> ?

I cannot remember the reason why I have removed it so it is more likely
this was unintentional. Thanks for catching this. I will fold it into
the original patch before I repost the full series (this week
hopefully).
--
Michal Hocko
SUSE Labs