LinuxLists.cc - [PATCH mmotm] vmscan: raise the bar to PAGEOUT_IO

2010-08-01 08:52:07

Subject: [PATCH mmotm] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

Invoking any command locked up for minutes (note that I'm
talking about attempted additional I/O to the _other_,
_unaffected_ main system HDD - such as loading some shell
binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system. The OOM trace shows that 3
processes are stuck in wait_on_page_writeback() in the direct reclaim
path. One in do_fork() and the other two in unix_stream_sendmsg(). They
are blocked on this condition:

(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.

Firstly, it can easily happen in daily desktop usage. vmscan priority
can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even
if the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in
the LRU lists. And order-1 to order-3 allocations are more than common
with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high
order slab caches. For example, the order-1 radix_tree_node slab cache
may stall applications at swap-in time; the order-3 inode cache on most
filesystems may stall applications when trying to read some file; the
order-2 proc_inode_cache may stall applications when trying to open a
/proc file.

Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system. This "one slow USB device stalls the whole system"
avalanching effect is very bad.

Thirdly, once stalled, the stall time could be intolerable long for the
users. When there are 20MB queued writeback pages and USB 1.1 is
writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20
seconds. Not to mention it may be called multiple times.

So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is
20%, it will hardly be triggered by pure dirty pages. We'd better treat
PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
uncomfortably long (easily goes beyond 1s).

The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
which are easy to satisfy in 1TB memory boxes. So, although 6.25% of
memory could be an awful lot of pages to scan on a system with 1TB of
memory, it won't really have to busy scan that much.

Andreas tested an older version of this patch and reported that it
mostly fixed his problem. Mel Gorman helped improve it and KOSAKI
Motohiro will fix it further in the next patch.

Reported-by: Andreas Mohr <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
mm/vmscan.c | 51 ++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 43 insertions(+), 8 deletions(-)

--- mmotm.orig/mm/vmscan.c 2010-07-20 11:21:08.000000000 +0800
+++ mmotm/mm/vmscan.c 2010-08-01 16:47:52.000000000 +0800
@@ -1232,6 +1232,47 @@ static noinline_for_stack void update_is
}

/*
+ * Returns true if the caller should wait to clean dirty/writeback pages.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for writeback IO to complete.
+ * This will stall high-order allocations noticeably. Only do that when really
+ * need to free the pages under high memory pressure.
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+ unsigned long nr_freed,
+ int priority,
+ struct scan_control *sc)
+{
+ int lumpy_stall_priority;
+
+ /* kswapd should not stall on sync IO */
+ if (current_is_kswapd())
+ return false;
+
+ /* Only stall on lumpy reclaim */
+ if (!sc->lumpy_reclaim_mode)
+ return false;
+
+ /* If we have relaimed everything on the isolated list, no stall */
+ if (nr_freed == nr_taken)
+ return false;
+
+ /*
+ * For high-order allocations, there are two stall thresholds.
+ * High-cost allocations stall immediately where as lower
+ * order allocations such as stacks require the scanning
+ * priority to be much higher before stalling.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ lumpy_stall_priority = DEF_PRIORITY;
+ else
+ lumpy_stall_priority = DEF_PRIORITY / 3;
+
+ return priority <= lumpy_stall_priority;
+}
+
+/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
@@ -1296,14 +1337,8 @@ shrink_inactive_list(unsigned long nr_to

nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);

- /*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
- */
- if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
- sc->lumpy_reclaim_mode) {
+ /* Check if we should syncronously wait for writeback */
+ if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*

2010-08-01 08:55:15

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH mmotm] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

> Reported-by: Andreas Mohr <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>

2010-08-01 09:12:54

by KOSAKI Motohiro

[permalink] [raw]

Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

rebased onto Wu's patch

----------------------------------------------
>From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Sun, 1 Aug 2010 17:23:41 +0900
Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

congestion_wait() mean "waiting for number of requests in IO queue is
under congestion threshold".
That said, if the system have plenty dirty pages, flusher thread push
new request to IO queue conteniously. So, IO queue are not cleared
congestion status for a long time. thus, congestion_wait(HZ/10) is
almostly equivalent schedule_timeout(HZ/10).

If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.
4096 times 0.1sec stall makes crazy insane long stall. That shouldn't.

In the other hand, this synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
call wait_on_page_writeback() and it provide sufficient waiting.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Wu Fengguang <[email protected]>

---
mm/vmscan.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 972c8f0..c5e673e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1339,8 +1339,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.6.5.2

2010-08-01 10:44:46

by Fengguang Wu

[permalink] [raw]

Subject: Re: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

> If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
> shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.

Err you must forgot the page size.

128kB means 128kB/4kB=32 pages which fit exactly into one
SWAP_CLUSTER_MAX batch. The shrink_page_list() call times
has nothing to do DEF_PRIORITY.

Thanks,
Fengguang

2010-08-01 10:51:11

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

> > If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
> > shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.
>
> Err you must forgot the page size.

page size? DEF_PRIORITY is 12.

512MB >> DEF_PRIORITY
= 512MB / 4096
= 128kB

128kB scan mean 4096 times shrink_list(). because one shrink_list() scan
SWAP_CLUSTER_MAX (i.e. 32).

>
> 128kB means 128kB/4kB=32 pages which fit exactly into one
> SWAP_CLUSTER_MAX batch. The shrink_page_list() call times
> has nothing to do DEF_PRIORITY.

Umm.. I haven't catch this mention.

2010-08-01 13:41:32

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

Hi KOSAKI,

On Sun, Aug 01, 2010 at 06:12:47PM +0900, KOSAKI Motohiro wrote:
> rebased onto Wu's patch
>
> ----------------------------------------------
> From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
> From: KOSAKI Motohiro <[email protected]>
> Date: Sun, 1 Aug 2010 17:23:41 +0900
> Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
>
> congestion_wait() mean "waiting for number of requests in IO queue is
> under congestion threshold".
> That said, if the system have plenty dirty pages, flusher thread push
> new request to IO queue conteniously. So, IO queue are not cleared
> congestion status for a long time. thus, congestion_wait(HZ/10) is
> almostly equivalent schedule_timeout(HZ/10).
Just a nitpick.
Why is it a problem?
HZ/10 is upper bound we intended. If is is rahter high, we can low it.
But totally I agree on this patch. It would be better to remove it
than lowing.

>
> If the system 512MB memory, DEF_PRIORITY mean 128kB scan and It takes 4096
> shrink_page_list() calls to scan 128kB (i.e. 128kB/32=4096) memory.
> 4096 times 0.1sec stall makes crazy insane long stall. That shouldn't.

128K / (4K * SWAP_CLUSTER_MAX) = 1

>
> In the other hand, this synchronous lumpy reclaim donesn't need this
> congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) cause to
> call wait_on_page_writeback() and it provide sufficient waiting.

Absolutely I agree on you.

>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Wu Fengguang <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-08-02 04:13:29

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

> Hi KOSAKI,
>
> On Sun, Aug 01, 2010 at 06:12:47PM +0900, KOSAKI Motohiro wrote:
> > rebased onto Wu's patch
> >
> > ----------------------------------------------
> > From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
> > From: KOSAKI Motohiro <[email protected]>
> > Date: Sun, 1 Aug 2010 17:23:41 +0900
> > Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
> >
> > congestion_wait() mean "waiting for number of requests in IO queue is
> > under congestion threshold".
> > That said, if the system have plenty dirty pages, flusher thread push
> > new request to IO queue conteniously. So, IO queue are not cleared
> > congestion status for a long time. thus, congestion_wait(HZ/10) is
> > almostly equivalent schedule_timeout(HZ/10).
> Just a nitpick.
> Why is it a problem?
> HZ/10 is upper bound we intended. If is is rahter high, we can low it.
> But totally I agree on this patch. It would be better to remove it
> than lowing.

because all of _unnecessary_ sleep is evil. the problem is, congestion_wait()
mean "wait until queue congestion will be cleared, iow, wait all of IO".
but we want to wait until _my_ IO finished.

So, if flusher thread conteniously push new IO into the queue, that makes
big difference.

Thanks.

2010-08-02 04:38:32

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()

On Mon, Aug 2, 2010 at 1:13 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> Hi KOSAKI,
>>
>> On Sun, Aug 01, 2010 at 06:12:47PM +0900, KOSAKI Motohiro wrote:
>> > rebased onto Wu's patch
>> >
>> > ----------------------------------------------
>> > From 35772ad03e202c1c9a2252de3a9d3715e30d180f Mon Sep 17 00:00:00 2001
>> > From: KOSAKI Motohiro <[email protected]>
>> > Date: Sun, 1 Aug 2010 17:23:41 +0900
>> > Subject: [PATCH] vmscan: synchronous lumpy reclaim don't call congestion_wait()
>> >
>> > congestion_wait() mean "waiting for number of requests in IO queue is
>> > under congestion threshold".
>> > That said, if the system have plenty dirty pages, flusher thread push
>> > new request to IO queue conteniously. So, IO queue are not cleared
>> > congestion status for a long time. thus, congestion_wait(HZ/10) is
>> > almostly equivalent schedule_timeout(HZ/10).
>> Just a nitpick.
>> Why is it a problem?
>> HZ/10 is upper bound we intended. ?If is is rahter high, we can low it.
>> But totally I agree on this patch. It would be better to remove it
>> than lowing.
>
> because all of _unnecessary_ sleep is evil. the problem is, congestion_wait()
> mean "wait until queue congestion will be cleared, iow, wait all of IO".
> but we want to wait until _my_ IO finished.
>
> So, if flusher thread conteniously push new IO into the queue, that makes
> big difference.
>

Agree. Please include this explanation in description to make it kind
if you resent this patch.
Thanks

--
Kind regards,
Minchan Kim