2010-08-05 06:11:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: [RFC][PATCH 0/7] low latency synchrounous lumpy reclaim


If slow usb storage is connected and run plenty io operation, lumpy
reclaim often stall in shrink_inactive_list(). This patch series try
to solve this issue.
At least, This works fine on my desktop and usb stick environment :-)

This patch is still RFC. comment, reviewing and testing are welcome!




Wu Fengguang (1):
vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

KOSAKI Motohiro (6):
vmscan: synchronous lumpy reclaim don't call congestion_wait()
vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()
vmscan: narrowing synchrounous lumply reclaim condition
vmscan: kill dead code in shrink_inactive_list()
vmscan: remove PF_SWAPWRITE from __zone_reclaim()
vmscan: isolated_lru_pages() stop neighbor search if neighbor can't be isolated

mm/vmscan.c | 211 ++++++++++++++++++++++++++++++++++++++--------------------
1 files changed, 138 insertions(+), 73 deletions(-)



2010-08-05 06:12:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/7] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

From: Wu Fengguang <[email protected]>

Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.

http://lkml.org/lkml/2010/4/4/86

In the above thread, Andreas Mohr described that

Invoking any command locked up for minutes (note that I'm
talking about attempted additional I/O to the _other_,
_unaffected_ main system HDD - such as loading some shell
binaries -, NOT the external SSD18M!!).

This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device

OOM also happens in Andreas' system. The OOM trace shows that 3
processes are stuck in wait_on_page_writeback() in the direct reclaim
path. One in do_fork() and the other two in unix_stream_sendmsg(). They
are blocked on this condition:

(sc->order && priority < DEF_PRIORITY - 2)

which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.

It's a severe problem in three ways.

Firstly, it can easily happen in daily desktop usage. vmscan priority
can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even
if the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in
the LRU lists. And order-1 to order-3 allocations are more than common
with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high
order slab caches. For example, the order-1 radix_tree_node slab cache
may stall applications at swap-in time; the order-3 inode cache on most
filesystems may stall applications when trying to read some file; the
order-2 proc_inode_cache may stall applications when trying to open a
/proc file.

Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system. This "one slow USB device stalls the whole system"
avalanching effect is very bad.

Thirdly, once stalled, the stall time could be intolerable long for the
users. When there are 20MB queued writeback pages and USB 1.1 is
writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20
seconds. Not to mention it may be called multiple times.

So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is
20%, it will hardly be triggered by pure dirty pages. We'd better treat
PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
uncomfortably long (easily goes beyond 1s).

The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
which are easy to satisfy in 1TB memory boxes. So, although 6.25% of
memory could be an awful lot of pages to scan on a system with 1TB of
memory, it won't really have to busy scan that much.

Andreas tested an older version of this patch and reported that it
mostly fixed his problem. Mel Gorman helped improve it and KOSAKI
Motohiro will fix it further in the next patch.

Reported-by: Andreas Mohr <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 52 +++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1b145e6..cf51d62 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1234,6 +1234,47 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone,
}

/*
+ * Returns true if the caller should wait to clean dirty/writeback pages.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for writeback IO to complete.
+ * This will stall high-order allocations noticeably. Only do that when really
+ * need to free the pages under high memory pressure.
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+ unsigned long nr_freed,
+ int priority,
+ struct scan_control *sc)
+{
+ int lumpy_stall_priority;
+
+ /* kswapd should not stall on sync IO */
+ if (current_is_kswapd())
+ return false;
+
+ /* Only stall on lumpy reclaim */
+ if (!sc->lumpy_reclaim_mode)
+ return false;
+
+ /* If we have relaimed everything on the isolated list, no stall */
+ if (nr_freed == nr_taken)
+ return false;
+
+ /*
+ * For high-order allocations, there are two stall thresholds.
+ * High-cost allocations stall immediately where as lower
+ * order allocations such as stacks require the scanning
+ * priority to be much higher before stalling.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ lumpy_stall_priority = DEF_PRIORITY;
+ else
+ lumpy_stall_priority = DEF_PRIORITY / 3;
+
+ return priority <= lumpy_stall_priority;
+}
+
+/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
@@ -1298,16 +1339,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);

- /*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
- */
- if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
- sc->lumpy_reclaim_mode) {
+ /* Check if we should syncronously wait for writeback */
+ if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
-
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.6.5.2


2010-08-05 06:13:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/7] vmscan: synchronous lumpy reclaim don't call congestion_wait()

congestion_wait() mean "waiting quueue congestion is cleared".
That said, if the system have plenty dirty pages and flusher thread push
new request to IO queue conteniously, IO queue are not cleared
congestion status for long time. thus, congestion_wait(HZ/10) become
almostly equivalent schedule_timeout(HZ/10).

However, synchronous lumpy reclaim donesn't need this
congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) are
using wait_on_page_writeback() and it provide sufficient waiting.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cf51d62..1cdc3db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1341,7 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
/*
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
--
1.6.5.2


2010-08-05 06:13:47

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 3/7] vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()

When synchrounous lumpy reclaim, there is no reason to give up to
reclaim pages even if page is locked. We use lock_page() instead
trylock_page() in this case.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1cdc3db..833b6ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);

- if (!trylock_page(page))
+ if (sync_writeback == PAGEOUT_IO_SYNC)
+ lock_page(page);
+ else if (!trylock_page(page))
goto keep;

VM_BUG_ON(PageActive(page));
--
1.6.5.2


2010-08-05 06:14:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 4/7] vmscan: narrowing synchrounous lumply reclaim condition

Now, mainly shrink_page_list() give up to reclaim the page by following
10 reasons.

1. trylock_page() failure
2. page is unevictable
3. zone reclaim and page is mapped
4. PageWriteback() is true
5. page is swapbacked and swap is full
6. add_to_swap() failure
7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
8. page is pinned
9. IO queue is congested
10. pageout() start IO, but not finished

When lumpy reclaim, all of failure cause synchronous lumpy reclaim but
It's slightly stupid. At least, case (2), (3), (5), (6), (7) and (8)
don't have any worth to retry. Then, This patch implement to disable
lumpy reclaim while reclaim processing.

Case (9) is more interesting. current behavior is,
1. start shrink_page_list(async)
2. found queue_congested()
3. skip pageout write
4. even tough start shrink_page_list(sync)
5. wait a lot of pages
6. again, found queue_congested()
7. give up pageout write again

So, it's meaningless time wasting. however just skipping page reclaim
seems no good idea for high order lumpy reclaim. because, example, x86
hugepage need 512 contenious memory. They often have much dirty pages
than queue congestion threshold (~=128).

After this patch, pageout() retrieve queueue congestion following

- If order > PAGE_ALLOC_COSTLY_ORDER
Ignore queue congestion always.
- If order <= PAGE_ALLOC_COSTLY_ORDER
skip write page and disable lumpy reclaim.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 122 ++++++++++++++++++++++++++++++++++++----------------------
1 files changed, 76 insertions(+), 46 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 833b6ad..f7aabd2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/vmscan.h>

+enum lumpy_mode {
+ LUMPY_MODE_NONE,
+ LUMPY_MODE_ASYNC,
+ LUMPY_MODE_SYNC,
+};
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -82,7 +88,7 @@ struct scan_control {
* Intend to reclaim enough contenious memory rather than to reclaim
* enough amount memory. I.e, it's the mode for high order allocation.
*/
- bool lumpy_reclaim_mode;
+ enum lumpy_mode lumpy_reclaim_mode;

/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
return ret;
}

+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
+ bool sync)
+{
+ enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
+
+ /*
+ * Some reclaim have alredy been failed. No worth to try synchronous
+ * lumpy reclaim.
+ */
+ if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
+ return;
+
+ /*
+ * If we need a large contiguous chunk of memory, or have
+ * trouble getting a small set of contiguous pages, we
+ * will reclaim both active and inactive pages.
+ */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ sc->lumpy_reclaim_mode = mode;
+ else if (sc->order && priority < DEF_PRIORITY - 2)
+ sc->lumpy_reclaim_mode = mode;
+ else
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
+static void disable_lumpy_reclaim_mode(struct scan_control *sc)
+{
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
+}
+
static inline int is_page_cache_freeable(struct page *page)
{
/*
@@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
return page_count(page) - page_has_private(page) == 2;
}

-static int may_write_to_queue(struct backing_dev_info *bdi)
+static int may_write_to_queue(struct backing_dev_info *bdi,
+ struct scan_control *sc)
{
if (current->flags & PF_SWAPWRITE)
return 1;
@@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
return 1;
if (bdi == current->backing_dev_info)
return 1;
+
+ /* lumpy reclaim for hugepage often need a lot of write */
+ if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ return 1;
return 0;
}

@@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}

-/* Request for sync pageout. */
-enum pageout_io {
- PAGEOUT_IO_ASYNC,
- PAGEOUT_IO_SYNC,
-};
-
/* possible outcome of pageout() */
typedef enum {
/* failed to write page out, page is locked */
@@ -330,7 +365,7 @@ typedef enum {
* Calls ->writepage().
*/
static pageout_t pageout(struct page *page, struct address_space *mapping,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
/*
* If the page is dirty, only perform writeback if that write
@@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
- if (!may_write_to_queue(mapping->backing_dev_info))
+ if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
+ disable_lumpy_reclaim_mode(sc);
return PAGE_KEEP;
+ }

if (clear_page_dirty_for_io(page)) {
int res;
@@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* direct reclaiming a large contiguous area and the
* first attempt to free a range of pages fails.
*/
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+ if (PageWriteback(page) &&
+ sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
wait_on_page_writeback(page);

if (!PageWriteback(page)) {
@@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
ClearPageReclaim(page);
}
trace_mm_vmscan_writepage(page,
- sync_writeback == PAGEOUT_IO_SYNC);
+ sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC);
inc_zone_page_state(page, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
}
@@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);

/* Lumpy reclaim - ignore references */
- if (sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
return PAGEREF_RECLAIM;

/*
@@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
- struct scan_control *sc,
- enum pageout_io sync_writeback)
+ struct scan_control *sc)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
page = lru_to_page(page_list);
list_del(&page->lru);

- if (sync_writeback == PAGEOUT_IO_SYNC)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
lock_page(page);
else if (!trylock_page(page))
goto keep;
@@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* for any page for which writeback has already
* started.
*/
- if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
+ may_enter_fs)
wait_on_page_writeback(page);
- else
- goto keep_locked;
+ else {
+ unlock_page(page);
+ goto keep_lumpy;
+ }
}

references = page_check_references(page, sc);
@@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;

/* Page is dirty, try to write it out here */
- switch (pageout(page, mapping, sync_writeback)) {
+ switch (pageout(page, mapping, sc)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page))
+ goto keep_lumpy;
+ if (PageDirty(page))
goto keep;
+
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -843,6 +886,7 @@ cull_mlocked:
try_to_free_swap(page);
unlock_page(page);
putback_lru_page(page);
+ disable_lumpy_reclaim_mode(sc);
continue;

activate_locked:
@@ -855,6 +899,8 @@ activate_locked:
keep_locked:
unlock_page(page);
keep:
+ disable_lumpy_reclaim_mode(sc);
+keep_lumpy:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
@@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
return false;

/* Only stall on lumpy reclaim */
- if (!sc->lumpy_reclaim_mode)
+ if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
return false;

/* If we have relaimed everything on the isolated list, no stall */
@@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
return SWAP_CLUSTER_MAX;
}

-
+ set_lumpy_reclaim_mode(priority, sc, false);
lru_add_drain();
spin_lock_irq(&zone->lru_lock);

if (scanning_global_lru(sc)) {
nr_taken = isolate_pages_global(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE: ISOLATE_BOTH,
zone, 0, file);
zone->pages_scanned += nr_scanned;
if (current_is_kswapd())
@@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
} else {
nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
&page_list, &nr_scanned, sc->order,
- sc->lumpy_reclaim_mode ?
- ISOLATE_BOTH : ISOLATE_INACTIVE,
+ sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
+ ISOLATE_INACTIVE: ISOLATE_BOTH,
zone, sc->mem_cgroup,
0, file);
/*
@@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

spin_unlock_irq(&zone->lru_lock);

- nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc);

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
@@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);

- nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+ set_lumpy_reclaim_mode(priority, sc, true);
+ nr_reclaimed += shrink_page_list(&page_list, sc);
}

local_irq_disable();
@@ -1721,21 +1768,6 @@ out:
}
}

-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
-{
- /*
- * If we need a large contiguous chunk of memory, or have
- * trouble getting a small set of contiguous pages, we
- * will reclaim both active and inactive pages.
- */
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
- sc->lumpy_reclaim_mode = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
- sc->lumpy_reclaim_mode = 1;
- else
- sc->lumpy_reclaim_mode = 0;
-}
-
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
@@ -1750,8 +1782,6 @@ static void shrink_zone(int priority, struct zone *zone,

get_scan_count(zone, sc, nr, priority);

- set_lumpy_reclaim_mode(priority, sc);
-
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
for_each_evictable_lru(l) {
--
1.6.5.2


2010-08-05 06:14:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 5/7] vmscan: kill dead code in shrink_inactive_list()

When synchrounous lumy reclaim occur, page_list have gurantee to
don't have active page because now page activation in shrink_page_list()
always disable lumpy reclaim.

Then, This patch remove virtual dead code.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 8 --------
1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7aabd2..f21dbeb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1334,7 +1334,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
unsigned long nr_taken;
- unsigned long nr_active;
unsigned long nr_anon;
unsigned long nr_file;

@@ -1389,13 +1388,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,

/* Check if we should syncronously wait for writeback */
if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, NULL);
- count_vm_events(PGDEACTIVATE, nr_active);
-
set_lumpy_reclaim_mode(priority, sc, true);
nr_reclaimed += shrink_page_list(&page_list, sc);
}
--
1.6.5.2


2010-08-05 06:15:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 6/7] vmscan: remove PF_SWAPWRITE from __zone_reclaim()

When zone_reclaim() was merged at first, It always order-0
allocation with PF_SWAPWRITE flag. (probably they are both
unintional).

Former was fixed commit bd2f6199cf (vmscan: respect higher
order in zone_reclaim()). However it expose latter mistake
because !order-0 reclaim often cause dirty pages isolation
and large latency.

Now, we have good opportunity to fix latter too.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f21dbeb..e043e97 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2663,7 +2663,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* and we also need to be able to write out pages for RECLAIM_WRITE
* and RECLAIM_SWAP.
*/
- p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
+ p->flags |= PF_MEMALLOC;
lockdep_set_current_reclaim_state(gfp_mask);
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
@@ -2716,7 +2716,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
}

p->reclaim_state = NULL;
- current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE);
+ current->flags &= ~PF_MEMALLOC;
lockdep_clear_current_reclaim_state();
return sc.nr_reclaimed >= nr_pages;
}
--
1.6.5.2


2010-08-05 06:16:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 7/7] vmscan: isolated_lru_pages() stop neighbor search if neighbor can't be isolated

isolate_lru_pages() doesn't only isolate LRU tail pages, but also
isolate neighbor pages of the eviction page.

Now, the neighbor search don't stop even if neighbors can't be isolated.
It is silly. successful higher order allocation need full contenious
memory, even though only one page reclaim failure mean to fail making
enough contenious memory.

Then, isolate_lru_pages() should stop to search PFN neighbor pages and
try to search next page on LRU soon. This patch does it. Also all of
lumpy reclaim failure account nr_lumpy_failed.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e043e97..264addc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1047,14 +1047,18 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
continue;

/* Avoid holes within the zone. */
- if (unlikely(!pfn_valid_within(pfn)))
+ if (unlikely(!pfn_valid_within(pfn))) {
+ nr_lumpy_failed++;
break;
+ }

cursor_page = pfn_to_page(pfn);

/* Check that we have not crossed a zone boundary. */
- if (unlikely(page_zone_id(cursor_page) != zone_id))
- continue;
+ if (unlikely(page_zone_id(cursor_page) != zone_id)) {
+ nr_lumpy_failed++;
+ break;
+ }

/*
* If we don't have enough swap space, reclaiming of
@@ -1062,8 +1066,10 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
* pointless.
*/
if (nr_swap_pages <= 0 && PageAnon(cursor_page) &&
- !PageSwapCache(cursor_page))
- continue;
+ !PageSwapCache(cursor_page)) {
+ nr_lumpy_failed++;
+ break;
+ }

if (__isolate_lru_page(cursor_page, mode, file) == 0) {
list_move(&cursor_page->lru, dst);
@@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
nr_lumpy_dirty++;
scan++;
} else {
- if (mode == ISOLATE_BOTH &&
- page_count(cursor_page))
- nr_lumpy_failed++;
+ /* the page is freed already. */
+ if (!page_count(cursor_page))
+ continue;
+ nr_lumpy_failed++;
+ break;
}
}
}
--
1.6.5.2


2010-08-05 13:55:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/7] vmscan: synchronous lumpy reclaim don't call congestion_wait()

On Thu, Aug 05, 2010 at 03:13:03PM +0900, KOSAKI Motohiro wrote:
> congestion_wait() mean "waiting quueue congestion is cleared".
> That said, if the system have plenty dirty pages and flusher thread push
> new request to IO queue conteniously, IO queue are not cleared
> congestion status for long time. thus, congestion_wait(HZ/10) become
> almostly equivalent schedule_timeout(HZ/10).
>
> However, synchronous lumpy reclaim donesn't need this
> congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) are
> using wait_on_page_writeback() and it provide sufficient waiting.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-08-05 14:17:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/7] vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()

On Thu, Aug 05, 2010 at 03:13:39PM +0900, KOSAKI Motohiro wrote:
> When synchrounous lumpy reclaim, there is no reason to give up to
> reclaim pages even if page is locked. We use lock_page() instead
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1cdc3db..833b6ad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> - if (!trylock_page(page))
> + if (sync_writeback == PAGEOUT_IO_SYNC)
> + lock_page(page);
> + else if (!trylock_page(page))
> goto keep;
>
> VM_BUG_ON(PageActive(page));
> --
> 1.6.5.2
>
>
>

Hmm. We can make sure lumpy already doesn't select the page locked?
I mean below scenario.

LRU head -> page A -> page B -> LRU tail

lock_page(page A)
some_function()
direct reclaim
select victim page B
enter lumpy mode
select victim page A as well as page B
shrink_page_list
lock_page(page A)


--
Kind regards,
Minchan Kim

2010-08-05 14:59:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 4/7] vmscan: narrowing synchrounous lumply reclaim condition

On Thu, Aug 05, 2010 at 03:14:13PM +0900, KOSAKI Motohiro wrote:
> Now, mainly shrink_page_list() give up to reclaim the page by following
> 10 reasons.
>
> 1. trylock_page() failure
> 2. page is unevictable
> 3. zone reclaim and page is mapped
> 4. PageWriteback() is true
> 5. page is swapbacked and swap is full
> 6. add_to_swap() failure
> 7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
> 8. page is pinned
> 9. IO queue is congested
> 10. pageout() start IO, but not finished
>
> When lumpy reclaim, all of failure cause synchronous lumpy reclaim but
> It's slightly stupid. At least, case (2), (3), (5), (6), (7) and (8)
> don't have any worth to retry. Then, This patch implement to disable
> lumpy reclaim while reclaim processing.

Agree.

>
> Case (9) is more interesting. current behavior is,
> 1. start shrink_page_list(async)
> 2. found queue_congested()
> 3. skip pageout write
> 4. even tough start shrink_page_list(sync)
> 5. wait a lot of pages
> 6. again, found queue_congested()
> 7. give up pageout write again
>
> So, it's meaningless time wasting. however just skipping page reclaim
> seems no good idea for high order lumpy reclaim. because, example, x86
> hugepage need 512 contenious memory. They often have much dirty pages
> than queue congestion threshold (~=128).
>
> After this patch, pageout() retrieve queueue congestion following
>
> - If order > PAGE_ALLOC_COSTLY_ORDER
> Ignore queue congestion always.
> - If order <= PAGE_ALLOC_COSTLY_ORDER
> skip write page and disable lumpy reclaim.
>

If Mel's patch which skip pageout in direct reclaim works well, still we need this?
file pages doesn't page out any more in direct reclaim path.
AFAIK, swap page's writeout doesn't consider queue_congested as we discuss it
at Wu's thread on some day ago. so it's no problem.

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/vmscan.c | 122 ++++++++++++++++++++++++++++++++++++----------------------
> 1 files changed, 76 insertions(+), 46 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 833b6ad..f7aabd2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -51,6 +51,12 @@
> #define CREATE_TRACE_POINTS
> #include <trace/events/vmscan.h>
>
> +enum lumpy_mode {
> + LUMPY_MODE_NONE,
> + LUMPY_MODE_ASYNC,
> + LUMPY_MODE_SYNC,
> +};
> +
> struct scan_control {
> /* Incremented by the number of inactive pages that were scanned */
> unsigned long nr_scanned;
> @@ -82,7 +88,7 @@ struct scan_control {
> * Intend to reclaim enough contenious memory rather than to reclaim
> * enough amount memory. I.e, it's the mode for high order allocation.
> */
> - bool lumpy_reclaim_mode;
> + enum lumpy_mode lumpy_reclaim_mode;
>
> /* Which cgroup do we reclaim from */
> struct mem_cgroup *mem_cgroup;
> @@ -265,6 +271,36 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> return ret;
> }
>
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
> + bool sync)
> +{
> + enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
> +
> + /*
> + * Some reclaim have alredy been failed. No worth to try synchronous
> + * lumpy reclaim.
> + */
> + if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> + return;
> +
> + /*
> + * If we need a large contiguous chunk of memory, or have
> + * trouble getting a small set of contiguous pages, we
> + * will reclaim both active and inactive pages.
> + */
> + if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> + sc->lumpy_reclaim_mode = mode;
> + else if (sc->order && priority < DEF_PRIORITY - 2)
> + sc->lumpy_reclaim_mode = mode;
> + else
> + sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
> +}
> +
> +static void disable_lumpy_reclaim_mode(struct scan_control *sc)
> +{
> + sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
> +}
> +
> static inline int is_page_cache_freeable(struct page *page)
> {
> /*
> @@ -275,7 +311,8 @@ static inline int is_page_cache_freeable(struct page *page)
> return page_count(page) - page_has_private(page) == 2;
> }
>
> -static int may_write_to_queue(struct backing_dev_info *bdi)
> +static int may_write_to_queue(struct backing_dev_info *bdi,
> + struct scan_control *sc)
> {
> if (current->flags & PF_SWAPWRITE)
> return 1;
> @@ -283,6 +320,10 @@ static int may_write_to_queue(struct backing_dev_info *bdi)
> return 1;
> if (bdi == current->backing_dev_info)
> return 1;
> +
> + /* lumpy reclaim for hugepage often need a lot of write */
> + if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> + return 1;
> return 0;
> }
>
> @@ -307,12 +348,6 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> -/* Request for sync pageout. */
> -enum pageout_io {
> - PAGEOUT_IO_ASYNC,
> - PAGEOUT_IO_SYNC,
> -};
> -
> /* possible outcome of pageout() */
> typedef enum {
> /* failed to write page out, page is locked */
> @@ -330,7 +365,7 @@ typedef enum {
> * Calls ->writepage().
> */
> static pageout_t pageout(struct page *page, struct address_space *mapping,
> - enum pageout_io sync_writeback)
> + struct scan_control *sc)
> {
> /*
> * If the page is dirty, only perform writeback if that write
> @@ -366,8 +401,10 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> }
> if (mapping->a_ops->writepage == NULL)
> return PAGE_ACTIVATE;
> - if (!may_write_to_queue(mapping->backing_dev_info))
> + if (!may_write_to_queue(mapping->backing_dev_info, sc)) {
> + disable_lumpy_reclaim_mode(sc);
> return PAGE_KEEP;
> + }
>
> if (clear_page_dirty_for_io(page)) {
> int res;
> @@ -394,7 +431,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> * direct reclaiming a large contiguous area and the
> * first attempt to free a range of pages fails.
> */
> - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> + if (PageWriteback(page) &&
> + sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
> wait_on_page_writeback(page);
>
> if (!PageWriteback(page)) {
> @@ -402,7 +440,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
> ClearPageReclaim(page);
> }
> trace_mm_vmscan_writepage(page,
> - sync_writeback == PAGEOUT_IO_SYNC);
> + sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC);
> inc_zone_page_state(page, NR_VMSCAN_WRITE);
> return PAGE_SUCCESS;
> }
> @@ -580,7 +618,7 @@ static enum page_references page_check_references(struct page *page,
> referenced_page = TestClearPageReferenced(page);
>
> /* Lumpy reclaim - ignore references */
> - if (sc->lumpy_reclaim_mode)
> + if (sc->lumpy_reclaim_mode != LUMPY_MODE_NONE)
> return PAGEREF_RECLAIM;
>
> /*
> @@ -644,8 +682,7 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> * shrink_page_list() returns the number of reclaimed pages
> */
> static unsigned long shrink_page_list(struct list_head *page_list,
> - struct scan_control *sc,
> - enum pageout_io sync_writeback)
> + struct scan_control *sc)
> {
> LIST_HEAD(ret_pages);
> LIST_HEAD(free_pages);
> @@ -665,7 +702,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> - if (sync_writeback == PAGEOUT_IO_SYNC)
> + if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC)
> lock_page(page);
> else if (!trylock_page(page))
> goto keep;
> @@ -696,10 +733,13 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> * for any page for which writeback has already
> * started.
> */
> - if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
> + if (sc->lumpy_reclaim_mode == LUMPY_MODE_SYNC &&
> + may_enter_fs)
> wait_on_page_writeback(page);
> - else
> - goto keep_locked;
> + else {
> + unlock_page(page);
> + goto keep_lumpy;
> + }
> }
>
> references = page_check_references(page, sc);
> @@ -753,14 +793,17 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> goto keep_locked;
>
> /* Page is dirty, try to write it out here */
> - switch (pageout(page, mapping, sync_writeback)) {
> + switch (pageout(page, mapping, sc)) {
> case PAGE_KEEP:
> goto keep_locked;
> case PAGE_ACTIVATE:
> goto activate_locked;
> case PAGE_SUCCESS:
> - if (PageWriteback(page) || PageDirty(page))
> + if (PageWriteback(page))
> + goto keep_lumpy;
> + if (PageDirty(page))
> goto keep;
> +
> /*
> * A synchronous write - probably a ramdisk. Go
> * ahead and try to reclaim the page.
> @@ -843,6 +886,7 @@ cull_mlocked:
> try_to_free_swap(page);
> unlock_page(page);
> putback_lru_page(page);
> + disable_lumpy_reclaim_mode(sc);
> continue;
>
> activate_locked:
> @@ -855,6 +899,8 @@ activate_locked:
> keep_locked:
> unlock_page(page);
> keep:
> + disable_lumpy_reclaim_mode(sc);
> +keep_lumpy:
> list_add(&page->lru, &ret_pages);
> VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
> }
> @@ -1255,7 +1301,7 @@ static inline bool should_reclaim_stall(unsigned long nr_taken,
> return false;
>
> /* Only stall on lumpy reclaim */
> - if (!sc->lumpy_reclaim_mode)
> + if (sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> return false;
>
> /* If we have relaimed everything on the isolated list, no stall */
> @@ -1300,15 +1346,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> return SWAP_CLUSTER_MAX;
> }
>
> -
> + set_lumpy_reclaim_mode(priority, sc, false);
> lru_add_drain();
> spin_lock_irq(&zone->lru_lock);
>
> if (scanning_global_lru(sc)) {
> nr_taken = isolate_pages_global(nr_to_scan,
> &page_list, &nr_scanned, sc->order,
> - sc->lumpy_reclaim_mode ?
> - ISOLATE_BOTH : ISOLATE_INACTIVE,
> + sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
> + ISOLATE_INACTIVE: ISOLATE_BOTH,
> zone, 0, file);
> zone->pages_scanned += nr_scanned;
> if (current_is_kswapd())
> @@ -1320,8 +1366,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> } else {
> nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
> &page_list, &nr_scanned, sc->order,
> - sc->lumpy_reclaim_mode ?
> - ISOLATE_BOTH : ISOLATE_INACTIVE,
> + sc->lumpy_reclaim_mode == LUMPY_MODE_NONE ?
> + ISOLATE_INACTIVE: ISOLATE_BOTH,
> zone, sc->mem_cgroup,
> 0, file);
> /*
> @@ -1339,7 +1385,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> spin_unlock_irq(&zone->lru_lock);
>
> - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
> + nr_reclaimed = shrink_page_list(&page_list, sc);
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> @@ -1350,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
> nr_active = clear_active_flags(&page_list, NULL);
> count_vm_events(PGDEACTIVATE, nr_active);
>
> - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
> + set_lumpy_reclaim_mode(priority, sc, true);
> + nr_reclaimed += shrink_page_list(&page_list, sc);
> }
>
> local_irq_disable();
> @@ -1721,21 +1768,6 @@ out:
> }
> }
>
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> -{
> - /*
> - * If we need a large contiguous chunk of memory, or have
> - * trouble getting a small set of contiguous pages, we
> - * will reclaim both active and inactive pages.
> - */
> - if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> - sc->lumpy_reclaim_mode = 1;
> - else if (sc->order && priority < DEF_PRIORITY - 2)
> - sc->lumpy_reclaim_mode = 1;
> - else
> - sc->lumpy_reclaim_mode = 0;
> -}
> -
> /*
> * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
> */
> @@ -1750,8 +1782,6 @@ static void shrink_zone(int priority, struct zone *zone,
>
> get_scan_count(zone, sc, nr, priority);
>
> - set_lumpy_reclaim_mode(priority, sc);
> -
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
> for_each_evictable_lru(l) {
> --
> 1.6.5.2
>
>
>

--
Kind regards,
Minchan Kim

2010-08-05 15:02:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/7] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

On Thu, Aug 05, 2010 at 03:12:22PM +0900, KOSAKI Motohiro wrote:
> From: Wu Fengguang <[email protected]>
>
> Fix "system goes unresponsive under memory pressure and lots of
> dirty/writeback pages" bug.
>
> http://lkml.org/lkml/2010/4/4/86
>
> In the above thread, Andreas Mohr described that
>
> Invoking any command locked up for minutes (note that I'm
> talking about attempted additional I/O to the _other_,
> _unaffected_ main system HDD - such as loading some shell
> binaries -, NOT the external SSD18M!!).
>
> This happens when the two conditions are both meet:
> - under memory pressure
> - writing heavily to a slow device
>
> <SNIP>

Other than an unnecessary whitespace removal at the end of the patch, I see
no problem with letting this patch stand on it's own as we are
reasonably sure this patch fixes a problem on its own. Patches 2-7 might
further improve the situation but can be considered as a new series.

This patch (as well as most of the series) will reject against current mmotm
because of other reclaim-related patches already in there. The resolutions
are not too hard but bear it in mind.

> <SNIP>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-05 15:06:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/7] vmscan: synchronous lumpy reclaim don't call congestion_wait()

On Thu, Aug 05, 2010 at 03:13:03PM +0900, KOSAKI Motohiro wrote:
> congestion_wait() mean "waiting quueue congestion is cleared".
> That said, if the system have plenty dirty pages and flusher thread push
> new request to IO queue conteniously, IO queue are not cleared
> congestion status for long time. thus, congestion_wait(HZ/10) become
> almostly equivalent schedule_timeout(HZ/10).
>
> However, synchronous lumpy reclaim donesn't need this
> congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) are
> using wait_on_page_writeback() and it provide sufficient waiting.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Needs rebasing for mmotm but otherwise;

Acked-by: Mel Gorman <[email protected]>

> ---
> mm/vmscan.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf51d62..1cdc3db 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1341,7 +1341,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
>
> /* Check if we should syncronously wait for writeback */
> if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> /*
> * The attempt at page out may have made some
> * of the pages active, mark them inactive again.
> --
> 1.6.5.2
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-05 15:06:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 2/7] vmscan: synchronous lumpy reclaim don't call congestion_wait()

On 08/05/2010 02:13 AM, KOSAKI Motohiro wrote:
> congestion_wait() mean "waiting quueue congestion is cleared".
> That said, if the system have plenty dirty pages and flusher thread push
> new request to IO queue conteniously, IO queue are not cleared
> congestion status for long time. thus, congestion_wait(HZ/10) become
> almostly equivalent schedule_timeout(HZ/10).
>
> However, synchronous lumpy reclaim donesn't need this
> congestion_wait() at all. shrink_page_list(PAGEOUT_IO_SYNC) are
> using wait_on_page_writeback() and it provide sufficient waiting.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-08-05 15:08:24

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 5/7] vmscan: kill dead code in shrink_inactive_list()

On Thu, Aug 05, 2010 at 03:14:47PM +0900, KOSAKI Motohiro wrote:
> When synchrounous lumy reclaim occur, page_list have gurantee to
> don't have active page because now page activation in shrink_page_list()
> always disable lumpy reclaim.
>
> Then, This patch remove virtual dead code.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim

2010-08-05 15:12:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/7] vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()

On Thu, Aug 05, 2010 at 03:13:39PM +0900, KOSAKI Motohiro wrote:
> When synchrounous lumpy reclaim, there is no reason to give up to
> reclaim pages even if page is locked. We use lock_page() instead
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

The intention of the code looks fine so;

Acked-by: Mel Gorman <[email protected]>

Something like the following might just be easier on the eye but it's a
question of personal taste.

/* Returns true if the page is locked */
static bool lru_lock_page(struct page *page, enum pageout_io sync_writeback)
{
if (likely(sync_writeback == PAGEOUT_IO_ASYNC))
return trylock_page(page);

lock_page(page);
return true;
}

then replace trylock_page() with lru_lock_page(). The naming is vaguely
similar to other helpers like lru_to_page

> ---
> mm/vmscan.c | 4 +++-
> 1 files changed, 3 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1cdc3db..833b6ad 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> page = lru_to_page(page_list);
> list_del(&page->lru);
>
> - if (!trylock_page(page))
> + if (sync_writeback == PAGEOUT_IO_SYNC)
> + lock_page(page);
> + else if (!trylock_page(page))
> goto keep;
>
> VM_BUG_ON(PageActive(page));
> --
> 1.6.5.2
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-05 15:14:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/7] vmscan: kill dead code in shrink_inactive_list()

On Thu, Aug 05, 2010 at 03:14:47PM +0900, KOSAKI Motohiro wrote:
> When synchrounous lumy reclaim occur, page_list have gurantee to
> don't have active page because now page activation in shrink_page_list()
> always disable lumpy reclaim.
>
> Then, This patch remove virtual dead code.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Seems fine.

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-05 15:20:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 1/7] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

On 08/05/2010 02:12 AM, KOSAKI Motohiro wrote:
> From: Wu Fengguang<[email protected]>
>
> Fix "system goes unresponsive under memory pressure and lots of
> dirty/writeback pages" bug.
>
> http://lkml.org/lkml/2010/4/4/86
>
> In the above thread, Andreas Mohr described that
>
> Invoking any command locked up for minutes (note that I'm
> talking about attempted additional I/O to the _other_,
> _unaffected_ main system HDD - such as loading some shell
> binaries -, NOT the external SSD18M!!).
>
> This happens when the two conditions are both meet:
> - under memory pressure
> - writing heavily to a slow device

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-08-05 15:25:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 7/7] vmscan: isolated_lru_pages() stop neighbor search if neighbor can't be isolated

On Thu, Aug 05, 2010 at 03:16:06PM +0900, KOSAKI Motohiro wrote:
> isolate_lru_pages() doesn't only isolate LRU tail pages, but also
> isolate neighbor pages of the eviction page.
>
> Now, the neighbor search don't stop even if neighbors can't be isolated.
> It is silly. successful higher order allocation need full contenious
> memory, even though only one page reclaim failure mean to fail making
> enough contenious memory.
>
> Then, isolate_lru_pages() should stop to search PFN neighbor pages and
> try to search next page on LRU soon. This patch does it. Also all of
> lumpy reclaim failure account nr_lumpy_failed.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Seems reasonable.

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-08-05 15:27:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 3/7] vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()

On 08/05/2010 02:13 AM, KOSAKI Motohiro wrote:
> When synchrounous lumpy reclaim, there is no reason to give up to
> reclaim pages even if page is locked. We use lock_page() instead
> trylock_page() in this case.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-08-05 15:40:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 7/7] vmscan: isolated_lru_pages() stop neighbor search if neighbor can't be isolated

On Thu, Aug 05, 2010 at 03:16:06PM +0900, KOSAKI Motohiro wrote:
> isolate_lru_pages() doesn't only isolate LRU tail pages, but also
> isolate neighbor pages of the eviction page.
>
> Now, the neighbor search don't stop even if neighbors can't be isolated.
> It is silly. successful higher order allocation need full contenious
> memory, even though only one page reclaim failure mean to fail making
> enough contenious memory.
>
> Then, isolate_lru_pages() should stop to search PFN neighbor pages and
> try to search next page on LRU soon. This patch does it. Also all of
> lumpy reclaim failure account nr_lumpy_failed.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

I agree this patch.
But I have a one question.

> ---
> mm/vmscan.c | 24 ++++++++++++++++--------
> 1 files changed, 16 insertions(+), 8 deletions(-)
>
<snip>


> if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> list_move(&cursor_page->lru, dst);
> @@ -1074,9 +1080,11 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
> nr_lumpy_dirty++;
> scan++;
> } else {
> - if (mode == ISOLATE_BOTH &&
> - page_count(cursor_page))
> - nr_lumpy_failed++;

sc->order = 1;
shrink_inactive_list;
isolate_pages_global with ISOLATE_INACTIVE(I mean no lumpy relcaim mode);
lumpy relcaim in inactive list in isolate_lru_pages;
(But I am not sure we can call it as lumpy reclaim. but at lesat I think
it a part of lumpy reclaim)
I mean it can reclaim physical pfn order not LRU order in inactive list since
it only consider sc->order. Is it a intentional?

I guess it's intentional since we care of ISOLATE_BOTH when we increase nr_lumpy_failed.
If it is, Shouldn't we care of ISOLATE_BOTH still?


> + /* the page is freed already. */
> + if (!page_count(cursor_page))
> + continue;
> + nr_lumpy_failed++;
> + break;
> }
> }
> }
> --
> 1.6.5.2
>


--
Kind regards,
Minchan Kim

2010-08-06 00:52:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/7] vmscan: synchrounous lumpy reclaim use lock_page() instead trylock_page()

On Thu, Aug 5, 2010 at 11:17 PM, Minchan Kim <[email protected]> wrote:
> On Thu, Aug 05, 2010 at 03:13:39PM +0900, KOSAKI Motohiro wrote:
>> When synchrounous lumpy reclaim, there is no reason to give up to
>> reclaim pages even if page is locked. We use lock_page() instead
>> trylock_page() in this case.
>>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>> ---
>> ?mm/vmscan.c | ? ?4 +++-
>> ?1 files changed, 3 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 1cdc3db..833b6ad 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -665,7 +665,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>> ? ? ? ? ? ? ? page = lru_to_page(page_list);
>> ? ? ? ? ? ? ? list_del(&page->lru);
>>
>> - ? ? ? ? ? ? if (!trylock_page(page))
>> + ? ? ? ? ? ? if (sync_writeback == PAGEOUT_IO_SYNC)
>> + ? ? ? ? ? ? ? ? ? ? lock_page(page);
>> + ? ? ? ? ? ? else if (!trylock_page(page))
>> ? ? ? ? ? ? ? ? ? ? ? goto keep;
>>
>> ? ? ? ? ? ? ? VM_BUG_ON(PageActive(page));
>> --
>> 1.6.5.2
>>
>>
>>
>
> Hmm. We can make sure lumpy already doesn't select the page locked?
> I mean below scenario.
>
> LRU head -> page A -> page B -> LRU tail
>
> lock_page(page A)
> some_function()
> direct reclaim
> select victim page B
> enter lumpy mode
> select victim page A as well as page B
> shrink_page_list
> lock_page(page A)
>
>
> --
> Kind regards,
> Minchan Kim
>

Ignore above comment.
lock_page doesn't have a deadlock problem. My bad.

Reviewed-by: Minchan Kim <[email protected]>



--
Kind regards,
Minchan Kim

2010-08-08 06:43:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/7] vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

> This patch (as well as most of the series) will reject against current mmotm
> because of other reclaim-related patches already in there. The resolutions
> are not too hard but bear it in mind.

I was working on latest published mmotm. but yes, current akpm private mmotm
incluse your reclaim related change.

That said, I need to rework them later.


2010-11-02 02:05:16

by Fengguang Wu

[permalink] [raw]
Subject: Re: [PATCH 4/7] vmscan: narrowing synchrounous lumply reclaim condition

> To make compaction a full replacement for lumpy, reclaim would have to
> know how to reclaim order-9 worth of pages and then compact properly.
> It's not setup for this and a naive algorithm would spend a lot of time
> in the compaction scanning code (which is pretty inefficient). A possible
> alternative would be to lumpy-compact i.e. select a page from the LRU and
> move all pages around it elsewhere. Again, this is not what we are currently
> doing but it's a direction that could be taken.

Agreed. The more lumpy reclaim, the more young pages being wrongly
evicted. THP could trigger lumpy reclaims heavily, that's why Andreas
need to disable it. Lumpy migration looks much better. Compaction
looks like some pre/batched lumpy-migration. We may also do on demand
lumpy migration in future.

> +static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc,
> + bool sync)
> +{
> + enum lumpy_mode mode = sync ? LUMPY_MODE_SYNC : LUMPY_MODE_ASYNC;
> +
> + /*
> + * Some reclaim have alredy been failed. No worth to try synchronous
> + * lumpy reclaim.
> + */
> + if (sync && sc->lumpy_reclaim_mode == LUMPY_MODE_NONE)
> + return;
> +
> + /*
> + * If we need a large contiguous chunk of memory, or have
> + * trouble getting a small set of contiguous pages, we
> + * will reclaim both active and inactive pages.
> + */
> + if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
> + sc->lumpy_reclaim_mode = mode;
> + else if (sc->order && priority < DEF_PRIORITY - 2)
> + sc->lumpy_reclaim_mode = mode;
> + else
> + sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;
> +}

Andrea, I don't see the conflicts in doing lumpy reclaim improvements
in parallel to compaction and THP. If lumpy reclaim hurts THP, it can
be trivially disabled in your tree for huge page order allocations?

+ if (sc->order > 9)
+ sc->lumpy_reclaim_mode = LUMPY_MODE_NONE;

Thanks,
Fengguang