LinuxLists.cc - [resend][PATCH] mm, vmscan: fix do_try_to_free

2012-06-14 08:13:38

Subject: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

From: KOSAKI Motohiro <[email protected]>

Currently, do_try_to_free_pages() can enter livelock. Because of,
now vmscan has two conflicted policies.

1) kswapd sleep when it couldn't reclaim any page when reaching
priority 0. This is because to avoid kswapd() infinite
loop. That said, kswapd assume direct reclaim makes enough
free pages to use either regular page reclaim or oom-killer.
This logic makes kswapd -> direct-reclaim dependency.
2) direct reclaim continue to reclaim without oom-killer until
kswapd turn on zone->all_unreclaimble. This is because
to avoid too early oom-kill.
This logic makes direct-reclaim -> kswapd dependency.

In worst case, direct-reclaim may continue to page reclaim forever
when kswapd sleeps forever.

We can't turn on zone->all_unreclaimable from direct reclaim path
because direct reclaim path don't take any lock and this way is racy.

Thus this patch removes zone->all_unreclaimable field completely and
recalculates zone reclaimable state every time.

Note: we can't take the idea that direct-reclaim see zone->pages_scanned
directly and kswapd continue to use zone->all_unreclaimable. Because, it
is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
zone->all_unreclaimable as a name) describes the detail.

Reported-by: Aaditya Kumar <[email protected]>
Reported-by: Ying Han <[email protected]>
Cc: Nick Piggin <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/mm_inline.h | 19 +++++++++++++++++
include/linux/mmzone.h | 2 +-
include/linux/vmstat.h | 1 -
mm/page-writeback.c | 2 +
mm/page_alloc.c | 5 +--
mm/vmscan.c | 48 ++++++++++++--------------------------------
mm/vmstat.c | 3 +-
7 files changed, 39 insertions(+), 41 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 1397ccf..04f32e1 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -2,6 +2,7 @@
#define LINUX_MM_INLINE_H

#include <linux/huge_mm.h>
+#include <linux/swap.h>

/**
* page_is_file_cache - should the page be on a file LRU or anon LRU?
@@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
return lru;
}

+static inline unsigned long zone_reclaimable_pages(struct zone *zone)
+{
+ int nr;
+
+ nr = zone_page_state(zone, NR_ACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_FILE);
+
+ if (nr_swap_pages > 0)
+ nr += zone_page_state(zone, NR_ACTIVE_ANON) +
+ zone_page_state(zone, NR_INACTIVE_ANON);
+
+ return nr;
+}
+
+static inline bool zone_reclaimable(struct zone *zone)
+{
+ return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
+}
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..9d2a720 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -368,7 +368,7 @@ struct zone {
* free areas of different sizes
*/
spinlock_t lock;
- int all_unreclaimable; /* All pages pinned */
+
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 65efb92..9607256 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
}

extern unsigned long global_reclaimable_pages(void);
-extern unsigned long zone_reclaimable_pages(struct zone *zone);

#ifdef CONFIG_NUMA
/*
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 93d8d2f..d2d957f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -34,6 +34,8 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h> /* __set_page_dirty_buffers */
#include <linux/pagevec.h>
+#include <linux/mm_inline.h>
+
#include <trace/events/writeback.h>

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..5716b00 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/prefetch.h>
#include <linux/migrate.h>
#include <linux/page-debug-flags.h>
+#include <linux/mm_inline.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
int to_free = count;

spin_lock(&zone->lock);
- zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

while (to_free) {
@@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
int migratetype)
{
spin_lock(&zone->lock);
- zone->all_unreclaimable = 0;
zone->pages_scanned = 0;

__free_one_page(page, zone, order, migratetype);
@@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_BOUNCE)),
K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
zone->pages_scanned,
- (zone->all_unreclaimable ? "yes" : "no")
+ (zone_reclaimable(zone) ? "yes" : "no")
);
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..033671c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
* latencies, so it's better to scan a minimum amount there as
* well.
*/
- if (current_is_kswapd() && zone->all_unreclaimable)
+ if (current_is_kswapd() && !zone_reclaimable(zone))
force_scan = true;
if (!global_reclaim(sc))
force_scan = true;
@@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
if (global_reclaim(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (zone->all_unreclaimable &&
- sc->priority != DEF_PRIORITY)
+ if (!zone_reclaimable(zone) &&
+ sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
if (COMPACTION_BUILD) {
/*
@@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
return aborted_reclaim;
}

-static bool zone_reclaimable(struct zone *zone)
-{
- return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
-}
-
/* All zones in zonelist are unreclaimable? */
static bool all_unreclaimable(struct zonelist *zonelist,
struct scan_control *sc)
@@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
continue;
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- if (!zone->all_unreclaimable)
+ if (zone_reclaimable(zone))
return false;
}

@@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
* they must be considered balanced here as well if kswapd
* is to sleep
*/
- if (zone->all_unreclaimable) {
+ if (zone_reclaimable(zone)) {
balanced += zone->present_pages;
continue;
}
@@ -2393,8 +2388,7 @@ loop_again:
if (!populated_zone(zone))
continue;

- if (zone->all_unreclaimable &&
- sc.priority != DEF_PRIORITY)
+ if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
continue;

/*
@@ -2443,14 +2437,13 @@ loop_again:
*/
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- int nr_slab, testorder;
+ int testorder;
unsigned long balance_gap;

if (!populated_zone(zone))
continue;

- if (zone->all_unreclaimable &&
- sc.priority != DEF_PRIORITY)
+ if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
continue;

sc.nr_scanned = 0;
@@ -2497,12 +2490,11 @@ loop_again:
shrink_zone(zone, &sc);

reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+ shrink_slab(&shrink, sc.nr_scanned, lru_pages);
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;

- if (nr_slab == 0 && !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
+
}

/*
@@ -2514,7 +2506,7 @@ loop_again:
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;

- if (zone->all_unreclaimable) {
+ if (!zone_reclaimable(zone)) {
if (end_zone && end_zone == i)
end_zone--;
continue;
@@ -2616,7 +2608,7 @@ out:
if (!populated_zone(zone))
continue;

- if (zone->all_unreclaimable &&
+ if (!zone_reclaimable(zone) &&
sc.priority != DEF_PRIORITY)
continue;

@@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
return nr;
}

-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
- int nr;
-
- nr = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
-
- if (nr_swap_pages > 0)
- nr += zone_page_state(zone, NR_ACTIVE_ANON) +
- zone_page_state(zone, NR_INACTIVE_ANON);
-
- return nr;
-}
-
#ifdef CONFIG_HIBERNATION
/*
* Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
@@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
return ZONE_RECLAIM_FULL;

- if (zone->all_unreclaimable)
+ if (!zone_reclaimable(zone))
return ZONE_RECLAIM_FULL;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1bbbbd9..94b9d4c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -19,6 +19,7 @@
#include <linux/math64.h>
#include <linux/writeback.h>
#include <linux/compaction.h>
+#include <linux/mm_inline.h>

#ifdef CONFIG_VM_EVENT_COUNTERS
DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n all_unreclaimable: %u"
"\n start_pfn: %lu"
"\n inactive_ratio: %u",
- zone->all_unreclaimable,
+ !zone_reclaimable(zone),
zone->zone_start_pfn,
zone->inactive_ratio);
seq_putc(m, '\n');
--
1.7.1

2012-06-14 08:43:42

by Johannes Weiner

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page when reaching
> priority 0. This is because to avoid kswapd() infinite
> loop. That said, kswapd assume direct reclaim makes enough
> free pages to use either regular page reclaim or oom-killer.
> This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
> kswapd turn on zone->all_unreclaimble. This is because
> to avoid too early oom-kill.
> This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
>
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
>
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar <[email protected]>
> Reported-by: Ying Han <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

> @@ -2497,12 +2490,11 @@ loop_again:
> shrink_zone(zone, &sc);
>
> reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> + shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> total_scanned += sc.nr_scanned;
>
> - if (nr_slab == 0 && !zone_reclaimable(zone))
> - zone->all_unreclaimable = 1;
> +

That IS a slight change in behaviour. But then, if you scanned 6
times the amount of reclaimable pages without freeing a single slab
page, it's probably not worth going on.

2012-06-14 08:54:39

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

(2012/06/14 17:13), [email protected] wrote:
> From: KOSAKI Motohiro<[email protected]>
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page when reaching
> priority 0. This is because to avoid kswapd() infinite
> loop. That said, kswapd assume direct reclaim makes enough
> free pages to use either regular page reclaim or oom-killer.
> This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
> kswapd turn on zone->all_unreclaimble. This is because
> to avoid too early oom-kill.
> This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
>
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
>
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar<[email protected]>
> Reported-by: Ying Han<[email protected]>
> Cc: Nick Piggin<[email protected]>
> Acked-by: Rik van Riel<[email protected]>
> Cc: Michal Hocko<[email protected]>
> Cc: Johannes Weiner<[email protected]>
> Cc: Mel Gorman<[email protected]>
> Cc: KAMEZAWA Hiroyuki<[email protected]>
> Cc: Minchan Kim<[email protected]>
> Signed-off-by: KOSAKI Motohiro<[email protected]>

I like this.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-06-14 14:57:32

by Minchan Kim

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

Hi KOSAKI,

Sorry for late response.
Let me ask a question about description.

On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page when reaching
> priority 0. This is because to avoid kswapd() infinite
> loop. That said, kswapd assume direct reclaim makes enough
> free pages to use either regular page reclaim or oom-killer.
> This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
> kswapd turn on zone->all_unreclaimble. This is because
> to avoid too early oom-kill.
> This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.

I have tried imagined scenario you mentioned above with code level but
unfortunately I got failed.
If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
So if kswapd sleeps, it means we already have enough order-0 free pages.
Hmm, could you describe scenario you found in detail with code level?

Anyway, as I look at your patch, I can't find any problem.
I just want to understand scenario you mentioned completely in my head.
Maybe It can help making description clear.

Thanks.

>
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
>
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar <[email protected]>
> Reported-by: Ying Han <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> include/linux/mm_inline.h | 19 +++++++++++++++++
> include/linux/mmzone.h | 2 +-
> include/linux/vmstat.h | 1 -
> mm/page-writeback.c | 2 +
> mm/page_alloc.c | 5 +--
> mm/vmscan.c | 48 ++++++++++++--------------------------------
> mm/vmstat.c | 3 +-
> 7 files changed, 39 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 1397ccf..04f32e1 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -2,6 +2,7 @@
> #define LINUX_MM_INLINE_H
>
> #include <linux/huge_mm.h>
> +#include <linux/swap.h>
>
> /**
> * page_is_file_cache - should the page be on a file LRU or anon LRU?
> @@ -99,4 +100,22 @@ static __always_inline enum lru_list page_lru(struct page *page)
> return lru;
> }
>
> +static inline unsigned long zone_reclaimable_pages(struct zone *zone)
> +{
> + int nr;
> +
> + nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_FILE);
> +
> + if (nr_swap_pages > 0)
> + nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> + zone_page_state(zone, NR_INACTIVE_ANON);
> +
> + return nr;
> +}
> +
> +static inline bool zone_reclaimable(struct zone *zone)
> +{
> + return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> +}
> #endif
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..9d2a720 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -368,7 +368,7 @@ struct zone {
> * free areas of different sizes
> */
> spinlock_t lock;
> - int all_unreclaimable; /* All pages pinned */
> +
> #ifdef CONFIG_MEMORY_HOTPLUG
> /* see spanned/present_pages for more description */
> seqlock_t span_seqlock;
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 65efb92..9607256 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -140,7 +140,6 @@ static inline unsigned long zone_page_state_snapshot(struct zone *zone,
> }
>
> extern unsigned long global_reclaimable_pages(void);
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>
> #ifdef CONFIG_NUMA
> /*
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 93d8d2f..d2d957f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -34,6 +34,8 @@
> #include <linux/syscalls.h>
> #include <linux/buffer_head.h> /* __set_page_dirty_buffers */
> #include <linux/pagevec.h>
> +#include <linux/mm_inline.h>
> +
> #include <trace/events/writeback.h>
>
> /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..5716b00 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
> #include <linux/prefetch.h>
> #include <linux/migrate.h>
> #include <linux/page-debug-flags.h>
> +#include <linux/mm_inline.h>
>
> #include <asm/tlbflush.h>
> #include <asm/div64.h>
> @@ -638,7 +639,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> int to_free = count;
>
> spin_lock(&zone->lock);
> - zone->all_unreclaimable = 0;
> zone->pages_scanned = 0;
>
> while (to_free) {
> @@ -680,7 +680,6 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
> int migratetype)
> {
> spin_lock(&zone->lock);
> - zone->all_unreclaimable = 0;
> zone->pages_scanned = 0;
>
> __free_one_page(page, zone, order, migratetype);
> @@ -2870,7 +2869,7 @@ void show_free_areas(unsigned int filter)
> K(zone_page_state(zone, NR_BOUNCE)),
> K(zone_page_state(zone, NR_WRITEBACK_TEMP)),
> zone->pages_scanned,
> - (zone->all_unreclaimable ? "yes" : "no")
> + (zone_reclaimable(zone) ? "yes" : "no")
> );
> printk("lowmem_reserve[]:");
> for (i = 0; i < MAX_NR_ZONES; i++)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1592,7 +1592,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
> * latencies, so it's better to scan a minimum amount there as
> * well.
> */
> - if (current_is_kswapd() && zone->all_unreclaimable)
> + if (current_is_kswapd() && !zone_reclaimable(zone))
> force_scan = true;
> if (!global_reclaim(sc))
> force_scan = true;
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> if (global_reclaim(sc)) {
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (zone->all_unreclaimable &&
> - sc->priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) &&
> + sc->priority != DEF_PRIORITY)
> continue; /* Let kswapd poll it */
> if (COMPACTION_BUILD) {
> /*
> @@ -1975,11 +1975,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> return aborted_reclaim;
> }
>
> -static bool zone_reclaimable(struct zone *zone)
> -{
> - return zone->pages_scanned < zone_reclaimable_pages(zone) * 6;
> -}
> -
> /* All zones in zonelist are unreclaimable? */
> static bool all_unreclaimable(struct zonelist *zonelist,
> struct scan_control *sc)
> @@ -1993,7 +1988,7 @@ static bool all_unreclaimable(struct zonelist *zonelist,
> continue;
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (!zone->all_unreclaimable)
> + if (zone_reclaimable(zone))
> return false;
> }
>
> @@ -2299,7 +2294,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> * they must be considered balanced here as well if kswapd
> * is to sleep
> */
> - if (zone->all_unreclaimable) {
> + if (zone_reclaimable(zone)) {
> balanced += zone->present_pages;
> continue;
> }
> @@ -2393,8 +2388,7 @@ loop_again:
> if (!populated_zone(zone))
> continue;
>
> - if (zone->all_unreclaimable &&
> - sc.priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
> continue;
>
> /*
> @@ -2443,14 +2437,13 @@ loop_again:
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int nr_slab, testorder;
> + int testorder;
> unsigned long balance_gap;
>
> if (!populated_zone(zone))
> continue;
>
> - if (zone->all_unreclaimable &&
> - sc.priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
> continue;
>
> sc.nr_scanned = 0;
> @@ -2497,12 +2490,11 @@ loop_again:
> shrink_zone(zone, &sc);
>
> reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> + shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> total_scanned += sc.nr_scanned;
>
> - if (nr_slab == 0 && !zone_reclaimable(zone))
> - zone->all_unreclaimable = 1;
> +
> }
>
> /*
> @@ -2514,7 +2506,7 @@ loop_again:
> total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> sc.may_writepage = 1;
>
> - if (zone->all_unreclaimable) {
> + if (!zone_reclaimable(zone)) {
> if (end_zone && end_zone == i)
> end_zone--;
> continue;
> @@ -2616,7 +2608,7 @@ out:
> if (!populated_zone(zone))
> continue;
>
> - if (zone->all_unreclaimable &&
> + if (!zone_reclaimable(zone) &&
> sc.priority != DEF_PRIORITY)
> continue;
>
> @@ -2850,20 +2842,6 @@ unsigned long global_reclaimable_pages(void)
> return nr;
> }
>
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> - int nr;
> -
> - nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> - zone_page_state(zone, NR_INACTIVE_FILE);
> -
> - if (nr_swap_pages > 0)
> - nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> - zone_page_state(zone, NR_INACTIVE_ANON);
> -
> - return nr;
> -}
> -
> #ifdef CONFIG_HIBERNATION
> /*
> * Try to free `nr_to_reclaim' of memory, system-wide, and return the number of
> @@ -3158,7 +3136,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> return ZONE_RECLAIM_FULL;
>
> - if (zone->all_unreclaimable)
> + if (!zone_reclaimable(zone))
> return ZONE_RECLAIM_FULL;
>
> /*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 1bbbbd9..94b9d4c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -19,6 +19,7 @@
> #include <linux/math64.h>
> #include <linux/writeback.h>
> #include <linux/compaction.h>
> +#include <linux/mm_inline.h>
>
> #ifdef CONFIG_VM_EVENT_COUNTERS
> DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
> @@ -1022,7 +1023,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
> "\n all_unreclaimable: %u"
> "\n start_pfn: %lu"
> "\n inactive_ratio: %u",
> - zone->all_unreclaimable,
> + !zone_reclaimable(zone),
> zone->zone_start_pfn,
> zone->inactive_ratio);
> seq_putc(m, '\n');
> --
> 1.7.1
>

2012-06-14 15:25:41

by Michal Hocko

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Thu 14-06-12 04:13:12, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page when reaching
> priority 0. This is because to avoid kswapd() infinite
> loop. That said, kswapd assume direct reclaim makes enough
> free pages to use either regular page reclaim or oom-killer.
> This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
> kswapd turn on zone->all_unreclaimble. This is because
> to avoid too early oom-kill.
> This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
>
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
>
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar <[email protected]>
> Reported-by: Ying Han <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Looks good, just one comment bellow:

Reviewed-by: Michal Hocko <[email protected]>

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeb3bc9..033671c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
[...]
> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> if (global_reclaim(sc)) {
> if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
> continue;
> - if (zone->all_unreclaimable &&
> - sc->priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) &&
> + sc->priority != DEF_PRIORITY)

Not exactly a hot path but still would be nice to test the priority
first as the test is cheaper (maybe compiler is clever enough to reorder
this, as both expressions are independent and without any side-effects
but...).

[...]
> @@ -2393,8 +2388,7 @@ loop_again:
> if (!populated_zone(zone))
> continue;
>
> - if (zone->all_unreclaimable &&
> - sc.priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
> continue;

Same here

>
> /*
> @@ -2443,14 +2437,13 @@ loop_again:
> */
> for (i = 0; i <= end_zone; i++) {
> struct zone *zone = pgdat->node_zones + i;
> - int nr_slab, testorder;
> + int testorder;
> unsigned long balance_gap;
>
> if (!populated_zone(zone))
> continue;
>
> - if (zone->all_unreclaimable &&
> - sc.priority != DEF_PRIORITY)
> + if (!zone_reclaimable(zone) && sc.priority != DEF_PRIORITY)
> continue;

Same here

[...]
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-06-14 15:47:12

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Thu, Jun 14, 2012 at 11:25 AM, Michal Hocko <[email protected]> wrote:
> On Thu 14-06-12 04:13:12, [email protected] wrote:
>> From: KOSAKI Motohiro <[email protected]>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>> ? ?priority 0. This is because to avoid kswapd() infinite
>> ? ?loop. That said, kswapd assume direct reclaim makes enough
>> ? ?free pages to use either regular page reclaim or oom-killer.
>> ? ?This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>> ? ?kswapd turn on zone->all_unreclaimble. This is because
>> ? ?to avoid too early oom-kill.
>> ? ?This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>>
>> We can't turn on zone->all_unreclaimable from direct reclaim path
>> because direct reclaim path don't take any lock and this way is racy.
>>
>> Thus this patch removes zone->all_unreclaimable field completely and
>> recalculates zone reclaimable state every time.
>>
>> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
>> directly and kswapd continue to use zone->all_unreclaimable. Because, it
>> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
>> zone->all_unreclaimable as a name) describes the detail.
>>
>> Reported-by: Aaditya Kumar <[email protected]>
>> Reported-by: Ying Han <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Acked-by: Rik van Riel <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: KAMEZAWA Hiroyuki <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> Looks good, just one comment bellow:
>
> Reviewed-by: Michal Hocko <[email protected]>
>
> [...]
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index eeb3bc9..033671c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
> [...]
>> @@ -1936,8 +1936,8 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>> ? ? ? ? ? ? ? if (global_reclaim(sc)) {
>> ? ? ? ? ? ? ? ? ? ? ? if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
>> - ? ? ? ? ? ? ? ? ? ? if (zone->all_unreclaimable &&
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? sc->priority != DEF_PRIORITY)
>> + ? ? ? ? ? ? ? ? ? ? if (!zone_reclaimable(zone) &&
>> + ? ? ? ? ? ? ? ? ? ? ? ? sc->priority != DEF_PRIORITY)
>
> Not exactly a hot path but still would be nice to test the priority
> first as the test is cheaper (maybe compiler is clever enough to reorder
> this, as both expressions are independent and without any side-effects
> but...).

ok, will fix.

2012-06-14 16:11:12

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <[email protected]> wrote:
> Hi KOSAKI,
>
> Sorry for late response.
> Let me ask a question about description.
>
> On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
>> From: KOSAKI Motohiro <[email protected]>
>>
>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>> now vmscan has two conflicted policies.
>>
>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>> ? ?priority 0. This is because to avoid kswapd() infinite
>> ? ?loop. That said, kswapd assume direct reclaim makes enough
>> ? ?free pages to use either regular page reclaim or oom-killer.
>> ? ?This logic makes kswapd -> direct-reclaim dependency.
>> 2) direct reclaim continue to reclaim without oom-killer until
>> ? ?kswapd turn on zone->all_unreclaimble. This is because
>> ? ?to avoid too early oom-kill.
>> ? ?This logic makes direct-reclaim -> kswapd dependency.
>>
>> In worst case, direct-reclaim may continue to page reclaim forever
>> when kswapd sleeps forever.
>
> I have tried imagined scenario you mentioned above with code level but
> unfortunately I got failed.
> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.

pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
if node has multiple zones. Hm ok, I realized my descriptions was
slightly misleading. priority 0 is not needed. bakance_pddat() calls
pgdat_balanced()
every priority. Most easy case is, movable zone has a lot of free pages and
normal zone has no reclaimable page.

btw, current pgdat_balanced() logic seems not correct. kswapd should
sleep only if every zones have much free pages than high water mark
_and_ 25% of present pages in node are free.

> So if kswapd sleeps, it means we already have enough order-0 free pages.
> Hmm, could you describe scenario you found in detail with code level?
>
> Anyway, as I look at your patch, I can't find any problem.
> I just want to understand scenario you mentioned completely in my head.
> Maybe It can help making description clear.
>

2012-06-15 07:27:04

by Minchan Kim

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:

> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <[email protected]> wrote:
>> Hi KOSAKI,
>>
>> Sorry for late response.
>> Let me ask a question about description.
>>
>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
>>> From: KOSAKI Motohiro <[email protected]>
>>>
>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>> now vmscan has two conflicted policies.
>>>
>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>> priority 0. This is because to avoid kswapd() infinite
>>> loop. That said, kswapd assume direct reclaim makes enough
>>> free pages to use either regular page reclaim or oom-killer.
>>> This logic makes kswapd -> direct-reclaim dependency.
>>> 2) direct reclaim continue to reclaim without oom-killer until
>>> kswapd turn on zone->all_unreclaimble. This is because
>>> to avoid too early oom-kill.
>>> This logic makes direct-reclaim -> kswapd dependency.
>>>
>>> In worst case, direct-reclaim may continue to page reclaim forever
>>> when kswapd sleeps forever.
>>
>> I have tried imagined scenario you mentioned above with code level but
>> unfortunately I got failed.
>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
>
> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
> if node has multiple zones. Hm ok, I realized my descriptions was
> slightly misleading. priority 0 is not needed. bakance_pddat() calls
> pgdat_balanced()
> every priority. Most easy case is, movable zone has a lot of free pages and
> normal zone has no reclaimable page.
>
> btw, current pgdat_balanced() logic seems not correct. kswapd should
> sleep only if every zones have much free pages than high water mark
> _and_ 25% of present pages in node are free.
>

Sorry. I can't understand your point.
Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
It seems I am missing your point.
Please anybody correct me.

--
Kind regards,
Minchan Kim

2012-06-15 10:45:56

by Mel Gorman

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
> From: KOSAKI Motohiro <[email protected]>
>
> Currently, do_try_to_free_pages() can enter livelock. Because of,
> now vmscan has two conflicted policies.
>
> 1) kswapd sleep when it couldn't reclaim any page when reaching
> priority 0. This is because to avoid kswapd() infinite
> loop. That said, kswapd assume direct reclaim makes enough
> free pages to use either regular page reclaim or oom-killer.
> This logic makes kswapd -> direct-reclaim dependency.
> 2) direct reclaim continue to reclaim without oom-killer until
> kswapd turn on zone->all_unreclaimble. This is because
> to avoid too early oom-kill.
> This logic makes direct-reclaim -> kswapd dependency.
>
> In worst case, direct-reclaim may continue to page reclaim forever
> when kswapd sleeps forever.
>
> We can't turn on zone->all_unreclaimable from direct reclaim path
> because direct reclaim path don't take any lock and this way is racy.
>
> Thus this patch removes zone->all_unreclaimable field completely and
> recalculates zone reclaimable state every time.
>
> Note: we can't take the idea that direct-reclaim see zone->pages_scanned
> directly and kswapd continue to use zone->all_unreclaimable. Because, it
> is racy. commit 929bea7c71 (vmscan: all_unreclaimable() use
> zone->all_unreclaimable as a name) describes the detail.
>
> Reported-by: Aaditya Kumar <[email protected]>
> Reported-by: Ying Han <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2012-06-15 12:31:52

by Hillf Danton

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

Hi Minchan and KOSAKI

On Fri, Jun 15, 2012 at 3:27 PM, Minchan Kim <[email protected]> wrote:
> On 06/15/2012 01:10 AM, KOSAKI Motohiro wrote:
>
>> On Thu, Jun 14, 2012 at 10:57 AM, Minchan Kim <[email protected]> wrote:
>>> Hi KOSAKI,
>>>
>>> Sorry for late response.
>>> Let me ask a question about description.
>>>
>>> On Thu, Jun 14, 2012 at 04:13:12AM -0400, [email protected] wrote:
>>>> From: KOSAKI Motohiro <[email protected]>
>>>>
>>>> Currently, do_try_to_free_pages() can enter livelock. Because of,
>>>> now vmscan has two conflicted policies.
>>>>
>>>> 1) kswapd sleep when it couldn't reclaim any page when reaching
>>>> priority 0. This is because to avoid kswapd() infinite
>>>> loop. That said, kswapd assume direct reclaim makes enough
>>>> free pages to use either regular page reclaim or oom-killer.
>>>> This logic makes kswapd -> direct-reclaim dependency.
>>>> 2) direct reclaim continue to reclaim without oom-killer until
>>>> kswapd turn on zone->all_unreclaimble. This is because
>>>> to avoid too early oom-kill.
>>>> This logic makes direct-reclaim -> kswapd dependency.
>>>>
>>>> In worst case, direct-reclaim may continue to page reclaim forever
>>>> when kswapd sleeps forever.
>>>
>>> I have tried imagined scenario you mentioned above with code level but
>>> unfortunately I got failed.
>>> If kswapd can't meet high watermark on order-0, it doesn't sleep if I don't miss something.
>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.
>

Who left comment on unreclaimable there, and why?
/*
* balance_pgdat() skips over all_unreclaimable after
* DEF_PRIORITY. Effectively, it considers them balanced so
* they must be considered balanced here as well if kswapd
* is to sleep
*/

BTW, are you still using prefetch_prev_lru_page?

Good Weekend
Hillf

2012-06-16 17:48:11

by Aaditya Kumar

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <[email protected]> wrote:

>>
>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>> if node has multiple zones. Hm ok, I realized my descriptions was
>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>> pgdat_balanced()
>> every priority. Most easy case is, movable zone has a lot of free pages and
>> normal zone has no reclaimable page.
>>
>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>> sleep only if every zones have much free pages than high water mark
>> _and_ 25% of present pages in node are free.
>>
>
>
> Sorry. I can't understand your point.
> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
> It seems I am missing your point.
> Please anybody correct me.

Since currently direct reclaim is given up based on
zone->all_unreclaimable flag,
so for e.g in one of the scenarios:

Lets say system has one node with two zones (NORMAL and MOVABLE) and we
hot-remove the all the pages of the MOVABLE zone.

While migrating pages during memory hot-unplugging, the allocation function
(for new page to which the page in MOVABLE zone would be moved) can end up
looping in direct reclaim path for ever.

This is so because when most of the pages in the MOVABLE zone have
been migrated,
the zone now contains lots of free memory (basically above low watermark)
BUT all are in MIGRATE_ISOLATE list of the buddy list.

So kswapd() would not balance this zone as free pages are above low watermark
(but all are in isolate list). So zone->all_unreclaimable flag would
never be set for this zone
and allocation function would end up looping forever. (assuming the
zone NORMAL is
left with no reclaimable memory)

Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

2012-06-18 00:43:49

by Minchan Kim

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On 06/17/2012 02:48 AM, Aaditya Kumar wrote:

> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <[email protected]> wrote:
>
>>>
>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>> pgdat_balanced()
>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>> normal zone has no reclaimable page.
>>>
>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>> sleep only if every zones have much free pages than high water mark
>>> _and_ 25% of present pages in node are free.
>>>
>>
>>
>> Sorry. I can't understand your point.
>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>> It seems I am missing your point.
>> Please anybody correct me.
>
> Since currently direct reclaim is given up based on
> zone->all_unreclaimable flag,
> so for e.g in one of the scenarios:
>
> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
> hot-remove the all the pages of the MOVABLE zone.
>
> While migrating pages during memory hot-unplugging, the allocation function
> (for new page to which the page in MOVABLE zone would be moved) can end up
> looping in direct reclaim path for ever.
>
> This is so because when most of the pages in the MOVABLE zone have
> been migrated,
> the zone now contains lots of free memory (basically above low watermark)
> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>
> So kswapd() would not balance this zone as free pages are above low watermark
> (but all are in isolate list). So zone->all_unreclaimable flag would
> never be set for this zone
> and allocation function would end up looping forever. (assuming the
> zone NORMAL is
> left with no reclaimable memory)
>

Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
But I don't see it's a problem of kswapd.

a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
but we can't allocate it. :(
It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
Kswapd is just one of them confused.
As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.

This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
free_area[order].nr_free exactly.

Any thought?

Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4403009..19de56c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)

out:
if (!ret) {
+ int pages_moved;
set_pageblock_migratetype(page, MIGRATE_ISOLATE);
- move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
}

spin_unlock_irqrestore(&zone->lock, flags);
@@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
{
struct zone *zone;
unsigned long flags;
+ int pages_moved;
zone = page_zone(page);
spin_lock_irqsave(&zone->lock, flags);
if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
goto out;
set_pageblock_migratetype(page, migratetype);
- move_freepages_block(zone, page, migratetype);
+ pages_moved = move_freepages_block(zone, page, migratetype);
+ __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
out:
spin_unlock_irqrestore(&zone->lock, flags);
}

>
> Regards,
> Aaditya Kumar
> Sony India Software Centre,
> Bangalore.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
Kind regards,
Minchan Kim

2012-06-18 00:54:55

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

(2012/06/18 9:43), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim<[email protected]> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved) can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
>
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?
>
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
> out:
> if (!ret) {
> + int pages_moved;
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> - move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
> }
>
> spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> {
> struct zone *zone;
> unsigned long flags;
> + int pages_moved;
> zone = page_zone(page);
> spin_lock_irqsave(&zone->lock, flags);
> if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> goto out;
> set_pageblock_migratetype(page, migratetype);
> - move_freepages_block(zone, page, migratetype);
> + pages_moved = move_freepages_block(zone, page, migratetype);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> }
>

I think this patch is very good.

Thanks,
-Kame

2012-06-19 13:18:05

by Aaditya Kumar

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On Mon, Jun 18, 2012 at 6:13 AM, Minchan Kim <[email protected]> wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <[email protected]> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved) ?can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.

Hi Kim,

Yes I agree it is not a problem of kswapd.

> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]

I assume that by the inconsistency you mention above, you mean
temporary inconsistency.

Sorry, but IMHO as for memory hot plug the main issue with this patch
is that the inconsistency you mentioned above would NOT be a temporary
inconsistency.

Every time say 'x' number of page frames are off lined, they will
introduce a difference of 'x' pages between
NR_FREE_PAGES and SumOf[free_area[order].nr_free].
(So for e.g. if we do a frequent offline/online it will make
NR_FREE_PAGES negative)

This is so because, unset_migratetype_isolate() is called from
offlining code (to set the migrate type of off lined pages again back
to MIGRATE_MOVABLE)
after the pages have been off lined and removed from the buddy list.
Since the pages for which unset_migratetype_isolate() is called are
not buddy pages so move_freepages_block() does not move any page, and
thus introducing a permanent inconsistency.

> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?

As for fixing move_freepages_block(), At least for memory hot plug,
the pages stay in MIGRATE_ISOLATE list only for duration
offline_pages() function,
I mean only temporarily. Since fixing move_freepages_block() for will
introduce some overhead, So I am not very sure whether that overhead
is justified
for a temporary condition. What do you think?

> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
> ?out:
> ? ? ? ?if (!ret) {
> + ? ? ? ? ? ? ? int pages_moved;
> ? ? ? ? ? ? ? ?set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> - ? ? ? ? ? ? ? move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + ? ? ? ? ? ? ? pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + ? ? ? ? ? ? ? __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
> ? ? ? ?}
>
> ? ? ? ?spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> ?{
> ? ? ? ?struct zone *zone;
> ? ? ? ?unsigned long flags;
> + ? ? ? int pages_moved;
> ? ? ? ?zone = page_zone(page);
> ? ? ? ?spin_lock_irqsave(&zone->lock, flags);
> ? ? ? ?if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> ? ? ? ? ? ? ? ?goto out;
> ? ? ? ?set_pageblock_migratetype(page, migratetype);
> - ? ? ? move_freepages_block(zone, page, migratetype);
> + ? ? ? pages_moved = move_freepages_block(zone, page, migratetype);
> + ? ? ? __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
> ?out:
> ? ? ? ?spin_unlock_irqrestore(&zone->lock, flags);
> ?}
>
>
>>
>> Regards,
>> Aaditya Kumar
>> Sony India Software Centre,
>> Bangalore.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. ?For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
>
>
> --
> Kind regards,
> Minchan Kim

Regards,
Aaditya Kumar
Sony India Software Centre,
Bangalore.

2012-06-19 21:18:00

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

> Who left comment on unreclaimable there, and why?
> /*
> * balance_pgdat() skips over all_unreclaimable after
> * DEF_PRIORITY. Effectively, it considers them balanced so
> * they must be considered balanced here as well if kswapd
> * is to sleep
> */

Thank you for finding this! I'll fix.

> BTW, are you still using prefetch_prev_lru_page?

No. we should kill it. ;-)

2012-06-19 22:17:36

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

(6/17/12 8:43 PM), Minchan Kim wrote:
> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>
>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <[email protected]> wrote:
>>
>>>>
>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>> pgdat_balanced()
>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>> normal zone has no reclaimable page.
>>>>
>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>> sleep only if every zones have much free pages than high water mark
>>>> _and_ 25% of present pages in node are free.
>>>>
>>>
>>>
>>> Sorry. I can't understand your point.
>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>> It seems I am missing your point.
>>> Please anybody correct me.
>>
>> Since currently direct reclaim is given up based on
>> zone->all_unreclaimable flag,
>> so for e.g in one of the scenarios:
>>
>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>> hot-remove the all the pages of the MOVABLE zone.
>>
>> While migrating pages during memory hot-unplugging, the allocation function
>> (for new page to which the page in MOVABLE zone would be moved) can end up
>> looping in direct reclaim path for ever.
>>
>> This is so because when most of the pages in the MOVABLE zone have
>> been migrated,
>> the zone now contains lots of free memory (basically above low watermark)
>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>
>> So kswapd() would not balance this zone as free pages are above low watermark
>> (but all are in isolate list). So zone->all_unreclaimable flag would
>> never be set for this zone
>> and allocation function would end up looping forever. (assuming the
>> zone NORMAL is
>> left with no reclaimable memory)
>>
>
>
> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
> But I don't see it's a problem of kswapd.
>
> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
> but we can't allocate it. :(
> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
> Kswapd is just one of them confused.
> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>
> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
> free_area[order].nr_free exactly.
>
> Any thought?
>
> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4403009..19de56c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>
> out:
> if (!ret) {
> + int pages_moved;
> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> - move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
> }
>
> spin_unlock_irqrestore(&zone->lock, flags);
> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
> {
> struct zone *zone;
> unsigned long flags;
> + int pages_moved;
> zone = page_zone(page);
> spin_lock_irqsave(&zone->lock, flags);
> if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
> goto out;
> set_pageblock_migratetype(page, migratetype);
> - move_freepages_block(zone, page, migratetype);
> + pages_moved = move_freepages_block(zone, page, migratetype);
> + __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
> out:
> spin_unlock_irqrestore(&zone->lock, flags);
> }

Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
two scenarios. a) free page -> page block change into isolate b) page block change into isolate
-> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
_if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,
all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
kswapd. It mean direct reclaim should work fine w/o background reclaim.

2012-06-20 06:18:44

by Minchan Kim

[permalink] [raw]

Subject: Re: [resend][PATCH] mm, vmscan: fix do_try_to_free_pages() livelock

On 06/20/2012 07:17 AM, KOSAKI Motohiro wrote:

> (6/17/12 8:43 PM), Minchan Kim wrote:
>> On 06/17/2012 02:48 AM, Aaditya Kumar wrote:
>>
>>> On Fri, Jun 15, 2012 at 12:57 PM, Minchan Kim <[email protected]> wrote:
>>>
>>>>>
>>>>> pgdat_balanced() doesn't recognized zone. Therefore kswapd may sleep
>>>>> if node has multiple zones. Hm ok, I realized my descriptions was
>>>>> slightly misleading. priority 0 is not needed. bakance_pddat() calls
>>>>> pgdat_balanced()
>>>>> every priority. Most easy case is, movable zone has a lot of free pages and
>>>>> normal zone has no reclaimable page.
>>>>>
>>>>> btw, current pgdat_balanced() logic seems not correct. kswapd should
>>>>> sleep only if every zones have much free pages than high water mark
>>>>> _and_ 25% of present pages in node are free.
>>>>>
>>>>
>>>>
>>>> Sorry. I can't understand your point.
>>>> Current kswapd doesn't sleep if relevant zones don't have free pages above high watermark.
>>>> It seems I am missing your point.
>>>> Please anybody correct me.
>>>
>>> Since currently direct reclaim is given up based on
>>> zone->all_unreclaimable flag,
>>> so for e.g in one of the scenarios:
>>>
>>> Lets say system has one node with two zones (NORMAL and MOVABLE) and we
>>> hot-remove the all the pages of the MOVABLE zone.
>>>
>>> While migrating pages during memory hot-unplugging, the allocation function
>>> (for new page to which the page in MOVABLE zone would be moved) can end up
>>> looping in direct reclaim path for ever.
>>>
>>> This is so because when most of the pages in the MOVABLE zone have
>>> been migrated,
>>> the zone now contains lots of free memory (basically above low watermark)
>>> BUT all are in MIGRATE_ISOLATE list of the buddy list.
>>>
>>> So kswapd() would not balance this zone as free pages are above low watermark
>>> (but all are in isolate list). So zone->all_unreclaimable flag would
>>> never be set for this zone
>>> and allocation function would end up looping forever. (assuming the
>>> zone NORMAL is
>>> left with no reclaimable memory)
>>>
>>
>>
>> Thanks a lot, Aaditya! Scenario you mentioned makes perfect.
>> But I don't see it's a problem of kswapd.
>>
>> a5d76b54 made new migration type 'MIGRATE_ISOLATE' which is very irony type because there are many free pages in free list
>> but we can't allocate it. :(
>> It doesn't reflect right NR_FREE_PAGES while many places in the kernel use NR_FREE_PAGES to trigger some operation.
>> Kswapd is just one of them confused.
>> As right fix of this problem, we should fix hot plug code, IMHO which can fix CMA, too.
>>
>> This patch could make inconsistency between NR_FREE_PAGES and SumOf[free_area[order].nr_free]
>> and it could make __zone_watermark_ok confuse so we might need to fix move_freepages_block itself to reflect
>> free_area[order].nr_free exactly.
>>
>> Any thought?
>>
>> Side Note: I still need KOSAKI's patch with fixed description regardless of this problem because set zone->all_unreclaimable of only kswapd is very fragile.
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 4403009..19de56c 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5593,8 +5593,10 @@ int set_migratetype_isolate(struct page *page)
>>
>> out:
>> if (!ret) {
>> + int pages_moved;
>> set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>> - move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> + pages_moved = move_freepages_block(zone, page, MIGRATE_ISOLATE);
>> + __mod_zone_page_state(zone, NR_FREE_PAGES, -pages_moved);
>> }
>>
>> spin_unlock_irqrestore(&zone->lock, flags);
>> @@ -5607,12 +5609,14 @@ void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>> {
>> struct zone *zone;
>> unsigned long flags;
>> + int pages_moved;
>> zone = page_zone(page);
>> spin_lock_irqsave(&zone->lock, flags);
>> if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
>> goto out;
>> set_pageblock_migratetype(page, migratetype);
>> - move_freepages_block(zone, page, migratetype);
>> + pages_moved = move_freepages_block(zone, page, migratetype);
>> + __mod_zone_page_state(zone, NR_FREE_PAGES, pages_moved);
>> out:
>> spin_unlock_irqrestore(&zone->lock, flags);
>> }
>
> Unfortunately, this doesn't work. there are two reasons. 1) when memory hotplug occue, we have
> two scenarios. a) free page -> page block change into isolate b) page block change into isolate
> -> free page. The above patch only care scenario (a). Thus it lead to confusing NR_FREE_PAGES value.
> _if_ we put a new branch free page hotpath, we can solve scenario (b). but I don't like it. because of,
> zero hotpath overhead is one of memory hotplug design principle. 2) event if we can solve above issue,

Yeb. Aaditya already pointed out.
And I just sent other patch.
Let's talk about this problem on another thread because it's not a direct/background reclaim problem.
http://lkml.org/lkml/2012/6/20/30

> all_unreclaimable logic still broken. because of, __alloc_pages_slowpath() wake up kswapd only once and
> don't wake up when "goto rebalance" path. But, wake_all_kswapd() is racy and no guarantee to wake up
> kswapd. It mean direct reclaim should work fine w/o background reclaim.

We can fix it easily in direct reclaim path but I think your approach still make sense
because current scheme of zone_unreclaimable setting is very fragile on livelock.
So if you send your patch again with rewritten description, I have no objection.

Thanks.
--
Kind regards,
Minchan Kim