2010-12-02 16:01:19

by Minchan Kim

[permalink] [raw]
Subject: [PATCH] vmscan: make kswapd use a correct order

If we wake up prematurely, it means we should keep going on
reclaiming not new order page but at old order page.
Sometime new order can be smaller than old order by below
race so it could make failure of old order page reclaiming.

T0: Task 1 wakes up kswapd with order-3
T1: So, kswapd starts to reclaim pages using balance_pgdat
T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
by T1 are consumed quickly.
T3: kswapd exits balance_pgdat and will do following:
T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
be reset with zero.
T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
enters the false branch of 'if (order (3) < new_order (2))'
T4-3: If previous balance_pgdat can't meet requirement of order-2
free pages by high watermark, it will start reclaiming again.
So balance_pgdat will use order-0 to do reclaim while it
really should use order-2 at the moment.
T4-4: At last, Task 1 can't get the any page if it wanted with
GFP_ATOMIC.

Reported-by: Shaohua Li <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Shaohua Li <[email protected]>
Cc: Mel Gorman <[email protected]>
---
mm/vmscan.c | 23 +++++++++++++++++++----
1 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..27d0839 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2447,13 +2447,18 @@ out:
return sc.nr_reclaimed;
}

-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
+/*
+ * Return true if we slept enough. Otherwise, return false
+ */
+static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
{
long remaining = 0;
+ bool slept = false;
+
DEFINE_WAIT(wait);

if (freezing(current) || kthread_should_stop())
- return;
+ return slept;

prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);

@@ -2482,6 +2487,7 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
schedule();
set_pgdat_percpu_threshold(pgdat, calculate_pressure_threshold);
+ slept = true;
} else {
if (remaining)
count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
@@ -2489,6 +2495,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order)
count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
}
finish_wait(&pgdat->kswapd_wait, &wait);
+
+ return slept;
}

/*
@@ -2550,8 +2558,15 @@ static int kswapd(void *p)
*/
order = new_order;
} else {
- kswapd_try_to_sleep(pgdat, order);
- order = pgdat->kswapd_max_order;
+ /*
+ * If we wake up after enough sleeping, it means
+ * we reclaimed enough pages at that order. so
+ * we starts reclaim new order in this time.
+ * Otherwise, it was a premature sleep so we should
+ * keep going on reclaiming at that order pages.
+ */
+ if (kswapd_try_to_sleep(pgdat, order))
+ order = pgdat->kswapd_max_order;
}

ret = try_to_freeze();
--
1.7.0.4


2010-12-03 12:12:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: make kswapd use a correct order

On Fri, Dec 03, 2010 at 01:00:49AM +0900, Minchan Kim wrote:
> If we wake up prematurely, it means we should keep going on
> reclaiming not new order page but at old order page.
> Sometime new order can be smaller than old order by below
> race so it could make failure of old order page reclaiming.
>
> T0: Task 1 wakes up kswapd with order-3
> T1: So, kswapd starts to reclaim pages using balance_pgdat
> T2: Task 2 wakes up kswapd with order-2 because pages reclaimed
> by T1 are consumed quickly.
> T3: kswapd exits balance_pgdat and will do following:
> T4-1: In beginning of kswapd's loop, pgdat->kswapd_max_order will
> be reset with zero.
> T4-2: 'order' will be set to pgdat->kswapd_max_order(0), since it
> enters the false branch of 'if (order (3) < new_order (2))'
> T4-3: If previous balance_pgdat can't meet requirement of order-2
> free pages by high watermark, it will start reclaiming again.
> So balance_pgdat will use order-0 to do reclaim while it
> really should use order-2 at the moment.
> T4-4: At last, Task 1 can't get the any page if it wanted with
> GFP_ATOMIC.
>
> Reported-by: Shaohua Li <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Shaohua Li <[email protected]>
> Cc: Mel Gorman <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-12-09 22:14:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] vmscan: make kswapd use a correct order

On Fri, 3 Dec 2010 01:00:49 +0900
Minchan Kim <[email protected]> wrote:

> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)

OT: kswapd_try_to_sleep() does a
trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
time, but doesn't trace anything at all if it does a short sleep.
Where's the sense in that?

2010-12-10 03:53:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] vmscan: make kswapd use a correct order

On Fri, Dec 10, 2010 at 7:13 AM, Andrew Morton
<[email protected]> wrote:
> On Fri, ?3 Dec 2010 01:00:49 +0900
> Minchan Kim <[email protected]> wrote:
>
>> +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep.
> Where's the sense in that?
>

AFAIU, short sleep is _sleep_ but that trace's goal is to count only long sleep.
In addition, short sleep is a just ready to go or not long sleep so I
think we don't need short sleep trace.
And for knowing short sleep count, we can use
KSWAPD_{LOW|HIGH}_WMARK_HIT_QUICKLY.

--
Kind regards,
Minchan Kim

2010-12-10 11:17:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] vmscan: make kswapd use a correct order

On Thu, Dec 09, 2010 at 02:13:17PM -0800, Andrew Morton wrote:
> On Fri, 3 Dec 2010 01:00:49 +0900
> Minchan Kim <[email protected]> wrote:
>
> > +static bool kswapd_try_to_sleep(pg_data_t *pgdat, int order)
>
> OT: kswapd_try_to_sleep() does a
> trace_mm_vmscan_kswapd_sleep(pgdat->node_id) if it sleeps for a long
> time, but doesn't trace anything at all if it does a short sleep.
> Where's the sense in that?
>

The tracepoint is to mark when kswapd is going fully to sleep and being
inactive because all its work is done. The tracepoints name might be
unfortunate because it's really used to track if kswapd is active or
inactive rather than sleeping.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab