LinuxLists.cc - [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small

[permalink] [raw]

Subject: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

A problem occurs if the highest zone is small. balance_pgdat()
only considers unreclaimable zones when priority is DEF_PRIORITY
but sleeping_prematurely considers all zones. It's possible for this
sequence to occur

1. kswapd wakes up and enters balance_pgdat()
2. At DEF_PRIORITY, marks highest zone unreclaimable
3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
highest zone, clearing all_unreclaimable. Highest zone
is still unbalanced
5. kswapd returns and calls sleeping_prematurely
6. sleeping_prematurely looks at *all* zones, not just the ones
being considered by balance_pgdat. The highest small zone
has all_unreclaimable cleared but but the zone is not
balanced. all_zones_ok is false so kswapd stays awake

This patch corrects the behaviour of sleeping_prematurely to check
the zones balance_pgdat() checked.

Reported-and-tested-by: Pádraig Brady <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8ff834e..841e3bf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
return true;

/* Check the watermark levels */
- for (i = 0; i < pgdat->nr_zones; i++) {
+ for (i = 0; i <= classzone_idx; i++) {
struct zone *zone = pgdat->node_zones + i;

if (!populated_zone(zone))
--
1.7.3.4

2011-06-24 14:46:02

[permalink] [raw]

Subject: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour.

When kswapd applies pressure to zones during node balancing, it checks
if the zone is above a high+balance_gap threshold. If it is, it does
not apply pressure but it unconditionally shrinks slab on a global
basis which is excessive. In the event kswapd is being kept awake due to
a high small unreclaimable zone, it skips zone shrinking but still
calls shrink_slab().

Once pressure has been applied, the check for zone being unreclaimable
is being made before the check is made if all_unreclaimable should be
set. This miss of unreclaimable can cause has_under_min_watermark_zone
to be set due to an unreclaimable zone preventing kswapd backing off
on congestion_wait().

Reported-and-tested-by: Pádraig Brady <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 23 +++++++++++++----------
1 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 841e3bf..9cebed1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2507,18 +2507,18 @@ loop_again:
KSWAPD_ZONE_BALANCE_GAP_RATIO);
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone) + balance_gap,
- end_zone, 0))
+ end_zone, 0)) {
shrink_zone(priority, zone, &sc);
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
- total_scanned += sc.nr_scanned;

- if (zone->all_unreclaimable)
- continue;
- if (nr_slab == 0 &&
- !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
+ sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+ total_scanned += sc.nr_scanned;
+
+ if (nr_slab == 0 && !zone_reclaimable(zone))
+ zone->all_unreclaimable = 1;
+ }
+
/*
* If we've done a decent amount of scanning and
* the reclaim ratio is low, start doing writepage
@@ -2528,6 +2528,9 @@ loop_again:
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;

+ if (zone->all_unreclaimable)
+ continue;
+
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), end_zone, 0)) {
all_zones_ok = 0;
--
1.7.3.4

2011-06-24 14:46:00

[permalink] [raw]

Subject: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

When deciding if kswapd is sleeping prematurely, the classzone is
taken into account but this is different to what balance_pgdat() and
the allocator are doing. Specifically, the DMA zone will be checked
based on the classzone used when waking kswapd which could be for a
GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
the watermark is not met and kswapd thinks its sleeping prematurely
keeping kswapd awake in error.

Reported-and-tested-by: Pádraig Brady <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9cebed1..a76b6cc2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
}

if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
- classzone_idx, 0))
+ i, 0))
all_zones_ok = false;
else
balanced += zone->present_pages;
--
1.7.3.4

2011-06-24 14:45:09

[permalink] [raw]

Subject: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully

During allocator-intensive workloads, kswapd will be woken frequently
causing free memory to oscillate between the high and min watermark.
This is expected behaviour. Unfortunately, if the highest zone is
small, a problem occurs.

When balance_pgdat() returns, it may be at a lower classzone_idx than
it started because the highest zone was unreclaimable. Before checking
if it should go to sleep though, it checks pgdat->classzone_idx which
when there is no other activity will be MAX_NR_ZONES-1. It interprets
this as it has been woken up while reclaiming, skips scheduling and
reclaims again. As there is no useful reclaim work to do, it enters
into a loop of shrinking slab consuming loads of CPU until the highest
zone becomes reclaimable for a long period of time.

There are two problems here. 1) If the returned classzone or order is
lower, it'll continue reclaiming without scheduling. 2) if the highest
zone was marked unreclaimable but balance_pgdat() returns immediately
at DEF_PRIORITY, the new lower classzone is not communicated back to
kswapd() for sleeping.

This patch does two things that are related. If the end_zone is
unreclaimable, this information is communicated back. Second, if
the classzone or order was reduced due to failing to reclaim, new
information is not read from pgdat and instead an attempt is made to go
to sleep. Due to this, it is also necessary that pgdat->classzone_idx
be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
being interpreted as wakeups.

Reported-and-tested-by: Pádraig Brady <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 34 +++++++++++++++++++++-------------
1 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a76b6cc2..fe854d7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2448,7 +2448,6 @@ loop_again:
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), 0, 0)) {
end_zone = i;
- *classzone_idx = i;
break;
}
}
@@ -2528,8 +2527,11 @@ loop_again:
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;

- if (zone->all_unreclaimable)
+ if (zone->all_unreclaimable) {
+ if (end_zone && end_zone == i)
+ end_zone--;
continue;
+ }

if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), end_zone, 0)) {
@@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
*/
static int kswapd(void *p)
{
- unsigned long order;
- int classzone_idx;
+ unsigned long order, new_order;
+ int classzone_idx, new_classzone_idx;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;

@@ -2740,17 +2742,23 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();

- order = 0;
- classzone_idx = MAX_NR_ZONES - 1;
+ order = new_order = 0;
+ classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
for ( ; ; ) {
- unsigned long new_order;
- int new_classzone_idx;
int ret;

- new_order = pgdat->kswapd_max_order;
- new_classzone_idx = pgdat->classzone_idx;
- pgdat->kswapd_max_order = 0;
- pgdat->classzone_idx = MAX_NR_ZONES - 1;
+ /*
+ * If the last balance_pgdat was unsuccessful it's unlikely a
+ * new request of a similar or harder type will succeed soon
+ * so consider going to sleep on the basis we reclaimed at
+ */
+ if (classzone_idx >= new_classzone_idx && order == new_order) {
+ new_order = pgdat->kswapd_max_order;
+ new_classzone_idx = pgdat->classzone_idx;
+ pgdat->kswapd_max_order = 0;
+ pgdat->classzone_idx = pgdat->nr_zones - 1;
+ }
+
if (order < new_order || classzone_idx > new_classzone_idx) {
/*
* Don't sleep if someone wants a larger 'order'
@@ -2763,7 +2771,7 @@ static int kswapd(void *p)
order = pgdat->kswapd_max_order;
classzone_idx = pgdat->classzone_idx;
pgdat->kswapd_max_order = 0;
- pgdat->classzone_idx = MAX_NR_ZONES - 1;
+ pgdat->classzone_idx = pgdat->nr_zones - 1;
}

ret = try_to_freeze();
--
1.7.3.4

2011-06-25 14:23:59

by Andrew Lutomirski

[permalink] [raw]

Subject: Re: [PATCH 0/4] Stop kswapd consuming 100% CPU when highest zone is small

On Fri, Jun 24, 2011 at 8:44 AM, Mel Gorman <[email protected]> wrote:
> (Built this time and passed a basic sniff-test.)
>
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour. ?Unfortunately, if the highest zone is
> small, a problem occurs.
>

[...]

I've been running these for a couple days with no problems, although I
haven't been trying to reproduce the problem. (Well, no problems
related to memory management.)

I suspect that my pet unnecessary-OOM-kill bug is still around, but
that's probably not related, especially since I can trigger it if I
stick 8 GB of RAM in this laptop.

Thanks,
Andy

2011-06-25 21:34:06

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small. balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
> 1. kswapd wakes up and enters balance_pgdat()
> 2. At DEF_PRIORITY, marks highest zone unreclaimable
> 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> highest zone, clearing all_unreclaimable. Highest zone
> is still unbalanced
> 5. kswapd returns and calls sleeping_prematurely
> 6. sleeping_prematurely looks at *all* zones, not just the ones
> being considered by balance_pgdat. The highest small zone
> has all_unreclaimable cleared but but the zone is not
> balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-25 21:41:13

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-25 21:42:26

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.
>
> Reported-and-tested-by: Pádraig Brady<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-25 23:18:22

[permalink] [raw]

Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully

On 06/24/2011 10:44 AM, Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour. Unfortunately, if the highest zone is
> small, a problem occurs.
>
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
>
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
>
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
>
> Reported-and-tested-by: Pádraig Brady<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2011-06-27 06:11:22

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <[email protected]> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small. balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
> 1. kswapd wakes up and enters balance_pgdat()
> 2. At DEF_PRIORITY, marks highest zone unreclaimable
> 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> highest zone, clearing all_unreclaimable. Highest zone
> is still unbalanced
> 5. kswapd returns and calls sleeping_prematurely
> 6. sleeping_prematurely looks at *all* zones, not just the ones
> being considered by balance_pgdat. The highest small zone
> has all_unreclaimable cleared but but the zone is not
> balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-06-27 06:54:39

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <[email protected]> wrote:
> When deciding if kswapd is sleeping prematurely, the classzone is
> taken into account but this is different to what balance_pgdat() and
> the allocator are doing. Specifically, the DMA zone will be checked
> based on the classzone used when waking kswapd which could be for a
> GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> the watermark is not met and kswapd thinks its sleeping prematurely
> keeping kswapd awake in error.

I thought it was intentional when you submitted a patch firstly.
"Kswapd makes sure zones include enough free pages(ie, include reserve
limit of above zones).
But you seem to see DMA zone can't meet above requirement forever in
some situation so that kswapd doesn't sleep.
Right?

>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9cebed1..a76b6cc2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> }
>
> if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> - classzone_idx, 0))
> + i, 0))

Isn't it better to use 0 instead of i?

--
Kind regards,
Minchan Kim

2011-06-28 12:53:33

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <[email protected]> wrote:
> > When deciding if kswapd is sleeping prematurely, the classzone is
> > taken into account but this is different to what balance_pgdat() and
> > the allocator are doing. Specifically, the DMA zone will be checked
> > based on the classzone used when waking kswapd which could be for a
> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
> > the watermark is not met and kswapd thinks its sleeping prematurely
> > keeping kswapd awake in error.
>
>
> I thought it was intentional when you submitted a patch firstly.

It was, it also wasn't right.

> "Kswapd makes sure zones include enough free pages(ie, include reserve
> limit of above zones).
> But you seem to see DMA zone can't meet above requirement forever in
> some situation so that kswapd doesn't sleep.
> Right?
>

Right.

> >
> > Reported-and-tested-by: P?draig Brady <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > ?mm/vmscan.c | ? ?2 +-
> > ?1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9cebed1..a76b6cc2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > ? ? ? ? ? ? ? ?}
> >
> > ? ? ? ? ? ? ? ?if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? classzone_idx, 0))
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? i, 0))
>
> Isn't it better to use 0 instead of i?
>

I considered it but went with i to compromise between making sure zones
included enough free pages without requiring that ZONE_DMA meet an
almost impossible requirement when under continual memory pressure.

--
Mel Gorman
SUSE Labs

2011-06-28 21:49:54

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

On Fri, 24 Jun 2011 15:44:54 +0100
Mel Gorman <[email protected]> wrote:

> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small. balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
> 1. kswapd wakes up and enters balance_pgdat()
> 2. At DEF_PRIORITY, marks highest zone unreclaimable
> 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> highest zone, clearing all_unreclaimable. Highest zone
> is still unbalanced
> 5. kswapd returns and calls sleeping_prematurely
> 6. sleeping_prematurely looks at *all* zones, not just the ones
> being considered by balance_pgdat. The highest small zone
> has all_unreclaimable cleared but but the zone is not
> balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.

But kswapd is making progress: it's reclaiming slab. Eventually that
won't work any more and all_unreclaimable will not be cleared and the
condition will fix itself up?

btw,

if (!sleeping_prematurely(...))
sleep();

hurts my brain. My brain would prefer

if (kswapd_should_sleep(...))
sleep();

no?

> Reported-and-tested-by: P?draig Brady <[email protected]>

But what were the before-and-after observations? I don't understand
how this can cause a permanent cpuchew by kswapd.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> return true;
>
> /* Check the watermark levels */
> - for (i = 0; i < pgdat->nr_zones; i++) {
> + for (i = 0; i <= classzone_idx; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> if (!populated_zone(zone))

The patch looks sensible.

2011-06-28 23:23:16

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

On Tue, Jun 28, 2011 at 9:52 PM, Mel Gorman <[email protected]> wrote:
> On Mon, Jun 27, 2011 at 03:53:04PM +0900, Minchan Kim wrote:
>> On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <[email protected]> wrote:
>> > When deciding if kswapd is sleeping prematurely, the classzone is
>> > taken into account but this is different to what balance_pgdat() and
>> > the allocator are doing. Specifically, the DMA zone will be checked
>> > based on the classzone used when waking kswapd which could be for a
>> > GFP_KERNEL or GFP_HIGHMEM request. The lowmem reserve limit kicks in,
>> > the watermark is not met and kswapd thinks its sleeping prematurely
>> > keeping kswapd awake in error.
>>
>>
>> I thought it was intentional when you submitted a patch firstly.
>
> It was, it also wasn't right.
>
>> "Kswapd makes sure zones include enough free pages(ie, include reserve
>> limit of above zones).
>> But you seem to see DMA zone can't meet above requirement forever in
>> some situation so that kswapd doesn't sleep.
>> Right?
>>
>
> Right.
>
>> >
>> > Reported-and-tested-by: Pádraig Brady <[email protected]>
>> > Signed-off-by: Mel Gorman <[email protected]>
>> > ---
>> > mm/vmscan.c | 2 +-
>> > 1 files changed, 1 insertions(+), 1 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 9cebed1..a76b6cc2 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2341,7 +2341,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > }
>> >
>> > if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
>> > - classzone_idx, 0))
>> > + i, 0))
>>
>> Isn't it better to use 0 instead of i?
>>
>
> I considered it but went with i to compromise between making sure zones
> included enough free pages without requiring that ZONE_DMA meet an
> almost impossible requirement when under continual memory pressure.

I see.
Thanks, Mel.

--
Kind regards,
Minchan Kim

2011-06-28 23:23:28

[permalink] [raw]

Subject: Re: [PATCH 3/4] mm: vmscan: Evaluate the watermarks against the correct classzone

2011-06-28 23:38:15

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone

On Fri, Jun 24, 2011 at 11:44 PM, Mel Gorman <[email protected]> wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

It does make sense.

--
Kind regards,
Minchan Kim

2011-06-29 10:58:29

by Pádraig Brady

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

On 28/06/11 22:49, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <[email protected]> wrote:
>
>> During allocator-intensive workloads, kswapd will be woken frequently
>> causing free memory to oscillate between the high and min watermark.
>> This is expected behaviour.
>>
>> A problem occurs if the highest zone is small. balance_pgdat()
>> only considers unreclaimable zones when priority is DEF_PRIORITY
>> but sleeping_prematurely considers all zones. It's possible for this
>> sequence to occur
>>
>> 1. kswapd wakes up and enters balance_pgdat()
>> 2. At DEF_PRIORITY, marks highest zone unreclaimable
>> 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
>> 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
>> highest zone, clearing all_unreclaimable. Highest zone
>> is still unbalanced
>> 5. kswapd returns and calls sleeping_prematurely
>> 6. sleeping_prematurely looks at *all* zones, not just the ones
>> being considered by balance_pgdat. The highest small zone
>> has all_unreclaimable cleared but but the zone is not
>> balanced. all_zones_ok is false so kswapd stays awake
>>
>> This patch corrects the behaviour of sleeping_prematurely to check
>> the zones balance_pgdat() checked.
>
> But kswapd is making progress: it's reclaiming slab. Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
>
>
>
> btw,
>
> if (!sleeping_prematurely(...))
> sleep();
>
> hurts my brain. My brain would prefer
>
> if (kswapd_should_sleep(...))
> sleep();
>
> no?
>
>> Reported-and-tested-by: P?draig Brady <[email protected]>
>
> But what were the before-and-after observations? I don't understand
> how this can cause a permanent cpuchew by kswapd.

Context:
http://marc.info/?t=130865025500001&r=1&w=2
https://bugzilla.redhat.com/show_bug.cgi?id=712019

Summary:

This will spin kswapd0 on my SNB laptop with 3GB RAM (with small normal zone):

dd bs=1M count=3000 if=/dev/zero of=spin.test

Basically once a certain amount of data is cached,
kswapd0 will start spinning, until the data
is removed from cache (by `rm spin.test` for example).

cheers,
P?draig.

2011-06-30 02:24:04

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> A problem occurs if the highest zone is small. balance_pgdat()
> only considers unreclaimable zones when priority is DEF_PRIORITY
> but sleeping_prematurely considers all zones. It's possible for this
> sequence to occur
>
> 1. kswapd wakes up and enters balance_pgdat()
> 2. At DEF_PRIORITY, marks highest zone unreclaimable
> 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> highest zone, clearing all_unreclaimable. Highest zone
> is still unbalanced
> 5. kswapd returns and calls sleeping_prematurely
> 6. sleeping_prematurely looks at *all* zones, not just the ones
> being considered by balance_pgdat. The highest small zone
> has all_unreclaimable cleared but but the zone is not
> balanced. all_zones_ok is false so kswapd stays awake
>
> This patch corrects the behaviour of sleeping_prematurely to check
> the zones balance_pgdat() checked.
>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8ff834e..841e3bf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2323,7 +2323,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> return true;
>
> /* Check the watermark levels */
> - for (i = 0; i < pgdat->nr_zones; i++) {
> + for (i = 0; i <= classzone_idx; i++) {
> struct zone *zone = pgdat->node_zones + i;
>
> if (!populated_zone(zone))

sorry for the delay.
Reviewed-by: KOSAKI Motohiro <[email protected]>

2011-06-30 02:37:50

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 2/4] mm: vmscan: Do not apply pressure to slab if we are not applying pressure to zone

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour.
>
> When kswapd applies pressure to zones during node balancing, it checks
> if the zone is above a high+balance_gap threshold. If it is, it does
> not apply pressure but it unconditionally shrinks slab on a global
> basis which is excessive. In the event kswapd is being kept awake due to
> a high small unreclaimable zone, it skips zone shrinking but still
> calls shrink_slab().
>
> Once pressure has been applied, the check for zone being unreclaimable
> is being made before the check is made if all_unreclaimable should be
> set. This miss of unreclaimable can cause has_under_min_watermark_zone
> to be set due to an unreclaimable zone preventing kswapd backing off
> on congestion_wait().
>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 23 +++++++++++++----------
> 1 files changed, 13 insertions(+), 10 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 841e3bf..9cebed1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2507,18 +2507,18 @@ loop_again:
> KSWAPD_ZONE_BALANCE_GAP_RATIO);
> if (!zone_watermark_ok_safe(zone, order,
> high_wmark_pages(zone) + balance_gap,
> - end_zone, 0))
> + end_zone, 0)) {
> shrink_zone(priority, zone, &sc);
> - reclaim_state->reclaimed_slab = 0;
> - nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> - sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> - total_scanned += sc.nr_scanned;
>
> - if (zone->all_unreclaimable)
> - continue;
> - if (nr_slab == 0 &&
> - !zone_reclaimable(zone))
> - zone->all_unreclaimable = 1;
> + reclaim_state->reclaimed_slab = 0;
> + nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
> + sc.nr_reclaimed += reclaim_state->reclaimed_slab;
> + total_scanned += sc.nr_scanned;
> +
> + if (nr_slab == 0 && !zone_reclaimable(zone))
> + zone->all_unreclaimable = 1;
> + }
> +
> /*
> * If we've done a decent amount of scanning and
> * the reclaim ratio is low, start doing writepage
> @@ -2528,6 +2528,9 @@ loop_again:
> total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> sc.may_writepage = 1;
>
> + if (zone->all_unreclaimable)
> + continue;
> +
> if (!zone_watermark_ok_safe(zone, order,
> high_wmark_pages(zone), end_zone, 0)) {
> all_zones_ok = 0;

Reviewed-by: KOSAKI Motohiro <[email protected]>

2011-06-30 09:06:25

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [PATCH 4/4] mm: vmscan: Only read new_classzone_idx from pgdat when reclaiming successfully

(2011/06/24 23:44), Mel Gorman wrote:
> During allocator-intensive workloads, kswapd will be woken frequently
> causing free memory to oscillate between the high and min watermark.
> This is expected behaviour. Unfortunately, if the highest zone is
> small, a problem occurs.
>
> When balance_pgdat() returns, it may be at a lower classzone_idx than
> it started because the highest zone was unreclaimable. Before checking
> if it should go to sleep though, it checks pgdat->classzone_idx which
> when there is no other activity will be MAX_NR_ZONES-1. It interprets
> this as it has been woken up while reclaiming, skips scheduling and
> reclaims again. As there is no useful reclaim work to do, it enters
> into a loop of shrinking slab consuming loads of CPU until the highest
> zone becomes reclaimable for a long period of time.
>
> There are two problems here. 1) If the returned classzone or order is
> lower, it'll continue reclaiming without scheduling. 2) if the highest
> zone was marked unreclaimable but balance_pgdat() returns immediately
> at DEF_PRIORITY, the new lower classzone is not communicated back to
> kswapd() for sleeping.
>
> This patch does two things that are related. If the end_zone is
> unreclaimable, this information is communicated back. Second, if
> the classzone or order was reduced due to failing to reclaim, new
> information is not read from pgdat and instead an attempt is made to go
> to sleep. Due to this, it is also necessary that pgdat->classzone_idx
> be initialised each time to pgdat->nr_zones - 1 to avoid re-reads
> being interpreted as wakeups.
>
> Reported-and-tested-by: Pádraig Brady <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 34 +++++++++++++++++++++-------------
> 1 files changed, 21 insertions(+), 13 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a76b6cc2..fe854d7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2448,7 +2448,6 @@ loop_again:
> if (!zone_watermark_ok_safe(zone, order,
> high_wmark_pages(zone), 0, 0)) {
> end_zone = i;
> - *classzone_idx = i;
> break;
> }
> }
> @@ -2528,8 +2527,11 @@ loop_again:
> total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> sc.may_writepage = 1;
>
> - if (zone->all_unreclaimable)
> + if (zone->all_unreclaimable) {
> + if (end_zone && end_zone == i)
> + end_zone--;
> continue;
> + }
>
> if (!zone_watermark_ok_safe(zone, order,
> high_wmark_pages(zone), end_zone, 0)) {
> @@ -2709,8 +2711,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> */
> static int kswapd(void *p)
> {
> - unsigned long order;
> - int classzone_idx;
> + unsigned long order, new_order;
> + int classzone_idx, new_classzone_idx;
> pg_data_t *pgdat = (pg_data_t*)p;
> struct task_struct *tsk = current;
>
> @@ -2740,17 +2742,23 @@ static int kswapd(void *p)
> tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> set_freezable();
>
> - order = 0;
> - classzone_idx = MAX_NR_ZONES - 1;
> + order = new_order = 0;
> + classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
> for ( ; ; ) {
> - unsigned long new_order;
> - int new_classzone_idx;
> int ret;
>
> - new_order = pgdat->kswapd_max_order;
> - new_classzone_idx = pgdat->classzone_idx;
> - pgdat->kswapd_max_order = 0;
> - pgdat->classzone_idx = MAX_NR_ZONES - 1;
> + /*
> + * If the last balance_pgdat was unsuccessful it's unlikely a
> + * new request of a similar or harder type will succeed soon
> + * so consider going to sleep on the basis we reclaimed at
> + */
> + if (classzone_idx >= new_classzone_idx && order == new_order) {

I'm confusing this. If we take a following scenario, new_classzone_idx may be garbage.

1. new_classzone_idx = pgdat->classzone_idx
2. kswapd_try_to_sleep()
3. classzone_idx = pgdat->classzone_idx
4. balance_pgdat()

Wouldn't we need to reinitialize new_classzone_idx nad new_order at kswapd_try_to_sleep()
path too?

> + new_order = pgdat->kswapd_max_order;
> + new_classzone_idx = pgdat->classzone_idx;
> + pgdat->kswapd_max_order = 0;
> + pgdat->classzone_idx = pgdat->nr_zones - 1;
> + }
> +
> if (order < new_order || classzone_idx > new_classzone_idx) {
> /*
> * Don't sleep if someone wants a larger 'order'

2011-06-30 09:39:42

[permalink] [raw]

Subject: Re: [PATCH 1/4] mm: vmscan: Correct check for kswapd sleeping in sleeping_prematurely

On Tue, Jun 28, 2011 at 02:49:00PM -0700, Andrew Morton wrote:
> On Fri, 24 Jun 2011 15:44:54 +0100
> Mel Gorman <[email protected]> wrote:
>
> > During allocator-intensive workloads, kswapd will be woken frequently
> > causing free memory to oscillate between the high and min watermark.
> > This is expected behaviour.
> >
> > A problem occurs if the highest zone is small. balance_pgdat()
> > only considers unreclaimable zones when priority is DEF_PRIORITY
> > but sleeping_prematurely considers all zones. It's possible for this
> > sequence to occur
> >
> > 1. kswapd wakes up and enters balance_pgdat()
> > 2. At DEF_PRIORITY, marks highest zone unreclaimable
> > 3. At DEF_PRIORITY-1, ignores highest zone setting end_zone
> > 4. At DEF_PRIORITY-1, calls shrink_slab freeing memory from
> > highest zone, clearing all_unreclaimable. Highest zone
> > is still unbalanced
> > 5. kswapd returns and calls sleeping_prematurely
> > 6. sleeping_prematurely looks at *all* zones, not just the ones
> > being considered by balance_pgdat. The highest small zone
> > has all_unreclaimable cleared but but the zone is not
> > balanced. all_zones_ok is false so kswapd stays awake
> >
> > This patch corrects the behaviour of sleeping_prematurely to check
> > the zones balance_pgdat() checked.
>
> But kswapd is making progress: it's reclaiming slab. Eventually that
> won't work any more and all_unreclaimable will not be cleared and the
> condition will fix itself up?
>

It might, but at that point we've dumped as much slab as we can which
is very aggressive and there is no guarantee the condition is fixed
up. For example, if fork is happening often enough due to terminal
usage for example, it may be just enough allocation requests satisified
from the highest zone to clear all_unreclaimable during exit.

> btw,
>
> if (!sleeping_prematurely(...))
> sleep();
>
> hurts my brain. My brain would prefer
>
> if (kswapd_should_sleep(...))
> sleep();
>
> no?
>

kswapd_try_to_sleep -> should_sleep feel like it would hurt too. I
prefer the sleeping_prematurely name because it indicates what
condition we are checking but I'm biased and generally suck at naming.

> > Reported-and-tested-by: P?draig Brady <[email protected]>
>
> But what were the before-and-after observations? I don't understand
> how this can cause a permanent cpuchew by kswapd.
>

P?draig has reported on his before-and-after observations.

On its own, this patch doesn't entirely fix his problem because all
the patches are required but I felt that a rolled-up patch would be
too hard to review.

--
Mel Gorman
SUSE Labs

2011-06-30 10:19:44