2004-10-25 17:04:23

by Andrea Arcangeli

[permalink] [raw]
Subject: lowmem_reserve (replaces protection)

This is a forward port to 2.6 CVS of the lowmem_reserve VM feature in
the 2.4 kernel.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-1

Lack of this feature might explain out of memory related killing or
deadlocks hit on >2G boxes. so if anybody is having trouble with oom
conditions this is a patch to try.

this is only slightly tested but works for me so far.

This is the first of a series of oom related fix I'm going to do and
test within the next weeks to attempt to cure various oom regressions in
2.6 (deadlocks turned into crazy early oom kills and the like).


2004-10-26 02:04:01

by Rik van Riel

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Mon, 25 Oct 2004, Andrea Arcangeli wrote:

> This is a forward port to 2.6 CVS of the lowmem_reserve VM feature in
> the 2.4 kernel.
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-1

- unsigned long protection[MAX_NR_ZONES];
+ unsigned long lowmem_reserve[MAX_NR_ZONES];

The gratituous renaming of variable and function names makes
it hard to see what this patch actually changed. Hard enough
that I'm not sure what the behavioural difference is supposed
to be.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-26 04:21:40

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Tue, Oct 26, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
>
>>I see classzone_idx snuck in, can we leave that as alloc_type please?
>
>
> when I wrote that code in 2.4 it was called class_idx. Just to show it
> was not an opaque type, in this 2.6 I called it classzone_idx but it's
> the same as class_idx. If you feel classzone_idx is too long I'm sure
> fine to rename to class_idx like plain 2.4.
>
> The reason I renamed it is that alloc_type tells nothing to who's
> reading the code. That value in the opaque "alloc_type" variable, is
> really the classzone_idx that identify the classzone we have to allocate
> memory from. Classzone 2 means "all ram is good", classzone 2 means
> "zone-normal + zone-dma is good", classzone 0 means "zone-dma is good".
>

OK that makes sense... it isn't the length of the name, but the fact
that that naming convention hasn't proliferated thoughout the 2.6 tree;
maybe could you add a little comment along the lines of the above?
Thanks

2004-10-26 03:20:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Mon, Oct 25, 2004 at 09:48:25PM -0400, Rik van Riel wrote:
> On Mon, 25 Oct 2004, Andrea Arcangeli wrote:
>
> > This is a forward port to 2.6 CVS of the lowmem_reserve VM feature in
> > the 2.4 kernel.
> >
> > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-1
>
> - unsigned long protection[MAX_NR_ZONES];
> + unsigned long lowmem_reserve[MAX_NR_ZONES];
>
> The gratituous renaming of variable and function names makes
> it hard to see what this patch actually changed. Hard enough
> that I'm not sure what the behavioural difference is supposed
> to be.

the behavioural difference is the API and the fact the feaure is now
enabled with sane values (the previous code was disabled by default and
it was unusable with that API). besides fixing the API the patch nukes
dozens of useless lines of code and a buffer overflow. The sysctl
definitely needs renaming or it'd break the ABI with userspace, it's far
from a gratituous rename. since I was foroced to change the sysctl name
accordingly with the new 2.4 API, I thought renaming the variable that
is set by the sysctl was also required, otherwise the sysctl is called
lowmem_reserve and the variable is still called protection. Clearly it's
much cleaner if _both_ sysctl and variable are called lowmem_reserve.

I could have used protection2 to still use the "protection" name, but
lowmem_reserve (btw, the same name I used first in 2.4, before
protection ever existed in 2.6) looks nicer to me.

2004-10-26 03:52:33

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Mon, Oct 25, 2004 at 09:48:25PM -0400, Rik van Riel wrote:
>
>>On Mon, 25 Oct 2004, Andrea Arcangeli wrote:
>>
>>
>>>This is a forward port to 2.6 CVS of the lowmem_reserve VM feature in
>>>the 2.4 kernel.
>>>
>>> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-1
>>
>>- unsigned long protection[MAX_NR_ZONES];
>>+ unsigned long lowmem_reserve[MAX_NR_ZONES];
>>
>>The gratituous renaming of variable and function names makes
>>it hard to see what this patch actually changed. Hard enough
>>that I'm not sure what the behavioural difference is supposed
>>to be.
>
>
> the behavioural difference is the API and the fact the feaure is now
> enabled with sane values (the previous code was disabled by default and
> it was unusable with that API). besides fixing the API the patch nukes
> dozens of useless lines of code and a buffer overflow. The sysctl
> definitely needs renaming or it'd break the ABI with userspace, it's far
> from a gratituous rename. since I was foroced to change the sysctl name
> accordingly with the new 2.4 API, I thought renaming the variable that
> is set by the sysctl was also required, otherwise the sysctl is called
> lowmem_reserve and the variable is still called protection. Clearly it's
> much cleaner if _both_ sysctl and variable are called lowmem_reserve.
>
> I could have used protection2 to still use the "protection" name, but
> lowmem_reserve (btw, the same name I used first in 2.4, before
> protection ever existed in 2.6) looks nicer to me.
>

I'd say go with the name change. "protection" is fairly vague...
OTOH, lowmem_reserve doesn't quite carry the meaning that it is
_protecting_ lower zones from higher zone allocations... maybe
lowmem_protection? But I don't mind too much.

I see classzone_idx snuck in, can we leave that as alloc_type please?

Otherwise, looks great.

2004-10-26 04:08:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 01:48:02PM +1000, Nick Piggin wrote:
> I see classzone_idx snuck in, can we leave that as alloc_type please?

when I wrote that code in 2.4 it was called class_idx. Just to show it
was not an opaque type, in this 2.6 I called it classzone_idx but it's
the same as class_idx. If you feel classzone_idx is too long I'm sure
fine to rename to class_idx like plain 2.4.

The reason I renamed it is that alloc_type tells nothing to who's
reading the code. That value in the opaque "alloc_type" variable, is
really the classzone_idx that identify the classzone we have to allocate
memory from. Classzone 2 means "all ram is good", classzone 2 means
"zone-normal + zone-dma is good", classzone 0 means "zone-dma is good".

alloc_type means very very little to me, calling it with a meaningful
name made the code more readable for me. gfp_mask describes other alloc
types as well that are not less singificant than the classzone_idx, so I
don't see why we should go back to the opaque variable name when we can
have a more descriptive one.

I'm not going to nitpick further on this detail though, all I care about
is the asm generated and not breaking ABIs with userspace (i.e. at least
one forced rename to the sysctl).

thanks for the review!

2004-10-27 00:28:22

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 02:17:33PM +1000, Nick Piggin wrote:
> OK that makes sense... it isn't the length of the name, but the fact
> that that naming convention hasn't proliferated thoughout the 2.6 tree;

I don't see any other equivalent teminology besides my "classzone" word
existing, if we standardize on "alloc_type" to only mean a classzone
then I'd be fine to giveup on my wording, I don't care to retain my
wording, but to me classzone sounds more self explanatory than
alloc_type (though I must be biased having invented that word).

> maybe could you add a little comment along the lines of the above?

sure. done here:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-2

2004-10-27 00:34:06

by Rik van Riel

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, 26 Oct 2004, Nick Piggin wrote:

> OK that makes sense... it isn't the length of the name, but the fact
> that that naming convention hasn't proliferated thoughout the 2.6 tree;

Speaking about not proliferating...

One thing we need to make sure of is that the lower zone
protection stuff doesn't put the allocation threshold
higher than kswapd's freeing threshold.

Otherwise on a 1GB system, we'll end up cycling most of
userspace allocations through the 128MB highmem zone,
instead of falling back to the other zones.

Note that simply having a small lower zone protection isn't
enough, if we still cross the kswapd threshold. I suspect
the lower zone protection should just be off by default,
except if highmem is at least 5x (or 8x?) bigger than lowmem...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-27 00:42:37

by Andrew Morton

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli <[email protected]> wrote:
>
> I don't see any other equivalent teminology besides my "classzone" word
> existing,

I'll confess that I've never understood what "classzone" _means_. Is it "a
zone from amongst several classes" or what?

If it was "zone_class" then it might mean "a particular classification of
zones". Maybe that's what you meant?

I think a lot of other mm hackers share my confusion, which is why
"classzone" has been trickling away. But yeah, we haven't been replacing it
with anything very useful.

2004-10-27 00:47:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 05:42:37PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > I don't see any other equivalent teminology besides my "classzone" word
> > existing,
>
> I'll confess that I've never understood what "classzone" _means_. Is it "a
> zone from amongst several classes" or what?
>
> If it was "zone_class" then it might mean "a particular classification of
> zones". Maybe that's what you meant?
>
> I think a lot of other mm hackers share my confusion, which is why
> "classzone" has been trickling away. But yeah, we haven't been replacing it
> with anything very useful.

it's very easy to explain it now that you spontanously used that concept
in 2.6: classzone is the piece of ram represented by what you previously
called alloc_type in 2.6.9 mainline alloc_pages function. You wrote that
code so you know what it means. If you prefer to still call that
alloc_type that's fine with me.

2004-10-27 00:53:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 08:31:32PM -0400, Rik van Riel wrote:
> On Tue, 26 Oct 2004, Nick Piggin wrote:
>
> > OK that makes sense... it isn't the length of the name, but the fact
> > that that naming convention hasn't proliferated thoughout the 2.6 tree;
>
> Speaking about not proliferating...
>
> One thing we need to make sure of is that the lower zone
> protection stuff doesn't put the allocation threshold
> higher than kswapd's freeing threshold.

I agree. I didn't introduce that bug, the very same problem would happen
with the previous protection code. So this is not a regression, I'm far
from finished... I'm just trying to post orthogonal patches, since Hugh
had a much better merging success rate with small patches (though I find
very hard to produce small patches myself when there's more than one
thing to fix in the same file).

the per-classzone kswapd treshold was very well taken care of in 2.4,
thanks the watermarks embedding the low/min/high and the classzone being
passed up to the kswapd wakeup function.

> Otherwise on a 1GB system, we'll end up cycling most of
> userspace allocations through the 128MB highmem zone,
> instead of falling back to the other zones.

that's the side effect of the per-zone lru too (though I'm not going to
change the lru).

2004-10-27 00:56:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, Oct 27, 2004 at 02:54:25AM +0200, Andrea Arcangeli wrote:
> I agree. I didn't introduce that bug, the very same problem would happen

btw, the severity of such performance bug in the LRU screwup, is minor
compared to the severity of the functional deadlocks or early oom kills
potentially generated by the lack of lowmem_reserve enabled with proper
defaults but sure we need to fix the performance LRU issue too.

2004-10-27 01:03:30

by Rik van Riel

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, 27 Oct 2004, Andrea Arcangeli wrote:

> the per-classzone kswapd treshold was very well taken care of in 2.4,
> thanks the watermarks embedding the low/min/high and the classzone being
> passed up to the kswapd wakeup function.

And it works fine on non-numa x86.

However, doesn't the protection also serve a purpose for
NUMA fallback ? How is that handled by your code ?

NUMA wasn't very important a few years ago, but with AMD64
systems being common, it is an important consideration now.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-27 01:09:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 09:00:11PM -0400, Rik van Riel wrote:
> However, doesn't the protection also serve a purpose for
> NUMA fallback ? How is that handled by your code ?

It doesn't.

The only reason any NUMA folk were interested in this stuff was to nuke
the incremental min weak band-aid that Linus introduced in 2.4 while
rejecting my real lowmem_reserve algorithm (that has been reinvented in
2.6.5 first and against 2.6.7 with the name of protection but still
buggy and disabled by default, so not very useful.., especially after
even the incremental-min was gone from 2.6.8+).

NUMA folks only want to nuke the incremental-min hack since that
escalated to huge waste of ram on the hundred nodes.

the major point of the rework in the protection algorithm happened
between 2.6.5 and 2.6.8 was to drop any NUMA effect from the algorithm
that guarantees DMA allocations on x86-64 and NORMAL allocations on x86
PAE. So being it unrelated to NUMA is a feature as far as I can tell.

2004-10-27 01:34:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

this _incremental_ 2/? patch should fix the longtanding kswapd issue
vs protection algorithm, now lowmem_reserve (partly hidden by the lack
of lowmem_reserve/protection or equivalent band-aid enabled in 2.6.9).

--- 2-kswapd-balance/include/linux/mmzone.h.~1~ 2004-10-27 03:17:07.207812600 +0200
+++ 2-kswapd-balance/include/linux/mmzone.h 2004-10-27 03:26:22.673369000 +0200
@@ -273,7 +273,7 @@ void __get_zone_counts(unsigned long *ac
void get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free);
void build_all_zonelists(void);
-void wakeup_kswapd(struct zone *zone);
+void wakeup_kswapd(struct zone *zone, int classzone_idx);

/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
--- 2-kswapd-balance/mm/page_alloc.c.~1~ 2004-10-27 03:17:07.215811384 +0200
+++ 2-kswapd-balance/mm/page_alloc.c 2004-10-27 03:24:31.351292528 +0200
@@ -641,7 +641,7 @@ __alloc_pages(unsigned int gfp_mask, uns
}

for (i = 0; (z = zones[i]) != NULL; i++)
- wakeup_kswapd(z);
+ wakeup_kswapd(z, classzone_idx);

/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
--- 2-kswapd-balance/mm/vmscan.c.~1~ 2004-10-27 03:14:22.563842288 +0200
+++ 2-kswapd-balance/mm/vmscan.c 2004-10-27 03:26:57.462080312 +0200
@@ -1169,11 +1169,11 @@ static int kswapd(void *p)
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/
-void wakeup_kswapd(struct zone *zone)
+void wakeup_kswapd(struct zone *zone, int classzone_idx)
{
if (zone->present_pages == 0)
return;
- if (zone->free_pages > zone->pages_low)
+ if (zone->free_pages > zone->pages_low + zone->lowmem_reserve[classzone_idx])
return;
if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
return;

2004-10-27 02:07:01

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Tue, Oct 26, 2004 at 02:17:33PM +1000, Nick Piggin wrote:

>>maybe could you add a little comment along the lines of the above?
>
>
> sure. done here:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-2
>

Thanks Andrea.

2004-10-27 02:08:01

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:

> the per-classzone kswapd treshold was very well taken care of in 2.4,
> thanks the watermarks embedding the low/min/high and the classzone being
> passed up to the kswapd wakeup function.
>

Kswapd actually should take care of this properly: see the
initial loop before the real scanning loop.

I thought this was a bit subtle, but it seems to work fine,
and Andrew likes it.

I have a patch to explicitly have kswapd use the lower zone
protection watermarks but I haven't really demonstrated it is
better than what is currently there (other than being simpler).

2004-10-27 02:11:38

by Andrew Morton

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli <[email protected]> wrote:
>
> this _incremental_ 2/? patch should fix the longtanding kswapd issue
> vs protection algorithm

Could you please email the patch which this depends on?

2004-10-27 02:28:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, Oct 27, 2004 at 12:05:19PM +1000, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >the per-classzone kswapd treshold was very well taken care of in 2.4,
> >thanks the watermarks embedding the low/min/high and the classzone being
> >passed up to the kswapd wakeup function.
> >
>
> Kswapd actually should take care of this properly: see the
> initial loop before the real scanning loop.

I don't see how it can take care of it, if it doesn't even get a wakeup?
kswapd just sleeps.

kswapd is just an optimization, we try provide async allocation if the
kswapd watermarks can it keep up between low and high (and we hang
synchronosly while hitting the wall at min), so it's not fatal that
kswapd sleeps, but it can be fixed easily with the patch I just posted
some minute ago.

> I thought this was a bit subtle, but it seems to work fine,
> and Andrew likes it.
>
> I have a patch to explicitly have kswapd use the lower zone
> protection watermarks but I haven't really demonstrated it is
> better than what is currently there (other than being simpler).

how can it ever wakeup if nobody ever calls
wake_up_interruptible(&zone->zone_pgdat->kswapd_wait)?


Now the patch I posted fixes the wakeup side, but I'm not sure that the
"stop" is properly done yet. Peraphs that's what you mean, that the
kswapd-stop decision already works despite not being aware of the
low memory ram reservation?

Following the code I see nr_pages is 0 in balance_pgdat when invoked
from kswapd. nr_pages then goes here:

if (nr_pages == 0) {
/*
* Scan in the highmem->dma direction for the
* highest
* zone which needs scanning
*/
for (i = pgdat->nr_zones - 1; i >= 0; i--) {
struct zone *zone = pgdat->node_zones + i;

if (zone->all_unreclaimable &&
priority != DEF_PRIORITY)
continue;

if (zone->free_pages <= zone->pages_high) {
end_zone = i;
goto scan;
}
}
goto out;


shouldn't we take the full watermarks into account in the above too?
I think so, otherwise we wakeup but it goes back to sleep, i.e.
overscheduling but nothing done still, so I'm going to fixup the stop as
well.

2004-10-27 02:31:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 07:08:56PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > this _incremental_ 2/? patch should fix the longtanding kswapd issue
> > vs protection algorithm
>
> Could you please email the patch which this depends on?

it's against this one:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-2

(the patch only address the wakeup issue, the stop is still unsolved,
but it can be done in a separate patch)

2004-10-27 02:57:00

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> this _incremental_ 2/? patch should fix the longtanding kswapd issue
> vs protection algorithm, now lowmem_reserve (partly hidden by the lack
> of lowmem_reserve/protection or equivalent band-aid enabled in 2.6.9).
>
> --- 2-kswapd-balance/include/linux/mmzone.h.~1~ 2004-10-27 03:17:07.207812600 +0200
> +++ 2-kswapd-balance/include/linux/mmzone.h 2004-10-27 03:26:22.673369000 +0200
> @@ -273,7 +273,7 @@ void __get_zone_counts(unsigned long *ac
> void get_zone_counts(unsigned long *active, unsigned long *inactive,
> unsigned long *free);
> void build_all_zonelists(void);
> -void wakeup_kswapd(struct zone *zone);
> +void wakeup_kswapd(struct zone *zone, int classzone_idx);
>
> /*
> * zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
> --- 2-kswapd-balance/mm/page_alloc.c.~1~ 2004-10-27 03:17:07.215811384 +0200
> +++ 2-kswapd-balance/mm/page_alloc.c 2004-10-27 03:24:31.351292528 +0200
> @@ -641,7 +641,7 @@ __alloc_pages(unsigned int gfp_mask, uns
> }
>
> for (i = 0; (z = zones[i]) != NULL; i++)
> - wakeup_kswapd(z);
> + wakeup_kswapd(z, classzone_idx);
>
> /*
> * Go through the zonelist again. Let __GFP_HIGH and allocations
> --- 2-kswapd-balance/mm/vmscan.c.~1~ 2004-10-27 03:14:22.563842288 +0200
> +++ 2-kswapd-balance/mm/vmscan.c 2004-10-27 03:26:57.462080312 +0200
> @@ -1169,11 +1169,11 @@ static int kswapd(void *p)
> /*
> * A zone is low on free memory, so wake its kswapd task to service it.
> */
> -void wakeup_kswapd(struct zone *zone)
> +void wakeup_kswapd(struct zone *zone, int classzone_idx)
> {
> if (zone->present_pages == 0)
> return;
> - if (zone->free_pages > zone->pages_low)
> + if (zone->free_pages > zone->pages_low + zone->lowmem_reserve[classzone_idx])
> return;
> if (!waitqueue_active(&zone->zone_pgdat->kswapd_wait))
> return;

I don't think this is required, because by the time __alloc_pages
reaches wakeup_kswapd, it would have checked one zone with a
->lowmem_reserve of 0, and found it to be low on pages. Thus
wakeup_kswapd will wake it up.

2004-10-27 03:03:14

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Wed, Oct 27, 2004 at 12:05:19PM +1000, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>the per-classzone kswapd treshold was very well taken care of in 2.4,
>>>thanks the watermarks embedding the low/min/high and the classzone being
>>>passed up to the kswapd wakeup function.
>>>
>>
>>Kswapd actually should take care of this properly: see the
>>initial loop before the real scanning loop.
>
>
> I don't see how it can take care of it, if it doesn't even get a wakeup?
> kswapd just sleeps.
>
> kswapd is just an optimization, we try provide async allocation if the
> kswapd watermarks can it keep up between low and high (and we hang
> synchronosly while hitting the wall at min), so it's not fatal that
> kswapd sleeps, but it can be fixed easily with the patch I just posted
> some minute ago.
>

See my reply.

>
>>I thought this was a bit subtle, but it seems to work fine,
>>and Andrew likes it.
>>
>>I have a patch to explicitly have kswapd use the lower zone
>>protection watermarks but I haven't really demonstrated it is
>>better than what is currently there (other than being simpler).
>

[snip]

> shouldn't we take the full watermarks into account in the above too?
> I think so, otherwise we wakeup but it goes back to sleep, i.e.
> overscheduling but nothing done still, so I'm going to fixup the stop as
> well.
>

Please don't just yet. I've already got this patch to do it, but
I've got more kswapd work underneath that (which makes kswapd
work properly with higher order allocations).

Currently it does not overschedule, because one zone is always
going to be low by the time kswapd wakes up. This causes all zones
below to be scanned as well.

2004-10-27 03:23:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, Oct 27, 2004 at 01:01:54PM +1000, Nick Piggin wrote:
> Currently it does not overschedule, because one zone is always
> going to be low by the time kswapd wakes up. This causes all zones
> below to be scanned as well.

that's quite subtle (after all it's needed only for the numa pgdats) and
I agree on the wakeup side, the one thing that is wrong instead is the
kswapd-stop side. On that side you really need to know every single zone
that has to be balanced. So whatever, you can't just use pages_high
there. I'm creating a zone->max_lowmem_reserve to fix that efficiently
(that's recalculated every time with the sysctl and at boot).

However my patch to wakeup_kswapd sure wouldn't hurt there.

2004-10-27 03:38:49

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Wed, Oct 27, 2004 at 01:01:54PM +1000, Nick Piggin wrote:
>
>>Currently it does not overschedule, because one zone is always
>>going to be low by the time kswapd wakes up. This causes all zones
>>below to be scanned as well.
>
>
> that's quite subtle (after all it's needed only for the numa pgdats) and
> I agree on the wakeup side, the one thing that is wrong instead is the
> kswapd-stop side. On that side you really need to know every single zone
> that has to be balanced. So whatever, you can't just use pages_high
> there. I'm creating a zone->max_lowmem_reserve to fix that efficiently
> (that's recalculated every time with the sysctl and at boot).
>
> However my patch to wakeup_kswapd sure wouldn't hurt there.
>

You do though... it is right as is (sort of).

It actually can overscan lower zones a little bit, because
whenever any higher zone in the pgdat is low on memory, then
it and all zones below it get scanned too.

So if HIGHMEM is low on memory, NORMAL and DMA get scanned
as well regardless. If NORMAL is low, then DMA gets scanned
as well.

2004-10-27 03:46:13

by Andrew Morton

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Nick Piggin <[email protected]> wrote:
>
> It actually can overscan lower zones a little bit, because
> whenever any higher zone in the pgdat is low on memory, then
> it and all zones below it get scanned too.

Because we know that all of the eligible zones are below pages_low. kswapd
will then work to bring all the relevant zones back to pages_high.

When working on this code it is very very easy to break the zone levelling:
you *have* to run a workload mix and monitor the numbers in /proc/vmstat to
ensure that all zones are undergoing page scanning at frequencies which are
proportional to their sizes. It's easy to screw up the zone levelling so
all allocations end up coming from ZONE_NORMAL and pagecache pages in, say,
ZONE_DMA end up just sitting there.

2004-10-27 04:44:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 08:43:08PM -0700, Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > It actually can overscan lower zones a little bit, because
> > whenever any higher zone in the pgdat is low on memory, then
> > it and all zones below it get scanned too.
>
> Because we know that all of the eligible zones are below pages_low. kswapd
> will then work to bring all the relevant zones back to pages_high.

entering the loop is easy and shrinking all zones is easy.

is how much to free from each zone in kswapd that won't work as 2.4 did
with only pages_high.

what we'll happen is that we'll blindly free a few pages from each zone,
but then we'll be allowed to allocate the highmem pages, and not the
normal/dma pages. So after allocating the highmem pages we invoke kswapd
again and it frees again some highmem/normal/dma pages but we keep only
using the highmem ones. So for a while we may be rolling over only the
highmem lru and ignoring all freed pages from the normal/dma zones.

2.4 would have freed all pages from the lower zones first (without
this additional rolling over the highmem zone). Though it's a bit hard
to compare since 2.4 had a global lru.

The bigger the highmem zone, the less it will matter the stuff that can
be cached in the low zones. And the smaller the highemm zone the less
imbalancing will be generated since the reservation will be minuscle on
the lower zones. So perhaps it already works fine, but this is not 2.4
was working (I'm sure to make it work like 2.4 you'd need to add the
lowmem_reserve in the when-to-stop-kswapd math). But maybe it works even
better this way?

2004-10-27 04:51:36

by Rik van Riel

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, 27 Oct 2004, Andrea Arcangeli wrote:

> what we'll happen is that we'll blindly free a few pages from each zone,
> but then we'll be allowed to allocate the highmem pages, and not the
> normal/dma pages. So after allocating the highmem pages we invoke kswapd
> again and it frees again some highmem/normal/dma pages but we keep only
> using the highmem ones. So for a while we may be rolling over only the
> highmem lru and ignoring all freed pages from the normal/dma zones.

This is not how the page allocator in 2.6 works.

It will not wake up kswapd until it is past the low
watermark in all zones.


--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-10-27 05:04:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, Oct 27, 2004 at 12:51:11AM -0400, Rik van Riel wrote:
> On Wed, 27 Oct 2004, Andrea Arcangeli wrote:
>
> > what we'll happen is that we'll blindly free a few pages from each zone,
> > but then we'll be allowed to allocate the highmem pages, and not the
> > normal/dma pages. So after allocating the highmem pages we invoke kswapd
> > again and it frees again some highmem/normal/dma pages but we keep only
> > using the highmem ones. So for a while we may be rolling over only the
> > highmem lru and ignoring all freed pages from the normal/dma zones.
>
> This is not how the page allocator in 2.6 works.
>
> It will not wake up kswapd until it is past the low
> watermark in all zones.

how else the page allocator should work if not waiting all zones to hit
the low+reserve watermarks? the page allocator works the same in 2.4
and 2.6, so I don't understand your point.

what changes is kswapd stop algorithm and the fact the lru is per-zone,
but I think I described correctly what happens in 2.6 kswapd-stop.

2004-10-27 05:40:50

by Andrew Morton

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli <[email protected]> wrote:
>
> So after allocating the highmem pages we invoke kswapd
> again

Nope. We don't wake kswapd until the lower zones are below pages_low as
well. Then we rebalance all the zones which are below pages_high.

As I say: run a workload mix and monitor /proc/vmstat:

pgscan_kswapd_high
pgscan_kswapd_normal
pgscan_kswapd_dma

These should be increasing at rates which are proportional to their size,
if most allocations have __GFP_HIGHMEM.

The lower-zone protection thing means that the scanning rate really should
be proportional to (zone size - (protection from __GFP_HIGHMEM
allocations)), but it's not too far off at present.

The balancing of the zone scanning rates is subtle and easy to break. As is
the balancing of that against slab scanning.

2004-10-27 05:51:01

by Nick Piggin

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

Andrea Arcangeli wrote:
> On Wed, Oct 27, 2004 at 12:51:11AM -0400, Rik van Riel wrote:
>
>>On Wed, 27 Oct 2004, Andrea Arcangeli wrote:
>>
>>
>>>what we'll happen is that we'll blindly free a few pages from each zone,
>>>but then we'll be allowed to allocate the highmem pages, and not the
>>>normal/dma pages. So after allocating the highmem pages we invoke kswapd
>>>again and it frees again some highmem/normal/dma pages but we keep only
>>>using the highmem ones. So for a while we may be rolling over only the
>>>highmem lru and ignoring all freed pages from the normal/dma zones.
>>

That's what I mean when I say it can overscan the lower zones.
You don't want to change the stop condition here though because
obviously the highmem zone still has work to do - you just want
to stop scanning the lower zones.

I don't think it is a big problem though.

I do have a patch to make it use the lower zone protection watermarks
explicitly, but it is sitting on top of other stuff which I'm going
to try to get merged first...

2004-10-27 06:16:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Tue, Oct 26, 2004 at 10:33:35PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > So after allocating the highmem pages we invoke kswapd
> > again
>
> Nope. We don't wake kswapd until the lower zones are below pages_low as
> well. Then we rebalance all the zones which are below pages_high.

the lower zones will always be below pages_low+lowmem_reserve forever
since the reservation is 800M of ram.

The allocator isn't using pages_low, the allocator is always using
pages_low+lowmem_reserve. This is why only highmem will be allocated and
then immediatly kswapd will be invoked again.

the lower zones are above pages_high all the time, but they're also
below pages_low+lowmem_reserve all the time.

I'm not talking about 2.6.9 with protection disabled, I'm talking about
2.6.9+lowmem_reserve enabled.

So there will definitely be a rolling on the highmem, but it's probably
ok see below:

> As I say: run a workload mix and monitor /proc/vmstat:
>
> pgscan_kswapd_high
> pgscan_kswapd_normal
> pgscan_kswapd_dma
>
> These should be increasing at rates which are proportional to their size,
> if most allocations have __GFP_HIGHMEM.

they should yes. And this is also why it shouldn't be a problem this
behaviour of kswapd despite the rolling (after all we're shrinking cache
from the lower zones too, except the highmem cache will be rolled).

I'll try to measure the above soon (and I already applied it to all
relevant trees so it'll get some beating, since the aging details are
somewhat less severe than potential deadlocks or oom kills with lots of
ram and ptes, espcially now without the incremental min we have to
enable it).

2004-10-28 00:32:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: lowmem_reserve (replaces protection)

On Wed, Oct 27, 2004 at 02:25:37AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 26, 2004 at 02:17:33PM +1000, Nick Piggin wrote:
> > OK that makes sense... it isn't the length of the name, but the fact
> > that that naming convention hasn't proliferated thoughout the 2.6 tree;
>
> I don't see any other equivalent teminology besides my "classzone" word
> existing, if we standardize on "alloc_type" to only mean a classzone
> then I'd be fine to giveup on my wording, I don't care to retain my
> wording, but to me classzone sounds more self explanatory than
> alloc_type (though I must be biased having invented that word).
>
> > maybe could you add a little comment along the lines of the above?
>
> sure. done here:
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-2

sorry there was a small buglet, when testing I must have tweaked the
sysctl before sysrq+m so I overlooked it until now when I was playing
with a new feature (the other version of the patch against 2.6.5 that I
tested more closely didn't miss the below)

So it was disabled by default *still* ;)

--- 1-lowmem_reserve/mm/page_alloc.c.~1~ 2004-10-28 01:56:50.000000000 +0200
+++ 1-lowmem_reserve/mm/page_alloc.c 2004-10-28 02:18:39.160567304 +0200
@@ -1916,6 +1916,7 @@ static int __init init_per_zone_pages_mi
if (min_free_kbytes > 16384)
min_free_kbytes = 16384;
setup_per_zone_pages_min();
+ setup_per_zone_lowmem_reserve();
return 0;
}
module_init(init_per_zone_pages_min)


full update ready to be merged is here:

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/lowmem_reserve-3

thanks.