2004-10-21 01:29:57

by Andrea Arcangeli

[permalink] [raw]
Subject: ZONE_PADDING wastes 4 bytes of the new cacheline

I don't see why this 'int x' exists, alignment should really work fine
even with empty structure (works with my compiler with an userspace
test, please double check).

Index: linux-2.5/include/linux/mmzone.h
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/mmzone.h,v
retrieving revision 1.67
diff -u -p -r1.67 mmzone.h
--- linux-2.5/include/linux/mmzone.h 19 Oct 2004 14:58:00 -0000 1.67
+++ linux-2.5/include/linux/mmzone.h 21 Oct 2004 01:14:20 -0000
@@ -35,7 +35,6 @@ struct pglist_data;
*/
#if defined(CONFIG_SMP)
struct zone_padding {
- int x;
} ____cacheline_maxaligned_in_smp;
#define ZONE_PADDING(name) struct zone_padding name;
#else


2004-10-21 03:28:58

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:
> I don't see why this 'int x' exists, alignment should really work fine
> even with empty structure (works with my compiler with an userspace
> test, please double check).
>
> Index: linux-2.5/include/linux/mmzone.h
> ===================================================================
> RCS file: /home/andrea/crypto/cvs/linux-2.5/include/linux/mmzone.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 mmzone.h
> --- linux-2.5/include/linux/mmzone.h 19 Oct 2004 14:58:00 -0000 1.67
> +++ linux-2.5/include/linux/mmzone.h 21 Oct 2004 01:14:20 -0000
> @@ -35,7 +35,6 @@ struct pglist_data;
> */
> #if defined(CONFIG_SMP)
> struct zone_padding {
> - int x;
> } ____cacheline_maxaligned_in_smp;
> #define ZONE_PADDING(name) struct zone_padding name;
> #else

Perhaps to keep old compilers working? Not sure.

I think it is common to use a zero length array...?

int x[0];

Although that maybe that is only used when one requires a
handle on the address.

Nick

2004-10-21 04:47:34

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Nick Piggin <[email protected]> wrote:
>
> > #if defined(CONFIG_SMP)
> > struct zone_padding {
> > - int x;
> > } ____cacheline_maxaligned_in_smp;
> > #define ZONE_PADDING(name) struct zone_padding name;
> > #else
>
> Perhaps to keep old compilers working? Not sure.

gcc-2.95 is OK with it.

Stock 2.6.9:

sizeof(struct zone) = 1920

With Andrea's patch:

sizeof(struct zone) = 1536

With ZONE_PADDING removed:

sizeof(struct zone) = 1408

2004-10-21 05:02:49

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>> #if defined(CONFIG_SMP)
>>
>> > struct zone_padding {
>> > - int x;
>> > } ____cacheline_maxaligned_in_smp;
>> > #define ZONE_PADDING(name) struct zone_padding name;
>> > #else
>>
>> Perhaps to keep old compilers working? Not sure.
>
>
> gcc-2.95 is OK with it.
>
> Stock 2.6.9:
>
> sizeof(struct zone) = 1920
>
> With Andrea's patch:
>
> sizeof(struct zone) = 1536
>
> With ZONE_PADDING removed:
>
> sizeof(struct zone) = 1408
>
>

Wow. You'd probably still want pad1 because that seperates the
allocator and the scanner. It looks like temp/prev_priority
should be _above_ pad2, and possibly free_area should be above
pad1.

Don't know about all the stuff below pad3.

2004-10-21 10:56:44

by Mikael Pettersson

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrew Morton writes:
> Nick Piggin <[email protected]> wrote:
> >
> > > #if defined(CONFIG_SMP)
> > > struct zone_padding {
> > > - int x;
> > > } ____cacheline_maxaligned_in_smp;
> > > #define ZONE_PADDING(name) struct zone_padding name;
> > > #else
> >
> > Perhaps to keep old compilers working? Not sure.
>
> gcc-2.95 is OK with it.

Have you verified that? GCCs up to and including 2.95.3 and
early versions of 2.96 miscompiled the kernel when spinlocks
where empty structs on UP. I.e., you might not get a compile-time
error but runtime corruption instead.

2004-10-21 13:00:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, Oct 21, 2004 at 12:51:18PM +0200, Mikael Pettersson wrote:
> Have you verified that? GCCs up to and including 2.95.3 and
> early versions of 2.96 miscompiled the kernel when spinlocks
> where empty structs on UP. I.e., you might not get a compile-time
> error but runtime corruption instead.

peraphs we should add a check on the compiler and force people to use
gcc >= 3?

Otherwise adding an #ifdef will fix 2.95, just like the spinlock does in
UP.

btw, the only machine where I still have gcc 2.95.3 is not uptodate
enough to run 2.6 regardless of the fact 2.6 could compile on such
machine or not.

2004-10-21 18:57:32

by Adam Heath

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, 21 Oct 2004, Andrea Arcangeli wrote:

> On Thu, Oct 21, 2004 at 12:51:18PM +0200, Mikael Pettersson wrote:
> > Have you verified that? GCCs up to and including 2.95.3 and
> > early versions of 2.96 miscompiled the kernel when spinlocks
> > where empty structs on UP. I.e., you might not get a compile-time
> > error but runtime corruption instead.
>
> peraphs we should add a check on the compiler and force people to use
> gcc >= 3?
>
> Otherwise adding an #ifdef will fix 2.95, just like the spinlock does in
> UP.
>
> btw, the only machine where I still have gcc 2.95.3 is not uptodate
> enough to run 2.6 regardless of the fact 2.6 could compile on such
> machine or not.

So compile a 2.6 kernel on the machine with 2.95.3 for another machine.

2004-10-21 20:31:49

by DaMouse

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, 21 Oct 2004 13:54:41 -0500 (CDT), Adam Heath <[email protected]> wrote:
> On Thu, 21 Oct 2004, Andrea Arcangeli wrote:
>
> > On Thu, Oct 21, 2004 at 12:51:18PM +0200, Mikael Pettersson wrote:
> > > Have you verified that? GCCs up to and including 2.95.3 and
> > > early versions of 2.96 miscompiled the kernel when spinlocks
> > > where empty structs on UP. I.e., you might not get a compile-time
> > > error but runtime corruption instead.
> >
> > peraphs we should add a check on the compiler and force people to use
> > gcc >= 3?
> >
> > Otherwise adding an #ifdef will fix 2.95, just like the spinlock does in
> > UP.
> >
> > btw, the only machine where I still have gcc 2.95.3 is not uptodate
> > enough to run 2.6 regardless of the fact 2.6 could compile on such
> > machine or not.
>
> So compile a 2.6 kernel on the machine with 2.95.3 for another machine.
>

I think what he was referring to was that most machines with 2.95.x
have older kernels anyway.

-DaMouse


--
I know I broke SOMETHING but its there fault for not fixing it before me

2004-10-21 21:28:32

by Jon Masters

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, 21 Oct 2004 21:21:55 +0100, DaMouse <[email protected]> wrote:
> On Thu, 21 Oct 2004 13:54:41 -0500 (CDT), Adam Heath <[email protected]> wrote:
> > On Thu, 21 Oct 2004, Andrea Arcangeli wrote:
> >
> > > On Thu, Oct 21, 2004 at 12:51:18PM +0200, Mikael Pettersson wrote:
> > > > Have you verified that? GCCs up to and including 2.95.3 and
> > > > early versions of 2.96 miscompiled the kernel when spinlocks
> > > > where empty structs on UP. I.e., you might not get a compile-time
> > > > error but runtime corruption instead.
> > >
> > > peraphs we should add a check on the compiler and force people to use
> > > gcc >= 3?
> > >
> > > Otherwise adding an #ifdef will fix 2.95, just like the spinlock does in
> > > UP.
> > >
> > > btw, the only machine where I still have gcc 2.95.3 is not uptodate
> > > enough to run 2.6 regardless of the fact 2.6 could compile on such
> > > machine or not.
> >
> > So compile a 2.6 kernel on the machine with 2.95.3 for another machine.
> >
>
> I think what he was referring to was that most machines with 2.95.x
> have older kernels anyway.

That's probably mostly true even for embedded folks, but I don't think
it's a good idea to completely throw away 2.95 users just yet. Better
to use ifdefs or somesuch for now.

Jon.

2004-10-21 22:31:37

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>> #if defined(CONFIG_SMP)
>>
>> > struct zone_padding {
>> > - int x;
>> > } ____cacheline_maxaligned_in_smp;
>> > #define ZONE_PADDING(name) struct zone_padding name;
>> > #else
>>
>> Perhaps to keep old compilers working? Not sure.
>
>
> gcc-2.95 is OK with it.
>
> Stock 2.6.9:
>
> sizeof(struct zone) = 1920
>
> With Andrea's patch:
>
> sizeof(struct zone) = 1536
>
> With ZONE_PADDING removed:
>
> sizeof(struct zone) = 1408
>
>

How about this patch. 1536 before, 1152 afterwards for me.

Uses the zero length array which seems to be quite abundant throughout
the tree (although maybe that also causes problems when it is by itself
in an array?).

Also try to be a bit smarter about getting commonly accessed fields
together, which surely can't be worse than before.


Attachments:
mm-help-zone-padding.patch (2.64 kB)

2004-10-21 22:56:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Fri, Oct 22, 2004 at 08:26:47AM +1000, Nick Piggin wrote:
> Uses the zero length array which seems to be quite abundant throughout
> the tree (although maybe that also causes problems when it is by itself
> in an array?).

then we could use this on the spinlock too to drop the #ifdef. It
doesn't really matter either ways, in practice it's the same.

> @@ -108,10 +108,7 @@ struct per_cpu_pageset {
> */
>
> struct zone {
> - /*
> - * Commonly accessed fields:
> - */
> - spinlock_t lock;
> + /* Fields commonly accessed by the page allocator */
> unsigned long free_pages;
> unsigned long pages_min, pages_low, pages_high;
> /*
> @@ -128,8 +125,18 @@ struct zone {
> */
> unsigned long protection[MAX_NR_ZONES];
>
> + struct per_cpu_pageset pageset[NR_CPUS];
> +
> + /*
> + * free areas of different sizes
> + */
> + spinlock_t lock;
> + struct free_area free_area[MAX_ORDER];
> +
> +
> ZONE_PADDING(_pad1_)
>
> + /* Fields commonly accessed by the page reclaim scanner */
> spinlock_t lru_lock;
> struct list_head active_list;
> struct list_head inactive_list;
> @@ -137,10 +144,8 @@ struct zone {
> unsigned long nr_scan_inactive;
> unsigned long nr_active;
> unsigned long nr_inactive;
> - int all_unreclaimable; /* All pages pinned */
> unsigned long pages_scanned; /* since last reclaim */
> -
> - ZONE_PADDING(_pad2_)
> + int all_unreclaimable; /* All pages pinned */
>
> /*
> * prev_priority holds the scanning priority for this zone. It is
> @@ -161,10 +166,9 @@ struct zone {
> int temp_priority;
> int prev_priority;
>
> - /*
> - * free areas of different sizes
> - */
> - struct free_area free_area[MAX_ORDER];
> +
> + ZONE_PADDING(_pad2_)
> + /* Rarely used or read-mostly fields */
>
> /*
> * wait_table -- the array holding the hash table
> @@ -194,10 +198,6 @@ struct zone {
> unsigned long wait_table_size;
> unsigned long wait_table_bits;
>
> - ZONE_PADDING(_pad3_)
> -
> - struct per_cpu_pageset pageset[NR_CPUS];
> -
> /*
> * Discontig memory support fields.
> */
> @@ -206,12 +206,13 @@ struct zone {
> /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
> unsigned long zone_start_pfn;
>
> + unsigned long spanned_pages; /* total size, including holes */
> + unsigned long present_pages; /* amount of memory (excluding holes) */
> +
> /*
> * rarely used fields:
> */
> char *name;
> - unsigned long spanned_pages; /* total size, including holes */
> - unsigned long present_pages; /* amount of memory (excluding holes) */
> } ____cacheline_maxaligned_in_smp;

looks reasonable. only cons is that this rejects on my tree ;), pages_*
and protection is gone in my tree, replaced by watermarks[] using the
very same optimal and proven algo of 2.4 (enabled by default of course).
I'll reevaluate the false sharing later on.

2004-10-22 00:49:41

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:

>
>looks reasonable. only cons is that this rejects on my tree ;), pages_*
>and protection is gone in my tree, replaced by watermarks[] using the
>very same optimal and proven algo of 2.4 (enabled by default of course).
>I'll reevaluate the false sharing later on.
>

May I again ask what you think is wrong with ->protection[] apart from
it being turned off by default? (I don't think our previous conversation
ever reached a conclusion...)

2004-10-22 01:18:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Fri, Oct 22, 2004 at 10:34:13AM +1000, Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
> >
> >looks reasonable. only cons is that this rejects on my tree ;), pages_*
> >and protection is gone in my tree, replaced by watermarks[] using the
> >very same optimal and proven algo of 2.4 (enabled by default of course).
> >I'll reevaluate the false sharing later on.
> >
>
> May I again ask what you think is wrong with ->protection[] apart from
> it being turned off by default? (I don't think our previous conversation
> ever reached a conclusion...)

the API is flawed, how can set it up by default? if somebody tweaks
pages_min it get screwed.

plus it's worthless to have pages_min/low/high and protection[] when you
can combine the two things together. those pages_min/low/high and
protection combined when protection itself is calculated in function of
pages_min/low/high just creates confusion. I believe this comments
explains it well enough:

/*
* setup_per_zone_protection - called whenver min_free_kbytes or
* sysctl_lower_zone_protection changes. Ensures that each zone
* has a correct pages_protected value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
*
* This algorithm is way confusing. I tries to keep the same
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
* behavior
* as we had with the incremental min iterative algorithm.
*/

I've to agree with the comment, if I've to start doing math on top of
this algorithm, it's almost faster to replace it with the 2.4 one that
has clear semantics.

Example of the runtime behavaiour of the very well understood
lowmem_reserve from 2.4, it's easier to make an example than to explain
it with words:

1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
1G machine -> (16M dma, 784M normal, 224M high)

sysctl defaults are:

int sysctl_lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };

the results will be:

1) NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA

2) HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL

3) HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA

I invented this algorithm to scale in all machines and to make the
memory reservation not noticeable and only beneficial. With this API is
trivial to tune and to understand what will happen, the default value
looks good and they're proven by years of production in 2.4. Most
important: it's only in function of the sizes of the zones, the
pages_min levels have nothing to do with it.

I'm still unsure if the 2.6 lower_zone_protection completely mimics the
2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
settings accordingly, but I believe it's easier to drop it and replace
with a clear understandable API that as well drops the pages_min levels
that have no reason to exists anymore, than to leave it in its current
state and to start reversing pages_min algorithm to tune it from
userspace (in the hope nobody could ever tweak pages_min calculation in
the kernel, to avoid breaking the userspace that would require
kernel-internal knowledge to have a chance to tune lowmem_protection
from a rc.d script).

2004-10-22 01:36:15

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli <[email protected]> wrote:
>
> I'm still unsure if the 2.6 lower_zone_protection completely mimics the
> 2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
> settings accordingly, but I believe it's easier to drop it and replace
> with a clear understandable API that as well drops the pages_min levels
> that have no reason to exists anymore

I'd be OK with wapping over to the watermark version, as long as we have
runtime-settable levels.

But I'd be worried about making the default values anything other than zero
because nobody seems to be hitting the problems.

But then again, this get discussed so infrequently that by the time it
comes around again I've forgotten all the previous discussion. Ho hum.

2004-10-22 03:11:04

by Jesse Barnes

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thursday, October 21, 2004 8:26 pm, Andrew Morton wrote:
> I'd be OK with wapping over to the watermark version, as long as we have
> runtime-settable levels.
>
> But I'd be worried about making the default values anything other than zero
> because nobody seems to be hitting the problems.

Yes, please keep the default at 0 regardless of the algorithm. On the NUMA
systems I'm aware of, an incremental min just doesn't make any sense.

Thanks,
Jesse

2004-10-22 03:11:04

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:
> On Fri, Oct 22, 2004 at 10:34:13AM +1000, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>
>>>looks reasonable. only cons is that this rejects on my tree ;), pages_*
>>>and protection is gone in my tree, replaced by watermarks[] using the
>>>very same optimal and proven algo of 2.4 (enabled by default of course).
>>>I'll reevaluate the false sharing later on.
>>>
>>
>>May I again ask what you think is wrong with ->protection[] apart from
>>it being turned off by default? (I don't think our previous conversation
>>ever reached a conclusion...)
>
>
> the API is flawed, how can set it up by default? if somebody tweaks
> pages_min it get screwed.
>

The setup code leaves a bit to be desired, but for the purpose of
kswapd and the page allocator they are fine.

> plus it's worthless to have pages_min/low/high and protection[] when you
> can combine the two things together. those pages_min/low/high and
> protection combined when protection itself is calculated in function of
> pages_min/low/high just creates confusion. I believe this comments
> explains it well enough:
>

I don't agree, there are times when you need to know the bare pages_xxx
watermark, and times when you need to know the whole ->protection thing.

> /*
> * setup_per_zone_protection - called whenver min_free_kbytes or
> * sysctl_lower_zone_protection changes. Ensures that each zone
> * has a correct pages_protected value, so an adequate number of
> * pages are left in the zone after a successful __alloc_pages().
> *
> * This algorithm is way confusing. I tries to keep the same
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> * behavior
> * as we had with the incremental min iterative algorithm.
> */
>
> I've to agree with the comment, if I've to start doing math on top of
> this algorithm, it's almost faster to replace it with the 2.4 one that
> has clear semantics.
>
> Example of the runtime behavaiour of the very well understood
> lowmem_reserve from 2.4, it's easier to make an example than to explain
> it with words:
>
> 1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
> 1G machine -> (16M dma, 784M normal, 224M high)
>
> sysctl defaults are:
>
> int sysctl_lower_zone_reserve_ratio[MAX_NR_ZONES-1] = { 256, 32 };
>
> the results will be:
>
> 1) NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
>
> 2) HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
>
> 3) HIGHMEM allocation will (224M+784M)/256 of ram reserved in ZONE_DMA
>
> I invented this algorithm to scale in all machines and to make the
> memory reservation not noticeable and only beneficial. With this API is
> trivial to tune and to understand what will happen, the default value
> looks good and they're proven by years of production in 2.4. Most
> important: it's only in function of the sizes of the zones, the
> pages_min levels have nothing to do with it.
>

OK I dont disagree that your setup calculations are much nicer, and
the current ones are pretty broken...

> I'm still unsure if the 2.6 lower_zone_protection completely mimics the
> 2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
> settings accordingly, but I believe it's easier to drop it and replace
> with a clear understandable API that as well drops the pages_min levels
> that have no reason to exists anymore, than to leave it in its current
> state and to start reversing pages_min algorithm to tune it from
> userspace (in the hope nobody could ever tweak pages_min calculation in
> the kernel, to avoid breaking the userspace that would require
> kernel-internal knowledge to have a chance to tune lowmem_protection
> from a rc.d script).
>

But please no wholesale replacement of the ->pages_xxx / ->protection
thing unless you really show it is needed (which I'm pretty sure it
isn't). alloc_pages is very nice right now ;)

2004-10-22 03:28:46

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
>
>>I'm still unsure if the 2.6 lower_zone_protection completely mimics the
>> 2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
>> settings accordingly, but I believe it's easier to drop it and replace
>> with a clear understandable API that as well drops the pages_min levels
>> that have no reason to exists anymore
>
>
> I'd be OK with wapping over to the watermark version, as long as we have
> runtime-settable levels.
>

Please no "wapping" over :) This release is the first time the allocator
has been anywhere near working properly in this area.

Of course, if Andrea shows that the ->protection racket isn't sufficient,
then yeah.

> But I'd be worried about making the default values anything other than zero
> because nobody seems to be hitting the problems.
>
> But then again, this get discussed so infrequently that by the time it
> comes around again I've forgotten all the previous discussion. Ho hum.
>

I think they probably should be turned on. A system with a gig of ram
shouldn't be able to use up all of ZONE_DMA on pagecache. It seems like
a small price to pay... same goes for very big highmem systems and ZONE_NORMAL.

2004-10-22 03:37:26

by Andrew Morton

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Nick Piggin <[email protected]> wrote:
>
> Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> >
> >>I'm still unsure if the 2.6 lower_zone_protection completely mimics the
> >> 2.4 lowmem_zone_reserve algorithm if tuned by reversing the pages_min
> >> settings accordingly, but I believe it's easier to drop it and replace
> >> with a clear understandable API that as well drops the pages_min levels
> >> that have no reason to exists anymore
> >
> >
> > I'd be OK with wapping over to the watermark version, as long as we have
> > runtime-settable levels.
> >
>
> Please no "wapping" over :) This release is the first time the allocator
> has been anywhere near working properly in this area.
>
> Of course, if Andrea shows that the ->protection racket isn't sufficient,
> then yeah.

Well yes, I spose the answer as always is "show me a testcase". But the
lack of reports from the field has weight.

> > But I'd be worried about making the default values anything other than zero
> > because nobody seems to be hitting the problems.
> >
> > But then again, this get discussed so infrequently that by the time it
> > comes around again I've forgotten all the previous discussion. Ho hum.
> >
>
> I think they probably should be turned on. A system with a gig of ram
> shouldn't be able to use up all of ZONE_DMA on pagecache. It seems like
> a small price to pay... same goes for very big highmem systems and ZONE_NORMAL.

Problem is, how much lower zone memory do you reserve? If someone is
really getting hit by this in real life then the answer for their workload
is probably "lots". If they are not getting hit then the answer is "none".

Any halfway setting will screw everyone.

2004-10-22 03:41:40

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Andrew Morton wrote:
>>

>>I think they probably should be turned on. A system with a gig of ram
>>shouldn't be able to use up all of ZONE_DMA on pagecache. It seems like
>>a small price to pay... same goes for very big highmem systems and ZONE_NORMAL.
>
>
> Problem is, how much lower zone memory do you reserve? If someone is
> really getting hit by this in real life then the answer for their workload
> is probably "lots". If they are not getting hit then the answer is "none".
>

Yeah you might be right... although the ZONE_NORMAL can still be used
for other things like slab caches.

> Any halfway setting will screw everyone.
>
>


I guess what we really need to do is find someone who is getting hit by it.
Andrea do you have any pointers?

2004-10-22 03:46:58

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Jesse Barnes wrote:
> On Thursday, October 21, 2004 8:26 pm, Andrew Morton wrote:
>
>>I'd be OK with wapping over to the watermark version, as long as we have
>>runtime-settable levels.
>>
>>But I'd be worried about making the default values anything other than zero
>>because nobody seems to be hitting the problems.
>
>
> Yes, please keep the default at 0 regardless of the algorithm. On the NUMA
> systems I'm aware of, an incremental min just doesn't make any sense.
>

That problem shouldn't exist any more, so your one zone per node (?)
NUMA systems, incremental min won't have any effect at all.

That said, it isn't something that we should just turn on and see
who yells.

2004-10-22 03:52:08

by Jesse Barnes

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thursday, October 21, 2004 10:38 pm, Nick Piggin wrote:
> That problem shouldn't exist any more, so your one zone per node (?)
> NUMA systems, incremental min won't have any effect at all.

Well, it used to affect us, since as the allocator iterated over nodes, the
incremental min would increase, and so by the time we hit the 3rd or so node,
we were leaving quite a bit of memory unused. I just don't want to return to
the bad old days.

> That said, it isn't something that we should just turn on and see
> who yells.

Agreed.

Thanks,
Jesse

2004-10-22 10:09:33

by DaMouse

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, 21 Oct 2004 22:24:44 +0100, Jon Masters <[email protected]> wrote:
> On Thu, 21 Oct 2004 21:21:55 +0100, DaMouse <[email protected]> wrote:
>
>
> > On Thu, 21 Oct 2004 13:54:41 -0500 (CDT), Adam Heath <[email protected]> wrote:
> > > On Thu, 21 Oct 2004, Andrea Arcangeli wrote:
> > >
> > > > On Thu, Oct 21, 2004 at 12:51:18PM +0200, Mikael Pettersson wrote:
> > > > > Have you verified that? GCCs up to and including 2.95.3 and
> > > > > early versions of 2.96 miscompiled the kernel when spinlocks
> > > > > where empty structs on UP. I.e., you might not get a compile-time
> > > > > error but runtime corruption instead.
> > > >
> > > > peraphs we should add a check on the compiler and force people to use
> > > > gcc >= 3?
> > > >
> > > > Otherwise adding an #ifdef will fix 2.95, just like the spinlock does in
> > > > UP.
> > > >
> > > > btw, the only machine where I still have gcc 2.95.3 is not uptodate
> > > > enough to run 2.6 regardless of the fact 2.6 could compile on such
> > > > machine or not.
> > >
> > > So compile a 2.6 kernel on the machine with 2.95.3 for another machine.
> > >
> >
> > I think what he was referring to was that most machines with 2.95.x
> > have older kernels anyway.
>
> That's probably mostly true even for embedded folks, but I don't think
> it's a good idea to completely throw away 2.95 users just yet. Better
> to use ifdefs or somesuch for now.
>
> Jon.
>

Perhaps making gcc 4.x a target for 2.7 would be a good idea though?

-DaMouse

--
I know I broke SOMETHING but its there fault for not fixing it before me

2004-10-22 15:50:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, Oct 21, 2004 at 06:26:51PM -0700, Andrew Morton wrote:
> But I'd be worried about making the default values anything other than zero
> because nobody seems to be hitting the problems.

Google reported the bug two years ago, and that's when I've seen and
fixed the problem the first time. Now the same workload that hurted two
years ago would still hurt today on 2.6 (not on 2.4 anymore).

On a x86 32G box reserving 800M is needed, otherwise if you sawpoff I'll
trivially deadlock the box, and if you implement relocationg you'd be
wasting cpu. The only scare of a 32GB x86 owner, is to run out of
lowmem. And the only scare of a 256G x86-64 or ia64 user is to run out
of DMA. They cannot care less to spend 1/32 of HIGHMEM and 1/256 of
HIGHMEM+NORMAL as an insurance to avoid expensive relocation memcopies
in the middle of their peak loads and to avoid running out of memory.

I've some VM deadlocks in my queue, amittedly they should never deadlock
but return -ENOMEM like 2.4 does, but that's an orthogonal known
problem. I've also a DMA failure allocation in x86-64 that triggers the
oom killer, that as well might not be strictly related, but there is no
swap to relocate anonymous memory, there cannot be sawp since yast
hasn't formatted the disk yet, my current workaround has been to turn
off the oom killer during install, and that sort of worked for now
despite somebody says it tends to slowdown.

I tried to tune the protection sysctl, but that resulted in too much
memory being reserved and a deadlock with oom killer turned off, hence
my decision to start fixing this instead of putting kernel internal
knowledge on the pages_min algorithm to set protection properly (plus I
wasn't 100% sure protection does exactly what the lowmem_reserve does,
plus keeping both protecetion and pages_min is a waste of ram).

As for the sysctl, sure, I'm retaining those runtime settable
improvements in 2.6, that's pretty cool compared to the not-working
command line of 2.4 (such boot time command line is invoked far too late
to make a difference but nobody has ever needed to turn off the values
so nobody (or peraphs only one person) ever noticed ;).

> But then again, this get discussed so infrequently that by the time it
> comes around again I've forgotten all the previous discussion. Ho hum.

;)

2004-10-22 17:11:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Fri, Oct 22, 2004 at 01:02:24PM +1000, Nick Piggin wrote:
> I don't agree, there are times when you need to know the bare pages_xxx
> watermark, and times when you need to know the whole ->protection thing.

we'll see, I agree current alloc_pages is quite clean but I'm quite
tempted to have a strightforward alloc_pages as clean as 2.4:

for (;;) {
zone_t *z = *(zone++);
if (!z)
break;

if (zone_free_pages(z, order) > z->watermarks[class_idx].low) {
page = rmqueue(z, order);
if (page)
return page;
}
}

2.6 is like this:

/* Go through the zonelist once, looking for a zone with enough * free */
for (i = 0; (z = zones[i]) != NULL; i++) {
min = z->pages_low + (1<<order) + z->protection[alloc_type];

if (z->free_pages < min)
continue;

page = buffered_rmqueue(z, order, gfp_mask);
if (page)
goto got_pg;
}


I don't see any benefit in limiting the high order, infact it seems a
bad bug. If something you should limit the _small_ order, so that the
high order will have a slight chance to succeed. You're basically doing
the opposite.

The pages_low is completely useless too for example and it could go.
pages_min has some benefit for some more feature 2.6 provides (that
could be translated in more watermarks, to separate the "settings of
the watermarks" from the alloc_page user of the watermarks).

> OK I dont disagree that your setup calculations are much nicer, and
> the current ones are pretty broken...

ok cool.

2004-10-22 18:42:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Fri, Oct 22, 2004 at 01:35:27PM +1000, Nick Piggin wrote:
> Andrea do you have any pointers?

here we go, the same that deadlocked 2.4.14pre5aa1 will deadlock 2.6
just now (but not latest mainline 2.4 and of course not any recent
2.4-aa)

http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&selm=linux.kernel.3BE07730.60905%40google.com
http://groups.google.com/groups?q=google+VM+deadlock&hl=en&lr=&ie=UTF-8&selm=linux.kernel.20011118092434.A1331%40athlon.random&rnum=1

2.6 has a sysctl that should fix it too but it's disabled by default
(which makes it useless to 99% of userbase) and it's pretty hard to be
able to attempt to tun it to a desired value.

When you get an oom killer triggering too early, or an complete oom
deadlock, I never know if this is the lowmem_reserve missing, or the oom
killer going nuts, and I've a dozen of reports of oom triggering
spurious and some oom deadlock pending. So this is certainly the first
thing I must fix.

2004-10-23 04:36:34

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:
> On Fri, Oct 22, 2004 at 01:02:24PM +1000, Nick Piggin wrote:
>
>>I don't agree, there are times when you need to know the bare pages_xxx
>>watermark, and times when you need to know the whole ->protection thing.
>

[snip]

>
> I don't see any benefit in limiting the high order, infact it seems a
> bad bug. If something you should limit the _small_ order, so that the
> high order will have a slight chance to succeed. You're basically doing
> the opposite.
>

You need the order there so someone can't allocate a huge amount
of memory and deplete all your reserves and crash the system.

For day to day running, it should barely make a difference because
the watermarks will be an order of magnitude larger.

> The pages_low is completely useless too for example and it could go.
> pages_min has some benefit for some more feature 2.6 provides (that
> could be translated in more watermarks, to separate the "settings of
> the watermarks" from the alloc_page user of the watermarks).
>

AFAIKS, pages_min, pages_low and pages_high are all required for
what we want to be doing. I don't see you you could remove any one
of them and still have everything functioning properly....

I haven't really looked at 2.4 or your patches though. Maybe I
misunderstood you.

2004-10-22 17:25:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, Oct 21, 2004 at 08:26:56PM -0700, Andrew Morton wrote:
> Well yes, I spose the answer as always is "show me a testcase". But the

I already gave you a testcase ages ago. Pretty simple. Get your hands on
a 2G box. boot, swapoff -a, load 1G in pagecache. malloc 1G, bzero 1G,
then open some dozen thousand files until the machine deadlocks despite
1G is perfcetly freeable in highmem.

> Any halfway setting will screw everyone.

this stuff run in production for 2 years at least, if anyone would be
screwed I'd already know. The only thing I know is that I've oom
deadlocks in my queue of bugs, and we'll see if this fixes them, it's
hard to tell since I don't know how much lowmem is pinned by highmem
capable allocs (like pagetables!).

2004-10-22 17:25:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Thu, Oct 21, 2004 at 10:49:36PM -0500, Jesse Barnes wrote:
> On Thursday, October 21, 2004 10:38 pm, Nick Piggin wrote:
> > That problem shouldn't exist any more, so your one zone per node (?)
> > NUMA systems, incremental min won't have any effect at all.
>
> Well, it used to affect us, since as the allocator iterated over nodes, the
> incremental min would increase, and so by the time we hit the 3rd or so node,
> we were leaving quite a bit of memory unused. I just don't want to return to
> the bad old days.

yes, but all you care about is to turn off the incremental min, you
don't really care about lowmem_reserved, because you don't have floppies
that do ZONE_DMA allocations on ia64, x86-64 would oom-kill tasks just
because insmod floppy.o was running (and this is one of the showstopper
bugs in my queue, I had to disable forced the oom killer to workaroaund
it, apparently the vm can then later on free some page after many loops
despite without swap enabled).

2004-10-23 10:00:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Sat, Oct 23, 2004 at 02:33:39PM +1000, Nick Piggin wrote:
> Andrea Arcangeli wrote:
> >I don't see any benefit in limiting the high order, infact it seems a
> >bad bug. If something you should limit the _small_ order, so that the
> >high order will have a slight chance to succeed. You're basically doing
> >the opposite.
> >
>
> You need the order there so someone can't allocate a huge amount
> of memory and deplete all your reserves and crash the system.

what? the point of alloc_pages is to allow people to allocate memory,
what's the difference of 1 2M allocation and 512 4k allocations? No
difference at all. Infact if something the 512 4k allocations hurts a
lot more since they can fragment the memory, while the single 2M
allocation will not fragment the memory. So if you really want to limit
something you should do the exact opposite of what the allocator is
doing.

> For day to day running, it should barely make a difference because
> the watermarks will be an order of magnitude larger.

yes, it makes little difference, this is why it doesn't hurt that much.

> AFAIKS, pages_min, pages_low and pages_high are all required for
> what we want to be doing. I don't see you you could remove any one
> of them and still have everything functioning properly....
>
> I haven't really looked at 2.4 or your patches though. Maybe I
> misunderstood you.

2.4 has everything functionally properly but it has no
pages_min/low/high, it only has the watermarks. Infact the watermarks
_are_ low/min/high. That's what I'm forward porting to 2.6 (besides
fixing minor mostly not noticeable but harmful bits like the order
nosense described above).

2004-10-23 10:26:30

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:
> On Sat, Oct 23, 2004 at 02:33:39PM +1000, Nick Piggin wrote:
>
>>Andrea Arcangeli wrote:
>>
>>>I don't see any benefit in limiting the high order, infact it seems a
>>>bad bug. If something you should limit the _small_ order, so that the
>>>high order will have a slight chance to succeed. You're basically doing
>>>the opposite.
>>>
>>
>>You need the order there so someone can't allocate a huge amount
>>of memory and deplete all your reserves and crash the system.
>
>
> what? the point of alloc_pages is to allow people to allocate memory,
> what's the difference of 1 2M allocation and 512 4k allocations? No
> difference at all. Infact if something the 512 4k allocations hurts a
> lot more since they can fragment the memory, while the single 2M
> allocation will not fragment the memory. So if you really want to limit
> something you should do the exact opposite of what the allocator is
> doing.
>
>
>>For day to day running, it should barely make a difference because
>>the watermarks will be an order of magnitude larger.
>
>
> yes, it makes little difference, this is why it doesn't hurt that much.
>

It is an unlikely scenario, but it is definitely good for robustness.
Especially on small memory systems where the amount allocated doesn't
have to be that large.

Let's say a 16MB system pages_low ~= 64K, so we'll also say we've
currently got 64K free. Someone then wants to do an order 4 allocation
OK they succeed (assuming memory isn't fragmented) and there's 0K free.

Which is bad because you can now get deadlocks when trying to free
memory.

>
>>AFAIKS, pages_min, pages_low and pages_high are all required for
>>what we want to be doing. I don't see you you could remove any one
>>of them and still have everything functioning properly....
>>
>>I haven't really looked at 2.4 or your patches though. Maybe I
>>misunderstood you.
>
>
> 2.4 has everything functionally properly but it has no
> pages_min/low/high, it only has the watermarks. Infact the watermarks
> _are_ low/min/high. That's what I'm forward porting to 2.6 (besides
> fixing minor mostly not noticeable but harmful bits like the order
> nosense described above).
>

Oh if you've still got the three watermarks then that may work -
I thought you meant getting rid of one of the *completely*.

But I'm still not sure what advantage you see in moving from
pages_xxx + protection to a single watermark.

2004-10-23 11:03:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Sat, Oct 23, 2004 at 08:22:38PM +1000, Nick Piggin wrote:
> It is an unlikely scenario, but it is definitely good for robustness.
> Especially on small memory systems where the amount allocated doesn't
> have to be that large.
>
> Let's say a 16MB system pages_low ~= 64K, so we'll also say we've

btw, thinking the watermarks linear with the amount of memory isn't
correct. the watermarks for zone normal against zone normal (i.e. the
current pages_xx of 2.6) should have an high and low limit indipendent
on the memory size of the machine. the low limit is what the machine
needs to avoid locking up in the PF_MEMALLOC paths. So it obviously has
absolutely nothing to do with the amount of ram in the machine.

64k sounds way too low even for a PDA that doesn't swap, still there are
PF_MEMALLOC paths in the dcache and fs methods.

but this is just a side note, let's assume 64k would be sane in this
workload (a page size smaller than 4k that in turn requires less ram to
execute method on each page object would make it sane for example).

> currently got 64K free. Someone then wants to do an order 4 allocation
> OK they succeed (assuming memory isn't fragmented) and there's 0K free.
>
> Which is bad because you can now get deadlocks when trying to free
> memory.

I got what you mean, I misread that code sorry, you're perfectly right
about order being needed in that code.

In 2.4 I had to implement it too of course, it's just much cleaner than
2.6.

static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
{
long free = zone->free_pages - (1UL << order);
return free >= 0 ? free : 0;
}


for (;;) {
zone_t *z = *(zone++);
if (!z)
break;

if (zone_free_pages(z, order) > z->watermarks[class_idx].low) {
page = rmqueue(z, order);
if (page)
return page;
}
}


this compares with your 2.6 code:

for (i = 0; (z = zones[i]) != NULL; i++) {
min = z->pages_min;
if (gfp_mask & __GFP_HIGH)
min /= 2;
if (can_try_harder)
min -= min / 4;
min += (1<<order) + z->protection[alloc_type];

if (z->free_pages < min)
continue;

page = buffered_rmqueue(z, order, gfp_mask);
if (page)
goto got_pg;
}

When I was reading "z->free_pages < min" in your code, I was really
reading like my code here "zone_free_pages(z, order) > z->watermarks[class_idx].low"
I was taking for given the free_pages - 1UL<<order was already accounted
in z->free_pages, because I hidden that calculation in a method so I'm
not used to think about it while reading alloc_pages (I assumed that
thing was already accounted for in a different function like in 2.4).

Sorry if I'm biased but I read and modified 2.4 many more times than
2.6.

> Oh if you've still got the three watermarks then that may work -
> I thought you meant getting rid of one of the *completely*.
>
> But I'm still not sure what advantage you see in moving from
> pages_xxx + protection to a single watermark.

then what advantage you get to compute pages_xx + protection at runtime
when reading a pages_xx that already contains the protection would be
enough? I avoid computations at runtime and I keep the localized in the
watermark generation. I doubt it makes much difference but this is the
way I did in 2.4 and it looks cleaner to me, plus this avoids me to
reinvent the wheel.

2004-10-23 16:30:51

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:
> On Sat, Oct 23, 2004 at 08:22:38PM +1000, Nick Piggin wrote:
>
>>It is an unlikely scenario, but it is definitely good for robustness.
>>Especially on small memory systems where the amount allocated doesn't
>>have to be that large.
>>
>>Let's say a 16MB system pages_low ~= 64K, so we'll also say we've
>
>
> btw, thinking the watermarks linear with the amount of memory isn't
> correct. the watermarks for zone normal against zone normal (i.e. the
> current pages_xx of 2.6) should have an high and low limit indipendent
> on the memory size of the machine. the low limit is what the machine
> needs to avoid locking up in the PF_MEMALLOC paths. So it obviously has
> absolutely nothing to do with the amount of ram in the machine.
>

No, you are right there of course. However, I think 64K will be the
reality in this case because I just did a sysrq+M and took a look
at my ZONE_DMA (ie. 16MB) limits.

However, it does have something to do with the amount of concurrency
in the system - the chance of multiple tasks in PF_MEMALLOC on a larger
system (with more tasks, more CPUs) will increase of course.

> 64k sounds way too low even for a PDA that doesn't swap, still there are
> PF_MEMALLOC paths in the dcache and fs methods.
>
> but this is just a side note, let's assume 64k would be sane in this
> workload (a page size smaller than 4k that in turn requires less ram to
> execute method on each page object would make it sane for example).
>

Maybe - I haven't really looked at those paths at all... but yeah it
is a peripheral issue. We can continue that in another thread sometime
:)

>
>>currently got 64K free. Someone then wants to do an order 4 allocation
>>OK they succeed (assuming memory isn't fragmented) and there's 0K free.
>>
>>Which is bad because you can now get deadlocks when trying to free
>>memory.
>
>
> I got what you mean, I misread that code sorry, you're perfectly right
> about order being needed in that code.
>
> In 2.4 I had to implement it too of course, it's just much cleaner than
> 2.6.
>
> static inline unsigned long zone_free_pages(zone_t * zone, unsigned int order)
> {
> long free = zone->free_pages - (1UL << order);
> return free >= 0 ? free : 0;
> }
>
>
> for (;;) {
> zone_t *z = *(zone++);
> if (!z)
> break;
>
> if (zone_free_pages(z, order) > z->watermarks[class_idx].low) {
> page = rmqueue(z, order);
> if (page)
> return page;
> }
> }
>
>
> this compares with your 2.6 code:
>
> for (i = 0; (z = zones[i]) != NULL; i++) {
> min = z->pages_min;
> if (gfp_mask & __GFP_HIGH)
> min /= 2;
> if (can_try_harder)
> min -= min / 4;
> min += (1<<order) + z->protection[alloc_type];
>
> if (z->free_pages < min)
> continue;
>
> page = buffered_rmqueue(z, order, gfp_mask);
> if (page)
> goto got_pg;
> }
>

Although you put 2.6 in a bad light with this code ;)
__GFP_HIGH and can_try_harder are pretty important... It
does look like the continue could be replaced with the
2.4 version's negated if statement to be a bit cleaner

> When I was reading "z->free_pages < min" in your code, I was really
> reading like my code here "zone_free_pages(z, order) > z->watermarks[class_idx].low"
> I was taking for given the free_pages - 1UL<<order was already accounted
> in z->free_pages, because I hidden that calculation in a method so I'm
> not used to think about it while reading alloc_pages (I assumed that
> thing was already accounted for in a different function like in 2.4).
>
> Sorry if I'm biased but I read and modified 2.4 many more times than
> 2.6.
>

That's OK.

>
>>Oh if you've still got the three watermarks then that may work -
>>I thought you meant getting rid of one of the *completely*.
>>
>>But I'm still not sure what advantage you see in moving from
>>pages_xxx + protection to a single watermark.
>
>
> then what advantage you get to compute pages_xx + protection at runtime
> when reading a pages_xx that already contains the protection would be
> enough? I avoid computations at runtime and I keep the localized in the
> watermark generation. I doubt it makes much difference but this is the
> way I did in 2.4 and it looks cleaner to me, plus this avoids me to
> reinvent the wheel.
>

In kswapd you really just want the pages_xxx value (well, pages_high).

2004-10-25 12:45:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

The more I read the old protection setting the more it seems broken. It
even looks it would buffer overflow if it wasn't by luck that
GFP_ZONETYPES == GFP_ZONEMASK, and I suspect the Non-loner version of
GFP_ZONETYPES would trigger it. I don't understand what loner means.
online dict returns:

loner
n : a person who avoids the company or assistance of others

nosense for non-english speaker in GFP_ZONETYPES context:

/* #define GFP_ZONETYPES (GFP_ZONEMASK + 1) */ /* Non-loner */
#define GFP_ZONETYPES ((GFP_ZONEMASK + 1) / 2 + 1) /* Loner */


anyways GFP_ZONETYPES has absolutely nothing to do with this code, no
idea how it splipped into it, this is not a zonelist, this is a
node_zones array only in function of MAX_NR_ZONES and it makes no sense
to me to use anything but MAX_NR_ZONES to initialialize it.

This is my forward port from 2.4 that should fix it.

static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
int j, idx;

for_each_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
zone_t *zone = pgdat->node_zones + j;
unsigned long presentpages = zone->presentpages;

zone->lowmem_reserve[j] = 0;

for (idx = j-1; idx >= 0; idx--) {
zone_t * lower_zone = pgdat->node_zones + idx;

lower_zone->lowmem_reserve[j] = presetpages / lower_zone_reserve_ratio[idx];
presentpages += lower_zone->presentpages;
}
}
}
}

here a diff to compare it with current 2.6 code.

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids -u sles-ref/mm/page_alloc.c sles/mm/page_alloc.c
--- sles-ref/mm/page_alloc.c 2004-10-25 13:50:37.287995072 +0200
+++ sles/mm/page_alloc.c 2004-10-25 14:37:23.358407528 +0200
@@ -1905,87 +1892,29 @@ void __init page_alloc_init(void)
hotcpu_notifier(page_alloc_cpu_notify, 0);
}

-static unsigned long higherzone_val(struct zone *z, int max_zone,
- int alloc_type)
-{
- int z_idx = zone_idx(z);
- struct zone *higherzone;
- unsigned long pages;
-
- /* there is no higher zone to get a contribution from */
- if (z_idx == MAX_NR_ZONES-1)
- return 0;
-
- higherzone = &z->zone_pgdat->node_zones[z_idx+1];
-
- /* We always start with the higher zone's protection value */
- pages = higherzone->protection[alloc_type];
-
- /*
- * We get a lower-zone-protection contribution only if there are
- * pages in the higher zone and if we're not the highest zone
- * in the current zonelist. e.g., never happens for GFP_DMA. Happens
- * only for ZONE_DMA in a GFP_KERNEL allocation and happens for ZONE_DMA
- * and ZONE_NORMAL for a GFP_HIGHMEM allocation.
- */
- if (higherzone->present_pages && z_idx < alloc_type)
- pages += higherzone->pages_low * sysctl_lower_zone_protection;
-
- return pages;
-}
-
/*
- * setup_per_zone_protection - called whenver min_free_kbytes or
- * sysctl_lower_zone_protection changes. Ensures that each zone
- * has a correct pages_protected value, so an adequate number of
+ * setup_per_zone_lowmem_reserve - called whenever
+ * sysctl_lower_zone_reserve_ratio changes. Ensures that each zone
+ * has a correct pages reserved value, so an adequate number of
* pages are left in the zone after a successful __alloc_pages().
- *
- * This algorithm is way confusing. I tries to keep the same behavior
- * as we had with the incremental min iterative algorithm.
*/
-static void setup_per_zone_protection(void)
+static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
- struct zone *zones, *zone;
- int max_zone;
- int i, j;
+ int j, idx;

for_each_pgdat(pgdat) {
- zones = pgdat->node_zones;
+ for (j = 0; j < MAX_NR_ZONES; j++) {
+ zone_t *zone = pgdat->node_zones + j;
+ unsigned long presentpages = zone->presentpages;

- for (i = 0, max_zone = 0; i < MAX_NR_ZONES; i++)
- if (zones[i].present_pages)
- max_zone = i;
+ zone->lowmem_reserve[j] = 0;

- /*
- * For each of the different allocation types:
- * GFP_DMA -> GFP_KERNEL -> GFP_HIGHMEM
- */
- for (i = 0; i < GFP_ZONETYPES; i++) {
- /*
- * For each of the zones:
- * ZONE_HIGHMEM -> ZONE_NORMAL -> ZONE_DMA
- */
- for (j = MAX_NR_ZONES-1; j >= 0; j--) {
- zone = &zones[j];
+ for (idx = j-1; idx >= 0; idx--) {
+ zone_t * lower_zone = pgdat->node_zones + idx;

- /*
- * We never protect zones that don't have memory
- * in them (j>max_zone) or zones that aren't in
- * the zonelists for a certain type of
- * allocation (j>=i). We have to assign these
- * to zero because the lower zones take
- * contributions from the higher zones.
- */
- if (j > max_zone || j >= i) {
- zone->protection[i] = 0;
- continue;
- }
- /*
- * The contribution of the next higher zone
- */
- zone->protection[i] = higherzone_val(zone,
- max_zone, i);
+ lower_zone->lowmem_reserve[j] = presetpages / lower_zone_reserve_ratio[idx];
+ presentpages += lower_zone->presentpages;
}
}
}

quite some garbage got deleted:

andrea@dualathlon:~/devel/kernel> diffstat /tmp/z
page_alloc.c | 84 +++++++++--------------------------------------------------
1 files changed, 13 insertions(+), 71 deletions(-)
andrea@dualathlon:~/devel/kernel>

I'll adapt the rest of the code to deal with this and test it in a few
more minutes.

2004-10-25 12:49:42

by Nick Piggin

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

Andrea Arcangeli wrote:

> I'll adapt the rest of the code to deal with this and test it in a few
> more minutes.
>

The current stuff is pretty crufty. I think your changes are
far better (aside from the minor fact they won't compile!).

Also, switching to a calculation that has seen some real-world
use would be a good idea before we think about turning it on
by default.

2004-10-25 20:13:23

by Robert White

[permalink] [raw]
Subject: RE: ZONE_PADDING wastes 4 bytes of the new cacheline

As another "americanisim", "loaner" is something that is, or is to be, loaned out for
a time.

Usage:

"Where didyou get the fancy car?"
"It's a loaner, the dealer is fixing my hunk-of-junk."

I'd bet the comment is a spelling error, not that I have any real opinion... 8-)

-----Original Message-----
From: [email protected] [mailto:[email protected]]
On Behalf Of Andrea Arcangeli
I don't understand what loner means.
online dict returns:

loner
n : a person who avoids the company or assistance of others



2004-10-26 00:15:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ZONE_PADDING wastes 4 bytes of the new cacheline

On Mon, Oct 25, 2004 at 10:49:32PM +1000, Nick Piggin wrote:
> The current stuff is pretty crufty. I think your changes are
> far better (aside from the minor fact they won't compile!).

;) yes, I still have to finish hooking the new code into the
page_alloc.c, this was just a preview.

dropping pages_xx would have been a bigger change so I'm avoiding it for
now since I agree I would need to add more fields than just min/low/high
for the realtime and GFP_HIGH features (and most important I couldn't
remove those 2 branches anywyas). In 2.4 I was ignoring __GFP_HIGH. I
doubt it makes much difference in practice, but it makes sense in theory
so I'll keep those new features of course (especially the RT one could
help in practice too).

> Also, switching to a calculation that has seen some real-world
> use would be a good idea before we think about turning it on
> by default.

agreed. the value tuning is quite simple to understand this way, it is
tuned exactly to start to reserve the entire lowmem on a 32G x86 box.
31G/32 = 0.9G > 800m.