2004-10-30 14:11:16

by Andrea Arcangeli

[permalink] [raw]
Subject: PG_zero

This experiment is incremental with lowmem_reserve-3 (downloadble in the
same place), and it's against 2.6.9, it rejects against kernel CVS but
it should be easy to fixup.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2

I think it's much better to have PG_zero in the main page allocator than
to put the ptes in the slab. This way we can share available zero pages with
all zero-page users and we have a central place where people can
generate zero pages and allocate them later efficiently.

This gives a whole internal knowledge to the whole buddy system and
per-cpu subsystem of zero pages.

I also refile zeropages from the hot-cold quicklist to the zero
quicklist, clearing them if needed, with the idle task.

this microbenchmark runs 3 times faster with the patch applied.

#include <stdlib.h>
#include <strings.h>
#include <stdio.h>
#include <asm/msr.h>

#define SIZE (1024*1024*4)

int main(void)
{
unsigned long before, after;
unsigned long * p = malloc(SIZE), * end = p + (SIZE / sizeof(long));

rdtscl(before);
while (p < end) {
*p = 0;
p += 4096;
}
rdtscl(after);

printf("time %lu\n", after-before);

return 0;
}

Some fix included in the patch is to fallback in the quicklist for the
whole classzone before eating from the buddy, otherwise 1G boxes are
very penalized in terms of entering the buddy system too early, and not
using the quicklists of the lower zones (2.4-aa wasn't penalized). Plus
this adds a sysctl so the thing is tunable at runtime. And there was no
need of using two quicklists for cold and hot pages, less resources are
wasted by just using the lru ordering to diferentiate from hot/cold
allocations and hot/cold freeing.

The API with PG_zero is that if you set __GFP_ZERO in the gfp_mask, then
you must check PG_zero. If PG_zero is set, then you don't need to clear
the page. However you must clear PG_zero before freeing the page if its
contents are not zero anymore by the time you free it, or future users
of __GFP_ZERO will be screwed. So the pagetables for example never clear
PG_zero for the whole duration of the page, infact they set PG_zero if
they're forced to execute clear_page. shmem as well in a fail path if it
fails getting an entry it will free the zero page again and it won't
have to touch PG_zero since it didn't modify the page contents.

The zero pages tries to go into the zero quicklist, if the zero
quicklist is full they go into the hot-cold but they retain the PG_zero
information. If even that is full it fallbacks into the buddy with the
"batch" pass and even the buddy system retains the PG_zero info which is
zero cost to retain and it may be still useful later. For __GFP_ZERO
allocations I only look if something is available in the zero quicklist,
I don't search for zero pages in the other lists. The idle task as well
is 100% scalable, since it only tries to refile pages from the hot-cold
list to the zero quicklist, but only if the zero quicklist isn't full
yet and if the hot-cold list isn't empty (so normally the idle task does
nothing and it doesn't even cli ever). It's only the idle task capable
of refiling the stuff and potentially taking advantage of the PG_zero
info in the pages that are outside the zero quicklist. the idle clearing
can be disabled via sysctl, then the overhead will be just 1 cacheline
for the sysctl in the idle task.

__GFP_ONLY_ZERO can be used in combination of __GFP_ZERO,
__GFP_ONLY_ZERO will not fallback into the hot-cold quicklist and it
won't fallback in the buddy, it will be atomic as well. This is only
used by the wp fault that detects if the ZERO_PAGE was istantiated in
the pte and it avoids the copy-page in such case (if we can find a
zeropage during wp fault we don't even drop locks and remap the
high-pte).

Not sure if this experiment will turn out to be worthwhile or just an
useless complication (I can't get measure difference with any real life
program), but since it works and it's stable and the above
microbenchmark got a 200% boost (the writeprotect fault after zeropage I
exect will get a 200% boost too), it seems worth posting. The idle
zeroing is the most suspect thing, since it has a chance to hurt caches,
but it's only refilling the quicklist with total scalability. I would
never attempt to do a idle-zeroing in the buddy (I think I was negative
about patches trying to do that), that could never work well with the
spinlock bouncing and there's way too much free memory for each cpu that
would really keep freeing ram too frequently throwing away a lot more
cache. One more detail, I clear at most 1 page for each "halt" wakeup.
Not sure amount monitor/mwait, mainline doesn't support it anyways, I
don't rememebr by memory if mwait gets a wakeup after an irq handler,
but anyways that bit is fixable. I believe this is the best design to
handle zero page caching with idle zeroing on top of it. One can always
increase the size of the per-cpu queues via sysctl if more zero pages
are needed.

shmem allocations as well should get a nice boost in any micro
benchmark, since they're now using the native PG_zero too.

Obvious improvements would be to implement a long_write_zero(ptr)
operation that doesn't pollute the cache. IIRC it exists on the alpha, I
assume it exists on x86/x86-64 too. But that's incremental on top of
this.

It seems stable, I'm running it while writing this.

I guess testing on a memory bound architecture would be more interesting
(more cpus will make it more memory bound somewhat).

Comments welcome.


2004-10-30 21:09:38

by Andrew Morton

[permalink] [raw]
Subject: Re: PG_zero

Andrea Arcangeli <[email protected]> wrote:
>
> I think it's much better to have PG_zero in the main page allocator than
> to put the ptes in the slab. This way we can share available zero pages with
> all zero-page users and we have a central place where people can
> generate zero pages and allocate them later efficiently.

Yup.

> This gives a whole internal knowledge to the whole buddy system and
> per-cpu subsystem of zero pages.

Makes sense. I had a go at this ages ago and wasn't able to demonstrate
much benefit on a mixed workload.

I wonder if it would help if the page zeroing in the idle thread was done
with the CPU cache disabled. It should be pretty easy to test - isn't it
just a matter of setting the cache-disable bit in the kmap_atomic()
operation?

There are quite a few patches happening in this area - the
make-kswapd-aware-of-higher-order-pages patches and the no-buddy-bitmap
patches are queued in -mm. It'll take some time to work through them
all...

2004-10-30 22:45:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Sat, Oct 30, 2004 at 02:07:32PM -0700, Andrew Morton wrote:
> I wonder if it would help if the page zeroing in the idle thread was done
> with the CPU cache disabled. It should be pretty easy to test - isn't it
> just a matter of setting the cache-disable bit in the kmap_atomic()
> operation?

it's certainly an improvement I agree, however I don't measure a
slowdown here, so I'm not sure if it will make any significant
difference either. Plus the idle clearing code in theory could be
disabled and what would remain after that shall be an improvement over
current code.

I share your concern on the fact there seems not to be any speedup in my
2-way boxes unless I microbenchmark (but if I microbenchmark the best
case the speedup is very huge). OTOH the same applies to the per-cpu
queues at large, they only are measurable on the big boxes. Overall if
we've to use slab for the pte just to cache zero (which for sure won't
be a measurable speedup either in any small box using a _macro_
benchmark), this looks better design IMHO since it boosts everything
zero related, not just the pte. Plus it fixes some mistake in the
current code (like the failure of utilizing properly all the quicklists
belonging to each classzone [current code falls back into the buddy
before falling back in the lower zone quicklist] and the waste of
resouces in keeping the hot and cold caches separated, and the no point
for low watermark in the quicklists and other very minor details).

> There are quite a few patches happening in this area - the
> make-kswapd-aware-of-higher-order-pages patches and the no-buddy-bitmap
> patches are queued in -mm. It'll take some time to work through them
> all...

Sure take your time (and this is only an experiment so far anyways).
Only I'll do the reject fixup after they're in mainline so I've less
chance of doing useless rediff work.

2004-10-31 15:18:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

--Andrew Morton <[email protected]> wrote (on Saturday, October 30, 2004 14:07:32 -0700):

> Andrea Arcangeli <[email protected]> wrote:
>>
>> I think it's much better to have PG_zero in the main page allocator than
>> to put the ptes in the slab. This way we can share available zero pages with
>> all zero-page users and we have a central place where people can
>> generate zero pages and allocate them later efficiently.
>
> Yup.
>
>> This gives a whole internal knowledge to the whole buddy system and
>> per-cpu subsystem of zero pages.
>
> Makes sense. I had a go at this ages ago and wasn't able to demonstrate
> much benefit on a mixed workload.
>
> I wonder if it would help if the page zeroing in the idle thread was done
> with the CPU cache disabled. It should be pretty easy to test - isn't it
> just a matter of setting the cache-disable bit in the kmap_atomic()
> operation?

I looked at the basic problem a couple of years ago (based on your own code,
IIRC Andrew) then Andy (cc'ed) did it again with cache writethrough. It
doesn't provide any benefit at all, no matter what we did, and it was
finally ditched.

I wouldn't bother doing it again personally ... perhaps Andy still has
the last set of results he can send to you.

M.

2004-10-31 15:35:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

> I share your concern on the fact there seems not to be any speedup in my
> 2-way boxes unless I microbenchmark (but if I microbenchmark the best
> case the speedup is very huge). OTOH the same applies to the per-cpu
> queues at large, they only are measurable on the big boxes. Overall if
> we've to use slab for the pte just to cache zero (which for sure won't
> be a measurable speedup either in any small box using a _macro_
> benchmark), this looks better design IMHO since it boosts everything
> zero related, not just the pte. Plus it fixes some mistake in the
> current code (like the failure of utilizing properly all the quicklists
> belonging to each classzone [current code falls back into the buddy
> before falling back in the lower zone quicklist] and the waste of
> resouces in keeping the hot and cold caches separated, and the no point
> for low watermark in the quicklists and other very minor details).

I'll non-micro-benchmark stuff for you on big machines if you want, but I've
wasted enough time coding this stuff already ;-)

BTW, the one really useful thing the whole page zeroing stuff did was to
shift the profiled cost of page zeroing out to the routine acutally using
the pages, as it's no longer just do_anonymous_page taking the cache hit.

M.


2004-11-01 18:16:29

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

>> And there was no
>> need of using two quicklists for cold and hot pages, less resources are
>> wasted by just using the lru ordering to diferentiate from hot/cold
>> allocations and hot/cold freeing.
>
> Not sure if this is wise. Reclaimed pages should definitely be cache
> cold. Other freeing is assumed cache hot and LRU ordered on the hot
> list which seems right... but I think you want the cold list for page
> reclaim, don't you?

You're completely correct about the hot vs cold, but I don't think that
precludes what Andrea is suggesting ... merge into one list and use the
hot/cold ends. Mmmm ... why did we do that? I think it was to stop cold
allocations from eating into hot pages - we'd prefer them to fall back
into the buddy instead.

>> Obvious improvements would be to implement a long_write_zero(ptr)
>> operation that doesn't pollute the cache. IIRC it exists on the alpha, I
>> assume it exists on x86/x86-64 too. But that's incremental on top of
>> this.
>>
>> It seems stable, I'm running it while writing this.
>>
>> I guess testing on a memory bound architecture would be more interesting
>> (more cpus will make it more memory bound somewhat).
>>
>> Comments welcome.
>
> I have the feeling that it might not be worthwhile doing zero on idle.
> You've got chance of blowing the cache on zeroing pages that won't be
> used for a while. If you do uncached writes then you've changed the
> problem to memory bandwidth (I guess doesn't matter much on UP).

Yeah, we got bugger-all benefit out of it. The only think it might do
is lower the latency on inital load-spikes, but basically you end up
paying the cache fetch cost twice. But ... numbers rule - if you can come
up with something that helps a real macro benchmark, I'll eat my non-existant
hat ;-)

M.

2004-11-01 18:31:32

by Nick Piggin

[permalink] [raw]
Subject: Re: PG_zero

Andrea Arcangeli wrote:
> This experiment is incremental with lowmem_reserve-3 (downloadble in the
> same place), and it's against 2.6.9, it rejects against kernel CVS but
> it should be easy to fixup.
>
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.9/PG_zero-2
>

...

> Some fix included in the patch is to fallback in the quicklist for the
> whole classzone before eating from the buddy, otherwise 1G boxes are
> very penalized in terms of entering the buddy system too early, and not
> using the quicklists of the lower zones (2.4-aa wasn't penalized). Plus
> this adds a sysctl so the thing is tunable at runtime. And there was no
> need of using two quicklists for cold and hot pages, less resources are
> wasted by just using the lru ordering to diferentiate from hot/cold
> allocations and hot/cold freeing.
>

Not sure if this is wise. Reclaimed pages should definitely be cache
cold. Other freeing is assumed cache hot and LRU ordered on the hot
list which seems right... but I think you want the cold list for page
reclaim, don't you?

> The API with PG_zero is that if you set __GFP_ZERO in the gfp_mask, then
> you must check PG_zero. If PG_zero is set, then you don't need to clear
> the page. However you must clear PG_zero before freeing the page if its
> contents are not zero anymore by the time you free it, or future users
> of __GFP_ZERO will be screwed. So the pagetables for example never clear
> PG_zero for the whole duration of the page, infact they set PG_zero if
> they're forced to execute clear_page. shmem as well in a fail path if it
> fails getting an entry it will free the zero page again and it won't
> have to touch PG_zero since it didn't modify the page contents.
>

Could the API be made nicer by clearing the page for you if it didn't
find a PG_zero page?

>
> Obvious improvements would be to implement a long_write_zero(ptr)
> operation that doesn't pollute the cache. IIRC it exists on the alpha, I
> assume it exists on x86/x86-64 too. But that's incremental on top of
> this.
>
> It seems stable, I'm running it while writing this.
>
> I guess testing on a memory bound architecture would be more interesting
> (more cpus will make it more memory bound somewhat).
>
> Comments welcome.
>

I have the feeling that it might not be worthwhile doing zero on idle.
You've got chance of blowing the cache on zeroing pages that won't be
used for a while. If you do uncached writes then you've changed the
problem to memory bandwidth (I guess doesn't matter much on UP).

It seems like a good idea to do zero pages in the page allocator if
at all (rather than slab), but I guess you don't want to complicate
it unless it shows improvements in macro benchmarks.

Sorry my feedback isn't much, it is not based on previous experience.

2004-11-01 23:56:22

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

Apologies to akpm if you're not getting this directly ... OSDL is spitting
my email back as spam.

> On Mon, Nov 01, 2004 at 10:03:56AM -0800, Martin J. Bligh wrote:
>> [..] it was to stop cold
>> allocations from eating into hot pages [..]
>
> exactly, and I believe that hurts. bouncing on the global lock is going to
> hurt more than preserving an hot page (at least on a 512-way). Plus the
> cold page may very soon become hot too.

? which global lock are we talking about now? the buddy allocator? mmm,
yes, might well do. OTOH, with hot/cold pages the lock should hardly
be contended at all (512-ways scare me, yes ... but they're broken in
lots of other ways ;-) ... do we have lockmeter data from one?

> Plus you should at least allow an hot allocation to eat into the cold
> pages (which didn't happen IIRC).

I think the hotlist was set to refill from the cold list before it refilled
from the buddy ... or it was at one point.

> I simply believe using the lru ordering is a more efficient way to
> implement hot/cold behaviour and it will save some minor ram too (with
> big lists the reservation might even confuse the oom conditions, if the
> allocation is hot, but the VM frees in the cold "stopped" list). I know
> the cold list was a lot smaller so this is probably only a theoretical
> issue.

well, it'd only save RAM in theory on SMP systems where the load was
very unevenly distributed across CPUs ... it's out of the reserved pool.

>> Yeah, we got bugger-all benefit out of it. The only think it might do
>> is lower the latency on inital load-spikes, but basically you end up
>> paying the cache fetch cost twice. But ... numbers rule - if you can come
>> up with something that helps a real macro benchmark, I'll eat my non-existant
>> hat ;-)
>
> I've no idea if it will help... I only knows it helps the micro ;), but I
> don't measure any slowdown.
>
> Note that my PG_zero will boost 200% the micro benchmark even without
> the idle zeroing enabled, if a big app quits all ptes will go in PG_zero
> quicklist and the next 2M allocation of anonymous memory won't require
> clearing. That has no downside at all. That's not something that can be
> achieved with slab, plus slab wastes ram as well and it has more
> overhead than PG_zero.

Let's see what it does on the macro-benchmarks ;-)

M.

2004-11-02 01:52:12

by Nick Piggin

[permalink] [raw]
Subject: Re: PG_zero

Andrea Arcangeli wrote:
> On Mon, Nov 01, 2004 at 10:03:56AM -0800, Martin J. Bligh wrote:
>
>>[..] it was to stop cold
>>allocations from eating into hot pages [..]
>
>
> exactly, and I believe that hurts. bouncing on the global lock is going to
> hurt more than preserving an hot page (at least on a 512-way). Plus the
> cold page may very soon become hot too.
>

Well, the lock isn't global of course. You might be better off
benchmarking on an old Intel 8-way SMP rather than a 512-way Altix :)

But nevertheless I won't say the lock will never hurt.

> Plus you should at least allow an hot allocation to eat into the cold
> pages (which didn't happen IIRC).
>
> I simply believe using the lru ordering is a more efficient way to
> implement hot/cold behaviour and it will save some minor ram too (with
> big lists the reservation might even confuse the oom conditions, if the
> allocation is hot, but the VM frees in the cold "stopped" list). I know
> the cold list was a lot smaller so this is probably only a theoretical
> issue.
>

If you don't have cold allocations eating hot pages, nor cold frees
pushing out hot pages then it may be worthwhile.

If that helps a lot, then you couldn't you just have hot allocations
also check the cold list before falling back to the buddy?

I admit I didn't look closely at this - mainly the PG_zero stuff.

2004-11-02 01:51:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 04:26:02AM +1100, Nick Piggin wrote:
> Not sure if this is wise. Reclaimed pages should definitely be cache
> cold. Other freeing is assumed cache hot and LRU ordered on the hot
> list which seems right... but I think you want the cold list for page
> reclaim, don't you?

I retain hot and cold info in the same list. That avoids wasting ram.

I call it hot-cold quicklist.

> Could the API be made nicer by clearing the page for you if it didn't
> find a PG_zero page?

That's what get_zeroed_page is already doing with my patch applied, I
natively and trasparently support the feature via the API already
existing for it (and it also makes sure to clear PG_zero since there's
no guarantee that the user will free the page after clearing it again).

> I have the feeling that it might not be worthwhile doing zero on idle.

yes, it only boost 200% one workload, for the rest it seems a noop, but
I'm very far from being memory bound on my test boxes. the sysctl will
disable it at runtime, we can make it enabled or disabled by default by
tweaking the static setting to 1 or to 0 at compile time.

> You've got chance of blowing the cache on zeroing pages that won't be
> used for a while. If you do uncached writes then you've changed the
> problem to memory bandwidth (I guess doesn't matter much on UP).

normally the cpu is really perfectly idle, only in a few cases we clear
pages and it starts clearing the hot caches first. Plus the refiling may
find already PG_zero pages from the buddy and it will not do any
clearpage at all.

> It seems like a good idea to do zero pages in the page allocator if
> at all (rather than slab), but I guess you don't want to complicate
> it unless it shows improvements in macro benchmarks.
>
> Sorry my feedback isn't much, it is not based on previous experience.

I serously doubt any previous experience is any similar to what I've
implemented, but I completely missed any previous experience (probably
because I was doing mostly 2.4 work when you were developing 2.5), so I
may be completely wrong, but certainly lots of things are going wrong in
the current per-cpu lists and those are all fixed, plus this is mostly
for the pte_quicklist that has been dropped by mistake from 2.6 (just
got a compliant for it from SGI saying latency is too high, and that's
what driven me at doing PG_zero, to close that bug but underway I found
many more wrong things, and eventually I figured out the only design I
would ever let my cpu use to do idle zeroing on top of what I already
implemented, but I certainly agree that's the least interesting part of
the patch).

2004-11-01 23:26:25

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

--On Monday, November 01, 2004 22:57:09 +0100 Andrea Arcangeli <[email protected]> wrote:

> On Sun, Oct 31, 2004 at 07:35:03AM -0800, Martin J. Bligh wrote:
>> I'll non-micro-benchmark stuff for you on big machines if you want, but I've
>> wasted enough time coding this stuff already ;-)
>>
>> BTW, the one really useful thing the whole page zeroing stuff did was to
>> shift the profiled cost of page zeroing out to the routine acutally using
>> the pages, as it's no longer just do_anonymous_page taking the cache hit.
>
> I'm not a big fan of the idle zeroing myself despite I implemented it.
> The only performance difference I can measure is the microbenchmark
> running 3 times faster, everything else seems the same.
>
> However the real point of the patch is to address all other issues with
> the per-cpu list and to fixup the lack of pte_quicklist in 2.6, and to
> avoid wasting zeropages (like the pte_quicklists did) by sharing them
> with all other page-zero users.
>
> I'm fine to drop the idle zeroing stuff, it just took another half an
> hour to add it so I did it just in case (it can be already disabled via
> sysctl of course).
>
> btw, how did you implement idle zeroing? had you two per-cpu lists,
> where the idle zeroing only refile across the two per-cpu lists like I
> did?

I did a nasty hack (I was just testing perf). I think Andy added another
per-cpu queue, when he redid my stuff properly.

> Did you address the COW with zeropage? the design I did for it is
> the only one I could imagine beneficial at all, I would never attempt
> taking any spinlock on the idle task.

I really don't see much point in the COW'ed zeropage thing - it just
generates more minor faults. From the stuff I measured, I couldn't find
anything that read from an allocated page, and never wrote to it. I've
turned it off before, and just given them their own page straight away,
and it had no perf impact.

> No idea if you also were just
> refiling pages across two per-cpu lists. I need two lists anyways, one
> is the hot-cold one (shared to avoid wasting memory like 2.6 mainline)
> and one is the zero-list that is needed to avoid the lack of
> pte_quicklists and to cache the PG_zero info (it actually renders
> pte_quicklists obsolete since they're less optimal at utilizing zero
> resources).

Yeah, fair enough. the only issue is that I think cold allocators could
steal hot pages the way you're doing it (ie if we set up a big DMA).

> Plus the patch fixes other issues like trying all per-cpu
> lists in the classzone before falling back to the buddy allocator.

Yeah, that seems like a good thing to do ... except on machines with
big highmem, won't we exhaust ZONE_NORMAL much quicker?

> It
> removes the low watermark and other minor details. The idle zeroing is
> just a minor thing in the whole patch, the point of PG_zero is to create
> an infrastructure in the main page allocator that is capable of caching
> already zero data like ptes.
>
> So far this in practice has been an improvement or a noop, and it solves
> various theoretical issues too, but I still need to test on some bigger
> box and see if it makes any difference there or not... But since it's
> rock solid here, it was good enough for posting ;)

Heh. If it's the same as what you posted before, I'll do a sniff test on
one of teh big boxes here.

M

2004-11-01 23:01:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Sun, Oct 31, 2004 at 07:35:03AM -0800, Martin J. Bligh wrote:
> I'll non-micro-benchmark stuff for you on big machines if you want, but I've
> wasted enough time coding this stuff already ;-)
>
> BTW, the one really useful thing the whole page zeroing stuff did was to
> shift the profiled cost of page zeroing out to the routine acutally using
> the pages, as it's no longer just do_anonymous_page taking the cache hit.

I'm not a big fan of the idle zeroing myself despite I implemented it.
The only performance difference I can measure is the microbenchmark
running 3 times faster, everything else seems the same.

However the real point of the patch is to address all other issues with
the per-cpu list and to fixup the lack of pte_quicklist in 2.6, and to
avoid wasting zeropages (like the pte_quicklists did) by sharing them
with all other page-zero users.

I'm fine to drop the idle zeroing stuff, it just took another half an
hour to add it so I did it just in case (it can be already disabled via
sysctl of course).

btw, how did you implement idle zeroing? had you two per-cpu lists,
where the idle zeroing only refile across the two per-cpu lists like I
did? Did you address the COW with zeropage? the design I did for it is
the only one I could imagine beneficial at all, I would never attempt
taking any spinlock on the idle task. No idea if you also were just
refiling pages across two per-cpu lists. I need two lists anyways, one
is the hot-cold one (shared to avoid wasting memory like 2.6 mainline)
and one is the zero-list that is needed to avoid the lack of
pte_quicklists and to cache the PG_zero info (it actually renders
pte_quicklists obsolete since they're less optimal at utilizing zero
resources). Plus the patch fixes other issues like trying all per-cpu
lists in the classzone before falling back to the buddy allocator. It
removes the low watermark and other minor details. The idle zeroing is
just a minor thing in the whole patch, the point of PG_zero is to create
an infrastructure in the main page allocator that is capable of caching
already zero data like ptes.

So far this in practice has been an improvement or a noop, and it solves
various theoretical issues too, but I still need to test on some bigger
box and see if it makes any difference there or not... But since it's
rock solid here, it was good enough for posting ;)

2004-11-02 03:42:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: PG_zero

On Mon, Nov 01, 2004 at 10:57:09PM +0100, Andrea Arcangeli wrote:
> However the real point of the patch is to address all other issues with
> the per-cpu list and to fixup the lack of pte_quicklist in 2.6, and to
> avoid wasting zeropages (like the pte_quicklists did) by sharing them
> with all other page-zero users.

For the record, I had pte prezeroing patches written and posted prior
to 2.6.0, and had maintained them for a year when I gave up on them.


-- wli

2004-11-02 03:03:45

by Nick Piggin

[permalink] [raw]
Subject: Re: PG_zero

Andrea Arcangeli wrote:
> On Mon, Nov 01, 2004 at 11:34:19PM +0100, Andrea Arcangeli wrote:
>
>>On Mon, Nov 01, 2004 at 10:03:56AM -0800, Martin J. Bligh wrote:
>>
>>>[..] it was to stop cold
>>>allocations from eating into hot pages [..]
>>
>>exactly, and I believe that hurts. bouncing on the global lock is going to
>>hurt more than preserving an hot page (at least on a 512-way). Plus the
>>cold page may very soon become hot too.
>
>
> I've read your reply via the web in some archive since my email is down
> right now (probably some upgrade is underway).
>
>
> with global I mean the buddy lock, which is global for all cpus.
>

Hmm... yeah they global, but also per-zone. And most of the
page activity will be coming from the node local CPUs hopefully.

I guess the exception could be things like ZONE_NORMAL for big
highmem systems, but I think we don't want to be doing too much
more to accomodate those dinosaurs nowadays.


> The idea that the quicklist is meant to take the lock once every X pages
> is limiting. The object is to never ever have to enter the buddy, not
> just to "buffer" allocations. The two separated cold/hot lists prevents
> that. As far as there's a single page available we should use it since
> bouncing the cacheline is very costly.
>

Well yeah. I'm still unconvinced that the cacheline is bouncing
all that much.. I'm sure some of the guys at SGI will have profiles
lying around for their bigger systems though.

> It's really a question if you believe the cache effects are going to be
> more significant than the cacheline bouncing on the zone lock. I do
> believe the latter is several order of magnitude more severe issue.
> Hence I allow fallbacks both ways (I even allow fallback across the zero
> pages that carry more info than just an hot cache). Mainline doesn't
> allow any fallback at all instead and that's certainly wrong.
>
> BTW, please apply PG_zero-2-no-zerolist-reserve-1 on top of PG_zero-2
> too if you're going to run any bench (PG_zero-2 alone doesn't give
> zeropages to non-zero users and I don't like it that much even if it
> mirrors the quicklists better). And to make a second run to disable the
> idle clearing sysctl you can just write 0 into the last entry of the
> per_cpu_pages syctl.
>
> If you can run the bench with several cpus and with with mem=1G that
> will put into action other bugfixes as well.
>
>
>>Plus you should at least allow an hot allocation to eat into the cold
>>pages (which didn't happen IIRC).
>
>
> all I could see is that it doesn't fallback in 2.6.9, and that's fixed
> with the single list of course ;). Plus I provide hot-cold caching on
> the zero list too (zero list guarantees all pages have PG_zero set, but
> that's the only difference with the hot-cold list). cold info is
> retained in the zero list too so you can freely allocate with __GFP_ZERO
> and __GFP_COLD, or even __GFP_ZERO|__GFP_ONLY_ZERO|__GFP_COLD etc...
>

OK well I guess any changes are a matter for the numbers to decide,
speculation only gets one so far :)

(Great to see you're getting more time to work on 2.6 BTW).

2004-11-02 02:28:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Mon, Nov 01, 2004 at 11:34:19PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 01, 2004 at 10:03:56AM -0800, Martin J. Bligh wrote:
> > [..] it was to stop cold
> > allocations from eating into hot pages [..]
>
> exactly, and I believe that hurts. bouncing on the global lock is going to
> hurt more than preserving an hot page (at least on a 512-way). Plus the
> cold page may very soon become hot too.

I've read your reply via the web in some archive since my email is down
right now (probably some upgrade is underway).


with global I mean the buddy lock, which is global for all cpus.

The idea that the quicklist is meant to take the lock once every X pages
is limiting. The object is to never ever have to enter the buddy, not
just to "buffer" allocations. The two separated cold/hot lists prevents
that. As far as there's a single page available we should use it since
bouncing the cacheline is very costly.

It's really a question if you believe the cache effects are going to be
more significant than the cacheline bouncing on the zone lock. I do
believe the latter is several order of magnitude more severe issue.
Hence I allow fallbacks both ways (I even allow fallback across the zero
pages that carry more info than just an hot cache). Mainline doesn't
allow any fallback at all instead and that's certainly wrong.

BTW, please apply PG_zero-2-no-zerolist-reserve-1 on top of PG_zero-2
too if you're going to run any bench (PG_zero-2 alone doesn't give
zeropages to non-zero users and I don't like it that much even if it
mirrors the quicklists better). And to make a second run to disable the
idle clearing sysctl you can just write 0 into the last entry of the
per_cpu_pages syctl.

If you can run the bench with several cpus and with with mem=1G that
will put into action other bugfixes as well.

> Plus you should at least allow an hot allocation to eat into the cold
> pages (which didn't happen IIRC).

all I could see is that it doesn't fallback in 2.6.9, and that's fixed
with the single list of course ;). Plus I provide hot-cold caching on
the zero list too (zero list guarantees all pages have PG_zero set, but
that's the only difference with the hot-cold list). cold info is
retained in the zero list too so you can freely allocate with __GFP_ZERO
and __GFP_COLD, or even __GFP_ZERO|__GFP_ONLY_ZERO|__GFP_COLD etc...

2004-11-02 05:13:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Mon, Nov 01, 2004 at 10:03:56AM -0800, Martin J. Bligh wrote:
> [..] it was to stop cold
> allocations from eating into hot pages [..]

exactly, and I believe that hurts. bouncing on the global lock is going to
hurt more than preserving an hot page (at least on a 512-way). Plus the
cold page may very soon become hot too.

Plus you should at least allow an hot allocation to eat into the cold
pages (which didn't happen IIRC).

I simply believe using the lru ordering is a more efficient way to
implement hot/cold behaviour and it will save some minor ram too (with
big lists the reservation might even confuse the oom conditions, if the
allocation is hot, but the VM frees in the cold "stopped" list). I know
the cold list was a lot smaller so this is probably only a theoretical
issue.

> Yeah, we got bugger-all benefit out of it. The only think it might do
> is lower the latency on inital load-spikes, but basically you end up
> paying the cache fetch cost twice. But ... numbers rule - if you can come
> up with something that helps a real macro benchmark, I'll eat my non-existant
> hat ;-)

I've no idea if it will help... I only knows it helps the micro ;), but I
don't measure any slowdown.

Note that my PG_zero will boost 200% the micro benchmark even without
the idle zeroing enabled, if a big app quits all ptes will go in PG_zero
quicklist and the next 2M allocation of anonymous memory won't require
clearing. That has no downside at all. That's not something that can be
achieved with slab, plus slab wastes ram as well and it has more
overhead than PG_zero.

2004-11-02 15:10:54

by Andy Whitcroft

[permalink] [raw]
Subject: Re: PG_zero

Martin J. Bligh wrote:
> --Andrew Morton <[email protected]> wrote (on Saturday, October 30, 2004 14:07:32 -0700):
>
>
>>Andrea Arcangeli <[email protected]> wrote:
>>
>>>I think it's much better to have PG_zero in the main page allocator than
>>> to put the ptes in the slab. This way we can share available zero pages with
>>> all zero-page users and we have a central place where people can
>>> generate zero pages and allocate them later efficiently.
>>
>>Yup.
>>
>>
>>> This gives a whole internal knowledge to the whole buddy system and
>>> per-cpu subsystem of zero pages.
>>
>>Makes sense. I had a go at this ages ago and wasn't able to demonstrate
>>much benefit on a mixed workload.
>>
>>I wonder if it would help if the page zeroing in the idle thread was done
>>with the CPU cache disabled. It should be pretty easy to test - isn't it
>>just a matter of setting the cache-disable bit in the kmap_atomic()
>>operation?
>
>
> I looked at the basic problem a couple of years ago (based on your own code,
> IIRC Andrew) then Andy (cc'ed) did it again with cache writethrough. It
> doesn't provide any benefit at all, no matter what we did, and it was
> finally ditched.
>
> I wouldn't bother doing it again personally ... perhaps Andy still has
> the last set of results he can send to you.

I'll have a look out for the results, they should be around somewhere?

The work I did was based on the idea it _had_ to be more efficient to
have pre-zero'd pages available instead of wasting the hot pages,
scrubbing them on use. I did a lot of work to introduce a new queue for
these, per-cpu. Scrubbing them using spare cycles, even trying a
version where we didn't polute the cache with them using uncached
write-combining. The results all showed (on i386 machines at least)
that the predominant cost was cache warmth. It was slower to fetch the
clean zero page from memory than it was to clean out all of the cache
page for use. The colder the page the slower the system went.

-apw

2004-11-02 15:52:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

> The idea that the quicklist is meant to take the lock once every X pages
> is limiting. The object is to never ever have to enter the buddy, not
> just to "buffer" allocations.

That'd be nicer, yes.

> The two separated cold/hot lists prevents that. As far as there's a single
> page available we should use it since bouncing the cacheline is very costly.

Well, it doesn't really prevent it ... it used to work, I think. However,
the current code appears to be broken - what it's meant to do is refill
the hot list from the cold list if the hot list was completely empty.
I could've sworn it used to do that, but no matter ... I think we just
need to fix up the bit in buffered_rmqueue() here:

if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0, pcp->batch, &pcp->list);

To say something more like:

if (pcp->count <= pcp->low && !cold)
<shift some pages from cold to hot>
if (pcp->count <= pcp->low)
pcp->count += rmqueue_bulk(zone, 0, pcp->batch, &pcp->list);

Though I'm less than convinced in retrospect that there was any point in
having low watermarks, rather than running it down to zero. Andrew, can
you recall why we did that?

> It's really a question if you believe the cache effects are going to be
> more significant than the cacheline bouncing on the zone lock.

Exactly. The disadvantage of the single list is that cold allocs can steal
hot pages, which we believe are precious, and as CPUs get faster, will only
get more so (insert McKenney's bog roll here).

> I do
> believe the latter is several order of magnitude more severe issue.
> Hence I allow fallbacks both ways (I even allow fallback across the zero
> pages that carry more info than just an hot cache). Mainline doesn't
> allow any fallback at all instead and that's certainly wrong.

Mmmm. Are you actually seeing lock contention on the buddy allocator on
a real benchmark? I haven't seen it since the advent of hot/cold pages.
Remember it's not global at all on the larger systems, since they're NUMA.

> all I could see is that it doesn't fallback in 2.6.9, and that's fixed
> with the single list of course ;). Plus I provide hot-cold caching on
> the zero list too (zero list guarantees all pages have PG_zero set, but
> that's the only difference with the hot-cold list). cold info is
> retained in the zero list too so you can freely allocate with __GFP_ZERO
> and __GFP_COLD, or even __GFP_ZERO|__GFP_ONLY_ZERO|__GFP_COLD etc...

Mmmm. I'm still very worried this'll exhaust ZONE_NORMAL on highmem systems,
and keep remote pages around instead of returning them to their home node
on NUMA systems. Not sure that's really what we want, even if we gain a
little in spinlock and immediate cache cost. Nor will it be easy to measure
since it'll only do anything under memory pressure, and the perf there is
notoriously unstable for measurement.

M.

2004-11-02 19:41:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 01:53:09PM +0000, Andy Whitcroft wrote:
> I'll have a look out for the results, they should be around somewhere?

could you give an hint on which workload to use for the measurements?

> page for use. The colder the page the slower the system went.

this is expected. This is why I waste no more than 4k of cache for each
local-apic irq, and this is also why I don't waste any cache at all if
the zerolist is already full or the hot-cold list is empty (plus if
there are PG_zero pages in the hot-cold list since I cache PG_zere deep
down in the buddy, I don't waste any cache at all by refiling them into
the zero quicklist). Not sure if you were using the same design. Note
that idle zeroing is a worthless feature in my patch, I never intented
to implement it, I just happened to be able to implement it with a
trivial change so I did (and it can be disabled via sysctl). all the
important stuff are bugfixes and obsoleting the inefficient slab for pte
allocation and to create a superset of the pte_quicklist in 2.4 that is
missing in 2.6 by mistake.

Note that I'm keeping track of hot and cold cache for the zerolist too.
Plus I can disable the idle zeroing and still I will be able to cache
zero information in O(1) (so then caching zero will be zerocost, plus
I'll keep track of the hottest zero page, and the coldest one).

2004-11-02 20:01:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 07:42:18AM -0800, Martin J. Bligh wrote:
> the current code appears to be broken - what it's meant to do is refill

agreed.

> Though I'm less than convinced in retrospect that there was any point in
> having low watermarks, rather than running it down to zero. Andrew, can

there is not point for low watermark as far as I can tell.

Both the above and this have been fixed in my patch.

> Exactly. The disadvantage of the single list is that cold allocs can steal
> hot pages, which we believe are precious, and as CPUs get faster, will only
> get more so (insert McKenney's bog roll here).

a cold page can become hot anytime soon (just when the I/O has
completed and we map it into userspace), plus you're throwing a the
problem twice as much memory as I do to get the same boost.

> Mmmm. Are you actually seeing lock contention on the buddy allocator on
> a real benchmark? I haven't seen it since the advent of hot/cold pages.
> Remember it's not global at all on the larger systems, since they're NUMA.

yes, the pgdat helps there. But certainly doing a single cli should be
faster than a spinlock + buddy algorithms.

This is also to drop down the probability of having to call into the
higher latency of the buddy allocator, this is why the per-cpu lists
exists in UP too. so the more we can work inside the per-cpu lists the
better. This is why the last thing I consider the per-cpu lists are
for buffering. The buffering is really a "refill", because we were
unlucky and no other free_page refilled it, so we've to enter the slow
path.

> Mmmm. I'm still very worried this'll exhaust ZONE_NORMAL on highmem systems,

See my patch, I've all the protection code on top of it. Andi as well
was worried about that and he was right, but then I've fixed it.

> and keep remote pages around instead of returning them to their home node
> on NUMA systems. Not sure that's really what we want, even if we gain a

in my early version of the code I had a:

+ for (i = 0; (z = zones[i]) != NULL; i++) {
+#ifdef CONFIG_NUMA
+ /* discontigmem doesn't need this, only true numa needs this */
+ if (zones[0]->zone_pgdat != z->zone_pgdat)
+ break
+#endif

I dropped it from the latest just to make it a bit simpler and because I
started to suspect on the new numa it might be faster to use the hot
cache on remote memory, than cold cache on local memory. So effectively
I believe removing the above was going to optimize x86-64 which is the
only numa arch I run anyways. The above code can be made conditional
under an #ifdef CONFIG_SLOW_NUMA of course.

> little in spinlock and immediate cache cost. Nor will it be easy to measure
> since it'll only do anything under memory pressure, and the perf there is
> notoriously unstable for measurement.

under mem pressure it's worthless to measure performance I agree.
However the mainline code under mem pressure was very wrong and it could
have generated early OOM kills by mistake on big boxes with lots of ram,
since you prevented hot allocations to get ram from the cold quicklist
(the breakage I noticed and fixed, and that you acknowledge at the top
of the email) and that would lead to the VM freeing memory and the
application going oom anyways.

2004-11-02 21:05:58

by Andrew Morton

[permalink] [raw]
Subject: Re: PG_zero

"Martin J. Bligh" <[email protected]> wrote:
>
> > The idea that the quicklist is meant to take the lock once every X pages
> > is limiting. The object is to never ever have to enter the buddy, not
> > just to "buffer" allocations.
>
> That'd be nicer, yes.
>
> > The two separated cold/hot lists prevents that. As far as there's a single
> > page available we should use it since bouncing the cacheline is very costly.
>
> Well, it doesn't really prevent it ... it used to work, I think. However,
> the current code appears to be broken - what it's meant to do is refill
> the hot list from the cold list if the hot list was completely empty.
> I could've sworn it used to do that, but no matter

We discussed it, but iirc we worked out that it wouldn't have any useful
effect on the frequency at which we take the buddy lock.

It adds more code and it adds special knowledge of hotness and coldness to
the core page allocator. The intention was that the argument called `cold'
could be renamed to `pcp_index' or whatever so that we could add more head
arrays in the future. For known-to-be-zero pages.

> ... I think we just
> need to fix up the bit in buffered_rmqueue() here:
>
> if (pcp->count <= pcp->low)
> pcp->count += rmqueue_bulk(zone, 0, pcp->batch, &pcp->list);
>
> To say something more like:
>
> if (pcp->count <= pcp->low && !cold)
> <shift some pages from cold to hot>
> if (pcp->count <= pcp->low)
> pcp->count += rmqueue_bulk(zone, 0, pcp->batch, &pcp->list);
>
> Though I'm less than convinced in retrospect that there was any point in
> having low watermarks, rather than running it down to zero. Andrew, can
> you recall why we did that?

Nope.

> > It's really a question if you believe the cache effects are going to be
> > more significant than the cacheline bouncing on the zone lock.
>
> Exactly. The disadvantage of the single list is that cold allocs can steal
> hot pages, which we believe are precious, and as CPUs get faster, will only
> get more so (insert McKenney's bog roll here).

The cold pages are mainly intended to be the pages which will be placed
under DMA transfers. We should never return hot pages in response for a
request for a cold page.

2004-11-02 21:57:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 01:09:10PM -0800, Andrew Morton wrote:
> The cold pages are mainly intended to be the pages which will be placed
> under DMA transfers. We should never return hot pages in response for a
> request for a cold page.

after the DMA transfer often the cpu will touch the data contents
(all the pagein/reads do that) and the previously cold page will become
hotter than the other hot pages you left in the hot list. I doubt
there's any difference between a cache shoop or a recycle of some cache
entry because we run out of cache (in turn making some random hot cache
as cold). There's a window of time during the dma that may run faster by
allocating hot cache, but in the same window of time some other task may
as well free some hot data in turn avoiding to enter the buddy at all
and to take the zone lock.

2004-11-02 22:46:35

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

> On Tue, Nov 02, 2004 at 01:09:10PM -0800, Andrew Morton wrote:
>> The cold pages are mainly intended to be the pages which will be placed
>> under DMA transfers. We should never return hot pages in response for a
>> request for a cold page.
>
> after the DMA transfer often the cpu will touch the data contents
> (all the pagein/reads do that) and the previously cold page will become
> hotter than the other hot pages you left in the hot list. I doubt

eh? I don't see how that matters at all. After the DMA transfer, all the
cache lines will have to be invalidated in every CPUs cache anyway, so
it's guaranteed to be stone-dead zero-degrees-kelvin cold. I don't see how
however hot it becomes afterwards is relevant?

> there's any difference between a cache shoop or a recycle of some cache
> entry because we run out of cache (in turn making some random hot cache
> as cold). There's a window of time during the dma that may run faster by
> allocating hot cache, but in the same window of time some other task may
> as well free some hot data in turn avoiding to enter the buddy at all
> and to take the zone lock.

If the DMA is to pages that are hot in the CPUs cache - it's WORSE ... we
have more work to do in terms of cacheline invalidates. Mmm ... in terms
of DMAs, we're talking about disk reads (ie a new page allocates) - we're
both on the same page there, right?

M.

2004-11-03 00:23:32

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

>> Exactly. The disadvantage of the single list is that cold allocs can steal
>> hot pages, which we believe are precious, and as CPUs get faster, will only
>> get more so (insert McKenney's bog roll here).
>
> a cold page can become hot anytime soon (just when the I/O has
> completed and we map it into userspace),

See other email ... this makes non sense to me.

> plus you're throwing a the
> problem twice as much memory as I do to get the same boost.

Well .... depends on how we do it. Conceptually, I agree with you that
it's really one list. However, I'd like to stop the cold list stealing
from the hot, so I think it's easier to *implement* it as two lists,
because there's nothing in the standard list code to add a magic marker
on one element in the middle (though maybe you can think of a trick way
to do that). Moving things from one list to the other is pretty cheap
because we can use the splice operations.

>> Mmmm. Are you actually seeing lock contention on the buddy allocator on
>> a real benchmark? I haven't seen it since the advent of hot/cold pages.
>> Remember it's not global at all on the larger systems, since they're NUMA.
>
> yes, the pgdat helps there. But certainly doing a single cli should be
> faster than a spinlock + buddy algorithms.

Yes, it's certainly faster to do that one operation - no dispute from me.
However, the question is whether it's worth trading off against the
cache-warmth of pages ... that's why we wrote it that way originally, to
preserve that.

> This is also to drop down the probability of having to call into the
> higher latency of the buddy allocator, this is why the per-cpu lists
> exists in UP too. so the more we can work inside the per-cpu lists the
> better. This is why the last thing I consider the per-cpu lists are
> for buffering. The buffering is really a "refill", because we were
> unlucky and no other free_page refilled it, so we've to enter the slow
> path.

yup, exactly.

>> Mmmm. I'm still very worried this'll exhaust ZONE_NORMAL on highmem systems,
>
> See my patch, I've all the protection code on top of it. Andi as well
> was worried about that and he was right, but then I've fixed it.

OK, I'll go read it.

>> and keep remote pages around instead of returning them to their home node
>> on NUMA systems. Not sure that's really what we want, even if we gain a
>
> in my early version of the code I had a:
>
> + for (i = 0; (z = zones[i]) != NULL; i++) {
> +#ifdef CONFIG_NUMA
> + /* discontigmem doesn't need this, only true numa needs this */
> + if (zones[0]->zone_pgdat != z->zone_pgdat)
> + break
> +#endif
>
> I dropped it from the latest just to make it a bit simpler and because I
> started to suspect on the new numa it might be faster to use the hot
> cache on remote memory, than cold cache on local memory. So effectively
> I believe removing the above was going to optimize x86-64 which is the
> only numa arch I run anyways. The above code can be made conditional
> under an #ifdef CONFIG_SLOW_NUMA of course.

Probably easier to make it a per-arch option to define, but yes, if one
arch works better one way, and others the other, we can make it optional
fairly easily in lots of ways. <insert traditional disagreement with
"new" vs "old" connotations here ...>

>> little in spinlock and immediate cache cost. Nor will it be easy to measure
>> since it'll only do anything under memory pressure, and the perf there is
>> notoriously unstable for measurement.
>
> under mem pressure it's worthless to measure performance I agree.
> However the mainline code under mem pressure was very wrong and it could
> have generated early OOM kills by mistake on big boxes with lots of ram,
> since you prevented hot allocations to get ram from the cold quicklist
> (the breakage I noticed and fixed, and that you acknowledge at the top
> of the email) and that would lead to the VM freeing memory and the
> application going oom anyways.

Well, remember that the lists only contain a very small amount of memory
relative to the machine size anyway, so I'm not really worried about us
triggering early OOMs or whatever.

But however one chose to fix this, it'd require more specialization in
the allocator (knowing about hot/cold, or merging into one list), so not
sure how concerned Andrew is about that (see his other email on the subject)

2004-11-03 01:10:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 02:41:15PM -0800, Martin J. Bligh wrote:
> eh? I don't see how that matters at all. After the DMA transfer, all the
> cache lines will have to be invalidated in every CPUs cache anyway, so
> it's guaranteed to be stone-dead zero-degrees-kelvin cold. I don't see how
> however hot it becomes afterwards is relevant?

if the cold page becomes hot, it means the hot pages in the hot
quicklist will become colder. The cache size is limited, so if something
becomes hot, something will become cold.

The only difference is that the hot pages will become cold during the
dma if we return an hot page, or the hot pages will become cold while
the cpu touches the data of the previously cold page, if we return a
cold page. Or are you worried that the cache snooping is measurable?

I believe the hot-cold thing, is mostly important for the hot
allocations not for the cold one. So that the hot allocations are served
in a strict LIFO order, that truly matters but the cold allocations are
a grey area.

What kind of slowdown can you measure if you drop __GFP_COLD enterely?

Don't get me wrong, __GFP_COLD makes perfect sense since it's so little
cost to do it that it most certainly worth the branch in the
allocator, but I don't think the hot pages worth a _reservation_ since
they'll become cold anwyays after the I/O has completed, so then we
could have returned an hot page in the first place without slowing down
in the buddy to get it.

> If the DMA is to pages that are hot in the CPUs cache - it's WORSE ... we
> have more work to do in terms of cacheline invalidates. Mmm ... in terms
> of DMAs, we're talking about disk reads (ie a new page allocates) - we're
> both on the same page there, right?

the DMA snoops the cache for the cacheline invalidate but I didn't think
it's measurable.

I would really like to see the performance difference of disabling the
__GFP_COLD thing for the allocations and to force picking from the head
of the list (and to always free the cold pages a the tail), I doubt you
will measure anything.

NOTE: I'm not talking about the freeing of cold pages. the freeing of
cold pages definitely must not free at the head, this way hot
allocations will keep going fast. But reserving hot pages during cold
allocations I doubt it's measurable. I wonder if you've any measurement
that collides with my theory. I could be wrong of course.

I can change my patch to reserve hot pages during cold allocations, no
problem, but I'd really like to have any measurement data before doing
that, since I feel I'd be wasting some tons of memory on a many-cpu
lots-of-ram box for a worthless cause.

2004-11-03 01:23:27

by Nick Piggin

[permalink] [raw]
Subject: Re: PG_zero

Andrea Arcangeli wrote:
> On Tue, Nov 02, 2004 at 02:41:15PM -0800, Martin J. Bligh wrote:
>
>>eh? I don't see how that matters at all. After the DMA transfer, all the
>>cache lines will have to be invalidated in every CPUs cache anyway, so
>>it's guaranteed to be stone-dead zero-degrees-kelvin cold. I don't see how
>>however hot it becomes afterwards is relevant?
>
>
> if the cold page becomes hot, it means the hot pages in the hot
> quicklist will become colder. The cache size is limited, so if something
> becomes hot, something will become cold.
>
> The only difference is that the hot pages will become cold during the
> dma if we return an hot page, or the hot pages will become cold while
> the cpu touches the data of the previously cold page, if we return a
> cold page. Or are you worried that the cache snooping is measurable?
>
> I believe the hot-cold thing, is mostly important for the hot
> allocations not for the cold one. So that the hot allocations are served
> in a strict LIFO order, that truly matters but the cold allocations are
> a grey area.
>
> What kind of slowdown can you measure if you drop __GFP_COLD enterely?
>
> Don't get me wrong, __GFP_COLD makes perfect sense since it's so little
> cost to do it that it most certainly worth the branch in the
> allocator, but I don't think the hot pages worth a _reservation_ since
> they'll become cold anwyays after the I/O has completed, so then we
> could have returned an hot page in the first place without slowing down
> in the buddy to get it.
>

I see what you mean. You could be correct that it would model cache
behaviour better to just have the last N freed "hot" pages in LIFO
order on the list, and allocate cold pages from the other end of it.

You still don't want cold freeing to pollute this list, *but* you do
want to still batch up cold freeing to amortise the buddy's lock
aquisition.

You could do that with just one list, if you gave cold pages a small
extra allowance to batch freeing if the list is full.

>
>>If the DMA is to pages that are hot in the CPUs cache - it's WORSE ... we
>>have more work to do in terms of cacheline invalidates. Mmm ... in terms
>>of DMAs, we're talking about disk reads (ie a new page allocates) - we're
>>both on the same page there, right?
>
>
> the DMA snoops the cache for the cacheline invalidate but I didn't think
> it's measurable.
>
> I would really like to see the performance difference of disabling the
> __GFP_COLD thing for the allocations and to force picking from the head
> of the list (and to always free the cold pages a the tail), I doubt you
> will measure anything.
>

I think you want to still take them off the cold end. Taking a
really cache hot page and having it invalidated is worse than
having some cachelines out of your combined pool of hot pages
pushed out when you heat the cold page.

2004-11-03 01:24:40

by Martin J. Bligh

[permalink] [raw]
Subject: Re: PG_zero

--On Wednesday, November 03, 2004 02:09:52 +0100 Andrea Arcangeli <[email protected]> wrote:

> On Tue, Nov 02, 2004 at 02:41:15PM -0800, Martin J. Bligh wrote:
>> eh? I don't see how that matters at all. After the DMA transfer, all the
>> cache lines will have to be invalidated in every CPUs cache anyway, so
>> it's guaranteed to be stone-dead zero-degrees-kelvin cold. I don't see how
>> however hot it becomes afterwards is relevant?
>
> if the cold page becomes hot, it means the hot pages in the hot
> quicklist will become colder. The cache size is limited, so if something
> becomes hot, something will become cold.

Aaah. OK - I see what you mean. Not sure I agree, but at least I understand
now ;-) will think on that some more, but I'm still not sure it makes any
difference.

> The only difference is that the hot pages will become cold during the
> dma if we return an hot page, or the hot pages will become cold while
> the cpu touches the data of the previously cold page, if we return a
> cold page. Or are you worried that the cache snooping is measurable?

Not really, I just don't want to waste hot pages on DMA that we could
be using for something else.

...

> NOTE: I'm not talking about the freeing of cold pages. the freeing of
> cold pages definitely must not free at the head, this way hot
> allocations will keep going fast. But reserving hot pages during cold
> allocations I doubt it's measurable. I wonder if you've any measurement
> that collides with my theory. I could be wrong of course.

Mmmm. probably depends on cache size vs readahead windows, etc. I'm don't
think either of us has any numbers, so we're both postulating ;-)

> I can change my patch to reserve hot pages during cold allocations, no
> problem, but I'd really like to have any measurement data before doing
> that, since I feel I'd be wasting some tons of memory on a many-cpu
> lots-of-ram box for a worthless cause.

Why is it tons of memory? Shouldn't be more that cache-size * nr_cpus,
and as long as we fetch from the cold to the hot, it's not really wasted
at all, it's just sitting in the cache. I guess we don't steal from other
cpu's caches (maybe we ought to when we're under very heavy pressure),
but still ... I think it's a marginal amount of memory, not "tons" ;-)

M.

2004-11-03 01:26:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Tue, Nov 02, 2004 at 02:41:12PM -0800, Martin J. Bligh wrote:
> it's really one list. However, I'd like to stop the cold list stealing
> from the hot, so I think it's easier to *implement* it as two lists,
> because there's nothing in the standard list code to add a magic marker
> on one element in the middle (though maybe you can think of a trick way


sure if we want to stop cold allocations to get hot pages we need two
lists.

> Yes, it's certainly faster to do that one operation - no dispute from me.
> However, the question is whether it's worth trading off against the
> cache-warmth of pages ... that's why we wrote it that way originally, to
> preserve that.

my argument is very simple. A random content cold page is worthless.

When somebody allocates a free page, it will either:

1) write to it with the cpu to write some useful data into it, so the
page has a value (not the case since it used __GFP_COLD and we assume it
was not using __GFP_COLD by mistake)
2) use DMA to write something useful into the page

Now after 2) unless it's readahead, it will also _read_ or modify the
contents of the page. And writing into a _cold_ page with the cpu, isn't
very different from doing DMA into an hot page. At least this is my
theory, I'm not an hardware designer but there should be some cache
snooping during the invalidates that shouldn't be very different from
the cache recycling. Note that we've no clue if the cache is hot in the
other cpus as well, this is all per-cpu stuff.

Probably the only way to know if retaining the separated hot/cold list
is worthwhile is benchmarking.

I'm pretty sure __GFP_COLD makes huge difference for the freeing of ram,
but I doubt it does for the allocation of ram. Everybody who is
allocating memory is going to write to it and touch it with the cpu
eventually.

While freed ram may remain cold forever, so to me __GFP_COLD looks more
a freeing-ram allocation.

There's not such thing as allocating ram and never going to touch it
with the CPU anytime soon.

> Probably easier to make it a per-arch option to define, but yes, if one
> arch works better one way, and others the other, we can make it optional
> fairly easily in lots of ways. <insert traditional disagreement with
> "new" vs "old" connotations here ...>

yep ;).

> Well, remember that the lists only contain a very small amount of memory
> relative to the machine size anyway, so I'm not really worried about us
> triggering early OOMs or whatever.

I'm a bit worried for the many cpu systems. Since I'm also not limiting
the per-cpu size in function of the total ram of the machine, one could
have a configuration that doesn't even boot if it has too many cpus,
probably thousand of them (I doubt any system like that exists, it's
only a theoretical issue). But the closer you get to the non-booting
system, the more likely allocations will fail with oom due the
reservation.

especially order > 0 allocations will fail way very soon, but for those
dropping the reservation won't help, so they're an unfixable thing,
that asks to use the per-cpu resources as efficiently as possible.

2004-11-03 02:05:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Wed, Nov 03, 2004 at 12:23:12PM +1100, Nick Piggin wrote:
> I see what you mean. You could be correct that it would model cache
> behaviour better to just have the last N freed "hot" pages in LIFO
> order on the list, and allocate cold pages from the other end of it.

I believe this is the more efficient way to use the per-cpu resources to
still provide hot/cold behaviour, we should verify it though.

Plus I provide hot-cold info to the zerolist too the same way.

with the current model I'd need 4 per-cpu lists.

> You still don't want cold freeing to pollute this list, *but* you do
> want to still batch up cold freeing to amortise the buddy's lock
> aquisition.
>
> You could do that with just one list, if you gave cold pages a small
> extra allowance to batch freeing if the list is full.

Once the list is full, I free a "batch" from the other end of it. And
then I put the new page into the list.

I agree that if I've to fallback into the buddy allocator to free pages
because the list is full, then if it's a __GFP_COLD freeing, I should
put the cold page into the buddy first. While I'm putting the cold page
into the list instead. So I'm basically off-by one potentially hot page.

I doubt it makes any difference (this only matters when the list is
completely full of only hot pages [unlikely since the list size is
bigger than l2 size and cache isn't only allocated on the freed pages,
but also on the still allocated pages]), but I think I can easily change
it to put the page at the end first, and then to batch the freeing. This
will fix this minor performance issue in my code (theoretically it's
more efficient the way you suggested ;). It's more an implementation bug
than anything I did intentionally ;).

> I think you want to still take them off the cold end. Taking a

yes, there is no doubt I want to take cold pages from the other end. But
if no improvement can be measured by picking always hot pages, then of
course picking them from the cold end isn't going to payoff much... ;)

> really cache hot page and having it invalidated is worse than
> having some cachelines out of your combined pool of hot pages
> pushed out when you heat the cold page.

how much worse? I do understand the snooping has a cost, but that dma is
pretty slow anyways.

Or you mean it's because the hot pages won't be guaranteed to be
completely cold but just a few cacheline cold? I prefer the next hot page to
be guaranteed to be still hot even after dma completes than to avoid
snooping and clobber the cache randomly.

If you invalidate the cache of an hot page, you're guaranteed you won't
affect the next hot page, the next hot page is guaranteed to be still
hot even after dma has completed.

I'm not sure what's more important but I don't feel reserving is right.

With the PG_zero-2-no-zerolist-reserve-1 I went as far as un-reserving
zero pages during non-zero allocations. despite they're in different
lists ;). I need them in different lists to provide PG_zero efficiently
and to track hot-cold info into the zerolist too. But I stopped all
reservations even if they're already zero. This way the 200% boost in
the microbench isn't guaranteed anymore but it very often happens.

2004-11-03 11:53:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: PG_zero

On Wed, Nov 03, 2004 at 03:05:35AM +0100, Andrea Arcangeli wrote:
> because the list is full, then if it's a __GFP_COLD freeing, I should
> put the cold page into the buddy first. While I'm putting the cold page
> into the list instead. So I'm basically off-by one potentially hot page.

Not the case, I'm already putting the page at the other end before
calling free_pages_bulk, so there's no off-by one error and no
modification required.

maybe I misunderstood what you wanted to say.

2004-11-03 12:13:10

by Pavel Machek

[permalink] [raw]
Subject: Re: PG_zero

Hi!

> > Yeah, we got bugger-all benefit out of it. The only think it might do
> > is lower the latency on inital load-spikes, but basically you end up

Is not the lower latency on load-spikes pretty nice for the user? If
openoffice loads faster or something like that, it might be worth it.

Desktop systems *are* idle most of the time..
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!