2004-09-09 18:22:54

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH] cacheline align pagevec structure

Hi,

I commented this with Andrew before, we can shrink the pagevec structure to
cacheline align it. It is used all over VM reclaiming and mpage
pagecache read code.

Right now it is 140 bytes on 64-bit and 72 bytes on 32-bit. Thats just a little bit more
than a power of 2 (which will cacheline align), so shrink it to be aligned: 64 bytes on
32bit and 124bytes on 64-bit.

It now occupies two cachelines most of the time instead of three.

I changed nr and cold to "unsigned short" because they'll never reach 2 ^ 16.

I do not see a problem with changing pagevec to "15" page pointers either,
Andrew, is there a special reason for that "16"? Is intentional to align
to 64 kbytes (IO device alignment)? I dont think that matters much because
of the elevator which sorts and merges requests anyway?



Did some reaim benchmarking on 4way PIII (32byte cacheline), with 512MB RAM:

#### stock 2.6.9-rc1-mm4 ####

Peak load Test: Maximum Jobs per Minute 4144.44 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4007.86 (average of 3 runs)

Peak load Test: Maximum Jobs per Minute 4207.48 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 3999.28 (average of 3 runs)

#### shrink-pagevec #####

Peak load Test: Maximum Jobs per Minute 4717.88 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4360.59 (average of 3 runs)

Peak load Test: Maximum Jobs per Minute 4493.18 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 4327.77 (average of 3 runs)


--- linux-2.6.9-rc1-mm4.orig/include/linux/pagevec.h 2004-09-08 16:13:14.000000000 -0300
+++ linux-2.6.9-rc1-mm4/include/linux/pagevec.h 2004-09-08 16:48:51.703401288 -0300
@@ -5,14 +5,14 @@
* pages. A pagevec is a multipage container which is used for that.
*/

-#define PAGEVEC_SIZE 16
+#define PAGEVEC_SIZE 15

struct page;
struct address_space;

struct pagevec {
- unsigned nr;
- int cold;
+ unsigned short nr;
+ unsigned short cold;
struct page *pages[PAGEVEC_SIZE];
};





2004-09-09 22:45:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

Marcelo Tosatti <[email protected]> wrote:
>
> Right now it is 140 bytes on 64-bit and 72 bytes on 32-bit. Thats just a little bit more
> than a power of 2 (which will cacheline align), so shrink it to be aligned: 64 bytes on
> 32bit and 124bytes on 64-bit.
>
> It now occupies two cachelines most of the time instead of three.
>
> I changed nr and cold to "unsigned short" because they'll never reach 2 ^ 16.
>
> I do not see a problem with changing pagevec to "15" page pointers either,
> Andrew, is there a special reason for that "16"? Is intentional to align
> to 64 kbytes (IO device alignment)? I dont think that matters much because
> of the elevator which sorts and merges requests anyway?
>
>
>
> Did some reaim benchmarking on 4way PIII (32byte cacheline), with 512MB RAM:
>
> #### stock 2.6.9-rc1-mm4 ####
>
> Peak load Test: Maximum Jobs per Minute 4144.44 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 4007.86 (average of 3 runs)
>
> Peak load Test: Maximum Jobs per Minute 4207.48 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 3999.28 (average of 3 runs)
>
> #### shrink-pagevec #####
>
> Peak load Test: Maximum Jobs per Minute 4717.88 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 4360.59 (average of 3 runs)
>
> Peak load Test: Maximum Jobs per Minute 4493.18 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 4327.77 (average of 3 runs)

I think the patch make sense, but I'm very sceptical about the benchmarks
;)

2004-09-09 22:48:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

Marcelo Tosatti <[email protected]> wrote:
>
> I do not see a problem with changing pagevec to "15" page pointers either,
> Andrew, is there a special reason for that "16"? Is intentional to align
> to 64 kbytes (IO device alignment)? I dont think that matters much because
> of the elevator which sorts and merges requests anyway?

No, it was just a randomly-chosen batching factor.

The tradeoff here is between

a) lock acquisition frequency versus lock hold time (increasing the size
helps).

b) icache misses versus dcache misses. (increasing the size probably hurts).

I suspect that some benefit would be seen from making the size very small
(say, 4). And on some machines, making it larger might help.

2004-09-09 23:06:13

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

On Thu, Sep 09, 2004 at 03:49:06PM -0700, Andrew Morton wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >
> > Right now it is 140 bytes on 64-bit and 72 bytes on 32-bit. Thats just a little bit more
> > than a power of 2 (which will cacheline align), so shrink it to be aligned: 64 bytes on
> > 32bit and 124bytes on 64-bit.
> >
> > It now occupies two cachelines most of the time instead of three.
> >
> > I changed nr and cold to "unsigned short" because they'll never reach 2 ^ 16.
> >
> > I do not see a problem with changing pagevec to "15" page pointers either,
> > Andrew, is there a special reason for that "16"? Is intentional to align
> > to 64 kbytes (IO device alignment)? I dont think that matters much because
> > of the elevator which sorts and merges requests anyway?
> >
> >
> >
> > Did some reaim benchmarking on 4way PIII (32byte cacheline), with 512MB RAM:
> >
> > #### stock 2.6.9-rc1-mm4 ####
> >
> > Peak load Test: Maximum Jobs per Minute 4144.44 (average of 3 runs)
> > Quick Convergence Test: Maximum Jobs per Minute 4007.86 (average of 3 runs)
> >
> > Peak load Test: Maximum Jobs per Minute 4207.48 (average of 3 runs)
> > Quick Convergence Test: Maximum Jobs per Minute 3999.28 (average of 3 runs)
> >
> > #### shrink-pagevec #####
> >
> > Peak load Test: Maximum Jobs per Minute 4717.88 (average of 3 runs)
> > Quick Convergence Test: Maximum Jobs per Minute 4360.59 (average of 3 runs)
> >
> > Peak load Test: Maximum Jobs per Minute 4493.18 (average of 3 runs)
> > Quick Convergence Test: Maximum Jobs per Minute 4327.77 (average of 3 runs)
>
> I think the patch make sense, but I'm very sceptical about the benchmarks
> ;)

Why's that? You think changing to the number of pages in the pagevec to "15" instead
"16" is the cause?

Or that the performance increase is not a direct effect of the one cacheline
saved per pagevec instance?

Or you think such benchmark is too specific to be interpreted as a broad
vision of performance?

:)

2004-09-09 23:09:14

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

Marcelo Tosatti <[email protected]> wrote:
>> I do not see a problem with changing pagevec to "15" page pointers either,
>> Andrew, is there a special reason for that "16"? Is intentional to align
>> to 64 kbytes (IO device alignment)? I dont think that matters much because
>> of the elevator which sorts and merges requests anyway?

On Thu, Sep 09, 2004 at 03:52:26PM -0700, Andrew Morton wrote:
> No, it was just a randomly-chosen batching factor.
> The tradeoff here is between
> a) lock acquisition frequency versus lock hold time (increasing the size
> helps).
> b) icache misses versus dcache misses. (increasing the size probably hurts).
> I suspect that some benefit would be seen from making the size very small
> (say, 4). And on some machines, making it larger might help.

Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
though that may blow the stack on e.g. larger Altixen. Perhaps
O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
though we may have debates about how to evaluate lg(n) at compile-time...
Would be nice if calls to sufficiently simple __attribute__((pure))
functions with constant args were considered constant expressions by gcc.

-- wli

2004-09-09 23:17:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

Marcelo Tosatti <[email protected]> wrote:
>
> > I think the patch make sense, but I'm very sceptical about the benchmarks
> > ;)
>
> Why's that? You think changing to the number of pages in the pagevec to "15" instead
> "16" is the cause?

Nope. I wouldn't have expected to see a significant (or even measurable)
change in performance as a result of this patch.

After all, these structures are always stack-allocated, and top-of-stack is
most always in L1 cache.

I'd suspect that benchmark variability is the cause here.

2004-09-09 23:19:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

William Lee Irwin III <[email protected]> wrote:
>
> On Thu, Sep 09, 2004 at 03:52:26PM -0700, Andrew Morton wrote:
> > No, it was just a randomly-chosen batching factor.
> > The tradeoff here is between
> > a) lock acquisition frequency versus lock hold time (increasing the size
> > helps).
> > b) icache misses versus dcache misses. (increasing the size probably hurts).
> > I suspect that some benefit would be seen from making the size very small
> > (say, 4). And on some machines, making it larger might help.
>
> Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
> though that may blow the stack on e.g. larger Altixen. Perhaps
> O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
> though we may have debates about how to evaluate lg(n) at compile-time...
> Would be nice if calls to sufficiently simple __attribute__((pure))
> functions with constant args were considered constant expressions by gcc.

Yes, that sort of thing.

It wouldn't be surprising if increasing the pagevec up to 64 slots on big
ia64 SMP provided a useful increase in some fs-intensive workloads.

One needs to watch stack consumption though.

2004-09-09 23:41:21

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

On Thu, Sep 09, 2004 at 04:09:05PM -0700, William Lee Irwin III wrote:
> Marcelo Tosatti <[email protected]> wrote:
> >> I do not see a problem with changing pagevec to "15" page pointers either,
> >> Andrew, is there a special reason for that "16"? Is intentional to align
> >> to 64 kbytes (IO device alignment)? I dont think that matters much because
> >> of the elevator which sorts and merges requests anyway?
>
> On Thu, Sep 09, 2004 at 03:52:26PM -0700, Andrew Morton wrote:
> > No, it was just a randomly-chosen batching factor.
> > The tradeoff here is between
> > a) lock acquisition frequency versus lock hold time (increasing the size
> > helps).
> > b) icache misses versus dcache misses. (increasing the size probably hurts).
> > I suspect that some benefit would be seen from making the size very small
> > (say, 4). And on some machines, making it larger might help.
>
> Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
> though that may blow the stack on e.g. larger Altixen. Perhaps
> O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
> though we may have debates about how to evaluate lg(n) at compile-time...
> Would be nice if calls to sufficiently simple __attribute__((pure))
> functions with constant args were considered constant expressions by gcc.

Let me see if I get you right - basically what you're suggesting is
to depend PAGEVEC_SIZE on NR_CPUS?

2004-09-09 23:59:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH] cacheline align pagevec structure

On Thu, Sep 09, 2004 at 04:09:05PM -0700, William Lee Irwin III wrote:
>> Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
>> though that may blow the stack on e.g. larger Altixen. Perhaps
>> O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
>> though we may have debates about how to evaluate lg(n) at compile-time...
>> Would be nice if calls to sufficiently simple __attribute__((pure))
>> functions with constant args were considered constant expressions by gcc.

On Thu, Sep 09, 2004 at 07:12:38PM -0300, Marcelo Tosatti wrote:
> Let me see if I get you right - basically what you're suggesting is
> to depend PAGEVEC_SIZE on NR_CPUS?

Yes. The motive is that as the arrival rate is O(num_cpus_online()), and
NR_CPUS is supposed to be strongly correlated with that, so reducing the
arrival rate by a (hopefully) similar factor ought to help.


-- wli

2004-09-10 00:13:22

by William Lee Irwin III

[permalink] [raw]
Subject: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III <[email protected]> wrote:
>> Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
>> though that may blow the stack on e.g. larger Altixen. Perhaps
>> O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
>> though we may have debates about how to evaluate lg(n) at compile-time...
>> Would be nice if calls to sufficiently simple __attribute__((pure))
>> functions with constant args were considered constant expressions by gcc.

On Thu, Sep 09, 2004 at 04:22:45PM -0700, Andrew Morton wrote:
> Yes, that sort of thing.
> It wouldn't be surprising if increasing the pagevec up to 64 slots on big
> ia64 SMP provided a useful increase in some fs-intensive workloads.
> One needs to watch stack consumption though.

Okay, Marcelo, looks like we need to do cache alignment work with a
variable-size pagevec.

In order to attempt to compensate for arrival rates to zone->lru_lock
increasing as O(num_cpus_online()), this patch resizes the pagevec to
O(lg(NR_CPUS)) for lock amortization that adjusts better to the size of
the system. Compiletested on ia64.


Index: mm4-2.6.9-rc1/include/linux/pagevec.h
===================================================================
--- mm4-2.6.9-rc1.orig/include/linux/pagevec.h 2004-08-24 00:03:39.000000000 -0700
+++ mm4-2.6.9-rc1/include/linux/pagevec.h 2004-09-09 16:58:19.978150158 -0700
@@ -4,8 +4,27 @@
* In many places it is efficient to batch an operation up against multiple
* pages. A pagevec is a multipage container which is used for that.
*/
+#include <linux/config.h>
+#include <linux/threads.h>

-#define PAGEVEC_SIZE 16
+#define __PAGEVEC_SIZE_0(n, k) \
+ ((k) * !!((unsigned long)(n) > (1ULL << (!(k) ? 0 : (k) - 1)) \
+ && ((unsigned long)(n) <= (1ULL << (k)))))
+#define __PAGEVEC_SIZE_1(n, k) \
+ (__PAGEVEC_SIZE_0(n, 2*(k)+1) + __PAGEVEC_SIZE_0(n, 2*(k)))
+#define __PAGEVEC_SIZE_2(n, k) \
+ (__PAGEVEC_SIZE_1(n, 2*(k)+1) + __PAGEVEC_SIZE_1(n, 2*(k)))
+#define __PAGEVEC_SIZE_3(n, k) \
+ (__PAGEVEC_SIZE_2(n, 2*(k)+1) + __PAGEVEC_SIZE_2(n, 2*(k)))
+#define __PAGEVEC_SIZE_4(n, k) \
+ (__PAGEVEC_SIZE_3(n, 2*(k)+1) + __PAGEVEC_SIZE_3(n, 2*(k)))
+#define __PAGEVEC_SIZE_5(n, k) \
+ (__PAGEVEC_SIZE_4(n, 2*(k)+1) + __PAGEVEC_SIZE_4(n, 2*(k)))
+#define __PAGEVEC_SIZE_6(n, k) \
+ (__PAGEVEC_SIZE_5(n, 2*(k)+1) + __PAGEVEC_SIZE_5(n, 2*(k)))
+#define __PAGEVEC_SIZE(n) \
+ (BITS_PER_LONG == 32 ? __PAGEVEC_SIZE_5(n, 0) : __PAGEVEC_SIZE_6(n, 0))
+#define PAGEVEC_SIZE (NR_CPUS > 1 ? 4*__PAGEVEC_SIZE(NR_CPUS) : 4)

struct page;
struct address_space;

2004-09-10 05:45:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:
> William Lee Irwin III <[email protected]> wrote:
>
>>>Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
>>>though that may blow the stack on e.g. larger Altixen. Perhaps
>>>O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
>>>though we may have debates about how to evaluate lg(n) at compile-time...
>>>Would be nice if calls to sufficiently simple __attribute__((pure))
>>>functions with constant args were considered constant expressions by gcc.
>
>
> On Thu, Sep 09, 2004 at 04:22:45PM -0700, Andrew Morton wrote:
>
>>Yes, that sort of thing.
>>It wouldn't be surprising if increasing the pagevec up to 64 slots on big
>>ia64 SMP provided a useful increase in some fs-intensive workloads.
>>One needs to watch stack consumption though.
>
>
> Okay, Marcelo, looks like we need to do cache alignment work with a
> variable-size pagevec.
>
> In order to attempt to compensate for arrival rates to zone->lru_lock
> increasing as O(num_cpus_online()), this patch resizes the pagevec to
> O(lg(NR_CPUS)) for lock amortization that adjusts better to the size of
> the system. Compiletested on ia64.
>

Yuck. I don't like things like this to depend on NR_CPUS, because your
kernel may behave quite differently depending on the value. But in this
case I guess "quite differently" is probably "a little bit differently",
and practical reality may dictate that you need to do something like
this at compile time, and NR_CPUS is your best approximation.

That said, I *don't* think this should go in hastily.

First reason is that the lru lock is per zone, so the premise that
zone->lru_lock aquisitions increases O(cpus) is wrong for anything large
enough to care (ie. it will be NUMA). It is possible that a 512 CPU Altix
will see less lru_lock contention than an 8-way Intel box.

Secondly is that you'll might really start putting pressure on small L1
caches (eg. Itanium 2) if you bite off too much in one go. If you blow
it, you'll have to pull all the pages into cache again as you process
the pagevec.

I don't think the smallish loop overhead constant (mainly pulling a lock
and a couple of hot cachelines off another CPU) would gain much from
increasing these a lot, either. The overhead should already at least an
order of magnitude smaller than the actual work cost.

Lock contention isn't a good argument either, because it shouldn't
significantly change as you tradeoff hold vs frequency if we assume
that the lock transfer and other overheads aren't significant (which
should be a safe assumption at PAGEVEC_SIZE of >= 16, I think).

Probably a PAGEVEC_SIZE of 4 on UP would be an interesting test, because
your loop overheads get a bit smaller.

2004-09-10 05:47:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

Nick Piggin wrote:

>
> That said, I *don't* think this should go in hastily.
>

Of course, I would be happy for it to if it is shown to improve
things...

2004-09-11 21:07:16

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
> William Lee Irwin III wrote:
> >William Lee Irwin III <[email protected]> wrote:
> >
> >>>Reducing arrival rates by an Omega(NR_CPUS) factor would probably help,
> >>>though that may blow the stack on e.g. larger Altixen. Perhaps
> >>>O(lg(NR_CPUS)), e.g. NR_CPUS > 1 ? 4*lg(NR_CPUS) : 4 etc., will suffice,
> >>>though we may have debates about how to evaluate lg(n) at compile-time...
> >>>Would be nice if calls to sufficiently simple __attribute__((pure))
> >>>functions with constant args were considered constant expressions by gcc.
> >
> >
> >On Thu, Sep 09, 2004 at 04:22:45PM -0700, Andrew Morton wrote:
> >
> >>Yes, that sort of thing.
> >>It wouldn't be surprising if increasing the pagevec up to 64 slots on big
> >>ia64 SMP provided a useful increase in some fs-intensive workloads.
> >>One needs to watch stack consumption though.
> >
> >
> >Okay, Marcelo, looks like we need to do cache alignment work with a
> >variable-size pagevec.
> >
> >In order to attempt to compensate for arrival rates to zone->lru_lock
> >increasing as O(num_cpus_online()), this patch resizes the pagevec to
> >O(lg(NR_CPUS)) for lock amortization that adjusts better to the size of
> >the system. Compiletested on ia64.
> >
>
> Yuck. I don't like things like this to depend on NR_CPUS, because your
> kernel may behave quite differently depending on the value. But in this
> case I guess "quite differently" is probably "a little bit differently",
> and practical reality may dictate that you need to do something like
> this at compile time, and NR_CPUS is your best approximation.

For me Bill's patch (with the recursive thingie) is very cryptic. Its
just doing log2(n), it took me an hour to figure it out with his help.

> That said, I *don't* think this should go in hastily.
>
> First reason is that the lru lock is per zone, so the premise that
> zone->lru_lock aquisitions increases O(cpus) is wrong for anything large
> enough to care (ie. it will be NUMA). It is possible that a 512 CPU Altix
> will see less lru_lock contention than an 8-way Intel box.

Oops, right. wli's patch is borked for NUMA. Clamping it at 64 should do fine.

> Secondly is that you'll might really start putting pressure on small L1
> caches (eg. Itanium 2) if you bite off too much in one go. If you blow
> it, you'll have to pull all the pages into cache again as you process
> the pagevec.

Whats the L1 cache size of Itanium2? Each page is huge compared to the pagevec
structure (you need a 64 item pagevec array on 64-bits to occupy the space of
one 4KB page). So I think you wont blow up the cache even with a really big
pagevec.

> I don't think the smallish loop overhead constant (mainly pulling a lock
> and a couple of hot cachelines off another CPU) would gain much from
> increasing these a lot, either. The overhead should already at least an
> order of magnitude smaller than the actual work cost.
>
> Lock contention isn't a good argument either, because it shouldn't
> significantly change as you tradeoff hold vs frequency if we assume
> that the lock transfer and other overheads aren't significant (which
> should be a safe assumption at PAGEVEC_SIZE of >= 16, I think).
>
> Probably a PAGEVEC_SIZE of 4 on UP would be an interesting test, because
> your loop overheads get a bit smaller.

Not very noticeable on reaim. I want to do more tests (different workloads, nr CPUs, etc).


kernel: pagevec-4
plmid: 3304
Host: stp1-002
Reaim test
http://khack.osdl.org/stp/297484
kernel: 3304
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 992.40 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 987.72 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

kernel: 2.6.9-rc1-mm4
plmid: 3294
Host: stp1-003
Reaim test
http://khack.osdl.org/stp/297485
kernel: 3294
Filesystem: ext3
Peak load Test: Maximum Jobs per Minute 989.85 (average of 3 runs)
Quick Convergence Test: Maximum Jobs per Minute 982.07 (average of 3 runs)
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.





2004-09-12 01:19:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

Marcelo Tosatti wrote:
> On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:

>>Yuck. I don't like things like this to depend on NR_CPUS, because your
>>kernel may behave quite differently depending on the value. But in this
>>case I guess "quite differently" is probably "a little bit differently",
>>and practical reality may dictate that you need to do something like
>>this at compile time, and NR_CPUS is your best approximation.
>
>
> For me Bill's patch (with the recursive thingie) is very cryptic. Its
> just doing log2(n), it took me an hour to figure it out with his help.
>

Having it depend on NR_CPUS should be avoided if possible.
But yeah in this case I guess you can't easily make it work
at runtime.

>
>>That said, I *don't* think this should go in hastily.
>>
>>First reason is that the lru lock is per zone, so the premise that
>>zone->lru_lock aquisitions increases O(cpus) is wrong for anything large
>>enough to care (ie. it will be NUMA). It is possible that a 512 CPU Altix
>>will see less lru_lock contention than an 8-way Intel box.
>
>
> Oops, right. wli's patch is borked for NUMA. Clamping it at 64 should do fine.
>

Is 16 any good? ;)

>
>>Secondly is that you'll might really start putting pressure on small L1
>>caches (eg. Itanium 2) if you bite off too much in one go. If you blow
>>it, you'll have to pull all the pages into cache again as you process
>>the pagevec.
>
>
> Whats the L1 cache size of Itanium2? Each page is huge compared to the pagevec
> structure (you need a 64 item pagevec array on 64-bits to occupy the space of
> one 4KB page). So I think you wont blow up the cache even with a really big
> pagevec.
>

I think it is 16K data cache. It is not the pagevec structure that you
are worried about, but all the cachelines from all the pages you put
into it. If you put 64 pages in it, that's 8K with a 128byte cacheline
size (the structure will be ~512 bytes on a 64-bit arch).

And if you touch one other cacheline per page, there's 16K.

So I'm just making up numbers, but the point is you obviously want to
keep it as small as possible unless you can demonstrate improvements.

>
>>I don't think the smallish loop overhead constant (mainly pulling a lock
>>and a couple of hot cachelines off another CPU) would gain much from
>>increasing these a lot, either. The overhead should already at least an
>>order of magnitude smaller than the actual work cost.
>>
>>Lock contention isn't a good argument either, because it shouldn't
>>significantly change as you tradeoff hold vs frequency if we assume
>>that the lock transfer and other overheads aren't significant (which
>>should be a safe assumption at PAGEVEC_SIZE of >= 16, I think).
>>
>>Probably a PAGEVEC_SIZE of 4 on UP would be an interesting test, because
>>your loop overheads get a bit smaller.
>
>
> Not very noticeable on reaim. I want to do more tests (different workloads, nr CPUs, etc).
>

Would be good.

>
> kernel: pagevec-4
> plmid: 3304
> Host: stp1-002
> Reaim test
> http://khack.osdl.org/stp/297484
> kernel: 3304
> Filesystem: ext3
> Peak load Test: Maximum Jobs per Minute 992.40 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 987.72 (average of 3 runs)
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>
> kernel: 2.6.9-rc1-mm4
> plmid: 3294
> Host: stp1-003
> Reaim test
> http://khack.osdl.org/stp/297485
> kernel: 3294
> Filesystem: ext3
> Peak load Test: Maximum Jobs per Minute 989.85 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 982.07 (average of 3 runs)
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>
>

To get a best case argument for increasing the size of the structure, I guess
you'll want to setup tests to put the maximum contention on the lru_lock. That
would mean big non NUMAs (eg. OSDL's stp8 systems), lots of page reclaim so
you'll have to fill up the caches, and lots of read()'ing.

2004-09-12 04:57:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:
>>> In order to attempt to compensate for arrival rates to zone->lru_lock
>>> increasing as O(num_cpus_online()), this patch resizes the pagevec to
>>> O(lg(NR_CPUS)) for lock amortization that adjusts better to the size of
>>> the system. Compiletested on ia64.

On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
>> Yuck. I don't like things like this to depend on NR_CPUS, because your
>> kernel may behave quite differently depending on the value. But in this
>> case I guess "quite differently" is probably "a little bit differently",
>> and practical reality may dictate that you need to do something like
>> this at compile time, and NR_CPUS is your best approximation.

On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
> For me Bill's patch (with the recursive thingie) is very cryptic. Its
> just doing log2(n), it took me an hour to figure it out with his help.

Feel free to suggest other ways to discover lg(n) at compile-time.


On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
>> That said, I *don't* think this should go in hastily.
>> First reason is that the lru lock is per zone, so the premise that
>> zone->lru_lock aquisitions increases O(cpus) is wrong for anything large
>> enough to care (ie. it will be NUMA). It is possible that a 512 CPU Altix
>> will see less lru_lock contention than an 8-way Intel box.

On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
> Oops, right. wli's patch is borked for NUMA. Clamping it at 64 should
> do fine.

No, it DTRT. Batching does not directly compensate for increases in
arrival rates, rather most directly compensates for increases to lock
transfer times, which do indeed increase on systems with large numbers
of cpus.


On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
>> Secondly is that you'll might really start putting pressure on small L1
>> caches (eg. Itanium 2) if you bite off too much in one go. If you blow
>> it, you'll have to pull all the pages into cache again as you process
>> the pagevec.

On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
> Whats the L1 cache size of Itanium2? Each page is huge compared to the pagevec
> structure (you need a 64 item pagevec array on 64-bits to occupy the space of
> one 4KB page). So I think you wont blow up the cache even with a really big
> pagevec.

A 511 item pagevec is 4KB on 64-bit machines.


On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
>> I don't think the smallish loop overhead constant (mainly pulling a lock
>> and a couple of hot cachelines off another CPU) would gain muc from
>> increasing these a lot, either. The overhead should already at least an
>> order of magnitude smaller than the actual work cost.
>> Lock contention isn't a good argument either, because it shouldn't
>> significantly change as you tradeoff hold vs frequency if we assume
>> that the lock transfer and other overheads aren't significant (which
>> should be a safe assumption at PAGEVEC_SIZE of >= 16, I think).
>> Probably a PAGEVEC_SIZE of 4 on UP would be an interesting test, because
>> your loop overheads get a bit smaller.

On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
> Not very noticeable on reaim. I want to do more tests (different
> workloads, nr CPUs, etc).

The results I got suggest the tests will not be significantly different
unless the machines differ significantly in the overhead of acquiring a
cacheline in an exclusive state.


On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
> kernel: pagevec-4
> plmid: 3304
> Host: stp1-002
> Reaim test
> http://khack.osdl.org/stp/297484
> kernel: 3304
> Filesystem: ext3
> Peak load Test: Maximum Jobs per Minute 992.40 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 987.72 (average of 3 runs)
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
> kernel: 2.6.9-rc1-mm4
> plmid: 3294
> Host: stp1-003
> Reaim test
> http://khack.osdl.org/stp/297485
> kernel: 3294
> Filesystem: ext3
> Peak load Test: Maximum Jobs per Minute 989.85 (average of 3 runs)
> Quick Convergence Test: Maximum Jobs per Minute 982.07 (average of 3 runs)
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.

Unsurprising. If the expected response time given batching factor K is
T(K) (which also depends on the lock transfer time) T(K)/(K*T(1)) may
have nontrivial maxima and minima in K. I've tried for the expected
waiting time of a few queues (e.g. M/M/1) and verified it's not a
degradation for them, though I've not gone over it in generality
(G/G/m is hard to get results of any kind for anyway). I refrained from
posting a lengthier discussion of the results.


-- wli

2004-09-12 05:20:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:

> On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
>
>>Oops, right. wli's patch is borked for NUMA. Clamping it at 64 should
>>do fine.
>
>
> No, it DTRT. Batching does not directly compensate for increases in
> arrival rates, rather most directly compensates for increases to lock
> transfer times, which do indeed increase on systems with large numbers
> of cpus.
>

Generally though I think you could expect the lru lock to be most
often taken by the scanner by node local CPUs. Even on the big
systems. We'll see.

>
> On Fri, Sep 10, 2004 at 02:56:11PM +1000, Nick Piggin wrote:
>
>>>Secondly is that you'll might really start putting pressure on small L1
>>>caches (eg. Itanium 2) if you bite off too much in one go. If you blow
>>>it, you'll have to pull all the pages into cache again as you process
>>>the pagevec.
>
>
> On Fri, Sep 10, 2004 at 02:49:15PM -0300, Marcelo Tosatti wrote:
>
>>Whats the L1 cache size of Itanium2? Each page is huge compared to the pagevec
>>structure (you need a 64 item pagevec array on 64-bits to occupy the space of
>>one 4KB page). So I think you wont blow up the cache even with a really big
>>pagevec.
>
>
> A 511 item pagevec is 4KB on 64-bit machines.
>

Sure. And when you fill it with pages, they'll use up 32KB of dcache
by using a single 64B line per page. Now that you've blown the cache,
when you go to move those pages to another list, you'll have to pull
them out of L2 again one at a time.

OK, so a 511 item pagevec is pretty unlikely. How about a 64 item one
with 128 byte cachelines, and you're touching two cachelines per
page operation? That's 16K.

2004-09-12 05:24:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

Marcelo Tosatti wrote:
>> For me Bill's patch (with the recursive thingie) is very cryptic. Its
>> just doing log2(n), it took me an hour to figure it out with his help.

On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
> Having it depend on NR_CPUS should be avoided if possible.
> But yeah in this case I guess you can't easily make it work
> at runtime.

With some work it could be tuned at boot-time.


Marcelo Tosatti wrote:
>> Oops, right. wli's patch is borked for NUMA. Clamping it at 64 should do
>> fine.

On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
> Is 16 any good? ;)

There are nontrivial differences in the optimal batching factor
dependent on the distribution of hold times and interarrival times. The
strongest dependencies of all are on ratio of the lock transfer time to
the interarrival time and the lock transfer time itself. These appear
routinely in numerous odd places in the expressions for expected
response time, and the latter often as a constant of proportionality.


On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
>> Whats the L1 cache size of Itanium2? Each page is huge compared to the
>> pagevec structure (you need a 64 item pagevec array on 64-bits to
>> occupy the space of one 4KB page). So I think you wont blow up the
>> cache even with a really big pagevec.

On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
> I think it is 16K data cache. It is not the pagevec structure that you
> are worried about, but all the cachelines from all the pages you put
> into it. If you put 64 pages in it, that's 8K with a 128byte cacheline
> size (the structure will be ~512 bytes on a 64-bit arch).
> And if you touch one other cacheline per page, there's 16K.
> So I'm just making up numbers, but the point is you obviously want to
> keep it as small as possible unless you can demonstrate improvements.

It's unclear what you're estimating the size of. PAGEVEC_SIZE of 62
yields a 512B pagevec, for 4 cachelines exclusive to the cpu (or if
stack allocated, the task). The pagevecs themselves are not shared,
so the TLB entries for per-cpu pagevecs span surrouding per-cpu data,
not other cpus' pagevecs, and the TLB entries for stack-allocated
pagevecs are in turn shared with other stack-allocated data.


Marcelo Tosatti wrote:
>> Not very noticeable on reaim. I want to do more tests (different
>> workloads, nr CPUs, etc).

On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
> Would be good.

On Sun, Sep 12, 2004 at 10:29:56AM +1000, Nick Piggin wrote:
> To get a best case argument for increasing the size of the structure, I
> guess you'll want to setup tests to put the maximum contention on the
> lru_lock. That would mean big non NUMAs (eg. OSDL's stp8 systems),
> lots of page reclaim so you'll have to fill up the caches, and lots
> of read()'ing.

mapping->tree_lock is affected as well as zone->lru_lock. The workload
obviously has to touch the relevant locks for pagevecs to be relevant;
however, the primary factor in the effectiveness of pagevecs is the
lock transfer time, which is not likely to vary significantly on boxen
such as the OSDL STP machines. You should use a workload stressing
mapping->tree_lock via codepaths using radix_tree_gang_lookup() and
getting runtime on OSDL's NUMA-Q or otherwise asking SGI to test its
effects, otherwise you're dorking around with boxen with identical
characteristics as far as batched locking is concerned.


-- wli

2004-09-12 05:28:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:

>
> It's unclear what you're estimating the size of. PAGEVEC_SIZE of 62

Overhead of the loaded pagevec.

[snip]

>
> mapping->tree_lock is affected as well as zone->lru_lock. The workload
> obviously has to touch the relevant locks for pagevecs to be relevant;
> however, the primary factor in the effectiveness of pagevecs is the
> lock transfer time, which is not likely to vary significantly on boxen
> such as the OSDL STP machines. You should use a workload stressing
> mapping->tree_lock via codepaths using radix_tree_gang_lookup() and
> getting runtime on OSDL's NUMA-Q or otherwise asking SGI to test its
> effects, otherwise you're dorking around with boxen with identical
> characteristics as far as batched locking is concerned.
>

Yeah I forgot about that. I guess it probably would be easier to
get contention on the tree lock.

2004-09-12 06:27:13

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:
>> No, it DTRT. Batching does not directly compensate for increases in
>> arrival rates, rather most directly compensates for increases to lock
>> transfer times, which do indeed increase on systems with large numbers
>> of cpus.

On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
> Generally though I think you could expect the lru lock to be most
> often taken by the scanner by node local CPUs. Even on the big
> systems. We'll see.

No, I'd expect zone->lru_lock to be taken most often for lru_cache_add()
and lru_cache_del().


William Lee Irwin III wrote:
>> A 511 item pagevec is 4KB on 64-bit machines.

On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
> Sure. And when you fill it with pages, they'll use up 32KB of dcache
> by using a single 64B line per page. Now that you've blown the cache,
> when you go to move those pages to another list, you'll have to pull
> them out of L2 again one at a time.

There can be no adequate compile-time metric of L1 cache size. 64B
cachelines with 16KB caches sounds a bit small, 256 entries, which is
even smaller than TLB's on various systems.

In general a hard cap at the L1 cache size would be beneficial for
operations done in tight loops, but there is no adequate detection
method. Also recall that the page structures things will be touched
regardless if they are there to be touched in a sufficiently large
pagevec. Various pagevecs are meant to amortize locking done in
scenarios where there is no relationship between calls. Again,
lru_cache_add() and lru_cache_del() are the poster children. These
operations are often done for one page at a time in some long codepath,
e.g. fault handlers, and the pagevec is merely deferring the work until
enough has accumulated. radix_tree_gang_lookup() and mpage_readpages()
OTOH execute the operations to be done under the locks in tight loops,
where the lock acquisitions are to be done immediately by the same caller.

This differentiation between the characteristics of pagevec users
happily matches the cases where they're used on-stack and per-cpu.
In the former case, larger pagevecs are desirable, as the cachelines
will not be L1-hot regardless; in the latter, L1 size limits apply.


On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
> OK, so a 511 item pagevec is pretty unlikely. How about a 64 item one
> with 128 byte cachelines, and you're touching two cachelines per
> page operation? That's 16K.

4*lg(NR_CPUS) is 64 for 16x-31x boxen. No constant number suffices.
Adaptation to systems and the usage cases would be an improvement.


-- wli

2004-09-12 06:55:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:
> William Lee Irwin III wrote:
>
>>>No, it DTRT. Batching does not directly compensate for increases in
>>>arrival rates, rather most directly compensates for increases to lock
>>>transfer times, which do indeed increase on systems with large numbers
>>>of cpus.
>
>
> On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
>
>>Generally though I think you could expect the lru lock to be most
>>often taken by the scanner by node local CPUs. Even on the big
>>systems. We'll see.
>
>
> No, I'd expect zone->lru_lock to be taken most often for lru_cache_add()
> and lru_cache_del().
>

Well "lru_cache_del" will be often coming from the scanner.
lru_cache_add should be being performed on newly allocated pages,
which should be node local most of the time.

>
> William Lee Irwin III wrote:
>
>>>A 511 item pagevec is 4KB on 64-bit machines.
>
>
> On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
>
>>Sure. And when you fill it with pages, they'll use up 32KB of dcache
>>by using a single 64B line per page. Now that you've blown the cache,
>>when you go to move those pages to another list, you'll have to pull
>>them out of L2 again one at a time.
>
>
> There can be no adequate compile-time metric of L1 cache size. 64B
> cachelines with 16KB caches sounds a bit small, 256 entries, which is
> even smaller than TLB's on various systems.
>

Although I'm pretty sure that is what Itanium 2 has. P4s may even
have 128B lines and 16K L1 IIRC.

> In general a hard cap at the L1 cache size would be beneficial for
> operations done in tight loops, but there is no adequate detection
> method. Also recall that the page structures things will be touched
> regardless if they are there to be touched in a sufficiently large
> pagevec. Various pagevecs are meant to amortize locking done in
> scenarios where there is no relationship between calls. Again,
> lru_cache_add() and lru_cache_del() are the poster children. These
> operations are often done for one page at a time in some long codepath,
> e.g. fault handlers, and the pagevec is merely deferring the work until
> enough has accumulated. radix_tree_gang_lookup() and mpage_readpages()
> OTOH execute the operations to be done under the locks in tight loops,
> where the lock acquisitions are to be done immediately by the same caller.
>
> This differentiation between the characteristics of pagevec users
> happily matches the cases where they're used on-stack and per-cpu.
> In the former case, larger pagevecs are desirable, as the cachelines
> will not be L1-hot regardless; in the latter, L1 size limits apply.
>

Possibly, I don't know. Performing a large stream of faults to
map in a file could easily keep all pages of a small pagevec
in cache.

Anyway, the point I'm making is just that you don't want to be
expanding this thing just because you can. If all else is equal,
a smaller size is obviously preferable. So obviously, simple
testing is required - but I don't think I need to be telling you
that ;)

>
> On Sun, Sep 12, 2004 at 02:28:46PM +1000, Nick Piggin wrote:
>
>>OK, so a 511 item pagevec is pretty unlikely. How about a 64 item one
>>with 128 byte cachelines, and you're touching two cachelines per
>>page operation? That's 16K.
>
>
> 4*lg(NR_CPUS) is 64 for 16x-31x boxen. No constant number suffices.
> Adaptation to systems and the usage cases would be an improvement.
>

Ignore my comments about disliking compile time sizing: the main
thing is to just find improvements, and merge-worthy implementation
can follow.

2004-09-12 07:21:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III wrote:
>> No, I'd expect zone->lru_lock to be taken most often for lru_cache_add()
>> and lru_cache_del().

On Sun, Sep 12, 2004 at 04:03:50PM +1000, Nick Piggin wrote:
> Well "lru_cache_del" will be often coming from the scanner.
> lru_cache_add should be being performed on newly allocated pages,
> which should be node local most of the time.

I presume page scanning to be rare outside memory-starved systems. I
would expect lru_cache_del() in the scanner to originate from evictions
of inodes in some odd cases, and from process or virtual area
destruction in the case of anonymous memory.


William Lee Irwin III wrote:
>> There can be no adequate compile-time metric of L1 cache size. 64B
>> cachelines with 16KB caches sounds a bit small, 256 entries, which is
>> even smaller than TLB's on various systems.

On Sun, Sep 12, 2004 at 04:03:50PM +1000, Nick Piggin wrote:
> Although I'm pretty sure that is what Itanium 2 has. P4s may even
> have 128B lines and 16K L1 IIRC.

Pathetic and not particularly intelligent on the cpu designers' parts.
The L1 cache should be large enough to cover at least two base pages
and have at least as many entries as the TLB.


William Lee Irwin III wrote:
>> In general a hard cap at the L1 cache size would be beneficial for
>> operations done in tight loops, but there is no adequate detection
>> method. Also recall that the page structures things will be touched
>> regardless if they are there to be touched in a sufficiently large
>> pagevec. Various pagevecs are meant to amortize locking done in
>> scenarios where there is no relationship between calls. Again,
>> lru_cache_add() and lru_cache_del() are the poster children. These
>> operations are often done for one page at a time in some long codepath,
>> e.g. fault handlers, and the pagevec is merely deferring the work until
>> enough has accumulated. radix_tree_gang_lookup() and mpage_readpages()
>> OTOH execute the operations to be done under the locks in tight loops,
>> where the lock acquisitions are to be done immediately by the same caller.
>> This differentiation between the characteristics of pagevec users
>> happily matches the cases where they're used on-stack and per-cpu.
>> In the former case, larger pagevecs are desirable, as the cachelines
>> will not be L1-hot regardless; in the latter, L1 size limits apply.

On Sun, Sep 12, 2004 at 04:03:50PM +1000, Nick Piggin wrote:
> Possibly, I don't know. Performing a large stream of faults to
> map in a file could easily keep all pages of a small pagevec
> in cache.
> Anyway, the point I'm making is just that you don't want to be
> expanding this thing just because you can. If all else is equal,
> a smaller size is obviously preferable. So obviously, simple
> testing is required - but I don't think I need to be telling you
> that ;)

A large stream of faults to map in a file will blow L1 caches of the
sizes you've mentioned at every kernel/user context switch. 256 distinct
cachelines will very easily be referenced between faults. MAP_POPULATE
and mlock() don't implement batching for either ->page_table_lock or
->tree_lock, so the pagevec point is moot in pagetable instantiation
codepaths (though it probably shouldn't be).

O_DIRECT writes and msync(..., ..., MS_SYNC) will use pagevecs on
->tree_lock in a rapid-fire process-triggerable manner. Almost all
uses of pagevecs for ->lru_lock outside the scanner that I'm aware
of are not rapid-fire in nature (though there probably should be some).
IMHO pagevecs are somewhat underutilized.


William Lee Irwin III wrote:
>> 4*lg(NR_CPUS) is 64 for 16x-31x boxen. No constant number suffices.
>> Adaptation to systems and the usage cases would be an improvement.

On Sun, Sep 12, 2004 at 04:03:50PM +1000, Nick Piggin wrote:
> Ignore my comments about disliking compile time sizing: the main
> thing is to just find improvements, and merge-worthy implementation
> can follow.

Sorry, 4*lg(NR_CPUS) is 64 when lg(NR_CPUS) = 16, or 65536 cpus. 512x
Altixen would have 4*lg(512) = 4*9 = 36. The 4*lg(NR_CPUS) sizing was
rather conservative on behalf of users of stack-allocated pagevecs.


-- wli

2004-09-12 07:45:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III <[email protected]> wrote:
>
> A large stream of faults to map in a file will blow L1 caches of the
> sizes you've mentioned at every kernel/user context switch. 256 distinct
> cachelines will very easily be referenced between faults. MAP_POPULATE
> and mlock() don't implement batching for either ->page_table_lock or
> ->tree_lock, so the pagevec point is moot in pagetable instantiation
> codepaths (though it probably shouldn't be).

Instantiation via normal fault-in becomes lock-intensive once you have
enough CPUs. At low CPU count the page zeroing probably preponderates.

> O_DIRECT writes and msync(..., ..., MS_SYNC) will use pagevecs on
> ->tree_lock in a rapid-fire process-triggerable manner. Almost all
> uses of pagevecs for ->lru_lock outside the scanner that I'm aware
> of are not rapid-fire in nature (though there probably should be some).

pagetable teardown (munmap, mremap, exit) is the place where pagevecs help
->lru_lock. And truncate.

> IMHO pagevecs are somewhat underutilized.
>

Possibly. I wouldn't bother converting anything unless a profiler tells
you to though.

> Sorry, 4*lg(NR_CPUS) is 64 when lg(NR_CPUS) = 16, or 65536 cpus. 512x
> Altixen would have 4*lg(512) = 4*9 = 36. The 4*lg(NR_CPUS) sizing was
> rather conservative on behalf of users of stack-allocated pagevecs.

It's pretty simple to diddle PAGEVEC_SIZE, run a few benchmarks. If that
makes no difference then the discussion is moot. If it makes a significant
difference then more investigation is warranted.

2004-09-12 08:57:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

On Sun, Sep 12, 2004 at 12:19:48AM -0700, William Lee Irwin III wrote:
> Sorry, 4*lg(NR_CPUS) is 64 when lg(NR_CPUS) = 16, or 65536 cpus. 512x
> Altixen would have 4*lg(512) = 4*9 = 36. The 4*lg(NR_CPUS) sizing was
> rather conservative on behalf of users of stack-allocated pagevecs.

And for the extra multiplications, that's a pagevec 296B in size, and
touching 36 page structure cachelines, or 2304B with a 64B cacheline,
4608B for a 128B cacheline, etc. and that even with a ridiculously
large number of cpus. A more involved batching factor may make
sense, though. e.g. 2**(2.5*sqrt(lg(NR_CPUS)) - 1) or some such to
get 4 -> 6, 9 -> 11, 16 -> 16, 25 -> 21, 36 -> 26, 49 -> 31, 64 -> 35,
81 -> 40, 100 -> 44, 121 -> 48, 144 -> 52, 169 -> 56, 196 -> 60,
225 -> 64, 256 -> 68, 289 -> 71, 324 -> 75, 361 -> 79, 400 -> 82,
441 -> 86, 484 -> 89, 529 -> 92, 576 -> 96, 625 -> 99, 676 -> 102,
729 -> 105, 784 -> 108, 841 -> 111, 900 -> 114, 961 -> 117, 1024 -> 120
etc., which looks like a fairly good tradeoff between growth with
NR_CPUS and various limits. I can approximate this well enough in the
preprocessor, but it would be somewhat more obscure than 4*lg(NR_CPUS)
(basically nest expansions of sufficiently rapidly convergent series
and use some functional relations to transform arguments into areas of
rapid convergence), but I suspect we should explore differentiating
between on-stack rapid-fire usage and longer-term amortization if we
must adapt so precisely rather than tuning a global PAGEVEC_SIZE to death.


-- wli

2004-09-13 23:56:38

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))


> Sure. And when you fill it with pages, they'll use up 32KB of dcache
> by using a single 64B line per page. Now that you've blown the cache,
> when you go to move those pages to another list, you'll have to pull
> them out of L2 again one at a time.
>
> OK, so a 511 item pagevec is pretty unlikely. How about a 64 item one
> with 128 byte cachelines, and you're touching two cachelines per
> page operation? That's 16K.

Nick,

Note that you dont read/write data to the actual pages most of the
times pagevec's are used. The great majority is just page management code.
So we dont really blow the caches like you said.

I agree we need more tests :)

2004-09-14 01:59:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

Marcelo Tosatti wrote:
>>Sure. And when you fill it with pages, they'll use up 32KB of dcache
>>by using a single 64B line per page. Now that you've blown the cache,
>>when you go to move those pages to another list, you'll have to pull
>>them out of L2 again one at a time.
>>
>>OK, so a 511 item pagevec is pretty unlikely. How about a 64 item one
>>with 128 byte cachelines, and you're touching two cachelines per
>>page operation? That's 16K.
>
>
> Nick,
>
> Note that you dont read/write data to the actual pages most of the
> times pagevec's are used. The great majority is just page management code.
> So we dont really blow the caches like you said.
>

You're often pulling them off lists though which is what will do it.
Not the actual page, but the struct page.

> I agree we need more tests :)
>

Yep.

2004-09-14 02:22:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III <[email protected]> wrote:
>> A large stream of faults to map in a file will blow L1 caches of the
>> sizes you've mentioned at every kernel/user context switch. 256 distinct
>> cachelines will very easily be referenced between faults. MAP_POPULATE
>> and mlock() don't implement batching for either ->page_table_lock or
>> ->tree_lock, so the pagevec point is moot in pagetable instantiation
>> codepaths (though it probably shouldn't be).

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> Instantiation via normal fault-in becomes lock-intensive once you have
> enough CPUs. At low CPU count the page zeroing probably preponderates.

But that's mm->page_table_lock, for which pagevecs aren't used, and for
faulting there isn't a batch of work to do anyway, so other methods are
needed, e.g. more finegrained pte locking or lockless pagetable radix
trees. I'd favor going fully lockless, but haven't put any code down
for it myself.


William Lee Irwin III <[email protected]> wrote:
>> O_DIRECT writes and msync(..., ..., MS_SYNC) will use pagevecs on
>> ->tree_lock in a rapid-fire process-triggerable manner. Almost all
>> uses of pagevecs for ->lru_lock outside the scanner that I'm aware
>> of are not rapid-fire in nature (though there probably should be some).

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> pagetable teardown (munmap, mremap, exit) is the place where pagevecs help
> ->lru_lock. And truncate.

truncate is rare enough I didn't bother but that probably hurts it the
worst. I'd expect pte teardown to affect mostly anonymous pages, as
file-backed pages will have a reference from the mapping holding them
until either that's shot down or they're evicted via page replacement.


William Lee Irwin III <[email protected]> wrote:
>> IMHO pagevecs are somewhat underutilized.

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> Possibly. I wouldn't bother converting anything unless a profiler tells
> you to though.

mlock() is the case I have in hand, though I've only heard of it being
problematic on vendor kernels. MAP_POPULATE is underutilized in
userspace thus far, so I've not heard anything about it good or bad.


William Lee Irwin III <[email protected]> wrote:
>> Sorry, 4*lg(NR_CPUS) is 64 when lg(NR_CPUS) = 16, or 65536 cpus. 512x
>> Altixen would have 4*lg(512) = 4*9 = 36. The 4*lg(NR_CPUS) sizing was
>> rather conservative on behalf of users of stack-allocated pagevecs.

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> It's pretty simple to diddle PAGEVEC_SIZE, run a few benchmarks. If that
> makes no difference then the discussion is moot. If it makes a significant
> difference then more investigation is warranted.

It only really applies when either the lock transfer time is high, such
as it is in NUMA architectures with high remote access latencies to
sufficiently distant nodes, or the arrival rates are high, such as they
are in unusual workloads, as the common cases (and even some of the
unusual ones) have already been addressed. Marcelo's original bits
about cache alignment are likely more pressing; my real hope was to get
some guess at the pagevec size based on NR_CPUS, align it upward to a
cacheline boundary, and subtract the size of the header information,
but as I found there are diminishing returns, and there are cache size
boundaries to be concerned about, I'd favor advancing marcelo's cache
alignment work prior to exploring larger batches' benefits.


-- wli

2004-09-14 03:02:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

William Lee Irwin III <[email protected]> wrote:
>
> William Lee Irwin III <[email protected]> wrote:
> >> A large stream of faults to map in a file will blow L1 caches of the
> >> sizes you've mentioned at every kernel/user context switch. 256 distinct
> >> cachelines will very easily be referenced between faults. MAP_POPULATE
> >> and mlock() don't implement batching for either ->page_table_lock or
> >> ->tree_lock, so the pagevec point is moot in pagetable instantiation
> >> codepaths (though it probably shouldn't be).
>
> On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> > Instantiation via normal fault-in becomes lock-intensive once you have
> > enough CPUs. At low CPU count the page zeroing probably preponderates.
>
> But that's mm->page_table_lock, for which pagevecs aren't used,

It is zone->lru_lock and pagevecs are indeed used. See
do_anonymous_page->lru_cache_add_active.

> On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
> > Possibly. I wouldn't bother converting anything unless a profiler tells
> > you to though.
>
> mlock() is the case I have in hand, though I've only heard of it being
> problematic on vendor kernels. MAP_POPULATE is underutilized in
> userspace thus far, so I've not heard anything about it good or bad.
>

If you're referring to mlock() of an anonymous vma then that should all go
through do_anonymous_page->lru_cache_add_active anyway?

2004-09-14 03:19:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [pagevec] resize pagevec to O(lg(NR_CPUS))

On Sun, Sep 12, 2004 at 12:42:56AM -0700, Andrew Morton wrote:
>>> Instantiation via normal fault-in becomes lock-intensive once you have
>>> enough CPUs. At low CPU count the page zeroing probably preponderates.

William Lee Irwin III <[email protected]> wrote:
>> But that's mm->page_table_lock, for which pagevecs aren't used,

On Mon, Sep 13, 2004 at 07:57:31PM -0700, Andrew Morton wrote:
> It is zone->lru_lock and pagevecs are indeed used. See
> do_anonymous_page->lru_cache_add_active.

zone->lru_lock is acquired there, but I believe locks in the mm are the
dominant overhead in such scenarios.


William Lee Irwin III <[email protected]> wrote:
>> mlock() is the case I have in hand, though I've only heard of it being
>> problematic on vendor kernels. MAP_POPULATE is underutilized in
>> userspace thus far, so I've not heard anything about it good or bad.

On Mon, Sep 13, 2004 at 07:57:31PM -0700, Andrew Morton wrote:
> If you're referring to mlock() of an anonymous vma then that should all go
> through do_anonymous_page->lru_cache_add_active anyway?

It's a shared memory segment. It's not clear that it was lock
contention per se that was the issue in this case, merely a lot of
(unexpected) cpu overhead. I'm going to check in with the reporter of
the issue to see whether this still an issue in 2.6.x. The vendor
solution was to ask the app politely not to mlock the shm segments; in
2.6.x this can be addressed if it's still an issue. I believe a fair
amount of the computational expense may be attributed to atomic
operations (not lock contention per se) that may be reduced by batching.
i.e. the pagevecs would merely be used to reduce the number of slow
atomic operations and to avoid walking trees from the top-down for each
element, not to address lock contention.


-- wli