2002-03-07 08:20:37

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4.19pre2aa1

URL:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre2aa1.gz
ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.19pre2aa1/

The alone vm-29 against 2.4.19pre2 can be downloaded from here:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.19pre2/vm-29

Only in 2.4.19pre2aa1: 00_TCDRAIN-1

drain/flush pty fixes from Michael Schroeder.

Only in 2.4.19pre1aa1: 00_VM_IO-2
Only in 2.4.19pre2aa1: 00_VM_IO-3
Only in 2.4.19pre1aa1: 00_block-highmem-all-18b-4.gz
Only in 2.4.19pre2aa1: 00_block-highmem-all-18b-5.gz
Only in 2.4.19pre1aa1: 00_flush_icache_range-2
Only in 2.4.19pre2aa1: 00_flush_icache_range-3
Only in 2.4.19pre1aa1: 00_module-gfp-5
Only in 2.4.19pre2aa1: 00_module-gfp-6
Only in 2.4.19pre1aa1: 00_nfs-2.4.17-pathconf-1
Only in 2.4.19pre2aa1: 00_nfs-2.4.17-pathconf-2
Only in 2.4.19pre1aa1: 00_lvm-1.0.2-1.gz
Only in 2.4.19pre2aa1: 00_lvm-1.0.2-2.gz
Only in 2.4.19pre1aa1: 00_silent-stack-overflow-14
Only in 2.4.19pre2aa1: 00_silent-stack-overflow-15
Only in 2.4.19pre1aa1: 20_numa-mm-1
Only in 2.4.19pre2aa1: 20_numa-mm-2
Only in 2.4.19pre1aa1: 10_compiler.h-2
Only in 2.4.19pre2aa1: 10_compiler.h-3

Rediffed.

Only in 2.4.19pre2aa1: 00_alpha-lseek-1

Export right interface for lseek/sys_lseek.

Only in 2.4.19pre2aa1: 00_ffs_compile_failure-1

Fix compilation failure with gcc 3.1 due a kernel bug.
From Andi Kleen and Jan Hubicka.

Only in 2.4.19pre2aa1: 00_gcc-3_1-compile-1

Fix compilation failure with gcc 3.1.

Only in 2.4.19pre2aa1: 00_kiobuf_init-1

Optimize kiobuf initialization. From Intel and Hubert Mantel.

Only in 2.4.19pre2aa1: 00_nfs-fix_create-1

New nfs fix from Trond.

Only in 2.4.19pre2aa1: 00_proc-sig-race-fix-1

Fix /proc signal SMP race from Chris Mason.

Only in 2.4.19pre1aa1: 00_rwsem-fair-26
Only in 2.4.19pre1aa1: 00_rwsem-fair-26-recursive-7
Only in 2.4.19pre2aa1: 00_rwsem-fair-27
Only in 2.4.19pre2aa1: 00_rwsem-fair-27-recursive-8

Fixed 64bit bug s/int/long/ in down_read. Thanks to
Jeff Wiedemeier and Andreas Schwab for providing
testcases on such bug.

Only in 2.4.19pre2aa1: 00_sd-cleanup-1

Rest of the sd cleanups from Pete Zaitcev.

Only in 2.4.19pre2aa1: 00_shmdt-retval-1

Fix shmdt retval from Andreas Schwab.


Only in 2.4.19pre1aa1: 10_nfs-o_direct-3
Only in 2.4.19pre2aa1: 10_nfs-o_direct-4

New version from Trond.

Only in 2.4.19pre1aa1: 10_no-virtual-2

Obsoleted by mainline.

Only in 2.4.19pre1aa1: 10_vm-28
Only in 2.4.19pre2aa1: 10_vm-29

Make sure to free memclass-realted-metadata for pages out of the
memclass (bh headers in short). kmem_cache_shrink as well in the
max_mapped path now. Added other sysctls and minor tuning.

Changed the wait_table stuff, first of all have it per-node (no point
of per-zone), then let it scale more with the ram in the machine (the
amount of ram used for the wait table is ridicolously small and it
mostly depends on the amount of the I/O, not on the amount of ram, so
set up a low limit instead of an high limit). The hashfn I wasn't
very confortable with. This one is the same of pagemap.h, and it's
not that huge on the 64bit archs. If something it had to be a mul.
This hashfn ensures to be contigous on the cacheline for physically
consecutive pages, and once the pages are randomized they just provide
enough randomization, we just need to protect against the bootup when
it's more likely the pages are physically consecutive.

Only in 2.4.19pre1aa1: 20_pte-highmem-14
Only in 2.4.19pre2aa1: 20_pte-highmem-15

Initialize the kmap functionality eary during boot, right
after the mem init, so pte_alloc and in turn ioremap can be used
in the driver init functions.

Only in 2.4.19pre1aa1: 50_uml-patch-2.4.17-12.gz
Only in 2.4.19pre2aa1: 50_uml-patch-2.4.18-2.gz

New revision from Jeff at user-mode-linux.sourceforge.net.

Only in 2.4.19pre1aa1: 70_xfs-2.gz
Only in 2.4.19pre2aa1: 70_xfs-4.gz

Rediffed.

Andrea


2002-03-07 10:50:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> Changed the wait_table stuff, first of all have it per-node (no point
> of per-zone),

Yes there is. It's called locality of reference. Zones exhibit distinct
usage patterns and affinities; hence, they should have distinct tables.
It is unwise to allow pressure on one zone (i.e. many waiters there) to
interfere with (by creating collisions) efficient wakeups for other zones.


On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> then let it scale more with the ram in the machine (the
> amount of ram used for the wait table is ridicolously small and it
> mostly depends on the amount of the I/O, not on the amount of ram, so
> set up a low limit instead of an high limit).

The wait_table does not need to scale with the RAM on the machine
It needs to scale with the number of waiters.

4096 threads blocked on I/O is already approaching or exceeding the
scalability limits of other core kernel subsystems.


On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> The hashfn I wasn't
> very confortable with.

I am less than impressed by this unscientific rationale.


On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> This one is the same of pagemap.h, and it's
> not that huge on the 64bit archs. If something it had to be a mul.

It was a mul. It was explicitly commented that the mul was expanded to a
shift/add implementation to accommodate 64-bit architectures, and the
motivation for the particular constant to multiply by was documented in
several posts of mine. It didn't "have to be", it was, and was commented.

This is not mere idle speculation and unfounded micro-optimization.
This is a direct consequence of reports of poor performance due to the
cost of multiplication on several 64-bit architectures, which was
resolved by the expansion of the multiplication to shifts and adds
(the multiplier was designed from inception to support expansion to shifts
and adds; it was discovered the compiler could not do this automatically).

And for Pete's sake that pagemap hash function is ridiculously dirty code;
please, regardless of whether you insist on being inefficient for the
sake of some sort of vendetta, or deny the facts upon which the prior
implementation was based, or just refuse to cooperate until the bitter
end, please, please, clean that hash up instead of copying and pasting it.


On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> This hashfn ensures to be contigous on the cacheline for physically
> consecutive pages, and once the pages are randomized they just provide
> enough randomization, we just need to protect against the bootup when
> it's more likely the pages are physically consecutive.

Is this some kind of joke? You honestly believe some kind of homebrew
strategy to create collisions is remotely worthwhile? Do you realize
even for one instant that the cost of collisions here is *not* just a
list traversal, but also the spurious wakeup of blocked threads? You've
gone off the deep end in an attempt to be contrary; this is even more
than demonstrably inefficient, it's so blatant inspection reveals it.

Designing your hash function to create collisions is voodoo, pure and
simple. "Just enough randomization" is also voodoo. The pagemap.h hash
was one of the specific hash functions in the kernel producing
measurably horrific hash table distributions on large memory machines,
and my efforts to find a better hash function for it contributed
directly to the design of the hash functions presented in the hashed
waitqueue code. The (rather long) lengths I've gone to to prevent
collisions are justified by the low impact on the implementation (the
choice of multiplier) and the extremely high cost of collisions in this
scheme, which comes from the wake-all semantics of the hash table buckets.

This reversion to a demonstrably poor hashing scheme in the face of
very high collision penalties is laughable.


Bill

2002-03-07 11:34:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

In article <[email protected]> you wrote:
> Only in 2.4.19pre1aa1: 00_lvm-1.0.2-1.gz
> Only in 2.4.19pre2aa1: 00_lvm-1.0.2-2.gz

Is there a specific reason you revert back to LVM 1.0.2 from the intree
LVM 1.0.3?

2002-03-07 11:32:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On March 7, 2002 11:49 am, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > then let it scale more with the ram in the machine (the
> > amount of ram used for the wait table is ridicolously small and it
> > mostly depends on the amount of the I/O, not on the amount of ram, so
> > set up a low limit instead of an high limit).
>
> The wait_table does not need to scale with the RAM on the machine
> It needs to scale with the number of waiters.
>
> 4096 threads blocked on I/O is already approaching or exceeding the
> scalability limits of other core kernel subsystems.

I think he meant that with larger ram it's easy to justify making the wait
table a little looser, to gain a tiny little bit of extra performance.

That seems perfectly reasonable to me. Oh, and there's also the observation
that machines with larger ram tend to be more heavily loaded with processes,
just because one can.

--
Daniel

2002-03-07 11:47:50

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On March 7, 2002 11:49 am, William Lee Irwin III wrote:
>> 4096 threads blocked on I/O is already approaching or exceeding the
>> scalability limits of other core kernel subsystems.

On Thu, Mar 07, 2002 at 12:27:41PM +0100, Daniel Phillips wrote:
> I think he meant that with larger ram it's easy to justify making the wait
> table a little looser, to gain a tiny little bit of extra performance.
> That seems perfectly reasonable to me. Oh, and there's also the observation
> that machines with larger ram tend to be more heavily loaded with processes,
> just because one can.


And these processes will not be waiting on pages as often, either, as
pages will be more plentiful.

>From the reports I've seen, typical numbers of waiters on pages, even
on large systems, are as much as an order of magnitude fewer in number
than 4096. It would seem applications need to wait on pages less with
the increased memory size.

Cheers,
Bill

2002-03-07 11:51:39

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On March 7, 2002 12:47 pm, William Lee Irwin III wrote:
> On March 7, 2002 11:49 am, William Lee Irwin III wrote:
> >> 4096 threads blocked on I/O is already approaching or exceeding the
> >> scalability limits of other core kernel subsystems.
>
> On Thu, Mar 07, 2002 at 12:27:41PM +0100, Daniel Phillips wrote:
> > I think he meant that with larger ram it's easy to justify making the wait
> > table a little looser, to gain a tiny little bit of extra performance.
> > That seems perfectly reasonable to me. Oh, and there's also the observation
> > that machines with larger ram tend to be more heavily loaded with processes,
> > just because one can.
>
> And these processes will not be waiting on pages as often, either, as
> pages will be more plentiful.
>
> From the reports I've seen, typical numbers of waiters on pages, even
> on large systems, are as much as an order of magnitude fewer in number
> than 4096. It would seem applications need to wait on pages less with
> the increased memory size.

Uh Yup.

--
Daniel

2002-03-07 17:04:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, Mar 07, 2002 at 02:49:42AM -0800, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > Changed the wait_table stuff, first of all have it per-node (no point
> > of per-zone),
>
> Yes there is. It's called locality of reference. Zones exhibit distinct
> usage patterns and affinities; hence, they should have distinct tables.

there is no different usage pattern, the pages you're waiting on are all
in the pagecache, that spreads across the whole ram of the machine on
any arch. That's just wasted ram. I didn't made it global for the
numa locality in memory of the local node (so it uses the whole bandwith
better and with reads there's more chances the cpu issuing the wait is
on the local node too).

> It is unwise to allow pressure on one zone (i.e. many waiters there) to
> interfere with (by creating collisions) efficient wakeups for other zones.
>
>
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > then let it scale more with the ram in the machine (the
> > amount of ram used for the wait table is ridicolously small and it
> > mostly depends on the amount of the I/O, not on the amount of ram, so
> > set up a low limit instead of an high limit).
>
> The wait_table does not need to scale with the RAM on the machine
> It needs to scale with the number of waiters.

16G machines tends to have much more cpu and waiters than a 32mbyte
machines. With a 32mbyte machine you'll run out of kernel stack first :)

> 4096 threads blocked on I/O is already approaching or exceeding the
> scalability limits of other core kernel subsystems.

It's not only a matter of 4096 limit, the ram wastage is so ridicolously
low, that we definitely want to decrease the probability of collisions,
since 16k for those 16G boxes is nothing, also knowing we'll more likely
have several thousand of threads blocking on I/O in those machines
(no-async IO in 2.4). Furthmore we don't know if some future arch is
able to deliver more cpu/bus/ram performance and scale better so I
wouldn't put such limit given the so little ram utilization on a huge
box.

> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > The hashfn I wasn't
> > very confortable with.
>
> I am less than impressed by this unscientific rationale.
>
>
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > This one is the same of pagemap.h, and it's
> > not that huge on the 64bit archs. If something it had to be a mul.
>
> It was a mul. It was explicitly commented that the mul was expanded to a
> shift/add implementation to accommodate 64-bit architectures, and the
> motivation for the particular constant to multiply by was documented in
> several posts of mine. It didn't "have to be", it was, and was commented.
>
> This is not mere idle speculation and unfounded micro-optimization.
> This is a direct consequence of reports of poor performance due to the
> cost of multiplication on several 64-bit architectures, which was

I think that happened to be true 10 years ago. I don't know of any
recent machine where some dozen of incremental shifts is faster than a
single plain mul. Am I wrong?

> resolved by the expansion of the multiplication to shifts and adds
> (the multiplier was designed from inception to support expansion to shifts
> and adds; it was discovered the compiler could not do this automatically).
>
> And for Pete's sake that pagemap hash function is ridiculously dirty code;

the pagemap.h ensures that following indexes will fall into the same
hash cacheline. That is an important property. The sizeof(struct inode)
trickery also allows the compiler not to divided the inode pointer for
the size of the inode, but it allows it to optimize it to a right shift,
Linus is be so kind to explain it to me years ago when I really couldn't
see the point of such trickery. A shift is faster than a div (note:
this have nothing to do with the dozen of shifts and the mul).

> please, regardless of whether you insist on being inefficient for the
> sake of some sort of vendetta, or deny the facts upon which the prior

What vendetta sorry :) ? If I change something is because I feel it's
faster/cleaner done that way, I will never change anything unless I've
a technical argument for that.

> implementation was based, or just refuse to cooperate until the bitter
> end, please, please, clean that hash up instead of copying and pasting it.
>
>
> On Thu, Mar 07, 2002 at 09:21:19AM +0100, Andrea Arcangeli wrote:
> > This hashfn ensures to be contigous on the cacheline for physically
> > consecutive pages, and once the pages are randomized they just provide
> > enough randomization, we just need to protect against the bootup when
> > it's more likely the pages are physically consecutive.
>
> Is this some kind of joke? You honestly believe some kind of homebrew
> strategy to create collisions is remotely worthwhile? Do you realize
> even for one instant that the cost of collisions here is *not* just a
> list traversal, but also the spurious wakeup of blocked threads? You've

I know what happens during collisiosn, do you really think I designed my
hashfn to make more collisions than yours?

> gone off the deep end in an attempt to be contrary; this is even more
> than demonstrably inefficient, it's so blatant inspection reveals it.

Please proof it in practice.

I repeat: the page during production are truly random, so neither my nor
your hashfn can make any difference, they cannot predict the future. So
IMHO you're totally wrong.

I think you're confused sometime there's a relationship between the hash
input, here there's no relationship at all, except at the very early
boot stage (or because somebody freed some multipage previously) where
pages could have an higher probability to be physically consecutive, and
my hashfn infact takes care of distributing well such case, unlike
yours. A mul cannot predict the future.

so please, go ahead, change my code to use your hashfn, monitor the
number of collisions with my hashfn and yours, and if you can measure
that my hashfn will generate a 50% more collisions than yours (you know,
some random variations are very possible since it's totally random, 50%
is a random number, it depends on the standard deviation of the
collisions) I will be glad to apologise for my mistake and to revert to
your version immediatly. All I'm asking is for a real world measurement,
in particular on the long run, just run cerberus for 10 hours or
something like that, and measure the max and mean population at regular
intervals with a tasklet or something like that. Take it as a
"comparison", just to see the difference. If your hashfn is so much
better you will have no worry to show the 50% improvement in real life.
I will also try to do it locally here if time permits, so if you write a
patch please pass it over so we don't duplicate efforts.

For the other points I think you shouldn't really complain (both at
runtime and in code style as well, please see how clean it is with the
wait_table_t thing), I made a definitive improvement to your code, the
only not obvious part is the hashfn but I really cannot see yours
beating mine because of the total random input, infact it could be the
other way around due the fact if something there's the probability the
pages are physically consecutive and I take care of that fine.

Take my changes as a suggestion to try it out, definitely not a
vendetta. Instead of just endlessy sprading words on l-k, I did the
changes in practice instead, so now it is very very easy to make the
comparison.

Don't be offended if I changed your code. I very much appreciate your
and Rik's changes they're good, I only meant to improve them.

If my mistake about the hashfn is so huge, I hope it will be compensated
by the other two improvements at least :).

thanks,

Andrea

2002-03-07 20:18:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> For the other points I think you shouldn't really complain (both at
> runtime and in code style as well, please see how clean it is with the
> wait_table_t thing), I made a definitive improvement to your code, the
> only not obvious part is the hashfn but I really cannot see yours
> beating mine because of the total random input, infact it could be the
> other way around due the fact if something there's the probability the
> pages are physically consecutive and I take care of that fine.


I don't know whose definition of clean code this is:

+static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
+{
+#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
+#define s(x) ((x)+((x)>>wait_table->shift))
+ return wait_table->head + (s(i) & (wait_table->size-1));
+#undef i
+#undef s
+}


I'm not sure I want to find out.


Bill

2002-03-07 20:39:31

by Richard B. Johnson

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, 7 Mar 2002, William Lee Irwin III wrote:

> On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > For the other points I think you shouldn't really complain (both at
> > runtime and in code style as well, please see how clean it is with the
> > wait_table_t thing), I made a definitive improvement to your code, the
> > only not obvious part is the hashfn but I really cannot see yours
> > beating mine because of the total random input, infact it could be the
> > other way around due the fact if something there's the probability the
> > pages are physically consecutive and I take care of that fine.
>
>
> I don't know whose definition of clean code this is:
>
> +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> +{
> +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> +#define s(x) ((x)+((x)>>wait_table->shift))
> + return wait_table->head + (s(i) & (wait_table->size-1));
> +#undef i
> +#undef s
> +}
>
>
> I'm not sure I want to find out.
>
>
> Bill

Methinks there is entirely too much white-space in the code. It
is almost readable *;). It probably could be fixed up to slip through
the compiler with no possibility of human interpretation, but that would
take a bit more work ;^)

The choice of temporaries makes me think of a former President
who, when confronted with a lie, said; "It depends upon what 'is' is!"
Keep up the good work!

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (799.53 BogoMips).

Bill Gates? Who?

2002-03-08 00:12:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, Mar 07, 2002 at 12:18:19PM -0800, William Lee Irwin III wrote:
> On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > For the other points I think you shouldn't really complain (both at
> > runtime and in code style as well, please see how clean it is with the
> > wait_table_t thing), I made a definitive improvement to your code, the
> > only not obvious part is the hashfn but I really cannot see yours
> > beating mine because of the total random input, infact it could be the
> > other way around due the fact if something there's the probability the
> > pages are physically consecutive and I take care of that fine.
>
>
> I don't know whose definition of clean code this is:
>
> +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> +{
> +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> +#define s(x) ((x)+((x)>>wait_table->shift))
> + return wait_table->head + (s(i) & (wait_table->size-1));
> +#undef i
> +#undef s
> +}
>
>
> I'm not sure I want to find out.

The above is again the hashfunction, the hashfn code doesn't need to be
nice, the API around wait_table_hashfn has to instead. See the above
wait_table_t typedef.

During some further auditing I also noticed now that you introduced
a certain usused wake_up_page. That's buggy, if you use it you'll
deadlock. Also it would be cleaner if __lock_page wasn't using the
exclusive waitqueue and that in turn you would keep using wake_up for
unlock_page. By the time you share the waitqueue nothing can be wake one
any longer, this is probably the worst drawback of the wait_table
memory-saving patch. Infact I was considering to solve the collisions
with additional memory, rather than by having to drop the wake-one
behaviour when many threads are working on the same chunk of the file
that your design solution requires. quite frankly I don't think this was
an urgent thing to change in 2.4 (it only saves some memory and even if
64G will now boot with CONFIG_1G, the lowmem will be way too much
unbalanced to be good for general purpose).

Andrea

2002-03-08 00:23:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Thu, Mar 07, 2002 at 03:38:09PM -0500, Richard B. Johnson wrote:
> On Thu, 7 Mar 2002, William Lee Irwin III wrote:
>
> > On Thu, Mar 07, 2002 at 06:03:00PM +0100, Andrea Arcangeli wrote:
> > > For the other points I think you shouldn't really complain (both at
> > > runtime and in code style as well, please see how clean it is with the
> > > wait_table_t thing), I made a definitive improvement to your code, the
> > > only not obvious part is the hashfn but I really cannot see yours
> > > beating mine because of the total random input, infact it could be the
> > > other way around due the fact if something there's the probability the
> > > pages are physically consecutive and I take care of that fine.
> >
> >
> > I don't know whose definition of clean code this is:
> >
> > +static inline wait_queue_head_t * wait_table_hashfn(struct page * page, wait_table_t * wait_table)
> > +{
> > +#define i (((unsigned long) page)/(sizeof(struct page) & ~ (sizeof(struct page) - 1)))
> > +#define s(x) ((x)+((x)>>wait_table->shift))
> > + return wait_table->head + (s(i) & (wait_table->size-1));
> > +#undef i
> > +#undef s
> > +}
> >
> >
> > I'm not sure I want to find out.
> >
> >
> > Bill
>
> Methinks there is entirely too much white-space in the code. It
> is almost readable *;). It probably could be fixed up to slip through
> the compiler with no possibility of human interpretation, but that would
> take a bit more work ;^)

well, to me it is very readable, but I'm certainly biased I admit.

You have to think it this way:

- "i" is the random not predictable input
- "i" is generated by dropping the non random bits of the "page"
with a cute trick that drops the bit with
a shiftright with an immediate argument, it's not a
division so it's faster it could been
"page/sizeof(struct page)" but it would been slower
- s(i) takes into account not just the last "wait_table->shift" bytes
of the random information in "i", but also the next "shift"
bytes, so it's more random even if some of those lower or higher
bits is generating patterns for whatever reason
- finally you mask out the resulting random information s(i) with
size-1, so you fit into the wait_table->head memory.

The result is that it exploits most of the random input, and physically
consecutive pages are liekly to use the same cacheline in case there is
some pattern. That looks a good algorithm to me.

Andrea

2002-03-08 00:27:54

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.19pre2aa1

On Fri, 8 Mar 2002, Andrea Arcangeli wrote:

> The result is that it exploits most of the random input, and physically
> consecutive pages are liekly to use the same cacheline in case there is
> some pattern. That looks a good algorithm to me.

Worthy of documentation, even.

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/