LinuxLists.cc - [PATCH] Radix-tree pagecache for 2.5

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

In article <[email protected]>,
Christoph Hellwig <[email protected]> wrote:
>I've ported my hacked up version of Momchil Velikov's radix tree
>radix tree pagecache to 2.5.3-pre{5,6}.
>
>The changes over the 2.4.17 version are:
>
> o use mempool to avoid OOM situation involving radix nodes.
> o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
> o unify add_to_page and add_to_page_unique
>
>It gives nice scalability improvements on big machines and drops the
>memory usage on small ones (if you consider my 64MB Athlon small :)).

Looks good.

In fact, this looks a _lot_ more palatable than the "scalable page
cache" approach with the hashed locks.

Can you post the numbers on scalability (I can see the locking
improvement, but if you have numbers I'd be even happier) and any
benchmarks you have?

The only real complaint I have is that I'd rather see "radix_root" than
"rat_root". Maybe it is just me, but the latter makes me wonder about
the sex-lives of small furry mammals. Which is not what I really want to
be thinking about.

It looks straigthforward enough, so if you feel it is stable (and
cleaned up), I'd suggest just submitting it for real.

Linus

2002-01-29 21:42:21

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

From: [email protected] (Linus Torvalds)
Date: Tue, 29 Jan 2002 19:27:43 +0000 (UTC)

In article <[email protected]>,
Christoph Hellwig <[email protected]> wrote:
>I've ported my hacked up version of Momchil Velikov's radix tree
>radix tree pagecache to 2.5.3-pre{5,6}.

Looks good.

I like the changes too, but I'd like to see some numbers
as well.

My only concern is that it doesn't handle one particular
case better than the ugly per-hashchain lock version. When we're
running through a file and the task doing this changes cpus.
In that case we'll get a lock collision and the per-hashchain lock
changes would at least potentially avoid that.

For web serving sizeable files this might matter, but probably
we don't really care. Probably it doesn't matter and we are limited
to moving one lock over in such an event anyways.

2002-01-29 22:08:13

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, 29 Jan 2002, David S. Miller wrote:
>
> I like the changes too, but I'd like to see some numbers
> as well.

Absolutely. Even something as simplistic as "lmbench file re-read" changed
by 0.1% or something. I definitely believe in the scalability part (as
long as the different processes don't all touch the same mapping all the
time), so I'm more interested in the "what is the impact of the hash chain
lookup/walk vs the radix tree walk" kinds of numbers.

> My only concern is that it doesn't handle one particular
> case better than the ugly per-hashchain lock version. When we're
> running through a file and the task doing this changes cpus.
> In that case we'll get a lock collision and the per-hashchain lock
> changes would at least potentially avoid that.

Well, I would put it the other way around: the advantage of the
per-mapping lock (as opposed to the per-hashchain one) is that for common
access patterns where one process walks the whole file, we get added
locality from the per-mapping approach. While the per-hashchain one tends
to take a _lot_ of different locks (module bucket behaviour, of course).

And locality is good - especially as we try to make processes have CPU
affinity anyway. So I'd expect the per-mapping lock to generally show
_nicer_ cache behaviour.

Linus

2002-01-29 22:59:24

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, Jan 29, 2002 at 04:54:44PM +0100, Christoph Hellwig wrote:
> I've ported my hacked up version of Momchil Velikov's radix tree
> radix tree pagecache to 2.5.3-pre{5,6}.
> The changes over the 2.4.17 version are:
> o use mempool to avoid OOM situation involving radix nodes.
> o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
> o unify add_to_page and add_to_page_unique
> It gives nice scalability improvements on big machines and drops the
> memory usage on small ones (if you consider my 64MB Athlon small :)).

I love this patch. My only concern is about worst-case space consumption,
but it is beautiful regardless, and space consumption can be addressed
later if it is a problem in practice. The average case space consumption,
as you have noted, is quite good already, and it seems difficult to
trigger the worst case (I have tested it myself).

Cheers,
Bill

2002-01-29 23:02:34

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, 29 Jan 2002, Linus Torvalds wrote:
> On Tue, 29 Jan 2002, David S. Miller wrote:
> >
> > I like the changes too, but I'd like to see some numbers
> > as well.
>
> Absolutely. Even something as simplistic as "lmbench file re-read" changed
> by 0.1% or something. I definitely believe in the scalability part (as
> long as the different processes don't all touch the same mapping all the
> time), so I'm more interested in the "what is the impact of the hash chain
> lookup/walk vs the radix tree walk" kinds of numbers.

There's another nice advantage to the radix tree.

We can let oracle shared memory segments use 4 MB pages,
but still use the normal page cache code to look up the
pages.

With a radix tree there is no overhead in using different
page sizes since we'll just run into them in the tree.

(as opposed to the horrors of trying a hash lookup with
multiple page orders)

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 23:06:43

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> In article <[email protected]>,
Linus> Christoph Hellwig <[email protected]> wrote:
>> I've ported my hacked up version of Momchil Velikov's radix tree
>> radix tree pagecache to 2.5.3-pre{5,6}.
>>
>> The changes over the 2.4.17 version are:
>>
>> o use mempool to avoid OOM situation involving radix nodes.
>> o remove add_to_page_cache_locked, it was unused in the 2.4.17 patch.
>> o unify add_to_page and add_to_page_unique
>>
>> It gives nice scalability improvements on big machines and drops the
>> memory usage on small ones (if you consider my 64MB Athlon small :)).

Linus> Can you post the numbers on scalability (I can see the locking
Linus> improvement, but if you have numbers I'd be even happier) and any
Linus> benchmarks you have?

Well, these are dbench numbers from December, it's
2.4.17. Unfortunately, it appears OSDL have trouble with 2.5 currently ...

FWIW, box is 8-way PIII Xeon, 700MHz, 1MB cache, 8G RAM

rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

#Clients 2.4.17 2.4.17-rat-7 2.4.17-rat-4
---------------------------------------------------------
1 81.81 82.70 79.49
2 131.77 133.15 116.32
3 179.74 188.04 184.80
4 221.60 228.70 223.97
5 249.86 252.89 258.77
6 260.56 277.70 265.20
7 285.82 287.47 281.27
8 263.61 258.81 256.29
9 271.06 268.29 261.04
10 261.23 265.82 259.34
11 256.82 260.38 258.35
12 255.55 255.68 252.78
13 252.70 254.02 249.42
14 251.41 253.93 252.21
15 255.27 257.13 262.21
16 156.81 146.69 180.77
17 113.00 103.32 101.14
18 81.06 78.98 86.77
19 76.24 40.09 39.89
20 17.51 17.64 17.53

The results are similar on 4-way OSDL boxen and on the 12- and 16-way
PPC64 runs by Anton Blanchard:

# clients
[1 - ncpu] - linear increase in the throughput, small improvement over the
stock kernel, I guess we quickly hit other locks
[ncpu - 2 * ncpu] - flat
[2 * ncpu, +infty) - drops down do zero

Linus> The only real complaint I have is that I'd rather see "radix_root" than
Linus> "rat_root". Maybe it is just me, but the latter makes me wonder about
Linus> the sex-lives of small furry mammals. Which is not what I really want to
Linus> be thinking about.

Done. rat_* -> radix_tree_*

Linus> It looks straigthforward enough, so if you feel it is stable (and
Linus> cleaned up), I'd suggest just submitting it for real.

I'll wait for a day for some more comments (e.g. Ingo) and will submit
it.

Regards,
-velco

2002-01-29 23:27:25

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> We can let oracle shared memory segments use 4 MB pages,
> but still use the normal page cache code to look up the
> pages.

That has some potential big wins beyond oracle. Some of the big number
crunching algorithms also benefit heavily from 4Mb pages even when you
try and minimise tlb misses.

Just remember to read the ppro/early pII errata when starting - there are
some page invalidation funnies. If I remember rightly we have to kill MCE
support on PPro if we do 4Mb pages that may overlap 4K ones

2002-01-29 23:37:25

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On 30 Jan 2002, Momchil Velikov wrote:
>
> rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

Hmm. It appears that the 128-way one is no slower, at least.

Probably not very relevant question: What are the memory usage
implications? I love having that global big page_hash_table gone, but what
are the differences in memory usage between rat-4 and rat-7? In
particular, it _looks_ like the way the radix_node is done, it will
basically always be a factor-of-two+1 words, which sounds like the worst
possible schenario from an allocator standpoint.

Linus

2002-01-29 23:38:56

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, 29 Jan 2002, Alan Cox wrote:

> > We can let oracle shared memory segments use 4 MB pages,
> > but still use the normal page cache code to look up the
> > pages.
>
> That has some potential big wins beyond oracle. Some of the big number
> crunching algorithms also benefit heavily from 4Mb pages even when you
> try and minimise tlb misses.

Note that I'm not sure whether the complexity of using
4 MB pages is worth it or not ... I just like the fact
that the radix tree page cache gives us the opportunity
to easily implement and try it.

I like radix trees for making our design more flexible
and opening doors to possible new functionality.

It could even be a CONFIG option for the embedded folks,
if we can keep the code isolated enough ;)

cheers,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-29 23:49:07

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

In article <[email protected]> you wrote:
> it _looks_ like the way the radix_node is done, it will
> basically always be a factor-of-two+1 words, which sounds like the worst
> possible schenario from an allocator standpoint.

One advantage of the slab allocator is that if works efficiently with
odd object sizes..

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2002-01-30 02:58:36

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On January 30, 2002 12:02 am, Momchil Velikov wrote:
> Well, these are dbench numbers from December, it's
> 2.4.17. Unfortunately, it appears OSDL have trouble with 2.5 currently ...

Have you tested with anything besides dbench?

--
Daniel

2002-01-30 02:57:46

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On January 30, 2002 12:35 am, Rik van Riel wrote:
> On Tue, 29 Jan 2002, Alan Cox wrote:
> > > We can let oracle shared memory segments use 4 MB pages,
> > > but still use the normal page cache code to look up the
> > > pages.
> >
> > That has some potential big wins beyond oracle. Some of the big number
> > crunching algorithms also benefit heavily from 4Mb pages even when you
> > try and minimise tlb misses.
>
> Note that I'm not sure whether the complexity of using
> 4 MB pages is worth it or not ... I just like the fact
> that the radix tree page cache gives us the opportunity
> to easily implement and try it.
>
> I like radix trees for making our design more flexible
> and opening doors to possible new functionality.
>
> It could even be a CONFIG option for the embedded folks,
> if we can keep the code isolated enough ;)

Making it a CONFIG option would preclude leveraging the advantages of the
radix tree that fall out from its ordered nature, so I'd vote for all or
nothing in this case.

I also have some perhaps-twisted plans for this thing, but if this patch
passes muster on its own merits - that is, in the context we're currently
using the pcache hash as opposed to new capabilities that can be leveraged -
that's the ideal situation. Hence no, or minimal, talking up other
advantages.

I'm inclined to think that the radix tree has locality advantages for UP as
well as SMP, under certain types of filesystem loads, and that it is never
worse. Well, we shall see about that soon enogh.

--
Daniel

2002-01-30 21:26:18

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> On 30 Jan 2002, Momchil Velikov wrote:
>>
>> rat-7 is with 128-way radix tree branch factor, rat-4 is with 16-way.

Linus> Hmm. It appears that the 128-way one is no slower, at least.

Linus> Probably not very relevant question: What are the memory usage
Linus> implications? I love having that global big page_hash_table gone, but what
Linus> are the differences in memory usage between rat-4 and rat-7? In
Linus> particular, it _looks_ like the way the radix_node is done, it will
Linus> basically always be a factor-of-two+1 words, which sounds like the worst
Linus> possible schenario from an allocator standpoint.

Memory overhead due to allocator overhead is of no concern with the
slab allocator. What matters most is probably the overhead of the
radix tree nodes themselves, compared to the two pointers in struct
page with the hash table approach. rat-4 variant ought to have less
overhead compared to rat-7 at the expense of deeper/higher tree. I
have no figures for the actual memory usage though. For small files it
should be negligible, i.e. one radix tree node, 68 or 516 bytes for
rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes. The
worst case would be very large file with a few cached pages with
offsets uniformly distributed across the whole file, that is having
deep tree with only one page hanging off each leaf node.

Regards,
-velco

2002-01-30 22:10:23

by John Stoffel

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Momchil> Memory overhead due to allocator overhead is of no concern with the
Momchil> slab allocator. What matters most is probably the overhead of the
Momchil> radix tree nodes themselves, compared to the two pointers in struct
Momchil> page with the hash table approach. rat-4 variant ought to have less
Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
Momchil> have no figures for the actual memory usage though. For small files it
Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes. The
Momchil> worst case would be very large file with a few cached pages with
Momchil> offsets uniformly distributed across the whole file, that is having
Momchil> deep tree with only one page hanging off each leaf node.

Isn't this a good place to use AVL trees then, since they balance
automatically? Admittedly, it may be more overhead than we want in
the case where the tree is balanced by default anyway.

Again, benchmarks would be the good thing to see either way.

John

2002-01-30 22:23:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

In article <[email protected]> you wrote:
> Isn't this a good place to use AVL trees then, since they balance
> automatically? Admittedly, it may be more overhead than we want in
> the case where the tree is balanced by default anyway.

OpenUnix uses AVL trees for the pagecache. The overhead in struct page
is immense..

2002-01-30 22:30:10

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "John" == John Stoffel <[email protected]> writes:

Momchil> Memory overhead due to allocator overhead is of no concern with the
Momchil> slab allocator. What matters most is probably the overhead of the
Momchil> radix tree nodes themselves, compared to the two pointers in struct
Momchil> page with the hash table approach. rat-4 variant ought to have less
Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
Momchil> have no figures for the actual memory usage though. For small files it
Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes. The
Momchil> worst case would be very large file with a few cached pages with
Momchil> offsets uniformly distributed across the whole file, that is having
Momchil> deep tree with only one page hanging off each leaf node.

John> Isn't this a good place to use AVL trees then, since they balance
John> automatically? Admittedly, it may be more overhead than we want in
John> the case where the tree is balanced by default anyway.

The widespread opinion is that binary trees are generally way too deep
compared to radix trees, so searches have larger cache footprint.

John> Again, benchmarks would be the good thing to see either way.

I've posted some with 2.4.

2002-01-31 02:32:55

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 12:15:09AM +0200, Momchil Velikov wrote:
> >>>>> "John" == John Stoffel <[email protected]> writes:
>
> Momchil> Memory overhead due to allocator overhead is of no concern with the
> Momchil> slab allocator. What matters most is probably the overhead of the
> Momchil> radix tree nodes themselves, compared to the two pointers in struct
> Momchil> page with the hash table approach. rat-4 variant ought to have less
> Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
> Momchil> have no figures for the actual memory usage though. For small files it
> Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
> Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes. The
> Momchil> worst case would be very large file with a few cached pages with
> Momchil> offsets uniformly distributed across the whole file, that is having
> Momchil> deep tree with only one page hanging off each leaf node.
>
> John> Isn't this a good place to use AVL trees then, since they balance
> John> automatically? Admittedly, it may be more overhead than we want in
> John> the case where the tree is balanced by default anyway.

rbtree are not too overhead for the rebalance, but the problem of not
using the hashtable is that you can't just pay with ram globally. Of
course you can enlarge the array for each radix node (that will end to
be a waste with an huge number of inodes with only a page in them), but
as the height of the tree increases performance will go down anyways (it
will never be as large as the global hashtable that we can tune
optimally at boot). With the hashtable the ram we pay for is not per inode,
but it's global.

I'm not optimistic it will work (even if it can be better than an rb or
an avl during the lookups because it pays more ram per tree node [and
per-inode], but still nearly not enoguh ram per node with big files to
be fast, and a big waste of ram with lots of inodes with a only 1 page
in them)

So I wouldn't merge it, at least until some math is done for the memory
consumation with 500k inodes with only 1 page in them each, and on the
number of heights/levels that must be walked during the tree lookup,
during a access at offset 10G (or worst case in general [biggest
height]) of an inode with 10G just allocated in pagecache.

>
> The widespread opinion is that binary trees are generally way too deep
> compared to radix trees, so searches have larger cache footprint.
>
> John> Again, benchmarks would be the good thing to see either way.
>
> I've posted some with 2.4.
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Andrea

2002-01-31 10:42:10

by Josh MacDonald

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Quoting Momchil Velikov ([email protected]):
> >>>>> "John" == John Stoffel <[email protected]> writes:
>
> Momchil> Memory overhead due to allocator overhead is of no concern with the
> Momchil> slab allocator. What matters most is probably the overhead of the
> Momchil> radix tree nodes themselves, compared to the two pointers in struct
> Momchil> page with the hash table approach. rat-4 variant ought to have less
> Momchil> overhead compared to rat-7 at the expense of deeper/higher tree. I
> Momchil> have no figures for the actual memory usage though. For small files it
> Momchil> should be negligible, i.e. one radix tree node, 68 or 516 bytes for
> Momchil> rat-4 or rat-7, for a file of size up to 65536 or 524288 bytes. The
> Momchil> worst case would be very large file with a few cached pages with
> Momchil> offsets uniformly distributed across the whole file, that is having
> Momchil> deep tree with only one page hanging off each leaf node.
>
> John> Isn't this a good place to use AVL trees then, since they balance
> John> automatically? Admittedly, it may be more overhead than we want in
> John> the case where the tree is balanced by default anyway.
>
> The widespread opinion is that binary trees are generally way too deep
> compared to radix trees, so searches have larger cache footprint.

I've posted this before -- my cache-optimized skip list solves the
problem of balanced-tree cache footprint. It uses cacheline-sized
nodes and per-node locking to avoid false-sharing and increase
concurrency. The memory usage for the skip list is also less than
the red-black tree for trees larger than several hundred nodes.

I posted a graph on space consumption (using the Linux vm_area_struct to
calculate space overhead) at:

http://prdownloads.sourceforge.net/skiplist/slrb_space.gif

There are also results for concurrency and performance as a function
of node size.

-josh

--
PRCS version control system http://sourceforge.net/projects/prcs
Xdelta storage & transport http://sourceforge.net/projects/xdelta
Need a concurrent skip list? http://sourceforge.net/projects/skiplist

2002-01-31 13:58:53

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:

> So I wouldn't merge it, at least until some math is done for the memory
> consumation with 500k inodes with only 1 page in them each, and on the
> number of heights/levels that must be walked during the tree lookup,
> during a access at offset 10G (or worst case in general [biggest
> height]) of an inode with 10G just allocated in pagecache.

Ummm, I don't see how this worst case is any more realistic
as the worst case for the hash table (where all pages live
in very few hash buckets and we have really deep chains).

People just don't go around caching a single page each for
all of their 10 GB files and even if they _wanted to_ they
couldn't because of the readahead code.

I suspect that for large files we'll always have around
min_readahead logically contiguous pages cached, if not more.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-31 14:01:13

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Josh MacDonald wrote:

> I've posted this before -- my cache-optimized skip list solves the
> problem of balanced-tree cache footprint. It uses cacheline-sized
> nodes and per-node locking to avoid false-sharing and increase
> concurrency. The memory usage for the skip list is also less than
> the red-black tree for trees larger than several hundred nodes.

I'd be happy to test a kernel where the page cache uses
these skip lists for indexing.

Where can I download the patch ? ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-31 14:20:47

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Josh" == Josh MacDonald <[email protected]> writes:

Josh> Quoting Momchil Velikov ([email protected]):
>> >>>>> "John" == John Stoffel <[email protected]> writes:

Momchil> The worst case would be very large file with a few cached
Momchil> pages with offsets uniformly distributed across the whole
Momchil> file, that is having deep tree with only one page hanging off
Momchil> each leaf node.

John> Isn't this a good place to use AVL trees then, since they balance
John> automatically? Admittedly, it may be more overhead than we want in
John> the case where the tree is balanced by default anyway.

>> The widespread opinion is that binary trees are generally way too deep
>> compared to radix trees, so searches have larger cache footprint.

Josh> I've posted this before -- my cache-optimized skip list solves the
Josh> problem of balanced-tree cache footprint. It uses cacheline-sized

I don't think skip lists differ from the balanced trees w.r.t cache
line footprint.

Josh> nodes and per-node locking to avoid false-sharing and increase

Whether there _is_ a (non-negligible) false sharing would be an open
question.

Josh> concurrency. The memory usage for the skip list is also less than
Josh> the red-black tree for trees larger than several hundred nodes.

Yes. Skip list or (whatever) b-tree are sure to have less space
overhead in the worst case. Therefore, I'd be curious to see
comparisons with the three pagecache implementations. Note that in my
last patch you can do a drop-in replacement of the radix tree with a
skip list, since memory allocation issues are solved.

Regards,
-velco

2002-01-31 14:35:58

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>
> > So I wouldn't merge it, at least until some math is done for the memory
> > consumation with 500k inodes with only 1 page in them each, and on the
> > number of heights/levels that must be walked during the tree lookup,
> > during a access at offset 10G (or worst case in general [biggest
> > height]) of an inode with 10G just allocated in pagecache.
>
> Ummm, I don't see how this worst case is any more realistic
> as the worst case for the hash table (where all pages live
> in very few hash buckets and we have really deep chains).

Mathematically the hashtable complexity is O(N). But probabilistically
with the tuning we do on the hashtable size, the collisions will be
nearly zero for most buckets for of most workloads. Despite the worst
case is with all the pagecache and swapcache queued in a single linked
list :).

So in short math is wrong about O(N) being bad, hashtable is infact the
only way we can get an effective O(1) by just paying RAM. We pay with
RAM and we get performance back to us.

but with the radix tree (please correct me if I'm wrong) the height will
increase eventually, no matter what (so it won't be an effective O(1)
like the hashtable provides in real life, not the worst case, the common
case). With the hashtable the height won't increase instead.

In short to get the same performance with the radix tree, you'd need to
waste an huge amount of ram per inode, the hashtable is instead global
so we pay only once for all the pages in the system. At least this is my
understanding, I'm not a radix tree guru though, so I may be missing
something.

>
> People just don't go around caching a single page each for
> all of their 10 GB files and even if they _wanted to_ they
> couldn't because of the readahead code.
>
> I suspect that for large files we'll always have around
> min_readahead logically contiguous pages cached, if not more.

readahead really doesn't matter at all. consider all the data just in
cache, assume you wrote it and you will never ever need to read it once
again because you've 64G of ram and only 20G of disk.

I/O on large files can and must run as fast as I/O on small files, at
the pagecache level. If the fs doesn't support extents or it's
inefficient with large files that's an enterly different problem, and
the underlying fs doesn't matter any longer once the data is in cache.
pagecache must not slowdown on big files.

Otherwise for an unix fs developer usage (the small files ala dbench)
the rbtree was much nicer data structure than the hash in first place
(and it eats less ram than the radix tree if only one page is queued
etc...).

Andrea

2002-01-31 15:20:34

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> Mathematically the hashtable complexity is O(N). But probabilistically
> with the tuning we do on the hashtable size, the collisions will be
> nearly zero for most buckets for of most workloads. Despite the worst
> case is with all the pagecache and swapcache queued in a single linked
> list :).

Providing it handles the worst case. Some of the hash table inputs appear
to be user controllable so an end user can set out to get worst case
behaviour 8(

2002-01-31 16:40:47

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>>> So I wouldn't merge it, at least until some math is done for the memory
>>> consumation with 500k inodes with only 1 page in them each, and on the
>>> number of heights/levels that must be walked during the tree lookup,
>>> during a access at offset 10G (or worst case in general [biggest
>>> height]) of an inode with 10G just allocated in pagecache.

Did someone say math? Looks like I popped in just in time.

The radix tree forest worst case space usage for fixed-precision search
keys is where each leaf node of a radix tree is occupied by a unique page,
and furthermore, each radix tree contains a single page (otherwise the
shared root conserves a small amount of space).

key precision = D^B = wordsize (e.g. 2^32 or 2^64)
D = depth
B = branch factor

Each leaf node lies within a chain of D nodes, where all but the root
nodes are of size B words. This is (D-1)*B + 1 words per file, hence
per-page. Variable branch factors don't complicate this significantly:
1 + \sum_{0 \leq k \leq D} B_k words per page.

For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
words per file. So for 500K inodes each with one page, 42MB (Douglas
Adams fan?). Offsets of 10GB don't work here. Sounds like either an
interesting patch or a 64-bit machine if they work for you. =)

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
>> Ummm, I don't see how this worst case is any more realistic
>> as the worst case for the hash table (where all pages live
>> in very few hash buckets and we have really deep chains).

I don't believe it's particularly realistic either. And sorry about the
inaccurate estimates from before. =)

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> Mathematically the hashtable complexity is O(N). But probabilistically
> with the tuning we do on the hashtable size, the collisions will be
> nearly zero for most buckets for of most workloads. Despite the worst
> case is with all the pagecache and swapcache queued in a single linked
> list :).

To avoid its worst case (or just poor distribution across the buckets),
a good hash function is necessary. And I don't believe the measurements
are in favor of the one currently in use. Also, the pointer links for
separate chaining within the objects costs extremely-precious boot-time
allocated memory and that memory is taken away from the system for all
time, where the dynamic allocation at least allows for the possibility
of recovering memory when needed.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> So in short math is wrong about O(N) being bad, hashtable is infact the
> only way we can get an effective O(1) by just paying RAM. We pay with
> RAM and we get performance back to us.
> but with the radix tree (please correct me if I'm wrong) the height will
> increase eventually, no matter what (so it won't be an effective O(1)
> like the hashtable provides in real life, not the worst case, the common
> case). With the hashtable the height won't increase instead.

The key is of a fixed precision, hence the tree is of a fixed depth.
The radix tree is O(1).

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> In short to get the same performance with the radix tree, you'd need to
> waste an huge amount of ram per inode, the hashtable is instead global
> so we pay only once for all the pages in the system. At least this is my
> understanding, I'm not a radix tree guru though, so I may be missing
> something.

Lock and cache contention introduced by intermixing data from unrelated
objects. We've all seen radix trees before: most page tables are radix
trees.

On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
>> People just don't go around caching a single page each for
>> all of their 10 GB files and even if they _wanted to_ they
>> couldn't because of the readahead code.
>> I suspect that for large files we'll always have around
>> min_readahead logically contiguous pages cached, if not more.

I suspect the worst case could only arise after evictions of the
readahead from the pagecache.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> readahead really doesn't matter at all. consider all the data just in
> cache, assume you wrote it and you will never ever need to read it once
> again because you've 64G of ram and only 20G of disk.

Good luck booting on a 64GB x86 with excess pointer links in struct page.
Boot-time allocations filling the direct-mapped portion of the kernel
virtual address space -appear- to be a severe problem there, but I've
not got the RAM to empirically verify this quite yet.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> I/O on large files can and must run as fast as I/O on small files, at
> the pagecache level. If the fs doesn't support extents or it's
> inefficient with large files that's an enterly different problem, and
> the underlying fs doesn't matter any longer once the data is in cache.
> pagecache must not slowdown on big files.

O(1) is O(1). This isn't even average-case or worst case: it's all cases.
In a radix tree using fixed-precision search keys, such as machine words,
exactly the same number of internal nodes are traversed to reach a leaf
for every search key, every time, regardless of how populated or unpopulated
the radix tree is.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> Otherwise for an unix fs developer usage (the small files ala dbench)
> the rbtree was much nicer data structure than the hash in first place
> (and it eats less ram than the radix tree if only one page is queued
> etc...).

And the pointer links in struct page? Sounds like more RAM to me...
4000 open files (much more realistic than 500K) each with one page
leads to 48000 words of radix tree overhead. 3 words per page of
pointer links and > 16000 pages of RAM and the rbtree eats more, not
less. And 16000 pages is just 64MB on i386.

Cheers,
Bill

2002-01-31 17:19:44

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> The radix tree forest worst case space usage for fixed-precision search
> keys is where each leaf node of a radix tree is occupied by a unique page,
> and furthermore, each radix tree contains a single page (otherwise the
> shared root conserves a small amount of space).
> key precision = D^B = wordsize (e.g. 2^32 or 2^64)
> D = depth
> B = branch factor
> Each leaf node lies within a chain of D nodes, where all but the root
> nodes are of size B words. This is (D-1)*B + 1 words per file, hence
> per-page. Variable branch factors don't complicate this significantly:
> 1 + \sum_{0 \leq k \leq D} B_k words per page.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
> words per file. So for 500K inodes each with one page, 42MB (Douglas
> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
> interesting patch or a 64-bit machine if they work for you. =)

As just pointed out to me the minute I did a substitution I went wrong:

A branch factor of 128 leads to
1 + (7 + 7 + 6)*128 words = 2561 words per file, which is somewhat
more severe. =(

More corrections are welcome.

On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
>> Otherwise for an unix fs developer usage (the small files ala dbench)
>> the rbtree was much nicer data structure than the hash in first place
>> (and it eats less ram than the radix tree if only one page is queued
>> etc...).

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> And the pointer links in struct page? Sounds like more RAM to me...
> 4000 open files (much more realistic than 500K) each with one page
> leads to 48000 words of radix tree overhead. 3 words per page of
> pointer links and > 16000 pages of RAM and the rbtree eats more, not
> less. And 16000 pages is just 64MB on i386.

This doesn't quite hold up after the the correction above. 4K open
files ends up having 10.5K words or 40MB of overhead.

Cheers,
Bill

2002-01-31 17:21:45

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >>> So I wouldn't merge it, at least until some math is done for the memory
> >>> consumation with 500k inodes with only 1 page in them each, and on the
> >>> number of heights/levels that must be walked during the tree lookup,
> >>> during a access at offset 10G (or worst case in general [biggest
> >>> height]) of an inode with 10G just allocated in pagecache.
>
> Did someone say math? Looks like I popped in just in time.
>
> The radix tree forest worst case space usage for fixed-precision search
> keys is where each leaf node of a radix tree is occupied by a unique page,
> and furthermore, each radix tree contains a single page (otherwise the
> shared root conserves a small amount of space).
>
> key precision = D^B = wordsize (e.g. 2^32 or 2^64)
> D = depth
> B = branch factor
>
> Each leaf node lies within a chain of D nodes, where all but the root
> nodes are of size B words. This is (D-1)*B + 1 words per file, hence
> per-page. Variable branch factors don't complicate this significantly:
> 1 + \sum_{0 \leq k \leq D} B_k words per page.
>
> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
> words per file. So for 500K inodes each with one page, 42MB (Douglas
> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
> interesting patch or a 64-bit machine if they work for you. =)

What do you mean with offsets of 10GB not working? In any recent
distribution supporting LFS the file size limit is only a constraint of
the filesystem on disk format. You don't need a 64bit arch for that.
(and anyways any change to the pagecahce must work fine for 64bit archs
too)

> On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> >> Ummm, I don't see how this worst case is any more realistic
> >> as the worst case for the hash table (where all pages live
> >> in very few hash buckets and we have really deep chains).
>
> I don't believe it's particularly realistic either. And sorry about the
> inaccurate estimates from before. =)
>
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > Mathematically the hashtable complexity is O(N). But probabilistically
> > with the tuning we do on the hashtable size, the collisions will be
> > nearly zero for most buckets for of most workloads. Despite the worst
> > case is with all the pagecache and swapcache queued in a single linked
> > list :).
>
> To avoid its worst case (or just poor distribution across the buckets),
> a good hash function is necessary. And I don't believe the measurements
> are in favor of the one currently in use. Also, the pointer links for

the randomization provided by the inode is quite powerful (and it makes
not possible to guess the hash bucket in use from userspace without
privilegies), and the current one make sure to optimize the cacheline
usage during consecutive reads.

> separate chaining within the objects costs extremely-precious boot-time
> allocated memory and that memory is taken away from the system for all
> time, where the dynamic allocation at least allows for the possibility
> of recovering memory when needed.
>
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > So in short math is wrong about O(N) being bad, hashtable is infact the
> > only way we can get an effective O(1) by just paying RAM. We pay with
> > RAM and we get performance back to us.
> > but with the radix tree (please correct me if I'm wrong) the height will
> > increase eventually, no matter what (so it won't be an effective O(1)
> > like the hashtable provides in real life, not the worst case, the common
> > case). With the hashtable the height won't increase instead.
>
> The key is of a fixed precision, hence the tree is of a fixed depth.
> The radix tree is O(1).

what does it mean the tree is of a fixed depth? If the depth is fixed
and you claim the lookup complexity O(1) how can you support terabytes
of pagecache queued into the tree without wasting quite a lot of ram per
inode in the worst case (the worst case is with only a few pages into
each radix tree, just to make sure all the depth gets allocated)? Forget
totally about x86, some box runs linux with 256G of ram (I guess
terabyte is next).

Also the complexity arguments are all about the worst case, O(1) as said
in the earlier email may very well be much slower than O(N) when you get
to numbers. hashtable will provide a common case where there is no
collision in the bucket, and then the lookup only consiste of a pointer
check.

If your radix tree O(1) fixed depth is 10000, you will always have to
walk 10000 pointers before you can finish the lookup and it will be
definitely much slower than O(N).

So keep in mind the math complexity arguments can be very misleading in
real life.

> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > In short to get the same performance with the radix tree, you'd need to
> > waste an huge amount of ram per inode, the hashtable is instead global
> > so we pay only once for all the pages in the system. At least this is my
> > understanding, I'm not a radix tree guru though, so I may be missing
> > something.
>
> Lock and cache contention introduced by intermixing data from unrelated
> objects. We've all seen radix trees before: most page tables are radix

of course with a per-inode data structure the locking issue while
accessing different inodes goes away but I think nominal performance of
the data structure is more important than scalability issue (also
contention would remain in workloads where all tasks access the same
inode like database). The cacheline part has to be taken into account but
the hashfn is just optimized for that one.

> trees.
>
> On Thu, Jan 31, 2002 at 11:58:10AM -0200, Rik van Riel wrote:
> >> People just don't go around caching a single page each for
> >> all of their 10 GB files and even if they _wanted to_ they
> >> couldn't because of the readahead code.
> >> I suspect that for large files we'll always have around
> >> min_readahead logically contiguous pages cached, if not more.
>
> I suspect the worst case could only arise after evictions of the
> readahead from the pagecache.
>
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > readahead really doesn't matter at all. consider all the data just in
> > cache, assume you wrote it and you will never ever need to read it once
> > again because you've 64G of ram and only 20G of disk.
>
> Good luck booting on a 64GB x86 with excess pointer links in struct page.

x86 doesn't matter. this is common code.

> Boot-time allocations filling the direct-mapped portion of the kernel
> virtual address space -appear- to be a severe problem there, but I've
> not got the RAM to empirically verify this quite yet.
>
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > I/O on large files can and must run as fast as I/O on small files, at
> > the pagecache level. If the fs doesn't support extents or it's
> > inefficient with large files that's an enterly different problem, and
> > the underlying fs doesn't matter any longer once the data is in cache.
> > pagecache must not slowdown on big files.
>
> O(1) is O(1). This isn't even average-case or worst case: it's all cases.
> In a radix tree using fixed-precision search keys, such as machine words,
> exactly the same number of internal nodes are traversed to reach a leaf
> for every search key, every time, regardless of how populated or unpopulated
> the radix tree is.

and this mean it will be slower than the hashtable that will reach the
page without walking any "depth" in the common case.

>
> On Thu, Jan 31, 2002 at 03:36:07PM +0100, Andrea Arcangeli wrote:
> > Otherwise for an unix fs developer usage (the small files ala dbench)
> > the rbtree was much nicer data structure than the hash in first place
> > (and it eats less ram than the radix tree if only one page is queued
> > etc...).
>
> And the pointer links in struct page? Sounds like more RAM to me...

yes, that would be a few more bytes per page.... unless you allocate the
node structure dynamically like you seems to be doing in the radix tree
patch for the very same reason of not increasing the struct page I guess.

> 4000 open files (much more realistic than 500K) each with one page

open files doesn't matter. what matters are the number of inodes with
cache in them. 500k is definitely realistic, try to run updatedb with
plenty of ram free and then check /proc/sys/fs/inode-nr.

> leads to 48000 words of radix tree overhead. 3 words per page of
> pointer links and > 16000 pages of RAM and the rbtree eats more, not
> less. And 16000 pages is just 64MB on i386.
>
>
> Cheers,
> Bill

Andrea

2002-01-31 17:48:15

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>
> but with the radix tree (please correct me if I'm wrong) the height will
> increase eventually, no matter what (so it won't be an effective O(1)
> like the hashtable provides in real life, not the worst case, the common
> case). With the hashtable the height won't increase instead.

No.

The radix tree is basically O(1), because the maximum depth of a 7-bit
radix tree is just 5. The index is only a 32-bit number.

We could, in fact, make all page caches use a fixed-depth tree, which is
clearly O(1). But the radix tree is slightly faster and tends to use less
memory under common loads, so..

Remember: you must NOT ignore the constant part of a "O(x)" equation.
Hashes tend to be effectively O(1) under most loads, but they have cache
costs, and they have scalability costs that a radix tree doesn't have.

Linus

2002-01-31 17:51:05

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> For a branch factor of 128 on i386, this ends up as 1 + 7 + 7 + 6 = 21
>> words per file. So for 500K inodes each with one page, 42MB (Douglas
>> Adams fan?). Offsets of 10GB don't work here. Sounds like either an
>> interesting patch or a 64-bit machine if they work for you. =)

These numbers are wrong -- see the other reply.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> What do you mean with offsets of 10GB not working? In any recent
> distribution supporting LFS the file size limit is only a constraint of
> the filesystem on disk format. You don't need a 64bit arch for that.
> (and anyways any change to the pagecahce must work fine for 64bit archs
> too)

I stand corrected on that. It appears the extra bits are used for large
files. The depth of the tree as represented in the calculation may need
to go up and so the worst case space usage is even larger than the
2.2-ish 32-bit - PAGE_SHIFT calculation.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> To avoid its worst case (or just poor distribution across the buckets),
>> a good hash function is necessary. And I don't believe the measurements
>> are in favor of the one currently in use. Also, the pointer links for

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> the randomization provided by the inode is quite powerful (and it makes
> not possible to guess the hash bucket in use from userspace without
> privilegies), and the current one make sure to optimize the cacheline
> usage during consecutive reads.

chi^2 is nowhere near passing a confidence test for uniformity on the
pagecache and on various machines extremely poor bucket distribution
has been observed (i.e. visibly poor from histograms).

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> The key is of a fixed precision, hence the tree is of a fixed depth.
>> The radix tree is O(1).

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> what does it mean the tree is of a fixed depth? If the depth is fixed
> and you claim the lookup complexity O(1) how can you support terabytes
> of pagecache queued into the tree without wasting quite a lot of ram per
> inode in the worst case (the worst case is with only a few pages into
> each radix tree, just to make sure all the depth gets allocated)? Forget
> totally about x86, some box runs linux with 256G of ram (I guess
> terabyte is next).

The number of levels in the tree is proportional to the number of bits in
the machine word, which is a constant. A double-precision machine word
is of constant size as well. Radix trees can be used on strings, which
is where they would not be of fixed depth.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> Also the complexity arguments are all about the worst case, O(1) as said
> in the earlier email may very well be much slower than O(N) when you get
> to numbers. hashtable will provide a common case where there is no
> collision in the bucket, and then the lookup only consiste of a pointer
> check.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> If your radix tree O(1) fixed depth is 10000, you will always have to
> walk 10000 pointers before you can finish the lookup and it will be
> definitely much slower than O(N).
> So keep in mind the math complexity arguments can be very misleading in
> real life.

I know how to use them. Unfortunately, I am not a lightning calculator,
as seen in the preceding post.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> of course with a per-inode data structure the locking issue while
> accessing different inodes goes away but I think nominal performance of
> the data structure is more important than scalability issue (also
> contention would remain in workloads where all tasks access the same
> inode like database). The cacheline part has to be taken into account but
> the hashfn is just optimized for that one.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> Good luck booting on a 64GB x86 with excess pointer links in struct page.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> x86 doesn't matter. this is common code.

I wish.

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> O(1) is O(1). This isn't even average-case or worst case: it's all cases.
>> In a radix tree using fixed-precision search keys, such as machine words,
>> exactly the same number of internal nodes are traversed to reach a leaf
>> for every search key, every time, regardless of how populated or unpopulated
>> the radix tree is.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> and this mean it will be slower than the hashtable that will reach the
> page without walking any "depth" in the common case.

Not so; the hash table will walk some number of pointer links which can
be estimated from the load on the table, and this is likely to be
approximately the same as the radix tree from hash table statistics I've
seen.

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> yes, that would be a few more bytes per page.... unless you allocate the
> node structure dynamically like you seems to be doing in the radix tree
> patch for the very same reason of not increasing the struct page I guess.

Well, maybe not given the revised numbers. =)

On Thu, Jan 31, 2002 at 08:39:34AM -0800, William Lee Irwin III wrote:
>> 4000 open files (much more realistic than 500K) each with one page

On Thu, Jan 31, 2002 at 06:21:50PM +0100, Andrea Arcangeli wrote:
> open files doesn't matter. what matters are the number of inodes with
> cache in them. 500k is definitely realistic, try to run updatedb with
> plenty of ram free and then check /proc/sys/fs/inode-nr.

I misspoke -- I used an estimate of the number of address_spaces in use
on busy systems I heard elsewhere and quadrupled it (and then said "open
files"). Since it's inodes, 4K is still a bad estimate.

Cheers,
Bill

2002-01-31 18:01:25

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 09:46:52AM -0800, Linus Torvalds wrote:
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >
> > but with the radix tree (please correct me if I'm wrong) the height will
> > increase eventually, no matter what (so it won't be an effective O(1)
> > like the hashtable provides in real life, not the worst case, the common
> > case). With the hashtable the height won't increase instead.
>
> No.
>
> The radix tree is basically O(1), because the maximum depth of a 7-bit
> radix tree is just 5. The index is only a 32-bit number.

then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).

Also there must be some significant memory overhead that can be
triggered with a certain layout of pages, in some configuration it
should take much more ram than the hashtable if I understood well how it
works.

Also its O(1) may be slower than the O(N) of the hashtable in the 99% of
the cases.

>
> We could, in fact, make all page caches use a fixed-depth tree, which is
> clearly O(1). But the radix tree is slightly faster and tends to use less
> memory under common loads, so..
>
> Remember: you must NOT ignore the constant part of a "O(x)" equation.
> Hashes tend to be effectively O(1) under most loads, but they have cache
> costs, and they have scalability costs that a radix tree doesn't have.

the scalability cost I obviously agree :) (however on some workload with
all tasks on the same inode, the scalability cost remains the same).

Andrea

2002-01-31 18:34:10

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> >
> > The radix tree is basically O(1), because the maximum depth of a 7-bit
> > radix tree is just 5. The index is only a 32-bit number.
>
> then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).

NO.

The radix tree is an index lookup mechanism.

The index is 32 bits.

That's true regardless of how much RAM you have.

> Also there must be some significant memory overhead that can be
> triggered with a certain layout of pages, in some configuration it
> should take much more ram than the hashtable if I understood well how it
> works.

Considering that the radix tree can _remove_ 8 bytes per "struct page", I
suspect you potentially win more memory than you lose.

Linus

2002-01-31 18:39:10

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Linus Torvalds wrote:

> > Also there must be some significant memory overhead that can be
> > triggered with a certain layout of pages, in some configuration it
> > should take much more ram than the hashtable if I understood well how it
> > works.
>
> Considering that the radix tree can _remove_ 8 bytes per "struct
> page", I suspect you potentially win more memory than you lose.

Actually, since the page cache hash table is also 8 bytes
per page, the radix trees effectively remove 16 bytes per
struct page.

Also, Momchil's radix trees are only as deep as needed
for each file, so most files should have very shallow
radix trees.

Combine these two facts with min_readahead and you'll
see that the memory consumption for radix trees should
be pretty decent.

It's still a question whether we'll want to use 128 as
the branch factor or another number ... but I'm sure
somebody will figure that out (and it can be changed
later, it's just one define).

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-01-31 18:50:52

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Rik van Riel wrote:
>
> It's still a question whether we'll want to use 128 as
> the branch factor or another number ... but I'm sure
> somebody will figure that out (and it can be changed
> later, it's just one define).

Actually, I think the big question is whether somebody is willing to clean
up and fix the "move_from_swap_cache()" issue with block_flushpage.

Linus

2002-01-31 19:10:10

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> On Thu, 31 Jan 2002, Rik van Riel wrote:
>>
>> It's still a question whether we'll want to use 128 as
>> the branch factor or another number ... but I'm sure
>> somebody will figure that out (and it can be changed
>> later, it's just one define).

Linus> Actually, I think the big question is whether somebody is willing to clean
Linus> up and fix the "move_from_swap_cache()" issue with block_flushpage.

Actually, I would be the one, only if I knew what was the issue :)

2002-01-31 19:15:30

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, Jan 31, 2002 at 10:32:35AM -0800, Linus Torvalds wrote:
>
> On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
> > >
> > > The radix tree is basically O(1), because the maximum depth of a 7-bit
> > > radix tree is just 5. The index is only a 32-bit number.
> >
> > then it will break on archs with more ram than 1<<(32+PAGE_CACHE_SHIFT).
>
> NO.
>
> The radix tree is an index lookup mechanism.
>
> The index is 32 bits.
>
> That's true regardless of how much RAM you have.

then there must be some collision handling that raise the complexity to
O(N) like with the hashtable, if the depth is fixed and if 32bits of
index are enough regardless of how many entries are in the tree.

I'm confused by the comments I heard so far, but well I don't want to
bother you further until I have clear how this data structure is layed
out exactly. I mainly wanted to give a warning, to be sure some point is
evalulated properly before integration.

> Considering that the radix tree can _remove_ 8 bytes per "struct page", I
> suspect you potentially win more memory than you lose.

of course if we add kmalloc to the pagecache code we can drop such part
from the page structure with the hashtable too.

Andrea

2002-01-31 19:25:10

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Andrea Arcangeli wrote:
>
> then there must be some collision handling that raise the complexity to
> O(N) like with the hashtable, if the depth is fixed and if 32bits of
> index are enough regardless of how many entries are in the tree.

No collisions. Each mapping has its own private tree. And mappings are
virtually indexed by 32 bits. No hashes, no collisions, no nothing.

Think of the page tables. We can have 64GB of memory, and the page tables
will shrink and grow dynamically to match the needs for virtual memory.
The radix tree is no different, except it ends up being a bit more
aggressive about shrinking by virtue of not always using the maximum depth.

(A fixed depth tree is much simpler, and has equivalent memory use for
not-very-dense mappings. But file mappings are 99% dense).

> of course if we add kmalloc to the pagecache code we can drop such part
> from the page structure with the hashtable too.

But you still need the hashtable.

Right now the hashtable is _roughly_ the size of 4 bytes per physical page
in the machine - and it was done that way explicitly to avoid havin gto
walk the chains. That's a LOT of memory.

For example, on my 2GB machine, I have 2MB worth of hash-tables.

In addition, each "struct page" has 8 bytes in it, so we have a total of
12 bytes per page just for the hash chains.

And yes, you could use kmalloc to allocate the hash chain entries. But
we're _guaranteed_ that 12 bytes, and kmalloc overhead might make it
worse.

In short: the radix tree certainly isn't any worse.

Linus

2002-01-31 19:34:23

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Linus Torvalds wrote:
>
> On Thu, 31 Jan 2002, Rik van Riel wrote:
> >
> > It's still a question whether we'll want to use 128 as
> > the branch factor or another number ... but I'm sure
> > somebody will figure that out (and it can be changed
> > later, it's just one define).
>
> Actually, I think the big question is whether somebody is willing to clean
> up and fix the "move_from_swap_cache()" issue with block_flushpage.
>

It appears that move_from_swap_cache() is in good company:

1: shmem_unuse_inode() calls delete_from_swap_cache under
spinlock, but delete_from_swap_cache() calls block_flushpage(),
which can sleep.

2: shmem_getpage_locked() calls delete_from_swap_cache() calls
block_flushpage() under info->lock.

3: zap_pte_range holds mm->page_table_lock, and calls
free_swap_and_cache() calls delete_from_swap_cache() calls
block_flushpage().

block_flushpage() can only sleep in the lock_buffer() in
discard_buffer(). It so happens that all three callers
are always using block_flushpage() against a locked
swapcache page, and (correct me if I'm wrong), it's
not possible for those buffers to be locked.

So we got lucky.

A short-term fix is to put a BIG FAT COMMENT over block_flushpage.

-

2002-01-31 19:37:43

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Thu, 31 Jan 2002, Linus Torvalds wrote:

> > then there must be some collision handling that raise the complexity to
> > O(N) like with the hashtable, if the depth is fixed and if 32bits of
> > index are enough regardless of how many entries are in the tree.
>
> No collisions. Each mapping has its own private tree. And mappings are
> virtually indexed by 32 bits. No hashes, no collisions, no nothing.

Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
(tree vs. scalable-hash page buckets) for SMP scalability against the
stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
performance of radix trees vs. scalable hash to be at least equivalent. (i
think Anton has a few links to show the resulting graphs.)

In fact the radix trees showed a slight performance/scalability edge in
some parts of the performance curve. So given the fact that hashes/buckets
were *purely* designed for speed/scalability and not for RAM usage, this
proves that radix trees are superior. Plus the locking is much simpler
than for the hash buckets solution. Which make radix trees a clear winner
IMO.

Ingo

2002-01-31 21:14:02

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Linus" == Linus Torvalds <[email protected]> writes:

Linus> On Thu, 31 Jan 2002, Rik van Riel wrote:
>>
>> It's still a question whether we'll want to use 128 as
>> the branch factor or another number ... but I'm sure
>> somebody will figure that out (and it can be changed
>> later, it's just one define).

Linus> Actually, I think the big question is whether somebody is willing to clean
Linus> up and fix the "move_from_swap_cache()" issue with block_flushpage.

Ah, almost forgot it. The patch removes ``next_hash'' and
``pprev_hash'' from ``struct page'', which breaks ARM and sparc64.

Regards,
-velco

2002-01-31 23:25:37

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Hi Ingo,

> Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
> (tree vs. scalable-hash page buckets) for SMP scalability against the
> stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
> performance of radix trees vs. scalable hash to be at least equivalent. (i
> think Anton has a few links to show the resulting graphs.)

Here are some results on a 12 way machine. (2.4.16-splay is the radix
patch):

http://samba.org/~anton/linux/pagecache_locking/1/summary.png

As you can see both patches give pretty much equal improvements.

The other problem with the current pagecache hash is that it maxes out
at order 9 (due to the get_free_pages limitation) which starts to hurt
at 4GB RAM and above. On a 32GB machine the average hashchain depth
was very high:

http://samba.org/~anton/linux/pagecache/pagecache_before.png

There were a few solutions (from davem and ingo) to allocate a larger
hash but with the radix patch we no longer have to worry about this.

So the radix patch solves 2 problems quite nicely :)

Anton

2002-01-31 23:49:57

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> That has some potential big wins beyond oracle. Some of the big number
> crunching algorithms also benefit heavily from 4Mb pages even when you
> try and minimise tlb misses.

There are further wins on some architectures, eg POWER has hardware
prefetch streams which terminate at a page boundary. With a 4kB pagesize
the prefetch engine will have to restart every 4kB, so we would want to
use 16MB pages if possible.

How would we allocate large pages? Would there be a boot option to
reserve an area of RAM for large pages only?

Anton

2002-01-31 23:55:19

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, Feb 01, 2002 at 10:12:42AM +1100, Anton Blanchard wrote:
>
> Hi Ingo,
>
> > Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
> > (tree vs. scalable-hash page buckets) for SMP scalability against the
> > stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
> > performance of radix trees vs. scalable hash to be at least equivalent. (i
> > think Anton has a few links to show the resulting graphs.)
>
> Here are some results on a 12 way machine. (2.4.16-splay is the radix
> patch):
>
> http://samba.org/~anton/linux/pagecache_locking/1/summary.png
>
> As you can see both patches give pretty much equal improvements.
>
> The other problem with the current pagecache hash is that it maxes out
> at order 9 (due to the get_free_pages limitation) which starts to hurt
> at 4GB RAM and above. On a 32GB machine the average hashchain depth
> was very high:
>
> http://samba.org/~anton/linux/pagecache/pagecache_before.png
>
> There were a few solutions (from davem and ingo) to allocate a larger
> hash but with the radix patch we no longer have to worry about this.
>
> So the radix patch solves 2 problems quite nicely :)

all the hashes should be allocated with the bootmem allocator, that
doesn't have the MAX_ORDER limit. Not only the pagecache hash, that is
the only one replaced.

In short, for an optimal comparison between hash and radix tree, we'd
need to fixup the hash allocation with the bootmem allocator first.

Andrea

2002-02-01 00:04:17

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

From: Andrea Arcangeli <[email protected]>
Date: Fri, 1 Feb 2002 00:55:43 +0100

In short, for an optimal comparison between hash and radix tree, we'd
need to fixup the hash allocation with the bootmem allocator first.

I'm totally convinced the radix stuff is much better, but since you
are not here is the "pagecache hash in bootmem" patch I did ages ago
so Anton can make you happy :-)

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/init.c linux/arch/alpha/mm/init.c
--- vanilla/linux/arch/alpha/mm/init.c Thu Sep 20 20:02:03 2001
+++ linux/arch/alpha/mm/init.c Sat Nov 10 01:49:56 2001
@@ -23,6 +23,7 @@
#ifdef CONFIG_BLK_DEV_INITRD
#include <linux/blk.h>
#endif
+#include <linux/pagemap.h>

#include <asm/system.h>
#include <asm/uaccess.h>
@@ -360,6 +361,7 @@
mem_init(void)
{
max_mapnr = num_physpages = max_low_pfn;
+ page_cache_init(count_free_bootmem());
totalram_pages += free_all_bootmem();
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/alpha/mm/numa.c linux/arch/alpha/mm/numa.c
--- vanilla/linux/arch/alpha/mm/numa.c Sun Aug 12 10:38:48 2001
+++ linux/arch/alpha/mm/numa.c Sat Nov 10 01:52:27 2001
@@ -15,6 +15,7 @@
#ifdef CONFIG_BLK_DEV_INITRD
#include <linux/blk.h>
#endif
+#include <linux/pagemap.h>

#include <asm/hwrpb.h>
#include <asm/pgalloc.h>
@@ -359,8 +360,13 @@
extern char _text, _etext, _data, _edata;
extern char __init_begin, __init_end;
extern unsigned long totalram_pages;
- unsigned long nid, i;
+ unsigned long nid, i, num_free_bootmem_pages;
mem_map_t * lmem_map;
+
+ num_free_bootmem_pages = 0;
+ for (nid = 0; nid < numnodes; nid++)
+ num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(nid));
+ page_cache_init(num_free_bootmem_pages);

high_memory = (void *) __va(max_mapnr <<PAGE_SHIFT);

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/arm/mm/init.c linux/arch/arm/mm/init.c
--- vanilla/linux/arch/arm/mm/init.c Thu Oct 11 09:04:57 2001
+++ linux/arch/arm/mm/init.c Sat Nov 10 01:52:34 2001
@@ -23,6 +23,7 @@
#include <linux/init.h>
#include <linux/bootmem.h>
#include <linux/blk.h>
+#include <linux/pagemap.h>

#include <asm/segment.h>
#include <asm/mach-types.h>
@@ -594,6 +595,7 @@
void __init mem_init(void)
{
unsigned int codepages, datapages, initpages;
+ unsigned long num_free_bootmem_pages;
int i, node;

codepages = &_etext - &_text;
@@ -608,6 +610,11 @@
*/
if (meminfo.nr_banks != 1)
create_memmap_holes(&meminfo);
+
+ num_free_bootmem_pages = 0;
+ for (node = 0; node < numnodes; node++)
+ num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(node));
+ page_cache_init(num_free_bootmem_pages);

/* this will put all unused low memory onto the freelists */
for (node = 0; node < numnodes; node++) {
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/cris/mm/init.c linux/arch/cris/mm/init.c
--- vanilla/linux/arch/cris/mm/init.c Thu Jul 26 15:10:06 2001
+++ linux/arch/cris/mm/init.c Sat Nov 10 01:53:10 2001
@@ -95,6 +95,7 @@
#include <linux/swap.h>
#include <linux/smp.h>
#include <linux/bootmem.h>
+#include <linux/pagemap.h>

#include <asm/system.h>
#include <asm/segment.h>
@@ -366,6 +367,8 @@

max_mapnr = num_physpages = max_low_pfn - min_low_pfn;

+ page_cache_init(count_free_bootmem());
+
/* this will put all memory onto the freelists */
totalram_pages = free_all_bootmem();

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/i386/mm/init.c linux/arch/i386/mm/init.c
--- vanilla/linux/arch/i386/mm/init.c Sun Nov 18 19:59:22 2001
+++ linux/arch/i386/mm/init.c Mon Nov 12 00:14:00 2001
@@ -466,6 +466,8 @@
#endif
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);

+ page_cache_init(count_free_bootmem());
+
/* clear the zero-page */
memset(empty_zero_page, 0, PAGE_SIZE);

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ia64/mm/init.c linux/arch/ia64/mm/init.c
--- vanilla/linux/arch/ia64/mm/init.c Sun Nov 18 19:59:23 2001
+++ linux/arch/ia64/mm/init.c Sat Nov 10 01:54:20 2001
@@ -13,6 +13,7 @@
#include <linux/reboot.h>
#include <linux/slab.h>
#include <linux/swap.h>
+#include <linux/pagemap.h>

#include <asm/bitops.h>
#include <asm/dma.h>
@@ -406,6 +407,8 @@

max_mapnr = max_low_pfn;
high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+ page_cache_init(count_free_bootmem());

totalram_pages += free_all_bootmem();

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/m68k/mm/init.c linux/arch/m68k/mm/init.c
--- vanilla/linux/arch/m68k/mm/init.c Thu Sep 20 20:02:03 2001
+++ linux/arch/m68k/mm/init.c Sat Nov 10 01:54:47 2001
@@ -20,6 +20,7 @@
#ifdef CONFIG_BLK_DEV_RAM
#include <linux/blk.h>
#endif
+#include <linux/pagemap.h>

#include <asm/setup.h>
#include <asm/uaccess.h>
@@ -135,6 +136,8 @@
if (MACH_IS_ATARI)
atari_stram_mem_init_hook();
#endif
+
+ page_cache_init(count_free_bootmem());

/* this will put all memory onto the freelists */
totalram_pages = free_all_bootmem();
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips/mm/init.c linux/arch/mips/mm/init.c
--- vanilla/linux/arch/mips/mm/init.c Wed Jul 4 11:50:39 2001
+++ linux/arch/mips/mm/init.c Sat Nov 10 01:55:09 2001
@@ -28,6 +28,7 @@
#ifdef CONFIG_BLK_DEV_INITRD
#include <linux/blk.h>
#endif
+#include <linux/pagemap.h>

#include <asm/bootinfo.h>
#include <asm/cachectl.h>
@@ -203,6 +204,8 @@

max_mapnr = num_physpages = max_low_pfn;
high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+ page_cache_init(count_free_bootmem());

totalram_pages += free_all_bootmem();
totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/mm/init.c linux/arch/mips64/mm/init.c
--- vanilla/linux/arch/mips64/mm/init.c Wed Jul 4 11:50:39 2001
+++ linux/arch/mips64/mm/init.c Sat Nov 10 01:55:30 2001
@@ -25,6 +25,7 @@
#ifdef CONFIG_BLK_DEV_INITRD
#include <linux/blk.h>
#endif
+#include <linux/pagemap.h>

#include <asm/bootinfo.h>
#include <asm/cachectl.h>
@@ -396,6 +397,8 @@

max_mapnr = num_physpages = max_low_pfn;
high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+ page_cache_init(count_free_bootmem());

totalram_pages += free_all_bootmem();
totalram_pages -= setup_zero_pages(); /* Setup zeroed pages. */
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c linux/arch/mips64/sgi-ip27/ip27-memory.c
--- vanilla/linux/arch/mips64/sgi-ip27/ip27-memory.c Sun Sep 9 10:43:02 2001
+++ linux/arch/mips64/sgi-ip27/ip27-memory.c Sat Nov 10 02:02:33 2001
@@ -15,6 +15,7 @@
#include <linux/mm.h>
#include <linux/bootmem.h>
#include <linux/swap.h>
+#include <linux/pagemap.h>

#include <asm/page.h>
#include <asm/bootinfo.h>
@@ -277,6 +278,11 @@
num_physpages = numpages; /* memory already sized by szmem */
max_mapnr = pagenr; /* already found during paging_init */
high_memory = (void *) __va(max_mapnr << PAGE_SHIFT);
+
+ tmp = 0;
+ for (nid = 0; nid < numnodes; nid++)
+ tmp += count_free_bootmem_node(NODE_DATA(nid));
+ page_cache_init(tmp);

for (nid = 0; nid < numnodes; nid++) {

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/parisc/mm/init.c linux/arch/parisc/mm/init.c
--- vanilla/linux/arch/parisc/mm/init.c Tue Dec 5 12:29:39 2000
+++ linux/arch/parisc/mm/init.c Sat Nov 10 01:57:11 2001
@@ -17,6 +17,7 @@
#include <linux/pci.h> /* for hppa_dma_ops and pcxl_dma_ops */
#include <linux/swap.h>
#include <linux/unistd.h>
+#include <linux/pagemap.h>

#include <asm/pgalloc.h>

@@ -48,6 +49,8 @@
{
max_mapnr = num_physpages = max_low_pfn;
high_memory = __va(max_low_pfn * PAGE_SIZE);
+
+ page_cache_init(count_free_bootmem());

totalram_pages += free_all_bootmem();
printk("Memory: %luk available\n", totalram_pages << (PAGE_SHIFT-10));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/ppc/mm/init.c linux/arch/ppc/mm/init.c
--- vanilla/linux/arch/ppc/mm/init.c Tue Oct 2 09:12:44 2001
+++ linux/arch/ppc/mm/init.c Sat Nov 10 01:57:34 2001
@@ -34,6 +34,7 @@
#ifdef CONFIG_BLK_DEV_INITRD
#include <linux/blk.h> /* for initrd_* */
#endif
+#include <linux/pagemap.h>

#include <asm/pgalloc.h>
#include <asm/prom.h>
@@ -462,6 +463,8 @@

high_memory = (void *) __va(max_low_pfn * PAGE_SIZE);
num_physpages = max_mapnr; /* RAM is assumed contiguous */
+
+ page_cache_init(count_free_bootmem());

totalram_pages += free_all_bootmem();

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390/mm/init.c linux/arch/s390/mm/init.c
--- vanilla/linux/arch/s390/mm/init.c Thu Oct 11 09:04:57 2001
+++ linux/arch/s390/mm/init.c Sat Nov 10 01:57:56 2001
@@ -186,6 +186,8 @@
/* clear the zero-page */
memset(empty_zero_page, 0, PAGE_SIZE);

+ page_cache_init(count_free_bootmem());
+
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem();

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/s390x/mm/init.c linux/arch/s390x/mm/init.c
--- vanilla/linux/arch/s390x/mm/init.c Sun Nov 18 19:59:23 2001
+++ linux/arch/s390x/mm/init.c Sat Nov 10 01:58:14 2001
@@ -198,6 +198,8 @@
/* clear the zero-page */
memset(empty_zero_page, 0, PAGE_SIZE);

+ page_cache_init(count_free_bootmem());
+
/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem();

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sh/mm/init.c linux/arch/sh/mm/init.c
--- vanilla/linux/arch/sh/mm/init.c Mon Oct 15 13:36:48 2001
+++ linux/arch/sh/mm/init.c Sat Nov 10 01:59:56 2001
@@ -26,6 +26,7 @@
#endif
#include <linux/highmem.h>
#include <linux/bootmem.h>
+#include <linux/pagemap.h>

#include <asm/processor.h>
#include <asm/system.h>
@@ -139,6 +140,7 @@
void __init mem_init(void)
{
extern unsigned long empty_zero_page[1024];
+ unsigned long num_free_bootmem_pages;
int codesize, reservedpages, datasize, initsize;
int tmp;

@@ -148,6 +150,12 @@
/* clear the zero-page */
memset(empty_zero_page, 0, PAGE_SIZE);
__flush_wback_region(empty_zero_page, PAGE_SIZE);
+
+ num_free_bootmem_pages = count_free_bootmem_node(NODE_DATA(0));
+#ifdef CONFIG_DISCONTIGMEM
+ num_free_bootmem_pages += count_free_bootmem_node(NODE_DATA(1));
+#endif
+ page_cache_init(num_free_bootmem_pages);

/* this will put all low memory onto the freelists */
totalram_pages += free_all_bootmem_node(NODE_DATA(0));
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc/mm/init.c linux/arch/sparc/mm/init.c
--- vanilla/linux/arch/sparc/mm/init.c Mon Oct 1 09:19:56 2001
+++ linux/arch/sparc/mm/init.c Mon Nov 12 19:27:47 2001
@@ -25,6 +25,7 @@
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/bootmem.h>
+#include <linux/pagemap.h>

#include <asm/system.h>
#include <asm/segment.h>
@@ -434,6 +432,8 @@

max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
high_memory = __va(max_low_pfn << PAGE_SHIFT);
+
+ page_cache_init(count_free_bootmem());

#ifdef DEBUG_BOOTMEM
prom_printf("mem_init: Calling free_all_bootmem().\n");
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/arch/sparc64/mm/init.c linux/arch/sparc64/mm/init.c
--- vanilla/linux/arch/sparc64/mm/init.c Sun Nov 18 19:59:23 2001
+++ linux/arch/sparc64/mm/init.c Sat Nov 17 23:51:28 2001
@@ -1583,6 +1583,8 @@

max_mapnr = last_valid_pfn - (phys_base >> PAGE_SHIFT);
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
+
+ page_cache_init(count_free_bootmem());

num_physpages = free_all_bootmem() - 1;

diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/include/linux/bootmem.h linux/include/linux/bootmem.h
--- vanilla/linux/include/linux/bootmem.h Mon Nov 5 12:43:18 2001
+++ linux/include/linux/bootmem.h Mon Nov 19 10:22:17 2001
@@ -43,11 +43,13 @@
#define alloc_bootmem_low_pages(x) \
__alloc_bootmem((x), PAGE_SIZE, 0)
extern unsigned long __init free_all_bootmem (void);
+extern unsigned long __init count_free_bootmem (void);

extern unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn);
extern void __init reserve_bootmem_node (pg_data_t *pgdat, unsigned long physaddr, unsigned long size);
extern void __init free_bootmem_node (pg_data_t *pgdat, unsigned long addr, unsigned long size);
extern unsigned long __init free_all_bootmem_node (pg_data_t *pgdat);
+extern unsigned long __init count_free_bootmem_node (pg_data_t *pgdat);
extern void * __init __alloc_bootmem_node (pg_data_t *pgdat, unsigned long size, unsigned long align, unsigned long goal);
#define alloc_bootmem_node(pgdat, x) \
__alloc_bootmem_node((pgdat), (x), SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/init/main.c linux/init/main.c
--- vanilla/linux/init/main.c Sun Nov 18 19:59:37 2001
+++ linux/init/main.c Sat Nov 10 04:58:16 2001
@@ -597,7 +597,6 @@
proc_caches_init();
vfs_caches_init(mempages);
buffer_init(mempages);
- page_cache_init(mempages);
#if defined(CONFIG_ARCH_S390)
ccwcache_init();
#endif
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/bootmem.c linux/mm/bootmem.c
--- vanilla/linux/mm/bootmem.c Tue Sep 18 14:10:43 2001
+++ linux/mm/bootmem.c Mon Nov 12 20:40:58 2001
@@ -272,6 +279,28 @@
return total;
}

+static unsigned long __init count_free_bootmem_core(pg_data_t *pgdat)
+{
+ bootmem_data_t *bdata = pgdat->bdata;
+ unsigned long i, idx, total;
+
+ if (!bdata->node_bootmem_map) BUG();
+
+ total = 0;
+ idx = bdata->node_low_pfn - (bdata->node_boot_start >> PAGE_SHIFT);
+ for (i = 0; i < idx; i++) {
+ if (!test_bit(i, bdata->node_bootmem_map))
+ total++;
+ }
+
+ /*
+ * Count the allocator bitmap itself.
+ */
+ total += ((bdata->node_low_pfn-(bdata->node_boot_start >> PAGE_SHIFT))/8 + PAGE_SIZE-1)/PAGE_SIZE;
+
+ return total;
+}
+
unsigned long __init init_bootmem_node (pg_data_t *pgdat, unsigned long freepfn, unsigned long startpfn, unsigned long endpfn)
{
return(init_bootmem_core(pgdat, freepfn, startpfn, endpfn));
@@ -292,6 +321,11 @@
return(free_all_bootmem_core(pgdat));
}

+unsigned long __init count_free_bootmem_node (pg_data_t *pgdat)
+{
+ return(count_free_bootmem_core(pgdat));
+}
+
unsigned long __init init_bootmem (unsigned long start, unsigned long pages)
{
max_low_pfn = pages;
@@ -312,6 +346,11 @@
unsigned long __init free_all_bootmem (void)
{
return(free_all_bootmem_core(&contig_page_data));
+}
+
+unsigned long __init count_free_bootmem (void)
+{
+ return(count_free_bootmem_core(&contig_page_data));
}

void * __init __alloc_bootmem (unsigned long size, unsigned long align, unsigned long goal)
diff -u --recursive --new-file --exclude=CVS --exclude=.cvsignore vanilla/linux/mm/filemap.c linux/mm/filemap.c
--- vanilla/linux/mm/filemap.c Sun Nov 18 19:59:38 2001
+++ linux/mm/filemap.c Fri Nov 16 07:31:35 2001
@@ -24,6 +24,7 @@
#include <linux/mm.h>
#include <linux/iobuf.h>
#include <linux/compiler.h>
+#include <linux/bootmem.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -2931,28 +2932,48 @@
goto unlock;
}

+/* This is called from the arch specific mem_init routine.
+ * It is done right before free_all_bootmem (or NUMA equivalent).
+ *
+ * The mempages arg is the number of pages free_all_bootmem is
+ * going to liberate, or a close approximation.
+ *
+ * We have to use bootmem because on huge systems (ie. 16GB ram)
+ * get_free_pages cannot give us a large enough allocation.
+ */
void __init page_cache_init(unsigned long mempages)
{
- unsigned long htable_size, order;
+ unsigned long htable_size, real_size;

htable_size = mempages;
htable_size *= sizeof(struct page *);
- for(order = 0; (PAGE_SIZE << order) < htable_size; order++)
+
+ for (real_size = 1UL; real_size < htable_size; real_size <<= 1UL)
;

do {
- unsigned long tmp = (PAGE_SIZE << order) / sizeof(struct page *);
+ unsigned long tmp = (real_size / sizeof(struct page *));
+ unsigned long align;

page_hash_bits = 0;
while((tmp >>= 1UL) != 0UL)
page_hash_bits++;
+
+ align = real_size;
+ if (align > (4UL * 1024UL * 1024UL))
+ align = (4UL * 1024UL * 1024UL);
+
+ page_hash_table = __alloc_bootmem(real_size, align,
+ __pa(MAX_DMA_ADDRESS));
+
+ /* Perhaps the alignment was too strict. */
+ if (page_hash_table == NULL)
+ page_hash_table = alloc_bootmem(real_size);
+ } while (page_hash_table == NULL &&
+ (real_size >>= 1UL) >= PAGE_SIZE);

- page_hash_table = (struct page **)
- __get_free_pages(GFP_ATOMIC, order);
- } while(page_hash_table == NULL && --order > 0);
-
- printk("Page-cache hash table entries: %d (order: %ld, %ld bytes)\n",
- (1 << page_hash_bits), order, (PAGE_SIZE << order));
+ printk("Page-cache hash table entries: %d (%ld bytes)\n",
+ (1 << page_hash_bits), real_size);
if (!page_hash_table)
panic("Failed to allocate page hash table\n");
memset((void *)page_hash_table, 0, PAGE_HASH_SIZE * sizeof(struct page *));

2002-02-01 00:21:47

by Alan

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> the prefetch engine will have to restart every 4kB, so we would want to
> use 16MB pages if possible.
>
> How would we allocate large pages? Would there be a boot option to
> reserve an area of RAM for large pages only?

If you have an rmap all you have to do is to avoid smearing kernel objects
around lots of 16Mb page sets. If need be you can then get a 16Mb page
back just by shuffling user pages.

It does make the performance analysis much more interesting though.

2002-02-01 04:00:12

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> all the hashes should be allocated with the bootmem allocator, that
> doesn't have the MAX_ORDER limit. Not only the pagecache hash, that is
> the only one replaced.
>
> In short, for an optimal comparison between hash and radix tree, we'd
> need to fixup the hash allocation with the bootmem allocator first.

All my results use vmalloc to allocate the hashes so they get sized
correctly.

Don't worry, there is no increased tlb pressure on these machines due
to vmalloc, that cpu doesnt have large page support.

Anton

2002-02-01 06:33:50

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Anton" == Anton Blanchard <[email protected]> writes:

Anton> Hi Ingo,

>> Yes, it's very nice. Anton Blanchard has benchmarked both patch variants
>> (tree vs. scalable-hash page buckets) for SMP scalability against the
>> stock hash, on big RAM, many CPUs boxes, via dbench load. He has found
>> performance of radix trees vs. scalable hash to be at least equivalent. (i
>> think Anton has a few links to show the resulting graphs.)

Anton> Here are some results on a 12 way machine. (2.4.16-splay is the radix
Anton> patch):

Anton> http://samba.org/~anton/linux/pagecache_locking/1/summary.png

A correction, "-splay" is the very first variant I posted, which used
splay trees for the page cache.

Anton> As you can see both patches give pretty much equal improvements.

Anton> The other problem with the current pagecache hash is that it maxes out
Anton> at order 9 (due to the get_free_pages limitation) which starts to hurt
Anton> at 4GB RAM and above. On a 32GB machine the average hashchain depth
Anton> was very high:

Anton> http://samba.org/~anton/linux/pagecache/pagecache_before.png

Anton> There were a few solutions (from davem and ingo) to allocate a larger
Anton> hash but with the radix patch we no longer have to worry about this.

Anton> So the radix patch solves 2 problems quite nicely :)

Anton> Anton

2002-02-01 07:58:49

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Ingo" == Ingo Molnar <[email protected]> writes:

Ingo> On Fri, 1 Feb 2002, Anton Blanchard wrote:

>> There were a few solutions (from davem and ingo) to allocate a larger
>> hash but with the radix patch we no longer have to worry about this.

Ingo> there is one big issue we forgot to consider.

Ingo> in the case of radix trees it's not only search depth that gets worse with

Hmm, worse, yes, the same way as page tables get "worse" with larger
address spaces.

Ingo> big files. The thing i'm worried about is the 'big pagecache lock' being
Ingo> reintroduced again. If eg. a database application puts lots of data into a

Yes, though I'd strongly suspect big database engines can/should/do
benefit from doing their application specific caching and indexing,
outperforming whatever cache implementation the OS has.

Ingo> single file (multiple gigabytes - why not), then the
mapping-> i_shared_lock becomes a 'big pagecache lock' again, causing
Ingo> serious SMP contention for even the read() case. Benchmarks show that it's
Ingo> the distribution of locks that matters on big boxes.

So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

Regards,
-velco

2002-02-01 07:08:58

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, 1 Feb 2002, Anton Blanchard wrote:

> There were a few solutions (from davem and ingo) to allocate a larger
> hash but with the radix patch we no longer have to worry about this.

there is one big issue we forgot to consider.

in the case of radix trees it's not only search depth that gets worse with
big files. The thing i'm worried about is the 'big pagecache lock' being
reintroduced again. If eg. a database application puts lots of data into a
single file (multiple gigabytes - why not), then the
mapping->i_shared_lock becomes a 'big pagecache lock' again, causing
serious SMP contention for even the read() case. Benchmarks show that it's
the distribution of locks that matters on big boxes.

dbench hides this issue, because it uses many temporary files, so the
locking overhead is distributed. Would you be willing to run benchmarks
that measure the scalability of reading from one bigger file, from
multiple CPUs?

with hash based locking, the locking overhead is *always* distributed.

with radix trees the locking overhead is distributed only if multiple
files are used. With one big file (or a few big files), the i_shared_lock
will always bounce between CPUs wildly in read() workloads, degrading
scalability just as much as it is degraded with the pagecache_lock now.

Ingo

2002-02-01 08:34:27

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On 1 Feb 2002, Momchil Velikov wrote:

> Hmm, worse, yes, the same way as page tables get "worse" with larger
> address spaces.

with the difference that for address spaces one of the preferred methods
of operation is read() [or sendfile(), or any other non-mmap() operation],
while for pagetables the hardware helps to get locking-free access to the
mapped contents.

> Ingo> big files. The thing i'm worried about is the 'big pagecache lock' being
> Ingo> reintroduced again. If eg. a database application puts lots of data into a
>
> Yes, though I'd strongly suspect big database engines can/should/do
> benefit from doing their application specific caching and indexing,
> outperforming whatever cache implementation the OS has.

it's not just databases. It's webservers too, serving content via
sendfile() from a single, bigger file. Think streaming media servers,
where the 'movie of the night' sits in a single big binary glob.

> Ingo> single file (multiple gigabytes - why not), then the
> mapping-> i_shared_lock becomes a 'big pagecache lock' again, causing
> Ingo> serious SMP contention for even the read() case. Benchmarks show that it's
> Ingo> the distribution of locks that matters on big boxes.
>
> So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

using read-write locks does not solve the scalability problem: the problem
is the bouncing of the spinlock cacheline from CPU to CPU.

Ingo

2002-02-01 09:01:03

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Ingo" == Ingo Molnar <[email protected]> writes:

Ingo> On 1 Feb 2002, Momchil Velikov wrote:
>> So, we can use a read-write spinlock instead ->i_shared_lock, ok ?

Ingo> using read-write locks does not solve the scalability problem: the problem
Ingo> is the bouncing of the spinlock cacheline from CPU to CPU.

Does cache line bounce (shared somewhere -> exclusive elsewhere) cost
more that a simple miss (present nowhere -> exclusive somewhere) ?

Regards,
-velco

2002-02-01 09:10:04

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

From: Ingo Molnar <[email protected]>
Date: Fri, 1 Feb 2002 11:29:53 +0100 (CET)

using read-write locks does not solve the scalability problem: the problem
is the bouncing of the spinlock cacheline from CPU to CPU.

I so much wish more people understood this :(

2002-02-01 09:12:04

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "David" == David S Miller <[email protected]> writes:

David> From: Ingo Molnar <[email protected]>
David> Date: Fri, 1 Feb 2002 11:29:53 +0100 (CET)

David> using read-write locks does not solve the scalability problem: the problem
David> is the bouncing of the spinlock cacheline from CPU to CPU.

David> I so much wish more people understood this :(

Amen. From now on I'll have it on an yellow sticker on my display ;)

2002-02-01 09:12:24

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

From: Momchil Velikov <[email protected]>
Date: 01 Feb 2002 11:01:50 +0200

Does cache line bounce (shared somewhere -> exclusive elsewhere) cost
more that a simple miss (present nowhere -> exclusive somewhere) ?

They are about equal. For coherency purposes all cpus have to listen
to all the transactions anyways to see if they have a match in their
L2 caches (and thus must provide the data to the requestor).

Perhaps the exclusive somewhere --> exclusive somewhere else is a bit
more expensive because you eat a write port for the cache line move on
the processor providing the data.

2002-02-01 11:05:23

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, 1 Feb 2002, Alan Cox wrote:

> > the prefetch engine will have to restart every 4kB, so we would want to
> > use 16MB pages if possible.
> >
> > How would we allocate large pages? Would there be a boot option to
> > reserve an area of RAM for large pages only?
>
> If you have an rmap all you have to do is to avoid smearing kernel objects
> around lots of 16Mb page sets. If need be you can then get a 16Mb page
> back just by shuffling user pages.
>
> It does make the performance analysis much more interesting though.

Actually, I suspect that for most workloads the amount of
large pages vs. the amount of small pages should be fairly
static.

In that case we can just reclaim an old large page from
the inactive_clean list whenever we want to allocate a new
one.

As for not putting kernel objects everywhere, this comes
naturally with HIGHMEM ;)

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-01 11:34:23

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Rik van Riel wrote:
>
> On Fri, 1 Feb 2002, Alan Cox wrote:
>
> > > the prefetch engine will have to restart every 4kB, so we would want to
> > > use 16MB pages if possible.
> > >
> > > How would we allocate large pages? Would there be a boot option to
> > > reserve an area of RAM for large pages only?
> >
> > If you have an rmap all you have to do is to avoid smearing kernel objects
> > around lots of 16Mb page sets. If need be you can then get a 16Mb page
> > back just by shuffling user pages.
> >
> > It does make the performance analysis much more interesting though.
>
> Actually, I suspect that for most workloads the amount of
> large pages vs. the amount of small pages should be fairly
> static.
>
> In that case we can just reclaim an old large page from
> the inactive_clean list whenever we want to allocate a new
> one.
>
> As for not putting kernel objects everywhere, this comes
> naturally with HIGHMEM ;)

well except when you start doing pagetables high, as Andrea is doing
(and it makes tons of sense to do that)

2002-02-01 14:44:39

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, Feb 01, 2002 at 10:04:50AM +0100, Ingo Molnar wrote:
>
> On Fri, 1 Feb 2002, Anton Blanchard wrote:
>
> > There were a few solutions (from davem and ingo) to allocate a larger
> > hash but with the radix patch we no longer have to worry about this.
>
> there is one big issue we forgot to consider.
>
> in the case of radix trees it's not only search depth that gets worse with
> big files. The thing i'm worried about is the 'big pagecache lock' being
> reintroduced again. If eg. a database application puts lots of data into a
> single file (multiple gigabytes - why not), then the
> mapping->i_shared_lock becomes a 'big pagecache lock' again, causing
> serious SMP contention for even the read() case. Benchmarks show that it's
> the distribution of locks that matters on big boxes.

exactly, this is the same thing I mentioned in some past email. It's not
that having per-inode data structures solves the locking completly, DBMS
are used to store stuff in a single file. And of course with a structure
like radix tree it would be a pain to have it scale within the same
file, unlike with the hashtable where each bucket is indipendent from
the others.

>
> dbench hides this issue, because it uses many temporary files, so the

Indeed, a lot of workloads would benefit from the separate data
structure and locking, but not all, some important one not.

> locking overhead is distributed. Would you be willing to run benchmarks
> that measure the scalability of reading from one bigger file, from
> multiple CPUs?

Agreed, also with DaveM patch applied, sizing the hash properly so it
has a mean distribution of 1 entry per bucket or so, will decrease the
window for the spinlock collisions as well btw.

>
> with hash based locking, the locking overhead is *always* distributed.
>
> with radix trees the locking overhead is distributed only if multiple
> files are used. With one big file (or a few big files), the i_shared_lock
> will always bounce between CPUs wildly in read() workloads, degrading
> scalability just as much as it is degraded with the pagecache_lock now.
>
> Ingo

Andrea

2002-02-01 14:59:02

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Ingo" == Ingo Molnar <[email protected]> writes:

Ingo> files are used. With one big file (or a few big files), the i_shared_lock
Ingo> will always bounce between CPUs wildly in read() workloads, degrading

Will there be difference between bounces of a rwlock in the radix tree
variant and the cache misses in hashed locks variant for the case of
concurrently accessed large file ?

Regards,
-velco

2002-02-01 15:06:02

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On 1 Feb 2002, Momchil Velikov wrote:

> Ingo> files are used. With one big file (or a few big files), the i_shared_lock
> Ingo> will always bounce between CPUs wildly in read() workloads, degrading
>
> Will there be difference between bounces of a rwlock in the radix tree
> variant and the cache misses in hashed locks variant for the case of
> concurrently accessed large file ?

definitely, because in the case of page buckets there are many locks
hashed in a mapping-neutral way. Ie. different parts of the same file will
likely map to different spinlocks. In the radix tree case all pages in the
inode will map to the same spinlock.

Ingo

2002-02-01 15:25:35

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "Ingo" == Ingo Molnar <[email protected]> writes:

Ingo> On 1 Feb 2002, Momchil Velikov wrote:

Ingo> files are used. With one big file (or a few big files), the i_shared_lock
Ingo> will always bounce between CPUs wildly in read() workloads, degrading
>>
>> Will there be difference between bounces of a rwlock in the radix tree
>> variant and the cache misses in hashed locks variant for the case of
>> concurrently accessed large file ?

Ingo> definitely, because in the case of page buckets there are many locks
Ingo> hashed in a mapping-neutral way. Ie. different parts of the same file will
Ingo> likely map to different spinlocks.

That's why it's likely to miss on each access.

Ingo> In the radix tree case all pages in the inode will map to the
Ingo> same spinlock.

That's why it's likely to bounce on each access.

So, is there any difference ? :)

Regards,
-velco

2002-02-01 17:08:25

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, 1 Feb 2002, Ingo Molnar wrote:
>
> it's not just databases. It's webservers too, serving content via
> sendfile() from a single, bigger file. Think streaming media servers,
> where the 'movie of the night' sits in a single big binary glob.

Hey guys, be _realistic_.

Don't bother with "oh, this could be bad", when there are absolutely _no_
numbers showing any such badness.

Even databases often use multiple files, and quite frankly, a database
that doesn't use mmap and doesn't try very hard to not cause extra system
calls is going to be bad performance-wise _regardless_ of any page cache
locking.

Radix-trees are cleaner than the alternatives, and all the numbers anybody
has ever shown shows them to be at least equal in performance.

Stop making up things that just are NOT problems.

In web-servers, 99% of the content is small files, and if the file is
cached the expensive parts are all elsewhere. Don't make up "worst case
schenarios" that simply do no exist.

Linus

2002-02-01 18:30:18

by Jeff Garzik

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> Even databases often use multiple files, and quite frankly, a database
> that doesn't use mmap and doesn't try very hard to not cause extra system
> calls is going to be bad performance-wise _regardless_ of any page cache
> locking.

I've always thought that read(2) and write(2) would in the end wind up
faster than mmap(2)... Tests in my rewritten cp/rm/mv type utilities
seem to bear this out.

Is mmap(2) only preferred for large files/databases?

Jeff

2002-02-01 18:46:01

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

In article <[email protected]> you wrote:
> On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
>> Even databases often use multiple files, and quite frankly, a database
>> that doesn't use mmap and doesn't try very hard to not cause extra system
>> calls is going to be bad performance-wise _regardless_ of any page cache
>> locking.

> I've always thought that read(2) and write(2) would in the end wind up
> faster than mmap(2)... Tests in my rewritten cp/rm/mv type utilities
> seem to bear this out.

the biggest reason for this is that we *suck* at readahead for mmap....

2002-02-01 18:59:31

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> A correction, "-splay" is the very first variant I posted, which used
> splay trees for the page cache.

Oops now I remember why I called it -splay :) I have a radix comparison
somewhere, I'll fish it out.

Anton

2002-02-01 19:49:41

by Jeff Garzik

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, Feb 01, 2002 at 06:44:50PM +0000, [email protected] wrote:
> In article <[email protected]> you wrote:
> > On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> >> Even databases often use multiple files, and quite frankly, a database
> >> that doesn't use mmap and doesn't try very hard to not cause extra system
> >> calls is going to be bad performance-wise _regardless_ of any page cache
> >> locking.
>
> > I've always thought that read(2) and write(2) would in the end wind up
> > faster than mmap(2)... Tests in my rewritten cp/rm/mv type utilities
> > seem to bear this out.
>
> the biggest reason for this is that we *suck* at readahead for mmap....

Is there not also fault overhead and similar issues related to mmap(2)
in general, that are not present with read(2)/write(2)?

Jeff

2002-02-01 21:52:06

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, 1 Feb 2002, Linus Torvalds wrote:

> In web-servers, 99% of the content is small files, and if the file is
> cached the expensive parts are all elsewhere. Don't make up "worst
> case schenarios" that simply do no exist.

in fact the locking structure of radix trees have a locking advantage in
the 'multiple small files' case: if one CPU does a sendfile() on one file,
then the lock will be likely CPU-local for the duration of the sendfile(),
while page buckets will access a new spinlock for every page accessed.

Ingo

2002-02-01 21:48:26

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On 1 Feb 2002, Momchil Velikov wrote:

> Ingo> definitely, because in the case of page buckets there are many locks
> Ingo> hashed in a mapping-neutral way. Ie. different parts of the same file will
> Ingo> likely map to different spinlocks.
>
> That's why it's likely to miss on each access.

yes, you are right.

> Ingo> In the radix tree case all pages in the inode will map to the
> Ingo> same spinlock.
>
> That's why it's likely to bounce on each access.
>
> So, is there any difference ? :)

no difference. I tried to create a testcase that shows the difference
(multiple processes read()ing a single big file on an 8-way box), but
performance was equivalent. So given the clear advantages of radix trees
in other areas, they win hands down. :)

Ingo

2002-02-02 15:40:05

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, 1 Feb 2002, Jeff Garzik wrote:

> > the biggest reason for this is that we *suck* at readahead for mmap....
>
> Is there not also fault overhead and similar issues related to mmap(2)
> in general, that are not present with read(2)/write(2)?

If a fault is more expensive than a system call, we're doing
something wrong in the page fault path ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-02 18:57:52

by Richard Henderson

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Fri, Feb 01, 2002 at 09:04:45AM -0200, Rik van Riel wrote:
> As for not putting kernel objects everywhere, this comes
> naturally with HIGHMEM ;)

Not for 64-bit targets.

r~

2002-02-02 19:20:13

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Benchmark results on 2.4.17 and 2.4.17 with radix-tree
(ratpagecache) patch. Identical results removed.

dbench 192 was the same. dbench 64 is virtually equal,
considering dbench normal flucutation.

dbench 64 processes
2.4.17 ************************* 12.5 MB/sec
2.4.17rat ************************ 12.1 MB/sec

Unixbench-4.1.0
2.4.17 2.4.17rat
Pipe Throughput 387881.3 379702.9
Pipe-based Context Switching 105911.3 91653.3
Process Creation 1180.3 1197.4
System Call Overhead 304463.9 335158.8
Shell Scripts (1 concurrent) 617.4 615.6
C Compiler Throughput 225.7 227.7
Dc: sqrt(2) to 99 decimal places 13988.7 14360.6

LMbench 2.0p2

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
OS null open selct sig sig fork exec sh
call stat clos TCP inst hndl proc proc proc
--------------- ---- ---- ---- ----- ---- ---- ----- ----- -----
Linux 2.4.17 0.43 3.41 6.16 36.8 1.43 3.26 932 3717 13612
Linux 2.4.17rat 0.43 4.56 7.80 55.4 1.45 3.17 901 3580 13239

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
---------------- ----- ------ ------ ------ ------ ------ ------
Linux 2.4.17 1.20 24.07 188.7 56.5 209.0 59.2 223.3
Linux 2.4.17rat 1.21 22.93 187.9 58.0 208.8 61.8 226.4

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
OS 2p/0K Pipe AF TCP TCP
ctxsw UNIX conn
--------------- ----- ----- ----- ----- -----
Linux 2.4.17 2.04 10.17 21.21 65.82 288.2
Linux 2.4.17rat 2.04 10.09 20.46 60.76 289.4

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------------- ------ ------ ------ ------ ------- ----- -----
Linux 2.4.17 123.9 165.6 677.1 242.8 2602 1.076 8.0
Linux 2.4.17rat 128.9 169.7 618.8 256.2 2762 1.051 7.7

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------------- ----- ----- ----- ------ ------ ------ ------ ----- -----
Linux 2.4.17 61.5 49.2 60.5 61.7 237.4 59.2 60.3 237.3 85.0
Linux 2.4.17rat 64.4 42.8 60.3 61.4 237.5 59.1 60.2 237.4 84.7

Memory latencies in nanoseconds - smaller is better
---------------------------------------------------
OS Mhz L1 $ L2 $ Main mem
----------------- ---- ----- ------ --------
Linux 2.4.17 501 4.2 188.1 262.2
Linux 2.4.17rat 501 4.2 195.4 262.0

Threaded I/O Bench
Read, Write, and Seeks are MB/sec
CPU Effiency (CPU Eff) = (MB/sec) / CPU% (bigger is better)

Num Seq Read CPU Rand Read CPU
Thr Rate (CPU%) Eff Rate (CPU%) Eff
--- ------------------- -----------------
2.4.17 1 13.23 43.2% 30.62 2.74 3.7% 73.88
2.4.17rat 1 12.60 43.0% 29.30 2.75 3.8% 72.45

2.4.17 2 11.56 29.9% 38.66 3.04 3.8% 79.89
2.4.17rat 2 11.07 30.4% 36.41 3.06 4.3% 71.81

2.4.17 4 11.03 25.9% 42.59 3.19 4.1% 78.26
2.4.17rat 4 10.62 26.5% 40.08 3.15 4.2% 74.32

2.4.17 8 10.62 22.8% 46.58 3.29 4.2% 77.99
2.4.17rat 8 10.23 23.4% 43.72 3.26 4.5% 73.05

Num Seq Write CPU Rand Write CPU
Thr Rate (CPU%) Eff Rate (CPU%) Eff
--- ------------------- -----------------
2.4.17 1 11.08 50.5% 21.94 0.69 1.6% 44.10
2.4.17rat 1 7.77 32.8% 23.69 0.53 1.1% 48.44

2.4.17 2 10.83 48.6% 22.28 0.69 1.5% 45.13
2.4.17rat 2 7.51 32.1% 23.38 0.52 1.1% 45.98

2.4.17 4 10.40 45.9% 22.66 0.68 1.5% 44.70
2.4.17rat 4 7.62 32.8% 23.25 0.53 1.2% 45.21

2.4.17 8 10.17 44.8% 22.70 0.67 1.5% 44.73
2.4.17rat 8 7.55 32.5% 23.24 0.53 1.2% 44.66

bonnie++
Version 1.02a ---------------------Sequential Output--------------------
-----Per Char----- ------Block------- -----Rewrite------
Kernel Size MB/sec %CPU Eff MB/sec %CPU Eff MB/sec %CPU Eff
2.4.17 1024 3.41 98.0 3.48 14.56 65.6 22.19 8.83 51.0 17.32
2.4.17rat 1024 3.36 98.0 3.43 10.64 41.6 25.57 6.79 37.2 18.27

Version 1.02a -----------Sequential Input----------- ------Random-----
-----Per Char----- ------Block------- ------Seeks------
Kernel Size MB/sec %CPU Eff MB/sec %CPU Eff /sec %CPU Eff
2.4.17 1024 3.98 97.2 4.10 16.39 60.6 27.05 132 2.0 6609
2.4.17rat 1024 3.92 96.0 4.08 15.46 57.0 27.12 126 2.0 6302

-------Sequential Create-----------
------Create----- -----Delete-----
files /sec %CPU Eff /sec %CPU Eff
2.4.17 16384 4421 96.8 4567 4719 97.4 4845
2.4.17rat 16384 3973 97.6 4070 4531 95.0 4769

-------Random Create----------------
------Create----- -----Delete-----
files /sec %CPU Eff /sec %CPU Eff
2.4.17 16384 4475 98.0 4566 4124 94.0 4387
2.4.17rat 16384 4203 98.0 4288 4005 94.4 4242

Build times. Smaller is better - task completed faster.

perl_build kernel_build
2.4.17 1620 1347
2.4.17rat 1629 1444

radix-tree saves 768k on 384M machine.

2.4.17rat Memory: 385932k/393216k available (885k kernel code, 6900k reserved
2.4.17 Memory: 385164k/393216k available (884k kernel code, 7668k reserved

Results like these on more kernels at:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html

--
Randy Hron

2002-02-02 21:16:33

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Sat, 2 Feb 2002, Richard Henderson wrote:
> On Fri, Feb 01, 2002 at 09:04:45AM -0200, Rik van Riel wrote:
> > As for not putting kernel objects everywhere, this comes
> > naturally with HIGHMEM ;)
>
> Not for 64-bit targets.

Agreed. We'll probably want to find something else to fix this
problem ...

(like, allocating kernel area as much contiguously as possible,
leaving space for large freeable areas elsewhere)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-03 14:31:45

by Chris Evans

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Sat, 2 Feb 2002 [email protected] wrote:

> Num Seq Write CPU Rand Write CPU
> Thr Rate (CPU%) Eff Rate (CPU%) Eff
> --- ------------------- -----------------
> 2.4.17 1 11.08 50.5% 21.94 0.69 1.6% 44.10
> 2.4.17rat 1 7.77 32.8% 23.69 0.53 1.1% 48.44

This is a worrying trend your benching has exposed - all the streaming
write tests have taken a performance hit. Above is tiobench, and bonnie
showed the same trend too.

Chris

2002-02-03 23:34:19

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

>>>>> "rwhron" == rwhron <[email protected]> writes:

rwhron> bonnie++
rwhron> Version 1.02a ---------------------Sequential Output--------------------
rwhron> -----Per Char----- ------Block------- -----Rewrite------
rwhron> Kernel Size MB/sec %CPU Eff MB/sec %CPU Eff MB/sec %CPU Eff
rwhron> 2.4.17 1024 3.41 98.0 3.48 14.56 65.6 22.19 8.83 51.0 17.32
rwhron> 2.4.17rat 1024 3.36 98.0 3.43 10.64 41.6 25.57 6.79 37.2 18.27

rwhron> Version 1.02a -----------Sequential Input----------- ------Random-----
rwhron> -----Per Char----- ------Block------- ------Seeks------
rwhron> Kernel Size MB/sec %CPU Eff MB/sec %CPU Eff /sec %CPU Eff
rwhron> 2.4.17 1024 3.98 97.2 4.10 16.39 60.6 27.05 132 2.0 6609
rwhron> 2.4.17rat 1024 3.92 96.0 4.08 15.46 57.0 27.12 126 2.0 6302

rwhron> -------Sequential Create-----------
rwhron> ------Create----- -----Delete-----
rwhron> files /sec %CPU Eff /sec %CPU Eff
rwhron> 2.4.17 16384 4421 96.8 4567 4719 97.4 4845
rwhron> 2.4.17rat 16384 3973 97.6 4070 4531 95.0 4769

rwhron> -------Random Create----------------
rwhron> ------Create----- -----Delete-----
rwhron> files /sec %CPU Eff /sec %CPU Eff
rwhron> 2.4.17 16384 4475 98.0 4566 4124 94.0 4387
rwhron> 2.4.17rat 16384 4203 98.0 4288 4005 94.4 4242

Hmm, I've got different results with bonnie++, are you sure you didn't
swap the results :)

Linux 2.5.3-dj1
--------------
Version 1.02b ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
merlin 496M 22038 13 11768 8 19219 6 183.9 0
merlin 496M 22495 13 11216 6 22390 6 183.9 0
merlin 496M 22292 13 11713 7 19249 6 188.8 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
merlin 10 2015 99 +++++ +++ +++++ +++ 2526 99 +++++ +++ 8514 99
merlin 10 2354 99 +++++ +++ +++++ +++ 2656 99 +++++ +++ 10195 100
merlin 10 2871 99 +++++ +++ +++++ +++ 3036 99 +++++ +++ 12012 98

Linux 2.5.3-dj1-radix-pagecache
-------------------------------
Version 1.02b ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
merlin 496M 22397 13 11248 7 22595 7 178.6 0
merlin 496M 22088 12 11204 7 22591 7 188.1 0
merlin 496M 24767 14 10966 7 22474 7 191.1 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
merlin 10 2485 99 +++++ +++ +++++ +++ 2456 99 +++++ +++ 9614 100
merlin 10 2653 99 +++++ +++ +++++ +++ 2884 100 +++++ +++ 11101 100
merlin 10 1948 99 +++++ +++ +++++ +++ 2530 99 +++++ +++ 9294 99

2002-02-04 03:55:18

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Mon, Feb 04, 2002 at 01:33:19AM +0200, Momchil Velikov wrote:
> Hmm, I've got different results with bonnie++, are you sure you didn't
> swap the results :)

I don't think so, but I notice that bonnie++ Sequential and Random Create
tests flucuate a lot. These are on 16384 small files in my tests.

------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
2.4.17 16 4217 97 +++++ +++ 4664 99 4354 99 +++++ +++ 4090 95
16 4423 99 +++++ +++ 4266 90 4404 95 +++++ +++ 4382 100
16 4468 97 +++++ +++ 4899 99 4522 99 +++++ +++ 4235 95
16 4498 96 +++++ +++ 4777 100 4464 99 +++++ +++ 3854 90
16 4503 95 +++++ +++ 4990 99 4632 98 +++++ +++ 4058 90

2.4.17rat 16 2994 98 +++++ +++ 4548 94 2952 99 +++++ +++ 3967 92
16 3055 97 +++++ +++ 4705 93 4665 99 +++++ +++ 4119 94
16 4463 96 +++++ +++ 4670 100 4452 100 +++++ +++ 3775 94
16 4833 99 +++++ +++ 4823 100 4616 97 +++++ +++ 4054 92
16 4521 98 +++++ +++ 3907 88 4329 95 +++++ +++ 4111 100

The test against a 1GB (large) file vary somewhat too, but not as much as
the tests above. The results I posted earlier were the averages from 5 runs.

I just uploaded the raw logfiles from my testing (not just 2.4.17 and radix-tree)
to http://home.earthlink.net/~rwhron/kernel (the tar.bz2 files are the logs).

--
Randy Hron

2002-02-05 09:20:00

by Zdenek Kabelac

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Jeff Garzik wrote:
>
> On Fri, Feb 01, 2002 at 09:06:37AM -0800, Linus Torvalds wrote:
> > Even databases often use multiple files, and quite frankly, a database
> > that doesn't use mmap and doesn't try very hard to not cause extra system
> > calls is going to be bad performance-wise _regardless_ of any page cache
> > locking.
>
> I've always thought that read(2) and write(2) would in the end wind up
> faster than mmap(2)... Tests in my rewritten cp/rm/mv type utilities
> seem to bear this out.
>
> Is mmap(2) only preferred for large files/databases?

I've tried to make faster md5summing program and programmed several
ways of accessing file - for the very large files the fastest
way seemed to be O_DIRECT with threaded precaching.

For fast mmap access I'd to implement two parallel mmpad areas with
madvise MADV_WILLNEED - then it was almost as fast as read

--
.''`. Which fundamental human right do you want to give up today?
: :' : Debian GNU/Linux maintainer - http://www.debian.{org,cz}
`. `' Zdenek Kabelac kabi@{debian.org, users.sf.net, fi.muni.cz}
`- When in doubt, just blame the Euro. :)

2002-02-05 17:59:25

by Pavel Machek

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Hi!

> > > the biggest reason for this is that we *suck* at readahead for mmap....
> >
> > Is there not also fault overhead and similar issues related to mmap(2)
> > in general, that are not present with read(2)/write(2)?
>
> If a fault is more expensive than a system call, we're doing
> something wrong in the page fault path ;)

You can read 128K at a time, but you can't fault 128K...
Pavel

--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2002-02-05 18:46:05

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, 5 Feb 2002, Pavel Machek wrote:

> > > > the biggest reason for this is that we *suck* at readahead for mmap....
> > >
> > > Is there not also fault overhead and similar issues related to mmap(2)
> > > in general, that are not present with read(2)/write(2)?
> >
> > If a fault is more expensive than a system call, we're doing
> > something wrong in the page fault path ;)
>
> You can read 128K at a time, but you can't fault 128K...

Why not ?

If the pages are present (read-ahead) and the page table
is present, I see no reason why we couldn't fill in 32
page table entries at once.

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-05 20:32:19

by Eric Dumazet

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> On Tue, 5 Feb 2002, Pavel Machek wrote:
>
> > > > > the biggest reason for this is that we *suck* at readahead for
mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to
mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
>
> Why not ?
>
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.
>
> Rik

Well, filling 32 page tables entries at once is certainly a big readahead...
for the common cases.

Maybe this high number could be a result of a madavise(..., MADV_SEQUENTIAL
or MAP_WILLNEED)
Solaris does exactly this kind of trick.

Eric

2002-02-06 02:00:14

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> Hmm, I've got different results with bonnie++, are you sure you didn't
> swap the results :)

I am curious if the small followup patch to 2.5.3-dj1 makes radix-tree
more like 2.4.17-ratpagecache. (overall, it seems that radix-tree did
better in I/O without the small followup patch).

Benchmarks on 2.5.3-dj1, 2.5.3-dj1 with radix-tree (rat)
and small followup patch (rat2).

dbench 64 processes
2.5.3-dj1rat ************************** 13.1 MB/sec
2.5.3-dj1 ********************** 11.1 MB/sec
2.5.3-dj1rat2 ********************** 11.1 MB/sec

dbench 192 processes
2.5.3-dj1rat ************* 6.8 MB/sec
2.5.3-dj1rat2 ************* 6.7 MB/sec
2.5.3-dj1 ************* 6.6 MB/sec

Unixbench-4.1.0
2.5.3-dj1 2.5.3-dj1rat 2.5.3-dj1rat2
Execl Throughput 316.7 298.5 324.8 lps
Pipe Throughput 307626.1 308399.0 284619.1 lps
Pipe-based Context Switching 95502.3 93398.1 76470.1 lps
Process Creation 1320.7 1280.5 1398.3 lps
System Call Overhead 245040.7 239410.7 251300.1 lps
Shell Scripts (1 concurrent) 643.2 639.0 636.1 lpm
Shell Scripts (8 concurrent) 89.3 88.0 86.9 lpm
Shell Scripts (16 concurrent) 44.7 44.0 44.0 lpm
Shell Scripts (32 concurrent) 22.7 22.0 22.0 lpm
Shell Scripts (64 concurrent) 10.9 10.7 10.7 lpm
Shell Scripts (96 concurrent) 7.1 7.1 6.9 lpm
Shell Scripts (128 concurrent) 5.3 5.1 5.1 lpm
C Compiler Throughput 220.6 225.2 217.1 lpm
Dc: sqrt(2) to 99 decimal places 15838.5 15465.6 15854.2 lpm
Recursion Test--Tower of Hanoi 14157.1 14156.8 14156.2 lps

LMbench-2.0p2 average of 9 samples.
Processor, Processes - times in microseconds - smaller is better
OS null open selct sig sig fork exec sh
call stat clos TCP inst hndl proc proc proc
-------------------- ---- ---- ---- ----- ---- ---- ---- ---- -----
Linux 2.5.3-dj1 0.60 3.56 5.42 39.0 1.49 3.10 777 3250 12082
Linux 2.5.3-dj1rat 0.60 4.65 6.61 37.5 1.44 3.03 809 3294 12580
Linux 2.5.3-dj1rat2 0.59 4.52 7.82 59.5 1.42 3.23 788 3431 12943

Context switching - times in microseconds - smaller is better
OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
------------------- ----- ------ ------ ------ ------ ------ ------
Linux 2.5.3-dj1 1.04 18.98 190.4 55.5 207.3 59.0 224.3
Linux 2.5.3-dj1rat 0.43 19.57 183.7 56.6 204.0 58.1 225.0
Linux 2.5.3-dj1rat2 1.02 20.39 184.9 58.0 212.3 59.5 227.1

*Local* Communication latencies in microseconds - smaller is better
OS Pipe AF TCP TCP
UNIX conn
------------------- ----- ----- ----- -----
Linux 2.5.3-dj1 13.57 27.21 76.68 303.6
Linux 2.5.3-dj1rat 10.46 20.51 72.37 292.6
Linux 2.5.3-dj1rat2 11.41 20.04 75.18 304.6

File & VM system latencies in microseconds - smaller is better
OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
------------------- ------ ------ ------ ------ ------- ----- -----
Linux 2.5.3-dj1 132.3 195.5 709.7 285.9 2821 1.134 6.1
Linux 2.5.3-dj1rat 145.5 198.2 791.6 291.7 2699 1.495 5.6
Linux 2.5.3-dj1rat2 140.1 195.0 723.2 287.4 2792 1.588 5.6

*Local* Communication bandwidths in MB/s - bigger is better
OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
------------------- ---- ---- ---- ------ ------ ------ ------ ----- -----
Linux 2.5.3-dj1 65.7 44.3 42.7 60.3 237.5 59.2 60.4 237.4 85.1
Linux 2.5.3-dj1rat 64.2 45.5 38.4 59.8 237.7 59.5 60.6 237.6 85.7
Linux 2.5.3-dj1rat2 67.0 46.0 45.3 60.1 237.6 59.4 60.5 237.4 85.4

Memory latencies in nanoseconds - smaller is better
OS Mhz L1 $ L2 $ Main mem
------------------- ---- ----- ------ --------
Linux 2.5.3-dj1 501 4.2 195.7 262.3
Linux 2.5.3-dj1rat 501 4.2 191.1 261.9
Linux 2.5.3-dj1rat2 501 4.2 189.9 262.1

Threaded I/O Bench
File Size is 384MB, Read, Write, and Seeks are MB/sec
CPU Effiency (CPU Eff) = (MB/sec) / CPU% (bigger is better)

Reads Num Seq Read CPU Rand Read CPU
Thr Rate (CPU%) Eff Rate (CPU%) Eff
--- ------------------- -----------------
2.5.3-dj1 1 12.92 42.7% 30.26 3.21 4.7% 68.63
2.5.3-dj1rat 1 12.99 44.8% 29.00 2.80 3.9% 71.17
2.5.3-dj1rat2 1 12.75 43.2% 29.51 2.75 4.6% 59.29

2.5.3-dj1 2 11.63 32.1% 36.23 3.34 4.4% 75.57
2.5.3-dj1rat 2 11.70 33.3% 35.14 3.00 4.5% 65.93
2.5.3-dj1rat2 2 11.42 32.3% 35.36 2.92 4.9% 59.75

2.5.3-dj1 4 11.26 28.8% 39.10 3.42 4.8% 71.70
2.5.3-dj1rat 4 11.22 30.0% 37.40 3.06 4.7% 65.18
2.5.3-dj1rat2 4 11.14 29.5% 37.76 3.00 4.9% 61.69

2.5.3-dj1 8 11.25 26.8% 41.98 3.61 5.1% 71.02
2.5.3-dj1rat 8 10.96 26.9% 40.74 3.19 4.6% 69.09
2.5.3-dj1rat2 8 10.94 26.9% 40.67 3.09 4.9% 62.91

Writes Num Seq Write CPU Rand Write CPU
Thr Rate (CPU%) Eff Rate (CPU%) Eff
--- ------------------- -----------------
2.5.3-dj1 1 11.38 55.9% 20.36 0.76 2.0% 37.21
2.5.3-dj1rat 1 10.77 54.3% 19.83 0.70 2.0% 35.30
2.5.3-dj1rat2 1 9.01 42.5% 21.21 0.57 1.5% 37.75

2.5.3-dj1 2 11.35 56.4% 20.12 0.77 2.0% 37.32
2.5.3-dj1rat 2 10.54 53.2% 19.81 0.70 2.0% 35.69
2.5.3-dj1rat2 2 7.92 37.0% 21.40 0.52 1.4% 38.15

2.5.3-dj1 4 11.39 57.3% 19.88 0.78 2.1% 36.65
2.5.3-dj1rat 4 10.54 53.2% 19.81 0.71 2.0% 35.91
2.5.3-dj1rat2 4 7.71 36.1% 21.37 0.51 1.4% 37.56

2.5.3-dj1 8 11.28 56.6% 19.93 0.79 2.2% 36.22
2.5.3-dj1rat 8 10.77 54.8% 19.65 0.73 2.0% 36.24
2.5.3-dj1rat2 8 7.93 37.3% 21.27 0.52 1.4% 37.16

bonnie++-1.02a summary
1024 MB file ---------------------Sequential Output--------------------
-----Per Char----- ------Block------- -----Rewrite------
Kernel MB/sec %CPU Eff MB/sec %CPU Eff MB/sec %CPU Eff
2.5.3-dj1 3.38 98.0 3.44 14.20 67.4 21.06 8.52 51.0 16.71
2.5.3-dj1rat 3.35 98.0 3.42 14.01 62.0 22.60 8.19 46.4 17.66
2.5.3-dj1rat2 3.38 98.0 3.44 13.06 56.2 23.24 7.66 42.6 17.98

-----------Sequential Input----------- ------Random-----
-----Per Char----- ------Block------- ------Seeks------
Kernel MB/sec %CPU Eff MB/sec %CPU Eff /sec %CPU Eff
2.5.3-dj1 3.96 97.4 4.07 15.26 57.2 26.68 129 2.0 6460
2.5.3-dj1rat 3.94 97.0 4.06 15.21 57.4 26.49 127 1.8 7051
2.5.3-dj1rat2 3.93 97.0 4.05 15.13 57.4 26.36 128 2.0 6415

bonnie++ small files (16384 files)
--------Sequential Create------------
------Create----- -----Delete-----
/sec %CPU Eff /sec %CPU Eff
2.5.3-dj1 4369 97.4 4485 4176 96.8 4313
2.5.3-dj1rat 4276 98.0 4364 4100 96.0 4271
2.5.3-dj1rat2 4293 97.0 4425 3880 94.4 4110

------------Random Create------------
------Create----- -----Delete-----
/sec %CPU Eff /sec %CPU Eff
2.5.3-dj1 4334 98.2 4413 3782 96.4 3923
2.5.3-dj1rat 4238 97.6 4341 3696 95.0 3890
2.5.3-dj1rat2 4215 97.2 4336 3667 95.0 3860

Time to build/test/run in seconds
version autoconf perl kernel updatedb updatedb*5
2.5.3-dj1 1152 1636 1320 29 38
2.5.3-dj1rat 1167 1684 1336 29 39
2.5.3-dj1rat2 1162 1674 1356 29 37

Memory Info - radix-tree saved 768k on 384MB machine.
version available totalmem kern_code reserved
2.5.3-dj1rat 385860k 393216k 931k 6972k
2.5.3-dj1rat2 385860k 393216k 931k 6972k
2.5.3-dj1 385092k 393216k 930k 7740k

Details on system, testing method, and other results at:
http://home.earthlink.net/~rwhron/kernel/k6-2-475.html
--
Randy Hron

2002-02-06 09:02:53

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On February 5, 2002 07:45 pm, Rik van Riel wrote:
> On Tue, 5 Feb 2002, Pavel Machek wrote:
> > > > > the biggest reason for this is that we *suck* at readahead for
> > > > > mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
>
> Why not ?
>
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.

Yes, essentially what you want is to schedule a generic_file_readahead, which
we'd need to cook up a mechanism for doing. The other part - much harder -
is deciding when to readahead, and how much.

I'd amend your original statement to just 'we *suck* at readahead'.

--
Daniel

2002-02-06 12:11:36

by Pavel Machek

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

Hi!

> > > > > the biggest reason for this is that we *suck* at readahead for mmap....
> > > >
> > > > Is there not also fault overhead and similar issues related to mmap(2)
> > > > in general, that are not present with read(2)/write(2)?
> > >
> > > If a fault is more expensive than a system call, we're doing
> > > something wrong in the page fault path ;)
> >
> > You can read 128K at a time, but you can't fault 128K...
>
> Why not ?
>
> If the pages are present (read-ahead) and the page table
> is present, I see no reason why we couldn't fill in 32
> page table entries at once.

Ugh.

Okay, CPU will still have to fill its TLBs (it does not have to in
read case), but that is way easier operation.

I did not think about this possibility, sorry.
Pavel
--
(about SSSCA) "I don't say this lightly. However, I really think that the U.S.
no longer is classifiable as a democracy, but rather as a plutocracy." --hpa

2002-02-06 11:45:24

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Tue, 5 Feb 2002 [email protected] wrote:

> I am curious if the small followup patch to 2.5.3-dj1 makes radix-tree
> more like 2.4.17-ratpagecache. (overall, it seems that radix-tree did
> better in I/O without the small followup patch).

It would be useful if you also did dbench tests with a much
lower amount of dbench processes.

Once you get over 'dbench 16' or so the whole thing basically
becomes an excercise in how well the system can trigger task
starvation in get_request_wait.

I can recommend 'dbench 1' 'dbench 4' 'dbench 16' ;)

> dbench 64 processes
> 2.5.3-dj1rat ************************** 13.1 MB/sec
> 2.5.3-dj1 ********************** 11.1 MB/sec
> 2.5.3-dj1rat2 ********************** 11.1 MB/sec
>
> dbench 192 processes
> 2.5.3-dj1rat ************* 6.8 MB/sec
> 2.5.3-dj1rat2 ************* 6.7 MB/sec
> 2.5.3-dj1 ************* 6.6 MB/sec

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-06 21:30:54

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> Once you get over 'dbench 16' or so the whole thing basically
> becomes an excercise in how well the system can trigger task
> starvation in get_request_wait.

It's neat you've identified that bottleneck.

dbench 192 also appears to trigger more swapin/swapout than
usual with rmap based kernels about 12-15 minutes into the test;
and it remains unusually high for the duration of the run.
(not a huge amount of swapping, but vmstat 60 shows double digit
numbers, rather than the more typical 0 with occasional single
digits "spikes"). dbench 64 doesn't trigger this behavior on
my test box.

I want diversity in the workloads. bonnie++ does a single
thread, and tiobench is doing 1, 2, 4, and 8 threads. dbench
fits in well at the other end of the spectrum.

The newest bench added to the lineup is OSDB on postgresql.
If everything executes properly, 2.5.3-dj3 (which includes
radix-tree) will win the first timer award for OSDB. :)

--
Randy Hron

2002-02-06 21:38:24

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On Wed, 6 Feb 2002 [email protected] wrote:
> On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> > Once you get over 'dbench 16' or so the whole thing basically
> > becomes an excercise in how well the system can trigger task
> > starvation in get_request_wait.
>
> It's neat you've identified that bottleneck.

Umm, there's one thing you need to remember about these
high dbench loads though.

They run fastest when you run each of the dbench forks
sequentially and have the others stuck in get_request_wait.

This, of course, is completely unacceptable for real-world
server scenarios, where all users of the server need to be
serviced fairly.

regards,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-06 22:02:25

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

> They run fastest when you run each of the dbench forks
> sequentially and have the others stuck in get_request_wait.

One interesting part of tiotest is the latency measurements.
Latency isn't printed by tiobench.pl though. I think it's
valueable information (and wish I had it).

> This, of course, is completely unacceptable for real-world
> server scenarios, where all users of the server need to be
> serviced fairly.

Agreed. I'm glad kernel hackers focus on latency too. :)

There are _some_ applications where throughput is critical
though. I would prefer to measure both throughput and
latency at the same time, but am not yet clear on how to
deal with the Heisenberg principle.

--
Randy Hron

2002-02-07 12:58:57

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

On February 6, 2002 10:37 pm, Rik van Riel wrote:
> On Wed, 6 Feb 2002 [email protected] wrote:
> > On Wed, Feb 06, 2002 at 09:44:33AM -0200, Rik van Riel wrote:
> > > Once you get over 'dbench 16' or so the whole thing basically
> > > becomes an excercise in how well the system can trigger task
> > > starvation in get_request_wait.
> >
> > It's neat you've identified that bottleneck.
>
> Umm, there's one thing you need to remember about these
> high dbench loads though.
>
> They run fastest when you run each of the dbench forks
> sequentially and have the others stuck in get_request_wait.
>
> This, of course, is completely unacceptable for real-world
> server scenarios, where all users of the server need to be
> serviced fairly.

Right, and as we just discussed on irc, it's a useful effect - only if we
control it, so that stopped or slowed processes do eventually get forcibly
elevated in terms of IO priority so they can make progress, after being sat
upon by more agressive/successful processes long enough. And we can control
this, it's just going to take a few months to get basic issues of IO queues,
RSS accounting, etc. out of the way so we can address it.

The trouble I have with paying a lot of attention to dbench results at this
point is - we're measuring effects of kernel behaviour that is, at this
point, uncontrolled and effectively random. IOW, we're not measuring the
effects that we're interested in just now. If we need to know IO throughput,
we need to use benches that test exactly that, and not other randomly
interacting effects. As they said on Laugh-in many moons ago: 'very
interesting, but useless'.

--
Daniel

2002-02-13 14:50:15

[permalink] [raw]

Subject: Re: [PATCH] Radix-tree pagecache for 2.5

2002-02-16 16:25:56