For the filesystem mavens among us, here is a wealth of new information on my
htree directory index for Ext2, including the paper I presented last month at
ALS, and a series of benchmark results (therein lies a story, I'll get to
that[1]).
The paper is here, in browsable html and postscript:
http://people.nl.linux.org/~phillips/htree/paper/htree.html
http://people.nl.linux.org/~phillips/htree/htree.ps.gz
(A Directory Index for Ext2)
Now the benchmarks. The tests all consist of creating and deleting massive
numbers of empty files, with names lying in a given numerical range.
Sometimes the names are created sequentially, sometimes randomly. I've
included the C program I wrote to create files with 'random' in this post.
Why should it matter which order files are *created* in? Well, it shouldn't,
and it doesn't for the htree index, but read to the end of this post to find
out why it's interesting. On the other hand, the order in which files are
deleted matters a great deal, to some filesystems. Not htree-indexed Ext2
though, which I intentionally designed so that best and worst cases are
almost the same. (I'll provide additional benchmark results relating to
random deletion in a subsequent post.)
Here come the charts...
The first is basically the same as the one I presented for my first prototype
of the htree index, back in February of this year:
http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.create.jpg
(Indexed ext2 compared to classic ext2 in the range 10,000 to 100,000 files)
The vertical axis is running time, the horizontal axis is number of files.
The only difference between this and February's chart is that this one runs
out to 100,000 files, which I was unable to do with the original prototype
because I'd implemented only a single tree level at the time - the index
became full at about 93,000 files. Oh, and the rendering quality is a lot
better this time, thankyou very much staroffice and sct (more on that later).
This chart shows that ext2 with the htree index creates large numbers of
files exponentially faster than classic ext2, the difference rapidly becoming
so large that the htree barcharts are barely visible.
The next chart shows the same thing with the classic ext2 results clipped to
the top of the chart so that we can see the linear performance of the htree
index:
http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.create.clipped.jpg
File deletions show a similar result:
http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x10000.delete.jpg
Mind you, this is without Ted's recent hack to cache the Ext2 directory index
position. Unfortunately my patch steps on that, and update is needed. With
Ted's hack, ext2 file deletion performance should be much close to linear,
until we start getting into random order deletion. Hopefully I will find the
time to run another benchmark, so we can see what the position-caching hack
does for classic ext2. We are, after all, going to keep running into
unindexed directories for some time.
This next chart is perhaps the most interesting of the bunch, I think:
http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.10x100.create.jpg
This shows that the htree index is *always* as fast or faster than classic
ext2, even for small directories consisting of a few blocks. This is because
the htree patch actually branches to the old code for any directory
consisting of a single block, but as soon as we go to two blocks, the index
is faster.
This gives us the answer to a question that was raised some months ago:
obviously indexing a directory consisting of a single block will only slow
things down, but where is the cutoff? Several of us speculated about that (I
can't find the url at the moment) but ultimately the prediction was, that
htree would squeak ahead of the linear search, even at two blocks. This is
now confirmed.
With the small number of files, each test had to be run 10 times to get stable measurements.
Ok, now for the interesting part. With the htree patch, ext2 can handle
millions of files per directory; so can reiserfs. Which is faster? The
answer isn't completely straightforward:
http://people.nl.linux.org/~phillips/htree/indexed.vs.reiser.8x250000.create.jpg
Here, we're testing directories with up to two million files. Ext2+htree
wins for the first million files, then suffers some kind of bump that puts
reiserfs ahead for the next million. So for *really* huge directories,
advantage reiserfs, or so it seems.
Curiously, Reiserfs actually depends on the spelling of the filename for a
lot of its good performance. Creating files with names that don't follow a
lexigraphically contiguous sequence produces far different results:
http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.vs.reiser.10x10000.create.random.jpg
So it seems that for realistic cases, ext2+htree outperforms reiserfs quite
dramatically. (Are you reading, Hans? Fighting words... ;-)
Mind you, I did not try tweaking reiserfs at all, it does have a selection of
hash functions to choose from. It's likely that the default - R5 - is not
very well-suited at all to random names, i.e., names we'd expect to see in
the wild. For one thing, this shows the pitfalls of writing hash functions
that make assumptions about information present in the input string.
I'm sure we'll hear more on this subject.
Finally, it seems that ext2+htree does its work using quite a bit less cpu
than reiserfs or ext2 (logarithmic graph):
http://people.nl.linux.org/~phillips/htree/indexed.vs.reiser.10x100000.create.random.cpu.log.jpg
I will supply the full benchmark data in a subsequent post.
Here is the c program I used to create files with 'random' names:
/*
* randfiles.c
*
* Usage: randfiles <basename> <count> <echo?>
*
* For benchmarking create performance - create <count> files with
* numbered names, in random order. The <echo?> flag, if y, echoes
* filenames created. For example, 'randfiles foo 10 y' will create
* 10 empty files with names ranging from foo0 to foo9.
*
* copyleft: Daniel Phillips, Oct 6, 2001, [email protected]
*
*/
#include <stdlib.h>
#define swap(x, y) do { typeof(x) z = x; x = y; y = z; } while (0)
int main (int argc, char *argv[])
{
int n = (argc > 2)? strtol(argv[2], 0, 10): 0;
int i, size = 50, show = argc > 3 && !strncmp(argv[3], "y", 1);
char name[size];
int choose[n];
for (i = 0; i < n; i++) choose[i] = i;
for (i = n; i; i--)
{
int j = rand() % i;
swap(choose[i-1], choose[j]);
}
for (i = 0; i < n; i++)
{
snprintf(name, size, "%s%i", argv[1], choose[i]);
if (show) printf("create %s\n", name);
close(open(name, 0100));
}
return 0;
}
[1] I'm eternally indebted to Stephen Tweedie for coming to my rescue at the
last minute. Shortly before my ALS presentation I had exactly zero graphs
prepared, let alone slides. I'd arrived at Oakland with files full of
benchmark numbers, secure in the knowledge that I could massage them into
impressive graphics on a moment's notice. Boy was I ever wrong.
It seems it's still a 'early days' for kchart and kpresenter. To do the job,
I needed something something that actually works. Stephen Tweedie had just
the thing on his laptop: staroffice, and he knows how to use it. So we all
retired to a Chinese restaurant nearby and Stephen went to work - all I had
to do was offer advice on which tables to turn into charts. Most of the time
went by just getting the first chart, as Stephen had never actually tried
this before. Then finally, a few charts magically materialized and, even
better, began to metamorphose into a slide show. We all trooped over to the
conference halls, and continued to work on the slides during the talks preceding mine.
My talk went fine; the charts I've presented here are exactly the ones made
by Stephen that day. Thanks. :-)
--
Daniel
I can't comment on your benchmarks because I was on the way to bed when
I read this. I am sure though that you and Stephen are doing your usual
good programming.
ReiserFS is an Htree by your definition in your paper, yes?
Daniel Phillips wrote:
>
>Curiously, Reiserfs actually depends on the spelling of the filename for a
>lot of its good performance. Creating files with names that don't follow a
>lexigraphically contiguous sequence produces far different results:
>
> http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.vs.reiser.10x10000.create.random.jpg
>
>So it seems that for realistic cases, ext2+htree outperforms reiserfs quite
>dramatically. (Are you reading, Hans? Fighting words... ;-)
>
Have you ever seen an application that creates millions of files create
them in random order? Almost always there is some non-randomness in the
order, and our newer hash functions are pretty good at preserving it.
Applications that create millions of files are usually willing to play
nice for an order of magnitude performance gain also.....
I have shared your kpresenter troubles:-)....
Hans
Hi Hans,
On December 6, 2001 04:41 am, you wrote:
> I can't comment on your benchmarks because I was on the way to bed when
> I read this. I am sure though that you and Stephen are doing your usual
> good programming.
>
> ReiserFS is an Htree by your definition in your paper, yes?
You've got a hash-keyed b*tree over there. The htree is fixed depth.
> Daniel Phillips wrote:
> >So it seems that for realistic cases, ext2+htree outperforms reiserfs
> >quite dramatically. (Are you reading, Hans? Fighting words... ;-)
>
> Have you ever seen an application that creates millions of files create
> them in random order?
We haven't seen an application create millions of files yet. However, the
effects I'm describing are readily apparent at much smaller numbers.
> Almost always there is some non-randomness in the
> order, and our newer hash functions are pretty good at preserving it.
> Applications that create millions of files are usually willing to play
> nice for an order of magnitude performance gain also.....
To be fair, I should rerun the tests with your linear-congruential hash, I'll
try to get time for that.
--
Daniel
Daniel Phillips wrote:
>Hi Hans,
>
>On December 6, 2001 04:41 am, you wrote:
>
>>I can't comment on your benchmarks because I was on the way to bed when
>>I read this. I am sure though that you and Stephen are doing your usual
>>good programming.
>>
>>ReiserFS is an Htree by your definition in your paper, yes?
>>
>
>You've got a hash-keyed b*tree over there. The htree is fixed depth.
>
B*trees are fixed depth. B-tree usually means height-balanced.
Best wishes,
Hans
On December 6, 2001 04:56 am, Hans Reiser wrote:
> >On December 6, 2001 04:41 am, you wrote:
> >
> >>ReiserFS is an Htree by your definition in your paper, yes?
> >
> >You've got a hash-keyed b*tree over there. The htree is fixed depth.
> >
>
> B*trees are fixed depth. B-tree usually means height-balanced.
I was relying on definitions like this:
B*-tree
(data structure)
Definition: A B-tree in which nodes are kept 2/3 full by redistributing
keys to fill two child nodes, then splitting them into three nodes.
To tell the truth, I haven't read your code that closely, sorry, but I got
the impression that you're doing rotations for balancing no? If not then
have you really got a b*tree?
--
Daniel
On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser wrote:
> >Curiously, Reiserfs actually depends on the spelling of the filename for a
> >lot of its good performance. Creating files with names that don't follow a
> >lexigraphically contiguous sequence produces far different results:
> >
> > http://people.nl.linux.org/~phillips/htree/indexed.vs.classic.vs.reiser.10x10000.create.random.jpg
> >
> >So it seems that for realistic cases, ext2+htree outperforms reiserfs quite
> >dramatically. (Are you reading, Hans? Fighting words... ;-)
>
> Have you ever seen an application that creates millions of files create
> them in random order? Almost always there is some non-randomness in the
> order, and our newer hash functions are pretty good at preserving it.
> Applications that create millions of files are usually willing to play
> nice for an order of magnitude performance gain also.....
There is obviously something missing in this picture, or reiserfs would
be as fast as ext2 for random access and much faster for access in
sequential order by filename spelling.
(a "random" hash should not be significantly better than a hash that
preserves order, because the randomness in the hash is of course not the
same random order in wich the files are accessed. The only exception is
that hashes that preserve order may have a harder time using the full
hash-space evenly)
So, did anyone investigate why ext2 is faster than reiserfs in theese
cases, or try benchmarking ext2 with one of the reiserfs-hashes (eg r5)?
We know from earlier benchmarks on reiserfs (tea vs r5 tests, and r5 vs
maildir-hash) that a hash that preserves order can give a magnitude of
order performance improvement in certain cases.
--
Ragnar Kj?rstad
Big Storage
Daniel Phillips wrote:
>On December 6, 2001 04:56 am, Hans Reiser wrote:
>
>>>On December 6, 2001 04:41 am, you wrote:
>>>
>>>>ReiserFS is an Htree by your definition in your paper, yes?
>>>>
>>>You've got a hash-keyed b*tree over there. The htree is fixed depth.
>>>
>>B*trees are fixed depth. B-tree usually means height-balanced.
>>
>
>I was relying on definitions like this:
>
> B*-tree
>
> (data structure)
>
> Definition: A B-tree in which nodes are kept 2/3 full by redistributing
> keys to fill two child nodes, then splitting them into three nodes.
>
This is the strangest definition I have read. Where'd you get it?
>
>
>To tell the truth, I haven't read your code that closely, sorry, but I got
>the impression that you're doing rotations for balancing no? If not then
>
What are rotations?
>
>have you really got a b*tree?
>
>--
>Daniel
>
>
On December 6, 2001 02:44 pm, Hans Reiser wrote:
> Daniel Phillips wrote:
> >On December 6, 2001 04:56 am, Hans Reiser wrote:
> >>>On December 6, 2001 04:41 am, you wrote:
> >>>
> >>>>ReiserFS is an Htree by your definition in your paper, yes?
> >>>>
> >>>You've got a hash-keyed b*tree over there. The htree is fixed depth.
> >>>
> >>B*trees are fixed depth. B-tree usually means height-balanced.
> >
> >I was relying on definitions like this:
> >
> > B*-tree
> >
> > (data structure)
> >
> > Definition: A B-tree in which nodes are kept 2/3 full by redistributing
> > keys to fill two child nodes, then splitting them into three nodes.
>
> This is the strangest definition I have read. Where'd you get it?
This came from:
http://www.nist.gov/dads/HTML/bstartree.html
Here's another, referring to Knuth's original description:
http://www.cise.ufl.edu/~jhammer/classes/b_star.html
> >To tell the truth, I haven't read your code that closely, sorry, but I got
> >the impression that you're doing rotations for balancing no? If not then
>
> What are rotations?
Re-rooting a (sub)tree to balance it. <Pulls Knuth down from shelf>
For a BTree, rotating means shifting keys between siblings rather than
re-parenting. (The latter would change the path lengths and the result
wouldn't be a BTree.)
So getting back to your earlier remark, when examined under a bright light,
an HTree looks quite a lot like a BTree, the principal difference being the
hash and consequent collision handling. An HTree is therefore a BTree with
wrinkles.
<reads more> But wait, you store references to objects along with keys, I
don't. So where does that leave us?
By the way, how do you handle collisions? I see it has something to do with
generation numbers, but I haven't fully decoded the algorithm.
Fully understanding your code is going to take some time. This would
go faster if I could find the papers mentioned in the comments, can you point
me at those?
--
Daniel
Daniel Phillips wrote:
>On December 6, 2001 02:44 pm, Hans Reiser wrote:
>
>>Daniel Phillips wrote:
>>
>>>On December 6, 2001 04:56 am, Hans Reiser wrote:
>>>
>>>>>On December 6, 2001 04:41 am, you wrote:
>>>>>
>>>>>>ReiserFS is an Htree by your definition in your paper, yes?
>>>>>>
>>>>>You've got a hash-keyed b*tree over there. The htree is fixed depth.
>>>>>
>>>>B*trees are fixed depth. B-tree usually means height-balanced.
>>>>
>>>I was relying on definitions like this:
>>>
>>> B*-tree
>>>
>>> (data structure)
>>>
>>> Definition: A B-tree in which nodes are kept 2/3 full by redistributing
>>> keys to fill two child nodes, then splitting them into three nodes.
>>>
>>This is the strangest definition I have read. Where'd you get it?
>>
>>
>
>This came from:
>
> http://www.nist.gov/dads/HTML/bstartree.html
>
>Here's another, referring to Knuth's original description:
>
> http://www.cise.ufl.edu/~jhammer/classes/b_star.html
>
Hmmmm, the definition I think I remember came I think from Wood's book
on Algorithms, and unfortunately it disappeared from my office some time
ago.
My memory was that a B*-tree handles variable length records. It seems
I am wrong though. Dear. I'd better go tell the tech writer.
ReiserFS is a B*-tree by both definitions though. (Convenient at the
moment:-) )
>
>
>>>To tell the truth, I haven't read your code that closely, sorry, but I got
>>>the impression that you're doing rotations for balancing no? If not then
>>>
>>What are rotations?
>>
>
>Re-rooting a (sub)tree to balance it. <Pulls Knuth down from shelf>
>For a BTree, rotating means shifting keys between siblings rather than
>re-parenting. (The latter would change the path lengths and the result
>wouldn't be a BTree.)
>
>So getting back to your earlier remark, when examined under a bright light,
>an HTree looks quite a lot like a BTree, the principal difference being the
>hash and consequent collision handling. An HTree is therefore a BTree with
>wrinkles.
>
Hmmm, well we do hashes too. But I see your hash collision handling
resembles (but with some interesting improvements) something once
suggested by one of my programmers. What happens when you have two
collisions in one node, and then you delete one colliding name? Do you
leak collision bits?
When you rehash a large directory, is there a brief service interruption?
>
>
><reads more> But wait, you store references to objects along with keys, I
>don't. So where does that leave us?
>
>By the way, how do you handle collisions? I see it has something to do with
>generation numbers, but I haven't fully decoded the algorithm.
>
>Fully understanding your code is going to take some time. This would
>go faster if I could find the papers mentioned in the comments, can you point
>me at those?
>
Which papers in which comments?
>
>
>--
>Daniel
>
>
>
Yura, can you run some benchmarks on this code?
On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser <[email protected]> wrote:
| Have you ever seen an application that creates millions of files create
| them in random order?
I can readily imagine one. An app which stashes things sent by random
other things (usenet/email attachment trollers? security cameras taking
thouands of still photos a day?). Mail services like hotmail. with a
zillion mail spools, being made and deleted and accessed at random...
| Applications that create millions of files are usually willing to play
| nice for an order of magnitude performance gain also.....
But they shouldn't have to! Specificly, to "play nice" you need to know
about the filesystem attributes. You can obviously do simple things like
a directory hierachy as for squid proxy caches etc, but it's an ad hoc
thing. Tuning it does require specific knowledge, and the act itself
presumes exactly the sort of inefficiency in the fs implementation that
this htree stuff is aimed at rooting out.
A filesystem _should_ be just a namespace from the app's point of view.
Needing to play silly subdir games is, well, ugly.
--
Cameron Simpson, DoD#743 [email protected] http://www.zip.com.au/~cs/
Were I subject to execution, I would choose to be fired at tremendous
speed from a cannon into the brick wall of an elemetary school building,
and insist on full media coverage, however, as I said the choices are
sadly limited. - Jim Del Vecchio
On December 7, 2001 01:13 am, Hans Reiser wrote:
> Daniel Phillips wrote:
> >Fully understanding your code is going to take some time. This would
> >go faster if I could find the papers mentioned in the comments, can you point
> >me at those?
>
> Which papers in which comments?
http://innominate.org/~graichen/projects/lxr/source/include/linux/reiserfs_fs.h?v=v2.4#L1393
1393 create a new node. We implement S1 balancing for the leaf nodes
1394 and S0 balancing for the internal nodes (S1 and S0 are defined in
1395 our papers.)*/
--
Daniel
Cameron Simpson wrote:
>On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser <[email protected]> wrote:
>| Have you ever seen an application that creates millions of files create
>| them in random order?
>
>I can readily imagine one. An app which stashes things sent by random
>other things (usenet/email attachment trollers? security cameras taking
>thouands of still photos a day?). Mail services like hotmail. with a
>zillion mail spools, being made and deleted and accessed at random...
>
Ok, they exist, but they are the 20% not the 80% case, and for that
reason preserving order in hashing is a legitimate optimization.
>
>
>| Applications that create millions of files are usually willing to play
>| nice for an order of magnitude performance gain also.....
>
>But they shouldn't have to! Specificly, to "play nice" you need to know
>about the filesystem attributes. You can obviously do simple things like
>a directory hierachy as for squid proxy caches etc, but it's an ad hoc
>thing. Tuning it does require specific knowledge, and the act itself
>presumes exactly the sort of inefficiency in the fs implementation that
>this htree stuff is aimed at rooting out.
>
>A filesystem _should_ be just a namespace from the app's point of view.
>Needing to play silly subdir games is, well, ugly.
>
Subdir games won't help you if the names are random ordered.
If names are truly random ordered, then the only optimization that can
help is compression so as to cause the working set to still fit into RAM.
Compression will be sometime later than Reiser4, unless an unexpected
sponsor comes along, but we will do it eventually.
Hans
Daniel Phillips wrote:
>On December 7, 2001 01:13 am, Hans Reiser wrote:
>
>>Daniel Phillips wrote:
>>
>>>Fully understanding your code is going to take some time. This would
>>>go faster if I could find the papers mentioned in the comments, can you point
>>>me at those?
>>>
>>Which papers in which comments?
>>
>
> http://innominate.org/~graichen/projects/lxr/source/include/linux/reiserfs_fs.h?v=v2.4#L1393
>
> 1393 create a new node. We implement S1 balancing for the leaf nodes
> 1394 and S0 balancing for the internal nodes (S1 and S0 are defined in
> 1395 our papers.)*/
>
>--
>Daniel
>
>
How about I just explain it instead? We preserve a criterion of nodes
must be 50% full for internal nodes and criterion of no 3 nodes can be
squeezed into 2 nodes for leaf nodes.
A tree that satisfies the criterion that no N nodes can be squeezed into
N-1 nodes is an SN tree. I don't remember where Konstantin Shvachko
published his paper on this, maybe it can be found.
In Reiser4 we abandon the notion that fixed balancing criteria should be
used for leaf nodes.
Hans
On Fri, Dec 07, 2001 at 02:19:13PM +1100, Cameron Simpson wrote:
> On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser <[email protected]> wrote:
> | Have you ever seen an application that creates millions of files create
> | them in random order?
>
> I can readily imagine one. An app which stashes things sent by random
> other things (usenet/email attachment trollers? security cameras taking
> thouands of still photos a day?). Mail services like hotmail. with a
> zillion mail spools, being made and deleted and accessed at random...
I wouldn't think either of those had "random" names.
E.g. a lot of mailsystems use maildir for storage, and the filenames
depend on the server recieving the data and a timestamp. It's a very
good example of what can be optimized with some guesses about ordering.
> But they shouldn't have to! Specificly, to "play nice" you need to know
> about the filesystem attributes. You can obviously do simple things like
> a directory hierachy as for squid proxy caches etc, but it's an ad hoc
> thing. Tuning it does require specific knowledge, and the act itself
> presumes exactly the sort of inefficiency in the fs implementation that
> this htree stuff is aimed at rooting out.
An ordering hash doesn't imply that you _need_ some knowledge about the
filesystem attributes. The hash should not change the worst-case
scenario significantly. The only effect of an ordering hash should be
that you get best-case whenever you access the files in order, and I
believe that tests on reiserfs have shown both that:
* this particular ordering is used in real life applications
and
* the best case is significantly better than the worst case.
(probably because of read-ahead and better cache results)
That said, there are other ways for filesystem to guess the optimal
order of the files, e.g.:
1 the order of wich files are created
2 a seperate interface where userspace programs could specify optimal
order
The 2. is probably out of the question because applications would need
to be changed to take advantage, but ordering the files on disk in the
order they were created in is probably a very good approxemation.
--
Ragnar Kj?rstad
Big Storage
On December 7, 2001 01:36 pm, Hans Reiser wrote:
http://innominate.org/~graichen/projects/lxr/source/include/linux/reiserfs_fs.h?v=v2.4#L1393
> >
> > 1393 create a new node. We implement S1 balancing for the leaf nodes
> > 1394 and S0 balancing for the internal nodes (S1 and S0 are defined in
> > 1395 our papers.)*/
>
> How about I just explain it instead? We preserve a criterion of nodes
> must be 50% full for internal nodes and criterion of no 3 nodes can be
> squeezed into 2 nodes for leaf nodes.
>
> A tree that satisfies the criterion that no N nodes can be squeezed into
> N-1 nodes is an SN tree. I don't remember where Konstantin Shvachko
> published his paper on this, maybe it can be found.
<nit> Then shouldn't that be "S3 balancing for the leaf nodes and S2
balancing for the internal nodes"?
--
Daniel
On December 7, 2001 11:54 am, Hans Reiser wrote:
> Cameron Simpson wrote:
>
> >On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser <[email protected]>
wrote:
> >| Have you ever seen an application that creates millions of files create
> >| them in random order?
> >
> >I can readily imagine one. An app which stashes things sent by random
> >other things (usenet/email attachment trollers? security cameras taking
> >thouands of still photos a day?). Mail services like hotmail. with a
> >zillion mail spools, being made and deleted and accessed at random...
>
> Ok, they exist, but they are the 20% not the 80% case, and for that
> reason preserving order in hashing is a legitimate optimization.
At least, I think you ought to make a random hash the default. You're
suffering badly on the 'random name' case, which I don't think is all that
rare. I'll run that test again with some of your hashes and see what happens.
> If names are truly random ordered, then the only optimization that can
> help is compression so as to cause the working set to still fit into RAM.
You appear to be mixing up the idea of random characters in the names with
random processing order. IMHO, the exact characters in a file name should
not affect processing efficiency at all, and I went out of my way to make
that true with HTree.
On the other hand, the processing order of names does and will always matter
a great deal in terms of cache footprint.
I should have done random stat benchmarks too, since we'll really see the
effects of processing order there. I'll put that on my to-do list.
--
Daniel
On December 6, 2001 12:27 pm, Ragnar Kj?rstad wrote:
> There is obviously something missing in this picture, or reiserfs would
> be as fast as ext2 for random access and much faster for access in
> sequential order by filename spelling.
>
> (a "random" hash should not be significantly better than a hash that
> preserves order, because the randomness in the hash is of course not the
> same random order in wich the files are accessed. The only exception is
> that hashes that preserve order may have a harder time using the full
> hash-space evenly)
>
> So, did anyone investigate why ext2 is faster than reiserfs in theese
> cases, or try benchmarking ext2 with one of the reiserfs-hashes (eg r5)?
> We know from earlier benchmarks on reiserfs (tea vs r5 tests, and r5 vs
> maildir-hash) that a hash that preserves order can give a magnitude of
> order performance improvement in certain cases.
I did try R5 in htree, and at least a dozen other hashes. R5 was the worst
of the bunch, in terms of uniformity of distribution, and caused a measurable
slowdown in Htree performance. (Not an order of magnitude, mind you,
something closer to 15%.)
An alternative way of looking at this is, rather than R5 causing an order of
magnitude improvement for certain cases, something else is causing an order
of magnitude slowdown for common cases. I'd suggest attempting to root that
out.
--
Daniel
On Fri, Dec 07, 2001 at 04:51:33PM +0100, Daniel Phillips wrote:
> I did try R5 in htree, and at least a dozen other hashes. R5 was the worst
> of the bunch, in terms of uniformity of distribution, and caused a measurable
> slowdown in Htree performance. (Not an order of magnitude, mind you,
> something closer to 15%.)
That sounds reasonable.
> An alternative way of looking at this is, rather than R5 causing an order of
> magnitude improvement for certain cases, something else is causing an order
> of magnitude slowdown for common cases. I'd suggest attempting to root that
> out.
In the cases I've studied more closely (e.g. maildir cases) the problem
with reiserfs and e.g. the tea hash is that there is no common ordering
between directory entries, stat-data and file-data.
When new files are created in a directory, the file-data tend to be
allocated somewhere after the last allocated file in the directory. The
ordering of the directory-entry and the stat-data (hmm, both?) are
however dependent on the hash. So, with something like the tea hash the
new file will be inserted in the middle of the directory.
In addition to the random lookup type reads, there are three other common
scenarios for reading the files:
1 Reading them in the same order they were created
The cache will probably not be 100% effective on the
directory/stat-data, because it's beeing accessed in a random-like
order. Read-ahead for the file-data on the other hand will be effective.
2 Reading the files in filename-order
Some applications (say, ls -l) may do this, but I doubt it's a very
common accesspattern. Cache-hit for directory-data will be poor, and
cache-hit for file-data will be poor unless the files were created in
the same order.
3 Reading the files in readdir() order.
This is what I think is the most common access-pattern. I would expect a
lot of programs (e.g. mail clients using maildir) to read the directory
and for every filename stat the file and read the data. This will be in
optimal order for directory-caching, but more importantly it will be
random-order like access for the file-data.
I think scenario nr 3 is the one that matters, and I think it is this
scenario that makes r5 faster than tea in real-life applications on
reiserfs. (allthough most numbers available are from benchmarks and not
real life applications).
The directory content is likely to all fit in cache even for a fairly
large directory, so cache-misses are not that much of a problem. The
file-data itself however, will suffer if read-ahead can't start reading
the next file from disk while the first one is beeing processed.
I'm counting on Hans or someone else from the reiserfs team to correct
me if I'm wrong.
--
Ragnar Kj?rstad
Big Storage
On December 7, 2001 05:47 pm, Ragnar Kj?rstad wrote:
> On Fri, Dec 07, 2001 at 04:51:33PM +0100, Daniel Phillips wrote:
> > An alternative way of looking at this is, rather than R5 causing an order
> > of magnitude improvement for certain cases, something else is causing an
> > order of magnitude slowdown for common cases. I'd suggest attempting to
> > root that out.
>
> In the cases I've studied more closely (e.g. maildir cases) the problem
> with reiserfs and e.g. the tea hash is that there is no common ordering
> between directory entries, stat-data and file-data.
>
> When new files are created in a directory, the file-data tend to be
> allocated somewhere after the last allocated file in the directory. The
> ordering of the directory-entry and the stat-data (hmm, both?) are
> however dependent on the hash. So, with something like the tea hash the
> new file will be inserted in the middle of the directory.
>
>
> In addition to the random lookup type reads, there are three other common
> scenarios for reading the files:
>
> 1 Reading them in the same order they were created
> The cache will probably not be 100% effective on the
> directory/stat-data, because it's beeing accessed in a random-like
> order. Read-ahead for the file-data on the other hand will be effective.
>
> 2 Reading the files in filename-order
> Some applications (say, ls -l) may do this, but I doubt it's a very
> common accesspattern. Cache-hit for directory-data will be poor, and
> cache-hit for file-data will be poor unless the files were created in
> the same order.
>
> 3 Reading the files in readdir() order.
> This is what I think is the most common access-pattern. I would expect a
> lot of programs (e.g. mail clients using maildir) to read the directory
> and for every filename stat the file and read the data. This will be in
> optimal order for directory-caching, but more importantly it will be
> random-order like access for the file-data.
>
> I think scenario nr 3 is the one that matters, and I think it is this
> scenario that makes r5 faster than tea in real-life applications on
> reiserfs. (allthough most numbers available are from benchmarks and not
> real life applications).
>
> The directory content is likely to all fit in cache even for a fairly
> large directory, so cache-misses are not that much of a problem. The
> file-data itself however, will suffer if read-ahead can't start reading
> the next file from disk while the first one is beeing processed.
I've observed disk cache effects with Ext2, the relevant relationship being
directory entry order vs inode order. Layout of the index itself doesn't
seem to matter much because of its small size, and 'popularity', which tends
to keep it in cache.
Because Ext2 packs multiple entries onto a single inode table block, the
major effect is not due to lack of readahead but to partially processed inode
table blocks being evicted. The easiest thing to do is keep everything
compact, which it is (dirent: 8 bytes + name, inode: 128 bytes, index: 8
bytes per ~250 entries). The next easiest thing is to fix the icache, which
seems to impose an arbitrary limit regardless of the actual amount of memory
available. Beyond that, I have some fancier strategies in mind for making
processing order correspond better to inode table order by tweaking inode
allocation policy. I'll have more to say about that later. Finally, disk
cache effects don't show up with HTree until we get into millions of files,
and even then performance degrades gently, with a worst case that is still
entirely acceptable.
With ReiserFS we see slowdown due to random access even with small
directories. I don't think this is a cache effect.
--
Daniel
On Fri, Dec 07, 2001 at 06:41:53PM +0100, Daniel Phillips wrote:
> I've observed disk cache effects with Ext2, the relevant relationship being
> directory entry order vs inode order. Layout of the index itself doesn't
> seem to matter much because of its small size, and 'popularity', which tends
> to keep it in cache.
Exactly.
And if the files have data in them (all my tests were done with files
with bodies) then there is a third data-type (the allocated blocks)
whose order compared to the entry-order and the inode-order also
matters.
> With ReiserFS we see slowdown due to random access even with small
> directories. I don't think this is a cache effect.
I can't see why the benefit from read-ahead on the file-data should be
affected by the directory-size?
I forgot to mention another important effect of hash-ordering:
If you mostly add new files to the directory it is far less work if you
almost always can add the new entry at the end rather than insert it in
the middle. Well, it depends on your implementation of course, but this
effect is quite noticable on reiserfs. When untaring a big directory of
maildir the performance difference between the tea hash and a special
maildir hash was approxemately 20%. The choice of hash should not affect
the performance on writing the data itself, so it has to be related to
the cost of the insert operation.
--
Ragnar Kj?rstad
Big Storage
On December 7, 2001 07:03 pm, Ragnar Kj?rstad wrote:
> > With ReiserFS we see slowdown due to random access even with small
> > directories. I don't think this is a cache effect.
>
> I can't see why the benefit from read-ahead on the file-data should be
> affected by the directory-size?
>
> I forgot to mention another important effect of hash-ordering:
> If you mostly add new files to the directory it is far less work if you
> almost always can add the new entry at the end rather than insert it in
> the middle. Well, it depends on your implementation of course, but this
> effect is quite noticable on reiserfs. When untaring a big directory of
> maildir the performance difference between the tea hash and a special
> maildir hash was approxemately 20%. The choice of hash should not affect
> the performance on writing the data itself, so it has to be related to
> the cost of the insert operation.
Yes, I think you're on the right track. HTree on the other hand is optimized
for inserting in arbitrary places, it takes no advantage at all of sequential
insertion. (And doesn't suffer from this, because it all happens in cache
anyway - a million-file indexed directory is around 30 meg.)
--
Daniel
Daniel Phillips wrote:
>
> Because Ext2 packs multiple entries onto a single inode table block, the
> major effect is not due to lack of readahead but to partially processed inode
> table blocks being evicted.
Inode and directory lookups are satisfied direct from the icache/dcache,
and the underlying fs is not informed of a lookup, which confuses the VM.
Possibly, implementing a d_revalidate() method which touches the
underlying block/page when a lookup occurs would help.
-
On December 7, 2001 07:32 pm, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > Because Ext2 packs multiple entries onto a single inode table block, the
> > major effect is not due to lack of readahead but to partially processed
> > inode table blocks being evicted.
>
> Inode and directory lookups are satisfied direct from the icache/dcache,
> and the underlying fs is not informed of a lookup, which confuses the VM.
>
> Possibly, implementing a d_revalidate() method which touches the
> underlying block/page when a lookup occurs would help.
Very interesting point, the same thing happens with file index blocks vs page
cache accesses. You're suggesting we need some kind of mechanism for
propagating hits on cache items, either back to the underlying data or the
information used to regenerate the cache items.
On the other hand, having the underlying itable blocks get evicted can be
looked at as a feature, not a bug - it reduces double storage, allowing more
total items in cache.
There's also a subtle mechanism at work here - if any of the icache items on
an itable block gets evicted, then recreated before the itable block is
evicted, we *will* get an access hit on the itable block and grandma won't
have to sell the farm (I made that last part up).
--
Daniel
Daniel Phillips wrote:
>
> On December 7, 2001 07:32 pm, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > Because Ext2 packs multiple entries onto a single inode table block, the
> > > major effect is not due to lack of readahead but to partially processed
> > > inode table blocks being evicted.
> >
> > Inode and directory lookups are satisfied direct from the icache/dcache,
> > and the underlying fs is not informed of a lookup, which confuses the VM.
> >
> > Possibly, implementing a d_revalidate() method which touches the
> > underlying block/page when a lookup occurs would help.
>
> Very interesting point, the same thing happens with file index blocks vs page
> cache accesses. You're suggesting we need some kind of mechanism for
> propagating hits on cache items, either back to the underlying data or the
> information used to regenerate the cache items.
Not just to regenerate, but in the case of inodes: to write them back.
We have situations in which sync_unlocked_inodes() stalls badly,
because it keeps on doing reads.
-
Daniel Phillips wrote:
>On December 7, 2001 01:36 pm, Hans Reiser wrote:
>http://innominate.org/~graichen/projects/lxr/source/include/linux/reiserfs_fs.h?v=v2.4#L1393
>
>>> 1393 create a new node. We implement S1 balancing for the leaf nodes
>>> 1394 and S0 balancing for the internal nodes (S1 and S0 are defined in
>>> 1395 our papers.)*/
>>>
>>How about I just explain it instead? We preserve a criterion of nodes
>>must be 50% full for internal nodes and criterion of no 3 nodes can be
>>squeezed into 2 nodes for leaf nodes.
>>
>>A tree that satisfies the criterion that no N nodes can be squeezed into
>>N-1 nodes is an SN tree. I don't remember where Konstantin Shvachko
>>published his paper on this, maybe it can be found.
>>
>
><nit> Then shouldn't that be "S3 balancing for the leaf nodes and S2
>balancing for the internal nodes"?
>
>--
>Daniel
>
>
Yes, sorry, an SN tree has the property that at the end of balancing the
set of nodes within the sweep composed of nodes N to the left and N to
the right plus the node being balanced cannot be compressed into 2N nodes.
My error.
Hans
Daniel Phillips wrote:
>On December 7, 2001 11:54 am, Hans Reiser wrote:
>
>>Cameron Simpson wrote:
>>
>>>On Thu, Dec 06, 2001 at 06:41:17AM +0300, Hans Reiser <[email protected]>
>>>
>wrote:
>
>>>| Have you ever seen an application that creates millions of files create
>>>| them in random order?
>>>
>>>I can readily imagine one. An app which stashes things sent by random
>>>other things (usenet/email attachment trollers? security cameras taking
>>>thouands of still photos a day?). Mail services like hotmail. with a
>>>zillion mail spools, being made and deleted and accessed at random...
>>>
>>Ok, they exist, but they are the 20% not the 80% case, and for that
>>reason preserving order in hashing is a legitimate optimization.
>>
>
>At least, I think you ought to make a random hash the default. You're
>suffering badly on the 'random name' case, which I don't think is all that
>rare. I'll run that test again with some of your hashes and see what happens.
>
>>If names are truly random ordered, then the only optimization that can
>>help is compression so as to cause the working set to still fit into RAM.
>>
>
>You appear to be mixing up the idea of random characters in the names with
>random processing order. IMHO, the exact characters in a file name should
>not affect processing efficiency at all, and I went out of my way to make
>that true with HTree.
>
If the characters in the name determine the point of insertion, and the
extent to which processing order correlates with the point of insertion
determines how well caching works, then do you see my viewpoint?
Sure, nobody "should" have to engage in locality of reference, but God
was not concerned somehow, and so disk drives make us get all very
worried about locality of reference.
>
>
>On the other hand, the processing order of names does and will always matter
>a great deal in terms of cache footprint.
>
>I should have done random stat benchmarks too, since we'll really see the
>effects of processing order there. I'll put that on my to-do list.
>
>--
>Daniel
>
>
We should give Yura and Green a chance to run some benchmarks before I
get into too much analyzing. I have learned not to comment before
seeing complete benchmarks.
Hans
Ragnar Kj?rstad wrote:
>On Fri, Dec 07, 2001 at 04:51:33PM +0100, Daniel Phillips wrote:
>
>>I did try R5 in htree, and at least a dozen other hashes. R5 was the worst
>>of the bunch, in terms of uniformity of distribution, and caused a measurable
>>slowdown in Htree performance. (Not an order of magnitude, mind you,
>>something closer to 15%.)
>>
>
>That sounds reasonable.
>
You are more dependent on hash uniformity than we are. We have a
balancing algorithm that manages space, you use hashing to manage your
space. It is a weakness of your approach (which is not to say your
approach is a bad one).
>
>
>
>>An alternative way of looking at this is, rather than R5 causing an order of
>>magnitude improvement for certain cases, something else is causing an order
>>of magnitude slowdown for common cases. I'd suggest attempting to root that
>>out.
>>
>
>In the cases I've studied more closely (e.g. maildir cases) the problem
>with reiserfs and e.g. the tea hash is that there is no common ordering
>between directory entries, stat-data and file-data.
>
>When new files are created in a directory, the file-data tend to be
>allocated somewhere after the last allocated file in the directory. The
>ordering of the directory-entry and the stat-data (hmm, both?) are
>
no, actually this is a problem for v3. stat data are time of creation
ordered (very roughly speaking)
and directory entries are hash ordered, meaning that ls -l suffers a
major performance penalty.
This might well affect our performance vs. htree, I don't know where
Daniel puts his stat data.
This matter is receiving attention in V4, and Nikita and I need to have
a seminar on it next week.
>
>however dependent on the hash. So, with something like the tea hash the
>new file will be inserted in the middle of the directory.
>
>
>In addition to the random lookup type reads, there are three other common
>scenarios for reading the files:
>
>1 Reading them in the same order they were created
>The cache will probably not be 100% effective on the
>directory/stat-data, because it's beeing accessed in a random-like
>order. Read-ahead for the file-data on the other hand will be effective.
>
>2 Reading the files in filename-order
>Some applications (say, ls -l) may do this, but I doubt it's a very
>common accesspattern. Cache-hit for directory-data will be poor, and
>cache-hit for file-data will be poor unless the files were created in
>the same order.
>
>3 Reading the files in readdir() order.
>This is what I think is the most common access-pattern. I would expect a
>lot of programs (e.g. mail clients using maildir) to read the directory
>and for every filename stat the file and read the data. This will be in
>optimal order for directory-caching, but more importantly it will be
>random-order like access for the file-data.
>
>I think scenario nr 3 is the one that matters, and I think it is this
>scenario that makes r5 faster than tea in real-life applications on
>reiserfs. (allthough most numbers available are from benchmarks and not
>real life applications).
>
>The directory content is likely to all fit in cache even for a fairly
>large directory, so cache-misses are not that much of a problem. The
>file-data itself however, will suffer if read-ahead can't start reading
>the next file from disk while the first one is beeing processed.
>
>
>
>I'm counting on Hans or someone else from the reiserfs team to correct
>me if I'm wrong.
>
>
Users who want to speedup reiserfs V3 read/stat performance can do so by
copying directories after creating them, and this way readdir order
equals stat data order. Sad, I know. Only a really fanatic sysadmin is
going to create his reiserfs installs using a master image that has
experienced a cp, but it will make things significantly faster if he
does. Green, add this to the FAQ.
We need to fix this, it is a missed opportunity for higher performance.
V4 I hope.
Hans
Daniel Phillips wrote:
>On December 7, 2001 07:03 pm, Ragnar Kj?rstad wrote:
>
>>>With ReiserFS we see slowdown due to random access even with small
>>>directories. I don't think this is a cache effect.
>>>
>>I can't see why the benefit from read-ahead on the file-data should be
>>affected by the directory-size?
>>
>>I forgot to mention another important effect of hash-ordering:
>>If you mostly add new files to the directory it is far less work if you
>>almost always can add the new entry at the end rather than insert it in
>>the middle. Well, it depends on your implementation of course, but this
>>effect is quite noticable on reiserfs. When untaring a big directory of
>>maildir the performance difference between the tea hash and a special
>>maildir hash was approxemately 20%. The choice of hash should not affect
>>the performance on writing the data itself, so it has to be related to
>>the cost of the insert operation.
>>
>
>Yes, I think you're on the right track. HTree on the other hand is optimized
>for inserting in arbitrary places, it takes no advantage at all of sequential
>insertion. (And doesn't suffer from this, because it all happens in cache
>anyway - a million-file indexed directory is around 30 meg.)
>
>--
>Daniel
>
>
And how large is the dcache and all the inodes? I believe large
directory plus small file performance is heavily affected by the
enormous size of struct inode and all the other per file data.
Hans
Daniel Phillips wrote:
>On December 7, 2001 07:03 pm, Ragnar Kj?rstad wrote:
>
>>>With ReiserFS we see slowdown due to random access even with small
>>>directories. I don't think this is a cache effect.
>>>
>>I can't see why the benefit from read-ahead on the file-data should be
>>affected by the directory-size?
>>
>>I forgot to mention another important effect of hash-ordering:
>>If you mostly add new files to the directory it is far less work if you
>>almost always can add the new entry at the end rather than insert it in
>>the middle. Well, it depends on your implementation of course, but this
>>effect is quite noticable on reiserfs. When untaring a big directory of
>>maildir the performance difference between the tea hash and a special
>>maildir hash was approxemately 20%. The choice of hash should not affect
>>the performance on writing the data itself, so it has to be related to
>>the cost of the insert operation.
>>
>
>Yes, I think you're on the right track. HTree on the other hand is optimized
>for inserting in arbitrary places, it takes no advantage at all of sequential
>insertion. (And doesn't suffer from this, because it all happens in cache
>anyway - a million-file indexed directory is around 30 meg.)
>
>--
>Daniel
>
>
someday reiserfs might benefit from inserting more airholes into
directories so that insertion is more efficient.
This may be a significant advantage of Htree.
Yet another feature needed.....:-/
Hans
On Sat, Dec 08, 2001 at 12:01:20AM +0300, Hans Reiser wrote:
> >In the cases I've studied more closely (e.g. maildir cases) the problem
> >with reiserfs and e.g. the tea hash is that there is no common ordering
> >between directory entries, stat-data and file-data.
> >
> >When new files are created in a directory, the file-data tend to be
> >allocated somewhere after the last allocated file in the directory. The
> >ordering of the directory-entry and the stat-data (hmm, both?) are
> >
>
> no, actually this is a problem for v3. stat data are time of creation
> ordered (very roughly speaking)
> and directory entries are hash ordered, meaning that ls -l suffers a
> major performance penalty.
Yes, just remember that file-body ordering also has the same problem.
(ref the "find . -type f | xargs cat > /dev/null" test wich I think
represent maildir performance pretty closely)
--
Ragnar Kj?rstad
Big Storage
Ragnar Kj?rstad wrote:
>On Sat, Dec 08, 2001 at 12:01:20AM +0300, Hans Reiser wrote:
>
>>>In the cases I've studied more closely (e.g. maildir cases) the problem
>>>with reiserfs and e.g. the tea hash is that there is no common ordering
>>>between directory entries, stat-data and file-data.
>>>
>>>When new files are created in a directory, the file-data tend to be
>>>allocated somewhere after the last allocated file in the directory. The
>>>ordering of the directory-entry and the stat-data (hmm, both?) are
>>>
>>no, actually this is a problem for v3. stat data are time of creation
>>ordered (very roughly speaking)
>>and directory entries are hash ordered, meaning that ls -l suffers a
>>major performance penalty.
>>
>
>Yes, just remember that file-body ordering also has the same problem.
>(ref the "find . -type f | xargs cat > /dev/null" test wich I think
>represent maildir performance pretty closely)
>
>
>
So is this a deeply inherent drawback of offering readdir name orders
that differ hugely from time of creation order?
The advantages of sorting for non-linear search time are obvious.....
I suppose we could use objectids based on the hash of the first
assigned filename plus a 60 bit global to the FS counter....
but it is too many bits I think. I think that using substantially less
than the full hash of the name that is used for directory entry keys
doesn't work.... Comments welcome.
Hans
In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>Daniel Phillips wrote:
>>
>> Because Ext2 packs multiple entries onto a single inode table block, the
>> major effect is not due to lack of readahead but to partially processed inode
>> table blocks being evicted.
>
>Inode and directory lookups are satisfied direct from the icache/dcache,
>and the underlying fs is not informed of a lookup, which confuses the VM.
>
>Possibly, implementing a d_revalidate() method which touches the
>underlying block/page when a lookup occurs would help.
Well, the multi-level caching thing is very much "separate levels" on
purpose, one of the whole points of the icache/dcache being accessed
without going to any lower levels is that going all the way to the lower
levels is slow.
And there are cases where it is better to throw away the low-level
information, and keep the high-level cache, if that really is the access
pattern. For example, if we really always hit in the dcache, there is no
reason to keep any backing store around.
For inodes in particular, though, I suspect that we're just wasting
memory copying the ext2 data from the disk block to the "struct inode".
We might be much better off with
- get rid of the duplication between "ext2_inode_info" (in struct
inode) and "ext2_inode" (on-disk representation)
- add "struct ext2_inode *" and a "struct buffer_head *" pointer to
"ext2_inode_info".
- do all inode ops "in place" directly in the buffer cache.
This might actually _improve_ memory usage (avoid duplicate data), and
would make the buffer cache a "slave cache" of the inode cache, which in
turn would improve inode IO (ie writeback) noticeably. It would get rid
of a lot of horrible stuff in "ext2_update_inode()", and we'd never have
to read in a buffer block in order to write out an inode (right now,
because inodes are only partial blocks, write-out becomes a read-modify-
write cycle if the buffer has been evicted).
So "ext2_write_inode()" would basically become somehting like
struct ext2_inode *raw_inode = inode->u.ext2_i.i_raw_inode;
struct buffer_head *bh = inode->u.ext2_i.i_raw_bh;
/* Update the stuff we've brought into the generic part of the inode */
raw_inode->i_size = cpu_to_le32(inode->i_size);
...
mark_buffer_dirty(bh);
with part of the data already in the right place (ie the current
"inode->u.ext2_i.i_data[block]" wouldn't exist, it would just exist as
"raw_inode->i_block[block]" directly in the buffer block.
Linus
On December 8, 2001 08:19 am, Linus Torvalds wrote:
> In article <[email protected]>,
> Andrew Morton <[email protected]> wrote:
> >Daniel Phillips wrote:
> >>
> >> Because Ext2 packs multiple entries onto a single inode table block, the
> >> major effect is not due to lack of readahead but to partially processed
inode
> >> table blocks being evicted.
> >
> >Inode and directory lookups are satisfied direct from the icache/dcache,
> >and the underlying fs is not informed of a lookup, which confuses the VM.
> >
> >Possibly, implementing a d_revalidate() method which touches the
> >underlying block/page when a lookup occurs would help.
>
> Well, the multi-level caching thing is very much "separate levels" on
> purpose, one of the whole points of the icache/dcache being accessed
> without going to any lower levels is that going all the way to the lower
> levels is slow.
>
> And there are cases where it is better to throw away the low-level
> information, and keep the high-level cache, if that really is the access
> pattern. For example, if we really always hit in the dcache, there is no
> reason to keep any backing store around.
>
> For inodes in particular, though, I suspect that we're just wasting
> memory copying the ext2 data from the disk block to the "struct inode".
> We might be much better off with
>
> - get rid of the duplication between "ext2_inode_info" (in struct
> inode) and "ext2_inode" (on-disk representation)
> - add "struct ext2_inode *" and a "struct buffer_head *" pointer to
> "ext2_inode_info".
> - do all inode ops "in place" directly in the buffer cache.
>
> This might actually _improve_ memory usage (avoid duplicate data), and
> would make the buffer cache a "slave cache" of the inode cache, which in
> turn would improve inode IO (ie writeback) noticeably. It would get rid
> of a lot of horrible stuff in "ext2_update_inode()", and we'd never have
> to read in a buffer block in order to write out an inode (right now,
> because inodes are only partial blocks, write-out becomes a read-modify-
> write cycle if the buffer has been evicted).
>
> So "ext2_write_inode()" would basically become somehting like
>
> struct ext2_inode *raw_inode = inode->u.ext2_i.i_raw_inode;
> struct buffer_head *bh = inode->u.ext2_i.i_raw_bh;
>
> /* Update the stuff we've brought into the generic part of the inode */
> raw_inode->i_size = cpu_to_le32(inode->i_size);
> ...
> mark_buffer_dirty(bh);
>
> with part of the data already in the right place (ie the current
> "inode->u.ext2_i.i_data[block]" wouldn't exist, it would just exist as
> "raw_inode->i_block[block]" directly in the buffer block.
I'd then be able to write a trivial program that would eat inode+blocksize
worth of cache for each cached inode, by opening one file on each itable
block.
I'd also regret losing the genericity that comes from the read_inode (unpack)
and update_inode (repack) abstraction. Right now, I don't see any fields in
_info that aren't directly copied, but I expect there soon will be.
An alternative approach: suppose we were to map the itable blocks with
smaller-than-blocksize granularity. We could then fall back to smaller
transfers under cache pressure, eliminating much thrashing.
By the way, we can trivially shrink every inode by 6 bytes, right now, with:
- __u32 i_faddr;
- __u8 i_frag_no;
- __u8 i_frag_size;
--
Daniel
Daniel Phillips wrote:
> Linus wrote:
> > So "ext2_write_inode()" would basically become somehting like
> >
> > struct ext2_inode *raw_inode = inode->u.ext2_i.i_raw_inode;
> > struct buffer_head *bh = inode->u.ext2_i.i_raw_bh;
> >
> > /* Update the stuff we've brought into the generic part of the inode */
> > raw_inode->i_size = cpu_to_le32(inode->i_size);
> > ...
> > mark_buffer_dirty(bh);
> >
> > with part of the data already in the right place (ie the current
> > "inode->u.ext2_i.i_data[block]" wouldn't exist, it would just exist as
> > "raw_inode->i_block[block]" directly in the buffer block.
note we do this for the superblock already, and it's pretty useful
> I'd then be able to write a trivial program that would eat inode+blocksize
> worth of cache for each cached inode, by opening one file on each itable
> block.
you already have X overhead per inode cached... yes this would increase
X but since there is typically more than one inode per block there is
also sharing as well. So inode+blocksize is not true.
> I'd also regret losing the genericity that comes from the read_inode (unpack)
> and update_inode (repack) abstraction.
so what is write_inode... re-repack? :)
> Right now, I don't see any fields in
> _info that aren't directly copied, but I expect there soon will be.
i_data[] is copied, and that would be nice to directly access in
inode->u.ext2_i.i_bh...
Also in my ibu fs (you can look at it now in gkernel cvs) it uses a
fixed inode size of 512 bytes, with file or extent data packed into that
512 bytes after the fixed header ends. Having the bh right there would
be nice. [note there shouldn't be aliasing problems related to that in
ibu's case, because when data-in-inode is implemented readpage and
writepage handle that case anyway]
> An alternative approach: suppose we were to map the itable blocks with
> smaller-than-blocksize granularity. We could then fall back to smaller
> transfers under cache pressure, eliminating much thrashing.
in ibu fs the entire inode table[1] is accessing via the page cache.
ext2 could do this too. If ext2's per-block-group inode table has
padding at the end page calculations get a bit more annoying but it's
still doable.
Jeff
[1] ibu's inode table is a normal, potentially-fragmented file. thus it
is possibly broken up in chunks spread across the disk like ext2's block
groups.
--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno
On Fri, 2001-12-07 at 07:51, Daniel Phillips wrote:
> I did try R5 in htree, and at least a dozen other hashes. R5 was the worst
> of the bunch, in terms of uniformity of distribution, and caused a measurable
> slowdown in Htree performance. (Not an order of magnitude, mind you,
> something closer to 15%.)
Did you try the ReiserFS teahash? I wrote it specifically to address
the issue you mentioned in the paper of an attacker deliberately
generating collisions; the intention was that each directory (or maybe
filesystem) have its own distinct hashing key.
J
On Sat, Dec 08, 2001 at 03:15:50AM +0300, Hans Reiser wrote:
> >>no, actually this is a problem for v3. stat data are time of creation
> >>ordered (very roughly speaking)
> >>and directory entries are hash ordered, meaning that ls -l suffers a
> >>major performance penalty.
> >
> >Yes, just remember that file-body ordering also has the same problem.
> >(ref the "find . -type f | xargs cat > /dev/null" test wich I think
> >represent maildir performance pretty closely)
>
> So is this a deeply inherent drawback of offering readdir name orders
> that differ hugely from time of creation order?
It should not be, if:
* If the cache was smart enough to detect that the user is reading all
the files in a directory and started reading in files into memory
ahead of time. It sounds ugly, so I don't know if I like it.
or
* file-bodies were ordered by hash as well.
> I suppose we could use objectids based on the hash of the first
> assigned filename plus a 60 bit global to the FS counter....
>
> but it is too many bits I think. I think that using substantially less
> than the full hash of the name that is used for directory entry keys
> doesn't work.... Comments welcome.
The abould stort the file-bodies in optimal order in the three, but
block-allocation is a seperate issue, right? Even if block-allocation
would take objectids into account it would be nearly impossible to get
the optimal order of the file-bodies, because you don't know the number
of files and their sizes before allocating the blocks for the first
files. (unless you would move files around after they are allocated)
So, I think the _only_ way to get the optimal performance for a growing
directory is to do allocation and ordering by creating-time.
That said, maybe trying to find algorithms that are order sub-sets of
the directories entries in optimal order rather than the whole directory
is perhaps more constructive? Or look at repackers or other utilities to
reorder data in batch instead?
So how is this done in ext2/3 with directory indexes, Daniel? Are there
any papers available, or would I have to decifer the source?
--
Ragnar Kj?rstad
Big Storage
Ragnar Kj?rstad wrote:
>
>
>So, I think the _only_ way to get the optimal performance for a growing
>directory is to do allocation and ordering by creating-time.
>
>
We could set the key to the starting packing locality plus starting name
hash, check to see if object with that key already exists, and then if
it does already exist we use a generation counter as originally planned
(though now it must start at some number large enough to avoid collision
with the previous technique, which can happen because generation
counters soak up some bits). This way in most practical situations (the
99% case where you don't have lots of files all created with the same
name in the same directory and renamed to a variety of other things) we
win performance-wise. For the 1% case, we can merely perform as well as
we do now. Comments? Maybe this could work..... Hate being slower
than ext2 at ANYTHING.....;-)
I wonder if Daniel is showing that the cost of our having to slide a
whole node sideways for every directory entry insertion is significant.
I'd better wait for some benchmarks before concluding. Leaving
airholes in directories is one of those optimizations we are putting off
until after v4 is very stable (which means fully coded;-) ).
Daniel, you didn't mention though whether leaking collision bits is a
problem for Htrees. Is it? Do you need to rehash every so often to
solve it? Or it is rare enough that the performance cost can be ignored?
Interesting work you do Daniel, good work.
Hans
Daniel Phillips wrote:
>
>By the way, we can trivially shrink every inode by 6 bytes, right now, with:
>
>- __u32 i_faddr;
>- __u8 i_frag_no;
>- __u8 i_frag_size;
>
>--
>Daniel
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
Using a union of filesystems, that might not even be compiled into the
kernel or as modules, in struct inode is just..... bad.
It is really annoying when the filesystems with larger inodes bloat up
the size for those who are careful with their bytes, can we do something
about that generally?
(There are quite a variety of ways to do something about it, if there is
a will.) I have programmers who come to me asking for permission for
adding bloat to our part of struct inode , and when they point out that
it does no good to save on bytes unless ext2 does, well..... what can I say?
Hans
On Sat, Dec 08, 2001 at 11:28:00PM +0300, Hans Reiser wrote:
> Using a union of filesystems, that might not even be compiled into the
> kernel or as modules, in struct inode is just..... bad.
>
> It is really annoying when the filesystems with larger inodes bloat up
> the size for those who are careful with their bytes, can we do something
> about that generally?
I believe it has been desided to solve this either by:
* including a filesystem-specific pointer in the general inode
or
* have the filesystem build the inode, and include all the general inode
variables in it's data-structure.
If it's not alreaddy done in 2.5 I think it's just a question of time.
--
Ragnar Kj?rstad
Big Storage
On December 8, 2001 07:02 pm, Jeremy Fitzhardinge wrote:
> On Fri, 2001-12-07 at 07:51, Daniel Phillips wrote:
> > I did try R5 in htree, and at least a dozen other hashes. R5 was the
> > worst of the bunch, in terms of uniformity of distribution, and caused a
> > measurable slowdown in Htree performance. (Not an order of magnitude,
> > mind you, something closer to 15%.)
>
> Did you try the ReiserFS teahash? I wrote it specifically to address
> the issue you mentioned in the paper of an attacker deliberately
> generating collisions; the intention was that each directory (or maybe
> filesystem) have its own distinct hashing key.
Yes, I tried every hash in ReiserFS. Please have a look at Larry McVoy's
'linear congruential' hash in the bitkeeper code. It's decent. In fact, the
only good hashes I've found after trolling all over the internet are that
one, and the one I wrote, both based on combining a sequence of characters
with a well-known pseudo-random number generation technique.
--
Daniel
On December 8, 2001 08:16 pm, Ragnar Kj?rstad wrote:
> On Sat, Dec 08, 2001 at 03:15:50AM +0300, Hans Reiser wrote:
> > >>no, actually this is a problem for v3. stat data are time of creation
> > >>ordered (very roughly speaking)
> > >>and directory entries are hash ordered, meaning that ls -l suffers a
> > >>major performance penalty.
> > >
> > >Yes, just remember that file-body ordering also has the same problem.
> > >(ref the "find . -type f | xargs cat > /dev/null" test wich I think
> > >represent maildir performance pretty closely)
> >
> > So is this a deeply inherent drawback of offering readdir name orders
> > that differ hugely from time of creation order?
>
> It should not be, if:
> * If the cache was smart enough to detect that the user is reading all
> the files in a directory and started reading in files into memory
> ahead of time. It sounds ugly, so I don't know if I like it.
> or
> * file-bodies were ordered by hash as well.
>
> > I suppose we could use objectids based on the hash of the first
> > assigned filename plus a 60 bit global to the FS counter....
> >
> > but it is too many bits I think. I think that using substantially less
> > than the full hash of the name that is used for directory entry keys
> > doesn't work.... Comments welcome.
>
> The abould stort the file-bodies in optimal order in the three, but
> block-allocation is a seperate issue, right? Even if block-allocation
> would take objectids into account it would be nearly impossible to get
> the optimal order of the file-bodies, because you don't know the number
> of files and their sizes before allocating the blocks for the first
> files. (unless you would move files around after they are allocated)
>
> So, I think the _only_ way to get the optimal performance for a growing
> directory is to do allocation and ordering by creating-time.
>
> That said, maybe trying to find algorithms that are order sub-sets of
> the directories entries in optimal order rather than the whole directory
> is perhaps more constructive? Or look at repackers or other utilities to
> reorder data in batch instead?
>
> So how is this done in ext2/3 with directory indexes, Daniel? Are there
> any papers available, or would I have to decifer the source?
You should find this useful:
http://people.nl.linux.org/~phillips/htree/paper/htree.html
http://people.nl.linux.org/~phillips/htree/htree.ps.gz
The coherency between inode order and file body order is handled for me
in the existing Ext2 code base. I haven't really dug into that algorithm but
it seems to produce servicable results. Note: Al Viro has taken a look at
improving that code. It's an ongoing project that's been discussed on lkml
and ext2-devel.
As far as coherency between readdir order and inode order goes, I'v just
left that dangling for the moment. This doesn't hurt until we get over a
million files/directory, and then it doesn't hurt an awful lot. As I
mentioned earlier, I think the increased table thrashing exhibited over the
million file mark is more because of shortcomings in icache handling than
anything else.
In the long run I plan to do some work on inode allocation targets to improve
the correspondence between readdir order and inode order.
--
Daniel
On December 8, 2001 08:55 pm, Hans Reiser wrote:
> Daniel, you didn't mention though whether leaking collision bits is a
> problem for Htrees. Is it? Do you need to rehash every so often to
> solve it? Or it is rare enough that the performance cost can be ignored?
It's a problem of course, and I put considerable effort into handling it. I
never have to rehash because the htree isn't a hash table, it's a tree of
hash ranges. I explicitly flag every instance where a collision could force
a search to continue in a successor block. Even though it's every rare, it
does happen and it has to be handled. This is described in some detail in my
paper.
After all the effort that went into this, the result was just a few lines of
code, which was nice.
--
Daniel
On December 8, 2001 06:54 pm, Jeff Garzik wrote:
> Daniel Phillips wrote:
> > Linus wrote:
> > > So "ext2_write_inode()" would basically become somehting like
> > >
> > > struct ext2_inode *raw_inode = inode->u.ext2_i.i_raw_inode;
> > > struct buffer_head *bh = inode->u.ext2_i.i_raw_bh;
> > >
> > > /* Update the stuff we've brought into the generic part of the
inode */
> > > raw_inode->i_size = cpu_to_le32(inode->i_size);
> > > ...
> > > mark_buffer_dirty(bh);
> > >
> > > with part of the data already in the right place (ie the current
> > > "inode->u.ext2_i.i_data[block]" wouldn't exist, it would just exist as
> > > "raw_inode->i_block[block]" directly in the buffer block.
>
> note we do this for the superblock already, and it's pretty useful
The difference is, there's only one superblock per mount. There are
bazillions
of inodes.
> > I'd then be able to write a trivial program that would eat inode+blocksize
> > worth of cache for each cached inode, by opening one file on each itable
> > block.
>
> you already have X overhead per inode cached... yes this would increase
> X but since there is typically more than one inode per block there is
> also sharing as well. So inode+blocksize is not true.
You skipped over my example too fast.
> > I'd also regret losing the genericity that comes from the read_inode
(unpack)
> > and update_inode (repack) abstraction.
>
> so what is write_inode... re-repack? :)
It's a trivial shell for update_inode:
http://innominate.org/~graichen/projects/lxr/source/fs/ext2/inode.c?v=v2.4#L1146
> > Right now, I don't see any fields in
> > _info that aren't directly copied, but I expect there soon will be.
>
> i_data[] is copied, and that would be nice to directly access in
> inode->u.ext2_i.i_bh...
Yes, that's the major one, it's 60 bytes, more than the other _info fields
put together. However, almost half the itable block data is going to be
redundant, and the proposal as I understand it is to lock it in cache while
the inode is in cache. This makes things worse, not better - it reduces the
total number of inodes that can be cached. And that's the best case,
when *all* the inodes on an itable block are in cache. Take a look at the
inode distribution in your directories and see if you think that's likely.
> > An alternative approach: suppose we were to map the itable blocks with
> > smaller-than-blocksize granularity. We could then fall back to smaller
> > transfers under cache pressure, eliminating much thrashing.
>
> in ibu fs the entire inode table[1] is accessing via the page cache.
> ext2 could do this too. If ext2's per-block-group inode table has
> padding at the end page calculations get a bit more annoying but it's
> still doable.
That's roughly what I had in mind, for starters.
It's worth keeping in mind that tweaking the icache efficiency in this case
is really just curing a symptom - the underlying problem is a mismatch
between readdir order and inode order.
--
Daniel
On Sun, 9 Dec 2001, Daniel Phillips wrote:
>
> The difference is, there's only one superblock per mount. There are
> bazillions of inodes.
.. and they share buffers.
I bet that having a pointer to the buffer will _reduce_ average
average footprint rather than increase it. The inodes are fairly densely
laid out in the buffers, so I dare you to find any real-world (or even
very contrieved) usage patterns where it ends up being problematic.
Remember: we'd save 15*4=60 bytes per inode, at the cost of pinning the
block the inode is in. But _usually_ we'd have those blocks in memory
anyway, especially if the inode gets touched (which dirties it, and
updates atime, which forces us to do writeback). For the initial IO we
obviously _have_ to have them in memory.
And we'll get rid of inodes under memory pressure too, so the pinning will
go away when memory is really tight. Yes, you can try to "attack" the
machine by trying to open all the right inodes, but the basic attack is
there already. ulimit is your friend.
> It's worth keeping in mind that tweaking the icache efficiency in this case
> is really just curing a symptom - the underlying problem is a mismatch
> between readdir order and inode order.
Well, the inode writeback read-modify-write synchronization and related
efficiency problems actually has nothing to do with the readdir order.
Linus
> Remember: we'd save 15*4=60 bytes per inode, at the cost of pinning the
> block the inode is in. But _usually_ we'd have those blocks in memory
> anyway, especially if the inode gets touched (which dirties it, and
> updates atime, which forces us to do writeback). For the initial IO we
> obviously _have_ to have them in memory.
Surely the buffer has the on disk inode format, not the fast to handle
in processor inode format ?
Alan
On December 9, 2001 05:19 am, Linus Torvalds wrote:
> On Sun, 9 Dec 2001, Daniel Phillips wrote:
> >
> > The difference is, there's only one superblock per mount. There are
> > bazillions of inodes.
>
> .. and they share buffers.
Not consistently, though this is something that could and should be improved.
Try "ls -i | sort" to see what I mean.
> I bet that having a pointer to the buffer will _reduce_ average
> average footprint rather than increase it. The inodes are fairly densely
> laid out in the buffers, so I dare you to find any real-world (or even
> very contrieved) usage patterns where it ends up being problematic.
I didn't have to look any further than my laptop. Home and etc look like
near-worst cases (because of slow growth and doubtful inode allocation
policy); /bin looks pretty close to the best case, but still it's very rare
to find a full 32 inodes densely allocated. The tradeoff is this: for each
itable block member that happens to be in cache simultaneously you can save
68 bytes by pinning the itable block buffer and pointing directly at it,
while incurring the cache cost of pinning an additional 60 bytes that are
never accessed outside read/update_inode. For each itable block member that
isn't in cache (perhaps because it wasn't allocated) you pin an extra 128
bytes. The break-even is 6 uncached inodes/block - any more and you're
pinning more cache than you're saving.
The case that hurts is where inode access order is non-coherent, which is the
case we're trying to fix. If inode access order *is* coherent, you don't
have a problem because once you're done with an entire itable block it can be
evicted and not read back any time soon.
You're also touching an extra cacheline/inode by following the pointer, for
what it's worth [not much, because we're doing get_block/truncate when
following the link, making the cacheline hit look microscopic].
Continuing the little warts list, there's Alan's comment re needing endian
reversal on big endian machines. I'm also looking at the locking
suspiciously. Directly operating on the buffer it looks like we can happily
change the data even in the middle of a dma. This is murky stuff, I'd be
happy to be corrected on it. I wonder what Ext3 impact we'd have here?
> Remember: we'd save 15*4=60 bytes per inode, at the cost of pinning the
> block the inode is in. But _usually_ we'd have those blocks in memory
> anyway, especially if the inode gets touched (which dirties it, and
> updates atime, which forces us to do writeback).
So by always pinning the itable block you'd be eroding the benefit that can
be obtained through inhibiting atime updating. Even with atime, at least
it's only the in-memory inode being dirtied. When it comes time to update,
itable blocks can be read in as necessary, updated and evicted as the icache
continues to exert pressure. Caveat: I'm rapidly wandering off into obscure
cache balancing issues here, little understood and probably poorly handled.
> For the initial IO we obviously _have_ to have them in memory.
>
> And we'll get rid of inodes under memory pressure too, so the pinning will
> go away when memory is really tight. Yes, you can try to "attack" the
> machine by trying to open all the right inodes, but the basic attack is
> there already. ulimit is your friend.
Unfortunately, it's not an attack, it's a not-so-rare real life situation -
not to the extreme of my example, but close enough to hurt.
> > It's worth keeping in mind that tweaking the icache efficiency in this
> > case is really just curing a symptom - the underlying problem is a
> > mismatch between readdir order and inode order.
>
> Well, the inode writeback read-modify-write synchronization and related
> efficiency problems actually has nothing to do with the readdir order.
I've measured the effect. Ignoring seeks, the worst case access order is:
hit_inum(i%n*m + i/n)
where n is the number of itable blocks and m is inodes/block, i.e., for each
itable block, access a single inode on it, repeat until all inodes have been
accessed. Here, pinning inode blocks is going to multiply the number of
inode evictions, even if we set noatime. Cases that approach this degree of
non-coherence aren't rare at all.
The bottom line is, I think you can find cases where your strategy for
reducing double caching saves a little, whereas I can find case where it
hurts a lot.
What if we could do inode-granular transfers through an inode address_space?
This will eliminate the double caching entirely, because we're always able to
generate the full on-disk inode from the cached inode. Then it costs little
to evict the inode buffer immediately after every transfer. It seems to me
that the cost of this would be just a table of states per address_space unit,
plus some fiddling with bio, which we're doing anyway.
--
Daniel
On Sun, 9 Dec 2001, Daniel Phillips wrote:
>
> Continuing the little warts list, there's Alan's comment re needing endian
> reversal on big endian machines.
Now that's a load of bollocks.
We _already_ keep the in-memory data in "disk format", and for a very
simple reason: that way we can naturally share all the functions that take
a pointer to a block tree, and they don't need to care whether the block
numbers come from a disk buffer or from the inode.
Which means that we have only _one_ set of routines for handling block
allocation etc, instead of duplicating them all.
Having in-core data in CPU-native byte order is _stupid_. We used to do
that, and I winced every time I had to look at all the duplication of
functions. I think it was early 2.3.x when Ingo did the page cache write
stuff where I fixed that - the people who had done the original ext2
endianness patches were just fairly lacking in brains (Hi, Davem ;), and
didn't realize that keeping inode data in host order was the fundamental
problem that caused them to have to duplicate all the functions.
So the _wart_ is in 2.2.x, which is just stupid and ugly, and keeps block
numbers in host data format - which causes no end of trouble. 2.4.x
doesn't have this problem, and could easily have a pointer to the on-disk
representation.
Linus
On Sun, 9 Dec 2001, Linus Torvalds wrote:
> Having in-core data in CPU-native byte order is _stupid_. We used to do
> that, and I winced every time I had to look at all the duplication of
> functions. I think it was early 2.3.x when Ingo did the page cache write
> stuff where I fixed that - the people who had done the original ext2
> endianness patches were just fairly lacking in brains (Hi, Davem ;), and
> didn't realize that keeping inode data in host order was the fundamental
> problem that caused them to have to duplicate all the functions.
>
> So the _wart_ is in 2.2.x, which is just stupid and ugly, and keeps block
> numbers in host data format - which causes no end of trouble. 2.4.x
> doesn't have this problem, and could easily have a pointer to the on-disk
> representation.
As an aside, anybody who uses SWAB... should cut down on drugs.
Code like
native_endian = fs_endian;
if (need_swap)
native_endian = SWAB16(fs_endian);
or, worse yet, same with ifdef instead of if is fscking braindead.
_Never_ treat data from IO as numeric. Incompatible by assignment. If
you have a little-endian data and CPU and your foo_to_cpu() is identity
mapping - fine, but keep that fact in definition of foo_to_cpu().
Endian-neutral code is _easy_ - you need to try real hard to write something
that would be endian-dependent. All you need is to ask yourself "is it
a number or a piece of on-disk data?" and then stick to that. And these
pointers in inode are obvious pieces of on-disk data - no bloody questions
on that.
For real horror look at 2.2 UFS. At least in ext2 DaveM et.al. had enough
sense to use le32_to_cpu() and friends. In UFS it was SWAB...() abortion
all over the place.
> > Continuing the little warts list, there's Alan's comment re needing endian
> > reversal on big endian machines.
>
> Now that's a load of bollocks.
Inodes are in native format with lots of unpacked additional info so we
can't just point into the inode. We can certainly do things like writeback
all the inodes in a block when the block has to go for queueing to disk.
Indirect blocks are a totally unrelated item to the discussion I was having
at least.
Alan
On December 10, 2001 07:27 am, Linus Torvalds wrote:
> On Sun, 9 Dec 2001, Daniel Phillips wrote:
> >
> > Continuing the little warts list, there's Alan's comment re needing endian
> > reversal on big endian machines.
>
> Now that's a load of bollocks.
>
> We _already_ keep the in-memory data in "disk format", and for a very
> simple reason: that way we can naturally share all the functions that take
> a pointer to a block tree, and they don't need to care whether the block
> numbers come from a disk buffer or from the inode.
Yes, as the length of an email approaches infinity the chance of saying
something stupid approaches certainty ;)
> Which means that we have only _one_ set of routines for handling block
> allocation etc, instead of duplicating them all.
I don't know how bad it was in the old code, but now the little-ended data is
confined to get_block/truncate, as it should be. However, reading ahead...
> Having in-core data in CPU-native byte order is _stupid_. We used to do
> that, and I winced every time I had to look at all the duplication of
> functions. I think it was early 2.3.x when Ingo did the page cache write
> stuff where I fixed that - the people who had done the original ext2
> endianness patches were just fairly lacking in brains (Hi, Davem ;), and
> didn't realize that keeping inode data in host order was the fundamental
> problem that caused them to have to duplicate all the functions.
I'm not proposing to change this, but I would have chosen Davem's approach in
this case, with the aid of:
typedef struct { u32 value; } le_u32;
This is a no-op for Intel, and it would make things nicer for non-intel
arches, for what that's worth. But it seems I've stepped into an old
flamewar, so I'm getting out while my skin is still non-crispy ;)
> So the _wart_ is in 2.2.x, which is just stupid and ugly, and keeps block
> numbers in host data format - which causes no end of trouble. 2.4.x
> doesn't have this problem, and could easily have a pointer to the on-disk
> representation.
Probably what I should have said is that *I* plan to do some conversions
between on-disk and in-memory format for a patch I'm working on, that it's a
natural thing to do, and that tying the in-memory inode straight to the disk
buffer interferes with that. On the other hand, if the direct pointer really
is a cache win, it would trump the layering issue.
--
Daniel