LinuxLists.cc - [rfc] Near-constant time directory index for Ext2

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <01022020011905.18944@gimli>,
Daniel Phillips <[email protected]> wrote:
>Earlier this month a runaway installation script decided to mail all its
>problems to root. After a couple of hours the script aborted, having
>created 65535 entries in Postfix's maildrop directory. Removing those
>files took an awfully long time. The problem is that Ext2 does each
>directory access using a simple, linear search though the entire
>directory file, resulting in n**2 behaviour to create/delete n files.
>It's about time we fixed that.

Interesting.

However, if you're playing with the directory structure, please consider
getting rid of the "struct buffer_head"-centricity, and using the page
cache instead. The page cache has much nicer caching semantics, and
looking up data in the page cache is much faster because it never needs
to do the "virtual->physical" translation.

Talk to Al Viro about this - he's already posted patches to move the
regular ext2 directory tree into the page cache, and they weren't
applied to 2.4.x only because there was no great feeling of "we _must_
do this for correctness".

I see that you already considered this issue, but I wanted to bring it
up again simply because something like this certainly looks like a
potential candidate for 2.5.x, but I will _refuse_ to add code that
increases our reliance of "struct buffer_head" as a caching entity. So
I'd rather see the page cache conversion happen sooner rather than
later...

Also, just out of interest: if you've already been worrying about
hashes, what's the verdict on just using the native dentry hash value
directly? It has other constraints (_really_ low latency and absolutely
performance critical to calculate for the common case, which is not
needing a real lookup at all), but maybe it is good enough? And if not,
and you have done some statistics on it, I'd love to hear about it ;)

Linus

2001-02-20 21:14:21

by Jeremy Jackson

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

> In article <01022020011905.18944@gimli>,
> Daniel Phillips <[email protected]> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root. After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory. Removing those
> >files took an awfully long time. The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files.
> >It's about time we fixed that.

In the case of your script I'm not sure this will help, but:
I've seen /home directories organised like /home/a/adamsonj,
/home/a/arthurtone, /home/b/barrettj, etc.
this way (crude) indexing only costs areas where it's needed,
without kernel modification. (app does it) What other placed would we
need indexing *in* the filesystem?

2001-02-20 21:20:41

by Mike Dresser

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

the way i'm reading this, the problem is there's 65535 files in the directory
/where/postfix/lives. rm * or what have you, is going to take forever and
ever, and bog the machine down while its doing it. My understanding is you
could do the rm *, and instead of it reading the tree over and over for every
file that has to be deleted, it just jumps one or two blocks to the file that's
being deleted, instead of thousands of files to be scanned for each file
deleted.

Jeremy Jackson wrote:

> > In article <01022020011905.18944@gimli>,
> > Daniel Phillips <[email protected]> wrote:
> > >Earlier this month a runaway installation script decided to mail all its
> > >problems to root. After a couple of hours the script aborted, having
> > >created 65535 entries in Postfix's maildrop directory. Removing those
> > >files took an awfully long time. The problem is that Ext2 does each
> > >directory access using a simple, linear search though the entire
> > >directory file, resulting in n**2 behaviour to create/delete n files.
> > >It's about time we fixed that.
>
> In the case of your script I'm not sure this will help, but:
> I've seen /home directories organised like /home/a/adamsonj,
> /home/a/arthurtone, /home/b/barrettj, etc.
> this way (crude) indexing only costs areas where it's needed,
> without kernel modification. (app does it) What other placed would we
> need indexing *in* the filesystem?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2001-02-20 21:56:20

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Tue, 20 Feb 2001, Linus Torvalds wrote:
> In article <01022020011905.18944@gimli>,
> Daniel Phillips <[email protected]> wrote:
> >Earlier this month a runaway installation script decided to mail all its
> >problems to root. After a couple of hours the script aborted, having
> >created 65535 entries in Postfix's maildrop directory. Removing those
> >files took an awfully long time. The problem is that Ext2 does each
> >directory access using a simple, linear search though the entire
> >directory file, resulting in n**2 behaviour to create/delete n files.
> >It's about time we fixed that.
>
> Interesting.
>
> However, if you're playing with the directory structure, please consider
> getting rid of the "struct buffer_head"-centricity, and using the page
> cache instead. The page cache has much nicer caching semantics, and
> looking up data in the page cache is much faster because it never needs
> to do the "virtual->physical" translation.

Oh yes, I was planning on it. I started with the buffers version
for two main reasons version: 1) it's simple and solid and 2) it
provides the basis for a backport to 2.2 - after the 2.4/2.5 version is
complete of course.

> Talk to Al Viro about this - he's already posted patches to move the
> regular ext2 directory tree into the page cache, and they weren't
> applied to 2.4.x only because there was no great feeling of "we _must_
> do this for correctness".
>
> I see that you already considered this issue, but I wanted to bring it
> up again simply because something like this certainly looks like a
> potential candidate for 2.5.x, but I will _refuse_ to add code that
> increases our reliance of "struct buffer_head" as a caching entity. So
> I'd rather see the page cache conversion happen sooner rather than
> later...

You are preaching to the converted.

> Also, just out of interest: if you've already been worrying about
> hashes, what's the verdict on just using the native dentry hash value
> directly? It has other constraints (_really_ low latency and absolutely
> performance critical to calculate for the common case, which is not
> needing a real lookup at all), but maybe it is good enough? And if not,
> and you have done some statistics on it, I'd love to hear about it ;)

You mean full_name_hash? I will un-static it and try it. I should have
some statistics tomorrow. I have a couple of simple metrics for
measuring the effectiveness of the hash function: the uniformity of
the hash space splitting (which in turn affects the average fullness
of directory leaves) and speed.

Let the hash races begin.

--
Daniel

2001-02-20 22:43:30

by Jeremy Jackson

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Mike Dresser wrote:

> the way i'm reading this, the problem is there's 65535 files in the directory
> /where/postfix/lives. rm * or what have you, is going to take forever and
> ever, and bog the machine down while its doing it. My understanding is you
> could do the rm *, and instead of it reading the tree over and over for every
> file that has to be deleted, it just jumps one or two blocks to the file that's
> being deleted, instead of thousands of files to be scanned for each file
> deleted.
>

I thought about it again, and the proformance problem with "rm *" is that
the shell reads and sorts the directory, passes each file as a separate
argument to rm, which then causes the kernel to lookup each file
from a random directory block (random because of previous sort),
modify that directory block, then read another... after a few seconds
the modified blocks start to be written back to disk while new ones
are looked up... disk seek contention. and this becomes hard on the
dir. block cache (wherever this is) since from source each dir entry
is just over 256 bytes (?) 65535 files would require 16MB to
cache dir entries. Plus it has to read in all the inodes, modify,
then write, taking up xxMB more. You're probably swapping
out, with swap partition on same disk, the disk may explode.

If it were truly doing a linear scan, it might be faster. Two
successive mods to same dir block would be merged
onto same write.

Perhaps rm -rf . would be faster? Let rm do glob expansion,
without the sort. Care to recreate those 65535 files and try it?

or use ls with the nosort flag pipe through xargs then to rm...
again loose sorting but don't delete directory or subdirs.

>
> Jeremy Jackson wrote:
>
> > > In article <01022020011905.18944@gimli>,
> > > Daniel Phillips <[email protected]> wrote:
> > > >Earlier this month a runaway installation script decided to mail all its
> > > >problems to root. After a couple of hours the script aborted, having
> > > >created 65535 entries in Postfix's maildrop directory. Removing those
> > > >files took an awfully long time. The problem is that Ext2 does each
> > > >directory access using a simple, linear search though the entire
> > > >directory file, resulting in n**2 behaviour to create/delete n files.
> > > >It's about time we fixed that.
> >

2001-02-20 23:18:27

by Jonathan Morton

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

>Perhaps rm -rf . would be faster? Let rm do glob expansion,
>without the sort. Care to recreate those 65535 files and try it?

Perhaps, but I think that form is still fairly slow. It takes an
"uncomfortable" amount of time to remove a complex directory structure
using, eg. "rm -rf /usr/src/linux-obsolete" or "rm -rf
downloads/XFree86-old-and-buggy". I'm not sure, but I would guess it's not
as much quicker than removing each file individually as you might think.

If I had more time on my hands, I'd run some quick benchmarks on some of my
systems.

--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----

2001-02-20 23:37:52

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Tue, 20 Feb 2001, Jeremy Jackson wrote:
> Mike Dresser wrote:
>
> > the way i'm reading this, the problem is there's 65535 files in the directory
> > /where/postfix/lives. rm * or what have you, is going to take forever and
> > ever, and bog the machine down while its doing it. My understanding is you
> > could do the rm *, and instead of it reading the tree over and over for every
> > file that has to be deleted, it just jumps one or two blocks to the file that's
> > being deleted, instead of thousands of files to be scanned for each file
> > deleted.
>
> I thought about it again, and the proformance problem with "rm *" is that
> the shell reads and sorts the directory, passes each file as a separate
> argument to rm, which then causes the kernel to lookup each file
> from a random directory block (random because of previous sort),
> modify that directory block, then read another... after a few seconds
> the modified blocks start to be written back to disk while new ones
> are looked up... disk seek contention. and this becomes hard on the
> dir. block cache (wherever this is) since from source each dir entry
> is just over 256 bytes (?) 65535 files would require 16MB to
> cache dir entries. Plus it has to read in all the inodes, modify,
> then write, taking up xxMB more. You're probably swapping
> out, with swap partition on same disk, the disk may explode.
>
> If it were truly doing a linear scan, it might be faster. Two
> successive mods to same dir block would be merged
> onto same write.
>
> Perhaps rm -rf . would be faster? Let rm do glob expansion,
> without the sort. Care to recreate those 65535 files and try it?
>
> or use ls with the nosort flag pipe through xargs then to rm...
> again loose sorting but don't delete directory or subdirs.

Indeed, rm -rf is faster. It does a readdir to get all the directory
entries in internal order, then calls unlink to remove them, one at a
time. This removes each entry from the front of the file, shortening
the time that has to be spent scanning forward in the file to find the
target entry. Manfred Spraul observed that this could be speeded up
with by caching the file position, and sent me a patch to do that. It
did speed things up - about 20%.

But actually, rm is not problem, it's open and create. To do a
create you have to make sure the file doesn't already exist, and
without an index you have to scan on average half the directory file.
Open requires a similar scan. Here we are talking about using an index
to speed that up quadraticly when operating on N files. That is the
real gravy.

--
Daniel

2001-02-21 00:23:30

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Tue, 20 Feb 2001, Daniel Phillips wrote:
>
> You mean full_name_hash? I will un-static it and try it. I should have
> some statistics tomorrow. I have a couple of simple metrics for
> measuring the effectiveness of the hash function: the uniformity of
> the hash space splitting (which in turn affects the average fullness
> of directory leaves) and speed.

I was more thinking about just using "dentry->d_name->hash" directly, and
not worrying about how that hash was computed. Yes, for ext2 it will have
the same value as "full_name_hash" - the difference really being that
d_hash has already been precomputed for you anyway.

> Let the hash races begin.

Note that dentry->d_name->hash is really quick (no extra computation), but
I'm not claiming that it has anything like a CRC quality. And it's
probably a bad idea to use it, because in theory at least the VFS layer
might decide to switch the hash function around. I'm more interested in
hearing whether it's a good hash, and maybe we could improve the VFS hash
enough that there's no reason to use anything else..

Linus

2001-02-21 00:27:50

by Alan

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

Reiserfs seems to have done a lot of work on this and be using tea, which is
also nice as tea is non trivial to abuse as a user to create pessimal file
searches intentionally

2001-02-21 01:03:17

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus writes:
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> > You mean full_name_hash? I will un-static it and try it. I should have
> > some statistics tomorrow.
>
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.

I _thought_ that's what you meant, but then I was also thinking that the
dentry hash was on the full path name and not just the filename? This
wouldn't be any good for use in the directory index, in case the directory
is renamed. If this is _not_ the case, then it is a definite candidate.

> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around.

I was thinking about this as well. Since the setup Daniel has allows us
to store a hash version, we could run the hash function on a fixed string
at SB init time to give us a hash "version" number. If the hash function
changes we will get a new hash "version". We could inline each new dentry
hash function into the ext2 code (so we can unpack the directories), or
as a cop-out if any directory has a hash version not equal to the current
one we re-hash all the entries in the directory.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-21 01:04:57

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <01022100361408.18944@gimli> you wrote:
> But actually, rm is not problem, it's open and create. To do a
> create you have to make sure the file doesn't already exist, and
> without an index you have to scan on average half the directory file.

Unless you use a File System which is better for that, like Reiser-FS. Of
course a even better solution is to distribute those files in hashed subdirs.

Greetings
Bernd

2001-02-21 06:39:36

by Ed Tomlinson

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Alan Cox wrote:

>> probably a bad idea to use it, because in theory at least the VFS layer
>> might decide to switch the hash function around. I'm more interested in
>> hearing whether it's a good hash, and maybe we could improve the VFS hash
>> enough that there's no reason to use anything else..
>
> Reiserfs seems to have done a lot of work on this and be using tea, which is
> also nice as tea is non trivial to abuse as a user to create pessimal file
> searches intentionally

The default in reiserfs is now the R5 hash, but you are right that lots of efforts went
into finding this hash. This includes testing various hashes on real directory
structures to see which one worked best. R5 won.

Ed Tomlinson

2001-02-21 16:40:55

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Wed, 21 Feb 2001, Bernd Eckenfels wrote:
> In article <01022100361408.18944@gimli> you wrote:
> > But actually, rm is not problem, it's open and create. To do a
> > create you have to make sure the file doesn't already exist, and
> > without an index you have to scan on average half the directory file.
>
> Unless you use a File System which is better for that, like Reiser-FS. Of
> course a even better solution is to distribute those files in hashed subdirs.

Ahem. Please read the first post in the thread. ;-)

--
Daniel

2001-02-21 17:19:41

[permalink] [raw]

Subject: RE: [rfc] Near-constant time directory index for Ext2

On 20-Feb-2001 Daniel Phillips wrote:
> Earlier this month a runaway installation script decided to mail all its
> problems to root. After a couple of hours the script aborted, having
> created 65535 entries in Postfix's maildrop directory. Removing those
> files took an awfully long time. The problem is that Ext2 does each
> directory access using a simple, linear search though the entire
> directory file, resulting in n**2 behaviour to create/delete n files.
> It's about time we fixed that.
>
> Last fall in Miami, Ted Ts'o mentioned some ideas he was playing with
> for an Ext2 directory index, including the following points:
>
> - Fixed-size hash keys instead of names in the index
> - Leaf blocks are normal ext2 directory blocks
> - Leaf blocks are sequental, so readdir doesn't have to be changed

Have You tried to use skiplists ?
In 93 I've coded a skiplist based directory access for Minix and it gave very
interesting performances.
Skiplists have a link-list like performance when linear scanned, and overall
good performance in insertion/seek/delete.

- Davide

2001-02-21 21:09:35

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> Have You tried to use skiplists ?
> In 93 I've coded a skiplist based directory access for Minix and it gave very
> interesting performances.
> Skiplists have a link-list like performance when linear scanned, and overall
> good performance in insertion/seek/delete.

Skip list search/insert/delete is O(log N) in average as skip lists are just a
dynamic version of interval bisection. Good hashing is O(1).

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Entropy isn't what it used to be.

2001-02-21 21:28:40

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> Have You tried to use skiplists ?
>> In 93 I've coded a skiplist based directory access for Minix and it gave
>> very
>> interesting performances.
>> Skiplists have a link-list like performance when linear scanned, and overall
>> good performance in insertion/seek/delete.
>
> Skip list search/insert/delete is O(log N) in average as skip lists are just
> a
> dynamic version of interval bisection. Good hashing is O(1).

To have O(1) you've to have the number of hash entries > number of files and a
really good hasing function.

>
> Have a nice fortnight

To be sincere, here is pretty daylight :)

- Davide

2001-02-21 21:33:02

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> To have O(1) you've to have the number of hash entries > number of files and a
> really good hasing function.

No, if you enlarge the hash table twice (and re-hash everything) every time the
table fills up, the load factor of the table keeps small and everything is O(1)
amortized, of course if you have a good hashing function. If you are really
smart and re-hash incrementally, you can get O(1) worst case complexity, but
the multiplicative constant is large.

> To be sincere, here is pretty daylight :)

:)
Martin

2001-02-21 21:58:26

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> To have O(1) you've to have the number of hash entries > number of files and
>> a
>> really good hasing function.
>
> No, if you enlarge the hash table twice (and re-hash everything) every time
> the
> table fills up, the load factor of the table keeps small and everything is
> O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.

My personal preference goes to skiplist coz it doesn't have fixed ( or growing
) tables to handle. You've simply a stub of data togheter with FS data in each
direntry.
And performance ( O(log2(n)) ) are the same for whatever number of entries.

- Davide

2001-02-21 22:15:26

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Followup to: <[email protected]>
By author: Martin Mares <[email protected]>
In newsgroup: linux.dev.kernel
>
> Hello!
>
> > To have O(1) you've to have the number of hash entries > number of files and a
> > really good hasing function.
>
> No, if you enlarge the hash table twice (and re-hash everything) every time the
> table fills up, the load factor of the table keeps small and everything is O(1)
> amortized, of course if you have a good hashing function. If you are really
> smart and re-hash incrementally, you can get O(1) worst case complexity, but
> the multiplicative constant is large.
>

Not true. The rehashing is O(n) and it has to be performed O(log n)
times during insertion. Therefore, insertion is O(log n).

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:27:18

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> My personal preference goes to skiplist coz it doesn't have fixed ( or growing
> ) tables to handle. You've simply a stub of data togheter with FS data in each
> direntry.

Another problem with skip lists is that they require variable sized nodes,
so you either need to keep free chunk lists and lose some space in deleted
nodes kept in these lists, or you choose to shift remaining nodes which is
slow and complicated as you need to keep the inter-node links right. With
hashing, you can separate the control part of the structure and the actual
data and shift data while leaving most of the control part intact.

> And performance ( O(log2(n)) ) are the same for whatever number of entries.

I don't understand this complexity estimate -- it cannot be the same for
whatever number of entries as the complexity function depends on the number
of entries.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
P.C.M.C.I.A. stands for `People Can't Memorize Computer Industry Acronyms'

2001-02-21 22:32:38

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> Not true. The rehashing is O(n) and it has to be performed O(log n)
> times during insertion. Therefore, insertion is O(log n).

Rehashing is O(n), but the "n" is the _current_ number of items, not the
maximum one after all the insertions.

Let's assume you start with a single-entry hash table. You rehash for the
first time after inserting the first item (giving hash table of size 2),
then after the second item (=> size 4), then after the fourth item (=> size 8)
and so on. I.e., when you insert n items, the total cost of rehashing summed
over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
MIPS: Meaningless Indicator of Processor Speed.

2001-02-21 22:38:48

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > Not true. The rehashing is O(n) and it has to be performed O(log n)
> > times during insertion. Therefore, insertion is O(log n).
>
> Rehashing is O(n), but the "n" is the _current_ number of items, not the
> maximum one after all the insertions.
>
> Let's assume you start with a single-entry hash table. You rehash for the
> first time after inserting the first item (giving hash table of size 2),
> then after the second item (=> size 4), then after the fourth item (=> size 8)
> and so on. I.e., when you insert n items, the total cost of rehashing summed
> over all the insertions is at most 1 + 2 + 4 + 8 + 16 + ... + 2^k (where
> k=floor(log2(n))) <= 2^k+1 = O(n). That is O(1) operations per item inserted.
>

You're right. However, for each hash table operation to be O(1) the size
of the hash table must be >> n.

I suggested at one point to use B-trees with a hash value as the key.
B-trees are extremely efficient when used on a small constant-size key.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:41:58

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On 21-Feb-2001 Martin Mares wrote:
> Hello!
>
>> My personal preference goes to skiplist coz it doesn't have fixed ( or
>> growing
>> ) tables to handle. You've simply a stub of data togheter with FS data in
>> each
>> direntry.
>
> Another problem with skip lists is that they require variable sized nodes,
> so you either need to keep free chunk lists and lose some space in deleted
> nodes kept in these lists, or you choose to shift remaining nodes which is
> slow and complicated as you need to keep the inter-node links right. With
> hashing, you can separate the control part of the structure and the actual
> data and shift data while leaving most of the control part intact.

An entry in skip list table is a u32 direntry offset and You've not to keep
free entries, simply the height of the node will change depending on the number
of entries.

>> And performance ( O(log2(n)) ) are the same for whatever number of entries.
>
> I don't understand this complexity estimate -- it cannot be the same for
> whatever number of entries as the complexity function depends on the number
> of entries.

n == number of entries

For constant I mean the formula not the result.

- Davide

2001-02-21 22:47:29

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Mark Hahn wrote:
>
> > You're right. However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
>
> there's at least one kind of HT where the table starts small
> and gets bigger, but at trivial cost (memcpy). while those
> memcpy's are O(n) each time, it's a little misleading to treat
> them as costing the same as O(n) rehashing.
>

memcpy() isn't exactly trivial, especially not when we're talking about
disk storage. Note, too, that we're talking about storage in a
filesystem, and random access a large, growable linear space (i.e. a
file) in a filesystem is O(log n) because of necessary inode indirection.

That's yet another reason I like the idea of using B-trees over hash
values: B-trees are O(log n), but do not need the file inode indirection
to do the job, so what you end up with is very nice and fast.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 22:50:38

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> You're right. However, for each hash table operation to be O(1) the size
> of the hash table must be >> n.

If we are talking about average case complexity (which is the only possibility
with fixed hash function and arbitrary input keys), it suffices to have
hash table size >= c*n for some constant c which gives O(1/c) cost of
all operations.

> I suggested at one point to use B-trees with a hash value as the key.
> B-trees are extremely efficient when used on a small constant-size key.

Although from asymptotic complexity standpoint hashing is much better
than B-trees, I'm not sure at all what will give the best performance for
reasonable directory sizes. Maybe the B-trees are really the better
alternative as they are updated dynamically and the costs of successive
operations are similar as opposed to hashing which is occassionally very
slow due to rehashing unless you try to rehash on-line, but I don't
know any algorithm for on-line rehashing with both inserts and deletes
which wouldn't be awfully complex and slow (speaking of multiplicative
constants, of course -- it's still O(1) per operation, but "the big Oh
is really big there").

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
"#define QUESTION ((bb) || !(bb))" -- Shakespeare

2001-02-21 22:54:58

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > You're right. However, for each hash table operation to be O(1) the size
> > of the hash table must be >> n.
>
> If we are talking about average case complexity (which is the only possibility
> with fixed hash function and arbitrary input keys), it suffices to have
> hash table size >= c*n for some constant c which gives O(1/c) cost of
> all operations.
>

True. Note too, though, that on a filesystem (which we are, after all,
talking about), if you assume a large linear space you have to create a
file, which means you need to multiply the cost of all random-access
operations with O(log n).

> > I suggested at one point to use B-trees with a hash value as the key.
> > B-trees are extremely efficient when used on a small constant-size key.
>
> Although from asymptotic complexity standpoint hashing is much better
> than B-trees, I'm not sure at all what will give the best performance for
> reasonable directory sizes. Maybe the B-trees are really the better
> alternative as they are updated dynamically and the costs of successive
> operations are similar as opposed to hashing which is occassionally very
> slow due to rehashing unless you try to rehash on-line, but I don't
> know any algorithm for on-line rehashing with both inserts and deletes
> which wouldn't be awfully complex and slow (speaking of multiplicative
> constants, of course -- it's still O(1) per operation, but "the big Oh
> is really big there").

Well, once you multiply with O(log n) for the file indirection (which
B-trees don't need, since they inherently handle blocking and thus can
use block pointers directly) then the asymptotic complexity is the same
as well, and I think the B-trees are the overall winner.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:08:24

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hello!

> True. Note too, though, that on a filesystem (which we are, after all,
> talking about), if you assume a large linear space you have to create a
> file, which means you need to multiply the cost of all random-access
> operations with O(log n).

One could avoid this, but it would mean designing the whole filesystem in a
completely different way -- merge all directories to a single gigantic
hash table and use (directory ID,file name) as a key, but we were originally
talking about extending ext2, so such massive changes are out of question
and your log n access argument is right.

Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
COBOL -- Completely Outdated, Badly Overused Language

2001-02-21 23:15:44

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
>
> Hello!
>
> > True. Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
>
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.
>

It would still be tricky since you have to have actual files in the
filesystem as well.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:15:14

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Ed Tomlinson <[email protected]> wrote:
>
>The default in reiserfs is now the R5 hash, but you are right that lots of efforts went
>into finding this hash. This includes testing various hashes on real directory
>structures to see which one worked best. R5 won.

That's interesting. The R5 hash is easily also the only one of the
reiser hashes that might be useable for the generic VFS hashing. It's
not so different in spirit from the current one, and if you've done the
work to test it, it's bound to be a lot better.

(The current VFS name hash is probably _really_ stupid - I think it's
still my original one, and nobody probably ever even tried to run it
through any testing. For example, I bet that using a shift factor of 4
is really bad, because it evenly divides a byte, which together with the
xor means that you can really easily generate trivial bad cases).

What did you use for a test-case? Real-life directory contents? Did you
do any worst-case analysis too?

Linus

2001-02-21 23:27:16

by Jamie Lokier

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Martin Mares wrote:
> Hello!
>
> > True. Note too, though, that on a filesystem (which we are, after all,
> > talking about), if you assume a large linear space you have to create a
> > file, which means you need to multiply the cost of all random-access
> > operations with O(log n).
>
> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

A gigantic hash table has serious problems with non-locality of
reference. Basically any regular access pattern you started with is
destroyed. This is a problem with pageable RAM, let alone disks with
millisecond seek times.

-- Jamie

2001-02-21 23:33:18

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On 21-Feb-2001 Linus Torvalds wrote:
> In article <[email protected]>,
> Ed Tomlinson <[email protected]> wrote:
>>
>>The default in reiserfs is now the R5 hash, but you are right that lots of
>>efforts went
>>into finding this hash. This includes testing various hashes on real
>>directory
>>structures to see which one worked best. R5 won.
>
> That's interesting. The R5 hash is easily also the only one of the
> reiser hashes that might be useable for the generic VFS hashing. It's
> not so different in spirit from the current one, and if you've done the
> work to test it, it's bound to be a lot better.
>
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing. For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>
> What did you use for a test-case? Real-life directory contents? Did you
> do any worst-case analysis too?

Yep, 4 is not good as a shifting factor. Prime number are the better choice for
this stuff.
The issue to have a good distribution is not only to have a good hashing
function, but also to give this function not correlated data.
Good hashing function for a Domain A may not be so good for a Domain B.

- Davide

2001-02-21 23:44:00

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Martin Mares wrote:
> >
> > > True. Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key, but we were originally
> > talking about extending ext2, so such massive changes are out of question
> > and your log n access argument is right.
>
> It would still be tricky since you have to have actual files in the
> filesystem as well.

Have you looked at the structure and algorithms I'm using? I would not
call this a hash table, nor is it a btree. It's a 'hash-keyed
uniform-depth tree'. It never needs to be rehashed (though it might be
worthwhile compacting it at some point). It also never needs to be
rebalanced - it's only two levels deep for up to 50 million files.

This thing deserves a name of its own. I call it an 'htree'. The
performance should speak for itself - 150 usec/create across 90,000
files and still a few optmizations to go.

Random access runs at similar speeds too, it's not just taking advantage
of a long sequence of insertions into the same directory.

BTW, the discussion in this thread has been very interesting, it just
isn't entirely relevant to my patch :-)

--
Daniel

2001-02-21 23:49:41

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> Have you looked at the structure and algorithms I'm using? I would not
> call this a hash table, nor is it a btree. It's a 'hash-keyed
> uniform-depth tree'. It never needs to be rehashed (though it might be
> worthwhile compacting it at some point). It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
>

I'm curious how you do that. It seems each level would have to be 64K
large in order to do that, with a minimum disk space consumption of 128K
for a directory. That seems extremely painful *except* in the case of
hysterically large directories, which tend to be the exception even on
filesystems where they occur.

I think I'd rather take the extra complexity and rebalancing cost of a
B-tree.

> This thing deserves a name of its own. I call it an 'htree'. The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
>
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
>
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-21 23:52:01

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On 21-Feb-2001 Daniel Phillips wrote:
> "H. Peter Anvin" wrote:
>>
>> Martin Mares wrote:
>> >
>> > > True. Note too, though, that on a filesystem (which we are, after all,
>> > > talking about), if you assume a large linear space you have to create a
>> > > file, which means you need to multiply the cost of all random-access
>> > > operations with O(log n).
>> >
>> > One could avoid this, but it would mean designing the whole filesystem in
>> > a
>> > completely different way -- merge all directories to a single gigantic
>> > hash table and use (directory ID,file name) as a key, but we were
>> > originally
>> > talking about extending ext2, so such massive changes are out of question
>> > and your log n access argument is right.
>>
>> It would still be tricky since you have to have actual files in the
>> filesystem as well.
>
> Have you looked at the structure and algorithms I'm using? I would not
> call this a hash table, nor is it a btree. It's a 'hash-keyed
> uniform-depth tree'. It never needs to be rehashed (though it might be
> worthwhile compacting it at some point). It also never needs to be
> rebalanced - it's only two levels deep for up to 50 million files.
>
> This thing deserves a name of its own. I call it an 'htree'. The
> performance should speak for itself - 150 usec/create across 90,000
> files and still a few optmizations to go.
>
> Random access runs at similar speeds too, it's not just taking advantage
> of a long sequence of insertions into the same directory.
>
> BTW, the discussion in this thread has been very interesting, it just
> isn't entirely relevant to my patch :-)

Daniel,

I'm all but saying that Your algo is not good.
I use something very like to it in my mail server ( XMail ) to index mail queue
files that has a two level depth fs splitting.
The mine was only an hint to try different types of directory indexing.

- Davide

2001-02-21 23:58:21

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Followup to: <[email protected]>
By author: [email protected] (Linus Torvalds)
In newsgroup: linux.dev.kernel
>
> (The current VFS name hash is probably _really_ stupid - I think it's
> still my original one, and nobody probably ever even tried to run it
> through any testing. For example, I bet that using a shift factor of 4
> is really bad, because it evenly divides a byte, which together with the
> xor means that you can really easily generate trivial bad cases).
>

Actually, the VFS name hash I think is derived from the "Dragon Book"
hash (via autofs), so it's not like it's completely untested.

-hpa
--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 00:00:01

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Davide Libenzi <[email protected]> wrote:
>
>Yep, 4 is not good as a shifting factor. Prime number are the better choice for
>this stuff.

Oh, absolutely.

It looks like the hash function was done rather early on in the dcache
lifetime (one of the first things), back when nobody cared about whether
it was really good or not because there were many much more complicated
questions like "how the h*ll will this all ever work" ;)

And at no point did anybody ever go back and verify whether the hash
function made much sense or not.

We had another boo-boo with the actual _folding_ of the "full" hash
value into the actual hash chain pointer that is done when the name
cache is actually looked up, which was even more embarrassing: even if
the hash ended up being ok, we would remove most of the valid bits from
it because it would under certain circumstances (512MB of RAM on x86)
basically xor itself with itself.

That took quite a while to find too - the code still worked fine, it
just had a horrible distribution on machines with half a gig of memory.

>The issue to have a good distribution is not only to have a good hashing
>function, but also to give this function not correlated data.
>Good hashing function for a Domain A may not be so good for a Domain B.

This is not something we can do all that much about. The data we get is
generated by the user, and can basically be a random string of
characters. HOWEVER, there are certainly tons of _usual_ data, and
while there's no way to select the data we can at least try to make sure
that the distribution is good for the normal case (ie regular ASCII
filenames, not forgetting the fact that many people use more interesting
encodings)

Linus

2001-02-22 00:36:27

by Ed Tomlinson

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds <[email protected]> wrote:
>
>?Ed Tomlinson ?<[email protected]> wrote:
>?>The default in reiserfs is now the R5 hash, but you are right that lots of
>?> efforts went into finding this hash. ?This includes testing various
>?> hashes on real directory structures to see which one worked best. ?R5
>?> won.
>
>?That's interesting. ?The R5 hash is easily also the only one of the
>?reiser hashes that might be useable for the generic VFS hashing. ?It's
>?not so different in spirit from the current one, and if you've done the
>?work to test it, it's bound to be a lot better.

It was not me personally. ? I just remembered the thread (from june 2000) on
the reiserfs list... ?I have summerized the results for you below.

For the program see: http://www.jedi.claranet.fr/hash_torture.tar.gz

Ed

PS. ?I am still seeing hangs with (2.4.2pre2 then I switched to ac7 or so and
have had hangs with all pre and ac(s) tried and that is most of them) ?ac20
plus the latest reiserfs fixes has stayed up 8 hours so far - it can take two
or three days ?to trigger the hang though. ?When it hangs it really dead, ?a
UPS connected via a serial port cannot shut it down. ? pings to the box fail.
A+SysRQ is dead, and the software watchdog does not trigger a reboot. ?
ideas?

>?(The current VFS name hash is probably _really_ stupid - I think it's
>?still my original one, and nobody probably ever even tried to run it
>?through any testing. ?For example, I bet that using a shift factor of 4
>?is really bad, because it evenly divides a byte, which together with the
>?xor means that you can really easily generate trivial bad cases).
>
>?What did you use for a test-case? Real-life directory contents? Did you
>?do any worst-case analysis too?
>
>??????? ??????? Linus

some test results from june 2000 with Hans's summary first.
---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: Hans Reiser <[email protected]>
To: "Yury Yu. Rupasov" <[email protected]>
Cc: Jedi/Sector One <[email protected]>, Petru Paler <[email protected]>,
"[email protected]" <[email protected]>, Yury Shevchuk
<[email protected]>

Ok, based on this benchmark let's put rupasov5 in, and warn users who choose
the
currently used rupasov1 hash that rupasov5 has obsoleted it. ?Do this in both
3.6 and 3.5, and fix the the delimiting key check in 3.5 REISERFS_CHECK bug at
the same time. ?Cut the patch, start testing, and see if you can release by
Monday. ?Make rupasov5 the default. ?sizif, review the documentation he
creates
for users.

Jedi, if you disagree with the benchmarks let me know. ?You might try
concatenating two filenames together instead of adding a digit to them, or
running find on a really large FS, to improve these tests. ?Thanks for helping
us with analyzing the different hash methods available Jedi.

Hans

---------------------------------------------------------------
(reiserfs) Re: r5 hash
From: "Yury Yu. Rupasov" <[email protected]>
To: Hans Reiser <[email protected]>
Cc: Jedi/Sector One <[email protected]>, Petru Paler <[email protected]>,
"[email protected]" <[email protected]>, Yury Shevchuk
<[email protected]>

Hans Reiser wrote:
>?
>?What is the speed of the real filenames, not just the number of collisions.
>?

Ok, here is the results for real names :
# find / -type d -exec ls {} \; | sort | uniq > allfiles.txt

# wc -l allfiles.txt
161101 allfiles.txt

Collisions for 161 101 names:

tea_hash ?: 784 total, ?2 dangerous
jedi_hash2: 957 total, ?2 dangerous
r5_hash ? :1191 total, ?2 dangerous
r7_hash ? :8439 total, 18 dangerous

The speed for 161 101 real names :

create 161101 files of 10 bytes with names from allfiles.txt

# time create d1 allfiles.txt
# time cp d1 d2 -r
# time rm d1 -r

? ? ? ? ? ? ? create ? ? ?copy ? ? ? ?remove
? ? ? ? ? ? ?--------------------------------
tea_hash ? : 1m27.223s ? 5m43.069s ?2m33.449s
jedi_hash2 : 1m26.062s ? 5m40.872s ?2m32.795s
r5_hash ? ?: 1m16.729s ? 4m14.967s ?1m53.037s
r7_hash ? ?: 1m10.665s ? 3m34.950s ?1m39.756s

As you can see the results are differ, but not too much. :)
The situation changes dramatically if we will test 1 million files.

The same test, but at the end of each name from allfiles.txt
added numbers from 0 to 6 (1 127 707 files):
?
? ? ? ? ? ? ? create ? ? ?copy ? ? ? ?remove
? ? ? ? ? ? ?--------------------------------
tea_hash ? : 81m44.449s ?
jedi_hash2 : 79m46.419s
r5_hash ? ?: 15m56.037s
r7_hash ? ?: 15m30.680s

Dual Celeron 500, 128 MB RAM, 8 GB scsi HDD
Reiserfs-3.5.21, Linux-2.2.15

Thanks,
Yura.
---------------------------------------------------------------
body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff;
text-decoration: none; }(reiserfs) Torture results
From: Jedi/Sector One <[email protected]>
To: [email protected]

? Here are the results of the hash torture on a Celeron 300.
? Once again, you can substract 1 from the dangerous collisions numbers.
? Xuan, can you provide a test for the case Rupasov hash was designed
for ?
? Anyway, I don't really see why large directories should have similar
file names, rather that keywords.

? Best regards,
--
???????? Frank DENIS aka Jedi/Sector One aka DJ Chrysalis <[email protected]>
???????? ??????? -> Software : http://www.jedi.claranet.fr <-
? ? ? If Bill Gates had a dime for every time a Windows box crashed...
???????? ??????? ?...oh, wait a minute -- he already does.

********************** /usr/dict/words test **********************

Trying with ? 45402 words

-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous : ? ? ? 1????? ffff980
Timing :

real???? 0m0.145s
user???? 0m0.120s
sys????? 0m0.010s

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous : ? ? ? 1????? ffffe00
Timing :

real???? 0m0.297s
user???? 0m0.260s
sys????? 0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous : ? ? ? 1????? ffae000
Timing :

real???? 0m0.124s
user???? 0m0.080s
sys????? 0m0.030s

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous : ? ? ? 1????? fffd400
Timing :

real???? 0m0.121s
user???? 0m0.100s
sys????? 0m0.000s

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous : ? ? ? 1????? fff9780
Timing :

real???? 0m0.122s
user???? 0m0.100s
sys????? 0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous : ? ? ? 1????? fff9780
Timing :

real???? 0m0.122s
user???? 0m0.090s
sys????? 0m0.020s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous : ? ? ? 1????? ffff480
Timing :

real???? 0m0.123s
user???? 0m0.100s
sys????? 0m0.000s

********************** Squid names test **********************

Trying with ?458752 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 6237
Dangerous : ? ? ? 1????? fffff80
Timing :

real???? 0m1.138s
user???? 0m1.090s
sys????? 0m0.030s

-------------[Benchmarking rupasov hash]-------------

Collisions : 377520
Dangerous : ? ? ? 1????? e32700
Timing :

real???? 0m2.588s
user???? 0m2.550s
sys????? 0m0.020s

-------------[Benchmarking r5 hash]-------------

Collisions : 309991
Dangerous : ? ? ? 1????? 55406b80
Timing :

real???? 0m0.940s
user???? 0m0.880s
sys????? 0m0.040s

-------------[Benchmarking r7 hash]-------------

Collisions : 449006
Dangerous : ? ? ? 2????? 22b16580
Timing :

real???? 0m0.928s
user???? 0m0.840s
sys????? 0m0.070s

-------------[Benchmarking jedi hash]-------------

Collisions : 2771
Dangerous : ? ? ? 1????? fffef80
Timing :

real???? 0m0.928s
user???? 0m0.860s
sys????? 0m0.050s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous : ? ? ? 1????? ffff80
Timing :

real???? 0m0.879s
user???? 0m0.810s
sys????? 0m0.050s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 6203
Dangerous : ? ? ? 1????? fffdc00
Timing :

real???? 0m0.930s
user???? 0m0.840s
sys????? 0m0.080s

********************** Real names test **********************

Trying with ? 89830 files

-------------[Benchmarking tea hash]-------------

Collisions : 237
Dangerous : ? ? ? 1????? fff5580
Timing :

real???? 0m0.276s
user???? 0m0.250s
sys????? 0m0.000s

-------------[Benchmarking rupasov hash]-------------

Collisions : 6288
Dangerous : ? ? ? 1????? ffee080
Timing :

real???? 0m0.582s
user???? 0m0.560s
sys????? 0m0.010s

-------------[Benchmarking r5 hash]-------------

Collisions : 3920
Dangerous : ? ? ? 1????? fff4600
Timing :

real???? 0m0.230s
user???? 0m0.190s
sys????? 0m0.020s

-------------[Benchmarking r7 hash]-------------

Collisions : 11801
Dangerous : ? ? ? 1????? fff580
Timing :

real???? 0m0.225s
user???? 0m0.180s
sys????? 0m0.030s

-------------[Benchmarking jedi hash]-------------

Collisions : 269
Dangerous : ? ? ? 1????? fff9f80
Timing :

real???? 0m0.226s
user???? 0m0.200s
sys????? 0m0.010s

-------------[Benchmarking jedi2 hash]-------------

Collisions : 415
Dangerous : ? ? ? 1????? fff9f80
Timing :

real???? 0m0.225s
user???? 0m0.200s
sys????? 0m0.010s

-------------[Benchmarking lookup2 hash]-------------

Collisions : 223
Dangerous : ? ? ? 1????? ffff480
Timing :

real???? 0m0.230s
user???? 0m0.210s
sys????? 0m0.000s

----------------------------------------------------------------------------------------

body { font-family: "helvetica" } p { font-size: 12pt } a { color: #0000ff;
text-decoration: none; }(reiserfs) hash torture results
From: Petru Paler <[email protected]>
To: [email protected]

Machine: AMD Athlon/650MHz, 128Mb RAM, Quantum Fireball lct15 IDE hdd
(UDMA/66 but that doesn't matter). Kernel 2.4.0-test1-ac10.

The results are interesting, but more interesting would be to see how fast
reiserfs actually is with each of these hashes.

Script output:

********************** /usr/dict/words test **********************

Trying with ? 45402 words

-------------[Benchmarking tea hash]-------------

Collisions : 45
Dangerous : ? ? ? 1????? ffff980
Timing :
0.00user 0.01system 0:00.08elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 553
Dangerous : ? ? ? 1????? ffffe00
Timing :
0.00user 0.00system 0:00.18elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 185
Dangerous : ? ? ? 1????? ffae000
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 2528
Dangerous : ? ? ? 1????? fffd400
Timing :
0.00user 0.01system 0:00.07elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 54
Dangerous : ? ? ? 1????? fff9780
Timing :
0.00user 0.00system 0:00.08elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 93
Dangerous : ? ? ? 1????? fff9780
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 63
Dangerous : ? ? ? 1????? ffff480
Timing :
0.00user 0.00system 0:00.07elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Squid names test **********************

Trying with ?262144 squid cache entries

-------------[Benchmarking tea hash]-------------

Collisions : 2019
Dangerous : ? ? ? 1????? ffff880
Timing :
0.00user 0.01system 0:00.47elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 210912
Dangerous : ? ? ? 1????? a88f00
Timing :
0.00user 0.02system 0:01.03elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 171912
Dangerous : ? ? ? 1????? 54ca7680
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 256171
Dangerous : ? ? ? 6????? 22aa0600
Timing :
0.00user 0.03system 0:00.41elapsed 7%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 589
Dangerous : ? ? ? 1????? fffda00
Timing :
0.00user 0.02system 0:00.42elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 0
Dangerous : ? ? ? 1????? ffff80
Timing :
0.00user 0.00system 0:00.40elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 2041
Dangerous : ? ? ? 1????? fffdc00
Timing :
0.00user 0.01system 0:00.40elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

********************** Real names test **********************

find: /proc/31112/fd/4: No such file or directory
Trying with ? 94836 files

-------------[Benchmarking tea hash]-------------

Collisions : 235
Dangerous : ? ? ? 1????? fff5e80
Timing :
0.00user 0.00system 0:00.20elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking rupasov hash]-------------

Collisions : 2016
Dangerous : ? ? ? 1????? fffab80
Timing :
0.01user 0.00system 0:00.46elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r5 hash]-------------

Collisions : 495
Dangerous : ? ? ? 1????? fff8780
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking r7 hash]-------------

Collisions : 8162
Dangerous : ? ? ? 1????? fff580
Timing :
0.00user 0.02system 0:00.17elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi hash]-------------

Collisions : 331
Dangerous : ? ? ? 1????? ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking jedi2 hash]-------------

Collisions : 341
Dangerous : ? ? ? 1????? ffe400
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-------------[Benchmarking lookup2 hash]-------------

Collisions : 298
Dangerous : ? ? ? 1????? fffb700
Timing :
0.00user 0.00system 0:00.17elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (83major+13minor)pagefaults 0swaps

-Petru

2001-02-22 01:24:46

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Daniel Phillips wrote:
> >
> > Have you looked at the structure and algorithms I'm using? I would not
> > call this a hash table, nor is it a btree. It's a 'hash-keyed
> > uniform-depth tree'. It never needs to be rehashed (though it might be
> > worthwhile compacting it at some point). It also never needs to be
> > rebalanced - it's only two levels deep for up to 50 million files.
>
> I'm curious how you do that. It seems each level would have to be 64K
> large in order to do that, with a minimum disk space consumption of 128K
> for a directory. That seems extremely painful *except* in the case of
> hysterically large directories, which tend to be the exception even on
> filesystems where they occur.

Easy, with average dirent reclen of 16 bytes each directory leaf block
can holds up to 256 entries. Each index block indexes 512 directory
blocks and the root indexes 511 index blocks. Assuming the leaves are
on average 75% full this gives:

(4096 / 16) * 512 * 511 * .75 = 50,233,344

I practice I'm getting a little more than 90,000 entries indexed by a
*single* index block (the root) so I'm not just making this up.

> I think I'd rather take the extra complexity and rebalancing cost of a
> B-tree.

Do you still think so?

--
Daniel

2001-02-22 01:43:26

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> "H. Peter Anvin" wrote:
> >
> > Daniel Phillips wrote:
> > >
> > > Have you looked at the structure and algorithms I'm using? I would not
> > > call this a hash table, nor is it a btree. It's a 'hash-keyed
> > > uniform-depth tree'. It never needs to be rehashed (though it might be
> > > worthwhile compacting it at some point). It also never needs to be
> > > rebalanced - it's only two levels deep for up to 50 million files.
> >
> > I'm curious how you do that. It seems each level would have to be 64K
> > large in order to do that, with a minimum disk space consumption of 128K
> > for a directory. That seems extremely painful *except* in the case of
> > hysterically large directories, which tend to be the exception even on
> > filesystems where they occur.
>
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries. Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks. Assuming the leaves are
> on average 75% full this gives:
>
> (4096 / 16) * 512 * 511 * .75 = 50,233,344
>

That's a three-level tree, not a two-level tree.

> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.
>
> > I think I'd rather take the extra complexity and rebalancing cost of a
> > B-tree.
>
> Do you still think so?

I think so.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 02:04:20

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips writes:
> Easy, with average dirent reclen of 16 bytes each directory leaf block
> can holds up to 256 entries. Each index block indexes 512 directory
> blocks and the root indexes 511 index blocks. Assuming the leaves are
> on average 75% full this gives:
>
> (4096 / 16) * 512 * 511 * .75 = 50,233,344
>
> I practice I'm getting a little more than 90,000 entries indexed by a
> *single* index block (the root) so I'm not just making this up.

I was just doing the math for 1k ext2 filesystems, and the numbers aren't
nearly as nice. We get:

(1024 / 16) * 127 * .75 = 6096 # 1 level
(1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels

Basically (IMHO) we will not really get any noticable benefit with 1 level
index blocks for a 1k filesystem - my estimates at least are that the break
even point is about 5k files. We _should_ be OK with 780k files in a single
directory for a while. Looks like we will need 2-level indexes sooner than
you would think though. Note that tests on my workstation showed an average
filename length of 10 characters (excluding MP3s at 78 characters), so this
would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 02:30:04

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds wrote:
>
> On Tue, 20 Feb 2001, Daniel Phillips wrote:
> >
> > You mean full_name_hash? I will un-static it and try it. I should have
> > some statistics tomorrow. I have a couple of simple metrics for
> > measuring the effectiveness of the hash function: the uniformity of
> > the hash space splitting (which in turn affects the average fullness
> > of directory leaves) and speed.
>
> I was more thinking about just using "dentry->d_name->hash" directly, and
> not worrying about how that hash was computed. Yes, for ext2 it will have
> the same value as "full_name_hash" - the difference really being that
> d_hash has already been precomputed for you anyway.
>
> > Let the hash races begin.
>
> Note that dentry->d_name->hash is really quick (no extra computation), but
> I'm not claiming that it has anything like a CRC quality. And it's
> probably a bad idea to use it, because in theory at least the VFS layer
> might decide to switch the hash function around. I'm more interested in
> hearing whether it's a good hash, and maybe we could improve the VFS hash
> enough that there's no reason to use anything else..

In the first heat of hash races - creating 20,000 files in one directory
- dentry::hash lost out to my original hack::dx_hash, causing a high
percentage of leaf blocks to remain exactly half full and slowing down
the whole thing by about 5%. (This was under uml - I haven't tried it
native yet but I expect the results to be similar.)

Contender Result
========= ======
dentry::hash Average fullness = 2352 (57%)
hack::dx_hash Average fullness = 2758 (67%)

This suggests that dentry::hash is producing distinctly non-dispersed
results and needs to be subjected to further scrutiny. I'll run the
next heat of hash races tomorrow, probably with R5, and CRC32 too if I
have time.

--
Daniel

2001-02-22 02:41:59

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
>
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files. We _should_ be OK with 780k files in a single
> directory for a while.
>

I've had a news server with 2000000 files in one directory. Such a
filesystem is likely to use small blocks, too, because each file is
generally small.

This is an important connection: filesystems which have lots and lots of
small files will have large directories and small block sizes.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 03:10:17

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
>
> Daniel Phillips writes:
> > Easy, with average dirent reclen of 16 bytes each directory leaf block
> > can holds up to 256 entries. Each index block indexes 512 directory
> > blocks and the root indexes 511 index blocks. Assuming the leaves are
> > on average 75% full this gives:
> >
> > (4096 / 16) * 512 * 511 * .75 = 50,233,344
> >
> > I practice I'm getting a little more than 90,000 entries indexed by a
> > *single* index block (the root) so I'm not just making this up.
>
> I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> nearly as nice. We get:
>
> (1024 / 16) * 127 * .75 = 6096 # 1 level
> (1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels
>
> Basically (IMHO) we will not really get any noticable benefit with 1 level
> index blocks for a 1k filesystem - my estimates at least are that the break
> even point is about 5k files. We _should_ be OK with 780k files in a single
> directory for a while. Looks like we will need 2-level indexes sooner than
> you would think though. Note that tests on my workstation showed an average
> filename length of 10 characters (excluding MP3s at 78 characters), so this
> would give 20-byte (or 88-byte) dirents for ext3, reducing the files count
> to 4857 and 621792 (or 78183 and 40029696 for 4k filesystems) at 75% full.

But you are getting over 3/4 million files in one directory on a 1K
blocksize system, and you really shouldn't be using 1K blocks on a
filesystem under that big a load. Is it just to reduce tail block
fragmentation? That's what tail merging is for - it does a much better
job than shrinking the block size.

But if you are *determined* to use 1K blocks and have more than 1/2
million files in one directory then I suppose a 3rd level is what you
need. The uniform-depth tree still works just fine and still doesn't
need to be rebalanced - it's never out of balance.

--
Daniel

2001-02-22 03:31:41

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Thu, 22 Feb 2001, Daniel Phillips wrote:
>
> In the first heat of hash races - creating 20,000 files in one directory
> - dentry::hash lost out to my original hack::dx_hash, causing a high
> percentage of leaf blocks to remain exactly half full and slowing down
> the whole thing by about 5%. (This was under uml - I haven't tried it
> native yet but I expect the results to be similar.)
>
> Contender Result
> ========= ======
> dentry::hash Average fullness = 2352 (57%)
> hack::dx_hash Average fullness = 2758 (67%)
>
> This suggests that dentry::hash is producing distinctly non-dispersed
> results and needs to be subjected to further scrutiny. I'll run the
> next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> have time.

I'd love to hear the results from R5, as that seems to be the reiserfs
favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
in..

Linus

2001-02-22 03:44:42

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

"H. Peter Anvin" wrote:
>
> Andreas Dilger wrote:
> >
> > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > index blocks for a 1k filesystem - my estimates at least are that the break
> > even point is about 5k files. We _should_ be OK with 780k files in a single
> > directory for a while.
> >
>
> I've had a news server with 2000000 files in one directory. Such a
> filesystem is likely to use small blocks, too, because each file is
> generally small.
>
> This is an important connection: filesystems which have lots and lots of
> small files will have large directories and small block sizes.

I mentioned this earlier but it's worth repeating: the desire to use a
small block size is purely an artifact of the fact that ext2 has no
handling for tail block fragmentation. That's a temporary situation -
once we've dealt with it your 2,000,000 file directory will be happier
with 4K filesystem blocks. There will be a lot fewer metadata index
blocks in your directory file, for one thing. Another practical matter
is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
result friendlier to the page cache and memory manager.

--
Daniel

2001-02-22 04:03:08

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> "H. Peter Anvin" wrote:
> >
> > Andreas Dilger wrote:
> > >
> > > Basically (IMHO) we will not really get any noticable benefit with 1 level
> > > index blocks for a 1k filesystem - my estimates at least are that the break
> > > even point is about 5k files. We _should_ be OK with 780k files in a single
> > > directory for a while.
> > >
> >
> > I've had a news server with 2000000 files in one directory. Such a
> > filesystem is likely to use small blocks, too, because each file is
> > generally small.
> >
> > This is an important connection: filesystems which have lots and lots of
> > small files will have large directories and small block sizes.
>
> I mentioned this earlier but it's worth repeating: the desire to use a
> small block size is purely an artifact of the fact that ext2 has no
> handling for tail block fragmentation. That's a temporary situation -
> once we've dealt with it your 2,000,000 file directory will be happier
> with 4K filesystem blocks. There will be a lot fewer metadata index
> blocks in your directory file, for one thing. Another practical matter
> is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> result friendlier to the page cache and memory manager.
>

Well, that's something I really don't expect to see anymore -- this
"purely temporary situation" is now already 7 years old at least.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 04:03:08

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Daniel Phillips <[email protected]> wrote:
>
>I mentioned this earlier but it's worth repeating: the desire to use a
>small block size is purely an artifact of the fact that ext2 has no
>handling for tail block fragmentation. That's a temporary situation -
>once we've dealt with it your 2,000,000 file directory will be happier
>with 4K filesystem blocks.

I'd rather see a whole new filesystem than have ext2 do tail-block
fragmentation.

Once you do tail fragments, you might as well do the whole filesystem
over and have it do fancier stuff than just handling sub-blocking.

Another way of saying this: if you go to the complexity of no longer
being a purely block-based filesystem, please go the whole way. Make the
thing be extent-based, and get away from the notion that you have to
allocate blocks one at a time. Make the blocksize something nice and
big, not just 4kB or 8kB or something.

And don't call it ext2.

Linus

2001-02-22 04:03:38

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Daniel Phillips wrote:
>
> There will be a lot fewer metadata index
> blocks in your directory file, for one thing.
>

Oh yes, another thing: a B-tree directory structure does not need
metadata index blocks.

-hpa

--
<[email protected]> at work, <[email protected]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt

2001-02-22 05:20:23

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

In article <[email protected]>,
Linus Torvalds <[email protected]> wrote:
>
>Another way of saying this: if you go to the complexity of no longer
>being a purely block-based filesystem, please go the whole way. Make the
>thing be extent-based, and get away from the notion that you have to
>allocate blocks one at a time. Make the blocksize something nice and
>big, not just 4kB or 8kB or something.

Btw, this is also going to be a VM and performance issue some time in
the future. Tgere are already CPU's that would _love_ to have 64kB
pages etc, and as such a filesystem that doesn't play with the old silly
"everthing is a block" rules would be much appreciated with the kind of
people who have multi-gigabyte files and want to read in big chunks at a
time.

So either you have a simple block-based filesystem (current ext2, no
extents, no crapola), or you decide to do it over. Don't do some
half-way thing, please.

Linus

2001-02-22 06:23:29

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Daniel,

Nice work!

A couple of comments. If you make the beginning of each index block
look like a an empty directory block (i.e, the first 8 blocks look like
this):

32 bits: ino == 0
16 bits: rec_len == blocksize
16 bits: name_len = 0

... then you will have full backwards compatibility, both for reading
*and* writing. When reading, old kernels will simply ignore the index
blocks, since it looks like it has an unpopulated directory entry. And
if the kernel attempts to write into the directory, it will clear the
BTREE_FL flag, in which case new kernels won't treat the directory as a
tree anymore. (Running a smart e2fsck which knows about directory trees
will be able to restore the tree structure).

Is it worth it? Well, it means you lose an index entry from each
directory block, thus reducing your fanout at each node of the tree by a
worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
using 4k blocksizes.

- Ted

2001-02-22 07:05:10

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

HPA writes:
> Daniel Phillips wrote:
> > I mentioned this earlier but it's worth repeating: the desire to use a
> > small block size is purely an artifact of the fact that ext2 has no
> > handling for tail block fragmentation. That's a temporary situation -
> > once we've dealt with it your 2,000,000 file directory will be happier
> > with 4K filesystem blocks. There will be a lot fewer metadata index
> > blocks in your directory file, for one thing. Another practical matter
> > is that 4K filesystem blocks map directly to 4K PAGE_SIZE and are as a
> > result friendlier to the page cache and memory manager.
> >
>
> Well, that's something I really don't expect to see anymore -- this
> "purely temporary situation" is now already 7 years old at least.

Peter, you're barking up the wrong tree - Daniel has had an ext2 tail
merging patch around for 6 months or more... However, from the sounds
of it, Linus may not want such a thing in ext2 (at least not until he
is convinced otherwise). It will be interesting to compare ext2 +
ongoing patches vs. new filesystems like reiserfs, XFS, JFS -- not only
speed, but reliability as well. XFS and JFS have previous implementations
to work with (although the JFS code is not the AIX JFS code), but reiserfs
has a long way to go, just from the standpoint of being run on millions
of machines, and being looked at by thousands of programmers.

I think people will be surprised at how ext2 + patches will continue to
improve. One of the reasons (despite Linus' misgivings, IMHO) is that
ext2 is continually being improved by small measures, has lots of eyes
on the code, and it offers a stable base for each improvement - which
means each improvement is stable and reliable much quicker than if you
were to code a new filesystem from scratch for each new feature.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 07:21:26

by Bill Wendling

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Also sprach H. Peter Anvin:
} Martin Mares wrote:
} >
} > Hello!
} >
} > > True. Note too, though, that on a filesystem (which we are, after all,
} > > talking about), if you assume a large linear space you have to create a
} > > file, which means you need to multiply the cost of all random-access
} > > operations with O(log n).
} >
} > One could avoid this, but it would mean designing the whole filesystem in a
} > completely different way -- merge all directories to a single gigantic
} > hash table and use (directory ID,file name) as a key, but we were originally
} > talking about extending ext2, so such massive changes are out of question
} > and your log n access argument is right.
} >
}
} It would still be tricky since you have to have actual files in the
} filesystem as well.
}
But that's just a user space issue, isn't it.

(Just kidding :-)

--
|| Bill Wendling [email protected]

2001-02-22 07:29:47

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

On Thu, 22 Feb 2001, [email protected] wrote:
> A couple of comments. If you make the beginning of each index block
> look like a an empty directory block (i.e, the first 8 blocks look like
> this):
>
> 32 bits: ino == 0
> 16 bits: rec_len == blocksize
> 16 bits: name_len = 0
>
> ... then you will have full backwards compatibility, both for reading
> *and* writing. When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry. And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore. (Running a smart e2fsck which knows about directory trees
> will be able to restore the tree structure).

:-) That's really nice, now I see what you were thinking about with
all those bit clears.

> Is it worth it? Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution. I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

--
Daniel

2001-02-22 08:09:56

[permalink] [raw]

Subject: Re: [rfc] [LONG] Near-constant time directory index for Ext2

Daniel Phillips writes:
> Andreas Dilger wrote:
> > I was just doing the math for 1k ext2 filesystems, and the numbers aren't
> > nearly as nice. We get:
> >
> > (1024 / 16) * 127 * .75 = 6096 # 1 level
> > (1024 / 16) * 128 * 127 * .75 = 780288 # 2 levels
>
> But if you are *determined* to use 1K blocks and have more than 1/2
> million files in one directory then I suppose a 3rd level is what you
> need. The uniform-depth tree still works just fine and still doesn't
> need to be rebalanced - it's never out of balance.

I would rather simply go to some chained block scheme at that point.
ext2 is already fairly fast at linear searching, so if we index a HUGE
directory we are still linearly searching only 1/2^16 of the directory
(at worst for 1k blocks, 1/2^20 for 4k blocks).

I just had a clever idea - on a single-level index you put the header
and index data in block 0, and put the directory data in the first
indirect block (11 sparse blocks, instead of 511). If you need to go
to a second-level index, you can simply shift the indirect data block to
be a double-indirect block, and start the level-2 index in the first
indirect block. If we ever need a third-level index, you basically do
the same thing - move the double-indirect blocks to triple-indirect,
and put the level-3 index in the double-indirect block. It will always
fit, because the index branching level is 1/2 of the indirect block
branching level because the index has the extra 4-byte hash values.

Andreas:
>> One thing I was thinking was that you could put "." and ".." in the first
>> block (like usual), and then put the index data after that. This way
>> "." and ".." still exist and e2fsck and the kernel code doesn't complain,
>> except about the sparse directory blocks.

Daniel:
>The kernel code - ext2 fs that is - doesn't complain at the moment
>because I removed the complaint, and everything seems to be fine. All
>references to "." and ".." are now intercepted and never reach the
>filesystem level. If they did then I'd just fix ext2_is_empty_dir to
>tolerate those entries being somewhere other than the first block.
>But, reading ahead, I see you are talking about forward compatibility...

One of the (many) benefits of ext2 is that it has tried to maintain
compatibility as much as possible, if possible. In this case, I
don't see that there is an overwhelming reason to NOT keep compatibility,
and I think Ted agrees:

Ted Ts'o writes:
> E2fsck uses '..' to be able to climb up the directory tree when it needs
> to print a pathname given only a directory inode. So yes, removing it
> will cause e2fsck to not work as well. '.' is not as useful, but it's
> useful as a sanity check.

> Of course, if we completely give up on compatibility, we don't actually
> need to have special directory entries for '.' and '..' complete with
> their names; we can just store the inode numbers for both in a 32bit
> field along with the indexes. (And so magic number for sanity checking;
> magic numbers are good things....)

Having real dirents for "." and ".." only costs 16 more bytes (2 index
leaves), compared to only keeping the inode numbers.

Andreas:
> > So, we would have (for the root entry, let's say):
(in directory block 0)

> > ext2_dir_entry_2{ EXT2_ROOT_INO, 12, 1, EXT2_FT_DIR, ".\0\0\0"}
> > ext2_dir_entry_2{ EXT2_ROOT_INO, <blocksize> - 12, 2, EXT2_FT_DIR, "..\0\0"}
> > <index magic (maybe)>
> > <index header>
> > <index data>
> >
> > For the index ext2 kernel code, it would notice the EXT2_INDEX_FL and
> > access the data after the end of the ".." dir entry, and this would also
> > give you read-only "compatibility" of sorts with older kernels (modulo
> > calling ext2_error() for all of the sparse blocks before the start of the
> > actual directory data blocks). You lose 24 bytes of data in the first
> > block, but gain some compatibility. For second-level index blocks, if you
> > want to keep compatibility you lose 8 bytes each block if you start with:
> >
> > ext2_dir_entry_2 { 0, <blocksize>, 0, EXT2_FT_DIR, "" }
> > <index magic (maybe)>
> > <second level index data>

Daniel:
> I really think INCOMPAT is the way to go and if you must mount it with
> an old kernel, do a fsck. Old fsck manages to put things back into a
> form it can understand without too much difficulty, though you do have
> to answer quite a few questions. The exact answers you give don't seem
> to be that important.

You don't always have the luxury to go back to an old kernel (for whatever
reason), and if an incompat flag is set the kernel will refuse to mount
your old filesystem. If this is your root, you can't even run fsck. Yes,
I know - have a rescue disk/partition - but _sometimes_ you are just stuck
and it would be good to not get into that situation in the first place.

Andreas:
> > Will there be a lower limit at which you create indexed directories?

Daniel:
> Yes, I hashed that out today with Al Viro on #kernelnewbies. The
> breakeven happens at 3 directory blocks.

Andreas:
> > I guess the tradeoff is if you have to index all of the existing entries
> > in a non-indexed directory. However, you need to handle this anyways if
> > you are updating an existing non-indexed directory from an old filesystem.

Daniel:
> If I do the optimization just for the first directory block then it's
> very nearly free - just one extra read of the first directory block,
> and it's almost certainly in cache anyway because it was just read to
> see if the entry already exists.

But you still need to handle the case for an arbitrary-sized non-indexed
directory, if you want to be able to upgrade an existing ext2 filesystem.
Since you need this, you may as well only turn indexing when you are
actually getting a speed benefit, because doing anything else still
wastes space. It may even be that indexing a large existing directory
and _then_ doing the lookup is still faster than doing the lookup on the
original un-indexed directory...

Ted writes:
> A couple of comments. If you make the beginning of each index block
> look like a an empty directory block:
>
> 32 bits: ino == 0
> 16 bits: rec_len == blocksize
> 16 bits: name_len = 0

This is what I also suggested for second-level index blocks above.
However, for a single-level index, blocks 1-511 (1-127 on a 1k filesystem)
will be sparse, because they will be unused - we don't want to have 511
(or 127) real empty dir blocks just for compatibility on a single-level
index. The ext2 dir code handles the case of a sparse directory block
with an ext2_error() and continues. By default ext2_error() is just
a printk, and on the only system I have seen where it is otherwise
(Debian), it is remount-ro for root only.

> ... then you will have full backwards compatibility, both for reading
> *and* writing. When reading, old kernels will simply ignore the index
> blocks, since it looks like it has an unpopulated directory entry. And
> if the kernel attempts to write into the directory, it will clear the
> BTREE_FL flag, in which case new kernels won't treat the directory as a
> tree anymore.

Yes, I had something like this on the tip of my brain as well. When you
boot with a non-index ext2 kernel, it will naturally find free space in
the first block, immediately after "." and ".." (with the setup above).
Not only will it clear BTREE_FL, it will also overwrite the index magic
(if we have one) so we definitely know that the index is not valid.
Since the index head is only using 4 of the 8 bytes needed for alignment,
we could stick in a 4 byte magic before or after the index header, and
still be assured that it will be overwritten by a new dirent.

Full COMPAT support would be a win, IMHO. You could leave it to e2fsck
to do reindexing, or the next time a file is added (or even removed)
from a candidate directory it could do the reindexing, which it needs
to be able to do for compatibility with old filesystems.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 08:35:00

by Rogier Wolff

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

H. Peter Anvin wrote:
> Martin Mares wrote:
> >
> > Hello!
> >
> > > True. Note too, though, that on a filesystem (which we are, after all,
> > > talking about), if you assume a large linear space you have to create a
> > > file, which means you need to multiply the cost of all random-access
> > > operations with O(log n).
> >
> > One could avoid this, but it would mean designing the whole filesystem in a
> > completely different way -- merge all directories to a single gigantic
> > hash table and use (directory ID,file name) as a key,

Novell, NTFS, HFS all do this.

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2137555 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* There are old pilots, and there are bold pilots.
* There are also old, bald pilots.

2001-02-22 10:33:09

by Alan

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

> Daniel Phillips wrote:
> >
> > There will be a lot fewer metadata index
> > blocks in your directory file, for one thing.
> >
>
> Oh yes, another thing: a B-tree directory structure does not need
> metadata index blocks.

Before people get excited about complex tree directory indexes, remember to
solve the other 95% before implementation - recovering from lost blocks,
corruption and the like

2001-02-22 11:32:05

by Ingo Oeser

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hi Linus,
Hi LKML people,

On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> In article <[email protected]>,
> Linus Torvalds <[email protected]> wrote:
> >allocate blocks one at a time. Make the blocksize something nice and
> >big, not just 4kB or 8kB or something.
>
> Btw, this is also going to be a VM and performance issue some time in
> the future. Tgere are already CPU's that would _love_ to have 64kB
> pages etc, and as such a filesystem that doesn't play with the old silly
> "everthing is a block" rules would be much appreciated with the kind of
> people who have multi-gigabyte files and want to read in big chunks at a
> time.

For this we need a block remapper layer that can map any
blocksize n to any blocksize m with only the following constraints:

- n and m are powers of 2
- n is a multiple of m

Both should use the page cache ( of size p) of course, so it
becomes 2 layers, if n > p.

- translating a buffer of n into some pages
- translating a page into buffers of m (current buffercache)

We could limit the translation to 5 powers of 2 obove and 5 powers of 2
below PAGE_CACHE_SIZE so that we can maintain a validity bitmap
(2^5 = 32 bits) for each layer if access is too expensive[1].

Some subsystems could certainly benefit from it.

- loop device (with all the crypto stuff)
- LVM
- FSes that support block sizes != PAGE_CACHE_SIZE
- Devices with blocksize != 512 (they don't have to care
being special anymore). There are even some rumors
about very pervert blocksizes of 1M and the like.

Since these remapped buffers will look like merged requests, I
see even no problems with the elevator any more.

The question is, where we implement this infrastructure, esp. if
we consider the last user (devices with blocksize != 512).

This has to be answered by the main architects of Linux before
anyone could start.

> So either you have a simple block-based filesystem (current ext2, no
> extents, no crapola), or you decide to do it over. Don't do some
> half-way thing, please.

Daniel (and others) uses ext2 as as a playground, because it is
implemented, tested and not that hard to understand and verify.

Hope they will switch to some own design later, once they
sufficiently played around with^W^W^Wtested their ideas.

Regards

Ingo Oeser

[1] In buffer cache we use read-modify-write for partial pages,
which hurts performance for them and is annoying for media
with limited write cycles like flash and CD-RW[2].

[2] Yes I know about packet writing mode ;-)
--
10.+11.03.2001 - 3. Chemnitzer LinuxTag <http://www.tu-chemnitz.de/linux/tag>
<<<<<<<<<<<< come and join the fun >>>>>>>>>>>>

2001-02-22 13:20:36

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Daniel Phillips <[email protected]>
Date: Thu, 22 Feb 2001 08:24:08 +0100
Content-Type: text/plain

> Is it worth it? Well, it means you lose an index entry from each
> directory block, thus reducing your fanout at each node of the tree by a
> worse case of 0.7% in the worst case (1k blocksize) and 0.2% if you're
> using 4k blocksizes.

I'll leave that up to somebody else - we now have two alternatives, the
100%, no-compromise INCOMPAT solution, and the slightly-bruised but
still largely intact forward compatible solution. I'll maintain both
solutions for now code so it's just as easy to choose either in the end.

Well, the $64,000 question is exactly how much performance does it cost?
My guess is that it will be barely measurable, but only benchmarks will
answer that question.

- Ted

2001-02-22 16:34:29

by Chris Mason

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Wednesday, February 21, 2001 07:30:47 PM -0800 Linus Torvalds
<[email protected]> wrote:
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
>>
>
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

Quick details, since I don't think I've seen them on l-k yet. r5 was
chosen because it is more tuned to the reiserfs disk format. The location
of a directory item on disk is determined by the hash of the name, and r5
is designed to put similar names close to each other on disk.

The benchmark that shows this best is creating X number of files in a
single dir (named 0001, 0002, 0003 etc). r5 greating increases the chances
the directory item for 00006 will be right next to the item for 00007. If
the application accesses these files in the same order they were created,
this has benefits at other times than just creation. The benchmarks Ed
posted give a general idea for other naming patterns, but this one is best
case:

Time to create 100,000 files (10 bytes each) with r5 hash: 48s
Time to create 100,000 files (10 bytes each) with tea: 3m58s

The percentage increase just gets bigger as you create more and more files.
That doesn't mean this is a real world case, but it is what the hash was
designed for.

-chris

2001-02-22 18:17:19

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Daniel writes:
> All references to "." and ".." are now intercepted and never reach the
> filesystem level.

Ted writes:
> From: Daniel Phillips <[email protected]>
>
> I'll leave that up to somebody else - we now have two alternatives, the
> 100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> still largely intact forward compatible solution. I'll maintain both
> solutions for now code so it's just as easy to choose either in the end.
>
> Well, the $64,000 question is exactly how much performance does it cost?
> My guess is that it will be barely measurable, but only benchmarks will
> answer that question.

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2? If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-22 18:21:39

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Thu, 22 Feb 2001, Ingo Oeser wrote:
>
> On Wed, Feb 21, 2001 at 09:19:45PM -0800, Linus Torvalds wrote:
> > In article <[email protected]>,
> > Linus Torvalds <[email protected]> wrote:
> > >allocate blocks one at a time. Make the blocksize something nice and
> > >big, not just 4kB or 8kB or something.
> >
> > Btw, this is also going to be a VM and performance issue some time in
> > the future. Tgere are already CPU's that would _love_ to have 64kB
> > pages etc, and as such a filesystem that doesn't play with the old silly
> > "everthing is a block" rules would be much appreciated with the kind of
> > people who have multi-gigabyte files and want to read in big chunks at a
> > time.
>
> For this we need a block remapper layer that can map any
> blocksize n to any blocksize m with only the following constraints:

No, nothing like that at all..

What you can _trivially_ do is to basically act to the VFS and VM layer as
if you're a 1kB block filesystem (or something), and then when you get
called to do a "bmap()" (which only happens for kernel installing and
LILO, not under normal load), you just return the "offset" into a larger
block.

The VFS and MM layers do not care what the _real_ underlying blocksize of
the filesystem is. They will just do "readpage()" and "write()" calls, and
you can implement those any way you want to - never showing that you are
chunking out page-sized pieces from a bigger allocation block.

It's not all that hard. You just have to think a bit differently: don't
think of it as a block-based filesystem that has to have fixed blocks. The
VFS and MM layer don't care. They just want to access it.

> Daniel (and others) uses ext2 as as a playground, because it is
> implemented, tested and not that hard to understand and verify.

I realize that. But I get _very_ nervous when people talk about adding
stuff to ext2, because there are a lot of people who do not want to ever
even by mistake run code that is "new" on their filesystem.

Note that I had the same issue with ext3 - for the longest time, Stephen
Tweedie wanted to just extend ext2, and make it an invisible upgrade where
the filesystem would just magically become journalled when the user asked
for it. It _sounds_ appealing, but it doesn't take into account (a)
inevitable bugs and (b) the fact that Reiserfs actually got a head start
at least partly because it didn't need to worry about backwards
compatibility at all (there were other reasons too).

Basically, if there is one thing I've learnt over ten years of Linux, it's
that it is absolutely _evil_ to add more "linkages" or dependencies than
you absolutely have to. It is _much_ easier to create a new filesystem,
and slowly phase out old code that is no longer used. It's been done
several times (both with filesystems and with drivers), and every time
we've had a "new X, phase out old X" kind of situation it has been very
smooth.

In comparison, if you have "new features in X, which also handles the old
cases of X" situation, you not only bind yourself to backwards
compatibility, but you also cause yourself to be unable to ever phase out
the old code. Which means that eventually the whole system is a piece of
crap, full of old garbage that nobody needs to use, but that is part of
the new stuff that everybody _does_ use.

Linus

2001-02-22 20:30:57

by kaih

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] (Martin Mares) wrote on 22.02.01 in <[email protected]>:

> One could avoid this, but it would mean designing the whole filesystem in a
> completely different way -- merge all directories to a single gigantic
> hash table and use (directory ID,file name) as a key, but we were originally
> talking about extending ext2, so such massive changes are out of question
> and your log n access argument is right.

s/hash table/btree/ and you have just described the Macintosh HFS file
system. (Incidentally, it stores file extent indices in a similar manner,
with key = (file id, fork, offset).)

MfG Kai

2001-02-22 20:30:46

by kaih

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] (Daniel Phillips) wrote on 20.02.01 in <01022020011905.18944@gimli>:

> But the current hash function is just a place holder, waiting for
> an better version based on some solid theory. I currently favor the
> idea of using crc32 as the default hash function, but I welcome
> suggestions.

I once liked those things, too - but I've learned better since.

Quoting _Handbook_of_Algorithms_and_Data_Structures_ (Gonnet/Baeza-Yates,
ISBM 0-201-41607-7, Addison-Wesley):

--- snip ---

3.3.1 Practical hashing functions

[...]

A universal class of hashing functions is a class with the property that
given any input, the average performance of all the functions is good.
[...] For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
a universal class of hash functions.
[...]
Keys which are strings or sequences of words (including those which are of
variable length) are best treated by considering them as a number base b.
Let the string s be composed of k characters s1s2...sk. Then

h(s) = ( sum(i=0..k-1) B^i*s(k-i) ) mod m

To obtain a more efficient version of this function we can compute

h(s) = ( ( sum(i=0..k-1) B^i*s(k-i) ) mod 2^w ) mod m

where w is the number of bits in a computer word, and the mod 2^w
operation is done by the hardware. For this function the value B = 131 is
recommended, as B^i has a maximum cycle mod 2^k for 8<=k<=64.

Hashing function for strings

int hashfunction(s)
char *s;

{ int i;
for(i=0; *s; s++) i = 131*i + *s;
return(i % m);
}

--- snip ---

I've actually used that function for a hash containing something like a
million phone numbers as keys, and there were *very* few collisions.
Similarly for another hash containgng megabytes of RFC 822 message-ids.

MfG Kai

2001-02-22 22:32:05

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Linus Torvalds wrote:
>
> On Thu, 22 Feb 2001, Daniel Phillips wrote:
> >
> > In the first heat of hash races - creating 20,000 files in one directory
> > - dentry::hash lost out to my original hack::dx_hash, causing a high
> > percentage of leaf blocks to remain exactly half full and slowing down
> > the whole thing by about 5%. (This was under uml - I haven't tried it
> > native yet but I expect the results to be similar.)
> >
> > Contender Result
> > ========= ======
> > dentry::hash Average fullness = 2352 (57%)
> > hack::dx_hash Average fullness = 2758 (67%)
> >
> > This suggests that dentry::hash is producing distinctly non-dispersed
> > results and needs to be subjected to further scrutiny. I'll run the
> > next heat of hash races tomorrow, probably with R5, and CRC32 too if I
> > have time.
>
> I'd love to hear the results from R5, as that seems to be the reiserfs
> favourite, and I'm trying it out in 2.4.2 because it was so easy to plug
> in..

In this round there were two new contenders:

- ReiserFS's R5
- Bob Jenkins' hash

Eirik Fuller pointed me to the latter, the subject of a very interesting
article in Dr. Dobbs, available online here:

http://burtleburtle.net/bob/hash/doobs.html

As before, the runs are for 20,000 creates and I report only the
fullness, because I'm still running these under UML. Suffice to say
that the total running time is roughly related to the average fullness
with a variance of about 15% from the best to the worst. Eventually I
will rerun the entire series of tests natively and provide more detailed
statistics. Here are the results from the second heat of hash races:

Contender Result
========= ======
dentry::hash Average fullness = 2352 (57%)
daniel::hack_hash Average fullness = 2758 (67%)
bob::hash Average fullness = 2539
(61%)
reiserfs::r5 Average fullness = 2064 (50%)

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block.
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks. The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal But according
to Chris Mason (see his post) it does work very well for ReiserFS. This
provides a little more evidence that my htree scheme is a quite
different from other approaches.

u32 r5_hash (const char *msg, int len)
{
u32 a=0;
while(*msg) {
a += *msg << 4;
a += *msg >> 4;
a *= 11;
msg++;
}
return a;
}

I expected more from bob::hash since it's very carefully well-thought
out in terms of dispersal and avoidance of 'funnelling' (the property
that determines the probabililty collision), but it still fell short of
hack_hash's performance. Oh well. Tomorrow I'll try CRC32.

The bottom line: dx_hack_hash is still the reigning champion. OK, come
out and take a bow:

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash < 0) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

--
Daniel

2001-02-22 23:05:29

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

Andreas Dilger wrote:
> Daniel writes:
> > All references to "." and ".." are now intercepted and never reach the
> > filesystem level.
>
> Ted writes:
> > From: Daniel Phillips <[email protected]>
> >
> > I'll leave that up to somebody else - we now have two alternatives, the
> > 100%, no-compromise INCOMPAT solution, and the slightly-bruised but
> > still largely intact forward compatible solution. I'll maintain both
> > solutions for now code so it's just as easy to choose either in the end.
> >
> > Well, the $64,000 question is exactly how much performance does it cost?
> > My guess is that it will be barely measurable, but only benchmarks will
> > answer that question.
>
> One important question as to the disk format is whether the "." and ".."
> interception by VFS is a new phenomenon in 2.4 or if this also happened
> in 2.2? If so, then having these entries on disk will be important
> for 2.2 compatibility, and you don't want to have different on-disk formats
> between 2.2 and 2.4.

The answer is 'yes', it's been in since at least the beginning of 2.2:

http://innominate.org/cgi-bin/lksr/linux/fs/namei.c?rev=1.1&content-type=text/x-cvsweb-markup&cvsroot=v2.2

Search for '.'.

By the way, out whole linux cvsweb tree is here:

http://lksr.org/

will all versions of linux back to linux-0.97.pl5, with a makefile that
starts out with:

#
# Makefile for linux.
# If you don't have '-mstring-insns' in your gcc (and nobody but me has
:-)
# remove them from the CFLAGS defines.
#

Getting back on topic, this makes the idea of getting rid of the actual
on-disk "." and ".." entries a little less scary, though I am keeping in
mind the fact that having those entries on disk could in some extreme
circumstance help fsck recover a a corrupted directory tree little
better and more automatically.

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted. Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

--
Daniel

2001-02-22 23:40:53

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Andreas Dilger <[email protected]>
Date: Thu, 22 Feb 2001 11:16:32 -0700 (MST)

One important question as to the disk format is whether the "." and ".."
interception by VFS is a new phenomenon in 2.4 or if this also happened
in 2.2? If so, then having these entries on disk will be important
for 2.2 compatibility, and you don't want to have different on-disk formats
between 2.2 and 2.4.

Well, you need to have the '.' and '..' there for compatibility if you
for the full backwards compatibility. That's clear.

If you don't care about backwards compatibility, it's important that
there be a way to find the parent directory, but there doesn't have to
be explicit '.' and '..' entries.

So if Daniel is going to try implementing it both ways then that's one
place where the #ifdef's might get a bit more complicated. After it's
done, we should do some benchmarks comparing it both ways; if the
difference is negligible, I'd argue for simply always providing
backwards compatibility. One of the key advantages of ext2/ext3 is its
backwards compatibility, and so if it's not too costly to preserve it
(as I suspect will be the case), we should try to do so.

- Ted

2001-02-23 00:58:52

by Felix von Leitner

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Thus spake Alan Cox ([email protected]):
> > > There will be a lot fewer metadata index
> > > blocks in your directory file, for one thing.
> > Oh yes, another thing: a B-tree directory structure does not need
> > metadata index blocks.
> Before people get excited about complex tree directory indexes, remember to
> solve the other 95% before implementation - recovering from lost blocks,
> corruption and the like

And don't forget the trouble with NFS handles after the tree was rebalanced.

Trees are nice only theoretically. In practice, the benefits are
outweighed by the nastiness in form of fsck and NFS and bigger code
(normally: more complex -> less reliable).

Felix

2001-02-23 01:53:21

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

>> idea of using crc32 as the default hash function

> I once liked those things, too - but I've learned better since.

> A universal class of hashing functions is a class with the property that
> given any input, the average performance of all the functions is good.
> For example, h(k) = (a * k + b) mod m with integers a != 0 and b is
> a universal class of hash functions.

Here a != 0 should be (a,m) = 1.

> for(i=0; *s; s++) i = 131*i + *s;

Yes, that is a good function.

(Also because 131 has only 3 bits, so that there is a cheap shift and add
implementation.)

I did some random tests, on the one hand on a collection of 557398 file
names (last part of path names) in a file system here.
On the other hand on an artificially generated sequence of file names
with increasing number tails: foo0001, foo0002, ...

On the first collection the choice of multiplier didnt really matter
provided that it was odd and not too close to a power of two.
The smallest number with good behaviour was 11, the winner was 51.

(51 has 4 bits, but is not more expensive because they are evenly spaced:
/* 51x = 17*3*x */
x += (x << 1);
x += (x << 4);
)

On the second collection there were large differences between multipliers.
The clear winner was 11.

Some numbers:

Hash 557398 actual names, using
hash(unsigned char *s) {
unsigned int h = 0;

while(*s)
h = m*h + *s++;
return h % sz;
}
for various values of m and powers of two sz (so that the % is an AND).
Report min, max, average length of hash chain, and standard deviation.
Of course min and max should be close to average and stddev should be small.

m= 11 sz=2048, min 221, max 324, av 272.17, stddev 254.12
m= 13 sz=2048, min 219, max 322, av 272.17, stddev 259.96
m= 51 sz=2048, min 218, max 325, av 272.17, stddev 265.52
m=131 sz=2048, min 222, max 344, av 272.17, stddev 264.20

m= 11 sz=4096, min 96, max 176, av 136.08, stddev 132.58
m= 13 sz=4096, min 101, max 177, av 136.08, stddev 128.71
m= 51 sz=4096, min 92, max 174, av 136.08, stddev 130.89
m=131 sz=4096, min 85, max 180, av 136.08, stddev 131.99

m= 11 sz=8192, min 38, max 102, av 68.04, stddev 68.26
m= 13 sz=8192, min 42, max 100, av 68.04, stddev 66.30
m= 51 sz=8192, min 41, max 97, av 68.04, stddev 64.98
m=131 sz=8192, min 39, max 102, av 68.04, stddev 66.19

m= 11 sz=16384, min 14, max 57, av 34.02, stddev 33.96
m= 13 sz=16384, min 14, max 58, av 34.02, stddev 33.51
m= 51 sz=16384, min 15, max 60, av 34.02, stddev 32.29
m=131 sz=16384, min 16, max 59, av 34.02, stddev 33.94

m= 11 sz=32768, min 3, max 37, av 17.01, stddev 17.50
m= 13 sz=32768, min 3, max 34, av 17.01, stddev 16.84
m= 51 sz=32768, min 4, max 41, av 17.01, stddev 16.46
m=131 sz=32768, min 3, max 40, av 17.01, stddev 16.90

m= 11 sz=65536, min 0, max 24, av 8.51, stddev 8.70
m= 13 sz=65536, min 0, max 23, av 8.51, stddev 8.56
m= 51 sz=65536, min 0, max 24, av 8.51, stddev 8.31
m=131 sz=65536, min 0, max 23, av 8.51, stddev 8.51

m= 11 sz=131072, min 0, max 17, av 4.25, stddev 4.39
m= 13 sz=131072, min 0, max 16, av 4.25, stddev 4.32
m= 51 sz=131072, min 0, max 16, av 4.25, stddev 4.22
m=131 sz=131072, min 0, max 16, av 4.25, stddev 4.24

m= 11 sz=262144, min 0, max 12, av 2.13, stddev 2.20
m= 13 sz=262144, min 0, max 12, av 2.13, stddev 2.18
m= 51 sz=262144, min 0, max 12, av 2.13, stddev 2.12
m=131 sz=262144, min 0, max 12, av 2.13, stddev 2.12

On the second, nonrandom, collection there are more variations:

m= 11 sz=8192, min 61, max 76, av 68.04, stddev 4.41
m= 13 sz=8192, min 55, max 83, av 68.04, stddev 18.64
m= 51 sz=8192, min 58, max 79, av 68.04, stddev 12.47
m=131 sz=8192, min 52, max 83, av 68.04, stddev 29.05

m= 11 sz=16384, min 26, max 41, av 34.02, stddev 3.61
m= 13 sz=16384, min 24, max 45, av 34.02, stddev 8.76
m= 51 sz=16384, min 25, max 44, av 34.02, stddev 6.32
m=131 sz=16384, min 23, max 47, av 34.02, stddev 14.00

m= 11 sz=32768, min 10, max 23, av 17.01, stddev 4.36
m= 13 sz=32768, min 7, max 28, av 17.01, stddev 8.66
m= 51 sz=32768, min 10, max 25, av 17.01, stddev 4.04
m=131 sz=32768, min 6, max 27, av 17.01, stddev 8.66

Andries

2001-02-23 02:49:27

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Just looking at R5 I knew it wasn't going to do well in this application
because it's similar to a number of hash functions I tried with the same
idea in mind: to place similar names together in the same leaf block.
That turned out to be not very important compared to achieving a
relatively high fullness of leaf blocks. The problem with R5 when used
with my htree is, it doesn't give very uniform dispersal.

The bottom line: dx_hack_hash is still the reigning champion.

Now that you provide source for r5 and dx_hack_hash, let me feed my
collections to them.
r5: catastrophic
dx_hack_hash: not bad, but the linear hash is better.

E.g.:
Actual file names:

Linear hash, m=11, sz=2048, min 262, max 283, av 272.17, stddev 12.25
dx_hack_hash: sz=2048, min 220, max 330, av 272.17, stddev 280.43
r5: sz=2048, min 205, max 382, av 272.17, stddev 805.18

Linear hash, m=11, sz=65536, min 0, max 24, av 8.51, stddev 8.70
dx_hack_hash: sz=65536, min 0, max 23, av 8.51, stddev 8.51
r5: sz=65536, min 0, max 26, av 8.51, stddev 8.89

Generated consecutive names:

Linear hash, m=11, sz=2048, min 262, max 283, av 272.17, stddev 12.25
dx_hack_hash: sz=2048, min 191, max 346, av 272.17, stddev 636.11
r5: sz=2048, min 0, max 3587, av 272.17, stddev 755222.91

Linear hash, m=11, sz=65536, min 2, max 14, av 8.51, stddev 2.79
dx_hack_hash: sz=65536, min 0, max 26, av 8.51, stddev 12.24
r5: sz=65536, min 0, max 120, av 8.51, stddev 738.08

Andries

2001-02-23 03:43:44

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

[email protected] wrote:
>
> Just looking at R5 I knew it wasn't going to do well in this application
> because it's similar to a number of hash functions I tried with the same
> idea in mind: to place similar names together in the same leaf block.
> That turned out to be not very important compared to achieving a
> relatively high fullness of leaf blocks. The problem with R5 when used
> with my htree is, it doesn't give very uniform dispersal.
>
> The bottom line: dx_hack_hash is still the reigning champion.
>
> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x80000000) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

The correction gained me another 1% in the leaf block fullness measure.
I will try your hash with the htree index code tomorrow.

--
Daniel

2001-02-23 12:22:54

by Jonathan Morton

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

>Now that you provide source for r5 and dx_hack_hash, let me feed my
>collections to them.
>r5: catastrophic
>dx_hack_hash: not bad, but the linear hash is better.

<snip verbose results>

So, not only does the linear hash normally provide a shorter worst-case
chain, its' results are actually more consistent than the other two. Looks
like simple is good here, but is it still possible to produce
"pathological" sets for the linear hash to see how badly it falls down?
I'm no mathematician, so I'll leave that to the gurus...

Meanwhile, let's go back to Linus' comment on compatibility and so on. He
has a *very* good point, which I'll expand on slightly here:

Suppose some stone-age Linux user, running 2.0.35 or something equally old
(which runs ext2), decides to finally bite the bullet and upgrade to the
all-new 2.6.1 (OK, this is some time in the future). 2.6.1 implements some
"enhanced" version of ext2 which makes some incompatible modifications to
the directory structure. However, since the process of upgrading through
such a massive range of kernels also involves upgrading most other software
to boot, this user forgot one or two pieces, and reboots to 2.0.35 to
regain a sufficiently working system that he can build the updated software
- or not, because 2.0.35's old ext2 code suddenly can't read the
filesystem, which was modified by 2.6.1 before the boot process stalled.
e2fsck is no help here either, because he now has an unbootable system with
old software that doesn't understand the new stuff.

I hope people understand this as well as I do - if a filesystem upgrade is
desirable, let the user perform some *specific* action to upgrade it, when
he has an otherwise-working setup *and* sufficient backups. I for one do
not want to be caught out like the hypothetical user I mentioned above.

OTOH, I have my own opinions on the direction of ext2:

- Currently, it's a stable and universally-utilised filesystem which offers
very good performance for most applications. I'm specifically drawing
attention to certain benchmarks which place ext2's disk-based performance
ahead of many commercial UNIX' ram-based filesystem performance.

- There are specific problems with performance when reading and/or
modifying large directories. I haven't tested for this personally, but I
have noticed slowness when using 'rm -rf' on a large *tree* of directories.
The problem appeared to be one of disk access, but may be a side-effect of
poor storage distribution (I haven't examined the ext2 code). Related to
this, rebuilding the slocate database on all my systems appears to be
disk-bound rather than CPU-bound, and takes too long for my liking.

One of the current suggestions, if I've interpreted it correctly, is to
introduce an extension to ext2 which essentially makes a "fast index" of a
large directory, attaches it to the directory in some backwards-compatible
manner, and uses it *in conjunction with* the regular directory structure.
This is probably a good idea, but it needs some thought:

- How much overhead does the "fast index" introduce for modification of the
directory? Large directories are the most likely to have stuff added and
deleted, and it doesn't help if during an "rm *" operation the saving on
the search is negated by the overhead on the unlink.

- If the index gets out of sync with the directory, how is this detected
and recovered from? Assuming the index merely points to the correct
position in the regular directory, some simple sanity checks will suffice
for most cases (is this entry in the directory the same as I asked for?),
and if the entry is not in the index then a standard search of the real
directory can be done. In either case, the index can be marked as invalid
(and removed?) and rebuilt whenever necessary.

- At what threshold of directory size does the index come into play?
(fairly obviously, the index is useless for tiny directories)

- What happens when an old kernel encounters a directory which has an index
attached to it? Does it look like a virtual file, which has no special
properties but whose name is reserved for system use? (cf. lost+found) Or
is it some inidentifiable bits in the directory structure and a few lost
clusters on the disk? If the latter, it will have to be done in a way that
older versions of e2fsck can clean it up easily and older versions of ext2
won't throw up on it, which could be kinda hard. If the former, an
'unused' name will have to be found to avoid conflicts but the big
advantage is *no* inconsistency with old systems.

Answers on a postcard...

--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]

The key to knowledge is not to rely on people to teach you it.

Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/

-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----

2001-02-23 12:39:12

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

From [email protected] Fri Feb 23 04:43:23 2001

> Now that you provide source for r5 and dx_hack_hash, let me feed my
> collections to them.
> r5: catastrophic
> dx_hack_hash: not bad, but the linear hash is better.

I never expected dx_hack_hash to be particularly good at anything, but
we might as well test the version without the mistake in it - I was
previously using < 0 to test the sign bit - on an unsigned variable :-/

unsigned dx_hack_hash (const char *name, int len)
{
u32 hash0 = 0x12a3fe2d, hash1 = 0x37abe8f9;
while (len--)
{
u32 hash = hash1 + (hash0 ^ (*name++ * 71523));
if (hash & 0x80000000) hash -= 0x7fffffff;
hash1 = hash0;
hash0 = hash;
}
return hash0;
}

The correction gained me another 1% in the leaf block fullness measure.
I will try your hash with the htree index code tomorrow.

Basically I find the same results as before.

Actual names (N=557398)
dx_hack+if dx_hack-if best
Size min max av stddev min max av stddev
2048 217 326 272.17 273.45 220 330 272.17 280.43 +
4096 97 191 136.08 138.35 100 182 136.08 138.29 -
8192 40 105 68.04 68.57 36 102 68.04 68.06 -
16384 14 59 34.02 34.36 14 59 34.02 34.08 -
32768 3 37 17.01 17.24 4 36 17.01 17.09 -
65536 0 24 8.51 8.55 0 23 8.51 8.51 -
131072 0 18 4.25 4.24 0 16 4.25 4.26 +
262144 0 13 2.13 2.12 0 11 2.13 2.13 -

Generated names
2048 195 347 272.17 509.38 191 346 272.17 636.11 +
4096 71 218 136.08 645.73 56 224 136.08 995.79 +
8192 23 125 68.04 202.16 23 135 68.04 288.99 +
16384 10 69 34.02 67.47 8 72 34.02 89.29 +
32768 1 42 17.01 26.32 1 43 17.01 31.39 +
65536 0 28 8.51 10.92 0 26 8.51 12.24 +
131072 0 17 4.25 4.93 0 18 4.25 5.28 +
262144 0 12 2.13 2.32 0 13 2.13 2.40 +

In other words, the "broken" version wins on actual names, the "correct" version
on generated names with number tail. The differences are small.
(And of course the broken version is faster.)
As a comparison:

Actual names (N=557398)
linear hash (m=11) linear hash (m=51) best of 4
Size min max av stddev min max av stddev
2048 221 324 272.17 254.02 218 325 272.17 265.46 lin-11
4096 96 176 136.08 132.53 92 174 136.08 130.94 lin-51
8192 38 102 68.04 68.26 41 97 68.04 64.98 lin-51
16384 14 57 34.02 33.97 15 60 34.02 32.29 lin-51
32768 3 37 17.01 17.50 4 41 17.01 16.46 lin-51
65536 0 24 8.51 8.70 0 24 8.51 8.31 lin-51
131072 0 17 4.25 4.39 0 16 4.25 4.22 lin-51
262144 0 12 2.13 2.20 0 12 2.13 2.12 lin-51

Generated names
2048 262 283 272.17 12.25 244 298 272.17 136.72 lin-11
4096 128 146 136.08 9.39 119 151 136.08 39.73 lin-11
8192 61 76 68.04 4.41 58 79 68.04 12.47 lin-11
16384 26 41 34.02 3.61 25 44 34.02 6.32 lin-11
32768 10 23 17.01 4.36 10 25 17.01 4.04 lin-51
65536 2 14 8.51 2.79 3 16 8.51 4.54 lin-11
131072 0 8 4.25 1.85 0 10 4.25 1.88 lin-11
262144 0 5 2.13 1.10 0 6 2.13 1.22 lin-11

So both linear hash versions are far superior (if the criterion is uniform
distribution) in the case of generated names, and lin-51 also beats dx_hack
in the case of actual names. Of course it also wins in speed.

Andries

2001-02-23 18:58:53

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Jonathan Morton writes:
> Meanwhile, let's go back to Linus' comment on compatibility and so on. He
> has a *very* good point, which I'll expand on slightly here:
>
> Suppose some stone-age Linux user, running 2.0.35 or something equally old
> (which runs ext2), decides to finally bite the bullet and upgrade...
> ... 2.0.35's old ext2 code suddenly can't read the filesystem, which was
> modified by 2.6.1 before the boot process stalled.

One of the proposals on the table for the indexed directories is read AND
WRITE compatible with older kernels. For 1.0 through 2.4 kernels the read
compatibility is very safe (it will generate some errors on reads because
of "sparse" directory blocks, but the code will continue on correctly,
<= 1.2 won't even generate an error). For < 2.2 kernels deleting a file from
an indexed directory will work, but would leave the index out of date.
For < 2.2 kernels adding a new file to an indexed directory would always
overwrite the index, so it is also safe.

> I hope people understand this as well as I do - if a filesystem upgrade is
> desirable, let the user perform some *specific* action to upgrade it, when
> he has an otherwise-working setup *and* sufficient backups. I for one do
> not want to be caught out like the hypothetical user I mentioned above.

I am on the side of maintaining compatibility. There _may_ be a _small_
performance (more like capacity) impact on indexed directories for this
compatibility, but it is well worth it, IMHO.

> - Currently, it's a stable and universally-utilised filesystem which offers
> very good performance for most applications. I'm specifically drawing
> attention to certain benchmarks which place ext2's disk-based performance
> ahead of many commercial UNIX' ram-based filesystem performance.

Totally agree.

> - There are specific problems with performance when reading and/or
> modifying large directories. I haven't tested for this personally, but I
> have noticed slowness when using 'rm -rf' on a large *tree* of directories.

That is what the index change will address. Actually, "rm -r" may not
be speeded up very much, but "rm *" definitely would be ("rm -r" deletes
files in directory order, but "rm *" deletes each file individually in
alphabetical order).

> One of the current suggestions, if I've interpreted it correctly, is to
> introduce an extension to ext2 which essentially makes a "fast index" of a
> large directory, attaches it to the directory in some backwards-compatible
> manner, and uses it *in conjunction with* the regular directory structure.

Yes, this is essentially true. The on-disk directory entries are exactly
the same. The index itself (in the compatible layout) appears to simply
be empty directory blocks (at the cost of 8 bytes = 1 index entry per block).

> - How much overhead does the "fast index" introduce for modification of the
> directory? Large directories are the most likely to have stuff added and
> deleted, and it doesn't help if during an "rm *" operation the saving on
> the search is negated by the overhead on the unlink.

The index will improve the performance for file add, stat, and delete. For
all of these operations you need to find a directory entry (add needs to
check if a file of the same name already exists before a new file is added).

> - If the index gets out of sync with the directory, how is this detected
> and recovered from? Assuming the index merely points to the correct
> position in the regular directory, some simple sanity checks will suffice
> for most cases (is this entry in the directory the same as I asked for?),
> and if the entry is not in the index then a standard search of the real
> directory can be done. In either case, the index can be marked as invalid
> (and removed?) and rebuilt whenever necessary.

On an index-aware kernel, the index can obviously not get out of sync.
All 2.2+ kernels that don't understand indexing will clear the "has
index" flag if they modify that directory, and the index will disappear.
Since the index is "hidden" (in my proposal at least) inside a totally
normal directory block, it will simply be overwritten by new entries.
As I mentioned above, 1.x and 2.0 kernels will overwrite the index on
an add, but not on a delete, and will not clear the "has index" flag.
This means we need some extra magic at the start of the index to ensure
we have a valid index header.

> - At what threshold of directory size does the index come into play?
> (fairly obviously, the index is useless for tiny directories)

This remains to be seen. Definitely not for directories 1 block in size
(which is 85%? of all directories). It looks like directories with about
250-300 files or more are needed for indexing to be useful. The good
news is that since indexing is optional, it can be tuned to only improve
performance of directories.

> - What happens when an old kernel encounters a directory which has an index
> attached to it? Does it look like a virtual file, which has no special
> properties but whose name is reserved for system use? (cf. lost+found) Or
> is it some inidentifiable bits in the directory structure and a few lost
> clusters on the disk? If the latter, it will have to be done in a way that
> older versions of e2fsck can clean it up easily and older versions of ext2
> won't throw up on it, which could be kinda hard. If the former, an
> 'unused' name will have to be found to avoid conflicts but the big
> advantage is *no* inconsistency with old systems.

This has also been discussed already for the indexing code. The index
is actually stored inside the directory, so no hidden files or anything,
and no "lost blocks" either. The original indexing code wasn't 100%
compatible with the older layout, but a proposal has been made to make
it totally compatible with older kernels (excluding some error messages
that all ext2 directory code handles properly).

If (for whatever reason) you started using an indexed ext2 kernel,
created a huge directory of files, and then never used it again, the
only thing wasted would be your time (when you try to delete the files
in the huge directory without the indexed directory ;-). The ext2 code
has NEVER releases directory blocks, even if all of the files are gone.
All blocks used by the index also appear (in the proposed layout) as
regular directory blocks, so a non-index kernel can use them at any time
(of course the index is destroyed in this case).

As an aside, lost+found is NOT a "special" name in any way to the
filesystem. The only thing that it is used for is e2fsck - to the kernel
it is just a regular directory.

Basically, I think the compatibility is excellent for the functionality
delivered. Another reason why I think ext2 will continue to have a long
and prosperous life, without making sacrifices for performance.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-02-23 20:41:45

[permalink] [raw]

Subject: Re: [Ext2-devel] [rfc] Near-constant time directory index for Ext2

From: Daniel Phillips <[email protected]>
Date: Fri, 23 Feb 2001 00:04:02 +0100

I resolve not to take a position on this subject, and I will carry
forward both a 'squeaky clean' backward-compatible version that sets an
INCOMPAT flag, and a 'slightly tarnished' but very clever version that
is both forward and backward-compatible, along the lines suggested by
Ted. Both flavors have the desireable property that old versions of
fsck with no knowledge of the new index structure can remove the indices
automatically, with fsck -y.

Note that in the long run, the fully comatible version should probably
have a COMPAT feature flag set so that you're forced to use a new enough
version of e2fsck. Otherwise an old e2fsck may end up not noticing
corruptions in an index block which might cause a new kernel to have
serious heartburn.

- Ted

2001-02-23 21:45:55

by Ralph Loader

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

Hi,

I while ago I did some experimentation with simple bit-op based string
hash
functions. I.e., no multiplications / divides in the hash loop.

The best I found was:

int hash_fn (char * p)
{
int hash = 0;
while (*p) {
hash = hash + *p;
// Rotate a 31 bit field 7 bits:
hash = ((hash << 7) | (hash >> 24)) & 0x7fffffff;
}
return hash;
}

[I haven't kept my test program / data set - if anyone compares the
above
to the others functions mentioned on the list, let me know.]

The 31 and 7 were determined experimentally. But the 31 has a
theoretical
explanation (which might even be valid):

The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
where P is the polynomial x**31 - 1 (over Z_2).
Presumably the "best" P would be irreducible - but that would have more
bits set in the polynomial, making reduction harder. A compromise is to
choose P in the form x**N - 1 but with relatively few factors.
X**31 - 1 is such a P.

Also, a 32 bit rotate (modulo X**32 - 1, which is equal
to (X - 1) ** 32 over Z_2), came out pretty badly.

One thing that shouldn't be forgotten about hashing for hash tables
is that you have to reduce the hash value to the required range - doing
that well greatly reduces the difference between various hash functions.

Ralph.

2001-02-23 22:37:43

by Guest section DW

[permalink] [raw]

Subject: Re: [rfc] Near-constant time directory index for Ext2

On Sat, Feb 24, 2001 at 10:43:16AM +1300, Ralph Loader wrote:

> A while ago I did some experimentation with simple bit-op based string
> hash functions. I.e., no multiplications / divides in the hash loop.
>
> The best I found was:
>
> int hash_fn (char * p)
> {
> int hash = 0;
> while (*p) {
> hash = hash + *p;
> // Rotate a 31 bit field 7 bits:
> hash = ((hash << 7) | (hash >> 24)) & 0x7fffffff;
> }
> return hash;
> }

Hmm. This one goes in the "catastrophic" category.

For actual names:

N=557398, m=51 sz=2048, min 82, max 4002, av 272.17, stddev 45122.99

For generated names:

N=557398, m=51 sz=2048, min 0, max 44800, av 272.17, stddev 10208445.83

A very non-uniform distribution.

> The rotate is equivalent to a multiplication by x**7 in Z_2[P=0],
> where P is the polynomial x**31 - 1 (over Z_2).
> Presumably the "best" P would be irreducible - but that would have more
> bits set in the polynomial, making reduction harder. A compromise is to
> choose P in the form x**N - 1 but with relatively few factors.
> X**31 - 1 is such a P.

It has seven irreducible factors. Hardly "almost irreducible".

Shifting the 7-bit ASCII characters over 7 bits makes sure that there
is very little interaction to start with. And the final AND that truncates
to the final size of the hash chain kills the high order bits.
No good.

Andries

2001-02-24 00:33:15