2002-06-19 20:15:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

Christopher Li wrote:
>
> ...
>
> I have a silly question, where is that ext3 CVS? Under sourcefourge
> ext2/ext3 or gkernel?

See http://www.zip.com.au/~akpm/linux/ext3/ - about halfway
down the page.

btw, I merged all the ext3 htree stuff into 2.5.23 yesterday. Haven't
tested it much at all yet.

http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/ext3-truncate-fix.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/ext3-htree.patch
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.23/htree-fixes.patch

-


2002-06-19 22:44:16

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

Hi,

On Wed, Jun 19, 2002 at 01:13:50PM -0700, Andrew Morton wrote:

> > I have a silly question, where is that ext3 CVS? Under sourcefourge
> > ext2/ext3 or gkernel?
>
> See http://www.zip.com.au/~akpm/linux/ext3/ - about halfway
> down the page.
>
> btw, I merged all the ext3 htree stuff into 2.5.23 yesterday. Haven't
> tested it much at all yet.

Well, it has some interesting properties, such as the hash function
being a constant:

+ return 80; /* FIXME: for test only */

which I assume was an artifact of some testing Christopher was doing.
:)

I'm checking out a proper hash function at the moment.

Cheers,
Stephen

2002-06-19 23:55:07

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

Hi,

On Wed, Jun 19, 2002 at 11:43:40PM +0100, Stephen C. Tweedie wrote:

> Well, it has some interesting properties, such as the hash function
> being a constant:
>
> + return 80; /* FIXME: for test only */
>
> which I assume was an artifact of some testing Christopher was doing.
> :)
>
> I'm checking out a proper hash function at the moment.

Done, checked into ext3 cvs (features-branch again.)

Deleting and recreating 100,000 files with this kernel:

[root@spock test0]# time xargs rm -f < /root/flist.100000

real 0m14.305s
user 0m0.750s
sys 0m5.430s
[root@spock test0]# time xargs touch < /root/flist.100000

real 0m16.244s
user 0m0.530s
sys 0m6.660s

that's an average of 160usec per create, 140usec per delete elapsed
time, and 66/54usec respectively system time.

I assume the elapsed time is greater only because we're starting to
wrap the journal due to the large amount of metadata being touched
(we're touching a lot of inodes doing the above, which I could avoid
by making hard links instead of new files.) Certainly, limiting the
test to 10,000 files lets it run at 100% cpu.

--Stephen

2002-06-21 03:29:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Thursday 20 June 2002 01:54, Stephen C. Tweedie wrote:
> > I'm checking out a proper hash function at the moment.
>
> Done, checked into ext3 cvs (features-branch again.)
>
> Deleting and recreating 100,000 files with this kernel:
>
> [root@spock test0]# time xargs rm -f < /root/flist.100000
>
> real 0m14.305s
> user 0m0.750s
> sys 0m5.430s
> [root@spock test0]# time xargs touch < /root/flist.100000
>
> real 0m16.244s
> user 0m0.530s
> sys 0m6.660s
>
> that's an average of 160usec per create, 140usec per delete elapsed
> time, and 66/54usec respectively system time.
>
> I assume the elapsed time is greater only because we're starting to
> wrap the journal due to the large amount of metadata being touched
> (we're touching a lot of inodes doing the above, which I could avoid
> by making hard links instead of new files.) Certainly, limiting the
> test to 10,000 files lets it run at 100% cpu.

I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2. As
predicted, half-md4 does produce very even bucket distributions. For 200,000
creates:

half-md4: 2872 avg bytes filled per 4k block (70%)
dx_hack_hash: 2853 avg bytes filled per 4k block (69%)

but guess which was faster overall?

half-md4: user 0.43 system 6.88 real 0:07.33 CPU 99%
dx_hack_hash: user 0.43 system 6.40 real 0:06.82 CPU 100%

This is quite reproducible: dx_hack_hash is always faster by about 6%. This
must be due entirely to the difference in hashing cost, since half-md4
produces measurably better distributions. Now what do we do?

By the way, I'm running about 37 usec per create here, on a 1GHz/1GB PIII,
with Ext2. I think most of the difference vs your timings is that your test
code is eating a lot of cpu.

--
Daniel

2002-06-21 15:07:33

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

Hi,

On Fri, Jun 21, 2002 at 05:28:28AM +0200, Daniel Phillips wrote:

> I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2. As
> predicted, half-md4 does produce very even bucket distributions. For 200,000
> creates:
>
> half-md4: 2872 avg bytes filled per 4k block (70%)
> dx_hack_hash: 2853 avg bytes filled per 4k block (69%)
>
> but guess which was faster overall?
>
> half-md4: user 0.43 system 6.88 real 0:07.33 CPU 99%
> dx_hack_hash: user 0.43 system 6.40 real 0:06.82 CPU 100%
>
> This is quite reproducible: dx_hack_hash is always faster by about 6%. This
> must be due entirely to the difference in hashing cost, since half-md4
> produces measurably better distributions. Now what do we do?

I want to get this thing tested!

There are far too many factors for this to be resolved very quickly.
In reality, there will be a lot of disk cost under load which you
don't see in benchmarks, too. We also know for a fact that the early
hashes used in Reiserfs were quick but were vulnerable to terribly bad
behaviour under certain application workloads. With the half-md4, at
least we can expect decent worst-case behaviour unless we're under
active attack (ie. only maliscious apps get hurt).

I think the md4 is a safer bet until we know more, so I'd vote that we
stick with the ext3 cvs code which uses hash version #1 for that, and
defer anything else until we've seen more --- the hash versioning lets
us do that safely.

> By the way, I'm running about 37 usec per create here, on a 1GHz/1GB PIII,
> with Ext2. I think most of the difference vs your timings is that your test
> code is eating a lot of cpu.

I was getting nearer to 50usec system time, but on an athlon k7-700,
so those timings are pretty comparable. Mine was ext3, too, which
accounts for a bit. The difference between that and wall-clock time
was all just idle time, which I think was due to using "touch"/"rm"
--- ie. there was a lot of inode table write activity due to the files
being created/deleted, and that was forcing a journal wrap before the
end of the test. That effect is not visible on ext2, of course.

--Stephen

2002-06-22 05:55:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Jun 21, 2002 05:28 +0200, Daniel Phillips wrote:
> I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2. As
> predicted, half-md4 does produce very even bucket distributions. For 200,000
> creates:
>
> half-md4: 2872 avg bytes filled per 4k block (70%)
> dx_hack_hash: 2853 avg bytes filled per 4k block (69%)
>
> but guess which was faster overall?
>
> half-md4: user 0.43 system 6.88 real 0:07.33 CPU 99%
> dx_hack_hash: user 0.43 system 6.40 real 0:06.82 CPU 100%
>
> This is quite reproducible: dx_hack_hash is always faster by about 6%. This
> must be due entirely to the difference in hashing cost, since half-md4
> produces measurably better distributions. Now what do we do?

While I normally advocate the "cheapest" way of implementing a given
solution (and dx_hack_hash is definitely the lowest-cost hash function
we could reasonably have), I would still be inclined to go with half-MD4
for this. A few reasons for that:
1) CPUs are getting faster all the time
2) it is a well-understood algorithm that has very good behaviour
3) it is much harder to spoof MD4 than dx_hack_hash
4) it is probably better to have the most uniform hash function we can
find than to do lots more block split/coalesce operations, so the
extra cost of half-MD4 may be a benefit overall

It would be interesting to re-run this test to create a few million
entries, but with periodic deletes.

Hmm, now that I think about it, split/coalesce operations are only
important on create and delete, while the hash cost is paid for each
lookup as well. It would be interesting to see the comparison with
a test something like this (sorry, don't have a system which has both
hash functions working right now):

#!/bin/sh
DEV=/dev/hda7
TESTDIR=/mnt/tmp
date
for d in `seq -f "directory_name_%05g" 1 10000` ; do
mkdir $TESTDIR/$d
for f in `seq -f "file_name_%05g" 1 10000` ; do
touch $TESTDIR/$d/$d_$f
done
done
date
umount $TESTDIR
mount -o noatime $DEV $TESTDIR
date
for d in `seq -f "directory_name_%05g" 10000 -1 1` ; do
for f in `seq -f "file_name_%05g" 10000 -1 1` ; do
stat $TESTDIR/$d/$d_$f > /dev/null
done
done
date

Having the longer filenames will put more load on the hash function,
so we will see if we are really paying a big price for the overhead,
and the stat test will remove all of the creation time and disk dirtying
and just leave us with the pure lookup costs hopefully.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

2002-06-22 21:00:21

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Saturday 22 June 2002 07:53, Andreas Dilger wrote:
> 4) it is probably better to have the most uniform hash function we can
> find than to do lots more block split/coalesce operations, so the
> extra cost of half-MD4 may be a benefit overall

That's the thing: we already know the extra cost of half-MD4 is dominant
here. It only saves about 1.3% of the splitting, and splitting itself is a
small part of overall costs.

Argument (1), "cpus are getting faster all the time" a little weak -
directories are getting bigger all the time too, and... you know, it just
isn't stylish to waste cycles in common operations.

However, I buy the stability and predictability arguments, and we should go
with half-md4 for the alpha-test period. But also, we should see what we can
do to trim some of the fat off half-MD4 without degrading its nice properties
too much. During the same period, I think I see if I can find a better
multiplier for dx_hack_hash, to even out its distribution a bit. Armed with
these results, the release version should use the fastest hash that exhibits
acceptable statistical behaviour.

I do agree there's a lot to be said for going with a hash that has been
analyzed rigorously.

> It would be interesting to re-run this test to create a few million
> entries, but with periodic deletes.

Changing the topic slightly, I'm running the following stupid program to test
for directory growth during steady state operation similar to a mailer:

/* steady.c */

#include <stdlib.h>

int main (int argc, char *argv[])
{
int target = (argc > 2)? strtol(argv[2], 0, 10): 0;
int range = (argc > 3)? strtol(argv[3], 0, 10): 0;
int size = 50, show = argc > 3 && !strncmp(argv[4], "y", 1);
int balance = 0, deletes = 0;
char name[size];

if (!range) return -1;

while (1)
{
snprintf(name, size, "%s%i", argv[1], rand() % range);
if (balance < target)
{
int fd = open(name, 0100);
if (fd >= 0)
{
if (show) printf("create %s\n", name);
close(fd);
balance++;
}
} else
if (!remove(name))
{
if (show) printf("remove %s (%i), ", name, ++deletes);
balance--;
}
}
return 0;
}

when run with:

steady <basename> <count> <range> [y=print output]

it will create up to <count> files with names <basename><rand> with <rand> in
the range 0 to range-1, then it will start alternately deleting and creating
random files. Its algorithm is braindead - it just keeps trying to delete a
random file until it hits one that exists, and on create, it creates a random
file and tries again if it hits one already there. I'm running it now with:

steady foo 10000 1000000000 y

meaning it's working with a 10,000 file, creating/deleting files with names
in the rand foo0 to foo999999999, i.e., a billion possible names. It's a
horribly inefficient thing to do, since it has about a hundred thousand
failures for every success, but it was quick to write. I'm watching for
directory expansion. The directory started at 385024 bytes (38.5 bytes/entry
and grew a block after about 300 steady state remove/create cycles. Another
block was added around 500 cycles. It's up to 1100 cycles now, and hasn't
added another block. It sort of looks like the expansion is getting slower.
This is with dx_hack_hash. I'll let it run overnight to get a feeling of how
urgent the steady-state problem is.

What I expect to find is that directory expansion is slower with half-md4,
perhaps significantly slower. However, even if it is slower, we can't really
tolerate any steady-state expansion at all, so it's no excuse for not
implementing the coalesce.

> Hmm, now that I think about it, split/coalesce operations are only
> important on create and delete, while the hash cost is paid for each
> lookup as well. It would be interesting to see the comparison with
> a test something like this (sorry, don't have a system which has both
> hash functions working right now):
>
> [code]

Yes, good test: performance against name length. Well, I don't plan to try
it just now, since I've run out of time and must get ready for the pilgrimage
to Ottawa.

--
Daniel

2002-06-23 00:02:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Saturday 22 June 2002 22:59, Daniel Phillips wrote:
> ... The directory started at 385024 bytes (38.5 bytes/entry
> and grew a block after about 300 steady state remove/create cycles. Another
> block was added around 500 cycles. It's up to 1100 cycles now, and hasn't
> added another block. It sort of looks like the expansion is getting slower.
> This is with dx_hack_hash. I'll let it run overnight to get a feeling of how
> urgent the steady-state problem is.

And now it's up to 5500 cycles, and still hasn't added another block.
Empirically, we're seeing a very steep falloff in the expansion rate,
though I'll readily admit the testing is far from sufficient.

--
Daniel

2002-07-04 04:49:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

(I wrote this about 10 days ago and somehow didn't get around to posting it)

(please note my new, permanent email address)

On Friday 21 June 2002 17:06, Stephen C. Tweedie wrote:
> Hi,
>
> On Fri, Jun 21, 2002 at 05:28:28AM +0200, Daniel Phillips wrote:
>
> > I ran a bakeoff between your new half-md4 and dx_hack_hash on Ext2. As
> > predicted, half-md4 does produce very even bucket distributions. For 200,000
> > creates:
> >
> > half-md4: 2872 avg bytes filled per 4k block (70%)
> > dx_hack_hash: 2853 avg bytes filled per 4k block (69%)
> >
> > but guess which was faster overall?
> >
> > half-md4: user 0.43 system 6.88 real 0:07.33 CPU 99%
> > dx_hack_hash: user 0.43 system 6.40 real 0:06.82 CPU 100%
> >
> > This is quite reproducible: dx_hack_hash is always faster by about 6%. This
> > must be due entirely to the difference in hashing cost, since half-md4
> > produces measurably better distributions. Now what do we do?
>
> I want to get this thing tested!

Yes :-)

> There are far too many factors for this to be resolved very quickly.
> In reality, there will be a lot of disk cost under load which you
> don't see in benchmarks, too. We also know for a fact that the early
> hashes used in Reiserfs were quick but were vulnerable to terribly bad
> behaviour under certain application workloads. With the half-md4, at
> least we can expect decent worst-case behaviour unless we're under
> active attack (ie. only maliscious apps get hurt).

OK, anti-hash-attack is on the list of things to do, and it's fairly
clear how to go about it:

- When we create a volume, generate and store some number of bits of
random seed in the superblock, or if we are creating the first-ever
index on an existing volume, generate the seed at that time.

- Every dx_hash starts by seeding the hash function from the value in
the superblock.

The width of the superblock seed doesn't have to match the value needed
by the directory hash exactly since we can easily widen the value by
seeding a pseudo-random generator with the smaller seed. If the
superblock seed is wider than we need, we can fold it down via xor, or
better, just take as much as we need (all bits of our seed are supposed
to be eually random, so xoring parts of the seed together does't make
the result any more random).

We would do whatever seed-adjusting we need to at mount time, so the
process of seeding the per-dirop hash is just a memory->memory or
memory->register move. This part of the hashing process, at least,
should impose no measurable overhead.

Originally, I'd put aside the 'reserve_zero' field in the per-directory
dx_info for this purpose, but I later realized that it doesn't make
sense to do this anti-hash-attack protection at any finer granuality
than a volume, plus we save valuable real estate in the directory root
and gain a little efficiency. Putting it in the superblock is a clear
win.

By the way, the dx_info area is not hardwired to 8 bytes. The current
code is written so that dx_info can be expanded without breaking
forward compatibility. At least it's supposed to be written that way.
We should put that on the list of things to test.

> I think the md4 is a safer bet until we know more, so I'd vote that we
> stick with the ext3 cvs code which uses hash version #1 for that, and
> defer anything else until we've seen more --- the hash versioning lets
> us do that safely.

Yes, I agree we should err on the side of safety. The 6% overhead I see
in my stripped down test will certainly be diluted quite a bit under real
loads. Hash version #1 does not have to be the release version. We can
reasonably require a fsck at the end of the alpha-testing period, which
would rebuild all the hash indexes with the final, release hash. This
would have the added advantage of enlisting our alpha-testers to test
Ted's new e2fsck code as well.

I think we're making progress here. An 'aha' I had while working with
this new hash code is that there's a one statistical property we're
looking for more than any other in a good hash: the hash value should
yield as little information as possible about the input string. Then
any statistical change in input strings over time won't cause a
statistical change in output hash values over time - precisely the
property htree wants, in order to keep leaf blocks filled uniformly
and slow the hash range fragmentation rate. This is why cryptographic
hash analysis is the right approach to this problem.

--
Daniel

P.S., in retrospect, pretty much all of the above survived scrutiny
at Ottawa.

2002-07-04 14:06:54

by James Lewis Nance

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Thu, Jul 04, 2002 at 06:48:45AM +0200, Daniel Phillips wrote:
> > behaviour under certain application workloads. With the half-md4, at
> > least we can expect decent worst-case behaviour unless we're under
> > active attack (ie. only maliscious apps get hurt).
>
> OK, anti-hash-attack is on the list of things to do, and it's fairly
> clear how to go about it:

Is it really worth the trouble and complexity to do anti-hash-attack?
What is the worst that could happen if someone managed to create a bunch
of files that hashed to the same value?

Jim

2002-07-05 02:07:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Ext2-devel] Re: Shrinking ext3 directories

On Thursday 04 July 2002 16:15, [email protected] wrote:
> On Thu, Jul 04, 2002 at 06:48:45AM +0200, Daniel Phillips wrote:
> > > behaviour under certain application workloads. With the half-md4, at
> > > least we can expect decent worst-case behaviour unless we're under
> > > active attack (ie. only maliscious apps get hurt).
> >
> > OK, anti-hash-attack is on the list of things to do, and it's fairly
> > clear how to go about it:
>
> Is it really worth the trouble and complexity to do anti-hash-attack?
> What is the worst that could happen if someone managed to create a bunch
> of files that hashed to the same value?

Just a slowdown, but in some cases it could be a quadratic slowdown
that could conceivably be turned into a denial of service. As risks
go, it's a minor one, but there's a straightforward solution with
insignificant cost in either efficiency or code size, so why not do
it. The overhead is just a data move from the superblock per name
hash.

--
Daniel