2007-09-19 14:46:28

by Jan Kara

[permalink] [raw]
Subject: Enabling h-trees too early?

Hi,

I was just wondering: Currently we start to build h-tree in a directory
already when the size of directory exceeds one block. But honestly, it does
not seem to make much sence to use this feature until the directory is much
larger (I'd say at least 16 or 32 KB). It actually slows down some
operations like deleting the whole directory, etc. So what is the reason
for starting building the tree so early? Just the simplicity of building it
when the directory is just one block large?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR


2007-09-19 17:34:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Sep 19, 2007 17:07 +0200, Jan Kara wrote:
> I was just wondering: Currently we start to build h-tree in a directory
> already when the size of directory exceeds one block. But honestly, it does
> not seem to make much sence to use this feature until the directory is much
> larger (I'd say at least 16 or 32 KB). It actually slows down some
> operations like deleting the whole directory, etc. So what is the reason
> for starting building the tree so early? Just the simplicity of building it
> when the directory is just one block large?

Yes, doing it at one block is easy, doing it with more blocks is complex.
At the time we added this feature, tests showed no performance difference
between 2 unhashed blocks and 2 hashed blocks so the easier code was used.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-09-19 18:24:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Wed, Sep 19, 2007 at 05:07:15PM +0200, Jan Kara wrote:
>
> I was just wondering: Currently we start to build h-tree in a directory
> already when the size of directory exceeds one block. But honestly, it does
> not seem to make much sence to use this feature until the directory is much
> larger (I'd say at least 16 or 32 KB). It actually slows down some
> operations like deleting the whole directory, etc. So what is the reason
> for starting building the tree so early? Just the simplicity of building it
> when the directory is just one block large?

How much is it slowing down operations such as rm -rf? For a small
directory (< 32k), I would assume that the difference would be
relatively small. What sort of differences have you measured, and is
this a common case problem?

Certainly one of the things that we could consider is for small
directories to do an in-memory sort of all of the directory entries at
opendir() time, and keeping that list until it is closed. We can't do
this for really big directories, but we could easily do it for
directories under 32k or 64k.

- Ted

2007-09-20 13:12:59

by Jan Kara

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Wed 19-09-07 14:24:50, Theodore Tso wrote:
> On Wed, Sep 19, 2007 at 05:07:15PM +0200, Jan Kara wrote:
> >
> > I was just wondering: Currently we start to build h-tree in a directory
> > already when the size of directory exceeds one block. But honestly, it does
> > not seem to make much sence to use this feature until the directory is much
> > larger (I'd say at least 16 or 32 KB). It actually slows down some
> > operations like deleting the whole directory, etc. So what is the reason
> > for starting building the tree so early? Just the simplicity of building it
> > when the directory is just one block large?
>
> How much is it slowing down operations such as rm -rf? For a small
> directory (< 32k), I would assume that the difference would be
> relatively small. What sort of differences have you measured, and is
> this a common case problem?
So for example deleting kernel tree on my computer takes ~14 seconds with
h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel
tree takes 8 seconds with h-trees and 6.3s without them... So I think the
performance difference is quite measurable.

> Certainly one of the things that we could consider is for small
> directories to do an in-memory sort of all of the directory entries at
> opendir() time, and keeping that list until it is closed. We can't do
> this for really big directories, but we could easily do it for
> directories under 32k or 64k.
Umm, yes. That would be probably feasible. But converting to htrees only
when directories grow larger would avoid the problem also. It also does not
seem *that* hard but maybe I miss some nasty details...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-09-20 14:28:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Thu, Sep 20, 2007 at 03:33:50PM +0200, Jan Kara wrote:
> So for example deleting kernel tree on my computer takes ~14 seconds with
> h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel
> tree takes 8 seconds with h-trees and 6.3s without them... So I think the
> performance difference is quite measurable.

This is in a completely cold cache state? (i.e. mounting and
unmounting the filesystem before doing the rm -rf?)

On my kernel tree, using the command: "lsattr -R | grep -- -I-" shows
that only 8 directories are htree indexed, and they're not that big:

12 drwxr-xr-x 12 tytso tytso 12288 2007-09-14 16:25 ./drivers/char
24 drwxr-xr-x 30 tytso tytso 24576 2007-09-14 16:25 ./drivers/net
20 drwxr-xr-x 2 tytso tytso 20480 2007-09-14 16:25 ./drivers/usb/serial
32 drwxr-xr-x 24 tytso tytso 32768 2007-09-14 16:10 ./include/linux
12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/bridge/netfilter
24 drwxr-xr-x 2 tytso tytso 24576 2007-09-14 16:25 ./net/ipv4/netfilter
12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/ipv6/netfilter
32 drwxr-xr-x 2 tytso tytso 32768 2007-09-14 16:25 ./net/netfilter

... which means if the benchmark only focused on deleting these files,
then presumably the percentage increase would be even worse.

> > Certainly one of the things that we could consider is for small
> > directories to do an in-memory sort of all of the directory entries at
> > opendir() time, and keeping that list until it is closed. We can't do
> > this for really big directories, but we could easily do it for
> > directories under 32k or 64k.
>
> Umm, yes. That would be probably feasible. But converting to htrees only
> when directories grow larger would avoid the problem also. It also does not
> seem *that* hard but maybe I miss some nasty details...

The reason why I mentioned the caching idea is we already have code to
manage and return directories stored in an rbtree in the kernel,
albeit for a slightly different purpose. So hacking it up to cache
all of the directory entries for directories < 64k and to index them
by inode number instead of hash key would be pretty easy.

What's nasty about converting to htrees after the directories become
larger is that we need to reserve extra space in the journal for each
block that we need to modify, and then just the fact that we have to
keep track of the multiple buffers. Basically, not impossible but
just a pain in the *ss.

- Ted

2007-09-20 14:58:41

by Jan Kara

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

> On Thu, Sep 20, 2007 at 03:33:50PM +0200, Jan Kara wrote:
> > So for example deleting kernel tree on my computer takes ~14 seconds with
> > h-trees and less than 9 without them. Also doing 'cp -lr' of the kernel
> > tree takes 8 seconds with h-trees and 6.3s without them... So I think the
> > performance difference is quite measurable.
>
> This is in a completely cold cache state? (i.e. mounting and
> unmounting the filesystem before doing the rm -rf?)
Yes.

> On my kernel tree, using the command: "lsattr -R | grep -- -I-" shows
> that only 8 directories are htree indexed, and they're not that big:
>
> 12 drwxr-xr-x 12 tytso tytso 12288 2007-09-14 16:25 ./drivers/char
> 24 drwxr-xr-x 30 tytso tytso 24576 2007-09-14 16:25 ./drivers/net
> 20 drwxr-xr-x 2 tytso tytso 20480 2007-09-14 16:25 ./drivers/usb/serial
> 32 drwxr-xr-x 24 tytso tytso 32768 2007-09-14 16:10 ./include/linux
> 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/bridge/netfilter
> 24 drwxr-xr-x 2 tytso tytso 24576 2007-09-14 16:25 ./net/ipv4/netfilter
> 12 drwxr-xr-x 2 tytso tytso 12288 2007-09-14 16:25 ./net/ipv6/netfilter
> 32 drwxr-xr-x 2 tytso tytso 32768 2007-09-14 16:25 ./net/netfilter
>
> ... which means if the benchmark only focused on deleting these files,
> then presumably the percentage increase would be even worse.
Hmm, strange - I've just looked at my computer and dir_index is set
just for 5 directories in my tree. If I try deleting just them, I also
see some performance decrease but it's less than if I try deleting the
whole tree (and that result seems to be quite consistent)... There's something
fishy there. Maybe I could try seekwatcher or something similar to see
what's really happening.

> > > Certainly one of the things that we could consider is for small
> > > directories to do an in-memory sort of all of the directory entries at
> > > opendir() time, and keeping that list until it is closed. We can't do
> > > this for really big directories, but we could easily do it for
> > > directories under 32k or 64k.
> >
> > Umm, yes. That would be probably feasible. But converting to htrees only
> > when directories grow larger would avoid the problem also. It also does not
> > seem *that* hard but maybe I miss some nasty details...
>
> The reason why I mentioned the caching idea is we already have code to
> manage and return directories stored in an rbtree in the kernel,
> albeit for a slightly different purpose. So hacking it up to cache
> all of the directory entries for directories < 64k and to index them
> by inode number instead of hash key would be pretty easy.
>
> What's nasty about converting to htrees after the directories become
> larger is that we need to reserve extra space in the journal for each
> block that we need to modify, and then just the fact that we have to
> keep track of the multiple buffers. Basically, not impossible but
> just a pain in the *ss.
I see :).

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-09-20 15:14:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote:
> Hmm, strange - I've just looked at my computer and dir_index is set
> just for 5 directories in my tree.

I looked at a tree that had object files, which is probably why I had
8 directories; I'm guessing you probably just had kernel sources and
no build files.

> If I try deleting just them, I also
> see some performance decrease but it's less than if I try deleting the
> whole tree (and that result seems to be quite consistent)... There's something
> fishy there. Maybe I could try seekwatcher or something similar to see
> what's really happening.

That is very strange.....

- Ted

2007-09-20 15:58:11

by Jan Kara

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Thu 20-09-07 11:14:40, Theodore Tso wrote:
> On Thu, Sep 20, 2007 at 04:58:39PM +0200, Jan Kara wrote:
> > Hmm, strange - I've just looked at my computer and dir_index is set
> > just for 5 directories in my tree.
>
> I looked at a tree that had object files, which is probably why I had
> 8 directories; I'm guessing you probably just had kernel sources and
> no build files.
>
> > If I try deleting just them, I also
> > see some performance decrease but it's less than if I try deleting the
> > whole tree (and that result seems to be quite consistent)... There's something
> > fishy there. Maybe I could try seekwatcher or something similar to see
> > what's really happening.
>
> That is very strange.....
Just a guess: Can't the culprit be the following test in ext3/4_readdir()?
if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) &&
((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) ||
((inode->i_size >> sb->s_blocksize_bits) == 1))) {
error = ext4_dx_readdir(filp, dirent, filldir);
if (error != ERR_BAD_DX_DIR) {
ret = error;
goto out;
}
/*
* We don't set the inode dirty flag since it's not
* critical that it get flushed back to the disk.
*/
EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL;
}
It calls ext4_dx_readdir() for *every* directory with 1 block (we have
1326 of them in the kernel tree). Now ext4_dx_readdir() calls
ext4_htree_fill_tree() which finds out the directory is not h-tree and
and calls htree_dirblock_to_tree(). So even for 4KB directories we end up
deleting inodes in hash order! And as a bonus we burn some cycles building
trees etc. What is the point of this?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2007-09-20 17:02:52

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Thu, Sep 20, 2007 at 06:19:04PM +0200, Jan Kara wrote:
> if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> ((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) ||
> ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
> error = ext4_dx_readdir(filp, dirent, filldir);
> if (error != ERR_BAD_DX_DIR) {
> ret = error;
> goto out;
> }
> /*
> * We don't set the inode dirty flag since it's not
> * critical that it get flushed back to the disk.
> */
> EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL;
> }
> It calls ext4_dx_readdir() for *every* directory with 1 block (we have
> 1326 of them in the kernel tree). Now ext4_dx_readdir() calls
> ext4_htree_fill_tree() which finds out the directory is not h-tree and
> and calls htree_dirblock_to_tree(). So even for 4KB directories we end up
> deleting inodes in hash order! And as a bonus we burn some cycles building
> trees etc. What is the point of this?

That was added so we wouldn't get screwed when a directory that was
previously non htree became an htree directory while the directory fd
is open. So the failure case is one where you do opendir(), readdir()
on 25% of the directory, sleep for 2 hours, and in the meantime, 200
files are added to the directory and it gets converted into a htree
index, causing all of the previously returned readdir() results in
directory order to be completely screwed up now that the directory has
been converted into an htree. (All of the readdir/telldir/seekdir
POSIX requirements cause filesystem designers to tear their hair out.)

What we would need to do to avoid needing this is to read in the
entire directory leaf page into the rbtree, sorted by inode number,
and then to keep that rbtree for the entire life of the open directory
file descriptor. We would also have to change telldir/seekdir to use
something else as a telldir cookie, and readdir would have to be set
up to *only* use the rbtree, and never look at the on-disk directory.
This would also mean that all of the files created or deleted after
the initial opendir() would never be reflected in results returned by
readdir(), but that's allowed by POSIX. And if we do this for a
single block 4k directory, we might as well do it for a 32k or 64k
HTREE directory as well.

Does that make sense?

- Ted

2007-09-21 09:02:59

by Andi Kleen

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

Theodore Tso <[email protected]> writes:
>
> Certainly one of the things that we could consider is for small
> directories to do an in-memory sort of all of the directory entries at
> opendir() time, and keeping that list until it is closed. We can't do
> this for really big directories, but we could easily do it for
> directories under 32k or 64k.

I assume you mean sort by inode, because sort by htree key would
be as bad as htrees.

But wouldn't that break parallel readdir for a directory that just grows
from <32/64K to over it? e.g. if the sort moves already read
entries to after the cursor readdir would return some entries twice.

I suspect you would need to keep it always sorted after that no matter
how big it gets. So the 32/64k boundary seems useless and you would
need a sorted potentially partial in memory rbtree anyways.

-Andi

2007-09-21 11:45:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

On Fri, Sep 21, 2007 at 11:02:58AM +0200, Andi Kleen wrote:
> I assume you mean sort by inode, because sort by htree key would
> be as bad as htrees.
>
> But wouldn't that break parallel readdir for a directory that just grows
> from <32/64K to over it? e.g. if the sort moves already read
> entries to after the cursor readdir would return some entries twice.

No, it wouldn't break it because after snapshotting the tree, we would
only use the snapshot and not refer to the on-disk format until the
directory file handle is closed or there is an explicit lseek to 0.
To quote from the spec:

If a file is removed from or added to the directory after the most recent
call to the opendir() or rewinddir() function, whether a subsequent call to
the readdir() function returns an entry for that file is unspecified.

So, if some program does the stupid thing and opens a directory,
doesn't use it for five hours, and in the meantime 2,000 files are
created, it's OK if we only return the set of files that were in
existence when the opendir() was originally called. This makes us OK
in terms of spec conformance, and given that real life programs don't
do stuff this stupid, we should be fine.

- Ted

2007-09-21 13:49:03

by Jan Kara

[permalink] [raw]
Subject: Re: Enabling h-trees too early?

> On Thu, Sep 20, 2007 at 06:19:04PM +0200, Jan Kara wrote:
> > if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb, EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> > ((EXT4_I(inode)->i_flags & EXT4_INDEX_FL) ||
> > ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
> > error = ext4_dx_readdir(filp, dirent, filldir);
> > if (error != ERR_BAD_DX_DIR) {
> > ret = error;
> > goto out;
> > }
> > /*
> > * We don't set the inode dirty flag since it's not
> > * critical that it get flushed back to the disk.
> > */
> > EXT4_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT4_INDEX_FL;
> > }
> > It calls ext4_dx_readdir() for *every* directory with 1 block (we have
> > 1326 of them in the kernel tree). Now ext4_dx_readdir() calls
> > ext4_htree_fill_tree() which finds out the directory is not h-tree and
> > and calls htree_dirblock_to_tree(). So even for 4KB directories we end up
> > deleting inodes in hash order! And as a bonus we burn some cycles building
> > trees etc. What is the point of this?
>
> That was added so we wouldn't get screwed when a directory that was
> previously non htree became an htree directory while the directory fd
> is open. So the failure case is one where you do opendir(), readdir()
> on 25% of the directory, sleep for 2 hours, and in the meantime, 200
> files are added to the directory and it gets converted into a htree
> index, causing all of the previously returned readdir() results in
> directory order to be completely screwed up now that the directory has
> been converted into an htree. (All of the readdir/telldir/seekdir
> POSIX requirements cause filesystem designers to tear their hair out.)
Oh, yes. Thanks for explanation.

> What we would need to do to avoid needing this is to read in the
> entire directory leaf page into the rbtree, sorted by inode number,
> and then to keep that rbtree for the entire life of the open directory
> file descriptor. We would also have to change telldir/seekdir to use
> something else as a telldir cookie, and readdir would have to be set
> up to *only* use the rbtree, and never look at the on-disk directory.
> This would also mean that all of the files created or deleted after
> the initial opendir() would never be reflected in results returned by
> readdir(), but that's allowed by POSIX. And if we do this for a
> single block 4k directory, we might as well do it for a 32k or 64k
> HTREE directory as well.
Yes, this makes sence...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs