2013-01-13 15:25:41

by Radek Pazdera

[permalink] [raw]
Subject: [RFC] Optimizing readdir()

Hi!

I am a student at Brno University of Technology working on my "final
project". I'm doing this with great help of Lukas Czerner, who acts
as my mentor and leads me on the project.

I'd like to optimize the problem of getents() returning entries in
hash order, which can have an impact on performance in some cases.
It was mentioned here a few times already [1][2].

I did some tests [3] and it seems to me that the amount of available
page cache plays a crucial role in this case. It can compensate for
the seeks if the directory is "small enough". However, in case
of memory pressure or when something starts to evict your pages, the
performance can go down.

The biggest performance can be observed when copying an entire directory
(ext4-spd are with the spd_readdir preload - and it's *fast*):

http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-3-1-block-files/cp.png
http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-3-1-block-files/cp.results

Raw readdir() and stat() on every file is ok up to really large dirs
(as long as your page cache can hold all the inode tables, I think).
ext4's dir index is ok even with aged dirs:

http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/readdir-stat_clean-vs-aged.results
http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/readdir-stat.png
http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/aged/readdir-stat.png


My idea was to try to add a second tree to the current index that would
be used to retrieve entries in inode-order. Its structure could be very
similar to the current HTree (the hash is 32bits and so are inode
numbers).

Question is, how hard would this hit creates/deletes, and renames.
Isolated create/delete would be definitely slower (ext4 would do
the same thing twice). But page cache could save the case of
creating/deleting a whole directory. Deletion might in fact benefit
from the inode-order readdir a little (compare ext4 and ext4-spd):

http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/delete.png
http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/clean-1/delete.results

One downside is, it would roughly double the size of directory metadata, as
there probably would have to be two dirents for each entry (for each tree).
If one tree would link to another's dirents, it would make node splits,
extremely expensive. I had an idea about using a array of dirents shared
between the trees, but I'm not really sure how to manage free space in it
efficiently.

On-disk format would remain the same, apart from the dx_root structure,
which would have to carry a pointer to the root of the second tree. I
think, there might be a place in the embedded dx_root_info:

struct dx_root_info
{
__le32 reserved_zero;
u8 hash_version;
- u8 info_length; /* 8 */
+ u8 info_length; /* 12 */
u8 indirect_levels;
u8 unused_flags;
+ __le32 itree_root_block;
}


What do you think about this approach? Is it worth trying?
Any feedback or sugesstions are greatly appreciated :).


Radek Pazdera


[1] http://thread.gmane.org/gmane.comp.file-systems.ext4/23794
[2] http://thread.gmane.org/gmane.linux.file-systems/61910
[3] http://www.stud.fit.vutbr.cz/~xpazde00/ext4-tests/


2013-01-14 04:51:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On Sun, Jan 13, 2013 at 04:22:04PM +0100, Radek Pazdera wrote:
>
> My idea was to try to add a second tree to the current index that would
> be used to retrieve entries in inode-order. Its structure could be very
> similar to the current HTree (the hash is 32bits and so are inode
> numbers).

Something to think about what the backwards compatibility impacts of
this would be. The current directory entries are indexed using
*logical* block numbers, which means that if we ignore the directory
htree blocks (which were deliberately crafted to look like deleted
directory entries that were the size of the entire directory block),
an htree-oblivious kernel (or the ext2 file system driver) would be
able to interpret the directory entries as a traditional ext2
directory.

When you add this feature, you will almost certainly break this, at
least for read/write forward compatibility.

What I would suggest, if we do go down this path, is to store the
secondary directory tree using physical block numbers, so the
directory blocks are stored outside of the directory entirely. That
means you'll need to have a 64-bit block number, but it means that you
won't have to do a lookup to translate the logical block number ot the
physical block number. It also means that there won't be any
confusion about whether a particular directory entry block belongs to
the htree-indexed tree or the inode-number-indexed tree.

If we want to make the file system be backwards compatible, this gets
a bit more difficult, since the current code is not set up to handle
the info_length to be anything other than 8. This won't be a problem
if you want to make the feature be completely backwards incompatible,
but if you want to allow at least read/only compatibility, you might
want to stick 64-bit block number at the end of the dx_root block
(this is why the directory checksum is in the dx_tail structure).

I wonder if the better approach is to just simply have some
easy-to-use library routines that do a readdir/sort in userspace. The
spd_readdir does basically this, and as we can see it's good enough
for most purposes. The problem is danger when using this in threaded
programs, or if you have programs doing really strange things with
telldir/seekdir, etc.

But it wouldn't be that hard to write a generic library function which
if it were used for find, ls, tar, and a few other key programs, would
solve the problem for most use cases.

Cheers,

- Ted

2013-01-15 07:25:22

by Radek Pazdera

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On Sun, Jan 13, 2013 at 11:51:52PM -0500, Theodore Ts'o wrote:
>On Sun, Jan 13, 2013 at 04:22:04PM +0100, Radek Pazdera wrote:
>>
>> My idea was to try to add a second tree to the current index that would
>> be used to retrieve entries in inode-order. Its structure could be very
>> similar to the current HTree (the hash is 32bits and so are inode
>> numbers).
>
>Something to think about what the backwards compatibility impacts of
>this would be. The current directory entries are indexed using
>*logical* block numbers, which means that if we ignore the directory
>htree blocks (which were deliberately crafted to look like deleted
>directory entries that were the size of the entire directory block),
>an htree-oblivious kernel (or the ext2 file system driver) would be
>able to interpret the directory entries as a traditional ext2
>directory.
>
>When you add this feature, you will almost certainly break this, at
>least for read/write forward compatibility.
>
>What I would suggest, if we do go down this path, is to store the
>secondary directory tree using physical block numbers, so the
>directory blocks are stored outside of the directory entirely. That
>means you'll need to have a 64-bit block number, but it means that you
>won't have to do a lookup to translate the logical block number ot the
>physical block number. It also means that there won't be any
>confusion about whether a particular directory entry block belongs to
>the htree-indexed tree or the inode-number-indexed tree.

Any new blocks to the directory file would have to be hidden in the same
manner as the current dx_nodes are. But placing them completely outside
of the directory file sounds much simpler. I haven't thought of that.

>If we want to make the file system be backwards compatible, this gets
>a bit more difficult, since the current code is not set up to handle
>the info_length to be anything other than 8. This won't be a problem
>if you want to make the feature be completely backwards incompatible,
>but if you want to allow at least read/only compatibility, you might
>want to stick 64-bit block number at the end of the dx_root block
>(this is why the directory checksum is in the dx_tail structure).

Oh, I didn't realize that. I'll need to think about the impact on
backwards and forwards compatibility a little more. I didn't think
that through as much as I thought I did. I would like to break as
few things as possible.

>I wonder if the better approach is to just simply have some
>easy-to-use library routines that do a readdir/sort in userspace. The
>spd_readdir does basically this, and as we can see it's good enough
>for most purposes. The problem is danger when using this in threaded
>programs, or if you have programs doing really strange things with
>telldir/seekdir, etc.

I think this approach is great in the specific cases when you know you are
going to have to deal with large dirs and your system can accommodate for
the additional memory required to keep the whole directory file. But they
can grow pretty quickly in the worst-case scenario of really long names.

The size would have to be limited (probably?) for security reasons (as it
is in the spd_readdir) accordingly to the memory available on the target
machine.

>But it wouldn't be that hard to write a generic library function which
>if it were used for find, ls, tar, and a few other key programs, would
>solve the problem for most use cases.

I'm not sure if the possibility of allocating a fair amount of memory
would be acceptable for these basic operations. They can be used on a
variety of embedded devices that might have a problem with using
something similar to scandir(3) (as Stewart pointed out) for reading
a directory.

>Cheers,
>
> - Ted

Thank you very much for the feedback!

Cheers,

-Radek

2013-01-15 22:57:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On 2013-01-15, at 12:21 AM, Radek Pazdera wrote:
> On Sun, Jan 13, 2013 at 11:51:52PM -0500, Theodore Ts'o wrote:
>> If we want to make the file system be backwards compatible, this
>> gets a bit more difficult, since the current code is not set up to
>> handle the info_length to be anything other than 8. This won't be
>> a problem if you want to make the feature be completely backwards
>> incompatible, but if you want to allow at least read/only
>> compatibility, you might want to stick 64-bit block number at the
>> end of the dx_root block (this is why the directory checksum is in
>> the dx_tail structure).
>
> Oh, I didn't realize that. I'll need to think about the impact on
> backwards and forwards compatibility a little more. I didn't think
> that through as much as I thought I did. I would like to break as
> few things as possible.

Did you consider my proposal to order the inode allocations so
that they are (partially) ordered by the directory hash? That
would not require any on-disk format changes at all. The theory
is that keeping the entries mostly sorted in the inode table is
enough to avoid the pathological case in large directories where
only a single entry in each block is processed per transaction.

Even if you don't get perfect sorting, processing ten entries
in each block instead of one would give a 10x reduction in
journal traffic and cache misses.

>> I wonder if the better approach is to just simply have some
>> easy-to-use library routines that do a readdir/sort in userspace. The spd_readdir does basically this, and as we can see it's good
>> enough for most purposes. The problem is danger when using this
>> in threaded programs, or if you have programs doing really strange
>> things with telldir/seekdir, etc.
>
> I think this approach is great in the specific cases when you know
> you are going to have to deal with large dirs and your system can
> accommodate for the additional memory required to keep the whole
> directory file. But they can grow pretty quickly in the worst-case
> scenario of really long names.
>
> The size would have to be limited (probably?) for security reasons
> (as it is in the spd_readdir) accordingly to the memory available
> on the target machine.

Having an upper limit on the directory cache is OK too. Read all
of the entries that fit into the cache size, sort them, and return
them to the caller. When the caller has processed all of those
entries, read another batch, sort it, return this list, repeat.

As long as the list is piecewise ordered, I suspect it would gain
most of the benefit of linear ordering (sequential inode table
reads, avoiding repeated lookups of blocks). Maybe worthwhile if
you could test this out?

>> But it wouldn't be that hard to write a generic library function
>> which if it were used for find, ls, tar, and a few other key
>> programs, would solve the problem for most use cases.
>
> I'm not sure if the possibility of allocating a fair amount of memory
> would be acceptable for these basic operations. They can be used on a
> variety of embedded devices that might have a problem with using
> something similar to scandir(3) (as Stewart pointed out) for reading
> a directory.

At the same time, the smaller the system, the smaller the directory
will typically be, so I don't think we need to go to extremes. If
the piecewise ordering of readdir entries gives a sufficient speedup,
then it would be possible to efficiently process directories of
arbitrary size, and optimally process the most common directories
that fit within the buffer.

Cheers, Andreas






2013-01-17 15:57:57

by Radek Pazdera

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On Tue, Jan 15, 2013 at 03:44:57PM -0700, Andreas Dilger wrote:
>Did you consider my proposal to order the inode allocations so
>that they are (partially) ordered by the directory hash? That
>would not require any on-disk format changes at all. The theory
>is that keeping the entries mostly sorted in the inode table is
>enough to avoid the pathological case in large directories where
>only a single entry in each block is processed per transaction.

I only found a mention in an article about the status of ext3 from
OLS [1], but I didn't understand it at that time. I found the original
thread [2] (at least I think it is the right one). I'll have a look
at it. Thanks for pointing that out!

[1] http://www.kernel.org/doc/ols/2005/ols2005v1-pages-77-104.pdf
[2] http://lwn.net/Articles/25012/

>Having an upper limit on the directory cache is OK too. Read all
>of the entries that fit into the cache size, sort them, and return
>them to the caller. When the caller has processed all of those
>entries, read another batch, sort it, return this list, repeat.
>
>As long as the list is piecewise ordered, I suspect it would gain
>most of the benefit of linear ordering (sequential inode table
>reads, avoiding repeated lookups of blocks). Maybe worthwhile if
>you could test this out?

I will try that out. It shouldn't be hard to modify the spd_readdir
preload from Ted to do just this and run the tests again.

>At the same time, the smaller the system, the smaller the directory
>will typically be, so I don't think we need to go to extremes. If
>the piecewise ordering of readdir entries gives a sufficient speedup,
>then it would be possible to efficiently process directories of
>arbitrary size, and optimally process the most common directories
>that fit within the buffer.

You're right, huge directories are not common at small devices. It just
occured to me, because I am using Raspberry Pi at home for backups. But
this is probably not that common.

Thank you for your suggestions!

Cheers,

Radek

>
>Cheers, Andreas
>
>
>
>
>

2013-01-29 16:39:21

by Radek Pazdera

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On Tue, Jan 15, 2013 at 03:44:57PM -0700, Andreas Dilger wrote:
>Having an upper limit on the directory cache is OK too. Read all
>of the entries that fit into the cache size, sort them, and return
>them to the caller. When the caller has processed all of those
>entries, read another batch, sort it, return this list, repeat.
>
>As long as the list is piecewise ordered, I suspect it would gain
>most of the benefit of linear ordering (sequential inode table
>reads, avoiding repeated lookups of blocks). Maybe worthwhile if
>you could test this out?

I did the tests last week. I modified the spd_readdir preload to
read at most $SPD_READDIR_CACHE_LIMIT entries, sort them and repeat.
The patch is here:

http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/

I tested it with the limit set to 0 (i.e., no limit), 1000, 10000,
50000, and completely without the preload. The test runs were
performed on the same directory, so the results shouldn't be
affected by positioning on disk.

Directory sizes went from 10k to 1.5M. The tests were run twice.
The first run is only with metadata. In the second run, each file
has 4096B of data.

Here are the results:
0B files:
http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/0B-files

4096B files:
http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/4096B-files/

The times seem to decrease accordingly as the limit of the cache
increases. The differences are bigger in case of 4096B files, where
the data blocks start to evict the inode tables. However, copying is
still more than two times slower for 1.5M files when 50000 entries
are cached.

It might be interesting to test what happens when the size of the
files in the directory increases.

Best Regards
Radek

2013-01-30 11:34:28

by Lukas Czerner

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On Tue, 29 Jan 2013, Radek Pazdera wrote:

> Date: Tue, 29 Jan 2013 17:38:46 +0100
> From: Radek Pazdera <[email protected]>
> To: Andreas Dilger <[email protected]>
> Cc: Theodore Ts'o <[email protected]>, [email protected],
> Luk?? Czerner <[email protected]>
> Subject: Re: [RFC] Optimizing readdir()
>
> On Tue, Jan 15, 2013 at 03:44:57PM -0700, Andreas Dilger wrote:
> >Having an upper limit on the directory cache is OK too. Read all
> >of the entries that fit into the cache size, sort them, and return
> >them to the caller. When the caller has processed all of those
> >entries, read another batch, sort it, return this list, repeat.
> >
> >As long as the list is piecewise ordered, I suspect it would gain
> >most of the benefit of linear ordering (sequential inode table
> >reads, avoiding repeated lookups of blocks). Maybe worthwhile if
> >you could test this out?
>
> I did the tests last week. I modified the spd_readdir preload to
> read at most $SPD_READDIR_CACHE_LIMIT entries, sort them and repeat.
> The patch is here:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/
>
> I tested it with the limit set to 0 (i.e., no limit), 1000, 10000,
> 50000, and completely without the preload. The test runs were
> performed on the same directory, so the results shouldn't be
> affected by positioning on disk.
>
> Directory sizes went from 10k to 1.5M. The tests were run twice.
> The first run is only with metadata. In the second run, each file
> has 4096B of data.
>
> Here are the results:
> 0B files:
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/0B-files
>
> 4096B files:
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/4096B-files/
>
> The times seem to decrease accordingly as the limit of the cache
> increases. The differences are bigger in case of 4096B files, where
> the data blocks start to evict the inode tables. However, copying is
> still more than two times slower for 1.5M files when 50000 entries
> are cached.
>
> It might be interesting to test what happens when the size of the
> files in the directory increases.
>
> Best Regards
> Radek

Hi Radek,

those are interesting results and it supports the idea that you can
get most of the performance of completely sorted inode list by doing
it in "batches" as long as the size of the batch is sufficiently
large. However I do not think that using spd_readdir is the best
approach for the problem, nor do I think that it should be part of
the generic library. Aside from it's "hackish" nature and the fact
you will never be able to tell how much memory you can actually use
for the sorting, the fact is that other file systems can handle this
problem well enough in comparison with ext4 and we should really
focus on fixing it, rather than going around it.

Thanks!
-Lukas

2013-02-02 19:45:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] Optimizing readdir()

On 2013-01-30, at 4:34 AM, Luk?? Czerner wrote:
> On Tue, 29 Jan 2013, Radek Pazdera wrote:
>> Radek Pazdera <[email protected]> wrote:
>> On Tue, Jan 15, 2013 at 03:44:57PM -0700, Andreas Dilger wrote:
>>> Having an upper limit on the directory cache is OK too. Read all
>>> of the entries that fit into the cache size, sort them, and return
>>> them to the caller. When the caller has processed all of those
>>> entries, read another batch, sort it, return this list, repeat.
>>>
>>> As long as the list is piecewise ordered, I suspect it would gain
>>> most of the benefit of linear ordering (sequential inode table
>>> reads, avoiding repeated lookups of blocks). Maybe worthwhile if
>>> you could test this out?
>>
>> I did the tests last week. I modified the spd_readdir preload to
>> read at most $SPD_READDIR_CACHE_LIMIT entries, sort them and repeat.
>> The patch is here:
>>
>> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/dir-index-test-ext4/
>>
>> I tested it with the limit set to 0 (i.e., no limit), 1000, 10000,
>> 50000, and completely without the preload. The test runs were
>> performed on the same directory, so the results shouldn't be
>> affected by positioning on disk.
>>
>> Directory sizes went from 10k to 1.5M. The tests were run twice.
>> The first run is only with metadata. In the second run, each file
>> has 4096B of data.
>>
>> The times seem to decrease accordingly as the limit of the cache
>> increases. The differences are bigger in case of 4096B files, where
>> the data blocks start to evict the inode tables. However, copying is
>> still more than two times slower for 1.5M files when 50000 entries
>> are cached.

Still, caching 50k entries is twice as fast as caching none for
the 1.5M directory entries. How much memory is that in total?
Maybe 2.5MB, which isn't too bad at all for any kind of modern
system,

>> It might be interesting to test what happens when the size of the
>> files in the directory increases.

Hopefully ext4 will move the large files into a different group.

> those are interesting results and it supports the idea that you can
> get most of the performance of completely sorted inode list by doing
> it in "batches" as long as the size of the batch is sufficiently
> large. However I do not think that using spd_readdir is the best
> approach for the problem, nor do I think that it should be part of
> the generic library. Aside from it's "hackish" nature and the fact
> you will never be able to tell how much memory you can actually use
> for the sorting, the fact is that other file systems can handle this
> problem well enough in comparison with ext4 and we should really
> focus on fixing it, rather than going around it.

I would argue that even if some on-disk optimization is found, it
will not help the majority of users that do not have their files
laid out with the new format. Also, the spd-readdir can help all
filesystems, not just ext4. It will help ext2, ext3, ext4, isofs,
etc. It could statfs() and detect the filesystem type, and skip
XFS and Btrfs if this proves to not help for them...

I'm not against fixing this in the filesystem as well, but I think
it will be several years before the majority of users see it in a
kernel they are using.

Cheers, Andreas-