2011-03-14 20:24:52

by Phillip Susi

[permalink] [raw]
Subject: Large directories and poor order correlation

Shouldn't copying or extracting or otherwise populating a large
directory of many small files at the same time result in a strong
correlation between the order the names appear in the directory, and the
order their data blocks are stored on disk, and thus, read performance
should not be negatively impacted by fragmentation?

Background:

While migrating a server to a new system, I noticed that it was taking
forever to rsync my Maildir. It seemed to be due to very low disk
throughput due to seeking. I confirmed this with timing tests using
both tar and dump to /dev/zero to test reading the files after dropping
cache. I noticed that tar was horribly slow, and dump was much better.
I surmise that this was due to poor correlation between the order of
file names in the directory and their data blocks on disk. I would
expect this on an old fs that has grown slowly over a few years, and
that this would mostly go away after copying the files to a new system.
I found some surprises. The big one being that after copying the files
to the new system, they still have a poor correlation between directory
and inode order.

Details:

The old system was a single disk with sequential throughput of 51 mb/s,
and the new one is a 4 disk raid-5 with sequential throughput of 160 mb/s.

On the old system, tar took 30 minutes, and dump took 8 minutes. On the
new system, tar took 18 minutes, and dump took a mere 30 seconds!

On just the linux-kernel Maildir, which has 85,364 files taking up 660M
of space, dump on the old system clocks in at 11m41s and only 10s on the
new system.

I wrote a python script to actually measure the correlation between name
and inode order, inode and data block order, and name to data block
order. It enumerates the files and counts it as being either in or out
of order by comparing the inode number to the last. I expected to see a
much better correlation on the new system, but I did not.

On the new system the linux-kernel Maildir gets these results:

Name to inode correlation: 0.50002342908
Name to block correlation: 0.49996485638
Inode to block correlation: 0.889239023476

And on the old system:

Name to inode correlation: 0.499531418397
Name to block correlation: 0.499554847477
Inode to block correlation: 0.987418583946

The other folders get similar results. You can see that the inode to
block correlation is improved, but it wasn't very bad to begin with so
going from 8 minutes to 30 seconds seems to be a good deal more
improvement than this would explain. What concerns me though, is the
name to inode correlation went from terrible to slightly worse, which is
why tar still is horribly slow.

Attaching the script for reference.


Attachments:
correlation.py (1.36 kB)

2011-03-14 20:37:23

by Eric Sandeen

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 3/14/11 3:24 PM, Phillip Susi wrote:
> Shouldn't copying or extracting or otherwise populating a large
> directory of many small files at the same time result in a strong
> correlation between the order the names appear in the directory, and the
> order their data blocks are stored on disk, and thus, read performance
> should not be negatively impacted by fragmentation?

No, because htree (dir_index) dirs returns names in hash-value
order, not inode number order. i.e. "at random."

As you say, sorting by inode number will work much better...

-Eric

> Background:
>
> While migrating a server to a new system, I noticed that it was taking
> forever to rsync my Maildir. It seemed to be due to very low disk
> throughput due to seeking. I confirmed this with timing tests using
> both tar and dump to /dev/zero to test reading the files after dropping
> cache. I noticed that tar was horribly slow, and dump was much better.
> I surmise that this was due to poor correlation between the order of
> file names in the directory and their data blocks on disk. I would
> expect this on an old fs that has grown slowly over a few years, and
> that this would mostly go away after copying the files to a new system.
> I found some surprises. The big one being that after copying the files
> to the new system, they still have a poor correlation between directory
> and inode order.
>
> Details:
>
> The old system was a single disk with sequential throughput of 51 mb/s,
> and the new one is a 4 disk raid-5 with sequential throughput of 160 mb/s.
>
> On the old system, tar took 30 minutes, and dump took 8 minutes. On the
> new system, tar took 18 minutes, and dump took a mere 30 seconds!
>
> On just the linux-kernel Maildir, which has 85,364 files taking up 660M
> of space, dump on the old system clocks in at 11m41s and only 10s on the
> new system.
>
> I wrote a python script to actually measure the correlation between name
> and inode order, inode and data block order, and name to data block
> order. It enumerates the files and counts it as being either in or out
> of order by comparing the inode number to the last. I expected to see a
> much better correlation on the new system, but I did not.
>
> On the new system the linux-kernel Maildir gets these results:
>
> Name to inode correlation: 0.50002342908
> Name to block correlation: 0.49996485638
> Inode to block correlation: 0.889239023476
>
> And on the old system:
>
> Name to inode correlation: 0.499531418397
> Name to block correlation: 0.499554847477
> Inode to block correlation: 0.987418583946
>
> The other folders get similar results. You can see that the inode to
> block correlation is improved, but it wasn't very bad to begin with so
> going from 8 minutes to 30 seconds seems to be a good deal more
> improvement than this would explain. What concerns me though, is the
> name to inode correlation went from terrible to slightly worse, which is
> why tar still is horribly slow.
>
> Attaching the script for reference.
>


2011-03-14 20:52:28

by Phillip Susi

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 3/14/2011 4:37 PM, Eric Sandeen wrote:
> On 3/14/11 3:24 PM, Phillip Susi wrote:
>> Shouldn't copying or extracting or otherwise populating a large
>> directory of many small files at the same time result in a strong
>> correlation between the order the names appear in the directory, and the
>> order their data blocks are stored on disk, and thus, read performance
>> should not be negatively impacted by fragmentation?
>
> No, because htree (dir_index) dirs returns names in hash-value
> order, not inode number order. i.e. "at random."

I thought that the htree was used to look up names, but the normal
directory was used to enumerate them? In other words, the htree speeds
up opening a single file, but slows down traversing the entire
directory, so should not be used there.

Also isn't htree only enabled for large directories? I still see crummy
correlation for small ( < 100 files, even one with only 8 entries )
directories.

It seems unreasonable to ask applications to read all directory entries,
then sort them by inode number to achieve reasonable performance. This
seems like something the fs should be doing.


2011-03-14 21:12:54

by Eric Sandeen

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 3/14/11 3:52 PM, Phillip Susi wrote:
> On 3/14/2011 4:37 PM, Eric Sandeen wrote:
>> On 3/14/11 3:24 PM, Phillip Susi wrote:
>>> Shouldn't copying or extracting or otherwise populating a large
>>> directory of many small files at the same time result in a strong
>>> correlation between the order the names appear in the directory, and the
>>> order their data blocks are stored on disk, and thus, read performance
>>> should not be negatively impacted by fragmentation?
>>
>> No, because htree (dir_index) dirs returns names in hash-value
>> order, not inode number order. i.e. "at random."
>
> I thought that the htree was used to look up names, but the normal
> directory was used to enumerate them? In other words, the htree speeds
> up opening a single file, but slows down traversing the entire
> directory, so should not be used there.

readdir uses htree / dir_index:

ext3_readdir()
if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb,
EXT3_FEATURE_COMPAT_DIR_INDEX) &&
((EXT3_I(inode)->i_flags & EXT3_INDEX_FL) ||
((inode->i_size >> sb->s_blocksize_bits) == 1))) {
err = ext3_dx_readdir(filp, dirent, filldir);

Because dir_index places entries into blocks in hash order, reading
it "like a non-dir_index" dir still gives you out of order entries,
I think. IOW it doesn't slow down readdir, it just gives you a nasty
order - slowing down access to those files.

> Also isn't htree only enabled for large directories? I still see crummy
> correlation for small ( < 100 files, even one with only 8 entries )
> directories.

Nope, it's used for all directories AFAIK. Certainly shows the most
improvement on lookups in large directories though...

> It seems unreasonable to ask applications to read all directory entries,
> then sort them by inode number to achieve reasonable performance. This
> seems like something the fs should be doing.

Yeah, this has been a longstanding nastiness...

-Eric

2011-03-14 21:52:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Mon, Mar 14, 2011 at 04:52:21PM -0400, Phillip Susi wrote:
> It seems unreasonable to ask applications to read all directory entries,
> then sort them by inode number to achieve reasonable performance. This
> seems like something the fs should be doing.

Unfortunately the kernel can't do it, because a directory could be
arbitrarily big, and kernel memory is non-swappable. In addition,
what if a process opens a directory, starts calling readdir, pauses in
the middle, and then holds onto it for days, weeks, or months?

Combine that with POSIX requirements about how readdir() has to behave
if files are added or deleted during a readdir() session (even a
readdir session which takes days, weeks, or months), and it's a
complete mess.

It's not hard to provide library routines that do the right thing, and
I have written an LD_PRELOAD which intercepts opendir() and readdir()
calls and does the sorting in userspace. Perhaps the right answer is
getting this into libc, but I have exactly two words for you: "Uhlrich
Drepper".

- Ted

2011-03-14 23:43:58

by Phillip Susi

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/14/2011 05:52 PM, Ted Ts'o wrote:
> Unfortunately the kernel can't do it, because a directory could be
> arbitrarily big, and kernel memory is non-swappable. In addition,

Buffers/cache is discardable though. Or does the entire htree have to
be kept in slab or something?

> what if a process opens a directory, starts calling readdir, pauses in
> the middle, and then holds onto it for days, weeks, or months?

The same thing that happened before htree?

> It's not hard to provide library routines that do the right thing, and
> I have written an LD_PRELOAD which intercepts opendir() and readdir()
> calls and does the sorting in userspace. Perhaps the right answer is
> getting this into libc, but I have exactly two words for you: "Uhlrich
> Drepper".

Wouldn't it be better to just have readdir() use the main directory,
which presumably is in a more sane ordering, instead of the htree? That
way you don't have to burn cpu and ram sorting on every opendir().

Also, I have checked some smaller directories and lsattr reports they
are NOT using indexing, yet still display poor correlation.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1+qDkACgkQJ4UciIs+XuIktwCgi1u4T2x+igOw4feO0hNjzB9W
liIAmwRBdPiZMSfWpzu4+40xJsNXzouQ
=d4VX
-----END PGP SIGNATURE-----

2011-03-15 00:14:55

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Mon, Mar 14, 2011 at 07:43:57PM -0400, Phillip Susi wrote:
> > what if a process opens a directory, starts calling readdir, pauses in
> > the middle, and then holds onto it for days, weeks, or months?
>
> The same thing that happened before htree?

No, because the readdir() semantics are exactly set up to mirror a
linear directory tree.

> > It's not hard to provide library routines that do the right thing, and
> > I have written an LD_PRELOAD which intercepts opendir() and readdir()
> > calls and does the sorting in userspace. Perhaps the right answer is
> > getting this into libc, but I have exactly two words for you: "Uhlrich
> > Drepper".
>
> Wouldn't it be better to just have readdir() use the main directory,
> which presumably is in a more sane ordering, instead of the htree? That
> way you don't have to burn cpu and ram sorting on every opendir().

The htree is embedded into directory blocks (they appear to ext2 and
older kernels as deleted directory entries). The leaf blocks are in
the "main directory" as well; there is no separate htree.

The reason why we have to traverse the directory tree in htree order
is because the POSIX requirements of how readdir() works in the face
of file deletes and creations, and what needs to happen if a leaf
block needs to be split. Even if the readdir() started three months
ago, if in the intervening time, leaf nodes have been split, readdir()
is not allowed to return the same file twice.

> Also, I have checked some smaller directories and lsattr reports they
> are NOT using indexing, yet still display poor correlation.

Well, if the file system has been around for a long time, and there
are lots of "holes" in the inode allocation bitmap, it can happen that
even without indexing.

As another example, if you have a large maildir directory w/o
indexing, and files get removed, deleted, etc., over time the order of
the directory entries will have very little to do with the inode
number. That's why programs like mutt sort the directory entries by
inode number.

- Ted

2011-03-15 08:06:58

by Florian Weimer

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

* Eric Sandeen:

> No, because htree (dir_index) dirs returns names in hash-value
> order, not inode number order. i.e. "at random."
>
> As you say, sorting by inode number will work much better...

The dpkg folks tested this and it turns out that you get better
results if you open the file and use FIBMAP to get the first block
number, and sort by that. You could sort by inode number before the
open/fstat calls, but it does not seem to help much.

--
Florian Weimer <[email protected]>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstra?e 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

2011-03-15 11:06:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation


On Mar 15, 2011, at 3:59 AM, Florian Weimer wrote:

> * Eric Sandeen:
>
>> No, because htree (dir_index) dirs returns names in hash-value
>> order, not inode number order. i.e. "at random."
>>
>> As you say, sorting by inode number will work much better...
>
> The dpkg folks tested this and it turns out that you get better
> results if you open the file and use FIBMAP to get the first block
> number, and sort by that. You could sort by inode number before the
> open/fstat calls, but it does not seem to help much.

It depends on which problem you are trying to solve. If this is a cold
cache situation, and the inode cache is empty, then sorting by inode
number will help since otherwise you'll be seeking all over just to
read in the inode structures. This is true for any kind of readdir+stat
combination, whether it's ls -l, or du or readdir + FIBMAP (I'd
recommend using FIEMAP these days, though).

However, if you need to suck in the information for a large number of
small files (such as all of the files in /var/lib/dpkg/info), then sure, sorting
ont he block number can help reduce seeks on the data blocks side of
things.

So in an absolute cold cache situations, what I'd recommend is readdir,
sort by inode, FIEMAP, sort by block, and then read in the dpkg files.
Of course an RPM partisan might say, "it would help if you guys had
used a real database instead of ab(using) the file system. And then
the dpkg guys could complain about what happens when RPM has to
deal with corrupted rpm database, and how this allows dpkg to use
shell scripts to access their package information. Life is full of tradeoffs.

-- Ted


>
> --
> Florian Weimer <[email protected]>
> BFK edv-consulting GmbH http://www.bfk.de/
> Kriegsstra?e 100 tel: +49-721-96201-1
> D-76133 Karlsruhe fax: +49-721-96201-99
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-03-15 11:23:18

by Ric Wheeler

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 03/15/2011 07:06 AM, Theodore Tso wrote:
> On Mar 15, 2011, at 3:59 AM, Florian Weimer wrote:
>
>> * Eric Sandeen:
>>
>>> No, because htree (dir_index) dirs returns names in hash-value
>>> order, not inode number order. i.e. "at random."
>>>
>>> As you say, sorting by inode number will work much better...
>> The dpkg folks tested this and it turns out that you get better
>> results if you open the file and use FIBMAP to get the first block
>> number, and sort by that. You could sort by inode number before the
>> open/fstat calls, but it does not seem to help much.
> It depends on which problem you are trying to solve. If this is a cold
> cache situation, and the inode cache is empty, then sorting by inode
> number will help since otherwise you'll be seeking all over just to
> read in the inode structures. This is true for any kind of readdir+stat
> combination, whether it's ls -l, or du or readdir + FIBMAP (I'd
> recommend using FIEMAP these days, though).
>
> However, if you need to suck in the information for a large number of
> small files (such as all of the files in /var/lib/dpkg/info), then sure, sorting
> ont he block number can help reduce seeks on the data blocks side of
> things.
>
> So in an absolute cold cache situations, what I'd recommend is readdir,
> sort by inode, FIEMAP, sort by block, and then read in the dpkg files.
> Of course an RPM partisan might say, "it would help if you guys had
> used a real database instead of ab(using) the file system. And then
> the dpkg guys could complain about what happens when RPM has to
> deal with corrupted rpm database, and how this allows dpkg to use
> shell scripts to access their package information. Life is full of tradeoffs.
>
> -- Ted
>

I have tested both sorting techniques with very large directories.

Most of the gain came with the simple sorting by inode number, but of course
this relies on the file system allocation policy having a correlation between
the inode numbers and layout (i.e., higher inode number correspond to higher
block numbers).

Note that you can get the inode number used in this sorting without doing any
stat calls.

Sorting by first block number also works well, but does have that extra syscall
(probably two - open & fibmap?) per file.

Ric


2011-03-15 11:38:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation


On Mar 15, 2011, at 7:23 AM, Ric Wheeler wrote:
>
> I have tested both sorting techniques with very large directories.
>
> Most of the gain came with the simple sorting by inode number, but of course this relies on the file system allocation policy having a correlation between the inode numbers and layout (i.e., higher inode number correspond to higher block numbers).

That's not just a matter of file system allocation policy, but how fragmented the free space is as far
as inode numbers and block numbers might be. (Especially in the block group where /var/lib/dpkg/info
lives, on Debian systems. That's because that directory is (ab)used as a database, with lots of files
added and deleted, so over time it's pretty much inevitable that any link between directory order,
inode numbers, and block numbers, would become weaker and weaker over time. Fortunately the
overall database tends to fit into the dentry, inode, and page caches after the first dpkg operation...)

> Note that you can get the inode number used in this sorting without doing any stat calls.

Sure. And in the worst case, you would sort based on the inode number so that when you call
open/FI[BE]MAP, the disk isn't seeking all over the place when you read in the inode structure...
Then you need to sort by block numbers, and assuming that you have enough memory that
all of /var/lib/dpkg/info/* fits in the inode cache, you will minimize seeking while you read the
blocks into memory.

-- Ted


2011-03-15 13:33:29

by Rogier Wolff

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Tue, Mar 15, 2011 at 07:06:34AM -0400, Theodore Tso wrote:
> So in an absolute cold cache situations, what I'd recommend is
> readdir, sort by inode, FIEMAP, sort by block, and then read in the
> dpkg files. Of course an RPM partisan might say, "it would help if
> you guys had used a real database instead of ab(using) the file
> system. And then the dpkg guys could complain about what happens
> when RPM has to deal with corrupted rpm database, and how this
> allows dpkg to use shell scripts to access their package
> information. Life is full of tradeoffs.

IMHO, the most important part is "up to and including the stat". It
should be possible to get the directory, and inode info all inside the
same "16Mb" part of the disk. This would result in (after a few seeks)
the rest of the accesses coming from the disk's cache.

This would mean that you should allocate directory blocks from the end
PREVIOUS block group....

Roger.

--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-03-15 14:01:29

by Phillip Susi

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 3/14/2011 8:14 PM, Ted Ts'o wrote:
> The reason why we have to traverse the directory tree in htree order
> is because the POSIX requirements of how readdir() works in the face
> of file deletes and creations, and what needs to happen if a leaf
> block needs to be split. Even if the readdir() started three months
> ago, if in the intervening time, leaf nodes have been split, readdir()
> is not allowed to return the same file twice.

This would also be fixed by having readdir() traverse the linear
directory entries rather than the htree.

> Well, if the file system has been around for a long time, and there
> are lots of "holes" in the inode allocation bitmap, it can happen that
> even without indexing.

Why is that? Sure, if the inode table is full of small holes I can see
them not being allocated sequentially, but why don't they tend to at
least be allocated in ascending order?

> As another example, if you have a large maildir directory w/o
> indexing, and files get removed, deleted, etc., over time the order of
> the directory entries will have very little to do with the inode
> number. That's why programs like mutt sort the directory entries by
> inode number.

Is this what e2fsck -D fixes? Does it rewrite the directory entries in
inode order? I've been toying with the idea of adding directory
optimization support to e2defrag.

To try and clarify this point a bit, are you saying that applications
like tar and rsync should be patched to sort the directory by inode
number, rather than it being the job of the fs to return entries in a
good order?


2011-03-15 14:33:04

by Rogier Wolff

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote:
> To try and clarify this point a bit, are you saying that applications
> like tar and rsync should be patched to sort the directory by inode
> number, rather than it being the job of the fs to return entries in a
> good order?

IMHO, it is the job of the FS to perform well in common cases.

Roger.


--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
** Delftechpark 26 2628 XH Delft, The Netherlands. KVK: 27239233 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement.
Does it sit on the couch all day? Is it unemployed? Please be specific!
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ

2011-03-15 14:36:06

by Ric Wheeler

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 03/15/2011 10:33 AM, Rogier Wolff wrote:
> On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote:
>> To try and clarify this point a bit, are you saying that applications
>> like tar and rsync should be patched to sort the directory by inode
>> number, rather than it being the job of the fs to return entries in a
>> good order?
> IMHO, it is the job of the FS to perform well in common cases.
>
> Roger.
>

Some file systems do this well. The thread here is more focused on dealing with
issues for file systems that handle this poorly :)

Ric


2011-03-15 17:08:33

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Tue, Mar 15, 2011 at 10:01:24AM -0400, Phillip Susi wrote:
> On 3/14/2011 8:14 PM, Ted Ts'o wrote:
> > The reason why we have to traverse the directory tree in htree order
> > is because the POSIX requirements of how readdir() works in the face
> > of file deletes and creations, and what needs to happen if a leaf
> > block needs to be split. Even if the readdir() started three months
> > ago, if in the intervening time, leaf nodes have been split, readdir()
> > is not allowed to return the same file twice.
>
> This would also be fixed by having readdir() traverse the linear
> directory entries rather than the htree.

No, because the directory blocks are the leaf nodes, and in the case
of a node split, we need to copy half of the directory entries in one
block, and move it to a newly allocated block. If readdir() was
traversing the linear directory entries, and had already traversed
that directory block that needs to split, then you'll return those
directory entries that got copied into a new leaf (i.e., new directory
block) a second time.

> > Well, if the file system has been around for a long time, and there
> > are lots of "holes" in the inode allocation bitmap, it can happen that
> > even without indexing.
>
> Why is that? Sure, if the inode table is full of small holes I can see
> them not being allocated sequentially, but why don't they tend to at
> least be allocated in ascending order?

Unless some files get deleted in between. Now depending on the
"holes" in the directory blocks, where the new directory entries are
added, even in the non-htree case, could either be wherever an empty
directory entry could be found, or in the worst case, we might need to
allocate a new block and that new directory entry gets added to the
end of the block.

I suggest that you try some experiments, using both dir_index and
non-dir_index file systems, and then looking at the directory using
the debugfs "ls" and "htree_dump" commands. Either believe me, or
learn how things really work. :-)

> > As another example, if you have a large maildir directory w/o
> > indexing, and files get removed, deleted, etc., over time the order of
> > the directory entries will have very little to do with the inode
> > number. That's why programs like mutt sort the directory entries by
> > inode number.
>
> Is this what e2fsck -D fixes? Does it rewrite the directory entries in
> inode order? I've been toying with the idea of adding directory
> optimization support to e2defrag.

Yes, e2fsck -D will optimize directory entries, as best it can. In
the non-dir_index case, it will sort the directory entries so they are
in inode order, and squeeze out "holes" in the directory blocks, thus
compatifying the directory. In the dir_index case, it will optimize
the hash_tree of the directory as much as possible.

> To try and clarify this point a bit, are you saying that applications
> like tar and rsync should be patched to sort the directory by inode
> number, rather than it being the job of the fs to return entries in a
> good order?

That's correct.

I wish the file system could do this for the applications, but there
are some corner cases involving very large directories, and programs
that use readdir() but then stall for days, months, and years, that
make this very hard.

I suppose we could allocate up to some tunable amount worth of
directory space, say 64k or 128k, and do the sorting inside the
kernel. We then have to live with the fact that each badly behaved
program which calls opendir(), and then a single readdir(), and then
stops, will consume 128k of non-swappable kernel memory until the
process gets killed. A process which does this thousands of times
could potentially carry out a resource exhaustion attack on the
system. Which we could then try to patch over, by say creating a new
resource limit of the number of directories a process can keep open at
a time, but then the attacker could just fork some additional child
processes....

If someone wants to try to solve this problem, patches that are clean
and which don't open up the kernel to a nasty DOS attack are welcome.

- Ted


2011-03-15 17:18:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Tue, Mar 15, 2011 at 02:33:27PM +0100, Rogier Wolff wrote:
> IMHO, the most important part is "up to and including the stat". It
> should be possible to get the directory, and inode info all inside the
> same "16Mb" part of the disk. This would result in (after a few seeks)
> the rest of the accesses coming from the disk's cache.

It depends on your workload. In the case of dpkg, everything fits in
cache (usually) so after the first operation this is no longer a
concern. But all of the data blocks of /var/lib/dpkg/info/* is huge,
since not using a real database means that a 4k block is consumed for
300 bytes of data, so fitting all of the data blocks in memory
generally doesn't work, which is why the dpkg folks are sorting by
block number.

> This would mean that you should allocate directory blocks from the end
> PREVIOUS block group....

We do something else, which is we group all directory blocks together
at the beginning of each flex_bg. This tends to reduce free space
fragmentation, and it helps to optimize for large files that are
bigger than a block group, and where you want to have contiguous
regions larger than a bg --- so breaking up the space every bg is not
a great idea. Again, with general purpose file systems you can't just
optimize for one thing, and life is full of tradeoffs.

- Ted

2011-03-15 19:09:03

by Phillip Susi

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On 3/15/2011 1:08 PM, Ted Ts'o wrote:
> No, because the directory blocks are the leaf nodes, and in the case
> of a node split, we need to copy half of the directory entries in one
> block, and move it to a newly allocated block. If readdir() was
> traversing the linear directory entries, and had already traversed
> that directory block that needs to split, then you'll return those
> directory entries that got copied into a new leaf (i.e., new directory
> block) a second time.

When you split the htree node, aren't you just moving around the
"deleted entries"? So the normal names remain in the same order so
readdir() doesn't have a problem when it is ignoring the htree entries
and just walking the normal names?

Also, how do you deal with this when you do end up re balancing the
htree during a readdir()? I would think that keeping that straight
would be much more difficult than handling the problem with linear
directory entries.

Why was the htree hidden inside the normal directory structure anyway?

> Unless some files get deleted in between. Now depending on the
> "holes" in the directory blocks, where the new directory entries are
> added, even in the non-htree case, could either be wherever an empty
> directory entry could be found, or in the worst case, we might need to
> allocate a new block and that new directory entry gets added to the
> end of the block.

Right, but on an otherwise idle system, when you make all the files at
once via rsync or untaring an archive, this shouldn't happen and they
should be ( generally ) in ascending order, shouldn't they?

> I suggest that you try some experiments, using both dir_index and
> non-dir_index file systems, and then looking at the directory using
> the debugfs "ls" and "htree_dump" commands. Either believe me, or
> learn how things really work. :-)

Now THAT sounds interesting. Is this documented somewhere?

Also, why can't chattr set/clear the 'I' flag? Is it just a runtime
combersome thing? So setting and clearing the flag with debugfs
followed by a fsck should do the trick? And when is it automatically
enabled?

> I suppose we could allocate up to some tunable amount worth of
> directory space, say 64k or 128k, and do the sorting inside the
> kernel. We then have to live with the fact that each badly behaved
> program which calls opendir(), and then a single readdir(), and then
> stops, will consume 128k of non-swappable kernel memory until the
> process gets killed. A process which does this thousands of times
> could potentially carry out a resource exhaustion attack on the
> system. Which we could then try to patch over, by say creating a new
> resource limit of the number of directories a process can keep open at
> a time, but then the attacker could just fork some additional child
> processes....

I think you are right in that if sorting is to be done at
opendir()/readdir() time, then it should be done in libc, not the
kernel, but it would be even better if the fs made some effort store the
entries in a good order so no sorting is needed at all.


2011-03-16 01:50:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Large directories and poor order correlation

On Tue, Mar 15, 2011 at 03:08:53PM -0400, Phillip Susi wrote:
>
> When you split the htree node, aren't you just moving around the
> "deleted entries"? So the normal names remain in the same order so
> readdir() doesn't have a problem when it is ignoring the htree entries
> and just walking the normal names?

No, because the leaf entries of the htree are the actual directory
entries in the directory block.

> Why was the htree hidden inside the normal directory structure anyway?

So that we had full backwards compatibility with ext2. I had planned
this feature in advance, and ext2 (and pre-htree ext3 implementations)
would clear the htree flag if they tried to modify an htree directory.
Since the leaf nodes (which are the ones that get split) are normal
directory blocks, and interior nodes of the tree are directory blocks
that have what appears to ext2 to be a deleted directory entry
covering the entire block, it's fully backwards compatible with
ext2/older ext3 systems.

> > Unless some files get deleted in between. Now depending on the
> > "holes" in the directory blocks, where the new directory entries are
> > added, even in the non-htree case, could either be wherever an empty
> > directory entry could be found, or in the worst case, we might need to
> > allocate a new block and that new directory entry gets added to the
> > end of the block.
>
> Right, but on an otherwise idle system, when you make all the files at
> once via rsync or untaring an archive, this shouldn't happen and they
> should be ( generally ) in ascending order, shouldn't they?

If the directory is freshly created, yes. But over time, if you are
adding and deleting files, such as a Maildir directory, this will not
be the case forever.

> > I suggest that you try some experiments, using both dir_index and
> > non-dir_index file systems, and then looking at the directory using
> > the debugfs "ls" and "htree_dump" commands. Either believe me, or
> > learn how things really work. :-)
>
> Now THAT sounds interesting. Is this documented somewhere?

The debugfs man page has some documentation.

> Also, why can't chattr set/clear the 'I' flag? Is it just a runtime
> combersome thing? So setting and clearing the flag with debugfs
> followed by a fsck should do the trick? And when is it automatically
> enabled?

You can't clear the 'I' flag because I didn't want to bother getting
the locking right so that it would be safe. And turning off the 'I'
flag isn't going to help, since the directory entries are scrambled
anyway --- again, because the leaf nodes of the htree *are* normal
directory blocks. Turning the 'I' flag on would require reindexing
the whole directory, which would be a major headache. You'd have to
lock out all directory accesses, and then do a full copy operation ---
and remember, a directory could be potentially many megabytes, and
kernel memory is non-swappable.

> I think you are right in that if sorting is to be done at
> opendir()/readdir() time, then it should be done in libc, not the
> kernel, but it would be even better if the fs made some effort store the
> entries in a good order so no sorting is needed at all.

The problem is that we have to store them in hash tree order so the
lookups happen correctly. We could have done what JFS does, which is
to keep three separate b-trees for its directories, and where every
single time you add or modify a file, you have to update multiple
btrees. But, (a) this would have required an non-backwards compatible
file system format change from ext2/older ext3 file systems, and (b)
the extra b-trees updates are their own source of overhead.

- Ted