2012-06-20 00:48:56

by Norbert Preining

[permalink] [raw]
Subject: Ext4 slow on links

Dear all

(please Cc)

I recently had to track down a big delay in one of my Debian packages,
and it turned out that it seems to be due to ext4 being *horribly*
slow on dealing with symlinks.

On my system, if I create a directory with 8000 symlinks (that is
a real case of a font package shipping special encoded files) and
the symlink targets are "far away" (long names), then, after
a reboot a simply
ls -l
in this directory took 1m20sec. While on second run it is down to 2secs
(nice caching).

I read in the ext4 design document that if the symlink target is
less then 66 (?) chars long, then it is saved right in the inode,
otherwise some other action has to be taken.

Now my questions are:
- is this to be expected and not to be avoided?
- do you have a way around it?
- do other file systems, esp ext2/ext3 behave differently in this respect?

Finally the specs: kernel 3.5.0-rc3 (but was the same with 3.4.0 and
before), mount options rw,noatime,errors=remount-ro,user_xattr

tune2fs -l output:
tune2fs 1.42.4 (12-Jun-2012)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: 961635f4-762d-4136-a3d5-35fca8e4f3d8
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: journal_data_writeback
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 46333952
Block count: 185335808
Reserved block count: 9266789
Free blocks: 104044481
Free inodes: 41749891
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 979
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Sun Nov 15 15:09:13 2009
Last mount time: Tue Jun 19 15:15:48 2012
Last write time: Tue May 29 07:17:52 2012
Mount count: 34
Maximum mount count: 50
Last checked: Tue May 29 07:17:52 2012
Check interval: 15552000 (6 months)
Next check after: Sun Nov 25 07:17:52 2012
Lifetime writes: 2151 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 13246498
Default directory hash: half_md4
Directory Hash Seed: 87ea85d5-2287-4211-a920-f793468c22c1
Journal backup: inode blocks


Anything else I can provide?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
BALDOCK
The sharp prong on the top of a tree stump where the tree has snapped
off before being completely sawn through.
--- Douglas Adams, The Meaning of Liff


2012-06-20 02:19:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 slow on links

On Wed, Jun 20, 2012 at 09:20:14AM +0900, Norbert Preining wrote:
> Dear all
>
> (please Cc)
>
> I recently had to track down a big delay in one of my Debian packages,
> and it turned out that it seems to be due to ext4 being *horribly*
> slow on dealing with symlinks.
>
> On my system, if I create a directory with 8000 symlinks (that is
> a real case of a font package shipping special encoded files) and
> the symlink targets are "far away" (long names), then, after
> a reboot a simply
> ls -l
> in this directory took 1m20sec. While on second run it is down to 2secs
> (nice caching).
>
> I read in the ext4 design document that if the symlink target is
> less then 66 (?) chars long, then it is saved right in the inode,
> otherwise some other action has to be taken.

The inode has room for 60 characters; after that, the symlink target
gets stored in an external block. The seek to read in the symlink
target could be one of the causes of the delay. The other is
potentially reading in the inode which is the target of the symlink
target. Both of these will take disk time in a cold cache situation.

> Now my questions are:
> - is this to be expected and not to be avoided?
> - do you have a way around it?
> - do other file systems, esp ext2/ext3 behave differently in this respect?

Nothing has changed here between ext2/ext3 and ext4 here, so ext2/ext3
will behave exactly the same. There are changes in the block and
inode allocation algorithms which might make a minor difference, but
the same is potentially true of a very fragmented file system.

There is a relatively new feature, which is not yet merged into ext4
mainline, called the inline data patch set, which could potentially
allow you to store more than 60 characters in a symlink in large
inodes. This could potentially help, but as a feature it will be a
while before it's ready (it definitely won't make the upcoming Debian
stable freeze) --- and so most of your Debian users won't be able to
take advantage of it for quite a while.

Otherwise, there's not much we can do about this, unfortunately. The
cold cache case is always a hard one, and the simplest ways of
optimizing it would involve changing how the application is storing
its files. In general, trying to use a file system as a poor man's
database is a bad idea, and will only end in tears, and it sounds like
this is what you're running into in terms of very long file names to
symlinks in a font directory.

Regards,

- Ted

2012-06-20 03:15:10

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 6/19/12 7:20 PM, Norbert Preining wrote:
> Dear all
>
> (please Cc)
>
> I recently had to track down a big delay in one of my Debian packages,
> and it turned out that it seems to be due to ext4 being *horribly*
> slow on dealing with symlinks.
>
> On my system, if I create a directory with 8000 symlinks (that is
> a real case of a font package shipping special encoded files) and
> the symlink targets are "far away" (long names), then, after
> a reboot a simply
> ls -l
> in this directory took 1m20sec. While on second run it is down to 2secs
> (nice caching).

As Ted said, the targets might be far-flung. If you do /bin/ls -l instead
of maybe an aliased ls which stats everything to make pretty colors,
is that faster?

-Eric

2012-06-20 03:38:37

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

Hi Ted, hi Eric,

thanks for the answers, here some remarks.

On Di, 19 Jun 2012, Ted Ts'o wrote:
> The inode has room for 60 characters; after that, the symlink target
> gets stored in an external block. The seek to read in the symlink
> target could be one of the causes of the delay. The other is

Ok.

> Nothing has changed here between ext2/ext3 and ext4 here, so ext2/ext3
> will behave exactly the same. There are changes in the block and
> inode allocation algorithms which might make a minor difference, but
> the same is potentially true of a very fragmented file system.

Ok.

Thinking about that, even if I dereference the files, I still am a
bit surprised. For each file we have the following times:
1- read the inode and determine if it is a link
2- check if link target fits in the the 60chars
3- read the additional block for long link target
4- read the target inode

I assume that the items 1,3, and 4 are the time consuming ones and
about the same time.

Now what I don't understand, why doing a
time ls -l >/dev/null
on the directory with the original files takes 1.2s,
but reading the links with ls -l >/dev/null takes 1m13s, both
after reboot on cold cache.

I assume that some data is hashed in the directory inode, so doing
ls -l on the real files only reads the directory inode and not
each file invividually, while reading all the links read all the
individual files.

Is this the explanation? If not, I cannot imagine any way that reading
a list of links and dereferencing them plus reading the ttargets
takes 60times as long.

On Di, 19 Jun 2012, Eric Sandeen wrote:
> As Ted said, the targets might be far-flung. If you do /bin/ls -l instead
> of maybe an aliased ls which stats everything to make pretty colors,
> is that faster?

Might be the problem, but I saw the same with a program doing
opendir readdir etc, so no allias or external program involved.

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
NACTION (n.)
The 'n' with which cheap advertising copywriters replace the word
'and' (as in 'fish 'n' chips', 'mix 'n' match', 'assault 'n'
battery'), in the mistaken belief that this is in some way chummy or
endearing.
--- Douglas Adams, The Meaning of Liff

2012-06-20 03:57:27

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4 slow on links

On Jun 19, 2012, at 10:38 PM, Norbert Preining <[email protected]> wrote:

> Hi Ted, hi Eric,
>
> thanks for the answers, here some remarks.
>
...

> On Di, 19 Jun 2012, Eric Sandeen wrote:
>> As Ted said, the targets might be far-flung. If you do /bin/ls -l instead
>> of maybe an aliased ls which stats everything to make pretty colors,
>> is that faster?
>
> Might be the problem, but I saw the same with a program doing
> opendir readdir etc, so no allias or external program involved.
>
Of course ls -l must stat anyway. I shouldn't compose emails so late. :).

You might see if the dir itself is badly fragmented (if not filefrag, stat in debugfs would show you block mapping) and maybe a blktrace of the actions would show you something interesting as well.

Eric

> Best wishes
>
> Norbert
> ------------------------------------------------------------------------
> Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
> JAIST, Japan TeX Live & Debian Developer
> DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
> ------------------------------------------------------------------------
> NACTION (n.)
> The 'n' with which cheap advertising copywriters replace the word
> 'and' (as in 'fish 'n' chips', 'mix 'n' match', 'assault 'n'
> battery'), in the mistaken belief that this is in some way chummy or
> endearing.
> --- Douglas Adams, The Meaning of Liff
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-06-20 04:01:55

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

Hi Eric,

On Di, 19 Jun 2012, Eric Sandeen wrote:
> You might see if the dir itself is badly fragmented

I don't think so, since I did:
for i in /usr/local/font-collection/......./* ; do ln -s $i . ; done

reboot

time ls -l >/dev/null

newly created entries shouldn't b fragmented, I guess.

> (if not filefrag, stat in debugfs would show you block mapping) and
> maybe a blktrace of the actions would show you something interesting as well.

I will investigate, but it is the first time I look into that so
if you have a link to quick howto, great, otherwise I read through
man pages ;-)

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
`You'd better be prepared for the jump into hyperspace.
It's unpleasently like being drunk.'
`What's so unpleasent about being drunk?'
`You ask a glass of water.'
--- Arthur getting ready for his first jump into hyperspace.
--- Douglas Adams, The Hitchhikers Guide to the Galaxy

2012-06-20 05:18:49

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

Hi Eric,

On Di, 19 Jun 2012, Eric Sandeen wrote:
> blktrace of the actions would show you something interesting as well.

I tried to understand the output, but didn't get any information
that tells me something.

I rebooted into single user mode, started blktrace on sda, then run
time ls -l /..../dir/with/links/ >/dev/null, stopped the blktrace.

Then I run blkparse and btt etc to generate a variety of data.
Here are some output of the btt run:
==================== All Devices ====================

ALL MIN AVG MAX N
--------------- ------------- ------------- ------------- -----------

Q2Q 0.000002654 0.009716548 6.594440648 8953
Q2G 0.000001047 0.000001825 0.000011594 8913
G2I 0.000000908 0.000001561 0.000234317 8913
Q2M 0.000000908 0.000001046 0.000001536 41
I2D 0.000005378 0.001040528 1.572900300 8913
M2D 0.000014527 0.038626776 1.572908333 41
D2C 0.000098337 0.009050273 0.053790446 8954
Q2C 0.000108394 0.010266282 1.577191424 8954

==================== Device Overhead ====================

DEV | Q2G G2I Q2M I2D D2C
---------- | --------- --------- --------- --------- ---------
( 8, 0) | 0.0177% 0.0151% 0.0000% 10.0890% 88.1553%
---------- | --------- --------- --------- --------- ---------
Overall | 0.0177% 0.0151% 0.0000% 10.0890% 88.1553%

==================== Device Merge Information ====================

DEV | #Q #D Ratio | BLKmin BLKavg BLKmax Total
---------- | -------- -------- ------- | -------- -------- -------- --------
( 8, 0) | 8954 8913 1.0 | 8 9 1024 86872

==================== Device Q2Q Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 0) | 8954 193362127.3 0 | 0(538)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 8954 193362127.3 0 | 0(538)

==================== Device D2D Seek Information ====================

DEV | NSEEKS MEAN MEDIAN | MODE
---------- | --------------- --------------- --------------- | ---------------
( 8, 0) | 8913 194044831.4 0 | 0(497)
---------- | --------------- --------------- --------------- | ---------------
Overall | NSEEKS MEAN MEDIAN | MODE
Average | 8913 194044831.4 0 | 0(497)

==================== Plug Information ====================

DEV | # Plugs # Timer Us | % Time Q Plugged
---------- | ---------- ---------- | ----------------
( 8, 0) | 81( 0) | 0.002663579%

DEV | IOs/Unp IOs/Unp(to)
---------- | ---------- ----------
( 8, 0) | 1.2 0.0
---------- | ---------- ----------
Overall | IOs/Unp IOs/Unp(to)
Average | 1.2 0.0

==================== Active Requests At Q Information ====================

DEV | Avg Reqs @ Q
---------- | -------------
( 8, 0) | 0.1

.....


I don't know if that shows anything of interest, but if you need more,
and want to waste a bit of time looking at the data, I have uploaded
everything created into
http://www.logic.at/people/preining/BlkParse.tar.gz
size 5090853
md5sum 46db8455a04dcc4a602e34d21eecc6bd

In any case, thanks for your patience and support

Norbert

------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
KIRBY (n.)
Small but repulsive piece of food prominently attached to a person's
face or clothing. See also CHIPPING ONGAR.
--- Douglas Adams, The Meaning of Liff

2012-06-20 14:07:33

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 6/20/12 12:18 AM, Norbert Preining wrote:
> Hi Eric,
>
> On Di, 19 Jun 2012, Eric Sandeen wrote:
>> blktrace of the actions would show you something interesting as well.
>
> I tried to understand the output, but didn't get any information
> that tells me something.
>
> I rebooted into single user mode, started blktrace on sda, then run
> time ls -l /..../dir/with/links/ >/dev/null, stopped the blktrace.
>
> Then I run blkparse and btt etc to generate a variety of data.

...

>
> I don't know if that shows anything of interest, but if you need more,
> and want to waste a bit of time looking at the data, I have uploaded
> everything created into
> http://www.logic.at/people/preining/BlkParse.tar.gz

Here are the overall stats:

Total (sda):
Reads Queued: 8,864, 35,456KiB Writes Queued: 90, 7,980KiB
Read Dispatches: 8,864, 35,456KiB Write Dispatches: 49, 7,980KiB
Reads Requeued: 0 Writes Requeued: 0
Reads Completed: 8,864, 35,456KiB Writes Completed: 59, 7,980KiB
Read Merges: 0, 0KiB Write Merges: 41, 164KiB
IO unplugs: 81 Timer unplugs: 0

so almost all reads, and no read merges; almost 35 megabytes read and every
one was a small 4k IO.

It's doing about 120 seeks/second. I'm a little surprised that there was no read
merging...

Let me think about this. :)

-Eric

2012-06-20 19:35:54

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 6/19/12 10:57 PM, Eric Sandeen wrote:
> On Jun 19, 2012, at 10:38 PM, Norbert Preining <[email protected]> wrote:
>
>> Hi Ted, hi Eric,
>>
>> thanks for the answers, here some remarks.
>>
> ...
>
>> On Di, 19 Jun 2012, Eric Sandeen wrote:
>>> As Ted said, the targets might be far-flung. If you do /bin/ls -l instead
>>> of maybe an aliased ls which stats everything to make pretty colors,
>>> is that faster?
>>
>> Might be the problem, but I saw the same with a program doing
>> opendir readdir etc, so no allias or external program involved.
>>
> Of course ls -l must stat anyway. I shouldn't compose emails so late. :).

Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
order, it's roughly random w.r.t. disk location. Newer utils will sort into
inode order, I think(?) Might be interesting to strace the ls -l and see
if it's doing it in inode order, or not.

-Eric


2012-06-21 02:28:23

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

Hi Eric,

thanks a lot for looking into that.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> so almost all reads, and no read merges; almost 35 megabytes read and every
> one was a small 4k IO.

Ouch, that hurts.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Would you be willing to provide an "e2image -r" image of the filesystem?

Ok, it is running now since a few hours and I am far from finished
I guess, since there are 350+G on the fs, and the compressed image
is by now 200M.

Is it fine to do it on a running system, or do I have to boot
from USB or so?

If it is not toooo big I will tr to upload it to some place were
you can get access to.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
> order, it's roughly random w.r.t. disk location. Newer utils will sort into
> inode order, I think(?) Might be interesting to strace the ls -l and see
> if it's doing it in inode order, or not.

Ok, is there a special option to strace, or -trace=all?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
CANNOCK CHASE (n.)
In any box of After Eight Mints, there is always a large number of
empty envelopes and no more that four or five actual mints. The
cannock chase is the process by which, no matter which part of the box
often, you will always extract most of the empty sachets before
pinning down an actual mint, or 'cannock'. The cannock chase also
occurs with people who put their dead matches back in the matchbox,
and then embarrass themselves at parties trying to light cigarettes
with tree quarters of an inch of charcoal. The term is also used to
describe futile attempts to pursue unscrupulous advertising agencies
who nick your ideas to sell chocolates with.
--- Douglas Adams, The Meaning of Liff

2012-06-21 04:06:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 6/20/12 9:28 PM, Norbert Preining wrote:
> Hi Eric,
>
> thanks a lot for looking into that.
>
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> so almost all reads, and no read merges; almost 35 megabytes read and every
>> one was a small 4k IO.
>
> Ouch, that hurts.
>
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Would you be willing to provide an "e2image -r" image of the filesystem?
>
> Ok, it is running now since a few hours and I am far from finished
> I guess, since there are 350+G on the fs, and the compressed image
> is by now 200M.
>
> Is it fine to do it on a running system, or do I have to boot
> from USB or so?

Well, don't bother, sorry. See below. Zach had it right.

> If it is not toooo big I will tr to upload it to some place were
> you can get access to.
>
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Oh, but Zach Brown reminds me that if we stat the entries in getdents/hash
>> order, it's roughly random w.r.t. disk location. Newer utils will sort into
>> inode order, I think(?) Might be interesting to strace the ls -l and see
>> if it's doing it in inode order, or not.
>
> Ok, is there a special option to strace, or -trace=all?

if you do

# strace -v -o outfile ls -l

you'll see things like:

getdents(3, {{d_ino=249052, d_off=186216735, d_reclen=32, d_name="file3"} {d_ino=245882, d_off=473549160, d_reclen=24, d_name="."} {d_ino=249051, d_off=516459536, d_reclen=32, d_name="file2"} {d_ino=249055, d_off=545762253, d_reclen=32, d_name="file6"} {d_ino=249049, d_off=550416647, d_reclen=32, d_name="file1"} ...

and from there see that the entries returned are not in inode order (and therefore not in disk order).

and lstats after that, also out of order:

# grep lstat outfile
lstat("file3", {st_dev=makedev(8, 8), st_ino=249052, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file2", {st_dev=makedev(8, 8), st_ino=249051, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file6", {st_dev=makedev(8, 8), st_ino=249055, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
lstat("file1", {st_dev=makedev(8, 8), st_ino=249049, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=13, st_atime=2012/06/20-22:13:08, st_mtime=2012/06/20-22:13:07, st_ctime=2012/06/20-22:13:07}) = 0
...

later on you'll see readlinks:

# grep readlink outfile
readlink("file3", "../dir2/file3", 14) = 13
readlink("file2", "../dir2/file2", 14) = 13
readlink("file6", "../dir2/file6", 14) = 13
readlink("file1", "../dir2/file1", 14) = 13
...

etc.

Hm. Upstream coreutils fixed this for rm and some other ops:

http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a

# grep unlink /tmp/rm-strace
unlink("file1") = 0
unlink("file10") = 0
unlink("file2") = 0
unlink("file3") = 0
unlink("file4") = 0
unlink("file5") = 0
unlink("file6") = 0
unlink("file7") = 0
unlink("file8") = 0
unlink("file9") = 0

but maybe not for ls -l

You could see if you could get this LD_PRELOAD working:

http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c

build & enable with:

gcc -o spd_readdir.so -fPIC -shared spd_readdir.c -ldl
export LD_PRELOAD=`pwd`/spd_readdir.so

and see if that addresses the problem;

here, it does for me:

# grep readlink outfile2
readlink("file1", "../dir2/file1"..., 14) = 13
readlink("file10", "../dir2/file10"..., 15) = 14
readlink("file2", "../dir2/file2"..., 14) = 13
readlink("file3", "../dir2/file3"..., 14) = 13
readlink("file4", "../dir2/file4"..., 14) = 13
readlink("file5", "../dir2/file5"..., 14) = 13

I'm guessing that operating in inode order should help
you a bit, at least. I tested on a dir w/ 10,000 long symlinks
with and without the sorting, and you can see the difference pretty
clearly.

sorted took 2.6s, unsorted took 52s.

And you can see why:

http://people.redhat.com/esandeen/sorted_unsorted.png

meanwhile I can ask Jim about coreutils & ls -l.

-Eric

> Best wishes
>
> Norbert

2012-06-21 04:50:09

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

Hi Eric,

wow, thanks again.

On Mi, 20 Jun 2012, Eric Sandeen wrote:
> Hm. Upstream coreutils fixed this for rm and some other ops:

Ok, I see.

> sorted took 2.6s, unsorted took 52s.

Got the idea, and tried it now myself not with ls etc, but
with the program that generates the caos, and yes, stracing it
gives the same result, getdents and the followed stats are all
*not* in inode order.

So that means, it should be fixed in glibc? Right? Ouuchhh...

That means that this behaviour is for *each* program using getdent
etc ...

Do you have any suggestions? Is there a way to force readdir (I guess
most people use readdir instead of getdents directly) to iterate
in inode order?



Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
PABBY (n.,vb.)
(Fencing term.) The play, or manoeuvre, where one swordsman leaps on
to the table and pulls the battleaxe off the wall.
--- Douglas Adams, The Meaning of Liff

2012-06-21 05:18:55

by Andreas Dilger

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 2012-06-20, at 10:50 PM, Norbert Preining wrote:
> On Mi, 20 Jun 2012, Eric Sandeen wrote:
>> Hm. Upstream coreutils fixed this for rm and some other ops:
>
> Ok, I see.
>
>> sorted took 2.6s, unsorted took 52s.
>
> Got the idea, and tried it now myself not with ls etc, but
> with the program that generates the caos, and yes, stracing it
> gives the same result, getdents and the followed stats are all
> *not* in inode order.
>
> So that means, it should be fixed in glibc? Right? Ouuchhh...
>
> That means that this behaviour is for *each* program using getdent
> etc ...
>
> Do you have any suggestions? Is there a way to force readdir (I guess
> most people use readdir instead of getdents directly) to iterate
> in inode order?

That's what the LD_PRELOAD library that Eric referenced does - you can
load it for any application, and it will sort the dirents in inode order.

It would definitely be better to do this in glibc, though we've also
been discussing on occasion doing this inside ext4 for small directories.

Cheers, Andreas






2012-06-21 06:55:41

by Norbert Preining

[permalink] [raw]
Subject: Re: Ext4 slow on links

On Mi, 20 Jun 2012, Andreas Dilger wrote:
> That's what the LD_PRELOAD library that Eric referenced does - you can
> load it for any application, and it will sort the dirents in inode order.

Yes, hmm, I tried it without success.
I did:
export LD_PRELOAD=/path/to/spd_readdir.so
strace -o ... /usr/bin/texlua /usr/bin/mtxrun --generate
(the bad command) and I still see stats and getdents out of
inode order.

> It would definitely be better to do this in glibc, though we've also
> been discussing on occasion doing this inside ext4 for small directories.

I have now found the thread
Large directories and poor order correlation
from March 2011 on ext4-devel, interesting read.

Anyway, as far as I can see I cannot do much but
fsck -D
the filesystem and see if it gets better, right?

Best wishes

Norbert
------------------------------------------------------------------------
Norbert Preining preining@{jaist.ac.jp, logic.at, debian.org}
JAIST, Japan TeX Live & Debian Developer
DSA: 0x09C5B094 fp: 14DF 2E6C 0307 BE6D AD76 A9C0 D2BF 4AA3 09C5 B094
------------------------------------------------------------------------
MARGATE (n.)
A margate is a particular kind of commissionaire who sees you every
day and is on cheerful Christian-name terms with you, then one day
refuses to let you in because you've forgotten your identify card.
--- Douglas Adams, The Meaning of Liff

2012-06-22 09:53:19

by Bernd Schubert

[permalink] [raw]
Subject: Re: Ext4 slow on links

On 06/21/2012 06:05 AM, Eric Sandeen wrote:
> On 6/20/12 9:28 PM, Norbert Preining wrote:
>> Hi Eric,
>>
>
> You could see if you could get this LD_PRELOAD working:
>
> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c
>

Hrmm, I need to look through that commit again, but on a first glance I
cannot see code doing the sorting for ext3/ext4 only (e.g. by checking
the fsid). So while I like the general approach, it will have the
opposite effect for some file systems. I will report that back on the
coreutils list.

Thanks a lot for the pointer to the commit!


Cheers,
Bernd


2012-06-22 14:08:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ext4 slow on links

On Fri, Jun 22, 2012 at 11:53:13AM +0200, Bernd Schubert wrote:
> On 06/21/2012 06:05 AM, Eric Sandeen wrote:
> >On 6/20/12 9:28 PM, Norbert Preining wrote:
> >>Hi Eric,
> >>
> >
> >You could see if you could get this LD_PRELOAD working:
> >
> >http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob_plain;f=contrib/spd_readdir.c
> >
>
> Hrmm, I need to look through that commit again, but on a first
> glance I cannot see code doing the sorting for ext3/ext4 only (e.g.
> by checking the fsid). So while I like the general approach, it will
> have the opposite effect for some file systems. I will report that
> back on the coreutils list.
>
> Thanks a lot for the pointer to the commit!

One warning about spd_readdir. It's not thread-safe, and I've noted
that some programs crash when they try using spd_readdir.so as a
pre-load. I've tried to fix some of the causes, and I think
thread-safety is the primary fix which is missing, but it's possible
that program which really care about telldir()/seekdir() behaviour as
it relates to readdir() and when files are added to a directory may
also end up getting surprised.

I wrote it primarily as a demonstration of how sorting by inode number
is a big win. It is *not* suitable for use in /etc/ld.so.preload!

If people want to try to make it safer, patches are accepted, but
ultimately it's better to fix this in the application; that way you
will get your performance gains no matter what OS you happen to be
running on, whether it's Linux, Solaris, AIX, OS X, etc.

- Ted