2012-02-29 13:52:28

by Jacek Luczak

[permalink] [raw]
Subject: getdents - ext4 vs btrfs performance

Hi All,

/*Sorry for sending incomplete email, hit wrong button :) I guess I
can't use Gmail */

Long story short: We've found that operations on a directory structure
holding many dirs takes ages on ext4.

The Question: Why there's that huge difference in ext4 and btrfs? See
below test results for real values.

Background: I had to backup a Jenkins directory holding workspace for
few projects which were co from svn (implies lot of extra .svn dirs).
The copy takes lot of time (at least more than I've expected) and
process was mostly in D (disk sleep). I've dig more and done some
extra test to see if this is not a regression on block/fs site. To
isolate the issue I've also performed same tests on btrfs.

Test environment configuration:
1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
2) Kernels: All tests were done on following kernels:
- 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
config changes mostly. In -3 we've introduced ,,fix readahead pipeline
break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
- 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
release recently).
3) A subject of tests, directory holding:
- 54GB of data (measured on ext4)
- 1978149 files
- 844008 directories
4) Mount options:
- ext4 -- errors=remount-ro,noatime,
data=writeback
- btrfs -- noatime,nodatacow and for later investigation on
copression effect: noatime,nodatacow,compress=lzo

In all tests I've been measuring time of execution. Following tests
were performed:
- find . -type d
- find . -type f
- cp -a
- rm -rf

Ext4 results:
| Type | 2.6.39.4-3 | 3.2.7
| Dir cnt | 17m 40sec | 11m 20sec
| File cnt | 17m 36sec | 11m 22sec
| Copy | 1h 28m | 1h 27m
| Remove| 3m 43sec | 3m 38sec

Btrfs results (without lzo comression):
| Type | 2.6.39.4-3 | 3.2.7
| Dir cnt | 2m 22sec | 2m 21sec
| File cnt | 2m 26sec | 2m 23sec
| Copy | 36m 22sec | 39m 35sec
| Remove| 7m 51sec | 10m 43sec

>From above one can see that copy takes close to 1h less on btrfs. I've
done strace counting times of calls, results are as follows (from
3.2.7):
1) Ext4 (only to elements):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.01 13.257850 1 15082163 read
23.40 5.440353 3 1687702 getdents
6.15 1.430559 0 3672418 lstat
3.80 0.883767 0 13106961 write
2.32 0.539959 0 4794099 open
1.69 0.393589 0 843695 mkdir
1.28 0.296700 0 5637802 setxattr
0.80 0.186539 0 7325195 stat

2) Btrfs:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
53.38 9.486210 1 15179751 read
11.38 2.021662 1 1688328 getdents
10.64 1.890234 0 4800317 open
6.83 1.213723 0 13201590 write
4.85 0.862731 0 5644314 setxattr
3.50 0.621194 1 844008 mkdir
2.75 0.489059 0 3675992 1 lstat
1.71 0.303544 0 5644314 llistxattr
1.50 0.265943 0 1978149 utimes
1.02 0.180585 0 5644314 844008 getxattr

On btrfs getdents takes much less time which prove the bottleneck in
copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
for getdents:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.77 10.978816 1 15033132 read
14.46 3.125996 1 4733589 open
7.15 1.546311 0 5566988 setxattr
5.89 1.273845 0 3626505 lstat
5.81 1.255858 1 1667050 getdents
5.66 1.224403 0 13083022 write
3.40 0.735114 1 833371 mkdir
1.96 0.424881 0 5566988 llistxattr


Why so huge difference in the getdents timings?

-Jacek


2012-02-29 13:55:11

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

Hi Chris,

the last one was borked :) Please check this one.

-jacek

2012/2/29 Jacek Luczak <[email protected]>:
> Hi All,
>
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can't use Gmail */
>
> Long story short: We've found that operations on a directory structure
> holding many dirs takes ages on ext4.
>
> The Question: Why there's that huge difference in ext4 and btrfs? See
> below test results for real values.
>
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I've expected) and
> process was mostly in D (disk sleep). I've dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I've also performed same tests on btrfs.
>
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
> ?- 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we've introduced ,,fix readahead pipeline
> break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
> ?- 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
> ?- 54GB of data (measured on ext4)
> ?- 1978149 files
> ?- 844008 directories
> 4) Mount options:
> ?- ext4 -- errors=remount-ro,noatime,
> data=writeback
> ?- btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
>
> In all tests I've been measuring time of execution. Following tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
>
> Ext4 results:
> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
> | Dir cnt ?| 17m 40sec ?| 11m 20sec
> | File cnt | ?17m 36sec | 11m 22sec
> | Copy ? ?| 1h 28m ? ? ? ?| 1h 27m
> | Remove| 3m 43sec ? ?| 3m 38sec
>
> Btrfs results (without lzo comression):
> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
> | Dir cnt ?| 2m 22sec ?| 2m 21sec
> | File cnt | ?2m 26sec | 2m 23sec
> | Copy ? ?| 36m 22sec | 39m 35sec
> | Remove| 7m 51sec ? | 10m 43sec
>
> From above one can see that copy takes close to 1h less on btrfs. I've
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> ?57.01 ? 13.257850 ? ? ? ? ? 1 ?15082163 ? ? ? ? ? read
> ?23.40 ? ?5.440353 ? ? ? ? ? 3 ? 1687702 ? ? ? ? ? getdents
> ?6.15 ? ?1.430559 ? ? ? ? ? 0 ? 3672418 ? ? ? ? ? lstat
> ?3.80 ? ?0.883767 ? ? ? ? ? 0 ?13106961 ? ? ? ? ? write
> ?2.32 ? ?0.539959 ? ? ? ? ? 0 ? 4794099 ? ? ? ? ? open
> ?1.69 ? ?0.393589 ? ? ? ? ? 0 ? ?843695 ? ? ? ? ? mkdir
> ?1.28 ? ?0.296700 ? ? ? ? ? 0 ? 5637802 ? ? ? ? ? setxattr
> ?0.80 ? ?0.186539 ? ? ? ? ? 0 ? 7325195 ? ? ? ? ? stat
>
> 2) Btrfs:
> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38 ? ?9.486210 ? ? ? ? ? 1 ?15179751 ? ? ? ? ? read
> 11.38 ? ?2.021662 ? ? ? ? ? 1 ? 1688328 ? ? ? ? ? getdents
> ?10.64 ? ?1.890234 ? ? ? ? ? 0 ? 4800317 ? ? ? ? ? open
> ?6.83 ? ?1.213723 ? ? ? ? ? 0 ?13201590 ? ? ? ? ? write
> ?4.85 ? ?0.862731 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? setxattr
> ?3.50 ? ?0.621194 ? ? ? ? ? 1 ? ?844008 ? ? ? ? ? mkdir
> ?2.75 ? ?0.489059 ? ? ? ? ? 0 ? 3675992 ? ? ? ? 1 lstat
> ?1.71 ? ?0.303544 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? llistxattr
> ?1.50 ? ?0.265943 ? ? ? ? ? 0 ? 1978149 ? ? ? ? ? utimes
> ?1.02 ? ?0.180585 ? ? ? ? ? 0 ? 5644314 ? ?844008 getxattr
>
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> ?50.77 ? 10.978816 ? ? ? ? ? 1 ?15033132 ? ? ? ? ? read
> ?14.46 ? ?3.125996 ? ? ? ? ? 1 ? 4733589 ? ? ? ? ? open
> ?7.15 ? ?1.546311 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? setxattr
> ?5.89 ? ?1.273845 ? ? ? ? ? 0 ? 3626505 ? ? ? ? ? lstat
> ?5.81 ? ?1.255858 ? ? ? ? ? 1 ? 1667050 ? ? ? ? ? getdents
> ?5.66 ? ?1.224403 ? ? ? ? ? 0 ?13083022 ? ? ? ? ? write
> ?3.40 ? ?0.735114 ? ? ? ? ? 1 ? ?833371 ? ? ? ? ? mkdir
> ?1.96 ? ?0.424881 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? llistxattr
>
>
> Why so huge difference in the getdents timings?
>
> -Jacek

2012-02-29 14:07:45

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <[email protected]>:
> Hi Chris,
>
> the last one was borked :) Please check this one.
>
> -jacek
>
> 2012/2/29 Jacek Luczak <[email protected]>:
>> Hi All,
>>
>> /*Sorry for sending incomplete email, hit wrong button :) I guess I
>> can't use Gmail */
>>
>> Long story short: We've found that operations on a directory structure
>> holding many dirs takes ages on ext4.
>>
>> The Question: Why there's that huge difference in ext4 and btrfs? See
>> below test results for real values.
>>
>> Background: I had to backup a Jenkins directory holding workspace for
>> few projects which were co from svn (implies lot of extra .svn dirs).
>> The copy takes lot of time (at least more than I've expected) and
>> process was mostly in D (disk sleep). I've dig more and done some
>> extra test to see if this is not a regression on block/fs site. To
>> isolate the issue I've also performed same tests on btrfs.
>>
>> Test environment configuration:
>> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
>> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
>> 2) Kernels: All tests were done on following kernels:
>> ?- 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
>> config changes mostly. In -3 we've introduced ,,fix readahead pipeline
>> break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
>> ?- 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
>> release recently).
>> 3) A subject of tests, directory holding:
>> ?- 54GB of data (measured on ext4)
>> ?- 1978149 files
>> ?- 844008 directories
>> 4) Mount options:
>> ?- ext4 -- errors=remount-ro,noatime,
>> data=writeback
>> ?- btrfs -- noatime,nodatacow and for later investigation on
>> copression effect: noatime,nodatacow,compress=lzo
>>
>> In all tests I've been measuring time of execution. Following tests
>> were performed:
>> - find . -type d
>> - find . -type f
>> - cp -a
>> - rm -rf
>>
>> Ext4 results:
>> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
>> | Dir cnt ?| 17m 40sec ?| 11m 20sec
>> | File cnt | ?17m 36sec | 11m 22sec
>> | Copy ? ?| 1h 28m ? ? ? ?| 1h 27m
>> | Remove| 3m 43sec ? ?| 3m 38sec
>>
>> Btrfs results (without lzo comression):
>> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
>> | Dir cnt ?| 2m 22sec ?| 2m 21sec
>> | File cnt | ?2m 26sec | 2m 23sec
>> | Copy ? ?| 36m 22sec | 39m 35sec
>> | Remove| 7m 51sec ? | 10m 43sec
>>
>> From above one can see that copy takes close to 1h less on btrfs. I've
>> done strace counting times of calls, results are as follows (from
>> 3.2.7):
>> 1) Ext4 (only to elements):
>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> ?57.01 ? 13.257850 ? ? ? ? ? 1 ?15082163 ? ? ? ? ? read
>> ?23.40 ? ?5.440353 ? ? ? ? ? 3 ? 1687702 ? ? ? ? ? getdents
>> ?6.15 ? ?1.430559 ? ? ? ? ? 0 ? 3672418 ? ? ? ? ? lstat
>> ?3.80 ? ?0.883767 ? ? ? ? ? 0 ?13106961 ? ? ? ? ? write
>> ?2.32 ? ?0.539959 ? ? ? ? ? 0 ? 4794099 ? ? ? ? ? open
>> ?1.69 ? ?0.393589 ? ? ? ? ? 0 ? ?843695 ? ? ? ? ? mkdir
>> ?1.28 ? ?0.296700 ? ? ? ? ? 0 ? 5637802 ? ? ? ? ? setxattr
>> ?0.80 ? ?0.186539 ? ? ? ? ? 0 ? 7325195 ? ? ? ? ? stat
>>
>> 2) Btrfs:
>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> 53.38 ? ?9.486210 ? ? ? ? ? 1 ?15179751 ? ? ? ? ? read
>> 11.38 ? ?2.021662 ? ? ? ? ? 1 ? 1688328 ? ? ? ? ? getdents
>> ?10.64 ? ?1.890234 ? ? ? ? ? 0 ? 4800317 ? ? ? ? ? open
>> ?6.83 ? ?1.213723 ? ? ? ? ? 0 ?13201590 ? ? ? ? ? write
>> ?4.85 ? ?0.862731 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? setxattr
>> ?3.50 ? ?0.621194 ? ? ? ? ? 1 ? ?844008 ? ? ? ? ? mkdir
>> ?2.75 ? ?0.489059 ? ? ? ? ? 0 ? 3675992 ? ? ? ? 1 lstat
>> ?1.71 ? ?0.303544 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? llistxattr
>> ?1.50 ? ?0.265943 ? ? ? ? ? 0 ? 1978149 ? ? ? ? ? utimes
>> ?1.02 ? ?0.180585 ? ? ? ? ? 0 ? 5644314 ? ?844008 getxattr
>>
>> On btrfs getdents takes much less time which prove the bottleneck in
>> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
>> for getdents:
>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>> ?50.77 ? 10.978816 ? ? ? ? ? 1 ?15033132 ? ? ? ? ? read
>> ?14.46 ? ?3.125996 ? ? ? ? ? 1 ? 4733589 ? ? ? ? ? open
>> ?7.15 ? ?1.546311 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? setxattr
>> ?5.89 ? ?1.273845 ? ? ? ? ? 0 ? 3626505 ? ? ? ? ? lstat
>> ?5.81 ? ?1.255858 ? ? ? ? ? 1 ? 1667050 ? ? ? ? ? getdents
>> ?5.66 ? ?1.224403 ? ? ? ? ? 0 ?13083022 ? ? ? ? ? write
>> ?3.40 ? ?0.735114 ? ? ? ? ? 1 ? ?833371 ? ? ? ? ? mkdir
>> ?1.96 ? ?0.424881 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? llistxattr
>>
>>
>> Why so huge difference in the getdents timings?
>>
>> -Jacek

I will try to answer the question from the broken email I've sent.

@Lukas, it was always a fresh FS on top of LVM logical volume. I've
been cleaning cache/remounting to sync all data before (re)doing
tests.

-Jacek

BTW: Sorry for the email mixture. I just can't get this gmail thing to
work (why forcing top posting:/). Please use this thread.

2012-02-29 14:21:18

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <[email protected]>:
> 2012/2/29 Jacek Luczak <[email protected]>:
>> Hi Chris,
>>
>> the last one was borked :) Please check this one.
>>
>> -jacek
>>
>> 2012/2/29 Jacek Luczak <[email protected]>:
>>> Hi All,
>>>
>>> /*Sorry for sending incomplete email, hit wrong button :) I guess I
>>> can't use Gmail */
>>>
>>> Long story short: We've found that operations on a directory structure
>>> holding many dirs takes ages on ext4.
>>>
>>> The Question: Why there's that huge difference in ext4 and btrfs? See
>>> below test results for real values.
>>>
>>> Background: I had to backup a Jenkins directory holding workspace for
>>> few projects which were co from svn (implies lot of extra .svn dirs).
>>> The copy takes lot of time (at least more than I've expected) and
>>> process was mostly in D (disk sleep). I've dig more and done some
>>> extra test to see if this is not a regression on block/fs site. To
>>> isolate the issue I've also performed same tests on btrfs.
>>>
>>> Test environment configuration:
>>> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
>>> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
>>> 2) Kernels: All tests were done on following kernels:
>>> ?- 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
>>> config changes mostly. In -3 we've introduced ,,fix readahead pipeline
>>> break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
>>> ?- 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
>>> release recently).
>>> 3) A subject of tests, directory holding:
>>> ?- 54GB of data (measured on ext4)
>>> ?- 1978149 files
>>> ?- 844008 directories
>>> 4) Mount options:
>>> ?- ext4 -- errors=remount-ro,noatime,
>>> data=writeback
>>> ?- btrfs -- noatime,nodatacow and for later investigation on
>>> copression effect: noatime,nodatacow,compress=lzo
>>>
>>> In all tests I've been measuring time of execution. Following tests
>>> were performed:
>>> - find . -type d
>>> - find . -type f
>>> - cp -a
>>> - rm -rf
>>>
>>> Ext4 results:
>>> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
>>> | Dir cnt ?| 17m 40sec ?| 11m 20sec
>>> | File cnt | ?17m 36sec | 11m 22sec
>>> | Copy ? ?| 1h 28m ? ? ? ?| 1h 27m
>>> | Remove| 3m 43sec ? ?| 3m 38sec
>>>
>>> Btrfs results (without lzo comression):
>>> | Type ? ? | 2.6.39.4-3 ? | 3.2.7
>>> | Dir cnt ?| 2m 22sec ?| 2m 21sec
>>> | File cnt | ?2m 26sec | 2m 23sec
>>> | Copy ? ?| 36m 22sec | 39m 35sec
>>> | Remove| 7m 51sec ? | 10m 43sec
>>>
>>> From above one can see that copy takes close to 1h less on btrfs. I've
>>> done strace counting times of calls, results are as follows (from
>>> 3.2.7):
>>> 1) Ext4 (only to elements):
>>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>> ?57.01 ? 13.257850 ? ? ? ? ? 1 ?15082163 ? ? ? ? ? read
>>> ?23.40 ? ?5.440353 ? ? ? ? ? 3 ? 1687702 ? ? ? ? ? getdents
>>> ?6.15 ? ?1.430559 ? ? ? ? ? 0 ? 3672418 ? ? ? ? ? lstat
>>> ?3.80 ? ?0.883767 ? ? ? ? ? 0 ?13106961 ? ? ? ? ? write
>>> ?2.32 ? ?0.539959 ? ? ? ? ? 0 ? 4794099 ? ? ? ? ? open
>>> ?1.69 ? ?0.393589 ? ? ? ? ? 0 ? ?843695 ? ? ? ? ? mkdir
>>> ?1.28 ? ?0.296700 ? ? ? ? ? 0 ? 5637802 ? ? ? ? ? setxattr
>>> ?0.80 ? ?0.186539 ? ? ? ? ? 0 ? 7325195 ? ? ? ? ? stat
>>>
>>> 2) Btrfs:
>>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>> 53.38 ? ?9.486210 ? ? ? ? ? 1 ?15179751 ? ? ? ? ? read
>>> 11.38 ? ?2.021662 ? ? ? ? ? 1 ? 1688328 ? ? ? ? ? getdents
>>> ?10.64 ? ?1.890234 ? ? ? ? ? 0 ? 4800317 ? ? ? ? ? open
>>> ?6.83 ? ?1.213723 ? ? ? ? ? 0 ?13201590 ? ? ? ? ? write
>>> ?4.85 ? ?0.862731 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? setxattr
>>> ?3.50 ? ?0.621194 ? ? ? ? ? 1 ? ?844008 ? ? ? ? ? mkdir
>>> ?2.75 ? ?0.489059 ? ? ? ? ? 0 ? 3675992 ? ? ? ? 1 lstat
>>> ?1.71 ? ?0.303544 ? ? ? ? ? 0 ? 5644314 ? ? ? ? ? llistxattr
>>> ?1.50 ? ?0.265943 ? ? ? ? ? 0 ? 1978149 ? ? ? ? ? utimes
>>> ?1.02 ? ?0.180585 ? ? ? ? ? 0 ? 5644314 ? ?844008 getxattr
>>>
>>> On btrfs getdents takes much less time which prove the bottleneck in
>>> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
>>> for getdents:
>>> % time ? ? seconds ?usecs/call ? ? calls ? ?errors syscall
>>> ------ ----------- ----------- --------- --------- ----------------
>>> ?50.77 ? 10.978816 ? ? ? ? ? 1 ?15033132 ? ? ? ? ? read
>>> ?14.46 ? ?3.125996 ? ? ? ? ? 1 ? 4733589 ? ? ? ? ? open
>>> ?7.15 ? ?1.546311 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? setxattr
>>> ?5.89 ? ?1.273845 ? ? ? ? ? 0 ? 3626505 ? ? ? ? ? lstat
>>> ?5.81 ? ?1.255858 ? ? ? ? ? 1 ? 1667050 ? ? ? ? ? getdents
>>> ?5.66 ? ?1.224403 ? ? ? ? ? 0 ?13083022 ? ? ? ? ? write
>>> ?3.40 ? ?0.735114 ? ? ? ? ? 1 ? ?833371 ? ? ? ? ? mkdir
>>> ?1.96 ? ?0.424881 ? ? ? ? ? 0 ? 5566988 ? ? ? ? ? llistxattr
>>>
>>>
>>> Why so huge difference in the getdents timings?
>>>
>>> -Jacek
>
> I will try to answer the question from the broken email I've sent.
>
> @Lukas, it was always a fresh FS on top of LVM logical volume. I've
> been cleaning cache/remounting to sync all data before (re)doing
> tests.
>
> -Jacek
>
> BTW: Sorry for the email mixture. I just can't get this gmail thing to
> work (why forcing top posting:/). Please use this thread.

More from the observations:
1) 10s dump of the process state during copy shows:
- Ext4: 526 probes done, 34 hits R state, 492 hits D state
- Btrfs (2.6.39.4): 218, 83, 135
- Btrfs (3.2.7): 238, 62, 174, 2 hit sleeping
2) dd write/read of 55GB file to/from volume:
- Ext4: write 127MB/s, read 107MB/s
- Btrfs: 110MB/s, read 176MB/s

-Jacek

2012-02-29 14:42:44

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:

[ btrfs faster than ext for find and cp -a ]

> 2012/2/29 Jacek Luczak <[email protected]>:
>
> I will try to answer the question from the broken email I've sent.
>
> @Lukas, it was always a fresh FS on top of LVM logical volume. I've
> been cleaning cache/remounting to sync all data before (re)doing
> tests.

The next step is to get cp -a out of the picture, in this case you're
benchmarking both the read speed and the write speed (what are you
copying to btw?).

Using tar cf /dev/zero <some_dir> is one way to get a consistent picture
of the read speed.

You can confirm the theory that it is directory order causing problems
by using acp to read the data.

http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2

-chris


2012-02-29 14:55:09

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/2/29 Chris Mason <[email protected]>:
> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:
>
> [ btrfs faster than ext for find and cp -a ]
>
>> 2012/2/29 Jacek Luczak <[email protected]>:
>>
>> I will try to answer the question from the broken email I've sent.
>>
>> @Lukas, it was always a fresh FS on top of LVM logical volume. I've
>> been cleaning cache/remounting to sync all data before (re)doing
>> tests.
>
> The next step is to get cp -a out of the picture, in this case you're
> benchmarking both the read speed and the write speed (what are you
> copying to btw?).

It's simple cp -a Jenkins{,.bak} so dir to dir copy on same volume.

> Using tar cf /dev/zero <some_dir> is one way to get a consistent picture
> of the read speed.

IMO the problem is not - only - in read speed. The directory order hit
here. There's a difference in the sequential tests that place btrfs as
the winner but still this should not have that huge influence on
getdents. I know a bit on the difference between ext4 and btrfs
directory handling and I would not expect that huge difference. On the
production system where the issue has been observed doing some real
work in the background copy takes up to 4h.

For me btrfs looks perfect here, what could be worth checking is the
change of timing in syscall between 39.4 and 3.2.7. Before getdents
was not that high on the list while now it jumps to second position
but without huge impact on the timings.

> You can confirm the theory that it is directory order causing problems
> by using acp to read the data.
>
> http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2

Will check this still today and report back.

-jacek

2012-03-01 04:44:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir.

You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory):

http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209

? and then using that as a LD_PRELOAD, and see how that changes things.

The short version is that we can't easily do this in the kernel since it's a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn't practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.)

-- Ted


On Feb 29, 2012, at 8:52 AM, Jacek Luczak wrote:

> Hi All,
>
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can't use Gmail */
>
> Long story short: We've found that operations on a directory structure
> holding many dirs takes ages on ext4.
>
> The Question: Why there's that huge difference in ext4 and btrfs? See
> below test results for real values.
>
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I've expected) and
> process was mostly in D (disk sleep). I've dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I've also performed same tests on btrfs.
>
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
> - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we've introduced ,,fix readahead pipeline
> break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
> - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
> - 54GB of data (measured on ext4)
> - 1978149 files
> - 844008 directories
> 4) Mount options:
> - ext4 -- errors=remount-ro,noatime,
> data=writeback
> - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
>
> In all tests I've been measuring time of execution. Following tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
>
> Ext4 results:
> | Type | 2.6.39.4-3 | 3.2.7
> | Dir cnt | 17m 40sec | 11m 20sec
> | File cnt | 17m 36sec | 11m 22sec
> | Copy | 1h 28m | 1h 27m
> | Remove| 3m 43sec | 3m 38sec
>
> Btrfs results (without lzo comression):
> | Type | 2.6.39.4-3 | 3.2.7
> | Dir cnt | 2m 22sec | 2m 21sec
> | File cnt | 2m 26sec | 2m 23sec
> | Copy | 36m 22sec | 39m 35sec
> | Remove| 7m 51sec | 10m 43sec
>
> From above one can see that copy takes close to 1h less on btrfs. I've
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 57.01 13.257850 1 15082163 read
> 23.40 5.440353 3 1687702 getdents
> 6.15 1.430559 0 3672418 lstat
> 3.80 0.883767 0 13106961 write
> 2.32 0.539959 0 4794099 open
> 1.69 0.393589 0 843695 mkdir
> 1.28 0.296700 0 5637802 setxattr
> 0.80 0.186539 0 7325195 stat
>
> 2) Btrfs:
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38 9.486210 1 15179751 read
> 11.38 2.021662 1 1688328 getdents
> 10.64 1.890234 0 4800317 open
> 6.83 1.213723 0 13201590 write
> 4.85 0.862731 0 5644314 setxattr
> 3.50 0.621194 1 844008 mkdir
> 2.75 0.489059 0 3675992 1 lstat
> 1.71 0.303544 0 5644314 llistxattr
> 1.50 0.265943 0 1978149 utimes
> 1.02 0.180585 0 5644314 844008 getxattr
>
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 50.77 10.978816 1 15033132 read
> 14.46 3.125996 1 4733589 open
> 7.15 1.546311 0 5566988 setxattr
> 5.89 1.273845 0 3626505 lstat
> 5.81 1.255858 1 1667050 getdents
> 5.66 1.224403 0 13083022 write
> 3.40 0.735114 1 833371 mkdir
> 1.96 0.424881 0 5566988 llistxattr
>
>
> Why so huge difference in the getdents timings?
>
> -Jacek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-03-01 13:35:59

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/2/29 Jacek Luczak <[email protected]>:
> 2012/2/29 Chris Mason <[email protected]>:
>> On Wed, Feb 29, 2012 at 03:07:45PM +0100, Jacek Luczak wrote:
>>
>> [ btrfs faster than ext for find and cp -a ]
>>
>>> 2012/2/29 Jacek Luczak <[email protected]>:
>>>
>>> I will try to answer the question from the broken email I've sent.
>>>
>>> @Lukas, it was always a fresh FS on top of LVM logical volume. I've
>>> been cleaning cache/remounting to sync all data before (re)doing
>>> tests.
>>
>> The next step is to get cp -a out of the picture, in this case you're
>> benchmarking both the read speed and the write speed (what are you
>> copying to btw?).
>
> It's simple cp -a Jenkins{,.bak} so dir to dir copy on same volume.
>
>> Using tar cf /dev/zero <some_dir> is one way to get a consistent picture
>> of the read speed.
>
> IMO the problem is not - only - in read speed. The directory order hit
> here. There's a difference in the sequential tests that place btrfs as
> the winner but still this should not have that huge influence on
> getdents. I know a bit on the difference between ext4 and btrfs
> directory handling and I would not expect that huge difference. On the
> production system where the issue has been observed doing some real
> work in the background copy takes up to 4h.
>
> For me btrfs looks perfect here, what could be worth checking is the
> change of timing in syscall between 39.4 and 3.2.7. Before getdents
> was not that high on the list while now it jumps to second position
> but without huge impact on the timings.
>
>> You can confirm the theory that it is directory order causing problems
>> by using acp to read the data.
>>
>> http://oss.oracle.com/~mason/acp/acp-0.6.tar.bz2
>
> Will check this still today and report back.
>

While I was about to grab acp I've noticed seekwatcher with made my day :)

seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png

I will send acp results soon.

-Jacek

2012-03-01 13:50:10

by Hillf Danton

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <[email protected]> wrote:
>
> While I was about to grab acp I've noticed seekwatcher with made my day :)
>
> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>
> I will send acp results soon.
>
Would you please take reiserfs into account?

Thanks
-hd

2012-03-01 14:03:55

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/1 Hillf Danton <[email protected]>:
> On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <[email protected]> wrote:
>>
>> While I was about to grab acp I've noticed seekwatcher with made my day :)
>>
>> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
>> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
>> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
>> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>>
>> I will send acp results soon.
>>
> Would you please take reiserfs into account?

As of now not (lack of time) but I'm pretty close to consider XFS in
the game. Whenever I will have more time and there won't be a pressure
on giving host back to production I will redo same tests for reiserfs.

Now I'm focused on the userspace sorting results.

-Jacek

2012-03-01 14:18:30

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak wrote:
> 2012/3/1 Hillf Danton <[email protected]>:
> > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <[email protected]> wrote:
> >>
> >> While I was about to grab acp I've noticed seekwatcher with made my day :)
> >>
> >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
> >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
> >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
> >> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png

Whoa, seekwatcher makes it pretty clear.

> >>
> >> I will send acp results soon.
> >>
> > Would you please take reiserfs into account?
>
> As of now not (lack of time) but I'm pretty close to consider XFS in
> the game. Whenever I will have more time and there won't be a pressure
> on giving host back to production I will redo same tests for reiserfs.
>
> Now I'm focused on the userspace sorting results.

reiserfs should have results very similar to ext4. The directory
hashing used by reiserfs is going to result in a very random read
pattern.

XFS will probably beat btrfs in this test. Their directory indexes
reflect on disk layout very well.

-chris

2012-03-01 14:38:59

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote:
> You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir.
>
> You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory):
>
> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209
>
> … and then using that as a LD_PRELOAD, and see how that changes things.
>
> The short version is that we can't easily do this in the kernel since it's a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn't practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.)
>

For reading the files, the acp program I sent him tries to do something
similar. I had forgotten about spd_readdir though, we should consider
hacking that into cp and tar.

One interesting note is the page cache used to help here. Picture two
tests:

A) time tar cf /dev/zero /home

and

cp -a /home /new_dir_in_new_fs
unmount / flush caches
B) time tar cf /dev/zero /new_dir_in_new_fs

On ext, The time for B used to be much faster than the time for A
because the files would get written back to disk in roughly htree order.
Based on Jacek's data, that isn't true anymore.

-chris

2012-03-01 14:43:41

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <[email protected]>:
> On Thu, Mar 01, 2012 at 03:03:53PM +0100, Jacek Luczak wrote:
>> 2012/3/1 Hillf Danton <[email protected]>:
>> > On Thu, Mar 1, 2012 at 9:35 PM, Jacek Luczak <[email protected]> wrote:
>> >>
>> >> While I was about to grab acp I've noticed seekwatcher with made my day :)
>> >>
>> >> seekwatcher run of tar cf to eliminate writes (all done on 3.2.7):
>> >> 1) btrfs: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs.png
>> >> 2) ext4: http://dozzie.jarowit.net/~dozzie/luczajac/tar_ext4.png
>> >> 3) both merged: http://dozzie.jarowit.net/~dozzie/luczajac/tar_btrfs_ext4.png
>
> Whoa, seekwatcher makes it pretty clear.

Yep, ext4 is close to my wife's closet.

>> >>
>> >> I will send acp results soon.
>> >>
>> > Would you please take reiserfs into account?
>>
>> As of now not (lack of time) but I'm pretty close to consider XFS in
>> the game. Whenever I will have more time and there won't be a pressure
>> on giving host back to production I will redo same tests for reiserfs.
>>
>> Now I'm focused on the userspace sorting results.
>
> reiserfs should have results very similar to ext4. ?The directory
> hashing used by reiserfs is going to result in a very random read
> pattern.
>
> XFS will probably beat btrfs in this test. ?Their directory indexes
> reflect on disk layout very well.

True, but not that fast on small files.

Except the question I've raised in first mail there's a point in all
those action. We are maintaining host that are used for building
software: random access, lot of small files and dirs (always a co),
heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
around 10% slower on build times. We did not - yet - done same on
btrfs. Now we're looking for replacement for ext4 as we suffer from
those issue - but we were not aware of that until stepped into this
issue.

If you would like me to do some specific tests around ext4 and btrfs,
let me know.

-Jacek

2012-03-01 14:52:01

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
> 2012/3/1 Chris Mason <[email protected]>:
> > XFS will probably beat btrfs in this test. ?Their directory indexes
> > reflect on disk layout very well.
>
> True, but not that fast on small files.
>
> Except the question I've raised in first mail there's a point in all
> those action. We are maintaining host that are used for building
> software: random access, lot of small files and dirs (always a co),
> heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
> around 10% slower on build times. We did not - yet - done same on
> btrfs. Now we're looking for replacement for ext4 as we suffer from
> those issue - but we were not aware of that until stepped into this
> issue.
>
> If you would like me to do some specific tests around ext4 and btrfs,
> let me know.

I'm always curious to see comparisons in real world workloads. You
should definitely consider testing XFS again, the big three filesystems
are under pretty constant improvement. For btrfs, please stick to 3.2
kernels and higher.

This seeky backup performance is somewhat built into ext4, but as Ted
said there are a few workarounds.

-chris

2012-03-01 14:57:52

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <[email protected]>:
> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
>> 2012/3/1 Chris Mason <[email protected]>:
>> > XFS will probably beat btrfs in this test. ?Their directory indexes
>> > reflect on disk layout very well.
>>
>> True, but not that fast on small files.
>>
>> Except the question I've raised in first mail there's a point in all
>> those action. We are maintaining host that are used for building
>> software: random access, lot of small files and dirs (always a co),
>> heavy parallel IO. We were testing XFS vs ext4 a year ago and XFS was
>> around 10% slower on build times. We did not - yet - done same on
>> btrfs. Now we're looking for replacement for ext4 as we suffer from
>> those issue - but we were not aware of that until stepped into this
>> issue.
>>
>> If you would like me to do some specific tests around ext4 and btrfs,
>> let me know.
>
> I'm always curious to see comparisons in real world workloads. ?You
> should definitely consider testing XFS again, the big three filesystems
> are under pretty constant improvement. ?For btrfs, please stick to 3.2
> kernels and higher.

That's the plan but I'm waiting for more of the briliant work that
recently popped up in XFS. For btrfs, the 3.2 introduced changes led
me to give here a try. I don't have nice pictures and digits in my
hand now but first tests shown close to 40% of timing improvements
between 2.6.39.4 and 3.2.7 - keep doing that great work guys (and
girls if any)!

-Jacek

2012-03-01 18:42:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
>
> Yep, ext4 is close to my wife's closet.
>

Were all of the file systems freshly laid down, or was this an aged
ext4 file system?

Also you should beware that if you have a workload which is heavy
parallel I/O, with lots of random, read/write accesses to small files,
a benchmark using tar might not be representative of what you will see
in production --- different file systems have different strengths and
weaknesses --- and the fact that ext3/ext4's readdir() returns inodes
in a non-optimal order for stat(2) or unlink(2) or file copy in the
cold cache case may not matter as much as you think in a build server.
(i.e., the directories that do need to be searched will probably be
serviced out of the dentry cache, etc.)

Regards,

- Ted

2012-03-02 09:51:55

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/1 Ted Ts'o <[email protected]>:
> On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
>>
>> Yep, ext4 is close to my wife's closet.
>>
>
> Were all of the file systems freshly laid down, or was this an aged
> ext4 file system?

Always fresh, recreated for each tests - that's why it takes quite
much time as I had to copy the test dir back in places. Env is kept in
all tests as much consistent as possible thus the values should have
credibility.

> Also you should beware that if you have a workload which is heavy
> parallel I/O, with lots of random, read/write accesses to small files,
> a benchmark using tar might not be representative of what you will see
> in production --- different file systems have different strengths and
> weaknesses --- and the fact that ext3/ext4's readdir() returns inodes
> in a non-optimal order for stat(2) or unlink(2) or file copy in the
> cold cache case may not matter as much as you think in a build server.
> (i.e., the directories that do need to be searched will probably be
> serviced out of the dentry cache, etc.)

The set of tests were chosen as is not to find best fs for build purposes.

For a pure builds ext4 is as of now the best and most of the points
you've put above are valid here. We've performed a real tests in a
clone of production environment. Results are not that surprising as
one can find same from e.g. Phoronix test. We've done test in XFS vs
ext[34] and ext4 vs btrfs. Here if we are taking into account only a
software compilation ext4 rocks. Btrfs is only few seconds slower (max
5 in average). The choice then was to use ext4 due to more mature
foundations and support in RHEL. Why we're looking for new one? Well,
the build environment is not only based on software building. There
are e.g. some strange tests running in parallel, code analysis, etc.
Users are doing damn odd things there ... are using Java, you can
imagine how bad bad bad bad zombie java can be. We've failed here and
are not able to isolate each use cases and create profiled
environment. Thus we need to find sth that will provide common sense
for all use cases.

The previous tests shown that ext[34] rocks on compilation timing, but
all around not really. Also one need to remember that fs content
changes often. The ext3 was ageing in 4-6 months of use. XFS on the
other hand was great all around while not in compilation timings.
Roughly 10% is not that much but if hosts is doing builds 24h/7 then
after a few days we've been much behind ext[34] clone env. Btrfs was
only tested against compilation timings, not in general use.

We've created a simple test case for compilation. It's quite not same
as what we got in real env but is good baseline (kernel build system
is too perfect). Simply parallel kernel builds with randomly
allyesconfig or allmodconfig. Below are the seekwatcher graphs of
around 1h of tests running. There were 10 builds (kernels
2.6.20-2.6.29) running with three parallel threads:
1) ext4: http://91.234.146.107/~difrost/seekwatcher/kt_ext4.png
2) btrfs: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs.png
3) both: http://91.234.146.107/~difrost/seekwatcher/kt_btrfs_ext4.png

Above graphs prove that ext4 is ahead in this ,,competition''. I will
try to setup there a real build env to see how those two compare.

-Jacek

2012-03-02 10:05:59

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/1 Chris Mason <[email protected]>:
> On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote:
>> You might try sorting the entries returned by readdir by inode number before you stat them. ? ?This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir.
>>
>> You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory):
>>
>> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209
>>
>> ? and then using that as a LD_PRELOAD, and see how that changes things.
>>
>> The short version is that we can't easily do this in the kernel since it's a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn't practical. ? It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.)
>>
>
> For reading the files, the acp program I sent him tries to do something
> similar. ?I had forgotten about spd_readdir though, we should consider
> hacking that into cp and tar.
>
> One interesting note is the page cache used to help here. ?Picture two
> tests:
>
> A) time tar cf /dev/zero /home
>
> and
>
> cp -a /home /new_dir_in_new_fs
> unmount / flush caches
> B) time tar cf /dev/zero /new_dir_in_new_fs
>
> On ext, The time for B used to be much faster than the time for A
> because the files would get written back to disk in roughly htree order.
> Based on Jacek's data, that isn't true anymore.

I've took both on tests. The subject is acp and spd_readdir used with
tar, all on ext4:
1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png

The acp looks much better than spd_readdir but directory copy with
spd_readdir decreased to 52m 39sec (30 min less).

-Jacek

2012-03-02 14:00:44

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>
> I've took both on tests. The subject is acp and spd_readdir used with
> tar, all on ext4:
> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>
> The acp looks much better than spd_readdir but directory copy with
> spd_readdir decreased to 52m 39sec (30 min less).

Do you have stats on how big these files are, and how fragmented they
are? For acp and spd to give us this, I think something has gone wrong
at writeback time (creating individual fragmented files).

-chris


2012-03-02 14:16:14

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/2 Chris Mason <[email protected]>:
> On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>>
>> I've took both on tests. The subject is acp and spd_readdir used with
>> tar, all on ext4:
>> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>
>> The acp looks much better than spd_readdir but directory copy with
>> spd_readdir decreased to 52m 39sec (30 min less).
>
> Do you have stats on how big these files are, and how fragmented they
> are? ?For acp and spd to give us this, I think something has gone wrong
> at writeback time (creating individual fragmented files).

How big? Which files?

-Jacek

2012-03-02 14:26:51

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
> 2012/3/2 Chris Mason <[email protected]>:
> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
> >>
> >> I've took both on tests. The subject is acp and spd_readdir used with
> >> tar, all on ext4:
> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
> >>
> >> The acp looks much better than spd_readdir but directory copy with
> >> spd_readdir decreased to 52m 39sec (30 min less).
> >
> > Do you have stats on how big these files are, and how fragmented they
> > are? ?For acp and spd to give us this, I think something has gone wrong
> > at writeback time (creating individual fragmented files).
>
> How big? Which files?

All the files you're reading ;)

filefrag will tell you how many extents each file has, any file with
more than one extent is interesting. (The ext4 crowd may have better
suggestions on measuring fragmentation).

Since you mention this is a compile farm, I'm guessing there are a bunch
of .o files created by parallel builds. There are a lot of chances for
delalloc and the kernel writeback code to do the wrong thing here.

-chris

2012-03-02 19:32:15

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
>
> filefrag will tell you how many extents each file has, any file with
> more than one extent is interesting. (The ext4 crowd may have better
> suggestions on measuring fragmentation).

You can get a *huge* amount of information (probably more than you'll
want to analyze) by doing this:

e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out

I haven't had time to do this in a while, but a while back I used this
to debug the writeback code with an eye towards reducing
fragmentation. At the time I was trying to optimize the case of
reducing fragmentation in the easist case possible, where you start
with an empty file system, and then copy all of the data from another
file system onto it using rsync -avH.

It would be worth while to see what happens with files written by the
compiler and linker. Given that libelf tends to write .o files
non-sequentially, and without telling us how big the space is in
advance, I could well imagine that we're not doing the best job
avoiding free space fragmentation, which eventually leads to extra
file system aging.

It would be interesting to have a project where someone added
fallocate() support into libelf, and then added some hueristics into
ext4 so that if a file is fallocated to a precise size, or if the file
is fully written and closed before writeback begins, that we use this
to more efficiently pack the space used by the files by the block
allocator. This is a place where I would not be surprised that XFS
has some better code to avoid accelerated file system aging, and where
we could do better with ext4 with some development effort.

Of course, it might also be possible to hack around this by simply
using VPATH and dropping your build files in a separate place from
your source files, and periodically reformatting the file system where
your build tree lives. (As a side note, something that works well for
me is to use an SSD for my source files, and a separate 5400rpm HDD
for my build tree. That allows me to use a smaller and more
affordable SSD, and since the object files can be written
asynchronously by the writeback threads, but the compiler can't move
forward until it gets file data from the .c or .h file, it gets me the
best price/performance for a laptop build environment.)

BTW, I suspect we could make acp even more efficient by teaching it to
use FIEMAP ioctl to map out the data blocks for all of the files in
the source file system, and then copied the files (or perhaps even
parts of files) in a read order which reduced seeking on the source
drive.

- Ted

2012-03-02 19:50:50

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 02, 2012 at 02:32:15PM -0500, Ted Ts'o wrote:
> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> >
> > filefrag will tell you how many extents each file has, any file with
> > more than one extent is interesting. (The ext4 crowd may have better
> > suggestions on measuring fragmentation).
>
> You can get a *huge* amount of information (probably more than you'll
> want to analyze) by doing this:
>
> e2fsck -nf -E fragcheck /dev/XXXX >& /tmp/fragcheck.out
>
> I haven't had time to do this in a while, but a while back I used this
> to debug the writeback code with an eye towards reducing
> fragmentation. At the time I was trying to optimize the case of
> reducing fragmentation in the easist case possible, where you start
> with an empty file system, and then copy all of the data from another
> file system onto it using rsync -avH.
>
> It would be worth while to see what happens with files written by the
> compiler and linker. Given that libelf tends to write .o files
> non-sequentially, and without telling us how big the space is in
> advance, I could well imagine that we're not doing the best job
> avoiding free space fragmentation, which eventually leads to extra
> file system aging.

I just realized that I confused things. He's doing a read on the
results of a cp -a to a fresh FS, so there's no way the compiler/linker
are causing trouble.

>
> It would be interesting to have a project where someone added
> fallocate() support into libelf, and then added some hueristics into
> ext4 so that if a file is fallocated to a precise size, or if the file
> is fully written and closed before writeback begins, that we use this
> to more efficiently pack the space used by the files by the block
> allocator. This is a place where I would not be surprised that XFS
> has some better code to avoid accelerated file system aging, and where
> we could do better with ext4 with some development effort.

The part I don't think any of us have solved is writing back the files
in a good order after we've fallocated the blocks.

So this will probably be great for reads and not so good for writes.

>
> Of course, it might also be possible to hack around this by simply
> using VPATH and dropping your build files in a separate place from
> your source files, and periodically reformatting the file system where
> your build tree lives. (As a side note, something that works well for
> me is to use an SSD for my source files, and a separate 5400rpm HDD
> for my build tree. That allows me to use a smaller and more
> affordable SSD, and since the object files can be written
> asynchronously by the writeback threads, but the compiler can't move
> forward until it gets file data from the .c or .h file, it gets me the
> best price/performance for a laptop build environment.)

mkfs for defrag ;) It's the only way to be sure.

>
> BTW, I suspect we could make acp even more efficient by teaching it to
> use FIEMAP ioctl to map out the data blocks for all of the files in
> the source file system, and then copied the files (or perhaps even
> parts of files) in a read order which reduced seeking on the source
> drive.

acp does have a -b mode where it fibmaps (I was either lazy or it is
older than fiemap, I forget) the first block in the file, and uses that
to sort. It does help if the file blocks aren't ordered well wrt their
inode numbers, but not if the files are fragmented.

It's also worth mentioning that acp doesn't actually cp. I never got
that far. It was supposed to be the perfect example of why everything
should be done via aio, but it just ended up demonstrating that ordering
by inode number and leveraging kernel/hardware reada were more
important.

-chris

2012-03-03 22:41:46

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/2 Chris Mason <[email protected]>:
> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>> 2012/3/2 Chris Mason <[email protected]>:
>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>> >>
>> >> I've took both on tests. The subject is acp and spd_readdir used with
>> >> tar, all on ext4:
>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>> >>
>> >> The acp looks much better than spd_readdir but directory copy with
>> >> spd_readdir decreased to 52m 39sec (30 min less).
>> >
>> > Do you have stats on how big these files are, and how fragmented they
>> > are? ?For acp and spd to give us this, I think something has gone wrong
>> > at writeback time (creating individual fragmented files).
>>
>> How big? Which files?
>
> All the files you're reading ;)
>
> filefrag will tell you how many extents each file has, any file with
> more than one extent is interesting. ?(The ext4 crowd may have better
> suggestions on measuring fragmentation).
>
> Since you mention this is a compile farm, I'm guessing there are a bunch
> of .o files created by parallel builds. ?There are a lot of chances for
> delalloc and the kernel writeback code to do the wrong thing here.
>

This took some time be here it goes ...

* B size files - total: 1064462 (53.81%)
| Size range | File cnt | Fragmented | Fragmented % |
| [0,100) | 288702 | 0 | 0 |
| [100,200) | 165997 | 0 | 0 |
| [200,300) | 45993 | 0 | 0 |
| [300,400) | 64883 | 0 | 0 |
| [400,500) | 39282 | 0 | 0 |
| [500,600) | 61612 | 0 | 0 |
| [600,700) | 208822 | 0 | 0 |
| [700,800) | 64016 | 0 | 0 |
| [800,900) | 58181 | 0 | 0 |
| [900,1024) | 66974 | 0 | 0 |

* K size files - total: 907375 (45.86%)
| Size range | File cnt | Fragmented | Fragmented % |
| [1,2) | 285481 | 0 | 0 |
| [2,3) | 148461 | 0 | 0 |
| [3,4) | 102569 | 0 | 0 |
| [4,5) | 45089 | 0 | 0 |
| [5,6) | 44223 | 0 | 0 |
| [6,10) | 94202 | 0 | 0 |
| [10,20) | 83760 | 0 | 0 |
| [20,30) | 42548 | 0 | 0 |
| [30,40) | 15274 | 0 | 0 |
| [40,50) | 4060 | 0 | 0 |
| [50,60) | 3517 | 0 | 0 |
| [60,100) | 7834 | 0 | 0 |
| [100,1024) | 30357 | 0 | 0 |

*M size files - total: 6312 (0.33%)
| Size range | File cnt | Fragmented | Fragmented % |
| [1.10) | 5806 | 100 | 1.72 |
| [10,100) | 482 | 200 | 41.49 |
| [100,1024) | 24 | 13 | 54.16 |

All files scanned: 1978149
Files fragmented: 313 (0.015%) where 11 have 3+ extents
Total size of fragmented files: 7GB (~13% of dir size)

tar cf on fragmented files:
1) time: 7sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
3) sw graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
4) both on one:
http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png

tar cf of fragmented files disturbed with [40,50) K files (in total
4373 files). K files before fragmented M files:
1) size: 7.2GB
2) time: 1m 14sec
3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
4) sw graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
5) both on one:
http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png

I will perform same tests on btrfs tomorrow.

-Jacek

2012-03-04 10:25:08

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/3 Jacek Luczak <[email protected]>:
> 2012/3/2 Chris Mason <[email protected]>:
>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>>> 2012/3/2 Chris Mason <[email protected]>:
>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>>> >>
>>> >> I've took both on tests. The subject is acp and spd_readdir used with
>>> >> tar, all on ext4:
>>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>> >>
>>> >> The acp looks much better than spd_readdir but directory copy with
>>> >> spd_readdir decreased to 52m 39sec (30 min less).
>>> >
>>> > Do you have stats on how big these files are, and how fragmented they
>>> > are? ?For acp and spd to give us this, I think something has gone wrong
>>> > at writeback time (creating individual fragmented files).
>>>
>>> How big? Which files?
>>
>> All the files you're reading ;)
>>
>> filefrag will tell you how many extents each file has, any file with
>> more than one extent is interesting. ?(The ext4 crowd may have better
>> suggestions on measuring fragmentation).
>>
>> Since you mention this is a compile farm, I'm guessing there are a bunch
>> of .o files created by parallel builds. ?There are a lot of chances for
>> delalloc and the kernel writeback code to do the wrong thing here.
>>
>
[Most of files are B and K size]
>
> All files scanned: 1978149
> Files fragmented: 313 (0.015%) where 11 have 3+ extents
> Total size of fragmented files: 7GB (~13% of dir size)

BTRFS: Non of files according to filefrag are fragmented - all fit
into one extent.

> tar cf on fragmented files:
> 1) time: 7sec
> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
> 3) sw graph with spd_readdir:
> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
> 4) both on one:
> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png

BTRFS: tar on ext4 fragmented files
1) time: 6sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png

> tar cf of fragmented files disturbed with [40,50) K files (in total
> 4373 files). K files before fragmented M files:
> 1) size: 7.2GB
> 2) time: 1m 14sec
> 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
> 4) sw graph with spd_readdir:
> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
> 5) both on one:
> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png

BTRFS: tar on [40,50) K and ext4 fragmented
1) time: 56sec
2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png

New test I've included - randomly selected files:
- size 240MB
1) ext4 (time: 34sec) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png
2) btrfs (time: 55sec) sw graph:
http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png

-Jacek

2012-03-05 11:32:47

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/4 Jacek Luczak <[email protected]>:
> 2012/3/3 Jacek Luczak <[email protected]>:
>> 2012/3/2 Chris Mason <[email protected]>:
>>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
>>>> 2012/3/2 Chris Mason <[email protected]>:
>>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
>>>> >>
>>>> >> I've took both on tests. The subject is acp and spd_readdir used with
>>>> >> tar, all on ext4:
>>>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
>>>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
>>>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
>>>> >>
>>>> >> The acp looks much better than spd_readdir but directory copy with
>>>> >> spd_readdir decreased to 52m 39sec (30 min less).
>>>> >
>>>> > Do you have stats on how big these files are, and how fragmented they
>>>> > are? ?For acp and spd to give us this, I think something has gone wrong
>>>> > at writeback time (creating individual fragmented files).
>>>>
>>>> How big? Which files?
>>>
>>> All the files you're reading ;)
>>>
>>> filefrag will tell you how many extents each file has, any file with
>>> more than one extent is interesting. ?(The ext4 crowd may have better
>>> suggestions on measuring fragmentation).
>>>
>>> Since you mention this is a compile farm, I'm guessing there are a bunch
>>> of .o files created by parallel builds. ?There are a lot of chances for
>>> delalloc and the kernel writeback code to do the wrong thing here.
>>>
>>
> [Most of files are B and K size]
>>
>> All files scanned: 1978149
>> Files fragmented: 313 (0.015%) where 11 have 3+ extents
>> Total size of fragmented files: 7GB (~13% of dir size)
>
> BTRFS: Non of files according to filefrag are fragmented - all fit
> into one extent.
>
>> tar cf on fragmented files:
>> 1) time: 7sec
>> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented.png
>> 3) sw graph with spd_readdir:
>> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_spd.png
>> 4) both on one:
>> http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_pure_spd.png
>
> BTRFS: tar on ext4 fragmented files
> 1) time: 6sec
> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_fragmented_btrfs.png
>
>> tar cf of fragmented files disturbed with [40,50) K files (in total
>> 4373 files). K files before fragmented M files:
>> 1) size: 7.2GB
>> 2) time: 1m 14sec
>> 3) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed.png
>> 4) sw graph with spd_readdir:
>> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_spd.png
>> 5) both on one:
>> http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_pure_spd.png
>
> BTRFS: tar on [40,50) K and ext4 fragmented
> 1) time: 56sec
> 2) sw graph: http://91.234.146.107/~difrost/seekwatcher/tar_disturbed_btrfs.png
>
> New test I've included - randomly selected files:
> - size 240MB
> 1) ext4 (time: 34sec) sw graph:
> http://91.234.146.107/~difrost/seekwatcher/tar_random_ext4.png
> 2) btrfs (time: 55sec) sw graph:
> http://91.234.146.107/~difrost/seekwatcher/tar_random_btrfs.png

Yet another test. The original issue is in the directory data
handling. In my case a lot of dirs are introduced due to extra .svn.
Let's then see how does tar on those dirs looks like.

Number of .svn directories: 61605
1) Ext4:
- tar time: 10m 53sec
- sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_ext4.png
- sw tar graph with spd_readdir:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_spd_ext4.png
2) Btrfs:
- tar time: 4m 35sec
- sw tar graph: http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs.png
- sw tar graph with ext4:
http://91.234.146.107/~difrost/seekwatcher/svn_dir_btrfs_ext4.png

IMO this is not a writeback issue (well it could be but then it mean
that it broken in general), it's not fragmentation. Sorting files in
readdir helps a bit but is still far behind the btrfs.

Any ideas? Is this a issue or the things are like they are and one
need to live with it.

-Jacek

2012-03-05 13:10:53

by Jan Kara

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri 02-03-12 14:32:15, Ted Tso wrote:
> On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> It would be interesting to have a project where someone added
> fallocate() support into libelf, and then added some hueristics into
> ext4 so that if a file is fallocated to a precise size, or if the file
> is fully written and closed before writeback begins, that we use this
> to more efficiently pack the space used by the files by the block
> allocator. This is a place where I would not be surprised that XFS
> has some better code to avoid accelerated file system aging, and where
> we could do better with ext4 with some development effort.
AFAIK XFS people actually prefer that applications let them do their work
using delayed allocation and do not interfere with fallocate(2) calls. The
problem they have with fallocate(2) is that it forces you to allocate
blocks while with delayed allocation you can make the decision about
allocation later. So for small files which completely fit into pagecache
before they get pushed out by writeback, they can make better decisions
from delayed allocation. Just dumping my memory from some other thread...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-03-06 00:37:16

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Mon, Mar 05, 2012 at 12:32:45PM +0100, Jacek Luczak wrote:
> 2012/3/4 Jacek Luczak <[email protected]>:
> > 2012/3/3 Jacek Luczak <[email protected]>:
> >> 2012/3/2 Chris Mason <[email protected]>:
> >>> On Fri, Mar 02, 2012 at 03:16:12PM +0100, Jacek Luczak wrote:
> >>>> 2012/3/2 Chris Mason <[email protected]>:
> >>>> > On Fri, Mar 02, 2012 at 11:05:56AM +0100, Jacek Luczak wrote:
> >>>> >>
> >>>> >> I've took both on tests. The subject is acp and spd_readdir used with
> >>>> >> tar, all on ext4:
> >>>> >> 1) acp: http://91.234.146.107/~difrost/seekwatcher/acp_ext4.png
> >>>> >> 2) spd_readdir: http://91.234.146.107/~difrost/seekwatcher/tar_ext4_readir.png
> >>>> >> 3) both: http://91.234.146.107/~difrost/seekwatcher/acp_vs_spd_ext4.png
> >>>> >>
> >>>> >> The acp looks much better than spd_readdir but directory copy with
> >>>> >> spd_readdir decreased to 52m 39sec (30 min less).
> >>>> >
> >>>> > Do you have stats on how big these files are, and how fragmented they
> >>>> > are? ?For acp and spd to give us this, I think something has gone wrong
> >>>> > at writeback time (creating individual fragmented files).
> >>>>
> >>>> How big? Which files?
> >>>
> >>> All the files you're reading ;)
> >>>
> >>> filefrag will tell you how many extents each file has, any file with
> >>> more than one extent is interesting. ?(The ext4 crowd may have better
> >>> suggestions on measuring fragmentation).
> >>>
> >>> Since you mention this is a compile farm, I'm guessing there are a bunch
> >>> of .o files created by parallel builds. ?There are a lot of chances for
> >>> delalloc and the kernel writeback code to do the wrong thing here.
> >>>
> >>
> > [Most of files are B and K size]
> >>
> >> All files scanned: 1978149
> >> Files fragmented: 313 (0.015%) where 11 have 3+ extents
> >> Total size of fragmented files: 7GB (~13% of dir size)

Ok, so I don't have a lot of great new ideas. My guess is that inode
order and disk order for the blocks aren't matching up. You can confirm
this with:

acp -b some_dir

You can also try telling acp to make a bigger read ahead window:

acp -s 4096 -r 128 some_dir

You can tell acp to scan all the files in the directory tree first
(warning, this might use a good chunk of ram)

acp -w some_dir

and you can combine all of these together None of the above
will actually help in your workload, but it'll help narrow down what is
actually seeky on disk.

-chris

2012-03-08 17:02:07

by Phillip Susi

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 2/29/2012 11:44 PM, Theodore Tso wrote:
> You might try sorting the entries returned by readdir by inode number
> before you stat them. This is a long-standing weakness in
> ext3/ext4, and it has to do with how we added hashed tree indexes to
> directories in (a) a backwards compatible way, that (b) was POSIX
> compliant with respect to adding and removing directory entries
> concurrently with reading all of the directory entries using
> readdir.

When I ran into this a while back, I cobbled together this python script
to measure the correlation from name to inode, inode to first data
block, and name to first data block for all of the small files in a
large directory, and found that ext4 gives a very poor correlation due
to that directory hashing. This is one of the reasons I prefer using
dump instead of tar for backups, since it rips through my Maildir more
than 10 times faster than tar, since it reads the files in inode order.


Attachments:
correlation.py (1.36 kB)

2012-03-09 11:29:34

by Lukas Czerner

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, 29 Feb 2012, Jacek Luczak wrote:

> Hi All,
>
> /*Sorry for sending incomplete email, hit wrong button :) I guess I
> can't use Gmail */
>
> Long story short: We've found that operations on a directory structure
> holding many dirs takes ages on ext4.
>
> The Question: Why there's that huge difference in ext4 and btrfs? See
> below test results for real values.
>
> Background: I had to backup a Jenkins directory holding workspace for
> few projects which were co from svn (implies lot of extra .svn dirs).
> The copy takes lot of time (at least more than I've expected) and
> process was mostly in D (disk sleep). I've dig more and done some
> extra test to see if this is not a regression on block/fs site. To
> isolate the issue I've also performed same tests on btrfs.
>
> Test environment configuration:
> 1) HW: HP ProLiant BL460 G6, 48 GB of memory, 2x 6 core Intel X5670 HT
> enabled, Smart Array P410i, RAID 1 on top of 2x 10K RPM SAS HDDs.
> 2) Kernels: All tests were done on following kernels:
> - 2.6.39.4-3 -- the build ID (3) is used here for internal tacking of
> config changes mostly. In -3 we've introduced ,,fix readahead pipeline
> break caused by block plug'' patch. Otherwise it's pure 2.6.39.4.
> - 3.2.7 -- latest kernel at the time of testing (3.2.8 has been
> release recently).
> 3) A subject of tests, directory holding:
> - 54GB of data (measured on ext4)
> - 1978149 files
> - 844008 directories
> 4) Mount options:
> - ext4 -- errors=remount-ro,noatime,
> data=writeback
> - btrfs -- noatime,nodatacow and for later investigation on
> copression effect: noatime,nodatacow,compress=lzo
>
> In all tests I've been measuring time of execution. Following tests
> were performed:
> - find . -type d
> - find . -type f
> - cp -a
> - rm -rf
>
> Ext4 results:
> | Type | 2.6.39.4-3 | 3.2.7
> | Dir cnt | 17m 40sec | 11m 20sec
> | File cnt | 17m 36sec | 11m 22sec
> | Copy | 1h 28m | 1h 27m
> | Remove| 3m 43sec | 3m 38sec
>
> Btrfs results (without lzo comression):
> | Type | 2.6.39.4-3 | 3.2.7
> | Dir cnt | 2m 22sec | 2m 21sec
> | File cnt | 2m 26sec | 2m 23sec
> | Copy | 36m 22sec | 39m 35sec
> | Remove| 7m 51sec | 10m 43sec
>
> From above one can see that copy takes close to 1h less on btrfs. I've
> done strace counting times of calls, results are as follows (from
> 3.2.7):
> 1) Ext4 (only to elements):
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 57.01 13.257850 1 15082163 read
> 23.40 5.440353 3 1687702 getdents
> 6.15 1.430559 0 3672418 lstat
> 3.80 0.883767 0 13106961 write
> 2.32 0.539959 0 4794099 open
> 1.69 0.393589 0 843695 mkdir
> 1.28 0.296700 0 5637802 setxattr
> 0.80 0.186539 0 7325195 stat
>
> 2) Btrfs:
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.38 9.486210 1 15179751 read
> 11.38 2.021662 1 1688328 getdents
> 10.64 1.890234 0 4800317 open
> 6.83 1.213723 0 13201590 write
> 4.85 0.862731 0 5644314 setxattr
> 3.50 0.621194 1 844008 mkdir
> 2.75 0.489059 0 3675992 1 lstat
> 1.71 0.303544 0 5644314 llistxattr
> 1.50 0.265943 0 1978149 utimes
> 1.02 0.180585 0 5644314 844008 getxattr
>
> On btrfs getdents takes much less time which prove the bottleneck in
> copy time on ext4 is this syscall. In 2.6.39.4 it shows even less time
> for getdents:
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 50.77 10.978816 1 15033132 read
> 14.46 3.125996 1 4733589 open
> 7.15 1.546311 0 5566988 setxattr
> 5.89 1.273845 0 3626505 lstat
> 5.81 1.255858 1 1667050 getdents
> 5.66 1.224403 0 13083022 write
> 3.40 0.735114 1 833371 mkdir
> 1.96 0.424881 0 5566988 llistxattr
>
>
> Why so huge difference in the getdents timings?
>
> -Jacek

Hi,

I have created a simple script which creates a bunch of files with
random names in the directory and then performs operation like list,
tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
the 4k size files. And the result is that ext4 pretty much dominates the
create times, tar times and find times. However copy times is a whole
different story unfortunately - is sucks badly.

Once we cross the mark of 320000 files in the directory (on my system) the
ext4 is becoming significantly worse in copy times. And that is where
the hash tree order in the directory entry really hit in.

Here is a simple graph:

http://people.redhat.com/lczerner/files/copy_benchmark.pdf

Here is a data where you can play with it:

https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE

and here is the txt file for convenience:

http://people.redhat.com/lczerner/files/copy_data.txt

I have also run the correlation.py from Phillip Susi on directory with
100000 4k files and indeed the name to block correlation in ext4 is pretty
much random :)

_ext4_
Name to inode correlation: 0.50002499975
Name to block correlation: 0.50002499975
Inode to block correlation: 0.9999900001

_xfs_
Name to inode correlation: 0.969660303397
Name to block correlation: 0.969660303397
Inode to block correlation: 1.0


So there definitely is a huge space for improvements in ext4.

Thanks!
-Lukas

Here is a script I have used to get the numbers above, just to see that
are the operation I have performed.


#!/bin/bash

dev=$1
mnt=$2
fs=$3
count=$4
size=$5

if [ -z $dev ]; then
echo "Device was not specified!"
exit 1
fi

if [ -z $mnt ]; then
echo "Mount point was not specified!"
exit 1
fi

if [ -z $fs ]; then
echo "File system was not specified!"
exit 1
fi

if [ -z $count ]; then
count=10000
fi

if [ -z $size ]; then
size=0
fi

export TIMEFORMAT="%3R"

umount $dev &> /dev/null
umount $mnt &> /dev/null

case $fs in
"xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;;
"ext3") mkfs.ext3 -F -E lazy_itable_init $dev &> /dev/null; mount $dev $mnt;;
"ext4") mkfs.ext4 -F -E lazy_itable_init $dev &> /dev/null; mount -o noinit_itable $dev $mnt;;
"btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev $mnt;;
*) echo "Unsupported file system";
exit 1;;
esac


testdir=${mnt}/$$
mkdir $testdir

_remount()
{
sync
#umount $mnt
#mount $dev $mnt
echo 3 > /proc/sys/vm/drop_caches
}


#echo "[+] Creating $count files"
_remount
create=$((time ./dirtest $testdir $count $size) 2>&1)

#echo "[+] Listing files"
_remount
list=$((time ls $testdir > /dev/null) 2>&1)

#echo "[+] tar the files"
_remount
tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1)

#echo "[+] find the files"
_remount
find=$((time $(find $testdir -type f &> /dev/null)) 2>&1)

#echo "[+] Copying files"
_remount
copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1)

#echo "[+] Removing files"
_remount
remove=$((time $(rm -rf $testdir)) 2>&1)

echo "$fs $count $create $list $tar $find $copy $remove"

2012-03-09 14:34:46

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 09, 2012 at 12:29:29PM +0100, Lukas Czerner wrote:
> Hi,
>
> I have created a simple script which creates a bunch of files with
> random names in the directory and then performs operation like list,
> tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
> the 4k size files. And the result is that ext4 pretty much dominates the
> create times, tar times and find times. However copy times is a whole
> different story unfortunately - is sucks badly.
>
> Once we cross the mark of 320000 files in the directory (on my system) the
> ext4 is becoming significantly worse in copy times. And that is where
> the hash tree order in the directory entry really hit in.
>
> Here is a simple graph:
>
> http://people.redhat.com/lczerner/files/copy_benchmark.pdf
>
> Here is a data where you can play with it:
>
> https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE
>
> and here is the txt file for convenience:
>
> http://people.redhat.com/lczerner/files/copy_data.txt
>
> I have also run the correlation.py from Phillip Susi on directory with
> 100000 4k files and indeed the name to block correlation in ext4 is pretty
> much random :)
>
> _ext4_
> Name to inode correlation: 0.50002499975
> Name to block correlation: 0.50002499975
> Inode to block correlation: 0.9999900001
>
> _xfs_
> Name to inode correlation: 0.969660303397
> Name to block correlation: 0.969660303397
> Inode to block correlation: 1.0
>
>
> So there definitely is a huge space for improvements in ext4.

Thanks Lukas, this is great data. There is definitely room for btrfs to
speed up in the other phases as well.

-chris

2012-03-10 00:25:40

by Andreas Dilger

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 2012-03-09, at 3:29, Lukas Czerner <[email protected]> wrote:
>
> I have created a simple script which creates a bunch of files with
> random names in the directory and then performs operation like list,
> tar, find, copy and remove. I have run it for ext4, xfs and btrfs with
> the 4k size files. And the result is that ext4 pretty much dominates the
> create times, tar times and find times. However copy times is a whole
> different story unfortunately - is sucks badly.
>
> Once we cross the mark of 320000 files in the directory (on my system) the
> ext4 is becoming significantly worse in copy times. And that is where
> the hash tree order in the directory entry really hit in.
>
> Here is a simple graph:
>
> http://people.redhat.com/lczerner/files/copy_benchmark.pdf
>
> Here is a data where you can play with it:
>
> https://www.google.com/fusiontables/DataSource?snapid=S425803zyTE
>
> and here is the txt file for convenience:
>
> http://people.redhat.com/lczerner/files/copy_data.txt
>
> I have also run the correlation.py from Phillip Susi on directory with
> 100000 4k files and indeed the name to block correlation in ext4 is pretty
> much random :)

Just reading this on the plane, so I can't find the exact reference that I want, but a solution to this problem with htree was discussed a few years ago between myself and Coly Li.

The basic idea is that for large directories the inode allocator starts by selecting a range of (relatively) free inodes based on the current directory size, and then piecewise maps the hash value for the filename into this inode range and uses that as the goal inode.

When the inode range is (relatively) filled up (as determined by average distance between goal and allocated inode), a new (larger) inode range is selected based on the new (larger) directory size and usage continues as described.

This change is only in-memory allocation policy and does not affect the on disk format, though it is expected to improve the hash->inode mapping coherency significantly.

When the directory is small (below a thousand or so) the allocations will be close to the parent and the ordering doesn't matter significantly because the inode table blocks will all be quickly read or prefetched from disk. It wouldn't be harmful to use the mapping algorithm in this case, but it likely won't show much improvement.

As the directory gets larger, the range of inodes will also get larger. The number of inodes in the smaller range becomes less significant as the range continues to grow.

Once the inode range is hundreds of thousands or larger the mapping of the hash to the inodes will avoid a lot of random IO.

When restarting from a new mount, the inode ranges can be found when doing the initial name lookup in the leaf block by checking the allocated inodes for existing dirents.

Unfortunately, the prototype that was developed diverged from this idea and didn't really achieve the results I wanted.

Cheers, Andreas

> _ext4_
> Name to inode correlation: 0.50002499975
> Name to block correlation: 0.50002499975
> Inode to block correlation: 0.9999900001
>
> _xfs_
> Name to inode correlation: 0.969660303397
> Name to block correlation: 0.969660303397
> Inode to block correlation: 1.0
>
>
> So there definitely is a huge space for improvements in ext4.
>
> Thanks!
> -Lukas
>
> Here is a script I have used to get the numbers above, just to see that
> are the operation I have performed.
>
>
> #!/bin/bash
>
> dev=$1
> mnt=$2
> fs=$3
> count=$4
> size=$5
>
> if [ -z $dev ]; then
> echo "Device was not specified!"
> exit 1
> fi
>
> if [ -z $mnt ]; then
> echo "Mount point was not specified!"
> exit 1
> fi
>
> if [ -z $fs ]; then
> echo "File system was not specified!"
> exit 1
> fi
>
> if [ -z $count ]; then
> count=10000
> fi
>
> if [ -z $size ]; then
> size=0
> fi
>
> export TIMEFORMAT="%3R"
>
> umount $dev &> /dev/null
> umount $mnt &> /dev/null
>
> case $fs in
> "xfs") mkfs.xfs -f $dev &> /dev/null; mount $dev $mnt;;
> "ext3") mkfs.ext3 -F -E lazy_itable_init $dev &> /dev/null; mount $dev $mnt;;
> "ext4") mkfs.ext4 -F -E lazy_itable_init $dev &> /dev/null; mount -o noinit_itable $dev $mnt;;
> "btrfs") mkfs.btrfs $dev &> /dev/null; mount $dev $mnt;;
> *) echo "Unsupported file system";
> exit 1;;
> esac
>
>
> testdir=${mnt}/$$
> mkdir $testdir
>
> _remount()
> {
> sync
> #umount $mnt
> #mount $dev $mnt
> echo 3 > /proc/sys/vm/drop_caches
> }
>
>
> #echo "[+] Creating $count files"
> _remount
> create=$((time ./dirtest $testdir $count $size) 2>&1)
>
> #echo "[+] Listing files"
> _remount
> list=$((time ls $testdir > /dev/null) 2>&1)
>
> #echo "[+] tar the files"
> _remount
> tar=$((time $(tar -cf - $testdir &> /dev/null)) 2>&1)
>
> #echo "[+] find the files"
> _remount
> find=$((time $(find $testdir -type f &> /dev/null)) 2>&1)
>
> #echo "[+] Copying files"
> _remount
> copy=$((time $(cp -a ${testdir} ${mnt}/copy)) 2>&1)
>
> #echo "[+] Removing files"
> _remount
> remove=$((time $(rm -rf $testdir)) 2>&1)
>
> echo "$fs $count $create $list $tar $find $copy $remove"
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-03-10 03:52:02

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

Hey Jacek,

I'm curious parameters of the set of directories on your production
server. On an ext4 file system, assuming you've copied the
directories over, what are the result of this command pipeline when
you are cd'ed into the top of the directory hierarchy of interest
(your svn tree, as I recall)?

find . -type d -ls | awk '{print $7}' | sort -n | uniq -c

I'm just curious what the distribution of directories sizes are in
practice for your use case...

Thanks,

- Ted



2012-03-10 04:48:09

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:
> > I have also run the correlation.py from Phillip Susi on directory with
> > 100000 4k files and indeed the name to block correlation in ext4 is pretty
> > much random :)
>
> Just reading this on the plane, so I can't find the exact reference
> that I want, but a solution to this problem with htree was discussed
> a few years ago between myself and Coly Li.
>
> The basic idea is that for large directories the inode allocator
> starts by selecting a range of (relatively) free inodes based on the
> current directory size, and then piecewise maps the hash value for
> the filename into this inode range and uses that as the goal inode.

I've heard you sketch out this idea before, but it's not been clean to
me how well it would work in practice. The potential downside is that
it fragments the inode table, such that a single inode table block
might not have as many inodes stored in it as we might otherwise
would. (Ideally, with 256 byte inodes, there would 16 of them in each
4k block.) This is something I'd definitely recommend modelling in
simulation to see how well it works first.

I'd observe that if we knew in advance how many files would be in a
particular directory (i.e., via a hint from userspace at mkdir time),
then it would be possible to optimize this case much more easily. In
fact, if we had perfect knowledge --- if the userspace process could
feed us the exact set of filenames that will be used in the directory,
plus the exact file sizes for each of the file names --- then it would
be possible to allocate inode numbers and starting block numbers such
that when the files are read in readdir() order, the inode numbers and
block numbers would line up perfectly into sequential writes.

Of course, the trade off is that by optimizing for read order, the
write order will tend to get painful as well! So there's a tension
here; making the read part of the benchmark faster will make the
process of writing the directory hierarchy require more seeks --- and
this assuming that you know file names and sizes ahead of time.

(Unless, of course, you play the same game that Hans Reiser did when
he cooked his "tar" benchmark by constructing a kernel source tarball
where the file order in the tarball was perfectly ordered to guarantee
the desired result given Reiser4's hash! :-)


I suspect the best optimization for now is probably something like
this:

1) Since the vast majority of directories are less than (say) 256k
(this would be a tunable value), for directories which are less than
this threshold size, the entire directory is sucked in after the first
readdir() after an opendir() or rewinddir(). The directory contents
are then sorted by inode number (or loaded into an rbtree ordered by
inode number), and returned back to userspace in the inode order via
readdir(). The directory contents will be released on a closedir() or
rewinddir().

2) For files larger than this size, we don't want to hold that much
kernel memory during an opendir(), so fall back to ext4's current
scheme.

3) If we want do to better than that, we could create new flag
O_NOTELLDIR, which can be set via fcntl. (This way programs can use
do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
SETFL, O_NOTELLDIR);"). For files who know they don't need to use
telldir/seekdir, they can set this flag, which will allow the above
optimization to be used on large files.


The other thing we could potentially do, if we really cared about this
issue in ext3/4, would be to define whole new (incompatible) storing
the directory indexing format which avoided this problem. It would be
an INCOMPAT feature, so it would be a while before it could get used
in practice, which is why I'm not entirely convinced it's worth it ---
especially since I suspect just doing (1) would solve the problem for
the vast majority of ext4's users.


- Ted

2012-03-11 10:30:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 2012-03-09, at 9:48 PM, Ted Ts'o wrote:
> On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:
>>
>> Just reading this on the plane, so I can't find the exact reference
>> that I want, but a solution to this problem with htree was discussed
>> a few years ago between myself and Coly Li.
>>
>> The basic idea is that for large directories the inode allocator
>> starts by selecting a range of (relatively) free inodes based on the
>> current directory size, and then piecewise maps the hash value for
>> the filename into this inode range and uses that as the goal inode.
>
> I've heard you sketch out this idea before, but it's not been clean to
> me how well it would work in practice. The potential downside is that
> it fragments the inode table, such that a single inode table block
> might not have as many inodes stored in it as we might otherwise
> would. (Ideally, with 256 byte inodes, there would 16 of them in each
> 4k block.) This is something I'd definitely recommend modelling in
> simulation to see how well it works first.

Very true, I was thinking this as I wrote the email.

> I'd observe that if we knew in advance how many files would be in a
> particular directory (i.e., via a hint from userspace at mkdir time),
> then it would be possible to optimize this case much more easily. In
> fact, if we had perfect knowledge --- if the userspace process could
> feed us the exact set of filenames that will be used in the directory,
> plus the exact file sizes for each of the file names --- then it would
> be possible to allocate inode numbers and starting block numbers such
> that when the files are read in readdir() order, the inode numbers and
> block numbers would line up perfectly into sequential writes.

Except POSIX doesn't allow anything close to this at all. Something like
fallocate() for directories would at least give the kernel a hint about
how large the directory is, but fewer userspace tools have this information
than the file size (e.g. tar doesn't really store a "directory", just a
bunch of pathnames that happen to have largely the same components).

In the cases where size IS known in advance, or at least if the directory
is going to be large, an allocation policy like mine would essentially
"predict" where a uniformly distributed hash function will allocate the
inodes for better reading order.

> Of course, the trade off is that by optimizing for read order, the
> write order will tend to get painful as well! So there's a tension
> here; making the read part of the benchmark faster will make the
> process of writing the directory hierarchy require more seeks --- and
> this assuming that you know file names and sizes ahead of time.
>
> (Unless, of course, you play the same game that Hans Reiser did when
> he cooked his "tar" benchmark by constructing a kernel source tarball
> where the file order in the tarball was perfectly ordered to guarantee
> the desired result given Reiser4's hash! :-)

No, I definitely don't want to go down that path... Optimization for the
sake of benchmarks is broken.

> I suspect the best optimization for now is probably something like
> this:
>
> 1) Since the vast majority of directories are less than (say) 256k
> (this would be a tunable value),

Yes, this would be reasonable, and I suspect could be a lot larger than
256kB. Probably variable based on the amount of RAM is reasonable.
Looking at http://www.pdsi-scidac.org/fsstats/ (which is listed as HPC,
but is a bit old and the filesystems aren't much larger than large SATA
drives today :-) having a 1MB cache would handle virtually every directory
in that survey (99.999%).

The unfortunate part is that this will help small directories, where the
problem is already largely hidden by readahead and the page cache, but it
doesn't help at all for huge directories that do not fit into cache. That
is where my approach would do well, but the difficulty is in knowing at
create time how large the directory will be.

> for directories which are less than
> this threshold size, the entire directory is sucked in after the first
> readdir() after an opendir() or rewinddir(). The directory contents
> are then sorted by inode number (or loaded into an rbtree ordered by
> inode number), and returned back to userspace in the inode order via
> readdir(). The directory contents will be released on a closedir() or
> rewinddir().

What would the telldir() cookie be in this case? The inode number, or
the file descriptor? What happens if the directory size crosses the
threshold while the cache is in use? I guess in that case the cache
could still be used (according to POSIX) because readdir() doesn't need
to report entries added after it started.

> 2) For files larger than this size, we don't want to hold that much
> kernel memory during an opendir(), so fall back to ext4's current
> scheme.
>
> 3) If we want do to better than that, we could create new flag
> O_NOTELLDIR, which can be set via fcntl. (This way programs can use
> do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
> SETFL, O_NOTELLDIR);"). For files who know they don't need to use
> telldir/seekdir, they can set this flag, which will allow the above
> optimization to be used on large files.

Probably a bit messy. Since telldir() is rarely used outside NFS (AFAIK,
though I recall something dumb happening in GNU libc at one point that
was doing a telldir() and seekdir() for every entry) we could assume no
telldir() users until it actually was used on the filesystem, and that
could set a hint in the superblock EXT2_FLAGS section. The only userspace
programs I could find that call telldir() are smbd and nmbd (and related
tools) and perl (not sure when or why).

> The other thing we could potentially do, if we really cared about this
> issue in ext3/4, would be to define whole new (incompatible) storing
> the directory indexing format which avoided this problem. It would be
> an INCOMPAT feature, so it would be a while before it could get used
> in practice, which is why I'm not entirely convinced it's worth it ---
> especially since I suspect just doing (1) would solve the problem for
> the vast majority of ext4's users.

Agreed. I don't know that we need to go to these extremes.

Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Lustre Engineer http://www.whamcloud.com/





2012-03-11 16:13:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Sun, Mar 11, 2012 at 04:30:37AM -0600, Andreas Dilger wrote:
> > if the userspace process could
> > feed us the exact set of filenames that will be used in the directory,
> > plus the exact file sizes for each of the file names...
>
> Except POSIX doesn't allow anything close to this at all. Something like
> fallocate() for directories would at least give the kernel a hint about
> how large the directory is, but fewer userspace tools have this information
> than the file size (e.g. tar doesn't really store a "directory", just a
> bunch of pathnames that happen to have largely the same components).

Yes, this was purely a thought experiment idea (although if we could
try to extend the interface if it really neted enough gains). My
point was that even if we could do this, we really are trading off
write performance for read performance (since you would now be doing
the extra seeking on the tar extraction phase).

Ultimately, while I believe it makes sense to optimize the "readdir +
seek" case, I don't believe it makes sense to optimize the "copy out
the directory using tar" benchmark --- since I believe this is much
more of a benchmark workload than something that happens a huge amount
in real life, and even for people who do backups (which unfortunately
as we know is a too-small percentage of the ext4 user base), they
generally will prefer than the "real" use case for the file system be
fast rather than the backup phase, and in point of fact as the file
system gets more fragmented, even if some file systems can offer
perfect name -> block correlation, over time I'll bet their
correlation breaks down and so ext4's disadvantage over those file
systems will disappear in real life usages. So I don't believe it
makes that much sense to worry overly about the "name -> block"
correlation for freshly formatted file systems. It's a nice-to-have,
but I really think that's more of a benchmark game than anything else.

> In the cases where size IS known in advance, or at least if the directory
> is going to be large, an allocation policy like mine would essentially
> "predict" where a uniformly distributed hash function will allocate the
> inodes for better reading order.

Well, yes; if we knew in advance how many files (roughly) would be
stored in a directory, there are some optimizations we could do. The
simplest of which is preallocating the directory blocks so they are
contiguous on disk. I'm not pursuaded yet that the optimizations are
worth the effort of trying to define a new interface and then trying
to convince application programers to use a hypothetical interface
extension *just* for this purpose, though.

> The unfortunate part is that this will help small directories, where the
> problem is already largely hidden by readahead and the page cache, but it
> doesn't help at all for huge directories that do not fit into cache. That
> is where my approach would do well, but the difficulty is in knowing at
> create time how large the directory will be.

Well, my goal in proposing this optimization is that helps for the
"medium size" directories in the cold cache case. The ext4 user who
first kicked off this thread was using his file system for an SVN
server, as I recall. I could easily believe that he has thousands of
files; maybe even tens of thousands of files in a directory. But that
probably still fits in 256k, or at best 512k worth of directory blocks.

It definitely won't help for the really lage directories, yes; but how
common are they? The most common case I can think of is squid caches,
but you can configure a larger number of 1st and 2nd level
subdirectories to keep the actual directory size smaller.

> > for directories which are less than
> > this threshold size, the entire directory is sucked in after the first
> > readdir() after an opendir() or rewinddir(). The directory contents
> > are then sorted by inode number (or loaded into an rbtree ordered by
> > inode number), and returned back to userspace in the inode order via
> > readdir(). The directory contents will be released on a closedir() or
> > rewinddir().
>
> What would the telldir() cookie be in this case? The inode number, or
> the file descriptor? What happens if the directory size crosses the
> threshold while the cache is in use? I guess in that case the cache
> could still be used (according to POSIX) because readdir() doesn't need
> to report entries added after it started.

The telldir() cookie can simply be an ordinal -- so "7" means the 7th
directory entry in the rbtree. And all we need to do, which obeys
POSIX as you've noted, is that we instantiate the rbtree during the
first readdir() after an opendir() or rewinddir(), and free the old
rbtree() when we get the rewinddir(), reach the end of the directory
stream, or get the closedir().

> > 3) If we want do to better than that, we could create new flag
> > O_NOTELLDIR, which can be set via fcntl. (This way programs can use
> > do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
> > SETFL, O_NOTELLDIR);"). For files who know they don't need to use
> > telldir/seekdir, they can set this flag, which will allow the above
> > optimization to be used on large files.
>
> Probably a bit messy. Since telldir() is rarely used outside NFS (AFAIK,
> though I recall something dumb happening in GNU libc at one point that
> was doing a telldir() and seekdir() for every entry) we could assume no
> telldir() users until it actually was used on the filesystem, and that
> could set a hint in the superblock EXT2_FLAGS section. The only userspace
> programs I could find that call telldir() are smbd and nmbd (and related
> tools) and perl (not sure when or why).

By the time a process uses telldir(), we're midway through a readdir()
stream, and so by that point it's too late to switch back to a
seekdir-friendly order. So what do we do at that point? If we
continue returning directory entries where we read in each directory
256kb at a time, sorting by inode number, and then returning dirents
in inode number order, then anytime there was a telldir() during a
256kb chuck, we wouldn't be able to discard that rbtree, but rather we
would have to keep it in memory until the next rewinddir(). So a
process which opens a large directory(), and then calls telldir() a
huge number of times could end up pinning a large amount of kernel
memory.

We could, of course, set some kind of resource limit on this, and if a
process behaves too badly by calling telldir in a way that it has
pinned too much kernel memory, we could potentially kill the process.
We could then tell application programmers that if they really want to
use telldir() a huge amount, they need to tell us in advance via some
kind of process flag. That is, we could default to being
telldir-unfriendly instead of defaulting to telldir-friendly. If we
really think it's only Samba and Perl, and that we can convince them
to give us this hint, then that we might get away with it.

I had been assuming though that we would have to default to
telldir-friendly, which is why I suggested the O_NOTELLDIR flag.

- Ted

2012-03-13 19:05:59

by Phillip Susi

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 3/9/2012 11:48 PM, Ted Ts'o wrote:
> I suspect the best optimization for now is probably something like
> this:
>
> 1) Since the vast majority of directories are less than (say) 256k
> (this would be a tunable value), for directories which are less than
> this threshold size, the entire directory is sucked in after the first
> readdir() after an opendir() or rewinddir(). The directory contents
> are then sorted by inode number (or loaded into an rbtree ordered by
> inode number), and returned back to userspace in the inode order via
> readdir(). The directory contents will be released on a closedir() or
> rewinddir().

Why not just separate the hash table from the conventional, mostly in
inode order directory entries? For instance, the first 200k of the
directory could be the normal entries that would tend to be in inode
order ( and e2fsck -D would reorder ), and the last 56k of the directory
would contain the hash table. Then readdir() just walks the directory
like normal, and namei() can check the hash table.


2012-03-13 19:53:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Tue, Mar 13, 2012 at 03:05:59PM -0400, Phillip Susi wrote:
> Why not just separate the hash table from the conventional, mostly
> in inode order directory entries? For instance, the first 200k of
> the directory could be the normal entries that would tend to be in
> inode order ( and e2fsck -D would reorder ), and the last 56k of the
> directory would contain the hash table. Then readdir() just walks
> the directory like normal, and namei() can check the hash table.

Because that would be a format change.

What we have today is not a hash table; it's a hashed tree, where we
use a fixed-length key for the tree based on the hash of the file
name. Currently the leaf nodes of the tree are the directory blocks
themselves; that is, the lowest level of the index blocks tells you to
look at directory block N, where that directory contains the directory
indexes for those file names which are in a particular range (say,
between 0x2325777A and 0x2325801).

If we aren't going to change the ordering of the directory directory,
that means we would need to change things so the leaf nodes contain
the actual directory file names themselves, so that we know whether or
not we've hit the correct entry or not before we go to read in a
specific directory block (otherwise, you'd have problems dealing with
hash collisions). But in that case, instead of storing the pointer to
the directory entry, since the bulk of the size of a directory entry
is the filename itself, you might as well store the inode number in
the tree itself, and be done with it.

And in that case, since you are replicating the information directory
twice over, and it's going to be an incompatible format change anyway,
you might as well just store the second copy of the directory entries
in *another* btree, except this one is indexed by inode number, and
then you use the second tree for readdir(), and you make the
telldir/seekdir cookie be the inode number. That way readdir() will
always return results in a stat() optimized order, even in the face of
directory fragmentation and file system aging.

**This** is why telldir/seekdir is so evil; in order to do things
correctly in terms of the semantics of readdir() in the face of
telldir/seekdir and file names getting inserted and deleted into the
btree, and the possibility for tree splits/rotations, etc., and the
fact that the cookie is only 32 or 64 bits, you essentially either
need to just do something stupid and have a linear directory aala ext2
and V7 Unix, or you need to store the directory information twice over
in redundant b-trees.

Or, userspace needs to do the gosh-darned sorting by inode, or we do
some hack such as only sorting the inodes using non-swapping kernel
memory if the directory is smaller than some arbitrary size, such as
256k or 512k.

- Ted

2012-03-13 20:22:55

by Phillip Susi

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 3/13/2012 3:53 PM, Ted Ts'o wrote:
> Because that would be a format change.

I think a format change would be preferable to runtime sorting.

> What we have today is not a hash table; it's a hashed tree, where we
> use a fixed-length key for the tree based on the hash of the file
> name. Currently the leaf nodes of the tree are the directory blocks
> themselves; that is, the lowest level of the index blocks tells you to
> look at directory block N, where that directory contains the directory
> indexes for those file names which are in a particular range (say,
> between 0x2325777A and 0x2325801).

So the index nodes contain the hash ranges for the leaf block, but the
leaf block only contains the regular directory entries, not a hash for
each name? That would mean that adding or removing names would require
moving around the regular directory entries wouldn't it?

> If we aren't going to change the ordering of the directory directory,
> that means we would need to change things so the leaf nodes contain
> the actual directory file names themselves, so that we know whether or
> not we've hit the correct entry or not before we go to read in a
> specific directory block (otherwise, you'd have problems dealing with
> hash collisions). But in that case, instead of storing the pointer to
> the directory entry, since the bulk of the size of a directory entry
> is the filename itself, you might as well store the inode number in
> the tree itself, and be done with it.

I would think that hash collisions are rare enough that reading a
directory block you end up not needing once in a blue moon would be
chalked up under "who cares". So just stick with hash, offset pairs to
map the hash to the normal directory entry.


2012-03-13 21:33:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
>
> I think a format change would be preferable to runtime sorting.

Are you volunteering to spearhead the design and coding of such a
thing? Run-time sorting is backwards compatible, and a heck of a lot
easier to code and test...

The reality is we'd probably want to implement run-time sorting
*anyway*, for the vast majority of people who don't want to convert to
a new incompatible file system format. (Even if you can do the
conversion using e2fsck --- which is possible, but it would be even
more code to write --- system administrators tend to be very
conservative about such things, since they might need to boot an older
kernel, or use a rescue CD that doesn't have an uptodate kernel or
file system utilities, etc.)

> So the index nodes contain the hash ranges for the leaf block, but
> the leaf block only contains the regular directory entries, not a
> hash for each name? That would mean that adding or removing names
> would require moving around the regular directory entries wouldn't
> it?

They aren't sorted in the leaf block, so we only need to move around
regular directory entries when we do a node split (and at the moment
we don't support shrinking directories), so we don't have to worry the
reverse case.

> I would think that hash collisions are rare enough that reading a
> directory block you end up not needing once in a blue moon would be
> chalked up under "who cares". So just stick with hash, offset pairs
> to map the hash to the normal directory entry.

With a 64-bit hash, and if we were actually going to implement this as
a new incompatible feature, you're probably right in terms of
accepting the extra directory block search.

We would still have to implement the case where hash collisions *do*
exist, though, and make sure the right thing happens in that case.
Even if the chance of that happening is 1 in 2**32, with enough
deployed systems (i.e., every Android handset, etc.) it's going to
happen in real life.

- Ted





2012-03-14 02:48:18

by Yongqiang Yang

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

What if we use inode number as the hash value? Does it work?

Yongqiang.

2012-03-14 02:51:08

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:48:17AM +0800, Yongqiang Yang wrote:
> What if we use inode number as the hash value? Does it work?

The whole point of using the tree structure is to accelerate filename
-> inode number lookups. So the namei lookup doesn't have the inode
number; the whole point is to use the filename to lookup the inode
number. So we can't use the inode number as the hash value since
that's what we are trying to find out.

We could do this if we have two b-trees, one indexed by filename and
one indexed by inode number, which is what JFS (and I believe btrfs)
does.

- Ted

2012-03-14 08:12:09

by Lukas Czerner

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Tue, 13 Mar 2012, Ted Ts'o wrote:

> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
> >
> > I think a format change would be preferable to runtime sorting.
>
> Are you volunteering to spearhead the design and coding of such a
> thing? Run-time sorting is backwards compatible, and a heck of a lot
> easier to code and test...

Actually I am in the process of creating some projects for the local
universities and this certainly sounds like something that some student
can do. I have already put that into the proposal, so we might have
someone looking into this in the near future. Since the problem has been
there since forever, I think that it's not that critical to solve it
right away.

I kind of like the idea about having the separate btree with inode
numbers for the directory reading, just because it does not affect
allocation policy nor the write performance which is a good thing. Also
it has been done before in btrfs and it works very well for them. The
only downside (aside from format change) is the extra step when removing
the directory entry, but the positives outperform the negatives.


Maybe this might be even done in a way which does not require format
change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
tell us that there is a inode number btree to use on directory reads.
Then the pointer to the root of that tree would be stored at the end of
the root block of the hree (of course if there is still some space for
it) and the rest is easy.

So If we mount the old file system with the new code, there will not be
the flag, but with newly added directories it could be used anyway. Also
e2fsck could easily add the inumber tree into the directory if it is
missing.

If we mount the new file system with the old code (the one which does
not know about this feature), everything would stay the same and we
would not touch the inumber tree at all. Of course there is a
possibility that we would rewrite the end of the root htree, overwriting
the pointer to the inumber tree and effectively losing one block. We
would not actually care that we rewrote this information because we
do not actually need it, so the only downside is the one block lost,
which is not that bad I guess.

Now, we have rewritten the pointer to the inumber tree and we're going
to mount it with the new code again. It would expect the the inumber
tree to be there, but not blindly. Some king of magic information would
have to be stored in the inumber root pointer so we can be sure that it
really is what we're looking for and if it is not there, well then we do
not care and walk the directory in the old way. And we can alway create
the inumber tree in the new directory. And again e2fsck should be able
to fix that for us as well.

So that is just an idea, I have not looked at the actual code to see it
it would be really possible, but it certainly seems like so. Am I
missing something ?


Anyway, if there is going to be a format change I agree that having
solution for those who are not comfortable with the format change is
equally as important. But maybe there will be better solution which does
not require the format change.

Thanks!
-Lukas

>
> The reality is we'd probably want to implement run-time sorting
> *anyway*, for the vast majority of people who don't want to convert to
> a new incompatible file system format. (Even if you can do the
> conversion using e2fsck --- which is possible, but it would be even
> more code to write --- system administrators tend to be very
> conservative about such things, since they might need to boot an older
> kernel, or use a rescue CD that doesn't have an uptodate kernel or
> file system utilities, etc.)
>
> > So the index nodes contain the hash ranges for the leaf block, but
> > the leaf block only contains the regular directory entries, not a
> > hash for each name? That would mean that adding or removing names
> > would require moving around the regular directory entries wouldn't
> > it?
>
> They aren't sorted in the leaf block, so we only need to move around
> regular directory entries when we do a node split (and at the moment
> we don't support shrinking directories), so we don't have to worry the
> reverse case.
>
> > I would think that hash collisions are rare enough that reading a
> > directory block you end up not needing once in a blue moon would be
> > chalked up under "who cares". So just stick with hash, offset pairs
> > to map the hash to the normal directory entry.
>
> With a 64-bit hash, and if we were actually going to implement this as
> a new incompatible feature, you're probably right in terms of
> accepting the extra directory block search.
>
> We would still have to implement the case where hash collisions *do*
> exist, though, and make sure the right thing happens in that case.
> Even if the chance of that happening is 1 in 2**32, with enough
> deployed systems (i.e., every Android handset, etc.) it's going to
> happen in real life.
>
> - Ted

2012-03-14 09:29:47

by Yongqiang Yang

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 4:12 PM, Lukas Czerner <[email protected]> wrote:
> On Tue, 13 Mar 2012, Ted Ts'o wrote:
>
>> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
>> >
>> > I think a format change would be preferable to runtime sorting.
>>
>> Are you volunteering to spearhead the design and coding of such a
>> thing? ?Run-time sorting is backwards compatible, and a heck of a lot
>> easier to code and test...
>
> Actually I am in the process of creating some projects for the local
> universities and this certainly sounds like something that some student
> can do. I have already put that into the proposal, so we might have
> someone looking into this in the near future. Since the problem has been
> there since forever, I think that it's not that critical to solve it
> right away.
>
> I kind of like the idea about having the separate btree with inode
> numbers for the directory reading, just because it does not affect
> allocation policy nor the write performance which is a good thing. Also
> it has been done before in btrfs and it works very well for them. The
> only downside (aside from format change) is the extra step when removing
> the directory entry, but the positives outperform the negatives.
>
>
> Maybe this might be even done in a way which does not require format
> change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> tell us that there is a inode number btree to use on directory reads.
> Then the pointer to the root of that tree would be stored at the end of
> the root block of the hree (of course if there is still some space for
> it) and the rest is easy.
dx_root takes 32 bytes from 60 bytes of i_data. The beginning of
dx_entries is controlled by info_length, that is, dx_root->info +
dx_root->info->info_length. It seems that we can put the new tree
root behind dx_root and add the length of the new tree root to
info_length. Then the old code can handle new filesystem and new code
can handle old filesystem because of the flag you described above.

If no one attempts to do it , I can do it as a google summer of code project.

Yongqiang.
>
> So If we mount the old file system with the new code, there will not be
> the flag, but with newly added directories it could be used anyway. Also
> e2fsck could easily add the inumber tree into the directory if it is
> missing.
>
> If we mount the new file system with the old code (the one which does
> not know about this feature), everything would stay the same and we
> would not touch the inumber tree at all. Of course there is a
> possibility that we would rewrite the end of the root htree, overwriting
> the pointer to the inumber tree and effectively losing one block. We
> would not actually care that we rewrote this information because we
> do not actually need it, so the only downside is the one block lost,
> which is not that bad I guess.
>
> Now, we have rewritten the pointer to the inumber tree and we're going
> to mount it with the new code again. It would expect the the inumber
> tree to be there, but not blindly. Some king of magic information would
> have to be stored in the inumber root pointer so we can be sure that it
> really is what we're looking for and if it is not there, well then we do
> not care and walk the directory in the old way. And we can alway create
> the inumber tree in the new directory. And again e2fsck should be able
> to fix that for us as well.
>
> So that is just an idea, I have not looked at the actual code to see it
> it would be really possible, but it certainly seems like so. Am I
> missing something ?
>
>
> Anyway, if there is going to be a format change I agree that having
> solution for those who are not comfortable with the format change is
> equally as important. But maybe there will be better solution which does
> not require the format change.
>
> Thanks!
> -Lukas
>
>>
>> The reality is we'd probably want to implement run-time sorting
>> *anyway*, for the vast majority of people who don't want to convert to
>> a new incompatible file system format. ?(Even if you can do the
>> conversion using e2fsck --- which is possible, but it would be even
>> more code to write --- system administrators tend to be very
>> conservative about such things, since they might need to boot an older
>> kernel, or use a rescue CD that doesn't have an uptodate kernel or
>> file system utilities, etc.)
>>
>> > So the index nodes contain the hash ranges for the leaf block, but
>> > the leaf block only contains the regular directory entries, not a
>> > hash for each name? ?That would mean that adding or removing names
>> > would require moving around the regular directory entries wouldn't
>> > it?
>>
>> They aren't sorted in the leaf block, so we only need to move around
>> regular directory entries when we do a node split (and at the moment
>> we don't support shrinking directories), so we don't have to worry the
>> reverse case.
>>
>> > I would think that hash collisions are rare enough that reading a
>> > directory block you end up not needing once in a blue moon would be
>> > chalked up under "who cares". ?So just stick with hash, offset pairs
>> > to map the hash to the normal directory entry.
>>
>> With a 64-bit hash, and if we were actually going to implement this as
>> a new incompatible feature, you're probably right in terms of
>> accepting the extra directory block search.
>>
>> We would still have to implement the case where hash collisions *do*
>> exist, though, and make sure the right thing happens in that case.
>> Even if the chance of that happening is 1 in 2**32, with enough
>> deployed systems (i.e., every Android handset, etc.) it's going to
>> happen in real life.
>>
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html



--
Best Wishes
Yongqiang Yang

2012-03-14 09:38:24

by Lukas Czerner

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, 14 Mar 2012, Yongqiang Yang wrote:

> On Wed, Mar 14, 2012 at 4:12 PM, Lukas Czerner <[email protected]> wrote:
> > On Tue, 13 Mar 2012, Ted Ts'o wrote:
> >
> >> On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
> >> >
> >> > I think a format change would be preferable to runtime sorting.
> >>
> >> Are you volunteering to spearhead the design and coding of such a
> >> thing? ?Run-time sorting is backwards compatible, and a heck of a lot
> >> easier to code and test...
> >
> > Actually I am in the process of creating some projects for the local
> > universities and this certainly sounds like something that some student
> > can do. I have already put that into the proposal, so we might have
> > someone looking into this in the near future. Since the problem has been
> > there since forever, I think that it's not that critical to solve it
> > right away.
> >
> > I kind of like the idea about having the separate btree with inode
> > numbers for the directory reading, just because it does not affect
> > allocation policy nor the write performance which is a good thing. Also
> > it has been done before in btrfs and it works very well for them. The
> > only downside (aside from format change) is the extra step when removing
> > the directory entry, but the positives outperform the negatives.
> >
> >
> > Maybe this might be even done in a way which does not require format
> > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> > tell us that there is a inode number btree to use on directory reads.
> > Then the pointer to the root of that tree would be stored at the end of
> > the root block of the hree (of course if there is still some space for
> > it) and the rest is easy.
> dx_root takes 32 bytes from 60 bytes of i_data. The beginning of
> dx_entries is controlled by info_length, that is, dx_root->info +
> dx_root->info->info_length. It seems that we can put the new tree
> root behind dx_root and add the length of the new tree root to
> info_length. Then the old code can handle new filesystem and new code
> can handle old filesystem because of the flag you described above.

Yes, there might be other places where we can store the new root. The
first idea was just from top of my head, we can certainly polish it a
bit more.

>
> If no one attempts to do it , I can do it as a google summer of code project.

As I wrote in the first paragraph this project is already taken :). It
will be the student project for our local universities. Actually I have
already finished the proposal. I can not stop you from doing it as a
soc project, although I am not sure that in this case it would be good
to have two people working on exactly the same problem.

Thanks!
-Lukas

>
> Yongqiang.
> >
> > So If we mount the old file system with the new code, there will not be
> > the flag, but with newly added directories it could be used anyway. Also
> > e2fsck could easily add the inumber tree into the directory if it is
> > missing.
> >
> > If we mount the new file system with the old code (the one which does
> > not know about this feature), everything would stay the same and we
> > would not touch the inumber tree at all. Of course there is a
> > possibility that we would rewrite the end of the root htree, overwriting
> > the pointer to the inumber tree and effectively losing one block. We
> > would not actually care that we rewrote this information because we
> > do not actually need it, so the only downside is the one block lost,
> > which is not that bad I guess.
> >
> > Now, we have rewritten the pointer to the inumber tree and we're going
> > to mount it with the new code again. It would expect the the inumber
> > tree to be there, but not blindly. Some king of magic information would
> > have to be stored in the inumber root pointer so we can be sure that it
> > really is what we're looking for and if it is not there, well then we do
> > not care and walk the directory in the old way. And we can alway create
> > the inumber tree in the new directory. And again e2fsck should be able
> > to fix that for us as well.
> >
> > So that is just an idea, I have not looked at the actual code to see it
> > it would be really possible, but it certainly seems like so. Am I
> > missing something ?
> >
> >
> > Anyway, if there is going to be a format change I agree that having
> > solution for those who are not comfortable with the format change is
> > equally as important. But maybe there will be better solution which does
> > not require the format change.
> >
> > Thanks!
> > -Lukas
> >
> >>
> >> The reality is we'd probably want to implement run-time sorting
> >> *anyway*, for the vast majority of people who don't want to convert to
> >> a new incompatible file system format. ?(Even if you can do the
> >> conversion using e2fsck --- which is possible, but it would be even
> >> more code to write --- system administrators tend to be very
> >> conservative about such things, since they might need to boot an older
> >> kernel, or use a rescue CD that doesn't have an uptodate kernel or
> >> file system utilities, etc.)
> >>
> >> > So the index nodes contain the hash ranges for the leaf block, but
> >> > the leaf block only contains the regular directory entries, not a
> >> > hash for each name? ?That would mean that adding or removing names
> >> > would require moving around the regular directory entries wouldn't
> >> > it?
> >>
> >> They aren't sorted in the leaf block, so we only need to move around
> >> regular directory entries when we do a node split (and at the moment
> >> we don't support shrinking directories), so we don't have to worry the
> >> reverse case.
> >>
> >> > I would think that hash collisions are rare enough that reading a
> >> > directory block you end up not needing once in a blue moon would be
> >> > chalked up under "who cares". ?So just stick with hash, offset pairs
> >> > to map the hash to the normal directory entry.
> >>
> >> With a 64-bit hash, and if we were actually going to implement this as
> >> a new incompatible feature, you're probably right in terms of
> >> accepting the extra directory block search.
> >>
> >> We would still have to implement the case where hash collisions *do*
> >> exist, though, and make sure the right thing happens in that case.
> >> Even if the chance of that happening is 1 in 2**32, with enough
> >> deployed systems (i.e., every Android handset, etc.) it's going to
> >> happen in real life.
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
>
>
>

--

2012-03-14 12:50:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> I kind of like the idea about having the separate btree with inode
> numbers for the directory reading, just because it does not affect
> allocation policy nor the write performance which is a good thing. Also
> it has been done before in btrfs and it works very well for them. The
> only downside (aside from format change) is the extra step when removing
> the directory entry, but the positives outperform the negatives.

Well, there will be extra (journaled!) block reads and writes involved
when adding or removing directory entries. So the performance on
workloads that do a large number of directory adds and removed will
have to be impacted. By how much is not obvious, but it will be
something which is measurable.

> Maybe this might be even done in a way which does not require format
> change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> tell us that there is a inode number btree to use on directory reads.
> Then the pointer to the root of that tree would be stored at the end of
> the root block of the hree (of course if there is still some space for
> it) and the rest is easy.

You can make it be a RO_COMPAT change instead of an INCOMPAT change,
yes.

And if you want to do it as simply as possible, we could just recycle
the current htree approach for the second tree, and simply store the
root in another set of directory blocks. But by putting the index
nodes in the directory blocks, masquerading as deleted directories, it
means that readdir() has to cycle through them and ignore the index
blocks.

The alternate approach is to use physical block numbers instead of
logical block numbers for the index blocks, and to store it separately
from the blocks containing actual directory entries. But if we do
that for the inumber tree, the next question that arise is maybe we
should do that for the hash tree as well --- and then once you upon
that can of worms, it gets a lot more complicated.

So the question that really arises here is how wide open do we want to
leave the design space, and whether we are optimizing for the best
possible layout change ignoring the amount of implementation work it
might require, or whether we keep things as simple as possible from a
code change perspective.

There are good arguments that can be made either way, and a lot
depends on the quality of the students you can recruit, the amount of
time they have, and how much review time it will take out of the core
team during the design and implementation phase.

Regards,

- Ted

2012-03-14 14:17:41

by Zach Brown

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance


> We could do this if we have two b-trees, one indexed by filename and
> one indexed by inode number, which is what JFS (and I believe btrfs)
> does.

Typically the inode number of the destination inode isn't used to index
entries for a readdir tree because of (wait for it) hard links. You end
up right back where you started with multiple entries per key.

A painful solution is to have the key in the readdir tree allocated by
the tree itself -- count key populations in subtrees per child pointer
and use that to find free keys. You still have the problem of
correlating entry keys and the location of inodes, but at least now you
have a structure to navigate to try and do that. You can also trivially
re-use freed keys and have densely packed readdir keys that won't break
32bit f_pos so quickly.

btrfs just punts and has an incrementing 64bit counter per directory
that determines a new entry's position in readdir. Accompanied by the
obligatory "will be smarter some day" comment :).

- z

2012-03-14 14:28:20

by Phillip Susi

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 3/13/2012 5:33 PM, Ted Ts'o wrote:
> Are you volunteering to spearhead the design and coding of such a
> thing? Run-time sorting is backwards compatible, and a heck of a lot
> easier to code and test...

Do you really think it is that much easier? Even if it is easier, it is
still an ugly kludge. It would be much better to fix the underlying
problem rather than try to paper over it.

> The reality is we'd probably want to implement run-time sorting
> *anyway*, for the vast majority of people who don't want to convert to
> a new incompatible file system format. (Even if you can do the
> conversion using e2fsck --- which is possible, but it would be even
> more code to write --- system administrators tend to be very
> conservative about such things, since they might need to boot an older
> kernel, or use a rescue CD that doesn't have an uptodate kernel or
> file system utilities, etc.)

The same argument could have been made against the current htree
implementation when it was added. I think it carries just as little
weight now as it did then. People who care about the added performance
the new feature provides can use it, those who don't, won't. Eventually
it will become ubiquitous.

> We would still have to implement the case where hash collisions *do*
> exist, though, and make sure the right thing happens in that case.
> Even if the chance of that happening is 1 in 2**32, with enough
> deployed systems (i.e., every Android handset, etc.) it's going to
> happen in real life.

Sure it will happen, but if we read one extra block 1 in 4 billion
times, nobody is going to notice.

2012-03-14 14:34:13

by Lukas Czerner

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, 14 Mar 2012, Ted Ts'o wrote:

> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> > I kind of like the idea about having the separate btree with inode
> > numbers for the directory reading, just because it does not affect
> > allocation policy nor the write performance which is a good thing. Also
> > it has been done before in btrfs and it works very well for them. The
> > only downside (aside from format change) is the extra step when removing
> > the directory entry, but the positives outperform the negatives.
>
> Well, there will be extra (journaled!) block reads and writes involved
> when adding or removing directory entries. So the performance on
> workloads that do a large number of directory adds and removed will
> have to be impacted. By how much is not obvious, but it will be
> something which is measurable.

Yes, it would have to be measured in various workloads.

>
> > Maybe this might be even done in a way which does not require format
> > change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> > tell us that there is a inode number btree to use on directory reads.
> > Then the pointer to the root of that tree would be stored at the end of
> > the root block of the hree (of course if there is still some space for
> > it) and the rest is easy.
>
> You can make it be a RO_COMPAT change instead of an INCOMPAT change,
> yes.

Does it have to be RO_COMPAT change though ? Since this would be both
forward and backward compatible.

>
> And if you want to do it as simply as possible, we could just recycle
> the current htree approach for the second tree, and simply store the
> root in another set of directory blocks. But by putting the index
> nodes in the directory blocks, masquerading as deleted directories, it
> means that readdir() has to cycle through them and ignore the index
> blocks.
>
> The alternate approach is to use physical block numbers instead of
> logical block numbers for the index blocks, and to store it separately
> from the blocks containing actual directory entries. But if we do
> that for the inumber tree, the next question that arise is maybe we
> should do that for the hash tree as well --- and then once you upon
> that can of worms, it gets a lot more complicated.
>
> So the question that really arises here is how wide open do we want to
> leave the design space, and whether we are optimizing for the best
> possible layout change ignoring the amount of implementation work it
> might require, or whether we keep things as simple as possible from a
> code change perspective.

What do you expect to gain by storing the index nodes outside the
directory entries ? Only to avoid reading the index nodes of both trees
when walking just one of it ? If I understand it correctly directory
blocks are laid out sequentially so reading it in cold cache state would
not pose much troubles I think. Nor the iteration through all of the
index nodes. However it would have to be measured first.

>
> There are good arguments that can be made either way, and a lot
> depends on the quality of the students you can recruit, the amount of
> time they have, and how much review time it will take out of the core
> team during the design and implementation phase.

I am always hoping for the best :). We have had a good expedience with
the students here, but you'll never know that in advance. They'll have
the whole year to get through this, which is plenty of thine, but the
question of course is whether they'll use that time wisely :). But
anyway, I think it is definitely worth a try.

Thanks!
-Lukas

>
> Regards,
>
> - Ted
>

--

2012-03-14 16:48:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote:
>
> >We could do this if we have two b-trees, one indexed by filename and
> >one indexed by inode number, which is what JFS (and I believe btrfs)
> >does.
>
> Typically the inode number of the destination inode isn't used to index
> entries for a readdir tree because of (wait for it) hard links. You end
> up right back where you started with multiple entries per key.

Well, if you are using 32-bit (or even 48-bit) inode numbers and a
64-bit telldir cookie, it's possible to make the right thing happen.
But yes, if you are using 32-bit inode numbers and a 32-bit telldir
cookie, dealing with what happens when you have multiple hard links to
the same inode in the same directory gets tricky.

> A painful solution is to have the key in the readdir tree allocated by
> the tree itself -- count key populations in subtrees per child pointer
> and use that to find free keys.

One thing that might work is to have a 16-bit extra field in the
directory entry that gives an signed offset to the inode number so
that such that inode+offset is a unique value within the btree sorted
by inode+offset number. Since this tree is only used for returning
entries in an optimal (or as close to optimal as can be arranged)
order, we could get away with that.

- Ted

2012-03-14 16:54:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 10:28:20AM -0400, Phillip Susi wrote:
>
> Do you really think it is that much easier? Even if it is easier,
> it is still an ugly kludge. It would be much better to fix the
> underlying problem rather than try to paper over it.

I don't think the choice is obvious. A solution which solves the
problem 90% plus of the time for real-life workloads, without
requiring any file system format changes, is very appealing to me and
to I think many customers.

That being said, for someone who is looking for perfection, I can see
how it would grate on someone's nerves.

> The same argument could have been made against the current htree
> implementation when it was added. I think it carries just as little
> weight now as it did then. People who care about the added
> performance the new feature provides can use it, those who don't,
> won't. Eventually it will become ubiquitous.

The original htree implementation tried very hard to be fully
backwards compatible for reading and writing. At the time that was
considered very important, and for some folks it was a major killer
feature. The original ext3 journalling support was designed in a way
where you could back out and mount such a file system on an ext2 file
system, and that was considered important back then, too.

The fact that we've been so successful with ext4's upgrade to features
that require RO_COMPAT or INCOMPAT features does make me more willing
to consider those features, but the reality is that the deployment
time for such a new feature is measured in 2-3 years, minimum. So I
won't rule out making such a change, but I wouldn't let the fact that
someone wants to sign up to implement a long-term feature, to be
mutually exclusive with a short-term solution which solves most of the
problem that could be rolled out much more quickly.

- Ted

2012-03-14 17:02:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 03:34:13PM +0100, Lukas Czerner wrote:
> >
> > You can make it be a RO_COMPAT change instead of an INCOMPAT change,
> > yes.
>
> Does it have to be RO_COMPAT change though ? Since this would be both
> forward and backward compatible.

The challenge is how do you notice if the file system is mounted on an
older kernel, which then inserts a directory entry without updating
the secondary tree. The older kernel won't know about the new inode
flag, so it can't clear the flag when it modifies the directory.

We were able to get away with making the original htree feature
read/write compatible because the design for it was anticipated far in
advance, and because it was before the days of enterprise kernels that
had to be supported for seven years. So we added code into ext3 to
clear the the htree flag whenever the directory was modified something
like two years before the htree code made its appearance, and back
then we decided that was fair to assume no one would be using a kernel
that old, or be jumping back and forth between an ancient kernel and a
modern kernel with htree enabled. Yes, that was playing a bit fast
and loose, but early on in the kernel history, we could do that. It's
not something I would do today.

The bigger deal is that as Zach pointed out, we can't just index it by
inode number because we have to worry about hard links. Which means
we need either an explicit counter field added to the directory entry,
or some kind of new offset field. That we can't just shoehorn in and
retain backwards compatibility.

And once we break backwards compatibility, we might as well look at
the full design space and see what is the most optimal.

- Ted

2012-03-14 17:37:07

by Zach Brown

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On 03/14/2012 12:48 PM, Ted Ts'o wrote:
> On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote:
>>
>>> We could do this if we have two b-trees, one indexed by filename and
>>> one indexed by inode number, which is what JFS (and I believe btrfs)
>>> does.
>>
>> Typically the inode number of the destination inode isn't used to index
>> entries for a readdir tree because of (wait for it) hard links. You end
>> up right back where you started with multiple entries per key.
>
> One thing that might work is to have a 16-bit extra field in the
> directory entry that gives an signed offset to the inode number so
> that such that inode+offset is a unique value within the btree sorted
> by inode+offset number. Since this tree is only used for returning
> entries in an optimal (or as close to optimal as can be arranged)
> order, we could get away with that.

Yeah, at least that's nice and simple. Personally, I'm always nervous
when we build in new limits like this. Some joker somewhere always
manages to push up against them.

But such is life. Maybe it's the right thing in the ext* design space.
The fixed number of possible inodes certainly makes it easier to pack
more inode offsets into the telldir cookie.

- z

2012-03-14 19:17:46

by Chris Mason

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Wed, Mar 14, 2012 at 08:50:02AM -0400, Ted Ts'o wrote:
> On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> > I kind of like the idea about having the separate btree with inode
> > numbers for the directory reading, just because it does not affect
> > allocation policy nor the write performance which is a good thing. Also
> > it has been done before in btrfs and it works very well for them. The
> > only downside (aside from format change) is the extra step when removing
> > the directory entry, but the positives outperform the negatives.
>
> Well, there will be extra (journaled!) block reads and writes involved
> when adding or removing directory entries. So the performance on
> workloads that do a large number of directory adds and removed will
> have to be impacted. By how much is not obvious, but it will be
> something which is measurable.

Most of the difference in btrfs rm times comes from our double indexing.
We store the directory name 3 times, once for each dir index and once
in the inode backref. But the inode backref doesn't generate seeks the
same way the extra directory index does, so it is probably only 10% of
the cost.

We also delete file data crcs during rm, so if you're benchmarking
things to see how much it will hurt, benchmark btrfs on zero length
files.

In terms of other costs, most users of readdir will eventually wander
off and stat or open the files in the dir. With your current indexes, the stat
reads the same directory blocks that readdir used.

With the double index, the stat reads the name index instead of the
readdir index. For the common case of normal directories, today's ext4
will have the directory blocks in cache when stat is called. Btrfs has
to do extra reads to pull in the name index.

Josef worked around this by prefilling a dentry during readdir with
details on where to find the inode.

Either way, I'd definitely recommend sitting down with the xfs directory
indexes and look at how they manage locality without the double
indexing. It's faster than btrfs and without the penalty.

It's also worth mentioning that on an SSD, there is no benefit at all to
the extra index. So when the industry finally floods the markets with
hybrid PCM shingled drives, this might all be moot.

-chris

2012-03-15 07:59:39

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/10 Ted Ts'o <[email protected]>:
> Hey Jacek,
>
> I'm curious parameters of the set of directories on your production
> server. ?On an ext4 file system, assuming you've copied the
> directories over, what are the result of this command pipeline when
> you are cd'ed into the top of the directory hierarchy of interest
> (your svn tree, as I recall)?
>
> find . -type d -ls | awk '{print $7}' | sort -n | uniq -c
>
> I'm just curious what the distribution of directories sizes are in
> practice for your use case...
>

Hi Ted,

below is from the latest upstream on ext4:

843341 4096
165 12288
29 16384
164 20480
26 24576
22 28672
24 32768
118 36864
12 40960
19 45056
3 49152
12 53248
37 57344
10 61440
2 65536
13 69632
1 131072
1 139264
1 143360
1 221184
1 249856
1 258048
1 262144
1 319488
2 335872
1 339968

-Jacek

2012-03-15 10:42:26

by Jacek Luczak

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

2012/3/11 Ted Ts'o <[email protected]>:

>
> Well, my goal in proposing this optimization is that helps for the
> "medium size" directories in the cold cache case. ?The ext4 user who
> first kicked off this thread was using his file system for an SVN
> server, as I recall. ?I could easily believe that he has thousands of
> files; maybe even tens of thousands of files in a directory. ?But that
> probably still fits in 256k, or at best 512k worth of directory blocks.
>

That was not a SVN server. It was a build host having checkouts of SVN
projects.

The many files/dirs case is common for VCS and the SVN is not the only
that would be affected here. AFAIR git.kernel.org was also suffering
from the getdents(). Same applies to commercial products that are
heavily stuffed with many files/dirs, e.g. ClearCase or Synergy. In
the second one case we are running on XFS for a longer while and I
must say that this was a huge improvement mostly on backup times
(where a dump of Informix and the whole filesystem is ordinary
compressed in tarball).

A medium size you are referring would most probably fit into 256k and
this could be enough for 90% of cases. Large production system running
on ext4 need backups thus those would benefit the most here.

-Jacek

2012-03-18 20:56:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: getdents - ext4 vs btrfs performance

On Thu, Mar 15, 2012 at 11:42:24AM +0100, Jacek Luczak wrote:
>
> That was not a SVN server. It was a build host having checkouts of SVN
> projects.
>
> The many files/dirs case is common for VCS and the SVN is not the only
> that would be affected here.

Well, with SVN it's 2x or 3x the number of files in the checked out
source code directory, right? So if a particular source tree has
2,000 files in a source directory, then SVN might have at most 6,000
files, and if you assume each directory entry is 64 bytes, we're still
talking about 375k. Do you have more files than that in a directory
in practice with SVN? And if so why?

> AFAIR git.kernel.org was also suffering from the getdents().

git.kernel.org was suffering from a different problem, which was that
the git.kernel.org administrators didn't feel like automatically doing
a "git gc" on all of the repositories, and a lot of people were just
doing "git pushes" and not bothering to gc their repositories. Since
git.kernel.org users don't have shell access any more, the
git.kernel.org administrators have to be doing automatic git gc's. By
default git is supposed to automatically do a gc when there are more
than 6700 loose object files (which are distributed across 256 1st
level directories, so in practice a .git/objects/XX directory
shouldn't have more than 30 objects in it, which each directory object
taking 48 bytes). The problem I believe is that "git push" commands
weren't checking gc.auto limit, and so that's why git.kernel.org had
in the past suffered from large directories. This is arguably a git
bug, though, and as I mentioned, since we all don't have shell access
to git.kernel.org, this has to be handled automatically now....

> Same applies to commercial products that are
> heavily stuffed with many files/dirs, e.g. ClearCase or Synergy.

How many files in a dircectory do we commonly see with these systems?
I'm not familiar with them, and so I don't have a good feel for what
typical directory sizes tend to be.

> A medium size you are referring would most probably fit into 256k and
> this could be enough for 90% of cases. Large production system running
> on ext4 need backups thus those would benefit the most here.

Yeah, 256k or 512k is probably the best. Alternatively, the backup
programs could simply be taught to sort the directory entries by inode
number, and if that's not enough, to grab the initial block numbers
using FIEMAP and then sort by block number. Of course, all of this
optimization may or may not actually give us as much returns as we
think, given that the disk is probably seeking from other workloads
happening in parallel anyway (another reason why I am suspicious that
timing the tar command may not be an accurate way of measuring actual
performance when you have other tasks accessing the file system in
parallel with the backup).

- Ted