2017-04-20 19:00:53

by Alexey Lyahkov

[permalink] [raw]
Subject: some large dir testing results

Hi All,

I run some testing on my environment with large dir patches provided by Artem.
Each test run a 11 loops with creating 20680000 mknod objects for normal dir, and 20680000 for large dir.
FS was reformatted before each test, files was created in root dir to have an allocate inodes and blocks from GD#0 and up.
Journal have a size - 4G and it was internal journal.
Kernel was RHEL 7.2 based with lustre patches.

Test script code
#!/bin/bash

LOOPS=11

for i in `seq ${LOOPS}`; do
mkfs -t ext4 -F -I 256 -J size=4096 ${DEV}
mount -t ldiskfs ${DEV} ${MNT}
pushd ${MNT}
/usr/lib/lustre/tests/createmany -m test 20680000 >& /tmp/small-mknod${i}
popd
umount ${DEV}
done


for i in `seq ${LOOPS}`; do
mkfs -t ext4 -F -I 256 -J size=4096 -O large_dir ${DEV}
mount -t ldiskfs ${DEV} ${MNT}
pushd ${MNT}
/usr/lib/lustre/tests/createmany -m test 206800000 >& /tmp/large-mknod${i}
popd
umount ${DEV}
done

Tests was run on two nodes - first node have a storage with raid10 of fast HDD’s, second node have a NMVE as block device.
Current directory code have a near of similar results for both nodes for first test:
- HDD node 56k-65k creates/s
- SSD node ~80k creates/s
But large_dir testing have a large differences for nodes.
- HDD node have a drop a creation rate to 11k create/s
- SSD node have drop to 46k create/s

Initial analyze say about several problems
0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3% cpu), most spent for dir entry checking function.

1) lookup have a large time to read a directory block to verify file not exist. I think it because a block fragmentation.
[root@pink03 ~]# cat /proc/100993/stack
[<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
[<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
[<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
[<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
[<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
[<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
[<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
[<ffffffff811e8e7d>] lookup_real+0x1d/0x50
[<ffffffff811e97f2>] __lookup_hash+0x42/0x60
[<ffffffff811ee848>] filename_create+0x98/0x180
[<ffffffff811ef6e1>] user_path_create+0x41/0x60
[<ffffffff811f084a>] SyS_mknodat+0xda/0x220
[<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
[<ffffffff81645549>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
2) Some JBD problems when create thread have a wait a shadow BH from a committed transaction.
[root@pink03 ~]# cat /proc/100993/stack
[<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
[<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
[<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
[<ffffffffa08ce817>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
[<ffffffffa08948c8>] ldiskfs_create+0xd8/0x190 [ldiskfs]
[<ffffffff811eb42d>] vfs_create+0xcd/0x130
[<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
[<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
[<ffffffff81645549>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
[root@pink03 ~]# cat /proc/100993/stack
[<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
[<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
[<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
[<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
[<ffffffffa08a75bd>] ldiskfs_mb_mark_diskspace_used+0x7d/0x4f0 [ldiskfs]
[<ffffffffa08abacc>] ldiskfs_mb_new_blocks+0x2ac/0x5d0 [ldiskfs]
[<ffffffffa08db63d>] ldiskfs_ext_map_blocks+0x49d/0xed0 [ldiskfs]
[<ffffffffa08997d9>] ldiskfs_map_blocks+0x179/0x590 [ldiskfs]
[<ffffffffa0899c55>] ldiskfs_getblk+0x65/0x200 [ldiskfs]
[<ffffffffa0899e17>] ldiskfs_bread+0x27/0xc0 [ldiskfs]
[<ffffffffa088e3be>] ldiskfs_append+0x7e/0x150 [ldiskfs]
[<ffffffffa088fb09>] do_split+0xa9/0x900 [ldiskfs]
[<ffffffffa0892bb2>] ldiskfs_dx_add_entry+0xc2/0xbc0 [ldiskfs]
[<ffffffffa0894154>] ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
[<ffffffffa0894600>] ldiskfs_add_nondir+0x20/0x80 [ldiskfs]
[<ffffffffa0894904>] ldiskfs_create+0x114/0x190 [ldiskfs]
[<ffffffff811eb42d>] vfs_create+0xcd/0x130
[<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
[<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
[<ffffffff81645549>] system_call_fastpath+0x16/0x1b

I know several jbd2 improvements by Kara isn’t landed into RHEL7, but i don’t think it will big improvement, as SSD have less perf drop.
I think perf dropped due additional seeks requested to have access to the dir data or inode allocation.

Alex


2017-04-20 21:10:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: some large dir testing results

On Apr 20, 2017, at 1:00 PM, Alexey Lyashkov <[email protected]> wrote:
> I run some testing on my environment with large dir patches provided by Artem.

Alexey, thanks for running these tests.

> Each test run a 11 loops with creating 20680000 mknod objects for normal dir, and 20680000 for large dir.

Just to clarify, here you write that both 2-level and 3-level directories
are creating about 20.7M entries, but in the tests shown below it looks
like the 3-level htree is creating ~207M entries (i.e. 10x as many)?

> FS was reformatted before each test, files was created in root dir to have an allocate inodes and blocks from GD#0 and up.
> Journal have a size - 4G and it was internal journal.
> Kernel was RHEL 7.2 based with lustre patches.

For non-Lustre readers, "createmany" is a single-threaded test that
creates a lot of files with the specified name in the given directory.
It has different options for using mknod(), open(), link(), mkdir(),
and unlink() or rmdir() to create and remove different types of entries,
and prints running stats on the current and overall rate of creation.

> Test script code
> #!/bin/bash
>
> LOOPS=11
>
> for i in `seq ${LOOPS}`; do
> mkfs -t ext4 -F -I 256 -J size=4096 ${DEV}
> mount -t ldiskfs ${DEV} ${MNT}
> pushd ${MNT}
> /usr/lib/lustre/tests/createmany -m test 20680000 >& /tmp/small-mknod${i}
> popd
> umount ${DEV}
> done
>
>
> for i in `seq ${LOOPS}`; do
> mkfs -t ext4 -F -I 256 -J size=4096 -O large_dir ${DEV}
> mount -t ldiskfs ${DEV} ${MNT}
> pushd ${MNT}
> /usr/lib/lustre/tests/createmany -m test 206800000 >& /tmp/large-mknod${i}
> popd
> umount ${DEV}
> done
>
> Tests was run on two nodes - first node have a storage with raid10 of fast HDD’s, second node have a NMVE as block device.
> Current directory code have a near of similar results for both nodes for first test:
> - HDD node 56k-65k creates/s
> - SSD node ~80k creates/s
> But large_dir testing have a large differences for nodes.
> - HDD node have a drop a creation rate to 11k create/s
> - SSD node have drop to 46k create/s

Sure, it isn't totally surprising that a larger directory becomes slower,
because the htree hashing is essentially inserting into random blocks.
For 207M entries of ~9 char names this would be about:

entries * (sizeof(ext4_dir_entry) + round_up(name_len, 4)) * use_ratio

= 206800000 * (8 + (4 + 9 + 3)) * 4 / 3 ~= 6.6GB of leaf blocks

Unfortunately, all of the leaf blocks need to be kept in RAM in order to
get any good performance, since each entry is inserted into a random leaf.
There also needs to be more RAM for 4GB journal, dcache, inodes, etc.

I guess the good news is that htree performance is also not going to degrade
significantly over time due to further fragmentation since it is already
doing random insert/delete when the directory is very large.

> Initial analyze say about several problems
> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3% cpu), most spent for dir entry checking function.
>
> 1) lookup have a large time to read a directory block to verify file not exist. I think it because a block fragmentation.
> [root@pink03 ~]# cat /proc/100993/stack
> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
> [<ffffffff811ee848>] filename_create+0x98/0x180
> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b

I don't think anything can be done here if the RAM size isn't large
enough to hold all of the directory leaf blocks in memory.

Would you be able to re-run this benchmark using the TEA hash? For
workloads like this where filenames are created in a sequential order
(createmany, Lustre object directories, others) the TEA hash can be
an improvement.

In theory, TEA hash entry insertion into the leaf blocks would be mostly
sequential for these workloads. The would localize the insertions into
the directory, which could reduce the number of leaf blocks that are
active at one time and could improve the performance noticeably. This
is only an improvement if the workload is known, but for Lustre OST
object directories that is the case, and is mostly under our control.

> 2) Some JBD problems when create thread have a wait a shadow BH from a committed transaction.
> [root@pink03 ~]# cat /proc/100993/stack
> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
> [<ffffffffa08ce817>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
> [<ffffffffa08948c8>] ldiskfs_create+0xd8/0x190 [ldiskfs]
> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b

You might consider to use "createmany -l" to link entries (at least 65k
at a time) to the same inode (this would need changes to createmany to
create more than 65k files), so that you are exercising the directory
code and not loading so many inodes into memory?

> [root@pink03 ~]# cat /proc/100993/stack
> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
> [<ffffffffa08a75bd>] ldiskfs_mb_mark_diskspace_used+0x7d/0x4f0 [ldiskfs]
> [<ffffffffa08abacc>] ldiskfs_mb_new_blocks+0x2ac/0x5d0 [ldiskfs]
> [<ffffffffa08db63d>] ldiskfs_ext_map_blocks+0x49d/0xed0 [ldiskfs]
> [<ffffffffa08997d9>] ldiskfs_map_blocks+0x179/0x590 [ldiskfs]
> [<ffffffffa0899c55>] ldiskfs_getblk+0x65/0x200 [ldiskfs]
> [<ffffffffa0899e17>] ldiskfs_bread+0x27/0xc0 [ldiskfs]
> [<ffffffffa088e3be>] ldiskfs_append+0x7e/0x150 [ldiskfs]
> [<ffffffffa088fb09>] do_split+0xa9/0x900 [ldiskfs]
> [<ffffffffa0892bb2>] ldiskfs_dx_add_entry+0xc2/0xbc0 [ldiskfs]
> [<ffffffffa0894154>] ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
> [<ffffffffa0894600>] ldiskfs_add_nondir+0x20/0x80 [ldiskfs]
> [<ffffffffa0894904>] ldiskfs_create+0x114/0x190 [ldiskfs]
> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b

The other issue here may be that ext4 extent-mapped directories are
not very efficient. Each block takes 12 bytes in the extent tree vs.
only 4 bytes for block-mapped directories. Unfortunately, it isn't
possible to use block-mapped directories for filesystems over 2^32 blocks.

Another option might be to use bigalloc with, say, 16KB or 64KB chunks
so that the directory leaf blocks are not so fragmented and the extent
map can be kept more compact.

> I know several jbd2 improvements by Kara isn’t landed into RHEL7, but i don’t think it will big improvement, as SSD have less perf drop.
> I think perf dropped due additional seeks requested to have access to the dir data or inode allocation.

Cheers, Andreas






Attachments:
signature.asc (195.00 B)
Message signed with OpenPGP

2017-04-21 08:09:56

by Alexey Lyahkov

[permalink] [raw]
Subject: Re: some large dir testing results


> 21 апр. 2017 г., в 0:10, Andreas Dilger <[email protected]> написал(а):
>
> On Apr 20, 2017, at 1:00 PM, Alexey Lyashkov <[email protected]> wrote:
>> I run some testing on my environment with large dir patches provided by Artem.
>
> Alexey, thanks for running these tests.
>
>> Each test run a 11 loops with creating 20680000 mknod objects for normal dir, and 20680000 for large dir.
>
> Just to clarify, here you write that both 2-level and 3-level directories
> are creating about 20.7M entries, but in the tests shown below it looks
> like the 3-level htree is creating ~207M entries (i.e. 10x as many)?


> 20680000
Is directory size with 2 level h-tree, It may sometimes increased a little, but it number have a guaranteed we don’t exit from 2 levels.
and i use ~207M entries to switch to the 3 level h-tree and see how it good from file creation perspective.


>> FS was reformatted before each test, files was created in root dir to have an allocate inodes and blocks from GD#0 and up.
>> Journal have a size - 4G and it was internal journal.
>> Kernel was RHEL 7.2 based with lustre patches.
>
> For non-Lustre readers, "createmany" is a single-threaded test that
> creates a lot of files with the specified name in the given directory.
> It has different options for using mknod(), open(), link(), mkdir(),
> and unlink() or rmdir() to create and remove different types of entries,
> and prints running stats on the current and overall rate of creation.
>
Thanks for clarification, i forget to describe it.
https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=blob_plain;f=lustre/tests/createmany.c;hb=HEAD


>>
>> Tests was run on two nodes - first node have a storage with raid10 of fast HDD’s, second node have a NMVE as block device.
>> Current directory code have a near of similar results for both nodes for first test:
>> - HDD node 56k-65k creates/s
>> - SSD node ~80k creates/s
>> But large_dir testing have a large differences for nodes.
>> - HDD node have a drop a creation rate to 11k create/s
>> - SSD node have drop to 46k create/s
>
> Sure, it isn't totally surprising that a larger directory becomes slower,
> because the htree hashing is essentially inserting into random blocks.
> For 207M entries of ~9 char names this would be about:
>
> entries * (sizeof(ext4_dir_entry) + round_up(name_len, 4)) * use_ratio
>
> = 206800000 * (8 + (4 + 9 + 3)) * 4 / 3 ~= 6.6GB of leaf blocks
>
> Unfortunately, all of the leaf blocks need to be kept in RAM in order to
> get any good performance, since each entry is inserted into a random leaf.
> There also needs to be more RAM for 4GB journal, dcache, inodes, etc.
nodes have a 64G Ram for HDD case, and 128G ram for NVME case.
It should enough to have enough memory to hold all info.


>
> I guess the good news is that htree performance is also not going to degrade
> significantly over time due to further fragmentation since it is already
> doing random insert/delete when the directory is very large.
>
>> Initial analyze say about several problems
>> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3% cpu), most spent for dir entry checking function.
>>
>> 1) lookup have a large time to read a directory block to verify file not exist. I think it because a block fragmentation.
>> [root@pink03 ~]# cat /proc/100993/stack
>> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
>> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
>> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
>> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
>> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
>> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
>> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
>> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
>> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
>> [<ffffffff811ee848>] filename_create+0x98/0x180
>> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
>> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>
> I don't think anything can be done here if the RAM size isn't large
> enough to hold all of the directory leaf blocks in memory.
I think 64G/128G ram is enough to keep, otherwise it big problem which have plan to use this feature.

>
> Would you be able to re-run this benchmark using the TEA hash? For
> workloads like this where filenames are created in a sequential order
> (createmany, Lustre object directories, others) the TEA hash can be
> an improvement.
>
Sure. But my testing not about a Lustre OST only, but about generic usage.
If you talk about OST’s we may introduce a new hash function which know about file name is number and use it knowledge to have a good distribution.

> In theory, TEA hash entry insertion into the leaf blocks would be mostly
> sequential for these workloads. The would localize the insertions into
> the directory, which could reduce the number of leaf blocks that are
> active at one time and could improve the performance noticeably. This
> is only an improvement if the workload is known, but for Lustre OST
> object directories that is the case, and is mostly under our control.
>
>> 2) Some JBD problems when create thread have a wait a shadow BH from a committed transaction.
>> [root@pink03 ~]# cat /proc/100993/stack
>> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
>> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
>> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
>> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
>> [<ffffffffa08ce817>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
>> [<ffffffffa08948c8>] ldiskfs_create+0xd8/0x190 [ldiskfs]
>> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
>> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>
> You might consider to use "createmany -l" to link entries (at least 65k
> at a time) to the same inode (this would need changes to createmany to
> create more than 65k files), so that you are exercising the directory
> code and not loading so many inodes into memory?

It will be next case.

>
>> [root@pink03 ~]# cat /proc/100993/stack
>> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
>> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
>> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
>> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
>> [<ffffffffa08a75bd>] ldiskfs_mb_mark_diskspace_used+0x7d/0x4f0 [ldiskfs]
>> [<ffffffffa08abacc>] ldiskfs_mb_new_blocks+0x2ac/0x5d0 [ldiskfs]
>> [<ffffffffa08db63d>] ldiskfs_ext_map_blocks+0x49d/0xed0 [ldiskfs]
>> [<ffffffffa08997d9>] ldiskfs_map_blocks+0x179/0x590 [ldiskfs]
>> [<ffffffffa0899c55>] ldiskfs_getblk+0x65/0x200 [ldiskfs]
>> [<ffffffffa0899e17>] ldiskfs_bread+0x27/0xc0 [ldiskfs]
>> [<ffffffffa088e3be>] ldiskfs_append+0x7e/0x150 [ldiskfs]
>> [<ffffffffa088fb09>] do_split+0xa9/0x900 [ldiskfs]
>> [<ffffffffa0892bb2>] ldiskfs_dx_add_entry+0xc2/0xbc0 [ldiskfs]
>> [<ffffffffa0894154>] ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
>> [<ffffffffa0894600>] ldiskfs_add_nondir+0x20/0x80 [ldiskfs]
>> [<ffffffffa0894904>] ldiskfs_create+0x114/0x190 [ldiskfs]
>> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
>> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>
> The other issue here may be that ext4 extent-mapped directories are
> not very efficient. Each block takes 12 bytes in the extent tree vs.
> only 4 bytes for block-mapped directories. Unfortunately, it isn't
> possible to use block-mapped directories for filesystems over 2^32 blocks.
>
> Another option might be to use bigalloc with, say, 16KB or 64KB chunks
> so that the directory leaf blocks are not so fragmented and the extent
> map can be kept more compact.

What about allocating more than one block in once?

>
>> I know several jbd2 improvements by Kara isn’t landed into RHEL7, but i don’t think it will big improvement, as SSD have less perf drop.
>> I think perf dropped due additional seeks requested to have access to the dir data or inode allocation.
>
> Cheers, Andreas
>
>
>
>
>

2017-04-21 14:13:01

by Alexey Lyahkov

[permalink] [raw]
Subject: Re: some large dir testing results


> 21 апр. 2017 г., в 17:08, Bernd Schubert <[email protected]> написал(а):
>>
>> Initial analyze say about several problems
>> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3%
>> cpu), most spent for dir entry checking function.
>>
>> 1) lookup have a large time to read a directory block to verify file not
>> exist. I think it because a block fragmentation. [root@pink03 ~]# cat
>> /proc/100993/stack
>> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
>> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
>> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
>> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
>> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
>> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
>> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
>> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
>> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
>> [<ffffffff811ee848>] filename_create+0x98/0x180
>> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
>> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I wrote patches for ext4 a long time ago to get a better caching for that
>
> https://patchwork.ozlabs.org/patch/101200/
>
>
> For FhGFS/BeeGFS we then decided to use a totally different directory layout,
> which totally eliminated the underlying issue for the main requirement or
> large dirs at all. (Personally I would recommend to do the something similar
> for Lustre - using hash dirs to store objects has a much too random access
> pattern once the file system gets used with many files...).
>
> Also, a caching issue has been fixed by Mel Gorman in 3.11 (I didn't check if
> these patches are backported to any vendor kernel).
>
>
Bernd,

Thanks to point we to patches, I will test with it on my next test loop.
As about a different layout - it’s exist as separate option.


Alex

2017-04-21 14:13:40

by Bernd Schubert

[permalink] [raw]
Subject: Re: some large dir testing results

On Thursday, April 20, 2017 10:00:48 PM CEST Alexey Lyashkov wrote:
> Hi All,
>
> I run some testing on my environment with large dir patches provided by
> Artem. Each test run a 11 loops with creating 20680000 mknod objects for
> normal dir, and 20680000 for large dir. FS was reformatted before each
> test, files was created in root dir to have an allocate inodes and blocks
> from GD#0 and up. Journal have a size - 4G and it was internal journal.
> Kernel was RHEL 7.2 based with lustre patches.
>
> Test script code
> #!/bin/bash
>
> LOOPS=11
>
> for i in `seq ${LOOPS}`; do
> mkfs -t ext4 -F -I 256 -J size=4096 ${DEV}
> mount -t ldiskfs ${DEV} ${MNT}
> pushd ${MNT}
> /usr/lib/lustre/tests/createmany -m test 20680000 >& /tmp/small-mknod${i}
> popd
> umount ${DEV}
> done
>
>
> for i in `seq ${LOOPS}`; do
> mkfs -t ext4 -F -I 256 -J size=4096 -O large_dir ${DEV}
> mount -t ldiskfs ${DEV} ${MNT}
> pushd ${MNT}
> /usr/lib/lustre/tests/createmany -m test 206800000 >& /tmp/large-mknod${i}
> popd
> umount ${DEV}
> done
>
> Tests was run on two nodes - first node have a storage with raid10 of fast
> HDD’s, second node have a NMVE as block device. Current directory code have
> a near of similar results for both nodes for first test: - HDD node 56k-65k
> creates/s
> - SSD node ~80k creates/s
> But large_dir testing have a large differences for nodes.
> - HDD node have a drop a creation rate to 11k create/s
> - SSD node have drop to 46k create/s
>
> Initial analyze say about several problems
> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3%
> cpu), most spent for dir entry checking function.
>
> 1) lookup have a large time to read a directory block to verify file not
> exist. I think it because a block fragmentation. [root@pink03 ~]# cat
> /proc/100993/stack
> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
> [<ffffffff811ee848>] filename_create+0x98/0x180
> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff

I wrote patches for ext4 a long time ago to get a better caching for that

https://patchwork.ozlabs.org/patch/101200/


For FhGFS/BeeGFS we then decided to use a totally different directory layout,
which totally eliminated the underlying issue for the main requirement or
large dirs at all. (Personally I would recommend to do the something similar
for Lustre - using hash dirs to store objects has a much too random access
pattern once the file system gets used with many files...).

Also, a caching issue has been fixed by Mel Gorman in 3.11 (I didn't check if
these patches are backported to any vendor kernel).


Bernd

2017-04-21 19:11:41

by Alexey Lyahkov

[permalink] [raw]
Subject: Re: some large dir testing results


> 21 апр. 2017 г., в 0:10, Andreas Dilger <[email protected]> написал(а):
>
>> 2) Some JBD problems when create thread have a wait a shadow BH from a committed transaction.
>> [root@pink03 ~]# cat /proc/100993/stack
>> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
>> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
>> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
>> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
>> [<ffffffffa08ce817>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
>> [<ffffffffa08948c8>] ldiskfs_create+0xd8/0x190 [ldiskfs]
>> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
>> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>
> You might consider to use "createmany -l" to link entries (at least 65k
> at a time) to the same inode (this would need changes to createmany to
> create more than 65k files), so that you are exercising the directory
> code and not loading so many inodes into memory?
>

I have run a hardlink test now (it don’t finished all loops but some initial results is exist).
Initial creation rate for HDD case is ~130k hardlink per second,
after ~ 83580000 entries it dropped to the 30k-40k hard links per second.

Alex.

2017-04-21 20:58:47

by Andreas Dilger

[permalink] [raw]
Subject: Re: some large dir testing results

On Apr 21, 2017, at 2:09 AM, Alexey Lyashkov <[email protected]> wrote:
>
>>
>> 21 апр. 2017 г., в 0:10, Andreas Dilger <[email protected]> написал(а):
>>
>> On Apr 20, 2017, at 1:00 PM, Alexey Lyashkov <[email protected]> wrote:
>>> I run some testing on my environment with large dir patches provided by Artem.
>>
>> Alexey, thanks for running these tests.
>>
>>> Each test run a 11 loops with creating 20680000 mknod objects for normal dir, and 20680000 for large dir.
>>
>> Just to clarify, here you write that both 2-level and 3-level directories
>> are creating about 20.7M entries, but in the tests shown below it looks
>> like the 3-level htree is creating ~207M entries (i.e. 10x as many)?
>
> 20680000 is directory size with 2 level h-tree, It may sometimes increased a little, but it number have a guaranteed we don’t exit from 2 levels.
> and i use ~207M entries to switch to the 3 level h-tree and see how it good from file creation perspective.
>
>
>>> FS was reformatted before each test, files was created in root dir to have an allocate inodes and blocks from GD#0 and up.
>>> Journal have a size - 4G and it was internal journal.
>>> Kernel was RHEL 7.2 based with lustre patches.
>>>
>>> Tests was run on two nodes - first node have a storage with raid10 of fast HDD’s, second node have a NMVE as block device.
>>> Current directory code have a near of similar results for both nodes for first test:
>>> - HDD node 56k-65k creates/s
>>> - SSD node ~80k creates/s
>>> But large_dir testing have a large differences for nodes.
>>> - HDD node have a drop a creation rate to 11k create/s
>>> - SSD node have drop to 46k create/s
>>
>> Sure, it isn't totally surprising that a larger directory becomes slower,
>> because the htree hashing is essentially inserting into random blocks.
>> For 207M entries of ~9 char names this would be about:
>>
>> entries * (sizeof(ext4_dir_entry) + round_up(name_len, 4)) * use_ratio
>>
>> = 206800000 * (8 + (4 + 9 + 3)) * 4 / 3 ~= 6.6GB of leaf blocks
>>
>> Unfortunately, all of the leaf blocks need to be kept in RAM in order to
>> get any good performance, since each entry is inserted into a random leaf.
>> There also needs to be more RAM for 4GB journal, dcache, inodes, etc.
> nodes have a 64G Ram for HDD case, and 128G ram for NVME case.
> It should enough to have enough memory to hold all info.

OK, this is a bit surprising that the htree blocks are being read from
disk so often. I guess large numbers of inodes (207M inodes = 50GB of
buffers, x2 for ext4_inode_info/VFS inode, dcache, etc.) is causing the
htree index blocks to be flushed.

Ted wrote in the thread https://patchwork.ozlabs.org/patch/101200/
that the index buffer blocks were not being marked referenced often
enough, so that probably needs to be fixed?

There is also the thread in https://patchwork.kernel.org/patch/9395211/
about avoiding single-use objects being marked REFERENCED in the cache
LRU to help such workloads to evict inodes faster than htree index blocks.

>> I guess the good news is that htree performance is also not going to degrade
>> significantly over time due to further fragmentation since it is already
>> doing random insert/delete when the directory is very large.
>>
>>> Initial analyze say about several problems
>>> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3% cpu), most spent for dir entry checking function.
>>>
>>> 1) lookup have a large time to read a directory block to verify file not exist. I think it because a block fragmentation.
>>> [root@pink03 ~]# cat /proc/100993/stack
>>> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
>>> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
>>> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
>>> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
>>> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
>>> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
>>> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
>>> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
>>> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
>>> [<ffffffff811ee848>] filename_create+0x98/0x180
>>> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
>>> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
>>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>>
>> I don't think anything can be done here if the RAM size isn't large
>> enough to hold all of the directory leaf blocks in memory.
> I think 64G/128G ram is enough to keep, otherwise it big problem which have plan to use this feature.
>
>> Would you be able to re-run this benchmark using the TEA hash? For
>> workloads like this where filenames are created in a sequential order
>> (createmany, Lustre object directories, others) the TEA hash can be
>> an improvement.
>
> Sure. But my testing not about a Lustre OST only, but about generic usage.
> If you talk about OST’s we may introduce a new hash function which know about file name is number and use it knowledge to have a good distribution.

Still, it would be interesting to test this out. The TEA hash already
exists in the ext4 code, it just isn't used very often. You can set it
on an existing filesystem (won't affect existing dir hashes):

tune2fs -E hash_alg=tea

though there doesn't appear to be an equivalent option for mke2fs. I'm
not sure if "e2fsck -fD" will rehash the directory using the new hash or
stick with the existing hash for directories, but if this is a big gain
we might consider adding some option to do this.

If there is a significant gain from this we should set that as the
default hash for Lustre OSTs. I'm not sure why we didn't think of using
TEA hash years ago for OSTs, but that shouldn't stop us from using it now.

>> In theory, TEA hash entry insertion into the leaf blocks would be mostly
>> sequential for these workloads. The would localize the insertions into
>> the directory, which could reduce the number of leaf blocks that are
>> active at one time and could improve the performance noticeably. This
>> is only an improvement if the workload is known, but for Lustre OST
>> object directories that is the case, and is mostly under our control.
>>
>>> 2) Some JBD problems when create thread have a wait a shadow BH from a committed transaction.
>>> [root@pink03 ~]# cat /proc/100993/stack
>>> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
>>> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
>>> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
>>> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
>>> [<ffffffffa08ce817>] __ldiskfs_new_inode+0x447/0x1300 [ldiskfs]
>>> [<ffffffffa08948c8>] ldiskfs_create+0xd8/0x190 [ldiskfs]
>>> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
>>> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
>>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>>
>> You might consider to use "createmany -l" to link entries (at least 65k
>> at a time) to the same inode (this would need changes to createmany to
>> create more than 65k files), so that you are exercising the directory
>> code and not loading so many inodes into memory?
>
> It will be next case.
>
>>
>>> [root@pink03 ~]# cat /proc/100993/stack
>>> [<ffffffffa06a072e>] sleep_on_shadow_bh+0xe/0x20 [jbd2]
>>> [<ffffffffa06a1bad>] do_get_write_access+0x2dd/0x4e0 [jbd2]
>>> [<ffffffffa06a1dd7>] jbd2_journal_get_write_access+0x27/0x40 [jbd2]
>>> [<ffffffffa08c7cab>] __ldiskfs_journal_get_write_access+0x3b/0x80 [ldiskfs]
>>> [<ffffffffa08a75bd>] ldiskfs_mb_mark_diskspace_used+0x7d/0x4f0 [ldiskfs]
>>> [<ffffffffa08abacc>] ldiskfs_mb_new_blocks+0x2ac/0x5d0 [ldiskfs]
>>> [<ffffffffa08db63d>] ldiskfs_ext_map_blocks+0x49d/0xed0 [ldiskfs]
>>> [<ffffffffa08997d9>] ldiskfs_map_blocks+0x179/0x590 [ldiskfs]
>>> [<ffffffffa0899c55>] ldiskfs_getblk+0x65/0x200 [ldiskfs]
>>> [<ffffffffa0899e17>] ldiskfs_bread+0x27/0xc0 [ldiskfs]
>>> [<ffffffffa088e3be>] ldiskfs_append+0x7e/0x150 [ldiskfs]
>>> [<ffffffffa088fb09>] do_split+0xa9/0x900 [ldiskfs]
>>> [<ffffffffa0892bb2>] ldiskfs_dx_add_entry+0xc2/0xbc0 [ldiskfs]
>>> [<ffffffffa0894154>] ldiskfs_add_entry+0x254/0x6e0 [ldiskfs]
>>> [<ffffffffa0894600>] ldiskfs_add_nondir+0x20/0x80 [ldiskfs]
>>> [<ffffffffa0894904>] ldiskfs_create+0x114/0x190 [ldiskfs]
>>> [<ffffffff811eb42d>] vfs_create+0xcd/0x130
>>> [<ffffffff811f0960>] SyS_mknodat+0x1f0/0x220
>>> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
>>> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
>>
>> The other issue here may be that ext4 extent-mapped directories are
>> not very efficient. Each block takes 12 bytes in the extent tree vs.
>> only 4 bytes for block-mapped directories. Unfortunately, it isn't
>> possible to use block-mapped directories for filesystems over 2^32 blocks.
>>
>> Another option might be to use bigalloc with, say, 16KB or 64KB chunks
>> so that the directory leaf blocks are not so fragmented and the extent
>> map can be kept more compact.
>
> What about allocating more than one block in once?

Yes, that would also be possible. There is s_prealloc_dir_blocks in the
superblock that is intended to allow for this, but I don't think it has
been used since the ext2 days.

It isn't clear if that would help much in the benchmark case, except to
keep the extent tree more compact. At 4-block chunks the extent tree is
more compact than indirect-mapped files. In real-life workloads where
directory blocks are being allocated at the same time as file data blocks
it might be more helpful to have larger preallocation chunks. On Lustre
OSTs even larger chunks would be helpful for all directories.

Cheers, Andreas






Attachments:
signature.asc (195.00 B)
Message signed with OpenPGP

2017-04-24 18:29:55

by Alexey Lyahkov

[permalink] [raw]
Subject: Re: some large dir testing results


> 21 апр. 2017 г., в 23:58, Andreas Dilger <[email protected]> написал(а):

based on current investigation step, journal is bottleneck now.
bd2/md66-8-21787 [000] .... 439933.758608: jbd2_commit_logging: dev 9,66 transaction 1311 sync 0
jbd2/md66-8-21787 [008] .... 439940.816308: jbd2_run_stats: dev 9,66 tid 1311 wait 0 request_delay 0 running 5000 locked 0 flushing 0 logging 7060 handle_count 167302 blocks 170369 blocks_logged 170871

so 7s is processing, but sometimes it eats a ~100s.

41s
jbd2/md66-8-21787 [001] .... 437328.447185: jbd2_start_commit: dev 9,66 transaction 1197 sync 0
jbd2/md66-8-21787 [007] .... 437369.618933: jbd2_run_stats: dev 9,66 tid 1197 wait 0 request_delay 0 running 5358 locked 5352 flushing 0 logging 35838 handle_count 206731 blocks 206433 blocks_logged 207041
jbd2/md66-8-21787 [007] .... 437369.618936: jbd2_end_commit: dev 9,66 transaction 1197 sync 0 head 1192

~100s
jbd2/md66-8-21787 [000] .... 435933.607806: jbd2_run_stats: dev 9,66 tid 1135 wait 0 request_delay 0 running 7590 locked 114 flushing 0 logging 22 handle_count 0 blocks 0 blocks_logged 0
jbd2/md66-8-21787 [011] .... 436030.850489: jbd2_run_stats: dev 9,66 tid 1136 wait 0 request_delay 0 running 5320 locked 3485 flushing 0 logging 88482 handle_count 123338 blocks 126810 blocks_logged 127183