Hi,
With filesystems like ext4, xfs and btrfs, what are the limits on directory
capacity, and how well are they indexed?
The reason I ask is that inside of cachefiles, I insert fanout directories
inside index directories to divide up the space for ext2 to cope with the
limits on directory sizes and that it did linear searches (IIRC).
For some applications, I need to be able to cache over 1M entries (render
farm) and even a kernel tree has over 100k.
What I'd like to do is remove the fanout directories, so that for each logical
"volume"[*] I have a single directory with all the files in it. But that
means sticking massive amounts of entries into a single directory and hoping
it (a) isn't too slow and (b) doesn't hit the capacity limit.
David
[*] What that means is netfs-dependent. For AFS it would be a single volume
within a cell; for NFS, it would be a particular FSID on a server, for
example. Kind of corresponds to a thing that gets its own superblock on the
client.
On Mon, May 17, 2021 at 04:06:58PM +0100, David Howells wrote:
> Hi,
>
> With filesystems like ext4, xfs and btrfs, what are the limits on directory
> capacity, and how well are they indexed?
>
> The reason I ask is that inside of cachefiles, I insert fanout directories
> inside index directories to divide up the space for ext2 to cope with the
> limits on directory sizes and that it did linear searches (IIRC).
Don't do that for XFS. XFS directories have internal hashed btree
indexes that are far more space efficient than using fanout in
userspace. i.e. The XFS hash index uses 8 bytes per dirent, and so
in a 4kB directory block size structure can index about 500 entries
per block. And being O(log N) for lookup, insert and remove, the
fan-out within the directory hash per IO operation is an aorder of
magnitude higher than using directories in userspace....
The capacity limit for XFS is 32GB of dirent data, which generally
equates to somewhere around 300-500 million dirents depending on
filename size. The hash index is separate from this limit (has it's
own 32GB address segment, as does the internal freespace map for the
directory....
The other directory design characterisitic of XFs directories is
that readdir is always a sequential read through the dirent data
with built in readahead. It does not need to look up the hash index
to determine where to read the next dirents from - that's a straight
"file offset to physical location" lookup in the extent btree, which
is always cached in memory. So that's generally not a limiting
factor, either.
> For some applications, I need to be able to cache over 1M entries (render
> farm) and even a kernel tree has over 100k.
Not a problem for XFS with a single directory, but could definitely
be a problem for others especially as the directory grows and
shrinks. Last I measured, ext4 directory perf drops off at about
80-90k entries using 40 byte file names, but you can get an idea of
XFS directory scalability with large entry counts in commit
756c6f0f7efe ("xfs: reverse search directory freespace indexes").
I'll reproduce the table using a 4kB directory block size here:
File count create time(sec) / rate (files/s)
10k 0.41 / 24.3k
20k 0.75 / 26.7k
100k 3.27 / 30.6k
200k 6.71 / 29.8k
1M 37.67 / 26.5k
2M 79.55 / 25.2k
10M 552.89 / 18.1k
So that's single threaded file create, which shows the rough limits
of insert into the large directory. There really isn't a major
drop-off in performance until there are several million entries in
the directory. Remove is roughly the same speed for the same dirent
count.
> What I'd like to do is remove the fanout directories, so that for each logical
> "volume"[*] I have a single directory with all the files in it. But that
> means sticking massive amounts of entries into a single directory and hoping
> it (a) isn't too slow and (b) doesn't hit the capacity limit.
Note that if you use a single directory, you are effectively single
threading modifications to your file index. You still need to use
fanout directories if you want concurrency during modification for
the cachefiles index, but that's a different design criteria
compared to directory capacity and modification/lookup scalability.
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On May 17, 2021, at 7:22 PM, Dave Chinner <[email protected]> wrote:
>
> On Mon, May 17, 2021 at 04:06:58PM +0100, David Howells wrote:
>
>> What I'd like to do is remove the fanout directories, so that for each logical
>> "volume"[*] I have a single directory with all the files in it. But that
>> means sticking massive amounts of entries into a single directory and hoping
>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>
> Note that if you use a single directory, you are effectively single
> threading modifications to your file index. You still need to use
> fanout directories if you want concurrency during modification for
> the cachefiles index, but that's a different design criteria
> compared to directory capacity and modification/lookup scalability.
Unless you’re doing one subvol per fan directory, the btrfs results should be really similar either way. It’s all getting indexed in the same btree, the keys just change based on the parent dir.
The biggest difference should be what Dave calls out here, where directory locking at the vfs level might be a bottleneck.
-chris
Dave Chinner <[email protected]> wrote:
> > What I'd like to do is remove the fanout directories, so that for each logical
> > "volume"[*] I have a single directory with all the files in it. But that
> > means sticking massive amounts of entries into a single directory and hoping
> > it (a) isn't too slow and (b) doesn't hit the capacity limit.
>
> Note that if you use a single directory, you are effectively single
> threading modifications to your file index. You still need to use
> fanout directories if you want concurrency during modification for
> the cachefiles index, but that's a different design criteria
> compared to directory capacity and modification/lookup scalability.
I knew there was something I was overlooking. This might be a more important
criterion. I should try benchmarking this, see what difference it makes
eliminating the extra lookup step (which is probably cheap) versus the
concurrency.
David
On 18/05/2021 02.22, Dave Chinner wrote:
>
>> What I'd like to do is remove the fanout directories, so that for each logical
>> "volume"[*] I have a single directory with all the files in it. But that
>> means sticking massive amounts of entries into a single directory and hoping
>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
> Note that if you use a single directory, you are effectively single
> threading modifications to your file index. You still need to use
> fanout directories if you want concurrency during modification for
> the cachefiles index, but that's a different design criteria
> compared to directory capacity and modification/lookup scalability.
Something that hit us with single-large-directory and XFS is that XFS
will allocate all files in a directory using the same allocation group.
If your entire filesystem is just for that one directory, then that
allocation group will be contended. We saw spurious ENOSPC when that
happened, though that may have related to bad O_DIRECT management by us.
We ended up creating files in a temporary directory and moving them to
the main directory, since for us the directory layout was mandated by
compatibility concerns.
We are now happy with XFS large-directory management, but are nowhere
close to a million files.
On Wed, May 19, 2021 at 11:00:03AM +0300, Avi Kivity wrote:
>
> On 18/05/2021 02.22, Dave Chinner wrote:
> >
> > > What I'd like to do is remove the fanout directories, so that for each logical
> > > "volume"[*] I have a single directory with all the files in it. But that
> > > means sticking massive amounts of entries into a single directory and hoping
> > > it (a) isn't too slow and (b) doesn't hit the capacity limit.
> > Note that if you use a single directory, you are effectively single
> > threading modifications to your file index. You still need to use
> > fanout directories if you want concurrency during modification for
> > the cachefiles index, but that's a different design criteria
> > compared to directory capacity and modification/lookup scalability.
>
> Something that hit us with single-large-directory and XFS is that
> XFS will allocate all files in a directory using the same
> allocation group. If your entire filesystem is just for that one
> directory, then that allocation group will be contended.
There is more than one concurrency problem that can arise from using
single large directories. Allocation policy is just another aspect
of the concurrency picture.
Indeed, you can avoid this specific problem simply by using the
inode32 allocator - this policy round-robins files across allocation
groups instead of trying to keep files physically local to their
parent directory. Hence if you just want one big directory with lots
of files that index lots of data, using the inode32 allocator will
allow the files in the filesytsem to allocate/free space at maximum
concurrency at all times...
> We saw spurious ENOSPC when that happened, though that
> may have related to bad O_DIRECT management by us.
You should not see spurious ENOSPC at all.
The only time I've recall this sort of thing occurring is when large
extent size hints are abused by applying them to every single file
and allocation regardless of whether they are needed, whilst
simultaneously mixing long term and short term data in the same
physical locality. Over time the repeated removal and reallocation
of short term data amongst long term data fragments the crap out of
free space until there are no large contiguous free spaces left to
allocate contiguous extents from.
> We ended up creating files in a temporary directory and moving them to the
> main directory, since for us the directory layout was mandated by
> compatibility concerns.
inode32 would have done effectively the same thing but without
needing to change the application....
> We are now happy with XFS large-directory management, but are nowhere close
> to a million files.
I think you are conflating directory scalability with problems
arising from file allocation policies not being ideal for your data
set organisation, layout and longevity characteristics.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 19/05/2021 15.57, Dave Chinner wrote:
> On Wed, May 19, 2021 at 11:00:03AM +0300, Avi Kivity wrote:
>> On 18/05/2021 02.22, Dave Chinner wrote:
>>>> What I'd like to do is remove the fanout directories, so that for each logical
>>>> "volume"[*] I have a single directory with all the files in it. But that
>>>> means sticking massive amounts of entries into a single directory and hoping
>>>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>>> Note that if you use a single directory, you are effectively single
>>> threading modifications to your file index. You still need to use
>>> fanout directories if you want concurrency during modification for
>>> the cachefiles index, but that's a different design criteria
>>> compared to directory capacity and modification/lookup scalability.
>> Something that hit us with single-large-directory and XFS is that
>> XFS will allocate all files in a directory using the same
>> allocation group. If your entire filesystem is just for that one
>> directory, then that allocation group will be contended.
> There is more than one concurrency problem that can arise from using
> single large directories. Allocation policy is just another aspect
> of the concurrency picture.
>
> Indeed, you can avoid this specific problem simply by using the
> inode32 allocator - this policy round-robins files across allocation
> groups instead of trying to keep files physically local to their
> parent directory. Hence if you just want one big directory with lots
> of files that index lots of data, using the inode32 allocator will
> allow the files in the filesytsem to allocate/free space at maximum
> concurrency at all times...
Perhaps a directory attribute can be useful in case the filesystem is
created independently of the application (say by the OS installer).
>
>> We saw spurious ENOSPC when that happened, though that
>> may have related to bad O_DIRECT management by us.
> You should not see spurious ENOSPC at all.
>
> The only time I've recall this sort of thing occurring is when large
> extent size hints are abused by applying them to every single file
> and allocation regardless of whether they are needed, whilst
> simultaneously mixing long term and short term data in the same
> physical locality.
Yes, you remember well.
> Over time the repeated removal and reallocation
> of short term data amongst long term data fragments the crap out of
> free space until there are no large contiguous free spaces left to
> allocate contiguous extents from.
>
>> We ended up creating files in a temporary directory and moving them to the
>> main directory, since for us the directory layout was mandated by
>> compatibility concerns.
> inode32 would have done effectively the same thing but without
> needing to change the application....
It would not have helped the installed base.
>> We are now happy with XFS large-directory management, but are nowhere close
>> to a million files.
> I think you are conflating directory scalability with problems
> arising from file allocation policies not being ideal for your data
> set organisation, layout and longevity characteristics.
Probably, but these problems can happen to others using large
directories. The XFS list can be very helpful in resolving them, but
better to be warned ahead.
On May 17, 2021, at 9:06 AM, David Howells <[email protected]> wrote:
> With filesystems like ext4, xfs and btrfs, what are the limits on directory
> capacity, and how well are they indexed?
>
> The reason I ask is that inside of cachefiles, I insert fanout directories
> inside index directories to divide up the space for ext2 to cope with the
> limits on directory sizes and that it did linear searches (IIRC).
>
> For some applications, I need to be able to cache over 1M entries (render
> farm) and even a kernel tree has over 100k.
>
> What I'd like to do is remove the fanout directories, so that for each logical
> "volume"[*] I have a single directory with all the files in it. But that
> means sticking massive amounts of entries into a single directory and hoping
> it (a) isn't too slow and (b) doesn't hit the capacity limit.
Ext4 can comfortably handle ~12M entries in a single directory, if the
filenames are not too long (e.g. 32 bytes or so). With the "large_dir"
feature (since 4.13, but not enabled by default) a single directory can
hold around 4B entries, basically all the inodes of a filesystem.
There are performance knees as the index grows to a new level (~50k, 10M,
depending on filename length)
As described elsewhere in the thread, allowing concurrent create and unlink
in a directory (rename probably not needed) would be invaluable for scaling
multi-threaded workloads. Neil Brown posted a prototype patch to add this
to the VFS for NFS:
https://lore.kernel.org/lustre-devel/[email protected]/
Maybe it's time to restart that discussion?
Cheers, Andreas
On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
> On May 17, 2021, at 9:06 AM, David Howells <[email protected]> wrote:
> > With filesystems like ext4, xfs and btrfs, what are the limits on directory
> > capacity, and how well are they indexed?
> >
> > The reason I ask is that inside of cachefiles, I insert fanout directories
> > inside index directories to divide up the space for ext2 to cope with the
> > limits on directory sizes and that it did linear searches (IIRC).
> >
> > For some applications, I need to be able to cache over 1M entries (render
> > farm) and even a kernel tree has over 100k.
> >
> > What I'd like to do is remove the fanout directories, so that for each logical
> > "volume"[*] I have a single directory with all the files in it. But that
> > means sticking massive amounts of entries into a single directory and hoping
> > it (a) isn't too slow and (b) doesn't hit the capacity limit.
>
> Ext4 can comfortably handle ~12M entries in a single directory, if the
> filenames are not too long (e.g. 32 bytes or so). With the "large_dir"
> feature (since 4.13, but not enabled by default) a single directory can
> hold around 4B entries, basically all the inodes of a filesystem.
ext4 definitely seems to be able to handle it. I've seen bottlenecks in
other parts of the storage stack, though.
With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
a few million entries (each pointing to a ~4-8k file) take the better
part of an hour, almost all of it system time in iowait. Also makes any
other concurrent disk writes hang, even a simple "touch x". Turning off
discard speeds it up by several orders of magnitude.
(I don't know if this is a known issue or not, so here are the details
just in case it isn't. Also, if this is already fixed in a newer kernel,
my apologies for the outdated report.)
$ uname -a
Linux s 5.10.0-6-amd64 #1 SMP Debian 5.10.28-1 (2021-04-09) x86_64 GNU/Linux
Reproducer (doesn't take *as* long but still long enough to demonstrate
the issue):
$ mkdir testdir
$ time python3 -c 'for i in range(1000000): open(f"testdir/{i}", "wb").write(b"test data")'
$ time rm -r testdir
dmesg details:
INFO: task rm:379934 blocked for more than 120 seconds.
Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rm state:D stack: 0 pid:379934 ppid:379461 flags:0x00004000
Call Trace:
__schedule+0x282/0x870
schedule+0x46/0xb0
wait_transaction_locked+0x8a/0xd0 [jbd2]
? add_wait_queue_exclusive+0x70/0x70
add_transaction_credits+0xd6/0x2a0 [jbd2]
start_this_handle+0xfb/0x520 [jbd2]
? jbd2__journal_start+0x8d/0x1e0 [jbd2]
? kmem_cache_alloc+0xed/0x1f0
jbd2__journal_start+0xf7/0x1e0 [jbd2]
__ext4_journal_start_sb+0xf3/0x110 [ext4]
ext4_evict_inode+0x24c/0x630 [ext4]
evict+0xd1/0x1a0
do_unlinkat+0x1db/0x2f0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f088f0c3b87
RSP: 002b:00007ffc8d3a27a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
RAX: ffffffffffffffda RBX: 000055ffee46de70 RCX: 00007f088f0c3b87
RDX: 0000000000000000 RSI: 000055ffee46df78 RDI: 0000000000000004
RBP: 000055ffece9daa0 R08: 0000000000000100 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffc8d3a2980 R14: 00007ffc8d3a2980 R15: 0000000000000002
INFO: task touch:379982 blocked for more than 120 seconds.
Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:touch state:D stack: 0 pid:379982 ppid:379969 flags:0x00000000
Call Trace:
__schedule+0x282/0x870
schedule+0x46/0xb0
wait_transaction_locked+0x8a/0xd0 [jbd2]
? add_wait_queue_exclusive+0x70/0x70
add_transaction_credits+0xd6/0x2a0 [jbd2]
? xas_load+0x5/0x70
? find_get_entry+0xd1/0x170
start_this_handle+0xfb/0x520 [jbd2]
? jbd2__journal_start+0x8d/0x1e0 [jbd2]
? kmem_cache_alloc+0xed/0x1f0
jbd2__journal_start+0xf7/0x1e0 [jbd2]
__ext4_journal_start_sb+0xf3/0x110 [ext4]
__ext4_new_inode+0x721/0x1670 [ext4]
ext4_create+0x106/0x1b0 [ext4]
path_openat+0xde1/0x1080
do_filp_open+0x88/0x130
? getname_flags.part.0+0x29/0x1a0
? __check_object_size+0x136/0x150
do_sys_openat2+0x97/0x150
__x64_sys_openat+0x54/0x90
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb2afb8fbe7
RSP: 002b:00007ffee3e287b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007ffee3e28a68 RCX: 00007fb2afb8fbe7
RDX: 0000000000000941 RSI: 00007ffee3e2a340 RDI: 00000000ffffff9c
RBP: 00007ffee3e2a340 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
R13: 00007ffee3e2a340 R14: 0000000000000000 R15: 0000000000000000
On Sat, May 23, 2021 at 10:51:02PM -0700, Josh Triplett wrote:
> On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
> > On May 17, 2021, at 9:06 AM, David Howells <[email protected]> wrote:
> > > With filesystems like ext4, xfs and btrfs, what are the limits on directory
> > > capacity, and how well are they indexed?
> > >
> > > The reason I ask is that inside of cachefiles, I insert fanout directories
> > > inside index directories to divide up the space for ext2 to cope with the
> > > limits on directory sizes and that it did linear searches (IIRC).
> > >
> > > For some applications, I need to be able to cache over 1M entries (render
> > > farm) and even a kernel tree has over 100k.
> > >
> > > What I'd like to do is remove the fanout directories, so that for each logical
> > > "volume"[*] I have a single directory with all the files in it. But that
> > > means sticking massive amounts of entries into a single directory and hoping
> > > it (a) isn't too slow and (b) doesn't hit the capacity limit.
> >
> > Ext4 can comfortably handle ~12M entries in a single directory, if the
> > filenames are not too long (e.g. 32 bytes or so). With the "large_dir"
> > feature (since 4.13, but not enabled by default) a single directory can
> > hold around 4B entries, basically all the inodes of a filesystem.
>
> ext4 definitely seems to be able to handle it. I've seen bottlenecks in
> other parts of the storage stack, though.
>
> With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
> enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
> a few million entries (each pointing to a ~4-8k file) take the better
> part of an hour, almost all of it system time in iowait. Also makes any
> other concurrent disk writes hang, even a simple "touch x". Turning off
> discard speeds it up by several orders of magnitude.
Synchronous discard is slow, even on NVME.
Background discard (aka fstrim in a cron job) isn't quite as bad, at
least in the sense of amortizing a bunch of clearing over an entire week
of not issuing discards. :P
--D
>
> (I don't know if this is a known issue or not, so here are the details
> just in case it isn't. Also, if this is already fixed in a newer kernel,
> my apologies for the outdated report.)
>
> $ uname -a
> Linux s 5.10.0-6-amd64 #1 SMP Debian 5.10.28-1 (2021-04-09) x86_64 GNU/Linux
>
> Reproducer (doesn't take *as* long but still long enough to demonstrate
> the issue):
> $ mkdir testdir
> $ time python3 -c 'for i in range(1000000): open(f"testdir/{i}", "wb").write(b"test data")'
> $ time rm -r testdir
>
> dmesg details:
>
> INFO: task rm:379934 blocked for more than 120 seconds.
> Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:rm state:D stack: 0 pid:379934 ppid:379461 flags:0x00004000
> Call Trace:
> __schedule+0x282/0x870
> schedule+0x46/0xb0
> wait_transaction_locked+0x8a/0xd0 [jbd2]
> ? add_wait_queue_exclusive+0x70/0x70
> add_transaction_credits+0xd6/0x2a0 [jbd2]
> start_this_handle+0xfb/0x520 [jbd2]
> ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
> ? kmem_cache_alloc+0xed/0x1f0
> jbd2__journal_start+0xf7/0x1e0 [jbd2]
> __ext4_journal_start_sb+0xf3/0x110 [ext4]
> ext4_evict_inode+0x24c/0x630 [ext4]
> evict+0xd1/0x1a0
> do_unlinkat+0x1db/0x2f0
> do_syscall_64+0x33/0x80
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f088f0c3b87
> RSP: 002b:00007ffc8d3a27a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000107
> RAX: ffffffffffffffda RBX: 000055ffee46de70 RCX: 00007f088f0c3b87
> RDX: 0000000000000000 RSI: 000055ffee46df78 RDI: 0000000000000004
> RBP: 000055ffece9daa0 R08: 0000000000000100 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007ffc8d3a2980 R14: 00007ffc8d3a2980 R15: 0000000000000002
> INFO: task touch:379982 blocked for more than 120 seconds.
> Not tainted 5.10.0-6-amd64 #1 Debian 5.10.28-1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:touch state:D stack: 0 pid:379982 ppid:379969 flags:0x00000000
> Call Trace:
> __schedule+0x282/0x870
> schedule+0x46/0xb0
> wait_transaction_locked+0x8a/0xd0 [jbd2]
> ? add_wait_queue_exclusive+0x70/0x70
> add_transaction_credits+0xd6/0x2a0 [jbd2]
> ? xas_load+0x5/0x70
> ? find_get_entry+0xd1/0x170
> start_this_handle+0xfb/0x520 [jbd2]
> ? jbd2__journal_start+0x8d/0x1e0 [jbd2]
> ? kmem_cache_alloc+0xed/0x1f0
> jbd2__journal_start+0xf7/0x1e0 [jbd2]
> __ext4_journal_start_sb+0xf3/0x110 [ext4]
> __ext4_new_inode+0x721/0x1670 [ext4]
> ext4_create+0x106/0x1b0 [ext4]
> path_openat+0xde1/0x1080
> do_filp_open+0x88/0x130
> ? getname_flags.part.0+0x29/0x1a0
> ? __check_object_size+0x136/0x150
> do_sys_openat2+0x97/0x150
> __x64_sys_openat+0x54/0x90
> do_syscall_64+0x33/0x80
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fb2afb8fbe7
> RSP: 002b:00007ffee3e287b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> RAX: ffffffffffffffda RBX: 00007ffee3e28a68 RCX: 00007fb2afb8fbe7
> RDX: 0000000000000941 RSI: 00007ffee3e2a340 RDI: 00000000ffffff9c
> RBP: 00007ffee3e2a340 R08: 0000000000000000 R09: 0000000000000000
> R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000941
> R13: 00007ffee3e2a340 R14: 0000000000000000 R15: 0000000000000000
>
>
On Mon, May 24, 2021 at 09:21:36PM -0700, Darrick J. Wong wrote:
> Synchronous discard is slow, even on NVME.
That really depends on the device. Some are specifically optimized for
that workload due to customer requirements.
On May 22, 2021, at 11:51 PM, Josh Triplett <[email protected]> wrote:
>
> On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
>> On May 17, 2021, at 9:06 AM, David Howells <[email protected]> wrote:
>>> With filesystems like ext4, xfs and btrfs, what are the limits on directory
>>> capacity, and how well are they indexed?
>>>
>>> The reason I ask is that inside of cachefiles, I insert fanout directories
>>> inside index directories to divide up the space for ext2 to cope with the
>>> limits on directory sizes and that it did linear searches (IIRC).
>>>
>>> For some applications, I need to be able to cache over 1M entries (render
>>> farm) and even a kernel tree has over 100k.
>>>
>>> What I'd like to do is remove the fanout directories, so that for each logical
>>> "volume"[*] I have a single directory with all the files in it. But that
>>> means sticking massive amounts of entries into a single directory and hoping
>>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>>
>> Ext4 can comfortably handle ~12M entries in a single directory, if the
>> filenames are not too long (e.g. 32 bytes or so). With the "large_dir"
>> feature (since 4.13, but not enabled by default) a single directory can
>> hold around 4B entries, basically all the inodes of a filesystem.
>
> ext4 definitely seems to be able to handle it. I've seen bottlenecks in
> other parts of the storage stack, though.
>
> With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
> enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
> a few million entries (each pointing to a ~4-8k file) take the better
> part of an hour, almost all of it system time in iowait. Also makes any
> other concurrent disk writes hang, even a simple "touch x". Turning off
> discard speeds it up by several orders of magnitude.
>
> (I don't know if this is a known issue or not, so here are the details
> just in case it isn't. Also, if this is already fixed in a newer kernel,
> my apologies for the outdated report.)
Definitely "-o discard" is known to have a measurable performance impact,
simply because it ends up sending a lot more requests to the block device,
and those requests can be slow/block the queue, depending on underlying
storage behavior.
There was a patch pushed recently that targets "-o discard" performance:
https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
that needs a bit more work, but may be worthwhile to test if it improves
your workload, and help put some weight behind landing it?
Another proposal was made to change "-o discard" from "track every freed
block and submit TRIM" to "(persistently) track modified block groups and
submit background TRIM like fstrim for the whole group". One advantage
of tracking the whole block group is that block group state is already
maintained in the kernel and persistently on disk. This also provides a
middle way between "immediate TRIM" that may not cover a whole erase block
when it is run, and "very lazy fstrim" that aggregates all free blocks in
a group but only happens when fstrim is run (from occasionally to never).
The in-kernel discard+fstrim handling could be smarter than "run every day
from cron" because it can know when the filesystem is busy or not, how much
data has been written and freed, and when a block group has a significant
amount of free space and is useful to actually submit the TRIM for a group.
The start of that work was posted for discussion on linux-ext4:
https://marc.info/?l=linux-ext4&m=159283169109297&w=4
but ended up focussed on semantics of whether TRIM needs to obey requested
boundaries for security reasons, or not.
Cheers, Andreas
On Tue, May 25, 2021 at 03:13:52PM -0600, Andreas Dilger wrote:
> Definitely "-o discard" is known to have a measurable performance impact,
> simply because it ends up sending a lot more requests to the block device,
> and those requests can be slow/block the queue, depending on underlying
> storage behavior.
>
> There was a patch pushed recently that targets "-o discard" performance:
> https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
> that needs a bit more work, but may be worthwhile to test if it improves
> your workload, and help put some weight behind landing it?
This all seems very complicated. I have chosen with my current laptop
to "short stroke" the drive. That is, I discarded the entire bdev,
then partitioned it roughly in half. The second half has never seen
any writes. This effectively achieves the purpose of TRIM/discard;
there are a lot of unused LBAs, so the underlying flash translation layer
always has plenty of spare space when it needs to empty an erase block.
Since the steady state of hard drives is full, I have to type 'make clean'
in my build trees more often than otherwise and remember to delete iso
images after i've had them lying around for a year, but I'd rather clean
up a little more often than get these weird performance glitches.
And if I really do need half a terabyte of space temporarily, I can
always choose to use the fallow range for a while, then discard it again.
On Tue, May 25, 2021 at 10:26:17PM +0100, Matthew Wilcox wrote:
> On Tue, May 25, 2021 at 03:13:52PM -0600, Andreas Dilger wrote:
> > Definitely "-o discard" is known to have a measurable performance impact,
> > simply because it ends up sending a lot more requests to the block device,
> > and those requests can be slow/block the queue, depending on underlying
> > storage behavior.
> >
> > There was a patch pushed recently that targets "-o discard" performance:
> > https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
> > that needs a bit more work, but may be worthwhile to test if it improves
> > your workload, and help put some weight behind landing it?
>
> This all seems very complicated. I have chosen with my current laptop
> to "short stroke" the drive. That is, I discarded the entire bdev,
> then partitioned it roughly in half. The second half has never seen
> any writes. This effectively achieves the purpose of TRIM/discard;
> there are a lot of unused LBAs, so the underlying flash translation layer
> always has plenty of spare space when it needs to empty an erase block.
>
> Since the steady state of hard drives is full, I have to type 'make clean'
> in my build trees more often than otherwise and remember to delete iso
> images after i've had them lying around for a year, but I'd rather clean
> up a little more often than get these weird performance glitches.
>
> And if I really do need half a terabyte of space temporarily, I can
> always choose to use the fallow range for a while, then discard it again.
I just let xfs_scrub run FITRIM on Sundays at 4:30am. ;)
--D
Andreas Dilger <[email protected]> wrote:
> As described elsewhere in the thread, allowing concurrent create and unlink
> in a directory (rename probably not needed) would be invaluable for scaling
> multi-threaded workloads. Neil Brown posted a prototype patch to add this
> to the VFS for NFS:
Actually, one thing I'm looking at is using vfs_tmpfile() to create a new file
(or a replacement file when invalidation is required) and then using
vfs_link() to attach directory entries in the background (possibly using
vfs_link() with AT_LINK_REPLACE[1] instead of unlink+link).
Any thoughts on how that might scale? vfs_tmpfile() doesn't appear to require
the directory inode lock. I presume the directory is required for security
purposes in addition to being a way to specify the target filesystem.
David
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
On May 25, 2021, at 3:26 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, May 25, 2021 at 03:13:52PM -0600, Andreas Dilger wrote:
>> Definitely "-o discard" is known to have a measurable performance impact,
>> simply because it ends up sending a lot more requests to the block device,
>> and those requests can be slow/block the queue, depending on underlying
>> storage behavior.
>>
>> There was a patch pushed recently that targets "-o discard" performance:
>> https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
>> that needs a bit more work, but may be worthwhile to test if it improves
>> your workload, and help put some weight behind landing it?
>
> This all seems very complicated. I have chosen with my current laptop
> to "short stroke" the drive. That is, I discarded the entire bdev,
> then partitioned it roughly in half. The second half has never seen
> any writes. This effectively achieves the purpose of TRIM/discard;
> there are a lot of unused LBAs, so the underlying flash translation layer
> always has plenty of spare space when it needs to empty an erase block.
>
> Since the steady state of hard drives is full, I have to type 'make clean'
> in my build trees more often than otherwise and remember to delete iso
> images after i've had them lying around for a year, but I'd rather clean
> up a little more often than get these weird performance glitches.
>
> And if I really do need half a terabyte of space temporarily, I can
> always choose to use the fallow range for a while, then discard it again.
Sure, that's one solution for a 1TB laptop, but not large filesystems
that may be hundreds of TB per device. I don't think the owners of
Perlmutter (https://www.nersc.gov/systems/perlmutter/) could be convinced
to avoid using 17PB of their flash to avoid the need for TRIM to work. :-)
Cheers, Andreas
On May 25, 2021, at 4:31 PM, David Howells <[email protected]> wrote:
>
> Andreas Dilger <[email protected]> wrote:
>
>> As described elsewhere in the thread, allowing concurrent create and unlink
>> in a directory (rename probably not needed) would be invaluable for scaling
>> multi-threaded workloads. Neil Brown posted a prototype patch to add this
>> to the VFS for NFS:
>
> Actually, one thing I'm looking at is using vfs_tmpfile() to create a new file
> (or a replacement file when invalidation is required) and then using
> vfs_link() to attach directory entries in the background (possibly using
> vfs_link() with AT_LINK_REPLACE[1] instead of unlink+link).
>
> Any thoughts on how that might scale? vfs_tmpfile() doesn't appear to require
> the directory inode lock. I presume the directory is required for security
> purposes in addition to being a way to specify the target filesystem.
I don't see how that would help much? Yes, the tmpfile allocation would be
out-of-line vs. the directory lock, so this may reduce the lock hold time
by some fraction, but this would still need to hold the directory lock
when linking the tmpfile into the directory, in the same way that create
and unlink are serialized against other threads working in the same dir.
Having the directory locking scale with the size of the directory is what
will get orders of magnitude speedups for large concurrent workloads.
In ext4 this means write locking the directory leaf blocks independently,
with read locks for the interior index blocks unless new leaf blocks are
added (they are currently never removed).
It's the same situation as back with the BKL locking the entire kernel,
before we got fine-grained locking throughout the kernel.
>
> David
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
Cheers, Andreas
Andreas Dilger <[email protected]> wrote:
> > Any thoughts on how that might scale? vfs_tmpfile() doesn't appear to
> > require the directory inode lock. I presume the directory is required for
> > security purposes in addition to being a way to specify the target
> > filesystem.
>
> I don't see how that would help much?
When it comes to dealing with a file I don't have cached, I can't probe the
cache file to find out whether it has data that I can read until I've opened
it (or found out it doesn't exist). When it comes to writing to a new cache
file, I can't start writing until the file is created and opened - but this
will potentially hold up close, data sync and writes that conflict (and have
to implicitly sync). If I can use vfs_tmpfile() to defer synchronous
directory accesses, that could be useful.
As mentioned, creating a link to a temporary cache file (ie. modifying the
directory) could be deferred to a background thread whilst allowing file I/O
to progress to the tmpfile.
David
> On May 25, 2021, at 5:13 PM, Andreas Dilger <[email protected]> wrote:
>
> On May 22, 2021, at 11:51 PM, Josh Triplett <[email protected]> wrote:
>>
>> On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
>>> On May 17, 2021, at 9:06 AM, David Howells <[email protected]> wrote:
>>>> With filesystems like ext4, xfs and btrfs, what are the limits on directory
>>>> capacity, and how well are they indexed?
>>>>
>>>> The reason I ask is that inside of cachefiles, I insert fanout directories
>>>> inside index directories to divide up the space for ext2 to cope with the
>>>> limits on directory sizes and that it did linear searches (IIRC).
>>>>
>>>> For some applications, I need to be able to cache over 1M entries (render
>>>> farm) and even a kernel tree has over 100k.
>>>>
>>>> What I'd like to do is remove the fanout directories, so that for each logical
>>>> "volume"[*] I have a single directory with all the files in it. But that
>>>> means sticking massive amounts of entries into a single directory and hoping
>>>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>>>
>>> Ext4 can comfortably handle ~12M entries in a single directory, if the
>>> filenames are not too long (e.g. 32 bytes or so). With the "large_dir"
>>> feature (since 4.13, but not enabled by default) a single directory can
>>> hold around 4B entries, basically all the inodes of a filesystem.
>>
>> ext4 definitely seems to be able to handle it. I've seen bottlenecks in
>> other parts of the storage stack, though.
>>
>> With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
>> enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
>> a few million entries (each pointing to a ~4-8k file) take the better
>> part of an hour, almost all of it system time in iowait. Also makes any
>> other concurrent disk writes hang, even a simple "touch x". Turning off
>> discard speeds it up by several orders of magnitude.
>>
>> (I don't know if this is a known issue or not, so here are the details
>> just in case it isn't. Also, if this is already fixed in a newer kernel,
>> my apologies for the outdated report.)
>
> Definitely "-o discard" is known to have a measurable performance impact,
> simply because it ends up sending a lot more requests to the block device,
> and those requests can be slow/block the queue, depending on underlying
> storage behavior.
>
> There was a patch pushed recently that targets "-o discard" performance:
> https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
> that needs a bit more work, but may be worthwhile to test if it improves
> your workload, and help put some weight behind landing it?
>
This is pretty far off topic from the original message, but we’ve had a long list of discard problems in production:
* Synchronous discards stall under heavy delete loads, especially on lower end drives. Even drives that service the discards entirely in ram on the host (fusion-io’s best feature imho) had trouble. I’m sure some really high end flash is really high end, but it hasn’t been a driving criteria for us in the fleet.
* XFS async discards decouple the commit latency from the discard latency, which is great. But the backlog of discards wasn’t really limited, so mass deletion events ended up generating stalls for reads and writes that were competing with the discards. We last benchmarked this with v5.2, so it might be different now, but unfortunately it wasn’t usable for us.
* fstrim-from-cron limits the stalls to 2am, which is peak somewhere in the world, so it isn't ideal. On some drives its fine, on others it’s a 10 minute lunch break.
For XFS in latency sensitive workloads, we’ve settled on synchronous discards and applications using iterating truncate calls that nibble the ends off of a file bit by bit while calling fsync in reasonable intervals. It hurts to say out loud but is also wonderfully predictable.
We generally use btrfs on low end root drives, where discards are a much bigger problem. The btrfs async discard implementation considers re-allocating the block the same as discarding it, so we avoid some discards just by reusing blocks. It sorts pending discards to prefer larger IOs, and dribbles them out slowly to avoid saturating the drive. It’s a giant bag of compromises but avoids latencies and maintains the write amplification targets. We do use it on a few data intensive workloads with higher end flash, but we crank up the iops targets for the discards there.
-chris
On Tue, May 25, 2021 at 03:13:52PM -0600, Andreas Dilger wrote:
> There was a patch pushed recently that targets "-o discard" performance:
> https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
> that needs a bit more work, but may be worthwhile to test if it improves
> your workload, and help put some weight behind landing it?
I just got a chance to test that patch (using the same storage stack,
with ext4 atop dm-crypt on the same SSD). That patch series makes a
*massive* difference; with that patch series (rebased atop latest
5.13.0-rc7) and the test case from my previous mail, `rm -r testdir`
takes the same amount of time (~17s) whether I have discard enabled or
disabled, and doesn't disrupt the rest of the system. Without the
patch, that same removal took many minutes, and stalled out the rest of
the system.
Thanks for the reference; I'll follow up to the thread for that patch
with the same information.
- Josh Triplett