LinuxLists.cc - ext3_dx_add_entry complains about Directory index full

2015-02-04 09:04:49

Subject: ext3_dx_add_entry complains about Directory index full

To reduce Jans load I send it here for advice:

Today I got these warnings for the backup partition:

[ 0.000000] Linux version 3.18.5 (abuild@build23) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Mon Jan 19 09:08:56 UTC 2015

[102565.308869] kjournald starting. Commit interval 5 seconds
[102565.315974] EXT3-fs (dm-5): using internal journal
[102565.315980] EXT3-fs (dm-5): mounted filesystem with ordered data mode
[104406.015708] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
[104406.239904] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
[104406.254162] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
[104406.270793] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
[104406.287443] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!

According to google this indicates that the filesystem has more than 32k
subdirectories. According to wikipedia this limit can be avoided by
enabling the dir_index feature. According to dumpe2fs the feature is
enabled already. Does the warning above mean something else?

Jan suggested to create a debug image with "e2image -r /dev/dm-5 - |
xz > ext3-image.e2i.xz", but this creates more than 250G of private data.

I wonder if the math within the kernel is done correctly. If so I will move the
data to another drive and reformat the thing with another filesystem.
If however the math is wrong somewhere, I'm willing to keep it for a while
until the issue is understood.

# dumpe2fs -h /dev/dm-5
dumpe2fs 1.41.14 (22-Dec-2010)
Filesystem volume name: BACKUP_OLH_500G
Last mounted on: /run/media/olaf/BACKUP_OLH_500G
Filesystem UUID: f0d41610-a993-4b77-8845-f0f07e37f61d
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 26214400
Block count: 419430400
Reserved block count: 419430
Free blocks: 75040285
Free inodes: 24328812
First block: 1
Block size: 1024
Fragment size: 1024
Reserved GDT blocks: 256
Blocks per group: 8192
Fragments per group: 8192
Inodes per group: 512
Inode blocks per group: 128
Filesystem created: Tue Feb 12 18:24:13 2013
Last mount time: Thu Jan 29 09:15:28 2015
Last write time: Thu Jan 29 09:15:28 2015
Mount count: 161
Maximum mount count: -1
Last checked: Mon May 26 10:09:36 2014
Check interval: 0 (<none>)
Lifetime writes: 299 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 55aeb7a2-43ca-4104-ad21-56d7a523dc8f
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 32M
Journal length: 32768
Journal sequence: 0x000a2725
Journal start: 17366

The backup is done with rsnapshot, which uses hardlinks and rsync to create a
new subdir with just the changed files.

# for t in d f l ; do echo "type $t: `find /media/BACKUP_OLH_500G/ -xdev -type $t | wc -l`" ; done
type d: 1051396
type f: 20824894
type l: 6876

With the hack below I got this output:

[14161.626156] scsi 4:0:0:0: Direct-Access ATA ST3500418AS CC45 PQ: 0 ANSI: 5
[14161.626671] sd 4:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 GiB)
[14161.626762] sd 4:0:0:0: [sdb] Write Protect is off
[14161.626769] sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[14161.626810] sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[14161.628058] sd 4:0:0:0: Attached scsi generic sg1 type 0
[14161.651340] sdb: sdb1
[14161.651978] sd 4:0:0:0: [sdb] Attached SCSI disk
[14176.784403] kjournald starting. Commit interval 5 seconds
[14176.790307] EXT3-fs (dm-5): using internal journal
[14176.790316] EXT3-fs (dm-5): mounted filesystem with ordered data mode
[14596.410693] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full! /hourly.0 localhost/olh/maildir/olh-maildir Maildir/old/xen-devel.old/cur 1422000479.29469_1.probook.fritz.box:2,S
[15335.342389] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full! /hourly.0 localhost/olh/maildir/olh-maildir Maildir/old/xen-devel.old/cur 1422000479.29469_1.probook.fritz.box:2,S

diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
index f197736..5022eda 100644
--- a/fs/ext3/namei.c
+++ b/fs/ext3/namei.c
@@ -1525,11 +1525,20 @@ static int ext3_dx_add_entry(handle_t *handle, struct dentry *dentry,
struct dx_entry *entries2;
struct dx_node *node2;
struct buffer_head *bh2;
+ struct dentry *parents = dentry->d_parent;
+ struct dentry *parents2;
+ unsigned int i = 4;

if (levels && (dx_get_count(frames->entries) ==
dx_get_limit(frames->entries))) {
+ while (parents && i > 0 && parents->d_parent)
+ i--, parents = parents->d_parent;
+ parents2 = parents;
+ i = 4;
+ while (parents2 && i > 0 && parents2->d_parent)
+ i--, parents2 = parents2->d_parent;
ext3_warning(sb, __func__,
- "Directory index full!");
+ "Directory index full! %pd4 %pd4 %pd4 %pd", parents2, parents, dentry->d_parent, dentry);
err = -ENOSPC;
goto cleanup;
}

This does not dump the inode yet. I suspect it will point to other hardlinks of the dentry above.

Thanks for reading,

Olaf

2015-02-04 10:52:23

by Andreas Dilger

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Feb 4, 2015, at 2:04 AM, Olaf Hering <[email protected]> wrote:
> Today I got these warnings for the backup partition:
>
> [ 0.000000] Linux version 3.18.5 (abuild@build23) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Mon Jan 19 09:08:56 UTC 2015
>
> [102565.308869] kjournald starting. Commit interval 5 seconds
> [102565.315974] EXT3-fs (dm-5): using internal journal
> [102565.315980] EXT3-fs (dm-5): mounted filesystem with ordered data mode
> [104406.015708] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
> [104406.239904] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
> [104406.254162] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
> [104406.270793] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
> [104406.287443] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full!
>
> According to google this indicates that the filesystem has more than 32k
> subdirectories. According to wikipedia this limit can be avoided by
> enabling the dir_index feature. According to dumpe2fs the feature is
> enabled already. Does the warning above mean something else?

How many files/subdirs in this directory? The old ext3 limit was 32000
subdirs, which the dir_index fixed, but the new limit is 65000 subdirs
without "dir_index" enabled.

The 65000 subdir limit can be exceeded by turning on the "dir_nlink"
feature of the filesystem with "tune2fs -O dir_nlink", to allow an
"unlimited" number of subdirs (subject to other directory limits, about
10-12M entries for 16-char filenames).

The other potential problem is if you create and delete a large number
of files from this directory, then the hash tables can become full and
the leaf blocks are imbalanced and some become full even when many others
are not (htree only has an average leaf fullness of 3/4 of each block).
This could probably happen if you have more than 5M files in a long-lived
directory in your backup fs. This can be fixed (for some time at least)
via "e2fsck -fD" on the unmounted filesystem to compact the directories.

We do have patches to allow 3-level hash tables for htree directories in
Lustre, instead of the current 2-level maximum. They also increase the
maximum directory size beyond 2GB. The last time I brought this up, it
didn't seem like it was of interest to others, but maybe opinions changed.

http://git.hpdd.intel.com/fs/lustre-release.git/blob/HEAD:/ldiskfs/kernel_patches/patches/sles11sp2/ext4-pdirop.patch

It's tangled together with another feature that allows (for Lustre at
least) concurrent create/lookup/unlink in a single directory, but there
was no interest in getting support for that into the VFS, so we only
use it when multiple clients are accessing the directory concurrently.

Cheers, Andreas

> Jan suggested to create a debug image with "e2image -r /dev/dm-5 - |
> xz > ext3-image.e2i.xz", but this creates more than 250G of private data.
>
> I wonder if the math within the kernel is done correctly. If so I will move the
> data to another drive and reformat the thing with another filesystem.
> If however the math is wrong somewhere, I'm willing to keep it for a while
> until the issue is understood.
>
>
> # dumpe2fs -h /dev/dm-5
> dumpe2fs 1.41.14 (22-Dec-2010)
> Filesystem volume name: BACKUP_OLH_500G
> Last mounted on: /run/media/olaf/BACKUP_OLH_500G
> Filesystem UUID: f0d41610-a993-4b77-8845-f0f07e37f61d
> Filesystem magic number: 0xEF53
> Filesystem revision #: 1 (dynamic)
> Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
> Filesystem flags: signed_directory_hash
> Default mount options: (none)
> Filesystem state: clean
> Errors behavior: Continue
> Filesystem OS type: Linux
> Inode count: 26214400
> Block count: 419430400
> Reserved block count: 419430
> Free blocks: 75040285
> Free inodes: 24328812
> First block: 1
> Block size: 1024
> Fragment size: 1024
> Reserved GDT blocks: 256
> Blocks per group: 8192
> Fragments per group: 8192
> Inodes per group: 512
> Inode blocks per group: 128
> Filesystem created: Tue Feb 12 18:24:13 2013
> Last mount time: Thu Jan 29 09:15:28 2015
> Last write time: Thu Jan 29 09:15:28 2015
> Mount count: 161
> Maximum mount count: -1
> Last checked: Mon May 26 10:09:36 2014
> Check interval: 0 (<none>)
> Lifetime writes: 299 MB
> Reserved blocks uid: 0 (user root)
> Reserved blocks gid: 0 (group root)
> First inode: 11
> Inode size: 256
> Required extra isize: 28
> Desired extra isize: 28
> Journal inode: 8
> Default directory hash: half_md4
> Directory Hash Seed: 55aeb7a2-43ca-4104-ad21-56d7a523dc8f
> Journal backup: inode blocks
> Journal features: journal_incompat_revoke
> Journal size: 32M
> Journal length: 32768
> Journal sequence: 0x000a2725
> Journal start: 17366
>
>
>
> The backup is done with rsnapshot, which uses hardlinks and rsync to create a
> new subdir with just the changed files.
>
> # for t in d f l ; do echo "type $t: `find /media/BACKUP_OLH_500G/ -xdev -type $t | wc -l`" ; done
> type d: 1051396
> type f: 20824894
> type l: 6876
>
> With the hack below I got this output:
>
> [14161.626156] scsi 4:0:0:0: Direct-Access ATA ST3500418AS CC45 PQ: 0 ANSI: 5
> [14161.626671] sd 4:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 GiB)
> [14161.626762] sd 4:0:0:0: [sdb] Write Protect is off
> [14161.626769] sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
> [14161.626810] sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
> [14161.628058] sd 4:0:0:0: Attached scsi generic sg1 type 0
> [14161.651340] sdb: sdb1
> [14161.651978] sd 4:0:0:0: [sdb] Attached SCSI disk
> [14176.784403] kjournald starting. Commit interval 5 seconds
> [14176.790307] EXT3-fs (dm-5): using internal journal
> [14176.790316] EXT3-fs (dm-5): mounted filesystem with ordered data mode
> [14596.410693] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full! /hourly.0 localhost/olh/maildir/olh-maildir Maildir/old/xen-devel.old/cur 1422000479.29469_1.probook.fritz.box:2,S
> [15335.342389] EXT3-fs (dm-5): warning: ext3_dx_add_entry: Directory index full! /hourly.0 localhost/olh/maildir/olh-maildir Maildir/old/xen-devel.old/cur 1422000479.29469_1.probook.fritz.box:2,S
>
>
> diff --git a/fs/ext3/namei.c b/fs/ext3/namei.c
> index f197736..5022eda 100644
> --- a/fs/ext3/namei.c
> +++ b/fs/ext3/namei.c
> @@ -1525,11 +1525,20 @@ static int ext3_dx_add_entry(handle_t *handle, struct dentry *dentry,
> struct dx_entry *entries2;
> struct dx_node *node2;
> struct buffer_head *bh2;
> + struct dentry *parents = dentry->d_parent;
> + struct dentry *parents2;
> + unsigned int i = 4;
>
> if (levels && (dx_get_count(frames->entries) ==
> dx_get_limit(frames->entries))) {
> + while (parents && i > 0 && parents->d_parent)
> + i--, parents = parents->d_parent;
> + parents2 = parents;
> + i = 4;
> + while (parents2 && i > 0 && parents2->d_parent)
> + i--, parents2 = parents2->d_parent;
> ext3_warning(sb, __func__,
> - "Directory index full!");
> + "Directory index full! %pd4 %pd4 %pd4 %pd", parents2, parents, dentry->d_parent, dentry);
> err = -ENOSPC;
> goto cleanup;
> }
>
>
> This does not dump the inode yet. I suspect it will point to other hardlinks of the dentry above.
>
>
> Thanks for reading,
>
> Olaf
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Cheers, Andreas

2015-02-04 13:58:34

by Olaf Hering

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Wed, Feb 04, Andreas Dilger wrote:

> How many files/subdirs in this directory? The old ext3 limit was 32000
> subdirs, which the dir_index fixed, but the new limit is 65000 subdirs
> without "dir_index" enabled.

See below:

> > # for t in d f l ; do echo "type $t: `find /media/BACKUP_OLH_500G/ -xdev -type $t | wc -l`" ; done
> > type d: 1051396
> > type f: 20824894
> > type l: 6876

> The 65000 subdir limit can be exceeded by turning on the "dir_nlink"
> feature of the filesystem with "tune2fs -O dir_nlink", to allow an
> "unlimited" number of subdirs (subject to other directory limits, about
> 10-12M entries for 16-char filenames).

I enabled this using another box, which turned the thing into an ext4
filesystem. Now ext4_dx_add_entry complains.

> The other potential problem is if you create and delete a large number
> of files from this directory, then the hash tables can become full and
> the leaf blocks are imbalanced and some become full even when many others
> are not (htree only has an average leaf fullness of 3/4 of each block).
> This could probably happen if you have more than 5M files in a long-lived
> directory in your backup fs. This can be fixed (for some time at least)
> via "e2fsck -fD" on the unmounted filesystem to compact the directories.

Ok, will try that. Thanks.

Olaf

2015-02-04 16:30:50

by Olaf Hering

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Wed, Feb 04, Olaf Hering wrote:

> Ok, will try that. Thanks.

root@linux-fceg:~ # time env -i /sbin/e2fsck -fDvv /dev/mapper/luks-861f1f73-7037-486a-9a8a-8588367fcf33
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Pass 5: Checking group summary information

BACKUP_OLH_500G: ***** FILE SYSTEM WAS MODIFIED *****

1886589 inodes used (7.20%, out of 26214400)
38925 non-contiguous files (2.1%)
28851 non-contiguous directories (1.5%)
# of inodes with ind/dind/tind blocks: 163156/45817/319
344807093 blocks used (82.21%, out of 419430400)
0 bad blocks
8 large files

859307 regular files
1026949 directories
0 character device files
0 block device files
0 fifos
19504583 links
322 symbolic links (316 fast symbolic links)
2 sockets
------------
21391163 files

real 78m31.853s
user 3m24.616s
sys 1m20.599s

root@probook:~ # dumpe2fs -h /dev/dm-5
dumpe2fs 1.41.14 (22-Dec-2010)
Filesystem volume name: BACKUP_OLH_500G
Last mounted on: /media/BACKUP_OLH_500G
Filesystem UUID: f0d41610-a993-4b77-8845-f0f07e37f61d
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file dir_nlink
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 26214400
Block count: 419430400
Reserved block count: 419430
Free blocks: 74623307
Free inodes: 24327811
First block: 1
Block size: 1024
Fragment size: 1024
Reserved GDT blocks: 256
Blocks per group: 8192
Fragments per group: 8192
Inodes per group: 512
Inode blocks per group: 128
Filesystem created: Tue Feb 12 18:24:13 2013
Last mount time: Wed Feb 4 17:02:29 2015
Last write time: Wed Feb 4 17:02:29 2015
Mount count: 1
Maximum mount count: -1
Last checked: Wed Feb 4 15:08:34 2015
Check interval: 0 (<none>)
Lifetime writes: 5039 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 55aeb7a2-43ca-4104-ad21-56d7a523dc8f
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 32M
Journal length: 32768
Journal sequence: 0x000a3a58
Journal start: 1

But still:

[44220.530001] scsi 4:0:0:0: Direct-Access ATA ST3500418AS CC45 PQ: 0 ANSI: 5
[44220.530455] sd 4:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 GiB)
[44220.530548] sd 4:0:0:0: [sdb] Write Protect is off
[44220.530557] sd 4:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[44220.530596] sd 4:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[44220.534670] sd 4:0:0:0: Attached scsi generic sg1 type 0
[44220.549014] sdb: sdb1
[44220.549721] sd 4:0:0:0: [sdb] Attached SCSI disk
[44238.004550] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null)
[45191.549831] EXT4-fs warning (device dm-5): ext4_dx_add_entry:1990: Directory index full!

Guess its time to wipe it and go with something else.

Olaf

2015-02-04 21:32:38

by Andreas Dilger

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Feb 4, 2015, at 6:52 AM, Olaf Hering <[email protected]> wrote:
> On Wed, Feb 04, Andreas Dilger wrote:
>
>> How many files/subdirs in this directory? The old ext3 limit was 32000
>> subdirs, which the dir_index fixed, but the new limit is 65000 subdirs
>> without "dir_index" enabled.
>
> See below:
>
>>> # for t in d f l ; do echo "type $t: `find /media/BACKUP_OLH_500G/ -xdev -type $t | wc -l`" ; done
>>> type d: 1051396
>>> type f: 20824894
>>> type l: 6876

Is "BACKUP_OLH_500G" a single large directory with 1M directories and
20M files in it? In that case, you are hitting the limits for the
current ext4 directory size with 20M+ entries.

Otherwise, I would expect you have subdirectories, and the link/count
limits are per directory so getting these numbers for the affected
directory are what is important.

Running something like http://www.pdsi-scidac.org/fsstats/ can give
you a good idea of what the file/directory size/age/counts min/max/avg
distributions are like for your filesystem.

Finding the largest directories with something like:

find /media/BACKUP_OLH_500G -type d -size +10M -ls

would tell us how big your directories actually are. The fsstats data
will also tell you what the min/max/avg filename length is, which may
also be a factor.

It would be surprising that you have such a large directory in a single
backup. We typically test up to 10M files in a single directory.

> root@linux-fceg:~ # time env -i /sbin/e2fsck -fDvv /dev/mapper/luks-861f1f73-7037-486a-9a8a-8588367fcf33
> e2fsck 1.42.12 (29-Aug-2014)
> 859307 regular files
> 1026949 directories
> 19504583 links

This implies that you have only 1.8M in-use files, while the above
reports 20M filenames, almost all of them hard links (about 23 links
per file). That said, the error being reported is on the name insert
and not on the link counts, so either there are some directories with
huge numbers of files or the file names are so long that it causes
the directory leaves to fill up very quickly.

> Block size: 1024

AH! This is the root of your problem. Formatting with 1024-byte
blocks means that the two-level directory hash tree can only hold
about 128^2 * (1024 / filename_length * 3 / 4) entries, maybe 500k
entries or less if the names are long.

This wouldn't be the default for a 500GB filesystem, but maybe you
picked that to optimize space usage of small files a bit? Definitely
1KB blocksize is not optimal for performance, and 4KB is much better.

Unfortunately, you need to reformat to get to 4kB blocks.

Cheers, Andreas

2015-02-05 09:19:36

by Olaf Hering

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Wed, Feb 04, Andreas Dilger wrote:

> On Feb 4, 2015, at 6:52 AM, Olaf Hering <[email protected]> wrote:
> > On Wed, Feb 04, Andreas Dilger wrote:
> >
> >> How many files/subdirs in this directory? The old ext3 limit was 32000
> >> subdirs, which the dir_index fixed, but the new limit is 65000 subdirs
> >> without "dir_index" enabled.
> >
> > See below:
> >
> >>> # for t in d f l ; do echo "type $t: `find /media/BACKUP_OLH_500G/ -xdev -type $t | wc -l`" ; done
> >>> type d: 1051396
> >>> type f: 20824894
> >>> type l: 6876
>
> Is "BACKUP_OLH_500G" a single large directory with 1M directories and
> 20M files in it? In that case, you are hitting the limits for the
> current ext4 directory size with 20M+ entries.

Its organized in subdirs named hourly.{0..23} daily.{0.6} weekly.{0..3}
monthly.{0..11}.

> Finding the largest directories with something like:
>
> find /media/BACKUP_OLH_500G -type d -size +10M -ls
>
> would tell us how big your directories actually are. The fsstats data
> will also tell you what the min/max/avg filename length is, which may
> also be a factor.

There is no output from this find command for large directories.

> > Block size: 1024
>
> AH! This is the root of your problem. Formatting with 1024-byte
> blocks means that the two-level directory hash tree can only hold
> about 128^2 * (1024 / filename_length * 3 / 4) entries, maybe 500k
> entries or less if the names are long.
>
> This wouldn't be the default for a 500GB filesystem, but maybe you
> picked that to optimize space usage of small files a bit? Definitely
> 1KB blocksize is not optimal for performance, and 4KB is much better.

Yes, I used 1024 blocksize to not waste space for the many small files.

I wonder what other filesystem would be able to cope? Does xfs or btrfs
do any better for these kind of data?

Thanks for the feedback!

Olaf

2015-02-06 06:52:38

by Andreas Dilger

[permalink] [raw]

Subject: Re: ext3_dx_add_entry complains about Directory index full

On Feb 5, 2015, at 2:19 AM, Olaf Hering <[email protected]> wrote:
> On Wed, Feb 04, Andreas Dilger wrote:
>>
>> Finding the largest directories with something like:
>>
>> find /media/BACKUP_OLH_500G -type d -size +10M -ls
>>
>> would tell us how big your directories actually are. The fsstats data
>> will also tell you what the min/max/avg filename length is, which may
>> also be a factor.
>
> There is no output from this find command for large directories.

I tested a 1KB blocksize filesystem, and the actual directory size was
only about 1.8MB when it ran out of space in the htree. That worked
out to be about 250k 12-character filenames in a single directory.

Even doubling the blocksize to 2KB you would get 2^3=8x as many
entries in the directory (twice as many internal blocks in each of
the two htree levels, and the leaf blocks are twice as large).
That would give you about 2M entries in a single directory, and I
doubt it would significantly impact the space usage unless you are
mostly backing up small files.

>>> Block size: 1024
>>
>> AH! This is the root of your problem. Formatting with 1024-byte
>> blocks means that the two-level directory hash tree can only hold
>> about 128^2 * (1024 / filename_length * 3 / 4) entries, maybe 500k
>> entries or less if the names are long.
>>
>> This wouldn't be the default for a 500GB filesystem, but maybe you
>> picked that to optimize space usage of small files a bit? Definitely
>> 1KB blocksize is not optimal for performance, and 4KB is much better.
>
> Yes, I used 1024 blocksize to not waste space for the many small files.

>>> Inode count: 26214400
>>> Block count: 419430400
>>> Reserved block count: 419430
>>> Free blocks: 75040285
>>> Free inodes: 24328812

You are using (419430400 - 75040285 - 419430) = 343970685 blocks
for (26214400 - 24328812) = 1885588 files, which is an average
file size of 182KB. You currently "waste" about half a block per
file (0.5KB/file), so 1885588 * 0.5 = 920MB = 1/500 = 0.2% of your
filesystem due to partially-used blocks at the end of every file.
With a 2KB blocksize this would increase to about 1840MB or 0.4%,
which really isn't very much space on a modern drive.

>>> # of inodes with ind/dind/tind blocks: 163156/45817/319

However, there would also be increased efficiency because of fewer
index blocks. These indirect blocks are currently at least
163156 + 45817 * (1024 / 4 / 2 + 1) + 319 * (1024 / 4 + 1) = 6155532 KB
or 6011MB of space, which is much more than you have saved due to
having the small blocksize.

If you formatted the filesystem with "-t ext4" (enables "extents" among
other things) there would likely be no indirect/index blocks at all since
extent-mapped inodes can directly address 256MB directly from the inode
(assuming fragmentation is not too bad) on a 2KB blocksize filesystem.

You get other benefits from reformatting with "-t ext4" like flex_bg
and uninit_bg that can speed up e2fsck times significantly.

> I wonder what other filesystem would be able to cope? Does xfs or btrfs
> do any better for these kind of data?

I can't really say, since I've never used those filesystems. I suspect
you could do much better to increase the blocksize on ext4 than what
you have now.

Cheers, Andreas