LinuxLists.cc - Maildir quickly hitting max htree

2021-11-12 20:37:07

Subject: Maildir quickly hitting max htree

Surprised to hit a limit when handling a modest Maildir case; does this
reflect a bug?

rsync'ing to a new mail server, after fewer than 100,000 files there are
intermittent failures:

rsync: [receiver] open "/home/mark/Maildir/.robot/cur/1633731549.M990187P7732.yello.[redacted],S=69473,W=70413:2," failed: No space left on device (28)
rsync: [receiver] rename "/home/mark/Maildir/.robot/cur/.1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,.oBphKA" -> ".robot/cur/1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,": No space left on device (28)

The kernel:

EXT4-fs warning (device dm-4): ext4_dx_add_entry:2351: Directory (ino: 225811) index full, reach max htree level :2
EXT4-fs warning (device dm-4): ext4_dx_add_entry:2355: Large directory feature is not enabled on this filesystem

Reaching for 'large_dir' seems premature as this feature is reported as
useful for 10M+ files, but this is much lower.

A 'bad' filename will fail consistently. Assuming the 10M+ absolute limit,
is the tree grossly imbalanced?

Intuitively, 'htree level :2' does not sound particular deep.

The source folder is 195,000 files -- large, but not crazy. rsync
eventually hit a ceiling having written 177,482 of the files. I can still
create new ones on the command line with non-Maildir names.

Ruled out quotas, by disabling them with "tune2fs -O ^quota" and
remounting.

See below for additional info.

--
Mark

$ uname -a
Linux floyd 5.10.78-0-virt #1-Alpine SMP Thu, 11 Nov 2021 14:31:09 +0000 x86_64 GNU/Linux

$ mke2fs -q -t ext4 /dev/vg0/home

$ rsync -va --exclude 'dovecot*' yello:Maildir/. $HOME/Maildir

$ ls | head -15
1605139205.M487508P91922.yello.[redacted],S=7625,W=7775:2,
1605139440.M413280P92363.yello.[redacted],S=7632,W=7782:2,
1605139466.M699663P92402.yello.[redacted],S=7560,W=7710:2,
1605139479.M651510P92421.yello.[redacted],S=7474,W=7623:2,
1605139508.M934351P92514.yello.[redacted],S=7626,W=7776:2,
1605139596.M459228P92713.yello.[redacted],S=7559,W=7709:2,
1605139645.M57446P92736.yello.[redacted],S=7632,W=7782:2,
1605139670.M964535P92758.yello.[redacted],S=7628,W=7778:2,
1605139697.M273694P92807.yello.[redacted],S=7632,W=7782:2,
1605139748.M607989P92853.yello.[redacted],S=7560,W=7710:2,
1605139759.M655635P92868.yello.[redacted],S=5912,W=6018:2,
1605139808.M338286P93071.yello.[redacted],S=7628,W=7778:2,
1605139961.M915501P93235.yello.[redacted],S=7625,W=7775:2,
1605140303.M219848P93591.yello.[redacted],S=6898,W=7023:2,
1605140580.M166212P93921.yello.[redacted],S=6896,W=7021:2,

$ touch abc
[success]

$ touch 1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,
touch: cannot touch '1624626598.M748388P84607.yello.[redacted],S=17049,W=17352:2,': No space left on device

$ dumpe2fs /dev/vg0/home
Filesystem volume name: <none>
Last mounted on: /home
Filesystem UUID: ad26c968-d057-4d44-bef9-1e2df347580e
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: signed_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 5225472
Block count: 21229568
Reserved block count: 851459
Overhead clusters: 22361
Free blocks: 8058180
Free inodes: 4799979
First block: 1
Block size: 1024
Fragment size: 1024
Group descriptor size: 64
Reserved GDT blocks: 96
Blocks per group: 8192
Fragments per group: 8192
Inodes per group: 2016
Inode blocks per group: 504
Flex block group size: 16
Filesystem created: Mon Nov 8 13:14:56 2021
Last mount time: Fri Nov 12 18:43:14 2021
Last write time: Fri Nov 12 18:43:14 2021
Mount count: 27
Maximum mount count: -1
Last checked: Mon Nov 8 13:14:56 2021
Check interval: 0 (<none>)
Lifetime writes: 14 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 839d2871-b97e-456d-9724-096db15931b8
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0x5974a8b1
Journal features: journal_incompat_revoke journal_64bit
journal_checksum_v3
Total journal size: 4096k
Total journal blocks: 4096
Max transaction length: 4096
Fast commit length: 0
Journal sequence: 0x00000a2a
Journal start: 702
Journal checksum type: crc32c
Journal checksum: 0x4d693e79

2021-11-13 12:05:12

by Mark Hills

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

Andreas, thanks for such a prompt reply.

On Fri, 12 Nov 2021, Andreas Dilger wrote:

> On Nov 12, 2021, at 11:37, Mark Hills <[email protected]> wrote:
> >
> > Surprised to hit a limit when handling a modest Maildir case; does
> > this reflect a bug?
> >
> > rsync'ing to a new mail server, after fewer than 100,000 files there
> > are intermittent failures:
>
> This is probably because you are using 1KB blocksize instead of 4KB,
> which reduces the size of each tree level by the cube of the ratio, so
> 64x. I guess that was selected because of very small files in the
> maildir?

Interesting! The 1Kb block size was not explicitly chosen. There was no
plan other than using the defaults.

However I did forget that this is a VM installed from a base image. The
root cause is likely to be that the /home partition has been enlarged from
a small size to 32Gb.

Is block size the only factor? If so, a patch like below (untested) could
make it clear it's relevant, and saved the question in this case.

[...]
> If you have a relatively recent kernel, you can enable the "large_dir"
> feature to allow 3-level htree, which would be enough for another factor
> of 1024/8 = 128 more entries than now (~12M).

The system is not yet in use, so I think it's better we reformat here, and
get a block size chosen by the experts :)

These days I think VMs make it more common to enlarge a filesystem from a
small size. We could have picked this up earlier with a warning from
resize2fs; eg. if the block size will no longer match the one that would
be chosen by default. That would pick it up before anyone puts 1Kb block
size into production.

Thanks for identifying the issue.

--
Mark

From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001
From: Mark Hills <[email protected]>
Date: Sat, 13 Nov 2021 11:46:50 +0000
Subject: [PATCH] Block size has an effect on the index size

---
fs/ext4/namei.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f3bbcd4efb56..8965bed4d7ff 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
}
if (add_level && levels == ext4_dir_htree_level(sb)) {
ext4_warning(sb, "Directory (ino: %lu) index full, "
- "reach max htree level :%d",
- dir->i_ino, levels);
+ "reach max htree level :%d"
+ "with block size %lu",
+ dir->i_ino, levels, sb->s_blocksize);
if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) {
ext4_warning(sb, "Large directory feature is "
"not enabled on this "
--
2.33.1

2021-11-13 17:20:02

by Andreas Dilger

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

On Nov 13, 2021, at 04:05, Mark Hills <[email protected]> wrote:
>
> Andreas, thanks for such a prompt reply.
>
>> On Fri, 12 Nov 2021, Andreas Dilger wrote:
>>
>>> On Nov 12, 2021, at 11:37, Mark Hills <[email protected]> wrote:
>>>
>>> Surprised to hit a limit when handling a modest Maildir case; does
>>> this reflect a bug?
>>>
>>> rsync'ing to a new mail server, after fewer than 100,000 files there
>>> are intermittent failures:
>>
>> This is probably because you are using 1KB blocksize instead of 4KB,
>> which reduces the size of each tree level by the cube of the ratio, so
>> 64x. I guess that was selected because of very small files in the
>> maildir?
>
> Interesting! The 1Kb block size was not explicitly chosen. There was no
> plan other than using the defaults.
>
> However I did forget that this is a VM installed from a base image. The
> root cause is likely to be that the /home partition has been enlarged from
> a small size to 32Gb.
>
> Is block size the only factor? If so, a patch like below (untested) could
> make it clear it's relevant, and saved the question in this case.

The patch looks reasonable, but should be submitted separately with
[patch] in the subject so that it will not be lost.

You can also add on your patch:

Reviewed-by: Andreas Dilger <[email protected]>

Cheers, Andreas

>
> [...]
>> If you have a relatively recent kernel, you can enable the "large_dir"
>> feature to allow 3-level htree, which would be enough for another factor
>> of 1024/8 = 128 more entries than now (~12M).
>
> The system is not yet in use, so I think it's better we reformat here, and
> get a block size chosen by the experts :)
>
> These days I think VMs make it more common to enlarge a filesystem from a
> small size. We could have picked this up earlier with a warning from
> resize2fs; eg. if the block size will no longer match the one that would
> be chosen by default. That would pick it up before anyone puts 1Kb block
> size into production.
>
> Thanks for identifying the issue.
>
> --
> Mark
>
>
> From 8604c50be77a4bc56a91099598c409d5a3c1fdbe Mon Sep 17 00:00:00 2001
> From: Mark Hills <[email protected]>
> Date: Sat, 13 Nov 2021 11:46:50 +0000
> Subject: [PATCH] Block size has an effect on the index size
>
> ---
> fs/ext4/namei.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index f3bbcd4efb56..8965bed4d7ff 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2454,8 +2454,9 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
> }
> if (add_level && levels == ext4_dir_htree_level(sb)) {
> ext4_warning(sb, "Directory (ino: %lu) index full, "
> - "reach max htree level :%d",
> - dir->i_ino, levels);
> + "reach max htree level :%d"
> + "with block size %lu",
> + dir->i_ino, levels, sb->s_blocksize);
> if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) {
> ext4_warning(sb, "Large directory feature is "
> "not enabled on this "
> --
> 2.33.1

2021-11-14 17:44:24

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote:
>
> Interesting! The 1Kb block size was not explicitly chosen. There was no
> plan other than using the defaults.
>
> However I did forget that this is a VM installed from a base image. The
> root cause is likely to be that the /home partition has been enlarged from
> a small size to 32Gb.

How small was the base image? As documented in the man page for
mke2fs.conf, for file systems that are smaller than 3mb, mke2fs use
the parameters in /etc/mke2fs.conf for type "floppy" (back when 3.5
inch floppies were either 1.44MB or 2.88MB). So it must have been a
really tiny base image to begin with.

> These days I think VMs make it more common to enlarge a filesystem from a
> small size. We could have picked this up earlier with a warning from
> resize2fs; eg. if the block size will no longer match the one that would
> be chosen by default. That would pick it up before anyone puts 1Kb block
> size into production.

It's would be a bit tricky for resize2fs to do that, since it doesn't
know what might be in the mke2fs.conf file at the time when the file
system when the file system was creaeted. Distributions or individual
system adminsitrators are free to modify that config file.

It is a good idea for resize2fs to give a warning, though. What I'm
thinking that what might sense is if resize2fs is expanding the file
system by more than, say a factor of 10x (e.g., expanding a file
system from 10mb to 100mb, or 3mb to 20gb) to give a warning that
inflating file systems is an anti-pattern that will not necessarily
result in the best file system performance. Even if the blocksize
isn't 1k, when a file system is shrunk to a very small size, and then
expanded to a very large size, the file system will not be optimal.

For example, the default size of the journal is based on the file
system size. Like the block size, it can be overridden on the
command-line, but it's unlikely that most people preparing the file
image will remember to consider this.

More importantly, when a file system is shrunk, the data blocks are
moved without a whole lot of optimizations, and then when the file
system is expanded, files that are were pre-loaded to the image, and
were located in the parts of the file system that had to be evacuated
as part of the shrinking process remain in whatever fragmented form
that they were after the shrink operation.

The way things work for Amazon's and Google's cloud is that the image
is created with a size of 8GB and 10GB, and best practice would be
create a separate EBS volume for the data partition. This would allow
the easy upgrade or replacement of the root file system, for example,
after you check in your project keys into a public repo (or you fail
to apply a security upgrade to an actively exploited zero-day), and
your system gets rooted to a fair-thee-well, it's much simpler to
completely throw away the root image, and reinstall it with a fresh
system image, without having to separate your data files from the
system image.

- Ted

2021-11-16 17:52:58

by Mark Hills

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

On Sat, 13 Nov 2021, Andreas Dilger wrote:

> >>> On Nov 12, 2021, at 11:37, Mark Hills <[email protected]> wrote:
> >>>
> >>> Surprised to hit a limit when handling a modest Maildir case; does
> >>> this reflect a bug?
> >>>
> >>> rsync'ing to a new mail server, after fewer than 100,000 files there
> >>> are intermittent failures:
> >>
> >> This is probably because you are using 1KB blocksize instead of 4KB,
[...]
> > Is block size the only factor? If so, a patch like below (untested) could
> > make it clear it's relevant, and saved the question in this case.
>
> The patch looks reasonable, but should be submitted separately with
> [patch] in the subject so that it will not be lost.
>
> You can also add on your patch:
>
> Reviewed-by: Andreas Dilger <[email protected]>

Thanks. When I get a moment I'll aim to test the patch and submit
properly.

--
Mark

2021-11-16 19:31:15

by Mark Hills

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

On Sun, 14 Nov 2021, Theodore Ts'o wrote:

> On Sat, Nov 13, 2021 at 12:05:07PM +0000, Mark Hills wrote:
> >
> > Interesting! The 1Kb block size was not explicitly chosen. There was no
> > plan other than using the defaults.
> >
> > However I did forget that this is a VM installed from a base image. The
> > root cause is likely to be that the /home partition has been enlarged from
> > a small size to 32Gb.
>
> How small was the base image?

/home was created with 256Mb, never shrunk.

> As documented in the man page for mke2fs.conf, for file systems that are
> smaller than 3mb, mke2fs use the parameters in /etc/mke2fs.conf for type
> "floppy" (back when 3.5 inch floppies were either 1.44MB or 2.88MB).
> So it must have been a really tiny base image to begin with.

Small, but not microscopic :)

I see a definition in mke2fs.conf for "small" which uses 1024 blocksize,
and I assume it originated there and not "floppy".

> > These days I think VMs make it more common to enlarge a filesystem from a
> > small size. We could have picked this up earlier with a warning from
> > resize2fs; eg. if the block size will no longer match the one that would
> > be chosen by default. That would pick it up before anyone puts 1Kb block
> > size into production.
>
> It's would be a bit tricky for resize2fs to do that, since it doesn't
> know what might be in the mke2fs.conf file at the time when the file
> system when the file system was creaeted. Distributions or individual
> system adminsitrators are free to modify that config file.

No need to time travel back -- it's complicated, and actually less
relevant?

I haven't looked at resize2fs code, so this comes just from a user's
point-of-view but... if it is already reading mke2fs.conf, it could make
comparisons using an equivalent new filesystem as benchmark.

In the spirit of eg. "your resized filesystem will have a block size of
1024, but a new filesystem of this size would use 4096"

Then you can compare any absolute metric of the filesystem that way.

The advantage being...

> It is a good idea for resize2fs to give a warning, though. What I'm
> thinking that what might sense is if resize2fs is expanding the file
> system by more than, say a factor of 10x (e.g., expanding a file system
> from 10mb to 100mb, or 3mb to 20gb)

... that the benchmark gives you a comparison that won't drift. eg. if you
resize by +90% several times.

And reflects any desires that may be in the configuration.

> to give a warning that inflating file systems is an anti-pattern that
> will not necessarily result in the best file system performance.

I imagine it's not a panacea, but it would be good to be more concrete on
what the gotchas are; "bad performance" is vague, and since the tool
exists it must be possible to use it properly.

I'll need to consult the docs, but so far have been made aware of:

* block size
(which has knock-on effect to file limits per directory)

* journal size
(not in configuration file -- can this be adjusted?)

* files get fragmented when shrinking a filesystem
(but this is similar to any full file system?)

These are all things I'm generally aware of and their implications, just
easy to miss when you're busy and focused on other aspects (completely
escaped me that the filesystem had been enlarged when I began this
thread!)

That's why the patch in the other thread is not a bad idea; just reminding
that block size is relevant.

For info, our use case here is the base image used to deploy persistent
VMs which use very different disk sizes. The base image is build using
packer+QEMU managed as code. Then written using "dd" and LVM partitions
expanded without needing to go single-user or take the system offline.
This method is appealling because it allows to pre-populate /home with
some small amount of data; SSH keys etc.

For the case that started this thread, we just wiped the filesystem and
made a new one at the target size of 32Gb.

> Even if the blocksize isn't 1k, when a file system is shrunk
[...more on shrinking]

Many thanks,

--
Mark

2021-11-17 05:21:05

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

On Tue, Nov 16, 2021 at 07:31:10PM +0000, Mark Hills wrote:
>
> I see a definition in mke2fs.conf for "small" which uses 1024 blocksize,
> and I assume it originated there and not "floppy".

Ah, yes, I forgot that we also had the "small" config for file systems
less than 512 mb.

There are a bunch of anti-patterns that I've seen with users using
VM's. And I'm trying to understand them, so we can better document
why folks shouldn't be doing things like that. For example, one of
the anti-patterns that I see on Cloud systems (e.g., at Amazon, Google
Cloud, Azure, etc.) is people who start with a super-tiny file system,
say, 10GB, and then let it grow until it's 99 full, and then they grow
it by another GB, so it's now 11GB, and then they fill it until it's
99% full, and then they grow it by another GB. That's because (for
example) Google's PD Standard is 4 cents per GB per month, and they
are trying to cheap out by not wanting to spend that extra 4 cents per
month until they absolutely, positively have to. Unfortunately, that
leaves the file system horribly fragmented, and performance is
terrible. (BTW, this is true no matter what file system they use:
ext4, xfs, etc.)

File systems were originally engineered assuming that resizing would
be done in fairly big chunks. For example, you might have a 50 TB
disk array, and you add another 10TB disk to the array, and you grow
the file system by 10TB. You can grow it in smaller chunks, but
nothing comes for free, and trying to save 4 cents per month as
opposed to growing a file system from say, 10GB to 20GB on Google
Cloud, and paying an extra, princely *forty* cents (USD) per month
will probably result in far better performance, which you'll more than
make up when you consider the cost of the CPU and memory of said
VM....

> I haven't looked at resize2fs code, so this comes just from a user's
> point-of-view but... if it is already reading mke2fs.conf, it could make
> comparisons using an equivalent new filesystem as benchmark.

Resize2fs doens't read mke2fs.conf, and my point was that the system
where resize2fs is run is not necessary same as the system where
mke2fs is run, especially when it comes to cloud images for the root
file system.

> I imagine it's not a panacea, but it would be good to be more concrete on
> what the gotchas are; "bad performance" is vague, and since the tool
> exists it must be possible to use it properly.

Well, we can document the issues in much greater detail in a man page,
or in LWN article, but we need it's a bit complicated to explain it
all warning messages built into resize2fs. There's the failure mode
of starting with a 100MB file system containing a root file system,
dropping it on a 10TB disk, or even worse, a 100TB raid array, and
trying to blow it up the 100MB file system to 100TB. There's the
failure mode of waiting until the file system is 99% full, and then
expanding it one GB at a time, repeatedly, until it's several hundred
GB or TB, and then users wonder why performance is crap.

There are so many different ways one can shoot one's foot off, and
until I understand why people are desining their own particular
foot-guns, it's hard to write a man page warning about all of the
particular bad ways one can be a system administrator. Unfortunately,
my imagination is not necessarily up to the task of figuring them all
out. For example...

> For info, our use case here is the base image used to deploy persistent
> VMs which use very different disk sizes. The base image is build using
> packer+QEMU managed as code. Then written using "dd" and LVM partitions
> expanded without needing to go single-user or take the system offline.
> This method is appealling because it allows to pre-populate /home with
> some small amount of data; SSH keys etc.

May I suggest using a tar.gz file instead and unpacking it onto a
freshly created file sysetem? It's easier to inspect and update the
contents of the tarball, and it's actually going to be smaller than
using a file system iamge and then trying to expand it using
resize2fs....

To be honest, that particular use case didn't even *occur* to me,
since there are so many more efficient ways it can be done. I take it
that you're trying to do this before the VM is launched, as opposed to
unpacking it as part of the VM boot process?

If you're using qemu/KVM, perhaps you could drop the tar.gz file in a
directory on the host, and launch the VM using a virtio-9p. This can
be done by launching qemu with arguments like this:

qemu ... \
-fsdev local,id=v_tmp,path=/tmp/kvm-xfstests-tytso,security_model=none \
-device virtio-9p-pci,fsdev=v_tmp,mount_tag=v_tmp

and then in the guest's /etc/fstab, you might have an entry like this:

v_tmp /vtmp 9p trans=virtio,version=9p2000.L,msize=262144,nofail,x-systemd.device-timeout=1 0 0

This will result in everything in /tmp/kvm-xfstests-tytso on the host
system being visible as /vtmp in the guest. A worked example of this
can be found at:

https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L115
https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/kvm-xfstests#L175
https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/etc/fstab#L7

If you are using Google Cloud Platform or AWS, you could use Google
Cloud Storage or Amazon S3, respectively, and then just copy the
tar.gz file into /run and unpack it. An example of this might get
done can be found here for Google Cloud Storage:

https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/usr/local/lib/gce-load-kernel#L65

Cheers,

- Ted

2021-11-17 13:13:16

by Mark Hills

[permalink] [raw]

Subject: Re: Maildir quickly hitting max htree

Ted, I can't speak for everyone, but perhaps I can give some insight:

> There are a bunch of anti-patterns that I've seen with users using VM's.
> And I'm trying to understand them, so we can better document why folks
> shouldn't be doing things like that.

I wouldn't be so negative :) My default position is 'cloud sceptic', but
dynamic resources changes the trade-offs between layers of abstraction. I
wouldn't expect to always be correct about someone's overall business when
saying "don't do that".

> For example, one of the anti-patterns that I see on Cloud systems (e.g.,
> at Amazon, Google Cloud, Azure, etc.) is people who start with a
> super-tiny file system, say, 10GB,

It sounds interesting that hosting VMs has renewed interest in smaller
file systems -- more, smaller hosts. 10GB seems a lot to me ;)

> and then let it grow until it's 99 full, and then they grow it by
> another GB, so it's now 11GB, and then they fill it until it's 99% full,
> and then they grow it by another GB. That's because (for example)
> Google's PD Standard is 4 cents per GB per month, and they are trying to
> cheap out by not wanting to spend that extra 4 cents per month until
> they absolutely, positively have to.

One reason: people are rarely working with single hosts. Small costs are
multipled up. 4 cents is not a lot per-host, but everything is a %
increase to overall costs.

Cloud providers sold for many years on a "use only what you need" so it
would not be surprising for people to be tightly optimising that.

[...]
> > I haven't looked at resize2fs code, so this comes just from a user's
> > point-of-view but... if it is already reading mke2fs.conf, it could make
> > comparisons using an equivalent new filesystem as benchmark.
>
> Resize2fs doens't read mke2fs.conf, and my point was that the system
> where resize2fs is run is not necessary same as the system where
> mke2fs is run, especially when it comes to cloud images for the root
> file system.

Totally understood your point, and sounds like maybe I wasn't clear in
mine (which got trimmed):

There's no need to worry about the state of mke2fs.conf on the host which
originally created the filesystem -- that system is no longer relevant.

At the point of expanding you have the all the relevant information for an
appropriate warning.

> > I imagine it's not a panacea, but it would be good to be more concrete
> > on what the gotchas are; "bad performance" is vague, and since the
> > tool exists it must be possible to use it properly.
>
> Well, we can document the issues in much greater detail in a man page,
> or in LWN article, but we need it's a bit complicated to explain it all
> warning messages built into resize2fs. There's the failure mode of
> starting with a 100MB file system containing a root file system,
> dropping it on a 10TB disk, or even worse, a 100TB raid array, and
> trying to blow it up the 100MB file system to 100TB. There's the
> failure mode of waiting until the file system is 99% full, and then
> expanding it one GB at a time, repeatedly, until it's several hundred GB
> or TB, and then users wonder why performance is crap.
>
> There are so many different ways one can shoot one's foot off, and
> until I understand why people are desining their own particular
> foot-guns, it's hard to write a man page warning about all of the
> particular bad ways one can be a system administrator. Unfortunately,
> my imagination is not necessarily up to the task of figuring them all
> out

Neithers your imagination, nor mine, nor anyone else's :) People will do
cranky things.

But also nobody reasonable is expecting you to do their job for them.

You haven't really said what are the major underlying properties of the
filesystem that are inflexible when re-sizing, and so I'm keen for
more detail.

It's the sort of information makes for better reasoning; at least a hint
of the appropriate trade-offs; and not having to keep explaining on
mailing lists ;)

But saying "performance is crap" or "failure mode" is vague and so I can
understand why people persevere (either knowingly or unknowingly) -- in my
experience, many people work on the basis that something 'seems' to work
ok.

Trying to be tangible here, I made a start on a section for the resize2fs
man page. Would it be worthwhile to flesh this out and would you consider
helping to do that?

CAVEATS

Re-sizing a filesystem should not be assumed to result in a filesystem
with identifical specification to one created at the new size. More
often this is not the case.

Specifically, enlarging or shrinking a filesystem does not resize
these resources:

* block size, which impacts directory indexes and the upper limit
on number of files in a directory;

* journal size which affects [write performance?]

* files which have become fragmented due to space constraints;
see e2freefrag(8) and e4defrag(8)

* [any more? or is this even comprehensive? Only
the major contributors needed to begin with]

It really doesn't need to be very long; just a signpost in the right
direction.

In my case, I have some knowledge of filesystem internals (much less than
you, but probably more than most) but had completely forgotten this was a
resized filesystem (as well as resized more than the original image ever
intended). It just takes a nudge/reminder in the direction, not much more.

The tiny patch to dmesg (elsewhere in the thread) would have indicated the
revelance of the block size; reminding me of the resize, and a tiny
addition to the man page to help decide what action to take -- reformat,
or not.

> For example...
>
> > For info, our use case here is the base image used to deploy persistent
> > VMs which use very different disk sizes. The base image is build using
> > packer+QEMU managed as code. Then written using "dd" and LVM partitions
> > expanded without needing to go single-user or take the system offline.
> > This method is appealling because it allows to pre-populate /home with
> > some small amount of data; SSH keys etc.
>
> May I suggest using a tar.gz file instead and unpacking it onto a
> freshly created file sysetem? It's easier to inspect and update the
> contents of the tarball, and it's actually going to be smaller than
> using a file system iamge and then trying to expand it using
> resize2fs....
>
> To be honest, that particular use case didn't even *occur* to me,
> since there are so many more efficient ways it can be done.

But, this considers 'efficiency' in context of the filesystem performance
only.

Unpacking a tar.gz requires custom scripting, is slow, extra 'one off'
steps on boot up introduce complexity. These are the 'installers' that
everyone hates :)

Also I presume there is COW at the image level on some infrastructure.

And none of this covers changing a size of an online system.

> I take it that you're trying to do this before the VM is launched, as
> opposed to unpacking it as part of the VM boot process?

Yes; and I think that's really the spirit of an "image", right?

> If you're using qemu/KVM, perhaps you could drop the tar.gz file in a
> directory on the host, and launch the VM using a virtio-9p. This can
> be done by launching qemu with arguments like this:
>
> qemu ... \
[...]

We use qemu+Makefile to build the images, but for running on
infrastructure most cloud providers are limited; controls like this are
not available.

> If you are using Google Cloud Platform or AWS, you could use Google
> Cloud Storage or Amazon S3, respectively, and then just copy the tar.gz
> file into /run and unpack it.

We're not using Google nor AWS. In general I can envisage extra work to
construct the secured side-channel to distribute supplementary .tar.gz
files.

I don't think you should be disheartened by people resizing in these ways.
I'm not an advocate, and understand your points.

But it sounds like it is allowing people to achieve things, and it _is_ a
positive sign that the abstractions are well designed -- leaving users
unaware of the caveats, which are hidden from them until they bite.

A rhethorical question, and with no prior knowledge, but: if there is
benefits to these extreme resizes then rather than say "don't do that"
could it be possible to generalise the maintaining of any filesystem as
creating a 'zero-sized' one, and resizing it upwards? ie. the real code
exists in resize, not in creation.

Thanks

--
Mark