2023-06-26 09:48:00

by Roberto Ragusa

[permalink] [raw]
Subject: packed_meta_blocks=1 incompatible with resize2fs?

Hi,

by using

mkfs.ext4 -E packed_meta_blocks=1

all the fs metadata is placed at the start of the disk.
Unfortunately I have not found a way to enlarge the fs
and maintain this property, new metadata is allocated from the
added space.

My attempts to work around the issue have failed:
- adding resize=4290772992 in mkfs doesn't help
- creating a bigger fs with packed_meta_blocks, then shrinking it,
then re-extending it to the original size still allocates from the
new space

Is there a way to have metadata at the beginning of the disk
and be able to enlarge the fs later?
Planning to do this on big filesystems, placing the initial part
on SSD extents; reformat+copy is not an option.

This mailing list appears the best place to ask, sorry if this is
considered off topic.
Thanks.

--
Roberto Ragusa mail at robertoragusa.it


2023-06-27 08:41:35

by Roberto Ragusa

[permalink] [raw]
Subject: Re: packed_meta_blocks=1 incompatible with resize2fs?

On 6/26/23 11:15, Roberto Ragusa wrote:
> My attempts to work around the issue have failed:
> - adding resize=4290772992 in mkfs doesn't help
> - creating a bigger fs with packed_meta_blocks, then shrinking it,
> then re-extending it to the original size still allocates from the
> new space

An additional attempt is still not fully reaching the objective.

mkfs.ext4 -E resize=1073741824 -G 32768 /dev/mydev

I've tried raising G to have a single flex_bg group and this
is partially working, since resizing places "block bitmap"
and "inode bitmap" in the initial part of the disk, while "inode table"
is still taken from the added space. (tested on a rather full fs)

Do I have any other option?

It looks like resize2fs has no way to force the inode table to be
where a fresh mkfs would put it, pushing existing blocks out of
the way.
What complexity can I expect to find if I try to add this
feature to resize2fs?

Is there a way to NOT add more inodes when resizing? I would
be happy to keep the same number I've originally got, but apparently
each new block group assumes to have some inodes.

My use case should be a really wide interest one: metadata on fast
hardware, data on slow hardware. Is there no solution?

Regards.

--
Roberto Ragusa mail at robertoragusa.it


2023-06-28 00:06:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: packed_meta_blocks=1 incompatible with resize2fs?

On Mon, Jun 26, 2023 at 11:15:14AM +0200, Roberto Ragusa wrote:
>
> mkfs.ext4 -E packed_meta_blocks=1
>
> all the fs metadata is placed at the start of the disk.
> Unfortunately I have not found a way to enlarge the fs
> and maintain this property, new metadata is allocated from the
> added space.

Yeah, sorry, there isn't a way of doing this. packed_meta_blocks=1 is
an mkfs.ext4 (aka mke2fs) option, and it influences where it choose to
allocate the metadata blocks when the file system is created.
Unfortunately, (a) there is no place where the fact that the file
system was created with this mkfs option is recorded in the
superblock, and (b) once the file system starts getting used, the
blocks where the metadata would need to be allocated at the start of
the disk will get used for directory and data blocks.

> Is there a way to have metadata at the beginning of the disk
> and be able to enlarge the fs later?
> Planning to do this on big filesystems, placing the initial part
> on SSD extents; reformat+copy is not an option.

OK, so I think what you're trying to do is to create a RAID0 device
where the first part of the md raid device is stored on SSD, and after
that there would be one or more HDD devices. Is that right?

In theory, it would be possible to add a file system extension where
the first portion of the MD device is not allowed to be used for
normal block allocations, so that when you later grow the raid0
device, the SSD blocks are available for use for the extended
metadata. This would require adding support for this in the
superblock format, which would have to be an RO_COMPAT feature (that
is, kernels that didn't understand the meaning of the feature bit
would be prohibited from mounting the file system read/write, and
older versions of fsck.ext4 would be prohibited from touching the file
system at all). We would then have to add support for off-line and
on-line resizing for using the reserved SSD space for this use case.

The downside of this particular approach is that the SSD space would
be "wasted" until the file system is resized, and you have to know up
front how big you might want to grow the file system eventually. I
could imagine another approach might be that when you grow the file
system, if you are using an LVM-type approach, you would append a
certain number of LVM stripes backed by SSD, and a certain number
backed by HDD's, and then give a hint to the resizing code where the
metadata blocks should be allocated, and you would need to know ahead
of time how many SSD-backed LV stripes to allocate to support the
additional number of HDD-backed LV stripes.

This would require a bunch of abstraction violations, and it's a
little bit gross. More to the point, it would require a bunch of
development work, and I'm not sure there is interest in the ext4
development community, or the companies that back those developers,
for implementing such a feature.

Cheers,

- Ted

2023-06-28 14:39:19

by Roberto Ragusa

[permalink] [raw]
Subject: Re: packed_meta_blocks=1 incompatible with resize2fs?

On 6/28/23 02:03, Theodore Ts'o wrote:

> Unfortunately, (a) there is no place where the fact that the file
> system was created with this mkfs option is recorded in the
> superblock, and (b) once the file system starts getting used, the
> blocks where the metadata would need to be allocated at the start of
> the disk will get used for directory and data blocks.

Isn't resize2fs already capable of migrating directory and data blocks
away? According to the comments at the beginning of resize2fs.c, I mean.

> OK, so I think what you're trying to do is to create a RAID0 device
> where the first part of the md raid device is stored on SSD, and after
> that there would be one or more HDD devices. Is that right?

More or less.
Using LVM and having the first extents of the LV on a fast PV and all the
others on a slow PV.

> In theory, it would be possible to add a file system extension where
> the first portion of the MD device is not allowed to be used for
> normal block allocations[...]

I would not aim to a so complex way.
My hope was that one of the two was possible:
1. reserve the bitmaps and inode table space since the beginning (with mke2fs
option resize, for example)
2. push things out of the way when the expansion is done

I could attempt to code something to do 2., but I would either have to
study resize2fs code, which is not trivial, or write something from scratch,
based only on the layout docs, which would be also complex and not easily
mergeable in resize2fs.

Other options may have been:
3. do not add new inodes when expanding (impossible by design, right?)
4. have an offline way (custom tool, or detecting conflicting files and
temporarily removing them, ...) to free the needed blocks

At the moment the best option I have is to continue doing what I've been
doing for years already: use dumpe2fs and debugfs to discover which bg
contain metadata+journal and selectively use "pvmove" to migrate
those extents (PE) to the fast PV. Automatable, but still messy.
Discovering "packed_meta_blocks" turned out not a so great finding as I was
hoping, if then you can't resize.

Thanks.
--
Roberto Ragusa mail at robertoragusa.it


2023-06-28 15:55:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: packed_meta_blocks=1 incompatible with resize2fs?

On Wed, Jun 28, 2023 at 04:35:50PM +0200, Roberto Ragusa wrote:
> On 6/28/23 02:03, Theodore Ts'o wrote:
>
> > Unfortunately, (a) there is no place where the fact that the file
> > system was created with this mkfs option is recorded in the
> > superblock, and (b) once the file system starts getting used, the
> > blocks where the metadata would need to be allocated at the start of
> > the disk will get used for directory and data blocks.
>
> Isn't resize2fs already capable of migrating directory and data blocks
> away? According to the comments at the beginning of resize2fs.c, I mean.

Yes, but (a) that can only be done off-line (while the file system is
unmounted), and (b) migrating directory and data blocks is quite slow
and inefficient, and it doesn't necessarily leave the data file in the
most optimal way (it didn't do as much as it could to minimize file
fragmentation during the mirgation process). It was intended for
moving a very small number of blocks, and while it could be improved,
that would be additional software engineering investment.

> 1. reserve the bitmaps and inode table space since the beginning (with mke2fs
> option resize, for example)
> 3. do not add new inodes when expanding (impossible by design, right?)

This would require file system format changes in the kernel, the
kernel on-line resizing code, e2fsck, and the resized2fs for off-line
resizing. And while we've considered doing (3) for other reasons,
that's not sufficient for this use case, because when we add new block
groups, we have to add block and inode allocation bitmaps, the inode
table, and the block group descriptor blocks. It's not just the inode
table.

> 2. push things out of the way when the expansion is done
>
> I could attempt to code something to do 2., but I would either have to
> study resize2fs code, which is not trivial, or write something from scratch,
> based only on the layout docs, which would be also complex and not easily
> mergeable in resize2fs.
>
> 4. have an offline way (custom tool, or detecting conflicting files and
> temporarily removing them, ...) to free the needed blocks
>
> At the moment the best option I have is to continue doing what I've been
> doing for years already: use dumpe2fs and debugfs to discover which bg
> contain metadata+journal and selectively use "pvmove" to migrate
> those extents (PE) to the fast PV. Automatable, but still messy.
> Discovering "packed_meta_blocks" turned out not a so great finding as I was
> hoping, if then you can't resize.

Honestly, suspect automating the code to determine which are the block
group descriptors, inode table blocks, and allocation bitmap blocks
represent the PE's that should be migrated to the fast PV is probably
the simplest thing to do. You should be able to do this using just
dumpe2fs; the journal is generally not going to move while during a
migration.

- Ted

2023-06-28 21:49:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: packed_meta_blocks=1 incompatible with resize2fs?

On Jun 27, 2023, at 6:03 PM, Theodore Ts'o <[email protected]> wrote:
>
> On Mon, Jun 26, 2023 at 11:15:14AM +0200, Roberto Ragusa wrote:
>> Is there a way to have metadata at the beginning of the disk
>> and be able to enlarge the fs later?
>> Planning to do this on big filesystems, placing the initial part
>> on SSD extents; reformat+copy is not an option.
>
> OK, so I think what you're trying to do is to create a RAID0 device
> where the first part of the md raid device is stored on SSD, and after
> that there would be one or more HDD devices. Is that right?
>
> In theory, it would be possible to add a file system extension where
> the first portion of the MD device is not allowed to be used for
> normal block allocations, so that when you later grow the raid0
> device, the SSD blocks are available for use for the extended
> metadata. This would require adding support for this in the
> superblock format, which would have to be an RO_COMPAT feature (that
> is, kernels that didn't understand the meaning of the feature bit
> would be prohibited from mounting the file system read/write, and
> older versions of fsck.ext4 would be prohibited from touching the file
> system at all). We would then have to add support for off-line and
> on-line resizing for using the reserved SSD space for this use case.

We already have 600TB+ Lustre storage targets with declustered RAID
volumes and will hit 1 PiB very soon (60 * 18TB drives). This results
in millions of block groups. This can cause issues mounting, block
allocation, unlinking large files, etc. due to loading the metadata
from disk. Being able to store the filesystem metadata on flash will
avoid performance contention from HDD seeking without the cost of
all-flash storage (some Lustre filesystems are over 700PiB already).


We are investigating hybrid flash+HDD OSTs with sparse_super2 to put
all static metadata at the start of the LV on flash, and keep regular
data on HDDs. I think there isn't a huge amount of work needed to
get this working reasonably well, and it can be done incrementally.


Modify mke2fs to not force meta_bg to be enabled on filesystems over
256TiB. Locate the sparse_super2 superblock/GDT backup #1 in one of
the sparse_super backup groups (3^n, 5^n, 7^n) instead of always group
#1, so it can be found by e2fsck easily like sparse_super filesystems.

Locate the sparse_super2 backup #2 group in the last ({3,5,7}^n) group
within the filesystem (instead of the *actual* last group). That puts
it in the "slow" part of the device, but it is rarely accessed, and
is still far enough from the start of the device to avoid corruption.

In addition to avoiding the group #0/#1 GDT collision for backup #1 on
very large filesystems, this separates superblock/GDT copies further
apart for safety, and makes the #2 backup easier to find in case of
filesystem resize. The current use of the last filesystem group
is not "correct" after a resize, and is not easily found in this case.

This allows "mke2fs -O sparse_super2 -E packed_meta_blocks" allocation
of static metadata (superblocks, GDTs, bitmaps, inode tables, journal)
at the start of the block device, which puts most of the metadata IOPS
onto flash instead of HDD. No ext4/e2fsck changes are needed for this.


Update e2fsck to check sparse_super groups for backup superblocks/GDTs.
I think is useful independent of sparse_super2 changes. This was
previously submitted as "ext2fs: automaticlly open backup superblocks"
but needs some updates before it can land:
https://patchwork.ozlabs.org/project/linux-ext4/patch/[email protected]/



To handle the allocation of *dynamic* metadata on the flash device
(indirect/index blocks, directories, xattr blocks) the ext4 mballoc
code needs to be informed about which groups are on the high IOPS
device. I think it makes sense to add a new EXT4_BG_IOPS = 0x0010
group descriptor flag to mark groups that are on "high IOPS" media,
and then mballoc can use these groups when allocating the metadata.
Normal file data blocks would not be allocated from these groups.
Using a flag instead of "last group" marker in the superblock allows
more flexibility (e.g. for resize after filesystem creation).

mke2fs would be modified to add a "-E iops=[START-]END[,START-END,...]"
option to know which groups to mark with the IOPS flag. The START/END
would be given with either block numbers, or with unit suffixes, so
it can be specified easily by the user, and converted internally to
group numbers based on other filesystem parameters.

The normal case would be a single "END" argument (default START=0)
puts all the IOPS capacity at the start of the device. The ability
to specify multiple [,START-END] ranges is only there for completeness
and flexibility, I don't expect that we would be using it ourselves.

This dynamic metadata allocation on flash is not strictly necessary,
but still covers about 1/5 of metadata IOPS per write in my testing.

> The downside of this particular approach is that the SSD space would
> be "wasted" until the file system is resized, and you have to know up
> front how big you might want to grow the file system eventually.

I agree it doesn't make sense to hugely over-provision flash capacity.
It could be used for dynamic metadata allocations (as above), but that
also defeats the purpose of reserving this space. My calculations is
that the IOPS capacity should be about 0.5% of the filesystem size to
handle the static and dynamic metadata, depending on inode size/count,
journal size, etc.

> I could imagine another approach might be that when you grow the file
> system, if you are using an LVM-type approach, you would append a
> certain number of LVM stripes backed by SSD, and a certain number
> backed by HDD's, and then give a hint to the resizing code where the
> metadata blocks should be allocated, and you would need to know ahead
> of time how many SSD-backed LV stripes to allocate to support the
> additional number of HDD-backed LV stripes.

That is what I was thinking also - if you are resizing (which is itself
an uncommon operation), then it makes sense to also add a large chunk
of flash at the end of the current device followed by the bulk of the
capacity on HDDs, and then have the resize code locate all of the new
static metadata on the new flash groups, like '-E packed_meta_blocks'.

In the past I also tried 1x 128MiB flash + 255x 128MiB HDD, and match
the flex_bg layout exactly to this, but it became complex very quickly.
LVM block remapping also slows down significantly with thousands of
segments, probably due to linear list walking, and still didn't get
to the level of 100k different segments needed for a 1PiB+ fs.


> This would require a bunch of abstraction violations, and it's a
> little bit gross. More to the point, it would require a bunch of
> development work, and I'm not sure there is interest in the ext4
> development community, or the companies that back those developers,
> for implementing such a feature.

We're definitely interested in hybrid storage devices like this, but
resizing is not a priority for us.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP