2012-02-22 13:31:16

by Roberto Ragusa

[permalink] [raw]
Subject: Mkfs option to choose where metadata will be stored

Hi, [please CC me, I'm not subscribed]

is there any way to force allocation of metadata to a different device
or to a specific part (e.g. the begin) of the partition?

I see there is -O journal_dev to redirect the journal.
Can I do something different for metadata?

My idea is to have metadata on SSD and data on HDD.
With a linear RAID mapping, I would get a device which is a few GB of
SSD followed by a lot of HDD space.

Alternatively, I'm going to experiment with an approach where a volume group
is done on two PV: one HDD and one SSD. The idea is to create the LV on
the HDD and then move some extents (for example 0,1,64,65,128,129,...) to
the SSD, so that metadata happens to be on the SSD.
>From what I found about the on-disk format, this is highly approximate
and surely inelegant, so I wonder if a simpler solution exists.

Thanks.

--
Roberto Ragusa mail at robertoragusa.it


2012-02-22 16:54:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: Mkfs option to choose where metadata will be stored

On 2012-02-22, at 6:20, Roberto Ragusa <[email protected]> wrote:

> Hi, [please CC me, I'm not subscribed]
>
> is there any way to force allocation of metadata to a different device
> or to a specific part (e.g. the begin) of the partition?
>
> I see there is -O journal_dev to redirect the journal.
> Can I do something different for metadata?
>
> My idea is to have metadata on SSD and data on HDD.
> With a linear RAID mapping, I would get a device which is a few GB of
> SSD followed by a lot of HDD space.

I've tested something similar to this myself. The way I did it is to use the "flex_bg" option "-G 256" to put the metadata into a single 128MB group, which is allocated on an SSD LVM PV, then 255 x 128MB on an HDD PV.

This pattern repeats for the entire LV size, and can easily be created with a 128MB LV on the SSD then alternating pvextend of (255 * 128MB) on the HDD PV and 128MB on the SSD PV until the desired size is reached or you run out of space on one of the PVs.

The exact formatting options I used are:

mke2fs -t ext4 -i 69905 -G 256 -E resize=4290772992 {dev}

this will lay everything out on the LV nicely. Note that it assumes an average file size of about 69kB here. Increasing this is fine, but making it smaller would disrupt the layout.

> Alternatively, I'm going to experiment with an approach where a volume group
> is done on two PV: one HDD and one SSD. The idea is to create the LV on
> the HDD and then move some extents (for example 0,1,64,65,128,129,...) to
> the SSD, so that metadata happens to be on the SSD.
> From what I found about the on-disk format, this is highly approximate
> and surely inelegant, so I wonder if a simpler solution exists.
>
> Thanks.
>
> --
> Roberto Ragusa mail at robertoragusa.it
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-02-22 22:14:01

by Roberto Ragusa

[permalink] [raw]
Subject: Re: Mkfs option to choose where metadata will be stored

On 02/22/2012 05:54 PM, Andreas Dilger wrote:
> On 2012-02-22, at 6:20, Roberto Ragusa <[email protected]> wrote:
>
>> My idea is to have metadata on SSD and data on HDD.
>> With a linear RAID mapping, I would get a device which is a few GB of
>> SSD followed by a lot of HDD space.
>
> I've tested something similar to this myself. The way I did it is to use the "flex_bg" option "-G 256" to put the metadata into a single 128MB group, which is allocated on an SSD LVM PV, then 255 x 128MB on an HDD PV.

I actually discovered flex_bg a few minutes after sending my mail. :-)
I tested -G 1048576 (that is "infinity") and played a little
with -i to keep down the SSD usage (my current average filesize
is 3MB, so I can have a big value), discovering that the bitmaps are
in any case dominant.

> This pattern repeats for the entire LV size, and can easily be created with a 128MB LV on the SSD then alternating pvextend of (255 * 128MB) on the HDD PV and 128MB on the SSD PV until the desired size is reached or you run out of space on one of the PVs.

This is a nice trick. I was thinking about only one big initial metadata zone, but
your approach will give me back lvextend (which is useful on terabyte-range filesystems).

> The exact formatting options I used are:
>
> mke2fs -t ext4 -i 69905 -G 256 -E resize=4290772992 {dev}
>
> this will lay everything out on the LV nicely. Note that it assumes an average file size of about 69kB here. Increasing this is fine, but making it smaller would disrupt the layout.

You really tuned -i to perfection. :-)

It sounds very interesting that you only get 1/256 metadata overhead,
because my tests were around 1/10 (which surely appears a lot!).
I just discovered that -G 1048576 allocates a lot of expansion space, even if
you set -E resize to a reasonable value.
(delete my previous sentence about bitmaps :-) )

Your refinements turned my bizarre idea into a really nice solution, I'm looking
forward to implementing it in production.
Even a tiny SSD can take metadata for a lot of HDD disk space. Maybe I will put
a couple of SSD in RAID-1 as I'm still not confident about their robustness.
(the data is backupped, in any case).

One last thing: you didn't worry about the journal. What would you suggest?
Using an external journal appears a little dirty, maybe we could just force
it to live in the second 128MB extent and place that one on the SSD too.
Can this be done?

Really thank you.

(this should be indeed better documented; it can have dramatic performance
implications and some optimized parameters or a spreadsheet or web form to
calculate them would be useful to a wider audience [I mean guys which
are not ready to use dumpe2fs to reverse engineer the layout like I did]).

--
Roberto Ragusa mail at robertoragusa.it

2012-02-23 04:46:11

by Andreas Dilger

[permalink] [raw]
Subject: Re: Mkfs option to choose where metadata will be stored

On 2012-02-22, at 15:13, Roberto Ragusa <[email protected]> wrote:
> On 02/22/2012 05:54 PM, Andreas Dilger wrote:
>> On 2012-02-22, at 6:20, Roberto Ragusa <[email protected]> wrote:
>>
>>> My idea is to have metadata on SSD and data on HDD.
>>> With a linear RAID mapping, I would get a device which is a few GB of
>>> SSD followed by a lot of HDD space.
>>
>> I've tested something similar to this myself. The way I did it is to use the "flex_bg" option "-G 256" to put the metadata into a single 128MB group, which is allocated on an SSD LVM PV, then 255 x 128MB on an HDD PV.
>
> I actually discovered flex_bg a few minutes after sending my mail. :-)
> I tested -G 1048576 (that is "infinity") and played a little
> with -i to keep down the SSD usage (my current average filesize
> is 3MB, so I can have a big value), discovering that the bitmaps are
> in any case dominant.
>
>> This pattern repeats for the entire LV size, and can easily be created with a 128MB LV on the SSD then alternating pvextend of (255 * 128MB) on the HDD PV and 128MB on the SSD PV until the desired size is reached or you run out of space on one of the PVs.
>
> This is a nice trick. I was thinking about only one big initial metadata zone, but
> your approach will give me back lvextend (which is useful on terabyte-range filesystems).

Exactly. And the pattern is constant until 16TB, so it can be used for any size filesystem.

>> The exact formatting options I used are:
>>
>> mke2fs -t ext4 -i 69905 -G 256 -E resize=4290772992 {dev}
>>
>> this will lay everything out on the LV nicely. Note that it assumes an average file size of about 69kB here. Increasing this is fine, but making it smaller would disrupt the layout.
>
> You really tuned -i to perfection. :-)

Along with the resize option it aligns all the metadata nicely on 1MB boundaries for RAID-6 HDD LUNs.

The resize trick won't work past 16TB, however, and the metadata size is also different.

> It sounds very interesting that you only get 1/256 metadata overhead,
> because my tests were around 1/10 (which surely appears a lot!).
> I just discovered that -G 1048576 allocates a lot of expansion space, even if
> you set -E resize to a reasonable value.
> (delete my previous sentence about bitmaps :-) )
>
> Your refinements turned my bizarre idea into a really nice solution, I'm looking
> forward to implementing it in production.
> Even a tiny SSD can take metadata for a lot of HDD disk space.

Yes, only 1/255 of the total size needs to be SSD.

> Maybe I will put
> a couple of SSD in RAID-1 as I'm still not confident about their robustness.
> (the data is backupped, in any case).

Definitely prudent.

> One last thing: you didn't worry about the journal. What would you suggest?
> Using an external journal appears a little dirty, maybe we could just force
> it to live in the second 128MB extent and place that one on the SSD too.
> Can this be done?

I looked Into this once, and it would be desirable to have a mke2fs option to specify the starting journal block. I can't remember why I didn't finish it, but it wasn't very complex if you wanted to give it a shot.

> Really thank you.
>
> (this should be indeed better documented; it can have dramatic performance
> implications and some optimized parameters or a spreadsheet or web form to
> calculate them would be useful to a wider audience [I mean guys which
> are not ready to use dumpe2fs to reverse engineer the layout like I did]).

Ideally it would be built into mke2fs.

Cheers, Andreas