2011-09-20 10:31:47

by torn5

[permalink] [raw]
Subject: What to put for unknown stripe-width?

Hello there,
I am planning some ext4 filesystems which are currently in (LVM over) a
RAID array having:

stride=1MiB stripe-witdth=4MiB

BUT ... their RAID could be enlarged with more disks in the future
and/or they are likely to be moved around live (leveraging LVM, which I
will align) to other RAIDs which also will have stride=1MiB but unknown,
as of now, stripe-width.
That's on HDDs, with platters.

What do you suggest for stripe-width?
I don't really know how ext4 works, unfortunately. But I think the
answer should be among the following values:

- a) 60MiB: so to be exact multiple of most stripe-widths, in
particular when number of data disks is any of:
1,2,3,4,5,10,12,15,20,30,60 . I expect some longer-than-normal seeks
with HDD heads with 60MiB though.

- b) 7MiB or 11MiB: (prime numbers so *not* likely multiple of most
stripe-widths) so to likely see data eventually spreaded equally on the
various disks (maybe?). That's the opposite reason as (a), so one of
these two must be wrong.

- c) 1MiB so to be exact *divisor* (and not multiple) of all possible
number of disks. This is wrong, isn't it?

- d) use current optimal of 4MiB, then use tune2fs to alter
stripe-width when the underlying stripes are changed. This should be
fine for new writes, but I am not sure what is the impact on reads of
old data.

Thanks for your help
T5


2011-09-20 12:46:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What to put for unknown stripe-width?


On Sep 20, 2011, at 6:30 AM, torn5 wrote:

> Hello there,
> I am planning some ext4 filesystems which are currently in (LVM over) a RAID array having:
>
> stride=1MiB stripe-witdth=4MiB
>
> BUT ... their RAID could be enlarged with more disks in the future and/or they are likely to be moved around live (leveraging LVM, which I will align) to other RAIDs which also will have stride=1MiB but unknown, as of now, stripe-width.
> That's on HDDs, with platters.

You can't change the stride width without essentially recreating the RAID system. Let's assume you're doing RAID 5 with four disks. Then the picture looks like this:

DISK A DISK B DISK C DISK D
| --- chunk 0 ---| | --- chunk 1 ---| | --- chunk 2 ---| | --- Parity ---|
| --- chunk 3 ---| | --- chunk 4 ---| | --- Parity ---| | --- chunk 5 ---|
| --- chunk 6 ---| | --- Parity ---| | --- chunk 7 ---| | --- chunk 8 ---|
| --- Parity ---| | --- chunk 9 ---| | --- chunk 10 --| | --- chunk 11---|

Each of the chunks are 1MB, and the stripe size is 3MB. But if you add a new hypothetical disk E, you can't just add it next to Disk D, because then you'd have to renumber all of the chunks. Instead you might add Disks F, G, and H as well, and then tack them on to the end of the disks A, B, C, and D. But the stripe size stays the same without some serious surgery. But even if you did the surgery, there's the problem of what to do with the file system.

Why does the file system care about the RAID parameters? Well, suppose you do a 1MB write to replace chunk #0. You will also need to read chunks #1 and #2 so you can calculate the parity chunk on disk D. This is called a read/modify/write cycle, and it's inefficient. To avoid that, you need to do writes which are aligned on the stripe, and sized for the stripe. So for the best file system performance, the file system should align its files so that they begin at the beginning of a RAID stripe. So even if you could change the parameters of your RAID array (which would require radical surgery involving lots of data movement), the file system would also need to be changed to support optimized writes on the new stripe size. But that's OK, because I don't know of any RAID array tha
t supports this kind of radical surgery in parameters in the first case. :-)

The other thing to consider is small writes. If you are doing small writes, a large stripe size is a disaster, because a 32k random write by a program like MySQL will turn into a 3MB read + 3MB write request. Large stripe sizes are good for sequential writes of large files, but not so much for small random writes.

- Ted




2011-09-20 15:29:58

by torn5

[permalink] [raw]
Subject: Re: What to put for unknown stripe-width?

On 09/20/11 14:47, Theodore Tso wrote:
> But that's OK, because I don't know of any RAID array that supports
> this kind of radical surgery in parameters in the first case. :-)

Ted, thanks for your reply,

Linux MD raid supports this, it's called reshape. Most parameters
changes are supported, in particular the addition of a new disk and
restriping of a raid5 is supported *live*. It's not very stable though...

But apart from the MD live reshape/restripe, what I could do more likely
is to move such filesystem *live* across various RAIDs I have,
leveraging LVM's "pvmove". Such RAIDs are almost all of 1MB stride, but
with various number of elements, hence they have a different stripe-width.


> The other thing to consider is small writes. If you are doing small writes, a large stripe size is a disaster, because a 32k random write by a program like MySQL will turn into a 3MB read + 3MB write request.

No this is not correct, for MD at least.
MD uses strips to compute parity, which are always 4k wide for each
device. The reads in your example would be 32k read from two devices,
followed by 32k write to two devices. I am testing this now with iostat
to confirm what I'm saying with a dd 4k write: I see various spurious
read and writes (probably due to MD and LVM accounting, dirty flags etc)
which sum up to about 108k read and 18k write (that's the aggregated sum
from all drives) for a single 4k write to the MD device. That's
definitely not as large as even a single chunk which is 1MB.
What chunksize does is to regulate every how much data the placement of
parity is changed (i.e. your ascii-art picture was correct). Large
chunksize like I use, means that reads smaller than 1MB hopefully come
from 1 spindle only. This is useful for us.


So, regarding my original problem, the way you use stride-size in ext4
is that you begin every new file at the start of a stripe?

For growing an existing file what do you do, do you continue to write it
from where it was, without holes, or you put a hole, select a new
location at the start of a new stripe and start from there?

Regarding multiple very small files wrote together by pdflush, what do
they do? They are sticked together on the same stripe without holes, or
each one goes to a different stripe?

Is the change of stripe-width with tune2fs supported on a live, mounted
fs? (I mean maybe with a mount -o remount but no umount)

Thanks for your help,

T.

2011-09-20 16:00:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: What to put for unknown stripe-width?

On Tue, Sep 20, 2011 at 05:29:34PM +0200, torn5 wrote:
>
> No this is not correct, for MD at least.
> MD uses strips to compute parity, which are always 4k wide for each
> device. The reads in your example would be 32k read from two
> devices, followed by 32k write to two devices.

Then where the heck are these 1MB numbers that you cited originally
coming from? If by that you mean the LVM PE size, that doesn't matter
to the file system. It matters as far as your efficiency of doing LVM
snapshots, but what matters from the perspective of the file system's
stripe-width (stride doesn't really matter with the ext4 flex_bg
layout) is the RAID parameters at the MD level, NOT the LVM
parameters. (This is assuming LVM is properly aligning its PE's with
the RAID stripe sizes, but I'll assume for the purposes of this
discussion that LVM is competently implemented...)

> So, regarding my original problem, the way you use stride-size in
> ext4 is that you begin every new file at the start of a stripe?

The design goal (not completely implemented at the moment) is that
block allocation will try (but may not succeed if the free space is
fragmented, etc.) to align files to begin at the start of a stripe,
and that file writes will be try to avoid RAID 5 read/modify/write
cycles.

> For growing an existing file what do you do, do you continue to
> write it from where it was, without holes, or you put a hole, select
> a new location at the start of a new stripe and start from there?

We will try to arrange things so that subsequent reads and writes are
well aligned. So the assumption here is that the application will
also be intelligent enough to do the right thing. (i.e., if you have
a 32k stripe width, writes need to be 32k aligned and in multiples of
32k at the logical block layer, which the file system will translate
into appropriately aligned reads and writes to the RAID array.)

> Regarding multiple very small files wrote together by pdflush, what
> do they do? They are sticked together on the same stripe without
> holes, or each one goes to a different stripe?

pdflush has no idea about RAID, so for small files it's all up in the
air.

> Is the change of stripe-width with tune2fs supported on a live,
> mounted fs? (I mean maybe with a mount -o remount but no umount)

It's supported, but it's not going to change the layout of the live,
mounted file system.

I'm going to strongly caution you about using fancy-shmancy features
like this MD reshape. You yourself have said it's not stable. But do
you really need it? You haven't said what your application is, but it
may be much better to simply add new disks in multiples of your
existing stripe width, and just keep things simple.

Also, why are you using RAID? Is it for reliability, or performance,
or both? Have you seen some of the recent (well, in the last 4-5
years paper about relibility of disks and RAID)? Especially
interesting comments such as

"Protecting online data only via RAID 5 today verges on
professional malpractice"

http://storagemojo.com/2007/02/26/netapp-weighs-in-on-disks/

Regards,

- Ted

2011-09-20 23:29:56

by Andreas Dilger

[permalink] [raw]
Subject: Re: What to put for unknown stripe-width?

On 2011-09-20, at 9:29 AM, torn5 wrote:
> On 09/20/11 14:47, Theodore Tso wrote:
>> But that's OK, because I don't know of any RAID array that supports this kind of radical surgery in parameters in the first case. :-)
>
> Ted, thanks for your reply,
>
> Linux MD raid supports this, it's called reshape. Most parameters changes are supported, in particular the addition of a new disk and restriping of a raid5 is supported *live*. It's not very stable though...
>
> But apart from the MD live reshape/restripe, what I could do more likely is to move such filesystem *live* across various RAIDs I have, leveraging LVM's "pvmove". Such RAIDs are almost all of 1MB stride, but with various number of elements, hence they have a different stripe-width.

Just FYI, we use 1MB stripe width for Lustre by default, and this is large
enough for very good IO performance, and is very well tested.

>> The other thing to consider is small writes. If you are doing small writes, a large stripe size is a disaster, because a 32k random write by a program like MySQL will turn into a 3MB read + 3MB write request.
>
> So, regarding my original problem, the way you use stride-size in ext4 is that you begin every new file at the start of a stripe?

With ext4 and flex_bg it is nearly irrelevant. I try to use a flex_bg size
that is -G 256 to match the stripe_width, so that the bitmap load on all the
disks is even. It would be good to have a patch to do this by default, but
I haven't gotten around to that.

> For growing an existing file what do you do, do you continue to write it from where it was, without holes, or you put a hole, select a new location at the start of a new stripe and start from there?

Large files will be allocated at the start of the stripe_width (1MB) alignment,
and (IIRC) small files will be packed together into a 1MB chunk, to minimize
read-modify-write on the RAID.

> Regarding multiple very small files wrote together by pdflush, what do they do? They are sticked together on the same stripe without holes, or each one goes to a different stripe?

I'm not 100% sure that the small files are still being handled correctly,
since it is a long time since I looked at that code and it has undergone
a lot of changes.

> Is the change of stripe-width with tune2fs supported on a live, mounted fs? (I mean maybe with a mount -o remount but no umount)

I think it is cached in the in-memory superblock to include any mount-time
parameters and sanity checks, so I don't think it will take effect.

Cheers, Andreas