2007-06-15 02:59:52

by David Lang

[permalink] [raw]
Subject: limits on raid

what is the limit for the number of devices that can be in a single array?

I'm trying to build a 45x750G array and want to experiment with the
different configurations. I'm trying to start with raid6, but mdadm is
complaining about an invalid number of drives

David Lang


2007-06-15 03:05:28

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Thursday June 14, [email protected] wrote:
> what is the limit for the number of devices that can be in a single array?
>
> I'm trying to build a 45x750G array and want to experiment with the
> different configurations. I'm trying to start with raid6, but mdadm is
> complaining about an invalid number of drives
>
> David Lang

"man mdadm" search for "limits". (forgive typos).

NeilBrown

2007-06-15 03:44:20

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Fri, 15 Jun 2007, Neil Brown wrote:

> On Thursday June 14, [email protected] wrote:
>> what is the limit for the number of devices that can be in a single array?
>>
>> I'm trying to build a 45x750G array and want to experiment with the
>> different configurations. I'm trying to start with raid6, but mdadm is
>> complaining about an invalid number of drives
>>
>> David Lang
>
> "man mdadm" search for "limits". (forgive typos).

thanks.

why does it still default to the old format after so many new versions?
(by the way, the documetnation said 28 devices, but I couldn't get it to
accept more then 27)

it's now churning away 'rebuilding' the brand new array.

a few questions/thoughts.

why does it need to do a rebuild when makeing a new array? couldn't it
just zero all the drives instead? (or better still just record most of the
space as 'unused' and initialize it as it starts useing it?)

while I consider zfs to be ~80% hype, one advantage it could have (but I
don't know if it has) is that since the filesystem an raid are integrated
into one layer they can optimize the case where files are being written
onto unallocated space and instead of reading blocks from disk to
calculate the parity they could just put zeros in the unallocated space,
potentially speeding up the system by reducing the amount of disk I/O.

.this wouldn't work if the filesystem is crowded, but a lot of large
arrays are used for storing large files (i.e. sequential writes of large
amounts of data) and it would seem that this could be a substantial win in
these cases.

is there any way that linux would be able to do this sort of thing? or is
it impossible due to the layering preventing the nessasary knowledge from
being in the right place?

David Lang

2007-06-15 03:58:38

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Thursday June 14, [email protected] wrote:
> On Fri, 15 Jun 2007, Neil Brown wrote:
>
> > On Thursday June 14, [email protected] wrote:
> >> what is the limit for the number of devices that can be in a single array?
> >>
> >> I'm trying to build a 45x750G array and want to experiment with the
> >> different configurations. I'm trying to start with raid6, but mdadm is
> >> complaining about an invalid number of drives
> >>
> >> David Lang
> >
> > "man mdadm" search for "limits". (forgive typos).
>
> thanks.
>
> why does it still default to the old format after so many new versions?
> (by the way, the documetnation said 28 devices, but I couldn't get it to
> accept more then 27)

Dunno - maybe I can't count...

>
> it's now churning away 'rebuilding' the brand new array.
>
> a few questions/thoughts.
>
> why does it need to do a rebuild when makeing a new array? couldn't it
> just zero all the drives instead? (or better still just record most of the
> space as 'unused' and initialize it as it starts useing it?)

Yes, it could zero all the drives first. But that would take the same
length of time (unless p/q generation was very very slow), and you
wouldn't be able to start writing data until it had finished.
You can "dd" /dev/zero onto all drives and then create the array with
--assume-clean if you want to. You could even write a shell script to
do it for you.

Yes, you could record which space is used vs unused, but I really
don't think the complexity is worth it.

>
> while I consider zfs to be ~80% hype, one advantage it could have (but I
> don't know if it has) is that since the filesystem an raid are integrated
> into one layer they can optimize the case where files are being written
> onto unallocated space and instead of reading blocks from disk to
> calculate the parity they could just put zeros in the unallocated space,
> potentially speeding up the system by reducing the amount of disk I/O.

Certainly. But the raid doesn't need to be tightly integrated
into the filesystem to achieve this. The filesystem need only know
the geometry of the RAID and when it comes to write, it tries to write
full stripes at a time. If that means writing some extra blocks full
of zeros, it can try to do that. This would require a little bit
better communication between filesystem and raid, but not much. If
anyone has a filesystem that they want to be able to talk to raid
better, they need only ask...

> is there any way that linux would be able to do this sort of thing? or is
> it impossible due to the layering preventing the nessasary knowledge from
> being in the right place?

Linux can do anything we want it to. Interfaces can be changed. All
it takes is a fairly well defined requirement, and the will to make it
happen (and some technical expertise, and lots of time .... and
coffee?).

NeilBrown

2007-06-15 09:13:41

by David Chinner

[permalink] [raw]
Subject: Re: limits on raid

On Fri, Jun 15, 2007 at 01:58:20PM +1000, Neil Brown wrote:
> Certainly. But the raid doesn't need to be tightly integrated
> into the filesystem to achieve this. The filesystem need only know
> the geometry of the RAID and when it comes to write, it tries to write
> full stripes at a time.

XFS already knows this (stripe unit, stripe width) and already
does stripe unit sized and aligned allocation where it can.

> If that means writing some extra blocks full
> of zeros, it can try to do that. This would require a little bit
> better communication between filesystem and raid, but not much. If
> anyone has a filesystem that they want to be able to talk to raid
> better, they need only ask...

I'm interested in what you think is necessary here, Neil.

But I think there's more to it than this - the filesystem only
writes back what the generic writeback code passes it (i.e. a page
at a time). XFs writes back extra adjacent pages in each I/O
if they are in the same state, but it might take multiple
I/Os to write out a full stripe units if we are doing things
like writing across a hole.

Also, there is no guarantee that the first page the filesystem
receives lies at the start of a stripe boundary, so that
sort of information really needs to be propagated into
the generic writeback code above the filesystem as well....

IOWs, the files can easily end I/Os on stripe boundaries
but it is much harder to start them there because that is
not in the control of the filesystem.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-15 11:53:47

by Avi Kivity

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
>
>> while I consider zfs to be ~80% hype, one advantage it could have (but I
>> don't know if it has) is that since the filesystem an raid are integrated
>> into one layer they can optimize the case where files are being written
>> onto unallocated space and instead of reading blocks from disk to
>> calculate the parity they could just put zeros in the unallocated space,
>> potentially speeding up the system by reducing the amount of disk I/O.
>>
>
> Certainly. But the raid doesn't need to be tightly integrated
> into the filesystem to achieve this. The filesystem need only know
> the geometry of the RAID and when it comes to write, it tries to write
> full stripes at a time. If that means writing some extra blocks full
> of zeros, it can try to do that. This would require a little bit
> better communication between filesystem and raid, but not much. If
> anyone has a filesystem that they want to be able to talk to raid
> better, they need only ask...
>

Some things are not achievable with block-level raid. For example, with
redundancy integrated into the filesystem, you can have three copies for
metadata, two copies for small files, and parity blocks for large files,
effectively using different raid levels for different types of data on
the same filesystem.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-06-15 16:24:04

by Jan Engelhardt

[permalink] [raw]
Subject: Re: limits on raid


On Jun 15 2007 14:10, Avi Kivity wrote:
>
>Some things are not achievable with block-level raid. For example, with
>redundancy integrated into the filesystem, you can have three copies for
>metadata, two copies for small files, and parity blocks for large files,
>effectively using different raid levels for different types of data on
>the same filesystem.

Sounds like you want RAIF, not RAID.



Jan
--

2007-06-15 17:21:10

by Avi Kivity

[permalink] [raw]
Subject: Re: limits on raid

Jan Engelhardt wrote:
> On Jun 15 2007 14:10, Avi Kivity wrote:
>
>> Some things are not achievable with block-level raid. For example, with
>> redundancy integrated into the filesystem, you can have three copies for
>> metadata, two copies for small files, and parity blocks for large files,
>> effectively using different raid levels for different types of data on
>> the same filesystem.
>>
>
> Sounds like you want RAIF, not RAID.
>
>

If you mean taking a bunch of single-disk filesystems and layering
another filesystem on top, then no. The underlying filesystems will
only serve as object allocators, and all the directory hierarchy,
journalling, and other capabilities will end up as overhead. Fairly
significant overhead, too -- I've once worked on a similar environment.
Performance sucked until the underlying filesystems were removed.

Abstraction layers are good for dividing large problems, but they have
their costs. Consider NUMA for example: you can treat it as a large
blob of memory, but much performance can be gained of you don't.
Similarly, with disks, you can put them in a big RAID and treat them as
a large disk, but if you don't, there are large rewards.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-06-15 21:59:58

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Friday June 15, [email protected] wrote:
> Neil Brown wrote:
> >
> >> while I consider zfs to be ~80% hype, one advantage it could have (but I
> >> don't know if it has) is that since the filesystem an raid are integrated
> >> into one layer they can optimize the case where files are being written
> >> onto unallocated space and instead of reading blocks from disk to
> >> calculate the parity they could just put zeros in the unallocated space,
> >> potentially speeding up the system by reducing the amount of disk I/O.
> >>
> >
> > Certainly. But the raid doesn't need to be tightly integrated
> > into the filesystem to achieve this. The filesystem need only know
> > the geometry of the RAID and when it comes to write, it tries to write
> > full stripes at a time. If that means writing some extra blocks full
> > of zeros, it can try to do that. This would require a little bit
> > better communication between filesystem and raid, but not much. If
> > anyone has a filesystem that they want to be able to talk to raid
> > better, they need only ask...
> >
>
> Some things are not achievable with block-level raid. For example, with
> redundancy integrated into the filesystem, you can have three copies for
> metadata, two copies for small files, and parity blocks for large files,
> effectively using different raid levels for different types of data on
> the same filesystem.

Absolutely. And doing that is a very good idea quite independent of
underlying RAID. Even ext2 stores multiple copies of the superblock.

Having the filesystem duplicate data, store checksums, and be able to
find a different copy if the first one it chose was bad is very
sensible and cannot be done by just putting the filesystem on RAID.

Having the filesystem keep multiple copies of each data block so that
when one drive dies, another block is used does not excite me quite so
much. If you are going to do that, then you want to be able to
reconstruct the data that should be on a failed drive onto a new
drive.
For a RAID system, that reconstruction can go at the full speed of the
drive subsystem - but needs to copy every block, whether used or not.
For in-filesystem duplication, it is easy to imagine that being quite
slow and complex. It would depend a lot on how you arrange data,
and maybe there is some clever approach to data layout that I haven't
thought of. But I think that sort of thing is much easier to do in a
RAID layer below the filesystem.

Combining these thoughts, it would make a lot of sense for the
filesystem to be able to say to the block device "That blocks looks
wrong - can you find me another copy to try?". That is an example of
the sort of closer integration between filesystem and RAID that would
make sense.

NeilBrown

2007-06-15 22:21:47

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Friday June 15, [email protected] wrote:
> On Fri, Jun 15, 2007 at 01:58:20PM +1000, Neil Brown wrote:
> > Certainly. But the raid doesn't need to be tightly integrated
> > into the filesystem to achieve this. The filesystem need only know
> > the geometry of the RAID and when it comes to write, it tries to write
> > full stripes at a time.
>
> XFS already knows this (stripe unit, stripe width) and already
> does stripe unit sized and aligned allocation where it can.

What happens if the device gets restriped (e.g. add a drive to raid5)?
Is it possible to tell XFS about the new shape, or is it a mkfs only
thing?

I think it would be nice if the filesystem could query the device to
find out this geometry, and even that the device could tell the
filesystem that the geometry has changed. But we don't have well
defined interfaces for that (yet).

>
> > If that means writing some extra blocks full
> > of zeros, it can try to do that. This would require a little bit
> > better communication between filesystem and raid, but not much. If
> > anyone has a filesystem that they want to be able to talk to raid
> > better, they need only ask...
>
> I'm interested in what you think is necessary here, Neil.

I see it more as "What does the filesystem think is necessary".
Some possibilities I see:
- if the filesystem keeps checksums of some blocks, and finds that a
block looks wrong, it might want to ask the RAID system "Do you
have another copy of that I could try".
- If the filesystem uses a journal, it might benefit from always
doing strip-wide(*) writes. With a journal, read speed is not a big
issue, and padding is quite possible. So the filesystem might want
to find out the exact geometry so it can write a full strip each
time, and it might want an interface so it can say "I am writing a
full strip - don't start processing until I say 'go'".

>
> But I think there's more to it than this - the filesystem only
> writes back what the generic writeback code passes it (i.e. a page
> at a time). XFs writes back extra adjacent pages in each I/O
> if they are in the same state, but it might take multiple
> I/Os to write out a full stripe units if we are doing things
> like writing across a hole.

I think the suggestion was that if the filesystem knows the contents
of other blocks in the stripe, then it might be more efficient to
write them out as well, even if they aren't dirty. e.g. if we know a
neighbouring block is unallocated, write zeros. If we have a
neighbouring block in the page cache that is clean, write it out as
well as doing so will reduce the pre-reading required for a full
write.

>
> Also, there is no guarantee that the first page the filesystem
> receives lies at the start of a stripe boundary, so that
> sort of information really needs to be propagated into
> the generic writeback code above the filesystem as well....
>
> IOWs, the files can easily end I/Os on stripe boundaries
> but it is much harder to start them there because that is
> not in the control of the filesystem.

This of course depends on the layout used by the filesystem.
For a traditional layout where files are allocated to locations on
storage and stay there, you are absolutely correct.
For copy-on-write filesystems, there is a lot more room to choose
size and alignment for all writes. We seem to have a growth spurt in
this style of filesystem at the moment. It will be interesting to see
if they can deliver equal performance (copy-on-write tends to risk
fragmented reads). I suspect these new filesystems will have more
interest in closer integration with RAID.

NeilBrown

(*) A 'strip-wide' write is different from a 'stripe-wide' write. It
is one block from each device, rather than one chunk. These blocks
will likely not be consecutive in device-space, but writing them as a
group will be faster than writing a similar number of blocks that are
consecutive. Laying out a journal so that consecutive blocks follow
strips might make writes faster.

2007-06-16 02:07:23

by Wakko Warner

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
> On Thursday June 14, [email protected] wrote:
> > why does it need to do a rebuild when makeing a new array? couldn't it
> > just zero all the drives instead? (or better still just record most of the
> > space as 'unused' and initialize it as it starts useing it?)
>
> Yes, it could zero all the drives first. But that would take the same
> length of time (unless p/q generation was very very slow), and you
> wouldn't be able to start writing data until it had finished.
> You can "dd" /dev/zero onto all drives and then create the array with
> --assume-clean if you want to. You could even write a shell script to
> do it for you.

I still fail to see the reason to actually "resync" the drives. I've dealt
with some hardware raid devices and they do not force a resync but they do
recommend it.

If there was no resync (I have not found a way to force the kernel not to do
this), the parity will not be correct. Well in this case, that's fine, the
data is pretty much useless anyway. (I'm assuming newly created arrays,
not attempting to recreate an array that had some failures)

Noone zeros out a new hard drive because of what might be on it. You just
fdisk (or lvm or whatever), mkfs and use it. Assume this is performed on an
array that has not been resync'd (or initialized as some hardware raids call
it). You fdisk it, mkfs it, and start using it. As I understand the way
raid works, when you write a block to the array, it will have to read all
the other blocks in the stripe and recalculate the parity and write it out.
(I also assume that if you write lots of data at a time, there may not be a
read since those sectors will be over written anyway).

Ok, so we have a device with some parity information "correct" due to writes
of some of the sectors but not all of them since there was no resync. What
happens if a drive fails and you replace it? Ok, now we have to resync.
The strips that were never written to may now have random data. Well, does
that *really* matter? After all, it was never written to.

In the past, I have used arrays that were not initialized (hardware raids)
and had to rebuild the array due to a disk failure. There was no problems
with the data that was already allocated on the array.

--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???

2007-06-16 03:47:27

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Friday June 15, [email protected] wrote:

> As I understand the way
> raid works, when you write a block to the array, it will have to read all
> the other blocks in the stripe and recalculate the parity and write it out.

Your understanding is incomplete.
For raid5 on an array with more than 3 drive, if you attempt to write
a single block, it will:

- read the current value of the block, and the parity block.
- "subtract" the old value of the block from the parity, and "add"
the new value.
- write out the new data and the new parity.

If the parity was wrong before, it will still be wrong. If you then
lose a drive, you lose your data.

With the current implementation in md, this only affect RAID5. RAID6
will always behave as you describe. But I don't promise that won't
change with time.

It would be possible to have a 'this is not initialised' flag on the
array, and if that is not set, always do a reconstruct-write rather
than a read-modify-write. But the first time you have an unclean
shutdown you are going to resync all the parity anyway (unless you
have a bitmap....) so you may as well resync at the start.

And why is it such a big deal anyway? The initial resync doesn't stop
you from using the array. I guess if you wanted to put an array into
production instantly and couldn't afford any slowdown due to resync,
then you might want to skip the initial resync.... but is that really
likely?

NeilBrown

2007-06-16 04:40:29

by Dan Merillat

[permalink] [raw]
Subject: Re: limits on raid

> For raid5 on an array with more than 3 drive, if you attempt to write
> a single block, it will:
>
> - read the current value of the block, and the parity block.
> - "subtract" the old value of the block from the parity, and "add"
> the new value.
> - write out the new data and the new parity.
>
> If the parity was wrong before, it will still be wrong. If you then
> lose a drive, you lose your data.

Wow, that really needs to be put somewhere in 120 point red blinking
text. A lot of us are used to uninitialized disks calculating the
parity-on-first-write, but if linux MD is forgoeing that
'dangerous-no-resync' sounds really REALLY bad. How about at least a
'Warning: unlike other systems this WILL cause corruption if you
forego reconstruction' on mkraid?

2007-06-16 07:51:44

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Sat, 16 Jun 2007, Neil Brown wrote:

> It would be possible to have a 'this is not initialised' flag on the
> array, and if that is not set, always do a reconstruct-write rather
> than a read-modify-write. But the first time you have an unclean
> shutdown you are going to resync all the parity anyway (unless you
> have a bitmap....) so you may as well resync at the start.
>
> And why is it such a big deal anyway? The initial resync doesn't stop
> you from using the array. I guess if you wanted to put an array into
> production instantly and couldn't afford any slowdown due to resync,
> then you might want to skip the initial resync.... but is that really
> likely?

in my case it takes 2+ days to resync the array before I can do any
performance testing with it. for some reason it's only doing the rebuild
at ~5M/sec (even though I've increased the min and max rebuild speeds and
a dd to the array seems to be ~44M/sec, even during the rebuild)

I want to test several configurations, from a 45 disk raid6 to a 45 disk
raid0. at 2-3 days per test (or longer, depending on the tests) this
becomes a very slow process.

also, when a rebuild is slow enough (and has enough of a performance
impact) it's not uncommon to want to operate in degraded mode just long
enought oget to a maintinance window and then recreate the array and
reload from backup.

David Lang

2007-06-16 13:33:37

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
> On Friday June 15, [email protected] wrote:
>
>> As I understand the way
>> raid works, when you write a block to the array, it will have to read all
>> the other blocks in the stripe and recalculate the parity and write it out.
>
> Your understanding is incomplete.

Does this help?
[for future reference so you can paste a url and save the typing for code :) ]

http://linux-raid.osdl.org/index.php/Initial_Array_Creation

David



Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity
is what's called the "initial resync".

The kernel takes one (or two for raid6) disks and marks them as 'spare'; it then
creates the array in degraded mode. It then marks spare disks as 'rebuilding'
and starts to read from the 'good' disks, calculate the parity and determines
what should be on any spare disks and then writes it. Once all this is done the
array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is
happening (it is however fully useable).

--assume-clean

Some people have noticed the --assume-clean option in mdadm and speculated that
this can be used to skip the initial resync. Which it does. But this is a bad
idea in some cases - and a *very* bad idea in others.

raid5

For raid5 especially it is NOT safe to skip the initial sync. The raid5
implementation optimises use of the component disks and it is possible for all
updates to be "read-modify-write" updates which assume the parity is correct. If
it is wrong, it stays wrong. Then when you lose a drive, the parity blocks are
wrong so the data you recover using them is wrong. In other words - you will get
data corruption.

For raid5 on an array with more than 3 drive, if you attempt to write a single
block, it will:

* read the current value of the block, and the parity block.
* "subtract" the old value of the block from the parity, and "add" the new
value.
* write out the new data and the new parity.

If the parity was wrong before, it will still be wrong. If you then lose a
drive, you lose your data.

linear, raid0,1,10

These raid levels do not need an initial sync.

linear and raid0 have no redundancy.

raid1 always writes all data to all disks.

raid10 always writes all data to all relevant disks.


Other raid levels

Probably the most noticeable effect for the other raid levels is that if you
don't sync first, then every check will find lots of errors. (Of course you
could 'repair' instead of 'check'. Or do that once. Or something.)

For raid6 it is also safe to not sync first, though with the same caveat. Raid6
always updates parity by reading all blocks in the stripe that aren't known and
calculating P and Q. So the first write to a stripe will make P and Q correct
for that stripe. This is current behaviour. There is no guarantee it will never
changed (so theoretically one day you may upgrade your kernel and suffer data
corruption on an old raid6 array).

Summary

In summary, it is safe to use --assume-clean on a raid1 or raid1o, though a
"repair" is recommended before too long. For other raid levels it is best avoided.

Potential 'Solutions'

There have been 'solutions' suggested including the use of bitmaps to
efficiently store 'not yet synced' information about the array. It would be
possible to have a 'this is not initialised' flag on the array, and if that is
not set, always do a reconstruct-write rather than a read-modify-write. But the
first time you have an unclean shutdown you are going to resync all the parity
anyway (unless you have a bitmap....) so you may as well resync at the start. So
essentially, at the moment, there is no interest in implementing this since the
added complexity is not justified.

What's the problem anyway?

First of all RAID is all about being safe with your data.

And why is it such a big deal anyway? The initial resync doesn't stop you from
using the array. If you wanted to put an array into production instantly and
couldn't afford any slowdown due to resync, then you might want to skip the
initial resync.... but is that really likely?

So what is --assume-clean for then?

Disaster recovery. If you want to build an array from components that used to be
in a raid then this stops the kernel from scribbling on them. As the man page says :

"Use this ony if you really know what you are doing."

2007-06-16 13:38:51

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Sat, 16 Jun 2007, Neil Brown wrote:
>
> I want to test several configurations, from a 45 disk raid6 to a 45 disk
> raid0. at 2-3 days per test (or longer, depending on the tests) this
> becomes a very slow process.
Are you suggesting the code that is written to enhance data integrity is
optimised (or even touched) to support this kind of test scenario?
Seriously? :)

> also, when a rebuild is slow enough (and has enough of a performance
> impact) it's not uncommon to want to operate in degraded mode just long
> enought oget to a maintinance window and then recreate the array and
> reload from backup.

so would mdadm --remove the rebuilding disk help?

David

2007-06-16 14:13:00

by Wakko Warner

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
> On Friday June 15, [email protected] wrote:
>
> > As I understand the way
> > raid works, when you write a block to the array, it will have to read all
> > the other blocks in the stripe and recalculate the parity and write it out.
>
> Your understanding is incomplete.
> For raid5 on an array with more than 3 drive, if you attempt to write
> a single block, it will:
>
> - read the current value of the block, and the parity block.
> - "subtract" the old value of the block from the parity, and "add"
> the new value.
> - write out the new data and the new parity.
>
> If the parity was wrong before, it will still be wrong. If you then
> lose a drive, you lose your data.

I see, I didn't know that the MD's raid5 did this.

> And why is it such a big deal anyway? The initial resync doesn't stop
> you from using the array. I guess if you wanted to put an array into
> production instantly and couldn't afford any slowdown due to resync,
> then you might want to skip the initial resync.... but is that really
> likely?

When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
always slowed the system down when booting up. Quite significantly I must
say. I wait until I can login and change the rebuild max speed to slow it
down while I'm using it. But that is another thing.

Thanks for the clarification on raid5.

--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???

2007-06-16 17:19:15

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Sat, 16 Jun 2007, David Greaves wrote:

> [email protected] wrote:
>> On Sat, 16 Jun 2007, Neil Brown wrote:
>>
>> I want to test several configurations, from a 45 disk raid6 to a 45 disk
>> raid0. at 2-3 days per test (or longer, depending on the tests) this
>> becomes a very slow process.
> Are you suggesting the code that is written to enhance data integrity is
> optimised (or even touched) to support this kind of test scenario?
> Seriously? :)

actually, if it can be done without a huge impact to the maintainability
of the code I think it would be a good idea for the simple reason that I
think the increased experimentation would result in people finding out
what raid level is really appropriate for their needs.

there is a _lot_ of confusion around about what the performance
implications of different raid levels are (especially when you consider
things like raid 10/50/60 where you have two layers combined) and anything
that encourages experimentation would be a good thing.

>> also, when a rebuild is slow enough (and has enough of a performance
>> impact) it's not uncommon to want to operate in degraded mode just long
>> enought oget to a maintinance window and then recreate the array and
>> reload from backup.
>
> so would mdadm --remove the rebuilding disk help?

no. let me try again

drive fails monday morning

scenerio 1

replace the failed drive, start the rebuild. system will be slow (degraded
mode + rebuild) for the next three days.

scenerio 2

leave it in degraded mode until monday night (accepting the speed penalty
for degraded mode, but not the rebuild penalty)

monday night shutdown the system, put in the new drive, reinitialize the
array, reload the system from backup.

system is back to full speed tuesday morning.

scenerio 2 isn't supported with md today, although it sounds as if the
skip rebuild could do this except for raid 5

on my test system, the rebuild says it's running at 5M/s a DD to a file on
the array says it's doing 45M/s (even while the rebuild is running), so it
seems to me that there may be value in this approach.

David Lang

2007-06-16 17:23:19

by Avi Kivity

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
>>>
>>>
>> Some things are not achievable with block-level raid. For example, with
>> redundancy integrated into the filesystem, you can have three copies for
>> metadata, two copies for small files, and parity blocks for large files,
>> effectively using different raid levels for different types of data on
>> the same filesystem.
>>
>
> Absolutely. And doing that is a very good idea quite independent of
> underlying RAID. Even ext2 stores multiple copies of the superblock.
>
> Having the filesystem duplicate data, store checksums, and be able to
> find a different copy if the first one it chose was bad is very
> sensible and cannot be done by just putting the filesystem on RAID.
>

It would need to know a lot about the RAID geometry in order not to put
the the copies on the same disks.

> Having the filesystem keep multiple copies of each data block so that
> when one drive dies, another block is used does not excite me quite so
> much. If you are going to do that, then you want to be able to
> reconstruct the data that should be on a failed drive onto a new
> drive.
> For a RAID system, that reconstruction can go at the full speed of the
> drive subsystem - but needs to copy every block, whether used or not.
> For in-filesystem duplication, it is easy to imagine that being quite
> slow and complex. It would depend a lot on how you arrange data,
> and maybe there is some clever approach to data layout that I haven't
> thought of. But I think that sort of thing is much easier to do in a
> RAID layer below the filesystem.
>

You'd need a reverse mapping of extents to files. While maintaining
that is expensive, it brings a lot of benefits:

- rebuild a failed drive, without rebuilding free space
- evacuate a drive in anticipation of taking it offline
- efficient defragmentation

Reverse mapping storage could serve as free space store too.

> Combining these thoughts, it would make a lot of sense for the
> filesystem to be able to say to the block device "That blocks looks
> wrong - can you find me another copy to try?". That is an example of
> the sort of closer integration between filesystem and RAID that would
> make sense.
>

It's a step forward, but still quite limited compared to combining the
two layers together. Sticking with the example above, you still can't
have a mix of parity-protected files and mirror-protected files; the
RAID decides that for you.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2007-06-17 01:44:23

by dean gaudet

[permalink] [raw]
Subject: Re: limits on raid

On Sat, 16 Jun 2007, David Greaves wrote:

> Neil Brown wrote:
> > On Friday June 15, [email protected] wrote:
> >
> > > As I understand the way
> > > raid works, when you write a block to the array, it will have to read all
> > > the other blocks in the stripe and recalculate the parity and write it
> > > out.
> >
> > Your understanding is incomplete.
>
> Does this help?
> [for future reference so you can paste a url and save the typing for code :) ]
>
> http://linux-raid.osdl.org/index.php/Initial_Array_Creation

i fixed a typo and added one more note which i think is quite fair:

It is also safe to use --assume-clean if you are performing
performance measurements of different raid configurations. Just
be sure to rebuild your array without --assume-clean when you
decide on your final configuration.

-dean

2007-06-17 01:47:43

by dean gaudet

[permalink] [raw]
Subject: Re: limits on raid

On Sat, 16 Jun 2007, Wakko Warner wrote:

> When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> always slowed the system down when booting up. Quite significantly I must
> say. I wait until I can login and change the rebuild max speed to slow it
> down while I'm using it. But that is another thing.

i use an external write-intent bitmap on a raid1 to avoid this... you
could use internal bitmap but that slows down i/o too much for my tastes.
i also use an external xfs journal for the same reason. 2 disk raid1 for
root/journal/bitmap, N disk raid5 for bulk storage. no spindles in
common.

-dean

2007-06-17 12:04:39

by Andi Kleen

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown <[email protected]> writes:
>
> Having the filesystem duplicate data, store checksums, and be able to
> find a different copy if the first one it chose was bad is very
> sensible and cannot be done by just putting the filesystem on RAID.

Apropos checksums: since RAID5 copies/xors anyways it would
be nice to combine that with the file system. During the xor
a simple checksum could be computed in parallel and stored
in the file system.

And the copy/checksum passes will hopefully at some
point be combined.

-Andi

2007-06-17 13:32:33

by Wakko Warner

[permalink] [raw]
Subject: Re: limits on raid

dean gaudet wrote:
> On Sat, 16 Jun 2007, Wakko Warner wrote:
>
> > When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> > always slowed the system down when booting up. Quite significantly I must
> > say. I wait until I can login and change the rebuild max speed to slow it
> > down while I'm using it. But that is another thing.
>
> i use an external write-intent bitmap on a raid1 to avoid this... you
> could use internal bitmap but that slows down i/o too much for my tastes.
> i also use an external xfs journal for the same reason. 2 disk raid1 for
> root/journal/bitmap, N disk raid5 for bulk storage. no spindles in
> common.

I must remember this if I have to rebuild the array. Although I'm
considering moving to a hardware raid solution when I upgrade my storage.

--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???

2007-06-17 17:14:14

by Bill Davidsen

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
> On Thursday June 14, [email protected] wrote:
>
>> On Fri, 15 Jun 2007, Neil Brown wrote:
>>
>>
>>> On Thursday June 14, [email protected] wrote:
>>>
>>>> what is the limit for the number of devices that can be in a single array?
>>>>
>>>> I'm trying to build a 45x750G array and want to experiment with the
>>>> different configurations. I'm trying to start with raid6, but mdadm is
>>>> complaining about an invalid number of drives
>>>>
>>>> David Lang
>>>>
>>> "man mdadm" search for "limits". (forgive typos).
>>>
>> thanks.
>>
>> why does it still default to the old format after so many new versions?
>> (by the way, the documetnation said 28 devices, but I couldn't get it to
>> accept more then 27)
>>
>
> Dunno - maybe I can't count...
>
>
>> it's now churning away 'rebuilding' the brand new array.
>>
>> a few questions/thoughts.
>>
>> why does it need to do a rebuild when makeing a new array? couldn't it
>> just zero all the drives instead? (or better still just record most of the
>> space as 'unused' and initialize it as it starts useing it?)
>>
>
> Yes, it could zero all the drives first. But that would take the same
> length of time (unless p/q generation was very very slow), and you
> wouldn't be able to start writing data until it had finished.
> You can "dd" /dev/zero onto all drives and then create the array with
> --assume-clean if you want to. You could even write a shell script to
> do it for you.
>
> Yes, you could record which space is used vs unused, but I really
> don't think the complexity is worth it.
>
>
How about a simple solution which would get an array on line and still
be safe? All it would take is a flag which forced reconstruct writes for
RAID-5. You could set it with an option, or automatically if someone
puts --assume-clean with --create, leave it in the superblock until the
first "repair" runs to completion. And for repair you could make some
assumptions about bad parity not being caused by error but just unwritten.

Thought 2: I think the unwritten bit is easier than you think, you only
need it on parity blocks for RAID5, not on data blocks. When a write is
done, if the bit is set do a reconstruct, write the parity block, and
clear the bit. Keeping a bit per data block is madness, and appears to
be unnecessary as well.
>> while I consider zfs to be ~80% hype, one advantage it could have (but I
>> don't know if it has) is that since the filesystem an raid are integrated
>> into one layer they can optimize the case where files are being written
>> onto unallocated space and instead of reading blocks from disk to
>> calculate the parity they could just put zeros in the unallocated space,
>> potentially speeding up the system by reducing the amount of disk I/O.
>>
>
> Certainly. But the raid doesn't need to be tightly integrated
> into the filesystem to achieve this. The filesystem need only know
> the geometry of the RAID and when it comes to write, it tries to write
> full stripes at a time. If that means writing some extra blocks full
> of zeros, it can try to do that. This would require a little bit
> better communication between filesystem and raid, but not much. If
> anyone has a filesystem that they want to be able to talk to raid
> better, they need only ask...
>
>
>> is there any way that linux would be able to do this sort of thing? or is
>> it impossible due to the layering preventing the nessasary knowledge from
>> being in the right place?
>>
>
> Linux can do anything we want it to. Interfaces can be changed. All
> it takes is a fairly well defined requirement, and the will to make it
> happen (and some technical expertise, and lots of time .... and
> coffee?).
>
Well, I gave you two thoughts, one which would be slow until a repair
but sounds easy to do, and one which is slightly harder but works better
and minimizes performance impact.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-06-17 17:16:36

by Bill Davidsen

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Sat, 16 Jun 2007, Neil Brown wrote:
>
>> It would be possible to have a 'this is not initialised' flag on the
>> array, and if that is not set, always do a reconstruct-write rather
>> than a read-modify-write. But the first time you have an unclean
>> shutdown you are going to resync all the parity anyway (unless you
>> have a bitmap....) so you may as well resync at the start.
>>
>> And why is it such a big deal anyway? The initial resync doesn't stop
>> you from using the array. I guess if you wanted to put an array into
>> production instantly and couldn't afford any slowdown due to resync,
>> then you might want to skip the initial resync.... but is that really
>> likely?
>
> in my case it takes 2+ days to resync the array before I can do any
> performance testing with it. for some reason it's only doing the
> rebuild at ~5M/sec (even though I've increased the min and max rebuild
> speeds and a dd to the array seems to be ~44M/sec, even during the
> rebuild)
>
> I want to test several configurations, from a 45 disk raid6 to a 45
> disk raid0. at 2-3 days per test (or longer, depending on the tests)
> this becomes a very slow process.
>
I've been doing stuff like this, but I just build the array on a
partition per drive so the init is livable. For the stuff I'm doing a
total of 500-100GB is ample to do performance testing.
> also, when a rebuild is slow enough (and has enough of a performance
> impact) it's not uncommon to want to operate in degraded mode just
> long enought oget to a maintinance window and then recreate the array
> and reload from backup.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-06-17 17:28:19

by dean gaudet

[permalink] [raw]
Subject: Re: limits on raid

On Sun, 17 Jun 2007, Wakko Warner wrote:

> dean gaudet wrote:
> > On Sat, 16 Jun 2007, Wakko Warner wrote:
> >
> > > When I've had an unclean shutdown on one of my systems (10x 50gb raid5) it's
> > > always slowed the system down when booting up. Quite significantly I must
> > > say. I wait until I can login and change the rebuild max speed to slow it
> > > down while I'm using it. But that is another thing.
> >
> > i use an external write-intent bitmap on a raid1 to avoid this... you
> > could use internal bitmap but that slows down i/o too much for my tastes.
> > i also use an external xfs journal for the same reason. 2 disk raid1 for
> > root/journal/bitmap, N disk raid5 for bulk storage. no spindles in
> > common.
>
> I must remember this if I have to rebuild the array. Although I'm
> considering moving to a hardware raid solution when I upgrade my storage.

you can do it without a rebuild -- that's in fact how i did it the first
time.

to add an external bitmap:

mdadm --grow --bitmap /bitmapfile /dev/mdX

plus add "bitmap=/bitmapfile" to mdadm.conf... as in:

ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

you can also easily move an ext3 journal to an external journal with
tune2fs (see man page).

if you use XFS it's a bit more of a challenge to convert from internal to
external, but see this thread:

http://marc.theaimsgroup.com/?l=linux-xfs&m=106929781232520&w=2

i found that i had to do "sb 1", "sb 2", ..., "sb N" for all sb rather
than just the "sb 0" that email instructed me to do.

-dean

2007-06-17 19:35:31

by Wakko Warner

[permalink] [raw]
Subject: Re: limits on raid

dean gaudet wrote:
> On Sun, 17 Jun 2007, Wakko Warner wrote:
>
> > > i use an external write-intent bitmap on a raid1 to avoid this... you
> > > could use internal bitmap but that slows down i/o too much for my tastes.
> > > i also use an external xfs journal for the same reason. 2 disk raid1 for
> > > root/journal/bitmap, N disk raid5 for bulk storage. no spindles in
> > > common.
> >
> > I must remember this if I have to rebuild the array. Although I'm
> > considering moving to a hardware raid solution when I upgrade my storage.
>
> you can do it without a rebuild -- that's in fact how i did it the first
> time.
>
> to add an external bitmap:
>
> mdadm --grow --bitmap /bitmapfile /dev/mdX
>
> plus add "bitmap=/bitmapfile" to mdadm.conf... as in:
>
> ARRAY /dev/md4 bitmap=/bitmap.md4 UUID=dbc3be0b:b5853930:a02e038c:13ba8cdc

I used evms to setup mine. I have used mdadm in the past. I use lvm ontop
of it which evms makes it a little easier to maintain. I have 3 arrays
total (only the raid5 was configured by evms, the other 2 raid1s were done
by hand)

> you can also easily move an ext3 journal to an external journal with
> tune2fs (see man page).

I only have 2 ext3 file systems (One of which is mounted R/O since it's
full), all my others are reiserfs (v3).

What benefit would I gain by using an external journel and how big would it
need to be?

> if you use XFS it's a bit more of a challenge to convert from internal to
> external, but see this thread:

I specifically didn't use XFS (or JFS) since neither one at the time could
be shrinked.

--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???

2007-06-17 19:54:23

by dean gaudet

[permalink] [raw]
Subject: Re: limits on raid

On Sun, 17 Jun 2007, Wakko Warner wrote:

> What benefit would I gain by using an external journel and how big would it
> need to be?

i don't know how big the journal needs to be... i'm limited by xfs'
maximum journal size of 128MiB.

i don't have much benchmark data -- but here are some rough notes i took
when i was evaluating a umem NVRAM card. since the pata disks in the
raid1 have write caching enabled it's somewhat of an unfair comparison,
but the important info is the 88 seconds for internal journal vs. 81
seconds for external journal.

-dean

time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

xfs journal raid5 bitmap times
internal none 0.18s user 2.14s system 2% cpu 1:27.95 total
internal internal 0.16s user 2.16s system 1% cpu 2:01.12 total
raid1 none 0.07s user 2.02s system 2% cpu 1:20.62 total
raid1 internal 0.14s user 2.01s system 1% cpu 1:55.18 total
raid1 raid1 0.14s user 2.03s system 2% cpu 1:20.61 total
umem none 0.13s user 2.07s system 2% cpu 1:20.77 total
umem internal 0.15s user 2.16s system 2% cpu 1:51.28 total
umem umem 0.12s user 2.13s system 2% cpu 1:20.50 total


raid5:
- 4x seagate 7200.10 400GB on marvell MV88SX6081
- mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1

raid1:
- 2x maxtor 6Y200P0 on 3ware 7504
- two 128MiB partitions starting at cyl 1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 /dev/sd[fg]1
- mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 /dev/sd[fg]2
- md1 is used for external xfs journal
- md2 has an ext3 filesystem for the external md4 bitmap

xfs:
- mkfs.xfs issued before each run using the defaults (aside from -l logdev=/dev/md1)
- mount -o noatime,nodiratime[,logdev=/dev/md1]

umem:
- 512MiB Micro Memory MM-5415CN
- 2 partitions similar to the raid1 setup

2007-06-17 20:47:01

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Sun, 17 Jun 2007, Wakko Warner wrote:

>> you can also easily move an ext3 journal to an external journal with
>> tune2fs (see man page).
>
> I only have 2 ext3 file systems (One of which is mounted R/O since it's
> full), all my others are reiserfs (v3).
>
> What benefit would I gain by using an external journel and how big would it
> need to be?

if you have the journal on a drive by itself you end up doing (almost)
sequential reads and writes to the journal and the disk head doesn't need
to move much.

this can greatly increase your write speeds since

1. the journal gets written faster (completeing the write as far as your
software is concerned)

2. the heads don't need to seek back and forth from the journal to the
final location that the data gets written.

as for how large it should be, it all depends on the volume of your
writes, once the journal fills up all writes stall until space is freed in
the journal, IIRC Ext3 is limited to 128M, with todays drive sizes I don't
see any reason to make it any smaller.

David Lang

2007-06-17 20:49:20

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Sun, 17 Jun 2007, dean gaudet wrote:

> On Sun, 17 Jun 2007, Wakko Warner wrote:
>
>> What benefit would I gain by using an external journel and how big would it
>> need to be?
>
> i don't know how big the journal needs to be... i'm limited by xfs'
> maximum journal size of 128MiB.
>
> i don't have much benchmark data -- but here are some rough notes i took
> when i was evaluating a umem NVRAM card. since the pata disks in the
> raid1 have write caching enabled it's somewhat of an unfair comparison,
> but the important info is the 88 seconds for internal journal vs. 81
> seconds for external journal.

if you turn on disk write caching the difference will be much larger.

> -dean
>
> time sh -c 'tar xf /var/tmp/linux-2.6.20.tar; sync'

I know that sync will force everything to get as far as the journal, will
it force the journal to be flushed?

David Lang

>
> xfs journal raid5 bitmap times
> internal none 0.18s user 2.14s system 2% cpu 1:27.95 total
> internal internal 0.16s user 2.16s system 1% cpu 2:01.12 total
> raid1 none 0.07s user 2.02s system 2% cpu 1:20.62 total
> raid1 internal 0.14s user 2.01s system 1% cpu 1:55.18 total
> raid1 raid1 0.14s user 2.03s system 2% cpu 1:20.61 total
> umem none 0.13s user 2.07s system 2% cpu 1:20.77 total
> umem internal 0.15s user 2.16s system 2% cpu 1:51.28 total
> umem umem 0.12s user 2.13s system 2% cpu 1:20.50 total
>
>
> raid5:
> - 4x seagate 7200.10 400GB on marvell MV88SX6081
> - mdadm --create --level=5 --raid-devices=4 /dev/md4 /dev/sd[abcd]1
>
> raid1:
> - 2x maxtor 6Y200P0 on 3ware 7504
> - two 128MiB partitions starting at cyl 1
> - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md1 /dev/sd[fg]1
> - mdadm --create --level=1 --raid-disks=2 --auto=yes --assume-clean /dev/md2 /dev/sd[fg]2
> - md1 is used for external xfs journal
> - md2 has an ext3 filesystem for the external md4 bitmap
>
> xfs:
> - mkfs.xfs issued before each run using the defaults (aside from -l logdev=/dev/md1)
> - mount -o noatime,nodiratime[,logdev=/dev/md1]
>
> umem:
> - 512MiB Micro Memory MM-5415CN
> - 2 partitions similar to the raid1 setup
>

2007-06-18 04:58:53

by David Chinner

[permalink] [raw]
Subject: Re: limits on raid

On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> Combining these thoughts, it would make a lot of sense for the
> filesystem to be able to say to the block device "That blocks looks
> wrong - can you find me another copy to try?". That is an example of
> the sort of closer integration between filesystem and RAID that would
> make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-18 17:21:01

by Brendan Conoboy

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> in my case it takes 2+ days to resync the array before I can do any
> performance testing with it. for some reason it's only doing the rebuild
> at ~5M/sec (even though I've increased the min and max rebuild speeds
> and a dd to the array seems to be ~44M/sec, even during the rebuild)

With performance like that, it sounds like you're saturating a bus
somewhere along the line. If you're using scsi, for instance, it's very
easy for a long chain of drives to overwhelm a channel. You might also
want to consider some other RAID layouts like 1+0 or 5+0 depending upon
your space vs. reliability needs.

--
Brendan Conoboy / Red Hat, Inc. / [email protected]

2007-06-18 17:31:21

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Brendan Conoboy wrote:

> [email protected] wrote:
>> in my case it takes 2+ days to resync the array before I can do any
>> performance testing with it. for some reason it's only doing the rebuild
>> at ~5M/sec (even though I've increased the min and max rebuild speeds and
>> a dd to the array seems to be ~44M/sec, even during the rebuild)
>
> With performance like that, it sounds like you're saturating a bus somewhere
> along the line. If you're using scsi, for instance, it's very easy for a
> long chain of drives to overwhelm a channel. You might also want to consider
> some other RAID layouts like 1+0 or 5+0 depending upon your space vs.
> reliability needs.

I plan to test the different configurations.

however, if I was saturating the bus with the reconstruct how can I fire
off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
reconstruct to ~4M/sec?

I'm putting 10x as much data through the bus at that point, it would seem
to proove that it's not the bus that's saturated.

David Lang

2007-06-18 18:03:39

by Lennart Sorensen

[permalink] [raw]
Subject: Re: limits on raid

On Mon, Jun 18, 2007 at 10:28:38AM -0700, [email protected] wrote:
> I plan to test the different configurations.
>
> however, if I was saturating the bus with the reconstruct how can I fire
> off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
> reconstruct to ~4M/sec?
>
> I'm putting 10x as much data through the bus at that point, it would seem
> to proove that it's not the bus that's saturated.

dd 45MB/s from the raid sounds reasonable.

If you have 45 drives, doing a resync of raid5 or radi6 should probably
involve reading all the disks, and writing new parity data to one drive.
So if you are writing 5MB/s, then you are reading 44*5MB/s from the
other drives, which is 220MB/s. If your resync drops to 4MB/s when
doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
capacity, which surprisingly seems to match the dd speed you are
getting. Seems like you are indeed very much saturating a bus
somewhere. The numbers certainly agree with that theory.

What kind of setup is the drives connected to?

--
Len Sorensen

2007-06-18 18:08:25

by Brendan Conoboy

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> I plan to test the different configurations.
>
> however, if I was saturating the bus with the reconstruct how can I fire
> off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing
> the reconstruct to ~4M/sec?
>
> I'm putting 10x as much data through the bus at that point, it would
> seem to proove that it's not the bus that's saturated.

I am unconvinced. If you take ~1MB/s for each active drive, add in SCSI
overhead, 45M/sec seems reasonable. Have you look at a running iostat
while all this is going on? Try it out- add up the kb/s from each drive
and see how close you are to your maximum theoretical IO.

Also, how's your CPU utilization?

--
Brendan Conoboy / Red Hat, Inc. / [email protected]

2007-06-18 18:15:21

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Lennart Sorensen wrote:

> On Mon, Jun 18, 2007 at 10:28:38AM -0700, [email protected] wrote:
>> I plan to test the different configurations.
>>
>> however, if I was saturating the bus with the reconstruct how can I fire
>> off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
>> reconstruct to ~4M/sec?
>>
>> I'm putting 10x as much data through the bus at that point, it would seem
>> to proove that it's not the bus that's saturated.
>
> dd 45MB/s from the raid sounds reasonable.
>
> If you have 45 drives, doing a resync of raid5 or radi6 should probably
> involve reading all the disks, and writing new parity data to one drive.
> So if you are writing 5MB/s, then you are reading 44*5MB/s from the
> other drives, which is 220MB/s. If your resync drops to 4MB/s when
> doing dd, then you have 44*4MB/s which is 176MB/s or 44MB/s less read
> capacity, which surprisingly seems to match the dd speed you are
> getting. Seems like you are indeed very much saturating a bus
> somewhere. The numbers certainly agree with that theory.
>
> What kind of setup is the drives connected to?

simple ultra-wide SCSI to a single controller.

I didn't realize that the rate reported by /proc/mdstat was the write
speed that was takeing place, I thought it was the total data rate (reads
+ writes). the next time this message gets changed it would be a good
thing to clarify this.

David Lang

2007-06-18 18:19:00

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Brendan Conoboy wrote:

> [email protected] wrote:
>> I plan to test the different configurations.
>>
>> however, if I was saturating the bus with the reconstruct how can I fire
>> off a dd if=/dev/zero of=/mnt/test and get ~45M/sec whild only slowing the
>> reconstruct to ~4M/sec?
>>
>> I'm putting 10x as much data through the bus at that point, it would seem
>> to proove that it's not the bus that's saturated.
>
> I am unconvinced. If you take ~1MB/s for each active drive, add in SCSI
> overhead, 45M/sec seems reasonable. Have you look at a running iostat while
> all this is going on? Try it out- add up the kb/s from each drive and see
> how close you are to your maximum theoretical IO.

I didn't try iostat, I did look at vmstat, and there the numbers look even
worse, the bo column is ~500 for the resync by itself, but with the DD
it's ~50,000. when I get access to the box again I'll try iostat to get
more details

> Also, how's your CPU utilization?

~30% of one cpu for the raid 6 thread, ~5% of one cpu for the resync
thread

David Lang

2007-06-18 18:33:36

by Lennart Sorensen

[permalink] [raw]
Subject: Re: limits on raid

On Mon, Jun 18, 2007 at 11:12:45AM -0700, [email protected] wrote:
> simple ultra-wide SCSI to a single controller.

Hmm, isn't ultra-wide limited to 40MB/s? Is it Ultra320 wide? That
could do a lot more, and 220MB/s sounds plausable for 320 scsi.

> I didn't realize that the rate reported by /proc/mdstat was the write
> speed that was takeing place, I thought it was the total data rate (reads
> + writes). the next time this message gets changed it would be a good
> thing to clarify this.

Well I suppose itcould make sense to show rate of rebuild which you can
then compare against the total size of tha raid, or you can have rate of
write, which you then compare against the size of the drive being
synced. Certainly I would expect much higer speeds if it was the
overall raid size, while the numbers seem pretty reasonable as a write
speed. 4MB/s would take for ever if it was the overall raid resync
speed. I usually see SATA raid1 resync at 50 to 60MB/s or so, which
matches the read and write speeds of the drives in the raid.

--
Len Sorensen

2007-06-18 18:43:30

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Lennart Sorensen wrote:

> On Mon, Jun 18, 2007 at 11:12:45AM -0700, [email protected] wrote:
>> simple ultra-wide SCSI to a single controller.
>
> Hmm, isn't ultra-wide limited to 40MB/s? Is it Ultra320 wide? That
> could do a lot more, and 220MB/s sounds plausable for 320 scsi.

yes, sorry, ultra 320 wide.

>> I didn't realize that the rate reported by /proc/mdstat was the write
>> speed that was takeing place, I thought it was the total data rate (reads
>> + writes). the next time this message gets changed it would be a good
>> thing to clarify this.
>
> Well I suppose itcould make sense to show rate of rebuild which you can
> then compare against the total size of tha raid, or you can have rate of
> write, which you then compare against the size of the drive being
> synced. Certainly I would expect much higer speeds if it was the
> overall raid size, while the numbers seem pretty reasonable as a write
> speed. 4MB/s would take for ever if it was the overall raid resync
> speed. I usually see SATA raid1 resync at 50 to 60MB/s or so, which
> matches the read and write speeds of the drives in the raid.

as I read it right now what happens is the worst of the options, you show
the total size of the array for the amount of work that needs to be done,
but then show only the write speed for the rate pf progress being made
through the job.

total rebuild time was estimated at ~3200 min

David Lang

2007-06-18 19:11:55

by Brendan Conoboy

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> yes, sorry, ultra 320 wide.

Exactly how many channels and drives?

--
Brendan Conoboy / Red Hat, Inc. / [email protected]

2007-06-18 20:54:37

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Brendan Conoboy wrote:

> [email protected] wrote:
>> yes, sorry, ultra 320 wide.
>
> Exactly how many channels and drives?

one channel, 2 OS drives plus the 45 drives in the array.

yes I realize that there will be bottlenecks with this, the large capacity
is to handle longer history (it's going to be a 30TB circular buffer being
fed by a pair of OC-12 links)

it appears that my big mistake was not understanding what /proc/mdstat is
telling me.

David Lang

2007-06-18 21:51:26

by Wakko Warner

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Mon, 18 Jun 2007, Brendan Conoboy wrote:
>
> >[email protected] wrote:
> >> yes, sorry, ultra 320 wide.
> >
> >Exactly how many channels and drives?
>
> one channel, 2 OS drives plus the 45 drives in the array.

Given that the drives only have 4 ID bits, how can you have 47 drives on 1
cable? You'd need a minimum of 3 channels for 47 drives. Do you have some
sort of external box that holds X number of drives and only uses a single
ID?

--
Lab tests show that use of micro$oft causes cancer in lab animals
Got Gas???

2007-06-18 21:56:40

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Mon, 18 Jun 2007, Wakko Warner wrote:

> Subject: Re: limits on raid
>
> [email protected] wrote:
>> On Mon, 18 Jun 2007, Brendan Conoboy wrote:
>>
>>> [email protected] wrote:
>>>> yes, sorry, ultra 320 wide.
>>>
>>> Exactly how many channels and drives?
>>
>> one channel, 2 OS drives plus the 45 drives in the array.
>
> Given that the drives only have 4 ID bits, how can you have 47 drives on 1
> cable? You'd need a minimum of 3 channels for 47 drives. Do you have some
> sort of external box that holds X number of drives and only uses a single
> ID?

yes, I'm useing promise drive shelves, I have them configured to export
the 15 drives as 15 LUNs on a single ID.

I'm going to be useing this as a huge circular buffer that will just be
overwritten eventually 99% of the time, but once in a while I will need to
go back into the buffer and extract and process the data.

David Lang

2007-06-18 22:01:28

by Brendan Conoboy

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> yes, I'm useing promise drive shelves, I have them configured to export
> the 15 drives as 15 LUNs on a single ID.

Well, that would account for it. Your bus is very, very saturated. If
all your drives are active, you can't get more than ~7MB/s per disk
under perfect conditions.

--
Brendan Conoboy / Red Hat, Inc. / [email protected]

2007-06-19 15:07:56

by Phillip Susi

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> one channel, 2 OS drives plus the 45 drives in the array.

Huh? You can only have 16 devices on a scsi bus, counting the host
adapter. And I don't think you can even manage that much reliably with
the newer higher speed versions, at least not without some very special
cables.

> yes I realize that there will be bottlenecks with this, the large
> capacity is to handle longer history (it's going to be a 30TB circular
> buffer being fed by a pair of OC-12 links)

Building one of those nice packet sniffers for the NSA to install on
AT&Ts network eh? ;)


2007-06-19 19:29:22

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Tue, 19 Jun 2007, Phillip Susi wrote:

> [email protected] wrote:
>> one channel, 2 OS drives plus the 45 drives in the array.
>
> Huh? You can only have 16 devices on a scsi bus, counting the host adapter.
> And I don't think you can even manage that much reliably with the newer
> higher speed versions, at least not without some very special cables.

6 devices on the bus (2 OS drives, 3 promise drive shelves, controller
card)

>> yes I realize that there will be bottlenecks with this, the large capacity
>> is to handle longer history (it's going to be a 30TB circular buffer being
>> fed by a pair of OC-12 links)
>
> Building one of those nice packet sniffers for the NSA to install on AT&Ts
> network eh? ;)

just for going back in time to track hacker actions at a bank.

I'm hopeing that once I figure out the drives the rest of the software
will basicly boil down to tcpdump with the right options to write to a
circular buffer of files.

David Lang

2007-06-19 20:11:39

by Lennart Sorensen

[permalink] [raw]
Subject: Re: limits on raid

On Mon, Jun 18, 2007 at 02:56:10PM -0700, [email protected] wrote:
> yes, I'm useing promise drive shelves, I have them configured to export
> the 15 drives as 15 LUNs on a single ID.
>
> I'm going to be useing this as a huge circular buffer that will just be
> overwritten eventually 99% of the time, but once in a while I will need to
> go back into the buffer and extract and process the data.

I would guess that if you ran 15 drives per channel on 3 different
channels, you would resync in 1/3 the time. Well unless you end up
saturating the PCI bus instead.

hardware raid of course has an advantage there in that it doesn't have
to go across the bus to do the work (although if you put 45 drives on
one scsi channel on hardware raid, it will still be limited).

--
Len Sorensen

2007-06-19 20:52:13

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Tue, 19 Jun 2007, Lennart Sorensen wrote:

> On Mon, Jun 18, 2007 at 02:56:10PM -0700, [email protected] wrote:
>> yes, I'm useing promise drive shelves, I have them configured to export
>> the 15 drives as 15 LUNs on a single ID.
>>
>> I'm going to be useing this as a huge circular buffer that will just be
>> overwritten eventually 99% of the time, but once in a while I will need to
>> go back into the buffer and extract and process the data.
>
> I would guess that if you ran 15 drives per channel on 3 different
> channels, you would resync in 1/3 the time. Well unless you end up
> saturating the PCI bus instead.
>
> hardware raid of course has an advantage there in that it doesn't have
> to go across the bus to do the work (although if you put 45 drives on
> one scsi channel on hardware raid, it will still be limited).

I fully realize that the channel will be the bottleneck, I just didn't
understand what /proc/mdstat was telling me. I thought that it was telling
me that the resync was processing 5M/sec, not that it was writing 5M/sec
on each of the two parity locations.

David Lang

2007-06-21 02:57:40

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Monday June 18, [email protected] wrote:
> On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> > Combining these thoughts, it would make a lot of sense for the
> > filesystem to be able to say to the block device "That blocks looks
> > wrong - can you find me another copy to try?". That is an example of
> > the sort of closer integration between filesystem and RAID that would
> > make sense.
>
> I think that this would only be useful on devices that store
> discrete copies of the blocks on different devices i.e. mirrors. If
> it's an XOR based RAID, you don't have another copy you can
> retreive....

You could reconstruct the block in question from all the other blocks
(including parity) and see if that differs from the data block read
from disk... For RAID6, there would be a number of different ways to
calculate alternate blocks. Not convinced that it is actually
something we want to do, but it is a possibility.

I have that - apparently naive - idea that drives use strong checksum,
and will never return bad data, only good data or an error. If this
isn't right, then it would really help to understand what the cause of
other failures are before working out how to handle them....

NeilBrown

2007-06-21 03:02:16

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Saturday June 16, [email protected] wrote:
> Neil Brown wrote:
> > On Friday June 15, [email protected] wrote:
> >
> >> As I understand the way
> >> raid works, when you write a block to the array, it will have to read all
> >> the other blocks in the stripe and recalculate the parity and write it out.
> >
> > Your understanding is incomplete.
>
> Does this help?
> [for future reference so you can paste a url and save the typing for code :) ]
>
> http://linux-raid.osdl.org/index.php/Initial_Array_Creation
>
> David
>
>
>
> Initial Creation
>
> When mdadm asks the kernel to create a raid array the most noticeable activity
> is what's called the "initial resync".
>
> The kernel takes one (or two for raid6) disks and marks them as 'spare'; it then
> creates the array in degraded mode. It then marks spare disks as 'rebuilding'
> and starts to read from the 'good' disks, calculate the parity and determines
> what should be on any spare disks and then writes it. Once all this is done the
> array is clean and all disks are active.

This isn't quite right.
Firstly, it is mdadm which decided to make one drive a 'spare' for
raid5, not the kernel.
Secondly, it only applies to raid5, not raid6 or raid1 or raid10.

For raid6, the initial resync (just like the resync after an unclean
shutdown) reads all the data blocks, and writes all the P and Q
blocks.
raid5 can do that, but it is faster the read all but one disk, and
write to that one disk.

NeilBrown

2007-06-21 06:40:16

by David Chinner

[permalink] [raw]
Subject: Re: limits on raid

On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:
> On Monday June 18, [email protected] wrote:
> > On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
> > > Combining these thoughts, it would make a lot of sense for the
> > > filesystem to be able to say to the block device "That blocks looks
> > > wrong - can you find me another copy to try?". That is an example of
> > > the sort of closer integration between filesystem and RAID that would
> > > make sense.
> >
> > I think that this would only be useful on devices that store
> > discrete copies of the blocks on different devices i.e. mirrors. If
> > it's an XOR based RAID, you don't have another copy you can
> > retreive....
>
> You could reconstruct the block in question from all the other blocks
> (including parity) and see if that differs from the data block read
> from disk... For RAID6, there would be a number of different ways to
> calculate alternate blocks. Not convinced that it is actually
> something we want to do, but it is a possibility.

Agreed - it's not as straight forward as a mirror, and it kind of assumes
that you have software RAID.

/me had his head stuck in hw raid land ;)

> I have that - apparently naive - idea that drives use strong checksum,
> and will never return bad data, only good data or an error. If this
> isn't right, then it would really help to understand what the cause of
> other failures are before working out how to handle them....

The drive is not the only source of errors, though. You could
have a path problem that is corrupting random bits between the drive
and the filesystem. So the data on the disk might be fine, and
reading it via a redundant path might be all that is needed.

Yeah, so I can see how having a different retry semantic would be a
good idea. i.e. if we do a READ_VERIFY I/O, the underlying device
attempts to verify the data is good in as many ways as possible
before returning the verified data or an error.

I guess a filesystem read would become something like this:

verified = 0
error = read(block)
if (error) {
read_verify:
error = read_verify(block)
if (error) {
OMG THE SKY IS FALLING
return error
}
verified = 1
}
/* check contents */
if (contents are bad) {
if (!verified)
goto read_verify
OMG THE SKY HAS FALLEN
return -EIO
}

Is this the sort of erro handling and re-issuing of
I/O that you had in mind?

FWIW, I don't think this really removes the need for a filesystem to
be able to keep multiple copies of stuff about. If the copy(s) on a
device are gone, you've still got to have another copy somewhere
else to get it back...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-21 06:45:51

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Thu, 21 Jun 2007, David Chinner wrote:

> On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:
>
>> I have that - apparently naive - idea that drives use strong checksum,
>> and will never return bad data, only good data or an error. If this
>> isn't right, then it would really help to understand what the cause of
>> other failures are before working out how to handle them....
>
> The drive is not the only source of errors, though. You could
> have a path problem that is corrupting random bits between the drive
> and the filesystem. So the data on the disk might be fine, and
> reading it via a redundant path might be all that is needed.

one of the 'killer features' of zfs is that it does checksums of every
file on disk. so many people don't consider the disk infallable.

several other filesystems also do checksums

both bitkeeper and git do checksums of files to detect disk corruption

as david C points out there are many points in the path where the data
could get corrupted besides on the platter.

David Lang

2007-06-21 08:49:21

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
>
> This isn't quite right.
Thanks :)

> Firstly, it is mdadm which decided to make one drive a 'spare' for
> raid5, not the kernel.
> Secondly, it only applies to raid5, not raid6 or raid1 or raid10.
>
> For raid6, the initial resync (just like the resync after an unclean
> shutdown) reads all the data blocks, and writes all the P and Q
> blocks.
> raid5 can do that, but it is faster the read all but one disk, and
> write to that one disk.

How about this:

Initial Creation

When mdadm asks the kernel to create a raid array the most noticeable activity
is what's called the "initial resync".

Raid level 0 doesn't have any redundancy so there is no initial resync.

For raid levels 1,4,6 and 10 mdadm creates the array and starts a resync. The
raid algorithm then reads the data blocks and writes the appropriate
parity/mirror (P+Q) blocks across all the relevant disks. There is some sample
output in a section below...

For raid5 there is an optimisation: mdadm takes one of the disks and marks it as
'spare'; it then creates the array in degraded mode. The kernel marks the spare
disk as 'rebuilding' and starts to read from the 'good' disks, calculate the
parity and determines what should be on the spare disk and then just writes to it.

Once all this is done the array is clean and all disks are active.

This can take quite a time and the array is not fully resilient whilst this is
happening (it is however fully useable).





Also is raid4 like raid5 or raid6 in this respect?

2007-06-21 09:00:05

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Thu, 21 Jun 2007, David Chinner wrote:
> one of the 'killer features' of zfs is that it does checksums of every
> file on disk. so many people don't consider the disk infallable.
>
> several other filesystems also do checksums
>
> both bitkeeper and git do checksums of files to detect disk corruption

How different is that to raid1/5/6 being set to a 'paranoid' "read-verify" mode
(as per Dan's recent email) where a read reads from _all_ spindles and verifies
(and with R6 maybe corrects) the stripe before returning it?

Doesn't solve DaveC's issue about the fs doing redundancy but isn't that
essentially just fs level mirroring?

David

2007-06-21 11:00:47

by David Chinner

[permalink] [raw]
Subject: Re: limits on raid

On Thu, Jun 21, 2007 at 04:39:36PM +1000, David Chinner wrote:
> FWIW, I don't think this really removes the need for a filesystem to
> be able to keep multiple copies of stuff about. If the copy(s) on a
> device are gone, you've still got to have another copy somewhere
> else to get it back...

Speaking of knowing where you can safely put multiple copies, I'm in
the process of telling XFS about linear alignment of the underlying
array so that we can:

- spread out the load across it faster.
- provide workload isolation
- know where *not* to put duplicate or EDAC data

I'm aiming at identical subvolumes so it's simple to implement. All
I need to know is the size of each subvolume. I can supply that at
mkfs time or in a mount option, but I want something that can works
automatically so I need to query dm to find out the size of each
underlying device during mount. We should also pass stripe
unit/width with the same interface while we are at it...

What's the correct and safe way to get this information from dm
both into the kernel and out to userspace (mkfs)?

FWIW, my end goal is to be able to map the underlying block device
address spaces directly into the filesystem so that the filesystem
is able to use the underlying devices intelligently and I can
logically separate caches and writeback for the separate subdevices.
A struct address_space per subdevice would be ideal - anyone got
ideas on how to get that?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-06-21 13:06:44

by Mattias Wadenstein

[permalink] [raw]
Subject: Re: limits on raid

On Thu, 21 Jun 2007, Neil Brown wrote:

> I have that - apparently naive - idea that drives use strong checksum,
> and will never return bad data, only good data or an error. If this
> isn't right, then it would really help to understand what the cause of
> other failures are before working out how to handle them....

In theory, that's how storage should work. In practice, silent data
corruption does happen. If not from the disks themselves, somewhere along
the path of cables, controllers, drivers, buses, etc. If you add in fcal,
you'll get even more sources of failure, but usually you can avoid SANs
(if you care about your data).

Well, here is a couple of the issues that I've seen myself:

A hw-raid controller returning every 64th bit as 0, no matter what's on
disk. With no error condition at all. (I've also heard from a collegue
about this on every 64k, but not seen that myself.)

An fcal switch occasionally resetting, garbling the blocks in transit with
random data. Lost a few TB of user data that way.

Add to this the random driver breakage that happens now and then. I've
also had a few broken filesystems due to in-memory corruption due to bad
ram, not sure there is much hope of fixing that though.

Also, this presentation is pretty worrying on the frequency of silent data
corruption:

https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257

/Mattias Wadenstein

2007-06-21 14:41:03

by Justin Piszcz

[permalink] [raw]
Subject: Re: limits on raid



On Thu, 21 Jun 2007, Mattias Wadenstein wrote:

> On Thu, 21 Jun 2007, Neil Brown wrote:
>
>> I have that - apparently naive - idea that drives use strong checksum,
>> and will never return bad data, only good data or an error. If this
>> isn't right, then it would really help to understand what the cause of
>> other failures are before working out how to handle them....
>
> In theory, that's how storage should work. In practice, silent data
> corruption does happen. If not from the disks themselves, somewhere along the
> path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll
> get even more sources of failure, but usually you can avoid SANs (if you care
> about your data).
>
> Well, here is a couple of the issues that I've seen myself:
>
> A hw-raid controller returning every 64th bit as 0, no matter what's on disk.
> With no error condition at all. (I've also heard from a collegue about this
> on every 64k, but not seen that myself.)
>
> An fcal switch occasionally resetting, garbling the blocks in transit with
> random data. Lost a few TB of user data that way.
>
> Add to this the random driver breakage that happens now and then. I've also
> had a few broken filesystems due to in-memory corruption due to bad ram, not
> sure there is much hope of fixing that though.
>
> Also, this presentation is pretty worrying on the frequency of silent data
> corruption:
>
> https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257
>
> /Mattias Wadenstein
> -
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

Very interesting slides/presentation, going to watch it shortly.

2007-06-21 16:49:52

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Thu, 21 Jun 2007, Mattias Wadenstein wrote:

> On Thu, 21 Jun 2007, Neil Brown wrote:
>
>> I have that - apparently naive - idea that drives use strong checksum,
>> and will never return bad data, only good data or an error. If this
>> isn't right, then it would really help to understand what the cause of
>> other failures are before working out how to handle them....
>
> In theory, that's how storage should work. In practice, silent data
> corruption does happen. If not from the disks themselves, somewhere along the
> path of cables, controllers, drivers, buses, etc. If you add in fcal, you'll
> get even more sources of failure, but usually you can avoid SANs (if you care
> about your data).

heh, the pitch I get from the self proclaimed experts is that if you care
about your data you put it on the san (so you can take advantage of the
more expensive disk arrays, various backup advantages, and replication
features that tend to be focused on the san becouse it's a big target)

David Lang

> Well, here is a couple of the issues that I've seen myself:
>
> A hw-raid controller returning every 64th bit as 0, no matter what's on disk.
> With no error condition at all. (I've also heard from a collegue about this
> on every 64k, but not seen that myself.)
>
> An fcal switch occasionally resetting, garbling the blocks in transit with
> random data. Lost a few TB of user data that way.
>
> Add to this the random driver breakage that happens now and then. I've also
> had a few broken filesystems due to in-memory corruption due to bad ram, not
> sure there is much hope of fixing that though.
>
> Also, this presentation is pretty worrying on the frequency of silent data
> corruption:
>
> https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257
>
> /Mattias Wadenstein
>
>

2007-06-21 17:00:48

by Mark Lord

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Thu, 21 Jun 2007, David Chinner wrote:
>
>> On Thu, Jun 21, 2007 at 12:56:44PM +1000, Neil Brown wrote:
>>
>>> I have that - apparently naive - idea that drives use strong checksum,
>>> and will never return bad data, only good data or an error. If this
>>> isn't right, then it would really help to understand what the cause of
>>> other failures are before working out how to handle them....
>>
>> The drive is not the only source of errors, though. You could
>> have a path problem that is corrupting random bits between the drive
>> and the filesystem. So the data on the disk might be fine, and
>> reading it via a redundant path might be all that is needed.
>
> one of the 'killer features' of zfs is that it does checksums of every
> file on disk. so many people don't consider the disk infallable.
>
> several other filesystems also do checksums
>
> both bitkeeper and git do checksums of files to detect disk corruption

No, all of those checksums are to detect *filesystem* corruption,
not device corruption (a mere side-effect).

> as david C points out there are many points in the path where the data
> could get corrupted besides on the platter.

Yup, that too.

But drives either return good data, or an error.

Cheers

2007-06-21 18:31:29

by Martin K. Petersen

[permalink] [raw]
Subject: Re: limits on raid

>>>>> "Mattias" == Mattias Wadenstein <[email protected]> writes:

Mattias> In theory, that's how storage should work. In practice,
Mattias> silent data corruption does happen. If not from the disks
Mattias> themselves, somewhere along the path of cables, controllers,
Mattias> drivers, buses, etc. If you add in fcal, you'll get even more
Mattias> sources of failure, but usually you can avoid SANs (if you
Mattias> care about your data).

Oracle cares a lot about people's data 8). And we've seen many cases
of silent data corruption. Often the problem goes unnoticed for
months. And by the time you find out about it you may have gone
through your backup cycle so the data is simply lost.

The Oracle database in combination with certain high-end arrays
supports a technology called HARD (Hardware Assisted Resilient Data)
which allows the array front end to verify the integrity of an I/O
before committing it to disk. The downside to HARD is that it's
proprietary and only really high-end customers use it (many
enterprises actually mandate HARD).

A couple of years ago some changes started to trickle into the SCSI
Block Commands spec. And as some of you know I've been working on
implementing support for this Data Integrity Field in Linux.

What DIF allows you to do is to attach some integrity metadata to an
I/O. We can attach this metadata all the way up in the userland
application context where the risk of corruption is relatively small.
The metadata passes all the way through the I/O stack, gets verified
by the HBA firmware, through the fabric, gets verified by the array
front end and finally again by the disk drive before the change is
committed to platter. Any discrepancy will cause the I/O to be
failed. And thanks to the intermediate checks you also get fault
isolation.

The DIF integrity metadata contains a CRC of the data block as well as
a reference tag that (for Type 1) needs to match the target sector on
disk. This way the common problem of misdirected writes can be
alleviated.

Initially, DIF is going to be offered in the FC/SAS space. But I
encourage everybody to lean on their SATA drive manufacturer of choice
and encourage them to provide a similar functionality on consumer or
at the very least nearline drives.


Note there's a difference between FS checksums and DIF. Filesystem
checksums (plug: http://oss.oracle.com/projects/btrfs/) allows the
filesystem to detect that it read something bad. And as discussed
earlier we can potentially retry the read from another mirror or
reconstruct in the case of RAID5/6.

DIF, however, is a proactive technology. It prevents bad stuff from
being written to disk in the first place. You'll know right away when
corruption happens, not 4 months later when you try to read the data
back.

So DIF and filesystem checksumming go hand in hand in preventing data
corruption...

--
Martin K. Petersen Oracle Linux Engineering

2007-06-21 20:09:33

by Nix

[permalink] [raw]
Subject: Re: limits on raid

On 21 Jun 2007, Neil Brown stated:
> I have that - apparently naive - idea that drives use strong checksum,
> and will never return bad data, only good data or an error. If this
> isn't right, then it would really help to understand what the cause of
> other failures are before working out how to handle them....

Look at the section `Disks and errors' in Val Henson's excellent report
on last year's filesystems workshop: <http://lwn.net/Articles/190223/>.
Most of the error modes given there lead to valid checksums and wrong
data...

(while you're there, read the first part too :) )

--
`... in the sense that dragons logically follow evolution so they would
be able to wield metal.' --- Kenneth Eng's colourless green ideas sleep
furiously

2007-06-21 23:03:22

by Bill Davidsen

[permalink] [raw]
Subject: Re: limits on raid

I didn't get a comment on my suggestion for a quick and dirty fix for
-assume-clean issues...

Bill Davidsen wrote:
> Neil Brown wrote:
>> On Thursday June 14, [email protected] wrote:
>>
>>> it's now churning away 'rebuilding' the brand new array.
>>>
>>> a few questions/thoughts.
>>>
>>> why does it need to do a rebuild when makeing a new array? couldn't
>>> it just zero all the drives instead? (or better still just record
>>> most of the space as 'unused' and initialize it as it starts useing
>>> it?)
>>>
>>
>> Yes, it could zero all the drives first. But that would take the same
>> length of time (unless p/q generation was very very slow), and you
>> wouldn't be able to start writing data until it had finished.
>> You can "dd" /dev/zero onto all drives and then create the array with
>> --assume-clean if you want to. You could even write a shell script to
>> do it for you.
>>
>> Yes, you could record which space is used vs unused, but I really
>> don't think the complexity is worth it.
>>
>>
> How about a simple solution which would get an array on line and still
> be safe? All it would take is a flag which forced reconstruct writes
> for RAID-5. You could set it with an option, or automatically if
> someone puts --assume-clean with --create, leave it in the superblock
> until the first "repair" runs to completion. And for repair you could
> make some assumptions about bad parity not being caused by error but
> just unwritten.
>
> Thought 2: I think the unwritten bit is easier than you think, you
> only need it on parity blocks for RAID5, not on data blocks. When a
> write is done, if the bit is set do a reconstruct, write the parity
> block, and clear the bit. Keeping a bit per data block is madness, and
> appears to be unnecessary as well.
>>> while I consider zfs to be ~80% hype, one advantage it could have
>>> (but I don't know if it has) is that since the filesystem an raid
>>> are integrated into one layer they can optimize the case where files
>>> are being written onto unallocated space and instead of reading
>>> blocks from disk to calculate the parity they could just put zeros
>>> in the unallocated space, potentially speeding up the system by
>>> reducing the amount of disk I/O.
>>>
>>
>> Certainly. But the raid doesn't need to be tightly integrated
>> into the filesystem to achieve this. The filesystem need only know
>> the geometry of the RAID and when it comes to write, it tries to write
>> full stripes at a time. If that means writing some extra blocks full
>> of zeros, it can try to do that. This would require a little bit
>> better communication between filesystem and raid, but not much. If
>> anyone has a filesystem that they want to be able to talk to raid
>> better, they need only ask...
>>
>>
>>> is there any way that linux would be able to do this sort of thing?
>>> or is it impossible due to the layering preventing the nessasary
>>> knowledge from being in the right place?
>>>
>>
>> Linux can do anything we want it to. Interfaces can be changed. All
>> it takes is a fairly well defined requirement, and the will to make it
>> happen (and some technical expertise, and lots of time .... and
>> coffee?).
>>
> Well, I gave you two thoughts, one which would be slow until a repair
> but sounds easy to do, and one which is slightly harder but works
> better and minimizes performance impact.
>


--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-06-22 02:25:16

by NeilBrown

[permalink] [raw]
Subject: Re: limits on raid

On Thursday June 21, [email protected] wrote:
> I didn't get a comment on my suggestion for a quick and dirty fix for
> -assume-clean issues...
>
> Bill Davidsen wrote:
> > How about a simple solution which would get an array on line and still
> > be safe? All it would take is a flag which forced reconstruct writes
> > for RAID-5. You could set it with an option, or automatically if
> > someone puts --assume-clean with --create, leave it in the superblock
> > until the first "repair" runs to completion. And for repair you could
> > make some assumptions about bad parity not being caused by error but
> > just unwritten.

It is certainly possible, and probably not a lot of effort. I'm not
really excited about it though.

So if someone to submit a patch that did the right stuff, I would
probably accept it, but I am unlikely to do it myself.


> >
> > Thought 2: I think the unwritten bit is easier than you think, you
> > only need it on parity blocks for RAID5, not on data blocks. When a
> > write is done, if the bit is set do a reconstruct, write the parity
> > block, and clear the bit. Keeping a bit per data block is madness, and
> > appears to be unnecessary as well.

Where do you propose storing those bits? And how many would you cache
in memory? And what performance hit would you suffer for accessing
them? And would it be worth it?

NeilBrown

2007-06-22 08:11:15

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

Neil Brown wrote:
> On Thursday June 21, [email protected] wrote:
>> I didn't get a comment on my suggestion for a quick and dirty fix for
>> -assume-clean issues...
>>
>> Bill Davidsen wrote:
>>> How about a simple solution which would get an array on line and still
>>> be safe? All it would take is a flag which forced reconstruct writes
>>> for RAID-5. You could set it with an option, or automatically if
>>> someone puts --assume-clean with --create, leave it in the superblock
>>> until the first "repair" runs to completion. And for repair you could
>>> make some assumptions about bad parity not being caused by error but
>>> just unwritten.
>
> It is certainly possible, and probably not a lot of effort. I'm not
> really excited about it though.
>
> So if someone to submit a patch that did the right stuff, I would
> probably accept it, but I am unlikely to do it myself.
>
>
>>> Thought 2: I think the unwritten bit is easier than you think, you
>>> only need it on parity blocks for RAID5, not on data blocks. When a
>>> write is done, if the bit is set do a reconstruct, write the parity
>>> block, and clear the bit. Keeping a bit per data block is madness, and
>>> appears to be unnecessary as well.
>
> Where do you propose storing those bits? And how many would you cache
> in memory? And what performance hit would you suffer for accessing
> them? And would it be worth it?

Sometimes I think one of the problems with Linux is that it tries to do
everything for everyone.

That's not a bad thing - until you look at the complexity it brings - and then
consider the impact and exceptions when you do, eg hardware acceleration? md
information fed up to the fs layer for xfs? simple long term maintenance?

Often these problems are well worth the benefits of the feature.

I _wonder_ if this is one where the right thing is to "just say no" :)

David

2007-06-22 09:51:19

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Fri, 22 Jun 2007, David Greaves wrote:

> That's not a bad thing - until you look at the complexity it brings - and
> then consider the impact and exceptions when you do, eg hardware
> acceleration? md information fed up to the fs layer for xfs? simple long term
> maintenance?
>
> Often these problems are well worth the benefits of the feature.
>
> I _wonder_ if this is one where the right thing is to "just say no" :)

In this case I think the advantages of a higher level system knowing what
efficiant blocks to do writes/reads in can potentially be a HUGE
advantage.

if the uppper levels know that you ahve a 6 disk raid 6 array with a 64K
chunk size then reads and writes in 256k chunks (aligned) should be able
to be done at basicly the speed of a 4 disk raid 0 array.

what's even more impressive is that this could be done even if the array
is degraded (if you know the drives have failed you don't even try to read
from them and you only have to reconstruct the missing info once per
stripe)

the current approach doesn't give the upper levels any chance to operate
in this mode, they just don't have enough information to do so.

the part about wanting to know raid 0 chunk size so that the upper layers
can be sure that data that's supposed to be redundant is on seperate
drives is also possible

storage technology is headed in the direction of having the system do more
and more of the layout decisions, and re-stripe the array as conditions
change (similar to what md can already do with enlarging raid5/6 arrays)
but unless you want to eventually put all that decision logic into the md
layer you should make it possible for other layers to make queries to find
out what's what and then they can give directions for what they want to
have happen.

so for several reasons I don't see this as something that's deserving of
an atomatic 'no'

David Lang

2007-06-22 12:39:50

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

[email protected] wrote:
> On Fri, 22 Jun 2007, David Greaves wrote:
>
>> That's not a bad thing - until you look at the complexity it brings -
>> and then consider the impact and exceptions when you do, eg hardware
>> acceleration? md information fed up to the fs layer for xfs? simple
>> long term maintenance?
>>
>> Often these problems are well worth the benefits of the feature.
>>
>> I _wonder_ if this is one where the right thing is to "just say no" :)
> so for several reasons I don't see this as something that's deserving of
> an atomatic 'no'
>
> David Lang

Err, re-read it, I hope you'll see that I agree with you - I actually just meant
the --assume-clean workaround stuff :)

If you end up 'fiddling' in md because someone specified --assume-clean on a
raid5 [in this case just to save a few minutes *testing time* on system with a
heavily choked bus!] then that adds *even more* complexity and exception cases
into all the stuff you described.

I'm very much for the fs layer reading the lower block structure so I don't have
to fiddle with arcane tuning parameters - yes, *please* help make xfs self-tuning!

Keeping life as straightforward as possible low down makes the upwards interface
more manageable and that goal more realistic...

David

2007-06-22 16:00:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: limits on raid

David Greaves wrote:
> [email protected] wrote:
>> On Fri, 22 Jun 2007, David Greaves wrote:
>>
>>> That's not a bad thing - until you look at the complexity it brings
>>> - and then consider the impact and exceptions when you do, eg
>>> hardware acceleration? md information fed up to the fs layer for
>>> xfs? simple long term maintenance?
>>>
>>> Often these problems are well worth the benefits of the feature.
>>>
>>> I _wonder_ if this is one where the right thing is to "just say no" :)
>> so for several reasons I don't see this as something that's deserving
>> of an atomatic 'no'
>>
>> David Lang
>
> Err, re-read it, I hope you'll see that I agree with you - I actually
> just meant the --assume-clean workaround stuff :)
>
> If you end up 'fiddling' in md because someone specified
> --assume-clean on a raid5 [in this case just to save a few minutes
> *testing time* on system with a heavily choked bus!] then that adds
> *even more* complexity and exception cases into all the stuff you
> described.

A "few minutes?" Are you reading the times people are seeing with
multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days.
And as soon as you believe that the array is actually "usable" you cut
that rebuild rate, perhaps in half, and get dog-slow performance from
the array. It's usable in the sense that reads and writes work, but for
useful work it's pretty painful. You either fail to understand the
magnitude of the problem or wish to trivialize it for some reason.

By delaying parity computation until the first write to a stripe only
the growth of a filesystem is slowed, and all data are protected without
waiting for the lengthly check. The rebuild speed can be set very low,
because on-demand rebuild will do most of the work.
>
> I'm very much for the fs layer reading the lower block structure so I
> don't have to fiddle with arcane tuning parameters - yes, *please*
> help make xfs self-tuning!
>
> Keeping life as straightforward as possible low down makes the upwards
> interface more manageable and that goal more realistic...

Those two paragraphs are mutually exclusive. The fs can be simple
because it rests on a simple device, even if the "simple device" is
provided by LVM or md. And LVM and md can stay simple because they rest
on simple devices, even if they are provided by PATA, SATA, nbd, etc.
Independent layers make each layer more robust. If you want to
compromise the layer separation, some approach like ZFS with full
integration would seem to be promising. Note that layers allow
specialized features at each point, trading integration for flexibility.

My feeling is that full integration and independent layers each have
benefits, as you connect the layers to expose operational details you
need to handle changes in those details, which would seem to make layers
more complex. What I'm looking for here is better performance in one
particular layer, the md RAID5 layer. I like to avoid unnecessary
complexity, but I feel that the current performance suggests room for
improvement.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2007-06-22 16:56:07

by David Greaves

[permalink] [raw]
Subject: Re: limits on raid

Bill Davidsen wrote:
> David Greaves wrote:
>> [email protected] wrote:
>>> On Fri, 22 Jun 2007, David Greaves wrote:
>> If you end up 'fiddling' in md because someone specified
>> --assume-clean on a raid5 [in this case just to save a few minutes
>> *testing time* on system with a heavily choked bus!] then that adds
>> *even more* complexity and exception cases into all the stuff you
>> described.
>
> A "few minutes?" Are you reading the times people are seeing with
> multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days.
Yes. But we are talking initial creation here.

> And as soon as you believe that the array is actually "usable" you cut
> that rebuild rate, perhaps in half, and get dog-slow performance from
> the array. It's usable in the sense that reads and writes work, but for
> useful work it's pretty painful. You either fail to understand the
> magnitude of the problem or wish to trivialize it for some reason.
I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to
say "oh, we can code up a clever algorithm that keeps track of what stripes have
valid parity and which don't and we can optimise the read/copy/write for valid
stripes and use the raid6 type read-all/write-all for invalid stripes and then
we can write a bit extra on the check code to set the bitmaps......"

Phew - and that lets us run the array at semi-degraded performance (raid6-like)
for 3 days rather than either waiting before we put it into production or
running it very slowly.
Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't
apply then - it's 3 days to rebuild - like it or not.

> By delaying parity computation until the first write to a stripe only
> the growth of a filesystem is slowed, and all data are protected without
> waiting for the lengthly check. The rebuild speed can be set very low,
> because on-demand rebuild will do most of the work.
I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs
- very useful indeed.

>> I'm very much for the fs layer reading the lower block structure so I
>> don't have to fiddle with arcane tuning parameters - yes, *please*
>> help make xfs self-tuning!
>>
>> Keeping life as straightforward as possible low down makes the upwards
>> interface more manageable and that goal more realistic...
>
> Those two paragraphs are mutually exclusive. The fs can be simple
> because it rests on a simple device, even if the "simple device" is
> provided by LVM or md. And LVM and md can stay simple because they rest
> on simple devices, even if they are provided by PATA, SATA, nbd, etc.
> Independent layers make each layer more robust. If you want to
> compromise the layer separation, some approach like ZFS with full
> integration would seem to be promising. Note that layers allow
> specialized features at each point, trading integration for flexibility.

That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and
tightly couple them too - XFS is capable (I guess) of understanding md more
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't
as important (USB flash maybe, dunno).


> My feeling is that full integration and independent layers each have
> benefits, as you connect the layers to expose operational details you
> need to handle changes in those details, which would seem to make layers
> more complex.
Agreed.

> What I'm looking for here is better performance in one
> particular layer, the md RAID5 layer. I like to avoid unnecessary
> complexity, but I feel that the current performance suggests room for
> improvement.

I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called "raid5prepare"
that writes zeroes/ones as appropriate to all component devices and then you can
use --assume-clean without concern. That could look to see if the devices are
scsi or whatever and take advantage of the hyperfast block writes that can be done.

David

2007-06-22 18:41:45

by David Lang

[permalink] [raw]
Subject: Re: limits on raid

On Fri, 22 Jun 2007, Bill Davidsen wrote:

> By delaying parity computation until the first write to a stripe only the
> growth of a filesystem is slowed, and all data are protected without waiting
> for the lengthly check. The rebuild speed can be set very low, because
> on-demand rebuild will do most of the work.
>>
>> I'm very much for the fs layer reading the lower block structure so I
>> don't have to fiddle with arcane tuning parameters - yes, *please* help
>> make xfs self-tuning!
>>
>> Keeping life as straightforward as possible low down makes the upwards
>> interface more manageable and that goal more realistic...
>
> Those two paragraphs are mutually exclusive. The fs can be simple because it
> rests on a simple device, even if the "simple device" is provided by LVM or
> md. And LVM and md can stay simple because they rest on simple devices, even
> if they are provided by PATA, SATA, nbd, etc. Independent layers make each
> layer more robust. If you want to compromise the layer separation, some
> approach like ZFS with full integration would seem to be promising. Note that
> layers allow specialized features at each point, trading integration for
> flexibility.
>
> My feeling is that full integration and independent layers each have
> benefits, as you connect the layers to expose operational details you need to
> handle changes in those details, which would seem to make layers more
> complex. What I'm looking for here is better performance in one particular
> layer, the md RAID5 layer. I like to avoid unnecessary complexity, but I feel
> that the current performance suggests room for improvement.

they both have have benifits, but it shouldn't have to be either-or

if you build the seperate layers and provide for ways that the upper
layers can query the lower layers to find what's efficiant then you can
have some uppoer layers that don't care about this and trat the lower
layer as a simple block device, while other upper layers find out what
sort of things are more efficiant to do and use the same lower layer in a
more complex manner

the alturnative is to duplicate effort (and code) to have two codebases
that try to do the same thing, one stand-alone, and one as a part of an
integrated solution (and it gets even worse if there end up being multiple
integrated solutions)

David Lang