2005-01-27 03:59:24

by Marc Lehmann

[permalink] [raw]
Subject: critical bugs in md raid5

Hi,

I want to report a number of problems in the current raid5 code, some of
which are pretty annoying, some of which require a superblock reformat.

Here's my setup:

- dual AMD opteron with 64-bit kernel, 2.6.10/2.6.8.1
- 5 raid disks, 4 standard ide on hda..hdd, one sata-device
(that setup gives a much higher performance than putting every device
on it's own ide port).

First, the really bad ones (happened with 2.6.10):

Today, I had a crash that required a hard reset. After the next start, the
raid started to rebuild as expected (and, as usually, at 5 times the speed
that I get when reading from the raid array, but that's an old problem
that I had on 2.4, too).

I only remember that I had all all five disks with an up-to-date sign
[UUUUU] during the whole rebuild, which made sense to me, as the data
should be up-to-date, despite the raid being out-of-sync. I rebooted
during the time, at around 46%. After the next reboot, resyncing properly
continued.

Then the rebuild finished:

md: md0: sync done.

*immediately* after that line, I got an ide error:

hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=312581110, high=18, low=10591222, sector=312580370
ide: failed opcode was: unknown
end_request: I/O error, dev hda, sector 312580370

I did verify that the disk block indeed was unreadable. The raid did react:

raid5: Disk failure on hda1, disabling device. Operation continuing on 4 devices
RAID5 conf printout:
--- rd:5 wd:4 fd:1
disk 0, o:0, dev:hda1
disk 1, o:1, dev:sda1
disk 2, o:1, dev:hdd1
disk 3, o:1, dev:hdc1
disk 4, o:1, dev:hdb1
RAID5 conf printout:
--- rd:5 wd:4 fd:1
disk 1, o:1, dev:sda1
disk 2, o:1, dev:hdd1
disk 3, o:1, dev:hdc1
disk 4, o:1, dev:hdb1

Interestingly, it immediately started to rebuild the array, with only for
disks (/proc/mdstat showed hda1 as F (faulty)):

.<6>md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 0 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 99999 KB/sec)
for reconstruction.
md: using 128k window, over a total of 156290816 blocks.
md: resuming recovery of md0 from checkpoint.

Indeed, /proc/mdstat showed that it had 46% (last checkpoint) already in
sync, while at the same time only showing 4 disks.

I did:

raidhotremove /dev/md0 /dev/hda1

which worked (rebuild continued, though), then

raidhotadd /dev/md0 /dev/hda1

which worked (rebuilt continued, though). I then decided to reboot.

After the reboot, I no longer had a raid, because the array was dirty and
degraded.

This is an important bug, IMHO, as it required me to manually mdadm -C the
array, which I am fairly proficient with, but it's still rather risky,
because of the following issues:

Sometimes, when the machine crashes, it seems that a dirty array is just
as unsafe as a degraded array: when the rebuild starts and hits a bad
block somewhere on the disks it chose to read from (bad blocks are very
common during hard crashes, as the disk might not have time to properly
write the block), it will mark the disk as faulty. Unfortunately, as the
array seems to be in some artifical degraded mode, it seems to think yet
another disk is faulty, and then stop working, as it thinks it has lost
two disks - reformat required.

Then, the reformat might be more difficult than necessary, as, of course,
you need to know the "index" numbers of your disks, I mean these numbers:

md0 : active raid5 hda1[5] sda1[1] hdd1[2] hdc1[3] hdb1[4]

mdadm can be made to print the same numbers. Unfortunately, neither mkraid
nor mdadm have any waay of working with disk indices that are outside the
normal range (i.e. only 0..4 are valid on my 5 disk-array when building,
but every raid failure will give the disks a different number, which is in
no relation to the original order).

That means that every replacement disk will need paper and pencil work, as
the required data is not readily available with user-level commands.

In one case, I ahd two unavailable disks (generated by the problem above)
with indices 5 and 6. Simple guesswork (formatting the array superblocks
in degraded mode - not a very documented thing, readonly-mount, retry
with different order) gave me the correct order, but it's a tedious and
error-prone task. I'd say many admins might just format the superblocks
in non-degraded mode, and when they detect a problem, it'S already too
late because the sync has already started (although it could be that a
sync will not damage the disks in any way, after all, it's simple xor).

The summary seems to be that the linux raid driver only protects your data
as long as all disks are fine and the machine never crashes.

The last annoying issue is not a bug per se, and has been reported in much
larger detail earlier: my array can rebuild at a top speed of 165MB/s, but
dd or other tools never read more than about 25-35MB/s top, which is much
less than the speed of a single disk - dd'ing from a single disk gives a
speed of >50MB/s, and dd'ing from, say, 4 or 5 disks gives me wlel over
200MB/s).

Of course, this last issue is not critical at all - I am working with this
problem since 2.4 days :)

Thanks for all the good work that alraedy went into linux, though!

Hope this helps,

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ [email protected]
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE


2005-01-27 05:11:47

by Andi Kleen

[permalink] [raw]
Subject: Re: critical bugs in md raid5

Marc Lehmann <[email protected]> writes:
>
> The summary seems to be that the linux raid driver only protects your data
> as long as all disks are fine and the machine never crashes.

"as long as the machine never crashes". That's correct. If you think
about how RAID 5 works there is no way around it. When a write to
a single stripe is interrupted (machine crash) and you lose a disk
during the recovery a lot of data (even unrelated to the data just written)
is lost. That is because there is no way to figure out what part
of the data on the stripe belonged to the old and what part to
the new write.

But that's nothing inherent in Linux RAID5. It's a generic problem.
Pretty much all Software RAID5 implementations have it.

The only way around it is to journal all writes, to make stripe
updates atomic, but in general that's too slow unless you have a
battery backed up journal device.

There are some tricks to avoid this (e.g. always write to a new disk
location and update an disk index atomically), but they tend to be
heavily patented and are slower too. They also go far beyond RAID-5
(use disk space less efficiently etc.) and typically need support
from the file system to be efficient.

RAID-1 helps a bit, because you either get the old or the new data,
but not some corruption. In practice even old data can be a big
problem though (e.g. when file system metadata is affected)

Morale: if you really care about your data backup very often and
use RAID-1 or get an expensive hardware RAID with battery backup
(all the cheap "hardware RAIDs" are equally useless for this)

-Andi

2005-01-27 06:31:37

by Marc Lehmann

[permalink] [raw]
Subject: Re: critical bugs in md raid5

On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen <[email protected]> wrote:
> Marc Lehmann <[email protected]> writes:
> > The summary seems to be that the linux raid driver only protects your data
> > as long as all disks are fine and the machine never crashes.
>
> "as long as the machine never crashes". That's correct. If you think
> about how RAID 5 works there is no way around it. When a write to

I disagree. When not working in degraded mode, it's absolutely reasonable
to e.g. use only the non-parity data. A crash with raid5 is in no way
different to a crash without raid5 then: either the old data is on the
disk, the new data is on the disk, or you had some catastrophic disk event
and no data is on the disk.

The case I reported was not a catastrophic failure: either the old or new
data was on the disk, and the filesystem journaling (which is ext3) will
take care of it. Even if the parity information is not in sync, either old or
new data is on the disk.

> a single stripe is interrupted (machine crash) and you lose a disk
> during the recovery a lot of data (even unrelated to the data just written)
> is lost.

This is not what I described, in fact, I haven't lost any data, despite
having had a number of such problems (I did verify that afterwards, and
found no differences. Maybe this is luck, but it seems to happen in the
majority of cases, and I ahd a similar problem at least 5 or 6 times
because I didn't encounter the bug I reported).

> But that's nothing inherent in Linux RAID5. It's a generic problem.
> Pretty much all Software RAID5 implementations have it.

Indeed, but I think linux' behaviour is especially poor. For example, the
renumbering of the devices or the strange rebuild-restart behaviour (which
is definitely a bug) will make recovery unnecessarily complicated.

> RAID-1 helps a bit, because you either get the old or the new data,
> but not some corruption.

You don't get any magical corruption with RAID5 either... the data contents
will either be old, or new. The differnce is that you cannot trust parity.

> In practice even old data can be a big
> problem though (e.g. when file system metadata is affected)

Of course, but that's supposed to be worked around by using a journaling
file system, right?

> Morale: if you really care about your data backup very often and
> use RAID-1 or get an expensive hardware RAID with battery backup
> (all the cheap "hardware RAIDs" are equally useless for this)

Yes, I am thinking of that for some time now, but always had a problem
because the affordable ones have low performance. But given linux'
effective slower-than-a-single-disk performance it shouldn't be hard to
beat nowadays.

There is, however, at least the resyncing with only 4 out of 5 disks, that
is doubtlessly a bug somewhere.

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ [email protected]
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

2005-01-27 06:48:46

by Marc Lehmann

[permalink] [raw]
Subject: Re: critical bugs in md raid5

On Thu, Jan 27, 2005 at 06:11:34AM +0100, Andi Kleen <[email protected]> wrote:
> Marc Lehmann <[email protected]> writes:
> >
> > The summary seems to be that the linux raid driver only protects your data
> > as long as all disks are fine and the machine never crashes.
>
> "as long as the machine never crashes". That's correct. If you think

Thanks for your thoughts, btw :)

I forgot to mention that even if data is known to be lost it's much better
to return, say, EIO to higher levels than to completely shut down the
device (after all, this is no differnce to what other block devices behave).

Also, it's still likely that some old error can be repaired, as the broken
non-parity block might be old. This is probably better to be handled in
userspace, though, with special tools. But for them it might be vital to
get the correct disk index, to be able to detect the stripe layout.

It's usually much faster to repair and verify, as opposed to format and
restore, of course.

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ [email protected]
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

2005-01-27 09:51:10

by Andi Kleen

[permalink] [raw]
Subject: Re: critical bugs in md raid5

> I disagree. When not working in degraded mode, it's absolutely reasonable
> to e.g. use only the non-parity data. A crash with raid5 is in no way

Yep. But when you go into degraded mode during the crash recovery
(before the RAID is fully synced again) you lose.

> different to a crash without raid5 then: either the old data is on the
> disk, the new data is on the disk, or you had some catastrophic disk event
> and no data is on the disk.

No, that's not how RAID-5 works. For its redundancy it requires
coordinated writes of full stripes (= bigger than fs block) over
multiple disks. When you crash in the middle of a write and you
lose a disk during crash recovery there is no way to fully
reconstruct all the data because the XOR data recovery requires
valid data on all disks.

The nasty part there is that it can affect completely unrelated
data too (on a traditional disk you normally only lose the data
that is currently being written) because of of the relationship
between stripes on different disks.

>
> The case I reported was not a catastrophic failure: either the old or new
> data was on the disk, and the filesystem journaling (which is ext3) will
> take care of it. Even if the parity information is not in sync, either old or
> new data is on the disk.

But you lost a disk in the middle of recovery (any IO error is
a lost disk)

> Indeed, but I think linux' behaviour is especially poor. For example, the
> renumbering of the devices or the strange rebuild-restart behaviour (which
> is definitely a bug) will make recovery unnecessarily complicated.

There were some suggestions in the past
to be a bit nicer on read IO errors - often if a read fails and you rewrite
the block from the reconstructed data the disk would allocate a new block
and then be error free again.

The problem is just that when there are user visible IO errors
on a modern disk something is very wrong and it will likely run quickly out
of replacement blocks, and will eventually fail. That is why
Linux "forces" early replacement of the disk on any error - it is the
safest thing to do.


> > problem though (e.g. when file system metadata is affected)
>
> Of course, but that's supposed to be worked around by using a journaling
> file system, right?

Nope, journaling is no magical fix for meta data corruption.

-Andi

2005-01-27 16:34:12

by Marc Lehmann

[permalink] [raw]
Subject: Re: critical bugs in md raid5 and ATA disk failure/recovery modes

On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen <[email protected]> wrote:
> > I disagree. When not working in degraded mode, it's absolutely reasonable
> > to e.g. use only the non-parity data. A crash with raid5 is in no way
>
> Yep. But when you go into degraded mode during the crash recovery
> (before the RAID is fully synced again) you lose.

Hi, see below.

>
> > different to a crash without raid5 then: either the old data is on the
> > disk, the new data is on the disk, or you had some catastrophic disk event
> > and no data is on the disk.
>
> No, that's not how RAID-5 works. For its redundancy it requires

Hi, I think we might have misunderstood each other.

In fact, I fully agree with you on how raid5 works. However, the current
linux raid behaviour is highy suboptimal, and offers much less than your
description of raid5 would enable it to do:

- it shouldn't do any kind of re-syncing with two failed disks (as it did
in the case I described). That makes no sense and possibly destroys more
data.

- it should still satisfy read requests whenever possible (right now,
the device is often fully dead, despite the data being there in maybe
100% of the cases).

The thing is, the md raid5 driver itself is able to satisfy most read
requests and even some write requests in >= two-failed-disk mode, but it
is not so when the failure happens during reconstruction, so in a case
were much *more* data can safely be provided is offers *less* then when
two disks have failed.

- Vital information about the disk order that might be required for repairing
is being destroyed.

I think all of these points are valid, despite the deficiencies in raid5
protection in theory, the linux raid behaviour is much worse in practise.

> The nasty part there is that it can affect completely unrelated
> data too (on a traditional disk you normally only lose the data
> that is currently being written) because of of the relationship
> between stripes on different disks.

Hmm.. indeed, I do not understand this. My reasoning would be as follows:

If I had a bad block, I either lose parity (== no data loss) or I lose a
data block (== this data block is lost when the machine crashes).

If unrelated data can get lost (as it is right now, as the device
basically is lost), then this seems like a deficiency in the driver.

> > The case I reported was not a catastrophic failure: either the old or new
> > data was on the disk, and the filesystem journaling (which is ext3) will
> > take care of it. Even if the parity information is not in sync, either old or
> > new data is on the disk.
>
> But you lost a disk in the middle of recovery (any IO error is
> a lost disk)

Yes, and I hit a bug, which I reported.

> > Indeed, but I think linux' behaviour is especially poor. For example, the
> > renumbering of the devices or the strange rebuild-restart behaviour (which
> > is definitely a bug) will make recovery unnecessarily complicated.
>
> There were some suggestions in the past
> to be a bit nicer on read IO errors - often if a read fails and you rewrite
> the block from the reconstructed data the disk would allocate a new block
> and then be error free again.

(I am not asking for this kind fo automatic recovery, but here are some
thoughts on the above):

With modern IDE drives, trying to correct is actually the right thing to
do.

At least when the device indicates that it is still working fine, as
opposed to be being in pre-failure mode for example due to lack of
replacement blocks.

> The problem is just that when there are user visible IO errors
> on a modern disk something is very wrong and it will likely run quickly out

No, the disk will likely just re-write the block. There are different
failure modes on IDE drives, the most likely failure on a crash is that
some block couldn't be written due to, say, a power outage, or a hard or
soft reset in the middle of a write (sad, but true, many ide disks act
like that, there have been discussions about this on LK).

No replacement block will need to be allocated in that case, just the
currently written data is lost. And nothing at all will be wrong with the
disk in that case, either. So I dispute the "on a modern disk something is
very wrong" because that is normal operation of a ATA disk, mandated by
standards.

Also, the number of used replacement blocks can be queried on basically
all modern ATA disks, and there is a method in place to warn about
possible failures, namely SMART.

Please note that most disks can be made to regularly scan their surface and
replace blocks, so the device might run out of replacement blocks without any
write access from the driver.

So this kind of danger is already possible and likely without linux trying to
repair the block, so repairing the block is just normal operation for the
drive.

What the drive in many failures is simply tag the block as unreadable
(mostly because the checksum/ecc data does not match) and correct this on
write. Most drivers will also check the surface and allocate a replacement
block automatically if required.

> of replacement blocks, and will eventually fail. That is why

Then the drive would be very buggy. If it runs out of replacement blocks it
will not suddenly fail, but only be unable to repair the block.

> Linux "forces" early replacement of the disk on any error - it is the
> safest thing to do.

That is certainly untrue. The safest thing to do would doubtlessly be to
make a warning that the disk needs to be replaced but still provide the
data as long as possible, instead of killing the device.

It would certainly make sense to no touch the disk in write mode, or, if
one is paranoid, in read mode, but right now the device is simply lost.

> > Of course, but that's supposed to be worked around by using a journaling
> > file system, right?
>
> Nope, journaling is no magical fix for meta data corruption.

Meta data corruption of what? The raid device, then yes, the filesystem,
then no.

raid5 works by relying on error detetcion of the underlying device. it
will suffer form the same kind of corruption that a normal device suffers,
i.e. if data gets corrupted silently it's gone. However, in other cases
(loud error reporting), the raid device will not corrupt data, as it can
always know which data is there and which isn't, juts as with a normal
disk.

What raid provides is just more redundant data in normal operation - it
doens't suffer from silent data corruption more than a normal disk.

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ [email protected]
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

2005-01-27 16:59:27

by Marc Lehmann

[permalink] [raw]
Subject: Re: critical bugs in md raid5

On Thu, Jan 27, 2005 at 10:51:02AM +0100, Andi Kleen <[email protected]> wrote:
> The nasty part there is that it can affect completely unrelated
> data too (on a traditional disk you normally only lose the data
> that is currently being written) because of of the relationship
> between stripes on different disks.

Sorry, I must be a bit dense at times I understood that now, you meant in
the case where parity is lost and you have an I/O error in other cases.

> There were some suggestions in the past
> to be a bit nicer on read IO errors - often if a read fails and you rewrite
> the block from the reconstructed data the disk would allocate a new block
> and then be error free again.
>
> The problem is just that when there are user visible IO errors
> on a modern disk something is very wrong and it will likely run quickly out

Also, linux already does re-write failed parity blocks automatically on
a crash, so whatever damage you might think might be done to the disk
will already be done at numerous occasions, as linux in general nor the
raid driver in particular checks for bad blocks before rewriting (I don't
suggets that it does, just that linux already rewrites failed blocks if it
doesn't know about them, and this hasn't been a particular bad problem).

--
The choice of a
-----==- _GNU_
----==-- _ generation Marc Lehmann
---==---(_)__ __ ____ __ [email protected]
--==---/ / _ \/ // /\ \/ / http://schmorp.de/
-=====/_/_//_/\_,_/ /_/\_\ XX11-RIPE

2005-01-29 18:35:30

by Pavel Machek

[permalink] [raw]
Subject: Re: critical bugs in md raid5 and ATA disk failure/recovery modes

Hi!

> > The nasty part there is that it can affect completely unrelated
> > data too (on a traditional disk you normally only lose the data
> > that is currently being written) because of of the relationship
> > between stripes on different disks.

Well, you could set stripe size to 512B; that way, RAID-5 would be
*very* slow, but it should have same characteristics as normal disc
w.r.t. crash. Unrelated data would not be lost, and you'd either get
old data or new data...

Nasty part might be that if it went to degraded mode (before resync is
done), data on disk might silently change; that's bad I guess.

Performance would not be good, also.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!

2005-01-29 18:37:38

by Andi Kleen

[permalink] [raw]
Subject: Re: critical bugs in md raid5 and ATA disk failure/recovery modes

> Well, you could set stripe size to 512B; that way, RAID-5 would be
> *very* slow, but it should have same characteristics as normal disc
> w.r.t. crash. Unrelated data would not be lost, and you'd either get
> old data or new data...

When you lose a disk during recovery you can still lose
unrelated data (any "sibling" in a stripe set because its parity
information is incomplete). RAID-1 doesn't have this problem though.

-Andi

2005-01-29 18:55:33

by Pavel Machek

[permalink] [raw]
Subject: Re: critical bugs in md raid5 and ATA disk failure/recovery modes

Hi!

> > Well, you could set stripe size to 512B; that way, RAID-5 would be
> > *very* slow, but it should have same characteristics as normal disc
> > w.r.t. crash. Unrelated data would not be lost, and you'd either get
> > old data or new data...
>
> When you lose a disk during recovery you can still lose
> unrelated data (any "sibling" in a stripe set because its parity
> information is incomplete). RAID-1 doesn't have this problem though.

You are right, I'd have to do soething very special... Like if I know
it is 4K filesystem, raid-5 from 5 disks could do the trick. Like

Disk1 Disk2 Disk3 Disk4 Disk5
bytes0-511 512-1023 1024-1535 1536-2048 parity
....

....no, even that does not work. You could add single bit for each 4K
saying "this stripe is being written" (with barriers etc) and return
read errors if bit is set might actually do the trick, but that's no
longer raid-5... (Can ext3 handle error in journal?)
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!