LinuxLists.cc - RAID-5 design bug (or misfeature)

2005-05-29 22:53:29

Subject: RAID-5 design bug (or misfeature)

Hi

RAID-5 has rather serious design bug --- when two disks become temporarily
inaccessible (as it happened to me because of high temperature in server
room), linux writes information about these errors to the remaining disks
and when failed disks are on line again, RAID-5 won't ever be accessible.

RAID-HOWTO lists some actions that can be done in this case, but none of
them can be done if root filesystem is on RAID --- the machine just won't
boot.

I think Linux should stop accessing all disks in RAID-5 array if two disks
fail and not write "this array is dead" in superblocks on remaining disks,
efficiently destroying the whole array.

Mikulas

2005-05-29 23:05:09

by Wakko Warner

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

Mikulas Patocka wrote:
> RAID-5 has rather serious design bug --- when two disks become temporarily
> inaccessible (as it happened to me because of high temperature in server
> room), linux writes information about these errors to the remaining disks
> and when failed disks are on line again, RAID-5 won't ever be accessible.

I ran into this myself, however, I had 10 disks (5 per channel) and one
chennel went down. Ok, my array was dead at that point and I had to reboot.
What luck, the arry wasn't usable anymore. My /usr was on that array, but
my / was not. I did not want to go through the initrd/initramfs thing at
the time to setup my / with raid5, plus the fact you truely cannot boot from
it (thus partitioning and setting aside a slice wasn't viable to me)

> RAID-HOWTO lists some actions that can be done in this case, but none of
> them can be done if root filesystem is on RAID --- the machine just won't
> boot.

I had to reconstruct the array by hand with mdadm. evms wouldn't touch it.
Fortunately, I had a copy of each disk's information and the raid5's
information in files so it was quite easy to rebuild. I did have backups
but that wasn't really what I wanted to do. (It did take over 2 hours
before I could return to normal. evms can't handle a raid5 that was in
reconstruction. I think newer versions have this fixed.)

> I think Linux should stop accessing all disks in RAID-5 array if two disks
> fail and not write "this array is dead" in superblocks on remaining disks,
> efficiently destroying the whole array.

That'd be nice =)

--
Lab tests show that use of micro$oft causes cancer in lab animals

2005-05-29 23:58:12

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

In article <[email protected]> you wrote:
> I think Linux should stop accessing all disks in RAID-5 array if two disks
> fail and not write "this array is dead" in superblocks on remaining disks,
> efficiently destroying the whole array.

I agree with you, however it is a pretty damned stupid idea to use raid-5
for a root disk (I was about to say it is not a good idea to use raid-5 on
linux at all :)

Gruss
Bernd

2005-05-30 02:48:00

by Mikulas Patocka

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

> In article <[email protected]> you wrote:
> > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > fail and not write "this array is dead" in superblocks on remaining disks,
> > efficiently destroying the whole array.
>
> I agree with you, however it is a pretty damned stupid idea to use raid-5
> for a root disk (I was about to say it is not a good idea to use raid-5 on
> linux at all :)

But root disk might fail too... This way, the system can't be taken down
by any single disk crash.

Mikulas

2005-05-30 03:00:43

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

On Mon, May 30, 2005 at 04:47:58AM +0200, Mikulas Patocka wrote:
> But root disk might fail too... This way, the system can't be taken down
> by any single disk crash.

Yes, mirroring has here good properties, must boot loaders work with it, it
is less suspectible to silent corruption and you can use a 1+0 configuration
for additional protection against multi disk failures.

Greetings
Bernd

2005-05-30 11:58:03

by Alan

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > In article <[email protected]> you wrote:
> > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > efficiently destroying the whole array.

It discovered the disks had failed because they had outstanding I/O that
failed to complete and errorred. At that point your stripes *are*
inconsistent. If it didn't mark them as failed then you wouldn't know it
was corrupted after a power restore. You can then clean it fsck it,
restore it, use mdadm as appropriate to restore the volume and check it.

> But root disk might fail too... This way, the system can't be taken down
> by any single disk crash.

It only takes on disk in an array to short 12v and 5v due to a component
failure to total the entire disk array, and with both IDE and SCSI a
drive fail can hang the entire bus anyway.

Alan

2005-05-30 13:23:16

by Stephen Frost

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

* Alan Cox ([email protected]) wrote:
> On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > > In article <[email protected]> you wrote:
> > > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > > efficiently destroying the whole array.
>
> It discovered the disks had failed because they had outstanding I/O that
> failed to complete and errorred. At that point your stripes *are*
> inconsistent. If it didn't mark them as failed then you wouldn't know it
> was corrupted after a power restore. You can then clean it fsck it,
> restore it, use mdadm as appropriate to restore the volume and check it.

Could that I/O be backed out when it's discovered that there's too many
dead disks for the array to be kept online anymore?

Just a thought,

Stephen

Attachments:

(No filename) (921.00 B)
signature.asc (189.00 B)
Digital signature Download all attachments

2005-05-30 16:09:46

by Mikulas Patocka

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

On Mon, 30 May 2005, Alan Cox wrote:

> On Llu, 2005-05-30 at 03:47, Mikulas Patocka wrote:
> > > In article <[email protected]> you wrote:
> > > > I think Linux should stop accessing all disks in RAID-5 array if two disks
> > > > fail and not write "this array is dead" in superblocks on remaining disks,
> > > > efficiently destroying the whole array.
>
> It discovered the disks had failed because they had outstanding I/O that
> failed to complete and errorred.

I think that's another problem --- when RAID-5 is operating in degraded
mode, the machine must not crash or volume will be damaged (sectors
that were not written may be damaged this way). Did anybody develop some
method to care about this (i.e. something like journaling on raid)? What
do hardware RAID controllers do in this situation?

> At that point your stripes *are*
> inconsistent. If it didn't mark them as failed then you wouldn't know it
> was corrupted after a power restore. You can then clean it fsck it,
> restore it,
> use mdadm as appropriate to restore the volume and check it.

I can't because mdadm is on that volume ... I solved it by booting from
floppy and editing raid superblocks with disk hexeditor but not every user
wants to do it; there should be at least kernel boot parameter for it.

> > But root disk might fail too... This way, the system can't be taken down
> > by any single disk crash.
>
> It only takes on disk in an array to short 12v and 5v due to a component
> failure to total the entire disk array, and with both IDE and SCSI a
> drive fail can hang the entire bus anyway.

I meant mechanical failure which is more common. Of course --- everything
can happen in case of electrical failure in disk/controller/bus/mainboard
...

Mikulas

> Alan
>

2005-05-31 07:59:59

by Helge Hafting

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

Mikulas Patocka wrote:

>
>I think that's another problem --- when RAID-5 is operating in degraded
>mode, the machine must not crash or volume will be damaged (sectors
>that were not written may be damaged this way). Did anybody develop some
>method to care about this (i.e. something like journaling on raid)? What
>do hardware RAID controllers do in this situation?
>
>
Hot spares can keep the degraded time to a minimum. If you want to
keep the risk to a minimum, unmount the raid fs until it is
resynchronized. If you need more safety, there is options like raid-6
or mirrors of the entire raid-5 set.

Some hw controllers have a battery-backed cache. Even a power loss
won't ruin the raid - the io will simply sit in that cache until the
disks become available again. The io operation that was in effect when
power was lost can then be retried. Not that this saves you from everything,
the fs could be inconsistent anyway due to the os being killed in the
middle of its updates. A journalled fs can help with that though.

Helge Hafting

2005-05-31 21:41:24

by Pavel Machek

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

Hi!

> > At that point your stripes *are*
> > inconsistent. If it didn't mark them as failed then you wouldn't know it
> > was corrupted after a power restore. You can then clean it fsck it,
> > restore it,
> > use mdadm as appropriate to restore the volume and check it.
>
> I can't because mdadm is on that volume ... I solved it by booting from
> floppy and editing raid superblocks with disk hexeditor but not every user
> wants to do it; there should be at least kernel boot parameter for
> it.

Well, you should not use hexedit... just boot from rescue cd and run
mdadd from it. No need to pollute kernel with that one.

Pavel

2005-06-01 01:44:02

by Mikulas Patocka

[permalink] [raw]

Subject: Re: RAID-5 design bug (or misfeature)

On Tue, 31 May 2005, Pavel Machek wrote:

> Hi!
>
> > > At that point your stripes *are*
> > > inconsistent. If it didn't mark them as failed then you wouldn't know it
> > > was corrupted after a power restore. You can then clean it fsck it,
> > > restore it,
> > > use mdadm as appropriate to restore the volume and check it.
> >
> > I can't because mdadm is on that volume ... I solved it by booting from
> > floppy and editing raid superblocks with disk hexeditor but not every user
> > wants to do it; there should be at least kernel boot parameter for
> > it.
>
> Well, you should not use hexedit... just boot from rescue cd and run
> mdadd from it. No need to pollute kernel with that one.

Hi!

I think editing superblock with hexedit is less dangerous than using
raid-tools --- with editor I know what changes I have made and I can
revert them. With raid-tools, if you create wrong /etc/raidtab (original
was on failed volume too), it will trash superblocks completely.

I still think it's stupid that Linux modifies RAID superblocks into
irreversible state.

BTW. that server doesn't have CD drive. It was installed from network.

Mikulas

> Pavel
>