2006-01-17 19:35:49

by Cynbe ru Taren

[permalink] [raw]
Subject: FYI: RAID5 unusably unstable through 2.6.14


Just in case the RAID5 maintainers aren't aware of it:

The current Linux kernel RAID5 implementation is just
too fragile to be used for most of the applications
where it would be most useful.

In principle, RAID5 should allow construction of a
disk-based store which is considerably MORE reliable
than any individual drive.

In my experience, at least, using Linux RAID5 results
in a disk storage system which is considerably LESS
reliable than the underlying drives.

What happens repeatedly, at least in my experience over
a variety of boxes running a variety of 2.4 and 2.6
Linux kernel releases, is that any transient I/O problem
results in a critical mass of RAID5 drives being marked
'failed', at which point there is no longer any supported
way of retrieving the data on the RAID5 device, even
though the underlying drives are all fine, and the underlying
data on those drives almost certainly intact.

This has just happened to me for at least the sixth time,
this time in a brand new RAID5 consisting of 8 200G hotswap
SATA drives backing up the contents of about a dozen onsite
and offsite boxes via dirvish, which took me the better part
of December to get initialized and running, and now two weeks
later I'm back to square one.

I'm currently digging through the md kernel source code
trying to work out some ad-hoc recovery method, but this
level of flakiness just isn't acceptable on systems where
reliable mass storage is a must -- and when else would
one bother with RAID5?

I run a RAID1 mirrored boot and/or root partition on all
the boxes I run RAID5 on -- and lots more as well -- and
RAID1 -does- work as one would hope, providing a disk
store -more- reliable than the underlying drives. A
Linux RAID1 system will ride out any sort of sequence
of hardware problems, and if the hardware is physically
capable of running at all, the RAID1 system will pop
right back like a cork coming out of white water.

I've NEVER had a RAID1 throw a temper trantrum and go
into apoptosis mode the way RAID5s do given the slightest
opportunity.

We need RAID5 to be equally resilient in the face of
real-world problems, people -- it isn't enough to
just be able to function under ideal lab conditions!

A design bug is -still- a bug, and -still- needs to
get fixed.

Something HAS to be done to make the RAID5 logic
MUCH more conservative about destroying RAID5
systems in response to a transient burst of I/O
errors, before it can in good conscience be declared
ready for production use -- or at MINIMUM to provide
a SUPPORTED way of restoring a butchered RAID5 to
last-known-good configuration or such once transient
hardware issues have been resolved.

There was a time when Unix filesystems disintegrated
on the slightest excuse, requiring guru-level inode
hand-editing to fix. fsck basically ended that,
allowing any idiot to successfully maintain a unix
filesystem in the face of real-life problems like
power failures and kernel crashes. Maybe we need
a mdfsck which can fix sick RAID5 subsystems?

In the meantime, IMHO Linux RAID5 should be prominently flagged
EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
building up ill-will and undeserved distrust of Linux
software quality generally.

Pending some quantum leap in Linux RAID5 resistance to
collapse, I'm switching to RAID1 everywhere: Doubling
my diskspace hardware costs is a SMALL price to pay to
avoid weeks of system downtime and rebuild effort annually.
I like to spend my time writing open source, not
rebuilding servers. :) (Yes, I could become an md
maintainer myself. But only at the cost of defaulting
on pre-existing open source commitments. We all have
full plates.)

Anyhow -- kudos to everyone involved: I've been using
Unix since v7 on PDP-11, Irix since its 68020 days,
and Linux since booting off floppy was mandatory, and
in general I'm happy as a bug in a rug with the fleet
of Debian Linux boxes I manage, with uptimes often exceeding
a year, typically limited only by hardware or software
upgrades -- great work all around, everyone!

Life is Good!

-- Cynbe




2006-01-17 19:43:21

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Tue, Jan 17, 2006 at 01:35:46PM -0600, Cynbe ru Taren wrote:
> In principle, RAID5 should allow construction of a
> disk-based store which is considerably MORE reliable
> than any individual drive.
>
> In my experience, at least, using Linux RAID5 results
> in a disk storage system which is considerably LESS
> reliable than the underlying drives.

That is a function of how RAID5 works. A properly configured RAID5 array
will have a spare disk to take over in case one of the members fails, as
otherwise you run a serious risk of not being able to recover any data.

> What happens repeatedly, at least in my experience over
> a variety of boxes running a variety of 2.4 and 2.6
> Linux kernel releases, is that any transient I/O problem
> results in a critical mass of RAID5 drives being marked
> 'failed', at which point there is no longer any supported
> way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying
> data on those drives almost certainly intact.

Underlying disks should not be experiencing transient failures. Are you
sure the problem isn't with the disk controller you're building your array
on top of? At the very least any bug report requires that information to
be able to provide even a basic analysis of what is going wrong.

Personally, I am of the opinion that RAID5 should not be used by the
vast majority of people as the failure modes it entails are far too
complex for most people to cope with.

-ben
--
"You know, I've seen some crystals do some pretty trippy shit, man."
Don't Email: <[email protected]>.

2006-01-17 19:56:07

by Kyle Moffett

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Jan 17, 2006, at 14:35, Cynbe ru Taren wrote:
> What happens repeatedly, at least in my experience over a variety
> of boxes running a variety of 2.4 and 2.6 Linux kernel releases, is
> that any transient I/O problem results in a critical mass of RAID5
> drives being marked 'failed', at which point there is no longer any
> supported way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying data
> on those drives almost certainly intact.

Insufficient detail. Please provide a full bug report detailing the
problem, then we can help you.

> I've NEVER had a RAID1 throw a temper trantrum and go into
> apoptosis mode the way RAID5s do given the slightest opportunity.

I've never had either RAID1 _or_ RAID5 throw temper tantrums on me,
_including_ during drive failures. In fact, I've dealt easily with
Linux RAID multi-drive failures that threw all our shiny 3ware RAID
hardware into fits it took me an hour to work out.

> Something HAS to be done to make the RAID5 logic MUCH more
> conservative about destroying RAID5
> systems in response to a transient burst of I/O errors, before it
> can in good conscience be declared ready for production use -- or
> at MINIMUM to provide a SUPPORTED way of restoring a butchered
> RAID5 to last-known-good configuration or such once transient
> hardware issues have been resolved.

The problem is that such errors are _rarely_ transient, or indicate
deeper media problems. Have you then verified your disks using
smartctl? There already _is_ such a way to restore said "butchered"
RAID5: "mdadm --assemble --force" In any case, I suspect your RAID-
on-SATA problems are more due to the primitive nature of the SATA
error handling; much of the code does not do more than a basic bus
reset before failing the whole I/O.

> In the meantime, IMHO Linux RAID5 should be prominently flagged
> EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
> building up ill-will and undeserved distrust of Linux software
> quality generally.

It works great for me, and for a lot of other people too, including
production servers. In fact, I've had fewer issues with Linux RAID5
than with a lot of hardware RAIDs, especially when the HW raid
controller died and the company was no longer in business :-\. If
you can provide actual bug reports, we'd be happy to take a look at
your problems, but as it is, we can't help you.

Cheers,
Kyle Moffett

--
There is no way to make Linux robust with unreliable memory
subsystems, sorry. It would be like trying to make a human more
robust with an unreliable O2 supply. Memory just has to work.
-- Andi Kleen


2006-01-17 19:59:09

by David R

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Cynbe ru Taren wrote:
> The current Linux kernel RAID5 implementation is just
> too fragile to be used for most of the applications
> where it would be most useful.

I'm not sure I agree.

> What happens repeatedly, at least in my experience over
> a variety of boxes running a variety of 2.4 and 2.6
> Linux kernel releases, is that any transient I/O problem
> results in a critical mass of RAID5 drives being marked
> 'failed', at which point there is no longer any supported

What "transient" I/O problem would this be. I've had loads of issues with
flaky motherboard/PCI bus implementations that make RAID using addin cards
(all 5 slots filled with other devices) a nightmare. The built in controllers
seem to be more reliable.

> way of retrieving the data on the RAID5 device, even
> though the underlying drives are all fine, and the underlying
> data on those drives almost certainly intact.

This is no problem, just use something like

mdadm --assemble --force /dev/md5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
/dev/sde1

(Then of course do a fsck)

You can even do this with (nr.drives-1), then add in the last one to be
sync'ed up in the background.

> This has just happened to me for at least the sixth time,
> this time in a brand new RAID5 consisting of 8 200G hotswap
> SATA drives backing up the contents of about a dozen onsite
> and offsite boxes via dirvish, which took me the better part
> of December to get initialized and running, and now two weeks
> later I'm back to square one.

:-( .. maybe try the force assemble?

> I'm currently digging through the md kernel source code
> trying to work out some ad-hoc recovery method, but this
> level of flakiness just isn't acceptable on systems where
> reliable mass storage is a must -- and when else would
> one bother with RAID5?

It isn't flaky for me now I'm using a better quality motherboard, in fact it's
saved me through 3 near simultaneous failures of WD 250GB drives.

> We need RAID5 to be equally resilient in the face of
> real-world problems, people -- it isn't enough to
> just be able to function under ideal lab conditions!

I think it is. The automatics are paranoid (as they should be) when failures
are noticed. The array can be assembled manually though.

> A design bug is -still- a bug, and -still- needs to
> get fixed.

It's not a design bug - in my opinion.

> Something HAS to be done to make the RAID5 logic
> MUCH more conservative about destroying RAID5
> systems in response to a transient burst of I/O
> errors, before it can in good conscience be declared

If such things are common you should investigate the hardware.

> ready for production use -- or at MINIMUM to provide
> a SUPPORTED way of restoring a butchered RAID5 to
> last-known-good configuration or such once transient
> hardware issues have been resolved.

It is. See above.

> In the meantime, IMHO Linux RAID5 should be prominently flagged
> EXPERIMENTAL -- NONCRITICAL USE ONLY or some such, to avoid
> building up ill-will and undeserved distrust of Linux
> software quality generally.

I'd calm down if I were you.

Cheers
David


Attachments:
signature.asc (256.00 B)
OpenPGP digital signature

2006-01-17 20:00:50

by Kyle Moffett

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

BTW, when you get this message via the list (because I refuse to
touch challenge-response email systems), shut off your damn email
challenge-response system before posting to this list again. Such
systems are exceptionally poor ettiquite on a public mailing list,
and more than likely most posters will flag your "challenge" as junk-
mail (as I did, and will continue doing).

Cheers,
Kyle Moffett

--
Premature optimization is the root of all evil in programming
-- C.A.R. Hoare



2006-01-17 20:14:04

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Tue, 17 Jan 2006, Benjamin LaHaise wrote:

> On Tue, Jan 17, 2006 at 01:35:46PM -0600, Cynbe ru Taren wrote:
> > In principle, RAID5 should allow construction of a
> > disk-based store which is considerably MORE reliable
> > than any individual drive.
> >
> > In my experience, at least, using Linux RAID5 results
> > in a disk storage system which is considerably LESS
> > reliable than the underlying drives.
>
> That is a function of how RAID5 works. A properly configured RAID5 array
> will have a spare disk to take over in case one of the members fails, as
> otherwise you run a serious risk of not being able to recover any data.
>
> > What happens repeatedly, at least in my experience over
> > a variety of boxes running a variety of 2.4 and 2.6
> > Linux kernel releases, is that any transient I/O problem
> > results in a critical mass of RAID5 drives being marked
> > 'failed', at which point there is no longer any supported
> > way of retrieving the data on the RAID5 device, even
> > though the underlying drives are all fine, and the underlying
> > data on those drives almost certainly intact.
>
> Underlying disks should not be experiencing transient failures. Are you
> sure the problem isn't with the disk controller you're building your array
> on top of? At the very least any bug report requires that information to
> be able to provide even a basic analysis of what is going wrong.

Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID
5 array. Due to the CPU overheating the whole box was suddenly shot down
by the CPU damage protection mechanism. While there is no battery backup
on this particular RAID controller, the sudden poweroff caused some very
localized inconsistency of one disk in the RAID. The configuration was
1x160 GB and 3x120GB, with the 160 GB being split into 120 GB part within
the RAID 5 and a 40 GB part as a separate volume. The inconsistency
happend in the 40 GB part of the 160 GB HDD (as reported by the Adaptec
BIOS media check). In particular the problem was in the /dev/sda2 (with
/dev/sda being the 40 GB Volume, /dev/sda1 being an NTFS Windows system,
and /dev/sda2 being ext3 Linux system).

Now, what is interesting, is that Linux completely refused any possible
access to every byte within /dev/sda, not even dd(1) reading from any
position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
ended up with lots of following messages:

sd 0:0:0:0: SCSI error: return code = 0x8000002
sda: Current: sense key: Hardware Error
Additional sense: Internal target failure
Info fld=0x0
end_request: I/O error, dev sda, sector <some sector number>

I've consulted this with Mark Salyzyn, because I thought it was a problem
of the AACRAID driver. But I was told, that there is nothing that AACRAID
can possibly do about it, and that it is a problem of the upper Linux
layers (block device layer?) that are strictly fault intollerant, and
thouth the problem was just an inconsistency of one particular localized
region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
single byte from the ENTIRE VOLUME (/dev/sda)!

And now for the best part: From Windows, I was able to access the ENTIRE
VOLUME without the slightest problem. Not only did Windows boot entirely
from the /dev/sda1, but using Total Commander's ext3 plugin I was also
able to access the ENTIRE /dev/sda2 and at least extract the most
important data and configurations, before I did the complete low-level
formatting of the drive, which fixed the inconsistency problem.

I call this "AN IRONY" to be forced to use Windows to extract information
from Linux partition, wouldn't you? ;)

(Besides, even GRUB (using BIOS) accessed the /dev/sda without
complications - as it was the bootable volume. Only Linux failed here a
100%. :()

Martin

2006-01-17 23:28:08

by Michael Loftis

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14



--On January 17, 2006 1:35:46 PM -0600 Cynbe ru Taren <[email protected]> wrote:

>
> Just in case the RAID5 maintainers aren't aware of it:
>
> The current Linux kernel RAID5 implementation is just
> too fragile to be used for most of the applications
> where it would be most useful.
>
> In principle, RAID5 should allow construction of a
> disk-based store which is considerably MORE reliable
> than any individual drive.

Absolutely not. The more spindles the more chances of a double failure.
Simple statistics will mean that unless you have mirrors the more drives
you add the more chance of two of them (really) failing at once and choking
the whole system.

That said, there very well could be (are?) cases where md needs to do a
better job of handling the world unravelling.

2006-01-17 23:39:41

by Michael Loftis

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14



--On January 17, 2006 9:13:49 PM +0100 Martin Drab
<[email protected]> wrote:

> I've consulted this with Mark Salyzyn, because I thought it was a problem
> of the AACRAID driver. But I was told, that there is nothing that AACRAID
> can possibly do about it, and that it is a problem of the upper Linux
> layers (block device layer?) that are strictly fault intollerant, and
> thouth the problem was just an inconsistency of one particular localized
> region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
> single byte from the ENTIRE VOLUME (/dev/sda)!

Actually...this is also related to how the controller reports the error.
If it reports a device level death/failure rather than a read error, Linux
is just taking that on face value. Yup, it should retry though. Other
possibilities exist including the volume going offline at the controller
level, having nothing to do with Linux, this is most often the problem I
see with RAIDs.

2006-01-18 00:13:13

by Kyle Moffett

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Jan 17, 2006, at 18:27, Michael Loftis wrote:
> --On January 17, 2006 1:35:46 PM -0600 Cynbe ru Taren
> <[email protected]> wrote:
>> Just in case the RAID5 maintainers aren't aware of it:
>>
>> The current Linux kernel RAID5 implementation is just too fragile
>> to be used for most of the applications where it would be most
>> useful.
>>
>> In principle, RAID5 should allow construction of a disk-based
>> store which is considerably MORE reliable than any individual drive.
>
> Absolutely not. The more spindles the more chances of a double
> failure. Simple statistics will mean that unless you have mirrors
> the more drives you add the more chance of two of them (really)
> failing at once and choking the whole system.

The most reliable RAID-5 you can build is a 3-drive system. For each
byte of data, you have a half-byte of parity, meaning that half the
data-space (not including the parity) can fail without data loss.
I'm ignoring the issue of rotating parity drive for simplicity, but
that only affects performance, not the algorithm. If you want any
kind of _real_ reliability and speed, you should buy a couple good
hardware RAID-5 units and mirror them in software.

Cheers,
Kyle Moffett

--
If you don't believe that a case based on [nothing] could potentially
drag on in court for _years_, then you have no business playing with
the legal system at all.
-- Rob Landley



2006-01-18 00:21:51

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Your understanding of statistics leaves something to be desired. As you
add disks the probability of a single failure is grows linearly, but the
probability of double failure grows much more slowly. For example:

If 1 disk has a 1/1000 chance of failure, then
2 disks have a (1/1000)^2 chance of double failure, and
3 disks have a (1/1000)^2 * 3 chance of double failure
4 disks have a (1/1000)^2 * 7 chance of double failure

Thus the probability of double failure on this 4 drive array is ~142
times less than the odds of a single drive failing. As the probably of
a single drive failing becomes more remote, then the ratio of that
probability to the probability of double fault in the array grows
exponentially.

( I think I did that right in my head... will check on a real calculator
later )

This is why raid-5 was created: because the array has a much lower
probabiliy of double failure, and thus, data loss, than a single drive.
Then of course, if you are really paranoid, you can go with raid-6 ;)


Michael Loftis wrote:
> Absolutely not. The more spindles the more chances of a double failure.
> Simple statistics will mean that unless you have mirrors the more drives
> you add the more chance of two of them (really) failing at once and
> choking the whole system.
>
> That said, there very well could be (are?) cases where md needs to do a
> better job of handling the world unravelling.
> -

2006-01-18 00:30:24

by Michael Loftis

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14



--On January 17, 2006 7:21:45 PM -0500 Phillip Susi <[email protected]>
wrote:

> Your understanding of statistics leaves something to be desired. As you
> add disks the probability of a single failure is grows linearly, but the
> probability of double failure grows much more slowly. For example:

What about I said was inaccurate? I never said that it increases
exponentially or anything like that, just that it does increase, which
you've proven. I was speaking in the case of a RAID-5 set, where the
minimum is 3 drives, so every additional drive increases the chance of a
double fault condition. Now if we're including mirrors and stripes/etc,
then that means we do have to look at the 2 spindle case, but the third
spindle and beyond keeps increasing. If you've a 1% failure rate, and you
have 100+ drives, chances are pretty good you're going to see a failure.
Yes it's a LOT more complicated than that.

>
> If 1 disk has a 1/1000 chance of failure, then
> 2 disks have a (1/1000)^2 chance of double failure, and
> 3 disks have a (1/1000)^2 * 3 chance of double failure
> 4 disks have a (1/1000)^2 * 7 chance of double failure
>
> Thus the probability of double failure on this 4 drive array is ~142
> times less than the odds of a single drive failing. As the probably of a
> single drive failing becomes more remote, then the ratio of that
> probability to the probability of double fault in the array grows
> exponentially.
>
> ( I think I did that right in my head... will check on a real calculator
> later )
>
> This is why raid-5 was created: because the array has a much lower
> probabiliy of double failure, and thus, data loss, than a single drive.
> Then of course, if you are really paranoid, you can go with raid-6 ;)
>
>
> Michael Loftis wrote:
>> Absolutely not. The more spindles the more chances of a double failure.
>> Simple statistics will mean that unless you have mirrors the more drives
>> you add the more chance of two of them (really) failing at once and
>> choking the whole system.
>>
>> That said, there very well could be (are?) cases where md needs to do a
>> better job of handling the world unravelling.
>> -
>



--
"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler

2006-01-18 02:10:59

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Michael Loftis wrote:

> What about I said was inaccurate? I never said that it increases
> exponentially or anything like that, just that it does increase, which
> you've proven. I was speaking in the case of a RAID-5 set, where the
> minimum is 3 drives, so every additional drive increases the chance of
> a double fault condition. Now if we're including mirrors and
> stripes/etc, then that means we do have to look at the 2 spindle case,
> but the third spindle and beyond keeps increasing. If you've a 1%
> failure rate, and you have 100+ drives, chances are pretty good you're
> going to see a failure. Yes it's a LOT more complicated than that.
>

I understood you to be saying that a raid-5 was less reliable than a
single disk, which it is not. Maybe I did not read correctly. Yes, a 3
+ n disk raid-5 has a higher chance of failure than a 3 disk raid-5, but
only slightly so, and in any case, a 3 disk raid-5 is FAR more reliable
than a single drive, and only slightly less reliable than a two disk
raid-1 ( though you get 3x the space for only 50% higher cost, so 6x
cheaper cost per byte of storage ).




2006-01-18 02:30:55

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Tue, 17 Jan 2006, Michael Loftis wrote:
> --On January 17, 2006 9:13:49 PM +0100 Martin Drab <[email protected]> wrote:
>
> > I've consulted this with Mark Salyzyn, because I thought it was a problem
> > of the AACRAID driver. But I was told, that there is nothing that AACRAID
> > can possibly do about it, and that it is a problem of the upper Linux
> > layers (block device layer?) that are strictly fault intollerant, and
> > thouth the problem was just an inconsistency of one particular localized
> > region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
> > single byte from the ENTIRE VOLUME (/dev/sda)!
>
> Actually...this is also related to how the controller reports the error. If it
> reports a device level death/failure rather than a read error, Linux is just

Yes, but that wasn't the case here. I've witnessed that a while ago, but
this time, no. Just a read error, no device death nor going off-line.
Otherwise I wouldn't be that much surprised that Linux didn't even try.
The controller didn't do anything that would prevent system from reading.
Windows used that and worked, Linux unfortunatelly didn't even try. That's
why I'm talking about it here.

> taking that on face value. Yup, it should retry though. Other possibilities
> exist including the volume going offline at the controller level, having
> nothing to do with Linux, this is most often the problem I see with RAIDs.

Martin

2006-01-18 03:01:39

by Michael Loftis

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14



--On January 17, 2006 9:10:56 PM -0500 Phillip Susi <[email protected]>
wrote:

> I understood you to be saying that a raid-5 was less reliable than a
> single disk, which it is not. Maybe I did not read correctly. Yes, a 3
> + n disk raid-5 has a higher chance of failure than a 3 disk raid-5, but
> only slightly so, and in any case, a 3 disk raid-5 is FAR more reliable
> than a single drive, and only slightly less reliable than a two disk
> raid-1 ( though you get 3x the space for only 50% higher cost, so 6x
> cheaper cost per byte of storage ).


Yup we're on the same page, we just didn't think we were. It happens :)
R-5 (in theory) could be less reliable than a mirror or possibly a single
drive, but it'd take a pretty obscene number of drives with excessively
large strip size.

2006-01-18 10:49:28

by Helge Hafting

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Cynbe ru Taren wrote:

>Just in case the RAID5 maintainers aren't aware of it:
>
>The current Linux kernel RAID5 implementation is just
>too fragile to be used for most of the applications
>where it would be most useful.
>
>In principle, RAID5 should allow construction of a
>disk-based store which is considerably MORE reliable
>than any individual drive.
>
>In my experience, at least, using Linux RAID5 results
>in a disk storage system which is considerably LESS
>reliable than the underlying drives.
>
>What happens repeatedly, at least in my experience over
>a variety of boxes running a variety of 2.4 and 2.6
>Linux kernel releases, is that any transient I/O problem
>results in a critical mass of RAID5 drives being marked
>'failed',
>
What kind of "transient io error" would that be?
That is not supposed to happen regularly. . .

You do replace failed drives immediately? Allowing
systems to run "for a while" in degraded mode is
surely a recipe for disaster. Degraded mode
has no safety at all, it is just raid-0 with a performance
overhead added in. :-/

Having hot spares is a nice way of replacing the failed
drive quickly.

>at which point there is no longer any supported
>way of retrieving the data on the RAID5 device, even
>though the underlying drives are all fine, and the underlying
>data on those drives almost certainly intact.
>
>
As other have showed - "mdadm" can reassemble your
broken raid - and it'll work well in those cases where
the underlying drives indeed are ok. It will fail
spectacularly if you have a real double fault though,
but then nothing short of raid-6 can save you.


Helge Hafting

2006-01-18 11:24:35

by Erik Mouw

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Tue, Jan 17, 2006 at 07:12:57PM -0500, Kyle Moffett wrote:
> The most reliable RAID-5 you can build is a 3-drive system. For each
> byte of data, you have a half-byte of parity, meaning that half the
> data-space (not including the parity) can fail without data loss.
> I'm ignoring the issue of rotating parity drive for simplicity, but
> that only affects performance, not the algorithm. If you want any
> kind of _real_ reliability and speed, you should buy a couple good
> hardware RAID-5 units and mirror them in software.

Actually, the most reliable RAID-5 is a 2 drive system, where you have
a full byte of reduncancy for each byte of data. Two drive RAID-5
systems are usually called RAID-1, but if you write out the formulas it
becomes clear that RAID-1 is just a special case of RAID-5.


Erik

--
+-- Erik Mouw -- http://www.harddisk-recovery.com -- +31 70 370 12 90 --
| Lab address: Delftechpark 26, 2628 XH, Delft, The Netherlands
| Data lost? Stay calm and contact Harddisk-recovery.com

2006-01-18 16:15:28

by Mark Lord

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Helge Hafting wrote:
>
> As other have showed - "mdadm" can reassemble your
> broken raid - and it'll work well in those cases where
> the underlying drives indeed are ok. It will fail
> spectacularly if you have a real double fault though,
> but then nothing short of raid-6 can save you.

No, actually there are several things we *could* do,
if only the will-to-do-so existed.

For example, one bad sector on a drive doesn't mean that
the entire drive has failed. It just means that one 512-byte
chunk of the drive has failed.

We could rewrite the failed area of the drive, allowing the
onboard firmware to repair the fault internally, likely by
remapping physical sectors. This is nothing unusual, as all
drives these days ship from the factory with many bad sectors
that have already been remapped to "fix" them. One or two
more in the field is no reason to toss a perfectly good drive.

Mind you, if it's more than just one or two bad sectors,
then the drive really should get tossed regardless. And the case
can be made that even for the first one or two bad sectors,
a prudent sysadmin would schedule replacement of the whole drive.

But until the drive is replaced, it could be repaired and continued
to be used as added redundancy, helping us cope far more reliably
with multiple failures.

Sure, nobody's demanding double-fault protection -- where the SAME
sector of data fails on multiple drives, and nothing can be done
to recover it then. But we really could/should handle the case
of two *different* unrelated single-faults, at least when those
are just soft failures of unrelated sectors.

Just need somebody motivated to actually fix it,
rather than bitch about how impossible/stupid it would be.

Cheers

2006-01-18 16:47:17

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Phillip Susi <[email protected]> writes:

> but only slightly so, and in any case, a 3 disk raid-5 is FAR more
> reliable than a single drive, and only slightly less reliable than a
> two disk raid-1 ( though you get 3x the space for only 50% higher
> cost, so 6x cheaper cost per byte of storage ).

Actually with 3-disk RAID5 you get 2x the space of RAID1 for 1.5 x cost,
so the factor is 1.5/2 = 0.75, i.e., you save only 25% on RAID5 or RAID1
is 33% more expensive.
--
Krzysztof Halasa

2006-01-18 16:49:53

by Krzysztof Halasa

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Michael Loftis <[email protected]> writes:

> Yup we're on the same page, we just didn't think we were. It happens
> :) R-5 (in theory) could be less reliable than a mirror

Statistically, RAID-5 with 3 or more disks is always less reliable than
a mirror. Strip size doesn't matter.

> or possibly a
> single drive,

With lot of drives.
--
Krzysztof Halasa

2006-01-18 17:33:09

by Alan

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Mer, 2006-01-18 at 11:15 -0500, Mark Lord wrote:
> For example, one bad sector on a drive doesn't mean that
> the entire drive has failed. It just means that one 512-byte
> chunk of the drive has failed.

You don't actually know what failed, truth be told, probably a lot more
than 512 byte spec of disk nowdays.

> We could rewrite the failed area of the drive, allowing the
> onboard firmware to repair the fault internally, likely by

We should do so definitely but you probably want to rewrite the stripe
as a whole so that you fix up the other sectors in the physical sector
that went poof.

> Just need somebody motivated to actually fix it,
> rather than bitch about how impossible/stupid it would be.

Send patches ;)

PS: How is the delkin_cb driver - does it know how to do modes and stuff
yet ? Just wondering if I should pull a version for libata whacking

Alan

2006-01-18 23:37:16

by NeilBrown

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Wednesday January 18, [email protected] wrote:
> Helge Hafting wrote:
> >
> > As other have showed - "mdadm" can reassemble your
> > broken raid - and it'll work well in those cases where
> > the underlying drives indeed are ok. It will fail
> > spectacularly if you have a real double fault though,
> > but then nothing short of raid-6 can save you.
>
> No, actually there are several things we *could* do,
> if only the will-to-do-so existed.

You not only need the will. You also need the ability and the time,
and the three must be combined into the one person...

>
> For example, one bad sector on a drive doesn't mean that
> the entire drive has failed. It just means that one 512-byte
> chunk of the drive has failed.
>
> We could rewrite the failed area of the drive, allowing the
> onboard firmware to repair the fault internally, likely by
> remapping physical sectors. This is nothing unusual, as all
> drives these days ship from the factory with many bad sectors
> that have already been remapped to "fix" them. One or two
> more in the field is no reason to toss a perfectly good drive.

Very recent 2.6 kernels do exactly this. They don't drop a drive on a
read error, only on a write error. On a read error they generate the
data from elsewhere and schedule a write, then a re-read.

NeilBrown

2006-01-19 00:13:30

by NeilBrown

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Tuesday January 17, [email protected] wrote:
>
> Just in case the RAID5 maintainers aren't aware of it:

Others have replied, but just so that you know that the "RAID5
maintainer" is listening, I will too.

You refer to "current" implementation and then talk about " a variety
of 2.4 and 2.6" releases.... Not all of them are 'current'.

The 'current' raid5 (in 2.6.15) is much more resilient against read
errors than earlier versions.

If you are having transient errors that really are very transient,
then the device driver should be retrying more I expect.

If you are having random connectivity error causing transient errors,
then your hardware is too unreliable for raid5 to code with.

As has been said, there *is* a supported way to regain a raid5 after
connectivity problems - mdadm --assemble --force.

The best way to help with the improvement of md/raid5 is to give
precise details of situations where md/raid5 doesn't live up to your
expectations. Without precise details it is hard to make progress.

Thankyou for your interest.

NeilBrown

2006-01-19 15:54:05

by Mark Lord

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Neil Brown wrote:
>
> Very recent 2.6 kernels do exactly this. They don't drop a drive on a
> read error, only on a write error. On a read error they generate the
> data from elsewhere and schedule a write, then a re-read.

Well done, then. Further to this:

Pardon me for not looking at the specifics of the code here,
but experience shows that rewriting just the single sector
is often not enough to repair an error. The drive often just
continues to fail when only the bad sector is rewritten by itself.

Dumb drives, or what, I don't know, but they seem to respond
better when the entire physical track is rewritten.

Since we rarely know what a physical track is these days,
this often boils down to simply rewriting a 64KB chunk
centered on the failed sector. So far, this strategy has
always worked for me.

Cheers

2006-01-19 15:59:20

by Mark Lord

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Alan Cox wrote:
> PS: How is the delkin_cb driver - does it know how to do modes and stuff
> yet ? Just wondering if I should pull a version for libata whacking

I whacked at it for libata a while back, and then shelved it while awaiting
PIO to appear in a released libata version. Now that we've got PIO, I ought
to add a couple of lines to bind in the right functions and release it.

No knowledge of "modes" and stuff -- but the basic register settings I
reverse engineered seem to work adequately on the cards I have here.

But the card is a total slug unless the host does 32-bit PIO to/from it.
Do we have that capability in libata yet?

My last hack at it (without the necessary libata PIO bindings) is attached,
but this is several revisions behind libata now, and probably needs some
updates to compile. Suggestions welcomed.

Cheers


Attachments:
pata_delkin_cb.c (7.02 kB)

2006-01-19 16:25:33

by Alan

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Iau, 2006-01-19 at 10:59 -0500, Mark Lord wrote:
> But the card is a total slug unless the host does 32-bit PIO to/from it.
> Do we have that capability in libata yet?

Very very easy to sort out. Just need a ->pio_xfer method set. Would
then eliminate some of the core driver flags and let us do vlb sync for
legacy hw

2006-02-02 20:32:57

by Bill Davidsen

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Martin Drab wrote:

> Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID
> 5 array. Due to the CPU overheating the whole box was suddenly shot down
> by the CPU damage protection mechanism. While there is no battery backup
> on this particular RAID controller, the sudden poweroff caused some very
> localized inconsistency of one disk in the RAID. The configuration was
> 1x160 GB and 3x120GB, with the 160 GB being split into 120 GB part within
> the RAID 5 and a 40 GB part as a separate volume. The inconsistency
> happend in the 40 GB part of the 160 GB HDD (as reported by the Adaptec
> BIOS media check). In particular the problem was in the /dev/sda2 (with
> /dev/sda being the 40 GB Volume, /dev/sda1 being an NTFS Windows system,
> and /dev/sda2 being ext3 Linux system).
>
> Now, what is interesting, is that Linux completely refused any possible
> access to every byte within /dev/sda, not even dd(1) reading from any
> position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
> ended up with lots of following messages:
>
> sd 0:0:0:0: SCSI error: return code = 0x8000002
> sda: Current: sense key: Hardware Error
> Additional sense: Internal target failure
> Info fld=0x0
> end_request: I/O error, dev sda, sector <some sector number>

But /dev/sda is not a Linux filesystem, running fsck on it makes no
sense. You wanted to run on /dev/sda2.
>
> I've consulted this with Mark Salyzyn, because I thought it was a problem
> of the AACRAID driver. But I was told, that there is nothing that AACRAID
> can possibly do about it, and that it is a problem of the upper Linux
> layers (block device layer?) that are strictly fault intollerant, and
> thouth the problem was just an inconsistency of one particular localized
> region inside /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a
> single byte from the ENTIRE VOLUME (/dev/sda)!

The obvious test of this "it's not us" statement is to connect that one
drive to another type controller and see if the upper level code
recovers. I'm assuming that "sda" is a real drive and not some
pseudo-drive which exists only in the firmware of the RAID controller.
That message is curious, did you cat /proc/scsi/scsi to see what the
system thought was there? Use the infamous "cdrecord -scanbus" command?

>
> And now for the best part: From Windows, I was able to access the ENTIRE
> VOLUME without the slightest problem. Not only did Windows boot entirely
> from the /dev/sda1, but using Total Commander's ext3 plugin I was also
> able to access the ENTIRE /dev/sda2 and at least extract the most
> important data and configurations, before I did the complete low-level
> formatting of the drive, which fixed the inconsistency problem.
>
> I call this "AN IRONY" to be forced to use Windows to extract information
> from Linux partition, wouldn't you? ;)
>
> (Besides, even GRUB (using BIOS) accessed the /dev/sda without
> complications - as it was the bootable volume. Only Linux failed here a
> 100%. :()

From the way you say sda when you presumably mean sda1 or sda2 it's not
clear if you don't understand the difference between drive and partition
access or are just so pissed off you are not taking the time to state
the distinction clearly.

There was a problem with recovery from errors in RAID-5 which is
addressed by recent changes to fail a sector, try rewriting it, etc. I
would have to read linux-raid archives to explain it, so I'll stop with
the overview. I don't think that's the issue here, you're using a RAID
controller rather than the software RAID, so it should not apply.

I assume that the problem is gone, so we can't do any more analysis
after the fact.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2006-02-02 22:08:53

by Bill Davidsen

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Phillip Susi wrote:
> Your understanding of statistics leaves something to be desired. As you
> add disks the probability of a single failure is grows linearly, but the
> probability of double failure grows much more slowly. For example:
>
> If 1 disk has a 1/1000 chance of failure, then
> 2 disks have a (1/1000)^2 chance of double failure, and
> 3 disks have a (1/1000)^2 * 3 chance of double failure
> 4 disks have a (1/1000)^2 * 7 chance of double failure

After the first drive fails you have no redundancy, the chance of an
additional failure is linear to the number of remaining drives.

Assume:
p - probability of a drive failing in unit time
n - number of drives
F - probability of double failure

The chance of a single drive failure is n*p. After that you have a new
"independent trial" for the failure any one of n-1 drives, so the chance
of a double drive failure is actually:
F = (n*p) * (n-1)*p

But wait, there's more:
p - chance of a drive failing in unit time
n - number of drives
R - the time to rebuild to a hot spare in the same units as p
F - probability of double failure

So:

F = n*p * (n-1)*(R * p)

If you rebuild a track at a time, each track takes the time to read the
slowest drive plus the time to write the spare. If the array remains in
use load increases those times.

And the ugly part is that p is changing all the time, there's infant
mortality on new drives, fairly constant electronic probability and
increasing probability of mechanical failure over time. If all of your
drives are the same age they are less reliable than mixed age drives.

>
> Thus the probability of double failure on this 4 drive array is ~142
> times less than the odds of a single drive failing. As the probably of
> a single drive failing becomes more remote, then the ratio of that
> probability to the probability of double fault in the array grows
> exponentially.
>
> ( I think I did that right in my head... will check on a real calculator
> later )
>
> This is why raid-5 was created: because the array has a much lower
> probabiliy of double failure, and thus, data loss, than a single drive.
> Then of course, if you are really paranoid, you can go with raid-6 ;)

If you're paranoid you mirror over two RAID-5 arrays. The mirrors are on
independent controllers. RAID-10.

>
>
> Michael Loftis wrote:
>
>> Absolutely not. The more spindles the more chances of a double
>> failure. Simple statistics will mean that unless you have mirrors the
>> more drives you add the more chance of two of them (really) failing at
>> once and choking the whole system.
>>
>> That said, there very well could be (are?) cases where md needs to do
>> a better job of handling the world unravelling.
>> -
A small graph of the effect of the rebuild time on RAID-5 attached, it
assumes probability of failure = 1/1000 per the original post, for
various rebuild times the probability of failure drops.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me


Attachments:
2dfail-2.png (3.03 kB)

2006-02-03 00:58:04

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Thu, 2 Feb 2006, Bill Davidsen wrote:

Just to state clearly in the first place. I've allready solved the problem
by low-level formatting the entire disk that this inconsistent array in
question was part of.

So now everything is back to normal. So unforunatelly I would not be able
to do any more tests on the device in the non-working state.

I mentioned this problem here now just to let you konw that there is such
a problematic Linux behviour (and IMO flawed) in such circumstances, and
that perhaps it may let you think of such situations when doing further
improvements and development in the design of the block device layer (or
wherever the problem may possibly come from).

And I also hope you would understand, that I wouldn't try to create that
state again deliberatelly, since my main system is running on that array
and I wouldn't risk loosing some more data because of this.

However maybe someone perhaps in Adaptec or smewhere else may have
some simillar system at the disposal on which he could allow to experiment
on demand without any serious risk of loosing anything important.

So what I may say is that it is an Adaptec 2410SA with 8205 firmware and
without a battery backup system (which is probably the crutial thing).
And the inconsistency was caused by a MB protection of CPU overheat
shutdown, because I've started the system and booted Linux from the array
in question (which consisted by just one part of one disk), while I've
forgotten to turn on the water cooling of the CPU and northbridge. So
after about 3 minutes the system automatically shut down and Linux was
probably doing some writing in that very moment, which wasn't able to
complete fully (most probably due to the lack of the battery backup system
on the RAID controller). So my guess is that this may be artificially
reproduced when you suddenly switch off a power source of the system while
Linux is doing some writing to the array.

My arrays in particular are:

HDD1 (160 GB): 120 GB Array 1, 40 GB Array 2
HDD2 (120 GB): 120 GB Array 1
HDD3 (120 GB): 120 GB Array 1
HDD4 (120 GB): 120 GB Array 1

Where Array 1 is a RAID 5 array /dev/sdb (labeled as "Data 1"), which
contains just one 330 GB partition /dev/sdb1, and Array 2 is a bootable
(in Adaptec BIOS setup so called) Volume array (i.e. no RAID) /dev/sda
(labeled as "SYSTEM"), which contains /dev/sda1 (NTFS Windows), /dev/sda2
(ext3 Linux), /dev/sda3 (Linux swap). Problem was accessing the whole Array 2.
Array 1 from Linux worked well.

Then, when I tried, the array checking function within the BIOS of the
Adaptec controller found an inconsistency on the position somewhere in the
middle of the /dev/sda, so somewhere within the /dev/sda2 in particular.
So I low-level formatted the entire HDD1, resynced the Array 1 (which is
RAID 5, so no problem) and reinstalled both systems in Array 2, and now it
is all back to normal again.

> Martin Drab wrote:
>
> > Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 5
> > array. Due to the CPU overheating the whole box was suddenly shot down by
> > the CPU damage protection mechanism. While there is no battery backup on
> > this particular RAID controller, the sudden poweroff caused some very
> > localized inconsistency of one disk in the RAID. The configuration was 1x160
> > GB and 3x120GB, with the 160 GB being split into 120 GB part within the RAID
> > 5 and a 40 GB part as a separate volume. The inconsistency happend in the 40
> > GB part of the 160 GB HDD (as reported by the Adaptec BIOS media check). In
> > particular the problem was in the /dev/sda2 (with /dev/sda being the 40 GB
> > Volume, /dev/sda1 being an NTFS Windows system, and /dev/sda2 being ext3
> > Linux system).
> >
> > Now, what is interesting, is that Linux completely refused any possible
> > access to every byte within /dev/sda, not even dd(1) reading from any
> > position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
^^^^^^^^^^^^^^
> > ended up with lots of following messages:
> >
> > sd 0:0:0:0: SCSI error: return code = 0x8000002
> > sda: Current: sense key: Hardware Error
> > Additional sense: Internal target failure
> > Info fld=0x0
> > end_request: I/O error, dev sda, sector <some sector number>
>
> But /dev/sda is not a Linux filesystem, running fsck on it makes no sense. You
> wanted to run on /dev/sda2.

But I was talking about fdisk(1). This wasn't a problematic behaviour of a
filesystem, but of the entire block device.

> > I've consulted this with Mark Salyzyn, because I thought it was a problem of
> > the AACRAID driver. But I was told, that there is nothing that AACRAID can
> > possibly do about it, and that it is a problem of the upper Linux layers
> > (block device layer?) that are strictly fault intollerant, and thouth the
> > problem was just an inconsistency of one particular localized region inside
> > /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a single byte from
> > the ENTIRE VOLUME (/dev/sda)!
>
> The obvious test of this "it's not us" statement is to connect that one drive
> to another type controller and see if the upper level code recovers. I'm
> assuming that "sda" is a real drive and not some pseudo-drive which exists
> only in the firmware of the RAID controller.

/dev/sda is a 40 GB RAID array consisting of just one 40 GB part of one
160 GB drive. But it is in fact a virtual device supplied by the
controller. I.e. this 40 GB part of that disc behaves as an entire
harddisk (with it's own MBR etc.). And it is at the end of the drive, so
it may be a little tricky to find the exact position of the partitions
there, but it may be possible.

> That message is curious, did you
> cat /proc/scsi/scsi to see what the system thought was there? Use the infamous
> "cdrecord -scanbus" command?

----------
$ cdrecord -scanbus
Cdrecord-Clone 2.01.01a03-dvd (i686-pc-linux-gnu) Copyright (C) 1995-2005 J�g Schilling
Note: This version is an unofficial (modified) version with DVD support
Note: and therefore may have bugs that are not present in the original.
Note: Please send bug reports or support requests to warly at mandriva.com.
Note: The author of cdrecord should not be bothered with problems in this
version.
Linux sg driver version: 3.5.33
Using libscg version 'schily-0.8'.
scsibus0:
0,0,0 0) 'Adaptec ' 'SYSTEM ' 'V1.0' Disk
0,1,0 1) 'Adaptec ' 'Data 1 ' 'V1.0' Disk
0,2,0 2) *
0,3,0 3) *
0,4,0 4) *
0,5,0 5) *
0,6,0 6) *
0,7,0 7) *

$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: Adaptec Model: SYSTEM Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: Adaptec Model: Data 1 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
-----------

The 0,0,0 is the /dev/sda. And even though this is now, after low-level
formatting of the previously inconsistent disc, the indications back then
were just the same. Which means every indication behaved as usual. Both
arrays were properly identified. But when I was accessing the inconsistent
one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do
with any filesystems) the error messages mentioned above appeared. I'm not
sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe
he can explain more to it.

> > And now for the best part: From Windows, I was able to access the ENTIRE
> > VOLUME without the slightest problem. Not only did Windows boot entirely
> > from the /dev/sda1, but using Total Commander's ext3 plugin I was also able
> > to access the ENTIRE /dev/sda2 and at least extract the most important data
> > and configurations, before I did the complete low-level formatting of the
> > drive, which fixed the inconsistency problem.
> >
> > I call this "AN IRONY" to be forced to use Windows to extract information
> > from Linux partition, wouldn't you? ;)
> >
> > (Besides, even GRUB (using BIOS) accessed the /dev/sda without complications
> > - as it was the bootable volume. Only Linux failed here a 100%. :()
>
> From the way you say sda when you presumably mean sda1 or sda2 it's not clear
> if you don't understand the difference between drive and partition access or
> are just so pissed off you are not taking the time to state the distinction
> clearly.

No, I understand the differences very clearly. But maybe I was just
unclear in my expressions (for which I appologize). What I mean is that
the problem was with the entire RAID array /dev/sda. So whenever ANY
access to ANY part of /dev/sda, which of course also includes accesses to
all of /dev/sda1, /dev/sda2, and /dev/sda3, the error messages appeared
and no access was performed. That includes even accesses like this
"dd if=/dev/sda of=/dev/null bs=512 count=1" and any other possible
accesses. So the problem was with the entire device /dev/sda.

> There was a problem with recovery from errors in RAID-5 which is addressed by
> recent changes to fail a sector, try rewriting it, etc.

Maybe this was again my bad explanation, but this wasn't a problem of a
RAID 5 array, and much less of a software array. Adaptec 2410SA is a
4-channel HW SATA-I RAID controller.

> I would have to read linux-raid archives to explain it, so I'll stop
> with the overview. I don't think that's the issue here, you're using a
> RAID controller rather than the software RAID, so it should not apply.

Yes, exactly. And again, I've solved it by lowlevel formatting.

> I assume that the problem is gone, so we can't do any more analysis after the
> fact.

Unfortunatelly, yes. But I've described above how did it happen, so maybe
someone in Adaptec would be able to reproduce, Mark?

Martin

2006-02-03 01:13:32

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Martin Drab wrote:

> On Thu, 2 Feb 2006, Bill Davidsen wrote:
>
> Just to state clearly in the first place. I've allready solved the problem
> by low-level formatting the entire disk that this inconsistent array in
> question was part of.
>
> So now everything is back to normal. So unforunatelly I would not be able
> to do any more tests on the device in the non-working state.
>
> I mentioned this problem here now just to let you konw that there is such
> a problematic Linux behviour (and IMO flawed) in such circumstances, and
> that perhaps it may let you think of such situations when doing further
> improvements and development in the design of the block device layer (or
> wherever the problem may possibly come from).

Perhaps maybe rather of the SCSI layer, than of the block layer ??

> And I also hope you would understand, that I wouldn't try to create that
> state again deliberatelly, since my main system is running on that array
> and I wouldn't risk loosing some more data because of this.
>
> However maybe someone perhaps in Adaptec or smewhere else may have
> some simillar system at the disposal on which he could allow to experiment
> on demand without any serious risk of loosing anything important.
>
> So what I may say is that it is an Adaptec 2410SA with 8205 firmware and
> without a battery backup system (which is probably the crutial thing).
> And the inconsistency was caused by a MB protection of CPU overheat
> shutdown, because I've started the system and booted Linux from the array
> in question (which consisted by just one part of one disk), while I've
> forgotten to turn on the water cooling of the CPU and northbridge. So
> after about 3 minutes the system automatically shut down and Linux was
> probably doing some writing in that very moment, which wasn't able to
> complete fully (most probably due to the lack of the battery backup system
> on the RAID controller). So my guess is that this may be artificially
> reproduced when you suddenly switch off a power source of the system while
> Linux is doing some writing to the array.
>
> My arrays in particular are:
>
> HDD1 (160 GB): 120 GB Array 1, 40 GB Array 2
> HDD2 (120 GB): 120 GB Array 1
> HDD3 (120 GB): 120 GB Array 1
> HDD4 (120 GB): 120 GB Array 1
>
> Where Array 1 is a RAID 5 array /dev/sdb (labeled as "Data 1"), which
> contains just one 330 GB partition /dev/sdb1, and Array 2 is a bootable
> (in Adaptec BIOS setup so called) Volume array (i.e. no RAID) /dev/sda
> (labeled as "SYSTEM"), which contains /dev/sda1 (NTFS Windows), /dev/sda2
> (ext3 Linux), /dev/sda3 (Linux swap). Problem was accessing the whole Array 2.
> Array 1 from Linux worked well.
>
> Then, when I tried, the array checking function within the BIOS of the
> Adaptec controller found an inconsistency on the position somewhere in the
> middle of the /dev/sda, so somewhere within the /dev/sda2 in particular.
> So I low-level formatted the entire HDD1, resynced the Array 1 (which is
> RAID 5, so no problem) and reinstalled both systems in Array 2, and now it
> is all back to normal again.
>
> > Martin Drab wrote:
> >
> > > Well, I had a similar experience lately with the Adaptec AAC-2410SA RAID 5
> > > array. Due to the CPU overheating the whole box was suddenly shot down by
> > > the CPU damage protection mechanism. While there is no battery backup on
> > > this particular RAID controller, the sudden poweroff caused some very
> > > localized inconsistency of one disk in the RAID. The configuration was 1x160
> > > GB and 3x120GB, with the 160 GB being split into 120 GB part within the RAID
> > > 5 and a 40 GB part as a separate volume. The inconsistency happend in the 40
> > > GB part of the 160 GB HDD (as reported by the Adaptec BIOS media check). In
> > > particular the problem was in the /dev/sda2 (with /dev/sda being the 40 GB
> > > Volume, /dev/sda1 being an NTFS Windows system, and /dev/sda2 being ext3
> > > Linux system).
> > >
> > > Now, what is interesting, is that Linux completely refused any possible
> > > access to every byte within /dev/sda, not even dd(1) reading from any
> > > position within /dev/sda, not even "fdisk /dev/sda", nothing. Everything
> ^^^^^^^^^^^^^^
> > > ended up with lots of following messages:
> > >
> > > sd 0:0:0:0: SCSI error: return code = 0x8000002
> > > sda: Current: sense key: Hardware Error
> > > Additional sense: Internal target failure
> > > Info fld=0x0
> > > end_request: I/O error, dev sda, sector <some sector number>
> >
> > But /dev/sda is not a Linux filesystem, running fsck on it makes no sense. You
> > wanted to run on /dev/sda2.
>
> But I was talking about fdisk(1). This wasn't a problematic behaviour of a
> filesystem, but of the entire block device.
>
> > > I've consulted this with Mark Salyzyn, because I thought it was a problem of
> > > the AACRAID driver. But I was told, that there is nothing that AACRAID can
> > > possibly do about it, and that it is a problem of the upper Linux layers
> > > (block device layer?) that are strictly fault intollerant, and thouth the
> > > problem was just an inconsistency of one particular localized region inside
> > > /dev/sda2, Linux was COMPLETELY UNABLE (!!!!!) to read a single byte from
> > > the ENTIRE VOLUME (/dev/sda)!
> >
> > The obvious test of this "it's not us" statement is to connect that one drive
> > to another type controller and see if the upper level code recovers. I'm
> > assuming that "sda" is a real drive and not some pseudo-drive which exists
> > only in the firmware of the RAID controller.
>
> /dev/sda is a 40 GB RAID array consisting of just one 40 GB part of one
> 160 GB drive. But it is in fact a virtual device supplied by the
> controller. I.e. this 40 GB part of that disc behaves as an entire
> harddisk (with it's own MBR etc.). And it is at the end of the drive, so
> it may be a little tricky to find the exact position of the partitions
> there, but it may be possible.
>
> > That message is curious, did you
> > cat /proc/scsi/scsi to see what the system thought was there? Use the infamous
> > "cdrecord -scanbus" command?
>
> ----------
> $ cdrecord -scanbus
> Cdrecord-Clone 2.01.01a03-dvd (i686-pc-linux-gnu) Copyright (C) 1995-2005 J�g Schilling
> Note: This version is an unofficial (modified) version with DVD support
> Note: and therefore may have bugs that are not present in the original.
> Note: Please send bug reports or support requests to warly at mandriva.com.
> Note: The author of cdrecord should not be bothered with problems in this
> version.
> Linux sg driver version: 3.5.33
> Using libscg version 'schily-0.8'.
> scsibus0:
> 0,0,0 0) 'Adaptec ' 'SYSTEM ' 'V1.0' Disk
> 0,1,0 1) 'Adaptec ' 'Data 1 ' 'V1.0' Disk
> 0,2,0 2) *
> 0,3,0 3) *
> 0,4,0 4) *
> 0,5,0 5) *
> 0,6,0 6) *
> 0,7,0 7) *
>
> $ cat /proc/scsi/scsi
> Attached devices:
> Host: scsi0 Channel: 00 Id: 00 Lun: 00
> Vendor: Adaptec Model: SYSTEM Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
> Host: scsi0 Channel: 00 Id: 01 Lun: 00
> Vendor: Adaptec Model: Data 1 Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
> -----------
>
> The 0,0,0 is the /dev/sda. And even though this is now, after low-level
> formatting of the previously inconsistent disc, the indications back then
> were just the same. Which means every indication behaved as usual. Both
> arrays were properly identified. But when I was accessing the inconsistent
> one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do
> with any filesystems) the error messages mentioned above appeared. I'm not
> sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe
> he can explain more to it.
>
> > > And now for the best part: From Windows, I was able to access the ENTIRE
> > > VOLUME without the slightest problem. Not only did Windows boot entirely
> > > from the /dev/sda1, but using Total Commander's ext3 plugin I was also able
> > > to access the ENTIRE /dev/sda2 and at least extract the most important data
> > > and configurations, before I did the complete low-level formatting of the
> > > drive, which fixed the inconsistency problem.
> > >
> > > I call this "AN IRONY" to be forced to use Windows to extract information
> > > from Linux partition, wouldn't you? ;)
> > >
> > > (Besides, even GRUB (using BIOS) accessed the /dev/sda without complications
> > > - as it was the bootable volume. Only Linux failed here a 100%. :()
> >
> > From the way you say sda when you presumably mean sda1 or sda2 it's not clear
> > if you don't understand the difference between drive and partition access or
> > are just so pissed off you are not taking the time to state the distinction
> > clearly.
>
> No, I understand the differences very clearly. But maybe I was just
> unclear in my expressions (for which I appologize). What I mean is that
> the problem was with the entire RAID array /dev/sda. So whenever ANY
> access to ANY part of /dev/sda, which of course also includes accesses to
> all of /dev/sda1, /dev/sda2, and /dev/sda3, the error messages appeared
> and no access was performed. That includes even accesses like this
> "dd if=/dev/sda of=/dev/null bs=512 count=1" and any other possible
> accesses. So the problem was with the entire device /dev/sda.
>
> > There was a problem with recovery from errors in RAID-5 which is addressed by
> > recent changes to fail a sector, try rewriting it, etc.
>
> Maybe this was again my bad explanation, but this wasn't a problem of a
> RAID 5 array, and much less of a software array. Adaptec 2410SA is a
> 4-channel HW SATA-I RAID controller.
>
> > I would have to read linux-raid archives to explain it, so I'll stop
> > with the overview. I don't think that's the issue here, you're using a
> > RAID controller rather than the software RAID, so it should not apply.
>
> Yes, exactly. And again, I've solved it by lowlevel formatting.
>
> > I assume that the problem is gone, so we can't do any more analysis after the
> > fact.
>
> Unfortunatelly, yes. But I've described above how did it happen, so maybe
> someone in Adaptec would be able to reproduce, Mark?
>
> Martin

====================================================
Martin Drab
Department of Solid State Engineering
Department of Mathematics
Faculty of Nuclear Sciences and Physical Engineering
Czech Technical University in Prague
Trojanova 13
120 00 Praha 2, Czech Republic
Tel: +420 22435 8649
Fax: +420 22435 8601
E-mail: [email protected]
====================================================

2006-02-03 15:42:51

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Martin Drab wrote:
> On Thu, 2 Feb 2006, Bill Davidsen wrote:
>
> Just to state clearly in the first place. I've allready solved the problem
> by low-level formatting the entire disk that this inconsistent array in
> question was part of.
>
> So now everything is back to normal. So unforunatelly I would not be able
> to do any more tests on the device in the non-working state.
>
> I mentioned this problem here now just to let you konw that there is such
> a problematic Linux behviour (and IMO flawed) in such circumstances, and
> that perhaps it may let you think of such situations when doing further
> improvements and development in the design of the block device layer (or
> wherever the problem may possibly come from).
>
>

It looks like the problem is in that controller card and its driver.
Was this a proprietary closed source driver? Linux is perfectly happy
to access the rest of the disk when some parts of it have gone bad;
people do this all the time. It looks like your raid controller decided
to take the entire virtual disk that it presents to the kernel offline
because it detected errors.

<snip>
> The 0,0,0 is the /dev/sda. And even though this is now, after low-level
> formatting of the previously inconsistent disc, the indications back then
> were just the same. Which means every indication behaved as usual. Both
> arrays were properly identified. But when I was accessing the inconsistent
> one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do
> with any filesystems) the error messages mentioned above appeared. I'm not
> sure what exactly was generating them, but I've CC'd Mark Salyzyn, maybe
> he can explain more to it.
>
>

How did you low level format the drive? These days disk manufacturers
ship drives already low level formatted and end users can not perform a
low level format. The last time I remember being able to low level
format a drive was with MFM and RLL drives, prior to IDE. My guess is
what you actually did was simply write out zeros to every sector on the
disk, which replaced the corrupted data in the effected sector with good
data, rendering it repaired. Usually drives will fail reads to bad
sectors but when you write to that sector, it will write and read that
sector to see if it is fine after being written again, or if the media
is bad in which case it will remap the sector to a spare.


2006-02-03 16:13:42

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Phillip Susi wrote:

> Martin Drab wrote:
> > On Thu, 2 Feb 2006, Bill Davidsen wrote:
> >
> > Just to state clearly in the first place. I've allready solved the problem
> > by low-level formatting the entire disk that this inconsistent array in
> > question was part of.
> >
> > So now everything is back to normal. So unforunatelly I would not be able to
> > do any more tests on the device in the non-working state.
> >
> > I mentioned this problem here now just to let you konw that there is such a
> > problematic Linux behviour (and IMO flawed) in such circumstances, and that
> > perhaps it may let you think of such situations when doing further
> > improvements and development in the design of the block device layer (or
> > wherever the problem may possibly come from).
> >
> >
>
> It looks like the problem is in that controller card and its driver. Was this
> a proprietary closed source driver?

No, it was the kernel's AACRAID driver (drivers/scsi/aacraid/*). And I've
consulted that with Mark Salyzyn who told me that it is the problem of the
upper layers which are only zero fault tollerant and that driver con do
nothing about it.

So as I understand it, the RAID controller did signal some error with
respect to the inconsistency of that particular array and the upper layers
weren't probably able to distinguish the real condition and just
interpreted it as an error and so refused to access the device
alltogether. But understand that this is just my way of interpreting what
I think might have happend without any knowledge of SCSI protocol or
functionality of the SCSI or other related layers.

> Linux is perfectly happy to access the
> rest of the disk when some parts of it have gone bad; people do this all the
> time. It looks like your raid controller decided to take the entire virtual
> disk that it presents to the kernel offline because it detected errors.

No, it wasn't offline. No such messages appeared in the kernel. And if it
would have been offlie, the kernel/driver would certainly report that, as
I've allready witnessed such a situation in the past (however for totally
different reason).

> <snip>
> > The 0,0,0 is the /dev/sda. And even though this is now, after low-level
> > formatting of the previously inconsistent disc, the indications back then
> > were just the same. Which means every indication behaved as usual. Both
> > arrays were properly identified. But when I was accessing the inconsistent
> > one, i.e. /dev/sda, in any way (even just bytes, this has nothing to do with
> > any filesystems) the error messages mentioned above appeared. I'm not sure
> > what exactly was generating them, but I've CC'd Mark Salyzyn, maybe he can
> > explain more to it.
>
> How did you low level format the drive?

The BIOS of the RAID controller has this option.

> These days disk manufacturers ship
> drives already low level formatted and end users can not perform a low level
> format. The last time I remember being able to low level format a drive was
> with MFM and RLL drives, prior to IDE. My guess is what you actually did was
> simply write out zeros to every sector on the disk, which replaced the
> corrupted data in the effected sector with good data, rendering it repaired.

That may very well be true. I do not know what the Adaptec BIOS does under
the "Low-Level Format" option. Maybe someone from Adaptec would know that.
Mark?

> Usually drives will fail reads to bad sectors but when you write to that
> sector, it will write and read that sector to see if it is fine after being
> written again, or if the media is bad in which case it will remap the sector
> to a spare.

No, I don't think this was the case of a physically bad sectors. I think
it was just an inconsistency of the RAID controllers metadata (or
something simillar) related to that particular array.

Martin

2006-02-03 16:39:05

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Martin Drab wrote:
> On Fri, 3 Feb 2006, Phillip Susi wrote:
>
>> It looks like the problem is in that controller card and its driver. Was this
>> a proprietary closed source driver?
>>
>
> No, it was the kernel's AACRAID driver (drivers/scsi/aacraid/*). And I've
> consulted that with Mark Salyzyn who told me that it is the problem of the
> upper layers which are only zero fault tollerant and that driver con do
> nothing about it.
>

That's a strange statement, maybe we could get some clarification on
it? From the dmesg lines you posted before, it appeared that the
hardware was failing the request with a bad disk sense code. As I said
before, normally Linux has no problem reading the good parts of a
partially bad disk, so I wonder exactly what Mark means by "upper layers
which are only zero fault tollerant"?



2006-02-03 17:00:07

by Mark Salyzyn

[permalink] [raw]
Subject: RE: FYI: RAID5 unusably unstable through 2.6.14

Martin Drab [mailto:[email protected]] sez:
> That may very well be true. I do not know what the Adaptec
> BIOS does under the "Low-Level Format" option. Maybe someone from
Adaptec
> would know that.

The drive is low level formatted. This resolved the problem you were
having.

> No, I don't think this was the case of a physically bad
> sectors. I think it was just an inconsistency of the RAID controllers
metadata (or
> something simillar) related to that particular array.

It was a case of a set of physically bad sectors in a non-redundant
formation resulting in a non-recoverable situation, from what I could
tell. Read failures do not take the array offline, write failures do.
Instead the adapter responds with a hardware failure to the read
responses. Writing the data would have re-assigned the bad blocks. (RAID
controllers do reassign media bad blocks automatically, but sets them as
inconsistent under some scenarios, requiring a write to mark them
consistent again. This is no different to how single drive media reacts
to faulty or corruption issues).

The bad sectors were localized only affecting the Linux partition, the
accesses were to directory or superblock nodes if memory serves. Another
system partition was unaffected because the errors were not localized to
it's area.

Besides low level formatting, there is not much anyone can do about this
issue except ask for a less catastrophic response from the Linux File
system drivers. I make no offer or suggestion regarding the changes that
would be necessary to support the system limping along when file system
data has been corrupted; UNIX policy in general is to walk away as
quickly as possible and do the least continuing damage.

Except this question: If a superblock can not be read in, what about the
backup copies? Could an fsck play games with backup copies to result in
a write to close inconsistencies?

-- Mark Salyzyn

2006-02-03 17:12:10

by Roger Heflin

[permalink] [raw]
Subject: RE: FYI: RAID5 unusably unstable through 2.6.14



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Phillip Susi
> Sent: Friday, February 03, 2006 10:38 AM
> To: Martin Drab
> Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List;
> Salyzyn, Mark
> Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
>
> Martin Drab wrote:
> > On Fri, 3 Feb 2006, Phillip Susi wrote:
> >
> >> It looks like the problem is in that controller card and
> its driver.
> >> Was this a proprietary closed source driver?
> >>
> >
> > No, it was the kernel's AACRAID driver
> (drivers/scsi/aacraid/*). And
> > I've consulted that with Mark Salyzyn who told me that it is the
> > problem of the upper layers which are only zero fault tollerant and
> > that driver con do nothing about it.
> >
>
> That's a strange statement, maybe we could get some
> clarification on it? From the dmesg lines you posted before,
> it appeared that the hardware was failing the request with a
> bad disk sense code. As I said before, normally Linux has no
> problem reading the good parts of a partially bad disk, so I
> wonder exactly what Mark means by "upper layers which are
> only zero fault tollerant"?


Some of the fakeraid controllers will kill the disk when the
disk returns a failure like that.

On top of that usually (even if the controller were not to
kill the disk) the application will get a fatal disk error
also, causing the application to die.

The best I have been able to hope for (this is a raid0 stripe
case) is that the fakeraid controller does not kill the disk,
returns the disk error to the higher levels and lets the application
be killed, at least in this case you will likely know the disk
has a fatal error, rather than (in the raid0 case) having the
machine crash, and have to debug it to determine exactly
what the nature of the failure was.

The same may need to be applied when the array is already
in degraded mode ... limping along with some lost data and messages
indicating such is a lot better that losing all of the data.

Roger

2006-02-03 17:39:42

by Martin Drab

[permalink] [raw]
Subject: RE: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Salyzyn, Mark wrote:

> Martin Drab [mailto:[email protected]] sez:
> > That may very well be true. I do not know what the Adaptec
> > BIOS does under the "Low-Level Format" option. Maybe someone from
> Adaptec
> > would know that.
>
> The drive is low level formatted. This resolved the problem you were
> having.
>
> > No, I don't think this was the case of a physically bad
> > sectors. I think it was just an inconsistency of the RAID controllers
> metadata (or
> > something simillar) related to that particular array.
>
> It was a case of a set of physically bad sectors in a non-redundant
> formation resulting in a non-recoverable situation, from what I could
> tell. Read failures do not take the array offline, write failures do.

Again, neither read, nor write did result in disk offline. (Even though
I'm not quite positive on trying the writing under Linux.) And it
definitelly wasn't caused by the controller, since I was doing both reads
and writes to that "faulty" array from Windows and all those operations
completed without any problem.

> Instead the adapter responds with a hardware failure to the read
> responses. Writing the data would have re-assigned the bad blocks. (RAID
> controllers do reassign media bad blocks automatically, but sets them as
> inconsistent under some scenarios, requiring a write to mark them
> consistent again. This is no different to how single drive media reacts
> to faulty or corruption issues).
>
> The bad sectors were localized only affecting the Linux partition, the
> accesses were to directory or superblock nodes if memory serves. Another
> system partition was unaffected because the errors were not localized to
> it's area.

However I was able to read the Linux Ext3 data (from the /dev/sda2)
partition using the Total Commander's ext2 plugin from Windows, and that
worked well for the entire partition (both reads and writes).

Are you a 100% certain it must have been bad physical sectors? Since I'm
not all that sure.

> Besides low level formatting, there is not much anyone can do about this
> issue except ask for a less catastrophic response from the Linux File
> system drivers.

This has nothing to do with filesystems, since no access was possible at
all to that block device entirely.

> I make no offer or suggestion regarding the changes that
> would be necessary to support the system limping along when file system
> data has been corrupted; UNIX policy in general is to walk away as
> quickly as possible and do the least continuing damage.
>
> Except this question: If a superblock can not be read in, what about the
> backup copies? Could an fsck play games with backup copies to result in
> a write to close inconsistencies?

OK, this is probably also something needed to be improved if there is the
problem as well, but it is a totally different case than what happend
here. This certainly had nothing to do with filesystems. As (as I've
mentioned earlier) not even plain access to the whole /dev/sda using dd(1)
was working.

Martin

2006-02-03 17:51:21

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Martin Drab wrote:

> On Fri, 3 Feb 2006, Phillip Susi wrote:
>
> > Usually drives will fail reads to bad sectors but when you write to that
> > sector, it will write and read that sector to see if it is fine after being
> > written again, or if the media is bad in which case it will remap the sector
> > to a spare.
>
> No, I don't think this was the case of a physically bad sectors. I think
> it was just an inconsistency of the RAID controllers metadata (or
> something simillar) related to that particular array.

Or is such a situation not possible at all? Are bad sectors the only
reason that might have caused this? That sounds a little strange to me,
that would have been a very unlikely concentration of conincidences, IMO.
That's why I still think there are no bad sectors at all (at least not
because of this). Is there any way to actually find out?

Martin

2006-02-03 19:00:21

by Roger Heflin

[permalink] [raw]
Subject: RE: FYI: RAID5 unusably unstable through 2.6.14



> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Martin Drab
> Sent: Friday, February 03, 2006 11:51 AM
> To: Phillip Susi
> Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List;
> Salyzyn, Mark
> Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
>
> On Fri, 3 Feb 2006, Martin Drab wrote:
>
> > On Fri, 3 Feb 2006, Phillip Susi wrote:
> >
> > > Usually drives will fail reads to bad sectors but when
> you write to
> > > that sector, it will write and read that sector to see if
> it is fine
> > > after being written again, or if the media is bad in
> which case it
> > > will remap the sector to a spare.
> >
> > No, I don't think this was the case of a physically bad sectors. I
> > think it was just an inconsistency of the RAID controllers metadata
> > (or something simillar) related to that particular array.
>
> Or is such a situation not possible at all? Are bad sectors
> the only reason that might have caused this? That sounds a
> little strange to me, that would have been a very unlikely
> concentration of conincidences, IMO.
> That's why I still think there are no bad sectors at all (at
> least not because of this). Is there any way to actually find out?


Some of the drive manufacturers have tools that will read out
"log" files from the disks, and these log files include stuff
such as how many bad block errors where returned to the host
over the life of the disk.

You would need a decent contatct with the disk manufacturer, and
you might be able to get them to tell you, maybe.

Roger

2006-02-03 19:13:08

by Martin Drab

[permalink] [raw]
Subject: RE: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Roger Heflin wrote:

> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Martin Drab
> > Sent: Friday, February 03, 2006 11:51 AM
> > To: Phillip Susi
> > Cc: Bill Davidsen; Cynbe ru Taren; Linux Kernel Mailing List;
> > Salyzyn, Mark
> > Subject: Re: FYI: RAID5 unusably unstable through 2.6.14
> >
> > On Fri, 3 Feb 2006, Martin Drab wrote:
> >
> > > On Fri, 3 Feb 2006, Phillip Susi wrote:
> > >
> > > > Usually drives will fail reads to bad sectors but when
> > you write to
> > > > that sector, it will write and read that sector to see if
> > it is fine
> > > > after being written again, or if the media is bad in
> > which case it
> > > > will remap the sector to a spare.
> > >
> > > No, I don't think this was the case of a physically bad sectors. I
> > > think it was just an inconsistency of the RAID controllers metadata
> > > (or something simillar) related to that particular array.
> >
> > Or is such a situation not possible at all? Are bad sectors
> > the only reason that might have caused this? That sounds a
> > little strange to me, that would have been a very unlikely
> > concentration of conincidences, IMO.
> > That's why I still think there are no bad sectors at all (at
> > least not because of this). Is there any way to actually find out?
>
>
> Some of the drive manufacturers have tools that will read out
> "log" files from the disks, and these log files include stuff
> such as how many bad block errors where returned to the host
> over the life of the disk.

S.M.A.R.T. should be able to do this. But last time I've checked it wasn't
working with Linux and SCSI/SATA. Is this working now?

> You would need a decent contatct with the disk manufacturer, and
> you might be able to get them to tell you, maybe.

Well it's a WD 1600SD.

Martin

2006-02-03 19:39:11

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

I fail to see how this is a reply to my message. I was asking for
clarification on what "higher layer" supposedly resulted in this
behavior ( of not being able to access any part of the disk ) because as
far as I know, all the higher layers are quite happy to access the non
broken parts of the disk, and return the appropriate error to the
calling application for the bad parts of the disk.

Roger Heflin wrote:
>> That's a strange statement, maybe we could get some
>> clarification on it? From the dmesg lines you posted before,
>> it appeared that the hardware was failing the request with a
>> bad disk sense code. As I said before, normally Linux has no
>> problem reading the good parts of a partially bad disk, so I
>> wonder exactly what Mark means by "upper layers which are
>> only zero fault tollerant"?
>>
>
>
> Some of the fakeraid controllers will kill the disk when the
> disk returns a failure like that.
>
> On top of that usually (even if the controller were not to
> kill the disk) the application will get a fatal disk error
> also, causing the application to die.
>
> The best I have been able to hope for (this is a raid0 stripe
> case) is that the fakeraid controller does not kill the disk,
> returns the disk error to the higher levels and lets the application
> be killed, at least in this case you will likely know the disk
> has a fatal error, rather than (in the raid0 case) having the
> machine crash, and have to debug it to determine exactly
> what the nature of the failure was.
>
> The same may need to be applied when the array is already
> in degraded mode ... limping along with some lost data and messages
> indicating such is a lot better that losing all of the data.
>
> Roger

2006-02-03 19:42:04

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Martin Drab wrote:
> S.M.A.R.T. should be able to do this. But last time I've checked it wasn't
> working with Linux and SCSI/SATA. Is this working now?
>
>

Yes, it is working now. The smartutils package returns all kinds of
handy information from the drive and can force the drive to perform a
low level disk check on request. It likely won't pass through a
hardware raid controller however.

> Well it's a WD 1600SD.
>
> Martin

2006-02-03 19:45:41

by Martin Drab

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Fri, 3 Feb 2006, Phillip Susi wrote:

> Martin Drab wrote:
> > S.M.A.R.T. should be able to do this. But last time I've checked it wasn't
> > working with Linux and SCSI/SATA. Is this working now?
>
> Yes, it is working now. The smartutils package returns all kinds of handy
> information from the drive and can force the drive to perform a low level disk
> check on request. It likely won't pass through a hardware raid controller
> however.

Yes, that may be another issue. It depend's on whether AACRAID is ready
for that or not. (Adaptec declares that the controller is SMART capable.)

Martin

2006-02-03 19:47:44

by Phillip Susi

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Salyzyn, Mark wrote:
> The drive is low level formatted. This resolved the problem you were
> having.
>
>

Could you define what you mean by "low level format"? AFAIK, IDE drives
do not provide a command to low level format them ( like MFM and RLL
drives required ), so the best you can do is write zeroes to all sectors
on the disk.


2006-02-08 14:44:44

by Alan

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

On Iau, 2006-01-19 at 16:25 +0000, Alan Cox wrote:
> On Iau, 2006-01-19 at 10:59 -0500, Mark Lord wrote:
> > But the card is a total slug unless the host does 32-bit PIO to/from it.
> > Do we have that capability in libata yet?
>
> Very very easy to sort out. Just need a ->pio_xfer method set. Would
> then eliminate some of the core driver flags and let us do vlb sync for
> legacy hw


This is now all done and present in the 2.6.16 libata PATA patches.

2006-02-09 17:06:31

by Pavel Machek

[permalink] [raw]
Subject: Re: FYI: RAID5 unusably unstable through 2.6.14

Hi!

> >If 1 disk has a 1/1000 chance of failure, then
> >2 disks have a (1/1000)^2 chance of double failure, and
> >3 disks have a (1/1000)^2 * 3 chance of double failure
> >4 disks have a (1/1000)^2 * 7 chance of double failure
>
> After the first drive fails you have no redundancy, the
> chance of an additional failure is linear to the number
> of remaining drives.
>
> Assume:
> p - probability of a drive failing in unit time
> n - number of drives
> F - probability of double failure
>
> The chance of a single drive failure is n*p. After that

<pedantic>
Actually it is not. Imagine 100 drives with 10% failure rate each. You
can't have probability of 1000%...
</>

Pavel
--
Thanks, Sharp!