LinuxLists.cc - Why does the md/raid subsystem does not remap bad sectors in a raid array?

2008-11-23 00:02:49

Subject: Why does the md/raid subsystem does not remap bad sectors in a raid array?

I asked before but it was kind of clobbered in the velociraptor mess:

On a colleague's box:

Aug 02, 2008 12:15.30AM(0x04:0x0023): Sector repair completed: port=7,
LBA=0x4A0387F5

SMART Self-test log structure revision number 0
Warning: ATA Specification requires self-test log structure revision
number = 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 90% 305
1241745397

Even though this disk has a bad sector:
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline
- 1

The controller does not drop the drive from the array when it hits an
error, the 3ware card "takes care of it" and the user need not worry about
it, whereas with md/raid every time it hits a bad sector, it breaks the
raid and it goes degraded, is this correct? Will/can something like what
3ware does be possible in a sw-raid based configuration or is a HW raid
card required?

Justin.

2008-11-23 00:13:47

by Jon Nelson

[permalink] [raw]

Subject: Re: Why does the md/raid subsystem does not remap bad sectors in a raid array?

There are a few reasons but my guess is this: md tries to use the
entire available storage (of the smallest element) in a given set of
devices, which means there is no room for remappery.

However, if MD could be told to set aside some percentage of this
value, or some fixed amount (like, say, 10MB) then the *possibility*
of remapping blocks becomes possible.

However, to add this functionality one would have to consider the following:

1. how much to set aside?
2. where? beginning, end, middle, staggered in chunks?
3. how to tell MD that block A maps to block B on device C? Should it
be done as an exception list (all blocks not in list X refer to their
actual block, otherwise they refer to a redirected block)? or as a
direct map (or something else)?

Perhaps an alternative would be to add a new block layer which takes
an existing block device X and exposes a new, automatic block
remapper-y block device Y (bad reads might continue to return errors
but writes to a previous bad read might go to a new block and so would
subsequent reads) and so on.

Perhaps the easiest way to test this would be to hack NBD or AoE and
build a raid out of such devices.

Just ramblin' here.

--
Jon

2008-11-23 02:00:26

by Robert Hancock

[permalink] [raw]

Subject: Re: Why does the md/raid subsystem does not remap bad sectors in a raid array?

Justin Piszcz wrote:
> I asked before but it was kind of clobbered in the velociraptor mess:
>
> On a colleague's box:
>
> Aug 02, 2008 12:15.30AM(0x04:0x0023): Sector repair completed: port=7,
> LBA=0x4A0387F5
>
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision
> number = 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Extended offline Completed: read failure 90% 305
> 1241745397
>
> Even though this disk has a bad sector:
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
> Offline - 1
>
> The controller does not drop the drive from the array when it hits an
> error, the 3ware card "takes care of it" and the user need not worry
> about it, whereas with md/raid every time it hits a bad sector, it
> breaks the raid and it goes degraded, is this correct? Will/can
> something like what 3ware does be possible in a sw-raid based
> configuration or is a HW raid card required?

Presumably all it's doing is writing that sector's contents back from
the other drive(s) in the array when the read error is detected, this is
something that software could do just as well. Drives only remap bad
sectors when they are written over, as a read failure doesn't
necessarily mean that the sector is entirely unreadable, but could be
due to environmental factors such as high temperature, vibration, etc.

Just rewriting the sector seems a bit questionable though, as if a drive
in your array is growing read errors that's not really a good thing..

>
> Justin.
>

2008-11-23 04:34:42

by Brad Campbell

[permalink] [raw]

Subject: Re: Why does the md/raid subsystem does not remap bad sectors in a raid array?

Robert Hancock wrote:
>> The controller does not drop the drive from the array when it hits an
>> error, the 3ware card "takes care of it" and the user need not worry
>> about it, whereas with md/raid every time it hits a bad sector, it
>> breaks the raid and it goes degraded, is this correct? Will/can
>> something like what 3ware does be possible in a sw-raid based
>> configuration or is a HW raid card required?
>
> Presumably all it's doing is writing that sector's contents back from
> the other drive(s) in the array when the read error is detected, this is
> something that software could do just as well. Drives only remap bad
> sectors when they are written over, as a read failure doesn't
> necessarily mean that the sector is entirely unreadable, but could be
> due to environmental factors such as high temperature, vibration, etc.
>
> Just rewriting the sector seems a bit questionable though, as if a drive
> in your array is growing read errors that's not really a good thing..

md has done this for a while now though. If it encounters a read error in the array it will make an
attempt to write the reconstructed data back to that disk attempting to force a reallocation. I've
seen it work quite well here on disks that have the occasional grown defect.

It's certainly _much_ nicer than having the disk booted from the array on a single read error.

If the disk is haemorrhaging sectors then you will find out about it sooner or later through other
means.

Brad
--
Dolphins are so intelligent that within a few weeks they can
train Americans to stand at the edge of the pool and throw them
fish.

2008-11-23 12:21:04

by Henrique de Moraes Holschuh

[permalink] [raw]

Subject: Re: Why does the md/raid subsystem does not remap bad sectors in a raid array?

On Sun, 23 Nov 2008, Brad Campbell wrote:
> md has done this for a while now though. If it encounters a read error in
> the array it will make an attempt to write the reconstructed data back to
> that disk attempting to force a reallocation. I've seen it work quite
> well here on disks that have the occasional grown defect.

Indeed, but it does so in the "check array" mode (which distros like
Debian are now enabling once-a-month or so, I always up that to once a
week :p)

Does md repair bitrotten sectors ALSO outside of check mode? That's
what is being asked in this thread...

> If the disk is haemorrhaging sectors then you will find out about it
> sooner or later through other means.

Like a weekly SMART long test. That's what our maintenance windows are
for :) Everything is kept on-line, but allowed to run in degraded
performance mode, so we kick in SMART offline and long tests, RAID array
scrubbing, etc (not at the same time, though!).

That reminds me to file a bug against smartmontools to DISABLE auto
offline mode on disks, and enable them one disk at a time at a random
interval with at least one hour between them. Otherwise, the disks all
enter auto-offline-testing SMART mode at the same time.

Hmm, it would be good to teach md to measure disk throughput using a
sliding window (of say, 5 minutes) and reduce read priority of disks
that are slow...

--
"One disk to rule them all, One disk to find them. One disk to bring
them all and in the darkness grind them. In the Land of Redmond
where the shadows lie." -- The Silicon Valley Tarot
Henrique Holschuh