I've got a problem. Two drives in my array developed bad sectors at
the same time. They're in completely different places, but I can't
read the disk because md simply fails out both of them, and marks the
array unusable.
Now, I can either buy new drives and dd the raw partition over, or I
can hack the kernel to make it a bit smarter about unrecoverable
reads.
Obviously, if I have raid5_error not mark the drive bad, it will hammer
away on it, failing over and over. My thought was to let the read
fail, catch it in raid5_end_read_request then tag the stripe_head
with the device that's failed. If one has already failed, return
EIO. This way further reads on the stripe_head will go to the parity
disk (until it's eventually freed. One IO error per stripe isn't too
harsh a price to pay for disaster recovery)
in 2.4.20, I'm at raid5.c:421 where we're about to call md_error.
What happens to the bh from that point? Obviously, it's not up-to-date,
so when 1 drive fails how does it get re-issued to be pulled from a parity
drive to reconstruct it?
Please CC me, I read via the archives.
Thanks,
--Dan
Obviously, this would ONLY be for recovery in the face of bad sectors.
As quickly as possible the bad drives need to be replaced
On Thursday December 19, [email protected] wrote:
>
> I've got a problem. Two drives in my array developed bad sectors at
> the same time. They're in completely different places, but I can't
> read the disk because md simply fails out both of them, and marks the
> array unusable.
>
> Now, I can either buy new drives and dd the raw partition over, or I
> can hack the kernel to make it a bit smarter about unrecoverable
> reads.
>
> Obviously, if I have raid5_error not mark the drive bad, it will hammer
> away on it, failing over and over. My thought was to let the read
> fail, catch it in raid5_end_read_request then tag the stripe_head
> with the device that's failed. If one has already failed, return
> EIO. This way further reads on the stripe_head will go to the parity
> disk (until it's eventually freed. One IO error per stripe isn't too
> harsh a price to pay for disaster recovery)
>
> in 2.4.20, I'm at raid5.c:421 where we're about to call md_error.
> What happens to the bh from that point? Obviously, it's not up-to-date,
> so when 1 drive fails how does it get re-issued to be pulled from a parity
> drive to reconstruct it?
Nothing much happens to the bh. It just has Uptodate cleared.
A little later (line 433) we call release_stripe. Once all the active
IO requests have finished the stripe will be completely released and
handle_stripe gets called (from raid5d) to handle the stripe.
It notices there is a block that is being read (bh_read) but that
isn't uptodate, and so tries to schedule a read. Line 949 notices
that the block it wants to read is on a failed drive, so it causes all
blocks to be read in. Once they are read in, handle_stripe gets
called again, and this time it gets to line 955 where it computes the
block that you want.
What you want to do is add a 'failed_drive' field to the stripe_head
which gets initialised to -1 in init_stripe, and set at line 421
instead of calling md_error. Then in handle_stripe, line 875 changes
from
if (!conf->disks[i].operational) {
to
if (!conf->disks[i].operational || sh->failed_drive == i) {
and it might just work for you.
Good luck
NeilBrown
On Fri, 20 Dec 2002, Neil Brown wrote:
> > fail, catch it in raid5_end_read_request then tag the stripe_head
> > with the device that's failed. If one has already failed, return
> > EIO. This way further reads on the stripe_head will go to the parity
> > disk (until it's eventually freed. One IO error per stripe isn't too
> > harsh a price to pay for disaster recovery)
> >
> > in 2.4.20, I'm at raid5.c:421 where we're about to call md_error.
> > What happens to the bh from that point? Obviously, it's not up-to-date,
> > so when 1 drive fails how does it get re-issued to be pulled from a parity
> > drive to reconstruct it?
>
> Nothing much happens to the bh. It just has Uptodate cleared.
>
> A little later (line 433) we call release_stripe. Once all the active
> IO requests have finished the stripe will be completely released and
> handle_stripe gets called (from raid5d) to handle the stripe.
> It notices there is a block that is being read (bh_read) but that
> isn't uptodate, and so tries to schedule a read. Line 949 notices
> that the block it wants to read is on a failed drive, so it causes all
> blocks to be read in. Once they are read in, handle_stripe gets
> called again, and this time it gets to line 955 where it computes the
> block that you want.
Ok, I was unclear what the retry-magic was. We stall a request on a
failure, mark the drive 'failed' and eventually we notice "Hey, that
wasn't finished" and restart it (on the good drive)
I'll try to toss a prelim patch up in the near future. Is there a
patch for loopback to simulate errors-on-demand? I've already done
drive forensics to recover the disks in question, and I'd much rather
test on a set of 500meg loops rather then a 60gig raid. (For obvious
reasons)
Thanks,
--Dan