2007-01-31 00:47:11

by Mark Lord

[permalink] [raw]
Subject: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR. It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.

This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also failed.

This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.

Signed-off-by: Mark Lord <[email protected]>
---
diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
--- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500
+++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500
@@ -865,6 +865,12 @@
*/
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+ case MEDIUM_ERROR:
+ // Bad sector. Fail it, and then continue the rest of the request:
+ if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
+ cmd->retries = 0; // go around again..
+ return;
+ }
case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change. Set a bit


2007-01-31 01:12:16

by Mark Lord

[permalink] [raw]
Subject: [PATCH] RESEND scsi_lib.c: continue after MEDIUM_ERROR

Fixed for 80-columns, and copying linux-scsi this time.

In ancient kernels, the SCSI disk code used to continue after
encountering a MEDIUM_ERROR. ?It would "complete" the good
sectors before the error, fail the bad sector/block, and then
continue with the rest of the request.

Kernels since about 2.6.16 or so have been broken in this regard.
They "complete" the good sectors before the error,
and then fail the entire remaining portions of the request.

This is very risky behaviour, as a request is often a merge
of several bios, and just because one application hits a bad sector
is no reason to pretend that (for example) an adjacent directly lookup also failed.

This patch fixes the behaviour to be similar to what we had originally.

When a bad sector is encounted, SCSI will now work around it again,
failing *only* the bad sector itself.

Signed-off-by: ?Mark Lord <[email protected]>
---
--- old/drivers/scsi/scsi_lib.c 2007-01-30 20:06:15.000000000 -0500
+++ linux/drivers/scsi/scsi_lib.c 2007-01-30 20:06:59.000000000 -0500
@@ -865,6 +865,13 @@
*/
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+ case MEDIUM_ERROR:
+ /* Bad sector. Fail it, and continue on with the rest */
+ if (scsi_end_request(cmd, 0,
+ cmd->device->sector_size, 1) == NULL) {
+ cmd->retries = 0; /* go around again.. */
+ return;
+ }
case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change. Set a bit

2007-01-31 01:16:44

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

First off, please send SCSI patches to the SCSI list:
<[email protected]>

On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:
> In ancient kernels, the SCSI disk code used to continue after
> encountering a MEDIUM_ERROR. It would "complete" the good
> sectors before the error, fail the bad sector/block, and then
> continue with the rest of the request.
>
> Kernels since about 2.6.16 or so have been broken in this regard.
> They "complete" the good sectors before the error,
> and then fail the entire remaining portions of the request.

What was the commit that introduced the change? ... I have a vague
memory of it being deliberate.

> This is very risky behaviour, as a request is often a merge
> of several bios, and just because one application hits a bad sector
> is no reason to pretend that (for example) an adjacent directly lookup also failed.
>
> This patch fixes the behaviour to be similar to what we had originally.
>
> When a bad sector is encounted, SCSI will now work around it again,
> failing *only* the bad sector itself.

Erm, but the corollary is that if we get a large read failure because of
a bad track, you're going to try and chunk up it a sector at a time
forcing an individual error for each sector is going to annoy some
people ... particularly removable medium ones which return this error if
the medium isn't present ... Are you sure this is really what we want to
do?

> Signed-off-by: Mark Lord <[email protected]>
> ---
> diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
> --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500
> +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500
> @@ -865,6 +865,12 @@
> */
> if (sense_valid && !sense_deferred) {
> switch (sshdr.sense_key) {
> + case MEDIUM_ERROR:
> + // Bad sector. Fail it, and then continue the rest of the request:
> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {

The sense key may have come with additional information I think we want
to parse that (if it exists) rather than just blindly failing the first
sector of the request.

> + cmd->retries = 0; // go around again..
> + return;
> + }

This would drop through to the UNIT_ATTENTION case if scsi_end_request()
fails ... I don't think that's correct.

> case UNIT_ATTENTION:
> if (cmd->device->removable) {
> /* Detected disc change. Set a bit

2007-01-31 01:36:07

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

James Bottomley wrote:
> First off, please send SCSI patches to the SCSI list:
> <[email protected]>

Fixed already, thanks!

>> This patch fixes the behaviour to be similar to what we had originally.
>>
>> When a bad sector is encounted, SCSI will now work around it again,
>> failing *only* the bad sector itself.
>
> Erm, but the corollary is that if we get a large read failure because of
> a bad track, you're going to try and chunk up it a sector at a time

That's better than the huge data-loss scenario that we currently
have for single-sector errors. MUCH better.

> forcing an individual error for each sector is going to annoy some
> people ... particularly removable medium ones which return this error if
> the medium isn't present ... Are you sure this is really what we want to
> do?

No, for removed-medium everything just fails right away.
This patch is *only* for media errors, not any other failures.

Cheers

2007-01-31 01:41:45

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Eric D. Mudama wrote:
>
> Actually, it's possibly worse, since each failure in libata will
> generate 3-4 retries. With existing ATA error recovery in the drives,
> that's about 3 seconds per retry on average, or 12 seconds per failure.
> Multiply that by the number of blocks past the error to complete the
> request..

It really beats the alternative of a forced reboot
due to, say, superblock I/O failing because it happened
to get merged with an unrelated I/O which then failed..
Etc..

Definitely an improvement.

The number of retries is an entirely separate issue.
If we really care about it, then we should fix SD_MAX_RETRIES.

The current value of 5 is *way* too high. It should be zero or one.

Cheers

2007-01-31 03:52:46

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR



Mark Lord wrote:

> Eric D. Mudama wrote:
>
>>
>> Actually, it's possibly worse, since each failure in libata will
>> generate 3-4 retries. With existing ATA error recovery in the
>> drives, that's about 3 seconds per retry on average, or 12 seconds
>> per failure. Multiply that by the number of blocks past the error to
>> complete the request..
>
>
> It really beats the alternative of a forced reboot
> due to, say, superblock I/O failing because it happened
> to get merged with an unrelated I/O which then failed..
> Etc..
>
> Definitely an improvement.
>
> The number of retries is an entirely separate issue.
> If we really care about it, then we should fix SD_MAX_RETRIES.
>
> The current value of 5 is *way* too high. It should be zero or one.
>
> Cheers
>
I think that drives retry enough, we should leave retry at zero for
normal (non-removable) drives. Should this be a policy we can set like
we do with NCQ queue depth via /sys ?

We need to be able to layer things like MD on top of normal drive errors
in a way that will produce a system that provides reasonable response
time despite any possible IO error on a single component. Another case
that we end up doing on a regular basis is drive recovery. Errors need
to be limited in scope to just the impacted area and dispatched up to
the application layer as quickly as we can so that you don't spend days
watching a copy of huge drive (think 750GB or more) ;-)

ric

2007-01-31 04:21:34

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Tue, 2007-01-30 at 22:20 -0500, Ric Wheeler wrote:
> Mark Lord wrote:
> > The number of retries is an entirely separate issue.
> > If we really care about it, then we should fix SD_MAX_RETRIES.
> >
> > The current value of 5 is *way* too high. It should be zero or one.
> >
> > Cheers
> >
> I think that drives retry enough, we should leave retry at zero for
> normal (non-removable) drives. Should this be a policy we can set like
> we do with NCQ queue depth via /sys ?

I don't disagree that it should be settable. However, retries occur for
other reasons than failures inside the device. The most standard ones
are unit attentions generated because of other activity (target reset
etc). The key to the problem is retrying only operations that are
genuinely retryable, which the mid-layer doesn't do such a good job on.

> We need to be able to layer things like MD on top of normal drive errors
> in a way that will produce a system that provides reasonable response
> time despite any possible IO error on a single component. Another case
> that we end up doing on a regular basis is drive recovery. Errors need
> to be limited in scope to just the impacted area and dispatched up to
> the application layer as quickly as we can so that you don't spend days
> watching a copy of huge drive (think 750GB or more) ;-)

For the MD case, this is what REQ_FAILFAST is for.

James


2007-01-31 05:09:48

by Douglas Gilbert

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Ric Wheeler wrote:
>
>
> Mark Lord wrote:
>
>> Eric D. Mudama wrote:
>>
>>>
>>> Actually, it's possibly worse, since each failure in libata will
>>> generate 3-4 retries. With existing ATA error recovery in the
>>> drives, that's about 3 seconds per retry on average, or 12 seconds
>>> per failure. Multiply that by the number of blocks past the error to
>>> complete the request..
>>
>>
>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
>>
>> Definitely an improvement.
>>
>> The number of retries is an entirely separate issue.
>> If we really care about it, then we should fix SD_MAX_RETRIES.
>>
>> The current value of 5 is *way* too high. It should be zero or one.
>>
>> Cheers
>>
> I think that drives retry enough, we should leave retry at zero for
> normal (non-removable) drives. Should this be a policy we can set like
> we do with NCQ queue depth via /sys ?

The transport might also want a say. I see ABORTED COMMAND
errors often enough with SAS (e.g. due to expander congestion)
to warrant at least one retry (which works in my testing).
SATA disks behind SAS infrastructure would also be
susceptible to the same "random" failures.

Transport Layer Retries (TLR) in SAS should remove this class
of transport errors but only SAS tape drives support TLR as
far as I know.

Doug Gilbert

> We need to be able to layer things like MD on top of normal drive errors
> in a way that will produce a system that provides reasonable response
> time despite any possible IO error on a single component. Another case
> that we end up doing on a regular basis is drive recovery. Errors need
> to be limited in scope to just the impacted area and dispatched up to
> the application layer as quickly as we can so that you don't spend days
> watching a copy of huge drive (think 750GB or more) ;-)
>
> ric


2007-01-31 09:30:49

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Mark Lord wrote:
> Eric D. Mudama wrote:
>>
>> Actually, it's possibly worse, since each failure in libata will
>> generate 3-4 retries. With existing ATA error recovery in the drives,
>> that's about 3 seconds per retry on average, or 12 seconds per
>> failure. Multiply that by the number of blocks past the error to
>> complete the request..
>
> It really beats the alternative of a forced reboot
> due to, say, superblock I/O failing because it happened
> to get merged with an unrelated I/O which then failed..
> Etc..


FWIW -- speaking generally -- I think there are inevitable areas where
libata error handling combined with SCSI error handling results in
suboptimal error handling.

Just creating a list of "<this behavior> should be handled <this way>,
but in reality is handled in <this silly way>" would be very helpful.

Error handling is tough to get right, because the code is exercised so
infrequently. Tejun has actually done an above-average job here, by
making device probe, hotplug and other "exceptions" go through the
libata EH code, thereby exercising the EH code more than one might
normally assume.

Some errors in libata probably should not be retried more than once,
when we have a definitive diagnosis. Suggestions for improvements are
welcome.

Jeff


2007-01-31 14:37:14

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR



Jeff Garzik wrote:
> Mark Lord wrote:
>> Eric D. Mudama wrote:
>>>
>>> Actually, it's possibly worse, since each failure in libata will
>>> generate 3-4 retries. With existing ATA error recovery in the
>>> drives, that's about 3 seconds per retry on average, or 12 seconds
>>> per failure. Multiply that by the number of blocks past the error to
>>> complete the request..
>>
>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
>
>
> FWIW -- speaking generally -- I think there are inevitable areas where
> libata error handling combined with SCSI error handling results in
> suboptimal error handling.
>
> Just creating a list of "<this behavior> should be handled <this way>,
> but in reality is handled in <this silly way>" would be very helpful.

I agree - Tejun has done a great job at giving us a great base. Next step is to
get clarity on what the types of errors are and how to differentiate between
them (and maybe how that would change by class of device?).

>
> Error handling is tough to get right, because the code is exercised so
> infrequently. Tejun has actually done an above-average job here, by
> making device probe, hotplug and other "exceptions" go through the
> libata EH code, thereby exercising the EH code more than one might
> normally assume.
>
> Some errors in libata probably should not be retried more than once,
> when we have a definitive diagnosis. Suggestions for improvements are
> welcome.
>
> Jeff

One thing that we find really useful is to inject real errors into devices. Mark
has some patches that let us inject media errors, we also bring back failed
drives and run them through testing and occasionally get to use analyzers, etc
to inject odd ball errors.

Hopefully, we will get some time to brainstorm about this at the workshop,

ric

2007-01-31 15:08:16

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Ric Wheeler wrote:
> Mark Lord wrote:
>> Eric D. Mudama wrote:
>>> Actually, it's possibly worse, since each failure in libata will
>>> generate 3-4 retries.

(note: libata does *not* generate retries for medium errors;
the looping is driven by the SCSI mid-layer code).

>> It really beats the alternative of a forced reboot
>> due to, say, superblock I/O failing because it happened
>> to get merged with an unrelated I/O which then failed..
>> Etc..
>>
>> Definitely an improvement.
>>
>> The number of retries is an entirely separate issue.
>> If we really care about it, then we should fix SD_MAX_RETRIES.
>>
>> The current value of 5 is *way* too high. It should be zero or one.
..
> I think that drives retry enough, we should leave retry at zero for
> normal (non-removable) drives. Should this be a policy we can set like
> we do with NCQ queue depth via /sys ?

Or perhaps we could have the mid-layer always "early-exit"
without retries for "MEDIUM_ERROR", and still do retries for the rest.

When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
as the drive itself has already done internal retries (libata uses the
"with retry" ATA opcodes for this).

But meanwhile, we still have the original issue too, where a single stray
bad sector can blow a system out of the water, because the mid-layer
currently aborts everything after it from a large merged request.

Thus the original patch from this thread. :)

Cheers

2007-01-31 15:10:59

by Alan

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
> as the drive itself has already done internal retries (libata uses the
> "with retry" ATA opcodes for this).

This depends on the firmware. Some of the "raid firmware" drives don't
appear to do retries in firmware.

> But meanwhile, we still have the original issue too, where a single stray
> bad sector can blow a system out of the water, because the mid-layer
> currently aborts everything after it from a large merged request.
>
> Thus the original patch from this thread. :)

Agreed

2007-01-31 15:13:25

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

James Bottomley wrote:
>
> For the MD case, this is what REQ_FAILFAST is for.

I cannot find where SCSI honours that flag. James?

And for that matter, even when I patch SCSI so that it *does* honour it,
I don't actually see the flag making it into the SCSI layer from above.

And I don't see where/how the block layer takes care when considering
merge FAILFAST/READA requests with non FAILFAST/READA requests.
To me, it looks perfectly happy to add non-FAILFAST/READA bios
to a FAILFAST request, risking data loss if a lower-layer decides
to honour the FAILFAST/READA flags.

So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)


2007-01-31 15:22:10

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Mark Lord wrote:
> James Bottomley wrote:
>>
>> For the MD case, this is what REQ_FAILFAST is for.
>
> I cannot find where SCSI honours that flag. James?

Scratch that thought.. SCSI honours it in scsi_end_request().

But I'm not certain that the block layer handles it correctly,
at least not in the 2.6.16/2.6.18 kernel that I'm working with today.

Cheers

2007-01-31 15:24:38

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Wed, 2007-01-31 at 10:13 -0500, Mark Lord wrote:
> James Bottomley wrote:
> >
> > For the MD case, this is what REQ_FAILFAST is for.

> I cannot find where SCSI honours that flag. James?

Er, it's in scsi_error.c:scsi_decide_disposition():

maybe_retry:

/* we requeue for retry because the error was retryable, and
* the request was not marked fast fail. Note that above,
* even if the request is marked fast fail, we still requeue
* for queue congestion conditions (QUEUE_FULL or BUSY) */
if ((++scmd->retries) <= scmd->allowed
&& !blk_noretry_request(scmd->request)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
return NEEDS_RETRY;
} else {
/*
* no more retries - report this one back to upper level.
*/
return SUCCESS;
}

> And for that matter, even when I patch SCSI so that it *does* honour it,
> I don't actually see the flag making it into the SCSI layer from above.
>
> And I don't see where/how the block layer takes care when considering
> merge FAILFAST/READA requests with non FAILFAST/READA requests.
> To me, it looks perfectly happy to add non-FAILFAST/READA bios
> to a FAILFAST request, risking data loss if a lower-layer decides
> to honour the FAILFAST/READA flags.
>
> So it's a pretty Good Thing(tm) that SCSI doesn't currently honour it. ;)

James


2007-01-31 15:29:08

by Douglas Gilbert

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Ric Wheeler wrote:
>
>
> Jeff Garzik wrote:
>> Mark Lord wrote:
>>> Eric D. Mudama wrote:
>>>>
>>>> Actually, it's possibly worse, since each failure in libata will
>>>> generate 3-4 retries. With existing ATA error recovery in the
>>>> drives, that's about 3 seconds per retry on average, or 12 seconds
>>>> per failure. Multiply that by the number of blocks past the error
>>>> to complete the request..
>>>
>>> It really beats the alternative of a forced reboot
>>> due to, say, superblock I/O failing because it happened
>>> to get merged with an unrelated I/O which then failed..
>>> Etc..
>>
>>
>> FWIW -- speaking generally -- I think there are inevitable areas where
>> libata error handling combined with SCSI error handling results in
>> suboptimal error handling.
>>
>> Just creating a list of "<this behavior> should be handled <this way>,
>> but in reality is handled in <this silly way>" would be very helpful.
>
> I agree - Tejun has done a great job at giving us a great base. Next
> step is to get clarity on what the types of errors are and how to
> differentiate between them (and maybe how that would change by class of
> device?).
>
>>
>> Error handling is tough to get right, because the code is exercised so
>> infrequently. Tejun has actually done an above-average job here, by
>> making device probe, hotplug and other "exceptions" go through the
>> libata EH code, thereby exercising the EH code more than one might
>> normally assume.
>>
>> Some errors in libata probably should not be retried more than once,
>> when we have a definitive diagnosis. Suggestions for improvements are
>> welcome.
>>
>> Jeff
>
> One thing that we find really useful is to inject real errors into
> devices. Mark has some patches that let us inject media errors, we also
> bring back failed drives and run them through testing and occasionally
> get to use analyzers, etc to inject odd ball errors.

Ric,
Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
command support to flag a block as "uncorrectable". There
is no need to send bad "long" data to it and suppress the
disk's automatic re-allocation logic.

In the case of ATA it is the WRITE UNCORRECTABLE command.
In the case of SCSI it is the WR_UNCOR bit in the WRITE
LONG command.

It seems that due to SAT any useful capability in the ATA
command set will soon appear in the corresponding SCSI
command set, if it is not already there.

Doug Gilbert

2007-01-31 15:38:51

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Douglas Gilbert wrote:
>
> Ric,
> Both ATA (ATA8-ACS) and SCSI (SBC-3) have recently added
> command support to flag a block as "uncorrectable". There
> is no need to send bad "long" data to it and suppress the
> disk's automatic re-allocation logic.

That'll be useful in a couple of years, once drives that have it
become more common. For now, though, we're hacking current drives
using READ/WRITE LONG commands, with a corresponding patch to libata
to allow for the longer "sector size" involved.

Having real bad sectors, exactly where we want them on the media,
sure does make testing / fixing the EH mechanisms a lot more feasible.

Cheers

2007-01-31 16:35:36

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR



Alan wrote:
>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>> as the drive itself has already done internal retries (libata uses the
>> "with retry" ATA opcodes for this).
>
> This depends on the firmware. Some of the "raid firmware" drives don't
> appear to do retries in firmware.

I think that even for these devices, the actual drives behind the controller
will do retries. In any case, it would be reasonable to be able to set this
retry/no-retry via /sys to handle exceptional cases...

>
>> But meanwhile, we still have the original issue too, where a single stray
>> bad sector can blow a system out of the water, because the mid-layer
>> currently aborts everything after it from a large merged request.
>>
>> Thus the original patch from this thread. :)
>
> Agreed


2007-01-31 17:57:57

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Alan wrote:
>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>> as the drive itself has already done internal retries (libata uses the
>> "with retry" ATA opcodes for this).
>
> This depends on the firmware. Some of the "raid firmware" drives don't
> appear to do retries in firmware.

One way to tell if this is true, is simply to time how long
the failed operation takes. If the drive truly does not do retries,
then the media error should be reported more or less instantly
(assuming drive was already spun up).

If the failure takes more than a few hundred milliseconds to be reported,
or in this case 4-7 seconds typically, then we know the drive was doing
retries before it reported back.

I haven't seen any drive fail instantly yet.
Can anyone with those newfangled "RAID edition" drives try it
and report back? Oh.. you'll need a way to create a bad sector.
I've got patches and a command-line utility for the job.

If your drive supports "WRITE UNCORRECTABLE" ("hdparm -I", w/latest hdparm),
then the patches aren't needed.

>> But meanwhile, we still have the original issue too, where a single stray
>> bad sector can blow a system out of the water, because the mid-layer
>> currently aborts everything after it from a large merged request.
>>
>> Thus the original patch from this thread. :)
>
> Agreed

2007-01-31 18:13:26

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:
> Alan wrote:
> >> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
> >> as the drive itself has already done internal retries (libata uses the
> >> "with retry" ATA opcodes for this).
> >
> > This depends on the firmware. Some of the "raid firmware" drives don't
> > appear to do retries in firmware.
>
> One way to tell if this is true, is simply to time how long
> the failed operation takes. If the drive truly does not do retries,
> then the media error should be reported more or less instantly
> (assuming drive was already spun up).

Well, the simpler way (and one we have a hope of implementing) is to
examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

I seem to have dropped the ball on this one in that the scsi_error.c
pieces of this patch

http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885

I thought I'd applied. Apparently I didn't, so I'll go back and put
them in.

James


2007-01-31 18:38:01

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

James Bottomley wrote:
> On Wed, 2007-01-31 at 12:57 -0500, Mark Lord wrote:
>> Alan wrote:
>>>> When libata reports a MEDIUM_ERROR to us, we *know* it's non-recoverable,
>>>> as the drive itself has already done internal retries (libata uses the
>>>> "with retry" ATA opcodes for this).
>>> This depends on the firmware. Some of the "raid firmware" drives don't
>>> appear to do retries in firmware.
>> One way to tell if this is true, is simply to time how long
>> the failed operation takes. If the drive truly does not do retries,
>> then the media error should be reported more or less instantly
>> (assuming drive was already spun up).
>
> Well, the simpler way (and one we have a hope of implementing) is to
> examine the ASC/ASCQ codes to see if the error is genuinely unretryable.

My suggestion above was not for a kernel fix,
but rather just as a way of determining if drives
which claim "no retries" actually do them or not. :)

> I seem to have dropped the ball on this one in that the scsi_error.c
> pieces of this patch
>
> http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885
>
> I thought I'd applied. Apparently I didn't, so I'll go back and put
> them in.

Good. That would be a useful supplement to the patch I posted here.

Cheers

2007-02-01 20:02:26

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

James Bottomley wrote:
> On Tue, 2007-01-30 at 19:47 -0500, Mark Lord wrote:
>> Kernels since about 2.6.16 or so have been broken in this regard.
>> They "complete" the good sectors before the error,
>> and then fail the entire remaining portions of the request.
>
> What was the commit that introduced the change? ... I have a vague
> memory of it being deliberate.

I believe you made the first change in response to my prodding at the time,
when libata was not returning valid sense data (no LBA) for media errors.
The SCSI EH handling of that was rather poor at the time,
and so having it not retry the remaining sectors was actually
a very good fix at the time.

But now, libata *does* return valid sense data for LBA/DMA drives,
and the workaround from circa 2.6.16 is no longer the best we can do.
Now that we know which sector failed, we ought to be able to skip
over it, and continue with the rest of the merged request.

One thing that could be even better than the patch below,
would be to have it perhaps skip the entire bio that includes
the failed sector, rather than only the bad sector itself.

I think doing that might address most concerns expressed here.
Have you got an alternate suggestion, James?

..
>> Signed-off-by: Mark Lord <[email protected]>
>> ---
>> diff -u --recursive --new-file --exclude-from=linux_17//Documentation/dontdiff old/drivers/scsi/scsi_lib.c linux/drivers/scsi/scsi_lib.c
>> --- old/drivers/scsi/scsi_lib.c 2007-01-30 13:58:05.000000000 -0500
>> +++ linux/drivers/scsi/scsi_lib.c 2007-01-30 18:30:01.000000000 -0500
>> @@ -865,6 +865,12 @@
>> */
>> if (sense_valid && !sense_deferred) {
>> switch (sshdr.sense_key) {
>> + case MEDIUM_ERROR:
>> + // Bad sector. Fail it, and then continue the rest of the request:
>> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
>
> The sense key may have come with additional information I think we want
> to parse that (if it exists) rather than just blindly failing the first
> sector of the request.
>
>> + cmd->retries = 0; // go around again..
>> + return;
>> + }
>
> This would drop through to the UNIT_ATTENTION case if scsi_end_request()
> fails ... I don't think that's correct.
>
>> case UNIT_ATTENTION:
>> if (cmd->device->removable) {
>> /* Detected disc change. Set a bit
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ide" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-02-01 21:55:27

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
> I believe you made the first change in response to my prodding at the time,
> when libata was not returning valid sense data (no LBA) for media errors.
> The SCSI EH handling of that was rather poor at the time,
> and so having it not retry the remaining sectors was actually
> a very good fix at the time.
>
> But now, libata *does* return valid sense data for LBA/DMA drives,
> and the workaround from circa 2.6.16 is no longer the best we can do.
> Now that we know which sector failed, we ought to be able to skip
> over it, and continue with the rest of the merged request.

We can ... the big concern with your approach, which you haven't
addressed is the time factor. For most SCSI devices, returning a fatal
MEDIUM ERROR means we're out of remapping table, and also that there's
probably a bunch of sectors on the track that are now out. Thus, there
are almost always multiple sector failures. In linux, the average
request size on a filesystem is around 64-128kb; thats 128-256 sectors.
If we fail at the initial sector, we have to go through another 128-256
attempts, with the internal device retries, before we fail the entire
request. Some devices can take a second or so for each read before they
finally give up and decide they really can't read the sector, so you're
looking at 2-5 minutes before the machine finally fails this one
request ... and much worse for devices that retry more times.

> One thing that could be even better than the patch below,
> would be to have it perhaps skip the entire bio that includes
> the failed sector, rather than only the bad sector itself.

Er ... define "skip over the bio". A bio is simply a block
representation for a bunch of sg elements coming in to the elevator.
Mostly what we see in SCSI is a single bio per request, so skipping the
bio is really the current behaviour (to fail the rest of the request).

> I think doing that might address most concerns expressed here.
> Have you got an alternate suggestion, James?

James


2007-02-02 02:48:31

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

James Bottomley wrote:
> On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
>..
>> One thing that could be even better than the patch below,
>> would be to have it perhaps skip the entire bio that includes
>> the failed sector, rather than only the bad sector itself.
>
> Er ... define "skip over the bio". A bio is simply a block
> representation for a bunch of sg elements coming in to the elevator.

Exactly. Or rather, a block of sg_elements from a single point
of request, is it not?

> Mostly what we see in SCSI is a single bio per request, so skipping the
> bio is really the current behaviour (to fail the rest of the request).

Very good. That's what it's supposed to do.

But if each request contained only a single bio, then all of Jens'
work on IO scheduling would be for nothing, n'est-ce pas?

In the case where a request consists of multiple bio's
which have been merged under a single request struct,
we really should give at least one attempt to each bio.

This way, in most cases, only the process that requested the
failed sector(s) will see an error, not the innocent victims
that happened to get merged onto the end. Which could be very
critical stuff (or not -- it could be quite random).

So the time factor works out to one disk I/O timeout per failed bio.
That's what would have happened with the NOP scheduler anyway.

On the sytems I'm working with, I don't see huge numbers of bad sectors.
What they tend to show is just one or two bad sectors, widely scattered.

So:
>> I think doing that might address most concerns expressed here.
>> Have you got an alternate suggestion, James?

Cheers

2007-02-02 12:20:24

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR



James Bottomley wrote:

>On Thu, 2007-02-01 at 15:02 -0500, Mark Lord wrote:
>
>
>>I believe you made the first change in response to my prodding at the time,
>>when libata was not returning valid sense data (no LBA) for media errors.
>>The SCSI EH handling of that was rather poor at the time,
>>and so having it not retry the remaining sectors was actually
>>a very good fix at the time.
>>
>>But now, libata *does* return valid sense data for LBA/DMA drives,
>>and the workaround from circa 2.6.16 is no longer the best we can do.
>>Now that we know which sector failed, we ought to be able to skip
>>over it, and continue with the rest of the merged request.
>>
>>
>
>We can ... the big concern with your approach, which you haven't
>addressed is the time factor. For most SCSI devices, returning a fatal
>MEDIUM ERROR means we're out of remapping table, and also that there's
>probably a bunch of sectors on the track that are now out. Thus, there
>are almost always multiple sector failures. In linux, the average
>request size on a filesystem is around 64-128kb; thats 128-256 sectors.
>If we fail at the initial sector, we have to go through another 128-256
>attempts, with the internal device retries, before we fail the entire
>request. Some devices can take a second or so for each read before they
>finally give up and decide they really can't read the sector, so you're
>looking at 2-5 minutes before the machine finally fails this one
>request ... and much worse for devices that retry more times.
>
>

This is not the case on a read error - we commonly see transient errors
on reads from disks. What we push our vendors to do is to try to keep
the "worst case" response down to tens of seconds as they retry, etc
internally with a device. When they take that long (and they do), adding
retries up the stack can translate into minutes per sector.

The interesting point of this question is about the typically pattern of
IO errors. On a read, it is safe to assume that you will have issues
with some bounded numbers of adjacent sectors.

>
>
>>One thing that could be even better than the patch below,
>>would be to have it perhaps skip the entire bio that includes
>>the failed sector, rather than only the bad sector itself.
>>
>>
>
>Er ... define "skip over the bio". A bio is simply a block
>representation for a bunch of sg elements coming in to the elevator.
>Mostly what we see in SCSI is a single bio per request, so skipping the
>bio is really the current behaviour (to fail the rest of the request).
>
>
This is really a tricky one - what can happen when we fail merged IO
requests is really unpredictable behavior up at the application level
since the IO error might not be at all relevant to my part of the
request. Merging can produce a request that is much larger than any
normal drive error.

I really like the idea of being able to set this kind of policy on a per
drive instance since what you want here will change depending on what
your system requirements are, what the system is trying to do (i.e.,
when trying to recover a failing but not dead yet disk, IO errors should
be as quick as possible and we should choose an IO scheduler that does
not combine IO's).

>
>
>>I think doing that might address most concerns expressed here.
>>Have you got an alternate suggestion, James?
>>
>>
>
>James
>
>

2007-02-02 14:29:54

by Alan

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

> The interesting point of this question is about the typically pattern of
> IO errors. On a read, it is safe to assume that you will have issues
> with some bounded numbers of adjacent sectors.

Which in theory you can get by asking the drive for the real sector size
from the ATA7 info. (We ought to dig this out more as its relevant for
partition layout too).

> I really like the idea of being able to set this kind of policy on a per
> drive instance since what you want here will change depending on what
> your system requirements are, what the system is trying to do (i.e.,
> when trying to recover a failing but not dead yet disk, IO errors should
> be as quick as possible and we should choose an IO scheduler that does
> not combine IO's).

That seems to be arguing for a bounded "live" time including retry run
time for a command. That's also more intuitive for real time work and for
end user setup. "Either work or fail within n seconds"


Alan

2007-02-02 14:37:34

by Alan

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

> your system requirements are, what the system is trying to do (i.e.,
> when trying to recover a failing but not dead yet disk, IO errors should
> be as quick as possible and we should choose an IO scheduler that does
> not combine IO's).

If this is the right strategy for disk recovery for a given type of
device then this ought to be an automatic strategy. Most end users will
not have the knowledge to frob about in sysfs, and if the bad sector hits
at the wrong moment a sensible automatic recovery strategy is going to do
the right thing faster than human intervention, which may be some hours
away.

2007-02-02 14:53:36

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Fri, 2007-02-02 at 14:42 +0000, Alan wrote:
> > The interesting point of this question is about the typically pattern of
> > IO errors. On a read, it is safe to assume that you will have issues
> > with some bounded numbers of adjacent sectors.
>
> Which in theory you can get by asking the drive for the real sector size
> from the ATA7 info. (We ought to dig this out more as its relevant for
> partition layout too).
>
> > I really like the idea of being able to set this kind of policy on a per
> > drive instance since what you want here will change depending on what
> > your system requirements are, what the system is trying to do (i.e.,
> > when trying to recover a failing but not dead yet disk, IO errors should
> > be as quick as possible and we should choose an IO scheduler that does
> > not combine IO's).
>
> That seems to be arguing for a bounded "live" time including retry run
> time for a command. That's also more intuitive for real time work and for
> end user setup. "Either work or fail within n seconds"

Actually, then I think perhaps we use the allowed retries for this ...

So you would fail a single sector and count it against the retries.
When you've done this allowed retries times, you fail the rest of the
request.

James


2007-02-02 16:06:22

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Alan wrote:
>
> If this is the right strategy for disk recovery for a given type of
> device then this ought to be an automatic strategy. Most end users will
> not have the knowledge to frob about in sysfs, and if the bad sector hits
> at the wrong moment a sensible automatic recovery strategy is going to do
> the right thing faster than human intervention, which may be some hours
> away.

I think something we seem to be discussing here are the opposite aims
of enterprise RAID (traditional SCSI market) versus desktop filesystems
(now the number one user of Linux SCSI, courtesy of libata).

With RAID, it's safe to suggest that a very fast, bounded exit from EH
is almost always what the customer / end-user wants, because the higher
level RAID management can then deal with the failed drive offline.

But for a desktop filesystem, failing out quickly and bulk-failing megabytes
around a couple of bad sectors is not good -- no redundancy, and this will
lead to unneeded data loss.

It's beginning to look like this needs to be run-time tuneable,
per block minor device (per-partition), through sysfs at a minimum.
The RAID tools could then automatically choose settings to bias more
towards an "instant exit" when errors are found, whereas a non-RAID
desktop could select a more reliable recovery strategy.

Right now, with Jame's patch (earlier in this thread), the whole scheme
is heavily weighted towards the RAID "instant exit" strategy, which is
making me quite nervous about the data on my laptop.

Cheers

2007-02-02 16:16:39

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR



James Bottomley wrote:
> On Fri, 2007-02-02 at 14:42 +0000, Alan wrote:
>
>>> The interesting point of this question is about the typically pattern of
>>> IO errors. On a read, it is safe to assume that you will have issues
>>> with some bounded numbers of adjacent sectors.
>>>
>> Which in theory you can get by asking the drive for the real sector size
>> from the ATA7 info. (We ought to dig this out more as its relevant for
>> partition layout too).
>>

Actually, my point is that damage typically impacts a cluster of disk
sectors that are adjacent. Think of a drive that has junk on the platter
or a some such thing - the contamination is likely to be localized.
>>
>>> I really like the idea of being able to set this kind of policy on a per
>>> drive instance since what you want here will change depending on what
>>> your system requirements are, what the system is trying to do (i.e.,
>>> when trying to recover a failing but not dead yet disk, IO errors should
>>> be as quick as possible and we should choose an IO scheduler that does
>>> not combine IO's).
>>>
>> That seems to be arguing for a bounded "live" time including retry run
>> time for a command. That's also more intuitive for real time work and for
>> end user setup. "Either work or fail within n seconds"
>>
>
> Actually, then I think perhaps we use the allowed retries for this ...
>
I really am not a big retry fan for most modern drives - the drive will
try really, really hard to complete an IO for us and multiple retries
can just slow down the higher level application from recovering.
> So you would fail a single sector and count it against the retries.
> When you've done this allowed retries times, you fail the rest of the
> request.
>
> James
>
>
I think that we need to play with some of these possible solutions on
some real-world bad drives and see how they react.

We should definitely talk more about this at the workshop ;-)

ric

2007-02-02 20:02:49

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Fri, Feb 02, 2007 at 11:06:19AM -0500, Mark Lord wrote:
> Alan wrote:
> >
> >If this is the right strategy for disk recovery for a given type of
> >device then this ought to be an automatic strategy. Most end users will
> >not have the knowledge to frob about in sysfs, and if the bad sector hits
> >at the wrong moment a sensible automatic recovery strategy is going to do
> >the right thing faster than human intervention, which may be some hours
> >away.
>
> I think something we seem to be discussing here are the opposite aims
> of enterprise RAID (traditional SCSI market) versus desktop filesystems
> (now the number one user of Linux SCSI, courtesy of libata).
>
> With RAID, it's safe to suggest that a very fast, bounded exit from EH
> is almost always what the customer / end-user wants, because the higher
> level RAID management can then deal with the failed drive offline.
>
> But for a desktop filesystem, failing out quickly and bulk-failing megabytes
> around a couple of bad sectors is not good -- no redundancy, and this will
> lead to unneeded data loss.
>
> It's beginning to look like this needs to be run-time tuneable,
> per block minor device (per-partition), through sysfs at a minimum.
> The RAID tools could then automatically choose settings to bias more
> towards an "instant exit" when errors are found, whereas a non-RAID
> desktop could select a more reliable recovery strategy.
>
> Right now, with Jame's patch (earlier in this thread), the whole scheme
> is heavily weighted towards the RAID "instant exit" strategy, which is
> making me quite nervous about the data on my laptop.

Also worth considering is that spending minutes trying to reread
damaged sectors is likely to accelerate your death spiral. More data
may be recoverable if you give up quickly in a first pass, then go
back and manually retry damaged bits with smaller I/Os.

Another approach is to have separate retries per hard sector and
hard errors per MB at the block device level. We don't want to have
the same number of retries for a 64KB block as a 1MB block and we
certainly don't want to do 2K retries in a row.

So for instance, I could have the first set to 1 and the second set to
16. For a 256KB read, this gets scaled down to 4, which means we'll
retry each sector once and fail the whole I/O when we hit the 5th
sector error.

More reasonable number might be 0 and 1, meaning "don't do OS level
retries on sectors and fail the whole I/O on the second sector error
(or immediately for smaller reads)".

It might also be informative to add a kernel message reporting if a retry
(in the non-transient cases) ever actually succeeds.

--
Mathematics is the supreme nostalgia of our time.

2007-02-02 20:17:36

by Douglas Gilbert

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Alan wrote:
>> The interesting point of this question is about the typically pattern of
>> IO errors. On a read, it is safe to assume that you will have issues
>> with some bounded numbers of adjacent sectors.
>
> Which in theory you can get by asking the drive for the real sector size
> from the ATA7 info. (We ought to dig this out more as its relevant for
> partition layout too).
>
>> I really like the idea of being able to set this kind of policy on a per
>> drive instance since what you want here will change depending on what
>> your system requirements are, what the system is trying to do (i.e.,
>> when trying to recover a failing but not dead yet disk, IO errors should
>> be as quick as possible and we should choose an IO scheduler that does
>> not combine IO's).
>
> That seems to be arguing for a bounded "live" time including retry run
> time for a command. That's also more intuitive for real time work and for
> end user setup. "Either work or fail within n seconds"

Which is more or less the "streaming" feature set in recent
ATA standards. [Alas, streaming and NCQ/TCQ can't be done
with the same access.] SCSI has its Read Write Error Recovery
mode page which doesn't have timeouts but does have Read
and Write Retry Counts amongst other fields that control
the amount (and indirectly the time) of attempted error
recovery.

Doug Gilbert


2007-02-02 22:58:11

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

Matt Mackall wrote:
> ..
> Also worth considering is that spending minutes trying to reread
> damaged sectors is likely to accelerate your death spiral. More data
> may be recoverable if you give up quickly in a first pass, then go
> back and manually retry damaged bits with smaller I/Os.

All good input. But what was being debated here is not so much
the retrying of known-bad sectors, but rather what to do about
the kiBs or MiBs of sectors remaining in a merged request after
hitting a single bad sector mid-way.

Currently, SCSI just abandons the entire remaining workload.

Cheers

2007-02-02 23:20:10

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

On Fri, Feb 02, 2007 at 05:58:04PM -0500, Mark Lord wrote:
> Matt Mackall wrote:
> >..
> >Also worth considering is that spending minutes trying to reread
> >damaged sectors is likely to accelerate your death spiral. More data
> >may be recoverable if you give up quickly in a first pass, then go
> >back and manually retry damaged bits with smaller I/Os.
>
> All good input. But what was being debated here is not so much
> the retrying of known-bad sectors, but rather what to do about
> the kiBs or MiBs of sectors remaining in a merged request after
> hitting a single bad sector mid-way.

Yep, that's precisely what was addressed in the part you snipped.

My main point being that what to do about the remaining workload
should be dependent on the size of the I/O. If we encounter errors on
sectors 4,5,6,7,8.. of a 1MB request, we should have a threshold for
giving up. It's not unreasonable for that threshold to be larger than
1, but it should not be 2048.

And if we do the I/O as four 256KB requests, we should have
approximately the same number of retries (assuming the whole region's
bad).

--
Mathematics is the supreme nostalgia of our time.