2009-03-20 03:21:32

by Norman Diamond

[permalink] [raw]
Subject: Overagressive failing of disk reads, both LIBATA and IDE

For months I was wondering how a disk could do this:
dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails

It turns out the disk isn't doing that. Linux is. The old IDE drivers did
it, but with LIBATA the same thing happens to /dev/sda. In later examples
also, the same happens to /dev/sda as /dev/hda.

Here's what the disk is really responsible for:
dd if=/dev/hda of=/dev/null bs=512 skip=551562 count=1 # really fails

Here's Linux to blame again:
dd if=/dev/hda of=/dev/null bs=512 skip=551561 count=1 # fails

When the drive reports an uncorrectable media error, Linux correctly records
it in the log. But when the app didn't ask for that block, when blocks that
the app asked for were all read, Linux incorrectly reports failure to the
app.

I don't know how Linux decides how many blocks to read ahead, but no matter
how many it chooses, read ahead is read ahead. Go ahead and record it in
the log. I'd also like to suggest that if a user is logged in on the screen
(whether X11 or text) see if we can warn them that their disk is dying. But
don't return a failure to the app. If the blocks that the app asked for
were read, we should give them to the app, successfully.

Sheesh.

P.S.
One would expect this to persuade the hard drive to relocate the block:
dd if=/dev/zero of=/dev/hda bs=512 seek=551562 count=1
But it doesn't because Linux wants to read 4 blocks, modify 1, and write 4
blocks. The read fails.

One would expect this to persuade the hard drive to relocate the block:
dd if=/dev/zero of=/dev/hda bs=512 seek=551560 count=4
But it doesn't because the hard drive reports success. If an app tries to
read the bad sector again it still fails. So the drive has egregiously bad
firmware. That doesn't excuse Linux.

--------------------------------------
Power up the Internet with Yahoo! Toolbar.
http://pr.mail.yahoo.co.jp/toolbar/


2009-03-20 03:33:20

by Mark Lord

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

Norman Diamond wrote:
> For months I was wondering how a disk could do this:
> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
> dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails
>
> It turns out the disk isn't doing that. Linux is. The old IDE drivers did
> it, but with LIBATA the same thing happens to /dev/sda. In later examples
> also, the same happens to /dev/sda as /dev/hda.
..

You can blame me for the IDE driver not doing that properly.
But for libata, it's the SCSI layer.

I've been patching this for years for my clients,
and will be updating the patch soon-ish and trying
again to get it into upstream kernels.

Here's the (now ancient) 2.6.20 version for SLES10:

* * *

Allow SCSI to continue with the remaining blocks of a request
after encountering a media error. Otherwise, it may just fail
the entire request, even though some blocks were fine and needed
by a completely different process than the one that wanted the bad block(s).

Signed-off-by: Mark Lord <[email protected]>

--- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
+++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
@@ -888,6 +888,12 @@
*/
if (sense_valid && !sense_deferred) {
switch (sshdr.sense_key) {
+ case MEDIUM_ERROR:
+ /* Bad sector. Fail it, and then continue the rest of the request. */
+ if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
+ cmd->retries = 0; // go around again..
+ return;
+ }
case UNIT_ATTENTION:
if (cmd->device->removable) {
/* Detected disc change. Set a bit

2009-03-20 10:13:20

by Andrew Morton

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

On Thu, 19 Mar 2009 23:32:55 -0400 Mark Lord <[email protected]> wrote:

> Norman Diamond wrote:
> > For months I was wondering how a disk could do this:
> > dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
> > dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
> > dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails
> >
> > It turns out the disk isn't doing that. Linux is. The old IDE drivers did
> > it, but with LIBATA the same thing happens to /dev/sda. In later examples
> > also, the same happens to /dev/sda as /dev/hda.
> ..
>
> You can blame me for the IDE driver not doing that properly.
> But for libata, it's the SCSI layer.
>
> I've been patching this for years for my clients,
> and will be updating the patch soon-ish and trying
> again to get it into upstream kernels.
>
> Here's the (now ancient) 2.6.20 version for SLES10:
>
> * * *
>
> Allow SCSI to continue with the remaining blocks of a request
> after encountering a media error. Otherwise, it may just fail
> the entire request, even though some blocks were fine and needed
> by a completely different process than the one that wanted the bad block(s).
>
> Signed-off-by: Mark Lord <[email protected]>
>
> --- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
> +++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
> @@ -888,6 +888,12 @@
> */
> if (sense_valid && !sense_deferred) {
> switch (sshdr.sense_key) {
> + case MEDIUM_ERROR:
> + /* Bad sector. Fail it, and then continue the rest of the request. */
> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
> + cmd->retries = 0; // go around again..
> + return;
> + }
> case UNIT_ATTENTION:
> if (cmd->device->removable) {
> /* Detected disc change. Set a bit

Once upon a time the VFS would fall back to single page reads when a large
readahead request failed. That's probably still the case.

It was more by accident than by design, but it had (has) the desired effect?

2009-03-20 13:09:58

by Mark Lord

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

Andrew Morton wrote:
> On Thu, 19 Mar 2009 23:32:55 -0400 Mark Lord <[email protected]> wrote:
>
>> Norman Diamond wrote:
>>> For months I was wondering how a disk could do this:
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails
>>>
>>> It turns out the disk isn't doing that. Linux is. The old IDE drivers did
>>> it, but with LIBATA the same thing happens to /dev/sda. In later examples
>>> also, the same happens to /dev/sda as /dev/hda.
>> ..
>>
>> You can blame me for the IDE driver not doing that properly.
>> But for libata, it's the SCSI layer.
>>
>> I've been patching this for years for my clients,
>> and will be updating the patch soon-ish and trying
>> again to get it into upstream kernels.
>>
>> Here's the (now ancient) 2.6.20 version for SLES10:
>>
>> * * *
>>
>> Allow SCSI to continue with the remaining blocks of a request
>> after encountering a media error. Otherwise, it may just fail
>> the entire request, even though some blocks were fine and needed
>> by a completely different process than the one that wanted the bad block(s).
>>
>> Signed-off-by: Mark Lord <[email protected]>
>>
>> --- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
>> +++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
>> @@ -888,6 +888,12 @@
>> */
>> if (sense_valid && !sense_deferred) {
>> switch (sshdr.sense_key) {
>> + case MEDIUM_ERROR:
>> + /* Bad sector. Fail it, and then continue the rest of the request. */
>> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
>> + cmd->retries = 0; // go around again..
>> + return;
>> + }
>> case UNIT_ATTENTION:
>> if (cmd->device->removable) {
>> /* Detected disc change. Set a bit
>
> Once upon a time the VFS would fall back to single page reads when a large
> readahead request failed. That's probably still the case.
>
> It was more by accident than by design, but it had (has) the desired effect?
..

Ahh.. but the block layer efficiently merges adjacent sectors
from multiple processes into single requests. Generally a good thing,
that, but it does mean that one bad sector can currently trigger I/O
failures for several processes which aren't even interested in the
bad sector.

-ml

2009-03-21 14:22:29

by James Bottomley

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

On Thu, 2009-03-19 at 23:32 -0400, Mark Lord wrote:
> Norman Diamond wrote:
> > For months I was wondering how a disk could do this:
> > dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
> > dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
> > dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails

This basically means the drive doesn't report where in the requested
transfer the error occurred. If we have that information, we'd return
all sectors up to that LBA as OK and all at or beyond as -EIO, so the
readahead wouldn't matter.

> > It turns out the disk isn't doing that. Linux is. The old IDE drivers did
> > it, but with LIBATA the same thing happens to /dev/sda. In later examples
> > also, the same happens to /dev/sda as /dev/hda.
> ..
>
> You can blame me for the IDE driver not doing that properly.
> But for libata, it's the SCSI layer.
>
> I've been patching this for years for my clients,
> and will be updating the patch soon-ish and trying
> again to get it into upstream kernels.
>
> Here's the (now ancient) 2.6.20 version for SLES10:
>
> * * *
>
> Allow SCSI to continue with the remaining blocks of a request
> after encountering a media error. Otherwise, it may just fail
> the entire request, even though some blocks were fine and needed
> by a completely different process than the one that wanted the bad block(s).
>
> Signed-off-by: Mark Lord <[email protected]>
>
> --- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
> +++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
> @@ -888,6 +888,12 @@
> */
> if (sense_valid && !sense_deferred) {
> switch (sshdr.sense_key) {
> + case MEDIUM_ERROR:
> + /* Bad sector. Fail it, and then continue the rest of the request. */
> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
> + cmd->retries = 0; // go around again..
> + return;
> + }

But we've been over this. You can't apply something like this because
it ignores retries and chunks up the request a sector at a time. For
the enterprise that can increase failure time from a few seconds to
hours for 512k transfers.

Using the disk supplied data about where the error occurred (provided
the disk returns it) eliminates all the readahead problems like the one
above. Perhaps just turning of readahead for disks that don't supply
error location information would be a reasonable workaround?

James

2009-03-21 14:56:21

by Mark Lord

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

James Bottomley wrote:
> On Thu, 2009-03-19 at 23:32 -0400, Mark Lord wrote:
..
>> Allow SCSI to continue with the remaining blocks of a request
>> after encountering a media error. Otherwise, it may just fail
>> the entire request, even though some blocks were fine and needed
>> by a completely different process than the one that wanted the bad block(s).
>>
>> Signed-off-by: Mark Lord <[email protected]>
>>
>> --- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
>> +++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
>> @@ -888,6 +888,12 @@
>> */
>> if (sense_valid && !sense_deferred) {
>> switch (sshdr.sense_key) {
>> + case MEDIUM_ERROR:
>> + /* Bad sector. Fail it, and then continue the rest of the request. */
>> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
>> + cmd->retries = 0; // go around again..
>> + return;
>> + }
>
> But we've been over this. You can't apply something like this because
> it ignores retries and chunks up the request a sector at a time. For
> the enterprise that can increase failure time from a few seconds to
> hours for 512k transfers.
>
> Using the disk supplied data about where the error occurred (provided
> the disk returns it) eliminates all the readahead problems like the one
> above. Perhaps just turning of readahead for disks that don't supply
> error location information would be a reasonable workaround?
..

The patch *does* use the disk supplied data about the error,
and returns success for sectors up to that point. Where it differs
from mainline SCSI, is that it then continues attempting the remaining
2000 sectors (or whatever) of the request, hoping that not all of
them are bad.

It's not perfect, and likely no longer applies/works cleanly on the
latest kernels. And perhaps it really ought to fail a "block" rather
than a "sector" at a time as it seeks clean media after the fault.

I think it could be even more clever, using a binary search or something
on the remaining chunks, so that it takes less time to skip over a huge
stretch of sequential bad sectors (scratch on media). That would probably
be best all around.

Cheers

2009-03-21 15:01:59

by Mark Lord

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

Mark Lord wrote:
..
> I think it could be even more clever, using a binary search or something
> on the remaining chunks, so that it takes less time to skip over a huge
> stretch of sequential bad sectors (scratch on media). That would probably
> be best all around.
..

Heh.. pity that spinning media cannot do a "READ BACKWARDS" type of operation.

:)

2009-03-21 15:08:57

by James Bottomley

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

On Sat, 2009-03-21 at 10:55 -0400, Mark Lord wrote:
> James Bottomley wrote:
> > On Thu, 2009-03-19 at 23:32 -0400, Mark Lord wrote:
> ..
> >> Allow SCSI to continue with the remaining blocks of a request
> >> after encountering a media error. Otherwise, it may just fail
> >> the entire request, even though some blocks were fine and needed
> >> by a completely different process than the one that wanted the bad block(s).
> >>
> >> Signed-off-by: Mark Lord <[email protected]>
> >>
> >> --- linux-2.6.16.60-0.6/drivers/scsi/scsi_lib.c 2008-03-10 13:46:03.000000000 -0400
> >> +++ linux/drivers/scsi/scsi_lib.c 2008-03-21 11:54:09.000000000 -0400
> >> @@ -888,6 +888,12 @@
> >> */
> >> if (sense_valid && !sense_deferred) {
> >> switch (sshdr.sense_key) {
> >> + case MEDIUM_ERROR:
> >> + /* Bad sector. Fail it, and then continue the rest of the request. */
> >> + if (scsi_end_request(cmd, 0, cmd->device->sector_size, 1) == NULL) {
> >> + cmd->retries = 0; // go around again..
> >> + return;
> >> + }
> >
> > But we've been over this. You can't apply something like this because
> > it ignores retries and chunks up the request a sector at a time. For
> > the enterprise that can increase failure time from a few seconds to
> > hours for 512k transfers.
> >
> > Using the disk supplied data about where the error occurred (provided
> > the disk returns it) eliminates all the readahead problems like the one
> > above. Perhaps just turning of readahead for disks that don't supply
> > error location information would be a reasonable workaround?
> ..
>
> The patch *does* use the disk supplied data about the error,
> and returns success for sectors up to that point. Where it differs
> from mainline SCSI, is that it then continues attempting the remaining
> 2000 sectors (or whatever) of the request, hoping that not all of
> them are bad.

Um, but so does SCSI without your patch ... that was my point.

> It's not perfect, and likely no longer applies/works cleanly on the
> latest kernels. And perhaps it really ought to fail a "block" rather
> than a "sector" at a time as it seeks clean media after the fault.
>
> I think it could be even more clever, using a binary search or something
> on the remaining chunks, so that it takes less time to skip over a huge
> stretch of sequential bad sectors (scratch on media). That would probably
> be best all around.

I don't really think we'd do that. The problem, as you say is request
combination. I think if we really wanted to do this, we'd have block do
it. Each separate request that's merged gets a separate bio, and block
already has capabilities to pick up per bio errors, so we'd do the
partial completion of the failing bio then skip to the next one in the
request to try. That would completely solve both readahead problems and
request merging ones.

James

2009-03-21 15:09:51

by Alan

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

> Using the disk supplied data about where the error occurred (provided
> the disk returns it) eliminates all the readahead problems like the one

ATA disks do provide sector information generally, as do CD-ROMs, in fact
we actually decode it for error reporting so probably all the bits are
there to improve any reporting for most read side cases.

>From some of the traces I have been debugging I am not convinced the scsi
mid layer does the right thing any more. It uses to be handling CD-R
media (where the end of media is not well defined) but nowdays seems to
be reporting errors even when told that the I/O partly succeeded. I need
to debug that case further however but as I don't have one of the problem
drives its a bit tricky.

> above. Perhaps just turning of readahead for disks that don't supply
> error location information would be a reasonable workaround?

Not really. From a performance perspective Mark's patch is vastly
superior because it punishes the incredibly rare error case not the
routine working performance. Avoiding the need to do either would be even
better - as would fixing the block layer not to mess up retries in this
situation.

For low speed devices like MMC cards and flash it might make sense not to
merge unrelated requests however so that only the relevant one fails.

Alan

2009-03-21 15:18:43

by James Bottomley

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

On Sat, 2009-03-21 at 15:10 +0000, Alan Cox wrote:
> > Using the disk supplied data about where the error occurred (provided
> > the disk returns it) eliminates all the readahead problems like the one
>
> ATA disks do provide sector information generally, as do CD-ROMs, in fact
> we actually decode it for error reporting so probably all the bits are
> there to improve any reporting for most read side cases.
>
> >From some of the traces I have been debugging I am not convinced the scsi
> mid layer does the right thing any more. It uses to be handling CD-R
> media (where the end of media is not well defined) but nowdays seems to
> be reporting errors even when told that the I/O partly succeeded. I need
> to debug that case further however but as I don't have one of the problem
> drives its a bit tricky.

OK, so in modern kernels, this is done in the ULD ... specifically for
disks in sd.c:sd_done and for CD/DVD in sr.c:sr_done().

The sr.c one looks crufty in that I don't think it handles descriptor
sense at all, so perhaps it should be updated to match the sd one?

> > above. Perhaps just turning of readahead for disks that don't supply
> > error location information would be a reasonable workaround?
>
> Not really. From a performance perspective Mark's patch is vastly
> superior because it punishes the incredibly rare error case not the
> routine working performance. Avoiding the need to do either would be even
> better - as would fixing the block layer not to mess up retries in this
> situation.
>
> For low speed devices like MMC cards and flash it might make sense not to
> merge unrelated requests however so that only the relevant one fails.

See other email for suggestion how to do this at the block level.

James

2009-03-21 15:22:51

by Mark Lord

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

James Bottomley wrote:
> On Sat, 2009-03-21 at 10:55 -0400, Mark Lord wrote:
..
>> The patch *does* use the disk supplied data about the error,
>> and returns success for sectors up to that point. Where it differs
>> from mainline SCSI, is that it then continues attempting the remaining
>> 2000 sectors (or whatever) of the request, hoping that not all of
>> them are bad.
>
> Um, but so does SCSI without your patch ... that was my point.
..

Does it? I thought it still just failed everything after the first
bad sector? Kudos are due if that's working now.

..
> I don't really think we'd do that. The problem, as you say is request
> combination. I think if we really wanted to do this, we'd have block do
> it. Each separate request that's merged gets a separate bio, and block
> already has capabilities to pick up per bio errors, so we'd do the
> partial completion of the failing bio then skip to the next one in the
> request to try. That would completely solve both readahead problems and
> request merging ones.
..

Yeah, that's a reasonable way to tackle. And you're right, we *did* discuss
this back two years ago. It just never made it as far as new code. :)

Something else that might be good here, would be to have the md layer
pass down a (per-bio?) flag indicating whether it has redundacy capability
or not for the I/O. Eg. healthy RAID1/4/5/10 etc.. would set the flag,
and SCSI could then just abort immediately on a bad sector, with NO retries
beyond the first bad one.

On RAID0, or a degraded (no spares) RAID1 etc, it would not set the flag,
so SCSI would try harder to recover the data, as we're discussing above.

This sounds like FAST_FAIL, but is different. And the hint needs to
come from the upper layer that is performing redundancy.


2009-03-22 02:16:19

by Norman Diamond

[permalink] [raw]
Subject: Re: Overagressive failing of disk reads, both LIBATA and IDE

James Bottomley wrote:
> On Thu, 2009-03-19 at 23:32 -0400, Mark Lord wrote:
>> Norman Diamond wrote:
>>> For months I was wondering how a disk could do this:
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=4 # succeeds
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551544 count=4 # succeeds
>>> dd if=/dev/hda of=/dev/null bs=512 skip=551540 count=8 # fails
>
> This basically means the drive doesn't report where in the requested
> transfer the error occurred. If we have that information, we'd return all
> sectors up to that LBA as OK and all at or beyond as -EIO, so the
> readahead wouldn't matter.

That's exactly what my submission suggested Linux should do, because that's
exactly what Linux isn't doing. The defective sector number is 551562.
Linux makes varying decisions on how much to read ahead, and when its
readahead includes the defective sector Linux doesn't do what you and I want
it to do.

The way I discovered the actual defective sector number is that one time
last week I noticed it in dmesg output. After noticing it, investigation
became a lot easier. I don't remember if I noticed it for hda (old IDE) or
sda (LIBATA) but either way the drive put the defective sector number in its
error report. When readahead was long enough to reach sector 551562 the
drive told the PC.

Regarding other threads of this discussion, I/Os are not being merged with
other processes. I'm running either Slax or Knoppix from a live CD, and the
only one accessing the hard drive is me.

In cases where Slax or Knoppix includes a sufficiently recent hdparm, I
could attempt reads of individual sectors. 551561 is OK. 551563 is OK.
551562 has an uncorrectable media error. I had mentioned that the drive has
egregiously bad firmware (which doesn't excuse Linux). That includes an
effort to relocate the sector by using hdparm to write sector 551562,
whereupon Hitachi drives me crazy. The drive reports success but subsequent
reads still fail.

--------------------------------------
Power up the Internet with Yahoo! Toolbar.
http://pr.mail.yahoo.co.jp/toolbar/