LinuxLists.cc - RFC: short reads on block devices

2010-12-17 16:48:36

Subject: RFC: short reads on block devices

Recently while testing with the scsi_debug driver
I was able to trick the block layer into reading
random data which the block layer thought was
valid ***.

Best to start with an example, say LBA ** 4660 has
an unrecoverable error (aka medium error) and
the block layer fires off a SCSI READ for 8
blocks (512 byte variety) at LBA 4656. The response
will be a medium error with the sense buffer info
field indicating LBA 4660. Now are the 4 blocks
that precede it (i.e. LBA 4656 to 4659) possibly
sitting in the data-in buffer and valid??

The block layer thinks they are. This is what my
term "short read" in the title alludes to. So I put
this question to the T10 reflector:
http://www.t10.org/t10r.htm
titled "sbc: reading blocks prior to a medium error".
And the answers were pretty clear. And the one from
George Penokie of LSI is interesting because Linux's
block layer assumption breaks some of LSI's equipment.

On the other hand, big array vendors and database vendors
want exactly what the block layer is doing at the moment.
So those guys don't want a change. [Please correct me
if that is too sweeping.] Also I'm informed some other
OSes do this as well.

I would like to propose a solution, at least in the SCSI
subsystem context. The 'resid' field was added 11 years
ago and is used by a HBA driver to indicate how many bytes
less than requested were placed in the scatter gather
list (i.e. the data-in buffer). It defaults to zero
(meaning all requested bytes have been read). Usually
for a medium error one would not bother setting resid
(so resid would remain 0). Somewhat surprisingly the
block layer has always ignored resid. I propose in the
case of a short read caused by a MEDIUM ERROR the block
layer checks resid. And if resid equals the requested
number of bytes then that means no data in the scatter
gather list is valid. So the block layer should act on
this information.

To this end I propose to change the scsi_debug driver
to set resid equal to bufflen when it simulates a
medium error.

Changes in the block layer and drivers from vendors who
want the strict "T10" handling of medium errors would
also be required. Maybe the USB mass storage (and UAS)
folks might also check if this impacts them.

Doug Gilbert

** LBA is Logical Block Address (origin 0)

*** Using 'modprobe scsi_debug opts=2' will set up a
pseudo device which the example in the second
paragraph is based on. Write a known pattern into the
pseudo device (only 8 MB long) and use dd to read
that device. Due to the 4 KB blocks used by the block
layer, the read ends at LBA 4655. In my tests LBAs
4576 through to 4655 are corrupted (i.e. not what is
actually on the pseudo device).

2010-12-17 20:37:12

by James Bottomley

[permalink] [raw]

Subject: Re: RFC: short reads on block devices

On Fri, 2010-12-17 at 11:48 -0500, Douglas Gilbert wrote:
> Recently while testing with the scsi_debug driver
> I was able to trick the block layer into reading
> random data which the block layer thought was
> valid ***.
>
> Best to start with an example, say LBA ** 4660 has
> an unrecoverable error (aka medium error) and
> the block layer fires off a SCSI READ for 8
> blocks (512 byte variety) at LBA 4656. The response
> will be a medium error with the sense buffer info
> field indicating LBA 4660. Now are the 4 blocks
> that precede it (i.e. LBA 4656 to 4659) possibly
> sitting in the data-in buffer and valid??
>
> The block layer thinks they are. This is what my
> term "short read" in the title alludes to. So I put
> this question to the T10 reflector:
> http://www.t10.org/t10r.htm
> titled "sbc: reading blocks prior to a medium error".
> And the answers were pretty clear. And the one from
> George Penokie of LSI is interesting because Linux's
> block layer assumption breaks some of LSI's equipment.

Well, unsurprisingly, I was aware of the issue via Novell customer
interactions. Since you've outed LSI, we can discuss it openly.

The fact is that for medium errors, every other array returns valid data
up to the erroring sector.

> On the other hand, big array vendors and database vendors
> want exactly what the block layer is doing at the moment.
> So those guys don't want a change. [Please correct me
> if that is too sweeping.] Also I'm informed some other
> OSes do this as well.

Plus all disk devices transfer up to the error sector. Additionally,
Martin Petersen uses the same code for DIF and he's secured external
agreement from the DIF based arrays that nothwithstanding the ambiguity
in the SCSI standards, all DIF arrays return valid data up to the sector
with the DIF error.

> I would like to propose a solution, at least in the SCSI
> subsystem context. The 'resid' field was added 11 years
> ago and is used by a HBA driver to indicate how many bytes
> less than requested were placed in the scatter gather
> list (i.e. the data-in buffer). It defaults to zero
> (meaning all requested bytes have been read). Usually
> for a medium error one would not bother setting resid
> (so resid would remain 0). Somewhat surprisingly the
> block layer has always ignored resid. I propose in the
> case of a short read caused by a MEDIUM ERROR the block
> layer checks resid. And if resid equals the requested
> number of bytes then that means no data in the scatter
> gather list is valid. So the block layer should act on
> this information.
>
> To this end I propose to change the scsi_debug driver
> to set resid equal to bufflen when it simulates a
> medium error.
>
> Changes in the block layer and drivers from vendors who
> want the strict "T10" handling of medium errors would
> also be required. Maybe the USB mass storage (and UAS)
> folks might also check if this impacts them.

OK, so I checked, and I think all of the major in-use HBA drivers today
do set the residue, so I'd be reasonably happy with a modification like
the following. It basically believes the lower of either the
transferred data or the listed error (assuming the listed error is
valid ... if it's invalid, we still assume we can't trust anything).
This should mean that HBA drivers that set the residue work for all
arrays and those that don't work as they do today.

James

---

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 9564961..d41eaa2 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1175,6 +1175,12 @@ static unsigned int sd_completed_bytes(struct scsi_cmnd *scmd)
u64 end_lba = blk_rq_pos(scmd->request) + (scsi_bufflen(scmd) / 512);
u64 bad_lba;
int info_valid;
+ /*
+ * resid is optional but mosly filled in. When it's unused,
+ * its value is zero, so we assume the whole buffer transferred
+ */
+ unsigned int transferred = scsi_bufflen(scmd) - scsi_get_resid(scmd);
+ unsigned int good_bytes;

if (scmd->request->cmd_type != REQ_TYPE_FS)
return 0;
@@ -1208,7 +1214,8 @@ static unsigned int sd_completed_bytes(struct scsi_cmnd *scmd)
/* This computation should always be done in terms of
* the resolution of the device's medium.
*/
- return (bad_lba - start_lba) * scmd->device->sector_size;
+ good_bytes = (bad_lba - start_lba) * scmd->device->sector_size;
+ return min(good_bytes, transferred);
}

/**

2010-12-18 01:19:55

by Douglas Gilbert

[permalink] [raw]

Subject: Re: RFC: short reads on block devices

On 10-12-17 03:36 PM, James Bottomley wrote:
> On Fri, 2010-12-17 at 11:48 -0500, Douglas Gilbert wrote:
>> Recently while testing with the scsi_debug driver
>> I was able to trick the block layer into reading
>> random data which the block layer thought was
>> valid ***.
>>
>> Best to start with an example, say LBA ** 4660 has
>> an unrecoverable error (aka medium error) and
>> the block layer fires off a SCSI READ for 8
>> blocks (512 byte variety) at LBA 4656. The response
>> will be a medium error with the sense buffer info
>> field indicating LBA 4660. Now are the 4 blocks
>> that precede it (i.e. LBA 4656 to 4659) possibly
>> sitting in the data-in buffer and valid??
>>
>> The block layer thinks they are. This is what my
>> term "short read" in the title alludes to. So I put
>> this question to the T10 reflector:
>> http://www.t10.org/t10r.htm
>> titled "sbc: reading blocks prior to a medium error".
>> And the answers were pretty clear. And the one from
>> George Penokie of LSI is interesting because Linux's
>> block layer assumption breaks some of LSI's equipment.
>
> Well, unsurprisingly, I was aware of the issue via Novell customer
> interactions. Since you've outed LSI, we can discuss it openly.
>
> The fact is that for medium errors, every other array returns valid data
> up to the erroring sector.
>
>> On the other hand, big array vendors and database vendors
>> want exactly what the block layer is doing at the moment.
>> So those guys don't want a change. [Please correct me
>> if that is too sweeping.] Also I'm informed some other
>> OSes do this as well.
>
> Plus all disk devices transfer up to the error sector. Additionally,
> Martin Petersen uses the same code for DIF and he's secured external
> agreement from the DIF based arrays that nothwithstanding the ambiguity
> in the SCSI standards, all DIF arrays return valid data up to the sector
> with the DIF error.
>
>> I would like to propose a solution, at least in the SCSI
>> subsystem context. The 'resid' field was added 11 years
>> ago and is used by a HBA driver to indicate how many bytes
>> less than requested were placed in the scatter gather
>> list (i.e. the data-in buffer). It defaults to zero
>> (meaning all requested bytes have been read). Usually
>> for a medium error one would not bother setting resid
>> (so resid would remain 0). Somewhat surprisingly the
>> block layer has always ignored resid. I propose in the
>> case of a short read caused by a MEDIUM ERROR the block
>> layer checks resid. And if resid equals the requested
>> number of bytes then that means no data in the scatter
>> gather list is valid. So the block layer should act on
>> this information.
>>
>> To this end I propose to change the scsi_debug driver
>> to set resid equal to bufflen when it simulates a
>> medium error.
>>
>> Changes in the block layer and drivers from vendors who
>> want the strict "T10" handling of medium errors would
>> also be required. Maybe the USB mass storage (and UAS)
>> folks might also check if this impacts them.
>
> OK, so I checked, and I think all of the major in-use HBA drivers today
> do set the residue, so I'd be reasonably happy with a modification like
> the following. It basically believes the lower of either the
> transferred data or the listed error (assuming the listed error is
> valid ... if it's invalid, we still assume we can't trust anything).
> This should mean that HBA drivers that set the residue work for all
> arrays and those that don't work as they do today.
>
> James
>
> ---
>
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 9564961..d41eaa2 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1175,6 +1175,12 @@ static unsigned int sd_completed_bytes(struct scsi_cmnd *scmd)
> u64 end_lba = blk_rq_pos(scmd->request) + (scsi_bufflen(scmd) / 512);
> u64 bad_lba;
> int info_valid;
> + /*
> + * resid is optional but mosly filled in. When it's unused,
> + * its value is zero, so we assume the whole buffer transferred
> + */
> + unsigned int transferred = scsi_bufflen(scmd) - scsi_get_resid(scmd);
> + unsigned int good_bytes;
>
> if (scmd->request->cmd_type != REQ_TYPE_FS)
> return 0;
> @@ -1208,7 +1214,8 @@ static unsigned int sd_completed_bytes(struct scsi_cmnd *scmd)
> /* This computation should always be done in terms of
> * the resolution of the device's medium.
> */
> - return (bad_lba - start_lba) * scmd->device->sector_size;
> + good_bytes = (bad_lba - start_lba) * scmd->device->sector_size;
> + return min(good_bytes, transferred);
> }
>
> /**

James,
This patch to the sd driver together with the one I made
to the scsi_debug driver (sent to the linux-scsi list)
fixes the corruption problem in lk 2.6.36 .

Below is a dd command and the resulting log file
ouput. The log shows the actual SCSI READs sent
to the scsi_debug pseudo device (the medium error
is at LBA 0x1234):

# dd if=/dev/sdc of=ttt.img
dd: reading `/dev/sdc': Input/output error
4656+0 records in
4656+0 records out
2383872 bytes (2.4 MB) copied, 0.0722803 s, 33.0 MB/s

....
scsi_debug: cmd 28 00 00 00 0f e0 00 01 00 00
scsi_debug: cmd 28 00 00 00 10 e0 00 01 00 00
scsi_debug: cmd 28 00 00 00 11 e0 00 01 00 00
scsi_debug: [sense_key,asc,ascq]: [0x3,0x11,0x0]
scsi_debug: <8 0 0 0> non-zero result=0x8000002
sd 8:0:0:0: [sdc] Unhandled sense code
sd 8:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 8:0:0:0: [sdc] Sense Key : Medium Error [current]
Info fld=0x1234
sd 8:0:0:0: [sdc] Add. Sense: Unrecovered read error
sd 8:0:0:0: [sdc] CDB: Read(10): 28 00 00 00 11 e0 00 01 00 00
quiet_error: 23 callbacks suppressed
scsi_debug: cmd 28 00 00 00 12 e0 00 01 00 00
scsi_debug: cmd 28 00 00 00 11 e0 00 00 08 00
scsi_debug: cmd 28 00 00 00 11 e8 00 00 08 00
scsi_debug: cmd 28 00 00 00 11 f0 00 00 08 00
scsi_debug: cmd 28 00 00 00 11 f8 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 00 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 08 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 10 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 18 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 20 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 28 00 00 08 00
scsi_debug: cmd 28 00 00 00 12 30 00 00 08 00
scsi_debug: [sense_key,asc,ascq]: [0x3,0x11,0x0]
scsi_debug: <8 0 0 0> non-zero result=0x8000002
sd 8:0:0:0: [sdc] Unhandled sense code
sd 8:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 8:0:0:0: [sdc] Sense Key : Medium Error [current]
Info fld=0x1234
sd 8:0:0:0: [sdc] Add. Sense: Unrecovered read error

Almost perfect. It was reading 256 blocks at a time and
hit trouble at LBA 0x11e0 . So then it reads 4 KB (8
blocks) at a time starting at LBA 0x11e0 until it hits
the medium error again. Well it didn't need to read
the medium error again. And what was it thinking when
it read 256 blocks starting at LBA 0x12e0 (i.e. after
the medium error)? Correct but not optimal.

Moving right along, these words are written in the t13.org
document: d2015r3 otherwise know as ACS-2 for ATA
(including SATA) disks:

"If an unrecoverable error occurs while the device is
processing this command, then the device shall return
command completion with the Error bit set to one and
the LBA field set to the LBA of the logical sector where
the first unrecoverable error occurred . The validity
of the data transferred is indeterminate ."

That is for the READ DMA EXT command. When I check
drivers/ata/libata-scsi.c it doesn't touch the resid
field. That means resid is at its default value of
zero implying the data-in buffer is "all good". So
do we know what common practice is for ATA disks in
this situation?

Doug Gilbert

2010-12-20 09:38:23

by Hannes Reinecke

[permalink] [raw]

Subject: Re: RFC: short reads on block devices

On 12/17/2010 09:36 PM, James Bottomley wrote:
> On Fri, 2010-12-17 at 11:48 -0500, Douglas Gilbert wrote:
>> Recently while testing with the scsi_debug driver
>> I was able to trick the block layer into reading
>> random data which the block layer thought was
>> valid ***.
>>
>> Best to start with an example, say LBA ** 4660 has
>> an unrecoverable error (aka medium error) and
>> the block layer fires off a SCSI READ for 8
>> blocks (512 byte variety) at LBA 4656. The response
>> will be a medium error with the sense buffer info
>> field indicating LBA 4660. Now are the 4 blocks
>> that precede it (i.e. LBA 4656 to 4659) possibly
>> sitting in the data-in buffer and valid??
>>
>> The block layer thinks they are. This is what my
>> term "short read" in the title alludes to. So I put
>> this question to the T10 reflector:
>> http://www.t10.org/t10r.htm
>> titled "sbc: reading blocks prior to a medium error".
>> And the answers were pretty clear. And the one from
>> George Penokie of LSI is interesting because Linux's
>> block layer assumption breaks some of LSI's equipment.
>
> Well, unsurprisingly, I was aware of the issue via Novell customer
> interactions. Since you've outed LSI, we can discuss it openly.
>
> The fact is that for medium errors, every other array returns valid data
> up to the erroring sector.
>
>> On the other hand, big array vendors and database vendors
>> want exactly what the block layer is doing at the moment.
>> So those guys don't want a change. [Please correct me
>> if that is too sweeping.] Also I'm informed some other
>> OSes do this as well.
>
> Plus all disk devices transfer up to the error sector. Additionally,
> Martin Petersen uses the same code for DIF and he's secured external
> agreement from the DIF based arrays that nothwithstanding the ambiguity
> in the SCSI standards, all DIF arrays return valid data up to the sector
> with the DIF error.
>
>> I would like to propose a solution, at least in the SCSI
>> subsystem context. The 'resid' field was added 11 years
>> ago and is used by a HBA driver to indicate how many bytes
>> less than requested were placed in the scatter gather
>> list (i.e. the data-in buffer). It defaults to zero
>> (meaning all requested bytes have been read). Usually
>> for a medium error one would not bother setting resid
>> (so resid would remain 0). Somewhat surprisingly the
>> block layer has always ignored resid. I propose in the
>> case of a short read caused by a MEDIUM ERROR the block
>> layer checks resid. And if resid equals the requested
>> number of bytes then that means no data in the scatter
>> gather list is valid. So the block layer should act on
>> this information.
>>
>> To this end I propose to change the scsi_debug driver
>> to set resid equal to bufflen when it simulates a
>> medium error.
>>
>> Changes in the block layer and drivers from vendors who
>> want the strict "T10" handling of medium errors would
>> also be required. Maybe the USB mass storage (and UAS)
>> folks might also check if this impacts them.
>
> OK, so I checked, and I think all of the major in-use HBA drivers today
> do set the residue, so I'd be reasonably happy with a modification like
> the following. It basically believes the lower of either the
> transferred data or the listed error (assuming the listed error is
> valid ... if it's invalid, we still assume we can't trust anything).
> This should mean that HBA drivers that set the residue work for all
> arrays and those that don't work as they do today.
>
Okay, having been part of the discussion (and the resulting
specification digging) I'm not quite convinced that this minimal
patch does entirely the right thing.

After all, T10 said we should consider the buffer as invalid in the
face of read or write errors, despite the fact that some bits in
there _may_ be valid.

So the correct approach here would be to retry the command
with a short read up to the size indicated in the sense code;
that should avoid the error and the buffer would be filled with
correct data.

And with that approach we would be keeping everyone happy.

Hmm. Someone should do a patch here :-)

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

2010-12-20 16:00:57

by James Bottomley

[permalink] [raw]

Subject: Re: RFC: short reads on block devices

On Mon, 2010-12-20 at 10:43 +0100, Hannes Reinecke wrote:
> On 12/17/2010 09:36 PM, James Bottomley wrote:
> > On Fri, 2010-12-17 at 11:48 -0500, Douglas Gilbert wrote:
> >> Recently while testing with the scsi_debug driver
> >> I was able to trick the block layer into reading
> >> random data which the block layer thought was
> >> valid ***.
> >>
> >> Best to start with an example, say LBA ** 4660 has
> >> an unrecoverable error (aka medium error) and
> >> the block layer fires off a SCSI READ for 8
> >> blocks (512 byte variety) at LBA 4656. The response
> >> will be a medium error with the sense buffer info
> >> field indicating LBA 4660. Now are the 4 blocks
> >> that precede it (i.e. LBA 4656 to 4659) possibly
> >> sitting in the data-in buffer and valid??
> >>
> >> The block layer thinks they are. This is what my
> >> term "short read" in the title alludes to. So I put
> >> this question to the T10 reflector:
> >> http://www.t10.org/t10r.htm
> >> titled "sbc: reading blocks prior to a medium error".
> >> And the answers were pretty clear. And the one from
> >> George Penokie of LSI is interesting because Linux's
> >> block layer assumption breaks some of LSI's equipment.
> >
> > Well, unsurprisingly, I was aware of the issue via Novell customer
> > interactions. Since you've outed LSI, we can discuss it openly.
> >
> > The fact is that for medium errors, every other array returns valid data
> > up to the erroring sector.
> >
> >> On the other hand, big array vendors and database vendors
> >> want exactly what the block layer is doing at the moment.
> >> So those guys don't want a change. [Please correct me
> >> if that is too sweeping.] Also I'm informed some other
> >> OSes do this as well.
> >
> > Plus all disk devices transfer up to the error sector. Additionally,
> > Martin Petersen uses the same code for DIF and he's secured external
> > agreement from the DIF based arrays that nothwithstanding the ambiguity
> > in the SCSI standards, all DIF arrays return valid data up to the sector
> > with the DIF error.
> >
> >> I would like to propose a solution, at least in the SCSI
> >> subsystem context. The 'resid' field was added 11 years
> >> ago and is used by a HBA driver to indicate how many bytes
> >> less than requested were placed in the scatter gather
> >> list (i.e. the data-in buffer). It defaults to zero
> >> (meaning all requested bytes have been read). Usually
> >> for a medium error one would not bother setting resid
> >> (so resid would remain 0). Somewhat surprisingly the
> >> block layer has always ignored resid. I propose in the
> >> case of a short read caused by a MEDIUM ERROR the block
> >> layer checks resid. And if resid equals the requested
> >> number of bytes then that means no data in the scatter
> >> gather list is valid. So the block layer should act on
> >> this information.
> >>
> >> To this end I propose to change the scsi_debug driver
> >> to set resid equal to bufflen when it simulates a
> >> medium error.
> >>
> >> Changes in the block layer and drivers from vendors who
> >> want the strict "T10" handling of medium errors would
> >> also be required. Maybe the USB mass storage (and UAS)
> >> folks might also check if this impacts them.
> >
> > OK, so I checked, and I think all of the major in-use HBA drivers today
> > do set the residue, so I'd be reasonably happy with a modification like
> > the following. It basically believes the lower of either the
> > transferred data or the listed error (assuming the listed error is
> > valid ... if it's invalid, we still assume we can't trust anything).
> > This should mean that HBA drivers that set the residue work for all
> > arrays and those that don't work as they do today.
> >
> Okay, having been part of the discussion (and the resulting
> specification digging) I'm not quite convinced that this minimal
> patch does entirely the right thing.
>
> After all, T10 said we should consider the buffer as invalid in the
> face of read or write errors, despite the fact that some bits in
> there _may_ be valid.

Well, T10 always gives the most conservative approach. All other array
(and certainly all disk) vendors seem to confirm valid data up to the
error ... certainly for the DIF case, Martin secured this agreement.

> So the correct approach here would be to retry the command
> with a short read up to the size indicated in the sense code;
> that should avoid the error and the buffer would be filled with
> correct data.
>
> And with that approach we would be keeping everyone happy.
>
> Hmm. Someone should do a patch here :-)

Such a patch is harder than you think. We certainly don't want to
bother disks with this ... the read with medium error probably took
several seconds, so we never want to hit that sector again.

I think the correct approach is:

1. If we got data up to (and possibly including) the error sector,
we're done and we assume good data and terminate the command.
2. If we got an error sector and less good data than we should,
then we might need to repeat. But currently failing the entire
transaction at that point with what we have is good
3. For the outlier repeat path, I think reconstructing the SCSI
command to request data up to the error sector is best. We
still have to recognise this on return so as not to emit another
read for the bad sector.

The proposed patch already does 1 & 2 ... I think we can do 3 later
depending on just how common the outlier arrays are.

James

2010-12-22 18:09:56

by Martin K. Petersen

[permalink] [raw]

Subject: Re: RFC: short reads on block devices

>>>>> "James" == James Bottomley <[email protected]> writes:

James> Well, T10 always gives the most conservative approach. All other
James> array (and certainly all disk) vendors seem to confirm valid data
James> up to the error ... certainly for the DIF case, Martin secured
James> this agreement.

Well, secured is probably too strong a word. Nobody objected when it was
put in the DIX spec...

--
Martin K. Petersen Oracle Linux Engineering