2006-11-30 01:22:50

by Luben Tuikov

[permalink] [raw]
Subject: Infinite retries reading the partition table

Suppose reading sector 0 always reports an error,
sense key HARDWARE ERROR.

What I'm observing is that the request to read sector 0,
reading partition information, is retried forever, ad infinitum.

Does anyone have a patch to resolve this? (2.6.19-rc6)

Thanks,
Luben


2006-11-30 01:53:27

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- Luben Tuikov <[email protected]> wrote:

> Suppose reading sector 0 always reports an error,
> sense key HARDWARE ERROR.
>
> What I'm observing is that the request to read sector 0,
> reading partition information, is retried forever, ad infinitum.
>
> Does anyone have a patch to resolve this? (2.6.19-rc6)

Actually the device sends SK: MEDIUM ERROR, ASC: UNRECOVERED READ ERR,
but SCSI Core seems to retry reading the partition table (sector 0)
forever.

Anyone seen this and/or has a patch in their tree for it?

Luben
P.S. This is fairly straightforward to inject/test.

2006-12-01 05:47:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

On Wed, 29 Nov 2006 17:22:48 -0800 (PST)
Luben Tuikov <[email protected]> wrote:

> Suppose reading sector 0 always reports an error,
> sense key HARDWARE ERROR.
>
> What I'm observing is that the request to read sector 0,
> reading partition information, is retried forever, ad infinitum.
>
> Does anyone have a patch to resolve this? (2.6.19-rc6)
>

Please send a backtrace so we can see where the offending loop occurs.

2006-12-01 06:34:58

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- Andrew Morton <[email protected]> wrote:
> On Wed, 29 Nov 2006 17:22:48 -0800 (PST)
> Luben Tuikov <[email protected]> wrote:
>
> > Suppose reading sector 0 always reports an error,
> > sense key HARDWARE ERROR.
> >
> > What I'm observing is that the request to read sector 0,
> > reading partition information, is retried forever, ad infinitum.
> >
> > Does anyone have a patch to resolve this? (2.6.19-rc6)
> >
>
> Please send a backtrace so we can see where the offending loop occurs.

I posted a patch to linux-scsi which resolves this issue:
http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885&w=2

Luben

2006-12-01 07:29:19

by Andrew Morton

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

On Thu, 30 Nov 2006 22:34:57 -0800 (PST)
Luben Tuikov <[email protected]> wrote:

> --- Andrew Morton <[email protected]> wrote:
> > On Wed, 29 Nov 2006 17:22:48 -0800 (PST)
> > Luben Tuikov <[email protected]> wrote:
> >
> > > Suppose reading sector 0 always reports an error,
> > > sense key HARDWARE ERROR.
> > >
> > > What I'm observing is that the request to read sector 0,
> > > reading partition information, is retried forever, ad infinitum.
> > >
> > > Does anyone have a patch to resolve this? (2.6.19-rc6)
> > >
> >
> > Please send a backtrace so we can see where the offending loop occurs.
>
> I posted a patch to linux-scsi

hm. Does sending patches to linux-scsi get them applied? It might, I
don't know.

> which resolves this issue:
> http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885&w=2

That looks like it prevents the IO error. But why was an IO error causing
an infinite loop? What piece of code was initiating the retries?

2006-12-01 08:32:33

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- Andrew Morton <[email protected]> wrote:
> On Thu, 30 Nov 2006 22:34:57 -0800 (PST)
> Luben Tuikov <[email protected]> wrote:
>
> > --- Andrew Morton <[email protected]> wrote:
> > > On Wed, 29 Nov 2006 17:22:48 -0800 (PST)
> > > Luben Tuikov <[email protected]> wrote:
> > >
> > > > Suppose reading sector 0 always reports an error,
> > > > sense key HARDWARE ERROR.
> > > >
> > > > What I'm observing is that the request to read sector 0,
> > > > reading partition information, is retried forever, ad infinitum.
> > > >
> > > > Does anyone have a patch to resolve this? (2.6.19-rc6)
> > > >
> > >
> > > Please send a backtrace so we can see where the offending loop occurs.
> >
> > I posted a patch to linux-scsi
>
> hm. Does sending patches to linux-scsi get them applied? It might, I
> don't know.

Good question -- I don't know either.

> > which resolves this issue:
> > http://marc.theaimsgroup.com/?l=linux-scsi&m=116485834119885&w=2
>
> That looks like it prevents the IO error. But why was an IO error causing
> an infinite loop? What piece of code was initiating the retries?

Here is what happens: sector 0 is broken -- the device cannot read
the media at that location. The device properly returns a certain
type of uncorrectable MEDIUM ERROR (ASC: UNRECOVERABLE READ ERR).

SCSI Core loops around its retries (which this patch fixes) and
eventually gives up and sends it for "completion". This is what
happens when scsi_check_sense() returns NEEDS_RETRY to
scsi_decide_disposition() to scsi_softirq(). The first chunk
of the patch fixes this.

We end up in scsi_io_completion(), where good_bytes = 0, and
result = 0x08000002 (DRIVER SENSE and CHECK CONDITION).

This statement in scsi_io_completion() causes the infinite retry loop:
if (scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
return;
substitute to get: scsi_end_request(cmd, uptodate=1, uptodate bytes=0, retry=1)
Yeah, but it doesn't make sense to call scsi_end_request() with uptodate=1 and
uptodate bytes = 0. This causes the infinite retry, since the code
tries to re-read the whole xfer size (0 bytes were uptodate and retry=1),
from the bad media.

That is, we want to set uptodate=1 iff there was at least 1 byte up to date.
Else if nothing was read, uptodate bytes = 0, then we should pass
uptodate = 0, uptodate_bytes = total xfer, to mean the whole xfer is
not uptodate; and retry iff there was no error. (This is the very bottom
of the function.)

... I know, I know, but that's what we've got.

See this commit 03aba2f79594ca94d159c8bab454de9bcc385b76 as well.

Luben

2006-12-05 20:40:55

by Michael Reed

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table



Luben Tuikov wrote:
...snip...
> This statement in scsi_io_completion() causes the infinite retry loop:
> if (scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
> return;

The code in 2.6.19 is "result==0", not "!!result", which is logically
the same as "result!=0". Did you mean to change the logic here?
Am I missing something?

Mike

2006-12-06 05:00:23

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- Michael Reed <[email protected]> wrote:
> Luben Tuikov wrote:
> ...snip...
> > This statement in scsi_io_completion() causes the infinite retry loop:
> > if (scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
> > return;
>
> The code in 2.6.19 is "result==0", not "!!result", which is logically
> the same as "result!=0". Did you mean to change the logic here?
> Am I missing something?

Hmm, I think my trees have !!result from an earlier patch I posted.

In this case it would appear that the second chunk of the patch
wouldn't be necessary, since result==0 would be false, and it
wouldn't retry.

Luben



2006-12-06 05:08:58

by Andrew Morton

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

> On Tue, 5 Dec 2006 21:00:20 -0800 (PST) Luben Tuikov <[email protected]> wrote:
> --- Michael Reed <[email protected]> wrote:
> > Luben Tuikov wrote:
> > ...snip...
> > > This statement in scsi_io_completion() causes the infinite retry loop:
> > > if (scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
> > > return;
> >
> > The code in 2.6.19 is "result==0", not "!!result", which is logically
> > the same as "result!=0". Did you mean to change the logic here?
> > Am I missing something?
>
> Hmm, I think my trees have !!result from an earlier patch I posted.
>
> In this case it would appear that the second chunk of the patch
> wouldn't be necessary, since result==0 would be false, and it
> wouldn't retry.
>

I fixed things up. The below is as-intended, yes?


diff -puN drivers/scsi/scsi_error.c~fix-sense-key-medium-error-processing-and-retry drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c~fix-sense-key-medium-error-processing-and-retry
+++ a/drivers/scsi/scsi_error.c
@@ -359,6 +359,11 @@ static int scsi_check_sense(struct scsi_
return SUCCESS;

case MEDIUM_ERROR:
+ if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
+ sshdr.asc == 0x13 || /* AMNF DATA FIELD */
+ sshdr.asc == 0x14) { /* RECORD NOT FOUND */
+ return SUCCESS;
+ }
return NEEDS_RETRY;

case HARDWARE_ERROR:
diff -puN drivers/scsi/scsi_lib.c~fix-sense-key-medium-error-processing-and-retry drivers/scsi/scsi_lib.c
--- a/drivers/scsi/scsi_lib.c~fix-sense-key-medium-error-processing-and-retry
+++ a/drivers/scsi/scsi_lib.c
@@ -871,7 +871,8 @@ void scsi_io_completion(struct scsi_cmnd
* are leftovers and there is some kind of error
* (result != 0), retry the rest.
*/
- if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
+ if (good_bytes &&
+ scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
return;

/* good_bytes = 0, or (inclusive) there were leftovers and
_

2006-12-06 06:10:54

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- Andrew Morton <[email protected]> wrote:
> > On Tue, 5 Dec 2006 21:00:20 -0800 (PST) Luben Tuikov <[email protected]> wrote:
> > --- Michael Reed <[email protected]> wrote:
> > > Luben Tuikov wrote:
> > > ...snip...
> > > > This statement in scsi_io_completion() causes the infinite retry loop:
> > > > if (scsi_end_request(cmd, 1, good_bytes, !!result) == NULL)
> > > > return;
> > >
> > > The code in 2.6.19 is "result==0", not "!!result", which is logically
> > > the same as "result!=0". Did you mean to change the logic here?
> > > Am I missing something?
> >
> > Hmm, I think my trees have !!result from an earlier patch I posted.
> >
> > In this case it would appear that the second chunk of the patch
> > wouldn't be necessary, since result==0 would be false, and it
> > wouldn't retry.
> >
>
> I fixed things up. The below is as-intended, yes?

Yes, thanks!

Luben

>
>
> diff -puN drivers/scsi/scsi_error.c~fix-sense-key-medium-error-processing-and-retry
> drivers/scsi/scsi_error.c
> --- a/drivers/scsi/scsi_error.c~fix-sense-key-medium-error-processing-and-retry
> +++ a/drivers/scsi/scsi_error.c
> @@ -359,6 +359,11 @@ static int scsi_check_sense(struct scsi_
> return SUCCESS;
>
> case MEDIUM_ERROR:
> + if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
> + sshdr.asc == 0x13 || /* AMNF DATA FIELD */
> + sshdr.asc == 0x14) { /* RECORD NOT FOUND */
> + return SUCCESS;
> + }
> return NEEDS_RETRY;
>
> case HARDWARE_ERROR:
> diff -puN drivers/scsi/scsi_lib.c~fix-sense-key-medium-error-processing-and-retry
> drivers/scsi/scsi_lib.c
> --- a/drivers/scsi/scsi_lib.c~fix-sense-key-medium-error-processing-and-retry
> +++ a/drivers/scsi/scsi_lib.c
> @@ -871,7 +871,8 @@ void scsi_io_completion(struct scsi_cmnd
> * are leftovers and there is some kind of error
> * (result != 0), retry the rest.
> */
> - if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> + if (good_bytes &&
> + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> return;
>
> /* good_bytes = 0, or (inclusive) there were leftovers and
> _
>
>

2006-12-06 10:48:29

by Etienne.Vogt

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table



On Thu, 30 Nov 2006, Andrew Morton wrote:

> That looks like it prevents the IO error. But why was an IO error causing
> an infinite loop? What piece of code was initiating the retries?

Maybe it's related to the infinite domain validation retry loop problem
I reported on linux-scsi a few weeks ago ?

--
Etienne Vogt ([email protected])
Observatoire de Paris-Meudon, France

2006-12-06 16:00:00

by James Bottomley

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

On Tue, 2006-12-05 at 21:08 -0800, Andrew Morton wrote:
> case MEDIUM_ERROR:
> + if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
> + sshdr.asc == 0x13 || /* AMNF DATA FIELD */
> + sshdr.asc == 0x14) { /* RECORD NOT FOUND */
> + return SUCCESS;
> + }
> return NEEDS_RETRY;

If the complaint is true; i.e. infinite retries, this is just a bandaid
not a fix. What it's doing is marking the unrecoverable medium errors
for no retry. However, what we really need to know is why NEEDS_RETRY
isn't terminating after its allotted number of retries. Can we please
have a trace of this?

> - if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> + if (good_bytes &&
> + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> return;

What exactly is this supposed to be doing? its result is identical to
the code it's replacing (because of the way scsi_end_request() processes
its second argument), so it can't have any effect on the stated problem.

James


2006-12-06 20:25:04

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- James Bottomley <[email protected]> wrote:
> On Tue, 2006-12-05 at 21:08 -0800, Andrew Morton wrote:
> > case MEDIUM_ERROR:
> > + if (sshdr.asc == 0x11 || /* UNRECOVERED READ ERR */
> > + sshdr.asc == 0x13 || /* AMNF DATA FIELD */
> > + sshdr.asc == 0x14) { /* RECORD NOT FOUND */
> > + return SUCCESS;
> > + }
> > return NEEDS_RETRY;
>
> If the complaint is true; i.e. infinite retries, this is just a bandaid
> not a fix. What it's doing is marking the unrecoverable medium errors
> for no retry. However, what we really need to know is why NEEDS_RETRY
> isn't terminating after its allotted number of retries. Can we please
> have a trace of this?

NEEDS_RETRY _does_ terminate, after it exhausts the retries. But since
by the ASC value we know that no amount of retries is going to work,
this chunk of the patch resolves it quicker, i.e. eliminates the
"NEEDS_RETRY" pointless retries (given the SK/ASC combination).

> > - if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > + if (good_bytes &&
> > + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > return;
>
> What exactly is this supposed to be doing? its result is identical to
> the code it's replacing (because of the way scsi_end_request() processes
> its second argument), so it can't have any effect on the stated problem.

I suppose this is true, but I'd rather it not even go in
scsi_end_request as (cmd, uptodate=1, good_bytes=0, retry=0) and complete
at the bottom as (cmd, uptodate=0, total_xfer, retry=0).

Luben

2006-12-07 18:17:03

by James Bottomley

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

On Wed, 2006-12-06 at 12:24 -0800, Luben Tuikov wrote:
> NEEDS_RETRY _does_ terminate, after it exhausts the retries. But since
> by the ASC value we know that no amount of retries is going to work,
> this chunk of the patch resolves it quicker, i.e. eliminates the
> "NEEDS_RETRY" pointless retries (given the SK/ASC combination).

I agree that it's useful behaviour. However, the change header should
be something like "scsi_error: don't retry for unrecoverable medium
errors" not "infinite retries .."

> > > - if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > > + if (good_bytes &&
> > > + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > > return;
> >
> > What exactly is this supposed to be doing? its result is identical to
> > the code it's replacing (because of the way scsi_end_request() processes
> > its second argument), so it can't have any effect on the stated problem.
>
> I suppose this is true, but I'd rather it not even go in
> scsi_end_request as (cmd, uptodate=1, good_bytes=0, retry=0) and complete
> at the bottom as (cmd, uptodate=0, total_xfer, retry=0).

But, logically, this isn't part of the change set ... the behaviour
you're altering is unrelated to the change set details, so this piece
shouldn't be in.

James


2006-12-07 19:14:24

by Luben Tuikov

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

--- James Bottomley <[email protected]> wrote:
> On Wed, 2006-12-06 at 12:24 -0800, Luben Tuikov wrote:
> > NEEDS_RETRY _does_ terminate, after it exhausts the retries. But since
> > by the ASC value we know that no amount of retries is going to work,
> > this chunk of the patch resolves it quicker, i.e. eliminates the
> > "NEEDS_RETRY" pointless retries (given the SK/ASC combination).
>
> I agree that it's useful behaviour. However, the change header should
> be something like "scsi_error: don't retry for unrecoverable medium
> errors" not "infinite retries .."
>
> > > > - if (scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > > > + if (good_bytes &&
> > > > + scsi_end_request(cmd, 1, good_bytes, result == 0) == NULL)
> > > > return;
> > >
> > > What exactly is this supposed to be doing? its result is identical to
> > > the code it's replacing (because of the way scsi_end_request() processes
> > > its second argument), so it can't have any effect on the stated problem.
> >
> > I suppose this is true, but I'd rather it not even go in
> > scsi_end_request as (cmd, uptodate=1, good_bytes=0, retry=0) and complete
> > at the bottom as (cmd, uptodate=0, total_xfer, retry=0).
>
> But, logically, this isn't part of the change set ... the behaviour
> you're altering is unrelated to the change set details, so this piece
> shouldn't be in.

It is. If good_bytes=0 then nothing is up to date and uptodate should
be set to 0.

Look at my comment before the function call:
/* A number of bytes were successfully read. ...

I repeat again: it doesn't make sense to call scsi_end_request
with uptodate=1 and good_bytes=0, since _no bytes are uptodate_.

2006-12-07 21:15:14

by James Bottomley

[permalink] [raw]
Subject: Re: Infinite retries reading the partition table

On Thu, 2006-12-07 at 11:14 -0800, Luben Tuikov wrote:
> It is. If good_bytes=0 then nothing is up to date and uptodate should
> be set to 0.

That's not a correct assumption. Zero transfer commands, like TEST UNIT
READY are perfectly happy to complete successfully with good_bytes == 0.

> Look at my comment before the function call:
> /* A number of bytes were successfully read. ...
>
> I repeat again: it doesn't make sense to call scsi_end_request
> with uptodate=1 and good_bytes=0, since _no bytes are uptodate_.

We can certainly debate that, but it's not appropriate to do it as part
of an unrelated patch.

James