2003-03-06 23:29:57

by Andries E. Brouwer

[permalink] [raw]
Subject: [PATCH] scsi_error fix

Below:
imm.c: spelling
scsi.h: remove old and now incorrect comment
scsi_scan.c: remove superfluous final return
scsi_error.c: apart from similar trivialities the only change:

If a command fails (e.g. because it belongs to a newer
SCSI version than the device), it is fed to
scsi_decide_disposition(). That routine must return
SUCCESS, unless the error handler should be invoked.

In the situation where host_byte is DID_OK, and message_byte
is COMMAND_COMPLETE, and status is CHECK_CONDITION, there is
no reason at all to invoke aborts and resets. The situation
is normal. I see here UNIT ATTENTION, Power on occurred
and ILLEGAL REQUEST, Invalid field in cdb.

The 2.5.64 code does not return SUCCESS, but it returns the
return code of scsi_check_sense(), and that may be FAILED
in case we do not have valid sense.

Further discussion is possible here, but I changed the return
into "return SUCCESS" and 2.5.64 boots fine.

Andries

[Further discussion and things I did not yet investigate:
What was changed to make this fail first in 2.5.63?
Experience shows that we get into a loop when something else
than SUCCESS is returned here. Probably that is a bug elsewhere.
Probably the commands that cause problems should never have been
sent in the first place.]

diff -u --recursive --new-file -X /linux/dontdiff a/drivers/scsi/imm.c b/drivers/scsi/imm.c
--- a/drivers/scsi/imm.c Sat Feb 15 20:41:59 2003
+++ b/drivers/scsi/imm.c Thu Mar 6 23:50:31 2003
@@ -174,6 +174,7 @@
parport_unregister_device(imm_hosts[i].dev);
continue;
}
+
/* now the glue ... */
switch (imm_hosts[i].mode) {
case IMM_NIBBLE:
@@ -218,7 +219,7 @@

/* This is to give the imm driver a way to modify the timings (and other
* parameters) by writing to the /proc/scsi/imm/0 file.
- * Very simple method really... (To simple, no error checking :( )
+ * Very simple method really... (Too simple, no error checking :( )
* Reason: Kernel hackers HATE having to unload and reload modules for
* testing...
* Also gives a method to use a script to obtain optimum timings (TODO)
@@ -948,7 +949,7 @@
unsigned char l = 0, h = 0;
int retv, x;

- /* First check for any errors that may of occurred
+ /* First check for any errors that may have occurred
* Here we check for internal errors
*/
if (tmp->failed)
diff -u --recursive --new-file -X /linux/dontdiff a/drivers/scsi/scsi.h b/drivers/scsi/scsi.h
--- a/drivers/scsi/scsi.h Wed Mar 5 10:47:26 2003
+++ b/drivers/scsi/scsi.h Thu Mar 6 23:50:31 2003
@@ -280,26 +280,7 @@
#define SCSI_SET_IOCTL_LOGGING(LEVEL) \
SCSI_SET_LOGGING(SCSI_LOG_IOCTL_SHIFT, SCSI_LOG_IOCTL_BITS, LEVEL);

-/*
- * the return of the status word will be in the following format :
- * The low byte is the status returned by the SCSI command,
- * with vendor specific bits masked.
- *
- * The next byte is the message which followed the SCSI status.
- * This allows a stos to be used, since the Intel is a little
- * endian machine.
- *
- * The final byte is a host return code, which is one of the following.
- *
- * IE
- * lsb msb
- * status msg host code
- *
- * Our errors returned by OUR driver, NOT SCSI message. Or'd with
- * SCSI message passed back to driver <IF any>.
- */
-
-
+/* host byte codes */
#define DID_OK 0x00 /* NO error */
#define DID_NO_CONNECT 0x01 /* Couldn't connect before timeout period */
#define DID_BUS_BUSY 0x02 /* BUS stayed busy through time out period */
diff -u --recursive --new-file -X /linux/dontdiff a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
--- a/drivers/scsi/scsi_error.c Mon Feb 24 23:02:52 2003
+++ b/drivers/scsi/scsi_error.c Thu Mar 6 23:50:31 2003
@@ -92,10 +92,8 @@
*
* Notes:
* This should be turned into an inline function. Each scsi command
- * has it's own timer, and as it is added to the queue, we set up the
- * timer. When the command completes, we cancel the timer. Pretty
- * simple, really, especially compared to the old way of handling this
- * crap.
+ * has its own timer, and as it is added to the queue, we set up the
+ * timer. When the command completes, we cancel the timer.
**/
void scsi_add_timer(struct scsi_cmnd *scmd, int timeout,
void (*complete)(struct scsi_cmnd *))
@@ -249,6 +247,7 @@
{
if (!SCSI_SENSE_VALID(scmd))
return FAILED;
+
if (scmd->sense_buffer[2] & 0xe0)
return SUCCESS;

@@ -1193,12 +1192,12 @@
* Notes:
* This is *only* called when we are examining the status after sending
* out the actual data command. any commands that are queued for error
- * recovery (i.e. test_unit_ready) do *not* come through here.
+ * recovery (e.g. test_unit_ready) do *not* come through here.
*
* When this routine returns failed, it means the error handler thread
- * is woken. in cases where the error code indicates an error that
+ * is woken. In cases where the error code indicates an error that
* doesn't require the error handler read (i.e. we don't need to
- * abort/reset), then this function should return SUCCESS.
+ * abort/reset), this function should return SUCCESS.
**/
int scsi_decide_disposition(struct scsi_cmnd *scmd)
{
@@ -1214,11 +1213,11 @@
__FUNCTION__));
return SUCCESS;
}
+
/*
* first check the host byte, to see if there is anything in there
* that would indicate what we need to do.
*/
-
switch (host_byte(scmd->result)) {
case DID_PASSTHROUGH:
/*
@@ -1296,11 +1295,11 @@
/*
* next, check the message byte.
*/
- if (msg_byte(scmd->result) != COMMAND_COMPLETE) {
+ if (msg_byte(scmd->result) != COMMAND_COMPLETE)
return FAILED;
- }
+
/*
- * now, check the status byte to see if this indicates anything special.
+ * check the status byte to see if this indicates anything special.
*/
switch (status_byte(scmd->result)) {
case QUEUE_FULL:
@@ -1321,10 +1320,11 @@
return SUCCESS;
case CHECK_CONDITION:
rtn = scsi_check_sense(scmd);
- if (rtn == NEEDS_RETRY) {
+ if (rtn == NEEDS_RETRY)
goto maybe_retry;
- }
- return rtn;
+ /* if rtn == FAILED, we have no sense information */
+ /* was: return rtn; */
+ return SUCCESS;
case CONDITION_GOOD:
case INTERMEDIATE_GOOD:
case INTERMEDIATE_C_GOOD:
diff -u --recursive --new-file -X /linux/dontdiff a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
--- a/drivers/scsi/scsi_scan.c Wed Mar 5 10:47:26 2003
+++ b/drivers/scsi/scsi_scan.c Fri Mar 7 00:02:09 2003
@@ -960,7 +960,6 @@
SCSI_LOG_SCAN_BUS(3, printk(KERN_INFO "scsi scan: host %d channel %d"
" id %d lun %d name/id: '%s'\n", sdev->host->host_no,
sdev->channel, sdev->id, sdev->lun, sdev->sdev_driverfs_dev.name));
- return;
}

/**


2003-03-07 01:12:56

by Rob Radez

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

On Fri, Mar 07, 2003 at 12:37:57AM +0100, [email protected] wrote:
> Below:
> scsi_error.c: apart from similar trivialities the only change:
>
> If a command fails (e.g. because it belongs to a newer
> SCSI version than the device), it is fed to
> scsi_decide_disposition(). That routine must return
> SUCCESS, unless the error handler should be invoked.
[snip patch]

This patch gets me booting again.

Regards,
Rob Radez

2003-03-07 19:43:18

by Patrick Mansfield

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

On Fri, Mar 07, 2003 at 12:37:57AM +0100, [email protected] wrote:

> scsi_error.c: apart from similar trivialities the only change:
>
> If a command fails (e.g. because it belongs to a newer
> SCSI version than the device), it is fed to
> scsi_decide_disposition(). That routine must return
> SUCCESS, unless the error handler should be invoked.
>
> In the situation where host_byte is DID_OK, and message_byte
> is COMMAND_COMPLETE, and status is CHECK_CONDITION, there is
> no reason at all to invoke aborts and resets. The situation
> is normal. I see here UNIT ATTENTION, Power on occurred
> and ILLEGAL REQUEST, Invalid field in cdb.
>
> The 2.5.64 code does not return SUCCESS, but it returns the
> return code of scsi_check_sense(), and that may be FAILED
> in case we do not have valid sense.
>
> Further discussion is possible here, but I changed the return
> into "return SUCCESS" and 2.5.64 boots fine.
>
> Andries
>
> [Further discussion and things I did not yet investigate:
> What was changed to make this fail first in 2.5.63?
> Experience shows that we get into a loop when something else
> than SUCCESS is returned here. Probably that is a bug elsewhere.
> Probably the commands that cause problems should never have been
> sent in the first place.]

The scsi error handler is also used to retrieve sense data for
adapters/drivers that do not auto retrieve it. In such cases, it should
not issue any aborts, resets etc.

Your change effectively disables that support - we never hit the code in
scsi_eh_get_sense() to request sense. It would be very nice if we could
fix (or audit) all the scsi drivers, apply your change and remove
scsi_eh_get_sense, but AFAIK that has not and is not happening.

I don't know if your adapter/driver has auto sense retrieval (imm.c?).

I would guess a change in scsi_error.c related to sense (scsi_eh_get_sense)
handling is causing problems, and if your driver has auto sense then
something is wrong with the scmd->sense_buffer.

It would still be useful to see the console log output with scsi_logging=1
for your failure case.

-- Patrick Mansfield

2003-03-07 20:09:08

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

From: Patrick Mansfield <[email protected]>

> [Further discussion and things I did not yet investigate:
> What was changed to make this fail first in 2.5.63?
> Experience shows that we get into a loop when something else
> than SUCCESS is returned here. Probably that is a bug elsewhere.
> Probably the commands that cause problems should never have been
> sent in the first place.]

The scsi error handler is also used to retrieve sense data for
adapters/drivers that do not auto retrieve it. In such cases, it should
not issue any aborts, resets etc.

Indeed.

Your change effectively disables that support - we never hit the code in
scsi_eh_get_sense() to request sense. It would be very nice if we could
fix (or audit) all the scsi drivers, apply your change and remove
scsi_eh_get_sense, but AFAIK that has not and is not happening.

No. What happened before was that we got into an infinite loop.
The right action is to read the code, understand why it gets
into a loop, and fix it. Once that has happened we may decide
to undo my change. Or we may decide to ask for sense at that very spot.

Today both James and Mike say that they can reproduce the loop,
so probably they'll fix that part. If not, I'll have a look again.

Andries

2003-03-07 21:05:12

by Mike Anderson

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

[email protected] [[email protected]] wrote:
> From: Patrick Mansfield <[email protected]>
>
> > [Further discussion and things I did not yet investigate:
> > What was changed to make this fail first in 2.5.63?
> > Experience shows that we get into a loop when something else
> > than SUCCESS is returned here. Probably that is a bug elsewhere.
> > Probably the commands that cause problems should never have been
> > sent in the first place.]
>
> The scsi error handler is also used to retrieve sense data for
> adapters/drivers that do not auto retrieve it. In such cases, it should
> not issue any aborts, resets etc.
>
> Indeed.
>
> Your change effectively disables that support - we never hit the code in
> scsi_eh_get_sense() to request sense. It would be very nice if we could
> fix (or audit) all the scsi drivers, apply your change and remove
> scsi_eh_get_sense, but AFAIK that has not and is not happening.
>
> No. What happened before was that we got into an infinite loop.
> The right action is to read the code, understand why it gets
> into a loop, and fix it. Once that has happened we may decide
> to undo my change. Or we may decide to ask for sense at that very spot.
>
> Today both James and Mike say that they can reproduce the loop,
> so probably they'll fix that part. If not, I'll have a look again.

Sorry about that Patrick. I had sent some mail last might and thought it
went to the list.

Both James and I can reproduce this problem. I was able to reproduce
using a hack to scsi_debug.

The loop problem is related to scsi error handling using the common code
of scsi_queue_insert / scsi_requesT_fn . When a command gets started the
scsi_init_cmd_errh function is called which sets retries to 0.

There maybe another issue with the scsi_eh_get_sense function, but I am
still looking at it.

I believe James is still pursuing a solution also.

-andmike
--
Michael Anderson
[email protected]

2003-03-07 22:11:00

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

===== drivers/scsi/scsi_error.c 1.38 vs edited =====
--- 1.38/drivers/scsi/scsi_error.c Sat Feb 22 10:17:01 2003
+++ edited/drivers/scsi/scsi_error.c Fri Mar 7 15:39:51 2003
@@ -515,7 +515,7 @@
* actually did complete normally.
*/
if (rtn == SUCCESS) {
- int rtn = scsi_eh_completed_normally(scmd);
+ rtn = scsi_eh_completed_normally(scmd);
SCSI_LOG_ERROR_RECOVERY(3,
printk("%s: scsi_eh_completed_normally %x\n",
__FUNCTION__, rtn));
@@ -545,20 +545,20 @@
static int scsi_request_sense(struct scsi_cmnd *scmd)
{
static unsigned char generic_sense[6] =
- {REQUEST_SENSE, 0, 0, 0, 255, 0};
- unsigned char scsi_result0[256], *scsi_result = &scsi_result0[0];
+ {REQUEST_SENSE, 0, 0, 0, 254, 0};
+ unsigned char *scsi_result;
int saved_result;
int rtn;

memcpy(scmd->cmnd, generic_sense, sizeof(generic_sense));

- if (scmd->device->host->hostt->unchecked_isa_dma) {
- scsi_result = kmalloc(512, GFP_ATOMIC | __GFP_DMA);
- if (unlikely(!scsi_result)) {
- printk(KERN_ERR "%s: cannot allocate scsi_result.\n",
- __FUNCTION__);
- return FAILED;
- }
+ scsi_result = kmalloc(254, GFP_ATOMIC | (scmd->device->host->hostt->unchecked_isa_dma) ? __GFP_DMA : 0);
+
+
+ if (unlikely(!scsi_result)) {
+ printk(KERN_ERR "%s: cannot allocate scsi_result.\n",
+ __FUNCTION__);
+ return FAILED;
}

/*
@@ -568,11 +568,11 @@
* address (db). 0 is not a valid sense code.
*/
memset(scmd->sense_buffer, 0, sizeof(scmd->sense_buffer));
- memset(scsi_result, 0, 256);
+ memset(scsi_result, 0, 254);

saved_result = scmd->result;
scmd->request_buffer = scsi_result;
- scmd->request_bufflen = 256;
+ scmd->request_bufflen = 254;
scmd->use_sg = 0;
scmd->cmd_len = COMMAND_SIZE(scmd->cmnd[0]);
scmd->sc_data_direction = SCSI_DATA_READ;
@@ -581,13 +581,12 @@
rtn = scsi_send_eh_cmnd(scmd, SENSE_TIMEOUT);

/* last chance to have valid sense data */
- if (!SCSI_SENSE_VALID(scmd)) {
+ if(!SCSI_SENSE_VALID(scmd)) {
memcpy(scmd->sense_buffer, scmd->request_buffer,
- sizeof(scmd->sense_buffer));
+ sizeof(scmd->sense_buffer));
}

- if (scsi_result != &scsi_result0[0])
- kfree(scsi_result);
+ kfree(scsi_result);

/*
* when we eventually call scsi_finish, we really wish to complete
@@ -702,8 +701,14 @@
* if the result was normal, then just pass it along to the
* upper level.
*/
- if (rtn == SUCCESS)
+ if (rtn == SUCCESS) {
+ /* we don't want this command reissued, just
+ * finished with the sense data, so set
+ * retries to the max allowed to ensure it
+ * won't get reissued */
+ scmd->retries = scmd->allowed;
scsi_eh_finish_cmd(scmd, done_q);
+ }
if (rtn != NEEDS_RETRY)
continue;

===== drivers/scsi/scsi_lib.c 1.73 vs edited =====
--- 1.73/drivers/scsi/scsi_lib.c Sat Feb 22 14:35:33 2003
+++ edited/drivers/scsi/scsi_lib.c Fri Mar 7 15:19:46 2003
@@ -265,7 +265,6 @@
cmd->serial_number = 0;
cmd->serial_number_at_timeout = 0;
cmd->flags = 0;
- cmd->retries = 0;
cmd->abort_reason = 0;

memset(cmd->sense_buffer, 0, sizeof cmd->sense_buffer);


Attachments:
tmp.diff (3.02 kB)

2003-03-07 22:32:26

by Mike Anderson

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

James Bottomley [[email protected]] wrote:

> @@ -702,8 +701,14 @@
> * if the result was normal, then just pass it along to the
> * upper level.
> */
> - if (rtn == SUCCESS)
> + if (rtn == SUCCESS) {
> + /* we don't want this command reissued, just
> + * finished with the sense data, so set
> + * retries to the max allowed to ensure it
> + * won't get reissued */
> + scmd->retries = scmd->allowed;
> scsi_eh_finish_cmd(scmd, done_q);
> + }
> if (rtn != NEEDS_RETRY)
> continue;
>

We might need to cover the case down below if we succeed on retry without
reaching allowed.

Another option is to look into re-sending the command from the
scsi_queue_insert handler like we do with timeouts. The interface to the
device should be the same from the LLDD's point of view just delayed
some.


-andmike
--
Michael Anderson
[email protected]

2003-03-07 22:46:33

by James Bottomley

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

On Fri, 2003-03-07 at 16:43, Mike Anderson wrote:
> James Bottomley [[email protected]] wrote:
>
> > @@ -702,8 +701,14 @@
> > * if the result was normal, then just pass it along to the
> > * upper level.
> > */
> > - if (rtn == SUCCESS)
> > + if (rtn == SUCCESS) {
> > + /* we don't want this command reissued, just
> > + * finished with the sense data, so set
> > + * retries to the max allowed to ensure it
> > + * won't get reissued */
> > + scmd->retries = scmd->allowed;
> > scsi_eh_finish_cmd(scmd, done_q);
> > + }
> > if (rtn != NEEDS_RETRY)
> > continue;
> >
>
> We might need to cover the case down below if we succeed on retry without
> reaching allowed.
>
> Another option is to look into re-sending the command from the
> scsi_queue_insert handler like we do with timeouts. The interface to the
> device should be the same from the LLDD's point of view just delayed
> some.

The attached should sweep up all of this. For NEEDS_RETRY, we will
retry until we reach the max retry count.

James


Attachments:
tmp.diff (3.40 kB)

2003-03-09 01:32:09

by Osamu Tomita

[permalink] [raw]
Subject: Re: [PATCH] scsi_error fix

James Bottomley wrote:
> This is against vanilla 2.5.64 (you have to take Andries work around out
> to see the fix in the BK version).

Thanks!
My old PC98 box with wd33c93 and SCSI-1 HD/CD re-works fine
after apllied you patch.
With cset-1.1099 patch, kernel can boot but mis detect SCSI-1
CD drive.

Regards,
Osamu Tomita