2002-11-13 20:54:00

by Luben Tuikov

[permalink] [raw]
Subject: [PATCH] 3w-xxxx: additional ata->sense codes, avoid driver lockup

A few of our IDE hds were generating those errors.

The first one locks up the 3w-xxxx driver (into an inf. loop)
when the error register = 0x11; this also resulted sometimes
in kernel lockup, also when doing raidstop...

The second one (0x51 move) is for completely broken
sectors/medium -- this makes SCSI core figure this one out
quickly and do the appropriate action, rather than retry
the command several times + timeout, of course without
success.

For the status byte, we're more generous (actually only interested
in the error bit); if found, good, else the effect is the same.

--
Luben

--- linux-2.5.47/drivers/scsi/3w-xxxx.h Sun Nov 10 22:28:06 2002
+++ linux/drivers/scsi/3w-xxxx.h Wed Nov 13 15:07:28 2002
@@ -108,7 +108,9 @@
{0x01, 0x03, 0x13, 0x00}, // Address mark not found Address mark not found for data field
{0x04, 0x0b, 0x00, 0x00}, // Aborted command Aborted command
{0x10, 0x0b, 0x14, 0x00}, // ID not found Recorded entity not found
+ {0x11, 0x0b, 0x12, 0x00}, // Address Mark Not Found Address Mark Not Found For ID Field
{0x40, 0x03, 0x11, 0x00}, // Uncorrectable ECC error Unrecovered read error
+ {0x51, 0x03, 0x31, 0x00}, // Uncorrectable Data Error Medium Format Corrupted
{0x61, 0x04, 0x00, 0x00}, // Device fault Hardware error
{0x84, 0x0b, 0x47, 0x00}, // Data CRC error SCSI parity error
{0xd0, 0x0b, 0x00, 0x00}, // Device busy Aborted command
@@ -118,7 +120,6 @@
// 3ware Error SCSI Error
{0x09, 0x0b, 0x00, 0x00}, // Unrecovered disk error Aborted command
{0x37, 0x0b, 0x04, 0x00}, // Unit offline Logical unit not ready
- {0x51, 0x0b, 0x00, 0x00} // Unspecified Aborted command
};

/* Control register bit definitions */
--- linux-2.5.47/drivers/scsi/3w-xxxx.c Sun Nov 10 22:28:30 2002
+++ linux/drivers/scsi/3w-xxxx.c Wed Nov 13 15:07:28 2002
@@ -716,7 +716,7 @@

/* Attempt to return intelligent sense information */
if (fill_sense) {
- if ((command->status == 0xc7) || (command->status == 0xcb)) {
+ if (command->status & 0xC1) {
for (i=0;i<(sizeof(tw_sense_table)/sizeof(tw_sense_table[0]));i++) {
if (command->flags == tw_sense_table[i][0]) {


2002-11-13 22:17:32

by Luben Tuikov

[permalink] [raw]
Subject: Re: [PATCH] 3w-xxxx: additional ata->sense codes, avoid driver lockup

Adam Radford wrote:
>
> Luben,
>
> Thanks for submitting the patch, however, it appears part of it is wrong:
>
> - if ((command->status == 0xc7) || (command->status == 0xcb))
> {
> + if (command->status & 0xC1) {
>
> What makes you think you should not check for c7 or cb, and only check c1?

Hi Adam,

I don't really ``only'' check for 0xc1, it just shows which bits I'm
interested in (0xc1 is a mask anded with the status).

In fact, I'm only interested in the error bit (ERR), but saw what you did
and decided to stay as close to 0xc7 and 0xcb, both of whom are in the
subset of status & 0xC1, (0xC1 == BSY|DRDY|ERR). So in effect 0xC7 and 0xCB
still match.

Anyway, if you are throwing away

if (command->status & 0xC1)

then you might as well throw away flags 0x11 from tw_sense_table[]
and then we're back at ``square 1'' -- this is exactly when the
driver gets into an inf. loop and eventually locks up the machine
and the serial console prints ...
3w-xxxx: scsiX: Command failed: status = 0xc1, flags = 0x11, unit #Y.
... ad infinitum.

I was just trying to avoid this deadlock.

--
Luben

2002-11-13 23:46:49

by Luben Tuikov

[permalink] [raw]
Subject: Re: [PATCH] 3w-xxxx: additional ata->sense codes, avoid driver lockup

Adam Radford wrote:
>
> While there may need to be a fix so you don't loop on status=c1,flags=0x11,
> you should know that:
>
> command_packet->status is not a scsi or ATA register value at all.
>
> (0xC1 == BSY|DRDY|ERR).
> ^^^^^^^^^^^^^^^^^^^^^^^^ this is not true.

Really? Last time I checked the ATA spec, C1h comes
out to BSY=80h | DRDY=40h | ERR=1h.

I was *merely* trying to fix the loop on status=C1h, flags=11h.

By the use of flags (error register) and status's bits, I concluded
that while status is not *the* ATA status register, it's bits
are pretty close. For this reason I used the C1h *mask* to make
everyone happy.

Yes, I did assume that it is massaged by the controller,
but with a closed hardware spec and a bug, I had to start
somewhere.

--
Luben