2003-07-24 11:02:09

by Tugrul Galatali

[permalink] [raw]
Subject: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

After months of using 2.5.x with stability on my box, and using
2.6.0-test1 since the day after its release (with the 20030714 ACPI
patch), I had two seemingly random SCSI hangs today. One shortly after
I booted the box in the afternoon, and one after about 15 hours of
uptime. I was busy the first time around, but the second time I managed
to scp out a copy of the current dmesg to another box before a hard
power down.

Can somebody translate the error in the dmesg into english and advise
me on whether I want to change something in the software or the
hardware?

http://acm.cs.nyu.edu/~tugrul/scsi/

Thanks in advance,
Tugrul Galatali


2003-07-24 17:01:31

by Justin T. Gibbs

[permalink] [raw]
Subject: Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

> After months of using 2.5.x with stability on my box, and using
> 2.6.0-test1 since the day after its release (with the 20030714 ACPI patch),
> I had two seemingly random SCSI hangs today. One shortly after I booted the
> box in the afternoon, and one after about 15 hours of uptime. I was busy the
> first time around, but the second time I managed to scp out a copy of the
> current dmesg to another box before a hard power down.
>
> Can somebody translate the error in the dmesg into english and advise
> me on whether I want to change something in the software or the hardware?

What the controller is saying is that the drive attempted to complete
a command it knew nothing about. At the time of the failure, the only
command outstanding on the device had tag identifier 0x3c. The drive
came back with a tag identifier of 0x20. This looks like a drive
firmware bug, but a bug in the aic7xxx driver cannot be completely
ruled out without a SCSI bus trace of the failure. All of the state in the
aic7xxx driver is consistent (disconnected cache matches the pending list)
which leads me to conclude that a drive firmware bug is more likely. Why
would this happen now? Most drive firmware bugs are load dependent. They
often will only occur when two commands with just the right characteristics
overlap. It may well be that a recent change in the 2.5/2.6 kernel has
caused a subtle change in I/O behavior that exposes this issue.

--
Justin

2003-07-25 00:47:27

by Tugrul Galatali

[permalink] [raw]
Subject: Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

On Thu, 2003-07-24 at 13:17, Justin T. Gibbs wrote:
[snip snip]
> came back with a tag identifier of 0x20. This looks like a drive
> firmware bug, but a bug in the aic7xxx driver cannot be completely
> ruled out without a SCSI bus trace of the failure.
[snip snip]

SCSI bus trace = logging? I started poking around online for how that
works, and I found repeatable case of what I hope is the same error (one
tar from the bad scsi disk piping into another tar onto a good scsi
disk). One problem I ran into is that scsi_logging=X as a kernel
parameter doesn't seem to work in 2.6.0-test1, so I put in a S00 init
script to do the:

echo "scsi log all" > /proc/scsi/scsi

The resulting /var/log/messages is ~18M, compressed down to 300k.

http://acm.cs.nyu.edu/~tugrul/scsi/messages.bz2

Is this what you need?

I did a quick test of the above case on a 2.4.21 kernel and it didn't
seem to trigger anything evil.

If it turns out to be a firmware problem, is the firmware upgradeable
or do I have to buy a new drive, in which case is there a blacklist?

Tugrul Galatali




2003-07-25 13:28:33

by Cress, Andrew R

[permalink] [raw]
Subject: RE: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief

Tugrul,

If it is a firmware problem, the firmware is upgradable, but you have to get
the firmware from IBM rather than Seagate. IBM has special firmware for
their ST (Seagate) OEM'd disks.

You can use the IBM utility (runs from a CD in DOS), or the sgdskfl utility
under Linux from scsirastools.sf.net.

But do verify the SCSI cabling/termination first.

Andy

-----Original Message-----
From: Tugrul Galatali [mailto:[email protected]]
Sent: Thursday, July 24, 2003 9:02 PM
To: Justin T. Gibbs
Cc: [email protected]
Subject: Re: 2.6.0-test1 Adaptec aic7899 Ultra160 SCSI grief


[... snip ...]

If it turns out to be a firmware problem, is the firmware
upgradeable
or do I have to buy a new drive, in which case is there a blacklist?

Tugrul Galatali