LinuxLists.cc - Raid/Adaptec/SCSI errors, obvious explanation isn't

2001-10-31 20:11:20

Subject: Raid/Adaptec/SCSI errors, obvious explanation isn't

We can consistently generate 1-2 of the following errors per hour:

Oct 31 10:08:30 ccfs2 kernel: SCSI disk error : host 2 channel 0 id 9 lun 0
return code = 800
Oct 31 10:08:30 ccfs2 kernel: Current sd08:51: sense key Hardware Error
Oct 31 10:08:30 ccfs2 kernel: Additional sense indicates Internal target failure
Oct 31 10:08:30 ccfs2 kernel: I/O error: dev 08:51, sector 35371392

The errors occur on most of the 14 SCSI disks on two JBODs.
Multiple errors on the same disk always reference different sectors.

The errors occur 1-2 per hour when we rsync a remote machine to a local
file-system.
We've produced this error only once running Bonnie++.
No other I/O activity (cp, nfs serving, etc) has caused an error.

Our hardware config:
IBM NetFinity 5100 with Dual 866 MHz CPUs (86584RY)
NetGear GA620 Gigabit Ethernet
Two IBM EXP15 Fast Wide Ultra SCSI JBODs
Adaptec 2940 Ultra SCSI adapters
Each JBOD has 7 x 36 GB 7200 RPM IBM Ultrastar 36XP drives (DRHS-36D)
Adaptec BIOS configured for 40 MB/s

Our software config:
RedHat 7.1
Kernel 2.4.9-ac9

The 14 disks are configured at two 7 disk RAID0 volumes of EXT3. We've
reproduced the problem with one RAID0 volume per JBOD, and also splitting
volumes so they span both JBODs (4 disks on one and 3 on the other). We did
this because we suspect the errors may be caused by excessive I/O load on a
JBOD.

This error started happening *after* we upgraded to the above from the following
s/w:
RedHat 6.2
Kernel 2.4.2
Two RAID0 volumes with ReiserFS

Previous postings have suggested hardware (disk) failures or a bug in the RAID
<-> Adaptec driver interaction. We think disk failures are unlikely since they
are happening on multiple disks and only after a software upgrade.

We once tested 15K drives on these EXP15 JBODs and encountered SCSI disks/driver
errors, so we've suspected some type of JBOD problem under high load.

Anyhow, does anyone have a clue as to what might be causing these errors, what
tests we could conduct to shed light on the problem, or additional information
we could provide that would be useful.

We're considering running the following tests:

- reduce the SCSI disk transfer rates to below 40 MB/s
- try 2 and 3 disk RAID0/EXT3 file-systems
- other kernels?

Regards,

JP Navarro

2001-10-31 20:49:40

by Justin T. Gibbs

[permalink] [raw]

Subject: Re: Raid/Adaptec/SCSI errors, obvious explanation isn't

>We can consistently generate 1-2 of the following errors per hour:
>
>Oct 31 10:08:30 ccfs2 kernel: SCSI disk error : host 2 channel 0 id 9 lun 0
>return code = 800
>Oct 31 10:08:30 ccfs2 kernel: Current sd08:51: sense key Hardware Error
>Oct 31 10:08:30 ccfs2 kernel: Additional sense indicates Internal target failu
>re
>Oct 31 10:08:30 ccfs2 kernel: I/O error: dev 08:51, sector 35371392

...
>Previous postings have suggested hardware (disk) failures or a bug in the RAID
><-> Adaptec driver interaction. We think disk failures are unlikely since
>they are happening on multiple disks and only after a software upgrade.
>
>We once tested 15K drives on these EXP15 JBODs and encountered SCSI disks/driv
>er errors, so we've suspected some type of JBOD problem under high load.
>
>Anyhow, does anyone have a clue as to what might be causing these errors, what
>tests we could conduct to shed light on the problem, or additional information
>we could provide that would be useful.

Its hard for me to believe that the aic7xxx driver could "make up" sense
information returned from a drive that actually parsed correctly into a
valid set of error codes. What I can believe is that after one error
occurs, this error shows up in commands that completed normally. The
Linux SCSI mid-layer assumes that if the first byte of the sense buffer
is non-zero, it has been filled in regardless of the SCSI status byte
that is returned by the driver. Up until the 6.2.0 aic7xxx driver, the
sense buffer's first byte was not zeroed out prior to executing a new
command. This could result in false positives in certain situations.
If you ask me, the DRIVER_SENSE flag should only be set by the low level
driver in the case of auto-sense, or by the mid-layer when manual sense
recovery is successful (this latter case is somewhat questionable since
a driver that can do autosense but failed may have cleared the real sense
info already). Poking around in a potentially unused buffer and guesing
that its contents imply one thing or the other is bad design.

Anyway, the hardware error is in part real. If you modify the code that
prints out the error information to include the ASC and ASCQ code, the
drive vendor may be able to tell you exactly what is going wrong with your
drive. If you upgrade to a later version of the aic7xxx driver (6.2.4 is
the lastest), the number of errors you encounter may decrease due to the
bug I listed above.

--
Justin

2001-11-07 16:53:35

by JP Navarro

[permalink] [raw]

Subject: Re: Raid/Adaptec/SCSI errors, obvious explanation isn't

Justin,

We upgraded to 2.4.13-ac8 with aic7xxx 6.2.4. We also modified the kernel to
print the ASC and ASCQ codes. We've now seen the error only once:

SCSI disk error : host 2 channel 0 id 8 lun 0 return code = 8000002
Current sd08:51: sense key Hardware Error
Additional sense indicates Internal target failure
ASC=44 ASCQ= 0
I/O error: dev 08:51, sector 65536016

How can we find out what ASC=44 means?
Is this a retriable failure and is it recovering?
Would this error cause corruption? (we haven't seen any)

On a sideline:

Repeated attempts to produce the error have failed due to some other kernel
problem: two processes doing I/O to separate raid/ext3 volumes stop in "D" and
"S" states. The D state process is doing a simple copy from local ext2 to local
raid/ext3, the S state process is doing an rsync of a remote machine. The D
process can't be killed, but when the S process is killed the D process
continues. These process are otherwise unrelated (they aren't even doing I/O to
the same file-system).

Thanks,

JP

"Justin T. Gibbs" wrote:
>
> >We can consistently generate 1-2 of the following errors per hour:
> >
> >Oct 31 10:08:30 ccfs2 kernel: SCSI disk error : host 2 channel 0 id 9 lun 0
> >return code = 800
> >Oct 31 10:08:30 ccfs2 kernel: Current sd08:51: sense key Hardware Error
> >Oct 31 10:08:30 ccfs2 kernel: Additional sense indicates Internal target failu
> >re
> >Oct 31 10:08:30 ccfs2 kernel: I/O error: dev 08:51, sector 35371392
>
> ...
> >Previous postings have suggested hardware (disk) failures or a bug in the RAID
> ><-> Adaptec driver interaction. We think disk failures are unlikely since
> >they are happening on multiple disks and only after a software upgrade.
> >
> >We once tested 15K drives on these EXP15 JBODs and encountered SCSI disks/driv
> >er errors, so we've suspected some type of JBOD problem under high load.
> >
> >Anyhow, does anyone have a clue as to what might be causing these errors, what
> >tests we could conduct to shed light on the problem, or additional information
> >we could provide that would be useful.
>
> Its hard for me to believe that the aic7xxx driver could "make up" sense
> information returned from a drive that actually parsed correctly into a
> valid set of error codes. What I can believe is that after one error
> occurs, this error shows up in commands that completed normally. The
> Linux SCSI mid-layer assumes that if the first byte of the sense buffer
> is non-zero, it has been filled in regardless of the SCSI status byte
> that is returned by the driver. Up until the 6.2.0 aic7xxx driver, the
> sense buffer's first byte was not zeroed out prior to executing a new
> command. This could result in false positives in certain situations.
> If you ask me, the DRIVER_SENSE flag should only be set by the low level
> driver in the case of auto-sense, or by the mid-layer when manual sense
> recovery is successful (this latter case is somewhat questionable since
> a driver that can do autosense but failed may have cleared the real sense
> info already). Poking around in a potentially unused buffer and guesing
> that its contents imply one thing or the other is bad design.
>
> Anyway, the hardware error is in part real. If you modify the code that
> prints out the error information to include the ASC and ASCQ code, the
> drive vendor may be able to tell you exactly what is going wrong with your
> drive. If you upgrade to a later version of the aic7xxx driver (6.2.4 is
> the lastest), the number of errors you encounter may decrease due to the
> bug I listed above.
>
> --
> Justin

2001-11-08 21:42:59

by Cress, Andrew R

[permalink] [raw]

Subject: RE: Raid/Adaptec/SCSI errors, obvious explanation isn't

JP,

Sense Key = 04 (Hardware Error)
ASC = 0x44 0x44,00 = Internal target failure
ASCQ = 0x00
See SCSI-3 spec, Table 66 for descriptions. This text is also what the
error message returns.

The problem should be reported to the disk manufacturer. They will want to
see a SCSI trace of the problem, the model, firmware level, and mode page
configuration of the disk. There are some conditions on the disk itself
that are dependent on timing & configuration, so your software change could
have exposed a disk problem that was previously latent. Some errors can be
resolved with disk firmware upgrades or mode page changes, this may be one
of those.

BTW, I always turn off SMART & WCE in the mode pages for maximum disk
reliability.

Andy

-----Original Message-----
From: JP Navarro [mailto:[email protected]]
Sent: Wednesday, November 07, 2001 11:44 AM
To: Justin T. Gibbs
Cc: [email protected]
Subject: Re: Raid/Adaptec/SCSI errors, obvious explanation isn't

Justin,

We upgraded to 2.4.13-ac8 with aic7xxx 6.2.4. We also modified the kernel to
print the ASC and ASCQ codes. We've now seen the error only once:

SCSI disk error : host 2 channel 0 id 8 lun 0 return code = 8000002
Current sd08:51: sense key Hardware Error
Additional sense indicates Internal target failure
ASC=44 ASCQ= 0
I/O error: dev 08:51, sector 65536016

How can we find out what ASC=44 means?
Is this a retriable failure and is it recovering?
Would this error cause corruption? (we haven't seen any)

On a sideline:

Repeated attempts to produce the error have failed due to some other kernel
problem: two processes doing I/O to separate raid/ext3 volumes stop in "D"
and
"S" states. The D state process is doing a simple copy from local ext2 to
local
raid/ext3, the S state process is doing an rsync of a remote machine. The D
process can't be killed, but when the S process is killed the D process
continues. These process are otherwise unrelated (they aren't even doing I/O
to
the same file-system).

Thanks,

JP

"Justin T. Gibbs" wrote:
>
> >We can consistently generate 1-2 of the following errors per hour:
> >
> >Oct 31 10:08:30 ccfs2 kernel: SCSI disk error : host 2 channel 0 id 9 lun
0
> >return code = 800
> >Oct 31 10:08:30 ccfs2 kernel: Current sd08:51: sense key Hardware Error
> >Oct 31 10:08:30 ccfs2 kernel: Additional sense indicates Internal target
failu
> >re
> >Oct 31 10:08:30 ccfs2 kernel: I/O error: dev 08:51, sector 35371392
>
> ...
> >Previous postings have suggested hardware (disk) failures or a bug in the
RAID
> ><-> Adaptec driver interaction. We think disk failures are unlikely
since
> >they are happening on multiple disks and only after a software upgrade.
> >
> >We once tested 15K drives on these EXP15 JBODs and encountered SCSI
disks/driv
> >er errors, so we've suspected some type of JBOD problem under high load.
> >
> >Anyhow, does anyone have a clue as to what might be causing these errors,
what
> >tests we could conduct to shed light on the problem, or additional
information
> >we could provide that would be useful.
>
> Its hard for me to believe that the aic7xxx driver could "make up" sense
> information returned from a drive that actually parsed correctly into a
> valid set of error codes. What I can believe is that after one error
> occurs, this error shows up in commands that completed normally. The
> Linux SCSI mid-layer assumes that if the first byte of the sense buffer
> is non-zero, it has been filled in regardless of the SCSI status byte
> that is returned by the driver. Up until the 6.2.0 aic7xxx driver, the
> sense buffer's first byte was not zeroed out prior to executing a new
> command. This could result in false positives in certain situations.
> If you ask me, the DRIVER_SENSE flag should only be set by the low level
> driver in the case of auto-sense, or by the mid-layer when manual sense
> recovery is successful (this latter case is somewhat questionable since
> a driver that can do autosense but failed may have cleared the real sense
> info already). Poking around in a potentially unused buffer and guesing
> that its contents imply one thing or the other is bad design.
>
> Anyway, the hardware error is in part real. If you modify the code that
> prints out the error information to include the ASC and ASCQ code, the
> drive vendor may be able to tell you exactly what is going wrong with your
> drive. If you upgrade to a later version of the aic7xxx driver (6.2.4 is
> the lastest), the number of errors you encounter may decrease due to the
> bug I listed above.
>
> --
> Justin