2007-01-15 17:16:07

by Olivier Galibert

[permalink] [raw]
Subject: What does this scsi error mean ?

sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
ASC=0x42 ASCQ=0x0
Info fld=0x400802c
end_request: I/O error, dev sda, sector 202369
Aborting journal on device sda1.
journal commit I/O error
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only


It's always on a journal write and smart on the disk doesn't see a
thing (no error log, short and long smart tests pass).

In case it is relevant (it's an IBM LS20 blade):
00:06.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8111 PCI (rev 07)
00:07.0 ISA bridge: Advanced Micro Devices [AMD] AMD-8111 LPC (rev 05)
00:07.3 Bridge: Advanced Micro Devices [AMD] AMD-8111 ACPI (rev 05)
00:0a.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0a.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:0b.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge (rev 12)
00:0b.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:00.0 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:00.1 USB Controller: Advanced Micro Devices [AMD] AMD-8111 USB (rev 0b)
01:04.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE]
02:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 10)
02:01.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704S Gigabit Ethernet (rev 10)
02:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08)

ioc0: LSI53C1030, FwRev=01032700h, Ports=1, MaxQ=222

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: IBM-ESXS Model: ST936701LC FN Rev: B41D
Type: Direct-Access ANSI SCSI revision: 04

OG.


2007-01-15 18:34:10

by Alan

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Mon, 15 Jan 2007 18:16:02 +0100
Olivier Galibert <[email protected]> wrote:

> sd 0:0:0:0: SCSI error: return code = 0x08000002
> sda: Current: sense key: Hardware Error
> ASC=0x42 ASCQ=0x0

I'll give you a clue: The words "Hardware Error".

Run a SCSI verify pass on the drive with some drive utilities and see
what happens. If you are lucky it'll just reallocate blocks and decide
the drive is ok, if not well see what the smart data thinks.

Alan


2007-01-15 21:45:06

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Mon, Jan 15, 2007 at 06:45:40PM +0000, Alan wrote:
> On Mon, 15 Jan 2007 18:16:02 +0100
> Olivier Galibert <[email protected]> wrote:
>
> > sd 0:0:0:0: SCSI error: return code = 0x08000002
> > sda: Current: sense key: Hardware Error
> > ASC=0x42 ASCQ=0x0
>
> I'll give you a clue: The words "Hardware Error".
>
> Run a SCSI verify pass on the drive with some drive utilities and see
> what happens. If you are lucky it'll just reallocate blocks and decide
> the drive is ok, if not well see what the smart data thinks.

Both smart and the internal blade diagnostics say "everything is a-ok
with the drive, there hasn't been any error ever except a bunch of
corrected ECC ones, and no more than with a similar drive in another
working blade". Hence my initial post. "Hardware error" is kinda
imprecise, so I was wondering whether it was unexpected controller
answer, detected transmission error, block write error, sector not
found... Is there a way to have more information?

OG.

2007-01-15 23:03:16

by Alan

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

> Both smart and the internal blade diagnostics say "everything is a-ok
> with the drive, there hasn't been any error ever except a bunch of
> corrected ECC ones, and no more than with a similar drive in another
> working blade". Hence my initial post. "Hardware error" is kinda
> imprecise, so I was wondering whether it was unexpected controller
> answer, detected transmission error, block write error, sector not
> found... Is there a way to have more information?

Well the right place to look would indeed have been the SMART data
providing the drive didn't get into a state it couldn't update it.
Hardware error comes from the drive deciding something is wrong (or a
raid card faking it I guess). That covers everything from power
fluctuations and overheating through firmware consistency failures and
more.

If you pull the drive and test it in another box does it show the same ?
And what does a scsi verify have to say ?


Alan

2007-01-15 23:27:23

by Stefan Richter

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On 15 Jan, Olivier Galibert wrote:
> sd 0:0:0:0: SCSI error: return code = 0x08000002
> sda: Current: sense key: Hardware Error
> ASC=0x42 ASCQ=0x0

The Additional Sense Code means "power-on or self-test failure" FWIW.
(SPC-4 annex D)
--
Stefan Richter
-=====-=-=== ---= =----
http://arcgraph.de/sr/

2007-01-15 23:35:08

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Tue, Jan 16, 2007 at 12:27:17AM +0100, Stefan Richter wrote:
> On 15 Jan, Olivier Galibert wrote:
> > sd 0:0:0:0: SCSI error: return code = 0x08000002
> > sda: Current: sense key: Hardware Error
> > ASC=0x42 ASCQ=0x0
>
> The Additional Sense Code means "power-on or self-test failure" FWIW.
> (SPC-4 annex D)

Given that happens between 3 days to a week after bootup on the root
drive, it's obviously not the "power on" part. It's kinda annoying
nothing appears in the smart logs though:

smartctl version 5.36 [x86_64-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

Device: IBM-ESXS ST936701LC FN Version: B41D
Serial number: 3LC0C8P000007647WLMV
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Tue Jan 16 00:33:09 2007 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK

Current Drive Temperature: 33 C
Drive Trip Temperature: 60 C
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 16206797
Blocks received from initiator = 83607272
Blocks read from cache and sent to initiator = 3311410
Number of read and write commands whose size <= segment size = 2801896
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 533.07
number of minutes until next internal SMART test = 112

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 10474 0 0 10474 10474 61.360 0
write: 0 0 0 0 0 58.647 2

Non-medium error count: 1457822

SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed - 407 - [- - -]
# 2 Background short Completed - 243 - [- - -]

Long (extended) Self Test duration: 793 seconds [13.2 minutes]


OG.

2007-01-16 00:10:18

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote:
> If you pull the drive and test it in another box does it show the same ?

I'm going to try that. The prolem requires 3-7 days to appear, so I
won't know immediatly.


> And what does a scsi verify have to say ?

Running, looks like it's gonna take a little while.

OG.

2007-01-16 15:16:32

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: What does this scsi error mean ?


On Mon, 15 Jan 2007, Olivier Galibert wrote:

> On Mon, Jan 15, 2007 at 06:45:40PM +0000, Alan wrote:
>> On Mon, 15 Jan 2007 18:16:02 +0100
>> Olivier Galibert <[email protected]> wrote:
>>
>>> sd 0:0:0:0: SCSI error: return code = 0x08000002
>>> sda: Current: sense key: Hardware Error
>>> ASC=0x42 ASCQ=0x0
>>
>> I'll give you a clue: The words "Hardware Error".
>>
>> Run a SCSI verify pass on the drive with some drive utilities and see
>> what happens. If you are lucky it'll just reallocate blocks and decide
>> the drive is ok, if not well see what the smart data thinks.
>
> Both smart and the internal blade diagnostics say "everything is a-ok
> with the drive, there hasn't been any error ever except a bunch of
> corrected ECC ones, and no more than with a similar drive in another
> working blade". Hence my initial post. "Hardware error" is kinda
> imprecise, so I was wondering whether it was unexpected controller
> answer, detected transmission error, block write error, sector not
> found... Is there a way to have more information?
>
> OG.

Correctable SCSI errors show that the data in a sector was not properly
read, but the device was able to fix the data error because of the
redundancy in the CRC. The error could be permanently fixed is you
rewrote the sector. You probably don't know where the bad sector is
without adding a printk() to driver code. Some BIOS SCSI utilities
(Adaptec) have the capability of reading an entire drive and fixing
bad sectors either by rewrite or relocation. Since drives can be
accessed as files, you could write a utility that opens the RAW
device with in NOT mounted, reads a bunch of sectors, then writes
them back. To do this, you need to verify that lseek() works on
your particular drive because you need to write the data back to
the same offset that you read it from. I mention this because
the raw r/w of an early Adaptec (aha1542) driver, didn't impliment
lseek, just returned 'okay'. You can imagine the mess I made of
a drive with that controller!

Once you verify that lseek works, the rest of the code is trivial.
I suggest reading then writing 64 kilobytes at a time. It will seem
to take 'forever', but the retries on these relatively short groups
of sectors (128 sectors), will be short when errors are encountered.

Make sure the drive is either not mounted or mounted r/o.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.67 BogoMips).
New book: http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2007-01-16 15:36:06

by Alan

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

> Correctable SCSI errors show that the data in a sector was not properly
> read, but the device was able to fix the data error because of the
> redundancy in the CRC. The error could be permanently fixed is you
> rewrote the sector. You probably don't know where the bad sector is

The drives do that automatically, and the SCSI verify did it for him too
if there were any other problems.

2007-01-16 17:25:16

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Tue, Jan 16, 2007 at 03:47:52PM +0000, Alan wrote:
> The drives do that automatically, and the SCSI verify did it for him too
> if there were any other problems.

The SCSI verify didn't see a thing, I'm gonna do the disk swapping
dance.

OG.

2007-01-18 14:08:50

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote:
> > Both smart and the internal blade diagnostics say "everything is a-ok
> > with the drive, there hasn't been any error ever except a bunch of
> > corrected ECC ones, and no more than with a similar drive in another
> > working blade". Hence my initial post. "Hardware error" is kinda
> > imprecise, so I was wondering whether it was unexpected controller
> > answer, detected transmission error, block write error, sector not
> > found... Is there a way to have more information?
>
> Well the right place to look would indeed have been the SMART data
> providing the drive didn't get into a state it couldn't update it.
> Hardware error comes from the drive deciding something is wrong (or a
> raid card faking it I guess). That covers everything from power
> fluctuations and overheating through firmware consistency failures and
> more.
>
> If you pull the drive and test it in another box does it show the same ?

Ok, inverted the disks, got a crash of the same blade with the new
disk, so the problem is not the drive itself. Gonna try inverting two
blades to check if it's the power supply connector/rail.

OG.

2007-02-07 17:43:54

by Olivier Galibert

[permalink] [raw]
Subject: Re: What does this scsi error mean ?

On Thu, Jan 18, 2007 at 03:08:46PM +0100, Olivier Galibert wrote:
> On Mon, Jan 15, 2007 at 11:14:52PM +0000, Alan wrote:
> > > Both smart and the internal blade diagnostics say "everything is a-ok
> > > with the drive, there hasn't been any error ever except a bunch of
> > > corrected ECC ones, and no more than with a similar drive in another
> > > working blade". Hence my initial post. "Hardware error" is kinda
> > > imprecise, so I was wondering whether it was unexpected controller
> > > answer, detected transmission error, block write error, sector not
> > > found... Is there a way to have more information?
> >
> > Well the right place to look would indeed have been the SMART data
> > providing the drive didn't get into a state it couldn't update it.
> > Hardware error comes from the drive deciding something is wrong (or a
> > raid card faking it I guess). That covers everything from power
> > fluctuations and overheating through firmware consistency failures and
> > more.
> >
> > If you pull the drive and test it in another box does it show the same ?
>
> Ok, inverted the disks, got a crash of the same blade with the new
> disk, so the problem is not the drive itself. Gonna try inverting two
> blades to check if it's the power supply connector/rail.

...and it is the power supply/connector. Failure is linked to the
position of the blade in the box (as in the blade in the first
position always fails). Now that's a cute failure. Having the
support act on it is going to be fun.

OG.

PS: Yes, I did forget to send that email :-)