2002-08-18 19:10:59

by Shane

[permalink] [raw]
Subject: 2.4.18-rc3aa3: dma_intr: status=0x51 errors

Hello,

I just tried running Cerberus for 15-20s and I got these errors in the
logs. I do use the nasty binary drivers but I replicated the errors from
a fresh boot without them ever being loaded. Can someone tell me what
these errors mean? And are they dangerous? Are there some docs on these
error codes such that I could translate them myself without having to
bother you guys?

The motherboard is an MSI KT133A
I use LVM on that drive and ext3
The controller the drive is on is a Promise Ultra 133 TX2
The drive is:

/dev/hdg:

Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
ATA-4 ATA-5


Aug 18 14:49:58 mars kernel: invalidate: busy buffer
Aug 18 14:49:58 mars last message repeated 21 times
Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61193, sector=61192
Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61192
Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61195, sector=61194
Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61194

I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
it found nothing.

More info:

00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
(rev 03)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
(rev 40)
00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
(rev 40)
00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
(rev 74)
00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
DMA push (rev 12)
00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
05)
00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
Unknown device 4d69 (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
(rev a1)

Regards,

Shane


2002-08-18 19:24:37

by Alan

[permalink] [raw]
Subject: Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors

On Sun, 2002-08-18 at 20:10, Shane wrote:
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?


> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192

Tbats the drive logging a bad block on logical sector 61192 (be careful
with the 512byte/1K conversions here when using bad blocks
)


2002-08-18 20:11:09

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors


Because it is a hardware error.
Your drive is attempting to reallocate sectors and is failing.

On 18 Aug 2002, Shane wrote:

> Hello,
>
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?
>
> The motherboard is an MSI KT133A
> I use LVM on that drive and ext3
> The controller the drive is on is a Promise Ultra 133 TX2
> The drive is:
>
> /dev/hdg:
>
> Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
> Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
> RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
> BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
> CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> AdvancedPM=no WriteCache=enabled
> Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
> ATA-4 ATA-5
>
>
> Aug 18 14:49:58 mars kernel: invalidate: busy buffer
> Aug 18 14:49:58 mars last message repeated 21 times
> Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61195, sector=61194
> Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61194
>
> I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
> it found nothing.
>
> More info:
>
> 00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
> (rev 03)
> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
> 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
> (rev 40)
> 00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
> 00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
> (rev 40)
> 00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
> (rev 74)
> 00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
> DMA push (rev 12)
> 00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
> 05)
> 00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
> 00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
> Unknown device 4d69 (rev 02)
> 01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
> (rev a1)
>
> Regards,
>
> Shane
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

Andre Hedrick
LAD Storage Consulting Group

2002-08-18 22:43:03

by Shane

[permalink] [raw]
Subject: Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors

Thanks for the answers. Also, it is 2.4.19-rc3aa3 not whats in the
subject.

The man page for badblocks encourages me to use e2fsck -c to run
badblocks and to not run it directly. That, in addition to the hint
from Alan that I was testing the incorrect range with badblocks lead to
wanting to run e2fsck -c -c /dev/vg01/biglv. That lead to:

kernel: lvm -- lvm_blk_ioctl: unknown command 0x24b
last message repeated 1443 times
message repeated 1422 times

Then I thought, e2fsck works on filesystems and badblocks works on
partitions so maybe this is not a good idea after all? What are the
correct numbers to feed to badblocks to get it to test that portion of
the disk?

Then some of these popped up a few minutes later:

smartd: Device: /dev/hda, S.M.A.R.T. Attribute: 231 Changed 11
smartd: Device: /dev/hde, S.M.A.R.T. Attribute: 231 Changed 8
smartd: Device: /dev/hdg, S.M.A.R.T. Attribute: 7 Changed -53

This clued me into the fact I had previously enabled this SMART on this
box. I don't know much about SMART and I can't seem to find much about
which of the below errors are truly fatal and whats normal. I did a
short self test too.

I see the raw read error rate is 0 but it failed the self test in the
read element!?

Is 11964485 a large number for Hardware ECC Recovered?

The drive is totally pooched I guess? Any light you could shed on which
of the below numbers are the tell-tale signs that the drive is dying
would be appreciated.

# smartctl -a /dev/hdg
Device: MAXTOR 6L080J4 Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.

General Smart Values:
Off-line data collection status: (0x00) Offline data collection activity
was never started
Self-test execution status: ( 112) The previous self-test completed
having failed
the read element of the test
Total time to complete off-line
data collection: ( 34) Seconds
Offline data collection
Capabilities: (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
Automatic timer ON/OFF support
Suspend Offline Collection upon
new command
Offline surface scan supported
Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering
power-saving mode
Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine
recommended polling time: ( 2) Minutes
Extended self-test routine
recommended polling time: ( 40) Minutes

Vendor Specific SMART Attributes with Thresholds:
Revision Number: 11
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x0029 100 253 020 0
( 3)Spin Up Time 0x0027 063 063 020 4659
( 4)Start Stop Count 0x0032 100 100 008 192
( 5)Reallocated Sector Ct 0x0033 097 097 020 18
( 7)Seek Error Rate 0x000b 100 047 023 0
( 9)Power On Hours 0x0012 096 096 001 2980
( 10)Spin Retry Count 0x0026 100 100 000 0
( 11)Calibration Retry Count 0x0013 100 100 020 0
( 12)Power Cycle Count 0x0032 100 100 008 166
( 13)Read Soft Error Rate 0x000b 100 100 023 0
(194)Temperature 0x0022 082 077 042 48
(195)Hardware ECC Recovered 0x001a 100 005 000 11964485
(196)Reallocated Event Count 0x0010 100 100 020 0
(197)Current Pending Sector 0x0032 100 100 020 3
(198)Offline Uncorrectable 0x0010 100 253 000 0
(199)UDMA CRC Error Count 0x001a 197 197 000 3
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 04
ATA Error Count: 109
Non-Fatal Count: 0

Thanks,

Shane


2002-08-20 18:45:20

by Gunther Mayer

[permalink] [raw]
Subject: Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors

Andre Hedrick wrote:

>Because it is a hardware error.
>Your drive is attempting to reallocate sectors and is failing.
>
The drive cannot relocate on an "uncorrectable read error",
as this must be communicated to the user, so he can get
the data from backup.



2002-08-21 07:27:03

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.18-rc3aa3: dma_intr: status=0x51 errors

On Tue, 20 Aug 2002, Gunther Mayer wrote:

> Andre Hedrick wrote:
>
> >Because it is a hardware error.
> >Your drive is attempting to reallocate sectors and is failing.
> >
> The drive cannot relocate on an "uncorrectable read error",
> as this must be communicated to the user, so he can get
> the data from backup.

Gunther,

Where are we in disagreement?

me: the error report because the drive failed to reallocate sector(s)
you: drive cannot relocate with this error.

Oh I have a noise maker patch for Erik Anderson, I just need to add it.

Cheers,

Andre Hedrick
LAD Storage Consulting Group