Hello,
I just tried running Cerberus for 15-20s and I got these errors in the
logs. I do use the nasty binary drivers but I replicated the errors from
a fresh boot without them ever being loaded. Can someone tell me what
these errors mean? And are they dangerous? Are there some docs on these
error codes such that I could translate them myself without having to
bother you guys?
The motherboard is an MSI KT133A
I use LVM on that drive and ext3
The controller the drive is on is a Promise Ultra 133 TX2
The drive is:
/dev/hdg:
Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
ATA-4 ATA-5
Aug 18 14:49:58 mars kernel: invalidate: busy buffer
Aug 18 14:49:58 mars last message repeated 21 times
Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61193, sector=61192
Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61192
Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=61195, sector=61194
Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
sector 61194
I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
it found nothing.
More info:
00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
(rev 03)
00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
(rev 40)
00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
(rev 40)
00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
(rev 74)
00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
DMA push (rev 12)
00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
05)
00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
Unknown device 4d69 (rev 02)
01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
(rev a1)
Regards,
Shane
On Sun, 2002-08-18 at 20:10, Shane wrote:
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192
Tbats the drive logging a bad block on logical sector 61192 (be careful
with the 512byte/1K conversions here when using bad blocks
)
Because it is a hardware error.
Your drive is attempting to reallocate sectors and is failing.
On 18 Aug 2002, Shane wrote:
> Hello,
>
> I just tried running Cerberus for 15-20s and I got these errors in the
> logs. I do use the nasty binary drivers but I replicated the errors from
> a fresh boot without them ever being loaded. Can someone tell me what
> these errors mean? And are they dangerous? Are there some docs on these
> error codes such that I could translate them myself without having to
> bother you guys?
>
> The motherboard is an MSI KT133A
> I use LVM on that drive and ext3
> The controller the drive is on is a Promise Ultra 133 TX2
> The drive is:
>
> /dev/hdg:
>
> Model=MAXTOR 6L080J4, FwRev=A93.0500, SerialNo=664133005196
> Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
> RawCHS=16383/16/63, TrkSize=32256, SectSize=21298, ECCbytes=4
> BuffType=DualPortCache, BuffSize=1819kB, MaxMultSect=16, MultSect=off
> CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=156355584
> IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
> PIO modes: pio0 pio1 pio2 pio3 pio4
> DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> AdvancedPM=no WriteCache=enabled
> Drive Supports : ATA/ATAPI-5 T13 1321D revision 1 : ATA-1 ATA-2 ATA-3
> ATA-4 ATA-5
>
>
> Aug 18 14:49:58 mars kernel: invalidate: busy buffer
> Aug 18 14:49:58 mars last message repeated 21 times
> Aug 18 14:50:01 mars CROND[1863]: (root) CMD (/usr/lib/sa/sa1 1 1)
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:50 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61193, sector=61192
> Aug 18 14:50:50 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61192
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Aug 18 14:50:55 mars kernel: hdg: dma_intr: error=0x40 {
> UncorrectableError }, LBAsect=61195, sector=61194
> Aug 18 14:50:55 mars kernel: end_request: I/O error, dev 22:00 (hdg),
> sector 61194
>
> I also ran badblocks -v -s -n -b 4096 -c 128 /dev/hdg1 65000 55000 and
> it found nothing.
>
> More info:
>
> 00:00.0 Host bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133]
> (rev 03)
> 00:01.0 PCI bridge: VIA Technologies, Inc. VT8363/8365 [KT133/KM133 AGP]
> 00:07.0 ISA bridge: VIA Technologies, Inc. VT82C686 [Apollo Super South]
> (rev 40)
> 00:07.1 IDE interface: VIA Technologies, Inc. Bus Master IDE (rev 06)
> 00:07.2 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.3 USB Controller: VIA Technologies, Inc. UHCI USB (rev 16)
> 00:07.4 Host bridge: VIA Technologies, Inc. VT82C686 [Apollo Super ACPI]
> (rev 40)
> 00:0a.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink]
> (rev 74)
> 00:0c.0 Multimedia video controller: Brooktree Corporation Bt848 TV with
> DMA push (rev 12)
> 00:0d.0 Multimedia audio controller: Creative Labs SB Live! EMU10k1 (rev
> 05)
> 00:0d.1 Input device controller: Creative Labs SB Live! (rev 05)
> 00:0e.0 Unknown mass storage controller: Promise Technology, Inc.:
> Unknown device 4d69 (rev 02)
> 01:00.0 VGA compatible controller: nVidia Corporation NV11 (GeForce2 MX)
> (rev a1)
>
> Regards,
>
> Shane
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Andre Hedrick
LAD Storage Consulting Group
Thanks for the answers. Also, it is 2.4.19-rc3aa3 not whats in the
subject.
The man page for badblocks encourages me to use e2fsck -c to run
badblocks and to not run it directly. That, in addition to the hint
from Alan that I was testing the incorrect range with badblocks lead to
wanting to run e2fsck -c -c /dev/vg01/biglv. That lead to:
kernel: lvm -- lvm_blk_ioctl: unknown command 0x24b
last message repeated 1443 times
message repeated 1422 times
Then I thought, e2fsck works on filesystems and badblocks works on
partitions so maybe this is not a good idea after all? What are the
correct numbers to feed to badblocks to get it to test that portion of
the disk?
Then some of these popped up a few minutes later:
smartd: Device: /dev/hda, S.M.A.R.T. Attribute: 231 Changed 11
smartd: Device: /dev/hde, S.M.A.R.T. Attribute: 231 Changed 8
smartd: Device: /dev/hdg, S.M.A.R.T. Attribute: 7 Changed -53
This clued me into the fact I had previously enabled this SMART on this
box. I don't know much about SMART and I can't seem to find much about
which of the below errors are truly fatal and whats normal. I did a
short self test too.
I see the raw read error rate is 0 but it failed the self test in the
read element!?
Is 11964485 a large number for Hardware ECC Recovered?
The drive is totally pooched I guess? Any light you could shed on which
of the below numbers are the tell-tale signs that the drive is dying
would be appreciated.
# smartctl -a /dev/hdg
Device: MAXTOR 6L080J4 Supports ATA Version 5
Drive supports S.M.A.R.T. and is enabled
Check S.M.A.R.T. Passed.
General Smart Values:
Off-line data collection status: (0x00) Offline data collection activity
was never started
Self-test execution status: ( 112) The previous self-test completed
having failed
the read element of the test
Total time to complete off-line
data collection: ( 34) Seconds
Offline data collection
Capabilities: (0x1b)SMART EXECUTE OFF-LINE IMMEDIATE
Automatic timer ON/OFF support
Suspend Offline Collection upon
new command
Offline surface scan supported
Self-test supported
Smart Capablilities: (0x0003) Saves SMART data before entering
power-saving mode
Supports SMART auto save timer
Error logging capability: (0x01) Error logging supported
Short self-test routine
recommended polling time: ( 2) Minutes
Extended self-test routine
recommended polling time: ( 40) Minutes
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 11
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x0029 100 253 020 0
( 3)Spin Up Time 0x0027 063 063 020 4659
( 4)Start Stop Count 0x0032 100 100 008 192
( 5)Reallocated Sector Ct 0x0033 097 097 020 18
( 7)Seek Error Rate 0x000b 100 047 023 0
( 9)Power On Hours 0x0012 096 096 001 2980
( 10)Spin Retry Count 0x0026 100 100 000 0
( 11)Calibration Retry Count 0x0013 100 100 020 0
( 12)Power Cycle Count 0x0032 100 100 008 166
( 13)Read Soft Error Rate 0x000b 100 100 023 0
(194)Temperature 0x0022 082 077 042 48
(195)Hardware ECC Recovered 0x001a 100 005 000 11964485
(196)Reallocated Event Count 0x0010 100 100 020 0
(197)Current Pending Sector 0x0032 100 100 020 3
(198)Offline Uncorrectable 0x0010 100 253 000 0
(199)UDMA CRC Error Count 0x001a 197 197 000 3
SMART Error Log:
SMART Error Logging Version: 1
Error Log Data Structure Pointer: 04
ATA Error Count: 109
Non-Fatal Count: 0
Thanks,
Shane
Andre Hedrick wrote:
>Because it is a hardware error.
>Your drive is attempting to reallocate sectors and is failing.
>
The drive cannot relocate on an "uncorrectable read error",
as this must be communicated to the user, so he can get
the data from backup.
On Tue, 20 Aug 2002, Gunther Mayer wrote:
> Andre Hedrick wrote:
>
> >Because it is a hardware error.
> >Your drive is attempting to reallocate sectors and is failing.
> >
> The drive cannot relocate on an "uncorrectable read error",
> as this must be communicated to the user, so he can get
> the data from backup.
Gunther,
Where are we in disagreement?
me: the error report because the drive failed to reallocate sector(s)
you: drive cannot relocate with this error.
Oh I have a noise maker patch for Erik Anderson, I just need to add it.
Cheers,
Andre Hedrick
LAD Storage Consulting Group