2006-11-02 08:57:47

by Manu Abraham

[permalink] [raw]
Subject: hdb lost interrupt

Hi,

I have been seeing this message (twice) "hdb: lost interrupt" on my
console. This happened overnight, so in the morning i had this click
click sound from my HDD, (my heart stopped for a moment on hearing the
click-click-click, thought was a dead HDD) not responding to anything
else other than a hard RESET. On RESET, things came back to normal.




On checking the syslog, the only thing that i could find in the log that
was relevant was this.

Nov 1 12:01:22 Orbit01 kernel: [20715026.956000] hdb: dma_intr:
status=0x30 { DeviceFault SeekComplete }
Nov 1 12:01:22 Orbit01 kernel: [20715026.956000] ide: failed opcode
was: unknown
Nov 1 12:01:22 Orbit01 kernel: [20715026.956000] hda: DMA disabled
Nov 1 12:01:22 Orbit01 kernel: [20715026.956000] hdb: DMA disabled
Nov 1 12:01:29 Orbit01 kernel: [20715034.348000] ide0: reset: success



The log on a reboot,



Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] Uniform Multi-Platform
E-IDE driver Revision: 7.00alpha2
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ide: Assuming 33MHz
system bus speed for PIO modes; override with idebus=xx
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ICH5: IDE controller
at PCI slot 0000:00:1f.1
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] PCI: Enabling device
0000:00:1f.1 (0005 -> 0007)
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ACPI: PCI Interrupt
0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 17
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ICH5: chipset revision 2
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ICH5: not 100% native
mode: will probe irqs later
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ide0: BM-DMA at
0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:DMA
Nov 2 11:08:51 Orbit01 kernel: [17179571.232000] ide1: BM-DMA at
0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA
Nov 2 11:08:51 Orbit01 kernel: [17179571.520000] hda: ST340014A, ATA
DISK drive
Nov 2 11:08:51 Orbit01 kernel: [17179571.800000] hdb: WDC
WD2500BB-00DWA0, ATA DISK drive
Nov 2 11:08:51 Orbit01 kernel: [17179571.856000] ide0 at
0x1f0-0x1f7,0x3f6 on irq 14
Nov 2 11:08:51 Orbit01 kernel: [17179572.148000] hdc: ST3160212A, ATA
DISK drive
Nov 2 11:08:51 Orbit01 kernel: [17179572.932000] hdd: HL-DT-ST DVDRAM
GSA-4163B, ATAPI CD/DVD-ROM drive
Nov 2 11:08:51 Orbit01 kernel: [17179572.988000] ide1 at
0x170-0x177,0x376 on irq 15
Nov 2 11:08:51 Orbit01 kernel: [17179572.988000] hda: max request size:
512KiB
Nov 2 11:08:52 Orbit01 kernel: [17179572.992000] hda: 78165360 sectors
(40020 MB) w/2048KiB Cache, CHS=16383/255/63, UDMA(33)
Nov 2 11:08:52 Orbit01 kernel: [17179572.992000] hda: cache flushes
supported
Nov 2 11:08:52 Orbit01 kernel: [17179572.992000] hda: hda1 hda2 < hda5
hda6 >
Nov 2 11:08:52 Orbit01 kernel: [17179573.024000] hdb: max request size:
512KiB
Nov 2 11:08:52 Orbit01 kernel: [17179573.040000] hdb: 488397168 sectors
(250059 MB) w/2048KiB Cache, CHS=30401/255/63, UDMA(33)
Nov 2 11:08:52 Orbit01 kernel: [17179573.044000] hdb: cache flushes
supported
Nov 2 11:08:52 Orbit01 kernel: [17179573.044000] hdb: hdb1
Nov 2 11:08:52 Orbit01 kernel: [17179573.048000] hdc: max request size:
512KiB
Nov 2 11:08:52 Orbit01 kernel: [17179573.080000] hdc: 312581808 sectors
(160041 MB) w/2048KiB Cache, CHS=19457/255/63, UDMA(33)
Nov 2 11:08:52 Orbit01 kernel: [17179573.104000] hdc: cache flushes
supported
Nov 2 11:08:52 Orbit01 kernel: [17179573.104000] hdc: hdc1
Nov 2 11:08:52 Orbit01 kernel: [17179573.132000] hdd: ATAPI 40X DVD-ROM
DVD-R-RAM CD-R/RW drive, 2048kB Cache, UDMA(33)
Nov 2 11:08:52 Orbit01 kernel: [17179573.132000] Uniform CD-ROM driver
Revision: 3.20
Nov 2 11:08:52 Orbit01 kernel: [17179573.140000] mice: PS/2 mouse
device common for all mice
Nov 2 11:08:52 Orbit01 kernel: [17179573.140000] oprofile: using NMI
interrupt.
Nov 2 11:08:52 Orbit01 kernel: [17179573.140000] NET: Registered
protocol family 2
Nov 2 11:08:52 Orbit01 kernel: [17179573.164000] input: AT Translated
Set 2 keyboard as /class/input/input0
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] IP route cache hash
table entries: 32768 (order: 5, 131072 bytes)
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] TCP established hash
table entries: 131072 (order: 8, 1572864 bytes)
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] TCP bind hash table
entries: 65536 (order: 7, 786432 bytes)
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] TCP: Hash tables
configured (established 131072 bind 65536)
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] TCP reno registered
Nov 2 11:08:52 Orbit01 kernel: [17179573.168000] ip_conntrack version
2.4 (7168 buckets, 57344 max) - 172 bytes per conntrack
Nov 2 11:08:52 Orbit01 kernel: [17179573.272000] TCP bic registered
Nov 2 11:08:52 Orbit01 kernel: [17179573.272000] NET: Registered
protocol family 1
Nov 2 11:08:52 Orbit01 kernel: [17179573.272000] NET: Registered
protocol family 17
Nov 2 11:08:52 Orbit01 kernel: [17179573.272000] Using IPI Shortcut mode
Nov 2 11:08:52 Orbit01 kernel: [17179573.272000] Freeing unused kernel
memory: 212k freed
Nov 2 11:08:52 Orbit01 kernel: [17179573.464000] logips2pp: Detected
unknown logitech mouse model 90
Nov 2 11:08:52 Orbit01 kernel: [17179573.500000] ACPI: PCI Interrupt
0000:00:1f.2[A] -> GSI 18 (level, low) -> IRQ 17
Nov 2 11:08:52 Orbit01 kernel: [17179573.500000] ata1: SATA max
UDMA/133 cmd 0xEC00 ctl 0xE802 bmdma 0xDC00 irq 17
Nov 2 11:08:52 Orbit01 kernel: [17179573.500000] ata2: SATA max
UDMA/133 cmd 0xE400 ctl 0xE002 bmdma 0xDC08 irq 17
Nov 2 11:08:52 Orbit01 kernel: [17179573.664000] ATA: abnormal status
0x7F on port 0xEC07
Nov 2 11:08:52 Orbit01 kernel: [17179573.664000] scsi0 : ata_piix
Nov 2 11:08:52 Orbit01 kernel: [17179573.828000] ATA: abnormal status
0x7F on port 0xE407
Nov 2 11:08:52 Orbit01 kernel: [17179573.828000] scsi1 : ata_piix




the kernel/gcc versions,

Nov 1 12:05:50 Orbit01 kernel: [17179569.184000] Linux version 2.6.16
(root@Orbit01) (gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)) #9 SMP
PREEMPT Thu Jul 20 16:15:33 GST 2006


The hardware what i have is a Pentium 4 processor on an Intel D865 PERL
motherboard
(http://www.intel.com/products/motherboard/D865PERL/index.htm)


[root@Orbit01 proc]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Pentium(R) 4 CPU 2.40GHz
stepping : 7
cpu MHz : 2394.314
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 4796.49



Any idea as to what could be causing the lost interrupt ?

Regards,
Manu


2006-11-02 13:18:46

by Alan

[permalink] [raw]
Subject: Re: hdb lost interrupt

Ar Iau, 2006-11-02 am 12:57 +0400, ysgrifennodd Manu Abraham:
> Any idea as to what could be causing the lost interrupt ?

It may have been a drive fault. Check the SMART information on the drive
and see what the disk has logged recently.


2006-11-02 15:30:58

by Manu Abraham

[permalink] [raw]
Subject: Re: hdb lost interrupt

Alan Cox wrote:
> Ar Iau, 2006-11-02 am 12:57 +0400, ysgrifennodd Manu Abraham:
>> Any idea as to what could be causing the lost interrupt ?
>
> It may have been a drive fault. Check the SMART information on the drive
> and see what the disk has logged recently.
>
>

running smartctl -a gave me this ..

[root@Orbit01 ~]# smartctl -a /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: WDC WD2500BB-00DWA0
Serial Number: WD-WMAEH2436571
Firmware Version: 15.05R15
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Nov 2 19:08:15 2006 GST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection:
Enabled.
Self-test execution status: ( 0) The previous self-test routine
completed
without error or no self-test
has ever
been run.
Total time to complete Offline
data collection: (7599) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 95) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 200 200 051 Pre-fail Always
- 0
3 Spin_Up_Time 0x0007 152 119 021 Pre-fail Always
- 2933
4 Start_Stop_Count 0x0032 100 100 040 Old_age Always
- 127
5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always
- 1
7 Seek_Error_Rate 0x000b 200 200 051 Pre-fail Always
- 0
9 Power_On_Hours 0x0032 083 083 000 Old_age Always
- 12676
10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always
- 0
11 Calibration_Retry_Count 0x0013 100 253 051 Pre-fail Always
- 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always
- 97
194 Temperature_Celsius 0x0022 118 253 000 Old_age Always
- 32
196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always
- 1
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
- 5
198 Offline_Uncorrectable 0x0012 200 200 000 Old_age Always
- 0
199 UDMA_CRC_Error_Count 0x000a 200 253 000 Old_age Always
- 0
200 Multi_Zone_Error_Rate 0x0009 200 155 051 Pre-fail
Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


[root@Orbit01 ~]# smartctl -H /dev/hdb
smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce
Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED




After which i ran,

[root@Orbit01 ~]# smartctl -t long /dev/hdb

This started to give the same click's as earlier, dmesg, has quite some
info ..



[17207074.632000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.632000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.632000] ide: failed opcode was: unknown
[17207074.632000] end_request: I/O error, dev hdd, sector 0
[17207074.632000] Buffer I/O error on device hdd, logical block 0
[17207074.632000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.632000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.632000] ide: failed opcode was: unknown
[17207074.632000] end_request: I/O error, dev hdd, sector 0
[17207074.632000] Buffer I/O error on device hdd, logical block 0
[17207074.636000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.636000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.636000] ide: failed opcode was: unknown
[17207074.636000] end_request: I/O error, dev hdd, sector 0
[17207074.636000] Buffer I/O error on device hdd, logical block 0
[17207074.636000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.636000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.636000] ide: failed opcode was: unknown
[17207074.636000] end_request: I/O error, dev hdd, sector 0
[17207074.636000] Buffer I/O error on device hdd, logical block 0
[17207074.640000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.640000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.640000] ide: failed opcode was: unknown
[17207074.640000] end_request: I/O error, dev hdd, sector 0
[17207074.640000] Buffer I/O error on device hdd, logical block 0

..

[17207074.796000] end_request: I/O error, dev hdd, sector 0
[17207074.796000] hdd: command error: status=0x51 { DriveReady
SeekComplete Error }
[17207074.796000] hdd: command error: error=0x54 { AbortedCommand
LastFailedSense=0x05 }
[17207074.796000] ide: failed opcode was: unknown
[17207074.796000] end_request: I/O error, dev hdd, sector 0
[17207293.072000] ISO 9660 Extensions: Microsoft Joliet Level 3
[17207293.108000] ISO 9660 Extensions: RRIP_1991A
[17207531.412000] hdb: dma_intr: status=0x30 { DeviceFault SeekComplete }
[17207531.412000] ide: failed opcode was: unknown
[17207531.412000] hda: DMA disabled
[17207531.412000] hdb: DMA disabled
[17207534.628000] ide0: reset: success
[17208781.840000] hdb: drive_cmd: status=0x30 { DeviceFault SeekComplete }
[17208781.840000] ide: failed opcode was: 0xb0


stopping smartctrl (-X) caused the clicks to go away.


Regards,
Manu

2006-11-02 15:47:17

by Alan

[permalink] [raw]
Subject: Re: hdb lost interrupt

Ar Iau, 2006-11-02 am 19:30 +0400, ysgrifennodd Manu Abraham:
> running smartctl -a gave me this ..

Thanks

> [17207074.632000] hdd: command error: status=0x51 { DriveReady
> SeekComplete Error }
> [17207074.632000] hdd: command error: error=0x54 { AbortedCommand
> LastFailedSense=0x05 }

CD stuff so not related.

> [17207531.412000] hdb: dma_intr: status=0x30 { DeviceFault SeekComplete }
> [17207531.412000] ide: failed opcode was: unknown
> [17207531.412000] hda: DMA disabled
> [17207531.412000] hdb: DMA disabled
> [17207534.628000] ide0: reset: success
> [17208781.840000] hdb: drive_cmd: status=0x30 { DeviceFault SeekComplete }
> [17208781.840000] ide: failed opcode was: 0xb0

No idea, it appears the drive got cross and went for a sulk but I've no
idea why and the diagnostics aren't sufficient to tell

2006-11-02 17:20:19

by Manu Abraham

[permalink] [raw]
Subject: Re: hdb lost interrupt

Alan Cox wrote:
> Ar Iau, 2006-11-02 am 19:30 +0400, ysgrifennodd Manu Abraham:
>> running smartctl -a gave me this ..
>
> Thanks
>
>> [17207074.632000] hdd: command error: status=0x51 { DriveReady
>> SeekComplete Error }
>> [17207074.632000] hdd: command error: error=0x54 { AbortedCommand
>> LastFailedSense=0x05 }
>
> CD stuff so not related.
>
>> [17207531.412000] hdb: dma_intr: status=0x30 { DeviceFault SeekComplete }
>> [17207531.412000] ide: failed opcode was: unknown
>> [17207531.412000] hda: DMA disabled
>> [17207531.412000] hdb: DMA disabled
>> [17207534.628000] ide0: reset: success
>> [17208781.840000] hdb: drive_cmd: status=0x30 { DeviceFault SeekComplete }
>> [17208781.840000] ide: failed opcode was: 0xb0
>
> No idea, it appears the drive got cross and went for a sulk but I've no
> idea why and the diagnostics aren't sufficient to tell
>

Is there something that i can look out for, such that debugging this
would be easier ?


Thanks,
Manu


2006-11-02 19:50:41

by Alan

[permalink] [raw]
Subject: Re: hdb lost interrupt

Ar Iau, 2006-11-02 am 21:20 +0400, ysgrifennodd Manu Abraham:
> > No idea, it appears the drive got cross and went for a sulk but I've no
> > idea why and the diagnostics aren't sufficient to tell
> >
>
> Is there something that i can look out for, such that debugging this
> would be easier ?

Not unless you can reproduce it reliably

2006-11-05 04:07:11

by Brad Campbell

[permalink] [raw]
Subject: Re: hdb lost interrupt

Manu Abraham wrote:

> 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always
> - 1
> 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always
> - 5

You have grown defects pending reallocation. A read to any one of these will cause the drive to
return an error.

Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams