[please cc: me as I am not subscribed to lkml]
I've recently started getting errors like this (this example is from
2.4.20-pre3-ac2):
Jan 9 14:20:48 carthage kernel: hda: dma_timer_expiry: dma status ==
0x61
Jan 9 14:20:48 carthage kernel: hdc: dma_timer_expiry: dma status ==
0x21
Jan 9 14:20:58 carthage kernel: hda: timeout waiting for DMA
Jan 9 14:20:58 carthage kernel: PDC202XX: Primary channel reset.
Jan 9 14:20:58 carthage kernel: PDC202XX: Secondary channel reset.
Jan 9 14:20:58 carthage kernel: hda: DMA disabled
Jan 9 14:20:58 carthage kernel: hda: timeout waiting for DMA
Jan 9 14:20:58 carthage kernel: blk: queue c03c2860, I/O limit 4095Mb
(mask 0xffffffff)
Jan 9 14:20:58 carthage kernel: hdc: timeout waiting for DMA
Jan 9 14:20:58 carthage kernel: PDC202XX: Secondary channel reset.
Jan 9 14:20:58 carthage kernel: PDC202XX: Primary channel reset.
Jan 9 14:20:58 carthage kernel: hdc: DMA disabled
Jan 9 14:20:58 carthage kernel: hdc: timeout waiting for DMA
Jan 9 14:20:58 carthage kernel: blk: queue c03c2cac, I/O limit 4095Mb
(mask 0xffffffff)
I have a Promise 20267 PCI IDE controller card on an Epox 8RDA motherboard.
The motherboard is brand new and I never got these kinds of errors with
my previous MSI K7T Turbo board. There are two drives on the card:
hda: WDC WD400BB-00AUA1, ATA DISK drive
hdc: WDC WD400BB-00DEA0, ATA DISK drive
which are both alone on the seperate controllers. I've tried both 2.4
and 2.5 kernels (2.4.20, 2.4.20-ac2, 2.4.20-pre3-ac2, 2.5.[53-55] and
get the same errors.
Does anyone have idea what is causing this? I can offer more information
(.config etc) if necessary.
--
James Curbo <[email protected]> <[email protected]>
http://www.adtrw.org/blogs/hannibal/
James Curbo wrote:
>[please cc: me as I am not subscribed to lkml]
>
>I've recently started getting errors like this (this example is from
>2.4.20-pre3-ac2):
>
>Jan 9 14:20:48 carthage kernel: hda: dma_timer_expiry: dma status ==
>0x61
>Jan 9 14:20:48 carthage kernel: hdc: dma_timer_expiry: dma status ==
>0x21
>
>
I believe the low bit set in the dma_status means that the DMA transfer
is still in progress. Since the timer has expired, that means it's been
in progress for 10 seconds. Odds are the drive has stopped responding.
Since it's a Western Digital drive, it probably needs to be powercycled
to come back.
I don't think this is a problem with the controller card, but I could be
wrong.
Ross
On Jan 09, Manish Lachwani wrote:
> Can you also get the SMART data from the drives using smartctl? Also, it
> looks like the errors are happening on both the drives. Which UDMA mode are
> you operating in?
>
> Thanks
> Manish
UDMA 5 for both of them. Here is the smartctl data:
carthage:/home/james# smartctl -v /dev/hda
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x000b 200 199 051 0
( 3)Spin Up Time 0x0007 102 095 021 3733
( 4)Start Stop Count 0x0032 100 100 040 419
( 5)Reallocated Sector Ct 0x0032 160 160 112 160
( 7)Seek Error Rate 0x000b 100 253 051 0
( 9)Power On Hours 0x0032 079 079 000 15858
( 10)Spin Retry Count 0x0013 100 099 051 2
( 11)Calibration Retry Count 0x0013 100 100 051 0
( 12)Power Cycle Count 0x0032 100 100 000 358
(196)Reallocated Event Count 0x0032 126 126 000 74
(197)Current Pending Sector 0x0012 200 200 000 1
(198)Offline Uncorrectable 0x0012 200 200 000 0
(199)UDMA CRC Error Count 0x000a 200 253 000 65884
(200)Unknown Attribute 0x0009 200 199 051 1
carthage:/home/james# smartctl -v /dev/hdc
Vendor Specific SMART Attributes with Thresholds:
Revision Number: 16
Attribute Flag Value Worst Threshold Raw Value
( 1)Raw Read Error Rate 0x000b 200 200 051 0
( 3)Spin Up Time 0x0007 101 093 021 2300
( 4)Start Stop Count 0x0032 100 100 040 96
( 5)Reallocated Sector Ct 0x0033 200 200 140 0
( 7)Seek Error Rate 0x000b 200 200 051 0
( 9)Power On Hours 0x0032 096 096 000 3325
( 10)Spin Retry Count 0x0013 100 253 051 0
( 11)Calibration Retry Count 0x0013 100 253 051 0
( 12)Power Cycle Count 0x0032 100 100 000 93
(196)Reallocated Event Count 0x0032 200 200 000 0
(197)Current Pending Sector 0x0012 200 200 000 0
(198)Offline Uncorrectable 0x0012 200 200 000 0
(199)UDMA CRC Error Count 0x000a 200 253 000 0
(200)Unknown Attribute 0x0009 200 200 051 0
--
James Curbo <[email protected]> <[email protected]>
http://www.adtrw.org/blogs/hannibal/
On Jan 09, Ross Biro wrote:
> I believe the low bit set in the dma_status means that the DMA transfer
> is still in progress. Since the timer has expired, that means it's been
> in progress for 10 seconds. Odds are the drive has stopped responding.
> Since it's a Western Digital drive, it probably needs to be powercycled
> to come back.
>
> I don't think this is a problem with the controller card, but I could be
> wrong.
>
> Ross
>
Well, I have had the first drive for about a year and a half (I think)
and the second drive since August. I never had any problems out of them
with the same controller card on my previous motherboard (MSI K7T
Turbo). The problems didn't arise until the other day when I got my new
board.
The errors occur over and over; the drive will come back for a few
seconds and then the error will occur again. I usually reboot at this
point.
--
James Curbo <[email protected]> <[email protected]>
http://www.adtrw.org/blogs/hannibal/
check the CRC count on the first drive, hda. Its
65584 !!! Thats huge. This
CRC values result in UDMA downgrades. Also, check
the reallocation sector
count. A high value here means possible timeouts.
With high reallocation
sector count, there could be multiple mappings a
drive would have to look
into to get to the proper sector. You should change
the drive hda and also
the cable. Then try again.
Thanks
Manish
__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
On Jan 09, Manish Lachwani wrote:
> You should change the drive hda and also the cable. Then try again.
>
Oops! One of my IDE cables wasn't seated properly. Thanks for the help!
At least it wasn't a kernel bug :)
--
James Curbo <[email protected]> <[email protected]>
http://www.adtrw.org/blogs/hannibal/