Hi,
I'm using a Promise controller controlling 4 IDE HD:s, setup as a sw
raid0 array. Lately I'm getting interuppt problems that looks like this:
Oct 18 18:03:06 lion kernel: hdg: dma_timer_expiry: dma status == 0x61
Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: status=0x51 {
DriveReady SeekComplete Error }
Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: error=0x40 {
UncorrectableError }, LBAsect=53500655, sector=53500520
Oct 18 18:03:16 lion kernel: end_request: I/O error, dev 22:01 (hdg),
sector 53500520
Oct 18 18:03:16 lion kernel: blk: queue c030c85c, I/O limit 4095Mb (mask
0xffffffff)
Oct 18 18:03:21 lion kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Oct 18 18:03:21 lion kernel: hdg: read_intr: error=0x40 {
UncorrectableError }, LBAsect=53500655, sector=53500592
Oct 18 18:03:21 lion kernel: end_request: I/O error, dev 22:01 (hdg),
sector 53500592
System info:
lion:~# uname -a
Linux lion 2.4.25 #4 Mon Aug 9 15:30:49 CEST 2004 i586 GNU/Linux
lion:~# gcc --version
gcc (GCC) 3.3.2 (Debian)
Copyright (C) 2003 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
lion:~# lspci -v
00:00.0 Host bridge: VIA Technologies, Inc. VT82C598 [Apollo MVP3] (rev 04)
Flags: bus master, medium devsel, latency 16
Memory at e0000000 (32-bit, prefetchable) [size=64M]
Capabilities: [a0] AGP version 1.0
00:01.0 PCI bridge: VIA Technologies, Inc. VT82C598/694x [Apollo
MVP3/Pro133x AGP] (prog-if 00 [Normal decode])
Flags: bus master, 66Mhz, medium devsel, latency 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 00009000-00009fff
00:07.0 ISA bridge: VIA Technologies, Inc. VT82C596 ISA [Mobile South]
(rev 12)
Subsystem: VIA Technologies, Inc. VT82C596/A/B PCI to ISA Bridge
Flags: bus master, stepping, medium devsel, latency 0
00:07.1 IDE interface: VIA Technologies, Inc.
VT82C586A/B/VT82C686/A/B/VT8233/A/C/VT8235 PIPC Bus Master IDE (rev 06)
(prog-if 8a [Master SecP PriP])
Flags: bus master, medium devsel, latency 64
I/O ports at a000 [size=16]
00:07.2 USB Controller: VIA Technologies, Inc. USB (rev 08) (prog-if 00
[UHCI])
Subsystem: VIA Technologies, Inc. (Wrong ID) USB Controller
Flags: bus master, medium devsel, latency 64, IRQ 11
I/O ports at a400 [size=32]
00:07.3 Host bridge: VIA Technologies, Inc. VT82C596 Power Management
(rev 20)
Flags: medium devsel
00:09.0 Unknown mass storage controller: Promise Technology, Inc. 20268
(rev 02) (prog-if 85)
Subsystem: Promise Technology, Inc. Ultra100TX2
Flags: bus master, 66Mhz, slow devsel, latency 64, IRQ 12
I/O ports at a800 [size=8]
I/O ports at ac00 [size=4]
I/O ports at b000 [size=8]
I/O ports at b400 [size=4]
I/O ports at b800 [size=16]
Memory at eb100000 (32-bit, non-prefetchable) [size=16K]
Expansion ROM at e8000000 [disabled] [size=16K]
Capabilities: [60] Power Management version 1
lion:~# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 5
model : 9
model name : AMD-K6(tm) 3D+ Processor
stepping : 1
cpu MHz : 451.036
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr mce cx8 pge mmx syscall 3dnow
k6_mtrr
bogomips : 897.84
hde: WDC WD800BB-32BSA0, ATA DISK drive
hdf: WDC WD800BB-32BSA0, ATA DISK drive
blk: queue c030c408, I/O limit 4095Mb (mask 0xffffffff)
blk: queue c030c544, I/O limit 4095Mb (mask 0xffffffff)
hdg: WDC WD800BB-32BSA0, ATA DISK drive
hdh: WDC WD800BB-32CCB0, ATA DISK drive
blk: queue c030c85c, I/O limit 4095Mb (mask 0xffffffff)
blk: queue c030c998, I/O limit 4095Mb (mask 0xffffffff)
Well, that is all the info I can think of.
As you can see the system is a:
AMD K6-3 450 MHz with VIA Apollo MVP3 chipset.
Promise Ultra TX02 controller
4 x Western Digital 80 GB ATA100
I thought the controller was dying so I bought a new one but with the
same result. Can it be that hdg is dying?
Please, CC me as I'm not subscribed to this list.
Regards,
Johan Groth
On Mon, 18 Oct 2004 20:43:23 +0100, Johan Groth <[email protected]> wrote:
> Hi,
> I'm using a Promise controller controlling 4 IDE HD:s, setup as a sw
> raid0 array. Lately I'm getting interuppt problems that looks like this:
>
> Oct 18 18:03:06 lion kernel: hdg: dma_timer_expiry: dma status == 0x61
> Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: status=0x51 {
> DriveReady SeekComplete Error }
> Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: error=0x40 {
> UncorrectableError }, LBAsect=53500655, sector=53500520
> Oct 18 18:03:16 lion kernel: end_request: I/O error, dev 22:01 (hdg),
> sector 53500520
> Oct 18 18:03:16 lion kernel: blk: queue c030c85c, I/O limit 4095Mb (mask
> 0xffffffff)
> Oct 18 18:03:21 lion kernel: hdg: read_intr: status=0x59 { DriveReady
> SeekComplete DataRequest Error }
> Oct 18 18:03:21 lion kernel: hdg: read_intr: error=0x40 {
> UncorrectableError }, LBAsect=53500655, sector=53500592
> Oct 18 18:03:21 lion kernel: end_request: I/O error, dev 22:01 (hdg),
> sector 53500592
...
> I thought the controller was dying so I bought a new one but with the
> same result. Can it be that hdg is dying?
Yes, you can use smartmontools (http://smartmontools.sf.net) to check the drive.
On Mon, 18 Oct 2004 22:22:38 +0200, Bartlomiej Zolnierkiewicz
<[email protected]> wrote:
> On Mon, 18 Oct 2004 20:43:23 +0100, Johan Groth <[email protected]> wrote:
> > Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: error=0x40 {
> > UncorrectableError }, LBAsect=53500655, sector=53500520
The Uncorrectable Error is a dead give away. You have a bad sector on
your drive.
Ross Biro wrote:
> On Mon, 18 Oct 2004 22:22:38 +0200, Bartlomiej Zolnierkiewicz
> <[email protected]> wrote:
>
>>On Mon, 18 Oct 2004 20:43:23 +0100, Johan Groth <[email protected]> wrote:
>>
>>>Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: error=0x40 {
>>>UncorrectableError }, LBAsect=53500655, sector=53500520
>
>
> The Uncorrectable Error is a dead give away. You have a bad sector on
> your drive.
>
How am I supposed to fix those blocks? I've tried with e2fsck -c -c -y
/dev/md0 but that yields the following printout in the log.
Oct 19 18:12:13 lion kernel: hdg: dma_timer_expiry: dma status == 0x61
Oct 19 18:12:23 lion kernel: hdg: dma timeout retry: status=0x51 {
DriveReady SeekComplete Error }
Oct 19 18:12:23 lion kernel: hdg: dma timeout retry: error=0x40 {
UncorrectableError }, LBAsect=156145, sector=156064
Oct 19 18:12:23 lion kernel: end_request: I/O error, dev 22:01 (hdg),
sector 156064
Oct 19 18:12:24 lion kernel: blk: queue c8828afc, I/O limit 4095Mb (mask
0xffffffff)
Oct 19 18:12:29 lion kernel: hdg: read_intr: status=0x59 { DriveReady
SeekComplete DataRequest Error }
Oct 19 18:12:29 lion kernel: hdg: read_intr: error=0x40 {
UncorrectableError }, LBAsect=156145, sector=156082
Oct 19 18:12:29 lion kernel: end_request: I/O error, dev 22:01 (hdg),
sector 156082
Oct 19 18:12:49 lion kernel: hdg: dma_timer_expiry: dma status == 0x61
Oct 19 18:12:59 lion kernel: hdg: dma timeout retry: status=0x51 {
DriveReady SeekComplete Error }
Oct 19 18:12:59 lion kernel: hdg: dma timeout retry: error=0x40 {
UncorrectableError }, LBAsect=156145, sector=156072
Oct 19 18:12:59 lion kernel: end_request: I/O error, dev 22:01 (hdg),
sector 156072
This goes on for a while and after that the following appears in the log.
Oct 19 18:14:29 lion kernel:
Oct 19 18:14:29 lion kernel: hdg: status timeout: status=0xd0 { Busy }
Oct 19 18:14:29 lion kernel:
Oct 19 18:14:29 lion kernel: hdh: DMA disabled
Oct 19 18:14:29 lion kernel: PDC202XX: Secondary channel reset.
Oct 19 18:14:29 lion kernel: ide3: reset: success
Oct 19 18:14:34 lion kernel: hdh: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Oct 19 18:14:34 lion kernel:
Oct 19 18:14:34 lion kernel: hdh: status timeout: status=0xd0 { Busy }
Oct 19 18:14:34 lion kernel:
Oct 19 18:14:34 lion kernel: PDC202XX: Secondary channel reset.
Oct 19 18:14:34 lion kernel: ide3: reset: success
Oct 19 18:14:40 lion kernel: hdh: status error: status=0x58 { DriveReady
SeekComplete DataRequest }
Oct 19 18:14:40 lion kernel:
Oct 19 18:14:40 lion kernel: hdh: status timeout: status=0xd0 { Busy }
And that goes on as long as e2fsck runs. So it takes forever just to
check a couple of blocks and as it has to check > 78E6 blocks it will
take weeks.
Is there something wrong with the drivers or the controller or the hd:s?
Please CC me as I'm not on the list.
/Johan
Ross Biro wrote:
[snip]
>
> The drive still has a bad sector. You are having trouble because the
> error recover in the Linux ide code is not the same as Windows and
> most drive vendors care about Windows, not the ATA-Spec. On top of
> that Linux switches out of DMA mode once it hits a bad sector, so the
> drive will be very slow from the on.
>
> The only way you are going to fix the problem is if your drive has
> some spare sectors still available, and you do a write with out a read
> to the bad sector.
Ok, I pretty sure it has spare sectors. How do I write to that sector
without a read and how do I find which sector is bad?
Sorry for all these questions but this is the first time I've had these
kind of problems ever. SCSI disks fix bad blocks by themselves so you
don't have to do anything.
Regards,
Johan
On Tue, 19 Oct 2004 17:17:33 +0100, Johan Groth <[email protected]> wrote:
> Ross Biro wrote:
>
>
> > On Mon, 18 Oct 2004 22:22:38 +0200, Bartlomiej Zolnierkiewicz
> > <[email protected]> wrote:
> >
> >>On Mon, 18 Oct 2004 20:43:23 +0100, Johan Groth <[email protected]> wrote:
> >>
> >>>Oct 18 18:03:16 lion kernel: hdg: dma timeout retry: error=0x40 {
> >>>UncorrectableError }, LBAsect=53500655, sector=53500520
> >
> >
> > The Uncorrectable Error is a dead give away. You have a bad sector on
> > your drive.
> >
> How am I supposed to fix those blocks? I've tried with e2fsck -c -c -y
> /dev/md0 but that yields the following printout in the log.
>
The drive still has a bad sector. You are having trouble because the
error recover in the Linux ide code is not the same as Windows and
most drive vendors care about Windows, not the ATA-Spec. On top of
that Linux switches out of DMA mode once it hits a bad sector, so the
drive will be very slow from the on.
The only way you are going to fix the problem is if your drive has
some spare sectors still available, and you do a write with out a read
to the bad sector.
Ross
On Tue, 19 Oct 2004 18:23:07 +0100, Johan Groth <[email protected]> wrote:
> Ross Biro wrote:
> [snip]
>
> >
> > The drive still has a bad sector. You are having trouble because the
> > error recover in the Linux ide code is not the same as Windows and
> > most drive vendors care about Windows, not the ATA-Spec. On top of
> > that Linux switches out of DMA mode once it hits a bad sector, so the
> > drive will be very slow from the on.
> >
> > The only way you are going to fix the problem is if your drive has
> > some spare sectors still available, and you do a write with out a read
> > to the bad sector.
>
> Ok, I pretty sure it has spare sectors. How do I write to that sector
> without a read and how do I find which sector is bad?
That part is easy. It's in your error message. 156064 is the bad
sector. I would use dd if=/dev/zero of=/dev/hd???? bs=512 seek=?????
count=1 to write the sector, but before I did that, I would be very
sure of my sector number. The best way I can think of to do that is
to turn off read aheda for that device and attempt to read one sector
at a time until you find the bad one. Then reboot, double check,
reboot again, and finally write that sector out. Then you'll need to
do an fsck to fix the file system. You will have lost some data, but
it may not be clare what file(s) have been damaged.
If you are very confident in your backups, you could just dd
if=/dev/zero of=/dev/hd???? bs=something big and wipe the whole drive.
That will remapp all of the bad sectors, then just mke2fs the device
and start over.
Becareful doing any of the above, if you do it wrong, you lose data.
Even if you do it write, you lose some data, just not as much.
Ross
On Tue, 19 Oct 2004, Johan Groth wrote:
> Ross Biro wrote:
> [snip]
>
>>
>> The drive still has a bad sector. You are having trouble because the
>> error recover in the Linux ide code is not the same as Windows and
>> most drive vendors care about Windows, not the ATA-Spec. On top of
>> that Linux switches out of DMA mode once it hits a bad sector, so the
>> drive will be very slow from the on.
>>
>> The only way you are going to fix the problem is if your drive has
>> some spare sectors still available, and you do a write with out a read
>> to the bad sector.
>
> Ok, I pretty sure it has spare sectors. How do I write to that sector without
> a read and how do I find which sector is bad?
>
> Sorry for all these questions but this is the first time I've had these kind
> of problems ever. SCSI disks fix bad blocks by themselves so you don't have
> to do anything.
>
> Regards,
> Johan
man `badblocks`
Also, if you has a BIOS screen when the machine is booting, that
are tools for SCSI (Adaptec has this), then you can use the
SCSI disk utility to replace any bad blocks. Generally, it
reads everything and relocates anything it can't read. You
man end up with corrupt files, but the disk ends up clean.
Cheers,
Dick Johnson
Penguin : Linux version 2.6.9 on an i686 machine (5537.79 GrumpyMips).
98.36% of all statistics are fiction.