2006-12-26 12:57:19

by Erik Ohrnberger

[permalink] [raw]
Subject: System / libata IDE controller woes (long)

First off, Merry Christmas, Seasons Greetings and Happy Holidays!

Hang on, this is a bit of a long story, but I think that you'll need the
information and background.

I want what amounts to a NAS, that I'd like to build on gentoo Linux. I'm
familiar with gentoo and the use of EVMS, so I think I'm pretty well
prepared from this perspective.

Earlier this year, when I started putting it together, I gathered my
hardware. A decent 2 GHz Athlon system with 512 MB RAM, DVD drive, a 40 GB
system drive, and a 500 Watt power supply. Then I started adding hard
disks. To date, I've got 5 80 GB PATA, 2 200 GB PATA, and 1 60 GB PATA.

I mounted the drives on a set of aluminum rails that I had a friend make for
me. They run vertically, and have slots through which screws are tightened
into the normal hard drive's mounting holes. All the communication cables
are 80 pin cables, and run pretty much straight to the controller cards,
while the power pigtails fan out on the side of the 'tower'.

With all these hard drives, I also got 3 Promise 20269 IDE controllers.
After I put it all together, and creating 2 logical volumes, one linked EVMS
LV, and one RAID5 across 5 80 GB drives. To support this configuration, I
connected the drives in the follow manner (using /dev/hdX notation):

ide0: /dev/hdc = System boot disk (Motherboard)
/dev/hdb = DVD ROM
ide1: /dev/hdc = nothing
/dev/hdd = nothing
ide2: /dev/hde = raid disk (First Promise card)
/dev/hdf = lvm disk
ide3: /dev/hdg = raid disk
/dev/hdh = lvm disk
ide4: /dev/hdi = raid disk (Second Promise card)
/dev/hdj = lvm disk
ide5: /dev/hdk = raid disk
/dev/hdl = nothing
ide6: /dev/hdm = raid disk (Thrid Promise card)
/dev/hdn = nothing
ide7: /dev/hdo = nothing
/dev/hdp = nothing

>From what I understood, this is how you want to connect a set of raid drives
so that no one controller is over loaded with IO. But I had to use the
other ports to connect the LVM.

I started to get 'dma_expiry' errors (see message file extract below):

Dec 22 21:29:33 livecd hdg: dma_timer_expiry: dma status == 0x21
Dec 22 21:29:43 livecd hdg: DMA timeout error
Dec 22 21:29:43 livecd hdg: dma timeout error: status=0x50 { DriveReady
SeekComplete }
Dec 22 21:29:43 livecd ide: failed opcode was: unknown
Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
Dec 22 21:29:43 livecd ide: failed opcode was: unknown
Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
Dec 22 21:29:43 livecd ide: failed opcode was: unknown
Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
Dec 22 21:29:43 livecd ide: failed opcode was: unknown
Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
SeekComplete Error }
Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
Dec 22 21:29:43 livecd ide: failed opcode was: unknown
Dec 22 21:29:43 livecd PDC202XX: Secondary channel reset.
Dec 22 21:29:43 livecd ide3: reset: success
Dec 22 21:30:03 livecd hdg: dma_timer_expiry: dma status == 0x21
Dec 22 21:30:15 livecd hdg: DMA timeout error
Dec 22 21:30:15 livecd hdg: dma timeout error: status=0x80 { Busy }
Dec 22 21:30:15 livecd ide: failed opcode was: unknown
Dec 22 21:30:15 livecd hdg: DMA disabled
Dec 22 21:30:15 livecd PDC202XX: Secondary channel reset.
Dec 22 21:30:20 livecd ide3: reset: success
Dec 22 21:36:58 livecd hdg: irq timeout: status=0x80 { Busy }
Dec 22 21:36:58 livecd ide: failed opcode was: unknown
Dec 22 21:36:58 livecd PDC202XX: Secondary channel reset.
Dec 22 21:37:33 livecd ide3: reset timed-out, status=0x80
Dec 22 21:37:33 livecd hdg: status timeout: status=0x80 { Busy }
Dec 22 21:37:33 livecd ide: failed opcode was: unknown
Dec 22 21:37:33 livecd PDC202XX: Secondary channel reset.
Dec 22 21:37:33 livecd hdg: drive not ready for command
Dec 22 21:37:48 livecd ide3: reset: success
Dec 22 21:37:58 livecd hdg: lost interrupt

These errors caused the raid array to crash repeatedly, so I gave up on that
and changed the raid to an EVMS drive linked logical volume, and changed
their connections to as follows:

ide0: /dev/hdc = System boot disk (motherboard)
/dev/hdb = DVD ROM
ide1: /dev/hdc = nothing
/dev/hdd = nothing
ide2: /dev/hde = lvm1 (first promise card)
/dev/hdf = lvm1
ide3: /dev/hdg = lvm1
/dev/hdh = lvm1
ide4: /dev/hdi = lvm1 (second promise card)
/dev/hdj = nothing
ide5: /dev/hdk = lvm2
/dev/hdl = lvm2
ide6: /dev/hdm = lvm2 (third promise card)
/dev/hdn = nothing
ide7: /dev/hdo = nothing
/dev/hdp = nothing

Still got the same dma_timer_expiry errors. I consulted this list as to how
to resolve them. The wisdom of the list recommended that I try libata
rather than the old ide controller code. So I patched the kernel, and all
was well, for quite some time.

But then I started to get random lockups. I upgraded the kernel to 2.6.19,
which has all the libata code in it, and ran it. This didn't help. I
enabled nmi_watchdog in order to track down which drive was causing the
problems. It helped, and pointed to a drive (see message log file extract
below):

Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
Dec 25 03:13:53 storage ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0
action 0x2 frozen
Dec 25 03:13:53 storage ata5.01: (BMDMA stat 0x1)
Dec 25 03:13:53 storage ata5.01: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0
(timeout)
Dec 25 03:14:00 storage ata5: port is slow to respond, please be patient
(Status 0x80)
Dec 25 03:14:23 storage ata5: port failed to respond (30 secs, Status 0x80)
Dec 25 03:14:23 storage ata5: soft resetting port
Dec 25 03:14:30 storage ata5: port is slow to respond, please be patient
(Status 0xd0)
Dec 25 03:14:53 storage ata5: port failed to respond (30 secs, Status 0xd0)
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:23 storage ata5.00: qc timeout (cmd 0xec)
Dec 25 03:15:23 storage ata5.00: failed to IDENTIFY (I/O error,
err_mask=0x4)
Dec 25 03:15:23 storage ata5.00: revalidation failed (errno=-5)
Dec 25 03:15:23 storage ata5: failed to recover some devices, retrying in 5
secs
Dec 25 03:15:28 storage ata5: soft resetting port
Dec 25 03:15:35 storage ata5: port is slow to respond, please be patient
(Status 0xd0)
Dec 25 03:15:58 storage ata5: port failed to respond (30 secs, Status 0xd0)
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:16:28 storage ata5.00: qc timeout (cmd 0xec)
Dec 25 03:16:28 storage ata5.00: failed to IDENTIFY (I/O error,
err_mask=0x4)
Dec 25 03:16:28 storage ata5.00: revalidation failed (errno=-5)
Dec 25 03:16:28 storage ata5: failed to recover some devices, retrying in 5
secs
Dec 25 03:16:33 storage ata5: soft resetting port
Dec 25 03:16:41 storage ata5: port is slow to respond, please be patient
(Status 0xd0)
Dec 25 03:17:04 storage ata5: port failed to respond (30 secs, Status 0xd0)
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:17:34 storage ata5.00: qc timeout (cmd 0xec)
Dec 25 03:17:34 storage ata5.00: failed to IDENTIFY (I/O error,
err_mask=0x4)
Dec 25 03:17:34 storage ata5.00: revalidation failed (errno=-5)
Dec 25 03:17:34 storage ata5.00: disabled
Dec 25 03:17:34 storage ata5: failed to recover some devices, retrying in 5
secs
Dec 25 03:17:39 storage ATA: abnormal status 0x80 on port 0xE0A817DF
Dec 25 03:17:39 storage ata5.01: failed to IDENTIFY (I/O error,
err_mask=0x40)
Dec 25 03:17:39 storage ata5.01: revalidation failed (errno=-5)
Dec 25 03:17:39 storage ata5: failed to recover some devices, retrying in 5
secs
Dec 25 03:17:51 storage ata5: port is slow to respond, please be patient
(Status 0x80)
Dec 25 03:18:14 storage ata5: port failed to respond (30 secs, Status 0x80)
Dec 25 03:18:14 storage ata5: soft resetting port
Dec 25 03:18:21 storage ata5: port is slow to respond, please be patient
(Status 0xd0)
Dec 25 03:18:44 storage ata5: port failed to respond (30 secs, Status 0xd0)
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:19:14 storage ata5.01: qc timeout (cmd 0xec)
Dec 25 03:19:14 storage ata5.01: failed to IDENTIFY (I/O error,
err_mask=0x4)
Dec 25 03:19:14 storage ata5.01: revalidation failed (errno=-5)
Dec 25 03:19:14 storage ata5: failed to recover some devices, retrying in 5
secs
Dec 25 03:19:19 storage ata5: soft resetting port
Dec 25 03:19:26 storage ata5: port is slow to respond, please be patient
(Status 0xd0)
Dec 25 03:19:49 storage ata5: port failed to respond (30 secs, Status 0xd0)
Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:19:49 storage ATA: abnormal status 0xD2 on port 0xE0A817DF
Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
Dec 25 03:20:19 storage ata5.01: qc timeout (cmd 0xec)
Dec 25 03:20:19 storage ata5.01: failed to IDENTIFY (I/O error,
err_mask=0x4)
Dec 25 03:20:19 storage ata5.01: revalidation failed (errno=-5)
Dec 25 03:20:19 storage ata5.01: disabled
Dec 25 03:20:20 storage ata5: EH complete
Dec 25 03:20:20 storage sd 4:0:1:0: SCSI error: return code = 0x00040000
Dec 25 03:20:20 storage end_request: I/O error, dev sdg, sector 271144

However, when I take the drives off that system put them another on
another's motherboard IDE connection, and run badblocks in read only mode on
all the drives, I get no errors, not a single one on any of the drives. So
if it's not the physical IO that causing problems, it must be the interface?

Something is clearly wrong here, and I'm at a loss as to what it may be or
how to resolve it.

1). Could it be that this is just too many PCI IDE drive controllers in one
system? That they are fighting each other for resources when they are all
being read from and written to?

2). If it is in fact that there are too many IDE controllers, is there any
advantage in a single board with many drive connections over many boards?

3). Are there any BIOS or kernel settings that I could make that would
resolve what may be this resource contention? A slower CPU? (I have a
similar 900 MHz Athlon system and file serving is hardly CPU intensive).

4). Is it that the Promise 20269 are bad choice of controllers and I should
change to a different ones? If so, what is a good choice?

5). Should I consider migrating everything to SATA with SIL 3114
controllers?

6). Should I consider reducing the number of smaller disks in favor or fewer
larger ones?

7). What if I add a SIL 3114 SATA controller and SATA disks to migrate off,
will I cause the same issue by adding yet another PCI hard disk controller?

Lspci output for further reference:
00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
(rev c1)
00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 0 (rev c1)
00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
01:06.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
01:07.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
01:08.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
01:0a.0 Ethernet controller: Accton Technology Corporation SMC2-1211TX (rev
10)
02:00.0 VGA compatible controller: ATI Technologies Inc Radeon R200 QM
[Radeon 9100]


2006-12-26 19:43:12

by Justin Piszcz

[permalink] [raw]
Subject: Re: System / libata IDE controller woes (long)

I had the same problem you did when I put 3 identical controllers
together. To get around that problem I used 2 TX133s and 1 TX100x2. I
believe this is the root cause of your problems.

Justin.

On Tue, 26 Dec 2006, Erik Ohrnberger wrote:

> First off, Merry Christmas, Seasons Greetings and Happy Holidays!
>
> Hang on, this is a bit of a long story, but I think that you'll need the
> information and background.
>
> I want what amounts to a NAS, that I'd like to build on gentoo Linux. I'm
> familiar with gentoo and the use of EVMS, so I think I'm pretty well
> prepared from this perspective.
>
> Earlier this year, when I started putting it together, I gathered my
> hardware. A decent 2 GHz Athlon system with 512 MB RAM, DVD drive, a 40 GB
> system drive, and a 500 Watt power supply. Then I started adding hard
> disks. To date, I've got 5 80 GB PATA, 2 200 GB PATA, and 1 60 GB PATA.
>
> I mounted the drives on a set of aluminum rails that I had a friend make for
> me. They run vertically, and have slots through which screws are tightened
> into the normal hard drive's mounting holes. All the communication cables
> are 80 pin cables, and run pretty much straight to the controller cards,
> while the power pigtails fan out on the side of the 'tower'.
>
> With all these hard drives, I also got 3 Promise 20269 IDE controllers.
> After I put it all together, and creating 2 logical volumes, one linked EVMS
> LV, and one RAID5 across 5 80 GB drives. To support this configuration, I
> connected the drives in the follow manner (using /dev/hdX notation):
>
> ide0: /dev/hdc = System boot disk (Motherboard)
> /dev/hdb = DVD ROM
> ide1: /dev/hdc = nothing
> /dev/hdd = nothing
> ide2: /dev/hde = raid disk (First Promise card)
> /dev/hdf = lvm disk
> ide3: /dev/hdg = raid disk
> /dev/hdh = lvm disk
> ide4: /dev/hdi = raid disk (Second Promise card)
> /dev/hdj = lvm disk
> ide5: /dev/hdk = raid disk
> /dev/hdl = nothing
> ide6: /dev/hdm = raid disk (Thrid Promise card)
> /dev/hdn = nothing
> ide7: /dev/hdo = nothing
> /dev/hdp = nothing
>
> >From what I understood, this is how you want to connect a set of raid drives
> so that no one controller is over loaded with IO. But I had to use the
> other ports to connect the LVM.
>
> I started to get 'dma_expiry' errors (see message file extract below):
>
> Dec 22 21:29:33 livecd hdg: dma_timer_expiry: dma status == 0x21
> Dec 22 21:29:43 livecd hdg: DMA timeout error
> Dec 22 21:29:43 livecd hdg: dma timeout error: status=0x50 { DriveReady
> SeekComplete }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd hdg: task_in_intr: status=0x51 { DriveReady
> SeekComplete Error }
> Dec 22 21:29:43 livecd hdg: task_in_intr: error=0x04 { DriveStatusError }
> Dec 22 21:29:43 livecd ide: failed opcode was: unknown
> Dec 22 21:29:43 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:29:43 livecd ide3: reset: success
> Dec 22 21:30:03 livecd hdg: dma_timer_expiry: dma status == 0x21
> Dec 22 21:30:15 livecd hdg: DMA timeout error
> Dec 22 21:30:15 livecd hdg: dma timeout error: status=0x80 { Busy }
> Dec 22 21:30:15 livecd ide: failed opcode was: unknown
> Dec 22 21:30:15 livecd hdg: DMA disabled
> Dec 22 21:30:15 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:30:20 livecd ide3: reset: success
> Dec 22 21:36:58 livecd hdg: irq timeout: status=0x80 { Busy }
> Dec 22 21:36:58 livecd ide: failed opcode was: unknown
> Dec 22 21:36:58 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:37:33 livecd ide3: reset timed-out, status=0x80
> Dec 22 21:37:33 livecd hdg: status timeout: status=0x80 { Busy }
> Dec 22 21:37:33 livecd ide: failed opcode was: unknown
> Dec 22 21:37:33 livecd PDC202XX: Secondary channel reset.
> Dec 22 21:37:33 livecd hdg: drive not ready for command
> Dec 22 21:37:48 livecd ide3: reset: success
> Dec 22 21:37:58 livecd hdg: lost interrupt
>
> These errors caused the raid array to crash repeatedly, so I gave up on that
> and changed the raid to an EVMS drive linked logical volume, and changed
> their connections to as follows:
>
> ide0: /dev/hdc = System boot disk (motherboard)
> /dev/hdb = DVD ROM
> ide1: /dev/hdc = nothing
> /dev/hdd = nothing
> ide2: /dev/hde = lvm1 (first promise card)
> /dev/hdf = lvm1
> ide3: /dev/hdg = lvm1
> /dev/hdh = lvm1
> ide4: /dev/hdi = lvm1 (second promise card)
> /dev/hdj = nothing
> ide5: /dev/hdk = lvm2
> /dev/hdl = lvm2
> ide6: /dev/hdm = lvm2 (third promise card)
> /dev/hdn = nothing
> ide7: /dev/hdo = nothing
> /dev/hdp = nothing
>
> Still got the same dma_timer_expiry errors. I consulted this list as to how
> to resolve them. The wisdom of the list recommended that I try libata
> rather than the old ide controller code. So I patched the kernel, and all
> was well, for quite some time.
>
> But then I started to get random lockups. I upgraded the kernel to 2.6.19,
> which has all the libata code in it, and ran it. This didn't help. I
> enabled nmi_watchdog in order to track down which drive was causing the
> problems. It helped, and pointed to a drive (see message log file extract
> below):
>
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:23 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:13:53 storage ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0
> action 0x2 frozen
> Dec 25 03:13:53 storage ata5.01: (BMDMA stat 0x1)
> Dec 25 03:13:53 storage ata5.01: tag 0 cmd 0xc8 Emask 0x4 stat 0x40 err 0x0
> (timeout)
> Dec 25 03:14:00 storage ata5: port is slow to respond, please be patient
> (Status 0x80)
> Dec 25 03:14:23 storage ata5: port failed to respond (30 secs, Status 0x80)
> Dec 25 03:14:23 storage ata5: soft resetting port
> Dec 25 03:14:30 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:14:53 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:14:53 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:23 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:15:23 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:15:23 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:15:23 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:15:28 storage ata5: soft resetting port
> Dec 25 03:15:35 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:15:58 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:15:58 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:16:28 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:16:28 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:16:28 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:16:28 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:16:33 storage ata5: soft resetting port
> Dec 25 03:16:41 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:17:04 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:04 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:17:34 storage ata5.00: qc timeout (cmd 0xec)
> Dec 25 03:17:34 storage ata5.00: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:17:34 storage ata5.00: revalidation failed (errno=-5)
> Dec 25 03:17:34 storage ata5.00: disabled
> Dec 25 03:17:34 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:17:39 storage ATA: abnormal status 0x80 on port 0xE0A817DF
> Dec 25 03:17:39 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x40)
> Dec 25 03:17:39 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:17:39 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:17:51 storage ata5: port is slow to respond, please be patient
> (Status 0x80)
> Dec 25 03:18:14 storage ata5: port failed to respond (30 secs, Status 0x80)
> Dec 25 03:18:14 storage ata5: soft resetting port
> Dec 25 03:18:21 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:18:44 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:18:44 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:14 storage ata5.01: qc timeout (cmd 0xec)
> Dec 25 03:19:14 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:19:14 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:19:14 storage ata5: failed to recover some devices, retrying in 5
> secs
> Dec 25 03:19:19 storage ata5: soft resetting port
> Dec 25 03:19:26 storage ata5: port is slow to respond, please be patient
> (Status 0xd0)
> Dec 25 03:19:49 storage ata5: port failed to respond (30 secs, Status 0xd0)
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD2 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:19:49 storage ATA: abnormal status 0xD0 on port 0xE0A817DF
> Dec 25 03:20:19 storage ata5.01: qc timeout (cmd 0xec)
> Dec 25 03:20:19 storage ata5.01: failed to IDENTIFY (I/O error,
> err_mask=0x4)
> Dec 25 03:20:19 storage ata5.01: revalidation failed (errno=-5)
> Dec 25 03:20:19 storage ata5.01: disabled
> Dec 25 03:20:20 storage ata5: EH complete
> Dec 25 03:20:20 storage sd 4:0:1:0: SCSI error: return code = 0x00040000
> Dec 25 03:20:20 storage end_request: I/O error, dev sdg, sector 271144
>
> However, when I take the drives off that system put them another on
> another's motherboard IDE connection, and run badblocks in read only mode on
> all the drives, I get no errors, not a single one on any of the drives. So
> if it's not the physical IO that causing problems, it must be the interface?
>
> Something is clearly wrong here, and I'm at a loss as to what it may be or
> how to resolve it.
>
> 1). Could it be that this is just too many PCI IDE drive controllers in one
> system? That they are fighting each other for resources when they are all
> being read from and written to?
>
> 2). If it is in fact that there are too many IDE controllers, is there any
> advantage in a single board with many drive connections over many boards?
>
> 3). Are there any BIOS or kernel settings that I could make that would
> resolve what may be this resource contention? A slower CPU? (I have a
> similar 900 MHz Athlon system and file serving is hardly CPU intensive).
>
> 4). Is it that the Promise 20269 are bad choice of controllers and I should
> change to a different ones? If so, what is a good choice?
>
> 5). Should I consider migrating everything to SATA with SIL 3114
> controllers?
>
> 6). Should I consider reducing the number of smaller disks in favor or fewer
> larger ones?
>
> 7). What if I add a SIL 3114 SATA controller and SATA disks to migrate off,
> will I cause the same issue by adding yet another PCI hard disk controller?
>
> Lspci output for further reference:
> 00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?)
> (rev c1)
> 00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 0 (rev c1)
> 00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
> 00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
> 00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
> 00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
> 00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a4)
> 00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
> 00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
> 00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
> 00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
> 01:06.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:07.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:08.0 Mass storage controller: Promise Technology, Inc. 20269 (rev 02)
> 01:0a.0 Ethernet controller: Accton Technology Corporation SMC2-1211TX (rev
> 10)
> 02:00.0 VGA compatible controller: ATI Technologies Inc Radeon R200 QM
> [Radeon 9100]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-12-27 03:42:13

by Tejun Heo

[permalink] [raw]
Subject: Re: System / libata IDE controller woes (long)

Hello,

Erik Ohrnberger wrote:
> Earlier this year, when I started putting it together, I gathered my
> hardware. A decent 2 GHz Athlon system with 512 MB RAM, DVD drive, a 40 GB
> system drive, and a 500 Watt power supply. Then I started adding hard
> disks. To date, I've got 5 80 GB PATA, 2 200 GB PATA, and 1 60 GB PATA.

That's 9 hard drives. How did you hook up your power supply? My
dual-rail 450w PS has a lot of problem driving 9 drives no matter how I
hook it up while my 350w power supply can happily handle the load. I
suspect it's because how the separate 12v rails are configured in the PS.

It's nothing concrete but I wanna rule PS issue first. If you've got an
extra power supply (buy cheap 350w one if you don't have one), hook half
of the drives to it and see what happens. Using PS without motherboard
is easy. Just ask google.

Happy holidays.

--
tejun

2006-12-27 05:51:05

by Erik Ohrnberger

[permalink] [raw]
Subject: RE: System / libata IDE controller woes (long)

Hi There!
Yea, I thought that it might be power related as well, so I moved
1/2 of the drives from the 500 Watt power supply onto a separate one, and it
did not change any of the symptoms. So I think that it's been ruled out.

Thanks,
Erik.
>
> Hello,
>
> Erik Ohrnberger wrote:
> > Earlier this year, when I started putting it together, I
> gathered my
> > hardware. A decent 2 GHz Athlon system with 512 MB RAM,
> DVD drive, a
> > 40 GB system drive, and a 500 Watt power supply. Then I started
> > adding hard disks. To date, I've got 5 80 GB PATA, 2 200
> GB PATA, and 1 60 GB PATA.
>
> That's 9 hard drives. How did you hook up your power supply?
> My dual-rail 450w PS has a lot of problem driving 9 drives
> no matter how I hook it up while my 350w power supply can
> happily handle the load. I suspect it's because how the
> separate 12v rails are configured in the PS.
>
> It's nothing concrete but I wanna rule PS issue first. If
> you've got an extra power supply (buy cheap 350w one if you
> don't have one), hook half of the drives to it and see what
> happens. Using PS without motherboard is easy. Just ask google.
>
> Happy holidays.
>
> --
> tejun
>

2006-12-27 06:02:42

by Gene Heskett

[permalink] [raw]
Subject: Re: System / libata IDE controller woes (long)

On Wednesday 27 December 2006 00:50, Erik Ohrnberger wrote:
>Hi There!
> Yea, I thought that it might be power related as well, so I moved
>1/2 of the drives from the 500 Watt power supply onto a separate one,
> and it did not change any of the symptoms. So I think that it's been
> ruled out.
>
> Thanks,
> Erik.

Cable lengths can be a bear too, particularly if the drive set as master
is NOT on the end connector of the cable.

>> Hello,
>>
>> Erik Ohrnberger wrote:
>> > Earlier this year, when I started putting it together, I
>>
>> gathered my
>>
>> > hardware. A decent 2 GHz Athlon system with 512 MB RAM,
>>
>> DVD drive, a
>>
>> > 40 GB system drive, and a 500 Watt power supply. Then I started
>> > adding hard disks. To date, I've got 5 80 GB PATA, 2 200
>>
>> GB PATA, and 1 60 GB PATA.
>>
>> That's 9 hard drives. How did you hook up your power supply?
>> My dual-rail 450w PS has a lot of problem driving 9 drives
>> no matter how I hook it up while my 350w power supply can
>> happily handle the load. I suspect it's because how the
>> separate 12v rails are configured in the PS.
>>
>> It's nothing concrete but I wanna rule PS issue first. If
>> you've got an extra power supply (buy cheap 350w one if you
>> don't have one), hook half of the drives to it and see what
>> happens. Using PS without motherboard is easy. Just ask google.
>>
>> Happy holidays.
>>
>> --
>> tejun
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2006 by Maurice Eugene Heskett, all rights reserved.

2007-01-08 01:41:01

by Erik Ohrnberger

[permalink] [raw]
Subject: RE: System / libata IDE controller woes (long)

Well, after more mucking about, and copying data off of the production LVM
on to the backup LVM, I noticed that no matter where I put this one Seagate
drive, it caused dma_timer_expiry errors. Once I replaced this drive,
everything settled down again, and has been running normally.

So it's not the old IDE driver code can't handle that many controllers, it
can. It's also no problem for libata in a similar configuration. Both work
and work well.

I have to admit that it sure took me a long time to figure out that the
drive was the problem. I guess that sort of thing should move higher in the
diagnosis decision tree. You live, you learn.

Thanks for everyone's patience and help in this. It helped me keep my
sanity through all this.

Cheers,
Erik.