2002-03-24 08:03:40

by Eyal Lebedinsky

[permalink] [raw]
Subject: 2.4.18: many IDE errors

I have six disks in the machine, each a master on the cable. Two are
on the on-board controller and Four are on two PCI ATA-100 cards.

I am getting disk error (BadCRC) on all disks, intermittently.

I upgraded from RH 2.4.9 (with SGI xfs) to 2.4.18+xfs. Same problem.
I then applied the 2.4.18-rc1 IDE patches with no improvement (well,
the four 160GB disks are now fully visible and not clipped to 28bits
which is a nice surprise).

I checked the memory, replaced the power supply with a hefty 400W,
I even used a recent mobo (Gigabyte GA-6VTXE). No beef, practically
the same rate of errors.

I do not believe all six cables are bad (80w all). This is a UP, and
I did not enalbe local APIC. Should I?

Any ideas where to go from here? On request I have boot messages
(with errors) and lspci output - I prefer not to overload the list.

The setup is two WD 60GB disks (hda/hdc) which host the root fs (ext2)
a working area (md2=RAID0, most of the space) and an xfs log
(md1=RAID1).

hde/g/i/k are Maxtor 160GB, RAID5, xfs (with external log on md0).

I get errors intermittently on all, here is an example after a boot
following the creation of the raid5 (so a sync was running for about
two hours).

hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide0: reset: success
hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide1: reset: success
hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
md: md1: sync done.
raid5: resync finished.
hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
invalidate: busy buffer


--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>


2002-03-24 08:22:35

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors


It is not a case of bad cables but maybe cable routing.
Also, four 160GB disks eat power!

I have a box dual athlon similar setup w/ 460W ps
I have to wait for the PS to warm up or there is not enough juice to
properly spin up the last drive. However if I replace the four 160GB's
with four 20GB Seagate's no problem.

You are going to need at least a 400W PS w/ almost no ripple to make it
work. If you have this then check the cable routing.

Also hdparm -i /dev/hdX to see if their transfer rates are reduced.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:

> I have six disks in the machine, each a master on the cable. Two are
> on the on-board controller and Four are on two PCI ATA-100 cards.
>
> I am getting disk error (BadCRC) on all disks, intermittently.
>
> I upgraded from RH 2.4.9 (with SGI xfs) to 2.4.18+xfs. Same problem.
> I then applied the 2.4.18-rc1 IDE patches with no improvement (well,
> the four 160GB disks are now fully visible and not clipped to 28bits
> which is a nice surprise).
>
> I checked the memory, replaced the power supply with a hefty 400W,
> I even used a recent mobo (Gigabyte GA-6VTXE). No beef, practically
> the same rate of errors.
>
> I do not believe all six cables are bad (80w all). This is a UP, and
> I did not enalbe local APIC. Should I?
>
> Any ideas where to go from here? On request I have boot messages
> (with errors) and lspci output - I prefer not to overload the list.
>
> The setup is two WD 60GB disks (hda/hdc) which host the root fs (ext2)
> a working area (md2=RAID0, most of the space) and an xfs log
> (md1=RAID1).
>
> hde/g/i/k are Maxtor 160GB, RAID5, xfs (with external log on md0).
>
> I get errors intermittently on all, here is an example after a boot
> following the creation of the raid5 (so a sync was running for about
> two hours).
>
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> ide0: reset: success

Safe transfer rate down grade!

> hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdc: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdc: dma_intr: error=0x84 { DriveStatusError BadCRC }
> ide1: reset: success

Safe transfer rate down grade!

> hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
> hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
> md: md1: sync done.
> raid5: resync finished.
> hdi: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> hdi: dma_intr: error=0x84 { DriveStatusError BadCRC }
> invalidate: busy buffer

GAK!


2002-03-24 12:28:29

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

If I understand correctly, the basic answer is that this is not
a driver issue, and not a general kernel (irq's etc.) either, but
a true hardware problem.


Andre Hedrick wrote:
>
> It is not a case of bad cables but maybe cable routing.

I am not clear on what cable routing means. Can you elaborate?

> Also, four 160GB disks eat power!

Well, these disks (4G160J8) are 5400rpm, They spin up just fine
and according to spec they each use under 300ma@12v and 400ma@5v
(active, 5.2W actually) which is not that much.

This is not to say that just because the power supply can deliver
the power, the power is clean enough.

> I have a box dual athlon similar setup w/ 460W ps
> I have to wait for the PS to warm up or there is not enough juice to
> properly spin up the last drive. However if I replace the four 160GB's
> with four 20GB Seagate's no problem.

A bootup is always clean, and when not stressed it can go for a long
while
without any errors. However once pushed I start seeing these failures.

>
> You are going to need at least a 400W PS w/ almost no ripple to make it
> work. If you have this then check the cable routing.
>
> Also hdparm -i /dev/hdX to see if their transfer rates are reduced.

I will check the situation. I know they come up ATA-5. Here are the
relevant bootup messages:

Uniform Multi-Platform E-IDE driver Revision: 6.31
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
VP_IDE: IDE controller on PCI bus 00 dev 39
VP_IDE: chipset revision 6
VP_IDE: not 100% native mode: will probe irqs later
ide: Assuming 33MHz system bus speed for PIO modes; override with
idebus=xx
VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA
CMD649: IDE controller on PCI bus 00 dev 50
PCI: Found IRQ 5 for device 00:0a.0
CMD649: chipset revision 1
CMD649: not 100% native mode: will probe irqs later
CMD649: ROM enabled at 0xdff00000
ide2: BM-DMA at 0xc000-0xc007, BIOS settings: hde:pio, hdf:pio
ide3: BM-DMA at 0xc008-0xc00f, BIOS settings: hdg:pio, hdh:pio
CMD649: IDE controller on PCI bus 00 dev 60
PCI: Found IRQ 12 for device 00:0c.0
CMD649: chipset revision 2
CMD649: not 100% native mode: will probe irqs later
CMD649: ROM enabled at 0xdfe80000
ide4: BM-DMA at 0xac00-0xac07, BIOS settings: hdi:pio, hdj:pio
ide5: BM-DMA at 0xac08-0xac0f, BIOS settings: hdk:pio, hdl:pio
hda: WDC WD600BB-00BSA0, ATA DISK drive
hdc: WDC WD600BB-00BSA0, ATA DISK drive
hdd: ATAPI CD ROM DRIVE 50X MAX, ATAPI CD/DVD-ROM drive
hde: Maxtor 4G160J8, ATA DISK drive
hdg: Maxtor 4G160J8, ATA DISK drive
hdi: Maxtor 4G160J8, ATA DISK drive
hdk: Maxtor 4G160J8, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide1 at 0x170-0x177,0x376 on irq 15
ide2 at 0xd000-0xd007,0xcc02 on irq 5
ide3 at 0xc800-0xc807,0xc402 on irq 5
ide4 at 0xbc00-0xbc07,0xb802 on irq 12
ide5 at 0xb400-0xb407,0xb002 on irq 12
hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=7297/255/63,
UDMA(100)
hdc: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=116301/16/63,
UDMA(100)
hde: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
UDMA(100)
hdg: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
UDMA(100)
hdi: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
UDMA(100)
hdk: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
UDMA(100)

> Cheers,
>
> Andre Hedrick
> LAD Storage Consulting Group
>
> On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:
>
> > I have six disks in the machine, each a master on the cable. Two are
> > on the on-board controller and Four are on two PCI ATA-100 cards.
> >
> > I am getting disk error (BadCRC) on all disks, intermittently.
> >
> > I upgraded from RH 2.4.9 (with SGI xfs) to 2.4.18+xfs. Same problem.
> > I then applied the 2.4.18-rc1 IDE patches with no improvement (well,
> > the four 160GB disks are now fully visible and not clipped to 28bits
> > which is a nice surprise).
> >
> > I checked the memory, replaced the power supply with a hefty 400W,
> > I even used a recent mobo (Gigabyte GA-6VTXE). No beef, practically
> > the same rate of errors.
> >
> > I do not believe all six cables are bad (80w all). This is a UP, and
> > I did not enalbe local APIC. Should I?
> >
> > Any ideas where to go from here? On request I have boot messages
> > (with errors) and lspci output - I prefer not to overload the list.
> >
> > The setup is two WD 60GB disks (hda/hdc) which host the root fs (ext2)
> > a working area (md2=RAID0, most of the space) and an xfs log
> > (md1=RAID1).
> >
> > hde/g/i/k are Maxtor 160GB, RAID5, xfs (with external log on md0).
> >
> > I get errors intermittently on all, here is an example after a boot
> > following the creation of the raid5 (so a sync was running for about
> > two hours).
> >
> > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
[errors trimmed]

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>

2002-03-24 22:19:29

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

On Sun, Mar 24, 2002 at 11:27:59PM +1100, Eyal Lebedinsky wrote:
> If I understand correctly, the basic answer is that this is not
> a driver issue, and not a general kernel (irq's etc.) either, but
> a true hardware problem.
>
>
> Andre Hedrick wrote:
> >
> > It is not a case of bad cables but maybe cable routing.
>
> I am not clear on what cable routing means. Can you elaborate?

Make sure the cables aren't going close parallel to each other.
Interference can be expected then. If needed, you can apply (grounded)
aluminium foil or conductive tapes on both the flat sides of the cables.

> > Also, four 160GB disks eat power!
>
> Well, these disks (4G160J8) are 5400rpm, They spin up just fine
> and according to spec they each use under 300ma@12v and 400ma@5v
> (active, 5.2W actually) which is not that much.
>
> This is not to say that just because the power supply can deliver
> the power, the power is clean enough.

You can check with a scope, or at least with on-board HW monitor.

> > I have a box dual athlon similar setup w/ 460W ps
> > I have to wait for the PS to warm up or there is not enough juice to
> > properly spin up the last drive. However if I replace the four 160GB's
> > with four 20GB Seagate's no problem.
>
> A bootup is always clean, and when not stressed it can go for a long
> while
> without any errors. However once pushed I start seeing these failures.

On what controllers do you see the failures? Only the onboard or also
on the addon cards? They're UDMA transfer errors - correctable by
retrying.

> > You are going to need at least a 400W PS w/ almost no ripple to make it
> > work. If you have this then check the cable routing.
> >
> > Also hdparm -i /dev/hdX to see if their transfer rates are reduced.
>
> I will check the situation. I know they come up ATA-5. Here are the
> relevant bootup messages:
>
> Uniform Multi-Platform E-IDE driver Revision: 6.31
> ide: Assuming 33MHz system bus speed for PIO modes; override with
> idebus=xx
> VP_IDE: IDE controller on PCI bus 00 dev 39
> VP_IDE: chipset revision 6
> VP_IDE: not 100% native mode: will probe irqs later
> ide: Assuming 33MHz system bus speed for PIO modes; override with
> idebus=xx
> VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
> ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
> ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA
> CMD649: IDE controller on PCI bus 00 dev 50
> PCI: Found IRQ 5 for device 00:0a.0
> CMD649: chipset revision 1
> CMD649: not 100% native mode: will probe irqs later
> CMD649: ROM enabled at 0xdff00000
> ide2: BM-DMA at 0xc000-0xc007, BIOS settings: hde:pio, hdf:pio
> ide3: BM-DMA at 0xc008-0xc00f, BIOS settings: hdg:pio, hdh:pio
> CMD649: IDE controller on PCI bus 00 dev 60
> PCI: Found IRQ 12 for device 00:0c.0
> CMD649: chipset revision 2
> CMD649: not 100% native mode: will probe irqs later
> CMD649: ROM enabled at 0xdfe80000
> ide4: BM-DMA at 0xac00-0xac07, BIOS settings: hdi:pio, hdj:pio
> ide5: BM-DMA at 0xac08-0xac0f, BIOS settings: hdk:pio, hdl:pio
> hda: WDC WD600BB-00BSA0, ATA DISK drive
> hdc: WDC WD600BB-00BSA0, ATA DISK drive
> hdd: ATAPI CD ROM DRIVE 50X MAX, ATAPI CD/DVD-ROM drive
> hde: Maxtor 4G160J8, ATA DISK drive
> hdg: Maxtor 4G160J8, ATA DISK drive
> hdi: Maxtor 4G160J8, ATA DISK drive
> hdk: Maxtor 4G160J8, ATA DISK drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> ide1 at 0x170-0x177,0x376 on irq 15
> ide2 at 0xd000-0xd007,0xcc02 on irq 5
> ide3 at 0xc800-0xc807,0xc402 on irq 5
> ide4 at 0xbc00-0xbc07,0xb802 on irq 12
> ide5 at 0xb400-0xb407,0xb002 on irq 12
> hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=7297/255/63,
> UDMA(100)
> hdc: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=116301/16/63,
> UDMA(100)
> hde: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdg: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdi: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdk: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
>
> > Cheers,
> >
> > Andre Hedrick
> > LAD Storage Consulting Group
> >
> > On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:
> >
> > > I have six disks in the machine, each a master on the cable. Two are
> > > on the on-board controller and Four are on two PCI ATA-100 cards.
> > >
> > > I am getting disk error (BadCRC) on all disks, intermittently.
> > >
> > > I upgraded from RH 2.4.9 (with SGI xfs) to 2.4.18+xfs. Same problem.
> > > I then applied the 2.4.18-rc1 IDE patches with no improvement (well,
> > > the four 160GB disks are now fully visible and not clipped to 28bits
> > > which is a nice surprise).
> > >
> > > I checked the memory, replaced the power supply with a hefty 400W,
> > > I even used a recent mobo (Gigabyte GA-6VTXE). No beef, practically
> > > the same rate of errors.
> > >
> > > I do not believe all six cables are bad (80w all). This is a UP, and
> > > I did not enalbe local APIC. Should I?
> > >
> > > Any ideas where to go from here? On request I have boot messages
> > > (with errors) and lspci output - I prefer not to overload the list.
> > >
> > > The setup is two WD 60GB disks (hda/hdc) which host the root fs (ext2)
> > > a working area (md2=RAID0, most of the space) and an xfs log
> > > (md1=RAID1).
> > >
> > > hde/g/i/k are Maxtor 160GB, RAID5, xfs (with external log on md0).
> > >
> > > I get errors intermittently on all, here is an example after a boot
> > > following the creation of the raid5 (so a sync was running for about
> > > two hours).
> > >
> > > hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
> > > hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
> [errors trimmed]
>
> --
> Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Vojtech Pavlik
SuSE Labs

2002-03-24 22:24:58

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

Roy Sigurd Karlsbakk wrote:
>
> > The PCI IDE cards I use are ATA-100, so this is the max speed available
> > to be. The four large disks can do ATA-133.
> >
> > The 48bit addressing (to allow >137GB) seems to be unrelated, and it
> > works with these cards. But I needed to apply:
> > http://www.kernel.org/pub/linux/kernel/people/hedrick/ide-2.4.18/
> > ide.2.4.18-rc1.02152002.patch.bz2
>
> but ...
>
> how can that work? i mean - 48bit addressing is in the udma133 standard
> but not in udma100...
>
> how does the /proc/mdstat and /proc/partitions look?

I am not in front of the machine, but let me tell you that mdstat shows
a full 480GB RAID5. There are no partitions (hde/g/i/k are used raw).

Again, I think udma133 and 48bit addressing are two, independent issues.
The first is an electronic spec for the hardware, the second is an api
standard which, it seems, can run at any speed.

Maybe Andre can make a clear statement here?

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>

2002-03-25 09:23:10

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

On Mon, Mar 25, 2002 at 09:24:38AM +1100, Eyal Lebedinsky wrote:
> Roy Sigurd Karlsbakk wrote:
> >
> > > The PCI IDE cards I use are ATA-100, so this is the max speed available
> > > to be. The four large disks can do ATA-133.
> > >
> > > The 48bit addressing (to allow >137GB) seems to be unrelated, and it
> > > works with these cards. But I needed to apply:
> > > http://www.kernel.org/pub/linux/kernel/people/hedrick/ide-2.4.18/
> > > ide.2.4.18-rc1.02152002.patch.bz2
> >
> > but ...
> >
> > how can that work? i mean - 48bit addressing is in the udma133 standard
> > but not in udma100...
> >
> > how does the /proc/mdstat and /proc/partitions look?
>
> I am not in front of the machine, but let me tell you that mdstat shows
> a full 480GB RAID5. There are no partitions (hde/g/i/k are used raw).
>
> Again, I think udma133 and 48bit addressing are two, independent issues.
> The first is an electronic spec for the hardware, the second is an api
> standard which, it seems, can run at any speed.
>
> Maybe Andre can make a clear statement here?

You're right.

--
Vojtech Pavlik
SuSE Labs

2002-03-25 09:59:29

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

I reported CRC errors, intermittently, on all six IDE disks, each
a master on a separate cable.

Andre Hedrick wrote:
> It is not a case of bad cables but maybe cable routing.
> Also, four 160GB disks eat power!

I reorganized the cables so that each takes a differnt path from the
mobo to the disk. I still see errors, so far on hda. I will see if
the rate is really reduced.

> I have a box dual athlon similar setup w/ 460W ps
> I have to wait for the PS to warm up or there is not enough juice to
> properly spin up the last drive. However if I replace the four 160GB's
> with four 20GB Seagate's no problem.

I do not have spinup problems. Neither does the computer :-)

> You are going to need at least a 400W PS w/ almost no ripple to make it
> work. If you have this then check the cable routing.

The new PS should do it then.

> Also hdparm -i /dev/hdX to see if their transfer rates are reduced.

OK, I checked, and after a bus reset the disk in error was dropped
from udma5 to udma4. I did not see more errors after that (but the
setup is up for less than one day so far). I plan to push it hard
tomorrow.

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>

2002-03-25 10:04:29

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:

> If I understand correctly, the basic answer is that this is not
> a driver issue, and not a general kernel (irq's etc.) either, but
> a true hardware problem.
>
>
> Andre Hedrick wrote:
> >
> > It is not a case of bad cables but maybe cable routing.
>
> I am not clear on what cable routing means. Can you elaborate?

If you have many cable (note there are random ways to construct, odd/even)
You can get crosstalk (aka 0x51/0x84 kernel noise). These messages are
telling you the hardware checksum between the HOST and DEVICE failed.
The solution is to retry, and the driver does.

Now the problem becomes if you have to many of these your IO falls to the
floor. Thus I architected the original auto-dma down grade solution and
shared the results w/ Intel. They and I had the same idea but not sure
who first, but Linux (me) did it first and provide it could be done.

That is why you see a snappy reset message.

It tests for any other errors and of there are only iCRC's, the driver
intercepts the error path and redirects it to reduce the IO transfer rate.

This is all perfectly safe and legal in the standard.

We have suspend IO in there error handler, and because it is iCRC the
device is expecting a retry, but first we clean up the DMA messes.
Next we respeed the host-device pair by one transfer rate in a down grade.
End the request and or issue it to retry back to through the main loop.

It is simple but not done before until Linux, and now it is very much a
quite and the right thing to do.

Now this process repeats until the crosstalk goes away.
That is what you are seeing.

To prove it is real, you can tell your drives to jump back into mode 5.

hdparm -X69 /dev/hde /dev/hdg /dev/hdi /dev/hdk

Do a simple read w/ '-t' and watch it repeat the down grade again.
One of the things not added is a perform return based on load drops.

I hope this helps.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

PS other info below too.


> > Also, four 160GB disks eat power!
>
> Well, these disks (4G160J8) are 5400rpm, They spin up just fine
> and according to spec they each use under 300ma@12v and 400ma@5v
> (active, 5.2W actually) which is not that much.
>
> This is not to say that just because the power supply can deliver
> the power, the power is clean enough.

But individual lines will draw difference values.
Do you know what each line looks like when you load the system?

> > I have a box dual athlon similar setup w/ 460W ps
> > I have to wait for the PS to warm up or there is not enough juice to
> > properly spin up the last drive. However if I replace the four 160GB's
> > with four 20GB Seagate's no problem.
>
> A bootup is always clean, and when not stressed it can go for a long
> while
> without any errors. However once pushed I start seeing these failures.

This is system loading and if a PS becomes marginal on any line, drives
suffer. The mode 6 clocking is very fragile and sensitive, even mode 5
had issues in the past.

> >
> > You are going to need at least a 400W PS w/ almost no ripple to make it
> > work. If you have this then check the cable routing.
> >
> > Also hdparm -i /dev/hdX to see if their transfer rates are reduced.
>
> I will check the situation. I know they come up ATA-5. Here are the
> relevant bootup messages:
>
> Uniform Multi-Platform E-IDE driver Revision: 6.31
> ide: Assuming 33MHz system bus speed for PIO modes; override with
> idebus=xx
> VP_IDE: IDE controller on PCI bus 00 dev 39
> VP_IDE: chipset revision 6
> VP_IDE: not 100% native mode: will probe irqs later
> ide: Assuming 33MHz system bus speed for PIO modes; override with
> idebus=xx
> VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
> ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
> ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA

First disable unmasking VIA it makes a mess. :-/

This (CMD649) is a good HOST and Linux can deal nicely.

> CMD649: IDE controller on PCI bus 00 dev 50
> PCI: Found IRQ 5 for device 00:0a.0
> CMD649: chipset revision 1
> CMD649: not 100% native mode: will probe irqs later
> CMD649: ROM enabled at 0xdff00000
> ide2: BM-DMA at 0xc000-0xc007, BIOS settings: hde:pio, hdf:pio
> ide3: BM-DMA at 0xc008-0xc00f, BIOS settings: hdg:pio, hdh:pio
> CMD649: IDE controller on PCI bus 00 dev 60
> PCI: Found IRQ 12 for device 00:0c.0
> CMD649: chipset revision 2
> CMD649: not 100% native mode: will probe irqs later
> CMD649: ROM enabled at 0xdfe80000
> ide4: BM-DMA at 0xac00-0xac07, BIOS settings: hdi:pio, hdj:pio
> ide5: BM-DMA at 0xac08-0xac0f, BIOS settings: hdk:pio, hdl:pio
> hda: WDC WD600BB-00BSA0, ATA DISK drive
> hdc: WDC WD600BB-00BSA0, ATA DISK drive
> hdd: ATAPI CD ROM DRIVE 50X MAX, ATAPI CD/DVD-ROM drive
> hde: Maxtor 4G160J8, ATA DISK drive
> hdg: Maxtor 4G160J8, ATA DISK drive
> hdi: Maxtor 4G160J8, ATA DISK drive
> hdk: Maxtor 4G160J8, ATA DISK drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> ide1 at 0x170-0x177,0x376 on irq 15
> ide2 at 0xd000-0xd007,0xcc02 on irq 5
> ide3 at 0xc800-0xc807,0xc402 on irq 5
> ide4 at 0xbc00-0xbc07,0xb802 on irq 12
> ide5 at 0xb400-0xb407,0xb002 on irq 12
> hda: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=7297/255/63,
> UDMA(100)
> hdc: 117231408 sectors (60022 MB) w/2048KiB Cache, CHS=116301/16/63,
> UDMA(100)
> hde: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdg: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdi: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)
> hdk: 320173056 sectors (163929 MB) w/2048KiB Cache, CHS=19929/255/63,
> UDMA(100)

2002-03-25 10:26:50

by Eyal Lebedinsky

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

Andre Hedrick wrote:
>
> On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:
>
> > If I understand correctly, the basic answer is that this is not
> > a driver issue, and not a general kernel (irq's etc.) either, but
> > a true hardware problem.
> >
> >
> > Andre Hedrick wrote:
> > >
> > > It is not a case of bad cables but maybe cable routing.
> >
> > I am not clear on what cable routing means. Can you elaborate?
>
> If you have many cable (note there are random ways to construct, odd/even)
> You can get crosstalk (aka 0x51/0x84 kernel noise). These messages are
> telling you the hardware checksum between the HOST and DEVICE failed.
> The solution is to retry, and the driver does.

[explanation of rate backoff trimmed]

Yes, I see this happening, and I agree it is a good idea.

> > I know they come up ATA-5. Here are the relevant bootup messages:
> >
> > Uniform Multi-Platform E-IDE driver Revision: 6.31
> > ide: Assuming 33MHz system bus speed for PIO modes; override with
> > idebus=xx
> > VP_IDE: IDE controller on PCI bus 00 dev 39
> > VP_IDE: chipset revision 6
> > VP_IDE: not 100% native mode: will probe irqs later
> > ide: Assuming 33MHz system bus speed for PIO modes; override with
> > idebus=xx
> > VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
> > ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
> > ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA
>
> First disable unmasking VIA it makes a mess. :-/
>
> This (CMD649) is a good HOST and Linux can deal nicely.

You mean IRQ unmasking? This is interesting. After bootup I see hda/c
(on VP_IDE, ide0/1) having
I/O support: 1
unmasking: 1
but hde/g/i/k (on CMD649, ide2/3/4/5) have:
I/O support: 0
unmasking: 0
So VIA unmasking is NOT disabled (I will do so) but why is the CMD649
disabled if you say it is a good chipset?

--
Eyal Lebedinsky ([email protected]) <http://samba.org/eyal/>

2002-03-25 10:33:00

by Andre Hedrick

[permalink] [raw]
Subject: Re: 2.4.18: many IDE errors

On Mon, 25 Mar 2002, Eyal Lebedinsky wrote:

> Andre Hedrick wrote:
> >
> > On Sun, 24 Mar 2002, Eyal Lebedinsky wrote:
> >
> > > If I understand correctly, the basic answer is that this is not
> > > a driver issue, and not a general kernel (irq's etc.) either, but
> > > a true hardware problem.
> > >
> > >
> > > Andre Hedrick wrote:
> > > >
> > > > It is not a case of bad cables but maybe cable routing.
> > >
> > > I am not clear on what cable routing means. Can you elaborate?
> >
> > If you have many cable (note there are random ways to construct, odd/even)
> > You can get crosstalk (aka 0x51/0x84 kernel noise). These messages are
> > telling you the hardware checksum between the HOST and DEVICE failed.
> > The solution is to retry, and the driver does.
>
> [explanation of rate backoff trimmed]
>
> Yes, I see this happening, and I agree it is a good idea.
>
> > > I know they come up ATA-5. Here are the relevant bootup messages:
> > >
> > > Uniform Multi-Platform E-IDE driver Revision: 6.31
> > > ide: Assuming 33MHz system bus speed for PIO modes; override with
> > > idebus=xx
> > > VP_IDE: IDE controller on PCI bus 00 dev 39
> > > VP_IDE: chipset revision 6
> > > VP_IDE: not 100% native mode: will probe irqs later
> > > ide: Assuming 33MHz system bus speed for PIO modes; override with
> > > idebus=xx
> > > VP_IDE: VIA vt82c686b (rev 40) IDE UDMA100 controller on pci00:07.1
> > > ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
> > > ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:DMA, hdd:DMA
> >
> > First disable unmasking VIA it makes a mess. :-/
> >
> > This (CMD649) is a good HOST and Linux can deal nicely.
>
> You mean IRQ unmasking? This is interesting. After bootup I see hda/c
> (on VP_IDE, ide0/1) having
> I/O support: 1
> unmasking: 1
> but hde/g/i/k (on CMD649, ide2/3/4/5) have:
> I/O support: 0
> unmasking: 0
> So VIA unmasking is NOT disabled (I will do so) but why is the CMD649
> disabled if you say it is a good chipset?

Because not all drivers are safe and have interrupt parser to claim or
reject ownership of the interrupt.

In the past the adaptec was the worst interrupt eater on the planet, but
things have changed in that driver.

General paranoia from the past.

Cheers,

Andre Hedrick
LAD Storage Consulting Group