LinuxLists.cc - Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

2002-12-25 11:15:33

Subject: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

Hi Everybody,

I wish you all a nice Christmas and A Happy New Year!

Now to the stuff that matters (for me 8) ):

I recently bought 4 80G IDE disks which I considered to combine into a
software RAID (RAID5). The four discs are connected via a Promise
Ultra100 TX2 (driver: PDC202xx, PDC20268 - I believe ;) ). It all works
fine for me just that the data throughput (i.e. speed) has been very
varying.

I had an older 80G IDE disc (yet capable of UDMA5) which I wanted to
make a spare disk.

This is my configuration:

raiddev /dev/md0
raid-level 5
nr-raid-disks 4
nr-spare-disks 1
persistent-superblock 1
parity-algorithm left-symmetric
chunk-size 64

## ALL NEW DISCS :)
device /dev/hde
raid-disk 0
device /dev/hdf
raid-disk 1
device /dev/hdg
raid-disk 2
device /dev/hdh
raid-disk 3

## GOOD, OLD, BRAVE DISC ;)
device /dev/hdc
spare-disk 0

Now to the funny part:

For some funny reason, a 2.4.20 kernel refuses to set the DMA-level on
the new disks (all connected to a UDMA5-capable Ultra100 TX2 controller)
to UDMA5,4,3 and settles it for UDMA2, which is the highest possibility
for the OLD onboard-controller (but NOT for the promise card). A recent
2.5.52 gives me :) UDMA5 on the new discs while (correctly) UDMA2 for
the old drive.

This is generally not a very big problem, I can live 8( with my new fine
discs only being used in UDMA2 (instead of 5).

After I initialized the array using 'mkraid /dev/md0' I open up 'while
true; do clear; cat /proc/mdstat; sleep 1; done' on one terminal to
watch the progress.

The first try I gave it was (very satisfying) giving me 15MB/sec at the
beginning. After about 30-40% the speed fell down to unsatisfying
100-200KB/sec (nothing CPU-intensive running besides raid5d).

I have been having problems with the older controller and I was not sure
about the throughput of the old drive, therefore I stopped the syncing,
stopped the raid and ran five synced (i.e. with a semaphore) bonnie++
processes to benchmark the discs, they all performed likely well (the
old one was a little slower at random seeks, but that was indeed
expected).

I tried the same thing on a 2.4.18 kernel (bf2.4 flavour from Debian
SID) but that gave me funny DMA timeout errors (something like DMA
timeout func 14 only supported) and kicked out one of (sporadic) my
newer drives from the array.

I wanted to try a 2.5.x kernel and settled for the newest one: 2.5.52
(vanilla). This wonderful kernel detected the UDMA5 capability of the
new drives and the new controller (veeery satisfying =) ) and gave me a
throughput of about 20-22MB/sec in the beginning. Sadly the system
Ooopsed at 30% percent giving me some DMA errors.

I have been wondering if it could be the powersupply, I have a 360W (max
load) powersupply, 4 new IBM 80G discs, 1 older 80G SEAGATE, 1 20G
MAXTOR.

Hmm, there it was again, one of my new drives got kicked out of the
array before the initial sync was ready.

The errors was (2.4.20):

<DATE> webber kernel: hdg: dma_intr: status=0x51
<DATE> webber kernel: hdg: end_request: I/O error, dev 22:00 (hdg)
sector 58033224
<DATE> webber kernel: hdg: dma_intr: status=0x51
<DATE> webber kernel: hdg: dma_intr: error=0x40 LBAsect=58033238,
sector=58033232
<DATE> webber kernel: end_request: I/O error, dev 22:00 (hdg), sector
58033232

Hmm, what could it be, this didn't happen while running bonnie++ :(

Any help will be appreciated,

Regards,
Mikael

--
Mikael Olenfalk <[email protected]>
Netgineers

2002-12-25 11:50:10

by Tomas Szepe

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

> For some funny reason, a 2.4.20 kernel refuses to set the DMA-level on
> the new disks (all connected to a UDMA5-capable Ultra100 TX2 controller)
> to UDMA5,4,3 and settles it for UDMA2, which is the highest possibility
> for the OLD onboard-controller (but NOT for the promise card).

You need to boot 2.4.19 and 2.4.20 with 'ideX=ata66' where X is the
number of the channel where you wish to use transfer modes above UDMA2.
For instance, "ide0=ata66 ide1=ata66" will do the trick for the first two
channels. This is a well-known bug in the PDC driver that has never been
fixed. And don't worry, 2.4.21 will most likely feature a PDC driver that
won't work at all (read Andre Hedrick's posts from last week), but if you
say 2.5.52 mostly works then you might be lucky as 2.5 IDE is basically
the same codebase as 2.4.21-pre.

> The errors was (2.4.20):
>
> <DATE> webber kernel: hdg: dma_intr: status=0x51
> <DATE> webber kernel: hdg: end_request: I/O error, dev 22:00 (hdg)
> sector 58033224
> <DATE> webber kernel: hdg: dma_intr: status=0x51
> <DATE> webber kernel: hdg: dma_intr: error=0x40 LBAsect=58033238,
> sector=58033232
> <DATE> webber kernel: end_request: I/O error, dev 22:00 (hdg), sector
> 58033232

Hmmm, I can't help you with these.

--
Tomas Szepe <[email protected]>

2002-12-26 12:29:01

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Wed, Dec 25, 2002 at 12:58:20PM +0100, Tomas Szepe wrote:
> > For some funny reason, a 2.4.20 kernel refuses to set the DMA-level on
> > the new disks (all connected to a UDMA5-capable Ultra100 TX2 controller)
> > to UDMA5,4,3 and settles it for UDMA2, which is the highest possibility
> > for the OLD onboard-controller (but NOT for the promise card).
>
> You need to boot 2.4.19 and 2.4.20 with 'ideX=ata66' where X is the
> number of the channel where you wish to use transfer modes above UDMA2.
> For instance, "ide0=ata66 ide1=ata66" will do the trick for the first two

hdparm -X69 /dev/hda will put it into UDMA5/ata100 mode as well
(69 == 64 + UDMA mode). No need to specify it at boot time.

(this discussion reminded me of my own TX2 adapter and 100GB disk:
adjusting its setting improved sequential disk reads: now 36MB/sec
instead of 24MB/sec)

--
Frank

2002-12-26 13:14:17

by Tomas Szepe

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

> On Wed, Dec 25, 2002 at 12:58:20PM +0100, Tomas Szepe wrote:
> > > For some funny reason, a 2.4.20 kernel refuses to set the DMA-level on
> > > the new disks (all connected to a UDMA5-capable Ultra100 TX2 controller)
> > > to UDMA5,4,3 and settles it for UDMA2, which is the highest possibility
> > > for the OLD onboard-controller (but NOT for the promise card).
> >
> > You need to boot 2.4.19 and 2.4.20 with 'ideX=ata66' where X is the
> > number of the channel where you wish to use transfer modes above UDMA2.
> > For instance, "ide0=ata66 ide1=ata66" will do the trick for the first two
>
> hdparm -X69 /dev/hda will put it into UDMA5/ata100 mode as well
> (69 == 64 + UDMA mode). No need to specify it at boot time.

Not true. You definitely need to use the ideX boot param AND
run hdparm -X?? /dev/hd? to make use of UDMA3+ on newer PDC
controllers (unless you apply the patch posted on Dec 24 by
Nikolai Zhubr).

--
Tomas Szepe <[email protected]>

2002-12-26 16:34:58

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Thu, Dec 26, 2002 at 02:22:28PM +0100, Tomas Szepe wrote:
> >
> > hdparm -X69 /dev/hda will put it into UDMA5/ata100 mode as well
> > (69 == 64 + UDMA mode). No need to specify it at boot time.
>
> Not true. You definitely need to use the ideX boot param AND
> run hdparm -X?? /dev/hd? to make use of UDMA3+ on newer PDC
> controllers (unless you apply the patch posted on Dec 24 by
> Nikolai Zhubr).

Driver says otherwise on RH8.0 + 2.4.20:
iapetus /proc/ide# cat pdc202xx

PROMISE Ultra series driver Ver 1.20.0.7 2002-05-23 Adapter: Ultra100 TX2
--------------- Primary Channel ---------------- Secondary Channel -------------
enabled enabled
66 Clocking enabled enabled
Mode MASTER MASTER
--------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
DMA enabled: yes no no no
UDMA Mode: 5 0 0 0
PIO Mode: 4 0 0 0

--
Frank

2002-12-26 17:27:16

by Tomas Szepe

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

> On Thu, Dec 26, 2002 at 02:22:28PM +0100, Tomas Szepe wrote:
> > >
> > > hdparm -X69 /dev/hda will put it into UDMA5/ata100 mode as well
> > > (69 == 64 + UDMA mode). No need to specify it at boot time.
> >
> > Not true. You definitely need to use the ideX boot param AND
> > run hdparm -X?? /dev/hd? to make use of UDMA3+ on newer PDC
> > controllers (unless you apply the patch posted on Dec 24 by
> > Nikolai Zhubr).
>
> Driver says otherwise on RH8.0 + 2.4.20:
> iapetus /proc/ide# cat pdc202xx
>
> PROMISE Ultra series driver Ver 1.20.0.7 2002-05-23 Adapter: Ultra100 TX2
> --------------- Primary Channel ---------------- Secondary Channel -------------
> enabled enabled
> 66 Clocking enabled enabled
> Mode MASTER MASTER
> --------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
> DMA enabled: yes no no no
> UDMA Mode: 5 0 0 0
> PIO Mode: 4 0 0 0

Fair enough. Can you give me your PCI ids, Promise BIOS version
and "hdparm -Iv /dev/hda"?

--
Tomas Szepe <[email protected]>

2002-12-26 18:33:38

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Thu, Dec 26, 2002 at 06:35:28PM +0100, Tomas Szepe wrote:
>
> Fair enough. Can you give me your PCI ids, Promise BIOS version

Don't know how to obtain BIOS # without a reboot, will have to wait
half an hour for this.

lspci -n header:
00:12.0 Class 0180: 105a:4d68 (rev 02) (prog-if 85)

lspci -vvv:
00:12.0 Unknown mass storage controller: Promise Technology, Inc. 20268 (rev 02) (prog-if 85)
Subsystem: Promise Technology, Inc. Ultra100TX2
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=slow >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 32 (1000ns min, 4500ns max), cache line size 08
Interrupt: pin A routed to IRQ 18
Region 0: I/O ports at d800 [size=8]
Region 1: I/O ports at dc00 [size=4]
Region 2: I/O ports at e000 [size=8]
Region 3: I/O ports at e400 [size=4]
Region 4: I/O ports at e800 [size=16]
Region 5: Memory at e4000000 (32-bit, non-prefetchable) [size=16K]
Expansion ROM at e2000000 [disabled] [size=16K]
Capabilities: [60] Power Management version 1
Flags: PMEClk- DSI+ D1+ D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

> and "hdparm -Iv /dev/hda"?

/dev/hde:
multcount = 0 (off)
IO_support = 3 (32-bit w/sync)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 8 (on)
geometry = 193821/16/63, sectors = 195371568, start = 0

ATA device, with non-removable media
Model Number: WDC WD1000BB-00CCB0
Serial Number: WD-WMA9P1116311
Firmware Revision: 22.04A22
Standards:
Supported: 5 4 3 2
Likely used: 6
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 195371568
device size with M = 1024*1024: 95396 MBytes
device size with M = 1000*1000: 100030 MBytes (100 GB)
Capabilities:
LBA, IORDY(can be disabled)
bytes avail on r/w long: 40 Queue depth: 1
Standby timer values: spec'd by Standard, with device specific minimum
R/W multiple sector transfer: Max = 16 Current = 16
Recommended acoustic management value: 128, current value: 254
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* Look-ahead
* Write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
Automatic Acoustic Management feature set
SET MAX security extension
* DOWNLOAD MICROCODE cmd
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
HW reset results:
CBLID- above Vih
Device num = 0 determined by CSEL
Checksum: correct

--
Frank

2002-12-26 19:20:08

by Frank van Maarseveen

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Thu, Dec 26, 2002 at 06:35:28PM +0100, Tomas Szepe wrote:
>
> Fair enough. Can you give me your PCI ids, Promise BIOS version

Ultra100TX2 BIOS version is 2.20.0.11

dmesg says:
PDC20268: IDE controller on PCI bus 00 dev 90
PDC20268: chipset revision 2
PDC20268: not 100% native mode: will probe irqs later
PDC20268: ROM enabled at 0xe2000000
PDC20268: (U)DMA Burst Bit ENABLED Primary MASTER Mode Secondary MASTER Mode.

--
Frank

2002-12-27 13:11:26

by Mikael Olenfalk

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Thu, 2002-12-26 at 13:37, Frank van Maarseveen wrote:
> On Wed, Dec 25, 2002 at 12:58:20PM +0100, Tomas Szepe wrote:
> > > For some funny reason, a 2.4.20 kernel refuses to set the DMA-level on
> > > the new disks (all connected to a UDMA5-capable Ultra100 TX2 controller)
> > > to UDMA5,4,3 and settles it for UDMA2, which is the highest possibility
> > > for the OLD onboard-controller (but NOT for the promise card).
> >
> > You need to boot 2.4.19 and 2.4.20 with 'ideX=ata66' where X is the
> > number of the channel where you wish to use transfer modes above UDMA2.
> > For instance, "ide0=ata66 ide1=ata66" will do the trick for the first two
>
> hdparm -X69 /dev/hda will put it into UDMA5/ata100 mode as well
> (69 == 64 + UDMA mode). No need to specify it at boot time.
>
> (this discussion reminded me of my own TX2 adapter and 100GB disk:
> adjusting its setting improved sequential disk reads: now 36MB/sec
> instead of 24MB/sec)

I can only set UDMA3,4,5 if I pass the ide{1,2}=ata66 kernel boot
parameter. Actually I don't really care that much about the speed for
now, I would rather like the thing to work at all :)

The PDC20268 sporadically gives me DMA errors when doing the first
parity sync of my software RAID5. The last few times it has always been
the same disk, but that is no requirement (sometimes hde 02:00, hdg
03:00, not so often one of the slave disks on the channel).

But the only system seems to be that It either always bails out at
30-36% percent of the parity sync or in case it does not bail out the
speed goes down to 60-80kB/sec, which will finish the sync after 16,000
or so minutes (definitely too long).

I thought of returning the IBM drives and getting MAXTOR instead, as I
have heard rumors about the bad quality of the IBM drives. Still this
seems to be a problem with the controller and/or in combination with MD.
The drives gives me NO problems when just writing and/or reading them
with dd if=/dev/zero of=/dev/hd[efgh]. Even running the bonnie++
benchmark on them with a 20GB file gives no errors (simultaneously).

Thank you for your help so far.

/Regards, Mikael

--
Mikael Olenfalk <[email protected]>
Netgineers

2002-12-27 15:05:44

by jw schultz

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Fri, Dec 27, 2002 at 02:14:45PM +0100, Mikael Olenfalk wrote:
> I can only set UDMA3,4,5 if I pass the ide{1,2}=ata66 kernel boot
> parameter. Actually I don't really care that much about the speed for
> now, I would rather like the thing to work at all :)
>
> The PDC20268 sporadically gives me DMA errors when doing the first
> parity sync of my software RAID5. The last few times it has always been
> the same disk, but that is no requirement (sometimes hde 02:00, hdg
> 03:00, not so often one of the slave disks on the channel).
>
> But the only system seems to be that It either always bails out at
> 30-36% percent of the parity sync or in case it does not bail out the
> speed goes down to 60-80kB/sec, which will finish the sync after 16,000
> or so minutes (definitely too long).
>
> I thought of returning the IBM drives and getting MAXTOR instead, as I
> have heard rumors about the bad quality of the IBM drives. Still this
> seems to be a problem with the controller and/or in combination with MD.
> The drives gives me NO problems when just writing and/or reading them
> with dd if=/dev/zero of=/dev/hd[efgh]. Even running the bonnie++
> benchmark on them with a 20GB file gives no errors (simultaneously).

I'm a bit suprised noone else has mentioned this so i will.
Your RAID description seems to indicate it is built of four
drives on a two channel HBA. In other words you have two
pairs of drives each pair sharing a cable (master/slave).
It is my understanding that that is at best a recepe for
poor performance. From what i have heard ata66 and above is
problematic (out of spec) in that configuration. I imagine
such a configuration might also cause poor interactions
between the drives.

Aside from the reputed problems with PDC and with the IBM
"deathstar" drives you might first try adding another HBA
and use better cables before you scrap the drives.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-12-27 22:35:42

by Mikael Olenfalk

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

jw schultz wrote:
> I'm a bit suprised noone else has mentioned this so i will.
> Your RAID description seems to indicate it is built of four
> drives on a two channel HBA. In other words you have two
> pairs of drives each pair sharing a cable (master/slave).
> It is my understanding that that is at best a recepe for
> poor performance. From what i have heard ata66 and above is
> problematic (out of spec) in that configuration. I imagine
> such a configuration might also cause poor interactions
> between the drives.

I've tried another setup with One Disk Per Channel(tm), I'm at 37.1% of
the parity sync now with amazing speeds of 40K/sec and 20314.20 minutes
left.

The new setup is:

>
> Aside from the reputed problems with PDC and with the IBM
> "deathstar" drives you might first try adding another HBA
> and use better cables before you scrap the drives.

What reputed problems? I've heard of problems with IBM disks (like dying
after just one year of use and so on) But I've never heard of any PDC
problems (BTW I have never heard of any success stories either 8) )...
Can you recommend another controller, what's about the HPT, is that one
usable?

Thanks for your time, Mikael

--
Mikael Olenfalk <[email protected]>
Netgineers

2002-12-28 00:38:42

by Alan

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

Master/Slave is ok with 80pin cables on UDMA66 so that should not be the
problem area. Some combinations with old hardware can be problematic but
this is a pair of new same vendor drives.

Odd indeed

2002-12-28 14:29:19

by Mikael Olenfalk

[permalink] [raw]

Subject: Re: Alot of DMA errors in 2.4.18, 2.4.20 and 2.5.52

On Sat, 2002-12-28 at 01:45, Alan Cox wrote:
> Master/Slave is ok with 80pin cables on UDMA66 so that should not be the
> problem area. Some combinations with old hardware can be problematic but
> this is a pair of new same vendor drives.
>
> Odd indeed
>

I've been trying to do an array with three drives (two RAID5 drives, one
spare), with each one on a channel, the results was (hmm, I feel like
I'm already been telling this, hmm) almost the same, funnily though the
were some speed diffs:

1) kernel version 2.4.20 -- 4+1 config, the four drives on two channels,
the spare drive on an channel of its own:

- 12-15 MB/sec, with UDMA2

- 20-22 MB/sec, when forcing UDMA5 through kernel boot param
ide{2,3}=ata66

always at 36.1% (of the initial parity sync) the speed begins dropping
from maximum speed to 40-60KB/sec at 37%

2) kernel version 2.4.20 -- 2+1 config, each on a channel of its own

- 20-22MB/sec, when forcing UDMA5 through kernel boot param

at 36.1% the speed drops again

3) kernel version 2.4.20 -- 4+1/2+1 config occasionally the system gives
me DMA errors, this has almost only happened when forcing UDMA5

4) kernel version 2.5.52 -- 4+1 config

- runs at a beautiful speed of 33-36MB/sec filling my heart with joy,
gives me though DMA errors or kernel Ooopses at almost 36-39%

5) kernel version 2.5.53 -- 2+1 config

- again wonderful speedliftings to max 36MB/sec, same errors as with 4)

The only thing that I see in common between all configs is that they
almost always bail out somewhere at 36-37%.

I don't know the internals of the MD driver but could it be that the
initial parity sync is done linearly through the disks, meaning that
parity information on the first disk is done first, when the parity
information on the first disk is completely computed, the parity
information on the second disk is being computed and so on, and so on.

Given the fact that the sync always fails at somewhat the same position,
it sound to me like just one bad little disk is bailing out.

I'll try the 2+1 configs today again, just with the two other disks,
I'll come back and report the success story to you on Monday or Sunday
evening (I got to inspect some UMTS-antennas tomorrow -- measure cable
quality and stuff, so no Linux-joy for me 8) )

I really appreciate your help and advice,

Regards Mikael

--
Mikael Olenfalk <[email protected]>
Netgineers