2002-11-19 21:40:11

by Steven Timm

[permalink] [raw]
Subject: Serverworks dma_intr: error=0x40 { UncorrectableError }


The problems with the Serverworks OSB4 chipset are well documented
on this list. It is known that changing to multi-word DMA mode
will severely reduce problems. However, since we need high
throughput in our application, we are running at high DMA speeds still.
The configuration in question is:

2.4.9-31smp kernel, Tyan 2518 motherboard, Western Digital 20 Gb system
disk hda, and hdc and hdd being each a 40Gb Western Digital drive.

The error usually happens on the system disk. In that location
of the disk, there are usually some corrupted files as a result.
But when we either reinstall the system disk, formatting checking
for bad blocks, or low-level format the disk and reinstall, the
system is able to go on for quite some length of time
without these errors repeating.

My question--is there any way that the Serverworks OSB4 could be
causing a soft error on the disk such that the disk appears to
be bad (sometimes to the point that SMART puts up an alert in
the BIOS saying it is bad) but yet can still be used effectively?

(And yes, I am aware that the 2.4.18 kernels trap this condition and
usually stop the file corruption before it happens.)

Nov 7 05:08:10 fnd0102 kernel: Curious - OSB4 thinks the DMA is still
running.
Nov 7 05:08:10 fnd0102 kernel: OSB4 wait exit.
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=4457382, sector=4457312
Nov 7 05:08:10 fnd0102 kernel: end_request: I/O error, dev 03:01 (hda),
sector
4457312
Nov 7 05:08:10 fnd0102 kernel: EXT2-fs error (device ide0(3,1)):

Steve Timm


------------------------------------------------------------------
Steven C. Timm (630) 840-8525 [email protected] http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations


2002-11-19 21:52:47

by Manish Lachwani

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

0x40 indicate Uncorrectable ECC errors. CHeck the SMART data from the drive
using utility called smartctl. You should probably see a Current Pending
Count. As long as the pending sector does not reallocated, you will continue
to see this problem when reading that specific sector. WHat you can do is
write to that sector and get the sector remapped. Once remapped, these
problems on that sector wont occur.

ALso, we had been using TYAN S2518 with OSB4 in UDMA 2. However, it causes
data corruption when there is IO with even one drive. We used two drives in
master-master and master-slave mode and it did not solve the problem. EVen
at UDMA 0, we experienced the same issue of data corruption. Once does not
need to have 0x40 errors to see this corruption. If you want to see this
corruption, use the dt utility and then run:

./dt of=/dev/hdc bs=300M incr=512 enable=raw iodir=forward pattern=iot
log=tst_hdc.log dispose=keep iodir=reverse iotype=sequential pattern=iot
enable=compare,coredump,debug,raw,verify,verbose,pstats &

on hda and hdc. Also, try to have network traffic in parallel. This will
result in data getting shifted by 4 bytes ...

-----Original Message-----
From: Steven Timm [mailto:[email protected]]
Sent: Tuesday, November 19, 2002 1:47 PM
To: [email protected]
Subject: Serverworks dma_intr: error=0x40 { UncorrectableError }



The problems with the Serverworks OSB4 chipset are well documented
on this list. It is known that changing to multi-word DMA mode
will severely reduce problems. However, since we need high
throughput in our application, we are running at high DMA speeds still.
The configuration in question is:

2.4.9-31smp kernel, Tyan 2518 motherboard, Western Digital 20 Gb system
disk hda, and hdc and hdd being each a 40Gb Western Digital drive.

The error usually happens on the system disk. In that location
of the disk, there are usually some corrupted files as a result.
But when we either reinstall the system disk, formatting checking
for bad blocks, or low-level format the disk and reinstall, the
system is able to go on for quite some length of time
without these errors repeating.

My question--is there any way that the Serverworks OSB4 could be
causing a soft error on the disk such that the disk appears to
be bad (sometimes to the point that SMART puts up an alert in
the BIOS saying it is bad) but yet can still be used effectively?

(And yes, I am aware that the 2.4.18 kernels trap this condition and
usually stop the file corruption before it happens.)

Nov 7 05:08:10 fnd0102 kernel: Curious - OSB4 thinks the DMA is still
running.
Nov 7 05:08:10 fnd0102 kernel: OSB4 wait exit.
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: status=0x51 { DriveReady
SeekComplete Error }
Nov 7 05:08:10 fnd0102 kernel: hda: dma_intr: error=0x40 {
UncorrectableError }, LBAsect=4457382, sector=4457312
Nov 7 05:08:10 fnd0102 kernel: end_request: I/O error, dev 03:01 (hda),
sector
4457312
Nov 7 05:08:10 fnd0102 kernel: EXT2-fs error (device ide0(3,1)):

Steve Timm


------------------------------------------------------------------
Steven C. Timm (630) 840-8525 [email protected] http://home.fnal.gov/~timm/
Fermilab Computing Division/Operating Systems Support
Scientific Computing Support Group--Computing Farms Operations

2002-11-19 23:53:37

by Alan

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

On Tue, 2002-11-19 at 21:59, Manish Lachwani wrote:
> ALso, we had been using TYAN S2518 with OSB4 in UDMA 2. However, it causes
> data corruption when there is IO with even one drive. We used two drives in
> master-master and master-slave mode and it did not solve the problem. EVen
> at UDMA 0, we experienced the same issue of data corruption. Once does not
> need to have 0x40 errors to see this corruption. If you want to see this
> corruption, use the dt utility and then run:

Known. The newer kernels will panic in this case to avoid corrupting,
and the 2.4.20-ac/2.5.4x kernels will not use UDMA for disks on OSB4.
The newer serverworks (CSB5/CSB6) doesn't have this problem btw

2002-11-19 23:57:58

by Manish Lachwani

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

Yes, CSB5/CSB6 does not have this problem. I have been using this with the
GC-LE chipset.

However, there was something that I noticed when I was experimenting with
the FreeBSD kernel for producing these corruptions. I could not reproduce
these corruptions with FreeBSD 4.6 and at UDMA 2. I also noticed that the
PCI config space settings for the IDE controller were different in FreeBSD
and Linux 2.4.17.

-----Original Message-----
From: Alan Cox [mailto:[email protected]]
Sent: Tuesday, November 19, 2002 4:29 PM
To: Manish Lachwani
Cc: 'Steven Timm'; Linux Kernel Mailing List
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }


On Tue, 2002-11-19 at 21:59, Manish Lachwani wrote:
> ALso, we had been using TYAN S2518 with OSB4 in UDMA 2. However, it causes
> data corruption when there is IO with even one drive. We used two drives
in
> master-master and master-slave mode and it did not solve the problem. EVen
> at UDMA 0, we experienced the same issue of data corruption. Once does not
> need to have 0x40 errors to see this corruption. If you want to see this
> corruption, use the dt utility and then run:

Known. The newer kernels will panic in this case to avoid corrupting,
and the 2.4.20-ac/2.5.4x kernels will not use UDMA for disks on OSB4.
The newer serverworks (CSB5/CSB6) doesn't have this problem btw

2002-11-20 00:07:23

by Alan

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

On Wed, 2002-11-20 at 00:04, Manish Lachwani wrote:
> Yes, CSB5/CSB6 does not have this problem. I have been using this with the
> GC-LE chipset.
>
> However, there was something that I noticed when I was experimenting with
> the FreeBSD kernel for producing these corruptions. I could not reproduce
> these corruptions with FreeBSD 4.6 and at UDMA 2. I also noticed that the
> PCI config space settings for the IDE controller were different in FreeBSD
> and Linux 2.4.17.

It might be interesting to know what the differences are. Certainly the
bug is a very strange one.

2002-11-20 00:13:29

by Manish Lachwani

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

I will try to repeat the experiment with FreeBSD and send the config space
information.

btw, I had also noticed the corruption in the PIO4 mode. This is the sample
o/p from dt that shows the 4 bytes shifts in PIO4:

Original Buffer:
=====================

0x51972be0 35 a7 a4 81 36 a8 a5 82 37 a9 a6 83 38 aa a7 84
0x51972bf0 39 ab a8 85 3a ac a9 86 3b ad aa 87 3c ae ab 88
0x51972c00 *be 2e 2c 09 bf 2f 2d 0a c0 30 2e 0b c1 31 2f 0c
0x51972c10 c2 32 30 0d c3 33 31 0e c4 34 32 0f c5 35 33 10

dt: The incorrect data starts at address 0x3ed70c00 (marked by asterisk '*')
dt: Dumping Verify Buffer (base = 0x3d6af000, offset = 23862272, limit = 64
bytes):

Verify buffer
=================

0x3ed70be0 35 a7 a4 81 36 a8 a5 82 37 a9 a6 83 38 aa a7 84
0x3ed70bf0 39 ab a8 85 3a ac a9 86 3b ad aa 87 3c ae ab 88
0x3ed70c00 *3c ae ab 88 be 2e 2c 09 bf 2f 2d 0a c0 30 2e 0b
0x3ed70c10 c1 31 2f 0c c2 32 30 0d c3 33 31 0e c4 34 32 0f

Thanks
Manish

-----Original Message-----
From: Alan Cox [mailto:[email protected]]
Sent: Tuesday, November 19, 2002 4:43 PM
To: Manish Lachwani
Cc: 'Steven Timm'; Linux Kernel Mailing List
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }


On Wed, 2002-11-20 at 00:04, Manish Lachwani wrote:
> Yes, CSB5/CSB6 does not have this problem. I have been using this with the
> GC-LE chipset.
>
> However, there was something that I noticed when I was experimenting with
> the FreeBSD kernel for producing these corruptions. I could not reproduce
> these corruptions with FreeBSD 4.6 and at UDMA 2. I also noticed that the
> PCI config space settings for the IDE controller were different in FreeBSD
> and Linux 2.4.17.

It might be interesting to know what the differences are. Certainly the
bug is a very strange one.

2002-11-22 18:49:34

by Manish Lachwani

[permalink] [raw]
Subject: RE: Serverworks dma_intr: error=0x40 { UncorrectableError }

http://www.bit-net.com/~rmiller/dt.html

-----Original Message-----
From: Ingo Oeser [mailto:[email protected]]
Sent: Friday, November 22, 2002 2:44 AM
To: Manish Lachwani
Subject: Re: Serverworks dma_intr: error=0x40 { UncorrectableError }


Dear Mr. Lachwani,

On Wed, Nov 20, 2002 at 11:07:11AM -0800, Manish Lachwani wrote:
> Just hunt on google for datatest download ...

Didn't work as expected. I get 51 non-relevant results and
your other suggestion at lkml reveals 36.000 matches.

What's the problem with just giving an exact URL? Or with
putting it up somewhere, so people can download it?

If bandwidth or traffic limit is a problem, please send me an
email containing the sources of this tool and I'll host it, since
my traffic limit would be 150Terabyte and bandwidth is >=155MBit/s.

So please help us out here.

Thanks & Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. ---
D.E.Knuth