2004-01-09 20:05:16

by Chuck Berg

[permalink] [raw]
Subject: HPT372 DMA corruption

I have an HPT372 onboard a Soyo Dragon KT400 board. I get corruption when
reading from drives on this controller. (I don't dare write, so I don't
know if writes are affected as well). I'm using 2.6.1, but have experienced
this problem with older kernels (at least it's not crashing anymore).

Disabling DMA on these drives stops the corruption.

It only happens when I heavily exercise certain other devices on the system
at the same time. This includes both a network and a SCSI PCI card that
share an irq with the HPT372, but also a firewire card that does not. Load
on the other onboard IDE interfaces (which do not demonstrate the
corruption) does not trigger it.

I have four identical drives in the system. I've switched cables around to
verify that the problem follows the HPT372. I have no problems when I move
the drives to a Promise card. I have run memtest86 for over 24 hours with
no errors.

It happens whether I read the /dev/hd* devices, a file on the filesystem,
or using /dev/raw.

Here's an example of the corruption. (cmp -l output, left is bad right is
good). It always consists of bytes being replaced with 0. It's usually 4,
but sometimes 64 bytes at a time. On average, about 256 bytes per gigabyte
read are corrupt.

51642365 0 211
51642366 0 154
51642367 0 163
51642368 0 120
63700989 0 100
63700990 0 153
63700991 0 216
63700992 0 2
89260029 0 31
89260030 0 327
89260031 0 200
89260032 0 13

More information (.config, /proc/{interrupts,cpuinfo,ioports,iomem,ide}, bootup
messages, lspci -vvv) is at:
http://www.encinc.com/~chuck/kt400/2.6.1/

Thanks for any help.


2004-01-09 20:24:39

by Richard B. Johnson

[permalink] [raw]
Subject: Re: HPT372 DMA corruption

On Fri, 9 Jan 2004, Chuck Berg wrote:

> I have an HPT372 onboard a Soyo Dragon KT400 board. I get corruption when
> reading from drives on this controller. (I don't dare write, so I don't
> know if writes are affected as well). I'm using 2.6.1, but have experienced
> this problem with older kernels (at least it's not crashing anymore).
>
> Disabling DMA on these drives stops the corruption.
>
[SNIPPED...]

>
> 51642365 0 211
> 51642366 0 154
> 51642367 0 163
> 51642368 0 120
> 63700989 0 100
> 63700990 0 153
> 63700991 0 216
> 63700992 0 2
> 89260029 0 31
> 89260030 0 327
> 89260031 0 200
> 89260032 0 13
>

Since whole bytes are not written, this looks strangely like
an attempt to DMA to cached RAM! Since the CPU didn't write
to RAM, the cache doesn't "know" that somebody wrote to it
so the subsequent read comes from cache, not RAM. Somebody
who knows the KT400 software well, should verify that DMA-able
buffers are being used and the driver isn't writing directly
to a (ultimately) user buffer.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2004-01-13 23:30:33

by Chuck Berg

[permalink] [raw]
Subject: Re: HPT372 DMA corruption

On Fri, Jan 09, 2004 at 03:24:28PM -0500, Richard B. Johnson wrote:
[cmp -l bad good]
> > 89260029 0 31
> > 89260030 0 327
> > 89260031 0 200
> > 89260032 0 13
>
> Since whole bytes are not written, this looks strangely like
> an attempt to DMA to cached RAM! Since the CPU didn't write

I tested this by reading with O_DIRECT, and immediately after each read(),
read all of a 1MB array (my cache is only 256kB), and then checking the
data. The same corruption occurs.

Via had a DMA corruption bug a couple years ago with similar symptoms,
apparently with the VT82C686B southbridge. Mine is a VT82C586B (which some
people also reported problems with). My board dates long after these
problems were discovered, so I sure hope it's not the same bug. I'll try
upgrading my BIOS to the latest version in case Soyo's changelog is not
entirely honest.

I did learn some more about the pattern of corruption. The data is not
being written to memory - the "bad" data is whatever happened to be there
before. It usually happens in 4, but sometimes 64 or 32 byte chunks.

When I read from the device with O_DIRECT, the corruption only appears at the
very end of the read. I've confirmed this for reads of 512 bytes through 256k
at multiples of 512 bytes.

2004-01-14 00:08:10

by Måns Rullgård

[permalink] [raw]
Subject: Re: HPT372 DMA corruption

Chuck Berg <[email protected]> writes:

> On Fri, Jan 09, 2004 at 03:24:28PM -0500, Richard B. Johnson wrote:
> [cmp -l bad good]
>> > 89260029 0 31
>> > 89260030 0 327
>> > 89260031 0 200
>> > 89260032 0 13
>>
>> Since whole bytes are not written, this looks strangely like
>> an attempt to DMA to cached RAM! Since the CPU didn't write
>
> I tested this by reading with O_DIRECT, and immediately after each read(),
> read all of a 1MB array (my cache is only 256kB), and then checking the
> data. The same corruption occurs.
>
> Via had a DMA corruption bug a couple years ago with similar symptoms,
> apparently with the VT82C686B southbridge. Mine is a VT82C586B (which some
> people also reported problems with). My board dates long after these
> problems were discovered, so I sure hope it's not the same bug. I'll try
> upgrading my BIOS to the latest version in case Soyo's changelog is not
> entirely honest.

Well, VIA never did have a good reputation.

> I did learn some more about the pattern of corruption. The data is not
> being written to memory - the "bad" data is whatever happened to be there
> before. It usually happens in 4, but sometimes 64 or 32 byte chunks.

Is it always a multiple of 4 bytes? Is there any pattern in the
position of the corruption, such as always aligned to some value?

> When I read from the device with O_DIRECT, the corruption only
> appears at the very end of the read. I've confirmed this for reads
> of 512 bytes through 256k at multiples of 512 bytes.

Could something be cutting off the DMA transfer too early?

--
M?ns Rullg?rd
[email protected]

2004-01-23 02:30:36

by Andre Hedrick

[permalink] [raw]
Subject: Re: HPT372 DMA corruption


It has NOTHING to do with VIA!

It has everything to do with a missing function th hpt366.c code.

It is all about what the FIFO thresholds are wrt to when interrupts are
issued and a pre-emptive like notification.

I am waiting on one of my customers report they are happy with the fix and
will ship the fix before I release it to the public. I have this serious
problem on not testing volitale patches on the general masses.

In my opinion, after several weeks of hard-on testing, the changes are
clean, correct, and exact.

Noting this type of patch would not be doable, had I not split the
individual dma operations way back when.

And yes for the remainder of the peanut gallery, I will "SHUT UP" for now.

Cheers,

Andre Hedrick
LAD Storage Consulting Group

On Wed, 14 Jan 2004, M?ns Rullg?rd wrote:

> Chuck Berg <[email protected]> writes:
>
> > On Fri, Jan 09, 2004 at 03:24:28PM -0500, Richard B. Johnson wrote:
> > [cmp -l bad good]
> >> > 89260029 0 31
> >> > 89260030 0 327
> >> > 89260031 0 200
> >> > 89260032 0 13
> >>
> >> Since whole bytes are not written, this looks strangely like
> >> an attempt to DMA to cached RAM! Since the CPU didn't write
> >
> > I tested this by reading with O_DIRECT, and immediately after each read(),
> > read all of a 1MB array (my cache is only 256kB), and then checking the
> > data. The same corruption occurs.
> >
> > Via had a DMA corruption bug a couple years ago with similar symptoms,
> > apparently with the VT82C686B southbridge. Mine is a VT82C586B (which some
> > people also reported problems with). My board dates long after these
> > problems were discovered, so I sure hope it's not the same bug. I'll try
> > upgrading my BIOS to the latest version in case Soyo's changelog is not
> > entirely honest.
>
> Well, VIA never did have a good reputation.
>
> > I did learn some more about the pattern of corruption. The data is not
> > being written to memory - the "bad" data is whatever happened to be there
> > before. It usually happens in 4, but sometimes 64 or 32 byte chunks.
>
> Is it always a multiple of 4 bytes? Is there any pattern in the
> position of the corruption, such as always aligned to some value?
>
> > When I read from the device with O_DIRECT, the corruption only
> > appears at the very end of the read. I've confirmed this for reads
> > of 512 bytes through 256k at multiples of 512 bytes.
>
> Could something be cutting off the DMA transfer too early?
>
> --
> M?ns Rullg?rd
> [email protected]
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>