2000-12-04 17:10:17

by Gerard Sharp

[permalink] [raw]
Subject: HPT366 + SMP = slight corruption in 2.3.99 - 2.4.0-11

Hello Again

Some more details, now that I scraped a few minutes on the weekend to
look into this.
Same hardware configuration as my earlier post; with a newer bios
version on the motherboard; with no improvement.

Some long winded test results, and my conclusions:
[abridged] output from diff for one flawed copy:
===
linux/fs/cramfs/inflate/inffixed.h
- {{{84,7}},99}, {{{0,8}},127}, {{{0,8}},63}, {{{0,9}},223},
+ {{{84,7}},99}, {{{0,8}},127}{0,8}},63}, {{{0,9}},223},
linux/fs/jffs/intrep.c
- jffs_fmfree_partly(fmc, fm, total_data_size);
- jffs_fm_write_unlock(fmc);
+ jffs_fmfree_partly(fmc, fm,
total_data_sizport jffs_fm_write_unlock(fmc);
linux/kernel/resource.c
-int allocate_resource(struct resource *root, struct resource *new,
+int allocate_rrce(struct resource *root, struct resource *new,
===
Closer inspection of the three "corrupt" files, using the command
od -tc <file> | less
revealed that in all cases, the corruption was the four bytes
immediately preceding an exact multiple of 4096 - the block size for the
(ext2) fs...
I may well go back and read the "corruption" thread which gave me the
idea for the comparison when I wake up :(
for inffixed.h, the correct dump reads
0017740 { { { 8 4 , 7 } } , 9 9 } , {
0017760 { { 0 , 8 } } , 1 2 7 } , { {
0020000 { 0 , 8 } } , 6 3 } , { { { 0
0020020 , 9 } } , 2 2 3 } , \n {
and the flawed dump reads
0017740 { { { 8 4 , 7 } } , 9 9 } , {
0017760 { { 0 , 8 } } , 1 2 7 } \0 \0 \0 \0
0020000 { 0 , 8 } } , 6 3 } , { { { 0
0020020 , 9 } } , 2 2 3 } , \n {
0x0020000 -> 131072 -> 32 x 4k

likewise for intrep.c
0177740 r t l y ( f m c , f m , t o
0177760 t a l _ d a t a _ s i z e ) ; \n
0200000 \t \t \t j f f s _ f m _ w r i t e
0200020 _ u n l o c k ( f m c ) ; \n \t \t
and the flawed dump reads
0177740 r t l y ( f m c , f m , t o
0177760 t a l _ d a t a _ s i z p o r t
0200000 \t \t \t j f f s _ f m _ w r i t e
0200020 _ u n l o c k ( f m c ) ; \n \t \t
0x0200000 -> 2097152 -> 512 x 4k


I have a couple more examples; the corruption is still present after a
reboot; but I have yet to see what fsck makes of it...
[Addendum: corruption is still present after fsck]

So, in summary; when using the HPT366 controller with my claimed ATA66
drive; using an SMP kernel with two Celeron 466's (at 466), under load I
find intermittant corruption of the ext2 filesystem.
always four bytes; and apparently commonly (maybe always?) the four
before an exact multiple of 4096 bytes - the filesystem block size.
The values that are written instead of the correct data do not appear to
be random; but rather data from the (memory) cache; I've noticed one or
two previously that look "familiar"; but that may just be my tired eyes.

That's all I can think (ramble?) of at this time; Awaiting bright ideas;
or more free time to fiddle more. Thanks in Advance (for all the bright
ideas :) )


Gerard Sharp
Two Penguins at 1024x768


2000-12-05 21:12:03

by Winfried Truemper

[permalink] [raw]
Subject: Re: HPT366 + SMP = slight corruption in 2.3.99 - 2.4.0-11


What helped me was to use old IDE cables to enforce UDMA-33.
Switching to an 80-conductor cable makes the machine freeze
regardless of what I set in the HPT-BIOS.

Upgrading the MB BIOS, fans on the chipset and a 435 watts :)
power supply are reported to help some cases. I had no luck.


Regards
-Winfried