2002-07-18 01:57:35

by J. Hart

[permalink] [raw]
Subject: File Corruption in Kernel 2.4.18


A large directory tree (70652 files, 7.6G) is copied recursively to an
empty destination directory using the following commands :

mkdir aminet1/
cp -a aminet aminet1/

The source and destination directories are then compared using
the following commands:

diff -r aminet aminet1/aminet > difflist

A few of the files at the copy destination, typically three or four, will
usually be corrupt while the source files will be correct. Occasionally the
copy will be done without any corrupt files at the destination. The
mem=nopentium option appears to have no effect on this. An overnight test using
the memtest86 utility shows no memory errors. The corruption in each file
occurs in precise 4096 byte blocks. An overnight test using the memtest86
utility shows no memory errors. The corruption in each file occurs in precise
4096 byte blocks. System logs show no evidence of any trouble, and no kernel
panics, warning messages or crashes are observed. If there is any other user
activity while the copy is running, the system will frequently lock up requiring
a hard reset and reboot. This forces a file system check due to the lack of a
clean unmount. System logs also show no evidence of any trouble after the
lockup, and no kernel panics or other messages have been observed.

If a tar file is made of the source directory and then extracted, and the
resultant extracted directory compared with the original, similar effects are
observed.

Are there any kernel boot or build parameters which could be used
to give additional diagnostics ?

motherboard : ASYS-A7V
Linux version : Slackware 8
Kernel : 2.4.18
hard disk : ATA100 IBM-DTLA-307045 45gb
hd controller : Promise Technology, Inc. 20265
cpu : 900mhz AMD Athlon


2002-07-18 03:08:28

by Kelledin

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

This could possibly be a problem with your hard drive. Judging
from the model number, you have a 45GB IBM DeskStar 75GXP, one
of the first IBM drives to earn the nickname "DeathStar" for its
high failure rate. What does IBM's Drive Fitness Test tell you?

I'll see about performing your test tonight; I've got a hefty
little DivX directory I can throw around as I wait for
j2sdk-1.4.0 to finish compiling. Such a test should be
sufficient...

This could also be a recurrence of ye olde VIA686B PCI+IDE issue.
IIRC, some VIA686B motherboards that had that flaw were
effectively unfixable, simply because certain motherboard
manufacturers spotted the problem before everyone else (even
VIA?) and tried their own partial kludge fixes for it. Gotta
love VIA.

On Wednesday 17 July 2002 09:00 pm, J. Hart wrote:
> A large directory tree (70652 files, 7.6G) is copied
> recursively to an empty destination directory using the
> following commands :
>
> mkdir aminet1/
> cp -a aminet aminet1/
>
> The source and destination directories are then compared
> using the following commands:
>
> diff -r aminet aminet1/aminet > difflist
>
> A few of the files at the copy destination, typically
> three or four, will usually be corrupt while the source files
> will be correct. Occasionally the copy will be done without
> any corrupt files at the destination. The mem=nopentium
> option appears to have no effect on this. An overnight test
> using the memtest86 utility shows no memory errors. The
> corruption in each file occurs in precise 4096 byte blocks.
> An overnight test using the memtest86 utility shows no memory
> errors. The corruption in each file occurs in precise 4096
> byte blocks. System logs show no evidence of any trouble, and
> no kernel panics, warning messages or crashes are observed.
> If there is any other user activity while the copy is running,
> the system will frequently lock up requiring a hard reset and
> reboot. This forces a file system check due to the lack of a
> clean unmount. System logs also show no evidence of any
> trouble after the lockup, and no kernel panics or other
> messages have been observed.
>
> If a tar file is made of the source directory and then
> extracted, and the resultant extracted directory compared with
> the original, similar effects are observed.
>
> Are there any kernel boot or build parameters which could
> be used to give additional diagnostics ?
>
> motherboard : ASYS-A7V
> Linux version : Slackware 8
> Kernel : 2.4.18
> hard disk : ATA100 IBM-DTLA-307045 45gb
> hd controller : Promise Technology, Inc. 20265
> cpu : 900mhz AMD Athlon

--
Kelledin
"If a server crashes in a server farm and no one pings it, does
it still cost four figures to fix?"

2002-07-18 03:21:28

by Hell.Surfers

[permalink] [raw]
Subject: RE:Re: File Corruption in Kernel 2.4.18

I disagree, its a bug in bash, i can justfeel it.

- I found it hard, it was hard to find, ohwell, whatever, nevermind. Kurt Cobain.

On Wed, 17 Jul 2002 22:11:13 -0500 Kelledin <[email protected]> wrote:


Attachments:
(No filename) (4.33 kB)

2002-07-18 04:13:09

by Kelledin

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

Ok, the test:

I chose directory /home/kelledin/gnutella. It contains
approximately 10GB of files, ranging in size from ~5MB to
~700MB. Most are ~600-650MB.

System specs are here:
http://www.anandtech.com/mysystemrig.html?rigid=5092

Only out-of-date info on that page is the kernel--I'm running
2.4.18+XFS-1.0.2+RML-preempt, compiled with gcc-2.95.3. Kernel
was booted with "acpi=no-idle mem=nopentium" options.

While I was compiling jdk-1.4.0, I did the following:

[ kelledin@valhalla ~ ] # mkdir gnutella2
[ kelledin@valhalla ~ ] # cp -a gnutella gnutella2
[ kelledin@valhalla ~ ] # for FNAME in gnutella/*; do cmp
"$FNAME" "gnutella2/$FNAME"; done

The "cp -a" operation took 19 minutes, during which the system
load reached approximately 4.0 and the CPU temperature held at
54 C. Ambient case temperature held at 26 C. Swap usage did
not change. System was somewhat sluggish but responsive enough
to play an mp3 and allow me to open terminal windows. j2sdk
compile is still apparently going strong.

The comparison check...well, it finished while I was away
getting a snack. It printed no output, which means the check
probably completed successfully. Maybe I'll run some md5 sums
later, just to be sure.

System load stayed at about 3.0, and temperatures remained
approximately the same as during the copy operation.

The relevant software:

kernel...well, you know.
glibc-2.2.5+linuxthreads+LSB+blowfish+math patches
libacl-2.0.11
libattr-2.0.8
bash-2.05a (Just for you, Hell.Surfers, just for you ;)
fileutils 4.1.8 with ACL patches and a Kelledin special.
Tarball can be found at:

ftp://skarpsey.dyndns.org/fileutils-4.1.8acl-kelledin.tar.bz2

Things that might be causing the corruption in our friend
J.Hart's case:

Buggy chipset (damn VIA!!!)
Faulty CPU (heat damage, chipped core?)
Faulty hard drive (hey, it's a DeathStar.)
Faulty IDE controller (if using offboard IDE)
Flaky cable (80-conductor ATA cable doesn't like being folded,
stacked, crumpled, etc., not even slightly)
Buggy IDE driver in the kernel
Buggy filesystem driver
Buggy fileutils
Buggy VM

I can't really test any of the possible software problems,
because I'm all SCSI, all XFS, bleeding-edge fileutils, and
didn't have any really significant swapping going on. There's a
production server I could possibly test it on, but...well...it's
a production machine. Maybe I'll repeat the test a few times
later.

On Wednesday 17 July 2002 10:11 pm, Kelledin wrote:
> This could possibly be a problem with your hard drive.
> Judging from the model number, you have a 45GB IBM DeskStar
> 75GXP, one of the first IBM drives to earn the nickname
> "DeathStar" for its high failure rate. What does IBM's Drive
> Fitness Test tell you?
>
> I'll see about performing your test tonight; I've got a hefty
> little DivX directory I can throw around as I wait for
> j2sdk-1.4.0 to finish compiling. Such a test should be
> sufficient...
>
> This could also be a recurrence of ye olde VIA686B PCI+IDE
> issue. IIRC, some VIA686B motherboards that had that flaw were
> effectively unfixable, simply because certain motherboard
> manufacturers spotted the problem before everyone else (even
> VIA?) and tried their own partial kludge fixes for it. Gotta
> love VIA.

--
Kelledin
"If a server crashes in a server farm and no one pings it, does
it still cost four figures to fix?"

2002-07-18 07:19:15

by Ville Herva

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

On Thu, Jul 18, 2002 at 11:00:05AM +0900, you [J. Hart] wrote:
>
> A few of the files at the copy destination, typically three or four, will
> usually be corrupt while the source files will be correct. Occasionally the
> copy will be done without any corrupt files at the destination. The
> mem=nopentium option appears to have no effect on this. An overnight test using
> the memtest86 utility shows no memory errors. The corruption in each file
> occurs in precise 4096 byte blocks. An overnight test using the memtest86
> utility shows no memory errors. The corruption in each file occurs in precise
> 4096 byte blocks.

> motherboard : ASYS-A7V

Asus A7V is Via KT133 based, right? It has additional Promise ide
controller?

> Linux version : Slackware 8
> Kernel : 2.4.18

Stock 2.4.18, no patches? Which filesystem are you using? Ext2, ext3, other?

> hard disk : ATA100 IBM-DTLA-307045 45gb
> hd controller : Promise Technology, Inc. 20265

So the harddisk is connected to Promise, not Via? You have no other
harddisks?

> cpu : 900mhz AMD Athlon

I had enormous trouble with a KT133(A or not) based mobo (Abit-KT7(A)-RAID
in past - it would just corrupt data when transferring big files from the
additional ide controller (HPT370 in this case). The Via ide controller
didn't show this behaviour.

- This happened on 2.2.20, 2.4.15, 2.4.18preX + ide-patch.
- Memtest86 showed nothing
- Network activity seemed to have to do with it
- Changing the NIC to another PCI slot and tweaking bios params seemed to
help, but eventually it happened again
- I eventually concluded that KT133 corrupts PCI transfers under load, which
was found out by others in 'net as well.
- Tried bios updates and contacting Via, Highpoint, Abit. Highpoint and Abit
never cared to answer. Neither did Via until I spotted an Via employee on
viahardware.com forum. She said they'd investigate the issue, never heard
of her since.
- Ditched the mobo fo good, bought an Abit ST6R, and never had a problem
since. You may be lucky just switching the drive to Via ide.

First reports on the corruption:

http://marc.theaimsgroup.com/?l=linux-kernel&m=100651892331843&w=2
http://marc.theaimsgroup.com/?l=linux-kernel&m=100669782329815&w=2
http://groups.google.com/groups?q=We+first+reported+disk+corruption+with+a+VIA+KT133A+based+board&hl=en&lr=&ie=UTF-8&oe=utf-8&selm=linux.kernel.00c201c1a033%241cf46700%24b71c64c2%40viasys.com&rnum=3

There was a long thread on forums.viahardware.com as well
(http://forums.viaarena.com/messageview.cfm?catid=6&threadid=7171&start=21),
but it seems they have sensored it away for good.

I've also received reports of similar experiences from a number of people
since I wrote to linux-kernel about this.

I repoduced the problem with wrchk utility I wrote
(http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
tree copying.


-- v --

[email protected]

2002-07-18 07:44:41

by Wilfried Weissmann

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

Ville Herva wrote:
> I had enormous trouble with a KT133(A or not) based mobo (Abit-KT7(A)-RAID
> in past - it would just corrupt data when transferring big files from the
> additional ide controller (HPT370 in this case). The Via ide controller
> didn't show this behaviour.

I got a Abit-KT7-RAID with a AMD Thunderbird 800 and also have seen lots
of trouble. Finally I have figured out that reducing the memory bus
clock to 100MHz (instead of 133MHz) make my system pretty stable (My
memory modules can take 133MHz! I checked the specs.). Maybe that
chipset memory tweaks that the linux kernel does are not enough to fix
all memory problems...

[snip]
> - Ditched the mobo fo good, bought an Abit ST6R, and never had a problem
> since. You may be lucky just switching the drive to Via ide.

Well, after messing around with the mobo for almost 2 years, it finally
seems to be stable. But I wish I could have done useful stuff with my
computer during that time.

[snip]
> I repoduced the problem with wrchk utility I wrote
> (http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
> tree copying.

I got to check this out!

bye,
Wilfried

2002-07-21 02:50:56

by J. Hart

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18


I must apologize for the delay in replying to the many helpful
responses I received on this problem, and I'd like to say "thank you
very much" (Domo Arigato Gozaimasu) to the many who looked into this on
my behalf....:-)

Here's the status so far:

I ran the e2fcsk utility which did not detect any problems on the
hard drive. I had not known about the IBM DFT utility until it was
suggested by one of the responses, so I picked that up and tried it. I
ran the quick test, which immediately indicated two corrupt sectors. I
ran the Corrupt Sector Repair Utility (which does not seem to be
documented in the manual that comes with DFT) after backing up, and
repeated the tests a couple of times. There were no more complaints
from DFT. I will be reloading my test directory, which I had to dump
before the backup, and I will repeat the copy test on Monday after that
to see if the file corruption still occurs.

I do not know what caused the corrupted sectors, but I am giving
serious thought to a new mother board (to get rid of the chipset and
Promise controller), and perhaps a replacement for the "IBM DeathStar".
I'll let you all know the outcome of the tests on Monday. I am still
curious about the precise 4k damaged blocks.





2002-07-22 10:07:31

by Wilfried Weissmann

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

Ville Herva wrote:
> On Thu, Jul 18, 2002 at 09:47:33AM +0200, you [Wilfried Weissmann] wrote:
>
>>[snip]
>>
>>>I repoduced the problem with wrchk utility I wrote
>>>(http://iki.fi/v/tmp/wrchk.c) but it seems you can do it with you directory
>>>tree copying.
>>
>>I got to check this out!
>
>
> I had the problem to appear almost certainly when doing wrchk to raw disks
> (you should be able to use large files just as well), two writes in parallel
> (eg. /dev/hde, /dev/hdg). Occasionally it took ~50GB of writing before it
> happened (multiple rounds), but it always did.

I did a simultaneous:
wrck /dev/hd[fh] 0 64 2
The two disks were connected to the HPT-370 controller and both were
configured as slave (the masters are configured into an ataraid-0 and
contain my root partition). The test disk were IBM DLTA 307030 (30GB)
with updated firmware. These disks are locked down to ata-44 by the
kernel and I only got a maximum I/O speed of 21.7 MB/s. During the read
phase one of the disks always slowed down, while the other disk
proceeded at normal speed. In the first run I got 7.2 MB/s and at the
second run the other disk slowed down to crawling 5.3 MB/s, but the test
was completed without any errors. *joy* However I am not that the test
did stress the chipset enough to trigger the error because of the
throughput is so low.
My mainboard is a abit kt7-raid (VIA KT133 chipset), BIOS version 3R.
Memory bus was reduced to 100 MHz (SDR). Linux kernel 2.4.18 tainted by
NVidia(TM). ;)
DivX 5.0 seems to be a good stability test for VIA chipset based
motherboards. It finds errors that not even memtest could detect.

greetings,
Wilfried

PS: I will do another run on my raid-0 root partition. The 2 disks that
are part of the raid run at ata-100 (Maxtor 40GB).

2002-07-23 02:53:37

by J. Hart

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18


Here is a further update on the file corruption question :

I ran the DFT utility which picked up two bad sectors, which I then
repaired. A rerun of DFT after that gave no further reports of any problems. I
then tried the directory tree copy (cp -a aminet aminet1) which produced one
corrupted file at the destination. All the other destination files in the tree
(70652 files, 7.6G) appeared to be correct.

An additional rerun of the IBM DFT utility after this reported no problems
despite the corrupt copy.

In order to resolve this issue, my employer is considering the replacement
of my current machine with a new one having the following specifications :


motherboard: Asus P4T AGP Pro/4X
ram : 1Gb
OS : Linux 2.4.7-10 i686 unknown
CPU : Intel(R) Pentium(R) 4 CPU 1800MHz
Gfx : Matrox Graphics, Inc. MGA G400 AGP
drives : Seagate 40gb UATA ST340810A (two of these)
controller : Intel PIIX4 Ultra 100 Chipset
: (Intel Corporation 82801BA IDE U100)
chipset : Intel Corporation 82850 850 (Tehama) Chipset Host Bridge (MCH)

Are there any outstanding issues with machines of this new configuration as
there seemed to be with my old machine ?

With Thanks,

J. Hart

2002-07-23 03:01:13

by Thunder from the hill

[permalink] [raw]
Subject: Re: File Corruption in Kernel 2.4.18

Hi,

On Tue, 23 Jul 2002, J. Hart wrote:
> OS : Linux 2.4.7-10 i686 unknown
>
> Are there any outstanding issues with machines of this new
> configuration

Maybe a new kernel. I think the rest should be OK.

Regards,
Thunder
--
(Use http://www.ebb.org/ungeek if you can't decode)
------BEGIN GEEK CODE BLOCK------
Version: 3.12
GCS/E/G/S/AT d- s++:-- a? C++$ ULAVHI++++$ P++$ L++++(+++++)$ E W-$
N--- o? K? w-- O- M V$ PS+ PE- Y- PGP+ t+ 5+ X+ R- !tv b++ DI? !D G
e++++ h* r--- y-
------END GEEK CODE BLOCK------