2002-04-09 03:15:48

by Russell Miller

[permalink] [raw]
Subject: 2.2.18 data corruption issues

I'm not subscribed to the list, so please CC me on any responses.

I am running the stock 2.4.18 kernel, downloaded a few days ago from the
kernel mailing list. The kernel was custom-built to my specifications, using
the default RH7.2 gcc (config available upon request). The machine is a dual
pentium-III 1000 MHz, one scsi drive (sym53cxxx criver) and two ide drives.
All filesystems are ext3 journaling.

We copied several very large partitions from one machine to another in an
attempt to put a new machine in service. Just for kicks, we attempted to
verify the copy. It turns out that a small amount of files, about 60 to 100
on a 17 gig partition, are corrupted. Mod times are exactly the same,
owners, even file size. It turns out that pretty consistently four null
characters (and occasionally other characters and a different number) are
appended to the beginning of the file, and the last four characters are
rolled off the end. We ran the copy (and rsync and stuff) multiple times.
Each time different files were modified, in a seemingly random fashion, but
with a fairly consistent pattern of corruption.

I have turned off DMA on the disk drives to no effect. I have replaced the
ide cables with higher quality cables. The problem seems to be occuring on
both the scsi and ide drives, which to me eliminates the ide or scsi
controllers, drivers, or anything on the back end of them as the source of
the problem.This same machine was in service previously, minus one disk
drive, and this problem never manifested itself, leading me to believe it is
either something to do with the ext3 jfs, or with the 2.4.18 kernel.

Does anyone have any tips on how to debug this? I have administrative access
to the machine, and although it is running production, I am very keen on
getting this resolved and will provide any information you need. If this is
a kernel or ext3 problem as I suspect I imagine you want to get this resolved
as much as I do.

Thank you in advance for your help.

--Russell



--
Russell Miller
[email protected]
Somewhere in Northwestern Iowa


2002-04-09 06:07:36

by Denis Vlasenko

[permalink] [raw]
Subject: Re: 2.2.18 data corruption issues

On 9 April 2002 00:52, Russell Miller wrote:
> I'm not subscribed to the list, so please CC me on any responses.
>
> I am running the stock 2.4.18 kernel, downloaded a few days ago from the
> kernel mailing list. The kernel was custom-built to my specifications,
> using the default RH7.2 gcc (config available upon request). The machine
> is a dual pentium-III 1000 MHz, one scsi drive (sym53cxxx criver) and two
> ide drives. All filesystems are ext3 journaling.

What is your GCC version?

> We copied several very large partitions from one machine to another in an
> attempt to put a new machine in service. Just for kicks, we attempted to
> verify the copy. It turns out that a small amount of files, about 60 to
> 100 on a 17 gig partition, are corrupted. Mod times are exactly the same,
> owners, even file size. It turns out that pretty consistently four null
> characters (and occasionally other characters and a different number) are
> appended to the beginning of the file, and the last four characters are
> rolled off the end. We ran the copy (and rsync and stuff) multiple times.
> Each time different files were modified, in a seemingly random fashion, but
> with a fairly consistent pattern of corruption.
>
> I have turned off DMA on the disk drives to no effect. I have replaced the
> ide cables with higher quality cables. The problem seems to be occuring on
> both the scsi and ide drives, which to me eliminates the ide or scsi
> controllers, drivers, or anything on the back end of them as the source of
> the problem.This same machine was in service previously, minus one disk
> drive, and this problem never manifested itself, leading me to believe it
> is either something to do with the ext3 jfs, or with the 2.4.18 kernel.

It was Linux? What kernel version? Did you try copying with that kernel?

> Does anyone have any tips on how to debug this? I have administrative
> access to the machine, and although it is running production, I am very
> keen on getting this resolved and will provide any information you need.
> If this is a kernel or ext3 problem as I suspect I imagine you want to get
> this resolved as much as I do.

You may try to repeat your test with:
* newer / older kernel (maybe this is a kernel bug?)
* newer GCC (miscompiled kernel?)
* different fs (ext3 bug?)
* different hardware (last resort to rule out hw problems)
--
vda

2002-04-09 19:26:50

by Andrew Burgess

[permalink] [raw]
Subject: Re: 2.2.18 data corruption issues


I have a very similar setup and I ran 2.4.18 for about a month with similar
corruptions. Things seem better with 2.4.19pre6. I wrote some test programs
to copy and compare files on 5 drives at once as well as a large footprint
program that checksums an array the size of RAM and then sleeps to let it
page out and then repeats, to beat up on swap. No failures on 2.4.19pre6 except
I did hit the infamous 3ware 7810 Tyan 2460 lockup and had to remove the 3ware
array from my test.

>I am running the stock 2.4.18 kernel, downloaded a few days ago from the
>kernel mailing list. The kernel was custom-built to my specifications, using
>the default RH7.2 gcc (config available upon request). The machine is a dual
>pentium-III 1000 MHz, one scsi drive (sym53cxxx criver) and two ide drives.
>All filesystems are ext3 journaling.
>
>We copied several very large partitions from one machine to another in an
>attempt to put a new machine in service. Just for kicks, we attempted to
>verify the copy. It turns out that a small amount of files, about 60 to 100
>on a 17 gig partition, are corrupted. Mod times are exactly the same,
>owners, even file size. It turns out that pretty consistently four null
>characters (and occasionally other characters and a different number) are
>appended to the beginning of the file, and the last four characters are
>rolled off the end. We ran the copy (and rsync and stuff) multiple times.
>Each time different files were modified, in a seemingly random fashion, but
>with a fairly consistent pattern of corruption.
>

2002-04-09 23:21:44

by Russell Miller

[permalink] [raw]
Subject: Re: 2.2.18 data corruption issues

On Tuesday 09 April 2002 06:08 am, Denis Vlasenko wrote:

> What is your GCC version?
>
2.96 20000731 (Redhat Linux 7.1 2.96-98)

> It was Linux? What kernel version? Did you try copying with that kernel?
>
Kernel 2.2.16. I believe we did but we didn't verify the copies, regardless,
we never found any problems with it.

> You may try to repeat your test with:
> * newer / older kernel (maybe this is a kernel bug?)
> * newer GCC (miscompiled kernel?)
> * different fs (ext3 bug?)
> * different hardware (last resort to rule out hw problems)

I'll do this. I doubt it's a miscompiled kernel, gcc problems usually cause
crashes, not such subtle problems.

--Russell

--
Russell Miller
[email protected]
Somewhere in Northwestern Iowa