2002-02-03 05:38:14

by Luis A. Montes

[permalink] [raw]
Subject: 2.4.17 filesystem corruption

Hi there,

I have been experiencing filesystem corruption very frequently with
2.4.17. I've probably reinstalled my system more than 10 times in as
many days. So far it seems to be related to the kernel version and
perhaps to the UDMA settings. I haven't been able to crash the system
running 2.4.5 or 2.2.19, but it has crashed with 2.4.17 every time,
regardless of cpu optimization (Athlon, K6 or i386), AGP (built-in, as
a module or not built), filesystem (ext2 or xfs). Last kernel I tried
was a 2.4.17 with a ext2 fs and a patch I found by Lionel Bouton in
this list to handle my SiS 735 chipset. It did seem more stable for a
while, until I decided to try and enable ultra dma 66 on my primary
drive. The two partitions that I had mounted got completely corrupted
(on boot the kernel tried to mount it as a UMSDOS fs) and e2fsck
wasn't able to fix it. It did seem to work with udma 33, I compiled
the kernel without a problem as a test of disk IO, but I can't really
tell for sure that there wasn't a subtle disk corruption just waiting
to crop up.

My system is as follows:

ECS K7S5A Motherboard with the SiS 735 chipset, 128MB of PC133 SDRAM
and Athlon XP 1700+ processor at 1.4 something MHz. Memory is good,
tested with memtest86 overnight several full passes.

hda: Western Digital Caviar WDC AC313000R (it is *not* in the udma
black list, should it be?)

hdb: Western Digital Caviar WDC AC23200L (this one is in the black
list, but is not being mounted, so it shouldn't matter, right?)


Software: Straight Slackware 8 install, with XFree86 from cvs. But
lately I havent even tried glx, dri et al at all ...

Questions:

- There is a patch by the IDE maintainer (Andre Hedrick?), but I don't
know if that is supposed to make the system behave better or is a
new major architectural change (if it is the latest I probably don't
want to compound my problem, do I?) although at this point I'm
willing to try almost anything, even windows ;-)

- Has anybody gotten a system similar to mine to work on these
kernels. This same kernel (2.4.17-xfs with the rml patch) was rock
solid in my old motherboard, a VIA Apollo K6-III motherboard with
the same HD's.

- Was there some change between 2.4.5 and 2.4.17 that could have
introduced problems in the IDE layer? I really tried to test 2.4.5
to the limits compiling two versions of the kernel and XFree86
simultaneously, and the filesystem survived. But unfortunately a
negative result is not proof of stability


2002-02-03 06:04:10

by Pierre Rousselet

[permalink] [raw]
Subject: Re: 2.4.17 filesystem corruption

Luis A. Montes wrote:
> I have been experiencing filesystem corruption very frequently with
> 2.4.17. I've probably reinstalled my system more than 10 times in as
> many days. So far it seems to be related to the kernel version and
> perhaps to the UDMA settings. I haven't been able to crash the system
> running 2.4.5 or 2.2.19, but it has crashed with 2.4.17 every time,
> regardless of cpu optimization (Athlon, K6 or i386), AGP (built-in, as
> a module or not built), filesystem (ext2 or xfs).

The ext2 code diff in 2.4.18-pre7 solved it for me. I don't use xfs.

Pierre
--
------------------------------------------------
Pierre Rousselet <[email protected]>
------------------------------------------------

2002-02-03 13:59:16

by Alan

[permalink] [raw]
Subject: Re: 2.4.17 filesystem corruption

> this list to handle my SiS 735 chipset. It did seem more stable for a
> while, until I decided to try and enable ultra dma 66 on my primary
> drive. The two partitions that I had mounted got completely corrupted

How did you switch on UDMA66 ?

> hda: Western Digital Caviar WDC AC313000R (it is *not* in the udma
> black list, should it be?)

There is certainly no evidence it should be

> hdb: Western Digital Caviar WDC AC23200L (this one is in the black
> list, but is not being mounted, so it shouldn't matter, right?)

Unknown. But you can test that

> - Was there some change between 2.4.5 and 2.4.17 that could have
> introduced problems in the IDE layer? I really tried to test 2.4.5

For the SiS possibly.

2002-02-04 12:57:08

by Denis Vlasenko

[permalink] [raw]
Subject: Re: 2.4.17 filesystem corruption

On 3 February 2002 03:38, Luis A. Montes wrote:
> I have been experiencing filesystem corruption very frequently with
> 2.4.17. I've probably reinstalled my system more than 10 times in as
> many days. So far it seems to be related to the kernel version and
> perhaps to the UDMA settings. I haven't been able to crash the system
> running 2.4.5 or 2.2.19, but it has crashed with 2.4.17 every time,
> regardless of cpu optimization (Athlon, K6 or i386), AGP (built-in, as
> a module or not built), filesystem (ext2 or xfs). Last kernel I tried
> was a 2.4.17 with a ext2 fs and a patch I found by Lionel Bouton in
> this list to handle my SiS 735 chipset. It did seem more stable for a
> while, until I decided to try and enable ultra dma 66 on my primary
> drive. The two partitions that I had mounted got completely corrupted
> (on boot the kernel tried to mount it as a UMSDOS fs) and e2fsck
> wasn't able to fix it. It did seem to work with udma 33, I compiled
> the kernel without a problem as a test of disk IO, but I can't really
> tell for sure that there wasn't a subtle disk corruption just waiting
> to crop up.
>
> My system is as follows:
>
> ECS K7S5A Motherboard with the SiS 735 chipset, 128MB of PC133 SDRAM
> and Athlon XP 1700+ processor at 1.4 something MHz. Memory is good,
> tested with memtest86 overnight several full passes.
>
> hda: Western Digital Caviar WDC AC313000R (it is *not* in the udma
> black list, should it be?)

Maybe. Your report might lead to this, can you test with some hdd known to
work with UDMA66+ in another box?

> hdb: Western Digital Caviar WDC AC23200L (this one is in the black
> list, but is not being mounted, so it shouldn't matter, right?)

Trying to disconnect it and provoke fs corruption on a test partition
on the first drive sounds like good idea...

> Software: Straight Slackware 8 install, with XFree86 from cvs. But
> lately I havent even tried glx, dri et al at all ...
>
> Questions:
>
> - There is a patch by the IDE maintainer (Andre Hedrick?), but I don't
> know if that is supposed to make the system behave better or is a
> new major architectural change (if it is the latest I probably don't
> want to compound my problem, do I?) although at this point I'm
> willing to try almost anything, even windows ;-)

Consider CC'ing Andre and Jens Axboe:
[email protected]
Jens Axboe <[email protected]>
--
vda

2002-02-09 08:46:50

by Luis A. Montes

[permalink] [raw]
Subject: Re: 2.4.17 filesystem corruption

On 2002.02.03 05:44 Alan Cox wrote:
> > this list to handle my SiS 735 chipset. It did seem more stable for a
> > while, until I decided to try and enable ultra dma 66 on my primary
> > drive. The two partitions that I had mounted got completely corrupted
>
> How did you switch on UDMA66 ?
Well, when you mentioned this I remember a couple of things that I probably
shouldn't have done, that's why it took me so long to answer, I went back
and
systematically tested different kernels on a test partition and taking
note of
the hdparm's I used. I've still got a couple of kernels I want to test, so
I
will post more complete results tomorrow. Still the answer seems to be that
2.4.5 is stable while 2.4.17 is not.
But to answer your question, I downloaded a utility from WD that switches
the
drive from udma 33 to udma 66, and I then boot linux and type
hdparm -c 1 -d 1 -m 8
Last time before I wrote I also used -X66, but I'm not sure that's a good
idea ...

>
> > hda: Western Digital Caviar WDC AC313000R (it is *not* in the udma
> > black list, should it be?)
>
> There is certainly no evidence it should be

>
> > hdb: Western Digital Caviar WDC AC23200L (this one is in the black
> > list, but is not being mounted, so it shouldn't matter, right?)
>
> Unknown. But you can test that
>
> > - Was there some change between 2.4.5 and 2.4.17 that could have
> > introduced problems in the IDE layer? I really tried to test 2.4.5
>
> For the SiS possibly.
>
There is something in vanilla 2.4.17. Using the exact same .config and
filesystem as the one for 2.4.5 it crashes while 2.4.5 remains stable.
OTOH, I've finally managed to have an stable system for a week (I'm
actually using it right now) with 2.4.17. The difference is that
it's got the sis5513.c patch from Lionel Bouton that I found in
the lkml. I'm not using dma yet, though! The driver, as the previous
driver for this chipset, disables everything by default (but even
that didn't help for plain 2.4.17) That's what I'm going to test next,
using the above mentioned hdparm line.

2002-02-09 19:08:05

by Luis A. Montes

[permalink] [raw]
Subject: Re: 2.4.17 filesystem corruption

Well, I have been testing different kernels and filesystems in my
system lately. To recap, my system is a ECS mobo with the SiS 735
chipset, Athlon CPU, pc133 memory and IDE caviar hd. The problem is
that the kernel 2.4.17 produces massive filesystem corruption. Things
I have been able to eliminate as sources of the problem:

- Memory/CPU: ran memtest86, several full passes, without a problem. I'm
not overclocking. It's also been stable with other kernels.

- CPU optimization in the kernel: Corruption has happened with i386, K6
and K7 optimization, and stable kernels have had K7 enabled without
problem.

- filesystem: I've tried XFS-only systems, mixtures of XFS and ext2 and
ext2-only systems with identical results.

- I've got a second hd plugged in the system, and I did wonder whether
it could be the problem as it happens to be on the black list, but
again whether I have or not connected doesn't seem to change things.

- My harddrive itself. Passes Western Digital test, so there doesn't seem
to be anything physically wrong with it. And it's worked fine with
other kernels.

After more test during this week, I think I have something close to a
smoking gun: Compiled a 2.4.5 and a 2.4.17 kernel with identical config
files. Ran 2.4.5 during about 24 hours with continous i/o (compiling
kernels and XFree86), and the filesystem survived. Ran 2.4.17 and it
crashed within the first hour. In either case I didn't use hdparm to
change anything, the hdparms where as they are set by default, everything
off. Again, keeping identical config file I compiled 2.4.17 changing only
the drivers/ide/sis5513.c file as per Lionel Bouton's patch. The system
has been running for a week now with my normal load, compiling lots of
stuff. I still didn't want to turn on dma (this is my workstation, I need
to have it mostly up!). Then I did try the exact same kernel but I did
enable dma with hdparm -c 1 -d 1 -m 8 and ran the same test I did for the
other kernels, and it didn't crash (ran for about 12 hours fine). I'll
probably test it longer before using it in my good partitions, but I'm
confindent it will survive, crashes usually occurred within the first
hour.

Conclusion: It seems to me that something within sis5513.c got broken
between 2.4.5 and 2.4.17 and was repaired by Bouton's patch.

Please let me know if I should do some more tests. I still have the
"sacrificial" partition around and I'm willing to test patches/whatever.
But
it seems to me that Bouton's patch just fixed it.

Thanks to everybody who answer