2001-12-22 22:03:27

by Christian Ohm

[permalink] [raw]
Subject: file corruption in 2.4.16/17


hi.

i've recently bought a new 80gb ide drive, and am now getting corrupted
files on it. i've made three partitions for linux on it, a small ext2 one as
root, and two larger ones with reiserfs as /usr and /var. the problem is
that now some files get randomly corrupted; they are the right size, but
contain some random garbage (searching the archive for this list just came
up with some issues around july / kernel 2.4.6), which makes the system
pretty much unusable.

my old setup with a 20gb ide drive and a 4.5gb scsi drive worked flawlessly
for at least a year with reiserfs, so this seems to be a problem with
reiserfs and large drives (i haven't found a corrupted file on the ext2
partition (yet)). my hardware is: a nmc (now enmic) 8tax+ mainboard with via
kt133 chipset (newest bios), a maxtor d540x-4k 80gb harddrive and a quantum
lct15 20gb harddrive. i used kernel 2.4.16 with the preemtion patch, but
2.4.17 seems to have the same problem.

windows had a problem with the maxtor drive, too. i made a fat32 partition
and copied the files from the old drive under linux. worked perfectly, but
when reading the partition with windows, it showed a corrupted file system.
i had to install a special ide driver not included in the via 4in1 drivers
to read it correctly, but now it works without problems.

i'd be happy if there's a solution for this, as, like i said, the system now
is pretty much unusable.

bye
christian ohm

ps.: i'm not subscribed to this list, so please cc me on any replies to this
thread. thanks.


2001-12-22 23:07:51

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> Hmm. Can you be more precise on that "special ide driver"?

it's the one on 'http://www.viaarena.com/?PageID=2' called 'IDE miniport
driver v3.0.14'. neither the original win98se one nor the one in the via
4in1 drivers (4.36) worked correctly.

bye
christian ohm

2001-12-23 04:23:15

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> there is no problem with these disks or chipsets. have you checked your
> ide cable (*always* must be 18" or less, with *both* ends plugged in)?
> also, do you have the via-specific ide driver?

the ide cable is the one that came with the mainboard, it worked perfectly
for one year with the 20gb hd. and yes, i'm using the via-driver.

the strange thing about this is that it all worked perfectly before i added
the 80gb disk. and it corrupts files only on that disk.

anyway, i've recompiled 2.4.17 from a fresh source tree. until now, i
haven't discovered any corrupted files of whoch i _know_ that they have to
be corrupted since i used this kernel. so probably this was a problem of the
preemption patch and reiserfs and large disks and via chipsets, but i'm not
100% sure about this. the changelog for 2.4.17 mentioned some reiserfs
fixes; are any of those related to corrupted files?

bye
christian ohm

2001-12-23 06:49:24

by Andre Hedrick

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17


Christian,

Don't take the tone personally, it is directed at the primary kernel
maintainers.

Well I will promise you that it is not my driver! Additionally there has
been a private blanket test to authenticate it is doing the correct thing.
Now the legacy driver in the stock kernels have the ablitity to fail but
very rarely.

So I suggest you use a corrected driver found at http://www.linuxdiskcert.org.

There will be an update for 2.4.17, but if you choose the use the legacy
driver and you have FSC, TOUGH! I have offered out a tested and domain
validated driver and nobody wants it.

Regards,

Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project


On Sat, 22 Dec 2001, Christian Ohm wrote:

>
> hi.
>
> i've recently bought a new 80gb ide drive, and am now getting corrupted
> files on it. i've made three partitions for linux on it, a small ext2 one as
> root, and two larger ones with reiserfs as /usr and /var. the problem is
> that now some files get randomly corrupted; they are the right size, but
> contain some random garbage (searching the archive for this list just came
> up with some issues around july / kernel 2.4.6), which makes the system
> pretty much unusable.
>
> my old setup with a 20gb ide drive and a 4.5gb scsi drive worked flawlessly
> for at least a year with reiserfs, so this seems to be a problem with
> reiserfs and large drives (i haven't found a corrupted file on the ext2
> partition (yet)). my hardware is: a nmc (now enmic) 8tax+ mainboard with via
> kt133 chipset (newest bios), a maxtor d540x-4k 80gb harddrive and a quantum
> lct15 20gb harddrive. i used kernel 2.4.16 with the preemtion patch, but
> 2.4.17 seems to have the same problem.
>
> windows had a problem with the maxtor drive, too. i made a fat32 partition
> and copied the files from the old drive under linux. worked perfectly, but
> when reading the partition with windows, it showed a corrupted file system.
> i had to install a special ide driver not included in the via 4in1 drivers
> to read it correctly, but now it works without problems.
>
> i'd be happy if there's a solution for this, as, like i said, the system now
> is pretty much unusable.
>
> bye
> christian ohm
>
> ps.: i'm not subscribed to this list, so please cc me on any replies to this
> thread. thanks.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2001-12-23 13:29:45

by Alan

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> So I suggest you use a corrected driver found at http://www.linuxdiskcert.org.

I don't think its remotely related to the IDE layer

A number of VIA boards have hardware/bios problems that cause corruption
under certain very high PCI loads. Current 2.4 should set the registers
to do the work arounds needed on the various boards we know have the
problems as do the VIA 4in1 drivers for windows in most afflicted cases.

The description sounds extremely like that is the problem with this board.


Alan

2001-12-24 01:05:22

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> none of that is meaningful (genuine companies ship invalid cables,
> and the 20 could easily have had a lower max udma mode, etc.)

hmmm, i'll try a new one after christmas... the bios showed them both as
'udma 66'.

> are they both on the same channel? also, when you boot, do you see
> a message about the "athlon bug stomper"? (which actually corrects
> a bogus kt133/etal hostbridge setting.)

they are on the same channel. and i've never seen a message like this.

> I can't imagine any reason to use the preemption patch, since it
> completely corrupts all atomicity guarantees that the kernel has
> been codd on. there definitely is no problem with large disks and
> via (and 80G is certainly not a large disk any more.)

well, plain 2.4.17 corrupted files, too.

bye
christian ohm

2001-12-24 01:12:53

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> I don't think its remotely related to the IDE layer
>
> A number of VIA boards have hardware/bios problems that cause corruption
> under certain very high PCI loads. Current 2.4 should set the registers
> to do the work arounds needed on the various boards we know have the
> problems as do the VIA 4in1 drivers for windows in most afflicted cases.
>
> The description sounds extremely like that is the problem with this board.

hmmm, could be... i had a problem with ide before... i connected an old 4x
cdrom drive, and when accessing it, the sound slowed down (soundblaster
awe64 pnp, alsa driver). and most corruption could be caused during a high
pci load, so this could be a hardware problem. anything you might want to
know aboiut the board?

bye
christian ohm

2001-12-24 09:20:07

by Hans Reiser

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

Christian Ohm wrote:

>hi.
>
>i've recently bought a new 80gb ide drive, and am now getting corrupted
>files on it. i've made three partitions for linux on it, a small ext2 one as
>root, and two larger ones with reiserfs as /usr and /var. the problem is
>that now some files get randomly corrupted; they are the right size, but
>contain some random garbage (searching the archive for this list just came
>up with some issues around july / kernel 2.4.6), which makes the system
>pretty much unusable.
>
>my old setup with a 20gb ide drive and a 4.5gb scsi drive worked flawlessly
>for at least a year with reiserfs, so this seems to be a problem with
>reiserfs and large drives (i haven't found a corrupted file on the ext2
>partition (yet)). my hardware is: a nmc (now enmic) 8tax+ mainboard with via
>kt133 chipset (newest bios), a maxtor d540x-4k 80gb harddrive and a quantum
>lct15 20gb harddrive. i used kernel 2.4.16 with the preemtion patch, but
>2.4.17 seems to have the same problem.
>
>windows had a problem with the maxtor drive, too. i made a fat32 partition
>and copied the files from the old drive under linux. worked perfectly, but
>when reading the partition with windows, it showed a corrupted file system.
>i had to install a special ide driver not included in the via 4in1 drivers
>to read it correctly, but now it works without problems.
>
>i'd be happy if there's a solution for this, as, like i said, the system now
>is pretty much unusable.
>
>bye
>christian ohm
>
>ps.: i'm not subscribed to this list, so please cc me on any replies to this
>thread. thanks.
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
ReiserFS does not have a problem with large hard drives. If you crash
while writing a file, you can damage it. Not sure if that is your
problem, but maybe.


Hans


2001-12-25 00:45:34

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> ReiserFS does not have a problem with large hard drives. If you crash
> while writing a file, you can damage it. Not sure if that is your
> problem, but maybe.

no just using it normally, copying files from the old to the new drive,
running apt-get dist-upgrade etc., and then, some files contain garbage.
only on reiserfs partitions (i copied about 3gb reiserfs to reiserfs and
about 15gb fat32 to fat32, both under linux), and seemingly random... as i
said, the file system looks good (right filesize etc.), the files just
contain garbage. so i think the contents of the files gets corrupted, and
not the file system entries for them. as for the reason why, i have no idea
(obviously).

bye & merry christmas
christian ohm

2001-12-25 00:39:34

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> drives from different vendors often don't get along on the same channel;
> you should definitely try disconnecting the cdrom.

judging by the look of the drives, quantum was bought by maxtor (i think
they really did); they look nearly identical. but i'll try to connect them
to different channels and use a new cable. and the cdrom is a scsi drive...

> if you have a kt133/kt133a/kt266/kt266a, you should probably look for:
>
> PCI: Probing PCI hardware
> Trying to stomp on Athlon bug...
> Unknown bridge resource 0: assuming transparent
> PCI: Using IRQ router VIA [1106/0686] at 00:04.0
> Applying VIA southbridge workaround.

for me, it looks like this:

PCI: Probing PCI hardware
Unknown bridge resource 0: assuming transparent
PCI: Using IRQ router VIA [1106/0686] at 00:07.0
PCI: Disabling Via external APIC routing

> (the second and fifth lines are relevant, though it depends on which
> particular chips you have as to whether you should see them.)

nothing about athlons (i have a duron, though this shouldn't make a
difference)...

> rest assured that this is specific to your config. there are many,
> many perfect-functioning linux boxes out there, along with probably
> several tens of thousands running 2.4.17 already.

yeah, i know... and that's what makes this so difficult to track down...

it doesn't seem to be a hardware problem. i copied about 15gb from the old
to the new drive (with linux, on fat32 partitions). i diff'ed them, just to
be sure, and not a single file was corrupted... the only corruption i found
was on reiserfs partitions (though my ext2 partition is only 50mb). i ran
the maxtor power diagnostics, including a 6 hour burn in test, and it said
the drive works perfectly... i'm really out of ideas here...

bye
christian ohm

2001-12-25 10:25:32

by Hans Reiser

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

Christian Ohm wrote:

>>ReiserFS does not have a problem with large hard drives. If you crash
>>while writing a file, you can damage it. Not sure if that is your
>>problem, but maybe.
>>
>
>no just using it normally, copying files from the old to the new drive,
>running apt-get dist-upgrade etc., and then, some files contain garbage.
>only on reiserfs partitions (i copied about 3gb reiserfs to reiserfs and
>about 15gb fat32 to fat32, both under linux), and seemingly random... as i
>said, the file system looks good (right filesize etc.), the files just
>contain garbage. so i think the contents of the files gets corrupted, and
>not the file system entries for them. as for the reason why, i have no idea
>(obviously).
>
>bye & merry christmas
>christian ohm
>
>
So if I understand right, Andre Hedrick thinks it might be whatever
driver is in the 2.4.16 kernel?

(There are several ways to read his email.)

If you can reproduce it for 2.4.17 we will eagerly debug it.

Best,

Hans


2001-12-26 01:01:20

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> Force Athlon bug stomper to be executed at startup (look into pci-pc.c).
> Your kernel thinks you don't need it. It may be wrong.
> Please report back.

like i said, i don't really think this is hardware related, unless someone
convinces me of the opposite. anyway, first i'll try to reproduce it to be
sure 2.4.17 still behaves this way. then comes the fun part: searching for
the cause...

bye
christian ohm

2001-12-26 00:54:40

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> So if I understand right, Andre Hedrick thinks it might be whatever
> driver is in the 2.4.16 kernel?

if it is, it seems to be the way reiserfs uses it, and not a general issue
of the driver itself. i don't really think this is a hardware problem,
unless anyone can give me a convincing reason why it should/could be.

> If you can reproduce it for 2.4.17 we will eagerly debug it.

i'll try, though i'm not really eager to get my files corrupted. so i think
i'll just copy some files from the old to the new drive and diff them to see
if they get corrupted with a plain 2.4.17 kernel. if they do, any ideas how
to track it down further?

bye
christian ohm

2001-12-26 06:20:50

by Oleg Drokin

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

Hello!

On Wed, Dec 26, 2001 at 01:53:27AM +0100, Christian Ohm wrote:
> > So if I understand right, Andre Hedrick thinks it might be whatever
> > driver is in the 2.4.16 kernel?
> if it is, it seems to be the way reiserfs uses it, and not a general issue
> of the driver itself. i don't really think this is a hardware problem,
> unless anyone can give me a convincing reason why it should/could be.
Well, there were not so much reports like yours on this list recently,
which should mean something.

> > If you can reproduce it for 2.4.17 we will eagerly debug it.
> i'll try, though i'm not really eager to get my files corrupted. so i think
You can make a backup first.

> i'll just copy some files from the old to the new drive and diff them to see
> if they get corrupted with a plain 2.4.17 kernel. if they do, any ideas how
You've just described the way I tracked down memory problems several years ago ;)

> to track it down further?
Perhaps several samples of corrupted files (and their original versions),
also make sure that while this corruption occurs with reserfs on the new drive, it
does not occurs with non-reiserfs filesystem on the new drive.
Also try to copy your files and see if you get random corrupted files or is the corrupted
filelist is the same all the time.
Look into your log for any strange messages. (you might even recompile your
kernel with CONFIG_REISERFS_CHECK enabled to allow for more checks to be made)

Bye,
Oleg

2001-12-27 03:12:52

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

> > i'll just copy some files from the old to the new drive and diff them to see
> > if they get corrupted with a plain 2.4.17 kernel. if they do, any ideas how

well, tried that. got corrupted files.

> > to track it down further?
> Perhaps several samples of corrupted files (and their original versions),

i examined some of those files. results: it seems like some parts of the
files i copied get written to the wrong destination on the disk, since i
discovered something that looked like kernel source (one of the things i
copied) in one of those files. sometimes there's some of the original
content still at the end of the file.

> also make sure that while this corruption occurs with reserfs on the new drive, it
> does not occurs with non-reiserfs filesystem on the new drive.

as i said, i copied about 15gb to a fat32 partition and about 3gb to a
reiserfs partition. i diffed the fat32 and found not a single error. on the
reiserfs partition some of the present files got corrupted, and some of the
copied files, too. though i don't know if they got corrupted because of
other files that were copied or because some of their blocks ended up in
other files.

that was /var. on /usr (my second reiserfs partition), otoh, i haven't found
corrupted files yet.

> Also try to copy your files and see if you get random corrupted files or is the corrupted
> filelist is the same all the time.

i copied the same files again, and got the same files (or at least files in
the same dir) corrupted again.

> Look into your log for any strange messages. (you might even recompile your
> kernel with CONFIG_REISERFS_CHECK enabled to allow for more checks to be made)

after some copying, i discovered this nice message:
---
vs-4080: reiserfs_free_block: free_block (0306:801711)[dev:blocknr]: bit
already cleared
---

then i copied the same files again, overwriting the old copy. after a while,
the system hang with 100% cpu used, and i got lots of those on the console:
---
vs-3050: wait_buffer_until_released: nobody releases buffer (dev 03:06, size
4096 blocknr 688128, count 3, list 0, state 0x10019, page c17a0540,
(UPTODATE, CLEAN, UNLOCKED)). still waiting (-150000000) JDIRTY !JWAIT
---

after a reboot, i tried it again. similar result, though i didn't see any
messages anymore. another reboot, and the file system seemed to be garbage
(wouldn't mount), yet after yet another reboot, i could mount it again, but
with lots of errors when accessing it. then i scrapped another partition,
created a new reiserfs one and copied the files over.

i think i'll create a fat32 partition where the old reiserfs one was, to see
if i get some errors there. if i do, this could be a hardware issue after
all...

bye
christian ohm

2001-12-27 11:08:52

by Hans Reiser

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

It sounds like you get reiserfs corruptions easily, without crashing the
machine or anythin unusual, and in that case you surely have hardware
problems. Please note in our FAQ the discussion of how reiserfs runs
hotter than ext2, and it is common for improperly cooled CPUs to work
well for ext2 and not reiserfs (tail combining heats the CPU).

Hans


Christian Ohm wrote:

>>>i'll just copy some files from the old to the new drive and diff them to see
>>>if they get corrupted with a plain 2.4.17 kernel. if they do, any ideas how
>>>
>
>well, tried that. got corrupted files.
>
>>>to track it down further?
>>>
>>Perhaps several samples of corrupted files (and their original versions),
>>
>
>i examined some of those files. results: it seems like some parts of the
>files i copied get written to the wrong destination on the disk, since i
>discovered something that looked like kernel source (one of the things i
>copied) in one of those files. sometimes there's some of the original
>content still at the end of the file.
>
>>also make sure that while this corruption occurs with reserfs on the new drive, it
>>does not occurs with non-reiserfs filesystem on the new drive.
>>
>
>as i said, i copied about 15gb to a fat32 partition and about 3gb to a
>reiserfs partition. i diffed the fat32 and found not a single error. on the
>reiserfs partition some of the present files got corrupted, and some of the
>copied files, too. though i don't know if they got corrupted because of
>other files that were copied or because some of their blocks ended up in
>other files.
>
>that was /var. on /usr (my second reiserfs partition), otoh, i haven't found
>corrupted files yet.
>
>>Also try to copy your files and see if you get random corrupted files or is the corrupted
>>filelist is the same all the time.
>>
>
>i copied the same files again, and got the same files (or at least files in
>the same dir) corrupted again.
>
>>Look into your log for any strange messages. (you might even recompile your
>>kernel with CONFIG_REISERFS_CHECK enabled to allow for more checks to be made)
>>
>
>after some copying, i discovered this nice message:
>---
>vs-4080: reiserfs_free_block: free_block (0306:801711)[dev:blocknr]: bit
>already cleared
>---
>
>then i copied the same files again, overwriting the old copy. after a while,
>the system hang with 100% cpu used, and i got lots of those on the console:
>---
>vs-3050: wait_buffer_until_released: nobody releases buffer (dev 03:06, size
>4096 blocknr 688128, count 3, list 0, state 0x10019, page c17a0540,
>(UPTODATE, CLEAN, UNLOCKED)). still waiting (-150000000) JDIRTY !JWAIT
>---
>
>after a reboot, i tried it again. similar result, though i didn't see any
>messages anymore. another reboot, and the file system seemed to be garbage
>(wouldn't mount), yet after yet another reboot, i could mount it again, but
>with lots of errors when accessing it. then i scrapped another partition,
>created a new reiserfs one and copied the files over.
>
>i think i'll create a fat32 partition where the old reiserfs one was, to see
>if i get some errors there. if i do, this could be a hardware issue after
>all...
>
>bye
>christian ohm
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>



2001-12-28 00:24:54

by Christian Ohm

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

On Thu, Dec 27, 2001 at 02:06:27PM +0300, Hans Reiser wrote:
> It sounds like you get reiserfs corruptions easily, without crashing the
> machine or anythin unusual, and in that case you surely have hardware
> problems. Please note in our FAQ the discussion of how reiserfs runs
> hotter than ext2, and it is common for improperly cooled CPUs to work
> well for ext2 and not reiserfs (tail combining heats the CPU).

not likely; its a duron 700 and lm-sensors says < 37 degree celsius with
open case (as it is now) and about 45 with closed case.

anyway, i've created a new fat32 partition where the old reiserfs one was,
copied lots of files to it and diffed them. no difference. i copied the same
files as earlier to the new reiserfs partition and diffed them. no
difference. no other corrupted files elsewhere, as far as i have seen now.
so whatever was the cause of this seems to have gone now (always the same
kernel, nothing changed with the hardware). perhaps the reiserfs file system
structure got corrupted somehow and thus caused this. don't know. if i have
nothing else to do, i'll create another reiserfs partition where the old one
was and try to corrupt some files.

i'll report back if i get corrupted files again. until then, thanks to
everyone trying to help me and sorry for taking your time.

bye
christian ohm

2001-12-28 01:35:22

by Hans Reiser

[permalink] [raw]
Subject: Re: file corruption in 2.4.16/17

Christian Ohm wrote:

>On Thu, Dec 27, 2001 at 02:06:27PM +0300, Hans Reiser wrote:
>
>>It sounds like you get reiserfs corruptions easily, without crashing the
>>machine or anythin unusual, and in that case you surely have hardware
>>problems. Please note in our FAQ the discussion of how reiserfs runs
>>hotter than ext2, and it is common for improperly cooled CPUs to work
>>well for ext2 and not reiserfs (tail combining heats the CPU).
>>
>
>not likely; its a duron 700 and lm-sensors says < 37 degree celsius with
>open case (as it is now) and about 45 with closed case.
>
>anyway, i've created a new fat32 partition where the old reiserfs one was,
>copied lots of files to it and diffed them. no difference. i copied the same
>files as earlier to the new reiserfs partition and diffed them. no
>difference. no other corrupted files elsewhere, as far as i have seen now.
>so whatever was the cause of this seems to have gone now (always the same
>kernel, nothing changed with the hardware). perhaps the reiserfs file system
>structure got corrupted somehow and thus caused this. don't know. if i have
>nothing else to do, i'll create another reiserfs partition where the old one
>was and try to corrupt some files.
>
>i'll report back if i get corrupted files again. until then, thanks to
>everyone trying to help me and sorry for taking your time.
>
>bye
>christian ohm
>
>
Oh, this is frustrating. Ok, let me throw out my last lame possible
diagnosis and suggest that maybe in this case it was not bad CPU but bad
harddrive, and the drive has now remapped the sector so you can't get
the corruption again. Or it could be a corruption due to a software bug
that continued to affect subsequent writes. I apologize that we failed
to give you a good diagnosis on this one. If I remember right there was
a corruption causing bug in 2.4.16 that wasn't due to reiserfs but
affected us (and unfortunately I can't recall at the moment what it was,
maybe somebody can.)

My apologies for wasting your time with poor diagnostics.

Hans