2003-08-12 03:58:08

by maney

[permalink] [raw]
Subject: 2.4.22-rc2 ext2 filesystem corruption


Okay, further testing is clearly indicated (and I'm recompiling a test
kernel while writing this to try to narrow it down a little), but I've
got a very repeatable file corruption under 2.4.22-rc2 that does not
manifest under 2.4.21. My repeatable test case only (so far?) causes
the data in the file to be corrupted, but I suspect metadata can get
hit as well, and I have seen some filesystem errors that were probably
caused by this, but not so that I can say so with certainty.

The recipie is simple: cp a large file across filesystems. All looks
well (md5sums match, etc), but the file is all still present in memory.
But if you then unmount the destination filesystem to invalidate the
buffers, after mounting the file data will have changed. I'm pretty
certain that I have observed the same effect without the mass
invalidation of umount in a couple of cases, but I haven't replicated
that.

In all cases I have investigated, the corruption seems to take the form
of four bytes of garbage at the beginning of a block; two small scripts
have been observed with 4 NULLs prepended and the last four characters
truncated. In another case I found a block of over 100 bytes (I got
tired of wading through it after a while) in the same form - four bytes
were inserted into the corrupted file, pushing the data back. In
hindsight I wish I had investigated that case further; as it is, I'm
not positive the dislocation was at a disk block boundary.

(I have one example I saved that appears NOT to begin at a block
boundary, with a dislocation that continues for at least 8KB (by spot
checking of cmp --verbose output).)

The machine is a PIII/850 on an Asus 440BX board with a Promise 20265
controller; the Seagate ST340016A is the only device connected to the
Promise's ports. There's 640MB of ECC'd memory on board, and I haven't
had an SBE reported on this system in a year or so (the last hardware
changes was two or three months ago). (I disabled the ECC monitoring
module while verifying this problem; made no difference.)

The "large file" I've been using (becuase it was where I first observed
an issue) was the XFree86 4.2.1 source archive. At 54MB, it is less
than 1/10th the size of physical RAM.

--
There is nothing perhaps so generally consoling to a man as a
well-established grievance; a feeling of having been injured,
on which his mind can brood from hour to hour, allowing him
to plead his own cause in his own court, within his own heart,
and always to plead it successfully. -- Anthony Trollope


2003-08-12 06:20:12

by Alex Davis

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

Do you have have any disks on the
standard IDE controllers and, if so,
does the FS corruption occur on those
disks??

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

2003-08-12 10:18:49

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Mon, 11 Aug 2003 22:58:03 -0500
[email protected] (Martin Maney) wrote:

>
> Okay, further testing is clearly indicated (and I'm recompiling a test
> kernel while writing this to try to narrow it down a little), but I've
> got a very repeatable file corruption under 2.4.22-rc2 that does not
> manifest under 2.4.21. My repeatable test case only (so far?) causes
> the data in the file to be corrupted, but I suspect metadata can get
> hit as well, and I have seen some filesystem errors that were probably
> caused by this, but not so that I can say so with certainty.

Did you do a long check of your system memory with memtest? long meaning some
days...

Regards,
Stephan

2003-08-12 13:13:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption



On Mon, 11 Aug 2003, Martin Maney wrote:

> The recipie is simple: cp a large file across filesystems. All looks
> well (md5sums match, etc), but the file is all still present in memory.
> But if you then unmount the destination filesystem to invalidate the
> buffers, after mounting the file data will have changed. I'm pretty
> certain that I have observed the same effect without the mass
> invalidation of umount in a couple of cases, but I haven't replicated
> that.
>
> In all cases I have investigated, the corruption seems to take the form
> of four bytes of garbage at the beginning of a block; two small scripts
> have been observed with 4 NULLs prepended and the last four characters
> truncated. In another case I found a block of over 100 bytes (I got
> tired of wading through it after a while) in the same form - four bytes
> were inserted into the corrupted file, pushing the data back. In
> hindsight I wish I had investigated that case further; as it is, I'm
> not positive the dislocation was at a disk block boundary.

Martin,

Can you tell me exactly how can I try to reproduce the problem you're
seeing?

With just cp and unmount you can see the corruption?

2003-08-12 13:42:32

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Tue, Aug 12, 2003 at 10:12:19AM -0300, Marcelo Tosatti wrote:
> Can you tell me exactly how can I try to reproduce the problem you're
> seeing?
>
> With just cp and unmount you can see the corruption?

Yes. With the c. 50MB file it happens every time (now out of a couple
dozen tests). A 3MB file did not get corrupted in half a dozen trials,
including ones where both were copied before the umount.

The age & condition of the target filesystem don't seem to matter; at
least I have replicated this immediately following mke2fs of the
target. The original observed corruption was on much older and more
cluttered filesystems - the first sign of trouble was when a local
build of XFree failed.

In case I wasn't perfectly clear (it was late, so that may well be), I
used the umount/mount only to invalidate the buffers; merely syncing
after copying wouldn't produce any immediate effect. The copy always
looks good until the data has to be read back from the target
filesystem.

One other item which I didn't think to mention is that the compiler was
"gcc version 2.95.4 20011002" - Debian's normal compiler in the Woody
release. Of course that's been used for every other 2.4 kernel I've
built here as well.

--
the warfare on the cutting edge of any science draws attention
away from the huge uncontested background, the dull metal heft
of the axe that gives the cutting edge its power. -- Dennett

2003-08-12 14:08:06

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption



On Tue, 12 Aug 2003, Martin Maney wrote:

> On Tue, Aug 12, 2003 at 10:12:19AM -0300, Marcelo Tosatti wrote:
> > Can you tell me exactly how can I try to reproduce the problem you're
> > seeing?
> >
> > With just cp and unmount you can see the corruption?
>
> Yes. With the c. 50MB file it happens every time (now out of a couple
> dozen tests). A 3MB file did not get corrupted in half a dozen trials,
> including ones where both were copied before the umount.
>
> The age & condition of the target filesystem don't seem to matter; at
> least I have replicated this immediately following mke2fs of the
> target. The original observed corruption was on much older and more
> cluttered filesystems - the first sign of trouble was when a local
> build of XFree failed.
>
> In case I wasn't perfectly clear (it was late, so that may well be), I
> used the umount/mount only to invalidate the buffers; merely syncing
> after copying wouldn't produce any immediate effect. The copy always
> looks good until the data has to be read back from the target
> filesystem.
>
> One other item which I didn't think to mention is that the compiler was
> "gcc version 2.95.4 20011002" - Debian's normal compiler in the Woody
> release. Of course that's been used for every other 2.4 kernel I've
> built here as well.

I'll try to reproduce around here. In the meantime can you try to isolate
the corruption. You said it didnt happen with 2.4.21 -- which pre shows up
the problem?

2003-08-12 15:14:34

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Tue, Aug 12, 2003 at 11:10:51AM -0300, Marcelo Tosatti wrote:
> I'll try to reproduce around here. In the meantime can you try to isolate
> the corruption. You said it didnt happen with 2.4.21 -- which pre shows up
> the problem?

Yes, that's on my list. Unfortunately this has so far only been seen
on my workstation, and I have to get a bit of work done before I can
pursue this. At least I can get some candidates built in the
background.

I have tried a few things quickly with 22-rc2, and the short summary
is:

* noapic makes no difference (don't recall why I had UP APIC enabled)

* disabling DMA w/hdparm stops the corruption (all normal operation
and previous testing has been with DMA enabled)

* mounting w/ -o sync makes no difference except to slow things down

--
Show me your flowcharts and conceal your tables, and I shall continue
to be mystified. Show me your tables, and I won't usually need your
flowcharts; they'll be obvious. -- Brooks

2003-08-12 16:56:36

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Tue, Aug 12, 2003 at 11:10:51AM -0300, Marcelo Tosatti wrote:
> I'll try to reproduce around here. In the meantime can you try to isolate
> the corruption. You said it didnt happen with 2.4.21 -- which pre shows up
> the problem?

The problem appears only in rc2 (okay, assuming it's not a
regression). With 2.4.21-rc1 the file corruption I've been seeing does
not happen. From what Stephan has said I think I should try some more
varied tests. At this point I plan to do that a little later; I will
also try an rc2 with unnecessary features omitted from the build. So
far I've stayed with the base config, but it's a config shared by most
of the machines on the LAN and thus has plenty of extras.

--
Self-pity can make one weep, as can onions. -- Fodor

2003-08-12 17:07:30

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption



On Tue, 12 Aug 2003, Martin Maney wrote:

> On Tue, Aug 12, 2003 at 11:10:51AM -0300, Marcelo Tosatti wrote:
> > I'll try to reproduce around here. In the meantime can you try to isolate
> > the corruption. You said it didnt happen with 2.4.21 -- which pre shows up
> > the problem?
>
> The problem appears only in rc2 (okay, assuming it's not a
> regression). With 2.4.21-rc1 the file corruption I've been seeing does
> not happen. From what Stephan has said I think I should try some more
> varied tests. At this point I plan to do that a little later; I will
> also try an rc2 with unnecessary features omitted from the build. So
> far I've stayed with the base config, but it's a config shared by most
> of the machines on the LAN and thus has plenty of extras.

Well, rc2 had a Promise change. I'm not sure if it could be the cause, but
lets check.

Alan?

Please try -rc2 with the following patch unpplied (patch -R):



# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.1070 -> 1.1071
# drivers/ide/pci/pdc202xx_old.c 1.5 -> 1.6
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 03/08/08 [email protected] 1.1071
# [PATCH] PATCH: Promise cable
#
# The old driver used to check .id was NULL to detect drive absent
# (which is wrong but generally worked) with the IDE changes it always
# got it wrong. This fixes it to test .present instead.
#
# Without this fix it mistakenly assumes that the empty drive slot
# cannot do UDMA66/100/133
# --------------------------------------------
#
diff -Nru a/drivers/ide/pci/pdc202xx_old.c b/drivers/ide/pci/pdc202xx_old.c
--- a/drivers/ide/pci/pdc202xx_old.c Tue Aug 12 14:08:21 2003
+++ b/drivers/ide/pci/pdc202xx_old.c Tue Aug 12 14:08:21 2003
@@ -423,9 +423,9 @@
if (ultra_66) {
/*
* check to make sure drive on same channel
- * is u66 capable
+ * is u66 capable. Ignore empty slots.
*/
- if (hwif->drives[!(drive->dn%2)].id) {
+ if (hwif->drives[!(drive->dn%2)].present) {
if (hwif->drives[!(drive->dn%2)].id->dma_ultra & 0x0078) {
hwif->OUTB(CLKSPD | mask, (hwif->dma_master + 0x11));
} else {

2003-08-12 16:56:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption



On Tue, 12 Aug 2003, Martin Maney wrote:

> On Tue, Aug 12, 2003 at 11:10:51AM -0300, Marcelo Tosatti wrote:
> > I'll try to reproduce around here. In the meantime can you try to isolate
> > the corruption. You said it didnt happen with 2.4.21 -- which pre shows up
> > the problem?
>
> Yes, that's on my list. Unfortunately this has so far only been seen
> on my workstation, and I have to get a bit of work done before I can
> pursue this. At least I can get some candidates built in the
> background.
>
> I have tried a few things quickly with 22-rc2, and the short summary
> is:
>
> * noapic makes no difference (don't recall why I had UP APIC enabled)
>
> * disabling DMA w/hdparm stops the corruption (all normal operation
> and previous testing has been with DMA enabled)

Okey so its probably the Promise driver.

Alan, have you ever heard of corruption with the newer Promise driver?

2003-08-12 17:15:25

by Alan

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Maw, 2003-08-12 at 17:58, Marcelo Tosatti wrote:
> Okey so its probably the Promise driver.
>
> Alan, have you ever heard of corruption with the newer Promise driver?

Other than people who've had promise problems for aeons - no

2003-08-12 21:38:37

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Tue, Aug 12, 2003 at 02:09:53PM -0300, Marcelo Tosatti wrote:
> Well, rc2 had a Promise change. I'm not sure if it could be the cause, but
> lets check.

> Please try -rc2 with the following patch unpplied (patch -R):

Oops, I overlooked the change. Tried it with a relatively
stripped-down 22-rc2 build (slimmed the vmlinuz down by about 100K),
but that made no difference. I popped a CMD648-based card in, disabled
the on-board Promise chip, and it booted right up and works fine with
22-rc2. So if the .id -> .present is the only change that affected the
Promise driver (I did some looking for obvious, but gave up after
realizing that unless the change actually had a /* borks Promise IDE
controllers*/ in it I wouldn't be likely to recognize it), then I guess
that's it.

> # [PATCH] PATCH: Promise cable
> #
> # The old driver used to check .id was NULL to detect drive absent
> # (which is wrong but generally worked) with the IDE changes it always
> # got it wrong. This fixes it to test .present instead.
> #
> # Without this fix it mistakenly assumes that the empty drive slot
> # cannot do UDMA66/100/133

Does this really mean that the Promise has been running at only 33MHz
all along, and that with this fix it stopped choking the speed and
that's the cause of the problem? I know that back when I first setup
this drive on the Promise (over a year ago - I'm pretty sure I was
runnign 2.2.latest back then and had to jumper the drive to get around
the 64K cylinders problem) I know I saw transfer speeds greater than
33MB reported by hdparm -T.

Okay, for completeness I should back out that change and retest it with
the Promise, and I'll try to remember to throw a quick throughput test
in just to see what it's been doing to me. ;-)

BTW, yes, I am (and have been) using an 80-pin cable with this drive.

--
Faced with the choice between changing one's mind and proving there is
no need to do so, almost everyone gets busy on the proof. -- JKG

2003-08-13 11:37:24

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On 13 Aug 2003 12:28:31 +0100
Alan Cox <[email protected]> wrote:

> On Maw, 2003-08-12 at 22:36, Martin Maney wrote:
> > but that made no difference. I popped a CMD648-based card in, disabled
> > the on-board Promise chip, and it booted right up and works fine with
> > 22-rc2. So if the .id -> .present is the only change that affected the
> > Promise driver (I did some looking for obvious, but gave up after
> > realizing that unless the change actually had a /* borks Promise IDE
> > controllers*/ in it I wouldn't be likely to recognize it), then I guess
> > that's it.
>
> That change simple turns
>
> speed = random()?33:66 (but never > drive allows)
>
> to
> speed = correct value
>
> in the pdc202xx_old driver. There are many things it can trigger but I
> cannot conceive how it can be wrong itself. And not fixing it leaves it
> definitely wrong

Maybe try another controller of same type to verify if it's a general problem
or linked to the specific piece. Could be an awful hw timing bug only showing
up at full DMA speed...

Regards,
Stephan

2003-08-13 11:29:28

by Alan

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Maw, 2003-08-12 at 22:36, Martin Maney wrote:
> but that made no difference. I popped a CMD648-based card in, disabled
> the on-board Promise chip, and it booted right up and works fine with
> 22-rc2. So if the .id -> .present is the only change that affected the
> Promise driver (I did some looking for obvious, but gave up after
> realizing that unless the change actually had a /* borks Promise IDE
> controllers*/ in it I wouldn't be likely to recognize it), then I guess
> that's it.

That change simple turns

speed = random()?33:66 (but never > drive allows)

to
speed = correct value

in the pdc202xx_old driver. There are many things it can trigger but I
cannot conceive how it can be wrong itself. And not fixing it leaves it
definitely wrong


2003-08-13 14:56:18

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption



On Tue, 12 Aug 2003, Martin Maney wrote:

> Okay, for completeness I should back out that change and retest it with
> the Promise, and I'll try to remember to throw a quick throughput test
> in just to see what it's been doing to me. ;-)

Please reback just that change and see if you still get the corruption,
please.

That way we can be sure.

2003-08-13 18:13:52

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Wed, Aug 13, 2003 at 11:55:52AM -0300, Marcelo Tosatti wrote:
> Please reback just that change and see if you still get the corruption,
> please.
>
> That way we can be sure.

At this point the outcome was pretty much a foregone conclusion, but
yep, reverting to ".id" stopped the corruption for this test case. As
Alan said, it "fixed" it only because that incorrect test happens to
force the driver to use the lower DMA speed. I had been about to
report on that when your request for the explicit test arrived, but in
short it's that rc1 (and earlier) were disabling the "66" clock speed,
while rc2 was, correctly, finding no reason not to enable it. The real
bug, be it hardware or software, is that enabling the higher speed
causes the corruption.

I suppose the obvious bandaid would be to add a config option or yet
another /proc/something kluge to let Promise chips be throttled on
purpose, rather than fortuitously. For my own use, I think I'm just
going to reconfigure to avoid the Promise controller on this machine.
I would be willing, in principle, to try any proposed fixes, but for a
while longer I would flinch at trying any untested code that I didn't
feel I understood. Later on this hardware ought to be more available
for testing, at least until it gets repurposed again.

I do have one casual question, if someone should have the answer. The
driver only talks about a 66MHz high speed; does that mean that the
20265 never gets run at its full speed under Linux, or is it just old
terminology from back when UDMA66 was the top speed?

--
An education that does not teach clear, coherent writing
cannot provide our world with thoughtful adults; it gives us instead,
at the best, clever children of all ages. -- Richard Mitchell

2003-08-13 19:40:56

by Alan

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Mer, 2003-08-13 at 19:13, Martin Maney wrote:
> At this point the outcome was pretty much a foregone conclusion, but
> yep, reverting to ".id" stopped the corruption for this test case. As
> Alan said, it "fixed" it only because that incorrect test happens to
> force the driver to use the lower DMA speed.

Ok

> I suppose the obvious bandaid would be to add a config option or yet
> another /proc/something kluge to let Promise chips be throttled on
> purpose, rather than fortuitously. For my own use, I think I'm just
> going to reconfigure to avoid the Promise controller on this machine.

I think the real thing is to find the bug. I guess pdc202xx_old.c needs
an audit at this point.

> I do have one casual question, if someone should have the answer. The
> driver only talks about a 66MHz high speed; does that mean that the
> 20265 never gets run at its full speed under Linux, or is it just old
> terminology from back when UDMA66 was the top speed?

The latter.

2003-08-13 22:58:12

by Nerijus Baliūnas

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Wed, 13 Aug 2003 13:13:30 -0500 Martin Maney <[email protected]> wrote:

> At this point the outcome was pretty much a foregone conclusion, but
> yep, reverting to ".id" stopped the corruption for this test case. As
> Alan said, it "fixed" it only because that incorrect test happens to
> force the driver to use the lower DMA speed. I had been about to
> report on that when your request for the explicit test arrived, but in
> short it's that rc1 (and earlier) were disabling the "66" clock speed,
> while rc2 was, correctly, finding no reason not to enable it. The real
> bug, be it hardware or software, is that enabling the higher speed
> causes the corruption.

Do you have the latest Promise BIOS? If not, does it still happen with
the latest one?

Regards,
Nerijus

2003-08-16 06:35:25

by maney

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Wed, Aug 13, 2003 at 08:40:13PM +0100, Alan Cox wrote:
> I think the real thing is to find the bug. I guess pdc202xx_old.c needs
> an audit at this point.

I've got no problem with that, of course. I guess I was thinking
pessimistic thoughts about finding a fix for this for .22

> > driver only talks about a 66MHz high speed; does that mean that the
> > 20265 never gets run at its full speed under Linux, or is it just old
> > terminology from back when UDMA66 was the top speed?
>
> The latter.

So what I'm seeing is a failure at 100MHz operation. Is there any way
to put the Promise into 66MHz mode (other than using a drive that runs
no faster)? Because at this point I don't have any practical way to
rule out the possibility that the cable/drive are what's marginal at
100MHz; aside from the Promise, I don't have anything faster than a
UDMA66 card (which works fine with them).

I also managed to find a useful non-broken link on ASUS's web site and
found that they had snuck out a BIOS update later than the latest I had
previously known of, and it did include a minor change in the Promise
BIOS (from 2.01 build 19 to build 35). Made no difference, though: the
50MB copy still failed as usual with 22-rc2.

--
The Internet discourages reflection and deep thought. It
encourages just glossing over, as quick as possible. The
Internet is a terrific way to look up facts and a terrible
way to get a story. -- Clifford Stoll

2003-08-17 00:09:38

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.22-rc2 ext2 filesystem corruption

On Sat, Aug 16, 2003 at 01:35:12AM -0500, Martin Maney wrote:
> So what I'm seeing is a failure at 100MHz operation. Is there any way
> to put the Promise into 66MHz mode (other than using a drive that runs
> no faster)? Because at this point I don't have any practical way to
> rule out the possibility that the cable/drive are what's marginal at
> 100MHz; aside from the Promise, I don't have anything faster than a
> UDMA66 card (which works fine with them).

Just so you know. The ide cards are not rated in MHz, but MB/s. That's
MegaBytes per second.