LinuxLists.cc - corruption

2000-11-29 04:39:21

by Andries E. Brouwer

[permalink] [raw]

Subject: corruption

I did again a large test comparing two identical trees.
Found again corruption, and, upon inspection, the disk
files did not differ - this is in-core corruption only.

A few days ago:

diff -r /c2/linux/linux-2.4.0-test10/linux/include/asm-sparc/ecc.h /g1/linux/li\
nux-2.4.0-test10/linux/include/asm-sparc/ecc.h
80,83c80,95
< #define ECC_FADDR0_CACHE 0x00000800
< #define ECC_FADDR0_SIZE 0x00000700
< #define ECC_FADDR0_TYPE 0x000000f0
< #define ECC_FADDR0_PADDR 0x0000000f
---
> #define ECC_FADDR0_Ccount << RATIO_SCALE_LOG;
> if (db->bytes_out != 0)
> {
> new_ratio /= db->bytes_out;
> }
>
> if (new_ratio < db->ratio || new_ratio < 1 * RATIO_SCALE)
> {
> bsd_clear (db);
> return 1;
> }
> db->ratio = new_ratio;
> }
> }
> return 0;
> }

Here the corruption starts precisely 3072 bytes into the file
(which lives on a filesystem with 1024-byte blocks).

But the tail is a fragment of drivers/isdn/isdn_bsdcomp.c
starting at an offset of 7168 bytes.

The former lives on blocks 6373895 6373896 6373897 6373898 6373899,
the other on blocks 2475568...2475579,2475616...2475628.

Today:

diff -r /g1/linux/linux-2.4.0-test11vanilla/linux/net/sched/sch_cbq.c /c2/linux\
/linux-2.4.0-test11vanilla/linux/net/sched/sch_cbq.c
2000c2000,2115
< cbq_destr\201XM^@\202XM^@^@^@^@^@^@^@^@^@^@^@...

(lots of nulls)

with corruption starting at an offset of 47104=46*1024 bytes.
Don't know where the corruption part is from.

Andries

2000-11-29 05:40:16

by Linus Torvalds

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000 [email protected] wrote:
>
> I did again a large test comparing two identical trees.
> Found again corruption, and, upon inspection, the disk
> files did not differ - this is in-core corruption only.

Ok. It definitely looks like the 1kB thing has become broken somehow.

The fact that it is in-core only doesn't mean that much - it could still
easily be just problems at read-time, and if you have an IDE disk I would
strongly suggest you try out the patch that Jens Axboe posted,
re-initializing the "head" pointer when doing a re-merge.

That said, the VM/ext2 angle should definitely be looked at too. Nothing
has really changed there in some time - can you give a rough estimate on
when you suspect you started seeing it? Ie is it new to one of the test11
pre-kernels, or does it happen so occasionally that you can't tell whether
it happened much earlier too?

Linus

2000-11-29 09:39:22

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Tue, 28 Nov 2000, Linus Torvalds wrote:

> The fact that it is in-core only doesn't mean that much - it could still
> easily be just problems at read-time, and if you have an IDE disk I would
> strongly suggest you try out the patch that Jens Axboe posted,
> re-initializing the "head" pointer when doing a re-merge.
>
> That said, the VM/ext2 angle should definitely be looked at too. Nothing
> has really changed there in some time - can you give a rough estimate on
> when you suspect you started seeing it? Ie is it new to one of the test11
> pre-kernels, or does it happen so occasionally that you can't tell whether
> it happened much earlier too?

Problem fixed by Jens' patch had been there since March, so if it's a
mix of __make_request() screwing up and something else... Urgh.

I'ld really like to see details on the box with ext2 corruption on SCSI.
Tigran, IIRC you had it on SCSI boxen, right? Could you send me relevant
part of logs?

2000-11-29 09:49:48

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Alexander Viro wrote:
>
> I'ld really like to see details on the box with ext2 corruption on SCSI.
> Tigran, IIRC you had it on SCSI boxen, right? Could you send me relevant
> part of logs?
>

I definitely did have this very corruption on a 4xXeon SCSI-only box. But
the bad news is that I reinstalled redhat7 on it immediately after this
happened so I don't have the logs. _However_, I don't need that particular
root filesystem there anymore (since more disks arrive today and I'm
rearranging stuff) so I'll try and corrupt it for you right now. Using
test12-pre3, unless you have better suggestions on what to do to help.

Regards,
Tigran

2000-11-29 09:56:51

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:

> On Wed, 29 Nov 2000, Alexander Viro wrote:
> >
> > I'ld really like to see details on the box with ext2 corruption on SCSI.
> > Tigran, IIRC you had it on SCSI boxen, right? Could you send me relevant
> > part of logs?
> >
>
> I definitely did have this very corruption on a 4xXeon SCSI-only box. But

"This" as in "range of blocks duplicated onto another range", "random
crap in indirect blocks" or both?

> the bad news is that I reinstalled redhat7 on it immediately after this
> happened so I don't have the logs. _However_, I don't need that particular
> root filesystem there anymore (since more disks arrive today and I'm
> rearranging stuff) so I'll try and corrupt it for you right now. Using
> test12-pre3, unless you have better suggestions on what to do to help.

Could you look for duplicates too?
TIA,
Al

2000-11-29 11:21:31

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Alexander Viro wrote:
> Could you look for duplicates too?

will do. One useful finding so far -- trying simultaneous mke2fs /dev/sdX1
for X = {b,c,d,e,f} deadlocks the machine dead (and without kdb such death
was in vain). (each disk is 37G, RAM is 6G)

I know this is offtopic for this thread but not for this list.

I am continuing to pursue the corruption. Nothing yet.

Regards,
Tigran

2000-11-29 11:47:35

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: corruption

> can you give a rough estimate on when you suspect you started seeing it?

I reported both cases. That is, I started seeing it a few days ago.
(But there is no problem during daily work. Also for example a
diff between two kernel trees never gave corruption so far.
It was only with diff between trees several GB in size, and I
don't do that very often.)

Andries

2000-11-29 14:15:00

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: corruption

I just tried 2.4.0test12pre3, which has Jens' fix,
and no corruption to be seen. Will test a bit more,
but perhaps this did it.

Andries

2000-11-29 14:39:14

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000 [email protected] wrote:

> I just tried 2.4.0test12pre3, which has Jens' fix,
> and no corruption to be seen. Will test a bit more,
> but perhaps this did it.
>

I have also been testing very hard on the SMP (4xXeon/6G) machine with
test12-pre3 and also cannot reproduce the problem. This is a SCSI-only
machine and I don't know what Jens' fix is and whether it is applicable or
not.

Regards,
Tigran

2000-11-29 14:47:15

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:

> On Wed, 29 Nov 2000 [email protected] wrote:
>
> > I just tried 2.4.0test12pre3, which has Jens' fix,
> > and no corruption to be seen. Will test a bit more,
> > but perhaps this did it.
> >
>
> I have also been testing very hard on the SMP (4xXeon/6G) machine with
> test12-pre3 and also cannot reproduce the problem. This is a SCSI-only
> machine and I don't know what Jens' fix is and whether it is applicable or
> not.

Change in the __make_request() and no, it shouldn't change the behaviour on
SCSI.

2000-11-29 14:58:05

[permalink] [raw]

Subject: Re: corruption

On Wed, Nov 29 2000, Tigran Aivazian wrote:
> On Wed, 29 Nov 2000 [email protected] wrote:
>
> > I just tried 2.4.0test12pre3, which has Jens' fix,
> > and no corruption to be seen. Will test a bit more,
> > but perhaps this did it.
> >
>
> I have also been testing very hard on the SMP (4xXeon/6G) machine with
> test12-pre3 and also cannot reproduce the problem. This is a SCSI-only
> machine and I don't know what Jens' fix is and whether it is applicable or
> not.

No, the fix could only really make a difference on IDE. So it can't
possibly account for all the corruption issues reported, but I'm hoping
at least some of them... The fix was posted in the other corruption
thread, and it's in test12-pre3 too.

--
* Jens Axboe <[email protected]>
* SuSE Labs

2000-11-29 18:19:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000 [email protected] wrote:
>
> > can you give a rough estimate on when you suspect you started seeing it?
>
> I reported both cases. That is, I started seeing it a few days ago.

I wasn't trying to imply that you hadn't reported them well.

It's just that I was born with a highly developed case of Altzheimers, and
I have trouble keeping details around in my head for more than about five
minutes.

I'm half serious, btw. It's not that I don't have a good memory, but I
tend to remember patterns and how things work, and I'm _really_ bad at
keeping track of details. This is why I absolutely depend on people like
Alan Cox etc who maintain lists of problems, and who are good at gathering
reports on what kinds of machines see it etc. I just suck at it. I really
do.

Anyway, it tentatively sounds like it might have been request corruption
by the new re-merge code. It fits the details, you having IDE and all. I
see that you can't at least easily reproduce it in pre3 any more, but if
it turns out later that you still can, please holler. Loudly.

That still leaves the SCSI corruption, which could not have been due to
the request issue. What's the pattern there for people?

Linus

2000-11-29 18:25:50

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Linus Torvalds wrote:
> That still leaves the SCSI corruption, which could not have been due to
> the request issue. What's the pattern there for people?

Linus, I confess that at the time (when I reproduced this problem on my
SCSI-only 4way/6G machine) I did not realize the importance of observing
the pattern or even just saving the log. No, I was _not_ just being stupid
but rather it was _so_ easy to panic Linux at the time (for various
reasons) that this one looked like just "yet another panic" somewhere.

Now, I am trying hard (lots of kernel compiles, bonnies, diff -urN between
linux trees, cp -a linuxA linuxB etc etc) to reproduce it and I can't.

All I remember from memory was those messages about "freeing stuff not in
datazone" etc. They were the same messages as I had on an IDE system and
the same as Mohammad and others reported on the list recently.

Regards,
Tigran

2000-11-29 18:37:52

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:

> On Wed, 29 Nov 2000, Linus Torvalds wrote:
> > That still leaves the SCSI corruption, which could not have been due to
> > the request issue. What's the pattern there for people?

one more thing I remember when this happened:

a) lots of ld processes from kernel compilation were failing with ENOSPC
although df(1) was showing plenty of memory and I could manually "touch
ok" in the same filesystem just fine.

b) immediately restarting "make -j4 bzImage" would go on for quite a bit
and then hit the same set of .c files and "run out of space" again.

Regards,
Tigran

2000-11-29 18:38:32

by Zdenek Kabelac

[permalink] [raw]

Subject: Re: corruption

Linus Torvalds wrote:
>
> On Wed, 29 Nov 2000 [email protected] wrote:
> >
> > > can you give a rough estimate on when you suspect you started seeing it?
> >
> > I reported both cases. That is, I started seeing it a few days ago.

I'm seeing this kind of corruption during the tar zxf.
$ tar zxf linux-2.4.0-test11.tar.gz

gzip: stdin: invalid compressed data--format violated
tar: Unexpected EOF in archive
tar: Child returned status 1
tar: Error exit delayed from previous errors

Currently running kernel is 2.4.0-test11-ac4 and the file is correct.
After the reboot there is no problem with uncompressing this file.
It possible that this problem is fixed with test12-pre3, but in
case its not I'm reporting this now (this already happend to me at least
three times with this kernel - also there are no messages in the kernel
log.

--
There are three types of people in the world:
those who can count, and those who can't.
Zdenek Kabelac http://i.am/kabi/ [email protected] {debian.org; fi.muni.cz}

2000-11-29 18:43:12

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:
> although df(1) was showing plenty of memory and I could manually "touch
~~~~~~
space, I meant. *blush*

Tigran

2000-11-29 18:48:12

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:

> On Wed, 29 Nov 2000, Tigran Aivazian wrote:
>
> > On Wed, 29 Nov 2000, Linus Torvalds wrote:
> > > That still leaves the SCSI corruption, which could not have been due to
> > > the request issue. What's the pattern there for people?
>
> one more thing I remember when this happened:
>
> a) lots of ld processes from kernel compilation were failing with ENOSPC
> although df(1) was showing plenty of memory and I could manually "touch
> ok" in the same filesystem just fine.

Consistent with bitmap corruption - counters are out of sync with the
block/inode bitmaps.

2000-11-29 19:09:45

by Linus Torvalds

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Tigran Aivazian wrote:

> On Wed, 29 Nov 2000, Tigran Aivazian wrote:
>
> > On Wed, 29 Nov 2000, Linus Torvalds wrote:
> > > That still leaves the SCSI corruption, which could not have been due to
> > > the request issue. What's the pattern there for people?
>
> one more thing I remember when this happened:
>
> a) lots of ld processes from kernel compilation were failing with ENOSPC
> although df(1) was showing plenty of memory and I could manually "touch
> ok" in the same filesystem just fine.
>
> b) immediately restarting "make -j4 bzImage" would go on for quite a bit
> and then hit the same set of .c files and "run out of space" again.

Ehh, this is a stupid question, but I've had that happen too, and it
turned out my /tmp filesystem was full, and it runs out of space only with
certain large link cases (never anything else, because all the other
stages of compilation are done with -pipe and do not use /tmp files).

I'm embarrassed to even mention this, but I'v ebeen confused myself.

Linus

2000-11-29 19:16:47

by Tigran Aivazian

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Linus Torvalds wrote:
> Ehh, this is a stupid question, but I've had that happen too, and it
> turned out my /tmp filesystem was full, and it runs out of space only with
> certain large link cases (never anything else, because all the other
> stages of compilation are done with -pipe and do not use /tmp files).
>
> I'm embarrassed to even mention this, but I'v ebeen confused myself.
>

No, I did check my /tmp, I can assure you. Well, no, I did _not_ but I do
not have a separate /tmp. I like huge root filesystems (until recently,
when I realized that I most often corrupt/work in /usr/src so it's now
separate and there are multiple roots for disaster recovery) so checking
that root was OK implied /tmp was OK.

Also, running out of space in /tmp can hardly cause a bunch of ext2
messages about freeing blocks not in datazone etc.

Regards,
Tigran

PS. I do have another filesystem which is separate from root, i.e. /boot,
again shared amongst all roots but I doubt ld(1) is storing anything in
/boot :)

2000-11-29 19:27:28

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: corruption

On Wed, Nov 29, 2000 at 04:08:26AM -0500, Alexander Viro wrote:
> Problem fixed by Jens' patch had been there since March, so if it's a

No, it's there only since Jens fixed the request merging bug in test11 or so.

With previous kernel the head pointer couldn't change so that change
was unnecessary and initializing it outside the critical section was
a micro scalability optimization :).

Andrea

2000-11-29 19:36:59

by Rik van Riel

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Andrea Arcangeli wrote:
> On Wed, Nov 29, 2000 at 04:08:26AM -0500, Alexander Viro wrote:
> > Problem fixed by Jens' patch had been there since March, so if it's a
>
> No, it's there only since Jens fixed the request merging bug in
> test11 or so.
>
> With previous kernel the head pointer couldn't change so that
> change was unnecessary and initializing it outside the critical
> section was a micro scalability optimization :).

To be honest, I have a big problem with micro optimisations
that prevent the big optimisations from happening.

Would it be an idea to explicitly comment such dangerous
micro optimisations so people implementing the big optimisations
later on won't run into nasty surprises?

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-11-29 19:56:59

by Linus Torvalds

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Alexander Viro wrote:
>
> Problem fixed by Jens' patch had been there since March, so if it's a
> mix of __make_request() screwing up and something else... Urgh.

No, the bug really got introduced in test11 due to the request merging
stuff.

The patch may _look_ like it fixed a generic problem that has been there
forever, but we didn't actually need the spinlock for initializing "head"
at all. It's initialized to a constant offset within the unchaning request
queue, so we can happily do it outside the spinlock.

The reason the initialization was moved inside the spinlock was really
just that it had to be re-initialized for the case where we re-did the
merge, so it had to be moved down to inside the loop - and it just happens
to happen inside the spinlock now.

So the spinlock protection was never relevant to the bug - forgetting to
re-initialize a variable when a straight-line code was turned into a loop
was the bug.

> I'ld really like to see details on the box with ext2 corruption on SCSI.
> Tigran, IIRC you had it on SCSI boxen, right? Could you send me relevant
> part of logs?

I suspect that Tigran may have seen other instability (of which we had
lots back when he saw it), and that the current rash is for the IDE
problem only.

Which is not to say that there might not be SCSI issues or other issues
too, but I'm also not convinced that the SCSI thing might not just be a
red herring at this point.

Linus

2000-11-29 19:58:39

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: corruption

On Wed, Nov 29, 2000 at 05:05:20PM -0200, Rik van Riel wrote:
> To be honest, I have a big problem with micro optimisations
> that prevent the big optimisations from happening.
>
> Would it be an idea to explicitly comment such dangerous
> micro optimisations so people implementing the big optimisations
> later on won't run into nasty surprises?

Did you read the code we're talking about?

Andrea

2000-11-29 20:27:59

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Linus Torvalds wrote:

>
>
> On Wed, 29 Nov 2000, Alexander Viro wrote:
> >
> > Problem fixed by Jens' patch had been there since March, so if it's a
> > mix of __make_request() screwing up and something else... Urgh.
>
> No, the bug really got introduced in test11 due to the request merging
> stuff.
>
> The patch may _look_ like it fixed a generic problem that has been there
> forever, but we didn't actually need the spinlock for initializing "head"

Sure.

> at all. It's initialized to a constant offset within the unchaning request
> queue, so we can happily do it outside the spinlock.

Actually, I was not thinking about spinlock. What I missed was the fact
that again: was quite recent. My apologies...

> > I'ld really like to see details on the box with ext2 corruption on SCSI.
> > Tigran, IIRC you had it on SCSI boxen, right? Could you send me relevant
> > part of logs?
>
> I suspect that Tigran may have seen other instability (of which we had
> lots back when he saw it), and that the current rash is for the IDE
> problem only.
>
> Which is not to say that there might not be SCSI issues or other issues
> too, but I'm also not convinced that the SCSI thing might not just be a
> red herring at this point.

There are two quite distinct patterns: duplicated range vs. crap in metadata.
The former looks like a bug caught by Jens. The latter (especially in
bitmaps) seems to be older[1] and independent from elevator stuff. _That_ may
be a fs/buffer.c or fs/ext2/* bug. The former definitely lives below the
fs/buffer.c level.

[1] "older" may mean "shared with 2.2" here - ISTR bug reports looking like
that and IIRC they were never resolved. BTW, if you know some searchable
l-k archive... DN sucks coprolites through the straw these days ;-/

2000-11-29 20:33:49

by Rik van Riel

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000, Andrea Arcangeli wrote:
> On Wed, Nov 29, 2000 at 05:05:20PM -0200, Rik van Riel wrote:
> > To be honest, I have a big problem with micro optimisations
> > that prevent the big optimisations from happening.
> >
> > Would it be an idea to explicitly comment such dangerous
> > micro optimisations so people implementing the big optimisations
> > later on won't run into nasty surprises?
>
> Did you read the code we're talking about?

This particular piece of code may be a bad example of
my "complaint", but I guess we can just as easily take
something like shrink_mmap() as our example ...

regards,

Rik
--
Hollywood goes for world dumbination,
Trailer at 11.

http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/

2000-11-29 21:07:20

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: corruption

On Wed, Nov 29, 2000 at 02:57:11PM -0500, Alexander Viro wrote:
> that again: was quite recent. My apologies...

Never mind, strict patch reading was obviously misleading in this case.

> [1] "older" may mean "shared with 2.2" here - ISTR bug reports looking like
> that and IIRC they were never resolved. [..]

I don't recall any report after 2.2.1x series but I might have missed them.
Previous reports may be due the bugs that got fixed there.

Andrea

2000-11-29 21:41:10

by Andreas Dilger

[permalink] [raw]

Subject: Re: searchable l-k archive

Al Viro writes:
> BTW, if you know some searchable l-k archive...
> DN sucks coprolites through the straw these days ;-/

There are archives of pretty much all of the linux- and oss-development
related mailing lists at:

http://marc.theaimsgroup.com/

For l-k in particular, I use
http://marc.theaimsgroup.com/?l=linux-kernel&r=1&w=4

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2000-11-29 22:25:35

by Andries E. Brouwer

[permalink] [raw]

Subject: Re: corruption

> ISTR bug reports looking like that and IIRC they were never resolved.

Have you looked at the report by Daniel Phillips?

http://marc.theaimsgroup.com/?l=linux-kernel&m=95162877201890&w=2

2000-11-29 22:49:39

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Wed, 29 Nov 2000 [email protected] wrote:

> > ISTR bug reports looking like that and IIRC they were never resolved.
>
> Have you looked at the report by Daniel Phillips?

Yes. The problem is real, but the fix... I'm doing a cleanup there and
I'll post the patch when I'll give it some testing.

2000-11-30 14:49:18

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

In thread "File corruption part deux", Lawrence Walton wrote:
>
> my system has been acting slightly odd on all the pre 12 kernels
> with the fs going read only with out any messages until now.
> no opps or anything like that, but I did get this just now.
>
> EXT2-fs error (device sd(8,2)): ext2_readdir:
> bad entry in directory #458430: directory entry
> across blocks - offset=152, inode=3393794200,
> rec_len=12440, name_len=73
>

3393794200 == 0xca493098. A kernel address. And 152 is 0x98,
which is equal to N * 0x20 + 0x18. Read on...

I am somewhat reluctant to report this problem because I always
run kernels with the lowish latency patch, but having reviewed
the effects of that patch on fs/*.c I don't think it's to blame.
Plus it's been 100% stable for months.

I believe that the problem I've observed is caused by or exposed
by the O_SYNC changes. Or maybe not.

Running test11-ac4 on *very* vanilla machines. x86 UP, IDE, 3c905
and really nothing else. No APM, fat, vfat, isofs, USB, audio, etc.
It has happened on two different machines which have been 100% reliable
for a year.

The problem is corruption of in-core files. It has only started
happening in the past few days. It happened after two days uptime.
In the most recent case my /bin/ls went bad. I took a copy and
rebooted. After reboot /bin/ls had a correct MD5 sum. Here's
the diff:

--- ls.good Thu Nov 30 15:07:11 2000
+++ ls.bad Thu Nov 30 15:07:04 2000
@@ -1589,7 +1589,7 @@
006340: C7 85 F8 BF FF FF 00 00 00 00 E9 EA 02 00 00 90 >@@@@@@@@@@@@@@@@<
006350: 8B BD FC BF FF FF 8D B5 00 E0 FF FF 57 68 00 20 >@@@@@@@@@@@@Wh@ <
006360: 00 00 56 E8 3C B2 FF FF 83 C4 0C 85 C0 0F 84 DD >@@V@<@@@@@@@@@@@<
-006370: 02 00 00 6A 0A 56 E8 49 B0 FF FF 83 C4 08 85 C0 >@@@j@V@I@@@@@@@@<
+006370: 02 00 00 6A 0A 56 E8 49 78 73 62 C6 78 73 62 C6 >@@@j@V@Ixsb@xsb@<
006380: 75 2E 8D 9D 00 C0 FF FF 8B BD FC BF FF FF 57 68 >u.@@@@@@@@@@@@Wh<
006390: 00 20 00 00 53 E8 0A B2 FF FF 83 C4 0C 85 C0 74 >@ @@S@@@@@@@@@@t<
0063A0: 0F 6A 0A 53 E8 1B B0 FF FF 83 C4 08 85 C0 74 D8 >@j@S@@@@@@@@@@t@<
@@ -1709,7 +1709,7 @@
006AC0: 00 00 00 FF 75 DF 83 E8 03 40 40 2B 44 24 58 83 >@@@@u@@@@@@+D$X@<
006AD0: C0 02 89 44 24 14 EB 08 C7 44 24 14 01 00 00 00 >@@@D$@@@@D$@@@@@<
006AE0: 8B 4C 24 3C F6 C1 01 74 5B 8B 44 24 5C 8B 74 24 >@L$<@@@t[@D$\@t$<
-006AF0: 14 89 C2 83 E0 03 74 16 7A 0F 83 F8 02 74 05 38 >@@@@@@t@z@@@@t@8<
+006AF0: 14 89 C2 83 E0 03 74 16 F8 7A 62 C6 F8 7A 62 C6 >@@@@@@t@@zb@@zb@<
006B00: 22 74 2F 42 38 22 74 2A 42 38 22 74 25 42 8B 02 >"t/B8"t*B8"t%B@@<
006B10: 84 E0 75 08 84 C0 74 1A 84 E4 74 15 A9 00 00 FF >@@u@@@t@@@t@@@@@<
006B20: 00 74 0D 83 C2 04 A9 00 00 00 FF 75 E1 83 EA 03 >@t@@@@@@@@@u@@@@<
@@ -1733,7 +1733,7 @@
006C40: 4C 24 54 40 51 50 E8 C9 A7 FF FF 83 C4 08 83 7C >L$T@QP@@@@@@@@@|<
006C50: 24 1C 00 74 38 C6 00 2C 8B 5C 24 3C 40 8B 4C 24 >$@@t8@@,@\$<@@L$<
006C60: 3C 83 E3 01 F6 C1 02 74 0E 8B 74 24 58 56 50 E8 ><@@@@@@t@@t$XVP@<
-006C70: A0 A7 FF FF 83 C4 08 85 DB 74 12 C6 00 5F 8B 4C >@@@@@@@@@t@@@_@L<
+006C70: A0 A7 FF FF 83 C4 08 85 78 7C 62 C6 78 7C 62 C6 >@@@@@@@@x|b@x|b@<
006C80: 24 5C 40 51 50 E8 8A A7 FF FF 83 C4 08 C6 00 2F >$\@QP@@@@@@@@@@/<
006C90: 31 FF 8B 74 24 60 40 56 50 E8 76 A7 FF FF 83 C4 >1@@t$`@VP@v@@@@@<
006CA0: 08 8B 4C 24 30 8B 29 85 ED 74 31 90 8D 74 26 00 >@@L$0@)@@t1@@t&@<

Note that in both my cases (and, apparently, Lawrence's) the
corrupted data consists of two identical kernel addresses which
have the value

N * 0x20 + 0x18

and they are always equal. And they occur at a file offset
of

N * 0x20 + 0x18

Which leads one to believe that someone somewhere is doing
an init_list_head() on a wild pointer.

Or, more likely, someone is doing a list_del() on a list_head
which points at recycled memory, and that list_head resides
within a structure at offset 0x18.

And that description perfectly matches the new i_dirty_buffers
field in struct inode.

Which would perhaps indicate that one of the following statements:

- the list_del in buffer_insert_inode_queue() or

- the list_del in __remove_inode_queue()

- the list_del in fsync_inode_buffers()

has gotten itself a wild pointer.

Other possible candidates apart from i_dirty_buffers which
have a list_head at offset 0x18 and whose list_dels should
be reviewed are:

- request_queue.elevator.queue
- dentry.d_hash
- anything which has a timer_list at offset 0x18
- anything which has a waitqueue at offset 0x14

There may be others which have list_heads at 0x38, 0x58, ...

This doesn't just happen a single time. The first time it happened
during a CVS commit at least eight files on the server ended up with
this corruption, as did /usr/lib/netscape/netscape-communicator,
so we had a whole bunch of corruptions happening in a short
period of time.

It takes a very bad kernel bug to be able to crash netscape.

Anyway, something to be thinking about. I've written the
canonical list_head debugging code. I'll run that overnight
and finish it off tomorrow.

2000-11-30 19:24:12

by Jonathan Hudson

[permalink] [raw]

Subject: Re: corruption

In article <[email protected]>,
Andrew Morton <[email protected]> writes:
AM> In thread "File corruption part deux", Lawrence Walton wrote:
>>
>> my system has been acting slightly odd on all the pre 12 kernels
>> with the fs going read only with out any messages until now.
>> no opps or anything like that, but I did get this just now.
>>
>> EXT2-fs error (device sd(8,2)): ext2_readdir:
>> bad entry in directory #458430: directory entry
>> across blocks - offset=152, inode=3393794200,
>> rec_len=12440, name_len=73
>>
AM>
AM> 3393794200 == 0xca493098. A kernel address. And 152 is 0x98,
AM> which is equal to N * 0x20 + 0x18. Read on...
AM>

Don't know what these do for your analysis, observed on
2.4.0test12pre2, compiling mozilla.

EXT2-fs error (device ide0(3,11)):
ext2_readdir: bad entry in directory #409870: directory entry across blocks
- offset=88, inode=3284439128, rec_len=36952, name_len=196

EXT2-fs error (device ide0(3,11)):
ext2_add_entry: bad entry in directory #344273: rec_len % 4 != 0 - offset=0,
inode=1769234798, rec_len=28271, name_len=85

Recompiling it with 2.4.0test12pre3 last night did not cause any fs
problems, at least that I've noticed.

2000-11-30 20:06:10

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Thu, 30 Nov 2000, Jonathan Hudson wrote:

>
> In article <[email protected]>,
> Andrew Morton <[email protected]> writes:
> AM> In thread "File corruption part deux", Lawrence Walton wrote:
> >>
> >> my system has been acting slightly odd on all the pre 12 kernels
> >> with the fs going read only with out any messages until now.
> >> no opps or anything like that, but I did get this just now.
> >>
> >> EXT2-fs error (device sd(8,2)): ext2_readdir:
> >> bad entry in directory #458430: directory entry
> >> across blocks - offset=152, inode=3393794200,
> >> rec_len=12440, name_len=73
> >>
> AM>
> AM> 3393794200 == 0xca493098. A kernel address. And 152 is 0x98,
> AM> which is equal to N * 0x20 + 0x18. Read on...
> AM>
>
> Don't know what these do for your analysis, observed on
> 2.4.0test12pre2, compiling mozilla.
>
> EXT2-fs error (device ide0(3,11)):
> ext2_readdir: bad entry in directory #409870: directory entry across blocks
> - offset=88, inode=3284439128, rec_len=36952, name_len=196

offset 0x58, data: 58 90 c4 c3 58 90 c4

Confirms. That's definitely an empty list_head at address 0xc3c49058 and -pre2
has O_SYNC patches.

> EXT2-fs error (device ide0(3,11)):
> ext2_add_entry: bad entry in directory #344273: rec_len % 4 != 0 - offset=0,
> inode=1769234798, rec_len=28271, name_len=85

data: 6e 61 74 69 6f 6e 55, i.e. "nationU"

... and that one looks like a duplicated blocks effect, but there may be a lot
of other reasons for that.

> Recompiling it with 2.4.0test12pre3 last night did not cause any fs
> problems, at least that I've noticed.

2000-11-30 22:03:47

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Alexander Viro wrote:
>
> Confirms. That's definitely an empty list_head at address 0xc3c49058 and -pre2
> has O_SYNC patches.

foo. The overnight run wedged tight in mmap002. No progress.

I bet this'll catch it:

--- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
+++ include/linux/list.h Fri Dec 1 08:33:55 2000
@@ -90,6 +90,7 @@
static __inline__ void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
+ entry->next = entry->prev = 0;
}

/**

First person to send a ksymoops trace gets a cookie.

2000-12-01 01:29:04

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Andrew Morton wrote:
>
> I bet this'll catch it:
>
> --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> @@ -90,6 +90,7 @@
> static __inline__ void list_del(struct list_head *entry)
> {
> __list_del(entry->prev, entry->next);
> + entry->next = entry->prev = 0;
> }
>
> /**
>
> First person to send a ksymoops trace gets a cookie.

mmmm... choc-chip.

With the above patch applied the machine crashed after an hour. Crashed
a second time during the e2fsck. gdb backtrace:

(gdb) l *0xc01755d4
0xc01755d4 is in __make_request (ll_rw_blk.c:744).
739 * skip first entry, for devices with active queue head
740 */
741 if (q->head_active && !q->plugged)
742 head = head->next;
743
744 if (list_empty(head)) {
745 q->plug_device_fn(q, bh->b_rdev); /* is atomic */
746 goto get_rq;
747 }
748
(gdb) l *0xc0175c3d
0xc0175c3d is in generic_make_request (ll_rw_blk.c:882).
877 buffer_IO_error(bh);
878 break;
879 }
880
881 }
882 while (q->make_request_fn(q, rw, bh));
883 }
884
885 /* This function can be used to request a number of buffers from a block
886 device. Currently the only restriction is that all buffers must belong to
(gdb) l *0xc0175da1
0xc0175da1 is in ll_rw_block (ll_rw_blk.c:924).
919 printk(KERN_NOTICE "Can't write to read-only device %s\n",
920 kdevname(bhs[0]->b_dev));
921 goto sorry;
922 }
923
924 for (i = 0; i < nr; i++) {
925 bh = bhs[i];
926
927 /* Only one thread can actually submit the I/O. */
928 if (test_and_set_bit(BH_Lock, &bh->b_state))
(gdb) l *0xc01309e9
0xc01309e9 is in sync_buffers (/uhome/morton/akpm/linux/include/asm/atomic.h:65).
60 :"=m" (v->counter)
61 :"m" (v->counter));
62 }
63
64 static __inline__ void atomic_dec(atomic_t *v)
65 {
66 __asm__ __volatile__(
67 LOCK "decl %0"
68 :"=m" (v->counter)
69 :"m" (v->counter));
(gdb) l *0xc0130af7
0xc0130af7 is in fsync_dev (/uhome/morton/akpm/linux/include/asm/current.h:9).
4 struct task_struct;
5
6 static inline struct task_struct * get_current(void)
7 {
8 struct task_struct *current;
9 __asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
10 return current;
11 }
12
13 #define current get_current()
(gdb) l *0xc0137339
0xc0137339 is in block_fsync (block_dev.c:353).
348 * since the vma has no handle.
349 */
350
351 static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
352 {
353 return fsync_dev(dentry->d_inode->i_rdev);
354 }
355
356 /*
357 * bdev cache handling - shamelessly stolen from inode.c
(gdb) l *0xc0130c76
0xc0130c76 is in sys_fsync (buffer.c:373).
368 if (!file->f_op || !file->f_op->fsync)
369 goto out_putf;
370
371 /* We need to protect against concurrent writers.. */
372 down(&inode->i_sem);
373 err = file->f_op->fsync(file, dentry, 0);
374 up(&inode->i_sem);
375
376 out_putf:
377 fput(file);

2000-12-01 12:49:47

[permalink] [raw]

Subject: Re: corruption

On Fri, Dec 01 2000, Andrew Morton wrote:
> Andrew Morton wrote:
> >
> > I bet this'll catch it:
> >
> > --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> > +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> > @@ -90,6 +90,7 @@
> > static __inline__ void list_del(struct list_head *entry)
> > {
> > __list_del(entry->prev, entry->next);
> > + entry->next = entry->prev = 0;
> > }
> >
> > /**
> >
> > First person to send a ksymoops trace gets a cookie.
>
> mmmm... choc-chip.
>
> With the above patch applied the machine crashed after an hour. Crashed
> a second time during the e2fsck. gdb backtrace:

Very interesting. IDE / SCSI?

--
* Jens Axboe <[email protected]>
* SuSE Labs

2000-12-01 12:50:46

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Andrew Morton wrote:
>
> Andrew Morton wrote:
> >
> > I bet this'll catch it:
> >
> > --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> > +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> > @@ -90,6 +90,7 @@
> > static __inline__ void list_del(struct list_head *entry)
> > {
> > __list_del(entry->prev, entry->next);
> > + entry->next = entry->prev = 0;
> > }
> >
> > /**
> >
> > First person to send a ksymoops trace gets a cookie.
>
> mmmm... choc-chip.
>
> With the above patch applied the machine crashed after an hour. Crashed
> a second time during the e2fsck. gdb backtrace:
>

This sync_buffers corruption is fixed by Jens' patch, which is present
in test12-pre3.

However it is possible that the original list_head-based corruption which
I reported will not be fixed by this patch. This is because none of the
structure offsets match up with what I observed, and because Lawrence's
sytem is "SCSI-only" - no SCSI drivers are headactive.

Lawrence, did you see this problem with test12-pre3?

2000-12-01 13:01:20

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Jens Axboe wrote:
>
> On Fri, Dec 01 2000, Andrew Morton wrote:
> > Andrew Morton wrote:
> > >
> > > I bet this'll catch it:
> > >
> > > --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> > > +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> > > @@ -90,6 +90,7 @@
> > > static __inline__ void list_del(struct list_head *entry)
> > > {
> > > __list_del(entry->prev, entry->next);
> > > + entry->next = entry->prev = 0;
> > > }
> > >
> > > /**
> > >
> > > First person to send a ksymoops trace gets a cookie.
> >
> > mmmm... choc-chip.
> >
> > With the above patch applied the machine crashed after an hour. Crashed
> > a second time during the e2fsck. gdb backtrace:
>
> Very interesting. IDE / SCSI?

hmm.. Overlapping emails.

The crash with e2fsck was easily repeatable with the above patch. Just
dirty a few buffers and run /sbin/sync. It's due to the __make_request
queue_head thing which you fixed in test12-pre3. Yes, this was IDE.

However the original problem of a list_del being performed on a wild
pointer is being seen on SCSI systems. I expect the above patch will
catch it if it's still happening.

2000-12-01 13:08:41

[permalink] [raw]

Subject: Re: corruption

On Fri, Dec 01 2000, Andrew Morton wrote:
> > > mmmm... choc-chip.
> > >
> > > With the above patch applied the machine crashed after an hour. Crashed
> > > a second time during the e2fsck. gdb backtrace:
> >
> > Very interesting. IDE / SCSI?
>
> hmm.. Overlapping emails.
>
> The crash with e2fsck was easily repeatable with the above patch. Just
> dirty a few buffers and run /sbin/sync. It's due to the __make_request
> queue_head thing which you fixed in test12-pre3. Yes, this was IDE.

Ah ok, I thought this was on test12-pre3.

> However the original problem of a list_del being performed on a wild
> pointer is being seen on SCSI systems. I expect the above patch will
> catch it if it's still happening.

Indeed, and I don't think it's request queue_head related anymore. I
will look forward to seeing a trace, though :-)

--
* Jens Axboe <[email protected]>
* SuSE Labs

2000-12-01 14:49:19

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: corruption

Hi,

On Fri, Dec 01, 2000 at 08:35:41AM +1100, Andrew Morton wrote:
>
> I bet this'll catch it:
>
> static __inline__ void list_del(struct list_head *entry)
> {
> __list_del(entry->prev, entry->next);
> + entry->next = entry->prev = 0;
> }

No, because the buffer hash list is never referenced unless
buffer->b_inode is non-null, so we can't ever do a double-list_del on
the buffer.

The patch below should fix it. It has been sent to Linus. The
important part is the first hunk of the inode.c diff.

Cheers,
Stephen

fsync-fix2.diff:
--- fs/buffer.c.~1~ Wed Nov 29 15:16:43 2000
+++ fs/buffer.c Fri Dec 1 00:41:28 2000
@@ -871,10 +871,11 @@
else {
bh->b_inode = &tmp;
list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
- atomic_inc(&bh->b_count);
if (buffer_dirty(bh)) {
+ atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
ll_rw_block(WRITE, 1, &bh);
+ brelse(bh);
spin_lock(&lru_list_lock);
}
}
@@ -883,6 +884,7 @@
while (!list_empty(&tmp.i_dirty_buffers)) {
bh = BH_ENTRY(tmp.i_dirty_buffers.prev);
remove_inode_queue(bh);
+ atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
@@ -929,9 +931,9 @@
atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
wait_on_buffer(bh);
- brelse(bh);
if (!buffer_uptodate(bh))
err = -EIO;
+ brelse(bh);
spin_lock(&lru_list_lock);
goto repeat;
}
--- fs/inode.c.~1~ Wed Nov 29 15:16:43 2000
+++ fs/inode.c Fri Dec 1 00:40:26 2000
@@ -77,7 +77,13 @@

#define alloc_inode() \
((struct inode *) kmem_cache_alloc(inode_cachep, SLAB_KERNEL))
-#define destroy_inode(inode) kmem_cache_free(inode_cachep, (inode))
+static void destroy_inode(struct inode *inode)
+{
+ if (!list_empty(&inode->i_dirty_buffers))
+ BUG();
+ kmem_cache_free(inode_cachep, (inode));
+}
+

/*
* These are initializations that only need to be done
@@ -348,6 +354,12 @@

void clear_inode(struct inode *inode)
{
+ if (!list_empty(&inode->i_dirty_buffers)) {
+ if (inode->i_nlink)
+ BUG();
+ invalidate_inode_buffers(inode);
+ }
+
if (inode->i_data.nrpages)
BUG();
if (!(inode->i_state & I_FREEING))
@@ -407,6 +419,7 @@
inode = list_entry(tmp, struct inode, i_list);
if (inode->i_sb != sb)
continue;
+ invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
list_del(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_hash);

2000-12-01 15:34:37

by Lawrence Walton

[permalink] [raw]

Subject: Re: corruption

Andrew Morton [[email protected]] wrote:
> Andrew Morton wrote:
> >
> > Andrew Morton wrote:
> > >
> > > I bet this'll catch it:
> > >
> > > --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> > > +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> > > @@ -90,6 +90,7 @@
> > > static __inline__ void list_del(struct list_head *entry)
> > > {
> > > __list_del(entry->prev, entry->next);
> > > + entry->next = entry->prev = 0;
> > > }
> > >
> > > /**
> > >
> > > First person to send a ksymoops trace gets a cookie.
> >
> > mmmm... choc-chip.
> >
> > With the above patch applied the machine crashed after an hour. Crashed
> > a second time during the e2fsck. gdb backtrace:
> >
>
>
> This sync_buffers corruption is fixed by Jens' patch, which is present
> in test12-pre3.
>
> However it is possible that the original list_head-based corruption which
> I reported will not be fixed by this patch. This is because none of the
> structure offsets match up with what I observed, and because Lawrence's
> system is "SCSI-only" - no SCSI drivers are headactive.
>
> Lawrence, did you see this problem with test12-pre3?

Yes all of my current problems are with pre3;
Is there anything you would like me to test
specific?
The best way to cause problems so far is
compiling mozilla.
--
*--* Mail: [email protected]
*--* Voice: 425.739.4247
*--* Fax: 425.827.9577
*--* HTTP://www.otak-k.com/~lawrence/
--------------------------------------
- - - - - - O t a k i n c . - - - - -

2000-12-01 18:01:15

[permalink] [raw]

Subject: Re: corruption

On Fri, 1 Dec 2000, Andrew Morton wrote:

> Alexander Viro wrote:
> >
> > Confirms. That's definitely an empty list_head at address 0xc3c49058 and -pre2
> > has O_SYNC patches.
>
> foo. The overnight run wedged tight in mmap002. No progress.
>
> I bet this'll catch it:
>
> --- include/linux/list.h.orig Fri Dec 1 08:33:36 2000
> +++ include/linux/list.h Fri Dec 1 08:33:55 2000
> @@ -90,6 +90,7 @@
> static __inline__ void list_del(struct list_head *entry)
> {
> __list_del(entry->prev, entry->next);
> + entry->next = entry->prev = 0;
> }
>

Or just call list_del_init instead...

2000-12-01 23:55:42

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Fri, Dec 01, 2000 at 08:35:41AM +1100, Andrew Morton wrote:
> >
> > I bet this'll catch it:
> >
> > static __inline__ void list_del(struct list_head *entry)
> > {
> > __list_del(entry->prev, entry->next);
> > + entry->next = entry->prev = 0;
> > }
>
> No, because the buffer hash list is never referenced unless
> buffer->b_inode is non-null, so we can't ever do a double-list_del on
> the buffer.

You are right. An overnight diskthrash with the above patch didn't
oops, but it turned up three instances of the EXT2 directory warnings.

Incidentally, there's something wrong in blockdev/VM land. The
overnight run consisted of:

- one looping instance of `dbench 4 ; sleep 120'
- one looping instance of 'lmbench ; sleep 120'
- one looping instance of `bonnie++ ; sleep 120'
- one looping instance of `mmap001;mmap002;misc001;sleep 120'

Things which went wrong (after ten hours):

- the first dbench run never completed.

- the first bonnie++ run never completed.

- I then killed everything with ALT-SYSRQ-T. It's been 20 minutes
now and the disk LED is *still* hard on. This machine has 256 megs
and the hdparm disk bandwidth is 20 megs/sec.

You can observe the latter problem pretty easily by running `misc001' on
its own. Kill it after 20 seconds and the disk remains active for *ages*.
Usually ninety seconds. Long enough to write all physical memory out
ten times...

> The patch below should fix it. It has been sent to Linus. The
> important part is the first hunk of the inode.c diff.

Okay, will test. (25 minutes now. It's fsck time...)

2000-12-02 01:01:10

[permalink] [raw]

Subject: Re: corruption

Hi, Andrew,

Andrew Morton writes:
> - I then killed everything with ALT-SYSRQ-T. It's been 20 minutes
> now and the disk LED is *still* hard on. This machine has 256 megs
> and the hdparm disk bandwidth is 20 megs/sec.
>
> You can observe the latter problem pretty easily by running `misc001' on
> its own. Kill it after 20 seconds and the disk remains active for *ages*.
> Usually ninety seconds. Long enough to write all physical memory out
> ten times...

If the benchmark writes blocks completely random, the random writing
performance should be used instead of the bandwidth. It is tipically
30blk/s to 200blk/s depends on the seek speed.

90 second writing corresponds to 10MB to 72MB dirty buffer
(4K fs block), it is not so crazy. But 30 minutes is still
too long on 256MB physical memory.

I think they are different, aren't they?

--
Computer Systems Laboratory, Fujitsu Labs.
[email protected]

2000-12-02 04:26:59

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Fri, Dec 01, 2000 at 08:35:41AM +1100, Andrew Morton wrote:
> >
> > I bet this'll catch it:
> >
> > static __inline__ void list_del(struct list_head *entry)
> > {
> > __list_del(entry->prev, entry->next);
> > + entry->next = entry->prev = 0;
> > }
>
> No, because the buffer hash list is never referenced unless
> buffer->b_inode is non-null, so we can't ever do a double-list_del on
> the buffer.
>
> The patch below should fix it. It has been sent to Linus. The
> important part is the first hunk of the inode.c diff.

Testing test11-pre3 with your inode.c patch and the above list_del
patch. x86 dual processor, IDE. Same workload as before, except
I cut out misc001 (and the machine recovered almost immediately
when I killed everything! Need more testing to characterise this).

kernel BUG at inode.c:83! The trace is below. Now, this was
probably triggered by my list_del change. If so it means
that we're doing a list_empty() test on a list_head which
has actually been deleted from a list. So it's possibly the
actual assertion in destroy_inode() which is a little bogus.

But the wierd thing is that this BUG only hit a single time,
after three hours of intensive testing. If my theory is
right, the BUG should hit every time. Will investigate further...

ksymoops 0.7c on i686 2.4.0-test12-pre3. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.0-test12-pre3/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Reading Oops report from the terminal
Dec 2 13:15:45 mnm kernel: invalid operand: 0000
Dec 2 13:15:45 mnm kernel: CPU: 0
Dec 2 13:15:45 mnm kernel: EIP: 0010:[<c014570e>]
Using defaults from ksymoops -t elf32-i386 -a i386
Dec 2 13:15:45 mnm kernel: EFLAGS: 00010282
Dec 2 13:15:45 mnm kernel: eax: 0000001a ebx: c78686e0 ecx: 00000000 edx: 0000002f
Dec 2 13:15:45 mnm kernel: esi: c025b800 edi: cd950960 ebp: c7869160 esp: ce611f3c
Dec 2 13:15:45 mnm kernel: ds: 0018 es: 0018 ss: 0018
Dec 2 13:15:45 mnm kernel: Process dbench (pid: 13094, stackpage=ce611000)
Dec 2 13:15:45 mnm kernel: Stack: c021b7e5 c021b8a5 00000053 c78686e0 c0146916 c78686e0 c7869160 c78686e0
Dec 2 13:15:45 mnm kernel: c0145096 c78686e0 00000000 ce610000 c013de4d c7869160 c7869160 c9b1e000
Dec 2 13:15:45 mnm kernel: ce611fa4 c7869160 cd9509d0 c013df25 cd950960 c7869160 ce610000 bffff5ca
Dec 2 13:15:45 mnm kernel: Call Trace: [<c021b7e5>] [<c021b8a5>] [<c0146916>] [<c0145096>] [<c013de4d>] [<c013df25>
Dec 2 13:15:45 mnm kernel: [<c010002b>]
Dec 2 13:15:45 mnm kernel: Code: 0f 0b 83 c4 0c 53 a1 10 d1 2a c0 50 e8 cd 3d fe ff 83 c4 08

>>EIP; c014570e <destroy_inode+1e/34> <=====
Trace; c021b7e5 <tvecs+5a3d/1a3d8>
Trace; c021b8a5 <tvecs+5afd/1a3d8>
Trace; c0146916 <iput+18e/194>
Trace; c0145096 <d_delete+66/ac>
Trace; c013de4d <vfs_unlink+18d/1c0>
Trace; c013df25 <sys_unlink+a5/118>
Trace; c010002b <startup_32+2b/cb>
Code; c014570e <destroy_inode+1e/34>
00000000 <_EIP>:
Code; c014570e <destroy_inode+1e/34> <=====
0: 0f 0b ud2a <=====
Code; c0145710 <destroy_inode+20/34>
2: 83 c4 0c add $0xc,%esp
Code; c0145713 <destroy_inode+23/34>
5: 53 push %ebx
Code; c0145714 <destroy_inode+24/34>
6: a1 10 d1 2a c0 mov 0xc02ad110,%eax
Code; c0145719 <destroy_inode+29/34>
b: 50 push %eax
Code; c014571a <destroy_inode+2a/34>
c: e8 cd 3d fe ff call fffe3dde <_EIP+0xfffe3dde> c01294ec <kmem_cache_free+0/7c>
Code; c014571f <destroy_inode+2f/34>
11: 83 c4 08 add $0x8,%esp

1 warning issued. Results may not be reliable.

2000-12-02 14:30:47

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Andrew Morton wrote:
>
> "Stephen C. Tweedie" wrote:
> >
> > Hi,
> >
> > On Fri, Dec 01, 2000 at 08:35:41AM +1100, Andrew Morton wrote:
> > >
> > > I bet this'll catch it:
> > >
> > > static __inline__ void list_del(struct list_head *entry)
> > > {
> > > __list_del(entry->prev, entry->next);
> > > + entry->next = entry->prev = 0;
> > > }
> >
> > No, because the buffer hash list is never referenced unless
> > buffer->b_inode is non-null, so we can't ever do a double-list_del on
> > the buffer.
> >
> > The patch below should fix it. It has been sent to Linus. The
> > important part is the first hunk of the inode.c diff.
>
> Testing test11-pre3 with your inode.c patch and the above list_del
> patch. x86 dual processor, IDE. Same workload as before, except
> I cut out misc001 (and the machine recovered almost immediately
> when I killed everything! Need more testing to characterise this).
>
> kernel BUG at inode.c:83! The trace is below. Now, this was
> probably triggered by my list_del change. If so it means
> that we're doing a list_empty() test on a list_head which
> has actually been deleted from a list. So it's possibly the
> actual assertion in destroy_inode() which is a little bogus.
>
> But the wierd thing is that this BUG only hit a single time,
> after three hours of intensive testing. If my theory is
> right, the BUG should hit every time. Will investigate further...
>

It appears that this problem is not fixed.

My destroy_inode() is now

static void destroy_inode(struct inode *inode)
{
if (!list_empty(&inode->i_dirty_buffers)) {
printk("&inode->i_dirty_buffers=0x%p\n", &inode->i_dirty_buffers);
printk("next=0x%p\n", inode->i_dirty_buffers.next);
printk("prev=0x%p\n", inode->i_dirty_buffers.prev);
BUG();
}
kmem_cache_free(inode_cachep, (inode));
}

After 45 minutes of running the previously described tests:

Dec 2 16:14:30 mnm kernel: &inode->i_dirty_buffers=0xcfe16878
Dec 2 16:14:30 mnm kernel: next=0xc16f3678
Dec 2 16:14:30 mnm kernel: prev=0xc16f3678
Dec 2 16:14:30 mnm kernel: kernel BUG at inode.c:86!

We're throwing away an inode which has live data on its
dirty buffers list.

This is 2.4.0-test11-pre3 + list_del patch + sct's inode
patch (buffer.c, inode.c). x86 dual processor. gcc 2.91.66.
I rediffed my tree. No rogue patches.

It's not an SMP thing - the address patterns match up with the
previously reported uniprocessor corruption.

2000-12-02 16:04:26

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Sun, 3 Dec 2000, Andrew Morton wrote:

> It appears that this problem is not fixed.

Sure, it isn't. Place where the shit hits the fan: fs/buffer.c::unmap_buffer().
Add the call of remove_inode_queue(bh) there and see if it helps. I.e.

ed fs/buffer.c <<EOF
/unmap_buffer/
/}/i
remove_inode_queue(bh);
.
wq
EOF

Linus, could you apply that? We are leaving the unmapped buffers on the
inode queue. I.e. every truncate_inode_pages() leaves a lot of junk around.
Now, guess what happens when we destroy the last link to inode that nobody
keeps open...

2000-12-02 16:06:16

by Petr Vandrovec

[permalink] [raw]

Subject: Re: corruption

Andrew Morton wrote:
> Andrew Morton wrote:
> >
> > actual assertion in destroy_inode() which is a little bogus.
> >
> > But the wierd thing is that this BUG only hit a single time,
> > after three hours of intensive testing. If my theory is
> > right, the BUG should hit every time. Will investigate further...
> >
>
> It appears that this problem is not fixed.
>
> My destroy_inode() is now
>
> static void destroy_inode(struct inode *inode)
> {
> if (!list_empty(&inode->i_dirty_buffers)) {
> printk("&inode->i_dirty_buffers=0x%p\n", &inode->i_dirty_buffers);
> printk("next=0x%p\n", inode->i_dirty_buffers.next);
> printk("prev=0x%p\n", inode->i_dirty_buffers.prev);
> BUG();
> }
> kmem_cache_free(inode_cachep, (inode));
> }

I used do { if (inode_has_buffers(inode)) { printstate(); } kmem_cache_free.... } while (0)

and machine complained very loudly during boot...

> This is 2.4.0-test11-pre3 + list_del patch + sct's inode
> patch (buffer.c, inode.c). x86 dual processor. gcc 2.91.66.
> I rediffed my tree. No rogue patches.

test12-pre3, NULL in list_del, destroy_inode as above, UP, 2.95.2

So I thought that adding fsync_inode_buffers() added into iput() just below
atomic_dec_and_lock(&inode->i_count...) would be good idea. It is not, bug
was still trigerred. So there are oopses... I removed disassembled code,
as it is same for all oopses (as my printstate dumps itself).

Before fsync_inode_buffers() it was almost same, there were also traces
through sys_close() in additon to this. But maybe that I just did not
trigger this code path during testing.

I think that buffer_insert_inode_queue and __remove_inode_queue should
also do iget() and iput(), but maybe I'm wrong.

And I have no idea why fsync_inode_buffers() does not work. I thought that
inode should not have any buffers attached after this function returns if
inode use count was zero... Maybe it is a bit complicated when inode
is going to cease...
Best regards,
Petr Vandrovec
[email protected]

ksymoops 2.3.4 on i686 2.4.0-test9-smp. Options used
-v /usr/src/linux/vmlinux (specified)
-k none (specified)
-l none (specified)
-o /lib/modules/2.4.0-test12-pre3-smp/ (specified)
-m /boot/System.map (specified)

Error (regular_file): read_ksyms stat none failed
No modules in ksyms, skipping objects
No ksyms, skipping lsmod
CPU: 0
EIP: 0010:[<c010b9b5>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000202
eax: 00000001 ebx: c13fe9a0 ecx: c022ce90 edx: c13fe9b8
esi: c022f500 edi: c7d940a0 ebp: c7d69fbc esp: c7d69f38
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 34, stackpage=c7d69000)
Stack: c7d69f3c c0140018 c01475b5 c13ffd40 c13fe9a0 c0145296 c13fe9a0 c7d68000
00000000 c01400b2 c13ffd40 c7d68000 08058fd0 08059930 c123d920 c13ffd40
c123d920 c7d12000 c7f45000 c123d920 c7f7b8c0 c7d12005 00000004 0006a272
Call Trace: [<c0140018>] [<c01475b5>] [<c0145296>] [<c01400b2>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145296 <dput+116/174>
Trace; c01400b2 <sys_rename+1ee/270>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>
0000000000000000 <_EIP>:
Code; c010b9b5 <printstate+9/2c> <=====
0: 50 push %eax <=====
Code; c010b9b6 <printstate+a/2c>
1: 1e push %ds
Code; c010b9b7 <printstate+b/2c>
2: 06 push %es
Code; c010b9b8 <printstate+c/2c>
3: 50 push %eax
Code; c010b9b9 <printstate+d/2c>
4: 55 push %ebp
Code; c010b9ba <printstate+e/2c>
5: 57 push %edi
Code; c010b9bb <printstate+f/2c>
6: 56 push %esi
Code; c010b9bc <printstate+10/2c>
7: 52 push %edx
Code; c010b9bd <printstate+11/2c>
8: 51 push %ecx
Code; c010b9be <printstate+12/2c>
9: 53 push %ebx
Code; c010b9bf <printstate+13/2c>
a: 89 e0 mov %esp,%eax
Code; c010b9c1 <printstate+15/2c>
c: 50 push %eax
Code; c010b9c2 <printstate+16/2c>
d: e8 a9 fe ff ff call fffffebb <_EIP+0xfffffebb> c010b870 <show_registers+0/13c>
Code; c010b9c7 <printstate+1b/2c>
12: 83 c4 00 add $0x0,%esp

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7c5cb40 ecx: c022ce90 edx: c7c5cb58
esi: c022f500 edi: c7f3d4e0 ebp: c132a460 esp: c7d69f4c
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 34, stackpage=c7d69000)
Stack: c7d69f50 c0140018 c01475b5 c132a460 c7c5cb40 c0145df8 c7c5cb40 c7d68000
00000000 c013f110 c132a460 c132a460 c132a460 c7f45000 c7d69fa4 c013f1e7
c7f3d4e0 c132a460 c7d68000 08058fd0 08059930 bffffb8c c123d920 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7c5cb40 ecx: c022ce90 edx: c7c5cb58
esi: c022f500 edi: c7f3d4e0 ebp: c132a460 esp: c7d69f4c
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 35, stackpage=c7d69000)
Stack: c7d69f50 c0140018 c01475b5 c132a460 c7c5cb40 c0145df8 c7c5cb40 c7d68000
00000000 c013f110 c132a460 c132a460 c132a460 c7f45000 c7d69fa4 c013f1e7
c7f3d4e0 c132a460 c7d68000 08058f60 08052c71 00000000 c123d920 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0ce0 ecx: c022ce90 edx: c7ad0cf8
esi: c022f500 edi: c7ad0e60 ebp: c7d6c3e0 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 41, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7d6c3e0 c7ad0ce0 c0145df8 c7ad0ce0 c7c06000
00000000 c013f110 c7d6c3e0 c7d6c3e0 c7d6c3e0 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7d6c3e0 c7c06000 400165a4 bffffd4c bffffc54 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0ce0 ecx: c022ce90 edx: c7ad0cf8
esi: c022f500 edi: c7ad0e60 ebp: c7d844a0 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 41, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7d844a0 c7ad0ce0 c0145df8 c7ad0ce0 c7c06000
00000000 c013f110 c7d844a0 c7d844a0 c7d844a0 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7d844a0 c7c06000 400165a4 bffffd4c bffffc54 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0ce0 ecx: c022ce90 edx: c7ad0cf8
esi: c022f500 edi: c7ad0e60 ebp: c7d6c560 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 41, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7d6c560 c7ad0ce0 c0145df8 c7ad0ce0 c7c06000
00000000 c013f110 c7d6c560 c7d6c560 c7d6c560 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7d6c560 c7c06000 400165a4 bffffd4c bffffc54 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0b60 ecx: c022ce90 edx: c7ad0b78
esi: c022f500 edi: c7ad0e60 ebp: c7d6c660 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 47, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7d6c660 c7ad0b60 c0145df8 c7ad0b60 c7c06000
00000000 c013f110 c7d6c660 c7d6c660 c7d6c660 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7d6c660 c7c06000 400165a4 bffffd2c bffffc34 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0b60 ecx: c022ce90 edx: c7ad0b78
esi: c022f500 edi: c7ad0e60 ebp: c7d846a0 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 47, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7d846a0 c7ad0b60 c0145df8 c7ad0b60 c7c06000
00000000 c013f110 c7d846a0 c7d846a0 c7d846a0 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7d846a0 c7c06000 400165a4 bffffd2c bffffc34 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7ad0b60 ecx: c022ce90 edx: c7ad0b78
esi: c022f500 edi: c7ad0e60 ebp: c7df0140 esp: c7c07f4c
ds: 0018 es: 0018 ss: 0018
Process ksymoops (pid: 47, stackpage=c7c07000)
Stack: c7c07f50 c0140018 c01475b5 c7df0140 c7ad0b60 c0145df8 c7ad0b60 c7c06000
00000000 c013f110 c7df0140 c7df0140 c7df0140 c7f45000 c7c07fa4 c013f1e7
c7ad0e60 c7df0140 c7c06000 400165a4 bffffd2c bffffc34 c7d84420 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c7b279c0 ecx: c022ce90 edx: c7b279d8
esi: c022f500 edi: c7f3d4e0 ebp: c132a460 esp: c77abf4c
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 89, stackpage=c77ab000)
Stack: c77abf50 c0140018 c01475b5 c132a460 c7b279c0 c0145df8 c7b279c0 c77aa000
00000000 c013f110 c132a460 c132a460 c132a460 c7f45000 c77abfa4 c013f1e7
c7f3d4e0 c132a460 c77aa000 08057f60 08057fc8 00000000 c123d920 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c78681c0 ecx: c022ce90 edx: c78681d8
esi: c022f500 edi: c7f3d4e0 ebp: c132a460 esp: c7a41f4c
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 92, stackpage=c7a41000)
Stack: c7a41f50 c0140018 c01475b5 c132a460 c78681c0 c0145df8 c78681c0 c7a40000
00000000 c013f110 c132a460 c132a460 c132a460 c7f45000 c7a41fa4 c013f1e7
c7f3d4e0 c132a460 c7a40000 08073d98 08052c71 00000000 c123d920 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c78681c0 ecx: c022ce90 edx: c78681d8
esi: c022f500 edi: c7f3d4e0 ebp: c132a460 esp: c7a41f4c
ds: 0018 es: 0018 ss: 0018
Process mount (pid: 92, stackpage=c7a41000)
Stack: c7a41f50 c0140018 c01475b5 c132a460 c78681c0 c0145df8 c78681c0 c7a40000
00000000 c013f110 c132a460 c132a460 c132a460 c7f45000 c7a41fa4 c013f1e7
c7f3d4e0 c132a460 c7a40000 080762e0 08052c71 00000000 c123d920 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c76e03a0 ecx: c022ce90 edx: c76e03b8
esi: c022f500 edi: c76e0520 ebp: c76c0840 esp: c7a6ff4c
ds: 0018 es: 0018 ss: 0018
Process rm (pid: 105, stackpage=c7a6f000)
Stack: c7a6ff50 c0140018 c01475b5 c76c0840 c76e03a0 c0145df8 c76e03a0 c7a6e000
00000000 c013f110 c76c0840 c76c0840 c76c0840 c7f45000 c7a6ffa4 c013f1e7
c76e0520 c76c0840 c7a6e000 bffffec1 00000000 bffffc44 c76c08c0 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c76154c0 ecx: c022ce90 edx: c76154d8
esi: c022f500 edi: c7f45000 ebp: c751bfa4 esp: c751bf60
ds: 0018 es: 0018 ss: 0018
Process rm (pid: 143, stackpage=c751b000)
Stack: c751bf64 c0140018 c01475b5 c77fe220 c76154c0 c0145296 c76154c0 c77fe220
00000000 c013ef77 c77fe220 c7ad0e60 c77fe220 c751a000 0804ec80 bffffbcc
bffffc34 c7d84420 c7f7b8c0 c7f45002 00000009 b23a38c3 00000010 00000000
Call Trace: [<c0140018>] [<c01475b5>] [<c0145296>] [<c013ef77>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145296 <dput+116/174>
Trace; c013ef77 <sys_rmdir+cb/104>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c752fae0 ecx: c022ce90 edx: c752faf8
esi: c022f500 edi: c13fe520 ebp: c7684dc0 esp: c7529f4c
ds: 0018 es: 0018 ss: 0018
Process rm (pid: 149, stackpage=c7529000)
Stack: c7529f50 c0140018 c01475b5 c7684dc0 c752fae0 c0145df8 c752fae0 c7528000
00000000 c013f110 c7684dc0 c7684dc0 c7684dc0 c7f45000 c7529fa4 c013f1e7
c13fe520 c7684dc0 c7528000 bffffeb4 00000000 bffffc34 c7d6cf60 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c752f7e0 ecx: c022ce90 edx: c752f7f8
esi: c022f500 edi: c13fe520 ebp: c7684c40 esp: c7529f4c
ds: 0018 es: 0018 ss: 0018
Process rm (pid: 150, stackpage=c7529000)
Stack: c7529f50 c0140018 c01475b5 c7684c40 c752f7e0 c0145df8 c752f7e0 c7528000
00000000 c013f110 c7684c40 c7684c40 c7684c40 c7f45000 c7529fa4 c013f1e7
c13fe520 c7684c40 c7528000 bffffea7 00000000 bffffc24 c7d6cf60 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

CPU: 0
EIP: 0010:[<c010b9b5>]
EFLAGS: 00000202
eax: 00000001 ebx: c752fde0 ecx: c022ce90 edx: c752fdf8
esi: c022f500 edi: c13fe520 ebp: c7684ec0 esp: c74edf4c
ds: 0018 es: 0018 ss: 0018
Process rm (pid: 164, stackpage=c74ed000)
Stack: c74edf50 c0140018 c01475b5 c7684ec0 c752fde0 c0145df8 c752fde0 c74ec000
00000000 c013f110 c7684ec0 c7684ec0 c7684ec0 c7f45000 c74edfa4 c013f1e7
c13fe520 c7684ec0 c74ec000 bffffec4 00000000 bffffc44 c7d6cf60 c7f7b8c0
Call Trace: [<c0140018>] [<c01475b5>] [<c0145df8>] [<c013f110>] [<c013f1e7>] [<c010b513>]
Code: 50 1e 06 50 55 57 56 52 51 53 89 e0 50 e8 a9 fe ff ff 83 c4

>>EIP; c010b9b5 <printstate+9/2c> <=====
Trace; c0140018 <sys_rename+154/270>
Trace; c01475b5 <iput+1a1/1b4>
Trace; c0145df8 <d_delete+60/a8>
Trace; c013f110 <vfs_unlink+160/190>
Trace; c013f1e7 <sys_unlink+a7/120>
Trace; c010b513 <system_call+33/38>
Code; c010b9b5 <printstate+9/2c>

1 error issued. Results may not be reliable.

2000-12-02 17:11:58

by Petr Vandrovec

[permalink] [raw]

Subject: Re: corruption

On Sat, Dec 02, 2000 at 10:33:36AM -0500, Alexander Viro wrote:
> On Sun, 3 Dec 2000, Andrew Morton wrote:
>
> > It appears that this problem is not fixed.
>
> Sure, it isn't. Place where the shit hits the fan: fs/buffer.c::unmap_buffer().
> Add the call of remove_inode_queue(bh) there and see if it helps. I.e.
>
> ed fs/buffer.c <<EOF
> /unmap_buffer/
> /}/i
> remove_inode_queue(bh);
> .
> wq
> EOF
>
> Linus, could you apply that? We are leaving the unmapped buffers on the
> inode queue. I.e. every truncate_inode_pages() leaves a lot of junk around.
> Now, guess what happens when we destroy the last link to inode that nobody
> keeps open...

Nothing new (was it meant to run remove_inode_queue() conditionaly inside
buffer_mapped() branch? ed did it that way). First is list of buffers at
time of destroy_inode, then process. If you want full oops trace, it is
available at http://platan.vc.cvut.cz/oops3.txt, but last part is always
iput. For now I'm back on test9, as I lost inetd.conf again :-( Someone
should shoot sendmail Debian maintainer... Running update-inetd at startup
is really bad idea, as fsck then usually removes both old and new inetd.conf,
so I'm back on inetd.conf from 25 Aug 1999 :-(((

Fields printed from buffer_head are b_next, b_blocknr, b_size, b_list,
b_count, b_state and b_inode. (oops now I see that I left
remove_inode_queue(bh) in printing loop (I copied it from
invalidate_inode_buffers()), but it should not hurt, I believe. Dirty buffers
should find its way to disk anyway, or not?)
Best regards,
Petr Vandrovec
[email protected]

BTW, running 'ksymoops < oops > oops.txt' is great source of errors below,
as it (probably) uses couple of unlinked temorary files...

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7bf3ce0

Process mount (pid: 30, stackpage=c7df3000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7bf3ce0

Process mount (pid: 31, stackpage=c7df3000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7c1a860

Process mount (pid: 68, stackpage=c7997000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7b54c80

Process mount (pid: 70, stackpage=c7997000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7b54c80

Process mount (pid: 70, stackpage=c7997000)

next=00000000, nr=3375109, size=4096, list=2
count=0
state=0000001B inode=c77e2260

Process rm (pid: 82, stackpage=c7b35000)

next=00000000, nr=3506180, size=4096, list=2
count=0
state=0000001B inode=c77eb9c0

Process rm (pid: 121, stackpage=c776d000)

next=00000000, nr=3506180, size=4096, list=2
count=0
state=0000001B inode=c77ebb40
next=00000000, nr=3507147, size=4096, list=2
count=0
state=0000001B inode=c77ebb40

Process rm (pid: 122, stackpage=c776d000)

next=00000000, nr=1179657, size=4096, list=2
count=0
state=0000001B inode=c77ebb40

Process rm (pid: 123, stackpage=c776d000)

next=00000000, nr=294919, size=4096, list=2
count=0
state=0000001B inode=c77eb240

Process rm (pid: 129, stackpage=c775d000)

next=00000000, nr=294919, size=4096, list=2
count=0
state=0000001B inode=c77eb0c0

Process rm (pid: 130, stackpage=c775d000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7889c60

Process mv (pid: 138, stackpage=c7985000)

next=c796fce0, nr=294916, size=4096, list=2
count=0
state=0000001B inode=c77eb6c0

Process rm (pid: 142, stackpage=c7721000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c75e83a0

Process update-inetd (pid: 273, stackpage=c715d000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7b359a0

Process rm (pid: 305, stackpage=c6b33000)

next=00000000, nr=2457654, size=4096, list=2
count=0
state=0000001B inode=c7b356a0

Process rm (pid: 305, stackpage=c6b33000)

next=c6ed8980, nr=753674, size=4096, list=2
count=0
state=0000001B inode=c655a3a0

Process mc (pid: 330, stackpage=c6703000)

next=00000000, nr=196617, size=4096, list=2
count=0
state=0000001B inode=c656c4e0

Process mc (pid: 330, stackpage=c6703000)

2000-12-02 18:20:48

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Sat, 2 Dec 2000, Petr Vandrovec wrote:

> Nothing new (was it meant to run remove_inode_queue() conditionaly inside
> buffer_mapped() branch? ed did it that way). First is list of buffers at
> time of destroy_inode, then process. If you want full oops trace, it is

what list of buffers? ->i_dirty_buffers one?

> available at http://platan.vc.cvut.cz/oops3.txt, but last part is always
> iput. For now I'm back on test9, as I lost inetd.conf again :-( Someone
> should shoot sendmail Debian maintainer... Running update-inetd at startup
> is really bad idea, as fsck then usually removes both old and new inetd.conf,
> so I'm back on inetd.conf from 25 Aug 1999 :-(((
>
> Fields printed from buffer_head are b_next, b_blocknr, b_size, b_list,
> b_count, b_state and b_inode. (oops now I see that I left
> remove_inode_queue(bh) in printing loop (I copied it from
> invalidate_inode_buffers()), but it should not hurt, I believe. Dirty buffers
> should find its way to disk anyway, or not?)

When you delete the inode? Why would they? Petr, could you send me the
diff between the variant you've run and pristine 12-pre3? I'ld really
like to see what exactly was doing the printks...

2000-12-02 18:29:59

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Sat, 2 Dec 2000, Petr Vandrovec wrote:

[I wrote:]

> > ed fs/buffer.c <<EOF
> > /unmap_buffer/
> > /}/i
spin_lock(&lru_list_lock);
> > remove_inode_queue(bh);
spin_unlock(&lru_list_lock);
> > .
> > wq
> > EOF

Damn. I claim the sudden idiocy attack - didn't look at the locking
rules for the ->b_inode_buffers. My apologies.

2000-12-03 20:55:14

by Jonathan Hudson

[permalink] [raw]

Subject: Re: corruption

In article <[email protected]>,
Alexander Viro <[email protected]> writes:
AV>
>> > ed fs/buffer.c <<EOF
>> > /unmap_buffer/
>> > /}/i
AV> spin_lock(&lru_list_lock);
>> > remove_inode_queue(bh);
AV> spin_unlock(&lru_list_lock);
>> > .
>> > wq
>> > EOF
AV>

I applied this on top the the previous SCT patch, and have thrashed
the system harder than I would have dared previously. It's still
running. I feel very comfortable with this, much more so than any
prior 2.4.0t*.

2000-12-03 22:12:25

by Andrew Morton

[permalink] [raw]

Subject: Re: corruption

Alexander Viro wrote:
>
> On Sun, 3 Dec 2000, Andrew Morton wrote:
>
> > It appears that this problem is not fixed.
>
> Sure, it isn't. Place where the shit hits the fan: fs/buffer.c::unmap_buffer().
> Add the call of remove_inode_queue(bh) there and see if it helps. I.e.
>
> ed fs/buffer.c <<EOF
> /unmap_buffer/
> /}/i
> remove_inode_queue(bh);
> .
> wq
> EOF
>

Sorry, it's still failing. It took three hours.

&inode->i_dirty_buffers=0xca9e63f8
next=0xc30a2598
prev=0xc30a2598
kernel BUG at inode.c:86!

The ksymoops output is here, as is my current diff wrt test12-pre3.

ksymoops 0.7c on i686 2.4.0-test12-pre3. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.0-test12-pre3/ (default)
-m /usr/src/linux/System.map (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Reading Oops report from the terminal
Dec 4 01:56:02 mnm kernel: EIP: 0010:[<c0145758>]
Using defaults from ksymoops -t elf32-i386 -a i386
Dec 4 01:56:02 mnm kernel: EFLAGS: 00010282
Dec 4 01:56:02 mnm kernel: eax: 0000001a ebx: ca9e63e0 ecx: 00000001 edx: 00000000
Dec 4 01:56:02 mnm kernel: esi: c025b8a0 edi: cfbcb040 ebp: ce2b0260 esp: ced4df3c
Dec 4 01:56:02 mnm kernel: ds: 0018 es: 0018 ss: 0018
Dec 4 01:56:02 mnm kernel: Process lat_fs (pid: 17559, stackpage=ced4d000) Dec 4 01:56:02 mnm kernel: Stack: c021b845 c021b939 00000056 ca9e63e0 c0146966 ca9e63e0 ce2b0260 ca9e63e0
Dec 4 01:56:02 mnm kernel: Call Trace: [<c021b845>] [<c021b939>] [<c0146966>] [<c01450b6>] [<c013de6d>] [<c013df45>] [<c0108fdf>]
Dec 4 01:56:02 mnm kernel: Code: 0f 0b 83 c4 0c 8d 76 00 53 a1 10 d1 2a c0 50 e8 80 3d fe ff

>>EIP; c0145758 <destroy_inode+48/64> <=====
Trace; c021b845 <tvecs+5a3d/1a418>
Trace; c021b939 <tvecs+5b31/1a418>
Trace; c0146966 <iput+18e/194>
Trace; c01450b6 <d_delete+66/ac>
Trace; c013de6d <vfs_unlink+18d/1c0>
Trace; c013df45 <sys_unlink+a5/118>
Trace; c0108fdf <system_call+33/38>
Code; c0145758 <destroy_inode+48/64>
00000000 <_EIP>:
Code; c0145758 <destroy_inode+48/64> <=====
0: 0f 0b ud2a <=====
Code; c014575a <destroy_inode+4a/64>
2: 83 c4 0c add $0xc,%esp
Code; c014575d <destroy_inode+4d/64>
5: 8d 76 00 lea 0x0(%esi),%esi
Code; c0145760 <destroy_inode+50/64>
8: 53 push %ebx
Code; c0145761 <destroy_inode+51/64>
9: a1 10 d1 2a c0 mov 0xc02ad110,%eax
Code; c0145766 <destroy_inode+56/64>
e: 50 push %eax
Code; c0145767 <destroy_inode+57/64>
f: e8 80 3d fe ff call fffe3d94 <_EIP+0xfffe3d94> c01294ec <kmem_cache_free+0/7c>

1 warning issued. Results may not be reliable.

--- linux-2.4.0-test12-pre3/include/linux/list.h Fri Aug 11 19:06:12 2000
+++ linux-akpm/include/linux/list.h Fri Dec 1 17:31:35 2000
@@ -90,6 +90,7 @@
static __inline__ void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
+ entry->next = entry->prev = 0;
}

/**
--- linux-2.4.0-test12-pre3/fs/buffer.c Wed Nov 29 18:23:19 2000
+++ linux-akpm/fs/buffer.c Sun Dec 3 22:36:18 2000
@@ -871,10 +871,11 @@
else {
bh->b_inode = &tmp;
list_add(&bh->b_inode_buffers, &tmp.i_dirty_buffers);
- atomic_inc(&bh->b_count);
if (buffer_dirty(bh)) {
+ atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
ll_rw_block(WRITE, 1, &bh);
+ brelse(bh);
spin_lock(&lru_list_lock);
}
}
@@ -883,6 +884,7 @@
while (!list_empty(&tmp.i_dirty_buffers)) {
bh = BH_ENTRY(tmp.i_dirty_buffers.prev);
remove_inode_queue(bh);
+ atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
wait_on_buffer(bh);
if (!buffer_uptodate(bh))
@@ -929,9 +931,9 @@
atomic_inc(&bh->b_count);
spin_unlock(&lru_list_lock);
wait_on_buffer(bh);
- brelse(bh);
if (!buffer_uptodate(bh))
err = -EIO;
+ brelse(bh);
spin_lock(&lru_list_lock);
goto repeat;
}
@@ -1459,6 +1461,9 @@
clear_bit(BH_Mapped, &bh->b_state);
clear_bit(BH_Req, &bh->b_state);
clear_bit(BH_New, &bh->b_state);
+ spin_lock(&lru_list_lock);
+ remove_inode_queue(bh);
+ spin_unlock(&lru_list_lock);
}
}

--- linux-2.4.0-test12-pre3/fs/inode.c Wed Nov 29 18:23:19 2000
+++ linux-akpm/fs/inode.c Sat Dec 2 15:34:51 2000
@@ -77,7 +77,17 @@

#define alloc_inode() \
((struct inode *) kmem_cache_alloc(inode_cachep, SLAB_KERNEL))
-#define destroy_inode(inode) kmem_cache_free(inode_cachep, (inode))
+static void destroy_inode(struct inode *inode)
+{
+ if (!list_empty(&inode->i_dirty_buffers)) {
+ printk("&inode->i_dirty_buffers=0x%p\n", &inode->i_dirty_buffers);
+ printk("next=0x%p\n", inode->i_dirty_buffers.next);
+ printk("prev=0x%p\n", inode->i_dirty_buffers.prev);
+ BUG();
+ }
+ kmem_cache_free(inode_cachep, (inode));
+}
+

/*
* These are initializations that only need to be done
@@ -348,6 +358,12 @@

void clear_inode(struct inode *inode)
{
+ if (!list_empty(&inode->i_dirty_buffers)) {
+ if (inode->i_nlink)
+ BUG();
+ invalidate_inode_buffers(inode);
+ }
+
if (inode->i_data.nrpages)
BUG();
if (!(inode->i_state & I_FREEING))
@@ -407,6 +423,7 @@
inode = list_entry(tmp, struct inode, i_list);
if (inode->i_sb != sb)
continue;
+ invalidate_inode_buffers(inode);
if (!atomic_read(&inode->i_count)) {
list_del(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_hash);

2000-12-03 23:16:54

by Alexander Viro

[permalink] [raw]

Subject: [resync?] Re: corruption

On Mon, 4 Dec 2000, Andrew Morton wrote:

> Sorry, it's still failing. It took three hours.

Yes. For one thing, original was plain wrong wrt locking (lru_list_lock
should be held). For another, it does not take care of metadata. And
that's way more serious. What really happens:

ext2_truncate() got a buffer_head of indirect block that is going to
die. Fine, we release the blocks refered from it and... do bforget()
on our block. Notice that we are not guaranteed that bh will actually
die here. buffer.c code might bump its ->b_count for a while, it might
be written out right now, etc. As the result, bforget() leaves the
sucker alive. It's not a big deal, since we will do unmap_underlying_metadata()
before we write anything there (if it will be reused for data) or we'll
just pick the bh and zero the buffer out (if it will be reused for metadata).

Unfortunately, we also leave it on the per-inode dirty blocks list. Guess
what happens if inode is destroyed, page that used to hold it gets reused
and bh gets finally written? Exactly.

Suggested fix: void bforget_inode(struct buffer_head *bh) that would
be a copy of __bforget(), except that it would call remove_inode_queue(bh)
unconditionally. And replace bforget() with bforget_inode() in those places
of ext2/inode.c that are followed by freeing the block.

Comments? I'll do a patch, but I'ld really like to know what had already
gone into the main tree. Linus, could you put the 12-pre4-dont-use on
ftp.kernel.org?

2000-12-04 00:31:25

by Jeff V. Merkey

[permalink] [raw]

Subject: Re: [resync?] Re: corruption

On Sun, Dec 03, 2000 at 05:45:57PM -0500, Alexander Viro wrote:
>
>
> On Mon, 4 Dec 2000, Andrew Morton wrote:
>
> > Sorry, it's still failing. It took three hours.
>
> Yes. For one thing, original was plain wrong wrt locking (lru_list_lock
> should be held). For another, it does not take care of metadata. And
> that's way more serious. What really happens:
>
> ext2_truncate() got a buffer_head of indirect block that is going to
> die. Fine, we release the blocks refered from it and... do bforget()
> on our block. Notice that we are not guaranteed that bh will actually
> die here. buffer.c code might bump its ->b_count for a while, it might
> be written out right now, etc. As the result, bforget() leaves the
> sucker alive. It's not a big deal, since we will do unmap_underlying_metadata()
> before we write anything there (if it will be reused for data) or we'll
> just pick the bh and zero the buffer out (if it will be reused for metadata).
>
> Unfortunately, we also leave it on the per-inode dirty blocks list. Guess
> what happens if inode is destroyed, page that used to hold it gets reused
> and bh gets finally written? Exactly.
>
> Suggested fix: void bforget_inode(struct buffer_head *bh) that would
> be a copy of __bforget(), except that it would call remove_inode_queue(bh)
> unconditionally. And replace bforget() with bforget_inode() in those places
> of ext2/inode.c that are followed by freeing the block.
>
> Comments? I'll do a patch, but I'ld really like to know what had already
> gone into the main tree. Linus, could you put the 12-pre4-dont-use on
> ftp.kernel.org?

Al,

I am always amazed at how rapidly you seem to be able to run down some
of these file system corruption problems. You seem to understand the
interaction of this layer extremely well. :-)

Jeff

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-12-04 15:35:46

by Stephen C. Tweedie

[permalink] [raw]

Subject: Re: corruption

Hi,

On Sat, Dec 02, 2000 at 10:33:36AM -0500, Alexander Viro wrote:
>
> On Sun, 3 Dec 2000, Andrew Morton wrote:
>
> > It appears that this problem is not fixed.
> Sure, it isn't. Place where the shit hits the fan: fs/buffer.c::unmap_buffer().
> Add the call of remove_inode_queue(bh) there and see if it helps. I.e.

unmap_buffer() calls mark_buffer_clean() calls refile_buffer() calls
remove_inode_queue(), which is why we don't see this all the time.

However, refile_buffer() is only calling the remove_inode_queue() if
the buffer disposition changes. I'm looking to see where we may be
going wrong here --- the refile_buffer() is not atomic wrt. the
bh->b_inode structures.

--Stephen

2000-12-04 15:49:47

by Alexander Viro

[permalink] [raw]

Subject: Re: corruption

On Mon, 4 Dec 2000, Stephen C. Tweedie wrote:

> unmap_buffer() calls mark_buffer_clean() calls refile_buffer() calls
> remove_inode_queue(), which is why we don't see this all the time.

Not enough, since you can hit the window between the request completion
(bh is marked clean) and getting it picked by flush_dirty_buffers() et.al.
If you get destroy_inode() before that window will close...

> However, refile_buffer() is only calling the remove_inode_queue() if
> the buffer disposition changes. I'm looking to see where we may be
> going wrong here --- the refile_buffer() is not atomic wrt. the
> bh->b_inode structures.

See above. Point about the metadata (bforget() is not enough) also stands,
ditto for ext2_update_inode() one.