2003-08-08 14:52:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Fri, 8 Aug 2003, Stephan von Krawczynski wrote:

> On Thu, 7 Aug 2003 14:49:17 -0700
> Andrew Morton <[email protected]> wrote:
>
> > Marcelo Tosatti <[email protected]> wrote:
> > >
> > > Anyway, you seem to be getting random memory corruption and I have no idea
> > >
> > > what the hell maybe causing it.
> > >
> > > Andrea? Andrew? Alan? _Any_ helpful comments?
> >
> > Not really, sorry. Ugly.
> >
> > What was the last kernel which didn't crash?
> >
> > You're showing a huge set of reiserfs diffs there, mostly cosmetic though.
> >
> > Running memtest86 for 12 hours is needed.
> >
> > Going back to the last-known-kernel would be useful, just to verify that
> > the hardware is still good (some connector could have become resistive, or
> > the power supply could have drifted, etc).
> >
> > Would it be possible to try a different filesystem on that box?
> >
> > Do we know of other people who are using late 2.4 kernels on server-grade
> > hardware? If so, are they doing OK?
>
> I can give you this additional info:
> I tried about everything back to 2.4.21 release, and even this crashes on the
> box. BUT it is _not_ the only box I can crash 2.4.21. I have another hardware
> (also SMP) based not on Serverworks but on VIA chipset and with no 64 bit pci
> and it crashes with 2.4.21 around every 10 - 20 days. It definitely does not
> with 2.4.19.

Do you have any traces of the other box crash?

> The only requirement for my usual test-box is a working tg3 driver for the GBit
> ethernet link.

> Ah yes, and from the long series of tests I can tell that the box won't crash
> with UP kernel. I can re-check that with rc1 if this is useful.

Okey. Thats useful information. How hard would it be for you to try ext3
as the filesystem (as Andrew suggested) ?


2003-08-08 15:05:45

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Fri, 8 Aug 2003 11:54:39 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> > I can give you this additional info:
> > I tried about everything back to 2.4.21 release, and even this crashes on
> > the box. BUT it is _not_ the only box I can crash 2.4.21. I have another
> > hardware(also SMP) based not on Serverworks but on VIA chipset and with no
> > 64 bit pci and it crashes with 2.4.21 around every 10 - 20 days. It
> > definitely does not with 2.4.19.
>
> Do you have any traces of the other box crash?

Not at hand, but can prepare for the next crash during the weekend.

> > The only requirement for my usual test-box is a working tg3 driver for the
> > GBit ethernet link.
>
> > Ah yes, and from the long series of tests I can tell that the box won't
> > crash with UP kernel. I can re-check that with rc1 if this is useful.
>
> Okey. Thats useful information. How hard would it be for you to try ext3
> as the filesystem (as Andrew suggested) ?

Well, if that provides further info I will do. I will try to achieve over the
weekend, I need some spare volumes for conversion (by copy) :-)

Regards,
Stephan

2003-08-08 15:30:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Fri, 8 Aug 2003, Stephan von Krawczynski wrote:

> On Fri, 8 Aug 2003 11:54:39 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> > > I can give you this additional info:
> > > I tried about everything back to 2.4.21 release, and even this crashes on
> > > the box. BUT it is _not_ the only box I can crash 2.4.21. I have another
> > > hardware(also SMP) based not on Serverworks but on VIA chipset and with no
> > > 64 bit pci and it crashes with 2.4.21 around every 10 - 20 days. It
> > > definitely does not with 2.4.19.
> >
> > Do you have any traces of the other box crash?
>
> Not at hand, but can prepare for the next crash during the weekend.
>
> > > The only requirement for my usual test-box is a working tg3 driver for the
> > > GBit ethernet link.
> >
> > > Ah yes, and from the long series of tests I can tell that the box won't
> > > crash with UP kernel. I can re-check that with rc1 if this is useful.
> >
> > Okey. Thats useful information. How hard would it be for you to try ext3
> > as the filesystem (as Andrew suggested) ?
>
> Well, if that provides further info I will do. I will try to achieve over the
> weekend, I need some spare volumes for conversion (by copy) :-)

That will provide further information yes. We can then know if the problem
is reiserfs specific or not, which is VERY useful.

Again, thanks for your efforts helping us track down the problem.

2003-08-10 14:23:49

by Keith Owens

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Fri, 8 Aug 2003 17:05:36 +0200,
Stephan von Krawczynski <[email protected]> wrote:
>Well, if that provides further info I will do. I will try to achieve over the
>weekend, I need some spare volumes for conversion (by copy) :-)

FWIW, there are kdb patches for 2.4.22-pre98 onwards. They also fit
2.4.22-rc1.

ftp://oss.sgi.com/projects/kdb/download/v4.3/kdb-v4.3-2.4.22-pre8-common-8.bz2
ftp://oss.sgi.com/projects/kdb/download/v4.3/kdb-v4.3-2.4.22-pre8-i386-5.bz2

2003-08-10 21:35:37

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> > > > Ah yes, and from the long series of tests I can tell that the box won't
> > > > crash with UP kernel. I can re-check that with rc1 if this is useful.
> > >
> > > Okey. Thats useful information.

During this weekend I did several tests around SMP and UP, and I can definitely
confirm the box does not crash under rc2-UP kernel, but collapses within hours
under rc2-SMP.

> > > How hard would it be for you to try ext3
> > > as the filesystem (as Andrew suggested) ?

I spent half the weekend to turn the setup from reiserfs over to ext3
completely preserving the data. The box runs now with rc2-SMP-ext3 (no reiserfs
present any longer). I will send notice if/when it crashes.

>From looking at the tests so far I would say the setup is remarkably slower in
terms of writing to ext3 via nfs and sync option set. I think especially the
"sync" is very visible - unlike reiserfs.

Regards,
Stephan

2003-08-13 10:55:14

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> That will provide further information yes. We can then know if the problem
> is reiserfs specific or not, which is VERY useful.
>
> Again, thanks for your efforts helping us track down the problem.

Status update:

uptime:
12:45pm up 2 days 19:39, 18 users, load average: 2.02, 2.05, 2.06

Running SMP. So far no crash happened under ext3.
Still I see the tar-verification errors. None on the first day, 2 on the second
and 2 today so far.
I see a growing possibility that the formerly crashes are directly linked to a
reiserfs problem, maybe broken SMP-locking.
If it survives until sunday I will revert all ext3 back to reiserfs to be sure
it still crashes, then ideas for patches will be welcome :-)

Up to sunday I can try to look deeper into the verification troubles. To be
honest I already doubt today that I will see a crash with ext3 until sunday...

Regards,
Stephan

2003-08-13 14:59:45

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Wed, Aug 13, 2003 at 11:53:09AM -0300, Marcelo Tosatti wrote:

> > Running SMP. So far no crash happened under ext3.
> > Still I see the tar-verification errors. None on the first day, 2 on the second

But tar verification errors are still bad, right?

> > it still crashes, then ideas for patches will be welcome :-)
> Great you tracked it down. Your previous traces almost always involved
> reiserfs calls, which is another indicator that reiserfs is probably the
> problem here.

reiserfs is just probably a bit more sensitive to memory corruptions.

> Chris, Oleg, it might be nice if you guys could look at previous oops
> reports by Stephan.

All of them looked like memory corruptions of unknown reason to me.

Bye,
Oleg

2003-08-13 14:53:40

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Wed, 13 Aug 2003, Stephan von Krawczynski wrote:

> On Fri, 8 Aug 2003 12:33:28 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> > That will provide further information yes. We can then know if the problem
> > is reiserfs specific or not, which is VERY useful.
> >
> > Again, thanks for your efforts helping us track down the problem.
>
> Status update:
>
> uptime:
> 12:45pm up 2 days 19:39, 18 users, load average: 2.02, 2.05, 2.06
>
> Running SMP. So far no crash happened under ext3.
> Still I see the tar-verification errors. None on the first day, 2 on the second
> and 2 today so far.
> I see a growing possibility that the formerly crashes are directly linked to a
> reiserfs problem, maybe broken SMP-locking.
> If it survives until sunday I will revert all ext3 back to reiserfs to be sure
> it still crashes, then ideas for patches will be welcome :-)

Great you tracked it down. Your previous traces almost always involved
reiserfs calls, which is another indicator that reiserfs is probably the
problem here.

Chris, Oleg, it might be nice if you guys could look at previous oops
reports by Stephan.

2003-08-13 15:12:28

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, 13 Aug 2003 18:59:40 +0400
Oleg Drokin <[email protected]> wrote:

> Hello!
>
> On Wed, Aug 13, 2003 at 11:53:09AM -0300, Marcelo Tosatti wrote:
>
> > > Running SMP. So far no crash happened under ext3.
> > > Still I see the tar-verification errors. None on the first day, 2 on the
> > > second
>
> But tar verification errors are still bad, right?

Sure. Maybe both topics are unrelated. I can't tell.

> > > it still crashes, then ideas for patches will be welcome :-)
> > Great you tracked it down. Your previous traces almost always involved
> > reiserfs calls, which is another indicator that reiserfs is probably the
> > problem here.
>
> reiserfs is just probably a bit more sensitive to memory corruptions.
>
> > Chris, Oleg, it might be nice if you guys could look at previous oops
> > reports by Stephan.
>
> All of them looked like memory corruptions of unknown reason to me.

Well, that's exactly the reason why I am awaiting some more days of
up-and-running ext3. After how many days will you be convinced that a random
memory corruption should have hit the ext3 system that bad, that it should have
crashed?
I can add another week if you want me to, just tell me. The only thing I don't
want is that any doubts are left after testing ...
Still, current 2 days uptime is early stage, so let's give it some more time.

Regards,
Stephan

2003-08-13 15:30:14

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:

> Well, that's exactly the reason why I am awaiting some more days of
> up-and-running ext3. After how many days will you be convinced that a random
> memory corruption should have hit the ext3 system that bad, that it should have
> crashed?

Well, I'd prefer that you spend time to figure out at which exact
2.4.21-pre version the crashes in reiserfs started to appear. ;)

> I can add another week if you want me to, just tell me. The only thing I don't
> want is that any doubts are left after testing ...

It would be interesting to look at fsck results on the fs after some time of
testing.
Probably it would be easier for you to make it crash (if there are crash
possibility at all) if you enable JBD debugging.

Bye,
Oleg

2003-08-13 16:04:11

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, 13 Aug 2003 19:30:09 +0400
Oleg Drokin <[email protected]> wrote:

> Hello!
>
> On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:
>
> > Well, that's exactly the reason why I am awaiting some more days of
> > up-and-running ext3. After how many days will you be convinced that a
> > random memory corruption should have hit the ext3 system that bad, that it
> > should have crashed?
>
> Well, I'd prefer that you spend time to figure out at which exact
> 2.4.21-pre version the crashes in reiserfs started to appear. ;)

Well, Oleg, I'd love to, but there is an immanent problem with that. If
I check pre-X and it crashes, everything is fine, because I have a certain
result of the test. If it does not crash within 3 days, then I have a problem.
How long do I wait before stating the pre is good? It could take months to test
10 pre's ... That cannot be the way to find out what is going on.
On the other hand:
- no UP kernel ever crashed. So we can at least talk about an SMP-race.
- 2.4.20 does not crash
- 2.4.21 does crash
If we can add "ext3 does not crash" to the list, then I really hope we can use
some brain and give good selection of patches between 2.4.20 and 2.4.21 that
may cause the troubles.
How many suspects do we have? We can at least begin to create a list of things
that went in between .20 and .21, or not?
If possible I can then patch out all of them and retry. So there is much less
time spent for testing.
I mean, have you looked at the length of this thread already?

> > I can add another week if you want me to, just tell me. The only thing I
> > don't want is that any doubts are left after testing ...
>
> It would be interesting to look at fsck results on the fs after some time of
> testing.

You mean I should do an fsck on sunday?

> Probably it would be easier for you to make it crash (if there are crash
> possibility at all) if you enable JBD debugging.

I have never seen this in real life. Is it possible to turn this on when
handling >100 GB of data or will some debug output flood the box?

Regards,
Stephan

2003-08-13 16:34:56

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Wed, Aug 13, 2003 at 06:04:05PM +0200, Stephan von Krawczynski wrote:
> > > Well, that's exactly the reason why I am awaiting some more days of
> > > up-and-running ext3. After how many days will you be convinced that a
> > > random memory corruption should have hit the ext3 system that bad, that it
> > > should have crashed?
> > Well, I'd prefer that you spend time to figure out at which exact
> > 2.4.21-pre version the crashes in reiserfs started to appear. ;)
> Well, Oleg, I'd love to, but there is an immanent problem with that. If
> I check pre-X and it crashes, everything is fine, because I have a certain
> result of the test. If it does not crash within 3 days, then I have a problem.
> How long do I wait before stating the pre is good? It could take months to test

You seem to be getting corruptions in at least 2 days for now, though.
And reiserfs seems to trigger the problem even faster (and may be
even more faster if you enable CONFIG_REISERFS_CHECK).

> 10 pre's ... That cannot be the way to find out what is going on.
> On the other hand:
> - no UP kernel ever crashed. So we can at least talk about an SMP-race.

There is still huge field to look at.

> - 2.4.20 does not crash
> - 2.4.21 does crash

diff is 20M in size.

> If we can add "ext3 does not crash" to the list, then I really hope we can use
> some brain and give good selection of patches between 2.4.20 and 2.4.21 that
> may cause the troubles.

There were not much changes in reiserfs. All those patches can easily be
reverted just for verification purposes. Let me know when you are ready/want
to test this variant and I will send you a diff.

> How many suspects do we have? We can at least begin to create a list of things

Well, suspects are all used drivers, VM, filesystem itself, arch code.

> that went in between .20 and .21, or not?

Lots of changes, 2.4.20->2.4.21 was a long trip.

> If possible I can then patch out all of them and retry. So there is much less
> time spent for testing.
> I mean, have you looked at the length of this thread already?

Yes, I did.
Now if only we can get someone to reproduce your problems...

> > > I can add another week if you want me to, just tell me. The only thing I
> > > don't want is that any doubts are left after testing ...
> > It would be interesting to look at fsck results on the fs after some time of
> > testing.
> You mean I should do an fsck on sunday?

Yes, whenever you decide you have waited long enough (provided that it won't
crash) and decide to stop testing, please run fsck on that testing fs.

> > Probably it would be easier for you to make it crash (if there are crash
> > possibility at all) if you enable JBD debugging.
> I have never seen this in real life. Is it possible to turn this on when
> handling >100 GB of data or will some debug output flood the box?

It only enables some more checks, not debug output.

Bye,
Oleg

2003-08-13 22:19:59

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, 13 Aug 2003 20:34:52 +0400
Oleg Drokin <[email protected]> wrote:

> You seem to be getting corruptions in at least 2 days for now, though.
> And reiserfs seems to trigger the problem even faster (and may be
> even more faster if you enable CONFIG_REISERFS_CHECK).

well, I have an idea how to find out more about these verify problem. Basically
I would try to patch tar to ouput the differing areas to stdout in hexdump
format or the like. Only I need some time to make this work out. I hope to find
some pattern about this corruption.

> > 10 pre's ... That cannot be the way to find out what is going on.
> > On the other hand:
> > - no UP kernel ever crashed. So we can at least talk about an SMP-race.
>
> There is still huge field to look at.
>
> > - 2.4.20 does not crash
> > - 2.4.21 does crash
>
> diff is 20M in size.
>
> > If we can add "ext3 does not crash" to the list, then I really hope we can
> > use some brain and give good selection of patches between 2.4.20 and 2.4.21
> > that may cause the troubles.
>
> There were not much changes in reiserfs. All those patches can easily be
> reverted just for verification purposes. Let me know when you are ready/want
> to test this variant and I will send you a diff.

Hm, my primary belief is that something _around_ reiserfs has changed
semantics.

> > If possible I can then patch out all of them and retry. So there is much
> > less time spent for testing.
> > I mean, have you looked at the length of this thread already?
>
> Yes, I did.
> Now if only we can get someone to reproduce your problems...

Hm, I believe nobody in fact tried a setup like mine. As I have clear
indication that I can trigger it simply by using an SMP box, installing SuSE
8.2, compiling stock 2.4.22-rc2 kernel exporting some reiserfs to a nfs-client
of your choice and starting copying data with sizes around 100GB back and
forth.

> > > > I can add another week if you want me to, just tell me. The only thing
> > > > I don't want is that any doubts are left after testing ...
> > > It would be interesting to look at fsck results on the fs after some time
> > > of testing.
> > You mean I should do an fsck on sunday?
>
> Yes, whenever you decide you have waited long enough (provided that it won't
> crash) and decide to stop testing, please run fsck on that testing fs.

Ok, will do that.

>
> > > Probably it would be easier for you to make it crash (if there are crash
> > > possibility at all) if you enable JBD debugging.
> > I have never seen this in real life. Is it possible to turn this on when
> > handling >100 GB of data or will some debug output flood the box?
>
> It only enables some more checks, not debug output.

Does this work for ext3, reiserfs or both?

Regards,
Stephan

2003-08-14 08:45:22

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

> > You seem to be getting corruptions in at least 2 days for now, though.
> > And reiserfs seems to trigger the problem even faster (and may be
> > even more faster if you enable CONFIG_REISERFS_CHECK).
> well, I have an idea how to find out more about these verify problem. Basically
> I would try to patch tar to ouput the differing areas to stdout in hexdump
> format or the like. Only I need some time to make this work out. I hope to find
> some pattern about this corruption.

Yes, that would be interesting.

> > > If we can add "ext3 does not crash" to the list, then I really hope we can
> > > use some brain and give good selection of patches between 2.4.20 and 2.4.21
> > > that may cause the troubles.
> > There were not much changes in reiserfs. All those patches can easily be
> > reverted just for verification purposes. Let me know when you are ready/want
> > to test this variant and I will send you a diff.
> Hm, my primary belief is that something _around_ reiserfs has changed
> semantics.

Well. Might be, but this is unlikely. And I do not remember anything like that.
I will take a closer look, though.

> > > If possible I can then patch out all of them and retry. So there is much
> > > less time spent for testing.
> > > I mean, have you looked at the length of this thread already?
> > Yes, I did.
> > Now if only we can get someone to reproduce your problems...
> Hm, I believe nobody in fact tried a setup like mine. As I have clear
> indication that I can trigger it simply by using an SMP box, installing SuSE
> 8.2, compiling stock 2.4.22-rc2 kernel exporting some reiserfs to a nfs-client
> of your choice and starting copying data with sizes around 100GB back and
> forth.

sounds like quite typical setup for some tasks (like clusters I guess).

> > > > Probably it would be easier for you to make it crash (if there are crash
> > > > possibility at all) if you enable JBD debugging.
> > > I have never seen this in real life. Is it possible to turn this on when
> > > handling >100 GB of data or will some debug output flood the box?
> > It only enables some more checks, not debug output.
> Does this work for ext3, reiserfs or both?

This works for ext3
For reiserfs we have similar compile time option that is called
CONFIG_REISERFS_CHECK

Thank you for all the time and efforts you are putting into finding out
the cause.

Bye,
Oleg

2003-08-14 17:26:56

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Thu, 14 Aug 2003, Oleg Drokin wrote:

> Thank you for all the time and efforts you are putting into finding out
> the cause.

Stephan,

How are things going? Is the machine is still alive and well?


2003-08-14 17:42:29

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Thu, 14 Aug 2003 14:26:33 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
>
> On Thu, 14 Aug 2003, Oleg Drokin wrote:
>
> > Thank you for all the time and efforts you are putting into finding out
> > the cause.
>
> Stephan,
>
> How are things going? Is the machine is still alive and well?

Hello Marcelo,

the system is up and running, currently:

7:40pm up 4 days 2:34, 21 users, load average: 2.07, 2.10, 2.06

there is still the verification issue, today I added another 50 GB to the data
stream, and therefore got additional 3 verification errors. But this seems to
have no influence on the stability. Box feels ok, reacts completely normal, no
strange output in any logs.

Regards,
Stephan

2003-08-15 02:09:24

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Thu, 2003-08-14 at 13:42, Stephan von Krawczynski wrote:

> Hello Marcelo,
>
> the system is up and running, currently:
>
> 7:40pm up 4 days 2:34, 21 users, load average: 2.07, 2.10, 2.06
>
> there is still the verification issue, today I added another 50 GB to the data
> stream, and therefore got additional 3 verification errors. But this seems to
> have no influence on the stability. Box feels ok, reacts completely normal, no
> strange output in any logs.

Just to second Oleg's messages so far, the verification issues are still
serious, it could be the same kind of memory corruptions that could be
causing crashes on reiserfs, just in a different place.

We need to find out if a specific kernel release is causing these
corruptions. There are lots of different ways to go about it, I would
suggest a combination of fsx (triggers IO and does verification) and
usemem (sucks down ram) from the ext3 cvs progs.

When you can reliably cause either fsx-linux errors or system hangs in a
short period of time, then we can try different prereleases to find the
offending code.

(download details here: http://www.zipworld.com.au/~akpm/linux/ext3/)

Run 4 or so fsx-linux programs (each to its own file) and use usemem to
put your box into swap. That should hit it pretty quickly, and any
errors from fsx indicate problems.

-chris


2003-08-15 09:40:35

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On 14 Aug 2003 22:08:58 -0400
Chris Mason <[email protected]> wrote:

> On Thu, 2003-08-14 at 13:42, Stephan von Krawczynski wrote:
>
> > Hello Marcelo,
> >
> > the system is up and running, currently:
> >
> > 7:40pm up 4 days 2:34, 21 users, load average: 2.07, 2.10, 2.06
> >
> > there is still the verification issue, today I added another 50 GB to the
> > data stream, and therefore got additional 3 verification errors. But this
> > seems to have no influence on the stability. Box feels ok, reacts
> > completely normal, no strange output in any logs.
>
> Just to second Oleg's messages so far, the verification issues are still
> serious, it could be the same kind of memory corruptions that could be
> causing crashes on reiserfs, just in a different place.

Well, as you expected I have the oops for you happened just this morning:

ksymoops 2.4.8 on i686 2.4.22-rc2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-rc2/ (default)
-m /boot/System.map-2.4.22-rc2 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

NMI Watchdog detected LOCKUP on CPU0, eip c01457c3, registers:
CPU: 0
EIP: 0010:[<c01457c3>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000046
eax: 00000019 ebx: effc5c7c ecx: 00000000 edx: effc6c7c
esi: 00000001 edi: 00000202 ebp: c13956c0 esp: f6ae1e8c
ds: 0018 es: 0018 ss: 0018
Process setiathome (pid: 2696, stackpage=f6ae1000)
Stack: f79ba218 effc5c7c f710eab8 00000008 c02165ea effc5c7c 00000001 ffffffff
f79ba298 f79ba218 00000001 00000010 00000001 f710ea00 c0216a0f f710ea00
00000001 00000000 00000001 00000001 ffffffff ffffffff 0000001c 00000000
Call Trace: [<c02165ea>] [<c0216a0f>] [<c024a47a>] [<c020f6b8>] [<c020f568>]
[<c01226da>] [<c0122563>] [<c01222d6>] [<c0109508>] [<c010c048>]
Code: 75 eb a8 01 0f 44 f1 8b 52 28 39 da 75 ea c6 05 64 5d 30 c0


>>EIP; c01457c3 <end_buffer_io_async+63/b0> <=====

>>ebx; effc5c7c <_end+2fbfe61c/38462a00>
>>edx; effc6c7c <_end+2fbff61c/38462a00>
>>ebp; c13956c0 <_end+fce060/38462a00>
>>esp; f6ae1e8c <_end+3671a82c/38462a00>

Trace; c02165ea <__scsi_end_request+ba/250>
Trace; c0216a0f <scsi_io_completion+15f/430>
Trace; c024a47a <rw_intr+5a/200>
Trace; c020f6b8 <scsi_finish_command+98/d0>
Trace; c020f568 <scsi_bottom_half_handler+c8/f0>
Trace; c01226da <bh_action+6a/70>
Trace; c0122563 <tasklet_hi_action+53/a0>
Trace; c01222d6 <do_softirq+76/e0>
Trace; c0109508 <do_IRQ+d8/f0>
Trace; c010c048 <call_do_IRQ+5/d>

Code; c01457c3 <end_buffer_io_async+63/b0>
00000000 <_EIP>:
Code; c01457c3 <end_buffer_io_async+63/b0> <=====
0: 75 eb jne ffffffed <_EIP+0xffffffed> <=====
Code; c01457c5 <end_buffer_io_async+65/b0>
2: a8 01 test $0x1,%al
Code; c01457c7 <end_buffer_io_async+67/b0>
4: 0f 44 f1 cmove %ecx,%esi
Code; c01457ca <end_buffer_io_async+6a/b0>
7: 8b 52 28 mov 0x28(%edx),%edx
Code; c01457cd <end_buffer_io_async+6d/b0>
a: 39 da cmp %ebx,%edx
Code; c01457cf <end_buffer_io_async+6f/b0>
c: 75 ea jne fffffff8 <_EIP+0xfffffff8>
Code; c01457d1 <end_buffer_io_async+71/b0>
e: c6 05 64 5d 30 c0 00 movb $0x0,0xc0305d64


1 warning issued. Results may not be reliable.


Obviously the problem seems a lot harder to trigger with ext3, but nevertheless
comes up (this time around 5 days). I will try Chris' suggestions and see what
happens. I'll keep you informed.

Regards,
Stephan

2003-08-15 10:13:49

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello Oleg,

there was a question about fsck'ing the ext3 filesystems. Since it crashed
today I did check them now and no errors or warnings showed up. Everything
seems clean. I don't exactly understand what that tells you. I guess you mean
the fs metadata may have been hit, too. Seems not.

Regards,
Stephan

2003-08-15 10:28:52

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On 14 Aug 2003 22:08:58 -0400
Chris Mason <[email protected]> wrote:

> Run 4 or so fsx-linux programs (each to its own file) and use usemem to
> put your box into swap. That should hit it pretty quickly, and any
> errors from fsx indicate problems.

Question: how do I make fsx-linux use big filesizes (GB range) ?

Regards,
Stephan

2003-08-15 10:31:11

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Fri, Aug 15, 2003 at 12:13:21PM +0200, Stephan von Krawczynski wrote:

> there was a question about fsck'ing the ext3 filesystems. Since it crashed
> today I did check them now and no errors or warnings showed up. Everything
> seems clean. I don't exactly understand what that tells you. I guess you mean
> the fs metadata may have been hit, too. Seems not.

Yes. And from what I remember, all the oopses on reiserfs were about some
lists corruptions and this sort of things, so not metadata, but kernel
data was damaged somehow.
And your last oops confirms that.
end_buffer_io_async have the loop running with irqs disabled.
And this loop in your case should only have one iteration (you run with 4k
blocksize, I presume) of gouig thorough one buffer attaching to a page.
Also at least one of the oopses you posted prior to that also had signs of
buffer list corruptions. (may be even two).
So it seems something changes buffer lists under out feet without doing
proper locking.
I am not sure how this relates to data corruption, though.
Ok, at least now there seems to be something definite to look for in changes.

Thank you.

Bye,
Oleg

2003-08-15 12:55:21

by Chris Mason

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Fri, 2003-08-15 at 06:28, Stephan von Krawczynski wrote:
> On 14 Aug 2003 22:08:58 -0400
> Chris Mason <[email protected]> wrote:
>
> > Run 4 or so fsx-linux programs (each to its own file) and use usemem to
> > put your box into swap. That should hit it pretty quickly, and any
> > errors from fsx indicate problems.
>
> Question: how do I make fsx-linux use big filesizes (GB range) ?

You don't really need to, fsx-linux is pretty good at triggering
problems with its default file size. Usually you just need some other
load in place to chew up ram.

-chris


2003-08-18 18:07:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, Aug 13, 2003 at 06:04:05PM +0200, Stephan von Krawczynski wrote:
> On Wed, 13 Aug 2003 19:30:09 +0400
> Oleg Drokin <[email protected]> wrote:
>
> > Hello!
> >
> > On Wed, Aug 13, 2003 at 05:12:24PM +0200, Stephan von Krawczynski wrote:
> >
> > > Well, that's exactly the reason why I am awaiting some more days of
> > > up-and-running ext3. After how many days will you be convinced that a
> > > random memory corruption should have hit the ext3 system that bad, that it
> > > should have crashed?
> >
> > Well, I'd prefer that you spend time to figure out at which exact
> > 2.4.21-pre version the crashes in reiserfs started to appear. ;)
>
> Well, Oleg, I'd love to, but there is an immanent problem with that. If
> I check pre-X and it crashes, everything is fine, because I have a certain
> result of the test. If it does not crash within 3 days, then I have a problem.
> How long do I wait before stating the pre is good? It could take months to test
> 10 pre's ... That cannot be the way to find out what is going on.
> On the other hand:
> - no UP kernel ever crashed. So we can at least talk about an SMP-race.
> - 2.4.20 does not crash
> - 2.4.21 does crash

an SMP kernel puts the double of the stress on the mem bus, so it might
still be ram that went bad around the time you upgraded from 2.4.19. Or
it maybe simply a buggy smp motherboard, or whatever.

Of course I can't be sure but we can't exclude it.

Andrea

2003-08-18 20:19:24

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Mon, 18 Aug 2003 17:06:25 +0200
Andrea Arcangeli <[email protected]> wrote:

> an SMP kernel puts the double of the stress on the mem bus, so it might
> still be ram that went bad around the time you upgraded from 2.4.19. Or
> it maybe simply a buggy smp motherboard, or whatever.
>
> Of course I can't be sure but we can't exclude it.

It is unlikely for bad ram to survive memtest for several hours.

Regards,
Stephan

2003-08-18 20:58:28

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> On Mon, 18 Aug 2003 17:06:25 +0200
> Andrea Arcangeli <[email protected]> wrote:
>
> > an SMP kernel puts the double of the stress on the mem bus, so it might
> > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > it maybe simply a buggy smp motherboard, or whatever.
> >
> > Of course I can't be sure but we can't exclude it.
>
> It is unlikely for bad ram to survive memtest for several hours.

How many hours?

Are you using memtest 3.0 that supports larger ammounts of memory, and has
specific tests for ECC (ie disabling it)?

Are you doing a full run with all tests, and not just the standard tests?
(you should let it complete one, or preferably two or three in this mode)

2003-08-18 22:50:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> On Mon, 18 Aug 2003 17:06:25 +0200
> Andrea Arcangeli <[email protected]> wrote:
>
> > an SMP kernel puts the double of the stress on the mem bus, so it might
> > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > it maybe simply a buggy smp motherboard, or whatever.
> >
> > Of course I can't be sure but we can't exclude it.
>
> It is unlikely for bad ram to survive memtest for several hours.

memtest is single threaded, UP kernel works fine too.

Andrea

2003-08-19 01:12:19

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Tue, Aug 19, 2003 at 12:31:27AM +0200, Andrea Arcangeli wrote:
> On Mon, Aug 18, 2003 at 10:19:21PM +0200, Stephan von Krawczynski wrote:
> > On Mon, 18 Aug 2003 17:06:25 +0200
> > Andrea Arcangeli <[email protected]> wrote:
> >
> > > an SMP kernel puts the double of the stress on the mem bus, so it might
> > > still be ram that went bad around the time you upgraded from 2.4.19. Or
> > > it maybe simply a buggy smp motherboard, or whatever.
> > >
> > > Of course I can't be sure but we can't exclude it.
> >
> > It is unlikely for bad ram to survive memtest for several hours.
>
> memtest is single threaded, UP kernel works fine too.

Are you saying that one CPU can't saturate the memory bus? Or maybe we're
hitting something on the CPU bus, or just that SMP will change the timings
and stress things differently? Or that if memtest doesn't test from the
second CPU then it could be a faulty cpu/L2?

Grr, has anything been done to verify the hardware is running withing specs
and isn't too hot?

2003-08-19 07:13:40

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Mon, 18 Aug 2003 18:12:08 -0700
Mike Fedyk <[email protected]> wrote:

> > > It is unlikely for bad ram to survive memtest for several hours.
> >
> > memtest is single threaded, UP kernel works fine too.
>
> Are you saying that one CPU can't saturate the memory bus? Or maybe we're
> hitting something on the CPU bus, or just that SMP will change the timings
> and stress things differently? Or that if memtest doesn't test from the
> second CPU then it could be a faulty cpu/L2?

Well, if memtest does not use a second available CPU then probably we should
ask the author about this...

> Grr, has anything been done to verify the hardware is running withing specs
> and isn't too hot?

In fact we are talking about datacenter environment with air conditioning and
the like.
Besides the favourite test box I have others (already mentioned in this thread)
- SMP with completely different hw - where I can make 2.4.21 and above crash,
too.

Regards,
Stephan

2003-08-19 14:12:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Tue, Aug 19, 2003 at 09:12:43AM +0200, Stephan von Krawczynski wrote:
> Besides the favourite test box I have others (already mentioned in this thread)
> - SMP with completely different hw - where I can make 2.4.21 and above crash,
> too.

Did you post any backtrace for those other boxes yet? It would be
especially useful if you could demonstrate the same random mm corruption
with different ram/motherboard/cpus (I mean all of them different), if
the devices are the same that's ok (since it could be a software bug in
a driver).

At the moment I doubt a bug in the common code since AFIK you are the
only one running into this sort of corruption and at the very least I
can't trigger it here (OTOH maybe it triggers with only one certain
application).

(just for clarity: with my previous posts I didn't mean it's not a
software bug, I only wanted to point out that with the current info we
cannot exclude completely an hardware issue yet)

Andrea

2003-08-19 15:56:35

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On 19 Aug 2003 14:10:22 +0100
Alan Cox <[email protected]> wrote:

> On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > > Are you saying that one CPU can't saturate the memory bus? Or maybe
> > > we're hitting something on the CPU bus, or just that SMP will change the
> > > timings and stress things differently? Or that if memtest doesn't test
> > > from the second CPU then it could be a faulty cpu/L2?
> >
> > Well, if memtest does not use a second available CPU then probably we
> > should ask the author about this...
>
> I'm sure he'd give you a quote for adding SMP support if you asked.

Well, actually I don't want to burn down his time as long as I don't see a need
for it. Since I am pretty confident to make the box work in SMP under 2.4.20 a
memtest will most certainly not give any additional information, be it running
UP or SMP.
Instead I will invest another day and convert the whole system back to
reiserfs, because the ext3 fs cannot be used under 2.4.20 - I don't know why.
Additionally reiserfs is better for testing possible patches because it crashes
in much shorter time than ext3 setup.
2.4.20 setup gives me a simple testcase to prove people right or wrong that are
talking about a hardware issue.

Regards,
Stephan

2003-08-19 16:18:40

by Alan

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > Are you saying that one CPU can't saturate the memory bus? Or maybe we're
> > hitting something on the CPU bus, or just that SMP will change the timings
> > and stress things differently? Or that if memtest doesn't test from the
> > second CPU then it could be a faulty cpu/L2?
>
> Well, if memtest does not use a second available CPU then probably we should
> ask the author about this...

I'm sure he'd give you a quote for adding SMP support if you asked.

2003-08-19 18:10:14

by Mike Fedyk

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Tue, Aug 19, 2003 at 04:18:32PM +0200, Stephan von Krawczynski wrote:
> On 19 Aug 2003 14:10:22 +0100
> Alan Cox <[email protected]> wrote:
>
> > On Maw, 2003-08-19 at 08:12, Stephan von Krawczynski wrote:
> > > > Are you saying that one CPU can't saturate the memory bus? Or maybe
> > > > we're hitting something on the CPU bus, or just that SMP will change the
> > > > timings and stress things differently? Or that if memtest doesn't test
> > > > from the second CPU then it could be a faulty cpu/L2?
> > >
> > > Well, if memtest does not use a second available CPU then probably we
> > > should ask the author about this...
> >
> > I'm sure he'd give you a quote for adding SMP support if you asked.
>
> Well, actually I don't want to burn down his time as long as I don't see a need
> for it. Since I am pretty confident to make the box work in SMP under 2.4.20 a
> memtest will most certainly not give any additional information, be it running
> UP or SMP.
> Instead I will invest another day and convert the whole system back to
> reiserfs, because the ext3 fs cannot be used under 2.4.20 - I don't know why.
> Additionally reiserfs is better for testing possible patches because it crashes
> in much shorter time than ext3 setup.
> 2.4.20 setup gives me a simple testcase to prove people right or wrong that are
> talking about a hardware issue.

Are you doing a lot of directory operations, or is it mostly just large
amounts of data transfering over NFS?

The reason why I ask, is that I know that at least JFS and possibly XFS uses
trees for their directory structures, and might show similar problems (with
its large use of trees), if you did a lot of directory operations on the
other filesystems.

Then maybe it could rule out reiserfs. Though it still did show up on ext3...

2003-08-19 21:59:23

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Tue, 19 Aug 2003 11:00:28 -0700
Mike Fedyk <[email protected]> wrote:

> Are you doing a lot of directory operations, or is it mostly just large
> amounts of data transfering over NFS?

In fact merely no directory operations take place. Large data movement is the
primary test.

Regards,
Stephan

2003-08-20 14:22:08

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (yet another oops for rc2)

Hello all,

todays' oops is:

ksymoops 2.4.8 on i686 2.4.22-rc2. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-rc2/ (default)
-m /boot/System.map-2.4.22-rc2 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

kernel BUG at slab.c:1225!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c0137ebd>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 00000005 ebx: 00000005 ecx: 00000088 edx: 00000000
esi: f6df2000 edi: f6df20a0 ebp: f6df2348 esp: c345df04
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c345d000)
Stack: f6df234c f6df2348 f6df23cc f6df2000 c0139107 c342b4d0 f6df2000 f6df2348
c342b4d0 0000007d c346040c c3460400 c01384e2 c342b4d0 f6df234c 00000000
00000001 00000000 00000000 00000000 00000020 000001d0 00000020 00000006
Call Trace: [<c0139107>] [<c01384e2>] [<c0139c78>] [<c0139d2e>] [<c0139e3c>]
[<c0139ec8>] [<c0139ff8>] [<c0139f60>] [<c0105000>] [<c010592e>] [<c0139f60>]
Code: 0f 0b c9 04 44 92 2c c0 8b 44 86 18 83 f8 ff 75 eb 89 f6 8b


>>EIP; c0137ebd <kmem_extra_free_checks+6d/a0> <=====

>>esi; f6df2000 <_end+36a2a9a0/38462a00>
>>edi; f6df20a0 <_end+36a2aa40/38462a00>
>>ebp; f6df2348 <_end+36a2ace8/38462a00>
>>esp; c345df04 <_end+30968a4/38462a00>

Trace; c0139107 <kmem_cache_free_one+f7/220>
Trace; c01384e2 <kmem_cache_reap+b2/290>
Trace; c0139c78 <shrink_caches+28/a0>
Trace; c0139d2e <try_to_free_pages_zone+3e/60>
Trace; c0139e3c <kswapd_balance_pgdat+4c/b0>
Trace; c0139ec8 <kswapd_balance+28/40>
Trace; c0139ff8 <kswapd+98/c0>
Trace; c0139f60 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0139f60 <kswapd+0/c0>

Code; c0137ebd <kmem_extra_free_checks+6d/a0>
00000000 <_EIP>:
Code; c0137ebd <kmem_extra_free_checks+6d/a0> <=====
0: 0f 0b ud2a <=====
Code; c0137ebf <kmem_extra_free_checks+6f/a0>
2: c9 leave
Code; c0137ec0 <kmem_extra_free_checks+70/a0>
3: 04 44 add $0x44,%al
Code; c0137ec2 <kmem_extra_free_checks+72/a0>
5: 92 xchg %eax,%edx
Code; c0137ec3 <kmem_extra_free_checks+73/a0>
6: 2c c0 sub $0xc0,%al
Code; c0137ec5 <kmem_extra_free_checks+75/a0>
8: 8b 44 86 18 mov 0x18(%esi,%eax,4),%eax
Code; c0137ec9 <kmem_extra_free_checks+79/a0>
c: 83 f8 ff cmp $0xffffffff,%eax
Code; c0137ecc <kmem_extra_free_checks+7c/a0>
f: 75 eb jne fffffffc <_EIP+0xfffffffc>
Code; c0137ece <kmem_extra_free_checks+7e/a0>
11: 89 f6 mov %esi,%esi
Code; c0137ed0 <kmem_extra_free_checks+80/a0>
13: 8b 00 mov (%eax),%eax


1 warning issued. Results may not be reliable.


This is still with ext3 and about 24 hours uptime (rough guess).

Regards,
Stephan

2003-09-05 09:24:09

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (case closed)

Hello all,

I would like to give you the last update on the story:

short: hardware problem

long:
The box had two different types of RAM (both registered ECC) in it. Two were 1
GByte, four were 256 MByte to a total of 3 GByte. I had to find out that the
box runs flawlessly when using only the GByte modules _or_ only the 256 MByte
modules, but not the mix. All modules are from same vendor. The problem in
mixed setup does not show up in UP mode (memtest works!). It does not even show
up straight away, it takes days, but it is always there.
In fact - even though having sunk weeks of work - I am pretty happy that it
turned out not to be a kernel problem.
For the other setups that showed SMP-specific weirdness TeJun may have found
interesting explanations. I updated them all to 2.4.22 and have not seen any
problem yet.
For me it was really interesting to see that reiserfs setups obviously have a
completely different memory footprint than ext3, and altogether there seems a
remarkable difference between later kernels and former. The problem showed up
very seldom on 2.4.21 and below but within 2 days with 2.4.22.
Thanks to all who lend me their ears on the topic and sorry for wasting the
time.

Regards,
Stephan

PS: Obviously there are seldom cases where SMP support in memtest _could_ make
a difference ;-)

2003-09-05 13:37:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (case closed)

On Fri, Sep 05, 2003 at 11:24:00AM +0200, Stephan von Krawczynski wrote:
> Hello all,
>
> I would like to give you the last update on the story:
>
> short: hardware problem
>
> long:
> The box had two different types of RAM (both registered ECC) in it. Two were 1
> GByte, four were 256 MByte to a total of 3 GByte. I had to find out that the
> box runs flawlessly when using only the GByte modules _or_ only the 256 MByte
> modules, but not the mix. All modules are from same vendor. The problem in
> mixed setup does not show up in UP mode (memtest works!). It does not even show
> up straight away, it takes days, but it is always there.
> In fact - even though having sunk weeks of work - I am pretty happy that it
> turned out not to be a kernel problem.

thanks for demonstrating this.

> For the other setups that showed SMP-specific weirdness TeJun may have found
> interesting explanations. I updated them all to 2.4.22 and have not seen any
> problem yet.
> For me it was really interesting to see that reiserfs setups obviously have a
> completely different memory footprint than ext3, and altogether there seems a
> remarkable difference between later kernels and former. The problem showed up
> very seldom on 2.4.21 and below but within 2 days with 2.4.22.

normally that indicates the kernel is somehow using the resources more
efficiently, it's usually a good sign from a kernel standpoint, I heard
of things like this happening also during major upgrades like from 2.2
to 2.4.

> Thanks to all who lend me their ears on the topic and sorry for wasting the
> time.

you're very welcome.

> PS: Obviously there are seldom cases where SMP support in memtest _could_ make
> a difference ;-)

;)

Andrea

/*
* If you refuse to depend on closed software for a critical
* part of your business, these links may be useful:
*
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.5/
* rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.4/
* http://www.cobite.com/cvsps/
*
* svn://svn.kernel.org/linux-2.6/trunk
* svn://svn.kernel.org/linux-2.4/trunk
*/