2002-02-10 07:34:19

by Pawel Worach

[permalink] [raw]
Subject: 2.4.18-pre9

during an cvs update of the mozilla source tree the machine oops'ed and
hung hard. system is a abit bp6 with two intel celeron cpu's. this seems
to be swap/cache related, memory has been tested with memtest86 without
any faults

../Pawel

decoded oops:

VM: killing process sendmail
swap_free: Unused swap offset entry 00004000
Unable to handle kernel NULL pointer dereference at virtual address 00000000
c0149674
*pde = 00000000
Oops: 0000
CPU: 1
EIP: 0010:[d_lookup+116/304] Not tainted
EIP: 0010:[<c0149674>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010213
eax: c1580000 ebx: fffffff0 ecx: 00000010 edx: d5139bab
esi: c6efb4e0 edi: c6ee2600 ebp: 00000000 esp: c28b1ee8
ds: 0018 es: 0018 ss: 0018
Process cvs (pid: 31350, stackpage=c28b1000)
Stack: c15e10c8 c4ff2008 d5139bab 00000007 c28b1f54 c6efb4e0 c6ee2600
c28b1f90
c0140610 c6efb4e0 c28b1f54 c28b1f54 c0140e2c c6efb4e0 c28b1f54 00000000
00000009 c4ff200f 00000000 000041ed 00000002 c6de14c0 00001000 fffffff4
Call Trace: [cached_lookup+16/80] [link_path_walk+1452/2048]
[getname+94/160] [__user_walk+51/80] [sys_access+126/288]
Call Trace: [<c0140610>] [<c0140e2c>] [<c014034e>] [<c01414f3>]
[<c01357ee>]
[<c010700b>]
Code: 8b 6d 00 39 53 44 0f 85 83 00 00 00 8b 44 24 24 39 43 0c 75

>>EIP; c0149674 <d_lookup+74/130> <=====
Trace; c0140610 <cached_lookup+10/50>
Trace; c0140e2c <link_path_walk+5ac/800>
Trace; c014034e <getname+5e/a0>
Trace; c01414f3 <__user_walk+33/50>
Trace; c01357ee <sys_access+7e/120>
Trace; c010700b <system_call+33/38>
Code; c0149674 <d_lookup+74/130>
00000000 <_EIP>:
Code; c0149674 <d_lookup+74/130> <=====
0: 8b 6d 00 mov 0x0(%ebp),%ebp <=====
Code; c0149677 <d_lookup+77/130>
3: 39 53 44 cmp %edx,0x44(%ebx)
Code; c014967a <d_lookup+7a/130>
6: 0f 85 83 00 00 00 jne 8f <_EIP+0x8f> c0149703
<d_lookup+103/130>
Code; c0149680 <d_lookup+80/130>
c: 8b 44 24 24 mov 0x24(%esp,1),%eax
Code; c0149684 <d_lookup+84/130>
10: 39 43 0c cmp %eax,0xc(%ebx)
Code; c0149687 <d_lookup+87/130>
13: 75 00 jne 15 <_EIP+0x15> c0149689
<d_lookup+89/130>



2002-02-10 07:50:35

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.4.18-pre9

Pawel Worach <[email protected]> writes:

> .... abit bp6 with two intel celeron cpu's....

...

>
> VM: killing process sendmail
> swap_free: Unused swap offset entry 00004000
^^^^^^^^^
Very much looks like a single bit memory corruption. And an unsupported
SMP configuration with a known-to-be-problematic board too.
I would suspect hardware.


-Andi

2002-02-10 07:51:45

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.18-pre9

Pawel Worach wrote:
>
> during an cvs update of the mozilla source tree the machine oops'ed and
> hung hard. system is a abit bp6 with two intel celeron cpu's. this seems
> to be swap/cache related, memory has been tested with memtest86 without
> any faults
>
> ../Pawel
>
> decoded oops:
>
> VM: killing process sendmail

Well that's interesting. You got an out-of-memory kill. Does
that happen often?

For how long did you run memtest86? If it was less than 12 hours,
could you please give it an overnight run and let me know the
results?

Also, could you please send me your .config, and a description
of what sort of things the machine is used for?

Are you using netfilter? If so, was it in use at the time
of the crash? And if so, what netfilter modules were you
using?

Thanks!

2002-02-10 08:00:45

by Pawel Worach

[permalink] [raw]
Subject: Re: 2.4.18-pre9

Andi Kleen wrote:
>>.... abit bp6 with two intel celeron cpu's....


>>VM: killing process sendmail
>>swap_free: Unused swap offset entry 00004000
>>
> ^^^^^^^^^
> Very much looks like a single bit memory corruption. And an unsupported
> SMP configuration with a known-to-be-problematic board too.
> I would suspect hardware.
>

This system has been running linux for about 2 years without any problem
at all, the hardware configuration has not changed one bit so i have a
hard time beliving this is hardware. booted back into -pre7 and
everything worked fine.

../Pawel

2002-02-10 11:21:37

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.4.18-pre9

On Sun, 10 Feb 2002, Pawel Worach wrote:

> This system has been running linux for about 2 years without any problem
> at all, the hardware configuration has not changed one bit so i have a
> hard time beliving this is hardware. booted back into -pre7 and
> everything worked fine.

At Dortmund University, one of the machines I look after had a rock
solid configuration, good board (UP), no-name memory (256 MB PC-100
DIMM) and it was rock solid for almost a year, and all of a sudden,
without even being touched, it started to mess things up have processes
crash, wind itself up, uppercaps some characters and so on, corrupt its
file systems with e2fsck and so on.

Memtest86 quickly turned up that some memory line was faulty.
Regretfully, at the time that DIMM was bought, only 6 months warranty
were obligatory in Germany, so bad luck :-(

Consequence: don't buy memory which doesn't come with extended
guarantees.

So, get memtest86 onto floppy with a different computer and check the
memory of that box that crashed before claiming it's not the hardware.
(Sure, memtest86 will only find some memory problems, but still, it's
useful.)

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin