2001-10-30 21:34:05

by Dave Airlie

[permalink] [raw]
Subject: oops on 2.4.13-pre5 in prune_dcache


I just got this and I'm a few kernels out .. but I thought I'd throw this
way in case it hasn't popped up or been fixed..

I was compiling mopd on a AMD K6-400, 64MB RAM and it oops on the first
gcc.. machine was doing nothing much, no X, 3 sshs to a remote machine...

Mandrake 8.0.

Dave.

ksymoops 2.3.7 on i586 2.4.13-pre5. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.13-pre5/ (default)
-m /boot/System.map-2.4.13-pre5 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Warning (compare_maps): mismatch on symbol sb_be_quiet , sb_lib says c4899904, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sb_lib.o says c4897fa4. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sb_lib.o entry
Warning (compare_maps): mismatch on symbol smw_free , sb_lib says c4899910, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sb_lib.o says c4897fb0. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sb_lib.o entry
Warning (compare_maps): mismatch on symbol audio_devs , sound says c488bc40, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b5e0. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol midi_devs , sound says c488bcb0, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b650. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol mixer_devs , sound says c488bc58, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b5f8. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol num_audiodevs , sound says c488bc54, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b5f4. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol num_midis , sound says c488bcc8, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b668. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol num_mixers , sound says c488bc6c, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b60c. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol num_synths , sound says c488bcac, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b64c. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol synth_devs , sound says c488bc80, /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o says c488b620. Ignoring /lib/modules/2.4.13-pre5/kernel/drivers/sound/sound.o entry
Warning (compare_maps): mismatch on symbol nlmsvc_ops , lockd says c486a470, /lib/modules/2.4.13-pre5/kernel/fs/lockd/lockd.o says c48698dc. Ignoring /lib/modules/2.4.13-pre5/kernel/fs/lockd/lockd.o entry
Warning (compare_maps): mismatch on symbol nfs_debug , sunrpc says c485d24c, /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o says c485cf4c. Ignoring /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nfsd_debug , sunrpc says c485d250, /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o says c485cf50. Ignoring /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol nlm_debug , sunrpc says c485d254, /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o says c485cf54. Ignoring /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol rpc_debug , sunrpc says c485d248, /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o says c485cf48. Ignoring /lib/modules/2.4.13-pre5/kernel/net/sunrpc/sunrpc.o entry
Warning (compare_maps): mismatch on symbol packet_socks_nr , af_packet says c484dc48, /lib/modules/2.4.13-pre5/kernel/net/packet/af_packet.o says c484da24. Ignoring /lib/modules/2.4.13-pre5/kernel/net/packet/af_packet.o entry
Unable to handle kernel NULL pointer dereference at virtual address 00000054
c013e227
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c013e227>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000040 ebx: c23ded38 ecx: c23da670 edx: c23da670
esi: c23ded20 edi: c23da660 ebp: 0000013b esp: c1173f5c
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 4, stackpage=c1173000)
Stack: 00000019 000003d0 00000006 00000380 c013e4fb 00000662 c0128857 00000006
000003d0 00000006 00000020 00000000 000003d0 00002a06 c012888d c01dc55c
00000001 c01dc548 c1172000 c012894d c01dc4a0 00000000 c1173fe0 0008e000
Call Trace: [<c013e4fb>] [<c0128857>] [<c012888d>] [<c012894d>] [<c01289be>]
[<c0128ae7>] [<c0105000>] [<c01054bb>]
Code: 8b 40 14 85 c0 74 09 57 56 ff d0 83 c4 08 eb 09 57 e8 43 1e

>>EIP; c013e227 <prune_dcache+ab/12c> <=====
Trace; c013e4fb <shrink_dcache_memory+1b/34>
Trace; c0128857 <shrink_caches+6f/88>
Trace; c012888d <try_to_free_pages+1d/4c>
Trace; c012894d <kswapd_balance_pgdat+55/b0>
Trace; c01289be <kswapd_balance+16/3c>
Trace; c0128ae7 <kswapd+a3/cc>
Trace; c0105000 <_stext+0/0>
Trace; c01054bb <kernel_thread+23/30>
Code; c013e227 <prune_dcache+ab/12c>
00000000 <_EIP>:
Code; c013e227 <prune_dcache+ab/12c> <=====
0: 8b 40 14 mov 0x14(%eax),%eax <=====
Code; c013e22a <prune_dcache+ae/12c>
3: 85 c0 test %eax,%eax
Code; c013e22c <prune_dcache+b0/12c>
5: 74 09 je 10 <_EIP+0x10> c013e237 <prune_dcache+bb/12c>
Code; c013e22e <prune_dcache+b2/12c>
7: 57 push %edi
Code; c013e22f <prune_dcache+b3/12c>
8: 56 push %esi
Code; c013e230 <prune_dcache+b4/12c>
9: ff d0 call *%eax
Code; c013e232 <prune_dcache+b6/12c>
b: 83 c4 08 add $0x8,%esp
Code; c013e235 <prune_dcache+b9/12c>
e: eb 09 jmp 19 <_EIP+0x19> c013e240 <prune_dcache+c4/12c>
Code; c013e237 <prune_dcache+bb/12c>
10: 57 push %edi
Code; c013e238 <prune_dcache+bc/12c>
11: e8 43 1e 00 00 call 1e59 <_EIP+0x1e59> c0140080 <iput+0/178>


17 warnings issued. Results may not be reliable.


--
David Airlie, Software Engineer
http://www.skynet.ie/~airlied / [email protected]
pam_smb / Linux DecStation / Linux VAX / ILUG person



2001-10-31 00:56:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache

In article <Pine.LNX.4.32.0110302132020.14320-100000@skynet> you write:
>Unable to handle kernel NULL pointer dereference at virtual address 00000054
>c013e227
>*pde = 00000000
>Oops: 0000
>CPU: 0
>EIP: 0010:[<c013e227>] Not tainted
>Using defaults from ksymoops -t elf32-i386 -a i386
>EFLAGS: 00010202
>eax: 00000040 ebx: c23ded38 ecx: c23da670 edx: c23da670
>esi: c23ded20 edi: c23da660 ebp: 0000013b esp: c1173f5c
>ds: 0018 es: 0018 ss: 0018
>Process kswapd (pid: 4, stackpage=c1173000)
>Stack: 00000019 000003d0 00000006 00000380 c013e4fb 00000662 c0128857 00000006
> 000003d0 00000006 00000020 00000000 000003d0 00002a06 c012888d c01dc55c
> 00000001 c01dc548 c1172000 c012894d c01dc4a0 00000000 c1173fe0 0008e000
>Call Trace: [<c013e4fb>] [<c0128857>] [<c012888d>] [<c012894d>] [<c01289be>]
> [<c0128ae7>] [<c0105000>] [<c01054bb>]
>Code: 8b 40 14 85 c0 74 09 57 56 ff d0 83 c4 08 eb 09 57 e8 43 1e
>
>>>EIP; c013e227 <prune_dcache+ab/12c> <=====
>Trace; c013e4fb <shrink_dcache_memory+1b/34>
>Trace; c0128857 <shrink_caches+6f/88>
>Trace; c012888d <try_to_free_pages+1d/4c>
>Trace; c012894d <kswapd_balance_pgdat+55/b0>
>Trace; c01289be <kswapd_balance+16/3c>
>Trace; c0128ae7 <kswapd+a3/cc>
>Trace; c0105000 <_stext+0/0>
>Trace; c01054bb <kernel_thread+23/30>
>Code; c013e227 <prune_dcache+ab/12c>
> 0: 8b 40 14 mov 0x14(%eax),%eax <=====
> 3: 85 c0 test %eax,%eax
> 5: 74 09 je 10 <_EIP+0x10> c013e237 <prune_dcache+bb/12c>

This is

dentry->d_op->d_iput

with dentry->d_op (%eax) having the bogus value 0x00000040.

It's almost certainly _supposed_ to be a NULL pointer, but has bit 6
set.

So we do

if (dentry->d_op && dentry->d_op->d_iput)

and because dentry->d_op isn't NULL, we oops on the d_op->d_iput
dereference.

Something is setting a bit in your dentry. Either RAM errors (do you
have ECC memory or a history of SIGSEGV's to give any indication either
way?) or a wild "set_bit()" pointer or similar.

Linus

2001-10-31 01:13:30

by Alexander Viro

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache



On Tue, 30 Oct 2001, Linus Torvalds wrote:

> It's almost certainly _supposed_ to be a NULL pointer, but has bit 6
> set.
>
> So we do
>
> if (dentry->d_op && dentry->d_op->d_iput)
>
> and because dentry->d_op isn't NULL, we oops on the d_op->d_iput
> dereference.
>
> Something is setting a bit in your dentry. Either RAM errors (do you
> have ECC memory or a history of SIGSEGV's to give any indication either
> way?) or a wild "set_bit()" pointer or similar.

... or corrupted pointer in the list, in which case we are at the address
that had never been a dentry.

Frankly, it looks like adding magic to struct dentry and struct inode
might be a good idea. Both icache and dcache lists tend to be long,
so any memory corruption is very likely to show up in the code that
walks them.

2001-10-31 01:19:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache


On Tue, 30 Oct 2001, Alexander Viro wrote:
>
> Frankly, it looks like adding magic to struct dentry and struct inode
> might be a good idea. Both icache and dcache lists tend to be long,
> so any memory corruption is very likely to show up in the code that
> walks them.

I don't like magic numbers. They showed that somebody corrupted something,
but little else. There's nothing sane you can do except warn about magic
mismatch (and ignore it or exit early), and quite frankly, an oops dump
tends to be as useful as the warning.

We used to have magic numbers for tty's and some other things, and they
never helped anybody.

Now, doing _poisoning_ etc of dynamical memory allocation, _that_ is a
magic number use I heartily agree with (if your oops dump has pointers
that look like the poison pattern, that's a useful piece of information)

If you only want to use magic numbers for memory sanity testing, you're
better off doing things like poisoning, and checking that the poison
pattern is there when you re-allocate etc.

Linus

2001-10-31 01:20:31

by Udo A. Steinberg

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache

Linus Torvalds wrote:
>
> Something is setting a bit in your dentry. Either RAM errors (do you
> have ECC memory or a history of SIGSEGV's to give any indication either
> way?) or a wild "set_bit()" pointer or similar.

For what it's worth - I've had a very similar oops ages ago. Back then it
was blamed on bad RAM, but ever since then I've run numerous memtest's
over it without finding anything and never had any problems later either.

See here:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0101.0/0303.html

Call Trace is very similar. Maybe it's just coincidence, but who knows.

Regards,
Udo.

2001-10-31 01:26:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache


On Wed, 31 Oct 2001, Udo A. Steinberg wrote:
>
> For what it's worth - I've had a very similar oops ages ago. Back then it
> was blamed on bad RAM, but ever since then I've run numerous memtest's
> over it without finding anything and never had any problems later either.
>
> See here:
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0101.0/0303.html

Indeed. Same exact thing, different bit.

It could easily be a wild pointer corruption - single-bit errors in RAM
are not entirely uncommon (and as Al points out, they usually end up
showing up in things like the dcache which can have long lists that are
traversed fairly infrequently).

But blaming the thing on bad RAM is not a good strategy if there are many
of these reports. I'd love to see more of a pattern, though, because
without a pattern there is nothing to really start from..

The pattern might be something as simple as "we're both using minixfs"
or whatever.

Linus

2001-10-31 02:43:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache

On Tue, Oct 30, 2001 at 05:23:59PM -0800, Linus Torvalds wrote:
> But blaming the thing on bad RAM is not a good strategy if there are many
> of these reports. I'd love to see more of a pattern, though, because

Dunno why, but usually all bitflips triggers during heavy list walking,
so it's not too much surprisingly. I recall the most frequent bitflips
were happening walking the buffer header lists in 2.2, but I recall
dcache walks also oopsing due bitflips in 2.2.

Andrea

2001-10-31 02:51:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache


On Wed, 31 Oct 2001, Andrea Arcangeli wrote:
>
> Dunno why, but usually all bitflips triggers during heavy list walking,
> so it's not too much surprisingly. I recall the most frequent bitflips
> were happening walking the buffer header lists in 2.2, but I recall
> dcache walks also oopsing due bitflips in 2.2.

I suspect it's just that the list walking is (a) the operation that
touches the most uncached memory and (b) also is inherently the one that,
through pointer following, ends up showing the effects of flipped bits the
most as oopses.

For example, most of the actual _memory_ is obviously in the data caches
or user space pages, but if those are corrupt you'd never see an oops.
You'd see filesystem corruption (and see the reports of changing md5sums
about how this does happen), or you'd see strange SIGSEGV's in user space.

In contrast, a kernel oops is almost always accompanied by pointer
dereferencing, and the thing that dereferences the most pointers is
obviously pointer chasing - ie list walking.

So I don't think that this has anything to do with "certain lists are more
likely to get into trouble", but more of a "certain lists have long chains
of pointers, and as such they are more likely to show up in oopses".

Self-selection, in short - the bane of all statistics gathering.

Linus

2001-10-31 10:37:25

by Dave Airlie

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache

> and because dentry->d_op isn't NULL, we oops on the d_op->d_iput
> dereference.
>
> Something is setting a bit in your dentry. Either RAM errors (do you
> have ECC memory or a history of SIGSEGV's to give any indication either
> way?) or a wild "set_bit()" pointer or similar.
>

well I've never seen any OOPS on this machine before or segv on this
machine and I use it for cross-compiling glibc/gdb/binutils/gcc/kernels
etc for the VAX port of Linux so I'm sure I'd have seen one before if it
was memory... I can run memtest on it at some stage over the next couple
of days but I'm pretty confident the RAM is okay.. granted its not ECC...

I have nothing unusual in my config I don't think,

ext2, nfsd, vfat, sb16, parport, ne2k-pci (x2 cards)...I can post my
.config later on today when I get back home..

Dave.


>

--
David Airlie, Software Engineer
http://www.skynet.ie/~airlied / [email protected]
pam_smb / Linux DecStation / Linux VAX / ILUG person


2001-11-05 20:41:32

by Pavel Machek

[permalink] [raw]
Subject: Re: oops on 2.4.13-pre5 in prune_dcache

Hi!

> We used to have magic numbers for tty's and some other things, and they
> never helped anybody.

They are certainly helping me debugging swsusp.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.