2003-08-02 12:27:39

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (decoded oops for pre8)

Hello Marcelo, hello andrea,

after some days of running 2.4.22-pre8 I finally got the crash (freeze as
usual). This time the debuggin setup worked and I got:


ksymoops 2.4.8 on i686 2.4.22-pre8. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-pre8/ (default)
-m /boot/System.map-2.4.22-pre8 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Unable to handle kernel paging request at virtual address 4129b0fc
c0130084
*pde = 313f6067
Oops: 0002
CPU: 1
EIP: 0010:[<c0130084>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: c2cfdba0 ecx: 00000000 edx: 4129b0fc
esi: d5fb0a24 edi: 0001ca22 ebp: c02eaaa8 esp: c345df30
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c345d000)
Stack: c2cfdba0 d5fb0a24 c2cfdba0 c013924f c2cfdba0 000001d0 00000200 000001d0
00000006 00000020 000001d0 00000020 00000006 c0139493 00000006 00000001
c02eaaa8 000001d0 00000006 c02eaaa8 00000000 c013950e 00000020 c02eaaa8
Call Trace: [<c013924f>] [<c0139493>] [<c013950e>] [<c013961c>] [<c01396a8>]
[<c01397d8>] [<c0139740>] [<c0105000>] [<c010592e>] [<c0139740>]
Code: 89 02 c7 43 24 00 00 00 00 f0 ff 0d 9c a5 37 c0 5a 5b 5e c3


>>EIP; c0130084 <__remove_inode_page+44/60> <=====

>>ebx; c2cfdba0 <_end+2952980/3852ee40>
>>esi; d5fb0a24 <_end+15c05804/3852ee40>
>>ebp; c02eaaa8 <contig_page_data+168/340>
>>esp; c345df30 <_end+30b2d10/3852ee40>

Trace; c013924f <shrink_cache+2df/3b0>
Trace; c0139493 <shrink_caches+63/a0>
Trace; c013950e <try_to_free_pages_zone+3e/60>
Trace; c013961c <kswapd_balance_pgdat+4c/b0>
Trace; c01396a8 <kswapd_balance+28/40>
Trace; c01397d8 <kswapd+98/c0>
Trace; c0139740 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0139740 <kswapd+0/c0>

Code; c0130084 <__remove_inode_page+44/60>
00000000 <_EIP>:
Code; c0130084 <__remove_inode_page+44/60> <=====
0: 89 02 mov %eax,(%edx) <=====
Code; c0130086 <__remove_inode_page+46/60>
2: c7 43 24 00 00 00 00 movl $0x0,0x24(%ebx)
Code; c013008d <__remove_inode_page+4d/60>
9: f0 ff 0d 9c a5 37 c0 lock decl 0xc037a59c
Code; c0130094 <__remove_inode_page+54/60>
10: 5a pop %edx
Code; c0130095 <__remove_inode_page+55/60>
11: 5b pop %ebx
Code; c0130096 <__remove_inode_page+56/60>
12: 5e pop %esi
Code; c0130097 <__remove_inode_page+57/60>
13: c3 ret


1 warning issued. Results may not be reliable.


Hope this helps.
Anything further I can do?

Regards,
Stephan


2003-08-03 07:25:42

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (decoded oops for pre8)

Hi Stephan,

This is in remove_page_from_hash_queue() at filemap.c:114 :
*pprev = next;

pprev is taken from page->pprev_hash and is considered invalid here (4129b0fc).
Assuming it has been corrupted earlier, it seems that the only files able to
touch this either directly or indirectly are :
- mm/filemap.c (add_page_to_hash_queue, add_to_page_cache*)
- mm/shmem.c (add_to_page_cache_unique)
- mm/swap_state.c (idem)
- fs/ext3/inode.c and fs/buffer.c (find_or_create_page)

So the problem may be narrowed down to a few files. Perhaps digging through
the VM changes since before you had a problem will give you more clues...

Cheers,
Willy

On Sat, Aug 02, 2003 at 02:27:34PM +0200, Stephan von Krawczynski wrote:
> Unable to handle kernel paging request at virtual address 4129b0fc
> c0130084
> *pde = 313f6067
> Oops: 0002
> CPU: 1
> EIP: 0010:[<c0130084>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010246
> eax: 00000000 ebx: c2cfdba0 ecx: 00000000 edx: 4129b0fc
> esi: d5fb0a24 edi: 0001ca22 ebp: c02eaaa8 esp: c345df30
> ds: 0018 es: 0018 ss: 0018
> Process kswapd (pid: 5, stackpage=c345d000)
> Stack: c2cfdba0 d5fb0a24 c2cfdba0 c013924f c2cfdba0 000001d0 00000200 000001d0
> 00000006 00000020 000001d0 00000020 00000006 c0139493 00000006 00000001
> c02eaaa8 000001d0 00000006 c02eaaa8 00000000 c013950e 00000020 c02eaaa8
> Call Trace: [<c013924f>] [<c0139493>] [<c013950e>] [<c013961c>] [<c01396a8>]
> [<c01397d8>] [<c0139740>] [<c0105000>] [<c010592e>] [<c0139740>]
> Code: 89 02 c7 43 24 00 00 00 00 f0 ff 0d 9c a5 37 c0 5a 5b 5e c3
>
>
> >>EIP; c0130084 <__remove_inode_page+44/60> <=====
>
> >>ebx; c2cfdba0 <_end+2952980/3852ee40>
> >>esi; d5fb0a24 <_end+15c05804/3852ee40>
> >>ebp; c02eaaa8 <contig_page_data+168/340>
> >>esp; c345df30 <_end+30b2d10/3852ee40>
>
> Trace; c013924f <shrink_cache+2df/3b0>
> Trace; c0139493 <shrink_caches+63/a0>
> Trace; c013950e <try_to_free_pages_zone+3e/60>
> Trace; c013961c <kswapd_balance_pgdat+4c/b0>
> Trace; c01396a8 <kswapd_balance+28/40>
> Trace; c01397d8 <kswapd+98/c0>
> Trace; c0139740 <kswapd+0/c0>
> Trace; c0105000 <_stext+0/0>
> Trace; c010592e <arch_kernel_thread+2e/40>
> Trace; c0139740 <kswapd+0/c0>
>
> Code; c0130084 <__remove_inode_page+44/60>
> 00000000 <_EIP>:
> Code; c0130084 <__remove_inode_page+44/60> <=====
> 0: 89 02 mov %eax,(%edx) <=====
> Code; c0130086 <__remove_inode_page+46/60>
> 2: c7 43 24 00 00 00 00 movl $0x0,0x24(%ebx)
> Code; c013008d <__remove_inode_page+4d/60>
> 9: f0 ff 0d 9c a5 37 c0 lock decl 0xc037a59c
> Code; c0130094 <__remove_inode_page+54/60>
> 10: 5a pop %edx
> Code; c0130095 <__remove_inode_page+55/60>
> 11: 5b pop %ebx
> Code; c0130096 <__remove_inode_page+56/60>
> 12: 5e pop %esi
> Code; c0130097 <__remove_inode_page+57/60>
> 13: c3 ret
>
>
> 1 warning issued. Results may not be reliable.
>
>
> Hope this helps.
> Anything further I can do?
>
> Regards,
> Stephan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2003-08-03 09:40:57

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (decoded oops for pre8)

On Sun, 3 Aug 2003 09:25:25 +0200
Willy Tarreau <[email protected]> wrote:

> Hi Stephan,
>
> This is in remove_page_from_hash_queue() at filemap.c:114 :
> *pprev = next;
>
> pprev is taken from page->pprev_hash and is considered invalid here
> (4129b0fc). Assuming it has been corrupted earlier, it seems that the only
> files able to touch this either directly or indirectly are :
> - mm/filemap.c (add_page_to_hash_queue, add_to_page_cache*)
> - mm/shmem.c (add_to_page_cache_unique)
> - mm/swap_state.c (idem)

> - fs/ext3/inode.c and fs/buffer.c (find_or_create_page)

Ext3 is unlikely to be related, the box never saw ext3. Ext2 is only used on
/boot (so very unlikely, too), everything else is reiserfs.


>
> So the problem may be narrowed down to a few files. Perhaps digging through
> the VM changes since before you had a problem will give you more clues...
>
> Cheers,
> Willy

Thanks for commenting, the problem really is annoying because I _know_ the box
will freeze, only it takes time, this time 4 days...

Regards,
Stephan

2003-08-05 16:39:11

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (decoded oops for pre8)


Stephan,

Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ?

On Sat, 2 Aug 2003, Stephan von Krawczynski wrote:

> Hello Marcelo, hello andrea,
>
> after some days of running 2.4.22-pre8 I finally got the crash (freeze as
> usual). This time the debuggin setup worked and I got:
>
> Unable to handle kernel paging request at virtual address 4129b0fc
> c0130084
> *pde = 313f6067
> Oops: 0002
> CPU: 1
> EIP: 0010:[<c0130084>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010246
> eax: 00000000 ebx: c2cfdba0 ecx: 00000000 edx: 4129b0fc
> esi: d5fb0a24 edi: 0001ca22 ebp: c02eaaa8 esp: c345df30
> ds: 0018 es: 0018 ss: 0018
> Process kswapd (pid: 5, stackpage=c345d000)
> Stack: c2cfdba0 d5fb0a24 c2cfdba0 c013924f c2cfdba0 000001d0 00000200 000001d0
> 00000006 00000020 000001d0 00000020 00000006 c0139493 00000006 00000001
> c02eaaa8 000001d0 00000006 c02eaaa8 00000000 c013950e 00000020 c02eaaa8
> Call Trace: [<c013924f>] [<c0139493>] [<c013950e>] [<c013961c>] [<c01396a8>]
> [<c01397d8>] [<c0139740>] [<c0105000>] [<c010592e>] [<c0139740>]
> Code: 89 02 c7 43 24 00 00 00 00 f0 ff 0d 9c a5 37 c0 5a 5b 5e c3
>
>
> >>EIP; c0130084 <__remove_inode_page+44/60> <=====
>
> >>ebx; c2cfdba0 <_end+2952980/3852ee40>
> >>esi; d5fb0a24 <_end+15c05804/3852ee40>
> >>ebp; c02eaaa8 <contig_page_data+168/340>
> >>esp; c345df30 <_end+30b2d10/3852ee40>
>
> Trace; c013924f <shrink_cache+2df/3b0>
> Trace; c0139493 <shrink_caches+63/a0>
> Trace; c013950e <try_to_free_pages_zone+3e/60>
> Trace; c013961c <kswapd_balance_pgdat+4c/b0>
> Trace; c01396a8 <kswapd_balance+28/40>
> Trace; c01397d8 <kswapd+98/c0>
> Trace; c0139740 <kswapd+0/c0>
> Trace; c0105000 <_stext+0/0>
> Trace; c010592e <arch_kernel_thread+2e/40>
> Trace; c0139740 <kswapd+0/c0>
>
> Code; c0130084 <__remove_inode_page+44/60>
> 00000000 <_EIP>:
> Code; c0130084 <__remove_inode_page+44/60> <=====
> 0: 89 02 mov %eax,(%edx) <=====
> Code; c0130086 <__remove_inode_page+46/60>
> 2: c7 43 24 00 00 00 00 movl $0x0,0x24(%ebx)
> Code; c013008d <__remove_inode_page+4d/60>
> 9: f0 ff 0d 9c a5 37 c0 lock decl 0xc037a59c
> Code; c0130094 <__remove_inode_page+54/60>
> 10: 5a pop %edx
> Code; c0130095 <__remove_inode_page+55/60>
> 11: 5b pop %ebx
> Code; c0130096 <__remove_inode_page+56/60>
> 12: 5e pop %esi
> Code; c0130097 <__remove_inode_page+57/60>
> 13: c3 ret

2003-08-06 02:37:11

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (decoded oops for pre8)

On Tue, 5 Aug 2003 13:40:48 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
> Stephan,
>
> Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ?

This was from a pre8. There were no strange modules and no vmware involved.
Everything clean, kernel 2.4.22-pre8 on top of SuSE 8.2 distro.
Output was created via serial console.

Regards,
Stephan


>
> On Sat, 2 Aug 2003, Stephan von Krawczynski wrote:
>
> > Hello Marcelo, hello andrea,
> >
> > after some days of running 2.4.22-pre8 I finally got the crash (freeze as
> > usual). This time the debuggin setup worked and I got:
> >
> > Unable to handle kernel paging request at virtual address 4129b0fc
> > c0130084
> > *pde = 313f6067
> > Oops: 0002
> > CPU: 1
> > EIP: 0010:[<c0130084>] Not tainted
> > Using defaults from ksymoops -t elf32-i386 -a i386
> > EFLAGS: 00010246
> > eax: 00000000 ebx: c2cfdba0 ecx: 00000000 edx: 4129b0fc
> > esi: d5fb0a24 edi: 0001ca22 ebp: c02eaaa8 esp: c345df30
> > ds: 0018 es: 0018 ss: 0018
> > Process kswapd (pid: 5, stackpage=c345d000)
> > Stack: c2cfdba0 d5fb0a24 c2cfdba0 c013924f c2cfdba0 000001d0 00000200
> > 000001d0
> > 00000006 00000020 000001d0 00000020 00000006 c0139493 00000006
> > 00000001 c02eaaa8 000001d0 00000006 c02eaaa8 00000000 c013950e
> > 00000020 c02eaaa8
> > Call Trace: [<c013924f>] [<c0139493>] [<c013950e>] [<c013961c>]
> > [<c01396a8>]
> > [<c01397d8>] [<c0139740>] [<c0105000>] [<c010592e>] [<c0139740>]
> > Code: 89 02 c7 43 24 00 00 00 00 f0 ff 0d 9c a5 37 c0 5a 5b 5e c3
> >
> >
> > >>EIP; c0130084 <__remove_inode_page+44/60> <=====
> >
> > >>ebx; c2cfdba0 <_end+2952980/3852ee40>
> > >>esi; d5fb0a24 <_end+15c05804/3852ee40>
> > >>ebp; c02eaaa8 <contig_page_data+168/340>
> > >>esp; c345df30 <_end+30b2d10/3852ee40>
> >
> > Trace; c013924f <shrink_cache+2df/3b0>
> > Trace; c0139493 <shrink_caches+63/a0>
> > Trace; c013950e <try_to_free_pages_zone+3e/60>
> > Trace; c013961c <kswapd_balance_pgdat+4c/b0>
> > Trace; c01396a8 <kswapd_balance+28/40>
> > Trace; c01397d8 <kswapd+98/c0>
> > Trace; c0139740 <kswapd+0/c0>
> > Trace; c0105000 <_stext+0/0>
> > Trace; c010592e <arch_kernel_thread+2e/40>
> > Trace; c0139740 <kswapd+0/c0>
> >
> > Code; c0130084 <__remove_inode_page+44/60>
> > 00000000 <_EIP>:
> > Code; c0130084 <__remove_inode_page+44/60> <=====
> > 0: 89 02 mov %eax,(%edx) <=====
> > Code; c0130086 <__remove_inode_page+46/60>
> > 2: c7 43 24 00 00 00 00 movl $0x0,0x24(%ebx)
> > Code; c013008d <__remove_inode_page+4d/60>
> > 9: f0 ff 0d 9c a5 37 c0 lock decl 0xc037a59c
> > Code; c0130094 <__remove_inode_page+54/60>
> > 10: 5a pop %edx
> > Code; c0130095 <__remove_inode_page+55/60>
> > 11: 5b pop %ebx
> > Code; c0130096 <__remove_inode_page+56/60>
> > 12: 5e pop %esi
> > Code; c0130097 <__remove_inode_page+57/60>
> > 13: c3 ret
>

2003-08-06 07:41:58

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Tue, 5 Aug 2003 13:40:48 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

>
> Stephan,
>
> Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ?

Hello Marcelo,

today I have a fresh -pre10 oops for you.

Everything seems to start with (there is no i/o error or the like, is it
possible that the fs got damaged during former crashes?):

sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478481)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478445)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478441)[dev:blocknr]:
bit already cleared
sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478348)[dev:blocknr]:
bit already cleared

An then:

ksymoops 2.4.8 on i686 2.4.22-pre10. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-pre10/ (default)
-m /boot/System.map-2.4.22-pre10 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Unable to handle kernel NULL pointer dereference at virtual address 00000006
c0144b14
*pde = 00000000
Oops: 0002
CPU: 1
EIP: 0010:[<c0144b14>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010246
eax: 00000000 ebx: f0f66540 ecx: f0f66540 edx: 00000006
esi: f0f66540 edi: f0f66540 ebp: c2ce0350 esp: c345df24
ds: 0018 es: 0018 ss: 0018
Process kswapd (pid: 5, stackpage=c345d000)
Stack: c0147ddf f0f66540 00000000 c2ce0350 0001bcad c02eab68 c0139228 c2ce0350
000001d0 00000200 000001d0 00000016 00000020 000001d0 00000020 00000006
c01394b3 00000006 c345c000 c02eab68 000001d0 00000006 c02eab68 00000000
Call Trace: [<c0147ddf>] [<c0139228>] [<c01394b3>] [<c013952e>] [<c013963c>]
[<c01396c8>] [<c01397f8>] [<c0139760>] [<c0105000>] [<c010592e>] [<c0139760>]
Code: 89 02 c7 41 30 00 00 00 00 89 4c 24 04 e9 7a ff ff ff 8d 76


>>EIP; c0144b14 <__remove_from_queues+14/30> <=====

>>ebx; f0f66540 <_end+30bbb320/3852ee40>
>>ecx; f0f66540 <_end+30bbb320/3852ee40>
>>esi; f0f66540 <_end+30bbb320/3852ee40>
>>edi; f0f66540 <_end+30bbb320/3852ee40>
>>ebp; c2ce0350 <_end+2935130/3852ee40>
>>esp; c345df24 <_end+30b2d04/3852ee40>

Trace; c0147ddf <try_to_free_buffers+7f/170>
Trace; c0139228 <shrink_cache+298/3b0>
Trace; c01394b3 <shrink_caches+63/a0>
Trace; c013952e <try_to_free_pages_zone+3e/60>
Trace; c013963c <kswapd_balance_pgdat+4c/b0>
Trace; c01396c8 <kswapd_balance+28/40>
Trace; c01397f8 <kswapd+98/c0>
Trace; c0139760 <kswapd+0/c0>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0139760 <kswapd+0/c0>

Code; c0144b14 <__remove_from_queues+14/30>
00000000 <_EIP>:
Code; c0144b14 <__remove_from_queues+14/30> <=====
0: 89 02 mov %eax,(%edx) <=====
Code; c0144b16 <__remove_from_queues+16/30>
2: c7 41 30 00 00 00 00 movl $0x0,0x30(%ecx)
Code; c0144b1d <__remove_from_queues+1d/30>
9: 89 4c 24 04 mov %ecx,0x4(%esp,1)
Code; c0144b21 <__remove_from_queues+21/30>
d: e9 7a ff ff ff jmp ffffff8c <_EIP+0xffffff8c>
Code; c0144b26 <__remove_from_queues+26/30>
12: 8d 76 00 lea 0x0(%esi),%esi


1 warning issued. Results may not be reliable.

Regards,
Stephan


2003-08-06 08:58:21

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:

> > Is this _STOCK_ 2.4.22-pre10 (no vmware, no other modules) ?
> Hello Marcelo,
> today I have a fresh -pre10 oops for you.
> Everything seems to start with (there is no i/o error or the like, is it
> possible that the fs got damaged during former crashes?):

Well, you'd better run reiserfsck after crashes with binary modules just to make sure everything is ok.

> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478481)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478445)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478441)[dev:blocknr]:
> bit already cleared
> sd(8,17):vs-4080: reiserfs_free_block: free_block (0811:14478348)[dev:blocknr]:
> bit already cleared

Bye,
Oleg

2003-08-06 09:09:35

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:

> Code; c0144b14 <__remove_from_queues+14/30>
> 00000000 <_EIP>:
> Code; c0144b14 <__remove_from_queues+14/30> <=====
> 0: 89 02 mov %eax,(%edx) <=====
> Code; c0144b16 <__remove_from_queues+16/30>
> 2: c7 41 30 00 00 00 00 movl $0x0,0x30(%ecx)
> Code; c0144b1d <__remove_from_queues+1d/30>
> 9: 89 4c 24 04 mov %ecx,0x4(%esp,1)
> Code; c0144b21 <__remove_from_queues+21/30>
> d: e9 7a ff ff ff jmp ffffff8c <_EIP+0xffffff8c>
> Code; c0144b26 <__remove_from_queues+26/30>
> 12: 8d 76 00 lea 0x0(%esi),%esi

once again, it's *pprev=next which is is causing trouble, with pprev=6 this
time (fs/buffer.c:523). There really seems to be something playing badly with
this...

I find amazing that such widely used portions of code only trigger panics on
your system ! either it's a rare combinations of several components/drivers, or
a strange hardware problem, although I can't imagine which (cpu? bus locking?).

Cheers,
Willy

2003-08-06 09:37:10

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, 6 Aug 2003 11:09:20 +0200
Willy Tarreau <[email protected]> wrote:

> On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:
>
> > Code; c0144b14 <__remove_from_queues+14/30>
> > 00000000 <_EIP>:
> > Code; c0144b14 <__remove_from_queues+14/30> <=====
> > 0: 89 02 mov %eax,(%edx) <=====
> > Code; c0144b16 <__remove_from_queues+16/30>
> > 2: c7 41 30 00 00 00 00 movl $0x0,0x30(%ecx)
> > Code; c0144b1d <__remove_from_queues+1d/30>
> > 9: 89 4c 24 04 mov %ecx,0x4(%esp,1)
> > Code; c0144b21 <__remove_from_queues+21/30>
> > d: e9 7a ff ff ff jmp ffffff8c <_EIP+0xffffff8c>
> > Code; c0144b26 <__remove_from_queues+26/30>
> > 12: 8d 76 00 lea 0x0(%esi),%esi
>
> once again, it's *pprev=next which is is causing trouble, with pprev=6 this
> time (fs/buffer.c:523). There really seems to be something playing badly with
> this...
>
> I find amazing that such widely used portions of code only trigger panics on
> your system ! either it's a rare combinations of several components/drivers,
> or a strange hardware problem, although I can't imagine which (cpu? bus
> locking?).

Hm, the hardware may not be that widespread. I guess not many people are really
using SMP, 64 bit PCI network, 3 GB RAM, 3ware RAID5 and serverworks board
altogether in one box. I can't fight the impression it has something to do with
locking issues. It doesn't look exactly like a hardware problem, you would not
expect crashes on the same type of code then.
The question is: what additional information is needed to find the underlying
problem?

Regards,
Stephan

2003-08-06 12:45:57

by Willy Tarreau

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

> Hm, the hardware may not be that widespread. I guess not many people are really
> using SMP, 64 bit PCI network, 3 GB RAM, 3ware RAID5 and serverworks board
> altogether in one box. I can't fight the impression it has something to do with
> locking issues. It doesn't look exactly like a hardware problem, you would not
> expect crashes on the same type of code then.

Well, it depends... I once had an overclocked CPU which died only in one
case, it was a car simulator, and it always crashed exactly on the same race,
at the same position in the round ! I even knew that if I could pass that
position, it was ok for another round ! So I later used that game as a
reliability test when I was not sure about the origin of a crash :-)
It seems as a particular sequence of data and/or code could reliably trigger it
although parallel makes never hurt it.

> The question is: what additional information is needed to find the underlying
> problem?

Perhaps cache poisonning could help. Alan has already used this technique
extensively in the past, and might still have a patch which could apply to your
kernel without too many changes. Alan ?

On the other hand, you could also do it by hand, but it's a little hard. You
have to pick every place there's a free, and write particular data before the
free, if possible, data which can identify who has freed the page.

Then after the next crash, you can identify who used the page last. It can
sometimes lead you to some driver missing a lock. But that's not certain.

Cheers,
Willy

2003-08-06 18:12:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Wed, 6 Aug 2003, Stephan von Krawczynski wrote:

> Unable to handle kernel NULL pointer dereference at virtual address 00000006
> c0144b14
> *pde = 00000000
> Oops: 0002
> CPU: 1
> EIP: 0010:[<c0144b14>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010246
> eax: 00000000 ebx: f0f66540 ecx: f0f66540 edx: 00000006
> esi: f0f66540 edi: f0f66540 ebp: c2ce0350 esp: c345df24
> ds: 0018 es: 0018 ss: 0018
> Process kswapd (pid: 5, stackpage=c345d000)
> Stack: c0147ddf f0f66540 00000000 c2ce0350 0001bcad c02eab68 c0139228 c2ce0350
> 000001d0 00000200 000001d0 00000016 00000020 000001d0 00000020 00000006
> c01394b3 00000006 c345c000 c02eab68 000001d0 00000006 c02eab68 00000000
> Call Trace: [<c0147ddf>] [<c0139228>] [<c01394b3>] [<c013952e>] [<c013963c>]
> [<c01396c8>] [<c01397f8>] [<c0139760>] [<c0105000>] [<c010592e>] [<c0139760>]
> Code: 89 02 c7 41 30 00 00 00 00 89 4c 24 04 e9 7a ff ff ff 8d 76
>
>
> >>EIP; c0144b14 <__remove_from_queues+14/30> <=====
>
> >>ebx; f0f66540 <_end+30bbb320/3852ee40>
> >>ecx; f0f66540 <_end+30bbb320/3852ee40>
> >>esi; f0f66540 <_end+30bbb320/3852ee40>
> >>edi; f0f66540 <_end+30bbb320/3852ee40>
> >>ebp; c2ce0350 <_end+2935130/3852ee40>
> >>esp; c345df24 <_end+30b2d04/3852ee40>

Stephan,

I'm pretty worried about this problem.

Your oopses seem to be the result of some kind of memory corruption. On
the other oopses we could see the kernel oopsing on
remove_page_from_hash_queue due to corrupted pointers (as Willy pointed
out).

Can you please try to crash your box again with

CONFIG_DEBUG_SLAB=y

Again, thanks a lot for your reports.

2003-08-07 02:14:45

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, 6 Aug 2003 15:15:39 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> Stephan,
>
> I'm pretty worried about this problem.
>
> Your oopses seem to be the result of some kind of memory corruption. On
> the other oopses we could see the kernel oopsing on
> remove_page_from_hash_queue due to corrupted pointers (as Willy pointed
> out).
>
> Can you please try to crash your box again with
>
> CONFIG_DEBUG_SLAB=y
>
> Again, thanks a lot for your reports.

Ok, I have two things.
First, another oops. I upgraded the system to rc1 yesterday and it did not
survive a single day. Here's the decoded oops, the box was "clean" meaning no
weird modules or the like:


ksymoops 2.4.8 on i686 2.4.22-rc1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-rc1/ (default)
-m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Unable to handle kernel NULL pointer dereference at virtual address 00000004
c0145060
*pde = 00000000
Oops: 0002
CPU: 1
EIP: 0010:[<c0145060>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010283
eax: 00000000 ebx: c822feb4 ecx: c822fe60 edx: e07e7780
esi: 00000000 edi: e07e7780 ebp: f59bfe3c esp: f59bfe2c
ds: 0018 es: 0018 ss: 0018
Process nfsd (pid: 1737, stackpage=f59bf000)
Stack: f0cce7a0 00000001 f59bfe38 c822fe60 f0cce7f4 eec54ef4 00000000 e07e7760
f59be000 f59bfea8 c0183ef5 e07e7780 e07e77cc c02ed880 e07e7760 f8c84fc8
f59bfea8 dfe6c960 00000000 e07e7760 dfe6c960 00000000 f59c6e04 f59bfea8
Call Trace: [<c0183ef5>] [<f8c84fc8>] [<f8c856f1>] [<f8c8cee4>] [<f8c8e295>]
[<f8c923f4>] [<f8c80699>] [<f8c65938>] [<f8c923f4>] [<f8c91a38>] [<f8c91a58>]
[<f8c80411>] [<c010592e>] [<f8c80210>]
Code: 89 50 04 c7 41 54 00 00 00 00 c7 43 04 00 00 00 00 8b 44 24


>>EIP; c0145060 <fsync_buffers_list+50/1b0> <=====

>>ebx; c822feb4 <_end+7e84c94/3852ee40>
>>ecx; c822fe60 <_end+7e84c40/3852ee40>
>>edx; e07e7780 <_end+2043c560/3852ee40>
>>edi; e07e7780 <_end+2043c560/3852ee40>
>>ebp; f59bfe3c <_end+35614c1c/3852ee40>
>>esp; f59bfe2c <_end+35614c0c/3852ee40>

Trace; c0183ef5 <reiserfs_sync_file+65/d0>
Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
Trace; f8c856f1 <[nfsd]nfsd_commit+a1/b0>
Trace; f8c8cee4 <[nfsd]nfsd3_proc_commit+94/130>
Trace; f8c8e295 <[nfsd]nfs3svc_decode_commitargs+35/e0>
Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
Trace; f8c80699 <[nfsd]nfsd_dispatch+119/21d>
Trace; f8c65938 <[sunrpc]svc_process+4d8/570>
Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
Trace; f8c91a38 <[nfsd]nfsd_version3+0/10>
Trace; f8c91a58 <[nfsd]nfsd_program+0/28>
Trace; f8c80411 <[nfsd]nfsd+201/370>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; f8c80210 <[nfsd]nfsd+0/370>

Code; c0145060 <fsync_buffers_list+50/1b0>
00000000 <_EIP>:
Code; c0145060 <fsync_buffers_list+50/1b0> <=====
0: 89 50 04 mov %edx,0x4(%eax) <=====
Code; c0145063 <fsync_buffers_list+53/1b0>
3: c7 41 54 00 00 00 00 movl $0x0,0x54(%ecx)
Code; c014506a <fsync_buffers_list+5a/1b0>
a: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx)
Code; c0145071 <fsync_buffers_list+61/1b0>
11: 8b 44 24 00 mov 0x0(%esp,1),%eax


1 warning issued. Results may not be reliable.


As you can see reiserfs seems involved. Regarding reiserfs and my last postings
I can assure you that all reiserfs partitions were checked via reiserfsck right
before installation of rc1 - as Oleg advised - and found:
"Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
I was told to use --fix-fixable option which I did and it indeed fixed the
problem. Trying reiserfsck after that found no errors any more. So I see no
chance that corrupt data on the media (through former crashes) is responsible
for this one. Hint: spelling in reiserfsck should be checked ;-)

Second, I re-install the box with CONFIG_DEBUG_SLAB="y" right now. Please tell
me if I should perform special steps (SYSRQ or the like) after the next crash
happens, or if the decoded oops will be sufficient.

Regards,
Stephan

2003-08-07 05:35:49

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

Hello!

On Thu, Aug 07, 2003 at 04:14:40AM +0200, Stephan von Krawczynski wrote:

> Unable to handle kernel NULL pointer dereference at virtual address 00000004

Hm NULL pointer in j_dirty_buffers list. This cannot happen, basically.
This is a cyclically linked list of buffers. And we add stuff to it via standard
functions, so the linkage happens by itself.

> Trace; c0183ef5 <reiserfs_sync_file+65/d0>
> Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
> Code; c0145060 <fsync_buffers_list+50/1b0>
> 00000000 <_EIP>:
> Code; c0145060 <fsync_buffers_list+50/1b0> <=====
> 0: 89 50 04 mov %edx,0x4(%eax) <=====

> As you can see reiserfs seems involved. Regarding reiserfs and my last postings
> I can assure you that all reiserfs partitions were checked via reiserfsck right
> before installation of rc1 - as Oleg advised - and found:
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"

That might explain your prior "freeing already free block" messages.

> I was told to use --fix-fixable option which I did and it indeed fixed the
> problem. Trying reiserfsck after that found no errors any more. So I see no
> chance that corrupt data on the media (through former crashes) is responsible
> for this one. Hint: spelling in reiserfsck should be checked ;-)

Yes, but how the condition that triggered the oops have appeared is totally unclear for me.

Bye,
Oleg

2003-08-07 12:42:52

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)



On Thu, 7 Aug 2003, Stephan von Krawczynski wrote:

> On Wed, 6 Aug 2003 15:15:39 -0300 (BRT)
> Marcelo Tosatti <[email protected]> wrote:
>
> > Stephan,
> >
> > I'm pretty worried about this problem.
> >
> > Your oopses seem to be the result of some kind of memory corruption. On
> > the other oopses we could see the kernel oopsing on
> > remove_page_from_hash_queue due to corrupted pointers (as Willy pointed
> > out).
> >
> > Can you please try to crash your box again with
> >
> > CONFIG_DEBUG_SLAB=y
> >
> > Again, thanks a lot for your reports.
>
> Ok, I have two things.
> First, another oops. I upgraded the system to rc1 yesterday and it did not
> survive a single day. Here's the decoded oops, the box was "clean" meaning no
> weird modules or the like:
>
>
> ksymoops 2.4.8 on i686 2.4.22-rc1. Options used
> -V (default)
> -k /proc/ksyms (default)
> -l /proc/modules (default)
> -o /lib/modules/2.4.22-rc1/ (default)
> -m /boot/System.map-2.4.22-rc1 (default)
>
> Warning: You did not tell me where to find symbol information. I will
> assume that the log matches the kernel and modules that are running
> right now and I'll use the default options above for symbol resolution.
> If the current kernel and/or modules do not match the log, you can get
> more accurate output by telling me the kernel version and where to find
> map, modules, ksyms etc. ksymoops -h explains the options.
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000004
> c0145060
> *pde = 00000000
> Oops: 0002
> CPU: 1
> EIP: 0010:[<c0145060>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010283
> eax: 00000000 ebx: c822feb4 ecx: c822fe60 edx: e07e7780
> esi: 00000000 edi: e07e7780 ebp: f59bfe3c esp: f59bfe2c
> ds: 0018 es: 0018 ss: 0018
> Process nfsd (pid: 1737, stackpage=f59bf000)
> Stack: f0cce7a0 00000001 f59bfe38 c822fe60 f0cce7f4 eec54ef4 00000000 e07e7760
> f59be000 f59bfea8 c0183ef5 e07e7780 e07e77cc c02ed880 e07e7760 f8c84fc8
> f59bfea8 dfe6c960 00000000 e07e7760 dfe6c960 00000000 f59c6e04 f59bfea8
> Call Trace: [<c0183ef5>] [<f8c84fc8>] [<f8c856f1>] [<f8c8cee4>] [<f8c8e295>]
> [<f8c923f4>] [<f8c80699>] [<f8c65938>] [<f8c923f4>] [<f8c91a38>] [<f8c91a58>]
> [<f8c80411>] [<c010592e>] [<f8c80210>]
> Code: 89 50 04 c7 41 54 00 00 00 00 c7 43 04 00 00 00 00 8b 44 24
>
>
> >>EIP; c0145060 <fsync_buffers_list+50/1b0> <=====
>
> >>ebx; c822feb4 <_end+7e84c94/3852ee40>
> >>ecx; c822fe60 <_end+7e84c40/3852ee40>
> >>edx; e07e7780 <_end+2043c560/3852ee40>
> >>edi; e07e7780 <_end+2043c560/3852ee40>
> >>ebp; f59bfe3c <_end+35614c1c/3852ee40>
> >>esp; f59bfe2c <_end+35614c0c/3852ee40>
>
> Trace; c0183ef5 <reiserfs_sync_file+65/d0>
> Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
> Trace; f8c856f1 <[nfsd]nfsd_commit+a1/b0>
> Trace; f8c8cee4 <[nfsd]nfsd3_proc_commit+94/130>
> Trace; f8c8e295 <[nfsd]nfs3svc_decode_commitargs+35/e0>
> Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
> Trace; f8c80699 <[nfsd]nfsd_dispatch+119/21d>
> Trace; f8c65938 <[sunrpc]svc_process+4d8/570>
> Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
> Trace; f8c91a38 <[nfsd]nfsd_version3+0/10>
> Trace; f8c91a58 <[nfsd]nfsd_program+0/28>
> Trace; f8c80411 <[nfsd]nfsd+201/370>
> Trace; c010592e <arch_kernel_thread+2e/40>
> Trace; f8c80210 <[nfsd]nfsd+0/370>
>
> Code; c0145060 <fsync_buffers_list+50/1b0>
> 00000000 <_EIP>:
> Code; c0145060 <fsync_buffers_list+50/1b0> <=====
> 0: 89 50 04 mov %edx,0x4(%eax) <=====
> Code; c0145063 <fsync_buffers_list+53/1b0>
> 3: c7 41 54 00 00 00 00 movl $0x0,0x54(%ecx)
> Code; c014506a <fsync_buffers_list+5a/1b0>
> a: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx)
> Code; c0145071 <fsync_buffers_list+61/1b0>
> 11: 8b 44 24 00 mov 0x0(%esp,1),%eax
>
>
> 1 warning issued. Results may not be reliable.
>
>
> As you can see reiserfs seems involved. Regarding reiserfs and my last postings
> I can assure you that all reiserfs partitions were checked via reiserfsck right
> before installation of rc1 - as Oleg advised - and found:
> "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
> I was told to use --fix-fixable option which I did and it indeed fixed the
> problem. Trying reiserfsck after that found no errors any more. So I see no
> chance that corrupt data on the media (through former crashes) is responsible
> for this one. Hint: spelling in reiserfsck should be checked ;-)

It might be a problem in reiserfs. You're getting oopses on different
places with different stack traces, which is weird.

I'll take a closer look at this oops now.

> Second, I re-install the box with CONFIG_DEBUG_SLAB="y" right now. Please tell
> me if I should perform special steps (SYSRQ or the like) after the next crash
> happens, or if the decoded oops will be sufficient.

The decoded oops should be sufficient.

2003-08-07 15:54:55

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Thu, 7 Aug 2003 09:45:36 -0300 (BRT)
Marcelo Tosatti <[email protected]> wrote:

> The decoded oops should be sufficient.

Well, how about this one:


ksymoops 2.4.8 on i686 2.4.22-rc1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-rc1/ (default)
-m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

Unable to handle kernel paging request at virtual address 63eabdb3
c0145f31
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c0145f31>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: 00000000 ebx: 00000000 ecx: 00000061 edx: 63eabd93
esi: 00000000 edi: 00001000 ebp: 00000000 esp: c34f7e60
ds: 0018 es: 0018 ss: 0018
Process kupdated (pid: 7, stackpage=c34f7000)
Stack: 00000000 f7afb1f0 c0146018 00000000 c01312e9 00000000 c1849dd0 00001000
00001000 00000803 c014823a c1849dd0 00001000 00000000 f79b7fa4 00001e18
c0148428 f79b7fa4 00001e18 00001000 e9640000 00000000 00000803 00001000
Call Trace: [<c0146018>] [<c01312e9>] [<c014823a>] [<c0148428>] [<c0145b36>]
[<c0197328>] [<c019ceb9>] [<c019c4f5>] [<c0188e94>] [<c01498cb>] [<c014887c>]
[<c0148be9>] [<c0105000>] [<c010592e>] [<c0148af0>]
Code: 8b 42 20 a3 30 c6 37 c0 8d 41 ff a3 34 c6 37 c0 c6 05 c0 bb


>>EIP; c0145f31 <get_unused_buffer_head+21/b0> <=====

>>esp; c34f7e60 <_end+314cc40/3852ee40>

Trace; c0146018 <create_buffers+28/100>
Trace; c01312e9 <find_or_create_page+109/110>
Trace; c014823a <grow_dev_page+7a/c0>
Trace; c0148428 <grow_buffers+98/110>
Trace; c0145b36 <getblk+46/80>
Trace; c0197328 <journal_getblk+28/30>
Trace; c019ceb9 <do_journal_end+139/bb0>
Trace; c019c4f5 <flush_old_commits+135/1d0>
Trace; c0188e94 <reiserfs_write_super+64/90>
Trace; c01498cb <sync_supers+14b/170>
Trace; c014887c <sync_old_buffers+3c/b0>
Trace; c0148be9 <kupdate+f9/130>
Trace; c0105000 <_stext+0/0>
Trace; c010592e <arch_kernel_thread+2e/40>
Trace; c0148af0 <kupdate+0/130>

Code; c0145f31 <get_unused_buffer_head+21/b0>
00000000 <_EIP>:
Code; c0145f31 <get_unused_buffer_head+21/b0> <=====
0: 8b 42 20 mov 0x20(%edx),%eax <=====
Code; c0145f34 <get_unused_buffer_head+24/b0>
3: a3 30 c6 37 c0 mov %eax,0xc037c630
Code; c0145f39 <get_unused_buffer_head+29/b0>
8: 8d 41 ff lea 0xffffffff(%ecx),%eax
Code; c0145f3c <get_unused_buffer_head+2c/b0>
b: a3 34 c6 37 c0 mov %eax,0xc037c634
Code; c0145f41 <get_unused_buffer_head+31/b0>
10: c6 05 c0 bb 00 00 00 movb $0x0,0xbbc0


1 warning issued. Results may not be reliable.


After that I received this one:


ksymoops 2.4.8 on i686 2.4.22-rc1. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.22-rc1/ (default)
-m /boot/System.map-2.4.22-rc1 (default)

Warning: You did not tell me where to find symbol information. I will
assume that the log matches the kernel and modules that are running
right now and I'll use the default options above for symbol resolution.
If the current kernel and/or modules do not match the log, you can get
more accurate output by telling me the kernel version and where to find
map, modules, ksyms etc. ksymoops -h explains the options.

NMI Watchdog detected LOCKUP on CPU1, eip c011a747, registers:
CPU: 1
EIP: 0010:[<c011a747>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000082
eax: cef0b8dc ebx: cef0b894 ecx: 00000001 edx: 00000003
esi: 00000008 edi: cef0b8dc ebp: ec8efe48 esp: ec8efe28
ds: 0018 es: 0018 ss: 0018
Process tar (pid: 13603, stackpage=ec8ef000)
Stack: 00000000 cef0b894 00000000 00000282 00000003 cef0b894 00000008 cef0b8dc
00000000 c01c4f41 00000000 cef0b894 00000000 0001679d cef0b894 00001000
c0146c87 00000000 cef0b894 cef0b894 00000004 cef0b894 ec8ee000 00000001
Call Trace: [<c01c4f41>] [<c0146c87>] [<c013ae92>] [<c0119630>] [<c0130d7e>]
[<c017ff50>] [<c013146f>] [<c0131751>] [<c0131d50>] [<c0131ffc>] [<c0131d50>]
[<c014328b>] [<c010782f>]
Code: 7e f9 e9 d9 ec ff ff 80 38 00 f3 90 7e f9 e9 5d ed ff ff 80


>>EIP; c011a747 <.text.lock.sched+3f/178> <=====

>>eax; cef0b8dc <_end+eb606bc/3852ee40>
>>ebx; cef0b894 <_end+eb60674/3852ee40>
>>edi; cef0b8dc <_end+eb606bc/3852ee40>
>>ebp; ec8efe48 <_end+2c544c28/3852ee40>
>>esp; ec8efe28 <_end+2c544c08/3852ee40>

Trace; c01c4f41 <submit_bh+a1/c0>
Trace; c0146c87 <block_read_full_page+2d7/2f0>
Trace; c013ae92 <__alloc_pages+42/190>
Trace; c0119630 <wait_for_completion+70/b0>
Trace; c0130d7e <page_cache_read+be/e0>
Trace; c017ff50 <reiserfs_get_block+0/1490>
Trace; c013146f <generic_file_readahead+af/1a0>
Trace; c0131751 <do_generic_file_read+1c1/470>
Trace; c0131d50 <file_read_actor+0/110>
Trace; c0131ffc <generic_file_read+19c/1b0>
Trace; c0131d50 <file_read_actor+0/110>
Trace; c014328b <sys_read+9b/180>
Trace; c010782f <system_call+33/38>

Code; c011a747 <.text.lock.sched+3f/178>
00000000 <_EIP>:
Code; c011a747 <.text.lock.sched+3f/178> <=====
0: 7e f9 jle fffffffb <_EIP+0xfffffffb> <=====
Code; c011a749 <.text.lock.sched+41/178>
2: e9 d9 ec ff ff jmp ffffece0 <_EIP+0xffffece0>
Code; c011a74e <.text.lock.sched+46/178>
7: 80 38 00 cmpb $0x0,(%eax)
Code; c011a751 <.text.lock.sched+49/178>
a: f3 90 repz nop
Code; c011a753 <.text.lock.sched+4b/178>
c: 7e f9 jle 7 <_EIP+0x7>
Code; c011a755 <.text.lock.sched+4d/178>
e: e9 5d ed ff ff jmp ffffed70 <_EIP+0xffffed70>
Code; c011a75a <.text.lock.sched+52/178>
13: 80 00 00 addb $0x0,(%eax)


1 warning issued. Results may not be reliable.


There were no I/O errors or any other spectacular things happening. It just
died while I was sitting right next to it during the verify run of tar.

Regards,
Stephan

2003-08-18 17:34:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.22-pre lockups (now decoded oops for pre10)

On Wed, Aug 06, 2003 at 11:09:20AM +0200, Willy Tarreau wrote:
> On Wed, Aug 06, 2003 at 09:41:50AM +0200, Stephan von Krawczynski wrote:
>
> > Code; c0144b14 <__remove_from_queues+14/30>
> > 00000000 <_EIP>:
> > Code; c0144b14 <__remove_from_queues+14/30> <=====
> > 0: 89 02 mov %eax,(%edx) <=====
> > Code; c0144b16 <__remove_from_queues+16/30>
> > 2: c7 41 30 00 00 00 00 movl $0x0,0x30(%ecx)
> > Code; c0144b1d <__remove_from_queues+1d/30>
> > 9: 89 4c 24 04 mov %ecx,0x4(%esp,1)
> > Code; c0144b21 <__remove_from_queues+21/30>
> > d: e9 7a ff ff ff jmp ffffff8c <_EIP+0xffffff8c>
> > Code; c0144b26 <__remove_from_queues+26/30>
> > 12: 8d 76 00 lea 0x0(%esi),%esi
>
> once again, it's *pprev=next which is is causing trouble, with pprev=6 this
> time (fs/buffer.c:523). There really seems to be something playing badly with
> this...
>
> I find amazing that such widely used portions of code only trigger panics on
> your system ! either it's a rare combinations of several components/drivers, or
> a strange hardware problem, although I can't imagine which (cpu? bus locking?).

normally it's bad ram (or anyways a problem with the memory) when bugs
triggers in that place reproducibly. the list walking trashes the l2 and
that put more stress on the ram. If it was random memory corruption
(software) it would more likely crash in different places (though it's
not guaranteed ;).

Andrea