2002-06-06 20:44:46

by Martin J. Bligh

[permalink] [raw]
Subject: Panic from 2.4.19-pre9-aa2

Panic below - crashed on third kernel compile since boot.
Worked fine on -pre8-aa2

M.

-------------------------------------------------------

Unable to handle kernel paging request at virtual address fffff85e
c648ff38
*pde = 00005063
Oops: 0000
CPU: 3
EIP: 0060:[<c648ff38>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: c648e000
eax: 00000000 ebx: c623a000 ecx: fffff83e edx: c623a380
esi: 00000001 edi: c0297520 ebp: c0117bf6 esp: c648ff00
ds: 0018 es: 0018 ss: 0018
Process cpp (pid: 21583, stackpage=c648f000)
Stack: c648e000 c63473a0 c634740c 00000000 c01163f8 bfffeed4 c649e000 c648e000
00000040 c648e000 00000002 c62b75e0 c4ad2f20 c648e000 c648ff60 c0147dad
00001000 c4ba54e0 c63473a0 000415b4 00000000 c648e000 00000000 00000000
Call Trace: [<c01163f8>] [<c0147dad>] [<c0148180>] [<c013e308>] [<c013e937>]
[<c0108a7b>]
Code: 60 ff 48 c6 ad 7d 14 c0 00 10 00 00 e0 54 ba c4 a0 73 34 c6

>>EIP; c648ff38 <END_OF_CODE+6196040/????> <=====
Trace; c01163f8 <do_page_fault+0/670>
Trace; c0147dac <pipe_wait+7c/a4>
Trace; c0148180 <pipe_write+1cc/294>
Trace; c013e308 <filp_close+9c/a8>
Trace; c013e936 <sys_write+8e/100>
Trace; c0108a7a <system_call+2e/34>
Code; c648ff38 <END_OF_CODE+6196040/????>
00000000 <_EIP>:
Code; c648ff38 <END_OF_CODE+6196040/????>
0: 60 pusha
Code; c648ff38 <END_OF_CODE+6196040/????> <=====
1: ff 48 c6 decl 0xffffffc6(%eax) <=====
Code; c648ff3c <END_OF_CODE+6196044/????>
4: ad lods %ds:(%esi),%eax
Code; c648ff3c <END_OF_CODE+6196044/????>
5: 7d 14 jge 1b <_EIP+0x1b> c648ff52 <END_OF_CODE+61
9605a/????>
Code; c648ff3e <END_OF_CODE+6196046/????>
7: c0 00 10 rolb $0x10,(%eax)
Code; c648ff42 <END_OF_CODE+619604a/????>
a: 00 00 add %al,(%eax)
Code; c648ff44 <END_OF_CODE+619604c/????>
c: e0 54 loopne 62 <_EIP+0x62> c648ff9a <END_OF_CODE+61
960a2/????>
Code; c648ff46 <END_OF_CODE+619604e/????>
e: ba c4 a0 73 34 mov $0x3473a0c4,%edx
Code; c648ff4a <END_OF_CODE+6196052/????>
13: c6 00 00 movb $0x0,(%eax)



2002-06-06 21:20:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, Jun 06, 2002 at 01:44:45PM -0700, Martin J. Bligh wrote:
> Panic below - crashed on third kernel compile since boot.
> Worked fine on -pre8-aa2
>
> M.
>
> -------------------------------------------------------
>
> Unable to handle kernel paging request at virtual address fffff85e
> c648ff38
> *pde = 00005063
> Oops: 0000
> CPU: 3
> EIP: 0060:[<c648ff38>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: c648e000
> eax: 00000000 ebx: c623a000 ecx: fffff83e edx: c623a380
> esi: 00000001 edi: c0297520 ebp: c0117bf6 esp: c648ff00
> ds: 0018 es: 0018 ss: 0018
> Process cpp (pid: 21583, stackpage=c648f000)
> Stack: c648e000 c63473a0 c634740c 00000000 c01163f8 bfffeed4 c649e000 c648e000
> 00000040 c648e000 00000002 c62b75e0 c4ad2f20 c648e000 c648ff60 c0147dad
> 00001000 c4ba54e0 c63473a0 000415b4 00000000 c648e000 00000000 00000000
> Call Trace: [<c01163f8>] [<c0147dad>] [<c0148180>] [<c013e308>] [<c013e937>]
> [<c0108a7b>]
> Code: 60 ff 48 c6 ad 7d 14 c0 00 10 00 00 e0 54 ba c4 a0 73 34 c6
>
> >>EIP; c648ff38 <END_OF_CODE+6196040/????> <=====
> Trace; c01163f8 <do_page_fault+0/670>
> Trace; c0147dac <pipe_wait+7c/a4>

ok, so the crash is at pipe_wait+7c. Can you disassemble pipe_wait?
(shouldn't be very big) (i use gcc 3.1.1 so my assembly wouldn't match)
apparently a part of the inode got corrupted, and somebody is reading at
offset 0x20 of a structure inside the inode.

not really sure what could be the problem, it would be interesting to
see if you can reproduce it. Also if for example you enabled numa-q you
may want to try to disable it and see if w/o discontigmem the problem
goes away, if we could isolate it to a config option, it would help a lot.

> Trace; c0148180 <pipe_write+1cc/294>
> Trace; c013e308 <filp_close+9c/a8>
> Trace; c013e936 <sys_write+8e/100>
> Trace; c0108a7a <system_call+2e/34>
> Code; c648ff38 <END_OF_CODE+6196040/????>
> 00000000 <_EIP>:
> Code; c648ff38 <END_OF_CODE+6196040/????>
> 0: 60 pusha
> Code; c648ff38 <END_OF_CODE+6196040/????> <=====
> 1: ff 48 c6 decl 0xffffffc6(%eax) <=====
> Code; c648ff3c <END_OF_CODE+6196044/????>
> 4: ad lods %ds:(%esi),%eax
> Code; c648ff3c <END_OF_CODE+6196044/????>
> 5: 7d 14 jge 1b <_EIP+0x1b> c648ff52 <END_OF_CODE+61
> 9605a/????>
> Code; c648ff3e <END_OF_CODE+6196046/????>
> 7: c0 00 10 rolb $0x10,(%eax)
> Code; c648ff42 <END_OF_CODE+619604a/????>
> a: 00 00 add %al,(%eax)
> Code; c648ff44 <END_OF_CODE+619604c/????>
> c: e0 54 loopne 62 <_EIP+0x62> c648ff9a <END_OF_CODE+61
> 960a2/????>
> Code; c648ff46 <END_OF_CODE+619604e/????>
> e: ba c4 a0 73 34 mov $0x3473a0c4,%edx
> Code; c648ff4a <END_OF_CODE+6196052/????>
> 13: c6 00 00 movb $0x0,(%eax)
>


Andrea

2002-06-06 21:53:44

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

>> Unable to handle kernel paging request at virtual address fffff85e
>> c648ff38
>> *pde = 00005063
>> Oops: 0000
>> CPU: 3
>> EIP: 0060:[<c648ff38>] Not tainted
>> Using defaults from ksymoops -t elf32-i386 -a i386
>> EFLAGS: c648e000
>> eax: 00000000 ebx: c623a000 ecx: fffff83e edx: c623a380
>> esi: 00000001 edi: c0297520 ebp: c0117bf6 esp: c648ff00
>> ds: 0018 es: 0018 ss: 0018
>> Process cpp (pid: 21583, stackpage=c648f000)
>> Stack: c648e000 c63473a0 c634740c 00000000 c01163f8 bfffeed4 c649e000 c648e000
>> 00000040 c648e000 00000002 c62b75e0 c4ad2f20 c648e000 c648ff60 c0147dad
>> 00001000 c4ba54e0 c63473a0 000415b4 00000000 c648e000 00000000 00000000
>> Call Trace: [<c01163f8>] [<c0147dad>] [<c0148180>] [<c013e308>] [<c013e937>]
>> [<c0108a7b>]
>> Code: 60 ff 48 c6 ad 7d 14 c0 00 10 00 00 e0 54 ba c4 a0 73 34 c6
>>
>> >> EIP; c648ff38 <END_OF_CODE+6196040/????> <=====
>> Trace; c01163f8 <do_page_fault+0/670>
>> Trace; c0147dac <pipe_wait+7c/a4>
>
> ok, so the crash is at pipe_wait+7c. Can you disassemble pipe_wait?
> (shouldn't be very big) (i use gcc 3.1.1 so my assembly wouldn't match)
> apparently a part of the inode got corrupted, and somebody is reading at
> offset 0x20 of a structure inside the inode.

(gdb) disassemble pipe_wait
Dump of assembler code for function pipe_wait:
0xc0147d30 <pipe_wait>: sub $0x20,%esp
0xc0147d33 <pipe_wait+3>: push %ebp
0xc0147d34 <pipe_wait+4>: push %edi
0xc0147d35 <pipe_wait+5>: push %esi
0xc0147d36 <pipe_wait+6>: push %ebx
0xc0147d37 <pipe_wait+7>: mov $0xffffe000,%ebx
0xc0147d3c <pipe_wait+12>: and %esp,%ebx
0xc0147d3e <pipe_wait+14>: lea 0x20(%esp,1),%ebp
0xc0147d42 <pipe_wait+18>: mov %ebp,%edx
0xc0147d44 <pipe_wait+20>: mov 0x34(%esp,1),%esi
0xc0147d48 <pipe_wait+24>: movl $0x0,0x10(%esp,1)
0xc0147d50 <pipe_wait+32>: movl $0x0,0x14(%esp,1)
0xc0147d58 <pipe_wait+40>: movl $0x0,0x18(%esp,1)
0xc0147d60 <pipe_wait+48>: movl $0x0,0x1c(%esp,1)
0xc0147d68 <pipe_wait+56>: mov %ebx,0x14(%esp,1)
0xc0147d6c <pipe_wait+60>: movl $0x0,0x20(%esp,1)
0xc0147d74 <pipe_wait+68>: mov %ebx,0x24(%esp,1)
0xc0147d78 <pipe_wait+72>: movl $0x0,0x28(%esp,1)
0xc0147d80 <pipe_wait+80>: movl $0x0,0x2c(%esp,1)
0xc0147d88 <pipe_wait+88>: movl $0x1,(%ebx)
0xc0147d8e <pipe_wait+94>: mov 0xf8(%esi),%eax
0xc0147d94 <pipe_wait+100>: call 0xc01199c0 <add_wait_queue>
0xc0147d99 <pipe_wait+105>: lea 0x6c(%esi),%edi
0xc0147d9c <pipe_wait+108>: mov %edi,%ecx
0xc0147d9e <pipe_wait+110>: lock incl 0x6c(%esi)
0xc0147da2 <pipe_wait+114>: jle 0xc014891b <.text.lock.pipe>
0xc0147da8 <pipe_wait+120>: call 0xc0117ae8 <schedule>
0xc0147dad <pipe_wait+125>: mov 0xf8(%esi),%eax
0xc0147db3 <pipe_wait+131>: mov %ebp,%edx
0xc0147db5 <pipe_wait+133>: call 0xc0119a28 <remove_wait_queue>
0xc0147dba <pipe_wait+138>: movl $0x0,(%ebx)
0xc0147dc0 <pipe_wait+144>: mov %edi,%ecx
0xc0147dc2 <pipe_wait+146>: lock decl 0x6c(%esi)
0xc0147dc6 <pipe_wait+150>: js 0xc0148925 <.text.lock.pipe+10>
0xc0147dcc <pipe_wait+156>: pop %ebx
0xc0147dcd <pipe_wait+157>: pop %esi
0xc0147dce <pipe_wait+158>: pop %edi
0xc0147dcf <pipe_wait+159>: pop %ebp
0xc0147dd0 <pipe_wait+160>: add $0x20,%esp
0xc0147dd3 <pipe_wait+163>: ret
End of assembler dump.

> not really sure what could be the problem, it would be interesting to
> see if you can reproduce it. Also if for example you enabled numa-q you
> may want to try to disable it and see if w/o discontigmem the problem
> goes away, if we could isolate it to a config option, it would help a lot.

OK, I'll play around some more and try to build up a pattern.

Not sure why ksymoops is printing c0147dac from the trace, whilst
the stack says c0147dad, which seems to be the schedule call -
would make sense, as that's what you just changed?

M.

2002-06-06 23:15:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, Jun 06, 2002 at 02:53:40PM -0700, Martin J. Bligh wrote:
> >> Unable to handle kernel paging request at virtual address fffff85e
> >> c648ff38
> >> *pde = 00005063
> >> Oops: 0000
> >> CPU: 3
> >> EIP: 0060:[<c648ff38>] Not tainted
> >> Using defaults from ksymoops -t elf32-i386 -a i386
> >> EFLAGS: c648e000
> >> eax: 00000000 ebx: c623a000 ecx: fffff83e edx: c623a380
> >> esi: 00000001 edi: c0297520 ebp: c0117bf6 esp: c648ff00
> >> ds: 0018 es: 0018 ss: 0018
> >> Process cpp (pid: 21583, stackpage=c648f000)
> >> Stack: c648e000 c63473a0 c634740c 00000000 c01163f8 bfffeed4 c649e000 c648e000
> >> 00000040 c648e000 00000002 c62b75e0 c4ad2f20 c648e000 c648ff60 c0147dad
> >> 00001000 c4ba54e0 c63473a0 000415b4 00000000 c648e000 00000000 00000000
> >> Call Trace: [<c01163f8>] [<c0147dad>] [<c0148180>] [<c013e308>] [<c013e937>]
^^^^^^^^
> >> [<c0108a7b>]
> >> Code: 60 ff 48 c6 ad 7d 14 c0 00 10 00 00 e0 54 ba c4 a0 73 34 c6
> >>
> >> >> EIP; c648ff38 <END_OF_CODE+6196040/????> <=====
> >> Trace; c01163f8 <do_page_fault+0/670>
> >> Trace; c0147dac <pipe_wait+7c/a4>
^^^^^^^^
> >
> > ok, so the crash is at pipe_wait+7c. Can you disassemble pipe_wait?
> > (shouldn't be very big) (i use gcc 3.1.1 so my assembly wouldn't match)
> > apparently a part of the inode got corrupted, and somebody is reading at
> > offset 0x20 of a structure inside the inode.
>
> (gdb) disassemble pipe_wait
> Dump of assembler code for function pipe_wait:
> 0xc0147d30 <pipe_wait>: sub $0x20,%esp
^^^^^ this should be 0x30!!!!!! not 0x20
> 0xc0147d33 <pipe_wait+3>: push %ebp
> 0xc0147d34 <pipe_wait+4>: push %edi
> 0xc0147d35 <pipe_wait+5>: push %esi
> 0xc0147d36 <pipe_wait+6>: push %ebx
> 0xc0147d37 <pipe_wait+7>: mov $0xffffe000,%ebx
> 0xc0147d3c <pipe_wait+12>: and %esp,%ebx
> 0xc0147d3e <pipe_wait+14>: lea 0x20(%esp,1),%ebp
> 0xc0147d42 <pipe_wait+18>: mov %ebp,%edx
> 0xc0147d44 <pipe_wait+20>: mov 0x34(%esp,1),%esi
> 0xc0147d48 <pipe_wait+24>: movl $0x0,0x10(%esp,1)
> 0xc0147d50 <pipe_wait+32>: movl $0x0,0x14(%esp,1)
^^^^^^^^^^^^^^^^^^^^^^^^ (what's this?)
> 0xc0147d58 <pipe_wait+40>: movl $0x0,0x18(%esp,1)
> 0xc0147d60 <pipe_wait+48>: movl $0x0,0x1c(%esp,1)
> 0xc0147d68 <pipe_wait+56>: mov %ebx,0x14(%esp,1)
^^^^^^^^^^^^^^^^^^^^^^^^
> 0xc0147d6c <pipe_wait+60>: movl $0x0,0x20(%esp,1)
> 0xc0147d74 <pipe_wait+68>: mov %ebx,0x24(%esp,1)
> 0xc0147d78 <pipe_wait+72>: movl $0x0,0x28(%esp,1)
> 0xc0147d80 <pipe_wait+80>: movl $0x0,0x2c(%esp,1)
> 0xc0147d88 <pipe_wait+88>: movl $0x1,(%ebx)
> 0xc0147d8e <pipe_wait+94>: mov 0xf8(%esi),%eax
> 0xc0147d94 <pipe_wait+100>: call 0xc01199c0 <add_wait_queue>
> 0xc0147d99 <pipe_wait+105>: lea 0x6c(%esi),%edi
> 0xc0147d9c <pipe_wait+108>: mov %edi,%ecx
> 0xc0147d9e <pipe_wait+110>: lock incl 0x6c(%esi)
> 0xc0147da2 <pipe_wait+114>: jle 0xc014891b <.text.lock.pipe>
> 0xc0147da8 <pipe_wait+120>: call 0xc0117ae8 <schedule>
> 0xc0147dad <pipe_wait+125>: mov 0xf8(%esi),%eax
^^^^^^^^^^

At first glance this seems a miscompilation, a compiler bug, not bug in
2.4.19pre9aa2 (this clearly explains why you're the only one reproducing
this weird oops). it even sounds like ksymoops is buggy, ksymoops had to
say c0147dad (+7d), not c0147dac and +7c (maybe you compiled ksymoops
with the same compiler of the kernel? If not Keith should have a look
here).

besides the stupid zeroing of 0x14(esp) (my compiler isn't doing that),
the initial sub seems wrong, pipe_wait has just one argument, and that's
at offset 0x34, so it should be sub 0x30, not sub 0x20, or we will
corrupt the underlying stack and we also won't read the
inode at all (hence the oops, it wasn't the inode to be corrupted as I
guessed in the previous email, it's at the previous setp, we use random
memory as a pointer to the inode structure so we oops while we try to
read the inode contents).

Of course the code reads the inode at offset 0x34, but at 0x34 there's
not the inode, there's something else random, because the prologue did
sub 0x20 so the inode was at 0x24, not 0x34! the prologue clearly had to
do sub 0x30 instead (that's the miscompilation).

What compiler are you using? Maybe 2.96?

3.1.1 20020530 works fine for me with the kernel, as well as previous
gcc 3.1, never had a single problem with the kernel in the whole
developement cycle of 3.0 and 3.1 and now with 3.1.1. If you need to go
safe with the kernel for x86 you should use only 2.95 or egcs 1.1.2,
however I can reassure people that gcc 3.1.1 seems rock solid even if
I wouldn't use it in mission critical yet.

I CC'ed Honza (x86-64/x86 gcc guru) and Keith, in case I misread something.

Honza, this is the pipe_wait C code:

void pipe_wait(struct inode * inode)
{
DECLARE_WAITQUEUE(wait, current);
current->state = TASK_INTERRUPTIBLE;
add_wait_queue(PIPE_WAIT(*inode), &wait);
^^^^^^^^^^^^^^^^^ we bug here while dereferencing inode->i_pipe
up(PIPE_SEM(*inode));
schedule();
remove_wait_queue(PIPE_WAIT(*inode), &wait);
current->state = TASK_RUNNING;
down(PIPE_SEM(*inode));
}


note, wait is at offset 0 of i_pipe, and i_pipe is at offset 0xf8 of the
inode. So it is indeed doing inode->i_pipe when it oops, because the
inode address passed on the stack (first and only argument) was at 0x24 not 0x34.



> 0xc0147db3 <pipe_wait+131>: mov %ebp,%edx
> 0xc0147db5 <pipe_wait+133>: call 0xc0119a28 <remove_wait_queue>
> 0xc0147dba <pipe_wait+138>: movl $0x0,(%ebx)
> 0xc0147dc0 <pipe_wait+144>: mov %edi,%ecx
> 0xc0147dc2 <pipe_wait+146>: lock decl 0x6c(%esi)
> 0xc0147dc6 <pipe_wait+150>: js 0xc0148925 <.text.lock.pipe+10>
> 0xc0147dcc <pipe_wait+156>: pop %ebx
> 0xc0147dcd <pipe_wait+157>: pop %esi
> 0xc0147dce <pipe_wait+158>: pop %edi
> 0xc0147dcf <pipe_wait+159>: pop %ebp
> 0xc0147dd0 <pipe_wait+160>: add $0x20,%esp
> 0xc0147dd3 <pipe_wait+163>: ret
> End of assembler dump.
>
> > not really sure what could be the problem, it would be interesting to
> > see if you can reproduce it. Also if for example you enabled numa-q you
> > may want to try to disable it and see if w/o discontigmem the problem
> > goes away, if we could isolate it to a config option, it would help a lot.
>
> OK, I'll play around some more and try to build up a pattern.
>
> Not sure why ksymoops is printing c0147dac from the trace, whilst
> the stack says c0147dad, which seems to be the schedule call -
> would make sense, as that's what you just changed?

yes, that's wrong, but that is a ksymoops mistake not related to the
original oops (possibly due the same broken compiler but maybe not).

>
> M.


Andrea

2002-06-06 23:18:09

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

> not really sure what could be the problem, it would be interesting to
> see if you can reproduce it.

Yup, do 2 or 3 kernel compiles and it crashes again. Here's a slightly
different oops:

Unable to handle kernel NULL pointer dereference at virtual address 00000282
c0117feb
*pde = 00000000
Oops: 0000
CPU: 6
EIP: 0010:[<c0117feb>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: c6369f6c ebx: 00000282 ecx: c029a488 edx: c4ff5b24
esi: c4ff5b20 edi: 00000282 ebp: c6227f70 esp: c6227f54
ds: 0018 es: 0018 ss: 0018
Process cpp (pid: 16679, stackpage=c6227000)
Stack: 00001000 c4ff5b20 c5773180 00000001 c4ff5b24 00000282 00000001 000526a9
c0148311 00000000 ffffffea c5eab160 000536a9 c6526000 c6226000 c57731ec
00001000 00001000 c013ead7 c5eab160 4011000c 000536a9 c5eab180 c6226000
Call Trace: [<c0148311>] [<c013ead7>] [<c0108a7b>]
Code: 8b 3b 0f 18 07 3b 5d f4 75 d0 c6 06 01 ff 75 f8 9d 8d 74 26

>>EIP; c0117fea <__wake_up+5a/7c> <=====
Trace; c0148310 <pipe_write+1bc/294>
Trace; c013ead6 <sys_write+8e/100>
Trace; c0108a7a <system_call+2e/34>
Code; c0117fea <__wake_up+5a/7c>
00000000 <_EIP>:
Code; c0117fea <__wake_up+5a/7c> <=====
0: 8b 3b mov (%ebx),%edi <=====
Code; c0117fec <__wake_up+5c/7c>
2: 0f 18 07 prefetchnta (%edi)
Code; c0117fee <__wake_up+5e/7c>
5: 3b 5d f4 cmp 0xfffffff4(%ebp),%ebx
Code; c0117ff2 <__wake_up+62/7c>
8: 75 d0 jne ffffffda <_EIP+0xffffffda> c0117fc4 <__
wake_up+34/7c>
Code; c0117ff4 <__wake_up+64/7c>
a: c6 06 01 movb $0x1,(%esi)
Code; c0117ff6 <__wake_up+66/7c>
d: ff 75 f8 pushl 0xfffffff8(%ebp)
Code; c0117ffa <__wake_up+6a/7c>
10: 9d popf
Code; c0117ffa <__wake_up+6a/7c>
11: 8d 74 26 00 lea 0x0(%esi,1),%esi

> Also if for example you enabled numa-q you
> may want to try to disable it and see if w/o discontigmem the problem
> goes away, if we could isolate it to a config option, it would help a lot.

OK, will see if I can do that - I'm out for a few days, so it may be next
Tuesday before I can do this

M.

2002-06-06 23:29:07

by Hugh Dickins

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, 6 Jun 2002, Martin J. Bligh wrote:
>
> Not sure why ksymoops is printing c0147dac from the trace, whilst
> the stack says c0147dad, which seems to be the schedule call -

Bug in ksymoops (had a misinitialized truncate_mask, which
removed the low bit by mistake): fixed in ksymoops 2.4.4.

Hugh

2002-06-06 23:36:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, Jun 06, 2002 at 04:18:01PM -0700, Martin J. Bligh wrote:
> > not really sure what could be the problem, it would be interesting to
> > see if you can reproduce it.
>
> Yup, do 2 or 3 kernel compiles and it crashes again. Here's a slightly
> different oops:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000282
> c0117feb
> *pde = 00000000
> Oops: 0000
> CPU: 6
> EIP: 0010:[<c0117feb>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010046
> eax: c6369f6c ebx: 00000282 ecx: c029a488 edx: c4ff5b24
> esi: c4ff5b20 edi: 00000282 ebp: c6227f70 esp: c6227f54
> ds: 0018 es: 0018 ss: 0018
> Process cpp (pid: 16679, stackpage=c6227000)
> Stack: 00001000 c4ff5b20 c5773180 00000001 c4ff5b24 00000282 00000001 000526a9
> c0148311 00000000 ffffffea c5eab160 000536a9 c6526000 c6226000 c57731ec
> 00001000 00001000 c013ead7 c5eab160 4011000c 000536a9 c5eab180 c6226000
> Call Trace: [<c0148311>] [<c013ead7>] [<c0108a7b>]
> Code: 8b 3b 0f 18 07 3b 5d f4 75 d0 c6 06 01 ff 75 f8 9d 8d 74 26
>
> >>EIP; c0117fea <__wake_up+5a/7c> <=====
> Trace; c0148310 <pipe_write+1bc/294>

no doubt it crashes again here, the pipe_write stack gets corrupted by
pipe_wait. Actually we had very good luck that previously it crashed in
the buggy place, so you showed me imemdiatly the buggy assembler, if it
crashed in __wake_up the first time, maybe __wake_up wasn't miscompiled
and it would been much harder to guess it was not a kernel mistake... :)

> Trace; c013ead6 <sys_write+8e/100>
> Trace; c0108a7a <system_call+2e/34>
> Code; c0117fea <__wake_up+5a/7c>
> 00000000 <_EIP>:
> Code; c0117fea <__wake_up+5a/7c> <=====
> 0: 8b 3b mov (%ebx),%edi <=====
> Code; c0117fec <__wake_up+5c/7c>
> 2: 0f 18 07 prefetchnta (%edi)
> Code; c0117fee <__wake_up+5e/7c>
> 5: 3b 5d f4 cmp 0xfffffff4(%ebp),%ebx
> Code; c0117ff2 <__wake_up+62/7c>
> 8: 75 d0 jne ffffffda <_EIP+0xffffffda> c0117fc4 <__
> wake_up+34/7c>
> Code; c0117ff4 <__wake_up+64/7c>
> a: c6 06 01 movb $0x1,(%esi)
> Code; c0117ff6 <__wake_up+66/7c>
> d: ff 75 f8 pushl 0xfffffff8(%ebp)
> Code; c0117ffa <__wake_up+6a/7c>
> 10: 9d popf
> Code; c0117ffa <__wake_up+6a/7c>
> 11: 8d 74 26 00 lea 0x0(%esi,1),%esi
>
> > Also if for example you enabled numa-q you
> > may want to try to disable it and see if w/o discontigmem the problem
> > goes away, if we could isolate it to a config option, it would help a lot.
>
> OK, will see if I can do that - I'm out for a few days, so it may be next
> Tuesday before I can do this
>
> M.


Andrea

2002-06-06 23:45:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

> At first glance this seems a miscompilation, a compiler bug, not bug in
> 2.4.19pre9aa2 (this clearly explains why you're the only one reproducing
> this weird oops). it even sounds like ksymoops is buggy, ksymoops had to
> say c0147dad (+7d), not c0147dac and +7c (maybe you compiled ksymoops
> with the same compiler of the kernel? If not Keith should have a look
> here).
>
> What compiler are you using? Maybe 2.96?

Errm .... Redhat 6.2 default ... egcs-2.91.66 .... time to upgrade ?? ;-) ;-)
Pah ... reinstalling these machines is a pain in the ass .... ;-)

M.

2002-06-06 23:53:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, Jun 06, 2002 at 04:45:13PM -0700, Martin J. Bligh wrote:
> > At first glance this seems a miscompilation, a compiler bug, not bug in
> > 2.4.19pre9aa2 (this clearly explains why you're the only one reproducing
> > this weird oops). it even sounds like ksymoops is buggy, ksymoops had to
> > say c0147dad (+7d), not c0147dac and +7c (maybe you compiled ksymoops
> > with the same compiler of the kernel? If not Keith should have a look
> > here).
> >
> > What compiler are you using? Maybe 2.96?
>
> Errm .... Redhat 6.2 default ... egcs-2.91.66 .... time to upgrade ?? ;-) ;-)

hmm, that's a bad news, that's egcs 1.1.2, strange, it was supposed to
be safe oh well, but OTOH I'm not too surprised nobody noticed because I
doubt many people compiles with 2.4 with egcs still.

> Pah ... reinstalling these machines is a pain in the ass .... ;-)

Could you try compiling in another machine with a gcc 2.95 and see if
you can still reproduce it? If it's a race condition and a real kernel
bug it should be easily reproducible no matter the compiler.

Andrea

2002-06-07 01:33:56

by Keith Owens

[permalink] [raw]
Subject: Re: Panic from 2.4.19-pre9-aa2

On Thu, 06 Jun 2002 14:53:40 -0700,
"Martin J. Bligh" <[email protected]> wrote:
>Not sure why ksymoops is printing c0147dac from the trace, whilst
>the stack says c0147dad, which seems to be the schedule call -
>would make sense, as that's what you just changed?

Truncate mask bug, fixed in ksymoops 2.4.4. Current is 2.4.5.