2003-02-27 06:22:27

by Dave Hansen

[permalink] [raw]
Subject: Oops running oprofile in 2.5.62

This happened while running dbench on 2.5.62. I haven't seen it before,
but I thought I'd report it anyway. I'm using the 0.5 version of the
userspace tools.

I'm pretty sure it happened on this line in oprofile_add_sample():
cpu_buf->buffer[cpu_buf->pos].eip = eip;

Unable to handle kernel paging request at virtual address f8c3c000
c0212022
*pde = 00000000
Oops: 0002
CPU: 13
EIP: 0060:[<c0212022>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010046
eax: 40082d94 ebx: 00000340 ecx: 00002000 edx: f8c2c000
esi: 00000000 edi: c0310f00 ebp: 00000000 esp: e2fbdf60
ds: 007b es: 007b ss: 0068
Stack: 00000000 00000000 c03a22c8 c02139f0 40082d94 00000000 00000000
0000000d e2fbc000 e2fbdfc4 bffff7a0 bffffbb8 40082d94 c0213291
0000000d c03a22c8 e2fbdfc4 c010a1eb e2fbdfc4 0000000d 40122020
00000000 c010962e e2fbdfc4
Call Trace:
[<c02139f0>] ppro_check_ctrs+0x4c/0x80
[<c0213291>] nmi_callback+0x21/0x28
[<c010a1eb>] do_nmi+0x2b/0x48
[<c010962e>] nmi+0x1e/0x30
Code: 89 04 ca 8b 8b c4 0b 31 c0 8b 93 d0 0b 31 c0 8b 44 24 18 89
p

>>EIP; c0212022 <oprofile_add_sample+102/128> <=====

>>edi; c0310f00 <cpu_buffer+340/800>

Trace; c02139f0 <ppro_check_ctrs+4c/80>
Trace; c0213291 <nmi_callback+21/28>
Trace; c010a1eb <do_nmi+2b/48>
Trace; c010962e <nmi+1e/30>

Code; c0212022 <oprofile_add_sample+102/128>
00000000 <_EIP>:
Code; c0212022 <oprofile_add_sample+102/128> <=====
0: 89 04 ca mov %eax,(%edx,%ecx,8) <=====
Code; c0212025 <oprofile_add_sample+105/128>
3: 8b 8b c4 0b 31 c0 mov 0xc0310bc4(%ebx),%ecx
Code; c021202b <oprofile_add_sample+10b/128>
9: 8b 93 d0 0b 31 c0 mov 0xc0310bd0(%ebx),%edx
Code; c0212031 <oprofile_add_sample+111/128>
f: 8b 44 24 18 mov 0x18(%esp,1),%eax
Code; c0212035 <oprofile_add_sample+115/128>
13: 89 00 mov %eax,(%eax)



--
Dave Hansen
[email protected]


2003-02-27 17:26:50

by John Levon

[permalink] [raw]
Subject: Re: Oops running oprofile in 2.5.62

On Wed, Feb 26, 2003 at 10:29:43PM -0800, Dave Hansen wrote:

> I'm pretty sure it happened on this line in oprofile_add_sample():
> cpu_buf->buffer[cpu_buf->pos].eip = eip;
>
> Unable to handle kernel paging request at virtual address f8c3c000

Odd. UP or SMP ? Were you shutting down oprofile at the time (or did the
daemon crash ? check /var/lib/oprofile/oprofiled.log)

The only thing I can think is that the buffer got freed then we got an
NMI afterwards (which obviously isn't supposed to happen...)

UP oprofile in current linus is still totally broken, but I can't
imagine it causing the above.

regards
john

2003-02-27 17:48:52

by Dave Hansen

[permalink] [raw]
Subject: Re: Oops running oprofile in 2.5.62

John Levon wrote:
> On Wed, Feb 26, 2003 at 10:29:43PM -0800, Dave Hansen wrote:
>
>>I'm pretty sure it happened on this line in oprofile_add_sample():
>> cpu_buf->buffer[cpu_buf->pos].eip = eip;
>>
>>Unable to handle kernel paging request at virtual address f8c3c000
>
> Odd. UP or SMP ? Were you shutting down oprofile at the time (or did the
> daemon crash ? check /var/lib/oprofile/oprofiled.log)

16-way NUMAQ :) It was in the middle of a dbench run, so nothing
critical was happening with oprofile.

> The only thing I can think is that the buffer got freed then we got an
> NMI afterwards (which obviously isn't supposed to happen...)

The irqbalance code is causing some funniness on these machines, so it
might be related. I'll speak up if I see it again.

--
Dave Hansen
[email protected]

2003-03-05 19:01:36

by Philippe Elie

[permalink] [raw]
Subject: Re: Oops running oprofile in 2.5.62

Dave Hansen wrote:
> This happened while running dbench on 2.5.62. I haven't seen it before,
> but I thought I'd report it anyway. I'm using the 0.5 version of the
> userspace tools.
>
> I'm pretty sure it happened on this line in oprofile_add_sample():
> cpu_buf->buffer[cpu_buf->pos].eip = eip;

yes, in the last chunk of code in oprofile_add_sample()

> Unable to handle kernel paging request at virtual address f8c3c000
> c0212022
> *pde = 00000000
> Oops: 0002
> CPU: 13
> EIP: 0060:[<c0212022>] Not tainted
> Using defaults from ksymoops -t elf32-i386 -a i386
> EFLAGS: 00010046
> eax: 40082d94 ebx: 00000340 ecx: 00002000 edx: f8c2c000
^^^^^^^^

buffer overrrun by one entry (8192 entry by default of 8 bytes
each on x86), potentially oprofile_add_sample() add 3 events
in buffer but the protection at begin of code protect against
two addition not three

The bug is rare because add_sample use three entry in rare case,
and thing are wrong only when cpu_buf->pos == buffer_size - 2
the code is not fixed in 2.5.64, John if you have not already
a patch pending for this can you push it in your tree ?

void oprofile_add_sample(unsigned long eip, unsigned int is_kernel,
unsigned long event, int cpu)
.....
- if (cpu_buf->pos > buffer_size - 2) {
+
if (cpu_buf->pos > buffer_size - 3) {
cpu_buf->sample_lost_overflow++;
goto out;
}

>>>EIP; c0212022 <oprofile_add_sample+102/128> <=====
>>
>
>>>edi; c0310f00 <cpu_buffer+340/800>
>>
>
> Trace; c02139f0 <ppro_check_ctrs+4c/80>
> Trace; c0213291 <nmi_callback+21/28>
> Trace; c010a1eb <do_nmi+2b/48>
> Trace; c010962e <nmi+1e/30>
>
> Code; c0212022 <oprofile_add_sample+102/128>
> 00000000 <_EIP>:
> Code; c0212022 <oprofile_add_sample+102/128> <=====
> 0: 89 04 ca mov %eax,(%edx,%ecx,8) <=====

ecx == cpu_bufffer->pos == buffer_size ... boom ...

regards,
Philippe Elie