2010-11-12 21:21:00

by Sylvain GENEVES

[permalink] [raw]
Subject: Oprofile bug ?

Hello,

I'm encountering unexpected behaviour with OProfile when the profiled
system is under heavy load : "BUG: unable to handle kernel paging request
at 0000000000004cc3" (full console message is attached).

I'm using a linux kernel 2.6.32 and both OProfile 0.9.6 from Debian
repositories and OProfile 0.9.7 from CVS on Intel Xeon E5440 harpertown.

It happens when I profile a webserver under heavy load, with some cores
disabled, and with the following event :
--event=CPU_CLK_UNHALTED:1600000000000

I also use the --separate=all --callgraph=3 and my kernel has debug
symbols and frame pointers enabled.

Anyone has any idea on what is happening ?
Thanks,
Regards,
Sylvain


Attachments:
oprofile_bug_stacktrace (8.54 kB)

2010-11-12 21:45:32

by Frank Ch. Eigler

[permalink] [raw]
Subject: Re: Oprofile bug ?

"Sylvain GENEVES" <[email protected]> writes:

> [...]
> I'm encountering unexpected behaviour with OProfile when the profiled
> system is under heavy load : "BUG: unable to handle kernel paging request
> at 0000000000004cc3" (full console message is attached).
> [...]
> Anyone has any idea on what is happening ?

Just glancing at that oops & my local random kernel build, it appears
as though this part of arch/x86/kernel/time.c:profile_pc is failing:

unsigned long profile_pc(struct pt_regs *regs)
{
unsigned long pc = instruction_pointer(regs);

if (!user_mode_vm(regs) && in_lock_functions(pc)) {
#ifdef CONFIG_FRAME_POINTER
return *(unsigned long *)(regs->bp + sizeof(long));
#else
^^^^^^^^^^^^^^^^^^
[...]

regs->bp must have been 0x4cbb, which this code turns into an
unchecked dereferences at 0x4cbb+8 = 0x4cc3. I don't have a theory
as to why regs->bp should have that value in it, but the kernel
should probably use probe_kernel_read() or somesuch to validate the
value before dereferencing it.

- FChE

2010-11-13 16:18:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Oprofile bug ?

On Fri, 2010-11-12 at 22:11 +0100, Sylvain GENEVES wrote:
> Anyone has any idea on what is happening ?

Yeah, the oprofile code is terminally broken, it uses
__copy_from_user_inatomic() from NMI context.