Hi,
there has been a bug [1] recently which caused NMIs to be stuck on certain
Intel CPUs, resultung in oprofile not working.
This has been fixed by
commit 7c64ade53a6f977d73f16243865c42ceae999aea
Author: Andi Kleen <[email protected]>
Date: Fri Nov 7 14:02:49 2008 +0100
oprofile: Fix p6 counter overflow check
but we are seeing the very same symptoms still on the following CPU, even
with 2.6.28-rc4 (which includes 7c64ade5):
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 47
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 0
After starting oprofile, no NMI is delivered at all:
vulture:~ # opcontrol --start
Using 2.6+ OProfile kernel interface.
Reading module info.
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
vulture:~ # grep NMI /proc/interrupts
NMI: 0 Non-maskable interrupts
vulture:~ # grep NMI /proc/interrupts
NMI: 0 Non-maskable interrupts
on a quite similar CPU, everything seems to be working smoothly though
with the very same kernel -- NMIs are being delivered, and oprofile
performs proper profiling:
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 4
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 8
I haven't yet found a time to start bisecting this.
[1] http://lkml.org/lkml/2008/10/30/319
--
Jiri Kosina
SUSE Labs
* Jiri Kosina <[email protected]> wrote:
> I haven't yet found a time to start bisecting this.
Would be nice to identify a commit to revert - in case we run out of
time fixing it.
Ingo
On Thu, 13 Nov 2008, Ingo Molnar wrote:
> > I haven't yet found a time to start bisecting this.
> Would be nice to identify a commit to revert - in case we run out of
> time fixing it.
Yup, I first wanted to make this known to the public in hope that it will
ring a bell somewhere.
If noone sees an obvous reason for this, I will do my best to bisect this
tomorrow.
--
Jiri Kosina
SUSE Labs
Ingo Molnar wrote:
> * Jiri Kosina <[email protected]> wrote:
>
>> I haven't yet found a time to start bisecting this.
>
> Would be nice to identify a commit to revert - in case we run out of
> time fixing it.
Robert, Jason,
Can one of you take a look at this? It appears to be an AMD issue right now.
The first message in the thread can be found at:
http://lkml.org/lkml/2008/11/13/336
Thanks.
-Maynard
>
> Ingo
>
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> oprofile-list mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oprofile-list
* Jiri Kosina <[email protected]> wrote:
> On Thu, 13 Nov 2008, Ingo Molnar wrote:
>
> > > I haven't yet found a time to start bisecting this.
> > Would be nice to identify a commit to revert - in case we run out of
> > time fixing it.
>
> Yup, I first wanted to make this known to the public in hope that it
> will ring a bell somewhere.
>
> If noone sees an obvous reason for this, I will do my best to bisect
> this tomorrow.
We've got the one patch below pending, but that's not for AMD cpus so
it shouldnt impact your case.
But ... some change made it all much more fragile. I'm curious why
things became more fragile.
Ingo
--------------->
Subject: oprofile: un-mask APIC before resetting counter in ppro_check_ctrs()
From: Eric Dumazet <[email protected]>
Date: Tue, 11 Nov 2008 09:32:12 +0100
While using oprofile on my HP BL460c G1, (two quad core intel E5450 CPU),
I noticed that one CPU after the other could not get anymore NMI.
After a while, all cores where blocked (ie not generating events for oprofile)
I tried all major linux versions and all where affected by this freeze.
I found that we have to un-mask APIC *before* writing to MSR counter
when we get event notification, because we use APIC_LVTPC in edge triggered mode.
Signed-off-by: Eric Dumazet <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/oprofile/op_model_ppro.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
Index: tip/arch/x86/oprofile/op_model_ppro.c
===================================================================
--- tip.orig/arch/x86/oprofile/op_model_ppro.c
+++ tip/arch/x86/oprofile/op_model_ppro.c
@@ -126,6 +126,12 @@ static int ppro_check_ctrs(struct pt_reg
u64 val;
int i;
+ /*
+ * We need to unmask the apic vector *before* writing reset_value
+ * to msr counter, because we use edge trigger
+ */
+ apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED);
+
for (i = 0 ; i < num_counters; ++i) {
if (!reset_value[i])
continue;
@@ -136,10 +142,6 @@ static int ppro_check_ctrs(struct pt_reg
}
}
- /* Only P6 based Pentium M need to re-unmask the apic vector but it
- * doesn't hurt other P6 variant */
- apic_write(APIC_LVTPC, apic_read(APIC_LVTPC) & ~APIC_LVT_MASKED);
-
/* We can't work out if we really handled an interrupt. We
* might have caught a *second* counter just after overflowing
* the interrupt for this counter then arrives
diff --git a/arch/x86/oprofile/op_model_ppro.c b/arch/x86/oprofile/op_model_ppro.c
index 3f1b81a..716d26f 100644
--- a/arch/x86/oprofile/op_model_ppro.c
+++ b/arch/x86/oprofile/op_model_ppro.c
@@ -69,7 +69,7 @@ static void ppro_setup_ctrs(struct op_msrs const * const msrs)
int i;
if (!reset_value) {
- reset_value = kmalloc(sizeof(unsigned) * num_counters,
+ reset_value = kmalloc(sizeof(reset_value[0]) * num_counters,
GFP_ATOMIC);
if (!reset_value)
return;
On 13.11.08 22:27:52, Jiri Kosina wrote:
> On Thu, 13 Nov 2008, Ingo Molnar wrote:
>
> > > I haven't yet found a time to start bisecting this.
> > Would be nice to identify a commit to revert - in case we run out of
> > time fixing it.
>
> Yup, I first wanted to make this known to the public in hope that it will
> ring a bell somewhere.
>
> If noone sees an obvous reason for this, I will do my best to bisect this
> tomorrow.
Could you at least test with v2.6.27? I could not reproduce it on the
hardware I am using.
Thanks,
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]
On Fri, 14 Nov 2008, Robert Richter wrote:
> Could you at least test with v2.6.27?
Oprofile in 2.6.27 doesn't work on this system, as far as I know.
> I could not reproduce it on the hardware I am using.
Yes, it seems to be very system-specific. As I have said in my original
mail, the machine we are seeing this problem on is
vendor_id : AuthenticAMD
cpu family : 15
model : 47
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 0
but on the other hand on the following CPU, the oprofile works with 2.6.27
and 2.6.28-rc4 without any problems, NMIs are coming flawlessly
vendor_id : AuthenticAMD
cpu family : 15
model : 4
model name : AMD Athlon(tm) 64 Processor 3000+
stepping : 8
--
Jiri Kosina
SUSE Labs
Very strange results on my Intel E5450 CPU
I found that I have to bias regs->ip by 16 bytes in order to have
correct profiling.
Does it rings a bell to anyone ?
opcontrol --version
opcontrol: oprofile 0.9.4 compiled on Nov 3 2008 06:45:57
Thank you
diff --git a/arch/x86/oprofile/op_model_ppro.c b/arch/x86/oprofile/op_model_ppro.c
index 3f1b81a..051fb16 100644
--- a/arch/x86/oprofile/op_model_ppro.c
+++ b/arch/x86/oprofile/op_model_ppro.c
@@ -131,7 +131,9 @@ static int ppro_check_ctrs(struct pt_regs * const regs,
continue;
rdmsrl(msrs->counters[i].addr, val);
if (CTR_OVERFLOWED(val)) {
+ regs->ip -= 0x10;
oprofile_add_sample(regs, i);
+ regs->ip += 0x10;
wrmsrl(msrs->counters[i].addr, -reset_value[i]);
}
}
On Fri, Nov 14, 2008 at 04:25:51PM +0100, Eric Dumazet wrote:
> Very strange results on my Intel E5450 CPU
>
> I found that I have to bias regs->ip by 16 bytes in order to have
> correct profiling.
Can you expand a bit how you determined if the profile is right or wrong?
-Andi
Andi Kleen a ?crit :
> On Fri, Nov 14, 2008 at 04:25:51PM +0100, Eric Dumazet wrote:
>> Very strange results on my Intel E5450 CPU
>>
>> I found that I have to bias regs->ip by 16 bytes in order to have
>> correct profiling.
>
> Can you expand a bit how you determined if the profile is right or wrong?
>
> -Andi
>
>
opannotate -a vmlinux > FILE
And check results.
With the patch I get :
c041d270 <dst_release>: /* dst_release total: 16523 1.2978 */
252 0.0198 :c041d270: push %ebp
654 0.0514 :c041d271: test %eax,%eax
:c041d273: mov %esp,%ebp
47 0.0037 :c041d275: mov %eax,%edx
42 0.0033 :c041d277: je c041d289 <dst_release+0x19>
:c041d279: mov $0xffffffff,%eax
3 2.4e-04 :c041d27e: lock xadd %eax,0x80(%edx)
14766 1.1598 :c041d286: dec %eax
1 7.9e-05 :c041d287: js c041d28b <dst_release+0x1b>
53 0.0042 :c041d289: pop %ebp
705 0.0554 :c041d28a: ret
:c041d28b: mov $0x10e,%edx
:c041d290: mov $0xc0561ed5,%eax
:c041d295: call c022d1a0 <warn_on_slowpath>
:c041d29a: pop %ebp
:c041d29b: ret
:c041d29c: lea 0x0(%esi),%esi
This results please me : The cost is right after the "lock xadd" as expected.
Without patch, all addresses were off by 16 bytes.
On dst_release() it was on the call on warn_on_slowpath(),
and I am pretty sure it is not possible at all this WARN_ON is hit
Eric Dumazet writes:
> Very strange results on my Intel E5450 CPU
>
> I found that I have to bias regs->ip by 16 bytes in order to have
> correct profiling.
>
> Does it rings a bell to anyone ?
The term is "skid". All OO processors are susceptible to it,
as was known already to the Alpha "continous profiling"
performance counter folks.
And no, blindly subtracting 16 from IP is not a fix.
>
> opcontrol --version
> opcontrol: oprofile 0.9.4 compiled on Nov 3 2008 06:45:57
>
> Thank you
>
> diff --git a/arch/x86/oprofile/op_model_ppro.c b/arch/x86/oprofile/op_model_ppro.c
> index 3f1b81a..051fb16 100644
> --- a/arch/x86/oprofile/op_model_ppro.c
> +++ b/arch/x86/oprofile/op_model_ppro.c
> @@ -131,7 +131,9 @@ static int ppro_check_ctrs(struct pt_regs * const regs,
> continue;
> rdmsrl(msrs->counters[i].addr, val);
> if (CTR_OVERFLOWED(val)) {
> + regs->ip -= 0x10;
> oprofile_add_sample(regs, i);
> + regs->ip += 0x10;
> wrmsrl(msrs->counters[i].addr, -reset_value[i]);
> }
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Mikael Pettersson a ?crit :
> Eric Dumazet writes:
> > Very strange results on my Intel E5450 CPU
> >
> > I found that I have to bias regs->ip by 16 bytes in order to have
> > correct profiling.
> >
> > Does it rings a bell to anyone ?
>
> The term is "skid". All OO processors are susceptible to it,
> as was known already to the Alpha "continous profiling"
> performance counter folks.
Cool. Maybe this can explain the other bug I have with oprofile.
>
> And no, blindly subtracting 16 from IP is not a fix.
Who mentioned a fix ? I am only giving more fuel to Intel guys so they
hopefully can give us a working oprofile.
>
> >
> > opcontrol --version
> > opcontrol: oprofile 0.9.4 compiled on Nov 3 2008 06:45:57
> >
> > Thank you
> >
> > diff --git a/arch/x86/oprofile/op_model_ppro.c b/arch/x86/oprofile/op_model_ppro.c
> > index 3f1b81a..051fb16 100644
> > --- a/arch/x86/oprofile/op_model_ppro.c
> > +++ b/arch/x86/oprofile/op_model_ppro.c
> > @@ -131,7 +131,9 @@ static int ppro_check_ctrs(struct pt_regs * const regs,
> > continue;
> > rdmsrl(msrs->counters[i].addr, val);
> > if (CTR_OVERFLOWED(val)) {
> > + regs->ip -= 0x10;
> > oprofile_add_sample(regs, i);
> > + regs->ip += 0x10;
> > wrmsrl(msrs->counters[i].addr, -reset_value[i]);
> > }
> > }
> >
> > --
> >And no, blindly subtracting 16 from IP is not a fix.
>
> Who mentioned a fix ? I am only giving more fuel to Intel guys so they
> hopefully can give us a working oprofile.
You would need to implement PEBS support to avoid that problem. But it's a big
task. perfmon2 implements it already.
-Andi
Andi Kleen a ?crit :
>>> And no, blindly subtracting 16 from IP is not a fix.
>> Who mentioned a fix ? I am only giving more fuel to Intel guys so they
>> hopefully can give us a working oprofile.
>
> You would need to implement PEBS support to avoid that problem. But it's a big
> task. perfmon2 implements it already.
>
Thanks for the information.
Hum, so I grabbed perfmon2 git tree, installed various tools...
I am quite new to pfmon and tried :
# pfmon --system-wide
sizeof=64 44
<press ENTER to stop session>
Then started "tbench 8", and got a kernel panic after 6 seconds.
I was using oprofile like this
opcontrol --vmlinux=/path/vmlinux --start
// doing some benchmarking...
opreport -l vmlinux | head -n 40
What would be a working equivalent for perfmon2 based tools ?
Thanks
On Sat, Nov 15, 2008 at 05:30:58PM +0100, Eric Dumazet wrote:
> Andi Kleen a ?crit :
> >>>And no, blindly subtracting 16 from IP is not a fix.
> >>Who mentioned a fix ? I am only giving more fuel to Intel guys so they
> >>hopefully can give us a working oprofile.
> >
> >You would need to implement PEBS support to avoid that problem. But it's a
> >big
> >task. perfmon2 implements it already.
> >
>
> Thanks for the information.
>
> Hum, so I grabbed perfmon2 git tree, installed various tools...
>
> I am quite new to pfmon and tried :
>
> # pfmon --system-wide
> sizeof=64 44
> <press ENTER to stop session>
>
> Then started "tbench 8", and got a kernel panic after 6 seconds.
>
>
> I was using oprofile like this
>
> opcontrol --vmlinux=/path/vmlinux --start
> // doing some benchmarking...
> opreport -l vmlinux | head -n 40
>
>
> What would be a working equivalent for perfmon2 based tools ?
Probably getting a perfmon tree that works. I guess Stephane
can help (cc'ed). Or just deal with imprecise events for now.
-Andi
--
[email protected]
Hello,
I have not seen the beginning of that discussion so my comments may beslightly off.It seems Eric has problems with accuracy of instruction addresses whensampling with the PMU.
This is an inherent limitation of the PMU. It can be mitigated but notcompletely eliminated. The coreissue is that it takes several cycles between the moment a counteroverflows and posting of the PMUinterrupt. During that time, the CPU keeps on executing instructions.The interrupt IP you get, reflectsthe place you were when it triggered. That can be far away from whereit was posted and where thecounter actually overflowed. Of course, if you are stalled thatdistance is usually 0 or off by a smallnumber of instructions. But it can be very large when overflow happensduring a kernel critical sectionwhere interrupts are off. There is nothing SW can do about all of this.
Andi mentioned PEBS. I don't know if you are familiar with what itdoes. Let me summarize. This isa hardware/microcode feature which implements a hardware-managedbuffer where samples arestored. The OS points the CPU to a memory region where PMU samples arestored. No PMUinterrupt is generated until the buffer becomes full. That partaddresses some of the overheadassociated with interrupt-based sampling. Unfortunately, PEBS doesnot point to the instruction wherethe counter overflowed, it will still be a few instructions off. Butthis time, you get the machine state at thelast retired instruction. Furthermore, PEBS can record samples whilein kernel critical sections. A limitationof PEBS is that it does not work with all the PMU events. Only ahandful are available.
As for perfmon, if you pull from the perfmon2 GIT tree, this shouldwork. Don't know what happen inyou case.
Perfmon and the pfmon can do simple counting or also collect profiles.
$ pfmon date
Counts cycles at the user level only for the process date
$ pfmon --system-wide -t10
Counts elapsed cycles at user level for all CPU for 10s. Results are per-cpu
$ pfmon --long-smpl-periods=240000 date
Collect a flat profile of process date. Period is 240,000 elapsed user cycle
$ pfmon --system-wide --long-smpl-periods=240000 -t 10
Collect a flat profile on each online CPU during 10s. Period is240,000 user elapsed cycles. Results are per-cpu
You have a lot more examples on the perfmon web site, Following thedocumentation and pfmon users' guide.
Perfmon/pfmon can use PEBS on Intel Core processors. First step is toinsert the kernel module for it: # modprobe perfmon_pebs_core_smpl
Then use pfmon, we use instruction_retired because elapsed cycles doesnot support PEBS:
$ pfmon --smpl-module=pebs -einstructions_retired--long-smpl-periods=120000 date
Hope this helps.
On Sat, Nov 15, 2008 at 7:36 PM, Andi Kleen <[email protected]> wrote:> On Sat, Nov 15, 2008 at 05:30:58PM +0100, Eric Dumazet wrote:>> Andi Kleen a écrit :>> >>>And no, blindly subtracting 16 from IP is not a fix.>> >>Who mentioned a fix ? I am only giving more fuel to Intel guys so they>> >>hopefully can give us a working oprofile.>> >>> >You would need to implement PEBS support to avoid that problem. But it's a>> >big>> >task. perfmon2 implements it already.>> >>>>> Thanks for the information.>>>> Hum, so I grabbed perfmon2 git tree, installed various tools...>>>> I am quite new to pfmon and tried :>>>> # pfmon --system-wide>> sizeof=64 44>> <press ENTER to stop session>>>>> Then started "tbench 8", and got a kernel panic after 6 seconds.>>>>>> I was using oprofile like this>>>> opcontrol --vmlinux=/path/vmlinux --start>> // doing some benchmarking...>> opreport -l vmlinux | head -n 40>>>>>> What would be a working equivalent for perfmon2 based tools ?>> Probably getting a perfmon tree that works. I guess Stephane> can help (cc'ed). Or just deal with imprecise events for now.>> -Andi> --> [email protected]>????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, Nov 17, 2008 at 04:02:19PM +0100, stephane eranian wrote:
Thanks for explaining it better than I could.
Just a minor correction.
> But it can be very large when overflow happens
> during a kernel critical section
> where interrupts are off.
Actually oprofile uses the NMI to avoid that particular problem.
It still has limits, e.g. no profiling of SMM sections.
> There is nothing SW can do about all of this.
Yes profiling on a out of order machine is hard.
> Furthermore, PEBS can record samples while
> in kernel critical sections.
Again with oprofile that works too without PEBS.
-Andi
stephane eranian a écrit :
> Hello,
>
> I have not seen the beginning of that discussion so my comments may be
> slightly off.
> It seems Eric has problems with accuracy of instruction addresses when
> sampling with the PMU.
>
> This is an inherent limitation of the PMU. It can be mitigated but not
> completely eliminated. The core
> issue is that it takes several cycles between the moment a counter
> overflows and posting of the PMU
> interrupt. During that time, the CPU keeps on executing instructions.
> The interrupt IP you get, reflects
> the place you were when it triggered. That can be far away from where
> it was posted and where the
> counter actually overflowed. Of course, if you are stalled that
> distance is usually 0 or off by a small
> number of instructions. But it can be very large when overflow happens
> during a kernel critical section
> where interrupts are off. There is nothing SW can do about all of this.
>
> Andi mentioned PEBS. I don't know if you are familiar with what it
> does. Let me summarize. This is
> a hardware/microcode feature which implements a hardware-managed
> buffer where samples are
> stored. The OS points the CPU to a memory region where PMU samples are
> stored. No PMU
> interrupt is generated until the buffer becomes full. That part
> addresses some of the overhead
> associated with interrupt-based sampling. Unfortunately, PEBS does
> not point to the instruction where
> the counter overflowed, it will still be a few instructions off. But
> this time, you get the machine state at the
> last retired instruction. Furthermore, PEBS can record samples while
> in kernel critical sections. A limitation
> of PEBS is that it does not work with all the PMU events. Only a
> handful are available.
>
> As for perfmon, if you pull from the perfmon2 GIT tree, this should
> work. Don't know what happen in
> you case.
>
> Perfmon and the pfmon can do simple counting or also collect profiles.
>
> $ pfmon date
>
> Counts cycles at the user level only for the process date
>
> $ pfmon --system-wide -t10
>
> Counts elapsed cycles at user level for all CPU for 10s. Results are per-cpu
>
> $ pfmon --long-smpl-periods=240000 date
>
> Collect a flat profile of process date. Period is 240,000 elapsed user cycle
>
> $ pfmon --system-wide --long-smpl-periods=240000 -t 10
>
> Collect a flat profile on each online CPU during 10s. Period is
> 240,000 user elapsed cycles. Results are per-cpu
>
> You have a lot more examples on the perfmon web site, Following the
> documentation and pfmon users' guide.
>
> Perfmon/pfmon can use PEBS on Intel Core processors. First step is to
> insert the kernel module for it:
> # modprobe perfmon_pebs_core_smpl
>
> Then use pfmon, we use instruction_retired because elapsed cycles does
> not support PEBS:
>
> $ pfmon --smpl-module=pebs -einstructions_retired
> --long-smpl-periods=120000 date
>
> Hope this helps.
>
Sure it does ! Thanks a lot Stephane, I am going to try all this stuff.