2013-03-11 14:28:51

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] perf fixes

Linus,

Please pull the latest perf-urgent-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf-urgent-for-linus

HEAD: cb16b91a449afd01b85ec4e59f30449d11c4acd7 s390: Fix a header dependencies related build error

Misc minor fixes mostly related to tracing.

Thanks,

Ingo

------------------>
Hiraku Toyooka (1):
tracing: update documentation of snapshot utility

Li Zefan (1):
s390: Fix a header dependencies related build error

Steven Rostedt (1):
ftrace: Update the kconfig for DYNAMIC_FTRACE

Steven Rostedt (Red Hat) (2):
tracing: Add help of snapshot feature when snapshot is empty
tracing: Do not return EINVAL in snapshot when not allocated


Documentation/trace/ftrace.txt | 2 +-
arch/s390/include/asm/cpu_mf.h | 1 +
kernel/trace/Kconfig | 24 ++++++++++++++----------
kernel/trace/trace.c | 27 ++++++++++++++++++++++++---
4 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 53d6a3c..a372304 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -1873,7 +1873,7 @@ feature:

status\input | 0 | 1 | else |
--------------+------------+------------+------------+
- not allocated |(do nothing)| alloc+swap | EINVAL |
+ not allocated |(do nothing)| alloc+swap |(do nothing)|
--------------+------------+------------+------------+
allocated | free | swap | clear |
--------------+------------+------------+------------+
diff --git a/arch/s390/include/asm/cpu_mf.h b/arch/s390/include/asm/cpu_mf.h
index 35f0020..c7c9bf6 100644
--- a/arch/s390/include/asm/cpu_mf.h
+++ b/arch/s390/include/asm/cpu_mf.h
@@ -12,6 +12,7 @@
#ifndef _ASM_S390_CPU_MF_H
#define _ASM_S390_CPU_MF_H

+#include <linux/errno.h>
#include <asm/facility.h>

#define CPU_MF_INT_SF_IAE (1 << 31) /* invalid entry address */
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 3656756..b516a8e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -429,24 +429,28 @@ config PROBE_EVENTS
def_bool n

config DYNAMIC_FTRACE
- bool "enable/disable ftrace tracepoints dynamically"
+ bool "enable/disable function tracing dynamically"
depends on FUNCTION_TRACER
depends on HAVE_DYNAMIC_FTRACE
default y
help
- This option will modify all the calls to ftrace dynamically
- (will patch them out of the binary image and replace them
- with a No-Op instruction) as they are called. A table is
- created to dynamically enable them again.
+ This option will modify all the calls to function tracing
+ dynamically (will patch them out of the binary image and
+ replace them with a No-Op instruction) on boot up. During
+ compile time, a table is made of all the locations that ftrace
+ can function trace, and this table is linked into the kernel
+ image. When this is enabled, functions can be individually
+ enabled, and the functions not enabled will not affect
+ performance of the system.
+
+ See the files in /sys/kernel/debug/tracing:
+ available_filter_functions
+ set_ftrace_filter
+ set_ftrace_notrace

This way a CONFIG_FUNCTION_TRACER kernel is slightly larger, but
otherwise has native performance as long as no tracing is active.

- The changes to the code are done by a kernel thread that
- wakes up once a second and checks to see if any ftrace calls
- were made. If so, it runs stop_machine (stops all CPUS)
- and modifies the code to jump over the call to ftrace.
-
config DYNAMIC_FTRACE_WITH_REGS
def_bool y
depends on DYNAMIC_FTRACE
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c2e2c23..1f835a8 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2400,6 +2400,27 @@ static void test_ftrace_alive(struct seq_file *m)
seq_printf(m, "# MAY BE MISSING FUNCTION EVENTS\n");
}

+#ifdef CONFIG_TRACER_MAX_TRACE
+static void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter)
+{
+ if (iter->trace->allocated_snapshot)
+ seq_printf(m, "#\n# * Snapshot is allocated *\n#\n");
+ else
+ seq_printf(m, "#\n# * Snapshot is freed *\n#\n");
+
+ seq_printf(m, "# Snapshot commands:\n");
+ seq_printf(m, "# echo 0 > snapshot : Clears and frees snapshot buffer\n");
+ seq_printf(m, "# echo 1 > snapshot : Allocates snapshot buffer, if not already allocated.\n");
+ seq_printf(m, "# Takes a snapshot of the main buffer.\n");
+ seq_printf(m, "# echo 2 > snapshot : Clears snapshot buffer (but does not allocate)\n");
+ seq_printf(m, "# (Doesn't have to be '2' works with any number that\n");
+ seq_printf(m, "# is not a '0' or '1')\n");
+}
+#else
+/* Should never be called */
+static inline void print_snapshot_help(struct seq_file *m, struct trace_iterator *iter) { }
+#endif
+
static int s_show(struct seq_file *m, void *v)
{
struct trace_iterator *iter = v;
@@ -2411,7 +2432,9 @@ static int s_show(struct seq_file *m, void *v)
seq_puts(m, "#\n");
test_ftrace_alive(m);
}
- if (iter->trace && iter->trace->print_header)
+ if (iter->snapshot && trace_empty(iter))
+ print_snapshot_help(m, iter);
+ else if (iter->trace && iter->trace->print_header)
iter->trace->print_header(m);
else
trace_default_header(m);
@@ -4144,8 +4167,6 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
default:
if (current_trace->allocated_snapshot)
tracing_reset_online_cpus(&max_tr);
- else
- ret = -EINVAL;
break;
}


2013-03-14 20:32:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Mon, Feb 25, 2013 at 11:02 PM, Ingo Molnar <[email protected]> wrote:
>
> Stephane Eranian (1):
> perf/x86: Add Intel IvyBridge event scheduling constraints

So thanks to my new Chromebook, I'm trying to do perf stuff on
IvyBridge. I was hoping the whole exact cycle counting would work
better, since the last machine it was reliable for me was my old
Westmere desktop, and my previous laptop (Sandybridge MacBook Air)
didn't work very well due to CPU errata.

Sadly, my new Ivybridge laptop doesn't seem to do very well either,
but I suspect it's the kernel, not the hardware this time. A simple

perf record -f -e cycles:pp make -j

in the kernel directory (fully built kernel, so the load doesn't
actually go crazy despite the "-j") will *sometimes* totally hang the
machine with current git.

I get a NULL pointer dereference in intel_pmu_enable_all, so I don't
think it's a hardware bug.

And to make things interesting, I seem to be able to only reproduce
this *after* a suspend cycle. That may be just happenstance, since it
seemed to be hard to replicate and most of the time it has happened
under X with no messages visible at all, but that *seems* to be the
pattern.

And the one time I got it to happen on the text console, things
scrolled off (watchdog warnings due to lockups), but I did get a NULL
pointer dereference in intel_pmu_enable_all().

I'll try to reproduce it and get a picture, but thought I'd send out
the email with just this, since maybe the "suspend cycle" plus "NULL
pointer in intel_pmu_enable_all()" makes somebody go "Duh!", or at
least replicate it...

Linus

2013-03-14 21:06:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 1:32 PM, Linus Torvalds
<[email protected]> wrote:
>
> And to make things interesting, I seem to be able to only reproduce
> this *after* a suspend cycle. That may be just happenstance, since it
> seemed to be hard to replicate and most of the time it has happened
> under X with no messages visible at all, but that *seems* to be the
> pattern.
>
> And the one time I got it to happen on the text console, things
> scrolled off (watchdog warnings due to lockups), but I did get a NULL
> pointer dereference in intel_pmu_enable_all().
>
> I'll try to reproduce it and get a picture,

Theory more or less confirmed.

It does need a suspend/resume cycle, and I have a picture. The oops
happens immediately when trying to do any perf work after the first
suspend, before suspending I seem to be able to reliably use perf. It
could still be just random flakiness, but I don't think so.

The NULL pointer dereference is at intel_pmu_enable_all+0x4d/0xa0 for
me, which seems to be the load of the

if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))

thing. It says

BUG: unable to handle NULL pointer dereference at 0000000000000028

But that error makes no sense. The code at that EIP is

48 8b 83 00 02 00 00 mov 0x200(%rbx),%rax <-- trapping instruction

and the value printed out for %rbx is 0xffff80014f20b8e0, so it should
*not* be a NULL pointer dereference (and "cpuc" was also used just
before the wrmsrl).

So I suspect that the "wrmsrl" that was just before that instruction
does something odd, and the PMU is in some odd state, so that the NULL
pointer dereference actually has something to do with *that*, rather
than the instruction itself.

The callchain looks normal. It's

finish_task_switch ->
__perf_event_task_sched_in ->
perf_event_context_sched_in ->
perf_pmu_enable ->
x86_pmu_enable ->
intel_pmu_enable_all()

The immediately preceding wrmsrl was done with rax=0xf, rdx=0x7,
rcx=0x38f according to the register dump (but the picture isn't great,
so the numbers aren't 100% reliable).

Does this give any clues?

Linus

2013-03-14 22:10:03

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

Hi,


On Thu, Mar 14, 2013 at 10:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Mar 14, 2013 at 1:32 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > And to make things interesting, I seem to be able to only reproduce
> > this *after* a suspend cycle. That may be just happenstance, since it
> > seemed to be hard to replicate and most of the time it has happened
> > under X with no messages visible at all, but that *seems* to be the
> > pattern.
> >
> > And the one time I got it to happen on the text console, things
> > scrolled off (watchdog warnings due to lockups), but I did get a NULL
> > pointer dereference in intel_pmu_enable_all().
> >
> > I'll try to reproduce it and get a picture,
>
> Theory more or less confirmed.
>
> It does need a suspend/resume cycle, and I have a picture. The oops
> happens immediately when trying to do any perf work after the first
> suspend, before suspending I seem to be able to reliably use perf. It
> could still be just random flakiness, but I don't think so.
>
Could be related to suspend/resume. But were you running perf across
that resume/suspend cycle?



But still don't see how a wrmsrl could corrupt a cpuc.


>
> The NULL pointer dereference is at intel_pmu_enable_all+0x4d/0xa0 for
> me, which seems to be the load of the
>
> if (test_bit(INTEL_PMC_IDX_FIXED_BTS, cpuc->active_mask))
>
> thing. It says
>
> BUG: unable to handle NULL pointer dereference at 0000000000000028
>
> But that error makes no sense. The code at that EIP is
>
> 48 8b 83 00 02 00 00 mov 0x200(%rbx),%rax <-- trapping instruction
>
> and the value printed out for %rbx is 0xffff80014f20b8e0, so it should
> *not* be a NULL pointer dereference (and "cpuc" was also used just
> before the wrmsrl).


>
> So I suspect that the "wrmsrl" that was just before that instruction
> does something odd, and the PMU is in some odd state, so that the NULL
> pointer dereference actually has something to do with *that*, rather
> than the instruction itself.
>
> The callchain looks normal. It's
>
> finish_task_switch ->
> __perf_event_task_sched_in ->
> perf_event_context_sched_in ->
> perf_pmu_enable ->
> x86_pmu_enable ->
> intel_pmu_enable_all()
>
> The immediately preceding wrmsrl was done with rax=0xf, rdx=0x7,
> rcx=0x38f according to the register dump (but the picture isn't great,
> so the numbers aren't 100% reliable).
>
Value 0x38f for GLOBAL_CTRL is valid. And 0x70000000f is valid too
for the counter bitmask (4 generic counters + 3 fixed counters).

Let's see if we can reproduce the problem on the same ChromeBook you
have. Don't have one myself.

> Does this give any clues?
>
> Linus

2013-03-14 22:17:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>
> Could be related to suspend/resume. But were you running perf across
> that resume/suspend cycle?

No.

In most cases I was running a perf record before and after (but not
*while* suspending)

In at least one other crash, I didn't run perf before at all, so the
first time I used perf was after the resume.

So in no cases did I actually have any perf stuff active over the
suspend itself.

> Let's see if we can reproduce the problem on the same ChromeBook you
> have. Don't have one myself.

I don't imagine it should be about chromebook per se, because afaik
all of pmu suspend/resume is done by the kernel, no firmware involved.

So I'd assume it should happen with any IvyBridge.

Linus

2013-03-14 22:19:16

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 11:17 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>>
>> Could be related to suspend/resume. But were you running perf across
>> that resume/suspend cycle?
>
> No.
>
> In most cases I was running a perf record before and after (but not
> *while* suspending)
>
> In at least one other crash, I didn't run perf before at all, so the
> first time I used perf was after the resume.
>
> So in no cases did I actually have any perf stuff active over the
> suspend itself.
>
Ok, simpler test case then.

>> Let's see if we can reproduce the problem on the same ChromeBook you
>> have. Don't have one myself.
>
> I don't imagine it should be about chromebook per se, because afaik
> all of pmu suspend/resume is done by the kernel, no firmware involved.
>
> So I'd assume it should happen with any IvyBridge.
>
Will try on a desktop IvyBridge too.

2013-03-14 22:42:37

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 11:19 PM, Stephane Eranian <[email protected]> wrote:
> On Thu, Mar 14, 2013 at 11:17 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>>>
>>> Could be related to suspend/resume. But were you running perf across
>>> that resume/suspend cycle?
>>
>> No.
>>
>> In most cases I was running a perf record before and after (but not
>> *while* suspending)
>>
>> In at least one other crash, I didn't run perf before at all, so the
>> first time I used perf was after the resume.
>>
>> So in no cases did I actually have any perf stuff active over the
>> suspend itself.
>>
> Ok, simpler test case then.
>
>>> Let's see if we can reproduce the problem on the same ChromeBook you
>>> have. Don't have one myself.
>>
>> I don't imagine it should be about chromebook per se, because afaik
>> all of pmu suspend/resume is done by the kernel, no firmware involved.
>>
>> So I'd assume it should happen with any IvyBridge.
>>
> Will try on a desktop IvyBridge too.

Ok, it happens on my IVB desktop too, so I can investigate...

2013-03-14 22:53:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 11:42 PM, Stephane Eranian <[email protected]> wrote:
> On Thu, Mar 14, 2013 at 11:19 PM, Stephane Eranian <[email protected]> wrote:
>> On Thu, Mar 14, 2013 at 11:17 PM, Linus Torvalds
>> <[email protected]> wrote:
>>> On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>>>>
>>>> Could be related to suspend/resume. But were you running perf across
>>>> that resume/suspend cycle?
>>>
>>> No.
>>>
>>> In most cases I was running a perf record before and after (but not
>>> *while* suspending)
>>>
>>> In at least one other crash, I didn't run perf before at all, so the
>>> first time I used perf was after the resume.
>>>
>>> So in no cases did I actually have any perf stuff active over the
>>> suspend itself.
>>>
>> Ok, simpler test case then.
>>
>>>> Let's see if we can reproduce the problem on the same ChromeBook you
>>>> have. Don't have one myself.
>>>
>>> I don't imagine it should be about chromebook per se, because afaik
>>> all of pmu suspend/resume is done by the kernel, no firmware involved.
>>>
>>> So I'd assume it should happen with any IvyBridge.
>>>
>> Will try on a desktop IvyBridge too.
>
> Ok, it happens on my IVB desktop too, so I can investigate...

It's not specific to IVB either, it hangs on my Nehalem desktop as well.

2013-03-14 23:11:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 11:53 PM, Stephane Eranian <[email protected]> wrote:
> On Thu, Mar 14, 2013 at 11:42 PM, Stephane Eranian <[email protected]> wrote:
>> On Thu, Mar 14, 2013 at 11:19 PM, Stephane Eranian <[email protected]> wrote:
>>> On Thu, Mar 14, 2013 at 11:17 PM, Linus Torvalds
>>> <[email protected]> wrote:
>>>> On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>>>>>
>>>>> Could be related to suspend/resume. But were you running perf across
>>>>> that resume/suspend cycle?
>>>>
>>>> No.
>>>>
>>>> In most cases I was running a perf record before and after (but not
>>>> *while* suspending)
>>>>
>>>> In at least one other crash, I didn't run perf before at all, so the
>>>> first time I used perf was after the resume.
>>>>
>>>> So in no cases did I actually have any perf stuff active over the
>>>> suspend itself.
>>>>
>>> Ok, simpler test case then.
>>>
>>>>> Let's see if we can reproduce the problem on the same ChromeBook you
>>>>> have. Don't have one myself.
>>>>
>>>> I don't imagine it should be about chromebook per se, because afaik
>>>> all of pmu suspend/resume is done by the kernel, no firmware involved.
>>>>
>>>> So I'd assume it should happen with any IvyBridge.
>>>>
>>> Will try on a desktop IvyBridge too.
>>
>> Ok, it happens on my IVB desktop too, so I can investigate...
>
> It's not specific to IVB either, it hangs on my Nehalem desktop as well.

Looks related to PEBS. If I drop the :pp the machine does not hang. Even
a single :p hangs it. So it is possible something is not properly
restored in the
DS state after a resume or is corrupted by the suspend.

2013-03-15 00:24:04

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

Linus,

I bet if you force the affinity of your perf record to be on
a CPU other than CPU0, you will not get the crash.

This is what I am seeing now. I appears on resume,
CPU0 hotplug callbacks for perf_events are not invoked
leaving DS_AREA MSR to 0.

Can you confirm on your machine?



On Fri, Mar 15, 2013 at 12:11 AM, Stephane Eranian <[email protected]> wrote:
> On Thu, Mar 14, 2013 at 11:53 PM, Stephane Eranian <[email protected]> wrote:
>> On Thu, Mar 14, 2013 at 11:42 PM, Stephane Eranian <[email protected]> wrote:
>>> On Thu, Mar 14, 2013 at 11:19 PM, Stephane Eranian <[email protected]> wrote:
>>>> On Thu, Mar 14, 2013 at 11:17 PM, Linus Torvalds
>>>> <[email protected]> wrote:
>>>>> On Thu, Mar 14, 2013 at 3:09 PM, Stephane Eranian <[email protected]> wrote:
>>>>>>
>>>>>> Could be related to suspend/resume. But were you running perf across
>>>>>> that resume/suspend cycle?
>>>>>
>>>>> No.
>>>>>
>>>>> In most cases I was running a perf record before and after (but not
>>>>> *while* suspending)
>>>>>
>>>>> In at least one other crash, I didn't run perf before at all, so the
>>>>> first time I used perf was after the resume.
>>>>>
>>>>> So in no cases did I actually have any perf stuff active over the
>>>>> suspend itself.
>>>>>
>>>> Ok, simpler test case then.
>>>>
>>>>>> Let's see if we can reproduce the problem on the same ChromeBook you
>>>>>> have. Don't have one myself.
>>>>>
>>>>> I don't imagine it should be about chromebook per se, because afaik
>>>>> all of pmu suspend/resume is done by the kernel, no firmware involved.
>>>>>
>>>>> So I'd assume it should happen with any IvyBridge.
>>>>>
>>>> Will try on a desktop IvyBridge too.
>>>
>>> Ok, it happens on my IVB desktop too, so I can investigate...
>>
>> It's not specific to IVB either, it hangs on my Nehalem desktop as well.
>
> Looks related to PEBS. If I drop the :pp the machine does not hang. Even
> a single :p hangs it. So it is possible something is not properly
> restored in the
> DS state after a resume or is corrupted by the suspend.

2013-03-15 01:06:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Thu, Mar 14, 2013 at 5:24 PM, Stephane Eranian <[email protected]> wrote:
>
> I bet if you force the affinity of your perf record to be on
> a CPU other than CPU0, you will not get the crash.
>
> This is what I am seeing now. I appears on resume,
> CPU0 hotplug callbacks for perf_events are not invoked
> leaving DS_AREA MSR to 0.
>
> Can you confirm on your machine?

I'm not even going to bother confirming it, because I think you're
right, and I think the reason is clear: the DS initialization code
uses the CPU_UP notifiers.

And that's sufficient for CPU hotplug, which is what suspend/resume
ends up doing for all but the boot CPU. But the boot CPU is not
hotplugged.

Using CPU_UP notifiers is wrong, and they get called too late anyway.

The code should use a real resume method. Or, better yet, just do it
right, and do it from __restore_processor_state().

Those f*cking CPU notifiers are a pain in the ass, and the tend to be
invariably broken, and they have their own idiotic hacks that are
equally broken (ie that x86_pmu_notifier() thing seems to make up its
own suspend/resume with
"x86_pmu.cpu_prepare/cpu_starting/cpu_dying/cpu_dead" things.

I guess we could make the BP do a fake cpu notifier thing around the
suspend of the boot processor as well, but most of the per-CPU stuff
seems to be perfectly fine without it (ie mtrr, apic, etc etc all use
the suspend/resume infrastructure) and doesn't need that kind of
stuff.

Linus

2013-03-15 08:01:34

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, Mar 15, 2013 at 2:06 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Mar 14, 2013 at 5:24 PM, Stephane Eranian <[email protected]> wrote:
>>
>> I bet if you force the affinity of your perf record to be on
>> a CPU other than CPU0, you will not get the crash.
>>
>> This is what I am seeing now. I appears on resume,
>> CPU0 hotplug callbacks for perf_events are not invoked
>> leaving DS_AREA MSR to 0.
>>
>> Can you confirm on your machine?
>
> I'm not even going to bother confirming it, because I think you're
> right, and I think the reason is clear: the DS initialization code
> uses the CPU_UP notifiers.
>
Ok, I instrumented the pebs_enable() function and I confirm that
DS_AREA=0 on resume.

So what seems broken here for me is that on suspend, the cpu notifier
ends up calling fini_debug_store() to clear DS_AREA for CPU0. But
on resume, the same notifier does NOT call the init_debug_store().
I don't understand this asymmetry. You either do neither or you do
both.


> And that's sufficient for CPU hotplug, which is what suspend/resume
> ends up doing for all but the boot CPU. But the boot CPU is not
> hotplugged.
>
> Using CPU_UP notifiers is wrong, and they get called too late anyway.
>
> The code should use a real resume method. Or, better yet, just do it
> right, and do it from __restore_processor_state().
>
> Those f*cking CPU notifiers are a pain in the ass, and the tend to be
> invariably broken, and they have their own idiotic hacks that are
> equally broken (ie that x86_pmu_notifier() thing seems to make up its
> own suspend/resume with
> "x86_pmu.cpu_prepare/cpu_starting/cpu_dying/cpu_dead" things.
>
> I guess we could make the BP do a fake cpu notifier thing around the
> suspend of the boot processor as well, but most of the per-CPU stuff
> seems to be perfectly fine without it (ie mtrr, apic, etc etc all use
> the suspend/resume infrastructure) and doesn't need that kind of
> stuff.
>
> Linus

2013-03-15 10:50:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: [GIT PULL] perf fixes

On Fri, Mar 15, 2013 at 9:01 AM, Stephane Eranian <[email protected]> wrote:
> On Fri, Mar 15, 2013 at 2:06 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Mar 14, 2013 at 5:24 PM, Stephane Eranian <[email protected]> wrote:
>>>
>>> I bet if you force the affinity of your perf record to be on
>>> a CPU other than CPU0, you will not get the crash.
>>>
>>> This is what I am seeing now. I appears on resume,
>>> CPU0 hotplug callbacks for perf_events are not invoked
>>> leaving DS_AREA MSR to 0.
>>>
>>> Can you confirm on your machine?
>>
>> I'm not even going to bother confirming it, because I think you're
>> right, and I think the reason is clear: the DS initialization code
>> uses the CPU_UP notifiers.
>>
> Ok, I instrumented the pebs_enable() function and I confirm that
> DS_AREA=0 on resume.
>
> So what seems broken here for me is that on suspend, the cpu notifier
> ends up calling fini_debug_store() to clear DS_AREA for CPU0. But
> on resume, the same notifier does NOT call the init_debug_store().
> I don't understand this asymmetry. You either do neither or you do
> both.
>
Ok, corrections. I ran some more tests. On the suspend path, the cpu
notifier is not called for CPU0. However when the machine comes back
up, DS_AREA is 0 which is the power-up default value. And given that
the notifier is not called for CPU0, that is the value we inherit later on
and which causes the crash.

>
>> And that's sufficient for CPU hotplug, which is what suspend/resume
>> ends up doing for all but the boot CPU. But the boot CPU is not
>> hotplugged.
>>
>> Using CPU_UP notifiers is wrong, and they get called too late anyway.
>>
>> The code should use a real resume method. Or, better yet, just do it
>> right, and do it from __restore_processor_state().
>>
I will produce a patch to use this function, it's simple enough.

>> Those f*cking CPU notifiers are a pain in the ass, and the tend to be
>> invariably broken, and they have their own idiotic hacks that are
>> equally broken (ie that x86_pmu_notifier() thing seems to make up its
>> own suspend/resume with
>> "x86_pmu.cpu_prepare/cpu_starting/cpu_dying/cpu_dead" things.
>>
>> I guess we could make the BP do a fake cpu notifier thing around the
>> suspend of the boot processor as well, but most of the per-CPU stuff
>> seems to be perfectly fine without it (ie mtrr, apic, etc etc all use
>> the suspend/resume infrastructure) and doesn't need that kind of
>> stuff.
>>
>> Linus