When large PEBS is enabled, the below warning is triggered:
[6070379.453697] WARNING: CPU: 23 PID: 42379 at arch/x86/events/core.c:1466 x86_pmu_stop+0x95/0xa0
...
[6070379.453831] Call Trace:
[6070379.453840] x86_pmu_del+0x50/0x150
[6070379.453845] event_sched_out.isra.0+0x95/0x200
[6070379.453848] group_sched_out.part.0+0x53/0xd0
[6070379.453851] __perf_event_disable+0xee/0x1e0
[6070379.453854] event_function+0x89/0xd0
[6070379.453859] remote_function+0x3e/0x50
[6070379.453866] generic_exec_single+0x91/0xd0
[6070379.453870] smp_call_function_single+0xd1/0x110
[6070379.453874] event_function_call+0x11c/0x130
[6070379.453877] ? task_ctx_sched_out+0x20/0x20
[6070379.453880] ? perf_mux_hrtimer_handler+0x370/0x370
[6070379.453882] ? event_function_call+0x130/0x130
[6070379.453886] perf_event_for_each_child+0x34/0x80
[6070379.453889] ? event_function_call+0x130/0x130
[6070379.453891] _perf_ioctl+0x24b/0x6a0
[6070379.453898] ? sched_setaffinity+0x1ad/0x2a0
[6070379.453904] ? _cond_resched+0x15/0x30
[6070379.453906] perf_ioctl+0x3d/0x60
[6070379.453912] ksys_ioctl+0x87/0xc0
[6070379.453917] __x64_sys_ioctl+0x16/0x20
[6070379.453923] do_syscall_64+0x52/0x180
[6070379.453928] entry_SYSCALL_64_after_hwframe+0x44/0xa9
The commit 3966c3feca3f ("x86/perf/amd: Remove need to check "running"
bit in NMI handler") introduced this. It seems x86_pmu_stop can be
called recursively (like when it losts some samples) like below:
x86_pmu_stop
intel_pmu_disable_event (x86_pmu_disable)
intel_pmu_pebs_disable
intel_pmu_drain_pebs_buffer
x86_pmu_stop
It seems the change is only needed for AMD. So I added a new bit to
check when it should clear the active mask.
Fixes: 3966c3feca3f ("x86/perf/amd: Remove need to check "running" bit in NMI handler")
Reported-by: John Sperbeck <[email protected]>
Cc: "Lendacky, Thomas" <[email protected]>
Signed-off-by: Namhyung Kim <[email protected]>
---
arch/x86/events/amd/core.c | 1 +
arch/x86/events/core.c | 9 +++++++--
arch/x86/events/perf_event.h | 3 ++-
3 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/events/amd/core.c b/arch/x86/events/amd/core.c
index 39eb276d0277..c545fbd423df 100644
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -927,6 +927,7 @@ static __initconst const struct x86_pmu amd_pmu = {
.max_period = (1ULL << 47) - 1,
.get_event_constraints = amd_get_event_constraints,
.put_event_constraints = amd_put_event_constraints,
+ .late_nmi = 1,
.format_attrs = amd_format_attr,
.events_sysfs_show = amd_event_sysfs_show,
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b802a778014..a6c12bd71e66 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -1514,8 +1514,13 @@ void x86_pmu_stop(struct perf_event *event, int flags)
struct hw_perf_event *hwc = &event->hw;
if (test_bit(hwc->idx, cpuc->active_mask)) {
- x86_pmu.disable(event);
- __clear_bit(hwc->idx, cpuc->active_mask);
+ if (x86_pmu.late_nmi) {
+ x86_pmu.disable(event);
+ __clear_bit(hwc->idx, cpuc->active_mask);
+ } else {
+ __clear_bit(hwc->idx, cpuc->active_mask);
+ x86_pmu.disable(event);
+ }
cpuc->events[hwc->idx] = NULL;
WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
hwc->state |= PERF_HES_STOPPED;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 10032f023fcc..1ffaa0fcd521 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -682,7 +682,8 @@ struct x86_pmu {
/* PMI handler bits */
unsigned int late_ack :1,
enabled_ack :1,
- counter_freezing :1;
+ counter_freezing :1,
+ late_nmi :1;
/*
* sysfs attrs
*/
--
2.29.2.454.gaff20da3a2-goog
On Sat, Nov 21, 2020 at 11:50:11AM +0900, Namhyung Kim wrote:
> When large PEBS is enabled, the below warning is triggered:
>
> [6070379.453697] WARNING: CPU: 23 PID: 42379 at arch/x86/events/core.c:1466 x86_pmu_stop+0x95/0xa0
> ...
> [6070379.453831] Call Trace:
> [6070379.453840] x86_pmu_del+0x50/0x150
> [6070379.453845] event_sched_out.isra.0+0x95/0x200
> [6070379.453848] group_sched_out.part.0+0x53/0xd0
> [6070379.453851] __perf_event_disable+0xee/0x1e0
> [6070379.453854] event_function+0x89/0xd0
> [6070379.453859] remote_function+0x3e/0x50
> [6070379.453866] generic_exec_single+0x91/0xd0
> [6070379.453870] smp_call_function_single+0xd1/0x110
> [6070379.453874] event_function_call+0x11c/0x130
> [6070379.453877] ? task_ctx_sched_out+0x20/0x20
> [6070379.453880] ? perf_mux_hrtimer_handler+0x370/0x370
> [6070379.453882] ? event_function_call+0x130/0x130
> [6070379.453886] perf_event_for_each_child+0x34/0x80
> [6070379.453889] ? event_function_call+0x130/0x130
> [6070379.453891] _perf_ioctl+0x24b/0x6a0
> [6070379.453898] ? sched_setaffinity+0x1ad/0x2a0
> [6070379.453904] ? _cond_resched+0x15/0x30
> [6070379.453906] perf_ioctl+0x3d/0x60
> [6070379.453912] ksys_ioctl+0x87/0xc0
> [6070379.453917] __x64_sys_ioctl+0x16/0x20
> [6070379.453923] do_syscall_64+0x52/0x180
> [6070379.453928] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> The commit 3966c3feca3f ("x86/perf/amd: Remove need to check "running"
> bit in NMI handler") introduced this. It seems x86_pmu_stop can be
> called recursively (like when it losts some samples) like below:
>
> x86_pmu_stop
> intel_pmu_disable_event (x86_pmu_disable)
> intel_pmu_pebs_disable
> intel_pmu_drain_pebs_buffer
> x86_pmu_stop
>
This shouldn't be possible; intel_pmu_drain_pebs_buffer() calls
drain_pebs(.iregs=NULL), which means that __intel_pmu_pebs_event()
should not end up x86_pmu_stop().
Are you running some old kernel?
Hi Peter,
On Mon, Nov 23, 2020 at 11:23 PM Peter Zijlstra <[email protected]> wrote:
>
> On Sat, Nov 21, 2020 at 11:50:11AM +0900, Namhyung Kim wrote:
> > When large PEBS is enabled, the below warning is triggered:
> >
> > [6070379.453697] WARNING: CPU: 23 PID: 42379 at arch/x86/events/core.c:1466 x86_pmu_stop+0x95/0xa0
> > ...
> > [6070379.453831] Call Trace:
> > [6070379.453840] x86_pmu_del+0x50/0x150
> > [6070379.453845] event_sched_out.isra.0+0x95/0x200
> > [6070379.453848] group_sched_out.part.0+0x53/0xd0
> > [6070379.453851] __perf_event_disable+0xee/0x1e0
> > [6070379.453854] event_function+0x89/0xd0
> > [6070379.453859] remote_function+0x3e/0x50
> > [6070379.453866] generic_exec_single+0x91/0xd0
> > [6070379.453870] smp_call_function_single+0xd1/0x110
> > [6070379.453874] event_function_call+0x11c/0x130
> > [6070379.453877] ? task_ctx_sched_out+0x20/0x20
> > [6070379.453880] ? perf_mux_hrtimer_handler+0x370/0x370
> > [6070379.453882] ? event_function_call+0x130/0x130
> > [6070379.453886] perf_event_for_each_child+0x34/0x80
> > [6070379.453889] ? event_function_call+0x130/0x130
> > [6070379.453891] _perf_ioctl+0x24b/0x6a0
> > [6070379.453898] ? sched_setaffinity+0x1ad/0x2a0
> > [6070379.453904] ? _cond_resched+0x15/0x30
> > [6070379.453906] perf_ioctl+0x3d/0x60
> > [6070379.453912] ksys_ioctl+0x87/0xc0
> > [6070379.453917] __x64_sys_ioctl+0x16/0x20
> > [6070379.453923] do_syscall_64+0x52/0x180
> > [6070379.453928] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > The commit 3966c3feca3f ("x86/perf/amd: Remove need to check "running"
> > bit in NMI handler") introduced this. It seems x86_pmu_stop can be
> > called recursively (like when it losts some samples) like below:
> >
> > x86_pmu_stop
> > intel_pmu_disable_event (x86_pmu_disable)
> > intel_pmu_pebs_disable
> > intel_pmu_drain_pebs_buffer
> > x86_pmu_stop
> >
>
> This shouldn't be possible; intel_pmu_drain_pebs_buffer() calls
> drain_pebs(.iregs=NULL), which means that __intel_pmu_pebs_event()
> should not end up x86_pmu_stop().
>
> Are you running some old kernel?
Well, it's actually 5.7.17 but I think the latest version has the same problem.
Yes, it's not about __intel_pmu_pebs_event(). I'm looking at
intel_pmu_drain_pebs_nhm() specifically. There's code like
/* log dropped samples number */
if (error[bit]) {
perf_log_lost_samples(event, error[bit]);
if (perf_event_account_interrupt(event))
x86_pmu_stop(event, 0);
}
if (counts[bit]) {
__intel_pmu_pebs_event(event, iregs, base,
top, bit, counts[bit],
setup_pebs_fixed_sample_data);
}
There's a path to x86_pmu_stop() when an error bit is on.
Thanks,
Namhyung
Hi,
Another remark on the PEBS drainage code, it seems to me like a test
is not quite correct:
intel_pmu_drain_pebs_nhm()
{
...
if (p->status != (1ULL << bit)) {
for_each_set_bit(i, (unsigned long *)&pebs_status, size)
error[i]++;
continue;
}
The kernel cannot disambiguate when 2+ PEBS counters overflow at the
same time. This is what the comment for this code suggests.
However, I see the comparison is done with the unfiltered p->status
which is a copy of IA32_PERF_GLOBAL_STATUS at the time of
the sample. This register contains more than the PEBS counter overflow
bits. It also includes many other bits which could also be set.
Shouldn't this test use pebs_status instead (which covers only the
PEBS counters)?
if (pebs_status != (1ULL << bit)) {
}
Or am I missing something?
Thanks.
On Tue, Nov 24, 2020 at 12:09 AM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Nov 24, 2020 at 02:01:39PM +0900, Namhyung Kim wrote:
>
> > Yes, it's not about __intel_pmu_pebs_event(). I'm looking at
> > intel_pmu_drain_pebs_nhm() specifically. There's code like
> >
> > /* log dropped samples number */
> > if (error[bit]) {
> > perf_log_lost_samples(event, error[bit]);
> >
> > if (perf_event_account_interrupt(event))
> > x86_pmu_stop(event, 0);
> > }
> >
> > if (counts[bit]) {
> > __intel_pmu_pebs_event(event, iregs, base,
> > top, bit, counts[bit],
> > setup_pebs_fixed_sample_data);
> > }
> >
> > There's a path to x86_pmu_stop() when an error bit is on.
>
> That would seem to suggest you try something like this:
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 31b9e58b03fe..8c6ee8be8b6e 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1945,7 +1945,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
> if (error[bit]) {
> perf_log_lost_samples(event, error[bit]);
>
> - if (perf_event_account_interrupt(event))
> + if (iregs && perf_event_account_interrupt(event))
> x86_pmu_stop(event, 0);
> }
>
On Tue, Nov 24, 2020 at 02:01:39PM +0900, Namhyung Kim wrote:
> Yes, it's not about __intel_pmu_pebs_event(). I'm looking at
> intel_pmu_drain_pebs_nhm() specifically. There's code like
>
> /* log dropped samples number */
> if (error[bit]) {
> perf_log_lost_samples(event, error[bit]);
>
> if (perf_event_account_interrupt(event))
> x86_pmu_stop(event, 0);
> }
>
> if (counts[bit]) {
> __intel_pmu_pebs_event(event, iregs, base,
> top, bit, counts[bit],
> setup_pebs_fixed_sample_data);
> }
>
> There's a path to x86_pmu_stop() when an error bit is on.
That would seem to suggest you try something like this:
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 31b9e58b03fe..8c6ee8be8b6e 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1945,7 +1945,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
if (error[bit]) {
perf_log_lost_samples(event, error[bit]);
- if (perf_event_account_interrupt(event))
+ if (iregs && perf_event_account_interrupt(event))
x86_pmu_stop(event, 0);
}
On Tue, Nov 24, 2020 at 5:10 PM Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Nov 24, 2020 at 02:01:39PM +0900, Namhyung Kim wrote:
>
> > Yes, it's not about __intel_pmu_pebs_event(). I'm looking at
> > intel_pmu_drain_pebs_nhm() specifically. There's code like
> >
> > /* log dropped samples number */
> > if (error[bit]) {
> > perf_log_lost_samples(event, error[bit]);
> >
> > if (perf_event_account_interrupt(event))
> > x86_pmu_stop(event, 0);
> > }
> >
> > if (counts[bit]) {
> > __intel_pmu_pebs_event(event, iregs, base,
> > top, bit, counts[bit],
> > setup_pebs_fixed_sample_data);
> > }
> >
> > There's a path to x86_pmu_stop() when an error bit is on.
>
> That would seem to suggest you try something like this:
>
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 31b9e58b03fe..8c6ee8be8b6e 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1945,7 +1945,7 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
> if (error[bit]) {
> perf_log_lost_samples(event, error[bit]);
>
> - if (perf_event_account_interrupt(event))
> + if (iregs && perf_event_account_interrupt(event))
> x86_pmu_stop(event, 0);
> }
>
That would work too and much simpler!
Thanks,
Namhyung
On Tue, Nov 24, 2020 at 12:19:34AM -0800, Stephane Eranian wrote:
> Hi,
>
> Another remark on the PEBS drainage code, it seems to me like a test
> is not quite correct:
> intel_pmu_drain_pebs_nhm()
> {
> ...
> if (p->status != (1ULL << bit)) {
> for_each_set_bit(i, (unsigned long *)&pebs_status, size)
> error[i]++;
> continue;
> }
>
> The kernel cannot disambiguate when 2+ PEBS counters overflow at the
> same time. This is what the comment for this code suggests.
> However, I see the comparison is done with the unfiltered p->status
> which is a copy of IA32_PERF_GLOBAL_STATUS at the time of
> the sample. This register contains more than the PEBS counter overflow
> bits. It also includes many other bits which could also be set.
>
> Shouldn't this test use pebs_status instead (which covers only the
> PEBS counters)?
Hmm, yes, think so.