2016-03-10 14:39:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:

> With these patches syz-kaller can still trigger some fail; most notably some
> NMI watchdog triggers and a very sporadic unthrottle bug (much like last time).

So the below seems to make the sporadic unthrottle thing much less
likely in that I haven't seen it in several hours, my machine keeps
dying on NMI watchdog bits.

Boris, who has been running syz-kaller on AMD hardware and was hitting a
very similar bug with the AMD-IBS code, says its not fixed it for him,
so maybe there's still more to find.

---
Subject: perf: Fix unthrottle

Its possible to IOC_PERIOD while the event is throttled, this would
re-start the event and the next tick would then try to unthrottle it,
and find the event wasn't actually stopped anymore.

This would tickle a WARN in the x86-pmu code which isn't expecting to
start a !stopped event.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 712570dddacd..d39477390415 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4210,6 +4210,14 @@ static void __perf_event_period(struct perf_event *event,
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
perf_pmu_disable(ctx->pmu);
+ /*
+ * We could be throttled; unthrottle now to avoid the tick
+ * trying to unthrottle while we already re-started the event.
+ */
+ if (event->hw.interrupts == MAX_INTERRUPTS) {
+ event->hw.interrupts = 0;
+ perf_log_throttle(event, 1);
+ }
event->pmu->stop(event, PERF_EF_UPDATE);
}




2016-03-10 14:43:27

by Vince Weaver

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Thu, 10 Mar 2016, Peter Zijlstra wrote:

> On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
>
> Boris, who has been running syz-kaller on AMD hardware and was hitting a
> very similar bug with the AMD-IBS code, says its not fixed it for him,
> so maybe there's still more to find.

sorry I am not being much help with this big syz-kaller bug hunt, but I
just wanted to chime in that there's a (probably the same) long standing
IBS bug that perf_fuzzer hits all the time too, it was bad enough
that I had to stop running perf_fuzzer on my AMD box.

Vince

2016-03-11 10:13:26

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

Peter Zijlstra <[email protected]> writes:

> On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
>
>> With these patches syz-kaller can still trigger some fail; most notably some
>> NMI watchdog triggers and a very sporadic unthrottle bug (much like last time).
>
> So the below seems to make the sporadic unthrottle thing much less
> likely in that I haven't seen it in several hours, my machine keeps
> dying on NMI watchdog bits.
>
> Boris, who has been running syz-kaller on AMD hardware and was hitting a
> very similar bug with the AMD-IBS code, says its not fixed it for him,
> so maybe there's still more to find.
>
> ---
> Subject: perf: Fix unthrottle
>
> Its possible to IOC_PERIOD while the event is throttled, this would
> re-start the event and the next tick would then try to unthrottle it,
> and find the event wasn't actually stopped anymore.
>
> This would tickle a WARN in the x86-pmu code which isn't expecting to
> start a !stopped event.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

FWIW,

Reviewed-by: Alexander Shishkin <[email protected]>

Cheers,
--
Alex

2016-03-11 14:24:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Thu, Mar 10, 2016 at 09:44:30AM -0500, Vince Weaver wrote:
> On Thu, 10 Mar 2016, Peter Zijlstra wrote:
>
> > On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
> >
> > Boris, who has been running syz-kaller on AMD hardware and was hitting a
> > very similar bug with the AMD-IBS code, says its not fixed it for him,
> > so maybe there's still more to find.
>
> sorry I am not being much help with this big syz-kaller bug hunt, but I
> just wanted to chime in that there's a (probably the same) long standing
> IBS bug that perf_fuzzer hits all the time too, it was bad enough
> that I had to stop running perf_fuzzer on my AMD box.

The below seems to fix the IBS issue tickled by syz-kaller on my
machine. I've not yet ran perf-fuzzer, which seems able to tickle a
different set of bugs.

---
Subject: perf, amb: Fix IBS throttle

When the IBS IRQ handler get a !0 return from perf_event_overflow;
meaning it should throttle the event, it only disables it, it doesn't
call perf_ibs_stop().

This confuses the state machine, as we'll use pmu::start() ->
perf_ibs_start() to unthrottle.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/amd/ibs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 51087c29b2c2..7956d29762ef 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -599,7 +599,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
throttle = perf_event_overflow(event, &data, &regs);
out:
if (throttle)
- perf_ibs_disable_event(perf_ibs, hwc, *config);
+ perf_ibs_stop(event, 0);
else
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);


2016-03-11 15:42:14

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Fri, Mar 11, 2016 at 03:23:46PM +0100, Peter Zijlstra wrote:
> The below seems to fix the IBS issue tickled by syz-kaller on my
> machine. I've not yet ran perf-fuzzer, which seems able to tickle a
> different set of bugs.
>
> ---
> Subject: perf, amb: Fix IBS throttle

amd

but you know already.

>
> When the IBS IRQ handler get a !0 return from perf_event_overflow;
> meaning it should throttle the event, it only disables it, it doesn't
> call perf_ibs_stop().
>
> This confuses the state machine, as we'll use pmu::start() ->
> perf_ibs_start() to unthrottle.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Tested-by: Borislav Petkov <[email protected]>

I'll leave it running for an additional hour but if I haven't said
anything until then, all is good.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2016-03-15 15:40:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Thu, Mar 10, 2016 at 09:44:30AM -0500, Vince Weaver wrote:
> On Thu, 10 Mar 2016, Peter Zijlstra wrote:
>
> > On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
> >
> > Boris, who has been running syz-kaller on AMD hardware and was hitting a
> > very similar bug with the AMD-IBS code, says its not fixed it for him,
> > so maybe there's still more to find.
>
> sorry I am not being much help with this big syz-kaller bug hunt, but I
> just wanted to chime in that there's a (probably the same) long standing
> IBS bug that perf_fuzzer hits all the time too, it was bad enough
> that I had to stop running perf_fuzzer on my AMD box.

Running perf_fuzzer on that AMD box is still producing lots of fail, I
seen long strings of dazed and confused msgs, indicating we have a
'spurious' NMI problem somewhere.

And occasionally it locks up..

So we're not there yet.

2016-03-16 22:59:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Tue, Mar 15, 2016 at 04:38:30PM +0100, Peter Zijlstra wrote:

> Running perf_fuzzer on that AMD box is still producing lots of fail, I
> seen long strings of dazed and confused msgs, indicating we have a
> 'spurious' NMI problem somewhere.
>
> And occasionally it locks up..
>
> So we're not there yet.

So the below appears to alleviate some of this; but the hangs are
quicker now, so maybe I just made it worse.

---
Subject: perf, ibs: Fix race with IBS_STARTING state
From: Peter Zijlstra <[email protected]>
Date: Wed Mar 16 23:55:21 CET 2016

While tracing the IBS bits I saw the NMI hitting between clearing
IBS_STARTING and the actual MSR writes to disable the counter.

Since IBS_STARTING was cleared, the handler assumed these were spurious
NMIs and because STOPPING wasn't set yet either, insta-triggered an
"Unknown NMI".

Cure this by clearing IBS_STARTING after disabling the hardware.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/amd/ibs.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)

--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -376,7 +376,13 @@ static void perf_ibs_start(struct perf_e
hwc->state = 0;

perf_ibs_set_period(perf_ibs, hwc, &period);
+ /*
+ * Set STARTED before enabling the hardware, such that
+ * a subsequent NMI must observe it. Then clear STOPPING
+ * such that we don't consume NMIs by accident.
+ */
set_bit(IBS_STARTED, pcpu->state);
+ clear_bit(IBS_STOPPING, pcpu->state);
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);

perf_event_update_userpage(event);
@@ -390,7 +396,7 @@ static void perf_ibs_stop(struct perf_ev
u64 config;
int stopping;

- stopping = test_and_clear_bit(IBS_STARTED, pcpu->state);
+ stopping = test_bit(IBS_STARTED, pcpu->state);

if (!stopping && (hwc->state & PERF_HES_UPTODATE))
return;
@@ -398,8 +404,24 @@ static void perf_ibs_stop(struct perf_ev
rdmsrl(hwc->config_base, config);

if (stopping) {
+ /*
+ * Set STOPPING before disabling the hardware, such that it
+ * must be visible to NMIs the moment we clear the EN bit,
+ * at which point we can generate an !VALID sample which
+ * we need to consume.
+ */
set_bit(IBS_STOPPING, pcpu->state);
perf_ibs_disable_event(perf_ibs, hwc, config);
+ /*
+ * Clear STARTED after disabling the hardware; if it were
+ * cleared before an NMI hitting after the clear but before
+ * clearing the EN bit might think it a spurious NMI and not
+ * handle it.
+ *
+ * Clearing it after, however, creates the problem of the NMI
+ * handler seeing STARTED but not having a valid sample.
+ */
+ clear_bit(IBS_STARTED, pcpu->state);
WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
hwc->state |= PERF_HES_STOPPED;
}
@@ -527,20 +549,24 @@ static int perf_ibs_handle_irq(struct pe
u64 *buf, *config, period;

if (!test_bit(IBS_STARTED, pcpu->state)) {
+fail:
/*
* Catch spurious interrupts after stopping IBS: After
* disabling IBS there could be still incoming NMIs
* with samples that even have the valid bit cleared.
* Mark all this NMIs as handled.
*/
- return test_and_clear_bit(IBS_STOPPING, pcpu->state) ? 1 : 0;
+ if (test_and_clear_bit(IBS_STOPPING, pcpu->state))
+ return 1;
+
+ return 0;
}

msr = hwc->config_base;
buf = ibs_data.regs;
rdmsrl(msr, *buf);
if (!(*buf++ & perf_ibs->valid_mask))
- return 0;
+ goto fail;

config = &ibs_data.regs[0];
perf_ibs_event_update(perf_ibs, event, config);

2016-03-16 23:11:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Wed, Mar 16, 2016 at 11:59:33PM +0100, Peter Zijlstra wrote:
> Subject: perf, ibs: Fix race with IBS_STARTING state
> From: Peter Zijlstra <[email protected]>
> Date: Wed Mar 16 23:55:21 CET 2016
>
> While tracing the IBS bits I saw the NMI hitting between clearing
> IBS_STARTING and the actual MSR writes to disable the counter.
>
> Since IBS_STARTING was cleared, the handler assumed these were spurious
> NMIs and because STOPPING wasn't set yet either, insta-triggered an
> "Unknown NMI".
>
> Cure this by clearing IBS_STARTING after disabling the hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/events/amd/ibs.c | 32 +++++++++++++++++++++++++++++---
> 1 file changed, 29 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -376,7 +376,13 @@ static void perf_ibs_start(struct perf_e
> hwc->state = 0;
>
> perf_ibs_set_period(perf_ibs, hwc, &period);
> + /*
> + * Set STARTED before enabling the hardware, such that
> + * a subsequent NMI must observe it. Then clear STOPPING
> + * such that we don't consume NMIs by accident.
> + */
> set_bit(IBS_STARTED, pcpu->state);
> + clear_bit(IBS_STOPPING, pcpu->state);
> perf_ibs_enable_event(perf_ibs, hwc, period >> 4);

Also, all those atomic ops are probably entirely overkill and we could
use the non-atomic ops. This is all strictly cpu local. But I didn't
want to change too much at once, esp. while there's still problems.

2016-03-17 11:54:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 00/12] perf: more fixes

On Wed, Mar 16, 2016 at 11:59:33PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 04:38:30PM +0100, Peter Zijlstra wrote:
>
> > Running perf_fuzzer on that AMD box is still producing lots of fail, I
> > seen long strings of dazed and confused msgs, indicating we have a
> > 'spurious' NMI problem somewhere.
> >
> > And occasionally it locks up..
> >
> > So we're not there yet.
>
> So the below appears to alleviate some of this; but the hangs are
> quicker now, so maybe I just made it worse.
>
> ---
> Subject: perf, ibs: Fix race with IBS_STARTING state
> From: Peter Zijlstra <[email protected]>
> Date: Wed Mar 16 23:55:21 CET 2016
>
> While tracing the IBS bits I saw the NMI hitting between clearing
> IBS_STARTING and the actual MSR writes to disable the counter.
>
> Since IBS_STARTING was cleared, the handler assumed these were spurious
> NMIs and because STOPPING wasn't set yet either, insta-triggered an
> "Unknown NMI".
>
> Cure this by clearing IBS_STARTING after disabling the hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Tested-by: Borislav Petkov <[email protected]>

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Subject: [tip:perf/urgent] perf/core: Fix the unthrottle logic

Commit-ID: 1e02cd40f15190b78fcc6b3f50c952fb4028e9a5
Gitweb: http://git.kernel.org/tip/1e02cd40f15190b78fcc6b3f50c952fb4028e9a5
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:39:24 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:14 +0100

perf/core: Fix the unthrottle logic

Its possible to IOC_PERIOD while the event is throttled, this would
re-start the event and the next tick would then try to unthrottle it,
and find the event wasn't actually stopped anymore.

This would tickle a WARN in the x86-pmu code which isn't expecting to
start a !stopped event.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 712570d..d394773 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4210,6 +4210,14 @@ static void __perf_event_period(struct perf_event *event,
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
perf_pmu_disable(ctx->pmu);
+ /*
+ * We could be throttled; unthrottle now to avoid the tick
+ * trying to unthrottle while we already re-started the event.
+ */
+ if (event->hw.interrupts == MAX_INTERRUPTS) {
+ event->hw.interrupts = 0;
+ perf_log_throttle(event, 1);
+ }
event->pmu->stop(event, PERF_EF_UPDATE);
}


Subject: [tip:perf/urgent] perf/x86/ibs: Fix IBS throttle

Commit-ID: 0158b83f7508851cda9e62c4b14678c5bd492cb2
Gitweb: http://git.kernel.org/tip/0158b83f7508851cda9e62c4b14678c5bd492cb2
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 11 Mar 2016 15:23:46 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:15 +0100

perf/x86/ibs: Fix IBS throttle

When the IBS IRQ handler get a !0 return from perf_event_overflow;
meaning it should throttle the event, it only disables it, it doesn't
call perf_ibs_stop().

This confuses the state machine, as we'll use pmu::start() ->
perf_ibs_start() to unthrottle.

Tested-by: Borislav Petkov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/amd/ibs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 51087c2..7956d29 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -599,7 +599,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
throttle = perf_event_overflow(event, &data, &regs);
out:
if (throttle)
- perf_ibs_disable_event(perf_ibs, hwc, *config);
+ perf_ibs_stop(event, 0);
else
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);