On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
> With these patches syz-kaller can still trigger some fail; most notably some
> NMI watchdog triggers and a very sporadic unthrottle bug (much like last time).
So the below seems to make the sporadic unthrottle thing much less
likely in that I haven't seen it in several hours, my machine keeps
dying on NMI watchdog bits.
Boris, who has been running syz-kaller on AMD hardware and was hitting a
very similar bug with the AMD-IBS code, says its not fixed it for him,
so maybe there's still more to find.
---
Subject: perf: Fix unthrottle
Its possible to IOC_PERIOD while the event is throttled, this would
re-start the event and the next tick would then try to unthrottle it,
and find the event wasn't actually stopped anymore.
This would tickle a WARN in the x86-pmu code which isn't expecting to
start a !stopped event.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/events/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 712570dddacd..d39477390415 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4210,6 +4210,14 @@ static void __perf_event_period(struct perf_event *event,
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
perf_pmu_disable(ctx->pmu);
+ /*
+ * We could be throttled; unthrottle now to avoid the tick
+ * trying to unthrottle while we already re-started the event.
+ */
+ if (event->hw.interrupts == MAX_INTERRUPTS) {
+ event->hw.interrupts = 0;
+ perf_log_throttle(event, 1);
+ }
event->pmu->stop(event, PERF_EF_UPDATE);
}
On Thu, 10 Mar 2016, Peter Zijlstra wrote:
> On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
>
> Boris, who has been running syz-kaller on AMD hardware and was hitting a
> very similar bug with the AMD-IBS code, says its not fixed it for him,
> so maybe there's still more to find.
sorry I am not being much help with this big syz-kaller bug hunt, but I
just wanted to chime in that there's a (probably the same) long standing
IBS bug that perf_fuzzer hits all the time too, it was bad enough
that I had to stop running perf_fuzzer on my AMD box.
Vince
Peter Zijlstra <[email protected]> writes:
> On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
>
>> With these patches syz-kaller can still trigger some fail; most notably some
>> NMI watchdog triggers and a very sporadic unthrottle bug (much like last time).
>
> So the below seems to make the sporadic unthrottle thing much less
> likely in that I haven't seen it in several hours, my machine keeps
> dying on NMI watchdog bits.
>
> Boris, who has been running syz-kaller on AMD hardware and was hitting a
> very similar bug with the AMD-IBS code, says its not fixed it for him,
> so maybe there's still more to find.
>
> ---
> Subject: perf: Fix unthrottle
>
> Its possible to IOC_PERIOD while the event is throttled, this would
> re-start the event and the next tick would then try to unthrottle it,
> and find the event wasn't actually stopped anymore.
>
> This would tickle a WARN in the x86-pmu code which isn't expecting to
> start a !stopped event.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
FWIW,
Reviewed-by: Alexander Shishkin <[email protected]>
Cheers,
--
Alex
On Thu, Mar 10, 2016 at 09:44:30AM -0500, Vince Weaver wrote:
> On Thu, 10 Mar 2016, Peter Zijlstra wrote:
>
> > On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
> >
> > Boris, who has been running syz-kaller on AMD hardware and was hitting a
> > very similar bug with the AMD-IBS code, says its not fixed it for him,
> > so maybe there's still more to find.
>
> sorry I am not being much help with this big syz-kaller bug hunt, but I
> just wanted to chime in that there's a (probably the same) long standing
> IBS bug that perf_fuzzer hits all the time too, it was bad enough
> that I had to stop running perf_fuzzer on my AMD box.
The below seems to fix the IBS issue tickled by syz-kaller on my
machine. I've not yet ran perf-fuzzer, which seems able to tickle a
different set of bugs.
---
Subject: perf, amb: Fix IBS throttle
When the IBS IRQ handler get a !0 return from perf_event_overflow;
meaning it should throttle the event, it only disables it, it doesn't
call perf_ibs_stop().
This confuses the state machine, as we'll use pmu::start() ->
perf_ibs_start() to unthrottle.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/amd/ibs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 51087c29b2c2..7956d29762ef 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -599,7 +599,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
throttle = perf_event_overflow(event, &data, ®s);
out:
if (throttle)
- perf_ibs_disable_event(perf_ibs, hwc, *config);
+ perf_ibs_stop(event, 0);
else
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);
On Fri, Mar 11, 2016 at 03:23:46PM +0100, Peter Zijlstra wrote:
> The below seems to fix the IBS issue tickled by syz-kaller on my
> machine. I've not yet ran perf-fuzzer, which seems able to tickle a
> different set of bugs.
>
> ---
> Subject: perf, amb: Fix IBS throttle
amd
but you know already.
>
> When the IBS IRQ handler get a !0 return from perf_event_overflow;
> meaning it should throttle the event, it only disables it, it doesn't
> call perf_ibs_stop().
>
> This confuses the state machine, as we'll use pmu::start() ->
> perf_ibs_start() to unthrottle.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Borislav Petkov <[email protected]>
I'll leave it running for an additional hour but if I haven't said
anything until then, all is good.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Thu, Mar 10, 2016 at 09:44:30AM -0500, Vince Weaver wrote:
> On Thu, 10 Mar 2016, Peter Zijlstra wrote:
>
> > On Wed, Feb 24, 2016 at 06:45:39PM +0100, Peter Zijlstra wrote:
> >
> > Boris, who has been running syz-kaller on AMD hardware and was hitting a
> > very similar bug with the AMD-IBS code, says its not fixed it for him,
> > so maybe there's still more to find.
>
> sorry I am not being much help with this big syz-kaller bug hunt, but I
> just wanted to chime in that there's a (probably the same) long standing
> IBS bug that perf_fuzzer hits all the time too, it was bad enough
> that I had to stop running perf_fuzzer on my AMD box.
Running perf_fuzzer on that AMD box is still producing lots of fail, I
seen long strings of dazed and confused msgs, indicating we have a
'spurious' NMI problem somewhere.
And occasionally it locks up..
So we're not there yet.
On Tue, Mar 15, 2016 at 04:38:30PM +0100, Peter Zijlstra wrote:
> Running perf_fuzzer on that AMD box is still producing lots of fail, I
> seen long strings of dazed and confused msgs, indicating we have a
> 'spurious' NMI problem somewhere.
>
> And occasionally it locks up..
>
> So we're not there yet.
So the below appears to alleviate some of this; but the hangs are
quicker now, so maybe I just made it worse.
---
Subject: perf, ibs: Fix race with IBS_STARTING state
From: Peter Zijlstra <[email protected]>
Date: Wed Mar 16 23:55:21 CET 2016
While tracing the IBS bits I saw the NMI hitting between clearing
IBS_STARTING and the actual MSR writes to disable the counter.
Since IBS_STARTING was cleared, the handler assumed these were spurious
NMIs and because STOPPING wasn't set yet either, insta-triggered an
"Unknown NMI".
Cure this by clearing IBS_STARTING after disabling the hardware.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
arch/x86/events/amd/ibs.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -376,7 +376,13 @@ static void perf_ibs_start(struct perf_e
hwc->state = 0;
perf_ibs_set_period(perf_ibs, hwc, &period);
+ /*
+ * Set STARTED before enabling the hardware, such that
+ * a subsequent NMI must observe it. Then clear STOPPING
+ * such that we don't consume NMIs by accident.
+ */
set_bit(IBS_STARTED, pcpu->state);
+ clear_bit(IBS_STOPPING, pcpu->state);
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);
perf_event_update_userpage(event);
@@ -390,7 +396,7 @@ static void perf_ibs_stop(struct perf_ev
u64 config;
int stopping;
- stopping = test_and_clear_bit(IBS_STARTED, pcpu->state);
+ stopping = test_bit(IBS_STARTED, pcpu->state);
if (!stopping && (hwc->state & PERF_HES_UPTODATE))
return;
@@ -398,8 +404,24 @@ static void perf_ibs_stop(struct perf_ev
rdmsrl(hwc->config_base, config);
if (stopping) {
+ /*
+ * Set STOPPING before disabling the hardware, such that it
+ * must be visible to NMIs the moment we clear the EN bit,
+ * at which point we can generate an !VALID sample which
+ * we need to consume.
+ */
set_bit(IBS_STOPPING, pcpu->state);
perf_ibs_disable_event(perf_ibs, hwc, config);
+ /*
+ * Clear STARTED after disabling the hardware; if it were
+ * cleared before an NMI hitting after the clear but before
+ * clearing the EN bit might think it a spurious NMI and not
+ * handle it.
+ *
+ * Clearing it after, however, creates the problem of the NMI
+ * handler seeing STARTED but not having a valid sample.
+ */
+ clear_bit(IBS_STARTED, pcpu->state);
WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
hwc->state |= PERF_HES_STOPPED;
}
@@ -527,20 +549,24 @@ static int perf_ibs_handle_irq(struct pe
u64 *buf, *config, period;
if (!test_bit(IBS_STARTED, pcpu->state)) {
+fail:
/*
* Catch spurious interrupts after stopping IBS: After
* disabling IBS there could be still incoming NMIs
* with samples that even have the valid bit cleared.
* Mark all this NMIs as handled.
*/
- return test_and_clear_bit(IBS_STOPPING, pcpu->state) ? 1 : 0;
+ if (test_and_clear_bit(IBS_STOPPING, pcpu->state))
+ return 1;
+
+ return 0;
}
msr = hwc->config_base;
buf = ibs_data.regs;
rdmsrl(msr, *buf);
if (!(*buf++ & perf_ibs->valid_mask))
- return 0;
+ goto fail;
config = &ibs_data.regs[0];
perf_ibs_event_update(perf_ibs, event, config);
On Wed, Mar 16, 2016 at 11:59:33PM +0100, Peter Zijlstra wrote:
> Subject: perf, ibs: Fix race with IBS_STARTING state
> From: Peter Zijlstra <[email protected]>
> Date: Wed Mar 16 23:55:21 CET 2016
>
> While tracing the IBS bits I saw the NMI hitting between clearing
> IBS_STARTING and the actual MSR writes to disable the counter.
>
> Since IBS_STARTING was cleared, the handler assumed these were spurious
> NMIs and because STOPPING wasn't set yet either, insta-triggered an
> "Unknown NMI".
>
> Cure this by clearing IBS_STARTING after disabling the hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> ---
> arch/x86/events/amd/ibs.c | 32 +++++++++++++++++++++++++++++---
> 1 file changed, 29 insertions(+), 3 deletions(-)
>
> --- a/arch/x86/events/amd/ibs.c
> +++ b/arch/x86/events/amd/ibs.c
> @@ -376,7 +376,13 @@ static void perf_ibs_start(struct perf_e
> hwc->state = 0;
>
> perf_ibs_set_period(perf_ibs, hwc, &period);
> + /*
> + * Set STARTED before enabling the hardware, such that
> + * a subsequent NMI must observe it. Then clear STOPPING
> + * such that we don't consume NMIs by accident.
> + */
> set_bit(IBS_STARTED, pcpu->state);
> + clear_bit(IBS_STOPPING, pcpu->state);
> perf_ibs_enable_event(perf_ibs, hwc, period >> 4);
Also, all those atomic ops are probably entirely overkill and we could
use the non-atomic ops. This is all strictly cpu local. But I didn't
want to change too much at once, esp. while there's still problems.
On Wed, Mar 16, 2016 at 11:59:33PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 04:38:30PM +0100, Peter Zijlstra wrote:
>
> > Running perf_fuzzer on that AMD box is still producing lots of fail, I
> > seen long strings of dazed and confused msgs, indicating we have a
> > 'spurious' NMI problem somewhere.
> >
> > And occasionally it locks up..
> >
> > So we're not there yet.
>
> So the below appears to alleviate some of this; but the hangs are
> quicker now, so maybe I just made it worse.
>
> ---
> Subject: perf, ibs: Fix race with IBS_STARTING state
> From: Peter Zijlstra <[email protected]>
> Date: Wed Mar 16 23:55:21 CET 2016
>
> While tracing the IBS bits I saw the NMI hitting between clearing
> IBS_STARTING and the actual MSR writes to disable the counter.
>
> Since IBS_STARTING was cleared, the handler assumed these were spurious
> NMIs and because STOPPING wasn't set yet either, insta-triggered an
> "Unknown NMI".
>
> Cure this by clearing IBS_STARTING after disabling the hardware.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
Commit-ID: 1e02cd40f15190b78fcc6b3f50c952fb4028e9a5
Gitweb: http://git.kernel.org/tip/1e02cd40f15190b78fcc6b3f50c952fb4028e9a5
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 10 Mar 2016 15:39:24 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:14 +0100
perf/core: Fix the unthrottle logic
Its possible to IOC_PERIOD while the event is throttled, this would
re-start the event and the next tick would then try to unthrottle it,
and find the event wasn't actually stopped anymore.
This would tickle a WARN in the x86-pmu code which isn't expecting to
start a !stopped event.
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/events/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 712570d..d394773 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4210,6 +4210,14 @@ static void __perf_event_period(struct perf_event *event,
active = (event->state == PERF_EVENT_STATE_ACTIVE);
if (active) {
perf_pmu_disable(ctx->pmu);
+ /*
+ * We could be throttled; unthrottle now to avoid the tick
+ * trying to unthrottle while we already re-started the event.
+ */
+ if (event->hw.interrupts == MAX_INTERRUPTS) {
+ event->hw.interrupts = 0;
+ perf_log_throttle(event, 1);
+ }
event->pmu->stop(event, PERF_EF_UPDATE);
}
Commit-ID: 0158b83f7508851cda9e62c4b14678c5bd492cb2
Gitweb: http://git.kernel.org/tip/0158b83f7508851cda9e62c4b14678c5bd492cb2
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 11 Mar 2016 15:23:46 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Mon, 21 Mar 2016 09:08:15 +0100
perf/x86/ibs: Fix IBS throttle
When the IBS IRQ handler get a !0 return from perf_event_overflow;
meaning it should throttle the event, it only disables it, it doesn't
call perf_ibs_stop().
This confuses the state machine, as we'll use pmu::start() ->
perf_ibs_start() to unthrottle.
Tested-by: Borislav Petkov <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Alexander Shishkin <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Brian Gerst <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Denys Vlasenko <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jiri Olsa <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Namhyung Kim <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: Vince Weaver <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/events/amd/ibs.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/events/amd/ibs.c b/arch/x86/events/amd/ibs.c
index 51087c2..7956d29 100644
--- a/arch/x86/events/amd/ibs.c
+++ b/arch/x86/events/amd/ibs.c
@@ -599,7 +599,7 @@ static int perf_ibs_handle_irq(struct perf_ibs *perf_ibs, struct pt_regs *iregs)
throttle = perf_event_overflow(event, &data, ®s);
out:
if (throttle)
- perf_ibs_disable_event(perf_ibs, hwc, *config);
+ perf_ibs_stop(event, 0);
else
perf_ibs_enable_event(perf_ibs, hwc, period >> 4);