Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935252AbcCPW75 (ORCPT ); Wed, 16 Mar 2016 18:59:57 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:57421 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934945AbcCPW7z (ORCPT ); Wed, 16 Mar 2016 18:59:55 -0400 Date: Wed, 16 Mar 2016 23:59:33 +0100 From: Peter Zijlstra To: Vince Weaver Cc: mingo@kernel.org, alexander.shishkin@linux.intel.com, eranian@google.com, linux-kernel@vger.kernel.org, dvyukov@google.com, andi@firstfloor.org, jolsa@redhat.com, panand@redhat.com, sasha.levin@oracle.com, oleg@redhat.com, Borislav Petkov Subject: Re: [PATCH 00/12] perf: more fixes Message-ID: <20160316225933.GS6375@twins.programming.kicks-ass.net> References: <20160224174539.570749654@infradead.org> <20160310143924.GR6356@twins.programming.kicks-ass.net> <20160315153830.GA6356@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160315153830.GA6356@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3544 Lines: 109 On Tue, Mar 15, 2016 at 04:38:30PM +0100, Peter Zijlstra wrote: > Running perf_fuzzer on that AMD box is still producing lots of fail, I > seen long strings of dazed and confused msgs, indicating we have a > 'spurious' NMI problem somewhere. > > And occasionally it locks up.. > > So we're not there yet. So the below appears to alleviate some of this; but the hangs are quicker now, so maybe I just made it worse. --- Subject: perf, ibs: Fix race with IBS_STARTING state From: Peter Zijlstra Date: Wed Mar 16 23:55:21 CET 2016 While tracing the IBS bits I saw the NMI hitting between clearing IBS_STARTING and the actual MSR writes to disable the counter. Since IBS_STARTING was cleared, the handler assumed these were spurious NMIs and because STOPPING wasn't set yet either, insta-triggered an "Unknown NMI". Cure this by clearing IBS_STARTING after disabling the hardware. Signed-off-by: Peter Zijlstra (Intel) --- arch/x86/events/amd/ibs.c | 32 +++++++++++++++++++++++++++++--- 1 file changed, 29 insertions(+), 3 deletions(-) --- a/arch/x86/events/amd/ibs.c +++ b/arch/x86/events/amd/ibs.c @@ -376,7 +376,13 @@ static void perf_ibs_start(struct perf_e hwc->state = 0; perf_ibs_set_period(perf_ibs, hwc, &period); + /* + * Set STARTED before enabling the hardware, such that + * a subsequent NMI must observe it. Then clear STOPPING + * such that we don't consume NMIs by accident. + */ set_bit(IBS_STARTED, pcpu->state); + clear_bit(IBS_STOPPING, pcpu->state); perf_ibs_enable_event(perf_ibs, hwc, period >> 4); perf_event_update_userpage(event); @@ -390,7 +396,7 @@ static void perf_ibs_stop(struct perf_ev u64 config; int stopping; - stopping = test_and_clear_bit(IBS_STARTED, pcpu->state); + stopping = test_bit(IBS_STARTED, pcpu->state); if (!stopping && (hwc->state & PERF_HES_UPTODATE)) return; @@ -398,8 +404,24 @@ static void perf_ibs_stop(struct perf_ev rdmsrl(hwc->config_base, config); if (stopping) { + /* + * Set STOPPING before disabling the hardware, such that it + * must be visible to NMIs the moment we clear the EN bit, + * at which point we can generate an !VALID sample which + * we need to consume. + */ set_bit(IBS_STOPPING, pcpu->state); perf_ibs_disable_event(perf_ibs, hwc, config); + /* + * Clear STARTED after disabling the hardware; if it were + * cleared before an NMI hitting after the clear but before + * clearing the EN bit might think it a spurious NMI and not + * handle it. + * + * Clearing it after, however, creates the problem of the NMI + * handler seeing STARTED but not having a valid sample. + */ + clear_bit(IBS_STARTED, pcpu->state); WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED); hwc->state |= PERF_HES_STOPPED; } @@ -527,20 +549,24 @@ static int perf_ibs_handle_irq(struct pe u64 *buf, *config, period; if (!test_bit(IBS_STARTED, pcpu->state)) { +fail: /* * Catch spurious interrupts after stopping IBS: After * disabling IBS there could be still incoming NMIs * with samples that even have the valid bit cleared. * Mark all this NMIs as handled. */ - return test_and_clear_bit(IBS_STOPPING, pcpu->state) ? 1 : 0; + if (test_and_clear_bit(IBS_STOPPING, pcpu->state)) + return 1; + + return 0; } msr = hwc->config_base; buf = ibs_data.regs; rdmsrl(msr, *buf); if (!(*buf++ & perf_ibs->valid_mask)) - return 0; + goto fail; config = &ibs_data.regs[0]; perf_ibs_event_update(perf_ibs, event, config);