Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751427Ab3IJL2r (ORCPT ); Tue, 10 Sep 2013 07:28:47 -0400 Received: from mail-ie0-f173.google.com ([209.85.223.173]:42034 "EHLO mail-ie0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751069Ab3IJL2p (ORCPT ); Tue, 10 Sep 2013 07:28:45 -0400 MIME-Version: 1.0 Reply-To: eranian@gmail.com In-Reply-To: <20130909100544.GI31370@twins.programming.kicks-ass.net> References: <20130903132933.GA24955@gmail.com> <20130909100544.GI31370@twins.programming.kicks-ass.net> Date: Tue, 10 Sep 2013 13:28:44 +0200 Message-ID: Subject: Re: [GIT PULL] perf changes for v3.12 From: Stephane Eranian To: Peter Zijlstra Cc: Linus Torvalds , Ingo Molnar , Linux Kernel Mailing List , Arnaldo Carvalho de Melo , Thomas Gleixner , Andi Kleen Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 11464 Lines: 281 Hi, And what was the perf record command line for this crash? On Mon, Sep 9, 2013 at 12:05 PM, Peter Zijlstra wrote: > On Sat, Sep 07, 2013 at 07:17:28PM -0700, Linus Torvalds wrote: >> This is new for me, but I suspect it is more related to the new >> Haswell CPU I have than necessarily the 3.12 perf pull request. >> >> Regardless, nothing bad happened, but my dmesg has this in it: >> >> Unexpected number of pebs records 10 >> >> when I was profiling some git workloads. Full trace appended. > >> ------------[ cut here ]------------ >> WARNING: CPU: 1 PID: 3270 at >> arch/x86/kernel/cpu/perf_event_intel_ds.c:1003 >> intel_pmu_drain_pebs_hsw+0x91/0xa0() >> Unexpected number of pebs records 10 >> Modules linked in: fuse ip6t_REJECT nf_conntrack_ipv4 >> nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack >> ip6table_filter nf_conntrack ip6_tables snd_hda_codec_realtek >> snd_hda_codec_hdmi snd_hda_intel snd_hda_codec snd_hwdep snd_seq >> snd_seq_device snd_pcm e1000e serio_raw snd_page_alloc snd_timer ptp >> lpc_ich snd i2c_i801 pps_core mfd_core soundcore uinput nfsd >> auth_rpcgss nfs_acl lockd sunrpc binfmt_misc hid_logitech_dj i915 >> i2c_algo_bit drm_kms_helper drm i2c_core video >> CPU: 1 PID: 3270 Comm: git Not tainted 3.11.0-07749-ge7d33bb5ea82-dirty #4 >> Hardware name: /DH87RL, BIOS >> RLH8710H.86A.0317.2013.0426.1724 04/26/2013 >> 0000000000000009 ffff88041fa46ca0 ffffffff815fc637 ffff88041fa46ce8 >> ffff88041fa46cd8 ffffffff81051e78 ffff88040f42f000 ffff88040f42f780 >> ffff88041fa46ef8 ffffffff81c15860 0000000000000002 ffff88041fa46d38 >> Call Trace: >> [] dump_stack+0x45/0x56 >> [] warn_slowpath_common+0x78/0xa0 >> [] warn_slowpath_fmt+0x47/0x50 >> [] intel_pmu_drain_pebs_hsw+0x91/0xa0 >> [] intel_pmu_handle_irq+0x210/0x390 >> [] perf_event_nmi_handler+0x2b/0x50 >> [] nmi_handle.isra.3+0x80/0x180 >> [] do_nmi+0xd0/0x310 >> [] end_repeat_nmi+0x1e/0x2e >> <> [] perf_events_lapic_init+0x2f/0x40 >> [] x86_pmu_enable+0x260/0x310 >> [] perf_pmu_enable+0x27/0x30 >> [] perf_event_context_sched_in+0x80/0xc0 >> [] __perf_event_task_sched_in+0x16b/0x180 >> [] finish_task_switch+0x70/0xa0 >> [] __schedule+0x368/0x7c0 >> [] schedule+0x24/0x70 >> [] retint_careful+0xd/0x21 >> ---[ end trace 52dad7a9d8d96b5f ]--- > > So I'm not entirely sure how this can happen. We do _not_ set > ds->pebs_event_reset[] so once a PEBS assist happens we do not > automatically re-engage. We wait for the PMI to happen, consume the data > and reprogram the hardware. > > We allow for up to x86_pmu.max_pebs_events (4 on hsw afaik) of PEBS > entries to be in the buffer so each PEBS capable counter can trigger the > assist once while we 'wait' for the PMI to hit. > > There's a little 'funny' with the hsw PEBS code in that it doesn't > unconditionally clear the buffer in intel_pmu_drain_pebs_hsw() but that > would only make a difference if n actually was < 0 -- again something > that shouldn't ever happen. > > That said, there's two peculiar things about your report. First and > foremost HSW; I'm not at all sure how it deals with a transaction being > undone if a PEBS assist had happened during it. Does it reset the > hardware but not remove the data? If so, this could cause excess > entries. > > Stephane, Andi, it this at all possible?, section 18.10.5.1 (June 2013) > seems to suggest not; the way I read it PEBS assist will cause an abort > (and the PMI certainly would) and we'll process the entire thing outside > of transaction context. > > That said, 18.10.5.1 mentions that entries from inside a transaction > will only have 0xb0 and 0xb8 aka perf_record::real_ip and > perf_record::tsx_tuning set, I don't think __intel_pmu_drain_pebs_nhm() > will do the right thing for them. In particular, when no > perf_record::status bit is set we'll drop the event on the ground. > > Dropping them means we do not do intel_pmu_save_and_restart() to > reprogram the hardware and the counter would get 'stuck'. Then again, > without a status bit we have no idea what event to re-program so we're > pretty much up shit creek without no stinkin' paddle. > > Another possibility is that the meaning of pebs_record::status has > changed -- it was 'broken' and Peggy agreed and said she would look into > it, not sure this has been 'fixed' but I couldn't find anything on it. > That said, I think the code always assumed it was 'sane' so it having > been fixed shouldn't matter. > > Secondly you seem to have triggered the nested NMI replay path. I'm not > immediately sure how that could have messed up the PEBS state, but it > sure is a complication. > > I think we can do something like the below as a pure code reduction -- > completely untested: > > --- > arch/x86/kernel/cpu/perf_event_intel_ds.c | 109 +++++++++--------------------- > 1 file changed, 32 insertions(+), 77 deletions(-) > > diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c > index 63438aa..bfda50e 100644 > --- a/arch/x86/kernel/cpu/perf_event_intel_ds.c > +++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c > @@ -178,20 +178,14 @@ struct pebs_record_nhm { > u64 status, dla, dse, lat; > }; > > -/* > - * Same as pebs_record_nhm, with two additional fields. > - */ > struct pebs_record_hsw { > - struct pebs_record_nhm nhm; > - /* > - * Real IP of the event. In the Intel documentation this > - * is called eventingrip. > - */ > - u64 real_ip; > - /* > - * TSX tuning information field: abort cycles and abort flags. > - */ > - u64 tsx_tuning; > + u64 flags, ip; > + u64 ax, bx, cx, dx; > + u64 si, di, bp, sp; > + u64 r8, r9, r10, r11; > + u64 r12, r13, r14, r15; > + u64 status, dla, dse, lat; > + u64 real_ip, tsx_tuning; > }; > > void init_debug_store_on_cpu(int cpu) > @@ -789,12 +783,11 @@ static void __intel_pmu_pebs_event(struct perf_event *event, > struct pt_regs *iregs, void *__pebs) > { > /* > - * We cast to pebs_record_nhm to get the load latency data > - * if extra_reg MSR_PEBS_LD_LAT_THRESHOLD used > + * We cast to the biggest pebs_record but are careful not to > + * unconditionally access the 'extra' entries. > */ > struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > - struct pebs_record_nhm *pebs = __pebs; > - struct pebs_record_hsw *pebs_hsw = __pebs; > + struct pebs_record_hsw *pebs = __pebs; > struct perf_sample_data data; > struct pt_regs regs; > u64 sample_type; > @@ -853,7 +846,7 @@ static void __intel_pmu_pebs_event(struct perf_event *event, > regs.sp = pebs->sp; > > if (event->attr.precise_ip > 1 && x86_pmu.intel_cap.pebs_format >= 2) { > - regs.ip = pebs_hsw->real_ip; > + regs.ip = pebs->real_ip; > regs.flags |= PERF_EFLAGS_EXACT; > } else if (event->attr.precise_ip > 1 && intel_pmu_pebs_fixup_ip(®s)) > regs.flags |= PERF_EFLAGS_EXACT; > @@ -912,17 +905,34 @@ static void intel_pmu_drain_pebs_core(struct pt_regs *iregs) > __intel_pmu_pebs_event(event, iregs, at); > } > > -static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at, > - void *top) > +static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) > { > struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > struct debug_store *ds = cpuc->ds; > struct perf_event *event = NULL; > + void *at, *top; > u64 status = 0; > - int bit; > + int bit, n; > + > + if (!x86_pmu.pebs_active) > + return; > + > + at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base; > + top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index; > > ds->pebs_index = ds->pebs_buffer_base; > > + n = (top - at) / x86_pmu.pebs_record_size; > + if (n <= 0) > + return; > + > + /* > + * Should not happen, we program the threshold at 1 and do not > + * set a reset value. > + */ > + WARN_ONCE(n > x86_pmu.max_pebs_events, > + "Unexpected number of pebs records %d\n", n); > + > for (; at < top; at += x86_pmu.pebs_record_size) { > struct pebs_record_nhm *p = at; > > @@ -950,61 +960,6 @@ static void __intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, void *at, > } > } > > -static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs) > -{ > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > - struct debug_store *ds = cpuc->ds; > - struct pebs_record_nhm *at, *top; > - int n; > - > - if (!x86_pmu.pebs_active) > - return; > - > - at = (struct pebs_record_nhm *)(unsigned long)ds->pebs_buffer_base; > - top = (struct pebs_record_nhm *)(unsigned long)ds->pebs_index; > - > - ds->pebs_index = ds->pebs_buffer_base; > - > - n = top - at; > - if (n <= 0) > - return; > - > - /* > - * Should not happen, we program the threshold at 1 and do not > - * set a reset value. > - */ > - WARN_ONCE(n > x86_pmu.max_pebs_events, > - "Unexpected number of pebs records %d\n", n); > - > - return __intel_pmu_drain_pebs_nhm(iregs, at, top); > -} > - > -static void intel_pmu_drain_pebs_hsw(struct pt_regs *iregs) > -{ > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events); > - struct debug_store *ds = cpuc->ds; > - struct pebs_record_hsw *at, *top; > - int n; > - > - if (!x86_pmu.pebs_active) > - return; > - > - at = (struct pebs_record_hsw *)(unsigned long)ds->pebs_buffer_base; > - top = (struct pebs_record_hsw *)(unsigned long)ds->pebs_index; > - > - n = top - at; > - if (n <= 0) > - return; > - /* > - * Should not happen, we program the threshold at 1 and do not > - * set a reset value. > - */ > - WARN_ONCE(n > x86_pmu.max_pebs_events, > - "Unexpected number of pebs records %d\n", n); > - > - return __intel_pmu_drain_pebs_nhm(iregs, at, top); > -} > - > /* > * BTS, PEBS probe and setup > */ > @@ -1039,7 +994,7 @@ void intel_ds_init(void) > case 2: > pr_cont("PEBS fmt2%c, ", pebs_type); > x86_pmu.pebs_record_size = sizeof(struct pebs_record_hsw); > - x86_pmu.drain_pebs = intel_pmu_drain_pebs_hsw; > + x86_pmu.drain_pebs = intel_pmu_drain_pebs_nhm; > break; > > default: -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/