Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758891AbcLPBrU (ORCPT ); Thu, 15 Dec 2016 20:47:20 -0500 Received: from fbr03.mfg.siteprotect.com ([64.26.60.138]:52996 "EHLO fbr03.mfg.siteprotect.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753929AbcLPBrL (ORCPT ); Thu, 15 Dec 2016 20:47:11 -0500 X-Greylist: delayed 547 seconds by postgrey-1.27 at vger.kernel.org; Thu, 15 Dec 2016 20:47:11 EST Date: Thu, 15 Dec 2016 20:37:46 -0500 (EST) From: Vince Weaver X-X-Sender: vince@pianoman.cluster.toy To: Jiri Olsa cc: Peter Zijlstra , Andi Kleen , lkml , Alexander Shishkin , Ingo Molnar Subject: Re: [PATCHv2] perf/x86/intel: Account interrupts for PEBS errors In-Reply-To: <20161215154356.GB9173@krava> Message-ID: References: <20161214165036.GB9180@krava> <20161214180730.GR3124@twins.programming.kicks-ass.net> <20161214181636.GA14741@krava> <20161214193239.GS3124@twins.programming.kicks-ass.net> <20161215154356.GB9173@krava> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-CTCH-Spam: Unknown X-CTCH-RefID: str=0001.0A020205.5853455A.009B,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1632 Lines: 40 On Thu, 15 Dec 2016, Jiri Olsa wrote: > It's possible to setup PEBS events and get only errors and not > a single data, like on SNB-X (model 45) and IVB-EP (model 62) > via 2 perf commands running simultaneously: > > taskset -c 1 ./perf record -c 4 -e branches:pp -j any -C 10 > > This leads to soft lock up, because the error path of the > intel_pmu_drain_pebs_nhm does not account event->hw.interrupt > for error PEBS interrupts so the event is not eventually > stopped when it gets over the max_samples_per_tick limit. > > NMI watchdog: BUG: soft lockup - CPU#22 stuck for 22s! [perf_fuzzer:5816] > ... > task: ffff880273148000 task.stack: ffffc90002d58000 > RIP: 0010:[] [] smp_call_function_single+0xe2/0x140 > RSP: 0018:ffffc90002d5bd60 EFLAGS: 00000202 > ... > Call Trace: > ? trace_hardirqs_on_caller+0xf5/0x1b0 > ? perf_cgroup_attach+0x70/0x70 > perf_install_in_context+0x199/0x1b0 > ? ctx_resched+0x90/0x90 > SYSC_perf_event_open+0x641/0xf90 > SyS_perf_event_open+0x9/0x10 > do_syscall_64+0x6c/0x1f0 > entry_SYSCALL64_slow_path+0x25/0x25 I'll have to try this, with all the recent fixes I am down to NMI lockups like this being the major cause of fuzzer issues on my intel machines. My AMD and ARM machines are now fuzzing for weeks w/o problems. I also finally got a power8 machine and it crashes really quickly when fuzzing, but I haven't had a chance to track dthings own yet because it sounds like a jet plane taking off and I can't really leave it fuzzing like that when students are sitting nearby. Maybe over break. Vince