Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752388AbbEHHxy (ORCPT ); Fri, 8 May 2015 03:53:54 -0400 Received: from mail-wi0-f176.google.com ([209.85.212.176]:35652 "EHLO mail-wi0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751015AbbEHHxv (ORCPT ); Fri, 8 May 2015 03:53:51 -0400 Date: Fri, 8 May 2015 09:53:47 +0200 From: Ingo Molnar To: Vince Weaver Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , Arnaldo Carvalho de Melo , Jiri Olsa , Ingo Molnar , Paul Mackerras Subject: Re: perf: WARNING perfevents: irq loop stuck! Message-ID: <20150508075347.GB5403@gmail.com> References: <20150501070226.GB18957@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1593 Lines: 41 * Vince Weaver wrote: > On Fri, 1 May 2015, Ingo Molnar wrote: > > > So 0000fffffffffffe corresponds to 2 events left until overflow, > > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we > > allow these super short periods. > > > > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well, > > one similar to bdw_limit_period()? Something like the patch below? > > > > Totally untested and such. I picked 128 because of Broadwell, but > > lower values might work as well. You could try to increase it to 3 and > > upwards and see which one stops triggering stuck NMI loops? > > I spent a lot of time trying to come up with a test case that > triggered this more reliably but failed. > > It definitely is an issue with PMC0 being -2 causing the PMC0 bit in > the status register getting stuck and no clearing. Often there is > also a PEBS event active at the same time but that might be > coincidence. > > With your patch applied I can't trigger the issue. I haven't tried > narrowing down the exact value yet. So how about I change it from 128U to 2U and apply it upstream? I.e. use the minimal threshold that we have observed to cause problems. That way should it ever trigger in different circumstances we'll eventually trigger it or hear about it. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/