Date: Fri, 8 May 2015 09:53:47 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Vince Weaver <vincent.weaver@maine.edu>
Cc: linux-kernel@vger.kernel.org, Peter Zijlstra <peterz@infradead.org>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Jiri Olsa <jolsa@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Paul Mackerras <paulus@samba.org>
Subject: Re: perf: WARNING perfevents: irq loop stuck!
Message-ID: <20150508075347.GB5403@gmail.com>
References: <alpine.DEB.2.11.1504301656160.30466@vincent-weaver-1.umelst.maine.edu>
 <20150501070226.GB18957@gmail.com>
 <alpine.DEB.2.11.1505080018140.26907@vincent-weaver-1.umelst.maine.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.11.1505080018140.26907@vincent-weaver-1.umelst.maine.edu>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1593
Lines: 41


* Vince Weaver <vincent.weaver@maine.edu> wrote:

> On Fri, 1 May 2015, Ingo Molnar wrote:
> 
> > So 0000fffffffffffe corresponds to 2 events left until overflow, 
> > right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we 
> > allow these super short periods.
> > 
> > Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well, 
> > one similar to bdw_limit_period()? Something like the patch below?
> > 
> > Totally untested and such. I picked 128 because of Broadwell, but 
> > lower values might work as well. You could try to increase it to 3 and 
> > upwards and see which one stops triggering stuck NMI loops?
> 
> I spent a lot of time trying to come up with a test case that 
> triggered this more reliably but failed.
> 
> It definitely is an issue with PMC0 being -2 causing the PMC0 bit in 
> the status register getting stuck and no clearing.  Often there is 
> also a PEBS event active at the same time but that might be 
> coincidence.
> 
> With your patch applied I can't trigger the issue. I haven't tried 
> narrowing down the exact value yet.

So how about I change it from 128U to 2U and apply it upstream?

I.e. use the minimal threshold that we have observed to cause 
problems. That way should it ever trigger in different circumstances 
we'll eventually trigger it or hear about it.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/