From: Vince Weaver <vincent.weaver@maine.edu>
Date: Fri, 1 May 2015 13:20:17 -0400 (EDT)
To: Ingo Molnar <mingo@kernel.org>
cc: Vince Weaver <vincent.weaver@maine.edu>, linux-kernel@vger.kernel.org,
        Peter Zijlstra <peterz@infradead.org>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        Jiri Olsa <jolsa@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Paul Mackerras <paulus@samba.org>
Subject: Re: perf: WARNING perfevents: irq loop stuck!
In-Reply-To: <20150501070226.GB18957@gmail.com>
Message-ID: <alpine.DEB.2.11.1505011316420.2300@vincent-weaver-1.umelst.maine.edu>
References: <alpine.DEB.2.11.1504301656160.30466@vincent-weaver-1.umelst.maine.edu> <20150501070226.GB18957@gmail.com>
User-Agent: Alpine 2.11 (DEB 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1606
Lines: 42

On Fri, 1 May 2015, Ingo Molnar wrote:

> 
> * Vince Weaver <vincent.weaver@maine.edu> wrote:
> 
> > So this is just a warning, and I've reported it before, but the 
> > perf_fuzzer triggers this fairly regularly on my Haswell system.
> > 
> > It looks like fixed counter 0 (retired instructions) being set to 
> > 0000fffffffffffe occasionally causes an irq loop storm and gets 
> > stuck until the PMU state is cleared.
> 
> So 0000fffffffffffe corresponds to 2 events left until overflow, 
> right? And on Haswell we don't set x86_pmu.limit_period AFAICS, so we 
> allow these super short periods.
> 
> Maybe like on Broadwell we need a quirk on Nehalem/Haswell as well, 
> one similar to bdw_limit_period()? Something like the patch below?

I spent the morning trying to get a reproducer for this.  It turns out to 
be complex.  It seems in addition to fixed counter 0 being set to -2, at 
least one other non-fixed counter must be about to overflow.

For example, in this case gen-PMC2 is also poised to overflow at the same 
time.

CPU#0:	 gen-PMC2 ctrl:		00000003ff96764b
CPU#0:   gen-PMC2 count:	0000000000000001
gen-PMC2 left:			0000ffffffffffff
...
[ 2408.612442] CPU#0: fixed-PMC0 count: 0000fffffffffffe


It's not always PMC2 but in the warnings there's at least one other 
gen-PMC about to overflow at the exact same time as the fixed one.

Vince
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/