Date: Tue, 1 Dec 2015 08:28:26 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Wangnan (F)" <wangnan0@huawei.com>, Jiri Olsa <jolsa@kernel.org>,
        Arnaldo Carvalho de Melo <acme@kernel.org>,
        David Ahern <dsahern@gmail.com>, Milian Wolff <milian.wolff@kdab.com>,
        linux-kernel@vger.kernel.org, pi3orama <pi3orama@163.com>,
        lizefan 00213767 <lizefan@huawei.com>
Subject: Re: [BUG REPORT] perf tools: x86_64: Broken calllchain when sampling
 taken at 'callq' instruction
Message-ID: <20151201072826.GB28270@gmail.com>
References: <564C3011.8090002@huawei.com>
 <20151118082033.GA24726@gmail.com>
 <564C3A0E.3030502@huawei.com>
 <564C3BAA.4040806@huawei.com>
 <20151119063709.GA14852@gmail.com>
 <564D6FF9.3030105@huawei.com>
 <20151119102300.GA2830@gmail.com>
 <20151119112315.GL3816@twins.programming.kicks-ass.net>
 <20151127083811.GA26257@gmail.com>
 <20151130092843.GF17308@twins.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20151130092843.GF17308@twins.programming.kicks-ass.net>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3646
Lines: 81


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 27, 2015 at 09:38:11AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Thu, Nov 19, 2015 at 11:23:00AM +0100, Ingo Molnar wrote:
> > > > PEBS is an asynchronous hardware tracing mechanism, when batched PEBS is used it 
> > > > might not even result in any interruption of execution. The 'pt_regs' does not 
> > > > necessarily correspond to an interrupted, restartable context - we take the RIP 
> > > > from the PEBS machinery and also use LBR and disassembly to determine the previous 
> > > > instruction, before reporting it to user-space.
> > > 
> > > Note that modern PEBS hardware (hsw+) does the rollback in hardware.
> > > Prior to that we indeed to it manually using the LBR.
> > > 
> > > As to pt_regs, we construct a franken pt_regs based on the actual PEBS
> > > buffer overflow PMI and bits from the PEBS record (which also includes
> > > some register state). See
> > > arch/x86/kernel/cpu/perf_event_intel_ds.c:setup_pebs_sample_data().
> > > 
> > > We always copy the flags, ip, bp and sp from the PEBS record into the
> > > interrupt pt_regs.
> > > 
> > > And note that the PEBS record is constructed at instruction retirement,
> > > so it shows the state _after_ the instruction, with exception of the
> > > (hsw+) real_ip field.
> > > 
> > > So the unwinder will have to be taught that if the IP points at a stack
> > > altering instruction (call, push, etc.) it will have to 'undo' the
> > > effects on the actual stack (I appreciate this might be 'interesting'
> > > for things like: pop, ret, etc.).
> > 
> > So do we dump both the 'real' and the actual RIP, to not force tooling into having 
> > to decode instructions and such?
> 
> Nope, we only expose the corrected one.
> 
> > (Which is pretty hard and fragile and not always 
> > possible with instructions that destroy the original RIP, like JMP, etc.)
> 
> Not sure what you're getting at here. We don't need the uncorrected
> instruction.

Well, we need it for stack unwinding, as you point it out:

> But the problem here is that we rewind the instruction stream, but not
> the stack. And the stack unwinder is (obviously) interested in the stack
> state.

Unwinding the stack state would fix it as well - but an equivalent solution would 
be to pass along the original RIP would fix it as well: we'd have a 
self-consistent pair of RIP/RSP.

Especially since unwinding the RSP is probably hard:

> I'm not sure we want (or need) to go undo the specific instruction's
> stack effect in-kernel. If the !DWARF unwinders are similarly confused
> we might need to put it in kernel (expensive *groan*). If its only the
> DWARF muck then its something that can be done in userspace just
> fine, although we might need to copy slightly more of the stack than SP
> is pointing at, such that we can undo RET/POP etc. which would have data
> beyond the head of stack.
> 
> The easiest solution might be to figure out the biggest stack offset for
> any instruction and always capture that much over the head of stack.

so I think the problem here is that the RSP does not match up to the RIP. We can 
either pass along the original RIP+RSP, or the fixed up one - but what we do 
currently is that we pass along only half of it - which corrupts dwarf unwinding 
state that doesn't tolerate such errors.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/