Date: Thu, 13 Jul 2017 11:19:11 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>,
        Andres Freund <andres@anarazel.de>, x86@kernel.org,
        linux-kernel@vger.kernel.org, live-patching@vger.kernel.org,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Jiri Slaby <jslaby@suse.cz>,
        "H. Peter Anvin" <hpa@zytor.com>, Mike Galbraith <efault@gmx.de>,
        Jiri Olsa <jolsa@redhat.com>,
        Arnaldo Carvalho de Melo <acme@infradead.org>,
        Namhyung Kim <namhyung@kernel.org>,
        Alexander Shishkin <alexander.shishkin@linux.intel.com>
Subject: Re: [PATCH v3 00/10] x86: ORC unwinder (previously undwarf)
Message-ID: <20170713091911.aj7e7dvrbqcyxh7l@gmail.com>
References: <cover.1499786555.git.jpoimboe@redhat.com>
 <20170712214920.5droainfqjmq7sgu@alap3.anarazel.de>
 <20170712223225.zkq7tdb7pzgb3wy7@treble>
 <20170713071253.a3slz3j5tcgy3rkk@hirez.programming.kicks-ass.net>
 <20170713085015.yjjv5ig2znplx5jl@hirez.programming.kicks-ass.net>
 <20170713085114.h4vjgg7jjbl6dohb@hirez.programming.kicks-ass.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170713085114.h4vjgg7jjbl6dohb@hirez.programming.kicks-ass.net>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1743
Lines: 51


* Peter Zijlstra <peterz@infradead.org> wrote:

> > One gloriously ugly hack would be to delay the userspace unwind to 
> > return-to-userspace, at which point we have a schedulable context and can take 
> > faults.

I don't think it's ugly, and it has various advantages:

> > Of course, then you have to somehow identify this later unwind sample with all 
> > relevant prior samples and stitch the whole thing back together, but that 
> > should be doable.
> > 
> > In fact, it would not be at all hard to do, just queue a task_work from the 
> > NMI and have that do the EH based unwind.

This would have a couple of advantages:

 - as you mention, being able to fault in debug info and generally do 
   IO/scheduling,

 - profiling overhead would be accounted to the task context that generates it,
   not the NMI context,

 - there would be a natural batching/coalescing optimization if multiple events
   hit the same system call: the user-space backtrace would only have to be looked 
   up once for all samples that got collected.

This could be done by separating the user-space backtrace into a separate event, 
and perf tooling would then apply the same user-space backtrace to all prior 
kernel samples.

I.e. the ring-buffer would have trace entries like:

 [ kernel sample #1, with kernel backtrace #1 ]
 [ kernel sample #2, with kernel backtrace #2 ]
 [ kernel sample #3, with kernel backtrace #3 ]
 [ user-space backtrace #1 at syscall return ]
 ...

Note how the three kernel samples didn't have to do any user-space unwinding at 
all, so the user-space unwinding overhead got reduced by a factor of 3.

Tooling would know that 'user-space backtrace #1' applies to the previous three 
kernel samples.

Or so?

Thanks,

	Ingo