Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751854AbdGYLzY (ORCPT ); Tue, 25 Jul 2017 07:55:24 -0400 Received: from bombadil.infradead.org ([65.50.211.133]:33875 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750890AbdGYLzX (ORCPT ); Tue, 25 Jul 2017 07:55:23 -0400 Date: Tue, 25 Jul 2017 13:55:12 +0200 From: Peter Zijlstra To: Ingo Molnar Cc: Josh Poimboeuf , Andres Freund , x86@kernel.org, linux-kernel@vger.kernel.org, live-patching@vger.kernel.org, Linus Torvalds , Andy Lutomirski , Jiri Slaby , "H. Peter Anvin" , Mike Galbraith , Jiri Olsa , Arnaldo Carvalho de Melo , Namhyung Kim , Alexander Shishkin Subject: [RFC] perf: Delayed userspace unwind (Was: [PATCH v3 00/10] x86: ORC unwinder) Message-ID: <20170725115512.k3wk62fdnxdd6enu@hirez.programming.kicks-ass.net> References: <20170712214920.5droainfqjmq7sgu@alap3.anarazel.de> <20170712223225.zkq7tdb7pzgb3wy7@treble> <20170713071253.a3slz3j5tcgy3rkk@hirez.programming.kicks-ass.net> <20170713085015.yjjv5ig2znplx5jl@hirez.programming.kicks-ass.net> <20170713085114.h4vjgg7jjbl6dohb@hirez.programming.kicks-ass.net> <20170713091911.aj7e7dvrbqcyxh7l@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170713091911.aj7e7dvrbqcyxh7l@gmail.com> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7833 Lines: 253 On Thu, Jul 13, 2017 at 11:19:11AM +0200, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > > > > One gloriously ugly hack would be to delay the userspace unwind to > > > return-to-userspace, at which point we have a schedulable context and can take > > > faults. > > I don't think it's ugly, and it has various advantages: > > > > Of course, then you have to somehow identify this later unwind sample with all > > > relevant prior samples and stitch the whole thing back together, but that > > > should be doable. > > > > > > In fact, it would not be at all hard to do, just queue a task_work from the > > > NMI and have that do the EH based unwind. > > This would have a couple of advantages: > > - as you mention, being able to fault in debug info and generally do > IO/scheduling, > > - profiling overhead would be accounted to the task context that generates it, > not the NMI context, > > - there would be a natural batching/coalescing optimization if multiple events > hit the same system call: the user-space backtrace would only have to be looked > up once for all samples that got collected. > > This could be done by separating the user-space backtrace into a separate event, > and perf tooling would then apply the same user-space backtrace to all prior > kernel samples. > > I.e. the ring-buffer would have trace entries like: > > [ kernel sample #1, with kernel backtrace #1 ] > [ kernel sample #2, with kernel backtrace #2 ] > [ kernel sample #3, with kernel backtrace #3 ] > [ user-space backtrace #1 at syscall return ] > ... > > Note how the three kernel samples didn't have to do any user-space unwinding at > all, so the user-space unwinding overhead got reduced by a factor of 3. > > Tooling would know that 'user-space backtrace #1' applies to the previous three > kernel samples. > > Or so? Find compile tested patch below, someone needs to teach this userspace thing about it though.. Not sure I can still make sense of that code. --- include/linux/perf_event.h | 1 + include/uapi/linux/perf_event.h | 14 ++++++- kernel/events/callchain.c | 86 ++++++++++++++++++++++++++++++++++++++--- kernel/events/core.c | 18 +++------ 4 files changed, 100 insertions(+), 19 deletions(-) diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index a3b873fc59e4..241251533e39 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -682,6 +682,7 @@ struct perf_event { int pending_disable; struct irq_work pending; + struct callback_head pending_callchain; atomic_t event_limit; /* address range filters */ diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 642db5fa3286..342def57ef34 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -368,7 +368,8 @@ struct perf_event_attr { context_switch : 1, /* context switch data */ write_backward : 1, /* Write ring buffer from end to beginning */ namespaces : 1, /* include namespaces data */ - __reserved_1 : 35; + delayed_user_callchain : 1, /* ... */ + __reserved_1 : 34; union { __u32 wakeup_events; /* wakeup every n events */ @@ -915,6 +916,17 @@ enum perf_event_type { */ PERF_RECORD_NAMESPACES = 16, + /* + * struct { + * struct perf_event_header header; + * { u64 nr, + * u64 ips[nr]; } && PERF_SAMPLE_CALLCHAIN + * struct sample_id sample_id; + * }; + * + */ + PERF_RECORD_CALLCHAIN = 17, + PERF_RECORD_MAX, /* non-ABI */ }; diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index 1b2be63c8528..c98a12f3592c 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "internal.h" @@ -178,19 +179,94 @@ put_callchain_entry(int rctx) put_recursion_context(this_cpu_ptr(callchain_recursion), rctx); } +static struct perf_callchain_entry __empty = { .nr = 0, }; + +static void perf_callchain_work(struct callback_head *work) +{ + struct perf_event *event = container_of(work, struct perf_event, pending_callchain); + struct perf_output_handle handle; + struct perf_sample_data sample; + size_t size; + int ret; + + struct { + struct perf_event_header header; + } callchain_event = { + .header = { + .type = PERF_RECORD_CALLCHAIN, + .misc = 0, + .size = sizeof(callchain_event), + }, + }; + + perf_event_header__init_id(&callchain_event.header, &sample, event); + + sample.callchain = get_perf_callchain(task_pt_regs(current), + /* init_nr */ 0, + /* kernel */ false, + /* user */ true, + event->attr.sample_max_stack, + /* crosstask */ false, + /* add_mark */ true); + + if (!sample.callchain) + sample.callchain = &__empty; + + size = sizeof(u64) * (1 + sample.callchain->nr); + callchain_event.header.size += size; + + ret = perf_output_begin(&handle, event, callchain_event.header.size); + if (ret) + return; + + perf_output_put(&handle, callchain_event); + __output_copy(&handle, sample.callchain, size); + perf_event__output_id_sample(event, &handle, &sample); + perf_output_end(&handle); + + barrier(); + work->func = NULL; /* done */ +} + struct perf_callchain_entry * perf_callchain(struct perf_event *event, struct pt_regs *regs) { - bool kernel = !event->attr.exclude_callchain_kernel; - bool user = !event->attr.exclude_callchain_user; + bool kernel = !event->attr.exclude_callchain_kernel; + bool user = !event->attr.exclude_callchain_user; + bool delayed = event->attr.delayed_user_callchain; + /* Disallow cross-task user callchains. */ bool crosstask = event->ctx->task && event->ctx->task != current; const u32 max_stack = event->attr.sample_max_stack; - if (!kernel && !user) - return NULL; + struct perf_callchain_entry *callchain = NULL; + + if (user && delayed && !crosstask) { + struct callback_head *work = &event->pending_callchain; + + if (!work->func) { + work->func = perf_callchain_work; + /* + * We cannot do set_notify_resume() from NMI context, + * also, knowing we are already in an interrupted + * context and will pass return to userspace, we can + * simply set TIF_NOTIFY_RESUME. + */ + task_work_add(current, work, false); + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME); + } + user = false; + } + + if (kernel || user) { + callchain = get_perf_callchain(regs, 0, kernel, user, + max_stack, crosstask, true); + } + + if (!callchain) + callchain = &__empty; - return get_perf_callchain(regs, 0, kernel, user, max_stack, crosstask, true); + return callchain; } struct perf_callchain_entry * diff --git a/kernel/events/core.c b/kernel/events/core.c index 426c2ffba16d..26aed7bfbb6a 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5857,19 +5857,12 @@ void perf_output_sample(struct perf_output_handle *handle, perf_output_read(handle, event); if (sample_type & PERF_SAMPLE_CALLCHAIN) { - if (data->callchain) { - int size = 1; - - if (data->callchain) - size += data->callchain->nr; + int size = 1; - size *= sizeof(u64); + size += data->callchain->nr; + size *= sizeof(u64); - __output_copy(handle, data->callchain, size); - } else { - u64 nr = 0; - perf_output_put(handle, nr); - } + __output_copy(handle, data->callchain, size); } if (sample_type & PERF_SAMPLE_RAW) { @@ -6010,8 +6003,7 @@ void perf_prepare_sample(struct perf_event_header *header, data->callchain = perf_callchain(event, regs); - if (data->callchain) - size += data->callchain->nr; + size += data->callchain->nr; header->size += size * sizeof(u64); }