Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932122AbbFIOl1 (ORCPT ); Tue, 9 Jun 2015 10:41:27 -0400 Received: from mga14.intel.com ([192.55.52.115]:60119 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750869AbbFIOlU (ORCPT ); Tue, 9 Jun 2015 10:41:20 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.13,581,1427785200"; d="scan'208";a="505521676" From: Adrian Hunter To: Peter Zijlstra Cc: Andi Kleen , Arnaldo Carvalho de Melo , Ingo Molnar , linux-kernel@vger.kernel.org, Jiri Olsa , Stephane Eranian , mathieu.poirier@linaro.org, Pawel Moll Subject: [RFC PATCH] perf: Add PERF_RECORD_SWITCH to indicate context switches Date: Tue, 9 Jun 2015 17:21:10 +0300 Message-Id: <1433859670-10806-1-git-send-email-adrian.hunter@intel.com> X-Mailer: git-send-email 1.9.1 Organization: Intel Finland Oy, Registered Address: PL 281, 00181 Helsinki, Business Identity Code: 0357606 - 4, Domiciled in Helsinki Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7350 Lines: 241 There are already two events for context switches, namely the tracepoint sched:sched_switch and the software event context_switches. Unfortunately neither are suitable for use by non-privileged users for the purpose of synchronizing hardware trace data (e.g. Intel PT) to the context switch. Tracepoints are no good at all for non-privileged users because they need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1. On the other hand, kernel software events need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= 1. Now many distributions do default perf_event_paranoid to 1 making context_switches a contender, except it has another problem (which is also shared with sched:sched_switch) which is that it happens before perf schedules events out instead of after perf schedules events in. Whereas a privileged user can see all the events anyway, a non-privileged user only sees events for their own processes, in other words they see when their process was scheduled out not when it was scheduled in. That presents two problems to use the event: 1. the information comes too late, so tools have to look ahead in the event stream to find out what the current state is 2. if they are unlucky tracing might have stopped before the context-switches event is recorded. This new PERF_RECORD_SWITCH event does not have those problems and it also has a couple of other small advantages. It is easier to use because it is an auxiliary event (like mmap, comm and task events) which can be enabled by setting a single bit. It is smaller than sched:sched_switch and easier to parse. This implementation has a quirk which is that, if possible, it scrounges the event data from the ID sample instead of deriving it twice. Also I haven't tested this patch yet, so it is RFC at the moment. Signed-off-by: Adrian Hunter --- include/uapi/linux/perf_event.h | 15 +++++- kernel/events/core.c | 108 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 122 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 31b10b02db75..f5403660c9f8 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -332,7 +332,8 @@ struct perf_event_attr { mmap2 : 1, /* include mmap with inode data */ comm_exec : 1, /* flag comm events that are due to an exec */ use_clockid : 1, /* use @clockid for time fields */ - __reserved_1 : 38; + context_switch : 1, /* context switch data */ + __reserved_1 : 37; union { __u32 wakeup_events; /* wakeup every n events */ @@ -812,6 +813,18 @@ enum perf_event_type { */ PERF_RECORD_ITRACE_START = 12, + /* + * + * + * struct { + * struct perf_event_header header; + * u32 pid, tid; + * u64 time; + * struct sample_id sample_id; + * }; + */ + PERF_RECORD_SWITCH = 13, + PERF_RECORD_MAX, /* non-ABI */ }; diff --git a/kernel/events/core.c b/kernel/events/core.c index 30c7374bd263..eda26c464f7c 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -161,6 +161,7 @@ static atomic_t nr_mmap_events __read_mostly; static atomic_t nr_comm_events __read_mostly; static atomic_t nr_task_events __read_mostly; static atomic_t nr_freq_events __read_mostly; +static atomic_t nr_switch_events __read_mostly; static LIST_HEAD(pmus); static DEFINE_MUTEX(pmus_lock); @@ -2822,6 +2823,8 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx, perf_ctx_unlock(cpuctx, ctx); } +static void perf_event_switch(struct task_struct *task); + /* * Called from scheduler to add the events of the current task * with interrupts disabled. @@ -2856,6 +2859,9 @@ void __perf_event_task_sched_in(struct task_struct *prev, if (__this_cpu_read(perf_sched_cb_usages)) perf_pmu_sched_task(prev, task, true); + + if (!atomic_read(&nr_switch_events)) + perf_event_switch(task); } static u64 perf_calculate_period(struct perf_event *event, u64 nsec, u64 count) @@ -3479,6 +3485,10 @@ static void unaccount_event(struct perf_event *event) atomic_dec(&nr_task_events); if (event->attr.freq) atomic_dec(&nr_freq_events); + if (event->attr.context_switch) { + static_key_slow_dec_deferred(&perf_sched_events); + atomic_dec(&nr_switch_events); + } if (is_cgroup_event(event)) static_key_slow_dec_deferred(&perf_sched_events); if (has_branch_stack(event)) @@ -6205,6 +6215,100 @@ void perf_event_aux_event(struct perf_event *event, unsigned long head, } /* + * context_switch tracking + */ + +struct perf_switch_event { + struct task_struct *task; + + struct { + struct perf_event_header header; + + u32 pid; + u32 tid; + u64 time; + } event_id; +}; + +static int perf_event_switch_match(struct perf_event *event) +{ + return event->attr.context_switch; +} + +static void perf_event_switch_output(struct perf_event *event, void *data) +{ + struct perf_switch_event *switch_event = data; + struct perf_output_handle handle; + struct perf_sample_data sample; + struct task_struct *task = switch_event->task; + int size = switch_event->event_id.header.size; + int ret; + + if (!perf_event_switch_match(event)) + return; + + sample.tid_entry.pid = -1; + sample.tid_entry.tid = -1; + sample.time = -1; + + perf_event_header__init_id(&switch_event->event_id.header, &sample, event); + + ret = perf_output_begin(&handle, event, + switch_event->event_id.header.size); + if (ret) + goto out; + + if (sample.tid_entry.pid == -1) + switch_event->event_id.pid = perf_event_pid(event, task); + else + switch_event->event_id.pid = sample.tid_entry.pid; + + if (sample.tid_entry.tid == -1) + switch_event->event_id.tid = perf_event_tid(event, task); + else + switch_event->event_id.tid = sample.tid_entry.tid; + + if (sample.time == (u64)-1) + switch_event->event_id.time = perf_event_clock(event); + else + switch_event->event_id.time = sample.time; + + perf_output_put(&handle, switch_event->event_id); + + perf_event__output_id_sample(event, &handle, &sample); + + perf_output_end(&handle); +out: + switch_event->event_id.header.size = size; +} + +static void perf_event_switch(struct task_struct *task) +{ + struct perf_switch_event switch_event; + + if (!atomic_read(&nr_mmap_events)) + return; + + switch_event = (struct perf_switch_event){ + .task = task, + .event_id = { + .header = { + .type = PERF_RECORD_SWITCH, + .misc = 0, + .size = sizeof(switch_event.event_id), + }, + /* .pid */ + /* .tid */ + /* .time */ + }, + }; + + perf_event_aux(perf_event_switch_output, + &switch_event, + NULL); +} + +/* * IRQ throttle logging */ @@ -7707,6 +7811,10 @@ static void account_event(struct perf_event *event) if (atomic_inc_return(&nr_freq_events) == 1) tick_nohz_full_kick_all(); } + if (event->attr.context_switch) { + atomic_inc(&nr_switch_events); + static_key_slow_inc(&perf_sched_events.key); + } if (has_branch_stack(event)) static_key_slow_inc(&perf_sched_events.key); if (is_cgroup_event(event)) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/