Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752043AbdIALRY (ORCPT ); Fri, 1 Sep 2017 07:17:24 -0400 Received: from mga06.intel.com ([134.134.136.31]:52149 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751903AbdIALRX (ORCPT ); Fri, 1 Sep 2017 07:17:23 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,457,1498546800"; d="scan'208";a="124669063" Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping To: Peter Zijlstra Cc: Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel , Vince Weaver , Thomas Gleixner References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> <20170803150052.za2vofyqfgarukdr@hirez.programming.kicks-ass.net> <20170822204743.GR32112@worktop.programming.kicks-ass.net> <2a426aa2-42c8-e839-1cec-aa3971651f3e@linux.intel.com> <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> From: Alexey Budankov Organization: Intel Corp. Message-ID: <75717ebd-f9a7-07af-26a8-3938f841380f@linux.intel.com> Date: Fri, 1 Sep 2017 14:17:17 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 34548 Lines: 1093 On 31.08.2017 20:18, Peter Zijlstra wrote: > On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote: >> On 22.08.2017 23:47, Peter Zijlstra wrote: >>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote: >>>> The key thing in the patch is explicit updating of tstamp fields for >>>> INACTIVE events in update_event_times(). >>> >>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event) >>>> event->group_leader->state < PERF_EVENT_STATE_INACTIVE) >>>> return; >>>> >>>> + if (event->state == PERF_EVENT_STATE_INACTIVE) >>>> + perf_event_tstamp_update(event); >>>> + >>>> /* >>>> * in cgroup mode, time_enabled represents >>>> * the time the event was enabled AND active >>> >>> But why!? I thought the whole point was to not need to do this. >> >> update_event_times() is not called from timer interrupt handler >> thus it is not on the critical path which is optimized in this patch set. >> >> But update_event_times() is called in the context of read() syscall so >> this is the place where we may update event times for INACTIVE events >> instead of timer interrupt. >> >> Also update_event_times() is called on thread context switch out so >> we get event times also updated when the thread migrates to other CPU. >> >>> >>> The thing I outlined earlier would only need to update timestamps when >>> events change state and at no other point in time. >> >> But we still may request times while event is in INACTIVE state >> thru read() syscall and event timings need to be up-to-date. > > Sure, read() also updates. > > So the below completely rewrites timekeeping (and probably breaks > world) but does away with the need to touch events that don't get > scheduled. > > Esp the cgroup stuff is entirely untested since I simply don't know how > to operate that. I did run Vince's tests on it, and I think it doesn't > regress, but I'm near a migraine so I can't really see straight atm. > > Vince, Stephane, could you guys have a peek? > > (There's a few other bits in, I'll break up into patches and write > comments and Changelogs later, I think its can be split in some 5 > patches). > > The basic idea is really simple, we have a single timestamp and > depending on the state we update enabled/running. This obviously only > requires updates when we change state and when we need up-to-date > timestamps (read). > > No more weird and wonderful mind bending interaction between 3 different > timestamps with arcane update rules. > > --- > include/linux/perf_event.h | 25 +- > kernel/events/core.c | 551 ++++++++++++++++----------------------------- > 2 files changed, 192 insertions(+), 384 deletions(-) > Tried to apply on top of this: perf/core 1b2f76d77a277bb70d38ad0991ed7f16bbc115a9 [origin/perf/core] Merge tag 'perf-core-for-mingo-4.14-20170829' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core but failed: Checking patch include/linux/perf_event.h... Hunk #1 succeeded at 498 (offset 13 lines). Hunk #2 succeeded at 591 (offset 13 lines). Hunk #3 succeeded at 601 (offset 13 lines). Hunk #4 succeeded at 696 (offset 13 lines). Checking patch kernel/events/core.c... error: while searching for: return event_cpu; } /* * Cross CPU call to read the hardware event */ static void __perf_event_read(void *info) { struct perf_read_data *data = info; struct perf_event *sub, *event = data->event; struct perf_event_context *ctx = event->ctx; struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); struct pmu *pmu = event->pmu; /* * If this is a task context, we need to check whether it is * the current task context of this cpu. If not it has been * scheduled out before the smp call arrived. In that case * event->count would have been updated to a recent sample * when the event was scheduled out. */ if (ctx->task && cpuctx->task_ctx != ctx) return; raw_spin_lock(&ctx->lock); if (ctx->is_active) { update_context_time(ctx); update_cgrp_time_from_event(event); } update_event_times(event); if (event->state != PERF_EVENT_STATE_ACTIVE) goto unlock; if (!data->group) { pmu->read(event); data->ret = 0; goto unlock; } pmu->start_txn(pmu, PERF_PMU_TXN_READ); pmu->read(event); list_for_each_entry(sub, &event->sibling_list, group_entry) { update_event_times(sub); if (sub->state == PERF_EVENT_STATE_ACTIVE) { /* * Use sibling's PMU rather than @event's since * sibling could be on different (eg: software) PMU. */ sub->pmu->read(sub); } } data->ret = pmu->commit_txn(pmu); unlock: raw_spin_unlock(&ctx->lock); } static inline u64 perf_event_count(struct perf_event *event) { return local64_read(&event->count) + atomic64_read(&event->child_count); error: patch failed: kernel/events/core.c:3613 error: kernel/events/core.c: patch does not apply > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index 8e22f24ded6a..2a6ae48a1a96 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -485,9 +485,9 @@ struct perf_addr_filters_head { > }; > > /** > - * enum perf_event_active_state - the states of a event > + * enum perf_event_state - the states of a event > */ > -enum perf_event_active_state { > +enum perf_event_state { > PERF_EVENT_STATE_DEAD = -4, > PERF_EVENT_STATE_EXIT = -3, > PERF_EVENT_STATE_ERROR = -2, > @@ -578,7 +578,7 @@ struct perf_event { > struct pmu *pmu; > void *pmu_private; > > - enum perf_event_active_state state; > + enum perf_event_state state; > unsigned int attach_state; > local64_t count; > atomic64_t child_count; > @@ -588,26 +588,10 @@ struct perf_event { > * has been enabled (i.e. eligible to run, and the task has > * been scheduled in, if this is a per-task event) > * and running (scheduled onto the CPU), respectively. > - * > - * They are computed from tstamp_enabled, tstamp_running and > - * tstamp_stopped when the event is in INACTIVE or ACTIVE state. > */ > u64 total_time_enabled; > u64 total_time_running; > - > - /* > - * These are timestamps used for computing total_time_enabled > - * and total_time_running when the event is in INACTIVE or > - * ACTIVE state, measured in nanoseconds from an arbitrary point > - * in time. > - * tstamp_enabled: the notional time when the event was enabled > - * tstamp_running: the notional time when the event was scheduled on > - * tstamp_stopped: in INACTIVE state, the notional time when the > - * event was scheduled off. > - */ > - u64 tstamp_enabled; > - u64 tstamp_running; > - u64 tstamp_stopped; > + u64 tstamp; > > /* > * timestamp shadows the actual context timing but it can > @@ -699,7 +683,6 @@ struct perf_event { > > #ifdef CONFIG_CGROUP_PERF > struct perf_cgroup *cgrp; /* cgroup event is attach to */ > - int cgrp_defer_enabled; > #endif > > struct list_head sb_list; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 294f1927f944..e968b3eab9c7 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event) > return event->clock(); > } > > +/* > + * XXX comment about timekeeping goes here > + */ > + > +static __always_inline enum perf_event_state > +__perf_effective_state(struct perf_event *event) > +{ > + struct perf_event *leader = event->group_leader; > + > + if (leader->state <= PERF_EVENT_STATE_OFF) > + return leader->state; > + > + return event->state; > +} > + > +static __always_inline void > +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running) > +{ > + enum perf_event_state state = __perf_effective_state(event); > + u64 delta = now - event->tstamp; > + > + *enabled = event->total_time_enabled; > + if (state >= PERF_EVENT_STATE_INACTIVE) > + *enabled += delta; > + > + *running = event->total_time_running; > + if (state >= PERF_EVENT_STATE_ACTIVE) > + *running += delta; > +} > + > +static void perf_event_update_time(struct perf_event *event) > +{ > + u64 now = perf_event_time(event); > + > + __perf_update_times(event, now, &event->total_time_enabled, > + &event->total_time_running); > + event->tstamp = now; > +} > + > +static void perf_event_update_sibling_time(struct perf_event *leader) > +{ > + struct perf_event *sibling; > + > + list_for_each_entry(sibling, &leader->sibling_list, group_entry) > + perf_event_update_time(sibling); > +} > + > +static void > +perf_event_set_state(struct perf_event *event, enum perf_event_state state) > +{ > + if (event->state == state) > + return; > + > + perf_event_update_time(event); > + /* > + * If a group leader gets enabled/disabled all its siblings > + * are affected too. > + */ > + if ((event->state < 0) ^ (state < 0)) > + perf_event_update_sibling_time(event); > + > + WRITE_ONCE(event->state, state); > +} > + > #ifdef CONFIG_CGROUP_PERF > > static inline bool > @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now) > event->shadow_ctx_time = now - t->timestamp; > } > > -static inline void > -perf_cgroup_defer_enabled(struct perf_event *event) > -{ > - /* > - * when the current task's perf cgroup does not match > - * the event's, we need to remember to call the > - * perf_mark_enable() function the first time a task with > - * a matching perf cgroup is scheduled in. > - */ > - if (is_cgroup_event(event) && !perf_cgroup_match(event)) > - event->cgrp_defer_enabled = 1; > -} > - > -static inline void > -perf_cgroup_mark_enabled(struct perf_event *event, > - struct perf_event_context *ctx) > -{ > - struct perf_event *sub; > - u64 tstamp = perf_event_time(event); > - > - if (!event->cgrp_defer_enabled) > - return; > - > - event->cgrp_defer_enabled = 0; > - > - event->tstamp_enabled = tstamp - event->total_time_enabled; > - list_for_each_entry(sub, &event->sibling_list, group_entry) { > - if (sub->state >= PERF_EVENT_STATE_INACTIVE) { > - sub->tstamp_enabled = tstamp - sub->total_time_enabled; > - sub->cgrp_defer_enabled = 0; > - } > - } > -} > - > /* > * Update cpuctx->cgrp so that it is set when first cgroup event is added and > * cleared when last cgroup event is removed. > @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event) > } > > static inline void > -perf_cgroup_defer_enabled(struct perf_event *event) > -{ > -} > - > -static inline void > -perf_cgroup_mark_enabled(struct perf_event *event, > - struct perf_event_context *ctx) > -{ > -} > - > -static inline void > list_update_cgroup_event(struct perf_event *event, > struct perf_event_context *ctx, bool add) > { > @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event) > return ctx ? ctx->time : 0; > } > > -/* > - * Update the total_time_enabled and total_time_running fields for a event. > - */ > -static void update_event_times(struct perf_event *event) > -{ > - struct perf_event_context *ctx = event->ctx; > - u64 run_end; > - > - lockdep_assert_held(&ctx->lock); > - > - if (event->state < PERF_EVENT_STATE_INACTIVE || > - event->group_leader->state < PERF_EVENT_STATE_INACTIVE) > - return; > - > - /* > - * in cgroup mode, time_enabled represents > - * the time the event was enabled AND active > - * tasks were in the monitored cgroup. This is > - * independent of the activity of the context as > - * there may be a mix of cgroup and non-cgroup events. > - * > - * That is why we treat cgroup events differently > - * here. > - */ > - if (is_cgroup_event(event)) > - run_end = perf_cgroup_event_time(event); > - else if (ctx->is_active) > - run_end = ctx->time; > - else > - run_end = event->tstamp_stopped; > - > - event->total_time_enabled = run_end - event->tstamp_enabled; > - > - if (event->state == PERF_EVENT_STATE_INACTIVE) > - run_end = event->tstamp_stopped; > - else > - run_end = perf_event_time(event); > - > - event->total_time_running = run_end - event->tstamp_running; > - > -} > - > -/* > - * Update total_time_enabled and total_time_running for all events in a group. > - */ > -static void update_group_times(struct perf_event *leader) > -{ > - struct perf_event *event; > - > - update_event_times(leader); > - list_for_each_entry(event, &leader->sibling_list, group_entry) > - update_event_times(event); > -} > - > static enum event_type_t get_event_type(struct perf_event *event) > { > struct perf_event_context *ctx = event->ctx; > @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx) > WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT); > event->attach_state |= PERF_ATTACH_CONTEXT; > > + event->tstamp = perf_event_time(event); > + > /* > * If we're a stand alone event or group leader, we go to the context > * list, group events are kept attached to the group so that > @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) > if (event->group_leader == event) > list_del_init(&event->group_entry); > > - update_group_times(event); > - > /* > * If event was in error state, then keep it > * that way, otherwise bogus counts will be > @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) > * of the event > */ > if (event->state > PERF_EVENT_STATE_OFF) > - event->state = PERF_EVENT_STATE_OFF; > + perf_event_set_state(event, PERF_EVENT_STATE_OFF); > > ctx->generation++; > } > @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event, > struct perf_cpu_context *cpuctx, > struct perf_event_context *ctx) > { > - u64 tstamp = perf_event_time(event); > - u64 delta; > + enum perf_event_state state = PERF_EVENT_STATE_INACTIVE; > > WARN_ON_ONCE(event->ctx != ctx); > lockdep_assert_held(&ctx->lock); > > - /* > - * An event which could not be activated because of > - * filter mismatch still needs to have its timings > - * maintained, otherwise bogus information is return > - * via read() for time_enabled, time_running: > - */ > - if (event->state == PERF_EVENT_STATE_INACTIVE && > - !event_filter_match(event)) { > - delta = tstamp - event->tstamp_stopped; > - event->tstamp_running += delta; > - event->tstamp_stopped = tstamp; > - } > - > if (event->state != PERF_EVENT_STATE_ACTIVE) > return; > > perf_pmu_disable(event->pmu); > > - event->tstamp_stopped = tstamp; > event->pmu->del(event, 0); > event->oncpu = -1; > - event->state = PERF_EVENT_STATE_INACTIVE; > + > if (event->pending_disable) { > event->pending_disable = 0; > - event->state = PERF_EVENT_STATE_OFF; > + state = PERF_EVENT_STATE_OFF; > } > + perf_event_set_state(event, state); > > if (!is_software_event(event)) > cpuctx->active_oncpu--; > @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event, > struct perf_event_context *ctx) > { > struct perf_event *event; > - int state = group_event->state; > + > + if (group_event->state != PERF_EVENT_STATE_ACTIVE) > + return; > > perf_pmu_disable(ctx->pmu); > > @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event, > > perf_pmu_enable(ctx->pmu); > > - if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive) > + if (group_event->attr.exclusive) > cpuctx->exclusive = 0; > } > > @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event, > { > unsigned long flags = (unsigned long)info; > > + if (ctx->is_active & EVENT_TIME) { > + update_context_time(ctx); > + update_cgrp_time_from_cpuctx(cpuctx); > + } > + > event_sched_out(event, cpuctx, ctx); > if (flags & DETACH_GROUP) > perf_group_detach(event); > @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event, > if (event->state < PERF_EVENT_STATE_INACTIVE) > return; > > - update_context_time(ctx); > - update_cgrp_time_from_event(event); > - update_group_times(event); > + if (ctx->is_active & EVENT_TIME) { > + update_context_time(ctx); > + update_cgrp_time_from_cpuctx(cpuctx); > + } > + > if (event == event->group_leader) > group_sched_out(event, cpuctx, ctx); > else > event_sched_out(event, cpuctx, ctx); > - event->state = PERF_EVENT_STATE_OFF; > + > + perf_event_set_state(event, PERF_EVENT_STATE_OFF); > } > > /* > @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event) > } > > static void perf_set_shadow_time(struct perf_event *event, > - struct perf_event_context *ctx, > - u64 tstamp) > + struct perf_event_context *ctx) > { > /* > * use the correct time source for the time snapshot > @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event, > * is cleaner and simpler to understand. > */ > if (is_cgroup_event(event)) > - perf_cgroup_set_shadow_time(event, tstamp); > + perf_cgroup_set_shadow_time(event, event->tstamp); > else > - event->shadow_ctx_time = tstamp - ctx->timestamp; > + event->shadow_ctx_time = event->tstamp - ctx->timestamp; > } > > #define MAX_INTERRUPTS (~0ULL) > @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event, > struct perf_cpu_context *cpuctx, > struct perf_event_context *ctx) > { > - u64 tstamp = perf_event_time(event); > int ret = 0; > > lockdep_assert_held(&ctx->lock); > @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event, > * is visible. > */ > smp_wmb(); > - WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE); > + perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE); > > /* > * Unthrottle events, since we scheduled we might have missed several > @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event, > event->hw.interrupts = 0; > } > > - /* > - * The new state must be visible before we turn it on in the hardware: > - */ > - smp_wmb(); > - > perf_pmu_disable(event->pmu); > > - perf_set_shadow_time(event, ctx, tstamp); > + perf_set_shadow_time(event, ctx); > > perf_log_itrace_start(event); > > if (event->pmu->add(event, PERF_EF_START)) { > - event->state = PERF_EVENT_STATE_INACTIVE; > + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); > event->oncpu = -1; > ret = -EAGAIN; > goto out; > } > > - event->tstamp_running += tstamp - event->tstamp_stopped; > - > if (!is_software_event(event)) > cpuctx->active_oncpu++; > if (!ctx->nr_active++) > @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event, > { > struct perf_event *event, *partial_group = NULL; > struct pmu *pmu = ctx->pmu; > - u64 now = ctx->time; > - bool simulate = false; > > if (group_event->state == PERF_EVENT_STATE_OFF) > return 0; > @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event, > /* > * Groups can be scheduled in as one unit only, so undo any > * partial group before returning: > - * The events up to the failed event are scheduled out normally, > - * tstamp_stopped will be updated. > - * > - * The failed events and the remaining siblings need to have > - * their timings updated as if they had gone thru event_sched_in() > - * and event_sched_out(). This is required to get consistent timings > - * across the group. This also takes care of the case where the group > - * could never be scheduled by ensuring tstamp_stopped is set to mark > - * the time the event was actually stopped, such that time delta > - * calculation in update_event_times() is correct. > + * The events up to the failed event are scheduled out normally. > */ > list_for_each_entry(event, &group_event->sibling_list, group_entry) { > if (event == partial_group) > - simulate = true; > + break; > > - if (simulate) { > - event->tstamp_running += now - event->tstamp_stopped; > - event->tstamp_stopped = now; > - } else { > - event_sched_out(event, cpuctx, ctx); > - } > + event_sched_out(event, cpuctx, ctx); > } > event_sched_out(group_event, cpuctx, ctx); > > @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event, > return can_add_hw; > } > > -/* > - * Complement to update_event_times(). This computes the tstamp_* values to > - * continue 'enabled' state from @now, and effectively discards the time > - * between the prior tstamp_stopped and now (as we were in the OFF state, or > - * just switched (context) time base). > - * > - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and > - * cannot have been scheduled in yet. And going into INACTIVE state means > - * '@event->tstamp_stopped = @now'. > - * > - * Thus given the rules of update_event_times(): > - * > - * total_time_enabled = tstamp_stopped - tstamp_enabled > - * total_time_running = tstamp_stopped - tstamp_running > - * > - * We can insert 'tstamp_stopped == now' and reverse them to compute new > - * tstamp_* values. > - */ > -static void __perf_event_enable_time(struct perf_event *event, u64 now) > -{ > - WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE); > - > - event->tstamp_stopped = now; > - event->tstamp_enabled = now - event->total_time_enabled; > - event->tstamp_running = now - event->total_time_running; > -} > - > static void add_event_to_ctx(struct perf_event *event, > struct perf_event_context *ctx) > { > - u64 tstamp = perf_event_time(event); > - > list_add_event(event, ctx); > perf_group_attach(event); > - /* > - * We can be called with event->state == STATE_OFF when we create with > - * .disabled = 1. In that case the IOC_ENABLE will call this function. > - */ > - if (event->state == PERF_EVENT_STATE_INACTIVE) > - __perf_event_enable_time(event, tstamp); > } > > static void ctx_sched_out(struct perf_event_context *ctx, > @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx, > } > > /* > - * Put a event into inactive state and update time fields. > - * Enabling the leader of a group effectively enables all > - * the group members that aren't explicitly disabled, so we > - * have to update their ->tstamp_enabled also. > - * Note: this works for group members as well as group leaders > - * since the non-leader members' sibling_lists will be empty. > - */ > -static void __perf_event_mark_enabled(struct perf_event *event) > -{ > - struct perf_event *sub; > - u64 tstamp = perf_event_time(event); > - > - event->state = PERF_EVENT_STATE_INACTIVE; > - __perf_event_enable_time(event, tstamp); > - list_for_each_entry(sub, &event->sibling_list, group_entry) { > - /* XXX should not be > INACTIVE if event isn't */ > - if (sub->state >= PERF_EVENT_STATE_INACTIVE) > - __perf_event_enable_time(sub, tstamp); > - } > -} > - > -/* > * Cross CPU call to enable a performance event > */ > static void __perf_event_enable(struct perf_event *event, > @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event, > if (ctx->is_active) > ctx_sched_out(ctx, cpuctx, EVENT_TIME); > > - __perf_event_mark_enabled(event); > + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); > > if (!ctx->is_active) > return; > > if (!event_filter_match(event)) { > - if (is_cgroup_event(event)) > - perf_cgroup_defer_enabled(event); > ctx_sched_in(ctx, cpuctx, EVENT_TIME, current); > return; > } > @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event, > * we know the event must be on the current CPU, therefore we > * don't need to use it. > */ > - switch (event->state) { > - case PERF_EVENT_STATE_ACTIVE: > + if (event->state == PERF_EVENT_STATE_ACTIVE) > event->pmu->read(event); > - /* fall-through */ > > - case PERF_EVENT_STATE_INACTIVE: > - update_event_times(event); > - break; > - > - default: > - break; > - } > + perf_event_update_time(event); > > /* > * In order to keep per-task stats reliable we need to flip the event > @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, > if (!event_filter_match(event)) > continue; > > - /* may need to reset tstamp_enabled */ > - if (is_cgroup_event(event)) > - perf_cgroup_mark_enabled(event, ctx); > - > if (group_can_go_on(event, cpuctx, 1)) > group_sched_in(event, cpuctx, ctx); > > @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, > * If this pinned group hasn't been scheduled, > * put it in error state. > */ > - if (event->state == PERF_EVENT_STATE_INACTIVE) { > - update_group_times(event); > - event->state = PERF_EVENT_STATE_ERROR; > - } > + if (event->state == PERF_EVENT_STATE_INACTIVE) > + perf_event_set_state(event, PERF_EVENT_STATE_ERROR); > } > } > > @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx, > if (!event_filter_match(event)) > continue; > > - /* may need to reset tstamp_enabled */ > - if (is_cgroup_event(event)) > - perf_cgroup_mark_enabled(event, ctx); > - > if (group_can_go_on(event, cpuctx, can_add_hw)) { > if (group_sched_in(event, cpuctx, ctx)) > can_add_hw = 0; > @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event, > if (event->state >= PERF_EVENT_STATE_INACTIVE) > return 0; > > - __perf_event_mark_enabled(event); > + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); > > return 1; > } > @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn) > put_ctx(clone_ctx); > } > > -struct perf_read_data { > - struct perf_event *event; > - bool group; > - int ret; > -}; > - > static int __perf_event_read_cpu(struct perf_event *event, int event_cpu) > { > u16 local_pkg, event_pkg; > @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu) > return event_cpu; > } > > -/* > - * Cross CPU call to read the hardware event > - */ > -static void __perf_event_read(void *info) > -{ > - struct perf_read_data *data = info; > - struct perf_event *sub, *event = data->event; > - struct perf_event_context *ctx = event->ctx; > - struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); > - struct pmu *pmu = event->pmu; > - > - /* > - * If this is a task context, we need to check whether it is > - * the current task context of this cpu. If not it has been > - * scheduled out before the smp call arrived. In that case > - * event->count would have been updated to a recent sample > - * when the event was scheduled out. > - */ > - if (ctx->task && cpuctx->task_ctx != ctx) > - return; > - > - raw_spin_lock(&ctx->lock); > - if (ctx->is_active) { > - update_context_time(ctx); > - update_cgrp_time_from_event(event); > - } > - > - update_event_times(event); > - if (event->state != PERF_EVENT_STATE_ACTIVE) > - goto unlock; > - > - if (!data->group) { > - pmu->read(event); > - data->ret = 0; > - goto unlock; > - } > - > - pmu->start_txn(pmu, PERF_PMU_TXN_READ); > - > - pmu->read(event); > - > - list_for_each_entry(sub, &event->sibling_list, group_entry) { > - update_event_times(sub); > - if (sub->state == PERF_EVENT_STATE_ACTIVE) { > - /* > - * Use sibling's PMU rather than @event's since > - * sibling could be on different (eg: software) PMU. > - */ > - sub->pmu->read(sub); > - } > - } > - > - data->ret = pmu->commit_txn(pmu); > - > -unlock: > - raw_spin_unlock(&ctx->lock); > -} > - > static inline u64 perf_event_count(struct perf_event *event) > { > return local64_read(&event->count) + atomic64_read(&event->child_count); > @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value) > return ret; > } > > -static int perf_event_read(struct perf_event *event, bool group) > +struct perf_read_data { > + struct perf_event *event; > + bool group; > + int ret; > +}; > + > +static void __perf_event_read(struct perf_event *event, > + struct perf_cpu_context *cpuctx, > + struct perf_event_context *ctx, > + void *data) > { > - int event_cpu, ret = 0; > + struct perf_read_data *prd = data; > + struct pmu *pmu = event->pmu; > + struct perf_event *sibling; > > - /* > - * If event is enabled and currently active on a CPU, update the > - * value in the event structure: > - */ > - if (event->state == PERF_EVENT_STATE_ACTIVE) { > - struct perf_read_data data = { > - .event = event, > - .group = group, > - .ret = 0, > - }; > + if (ctx->is_active & EVENT_TIME) { > + update_context_time(ctx); > + update_cgrp_time_from_cpuctx(cpuctx); > + } > > - event_cpu = READ_ONCE(event->oncpu); > - if ((unsigned)event_cpu >= nr_cpu_ids) > - return 0; > + perf_event_update_time(event); > + if (prd->group) > + perf_event_update_sibling_time(event); > > - preempt_disable(); > - event_cpu = __perf_event_read_cpu(event, event_cpu); > + if (event->state != PERF_EVENT_STATE_ACTIVE) > + return; > > + if (!prd->group) { > + pmu->read(event); > + prd->ret = 0; > + return; > + } > + > + pmu->start_txn(pmu, PERF_PMU_TXN_READ); > + > + pmu->read(event); > + list_for_each_entry(sibling, &event->sibling_list, group_entry) { > + if (sibling->state == PERF_EVENT_STATE_ACTIVE) { > + /* > + * Use sibling's PMU rather than @event's since > + * sibling could be on different (eg: software) PMU. > + */ > + sibling->pmu->read(sibling); > + } > + } > + > + prd->ret = pmu->commit_txn(pmu); > +} > + > +static int perf_event_read(struct perf_event *event, bool group) > +{ > + struct perf_read_data prd = { > + .event = event, > + .group = group, > + .ret = 0, > + }; > + > + if (event->ctx->task) { > + event_function_call(event, __perf_event_read, &prd); > + } else { > /* > - * Purposely ignore the smp_call_function_single() return > - * value. > - * > - * If event_cpu isn't a valid CPU it means the event got > - * scheduled out and that will have updated the event count. > - * > - * Therefore, either way, we'll have an up-to-date event count > - * after this. > - */ > - (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); > - preempt_enable(); > - ret = data.ret; > - } else if (event->state == PERF_EVENT_STATE_INACTIVE) { > - struct perf_event_context *ctx = event->ctx; > - unsigned long flags; > - > - raw_spin_lock_irqsave(&ctx->lock, flags); > - /* > - * may read while context is not active > - * (e.g., thread is blocked), in that case > - * we cannot update context time > + * For uncore events (which are per definition per-cpu) > + * allow a different read CPU from event->cpu. > */ > - if (ctx->is_active) { > - update_context_time(ctx); > - update_cgrp_time_from_event(event); > - } > - if (group) > - update_group_times(event); > - else > - update_event_times(event); > - raw_spin_unlock_irqrestore(&ctx->lock, flags); > + struct event_function_struct efs = { > + .event = event, > + .func = __perf_event_read, > + .data = &prd, > + }; > + int cpu = __perf_event_read_cpu(event, event->cpu); > + > + cpu_function_call(cpu, event_function, &efs); > } > > - return ret; > + return prd.ret; > } > > /* > @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file) > return 0; > } > > -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) > +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) > { > struct perf_event *child; > u64 total = 0; > @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) > > return total; > } > + > +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) > +{ > + struct perf_event_context *ctx; > + u64 count; > + > + ctx = perf_event_ctx_lock(event); > + count = __perf_event_read_value(event, enabled, running); > + perf_event_ctx_unlock(event, ctx); > + > + return count; > +} > EXPORT_SYMBOL_GPL(perf_event_read_value); > > static int __perf_read_group_add(struct perf_event *leader, > @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader, > if (ret) > return ret; > > + raw_spin_lock_irqsave(&ctx->lock, flags); > + > /* > * Since we co-schedule groups, {enabled,running} times of siblings > * will be identical to those of the leader, so we only publish one > @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader, > if (read_format & PERF_FORMAT_ID) > values[n++] = primary_event_id(leader); > > - raw_spin_lock_irqsave(&ctx->lock, flags); > - > list_for_each_entry(sub, &leader->sibling_list, group_entry) { > values[n++] += perf_event_count(sub); > if (read_format & PERF_FORMAT_ID) > @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event, > u64 values[4]; > int n = 0; > > - values[n++] = perf_event_read_value(event, &enabled, &running); > + values[n++] = __perf_event_read_value(event, &enabled, &running); > if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) > values[n++] = enabled; > if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) > @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event, > > *now = perf_clock(); > ctx_time = event->shadow_ctx_time + *now; > - *enabled = ctx_time - event->tstamp_enabled; > - *running = ctx_time - event->tstamp_running; > + __perf_update_times(event, ctx_time, enabled, running); > } > > static void perf_event_init_userpage(struct perf_event *event) > @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event, > if (parent_event) > perf_group_detach(child_event); > list_del_event(child_event, child_ctx); > - child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */ > + perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */ > raw_spin_unlock_irq(&child_ctx->lock); > > /* > @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event, > struct perf_event *group_leader, > struct perf_event_context *child_ctx) > { > - enum perf_event_active_state parent_state = parent_event->state; > + enum perf_event_state parent_state = parent_event->state; > struct perf_event *child_event; > unsigned long flags; > > @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info) > struct perf_event *event; > > raw_spin_lock(&ctx->lock); > + ctx_sched_out(ctx, cpuctx, EVENT_TIME); > list_for_each_entry(event, &ctx->event_list, event_entry) > __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP); > raw_spin_unlock(&ctx->lock); >