Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751431AbdIEHvj (ORCPT ); Tue, 5 Sep 2017 03:51:39 -0400 Received: from mail-io0-f173.google.com ([209.85.223.173]:36254 "EHLO mail-io0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750731AbdIEHvh (ORCPT ); Tue, 5 Sep 2017 03:51:37 -0400 X-Google-Smtp-Source: ADKCNb6IHHWBEGUKfXBIoTGM5CI00c6OZELG+6dzRPgTZRGnYdTiLlxAuASA7bDxvs4dY6q5HA3PftCt1/aXwY76mNQ= MIME-Version: 1.0 In-Reply-To: References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> <20170803150052.za2vofyqfgarukdr@hirez.programming.kicks-ass.net> <20170822204743.GR32112@worktop.programming.kicks-ass.net> <2a426aa2-42c8-e839-1cec-aa3971651f3e@linux.intel.com> <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> From: Stephane Eranian Date: Tue, 5 Sep 2017 00:51:35 -0700 Message-ID: Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping To: Peter Zijlstra Cc: Alexey Budankov , Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , David Carrillo-Cisneros , linux-kernel , Vince Weaver , Thomas Gleixner Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 39678 Lines: 1034 On Thu, Aug 31, 2017 at 12:51 PM, Stephane Eranian wrote: > Hi, > > On Thu, Aug 31, 2017 at 10:18 AM, Peter Zijlstra wrote: >> On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote: >>> On 22.08.2017 23:47, Peter Zijlstra wrote: >>> > On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote: >>> >> The key thing in the patch is explicit updating of tstamp fields for >>> >> INACTIVE events in update_event_times(). >>> > >>> >> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event) >>> >> event->group_leader->state < PERF_EVENT_STATE_INACTIVE) >>> >> return; >>> >> >>> >> + if (event->state == PERF_EVENT_STATE_INACTIVE) >>> >> + perf_event_tstamp_update(event); >>> >> + >>> >> /* >>> >> * in cgroup mode, time_enabled represents >>> >> * the time the event was enabled AND active >>> > >>> > But why!? I thought the whole point was to not need to do this. >>> >>> update_event_times() is not called from timer interrupt handler >>> thus it is not on the critical path which is optimized in this patch set. >>> >>> But update_event_times() is called in the context of read() syscall so >>> this is the place where we may update event times for INACTIVE events >>> instead of timer interrupt. >>> >>> Also update_event_times() is called on thread context switch out so >>> we get event times also updated when the thread migrates to other CPU. >>> >>> > >>> > The thing I outlined earlier would only need to update timestamps when >>> > events change state and at no other point in time. >>> >>> But we still may request times while event is in INACTIVE state >>> thru read() syscall and event timings need to be up-to-date. >> >> Sure, read() also updates. >> >> So the below completely rewrites timekeeping (and probably breaks >> world) but does away with the need to touch events that don't get >> scheduled. >> >> Esp the cgroup stuff is entirely untested since I simply don't know how >> to operate that. I did run Vince's tests on it, and I think it doesn't >> regress, but I'm near a migraine so I can't really see straight atm. >> >> Vince, Stephane, could you guys have a peek? >> > okay, I will run some tests with cgroups on my systems. > I ran some cgroups tests, including multiplexing and so far it appears to work normally. It is easy to create a cgroup and move a shell into it: $ mount -t cgroup none /sys/fs/cgroups $ cd /sys/fs/cgroups/perf_events $ mkdir memtoy $ cd memtoy $ echo $$ >tasks At this point your shell is part of the cgroup. Then you can use perf to monitor globally or inside the cgroup: $ perf stat -a -e cycles,cycles -G memtoy -I 1000 sleep 1000 That monitors cycles on all CPUs twice, once only when a member of the cgroup memtoy runs, and the other globally. >> (There's a few other bits in, I'll break up into patches and write >> comments and Changelogs later, I think its can be split in some 5 >> patches). >> >> The basic idea is really simple, we have a single timestamp and >> depending on the state we update enabled/running. This obviously only >> requires updates when we change state and when we need up-to-date >> timestamps (read). >> >> No more weird and wonderful mind bending interaction between 3 different >> timestamps with arcane update rules. >> >> --- >> include/linux/perf_event.h | 25 +- >> kernel/events/core.c | 551 ++++++++++++++++----------------------------- >> 2 files changed, 192 insertions(+), 384 deletions(-) >> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> index 8e22f24ded6a..2a6ae48a1a96 100644 >> --- a/include/linux/perf_event.h >> +++ b/include/linux/perf_event.h >> @@ -485,9 +485,9 @@ struct perf_addr_filters_head { >> }; >> >> /** >> - * enum perf_event_active_state - the states of a event >> + * enum perf_event_state - the states of a event >> */ >> -enum perf_event_active_state { >> +enum perf_event_state { >> PERF_EVENT_STATE_DEAD = -4, >> PERF_EVENT_STATE_EXIT = -3, >> PERF_EVENT_STATE_ERROR = -2, >> @@ -578,7 +578,7 @@ struct perf_event { >> struct pmu *pmu; >> void *pmu_private; >> >> - enum perf_event_active_state state; >> + enum perf_event_state state; >> unsigned int attach_state; >> local64_t count; >> atomic64_t child_count; >> @@ -588,26 +588,10 @@ struct perf_event { >> * has been enabled (i.e. eligible to run, and the task has >> * been scheduled in, if this is a per-task event) >> * and running (scheduled onto the CPU), respectively. >> - * >> - * They are computed from tstamp_enabled, tstamp_running and >> - * tstamp_stopped when the event is in INACTIVE or ACTIVE state. >> */ >> u64 total_time_enabled; >> u64 total_time_running; >> - >> - /* >> - * These are timestamps used for computing total_time_enabled >> - * and total_time_running when the event is in INACTIVE or >> - * ACTIVE state, measured in nanoseconds from an arbitrary point >> - * in time. >> - * tstamp_enabled: the notional time when the event was enabled >> - * tstamp_running: the notional time when the event was scheduled on >> - * tstamp_stopped: in INACTIVE state, the notional time when the >> - * event was scheduled off. >> - */ >> - u64 tstamp_enabled; >> - u64 tstamp_running; >> - u64 tstamp_stopped; >> + u64 tstamp; >> >> /* >> * timestamp shadows the actual context timing but it can >> @@ -699,7 +683,6 @@ struct perf_event { >> >> #ifdef CONFIG_CGROUP_PERF >> struct perf_cgroup *cgrp; /* cgroup event is attach to */ >> - int cgrp_defer_enabled; >> #endif >> >> struct list_head sb_list; >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 294f1927f944..e968b3eab9c7 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -582,6 +582,70 @@ static inline u64 perf_event_clock(struct perf_event *event) >> return event->clock(); >> } >> >> +/* >> + * XXX comment about timekeeping goes here >> + */ >> + >> +static __always_inline enum perf_event_state >> +__perf_effective_state(struct perf_event *event) >> +{ >> + struct perf_event *leader = event->group_leader; >> + >> + if (leader->state <= PERF_EVENT_STATE_OFF) >> + return leader->state; >> + >> + return event->state; >> +} >> + >> +static __always_inline void >> +__perf_update_times(struct perf_event *event, u64 now, u64 *enabled, u64 *running) >> +{ >> + enum perf_event_state state = __perf_effective_state(event); >> + u64 delta = now - event->tstamp; >> + >> + *enabled = event->total_time_enabled; >> + if (state >= PERF_EVENT_STATE_INACTIVE) >> + *enabled += delta; >> + >> + *running = event->total_time_running; >> + if (state >= PERF_EVENT_STATE_ACTIVE) >> + *running += delta; >> +} >> + >> +static void perf_event_update_time(struct perf_event *event) >> +{ >> + u64 now = perf_event_time(event); >> + >> + __perf_update_times(event, now, &event->total_time_enabled, >> + &event->total_time_running); >> + event->tstamp = now; >> +} >> + >> +static void perf_event_update_sibling_time(struct perf_event *leader) >> +{ >> + struct perf_event *sibling; >> + >> + list_for_each_entry(sibling, &leader->sibling_list, group_entry) >> + perf_event_update_time(sibling); >> +} >> + >> +static void >> +perf_event_set_state(struct perf_event *event, enum perf_event_state state) >> +{ >> + if (event->state == state) >> + return; >> + >> + perf_event_update_time(event); >> + /* >> + * If a group leader gets enabled/disabled all its siblings >> + * are affected too. >> + */ >> + if ((event->state < 0) ^ (state < 0)) >> + perf_event_update_sibling_time(event); >> + >> + WRITE_ONCE(event->state, state); >> +} >> + >> #ifdef CONFIG_CGROUP_PERF >> >> static inline bool >> @@ -841,40 +905,6 @@ perf_cgroup_set_shadow_time(struct perf_event *event, u64 now) >> event->shadow_ctx_time = now - t->timestamp; >> } >> >> -static inline void >> -perf_cgroup_defer_enabled(struct perf_event *event) >> -{ >> - /* >> - * when the current task's perf cgroup does not match >> - * the event's, we need to remember to call the >> - * perf_mark_enable() function the first time a task with >> - * a matching perf cgroup is scheduled in. >> - */ >> - if (is_cgroup_event(event) && !perf_cgroup_match(event)) >> - event->cgrp_defer_enabled = 1; >> -} >> - >> -static inline void >> -perf_cgroup_mark_enabled(struct perf_event *event, >> - struct perf_event_context *ctx) >> -{ >> - struct perf_event *sub; >> - u64 tstamp = perf_event_time(event); >> - >> - if (!event->cgrp_defer_enabled) >> - return; >> - >> - event->cgrp_defer_enabled = 0; >> - >> - event->tstamp_enabled = tstamp - event->total_time_enabled; >> - list_for_each_entry(sub, &event->sibling_list, group_entry) { >> - if (sub->state >= PERF_EVENT_STATE_INACTIVE) { >> - sub->tstamp_enabled = tstamp - sub->total_time_enabled; >> - sub->cgrp_defer_enabled = 0; >> - } >> - } >> -} >> - >> /* >> * Update cpuctx->cgrp so that it is set when first cgroup event is added and >> * cleared when last cgroup event is removed. >> @@ -973,17 +1003,6 @@ static inline u64 perf_cgroup_event_time(struct perf_event *event) >> } >> >> static inline void >> -perf_cgroup_defer_enabled(struct perf_event *event) >> -{ >> -} >> - >> -static inline void >> -perf_cgroup_mark_enabled(struct perf_event *event, >> - struct perf_event_context *ctx) >> -{ >> -} >> - >> -static inline void >> list_update_cgroup_event(struct perf_event *event, >> struct perf_event_context *ctx, bool add) >> { >> @@ -1396,60 +1415,6 @@ static u64 perf_event_time(struct perf_event *event) >> return ctx ? ctx->time : 0; >> } >> >> -/* >> - * Update the total_time_enabled and total_time_running fields for a event. >> - */ >> -static void update_event_times(struct perf_event *event) >> -{ >> - struct perf_event_context *ctx = event->ctx; >> - u64 run_end; >> - >> - lockdep_assert_held(&ctx->lock); >> - >> - if (event->state < PERF_EVENT_STATE_INACTIVE || >> - event->group_leader->state < PERF_EVENT_STATE_INACTIVE) >> - return; >> - >> - /* >> - * in cgroup mode, time_enabled represents >> - * the time the event was enabled AND active >> - * tasks were in the monitored cgroup. This is >> - * independent of the activity of the context as >> - * there may be a mix of cgroup and non-cgroup events. >> - * >> - * That is why we treat cgroup events differently >> - * here. >> - */ >> - if (is_cgroup_event(event)) >> - run_end = perf_cgroup_event_time(event); >> - else if (ctx->is_active) >> - run_end = ctx->time; >> - else >> - run_end = event->tstamp_stopped; >> - >> - event->total_time_enabled = run_end - event->tstamp_enabled; >> - >> - if (event->state == PERF_EVENT_STATE_INACTIVE) >> - run_end = event->tstamp_stopped; >> - else >> - run_end = perf_event_time(event); >> - >> - event->total_time_running = run_end - event->tstamp_running; >> - >> -} >> - >> -/* >> - * Update total_time_enabled and total_time_running for all events in a group. >> - */ >> -static void update_group_times(struct perf_event *leader) >> -{ >> - struct perf_event *event; >> - >> - update_event_times(leader); >> - list_for_each_entry(event, &leader->sibling_list, group_entry) >> - update_event_times(event); >> -} >> - >> static enum event_type_t get_event_type(struct perf_event *event) >> { >> struct perf_event_context *ctx = event->ctx; >> @@ -1492,6 +1457,8 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx) >> WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT); >> event->attach_state |= PERF_ATTACH_CONTEXT; >> >> + event->tstamp = perf_event_time(event); >> + >> /* >> * If we're a stand alone event or group leader, we go to the context >> * list, group events are kept attached to the group so that >> @@ -1699,8 +1666,6 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) >> if (event->group_leader == event) >> list_del_init(&event->group_entry); >> >> - update_group_times(event); >> - >> /* >> * If event was in error state, then keep it >> * that way, otherwise bogus counts will be >> @@ -1709,7 +1674,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx) >> * of the event >> */ >> if (event->state > PERF_EVENT_STATE_OFF) >> - event->state = PERF_EVENT_STATE_OFF; >> + perf_event_set_state(event, PERF_EVENT_STATE_OFF); >> >> ctx->generation++; >> } >> @@ -1808,38 +1773,24 @@ event_sched_out(struct perf_event *event, >> struct perf_cpu_context *cpuctx, >> struct perf_event_context *ctx) >> { >> - u64 tstamp = perf_event_time(event); >> - u64 delta; >> + enum perf_event_state state = PERF_EVENT_STATE_INACTIVE; >> >> WARN_ON_ONCE(event->ctx != ctx); >> lockdep_assert_held(&ctx->lock); >> >> - /* >> - * An event which could not be activated because of >> - * filter mismatch still needs to have its timings >> - * maintained, otherwise bogus information is return >> - * via read() for time_enabled, time_running: >> - */ >> - if (event->state == PERF_EVENT_STATE_INACTIVE && >> - !event_filter_match(event)) { >> - delta = tstamp - event->tstamp_stopped; >> - event->tstamp_running += delta; >> - event->tstamp_stopped = tstamp; >> - } >> - >> if (event->state != PERF_EVENT_STATE_ACTIVE) >> return; >> >> perf_pmu_disable(event->pmu); >> >> - event->tstamp_stopped = tstamp; >> event->pmu->del(event, 0); >> event->oncpu = -1; >> - event->state = PERF_EVENT_STATE_INACTIVE; >> + >> if (event->pending_disable) { >> event->pending_disable = 0; >> - event->state = PERF_EVENT_STATE_OFF; >> + state = PERF_EVENT_STATE_OFF; >> } >> + perf_event_set_state(event, state); >> >> if (!is_software_event(event)) >> cpuctx->active_oncpu--; >> @@ -1859,7 +1810,9 @@ group_sched_out(struct perf_event *group_event, >> struct perf_event_context *ctx) >> { >> struct perf_event *event; >> - int state = group_event->state; >> + >> + if (group_event->state != PERF_EVENT_STATE_ACTIVE) >> + return; >> >> perf_pmu_disable(ctx->pmu); >> >> @@ -1873,7 +1826,7 @@ group_sched_out(struct perf_event *group_event, >> >> perf_pmu_enable(ctx->pmu); >> >> - if (state == PERF_EVENT_STATE_ACTIVE && group_event->attr.exclusive) >> + if (group_event->attr.exclusive) >> cpuctx->exclusive = 0; >> } >> >> @@ -1893,6 +1846,11 @@ __perf_remove_from_context(struct perf_event *event, >> { >> unsigned long flags = (unsigned long)info; >> >> + if (ctx->is_active & EVENT_TIME) { >> + update_context_time(ctx); >> + update_cgrp_time_from_cpuctx(cpuctx); >> + } >> + >> event_sched_out(event, cpuctx, ctx); >> if (flags & DETACH_GROUP) >> perf_group_detach(event); >> @@ -1955,14 +1913,17 @@ static void __perf_event_disable(struct perf_event *event, >> if (event->state < PERF_EVENT_STATE_INACTIVE) >> return; >> >> - update_context_time(ctx); >> - update_cgrp_time_from_event(event); >> - update_group_times(event); >> + if (ctx->is_active & EVENT_TIME) { >> + update_context_time(ctx); >> + update_cgrp_time_from_cpuctx(cpuctx); >> + } >> + >> if (event == event->group_leader) >> group_sched_out(event, cpuctx, ctx); >> else >> event_sched_out(event, cpuctx, ctx); >> - event->state = PERF_EVENT_STATE_OFF; >> + >> + perf_event_set_state(event, PERF_EVENT_STATE_OFF); >> } >> >> /* >> @@ -2019,8 +1980,7 @@ void perf_event_disable_inatomic(struct perf_event *event) >> } >> >> static void perf_set_shadow_time(struct perf_event *event, >> - struct perf_event_context *ctx, >> - u64 tstamp) >> + struct perf_event_context *ctx) >> { >> /* >> * use the correct time source for the time snapshot >> @@ -2048,9 +2008,9 @@ static void perf_set_shadow_time(struct perf_event *event, >> * is cleaner and simpler to understand. >> */ >> if (is_cgroup_event(event)) >> - perf_cgroup_set_shadow_time(event, tstamp); >> + perf_cgroup_set_shadow_time(event, event->tstamp); >> else >> - event->shadow_ctx_time = tstamp - ctx->timestamp; >> + event->shadow_ctx_time = event->tstamp - ctx->timestamp; >> } >> >> #define MAX_INTERRUPTS (~0ULL) >> @@ -2063,7 +2023,6 @@ event_sched_in(struct perf_event *event, >> struct perf_cpu_context *cpuctx, >> struct perf_event_context *ctx) >> { >> - u64 tstamp = perf_event_time(event); >> int ret = 0; >> >> lockdep_assert_held(&ctx->lock); >> @@ -2077,7 +2036,7 @@ event_sched_in(struct perf_event *event, >> * is visible. >> */ >> smp_wmb(); >> - WRITE_ONCE(event->state, PERF_EVENT_STATE_ACTIVE); >> + perf_event_set_state(event, PERF_EVENT_STATE_ACTIVE); >> >> /* >> * Unthrottle events, since we scheduled we might have missed several >> @@ -2089,26 +2048,19 @@ event_sched_in(struct perf_event *event, >> event->hw.interrupts = 0; >> } >> >> - /* >> - * The new state must be visible before we turn it on in the hardware: >> - */ >> - smp_wmb(); >> - >> perf_pmu_disable(event->pmu); >> >> - perf_set_shadow_time(event, ctx, tstamp); >> + perf_set_shadow_time(event, ctx); >> >> perf_log_itrace_start(event); >> >> if (event->pmu->add(event, PERF_EF_START)) { >> - event->state = PERF_EVENT_STATE_INACTIVE; >> + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); >> event->oncpu = -1; >> ret = -EAGAIN; >> goto out; >> } >> >> - event->tstamp_running += tstamp - event->tstamp_stopped; >> - >> if (!is_software_event(event)) >> cpuctx->active_oncpu++; >> if (!ctx->nr_active++) >> @@ -2132,8 +2084,6 @@ group_sched_in(struct perf_event *group_event, >> { >> struct perf_event *event, *partial_group = NULL; >> struct pmu *pmu = ctx->pmu; >> - u64 now = ctx->time; >> - bool simulate = false; >> >> if (group_event->state == PERF_EVENT_STATE_OFF) >> return 0; >> @@ -2163,27 +2113,13 @@ group_sched_in(struct perf_event *group_event, >> /* >> * Groups can be scheduled in as one unit only, so undo any >> * partial group before returning: >> - * The events up to the failed event are scheduled out normally, >> - * tstamp_stopped will be updated. >> - * >> - * The failed events and the remaining siblings need to have >> - * their timings updated as if they had gone thru event_sched_in() >> - * and event_sched_out(). This is required to get consistent timings >> - * across the group. This also takes care of the case where the group >> - * could never be scheduled by ensuring tstamp_stopped is set to mark >> - * the time the event was actually stopped, such that time delta >> - * calculation in update_event_times() is correct. >> + * The events up to the failed event are scheduled out normally. >> */ >> list_for_each_entry(event, &group_event->sibling_list, group_entry) { >> if (event == partial_group) >> - simulate = true; >> + break; >> >> - if (simulate) { >> - event->tstamp_running += now - event->tstamp_stopped; >> - event->tstamp_stopped = now; >> - } else { >> - event_sched_out(event, cpuctx, ctx); >> - } >> + event_sched_out(event, cpuctx, ctx); >> } >> event_sched_out(group_event, cpuctx, ctx); >> >> @@ -2225,46 +2161,11 @@ static int group_can_go_on(struct perf_event *event, >> return can_add_hw; >> } >> >> -/* >> - * Complement to update_event_times(). This computes the tstamp_* values to >> - * continue 'enabled' state from @now, and effectively discards the time >> - * between the prior tstamp_stopped and now (as we were in the OFF state, or >> - * just switched (context) time base). >> - * >> - * This further assumes '@event->state == INACTIVE' (we just came from OFF) and >> - * cannot have been scheduled in yet. And going into INACTIVE state means >> - * '@event->tstamp_stopped = @now'. >> - * >> - * Thus given the rules of update_event_times(): >> - * >> - * total_time_enabled = tstamp_stopped - tstamp_enabled >> - * total_time_running = tstamp_stopped - tstamp_running >> - * >> - * We can insert 'tstamp_stopped == now' and reverse them to compute new >> - * tstamp_* values. >> - */ >> -static void __perf_event_enable_time(struct perf_event *event, u64 now) >> -{ >> - WARN_ON_ONCE(event->state != PERF_EVENT_STATE_INACTIVE); >> - >> - event->tstamp_stopped = now; >> - event->tstamp_enabled = now - event->total_time_enabled; >> - event->tstamp_running = now - event->total_time_running; >> -} >> - >> static void add_event_to_ctx(struct perf_event *event, >> struct perf_event_context *ctx) >> { >> - u64 tstamp = perf_event_time(event); >> - >> list_add_event(event, ctx); >> perf_group_attach(event); >> - /* >> - * We can be called with event->state == STATE_OFF when we create with >> - * .disabled = 1. In that case the IOC_ENABLE will call this function. >> - */ >> - if (event->state == PERF_EVENT_STATE_INACTIVE) >> - __perf_event_enable_time(event, tstamp); >> } >> >> static void ctx_sched_out(struct perf_event_context *ctx, >> @@ -2496,28 +2397,6 @@ perf_install_in_context(struct perf_event_context *ctx, >> } >> >> /* >> - * Put a event into inactive state and update time fields. >> - * Enabling the leader of a group effectively enables all >> - * the group members that aren't explicitly disabled, so we >> - * have to update their ->tstamp_enabled also. >> - * Note: this works for group members as well as group leaders >> - * since the non-leader members' sibling_lists will be empty. >> - */ >> -static void __perf_event_mark_enabled(struct perf_event *event) >> -{ >> - struct perf_event *sub; >> - u64 tstamp = perf_event_time(event); >> - >> - event->state = PERF_EVENT_STATE_INACTIVE; >> - __perf_event_enable_time(event, tstamp); >> - list_for_each_entry(sub, &event->sibling_list, group_entry) { >> - /* XXX should not be > INACTIVE if event isn't */ >> - if (sub->state >= PERF_EVENT_STATE_INACTIVE) >> - __perf_event_enable_time(sub, tstamp); >> - } >> -} >> - >> -/* >> * Cross CPU call to enable a performance event >> */ >> static void __perf_event_enable(struct perf_event *event, >> @@ -2535,14 +2414,12 @@ static void __perf_event_enable(struct perf_event *event, >> if (ctx->is_active) >> ctx_sched_out(ctx, cpuctx, EVENT_TIME); >> >> - __perf_event_mark_enabled(event); >> + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); >> >> if (!ctx->is_active) >> return; >> >> if (!event_filter_match(event)) { >> - if (is_cgroup_event(event)) >> - perf_cgroup_defer_enabled(event); >> ctx_sched_in(ctx, cpuctx, EVENT_TIME, current); >> return; >> } >> @@ -2862,18 +2739,10 @@ static void __perf_event_sync_stat(struct perf_event *event, >> * we know the event must be on the current CPU, therefore we >> * don't need to use it. >> */ >> - switch (event->state) { >> - case PERF_EVENT_STATE_ACTIVE: >> + if (event->state == PERF_EVENT_STATE_ACTIVE) >> event->pmu->read(event); >> - /* fall-through */ >> >> - case PERF_EVENT_STATE_INACTIVE: >> - update_event_times(event); >> - break; >> - >> - default: >> - break; >> - } >> + perf_event_update_time(event); >> >> /* >> * In order to keep per-task stats reliable we need to flip the event >> @@ -3110,10 +2979,6 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, >> if (!event_filter_match(event)) >> continue; >> >> - /* may need to reset tstamp_enabled */ >> - if (is_cgroup_event(event)) >> - perf_cgroup_mark_enabled(event, ctx); >> - >> if (group_can_go_on(event, cpuctx, 1)) >> group_sched_in(event, cpuctx, ctx); >> >> @@ -3121,10 +2986,8 @@ ctx_pinned_sched_in(struct perf_event_context *ctx, >> * If this pinned group hasn't been scheduled, >> * put it in error state. >> */ >> - if (event->state == PERF_EVENT_STATE_INACTIVE) { >> - update_group_times(event); >> - event->state = PERF_EVENT_STATE_ERROR; >> - } >> + if (event->state == PERF_EVENT_STATE_INACTIVE) >> + perf_event_set_state(event, PERF_EVENT_STATE_ERROR); >> } >> } >> >> @@ -3146,10 +3009,6 @@ ctx_flexible_sched_in(struct perf_event_context *ctx, >> if (!event_filter_match(event)) >> continue; >> >> - /* may need to reset tstamp_enabled */ >> - if (is_cgroup_event(event)) >> - perf_cgroup_mark_enabled(event, ctx); >> - >> if (group_can_go_on(event, cpuctx, can_add_hw)) { >> if (group_sched_in(event, cpuctx, ctx)) >> can_add_hw = 0; >> @@ -3541,7 +3400,7 @@ static int event_enable_on_exec(struct perf_event *event, >> if (event->state >= PERF_EVENT_STATE_INACTIVE) >> return 0; >> >> - __perf_event_mark_enabled(event); >> + perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE); >> >> return 1; >> } >> @@ -3590,12 +3449,6 @@ static void perf_event_enable_on_exec(int ctxn) >> put_ctx(clone_ctx); >> } >> >> -struct perf_read_data { >> - struct perf_event *event; >> - bool group; >> - int ret; >> -}; >> - >> static int __perf_event_read_cpu(struct perf_event *event, int event_cpu) >> { >> u16 local_pkg, event_pkg; >> @@ -3613,64 +3466,6 @@ static int __perf_event_read_cpu(struct perf_event *event, int event_cpu) >> return event_cpu; >> } >> >> -/* >> - * Cross CPU call to read the hardware event >> - */ >> -static void __perf_event_read(void *info) >> -{ >> - struct perf_read_data *data = info; >> - struct perf_event *sub, *event = data->event; >> - struct perf_event_context *ctx = event->ctx; >> - struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); >> - struct pmu *pmu = event->pmu; >> - >> - /* >> - * If this is a task context, we need to check whether it is >> - * the current task context of this cpu. If not it has been >> - * scheduled out before the smp call arrived. In that case >> - * event->count would have been updated to a recent sample >> - * when the event was scheduled out. >> - */ >> - if (ctx->task && cpuctx->task_ctx != ctx) >> - return; >> - >> - raw_spin_lock(&ctx->lock); >> - if (ctx->is_active) { >> - update_context_time(ctx); >> - update_cgrp_time_from_event(event); >> - } >> - >> - update_event_times(event); >> - if (event->state != PERF_EVENT_STATE_ACTIVE) >> - goto unlock; >> - >> - if (!data->group) { >> - pmu->read(event); >> - data->ret = 0; >> - goto unlock; >> - } >> - >> - pmu->start_txn(pmu, PERF_PMU_TXN_READ); >> - >> - pmu->read(event); >> - >> - list_for_each_entry(sub, &event->sibling_list, group_entry) { >> - update_event_times(sub); >> - if (sub->state == PERF_EVENT_STATE_ACTIVE) { >> - /* >> - * Use sibling's PMU rather than @event's since >> - * sibling could be on different (eg: software) PMU. >> - */ >> - sub->pmu->read(sub); >> - } >> - } >> - >> - data->ret = pmu->commit_txn(pmu); >> - >> -unlock: >> - raw_spin_unlock(&ctx->lock); >> -} >> - >> static inline u64 perf_event_count(struct perf_event *event) >> { >> return local64_read(&event->count) + atomic64_read(&event->child_count); >> @@ -3733,63 +3528,81 @@ int perf_event_read_local(struct perf_event *event, u64 *value) >> return ret; >> } >> >> -static int perf_event_read(struct perf_event *event, bool group) >> +struct perf_read_data { >> + struct perf_event *event; >> + bool group; >> + int ret; >> +}; >> + >> +static void __perf_event_read(struct perf_event *event, >> + struct perf_cpu_context *cpuctx, >> + struct perf_event_context *ctx, >> + void *data) >> { >> - int event_cpu, ret = 0; >> + struct perf_read_data *prd = data; >> + struct pmu *pmu = event->pmu; >> + struct perf_event *sibling; >> >> - /* >> - * If event is enabled and currently active on a CPU, update the >> - * value in the event structure: >> - */ >> - if (event->state == PERF_EVENT_STATE_ACTIVE) { >> - struct perf_read_data data = { >> - .event = event, >> - .group = group, >> - .ret = 0, >> - }; >> + if (ctx->is_active & EVENT_TIME) { >> + update_context_time(ctx); >> + update_cgrp_time_from_cpuctx(cpuctx); >> + } >> >> - event_cpu = READ_ONCE(event->oncpu); >> - if ((unsigned)event_cpu >= nr_cpu_ids) >> - return 0; >> + perf_event_update_time(event); >> + if (prd->group) >> + perf_event_update_sibling_time(event); >> >> - preempt_disable(); >> - event_cpu = __perf_event_read_cpu(event, event_cpu); >> + if (event->state != PERF_EVENT_STATE_ACTIVE) >> + return; >> >> + if (!prd->group) { >> + pmu->read(event); >> + prd->ret = 0; >> + return; >> + } >> + >> + pmu->start_txn(pmu, PERF_PMU_TXN_READ); >> + >> + pmu->read(event); >> + list_for_each_entry(sibling, &event->sibling_list, group_entry) { >> + if (sibling->state == PERF_EVENT_STATE_ACTIVE) { >> + /* >> + * Use sibling's PMU rather than @event's since >> + * sibling could be on different (eg: software) PMU. >> + */ >> + sibling->pmu->read(sibling); >> + } >> + } >> + >> + prd->ret = pmu->commit_txn(pmu); >> +} >> + >> +static int perf_event_read(struct perf_event *event, bool group) >> +{ >> + struct perf_read_data prd = { >> + .event = event, >> + .group = group, >> + .ret = 0, >> + }; >> + >> + if (event->ctx->task) { >> + event_function_call(event, __perf_event_read, &prd); >> + } else { >> /* >> - * Purposely ignore the smp_call_function_single() return >> - * value. >> - * >> - * If event_cpu isn't a valid CPU it means the event got >> - * scheduled out and that will have updated the event count. >> - * >> - * Therefore, either way, we'll have an up-to-date event count >> - * after this. >> - */ >> - (void)smp_call_function_single(event_cpu, __perf_event_read, &data, 1); >> - preempt_enable(); >> - ret = data.ret; >> - } else if (event->state == PERF_EVENT_STATE_INACTIVE) { >> - struct perf_event_context *ctx = event->ctx; >> - unsigned long flags; >> - >> - raw_spin_lock_irqsave(&ctx->lock, flags); >> - /* >> - * may read while context is not active >> - * (e.g., thread is blocked), in that case >> - * we cannot update context time >> + * For uncore events (which are per definition per-cpu) >> + * allow a different read CPU from event->cpu. >> */ >> - if (ctx->is_active) { >> - update_context_time(ctx); >> - update_cgrp_time_from_event(event); >> - } >> - if (group) >> - update_group_times(event); >> - else >> - update_event_times(event); >> - raw_spin_unlock_irqrestore(&ctx->lock, flags); >> + struct event_function_struct efs = { >> + .event = event, >> + .func = __perf_event_read, >> + .data = &prd, >> + }; >> + int cpu = __perf_event_read_cpu(event, event->cpu); >> + >> + cpu_function_call(cpu, event_function, &efs); >> } >> >> - return ret; >> + return prd.ret; >> } >> >> /* >> @@ -4388,7 +4201,7 @@ static int perf_release(struct inode *inode, struct file *file) >> return 0; >> } >> >> -u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) >> +static u64 __perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) >> { >> struct perf_event *child; >> u64 total = 0; >> @@ -4416,6 +4229,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) >> >> return total; >> } >> + >> +u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running) >> +{ >> + struct perf_event_context *ctx; >> + u64 count; >> + >> + ctx = perf_event_ctx_lock(event); >> + count = __perf_event_read_value(event, enabled, running); >> + perf_event_ctx_unlock(event, ctx); >> + >> + return count; >> +} >> EXPORT_SYMBOL_GPL(perf_event_read_value); >> >> static int __perf_read_group_add(struct perf_event *leader, >> @@ -4431,6 +4256,8 @@ static int __perf_read_group_add(struct perf_event *leader, >> if (ret) >> return ret; >> >> + raw_spin_lock_irqsave(&ctx->lock, flags); >> + >> /* >> * Since we co-schedule groups, {enabled,running} times of siblings >> * will be identical to those of the leader, so we only publish one >> @@ -4453,8 +4280,6 @@ static int __perf_read_group_add(struct perf_event *leader, >> if (read_format & PERF_FORMAT_ID) >> values[n++] = primary_event_id(leader); >> >> - raw_spin_lock_irqsave(&ctx->lock, flags); >> - >> list_for_each_entry(sub, &leader->sibling_list, group_entry) { >> values[n++] += perf_event_count(sub); >> if (read_format & PERF_FORMAT_ID) >> @@ -4518,7 +4343,7 @@ static int perf_read_one(struct perf_event *event, >> u64 values[4]; >> int n = 0; >> >> - values[n++] = perf_event_read_value(event, &enabled, &running); >> + values[n++] = __perf_event_read_value(event, &enabled, &running); >> if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) >> values[n++] = enabled; >> if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) >> @@ -4897,8 +4722,7 @@ static void calc_timer_values(struct perf_event *event, >> >> *now = perf_clock(); >> ctx_time = event->shadow_ctx_time + *now; >> - *enabled = ctx_time - event->tstamp_enabled; >> - *running = ctx_time - event->tstamp_running; >> + __perf_update_times(event, ctx_time, enabled, running); >> } >> >> static void perf_event_init_userpage(struct perf_event *event) >> @@ -10516,7 +10340,7 @@ perf_event_exit_event(struct perf_event *child_event, >> if (parent_event) >> perf_group_detach(child_event); >> list_del_event(child_event, child_ctx); >> - child_event->state = PERF_EVENT_STATE_EXIT; /* is_event_hup() */ >> + perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */ >> raw_spin_unlock_irq(&child_ctx->lock); >> >> /* >> @@ -10754,7 +10578,7 @@ inherit_event(struct perf_event *parent_event, >> struct perf_event *group_leader, >> struct perf_event_context *child_ctx) >> { >> - enum perf_event_active_state parent_state = parent_event->state; >> + enum perf_event_state parent_state = parent_event->state; >> struct perf_event *child_event; >> unsigned long flags; >> >> @@ -11090,6 +10914,7 @@ static void __perf_event_exit_context(void *__info) >> struct perf_event *event; >> >> raw_spin_lock(&ctx->lock); >> + ctx_sched_out(ctx, cpuctx, EVENT_TIME); >> list_for_each_entry(event, &ctx->event_list, event_entry) >> __perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP); >> raw_spin_unlock(&ctx->lock);