Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753520AbdIDKqy (ORCPT ); Mon, 4 Sep 2017 06:46:54 -0400 Received: from mga05.intel.com ([192.55.52.43]:19686 "EHLO mga05.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753411AbdIDKqx (ORCPT ); Mon, 4 Sep 2017 06:46:53 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,474,1498546800"; d="scan'208";a="131487793" Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping To: Peter Zijlstra Cc: Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel , Vince Weaver , Thomas Gleixner References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> <20170803150052.za2vofyqfgarukdr@hirez.programming.kicks-ass.net> <20170822204743.GR32112@worktop.programming.kicks-ass.net> <2a426aa2-42c8-e839-1cec-aa3971651f3e@linux.intel.com> <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> From: Alexey Budankov Organization: Intel Corp. Message-ID: <46f72a3f-f18b-0227-3d78-fb23c8a6e18e@linux.intel.com> Date: Mon, 4 Sep 2017 13:46:45 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4651 Lines: 144 Hi, On 31.08.2017 20:18, Peter Zijlstra wrote: > On Wed, Aug 23, 2017 at 11:54:15AM +0300, Alexey Budankov wrote: >> On 22.08.2017 23:47, Peter Zijlstra wrote: >>> On Thu, Aug 10, 2017 at 06:57:43PM +0300, Alexey Budankov wrote: >>>> The key thing in the patch is explicit updating of tstamp fields for >>>> INACTIVE events in update_event_times(). >>> >>>> @@ -1405,6 +1426,9 @@ static void update_event_times(struct perf_event *event) >>>> event->group_leader->state < PERF_EVENT_STATE_INACTIVE) >>>> return; >>>> >>>> + if (event->state == PERF_EVENT_STATE_INACTIVE) >>>> + perf_event_tstamp_update(event); >>>> + >>>> /* >>>> * in cgroup mode, time_enabled represents >>>> * the time the event was enabled AND active >>> >>> But why!? I thought the whole point was to not need to do this. >> >> update_event_times() is not called from timer interrupt handler >> thus it is not on the critical path which is optimized in this patch set. >> >> But update_event_times() is called in the context of read() syscall so >> this is the place where we may update event times for INACTIVE events >> instead of timer interrupt. >> >> Also update_event_times() is called on thread context switch out so >> we get event times also updated when the thread migrates to other CPU. >> >>> >>> The thing I outlined earlier would only need to update timestamps when >>> events change state and at no other point in time. >> >> But we still may request times while event is in INACTIVE state >> thru read() syscall and event timings need to be up-to-date. > > Sure, read() also updates. > > So the below completely rewrites timekeeping (and probably breaks > world) but does away with the need to touch events that don't get > scheduled. We still need and do iterate thru all events at some points e.g. on context switches. > > Esp the cgroup stuff is entirely untested since I simply don't know how > to operate that. I did run Vince's tests on it, and I think it doesn't > regress, but I'm near a migraine so I can't really see straight atm. > > Vince, Stephane, could you guys have a peek? > > (There's a few other bits in, I'll break up into patches and write > comments and Changelogs later, I think its can be split in some 5 > patches). > > The basic idea is really simple, we have a single timestamp and > depending on the state we update enabled/running. This obviously only > requires updates when we change state and when we need up-to-date > timestamps (read). I would prefer to have this rework in a FSM similar to that below, so state transition and the corresponding tstamp, total_time_enabled and total_time_running manipulation logic would be consolidated in one place and adjacent lines of code. >From the table below event->state FSM is not as simple as it may seem on the first sight so in order to avoid regressions after rework we better keep that in mind and explicitly implement allowed and disallowed state transitions. A I O E X D U A Te+,Tr+ Te+,Tr+ Te+,Tr+ Te+,Tr+ Te+,Tr+ Te+,Tr+ --- ts ts ts ts ts ts I Te+,ts Te+,ts Te+,ts Te+,ts Te+,ts Te+,ts --- O Te=0,Tr=0, Te=0,Tr=0, Te=0,Tr=0 Te=0,Tr=0 Te=0,Tr=0 Te=0,Tr=0 --- ts ts ts ts ts ts E Te=0,Tr=0, Te=0,Tr=0, Te=0,Tr=0 Te=0,Tr=0 Te=0,Tr=0 Te=0,Tr=0 --- ts ts ts ts ts ts X --- --- --- --- --- --- --- D --- --- --- --- --- --- --- U --- Te=0,Tr=0 Te=0,Tr=0 --- --- --- --- ts ts LEGEND: U - allocation, A - ACTIVE, I - INACTIVE, O - OFF, E - ERROR, X - EXIT, D - DEAD, Te=0 - event->total_time_enabled = 0 Te+ - event->total_time_enabled += delta Tr=0 - event->total_time_running = 0 Tr+ - event->total_time_running += delta ts - event->tstamp = perf_event_time(event) static void perf_event_change_state(struct perf_event *event, enum perf_event_state state) { u64 delta = 0; u64 now = perf_event_time(event); delta = now - event->tstamp; event->tstamp = now; switch(event->state) { case A: switch(state) { case A: ... break; case I: event->total_time_enabled += delta; event->total_time_running += delta; event->state = state; break; case O: ... break; case E: ... ... case I: ... break; ... } } --- Regards, Alexey