Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757321AbZLUVTi (ORCPT ); Mon, 21 Dec 2009 16:19:38 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757059AbZLUVTh (ORCPT ); Mon, 21 Dec 2009 16:19:37 -0500 Received: from casper.infradead.org ([85.118.1.10]:45615 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756109AbZLUVTh (ORCPT ); Mon, 21 Dec 2009 16:19:37 -0500 Subject: Re: [PATCH] perf_events: improve Intel event scheduling From: Peter Zijlstra To: Stephane Eranian Cc: eranian@gmail.com, linux-kernel@vger.kernel.org, mingo@elte.hu, paulus@samba.org, perfmon2-devel@lists.sf.net, "David S. Miller" In-Reply-To: References: <1255964630-5878-1-git-send-email-eranian@gmail.com> <1258561957.3918.661.camel@laptop> <7c86c4470912110300n44650d98ke52ec56cf4d925c1@mail.gmail.com> <7c86c4470912110359i5a4416c2t9075eaa47d25865a@mail.gmail.com> <1261410040.4314.178.camel@laptop> <1261423869.4314.198.camel@laptop> Content-Type: text/plain; charset="UTF-8" Date: Mon, 21 Dec 2009 22:18:52 +0100 Message-ID: <1261430332.4314.216.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4073 Lines: 94 On Mon, 2009-12-21 at 21:59 +0100, Stephane Eranian wrote: > On Mon, Dec 21, 2009 at 8:31 PM, Peter Zijlstra wrote: > > On Mon, 2009-12-21 at 20:00 +0100, Stephane Eranian wrote: > > > >> > perf_disable() <-- shut down the full pmu > >> > > >> > pmu->disable() <-- hey someone got removed (easy free the reg) > >> > pmu->enable() <-- hey someone got added (harder, check constraints) > >> > > >> > hw_perf_group_sched_in() <-- hey a full group got added > >> > (better than multiple ->enable) > >> > > >> > perf_enable() <-- re-enable pmu > >> > > >> > > >> > So ->disable() is used to track freeing, ->enable is used to add > >> > individual counters, check constraints etc.. > >> > > >> > hw_perf_group_sched_in() is used to optimize the full group enable. > >> > > >> > >> Does that mean that after a disable() I can assume that there won't > >> be an enable() without a group_sched_in()? > >> > >> I suspect not. In fact, there is a counter-example in perf_ctx_adjust_freq(). > > > > Why would that be a problem? > > > > A perf_disable() will simply disable the pmu, but not alter its > > configuration, perf_enable() will program the pmu and re-enable it (with > > the obvious optimization that if the current state and the new state > > match we can skip the actual programming). > > > > If a ->disable() was observed between it will simply not re-enable that > > particular counter, that is it will remove that counters' config, and > > perf_enable() can do a smart reprogram by simply no re-enabling that > > counter. > > > > If an ->enable() or hw_perf_group_sched_in() was observed in between, > > you have to recompute the full state in order to validate the > > availability, if that fails, no state change, if it succeeds you have a > > new pmu state and perf_enable() will program that. > > > > Ok, so what you are suggesting is that the assignment is actually done > incrementally in ->enable(). hw_group_sched_in() would simply validate > that a single group is sane (i.e., can be scheduled if it was alone). hw_perf_group_sched_in() can be used to do a whole group at once (see how a !0 return value will short-circuit the whole incremental group buildup), but yes ->enable() is incremental. > > In the case of perf_ctx_adjust_freq(): > > > > if (!interrupts) { > > perf_disable(); > > event->pmu->disable(event); > > atomic64_set(&hwc->period_left, 0); > > event->pmu->enable(event); > > perf_enable(); > > } > > > > You'll very likely end up with the same state you had before, but if not > > who cares -- its supposed to be an unlikely case. > > > > That is, the ->disable() will clear the cpu_hw_events::used_mask bit. > > The ->enable() will want to place the counter on its last know reg, > > which (not so very surprisingly) will be available, hence it will > > Not in the case I am looking at on AMD. The counter you need to grab > depends on what is going on on the other cores on the socket. So > there is no guarantee that you will get the same. Something similar > exists on Nehalem but it has to do with HT. Well the above code is assumed correct, so we need to make it true, which should not be too hard to do. If there are cross cpu constraints (AMD has per node constraints IIRC) then we need a structure that describes these (a per node structure for AND, a per core structure for HT), if we take a lock on this structure the first time we touch it in a perf_disable() region, and release it on perf_enable(), then we ensure those constraints remain invariant. With something like that in place the above code will still be valid, since the act of removing the counter will freeze all needed state to re-instate it. Does that make sense? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/