Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756106AbbBLP7G (ORCPT ); Thu, 12 Feb 2015 10:59:06 -0500 Received: from casper.infradead.org ([85.118.1.10]:44966 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755674AbbBLP7E (ORCPT ); Thu, 12 Feb 2015 10:59:04 -0500 Date: Thu, 12 Feb 2015 16:58:56 +0100 From: Peter Zijlstra To: Sukadev Bhattiprolu Cc: mingo@kernel.org, Michael Ellerman , Anton Blanchard , Stephane Eranian , Jiri Olsa , Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org Subject: Re: [RFC][PATCH] perf: Implement read_group() PMU operation Message-ID: <20150212155856.GC21418@twins.programming.kicks-ass.net> References: <20150206025915.GA31650@us.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150206025915.GA31650@us.ibm.com> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3756 Lines: 115 On Thu, Feb 05, 2015 at 06:59:15PM -0800, Sukadev Bhattiprolu wrote: > From: Sukadev Bhattiprolu > Date: Thu Feb 5 20:56:20 EST 2015 -0300 > Subject: [RFC][PATCH] perf: Implement read_group() PMU operation > > This is a lightly tested, exploratory patch to allow PMUs to return > several counters at once. Appreciate any comments :-) > > Unlike normal hardware PMCs, the 24x7 counters[1] in Power8 are stored > in memory and accessed via a hypervisor call (HCALL). A major aspect > of the HCALL is that it allows retireving _SEVERAL_ counters at once > (unlike regular PMCs, which are read one at a time). > > This patch implements a ->read_group() PMU operation that tries to > take advantage of this ability to read several counters at once. A > PMU that implements the ->read_group() operation would allow users > to retrieve several counters at once and get a more consistent > snapshot. > > NOTE: This patch has a TODO in h_24x7_event_read_group() in that it > still does multiple HCALLS. I think that can be optimized > independently, once the pmu->read_group() interface itself is > finalized. > > Appreciate comments on the ->read_group interface and best managing the > interfaces between the core and PMU layers - eg: Ok for hv-24x7 PMU to > to walk the ->sibling_list ? > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -3549,10 +3549,43 @@ static int perf_event_read_group(struct perf_event *event, You also want perf_output_read_group(). > struct perf_event *leader = event->group_leader, *sub; > int n = 0, size = 0, ret = -EFAULT; > struct perf_event_context *ctx = leader->ctx; > + u64 *valuesp; > u64 values[5]; > + int use_group_read; > u64 count, enabled, running; > + struct pmu *pmu = event->pmu; > + > + /* > + * If PMU supports group read and group read is requested, > + * allocate memory before taking the mutex. > + */ > + use_group_read = 0; > + if ((read_format & PERF_FORMAT_GROUP) && pmu->read_group) { > + use_group_read++; > + } > + > + if (use_group_read) { > + valuesp = kzalloc(leader->nr_siblings * sizeof(u64), GFP_KERNEL); > + if (!valuesp) > + return -ENOMEM; > + } This seems 'sad', the hardware already knows how many it can maximally use at once and can preallocate, right? > > mutex_lock(&ctx->mutex); > + > + if (use_group_read) { > + ret = pmu->read_group(leader, valuesp, leader->nr_siblings); > + if (ret >= 0) { > + size = ret * sizeof(u64); > + > + ret = size; > + if (copy_to_user(buf, valuesp, size)) > + ret = -EFAULT; > + } > + > + kfree(valuesp); > + goto unlock; > + } > + > count = perf_event_read_value(leader, &enabled, &running); > > values[n++] = 1 + leader->nr_siblings; Since ->read() has a void return value, we can delay its effect, so I'm currently thinking we might want to extend the transaction interface for this; give pmu::start_txn() a flags argument to indicate scheduling (add) or reading (read). So we'd end up with something like: pmu->start_txn(pmu, PMU_TXN_READ); leader->read(); for_each_sibling() sibling->read(); pmu->commit_txn(); after which we can use the values updated by the read calls. The trivial no-support implementation lets read do its immediate thing like it does now. A more complex driver can then collect the actual counter values and execute one hypercall using its pre-allocated memory. So no allocations in the core code, and no sibling iterations in the driver code. Would that work for you? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/