Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754126Ab0A1K6P (ORCPT ); Thu, 28 Jan 2010 05:58:15 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S932201Ab0A1K6O (ORCPT ); Thu, 28 Jan 2010 05:58:14 -0500 Received: from casper.infradead.org ([85.118.1.10]:58740 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753239Ab0A1K6K (ORCPT ); Thu, 28 Jan 2010 05:58:10 -0500 Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units From: Peter Zijlstra To: Corey Ashford Cc: Ingo Molnar , LKML , Andi Kleen , Paul Mackerras , Stephane Eranian , Frederic Weisbecker , Xiao Guangrong , Dan Terpstra , Philip Mucci , Maynard Johnson , Carl Love , Steven Rostedt , Arnaldo Carvalho de Melo , Masami Hiramatsu In-Reply-To: <4B60990C.1030804@linux.vnet.ibm.com> References: <4B560ACD.4040206@linux.vnet.ibm.com> <1263994448.4283.1052.camel@laptop> <1264023204.4283.1124.camel@laptop> <4B57907E.5000207@linux.vnet.ibm.com> <20100121072118.GA10585@elte.hu> <4B58A750.2060607@linux.vnet.ibm.com> <4B58AAF7.60507@linux.vnet.ibm.com> <20100127102834.GA27357@elte.hu> <4B60990C.1030804@linux.vnet.ibm.com> Content-Type: text/plain; charset="UTF-8" Date: Thu, 28 Jan 2010 11:57:24 +0100 Message-ID: <1264676244.4283.2093.camel@laptop> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7067 Lines: 159 On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote: > On 1/27/2010 2:28 AM, Ingo Molnar wrote: > > > > * Corey Ashford wrote: > > > >> On 1/21/2010 11:13 AM, Corey Ashford wrote: > >>> > >>> > >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote: > >>>> > >>>> * Corey Ashford wrote: > >>>> > >>>>> I really think we need some sort of data structure which is passed > >>>> >from the > >>>>> kernel to user space to represent the topology of the system, and give > >>>>> useful information to be able to identify each PMU node. Whether this is > >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't > >>>>> really matter much, but it needs to be something that can be parsed > >>>>> relatively easily and *contains just enough information* for the user > >>>>> to be > >>>>> able to correctly choose PMUs, and for the kernel to be able to > >>>>> relate that > >>>>> back to actual PMU hardware. > >>>> > >>>> The right way would be to extend the current event description under > >>>> /debug/tracing/events with hardware descriptors and (maybe) to > >>>> formalise this > >>>> into a separate /proc/events/ or into a separate filesystem. > >>>> > >>>> The advantage of this is that in the grand scheme of things we > >>>> _really_ dont > >>>> want to limit performance events to 'hardware' hierarchies, or to > >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and > >>>> fundamentally limiting) object enumeration. > >>>> > >>>> We want a unified, logical enumeration of all events and objects that > >>>> we care > >>>> about from a performance monitoring and analysis point of view, shaped > >>>> for the > >>>> purpose of and parsed by perf user-space. And since the current event > >>>> descriptors are already rather rich as they enumerate all sorts of > >>>> things: > >>>> > >>>> - tracepoints > >>>> - hw-breakpoints > >>>> - dynamic probes > >>>> > >>>> etc., and are well used by tooling we should expand those with real > >>>> hardware > >>>> structure. > >>> > >>> This is an intriguing idea; I like the idea of generalizing all of this > >>> info into one structure. > >>> > >>> So you think that this structure should contain event info as well? If > >>> these structures are created by the kernel, I think that would > >>> necessitate placing large event tables into the kernel, which is > >>> something I think we'd prefer to avoid because of the amount of memory > >>> it would take. Keep in mind that we need not only event names, but event > >>> descriptions, encodings, attributes (e.g. unit masks), attribute > >>> descriptions, etc. I suppose the kernel could read a file from the file > >>> system, and then add this info to the tree, but that just seems bad. Are > >>> there existing places in the kernel where it reads a user space file to > >>> create a user space pseudo filesystem? > >>> > >>> I think keeping event naming in user space, and PMU naming in kernel > >>> space might be a better idea: the kernel exposes the available PMUs to > >>> user space via some structure, and a user space library tries to > >>> recognize the exposed PMUs and provide event lists and other needed > >>> info. The perf tool would use this library to be able to list available > >>> events to users. > >>> > >> > >> Perhaps another way of handing this would be to have the kernel dynamically > >> load a specific "PMU kernel module" once it has detected that it has a > >> particular PMU in the hardware. The module would consist only of a data > >> structure, and a simple API to access the event data. This way, only only > >> the PMUs that actually exist in the hardware would need to be loaded into > >> memory, and perhaps then only temporarily (just long enough to create the > >> pseudo fs nodes). > >> > >> Still, though, since it's a pseudo fs, all of that event data would be > >> taking up kernel memory. > >> > >> Another model, perhaps, would be to actually write this data out to a real > >> file system upon every boot up, so that it wouldn't need to be held in > >> memory. That seems rather ugly and time consuming, though. > > > > I dont think memory consumption is a problem at all. The structure of the > > monitored hardware/software state is information we _want_ the kernel to > > provide, mainly because there's no unified repository for user-space to get > > this info from. > > > > If someone doesnt want it on some ultra-embedded box then sure a .config > > switch can be provided to allow it to be turned off. > > > > Ingo > > Ok, just so that we quantify things a bit, let's say I have 20 different types > of PMUs totalling 2000 different events, each of which has a name and text > description, averaging 300 characters. Along with that, there's let's say 4 > 64-bit words of metadata per event describing encoding, which attributes apply > to the event, and any other needed info. I don't know how much memory each > pseudo fs node takes up. Let me guess and say 128 bytes for each event node > (the amount taken for the PMU nodes would be negligible compared with the event > nodes). > > So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory. > > Let's assume that the correct event module can be loaded dynamically, so that we > don't need to have all of the possible event sets for a particular arch kernel > build. > > Any opinions on whether allocating this amount of kernel memory would be > acceptable? It seems like a lot of kernel memory to me, but I come from an > embedded systems background. Granted, most systems are going to use a fraction > of that amount of memory (<100KB) due to having far fewer PMUs and therefore > fewer distinct event types. > > There's at least one more dimension to this. Let's say I have 16 uncore PMUs > all of the same type, each of which has, for example 8 events. As a very crude > pseudo fs, let's say we have a structure like this: > > > /sys/devices/pmus/ > uncore_pmu0/ > event0/ (path name to here is the name of the pmu and event) > description (file) > applicable_attributes (file) > event1/ > description > applicable_attributes > event2/ > ... > event7/ > ... > uncore_pmu1/ > event0/ > description > applicable_attributes > ... > ... > uncore_pmu15/ > ... I really don't like this. The the cpu->uncore map is fixed by the topology of the machine, which is already available in /sys some place. Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or something like that. We can start with 2 generic events for that type, local/remote memory accesses and take it from there. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/