Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units
From: Peter Zijlstra <peterz@infradead.org>
To: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>, LKML <linux-kernel@vger.kernel.org>,
       Andi Kleen <andi@firstfloor.org>, Paul Mackerras <paulus@samba.org>,
       Stephane Eranian <eranian@googlemail.com>,
       Frederic Weisbecker <fweisbec@gmail.com>,
       Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>,
       Dan Terpstra <terpstra@eecs.utk.edu>, Philip Mucci <mucci@eecs.utk.edu>,
       Maynard Johnson <mpjohn@us.ibm.com>, Carl Love <cel@us.ibm.com>,
       Steven Rostedt <rostedt@goodmis.org>,
       Arnaldo Carvalho de Melo <acme@redhat.com>,
       Masami Hiramatsu <mhiramat@redhat.com>
In-Reply-To: <4B60990C.1030804@linux.vnet.ibm.com>
References: <4B560ACD.4040206@linux.vnet.ibm.com>
	 <1263994448.4283.1052.camel@laptop> <1264023204.4283.1124.camel@laptop>
	 <4B57907E.5000207@linux.vnet.ibm.com> <20100121072118.GA10585@elte.hu>
	 <4B58A750.2060607@linux.vnet.ibm.com> <4B58AAF7.60507@linux.vnet.ibm.com>
	 <20100127102834.GA27357@elte.hu>  <4B60990C.1030804@linux.vnet.ibm.com>
Content-Type: text/plain; charset="UTF-8"
Date: Thu, 28 Jan 2010 11:57:24 +0100
Message-ID: <1264676244.4283.2093.camel@laptop>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7067
Lines: 159

On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
> >
> > * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
> >
> >> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >>>
> >>>
> >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>>>
> >>>> * Corey Ashford<cjashfor@linux.vnet.ibm.com>  wrote:
> >>>>
> >>>>> I really think we need some sort of data structure which is passed
> >>>> >from the
> >>>>> kernel to user space to represent the topology of the system, and give
> >>>>> useful information to be able to identify each PMU node. Whether this is
> >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>>> really matter much, but it needs to be something that can be parsed
> >>>>> relatively easily and *contains just enough information* for the user
> >>>>> to be
> >>>>> able to correctly choose PMUs, and for the kernel to be able to
> >>>>> relate that
> >>>>> back to actual PMU hardware.
> >>>>
> >>>> The right way would be to extend the current event description under
> >>>> /debug/tracing/events with hardware descriptors and (maybe) to
> >>>> formalise this
> >>>> into a separate /proc/events/ or into a separate filesystem.
> >>>>
> >>>> The advantage of this is that in the grand scheme of things we
> >>>> _really_ dont
> >>>> want to limit performance events to 'hardware' hierarchies, or to
> >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>>> fundamentally limiting) object enumeration.
> >>>>
> >>>> We want a unified, logical enumeration of all events and objects that
> >>>> we care
> >>>> about from a performance monitoring and analysis point of view, shaped
> >>>> for the
> >>>> purpose of and parsed by perf user-space. And since the current event
> >>>> descriptors are already rather rich as they enumerate all sorts of
> >>>> things:
> >>>>
> >>>> - tracepoints
> >>>> - hw-breakpoints
> >>>> - dynamic probes
> >>>>
> >>>> etc., and are well used by tooling we should expand those with real
> >>>> hardware
> >>>> structure.
> >>>
> >>> This is an intriguing idea; I like the idea of generalizing all of this
> >>> info into one structure.
> >>>
> >>> So you think that this structure should contain event info as well? If
> >>> these structures are created by the kernel, I think that would
> >>> necessitate placing large event tables into the kernel, which is
> >>> something I think we'd prefer to avoid because of the amount of memory
> >>> it would take. Keep in mind that we need not only event names, but event
> >>> descriptions, encodings, attributes (e.g. unit masks), attribute
> >>> descriptions, etc. I suppose the kernel could read a file from the file
> >>> system, and then add this info to the tree, but that just seems bad. Are
> >>> there existing places in the kernel where it reads a user space file to
> >>> create a user space pseudo filesystem?
> >>>
> >>> I think keeping event naming in user space, and PMU naming in kernel
> >>> space might be a better idea: the kernel exposes the available PMUs to
> >>> user space via some structure, and a user space library tries to
> >>> recognize the exposed PMUs and provide event lists and other needed
> >>> info. The perf tool would use this library to be able to list available
> >>> events to users.
> >>>
> >>
> >> Perhaps another way of handing this would be to have the kernel dynamically
> >> load a specific "PMU kernel module" once it has detected that it has a
> >> particular PMU in the hardware.  The module would consist only of a data
> >> structure, and a simple API to access the event data.  This way, only only
> >> the PMUs that actually exist in the hardware would need to be loaded into
> >> memory, and perhaps then only temporarily (just long enough to create the
> >> pseudo fs nodes).
> >>
> >> Still, though, since it's a pseudo fs, all of that event data would be
> >> taking up kernel memory.
> >>
> >> Another model, perhaps, would be to actually write this data out to a real
> >> file system upon every boot up, so that it wouldn't need to be held in
> >> memory.  That seems rather ugly and time consuming, though.
> >
> > I dont think memory consumption is a problem at all. The structure of the
> > monitored hardware/software state is information we _want_ the kernel to
> > provide, mainly because there's no unified repository for user-space to get
> > this info from.
> >
> > If someone doesnt want it on some ultra-embedded box then sure a .config
> > switch can be provided to allow it to be turned off.
> >
> > 	Ingo
> 
> Ok, just so that we quantify things a bit, let's say I have 20 different types 
> of PMUs totalling 2000 different events, each of which has a name and text 
> description, averaging 300 characters.  Along with that, there's let's say 4 
> 64-bit words of metadata per event describing encoding, which attributes apply 
> to the event, and any other needed info. I don't know how much memory each 
> pseudo fs node takes up.  Let me guess and say 128 bytes for each event node 
> (the amount taken for the PMU nodes would be negligible compared with the event 
> nodes).
> 
> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
> 
> Let's assume that the correct event module can be loaded dynamically, so that we 
> don't need to have all of the possible event sets for a particular arch kernel 
> build.
> 
> Any opinions on whether allocating this amount of kernel memory would be 
> acceptable?  It seems like a lot of kernel memory to me, but I come from an 
> embedded systems background.  Granted, most systems are going to use a fraction 
> of that amount of memory (<100KB) due to having far fewer PMUs and therefore 
> fewer distinct event types.
> 
> There's at least one more dimension to this.  Let's say I have 16 uncore PMUs 
> all of the same type, each of which has, for example 8 events.  As a very crude 
> pseudo fs, let's say we have a structure like this:
> 
> 
> /sys/devices/pmus/
>      uncore_pmu0/
>          event0/ (path name to here is the name of the pmu and event)
>              description (file)
>              applicable_attributes (file)
>          event1/
>              description
>              applicable_attributes
>          event2/
>              ...
>          event7/
>              ...
>      uncore_pmu1/
>          event0/
>              description
>              applicable_attributes
>          ...
>      ...
>      uncore_pmu15/
>          ...

I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.

Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/