Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754067AbbFDSxs (ORCPT ); Thu, 4 Jun 2015 14:53:48 -0400 Received: from mail-wi0-f176.google.com ([209.85.212.176]:33885 "EHLO mail-wi0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751672AbbFDSxm convert rfc822-to-8bit (ORCPT ); Thu, 4 Jun 2015 14:53:42 -0400 MIME-Version: 1.0 In-Reply-To: <20150527153914.GC3644@twins.programming.kicks-ass.net> References: <1431008154-6833-1-git-send-email-robert@sixbynine.org> <20150508162452.GR27504@twins.programming.kicks-ass.net> <20150519145337.GD3644@twins.programming.kicks-ass.net> <20150527153914.GC3644@twins.programming.kicks-ass.net> From: Robert Bragg Date: Thu, 4 Jun 2015 19:53:19 +0100 X-Google-Sender-Auth: Ads6lno4ypnNSihqzEEyRDdb6zY Message-ID: Subject: Re: [Intel-gfx] [RFC PATCH 00/11] drm/i915: Expose OA metrics via perf PMU To: Peter Zijlstra Cc: intel-gfx@lists.freedesktop.org, Daniel Vetter , Jani Nikula , David Airlie , Paul Mackerras , Ingo Molnar , Arnaldo Carvalho de Melo , linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-api@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 23178 Lines: 513 On Wed, May 27, 2015 at 4:39 PM, <> wrote: > On Thu, May 21, 2015 at 12:17:48AM +0100, Robert Bragg wrote: >> > >> > So for me the 'natural' way to represent this in perf would be through >> > event groups. Create a perf event for every single event -- yes this is >> > 53 events. >> >> So when I was first looking at this work I had considered the >> possibility of separate events, and these are some of the things that >> in the end made me forward the hardware's raw report data via a single >> event instead... >> >> There are 100s of possible B counters depending on the MUX >> configuration + fixed-function logic in addition to the A counters. > > There's only 8 B counters, what there is 100s of is configurations, and > that's no different from any other PMU. There's 100s of events to select > on the CPU side as well - yet only 4 (general purpose) counters (8 if > you disable HT on recent machines) per CPU. To clarify one thing here; In the PRM you'll see a reference to 'Reserved' slots in the report layouts and these effectively correspond to a further 8 configurable counters. These further counters get referred to as 'C' counters for 'custom' and similar to the B counters they are defined based on the MUX configuration except they skip the boolean logic pipeline. Thanks for the comparison with the CPU. My main thought here is to beware of considering these configurable counters as 'general purpose' if that might also imply orthogonality. All our configurable counters build on top of a common/shared MUX configuration so I'd tend to think of them more like a 'general purpose counter set'. > > Of course, you could create more than 8 events at any one time, and then > perf will round-robin the configuration for you. This hardware really isn't designed to allow round-robin reconfiguration. If we change the metric set then we will loose the aggregate value of our B counters. Besides writing the large MUX config + boolean logic state we would also need to save the current aggregated counter values only maintained by the OA unit so they could be restored later as well as any internal state that's simply not accessible to us. If we reconfigure to try and round-robin between sets of counters each reconfiguration will trash state that we have no way of restoring. The counters have to no longer be in use if there's no going back once we reconfigure. > >> A >> choice would need to be made about whether to expose events for the >> configurable counters that aren't inherently associated with any >> semantics, or instead defining events for counters with specific >> semantics (with 100s of possible counters to define). The later would >> seem more complex for userspace and the kernel if they both now have >> to understand the constraints on what counters can be used together. > > Again, no different from existing PMUs. The most 'interesting' part of > any PMU driver is event scheduling. > > The explicit design of perf was to put event scheduling _in_ the kernel > and not allow userspace direct access to the hardware (as previous PMU > models did). I'm thinking your 'event scheduling' here refers to the kernel being able to handle multiple concurrent users of the same pmu, but then we may be talking cross purposes... I was describing how (if we exposed events for counters with well defined semantics - as opposed to just 16 events for the B/C counters) the driver would need to do some kind of matching based on the events added to a group to deduce which MUX configuration should be used and saying that in this case userspace would need to be very aware of which events belong to a specific MUX configuration. I don't think we have the hardware flexibility to support scheduling in terms of multiple concurrent users, unless they happen to want exactly the same configuration. I also don't think we have a use case for accessing gpu metrics where concurrent access is really important. It's possible to think of cases where it might be nice, except they are unlikely to involve the same configuration so given the current OA unit design I haven't had plans to try and support concurrent access and haven't had any requests for it so far. > >> I >> guess with either approach we would also need to have some form of >> dedicated group leader event accepting attributes for configuring the >> state that affects the group as a whole, such as the counter >> configuration (3D vs GPGPU vs media etc). > > That depends a bit on how flexible you want to switch between these > modes; with the cost being in the 100s of MMIO register writes, it might > just not make sense to make this too dynamic. > > An option might be to select the mode through the PMU driver's sysfs > files. Another option might be to expose the B counters 3 times and have > each set be mutually exclusive. That is, as soon as you've created a 3D > event, you can no longer create GPGPU/Media events. Hmm, interesting idea to have global/pmu state be controlled via sysfs... I suppose the security model doesn't seem as clear, with sysfs access gated by file permissions while we want to gate some control based on whether you have access to a gpu context. The risk of multiple users looking to open a group of events and first setting a global mode via sysfs seems like it could also silently misconfigure your events with the wrong mode. Maybe it could become immutable once one event is opened and you could double check afterwards. Besides the mode/MUX config choice, other state that is shared by the group includes: - single context vs system wide profiling - the timer exponent Using sysfs I think we'd have to be careful to not expose configuration options that relate to the security policy if there's a risk of a non-privileged process affecting it. > >> I'm not sure where we would >> handle the context-id + drm file descriptor attributes for initiating >> single context profiling but guess we'd need to authenticate each >> individual event open. > > Event open is a slow path, so that should not be a problem, right? Yeah I don't think I'd be concerned about duplicating the checks per-event from a performance pov. I'm more concerned about differentiating security checks that relate to the final configuration as a whole vs individual events. In this case we use a drm file descriptor + context handle to allow profiling a single gpu context that the current process has access to, but the choice of profiling a single context or across all contexts affects the final hardware configuration too which relates to the group. Even if a process has the privileges to additionally open an event for system wide metrics, this choice is part of the OA unit configuration and incompatible with single-context filtering. For reference, on Haswell the reports don't include a context identifier which makes it difficult to try and emulate a single-context filtering event with a system wide hardware configuration. I can see that we'd be able to cross-check event compatibility for each addition to the group but I also can't help but see this as an example of how tightly coupled the configuration of these counters are. > >> It's not clear if we'd configure the report >> layout via the group leader, or try to automatically choose the most >> compact format based on the group members. I'm not sure how pmus >> currently handle the opening of enabled events on an enabled group but > > With the PERF_FORMAT_GROUP layout changing in-flight. I would recommend > not doing that -- decoding the output will be 'interesting' but not > impossible. Right, I wouldn't want to have to handle such corner cases so I think I'd want to be able to report an error back to userspace attempting to extend a group that's ever been activated. > >> I think there would need to be limitations in our case that new >> members can't result in a reconfigure of the counters if that might >> loose the current counter values known to userspace. > > I'm not entirely sure what you mean here. This is referring to the issue of us not being able to explicitly save and restore the state of the OA unit to be able to allow reconfiguring the counters while they are in use. The counter values for B/C counters are only maintained by the OA unit and in the case of B counters which supports e.g. referencing delayed values, there is state held in the unit that we have no way to explicitly access and save. > >> From a user's pov, there's no real freedom to mix and match which >> counters are configured together, and there's only some limited >> ability to ignore some of the currently selected counters by not >> including them in reports. > > It is 'impossible' to create a group that is not programmable. That is, > the pmu::event_init() call _should_ verify that the addition of the > event to the group (as given by event->group_leader) is valid. > >> Something to understand here is that we have to work with sets of >> pre-defined MUX + fixed-function logic configurations that have been >> validated to give useful metrics for specific use cases, such as >> benchmarking 3D rendering, GPGPU or media workloads. > > This is fine; as stated above. Since these are limited pieces of > 'firmware' which are expensive to load, you don't have to do a fully > dynamic solution here. > >> As it is currently the kernel doesn't need to know anything about the >> semantics of individual counters being selected, so it's currently >> convenient that we can aim to maintain all the counter meta data we >> have in userspace according to the changing needs of tools or drivers >> (e.g. names, descriptions, units, max values, normalization >> equations), de-coupled from the kernel, instead of splitting it >> between the kernel and userspace. > > And that fully violates the design premise of perf. The kernel should be > in control of resources, not userspace. This isn't referring to programmable resources though; it's only talking about where the higher-level meta data about these counters is maintained, including names and descriptions detailing the specific counter semantics as well as a description of the equations/expressions that should be used to normalize these counters to be useful to users. This seems very comparable to other PMUs that expose _RAW, hardware specific data via samples that rely on a userspace library like libpfm4 to understand. I see different tools have slightly different needs and may want to normalize some counters differently. This data is still evolving over time, and overall feel it's going to be more practical/maintainable for us to try and keep this semantic data consolidated to userspace. For reference; what I'm really referring to here is an established XML description of OA counters maintained and validated by engineers closely involved with the hardware. I think it makes sense for us to factor into this trade-off the benefit we get from being able to easily leverage this data but considering the relative lack of stability of this data makes me prefer to limit ourselves to just putting the MUX and boolean logic configurations in the kernel. > > If we'd have put userspace in charge, we could now not profile the same > task by two (or more) different observers. But with kernel side counter > management that is no problem at all. Sorry, I think my comment must have come across wrong because I wasn't talking about a choice that affects who's responsible for configuring the counters or supporting multiple observers or not. I was talking about a trade-off for where we maintain the higher level information about the counters that can be exposed, especially the code for normalizing individual counters. > >> A benefit of being able to change the report size is to reduce memory >> bandwidth usage that can skew measurements. It's possible to request >> the gpu to write out periodic snapshots at a very high frequency (we >> can program a period as low as 160 nanoseconds) and higher frequencies >> can start to expose some interesting details about how the gpu is >> utilized - though with notable observer effects too. How careful we >> are to not waste bandwidth is expected to determine what sampling >> resolutions we can achieve before significantly impacting what we are >> measuring. >> >> Splitting the counters up looked like it could increase the bandwidth >> we use quite a bit. The main difference comes from requiring 64bit >> values instead of the 32bit values in our raw reports. This can be >> offset partly since there are quite a few 'reserved'/redundant A >> counters that don't need forwarding. As an example in the most extreme >> case, instead of an 8 byte perf_event_header + 4byte raw_size + 256 >> byte reports + 4 byte padding every 160ns ~= 1.5GB/s, we might have 33 >> A counters (ignoring redundant ones) + 16 configurable counters = 400 > > In that PDF there's only 8 configurable 'B' counters. Sorry that it's another example of being incomplete. The current document only says enough about B counters to show how to access the gpu clock that is required to normalise many of the A counters. There are also the 8 C counters I mentioned earlier derived from the MUX configuration but without the boolean logic. > >> byte struct read_format (using PERF_FORMAT_GROUP) + 8 byte >> perf_event_header every 160ns ~= 2.4GB/s. On the other hand though we >> could choose to forward only 2 or 3 counters of interest at these high >> frequencies which isn't possible currently. > > Right; although you'd have to spend some cpu cycles on the format shift. > Which might make it moot again. That said; I would really prefer we > start out with trying to make the generic format stuff work before > trying to come up with special case hacks. > >> > Use the MMIO reads for the regular read() interface, and use a hrtimer >> > placing MI_REPORT_PERF_COUNT commands, with a counter select mask >> > covering the all events in the current group, for sampling. >> >> Unfortunately due to the mmio limitations and the need to relate >> counters I can't imagine many use cases for directly accessing the >> counters individually via the read() interface. > > Fair enough. > >> MI_REPORT_PERF_COUNT commands are really only intended for collecting >> reports in sync with a command stream. We are experimenting currently >> with an extension of my PMU driver that emits MI_REPORT_PERF_COUNT >> commands automatically around the batches of commands submitted by >> userspace so we can do a better job of filtering metrics across many >> gpu contexts, but for now the expectation is that the kernel shouldn't >> be emitting MI_REPORT_PERF_COUNT commands. We emit >> MI_REPORT_PERF_COUNT commands within Mesa for example to implement the >> GL_INTEL_performance_query extension, at the start and end of a query >> around a sequence of commands that the application is interested in >> measuring. > > I will try and read up on the GL_INTEL_performance_query thing. You can see my code for this here, in case that's helpful: https://github.com/rib/mesa/tree/wip/rib/oa-hsw-4.0.0 You might also be interested in my more recent 'codegen' branch too (work in progress though): https://github.com/rib/mesa/tree/wip/rib/oa-hsw-codegen Here you can see an example of the meta data I mentioned earlier, e.g. describing the equations for normalizing the counters for one metrics set: https://github.com/rib/mesa/blob/wip/rib/oa-hsw-codegen/src/mesa/drivers/dri/i965/brw_oa_hsw.xml This set corresponds to the "3D" metric set enabled in the i915_oa pmu driver. More example data can also be seen in my gputop tool, with 'render basic', 'compute basic', 'compute extended', 'memory reads', 'memory writes' and 'sampler balance' metric sets: https://github.com/rib/gputop/blob/641d7644df64a871f36f3c283cfa18a2f1530813/gputop/oa-hsw.xml > >> > You can use the perf_event_attr::config to select the counter (A0-A44, >> > B0-B7) and use perf_event_attr::config1 (low and high dword) for the >> > corresponding CEC registers. >> > >> >> Hopefully covered above, but since the fixed-function state is so >> dependent on the MUX configuration I think it currently makes sense to >> treat the MUX plus logic state (including the CEC state) a tightly >> coupled unit. > > Oh wait, the A counters are also affected by the MUX programming! > That wasn't clear to me before this. Er actually you were right before, the raw A counters aren't directly affected by the MUX configuration. In this paragraph 'fixed-function state' was referring to the boolean logic state for B counters in case that was unclear and the CEC ('custom event counter') registers are a part of that boolean logic state. That said though, the raw A counters aren't really /useful/ without the MUX programming because most of them are so closely related to the gpu clock (which we access via a B or C counter) that we have to program the MUX before we can normalize the A counters to be meaningful to users. > > Same difference though; you could still do either the sysfs aided flips > or the mutually exclusive counter types to deal with this; all still > assuming reprogramming all that is 'expensive'. > >> The Flexible EU Counters for Broadwell+ could be more amenable to this >> kind of independent configuration, as I don't believe they are >> dependant on the MUX configuration. > > That sounds good :-) > >> One idea that's come up a lot though is having the possibility of >> being able to configure an event with a full MUX + fixed-function >> state description. > > Right; seeing how there's only a very limited number of these MUX > programs validated (3D, GPGPU, Media, any others?) this should be > doable. And as mentioned before, you could disable the other types once > you create one in order to limit the reprogramming thing. Here I meant being able to support userspace supplying a full MUX configuration and boolean logic configuration (i.e. large arrays of data), for userspace tools being able to experiment with new configurations during the early stages of building and validating these configurations - so mostly interesting to the engineers responsible for defining and testing these configurations in the first place since it means they can share experimental configurations with tools with the caveat that you'd likely need root privileges to use the interface. It's appealing to have a low barrier for enabling people to test new configurations without the need for users to build a custom kernel. I'm not worried about supporting this in the short term, but it's still a usecase that could be helpful to support eventually, that I want to keep in mind at least. About the number of different configurations, I've been using 3D, gpgpu and media as summary examples, but there are quite a few configurations really... On Haswell these are names of some of the sets I'm aware of... Render Metrics Basic Gen7.5 Compute Metrics Basic Gen7.5 Compute Metrics Extended Gen7.5 Render Metrics Slice Balance Gen7.5 Memory Reads Distribution Gen7.5 Memory Writes Distribution Gen7.5 Metric set SamplerBalance Memory Reads on Write Port Distribution Gen7.5 Stencil PMA Hold Metrics Gen7.5 Media Memory Reads Distribution Gen7.5 Media Memory Writes Distribution Gen7.5 Media VME Pipe Gen7.5 For Broadwell we have more; for example... Render Metrics Basic Gen8 Compute Metrics Basic Gen8 Render Metrics for 3D Pipeline Profile Memory Reads Distribution Gen8 Memory Writes Distribution Gen8 Compute Metrics Extended Gen8 Compute Metrics L3 Cache Gen8 Data Port Reads Coalescing Gen8 Data Port Writes Coalescing Gen8 Metric set HDCAndSF Metric set L3_1 Metric set L3_2 Metric set L3_3 Metric set L3_4 Metric set RasterizerAndPixelBackend Metric set Sampler_1 Metric set Sampler_2 Metric set TDL_1 Metric set TDL_2 Stencil PMA Hold Metrics Gen8 Media Memory Reads Distribution Gen8 Media Memory Writes Distribution Gen8 Media VME Pipe Gen8 HDC URB Coalescing Metrics Gen8 I'm not sure that it will end up making sense to publish all of these, depending on what kind of testing these have been through to validate that they give helpful data, and on the other hand these are evolving and there may be more over time, but hopefully it at least gives a ballpark idea of the number of configurations that could be interesting to enable between Haswell and Broadwell. > >> > >> > This does not require random per driver ABI extentions for >> > perf_event_attr, nor your custom output format. >> > >> > Am I missing something obvious here? >> >> Definitely nothing 'obvious' since the current documentation is >> notably incomplete a.t.m, but I don't think we were on the same page >> about how the hardware works and our use cases. >> >> Hopefully some of my above comments help clarify some details. > > Yes, thanks! > Ok, I think there are some clarifications on the nature of our OA hardware that we're drilling into here that might also help explain more of the trade-offs that were considered regarding forwarding raw OA hardware reports. I wonder if it could be good for us to take a little bit of a step back here just to re-consider the implications of enabling a pmu that really doesn't relate to the cpu or tasks running on cpus - since I think this is the first example of that, and it's also had a bearing on how we currently forward samples via perf. I think the basic implications to consider are that for our use cases we wouldn't expect to be requesting cpu centric info in samples, such as _IP, _CALLCHAIN, _CPU, _BRANCH_STACK and_REGS/STACK_USER and that we wouldn't open events for specific pids/tasks. I don't think we'd expect to use existing userspace tools with this pmu, such as perf, and instead: - We're accessing perf from within OpenGL for implementing a performance query extension that applications/games/tools can then use to get metrics for a single gpu context owned by OpenGL. - We're creating new tools that are geared for viewing gpu metrics, including gputop and grafips I have a feeling that Ingo and yourself may currently be thinking of the OA unit like an uncore pmu with the idea that tools like perf should Just Work™ with OA counters, while I don't think that's going to be the case. Even so, I still feel that we benefit from being able to leverage the perf infrastructure to support our use cases and even though perf is currently quite cpu centric I've been happy that it's turned out to be a good fit for exposing device metrics too and ignoring/bypassing some of cpu specific bits for our needs has been straight forward. I don't bring this up as a tangent to the question of having one raw-report event vs multiple single counter events, but rather to highlight that our gpu focused use cases might be a bit different to what you're used to and I hope our tooling and OpenGL work can inform this discussion too; representing the practical ends that we're most eager to support. Regards, - Robert -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/