2010-01-19 19:41:19

by Corey Ashford

[permalink] [raw]
Subject: [RFC] perf_events: support for uncore a.k.a. nest units

-----
Intro
-----
One subject that hasn't been addressed since the introduction of perf_events in
the Linux kernel is that of support for "uncore" or "nest" unit events. Uncore
is the term used by the Intel engineers for their off-core units but are still
on the same die as the cores, and "nest" means exactly the same thing for IBM
Power processor engineers. I will use the term uncore for brevity and because
it's in common parlance, but the issues and design possibilities below are
relevant to both. I will also broaden the term by stating that uncore will also
refer to PMUs that are completely off of the processor chip altogether.

Contents
--------
1. Why support PMUs in uncore units? Is there anything interesting to look at?
2. How do uncore events differ from core events?
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
4. How do you encode uncore events?
5. How do you address a particular uncore PMU?
6. Event rotation issues with uncore PMUs
7. Other issues?
8. Feedback?

----
1. Why support PMUs in uncore units? Is there anything interesting to look at?
----

Today, many x86 chips contain uncore units, and we think that it's likely that
the trend will continue, as more devices - I/O, memory interfaces, shared
caches, accelerators, etc. - are integrated onto multi-core chips. As these
devices become more sophisticated and more workload is diverted off-core,
engineers and performance analysts are going to want to look at what's happening
in these units so that they can find bottlenecks.

In addition, we think that even off-chip I/O and interconnect devices are likely
to gain PMUs because engineers will want to find bottlenecks in their massively
parallel systems.

----
2. How do uncore events differ from core events?
----

The main difference is that uncore events are mostly likely not going to be tied
to a particular Linux task, or even a CPU context. Uncore units are resources
that are in some sense system-wide, though, they may not really be accessible
system-wide in some architectures. In the case of accelerators and I/O devices,
it's likely they will run asynchronously from the cores, and thus keeping track
of events on a per-task basis doesn't make a lot of sense. The other existing
mode in perf_events is a per-CPU context, and it turns out that this mode does
match up with uncore units well, though the choice of which CPU to use to manage
that uncore unit is going to need to be arch-dependent and may involve other
issues as well, such as minimizing access latency between the uncore unit and
the CPU which is managing it.

----
3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
----

* The control registers of the uncore unit's PMU need to be read and written,
and that may be possible only from a subset of processors in the system.
* A processor is needed to rotate the event list on the uncore unit on every
tick for the purposes of event scheduling.
* Because of access latency issues, we may want the CPU to be close in locality
to the PMU.

It seems like a good idea to let the kernel decide which CPU to use to monitor a
particular uncore event, based on the location of the uncore unit, and possibly
current system load balance. The user will not want to have to figure out this
detailed information.

----
4. How do you encode uncore events?
----
Uncore events will need to be encoded in the config field of the perf_event_attr
struct using the existing PERF_TYPE_RAW encoding. 64 bits are available in the
config field, and that may be sufficient to support events on most systems.
However, due to the proliferation and added complexity of PMUs we envision, we
might want to add another 64-bit config (perhaps call it config_extra or
config2) field to encode any extra attributes that might be needed. The exact
encoding used, just as for the current encoding for core events, will be on a
per-arch and possibly per-system basis.

----
5. How do you address a particular uncore PMU?
----

This one is going to be very system- and arch-dependent, but it seems fairly
clear that we need some sort of addressing scheme that can be
system/arch-defined by the kernel.

From a hierarchical perspective, here's an example of possible uncore PMU
locations in a large system:

1) Per-core - units that are shared between all hardware threads in a core
2) Per-node - units that are shared between all cores in a node
3) Per-chip - units that are shared between all nodes in a chip
4) Per-blade - units that are shared between all chips on a blade
5) Per-rack - units that are shared between all blades in a rack

Addressing option 1)

Reuse the cpu argument: cpu would be interpreted differently if an uncore unit
is specified (via the perf_event_attr struct's config field).

For the hypothetical system described above, we'd want to have an address that
contains enough address bits for each of the above. For example:

bits field
------ -----
3..0 PMU number 0-15 /* specifies which of several identical PMUs being
addressed */
7..4 core id 0-15
8..8 node id 0-1
11..9 chip id 0-7
16..12 blade id 0-31
23..17 rack id 0-128

These fields would be exposed via /usr/include/linux/perf_events_uncore_addr.h
(for example). How you actually assign these numbers to actual hardware is,
again, system-design dependent, and may be influenced by the use of a
hypervisor, or other software which allocates resources available to the system
dynamically.

How does the user discover the mapping between the hardware made available to
the system and the addresses shown above? Again, this is system-dependent, and
probably outside the scope of this proposal. In other words, I don't know how
to do this in a general way, though I could probably put something together for
a particular system.

Addressing Option 2)

Have the kernel create nodes for each uncore PMU in /sys/devices/system or other
pseudo file system, such as the existing /proc/device-tree on Power systems.
/sys/devices/system or /proc/device-tree could be explored by the
user tool, and the user could then specify the path of the requested PMU via a
string which the kernel could interpret. To be overly simplistic, something
like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we settled on a
common tree root to use, we could specify only the relative path name,
"blade4/cpu0/vectorcopro1".

One way to provide this extra "PMU path" argument to the sys_perf_event_open()
would be to add a bit to the flags argument says we're adding a PMU path string
onto the end of the argument list.

This path-string-based addressing option seems to more flexible in the long run,
and does not have as serious of an issue in mapping PMUs to user space; the
kernel essentially exposes to user space all of the available PMUs for the
current partition. This might create more work for the kernel side, but should
make the system more transparent for user-space tools. Another system- or at
least arch-dependent tool would have to be written for user space to help users
navigate the device tree to find the PMU they want to use. I don't think it
would make sense to build that capability into perf, because the software would
be arch- or system-dependent.

It could be argued that we should use a common user space tree to represent PMUs
for all architectures and systems, so that the arch-independent perf code would
be able to display available uncore PMUs. That may be a goal that's very hard
to achieve because of the wide variation in architectures. Any thoughts on that?

----
6. Event rotation issues with uncore PMUs
----

Currently, the perf_events code rotates the set of events assigned to a CPU or
task on every system tick, so that event scheduling collisions on a PMU are
mitigated. This turns out to cause problems for uncore units for two reasons -
inefficiency and CPU load.

a) Rotation of a set of events across more than one PMU causes inefficient rotation.

Consider the following event list; the letter designates the PMU and the number
is the event number on that PMU.
A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4 C5

after one rotation, you can see that the event list will be:

C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3 C4

and then

C4 C5 A1 A2 A3 B1 B2 B3 B4 C1 C2 C3

Notice how the relative positions for the A and B PMU events haven't changed
even after two (or even five) rotations, so they will schedule the events in the
same order for some time. This will skew the multiplexing so that some events
will be scheduled much less often than they should or could be.

What we'd like to have happen is that events for each PMU be rotated in their
own lists. For example, before rotation:

A1 A2 A3
B1 B2 B3 B4
C1 C2 C3 C4 C5

After rotation:

A3 A1 A2
B2 B3 B4 B1
C2 C3 C4 C5 C1

We've got some ideas about how to make this happen, using either separate lists,
or placing them on separate CPUs.

b) Access to some PMU uncore units may be quite slow due to the interconnect
that is used. This can place a burden on the CPU if it is done every system tick.

This can be addressed by keeping a counter, on a per-PMU context basis that
reduces the rate of event rotations. Setting the rotation period to three, for
example, would cause event rotations in that context to happen on every third
tick, instead of every tick. We think that the kernel could measure the amount
of time it is taking to do a rotate, and then dynamically decrease the rotation
rate if it's taking too long; "rotation rate throttling" in other words.

----
7. Other issues?
----

This section left blank for now.

----
8. Feedback?
----

I'd appreciate any feedback you might have on this topic. You can contact me
directly at the email address below, or better yet, reply to LKML.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]


2010-01-20 00:44:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote:
> One subject that hasn't been addressed since the introduction of
> perf_events in the Linux kernel is that of support for "uncore" or "nest"
> unit events. Uncore is the term used by the Intel engineers for their
> off-core units but are still on the same die as the cores, and "nest" means
> exactly the same thing for IBM Power processor engineers. I will use the
> term uncore for brevity and because it's in common parlance, but the issues
> and design possibilities below are relevant to both. I will also broaden
> the term by stating that uncore will also refer to PMUs that are completely
> off of the processor chip altogether.

Yes, e.g. chipsets commonly have their own PMUs too.

> The main difference is that uncore events are mostly likely not going to be
> tied to a particular Linux task, or even a CPU context. Uncore units are
> resources that are in some sense system-wide, though, they may not really
> be accessible system-wide in some architectures. In the case of
> accelerators and I/O devices, it's likely they will run asynchronously from
> the cores, and thus keeping track of events on a per-task basis doesn't
> make a lot of sense. The other existing mode in perf_events is a per-CPU
> context, and it turns out that this mode does match up with uncore units
> well, though the choice of which CPU to use to manage that uncore unit is
> going to need to be arch-dependent and may involve other issues as well,
> such as minimizing access latency between the uncore unit and the CPU which
> is managing it.

What the user needs to know is which CPUs are affected by that uncore
event. For example the integrated memory controller counters that count local
accesses should be somehow associated with the local CPUs.

> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the
> perf_event_attr struct using the existing PERF_TYPE_RAW encoding. 64 bits
> are available in the config field, and that may be sufficient to support
> events on most systems. However, due to the proliferation and added
> complexity of PMUs we envision, we might want to add another 64-bit config
> (perhaps call it config_extra or config2) field to encode any extra
> attributes that might be needed. The exact encoding used, just as for the
> current encoding for core events, will be on a per-arch and possibly
> per-system basis.

I don't think a raw hex number will scale anywhere. You'll need a human
readable event list / sub event masks with help texts.

Often uncore events have specific restrictions, and that needs
to be enforced somewhere too.

Doing that all in a clean way that is also usable
by programs likely needs a lot more thinking.


> bits field
> ------ -----
> 3..0 PMU number 0-15 /* specifies which of several identical PMUs being
> addressed */
> 7..4 core id 0-15
> 8..8 node id 0-1
> 11..9 chip id 0-7
> 16..12 blade id 0-31
> 23..17 rack id 0-128

Such a compressed addressing scheme doesn't seem very future proof.
e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that
was recently announced)


> probably put something together for a particular system.
>
> Addressing Option 2)
>
> Have the kernel create nodes for each uncore PMU in /sys/devices/system or
> other pseudo file system, such as the existing /proc/device-tree on Power
> systems. /sys/devices/system or /proc/device-tree could be explored by the
> user tool, and the user could then specify the path of the requested PMU
> via a string which the kernel could interpret. To be overly simplistic,
> something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we
> settled on a common tree root to use, we could specify only the relative
> path name, "blade4/cpu0/vectorcopro1".

That's a more workable scheme, but you still need to find a clean
way to describe topology (see above). The existing examples in sysfs
are unfortuately all clumpsy imho.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-20 01:49:45

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 1/19/2010 4:44 PM, Andi Kleen wrote:
> On Tue, Jan 19, 2010 at 11:41:01AM -0800, Corey Ashford wrote:
>> 4. How do you encode uncore events?
>> ----
>> Uncore events will need to be encoded in the config field of the
>> perf_event_attr struct using the existing PERF_TYPE_RAW encoding. 64 bits
>> are available in the config field, and that may be sufficient to support
>> events on most systems. However, due to the proliferation and added
>> complexity of PMUs we envision, we might want to add another 64-bit config
>> (perhaps call it config_extra or config2) field to encode any extra
>> attributes that might be needed. The exact encoding used, just as for the
>> current encoding for core events, will be on a per-arch and possibly
>> per-system basis.
>
> I don't think a raw hex number will scale anywhere. You'll need a human
> readable event list / sub event masks with help texts.
>
> Often uncore events have specific restrictions, and that needs
> to be enforced somewhere too.
>
> Doing that all in a clean way that is also usable
> by programs likely needs a lot more thinking.

I left out one critical detail here: I had in mind that we'd be using a library
like libpfm for handling the issue of event names + attributes to raw code
translation. In fact, we are using libpfm today for this purpose in the
PAPI/perf_events substrate implementation.

>
>
>> bits field
>> ------ -----
>> 3..0 PMU number 0-15 /* specifies which of several identical PMUs being
>> addressed */
>> 7..4 core id 0-15
>> 8..8 node id 0-1
>> 11..9 chip id 0-7
>> 16..12 blade id 0-31
>> 23..17 rack id 0-128
>
> Such a compressed addressing scheme doesn't seem very future proof.
> e.g. core 4 bits for the core is already obsolete (see the "80 core chip" that
> was recently announced)

Agreed. If the designer is very generous with the size of each field, it could
hold up for quite awhile, but still there's a problem with relating these
addresses to actual hardware.

>
>
>> probably put something together for a particular system.
>>
>> Addressing Option 2)
>>
>> Have the kernel create nodes for each uncore PMU in /sys/devices/system or
>> other pseudo file system, such as the existing /proc/device-tree on Power
>> systems. /sys/devices/system or /proc/device-tree could be explored by the
>> user tool, and the user could then specify the path of the requested PMU
>> via a string which the kernel could interpret. To be overly simplistic,
>> something like "/sys/devices/system/pmus/blade4/cpu0/vectorcopro1". If we
>> settled on a common tree root to use, we could specify only the relative
>> path name, "blade4/cpu0/vectorcopro1".
>
> That's a more workable scheme, but you still need to find a clean
> way to describe topology (see above). The existing examples in sysfs
> are unfortuately all clumpsy imho.
>

Yes, I agree. Also it's easy to construct a system design that doesn't have a
hierarchical topology. A simple example would be a cluster of 32 nodes, each of
which is connected to its 31 neighbors. Perhaps for the purposes of just
enumerating PMUs, a tree might be sufficient, but it's not clear to me that it
is mathematically sufficient for all topologies, not to mention if it's
intuitive enough to use. For example, highly-interconnected components might
require that PMU leaf nodes be duplicated in multiple branches, i.e. PMU paths
might not be unique in some topologies.

I'm certainly open to better alternatives!


Thanks for your thoughts,

- Corey

2010-01-20 09:36:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

> Yes, I agree. Also it's easy to construct a system design that doesn't
> have a hierarchical topology. A simple example would be a cluster of 32
> nodes, each of which is connected to its 31 neighbors. Perhaps for the

I doubt it's needed or useful to describe all details of an interconnect.

If detailed distance information is needed a simple table like
the SLIT table exported by ACPI would seem easier to handle.

But at least some degree of locality (e.g. "local memory controller")
would make sense.


> purposes of just enumerating PMUs, a tree might be sufficient, but it's not
> clear to me that it is mathematically sufficient for all topologies, not to
> mention if it's intuitive enough to use. For example,
> highly-interconnected components might require that PMU leaf nodes be
> duplicated in multiple branches, i.e. PMU paths might not be unique in some
> topologies.

We already have cyclical graphs in sysfs using symlinks. I'm not
sure they are all that easy to parse/handle, but at least they
can be described.

-Andi

--
[email protected] -- Speaking for myself only.

2010-01-20 13:34:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Tue, 2010-01-19 at 11:41 -0800, Corey Ashford wrote:

> ----
> 3. Why does a CPU need to be assigned to manage a particular uncore unit's events?
> ----
>
> * The control registers of the uncore unit's PMU need to be read and written,
> and that may be possible only from a subset of processors in the system.
> * A processor is needed to rotate the event list on the uncore unit on every
> tick for the purposes of event scheduling.
> * Because of access latency issues, we may want the CPU to be close in locality
> to the PMU.
>
> It seems like a good idea to let the kernel decide which CPU to use to monitor a
> particular uncore event, based on the location of the uncore unit, and possibly
> current system load balance. The user will not want to have to figure out this
> detailed information.

Well, to some extend the user will have to participate. For example
which uncore pmu will be selected depends on the cpu you're attaching
the event to according to the cpu to node map.

Furthermore the intel uncore thing has curious interrupt routing
capabilities which could be tied into this mapping.

> ----
> 4. How do you encode uncore events?
> ----
> Uncore events will need to be encoded in the config field of the perf_event_attr
> struct using the existing PERF_TYPE_RAW encoding. 64 bits are available in the
> config field, and that may be sufficient to support events on most systems.
> However, due to the proliferation and added complexity of PMUs we envision, we
> might want to add another 64-bit config (perhaps call it config_extra or
> config2) field to encode any extra attributes that might be needed. The exact
> encoding used, just as for the current encoding for core events, will be on a
> per-arch and possibly per-system basis.

Lets cross that bridge when we get there.

> ----
> 5. How do you address a particular uncore PMU?
> ----
>
> This one is going to be very system- and arch-dependent, but it seems fairly
> clear that we need some sort of addressing scheme that can be
> system/arch-defined by the kernel.
>
> From a hierarchical perspective, here's an example of possible uncore PMU
> locations in a large system:
>
> 1) Per-core - units that are shared between all hardware threads in a core
> 2) Per-node - units that are shared between all cores in a node
> 3) Per-chip - units that are shared between all nodes in a chip
> 4) Per-blade - units that are shared between all chips on a blade
> 5) Per-rack - units that are shared between all blades in a rack

So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?

> ----
> 6. Event rotation issues with uncore PMUs
> ----
>
> Currently, the perf_events code rotates the set of events assigned to a CPU or
> task on every system tick, so that event scheduling collisions on a PMU are
> mitigated. This turns out to cause problems for uncore units for two reasons -
> inefficiency and CPU load.

Well, if you give these things a cpumask and put them all onto the
context of first cpu of that mask things seem to collect nicely.

> b) Access to some PMU uncore units may be quite slow due to the interconnect
> that is used. This can place a burden on the CPU if it is done every system tick.
>
> This can be addressed by keeping a counter, on a per-PMU context basis that
> reduces the rate of event rotations. Setting the rotation period to three, for
> example, would cause event rotations in that context to happen on every third
> tick, instead of every tick. We think that the kernel could measure the amount
> of time it is taking to do a rotate, and then dynamically decrease the rotation
> rate if it's taking too long; "rotation rate throttling" in other words.

The better solution is to generalize the whole rr on tick scheme (which
has already been discussed).

2010-01-20 19:29:26

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units



On 1/20/2010 1:35 AM, Andi Kleen wrote:
>> Yes, I agree. Also it's easy to construct a system design that doesn't
>> have a hierarchical topology. A simple example would be a cluster of 32
>> nodes, each of which is connected to its 31 neighbors. Perhaps for the
>
> I doubt it's needed or useful to describe all details of an interconnect.
>

At this point, I don't see a need for that level of detail either.

> If detailed distance information is needed a simple table like
> the SLIT table exported by ACPI would seem easier to handle.
>

Thanks for the pointer. I didn't know about the ACPI SLIT and SRAT tables until
your post. Having had a quick look at them, I don't think they'd be that
helpful to us, at least at this point.

> But at least some degree of locality (e.g. "local memory controller")
> would make sense.

I think locality could be determined by looking at the device tree. For
example, a memory controller for a particular processor chip would be a
subdirectory of that chip.

>
>> purposes of just enumerating PMUs, a tree might be sufficient, but it's not
>> clear to me that it is mathematically sufficient for all topologies, not to
>> mention if it's intuitive enough to use. For example,
>> highly-interconnected components might require that PMU leaf nodes be
>> duplicated in multiple branches, i.e. PMU paths might not be unique in some
>> topologies.
>
> We already have cyclical graphs in sysfs using symlinks. I'm not
> sure they are all that easy to parse/handle, but at least they
> can be described.

Good point.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2010-01-20 21:33:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:

> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?

OK, so I read most of the intel uncore stuff, and it seems to suggest
you need a regular pmu event to receive uncore events (chained setup),
this seems rather retarded since it wastes a perfectly good pmu event
and makes configuring all this more intricate...

A well, nothing to be done about that I guess..


2010-01-20 23:23:50

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units



On 1/20/2010 1:33 PM, Peter Zijlstra wrote:
> On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:
>
>> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?
>
> OK, so I read most of the intel uncore stuff, and it seems to suggest
> you need a regular pmu event to receive uncore events (chained setup),
> this seems rather retarded since it wastes a perfectly good pmu event
> and makes configuring all this more intricate...
>
> A well, nothing to be done about that I guess..

Yes, we have a similar situation where in addition to events that are counted on
core PMU counters, we also have counters that are off-core; in some cases the
counters are in off-core units which take their actual events from other
off-core units, in addition to their own events. So you can see that this can
be almost arbitrarily complex.

As for the PERF_TYPE_(CORE,NODE,SOCKET) idea, that could still work, even
though, for example, a socket event may be counted on a core PMU. Using more
encodings for the type field, as you've suggested, would allow us to reuse the
64-bit config space multiple times. Were you thinking that with the type field
we'd still re-use the "cpu" argument for the actual pmu address within the
PERF_TYPE_* space? If so, that's an interesting idea, but I think it still
leaves open the problem of how to actually relate those address to the real
hardware, especially in the case of using a hypervisor which has provided you a
small subset of the physical hardware in the system.

I really think we need some sort of data structure which is passed from the
kernel to user space to represent the topology of the system, and give useful
information to be able to identify each PMU node. Whether this is done with a
sysfs-style tree, a table in a file, XML, etc... it doesn't really matter much,
but it needs to be something that can be parsed relatively easily and *contains
just enough information* for the user to be able to correctly choose PMUs, and
for the kernel to be able to relate that back to actual PMU hardware.

In our case, we are looking at /proc/device-tree, and it actually does appear to
contain enough information for us. However, since /proc/device-tree is not
available anywhere but Power arch (/proc/device-tree originates from a data
structure passed into the OS from the Open Firmware) we'd like to have a more
general approach that can be used on x86 and other arches.

- Corey

2010-01-21 07:21:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units


* Corey Ashford <[email protected]> wrote:

> I really think we need some sort of data structure which is passed from the
> kernel to user space to represent the topology of the system, and give
> useful information to be able to identify each PMU node. Whether this is
> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> really matter much, but it needs to be something that can be parsed
> relatively easily and *contains just enough information* for the user to be
> able to correctly choose PMUs, and for the kernel to be able to relate that
> back to actual PMU hardware.

The right way would be to extend the current event description under
/debug/tracing/events with hardware descriptors and (maybe) to formalise this
into a separate /proc/events/ or into a separate filesystem.

The advantage of this is that in the grand scheme of things we _really_ dont
want to limit performance events to 'hardware' hierarchies, or to
devices/sysfs, some existing /proc scheme, or any other arbitrary (and
fundamentally limiting) object enumeration.

We want a unified, logical enumeration of all events and objects that we care
about from a performance monitoring and analysis point of view, shaped for the
purpose of and parsed by perf user-space. And since the current event
descriptors are already rather rich as they enumerate all sorts of things:

- tracepoints
- hw-breakpoints
- dynamic probes

etc., and are well used by tooling we should expand those with real hardware
structure.

Thanks,

Ingo

2010-01-21 08:36:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-01-20 at 15:23 -0800, Corey Ashford wrote:
> If so, that's an interesting idea, but I think it still
> leaves open the problem of how to actually relate those address to the real
> hardware, especially in the case of using a hypervisor which has provided you a
> small subset of the physical hardware in the system.

Well, I'm tempted to say that's a problem for the virt guys :-)

One way one could solve that is by having the topology information
include the virt<->phys map, so that you can find the physical node from
the virtual cpu number.

Going to be interesting though, but then, virt seems to be about
creating problems where there were none before.

2010-01-21 08:47:46

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, Jan 20, 2010 at 10:33 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-01-20 at 14:34 +0100, Peter Zijlstra wrote:
>
>> So how about PERF_TYPE_{CORE,NODE,SOCKET} like things?
>
> OK, so I read most of the intel uncore stuff, and it seems to suggest
> you need a regular pmu event to receive uncore events (chained setup),
> this seems rather retarded since it wastes a perfectly good pmu event
> and makes configuring all this more intricate...
>
I don't think that is correct. You can be using the uncore PMU on Nehalem
without any core PMU event. The only thing to realize is that uncore PMU
shares the same interrupt vector as core PMU. You need to configure which
core the uncore is going to interrupt on. This is done via a bitmask, so you
can interrupt more than one core at a time. Several strategies are possible.


> A well, nothing to be done about that I guess..
>
>
>
>

2010-01-21 08:59:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
> I don't think that is correct. You can be using the uncore PMU on Nehalem
> without any core PMU event. The only thing to realize is that uncore PMU
> shares the same interrupt vector as core PMU. You need to configure which
> core the uncore is going to interrupt on. This is done via a bitmask, so you
> can interrupt more than one core at a time. Several strategies are possible.

Ah, sharing the IRQ line is no problem. But from reading I got the
impression you need to configure an Offcore counter. See 30.6.2.1:

• EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor
core n is programmed to receive an interrupt signal from any interrupt enabled
uncore counter. PMI delivery due to an uncore counter overflow is enabled by
setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1.

Which seems to indicate a link with the off-core response thing.

However I would be very glad to be wrong :-)

2010-01-21 09:16:43

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

The offcore_response register is a CORE PMU register.
I know the name is confusing.

OFFCORE_RESPONSE_0 is not a counter but rather an
extension of the filtering capabilities of regular counters.
To use offcore_response_0, you will need to program
a generic counter (1 config, 1 data) + offcore_response_0.
This is a very useful feature to understand memory traffic.

There is an associated difficulty though. The
offcore_response_0 MSR is shared between HT threads.
Thus the kernel needs to arbitrate. That should be taken
care of by the event scheduling code which I am working
on.

In the context of perf_event, you need to find some encoding
of the event to pass as RAW. The raw event code + umask
is 0x01B7. Then, you need to encode the value for the
offcore_response_0 MSR, which is 16 bits on NHM/WSM.
You could either stash this into the config field or use an extended
field. The kernel would detect 0x01B7 and use the value in the
extra config field.

As described in vol3b, Intel Westmere adds a second
offcore_response counter. It behaves the same way with
the same restrictions.

The debugctl bit controls uncore activation not offcore.

On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
>> I don't think that is correct. You can be using the uncore PMU on Nehalem
>> without any core PMU event. The only thing to realize is that uncore PMU
>> shares the same interrupt vector as core PMU. You need to configure which
>> core the uncore is going to interrupt on. This is done via a bitmask, so you
>> can interrupt more than one core at a time. Several strategies are possible.
>
> Ah, sharing the IRQ line is no problem. But from reading I got the
> impression you need to configure an Offcore counter. See 30.6.2.1:
>
> • EN_PMI_COREn (bit n, n = 0, 3 if four cores are present): When set, processor
> core n is programmed to receive an interrupt signal from any interrupt enabled
> uncore counter. PMI delivery due to an uncore counter overflow is enabled by
> setting IA32_DEBUG_CTL.Offcore_PMI_EN to 1.
>
> Which seems to indicate a link with the off-core response thing.
>
> However I would be very glad to be wrong :-)
>
>

2010-01-21 09:43:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Thu, Jan 21, 2010 at 9:59 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-01-21 at 09:47 +0100, stephane eranian wrote:
>> I don't think that is correct. You can be using the uncore PMU on Nehalem
>> without any core PMU event. The only thing to realize is that uncore PMU
>> shares the same interrupt vector as core PMU. You need to configure which
>> core the uncore is going to interrupt on. This is done via a bitmask, so you
>> can interrupt more than one core at a time. Several strategies are possible.
>
> Ah, sharing the IRQ line is no problem. But from reading I got the

Given the PMU sharing model of perf_events, it seems you may have
multiple consumers of uncore PMU at the same time. That means you
will need to direct the interrupt onto all the CPU for which you currently
have a user. You may have multiple users per CPU, thus you need some
reference count to track all of that. The alternative is to systematically
broadcast the uncore PMU interrupt. Each core then checks whether or
not it has uncore users.

Note that all of this is independent of the type of event, i.e., per-thread
or system-wide.

2010-01-21 19:13:43

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units



On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>
> * Corey Ashford<[email protected]> wrote:
>
>> I really think we need some sort of data structure which is passed from the
>> kernel to user space to represent the topology of the system, and give
>> useful information to be able to identify each PMU node. Whether this is
>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>> really matter much, but it needs to be something that can be parsed
>> relatively easily and *contains just enough information* for the user to be
>> able to correctly choose PMUs, and for the kernel to be able to relate that
>> back to actual PMU hardware.
>
> The right way would be to extend the current event description under
> /debug/tracing/events with hardware descriptors and (maybe) to formalise this
> into a separate /proc/events/ or into a separate filesystem.
>
> The advantage of this is that in the grand scheme of things we _really_ dont
> want to limit performance events to 'hardware' hierarchies, or to
> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> fundamentally limiting) object enumeration.
>
> We want a unified, logical enumeration of all events and objects that we care
> about from a performance monitoring and analysis point of view, shaped for the
> purpose of and parsed by perf user-space. And since the current event
> descriptors are already rather rich as they enumerate all sorts of things:
>
> - tracepoints
> - hw-breakpoints
> - dynamic probes
>
> etc., and are well used by tooling we should expand those with real hardware
> structure.

This is an intriguing idea; I like the idea of generalizing all of this info
into one structure.

So you think that this structure should contain event info as well? If these
structures are created by the kernel, I think that would necessitate placing
large event tables into the kernel, which is something I think we'd prefer to
avoid because of the amount of memory it would take. Keep in mind that we need
not only event names, but event descriptions, encodings, attributes (e.g. unit
masks), attribute descriptions, etc. I suppose the kernel could read a file
from the file system, and then add this info to the tree, but that just seems
bad. Are there existing places in the kernel where it reads a user space file
to create a user space pseudo filesystem?

I think keeping event naming in user space, and PMU naming in kernel space might
be a better idea: the kernel exposes the available PMUs to user space via some
structure, and a user space library tries to recognize the exposed PMUs and
provide event lists and other needed info. The perf tool would use this library
to be able to list available events to users.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2010-01-21 19:29:14

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 1/21/2010 11:13 AM, Corey Ashford wrote:
>
>
> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>
>> * Corey Ashford<[email protected]> wrote:
>>
>>> I really think we need some sort of data structure which is passed
>>> from the
>>> kernel to user space to represent the topology of the system, and give
>>> useful information to be able to identify each PMU node. Whether this is
>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>> really matter much, but it needs to be something that can be parsed
>>> relatively easily and *contains just enough information* for the user
>>> to be
>>> able to correctly choose PMUs, and for the kernel to be able to
>>> relate that
>>> back to actual PMU hardware.
>>
>> The right way would be to extend the current event description under
>> /debug/tracing/events with hardware descriptors and (maybe) to
>> formalise this
>> into a separate /proc/events/ or into a separate filesystem.
>>
>> The advantage of this is that in the grand scheme of things we
>> _really_ dont
>> want to limit performance events to 'hardware' hierarchies, or to
>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>> fundamentally limiting) object enumeration.
>>
>> We want a unified, logical enumeration of all events and objects that
>> we care
>> about from a performance monitoring and analysis point of view, shaped
>> for the
>> purpose of and parsed by perf user-space. And since the current event
>> descriptors are already rather rich as they enumerate all sorts of
>> things:
>>
>> - tracepoints
>> - hw-breakpoints
>> - dynamic probes
>>
>> etc., and are well used by tooling we should expand those with real
>> hardware
>> structure.
>
> This is an intriguing idea; I like the idea of generalizing all of this
> info into one structure.
>
> So you think that this structure should contain event info as well? If
> these structures are created by the kernel, I think that would
> necessitate placing large event tables into the kernel, which is
> something I think we'd prefer to avoid because of the amount of memory
> it would take. Keep in mind that we need not only event names, but event
> descriptions, encodings, attributes (e.g. unit masks), attribute
> descriptions, etc. I suppose the kernel could read a file from the file
> system, and then add this info to the tree, but that just seems bad. Are
> there existing places in the kernel where it reads a user space file to
> create a user space pseudo filesystem?
>
> I think keeping event naming in user space, and PMU naming in kernel
> space might be a better idea: the kernel exposes the available PMUs to
> user space via some structure, and a user space library tries to
> recognize the exposed PMUs and provide event lists and other needed
> info. The perf tool would use this library to be able to list available
> events to users.
>

Perhaps another way of handing this would be to have the kernel dynamically load
a specific "PMU kernel module" once it has detected that it has a particular PMU
in the hardware. The module would consist only of a data structure, and a
simple API to access the event data. This way, only only the PMUs that actually
exist in the hardware would need to be loaded into memory, and perhaps then only
temporarily (just long enough to create the pseudo fs nodes).

Still, though, since it's a pseudo fs, all of that event data would be taking up
kernel memory.

Another model, perhaps, would be to actually write this data out to a real file
system upon every boot up, so that it wouldn't need to be held in memory. That
seems rather ugly and time consuming, though.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2010-01-27 10:29:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units


* Corey Ashford <[email protected]> wrote:

> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >
> >
> >On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>
> >>* Corey Ashford<[email protected]> wrote:
> >>
> >>>I really think we need some sort of data structure which is passed
> >>>from the
> >>>kernel to user space to represent the topology of the system, and give
> >>>useful information to be able to identify each PMU node. Whether this is
> >>>done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>really matter much, but it needs to be something that can be parsed
> >>>relatively easily and *contains just enough information* for the user
> >>>to be
> >>>able to correctly choose PMUs, and for the kernel to be able to
> >>>relate that
> >>>back to actual PMU hardware.
> >>
> >>The right way would be to extend the current event description under
> >>/debug/tracing/events with hardware descriptors and (maybe) to
> >>formalise this
> >>into a separate /proc/events/ or into a separate filesystem.
> >>
> >>The advantage of this is that in the grand scheme of things we
> >>_really_ dont
> >>want to limit performance events to 'hardware' hierarchies, or to
> >>devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>fundamentally limiting) object enumeration.
> >>
> >>We want a unified, logical enumeration of all events and objects that
> >>we care
> >>about from a performance monitoring and analysis point of view, shaped
> >>for the
> >>purpose of and parsed by perf user-space. And since the current event
> >>descriptors are already rather rich as they enumerate all sorts of
> >>things:
> >>
> >>- tracepoints
> >>- hw-breakpoints
> >>- dynamic probes
> >>
> >>etc., and are well used by tooling we should expand those with real
> >>hardware
> >>structure.
> >
> >This is an intriguing idea; I like the idea of generalizing all of this
> >info into one structure.
> >
> >So you think that this structure should contain event info as well? If
> >these structures are created by the kernel, I think that would
> >necessitate placing large event tables into the kernel, which is
> >something I think we'd prefer to avoid because of the amount of memory
> >it would take. Keep in mind that we need not only event names, but event
> >descriptions, encodings, attributes (e.g. unit masks), attribute
> >descriptions, etc. I suppose the kernel could read a file from the file
> >system, and then add this info to the tree, but that just seems bad. Are
> >there existing places in the kernel where it reads a user space file to
> >create a user space pseudo filesystem?
> >
> >I think keeping event naming in user space, and PMU naming in kernel
> >space might be a better idea: the kernel exposes the available PMUs to
> >user space via some structure, and a user space library tries to
> >recognize the exposed PMUs and provide event lists and other needed
> >info. The perf tool would use this library to be able to list available
> >events to users.
> >
>
> Perhaps another way of handing this would be to have the kernel dynamically
> load a specific "PMU kernel module" once it has detected that it has a
> particular PMU in the hardware. The module would consist only of a data
> structure, and a simple API to access the event data. This way, only only
> the PMUs that actually exist in the hardware would need to be loaded into
> memory, and perhaps then only temporarily (just long enough to create the
> pseudo fs nodes).
>
> Still, though, since it's a pseudo fs, all of that event data would be
> taking up kernel memory.
>
> Another model, perhaps, would be to actually write this data out to a real
> file system upon every boot up, so that it wouldn't need to be held in
> memory. That seems rather ugly and time consuming, though.

I dont think memory consumption is a problem at all. The structure of the
monitored hardware/software state is information we _want_ the kernel to
provide, mainly because there's no unified repository for user-space to get
this info from.

If someone doesnt want it on some ultra-embedded box then sure a .config
switch can be provided to allow it to be turned off.

Ingo

2010-01-27 19:51:06

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 1/27/2010 2:28 AM, Ingo Molnar wrote:
>
> * Corey Ashford<[email protected]> wrote:
>
>> On 1/21/2010 11:13 AM, Corey Ashford wrote:
>>>
>>>
>>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>>>
>>>> * Corey Ashford<[email protected]> wrote:
>>>>
>>>>> I really think we need some sort of data structure which is passed
>>>> >from the
>>>>> kernel to user space to represent the topology of the system, and give
>>>>> useful information to be able to identify each PMU node. Whether this is
>>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>>>> really matter much, but it needs to be something that can be parsed
>>>>> relatively easily and *contains just enough information* for the user
>>>>> to be
>>>>> able to correctly choose PMUs, and for the kernel to be able to
>>>>> relate that
>>>>> back to actual PMU hardware.
>>>>
>>>> The right way would be to extend the current event description under
>>>> /debug/tracing/events with hardware descriptors and (maybe) to
>>>> formalise this
>>>> into a separate /proc/events/ or into a separate filesystem.
>>>>
>>>> The advantage of this is that in the grand scheme of things we
>>>> _really_ dont
>>>> want to limit performance events to 'hardware' hierarchies, or to
>>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>>>> fundamentally limiting) object enumeration.
>>>>
>>>> We want a unified, logical enumeration of all events and objects that
>>>> we care
>>>> about from a performance monitoring and analysis point of view, shaped
>>>> for the
>>>> purpose of and parsed by perf user-space. And since the current event
>>>> descriptors are already rather rich as they enumerate all sorts of
>>>> things:
>>>>
>>>> - tracepoints
>>>> - hw-breakpoints
>>>> - dynamic probes
>>>>
>>>> etc., and are well used by tooling we should expand those with real
>>>> hardware
>>>> structure.
>>>
>>> This is an intriguing idea; I like the idea of generalizing all of this
>>> info into one structure.
>>>
>>> So you think that this structure should contain event info as well? If
>>> these structures are created by the kernel, I think that would
>>> necessitate placing large event tables into the kernel, which is
>>> something I think we'd prefer to avoid because of the amount of memory
>>> it would take. Keep in mind that we need not only event names, but event
>>> descriptions, encodings, attributes (e.g. unit masks), attribute
>>> descriptions, etc. I suppose the kernel could read a file from the file
>>> system, and then add this info to the tree, but that just seems bad. Are
>>> there existing places in the kernel where it reads a user space file to
>>> create a user space pseudo filesystem?
>>>
>>> I think keeping event naming in user space, and PMU naming in kernel
>>> space might be a better idea: the kernel exposes the available PMUs to
>>> user space via some structure, and a user space library tries to
>>> recognize the exposed PMUs and provide event lists and other needed
>>> info. The perf tool would use this library to be able to list available
>>> events to users.
>>>
>>
>> Perhaps another way of handing this would be to have the kernel dynamically
>> load a specific "PMU kernel module" once it has detected that it has a
>> particular PMU in the hardware. The module would consist only of a data
>> structure, and a simple API to access the event data. This way, only only
>> the PMUs that actually exist in the hardware would need to be loaded into
>> memory, and perhaps then only temporarily (just long enough to create the
>> pseudo fs nodes).
>>
>> Still, though, since it's a pseudo fs, all of that event data would be
>> taking up kernel memory.
>>
>> Another model, perhaps, would be to actually write this data out to a real
>> file system upon every boot up, so that it wouldn't need to be held in
>> memory. That seems rather ugly and time consuming, though.
>
> I dont think memory consumption is a problem at all. The structure of the
> monitored hardware/software state is information we _want_ the kernel to
> provide, mainly because there's no unified repository for user-space to get
> this info from.
>
> If someone doesnt want it on some ultra-embedded box then sure a .config
> switch can be provided to allow it to be turned off.
>
> Ingo

Ok, just so that we quantify things a bit, let's say I have 20 different types
of PMUs totalling 2000 different events, each of which has a name and text
description, averaging 300 characters. Along with that, there's let's say 4
64-bit words of metadata per event describing encoding, which attributes apply
to the event, and any other needed info. I don't know how much memory each
pseudo fs node takes up. Let me guess and say 128 bytes for each event node
(the amount taken for the PMU nodes would be negligible compared with the event
nodes).

So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.

Let's assume that the correct event module can be loaded dynamically, so that we
don't need to have all of the possible event sets for a particular arch kernel
build.

Any opinions on whether allocating this amount of kernel memory would be
acceptable? It seems like a lot of kernel memory to me, but I come from an
embedded systems background. Granted, most systems are going to use a fraction
of that amount of memory (<100KB) due to having far fewer PMUs and therefore
fewer distinct event types.

There's at least one more dimension to this. Let's say I have 16 uncore PMUs
all of the same type, each of which has, for example 8 events. As a very crude
pseudo fs, let's say we have a structure like this:


/sys/devices/pmus/
uncore_pmu0/
event0/ (path name to here is the name of the pmu and event)
description (file)
applicable_attributes (file)
event1/
description
applicable_attributes
event2/
...
event7/
...
uncore_pmu1/
event0/
description
applicable_attributes
...
...
uncore_pmu15/
...

Now, you can see that there's a lot of replication here, because the event
descriptions and attributes are the same for each uncore pmu. We can use
symlinks to link them to the same descriptions and attribute data, but these
symlinks take up space too, which can add up if there are a lot of identical
PMUs. So for complex and large systems, we might consume several meg of memory
for the pseudo fs.

Note that I'm taking some liberty with the applicable_attributes file. I know
some attribute info has to be in there, but I don't have any sort of concrete
idea as to how to encode it at this point. The point of this email is to get an
idea as to how much memory the pseudo fs would consume.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2010-01-28 10:58:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
> >
> > * Corey Ashford<[email protected]> wrote:
> >
> >> On 1/21/2010 11:13 AM, Corey Ashford wrote:
> >>>
> >>>
> >>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
> >>>>
> >>>> * Corey Ashford<[email protected]> wrote:
> >>>>
> >>>>> I really think we need some sort of data structure which is passed
> >>>> >from the
> >>>>> kernel to user space to represent the topology of the system, and give
> >>>>> useful information to be able to identify each PMU node. Whether this is
> >>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
> >>>>> really matter much, but it needs to be something that can be parsed
> >>>>> relatively easily and *contains just enough information* for the user
> >>>>> to be
> >>>>> able to correctly choose PMUs, and for the kernel to be able to
> >>>>> relate that
> >>>>> back to actual PMU hardware.
> >>>>
> >>>> The right way would be to extend the current event description under
> >>>> /debug/tracing/events with hardware descriptors and (maybe) to
> >>>> formalise this
> >>>> into a separate /proc/events/ or into a separate filesystem.
> >>>>
> >>>> The advantage of this is that in the grand scheme of things we
> >>>> _really_ dont
> >>>> want to limit performance events to 'hardware' hierarchies, or to
> >>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
> >>>> fundamentally limiting) object enumeration.
> >>>>
> >>>> We want a unified, logical enumeration of all events and objects that
> >>>> we care
> >>>> about from a performance monitoring and analysis point of view, shaped
> >>>> for the
> >>>> purpose of and parsed by perf user-space. And since the current event
> >>>> descriptors are already rather rich as they enumerate all sorts of
> >>>> things:
> >>>>
> >>>> - tracepoints
> >>>> - hw-breakpoints
> >>>> - dynamic probes
> >>>>
> >>>> etc., and are well used by tooling we should expand those with real
> >>>> hardware
> >>>> structure.
> >>>
> >>> This is an intriguing idea; I like the idea of generalizing all of this
> >>> info into one structure.
> >>>
> >>> So you think that this structure should contain event info as well? If
> >>> these structures are created by the kernel, I think that would
> >>> necessitate placing large event tables into the kernel, which is
> >>> something I think we'd prefer to avoid because of the amount of memory
> >>> it would take. Keep in mind that we need not only event names, but event
> >>> descriptions, encodings, attributes (e.g. unit masks), attribute
> >>> descriptions, etc. I suppose the kernel could read a file from the file
> >>> system, and then add this info to the tree, but that just seems bad. Are
> >>> there existing places in the kernel where it reads a user space file to
> >>> create a user space pseudo filesystem?
> >>>
> >>> I think keeping event naming in user space, and PMU naming in kernel
> >>> space might be a better idea: the kernel exposes the available PMUs to
> >>> user space via some structure, and a user space library tries to
> >>> recognize the exposed PMUs and provide event lists and other needed
> >>> info. The perf tool would use this library to be able to list available
> >>> events to users.
> >>>
> >>
> >> Perhaps another way of handing this would be to have the kernel dynamically
> >> load a specific "PMU kernel module" once it has detected that it has a
> >> particular PMU in the hardware. The module would consist only of a data
> >> structure, and a simple API to access the event data. This way, only only
> >> the PMUs that actually exist in the hardware would need to be loaded into
> >> memory, and perhaps then only temporarily (just long enough to create the
> >> pseudo fs nodes).
> >>
> >> Still, though, since it's a pseudo fs, all of that event data would be
> >> taking up kernel memory.
> >>
> >> Another model, perhaps, would be to actually write this data out to a real
> >> file system upon every boot up, so that it wouldn't need to be held in
> >> memory. That seems rather ugly and time consuming, though.
> >
> > I dont think memory consumption is a problem at all. The structure of the
> > monitored hardware/software state is information we _want_ the kernel to
> > provide, mainly because there's no unified repository for user-space to get
> > this info from.
> >
> > If someone doesnt want it on some ultra-embedded box then sure a .config
> > switch can be provided to allow it to be turned off.
> >
> > Ingo
>
> Ok, just so that we quantify things a bit, let's say I have 20 different types
> of PMUs totalling 2000 different events, each of which has a name and text
> description, averaging 300 characters. Along with that, there's let's say 4
> 64-bit words of metadata per event describing encoding, which attributes apply
> to the event, and any other needed info. I don't know how much memory each
> pseudo fs node takes up. Let me guess and say 128 bytes for each event node
> (the amount taken for the PMU nodes would be negligible compared with the event
> nodes).
>
> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
>
> Let's assume that the correct event module can be loaded dynamically, so that we
> don't need to have all of the possible event sets for a particular arch kernel
> build.
>
> Any opinions on whether allocating this amount of kernel memory would be
> acceptable? It seems like a lot of kernel memory to me, but I come from an
> embedded systems background. Granted, most systems are going to use a fraction
> of that amount of memory (<100KB) due to having far fewer PMUs and therefore
> fewer distinct event types.
>
> There's at least one more dimension to this. Let's say I have 16 uncore PMUs
> all of the same type, each of which has, for example 8 events. As a very crude
> pseudo fs, let's say we have a structure like this:
>
>
> /sys/devices/pmus/
> uncore_pmu0/
> event0/ (path name to here is the name of the pmu and event)
> description (file)
> applicable_attributes (file)
> event1/
> description
> applicable_attributes
> event2/
> ...
> event7/
> ...
> uncore_pmu1/
> event0/
> description
> applicable_attributes
> ...
> ...
> uncore_pmu15/
> ...

I really don't like this. The the cpu->uncore map is fixed by the
topology of the machine, which is already available in /sys some place.

Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
something like that. We can start with 2 generic events for that type,
local/remote memory accesses and take it from there.


2010-01-28 18:01:07

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 01/28/2010 02:57 AM, Peter Zijlstra wrote:
> On Wed, 2010-01-27 at 11:50 -0800, Corey Ashford wrote:
>> On 1/27/2010 2:28 AM, Ingo Molnar wrote:
>>>
>>> * Corey Ashford<[email protected]> wrote:
>>>
>>>> On 1/21/2010 11:13 AM, Corey Ashford wrote:
>>>>>
>>>>>
>>>>> On 1/20/2010 11:21 PM, Ingo Molnar wrote:
>>>>>>
>>>>>> * Corey Ashford<[email protected]> wrote:
>>>>>>
>>>>>>> I really think we need some sort of data structure which is passed
>>>>>> >from the
>>>>>>> kernel to user space to represent the topology of the system, and give
>>>>>>> useful information to be able to identify each PMU node. Whether this is
>>>>>>> done with a sysfs-style tree, a table in a file, XML, etc... it doesn't
>>>>>>> really matter much, but it needs to be something that can be parsed
>>>>>>> relatively easily and *contains just enough information* for the user
>>>>>>> to be
>>>>>>> able to correctly choose PMUs, and for the kernel to be able to
>>>>>>> relate that
>>>>>>> back to actual PMU hardware.
>>>>>>
>>>>>> The right way would be to extend the current event description under
>>>>>> /debug/tracing/events with hardware descriptors and (maybe) to
>>>>>> formalise this
>>>>>> into a separate /proc/events/ or into a separate filesystem.
>>>>>>
>>>>>> The advantage of this is that in the grand scheme of things we
>>>>>> _really_ dont
>>>>>> want to limit performance events to 'hardware' hierarchies, or to
>>>>>> devices/sysfs, some existing /proc scheme, or any other arbitrary (and
>>>>>> fundamentally limiting) object enumeration.
>>>>>>
>>>>>> We want a unified, logical enumeration of all events and objects that
>>>>>> we care
>>>>>> about from a performance monitoring and analysis point of view, shaped
>>>>>> for the
>>>>>> purpose of and parsed by perf user-space. And since the current event
>>>>>> descriptors are already rather rich as they enumerate all sorts of
>>>>>> things:
>>>>>>
>>>>>> - tracepoints
>>>>>> - hw-breakpoints
>>>>>> - dynamic probes
>>>>>>
>>>>>> etc., and are well used by tooling we should expand those with real
>>>>>> hardware
>>>>>> structure.
>>>>>
>>>>> This is an intriguing idea; I like the idea of generalizing all of this
>>>>> info into one structure.
>>>>>
>>>>> So you think that this structure should contain event info as well? If
>>>>> these structures are created by the kernel, I think that would
>>>>> necessitate placing large event tables into the kernel, which is
>>>>> something I think we'd prefer to avoid because of the amount of memory
>>>>> it would take. Keep in mind that we need not only event names, but event
>>>>> descriptions, encodings, attributes (e.g. unit masks), attribute
>>>>> descriptions, etc. I suppose the kernel could read a file from the file
>>>>> system, and then add this info to the tree, but that just seems bad. Are
>>>>> there existing places in the kernel where it reads a user space file to
>>>>> create a user space pseudo filesystem?
>>>>>
>>>>> I think keeping event naming in user space, and PMU naming in kernel
>>>>> space might be a better idea: the kernel exposes the available PMUs to
>>>>> user space via some structure, and a user space library tries to
>>>>> recognize the exposed PMUs and provide event lists and other needed
>>>>> info. The perf tool would use this library to be able to list available
>>>>> events to users.
>>>>>
>>>>
>>>> Perhaps another way of handing this would be to have the kernel dynamically
>>>> load a specific "PMU kernel module" once it has detected that it has a
>>>> particular PMU in the hardware. The module would consist only of a data
>>>> structure, and a simple API to access the event data. This way, only only
>>>> the PMUs that actually exist in the hardware would need to be loaded into
>>>> memory, and perhaps then only temporarily (just long enough to create the
>>>> pseudo fs nodes).
>>>>
>>>> Still, though, since it's a pseudo fs, all of that event data would be
>>>> taking up kernel memory.
>>>>
>>>> Another model, perhaps, would be to actually write this data out to a real
>>>> file system upon every boot up, so that it wouldn't need to be held in
>>>> memory. That seems rather ugly and time consuming, though.
>>>
>>> I dont think memory consumption is a problem at all. The structure of the
>>> monitored hardware/software state is information we _want_ the kernel to
>>> provide, mainly because there's no unified repository for user-space to get
>>> this info from.
>>>
>>> If someone doesnt want it on some ultra-embedded box then sure a .config
>>> switch can be provided to allow it to be turned off.
>>>
>>> Ingo
>>
>> Ok, just so that we quantify things a bit, let's say I have 20 different types
>> of PMUs totalling 2000 different events, each of which has a name and text
>> description, averaging 300 characters. Along with that, there's let's say 4
>> 64-bit words of metadata per event describing encoding, which attributes apply
>> to the event, and any other needed info. I don't know how much memory each
>> pseudo fs node takes up. Let me guess and say 128 bytes for each event node
>> (the amount taken for the PMU nodes would be negligible compared with the event
>> nodes).
>>
>> So thats 2000 * (300 + 32 + 128) bytes ~= 920KB of memory.
>>
>> Let's assume that the correct event module can be loaded dynamically, so that we
>> don't need to have all of the possible event sets for a particular arch kernel
>> build.
>>
>> Any opinions on whether allocating this amount of kernel memory would be
>> acceptable? It seems like a lot of kernel memory to me, but I come from an
>> embedded systems background. Granted, most systems are going to use a fraction
>> of that amount of memory (<100KB) due to having far fewer PMUs and therefore
>> fewer distinct event types.
>>
>> There's at least one more dimension to this. Let's say I have 16 uncore PMUs
>> all of the same type, each of which has, for example 8 events. As a very crude
>> pseudo fs, let's say we have a structure like this:
>>
>>
>> /sys/devices/pmus/
>> uncore_pmu0/
>> event0/ (path name to here is the name of the pmu and event)
>> description (file)
>> applicable_attributes (file)
>> event1/
>> description
>> applicable_attributes
>> event2/
>> ...
>> event7/
>> ...
>> uncore_pmu1/
>> event0/
>> description
>> applicable_attributes
>> ...
>> ...
>> uncore_pmu15/
>> ...
>
> I really don't like this. The the cpu->uncore map is fixed by the
> topology of the machine, which is already available in /sys some place.
>
> Lets simply use the cpu->node mapping and use PERF_TYPE_NODE{,_RAW} or
> something like that. We can start with 2 generic events for that type,
> local/remote memory accesses and take it from there.
>

I don't quite get what you're saying here. Perhaps you are thinking
that all uncore units are associated with a particular cpu node, or a
set of cpu nodes? And that there's only one uncore unit per cpu (or set
of cpus) that needs to be addressed, i.e. no ambiguity?

That is not going to be the case for all systems. We can have uncore
units that are associated with the entire system, for example PMUs in an
I/O device. And we can have multiple uncore units of a particular
type, for example multiple vector coprocessors, each with its own PMU,
and are associated with a single cpu or a set of cpus.

perf_events needs an addressing scheme that covers these cases.

- Corey

2010-01-28 19:07:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
>
> I don't quite get what you're saying here. Perhaps you are thinking
> that all uncore units are associated with a particular cpu node, or a
> set of cpu nodes? And that there's only one uncore unit per cpu (or set
> of cpus) that needs to be addressed, i.e. no ambiguity?

Well, I was initially thinking of the intel uncore thing which is memory
controller, so node, level.

But all system topology bound pmus can be done that way.

> That is not going to be the case for all systems. We can have uncore
> units that are associated with the entire system,

Right, but that's simple too.

> for example PMUs in an I/O device.

> And we can have multiple uncore units of a particular
> type, for example multiple vector coprocessors, each with its own PMU,
> and are associated with a single cpu or a set of cpus.
>
> perf_events needs an addressing scheme that covers these cases.

You could possible add a u64 pmu_id field to perf_event_attr and use
that together with things like:

PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
PERF_TYPE_SPU, attr.pmu_id = spu-id

But before we go there the perf core needs to be extended to deal with
multiple hardware pmus, something which isn't too hard but we need to be
careful not to bloat the normal code paths for these somewhat esoteric
use cases.


2010-01-28 19:44:43

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 1/28/2010 11:06 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
>>
>> I don't quite get what you're saying here. Perhaps you are thinking
>> that all uncore units are associated with a particular cpu node, or a
>> set of cpu nodes? And that there's only one uncore unit per cpu (or set
>> of cpus) that needs to be addressed, i.e. no ambiguity?
>
> Well, I was initially thinking of the intel uncore thing which is memory
> controller, so node, level.
>
> But all system topology bound pmus can be done that way.
>
>> That is not going to be the case for all systems. We can have uncore
>> units that are associated with the entire system,
>
> Right, but that's simple too.
>
>> for example PMUs in an I/O device.
>
>> And we can have multiple uncore units of a particular
>> type, for example multiple vector coprocessors, each with its own PMU,
>> and are associated with a single cpu or a set of cpus.
>>
>> perf_events needs an addressing scheme that covers these cases.
>
> You could possible add a u64 pmu_id field to perf_event_attr and use
> that together with things like:
>
> PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
> PERF_TYPE_SPU, attr.pmu_id = spu-id
>

Thank you for that clarification.

One of Ingo's comments was that he wants perf to be able to expose all of the
available PMUs via the perf tool. That perf should be able to parse some data
structure (somewhere) that would contain all of the info the user would need to
choose a particular PMU. Do you have some ideas about how that could be
accomplished using the above encoding scheme? I can see how it would be fairly
easy to come up with a PERF_TYPE_* encoding per-topology, and then interpret all
of those bits correctly within the kernel (which is saavy to that topology), but
I don't see how there would be a straight-forward way to expose that structure
to perf. How would perf know which of those encodings apply to the current
system, how many PMUs there are of each type, etc.

That's why I'm leaning toward a /sys/devices-style pseudo fs at the moment. If
there's a simpler, better way, I'm open to it.

> But before we go there the perf core needs to be extended to deal with
> multiple hardware pmus, something which isn't too hard but we need to be
> careful not to bloat the normal code paths for these somewhat esoteric
> use cases.

Is this something you are looking into?

- Corey

2010-01-28 22:08:25

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On 1/28/2010 11:06 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 10:00 -0800, Corey Ashford wrote:
>>
>> I don't quite get what you're saying here. Perhaps you are thinking
>> that all uncore units are associated with a particular cpu node, or a
>> set of cpu nodes? And that there's only one uncore unit per cpu (or set
>> of cpus) that needs to be addressed, i.e. no ambiguity?
>
> Well, I was initially thinking of the intel uncore thing which is memory
> controller, so node, level.
>
> But all system topology bound pmus can be done that way.
>
>> That is not going to be the case for all systems. We can have uncore
>> units that are associated with the entire system,
>
> Right, but that's simple too.
>
>> for example PMUs in an I/O device.
>
>> And we can have multiple uncore units of a particular
>> type, for example multiple vector coprocessors, each with its own PMU,
>> and are associated with a single cpu or a set of cpus.
>>
>> perf_events needs an addressing scheme that covers these cases.
>
> You could possible add a u64 pmu_id field to perf_event_attr and use
> that together with things like:
>
> PERF_TYPE_PCI, attr.pmu_id = domain:bus:device:function encoding
> PERF_TYPE_SPU, attr.pmu_id = spu-id
>

Thank you for the clarification.

One of Ingo's comments in this thread was that he wants perf to be able to
display to the user the available PMUs along with their respect events. That
perf would parse some machine-independent data structure (somewhere) to get this
info. This same info would provide the user a method of specifying which PMU
he wants to address. He'd also like all of the event info data to reside in the
same place. I hope I am paraphrasing him correctly.

I can see that with the scheme you have proposed above, it would be
straight-forward to encode PMU ids for a particular new PERF_TYPE_* system
topology, but I don't see a clear way of providng perf with enough information
to tell it which particular topology is being used, how many units of each PMU
type exist, and so on. Do you have any ideas as to how to accomplish this goal
with the method you are suggesting?

This is one of the reasons why I am leaning toward a /sys/devices-style data
structure; the kernel could easily build it based on the pmus that it discovers
(through whatever means), and the user can fairly easily choose a pmu from this
structure to open, and it's unambiguous to the kernel as to which pmu the user
really wants.

I am not convinced that this is the right place to put the event info for each PMU.

> But before we go there the perf core needs to be extended to deal with
> multiple hardware pmus, something which isn't too hard but we need to be
> careful not to bloat the normal code paths for these somewhat esoteric
> use cases.
>

Is this something you've looked into? If so, what sort of issues have you
discovered?

Thanks,

- Corey

2010-01-29 09:53:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote:

> This is one of the reasons why I am leaning toward a /sys/devices-style data
> structure; the kernel could easily build it based on the pmus that it discovers
> (through whatever means), and the user can fairly easily choose a pmu from this
> structure to open, and it's unambiguous to the kernel as to which pmu the user
> really wants.

Well, the dumb way is simply probing all of them and see who responds.
Another might be adding a pmu attribute (showing the pmu-id) to the
existing sysfs topology layouts (system topology, pci, spu, are all
already available in sysfs iirc).

> I am not convinced that this is the right place to put the event info for each PMU.

Right, I'm not at all sure the kernel wants to know about any events
beyond those needed for pmu scheduling constraints and possible generic
event maps.

Clearly it needs to know about all software events, but I don't think we
need nor want exhaustive hardware event lists in the kernel.

> > But before we go there the perf core needs to be extended to deal with
> > multiple hardware pmus, something which isn't too hard but we need to be
> > careful not to bloat the normal code paths for these somewhat esoteric
> > use cases.
> >
>
> Is this something you've looked into? If so, what sort of issues have you
> discovered?

I've poked at it a little yes, while simply abstracting the current hw
interface and making it a list of pmu's isn't hard at all, it does add
overhead to a few key locations.

Another aspect is event scheduling, you'd want to separate the event
lists for the various pmus so that the RR thing works as expected, this
again adds overhead because you now need to abstract out the event lists
as well.

The main fast path affected by both these things is the task switch
event scheduling where you have to iterate all active events and their
pmus.

So while the abstraction itself isn't too hard, doing it so as to
minimize the bloat on the key paths does make it interesting.


2010-01-29 23:05:52

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units



On 1/29/2010 1:52 AM, Peter Zijlstra wrote:
> On Thu, 2010-01-28 at 14:08 -0800, Corey Ashford wrote:
>
>> This is one of the reasons why I am leaning toward a /sys/devices-style data
>> structure; the kernel could easily build it based on the pmus that it discovers
>> (through whatever means), and the user can fairly easily choose a pmu from this
>> structure to open, and it's unambiguous to the kernel as to which pmu the user
>> really wants.
>
> Well, the dumb way is simply probing all of them and see who responds.

That can work, but it's still fuzzy to me how a user would relate a PMU address
that he's encoded to some actual device in the system he's using. How would he
know that he's addressing the correct device (besides that the PMU type
matches), given that we're likely to have hypervisors as middle-men.

> Another might be adding a pmu attribute (showing the pmu-id) to the
> existing sysfs topology layouts (system topology, pci, spu, are all
> already available in sysfs iirc).

So you'd read the id from the sysfs topology tree, and then pass that id to the
interface? That's an interesting approach that eliminates the need to pass a
string pmu path to the kernel.

I like this idea, but I need to read more deeply about the topology entries to
understand how they work.

>> I am not convinced that this is the right place to put the event info for each PMU.
>
> Right, I'm not at all sure the kernel wants to know about any events
> beyond those needed for pmu scheduling constraints and possible generic
> event maps.
>
> Clearly it needs to know about all software events, but I don't think we
> need nor want exhaustive hardware event lists in the kernel.
>
>> > But before we go there the perf core needs to be extended to deal with
>> > multiple hardware pmus, something which isn't too hard but we need to be
>> > careful not to bloat the normal code paths for these somewhat esoteric
>> > use cases.
>> >
>>
>> Is this something you've looked into? If so, what sort of issues have you
>> discovered?
>
> I've poked at it a little yes, while simply abstracting the current hw
> interface and making it a list of pmu's isn't hard at all, it does add
> overhead to a few key locations.
>
> Another aspect is event scheduling, you'd want to separate the event
> lists for the various pmus so that the RR thing works as expected, this
> again adds overhead because you now need to abstract out the event lists
> as well.
>
> The main fast path affected by both these things is the task switch
> event scheduling where you have to iterate all active events and their
> pmus.
>
> So while the abstraction itself isn't too hard, doing it so as to
> minimize the bloat on the key paths does make it interesting.
>

Interesting.

Thanks for your comments.

- Corey

2010-01-30 08:43:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
> So you'd read the id from the sysfs topology tree, and then pass that id to the
> interface? That's an interesting approach that eliminates the need to pass a
> string pmu path to the kernel.

No, the attr.pmu_id would reflect the location in the tree (pci
location, or spu number), the pmu id reported would identify the kind of
pmu driver used for that particular device.

I realized this confusion after sending but didn't clarify, we should
come up with a good alternative name for either (or both) uses.


2010-02-01 19:39:53

by Corey Ashford

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units



On 1/30/2010 12:42 AM, Peter Zijlstra wrote:
> On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
>> So you'd read the id from the sysfs topology tree, and then pass that id to the
>> interface? That's an interesting approach that eliminates the need to pass a
>> string pmu path to the kernel.
>
> No, the attr.pmu_id would reflect the location in the tree (pci
> location, or spu number), the pmu id reported would identify the kind of
> pmu driver used for that particular device.
>
> I realized this confusion after sending but didn't clarify, we should
> come up with a good alternative name for either (or both) uses.
>

Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded
bit field?

- Corey

2010-02-01 19:54:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Mon, 2010-02-01 at 11:39 -0800, Corey Ashford wrote:
>
> On 1/30/2010 12:42 AM, Peter Zijlstra wrote:
> > On Fri, 2010-01-29 at 15:05 -0800, Corey Ashford wrote:
> >> So you'd read the id from the sysfs topology tree, and then pass that id to the
> >> interface? That's an interesting approach that eliminates the need to pass a
> >> string pmu path to the kernel.
> >
> > No, the attr.pmu_id would reflect the location in the tree (pci
> > location, or spu number), the pmu id reported would identify the kind of
> > pmu driver used for that particular device.
> >
> > I realized this confusion after sending but didn't clarify, we should
> > come up with a good alternative name for either (or both) uses.
> >
>
> Ok, just so I'm clear here, is attr.pmu_id a (char *) or some sort of encoded
> bit field?

Right, currently on x86 we have x86_pmu.name which basically tells us
what kind of pmu it is, but we really don't export that since its
trivial to see that from /proc/cpuinfo, but the ARM people expressed
interest in this because decoding the cpuid equivalent on arm is like
nasty business.

But once we go add other funny pmu's, it becomes interesting to know
what kind of pmu it is and tie possible event sets to them.

2010-04-15 21:47:58

by Gary.Mohr

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units


> On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
>
> Right, I've got some definite ideas on how to go here, just need some
> time to implement them.
>
> The first thing that needs to be done is get rid of all the __weak
> functions (with exception of perf_callchain*, since that really is arch
> specific).
>
> For hw_perf_event_init() we need to create a pmu registration facility
> and lookup a pmu_id, either passed as an actual id found in sysfs or an
> open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> compat).
>
> hw_perf_disable/enable() would become struct pmu functions and
> perf_disable/enable need to become per-pmu, most functions operate on a
> specific event, for those we know the pmu and hence can call the per-pmu
> version. (XXX find those sites where this is not true).
>
> Then we can move to context, yes I think we want new context for new
> PMUs, otherwise we get very funny RR interleaving problems. My idea was
> to move find_get_context() into struct pmu as well, this allows you to
> have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> because then things like perf_event_task_sched_out() would get rather
> complex.
>
> For RR we can move away from perf_event_task_tick and let the pmu
> install a (hr)timer for this on their own.
>
> I've been planning to implement this for more than a week now, its just
> that other stuff keeps getting in the way.
>

Hi Peter,

My name is Gary Mohr and I work for Bull Information Systems. I have been
following your discussions with Corey (and others) about how to implement
support for nest PMU's in the linux kernel.

My company feels that support for Intel Nehalem uncore events is very
important
to our customers. Has the "other stuff" mentioned above quited down to
allow
you to get started on building support for these features ?? If
development
is actually in progress, would you be willing to make a guess as to which
version of the kernel may offer the new capabilities ??

As I said we are interested so if there is any way we can assist you,
please
let us know. We would be happy to take experimental patch sets and
validate,
test, and debug any problems we encounter if that would help your
development.

Thanks for your time.
Gary

2010-04-16 13:23:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Thu, 2010-04-15 at 14:16 -0700, [email protected] wrote:
> > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> >
> > Right, I've got some definite ideas on how to go here, just need some
> > time to implement them.
> >
> > The first thing that needs to be done is get rid of all the __weak
> > functions (with exception of perf_callchain*, since that really is arch
> > specific).
> >
> > For hw_perf_event_init() we need to create a pmu registration facility
> > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > compat).
> >
> > hw_perf_disable/enable() would become struct pmu functions and
> > perf_disable/enable need to become per-pmu, most functions operate on a
> > specific event, for those we know the pmu and hence can call the per-pmu
> > version. (XXX find those sites where this is not true).
> >
> > Then we can move to context, yes I think we want new context for new
> > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > to move find_get_context() into struct pmu as well, this allows you to
> > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > because then things like perf_event_task_sched_out() would get rather
> > complex.
> >
> > For RR we can move away from perf_event_task_tick and let the pmu
> > install a (hr)timer for this on their own.
> >
> > I've been planning to implement this for more than a week now, its just
> > that other stuff keeps getting in the way.
> >
>
> Hi Peter,
>
> My name is Gary Mohr and I work for Bull Information Systems. I have been
> following your discussions with Corey (and others) about how to implement
> support for nest PMU's in the linux kernel.
>
> My company feels that support for Intel Nehalem uncore events is very
> important to our customers. Has the "other stuff" mentioned above quited down to
> allow you to get started on building support for these features ??

Sadly no.

> If development
> is actually in progress, would you be willing to make a guess as to which
> version of the kernel may offer the new capabilities ??
>
> As I said we are interested so if there is any way we can assist you,
> please let us know. We would be happy to take experimental patch sets and
> validate, test, and debug any problems we encounter if that would help your
> development.

Supply patches to make the above happen ;-)

One thing not on that list, which should happen first I guess, is to
remove hw_perf_group_sched_in(). The idea is to add some sort of
transactional API to the struct pmu, so that we can delay the
schedulability check until commit time (and roll back when it fails).

Something as simple as:

struct pmu {
void start_txn(struct pmu *);
void commit_txn(struct pmu *);

,,,
};

and then change group_sched_in() to use this instead of
hw_perf_group_sched_in(), whose implementations mostly replicate
group_sched_in() in various buggy ways anyway.

2010-04-19 09:07:22

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 14:16 -0700, [email protected] wrote:
> > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> > >
> > > Right, I've got some definite ideas on how to go here, just need some
> > > time to implement them.
> > >
> > > The first thing that needs to be done is get rid of all the __weak
> > > functions (with exception of perf_callchain*, since that really is arch
> > > specific).
> > >
> > > For hw_perf_event_init() we need to create a pmu registration facility
> > > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > > compat).
> > >
> > > hw_perf_disable/enable() would become struct pmu functions and
> > > perf_disable/enable need to become per-pmu, most functions operate on a
> > > specific event, for those we know the pmu and hence can call the per-pmu
> > > version. (XXX find those sites where this is not true).
> > >
> > > Then we can move to context, yes I think we want new context for new
> > > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > > to move find_get_context() into struct pmu as well, this allows you to
> > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > > because then things like perf_event_task_sched_out() would get rather
> > > complex.
> > >
> > > For RR we can move away from perf_event_task_tick and let the pmu
> > > install a (hr)timer for this on their own.
> > >
> > > I've been planning to implement this for more than a week now, its just
> > > that other stuff keeps getting in the way.
> > >
> >
> > Hi Peter,
> >
> > My name is Gary Mohr and I work for Bull Information Systems. I have been
> > following your discussions with Corey (and others) about how to implement
> > support for nest PMU's in the linux kernel.
> >
> > My company feels that support for Intel Nehalem uncore events is very
> > important to our customers. Has the "other stuff" mentioned above quited down to
> > allow you to get started on building support for these features ??
>
> Sadly no.
>
> > If development
> > is actually in progress, would you be willing to make a guess as to which
> > version of the kernel may offer the new capabilities ??
> >
> > As I said we are interested so if there is any way we can assist you,
> > please let us know. We would be happy to take experimental patch sets and
> > validate, test, and debug any problems we encounter if that would help your
> > development.
>
> Supply patches to make the above happen ;-)

Hi,

I have been also looking at this for some time.
I'll send a draft patch later this week to support multiple hw pmus.

Lin Ming

2010-04-19 09:28:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Mon, 2010-04-19 at 17:08 +0800, Lin Ming wrote:

> I have been also looking at this for some time.
> I'll send a draft patch later this week to support multiple hw pmus.

Well, that's going to be a mighty large patch.

I'd rather see smaller patches, say one patch per removal of a weak
hw_perf interface, etc...

2010-04-20 11:55:22

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Fri, 2010-04-16 at 21:24 +0800, Peter Zijlstra wrote:
> On Thu, 2010-04-15 at 14:16 -0700, [email protected] wrote:
> > > On Tue, 2010-03-30 at 09:49 -0700, Corey Ashford wrote:
> > >
> > > Right, I've got some definite ideas on how to go here, just need some
> > > time to implement them.
> > >
> > > The first thing that needs to be done is get rid of all the __weak
> > > functions (with exception of perf_callchain*, since that really is arch
> > > specific).
> > >
> > > For hw_perf_event_init() we need to create a pmu registration facility
> > > and lookup a pmu_id, either passed as an actual id found in sysfs or an
> > > open file handle from sysfs (the cpu pmu would be pmu_id 0 for backwards
> > > compat).
> > >
> > > hw_perf_disable/enable() would become struct pmu functions and
> > > perf_disable/enable need to become per-pmu, most functions operate on a
> > > specific event, for those we know the pmu and hence can call the per-pmu
> > > version. (XXX find those sites where this is not true).
> > >
> > > Then we can move to context, yes I think we want new context for new
> > > PMUs, otherwise we get very funny RR interleaving problems. My idea was
> > > to move find_get_context() into struct pmu as well, this allows you to
> > > have per-pmu contexts. Initially I'd not allow per-pmu-per-task contexts
> > > because then things like perf_event_task_sched_out() would get rather
> > > complex.
> > >
> > > For RR we can move away from perf_event_task_tick and let the pmu
> > > install a (hr)timer for this on their own.
> > >
> > > I've been planning to implement this for more than a week now, its just
> > > that other stuff keeps getting in the way.
> > >
> >
> > Hi Peter,
> >
> > My name is Gary Mohr and I work for Bull Information Systems. I have been
> > following your discussions with Corey (and others) about how to implement
> > support for nest PMU's in the linux kernel.
> >
> > My company feels that support for Intel Nehalem uncore events is very
> > important to our customers. Has the "other stuff" mentioned above quited down to
> > allow you to get started on building support for these features ??
>
> Sadly no.
>
> > If development
> > is actually in progress, would you be willing to make a guess as to which
> > version of the kernel may offer the new capabilities ??
> >
> > As I said we are interested so if there is any way we can assist you,
> > please let us know. We would be happy to take experimental patch sets and
> > validate, test, and debug any problems we encounter if that would help your
> > development.
>
> Supply patches to make the above happen ;-)
>
> One thing not on that list, which should happen first I guess, is to
> remove hw_perf_group_sched_in(). The idea is to add some sort of
> transactional API to the struct pmu, so that we can delay the
> schedulability check until commit time (and roll back when it fails).
>
> Something as simple as:
>
> struct pmu {
> void start_txn(struct pmu *);
> void commit_txn(struct pmu *);
>
> ,,,
> };

Could you please explain a bit more?

Does it mean that "start_txn" perform the schedule events stuff
and "commit_txn" perform the assign events stuff?

Does "commit time" mean the actual activation in hw_perf_enable?

Thanks,
Lin Ming

>
> and then change group_sched_in() to use this instead of
> hw_perf_group_sched_in(), whose implementations mostly replicate
> group_sched_in() in various buggy ways anyway.
>

2010-04-20 12:03:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:

> > One thing not on that list, which should happen first I guess, is to
> > remove hw_perf_group_sched_in(). The idea is to add some sort of
> > transactional API to the struct pmu, so that we can delay the
> > schedulability check until commit time (and roll back when it fails).
> >
> > Something as simple as:
> >
> > struct pmu {
> > void start_txn(struct pmu *);
> > void commit_txn(struct pmu *);
> >
> > ,,,
> > };
>
> Could you please explain a bit more?
>
> Does it mean that "start_txn" perform the schedule events stuff
> and "commit_txn" perform the assign events stuff?
>
> Does "commit time" mean the actual activation in hw_perf_enable?

No, the idea behind hw_perf_group_sched_in() is to not perform
schedulability tests on each event in the group, but to add the group as
a whole and then perform one test.

Of course, when that test fails, you'll have to roll-back the whole
group again.

So start_txn (or a better name) would simply toggle a flag in the pmu
implementation that will make pmu::enable() not perform the
schedulablilty test.

Then commit_txn() will perform the schedulability test (so note the
method has to have a !void return value, my mistake in the earlier
email).

This will allow us to use the regular
kernel/perf_event.c::group_sched_in() and all the rollback code.
Currently each hw_perf_group_sched_in() implementation duplicates all
the rolllback code (with various bugs).



We must get rid of all weak hw_perf_*() functions before we can properly
consider multiple struct pmu implementations.

2010-04-21 08:08:01

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>
> > > One thing not on that list, which should happen first I guess, is to
> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
> > > transactional API to the struct pmu, so that we can delay the
> > > schedulability check until commit time (and roll back when it fails).
> > >
> > > Something as simple as:
> > >
> > > struct pmu {
> > > void start_txn(struct pmu *);
> > > void commit_txn(struct pmu *);
> > >
> > > ,,,
> > > };
> >
> > Could you please explain a bit more?
> >
> > Does it mean that "start_txn" perform the schedule events stuff
> > and "commit_txn" perform the assign events stuff?
> >
> > Does "commit time" mean the actual activation in hw_perf_enable?
>
> No, the idea behind hw_perf_group_sched_in() is to not perform
> schedulability tests on each event in the group, but to add the group as
> a whole and then perform one test.
>
> Of course, when that test fails, you'll have to roll-back the whole
> group again.
>
> So start_txn (or a better name) would simply toggle a flag in the pmu
> implementation that will make pmu::enable() not perform the
> schedulablilty test.
>
> Then commit_txn() will perform the schedulability test (so note the
> method has to have a !void return value, my mistake in the earlier
> email).
>
> This will allow us to use the regular
> kernel/perf_event.c::group_sched_in() and all the rollback code.
> Currently each hw_perf_group_sched_in() implementation duplicates all
> the rolllback code (with various bugs).
>
>
>
> We must get rid of all weak hw_perf_*() functions before we can properly
> consider multiple struct pmu implementations.
>

Thanks for the clear explanation.

Does below patch show what you mean?

I only touch the x86 arch code now.

---
arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++--------------------------
include/linux/perf_event.h | 10 ++-
kernel/perf_event.c | 28 +++----
3 files changed, 67 insertions(+), 132 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..62aa9a1 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
if (n < 0)
return n;

+ if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
+ goto out;
+
ret = x86_pmu.schedule_events(cpuc, n, assign);
if (ret)
return ret;
@@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
*/
memcpy(cpuc->assign, assign, n*sizeof(int));

+out:
cpuc->n_events = n;
cpuc->n_added += n - n0;

@@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
return &unconstrained;
}

-static int x86_event_sched_in(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- int ret = 0;
-
- event->state = PERF_EVENT_STATE_ACTIVE;
- event->oncpu = smp_processor_id();
- event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
- if (!is_x86_event(event))
- ret = event->pmu->enable(event);
-
- if (!ret && !is_software_event(event))
- cpuctx->active_oncpu++;
-
- if (!ret && event->attr.exclusive)
- cpuctx->exclusive = 1;
-
- return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- event->state = PERF_EVENT_STATE_INACTIVE;
- event->oncpu = -1;
-
- if (!is_x86_event(event))
- event->pmu->disable(event);
-
- event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
- if (!is_software_event(event))
- cpuctx->active_oncpu--;
-
- if (event->attr.exclusive || !cpuctx->active_oncpu)
- cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct perf_event *sub;
- int assign[X86_PMC_IDX_MAX];
- int n0, n1, ret;
-
- if (!x86_pmu_initialized())
- return 0;
-
- /* n0 = total number of events */
- n0 = collect_events(cpuc, leader, true);
- if (n0 < 0)
- return n0;
-
- ret = x86_pmu.schedule_events(cpuc, n0, assign);
- if (ret)
- return ret;
-
- ret = x86_event_sched_in(leader, cpuctx);
- if (ret)
- return ret;
-
- n1 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state > PERF_EVENT_STATE_OFF) {
- ret = x86_event_sched_in(sub, cpuctx);
- if (ret)
- goto undo;
- ++n1;
- }
- }
- /*
- * copy new assignment, now we know it is possible
- * will be used by hw_perf_enable()
- */
- memcpy(cpuc->assign, assign, n0*sizeof(int));
-
- cpuc->n_events = n0;
- cpuc->n_added += n1;
- ctx->nr_active += n1;
-
- /*
- * 1 means successful and events are active
- * This is not quite true because we defer
- * actual activation until hw_perf_enable() but
- * this way we* ensure caller won't try to enable
- * individual events
- */
- return 1;
-undo:
- x86_event_sched_out(leader, cpuctx);
- n0 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state == PERF_EVENT_STATE_ACTIVE) {
- x86_event_sched_out(sub, cpuctx);
- if (++n0 == n1)
- break;
- }
- }
- return ret;
-}
-
#include "perf_event_amd.c"
#include "perf_event_p6.c"
#include "perf_event_p4.c"
@@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
x86_perf_event_update(event);
}

+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(struct pmu *pmu)
+{
+ pmu->flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(struct pmu *pmu)
+{
+ pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(struct pmu *pmu)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ int assign[X86_PMC_IDX_MAX];
+ int n, ret;
+
+ n = cpuc->n_events;
+
+ if (!x86_pmu_initialized())
+ return -EAGAIN;
+
+ ret = x86_pmu.schedule_events(cpuc, n, assign);
+ if (ret)
+ return ret;
+
+ /*
+ * copy new assignment, now we know it is possible
+ * will be used by hw_perf_enable()
+ */
+ memcpy(cpuc->assign, assign, n*sizeof(int));
+
+ return 0;
+}
+
static const struct pmu pmu = {
.enable = x86_pmu_enable,
.disable = x86_pmu_disable,
@@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
.stop = x86_pmu_stop,
.read = x86_pmu_read,
.unthrottle = x86_pmu_unthrottle,
+ .start_txn = x86_pmu_start_txn,
+ .stop_txn = x86_pmu_stop_txn,
+ .commit_txn = x86_pmu_commit_txn,
};

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..93aa8d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {

struct perf_event;

+#define PERF_EVENT_TRAN_STARTED 1
+
/**
* struct pmu - generic performance monitoring unit
*/
@@ -534,6 +536,11 @@ struct pmu {
void (*stop) (struct perf_event *event);
void (*read) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ void (*start_txn) (struct pmu *pmu);
+ void (*stop_txn) (struct pmu *pmu);
+ int (*commit_txn) (struct pmu *pmu);
+
+ u8 flag;
};

/**
@@ -799,9 +806,6 @@ extern void perf_disable(void);
extern void perf_enable(void);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx);
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
extern struct perf_event *
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..4537676 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
void __weak hw_perf_disable(void) { barrier(); }
void __weak hw_perf_enable(void) { barrier(); }

-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
- return 0;
-}
-
void __weak perf_event_print_debug(void) { }

static DEFINE_PER_CPU(int, perf_disable_count);
@@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
struct perf_event_context *ctx)
{
struct perf_event *event, *partial_group;
+ struct pmu *pmu = (struct pmu *)group_event->pmu;
int ret;

if (group_event->state == PERF_EVENT_STATE_OFF)
return 0;

- ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
- if (ret)
- return ret < 0 ? ret : 0;
+ pmu->start_txn(pmu);

if (event_sched_in(group_event, cpuctx, ctx))
return -EAGAIN;
@@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
}
}

- return 0;
+ ret = pmu->commit_txn(pmu);
+ if (!ret) {
+ pmu->stop_txn(pmu);
+ return 0;
+ }

group_error:
+ pmu->stop_txn(pmu);
+
/*
- * Groups can be scheduled in as one unit only, so undo any
- * partial group before returning:
+ * Commit transaction fails, rollback
+ * Groups can be scheduled in as one unit only, so undo
+ * whole group before returning:
*/
list_for_each_entry(event, &group_event->sibling_list, group_entry) {
- if (event == partial_group)
- break;
event_sched_out(event, cpuctx, ctx);
}
event_sched_out(group_event, cpuctx, ctx);

2010-04-21 08:33:00

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

Seems to me that struct pmu is a shared resource across all CPUs.
I don't understand why scheduling on one CPU would have to impact
all the other CPUs, unless I am missing something here.


On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <[email protected]> wrote:
> On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
>> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>>
>> > > One thing not on that list, which should happen first I guess, is to
>> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
>> > > transactional API to the struct pmu, so that we can delay the
>> > > schedulability check until commit time (and roll back when it fails).
>> > >
>> > > Something as simple as:
>> > >
>> > >   struct pmu {
>> > >     void start_txn(struct pmu *);
>> > >     void commit_txn(struct pmu *);
>> > >
>> > >     ,,,
>> > >   };
>> >
>> > Could you please explain a bit more?
>> >
>> > Does it mean that "start_txn" perform the schedule events stuff
>> > and "commit_txn" perform the assign events stuff?
>> >
>> > Does "commit time" mean the actual activation in hw_perf_enable?
>>
>> No, the idea behind hw_perf_group_sched_in() is to not perform
>> schedulability tests on each event in the group, but to add the group as
>> a whole and then perform one test.
>>
>> Of course, when that test fails, you'll have to roll-back the whole
>> group again.
>>
>> So start_txn (or a better name) would simply toggle a flag in the pmu
>> implementation that will make pmu::enable() not perform the
>> schedulablilty test.
>>
>> Then commit_txn() will perform the schedulability test (so note the
>> method has to have a !void return value, my mistake in the earlier
>> email).
>>
>> This will allow us to use the regular
>> kernel/perf_event.c::group_sched_in() and all the rollback code.
>> Currently each hw_perf_group_sched_in() implementation duplicates all
>> the rolllback code (with various bugs).
>>
>>
>>
>> We must get rid of all weak hw_perf_*() functions before we can properly
>> consider multiple struct pmu implementations.
>>
>
> Thanks for the clear explanation.
>
> Does below patch show what you mean?
>
> I only touch the x86 arch code now.
>
> ---
>  arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
>  include/linux/perf_event.h       |   10 ++-
>  kernel/perf_event.c              |   28 +++----
>  3 files changed, 67 insertions(+), 132 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 626154a..62aa9a1 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
>        if (n < 0)
>                return n;
>
> +       if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
> +               goto out;
> +
>        ret = x86_pmu.schedule_events(cpuc, n, assign);
>        if (ret)
>                return ret;
> @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
>         */
>        memcpy(cpuc->assign, assign, n*sizeof(int));
>
> +out:
>        cpuc->n_events = n;
>        cpuc->n_added += n - n0;
>
> @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>        return &unconstrained;
>  }
>
> -static int x86_event_sched_in(struct perf_event *event,
> -                         struct perf_cpu_context *cpuctx)
> -{
> -       int ret = 0;
> -
> -       event->state = PERF_EVENT_STATE_ACTIVE;
> -       event->oncpu = smp_processor_id();
> -       event->tstamp_running += event->ctx->time - event->tstamp_stopped;
> -
> -       if (!is_x86_event(event))
> -               ret = event->pmu->enable(event);
> -
> -       if (!ret && !is_software_event(event))
> -               cpuctx->active_oncpu++;
> -
> -       if (!ret && event->attr.exclusive)
> -               cpuctx->exclusive = 1;
> -
> -       return ret;
> -}
> -
> -static void x86_event_sched_out(struct perf_event *event,
> -                           struct perf_cpu_context *cpuctx)
> -{
> -       event->state = PERF_EVENT_STATE_INACTIVE;
> -       event->oncpu = -1;
> -
> -       if (!is_x86_event(event))
> -               event->pmu->disable(event);
> -
> -       event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
> -
> -       if (!is_software_event(event))
> -               cpuctx->active_oncpu--;
> -
> -       if (event->attr.exclusive || !cpuctx->active_oncpu)
> -               cpuctx->exclusive = 0;
> -}
> -
> -/*
> - * Called to enable a whole group of events.
> - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
> - * Assumes the caller has disabled interrupts and has
> - * frozen the PMU with hw_perf_save_disable.
> - *
> - * called with PMU disabled. If successful and return value 1,
> - * then guaranteed to call perf_enable() and hw_perf_enable()
> - */
> -int hw_perf_group_sched_in(struct perf_event *leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx)
> -{
> -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> -       struct perf_event *sub;
> -       int assign[X86_PMC_IDX_MAX];
> -       int n0, n1, ret;
> -
> -       if (!x86_pmu_initialized())
> -               return 0;
> -
> -       /* n0 = total number of events */
> -       n0 = collect_events(cpuc, leader, true);
> -       if (n0 < 0)
> -               return n0;
> -
> -       ret = x86_pmu.schedule_events(cpuc, n0, assign);
> -       if (ret)
> -               return ret;
> -
> -       ret = x86_event_sched_in(leader, cpuctx);
> -       if (ret)
> -               return ret;
> -
> -       n1 = 1;
> -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> -               if (sub->state > PERF_EVENT_STATE_OFF) {
> -                       ret = x86_event_sched_in(sub, cpuctx);
> -                       if (ret)
> -                               goto undo;
> -                       ++n1;
> -               }
> -       }
> -       /*
> -        * copy new assignment, now we know it is possible
> -        * will be used by hw_perf_enable()
> -        */
> -       memcpy(cpuc->assign, assign, n0*sizeof(int));
> -
> -       cpuc->n_events  = n0;
> -       cpuc->n_added  += n1;
> -       ctx->nr_active += n1;
> -
> -       /*
> -        * 1 means successful and events are active
> -        * This is not quite true because we defer
> -        * actual activation until hw_perf_enable() but
> -        * this way we* ensure caller won't try to enable
> -        * individual events
> -        */
> -       return 1;
> -undo:
> -       x86_event_sched_out(leader, cpuctx);
> -       n0  = 1;
> -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> -                       x86_event_sched_out(sub, cpuctx);
> -                       if (++n0 == n1)
> -                               break;
> -               }
> -       }
> -       return ret;
> -}
> -
>  #include "perf_event_amd.c"
>  #include "perf_event_p6.c"
>  #include "perf_event_p4.c"
> @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
>        x86_perf_event_update(event);
>  }
>
> +/*
> + * Set the flag to make pmu::enable() not perform the
> + * schedulablilty test.
> + */
> +static void x86_pmu_start_txn(struct pmu *pmu)
> +{
> +       pmu->flag |= PERF_EVENT_TRAN_STARTED;
> +}
> +
> +static void x86_pmu_stop_txn(struct pmu *pmu)
> +{
> +       pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
> +}
> +
> +/*
> + * Return 0 if commit transaction success
> + */
> +static int x86_pmu_commit_txn(struct pmu *pmu)
> +{
> +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> +       int assign[X86_PMC_IDX_MAX];
> +       int n, ret;
> +
> +       n = cpuc->n_events;
> +
> +       if (!x86_pmu_initialized())
> +               return -EAGAIN;
> +
> +       ret = x86_pmu.schedule_events(cpuc, n, assign);
> +       if (ret)
> +               return ret;
> +
> +       /*
> +        * copy new assignment, now we know it is possible
> +        * will be used by hw_perf_enable()
> +        */
> +       memcpy(cpuc->assign, assign, n*sizeof(int));
> +
> +       return 0;
> +}
> +
>  static const struct pmu pmu = {
>        .enable         = x86_pmu_enable,
>        .disable        = x86_pmu_disable,
> @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
>        .stop           = x86_pmu_stop,
>        .read           = x86_pmu_read,
>        .unthrottle     = x86_pmu_unthrottle,
> +       .start_txn      = x86_pmu_start_txn,
> +       .stop_txn       = x86_pmu_stop_txn,
> +       .commit_txn     = x86_pmu_commit_txn,
>  };
>
>  /*
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index bf896d0..93aa8d8 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -524,6 +524,8 @@ struct hw_perf_event {
>
>  struct perf_event;
>
> +#define PERF_EVENT_TRAN_STARTED 1
> +
>  /**
>  * struct pmu - generic performance monitoring unit
>  */
> @@ -534,6 +536,11 @@ struct pmu {
>        void (*stop)                    (struct perf_event *event);
>        void (*read)                    (struct perf_event *event);
>        void (*unthrottle)              (struct perf_event *event);
> +       void (*start_txn)               (struct pmu *pmu);
> +       void (*stop_txn)                (struct pmu *pmu);
> +       int (*commit_txn)               (struct pmu *pmu);
> +
> +       u8 flag;
>  };
>
>  /**
> @@ -799,9 +806,6 @@ extern void perf_disable(void);
>  extern void perf_enable(void);
>  extern int perf_event_task_disable(void);
>  extern int perf_event_task_enable(void);
> -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx);
>  extern void perf_event_update_userpage(struct perf_event *event);
>  extern int perf_event_release_kernel(struct perf_event *event);
>  extern struct perf_event *
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 07b7a43..4537676 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
>  void __weak hw_perf_disable(void)              { barrier(); }
>  void __weak hw_perf_enable(void)               { barrier(); }
>
> -int __weak
> -hw_perf_group_sched_in(struct perf_event *group_leader,
> -              struct perf_cpu_context *cpuctx,
> -              struct perf_event_context *ctx)
> -{
> -       return 0;
> -}
> -
>  void __weak perf_event_print_debug(void)       { }
>
>  static DEFINE_PER_CPU(int, perf_disable_count);
> @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
>               struct perf_event_context *ctx)
>  {
>        struct perf_event *event, *partial_group;
> +       struct pmu *pmu = (struct pmu *)group_event->pmu;
>        int ret;
>
>        if (group_event->state == PERF_EVENT_STATE_OFF)
>                return 0;
>
> -       ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
> -       if (ret)
> -               return ret < 0 ? ret : 0;
> +       pmu->start_txn(pmu);
>
>        if (event_sched_in(group_event, cpuctx, ctx))
>                return -EAGAIN;
> @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
>                }
>        }
>
> -       return 0;
> +       ret = pmu->commit_txn(pmu);
> +       if (!ret) {
> +               pmu->stop_txn(pmu);
> +               return 0;
> +       }
>
>  group_error:
> +       pmu->stop_txn(pmu);
> +
>        /*
> -        * Groups can be scheduled in as one unit only, so undo any
> -        * partial group before returning:
> +        * Commit transaction fails, rollback
> +        * Groups can be scheduled in as one unit only, so undo
> +        * whole group before returning:
>         */
>        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
> -               if (event == partial_group)
> -                       break;
>                event_sched_out(event, cpuctx, ctx);
>        }
>        event_sched_out(group_event, cpuctx, ctx);
>
>
>

2010-04-21 08:39:34

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
> Seems to me that struct pmu is a shared resource across all CPUs.
> I don't understand why scheduling on one CPU would have to impact
> all the other CPUs, unless I am missing something here.

Do you mean the pmu->flag?

You are right, pmu->flag should be per cpu data.

Will update the patch.

Thanks,
Lin Ming

>
>
> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <[email protected]> wrote:
> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
> >>
> >> > > One thing not on that list, which should happen first I guess, is to
> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
> >> > > transactional API to the struct pmu, so that we can delay the
> >> > > schedulability check until commit time (and roll back when it fails).
> >> > >
> >> > > Something as simple as:
> >> > >
> >> > > struct pmu {
> >> > > void start_txn(struct pmu *);
> >> > > void commit_txn(struct pmu *);
> >> > >
> >> > > ,,,
> >> > > };
> >> >
> >> > Could you please explain a bit more?
> >> >
> >> > Does it mean that "start_txn" perform the schedule events stuff
> >> > and "commit_txn" perform the assign events stuff?
> >> >
> >> > Does "commit time" mean the actual activation in hw_perf_enable?
> >>
> >> No, the idea behind hw_perf_group_sched_in() is to not perform
> >> schedulability tests on each event in the group, but to add the group as
> >> a whole and then perform one test.
> >>
> >> Of course, when that test fails, you'll have to roll-back the whole
> >> group again.
> >>
> >> So start_txn (or a better name) would simply toggle a flag in the pmu
> >> implementation that will make pmu::enable() not perform the
> >> schedulablilty test.
> >>
> >> Then commit_txn() will perform the schedulability test (so note the
> >> method has to have a !void return value, my mistake in the earlier
> >> email).
> >>
> >> This will allow us to use the regular
> >> kernel/perf_event.c::group_sched_in() and all the rollback code.
> >> Currently each hw_perf_group_sched_in() implementation duplicates all
> >> the rolllback code (with various bugs).
> >>
> >>
> >>
> >> We must get rid of all weak hw_perf_*() functions before we can properly
> >> consider multiple struct pmu implementations.
> >>
> >
> > Thanks for the clear explanation.
> >
> > Does below patch show what you mean?
> >
> > I only touch the x86 arch code now.
> >
> > ---
> > arch/x86/kernel/cpu/perf_event.c | 161 +++++++++++--------------------------
> > include/linux/perf_event.h | 10 ++-
> > kernel/perf_event.c | 28 +++----
> > 3 files changed, 67 insertions(+), 132 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> > index 626154a..62aa9a1 100644
> > --- a/arch/x86/kernel/cpu/perf_event.c
> > +++ b/arch/x86/kernel/cpu/perf_event.c
> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
> > if (n < 0)
> > return n;
> >
> > + if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
> > + goto out;
> > +
> > ret = x86_pmu.schedule_events(cpuc, n, assign);
> > if (ret)
> > return ret;
> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
> > */
> > memcpy(cpuc->assign, assign, n*sizeof(int));
> >
> > +out:
> > cpuc->n_events = n;
> > cpuc->n_added += n - n0;
> >
> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
> > return &unconstrained;
> > }
> >
> > -static int x86_event_sched_in(struct perf_event *event,
> > - struct perf_cpu_context *cpuctx)
> > -{
> > - int ret = 0;
> > -
> > - event->state = PERF_EVENT_STATE_ACTIVE;
> > - event->oncpu = smp_processor_id();
> > - event->tstamp_running += event->ctx->time - event->tstamp_stopped;
> > -
> > - if (!is_x86_event(event))
> > - ret = event->pmu->enable(event);
> > -
> > - if (!ret && !is_software_event(event))
> > - cpuctx->active_oncpu++;
> > -
> > - if (!ret && event->attr.exclusive)
> > - cpuctx->exclusive = 1;
> > -
> > - return ret;
> > -}
> > -
> > -static void x86_event_sched_out(struct perf_event *event,
> > - struct perf_cpu_context *cpuctx)
> > -{
> > - event->state = PERF_EVENT_STATE_INACTIVE;
> > - event->oncpu = -1;
> > -
> > - if (!is_x86_event(event))
> > - event->pmu->disable(event);
> > -
> > - event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
> > -
> > - if (!is_software_event(event))
> > - cpuctx->active_oncpu--;
> > -
> > - if (event->attr.exclusive || !cpuctx->active_oncpu)
> > - cpuctx->exclusive = 0;
> > -}
> > -
> > -/*
> > - * Called to enable a whole group of events.
> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
> > - * Assumes the caller has disabled interrupts and has
> > - * frozen the PMU with hw_perf_save_disable.
> > - *
> > - * called with PMU disabled. If successful and return value 1,
> > - * then guaranteed to call perf_enable() and hw_perf_enable()
> > - */
> > -int hw_perf_group_sched_in(struct perf_event *leader,
> > - struct perf_cpu_context *cpuctx,
> > - struct perf_event_context *ctx)
> > -{
> > - struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > - struct perf_event *sub;
> > - int assign[X86_PMC_IDX_MAX];
> > - int n0, n1, ret;
> > -
> > - if (!x86_pmu_initialized())
> > - return 0;
> > -
> > - /* n0 = total number of events */
> > - n0 = collect_events(cpuc, leader, true);
> > - if (n0 < 0)
> > - return n0;
> > -
> > - ret = x86_pmu.schedule_events(cpuc, n0, assign);
> > - if (ret)
> > - return ret;
> > -
> > - ret = x86_event_sched_in(leader, cpuctx);
> > - if (ret)
> > - return ret;
> > -
> > - n1 = 1;
> > - list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> > - if (sub->state > PERF_EVENT_STATE_OFF) {
> > - ret = x86_event_sched_in(sub, cpuctx);
> > - if (ret)
> > - goto undo;
> > - ++n1;
> > - }
> > - }
> > - /*
> > - * copy new assignment, now we know it is possible
> > - * will be used by hw_perf_enable()
> > - */
> > - memcpy(cpuc->assign, assign, n0*sizeof(int));
> > -
> > - cpuc->n_events = n0;
> > - cpuc->n_added += n1;
> > - ctx->nr_active += n1;
> > -
> > - /*
> > - * 1 means successful and events are active
> > - * This is not quite true because we defer
> > - * actual activation until hw_perf_enable() but
> > - * this way we* ensure caller won't try to enable
> > - * individual events
> > - */
> > - return 1;
> > -undo:
> > - x86_event_sched_out(leader, cpuctx);
> > - n0 = 1;
> > - list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> > - if (sub->state == PERF_EVENT_STATE_ACTIVE) {
> > - x86_event_sched_out(sub, cpuctx);
> > - if (++n0 == n1)
> > - break;
> > - }
> > - }
> > - return ret;
> > -}
> > -
> > #include "perf_event_amd.c"
> > #include "perf_event_p6.c"
> > #include "perf_event_p4.c"
> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
> > x86_perf_event_update(event);
> > }
> >
> > +/*
> > + * Set the flag to make pmu::enable() not perform the
> > + * schedulablilty test.
> > + */
> > +static void x86_pmu_start_txn(struct pmu *pmu)
> > +{
> > + pmu->flag |= PERF_EVENT_TRAN_STARTED;
> > +}
> > +
> > +static void x86_pmu_stop_txn(struct pmu *pmu)
> > +{
> > + pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
> > +}
> > +
> > +/*
> > + * Return 0 if commit transaction success
> > + */
> > +static int x86_pmu_commit_txn(struct pmu *pmu)
> > +{
> > + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
> > + int assign[X86_PMC_IDX_MAX];
> > + int n, ret;
> > +
> > + n = cpuc->n_events;
> > +
> > + if (!x86_pmu_initialized())
> > + return -EAGAIN;
> > +
> > + ret = x86_pmu.schedule_events(cpuc, n, assign);
> > + if (ret)
> > + return ret;
> > +
> > + /*
> > + * copy new assignment, now we know it is possible
> > + * will be used by hw_perf_enable()
> > + */
> > + memcpy(cpuc->assign, assign, n*sizeof(int));
> > +
> > + return 0;
> > +}
> > +
> > static const struct pmu pmu = {
> > .enable = x86_pmu_enable,
> > .disable = x86_pmu_disable,
> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
> > .stop = x86_pmu_stop,
> > .read = x86_pmu_read,
> > .unthrottle = x86_pmu_unthrottle,
> > + .start_txn = x86_pmu_start_txn,
> > + .stop_txn = x86_pmu_stop_txn,
> > + .commit_txn = x86_pmu_commit_txn,
> > };
> >
> > /*
> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> > index bf896d0..93aa8d8 100644
> > --- a/include/linux/perf_event.h
> > +++ b/include/linux/perf_event.h
> > @@ -524,6 +524,8 @@ struct hw_perf_event {
> >
> > struct perf_event;
> >
> > +#define PERF_EVENT_TRAN_STARTED 1
> > +
> > /**
> > * struct pmu - generic performance monitoring unit
> > */
> > @@ -534,6 +536,11 @@ struct pmu {
> > void (*stop) (struct perf_event *event);
> > void (*read) (struct perf_event *event);
> > void (*unthrottle) (struct perf_event *event);
> > + void (*start_txn) (struct pmu *pmu);
> > + void (*stop_txn) (struct pmu *pmu);
> > + int (*commit_txn) (struct pmu *pmu);
> > +
> > + u8 flag;
> > };
> >
> > /**
> > @@ -799,9 +806,6 @@ extern void perf_disable(void);
> > extern void perf_enable(void);
> > extern int perf_event_task_disable(void);
> > extern int perf_event_task_enable(void);
> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
> > - struct perf_cpu_context *cpuctx,
> > - struct perf_event_context *ctx);
> > extern void perf_event_update_userpage(struct perf_event *event);
> > extern int perf_event_release_kernel(struct perf_event *event);
> > extern struct perf_event *
> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> > index 07b7a43..4537676 100644
> > --- a/kernel/perf_event.c
> > +++ b/kernel/perf_event.c
> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
> > void __weak hw_perf_disable(void) { barrier(); }
> > void __weak hw_perf_enable(void) { barrier(); }
> >
> > -int __weak
> > -hw_perf_group_sched_in(struct perf_event *group_leader,
> > - struct perf_cpu_context *cpuctx,
> > - struct perf_event_context *ctx)
> > -{
> > - return 0;
> > -}
> > -
> > void __weak perf_event_print_debug(void) { }
> >
> > static DEFINE_PER_CPU(int, perf_disable_count);
> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
> > struct perf_event_context *ctx)
> > {
> > struct perf_event *event, *partial_group;
> > + struct pmu *pmu = (struct pmu *)group_event->pmu;
> > int ret;
> >
> > if (group_event->state == PERF_EVENT_STATE_OFF)
> > return 0;
> >
> > - ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
> > - if (ret)
> > - return ret < 0 ? ret : 0;
> > + pmu->start_txn(pmu);
> >
> > if (event_sched_in(group_event, cpuctx, ctx))
> > return -EAGAIN;
> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
> > }
> > }
> >
> > - return 0;
> > + ret = pmu->commit_txn(pmu);
> > + if (!ret) {
> > + pmu->stop_txn(pmu);
> > + return 0;
> > + }
> >
> > group_error:
> > + pmu->stop_txn(pmu);
> > +
> > /*
> > - * Groups can be scheduled in as one unit only, so undo any
> > - * partial group before returning:
> > + * Commit transaction fails, rollback
> > + * Groups can be scheduled in as one unit only, so undo
> > + * whole group before returning:
> > */
> > list_for_each_entry(event, &group_event->sibling_list, group_entry) {
> > - if (event == partial_group)
> > - break;
> > event_sched_out(event, cpuctx, ctx);
> > }
> > event_sched_out(group_event, cpuctx, ctx);
> >
> >
> >

2010-04-21 08:45:03

by Stephane Eranian

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <[email protected]> wrote:
> On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
>> Seems to me that struct pmu is a shared resource across all CPUs.
>> I don't understand why scheduling on one CPU would have to impact
>> all the other CPUs, unless I am missing something here.
>
> Do you mean the pmu->flag?

Yes.

> You are right, pmu->flag should be per cpu data.
>
> Will update the patch.
>
> Thanks,
> Lin Ming
>
>>
>>
>> On Wed, Apr 21, 2010 at 10:08 AM, Lin Ming <[email protected]> wrote:
>> > On Tue, 2010-04-20 at 20:03 +0800, Peter Zijlstra wrote:
>> >> On Tue, 2010-04-20 at 19:55 +0800, Lin Ming wrote:
>> >>
>> >> > > One thing not on that list, which should happen first I guess, is to
>> >> > > remove hw_perf_group_sched_in(). The idea is to add some sort of
>> >> > > transactional API to the struct pmu, so that we can delay the
>> >> > > schedulability check until commit time (and roll back when it fails).
>> >> > >
>> >> > > Something as simple as:
>> >> > >
>> >> > >   struct pmu {
>> >> > >     void start_txn(struct pmu *);
>> >> > >     void commit_txn(struct pmu *);
>> >> > >
>> >> > >     ,,,
>> >> > >   };
>> >> >
>> >> > Could you please explain a bit more?
>> >> >
>> >> > Does it mean that "start_txn" perform the schedule events stuff
>> >> > and "commit_txn" perform the assign events stuff?
>> >> >
>> >> > Does "commit time" mean the actual activation in hw_perf_enable?
>> >>
>> >> No, the idea behind hw_perf_group_sched_in() is to not perform
>> >> schedulability tests on each event in the group, but to add the group as
>> >> a whole and then perform one test.
>> >>
>> >> Of course, when that test fails, you'll have to roll-back the whole
>> >> group again.
>> >>
>> >> So start_txn (or a better name) would simply toggle a flag in the pmu
>> >> implementation that will make pmu::enable() not perform the
>> >> schedulablilty test.
>> >>
>> >> Then commit_txn() will perform the schedulability test (so note the
>> >> method has to have a !void return value, my mistake in the earlier
>> >> email).
>> >>
>> >> This will allow us to use the regular
>> >> kernel/perf_event.c::group_sched_in() and all the rollback code.
>> >> Currently each hw_perf_group_sched_in() implementation duplicates all
>> >> the rolllback code (with various bugs).
>> >>
>> >>
>> >>
>> >> We must get rid of all weak hw_perf_*() functions before we can properly
>> >> consider multiple struct pmu implementations.
>> >>
>> >
>> > Thanks for the clear explanation.
>> >
>> > Does below patch show what you mean?
>> >
>> > I only touch the x86 arch code now.
>> >
>> > ---
>> >  arch/x86/kernel/cpu/perf_event.c |  161 +++++++++++--------------------------
>> >  include/linux/perf_event.h       |   10 ++-
>> >  kernel/perf_event.c              |   28 +++----
>> >  3 files changed, 67 insertions(+), 132 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
>> > index 626154a..62aa9a1 100644
>> > --- a/arch/x86/kernel/cpu/perf_event.c
>> > +++ b/arch/x86/kernel/cpu/perf_event.c
>> > @@ -944,6 +944,9 @@ static int x86_pmu_enable(struct perf_event *event)
>> >        if (n < 0)
>> >                return n;
>> >
>> > +       if (!(event->pmu->flag & PERF_EVENT_TRAN_STARTED))
>> > +               goto out;
>> > +
>> >        ret = x86_pmu.schedule_events(cpuc, n, assign);
>> >        if (ret)
>> >                return ret;
>> > @@ -953,6 +956,7 @@ static int x86_pmu_enable(struct perf_event *event)
>> >         */
>> >        memcpy(cpuc->assign, assign, n*sizeof(int));
>> >
>> > +out:
>> >        cpuc->n_events = n;
>> >        cpuc->n_added += n - n0;
>> >
>> > @@ -1210,119 +1214,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
>> >        return &unconstrained;
>> >  }
>> >
>> > -static int x86_event_sched_in(struct perf_event *event,
>> > -                         struct perf_cpu_context *cpuctx)
>> > -{
>> > -       int ret = 0;
>> > -
>> > -       event->state = PERF_EVENT_STATE_ACTIVE;
>> > -       event->oncpu = smp_processor_id();
>> > -       event->tstamp_running += event->ctx->time - event->tstamp_stopped;
>> > -
>> > -       if (!is_x86_event(event))
>> > -               ret = event->pmu->enable(event);
>> > -
>> > -       if (!ret && !is_software_event(event))
>> > -               cpuctx->active_oncpu++;
>> > -
>> > -       if (!ret && event->attr.exclusive)
>> > -               cpuctx->exclusive = 1;
>> > -
>> > -       return ret;
>> > -}
>> > -
>> > -static void x86_event_sched_out(struct perf_event *event,
>> > -                           struct perf_cpu_context *cpuctx)
>> > -{
>> > -       event->state = PERF_EVENT_STATE_INACTIVE;
>> > -       event->oncpu = -1;
>> > -
>> > -       if (!is_x86_event(event))
>> > -               event->pmu->disable(event);
>> > -
>> > -       event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
>> > -
>> > -       if (!is_software_event(event))
>> > -               cpuctx->active_oncpu--;
>> > -
>> > -       if (event->attr.exclusive || !cpuctx->active_oncpu)
>> > -               cpuctx->exclusive = 0;
>> > -}
>> > -
>> > -/*
>> > - * Called to enable a whole group of events.
>> > - * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
>> > - * Assumes the caller has disabled interrupts and has
>> > - * frozen the PMU with hw_perf_save_disable.
>> > - *
>> > - * called with PMU disabled. If successful and return value 1,
>> > - * then guaranteed to call perf_enable() and hw_perf_enable()
>> > - */
>> > -int hw_perf_group_sched_in(struct perf_event *leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx)
>> > -{
>> > -       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > -       struct perf_event *sub;
>> > -       int assign[X86_PMC_IDX_MAX];
>> > -       int n0, n1, ret;
>> > -
>> > -       if (!x86_pmu_initialized())
>> > -               return 0;
>> > -
>> > -       /* n0 = total number of events */
>> > -       n0 = collect_events(cpuc, leader, true);
>> > -       if (n0 < 0)
>> > -               return n0;
>> > -
>> > -       ret = x86_pmu.schedule_events(cpuc, n0, assign);
>> > -       if (ret)
>> > -               return ret;
>> > -
>> > -       ret = x86_event_sched_in(leader, cpuctx);
>> > -       if (ret)
>> > -               return ret;
>> > -
>> > -       n1 = 1;
>> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > -               if (sub->state > PERF_EVENT_STATE_OFF) {
>> > -                       ret = x86_event_sched_in(sub, cpuctx);
>> > -                       if (ret)
>> > -                               goto undo;
>> > -                       ++n1;
>> > -               }
>> > -       }
>> > -       /*
>> > -        * copy new assignment, now we know it is possible
>> > -        * will be used by hw_perf_enable()
>> > -        */
>> > -       memcpy(cpuc->assign, assign, n0*sizeof(int));
>> > -
>> > -       cpuc->n_events  = n0;
>> > -       cpuc->n_added  += n1;
>> > -       ctx->nr_active += n1;
>> > -
>> > -       /*
>> > -        * 1 means successful and events are active
>> > -        * This is not quite true because we defer
>> > -        * actual activation until hw_perf_enable() but
>> > -        * this way we* ensure caller won't try to enable
>> > -        * individual events
>> > -        */
>> > -       return 1;
>> > -undo:
>> > -       x86_event_sched_out(leader, cpuctx);
>> > -       n0  = 1;
>> > -       list_for_each_entry(sub, &leader->sibling_list, group_entry) {
>> > -               if (sub->state == PERF_EVENT_STATE_ACTIVE) {
>> > -                       x86_event_sched_out(sub, cpuctx);
>> > -                       if (++n0 == n1)
>> > -                               break;
>> > -               }
>> > -       }
>> > -       return ret;
>> > -}
>> > -
>> >  #include "perf_event_amd.c"
>> >  #include "perf_event_p6.c"
>> >  #include "perf_event_p4.c"
>> > @@ -1454,6 +1345,47 @@ static inline void x86_pmu_read(struct perf_event *event)
>> >        x86_perf_event_update(event);
>> >  }
>> >
>> > +/*
>> > + * Set the flag to make pmu::enable() not perform the
>> > + * schedulablilty test.
>> > + */
>> > +static void x86_pmu_start_txn(struct pmu *pmu)
>> > +{
>> > +       pmu->flag |= PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +static void x86_pmu_stop_txn(struct pmu *pmu)
>> > +{
>> > +       pmu->flag &= ~PERF_EVENT_TRAN_STARTED;
>> > +}
>> > +
>> > +/*
>> > + * Return 0 if commit transaction success
>> > + */
>> > +static int x86_pmu_commit_txn(struct pmu *pmu)
>> > +{
>> > +       struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
>> > +       int assign[X86_PMC_IDX_MAX];
>> > +       int n, ret;
>> > +
>> > +       n = cpuc->n_events;
>> > +
>> > +       if (!x86_pmu_initialized())
>> > +               return -EAGAIN;
>> > +
>> > +       ret = x86_pmu.schedule_events(cpuc, n, assign);
>> > +       if (ret)
>> > +               return ret;
>> > +
>> > +       /*
>> > +        * copy new assignment, now we know it is possible
>> > +        * will be used by hw_perf_enable()
>> > +        */
>> > +       memcpy(cpuc->assign, assign, n*sizeof(int));
>> > +
>> > +       return 0;
>> > +}
>> > +
>> >  static const struct pmu pmu = {
>> >        .enable         = x86_pmu_enable,
>> >        .disable        = x86_pmu_disable,
>> > @@ -1461,6 +1393,9 @@ static const struct pmu pmu = {
>> >        .stop           = x86_pmu_stop,
>> >        .read           = x86_pmu_read,
>> >        .unthrottle     = x86_pmu_unthrottle,
>> > +       .start_txn      = x86_pmu_start_txn,
>> > +       .stop_txn       = x86_pmu_stop_txn,
>> > +       .commit_txn     = x86_pmu_commit_txn,
>> >  };
>> >
>> >  /*
>> > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> > index bf896d0..93aa8d8 100644
>> > --- a/include/linux/perf_event.h
>> > +++ b/include/linux/perf_event.h
>> > @@ -524,6 +524,8 @@ struct hw_perf_event {
>> >
>> >  struct perf_event;
>> >
>> > +#define PERF_EVENT_TRAN_STARTED 1
>> > +
>> >  /**
>> >  * struct pmu - generic performance monitoring unit
>> >  */
>> > @@ -534,6 +536,11 @@ struct pmu {
>> >        void (*stop)                    (struct perf_event *event);
>> >        void (*read)                    (struct perf_event *event);
>> >        void (*unthrottle)              (struct perf_event *event);
>> > +       void (*start_txn)               (struct pmu *pmu);
>> > +       void (*stop_txn)                (struct pmu *pmu);
>> > +       int (*commit_txn)               (struct pmu *pmu);
>> > +
>> > +       u8 flag;
>> >  };
>> >
>> >  /**
>> > @@ -799,9 +806,6 @@ extern void perf_disable(void);
>> >  extern void perf_enable(void);
>> >  extern int perf_event_task_disable(void);
>> >  extern int perf_event_task_enable(void);
>> > -extern int hw_perf_group_sched_in(struct perf_event *group_leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx);
>> >  extern void perf_event_update_userpage(struct perf_event *event);
>> >  extern int perf_event_release_kernel(struct perf_event *event);
>> >  extern struct perf_event *
>> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
>> > index 07b7a43..4537676 100644
>> > --- a/kernel/perf_event.c
>> > +++ b/kernel/perf_event.c
>> > @@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
>> >  void __weak hw_perf_disable(void)              { barrier(); }
>> >  void __weak hw_perf_enable(void)               { barrier(); }
>> >
>> > -int __weak
>> > -hw_perf_group_sched_in(struct perf_event *group_leader,
>> > -              struct perf_cpu_context *cpuctx,
>> > -              struct perf_event_context *ctx)
>> > -{
>> > -       return 0;
>> > -}
>> > -
>> >  void __weak perf_event_print_debug(void)       { }
>> >
>> >  static DEFINE_PER_CPU(int, perf_disable_count);
>> > @@ -642,14 +634,13 @@ group_sched_in(struct perf_event *group_event,
>> >               struct perf_event_context *ctx)
>> >  {
>> >        struct perf_event *event, *partial_group;
>> > +       struct pmu *pmu = (struct pmu *)group_event->pmu;
>> >        int ret;
>> >
>> >        if (group_event->state == PERF_EVENT_STATE_OFF)
>> >                return 0;
>> >
>> > -       ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
>> > -       if (ret)
>> > -               return ret < 0 ? ret : 0;
>> > +       pmu->start_txn(pmu);
>> >
>> >        if (event_sched_in(group_event, cpuctx, ctx))
>> >                return -EAGAIN;
>> > @@ -664,16 +655,21 @@ group_sched_in(struct perf_event *group_event,
>> >                }
>> >        }
>> >
>> > -       return 0;
>> > +       ret = pmu->commit_txn(pmu);
>> > +       if (!ret) {
>> > +               pmu->stop_txn(pmu);
>> > +               return 0;
>> > +       }
>> >
>> >  group_error:
>> > +       pmu->stop_txn(pmu);
>> > +
>> >        /*
>> > -        * Groups can be scheduled in as one unit only, so undo any
>> > -        * partial group before returning:
>> > +        * Commit transaction fails, rollback
>> > +        * Groups can be scheduled in as one unit only, so undo
>> > +        * whole group before returning:
>> >         */
>> >        list_for_each_entry(event, &group_event->sibling_list, group_entry) {
>> > -               if (event == partial_group)
>> > -                       break;
>> >                event_sched_out(event, cpuctx, ctx);
>> >        }
>> >        event_sched_out(group_event, cpuctx, ctx);
>> >
>> >
>> >
>
>

2010-04-21 09:42:37

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 16:44 +0800, stephane eranian wrote:
> On Wed, Apr 21, 2010 at 10:39 AM, Lin Ming <[email protected]> wrote:
> > On Wed, 2010-04-21 at 16:32 +0800, stephane eranian wrote:
> >> Seems to me that struct pmu is a shared resource across all CPUs.
> >> I don't understand why scheduling on one CPU would have to impact
> >> all the other CPUs, unless I am missing something here.
> >
> > Do you mean the pmu->flag?
>
> Yes.

Thanks for the review.

Changes log:

v2.
pmu->flag should be per cpu (Stephane Eranian)

change definition of "const struct pmu" to "struct pmu" since it's not
read only now.

v1.
remove hw_perf_group_sched_in() based on Peter's idea.

---
arch/x86/kernel/cpu/perf_event.c | 183 ++++++++++++++------------------------
include/linux/perf_event.h | 16 +++-
kernel/perf_event.c | 55 ++++++------
3 files changed, 106 insertions(+), 148 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..1113fd5 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -598,7 +598,7 @@ static void x86_pmu_enable_all(int added)
}
}

-static const struct pmu pmu;
+static struct pmu pmu;

static inline int is_x86_event(struct perf_event *event)
{
@@ -935,6 +935,7 @@ static int x86_pmu_enable(struct perf_event *event)
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct hw_perf_event *hwc;
int assign[X86_PMC_IDX_MAX];
+ u8 *flag;
int n, n0, ret;

hwc = &event->hw;
@@ -944,6 +945,10 @@ static int x86_pmu_enable(struct perf_event *event)
if (n < 0)
return n;

+ flag = perf_pmu_flag((struct pmu *)event->pmu);
+ if (!(*flag & PERF_EVENT_TRAN_STARTED))
+ goto out;
+
ret = x86_pmu.schedule_events(cpuc, n, assign);
if (ret)
return ret;
@@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event)
*/
memcpy(cpuc->assign, assign, n*sizeof(int));

+out:
cpuc->n_events = n;
cpuc->n_added += n - n0;

@@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
return &unconstrained;
}

-static int x86_event_sched_in(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- int ret = 0;
-
- event->state = PERF_EVENT_STATE_ACTIVE;
- event->oncpu = smp_processor_id();
- event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
- if (!is_x86_event(event))
- ret = event->pmu->enable(event);
-
- if (!ret && !is_software_event(event))
- cpuctx->active_oncpu++;
-
- if (!ret && event->attr.exclusive)
- cpuctx->exclusive = 1;
-
- return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- event->state = PERF_EVENT_STATE_INACTIVE;
- event->oncpu = -1;
-
- if (!is_x86_event(event))
- event->pmu->disable(event);
-
- event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
- if (!is_software_event(event))
- cpuctx->active_oncpu--;
-
- if (event->attr.exclusive || !cpuctx->active_oncpu)
- cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct perf_event *sub;
- int assign[X86_PMC_IDX_MAX];
- int n0, n1, ret;
-
- if (!x86_pmu_initialized())
- return 0;
-
- /* n0 = total number of events */
- n0 = collect_events(cpuc, leader, true);
- if (n0 < 0)
- return n0;
-
- ret = x86_pmu.schedule_events(cpuc, n0, assign);
- if (ret)
- return ret;
-
- ret = x86_event_sched_in(leader, cpuctx);
- if (ret)
- return ret;
-
- n1 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state > PERF_EVENT_STATE_OFF) {
- ret = x86_event_sched_in(sub, cpuctx);
- if (ret)
- goto undo;
- ++n1;
- }
- }
- /*
- * copy new assignment, now we know it is possible
- * will be used by hw_perf_enable()
- */
- memcpy(cpuc->assign, assign, n0*sizeof(int));
-
- cpuc->n_events = n0;
- cpuc->n_added += n1;
- ctx->nr_active += n1;
-
- /*
- * 1 means successful and events are active
- * This is not quite true because we defer
- * actual activation until hw_perf_enable() but
- * this way we* ensure caller won't try to enable
- * individual events
- */
- return 1;
-undo:
- x86_event_sched_out(leader, cpuctx);
- n0 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state == PERF_EVENT_STATE_ACTIVE) {
- x86_event_sched_out(sub, cpuctx);
- if (++n0 == n1)
- break;
- }
- }
- return ret;
-}
-
#include "perf_event_amd.c"
#include "perf_event_p6.c"
#include "perf_event_p4.c"
@@ -1397,6 +1290,14 @@ void __init init_hw_perf_events(void)
return;
}

+ /*
+ * TBD: will move this to pmu register function
+ * when multiple hw pmu support is added
+ */
+ pmu.flag = alloc_percpu(u8);
+ if (!pmu.flag)
+ return;
+
pmu_check_apic();

pr_cont("%s PMU driver.\n", x86_pmu.name);
@@ -1454,13 +1355,61 @@ static inline void x86_pmu_read(struct perf_event *event)
x86_perf_event_update(event);
}

-static const struct pmu pmu = {
+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(struct pmu *pmu)
+{
+ u8 *flag = perf_pmu_flag(pmu);
+
+ *flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(struct pmu *pmu)
+{
+ u8 *flag = perf_pmu_flag(pmu);
+
+ *flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(struct pmu *pmu)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ int assign[X86_PMC_IDX_MAX];
+ int n, ret;
+
+ n = cpuc->n_events;
+
+ if (!x86_pmu_initialized())
+ return -EAGAIN;
+
+ ret = x86_pmu.schedule_events(cpuc, n, assign);
+ if (ret)
+ return ret;
+
+ /*
+ * copy new assignment, now we know it is possible
+ * will be used by hw_perf_enable()
+ */
+ memcpy(cpuc->assign, assign, n*sizeof(int));
+
+ return 0;
+}
+
+static struct pmu pmu = {
.enable = x86_pmu_enable,
.disable = x86_pmu_disable,
.start = x86_pmu_start,
.stop = x86_pmu_stop,
.read = x86_pmu_read,
.unthrottle = x86_pmu_unthrottle,
+ .start_txn = x86_pmu_start_txn,
+ .stop_txn = x86_pmu_stop_txn,
+ .commit_txn = x86_pmu_commit_txn,
};

/*
@@ -1537,9 +1486,9 @@ out:
return ret;
}

-const struct pmu *hw_perf_event_init(struct perf_event *event)
+struct pmu *hw_perf_event_init(struct perf_event *event)
{
- const struct pmu *tmp;
+ struct pmu *tmp;
int err;

err = __hw_perf_event_init(event);
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..9fa3f46 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {

struct perf_event;

+#define PERF_EVENT_TRAN_STARTED 1
+
/**
* struct pmu - generic performance monitoring unit
*/
@@ -534,6 +536,12 @@ struct pmu {
void (*stop) (struct perf_event *event);
void (*read) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ void (*start_txn) (struct pmu *pmu);
+ void (*stop_txn) (struct pmu *pmu);
+ int (*commit_txn) (struct pmu *pmu);
+
+ /* percpu flag */
+ u8 *flag;
};

/**
@@ -610,7 +618,7 @@ struct perf_event {
int group_flags;
struct perf_event *group_leader;
struct perf_event *output;
- const struct pmu *pmu;
+ struct pmu *pmu;

enum perf_event_active_state state;
atomic64_t count;
@@ -782,7 +790,7 @@ struct perf_output_handle {
*/
extern int perf_max_events;

-extern const struct pmu *hw_perf_event_init(struct perf_event *event);
+extern struct pmu *hw_perf_event_init(struct perf_event *event);

extern void perf_event_task_sched_in(struct task_struct *task);
extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next);
@@ -799,9 +807,6 @@ extern void perf_disable(void);
extern void perf_enable(void);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx);
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
extern struct perf_event *
@@ -811,6 +816,7 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr,
perf_overflow_handler_t callback);
extern u64 perf_event_read_value(struct perf_event *event,
u64 *enabled, u64 *running);
+extern u8 *perf_pmu_flag(struct pmu *pmu);

struct perf_sample_data {
u64 type;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..4820090 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -75,7 +75,7 @@ static DEFINE_SPINLOCK(perf_resource_lock);
/*
* Architecture provided APIs - weak aliases:
*/
-extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
+extern __weak struct pmu *hw_perf_event_init(struct perf_event *event)
{
return NULL;
}
@@ -83,15 +83,14 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
void __weak hw_perf_disable(void) { barrier(); }
void __weak hw_perf_enable(void) { barrier(); }

-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
+void __weak perf_event_print_debug(void) { }
+
+u8 *perf_pmu_flag(struct pmu *pmu)
{
- return 0;
-}
+ int cpu = smp_processor_id();

-void __weak perf_event_print_debug(void) { }
+ return per_cpu_ptr(pmu->flag, cpu);
+}

static DEFINE_PER_CPU(int, perf_disable_count);

@@ -642,14 +641,13 @@ group_sched_in(struct perf_event *group_event,
struct perf_event_context *ctx)
{
struct perf_event *event, *partial_group;
+ struct pmu *pmu = group_event->pmu;
int ret;

if (group_event->state == PERF_EVENT_STATE_OFF)
return 0;

- ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
- if (ret)
- return ret < 0 ? ret : 0;
+ pmu->start_txn(pmu);

if (event_sched_in(group_event, cpuctx, ctx))
return -EAGAIN;
@@ -664,16 +662,21 @@ group_sched_in(struct perf_event *group_event,
}
}

- return 0;
+ ret = pmu->commit_txn(pmu);
+ if (!ret) {
+ pmu->stop_txn(pmu);
+ return 0;
+ }

group_error:
+ pmu->stop_txn(pmu);
+
/*
- * Groups can be scheduled in as one unit only, so undo any
- * partial group before returning:
+ * Commit transaction fails, rollback
+ * Groups can be scheduled in as one unit only, so undo
+ * whole group before returning:
*/
list_for_each_entry(event, &group_event->sibling_list, group_entry) {
- if (event == partial_group)
- break;
event_sched_out(event, cpuctx, ctx);
}
event_sched_out(group_event, cpuctx, ctx);
@@ -4139,7 +4142,7 @@ static void perf_swevent_disable(struct perf_event *event)
hlist_del_rcu(&event->hlist_entry);
}

-static const struct pmu perf_ops_generic = {
+static struct pmu perf_ops_generic = {
.enable = perf_swevent_enable,
.disable = perf_swevent_disable,
.read = perf_swevent_read,
@@ -4250,7 +4253,7 @@ static void cpu_clock_perf_event_read(struct perf_event *event)
cpu_clock_perf_event_update(event);
}

-static const struct pmu perf_ops_cpu_clock = {
+static struct pmu perf_ops_cpu_clock = {
.enable = cpu_clock_perf_event_enable,
.disable = cpu_clock_perf_event_disable,
.read = cpu_clock_perf_event_read,
@@ -4307,7 +4310,7 @@ static void task_clock_perf_event_read(struct perf_event *event)
task_clock_perf_event_update(event, time);
}

-static const struct pmu perf_ops_task_clock = {
+static struct pmu perf_ops_task_clock = {
.enable = task_clock_perf_event_enable,
.disable = task_clock_perf_event_disable,
.read = task_clock_perf_event_read,
@@ -4448,7 +4451,7 @@ static void tp_perf_event_destroy(struct perf_event *event)
swevent_hlist_put(event);
}

-static const struct pmu *tp_perf_event_init(struct perf_event *event)
+static struct pmu *tp_perf_event_init(struct perf_event *event)
{
int err;

@@ -4505,7 +4508,7 @@ static int perf_tp_event_match(struct perf_event *event,
return 1;
}

-static const struct pmu *tp_perf_event_init(struct perf_event *event)
+static struct pmu *tp_perf_event_init(struct perf_event *event)
{
return NULL;
}
@@ -4527,7 +4530,7 @@ static void bp_perf_event_destroy(struct perf_event *event)
release_bp_slot(event);
}

-static const struct pmu *bp_perf_event_init(struct perf_event *bp)
+static struct pmu *bp_perf_event_init(struct perf_event *bp)
{
int err;

@@ -4551,7 +4554,7 @@ void perf_bp_event(struct perf_event *bp, void *data)
perf_swevent_add(bp, 1, 1, &sample, regs);
}
#else
-static const struct pmu *bp_perf_event_init(struct perf_event *bp)
+static struct pmu *bp_perf_event_init(struct perf_event *bp)
{
return NULL;
}
@@ -4573,9 +4576,9 @@ static void sw_perf_event_destroy(struct perf_event *event)
swevent_hlist_put(event);
}

-static const struct pmu *sw_perf_event_init(struct perf_event *event)
+static struct pmu *sw_perf_event_init(struct perf_event *event)
{
- const struct pmu *pmu = NULL;
+ struct pmu *pmu = NULL;
u64 event_id = event->attr.config;

/*
@@ -4637,7 +4640,7 @@ perf_event_alloc(struct perf_event_attr *attr,
perf_overflow_handler_t overflow_handler,
gfp_t gfpflags)
{
- const struct pmu *pmu;
+ struct pmu *pmu;
struct perf_event *event;
struct hw_perf_event *hwc;
long err;


>
> > You are right, pmu->flag should be per cpu data.
> >
> > Will update the patch.
> >
> > Thanks,
> > Lin Ming
> >
> >>

2010-04-21 09:57:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote:
> + /*
> + * TBD: will move this to pmu register function
> + * when multiple hw pmu support is added
> + */
> + pmu.flag = alloc_percpu(u8);
> + if (!pmu.flag)
> + return;

Why not simply use a field in struct cpu_hw_events?

That's where we track all per-cpu pmu state.

2010-04-21 14:13:12

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 17:57 +0800, Peter Zijlstra wrote:
> On Wed, 2010-04-21 at 17:42 +0800, Lin Ming wrote:
> > + /*
> > + * TBD: will move this to pmu register function
> > + * when multiple hw pmu support is added
> > + */
> > + pmu.flag = alloc_percpu(u8);
> > + if (!pmu.flag)
> > + return;
>
> Why not simply use a field in struct cpu_hw_events?
>
> That's where we track all per-cpu pmu state.

That's good idea!

Thanks for the review.

Changes log:

v3.
track per-cpu pmu transaction state in cpu_hw_events (Peter Zijlstra)
move pmu definition back to const ("struct pmu" -> "const struct pmu")

v2.
pmu->flag should be per cpu (Stephane Eranian)

change definition of "const struct pmu" to "struct pmu" since it's not
read only now.

v1.
remove hw_perf_group_sched_in() based on Peter's idea.

---
arch/x86/kernel/cpu/perf_event.c | 167 ++++++++++++-------------------------
include/linux/perf_event.h | 8 +-
kernel/perf_event.c | 23 +++---
3 files changed, 69 insertions(+), 129 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 626154a..4a1dc62 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -110,6 +110,8 @@ struct cpu_hw_events {
u64 tags[X86_PMC_IDX_MAX];
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */

+ u8 flag;
+
/*
* Intel DebugStore bits
*/
@@ -944,6 +946,9 @@ static int x86_pmu_enable(struct perf_event *event)
if (n < 0)
return n;

+ if (cpuc->flag & PERF_EVENT_TRAN_STARTED)
+ goto out;
+
ret = x86_pmu.schedule_events(cpuc, n, assign);
if (ret)
return ret;
@@ -953,6 +958,7 @@ static int x86_pmu_enable(struct perf_event *event)
*/
memcpy(cpuc->assign, assign, n*sizeof(int));

+out:
cpuc->n_events = n;
cpuc->n_added += n - n0;

@@ -1210,119 +1216,6 @@ x86_get_event_constraints(struct cpu_hw_events *cpuc, struct perf_event *event)
return &unconstrained;
}

-static int x86_event_sched_in(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- int ret = 0;
-
- event->state = PERF_EVENT_STATE_ACTIVE;
- event->oncpu = smp_processor_id();
- event->tstamp_running += event->ctx->time - event->tstamp_stopped;
-
- if (!is_x86_event(event))
- ret = event->pmu->enable(event);
-
- if (!ret && !is_software_event(event))
- cpuctx->active_oncpu++;
-
- if (!ret && event->attr.exclusive)
- cpuctx->exclusive = 1;
-
- return ret;
-}
-
-static void x86_event_sched_out(struct perf_event *event,
- struct perf_cpu_context *cpuctx)
-{
- event->state = PERF_EVENT_STATE_INACTIVE;
- event->oncpu = -1;
-
- if (!is_x86_event(event))
- event->pmu->disable(event);
-
- event->tstamp_running -= event->ctx->time - event->tstamp_stopped;
-
- if (!is_software_event(event))
- cpuctx->active_oncpu--;
-
- if (event->attr.exclusive || !cpuctx->active_oncpu)
- cpuctx->exclusive = 0;
-}
-
-/*
- * Called to enable a whole group of events.
- * Returns 1 if the group was enabled, or -EAGAIN if it could not be.
- * Assumes the caller has disabled interrupts and has
- * frozen the PMU with hw_perf_save_disable.
- *
- * called with PMU disabled. If successful and return value 1,
- * then guaranteed to call perf_enable() and hw_perf_enable()
- */
-int hw_perf_group_sched_in(struct perf_event *leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
- struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
- struct perf_event *sub;
- int assign[X86_PMC_IDX_MAX];
- int n0, n1, ret;
-
- if (!x86_pmu_initialized())
- return 0;
-
- /* n0 = total number of events */
- n0 = collect_events(cpuc, leader, true);
- if (n0 < 0)
- return n0;
-
- ret = x86_pmu.schedule_events(cpuc, n0, assign);
- if (ret)
- return ret;
-
- ret = x86_event_sched_in(leader, cpuctx);
- if (ret)
- return ret;
-
- n1 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state > PERF_EVENT_STATE_OFF) {
- ret = x86_event_sched_in(sub, cpuctx);
- if (ret)
- goto undo;
- ++n1;
- }
- }
- /*
- * copy new assignment, now we know it is possible
- * will be used by hw_perf_enable()
- */
- memcpy(cpuc->assign, assign, n0*sizeof(int));
-
- cpuc->n_events = n0;
- cpuc->n_added += n1;
- ctx->nr_active += n1;
-
- /*
- * 1 means successful and events are active
- * This is not quite true because we defer
- * actual activation until hw_perf_enable() but
- * this way we* ensure caller won't try to enable
- * individual events
- */
- return 1;
-undo:
- x86_event_sched_out(leader, cpuctx);
- n0 = 1;
- list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state == PERF_EVENT_STATE_ACTIVE) {
- x86_event_sched_out(sub, cpuctx);
- if (++n0 == n1)
- break;
- }
- }
- return ret;
-}
-
#include "perf_event_amd.c"
#include "perf_event_p6.c"
#include "perf_event_p4.c"
@@ -1454,6 +1347,51 @@ static inline void x86_pmu_read(struct perf_event *event)
x86_perf_event_update(event);
}

+/*
+ * Set the flag to make pmu::enable() not perform the
+ * schedulablilty test.
+ */
+static void x86_pmu_start_txn(const struct pmu *pmu)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ cpuc->flag |= PERF_EVENT_TRAN_STARTED;
+}
+
+static void x86_pmu_stop_txn(const struct pmu *pmu)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ cpuc->flag &= ~PERF_EVENT_TRAN_STARTED;
+}
+
+/*
+ * Return 0 if commit transaction success
+ */
+static int x86_pmu_commit_txn(const struct pmu *pmu)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+ int assign[X86_PMC_IDX_MAX];
+ int n, ret;
+
+ n = cpuc->n_events;
+
+ if (!x86_pmu_initialized())
+ return -EAGAIN;
+
+ ret = x86_pmu.schedule_events(cpuc, n, assign);
+ if (ret)
+ return ret;
+
+ /*
+ * copy new assignment, now we know it is possible
+ * will be used by hw_perf_enable()
+ */
+ memcpy(cpuc->assign, assign, n*sizeof(int));
+
+ return 0;
+}
+
static const struct pmu pmu = {
.enable = x86_pmu_enable,
.disable = x86_pmu_disable,
@@ -1461,6 +1399,9 @@ static const struct pmu pmu = {
.stop = x86_pmu_stop,
.read = x86_pmu_read,
.unthrottle = x86_pmu_unthrottle,
+ .start_txn = x86_pmu_start_txn,
+ .stop_txn = x86_pmu_stop_txn,
+ .commit_txn = x86_pmu_commit_txn,
};

/*
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index bf896d0..862b965 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -524,6 +524,8 @@ struct hw_perf_event {

struct perf_event;

+#define PERF_EVENT_TRAN_STARTED 1
+
/**
* struct pmu - generic performance monitoring unit
*/
@@ -534,6 +536,9 @@ struct pmu {
void (*stop) (struct perf_event *event);
void (*read) (struct perf_event *event);
void (*unthrottle) (struct perf_event *event);
+ void (*start_txn) (const struct pmu *pmu);
+ void (*stop_txn) (const struct pmu *pmu);
+ int (*commit_txn) (const struct pmu *pmu);
};

/**
@@ -799,9 +804,6 @@ extern void perf_disable(void);
extern void perf_enable(void);
extern int perf_event_task_disable(void);
extern int perf_event_task_enable(void);
-extern int hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx);
extern void perf_event_update_userpage(struct perf_event *event);
extern int perf_event_release_kernel(struct perf_event *event);
extern struct perf_event *
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 07b7a43..1503174 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -83,14 +83,6 @@ extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
void __weak hw_perf_disable(void) { barrier(); }
void __weak hw_perf_enable(void) { barrier(); }

-int __weak
-hw_perf_group_sched_in(struct perf_event *group_leader,
- struct perf_cpu_context *cpuctx,
- struct perf_event_context *ctx)
-{
- return 0;
-}
-
void __weak perf_event_print_debug(void) { }

static DEFINE_PER_CPU(int, perf_disable_count);
@@ -641,15 +633,14 @@ group_sched_in(struct perf_event *group_event,
struct perf_cpu_context *cpuctx,
struct perf_event_context *ctx)
{
- struct perf_event *event, *partial_group;
+ struct perf_event *event, *partial_group = NULL;
+ const struct pmu *pmu = group_event->pmu;
int ret;

if (group_event->state == PERF_EVENT_STATE_OFF)
return 0;

- ret = hw_perf_group_sched_in(group_event, cpuctx, ctx);
- if (ret)
- return ret < 0 ? ret : 0;
+ pmu->start_txn(pmu);

if (event_sched_in(group_event, cpuctx, ctx))
return -EAGAIN;
@@ -664,9 +655,15 @@ group_sched_in(struct perf_event *group_event,
}
}

- return 0;
+ ret = pmu->commit_txn(pmu);
+ if (!ret) {
+ pmu->stop_txn(pmu);
+ return 0;
+ }

group_error:
+ pmu->stop_txn(pmu);
+
/*
* Groups can be scheduled in as one unit only, so undo any
* partial group before returning:

2010-04-21 14:22:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote:
> + ret = pmu->commit_txn(pmu);
> + if (!ret) {
> + pmu->stop_txn(pmu);
> + return 0;
> + }
>
> group_error:
> + pmu->stop_txn(pmu);

If you let commit_txn() also clear the state you can save some logic and
a method.

But yes, this looks good. If you don't remove the weak interface just
yet, you can do a patch per architecture that uses this (at least
powerpc and sparc do), and remove the weak thing at the end once all
users are gone.


2010-04-21 14:38:36

by Lin Ming

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 22:22 +0800, Peter Zijlstra wrote:
> On Wed, 2010-04-21 at 22:12 +0000, Lin Ming wrote:
> > + ret = pmu->commit_txn(pmu);
> > + if (!ret) {
> > + pmu->stop_txn(pmu);
> > + return 0;
> > + }
> >
> > group_error:
> > + pmu->stop_txn(pmu);
>
> If you let commit_txn() also clear the state you can save some logic and
> a method.

I add this ->stop_txn(pmu) because the rollback code also need to clear
the state.

>
> But yes, this looks good. If you don't remove the weak interface just
> yet, you can do a patch per architecture that uses this (at least
> powerpc and sparc do), and remove the weak thing at the end once all
> users are gone.
>

OK, I'll do that for powerpc and sparc.

And need to check if there are transaction methods in group_sched_in for
other arch, as below.

group_sched_in(...)
{
if (pmu->start_txn)
pmu->start_txn(pmu);

if (pmu->commit_txn)
pmu->commit_txn(pmu)
}

Thanks,
Lin Ming

2010-04-21 14:54:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] perf_events: support for uncore a.k.a. nest units

On Wed, 2010-04-21 at 22:38 +0000, Lin Ming wrote:
>
> I add this ->stop_txn(pmu) because the rollback code also need to clear
> the state.

Ah, yes, because strictly speaking ->enable() could still fail for
another reason. OK.