LinuxLists.cc - v2 of comments on Performance Counters for Linux (PCL)

[permalink] [raw]

Subject: Re: v2 of comments on Performance Counters for Linux (PCL)

hi Stephane,

Thanks for the extensive feedback! Your numerous comments cover
twenty sub-topics, so we've tabulated the summary of Peter's and my
replies, to give a quick overview:

-------------------------------------------------------------------------
Topic/question you raised # Our (current) reply
-------------------------------------------------------------------------
I/ General API comments
......
1/ System calls - ioctl # agree, willfix
2/ Grouping # agree, questions asked
3/ Multiplexing and system-wide # agree, disagree on severity
4/ Controlling group multiplexing # agree, disagree on relevance
5/ Mmaped count # disagree
6/ Group scheduling # agree, disagree on severity
7/ Group validity checking # questions answered
8/ Generalized cache events # disagree
9/ Group reading # already possible - extend on need
10/ Event buffer minimal useful size # questions answered
11/ Missing definitions for generic events # reply question asked

II/ X86 comments
......
1/ Fixed counters on Intel # agree, willfix
2/ Event knowledge missing # disagree

III/ Requests
......
1/ Sampling period randomization # support possible, outlined

IV/ Open questions
......
1/ Support for model-specific uncore PMU # support possible, outlined
2/ Features impacting all counters # support possible, outlined
3/ AMD IBS # support possible, outlined
4/ Intel PEBS # support possible, outlined
5/ Intel Last Branch Record (LBR) # support possible, outlined
-------------------------------------------------------------------------

( We'll send the individual replies in separate mails to keep mail
size manageable. )

At the outset, it appears that there's two major themes to your
questions (with a few other topics as well which i dont mention
here):

- Expressing/handling constraints with certain PMU features
- Supporting various non-counter PMU features

In general we'd like to observe that our primary design goal is to
offer a coherent, comprehensive and still simple to use performance
measurement and profiling framework for Linux. We also provide an
integrated tool in tools/perf/ to give a complete solution.

Many of the perfcounters features use no PMU at all, and in fact
it's entirely possible to use most 'perf' features and get a
meaningful analysis/profile of the system without any PMU.

In other words, we consider the PMU to be the means as to enhance
perfcounters, not the end goal. A PMU will make a good difference to
quality and precision, but we do not let the sanity of our
interfaces be dictated by PMU feature quirkiness.

We cherry-pick the PMU features that look most useful, and expose
them via rich performance measurement abstractions. It does not mean
that we strive for immediately exposing _all_ PMU features and drown
users in lots of conflicting hw facilities (different on each
platform) as quickly as possible.

Having said that, we do think that pretty much any sane PMU feature
_can_ be supported via perfcounters (and most insane ones as well)
and that the limit is 'do we want to', not 'can we'.

Even heavily constrained PMUs can be supported: PowerPC, which is a
CPU family with heavily constrained PMU variants is widely supported
by perfcounters here and today.

Ingo, Peter

2009-06-22 11:49:54

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

> I/ General API comments
>
> 1/ System calls
>
> * ioctl()
>
> You have defined 5 ioctls() so far to operate on an existing
> event. I was under the impression that ioctl() should not be
> used except for drivers.
>
> How do you justify your usage of ioctl() in this context?

We can certainly do a separate sys_perf_counter_ctrl() syscall - and
we will do that if people think the extra syscall slot is worth it
in this case.

The (mild) counter-argument so far was that the current ioctls are
very simple over "IO" attributes of counters:

- enable
- disable
- reset
- refresh
- set-period

So they could be considered 'IO controls' in the classic sense and
act as a (mild) exception to the 'dont use ioctls' rule.

They are not some weird tacked-on syscall functionality - they
modify the IO properties of counters: on/off, value and rate. If
they go beyond that we'll put it all into a separate syscall and
deprecate the ioctl (which will have a relatively short half-time
due to the tools being hosted in the kernel repo).

This could happen right now in fact, if people think it's worth it.

2009-06-22 11:50:42

[permalink] [raw]

Subject: Re: I.2 - Grouping

> 2/ Grouping
>
> By design, an event can only be part of one group at a time.
> Events in a group are guaranteed to be active on the PMU at the
> same time. That means a group cannot have more events than there
> are available counters on the PMU. Tools may want to know the
> number of counters available in order to group their events
> accordingly, such that reliable ratios could be computed. It seems
> the only way to know this is by trial and error. This is not
> practical.

Groups are there to support heavily constrained PMUs, and for them
this is the only way, as there is no simple linear expression for
how many counters one can load on the PMU.

The ideal model to tooling is relatively independent PMU registers
(counters) with little constraints - most modern CPUs meet that
model.

All the existing tooling (tools/perf/) operates on that model and
this leads to easy programmability and flexible results. This model
needs no grouping of counters.

Could you please cite specific examples in terms of tools/perf/?
What feature do you think needs to know more about constraints? What
is the specific win in precision we could achieve via that?

2009-06-22 11:51:26

[permalink] [raw]

Subject: Re: I.3 - Multiplexing and system-wide

> 3/ Multiplexing and system-wide
>
> Multiplexing is time-based and it is hooked into the timer tick.
> At every tick, the kernel tries to schedule another set of event
> groups.
>
> In tickless kernels if a CPU is idle, no timer tick is generated,
> therefore no multiplexing occurs. This is incorrect. It's not
> because the CPU is idle, that there aren't any interesting PMU
> events to measure. Parts of the CPU may still be active, e.g.,
> caches and buses. And thus, it is expected that multiplexing still
> happens.
>
> You need to hook up the timer source for multiplexing to something
> else which is not affected by tickless. You cannot simply disable
> tickless during a measurement because you would not be measuring
> the system as it actually behaves.

There is only a very small difference between inhibiting nohz and
waking up the machine using timers on the same interval.

Using the scheduler tick to round-robin counters is a first-level
approximation we use: it in essence corresponds to multiplexing
counters along a cycle cost metric. If there's sufficient interest
we can drive the round-robining by the PMU itself: this makes it HZ
and dynticks invariant.

The simplified multiplexing we offer now certainly works well enough
in practice - if there's interest we can expand it.

2009-06-22 11:51:53

[permalink] [raw]

Subject: Re: I.4 - Controlling group multiplexing

> 4/ Controlling group multiplexing
>
> Although multiplexing is exposed to users via the timing
> information, events may not necessarily be grouped at random by
> tools. Groups may not be ordered at random either.
>
> I know of tools which craft the sequence of groups carefully such
> that related events are in neighboring groups such that they
> measure similar parts of the execution. This way, you can mitigate
> the fluctuations introduced by multiplexing. In other words, some
> tools may want to control the order in which groups are scheduled
> on the PMU.
>
> You mentioned that groups are multiplexed in creation order. But
> which creation order? As far as I know, multiple distinct tools
> may be attaching to the same thread at the same time and their
> groups may be interleaved in the list. Therefore, I believe
> 'creation order' refers to the global group creation order which
> is only visible to the kernel. Each tool may see a different
> order. Let's take an example.
>
> Tool A creates group G1, G2, G3 and attaches them to thread T0. At
> the same time tool B creates group G4, G5. The actual global order
> may be: G1, G4, G2, G5, G3. This is what the kernel is going to
> multiplex. Each group will be multiplexed in the right order from
> the point of view of each tool. But there will be gaps. It would
> be nice to have a way to ensure that the sequence is either: G1,
> G2, G3, G4, G5 or G4, G5, G1, G2, G3. In other words, avoid the
> interleaving.

Since its all sampling, what is the statistical relevance of
scheduling groups back to back when interleaved with other groups?

I appreciate people might be wanting this, but is that wanting
justified?

That is, A1 A2 B1 B2 vs A1 B1 A2 B2, what statistically relevant
feature does one have over the other, esp. since the RR interval is
outside of programm control.

2009-06-22 11:53:11

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

> 5/ Mmaped count
>
> It is possible to read counts directly from user space for
> self-monitoring threads. This leverages a HW capability present on
> some processors. On X86, this is possible via RDPMC.
>
> The full 64-bit count is constructed by combining the hardware
> value extracted with an assembly instruction and a base value made
> available thru the mmap. There is an atomic generation count
> available to deal with the race condition.
>
> I believe there is a problem with this approach given that the PMU
> is shared and that events can be multiplexed. That means that even
> though you are self-monitoring, events get replaced on the PMU.
> The assembly instruction is unaware of that, it reads a register
> not an event.
>
> On x86, assume event A is hosted in counter 0, thus you need
> RDPMC(0) to extract the count. But then, the event is replaced by
> another one which reuses counter 0. At the user level, you will
> still use RDPMC(0) but it will read the HW value from a different
> event and combine it with a base count from another one.
>
> To avoid this, you need to pin the event so it stays in the PMU at
> all times. Now, here is something unclear to me. Pinning does not
> mean stay in the SAME register, it means the event stays on the
> PMU but it can possibly change register. To prevent that, I
> believe you need to also set exclusive so that no other group can
> be scheduled, and thus possibly use the same counter.
>
> Looks like this is the only way you can make this actually work.
> Not setting pinned+exclusive, is another pitfall in which many
> people will fall into.

do {
seq = pc->lock;

barrier()
if (pc->index) {
count = pmc_read(pc->index - 1);
count += pc->offset;
} else
goto regular_read;

barrier();
} while (pc->lock != seq);

We don't see the hole you are referring to. The sequence lock
ensures you get a consistent view.

2009-06-22 11:53:36

[permalink] [raw]

Subject: Re: I.6 - Group scheduling

> 6/ Group scheduling
>
> Looking at the existing code, it seems to me there is a risk of
> starvation for groups, i.e., groups never scheduled on the PMU.
>
> My understanding of the scheduling algorithm is:
>
> - first try to �schedule pinned groups. If a pinned group
> fails, put it in error mode. read() will fail until the
> group gets another chance at being scheduled.
>
> - then try to schedule the remaining groups. If a group fails
> just skip it.
>
> If the group list does not change, then certain groups may always
> fail. However, the ordering of the list changes because at every
> tick, it is rotated. The head becomes the tail. Therefore, each
> group eventually gets the first position and therefore gets the
> full PMU to assign its events.
>
> This works as long as there is a guarantee the list will ALWAYS
> rotate. If a thread does not run long enough for a tick, it may
> never rotate.

You need to ensure the task never runs during the tick, this is a
statistical propery that is bound to untrue for any 'normal'
application.

This is similar to the old cputime hiding tricks. We don't think
this will be a problem, you really need to actively avoid the tick
to aggregate a significantly lower tick rate.

In any case this can also be excluded via a natural extension
mentioned above: scheduling based not on the scheduler tick but
based on one of the counters.

2009-06-22 11:54:28

[permalink] [raw]

Subject: Re: I.7 - Group validity checking

> 7/ Group validity checking
>
> At the user level, an application is only concerned with events
> and grouping of those events. The assignment logic is performed by
> the kernel.
>
> For a group to be scheduled, all its events must be compatible
> with each other, otherwise the group will never be scheduled. It
> is not clear to me when that sanity check will be performed if I
> create the group such that it is stopped.

The hardware implementation should verify that the group fits on the
PMU for every new group member.

Like said before, we think this and I.2/ are in conflict.

> If the group goes all the way to scheduling, it will never be
> scheduled. Counts will be zero and the users will have no idea
> why. If the group is put in error state, read will not be
> possible. But again, how will the user know why?

It should not be possible to create such a situation. Can you see it
being possible?

2009-06-22 11:55:00

[permalink] [raw]

Subject: Re: I.8 - Generalized cache events

> 8/ Generalized cache events
>
> In recent days, you have added support for what you call
> 'generalized cache events'.
>
> The log defines:
> new event type: PERF_TYPE_HW_CACHE
>
> This is a 3-dimensional space:
> { L1-D, L1-I, L2, ITLB, DTLB, BPU } x
> { load, store, prefetch } x
> { accesses, misses }
>
> Those generic events are then mapped by the kernel onto actual PMU
> events if possible.
>
> I don't see any justification for adding this and especially in
> the kernel.
>
> What's the motivation and goal of this?
>
> If you define generic events, you need to provide a clear
> definition of what they are actually measuring. This is especially
> true for caches because there are many cache events and many
> different behaviors.
>
> If the goal is to make comparisons easier. I believe this is
> doomed to fail. Because different caches behave differently,
> events capture different subtle things, e.g, HW prefetch vs. sw
> prefetch. If to actually understand what the generic event is
> counting I need to know the mapping, then this whole feature is
> useless.

[ we since renamed L2 to LL ]

I beg to differ, subtle differences don't matter for big picture
performance indications.

If you measure significant last level cache misses by any metric,
you know you're hosed. Any work done to reduce this metric will
improve your workload.

Why do I need to care for the exact details of the event to make
valid use of this?

Sure, some people might be interested, and yes there are certainly
cases where you do want all detail available, but this interface
does not prohibit that usage - you can still use the full raw
spectrum of events, thousands of them if you so wish, easily
accessed both via the syscall and via the tools.

This categorization we added merely enables a somewhat simpler high
level view (that tries to be CPU model invariant) that suffices for
a lot of things.

2009-06-22 11:55:31

[permalink] [raw]

Subject: Re: I.9 - Group reading

> 9/ Group reading
>
> It is possible to start/stop an event group simply via ioctl() on
> the group leader. However, it is not possible to read all the
> counts with a single with a single read() system call. That seems
> odd. Furhermore, I believe you want reads to be as atomic as
> possible.

If you want an atomic snapshot you can do it: disable the group,
read out the counts, enable the group.

But, as your other comment under I/5 indicates, there are ways to
read out the PMU directly, via RDPMC instructions. Those are not
atomic either if used for multiple counters. Is your argument that
they are thus useless?

But if there is a strong use-case we can add PERF_FORMAT_GROUP.

2009-06-22 11:56:20

[permalink] [raw]

Subject: Re: I.10 - Event buffer minimal useful size

> 10/ Event buffer minimal useful size
>
> As it stands, the buffer header occupies the first page, even
> though the buffer header struct is 32-byte long. That's a lot of
> precious RLIMIT_MEMLOCK memory wasted.
>
> The actual buffer (data) starts at the next page (from builtin-top.c):
>
> static void mmap_read_counter(struct mmap_data *md)
> {
> unsigned int head = mmap_read_head(md);
> unsigned int old = md->prev;
> unsigned char *data = md->base + page_size;
>
> Given that the buffer "full" notification are sent on page
> crossing boundaries, if the actual buffer payload size is 1 page,
> you are guaranteed to have your samples overwritten.
>
> This leads me to believe that the minimal buffer size to get
> useful data is 3 pages. This is per event group per thread. That
> puts a lot of pressure on RLIMIT_MEMLOCK which is ususally set
> fairly low by distros.

Regarding notifications: you can actually tell it to generate
wakeups every 'n' events instead of at page boundaries. (See:
attr.wakeup_events).

Regarding locked pags: there is an extra per user mlock limit for
perf counters found in sysctl_perf_counter_mlock so it can be
configured independently. Furthermore, the accounting is done on a
per struct user basis, not on a per task basis - which makes the use
of locked pages far less of an issue to distros.

Regarding using 4K for the sampling data header: we are making use
of that property. We recently extended the header to include more
fields and having it writable - so the 4K MMU separation between
data header and data pages very much paid off already - and it could
provide more advantages in the future.

2009-06-22 11:56:52

[permalink] [raw]

Subject: Re: I.11 - Missing definitions for generic events

> 11/ Missing definitions for generic hardware events
>
> As soon as you define generic events, you need to provide a clear
> and precise definition at to what they measure. This is crucial to
> make them useful. I have not seen such a definition yet.

Do you mean their names aren't clear enough? :-)

Please suggest alternatives and/or clearer definitions.

2009-06-22 11:57:31

[permalink] [raw]

Subject: Re: II.1 - Fixed counters on Intel

> II/ X86 comments
>
> 1/ Fixed counters on Intel
>
> You cannot simply fall back to generic counters if you cannot find
> a fixed counter. There are model-specific bugs, for instance
> UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same
> thing on Nehalem when it is used in fixed counter 2 or a generic
> counter. The same is true on Core.

This could be handled via a model specific quirk, if the erratum is
serious enough.

> You cannot simply look at the event field code to determine
> whether this is an event supported by a fixed counters. You must
> look at the other fields such as edge, invert, cnt-mask. If those
> are present then you have to fall back to using a generic counter
> as fixed counters only support priv level filtering. As indicated
> above, though, programming UNHALTED_REFERENCE_CYCLES on a generic
> counter does not count the same thing, therefore you need to fail
> if filters other than priv levels are present on this event.

Agreed, we'll fix this.

2009-06-22 11:58:00

[permalink] [raw]

Subject: Re: II.2 - Event knowledge missing

> 2/ Event knowledge missing
>
> There are constraints on events in Intel processors. Different
> constraints do exist on AMD64 processors, especially with
> uncore-releated events.

You raise the issue of uncore events in IV.1, but let us reply here
primarily.

Un-core counters and events seem to be somewhat un-interesting to
us. (Patches from those who find them interesting are welcome of
course!)

So they werent in the first line of PMU features to cherry-pick
from.

The main problem with uncore events is that they are per physical
package, and hence tying a piece of physical metric exposed via them
to a particular workload is hard - unless full-system analysis is
performed. 'Task driven' metrics seem far more useful to performance
analysis (and those are the preferred analysis method of most
user-space developers), as they allow particularized sampling and
allow the tight connection between workload and metric.

If, despite our expecations, uncore events prove to be useful,
popular and required elements of performance analysis, they can be
supported in perfcounters via various levels:

- a special raw ID range on x86, only to per CPU counters. The
low-level implementation reserves the uncore PMCs, so overlapping
allocation (and interaction between the cores via the MSRs) is
not possible.

- generic enumeration with full tooling support, time-sharing and
the whole set of features. The low-level backend would time-share
the resource between interested CPUs.

There is no limitation in the perfcounters design that somehow makes
uncore events harder to support. The uncore counters _themselves_
are limited to begin with - so rich features cannot be offered on
top of them.

> In your model, those need to be taken care of by the kernel.
> Should the kernel make the wrong decision, there would be no
> work-around for user tools. Take the example I outlined just above
> with Intel fixed counters.

Yes. Why should there be a work-around for user tools if the fix is
for the kernel? We don't try to work around random driver bugs in userspace
either, we simply fix the kernel.

> The current code-base does not have any constrained event support,
> therefore bogus counts may be returned depending on the event
> measured.

Then we'll need to grow some when we run into them :-)

2009-06-22 11:59:02

[permalink] [raw]

Subject: Re: III.1 - Sampling period randomization

> III/ Requests
>
> 1/ Sampling period randomization
>
> It is our experience (on Itanium, for instance), that for certain
> sampling measurements, it is beneficial to randomize the sampling
> period a bit. This is in particular the case when sampling on an
> event that happens very frequently and which is not related to
> timing, e.g., branch_instructions_retired. Randomization helps
> mitigate the bias. You do not need something sophisticated. But
> when you are using a kernel-level sampling buffer, you need to
> have the kernel randomize. Randomization needs to be supported per
> event.

This is on the todo list, it should be pretty straight forward to
implement. Basically add a u64 randomize:1 flag and a u64 rand_size,
and add some noise to hwc->sample_period in perf_counter_overflow(),
just like the .freq path does.

In fact the auto-freq sampling counters effectively randomize
already, as the path of stabilization/convergence is never exactly
the same. We could add a little bit of constant noise to the freq
counter and it would auto-correct automatically.

Since you seem to care abut this sub-feature, would you be
interested in sending a patch for that? The essential steps are:

- Take a new bit off perf_counter_attr::__reserved_1 and name
it attr.randomize.

- Inject a few bits of trivial noise into the period calculated
in kernel/perf_counter.c:perf_adjust_period(). The best place
would be to add it right before this line:

hwc->sample_period = sample_period;

If you add a small relative fuzz of below 1% to sample_period
then the code will auto-correct.

- All the tools deal just fine with variable periods already, so
there's no tooling updates needed, beyond adding a --randomize
flag to tools/perf/builtin-record.c.

Let us know if you need any help with any of this!

2009-06-22 11:59:21

[permalink] [raw]

Subject: Re: IV.1 - Support for model-specific uncore PMU

> IV/ Open questions
>
> 1/ Support for model-specific uncore PMU monitoring capabilities
>
> Recent processors have multiple PMUs. Typically one per core and
> but also one at the socket level, e.g., Intel Nehalem. It is
> expected that this API will provide access to these PMU as well.
>
> It seems like with the current API, raw events for those PMU would
> need a new architecture-specific type as the event encoding by
> itself may not be enough to disambiguate between a core and uncore
> PMU event.
>
> How are those events going to be supported?

(Please see our reply to II.2.)

2009-06-22 12:00:20

[permalink] [raw]

Subject: Re: IV.2 - Features impacting all counters

> 2/ Features impacting all counters
>
> On some PMU models, e.g., Itanium, they are certain features which
> have an influence on all counters that are active. For instance,
> there is a way to restrict monitoring to a range of continuous
> code or data addresses using both some PMU registers and the debug
> registers.
>
> Given that the API exposes events (counters) as independent of
> each other, I wonder how range restriction could be implemented.

A solution is to make it a per-counter attribute and fail to
schedule multiple counters at the same time when these constraints
differ.

> Similarly, on Itanium, there are global behaviors. For instance,
> on counter overflow the entire PMU freezes all at once. That seems
> to be contradictory with the design of the API which creates the
> illusion of independence.
>
> What solutions do you propose?

We propose the same solution as last time: live with the small
imprecisions caused by such hardware limitations. We submit that
natural workload noise is probably far bigger than any such effect.
We could certainly list it as a platform limitation.

2009-06-22 12:01:16

[permalink] [raw]

Subject: Re: IV.3 - AMD IBS

> 3/ AMD IBS
>
> How is AMD IBS going to be implemented?
>
> IBS has two separate sets of registers. One to capture fetch
> related data and another one to capture instruction execution
> data. For each, there is one config register but multiple data
> registers. In each mode, there is a specific sampling period and
> IBS can interrupt.
>
> It looks like you could define two pseudo events or event types
> and then define a new record_format and read_format. That formats
> would only be valid for an IBS event.
>
> Is that how you intend to support IBS?

That is indeed one of the ways we thought of, not really nice, but
then, IBS is really weird, what were those AMD engineers thinking
:-)

The point is - weird hardware gets expressed as a ... weird feature
under perfcounters too. Not all hardware weirdnesses can be
engineered away.

2009-06-22 12:01:28

[permalink] [raw]

Subject: Re: IV.4 - Intel PEBS

> 4/ Intel PEBS
>
> Since Netburst-based processors, Intel PMUs support a hardware
> sampling buffer mechanism called PEBS.
>
> PEBS really became useful with Nehalem.
>
> Not all events support PEBS. Up until Nehalem, only one counter
> supported PEBS (PMC0). The format of the hardware buffer has
> changed between Core and Nehalem. It is not yet architected, thus
> it can still evolve with future PMU models.
>
> On Nehalem, there is a new PEBS-based feature called Load Latency
> Filtering which captures where data cache misses occur (similar to
> Itanium D-EAR). Activating this feature requires setting a latency
> threshold hosted in a separate PMU MSR.
>
> On Nehalem, given that all 4 generic counters support PEBS, the
> sampling buffer may contain samples generated by any of the 4
> counters. The buffer includes a bitmask of registers to determine
> the source of the samples. Multiple bits may be set in the
> bitmask.
>
> How PEBS will be supported for this new API?

Note, the relevance of PEBS (or IBS) should not be over-stated: for
example it fundamentally cannot do precise call-chain recording (it
only records the RIP, not any of the return frames), which removes
from its utility. Another limitation is that only a few basic
hardware event types are supported by PEBS.

Having said that, PEBS is a hardware sampling feature that is
definitely saner than AMD's IBS. There's two immediate incremental
uses of it in perfcounters:

- it makes flat sampling lower overhead by avoiding an NMI for all
sample points.

- it makes flat sampled data more precise. (I.e. it can avoid the
1-2 instructions 'skidding' of a sample position, for a handful
of PEBS-capable events.)

As such its primary support form would be 'transparent enablement':
i.e. on those (relatively few) events that are PEBS supported it
would be enabled automatically, and would result in more precise
(and possibly, cheaper) samples.

No separate APIs are needed really - the kernel can abstract it away
and can provide the user what the user wants: good and fast samples.

Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data
Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS
records in the same stream of data. One possible model to support
them is to set the PEBS threshold to one, and hence generate an
interrupt for each PEBS record. At offset 0x90 of the PEBS record we
have a snapshot of the global status register:

0x90 IA32_PERF_GLOBAL_STATUS

Which tells us that relative to the previous PEBS record in the DS
which counter overflowed. If this were not reliable, we could still
poll all active counters for overflows and get a occasionally
imprecise but still statistically meaningful and precise
demultiplexing.

As to enabling PEBS with the (CPU-)global latency recording filters,
we can do this transparantly for every PEBS supported event, or can
mandate PEBS scheduling when a PEBS only feature like load latency
is requested.

This means that for most purposes PEBS will be transparant.

2009-06-22 12:01:58

[permalink] [raw]

Subject: Re: IV.5 - Intel Last Branch Record (LBR)

> 5/ Intel Last Branch Record (LBR)
>
> Intel processors since Netburst have a cyclic buffer hosted in
> registers which can record taken branches. Each taken branch is
> stored into a pair of LBR registers (source, destination). Up
> until Nehalem, there was not filtering capabilities for LBR. LBR
> is not an architected PMU feature.
>
> There is no counter associated with LBR. Nehalem has a LBR_SELECT
> MSR. However there are some constraints on it given it is shared
> by threads.
>
> LBR is only useful when sampling and therefore must be combined
> with a counter. LBR must also be configured to freeze on PMU
> interrupt.
>
> How is LBR going to be supported?

If there's interest then one sane way to support it would be to
expose it as a new sampling format (PERF_SAMPLE_*).

Regarding the constraints - if we choose to expose the branch-type
filtering capabilities of Nehalem, then that puts a constraint on
counter scheduling: two counters with conflicting constraints should
not be scheduled at once, but should be time-shared via the usual
mechanism.

The typical use-case would be to have no or compatible LBR filter
attributes between counters though - so having the conflicts is not
an issue as long as it works according to the usual model.

2009-06-22 12:16:27

by Andi Kleen

[permalink] [raw]

Subject: Re: IV.4 - Intel PEBS

On Mon, Jun 22, 2009 at 02:00:52PM +0200, Ingo Molnar wrote:
> Having said that, PEBS is a hardware sampling feature that is
> definitely saner than AMD's IBS. There's two immediate incremental
> uses of it in perfcounters:
>
> - it makes flat sampling lower overhead by avoiding an NMI for all
> sample points.
>
> - it makes flat sampled data more precise. (I.e. it can avoid the
> 1-2 instructions 'skidding' of a sample position, for a handful

There are realistic examples where the non pebs "shadow" can be far more
than that, even giving systematic error and hiding complete basic blocks.

> of PEBS-capable events.)

There are a more reasons, especially on Nehalem there are some useful things
you can only measure with PEBS. e.g. memory latency or address
histograms (although the later is quite complicated). Also it
has a lot more PEBS capable events than older parts.

Long term the trend is likely that more and more advanced PMU features
will require PEBS.

> Regarding demultiplexing on Nehalem: PEBS goes into the DS (Data
> Store), and indeed on Nehalem all PEBS counters 'mix' their PEBS
> records in the same stream of data. One possible model to support
> them is to set the PEBS threshold to one, and hence generate an
> interrupt for each PEBS record. At offset 0x90 of the PEBS record we

Then you would need the NMIs again, the NMI avoidance in PEBS only
works with higher thresholds.

> As to enabling PEBS with the (CPU-)global latency recording filters,
> we can do this transparantly for every PEBS supported event, or can
> mandate PEBS scheduling when a PEBS only feature like load latency
> is requested.
>
> This means that for most purposes PEBS will be transparant.

One disadvantage here is that you're giving away a lot of measuring
overhead: interrupts are always much more costly than a PEBS event directly
written by the CPU.

But when you support batching multiple PEBS events I suspect the consumer
needs to be aware of the limitations, e.g. no precise time stamps.

-Andi
--
[email protected] -- Speaking for myself only.

2009-06-22 12:25:39

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<[email protected]> wrote:
>> 5/ Mmaped count
>>
>> It is possible to read counts directly from user space for
>> self-monitoring threads. This leverages a HW capability present on
>> some processors. On X86, this is possible via RDPMC.
>>
>> The full 64-bit count is constructed by combining the hardware
>> value extracted with an assembly instruction and a base value made
>> available thru the mmap. There is an atomic generation count
>> available to deal with the race condition.
>>
>> I believe there is a problem with this approach given that the PMU
>> is shared and that events can be multiplexed. That means that even
>> though you are self-monitoring, events get replaced on the PMU.
>> The assembly instruction is unaware of that, it reads a register
>> not an event.
>>
>> On x86, assume event A is hosted in counter 0, thus you need
>> RDPMC(0) to extract the count. But then, the event is replaced by
>> another one which reuses counter 0. At the user level, you will
>> still use RDPMC(0) but it will read the HW value from a different
>> event and combine it with a base count from another one.
>>
>> To avoid this, you need to pin the event so it stays in the PMU at
>> all times. Now, here is something unclear to me. Pinning does not
>> mean stay in the SAME register, it means the event stays on the
>> PMU but it can possibly change register. To prevent that, I
>> believe you need to also set exclusive so that no other group can
>> be scheduled, and thus possibly use the same counter.
>>
>> Looks like this is the only way you can make this actually work.
>> Not setting pinned+exclusive, is another pitfall in which many
>> people will fall into.
>
> do {
> seq = pc->lock;
>
> barrier()
> if (pc->index) {
> count = pmc_read(pc->index - 1);
> count += pc->offset;
> } else
> goto regular_read;
>
> barrier();
> } while (pc->lock != seq);
>
> We don't see the hole you are referring to. The sequence lock
> ensures you get a consistent view.
>
Let's take an example, with two groups, one event in each group.
Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
are multiplexed, one each tick. The user gets 2 file descriptors
and thus two mmap'ed pages.

Suppose the user wants to read, using the above loop, the value of the
event in the first group BUT it's the 2nd group that is currently active
and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.

Unless you tell me that pc->index is marked invalid (0) when the
event is not scheduled. I don't see how you can avoid reading
the wrong value. I am assuming that is the event is not scheduled
lock remains constant.

Assuming the event is active when you enter the loop and you
read a value. How to get the timing information to scale the
count?

2009-06-22 12:35:59

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<[email protected]> wrote:
> >> 5/ Mmaped count
> >>
> >> It is possible to read counts directly from user space for
> >> self-monitoring threads. This leverages a HW capability present on
> >> some processors. On X86, this is possible via RDPMC.
> >>
> >> The full 64-bit count is constructed by combining the hardware
> >> value extracted with an assembly instruction and a base value made
> >> available thru the mmap. There is an atomic generation count
> >> available to deal with the race condition.
> >>
> >> I believe there is a problem with this approach given that the PMU
> >> is shared and that events can be multiplexed. That means that even
> >> though you are self-monitoring, events get replaced on the PMU.
> >> The assembly instruction is unaware of that, it reads a register
> >> not an event.
> >>
> >> On x86, assume event A is hosted in counter 0, thus you need
> >> RDPMC(0) to extract the count. But then, the event is replaced by
> >> another one which reuses counter 0. At the user level, you will
> >> still use RDPMC(0) but it will read the HW value from a different
> >> event and combine it with a base count from another one.
> >>
> >> To avoid this, you need to pin the event so it stays in the PMU at
> >> all times. Now, here is something unclear to me. Pinning does not
> >> mean stay in the SAME register, it means the event stays on the
> >> PMU but it can possibly change register. To prevent that, I
> >> believe you need to also set exclusive so that no other group can
> >> be scheduled, and thus possibly use the same counter.
> >>
> >> Looks like this is the only way you can make this actually work.
> >> Not setting pinned+exclusive, is another pitfall in which many
> >> people will fall into.
> >
> > do {
> > seq = pc->lock;
> >
> > barrier()
> > if (pc->index) {
> > count = pmc_read(pc->index - 1);
> > count += pc->offset;
> > } else
> > goto regular_read;
> >
> > barrier();
> > } while (pc->lock != seq);
> >
> > We don't see the hole you are referring to. The sequence lock
> > ensures you get a consistent view.
> >
> Let's take an example, with two groups, one event in each group.
> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
> are multiplexed, one each tick. The user gets 2 file descriptors
> and thus two mmap'ed pages.
>
> Suppose the user wants to read, using the above loop, the value of the
> event in the first group BUT it's the 2nd group that is currently active
> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
>
> Unless you tell me that pc->index is marked invalid (0) when the
> event is not scheduled. I don't see how you can avoid reading
> the wrong value. I am assuming that is the event is not scheduled
> lock remains constant.

Indeed, pc->index == 0 means its not currently available.

> Assuming the event is active when you enter the loop and you
> read a value. How to get the timing information to scale the
> count?

I think we would have to add that do the data page,.. something like the
below?

Paulus?

---
Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
__u32 lock; /* seqlock for synchronization */
__u32 index; /* hardware counter identifier */
__s64 offset; /* add to hardware counter value */
+ __u64 total_time; /* total time counter active */
+ __u64 running_time; /* time counter on cpu */
+
+ __u64 __reserved[123]; /* align at 1k */

/*
* Control data for the mmap() data buffer.
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
if (counter->state == PERF_COUNTER_STATE_ACTIVE)
userpg->offset -= atomic64_read(&counter->hw.prev_count);

+ userpg->total_time = counter->total_time_enabled +
+ atomic64_read(&counter->child_total_time_enabled);
+
+ userpg->running_time = counter->total_time_running +
+ atomic64_read(&counter->child_total_time_running);
+
barrier();
++userpg->lock;
preempt_enable();

2009-06-22 12:54:29

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<[email protected]> wrote:
> On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
>> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<[email protected]> wrote:
>> >> 5/ Mmaped count
>> >>
>> >> It is possible to read counts directly from user space for
>> >> self-monitoring threads. This leverages a HW capability present on
>> >> some processors. On X86, this is possible via RDPMC.
>> >>
>> >> The full 64-bit count is constructed by combining the hardware
>> >> value extracted with an assembly instruction and a base value made
>> >> available thru the mmap. There is an atomic generation count
>> >> available to deal with the race condition.
>> >>
>> >> I believe there is a problem with this approach given that the PMU
>> >> is shared and that events can be multiplexed. That means that even
>> >> though you are self-monitoring, events get replaced on the PMU.
>> >> The assembly instruction is unaware of that, it reads a register
>> >> not an event.
>> >>
>> >> On x86, assume event A is hosted in counter 0, thus you need
>> >> RDPMC(0) to extract the count. But then, the event is replaced by
>> >> another one which reuses counter 0. At the user level, you will
>> >> still use RDPMC(0) but it will read the HW value from a different
>> >> event and combine it with a base count from another one.
>> >>
>> >> To avoid this, you need to pin the event so it stays in the PMU at
>> >> all times. Now, here is something unclear to me. Pinning does not
>> >> mean stay in the SAME register, it means the event stays on the
>> >> PMU but it can possibly change register. To prevent that, I
>> >> believe you need to also set exclusive so that no other group can
>> >> be scheduled, and thus possibly use the same counter.
>> >>
>> >> Looks like this is the only way you can make this actually work.
>> >> Not setting pinned+exclusive, is another pitfall in which many
>> >> people will fall into.
>> >
>> > do {
>> > seq = pc->lock;
>> >
>> > barrier()
>> > if (pc->index) {
>> > count = pmc_read(pc->index - 1);
>> > count += pc->offset;
>> > } else
>> > goto regular_read;
>> >
>> > barrier();
>> > } while (pc->lock != seq);
>> >
>> > We don't see the hole you are referring to. The sequence lock
>> > ensures you get a consistent view.
>> >
>> Let's take an example, with two groups, one event in each group.
>> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
>> are multiplexed, one each tick. The user gets 2 file descriptors
>> and thus two mmap'ed pages.
>>
>> Suppose the user wants to read, using the above loop, the value of the
>> event in the first group BUT it's the 2nd group that is currently active
>> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
>>
>> Unless you tell me that pc->index is marked invalid (0) when the
>> event is not scheduled. I don't see how you can avoid reading
>> the wrong value. I am assuming that is the event is not scheduled
>> lock remains constant.
>
> Indeed, pc->index == 0 means its not currently available.

I don't see where you clear that field on x86.
Looks like it comes from hwc->idx. I suspect you need
to do something in x86_pmu_disable() to be symmetrical
with x86_pmu_enable().

I suspect something similar needs to be done on Power.

>
>> Assuming the event is active when you enter the loop and you
>> read a value. How to get the timing information to scale the
>> count?
>
> I think we would have to add that do the data page,.. something like the
> below?
>
Yes.

> ---
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
> __u32 lock; /* seqlock for synchronization */
> __u32 index; /* hardware counter identifier */
> __s64 offset; /* add to hardware counter value */
> + __u64 total_time; /* total time counter active */
> + __u64 running_time; /* time counter on cpu */
> +
> + __u64 __reserved[123]; /* align at 1k */
>
> /*
> * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
> if (counter->state == PERF_COUNTER_STATE_ACTIVE)
> userpg->offset -= atomic64_read(&counter->hw.prev_count);
>
> + userpg->total_time = counter->total_time_enabled +
> + atomic64_read(&counter->child_total_time_enabled);
> +
> + userpg->running_time = counter->total_time_running +
> + atomic64_read(&counter->child_total_time_running);
> +
> barrier();
> ++userpg->lock;
> preempt_enable();
>
>
>

2009-06-22 12:58:59

by Christoph Hellwig

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Mon, Jun 22, 2009 at 01:49:31PM +0200, Ingo Molnar wrote:
> > How do you justify your usage of ioctl() in this context?
>
> We can certainly do a separate sys_perf_counter_ctrl() syscall - and
> we will do that if people think the extra syscall slot is worth it
> in this case.
>
> The (mild) counter-argument so far was that the current ioctls are
> very simple over "IO" attributes of counters:
>
> - enable
> - disable
> - reset
> - refresh
> - set-period
>
> So they could be considered 'IO controls' in the classic sense and
> act as a (mild) exception to the 'dont use ioctls' rule.
>
> They are not some weird tacked-on syscall functionality - they
> modify the IO properties of counters: on/off, value and rate. If
> they go beyond that we'll put it all into a separate syscall and
> deprecate the ioctl (which will have a relatively short half-time
> due to the tools being hosted in the kernel repo).
>
> This could happen right now in fact, if people think it's worth it.

Yet another multiplexer doesn't buy as anything over ioctls unless it
adds more structure. PERF_COUNTER_IOC_ENABLE/PERF_COUNTER_IOC_DISABLE/
PERF_COUNTER_IOC_RESET are calls without any argument, so it's kinda
impossible to add more structure. perf_counter_refresh has an integer
argument, and perf_counter_period aswell (with a slightly more
complicated calling convention due to passing a pointer to the 64bit
integer). I don't see how moving this to syscalls would improve
things.

But talking about syscalls the sys_perf_counter_open prototype is
really ugly - it uses either the pid or cpu argument which is a pretty
clear indicator it should actually be two sys calls.

Incomplete patch without touching the actuall wire-up below to
demonstrate it:

Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c 2009-06-22 14:43:35.323966162 +0200
+++ linux-2.6/kernel/perf_counter.c 2009-06-22 14:57:30.223807475 +0200
@@ -1396,41 +1396,14 @@ __perf_counter_init_context(struct perf_
ctx->task = task;
}

-static struct perf_counter_context *find_get_context(pid_t pid, int cpu)
+static struct perf_counter_context *find_get_pid_context(pid_t pid)
{
struct perf_counter_context *parent_ctx;
struct perf_counter_context *ctx;
- struct perf_cpu_context *cpuctx;
struct task_struct *task;
unsigned long flags;
int err;

- /*
- * If cpu is not a wildcard then this is a percpu counter:
- */
- if (cpu != -1) {
- /* Must be root to operate on a CPU counter: */
- if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
- return ERR_PTR(-EACCES);
-
- if (cpu < 0 || cpu > num_possible_cpus())
- return ERR_PTR(-EINVAL);
-
- /*
- * We could be clever and allow to attach a counter to an
- * offline CPU and activate it when the CPU comes up, but
- * that's for later.
- */
- if (!cpu_isset(cpu, cpu_online_map))
- return ERR_PTR(-ENODEV);
-
- cpuctx = &per_cpu(perf_cpu_context, cpu);
- ctx = &cpuctx->ctx;
- get_ctx(ctx);
-
- return ctx;
- }
-
rcu_read_lock();
if (!pid)
task = current;
@@ -3727,6 +3700,16 @@ static int perf_copy_attr(struct perf_co
if (attr->read_format & ~(PERF_FORMAT_MAX-1))
return -EINVAL;

+ if (!attr->exclude_kernel) {
+ if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
+ return -EACCES;
+ }
+
+ if (attr->freq) {
+ if (attr->sample_freq > sysctl_perf_counter_sample_rate)
+ return -EINVAL;
+ }
+
out:
return ret;

@@ -3736,52 +3719,16 @@ err_size:
goto out;
}

-/**
- * sys_perf_counter_open - open a performance counter, associate it to a task/cpu
- *
- * @attr_uptr: event type attributes for monitoring/sampling
- * @pid: target pid
- * @cpu: target cpu
- * @group_fd: group leader counter fd
- */
-SYSCALL_DEFINE5(perf_counter_open,
- struct perf_counter_attr __user *, attr_uptr,
- pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+static int do_perf_counter_open(struct perf_counter_attr *attr,
+ struct perf_counter_context *ctx, int cpu, int group_fd)
{
struct perf_counter *counter, *group_leader;
- struct perf_counter_attr attr;
- struct perf_counter_context *ctx;
struct file *counter_file = NULL;
struct file *group_file = NULL;
int fput_needed = 0;
int fput_needed2 = 0;
int ret;

- /* for future expandability... */
- if (flags)
- return -EINVAL;
-
- ret = perf_copy_attr(attr_uptr, &attr);
- if (ret)
- return ret;
-
- if (!attr.exclude_kernel) {
- if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
- return -EACCES;
- }
-
- if (attr.freq) {
- if (attr.sample_freq > sysctl_perf_counter_sample_rate)
- return -EINVAL;
- }
-
- /*
- * Get the target context (task or percpu):
- */
- ctx = find_get_context(pid, cpu);
- if (IS_ERR(ctx))
- return PTR_ERR(ctx);
-
/*
* Look up the group leader (we will attach this counter to it):
*/
@@ -3810,11 +3757,11 @@ SYSCALL_DEFINE5(perf_counter_open,
/*
* Only a group leader can be exclusive or pinned
*/
- if (attr.exclusive || attr.pinned)
+ if (attr->exclusive || attr->pinned)
goto err_put_context;
}

- counter = perf_counter_alloc(&attr, cpu, ctx, group_leader,
+ counter = perf_counter_alloc(attr, cpu, ctx, group_leader,
GFP_KERNEL);
ret = PTR_ERR(counter);
if (IS_ERR(counter))
@@ -3857,6 +3804,68 @@ err_put_context:
goto out_fput;
}

+SYSCALL_DEFINE4(perf_counter_open_pid,
+ struct perf_counter_attr __user *, attr_uptr,
+ pid_t, pid, int, group_fd, unsigned long, flags)
+{
+ struct perf_counter_attr attr;
+ struct perf_counter_context *ctx;
+ int ret;
+
+ /* for future expandability... */
+ if (flags)
+ return -EINVAL;
+
+ ret = perf_copy_attr(attr_uptr, &attr);
+ if (ret)
+ return ret;
+
+ ctx = find_get_pid_context(pid);
+ if (IS_ERR(ctx))
+ return PTR_ERR(ctx);
+
+ return do_perf_counter_open(&attr, ctx, -1, group_fd);
+}
+
+SYSCALL_DEFINE4(perf_counter_open_cpu,
+ struct perf_counter_attr __user *, attr_uptr,
+ int, cpu, int, group_fd, unsigned long, flags)
+{
+ struct perf_counter_attr attr;
+ struct perf_counter_context *ctx;
+ struct perf_cpu_context *cpuctx;
+ int ret;
+
+ /* for future expandability... */
+ if (flags)
+ return -EINVAL;
+
+ ret = perf_copy_attr(attr_uptr, &attr);
+ if (ret)
+ return ret;
+
+ /* Must be root to operate on a CPU counter: */
+ if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+ return -EACCES;
+
+ if (cpu < 0 || cpu > num_possible_cpus())
+ return -EINVAL;
+
+ /*
+ * We could be clever and allow to attach a counter to an
+ * offline CPU and activate it when the CPU comes up, but
+ * that's for later.
+ */
+ if (!cpu_isset(cpu, cpu_online_map))
+ return -ENODEV;
+
+ cpuctx = &per_cpu(perf_cpu_context, cpu);
+ ctx = &cpuctx->ctx;
+ get_ctx(ctx);
+
+ return do_perf_counter_open(&attr, ctx, cpu, group_fd);
+}
+
/*
* inherit a counter from parent task to child task:
*/
@@ -4027,7 +4036,7 @@ void perf_counter_exit_task(struct task_
__perf_counter_task_sched_out(child_ctx);

/*
- * Take the context lock here so that if find_get_context is
+ * Take the context lock here so that if find_get_pid_context is
* reading child->perf_counter_ctxp, we wait until it has
* incremented the context's refcount before we do put_ctx below.
*/

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
---end quoted text---

2009-06-22 13:56:53

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

* Christoph Hellwig <[email protected]> wrote:

> On Mon, Jun 22, 2009 at 01:49:31PM +0200, Ingo Molnar wrote:
> > > How do you justify your usage of ioctl() in this context?
> >
> > We can certainly do a separate sys_perf_counter_ctrl() syscall -
> > and we will do that if people think the extra syscall slot is
> > worth it in this case.
> >
> > The (mild) counter-argument so far was that the current ioctls
> > are very simple over "IO" attributes of counters:
> >
> > - enable
> > - disable
> > - reset
> > - refresh
> > - set-period
> >
> > So they could be considered 'IO controls' in the classic sense
> > and act as a (mild) exception to the 'dont use ioctls' rule.
> >
> > They are not some weird tacked-on syscall functionality - they
> > modify the IO properties of counters: on/off, value and rate. If
> > they go beyond that we'll put it all into a separate syscall and
> > deprecate the ioctl (which will have a relatively short
> > half-time due to the tools being hosted in the kernel repo).
> >
> > This could happen right now in fact, if people think it's worth it.
>
> Yet another multiplexer doesn't buy as anything over ioctls unless
> it adds more structure.
> PERF_COUNTER_IOC_ENABLE/PERF_COUNTER_IOC_DISABLE/
> PERF_COUNTER_IOC_RESET are calls without any argument, so it's
> kinda impossible to add more structure. perf_counter_refresh has
> an integer argument, and perf_counter_period aswell (with a
> slightly more complicated calling convention due to passing a
> pointer to the 64bit integer). I don't see how moving this to
> syscalls would improve things.

Yes - this is what kept us from moving it until now. But we are
ready and willing to add a sys_perf_counter_chattr() syscall to
change attributes. We are in the 'avoid ioctls' camp, but we are
not dogmatic about that. As you seem to agree this seems to be one
of the narrow special cases where ioctls still make sense.

There _is_ another, more theoretical argument in favor of
sys_perf_counter_chattr(): it is quite conceivable that as usage of
perfcounters expands we want to change more and more attributes. So
even though right now the ioctl just about manages to serve this
role, it would be more future-proof to use sys_perf_counter_chattr()
and deprecate the ioctl() straight away - to not even leave a
_chance_ for some ioctl crap to seep into the API.

So ... we are on two minds about this, and if people dont mind a
second syscall entry, we are glad to add it.

> But talking about syscalls the sys_perf_counter_open prototype is
> really ugly - it uses either the pid or cpu argument which is a
> pretty clear indicator it should actually be two sys calls.
>
> Incomplete patch without touching the actuall wire-up below to
> demonstrate it:

Dunno, not sure i agree here. 'CPU ID' is a pretty natural expansion
of the usage of 'scope of counter'. The scope can be:

- task-self
- specific PID
- specific PID with inheritance
- specific CPU

It is not really 'multiplexing' completely unrelated things: a CPU
is 'all PIDs running on a specific CPU' specifier. It is providing
an expanded definition of 'target context to be monitored'. Just
like signals have PID, group-PID and controlling-terminal type of
extensions. We dont really have syscalls for each of those either.

Also note that the syscall does not have different meanings
depending on whether it's a CPU counter or a specific-task counter
or a task-and-all-child-tasks counter. So it is not the ugly kind of
multiplexing that makes most ioctls such a jumbled mess.

If we were to unroll and expand _all_ such things in our current
300+ syscalls in the kernel we'd have thousands of syscalls. Do we
want that? Dunno. No strong feelings - but i dont think our current
syscalls are unclean, and i dont think we should insist on a model
that would have resulted in so many syscalls, were this enforced
from the get go.

Furthermore, having a 'target ID' concept does make it harder to
expand the range of targets that we might want to monitor. Do we
want to expand the PID one with a PID-group notion perhaps? Or do we
want to add IDs for specific VMs perhaps? It does not change the
semantics, it only changes the (pretty orthogonal and isolated)
'target context' field's meaning.

Hope this helps,

Ingo

2009-06-22 14:19:25

by Rob Fowler

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

Ingo Molnar wrote:
>> 3/ AMD IBS
>>
>> How is AMD IBS going to be implemented?
>>
>> IBS has two separate sets of registers. One to capture fetch
>> related data and another one to capture instruction execution
>> data. For each, there is one config register but multiple data
>> registers. In each mode, there is a specific sampling period and
>> IBS can interrupt.
>>
>> It looks like you could define two pseudo events or event types
>> and then define a new record_format and read_format. That formats
>> would only be valid for an IBS event.
>>
>> Is that how you intend to support IBS?
>
> That is indeed one of the ways we thought of, not really nice, but
> then, IBS is really weird, what were those AMD engineers thinking
> :-)

Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
processors. Those of us who used it there found it to be an extremely
powerful, low-overhead mechanism for directly collecting information about
how well the micro-architecture is performing. In particular, it can tell
you, not only which instructions take a long time to traverse the pipe, but
it also tells you which instructions delay other instructions and by how much.
This is extremely valuable if you are either working on instruction scheduling
in a compiler, or are modifying a program to give the compiler the opportunity
to do a good job.

A core group of engineers who worked on Alpha went on to AMD.

An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
open source community.

Code Analyst from AMD, also involving ex-DEC engineers, is
the only place where it is supported at all decently throughout the tool chain.
Last time I looked, there was a tweaked version of oprofile underlying CA.
I haven't checked to see whether the tweaks have migrated back into the oprofile
trunk.

>
> The point is - weird hardware gets expressed as a ... weird feature
> under perfcounters too. Not all hardware weirdnesses can be
> engineered away.
>
> ------------------------------------------------------------------------------
> Are you an open source citizen? Join us for the Open Source Bridge conference!
> Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
> Need another reason to go? 24-hour hacker lounge. Register today!
> http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org
> _______________________________________________
> perfmon2-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel

--
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
[email protected]

2009-06-22 14:27:40

[permalink] [raw]

Subject: Re: II.1 - Fixed counters on Intel

On Mon, Jun 22, 2009 at 1:57 PM, Ingo Molnar<[email protected]> wrote:
>> II/ X86 comments
>>
>> 1/ Fixed counters on Intel
>>
>> You cannot simply fall back to generic counters if you cannot find
>> a fixed counter. There are model-specific bugs, for instance
>> UNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same
>> thing on Nehalem when it is used in fixed counter 2 or a generic
>> counter. The same is true on Core.
>
> This could be handled via a model specific quirk, if the erratum is
> serious enough.

Better demonstrated with an actual example on a 2.4GHz Quad: 10s noploop

$ pfmon -v --us-c
-eunhalted_core_cycles,unhalted_reference_cycles,cpu_clk_unhalted:bus
-u noploop 10
[FIXED_CTRL(pmc16)=0xaa0 pmi0=1 en0=0x0 pmi1=1 en1=0x2 pmi2=1 en2=0x2]
UNHALTED_CORE_CYCLES UNHALTED_REFERENCE_CYCLES
[PERFEVTSEL0(pmc0)=0x51013c event_sel=0x3c umask=0x1 os=0 usr=1 en=1
int=1 inv=0 edge=0 cnt_mask=0] CPU_CLK_UNHALTED
noploop for 10 seconds
23,833,577,042 UNHALTED_CORE_CYCLES
23,853,100,716 UNHALTED_REFERENCE_CYCLES
2,650,345,853 CPU_CLK_UNHALTED:BUS

0x013c on fixed counter2 = unhalted_reference_cycles
0x013c on generic counter = cpu_clk_unhalted:bus

Difference is significant, isn't it?

And the question becomes: how do I express that I want the fixed counter version
of the event using the current PCL API, given both have the same encoding?

>
>> You cannot simply look at the event field code to determine
>> whether this is an event supported by a fixed counters. You must
>> look at the other fields such as edge, invert, cnt-mask. If those
>> are present then you have to fall back to using a generic counter
>> as fixed counters only support priv level filtering. As indicated
>> above, though, programming UNHALTED_REFERENCE_CYCLES on a generic
>> counter does not count the same thing, therefore you need to fail
>> if filters other than priv levels are present on this event.
>
> Agreed, we'll fix this.
>

2009-06-22 14:39:41

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

On Mon, 2009-06-22 at 14:54 +0200, stephane eranian wrote:
> On Mon, Jun 22, 2009 at 2:35 PM, Peter Zijlstra<[email protected]> wrote:
> > On Mon, 2009-06-22 at 14:25 +0200, stephane eranian wrote:
> >> On Mon, Jun 22, 2009 at 1:52 PM, Ingo Molnar<[email protected]> wrote:
> >> >> 5/ Mmaped count
> >> >>
> >> >> It is possible to read counts directly from user space for
> >> >> self-monitoring threads. This leverages a HW capability present on
> >> >> some processors. On X86, this is possible via RDPMC.
> >> >>
> >> >> The full 64-bit count is constructed by combining the hardware
> >> >> value extracted with an assembly instruction and a base value made
> >> >> available thru the mmap. There is an atomic generation count
> >> >> available to deal with the race condition.
> >> >>
> >> >> I believe there is a problem with this approach given that the PMU
> >> >> is shared and that events can be multiplexed. That means that even
> >> >> though you are self-monitoring, events get replaced on the PMU.
> >> >> The assembly instruction is unaware of that, it reads a register
> >> >> not an event.
> >> >>
> >> >> On x86, assume event A is hosted in counter 0, thus you need
> >> >> RDPMC(0) to extract the count. But then, the event is replaced by
> >> >> another one which reuses counter 0. At the user level, you will
> >> >> still use RDPMC(0) but it will read the HW value from a different
> >> >> event and combine it with a base count from another one.
> >> >>
> >> >> To avoid this, you need to pin the event so it stays in the PMU at
> >> >> all times. Now, here is something unclear to me. Pinning does not
> >> >> mean stay in the SAME register, it means the event stays on the
> >> >> PMU but it can possibly change register. To prevent that, I
> >> >> believe you need to also set exclusive so that no other group can
> >> >> be scheduled, and thus possibly use the same counter.
> >> >>
> >> >> Looks like this is the only way you can make this actually work.
> >> >> Not setting pinned+exclusive, is another pitfall in which many
> >> >> people will fall into.
> >> >
> >> > do {
> >> > seq = pc->lock;
> >> >
> >> > barrier()
> >> > if (pc->index) {
> >> > count = pmc_read(pc->index - 1);
> >> > count += pc->offset;
> >> > } else
> >> > goto regular_read;
> >> >
> >> > barrier();
> >> > } while (pc->lock != seq);
> >> >
> >> > We don't see the hole you are referring to. The sequence lock
> >> > ensures you get a consistent view.
> >> >
> >> Let's take an example, with two groups, one event in each group.
> >> Both events scheduled on counter0, i.e,, rdpmc(0). The 2 groups
> >> are multiplexed, one each tick. The user gets 2 file descriptors
> >> and thus two mmap'ed pages.
> >>
> >> Suppose the user wants to read, using the above loop, the value of the
> >> event in the first group BUT it's the 2nd group that is currently active
> >> and loaded on counter0, i.e., rdpmc(0) returns the value of the 2nd event.
> >>
> >> Unless you tell me that pc->index is marked invalid (0) when the
> >> event is not scheduled. I don't see how you can avoid reading
> >> the wrong value. I am assuming that is the event is not scheduled
> >> lock remains constant.
> >
> > Indeed, pc->index == 0 means its not currently available.
>
> I don't see where you clear that field on x86.

x86 doesn't have this feature fully implemented yet.. its still on the
todo list. Paulus started this on power, so it should work there.

> Looks like it comes from hwc->idx. I suspect you need
> to do something in x86_pmu_disable() to be symmetrical
> with x86_pmu_enable().

Right.

> I suspect something similar needs to be done on Power.

It looks like the power disable method does indeed do this:

if (counter->hw.idx) {
write_pmc(counter->hw.idx, 0);
counter->hw.idx = 0;
}
perf_counter_update_userpage(counter);

The below might suffice for x86.. but its not real nice.
Power already has that whole +1 thing in its ->idx field, x86 does not.
So I either munge x86 or add something like I did below..

Paul, any suggestions?

---
Index: linux-2.6/arch/x86/kernel/cpu/perf_counter.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_counter.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_counter.c
@@ -912,6 +912,8 @@ x86_perf_counter_set_period(struct perf_
err = checking_wrmsrl(hwc->counter_base + idx,
(u64)(-left) & x86_pmu.counter_mask);

+ perf_counter_update_userpage(counter);
+
return ret;
}

@@ -1041,6 +1043,8 @@ try_generic:
x86_perf_counter_set_period(counter, hwc, idx);
x86_pmu.enable(hwc, idx);

+ perf_counter_update_userpage(counter);
+
return 0;
}

@@ -1133,6 +1137,8 @@ static void x86_pmu_disable(struct perf_
x86_perf_counter_update(counter, hwc, idx);
cpuc->counters[idx] = NULL;
clear_bit(idx, cpuc->used_mask);
+
+ perf_counter_update_userpage(counter);
}

/*
Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1753,6 +1753,14 @@ int perf_counter_task_disable(void)
return 0;
}

+static int perf_counter_index(struct perf_counter *counter)
+{
+ if (counter->state != PERF_COUNTER_STATE_ACTIVE)
+ return 0;
+
+ return counter->hw->idx + 1 - PERF_COUNTER_INDEX_OFFSET;
+}
+
/*
* Callers need to ensure there can be no nesting of this function, otherwise
* the seqlock logic goes bad. We can not serialize this because the arch
@@ -1777,7 +1785,7 @@ void perf_counter_update_userpage(struct
preempt_disable();
++userpg->lock;
barrier();
- userpg->index = counter->hw.idx;
+ userpg->index = perf_counter_index(counter);
userpg->offset = atomic64_read(&counter->count);
if (counter->state == PERF_COUNTER_STATE_ACTIVE)
userpg->offset -= atomic64_read(&counter->hw.prev_count);
Index: linux-2.6/arch/powerpc/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/perf_counter.h
+++ linux-2.6/arch/powerpc/include/asm/perf_counter.h
@@ -61,6 +61,8 @@ struct pt_regs;
extern unsigned long perf_misc_flags(struct pt_regs *regs);
extern unsigned long perf_instruction_pointer(struct pt_regs *regs);

+#define PERF_COUNTER_INDEX_OFFSET 1
+
/*
* Only override the default definitions in include/linux/perf_counter.h
* if we have hardware PMU support.
Index: linux-2.6/arch/x86/include/asm/perf_counter.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_counter.h
+++ linux-2.6/arch/x86/include/asm/perf_counter.h
@@ -87,6 +87,9 @@ union cpuid10_edx {
#ifdef CONFIG_PERF_COUNTERS
extern void init_hw_perf_counters(void);
extern void perf_counters_lapic_init(void);
+
+#define PERF_COUNTER_INDEX_OFFSET 0
+
#else
static inline void init_hw_perf_counters(void) { }
static inline void perf_counters_lapic_init(void) { }

2009-06-22 14:54:21

[permalink] [raw]

Subject: Re: I.11 - Missing definitions for generic events

On Mon, Jun 22, 2009 at 1:56 PM, Ingo Molnar<[email protected]> wrote:
>> 11/ Missing definitions for generic hardware events
>>
>> As soon as you define generic events, you need to provide a clear
>> and precise definition at to what they measure. This is crucial to
>> make them useful. I have not seen such a definition yet.
>
> Do you mean their names aren't clear enough? :-)
>
No I'd like to see a defintion behind every name:

PERF_COUNT_HW_CPU_CYCLES: impacted by freq scaling or not?
PERF_COUNT_HW_INSTRUCTIONS
PERF_COUNT_HW_CACHE_REFERENCES: what cache level, data, code?
PERF_COUNT_HW_CACHE_MISSES: what cache level, data, code?
PERF_COUNT_HW_BRANCH_INSTRUCTIONS: taken, not-taken?
PERF_COUNT_HW_BRANCH_MISSES
PERF_COUNT_HW_BUS_CYCLES: what bus?

Take BUS_CYCLES, and based on my example on Core with UNHALTED_REFERENCE_CYCLE,
without a clear definition, it seems hard to understand if you need to
map it to 0x13c on a generic
counter or on fixed counter 2.

Having clearly spelled out definitions help port PCL to other
processors, it also helps user
understand which event they need to select. Users should not have to
dig through the
code to find the actual mapping for each PMU to understand what the
events actually
measure.

2009-06-22 17:42:51

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Monday 22 June 2009, Ingo Molnar wrote:
> There is another, more theoretical argument in favor of
> sys_perf_counter_chattr(): it is quite conceivable that as usage of
> perfcounters expands we want to change more and more attributes. So
> even though right now the ioctl just about manages to serve this
> role, it would be more future-proof to use sys_perf_counter_chattr()
> and deprecate the ioctl() straight away - to not even leave a
> chance for some ioctl crap to seep into the API.
>
> So ... we are on two minds about this, and if people dont mind a
> second syscall entry, we are glad to add it.

I think adding one or more system calls is definitely worth it
if that means getting rid of the ioctl interface here. While I
don't generally mind adding ioctl calls, I would much prefer to
restrict their use to device files, sockets and to the existing
cases for regular files.

Conceptually, ioctl is a different class of interface from the
'new system call' case, in a number of ways. For new subsystems
I would just never mix them by allowing ioctl on something that
was not returned by open() or socket().

Arnd <><

2009-06-22 17:58:24

by Maynard Johnson

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

Rob Fowler <[email protected]> wrote on 06/22/2009 09:08:34 AM:

>
>
> Ingo Molnar wrote:
> >> 3/ AMD IBS
> >>
> >> How is AMD IBS going to be implemented?
> >>
> >> IBS has two separate sets of registers. One to capture fetch
> >> related data and another one to capture instruction execution
> >> data. For each, there is one config register but multiple data
> >> registers. In each mode, there is a specific sampling period and
> >> IBS can interrupt.
> >>
> >> It looks like you could define two pseudo events or event types
> >> and then define a new record_format and read_format. That formats
> >> would only be valid for an IBS event.
> >>
> >> Is that how you intend to support IBS?
> >
> > That is indeed one of the ways we thought of, not really nice, but
> > then, IBS is really weird, what were those AMD engineers thinking
> > :-)
>
> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
> processors. Those of us who used it there found it to be an extremely
> powerful, low-overhead mechanism for directly collecting information
about
> how well the micro-architecture is performing. In particular, it can
tell
> you, not only which instructions take a long time to traverse the pipe,
but
> it also tells you which instructions delay other instructions and byhow
much.
> This is extremely valuable if you are either working on instruction
scheduling
> in a compiler, or are modifying a program to give the compiler the
opportunity
> to do a good job.
>
> A core group of engineers who worked on Alpha went on to AMD.
>
> An unfortunate problem with IBS on AMD is that good support isn't
> common in the "mainstream"
> open source community.
>
> Code Analyst from AMD, also involving ex-DEC engineers, is
> the only place where it is supported at all decently throughout the
> tool chain.
> Last time I looked, there was a tweaked version of oprofile underlying
CA.
> I haven't checked to see whether the tweaks have migrated back into
> the oprofile trunk.
Yes, IBS on AMD is now supported upstream in mainline, contributed by
Suravee Suthikulpanit.

-Maynard
>
>
> >
> > The point is - weird hardware gets expressed as a ... weird feature
> > under perfcounters too. Not all hardware weirdnesses can be
> > engineered away.
> >
> >
>
------------------------------------------------------------------------------

> > Are you an open source citizen? Join us for the Open Source Bridge
> conference!
> > Portland, OR, June 17-19. Two days of sessions, one day of
> unconference: $250.
> > Need another reason to go? 24-hour hacker lounge. Register today!
> > http://ad.doubleclick.net/clk;215844324;13503038;v?http:
> //opensourcebridge.org
> > _______________________________________________
> > perfmon2-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel
>
> --
> Robert J. Fowler
> Chief Domain Scientist, HPC
> Renaissance Computing Institute
> The University of North Carolina at Chapel Hill
> 100 Europa Dr, Suite 540
> Chapel Hill, NC 27517
> V: 919.445.9670
> F: 919 445.9669
> [email protected]

2009-06-22 19:17:54

[permalink] [raw]

Subject: Re: IV.3 - AMD IBS

On Mon, Jun 22, 2009 at 2:00 PM, Ingo Molnar<[email protected]> wrote:
>> 3/ AMD IBS
>>
>> How is AMD IBS going to be implemented?
>>
>> IBS has two separate sets of registers. One to capture fetch
>> related data and another one to capture instruction execution
>> data. For each, there is one config register but multiple data
>> registers. In each mode, there is a specific sampling period and
>> IBS can interrupt.
>>
>> It looks like you could define two pseudo events or event types
>> and then define a new record_format and read_format. That formats
>> would only be valid for an IBS event.
>>
>> Is that how you intend to support IBS?
>
> That is indeed one of the ways we thought of, not really nice, but
> then, IBS is really weird, what were those AMD engineers thinking
> :-)
>

What's your problem with IBS? You need to elaborate when you make
this kind of comments. Remember what we discussed back in December.
If hardware designers put that in, it's because they think it can deliver
value-add. It may be difficult to use, but it delivers interesting data.

> The point is - weird hardware gets expressed as a ... weird feature
> under perfcounters too. Not all hardware weirdnesses can be
> engineered away.
>

2009-06-22 19:45:19

[permalink] [raw]

Subject: Re: I.2 - Grouping

On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<[email protected]> wrote:
>> 2/ Grouping
>>
>> By design, an event can only be part of one group at a time.

As I read this again, another question came up. Is the statement
above also true for the group leader?

>> Events in a group are guaranteed to be active on the PMU at the
>> same time. That means a group cannot have more events than there
>> are available counters on the PMU. Tools may want to know the
>> number of counters available in order to group their events
>> accordingly, such that reliable ratios could be computed. It seems
>> the only way to know this is by trial and error. This is not
>> practical.
>
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.
>
But then, does that mean that users need to be aware of constraints
to form groups. Group are formed by users. I thought, the whole point
of the API was to hide that kind of hardware complexity.

Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
For instance, I sample the IP and the values of the other events in
my group. Grouping looks like the only way to sampling on Itanium
branch buffers (BTB), for instance. You program an event in a generic
counter which counts the number of entries recorded in the buffer.
Thus, to sample the BTB, you sample on this event, when it overflows
you grab the content of the BTB. Here, the event and the BTB are tied
together. You cannot count the event in one group, and have the BTB
in another one (BTB = 1 config reg + 32 data registers + 1 position reg).

> The ideal model to tooling is relatively independent PMU registers
> (counters) with little constraints - most modern CPUs meet that
> model.
>
I agree that would be really good. But that's not what we have.

> All the existing tooling (tools/perf/) operates on that model and
> this leads to easy programmability and flexible results. This model
> needs no grouping of counters.
>
> Could you please cite specific examples in terms of tools/perf/?
> What feature do you think needs to know more about constraints? What
> is the specific win in precision we could achieve via that?
>
See above example.

2009-06-22 20:02:46

[permalink] [raw]

Subject: Re: IV.5 - Intel Last Branch Record (LBR)

On Mon, Jun 22, 2009 at 2:01 PM, Ingo Molnar<[email protected]> wrote:
>> 5/ Intel Last Branch Record (LBR)
>>
>> Intel processors since Netburst have a cyclic buffer hosted in
>> registers which can record taken branches. Each taken branch is
>> stored into a pair of LBR registers (source, destination). Up
>> until Nehalem, there was not filtering capabilities for LBR. LBR
>> is not an architected PMU feature.
>>
>> There is no counter associated with LBR. Nehalem has a LBR_SELECT
>> MSR. However there are some constraints on it given it is shared
>> by threads.
>>
>> LBR is only useful when sampling and therefore must be combined
>> with a counter. LBR must also be configured to freeze on PMU
>> interrupt.
>>
>> How is LBR going to be supported?
>
> If there's interest then one sane way to support it would be to
> expose it as a new sampling format (PERF_SAMPLE_*).
>
LBR is very important, it becomes useable with Nehalem where you
can filter on priv level. It is important for statistical basic block
profiling, for instance. Another important feature is its ability to freeze
on PMU interrupt.

LBR is also interesting because it yield a path to an event.

LBR on NHM (and others) is not that easy to handle because:
- need to read-modify-write IA32_DEBUGCTL
- LBR_TOS, the position pointer is read-only
- LBR_SELECT to configure LBR is shared at the core-level on NHM

but it is very much worthwhile.

> Regarding the constraints - if we choose to expose the branch-type
> filtering capabilities of Nehalem, then that puts a constraint on
> counter scheduling: two counters with conflicting constraints should
> not be scheduled at once, but should be time-shared via the usual
> mechanism.
>
You need to expose the branch filtering in some way. The return
branch filter is useful for statistical call graph sampling, for instance.
If you can't disable the other types of branches, then the relevance
of the data drops.

> The typical use-case would be to have no or compatible LBR filter
> attributes between counters though - so having the conflicts is not
> an issue as long as it works according to the usual model.
>
Conflict arises when two events request different filter value. The conflict
happens in both per-thread and per-cpu mode when HT is on. In per-cpu
mode it can be controlled at the core level. But in per-thread mode, it
has to be managed globally as a thread may migrate. Multiplexing
as it is implemented manages groups attached to one thread or CPU.
Here it would have to look across a pair of CPUs (or threads) and
ensure that no two groups using LBR with conflicting filter selections
can be scheduled at the same time on the two hyper-threads of the
same core. I think it may be easier, for now, to say LBR_SELECT
is global, first come, first serve.

2009-06-22 21:38:34

by Corey Ashford

[permalink] [raw]

Subject: Re: I.2 - Grouping

Ingo Molnar wrote:
>> 2/ Grouping
>>
>> By design, an event can only be part of one group at a time.
>> Events in a group are guaranteed to be active on the PMU at the
>> same time. That means a group cannot have more events than there
>> are available counters on the PMU. Tools may want to know the
>> number of counters available in order to group their events
>> accordingly, such that reliable ratios could be computed. It seems
>> the only way to know this is by trial and error. This is not
>> practical.
>
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.
>
> The ideal model to tooling is relatively independent PMU registers
> (counters) with little constraints - most modern CPUs meet that
> model.
>
> All the existing tooling (tools/perf/) operates on that model and
> this leads to easy programmability and flexible results. This model
> needs no grouping of counters.
>
> Could you please cite specific examples in terms of tools/perf/?
> What feature do you think needs to know more about constraints? What
> is the specific win in precision we could achieve via that?

An example of this is that a user wants to monitor 10 events, and we have four
counters to work with. Let's assume there is some mapping of events to counters
where you need only 3 groups to schedule the 10 events onto the PMU. If you
leave it to the kernel (and don't group the events from user space), depending
on the kernel's fast event scheduling algorithm, it may take 6 groups to get all
of the requested events counted. This leads to lower counts in the counters,
and more chance for the counters to miss event bursts, which leads to less
accurate scaled results.

Currently the PAPI substrate for PCL does do this partitioning using a very dumb
algorithm. But it could be improved, particularly if there was some better way
to get feedback from the kernel other than a "yes, these fit" or "no, these
don't fit". I'm not sure what that way would be, though. Perhaps an ioctl that
does a some sort of "dry scheduling" of events to groups in an optimal way.
This call would not need to lock any resources, and just use the kernel's
algorithm for event constraint checking.

To me, this is not a big issue, but some sort of better mechanism might be
considered for a future update.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
Beaverton, OR
503-578-3507
[email protected]

2009-06-22 22:04:18

by Corey Ashford

[permalink] [raw]

Subject: Re: I.2 - Grouping

stephane eranian wrote:
> On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<[email protected]> wrote:
>>> 2/ Grouping
>>>
>>> By design, an event can only be part of one group at a time.
>
> As I read this again, another question came up. Is the statement
> above also true for the group leader?
>
>
>>> Events in a group are guaranteed to be active on the PMU at the
>>> same time. That means a group cannot have more events than there
>>> are available counters on the PMU. Tools may want to know the
>>> number of counters available in order to group their events
>>> accordingly, such that reliable ratios could be computed. It seems
>>> the only way to know this is by trial and error. This is not
>>> practical.
>> Groups are there to support heavily constrained PMUs, and for them
>> this is the only way, as there is no simple linear expression for
>> how many counters one can load on the PMU.
>>
> But then, does that mean that users need to be aware of constraints
> to form groups. Group are formed by users. I thought, the whole point
> of the API was to hide that kind of hardware complexity.
>
> Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
> For instance, I sample the IP and the values of the other events in
> my group. Grouping looks like the only way to sampling on Itanium
> branch buffers (BTB), for instance. You program an event in a generic
> counter which counts the number of entries recorded in the buffer.
> Thus, to sample the BTB, you sample on this event, when it overflows
> you grab the content of the BTB. Here, the event and the BTB are tied
> together. You cannot count the event in one group, and have the BTB
> in another one (BTB = 1 config reg + 32 data registers + 1 position reg).
>
>
>

With getting these posts from several different sources, I missed Stephane's reply.

I can see the issue is deeper and more subtle than I had imagined.

Stephane, if you were to just place into groups those events which must be
correlated, and just leave all of the others not grouped, wouldn't this solve
the problem? The kernel would be free to schedule the other events how and when
it can, but would guarantee that your correlated events are on the PMU
simultaneously.

--
Regards,

- Corey

Corey Ashford
Software Engineer
IBM Linux Technology Center, Linux Toolchain
[email protected]

2009-06-23 00:33:20

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

stephane eranian writes:

> Unless you tell me that pc->index is marked invalid (0) when the
> event is not scheduled. I don't see how you can avoid reading
> the wrong value. I am assuming that is the event is not scheduled
> lock remains constant.

That's what happens; when pc->index == 0, the event is not scheduled
and the current count is in pc->offset.

> Assuming the event is active when you enter the loop and you
> read a value. How to get the timing information to scale the
> count?

At present you'd have to do a read(). It wouldn't be hard to add
fields to mmapped page to enable the user program to compute
up-to-date timing information from the TSC (x86) or timebase (powerpc)
value. We'll do that.

Paul.

2009-06-23 00:47:17

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

stephane eranian writes:

> I don't see where you clear that field on x86.
> Looks like it comes from hwc->idx. I suspect you need
> to do something in x86_pmu_disable() to be symmetrical
> with x86_pmu_enable().
>
> I suspect something similar needs to be done on Power.

power_pmu_disable() already calls perf_counter_update_userpage,
so I think we're OK there.

Paul.

2009-06-23 00:47:31

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

Peter Zijlstra writes:

> I think we would have to add that do the data page,.. something like the
> below?
>
> Paulus?
>
> ---
> Index: linux-2.6/include/linux/perf_counter.h
> ===================================================================
> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
> __u32 lock; /* seqlock for synchronization */
> __u32 index; /* hardware counter identifier */
> __s64 offset; /* add to hardware counter value */
> + __u64 total_time; /* total time counter active */
> + __u64 running_time; /* time counter on cpu */
> +
> + __u64 __reserved[123]; /* align at 1k */
>
> /*
> * Control data for the mmap() data buffer.
> Index: linux-2.6/kernel/perf_counter.c
> ===================================================================
> --- linux-2.6.orig/kernel/perf_counter.c
> +++ linux-2.6/kernel/perf_counter.c
> @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
> if (counter->state == PERF_COUNTER_STATE_ACTIVE)
> userpg->offset -= atomic64_read(&counter->hw.prev_count);
>
> + userpg->total_time = counter->total_time_enabled +
> + atomic64_read(&counter->child_total_time_enabled);
> +
> + userpg->running_time = counter->total_time_running +
> + atomic64_read(&counter->child_total_time_running);

Hmmm, when the counter is running, what you want is not so much the
total time so far as a way to compute the total time so far from the
current TSC/timebase value. So we would need to export tstamp_enabled
and tstamp_running plus a scale/offset for converting the TSC/timebase
value to nanoseconds consistent with ctx->time. On powerpc that's
pretty straightforward because the timebases, but on x86 I gather the
offset and maybe also the scale would need to be per-cpu (which is OK,
because all the values in the mmapped page are only useful on one
specific CPU).

How would we compute the scale and offset on x86, given the current
TSC value and ctx->time?

Paul.

2009-06-23 05:16:46

[permalink] [raw]

Subject: Re: I.2 - Grouping

Ingo Molnar writes:

> > 2/ Grouping
> >
> > By design, an event can only be part of one group at a time.

To clarify this statement of Stephane's, a _counter_ can only be in
one group. You can have multiple counters counting the same _event_
and those counters can (obviously) be in different groups.

> > Events in a group are guaranteed to be active on the PMU at the
> > same time. That means a group cannot have more events than there
> > are available counters on the PMU. Tools may want to know the
> > number of counters available in order to group their events
> > accordingly, such that reliable ratios could be computed. It seems
> > the only way to know this is by trial and error. This is not
> > practical.
>
> Groups are there to support heavily constrained PMUs, and for them
> this is the only way, as there is no simple linear expression for
> how many counters one can load on the PMU.

That's not the only reason for having groups, or even the main reason
IMO. The main reason for having groups is to provide a way to ask the
scheduler to ensure that two or more counters are always scheduled
together, so that you can meaningfully do arithmetic operations on the
counter values that would be sensitive to the statistical noise
introduced by the scheduling, such as ratios and differences.

In other words, grouping is there because we don't guarantee to have
all counters scheduled onto the PMU whenever possible. Heavily
constrained PMUs increase the need for scheduling, but even if
counters are completely orthogonal there are only a fixed number of
them so we still need to schedule counters at some point.

Paul.

2009-06-23 06:13:59

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

On Tue, 2009-06-23 at 10:39 +1000, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > I think we would have to add that do the data page,.. something like the
> > below?
> >
> > Paulus?
> >
> > ---
> > Index: linux-2.6/include/linux/perf_counter.h
> > ===================================================================
> > --- linux-2.6.orig/include/linux/perf_counter.h
> > +++ linux-2.6/include/linux/perf_counter.h
> > @@ -232,6 +232,10 @@ struct perf_counter_mmap_page {
> > __u32 lock; /* seqlock for synchronization */
> > __u32 index; /* hardware counter identifier */
> > __s64 offset; /* add to hardware counter value */
> > + __u64 total_time; /* total time counter active */
> > + __u64 running_time; /* time counter on cpu */
> > +
> > + __u64 __reserved[123]; /* align at 1k */
> >
> > /*
> > * Control data for the mmap() data buffer.
> > Index: linux-2.6/kernel/perf_counter.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/perf_counter.c
> > +++ linux-2.6/kernel/perf_counter.c
> > @@ -1782,6 +1782,12 @@ void perf_counter_update_userpage(struct
> > if (counter->state == PERF_COUNTER_STATE_ACTIVE)
> > userpg->offset -= atomic64_read(&counter->hw.prev_count);
> >
> > + userpg->total_time = counter->total_time_enabled +
> > + atomic64_read(&counter->child_total_time_enabled);
> > +
> > + userpg->running_time = counter->total_time_running +
> > + atomic64_read(&counter->child_total_time_running);
>
> Hmmm, when the counter is running, what you want is not so much the
> total time so far as a way to compute the total time so far from the
> current TSC/timebase value. So we would need to export tstamp_enabled
> and tstamp_running plus a scale/offset for converting the TSC/timebase
> value to nanoseconds consistent with ctx->time. On powerpc that's
> pretty straightforward because the timebases, but on x86 I gather the
> offset and maybe also the scale would need to be per-cpu (which is OK,
> because all the values in the mmapped page are only useful on one
> specific CPU).
>
> How would we compute the scale and offset on x86, given the current
> TSC value and ctx->time?

With pain and suffering ;-)

The userpage would have to provide a multiplier and offset, and we'd
have to register a cpufreq notifier hook and iterate all active counters
and update these mult,offset bits when the cpu freq changes.

An alternative could be to simply ensure we update these timestamps at
least once per the RR interval (tick), that way the times are more or
less recent and could still be used for scaling purposes.

The most important data in these timestamps is their ratio, not their
absolute value, therefore if we keep the ratio statistically significant
we're good enough.

2009-06-23 06:19:54

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

On Mon, 2009-06-22 at 10:08 -0400, Rob Fowler wrote:
> Ingo Molnar wrote:
> >> 3/ AMD IBS
> >>
> >> How is AMD IBS going to be implemented?
> >>
> >> IBS has two separate sets of registers. One to capture fetch
> >> related data and another one to capture instruction execution
> >> data. For each, there is one config register but multiple data
> >> registers. In each mode, there is a specific sampling period and
> >> IBS can interrupt.
> >>
> >> It looks like you could define two pseudo events or event types
> >> and then define a new record_format and read_format. That formats
> >> would only be valid for an IBS event.
> >>
> >> Is that how you intend to support IBS?
> >
> > That is indeed one of the ways we thought of, not really nice, but
> > then, IBS is really weird, what were those AMD engineers thinking
> > :-)
>
> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
> processors. Those of us who used it there found it to be an extremely
> powerful, low-overhead mechanism for directly collecting information about
> how well the micro-architecture is performing. In particular, it can tell
> you, not only which instructions take a long time to traverse the pipe, but
> it also tells you which instructions delay other instructions and by how much.
> This is extremely valuable if you are either working on instruction scheduling
> in a compiler, or are modifying a program to give the compiler the opportunity
> to do a good job.
>
> A core group of engineers who worked on Alpha went on to AMD.
>
> An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
> open source community.

The 'problem' I have with IBS is that its basically a cycle counter
coupled with a pretty arbitrary number of output dimensions separated
into two groups, ops and fetches.

This is a very weird configuration in that it has a miss-match with the
traditional one value per counter thing.

The most natural way to support IBS would be to have a special sampling
cycle counter and use that as group lead and add non sampling siblings
to that group to get individual elements.

This is however quite cumbersome.

One thing to consider when building an IBS interface is its future
extensibility. In which fashion would IBS be extended?, additional
output dimensions or something else all-together?

2009-06-23 07:36:35

[permalink] [raw]

Subject: Re: I.2 - Grouping

On Tue, Jun 23, 2009 at 7:16 AM, Paul Mackerras<[email protected]> wrote:
> Ingo Molnar writes:
>
>> > 2/ Grouping
>> >
>> > By design, an event can only be part of one group at a time.
>
> To clarify this statement of Stephane's, a _counter_ can only be in
> one group. You can have multiple counters counting the same _event_
> and those counters can (obviously) be in different groups.
>
Okay.

What happens if I do:
fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);

And then:
fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);

>> > Events in a group are guaranteed to be active on the PMU at the
>> > same time. That means a group cannot have more events than there
>> > are available counters on the PMU. Tools may want to know the
>> > number of counters available in order to group their events
>> > accordingly, such that reliable ratios could be computed. It seems
>> > the only way to know this is by trial and error. This is not
>> > practical.
>>
>> Groups are there to support heavily constrained PMUs, and for them
>> this is the only way, as there is no simple linear expression for
>> how many counters one can load on the PMU.
>
> That's not the only reason for having groups, or even the main reason
> IMO. The main reason for having groups is to provide a way to ask the
> scheduler to ensure that two or more counters are always scheduled
> together, so that you can meaningfully do arithmetic operations on the
> counter values that would be sensitive to the statistical noise
> introduced by the scheduling, such as ratios and differences.
>
> In other words, grouping is there because we don't guarantee to have
> all counters scheduled onto the PMU whenever possible. Heavily
> constrained PMUs increase the need for scheduling, but even if
> counters are completely orthogonal there are only a fixed number of
> them so we still need to schedule counters at some point.
>
I agree completely.

2009-06-23 07:40:32

[permalink] [raw]

Subject: Re: I.5 - Mmaped count

Paul,

On Tue, Jun 23, 2009 at 2:39 AM, Paul Mackerras<[email protected]> wrote:
>
> Hmmm, when the counter is running, what you want is not so much the
> total time so far as a way to compute the total time so far from the
> current TSC/timebase value. So we would need to export tstamp_enabled
> and tstamp_running plus a scale/offset for converting the TSC/timebase
> value to nanoseconds consistent with ctx->time. On powerpc that's
> pretty straightforward because the timebases, but on x86 I gather the
> offset and maybe also the scale would need to be per-cpu (which is OK,
> because all the values in the mmapped page are only useful on one
> specific CPU).
>
I think you should make it such that reading via mmap and read() are
equivalent, one is just lower overhead than the other. Otherwise it would
make it more difficult for tools in case of multiplexing where you could
fallback to read() and there you would not get the same information.

> How would we compute the scale and offset on x86, given the current
> TSC value and ctx->time?
>
> Paul.
>

2009-06-23 08:19:28

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

On Tue, Jun 23, 2009 at 8:19 AM, Peter Zijlstra<[email protected]> wrote:
>
> The 'problem' I have with IBS is that its basically a cycle counter
> coupled with a pretty arbitrary number of output dimensions separated
> into two groups, ops and fetches.
>
Well, that's your view. Mine is different.

You have 2 independent cycle counters (one for fetch, one for op),
each is coupled with a value. it is just that the value does not fit
into 64 bits. the cycle count is not hosted in a generic counter but
in its own register. The captured data for fetch is constructed with
IBSFETCHCTL, IBSFETCHLINAD, IBSFETCHPHYSAD. The 3
registers are tied together. The values they contain represent the
same fetch event. Same thing with IBS op. I don't see the problem
because your API is able to cope with variable length output data.
The sampling buffer certainly can. Of course, internally you'd have
to manage it in a special way, but you already do this for fixed
counters, don't you?

> This is a very weird configuration in that it has a miss-match with the
> traditional one value per counter thing.
>
This is not the universal model. I can give lots of examples on Itanium
where you have one config register and multiple data registers
to capture the event: branch trace buffer (1 config, 33 data), Data EAR
(cache/TLB miss sampling, 1 config 3 data),Instruction-EAR
(cache/TLB miss sampling, 1 config, 2 data).

> The most natural way to support IBS would be to have a special sampling
> cycle counter and use that as group lead and add non sampling siblings
> to that group to get individual elements.
>
As discussed in my message, I think the way to support IBS is to create two
pseudo-events (like your perf_hw_event_ids), one for fetch and one for op
(because they could be measured simultaneously). The sample_period field
would be used to express the IBS*CTL maxcnt, subject to the verification
that the bottom 4 bits must be 0. And then, you add two new sampling formats
PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
with IBS pseudo events. Once you have the randomize option in perf_counter_attr,
you could even enable IBSFETCH randomization.

What is wrong with this approach?

Another question is: how do you present the values contained in the IBS data
registers:
1 - leave it as raw (tool parses the raw register values)
2 - decode it in the kernel and expose your own format

With 1/, you'd pick up automatically new fields if AMD adds some.
With 2/, you'd have to change your format if AMD change theirs.

> This is however quite cumbersome.
>
> One thing to consider when building an IBS interface is its future
> extensibility. In which fashion would IBS be extended?, additional
> output dimensions or something else all-together?
>
I don't know but it would be nice to provide better filtering capabilities
but for that, they can use some of the reserved bits they have, no
need to add more data registers.

2009-06-23 08:26:53

[permalink] [raw]

Subject: Re: I.2 - Grouping

stephane eranian writes:

> What happens if I do:
> fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
> fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
>
> And then:
> fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);

That perf_counter_open call will fail with an EINVAL error because fd1
is not a group leader.

Paul.

2009-06-23 08:31:06

[permalink] [raw]

Subject: Re: I.2 - Grouping

On Tue, Jun 23, 2009 at 10:26 AM, Paul Mackerras<[email protected]> wrote:
> stephane eranian writes:
>
>> What happens if I do:
>> fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
>> fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
>>
>> And then:
>> fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);
>
> That perf_counter_open call will fail with an EINVAL error because fd1
> is not a group leader.
>
Okay. Thanks.

But that leads me to another question related to grouping
which was hinted for by Peter's comment about IBS.

Can you have multiple sampling events in a group?

2009-06-23 13:18:59

[permalink] [raw]

Subject: Re: II.2 - Event knowledge missing

On Mon, Jun 22, 2009 at 1:57 PM, Ingo Molnar<[email protected]> wrote:
>> 2/ Event knowledge missing
>>
>> There are constraints on events in Intel processors. Different
>> constraints do exist on AMD64 processors, especially with
>> uncore-releated events.
>
> You raise the issue of uncore events in IV.1, but let us reply here
> primarily.
>
> Un-core counters and events seem to be somewhat un-interesting to
> us. (Patches from those who find them interesting are welcome of
> course!)
>
That is you opinion but not mine. I believe uncore is useful though
it is harder to manage than core PMU. I know that because I have
implemented the support for Nehalem. But going back to our discussion
from December, if it's there it's because it provides some value-add,
why would the hardware designers have bothered otherwise?

It is true that if you've only read the uncore description in Volume 3b, it
is not clear what this can actually do. Therefore, I recommend you take
a look at section B.2.5 of the Intel optimization manual:

http://www.intel.com/Assets/PDF/manual/248966.pdf

It shows a bunch of interesting metrics one can collect using uncore.
Metrics which you cannot get any other way. Some people do care
about those, otherwise they would not be explained.

>
> The main problem with uncore events is that they are per physical
> package, and hence tying a piece of physical metric exposed via them
> to a particular workload is hard - unless full-system analysis is
> performed. 'Task driven' metrics seem far more useful to performance
> analysis (and those are the preferred analysis method of most
> user-space developers), as they allow particularized sampling and
> allow the tight connection between workload and metric.
>
That is the nature of the beast. There is not much you can do about
this. But this is still useful especially if you have a symmetrical
workload like many scientific applications have.

Note that uncore also exist on AMD64, though, not as clearly separated.
Some events collect at the package level, yet they are using core PMU
counters. And those come with restrictions as well see Section 3.12,
description of PERFCTL, in the BKDG for Family 10h.

> If, despite our expecations, uncore events prove to be useful,
> popular and required elements of performance analysis, they can be
> supported in perfcounters via various levels:
>
> - a special raw ID range on x86, only to per CPU counters. The
> low-level implementation reserves the uncore PMCs, so overlapping
> allocation (and interaction between the cores via the MSRs) is
> not possible.
>
I agree this is for CPU counters only, not per-thread. It could be any
core in the package. In fact, multiple per CPU "sessions" could
co-exist in the same package. But there is one difficulty with allowing
this, though. The uncore does not interrupt directly. You need to
designate which core(s) it will interrupt via a bitmask. It could interrupt
ALL CPUs in the package at once (which is another interesting usage
model of uncore). So I believe the choice is between 1 CPU and
all CPUs.

Uncore events have no constraints, except for the single fixed counter
event (UNC_CLK_UNHALTED). Thus, you could still use your
event model and overcommit the uncore and multiplex groups on it.
You could reject events in a group once you reach 8 (max number of
counters). I don't see the difference there. The only issue is with managing
the interrupt.

> - generic enumeration with full tooling support, time-sharing and
> the whole set of features. The low-level backend would time-share
> the resource between interested CPUs.
>
> There is no limitation in the perfcounters design that somehow makes
> uncore events harder to support. The uncore counters _themselves_
> are limited to begin with - so rich features cannot be offered on
> top of them.
>
I would say they are limited. This is what you can do from where they
are sourced from.

>
>> The current code-base does not have any constrained event support,
>> therefore bogus counts may be returned depending on the event
>> measured.
>
> Then we'll need to grow some when we run into them :-)

FYI, here is the list of constrained events for Intel Core.
Counter [0] means generic counter0, [1] means generic counter1.
If you do not put these events in the right counter, they do
not count what they are supposed to, and do so silently.

Name : FP_COMP_OPS_EXE
Code : 0x10
Counters : [ 0 ]
Desc : Floating point computational micro-ops executed
PEBS : No

Name : FP_ASSIST
Code : 0x11
Counters : [ 1 ]
Desc : Floating point assists
PEBS : No

Name : MUL
Code : 0x12
Counters : [ 1 ]
Desc : Multiply operations executed
PEBS : No

Name : DIV
Code : 0x13
Counters : [ 1 ]
Desc : Divide operations executed
PEBS : No

Name : CYCLES_DIV_BUSY
Code : 0x14
Counters : [ 0 ]
Desc : Cycles the divider is busy
PEBS : No

Name : IDLE_DURING_DIV
Code : 0x18
Counters : [ 0 ]
Desc : Cycles the divider is busy and all other execution units are idle
PEBS : No

Name : DELAYED_BYPASS
Code : 0x19
Counters : [ 1 ]
Desc : Delayed bypass
Umask-00 : 0x00 : [FP] : Delayed bypass to FP operation
Umask-01 : 0x01 : [SIMD] : Delayed bypass to SIMD operation
Umask-02 : 0x02 : [LOAD] : Delayed bypass to load operation
PEBS : No

Name : MEM_LOAD_RETIRED
Code : 0xcb
Counters : [ 0 ]
Desc : Retired loads that miss the L1 data cache
Umask-00 : 0x01 : [L1D_MISS] : Retired loads that miss the L1 data
cache (precise event)
Umask-01 : 0x02 : [L1D_LINE_MISS] : L1 data cache line missed by
retired loads (precise event)
Umask-02 : 0x04 : [L2_MISS] : Retired loads that miss the L2 cache
(precise event)
Umask-03 : 0x08 : [L2_LINE_MISS] : L2 cache line missed by retired
loads (precise event)
Umask-04 : 0x10 : [DTLB_MISS] : Retired loads that miss the DTLB (precise event)
PEBS : [L1D_MISS] [L1D_LINE_MISS] [L2_MISS] [L2_LINE_MISS] [DTLB_MISS]

2009-06-23 14:06:20

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

* stephane eranian <[email protected]> wrote:

> > The most natural way to support IBS would be to have a special
> > sampling cycle counter and use that as group lead and add non
> > sampling siblings to that group to get individual elements.
> >
> As discussed in my message, I think the way to support IBS is to
> create two pseudo-events (like your perf_hw_event_ids), one for
> fetch and one for op (because they could be measured
> simultaneously). The sample_period field would be used to express
> the IBS*CTL maxcnt, subject to the verification that the bottom 4
> bits must be 0. And then, you add two new sampling formats
> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
> with IBS pseudo events. Once you have the randomize option in
> perf_counter_attr, you could even enable IBSFETCH randomization.

I'd suggest to start smaller, and first express the 'precise' nature
of IBS transparently, by simply mapping it to one of the generic
events. (cycles and instructions both appears to be possible)

No extra sampling, no extra events - just a transparent side channel
implementation for the specific case of PERF_COUNT_HW_CPU_CYCLES. (A
bit like the fixed-purpose counters are done on the Intel side - a
special-case - but none of the generic code knows about it.)

This gives us immediate results with less code, and also gives us
the platform to see how IBS is structured, what kind of general
problems/quirks it has, and how popular its precision is, etc. We
can always add extra sampling formats on top of that (i'm not
opposed to that), to expose more and more of IBS.

The same can be done on the PEBS side as well.

Would you be interested in pursuing this?

Ingo

2009-06-23 14:25:44

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

On Tue, Jun 23, 2009 at 4:05 PM, Ingo Molnar<[email protected]> wrote:
>
> * stephane eranian <[email protected]> wrote:
>
>> > The most natural way to support IBS would be to have a special
>> > sampling cycle counter and use that as group lead and add non
>> > sampling siblings to that group to get individual elements.
>> >
>> As discussed in my message, I think the way to support IBS is to
>> create two pseudo-events (like your perf_hw_event_ids), one for
>> fetch and one for op (because they could be measured
>> simultaneously). The sample_period field would be used to express
>> the IBS*CTL maxcnt, subject to the verification that the bottom 4
>> bits must be 0. And then, you add two new sampling formats
>> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
>> with IBS pseudo events. Once you have the randomize option in
>> perf_counter_attr, you could even enable IBSFETCH randomization.
>
> I'd suggest to start smaller, and first express the 'precise' nature
> of IBS transparently, by simply mapping it to one of the generic
> events. (cycles and instructions both appears to be possible)
>
IBS is precise by nature. It does not work like PEBS. It tags an instruction
and then collects info about it. When it retires, IBS freezes and triggers an
interrupt like a regular counter interrupt. Except this time, you don't care
about the interrupted IP, you use the instruction address in the IBS data
register, it is guaranteed to correspond to the tagged instruction.

The sampling period expresses the delay before picking the instruction
to tag. And as I said before, it is only 20 bits and the bottom 4 bits must
be zero (as they cannot be encoded).

> No extra sampling, no extra events - just a transparent side channel
> implementation for the specific case of PERF_COUNT_HW_CPU_CYCLES. (A
> bit like the fixed-purpose counters are done on the Intel side - a
> special-case - but none of the generic code knows about it.)
>

> This gives us immediate results with less code, and also gives us
> the platform to see how IBS is structured, what kind of general
> problems/quirks it has, and how popular its precision is, etc. We
> can always add extra sampling formats on top of that (i'm not
> opposed to that), to expose more and more of IBS.
>
> The same can be done on the PEBS side as well.
>
> Would you be interested in pursuing this?
>
> Ingo
>

2009-06-23 14:41:59

by Rob Fowler

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

I'm up to my neck in other stuff, so this will be short.

Yes, IBS is a very different model of performance measurement that
doesn't fit well with the traditional model. It does do what the
HW engineers need for understanding multi unit out of order processors, though.

The separation of the fetch and op monitoring is an artifact of the separation
and de-coupling of the front and back end pipelines. The front end IBS events
deal with stuff that happens in fetching: TLB and cache misses, mis-predictions, etc.
The back end IBS events deal with computing and data fetch/store.

There are two "conventional" counters involved: the tag-to-retire and completion-to-retire
counts. These can be accumulated, histogrammed, etc just like any conventional event, though
you need to add the counter contents to the accumulator rather than just increment.

The rest of the bits are predicates that can be used to filter the events into bins.
With n bits, you might need 2^n bins to accumulate all possibilities.
Sections 5 and 6 of the AMD software optimization guide provide some useful boolean expressions
for defining meaningful derived events.

In an old version of Rice HPCToolkit (now disappeared from the web) we
had a tool called xprof that processed DEC DCPI/ProfileMe binary files to
produce profiles with ~20 derived events that we thought would be useful. The cost
of collecting all of this didn't vary by the amount we collected, so you would select
the ones you wanted to view at analysis time, not at execute time. There was also a
mechanism for specifying other events. Nathan Tallent can provide details.

The Linear and Physical Address registers are an opportunity for someone to build data profiling
tools, or a combined instructions and data tool.

The critical thing is for the kernel, driver, and library builders to not do something that will
stand in the way of this.

Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 10:08 -0400, Rob Fowler wrote:
>> Ingo Molnar wrote:
>>>> 3/ AMD IBS
>>>>
>>>> How is AMD IBS going to be implemented?
>>>>
>>>> IBS has two separate sets of registers. One to capture fetch
>>>> related data and another one to capture instruction execution
>>>> data. For each, there is one config register but multiple data
>>>> registers. In each mode, there is a specific sampling period and
>>>> IBS can interrupt.
>>>>
>>>> It looks like you could define two pseudo events or event types
>>>> and then define a new record_format and read_format. That formats
>>>> would only be valid for an IBS event.
>>>>
>>>> Is that how you intend to support IBS?
>>> That is indeed one of the ways we thought of, not really nice, but
>>> then, IBS is really weird, what were those AMD engineers thinking
>>> :-)
>> Actually, IBS has roots in DEC's "ProfileMe" for Alpha EV67 and later
>> processors. Those of us who used it there found it to be an extremely
>> powerful, low-overhead mechanism for directly collecting information about
>> how well the micro-architecture is performing. In particular, it can tell
>> you, not only which instructions take a long time to traverse the pipe, but
>> it also tells you which instructions delay other instructions and by how much.
>> This is extremely valuable if you are either working on instruction scheduling
>> in a compiler, or are modifying a program to give the compiler the opportunity
>> to do a good job.
>>
>> A core group of engineers who worked on Alpha went on to AMD.
>>
>> An unfortunate problem with IBS on AMD is that good support isn't common in the "mainstream"
>> open source community.
>
> The 'problem' I have with IBS is that its basically a cycle counter
> coupled with a pretty arbitrary number of output dimensions separated
> into two groups, ops and fetches.
>
> This is a very weird configuration in that it has a miss-match with the
> traditional one value per counter thing.
>
> The most natural way to support IBS would be to have a special sampling
> cycle counter and use that as group lead and add non sampling siblings
> to that group to get individual elements.
>
> This is however quite cumbersome.
>
> One thing to consider when building an IBS interface is its future
> extensibility. In which fashion would IBS be extended?, additional
> output dimensions or something else all-together?

--
Robert J. Fowler
Chief Domain Scientist, HPC
Renaissance Computing Institute
The University of North Carolina at Chapel Hill
100 Europa Dr, Suite 540
Chapel Hill, NC 27517
V: 919.445.9670
F: 919 445.9669
[email protected]

2009-06-23 14:55:35

[permalink] [raw]

Subject: Re: [perfmon2] IV.3 - AMD IBS

* stephane eranian <[email protected]> wrote:

> On Tue, Jun 23, 2009 at 4:05 PM, Ingo Molnar<[email protected]> wrote:
> >
> > * stephane eranian <[email protected]> wrote:
> >
> >> > The most natural way to support IBS would be to have a special
> >> > sampling cycle counter and use that as group lead and add non
> >> > sampling siblings to that group to get individual elements.
> >> >
> >> As discussed in my message, I think the way to support IBS is to
> >> create two pseudo-events (like your perf_hw_event_ids), one for
> >> fetch and one for op (because they could be measured
> >> simultaneously). The sample_period field would be used to express
> >> the IBS*CTL maxcnt, subject to the verification that the bottom 4
> >> bits must be 0. And then, you add two new sampling formats
> >> PERF_SAMPLE_IBSFETCH, PERF_SAMPLE_IBSOP. Those would only work
> >> with IBS pseudo events. Once you have the randomize option in
> >> perf_counter_attr, you could even enable IBSFETCH randomization.
> >
> > I'd suggest to start smaller, and first express the 'precise'
> > nature of IBS transparently, by simply mapping it to one of the
> > generic events. (cycles and instructions both appears to be
> > possible)
>
> IBS is precise by nature.

(yes. Did you understand my comments above as saying the opposite?)

> [...] It does not work like PEBS. It tags an instruction and then
> collects info about it. When it retires, IBS freezes and triggers
> an interrupt like a regular counter interrupt. Except this time,
> you don't care about the interrupted IP, you use the instruction
> address in the IBS data register, it is guaranteed to correspond
> to the tagged instruction.
>
> The sampling period expresses the delay before picking the
> instruction to tag. And as I said before, it is only 20 bits and
> the bottom 4 bits must be zero (as they cannot be encoded).

The 20 bits delay is in cycles, right? So this in itself still lends
itself to be transparently provided as a PERF_COUNT_HW_CPU_CYCLES
counter.

Ingo

2009-06-23 16:25:17

by Corey Ashford

[permalink] [raw]

Subject: Re: I.2 - Grouping

stephane eranian wrote:
> On Tue, Jun 23, 2009 at 10:26 AM, Paul Mackerras<[email protected]> wrote:
>> stephane eranian writes:
>>
>>> What happens if I do:
>>> fd0 = perf_counter_open(&hwc1, getpid(), -1, -1, 0);
>>> fd1 = perf_counter_open(&hwc2, getpid(), -1, fd0, 0);
>>>
>>> And then:
>>> fd2 = perf_counter_open(&hwc2, getpid(), -1, fd1, 0);
>> That perf_counter_open call will fail with an EINVAL error because fd1
>> is not a group leader.
>>
> Okay. Thanks.
>
> But that leads me to another question related to grouping
> which was hinted for by Peter's comment about IBS.
>
> Can you have multiple sampling events in a group?

There's a PAPI ctest which exercises this - profile_two_events - and it
does work with the PCL substrate.

- Corey

2009-06-23 17:51:19

[permalink] [raw]

Subject: Re: I.2 - Grouping

Corey,

On Tue, Jun 23, 2009 at 12:04 AM, Corey
Ashford<[email protected]> wrote:
> stephane eranian wrote:
>>
>> On Mon, Jun 22, 2009 at 1:50 PM, Ingo Molnar<[email protected]> wrote:
>>>>
>>>> 2/ Grouping
>>>>
>>>> By design, an event can only be part of one group at a time.
>>
>> As I read this again, another question came up. Is the statement
>> above also true for the group leader?
>>
>>
>>>> Events in a group are guaranteed to be active on the PMU at the
>>>> same time. That means a group cannot have more events than there
>>>> are available counters on the PMU. Tools may want to know the
>>>> number of counters available in order to group their events
>>>> accordingly, such that reliable ratios could be computed. It seems
>>>> the only way to know this is by trial and error. This is not
>>>> practical.
>>>
>>> Groups are there to support heavily constrained PMUs, and for them
>>> this is the only way, as there is no simple linear expression for
>>> how many counters one can load on the PMU.
>>>
>> But then, does that mean that users need to be aware of constraints
>> to form groups. Group are formed by users. I thought, the whole point
>> of the API was to hide that kind of hardware complexity.
>>
>> Groups are needed when sampling, e.g., PERF_SAMPLE_GROUP.
>> For instance, I sample the IP and the values of the other events in
>> my group. Grouping looks like the only way to sampling on Itanium
>> branch buffers (BTB), for instance. You program an event in a generic
>> counter which counts the number of entries recorded in the buffer.
>> Thus, to sample the BTB, you sample on this event, when it overflows
>> you grab the content of the BTB. Here, the event and the BTB are tied
>> together. You cannot count the event in one group, and have the BTB
>> in another one (BTB = 1 config reg + 32 data registers + 1 position reg).
>>
>
> Stephane, if you were to just place into groups those events which must be
> correlated, and just leave all of the others not grouped, wouldn't this
> solve the problem? The kernel would be free to schedule the other events
> how and when it can, but would guarantee that your correlated events are on
> the PMU simultaneously.
>
It would work.

2009-07-13 10:53:29

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> But talking about syscalls the sys_perf_counter_open prototype is
> really ugly - it uses either the pid or cpu argument which is a pretty
> clear indicator it should actually be two sys calls.

Would something like the below be any better?

It would allow us to later add something like PERF_TARGET_SOCKET and
things like that.

(utterly untested)

---
include/linux/perf_counter.h | 12 ++++++++++++
include/linux/syscalls.h | 3 ++-
kernel/perf_counter.c | 24 +++++++++++++++++++++++-
tools/perf/builtin-record.c | 11 ++++++++++-
tools/perf/builtin-stat.c | 12 ++++++++++--
tools/perf/builtin-top.c | 11 ++++++++++-
tools/perf/perf.h | 7 +++----
7 files changed, 70 insertions(+), 10 deletions(-)

diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
index 5e970c7..f7dd22e 100644
--- a/include/linux/perf_counter.h
+++ b/include/linux/perf_counter.h
@@ -189,6 +189,18 @@ struct perf_counter_attr {
__u64 __reserved_3;
};

+enum perf_counter_target_ids {
+ PERF_TARGET_PID = 0,
+ PERF_TARGET_CPU = 1,
+
+ PERF_TARGET_MAX /* non-ABI */
+};
+
+struct perf_counter_target {
+ __u32 id;
+ __u64 val;
+};
+
/*
* Ioctls that can be done on a perf counter fd:
*/
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 80de700..670edad 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -760,5 +760,6 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[]);

asmlinkage long sys_perf_counter_open(
struct perf_counter_attr __user *attr_uptr,
- pid_t pid, int cpu, int group_fd, unsigned long flags);
+ struct perf_counter_target __user *target,
+ int group_fd, unsigned long flags);
#endif
diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
index fa20a9d..205f0b6 100644
--- a/kernel/perf_counter.c
+++ b/kernel/perf_counter.c
@@ -3969,15 +3969,18 @@ err_size:
*/
SYSCALL_DEFINE5(perf_counter_open,
struct perf_counter_attr __user *, attr_uptr,
- pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
+ struct perf_counter_target __user *, target_uptr,
+ int, group_fd, unsigned long, flags)
{
struct perf_counter *counter, *group_leader;
struct perf_counter_attr attr;
+ struct perf_counter_target target;
struct perf_counter_context *ctx;
struct file *counter_file = NULL;
struct file *group_file = NULL;
int fput_needed = 0;
int fput_needed2 = 0;
+ int pid, cpu;
int ret;

/* for future expandability... */
@@ -3998,6 +4001,25 @@ SYSCALL_DEFINE5(perf_counter_open,
return -EINVAL;
}

+ ret = copy_from_user(&target, target_uptr, sizeof(target));
+ if (ret)
+ return ret;
+
+ switch (target.id) {
+ case PERF_TARGET_PID:
+ pid = target.val;
+ cpu = -1;
+ break;
+
+ case PERF_TARGET_CPU:
+ cpu = target.val;
+ pid = -1;
+ break;
+
+ default:
+ return -EINVAL;
+ }
+
/*
* Get the target context (task or percpu):
*/
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 4ef78a5..b172d30 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -374,6 +374,7 @@ static struct perf_header_attr *get_header_attr(struct perf_counter_attr *a, int
static void create_counter(int counter, int cpu, pid_t pid)
{
struct perf_counter_attr *attr = attrs + counter;
+ struct perf_counter_target target;
struct perf_header_attr *h_attr;
int track = !counter; /* only the first counter needs these */
struct {
@@ -409,8 +410,16 @@ static void create_counter(int counter, int cpu, pid_t pid)
attr->inherit = (cpu < 0) && inherit;
attr->disabled = 1;

+ if (cpu < 0) {
+ target.id = PERF_TARGET_PID;
+ target.val = pid;
+ } else {
+ target.id = PERF_TARGET_CPU;
+ target.val = cpu;
+ }
+
try_again:
- fd[nr_cpu][counter] = sys_perf_counter_open(attr, pid, cpu, group_fd, 0);
+ fd[nr_cpu][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);

if (fd[nr_cpu][counter] < 0) {
int err = errno;
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 27921a8..342e642 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -106,6 +106,7 @@ static u64 runtime_cycles_noise;
static void create_perf_stat_counter(int counter, int pid)
{
struct perf_counter_attr *attr = attrs + counter;
+ struct perf_counter_target target;

if (scale)
attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
@@ -114,8 +115,12 @@ static void create_perf_stat_counter(int counter, int pid)
if (system_wide) {
unsigned int cpu;

+ target.id = PERF_TARGET_CPU;
+
for (cpu = 0; cpu < nr_cpus; cpu++) {
- fd[cpu][counter] = sys_perf_counter_open(attr, -1, cpu, -1, 0);
+ target.val = cpu;
+
+ fd[cpu][counter] = sys_perf_counter_open(attr, &target, -1, 0);
if (fd[cpu][counter] < 0 && verbose)
fprintf(stderr, ERR_PERF_OPEN, counter,
fd[cpu][counter], strerror(errno));
@@ -125,7 +130,10 @@ static void create_perf_stat_counter(int counter, int pid)
attr->disabled = 1;
attr->enable_on_exec = 1;

- fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
+ target.id = PERF_TARGET_PID;
+ target.val = pid;
+
+ fd[0][counter] = sys_perf_counter_open(attr, &target, -1, 0);
if (fd[0][counter] < 0 && verbose)
fprintf(stderr, ERR_PERF_OPEN, counter,
fd[0][counter], strerror(errno));
diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 95d5c0a..facc870 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -548,6 +548,7 @@ int group_fd;

static void start_counter(int i, int counter)
{
+ struct perf_counter_target target;
struct perf_counter_attr *attr;
unsigned int cpu;

@@ -560,8 +561,16 @@ static void start_counter(int i, int counter)
attr->sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
attr->freq = freq;

+ if (cpu < 0) {
+ target.id = PERF_TARGET_PID;
+ target.val = target_pid;
+ } else {
+ target.id = PERF_TARGET_CPU;
+ target.val = cpu;
+ }
+
try_again:
- fd[i][counter] = sys_perf_counter_open(attr, target_pid, cpu, group_fd, 0);
+ fd[i][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);

if (fd[i][counter] < 0) {
int err = errno;
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index e5148e2..3d663d7 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -85,12 +85,11 @@ static inline unsigned long long rdclock(void)

static inline int
sys_perf_counter_open(struct perf_counter_attr *attr,
- pid_t pid, int cpu, int group_fd,
- unsigned long flags)
+ struct perf_counter_target *target,
+ int group_fd, unsigned long flags)
{
attr->size = sizeof(*attr);
- return syscall(__NR_perf_counter_open, attr, pid, cpu,
- group_fd, flags);
+ return syscall(__NR_perf_counter_open, attr, target, group_fd, flags);
}

#define MAX_COUNTERS 256

2009-07-13 17:31:30

[permalink] [raw]

Subject: Re: [perfmon2] I.1 - System calls - ioctl

On Monday 13 July 2009, Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > But talking about syscalls the sys_perf_counter_open prototype is
> > really ugly - it uses either the pid or cpu argument which is a pretty
> > clear indicator it should actually be two sys calls.
>
> Would something like the below be any better?
>
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.

I don't think it helps on the ugliness side. You basically make the
two arguments a union, but instead of adding another flag and directly
passing a union, you also add interface complexity.

A strong indication for the complexity is that you got it wrong ;-) :

> +struct perf_counter_target {
> + __u32 id;
> + __u64 val;
> +};

This structure is not compatible between 32 and 64 bit user space on x86,
because everything except i386 adds implicit padding between id and val.

Other than that, making it extensible sounds reasonable. How about just
using a '__u64 *target' and a bit in the 'flags' argument?

Arnd <><

2009-07-13 17:34:14

[permalink] [raw]

Subject: Re: [perfmon2] I.1 - System calls - ioctl

On Mon, 2009-07-13 at 19:30 +0200, Arnd Bergmann wrote:
> On Monday 13 July 2009, Peter Zijlstra wrote:
> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > > But talking about syscalls the sys_perf_counter_open prototype is
> > > really ugly - it uses either the pid or cpu argument which is a pretty
> > > clear indicator it should actually be two sys calls.
> >
> > Would something like the below be any better?
> >
> > It would allow us to later add something like PERF_TARGET_SOCKET and
> > things like that.
>
> I don't think it helps on the ugliness side. You basically make the
> two arguments a union, but instead of adding another flag and directly
> passing a union, you also add interface complexity.
>
> A strong indication for the complexity is that you got it wrong ;-) :
>
> > +struct perf_counter_target {
> > + __u32 id;
> > + __u64 val;
> > +};
>
> This structure is not compatible between 32 and 64 bit user space on x86,
> because everything except i386 adds implicit padding between id and val.

Humm, __u64 doesn't have natural alignment? That would break more than
just this I think -- it sure surprises me.

> Other than that, making it extensible sounds reasonable. How about just
> using a '__u64 *target' and a bit in the 'flags' argument?

Would there still be a point in having it a pointer in that case?, but
yeah, that might work too?

2009-07-13 17:54:24

[permalink] [raw]

Subject: Re: [perfmon2] I.1 - System calls - ioctl

On Monday 13 July 2009, Peter Zijlstra wrote:
> On Mon, 2009-07-13 at 19:30 +0200, Arnd Bergmann wrote:
> > > +struct perf_counter_target {
> > > + __u32 id;
> > > + __u64 val;
> > > +};
> >
> > This structure is not compatible between 32 and 64 bit user space on x86,
> > because everything except i386 adds implicit padding between id and val.
>
> Humm, __u64 doesn't have natural alignment? That would break more than
> just this I think -- it sure surprises me.

Yes, nobody expects this, so it is a frequent source of bugs in the ABI.
Look for compat_u64 and __packed in the definition of compat ioctl and
syscall interfaces for how we had to work around this elsewhere.

> > Other than that, making it extensible sounds reasonable. How about just
> > using a '__u64 *target' and a bit in the 'flags' argument?
>
> Would there still be a point in having it a pointer in that case?, but
> yeah, that might work too?

passing u64 bit arguments directly to system calls is a bit complicated,
because some 32 bit architectures can only pass them in certain
register pairs, see the trouble we go through for llseek, sync_file_range
or preadv.

If you can directly pass an 'unsigned long' instead, that would work fine
though.

Arnd <><

2009-07-14 13:51:30

by Christoph Hellwig

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Mon, Jul 13, 2009 at 12:53:13PM +0200, Peter Zijlstra wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> > But talking about syscalls the sys_perf_counter_open prototype is
> > really ugly - it uses either the pid or cpu argument which is a pretty
> > clear indicator it should actually be two sys calls.
>
> Would something like the below be any better?
>
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.
>
> (utterly untested)

And makes the code a couple of magnitudes more ugly.. If you really
add a completely new targer adding a new system call wouldn't be a big
issue. Altough a socket seems like a pretty logical flag extension to
the cpu syscall (or sub-syscall currently), as it would still be
specified as trace the whole cpu socket for this given cpuid.

2009-07-30 13:58:52

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<[email protected]> wrote:
> On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
>> But talking about syscalls the sys_perf_counter_open prototype is
>> really ugly - it uses either the pid or cpu argument which is a pretty
>> clear indicator it should actually be two sys calls.
>
> Would something like the below be any better?
>
> It would allow us to later add something like PERF_TARGET_SOCKET and
> things like that.

One thing for sure, you need to provision for future targets, SOCKET
is an obvious
one. But given that your goal is to have a generic API, not just for
CPU PMU, then you
need to be able to add new targets, e.g., socket, chipset, GPU. The
current model with
pid and cpu is too limited, relying on flags to add more parameters
for a target is not pretty.

Given that an event can only be attached to a single target at a time, it seems
you could either do:
- encapsulate target type + target into a struct and pass that.
This is your proposal here.
- add a generic int target parameter and pass the target type in flags
- provide one syscall per target type (seems to be Hellwig's)

Your scheme makes it possible to pass 64-bit target id. Not clear if
this is really needed.
It adds yet another data structure to your API.

>
> (utterly untested)
>
> ---
> include/linux/perf_counter.h | 12 ++++++++++++
> include/linux/syscalls.h | 3 ++-
> kernel/perf_counter.c | 24 +++++++++++++++++++++++-
> tools/perf/builtin-record.c | 11 ++++++++++-
> tools/perf/builtin-stat.c | 12 ++++++++++--
> tools/perf/builtin-top.c | 11 ++++++++++-
> tools/perf/perf.h | 7 +++----
> 7 files changed, 70 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/perf_counter.h b/include/linux/perf_counter.h
> index 5e970c7..f7dd22e 100644
> --- a/include/linux/perf_counter.h
> +++ b/include/linux/perf_counter.h
> @@ -189,6 +189,18 @@ struct perf_counter_attr {
> __u64 __reserved_3;
> };
>
> +enum perf_counter_target_ids {
> + PERF_TARGET_PID = 0,
> + PERF_TARGET_CPU = 1,
> +
> + PERF_TARGET_MAX /* non-ABI */
> +};
> +
> +struct perf_counter_target {
> + __u32 id;
> + __u64 val;
> +};
> +
> /*
> * Ioctls that can be done on a perf counter fd:
> */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 80de700..670edad 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -760,5 +760,6 @@ int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
>
> asmlinkage long sys_perf_counter_open(
> struct perf_counter_attr __user *attr_uptr,
> - pid_t pid, int cpu, int group_fd, unsigned long flags);
> + struct perf_counter_target __user *target,
> + int group_fd, unsigned long flags);
> #endif
> diff --git a/kernel/perf_counter.c b/kernel/perf_counter.c
> index fa20a9d..205f0b6 100644
> --- a/kernel/perf_counter.c
> +++ b/kernel/perf_counter.c
> @@ -3969,15 +3969,18 @@ err_size:
> */
> SYSCALL_DEFINE5(perf_counter_open,
> struct perf_counter_attr __user *, attr_uptr,
> - pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
> + struct perf_counter_target __user *, target_uptr,
> + int, group_fd, unsigned long, flags)
> {
> struct perf_counter *counter, *group_leader;
> struct perf_counter_attr attr;
> + struct perf_counter_target target;
> struct perf_counter_context *ctx;
> struct file *counter_file = NULL;
> struct file *group_file = NULL;
> int fput_needed = 0;
> int fput_needed2 = 0;
> + int pid, cpu;
> int ret;
>
> /* for future expandability... */
> @@ -3998,6 +4001,25 @@ SYSCALL_DEFINE5(perf_counter_open,
> return -EINVAL;
> }
>
> + ret = copy_from_user(&target, target_uptr, sizeof(target));
> + if (ret)
> + return ret;
> +
> + switch (target.id) {
> + case PERF_TARGET_PID:
> + pid = target.val;
> + cpu = -1;
> + break;
> +
> + case PERF_TARGET_CPU:
> + cpu = target.val;
> + pid = -1;
> + break;
> +
> + default:
> + return -EINVAL;
> + }
> +
> /*
> * Get the target context (task or percpu):
> */
> diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
> index 4ef78a5..b172d30 100644
> --- a/tools/perf/builtin-record.c
> +++ b/tools/perf/builtin-record.c
> @@ -374,6 +374,7 @@ static struct perf_header_attr *get_header_attr(struct perf_counter_attr *a, int
> static void create_counter(int counter, int cpu, pid_t pid)
> {
> struct perf_counter_attr *attr = attrs + counter;
> + struct perf_counter_target target;
> struct perf_header_attr *h_attr;
> int track = !counter; /* only the first counter needs these */
> struct {
> @@ -409,8 +410,16 @@ static void create_counter(int counter, int cpu, pid_t pid)
> attr->inherit = (cpu < 0) && inherit;
> attr->disabled = 1;
>
> + if (cpu < 0) {
> + target.id = PERF_TARGET_PID;
> + target.val = pid;
> + } else {
> + target.id = PERF_TARGET_CPU;
> + target.val = cpu;
> + }
> +
> try_again:
> - fd[nr_cpu][counter] = sys_perf_counter_open(attr, pid, cpu, group_fd, 0);
> + fd[nr_cpu][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
>
> if (fd[nr_cpu][counter] < 0) {
> int err = errno;
> diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
> index 27921a8..342e642 100644
> --- a/tools/perf/builtin-stat.c
> +++ b/tools/perf/builtin-stat.c
> @@ -106,6 +106,7 @@ static u64 runtime_cycles_noise;
> static void create_perf_stat_counter(int counter, int pid)
> {
> struct perf_counter_attr *attr = attrs + counter;
> + struct perf_counter_target target;
>
> if (scale)
> attr->read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
> @@ -114,8 +115,12 @@ static void create_perf_stat_counter(int counter, int pid)
> if (system_wide) {
> unsigned int cpu;
>
> + target.id = PERF_TARGET_CPU;
> +
> for (cpu = 0; cpu < nr_cpus; cpu++) {
> - fd[cpu][counter] = sys_perf_counter_open(attr, -1, cpu, -1, 0);
> + target.val = cpu;
> +
> + fd[cpu][counter] = sys_perf_counter_open(attr, &target, -1, 0);
> if (fd[cpu][counter] < 0 && verbose)
> fprintf(stderr, ERR_PERF_OPEN, counter,
> fd[cpu][counter], strerror(errno));
> @@ -125,7 +130,10 @@ static void create_perf_stat_counter(int counter, int pid)
> attr->disabled = 1;
> attr->enable_on_exec = 1;
>
> - fd[0][counter] = sys_perf_counter_open(attr, pid, -1, -1, 0);
> + target.id = PERF_TARGET_PID;
> + target.val = pid;
> +
> + fd[0][counter] = sys_perf_counter_open(attr, &target, -1, 0);
> if (fd[0][counter] < 0 && verbose)
> fprintf(stderr, ERR_PERF_OPEN, counter,
> fd[0][counter], strerror(errno));
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 95d5c0a..facc870 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -548,6 +548,7 @@ int group_fd;
>
> static void start_counter(int i, int counter)
> {
> + struct perf_counter_target target;
> struct perf_counter_attr *attr;
> unsigned int cpu;
>
> @@ -560,8 +561,16 @@ static void start_counter(int i, int counter)
> attr->sample_type = PERF_SAMPLE_IP | PERF_SAMPLE_TID;
> attr->freq = freq;
>
> + if (cpu < 0) {
> + target.id = PERF_TARGET_PID;
> + target.val = target_pid;
> + } else {
> + target.id = PERF_TARGET_CPU;
> + target.val = cpu;
> + }
> +
> try_again:
> - fd[i][counter] = sys_perf_counter_open(attr, target_pid, cpu, group_fd, 0);
> + fd[i][counter] = sys_perf_counter_open(attr, &target, group_fd, 0);
>
> if (fd[i][counter] < 0) {
> int err = errno;
> diff --git a/tools/perf/perf.h b/tools/perf/perf.h
> index e5148e2..3d663d7 100644
> --- a/tools/perf/perf.h
> +++ b/tools/perf/perf.h
> @@ -85,12 +85,11 @@ static inline unsigned long long rdclock(void)
>
> static inline int
> sys_perf_counter_open(struct perf_counter_attr *attr,
> - pid_t pid, int cpu, int group_fd,
> - unsigned long flags)
> + struct perf_counter_target *target,
> + int group_fd, unsigned long flags)
> {
> attr->size = sizeof(*attr);
> - return syscall(__NR_perf_counter_open, attr, pid, cpu,
> - group_fd, flags);
> + return syscall(__NR_perf_counter_open, attr, target, group_fd, flags);
> }
>
> #define MAX_COUNTERS 256
>
>
>

2009-07-30 14:10:28

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Thu, 2009-07-30 at 15:58 +0200, stephane eranian wrote:
> On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<[email protected]> wrote:
> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
> >> But talking about syscalls the sys_perf_counter_open prototype is
> >> really ugly - it uses either the pid or cpu argument which is a pretty
> >> clear indicator it should actually be two sys calls.
> >
> > Would something like the below be any better?
> >
> > It would allow us to later add something like PERF_TARGET_SOCKET and
> > things like that.
>
> One thing for sure, you need to provision for future targets, SOCKET
> is an obvious
> one. But given that your goal is to have a generic API, not just for
> CPU PMU, then you
> need to be able to add new targets, e.g., socket, chipset, GPU. The
> current model with
> pid and cpu is too limited, relying on flags to add more parameters
> for a target is not pretty.
>
> Given that an event can only be attached to a single target at a time, it seems
> you could either do:
> - encapsulate target type + target into a struct and pass that.
> This is your proposal here.

*nod*

> - add a generic int target parameter and pass the target type in flags

This would mean reserving a number of bits in the flags field for this
target id. We can do that, I figure 8 should do.. (640kb should be
enough comes to mind though :-).

> - provide one syscall per target type (seems to be Hellwig's)
>
> Your scheme makes it possible to pass 64-bit target id. Not clear if
> this is really needed.

Yeah, not sure either, pids, fds and similar are all 32bit iirc. We do
have 64bit resource ids but that's for things like inodes, and I'm not
sure it'd make sense to attach a counter to an inode :-)

2009-07-30 16:17:41

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Thu, Jul 30, 2009 at 4:13 PM, Peter Zijlstra<[email protected]> wrote:
> On Thu, 2009-07-30 at 15:58 +0200, stephane eranian wrote:
>> On Mon, Jul 13, 2009 at 12:53 PM, Peter Zijlstra<[email protected]> wrote:
>> > On Mon, 2009-06-22 at 08:58 -0400, Christoph Hellwig wrote:
>> >> But talking about syscalls the sys_perf_counter_open prototype is
>> >> really ugly - it uses either the pid or cpu argument which is a pretty
>> >> clear indicator it should actually be two sys calls.
>> >
>> > Would something like the below be any better?
>> >
>> > It would allow us to later add something like PERF_TARGET_SOCKET and
>> > things like that.
>>
>> One thing for sure, you need to provision for future targets, SOCKET
>> is an obvious
>> one. But given that your goal is to have a generic API, not just for
>> CPU PMU, then you
>> need to be able to add new targets, e.g., socket, chipset, GPU. The
>> current model with
>> pid and cpu is too limited, relying on flags to add more parameters
>> for a target is not pretty.
>>
>> Given that an event can only be attached to a single target at a time, it seems
>> you could either do:
>> - encapsulate target type + target into a struct and pass that.
>> This is your proposal here.
>
> *nod*
>
>> - add a generic int target parameter and pass the target type in flags
>
> This would mean reserving a number of bits in the flags field for this
> target id. We can do that, I figure 8 should do.. (640kb should be
> enough comes to mind though :-).
>
I meant:

long sys_perf_counter_open(
struct perf_counter_attr *attr,
int target,
int group_fd,
unsigned long flags);

If you reserve 8 bits in flags, that gives you 256 types of targets,
given that you can only measure an event on one target.

The target "name" would be passed in target.

per-thread: (self-monitoring)
sys_perf_counter_open(&attr, getpid(), 0, 0);

cpu-wide: (monitored CPU1)
sys_perf_counter_open(&attr, 1, 0, PERF_TARGET_CPU);

socket-wide: (socket 2)
sys_perf_counter_open(&attr, 2, 0, PERF_TARGET_SOCKET);

I also suspect you can get by with fewer than 8 bits for the type of target.

Because in this scheme, part of flags would not be treated as bitmask
but rather bitfield, it may be better to just separate this target bitfield
like in:

long sys_perf_counter_open(
struct perf_counter_attr *attr,
enum perf_target_type target_type,
int target_id,
int group_fd,
unsigned long flags);

Which is what you had, except without the struct.

Then, it boils down to whether expressing a target id in 32 bits is enough.
Obviously, 64-bit would be the safest but then I understand this causes issues
on 32-bit systems.

>> - provide one syscall per target type (seems to be Hellwig's)
>>
>> Your scheme makes it possible to pass 64-bit target id. Not clear if
>> this is really needed.
>
> Yeah, not sure either, pids, fds and similar are all 32bit iirc. We do
> have 64bit resource ids but that's for things like inodes, and I'm not
> sure it'd make sense to attach a counter to an inode :-)
>
>
>

2009-07-30 16:41:56

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Thursday 30 July 2009, stephane eranian wrote:
> long sys_perf_counter_open(
> struct perf_counter_attr *attr,
> enum perf_target_type target_type,
> int target_id,
> int group_fd,
> unsigned long flags);
>
> Which is what you had, except without the struct.
>
> Then, it boils down to whether expressing a target id in 32 bits is enough.
> Obviously, 64-bit would be the safest but then I understand this causes issues
> on 32-bit systems.

Just make it an unsigned long then, that still covers all cases
where you only need the 64-bit type on 64-bit systems.

Arnd <><

2009-07-30 16:53:33

[permalink] [raw]

Subject: Re: I.1 - System calls - ioctl

On Thu, Jul 30, 2009 at 6:40 PM, Arnd Bergmann<[email protected]> wrote:
> On Thursday 30 July 2009, stephane eranian wrote:
>> long sys_perf_counter_open(
>> struct perf_counter_attr *attr,
>> enum perf_target_type target_type,
>> int target_id,
>> int group_fd,
>> unsigned long flags);
>>
>> Which is what you had, except without the struct.
>>
>> Then, it boils down to whether expressing a target id in 32 bits is enough.
>> Obviously, 64-bit would be the safest but then I understand this causes issues
>> on 32-bit systems.
>
> Just make it an unsigned long then, that still covers all cases
> where you only need the 64-bit type on 64-bit systems.
>
But that won't always work in the case of a 32-bit monitoring tool
running on top of
a 64-bit OS. Imagine the target id is indeed 64-bit, e.g., inode
number (as suggested
by Peter). It's not because you are a 32-bit tool than you cannot name
a monitoring
resource in a 64-bit OS.

> Arnd <><
>

2009-07-30 17:21:27