hi,
Intel Last Branch Record (LBR) is a cyclic taken branch buffer hosted
in registers. It is present in Core 2, Atom, and Nehalem processors. Each
one adding some nice improvements over its predecessor.
LBR is very useful to capture the path that leads to an event. Although
the number of recorded branches is limited (4 on Core2 but 16 in Nehalem)
it is very valuable information.
One nice feature of LBR, unlike BTS, is that it can be set to freeze on PMU
interrupt. This is the way one can capture a path that leads to an event or
more precisely to a PMU interrupt.
I started looking into how to add LBR support to perf_events. We have LBR
support in perfmon and it has proven very useful for some measurements.
The usage model is that you always couple LBR with sampling on an event.
You want the LBR state dumped into the sample on overflow. When you resume,
after an overflow, you clear LBR and you restart it.
One obvious implementation would be to add a new sample type such as
PERF_SAMPLE_TAKEN_BRANCHES. That would generate a sample with
a body containing an array of 4x2 up to 16x2 u64 addresses. Internally, the
hw_perf_event_structure would have to store the LBR state so it could be
saved and restored on context switch in per-thread mode.
There is one problem with this approach. On Nehalem, the LBR can be configured
to capture only certain types of branches + priv levels. That is about
8 config bits
+ priv levels. Where do we pass those config options?
One solution would have to provide as many PERF_SAMPLE bits as the hardware
OR provide some config field for it in perf_event_attr. All of this
would have to
remain very generic.
An alternative approach is to define a new type of (pseudo)-event, e.g.,
PERF_TYPE_HW_BRANCH and provide variations very much like this is
done for the generic cache events. That event would be associated with a
new fixed-purpose counter (similar to BTS). It would go through scheduling
via a specific constraint (similar to BTS). The hw_perf_event structure
would provide the storage area for dumping LBR state.
To sample on LBR with the event approach, the LBR event would have to
be in the same event group. The sampling event would then simply add
sample_type = PERF_SAMPLE_GROUP.
The second approach looks more extensible, flexible than the first one. But
it runs into a major problem with the current perf_event API/ABI and
implementation. The current assumption is that all events never return more
than 64-bit worth of data. In the case of LBR, we would need to return way
more than this.
A long time ago, I mentioned LBR as a key feature to support but we never
got to a solution as to how to support it with perf_events.
What's you take on this?
Stephane,
On 10.02.10 12:31:16, Stephane Eranian wrote:
> I started looking into how to add LBR support to perf_events. We have LBR
> support in perfmon and it has proven very useful for some measurements.
>
> The usage model is that you always couple LBR with sampling on an event.
> You want the LBR state dumped into the sample on overflow. When you resume,
> after an overflow, you clear LBR and you restart it.
>
> One obvious implementation would be to add a new sample type such as
> PERF_SAMPLE_TAKEN_BRANCHES. That would generate a sample with
> a body containing an array of 4x2 up to 16x2 u64 addresses. Internally, the
> hw_perf_event_structure would have to store the LBR state so it could be
> saved and restored on context switch in per-thread mode.
>
> There is one problem with this approach. On Nehalem, the LBR can be configured
> to capture only certain types of branches + priv levels. That is about
> 8 config bits
> + priv levels. Where do we pass those config options?
I have a solution for IBS in mind and try to implement it. I just have
the problem that the current development on perf is so fast and
changes are very intrusive that I am not able to publish a working
version due to merge conflicts. So I need a bit time to rework my
exisisting implementation and review your changes.
The basic idea for IBS is to define special pmu events that have a
different behaviour than standard events (on x86 these are performance
counters). The 64 bit configuration value of such an event is simply
marked as a special event. The pmu detects the type of the model
specific event and passes its value to the hardware. Doing so you can
pass any kind of configuration data to a certain pmu.
The sample data you get in this case could be either packed into the
standard perf_event sampling format, or if this does not fit, the pmu
may return raw samples in a special format the userland knows about.
The interface extension is adopting the perfmon2 model specific pmu
setup where you can pass config values to the pmu and return
performance data from it. The implementation is architecture
independent and compatible with the current interface. The only change
to the api is an additional bit to the perf_event_attr to mark the raw
config value as model specific.
> One solution would have to provide as many PERF_SAMPLE bits as the hardware
> OR provide some config field for it in perf_event_attr. All of this
> would have to
> remain very generic.
>
> An alternative approach is to define a new type of (pseudo)-event, e.g.,
> PERF_TYPE_HW_BRANCH and provide variations very much like this is
> done for the generic cache events. That event would be associated with a
> new fixed-purpose counter (similar to BTS). It would go through scheduling
> via a specific constraint (similar to BTS). The hw_perf_event structure
> would provide the storage area for dumping LBR state.
>
> To sample on LBR with the event approach, the LBR event would have to
> be in the same event group. The sampling event would then simply add
> sample_type = PERF_SAMPLE_GROUP.
>
> The second approach looks more extensible, flexible than the first one. But
> it runs into a major problem with the current perf_event API/ABI and
> implementation. The current assumption is that all events never return more
> than 64-bit worth of data. In the case of LBR, we would need to return way
> more than this.
My implementation just need one 64 bit config value, but it could be
extended to use more than one config value too.
I will try to send working sample code soon, but I need a 'somehow
stable' perf tree for this. It would also help if you would publish
patch sets with many small patches instead of one big change. This
reduces merge or rebase effort.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]
On Wed, Feb 10, 2010 at 4:46 PM, Robert Richter <[email protected]> wrote:
> Stephane,
>
> On 10.02.10 12:31:16, Stephane Eranian wrote:
>> I started looking into how to add LBR support to perf_events. We have LBR
>> support in perfmon and it has proven very useful for some measurements.
>>
>> The usage model is that you always couple LBR with sampling on an event.
>> You want the LBR state dumped into the sample on overflow. When you resume,
>> after an overflow, you clear LBR and you restart it.
>>
>> One obvious implementation would be to add a new sample type such as
>> PERF_SAMPLE_TAKEN_BRANCHES. That would generate a sample with
>> a body containing an array of 4x2 up to 16x2 u64 addresses. Internally, the
>> hw_perf_event_structure would have to store the LBR state so it could be
>> saved and restored on context switch in per-thread mode.
>>
>> There is one problem with this approach. On Nehalem, the LBR can be configured
>> to capture only certain types of branches + priv levels. That is about
>> 8 config bits
>> + priv levels. Where do we pass those config options?
>
I was referring to the fact that if I enable LBR via a PERF_SAMPLE_* bit, I
will actually need more than one bit because there are configuration options.
I was not talking about event_attr.config.
> The basic idea for IBS is to define special pmu events that have a
> different behaviour than standard events (on x86 these are performance
> counters). The 64 bit configuration value of such an event is simply
> marked as a special event. The pmu detects the type of the model
> specific event and passes its value to the hardware. Doing so you can
> pass any kind of configuration data to a certain pmu.
>
Isn't that what the event_attr.type field is used for? there is a RAW type.
I use it all the time. As for passing to the PMU specific code, this is
already what it does based on event_attr.type.
> The sample data you get in this case could be either packed into the
> standard perf_event sampling format, or if this does not fit, the pmu
> may return raw samples in a special format the userland knows about.
>
There is a PERF_SAMPLE_RAW (used by tracing?). It can return opaque
data of variable length.
There is a slight difference between IBS and LBR. LBR in itself does not
generate any interrupts. It has no associated period you arm. It is a free
running cyclic buffer. To be useful, it needs to be associated with a regular
counting event, e.g, BRANCH_INSTRUCTIONS_RETIRED. Thus, you
would need to set PERF_SAMPLE_TAKEN_BRANCH on this event, and
then you would expect the LBR data coming back as PERF_SAMPLE_RAW.
If you use the other approach with a dedicated event type. For instance:
event.type = PERF_TYPE_HW_BRANCH;
event.config = PERF_HW_BRANCH:TAKEN:ANY
I used a symbolic name to make things clearer (but it is the same model as
for the cache events).
Then you need to group this event with BRANCH_INSTRUCTIONS_RETIRED
and set PERF_SAMPLE_GROUP to collect the values of the other member
of the group. In that case, the other member is LBR but it has a value that
is more than 64 bits. That does not work with the current code.
> The interface extension is adopting the perfmon2 model specific pmu
> setup where you can pass config values to the pmu and return
> performance data from it. The implementation is architecture
> independent and compatible with the current interface. The only change
> to the api is an additional bit to the perf_event_attr to mark the raw
> config value as model specific.
>
>> An alternative approach is to define a new type of (pseudo)-event, e.g.,
>> PERF_TYPE_HW_BRANCH and provide variations very much like this is
>> done for the generic cache events. That event would be associated with a
>> new fixed-purpose counter (similar to BTS). It would go through scheduling
>> via a specific constraint (similar to BTS). The hw_perf_event structure
>> would provide the storage area for dumping LBR state.
>>
>> To sample on LBR with the event approach, the LBR event would have to
>> be in the same event group. The sampling event would then simply add
>> sample_type = PERF_SAMPLE_GROUP.
>>
>> The second approach looks more extensible, flexible than the first one. But
>> it runs into a major problem with the current perf_event API/ABI and
>> implementation. The current assumption is that all events never return more
>> than 64-bit worth of data. In the case of LBR, we would need to return way
>> more than this.
>
> My implementation just need one 64 bit config value, but it could be
> extended to use more than one config value too.
>
Ok, I'll wait for the code then.
On 10.02.10 17:01:45, Stephane Eranian wrote:
> I was referring to the fact that if I enable LBR via a PERF_SAMPLE_* bit, I
> will actually need more than one bit because there are configuration options.
> I was not talking about event_attr.config.
I am not sure how big a LBR sample would be, but couldn't you send the
whole sample to the userland as a raw sample? If this is too much
overhead and you need to configure the formate, you could set up this
using a small part of the config value.
> > The basic idea for IBS is to define special pmu events that have a
> > different behaviour than standard events (on x86 these are performance
> > counters). The 64 bit configuration value of such an event is simply
> > marked as a special event. The pmu detects the type of the model
> > specific event and passes its value to the hardware. Doing so you can
> > pass any kind of configuration data to a certain pmu.
> Isn't that what the event_attr.type field is used for? there is a RAW type.
> I use it all the time. As for passing to the PMU specific code, this is
> already what it does based on event_attr.type.
I mean, you could setup the pmu with a raw config value. The samples
you return are in raw format too. Doing so, you could put in all
information, also that about the sample format into you
configuration. Of course there must be a way for values more than 64
bits.
The problem with the current x86 implementation is that it expects a
raw config value in the performance counter format. To mark the config
as different, I would simply introduce a bit in event_attr that marks
it as special event.
> > The sample data you get in this case could be either packed into the
> > standard perf_event sampling format, or if this does not fit, the pmu
> > may return raw samples in a special format the userland knows about.
> >
> There is a PERF_SAMPLE_RAW (used by tracing?). It can return opaque
> data of variable length.
>
> There is a slight difference between IBS and LBR. LBR in itself does not
> generate any interrupts. It has no associated period you arm. It is a free
> running cyclic buffer. To be useful, it needs to be associated with a regular
> counting event, e.g, BRANCH_INSTRUCTIONS_RETIRED. Thus, you
> would need to set PERF_SAMPLE_TAKEN_BRANCH on this event, and
> then you would expect the LBR data coming back as PERF_SAMPLE_RAW.
>
>
> If you use the other approach with a dedicated event type. For instance:
>
> event.type = PERF_TYPE_HW_BRANCH;
> event.config = PERF_HW_BRANCH:TAKEN:ANY
>
> I used a symbolic name to make things clearer (but it is the same model as
> for the cache events).
>
> Then you need to group this event with BRANCH_INSTRUCTIONS_RETIRED
> and set PERF_SAMPLE_GROUP to collect the values of the other member
> of the group. In that case, the other member is LBR but it has a value that
> is more than 64 bits. That does not work with the current code.
There are several questions: How to attach additional setup options to
an event? Grouping seems to be a solution for this. How to pass config
values with more than 64 bits to the pmu? An extension of the api is
probably needed, or grouping could work too. How to get samples back?
The raw sample format is the best to use here. For IBS the difference
is that the configuration has nothing to do with performance counters
and a raw config value needs differen handling.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
email: [email protected]
Robert.
On Thu, Feb 11, 2010 at 11:24 PM, Robert Richter <[email protected]> wrote:
> On 10.02.10 17:01:45, Stephane Eranian wrote:
>> I was referring to the fact that if I enable LBR via a PERF_SAMPLE_* bit, I
>> will actually need more than one bit because there are configuration options.
>> I was not talking about event_attr.config.
>
> I am not sure how big a LBR sample would be, but couldn't you send the
> whole sample to the userland as a raw sample? If this is too much
> overhead and you need to configure the formate, you could set up this
> using a small part of the config value.
>
>> > The basic idea for IBS is to define special pmu events that have a
>> > different behaviour than standard events (on x86 these are performance
>> > counters). The 64 bit configuration value of such an event is simply
>> > marked as a special event. The pmu detects the type of the model
>> > specific event and passes its value to the hardware. Doing so you can
>> > pass any kind of configuration data to a certain pmu.
>
>> Isn't that what the event_attr.type field is used for? there is a RAW type.
>> I use it all the time. As for passing to the PMU specific code, this is
>> already what it does based on event_attr.type.
>
> I mean, you could setup the pmu with a raw config value. The samples
> you return are in raw format too. Doing so, you could put in all
> information, also that about the sample format into you
> configuration. Of course there must be a way for values more than 64
> bits.
Not quite for LBR. But I would do that for IBS. I mean define
pseudo-events with unique event selects that can be identified by
the kernel. Then for the rest, I would do:
- The IBS periods can be passed in attr.period. the frequency may be doable.
- I would ignore the random mode of IBSFETCH for now. Randomization must
be added in the general case anyway, thus we could leverage that later on.
- Then use PERF_SAMPLE_RAW to collect the IBS data.
Internally, the kernel would identify, in the scheduling code for AMD, these
special events, very much like what is done for BTS in
intel_special_constraints().
IBSFETCH and IBSOP would have pseudo fixed-purpose counters assigned (similar
to BTS). They would go through the nornal x86_schedule_events()
routine. Given they
are provided by only one fixed-purpose counter, that would
automatically reject attempts
to use IBSOP/IBSFETCH multiple times per event group. On overflow, the
handler would
dump the IBS data registers into the data.raw area.
They are thing, however, you cannot do with non counting events. You cannot
count and therefore you cannot aggregate across threads.
But here is a key difference with LBR: if you use a pseudo-event for LBR, you
cannot use PERF_SAMPLE_RAW. That's because LBR does NOT interrupt. You
always need to associate it with another counting events. So it must be used
with an event pair. Setting PERF_SAMPLE_RAW on the counting event does not
make sense. There is no raw data associated with the counting event. You need
to read PERF_SAMPLE_READ+PERF_FORMAT_GROUP.
If you go with the PERF_SAMPLE_LBR sample_type approach. You are right,
you would need to encode LBR settings into the config field. But that's awkward.
The config field relates to the event and not its sample_type bitmask.
And AFAIK,
the sample_type is meant to have generic features, not model specific ones. And
internally it would be more difficult to manage because you would need an extra
per-event storage area to save/restore LBR.
One thing possible, though, is to define a pseudo model-specific event
for LBR, e.g.
LBR_EVENT instead of defining a new event type (PERF_TYPE_HW_BRANCH).
That would leave this as a model specific feature which I think it is
for now. I think
some of the LBR setup is now architected by Intel.
Anyway, I am working on getting LBR support. I got some promising results
already. Will update you once I have a clean and working solution.
> The problem with the current x86 implementation is that it expects a
> raw config value in the performance counter format. To mark the config
> as different, I would simply introduce a bit in event_attr that marks
> it as special event.
>
>> > The sample data you get in this case could be either packed into the
>> > standard perf_event sampling format, or if this does not fit, the pmu
>> > may return raw samples in a special format the userland knows about.
>> >
>> There is a PERF_SAMPLE_RAW (used by tracing?). It can return opaque
>> data of variable length.
>>
>> There is a slight difference between IBS and LBR. LBR in itself does not
>> generate any interrupts. It has no associated period you arm. It is a free
>> running cyclic buffer. To be useful, it needs to be associated with a regular
>> counting event, e.g, BRANCH_INSTRUCTIONS_RETIRED. Thus, you
>> would need to set PERF_SAMPLE_TAKEN_BRANCH on this event, and
>> then you would expect the LBR data coming back as PERF_SAMPLE_RAW.
>>
>>
>> If you use the other approach with a dedicated event type. For instance:
>>
>> event.type = PERF_TYPE_HW_BRANCH;
>> event.config = PERF_HW_BRANCH:TAKEN:ANY
>>
>> I used a symbolic name to make things clearer (but it is the same model as
>> for the cache events).
>>
>> Then you need to group this event with BRANCH_INSTRUCTIONS_RETIRED
>> and set PERF_SAMPLE_GROUP to collect the values of the other member
>> of the group. In that case, the other member is LBR but it has a value that
>> is more than 64 bits. That does not work with the current code.
>
> There are several questions: How to attach additional setup options to
> an event? Grouping seems to be a solution for this. How to pass config
> values with more than 64 bits to the pmu? An extension of the api is
> probably needed, or grouping could work too. How to get samples back?
> The raw sample format is the best to use here. For IBS the difference
> is that the configuration has nothing to do with performance counters
> and a raw config value needs differen handling.
>
> -Robert
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
> email: [email protected]
>
>
--
Stephane Eranian | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks
On Wed, 2010-02-10 at 12:31 +0100, Stephane Eranian wrote:
> Intel Last Branch Record (LBR) is a cyclic taken branch buffer hosted
> in registers. It is present in Core 2, Atom, and Nehalem processors. Each
> one adding some nice improvements over its predecessor.
>
> LBR is very useful to capture the path that leads to an event. Although
> the number of recorded branches is limited (4 on Core2 but 16 in Nehalem)
> it is very valuable information.
>
> One nice feature of LBR, unlike BTS, is that it can be set to freeze on PMU
> interrupt. This is the way one can capture a path that leads to an event or
> more precisely to a PMU interrupt.
Right, it allows to compute the actual IP for the IP+1 PEBS things among
other things, although that requires using a PEBS threshold of 1 record
I figure.
> The usage model is that you always couple LBR with sampling on an event.
> You want the LBR state dumped into the sample on overflow. When you resume,
> after an overflow, you clear LBR and you restart it.
>
> One obvious implementation would be to add a new sample type such as
> PERF_SAMPLE_TAKEN_BRANCHES. That would generate a sample with
> a body containing an array of 4x2 up to 16x2 u64 addresses. Internally, the
> hw_perf_event_structure would have to store the LBR state so it could be
> saved and restored on context switch in per-thread mode.
x3 actually (like the BTS record), because we cannot keep the flags in
the from address like the hardware does, we need to split them out into
a separate word, otherwise we'll run into trouble the moment someone
makes a machine with 64bit virtual space.
> There is one problem with this approach. On Nehalem, the LBR can be configured
> to capture only certain types of branches + priv levels. That is about
> 8 config bits + priv levels. Where do we pass those config options?
Right, this config stuff really messes things up on various levels.
> One solution would have to provide as many PERF_SAMPLE bits as the hardware
> OR provide some config field for it in perf_event_attr. All of this
> would have to remain very generic.
The problem with this LBR config stuff is that is creates inter-counter
constraints, because each counter wanting LBR samples needs to have the
same config.
Dealing with context switches is also going to be tricky, where we have
to safe and 'restore' LBR stacks for per-task counters.
FWIW, I'm tempted to stick with the !config variant, that's going to be
interesting enough to implement. Also, I'd really like to see a sensible
use case for these config bits that would justify their complexity.
> An alternative approach is to define a new type of (pseudo)-event, e.g.,
> PERF_TYPE_HW_BRANCH and provide variations very much like this is
> done for the generic cache events. That event would be associated with a
> new fixed-purpose counter (similar to BTS). It would go through scheduling
> via a specific constraint (similar to BTS). The hw_perf_event structure
> would provide the storage area for dumping LBR state.
>
> To sample on LBR with the event approach, the LBR event would have to
> be in the same event group. The sampling event would then simply add
> sample_type = PERF_SAMPLE_GROUP.
>
> The second approach looks more extensible, flexible than the first one. But
> it runs into a major problem with the current perf_event API/ABI and
> implementation. The current assumption is that all events never return more
> than 64-bit worth of data. In the case of LBR, we would need to return way
> more than this.
Agreed, that is also not a very attractive model.
On Sun, 2010-02-14 at 11:12 +0100, Peter Zijlstra wrote:
>
> Dealing with context switches is also going to be tricky, where we have
> to safe and 'restore' LBR stacks for per-task counters.
OK, so I poked at the LBR hardware a bit, sadly the TOS really doesn't
count beyond the few bits it requires :-(
I had hopes it would, since that would make it easier to share the LBR,
simply take a TOS snapshot when you schedule the counter in, and never
roll back further for that particular counter.
As it stands we'll have to wipe the full LBR state every time we 'touch'
it, which makes it less useful for cpu-bound counters.
Also, not all hw (core and pentium-m) supports the freeze_lbrs_on_pmi
bit, what we could do for those is stick an unconditional LBR disable
very early in the NMI path and simply roll back the stack until we hit a
branch into the NMI vector, that should leave a few usable LBR entries.
For AMD and P6 there is only a single LBR record, AMD seems to freeze
the thing on #DB traps but the PMI isn't qualified as one afaict,
rendering the single entry useless (didn't look at the P6 details).
hackery below..
---
arch/x86/include/asm/perf_event.h | 24 +++
arch/x86/kernel/cpu/perf_event.c | 233 +++++++++++++++++++++++++++++++++++---
arch/x86/kernel/traps.c | 3
include/linux/perf_event.h | 7 -
4 files changed, 251 insertions(+), 16 deletions(-)
Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c
+++ linux-2.6/arch/x86/kernel/cpu/perf_event.c
@@ -104,6 +104,10 @@ struct amd_nb {
struct event_constraint event_constraints[X86_PMC_IDX_MAX];
};
+struct lbr_entry {
+ u64 from, to, flags;
+};
+
struct cpu_hw_events {
struct perf_event *events[X86_PMC_IDX_MAX]; /* in counter order */
unsigned long active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
@@ -117,6 +121,10 @@ struct cpu_hw_events {
u64 tags[X86_PMC_IDX_MAX];
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
struct amd_nb *amd_nb;
+
+ int lbr_users;
+ int lbr_entries;
+ struct lbr_entry lbr_stack[16];
};
#define __EVENT_CONSTRAINT(c, n, m, w) {\
@@ -187,6 +195,19 @@ struct x86_pmu {
void (*put_event_constraints)(struct cpu_hw_events *cpuc,
struct perf_event *event);
struct event_constraint *event_constraints;
+
+ unsigned long lbr_tos;
+ unsigned long lbr_from, lbr_to;
+ int lbr_nr;
+ int lbr_ctl;
+ int lbr_format;
+};
+
+enum {
+ LBR_FORMAT_32 = 0x00,
+ LBR_FORMAT_LIP = 0x01,
+ LBR_FORMAT_EIP = 0x02,
+ LBR_FORMAT_EIP_FLAGS = 0x03,
};
static struct x86_pmu x86_pmu __read_mostly;
@@ -1203,6 +1224,52 @@ static void intel_pmu_disable_bts(void)
update_debugctlmsr(debugctlmsr);
}
+static void __intel_pmu_enable_lbr(void)
+{
+ u64 debugctl;
+
+ rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+ debugctl |= x86_pmu.lbr_ctl;
+ wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+}
+
+static void intel_pmu_enable_lbr(void)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
+
+ if (!cpuc->lbr_users)
+ __intel_pmu_enable_lbr();
+
+ cpuc->lbr_users++;
+}
+
+static void __intel_pmu_disable_lbr(void)
+{
+ u64 debugctl;
+
+ rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+ debugctl &= ~x86_pmu.lbr_ctl;
+ wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+}
+
+static void intel_pmu_disable_lbr(void)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (!x86_pmu.lbr_nr)
+ return;
+
+ cpuc->lbr_users--;
+
+ BUG_ON(cpuc->lbr_users < 0);
+
+ if (!cpuc->lbr_users)
+ __intel_pmu_disable_lbr();
+}
+
static void intel_pmu_pebs_enable(struct hw_perf_event *hwc)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -1402,6 +1469,9 @@ void hw_perf_disable(void)
cpuc->enabled = 0;
barrier();
+ if (cpuc->lbr_users)
+ __intel_pmu_disable_lbr();
+
x86_pmu.disable_all();
}
@@ -1703,6 +1773,10 @@ void hw_perf_enable(void)
barrier();
x86_pmu.enable_all();
+
+ // XXX
+ if (cpuc->lbr_users = 1)
+ __intel_pmu_enable_lbr();
}
static inline u64 intel_pmu_get_status(void)
@@ -2094,7 +2168,6 @@ static void intel_pmu_drain_pebs_core(st
struct perf_event_header header;
struct perf_sample_data data;
struct pt_regs regs;
- u64
if (!event || !ds || !x86_pmu.pebs)
return;
@@ -2114,7 +2187,7 @@ static void intel_pmu_drain_pebs_core(st
perf_prepare_sample(&header, &data, event, ®s);
- event.hw.interrupts += (top - at);
+ event->hw.interrupts += (top - at);
atomic64_add((top - at) * event->hw.last_period, &event->count);
if (perf_output_begin(&handle, event, header.size * (top - at), 1, 1))
@@ -2188,6 +2261,84 @@ static void intel_pmu_drain_pebs_nhm(str
}
}
+static inline u64 intel_pmu_lbr_tos(void)
+{
+ u64 tos;
+
+ rdmsrl(x86_pmu.lbr_tos, tos);
+ return tos;
+}
+
+static void
+intel_pmu_read_lbr_32(struct cpu_hw_events *cpuc, struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long mask = x86_pmu.lbr_nr - 1;
+ u64 tos = intel_pmu_lbr_tos();
+ int i;
+
+ for (i = 0; tos > hwc->lbr_tos && i < x86_pmu.lbr_nr; i++, tos--) {
+ unsigned long lbr_idx = (tos - i) & mask;
+ union {
+ struct {
+ u32 from;
+ u32 to;
+ };
+ u64 lbr;
+ } msr_lastbranch;
+
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
+
+ cpuc->lbr_stack[i].from = msr_lastbranch.from;
+ cpuc->lbr_stack[i].to = msr_lastbranch.to;
+ cpuc->lbr_stack[i].flags = 0;
+ }
+ cpuc->lbr_entries = i;
+}
+
+#define LBR_FROM_FLAG_MISPRED (1ULL << 63)
+
+/*
+ * Due to lack of segmentation in Linux the effective address (offset)
+ * is the same as the linear address, allowing us to merge the LIP and EIP
+ * LBR formats.
+ */
+static void
+intel_pmu_read_lbr_64(struct cpu_hw_events *cpuc, struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long mask = x86_pmu.lbr_nr - 1;
+ u64 tos = intel_pmu_lbr_tos();
+ int i;
+
+ for (i = 0; tos > hwc->lbr_tos && i < x86_pmu.lbr_nr; i++, tos--) {
+ unsigned long lbr_idx = (tos - i) & mask;
+ u64 from, to, flags = 0;
+
+ rdmsrl(x86_pmu.lbr_from + lbr_idx, from);
+ rdmsrl(x86_pmu.lbr_to + lbr_idx, to);
+
+ if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {
+ flags = !!(from & LBR_FROM_FLAG_MISPRED);
+ from = (u64)((((s64)from) << 1) >> 1);
+ }
+
+ cpuc->lbr_stack[i].from = from;
+ cpuc->lbr_stack[i].to = to;
+ cpuc->lbr_stack[i].flags = flags;
+ }
+ cpuc->lbr_entries = i;
+}
+
+static void
+intel_pmu_read_lbr(struct cpu_hw_events *cpuc, struct perf_event *event)
+{
+ if (x86_pmu.lbr_format == LBR_FORMAT_32)
+ intel_pmu_read_lbr_32(cpuc, event);
+ else
+ intel_pmu_read_lbr_64(cpuc, event);
+}
+
static void x86_pmu_stop(struct perf_event *event)
{
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
@@ -2456,11 +2607,26 @@ perf_event_nmi_handler(struct notifier_b
* If the first NMI handles both, the latter will be empty and daze
* the CPU.
*/
+ trace_printk("LBR TOS: %Ld\n", intel_pmu_lbr_tos());
x86_pmu.handle_irq(regs);
return NOTIFY_STOP;
}
+static __read_mostly struct notifier_block perf_event_nmi_notifier = {
+ .notifier_call = perf_event_nmi_handler,
+ .next = NULL,
+ .priority = 1
+};
+
+void perf_nmi_exit(void)
+{
+ struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
+
+ if (cpuc->lbr_users)
+ __intel_pmu_enable_lbr();
+}
+
static struct event_constraint unconstrained; /* can schedule */
static struct event_constraint null_constraint; /* can't schedule */
static struct event_constraint bts_constraint =
@@ -2761,12 +2927,6 @@ undo:
return ret;
}
-static __read_mostly struct notifier_block perf_event_nmi_notifier = {
- .notifier_call = perf_event_nmi_handler,
- .next = NULL,
- .priority = 1
-};
-
static __initconst struct x86_pmu p6_pmu = {
.name = "p6",
.handle_irq = x86_pmu_handle_irq,
@@ -2793,7 +2953,7 @@ static __initconst struct x86_pmu p6_pmu
.event_bits = 32,
.event_mask = (1ULL << 32) - 1,
.get_event_constraints = intel_get_event_constraints,
- .event_constraints = intel_p6_event_constraints
+ .event_constraints = intel_p6_event_constraints,
};
static __initconst struct x86_pmu core_pmu = {
@@ -2873,18 +3033,26 @@ static __init int p6_pmu_init(void)
case 7:
case 8:
case 11: /* Pentium III */
+ x86_pmu = p6_pmu;
+
+ break;
case 9:
- case 13:
- /* Pentium M */
+ case 13: /* Pentium M */
+ x86_pmu = p6_pmu;
+
+ x86_pmu.lbr_nr = 8;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR;
+ x86_pmu.lbr_from = 0x40;
+
break;
+
default:
pr_cont("unsupported p6 CPU model %d ",
boot_cpu_data.x86_model);
return -ENODEV;
}
- x86_pmu = p6_pmu;
-
return 0;
}
@@ -2925,6 +3093,9 @@ static __init int intel_pmu_init(void)
x86_pmu.event_bits = eax.split.bit_width;
x86_pmu.event_mask = (1ULL << eax.split.bit_width) - 1;
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);
+ x86_pmu.lbr_format = capabilities & 0x1f;
+
/*
* Quirk: v2 perfmon does not report fixed-purpose events, so
* assume at least 3 events:
@@ -2973,6 +3144,10 @@ no_datastore:
*/
switch (boot_cpu_data.x86_model) {
case 14: /* 65 nm core solo/duo, "Yonah" */
+ x86_pmu.lbr_nr = 8;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR;
+ x86_pmu.lbr_from = 0x40;
pr_cont("Core events, ");
break;
@@ -2980,6 +3155,13 @@ no_datastore:
case 22: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
case 23: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
case 29: /* six-core 45 nm xeon "Dunnington" */
+ x86_pmu.lbr_nr = 4;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |
+ X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;
+ x86_pmu.lbr_from = 0x40;
+ x86_pmu.lbr_to = 0x60;
+
memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
@@ -2989,13 +3171,28 @@ no_datastore:
case 26: /* 45 nm nehalem, "Bloomfield" */
case 30: /* 45 nm nehalem, "Lynnfield" */
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |
+ X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;
+ x86_pmu.lbr_from = 0x680;
+ x86_pmu.lbr_to = 0x6c0;
+
memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
x86_pmu.event_constraints = intel_nehalem_event_constraints;
pr_cont("Nehalem/Corei7 events, ");
break;
- case 28:
+
+ case 28: /* Atom */
+ x86_pmu.lbr_nr = 8;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |
+ X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;
+ x86_pmu.lbr_from = 0x40;
+ x86_pmu.lbr_to = 0x60;
+
memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
@@ -3005,12 +3202,20 @@ no_datastore:
case 37: /* 32 nm nehalem, "Clarkdale" */
case 44: /* 32 nm nehalem, "Gulftown" */
+ x86_pmu.lbr_nr = 16;
+ x86_pmu.lbr_tos = 0x01c9;
+ x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |
+ X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;
+ x86_pmu.lbr_from = 0x680;
+ x86_pmu.lbr_to = 0x6c0;
+
memcpy(hw_cache_event_ids, westmere_hw_cache_event_ids,
sizeof(hw_cache_event_ids));
x86_pmu.event_constraints = intel_westmere_event_constraints;
pr_cont("Westmere events, ");
break;
+
default:
/*
* default constraints for v2 and up
Index: linux-2.6/arch/x86/include/asm/perf_event.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/perf_event.h
+++ linux-2.6/arch/x86/include/asm/perf_event.h
@@ -1,6 +1,8 @@
#ifndef _ASM_X86_PERF_EVENT_H
#define _ASM_X86_PERF_EVENT_H
+#include <asm/msr.h>
+
/*
* Performance event hw details:
*/
@@ -122,11 +124,31 @@ union cpuid10_edx {
extern void init_hw_perf_events(void);
extern void perf_events_lapic_init(void);
+#define X86_DEBUGCTL_LBR (1 << 0)
+#define X86_DEBUGCTL_FREEZE_LBRS_ON_PMI (1 << 11)
+
+static __always_inline void perf_nmi_enter(void)
+{
+ u64 debugctl;
+
+ /*
+ * Unconditionally disable LBR so as to minimally pollute the LBR stack.
+ * XXX: paravirt will screw us over massive
+ */
+ rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+ debugctl &= ~X86_DEBUGCTL_LBR;
+ wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);
+}
+
+extern void perf_nmi_exit(void);
+
#define PERF_EVENT_INDEX_OFFSET 0
#else
static inline void init_hw_perf_events(void) { }
-static inline void perf_events_lapic_init(void) { }
+static inline void perf_events_lapic_init(void) { }
+static inline void perf_nmi_enter(void) { }
+static inline void perf_nmi_exit(void) { }
#endif
#endif /* _ASM_X86_PERF_EVENT_H */
Index: linux-2.6/arch/x86/kernel/traps.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/traps.c
+++ linux-2.6/arch/x86/kernel/traps.c
@@ -45,6 +45,7 @@
#endif
#include <asm/kmemcheck.h>
+#include <asm/perf_event.h>
#include <asm/stacktrace.h>
#include <asm/processor.h>
#include <asm/debugreg.h>
@@ -442,6 +443,7 @@ static notrace __kprobes void default_do
dotraplinkage notrace __kprobes void
do_nmi(struct pt_regs *regs, long error_code)
{
+ perf_nmi_enter();
nmi_enter();
inc_irq_stat(__nmi_count);
@@ -450,6 +452,7 @@ do_nmi(struct pt_regs *regs, long error_
default_do_nmi(regs);
nmi_exit();
+ perf_nmi_exit();
}
void stop_nmi(void)
Index: linux-2.6/include/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/linux/perf_event.h
+++ linux-2.6/include/linux/perf_event.h
@@ -125,8 +125,9 @@ enum perf_event_sample_format {
PERF_SAMPLE_PERIOD = 1U << 8,
PERF_SAMPLE_STREAM_ID = 1U << 9,
PERF_SAMPLE_RAW = 1U << 10,
+ PERF_SAMPLE_LBR = 1U << 11,
- PERF_SAMPLE_MAX = 1U << 11, /* non-ABI */
+ PERF_SAMPLE_MAX = 1U << 12, /* non-ABI */
};
/*
@@ -396,6 +397,9 @@ enum perf_event_type {
* { u64 nr,
* u64 ips[nr]; } && PERF_SAMPLE_CALLCHAIN
*
+ * { u64 nr;
+ * struct lbr_format lbr[nr]; } && PERF_SAMPLE_LBR
+ *
* #
* # The RAW record below is opaque data wrt the ABI
* #
@@ -483,6 +487,7 @@ struct hw_perf_event {
int idx;
int last_cpu;
int pebs;
+ u64 lbr_tos;
};
struct { /* software */
s64 remaining;
Hi,
On Thu, Feb 18, 2010 at 11:25 PM, Peter Zijlstra <[email protected]> wrote:> On Sun, 2010-02-14 at 11:12 +0100, Peter Zijlstra wrote:>>>> Dealing with context switches is also going to be tricky, where we have>> to safe and 'restore' LBR stacks for per-task counters.>> OK, so I poked at the LBR hardware a bit, sadly the TOS really doesn't> count beyond the few bits it requires :-(>
The TOS is also a read-only MSR.
> I had hopes it would, since that would make it easier to share the LBR,> simply take a TOS snapshot when you schedule the counter in, and never> roll back further for that particular counter.>> As it stands we'll have to wipe the full LBR state every time we 'touch'> it, which makes it less useful for cpu-bound counters.>Yes, you need to clean it up each time you snapshot it and each timeyou restore it.
The patch does not seem to handle LBR context switches.
> Also, not all hw (core and pentium-m) supports the freeze_lbrs_on_pmi> bit, what we could do for those is stick an unconditional LBR disable> very early in the NMI path and simply roll back the stack until we hit a> branch into the NMI vector, that should leave a few usable LBR entries.>You need to be consistent across the CPUs. If a CPU does not providefreeze_on_pmi, then I would simply not support it as a first approach.Same thing if the LBR is less than 4-deep. I don't think you'll get anythinguseful out of it.
> For AMD and P6 there is only a single LBR record, AMD seems to freeze> the thing on #DB traps but the PMI isn't qualified as one afaict,> rendering the single entry useless (didn't look at the P6 details).>> hackery below..
The patch does not address the configuration options available on IntelNehalem/Westmere, i.e., LBR_SELECT (see Vol 3a table 16-9). We canhandle priv level separately as it can be derived from the event exclude_*.But it you want to allow multiple events in a group to use PERF_SAMPLE_LBRthen you need to ensure LBR_SELECT is set to the same value, priv levelsincluded.
Furthermore, LBR_SELECT is shared between HT threads. We need to eitheradd another field in perf_event_attr or encode this in the configfield, though itis ugly because unrelated to the event but rather to the sample_type.
The patch is missing the sampling part, i.e., dump of the LBR (in sequentialorder) into the sampling buffer.
I would also select a better name than PERF_SAMPLE_LBR. LBR is anIntel thing. Maybe PERF_SAMPLE_TAKEN_BRANCH.
> ---> arch/x86/include/asm/perf_event.h | 24 +++> arch/x86/kernel/cpu/perf_event.c | 233 +++++++++++++++++++++++++++++++++++---> arch/x86/kernel/traps.c | 3> include/linux/perf_event.h | 7 -> 4 files changed, 251 insertions(+), 16 deletions(-)>> Index: linux-2.6/arch/x86/kernel/cpu/perf_event.c> ===================================================================> --- linux-2.6.orig/arch/x86/kernel/cpu/perf_event.c> +++ linux-2.6/arch/x86/kernel/cpu/perf_event.c> @@ -104,6 +104,10 @@ struct amd_nb {> struct event_constraint event_constraints[X86_PMC_IDX_MAX];> };>> +struct lbr_entry {> + u64 from, to, flags;> +};> +> struct cpu_hw_events {> struct perf_event *events[X86_PMC_IDX_MAX]; /* in counter order */> unsigned long active_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];> @@ -117,6 +121,10 @@ struct cpu_hw_events {> u64 tags[X86_PMC_IDX_MAX];> struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */> struct amd_nb *amd_nb;> +> + int lbr_users;> + int lbr_entries;> + struct lbr_entry lbr_stack[16];> };>> #define __EVENT_CONSTRAINT(c, n, m, w) {\> @@ -187,6 +195,19 @@ struct x86_pmu {> void (*put_event_constraints)(struct cpu_hw_events *cpuc,> struct perf_event *event);> struct event_constraint *event_constraints;> +> + unsigned long lbr_tos;> + unsigned long lbr_from, lbr_to;> + int lbr_nr;> + int lbr_ctl;> + int lbr_format;> +};> +> +enum {> + LBR_FORMAT_32 = 0x00,> + LBR_FORMAT_LIP = 0x01,> + LBR_FORMAT_EIP = 0x02,> + LBR_FORMAT_EIP_FLAGS = 0x03,> };>> static struct x86_pmu x86_pmu __read_mostly;> @@ -1203,6 +1224,52 @@ static void intel_pmu_disable_bts(void)> update_debugctlmsr(debugctlmsr);> }>> +static void __intel_pmu_enable_lbr(void)> +{> + u64 debugctl;> +> + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> + debugctl |= x86_pmu.lbr_ctl;> + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +}> +> +static void intel_pmu_enable_lbr(void)> +{> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> + if (!x86_pmu.lbr_nr)> + return;> +> + if (!cpuc->lbr_users)> + __intel_pmu_enable_lbr();> +> + cpuc->lbr_users++;> +}> +> +static void __intel_pmu_disable_lbr(void)> +{> + u64 debugctl;> +> + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> + debugctl &= ~x86_pmu.lbr_ctl;> + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +}> +> +static void intel_pmu_disable_lbr(void)> +{> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> + if (!x86_pmu.lbr_nr)> + return;> +> + cpuc->lbr_users--;> +> + BUG_ON(cpuc->lbr_users < 0);> +> + if (!cpuc->lbr_users)> + __intel_pmu_disable_lbr();> +}> +> static void intel_pmu_pebs_enable(struct hw_perf_event *hwc)> {> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> @@ -1402,6 +1469,9 @@ void hw_perf_disable(void)> cpuc->enabled = 0;> barrier();>> + if (cpuc->lbr_users)> + __intel_pmu_disable_lbr();> +> x86_pmu.disable_all();> }>> @@ -1703,6 +1773,10 @@ void hw_perf_enable(void)> barrier();>> x86_pmu.enable_all();> +> + // XXX> + if (cpuc->lbr_users = 1)> + __intel_pmu_enable_lbr();> }>> static inline u64 intel_pmu_get_status(void)> @@ -2094,7 +2168,6 @@ static void intel_pmu_drain_pebs_core(st> struct perf_event_header header;> struct perf_sample_data data;> struct pt_regs regs;> - u64>> if (!event || !ds || !x86_pmu.pebs)> return;> @@ -2114,7 +2187,7 @@ static void intel_pmu_drain_pebs_core(st>> perf_prepare_sample(&header, &data, event, ®s);>> - event.hw.interrupts += (top - at);> + event->hw.interrupts += (top - at);> atomic64_add((top - at) * event->hw.last_period, &event->count);>> if (perf_output_begin(&handle, event, header.size * (top - at), 1, 1))> @@ -2188,6 +2261,84 @@ static void intel_pmu_drain_pebs_nhm(str> }> }>> +static inline u64 intel_pmu_lbr_tos(void)> +{> + u64 tos;> +> + rdmsrl(x86_pmu.lbr_tos, tos);> + return tos;> +}> +> +static void> +intel_pmu_read_lbr_32(struct cpu_hw_events *cpuc, struct perf_event *event)> +{> + struct hw_perf_event *hwc = &event->hw;> + unsigned long mask = x86_pmu.lbr_nr - 1;> + u64 tos = intel_pmu_lbr_tos();> + int i;> +> + for (i = 0; tos > hwc->lbr_tos && i < x86_pmu.lbr_nr; i++, tos--) {> + unsigned long lbr_idx = (tos - i) & mask;> + union {> + struct {> + u32 from;> + u32 to;> + };> + u64 lbr;> + } msr_lastbranch;> +> + rdmsrl(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);> +> + cpuc->lbr_stack[i].from = msr_lastbranch.from;> + cpuc->lbr_stack[i].to = msr_lastbranch.to;> + cpuc->lbr_stack[i].flags = 0;> + }> + cpuc->lbr_entries = i;> +}> +> +#define LBR_FROM_FLAG_MISPRED (1ULL << 63)> +> +/*> + * Due to lack of segmentation in Linux the effective address (offset)> + * is the same as the linear address, allowing us to merge the LIP and EIP> + * LBR formats.> + */> +static void> +intel_pmu_read_lbr_64(struct cpu_hw_events *cpuc, struct perf_event *event)> +{> + struct hw_perf_event *hwc = &event->hw;> + unsigned long mask = x86_pmu.lbr_nr - 1;> + u64 tos = intel_pmu_lbr_tos();> + int i;> +> + for (i = 0; tos > hwc->lbr_tos && i < x86_pmu.lbr_nr; i++, tos--) {> + unsigned long lbr_idx = (tos - i) & mask;> + u64 from, to, flags = 0;> +> + rdmsrl(x86_pmu.lbr_from + lbr_idx, from);> + rdmsrl(x86_pmu.lbr_to + lbr_idx, to);> +> + if (x86_pmu.lbr_format == LBR_FORMAT_EIP_FLAGS) {> + flags = !!(from & LBR_FROM_FLAG_MISPRED);> + from = (u64)((((s64)from) << 1) >> 1);> + }> +> + cpuc->lbr_stack[i].from = from;> + cpuc->lbr_stack[i].to = to;> + cpuc->lbr_stack[i].flags = flags;> + }> + cpuc->lbr_entries = i;> +}> +> +static void> +intel_pmu_read_lbr(struct cpu_hw_events *cpuc, struct perf_event *event)> +{> + if (x86_pmu.lbr_format == LBR_FORMAT_32)> + intel_pmu_read_lbr_32(cpuc, event);> + else> + intel_pmu_read_lbr_64(cpuc, event);> +}> +> static void x86_pmu_stop(struct perf_event *event)> {> struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> @@ -2456,11 +2607,26 @@ perf_event_nmi_handler(struct notifier_b> * If the first NMI handles both, the latter will be empty and daze> * the CPU.> */> + trace_printk("LBR TOS: %Ld\n", intel_pmu_lbr_tos());> x86_pmu.handle_irq(regs);>> return NOTIFY_STOP;> }>> +static __read_mostly struct notifier_block perf_event_nmi_notifier = {> + .notifier_call = perf_event_nmi_handler,> + .next = NULL,> + .priority = 1> +};> +> +void perf_nmi_exit(void)> +{> + struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);> +> + if (cpuc->lbr_users)> + __intel_pmu_enable_lbr();> +}> +> static struct event_constraint unconstrained; /* can schedule */> static struct event_constraint null_constraint; /* can't schedule */> static struct event_constraint bts_constraint => @@ -2761,12 +2927,6 @@ undo:> return ret;> }>> -static __read_mostly struct notifier_block perf_event_nmi_notifier = {> - .notifier_call = perf_event_nmi_handler,> - .next = NULL,> - .priority = 1> -};> -> static __initconst struct x86_pmu p6_pmu = {> .name = "p6",> .handle_irq = x86_pmu_handle_irq,> @@ -2793,7 +2953,7 @@ static __initconst struct x86_pmu p6_pmu> .event_bits = 32,> .event_mask = (1ULL << 32) - 1,> .get_event_constraints = intel_get_event_constraints,> - .event_constraints = intel_p6_event_constraints> + .event_constraints = intel_p6_event_constraints,> };>> static __initconst struct x86_pmu core_pmu = {> @@ -2873,18 +3033,26 @@ static __init int p6_pmu_init(void)> case 7:> case 8:> case 11: /* Pentium III */> + x86_pmu = p6_pmu;> +> + break;> case 9:> - case 13:> - /* Pentium M */> + case 13: /* Pentium M */> + x86_pmu = p6_pmu;> +> + x86_pmu.lbr_nr = 8;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR;> + x86_pmu.lbr_from = 0x40;> +> break;> +> default:> pr_cont("unsupported p6 CPU model %d ",> boot_cpu_data.x86_model);> return -ENODEV;> }>> - x86_pmu = p6_pmu;> -> return 0;> }>> @@ -2925,6 +3093,9 @@ static __init int intel_pmu_init(void)> x86_pmu.event_bits = eax.split.bit_width;> x86_pmu.event_mask = (1ULL << eax.split.bit_width) - 1;>> + rdmsrl(MSR_IA32_PERF_CAPABILITIES, capabilities);> + x86_pmu.lbr_format = capabilities & 0x1f;> +> /*> * Quirk: v2 perfmon does not report fixed-purpose events, so> * assume at least 3 events:> @@ -2973,6 +3144,10 @@ no_datastore:> */> switch (boot_cpu_data.x86_model) {> case 14: /* 65 nm core solo/duo, "Yonah" */> + x86_pmu.lbr_nr = 8;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR;> + x86_pmu.lbr_from = 0x40;> pr_cont("Core events, ");> break;>> @@ -2980,6 +3155,13 @@ no_datastore:> case 22: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */> case 23: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */> case 29: /* six-core 45 nm xeon "Dunnington" */> + x86_pmu.lbr_nr = 4;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |> + X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;> + x86_pmu.lbr_from = 0x40;> + x86_pmu.lbr_to = 0x60;> +> memcpy(hw_cache_event_ids, core2_hw_cache_event_ids,> sizeof(hw_cache_event_ids));>> @@ -2989,13 +3171,28 @@ no_datastore:>> case 26: /* 45 nm nehalem, "Bloomfield" */> case 30: /* 45 nm nehalem, "Lynnfield" */> + x86_pmu.lbr_nr = 16;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |> + X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;> + x86_pmu.lbr_from = 0x680;> + x86_pmu.lbr_to = 0x6c0;> +> memcpy(hw_cache_event_ids, nehalem_hw_cache_event_ids,> sizeof(hw_cache_event_ids));>> x86_pmu.event_constraints = intel_nehalem_event_constraints;> pr_cont("Nehalem/Corei7 events, ");> break;> - case 28:> +> + case 28: /* Atom */> + x86_pmu.lbr_nr = 8;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |> + X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;> + x86_pmu.lbr_from = 0x40;> + x86_pmu.lbr_to = 0x60;> +> memcpy(hw_cache_event_ids, atom_hw_cache_event_ids,> sizeof(hw_cache_event_ids));>> @@ -3005,12 +3202,20 @@ no_datastore:>> case 37: /* 32 nm nehalem, "Clarkdale" */> case 44: /* 32 nm nehalem, "Gulftown" */> + x86_pmu.lbr_nr = 16;> + x86_pmu.lbr_tos = 0x01c9;> + x86_pmu.lbr_ctl = X86_DEBUGCTL_LBR |> + X86_DEBUGCTL_FREEZE_LBRS_ON_PMI;> + x86_pmu.lbr_from = 0x680;> + x86_pmu.lbr_to = 0x6c0;> +> memcpy(hw_cache_event_ids, westmere_hw_cache_event_ids,> sizeof(hw_cache_event_ids));>> x86_pmu.event_constraints = intel_westmere_event_constraints;> pr_cont("Westmere events, ");> break;> +> default:> /*> * default constraints for v2 and up> Index: linux-2.6/arch/x86/include/asm/perf_event.h> ===================================================================> --- linux-2.6.orig/arch/x86/include/asm/perf_event.h> +++ linux-2.6/arch/x86/include/asm/perf_event.h> @@ -1,6 +1,8 @@> #ifndef _ASM_X86_PERF_EVENT_H> #define _ASM_X86_PERF_EVENT_H>> +#include <asm/msr.h>> +> /*> * Performance event hw details:> */> @@ -122,11 +124,31 @@ union cpuid10_edx {> extern void init_hw_perf_events(void);> extern void perf_events_lapic_init(void);>> +#define X86_DEBUGCTL_LBR (1 << 0)> +#define X86_DEBUGCTL_FREEZE_LBRS_ON_PMI (1 << 11)> +> +static __always_inline void perf_nmi_enter(void)> +{> + u64 debugctl;> +> + /*> + * Unconditionally disable LBR so as to minimally pollute the LBR stack.> + * XXX: paravirt will screw us over massive> + */> + rdmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> + debugctl &= ~X86_DEBUGCTL_LBR;> + wrmsrl(MSR_IA32_DEBUGCTLMSR, debugctl);> +}> +> +extern void perf_nmi_exit(void);> +> #define PERF_EVENT_INDEX_OFFSET 0>> #else> static inline void init_hw_perf_events(void) { }> -static inline void perf_events_lapic_init(void) { }> +static inline void perf_events_lapic_init(void) { }> +static inline void perf_nmi_enter(void) { }> +static inline void perf_nmi_exit(void) { }> #endif>> #endif /* _ASM_X86_PERF_EVENT_H */> Index: linux-2.6/arch/x86/kernel/traps.c> ===================================================================> --- linux-2.6.orig/arch/x86/kernel/traps.c> +++ linux-2.6/arch/x86/kernel/traps.c> @@ -45,6 +45,7 @@> #endif>> #include <asm/kmemcheck.h>> +#include <asm/perf_event.h>> #include <asm/stacktrace.h>> #include <asm/processor.h>> #include <asm/debugreg.h>> @@ -442,6 +443,7 @@ static notrace __kprobes void default_do> dotraplinkage notrace __kprobes void> do_nmi(struct pt_regs *regs, long error_code)> {> + perf_nmi_enter();> nmi_enter();>> inc_irq_stat(__nmi_count);> @@ -450,6 +452,7 @@ do_nmi(struct pt_regs *regs, long error_> default_do_nmi(regs);>> nmi_exit();> + perf_nmi_exit();> }>> void stop_nmi(void)> Index: linux-2.6/include/linux/perf_event.h> ===================================================================> --- linux-2.6.orig/include/linux/perf_event.h> +++ linux-2.6/include/linux/perf_event.h> @@ -125,8 +125,9 @@ enum perf_event_sample_format {> PERF_SAMPLE_PERIOD = 1U << 8,> PERF_SAMPLE_STREAM_ID = 1U << 9,> PERF_SAMPLE_RAW = 1U << 10,> + PERF_SAMPLE_LBR = 1U << 11,>> - PERF_SAMPLE_MAX = 1U << 11, /* non-ABI */> + PERF_SAMPLE_MAX = 1U << 12, /* non-ABI */> };>> /*> @@ -396,6 +397,9 @@ enum perf_event_type {> * { u64 nr,> * u64 ips[nr]; } && PERF_SAMPLE_CALLCHAIN> *> + * { u64 nr;> + * struct lbr_format lbr[nr]; } && PERF_SAMPLE_LBR> + *> * #> * # The RAW record below is opaque data wrt the ABI> * #> @@ -483,6 +487,7 @@ struct hw_perf_event {> int idx;> int last_cpu;> int pebs;> + u64 lbr_tos;> };> struct { /* software */> s64 remaining;>>>????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
On Mon, 2010-02-22 at 15:07 +0100, Stephane Eranian wrote:
> On Thu, Feb 18, 2010 at 11:25 PM, Peter Zijlstra <[email protected]> wrote:
> > On Sun, 2010-02-14 at 11:12 +0100, Peter Zijlstra wrote:
> >>
> >> Dealing with context switches is also going to be tricky, where we have
> >> to safe and 'restore' LBR stacks for per-task counters.
> >
> > OK, so I poked at the LBR hardware a bit, sadly the TOS really doesn't
> > count beyond the few bits it requires :-(
> >
>
> The TOS is also a read-only MSR.
well, r/o is fine.
> > I had hopes it would, since that would make it easier to share the LBR,
> > simply take a TOS snapshot when you schedule the counter in, and never
> > roll back further for that particular counter.
> >
> > As it stands we'll have to wipe the full LBR state every time we 'touch'
> > it, which makes it less useful for cpu-bound counters.
> >
> Yes, you need to clean it up each time you snapshot it and each time
> you restore it.
>
> The patch does not seem to handle LBR context switches.
Well, it does, but sadly not in a viable way, it assumes the TOS counts
more than the required bits and stops the unwind on hwc->lbr_tos
snapshot. Except that the TOS doesn't work that way.
This whole PEBS/LBR stuff is a massive trainwreck from a design pov.
> > Also, not all hw (core and pentium-m) supports the freeze_lbrs_on_pmi
> > bit, what we could do for those is stick an unconditional LBR disable
> > very early in the NMI path and simply roll back the stack until we hit a
> > branch into the NMI vector, that should leave a few usable LBR entries.
> >
> You need to be consistent across the CPUs. If a CPU does not provide
> freeze_on_pmi, then I would simply not support it as a first approach.
> Same thing if the LBR is less than 4-deep. I don't think you'll get anything
> useful out of it.
Well, if at the first branch into the NMI handler you do an
unconditional LBR disable, you should still have 3 usable records. But
yeah, the 1 deep LBR chips (p6 and amd) are pretty useless for this
purpose and are indeed not supported.
> The patch does not address the configuration options available on Intel
> Nehalem/Westmere, i.e., LBR_SELECT (see Vol 3a table 16-9). We can
> handle priv level separately as it can be derived from the event exclude_*.
> But it you want to allow multiple events in a group to use PERF_SAMPLE_LBR
> then you need to ensure LBR_SELECT is set to the same value, priv levels
> included.
Yes, I explicitly skipped that because of the HT thing and because like
I argued in an earlier reply, I don't see much use for it, that is, it
significantly complicates matters for not much (if any) benefit.
As it stands LBR seems much more like a hw-breakpoint feature than a PMU
feature, except for this trainwreck called PEBS.
> Furthermore, LBR_SELECT is shared between HT threads. We need to either
> add another field in perf_event_attr or encode this in the config
> field, though it
> is ugly because unrelated to the event but rather to the sample_type.
>
> The patch is missing the sampling part, i.e., dump of the LBR (in sequential
> order) into the sampling buffer.
Yes, I just hacked enough stuff together to poke at the hardware a bit,
never said it was anywhere near complete.
> I would also select a better name than PERF_SAMPLE_LBR. LBR is an
> Intel thing. Maybe PERF_SAMPLE_TAKEN_BRANCH.
Either LAST_BRANCH (suggesting a single entry), or BRANCH_STACK
(suggesting >1 possible entries) seem more appropriate.
Supporting only a single entry, LAST_BRANCH, seems like an attractive
enough option, the use of multiple steps back seem rather pointless for
interpreting the sample.
On Mon, Feb 22, 2010 at 3:29 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-02-22 at 15:07 +0100, Stephane Eranian wrote:
>> On Thu, Feb 18, 2010 at 11:25 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Sun, 2010-02-14 at 11:12 +0100, Peter Zijlstra wrote:
>> >>
>> >> Dealing with context switches is also going to be tricky, where we have
>> >> to safe and 'restore' LBR stacks for per-task counters.
>> >
>> > OK, so I poked at the LBR hardware a bit, sadly the TOS really doesn't
>> > count beyond the few bits it requires :-(
>> >
>>
>> The TOS is also a read-only MSR.
>
> well, r/o is fine.
>
Need to restore or stitch LBR entries at some point to get the full sequential
history. this is needed when a thread migrates from one CPU to another.
>> > I had hopes it would, since that would make it easier to share the LBR,
>> > simply take a TOS snapshot when you schedule the counter in, and never
>> > roll back further for that particular counter.
>> >
>> > As it stands we'll have to wipe the full LBR state every time we 'touch'
>> > it, which makes it less useful for cpu-bound counters.
>> >
>> Yes, you need to clean it up each time you snapshot it and each time
>> you restore it.
>>
>> The patch does not seem to handle LBR context switches.
>
> Well, it does, but sadly not in a viable way, it assumes the TOS counts
> more than the required bits and stops the unwind on hwc->lbr_tos
> snapshot. Except that the TOS doesn't work that way.
>
Yes, you cannot simply record a point-in-time and extract the difference
with current TOS value. LBR may wrap around multiple times. You need
to do the basic save and restore.
> This whole PEBS/LBR stuff is a massive trainwreck from a design pov.
LBR is unrelated to PEBS. LBR provides quite some value-add. Thus, it
needs to be supported.
>
>> > Also, not all hw (core and pentium-m) supports the freeze_lbrs_on_pmi
>> > bit, what we could do for those is stick an unconditional LBR disable
>> > very early in the NMI path and simply roll back the stack until we hit a
>> > branch into the NMI vector, that should leave a few usable LBR entries.
>> >
>> You need to be consistent across the CPUs. If a CPU does not provide
>> freeze_on_pmi, then I would simply not support it as a first approach.
>> Same thing if the LBR is less than 4-deep. I don't think you'll get anything
>> useful out of it.
>
> Well, if at the first branch into the NMI handler you do an
> unconditional LBR disable, you should still have 3 usable records. But
> yeah, the 1 deep LBR chips (p6 and amd) are pretty useless for this
> purpose and are indeed not supported.
>
I doubt that by the time you get to the NMI handler, you have not at least
executed 3 branches in some assembly code. I would simply not support
LBR on those processors.
>> The patch does not address the configuration options available on Intel
>> Nehalem/Westmere, i.e., LBR_SELECT (see Vol 3a table 16-9). We can
>> handle priv level separately as it can be derived from the event exclude_*.
>> But it you want to allow multiple events in a group to use PERF_SAMPLE_LBR
>> then you need to ensure LBR_SELECT is set to the same value, priv levels
>> included.
>
> Yes, I explicitly skipped that because of the HT thing and because like
> I argued in an earlier reply, I don't see much use for it, that is, it
> significantly complicates matters for not much (if any) benefit.
>
Well, I want to be able to filter the type of branches captured by LBR
and in particular return branches. Useful if you want to collect a statistical
call graph, for instance. We did that on Itanium a very long time ago, using
their equivalent to LBR (called BTB) and it gave very good results.
Without filtering,
code with loops will inevitably pollute the LBR and the data will be useless for
building a statistical call graph.
> As it stands LBR seems much more like a hw-breakpoint feature than a PMU
> feature, except for this trainwreck called PEBS.
>
I don't understand your comparison. LBR is just a free running cyclic buffer
recording taken branches. You simply want to snapshot it on PMU interrupt.
It is totally independent of PEBS.It does not operate in the same way.
>> Furthermore, LBR_SELECT is shared between HT threads. We need to either
>> add another field in perf_event_attr or encode this in the config
>> field, though it
>> is ugly because unrelated to the event but rather to the sample_type.
>>
>> The patch is missing the sampling part, i.e., dump of the LBR (in sequential
>> order) into the sampling buffer.
>
> Yes, I just hacked enough stuff together to poke at the hardware a bit,
> never said it was anywhere near complete.
>
>> I would also select a better name than PERF_SAMPLE_LBR. LBR is an
>> Intel thing. Maybe PERF_SAMPLE_TAKEN_BRANCH.
>
> Either LAST_BRANCH (suggesting a single entry), or BRANCH_STACK
> (suggesting >1 possible entries) seem more appropriate.
>
> Supporting only a single entry, LAST_BRANCH, seems like an attractive
> enough option, the use of multiple steps back seem rather pointless for
> interpreting the sample.
I would vote for BRANCH_STACK.