This patch series contains fixes for v2.6.39. They are related to the
AMD family 15h PMU but also modify x86 generic code. See the diffstat
below.
If possible I would like to see patch #4 also in .39, though it is
more than just a one-liner with many changes in the x86 event
scheduling code. Without this patch scheduling multiple events on AMD
family 15h may fail under certain conditions though there are enough
counters available.
-Robert
The following changes since commit e566b76ed30768140df8f0023904aed5a41244f7:
perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() (2011-04-11 11:07:55 +0200)
Andre Przywara (1):
perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus
Robert Richter (3):
perf, x86: Fix AMD family 15h FPU event constraints
perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE
perf, x86: Fix event scheduler to solve complex scheduling problems
arch/x86/kernel/cpu/perf_event.c | 63 +++++++++++++++++++++++++++------
arch/x86/kernel/cpu/perf_event_amd.c | 23 ++++++++++--
2 files changed, 70 insertions(+), 16 deletions(-)
Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
an extra pointer chase and data cache hit.
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index eed3673a..224a84f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -31,6 +31,7 @@
#include <asm/nmi.h>
#include <asm/compat.h>
#include <asm/smp.h>
+#include <asm/alternative.h>
#if 0
#undef wrmsrl
@@ -363,12 +364,18 @@ again:
return new_raw_count;
}
-/* using X86_FEATURE_PERFCTR_CORE to later implement ALTERNATIVE() here */
static inline int x86_pmu_addr_offset(int index)
{
- if (boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
- return index << 1;
- return index;
+ int offset;
+
+ /* offset = X86_FEATURE_PERFCTR_CORE ? index << 1 : index */
+ alternative_io(ASM_NOP2,
+ "shll $1, %%eax",
+ X86_FEATURE_PERFCTR_CORE,
+ "=a" (offset),
+ "a" (index));
+
+ return offset;
}
static inline unsigned int x86_pmu_config_addr(int index)
--
1.7.3.4
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens esp.
for events with complex constraints such as those of the AMD family
15h pmu. The scheduler does not find then an existing solution.
Examples are:
event code counter failure possible solution
1) 0x043 PMC[2:0] 0 1
0x02E PMC[3,0] 3 0
0x003 PMC3 FAIL 3
2) 0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
Scheduling events on counters is a Hamiltonian path problem. To find a
possible solution we must traverse all existing paths. This patch
implements this.
We need to save all states of already walked paths. If we fail to
schedule an event we now rollback the previous state and try to use
another free counter until we have analysed all paths.
We might consider to later remove the constraint weight implementation
completely, but I left this out as this is a much bigger and more
risky change than this fix.
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 48 +++++++++++++++++++++++++++++++------
1 files changed, 40 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 224a84f..887a500 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -770,11 +770,19 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == &pmu;
}
+struct sched_log
+{
+ int i;
+ int w;
+ int idx;
+};
+
static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
{
struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
- int i, j, w, wmax, num = 0;
+ struct sched_log sched_log[X86_PMC_IDX_MAX];
+ int i, idx, w, wmax, num = 0, scheduled = 0;
struct hw_perf_event *hwc;
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -815,6 +823,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
*/
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+ memset(&sched_log, 0, sizeof(sched_log));
/*
* weight = number of possible counters
@@ -838,25 +847,48 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
for (w = 1, num = n; num && w <= wmax; w++) {
/* for each event */
for (i = 0; num && i < n; i++) {
+ redo:
c = constraints[i];
hwc = &cpuc->event_list[i]->hw;
if (c->weight != w)
continue;
- for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
- if (!test_bit(j, used_mask))
+ idx = sched_log[scheduled].idx;
+ /* for each bit in idxmsk starting from idx */
+ while (idx < X86_PMC_IDX_MAX) {
+ idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
+ idx);
+ if (idx == X86_PMC_IDX_MAX)
+ break;
+ if (!__test_and_set_bit(idx, used_mask))
break;
+ idx++;
}
- if (j == X86_PMC_IDX_MAX)
- break;
-
- __set_bit(j, used_mask);
+ if (idx >= X86_PMC_IDX_MAX) {
+ /* roll back and try next free counter */
+ if (!scheduled)
+ /* no free counters anymore */
+ break;
+ sched_log[scheduled].idx = 0;
+ scheduled--;
+ num++;
+ clear_bit(sched_log[scheduled].idx, used_mask);
+ i = sched_log[scheduled].i;
+ w = sched_log[scheduled].w;
+ sched_log[scheduled].idx++;
+ goto redo;
+ }
if (assign)
- assign[i] = j;
+ assign[i] = idx;
+
num--;
+ sched_log[scheduled].i = i;
+ sched_log[scheduled].w = w;
+ sched_log[scheduled].idx = idx;
+ scheduled++;
}
}
done:
--
1.7.3.4
Depending on the unit mask settings some FPU events may be scheduled
only on cpu counter #3. This patch fixes this.
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/perf_event_amd.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 4e16138..758cccc 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -427,7 +427,9 @@ static __initconst const struct x86_pmu amd_pmu = {
*
* Exceptions:
*
+ * 0x000 FP PERF_CTL[3], PERF_CTL[5:3] (*)
* 0x003 FP PERF_CTL[3]
+ * 0x004 FP PERF_CTL[3], PERF_CTL[5:3] (*)
* 0x00B FP PERF_CTL[3]
* 0x00D FP PERF_CTL[3]
* 0x023 DE PERF_CTL[2:0]
@@ -448,6 +450,8 @@ static __initconst const struct x86_pmu amd_pmu = {
* 0x0DF LS PERF_CTL[5:0]
* 0x1D6 EX PERF_CTL[5:0]
* 0x1D8 EX PERF_CTL[5:0]
+ *
+ * (*) depending on the umask all FPU counters may be used
*/
static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0);
@@ -460,18 +464,29 @@ static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);
static struct event_constraint *
amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event)
{
- unsigned int event_code = amd_get_event_code(&event->hw);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned int event_code = amd_get_event_code(hwc);
switch (event_code & AMD_EVENT_TYPE_MASK) {
case AMD_EVENT_FP:
switch (event_code) {
+ case 0x000:
+ if (!(hwc->config & 0x0000F000ULL))
+ break;
+ if (!(hwc->config & 0x00000F00ULL))
+ break;
+ return &amd_f15_PMC3;
+ case 0x004:
+ if (hweight_long(hwc->config &
+ ARCH_PERFMON_EVENTSEL_UMASK) <= 1)
+ break;
+ return &amd_f15_PMC3;
case 0x003:
case 0x00B:
case 0x00D:
return &amd_f15_PMC3;
- default:
- return &amd_f15_PMC53;
}
+ return &amd_f15_PMC53;
case AMD_EVENT_LS:
case AMD_EVENT_DC:
case AMD_EVENT_EX_LS:
--
1.7.3.4
From: Andre Przywara <[email protected]>
With AMD cpu family 15h a unit mask was introduced for the Data Cache
Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0
(first data cache miss or streaming store to a 64 B cache line) of
this mask to proper count data cache misses.
Now we set this bit for all families and models. In case a PMU does
not implement a unit mask for event 0x041 the bit is ignored.
Signed-off-by: Andre Przywara <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 461f62b..4e16138 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -8,7 +8,7 @@ static __initconst const u64 amd_hw_cache_event_ids
[ C(L1D) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x0040, /* Data Cache Accesses */
- [ C(RESULT_MISS) ] = 0x0041, /* Data Cache Misses */
+ [ C(RESULT_MISS) ] = 0x0141, /* Data Cache Misses */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x0142, /* Data Cache Refills :system */
--
1.7.3.4
On Sat, 2011-04-16 at 02:27 +0200, Robert Richter wrote:
> The current x86 event scheduler fails to resolve scheduling problems
> of certain combinations of events and constraints. This happens esp.
> for events with complex constraints such as those of the AMD family
> 15h pmu. The scheduler does not find then an existing solution.
> Examples are:
>
> event code counter failure possible
> solution
>
> 1) 0x043 PMC[2:0] 0 1
> 0x02E PMC[3,0] 3 0
> 0x003 PMC3 FAIL 3
>
> 2) 0x02E PMC[3,0] 0 3
> 0x043 PMC[2:0] 1 0
> 0x045 PMC[2:0] 2 1
> 0x046 PMC[2:0] FAIL 2
>
> Scheduling events on counters is a Hamiltonian path problem. To find a
> possible solution we must traverse all existing paths. This patch
> implements this.
>
> We need to save all states of already walked paths. If we fail to
> schedule an event we now rollback the previous state and try to use
> another free counter until we have analysed all paths.
>
> We might consider to later remove the constraint weight implementation
> completely, but I left this out as this is a much bigger and more
> risky change than this fix.
Argh, crap. That's because AMD is now the first with overlapping
constraints. Be sure to let your hardware guys know that they went from
top to bottom om my appreciation list. AMD used to have no constraints
and now they have the absolute worst.
I'd really prefer not to do this for .39, and I'll have to sit down and
actually read this code. It looks like we went from O(n^2) to O(n!) or
somesuch, also not much of an improvement. I'll have to analyze the
solver to see what it does for 'simple' constraints set to see if it
will indeed be more expensive than the O(n^2) solver we had.
Also, I think this code could do with a tiny bit of comments ;-)
* Peter Zijlstra <[email protected]> wrote:
> On Sat, 2011-04-16 at 02:27 +0200, Robert Richter wrote:
> > The current x86 event scheduler fails to resolve scheduling problems
> > of certain combinations of events and constraints. This happens esp.
> > for events with complex constraints such as those of the AMD family
> > 15h pmu. The scheduler does not find then an existing solution.
> > Examples are:
> >
> > event code counter failure possible
> > solution
> >
> > 1) 0x043 PMC[2:0] 0 1
> > 0x02E PMC[3,0] 3 0
> > 0x003 PMC3 FAIL 3
> >
> > 2) 0x02E PMC[3,0] 0 3
> > 0x043 PMC[2:0] 1 0
> > 0x045 PMC[2:0] 2 1
> > 0x046 PMC[2:0] FAIL 2
> >
> > Scheduling events on counters is a Hamiltonian path problem. To find a
> > possible solution we must traverse all existing paths. This patch
> > implements this.
> >
> > We need to save all states of already walked paths. If we fail to
> > schedule an event we now rollback the previous state and try to use
> > another free counter until we have analysed all paths.
> >
> > We might consider to later remove the constraint weight implementation
> > completely, but I left this out as this is a much bigger and more
> > risky change than this fix.
>
> Argh, crap. That's because AMD is now the first with overlapping
> constraints. Be sure to let your hardware guys know that they went from
> top to bottom om my appreciation list. AMD used to have no constraints
> and now they have the absolute worst.
>
> I'd really prefer not to do this for .39, and I'll have to sit down and
> actually read this code. It looks like we went from O(n^2) to O(n!) or
> somesuch, also not much of an improvement. I'll have to analyze the
> solver to see what it does for 'simple' constraints set to see if it
> will indeed be more expensive than the O(n^2) solver we had.
>
> Also, I think this code could do with a tiny bit of comments ;-)
I'd also prefer if we first had actual testcases in 'perf test' for all these
failures - it took an *awfully* long time to find these regressions (the event
scheduler code has been committed for months), while with proper testcases it
would only take a second to run 'perf test'.
Thanks,
Ingo
On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote:
> I'd also prefer if we first had actual testcases in 'perf test' for all these
> failures - it took an *awfully* long time to find these regressions (the event
> scheduler code has been committed for months), while with proper testcases it
> would only take a second to run 'perf test'.
These cases only exist on AMD F15, I don't think there's many people
with such systems around.
* Peter Zijlstra <[email protected]> wrote:
> On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote:
> > I'd also prefer if we first had actual testcases in 'perf test' for all these
> > failures - it took an *awfully* long time to find these regressions (the event
> > scheduler code has been committed for months), while with proper testcases it
> > would only take a second to run 'perf test'.
>
> These cases only exist on AMD F15, I don't think there's many people
> with such systems around.
Well, if the trend continues we'll have more twisted constraints and more bugs
of this sort, so having a testsuite sure cannot hurt, right?
Thanks,
Ingo
On Sat, 2011-04-16 at 12:14 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Sat, 2011-04-16 at 11:43 +0200, Ingo Molnar wrote:
> > > I'd also prefer if we first had actual testcases in 'perf test' for all these
> > > failures - it took an *awfully* long time to find these regressions (the event
> > > scheduler code has been committed for months), while with proper testcases it
> > > would only take a second to run 'perf test'.
> >
> > These cases only exist on AMD F15, I don't think there's many people
> > with such systems around.
>
> Well, if the trend continues we'll have more twisted constraints and more bugs
> of this sort, so having a testsuite sure cannot hurt, right?
For sure. But the problem with writing test cases at the perf userspace
level is that they're very hardware specific. It would be much easier to
write unit tests for the solver itself.
On Sat, 16 Apr 2011 12:08:40 +0200, Peter Zijlstra said:
> These cases only exist on AMD F15, I don't think there's many people
> with such systems around.
Didn't they say similar things about the first Itaniums? :)
On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <[email protected]> wrote:
> The current x86 event scheduler fails to resolve scheduling problems
> of certain combinations of events and constraints. This happens esp.
> for events with complex constraints such as those of the AMD family
> 15h pmu. The scheduler does not find then an existing solution.
> Examples are:
>
> event code counter failure possible solution
>
> 1) 0x043 PMC[2:0] 0 1
> 0x02E PMC[3,0] 3 0
> 0x003 PMC3 FAIL 3
>
I am not sure I understand this failure case. If I recall
the scheduler looks at the weight of each event first:
weight
1) 0x043 PMC[2:0] 3
0x02E PMC[3,0] 2
0x003 PMC3 1
Then, it schedules in increasing weight order. So it will
schedule weight 1, 2, 3. For weight 1, it will find counter3,
for weight 2, it will take counter0, for weight 3, it will
take counter 1 given 0 is already used.
Or am I reading your example the wrong way?
The fact that counter have overlapping constraints
should not matter. In fact this is what happens with
counters without constraints.
> 2) 0x02E PMC[3,0] 0 3
> 0x043 PMC[2:0] 1 0
> 0x045 PMC[2:0] 2 1
> 0x046 PMC[2:0] FAIL 2
>
> Scheduling events on counters is a Hamiltonian path problem. To find a
> possible solution we must traverse all existing paths. This patch
> implements this.
>
> We need to save all states of already walked paths. If we fail to
> schedule an event we now rollback the previous state and try to use
> another free counter until we have analysed all paths.
>
> We might consider to later remove the constraint weight implementation
> completely, but I left this out as this is a much bigger and more
> risky change than this fix.
>
> Cc: Stephane Eranian <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event.c | 48 +++++++++++++++++++++++++++++++------
> 1 files changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
> index 224a84f..887a500 100644
> --- a/arch/x86/kernel/cpu/perf_event.c
> +++ b/arch/x86/kernel/cpu/perf_event.c
> @@ -770,11 +770,19 @@ static inline int is_x86_event(struct perf_event *event)
> return event->pmu == &pmu;
> }
>
> +struct sched_log
> +{
> + int i;
> + int w;
> + int idx;
> +};
> +
> static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> {
> struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
> unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
> - int i, j, w, wmax, num = 0;
> + struct sched_log sched_log[X86_PMC_IDX_MAX];
> + int i, idx, w, wmax, num = 0, scheduled = 0;
> struct hw_perf_event *hwc;
>
> bitmap_zero(used_mask, X86_PMC_IDX_MAX);
> @@ -815,6 +823,7 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> */
>
> bitmap_zero(used_mask, X86_PMC_IDX_MAX);
> + memset(&sched_log, 0, sizeof(sched_log));
>
> /*
> * weight = number of possible counters
> @@ -838,25 +847,48 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> for (w = 1, num = n; num && w <= wmax; w++) {
> /* for each event */
> for (i = 0; num && i < n; i++) {
> + redo:
> c = constraints[i];
> hwc = &cpuc->event_list[i]->hw;
>
> if (c->weight != w)
> continue;
>
> - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> - if (!test_bit(j, used_mask))
> + idx = sched_log[scheduled].idx;
> + /* for each bit in idxmsk starting from idx */
> + while (idx < X86_PMC_IDX_MAX) {
> + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> + idx);
> + if (idx == X86_PMC_IDX_MAX)
> + break;
> + if (!__test_and_set_bit(idx, used_mask))
> break;
> + idx++;
> }
>
> - if (j == X86_PMC_IDX_MAX)
> - break;
> -
> - __set_bit(j, used_mask);
> + if (idx >= X86_PMC_IDX_MAX) {
> + /* roll back and try next free counter */
> + if (!scheduled)
> + /* no free counters anymore */
> + break;
> + sched_log[scheduled].idx = 0;
> + scheduled--;
> + num++;
> + clear_bit(sched_log[scheduled].idx, used_mask);
> + i = sched_log[scheduled].i;
> + w = sched_log[scheduled].w;
> + sched_log[scheduled].idx++;
> + goto redo;
> + }
>
> if (assign)
> - assign[i] = j;
> + assign[i] = idx;
> +
> num--;
> + sched_log[scheduled].i = i;
> + sched_log[scheduled].w = w;
> + sched_log[scheduled].idx = idx;
> + scheduled++;
> }
> }
> done:
> --
> 1.7.3.4
>
>
>
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?
* Robert Richter <[email protected]> wrote:
> > I'd really prefer not to do this for .39, and I'll have to sit down and
> > actually read this code. It looks like we went from O(n^2) to O(n!) or
> > somesuch, also not much of an improvement. I'll have to analyze the solver
> > to see what it does for 'simple' constraints set to see if it will indeed
> > be more expensive than the O(n^2) solver we had.
>
> It wont be more expensive, if there is a solution. But if there is no one we
> walk all possible ways now which is something like O(n!).
So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320,
with 10 counters a loop of 3628800 ... O(n!) is not fun.
Thanks,
Ingo
On 16.04.11 04:51:17, Peter Zijlstra wrote:
> Argh, crap. That's because AMD is now the first with overlapping
> constraints. Be sure to let your hardware guys know that they went from
> top to bottom om my appreciation list. AMD used to have no constraints
> and now they have the absolute worst.
Yes, the overlapping constraints are the problem.
> I'd really prefer not to do this for .39, and I'll have to sit down and
> actually read this code. It looks like we went from O(n^2) to O(n!) or
> somesuch, also not much of an improvement. I'll have to analyze the
> solver to see what it does for 'simple' constraints set to see if it
> will indeed be more expensive than the O(n^2) solver we had.
It wont be more expensive, if there is a solution. But if there is no
one we walk all possible ways now which is something like O(n!).
Yes, we can shift this general solution after .39. Will try to find a
solution that handles the special case for family 15h as a fix for
.39.
> Also, I think this code could do with a tiny bit of comments ;-)
Will comment on the code inline in the patch.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On 16.04.11 11:52:54, Stephane Eranian wrote:
> On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <[email protected]> wrote:
> > ? ? ? ?event code ? ? ?counter ? ? ? ? failure ? ? ? ? possible solution
> >
> > 1) ? ? ?0x043 ? ? ? ? ? PMC[2:0] ? ? ? ?0 ? ? ? ? ? ? ? 1
> > ? ? ? ?0x02E ? ? ? ? ? PMC[3,0] ? ? ? ?3 ? ? ? ? ? ? ? 0
> > ? ? ? ?0x003 ? ? ? ? ? PMC3 ? ? ? ? ? ?FAIL ? ? ? ? ? ?3
> >
> I am not sure I understand this failure case. If I recall
> the scheduler looks at the weight of each event first:
>
> weight
> 1) ? ? ?0x043 ? ? ? ? ? PMC[2:0] ?3
> ? ? ? ?0x02E ? ? ? ? ? PMC[3,0] ? 2
> ? ? ? ?0x003 ? ? ? ? ? PMC3 ? ? ? ?1
>
> Then, it schedules in increasing weight order. So it will
> schedule weight 1, 2, 3. For weight 1, it will find counter3,
> for weight 2, it will take counter0, for weight 3, it will
> take counter 1 given 0 is already used.
>
> Or am I reading your example the wrong way?
No, I have to admit this one was taken out of my mind and I picked
a wrong example for the special problem I was thinking about. The
above works as you described because of the constraint's weights.
I don't have an example with existing constraints, but consider the
following (theoretical) one:
counter: 3210 Failure: Solution:
e1 xx xo ox
e2 xx xo ox
e3 x x o x o x
e4 x x x F x o
The special with the above is that two events (e1 and e2) must be
rescheduled to schedule e4. This means that swapping counters of only
one already scheduled event is not sufficient. A counter of a third
event must be freed, this counter is then taken by the second event.
> The fact that counter have overlapping constraints
> should not matter. In fact this is what happens with
> counters without constraints.
An event set containing constraints with following conditions is
problematic:
(c1->weight <= c2->weight && (c1->cmask & c2->cmask != c1->cmask))
Basically this means both constraints do not overlap completly. You
can't select then the correct counter without knowing the events with
higher weights that will be scheduled later.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On Sun, 2011-04-17 at 10:18 +0200, Ingo Molnar wrote:
> * Robert Richter <[email protected]> wrote:
>
> > > I'd really prefer not to do this for .39, and I'll have to sit down and
> > > actually read this code. It looks like we went from O(n^2) to O(n!) or
> > > somesuch, also not much of an improvement. I'll have to analyze the solver
> > > to see what it does for 'simple' constraints set to see if it will indeed
> > > be more expensive than the O(n^2) solver we had.
> >
> > It wont be more expensive, if there is a solution. But if there is no one we
> > walk all possible ways now which is something like O(n!).
>
> So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320,
> with 10 counters a loop of 3628800 ... O(n!) is not fun.
Right, and we'll hit this case at least once when scheduling a
over-committed system. Intel Sandy Bridge can have 8 counters per core +
3 fixed counters, giving an n=11 situation. You do _NOT_ want to have
one 39916800 cycle loop before we determine the PMU isn't schedulable,
that's simply unacceptable.
There's a fine point between maximum PMU utilization and acceptable
performance here, and an O(n!) algorithm is really not acceptable. If
you can find a polynomial algorithm that improves the AMD-F15 situation
we can talk.
As it stands I'm tempted to have AMD suffer its terrible PMU design
decisions, if you want this fixed, fix the silicon.
On Sun, Apr 17, 2011 at 1:44 AM, Robert Richter <[email protected]> wrote:
> On 16.04.11 11:52:54, Stephane Eranian wrote:
>> On Fri, Apr 15, 2011 at 5:27 PM, Robert Richter <[email protected]> wrote:
>
>> > event code counter failure possible solution
>> >
>> > 1) 0x043 PMC[2:0] 0 1
>> > 0x02E PMC[3,0] 3 0
>> > 0x003 PMC3 FAIL 3
>> >
>> I am not sure I understand this failure case. If I recall
>> the scheduler looks at the weight of each event first:
>>
>> weight
>> 1) 0x043 PMC[2:0] 3
>> 0x02E PMC[3,0] 2
>> 0x003 PMC3 1
>>
>> Then, it schedules in increasing weight order. So it will
>> schedule weight 1, 2, 3. For weight 1, it will find counter3,
>> for weight 2, it will take counter0, for weight 3, it will
>> take counter 1 given 0 is already used.
>>
>> Or am I reading your example the wrong way?
>
> No, I have to admit this one was taken out of my mind and I picked
> a wrong example for the special problem I was thinking about. The
> above works as you described because of the constraint's weights.
>
> I don't have an example with existing constraints, but consider the
> following (theoretical) one:
>
> counter: 3210 Failure: Solution:
>
> e1 xx xo ox
> e2 xx xo ox
> e3 x x o x o x
> e4 x x x F x o
>
> The special with the above is that two events (e1 and e2) must be
> rescheduled to schedule e4. This means that swapping counters of only
> one already scheduled event is not sufficient. A counter of a third
> event must be freed, this counter is then taken by the second event.
Ok, I understand this one better now. But you admit yourself, you made it
up. Is it even possible with Fam15h?
The thing is, I think, this is not a catastrophic problem. What will
happen is that
e4 cannot be scheduled with e1, e2, e3. In case the events are not in the same
group, e4 will eventually get a chance to be scheduled due to round-robin on
the event list. In case, the events are all in one group, then you won't be able
to add e4 to the group because of group scheduling sanity check on creation.
So either you have to shuffle your event groups or you will incur multiplexing.
It's not like e4 will be created but it will nerver be able to count anything.
>> The fact that counter have overlapping constraints
>> should not matter. In fact this is what happens with
>> counters without constraints.
>
> An event set containing constraints with following conditions is
> problematic:
>
> (c1->weight <= c2->weight && (c1->cmask & c2->cmask != c1->cmask))
>
> Basically this means both constraints do not overlap completly. You
> can't select then the correct counter without knowing the events with
> higher weights that will be scheduled later.
>
> -Robert
>
> --
> Advanced Micro Devices, Inc.
> Operating System Research Center
>
>
On 17.04.11 04:53:32, Peter Zijlstra wrote:
> On Sun, 2011-04-17 at 10:18 +0200, Ingo Molnar wrote:
> > So with 6 counters it would be a loop of 720, with 8 counters a loop of 40320,
> > with 10 counters a loop of 3628800 ... O(n!) is not fun.
>
> Right, and we'll hit this case at least once when scheduling a
> over-committed system. Intel Sandy Bridge can have 8 counters per core +
> 3 fixed counters, giving an n=11 situation. You do _NOT_ want to have
> one 39916800 cycle loop before we determine the PMU isn't schedulable,
> that's simply unacceptable.
Of course it is not that much as the algorithm is already optimized
and we only walk through possible ways. Also, the more constraints we
have the less we have to walk. So lets assume a worst case of 8
unconstraint counters, I reimplemented the algorithm in the perl
script attached and counted 251 loops, following numbers I got
depending on the number of counters:
$ perl counter-scheduling.pl | grep Num
Number of counters: 2, loops: 10, redos: 4, ratio: 2.5
Number of counters: 3, loops: 26, redos: 7, ratio: 3.7
Number of counters: 4, loops: 53, redos: 11, ratio: 4.8
Number of counters: 5, loops: 89, redos: 15, ratio: 5.9
Number of counters: 6, loops: 134, redos: 19, ratio: 7.1
Number of counters: 7, loops: 188, redos: 23, ratio: 8.2
Number of counters: 8, loops: 251, redos: 27, ratio: 9.3
Number of counters: 9, loops: 323, redos: 31, ratio: 10.4
Number of counters: 10, loops: 404, redos: 35, ratio: 11.5
Number of counters: 11, loops: 494, redos: 39, ratio: 12.7
Number of counters: 12, loops: 593, redos: 43, ratio: 13.8
It seems the algorithm is about number-of-counter times slower than
the current. I think this is worth some further considerations. There
is also some room for improvement with my algorithm.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On 17.04.11 13:23:25, Robert Richter wrote:
> $ perl counter-scheduling.pl | grep Num
> Number of counters: 2, loops: 10, redos: 4, ratio: 2.5
> Number of counters: 3, loops: 26, redos: 7, ratio: 3.7
> Number of counters: 4, loops: 53, redos: 11, ratio: 4.8
> Number of counters: 5, loops: 89, redos: 15, ratio: 5.9
> Number of counters: 6, loops: 134, redos: 19, ratio: 7.1
> Number of counters: 7, loops: 188, redos: 23, ratio: 8.2
> Number of counters: 8, loops: 251, redos: 27, ratio: 9.3
> Number of counters: 9, loops: 323, redos: 31, ratio: 10.4
> Number of counters: 10, loops: 404, redos: 35, ratio: 11.5
> Number of counters: 11, loops: 494, redos: 39, ratio: 12.7
> Number of counters: 12, loops: 593, redos: 43, ratio: 13.8
Wrong, wrong, wrong!
Before wasting your time with this, the script is buggy storing the
correct state:
@@ -35,7 +35,7 @@ while ($scheduled < $num_events) {
}
$used_mask |= (1 << $idx);
- push @sched_log, $idx;
+ $sched_log[$scheduled] = $idx;
printf "Scheduling event #%d on counter #%d\n", $scheduled, $idx;
$scheduled++;
}
It's not possible do the calculation for 11 counters in reasonable
time:
$ perl counter-scheduling.pl | grep Num
Number of counters: 2, loops: 10, redos: 4, ratio: 2.5
Number of counters: 3, loops: 48, redos: 15, ratio: 3.2
Number of counters: 4, loops: 260, redos: 64, ratio: 4.1
Number of counters: 5, loops: 1630, redos: 325, ratio: 5.0
Number of counters: 6, loops: 11742, redos: 1956, ratio: 6.0
Number of counters: 7, loops: 95900, redos: 13699, ratio: 7.0
Number of counters: 8, loops: 876808, redos: 109600, ratio: 8.0
Number of counters: 9, loops: 8877690, redos: 986409, ratio: 9.0
Number of counters: 10, loops: 98641010, redos: 9864100, ratio: 10.0
Updated script attached.
Sorry,
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
Robert Richter <[email protected]> writes:
> Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
> an extra pointer chase and data cache hit.
Is that really a performance critical path?
Seems more like unnecessary obfuscation to me.
-Andi
--
[email protected] -- Speaking for myself only
This version 2 implements an algorithm which does not increase the
loop count for systems with no overlapping-counter constraints. With
overlapping-counter constraints the loop count is reduceded to the
minimum and also limited by a stack depth of 2 states.
See below.
-Robert
>From cd70a579995fa7da885ec25db8ebe04ba4a0c30e Mon Sep 17 00:00:00 2001
From: Robert Richter <[email protected]>
Date: Fri, 15 Apr 2011 11:04:06 +0200
Subject: [PATCH] perf, x86: Fix event scheduler for constraints with overlapping counters
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:
counter mask weight
amd_f15_PMC30 0x09 2 <--- overlapping counters
amd_f15_PMC20 0x07 3
amd_f15_PMC53 0x38 3
The scheduler does not find then an existing solution. Here is an
example:
event code counter failure possible solution
0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.
To solve this, we now save the scheduler state of events with
overlapping counter counstraints. If we fail to schedule the events
we rollback to those states and try to use another free counter.
Constraints with overlapping counters are marked with a new introduced
redo flag. We set the redo flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_REDO() macro can be used for this.
Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_REDO() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.
On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.
Cc: Stephane Eranian <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 97 ++++++++++++++++++++++++++++++----
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
2 files changed, 87 insertions(+), 12 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 224a84f..2d4cfae 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -86,6 +86,7 @@ struct event_constraint {
u64 code;
u64 cmask;
int weight;
+ int redo;
};
struct amd_nb {
@@ -144,15 +145,40 @@ struct cpu_hw_events {
struct amd_nb *amd_nb;
};
-#define __EVENT_CONSTRAINT(c, n, m, w) {\
+#define __EVENT_CONSTRAINT(c, n, m, w, r) {\
{ .idxmsk64 = (n) }, \
.code = (c), \
.cmask = (m), \
.weight = (w), \
+ .redo = (r), \
}
#define EVENT_CONSTRAINT(c, n, m) \
- __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n))
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 0)
+
+/*
+ * The redo flag marks event constraints with overlapping counter
+ * masks. This is the case if the counter mask of such an event is not
+ * a subset of any other counter mask of a constraint with an equal or
+ * higher weight, e.g.:
+ *
+ * c_overlaps = EVENT_CONSTRAINT_REDO(0, 0x09, 0);
+ * c_another1 = EVENT_CONSTRAINT(0, 0x07, 0);
+ * c_another2 = EVENT_CONSTRAINT(0, 0x38, 0);
+ *
+ * The event scheduler may not select the correct counter in the first
+ * cycle because it needs to know which subsequent events will be
+ * scheduled. It may fail to schedule the events then. So we set the
+ * redo flag for such constraints to give the scheduler a hint which
+ * events to select for counter rescheduling.
+ *
+ * Care must be taken as the rescheduling algorithm is O(n!) which
+ * will increase scheduling cycles for an over-commited system
+ * dramatically. The number of such EVENT_CONSTRAINT_REDO() macros
+ * and its counter masks must be kept at a minimum.
+ */
+#define EVENT_CONSTRAINT_REDO(c, n, m) \
+ __EVENT_CONSTRAINT(c, n, m, HWEIGHT(n), 1)
/*
* Constraint on the Event code.
@@ -770,11 +796,24 @@ static inline int is_x86_event(struct perf_event *event)
return event->pmu == &pmu;
}
+struct sched_state
+{
+ int i;
+ int w;
+ int idx;
+ int num;
+ unsigned long used[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+};
+
+/* Total max is X86_PMC_IDX_MAX, but we are O(n!) limited */
+#define SCHED_STATES_MAX 2
+
static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
{
struct event_constraint *c, *constraints[X86_PMC_IDX_MAX];
unsigned long used_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
- int i, j, w, wmax, num = 0;
+ struct sched_state sched_state[SCHED_STATES_MAX];
+ int i, idx, w, wmax, num = 0, state = 0;
struct hw_perf_event *hwc;
bitmap_zero(used_mask, X86_PMC_IDX_MAX);
@@ -838,25 +877,61 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
for (w = 1, num = n; num && w <= wmax; w++) {
/* for each event */
for (i = 0; num && i < n; i++) {
+ idx = 0;
+ redo:
c = constraints[i];
hwc = &cpuc->event_list[i]->hw;
if (c->weight != w)
continue;
- for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
- if (!test_bit(j, used_mask))
+ /* for each bit in idxmsk starting from idx */
+ while (idx < X86_PMC_IDX_MAX) {
+ idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
+ idx);
+ if (idx == X86_PMC_IDX_MAX)
break;
+ if (!__test_and_set_bit(idx, used_mask))
+ break;
+ idx++;
}
- if (j == X86_PMC_IDX_MAX)
- break;
-
- __set_bit(j, used_mask);
+ if (idx >= X86_PMC_IDX_MAX) {
+ /*
+ * roll back and try to reschedule
+ * events on other counters
+ */
+ if (!state)
+ /* no events to reschedule */
+ goto done;
+ state--;
+ /* restore previous state: */
+ i = sched_state[state].i;
+ w = sched_state[state].w;
+ idx = sched_state[state].idx;
+ num = sched_state[state].num;
+ bitmap_copy(used_mask, sched_state[state].used, n);
+ /* try next counter: */
+ clear_bit(idx, used_mask);
+ num++;
+ idx++;
+ goto redo;
+ }
if (assign)
- assign[i] = j;
+ assign[i] = idx;
+
num--;
+
+ if (c->redo && state < SCHED_STATES_MAX) {
+ /* store scheduler state: */
+ sched_state[state].i = i;
+ sched_state[state].w = w;
+ sched_state[state].idx = idx;
+ sched_state[state].num = num;
+ bitmap_copy(sched_state[state].used, used_mask, n);
+ state++;
+ }
}
}
done:
@@ -1535,7 +1610,7 @@ static int __init init_hw_perf_events(void)
unconstrained = (struct event_constraint)
__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
- 0, x86_pmu.num_counters);
+ 0, x86_pmu.num_counters, 0);
if (x86_pmu.event_constraints) {
for_each_event_constraint(c, x86_pmu.event_constraints) {
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 758cccc..21821eb 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -457,7 +457,7 @@ static __initconst const struct x86_pmu amd_pmu = {
static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0);
static struct event_constraint amd_f15_PMC20 = EVENT_CONSTRAINT(0, 0x07, 0);
static struct event_constraint amd_f15_PMC3 = EVENT_CONSTRAINT(0, 0x08, 0);
-static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT(0, 0x09, 0);
+static struct event_constraint amd_f15_PMC30 = EVENT_CONSTRAINT_REDO(0, 0x09, 0);
static struct event_constraint amd_f15_PMC50 = EVENT_CONSTRAINT(0, 0x3F, 0);
static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);
--
1.7.3.4
--
Advanced Micro Devices, Inc.
Operating System Research Center
On 18.04.11 16:00:57, Andi Kleen wrote:
> Robert Richter <[email protected]> writes:
>
> > Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
> > an extra pointer chase and data cache hit.
>
> Is that really a performance critical path?
>
> Seems more like unnecessary obfuscation to me.
We hotest path is in perf_pmu_disable(), which happens at least with
every task switch when calling the event scheduler.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
* Robert Richter <[email protected]> wrote:
> + if (!state)
> + /* no events to reschedule */
> + goto done;
Hm, that's 5 levels of indentation. Would it be possible to factor out the gist
of the logic into a helper and such? My eyes would be much happier!
Thanks,
Ingo
Commit-ID: 83112e688f5f05dea1e63787db9a6c16b2887a1d
Gitweb: http://git.kernel.org/tip/83112e688f5f05dea1e63787db9a6c16b2887a1d
Author: Andre Przywara <[email protected]>
AuthorDate: Sat, 16 Apr 2011 02:27:53 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 10:07:54 +0200
perf, x86: Fix pre-defined cache-misses event for AMD family 15h cpus
With AMD cpu family 15h a unit mask was introduced for the Data Cache
Miss event (0x041/L1-dcache-load-misses). We need to enable bit 0
(first data cache miss or streaming store to a 64 B cache line) of
this mask to proper count data cache misses.
Now we set this bit for all families and models. In case a PMU does
not implement a unit mask for event 0x041 the bit is ignored.
Signed-off-by: Andre Przywara <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_amd.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 461f62b..4e16138 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -8,7 +8,7 @@ static __initconst const u64 amd_hw_cache_event_ids
[ C(L1D) ] = {
[ C(OP_READ) ] = {
[ C(RESULT_ACCESS) ] = 0x0040, /* Data Cache Accesses */
- [ C(RESULT_MISS) ] = 0x0041, /* Data Cache Misses */
+ [ C(RESULT_MISS) ] = 0x0141, /* Data Cache Misses */
},
[ C(OP_WRITE) ] = {
[ C(RESULT_ACCESS) ] = 0x0142, /* Data Cache Refills :system */
Commit-ID: 855357a21744e488cbee23a47d2b124035160a87
Gitweb: http://git.kernel.org/tip/855357a21744e488cbee23a47d2b124035160a87
Author: Robert Richter <[email protected]>
AuthorDate: Sat, 16 Apr 2011 02:27:54 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 10:07:55 +0200
perf, x86: Fix AMD family 15h FPU event constraints
Depending on the unit mask settings some FPU events may be scheduled
only on cpu counter #3. This patch fixes this.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Stephane Eranian <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event_amd.c | 20 +++++++++++++++++---
1 files changed, 17 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c
index 4e16138..cf4e369 100644
--- a/arch/x86/kernel/cpu/perf_event_amd.c
+++ b/arch/x86/kernel/cpu/perf_event_amd.c
@@ -427,7 +427,9 @@ static __initconst const struct x86_pmu amd_pmu = {
*
* Exceptions:
*
+ * 0x000 FP PERF_CTL[3], PERF_CTL[5:3] (*)
* 0x003 FP PERF_CTL[3]
+ * 0x004 FP PERF_CTL[3], PERF_CTL[5:3] (*)
* 0x00B FP PERF_CTL[3]
* 0x00D FP PERF_CTL[3]
* 0x023 DE PERF_CTL[2:0]
@@ -448,6 +450,8 @@ static __initconst const struct x86_pmu amd_pmu = {
* 0x0DF LS PERF_CTL[5:0]
* 0x1D6 EX PERF_CTL[5:0]
* 0x1D8 EX PERF_CTL[5:0]
+ *
+ * (*) depending on the umask all FPU counters may be used
*/
static struct event_constraint amd_f15_PMC0 = EVENT_CONSTRAINT(0, 0x01, 0);
@@ -460,18 +464,28 @@ static struct event_constraint amd_f15_PMC53 = EVENT_CONSTRAINT(0, 0x38, 0);
static struct event_constraint *
amd_get_event_constraints_f15h(struct cpu_hw_events *cpuc, struct perf_event *event)
{
- unsigned int event_code = amd_get_event_code(&event->hw);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned int event_code = amd_get_event_code(hwc);
switch (event_code & AMD_EVENT_TYPE_MASK) {
case AMD_EVENT_FP:
switch (event_code) {
+ case 0x000:
+ if (!(hwc->config & 0x0000F000ULL))
+ break;
+ if (!(hwc->config & 0x00000F00ULL))
+ break;
+ return &amd_f15_PMC3;
+ case 0x004:
+ if (hweight_long(hwc->config & ARCH_PERFMON_EVENTSEL_UMASK) <= 1)
+ break;
+ return &amd_f15_PMC3;
case 0x003:
case 0x00B:
case 0x00D:
return &amd_f15_PMC3;
- default:
- return &amd_f15_PMC53;
}
+ return &amd_f15_PMC53;
case AMD_EVENT_LS:
case AMD_EVENT_DC:
case AMD_EVENT_EX_LS:
Commit-ID: c8e5910edf8bbe2e5c6c35a4ef2a578cc7893b25
Gitweb: http://git.kernel.org/tip/c8e5910edf8bbe2e5c6c35a4ef2a578cc7893b25
Author: Robert Richter <[email protected]>
AuthorDate: Sat, 16 Apr 2011 02:27:55 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 19 Apr 2011 10:08:12 +0200
perf, x86: Use ALTERNATIVE() to check for X86_FEATURE_PERFCTR_CORE
Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
an extra pointer chase and data cache hit.
Signed-off-by: Robert Richter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index eed3673a..224a84f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -31,6 +31,7 @@
#include <asm/nmi.h>
#include <asm/compat.h>
#include <asm/smp.h>
+#include <asm/alternative.h>
#if 0
#undef wrmsrl
@@ -363,12 +364,18 @@ again:
return new_raw_count;
}
-/* using X86_FEATURE_PERFCTR_CORE to later implement ALTERNATIVE() here */
static inline int x86_pmu_addr_offset(int index)
{
- if (boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
- return index << 1;
- return index;
+ int offset;
+
+ /* offset = X86_FEATURE_PERFCTR_CORE ? index << 1 : index */
+ alternative_io(ASM_NOP2,
+ "shll $1, %%eax",
+ X86_FEATURE_PERFCTR_CORE,
+ "=a" (offset),
+ "a" (index));
+
+ return offset;
}
static inline unsigned int x86_pmu_config_addr(int index)
On 19.04.11 07:29:33, Ingo Molnar wrote:
>
> * Robert Richter <[email protected]> wrote:
>
> > + if (!state)
> > + /* no events to reschedule */
> > + goto done;
>
> Hm, that's 5 levels of indentation. Would it be possible to factor out the gist
> of the logic into a helper and such? My eyes would be much happier!
Hmm, I was thinking about this too, but it is not easy since we
control also the loops with goto's and updates of the loop
states. Refactoring out the inner block would mean to update state
varables via pointers and to introduce return codes which contains
control flow information. Don't think this will become easier.
We could probably merge both for-loops in one while-loop but also on
costs of code readability.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
On Tue, Apr 19, 2011 at 12:39:27PM +0200, Robert Richter wrote:
> On 18.04.11 16:00:57, Andi Kleen wrote:
> > Robert Richter <[email protected]> writes:
> >
> > > Using ALTERNATIVE() when checking for X86_FEATURE_PERFCTR_CORE avoids
> > > an extra pointer chase and data cache hit.
> >
> > Is that really a performance critical path?
> >
> > Seems more like unnecessary obfuscation to me.
>
> We hotest path is in perf_pmu_disable(), which happens at least with
> every task switch when calling the event scheduler.
Yes but that's already a slow path isn't it? It better is, because
the MSR accesses alone are incredibly expensive. I guess your test
and jump isn't even on the radar after that ...
-Andi
On 19.04.11 15:55:18, Robert Richter wrote:
> On 19.04.11 07:29:33, Ingo Molnar wrote:
> >
> > * Robert Richter <[email protected]> wrote:
> >
> > > + if (!state)
> > > + /* no events to reschedule */
> > > + goto done;
> >
> > Hm, that's 5 levels of indentation. Would it be possible to factor out the gist
> > of the logic into a helper and such? My eyes would be much happier!
>
> Hmm, I was thinking about this too, but it is not easy since we
> control also the loops with goto's and updates of the loop
> states. Refactoring out the inner block would mean to update state
> varables via pointers and to introduce return codes which contains
> control flow information. Don't think this will become easier.
>
> We could probably merge both for-loops in one while-loop but also on
> costs of code readability.
Ingo, do you still want me to rework it?
Any other comments on this patch?
Thanks,
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
So I'm mostly ok with this, just a few nits..
On Tue, 2011-04-19 at 12:26 +0200, Robert Richter wrote:
> @@ -838,25 +877,61 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> for (w = 1, num = n; num && w <= wmax; w++) {
> /* for each event */
> for (i = 0; num && i < n; i++) {
> + idx = 0;
> + redo:
labels go on column 0.
> c = constraints[i];
> hwc = &cpuc->event_list[i]->hw;
>
> if (c->weight != w)
> continue;
>
> - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> - if (!test_bit(j, used_mask))
> + /* for each bit in idxmsk starting from idx */
> + while (idx < X86_PMC_IDX_MAX) {
> + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> + idx);
I'd be mighty tempted to ignore that 80 column rule here ;-)
> + if (idx == X86_PMC_IDX_MAX)
> break;
> + if (!__test_and_set_bit(idx, used_mask))
> + break;
> + idx++;
> }
>
> - if (j == X86_PMC_IDX_MAX)
> - break;
> -
> - __set_bit(j, used_mask);
> + if (idx >= X86_PMC_IDX_MAX) {
> + /*
> + * roll back and try to reschedule
> + * events on other counters
> + */
> + if (!state)
> + /* no events to reschedule */
> + goto done;
Multi line indents get { } even if not strictly needed, alternatively we
can place the comment on the same line as the goto.
> + state--;
> + /* restore previous state: */
> + i = sched_state[state].i;
> + w = sched_state[state].w;
> + idx = sched_state[state].idx;
> + num = sched_state[state].num;
> + bitmap_copy(used_mask, sched_state[state].used, n);
> + /* try next counter: */
> + clear_bit(idx, used_mask);
> + num++;
Suppose we put the num--; bit below the if (c->redo) block, then we can
remove this num++;, right?
> + idx++;
> + goto redo;
> + }
>
> if (assign)
> - assign[i] = j;
> + assign[i] = idx;
> +
> num--;
> +
> + if (c->redo && state < SCHED_STATES_MAX) {
> + /* store scheduler state: */
> + sched_state[state].i = i;
> + sched_state[state].w = w;
> + sched_state[state].idx = idx;
> + sched_state[state].num = num;
> + bitmap_copy(sched_state[state].used, used_mask, n);
> + state++;
> + }
> }
> }
* Peter Zijlstra <[email protected]> wrote:
> > if (c->weight != w)
> > continue;
> >
> > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> > - if (!test_bit(j, used_mask))
> > + /* for each bit in idxmsk starting from idx */
> > + while (idx < X86_PMC_IDX_MAX) {
> > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> > + idx);
>
> I'd be mighty tempted to ignore that 80 column rule here ;-)
Please put the body of the loop into a helper function, the function is large
and there are countless col80 uglinesses in it!
Thanks,
Ingo
On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > > if (c->weight != w)
> > > continue;
> > >
> > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> > > - if (!test_bit(j, used_mask))
> > > + /* for each bit in idxmsk starting from idx */
> > > + while (idx < X86_PMC_IDX_MAX) {
> > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> > > + idx);
> >
> > I'd be mighty tempted to ignore that 80 column rule here ;-)
>
> Please put the body of the loop into a helper function, the function is large
> and there are countless col80 uglinesses in it!
I just tried that, its real ugly due to the amount of state you need to
pass around.
On 18.05.11 17:36:53, Peter Zijlstra wrote:
> On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > if (c->weight != w)
> > > > continue;
> > > >
> > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> > > > - if (!test_bit(j, used_mask))
> > > > + /* for each bit in idxmsk starting from idx */
> > > > + while (idx < X86_PMC_IDX_MAX) {
> > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> > > > + idx);
> > >
> > > I'd be mighty tempted to ignore that 80 column rule here ;-)
> >
> > Please put the body of the loop into a helper function, the function is large
> > and there are countless col80 uglinesses in it!
>
> I just tried that, its real ugly due to the amount of state you need to
> pass around.
I will try to improve the patch and send a new version.
Thanks for review.
-Robert
--
Advanced Micro Devices, Inc.
Operating System Research Center
* Peter Zijlstra <[email protected]> wrote:
> On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <[email protected]> wrote:
> >
> > > > if (c->weight != w)
> > > > continue;
> > > >
> > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> > > > - if (!test_bit(j, used_mask))
> > > > + /* for each bit in idxmsk starting from idx */
> > > > + while (idx < X86_PMC_IDX_MAX) {
> > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> > > > + idx);
> > >
> > > I'd be mighty tempted to ignore that 80 column rule here ;-)
> >
> > Please put the body of the loop into a helper function, the function is large
> > and there are countless col80 uglinesses in it!
>
> I just tried that, its real ugly due to the amount of state you need to
> pass around.
Does it help if you put that state into a helper structure?
Thanks,
Ingo
On 19.05.11 14:06:50, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2011-05-18 at 23:20 +0200, Ingo Molnar wrote:
> > > * Peter Zijlstra <[email protected]> wrote:
> > >
> > > > > if (c->weight != w)
> > > > > continue;
> > > > >
> > > > > - for_each_set_bit(j, c->idxmsk, X86_PMC_IDX_MAX) {
> > > > > - if (!test_bit(j, used_mask))
> > > > > + /* for each bit in idxmsk starting from idx */
> > > > > + while (idx < X86_PMC_IDX_MAX) {
> > > > > + idx = find_next_bit(c->idxmsk, X86_PMC_IDX_MAX,
> > > > > + idx);
> > > >
> > > > I'd be mighty tempted to ignore that 80 column rule here ;-)
> > >
> > > Please put the body of the loop into a helper function, the function is large
> > > and there are countless col80 uglinesses in it!
> >
> > I just tried that, its real ugly due to the amount of state you need to
> > pass around.
>
> Does it help if you put that state into a helper structure?
Yes, this is what I have in mind too. We could iterate on such a state
stucture instead of a couple of single variables. Storing and
restoring the state will then just copying the structure.
-Robert
>
> Thanks,
>
> Ingo
>
--
Advanced Micro Devices, Inc.
Operating System Research Center