2010-01-21 16:44:29

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

Note unlike previous versions, this patch is incremental on top of our
v5 patch.

This patch improves event scheduling by maximizing the use
of PMU registers regardless of the order in which events are
created in a group.

The algorithm takes into account the list of counter constraints
for each event. It assigns events to counters from the most
constrained, i.e., works on only one counter, to the least
constrained, i.e., works on any counter.

Intel Fixed counter events and the BTS special event are also
handled via this algorithm which is designed to be fairly generic.

The patch also updates the validation of an event to use the
scheduling algorithm. This will cause early failure in
perf_event_open().

The 2nd version of this patch follows the model used
by PPC, by running the scheduling algorithm and the actual
assignment separately. Actual assignment takes place in
hw_perf_enable() whereas scheduling is implemented in
hw_perf_group_sched_in() and x86_pmu_enable().

The 3rd version does:
- correct handling of pmu->enable() error in hw_perf_group_sched_in()
- simplified event scheduling
- constraint storage in x86_schedule_events() (dynamic constraints)
- new put_event_constraints() callback to release a dynamic constraint

The 4th version does:
- remove leftover debug code in x86_perf_event_set_period()

The 5th version does:
- fix missing bitmask initialization in intel_get_event_constraints()
- use bitmap_copy() instead of memcpy() to copy bitmasks
- call pmu->disable() in x86_event_sched_out()

The 6th version does:
- implement correct fastpath scheduling, i.e., reuse previous assignment
- skip reprogramming counters in hw_perf_enable() with a generation number
- define is_x86_event() to filter on non x86 pmu events

Signed-off-by: Stephane Eranian <[email protected]>

--
arch/x86/kernel/cpu/perf_event.c | 160 +++++++++++++++++++++++----------------
include/linux/perf_event.h | 2
2 files changed, 100 insertions(+), 62 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 03a888d..a961b1f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -87,6 +87,7 @@ struct cpu_hw_events {
int n_events;
int n_added;
int assign[X86_PMC_IDX_MAX]; /* event to counter assignment */
+ u64 tags[X86_PMC_IDX_MAX];
struct perf_event *event_list[X86_PMC_IDX_MAX]; /* in enabled order */
};

@@ -140,6 +141,7 @@ static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
static int x86_perf_event_set_period(struct perf_event *event,
struct hw_perf_event *hwc, int idx);

+static const struct pmu pmu;
/*
* Not sure about some of these
*/
@@ -159,6 +161,12 @@ static u64 p6_pmu_event_map(int hw_event)
return p6_perfmon_event_map[hw_event];
}

+static inline int is_x86_event(struct perf_event *event)
+{
+ return event->pmu == &pmu;
+}
+
+
/*
* Event setting that is specified not to count anything.
* We use this to effectively disable a counter.
@@ -1010,6 +1018,8 @@ static int __hw_perf_event_init(struct perf_event *event)
hwc->config = ARCH_PERFMON_EVENTSEL_INT;

hwc->idx = -1;
+ hwc->last_cpu = -1;
+ hwc->last_tag = ~0ULL;

/*
* Count user and OS events unless requested not to.
@@ -1235,6 +1245,44 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
}

/*
+ * fastpath, try to reuse previous register
+ */
+ for(i=0, num = n; i < n; i++, num--) {
+ hwc = &cpuc->event_list[i]->hw;
+ c = (unsigned long *)constraints[i];
+
+ /* never assigned */
+ if (hwc->idx == -1)
+ break;
+
+ /* constraint still honored */
+ if (!test_bit(hwc->idx, c))
+ break;
+
+ /* not already used */
+ if (test_bit(hwc->idx, used_mask))
+ break;
+
+ pr_debug("CPU%d fast config=0x%llx idx=%d assign=%c\n",
+ smp_processor_id(),
+ hwc->config,
+ hwc->idx,
+ assign ? 'y' : 'n');
+
+ set_bit(hwc->idx, used_mask);
+ if (assign)
+ assign[i] = hwc->idx;
+ }
+ if (!num)
+ goto done;
+
+ /*
+ * begin slow path
+ */
+
+ bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+
+ /*
* weight = number of possible counters
*
* 1 = most constrained, only works on one counter
@@ -1253,11 +1301,10 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (x86_pmu.num_events_fixed)
wmax++;

- num = n;
- for(w=1; num && w <= wmax; w++) {
+ for(w=1, num = n; num && w <= wmax; w++) {

/* for each event */
- for(i=0; i < n; i++) {
+ for(i=0; num && i < n; i++) {
c = (unsigned long *)constraints[i];
hwc = &cpuc->event_list[i]->hw;

@@ -1265,28 +1312,6 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (weight != w)
continue;

- /*
- * try to reuse previous assignment
- *
- * This is possible despite the fact that
- * events or events order may have changed.
- *
- * What matters is the level of constraints
- * of an event and this is constant for now.
- *
- * This is possible also because we always
- * scan from most to least constrained. Thus,
- * if a counter can be reused, it means no,
- * more constrained events, needed it. And
- * next events will either compete for it
- * (which cannot be solved anyway) or they
- * have fewer constraints, and they can use
- * another counter.
- */
- j = hwc->idx;
- if (j != -1 && !test_bit(j, used_mask))
- goto skip;
-
for_each_bit(j, c, X86_PMC_IDX_MAX) {
if (!test_bit(j, used_mask))
break;
@@ -1294,20 +1319,21 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

if (j == X86_PMC_IDX_MAX)
break;
-skip:
- set_bit(j, used_mask);

- pr_debug("CPU%d config=0x%llx idx=%d assign=%c\n",
+ pr_debug("CPU%d slow config=0x%llx idx=%d assign=%c\n",
smp_processor_id(),
hwc->config,
j,
assign ? 'y' : 'n');

+ set_bit(j, used_mask);
+
if (assign)
assign[i] = j;
num--;
}
}
+done:
/*
* scheduling failed or is just a simulation,
* free resources if necessary
@@ -1335,7 +1361,7 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
/* current number of events already accepted */
n = cpuc->n_events;

- if (!is_software_event(leader)) {
+ if (is_x86_event(leader)) {
if (n >= max_count)
return -ENOSPC;
cpuc->event_list[n] = leader;
@@ -1345,8 +1371,8 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
return n;

list_for_each_entry(event, &leader->sibling_list, group_entry) {
- if (is_software_event(event) ||
- event->state == PERF_EVENT_STATE_OFF)
+ if (!is_x86_event(event) ||
+ event->state <= PERF_EVENT_STATE_OFF)
continue;

if (n >= max_count)
@@ -1358,11 +1384,15 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,
return n;
}

-
static inline void x86_assign_hw_event(struct perf_event *event,
- struct hw_perf_event *hwc, int idx)
+ struct cpu_hw_events *cpuc,
+ int idx)
{
+ struct hw_perf_event *hwc = &event->hw;
+
hwc->idx = idx;
+ hwc->last_cpu = smp_processor_id();
+ hwc->last_tag = ++cpuc->tags[idx];

if (hwc->idx == X86_PMC_IDX_FIXED_BTS) {
hwc->config_base = 0;
@@ -1383,6 +1413,11 @@ static inline void x86_assign_hw_event(struct perf_event *event,

void hw_perf_enable(void)
{
+#define match_prev_assignment(h, c, i) \
+ ((h)->idx == (c)->assign[i] \
+ && (h)->last_cpu == smp_processor_id() \
+ && (h)->last_tag == (c)->tags[i])
+
struct cpu_hw_events *cpuc = &__get_cpu_var(cpu_hw_events);
struct perf_event *event;
struct hw_perf_event *hwc;
@@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
* apply assignment obtained either from
* hw_perf_group_sched_in() or x86_pmu_enable()
*
- * step1: save events moving to new counters
- * step2: reprogram moved events into new counters
+ * We either re-enable or re-program and re-enable.
+ * All events are disabled by the time we come here.
+ * That means their state has been saved already.
*/
for(i=0; i < cpuc->n_events; i++) {

event = cpuc->event_list[i];
hwc = &event->hw;

- if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
- continue;
-
- x86_pmu.disable(hwc, hwc->idx);
-
- clear_bit(hwc->idx, cpuc->active_mask);
- barrier();
- cpuc->events[hwc->idx] = NULL;
-
- x86_perf_event_update(event, hwc, hwc->idx);
-
- hwc->idx = -1;
- }
-
- for(i=0; i < cpuc->n_events; i++) {
-
- event = cpuc->event_list[i];
- hwc = &event->hw;
-
- if (hwc->idx == -1) {
- x86_assign_hw_event(event, hwc, cpuc->assign[i]);
+ /*
+ * we can avoid reprogramming counter if:
+ * - assigned same counter as last time
+ * - running on same CPU as last time
+ * - no other event has used the counter since
+ */
+ if (!match_prev_assignment(hwc, cpuc, i)) {
+ x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
x86_perf_event_set_period(event, hwc, hwc->idx);
}
/*
* need to mark as active because x86_pmu_disable()
- * clear active_mask and eventsp[] yet it preserves
+ * clear active_mask and events[] yet it preserves
* idx
*/
set_bit(hwc->idx, cpuc->active_mask);
@@ -2180,6 +2203,8 @@ static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event,
u64 *idxmsk)
{
+ /* no constraints, means supports all generic counters */
+ bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
}

static int x86_event_sched_in(struct perf_event *event,
@@ -2191,10 +2216,10 @@ static int x86_event_sched_in(struct perf_event *event,
event->oncpu = cpu;
event->tstamp_running += event->ctx->time - event->tstamp_stopped;

- if (is_software_event(event))
+ if (!is_x86_event(event))
ret = event->pmu->enable(event);

- if (!ret && !is_software_event(event))
+ if (!ret && is_x86_event(event))
cpuctx->active_oncpu++;

if (!ret && event->attr.exclusive)
@@ -2209,12 +2234,12 @@ static void x86_event_sched_out(struct perf_event *event,
event->state = PERF_EVENT_STATE_INACTIVE;
event->oncpu = -1;

- if (is_software_event(event))
+ if (!is_x86_event(event))
event->pmu->disable(event);

event->tstamp_running -= event->ctx->time - event->tstamp_stopped;

- if (!is_software_event(event))
+ if (is_x86_event(event))
cpuctx->active_oncpu--;

if (event->attr.exclusive || !cpuctx->active_oncpu)
@@ -2254,7 +2279,7 @@ int hw_perf_group_sched_in(struct perf_event *leader,

n1 = 1;
list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state != PERF_EVENT_STATE_OFF) {
+ if (sub->state > PERF_EVENT_STATE_OFF) {
ret = x86_event_sched_in(sub, cpuctx, cpu);
if (ret)
goto undo;
@@ -2609,12 +2634,23 @@ static int validate_group(struct perf_event *event)

const struct pmu *hw_perf_event_init(struct perf_event *event)
{
+ const struct pmu *tmp;
int err;

err = __hw_perf_event_init(event);
if (!err) {
+ /*
+ * we temporarily connect event to its pmu
+ * such that validate_group() can classify
+ * it as an x86 event using is_x86_event()
+ */
+ tmp = event->pmu;
+ event->pmu = &pmu;
+
if (event->group_leader != event)
err = validate_group(event);
+
+ event->pmu = tmp;
}
if (err) {
if (event->destroy)
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9a1d276..d8a1d34 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -478,9 +478,11 @@ struct hw_perf_event {
union {
struct { /* hardware */
u64 config;
+ u64 last_tag;
unsigned long config_base;
unsigned long event_base;
int idx;
+ int last_cpu;
};
struct { /* software */
s64 remaining;


2010-01-22 20:28:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
> @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
> * apply assignment obtained either from
> * hw_perf_group_sched_in() or x86_pmu_enable()
> *
> - * step1: save events moving to new counters
> - * step2: reprogram moved events into new counters
> + * We either re-enable or re-program and re-enable.
> + * All events are disabled by the time we come here.
> + * That means their state has been saved already.
> */

I'm not seeing how it is true.

Suppose a core2 with counter0 active counting a non-restricted event,
say cpu_cycles. Then we do:

perf_disable()
hw_perf_disable()
intel_pmu_disable_all
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);

->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
x86_pmu_enable()
collect_events()
x86_schedule_events()
n_added = 1

/* also slightly confused about this */
if (hwc->idx != -1)
x86_perf_event_set_period()

perf_enable()
hw_perf_enable()

/* and here we'll assign the new event to counter0
* except we never disabled it... */

intel_pmu_enable_all()
wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, intel_ctrl)

Or am I missing something?

> for(i=0; i < cpuc->n_events; i++) {
>
> event = cpuc->event_list[i];
> hwc = &event->hw;
>
> - if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
> - continue;
> -
> - x86_pmu.disable(hwc, hwc->idx);
> -
> - clear_bit(hwc->idx, cpuc->active_mask);
> - barrier();
> - cpuc->events[hwc->idx] = NULL;
> -
> - x86_perf_event_update(event, hwc, hwc->idx);
> -
> - hwc->idx = -1;
> - }
> -
> - for(i=0; i < cpuc->n_events; i++) {
> -
> - event = cpuc->event_list[i];
> - hwc = &event->hw;
> -
> - if (hwc->idx == -1) {
> - x86_assign_hw_event(event, hwc, cpuc->assign[i]);
> + /*
> + * we can avoid reprogramming counter if:
> + * - assigned same counter as last time
> + * - running on same CPU as last time
> + * - no other event has used the counter since
> + */
> + if (!match_prev_assignment(hwc, cpuc, i)) {
> + x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
> x86_perf_event_set_period(event, hwc, hwc->idx);
> }
> /*
> * need to mark as active because x86_pmu_disable()
> - * clear active_mask and eventsp[] yet it preserves
> + * clear active_mask and events[] yet it preserves
> * idx
> */
> set_bit(hwc->idx, cpuc->active_mask);

2010-01-25 15:01:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Fri, 2010-01-22 at 21:27 +0100, Peter Zijlstra wrote:
> On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
> > @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
> > * apply assignment obtained either from
> > * hw_perf_group_sched_in() or x86_pmu_enable()
> > *
> > - * step1: save events moving to new counters
> > - * step2: reprogram moved events into new counters
> > + * We either re-enable or re-program and re-enable.
> > + * All events are disabled by the time we come here.
> > + * That means their state has been saved already.
> > */
>
> I'm not seeing how it is true.
>
> Suppose a core2 with counter0 active counting a non-restricted event,
> say cpu_cycles. Then we do:
>
> perf_disable()
> hw_perf_disable()
> intel_pmu_disable_all
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>
> ->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
> x86_pmu_enable()
> collect_events()
> x86_schedule_events()
> n_added = 1
>
> /* also slightly confused about this */
> if (hwc->idx != -1)
> x86_perf_event_set_period()
>
> perf_enable()
> hw_perf_enable()
>
> /* and here we'll assign the new event to counter0
> * except we never disabled it... */
>
> intel_pmu_enable_all()
> wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, intel_ctrl)
>
> Or am I missing something?
>
> > for(i=0; i < cpuc->n_events; i++) {
> >
> > event = cpuc->event_list[i];
> > hwc = &event->hw;
> >
> > - if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
> > - continue;
> > -
> > - x86_pmu.disable(hwc, hwc->idx);
> > -
> > - clear_bit(hwc->idx, cpuc->active_mask);
> > - barrier();
> > - cpuc->events[hwc->idx] = NULL;
> > -
> > - x86_perf_event_update(event, hwc, hwc->idx);
> > -
> > - hwc->idx = -1;
> > - }
> > -

I've split your -v6 delta in two, one part doing that fastpath
scheduling, and one part this hw_perf_enable optimization, for now I've
dropped the second part.

On top of that I did a patch that shares the above code with
x86_pmu_disable() so that we don't have that sequence twice.



2010-01-25 17:12:24

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Fri, Jan 22, 2010 at 9:27 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
>> @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
>>                  * apply assignment obtained either from
>>                  * hw_perf_group_sched_in() or x86_pmu_enable()
>>                  *
>> -                * step1: save events moving to new counters
>> -                * step2: reprogram moved events into new counters
>> +                * We either re-enable or re-program and re-enable.
>> +                * All events are disabled by the time we come here.
>> +                * That means their state has been saved already.
>>                  */
>
> I'm not seeing how it is true.

> Suppose a core2 with counter0 active counting a non-restricted event,
> say cpu_cycles. Then we do:
>
> perf_disable()
>  hw_perf_disable()
>    intel_pmu_disable_all
>      wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>
everything is disabled globally, yet individual counter0 is not.
But that's enough to stop it.

> ->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
>  x86_pmu_enable()
>    collect_events()
>    x86_schedule_events()
>    n_added = 1
>
>    /* also slightly confused about this */
>    if (hwc->idx != -1)
>      x86_perf_event_set_period()
>

In x86_pmu_enable(), we have not yet actually assigned the
counter to hwc->idx. This is only accomplished by hw_perf_enable().
Yet, x86_perf_event_set_period() is going to write the MSR.

My understanding is that you never call enable(event) in code
outside of a perf_disable()/perf_enable() section.

> perf_enable()
>  hw_perf_enable()
>
>    /* and here we'll assign the new event to counter0
>     * except we never disabled it... */
>
You will have two events, scheduled, cycles in counter1
and mem_load_retired in counter0. Neither hwc->idx
will match previous state and thus both will be rewritten.

I think the case you are worried about is different. It is the
case where you would move an event to a new counter
without replacing it with a new event. Given that the individual
MSR.en would still be 1 AND that enable_all() enables all
counters (even the ones not actively used), then we would
get a runaway counter so to speak.

It seems a solution would be to call x86_pmu_disable() before
assigning an event to a new counter for all events which are
moving. This is because we cannot assume all events have been
previously disabled individually. Something like

if (!match_prev_assignment(hwc, cpuc, i)) {
if (hwc->idx != -1)
x86_pmu.disable(hwc, hwc->idx);
x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
x86_perf_event_set_period(event, hwc, hwc->idx);
}

>    intel_pmu_enable_all()
>      wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, intel_ctrl)
>
> Or am I missing something?
>

2010-01-25 17:26:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Mon, 2010-01-25 at 18:12 +0100, stephane eranian wrote:
> On Fri, Jan 22, 2010 at 9:27 PM, Peter Zijlstra <[email protected]> wrote:
> > On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
> >> @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
> >> * apply assignment obtained either from
> >> * hw_perf_group_sched_in() or x86_pmu_enable()
> >> *
> >> - * step1: save events moving to new counters
> >> - * step2: reprogram moved events into new counters
> >> + * We either re-enable or re-program and re-enable.
> >> + * All events are disabled by the time we come here.
> >> + * That means their state has been saved already.
> >> */
> >
> > I'm not seeing how it is true.
>
> > Suppose a core2 with counter0 active counting a non-restricted event,
> > say cpu_cycles. Then we do:
> >
> > perf_disable()
> > hw_perf_disable()
> > intel_pmu_disable_all
> > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
> >
> everything is disabled globally, yet individual counter0 is not.
> But that's enough to stop it.
>
> > ->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
> > x86_pmu_enable()
> > collect_events()
> > x86_schedule_events()
> > n_added = 1
> >
> > /* also slightly confused about this */
> > if (hwc->idx != -1)
> > x86_perf_event_set_period()
> >
>
> In x86_pmu_enable(), we have not yet actually assigned the
> counter to hwc->idx. This is only accomplished by hw_perf_enable().
> Yet, x86_perf_event_set_period() is going to write the MSR.
>
> My understanding is that you never call enable(event) in code
> outside of a perf_disable()/perf_enable() section.

That should be so yes, last time I verified that is was. Hence I'm a bit
puzzled by that set_period(), hw_perf_enable() will assign ->idx and do
set_period() so why also do it here...

> > perf_enable()
> > hw_perf_enable()
> >
> > /* and here we'll assign the new event to counter0
> > * except we never disabled it... */
> >
> You will have two events, scheduled, cycles in counter1
> and mem_load_retired in counter0. Neither hwc->idx
> will match previous state and thus both will be rewritten.

And by programming mem_load_retires you just wiped the counter value of
the cycle counter, there should be an x86_perf_event_update() in between
stopping the counter and moving that configuration.

> I think the case you are worried about is different. It is the
> case where you would move an event to a new counter
> without replacing it with a new event. Given that the individual
> MSR.en would still be 1 AND that enable_all() enables all
> counters (even the ones not actively used), then we would
> get a runaway counter so to speak.
>
> It seems a solution would be to call x86_pmu_disable() before
> assigning an event to a new counter for all events which are
> moving. This is because we cannot assume all events have been
> previously disabled individually. Something like
>
> if (!match_prev_assignment(hwc, cpuc, i)) {
> if (hwc->idx != -1)
> x86_pmu.disable(hwc, hwc->idx);
> x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
> x86_perf_event_set_period(event, hwc, hwc->idx);
> }

Yes and no, my worry is not that its not counting, but that we didn't
store the actual counter value before over-writing it with the new
configuration.

As to your suggestion,
1) we would have to do x86_pmu_disable() since that does
x86_perf_event_update().
2) I worried about the case where we basically switch two counters,
there we cannot do the x86_perf_event_update() in a single pass since
programming the first counter will destroy the value of the second.

Now possibly the scenario in 2 isn't possible because the event
scheduling is stable enough for this never to happen, but I wasn't
feeling too sure about that, so I skipped this part for now.

2010-01-25 17:48:18

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Mon, Jan 25, 2010 at 6:25 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-01-25 at 18:12 +0100, stephane eranian wrote:
>> On Fri, Jan 22, 2010 at 9:27 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Thu, 2010-01-21 at 17:39 +0200, Stephane Eranian wrote:
>> >> @@ -1395,40 +1430,28 @@ void hw_perf_enable(void)
>> >>                  * apply assignment obtained either from
>> >>                  * hw_perf_group_sched_in() or x86_pmu_enable()
>> >>                  *
>> >> -                * step1: save events moving to new counters
>> >> -                * step2: reprogram moved events into new counters
>> >> +                * We either re-enable or re-program and re-enable.
>> >> +                * All events are disabled by the time we come here.
>> >> +                * That means their state has been saved already.
>> >>                  */
>> >
>> > I'm not seeing how it is true.
>>
>> > Suppose a core2 with counter0 active counting a non-restricted event,
>> > say cpu_cycles. Then we do:
>> >
>> > perf_disable()
>> >  hw_perf_disable()
>> >    intel_pmu_disable_all
>> >      wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0);
>> >
>> everything is disabled globally, yet individual counter0 is not.
>> But that's enough to stop it.
>>
>> > ->enable(MEM_LOAD_RETIRED) /* constrained to counter0 */
>> >  x86_pmu_enable()
>> >    collect_events()
>> >    x86_schedule_events()
>> >    n_added = 1
>> >
>> >    /* also slightly confused about this */
>> >    if (hwc->idx != -1)
>> >      x86_perf_event_set_period()
>> >
>>
>> In x86_pmu_enable(), we have not yet actually assigned the
>> counter to hwc->idx. This is only accomplished by hw_perf_enable().
>> Yet, x86_perf_event_set_period() is going to write the MSR.
>>
>> My understanding is that you never call enable(event) in code
>> outside of a perf_disable()/perf_enable() section.
>
> That should be so yes, last time I verified that is was. Hence I'm a bit
> puzzled by that set_period(), hw_perf_enable() will assign ->idx and do
> set_period() so why also do it here...
>

Ok, so I think we can drop set_period() from enable(event).

>> > perf_enable()
>> >  hw_perf_enable()
>> >
>> >    /* and here we'll assign the new event to counter0
>> >     * except we never disabled it... */
>> >
>> You will have two events, scheduled, cycles in counter1
>> and mem_load_retired in counter0. Neither hwc->idx
>> will match previous state and thus both will be rewritten.
>
> And by programming mem_load_retires you just wiped the counter value of
> the cycle counter, there should be an x86_perf_event_update() in between
> stopping the counter and moving that configuration.
>
>> I think the case you are worried about is different. It is the
>> case where you would move an event to a new counter
>> without replacing it with a new event. Given that the individual
>> MSR.en would still be 1 AND that enable_all() enables all
>> counters (even the ones not actively used), then we would
>> get a runaway counter so to speak.
>>
>> It seems a solution would be to call x86_pmu_disable() before
>> assigning an event to a new counter for all events which are
>> moving. This is because we cannot assume all events have been
>> previously disabled individually. Something like
>>
>> if (!match_prev_assignment(hwc, cpuc, i)) {
>>    if (hwc->idx != -1)
>>       x86_pmu.disable(hwc, hwc->idx);
>>    x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
>>    x86_perf_event_set_period(event, hwc, hwc->idx);
>> }
>
> Yes and no, my worry is not that its not counting, but that we didn't
> store the actual counter value before over-writing it with the new
> configuration.
>
> As to your suggestion,
>  1) we would have to do x86_pmu_disable() since that does
> x86_perf_event_update().
>  2) I worried about the case where we basically switch two counters,
> there we cannot do the x86_perf_event_update() in a single pass since
> programming the first counter will destroy the value of the second.
>
> Now possibly the scenario in 2 isn't possible because the event
> scheduling is stable enough for this never to happen, but I wasn't
> feeling too sure about that, so I skipped this part for now.
>
I think what adds to the complexity here is that there are two distinct
disable() mechanisms: perf_disable() and x86_pmu.disable(). They
don't operate the same way. You would think that by calling hw_perf_disable()
you would stop individual events as well (thus saving their values). That
means that if you do perf_disable() and then read the count, you will not
get the up-to-date value in event->count. you need pmu->disable(event)
to ensure that.

So my understanding is that perf_disable() is meant for a temporary stop,
thus no need to save the count.

As for 2, I believe this can happen if you add 2 new events which have more
restrictions. For instance on Core, you were measuring cycles, inst in generic
counters, then you add 2 events which can only be measured on generic counters.
That will cause cycles, inst to be moved to fixed counters.

So we have to modify hw_perf_enable() to first disable all events
which are moving,
then reprogram them. I suspect it may be possible to optimize this if
we detect that
those events had already been stopped individually (as opposed to
perf_disable()), i.e.,
already had their counts saved.



>

2010-01-25 17:59:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Mon, 2010-01-25 at 18:48 +0100, stephane eranian wrote:

> >> It seems a solution would be to call x86_pmu_disable() before
> >> assigning an event to a new counter for all events which are
> >> moving. This is because we cannot assume all events have been
> >> previously disabled individually. Something like
> >>
> >> if (!match_prev_assignment(hwc, cpuc, i)) {
> >> if (hwc->idx != -1)
> >> x86_pmu.disable(hwc, hwc->idx);
> >> x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
> >> x86_perf_event_set_period(event, hwc, hwc->idx);
> >> }
> >
> > Yes and no, my worry is not that its not counting, but that we didn't
> > store the actual counter value before over-writing it with the new
> > configuration.
> >
> > As to your suggestion,
> > 1) we would have to do x86_pmu_disable() since that does
> > x86_perf_event_update().
> > 2) I worried about the case where we basically switch two counters,
> > there we cannot do the x86_perf_event_update() in a single pass since
> > programming the first counter will destroy the value of the second.
> >
> > Now possibly the scenario in 2 isn't possible because the event
> > scheduling is stable enough for this never to happen, but I wasn't
> > feeling too sure about that, so I skipped this part for now.
> >
> I think what adds to the complexity here is that there are two distinct
> disable() mechanisms: perf_disable() and x86_pmu.disable(). They
> don't operate the same way. You would think that by calling hw_perf_disable()
> you would stop individual events as well (thus saving their values). That
> means that if you do perf_disable() and then read the count, you will not
> get the up-to-date value in event->count. you need pmu->disable(event)
> to ensure that.

No, a read is actually good enough, it does x86_perf_event_update(), but
we're not guaranteeing that read is present.

So yes, perf_disable() is basically a local_irq_disable() but for perf
events, all it has to do is ensure there's no concurrency. ->disable()
will really tear the configuration down.

Either ->disable() or ->read() will end up calling
x86_perf_event_update() which is needed to read the actual hw counter
value and propagate it into event storage.

> So my understanding is that perf_disable() is meant for a temporary stop,
> thus no need to save the count.
>
> As for 2, I believe this can happen if you add 2 new events which have more
> restrictions. For instance on Core, you were measuring cycles, inst in generic
> counters, then you add 2 events which can only be measured on generic counters.
> That will cause cycles, inst to be moved to fixed counters.
>
> So we have to modify hw_perf_enable() to first disable all events
> which are moving,
> then reprogram them. I suspect it may be possible to optimize this if
> we detect that
> those events had already been stopped individually (as opposed to
> perf_disable()), i.e.,
> already had their counts saved.

Right, I see no fundamentally impossible things at all, we just need to
be careful here.

Anyway, I poked at the stack I've got now and it seems to hold up when I
poke at it with various combinations of constraint events, so I'll push
that off to Ingo and then we can go from there.

Thanks for working on this!

2010-01-26 09:17:52

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Mon, Jan 25, 2010 at 6:59 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-01-25 at 18:48 +0100, stephane eranian wrote:
>> So we have to modify hw_perf_enable() to first disable all events
>> which are moving,
>> then reprogram them. I suspect it may be possible to optimize this if
>> we detect that
>> those events had already been stopped individually (as opposed to
>> perf_disable()), i.e.,
>> already had their counts saved.
>
> Right, I see no fundamentally impossible things at all, we just need to
> be careful here.
>
> Anyway, I poked at the stack I've got now and it seems to hold up when I
> poke at it with various combinations of constraint events, so I'll push
> that off to Ingo and then we can go from there.
>
> Thanks for working on this!

Ok, so I think the best way to proceed here is to first wait until all of this
is checked in. Then I'll see what is missing based on what we discussed
here.

2010-01-29 09:27:41

by Stephane Eranian

[permalink] [raw]
Subject: [tip:perf/core] perf_events: Add fast-path to the rescheduling code

Commit-ID: 8113070d6639d2245c6c79afb8df42cedab30540
Gitweb: http://git.kernel.org/tip/8113070d6639d2245c6c79afb8df42cedab30540
Author: Stephane Eranian <[email protected]>
AuthorDate: Thu, 21 Jan 2010 17:39:01 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 29 Jan 2010 09:01:34 +0100

perf_events: Add fast-path to the rescheduling code

Implement correct fastpath scheduling, i.e., reuse previous assignment.

Signed-off-by: Stephane Eranian <[email protected]>
[ split from larger patch]
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/cpu/perf_event.c | 91 +++++++++++++++++++++++++------------
1 files changed, 61 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 995ac4a..0bd23d0 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1245,6 +1245,46 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
}

/*
+ * fastpath, try to reuse previous register
+ */
+ for (i = 0, num = n; i < n; i++, num--) {
+ hwc = &cpuc->event_list[i]->hw;
+ c = (unsigned long *)constraints[i];
+
+ /* never assigned */
+ if (hwc->idx == -1)
+ break;
+
+ /* constraint still honored */
+ if (!test_bit(hwc->idx, c))
+ break;
+
+ /* not already used */
+ if (test_bit(hwc->idx, used_mask))
+ break;
+
+#if 0
+ pr_debug("CPU%d fast config=0x%llx idx=%d assign=%c\n",
+ smp_processor_id(),
+ hwc->config,
+ hwc->idx,
+ assign ? 'y' : 'n');
+#endif
+
+ set_bit(hwc->idx, used_mask);
+ if (assign)
+ assign[i] = hwc->idx;
+ }
+ if (!num)
+ goto done;
+
+ /*
+ * begin slow path
+ */
+
+ bitmap_zero(used_mask, X86_PMC_IDX_MAX);
+
+ /*
* weight = number of possible counters
*
* 1 = most constrained, only works on one counter
@@ -1263,10 +1303,9 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (x86_pmu.num_events_fixed)
wmax++;

- num = n;
- for (w = 1; num && w <= wmax; w++) {
+ for (w = 1, num = n; num && w <= wmax; w++) {
/* for each event */
- for (i = 0; i < n; i++) {
+ for (i = 0; num && i < n; i++) {
c = (unsigned long *)constraints[i];
hwc = &cpuc->event_list[i]->hw;

@@ -1274,28 +1313,6 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
if (weight != w)
continue;

- /*
- * try to reuse previous assignment
- *
- * This is possible despite the fact that
- * events or events order may have changed.
- *
- * What matters is the level of constraints
- * of an event and this is constant for now.
- *
- * This is possible also because we always
- * scan from most to least constrained. Thus,
- * if a counter can be reused, it means no,
- * more constrained events, needed it. And
- * next events will either compete for it
- * (which cannot be solved anyway) or they
- * have fewer constraints, and they can use
- * another counter.
- */
- j = hwc->idx;
- if (j != -1 && !test_bit(j, used_mask))
- goto skip;
-
for_each_bit(j, c, X86_PMC_IDX_MAX) {
if (!test_bit(j, used_mask))
break;
@@ -1303,22 +1320,23 @@ static int x86_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)

if (j == X86_PMC_IDX_MAX)
break;
-skip:
- set_bit(j, used_mask);

#if 0
- pr_debug("CPU%d config=0x%llx idx=%d assign=%c\n",
+ pr_debug("CPU%d slow config=0x%llx idx=%d assign=%c\n",
smp_processor_id(),
hwc->config,
j,
assign ? 'y' : 'n');
#endif

+ set_bit(j, used_mask);
+
if (assign)
assign[i] = j;
num--;
}
}
+done:
/*
* scheduling failed or is just a simulation,
* free resources if necessary
@@ -1357,7 +1375,7 @@ static int collect_events(struct cpu_hw_events *cpuc, struct perf_event *leader,

list_for_each_entry(event, &leader->sibling_list, group_entry) {
if (!is_x86_event(event) ||
- event->state == PERF_EVENT_STATE_OFF)
+ event->state <= PERF_EVENT_STATE_OFF)
continue;

if (n >= max_count)
@@ -2184,6 +2202,8 @@ static void amd_get_event_constraints(struct cpu_hw_events *cpuc,
struct perf_event *event,
u64 *idxmsk)
{
+ /* no constraints, means supports all generic counters */
+ bitmap_fill((unsigned long *)idxmsk, x86_pmu.num_events);
}

static int x86_event_sched_in(struct perf_event *event,
@@ -2258,7 +2278,7 @@ int hw_perf_group_sched_in(struct perf_event *leader,

n1 = 1;
list_for_each_entry(sub, &leader->sibling_list, group_entry) {
- if (sub->state != PERF_EVENT_STATE_OFF) {
+ if (sub->state > PERF_EVENT_STATE_OFF) {
ret = x86_event_sched_in(sub, cpuctx, cpu);
if (ret)
goto undo;
@@ -2613,12 +2633,23 @@ static int validate_group(struct perf_event *event)

const struct pmu *hw_perf_event_init(struct perf_event *event)
{
+ const struct pmu *tmp;
int err;

err = __hw_perf_event_init(event);
if (!err) {
+ /*
+ * we temporarily connect event to its pmu
+ * such that validate_group() can classify
+ * it as an x86 event using is_x86_event()
+ */
+ tmp = event->pmu;
+ event->pmu = &pmu;
+
if (event->group_leader != event)
err = validate_group(event);
+
+ event->pmu = tmp;
}
if (err) {
if (event->destroy)

2010-01-29 23:08:19

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

I think there is a problem with this following code:

void hw_perf_enable(void)
for (i = 0; i < cpuc->n_events; i++) {

event = cpuc->event_list[i];
hwc = &event->hw;

if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
continue;

Here you are looking for events which are moving. I think the 2nd
part of the if is not good enough. It is not because hwc->idx is
identical to the assignment, that you can assume the event was
already there. It may have been there in the past, then scheduled
out and replaced at idx by another event. When it comes back,
it gets its spot back, but it needs to be reprogrammed.

That is why in v6 incremental, I have added last_cpu, last_tag
to have a stronger checks and match_prev_assignment().

Somehow it is missing in the series you've committed unless
I am missing something.


On Mon, Jan 25, 2010 at 6:59 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, 2010-01-25 at 18:48 +0100, stephane eranian wrote:
>
>> >> It seems a solution would be to call x86_pmu_disable() before
>> >> assigning an event to a new counter for all events which are
>> >> moving. This is because we cannot assume all events have been
>> >> previously disabled individually. Something like
>> >>
>> >> if (!match_prev_assignment(hwc, cpuc, i)) {
>> >>    if (hwc->idx != -1)
>> >>       x86_pmu.disable(hwc, hwc->idx);
>> >>    x86_assign_hw_event(event, cpuc, cpuc->assign[i]);
>> >>    x86_perf_event_set_period(event, hwc, hwc->idx);
>> >> }

2010-01-30 08:56:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

On Sat, 2010-01-30 at 00:08 +0100, Stephane Eranian wrote:
> I think there is a problem with this following code:
>
> void hw_perf_enable(void)
> for (i = 0; i < cpuc->n_events; i++) {
>
> event = cpuc->event_list[i];
> hwc = &event->hw;
>
> if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
> continue;
>
> Here you are looking for events which are moving. I think the 2nd
> part of the if is not good enough. It is not because hwc->idx is
> identical to the assignment, that you can assume the event was
> already there. It may have been there in the past, then scheduled
> out and replaced at idx by another event. When it comes back,
> it gets its spot back, but it needs to be reprogrammed.
>
> That is why in v6 incremental, I have added last_cpu, last_tag
> to have a stronger checks and match_prev_assignment().
>
> Somehow it is missing in the series you've committed unless
> I am missing something.

Right, that went missing because I was assuming that was for the
optimization of reducing to one loop. And since I didn't see that one
loop version work I left that part out.

(The risk of doing more than one thing in one patch)

Still, shouldn't be hard to correct, I'll look at doing a patch for this
on monday, unless you beat me to it :-)

2010-01-30 18:22:02

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH] perf_events: improve x86 event scheduling (v6 incremental)

I will resubmit a patch on Monday. Same thing for AMD NB events.


On Sat, Jan 30, 2010 at 9:56 AM, Peter Zijlstra <[email protected]> wrote:
> On Sat, 2010-01-30 at 00:08 +0100, Stephane Eranian wrote:
>> I think there is a problem with this following code:
>>
>> void hw_perf_enable(void)
>>                 for (i = 0; i < cpuc->n_events; i++) {
>>
>>                         event = cpuc->event_list[i];
>>                         hwc = &event->hw;
>>
>>                         if (hwc->idx == -1 || hwc->idx == cpuc->assign[i])
>>                                 continue;
>>
>> Here you are looking for events which are moving. I think the 2nd
>> part of the if is not good enough. It is not because hwc->idx is
>> identical to the assignment, that you can assume the event was
>> already there. It may have been there in the past, then scheduled
>> out and replaced at idx by another event. When it comes back,
>> it gets its spot back, but it needs to be reprogrammed.
>>
>> That is why in v6 incremental, I have added last_cpu, last_tag
>> to have a stronger checks and match_prev_assignment().
>>
>> Somehow it is missing in the series you've committed unless
>> I am missing something.
>
> Right, that went missing because I was assuming that was for the
> optimization of reducing to one loop. And since I didn't see that one
> loop version work I left that part out.
>
> (The risk of doing more than one thing in one patch)
>
> Still, shouldn't be hard to correct, I'll look at doing a patch for this
> on monday, unless you beat me to it :-)
>
>



--
Stephane Eranian | EMEA Software Engineering
Google France | 38 avenue de l'Opéra | 75002 Paris
Tel : +33 (0) 1 42 68 53 00
This email may be confidential or privileged. If you received this
communication by mistake, please
don't forward it to anyone else, please erase all copies and
attachments, and please let me know that
it went to the wrong person. Thanks