2010-01-10 01:38:26

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/6] perf: Various event scheduling improvements

Hi,

These patches bring (I hope) a bit more scalability and fairness
to the perf events scheduling.

But this is only an introduction as there is still some work to
do, like ensuring all pinned events have been scheduled before
flexible ones (for now we schedule in order cpu pinned, cpu flexible,
task pinned, task flexible), among other improvements.


Frederic Weisbecker (6):
perf/core: Split context's event group list into pinned and non-pinned lists
list: Introduce list_rotate_left()
perf: Round robin groups of events using list_rotate_left()
perf: Export software-only event group characteristic as a flag
perf: Don't rotate pinned groups
perf: Increase round-robin fairness of flexible events

include/linux/list.h | 14 +++
include/linux/perf_event.h | 8 +-
kernel/perf_event.c | 262 +++++++++++++++++++++++++++-----------------
3 files changed, 184 insertions(+), 100 deletions(-)


2010-01-10 01:38:32

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/6 v2] perf/core: Split context's event group list into pinned and non-pinned lists

Split-up struct perf_event_context::group_list into pinned_groups
and flexible_groups (non-pinned).

This first appears to be useless as it duplicates various loops around
the group list handlings.

But it scales better in the fast-path in perf_sched_in(). We don't
anymore iterate twice through the entire list to separate pinned and
non-pinned scheduling. Instead we interate through two distinct lists.

The another desired effect is that it makes easier to define distinct
scheduling rules on both.

Changes in v2:
- Respectively rename pinned_grp_list and
volatile_grp_list into pinned_groups and flexible_groups as per
Ingo suggestion.
- Various cleanups

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
include/linux/perf_event.h | 3 +-
kernel/perf_event.c | 227 +++++++++++++++++++++++++++++---------------
2 files changed, 153 insertions(+), 77 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9a1d276..cdbc2aa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -683,7 +683,8 @@ struct perf_event_context {
*/
struct mutex mutex;

- struct list_head group_list;
+ struct list_head pinned_groups;
+ struct list_head flexible_groups;
struct list_head event_list;
int nr_events;
int nr_active;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 27f69a0..c9f8a75 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -289,6 +289,15 @@ static void update_event_times(struct perf_event *event)
event->total_time_running = run_end - event->tstamp_running;
}

+static struct list_head *
+ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
+{
+ if (event->attr.pinned)
+ return &ctx->pinned_groups;
+ else
+ return &ctx->flexible_groups;
+}
+
/*
* Add a event from the lists for its context.
* Must be called with ctx->mutex and ctx->lock held.
@@ -303,9 +312,12 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
* add it straight to the context's event list, or to the group
* leader's sibling list:
*/
- if (group_leader == event)
- list_add_tail(&event->group_entry, &ctx->group_list);
- else {
+ if (group_leader == event) {
+ struct list_head *list;
+
+ list = ctx_group_list(event, ctx);
+ list_add_tail(&event->group_entry, list);
+ } else {
list_add_tail(&event->group_entry, &group_leader->sibling_list);
group_leader->nr_siblings++;
}
@@ -355,8 +367,10 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
* to the context list directly:
*/
list_for_each_entry_safe(sibling, tmp, &event->sibling_list, group_entry) {
+ struct list_head *list;

- list_move_tail(&sibling->group_entry, &ctx->group_list);
+ list = ctx_group_list(event, ctx);
+ list_move_tail(&sibling->group_entry, list);
sibling->group_leader = sibling;
}
}
@@ -1056,7 +1070,10 @@ void __perf_event_sched_out(struct perf_event_context *ctx,

perf_disable();
if (ctx->nr_active) {
- list_for_each_entry(event, &ctx->group_list, group_entry)
+ list_for_each_entry(event, &ctx->pinned_groups, group_entry)
+ group_sched_out(event, cpuctx, ctx);
+
+ list_for_each_entry(event, &ctx->flexible_groups, group_entry)
group_sched_out(event, cpuctx, ctx);
}
perf_enable();
@@ -1271,9 +1288,8 @@ __perf_event_sched_in(struct perf_event_context *ctx,
* First go through the list and put on any pinned groups
* in order to give them the best chance of going on.
*/
- list_for_each_entry(event, &ctx->group_list, group_entry) {
- if (event->state <= PERF_EVENT_STATE_OFF ||
- !event->attr.pinned)
+ list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
+ if (event->state <= PERF_EVENT_STATE_OFF)
continue;
if (event->cpu != -1 && event->cpu != cpu)
continue;
@@ -1291,15 +1307,10 @@ __perf_event_sched_in(struct perf_event_context *ctx,
}
}

- list_for_each_entry(event, &ctx->group_list, group_entry) {
- /*
- * Ignore events in OFF or ERROR state, and
- * ignore pinned events since we did them already.
- */
- if (event->state <= PERF_EVENT_STATE_OFF ||
- event->attr.pinned)
+ list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
+ /* Ignore events in OFF or ERROR state */
+ if (event->state <= PERF_EVENT_STATE_OFF)
continue;
-
/*
* Listen to the 'cpu' scheduling filter constraint
* of events:
@@ -1453,8 +1464,13 @@ static void rotate_ctx(struct perf_event_context *ctx)
* Rotate the first entry last (works just fine for group events too):
*/
perf_disable();
- list_for_each_entry(event, &ctx->group_list, group_entry) {
- list_move_tail(&event->group_entry, &ctx->group_list);
+ list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
+ list_move_tail(&event->group_entry, &ctx->pinned_groups);
+ break;
+ }
+
+ list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
+ list_move_tail(&event->group_entry, &ctx->flexible_groups);
break;
}
perf_enable();
@@ -1490,6 +1506,21 @@ void perf_event_task_tick(struct task_struct *curr)
perf_event_task_sched_in(curr);
}

+static int event_enable_on_exec(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ if (!event->attr.enable_on_exec)
+ return 0;
+
+ event->attr.enable_on_exec = 0;
+ if (event->state >= PERF_EVENT_STATE_INACTIVE)
+ return 0;
+
+ __perf_event_mark_enabled(event, ctx);
+
+ return 1;
+}
+
/*
* Enable all of a task's events that have been marked enable-on-exec.
* This expects task == current.
@@ -1500,6 +1531,7 @@ static void perf_event_enable_on_exec(struct task_struct *task)
struct perf_event *event;
unsigned long flags;
int enabled = 0;
+ int ret;

local_irq_save(flags);
ctx = task->perf_event_ctxp;
@@ -1510,14 +1542,16 @@ static void perf_event_enable_on_exec(struct task_struct *task)

raw_spin_lock(&ctx->lock);

- list_for_each_entry(event, &ctx->group_list, group_entry) {
- if (!event->attr.enable_on_exec)
- continue;
- event->attr.enable_on_exec = 0;
- if (event->state >= PERF_EVENT_STATE_INACTIVE)
- continue;
- __perf_event_mark_enabled(event, ctx);
- enabled = 1;
+ list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
+ ret = event_enable_on_exec(event, ctx);
+ if (ret)
+ enabled = 1;
+ }
+
+ list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
+ ret = event_enable_on_exec(event, ctx);
+ if (ret)
+ enabled = 1;
}

/*
@@ -1591,7 +1625,8 @@ __perf_event_init_context(struct perf_event_context *ctx,
{
raw_spin_lock_init(&ctx->lock);
mutex_init(&ctx->mutex);
- INIT_LIST_HEAD(&ctx->group_list);
+ INIT_LIST_HEAD(&ctx->pinned_groups);
+ INIT_LIST_HEAD(&ctx->flexible_groups);
INIT_LIST_HEAD(&ctx->event_list);
atomic_set(&ctx->refcount, 1);
ctx->task = task;
@@ -5032,7 +5067,11 @@ void perf_event_exit_task(struct task_struct *child)
mutex_lock_nested(&child_ctx->mutex, SINGLE_DEPTH_NESTING);

again:
- list_for_each_entry_safe(child_event, tmp, &child_ctx->group_list,
+ list_for_each_entry_safe(child_event, tmp, &child_ctx->pinned_groups,
+ group_entry)
+ __perf_event_exit_task(child_event, child_ctx, child);
+
+ list_for_each_entry_safe(child_event, tmp, &child_ctx->flexible_groups,
group_entry)
__perf_event_exit_task(child_event, child_ctx, child);

@@ -5041,7 +5080,8 @@ again:
* its siblings to the list, but we obtained 'tmp' before that which
* will still point to the list head terminating the iteration.
*/
- if (!list_empty(&child_ctx->group_list))
+ if (!list_empty(&child_ctx->pinned_groups) ||
+ !list_empty(&child_ctx->flexible_groups))
goto again;

mutex_unlock(&child_ctx->mutex);
@@ -5049,6 +5089,24 @@ again:
put_ctx(child_ctx);
}

+static void perf_free_event(struct perf_event *event,
+ struct perf_event_context *ctx)
+{
+ struct perf_event *parent = event->parent;
+
+ if (WARN_ON_ONCE(!parent))
+ return;
+
+ mutex_lock(&parent->child_mutex);
+ list_del_init(&event->child_list);
+ mutex_unlock(&parent->child_mutex);
+
+ fput(parent->filp);
+
+ list_del_event(event, ctx);
+ free_event(event);
+}
+
/*
* free an unexposed, unused context as created by inheritance by
* init_task below, used by fork() in case of fail.
@@ -5063,36 +5121,70 @@ void perf_event_free_task(struct task_struct *task)

mutex_lock(&ctx->mutex);
again:
- list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry) {
- struct perf_event *parent = event->parent;
+ list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
+ perf_free_event(event, ctx);

- if (WARN_ON_ONCE(!parent))
- continue;
+ list_for_each_entry_safe(event, tmp, &ctx->flexible_groups,
+ group_entry)
+ perf_free_event(event, ctx);

- mutex_lock(&parent->child_mutex);
- list_del_init(&event->child_list);
- mutex_unlock(&parent->child_mutex);
+ if (!list_empty(&ctx->pinned_groups) ||
+ !list_empty(&ctx->flexible_groups))
+ goto again;

- fput(parent->filp);
+ mutex_unlock(&ctx->mutex);

- list_del_event(event, ctx);
- free_event(event);
+ put_ctx(ctx);
+}
+
+static int
+inherit_task_group(struct perf_event *event, struct task_struct *parent,
+ struct perf_event_context *parent_ctx,
+ struct task_struct *child,
+ int *inherited_all)
+{
+ int ret;
+ struct perf_event_context *child_ctx = child->perf_event_ctxp;
+
+ if (!event->attr.inherit) {
+ *inherited_all = 0;
+ return 0;
}

- if (!list_empty(&ctx->group_list))
- goto again;
+ if (!child_ctx) {
+ /*
+ * This is executed from the parent task context, so
+ * inherit events that have been marked for cloning.
+ * First allocate and initialize a context for the
+ * child.
+ */

- mutex_unlock(&ctx->mutex);
+ child_ctx = kzalloc(sizeof(struct perf_event_context),
+ GFP_KERNEL);
+ if (!child_ctx)
+ return -ENOMEM;

- put_ctx(ctx);
+ __perf_event_init_context(child_ctx, child);
+ child->perf_event_ctxp = child_ctx;
+ get_task_struct(child);
+ }
+
+ ret = inherit_group(event, parent, parent_ctx,
+ child, child_ctx);
+
+ if (ret)
+ *inherited_all = 0;
+
+ return ret;
}

+
/*
* Initialize the perf_event context in task_struct
*/
int perf_event_init_task(struct task_struct *child)
{
- struct perf_event_context *child_ctx = NULL, *parent_ctx;
+ struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
struct perf_event *event;
struct task_struct *parent = current;
@@ -5130,41 +5222,22 @@ int perf_event_init_task(struct task_struct *child)
* We dont have to disable NMIs - we are only looking at
* the list, not manipulating it:
*/
- list_for_each_entry(event, &parent_ctx->group_list, group_entry) {
-
- if (!event->attr.inherit) {
- inherited_all = 0;
- continue;
- }
-
- if (!child->perf_event_ctxp) {
- /*
- * This is executed from the parent task context, so
- * inherit events that have been marked for cloning.
- * First allocate and initialize a context for the
- * child.
- */
-
- child_ctx = kzalloc(sizeof(struct perf_event_context),
- GFP_KERNEL);
- if (!child_ctx) {
- ret = -ENOMEM;
- break;
- }
-
- __perf_event_init_context(child_ctx, child);
- child->perf_event_ctxp = child_ctx;
- get_task_struct(child);
- }
+ list_for_each_entry(event, &parent_ctx->pinned_groups, group_entry) {
+ ret = inherit_task_group(event, parent, parent_ctx, child,
+ &inherited_all);
+ if (ret)
+ break;
+ }

- ret = inherit_group(event, parent, parent_ctx,
- child, child_ctx);
- if (ret) {
- inherited_all = 0;
+ list_for_each_entry(event, &parent_ctx->flexible_groups, group_entry) {
+ ret = inherit_task_group(event, parent, parent_ctx, child,
+ &inherited_all);
+ if (ret)
break;
- }
}

+ child_ctx = child->perf_event_ctxp;
+
if (child_ctx && inherited_all) {
/*
* Mark the child context as a clone of the parent
@@ -5213,7 +5286,9 @@ static void __perf_event_exit_cpu(void *info)
struct perf_event_context *ctx = &cpuctx->ctx;
struct perf_event *event, *tmp;

- list_for_each_entry_safe(event, tmp, &ctx->group_list, group_entry)
+ list_for_each_entry_safe(event, tmp, &ctx->pinned_groups, group_entry)
+ __perf_event_remove_from_context(event);
+ list_for_each_entry_safe(event, tmp, &ctx->flexible_groups, group_entry)
__perf_event_remove_from_context(event);
}
static void perf_event_exit_cpu(int cpu)
--
1.6.2.3

2010-01-10 01:38:34

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 4/6] perf: Export software-only event group characteristic as a flag

Before scheduling an event group, we first check if a group can go
on. We first check if the group is made of software only events
first, in which case it is enough to know if the group can be
scheduled in.

For that purpose, we iterate through the whole group, which is
wasteful as we could do this check when we add/delete an event to
a group.

So we create a group_flags field in perf event that can host
characteristics from a group of events, starting with a first
PERF_GROUP_SOFTWARE flag that reduces the check on the fast path.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/perf_event.c | 30 +++++++++++-------------------
2 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index cdbc2aa..c6f812e 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -565,6 +565,10 @@ typedef void (*perf_overflow_handler_t)(struct perf_event *, int,
struct perf_sample_data *,
struct pt_regs *regs);

+enum perf_group_flag {
+ PERF_GROUP_SOFTWARE = 0x1,
+};
+
/**
* struct perf_event - performance event kernel representation:
*/
@@ -574,6 +578,7 @@ struct perf_event {
struct list_head event_entry;
struct list_head sibling_list;
int nr_siblings;
+ int group_flags;
struct perf_event *group_leader;
struct perf_event *output;
const struct pmu *pmu;
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index d7254f5..7d00676 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -315,9 +315,16 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
if (group_leader == event) {
struct list_head *list;

+ if (is_software_event(event))
+ event->group_flags |= PERF_GROUP_SOFTWARE;
+
list = ctx_group_list(event, ctx);
list_add_tail(&event->group_entry, list);
} else {
+ if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
+ !is_software_event(event))
+ group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;
+
list_add_tail(&event->group_entry, &group_leader->sibling_list);
group_leader->nr_siblings++;
}
@@ -372,6 +379,9 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
list = ctx_group_list(event, ctx);
list_move_tail(&sibling->group_entry, list);
sibling->group_leader = sibling;
+
+ /* Inherit group flags from the previous leader */
+ sibling->group_flags = event->group_flags;
}
}

@@ -700,24 +710,6 @@ group_error:
}

/*
- * Return 1 for a group consisting entirely of software events,
- * 0 if the group contains any hardware events.
- */
-static int is_software_only_group(struct perf_event *leader)
-{
- struct perf_event *event;
-
- if (!is_software_event(leader))
- return 0;
-
- list_for_each_entry(event, &leader->sibling_list, group_entry)
- if (!is_software_event(event))
- return 0;
-
- return 1;
-}
-
-/*
* Work out whether we can put this event group on the CPU now.
*/
static int group_can_go_on(struct perf_event *event,
@@ -727,7 +719,7 @@ static int group_can_go_on(struct perf_event *event,
/*
* Groups consisting entirely of software events can always go on.
*/
- if (is_software_only_group(event))
+ if (event->group_flags & PERF_GROUP_SOFTWARE)
return 1;
/*
* If an exclusive group is already on, no other hardware
--
1.6.2.3

2010-01-10 01:38:48

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/6] perf: Round robin groups of events using list_rotate_left()

This is more proper that doing it through a list_for_each_entry()
that breaks after the first entry.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
kernel/perf_event.c | 13 +++----------
1 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index c9f8a75..d7254f5 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1454,8 +1454,6 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx)
*/
static void rotate_ctx(struct perf_event_context *ctx)
{
- struct perf_event *event;
-
if (!ctx->nr_events)
return;

@@ -1464,15 +1462,10 @@ static void rotate_ctx(struct perf_event_context *ctx)
* Rotate the first entry last (works just fine for group events too):
*/
perf_disable();
- list_for_each_entry(event, &ctx->pinned_groups, group_entry) {
- list_move_tail(&event->group_entry, &ctx->pinned_groups);
- break;
- }

- list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
- list_move_tail(&event->group_entry, &ctx->flexible_groups);
- break;
- }
+ list_rotate_left(&ctx->pinned_groups);
+ list_rotate_left(&ctx->flexible_groups);
+
perf_enable();

raw_spin_unlock(&ctx->lock);
--
1.6.2.3

2010-01-10 01:38:56

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 6/6] perf: Increase round-robin fairness of flexible events

Group of flexible events are round-robined in each tick so that
each group has its chance to be scheduled. But the fairness
per group granularity doesn't propagate inside the groups
themselves.

If only the first events of each groups have a chance to make
their way, the remaining ones will never be scheduled.

Hence this patch propagates the round-robin to the events
inside the groups.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
kernel/perf_event.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 584e854..407137a 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1446,16 +1446,21 @@ static void perf_ctx_adjust_freq(struct perf_event_context *ctx)
*/
static void rotate_ctx(struct perf_event_context *ctx)
{
+ struct perf_event *leader;
+
if (!ctx->nr_events)
return;

raw_spin_lock(&ctx->lock);
/*
* Rotate the first entry last of non-pinned groups
+ * and events inside these groups
*/
perf_disable();

list_rotate_left(&ctx->flexible_groups);
+ list_for_each_entry(leader, &ctx->flexible_groups, group_entry)
+ list_rotate_left(&leader->sibling_list);

perf_enable();

--
1.6.2.3

2010-01-10 01:39:20

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/6] list: Introduce list_rotate_left()

Bring a new list_rotate_left() helper that rotates a list to
the left. This is useful for codes that need to round roubin
elements which queue priority increases from tail to head.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
include/linux/list.h | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index 969f6e9..5d9c655 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -206,6 +206,20 @@ static inline int list_empty_careful(const struct list_head *head)
}

/**
+ * list_rotate_left - rotate the list to the left
+ * @head: the head of the list
+ */
+static inline void list_rotate_left(struct list_head *head)
+{
+ struct list_head *first;
+
+ if (!list_empty(head)) {
+ first = head->next;
+ list_move_tail(first, head);
+ }
+}
+
+/**
* list_is_singular - tests whether a list has just one entry.
* @head: the list to test.
*/
--
1.6.2.3

2010-01-10 01:39:23

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 5/6] perf: Don't rotate pinned groups

We round-robin pinned groups, which means we make them behaving
just like non-pinned groups. Currently, and in practice, the almost
only difference between pinned and non-pinned events is that the
formers are scheduled before the latters. And the latters also
stop scheduling non-software-only groups once one couldn't make
it.

Anyway, pinned groups don't need to be round-robined because if
a pinned group can't be scheduled, it is going to be put in an
error state, following the pinned logic: it is always or never
scheduled in a context.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Paul Mackerras <[email protected]>
---
kernel/perf_event.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 7d00676..584e854 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1451,11 +1451,10 @@ static void rotate_ctx(struct perf_event_context *ctx)

raw_spin_lock(&ctx->lock);
/*
- * Rotate the first entry last (works just fine for group events too):
+ * Rotate the first entry last of non-pinned groups
*/
perf_disable();

- list_rotate_left(&ctx->pinned_groups);
list_rotate_left(&ctx->flexible_groups);

perf_enable();
--
1.6.2.3

2010-01-10 22:07:25

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 6/6] perf: Increase round-robin fairness of flexible events

Frederic,

Nice to see someone working on the event scheduling in perf.

But I don't think this patch makes sense:

> Group of flexible events are round-robined in each tick so that
> each group has its chance to be scheduled. But the fairness
> per group granularity doesn't propagate inside the groups
> themselves.
>
> If only the first events of each groups have a chance to make
> their way, the remaining ones will never be scheduled.
>
> Hence this patch propagates the round-robin to the events
> inside the groups.

The semantic of a group is that either all of the events in the group
are scheduled in, or none of them are. So it doesn't make sense to
talk about fairness within a group, and I don't see any point to
rotating the elements of the sibling_list. Or have I misunderstood
what you're aiming at?

Paul.

2010-01-10 23:58:07

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/6] perf: Increase round-robin fairness of flexible events

On Mon, Jan 11, 2010 at 09:04:40AM +1100, Paul Mackerras wrote:
> Frederic,
>
> Nice to see someone working on the event scheduling in perf.
>
> But I don't think this patch makes sense:
>
> > Group of flexible events are round-robined in each tick so that
> > each group has its chance to be scheduled. But the fairness
> > per group granularity doesn't propagate inside the groups
> > themselves.
> >
> > If only the first events of each groups have a chance to make
> > their way, the remaining ones will never be scheduled.
> >
> > Hence this patch propagates the round-robin to the events
> > inside the groups.
>
> The semantic of a group is that either all of the events in the group
> are scheduled in, or none of them are. So it doesn't make sense to
> talk about fairness within a group, and I don't see any point to
> rotating the elements of the sibling_list. Or have I misunderstood
> what you're aiming at?


You're right. I forgot that a group that is only partially scheduled will
have its scheduled events cancelled.

But is it a sane behaviour considering the nature on non-pinned events?

Let's take an example. In x86 we have 4 breakpoint registers available.

We have 3 pinned breakpoint events in one group, and say, 2 groups
of flexible breakpoints:


Pinned Flexible0 Flexible1
| | |
Bp1 Bp4 Bp6
Bp2 Bp5 Bp7
Bp3


The flexible ones will never get scheduled because
we only have 4 available slots and we need 5. And if
we try to schedule Flexible0, Bp4 will make it, but
not Bp5, so Bp4 get cancelled, and so on.

But the semantics of non-pinned counters is about
time-sharing them.

If we don't cancel partially-only scheduled flexible
groups and if we round robin inside flexible groups,
then these will all make it.

I think the constraint of "either every or none get
scheduled in a group" makes a lot of sense for pinned
groups.

But I don't see the point in applying this
rule inside flexible groups because the nature
of flexible events implies these have been created to
fight against a limited resource. So if this fight
is done only between groups, this is like raising
a voluntary starvation.

Or..or..May be I just realize too late that the semantic
of a group implies that all events inside must be always
counted simultaneously? In which case I agree with you,
this patch makes no sense and must be dropped.

2010-01-11 00:39:13

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 6/6] perf: Increase round-robin fairness of flexible events

On Mon, Jan 11, 2010 at 12:57:59AM +0100, Frederic Weisbecker wrote:

> I think the constraint of "either every or none get
> scheduled in a group" makes a lot of sense for pinned
> groups.
>
> But I don't see the point in applying this
> rule inside flexible groups because the nature
> of flexible events implies these have been created to
> fight against a limited resource. So if this fight
> is done only between groups, this is like raising
> a voluntary starvation.
>
> Or..or..May be I just realize too late that the semantic
> of a group implies that all events inside must be always
> counted simultaneously? In which case I agree with you,
> this patch makes no sense and must be dropped.

The original idea of the groups was for situations where you want to
take the difference or ratio of two counts. For example, if you want
to measure cache hits but the hardware can only count cache accesses
and cache misses. In that situation you want to compute accesses
minus misses, but if the counters for accesses and for misses are
independently scheduled, statistical fluctuations can mean there is a
lot of noise in the result, and it might even be negative. Putting
the two counters into one group means that you can meaningfully
compute the difference or ratio since the two counter values relate to
the same set of instructions (even if that isn't the whole execution
of the program).

The default situation is that each event is in its own group, so the
starvation you talk about won't arise. If the user has gone to the
trouble of putting two events into one group, then they are saying
that they need the events to be scheduled on and off together, and if
that leads to starvation, that's unfortunate but we can't do any
better within the limitations of the hardware.

Paul.

2010-01-11 01:56:40

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 6/6] perf: Increase round-robin fairness of flexible events

On Mon, Jan 11, 2010 at 11:39:05AM +1100, Paul Mackerras wrote:
> On Mon, Jan 11, 2010 at 12:57:59AM +0100, Frederic Weisbecker wrote:
>
> > I think the constraint of "either every or none get
> > scheduled in a group" makes a lot of sense for pinned
> > groups.
> >
> > But I don't see the point in applying this
> > rule inside flexible groups because the nature
> > of flexible events implies these have been created to
> > fight against a limited resource. So if this fight
> > is done only between groups, this is like raising
> > a voluntary starvation.
> >
> > Or..or..May be I just realize too late that the semantic
> > of a group implies that all events inside must be always
> > counted simultaneously? In which case I agree with you,
> > this patch makes no sense and must be dropped.
>
> The original idea of the groups was for situations where you want to
> take the difference or ratio of two counts. For example, if you want
> to measure cache hits but the hardware can only count cache accesses
> and cache misses. In that situation you want to compute accesses
> minus misses, but if the counters for accesses and for misses are
> independently scheduled, statistical fluctuations can mean there is a
> lot of noise in the result, and it might even be negative. Putting
> the two counters into one group means that you can meaningfully
> compute the difference or ratio since the two counter values relate to
> the same set of instructions (even if that isn't the whole execution
> of the program).
>
> The default situation is that each event is in its own group, so the
> starvation you talk about won't arise. If the user has gone to the
> trouble of putting two events into one group, then they are saying
> that they need the events to be scheduled on and off together, and if
> that leads to starvation, that's unfortunate but we can't do any
> better within the limitations of the hardware.



Agreed. This patch came from my misunderstanding of the purpose of
groups.

2010-01-14 12:25:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/6] perf: Round robin groups of events using list_rotate_left()

On Sun, 2010-01-10 at 02:38 +0100, Frederic Weisbecker wrote:
> This is more proper that doing it through a list_for_each_entry()
> that breaks after the first entry.
>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Arnaldo Carvalho de Melo <[email protected]>
> Cc: Paul Mackerras <[email protected]>
> ---

> + list_rotate_left(&ctx->pinned_groups);
> + list_rotate_left(&ctx->flexible_groups);

Just wondering, pinned events were supposed to be on the pmu at all
times, right?

2010-01-14 12:29:29

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 3/6] perf: Round robin groups of events using list_rotate_left()

On Thu, Jan 14, 2010 at 01:25:22PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-01-10 at 02:38 +0100, Frederic Weisbecker wrote:
> > This is more proper that doing it through a list_for_each_entry()
> > that breaks after the first entry.
> >
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Arnaldo Carvalho de Melo <[email protected]>
> > Cc: Paul Mackerras <[email protected]>
> > ---
>
> > + list_rotate_left(&ctx->pinned_groups);
> > + list_rotate_left(&ctx->flexible_groups);
>
> Just wondering, pinned events were supposed to be on the pmu at all
> times, right?


Yeah, but the previous version that was using a global
list rotated the whole. So, somehow to keep the same
bahaviour, I also rotate the pinned group, even
if it makes no sense :-)

But there is a subsequent patch in the same set
that removes the rotating of pinned groups.

2010-01-14 12:32:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/6] perf: Round robin groups of events using list_rotate_left()

On Thu, 2010-01-14 at 13:29 +0100, Frederic Weisbecker wrote:
>
> But there is a subsequent patch in the same set
> that removes the rotating of pinned groups.

ok, let me read on ;-)

2010-01-14 12:49:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 0/6] perf: Various event scheduling improvements

On Sun, 2010-01-10 at 02:38 +0100, Frederic Weisbecker wrote:
> Hi,
>
> These patches bring (I hope) a bit more scalability and fairness
> to the perf events scheduling.
>
> But this is only an introduction as there is still some work to
> do, like ensuring all pinned events have been scheduled before
> flexible ones (for now we schedule in order cpu pinned, cpu flexible,
> task pinned, task flexible), among other improvements.

Right, so 1-5 look fine (although I'd have folded 5 into 3), thanks!

2010-01-14 13:09:54

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/6] perf: Various event scheduling improvements

On Thu, Jan 14, 2010 at 01:49:14PM +0100, Peter Zijlstra wrote:
> On Sun, 2010-01-10 at 02:38 +0100, Frederic Weisbecker wrote:
> > Hi,
> >
> > These patches bring (I hope) a bit more scalability and fairness
> > to the perf events scheduling.
> >
> > But this is only an introduction as there is still some work to
> > do, like ensuring all pinned events have been scheduled before
> > flexible ones (for now we schedule in order cpu pinned, cpu flexible,
> > task pinned, task flexible), among other improvements.
>
> Right, so 1-5 look fine (although I'd have folded 5 into 3), thanks!


Ok, I'll merge 5 into 3 and send it as a pull request to Ingo
with your ack.

Thanks.