2015-04-15 15:10:02

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

From: Kan Liang <[email protected]>

The pmu marked as perf_invalid_context don't have any state to switch on
context switch. Everything is global. So it is OK to be part of sw/hw
groups.
In sched_out/sched_in, del/add must be called, so the
perf_invalid_context event can be disabled/enabled accordingly during
context switch. The event count only be read when the event is already
sched_in.

However group read doesn't work with mix events.

For example,
perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
It always gets EINVAL.

This patch set intends to fix this issue.
perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.202 MB perf.data (12 samples) ]

This patch special case invalid context events and allow them to be part
of sw/hw groups.

Signed-off-by: Kan Liang <[email protected]>
---

Changes since V1
- For pure invalid context event group, using leader's pmu as
event members pmu.

include/linux/perf_event.h | 8 +++++
kernel/events/core.c | 76 ++++++++++++++++++++++++++++++++++------------
2 files changed, 65 insertions(+), 19 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 61992cf..ecc80fa 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -742,6 +742,14 @@ static inline bool is_sampling_event(struct perf_event *event)
/*
* Return 1 for a software event, 0 for a hardware event
*/
+static inline int is_invalid_context_event(struct perf_event *event)
+{
+ return event->pmu->task_ctx_nr == perf_invalid_context;
+}
+
+/*
+ * Return 1 for a software event, 0 for a hardware event
+ */
static inline int is_software_event(struct perf_event *event)
{
return event->pmu->task_ctx_nr == perf_sw_context;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 06917d5..519ae0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1351,7 +1351,7 @@ static void perf_group_attach(struct perf_event *event)
WARN_ON_ONCE(group_leader->ctx != event->ctx);

if (group_leader->group_flags & PERF_GROUP_SOFTWARE &&
- !is_software_event(event))
+ !is_software_event(event) && !is_invalid_context_event(event))
group_leader->group_flags &= ~PERF_GROUP_SOFTWARE;

list_add_tail(&event->group_entry, &group_leader->sibling_list);
@@ -7946,8 +7946,11 @@ SYSCALL_DEFINE5(perf_event_open,
account_event(event);

/*
- * Special case software events and allow them to be part of
- * any hardware group.
+ * Special case for software events and invalid context events.
+ * Allow software events to be part of any hardware group.
+ * Invalid context events can only be the group leader for pure
+ * invalid context event group, but could be part of any
+ * software/hardware group.
*/
pmu = event->pmu;

@@ -7958,27 +7961,62 @@ SYSCALL_DEFINE5(perf_event_open,
}

if (group_leader &&
- (is_software_event(event) != is_software_event(group_leader))) {
- if (is_software_event(event)) {
+ (group_leader->pmu->task_ctx_nr != event->pmu->task_ctx_nr)) {
+ if (is_invalid_context_event(group_leader)) {
+ err = -EINVAL;
+ goto err_alloc;
+ } else if (is_software_event(group_leader)) {
+ if (is_invalid_context_event(event)) {
+ if (group_leader->group_flags & PERF_GROUP_SOFTWARE) {
+ /*
+ * If group_leader is software event
+ * and event is invalid context event
+ * allow the addition of invalid
+ * context event to software groups.
+ */
+ pmu = group_leader->pmu;
+ } else {
+ /*
+ * Group leader is software event,
+ * but the group is not software event.
+ * There must be hardware event in group,
+ * find it and set it's pmu to event->pmu.
+ */
+ struct perf_event *tmp;
+
+ list_for_each_entry(tmp, &group_leader->sibling_list, group_entry) {
+ if (tmp->pmu->task_ctx_nr == perf_hw_context) {
+ pmu = tmp->pmu;
+ break;
+ }
+ }
+ if (pmu == event->pmu)
+ goto err_alloc;
+ }
+ } else {
+ if (group_leader->group_flags & PERF_GROUP_SOFTWARE) {
+ /*
+ * In case the group is pure software group,
+ * and we try to add a hardware event,
+ * move the whole group to hardware context.
+ */
+ move_group = 1;
+ }
+ }
+ } else {
/*
- * If event and group_leader are not both a software
- * event, and event is, then group leader is not.
- *
- * Allow the addition of software events to !software
- * groups, this is safe because software events never
- * fail to schedule.
+ * If group_leader is hardware event and event is not,
+ * allow the addition of !hardware events to hardware
+ * groups. This is safe because software events and
+ * invalid context events never fail to schedule.
*/
pmu = group_leader->pmu;
- } else if (is_software_event(group_leader) &&
- (group_leader->group_flags & PERF_GROUP_SOFTWARE)) {
- /*
- * In case the group is a pure software group, and we
- * try to add a hardware event, move the whole group to
- * the hardware context.
- */
- move_group = 1;
}
}
+ if (group_leader &&
+ is_invalid_context_event(group_leader) &&
+ is_invalid_context_event(event))
+ pmu = group_leader->pmu;

/*
* Get the target context (task or percpu):
--
1.8.3.1


2015-04-15 15:09:25

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 2/6] perf evsel: Set evsel->cpus to the evlist->cpus when not constrained

From: Arnaldo Carvalho de Melo <[email protected]>

So that we don't need to know about the evlist all the time and can cope
with cases where evsel->cpus were set because it was for an event on a
PMU with a cpumask.

Reported-by: Stephane Eranian <[email protected]>
Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>
Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-stat.c | 36 ++++++++++-----------------
tools/perf/util/evlist.c | 63 +++++++++++++++++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index f7b8218..2c1456e 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -163,16 +163,6 @@ static inline void diff_timespec(struct timespec *r, struct timespec *a,
}
}

-static inline struct cpu_map *perf_evsel__cpus(struct perf_evsel *evsel)
-{
- return (evsel->cpus && !target.cpu_list) ? evsel->cpus : evsel_list->cpus;
-}
-
-static inline int perf_evsel__nr_cpus(struct perf_evsel *evsel)
-{
- return perf_evsel__cpus(evsel)->nr;
-}
-
static void perf_evsel__reset_stat_priv(struct perf_evsel *evsel)
{
int i;
@@ -202,7 +192,7 @@ static int perf_evsel__alloc_prev_raw_counts(struct perf_evsel *evsel)
size_t sz;

sz = sizeof(*evsel->counts) +
- (perf_evsel__nr_cpus(evsel) * sizeof(struct perf_counts_values));
+ (cpu_map__nr(evsel->cpus) * sizeof(struct perf_counts_values));

addr = zalloc(sz);
if (!addr)
@@ -235,7 +225,7 @@ static int perf_evlist__alloc_stats(struct perf_evlist *evlist, bool alloc_raw)

evlist__for_each(evlist, evsel) {
if (perf_evsel__alloc_stat_priv(evsel) < 0 ||
- perf_evsel__alloc_counts(evsel, perf_evsel__nr_cpus(evsel)) < 0 ||
+ perf_evsel__alloc_counts(evsel, cpu_map__nr(evsel->cpus)) < 0 ||
(alloc_raw && perf_evsel__alloc_prev_raw_counts(evsel) < 0))
goto out_free;
}
@@ -269,7 +259,7 @@ static void perf_stat__reset_stats(struct perf_evlist *evlist)

evlist__for_each(evlist, evsel) {
perf_evsel__reset_stat_priv(evsel);
- perf_evsel__reset_counts(evsel, perf_evsel__nr_cpus(evsel));
+ perf_evsel__reset_counts(evsel, cpu_map__nr(evsel->cpus));
}

memset(runtime_nsecs_stats, 0, sizeof(runtime_nsecs_stats));
@@ -302,7 +292,7 @@ static int create_perf_stat_counter(struct perf_evsel *evsel)
attr->inherit = !no_inherit;

if (target__has_cpu(&target))
- return perf_evsel__open_per_cpu(evsel, perf_evsel__cpus(evsel));
+ return perf_evsel__open_per_cpu(evsel, evsel->cpus);

if (!target__has_task(&target) && perf_evsel__is_group_leader(evsel)) {
attr->disabled = 1;
@@ -398,7 +388,7 @@ static void zero_per_pkg(struct perf_evsel *counter)
static int check_per_pkg(struct perf_evsel *counter, int cpu, bool *skip)
{
unsigned long *mask = counter->per_pkg_mask;
- struct cpu_map *cpus = perf_evsel__cpus(counter);
+ struct cpu_map *cpus = counter->cpus;
int s;

*skip = false;
@@ -509,7 +499,7 @@ static int read_counter_aggr(struct perf_evsel *counter)
static int read_counter(struct perf_evsel *counter)
{
int nthreads = thread_map__nr(evsel_list->threads);
- int ncpus = perf_evsel__nr_cpus(counter);
+ int ncpus = cpu_map__nr(counter->cpus);
int cpu, thread;

if (!counter->supported)
@@ -733,13 +723,13 @@ static int __run_perf_stat(int argc, const char **argv)
if (aggr_mode == AGGR_GLOBAL) {
evlist__for_each(evsel_list, counter) {
read_counter_aggr(counter);
- perf_evsel__close_fd(counter, perf_evsel__nr_cpus(counter),
+ perf_evsel__close_fd(counter, cpu_map__nr(counter->cpus),
thread_map__nr(evsel_list->threads));
}
} else {
evlist__for_each(evsel_list, counter) {
read_counter(counter);
- perf_evsel__close_fd(counter, perf_evsel__nr_cpus(counter), 1);
+ perf_evsel__close_fd(counter, cpu_map__nr(counter->cpus), 1);
}
}

@@ -831,7 +821,7 @@ static void aggr_printout(struct perf_evsel *evsel, int id, int nr)
case AGGR_NONE:
fprintf(output, "CPU%*d%s",
csv_output ? 0 : -4,
- perf_evsel__cpus(evsel)->map[id], csv_sep);
+ evsel->cpus->map[id], csv_sep);
break;
case AGGR_GLOBAL:
default:
@@ -1239,8 +1229,8 @@ static void print_aggr(char *prefix)
evlist__for_each(evsel_list, counter) {
val = ena = run = 0;
nr = 0;
- for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
- cpu2 = perf_evsel__cpus(counter)->map[cpu];
+ for (cpu = 0; cpu < cpu_map__nr(counter->cpus); cpu++) {
+ cpu2 = counter->cpus->map[cpu];
s2 = aggr_get_id(evsel_list->cpus, cpu2);
if (s2 != id)
continue;
@@ -1353,7 +1343,7 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
double uval;
int cpu;

- for (cpu = 0; cpu < perf_evsel__nr_cpus(counter); cpu++) {
+ for (cpu = 0; cpu < cpu_map__nr(counter->cpus); cpu++) {
val = counter->counts->cpu[cpu].val;
ena = counter->counts->cpu[cpu].ena;
run = counter->counts->cpu[cpu].run;
@@ -1364,7 +1354,7 @@ static void print_counter(struct perf_evsel *counter, char *prefix)
if (run == 0 || ena == 0) {
fprintf(output, "CPU%*d%s%*s%s",
csv_output ? 0 : -4,
- perf_evsel__cpus(counter)->map[cpu], csv_sep,
+ counter->cpus->map[cpu], csv_sep,
csv_output ? 0 : 18,
counter->supported ? CNTR_NOT_COUNTED : CNTR_NOT_SUPPORTED,
csv_sep);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 080be93..a065150 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -110,13 +110,46 @@ void perf_evlist__exit(struct perf_evlist *evlist)
fdarray__exit(&evlist->pollfd);
}

+static void perf_evlist__reset_cpus(struct perf_evlist *evlist)
+{
+ struct perf_evsel *evsel;
+
+ evlist__for_each(evlist, evsel) {
+ if (evsel->cpus == evlist->cpus)
+ evsel->cpus = NULL;
+ }
+
+ evlist->cpus = NULL;
+}
+
+static void perf_evlist__set_cpus(struct perf_evlist *evlist,
+ struct cpu_map *cpus)
+{
+ struct perf_evsel *evsel;
+
+ if (evlist->cpus != NULL)
+ perf_evlist__reset_cpus(evlist);
+ /*
+ * If, when parsing events, the evsel->cpus wasn't constrained to a
+ * cpulist, say, because it is on a PMU that has a cpumask, then set it
+ * to the evlist cpu_map, so that we can access evsel->cpus and get the
+ * cpu_map this evsel works with.
+ */
+ evlist__for_each(evlist, evsel) {
+ if (evsel->cpus == NULL)
+ evsel->cpus = cpus;
+ }
+
+ evlist->cpus = cpus;
+}
+
void perf_evlist__delete(struct perf_evlist *evlist)
{
perf_evlist__munmap(evlist);
perf_evlist__close(evlist);
cpu_map__delete(evlist->cpus);
+ perf_evlist__reset_cpus(evlist);
thread_map__delete(evlist->threads);
- evlist->cpus = NULL;
evlist->threads = NULL;
perf_evlist__purge(evlist);
perf_evlist__exit(evlist);
@@ -129,6 +162,15 @@ void perf_evlist__add(struct perf_evlist *evlist, struct perf_evsel *entry)
entry->idx = evlist->nr_entries;
entry->tracking = !entry->idx;

+ /*
+ * If, when parsing events, the evsel->cpus wasn't constrained to a
+ * cpulist, say, because it is on a PMU that has a cpumask, then set it
+ * to the evlist cpu_map, so that we can access evsel->cpus and get the
+ * cpu_map this evsel works with.
+ */
+ if (entry->cpus == NULL)
+ entry->cpus = evlist->cpus;
+
if (!evlist->nr_entries++)
perf_evlist__set_id_pos(evlist);
}
@@ -1028,6 +1070,8 @@ int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages,

int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
{
+ struct cpu_map *cpus;
+
evlist->threads = thread_map__new_str(target->pid, target->tid,
target->uid);

@@ -1035,13 +1079,15 @@ int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
return -1;

if (target__uses_dummy_map(target))
- evlist->cpus = cpu_map__dummy_new();
+ cpus = cpu_map__dummy_new();
else
- evlist->cpus = cpu_map__new(target->cpu_list);
+ cpus = cpu_map__new(target->cpu_list);

- if (evlist->cpus == NULL)
+ if (cpus == NULL)
goto out_delete_threads;

+ perf_evlist__set_cpus(evlist, cpus);
+
return 0;

out_delete_threads:
@@ -1255,6 +1301,7 @@ void perf_evlist__close(struct perf_evlist *evlist)

static int perf_evlist__create_syswide_maps(struct perf_evlist *evlist)
{
+ struct cpu_map *cpus;
int err = -ENOMEM;

/*
@@ -1266,8 +1313,8 @@ static int perf_evlist__create_syswide_maps(struct perf_evlist *evlist)
* error, and we may not want to do that fallback to a
* default cpu identity map :-\
*/
- evlist->cpus = cpu_map__new(NULL);
- if (evlist->cpus == NULL)
+ cpus = cpu_map__new(NULL);
+ if (cpus == NULL)
goto out;

evlist->threads = thread_map__new_dummy();
@@ -1275,11 +1322,11 @@ static int perf_evlist__create_syswide_maps(struct perf_evlist *evlist)
goto out_free_cpus;

err = 0;
+ perf_evlist__set_cpus(evlist, cpus);
out:
return err;
out_free_cpus:
- cpu_map__delete(evlist->cpus);
- evlist->cpus = NULL;
+ cpu_map__delete(cpus);
goto out;
}

--
1.8.3.1

2015-04-15 15:09:11

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 3/6] perf,tools: get real cpu id for print_aggr

From: Kan Liang <[email protected]>

Events don't share the same evlist->cpus anymore. So same cpu id may
have different index in different evsel_list->cpus.
In print_aggr, aggr_get_id needs index to get core/pkg id.
perf_evsel__get_cpumap_index is introduced to get the index by cpu id.

Signed-off-by: Kan Liang <[email protected]>
---

Changes since V1
- Rebase to Arnaldo's code

For pure invalid context event group, using leader's pmu as
event members pmu.

tools/perf/builtin-stat.c | 1 +
tools/perf/util/cpumap.c | 4 ++--
tools/perf/util/evsel.c | 14 ++++++++++++++
tools/perf/util/evsel.h | 3 +++
4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 2c1456e..c4f8fd2 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1231,6 +1231,7 @@ static void print_aggr(char *prefix)
nr = 0;
for (cpu = 0; cpu < cpu_map__nr(counter->cpus); cpu++) {
cpu2 = counter->cpus->map[cpu];
+ cpu2 = perf_evsel__get_cpumap_index(cpu2, evsel_list->cpus);
s2 = aggr_get_id(evsel_list->cpus, cpu2);
if (s2 != id)
continue;
diff --git a/tools/perf/util/cpumap.c b/tools/perf/util/cpumap.c
index c4e55b7..82f3ea8 100644
--- a/tools/perf/util/cpumap.c
+++ b/tools/perf/util/cpumap.c
@@ -211,7 +211,7 @@ int cpu_map__get_socket(struct cpu_map *map, int idx)
char path[PATH_MAX];
int cpu, ret;

- if (idx > map->nr)
+ if ((idx > map->nr) || (idx < 0))
return -1;

cpu = map->map[idx];
@@ -274,7 +274,7 @@ int cpu_map__get_core(struct cpu_map *map, int idx)
char path[PATH_MAX];
int cpu, ret, s;

- if (idx > map->nr)
+ if ((idx > map->nr) || (idx < 0))
return -1;

cpu = map->map[idx];
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 33e3fd8..e177f43 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -991,6 +991,20 @@ int __perf_evsel__read_on_cpu(struct perf_evsel *evsel,
return 0;
}

+int perf_evsel__get_cpumap_index(int cpu, struct cpu_map *evsel_cpus)
+{
+ int i;
+
+ if (evsel_cpus == NULL)
+ return -1;
+
+ for (i = 0; i < evsel_cpus->nr; i++) {
+ if (cpu == evsel_cpus->map[i])
+ return i;
+ }
+ return -1;
+}
+
static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
{
struct perf_evsel *leader = evsel->leader;
diff --git a/tools/perf/util/evsel.h b/tools/perf/util/evsel.h
index e486151..5e9d347 100644
--- a/tools/perf/util/evsel.h
+++ b/tools/perf/util/evsel.h
@@ -8,6 +8,7 @@
#include <linux/types.h>
#include "xyarray.h"
#include "symbol.h"
+#include "cpumap.h"

struct perf_counts_values {
union {
@@ -366,4 +367,6 @@ typedef int (*attr__fprintf_f)(FILE *, const char *, const char *, void *);
int perf_event_attr__fprintf(FILE *fp, struct perf_event_attr *attr,
attr__fprintf_f attr__fprintf, void *priv);

+int perf_evsel__get_cpumap_index(int cpu, struct cpu_map *evsel_cpus);
+
#endif /* __PERF_EVSEL_H */
--
1.8.3.1

2015-04-15 15:09:04

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 4/6] perf,tools: check and re-organize evsel cpu maps

From: Kan Liang <[email protected]>

"-C" can be used to set cpu list. The cpu list may be incompatible with
event's cpumask. This patch discard the incompatible cpu. Only available
cpu can be stored in evsel->cpus->map.
If there is no cpu from cpu list compatible with event's cpumask. It
will error out.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/builtin-top.c | 6 ++--
tools/perf/util/evlist.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 75 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
index 1cb3436..5aba1b2 100644
--- a/tools/perf/builtin-top.c
+++ b/tools/perf/builtin-top.c
@@ -1220,9 +1220,6 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
if (target__none(target))
target->system_wide = true;

- if (perf_evlist__create_maps(top.evlist, target) < 0)
- usage_with_options(top_usage, options);
-
if (!top.evlist->nr_entries &&
perf_evlist__add_default(top.evlist) < 0) {
ui__error("Not enough memory for event selector list\n");
@@ -1231,6 +1228,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)

symbol_conf.nr_events = top.evlist->nr_entries;

+ if (perf_evlist__create_maps(top.evlist, target) < 0)
+ usage_with_options(top_usage, options);
+
if (top.delay_secs < 1)
top.delay_secs = 1;

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index a065150..16319b4 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1068,6 +1068,74 @@ int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages,
return perf_evlist__mmap_per_cpu(evlist, &mp);
}

+static int cmp_ids(const void *a, const void *b)
+{
+ return *(int *)a - *(int *)b;
+}
+
+/*
+ * Check evsel cpu map according to pmu cpumask and input
+ * Only available cpu can be stored in evsel->cpus->map.
+ */
+static int perf_evlist__check_evsel_cpus(struct perf_evlist *evlist)
+{
+ const struct cpu_map *cpus = evlist->cpus;
+ const int ncpus = cpu_map__nr(evlist->cpus);
+ struct perf_evsel *evsel;
+ int i, j, cpu_nr, tmp;
+
+ /* ensure we process id in increasing order */
+ qsort(evlist->cpus->map, evlist->cpus->nr, sizeof(int), cmp_ids);
+
+ evlist__for_each(evlist, evsel) {
+ if (evsel->cpus == cpus)
+ continue;
+
+ cpu_nr = 0;
+ j = 0;
+ for (i = 0; i < cpu_map__nr(evsel->cpus);) {
+
+ if (j >= ncpus) {
+ evsel->cpus->map[i++] = -1;
+ continue;
+ }
+ for (; j < ncpus; j++) {
+ if (cpus->map[j] < evsel->cpus->map[i])
+ continue;
+ if (cpus->map[j] == evsel->cpus->map[i]) {
+ cpu_nr++;
+ j++;
+ i++;
+ } else
+ evsel->cpus->map[i++] = -1;
+ break;
+ }
+ }
+
+ if (cpu_nr == cpu_map__nr(evsel->cpus))
+ continue;
+ if (cpu_nr == 0) {
+ perror("failed to create CPUs map, please check cpumask");
+ return -1;
+ }
+
+ tmp = 0;
+ for (i = 0; i < cpu_nr; i++) {
+ if (evsel->cpus->map[i] == -1) {
+ while (evsel->cpus->map[tmp] == -1) {
+ tmp++;
+ BUG_ON(tmp >= cpu_map__nr(evsel->cpus));
+ }
+ evsel->cpus->map[i] = evsel->cpus->map[tmp];
+ evsel->cpus->map[tmp] = -1;
+ }
+ tmp++;
+ }
+ evsel->cpus->nr = cpu_nr;
+ }
+ return 0;
+}
+
int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
{
struct cpu_map *cpus;
@@ -1088,6 +1156,10 @@ int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)

perf_evlist__set_cpus(evlist, cpus);

+ if (target->cpu_list &&
+ (perf_evlist__check_evsel_cpus(evlist) < 0))
+ goto out_delete_threads;
+
return 0;

out_delete_threads:
--
1.8.3.1

2015-04-15 15:08:53

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 5/6] perf,tools: open/mmap event uses event's cpu map

From: Kan Liang <[email protected]>

In perf_evlist__mmap, leader's fd has to be mmaped before member's fd.
So evlist__for_each must be the outermost of the loop.
Since different event's cpu list could vary, we cannot rely on the index
to get group leader's fd. In get_group_fd, the cpu id is used to get
correct fd.

Signed-off-by: Kan Liang <[email protected]>
---
tools/perf/util/evlist.c | 104 +++++++++++++++++++++++++++--------------------
tools/perf/util/evsel.c | 28 +++++++++----
2 files changed, 80 insertions(+), 52 deletions(-)

diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 16319b4..637fcf4 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -834,51 +834,48 @@ static int __perf_evlist__mmap(struct perf_evlist *evlist, int idx,
return 0;
}

-static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist, int idx,
- struct mmap_params *mp, int cpu,
- int thread, int *output)
+static int perf_evlist__mmap_per_evsel(struct perf_evlist *evlist,
+ struct perf_evsel *evsel,
+ int idx, struct mmap_params *mp,
+ int cpu, int thread, int *output)
{
- struct perf_evsel *evsel;
+ int fd;

- evlist__for_each(evlist, evsel) {
- int fd;
+ if (evsel->system_wide && thread)
+ return 0;

- if (evsel->system_wide && thread)
- continue;
+ fd = FD(evsel, cpu, thread);

- fd = FD(evsel, cpu, thread);
+ if (*output == -1) {
+ *output = fd;
+ if (__perf_evlist__mmap(evlist, idx, mp, *output) < 0)
+ return -1;
+ } else {
+ if (ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, *output) != 0)
+ return -1;

- if (*output == -1) {
- *output = fd;
- if (__perf_evlist__mmap(evlist, idx, mp, *output) < 0)
- return -1;
- } else {
- if (ioctl(fd, PERF_EVENT_IOC_SET_OUTPUT, *output) != 0)
- return -1;
+ perf_evlist__mmap_get(evlist, idx);
+ }

- perf_evlist__mmap_get(evlist, idx);
- }
+ /*
+ * The system_wide flag causes a selected event to be opened
+ * always without a pid. Consequently it will never get a
+ * POLLHUP, but it is used for tracking in combination with
+ * other events, so it should not need to be polled anyway.
+ * Therefore don't add it for polling.
+ */
+ if (!evsel->system_wide &&
+ __perf_evlist__add_pollfd(evlist, fd, idx) < 0) {
+ perf_evlist__mmap_put(evlist, idx);
+ return -1;
+ }

- /*
- * The system_wide flag causes a selected event to be opened
- * always without a pid. Consequently it will never get a
- * POLLHUP, but it is used for tracking in combination with
- * other events, so it should not need to be polled anyway.
- * Therefore don't add it for polling.
- */
- if (!evsel->system_wide &&
- __perf_evlist__add_pollfd(evlist, fd, idx) < 0) {
- perf_evlist__mmap_put(evlist, idx);
+ if (evsel->attr.read_format & PERF_FORMAT_ID) {
+ if (perf_evlist__id_add_fd(evlist, evsel, cpu, thread,
+ fd) < 0)
return -1;
- }
-
- if (evsel->attr.read_format & PERF_FORMAT_ID) {
- if (perf_evlist__id_add_fd(evlist, evsel, cpu, thread,
- fd) < 0)
- return -1;
- perf_evlist__set_sid_idx(evlist, evsel, idx, cpu,
- thread);
- }
+ perf_evlist__set_sid_idx(evlist, evsel, idx, cpu,
+ thread);
}

return 0;
@@ -890,23 +887,37 @@ static int perf_evlist__mmap_per_cpu(struct perf_evlist *evlist,
int cpu, thread;
int nr_cpus = cpu_map__nr(evlist->cpus);
int nr_threads = thread_map__nr(evlist->threads);
+ int *output = malloc(nr_cpus * sizeof(int));
+ int evlist_cpu;
+ struct perf_evsel *evsel;

pr_debug2("perf event ring buffer mmapped per cpu\n");
- for (cpu = 0; cpu < nr_cpus; cpu++) {
- int output = -1;

- for (thread = 0; thread < nr_threads; thread++) {
- if (perf_evlist__mmap_per_evsel(evlist, cpu, mp, cpu,
- thread, &output))
+ for (cpu = 0; cpu < nr_cpus; cpu++)
+ output[cpu] = -1;
+
+ evlist__for_each(evlist, evsel) {
+ for (cpu = 0; cpu < cpu_map__nr(evsel->cpus); cpu++) {
+ evlist_cpu = perf_evsel__get_cpumap_index(evsel->cpus->map[cpu], evlist->cpus);
+ if (evlist_cpu < 0)
goto out_unmap;
+
+ for (thread = 0; thread < nr_threads; thread++) {
+ if (perf_evlist__mmap_per_evsel(evlist, evsel, evlist_cpu,
+ mp, cpu, thread,
+ &output[evlist_cpu]))
+ goto out_unmap;
+ }
}
}

+ free(output);
return 0;

out_unmap:
for (cpu = 0; cpu < nr_cpus; cpu++)
__perf_evlist__munmap(evlist, cpu);
+ free(output);
return -1;
}

@@ -915,14 +926,17 @@ static int perf_evlist__mmap_per_thread(struct perf_evlist *evlist,
{
int thread;
int nr_threads = thread_map__nr(evlist->threads);
+ struct perf_evsel *evsel;

pr_debug2("perf event ring buffer mmapped per thread\n");
for (thread = 0; thread < nr_threads; thread++) {
int output = -1;

- if (perf_evlist__mmap_per_evsel(evlist, thread, mp, 0, thread,
- &output))
- goto out_unmap;
+ evlist__for_each(evlist, evsel) {
+ if (perf_evlist__mmap_per_evsel(evlist, evsel, thread,
+ mp, 0, thread, &output))
+ goto out_unmap;
+ }
}

return 0;
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index e177f43..44a663c 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1008,7 +1008,7 @@ int perf_evsel__get_cpumap_index(int cpu, struct cpu_map *evsel_cpus)
static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
{
struct perf_evsel *leader = evsel->leader;
- int fd;
+ int fd, leader_cpu;

if (perf_evsel__is_group_leader(evsel))
return -1;
@@ -1019,7 +1019,15 @@ static int get_group_fd(struct perf_evsel *evsel, int cpu, int thread)
*/
BUG_ON(!leader->fd);

- fd = FD(leader, cpu, thread);
+ if (cpu < 0)
+ fd = FD(leader, 0, thread);
+ else {
+ leader_cpu = perf_evsel__get_cpumap_index(cpu, leader->cpus);
+ if (leader_cpu >= 0)
+ fd = FD(leader, leader_cpu, thread);
+ else
+ return -1;
+ }
BUG_ON(fd == -1);

return fd;
@@ -1151,15 +1159,21 @@ static int __perf_evsel__open(struct perf_evsel *evsel, struct cpu_map *cpus,
int cpu, thread, nthreads;
unsigned long flags = PERF_FLAG_FD_CLOEXEC;
int pid = -1, err;
+ struct cpu_map *cpumap;
enum { NO_CHANGE, SET_TO_MAX, INCREASED_MAX } set_rlimit = NO_CHANGE;

+ if (evsel->cpus)
+ cpumap = evsel->cpus;
+ else
+ cpumap = cpus;
+
if (evsel->system_wide)
nthreads = 1;
else
nthreads = threads->nr;

if (evsel->fd == NULL &&
- perf_evsel__alloc_fd(evsel, cpus->nr, nthreads) < 0)
+ perf_evsel__alloc_fd(evsel, cpumap->nr, nthreads) < 0)
return -ENOMEM;

if (evsel->cgrp) {
@@ -1191,7 +1205,7 @@ retry_sample_id:
fprintf(stderr, "%.60s\n", graph_dotted_line);
}

- for (cpu = 0; cpu < cpus->nr; cpu++) {
+ for (cpu = 0; cpu < cpumap->nr; cpu++) {

for (thread = 0; thread < nthreads; thread++) {
int group_fd;
@@ -1199,14 +1213,14 @@ retry_sample_id:
if (!evsel->cgrp && !evsel->system_wide)
pid = threads->map[thread];

- group_fd = get_group_fd(evsel, cpu, thread);
+ group_fd = get_group_fd(evsel, cpumap->map[cpu], thread);
retry_open:
pr_debug2("sys_perf_event_open: pid %d cpu %d group_fd %d flags %#lx\n",
- pid, cpus->map[cpu], group_fd, flags);
+ pid, cpumap->map[cpu], group_fd, flags);

FD(evsel, cpu, thread) = sys_perf_event_open(&evsel->attr,
pid,
- cpus->map[cpu],
+ cpumap->map[cpu],
group_fd, flags);
if (FD(evsel, cpu, thread) < 0) {
err = -errno;
--
1.8.3.1

2015-04-15 15:09:18

by Liang, Kan

[permalink] [raw]
Subject: [PATCH V2 6/6] perf/x86/intel/uncore: do not implicitly set uncore event cpu

From: Kan Liang <[email protected]>

There is cpumask exposed to the uncore pmu sysfs directory. User should
set the cpu according to the cpumask. Kernel should not implicitly
change the event->cpu.

Signed-off-by: Kan Liang <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_uncore.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
index c635b8b..cd80731 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
@@ -621,9 +621,8 @@ static int uncore_pmu_event_init(struct perf_event *event)
if (event->cpu < 0)
return -EINVAL;
box = uncore_pmu_to_box(pmu, event->cpu);
- if (!box || box->cpu < 0)
+ if (!box || box->cpu < 0 || (box->cpu != event->cpu))
return -EINVAL;
- event->cpu = box->cpu;

event->hw.idx = -1;
event->hw.last_tag = ~0ULL;
--
1.8.3.1

2015-04-15 16:15:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Wed, Apr 15, 2015 at 03:56:11AM -0400, Kan Liang wrote:
> From: Kan Liang <[email protected]>
>
> The pmu marked as perf_invalid_context don't have any state to switch on
> context switch. Everything is global. So it is OK to be part of sw/hw
> groups.
> In sched_out/sched_in, del/add must be called, so the
> perf_invalid_context event can be disabled/enabled accordingly during
> context switch. The event count only be read when the event is already
> sched_in.
>
> However group read doesn't work with mix events.
>
> For example,
> perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> It always gets EINVAL.
>
> This patch set intends to fix this issue.
> perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.202 MB perf.data (12 samples) ]
>
> This patch special case invalid context events and allow them to be part
> of sw/hw groups.

I don't get it. What, Why?

2015-04-15 16:21:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Wed, Apr 15, 2015 at 06:15:28PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 15, 2015 at 03:56:11AM -0400, Kan Liang wrote:
> > From: Kan Liang <[email protected]>
> >
> > The pmu marked as perf_invalid_context don't have any state to switch on
> > context switch. Everything is global. So it is OK to be part of sw/hw
> > groups.
> > In sched_out/sched_in, del/add must be called, so the
> > perf_invalid_context event can be disabled/enabled accordingly during
> > context switch. The event count only be read when the event is already
> > sched_in.
> >
> > However group read doesn't work with mix events.
> >
> > For example,
> > perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> > It always gets EINVAL.
> >
> > This patch set intends to fix this issue.
> > perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.202 MB perf.data (12 samples) ]
> >
> > This patch special case invalid context events and allow them to be part
> > of sw/hw groups.
>
> I don't get it. What, Why?

Without the patch you can't mix uncore and cpu core events in the same
group.

Collecting uncore in PMIs is useful, for example to get memory
bandwidth over time.

-Andi

--
[email protected] -- Speaking for myself only.

2015-04-15 16:56:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Wed, Apr 15, 2015 at 06:21:11PM +0200, Andi Kleen wrote:
> On Wed, Apr 15, 2015 at 06:15:28PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 15, 2015 at 03:56:11AM -0400, Kan Liang wrote:
> > > From: Kan Liang <[email protected]>
> > >
> > > The pmu marked as perf_invalid_context don't have any state to switch on
> > > context switch. Everything is global. So it is OK to be part of sw/hw
> > > groups.
> > > In sched_out/sched_in, del/add must be called, so the
> > > perf_invalid_context event can be disabled/enabled accordingly during
> > > context switch. The event count only be read when the event is already
> > > sched_in.
> > >
> > > However group read doesn't work with mix events.
> > >
> > > For example,
> > > perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> > > It always gets EINVAL.
> > >
> > > This patch set intends to fix this issue.
> > > perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> > > [ perf record: Woken up 1 times to write data ]
> > > [ perf record: Captured and wrote 0.202 MB perf.data (12 samples) ]
> > >
> > > This patch special case invalid context events and allow them to be part
> > > of sw/hw groups.
> >
> > I don't get it. What, Why?
>
> Without the patch you can't mix uncore and cpu core events in the same
> group.
>
> Collecting uncore in PMIs is useful, for example to get memory
> bandwidth over time.

Well, start with a coherent changelog, why do you still think those are
optional?

2015-04-15 17:29:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Wed, Apr 15, 2015 at 03:56:11AM -0400, Kan Liang wrote:
> The event count only be read when the event is already
> sched_in.

Yeah, so no. This breaks what groups are. Group events _must_ be
co-scheduled. You cannot guarantee you can schedule events from another
PMU.

Also, I cannot see how this can possibly work, you cannot put these
things on the same event_context.

Also, how does this work wrt cpumasks, regular events are per cpu,
uncore events are per node.

There is so much broken stuff here without explanation its not funny.

Please explain how this does not completely wreck everything?

2015-04-16 14:53:58

by Liang, Kan

[permalink] [raw]
Subject: RE: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups



>
> On Wed, Apr 15, 2015 at 03:56:11AM -0400, Kan Liang wrote:
> > The event count only be read when the event is already sched_in.
>
> Yeah, so no. This breaks what groups are. Group events _must_ be co-
> scheduled. You cannot guarantee you can schedule events from another
> PMU.

Why? I think it's similar as sw and hw mixed event group.
group_sched_in will co-schedule all events in sibling_list.
Invalid context events is already added in sibling_list in
perf_install_in_context.
So all group events will be scheduled together.
event->pmu guarantee proper add is called.
For invalid context events, everything is global. It never fails to schedule.

>
> Also, I cannot see how this can possibly work, you cannot put these things
> on the same event_context.

Why not? For a sw and hw mixed event group, they use same
hw event_context.

Actually, the invalid context events don't have any state to switch
on context switch. They can attach to any event_context.

>
> Also, how does this work wrt cpumasks, regular events are per cpu, uncore
> events are per node.

Currently, uncore events is only available on the first cpu of the node.
So if it's a hw and uncore mixed group, only cpu 0 do the group.
For other CPUs, there will be only hw event.
Perf tool will guarantee it. Here are some examples.

[perf]$ sudo perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S'
-C 0-4 -vvv -- sleep 1
------------------------------------------------------------
perf_event_attr:
size 112
{ sample_period, sample_freq } 4000
sample_type IP|TID|TIME|READ|ID|CPU|PERIOD
read_format ID|GROUP
disabled 1
mmap 1
comm 1
freq 1
task 1
sample_id_all 1
exclude_guest 1
mmap2 1
comm_exec 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd -1 flags 0x8
sys_perf_event_open: pid -1 cpu 1 group_fd -1 flags 0x8
sys_perf_event_open: pid -1 cpu 2 group_fd -1 flags 0x8
sys_perf_event_open: pid -1 cpu 3 group_fd -1 flags 0x8
sys_perf_event_open: pid -1 cpu 4 group_fd -1 flags 0x8
------------------------------------------------------------
perf_event_attr:
type 15
size 112
config 304
sample_type IP|TID|TIME|READ|ID|CPU|PERIOD
read_format ID|GROUP
freq 1
sample_id_all 1
exclude_guest 1
------------------------------------------------------------
sys_perf_event_open: pid -1 cpu 0 group_fd 3 flags 0x8
mmap size 528384B
perf event ring buffer mmapped per cpu



[perf]$ sudo perf stat -e '{cycles,uncore_imc_0/cas_count_read/}' -C 0-4
--per-core -- sleep 1

Performance counter stats for 'CPU(s) 0-4':

S0-C0 1 1367712 cycles (100.00%)
S0-C0 1 0.17 MiB uncore_imc_0/cas_count_read/
S0-C1 1 982923 cycles
S0-C1 0 <not counted> MiB uncore_imc_0/cas_count_read/
S0-C2 1 958585 cycles
S0-C2 0 <not counted> MiB uncore_imc_0/cas_count_read/
S0-C3 1 978161 cycles
S0-C3 0 <not counted> MiB uncore_imc_0/cas_count_read/
S0-C4 1 976012 cycles
S0-C4 0 <not counted> MiB uncore_imc_0/cas_count_read/

1.001151765 seconds time elapsed


>
> There is so much broken stuff here without explanation its not funny.
>
> Please explain how this does not completely wreck everything?

OK. I will refine the description when questions as above are addressed.

Thanks,
Kan

2015-04-16 16:32:07

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

Hi,

If you're going to fundamentally change the behaviour of
perf_invalid_context, please Cc authors of other system PMU drivers.
Intel aren't the only ones with such PMUs.

For instance, this affects the ARM CCI and CCN PMU drivers.

On Wed, Apr 15, 2015 at 08:56:11AM +0100, Kan Liang wrote:
> From: Kan Liang <[email protected]>
>
> The pmu marked as perf_invalid_context don't have any state to switch on
> context switch. Everything is global. So it is OK to be part of sw/hw
> groups.
> In sched_out/sched_in, del/add must be called, so the
> perf_invalid_context event can be disabled/enabled accordingly during
> context switch. The event count only be read when the event is already
> sched_in.
>
> However group read doesn't work with mix events.
>
> For example,
> perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> It always gets EINVAL.

>From my PoV that makes sense. One is CPU-affine, the other is not, and
the two cannot be scheduled in the same PMU transaction by the nature of
the hardware. Fundamentally, you cannot provide group semantics due to
this.

Even if you ignore the fundamental semantics of groups, there are other
problems with allowing shared contexts:

* The *_txn functions only get called on the group leader's PMU. If your
system PMU has these functions, they are not called.

* Event rotation is per ctx, but now you could have some events in a CPU
PMU's context, and some in the uncore PMU's context. So those can race
with each other.

* Throttling is also per-context. So those can race with each other too.

> This patch set intends to fix this issue.
> perf record -e '{cycles,uncore_imc_0/cas_count_read/}:S' -a sleep 1
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.202 MB perf.data (12 samples) ]

You can already count the events concurrently without grouping them, and
the above implies that this patch just ends up misleading the user
w.r.t. group semantics.

If you want to be able to sample the events with a single read, then you
can attach the FDs.

I don't see that this solves a real problem. I see that it introduces a
new set of problems in addition to complicating existing code.

Mark.

2015-04-16 16:33:27

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 4/6] perf,tools: check and re-organize evsel cpu maps

On Wed, Apr 15, 2015 at 08:56:14AM +0100, Kan Liang wrote:
> From: Kan Liang <[email protected]>
>
> "-C" can be used to set cpu list. The cpu list may be incompatible with
> event's cpumask. This patch discard the incompatible cpu. Only available
> cpu can be stored in evsel->cpus->map.
> If there is no cpu from cpu list compatible with event's cpumask. It
> will error out.

So we do something fundamentally different to what the user asked for,
silently?

Mark.

>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> tools/perf/builtin-top.c | 6 ++--
> tools/perf/util/evlist.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 75 insertions(+), 3 deletions(-)
>
> diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c
> index 1cb3436..5aba1b2 100644
> --- a/tools/perf/builtin-top.c
> +++ b/tools/perf/builtin-top.c
> @@ -1220,9 +1220,6 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
> if (target__none(target))
> target->system_wide = true;
>
> - if (perf_evlist__create_maps(top.evlist, target) < 0)
> - usage_with_options(top_usage, options);
> -
> if (!top.evlist->nr_entries &&
> perf_evlist__add_default(top.evlist) < 0) {
> ui__error("Not enough memory for event selector list\n");
> @@ -1231,6 +1228,9 @@ int cmd_top(int argc, const char **argv, const char *prefix __maybe_unused)
>
> symbol_conf.nr_events = top.evlist->nr_entries;
>
> + if (perf_evlist__create_maps(top.evlist, target) < 0)
> + usage_with_options(top_usage, options);
> +
> if (top.delay_secs < 1)
> top.delay_secs = 1;
>
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index a065150..16319b4 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1068,6 +1068,74 @@ int perf_evlist__mmap(struct perf_evlist *evlist, unsigned int pages,
> return perf_evlist__mmap_per_cpu(evlist, &mp);
> }
>
> +static int cmp_ids(const void *a, const void *b)
> +{
> + return *(int *)a - *(int *)b;
> +}
> +
> +/*
> + * Check evsel cpu map according to pmu cpumask and input
> + * Only available cpu can be stored in evsel->cpus->map.
> + */
> +static int perf_evlist__check_evsel_cpus(struct perf_evlist *evlist)
> +{
> + const struct cpu_map *cpus = evlist->cpus;
> + const int ncpus = cpu_map__nr(evlist->cpus);
> + struct perf_evsel *evsel;
> + int i, j, cpu_nr, tmp;
> +
> + /* ensure we process id in increasing order */
> + qsort(evlist->cpus->map, evlist->cpus->nr, sizeof(int), cmp_ids);
> +
> + evlist__for_each(evlist, evsel) {
> + if (evsel->cpus == cpus)
> + continue;
> +
> + cpu_nr = 0;
> + j = 0;
> + for (i = 0; i < cpu_map__nr(evsel->cpus);) {
> +
> + if (j >= ncpus) {
> + evsel->cpus->map[i++] = -1;
> + continue;
> + }
> + for (; j < ncpus; j++) {
> + if (cpus->map[j] < evsel->cpus->map[i])
> + continue;
> + if (cpus->map[j] == evsel->cpus->map[i]) {
> + cpu_nr++;
> + j++;
> + i++;
> + } else
> + evsel->cpus->map[i++] = -1;
> + break;
> + }
> + }
> +
> + if (cpu_nr == cpu_map__nr(evsel->cpus))
> + continue;
> + if (cpu_nr == 0) {
> + perror("failed to create CPUs map, please check cpumask");
> + return -1;
> + }
> +
> + tmp = 0;
> + for (i = 0; i < cpu_nr; i++) {
> + if (evsel->cpus->map[i] == -1) {
> + while (evsel->cpus->map[tmp] == -1) {
> + tmp++;
> + BUG_ON(tmp >= cpu_map__nr(evsel->cpus));
> + }
> + evsel->cpus->map[i] = evsel->cpus->map[tmp];
> + evsel->cpus->map[tmp] = -1;
> + }
> + tmp++;
> + }
> + evsel->cpus->nr = cpu_nr;
> + }
> + return 0;
> +}
> +
> int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
> {
> struct cpu_map *cpus;
> @@ -1088,6 +1156,10 @@ int perf_evlist__create_maps(struct perf_evlist *evlist, struct target *target)
>
> perf_evlist__set_cpus(evlist, cpus);
>
> + if (target->cpu_list &&
> + (perf_evlist__check_evsel_cpus(evlist) < 0))
> + goto out_delete_threads;
> +
> return 0;
>
> out_delete_threads:
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2015-04-16 16:37:06

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 6/6] perf/x86/intel/uncore: do not implicitly set uncore event cpu

On Wed, Apr 15, 2015 at 08:56:16AM +0100, Kan Liang wrote:
> From: Kan Liang <[email protected]>
>
> There is cpumask exposed to the uncore pmu sysfs directory. User should
> set the cpu according to the cpumask. Kernel should not implicitly
> change the event->cpu.
>
> Signed-off-by: Kan Liang <[email protected]>
> ---
> arch/x86/kernel/cpu/perf_event_intel_uncore.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.c b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
> index c635b8b..cd80731 100644
> --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.c
> +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.c
> @@ -621,9 +621,8 @@ static int uncore_pmu_event_init(struct perf_event *event)
> if (event->cpu < 0)
> return -EINVAL;
> box = uncore_pmu_to_box(pmu, event->cpu);
> - if (!box || box->cpu < 0)
> + if (!box || box->cpu < 0 || (box->cpu != event->cpu))
> return -EINVAL;
> - event->cpu = box->cpu;

There are no existing users relying on this behaviour?

If hotplug occurs between userspace reading the mask and trying to open
the event, it will get -EINVAL back, for no apparent reason, where
previously it would have been silently fixed up. Given the event
migration logic, the old behaviour was more consistent in allowing the
user to not care about hotplug.

Mark.

2015-04-16 16:44:08

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

> Even if you ignore the fundamental semantics of groups, there are other
> problems with allowing shared contexts:
>
> * The *_txn functions only get called on the group leader's PMU. If your
> system PMU has these functions, they are not called.
>
> * Event rotation is per ctx, but now you could have some events in a CPU
> PMU's context, and some in the uncore PMU's context. So those can race
> with each other.
>
> * Throttling is also per-context. So those can race with each other too.

There's also a break down of behaviour: events in the uncore context
will get migrated to another CPU in the event of a hot unplug, while
events that are grouped with CPU events (and hence live in the CPU
context) will be destroyed.

Mark.

2015-04-16 17:17:05

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Thu, Apr 16, 2015 at 05:31:40PM +0100, Mark Rutland wrote:
> Hi,
>
> If you're going to fundamentally change the behaviour of
> perf_invalid_context, please Cc authors of other system PMU drivers.
> Intel aren't the only ones with such PMUs.
>
> For instance, this affects the ARM CCI and CCN PMU drivers.

As an aside, is there any chance of a [email protected] list?

My work email becomes a little useless when subscribed to the full LKML
firehose, so while I'd like to keep up with perf core changes I don't
subscribe. In this case I stumbled across these patches by chance when
browsing an archive, then had to have the messages bounced to my
account.

Having a dedicated list would make things easier for myself, at least.

Mark.

2015-04-16 21:23:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

> From my PoV that makes sense. One is CPU-affine, the other is not, and
> the two cannot be scheduled in the same PMU transaction by the nature of
> the hardware. Fundamentally, you cannot provide group semantics due to
> this.

Actually you can. Just use it like a free running counter, and the
different groups sample it. This will work from the different CPUs,
as long as the event is the same everywhere.

The implemention may not be quite right yet, but the basic concept
should work, and is useful.

-Andi

2015-04-17 09:47:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Thu, Apr 16, 2015 at 10:23:42PM +0100, Andi Kleen wrote:
> > From my PoV that makes sense. One is CPU-affine, the other is not, and
> > the two cannot be scheduled in the same PMU transaction by the nature of
> > the hardware. Fundamentally, you cannot provide group semantics due to
> > this.
>
> Actually you can. Just use it like a free running counter, and the
> different groups sample it. This will work from the different CPUs,
> as long as the event is the same everywhere.

... which would give you arbitrary skew, because one counter is
free-running and the other is not (when scheduling a context in or out we stop
the PMU)

>From my PoV that violates group semantics, because now the events aren't
always counting at the same time (which would be the reason I grouped
them in the first place).

> The implemention may not be quite right yet, but the basic concept
> should work, and is useful.

I can see that associating counts from different PMUs at points in time
may be useful, even if they aren't sampled at the precise same time, and
you have weaker guarantees than the current group semantics.

However, it is the case that you cannot offer group semantics.

Mark.

2015-04-18 00:47:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

> ... which would give you arbitrary skew, because one counter is
> free-running and the other is not (when scheduling a context in or out we stop
> the PMU)

Everyone just reads the counter and subtracts it from
the last value they've seen.

That's the same how any other shared free running counter work,
such as the standard timer.

The only thing that perf needs to enforce is that the counters
are running with the same event.

It also wouldn't work for sampling, but the uncore doesn't do
sampling anyways.

> From my PoV that violates group semantics, because now the events aren't
> always counting at the same time (which would be the reason I grouped
> them in the first place).

You never use the absolute value, just differences. The differences
effectively count only when the group runs.

> However, it is the case that you cannot offer group semantics.

I don't think that's true.

-Andi

2015-04-20 10:16:03

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

On Sat, Apr 18, 2015 at 01:47:06AM +0100, Andi Kleen wrote:
> > ... which would give you arbitrary skew, because one counter is
> > free-running and the other is not (when scheduling a context in or out we stop
> > the PMU)
>
> Everyone just reads the counter and subtracts it from
> the last value they've seen.
>
> That's the same how any other shared free running counter work,
> such as the standard timer.
>
> The only thing that perf needs to enforce is that the counters
> are running with the same event.
>
> It also wouldn't work for sampling, but the uncore doesn't do
> sampling anyways.

If you don't care about sampling and only care about totals, then you
can just open the events concurrently *without* grouping them, as I
stated previously.

> > From my PoV that violates group semantics, because now the events aren't
> > always counting at the same time (which would be the reason I grouped
> > them in the first place).
>
> You never use the absolute value, just differences. The differences
> effectively count only when the group runs.

Except that the uncore PMU is counting during the periods the CPU PMU is
disabled (e.g. when it is being programmed, read, or written). There's a
race there that you cannot solve given the two are indepedent agents.

> > However, it is the case that you cannot offer group semantics.
>
> I don't think that's true.

I believe that it is. I also believe it doesn't matter since you don't
care about those semantics anyway. Given that you can just open
independent events, which is already possible.

Mark.

2015-04-20 16:57:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH V2 1/6] perf,core: allow invalid context events to be part of sw/hw groups

> If you don't care about sampling and only care about totals, then you
> can just open the events concurrently *without* grouping them, as I
> stated previously.

perf record doesn't really support that. We need some group reader
that runs regularly. The best choice for the leader is a CPU sample event,
which also has the advantage that it won't disturb idle states.

> > > From my PoV that violates group semantics, because now the events aren't
> > > always counting at the same time (which would be the reason I grouped
> > > them in the first place).
> >
> > You never use the absolute value, just differences. The differences
> > effectively count only when the group runs.
>
> Except that the uncore PMU is counting during the periods the CPU PMU is
> disabled (e.g. when it is being programmed, read, or written). There's a
> race there that you cannot solve given the two are indepedent agents.

We get useful information, like

"memory bandwidth during the group run time".

You're right formulas between different PMUs are problematic though.
But in most cases time relation is good enough.

-Andi

--
[email protected] -- Speaking for myself only.