LinuxLists.cc - [PATCH 00/11] arm: perf: add support for heterogeneous PMUs

2014-11-07 16:26:26

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 00/11] arm: perf: add support for heterogeneous PMUs

In systems with heterogeneous CPUs (e.g. big.LITTLE) the associated PMUs
also differ in terms of the supported set of events, the precise
behaviour of each of those events, and the number of event counters.
Thus it is not possible to expose these PMUs as a single logical PMU.

Instead a logical PMU is created per CPU microarchitecture, which events
can target directly:

$ perf stat \
-e armv7_cortex_a7/config=0x11/ \
-e armv7_cortex_a15/config=0x11/ \
./test

Performance counter stats for './test':

7980455 armv7_cortex_a7/config=0x11/ [27.29%]
9947934 armv7_cortex_a15/config=0x11/ [72.66%]

0.016734833 seconds time elapsed

This series is based atop of my recent preparatory rework [1,2].

Thanks,
Mark.

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-October/295820.html
[2] https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/log/?h=perf/updates

Mark Rutland (11):
of: Add empty of_get_next_parent stub
perf: allow for PMU-specific event filtering
arm: perf: treat PMUs as CPU affine
arm: perf: filter unschedulable events
arm: perf: reject multi-pmu groups
arm: perf: probe number of counters on affine CPUs
arm: perf: document PMU affinity binding
arm: perf: add functions to parse affinity from dt
arm: perf: parse cpu affinity from dt
arm: perf: remove singleton PMU restriction
arm: dts: vexpress: describe all PMUs in TC2 dts

Documentation/devicetree/bindings/arm/pmu.txt | 104 +++++++-
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 36 ++-
arch/arm/include/asm/pmu.h | 13 +
arch/arm/kernel/perf_event.c | 61 ++++-
arch/arm/kernel/perf_event_cpu.c | 356 +++++++++++++++++++++-----
arch/arm/kernel/perf_event_v7.c | 41 +--
include/linux/of.h | 5 +
include/linux/perf_event.h | 5 +
kernel/events/core.c | 8 +-
9 files changed, 534 insertions(+), 95 deletions(-)

--
1.9.1

2014-11-07 16:26:40

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 01/11] of: Add empty of_get_next_parent stub

There's no stub version of of_get_next_parent, so use in code compiled
without !CONFIG_OF will cause the kernel build to fail.

This patch adds a stub version of of_get_next_parent as is with other
!CONFIG_OF stub functions, so such code can compile without CONFIG_OF.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Rob Herring <[email protected]>
Cc: Grant Likely <[email protected]>
---
include/linux/of.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/include/linux/of.h b/include/linux/of.h
index 6545e7a..2d4b7e0 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -392,6 +392,11 @@ static inline struct device_node *of_get_parent(const struct device_node *node)
return NULL;
}

+static inline struct device_node *of_get_next_parent(struct device_node *node)
+{
+ return NULL;
+}
+
static inline struct device_node *of_get_next_child(
const struct device_node *node, struct device_node *prev)
{
--
1.9.1

2014-11-07 16:27:02

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 03/11] arm: perf: treat PMUs as CPU affine

In multi-cluster systems, the PMUs can be different across clusters, and
so our logical PMU may not be able to schedule events on all CPUs.

This patch adds a cpumask to encode which CPUs a PMU driver supports
controlling events for, and limits the driver to scheduling events on
those CPUs, and enabling and disabling the physical PMUs on those CPUs.
Currently the cpumask is set to match all CPUs.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/include/asm/pmu.h | 1 +
arch/arm/kernel/perf_event.c | 25 +++++++++++++++++++++++++
arch/arm/kernel/perf_event_cpu.c | 10 +++++++++-
3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b1596bd..b630a44 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -92,6 +92,7 @@ struct pmu_hw_events {
struct arm_pmu {
struct pmu pmu;
cpumask_t active_irqs;
+ cpumask_t supported_cpus;
char *name;
irqreturn_t (*handle_irq)(int irq_num, void *dev);
void (*enable)(struct perf_event *event);
diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index e34934f..9ad21ab 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -11,6 +11,7 @@
*/
#define pr_fmt(fmt) "hw perfevents: " fmt

+#include <linux/cpumask.h>
#include <linux/kernel.h>
#include <linux/platform_device.h>
#include <linux/pm_runtime.h>
@@ -223,6 +224,10 @@ armpmu_add(struct perf_event *event, int flags)
int idx;
int err = 0;

+ /* An event following a process won't be stopped earlier */
+ if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
+ return -ENOENT;
+
perf_pmu_disable(event->pmu);

/* If we don't have a space for the counter then finish early. */
@@ -439,6 +444,17 @@ static int armpmu_event_init(struct perf_event *event)
int err = 0;
atomic_t *active_events = &armpmu->active_events;

+ /*
+ * Reject CPU-affine events for CPUs that are of a different class to
+ * that which this PMU handles. Process-following events (where
+ * event->cpu == -1) can be migrated between CPUs, and thus we have to
+ * reject them later (in armpmu_add) if they're scheduled on a
+ * different class of CPU.
+ */
+ if (event->cpu != -1 &&
+ !cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
+ return -ENOENT;
+
/* does not support taken branch sampling */
if (has_branch_stack(event))
return -EOPNOTSUPP;
@@ -474,6 +490,10 @@ static void armpmu_enable(struct pmu *pmu)
struct pmu_hw_events *hw_events = this_cpu_ptr(armpmu->hw_events);
int enabled = bitmap_weight(hw_events->used_mask, armpmu->num_events);

+ /* For task-bound events we may be called on other CPUs */
+ if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
+ return;
+
if (enabled)
armpmu->start(armpmu);
}
@@ -481,6 +501,11 @@ static void armpmu_enable(struct pmu *pmu)
static void armpmu_disable(struct pmu *pmu)
{
struct arm_pmu *armpmu = to_arm_pmu(pmu);
+
+ /* For task-bound events we may be called on other CPUs */
+ if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
+ return;
+
armpmu->stop(armpmu);
}

diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index 59c0642..ce35149 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -169,11 +169,15 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler)
static int cpu_pmu_notify(struct notifier_block *b, unsigned long action,
void *hcpu)
{
+ int cpu = (unsigned long)hcpu;
struct arm_pmu *pmu = container_of(b, struct arm_pmu, hotplug_nb);

if ((action & ~CPU_TASKS_FROZEN) != CPU_STARTING)
return NOTIFY_DONE;

+ if (!cpumask_test_cpu(cpu, &pmu->supported_cpus))
+ return NOTIFY_DONE;
+
if (pmu->reset)
pmu->reset(pmu);
else
@@ -209,7 +213,8 @@ static int cpu_pmu_init(struct arm_pmu *cpu_pmu)

/* Ensure the PMU has sane values out of reset. */
if (cpu_pmu->reset)
- on_each_cpu(cpu_pmu->reset, cpu_pmu, 1);
+ on_each_cpu_mask(&cpu_pmu->supported_cpus, cpu_pmu->reset,
+ cpu_pmu, 1);

/* If no interrupts available, set the corresponding capability flag */
if (!platform_get_irq(cpu_pmu->plat_device, 0))
@@ -311,6 +316,9 @@ static int cpu_pmu_device_probe(struct platform_device *pdev)
cpu_pmu = pmu;
cpu_pmu->plat_device = pdev;

+ /* Assume by default that we're on a homogeneous system */
+ cpumask_setall(&pmu->supported_cpus);
+
if (node && (of_id = of_match_node(cpu_pmu_of_device_ids, pdev->dev.of_node))) {
init_fn = of_id->data;
ret = init_fn(pmu);
--
1.9.1

2014-11-07 16:27:07

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 05/11] arm: perf: reject multi-pmu groups

An event group spanning multiple CPU PMUs can never be scheduled, as at
least one event should always fail, and are therefore nonsensical.
Additionally, groups spanning multiple PMUs would require additional
validation logic throughout the driver to prevent CPU PMUs from stepping
on each others' internal state. Given that such groups are nonsensical
to begin with, the simple option is to reject such groups entirely.
Groups consisting of software events and CPU PMU events are benign so
long as the CPU PMU events only target a single CPU PMU.

This patch ensures that we reject the creation of event groups which
span multiple CPU PMUs, avoiding the issues described above. The
addition of this_pmu to the validation logic made the fake_pmu more
confusing than it already was; so this is renamed to the more accurate
hw_events. As hw_events was being modified anyway, the initialisation of
hw_events.used_mask is also simplified with the use of a designated
initializer rather than the existing memset.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/kernel/perf_event.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index b00f6aa..41dcfc0 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -258,13 +258,17 @@ out:
}

static int
-validate_event(struct pmu_hw_events *hw_events,
+validate_event(struct pmu *this_pmu,
+ struct pmu_hw_events *hw_events,
struct perf_event *event)
{
struct arm_pmu *armpmu = to_arm_pmu(event->pmu);

if (is_software_event(event))
return 1;
+
+ if (event->pmu != this_pmu)
+ return 0;

if (event->state < PERF_EVENT_STATE_OFF)
return 1;
@@ -279,23 +283,20 @@ static int
validate_group(struct perf_event *event)
{
struct perf_event *sibling, *leader = event->group_leader;
- struct pmu_hw_events fake_pmu;
-
- /*
- * Initialise the fake PMU. We only need to populate the
- * used_mask for the purposes of validation.
- */
- memset(&fake_pmu.used_mask, 0, sizeof(fake_pmu.used_mask));
+ struct pmu *this_pmu = event->pmu;
+ struct pmu_hw_events hw_events = {
+ .used_mask = { 0 },
+ };

- if (!validate_event(&fake_pmu, leader))
+ if (!validate_event(this_pmu, &hw_events, leader))
return -EINVAL;

list_for_each_entry(sibling, &leader->sibling_list, group_entry) {
- if (!validate_event(&fake_pmu, sibling))
+ if (!validate_event(this_pmu, &hw_events, sibling))
return -EINVAL;
}

- if (!validate_event(&fake_pmu, event))
+ if (!validate_event(this_pmu, &hw_events, event))
return -EINVAL;

return 0;
--
1.9.1

2014-11-07 16:27:16

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 07/11] arm: perf: document PMU affinity binding

To describe the various ways CPU PMU interrupts might be wired up, we
can refer to the topology information in the device tree.

This patch adds a new property to the PMU binding, interrupts-affinity,
which describes the relationship between CPUs and interrupts. This
information is necessary to handle systems with heterogeneous PMU
implementations (e.g. big.LITTLE). Documentation is added describing the
use of said property.

Signed-off-by: Mark Rutland <[email protected]>
---
Documentation/devicetree/bindings/arm/pmu.txt | 104 +++++++++++++++++++++++++-
1 file changed, 103 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt
index 75ef91d..23a0675 100644
--- a/Documentation/devicetree/bindings/arm/pmu.txt
+++ b/Documentation/devicetree/bindings/arm/pmu.txt
@@ -24,12 +24,114 @@ Required properties:

Optional properties:

+- interrupts-affinity : A list of phandles to topology nodes (see topology.txt) describing
+ the set of CPUs associated with the interrupt at the same index.
- qcom,no-pc-write : Indicates that this PMU doesn't support the 0xc and 0xd
events.

-Example:
+Example 1 (A single CPU):

pmu {
compatible = "arm,cortex-a9-pmu";
interrupts = <100 101>;
};
+
+Example 2 (Multiple clusters with single interrupts):
+
+cpus {
+ #address-cells = <1>;
+ #size-cells = <1>;
+
+ CPU0: cpu@0 {
+ reg = <0x0>;
+ compatible = "arm,cortex-a15-pmu";
+ };
+
+ CPU1: cpu@1 {
+ reg = <0x1>;
+ compatible = "arm,cotex-a15-pmu";
+ };
+
+ CPU100: cpu@100 {
+ reg = <0x100>;
+ compatible = "arm,cortex-a7-pmu";
+ };
+
+ cpu-map {
+ cluster0 {
+ CORE_0_0: core0 {
+ cpu = <&CPU0>;
+ };
+ CORE_0_1: core1 {
+ cpu = <&CPU1>;
+ };
+ };
+ cluster1 {
+ CORE_1_0: core0 {
+ cpu = <&CPU100>;
+ };
+ };
+ };
+};
+
+pmu_a15 {
+ compatible = "arm,cortex-a15-pmu";
+ interrupts = <100>, <101>;
+ interrupts-affinity = <&CORE0>, <&CORE1>;
+};
+
+pmu_a7 {
+ compatible = "arm,cortex-a7-pmu";
+ interrupts = <105>;
+ interrupts-affinity = <&CORE_1_0>;
+};
+
+Example 3 (Multiple clusters with per-cpu interrupts):
+
+cpus {
+ #address-cells = <1>;
+ #size-cells = <1>;
+
+ CPU0: cpu@0 {
+ reg = <0x0>;
+ compatible = "arm,cortex-a15-pmu";
+ };
+
+ CPU1: cpu@1 {
+ reg = <0x1>;
+ compatible = "arm,cotex-a15-pmu";
+ };
+
+ CPU100: cpu@100 {
+ reg = <0x100>;
+ compatible = "arm,cortex-a7-pmu";
+ };
+
+ cpu-map {
+ CLUSTER0: cluster0 {
+ core0 {
+ cpu = <&CPU0>;
+ };
+ core1 {
+ cpu = <&CPU1>;
+ };
+ };
+ CLUSTER1: cluster1 {
+ core0 {
+ cpu = <&CPU100>;
+ };
+ };
+ };
+};
+
+pmu_a15 {
+ compatible = "arm,cortex-a15-pmu";
+ interrupts = <100>;
+ interrupts-affinity = <&CLUSTER0>;
+};
+
+pmu_a7 {
+ compatible = "arm,cortex-a7-pmu";
+ interrupts = <105>;
+ interrupts-affinity = <&CLUSTER1>;
+};
--
1.9.1

2014-11-07 16:27:13

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 06/11] arm: perf: probe number of counters on affine CPUs

In heterogeneous systems, the number of counters may differ across
clusters. To find the number of counters for a cluster, we must probe
the PMU from a CPU in that cluster.

Signed-off-by: Mark Rutland <[email protected]>
Reviewed-by: Will Deacon <[email protected]>
---
arch/arm/kernel/perf_event_v7.c | 41 +++++++++++++++++++++--------------------
1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/arch/arm/kernel/perf_event_v7.c b/arch/arm/kernel/perf_event_v7.c
index fa76b25..dccd108 100644
--- a/arch/arm/kernel/perf_event_v7.c
+++ b/arch/arm/kernel/perf_event_v7.c
@@ -994,15 +994,22 @@ static void armv7pmu_init(struct arm_pmu *cpu_pmu)
cpu_pmu->max_period = (1LLU << 32) - 1;
};

-static u32 armv7_read_num_pmnc_events(void)
+static void armv7_read_num_pmnc_events(void *info)
{
- u32 nb_cnt;
+ int *nb_cnt = info;

/* Read the nb of CNTx counters supported from PMNC */
- nb_cnt = (armv7_pmnc_read() >> ARMV7_PMNC_N_SHIFT) & ARMV7_PMNC_N_MASK;
+ *nb_cnt = (armv7_pmnc_read() >> ARMV7_PMNC_N_SHIFT) & ARMV7_PMNC_N_MASK;

- /* Add the CPU cycles counter and return */
- return nb_cnt + 1;
+ /* Add the CPU cycles counter */
+ *nb_cnt += 1;
+}
+
+static int armv7_probe_num_events(struct arm_pmu *arm_pmu)
+{
+ return smp_call_function_any(&arm_pmu->supported_cpus,
+ armv7_read_num_pmnc_events,
+ &arm_pmu->num_events, 1);
}

static int armv7_a8_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1010,8 +1017,7 @@ static int armv7_a8_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a8";
cpu_pmu->map_event = armv7_a8_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a9_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1019,8 +1025,7 @@ static int armv7_a9_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a9";
cpu_pmu->map_event = armv7_a9_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a5_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1028,8 +1033,7 @@ static int armv7_a5_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a5";
cpu_pmu->map_event = armv7_a5_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a15_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1037,9 +1041,8 @@ static int armv7_a15_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a15";
cpu_pmu->map_event = armv7_a15_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
cpu_pmu->set_event_filter = armv7pmu_set_event_filter;
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a7_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1047,9 +1050,8 @@ static int armv7_a7_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a7";
cpu_pmu->map_event = armv7_a7_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
cpu_pmu->set_event_filter = armv7pmu_set_event_filter;
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a12_pmu_init(struct arm_pmu *cpu_pmu)
@@ -1057,16 +1059,15 @@ static int armv7_a12_pmu_init(struct arm_pmu *cpu_pmu)
armv7pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a12";
cpu_pmu->map_event = armv7_a12_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
cpu_pmu->set_event_filter = armv7pmu_set_event_filter;
- return 0;
+ return armv7_probe_num_events(cpu_pmu);
}

static int armv7_a17_pmu_init(struct arm_pmu *cpu_pmu)
{
- armv7_a12_pmu_init(cpu_pmu);
+ int ret = armv7_a12_pmu_init(cpu_pmu);
cpu_pmu->name = "armv7_cortex_a17";
- return 0;
+ return ret;
}

/*
@@ -1453,7 +1454,7 @@ static int krait_pmu_init(struct arm_pmu *cpu_pmu)
cpu_pmu->map_event = krait_map_event_no_branch;
else
cpu_pmu->map_event = krait_map_event;
- cpu_pmu->num_events = armv7_read_num_pmnc_events();
+ cpu_pmu->num_events = armv7_probe_num_events(cpu_pmu);
cpu_pmu->set_event_filter = armv7pmu_set_event_filter;
cpu_pmu->reset = krait_pmu_reset;
cpu_pmu->enable = krait_pmu_enable_event;
--
1.9.1

2014-11-07 16:27:28

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 08/11] arm: perf: add functions to parse affinity from dt

Depending on hardware configuration, some devices may only be accessible
from certain CPUs, may have interrupts wired up to a subset of CPUs, or
may have operations which affect subsets of CPUs. To handle these
devices it is necessary to describe this affinity information in
devicetree.

This patch adds functions to handle parsing the CPU affinity of
properties from devicetree, based on Lorenzo's topology binding,
allowing subsets of CPUs to be associated with interrupts, hardware
ports, etc. The functions can be used to build cpumasks and also to test
whether an affinity property only targets one CPU independent of the
current configuration (e.g. when the kernel supports fewer CPUs than are
physically present). This is useful for dealing with mixed SPI/PPI
devices.

A device may have an arbitrary number of affinity properties, the
meaning of which is device-specific and should be specified in a given
device's binding document.

For example, an affinity property describing interrupt routing may
consist of a phandle pointing to a subtree of the topology nodes,
indicating the set of CPUs an interrupt originates from or may be taken
on. Bindings may have restrictions on the topology nodes referenced -
for describing coherency controls an affinity property may indicate a
whole cluster (including any non-CPU logic it contains) is affected by
some configuration.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Grant Likely <[email protected]>
Cc: Rob Herring <[email protected]>
---
arch/arm/kernel/perf_event_cpu.c | 127 +++++++++++++++++++++++++++++++++++++++
1 file changed, 127 insertions(+)

diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index ce35149..dfcaba5 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -22,6 +22,7 @@
#include <linux/export.h>
#include <linux/kernel.h>
#include <linux/of.h>
+#include <linux/of_device.h>
#include <linux/platform_device.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
@@ -294,6 +295,132 @@ static int probe_current_pmu(struct arm_pmu *pmu)
return ret;
}

+/*
+ * Test if the node is within the topology tree.
+ * Walk up to the root, keeping refcounts balanced.
+ */
+static bool is_topology_node(struct device_node *node)
+{
+ struct device_node *np, *cpu_map;
+ bool ret = false;
+
+ cpu_map = of_find_node_by_path("/cpus/cpu-map");
+ if (!cpu_map)
+ return false;
+
+ /*
+ * of_get_next_parent decrements the refcount of the provided node.
+ * Increment it first to keep things balanced.
+ */
+ for (np = of_node_get(node); np; np = of_get_next_parent(np)) {
+ if (np != cpu_map)
+ continue;
+
+ ret = true;
+ break;
+ }
+
+ of_node_put(np);
+ of_node_put(cpu_map);
+ return ret;
+}
+
+static int cpu_node_to_id(struct device_node *node)
+{
+ int cpu;
+ for_each_possible_cpu(cpu)
+ if (of_cpu_device_node_get(cpu) == node)
+ return cpu;
+
+ return -EINVAL;
+}
+
+static int arm_dt_affine_build_mask(struct device_node *affine,
+ cpumask_t *mask)
+{
+ struct device_node *child, *parent = NULL;
+ int ret = -EINVAL;
+
+ if (!is_topology_node(affine))
+ return -EINVAL;
+
+ child = of_node_get(affine);
+ if (!child)
+ goto out_invalid;
+
+ parent = of_get_parent(child);
+ if (!parent)
+ goto out_invalid;
+
+ if (!cpumask_empty(mask))
+ goto out_invalid;
+
+ /*
+ * Depth-first search over the topology tree, iterating over leaf nodes
+ * and adding all referenced CPUs to the cpumask. Almost all of the
+ * of_* iterators are built for breadth-first search, which means we
+ * have to do a little more work to ensure refcounts are balanced.
+ */
+ do {
+ struct device_node *tmp, *cpu_node;
+ int cpu;
+
+ /* head down to the leaf */
+ while ((tmp = of_get_next_child(child, NULL))) {
+ of_node_put(parent);
+ parent = child;
+ child = tmp;
+ }
+
+ /*
+ * In some cases cpu_node might be NULL, but cpu_node_to_id
+ * will handle this (albeit slowly) and we don't need another
+ * error path.
+ */
+ cpu_node = of_parse_phandle(child, "cpu", 0);
+ cpu = cpu_node_to_id(cpu_node);
+
+ if (cpu < 0)
+ pr_warn("Invalid or unused node in topology description '%s', skipping\n",
+ child->full_name);
+ else
+ cpumask_set_cpu(cpu, mask);
+
+ of_node_put(cpu_node);
+
+ /*
+ * Find the next sibling, or transitively a parent's sibling.
+ * Don't go further up the tree than the affine node we were
+ * handed.
+ */
+ while (child != affine &&
+ !(child = of_get_next_child(parent, child))) {
+ child = parent;
+ parent = of_get_parent(parent);
+ }
+
+ } while (child != affine); /* all children covered. Time to stop */
+
+ ret = 0;
+
+out_invalid:
+ of_node_put(child);
+ of_node_put(parent);
+ return ret;
+}
+
+static int arm_dt_affine_get_mask(struct device_node *node, char *prop,
+ int idx, cpumask_t *mask)
+{
+ int ret = -EINVAL;
+ struct device_node *affine = of_parse_phandle(node, prop, idx);
+
+ ret = arm_dt_affine_build_mask(affine, mask);
+
+ of_node_put(affine);
+ return ret;
+}
+
static int cpu_pmu_device_probe(struct platform_device *pdev)
{
const struct of_device_id *of_id;
--
1.9.1

2014-11-07 16:27:34

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 09/11] arm: perf: parse cpu affinity from dt

The current way we read interrupts form devicetree assumes that
interrupts are in increasing order of logical cpu id (MPIDR.Aff{2,1,0}),
and that these logical ids are in a contiguous block. This may not be
the case in general - after a kexec cpu ids may be arbitrarily assigned,
and multi-cluster systems do not have a contiguous range of cpu ids.

This patch parses cpu affinity information for interrupts from an
optional "interrupts-affinity" devicetree property described in the
devicetree binding document. Support for existing dts and board files
remains.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/include/asm/pmu.h | 12 +++
arch/arm/kernel/perf_event_cpu.c | 196 +++++++++++++++++++++++++++++----------
2 files changed, 161 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
index b630a44..92fc1da 100644
--- a/arch/arm/include/asm/pmu.h
+++ b/arch/arm/include/asm/pmu.h
@@ -12,6 +12,7 @@
#ifndef __ARM_PMU_H__
#define __ARM_PMU_H__

+#include <linux/cpumask.h>
#include <linux/interrupt.h>
#include <linux/perf_event.h>

@@ -89,6 +90,15 @@ struct pmu_hw_events {
struct arm_pmu *percpu_pmu;
};

+/*
+ * For systems with heterogeneous PMUs, we need to know which CPUs each
+ * (possibly percpu) IRQ targets. Map between them with an array of these.
+ */
+struct cpu_irq {
+ cpumask_t cpus;
+ int irq;
+};
+
struct arm_pmu {
struct pmu pmu;
cpumask_t active_irqs;
@@ -118,6 +128,8 @@ struct arm_pmu {
struct platform_device *plat_device;
struct pmu_hw_events __percpu *hw_events;
struct notifier_block hotplug_nb;
+ int nr_irqs;
+ struct cpu_irq *irq_map;
};

#define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index dfcaba5..f09c8a0 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -85,20 +85,27 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
struct platform_device *pmu_device = cpu_pmu->plat_device;
struct pmu_hw_events __percpu *hw_events = cpu_pmu->hw_events;

- irqs = min(pmu_device->num_resources, num_possible_cpus());
+ irqs = cpu_pmu->nr_irqs;

- irq = platform_get_irq(pmu_device, 0);
- if (irq >= 0 && irq_is_percpu(irq)) {
- on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
- free_percpu_irq(irq, &hw_events->percpu_pmu);
- } else {
- for (i = 0; i < irqs; ++i) {
- if (!cpumask_test_and_clear_cpu(i, &cpu_pmu->active_irqs))
- continue;
- irq = platform_get_irq(pmu_device, i);
- if (irq >= 0)
- free_irq(irq, per_cpu_ptr(&hw_events->percpu_pmu, i));
+ for (i = 0; i < irqs; i++) {
+ struct cpu_irq *map = &cpu_pmu->irq_map[i];
+ irq = map->irq;
+
+ if (irq <= 0)
+ continue;
+
+ if (irq_is_percpu(irq)) {
+ on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
+ free_percpu_irq(irq, &hw_events->percpu_pmu);
+ return;
}
+
+ if (!cpumask_test_and_clear_cpu(i, &cpu_pmu->active_irqs))
+ continue;
+
+ irq = platform_get_irq(pmu_device, i);
+ if (irq >= 0)
+ free_irq(irq, per_cpu_ptr(&hw_events->percpu_pmu, i));
}
}

@@ -111,51 +118,52 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, irq_handler_t handler)
if (!pmu_device)
return -ENODEV;

- irqs = min(pmu_device->num_resources, num_possible_cpus());
+ irqs = cpu_pmu->nr_irqs;
if (irqs < 1) {
printk_once("perf/ARM: No irqs for PMU defined, sampling events not supported\n");
return 0;
}

- irq = platform_get_irq(pmu_device, 0);
- if (irq >= 0 && irq_is_percpu(irq)) {
- err = request_percpu_irq(irq, handler, "arm-pmu",
- &hw_events->percpu_pmu);
- if (err) {
- pr_err("unable to request IRQ%d for ARM PMU counters\n",
- irq);
- return err;
- }
- on_each_cpu(cpu_pmu_enable_percpu_irq, &irq, 1);
- } else {
- for (i = 0; i < irqs; ++i) {
- err = 0;
- irq = platform_get_irq(pmu_device, i);
- if (irq < 0)
- continue;
-
- /*
- * If we have a single PMU interrupt that we can't shift,
- * assume that we're running on a uniprocessor machine and
- * continue. Otherwise, continue without this interrupt.
- */
- if (irq_set_affinity(irq, cpumask_of(i)) && irqs > 1) {
- pr_warn("unable to set irq affinity (irq=%d, cpu=%u)\n",
- irq, i);
- continue;
- }
+ for (i = 0; i < irqs; i++) {
+ struct cpu_irq *map = &cpu_pmu->irq_map[i];
+ irq = map->irq;

- err = request_irq(irq, handler,
- IRQF_NOBALANCING | IRQF_NO_THREAD, "arm-pmu",
- per_cpu_ptr(&hw_events->percpu_pmu, i));
+ if (irq <= 0)
+ continue;
+
+ if (irq_is_percpu(map->irq)) {
+ err = request_percpu_irq(irq, handler, "arm-pmu",
+ &hw_events->percpu_pmu);
if (err) {
pr_err("unable to request IRQ%d for ARM PMU counters\n",
irq);
return err;
}
+ on_each_cpu(cpu_pmu_enable_percpu_irq, &irq, 1);
+ return 0;
+ }
+
+ /*
+ * If we have a single PMU interrupt that we can't shift,
+ * assume that we're running on a uniprocessor machine and
+ * continue. Otherwise, continue without this interrupt.
+ */
+ if (irq_set_affinity(irq, &map->cpus) && irqs > 1) {
+ pr_warn("unable to set irq affinity (irq=%d, cpu=%u)\n",
+ irq, cpumask_first(&map->cpus));
+ continue;
+ }

- cpumask_set_cpu(i, &cpu_pmu->active_irqs);
+ err = request_irq(irq, handler,
+ IRQF_NOBALANCING | IRQF_NO_THREAD, "arm-pmu",
+ per_cpu_ptr(&hw_events->percpu_pmu, i));
+ if (err) {
+ pr_err("unable to request IRQ%d for ARM PMU counters\n",
+ irq);
+ return err;
}
+
+ cpumask_set_cpu(i, &cpu_pmu->active_irqs);
}

return 0;
@@ -421,6 +429,97 @@ static int arm_dt_affine_get_mask(struct device_node *node, char *prop,
return ret;
}

+static int cpu_pmu_parse_interrupt(struct arm_pmu *pmu, int idx)
+{
+ struct cpu_irq *map = &pmu->irq_map[idx];
+ struct platform_device *pdev = pmu->plat_device;
+ struct device_node *np = pdev->dev.of_node;
+
+ map->irq = platform_get_irq(pdev, idx);
+ if (map->irq <= 0)
+ return -ENOENT;
+
+ cpumask_clear(&map->cpus);
+
+ if (!of_property_read_bool(np, "interrupts-affinity")) {
+ /*
+ * If we don't have any affinity information, assume a
+ * homogeneous system. We assume that CPUs are ordered as in
+ * the DT, even in the absence of affinity information.
+ */
+ if (irq_is_percpu(map->irq))
+ cpumask_setall(&map->cpus);
+ else
+ cpumask_set_cpu(idx, &map->cpus);
+ } else {
+ return arm_dt_affine_get_mask(np, "interrupts-affinity", idx,
+ &map->cpus);
+ }
+
+ return 0;
+}
+
+static int cpu_pmu_parse_interrupts(struct arm_pmu *pmu)
+{
+ struct platform_device *pdev = pmu->plat_device;
+ int ret;
+ int i, irqs;
+
+ /*
+ * Figure out how many IRQs there are. This may be larger than NR_CPUS,
+ * and this may be in any arbitrary order...
+ */
+ for (irqs = 0; platform_get_irq(pdev, irqs) > 0; irqs++);
+ if (!irqs) {
+ pr_warn("Unable to find interrupts\n");
+ return -EINVAL;
+ }
+
+ pmu->nr_irqs = irqs;
+ pmu->irq_map = kmalloc_array(irqs, sizeof(*pmu->irq_map), GFP_KERNEL);
+ if (!pmu->irq_map) {
+ pr_warn("Unable to allocate irqmap data\n");
+ return -ENOMEM;
+ }
+
+ /*
+ * Some platforms are insane enough to mux all the PMU IRQs into a
+ * single IRQ. To enable handling of those cases, assume that if we
+ * have a single interrupt it targets all CPUs.
+ */
+ if (irqs == 1 && num_possible_cpus() > 1) {
+ cpumask_copy(&pmu->irq_map[0].cpus, cpu_present_mask);
+ } else {
+ for (i = 0; i < irqs; i++) {
+ ret = cpu_pmu_parse_interrupt(pmu, i);
+ if (ret)
+ goto out_free;
+ }
+ }
+
+ if (of_property_read_bool(pdev->dev.of_node, "interrupts-affinity")) {
+ /* The PMU can work on any CPU it has an interrupt. */
+ for (i = 0; i < irqs; i++) {
+ struct cpu_irq *map = &pmu->irq_map[i];
+ cpumask_or(&pmu->supported_cpus, &pmu->supported_cpus,
+ &map->cpus);
+ }
+ } else {
+ /*
+ * Without affintiy info, assume a homogeneous system with
+ * potentially missing interrupts, to keep existing DTBs
+ * working.
+ */
+ cpumask_setall(&pmu->supported_cpus);
+ }
+
+ return 0;
+
+out_free:
+ kfree(pmu->irq_map);
+ return ret;
+}
+
static int cpu_pmu_device_probe(struct platform_device *pdev)
{
const struct of_device_id *of_id;
@@ -443,8 +542,9 @@ static int cpu_pmu_device_probe(struct platform_device *pdev)
cpu_pmu = pmu;
cpu_pmu->plat_device = pdev;

- /* Assume by default that we're on a homogeneous system */
- cpumask_setall(&pmu->supported_cpus);
+ ret = cpu_pmu_parse_interrupts(pmu);
+ if (ret)
+ goto out_free_pmu;

if (node && (of_id = of_match_node(cpu_pmu_of_device_ids, pdev->dev.of_node))) {
init_fn = of_id->data;
@@ -471,8 +571,10 @@ static int cpu_pmu_device_probe(struct platform_device *pdev)
out_destroy:
cpu_pmu_destroy(cpu_pmu);
out_free:
- pr_info("failed to register PMU devices!\n");
+ kfree(pmu->irq_map);
+out_free_pmu:
kfree(pmu);
+ pr_info("failed to register PMU devices!\n");
return ret;
}

--
1.9.1

2014-11-07 16:27:41

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 11/11] arm: dts: vexpress: describe all PMUs in TC2 dts

The dts for the CoreTile Express A15x2 A7x3 (TC2) only describes the
PMUs of the Cortex-A15 CPUs, and not the Cortex-A7 CPUs.

Now that we have a mechanism for describing disparate PMUs and their
interrupts in device tree, this patch makes use of these to describe the
PMUs for all CPUs in the system.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 36 +++++++++++++++++++++++++++++-
1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 322fd15..52416f9 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -90,6 +90,28 @@
min-residency-us = <2500>;
};
};
+
+ cpu-map {
+ cluster0 {
+ core_0_0: core0 {
+ cpu = <&cpu0>;
+ };
+ core_0_1: core1 {
+ cpu = <&cpu1>;
+ };
+ };
+ cluster1 {
+ core_1_0: core0 {
+ cpu = <&cpu2>;
+ };
+ core_1_1: core1 {
+ cpu = <&cpu3>;
+ };
+ core_1_2: core2 {
+ cpu = <&cpu4>;
+ };
+ };
+ };
};

memory@80000000 {
@@ -187,10 +209,22 @@
<1 10 0xf08>;
};

- pmu {
+ pmu_a15 {
compatible = "arm,cortex-a15-pmu";
interrupts = <0 68 4>,
<0 69 4>;
+ interrupts-affinity = <&core_0_0>,
+ <&core_0_1>;
+ };
+
+ pmu_a7 {
+ compatible = "arm,cortex-a7-pmu";
+ interrupts = <0 128 4>,
+ <0 129 4>,
+ <0 130 4>;
+ interrupts-affinity = <&core_1_0>,
+ <&core_1_1>,
+ <&core_1_2>;
};

oscclk6a: oscclk6a {
--
1.9.1

2014-11-07 16:27:56

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 10/11] arm: perf: remove singleton PMU restriction

Now that we can describe PMUs in heterogeneous systems, the only item in
the way of perf support for big.LITTLE is the singleton cpu_pmu variable
used for OProfile compatibility.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/kernel/perf_event_cpu.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
index f09c8a0..09de0e6 100644
--- a/arch/arm/kernel/perf_event_cpu.c
+++ b/arch/arm/kernel/perf_event_cpu.c
@@ -34,7 +34,7 @@
#include <asm/pmu.h>

/* Set at runtime when we know what CPU type we are. */
-static struct arm_pmu *cpu_pmu;
+static struct arm_pmu *__oprofile_cpu_pmu;

/*
* Despite the names, these two functions are CPU-specific and are used
@@ -42,10 +42,10 @@ static struct arm_pmu *cpu_pmu;
*/
const char *perf_pmu_name(void)
{
- if (!cpu_pmu)
+ if (!__oprofile_cpu_pmu)
return NULL;

- return cpu_pmu->name;
+ return __oprofile_cpu_pmu->name;
}
EXPORT_SYMBOL_GPL(perf_pmu_name);

@@ -53,8 +53,8 @@ int perf_num_counters(void)
{
int max_events = 0;

- if (cpu_pmu != NULL)
- max_events = cpu_pmu->num_events;
+ if (__oprofile_cpu_pmu != NULL)
+ max_events = __oprofile_cpu_pmu->num_events;

return max_events;
}
@@ -528,19 +528,16 @@ static int cpu_pmu_device_probe(struct platform_device *pdev)
struct arm_pmu *pmu;
int ret = -ENODEV;

- if (cpu_pmu) {
- pr_info("attempt to register multiple PMU devices!\n");
- return -ENOSPC;
- }
-
pmu = kzalloc(sizeof(struct arm_pmu), GFP_KERNEL);
if (!pmu) {
pr_info("failed to allocate PMU device!\n");
return -ENOMEM;
}

- cpu_pmu = pmu;
- cpu_pmu->plat_device = pdev;
+ if (!__oprofile_cpu_pmu)
+ __oprofile_cpu_pmu = pmu;
+
+ pmu->plat_device = pdev;

ret = cpu_pmu_parse_interrupts(pmu);
if (ret)
@@ -558,18 +555,18 @@ static int cpu_pmu_device_probe(struct platform_device *pdev)
goto out_free;
}

- ret = cpu_pmu_init(cpu_pmu);
+ ret = cpu_pmu_init(pmu);
if (ret)
goto out_free;

- ret = armpmu_register(cpu_pmu, -1);
+ ret = armpmu_register(pmu, -1);
if (ret)
goto out_destroy;

return 0;

out_destroy:
- cpu_pmu_destroy(cpu_pmu);
+ cpu_pmu_destroy(pmu);
out_free:
kfree(pmu->irq_map);
out_free_pmu:
--
1.9.1

2014-11-07 16:29:15

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 04/11] arm: perf: filter unschedulable events

Different CPU microarchitectures implement different PMU events, and
thus events which can be scheduled on one microarchitecture cannot be
scheduled on another, and vice-versa. Some archicted events behave
differently across microarchitectures, and thus cannot be meaningfully
summed. Due to this, we reject the scheduling of an event on a CPU of a
different microarchitecture to that the event targets.

When the core perf code is scheduling events and encounters an event
which cannot be scheduled, it stops attempting to schedule events. As
the perf core periodically rotates the list of events, for some
proportion of the time events which are unschedulable will block events
which are schedulable, resulting in low utilisation of the hardware
counters.

This patch implements a pmu::filter_match callback such that we can
detect and skip such events while scheduling early, before they can
block the schedulable events. This prevents the low HW counter
utilisation issue.

Signed-off-by: Mark Rutland <[email protected]>
---
arch/arm/kernel/perf_event.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/arch/arm/kernel/perf_event.c b/arch/arm/kernel/perf_event.c
index 9ad21ab..b00f6aa 100644
--- a/arch/arm/kernel/perf_event.c
+++ b/arch/arm/kernel/perf_event.c
@@ -509,6 +509,18 @@ static void armpmu_disable(struct pmu *pmu)
armpmu->stop(armpmu);
}

+/*
+ * In heterogeneous systems, events are specific to a particular
+ * microarchitecture, and aren't suitable for another. Thus, only match CPUs of
+ * the same microarchitecture.
+ */
+static int armpmu_filter_match(struct perf_event *event)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
+ unsigned int cpu = smp_processor_id();
+ return cpumask_test_cpu(cpu, &armpmu->supported_cpus);
+}
+
#ifdef CONFIG_PM_RUNTIME
static int armpmu_runtime_resume(struct device *dev)
{
@@ -549,6 +561,7 @@ static void armpmu_init(struct arm_pmu *armpmu)
.start = armpmu_start,
.stop = armpmu_stop,
.read = armpmu_read,
+ .filter_match = armpmu_filter_match,
};
}

--
1.9.1

2014-11-07 16:27:05

by Mark Rutland

[permalink] [raw]

Subject: [PATCH 02/11] perf: allow for PMU-specific event filtering

In certain circumstances it may not be possible to schedule particular
events due to constraints other than a lack of hardware counters (e.g.
on big.LITTLE systems where CPUs support different events). The core
perf event code does not distinguish these cases and pessimistically
assumes that any failure to schedule an events is due to a lack of
hardware counters, ending event group scheduling early despite hardware
counters remaining available.

When such an unschedulable event exists in a ctx->flexible_groups list
it can unnecessarily prevent event groups following it in the list from
being scheduled until it is rotated to the end of the list. This can
result in events being scheduled for only a portion of the time they
would otherwise be eligible, and for short running programs unfortunate
initial list ordering can result in no events being counted.

This patch adds a new (optional) filter_match function pointer to struct
pmu which backends can use to tell the perf core whether or not it is
worth attempting to schedule an event. This plugs into the existing
event_filter_match logic, and makes it possible to avoid the scheduling
problem described above.

Signed-off-by: Mark Rutland <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
---
include/linux/perf_event.h | 5 +++++
kernel/events/core.c | 8 +++++++-
2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 893a0d0..80c5f5f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -263,6 +263,11 @@ struct pmu {
* flush branch stack on context-switches (needed in cpu-wide mode)
*/
void (*flush_branch_stack) (void);
+
+ /*
+ * Filter events for PMU-specific reasons.
+ */
+ int (*filter_match) (struct perf_event *event); /* optional */
};

/**
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2b02c9f..770b276 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1428,11 +1428,17 @@ static int __init perf_workqueue_init(void)

core_initcall(perf_workqueue_init);

+static inline int pmu_filter_match(struct perf_event *event)
+{
+ struct pmu *pmu = event->pmu;
+ return pmu->filter_match ? pmu->filter_match(event) : 1;
+}
+
static inline int
event_filter_match(struct perf_event *event)
{
return (event->cpu == -1 || event->cpu == smp_processor_id())
- && perf_cgroup_match(event);
+ && perf_cgroup_match(event) && pmu_filter_match(event);
}

static void
--
1.9.1

2014-11-17 11:15:00

[permalink] [raw]

Subject: Re: [PATCH 07/11] arm: perf: document PMU affinity binding

Hi Mark,

On Fri, Nov 07, 2014 at 04:25:32PM +0000, Mark Rutland wrote:
> To describe the various ways CPU PMU interrupts might be wired up, we
> can refer to the topology information in the device tree.
>
> This patch adds a new property to the PMU binding, interrupts-affinity,
> which describes the relationship between CPUs and interrupts. This
> information is necessary to handle systems with heterogeneous PMU
> implementations (e.g. big.LITTLE). Documentation is added describing the
> use of said property.

I'm not entirely comfortable with using interrupt affinity to convey
PMU affinity. It seems perfectly plausible for somebody to play the usual
trick of ORing all the irq lines together, despite having a big/little
PMU configuration.

Can you describe such a system with this binding?

> +Example 2 (Multiple clusters with single interrupts):
> +
> +cpus {
> + #address-cells = <1>;
> + #size-cells = <1>;
> +
> + CPU0: cpu@0 {
> + reg = <0x0>;
> + compatible = "arm,cortex-a15-pmu";
> + };
> +
> + CPU1: cpu@1 {
> + reg = <0x1>;
> + compatible = "arm,cotex-a15-pmu";

cortex

> + };
> +
> + CPU100: cpu@100 {
> + reg = <0x100>;
> + compatible = "arm,cortex-a7-pmu";
> + };
> +
> + cpu-map {
> + cluster0 {
> + CORE_0_0: core0 {
> + cpu = <&CPU0>;
> + };
> + CORE_0_1: core1 {
> + cpu = <&CPU1>;
> + };
> + };
> + cluster1 {
> + CORE_1_0: core0 {
> + cpu = <&CPU100>;
> + };
> + };
> + };
> +};
> +
> +pmu_a15 {
> + compatible = "arm,cortex-a15-pmu";
> + interrupts = <100>, <101>;
> + interrupts-affinity = <&CORE0>, <&CORE1>;
> +};
> +
> +pmu_a7 {
> + compatible = "arm,cortex-a7-pmu";
> + interrupts = <105>;
> + interrupts-affinity = <&CORE_1_0>;
> +};
> +
> +Example 3 (Multiple clusters with per-cpu interrupts):
> +
> +cpus {
> + #address-cells = <1>;
> + #size-cells = <1>;
> +
> + CPU0: cpu@0 {
> + reg = <0x0>;
> + compatible = "arm,cortex-a15-pmu";
> + };
> +
> + CPU1: cpu@1 {
> + reg = <0x1>;
> + compatible = "arm,cotex-a15-pmu";

Same here.

Will

2014-11-17 11:16:28

[permalink] [raw]

Subject: Re: [PATCH 08/11] arm: perf: add functions to parse affinity from dt

On Fri, Nov 07, 2014 at 04:25:33PM +0000, Mark Rutland wrote:
> Depending on hardware configuration, some devices may only be accessible
> from certain CPUs, may have interrupts wired up to a subset of CPUs, or
> may have operations which affect subsets of CPUs. To handle these
> devices it is necessary to describe this affinity information in
> devicetree.
>
> This patch adds functions to handle parsing the CPU affinity of
> properties from devicetree, based on Lorenzo's topology binding,
> allowing subsets of CPUs to be associated with interrupts, hardware
> ports, etc. The functions can be used to build cpumasks and also to test
> whether an affinity property only targets one CPU independent of the
> current configuration (e.g. when the kernel supports fewer CPUs than are
> physically present). This is useful for dealing with mixed SPI/PPI
> devices.
>
> A device may have an arbitrary number of affinity properties, the
> meaning of which is device-specific and should be specified in a given
> device's binding document.
>
> For example, an affinity property describing interrupt routing may
> consist of a phandle pointing to a subtree of the topology nodes,
> indicating the set of CPUs an interrupt originates from or may be taken
> on. Bindings may have restrictions on the topology nodes referenced -
> for describing coherency controls an affinity property may indicate a
> whole cluster (including any non-CPU logic it contains) is affected by
> some configuration.
>
> Signed-off-by: Mark Rutland <[email protected]>
> Cc: Grant Likely <[email protected]>
> Cc: Rob Herring <[email protected]>
> ---
> arch/arm/kernel/perf_event_cpu.c | 127 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 127 insertions(+)
>
> diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
> index ce35149..dfcaba5 100644
> --- a/arch/arm/kernel/perf_event_cpu.c
> +++ b/arch/arm/kernel/perf_event_cpu.c
> @@ -22,6 +22,7 @@
> #include <linux/export.h>
> #include <linux/kernel.h>
> #include <linux/of.h>
> +#include <linux/of_device.h>
> #include <linux/platform_device.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> @@ -294,6 +295,132 @@ static int probe_current_pmu(struct arm_pmu *pmu)
> return ret;
> }
>
> +/*
> + * Test if the node is within the topology tree.
> + * Walk up to the root, keeping refcounts balanced.
> + */
> +static bool is_topology_node(struct device_node *node)
> +{
> + struct device_node *np, *cpu_map;
> + bool ret = false;
> +
> + cpu_map = of_find_node_by_path("/cpus/cpu-map");
> + if (!cpu_map)
> + return false;
> +
> + /*
> + * of_get_next_parent decrements the refcount of the provided node.
> + * Increment it first to keep things balanced.
> + */
> + for (np = of_node_get(node); np; np = of_get_next_parent(np)) {
> + if (np != cpu_map)
> + continue;
> +
> + ret = true;
> + break;
> + }
> +
> + of_node_put(np);
> + of_node_put(cpu_map);
> + return ret;
> +}

Wouldn't this be more at home in topology.c, or somewhere where others can
make use of it?

Will

2014-11-17 11:20:44

[permalink] [raw]

Subject: Re: [PATCH 09/11] arm: perf: parse cpu affinity from dt

On Fri, Nov 07, 2014 at 04:25:34PM +0000, Mark Rutland wrote:
> The current way we read interrupts form devicetree assumes that
> interrupts are in increasing order of logical cpu id (MPIDR.Aff{2,1,0}),
> and that these logical ids are in a contiguous block. This may not be
> the case in general - after a kexec cpu ids may be arbitrarily assigned,
> and multi-cluster systems do not have a contiguous range of cpu ids.
>
> This patch parses cpu affinity information for interrupts from an
> optional "interrupts-affinity" devicetree property described in the
> devicetree binding document. Support for existing dts and board files
> remains.
>
> Signed-off-by: Mark Rutland <[email protected]>
> ---
> arch/arm/include/asm/pmu.h | 12 +++
> arch/arm/kernel/perf_event_cpu.c | 196 +++++++++++++++++++++++++++++----------
> 2 files changed, 161 insertions(+), 47 deletions(-)
>
> diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
> index b630a44..92fc1da 100644
> --- a/arch/arm/include/asm/pmu.h
> +++ b/arch/arm/include/asm/pmu.h
> @@ -12,6 +12,7 @@
> #ifndef __ARM_PMU_H__
> #define __ARM_PMU_H__
>
> +#include <linux/cpumask.h>
> #include <linux/interrupt.h>
> #include <linux/perf_event.h>
>
> @@ -89,6 +90,15 @@ struct pmu_hw_events {
> struct arm_pmu *percpu_pmu;
> };
>
> +/*
> + * For systems with heterogeneous PMUs, we need to know which CPUs each
> + * (possibly percpu) IRQ targets. Map between them with an array of these.
> + */
> +struct cpu_irq {
> + cpumask_t cpus;
> + int irq;
> +};
> +
> struct arm_pmu {
> struct pmu pmu;
> cpumask_t active_irqs;
> @@ -118,6 +128,8 @@ struct arm_pmu {
> struct platform_device *plat_device;
> struct pmu_hw_events __percpu *hw_events;
> struct notifier_block hotplug_nb;
> + int nr_irqs;
> + struct cpu_irq *irq_map;
> };
>
> #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
> diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
> index dfcaba5..f09c8a0 100644
> --- a/arch/arm/kernel/perf_event_cpu.c
> +++ b/arch/arm/kernel/perf_event_cpu.c
> @@ -85,20 +85,27 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
> struct platform_device *pmu_device = cpu_pmu->plat_device;
> struct pmu_hw_events __percpu *hw_events = cpu_pmu->hw_events;
>
> - irqs = min(pmu_device->num_resources, num_possible_cpus());
> + irqs = cpu_pmu->nr_irqs;
>
> - irq = platform_get_irq(pmu_device, 0);
> - if (irq >= 0 && irq_is_percpu(irq)) {
> - on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
> - free_percpu_irq(irq, &hw_events->percpu_pmu);
> - } else {
> - for (i = 0; i < irqs; ++i) {
> - if (!cpumask_test_and_clear_cpu(i, &cpu_pmu->active_irqs))
> - continue;
> - irq = platform_get_irq(pmu_device, i);
> - if (irq >= 0)
> - free_irq(irq, per_cpu_ptr(&hw_events->percpu_pmu, i));
> + for (i = 0; i < irqs; i++) {
> + struct cpu_irq *map = &cpu_pmu->irq_map[i];
> + irq = map->irq;
> +
> + if (irq <= 0)
> + continue;
> +
> + if (irq_is_percpu(irq)) {
> + on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);

Hmm, ok, so we're assuming that all the PMUs will be wired with PPIs in this
case. I have a patch allowing per-cpu interrupts to be requested for a
cpumask, but I suppose that can wait until it's actually needed.

Will

2014-11-17 11:24:55

[permalink] [raw]

Subject: Re: [PATCH 00/11] arm: perf: add support for heterogeneous PMUs

On Fri, Nov 07, 2014 at 04:25:25PM +0000, Mark Rutland wrote:
> In systems with heterogeneous CPUs (e.g. big.LITTLE) the associated PMUs
> also differ in terms of the supported set of events, the precise
> behaviour of each of those events, and the number of event counters.
> Thus it is not possible to expose these PMUs as a single logical PMU.
>
> Instead a logical PMU is created per CPU microarchitecture, which events
> can target directly:
>
> $ perf stat \
> -e armv7_cortex_a7/config=0x11/ \
> -e armv7_cortex_a15/config=0x11/ \
> ./test
>
> Performance counter stats for './test':
>
> 7980455 armv7_cortex_a7/config=0x11/ [27.29%]
> 9947934 armv7_cortex_a15/config=0x11/ [72.66%]
>
> 0.016734833 seconds time elapsed
>
> This series is based atop of my recent preparatory rework [1,2].

Modulo the patches I commented on, the ARM perf bits look fine to me. For
those:

Acked-by: Will Deacon <[email protected]>

However, you need to get the event_filter_match change into the core code
before I can queue anything.

Will

2014-11-17 14:33:20

[permalink] [raw]

Subject: Re: [PATCH 07/11] arm: perf: document PMU affinity binding

On Fri, Nov 7, 2014 at 10:25 AM, Mark Rutland <[email protected]> wrote:
> To describe the various ways CPU PMU interrupts might be wired up, we
> can refer to the topology information in the device tree.
>
> This patch adds a new property to the PMU binding, interrupts-affinity,
> which describes the relationship between CPUs and interrupts. This
> information is necessary to handle systems with heterogeneous PMU
> implementations (e.g. big.LITTLE). Documentation is added describing the
> use of said property.
>
> Signed-off-by: Mark Rutland <[email protected]>
> ---
> Documentation/devicetree/bindings/arm/pmu.txt | 104 +++++++++++++++++++++++++-
> 1 file changed, 103 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt
> index 75ef91d..23a0675 100644
> --- a/Documentation/devicetree/bindings/arm/pmu.txt
> +++ b/Documentation/devicetree/bindings/arm/pmu.txt
> @@ -24,12 +24,114 @@ Required properties:
>
> Optional properties:
>
> +- interrupts-affinity : A list of phandles to topology nodes (see topology.txt) describing
> + the set of CPUs associated with the interrupt at the same index.

Are there cases beyond PMUs we need to handle? I would think so, so we
should document this generically.

> - qcom,no-pc-write : Indicates that this PMU doesn't support the 0xc and 0xd
> events.
>
> -Example:
> +Example 1 (A single CPU):

Isn't this a single cluster of 2 cpus?

>
> pmu {
> compatible = "arm,cortex-a9-pmu";
> interrupts = <100 101>;
> };
> +
> +Example 2 (Multiple clusters with single interrupts):

The meaning of single could be made a bit more clear especially if you
consider Will's case. But I haven't really thought of better
wording...

> +
> +cpus {
> + #address-cells = <1>;
> + #size-cells = <1>;
> +
> + CPU0: cpu@0 {
> + reg = <0x0>;
> + compatible = "arm,cortex-a15-pmu";
> + };
> +
> + CPU1: cpu@1 {
> + reg = <0x1>;
> + compatible = "arm,cotex-a15-pmu";
> + };
> +
> + CPU100: cpu@100 {
> + reg = <0x100>;
> + compatible = "arm,cortex-a7-pmu";
> + };
> +
> + cpu-map {
> + cluster0 {
> + CORE_0_0: core0 {
> + cpu = <&CPU0>;
> + };
> + CORE_0_1: core1 {
> + cpu = <&CPU1>;
> + };
> + };
> + cluster1 {
> + CORE_1_0: core0 {
> + cpu = <&CPU100>;
> + };
> + };
> + };
> +};
> +
> +pmu_a15 {
> + compatible = "arm,cortex-a15-pmu";
> + interrupts = <100>, <101>;
> + interrupts-affinity = <&CORE0>, <&CORE1>;

The phandle names are wrong here.

> +};
> +
> +pmu_a7 {
> + compatible = "arm,cortex-a7-pmu";
> + interrupts = <105>;
> + interrupts-affinity = <&CORE_1_0>;
> +};
> +
> +Example 3 (Multiple clusters with per-cpu interrupts):
> +
> +cpus {
> + #address-cells = <1>;
> + #size-cells = <1>;
> +
> + CPU0: cpu@0 {
> + reg = <0x0>;
> + compatible = "arm,cortex-a15-pmu";
> + };
> +
> + CPU1: cpu@1 {
> + reg = <0x1>;
> + compatible = "arm,cotex-a15-pmu";
> + };
> +
> + CPU100: cpu@100 {
> + reg = <0x100>;
> + compatible = "arm,cortex-a7-pmu";
> + };
> +
> + cpu-map {
> + CLUSTER0: cluster0 {
> + core0 {
> + cpu = <&CPU0>;
> + };
> + core1 {
> + cpu = <&CPU1>;
> + };
> + };
> + CLUSTER1: cluster1 {
> + core0 {
> + cpu = <&CPU100>;
> + };
> + };
> + };
> +};
> +
> +pmu_a15 {
> + compatible = "arm,cortex-a15-pmu";
> + interrupts = <100>;
> + interrupts-affinity = <&CLUSTER0>;
> +};
> +
> +pmu_a7 {
> + compatible = "arm,cortex-a7-pmu";
> + interrupts = <105>;
> + interrupts-affinity = <&CLUSTER1>;
> +};
> --
> 1.9.1
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2014-11-17 15:02:32

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH 07/11] arm: perf: document PMU affinity binding

Hi Rob,

I appear to have typo'd your address when posting this. Sorry about
that; I'll make sure it doesn't happen again.

On Mon, Nov 17, 2014 at 02:32:57PM +0000, Rob Herring wrote:
> On Fri, Nov 7, 2014 at 10:25 AM, Mark Rutland <[email protected]> wrote:
> > To describe the various ways CPU PMU interrupts might be wired up, we
> > can refer to the topology information in the device tree.
> >
> > This patch adds a new property to the PMU binding, interrupts-affinity,
> > which describes the relationship between CPUs and interrupts. This
> > information is necessary to handle systems with heterogeneous PMU
> > implementations (e.g. big.LITTLE). Documentation is added describing the
> > use of said property.
> >
> > Signed-off-by: Mark Rutland <[email protected]>
> > ---
> > Documentation/devicetree/bindings/arm/pmu.txt | 104 +++++++++++++++++++++++++-
> > 1 file changed, 103 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt
> > index 75ef91d..23a0675 100644
> > --- a/Documentation/devicetree/bindings/arm/pmu.txt
> > +++ b/Documentation/devicetree/bindings/arm/pmu.txt
> > @@ -24,12 +24,114 @@ Required properties:
> >
> > Optional properties:
> >
> > +- interrupts-affinity : A list of phandles to topology nodes (see topology.txt) describing
> > + the set of CPUs associated with the interrupt at the same index.
>
> Are there cases beyond PMUs we need to handle? I would think so, so we
> should document this generically.

That was what I tried way back when I first tried to upstream all of
this, but in the mean time I've not encountered other devices which are
really CPU-affine which use SPIs and hence need a CPU<->IRQ relationship
described.

That said, I'm happy to document whatever approach for referring to a
set of CPUs that we settle on, if that seems more general than PMU IRQ
mapping.

> > -Example:
> > +Example 1 (A single CPU):
>
> Isn't this a single cluster of 2 cpus?

Yes, it is. My bad.

> > pmu {
> > compatible = "arm,cortex-a9-pmu";
> > interrupts = <100 101>;
> > };
> > +
> > +Example 2 (Multiple clusters with single interrupts):
>
> The meaning of single could be made a bit more clear especially if you
> consider Will's case. But I haven't really thought of better
> wording...

How about "A cluster of homogeneous CPUs"?

> > +
> > +cpus {
> > + #address-cells = <1>;
> > + #size-cells = <1>;
> > +
> > + CPU0: cpu@0 {
> > + reg = <0x0>;
> > + compatible = "arm,cortex-a15-pmu";
> > + };
> > +
> > + CPU1: cpu@1 {
> > + reg = <0x1>;
> > + compatible = "arm,cotex-a15-pmu";
> > + };
> > +
> > + CPU100: cpu@100 {
> > + reg = <0x100>;
> > + compatible = "arm,cortex-a7-pmu";
> > + };
> > +
> > + cpu-map {
> > + cluster0 {
> > + CORE_0_0: core0 {
> > + cpu = <&CPU0>;
> > + };
> > + CORE_0_1: core1 {
> > + cpu = <&CPU1>;
> > + };
> > + };
> > + cluster1 {
> > + CORE_1_0: core0 {
> > + cpu = <&CPU100>;
> > + };
> > + };
> > + };
> > +};
> > +
> > +pmu_a15 {
> > + compatible = "arm,cortex-a15-pmu";
> > + interrupts = <100>, <101>;
> > + interrupts-affinity = <&CORE0>, <&CORE1>;
>
> The phandle names are wrong here.

Whoops. I've fixed that up locally now.

Thanks,
Mark.

2014-11-17 15:03:45

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH 08/11] arm: perf: add functions to parse affinity from dt

On Mon, Nov 17, 2014 at 11:16:25AM +0000, Will Deacon wrote:
> On Fri, Nov 07, 2014 at 04:25:33PM +0000, Mark Rutland wrote:
> > Depending on hardware configuration, some devices may only be accessible
> > from certain CPUs, may have interrupts wired up to a subset of CPUs, or
> > may have operations which affect subsets of CPUs. To handle these
> > devices it is necessary to describe this affinity information in
> > devicetree.
> >
> > This patch adds functions to handle parsing the CPU affinity of
> > properties from devicetree, based on Lorenzo's topology binding,
> > allowing subsets of CPUs to be associated with interrupts, hardware
> > ports, etc. The functions can be used to build cpumasks and also to test
> > whether an affinity property only targets one CPU independent of the
> > current configuration (e.g. when the kernel supports fewer CPUs than are
> > physically present). This is useful for dealing with mixed SPI/PPI
> > devices.
> >
> > A device may have an arbitrary number of affinity properties, the
> > meaning of which is device-specific and should be specified in a given
> > device's binding document.
> >
> > For example, an affinity property describing interrupt routing may
> > consist of a phandle pointing to a subtree of the topology nodes,
> > indicating the set of CPUs an interrupt originates from or may be taken
> > on. Bindings may have restrictions on the topology nodes referenced -
> > for describing coherency controls an affinity property may indicate a
> > whole cluster (including any non-CPU logic it contains) is affected by
> > some configuration.
> >
> > Signed-off-by: Mark Rutland <[email protected]>
> > Cc: Grant Likely <[email protected]>
> > Cc: Rob Herring <[email protected]>
> > ---
> > arch/arm/kernel/perf_event_cpu.c | 127 +++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 127 insertions(+)
> >
> > diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
> > index ce35149..dfcaba5 100644
> > --- a/arch/arm/kernel/perf_event_cpu.c
> > +++ b/arch/arm/kernel/perf_event_cpu.c
> > @@ -22,6 +22,7 @@
> > #include <linux/export.h>
> > #include <linux/kernel.h>
> > #include <linux/of.h>
> > +#include <linux/of_device.h>
> > #include <linux/platform_device.h>
> > #include <linux/slab.h>
> > #include <linux/spinlock.h>
> > @@ -294,6 +295,132 @@ static int probe_current_pmu(struct arm_pmu *pmu)
> > return ret;
> > }
> >
> > +/*
> > + * Test if the node is within the topology tree.
> > + * Walk up to the root, keeping refcounts balanced.
> > + */
> > +static bool is_topology_node(struct device_node *node)
> > +{
> > + struct device_node *np, *cpu_map;
> > + bool ret = false;
> > +
> > + cpu_map = of_find_node_by_path("/cpus/cpu-map");
> > + if (!cpu_map)
> > + return false;
> > +
> > + /*
> > + * of_get_next_parent decrements the refcount of the provided node.
> > + * Increment it first to keep things balanced.
> > + */
> > + for (np = of_node_get(node); np; np = of_get_next_parent(np)) {
> > + if (np != cpu_map)
> > + continue;
> > +
> > + ret = true;
> > + break;
> > + }
> > +
> > + of_node_put(np);
> > + of_node_put(cpu_map);
> > + return ret;
> > +}
>
> Wouldn't this be more at home in topology.c, or somewhere where others can
> make use of it?

Perhaps. I'll need this for arm64 too and I don't know where that should
live.

Mark.

2014-11-17 15:08:48

by Mark Rutland

[permalink] [raw]

Subject: Re: [PATCH 09/11] arm: perf: parse cpu affinity from dt

On Mon, Nov 17, 2014 at 11:20:35AM +0000, Will Deacon wrote:
> On Fri, Nov 07, 2014 at 04:25:34PM +0000, Mark Rutland wrote:
> > The current way we read interrupts form devicetree assumes that
> > interrupts are in increasing order of logical cpu id (MPIDR.Aff{2,1,0}),
> > and that these logical ids are in a contiguous block. This may not be
> > the case in general - after a kexec cpu ids may be arbitrarily assigned,
> > and multi-cluster systems do not have a contiguous range of cpu ids.
> >
> > This patch parses cpu affinity information for interrupts from an
> > optional "interrupts-affinity" devicetree property described in the
> > devicetree binding document. Support for existing dts and board files
> > remains.
> >
> > Signed-off-by: Mark Rutland <[email protected]>
> > ---
> > arch/arm/include/asm/pmu.h | 12 +++
> > arch/arm/kernel/perf_event_cpu.c | 196 +++++++++++++++++++++++++++++----------
> > 2 files changed, 161 insertions(+), 47 deletions(-)
> >
> > diff --git a/arch/arm/include/asm/pmu.h b/arch/arm/include/asm/pmu.h
> > index b630a44..92fc1da 100644
> > --- a/arch/arm/include/asm/pmu.h
> > +++ b/arch/arm/include/asm/pmu.h
> > @@ -12,6 +12,7 @@
> > #ifndef __ARM_PMU_H__
> > #define __ARM_PMU_H__
> >
> > +#include <linux/cpumask.h>
> > #include <linux/interrupt.h>
> > #include <linux/perf_event.h>
> >
> > @@ -89,6 +90,15 @@ struct pmu_hw_events {
> > struct arm_pmu *percpu_pmu;
> > };
> >
> > +/*
> > + * For systems with heterogeneous PMUs, we need to know which CPUs each
> > + * (possibly percpu) IRQ targets. Map between them with an array of these.
> > + */
> > +struct cpu_irq {
> > + cpumask_t cpus;
> > + int irq;
> > +};
> > +
> > struct arm_pmu {
> > struct pmu pmu;
> > cpumask_t active_irqs;
> > @@ -118,6 +128,8 @@ struct arm_pmu {
> > struct platform_device *plat_device;
> > struct pmu_hw_events __percpu *hw_events;
> > struct notifier_block hotplug_nb;
> > + int nr_irqs;
> > + struct cpu_irq *irq_map;
> > };
> >
> > #define to_arm_pmu(p) (container_of(p, struct arm_pmu, pmu))
> > diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
> > index dfcaba5..f09c8a0 100644
> > --- a/arch/arm/kernel/perf_event_cpu.c
> > +++ b/arch/arm/kernel/perf_event_cpu.c
> > @@ -85,20 +85,27 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
> > struct platform_device *pmu_device = cpu_pmu->plat_device;
> > struct pmu_hw_events __percpu *hw_events = cpu_pmu->hw_events;
> >
> > - irqs = min(pmu_device->num_resources, num_possible_cpus());
> > + irqs = cpu_pmu->nr_irqs;
> >
> > - irq = platform_get_irq(pmu_device, 0);
> > - if (irq >= 0 && irq_is_percpu(irq)) {
> > - on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
> > - free_percpu_irq(irq, &hw_events->percpu_pmu);
> > - } else {
> > - for (i = 0; i < irqs; ++i) {
> > - if (!cpumask_test_and_clear_cpu(i, &cpu_pmu->active_irqs))
> > - continue;
> > - irq = platform_get_irq(pmu_device, i);
> > - if (irq >= 0)
> > - free_irq(irq, per_cpu_ptr(&hw_events->percpu_pmu, i));
> > + for (i = 0; i < irqs; i++) {
> > + struct cpu_irq *map = &cpu_pmu->irq_map[i];
> > + irq = map->irq;
> > +
> > + if (irq <= 0)
> > + continue;
> > +
> > + if (irq_is_percpu(irq)) {
> > + on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
>
> Hmm, ok, so we're assuming that all the PMUs will be wired with PPIs in this
> case. I have a patch allowing per-cpu interrupts to be requested for a
> cpumask, but I suppose that can wait until it's actually needed.

I wasn't too keen on assuming all CPUs, but I didn't have the facility
to request a PPI on a subset of CPUs. If you can point me at your patch,
I'd be happy to take a look.

I should have the target CPU mask decoded from whatever the binding
settles on, so at this point it's just plumbing.

Thanks,
Mark.

2014-11-18 10:40:24

[permalink] [raw]

Subject: Re: [PATCH 09/11] arm: perf: parse cpu affinity from dt

On Mon, Nov 17, 2014 at 03:08:04PM +0000, Mark Rutland wrote:
> On Mon, Nov 17, 2014 at 11:20:35AM +0000, Will Deacon wrote:
> > On Fri, Nov 07, 2014 at 04:25:34PM +0000, Mark Rutland wrote:
> > > diff --git a/arch/arm/kernel/perf_event_cpu.c b/arch/arm/kernel/perf_event_cpu.c
> > > index dfcaba5..f09c8a0 100644
> > > --- a/arch/arm/kernel/perf_event_cpu.c
> > > +++ b/arch/arm/kernel/perf_event_cpu.c
> > > @@ -85,20 +85,27 @@ static void cpu_pmu_free_irq(struct arm_pmu *cpu_pmu)
> > > struct platform_device *pmu_device = cpu_pmu->plat_device;
> > > struct pmu_hw_events __percpu *hw_events = cpu_pmu->hw_events;
> > >
> > > - irqs = min(pmu_device->num_resources, num_possible_cpus());
> > > + irqs = cpu_pmu->nr_irqs;
> > >
> > > - irq = platform_get_irq(pmu_device, 0);
> > > - if (irq >= 0 && irq_is_percpu(irq)) {
> > > - on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
> > > - free_percpu_irq(irq, &hw_events->percpu_pmu);
> > > - } else {
> > > - for (i = 0; i < irqs; ++i) {
> > > - if (!cpumask_test_and_clear_cpu(i, &cpu_pmu->active_irqs))
> > > - continue;
> > > - irq = platform_get_irq(pmu_device, i);
> > > - if (irq >= 0)
> > > - free_irq(irq, per_cpu_ptr(&hw_events->percpu_pmu, i));
> > > + for (i = 0; i < irqs; i++) {
> > > + struct cpu_irq *map = &cpu_pmu->irq_map[i];
> > > + irq = map->irq;
> > > +
> > > + if (irq <= 0)
> > > + continue;
> > > +
> > > + if (irq_is_percpu(irq)) {
> > > + on_each_cpu(cpu_pmu_disable_percpu_irq, &irq, 1);
> >
> > Hmm, ok, so we're assuming that all the PMUs will be wired with PPIs in this
> > case. I have a patch allowing per-cpu interrupts to be requested for a
> > cpumask, but I suppose that can wait until it's actually needed.
>
> I wasn't too keen on assuming all CPUs, but I didn't have the facility
> to request a PPI on a subset of CPUs. If you can point me at your patch,
> I'd be happy to take a look.

The patch is here:

https://git.kernel.org/cgit/linux/kernel/git/will/linux.git/commit/?h=irq&id=774f7bc54577b6875d96e670ee34580077fc10be

But I think we can avoid it until we find a platform that needs it. I can't
see a DT/ABI issue with that, can you?

Will