2018-01-02 11:25:47

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 0/8] perf: Support for ARM DynamIQ Shared Unit

This series adds support for the PMU in ARM DynamIQ Shared Unit (DSU).
The DSU integrates one or more cores with an L3 memory system, control
logic, and external interfaces to form a multicore cluster. The PMU
allows counting the various events related to L3, SCU etc, using 32bit
independent counters along with a 64bit cycle counter.

The PMU can only be accessed via CPU system registers, which are common
to the cores connected to the same DSU. The PMU registers follow the
semantics of the ARMv8 PMU, mostly, with the exception that
the counters record the cluster wide events.

Tested on a Fast model with DSU. The driver only supports ARM64 at the
moment. It can be extended to support ARM32 by providing register
accessors like we do in arch/arm64/include/arm_dsu_pmu.h.

The firmware should setup appropriate bits in the ACTLR_EL3/EL2 to
allow EL1 access to the PMU registers.

Series applies on v4.15-rc6 and is also available at:

git://linux-arm.org/linux-skp.git dsu/v11


Changes since V10:
- Rebased to 4.15-rc6
- Added Acked-by tag for Patch 1

Changes since V9:
- Rely on cpuhp callback for probing the PMU.
- Clear the overflow mask whenever the first CPU is brought up.
- Remove dsu_pmu_get_online_cpu(), which is not needed anymore.
- Flip the order of context migration and setting the active CPU.

Changes since V8:
- Include required header files (Mark Rutland)
- Remove Kconfig dependency on PERF_EVENTS (Mark Rutland)
- Fix typo in event name, bus_acesss => bus_access (Mark Rutland)
- Use find_first_zero_bit instead of find_next_zero_bit (Mark Rutland)
- Change order of checks in dsu_pmu_event_init (Mark Rutland)
- Allow lazy initialisation of DSU PMU to handle cases where CPUs
may be brought up later (e.g, maxcpus=N)- Mark Rutland.
- Change the CPU check to "associated_cpus" from "active_cpus",
as when we migrate the perf context we will access the DSU
from two different CPUs (source and destination).
- Fill in the "module" field for the PMU to prevent the module unload
when the PMU is active.

Changes since V7:
- No changes to the Core DSU PMU code.
- Rebased to v4.14-rc4
- Added Tested-by from Leo
- Convert arm64 CPU topology parsing, ARM PMU irq_affinity parsing
to use the new helper.

Changes since V6
- Rebased to v4.14-rc3
- Use of_cpu_device_node_get() instead of slower of_get_cpu_node(),
where the former uses per_cpu data when available and falls back to
former otherwise.


Suzuki K Poulose (8):
perf: Export perf_event_update_userpage
of: Add helper for mapping device node to logical CPU number
coresight: of: Use of_cpu_node_to_id helper
irqchip: gic-v3: Use of_cpu_node_to_id helper
arm64: Use of_cpu_node_to_id helper for CPU topology parsing
arm_pmu: Use of_cpu_node_to_id helper
dt-bindings: Document devicetree binding for ARM DSU PMU
perf: ARM DynamIQ Shared Unit PMU support

.../devicetree/bindings/arm/arm-dsu-pmu.txt | 27 +
Documentation/perf/arm_dsu_pmu.txt | 28 +
arch/arm64/include/asm/arm_dsu_pmu.h | 129 ++++
arch/arm64/kernel/topology.c | 16 +-
drivers/hwtracing/coresight/of_coresight.c | 15 +-
drivers/irqchip/irq-gic-v3.c | 29 +-
drivers/of/base.c | 26 +
drivers/perf/Kconfig | 9 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_dsu_pmu.c | 843 +++++++++++++++++++++
drivers/perf/arm_pmu_platform.c | 15 +-
include/linux/of.h | 7 +
kernel/events/core.c | 1 +
13 files changed, 1085 insertions(+), 61 deletions(-)
create mode 100644 Documentation/devicetree/bindings/arm/arm-dsu-pmu.txt
create mode 100644 Documentation/perf/arm_dsu_pmu.txt
create mode 100644 arch/arm64/include/asm/arm_dsu_pmu.h
create mode 100644 drivers/perf/arm_dsu_pmu.c

--
2.13.6


2018-01-02 11:25:52

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 1/8] perf: Export perf_event_update_userpage

Export perf_event_update_userpage() so that PMU driver using them,
can be built as modules.

Acked-by: Peter Zilstra <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
kernel/events/core.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4df5b695bf0d..64ec2c9d2c44 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4904,6 +4904,7 @@ void perf_event_update_userpage(struct perf_event *event)
unlock:
rcu_read_unlock();
}
+EXPORT_SYMBOL_GPL(perf_event_update_userpage);

static int perf_mmap_fault(struct vm_fault *vmf)
{
--
2.13.6

2018-01-02 11:25:55

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 3/8] coresight: of: Use of_cpu_node_to_id helper

Reuse the new generic helper, of_cpu_node_to_id() to map a
given CPU phandle to a logical CPU number.

Acked-by: Mathieu Poirier <[email protected]>
Tested-by: Leo Yan <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V4:
- Fix a regression introduced in v4, reported by bugrobot
Changes since V3:
- Reflect the renaming of the helper and return value changes
---
drivers/hwtracing/coresight/of_coresight.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/hwtracing/coresight/of_coresight.c b/drivers/hwtracing/coresight/of_coresight.c
index a18794128bf8..7c375443ede6 100644
--- a/drivers/hwtracing/coresight/of_coresight.c
+++ b/drivers/hwtracing/coresight/of_coresight.c
@@ -104,26 +104,17 @@ static int of_coresight_alloc_memory(struct device *dev,
int of_coresight_get_cpu(const struct device_node *node)
{
int cpu;
- bool found;
- struct device_node *dn, *np;
+ struct device_node *dn;

dn = of_parse_phandle(node, "cpu", 0);
-
/* Affinity defaults to CPU0 */
if (!dn)
return 0;
-
- for_each_possible_cpu(cpu) {
- np = of_cpu_device_node_get(cpu);
- found = (dn == np);
- of_node_put(np);
- if (found)
- break;
- }
+ cpu = of_cpu_node_to_id(dn);
of_node_put(dn);

/* Affinity to CPU0 if no cpu nodes are found */
- return found ? cpu : 0;
+ return (cpu < 0) ? 0 : cpu;
}
EXPORT_SYMBOL_GPL(of_coresight_get_cpu);

--
2.13.6

2018-01-02 11:26:12

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
The DSU integrates one or more cores with an L3 memory system, control
logic, and external interfaces to form a multicore cluster. The PMU
allows counting the various events related to L3, SCU etc, along with
providing a cycle counter.

The PMU can be accessed via system registers, which are common
to the cores in the same cluster. The PMU registers follow the
semantics of the ARMv8 PMU, mostly, with the exception that
the counters record the cluster wide events.

This driver is mostly based on the ARMv8 and CCI PMU drivers.
The driver only supports ARM64 at the moment. It can be extended
to support ARM32 by providing register accessors like we do in
arch/arm64/include/arm_dsu_pmu.h.

Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Reviewed-by: Jonathan Cameron <[email protected]>
Reviewed-by: Mark Rutland <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V9:
- Rely on cpuhp callback for probing the PMU.
- Clear the overflow mask whenever the first CPU is brought up.
- Remove dsu_pmu_get_online_cpu(), which is not needed anymore.
- Flip the order of context migration and setting the active CPU.

Changes since V8:
- Include required header files (Mark Rutland)
- Remove Kconfig dependency on PERF_EVENTS (Mark Rutland)
- Fix typo in event name, bus_acesss => bus_access (Mark Rutland)
- Use find_first_zero_bit instead of find_next_zero_bit (Mark Rutland)
- Change order of checks in dsu_pmu_event_init (Mark Rutland)
- Allow lazy initialisation of DSU PMU to handle cases where CPUs
may be brought up later (e.g, maxcpus=N)- Mark Rutland.
- Clear the interrupt overflow status upon initialisation (Mark Rutland)
- Change the CPU check to "associated_cpus" from "active_cpus",
as when we migrate the perf context we will access the DSU
from two different CPUs (source and destination).
- Fill in the "module" field for the PMU to prevent the module unload
when the PMU is active.
Changes since V6:
- Address comments from Jonathan
- Add Reviewed-by tags from Jonathan
Changes since V5:
- Address comments on V5 by Mark.
- Use IRQ_NOBALANCING for IRQ handler
- Don't expose events which could be unimplemented.
- Get rid of dsu_pmu_event_supported and allow raw event
code to be used without validating whether it is supported.
- Rename "supported_cpus" mask to "associated_cpus"
- Add Documentation for the PMU driver
- Don't disable IRQ for dsu_pmu_{enable/disable}_counters
- Use consistent return codes for validate_event/group calls.
- Check PERF_ATTACH_TASK flag in event_init.
- Allow missing CPUs in dsu_pmu_dt_get_cpus, to handle cases
where kernel could have capped nr_cpus.
- Cleanup sanity checking for the CPU before accessing DSU
- Reject events with counting CPU not associated with the DSU.
Changes since V4:
- Reflect the changed generic helper for mapping CPU id
Changes since V2:
- Cleanup dsu_pmu_device_probe error handling.
- Fix event validate_group to invert the result check of validate_event
- Return errors if we failed to parse CPUs in the DSU.
- Add MODULE_DEVICE_TABLE entry
- Use hlist_entry_safe for converting cpuhp_node to dsu_pmu.
---
Documentation/perf/arm_dsu_pmu.txt | 28 ++
arch/arm64/include/asm/arm_dsu_pmu.h | 129 ++++++
drivers/perf/Kconfig | 9 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_dsu_pmu.c | 843 +++++++++++++++++++++++++++++++++++
5 files changed, 1010 insertions(+)
create mode 100644 Documentation/perf/arm_dsu_pmu.txt
create mode 100644 arch/arm64/include/asm/arm_dsu_pmu.h
create mode 100644 drivers/perf/arm_dsu_pmu.c

diff --git a/Documentation/perf/arm_dsu_pmu.txt b/Documentation/perf/arm_dsu_pmu.txt
new file mode 100644
index 000000000000..d611e15f5add
--- /dev/null
+++ b/Documentation/perf/arm_dsu_pmu.txt
@@ -0,0 +1,28 @@
+ARM DynamIQ Shared Unit (DSU) PMU
+==================================
+
+ARM DynamIQ Shared Unit integrates one or more cores with an L3 memory system,
+control logic and external interfaces to form a multicore cluster. The PMU
+allows counting the various events related to the L3 cache, Snoop Control Unit
+etc, using 32bit independent counters. It also provides a 64bit cycle counter.
+
+The PMU can only be accessed via CPU system registers and are common to the
+cores connected to the same DSU. Like most of the other uncore PMUs, DSU
+PMU doesn't support process specific events and cannot be used in sampling mode.
+
+The DSU provides a bitmap for a subset of implemented events via hardware
+registers. There is no way for the driver to determine if the other events
+are available or not. Hence the driver exposes only those events advertised
+by the DSU, in "events" directory under :
+
+ /sys/bus/event_sources/devices/arm_dsu_<N>/
+
+The user should refer to the TRM of the product to figure out the supported events
+and use the raw event code for the unlisted events.
+
+The driver also exposes the CPUs connected to the DSU instance in "associated_cpus".
+
+
+e.g usage :
+
+ perf stat -a -e arm_dsu_0/cycles/
diff --git a/arch/arm64/include/asm/arm_dsu_pmu.h b/arch/arm64/include/asm/arm_dsu_pmu.h
new file mode 100644
index 000000000000..82e5cc3356bf
--- /dev/null
+++ b/arch/arm64/include/asm/arm_dsu_pmu.h
@@ -0,0 +1,129 @@
+/*
+ * ARM DynamIQ Shared Unit (DSU) PMU Low level register access routines.
+ *
+ * Copyright (C) ARM Limited, 2017.
+ *
+ * Author: Suzuki K Poulose <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2, as published by the Free Software Foundation.
+ */
+
+#include <linux/bitops.h>
+#include <linux/build_bug.h>
+#include <linux/compiler.h>
+#include <linux/types.h>
+#include <asm/barrier.h>
+#include <asm/sysreg.h>
+
+
+#define CLUSTERPMCR_EL1 sys_reg(3, 0, 15, 5, 0)
+#define CLUSTERPMCNTENSET_EL1 sys_reg(3, 0, 15, 5, 1)
+#define CLUSTERPMCNTENCLR_EL1 sys_reg(3, 0, 15, 5, 2)
+#define CLUSTERPMOVSSET_EL1 sys_reg(3, 0, 15, 5, 3)
+#define CLUSTERPMOVSCLR_EL1 sys_reg(3, 0, 15, 5, 4)
+#define CLUSTERPMSELR_EL1 sys_reg(3, 0, 15, 5, 5)
+#define CLUSTERPMINTENSET_EL1 sys_reg(3, 0, 15, 5, 6)
+#define CLUSTERPMINTENCLR_EL1 sys_reg(3, 0, 15, 5, 7)
+#define CLUSTERPMCCNTR_EL1 sys_reg(3, 0, 15, 6, 0)
+#define CLUSTERPMXEVTYPER_EL1 sys_reg(3, 0, 15, 6, 1)
+#define CLUSTERPMXEVCNTR_EL1 sys_reg(3, 0, 15, 6, 2)
+#define CLUSTERPMMDCR_EL1 sys_reg(3, 0, 15, 6, 3)
+#define CLUSTERPMCEID0_EL1 sys_reg(3, 0, 15, 6, 4)
+#define CLUSTERPMCEID1_EL1 sys_reg(3, 0, 15, 6, 5)
+
+static inline u32 __dsu_pmu_read_pmcr(void)
+{
+ return read_sysreg_s(CLUSTERPMCR_EL1);
+}
+
+static inline void __dsu_pmu_write_pmcr(u32 val)
+{
+ write_sysreg_s(val, CLUSTERPMCR_EL1);
+ isb();
+}
+
+static inline u32 __dsu_pmu_get_reset_overflow(void)
+{
+ u32 val = read_sysreg_s(CLUSTERPMOVSCLR_EL1);
+ /* Clear the bit */
+ write_sysreg_s(val, CLUSTERPMOVSCLR_EL1);
+ isb();
+ return val;
+}
+
+static inline void __dsu_pmu_select_counter(int counter)
+{
+ write_sysreg_s(counter, CLUSTERPMSELR_EL1);
+ isb();
+}
+
+static inline u64 __dsu_pmu_read_counter(int counter)
+{
+ __dsu_pmu_select_counter(counter);
+ return read_sysreg_s(CLUSTERPMXEVCNTR_EL1);
+}
+
+static inline void __dsu_pmu_write_counter(int counter, u64 val)
+{
+ __dsu_pmu_select_counter(counter);
+ write_sysreg_s(val, CLUSTERPMXEVCNTR_EL1);
+ isb();
+}
+
+static inline void __dsu_pmu_set_event(int counter, u32 event)
+{
+ __dsu_pmu_select_counter(counter);
+ write_sysreg_s(event, CLUSTERPMXEVTYPER_EL1);
+ isb();
+}
+
+static inline u64 __dsu_pmu_read_pmccntr(void)
+{
+ return read_sysreg_s(CLUSTERPMCCNTR_EL1);
+}
+
+static inline void __dsu_pmu_write_pmccntr(u64 val)
+{
+ write_sysreg_s(val, CLUSTERPMCCNTR_EL1);
+ isb();
+}
+
+static inline void __dsu_pmu_disable_counter(int counter)
+{
+ write_sysreg_s(BIT(counter), CLUSTERPMCNTENCLR_EL1);
+ isb();
+}
+
+static inline void __dsu_pmu_enable_counter(int counter)
+{
+ write_sysreg_s(BIT(counter), CLUSTERPMCNTENSET_EL1);
+ isb();
+}
+
+static inline void __dsu_pmu_counter_interrupt_enable(int counter)
+{
+ write_sysreg_s(BIT(counter), CLUSTERPMINTENSET_EL1);
+ isb();
+}
+
+static inline void __dsu_pmu_counter_interrupt_disable(int counter)
+{
+ write_sysreg_s(BIT(counter), CLUSTERPMINTENCLR_EL1);
+ isb();
+}
+
+
+static inline u32 __dsu_pmu_read_pmceid(int n)
+{
+ switch (n) {
+ case 0:
+ return read_sysreg_s(CLUSTERPMCEID0_EL1);
+ case 1:
+ return read_sysreg_s(CLUSTERPMCEID1_EL1);
+ default:
+ BUILD_BUG();
+ return 0;
+ }
+}
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index b8f44b068fc6..da5724cd89cf 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -17,6 +17,15 @@ config ARM_PMU_ACPI
depends on ARM_PMU && ACPI
def_bool y

+config ARM_DSU_PMU
+ tristate "ARM DynamIQ Shared Unit (DSU) PMU"
+ depends on ARM64
+ help
+ Provides support for performance monitor unit in ARM DynamIQ Shared
+ Unit (DSU). The DSU integrates one or more cores with an L3 memory
+ system, control logic. The PMU allows counting various events related
+ to DSU.
+
config HISI_PMU
bool "HiSilicon SoC PMU"
depends on ARM64 && ACPI
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 710a0135bd61..c2f27419bdf0 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_ARM_DSU_PMU) += arm_dsu_pmu.o
obj-$(CONFIG_ARM_PMU) += arm_pmu.o arm_pmu_platform.o
obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
obj-$(CONFIG_HISI_PMU) += hisilicon/
diff --git a/drivers/perf/arm_dsu_pmu.c b/drivers/perf/arm_dsu_pmu.c
new file mode 100644
index 000000000000..37c0526c93d5
--- /dev/null
+++ b/drivers/perf/arm_dsu_pmu.c
@@ -0,0 +1,843 @@
+/*
+ * ARM DynamIQ Shared Unit (DSU) PMU driver
+ *
+ * Copyright (C) ARM Limited, 2017.
+ *
+ * Based on ARM CCI-PMU, ARMv8 PMU-v3 drivers.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#define PMUNAME "arm_dsu"
+#define DRVNAME PMUNAME "_pmu"
+#define pr_fmt(fmt) DRVNAME ": " fmt
+
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/bug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/of_device.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#include <asm/arm_dsu_pmu.h>
+#include <asm/local64.h>
+
+/* PMU event codes */
+#define DSU_PMU_EVT_CYCLES 0x11
+#define DSU_PMU_EVT_CHAIN 0x1e
+
+#define DSU_PMU_MAX_COMMON_EVENTS 0x40
+
+#define DSU_PMU_MAX_HW_CNTRS 32
+#define DSU_PMU_HW_COUNTER_MASK (DSU_PMU_MAX_HW_CNTRS - 1)
+
+#define CLUSTERPMCR_E BIT(0)
+#define CLUSTERPMCR_P BIT(1)
+#define CLUSTERPMCR_C BIT(2)
+#define CLUSTERPMCR_N_SHIFT 11
+#define CLUSTERPMCR_N_MASK 0x1f
+#define CLUSTERPMCR_IDCODE_SHIFT 16
+#define CLUSTERPMCR_IDCODE_MASK 0xff
+#define CLUSTERPMCR_IMP_SHIFT 24
+#define CLUSTERPMCR_IMP_MASK 0xff
+#define CLUSTERPMCR_RES_MASK 0x7e8
+#define CLUSTERPMCR_RES_VAL 0x40
+
+#define DSU_ACTIVE_CPU_MASK 0x0
+#define DSU_ASSOCIATED_CPU_MASK 0x1
+
+/*
+ * We use the index of the counters as they appear in the counter
+ * bit maps in the PMU registers (e.g CLUSTERPMSELR).
+ * i.e,
+ * counter 0 - Bit 0
+ * counter 1 - Bit 1
+ * ...
+ * Cycle counter - Bit 31
+ */
+#define DSU_PMU_IDX_CYCLE_COUNTER 31
+
+/* All event counters are 32bit, with a 64bit Cycle counter */
+#define DSU_PMU_COUNTER_WIDTH(idx) \
+ (((idx) == DSU_PMU_IDX_CYCLE_COUNTER) ? 64 : 32)
+
+#define DSU_PMU_COUNTER_MASK(idx) \
+ GENMASK_ULL((DSU_PMU_COUNTER_WIDTH((idx)) - 1), 0)
+
+#define DSU_EXT_ATTR(_name, _func, _config) \
+ (&((struct dev_ext_attribute[]) { \
+ { \
+ .attr = __ATTR(_name, 0444, _func, NULL), \
+ .var = (void *)_config \
+ } \
+ })[0].attr.attr)
+
+#define DSU_EVENT_ATTR(_name, _config) \
+ DSU_EXT_ATTR(_name, dsu_pmu_sysfs_event_show, (unsigned long)_config)
+
+#define DSU_FORMAT_ATTR(_name, _config) \
+ DSU_EXT_ATTR(_name, dsu_pmu_sysfs_format_show, (char *)_config)
+
+#define DSU_CPUMASK_ATTR(_name, _config) \
+ DSU_EXT_ATTR(_name, dsu_pmu_cpumask_show, (unsigned long)_config)
+
+struct dsu_hw_events {
+ DECLARE_BITMAP(used_mask, DSU_PMU_MAX_HW_CNTRS);
+ struct perf_event *events[DSU_PMU_MAX_HW_CNTRS];
+};
+
+/*
+ * struct dsu_pmu - DSU PMU descriptor
+ *
+ * @pmu_lock : Protects accesses to DSU PMU register from normal vs
+ * interrupt handler contexts.
+ * @hw_events : Holds the event counter state.
+ * @associated_cpus : CPUs attached to the DSU.
+ * @active_cpu : CPU to which the PMU is bound for accesses.
+ * @cpuhp_node : Node for CPU hotplug notifier link.
+ * @num_counters : Number of event counters implemented by the PMU,
+ * excluding the cycle counter.
+ * @irq : Interrupt line for counter overflow.
+ * @cpmceid_bitmap : Bitmap for the availability of architected common
+ * events (event_code < 0x40).
+ */
+struct dsu_pmu {
+ struct pmu pmu;
+ struct device *dev;
+ raw_spinlock_t pmu_lock;
+ struct dsu_hw_events hw_events;
+ cpumask_t associated_cpus;
+ cpumask_t active_cpu;
+ struct hlist_node cpuhp_node;
+ u8 num_counters;
+ int irq;
+ DECLARE_BITMAP(cpmceid_bitmap, DSU_PMU_MAX_COMMON_EVENTS);
+};
+
+static unsigned long dsu_pmu_cpuhp_state;
+
+static inline struct dsu_pmu *to_dsu_pmu(struct pmu *pmu)
+{
+ return container_of(pmu, struct dsu_pmu, pmu);
+}
+
+static ssize_t dsu_pmu_sysfs_event_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dev_ext_attribute *eattr = container_of(attr,
+ struct dev_ext_attribute, attr);
+ return snprintf(buf, PAGE_SIZE, "event=0x%lx\n",
+ (unsigned long)eattr->var);
+}
+
+static ssize_t dsu_pmu_sysfs_format_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dev_ext_attribute *eattr = container_of(attr,
+ struct dev_ext_attribute, attr);
+ return snprintf(buf, PAGE_SIZE, "%s\n", (char *)eattr->var);
+}
+
+static ssize_t dsu_pmu_cpumask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
+ struct dev_ext_attribute *eattr = container_of(attr,
+ struct dev_ext_attribute, attr);
+ unsigned long mask_id = (unsigned long)eattr->var;
+ const cpumask_t *cpumask;
+
+ switch (mask_id) {
+ case DSU_ACTIVE_CPU_MASK:
+ cpumask = &dsu_pmu->active_cpu;
+ break;
+ case DSU_ASSOCIATED_CPU_MASK:
+ cpumask = &dsu_pmu->associated_cpus;
+ break;
+ default:
+ return 0;
+ }
+ return cpumap_print_to_pagebuf(true, buf, cpumask);
+}
+
+static struct attribute *dsu_pmu_format_attrs[] = {
+ DSU_FORMAT_ATTR(event, "config:0-31"),
+ NULL,
+};
+
+static const struct attribute_group dsu_pmu_format_attr_group = {
+ .name = "format",
+ .attrs = dsu_pmu_format_attrs,
+};
+
+static struct attribute *dsu_pmu_event_attrs[] = {
+ DSU_EVENT_ATTR(cycles, 0x11),
+ DSU_EVENT_ATTR(bus_access, 0x19),
+ DSU_EVENT_ATTR(memory_error, 0x1a),
+ DSU_EVENT_ATTR(bus_cycles, 0x1d),
+ DSU_EVENT_ATTR(l3d_cache_allocate, 0x29),
+ DSU_EVENT_ATTR(l3d_cache_refill, 0x2a),
+ DSU_EVENT_ATTR(l3d_cache, 0x2b),
+ DSU_EVENT_ATTR(l3d_cache_wb, 0x2c),
+ NULL,
+};
+
+static umode_t
+dsu_pmu_event_attr_is_visible(struct kobject *kobj, struct attribute *attr,
+ int unused)
+{
+ struct pmu *pmu = dev_get_drvdata(kobj_to_dev(kobj));
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
+ struct dev_ext_attribute *eattr = container_of(attr,
+ struct dev_ext_attribute, attr.attr);
+ unsigned long evt = (unsigned long)eattr->var;
+
+ return test_bit(evt, dsu_pmu->cpmceid_bitmap) ? attr->mode : 0;
+}
+
+static const struct attribute_group dsu_pmu_events_attr_group = {
+ .name = "events",
+ .attrs = dsu_pmu_event_attrs,
+ .is_visible = dsu_pmu_event_attr_is_visible,
+};
+
+static struct attribute *dsu_pmu_cpumask_attrs[] = {
+ DSU_CPUMASK_ATTR(cpumask, DSU_ACTIVE_CPU_MASK),
+ DSU_CPUMASK_ATTR(associated_cpus, DSU_ASSOCIATED_CPU_MASK),
+ NULL,
+};
+
+static const struct attribute_group dsu_pmu_cpumask_attr_group = {
+ .attrs = dsu_pmu_cpumask_attrs,
+};
+
+static const struct attribute_group *dsu_pmu_attr_groups[] = {
+ &dsu_pmu_cpumask_attr_group,
+ &dsu_pmu_events_attr_group,
+ &dsu_pmu_format_attr_group,
+ NULL,
+};
+
+static int dsu_pmu_get_online_cpu_any_but(struct dsu_pmu *dsu_pmu, int cpu)
+{
+ struct cpumask online_supported;
+
+ cpumask_and(&online_supported,
+ &dsu_pmu->associated_cpus, cpu_online_mask);
+ return cpumask_any_but(&online_supported, cpu);
+}
+
+static inline bool dsu_pmu_counter_valid(struct dsu_pmu *dsu_pmu, u32 idx)
+{
+ return (idx < dsu_pmu->num_counters) ||
+ (idx == DSU_PMU_IDX_CYCLE_COUNTER);
+}
+
+static inline u64 dsu_pmu_read_counter(struct perf_event *event)
+{
+ u64 val;
+ unsigned long flags;
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+ int idx = event->hw.idx;
+
+ if (WARN_ON(!cpumask_test_cpu(smp_processor_id(),
+ &dsu_pmu->associated_cpus)))
+ return 0;
+
+ if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
+ dev_err(event->pmu->dev,
+ "Trying reading invalid counter %d\n", idx);
+ return 0;
+ }
+
+ raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
+ if (idx == DSU_PMU_IDX_CYCLE_COUNTER)
+ val = __dsu_pmu_read_pmccntr();
+ else
+ val = __dsu_pmu_read_counter(idx);
+ raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
+
+ return val;
+}
+
+static void dsu_pmu_write_counter(struct perf_event *event, u64 val)
+{
+ unsigned long flags;
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+ int idx = event->hw.idx;
+
+ if (WARN_ON(!cpumask_test_cpu(smp_processor_id(),
+ &dsu_pmu->associated_cpus)))
+ return;
+
+ if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
+ dev_err(event->pmu->dev,
+ "writing to invalid counter %d\n", idx);
+ return;
+ }
+
+ raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
+ if (idx == DSU_PMU_IDX_CYCLE_COUNTER)
+ __dsu_pmu_write_pmccntr(val);
+ else
+ __dsu_pmu_write_counter(idx, val);
+ raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
+}
+
+static int dsu_pmu_get_event_idx(struct dsu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ int idx;
+ unsigned long evtype = event->attr.config;
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+ unsigned long *used_mask = hw_events->used_mask;
+
+ if (evtype == DSU_PMU_EVT_CYCLES) {
+ if (test_and_set_bit(DSU_PMU_IDX_CYCLE_COUNTER, used_mask))
+ return -EAGAIN;
+ return DSU_PMU_IDX_CYCLE_COUNTER;
+ }
+
+ idx = find_first_zero_bit(used_mask, dsu_pmu->num_counters);
+ if (idx >= dsu_pmu->num_counters)
+ return -EAGAIN;
+ set_bit(idx, hw_events->used_mask);
+ return idx;
+}
+
+static void dsu_pmu_enable_counter(struct dsu_pmu *dsu_pmu, int idx)
+{
+ __dsu_pmu_counter_interrupt_enable(idx);
+ __dsu_pmu_enable_counter(idx);
+}
+
+static void dsu_pmu_disable_counter(struct dsu_pmu *dsu_pmu, int idx)
+{
+ __dsu_pmu_disable_counter(idx);
+ __dsu_pmu_counter_interrupt_disable(idx);
+}
+
+static inline void dsu_pmu_set_event(struct dsu_pmu *dsu_pmu,
+ struct perf_event *event)
+{
+ int idx = event->hw.idx;
+ unsigned long flags;
+
+ if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
+ dev_err(event->pmu->dev,
+ "Trying to set invalid counter %d\n", idx);
+ return;
+ }
+
+ raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
+ __dsu_pmu_set_event(idx, event->hw.config_base);
+ raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
+}
+
+static void dsu_pmu_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 delta, prev_count, new_count;
+
+ do {
+ /* We may also be called from the irq handler */
+ prev_count = local64_read(&hwc->prev_count);
+ new_count = dsu_pmu_read_counter(event);
+ } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
+ prev_count);
+ delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
+ local64_add(delta, &event->count);
+}
+
+static void dsu_pmu_read(struct perf_event *event)
+{
+ dsu_pmu_event_update(event);
+}
+
+static inline u32 dsu_pmu_get_reset_overflow(void)
+{
+ return __dsu_pmu_get_reset_overflow();
+}
+
+/**
+ * dsu_pmu_set_event_period: Set the period for the counter.
+ *
+ * All DSU PMU event counters, except the cycle counter are 32bit
+ * counters. To handle cases of extreme interrupt latency, we program
+ * the counter with half of the max count for the counters.
+ */
+static void dsu_pmu_set_event_period(struct perf_event *event)
+{
+ int idx = event->hw.idx;
+ u64 val = DSU_PMU_COUNTER_MASK(idx) >> 1;
+
+ local64_set(&event->hw.prev_count, val);
+ dsu_pmu_write_counter(event, val);
+}
+
+static irqreturn_t dsu_pmu_handle_irq(int irq_num, void *dev)
+{
+ int i;
+ bool handled = false;
+ struct dsu_pmu *dsu_pmu = dev;
+ struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
+ unsigned long overflow;
+
+ overflow = dsu_pmu_get_reset_overflow();
+ if (!overflow)
+ return IRQ_NONE;
+
+ for_each_set_bit(i, &overflow, DSU_PMU_MAX_HW_CNTRS) {
+ struct perf_event *event = hw_events->events[i];
+
+ if (!event)
+ continue;
+ dsu_pmu_event_update(event);
+ dsu_pmu_set_event_period(event);
+ handled = true;
+ }
+
+ return IRQ_RETVAL(handled);
+}
+
+static void dsu_pmu_start(struct perf_event *event, int pmu_flags)
+{
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+
+ /* We always reprogram the counter */
+ if (pmu_flags & PERF_EF_RELOAD)
+ WARN_ON(!(event->hw.state & PERF_HES_UPTODATE));
+ dsu_pmu_set_event_period(event);
+ if (event->hw.idx != DSU_PMU_IDX_CYCLE_COUNTER)
+ dsu_pmu_set_event(dsu_pmu, event);
+ event->hw.state = 0;
+ dsu_pmu_enable_counter(dsu_pmu, event->hw.idx);
+}
+
+static void dsu_pmu_stop(struct perf_event *event, int pmu_flags)
+{
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+ dsu_pmu_disable_counter(dsu_pmu, event->hw.idx);
+ dsu_pmu_event_update(event);
+ event->hw.state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+}
+
+static int dsu_pmu_add(struct perf_event *event, int flags)
+{
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+ struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx;
+
+ if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(),
+ &dsu_pmu->associated_cpus)))
+ return -ENOENT;
+
+ idx = dsu_pmu_get_event_idx(hw_events, event);
+ if (idx < 0)
+ return idx;
+
+ hwc->idx = idx;
+ hw_events->events[idx] = event;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (flags & PERF_EF_START)
+ dsu_pmu_start(event, PERF_EF_RELOAD);
+
+ perf_event_update_userpage(event);
+ return 0;
+}
+
+static void dsu_pmu_del(struct perf_event *event, int flags)
+{
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+ struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = hwc->idx;
+
+ dsu_pmu_stop(event, PERF_EF_UPDATE);
+ hw_events->events[idx] = NULL;
+ clear_bit(idx, hw_events->used_mask);
+ perf_event_update_userpage(event);
+}
+
+static void dsu_pmu_enable(struct pmu *pmu)
+{
+ u32 pmcr;
+ unsigned long flags;
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
+
+ /* If no counters are added, skip enabling the PMU */
+ if (bitmap_empty(dsu_pmu->hw_events.used_mask, DSU_PMU_MAX_HW_CNTRS))
+ return;
+
+ raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
+ pmcr = __dsu_pmu_read_pmcr();
+ pmcr |= CLUSTERPMCR_E;
+ __dsu_pmu_write_pmcr(pmcr);
+ raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
+}
+
+static void dsu_pmu_disable(struct pmu *pmu)
+{
+ u32 pmcr;
+ unsigned long flags;
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
+
+ raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
+ pmcr = __dsu_pmu_read_pmcr();
+ pmcr &= ~CLUSTERPMCR_E;
+ __dsu_pmu_write_pmcr(pmcr);
+ raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
+}
+
+static bool dsu_pmu_validate_event(struct pmu *pmu,
+ struct dsu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ if (is_software_event(event))
+ return true;
+ /* Reject groups spanning multiple HW PMUs. */
+ if (event->pmu != pmu)
+ return false;
+ return dsu_pmu_get_event_idx(hw_events, event) >= 0;
+}
+
+/*
+ * Make sure the group of events can be scheduled at once
+ * on the PMU.
+ */
+static bool dsu_pmu_validate_group(struct perf_event *event)
+{
+ struct perf_event *sibling, *leader = event->group_leader;
+ struct dsu_hw_events fake_hw;
+
+ if (event->group_leader == event)
+ return true;
+
+ memset(fake_hw.used_mask, 0, sizeof(fake_hw.used_mask));
+ if (!dsu_pmu_validate_event(event->pmu, &fake_hw, leader))
+ return false;
+ list_for_each_entry(sibling, &leader->sibling_list, group_entry) {
+ if (!dsu_pmu_validate_event(event->pmu, &fake_hw, sibling))
+ return false;
+ }
+ return dsu_pmu_validate_event(event->pmu, &fake_hw, event);
+}
+
+static int dsu_pmu_event_init(struct perf_event *event)
+{
+ struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* We don't support sampling */
+ if (is_sampling_event(event)) {
+ dev_dbg(dsu_pmu->pmu.dev, "Can't support sampling events\n");
+ return -EOPNOTSUPP;
+ }
+
+ /* We cannot support task bound events */
+ if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) {
+ dev_dbg(dsu_pmu->pmu.dev, "Can't support per-task counters\n");
+ return -EINVAL;
+ }
+
+ if (has_branch_stack(event) ||
+ event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest) {
+ dev_dbg(dsu_pmu->pmu.dev, "Can't support filtering\n");
+ return -EINVAL;
+ }
+
+ if (!cpumask_test_cpu(event->cpu, &dsu_pmu->associated_cpus)) {
+ dev_dbg(dsu_pmu->pmu.dev,
+ "Requested cpu is not associated with the DSU\n");
+ return -EINVAL;
+ }
+ /*
+ * Choose the current active CPU to read the events. We don't want
+ * to migrate the event contexts, irq handling etc to the requested
+ * CPU. As long as the requested CPU is within the same DSU, we
+ * are fine.
+ */
+ event->cpu = cpumask_first(&dsu_pmu->active_cpu);
+ if (event->cpu >= nr_cpu_ids)
+ return -EINVAL;
+ if (!dsu_pmu_validate_group(event))
+ return -EINVAL;
+
+ event->hw.config_base = event->attr.config;
+ return 0;
+}
+
+static struct dsu_pmu *dsu_pmu_alloc(struct platform_device *pdev)
+{
+ struct dsu_pmu *dsu_pmu;
+
+ dsu_pmu = devm_kzalloc(&pdev->dev, sizeof(*dsu_pmu), GFP_KERNEL);
+ if (!dsu_pmu)
+ return ERR_PTR(-ENOMEM);
+
+ raw_spin_lock_init(&dsu_pmu->pmu_lock);
+ /*
+ * Initialise the number of counters to -1, until we probe
+ * the real number on a connected CPU.
+ */
+ dsu_pmu->num_counters = -1;
+ return dsu_pmu;
+}
+
+/**
+ * dsu_pmu_dt_get_cpus: Get the list of CPUs in the cluster.
+ */
+static int dsu_pmu_dt_get_cpus(struct device_node *dev, cpumask_t *mask)
+{
+ int i = 0, n, cpu;
+ struct device_node *cpu_node;
+
+ n = of_count_phandle_with_args(dev, "cpus", NULL);
+ if (n <= 0)
+ return -ENODEV;
+ for (; i < n; i++) {
+ cpu_node = of_parse_phandle(dev, "cpus", i);
+ if (!cpu_node)
+ break;
+ cpu = of_cpu_node_to_id(cpu_node);
+ of_node_put(cpu_node);
+ /*
+ * We have to ignore the failures here and continue scanning
+ * the list to handle cases where the nr_cpus could be capped
+ * in the running kernel.
+ */
+ if (cpu < 0)
+ continue;
+ cpumask_set_cpu(cpu, mask);
+ }
+ return 0;
+}
+
+/*
+ * dsu_pmu_probe_pmu: Probe the PMU details on a CPU in the cluster.
+ */
+static void dsu_pmu_probe_pmu(struct dsu_pmu *dsu_pmu)
+{
+ u64 num_counters;
+ u32 cpmceid[2];
+
+ num_counters = (__dsu_pmu_read_pmcr() >> CLUSTERPMCR_N_SHIFT) &
+ CLUSTERPMCR_N_MASK;
+ /* We can only support up to 31 independent counters */
+ if (WARN_ON(num_counters > 31))
+ num_counters = 31;
+ dsu_pmu->num_counters = num_counters;
+ if (!dsu_pmu->num_counters)
+ return;
+ cpmceid[0] = __dsu_pmu_read_pmceid(0);
+ cpmceid[1] = __dsu_pmu_read_pmceid(1);
+ bitmap_from_u32array(dsu_pmu->cpmceid_bitmap,
+ DSU_PMU_MAX_COMMON_EVENTS,
+ cpmceid,
+ ARRAY_SIZE(cpmceid));
+}
+
+static void dsu_pmu_set_active_cpu(int cpu, struct dsu_pmu *dsu_pmu)
+{
+ cpumask_set_cpu(cpu, &dsu_pmu->active_cpu);
+ if (irq_set_affinity_hint(dsu_pmu->irq, &dsu_pmu->active_cpu))
+ pr_warn("Failed to set irq affinity to %d\n", cpu);
+}
+
+/*
+ * dsu_pmu_init_pmu: Initialise the DSU PMU configurations if
+ * we haven't done it already.
+ */
+static void dsu_pmu_init_pmu(struct dsu_pmu *dsu_pmu)
+{
+ if (dsu_pmu->num_counters == -1)
+ dsu_pmu_probe_pmu(dsu_pmu);
+ /* Reset the interrupt overflow mask */
+ dsu_pmu_get_reset_overflow();
+}
+
+static int dsu_pmu_device_probe(struct platform_device *pdev)
+{
+ int irq, rc;
+ struct dsu_pmu *dsu_pmu;
+ char *name;
+ static atomic_t pmu_idx = ATOMIC_INIT(-1);
+
+ dsu_pmu = dsu_pmu_alloc(pdev);
+ if (IS_ERR(dsu_pmu))
+ return PTR_ERR(dsu_pmu);
+
+ rc = dsu_pmu_dt_get_cpus(pdev->dev.of_node, &dsu_pmu->associated_cpus);
+ if (rc) {
+ dev_warn(&pdev->dev, "Failed to parse the CPUs\n");
+ return rc;
+ }
+
+ irq = platform_get_irq(pdev, 0);
+ if (irq < 0) {
+ dev_warn(&pdev->dev, "Failed to find IRQ\n");
+ return -EINVAL;
+ }
+
+ name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "%s_%d",
+ PMUNAME, atomic_inc_return(&pmu_idx));
+ if (!name)
+ return -ENOMEM;
+ rc = devm_request_irq(&pdev->dev, irq, dsu_pmu_handle_irq,
+ IRQF_NOBALANCING, name, dsu_pmu);
+ if (rc) {
+ dev_warn(&pdev->dev, "Failed to request IRQ %d\n", irq);
+ return rc;
+ }
+
+ dsu_pmu->irq = irq;
+ platform_set_drvdata(pdev, dsu_pmu);
+ rc = cpuhp_state_add_instance(dsu_pmu_cpuhp_state,
+ &dsu_pmu->cpuhp_node);
+ if (rc)
+ return rc;
+
+ dsu_pmu->pmu = (struct pmu) {
+ .task_ctx_nr = perf_invalid_context,
+ .module = THIS_MODULE,
+ .pmu_enable = dsu_pmu_enable,
+ .pmu_disable = dsu_pmu_disable,
+ .event_init = dsu_pmu_event_init,
+ .add = dsu_pmu_add,
+ .del = dsu_pmu_del,
+ .start = dsu_pmu_start,
+ .stop = dsu_pmu_stop,
+ .read = dsu_pmu_read,
+
+ .attr_groups = dsu_pmu_attr_groups,
+ };
+
+ rc = perf_pmu_register(&dsu_pmu->pmu, name, -1);
+ if (rc) {
+ cpuhp_state_remove_instance(dsu_pmu_cpuhp_state,
+ &dsu_pmu->cpuhp_node);
+ irq_set_affinity_hint(dsu_pmu->irq, NULL);
+ }
+
+ return rc;
+}
+
+static int dsu_pmu_device_remove(struct platform_device *pdev)
+{
+ struct dsu_pmu *dsu_pmu = platform_get_drvdata(pdev);
+
+ perf_pmu_unregister(&dsu_pmu->pmu);
+ cpuhp_state_remove_instance(dsu_pmu_cpuhp_state, &dsu_pmu->cpuhp_node);
+ irq_set_affinity_hint(dsu_pmu->irq, NULL);
+
+ return 0;
+}
+
+static const struct of_device_id dsu_pmu_of_match[] = {
+ { .compatible = "arm,dsu-pmu", },
+ {},
+};
+
+static struct platform_driver dsu_pmu_driver = {
+ .driver = {
+ .name = DRVNAME,
+ .of_match_table = of_match_ptr(dsu_pmu_of_match),
+ },
+ .probe = dsu_pmu_device_probe,
+ .remove = dsu_pmu_device_remove,
+};
+
+static int dsu_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+ struct dsu_pmu *dsu_pmu = hlist_entry_safe(node, struct dsu_pmu,
+ cpuhp_node);
+
+ if (!cpumask_test_cpu(cpu, &dsu_pmu->associated_cpus))
+ return 0;
+
+ /* If the PMU is already managed, there is nothing to do */
+ if (!cpumask_empty(&dsu_pmu->active_cpu))
+ return 0;
+
+ dsu_pmu_init_pmu(dsu_pmu);
+ dsu_pmu_set_active_cpu(cpu, dsu_pmu);
+
+ return 0;
+}
+
+static int dsu_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node)
+{
+ int dst;
+ struct dsu_pmu *dsu_pmu = hlist_entry_safe(node, struct dsu_pmu,
+ cpuhp_node);
+
+ if (!cpumask_test_and_clear_cpu(cpu, &dsu_pmu->active_cpu))
+ return 0;
+
+ dst = dsu_pmu_get_online_cpu_any_but(dsu_pmu, cpu);
+ /* If there are no active CPUs in the DSU, leave IRQ disabled */
+ if (dst >= nr_cpu_ids) {
+ irq_set_affinity_hint(dsu_pmu->irq, NULL);
+ return 0;
+ }
+
+ perf_pmu_migrate_context(&dsu_pmu->pmu, cpu, dst);
+ dsu_pmu_set_active_cpu(dst, dsu_pmu);
+
+ return 0;
+}
+
+static int __init dsu_pmu_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+ DRVNAME,
+ dsu_pmu_cpu_online,
+ dsu_pmu_cpu_teardown);
+ if (ret < 0)
+ return ret;
+ dsu_pmu_cpuhp_state = ret;
+ return platform_driver_register(&dsu_pmu_driver);
+}
+
+static void __exit dsu_pmu_exit(void)
+{
+ platform_driver_unregister(&dsu_pmu_driver);
+ cpuhp_remove_multi_state(dsu_pmu_cpuhp_state);
+}
+
+module_init(dsu_pmu_init);
+module_exit(dsu_pmu_exit);
+
+MODULE_DEVICE_TABLE(of, dsu_pmu_of_match);
+MODULE_DESCRIPTION("Perf driver for ARM DynamIQ Shared Unit");
+MODULE_AUTHOR("Suzuki K Poulose <[email protected]>");
+MODULE_LICENSE("GPL v2");
--
2.13.6

2018-01-02 11:26:32

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 7/8] dt-bindings: Document devicetree binding for ARM DSU PMU

This patch documents the devicetree bindings for ARM DSU PMU.

Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: [email protected]
Cc: [email protected]
Acked-by: Rob Herring <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V3:
- Fixed node name in the example, suggested by Rob
---
.../devicetree/bindings/arm/arm-dsu-pmu.txt | 27 ++++++++++++++++++++++
1 file changed, 27 insertions(+)
create mode 100644 Documentation/devicetree/bindings/arm/arm-dsu-pmu.txt

diff --git a/Documentation/devicetree/bindings/arm/arm-dsu-pmu.txt b/Documentation/devicetree/bindings/arm/arm-dsu-pmu.txt
new file mode 100644
index 000000000000..6efabba530f1
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/arm-dsu-pmu.txt
@@ -0,0 +1,27 @@
+* ARM DynamIQ Shared Unit (DSU) Performance Monitor Unit (PMU)
+
+ARM DyanmIQ Shared Unit (DSU) integrates one or more CPU cores
+with a shared L3 memory system, control logic and external interfaces to
+form a multicore cluster. The PMU enables to gather various statistics on
+the operations of the DSU. The PMU provides independent 32bit counters that
+can count any of the supported events, along with a 64bit cycle counter.
+The PMU is accessed via CPU system registers and has no MMIO component.
+
+** DSU PMU required properties:
+
+- compatible : should be one of :
+
+ "arm,dsu-pmu"
+
+- interrupts : Exactly 1 SPI must be listed.
+
+- cpus : List of phandles for the CPUs connected to this DSU instance.
+
+
+** Example:
+
+dsu-pmu-0 {
+ compatible = "arm,dsu-pmu";
+ interrupts = <GIC_SPI 02 IRQ_TYPE_LEVEL_HIGH>;
+ cpus = <&cpu_0>, <&cpu_1>;
+};
--
2.13.6

2018-01-02 11:26:03

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 6/8] arm_pmu: Use of_cpu_node_to_id helper

Use the new generic helper, of_cpu_node_to_id(), to map a
a phandle to the logical CPU number while parsing the
PMU irq affinity.

Cc: Will Deacon <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
drivers/perf/arm_pmu_platform.c | 15 +++------------
1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/perf/arm_pmu_platform.c b/drivers/perf/arm_pmu_platform.c
index 91b224eced18..46501cc79fd7 100644
--- a/drivers/perf/arm_pmu_platform.c
+++ b/drivers/perf/arm_pmu_platform.c
@@ -82,19 +82,10 @@ static int pmu_parse_irq_affinity(struct device_node *node, int i)
return -EINVAL;
}

- /* Now look up the logical CPU number */
- for_each_possible_cpu(cpu) {
- struct device_node *cpu_dn;
-
- cpu_dn = of_cpu_device_node_get(cpu);
- of_node_put(cpu_dn);
-
- if (dn == cpu_dn)
- break;
- }
-
- if (cpu >= nr_cpu_ids) {
+ cpu = of_cpu_node_to_id(dn);
+ if (cpu < 0) {
pr_warn("failed to find logical CPU for %s\n", dn->name);
+ cpu = nr_cpu_ids;
}

of_node_put(dn);
--
2.13.6

2018-01-02 11:26:00

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 5/8] arm64: Use of_cpu_node_to_id helper for CPU topology parsing

Make use of the new generic helper to convert an of_node of a CPU
to the logical CPU id in parsing the topology.

Cc: Catalin Marinas <[email protected]>
Cc: Leo Yan <[email protected]>
Cc: Will Deacon <[email protected]>
Acked-by: Mark Rutland <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
arch/arm64/kernel/topology.c | 16 ++++++----------
1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 8d48b233e6ce..21868530018e 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -37,18 +37,14 @@ static int __init get_cpu_for_node(struct device_node *node)
if (!cpu_node)
return -1;

- for_each_possible_cpu(cpu) {
- if (of_get_cpu_node(cpu, NULL) == cpu_node) {
- topology_parse_cpu_capacity(cpu_node, cpu);
- of_node_put(cpu_node);
- return cpu;
- }
- }
-
- pr_crit("Unable to find CPU node for %pOF\n", cpu_node);
+ cpu = of_cpu_node_to_id(cpu_node);
+ if (cpu >= 0)
+ topology_parse_cpu_capacity(cpu_node, cpu);
+ else
+ pr_crit("Unable to find CPU node for %pOF\n", cpu_node);

of_node_put(cpu_node);
- return -1;
+ return cpu;
}

static int __init parse_core(struct device_node *core, int cluster_id,
--
2.13.6

2018-01-02 11:25:58

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 4/8] irqchip: gic-v3: Use of_cpu_node_to_id helper

Use the new generic helper of_cpu_node_to_id() instead
of using our own version to map a device node to logical CPU
number.

Acked-by: Marc Zyngier <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V3:
- Reflect the change in the helper name and return value.
---
drivers/irqchip/irq-gic-v3.c | 29 ++---------------------------
1 file changed, 2 insertions(+), 27 deletions(-)

diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index b56c3e23f0af..9a7a15049903 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1070,31 +1070,6 @@ static int __init gic_validate_dist_version(void __iomem *dist_base)
return 0;
}

-static int get_cpu_number(struct device_node *dn)
-{
- const __be32 *cell;
- u64 hwid;
- int cpu;
-
- cell = of_get_property(dn, "reg", NULL);
- if (!cell)
- return -1;
-
- hwid = of_read_number(cell, of_n_addr_cells(dn));
-
- /*
- * Non affinity bits must be set to 0 in the DT
- */
- if (hwid & ~MPIDR_HWID_BITMASK)
- return -1;
-
- for_each_possible_cpu(cpu)
- if (cpu_logical_map(cpu) == hwid)
- return cpu;
-
- return -1;
-}
-
/* Create all possible partitions at boot time */
static void __init gic_populate_ppi_partitions(struct device_node *gic_node)
{
@@ -1145,8 +1120,8 @@ static void __init gic_populate_ppi_partitions(struct device_node *gic_node)
if (WARN_ON(!cpu_node))
continue;

- cpu = get_cpu_number(cpu_node);
- if (WARN_ON(cpu == -1))
+ cpu = of_cpu_node_to_id(cpu_node);
+ if (WARN_ON(cpu < 0))
continue;

pr_cont("%pOF[%d] ", cpu_node, cpu);
--
2.13.6

2018-01-02 11:27:36

by Suzuki K Poulose

[permalink] [raw]
Subject: [PATCH v11 2/8] of: Add helper for mapping device node to logical CPU number

Add a helper to map a device node to a logical CPU number to avoid
duplication. Currently this is open coded in different places (e.g
gic-v3, coresight). The helper tries to map device node to a "possible"
logical CPU id, which may not be online yet. It is the responsibility
of the user to make sure that the CPU is online. The helper uses
of_cpu_device_node_get() to retrieve the device node for a given CPU
(which uses per_cpu data if available else falls back to slower
of_get_cpu_node()).

Cc: [email protected]
Cc: Frank Rowand <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Sudeep Holla <[email protected]>
Reviewed-by: Marc Zyngier <[email protected]>
Reviewed-by: Rob Herring <[email protected]>
Signed-off-by: Suzuki K Poulose <[email protected]>
---
Changes since V6:
- Use faster of_cpu_device_node_get instead of of_get_cpu_node(),
which now falls back to latter if called in early.
Changes since V3:
- Renamed the helper to of_cpu_node_to_id(), suggested by Rob
- Return -ENODEV on failure than nr_cpus_id
---
drivers/of/base.c | 26 ++++++++++++++++++++++++++
include/linux/of.h | 7 +++++++
2 files changed, 33 insertions(+)

diff --git a/drivers/of/base.c b/drivers/of/base.c
index 26618ba8f92a..a9d6fe86585b 100644
--- a/drivers/of/base.c
+++ b/drivers/of/base.c
@@ -316,6 +316,32 @@ struct device_node *of_get_cpu_node(int cpu, unsigned int *thread)
EXPORT_SYMBOL(of_get_cpu_node);

/**
+ * of_cpu_node_to_id: Get the logical CPU number for a given device_node
+ *
+ * @cpu_node: Pointer to the device_node for CPU.
+ *
+ * Returns the logical CPU number of the given CPU device_node.
+ * Returns -ENODEV if the CPU is not found.
+ */
+int of_cpu_node_to_id(struct device_node *cpu_node)
+{
+ int cpu;
+ bool found = false;
+ struct device_node *np;
+
+ for_each_possible_cpu(cpu) {
+ np = of_cpu_device_node_get(cpu);
+ found = (cpu_node == np);
+ of_node_put(np);
+ if (found)
+ return cpu;
+ }
+
+ return -ENODEV;
+}
+EXPORT_SYMBOL(of_cpu_node_to_id);
+
+/**
* __of_device_is_compatible() - Check if the node matches given constraints
* @device: pointer to node
* @compat: required compatible string, NULL or "" for any match
diff --git a/include/linux/of.h b/include/linux/of.h
index d3dea1d1e3a9..173102dafb07 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -544,6 +544,8 @@ const char *of_prop_next_string(struct property *prop, const char *cur);

bool of_console_check(struct device_node *dn, char *name, int index);

+extern int of_cpu_node_to_id(struct device_node *np);
+
#else /* CONFIG_OF */

static inline void of_core_init(void)
@@ -916,6 +918,11 @@ static inline void of_property_clear_flag(struct property *p, unsigned long flag
{
}

+static inline int of_cpu_node_to_id(struct device_node *np)
+{
+ return -ENODEV;
+}
+
#define of_match_ptr(_ptr) NULL
#define of_match_node(_matches, _node) NULL
#endif /* CONFIG_OF */
--
2.13.6

2018-02-22 02:34:18

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
> The DSU integrates one or more cores with an L3 memory system, control
> logic, and external interfaces to form a multicore cluster. The PMU
> allows counting the various events related to L3, SCU etc, along with
> providing a cycle counter.
>
> The PMU can be accessed via system registers, which are common
> to the cores in the same cluster. The PMU registers follow the
> semantics of the ARMv8 PMU, mostly, with the exception that
> the counters record the cluster wide events.
>
> This driver is mostly based on the ARMv8 and CCI PMU drivers.
> The driver only supports ARM64 at the moment. It can be extended
> to support ARM32 by providing register accessors like we do in
> arch/arm64/include/arm_dsu_pmu.h.
>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Reviewed-by: Jonathan Cameron <[email protected]>
> Reviewed-by: Mark Rutland <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V9:
> - Rely on cpuhp callback for probing the PMU.
> - Clear the overflow mask whenever the first CPU is brought up.
> - Remove dsu_pmu_get_online_cpu(), which is not needed anymore.
> - Flip the order of context migration and setting the active CPU.
>
> Changes since V8:
> - Include required header files (Mark Rutland)
> - Remove Kconfig dependency on PERF_EVENTS (Mark Rutland)
> - Fix typo in event name, bus_acesss => bus_access (Mark Rutland)
> - Use find_first_zero_bit instead of find_next_zero_bit (Mark Rutland)
> - Change order of checks in dsu_pmu_event_init (Mark Rutland)
> - Allow lazy initialisation of DSU PMU to handle cases where CPUs
> may be brought up later (e.g, maxcpus=N)- Mark Rutland.
> - Clear the interrupt overflow status upon initialisation (Mark Rutland)
> - Change the CPU check to "associated_cpus" from "active_cpus",
> as when we migrate the perf context we will access the DSU
> from two different CPUs (source and destination).
> - Fill in the "module" field for the PMU to prevent the module unload
> when the PMU is active.
> Changes since V6:
> - Address comments from Jonathan
> - Add Reviewed-by tags from Jonathan
> Changes since V5:
> - Address comments on V5 by Mark.
> - Use IRQ_NOBALANCING for IRQ handler
> - Don't expose events which could be unimplemented.
> - Get rid of dsu_pmu_event_supported and allow raw event
> code to be used without validating whether it is supported.
> - Rename "supported_cpus" mask to "associated_cpus"
> - Add Documentation for the PMU driver
> - Don't disable IRQ for dsu_pmu_{enable/disable}_counters
> - Use consistent return codes for validate_event/group calls.
> - Check PERF_ATTACH_TASK flag in event_init.
> - Allow missing CPUs in dsu_pmu_dt_get_cpus, to handle cases
> where kernel could have capped nr_cpus.
> - Cleanup sanity checking for the CPU before accessing DSU
> - Reject events with counting CPU not associated with the DSU.
> Changes since V4:
> - Reflect the changed generic helper for mapping CPU id
> Changes since V2:
> - Cleanup dsu_pmu_device_probe error handling.
> - Fix event validate_group to invert the result check of validate_event
> - Return errors if we failed to parse CPUs in the DSU.
> - Add MODULE_DEVICE_TABLE entry
> - Use hlist_entry_safe for converting cpuhp_node to dsu_pmu.
> ---
> Documentation/perf/arm_dsu_pmu.txt | 28 ++
> arch/arm64/include/asm/arm_dsu_pmu.h | 129 ++++++
> drivers/perf/Kconfig | 9 +
> drivers/perf/Makefile | 1 +
> drivers/perf/arm_dsu_pmu.c | 843 +++++++++++++++++++++++++++++++++++
> 5 files changed, 1010 insertions(+)
> create mode 100644 Documentation/perf/arm_dsu_pmu.txt
> create mode 100644 arch/arm64/include/asm/arm_dsu_pmu.h
> create mode 100644 drivers/perf/arm_dsu_pmu.c
>

<SNIP>

> +
> +static int dsu_pmu_event_init(struct perf_event *event)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;

You are checking if the caller set the attr.type "correctly".

> +
> + /* We don't support sampling */
> + if (is_sampling_event(event)) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support sampling events\n");
> + return -EOPNOTSUPP;
> + }
> +
> + /* We cannot support task bound events */
> + if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support per-task counters\n");
> + return -EINVAL;
> + }
> +
> + if (has_branch_stack(event) ||
> + event->attr.exclude_user ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle ||
> + event->attr.exclude_host ||
> + event->attr.exclude_guest) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support filtering\n");
> + return -EINVAL;
> + }
> +
> + if (!cpumask_test_cpu(event->cpu, &dsu_pmu->associated_cpus)) {
> + dev_dbg(dsu_pmu->pmu.dev,
> + "Requested cpu is not associated with the DSU\n");
> + return -EINVAL;
> + }
> + /*
> + * Choose the current active CPU to read the events. We don't want
> + * to migrate the event contexts, irq handling etc to the requested
> + * CPU. As long as the requested CPU is within the same DSU, we
> + * are fine.
> + */
> + event->cpu = cpumask_first(&dsu_pmu->active_cpu);
> + if (event->cpu >= nr_cpu_ids)
> + return -EINVAL;
> + if (!dsu_pmu_validate_group(event))
> + return -EINVAL;
> +
> + event->hw.config_base = event->attr.config;
> + return 0;
> +}
> +

<SNIP>

> +
> +static int dsu_pmu_device_probe(struct platform_device *pdev)
> +{
> + int irq, rc;
> + struct dsu_pmu *dsu_pmu;
> + char *name;
> + static atomic_t pmu_idx = ATOMIC_INIT(-1);
> +
> + dsu_pmu = dsu_pmu_alloc(pdev);
> + if (IS_ERR(dsu_pmu))
> + return PTR_ERR(dsu_pmu);
> +
> + rc = dsu_pmu_dt_get_cpus(pdev->dev.of_node, &dsu_pmu->associated_cpus);
> + if (rc) {
> + dev_warn(&pdev->dev, "Failed to parse the CPUs\n");
> + return rc;
> + }
> +
> + irq = platform_get_irq(pdev, 0);
> + if (irq < 0) {
> + dev_warn(&pdev->dev, "Failed to find IRQ\n");
> + return -EINVAL;
> + }
> +
> + name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "%s_%d",
> + PMUNAME, atomic_inc_return(&pmu_idx));
> + if (!name)
> + return -ENOMEM;
> + rc = devm_request_irq(&pdev->dev, irq, dsu_pmu_handle_irq,
> + IRQF_NOBALANCING, name, dsu_pmu);
> + if (rc) {
> + dev_warn(&pdev->dev, "Failed to request IRQ %d\n", irq);
> + return rc;
> + }
> +
> + dsu_pmu->irq = irq;
> + platform_set_drvdata(pdev, dsu_pmu);
> + rc = cpuhp_state_add_instance(dsu_pmu_cpuhp_state,
> + &dsu_pmu->cpuhp_node);
> + if (rc)
> + return rc;
> +
> + dsu_pmu->pmu = (struct pmu) {
> + .task_ctx_nr = perf_invalid_context,
> + .module = THIS_MODULE,
> + .pmu_enable = dsu_pmu_enable,
> + .pmu_disable = dsu_pmu_disable,
> + .event_init = dsu_pmu_event_init,
> + .add = dsu_pmu_add,
> + .del = dsu_pmu_del,
> + .start = dsu_pmu_start,
> + .stop = dsu_pmu_stop,
> + .read = dsu_pmu_read,
> +
> + .attr_groups = dsu_pmu_attr_groups,
> + };
> +
> + rc = perf_pmu_register(&dsu_pmu->pmu, name, -1);

You are passing in -1 here. Which means the event type is assigned by
the perf framework. perf framework uses idr_alloc(&pmu_idr, ...) to get
the id. So the id assigned is going to depend on the probe order among
the different PMU drivers in the board/platform. So, this seems pretty
random.

How is the caller supposed to know what to set the "type" to?

You also can't just delete the check in dsu_pmu_event_init() because the
event numbers you expose overlap with the per-CPU event numbers.

I'm not exactly sure if we can add entries to perf_type_id. If that's
allowed maybe we need to add something line PERF_TYPE_DSU and use that?

Or if that's not allowed then would it be better to offset the DSU PMU
events by some number (say 0x1000) and then delete the event type check
or pass PERF_TYPE_RAW to perf_pmu_register()?

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-02-22 11:36:02

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Wed, Feb 21, 2018 at 06:32:46PM -0800, Saravana Kannan wrote:
> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> > +static int dsu_pmu_event_init(struct perf_event *event)
> > +{
> > + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> > +
> > + if (event->attr.type != event->pmu->type)
> > + return -ENOENT;
>
> You are checking if the caller set the attr.type "correctly".

This is necessary for the case where perf_init_event() falls back to
iterating over the list of PMUs, if event->attr.type wasn't found in the
idr.

Without this, we'd erroneously check events intended for other PMUs.
So this is correct, and necessary.

[...]

> > +static int dsu_pmu_device_probe(struct platform_device *pdev)

> > + rc = perf_pmu_register(&dsu_pmu->pmu, name, -1);
>
> You are passing in -1 here. Which means the event type is assigned by the
> perf framework. perf framework uses idr_alloc(&pmu_idr, ...) to get the id.
> So the id assigned is going to depend on the probe order among the different
> PMU drivers in the board/platform. So, this seems pretty random.

The dynamic IDs are supposed to by looked up by name.

Each PMU has a folder: /sys/bus/event_source/devices/$PMU

... with /sys/bus/event_source/devices/$PMU/type giving the type.

> How is the caller supposed to know what to set the "type" to?

The perf tools understand this already. If you do:

perf stat -e $PMU/config=0xf00/

... they will look up the type for that PMU and use it automatically.

> You also can't just delete the check in dsu_pmu_event_init() because the
> event numbers you expose overlap with the per-CPU event numbers.

The type check is necessary and cannot be deleted. It provides a
namespace for the event IDs.

> I'm not exactly sure if we can add entries to perf_type_id. If that's
> allowed maybe we need to add something line PERF_TYPE_DSU and use that?
>
> Or if that's not allowed then would it be better to offset the DSU PMU
> events by some number (say 0x1000) and then delete the event type check or
> pass PERF_TYPE_RAW to perf_pmu_register()?

As above, neither of these should be necessary.

Thanks,
Mark.

2018-02-22 20:40:11

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 02/22/2018 03:33 AM, Mark Rutland wrote:
> On Wed, Feb 21, 2018 at 06:32:46PM -0800, Saravana Kannan wrote:
>> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>>> +static int dsu_pmu_event_init(struct perf_event *event)
>>> +{
>>> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
>>> +
>>> + if (event->attr.type != event->pmu->type)
>>> + return -ENOENT;
>>
>> You are checking if the caller set the attr.type "correctly".
>
> This is necessary for the case where perf_init_event() falls back to
> iterating over the list of PMUs, if event->attr.type wasn't found in the
> idr.
>
> Without this, we'd erroneously check events intended for other PMUs.
> So this is correct, and necessary.

Right, I'm aware of this. Which is why I also mentioned below that we
can't just blindly delete this.

>>> +static int dsu_pmu_device_probe(struct platform_device *pdev)
>
>>> + rc = perf_pmu_register(&dsu_pmu->pmu, name, -1);
>>
>> You are passing in -1 here. Which means the event type is assigned by the
>> perf framework. perf framework uses idr_alloc(&pmu_idr, ...) to get the id.
>> So the id assigned is going to depend on the probe order among the different
>> PMU drivers in the board/platform. So, this seems pretty random.
>
> The dynamic IDs are supposed to by looked up by name.
>
> Each PMU has a folder: /sys/bus/event_source/devices/$PMU
>
> ... with /sys/bus/event_source/devices/$PMU/type giving the type.
>
>> How is the caller supposed to know what to set the "type" to?
>
> The perf tools understand this already. If you do:
>
> perf stat -e $PMU/config=0xf00/
>
> ... they will look up the type for that PMU and use it automatically.
>

Ah, thanks! This finally explains how this is supposed to work from
userspace.

>> You also can't just delete the check in dsu_pmu_event_init() because the
>> event numbers you expose overlap with the per-CPU event numbers.
>
> The type check is necessary and cannot be deleted. It provides a
> namespace for the event IDs.

Right. Which is my point too.

>> I'm not exactly sure if we can add entries to perf_type_id. If that's
>> allowed maybe we need to add something line PERF_TYPE_DSU and use that?
>>
>> Or if that's not allowed then would it be better to offset the DSU PMU
>> events by some number (say 0x1000) and then delete the event type check or
>> pass PERF_TYPE_RAW to perf_pmu_register()?
>
> As above, neither of these should be necessary.
>

For the userspace interface. How about the kernel interface though?
perf_event_create_kernel_counter() takes attr.type as an input. But
there's no way to look up the DSU PMU's "type".

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-02-23 11:36:56

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Thu, Feb 22, 2018 at 12:38:39PM -0800, Saravana Kannan wrote:
> On 02/22/2018 03:33 AM, Mark Rutland wrote:
> > On Wed, Feb 21, 2018 at 06:32:46PM -0800, Saravana Kannan wrote:
> > > I'm not exactly sure if we can add entries to perf_type_id. If that's
> > > allowed maybe we need to add something line PERF_TYPE_DSU and use that?
> > >
> > > Or if that's not allowed then would it be better to offset the DSU PMU
> > > events by some number (say 0x1000) and then delete the event type check or
> > > pass PERF_TYPE_RAW to perf_pmu_register()?
> >
> > As above, neither of these should be necessary.
>
> For the userspace interface. How about the kernel interface though?
> perf_event_create_kernel_counter() takes attr.type as an input. But there's
> no way to look up the DSU PMU's "type".

There is no lookup mechanism currently.

I take it you want to use DSU events from a kernel module?

It would be possible to have a simlar name -> type lookup mechanism
kernel-side, if necessary.

Thanks,
Mark.

2018-02-23 21:47:51

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 02/23/2018 03:35 AM, Mark Rutland wrote:
> On Thu, Feb 22, 2018 at 12:38:39PM -0800, Saravana Kannan wrote:
>> On 02/22/2018 03:33 AM, Mark Rutland wrote:
>>> On Wed, Feb 21, 2018 at 06:32:46PM -0800, Saravana Kannan wrote:
>>>> I'm not exactly sure if we can add entries to perf_type_id. If that's
>>>> allowed maybe we need to add something line PERF_TYPE_DSU and use that?
>>>>
>>>> Or if that's not allowed then would it be better to offset the DSU PMU
>>>> events by some number (say 0x1000) and then delete the event type check or
>>>> pass PERF_TYPE_RAW to perf_pmu_register()?
>>>
>>> As above, neither of these should be necessary.
>>
>> For the userspace interface. How about the kernel interface though?
>> perf_event_create_kernel_counter() takes attr.type as an input. But there's
>> no way to look up the DSU PMU's "type".
>
> There is no lookup mechanism currently.
>
> I take it you want to use DSU events from a kernel module?

Yes

>
> It would be possible to have a simlar name -> type lookup mechanism
> kernel-side, if necessary.

Yeah, this is what I was thinking once you explained how user space
handles this. Just wanted to hear it from this thread. I can submit a patch.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-02-24 00:54:59

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
> The DSU integrates one or more cores with an L3 memory system, control
> logic, and external interfaces to form a multicore cluster. The PMU
> allows counting the various events related to L3, SCU etc, along with
> providing a cycle counter.
>
> The PMU can be accessed via system registers, which are common
> to the cores in the same cluster. The PMU registers follow the
> semantics of the ARMv8 PMU, mostly, with the exception that
> the counters record the cluster wide events.
>
> This driver is mostly based on the ARMv8 and CCI PMU drivers.
> The driver only supports ARM64 at the moment. It can be extended
> to support ARM32 by providing register accessors like we do in
> arch/arm64/include/arm_dsu_pmu.h.
>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Reviewed-by: Jonathan Cameron <[email protected]>
> Reviewed-by: Mark Rutland <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V9:
> - Rely on cpuhp callback for probing the PMU.
> - Clear the overflow mask whenever the first CPU is brought up.
> - Remove dsu_pmu_get_online_cpu(), which is not needed anymore.
> - Flip the order of context migration and setting the active CPU.
>
> Changes since V8:
> - Include required header files (Mark Rutland)
> - Remove Kconfig dependency on PERF_EVENTS (Mark Rutland)
> - Fix typo in event name, bus_acesss => bus_access (Mark Rutland)
> - Use find_first_zero_bit instead of find_next_zero_bit (Mark Rutland)
> - Change order of checks in dsu_pmu_event_init (Mark Rutland)
> - Allow lazy initialisation of DSU PMU to handle cases where CPUs
> may be brought up later (e.g, maxcpus=N)- Mark Rutland.
> - Clear the interrupt overflow status upon initialisation (Mark Rutland)
> - Change the CPU check to "

" from "active_cpus",
> as when we migrate the perf context we will access the DSU
> from two different CPUs (source and destination).
> - Fill in the "module" field for the PMU to prevent the module unload
> when the PMU is active.
> Changes since V6:
> - Address comments from Jonathan
> - Add Reviewed-by tags from Jonathan
> Changes since V5:
> - Address comments on V5 by Mark.
> - Use IRQ_NOBALANCING for IRQ handler
> - Don't expose events which could be unimplemented.
> - Get rid of dsu_pmu_event_supported and allow raw event
> code to be used without validating whether it is supported.
> - Rename "supported_cpus" mask to "associated_cpus"
> - Add Documentation for the PMU driver
> - Don't disable IRQ for dsu_pmu_{enable/disable}_counters
> - Use consistent return codes for validate_event/group calls.
> - Check PERF_ATTACH_TASK flag in event_init.
> - Allow missing CPUs in dsu_pmu_dt_get_cpus, to handle cases
> where kernel could have capped nr_cpus.
> - Cleanup sanity checking for the CPU before accessing DSU
> - Reject events with counting CPU not associated with the DSU.
> Changes since V4:
> - Reflect the changed generic helper for mapping CPU id
> Changes since V2:
> - Cleanup dsu_pmu_device_probe error handling.
> - Fix event validate_group to invert the result check of validate_event
> - Return errors if we failed to parse CPUs in the DSU.
> - Add MODULE_DEVICE_TABLE entry
> - Use hlist_entry_safe for converting cpuhp_node to dsu_pmu.
> ---
> Documentation/perf/arm_dsu_pmu.txt | 28 ++
> arch/arm64/include/asm/arm_dsu_pmu.h | 129 ++++++
> drivers/perf/Kconfig | 9 +
> drivers/perf/Makefile | 1 +
> drivers/perf/arm_dsu_pmu.c | 843 +++++++++++++++++++++++++++++++++++
> 5 files changed, 1010 insertions(+)
> create mode 100644 Documentation/perf/arm_dsu_pmu.txt
> create mode 100644 arch/arm64/include/asm/arm_dsu_pmu.h
> create mode 100644 drivers/perf/arm_dsu_pmu.c
>
> diff --git a/Documentation/perf/arm_dsu_pmu.txt b/Documentation/perf/arm_dsu_pmu.txt
> new file mode 100644
> index 000000000000..d611e15f5add
> --- /dev/null
> +++ b/Documentation/perf/arm_dsu_pmu.txt
> @@ -0,0 +1,28 @@
> +ARM DynamIQ Shared Unit (DSU) PMU
> +==================================
> +
> +ARM DynamIQ Shared Unit integrates one or more cores with an L3 memory system,
> +control logic and external interfaces to form a multicore cluster. The PMU
> +allows counting the various events related to the L3 cache, Snoop Control Unit
> +etc, using 32bit independent counters. It also provides a 64bit cycle counter.
> +
> +The PMU can only be accessed via CPU system registers and are common to the
> +cores connected to the same DSU. Like most of the other uncore PMUs, DSU
> +PMU doesn't support process specific events and cannot be used in sampling mode.
> +
> +The DSU provides a bitmap for a subset of implemented events via hardware
> +registers. There is no way for the driver to determine if the other events
> +are available or not. Hence the driver exposes only those events advertised
> +by the DSU, in "events" directory under :
> +
> + /sys/bus/event_sources/devices/arm_dsu_<N>/
> +
> +The user should refer to the TRM of the product to figure out the supported events
> +and use the raw event code for the unlisted events.
> +
> +The driver also exposes the CPUs connected to the DSU instance in "associated_cpus".
> +
> +
> +e.g usage :
> +
> + perf stat -a -e arm_dsu_0/cycles/
> diff --git a/arch/arm64/include/asm/arm_dsu_pmu.h b/arch/arm64/include/asm/arm_dsu_pmu.h
> new file mode 100644
> index 000000000000..82e5cc3356bf
> --- /dev/null
> +++ b/arch/arm64/include/asm/arm_dsu_pmu.h
> @@ -0,0 +1,129 @@
> +/*
> + * ARM DynamIQ Shared Unit (DSU) PMU Low level register access routines.
> + *
> + * Copyright (C) ARM Limited, 2017.
> + *
> + * Author: Suzuki K Poulose <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2, as published by the Free Software Foundation.
> + */
> +
> +#include <linux/bitops.h>
> +#include <linux/build_bug.h>
> +#include <linux/compiler.h>
> +#include <linux/types.h>
> +#include <asm/barrier.h>
> +#include <asm/sysreg.h>
> +
> +
> +#define CLUSTERPMCR_EL1 sys_reg(3, 0, 15, 5, 0)
> +#define CLUSTERPMCNTENSET_EL1 sys_reg(3, 0, 15, 5, 1)
> +#define CLUSTERPMCNTENCLR_EL1 sys_reg(3, 0, 15, 5, 2)
> +#define CLUSTERPMOVSSET_EL1 sys_reg(3, 0, 15, 5, 3)
> +#define CLUSTERPMOVSCLR_EL1 sys_reg(3, 0, 15, 5, 4)
> +#define CLUSTERPMSELR_EL1 sys_reg(3, 0, 15, 5, 5)
> +#define CLUSTERPMINTENSET_EL1 sys_reg(3, 0, 15, 5, 6)
> +#define CLUSTERPMINTENCLR_EL1 sys_reg(3, 0, 15, 5, 7)
> +#define CLUSTERPMCCNTR_EL1 sys_reg(3, 0, 15, 6, 0)
> +#define CLUSTERPMXEVTYPER_EL1 sys_reg(3, 0, 15, 6, 1)
> +#define CLUSTERPMXEVCNTR_EL1 sys_reg(3, 0, 15, 6, 2)
> +#define CLUSTERPMMDCR_EL1 sys_reg(3, 0, 15, 6, 3)
> +#define CLUSTERPMCEID0_EL1 sys_reg(3, 0, 15, 6, 4)
> +#define CLUSTERPMCEID1_EL1 sys_reg(3, 0, 15, 6, 5)
> +
> +static inline u32 __dsu_pmu_read_pmcr(void)
> +{
> + return read_sysreg_s(CLUSTERPMCR_EL1);
> +}
> +
> +static inline void __dsu_pmu_write_pmcr(u32 val)
> +{
> + write_sysreg_s(val, CLUSTERPMCR_EL1);
> + isb();
> +}
> +
> +static inline u32 __dsu_pmu_get_reset_overflow(void)
> +{
> + u32 val = read_sysreg_s(CLUSTERPMOVSCLR_EL1);
> + /* Clear the bit */
> + write_sysreg_s(val, CLUSTERPMOVSCLR_EL1);
> + isb();
> + return val;
> +}
> +
> +static inline void __dsu_pmu_select_counter(int counter)
> +{
> + write_sysreg_s(counter, CLUSTERPMSELR_EL1);
> + isb();
> +}
> +
> +static inline u64 __dsu_pmu_read_counter(int counter)
> +{
> + __dsu_pmu_select_counter(counter);
> + return read_sysreg_s(CLUSTERPMXEVCNTR_EL1);
> +}
> +
> +static inline void __dsu_pmu_write_counter(int counter, u64 val)
> +{
> + __dsu_pmu_select_counter(counter);
> + write_sysreg_s(val, CLUSTERPMXEVCNTR_EL1);
> + isb();
> +}
> +
> +static inline void __dsu_pmu_set_event(int counter, u32 event)
> +{
> + __dsu_pmu_select_counter(counter);
> + write_sysreg_s(event, CLUSTERPMXEVTYPER_EL1);
> + isb();
> +}
> +
> +static inline u64 __dsu_pmu_read_pmccntr(void)
> +{
> + return read_sysreg_s(CLUSTERPMCCNTR_EL1);
> +}
> +
> +static inline void __dsu_pmu_write_pmccntr(u64 val)
> +{
> + write_sysreg_s(val, CLUSTERPMCCNTR_EL1);
> + isb();
> +}
> +
> +static inline void __dsu_pmu_disable_counter(int counter)
> +{
> + write_sysreg_s(BIT(counter), CLUSTERPMCNTENCLR_EL1);
> + isb();
> +}
> +
> +static inline void __dsu_pmu_enable_counter(int counter)
> +{
> + write_sysreg_s(BIT(counter), CLUSTERPMCNTENSET_EL1);
> + isb();
> +}
> +
> +static inline void __dsu_pmu_counter_interrupt_enable(int counter)
> +{
> + write_sysreg_s(BIT(counter), CLUSTERPMINTENSET_EL1);
> + isb();
> +}
> +
> +static inline void __dsu_pmu_counter_interrupt_disable(int counter)
> +{
> + write_sysreg_s(BIT(counter), CLUSTERPMINTENCLR_EL1);
> + isb();
> +}
> +
> +
> +static inline u32 __dsu_pmu_read_pmceid(int n)
> +{
> + switch (n) {
> + case 0:
> + return read_sysreg_s(CLUSTERPMCEID0_EL1);
> + case 1:
> + return read_sysreg_s(CLUSTERPMCEID1_EL1);
> + default:
> + BUILD_BUG();
> + return 0;
> + }
> +}
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index b8f44b068fc6..da5724cd89cf 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -17,6 +17,15 @@ config ARM_PMU_ACPI
> depends on ARM_PMU && ACPI
> def_bool y
>
> +config ARM_DSU_PMU
> + tristate "ARM DynamIQ Shared Unit (DSU) PMU"
> + depends on ARM64
> + help
> + Provides support for performance monitor unit in ARM DynamIQ Shared
> + Unit (DSU). The DSU integrates one or more cores with an L3 memory
> + system, control logic. The PMU allows counting various events related
> + to DSU.
> +
> config HISI_PMU
> bool "HiSilicon SoC PMU"
> depends on ARM64 && ACPI
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 710a0135bd61..c2f27419bdf0 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -1,4 +1,5 @@
> # SPDX-License-Identifier: GPL-2.0
> +obj-$(CONFIG_ARM_DSU_PMU) += arm_dsu_pmu.o
> obj-$(CONFIG_ARM_PMU) += arm_pmu.o arm_pmu_platform.o
> obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
> obj-$(CONFIG_HISI_PMU) += hisilicon/
> diff --git a/drivers/perf/arm_dsu_pmu.c b/drivers/perf/arm_dsu_pmu.c
> new file mode 100644
> index 000000000000..37c0526c93d5
> --- /dev/null
> +++ b/drivers/perf/arm_dsu_pmu.c
> @@ -0,0 +1,843 @@
> +/*
> + * ARM DynamIQ Shared Unit (DSU) PMU driver
> + *
> + * Copyright (C) ARM Limited, 2017.
> + *
> + * Based on ARM CCI-PMU, ARMv8 PMU-v3 drivers.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * version 2 as published by the Free Software Foundation.
> + */
> +
> +#define PMUNAME "arm_dsu"
> +#define DRVNAME PMUNAME "_pmu"
> +#define pr_fmt(fmt) DRVNAME ": " fmt
> +
> +#include <linux/bitmap.h>
> +#include <linux/bitops.h>
> +#include <linux/bug.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/of_device.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>
> +
> +#include <asm/arm_dsu_pmu.h>
> +#include <asm/local64.h>
> +
> +/* PMU event codes */
> +#define DSU_PMU_EVT_CYCLES 0x11
> +#define DSU_PMU_EVT_CHAIN 0x1e
> +
> +#define DSU_PMU_MAX_COMMON_EVENTS 0x40
> +
> +#define DSU_PMU_MAX_HW_CNTRS 32
> +#define DSU_PMU_HW_COUNTER_MASK (DSU_PMU_MAX_HW_CNTRS - 1)
> +
> +#define CLUSTERPMCR_E BIT(0)
> +#define CLUSTERPMCR_P BIT(1)
> +#define CLUSTERPMCR_C BIT(2)
> +#define CLUSTERPMCR_N_SHIFT 11
> +#define CLUSTERPMCR_N_MASK 0x1f
> +#define CLUSTERPMCR_IDCODE_SHIFT 16
> +#define CLUSTERPMCR_IDCODE_MASK 0xff
> +#define CLUSTERPMCR_IMP_SHIFT 24
> +#define CLUSTERPMCR_IMP_MASK 0xff
> +#define CLUSTERPMCR_RES_MASK 0x7e8
> +#define CLUSTERPMCR_RES_VAL 0x40
> +
> +#define DSU_ACTIVE_CPU_MASK 0x0
> +#define DSU_ASSOCIATED_CPU_MASK 0x1
> +
> +/*
> + * We use the index of the counters as they appear in the counter
> + * bit maps in the PMU registers (e.g CLUSTERPMSELR).
> + * i.e,
> + * counter 0 - Bit 0
> + * counter 1 - Bit 1
> + * ...
> + * Cycle counter - Bit 31
> + */
> +#define DSU_PMU_IDX_CYCLE_COUNTER 31
> +
> +/* All event counters are 32bit, with a 64bit Cycle counter */
> +#define DSU_PMU_COUNTER_WIDTH(idx) \
> + (((idx) == DSU_PMU_IDX_CYCLE_COUNTER) ? 64 : 32)
> +
> +#define DSU_PMU_COUNTER_MASK(idx) \
> + GENMASK_ULL((DSU_PMU_COUNTER_WIDTH((idx)) - 1), 0)
> +
> +#define DSU_EXT_ATTR(_name, _func, _config) \
> + (&((struct dev_ext_attribute[]) { \
> + { \
> + .attr = __ATTR(_name, 0444, _func, NULL), \
> + .var = (void *)_config \
> + } \
> + })[0].attr.attr)
> +
> +#define DSU_EVENT_ATTR(_name, _config) \
> + DSU_EXT_ATTR(_name, dsu_pmu_sysfs_event_show, (unsigned long)_config)
> +
> +#define DSU_FORMAT_ATTR(_name, _config) \
> + DSU_EXT_ATTR(_name, dsu_pmu_sysfs_format_show, (char *)_config)
> +
> +#define DSU_CPUMASK_ATTR(_name, _config) \
> + DSU_EXT_ATTR(_name, dsu_pmu_cpumask_show, (unsigned long)_config)
> +
> +struct dsu_hw_events {
> + DECLARE_BITMAP(used_mask, DSU_PMU_MAX_HW_CNTRS);
> + struct perf_event *events[DSU_PMU_MAX_HW_CNTRS];
> +};
> +
> +/*
> + * struct dsu_pmu - DSU PMU descriptor
> + *
> + * @pmu_lock : Protects accesses to DSU PMU register from normal vs
> + * interrupt handler contexts.
> + * @hw_events : Holds the event counter state.
> + * @associated_cpus : CPUs attached to the DSU.
> + * @active_cpu : CPU to which the PMU is bound for accesses.
> + * @cpuhp_node : Node for CPU hotplug notifier link.
> + * @num_counters : Number of event counters implemented by the PMU,
> + * excluding the cycle counter.
> + * @irq : Interrupt line for counter overflow.
> + * @cpmceid_bitmap : Bitmap for the availability of architected common
> + * events (event_code < 0x40).
> + */
> +struct dsu_pmu {
> + struct pmu pmu;
> + struct device *dev;
> + raw_spinlock_t pmu_lock;
> + struct dsu_hw_events hw_events;
> + cpumask_t associated_cpus;
> + cpumask_t active_cpu;
> + struct hlist_node cpuhp_node;
> + u8 num_counters;
> + int irq;
> + DECLARE_BITMAP(cpmceid_bitmap, DSU_PMU_MAX_COMMON_EVENTS);
> +};
> +
> +static unsigned long dsu_pmu_cpuhp_state;
> +
> +static inline struct dsu_pmu *to_dsu_pmu(struct pmu *pmu)
> +{
> + return container_of(pmu, struct dsu_pmu, pmu);
> +}
> +
> +static ssize_t dsu_pmu_sysfs_event_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dev_ext_attribute *eattr = container_of(attr,
> + struct dev_ext_attribute, attr);
> + return snprintf(buf, PAGE_SIZE, "event=0x%lx\n",
> + (unsigned long)eattr->var);
> +}
> +
> +static ssize_t dsu_pmu_sysfs_format_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dev_ext_attribute *eattr = container_of(attr,
> + struct dev_ext_attribute, attr);
> + return snprintf(buf, PAGE_SIZE, "%s\n", (char *)eattr->var);
> +}
> +
> +static ssize_t dsu_pmu_cpumask_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct pmu *pmu = dev_get_drvdata(dev);
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
> + struct dev_ext_attribute *eattr = container_of(attr,
> + struct dev_ext_attribute, attr);
> + unsigned long mask_id = (unsigned long)eattr->var;
> + const cpumask_t *cpumask;
> +
> + switch (mask_id) {
> + case DSU_ACTIVE_CPU_MASK:
> + cpumask = &dsu_pmu->active_cpu;
> + break;
> + case DSU_ASSOCIATED_CPU_MASK:
> + cpumask = &dsu_pmu->associated_cpus;
> + break;
> + default:
> + return 0;
> + }
> + return cpumap_print_to_pagebuf(true, buf, cpumask);
> +}
> +
> +static struct attribute *dsu_pmu_format_attrs[] = {
> + DSU_FORMAT_ATTR(event, "config:0-31"),
> + NULL,
> +};
> +
> +static const struct attribute_group dsu_pmu_format_attr_group = {
> + .name = "format",
> + .attrs = dsu_pmu_format_attrs,
> +};
> +
> +static struct attribute *dsu_pmu_event_attrs[] = {
> + DSU_EVENT_ATTR(cycles, 0x11),
> + DSU_EVENT_ATTR(bus_access, 0x19),
> + DSU_EVENT_ATTR(memory_error, 0x1a),
> + DSU_EVENT_ATTR(bus_cycles, 0x1d),
> + DSU_EVENT_ATTR(l3d_cache_allocate, 0x29),
> + DSU_EVENT_ATTR(l3d_cache_refill, 0x2a),
> + DSU_EVENT_ATTR(l3d_cache, 0x2b),
> + DSU_EVENT_ATTR(l3d_cache_wb, 0x2c),
> + NULL,
> +};
> +
> +static umode_t
> +dsu_pmu_event_attr_is_visible(struct kobject *kobj, struct attribute *attr,
> + int unused)
> +{
> + struct pmu *pmu = dev_get_drvdata(kobj_to_dev(kobj));
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
> + struct dev_ext_attribute *eattr = container_of(attr,
> + struct dev_ext_attribute, attr.attr);
> + unsigned long evt = (unsigned long)eattr->var;
> +
> + return test_bit(evt, dsu_pmu->cpmceid_bitmap) ? attr->mode : 0;
> +}
> +
> +static const struct attribute_group dsu_pmu_events_attr_group = {
> + .name = "events",
> + .attrs = dsu_pmu_event_attrs,
> + .is_visible = dsu_pmu_event_attr_is_visible,
> +};
> +
> +static struct attribute *dsu_pmu_cpumask_attrs[] = {
> + DSU_CPUMASK_ATTR(cpumask, DSU_ACTIVE_CPU_MASK),
> + DSU_CPUMASK_ATTR(associated_cpus, DSU_ASSOCIATED_CPU_MASK),
> + NULL,
> +};
> +
> +static const struct attribute_group dsu_pmu_cpumask_attr_group = {
> + .attrs = dsu_pmu_cpumask_attrs,
> +};
> +
> +static const struct attribute_group *dsu_pmu_attr_groups[] = {
> + &dsu_pmu_cpumask_attr_group,
> + &dsu_pmu_events_attr_group,
> + &dsu_pmu_format_attr_group,
> + NULL,
> +};
> +
> +static int dsu_pmu_get_online_cpu_any_but(struct dsu_pmu *dsu_pmu, int cpu)
> +{
> + struct cpumask online_supported;
> +
> + cpumask_and(&online_supported,
> + &dsu_pmu->associated_cpus, cpu_online_mask);
> + return cpumask_any_but(&online_supported, cpu);
> +}
> +
> +static inline bool dsu_pmu_counter_valid(struct dsu_pmu *dsu_pmu, u32 idx)
> +{
> + return (idx < dsu_pmu->num_counters) ||
> + (idx == DSU_PMU_IDX_CYCLE_COUNTER);
> +}
> +
> +static inline u64 dsu_pmu_read_counter(struct perf_event *event)
> +{
> + u64 val;
> + unsigned long flags;
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> + int idx = event->hw.idx;
> +
> + if (WARN_ON(!cpumask_test_cpu(smp_processor_id(),
> + &dsu_pmu->associated_cpus)))
> + return 0;
> +
> + if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
> + dev_err(event->pmu->dev,
> + "Trying reading invalid counter %d\n", idx);
> + return 0;
> + }
> +
> + raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
> + if (idx == DSU_PMU_IDX_CYCLE_COUNTER)
> + val = __dsu_pmu_read_pmccntr();
> + else
> + val = __dsu_pmu_read_counter(idx);
> + raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
> +
> + return val;
> +}
> +
> +static void dsu_pmu_write_counter(struct perf_event *event, u64 val)
> +{
> + unsigned long flags;
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> + int idx = event->hw.idx;
> +
> + if (WARN_ON(!cpumask_test_cpu(smp_processor_id(),
> + &dsu_pmu->associated_cpus)))
> + return;
> +
> + if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
> + dev_err(event->pmu->dev,
> + "writing to invalid counter %d\n", idx);
> + return;
> + }
> +
> + raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
> + if (idx == DSU_PMU_IDX_CYCLE_COUNTER)
> + __dsu_pmu_write_pmccntr(val);
> + else
> + __dsu_pmu_write_counter(idx, val);
> + raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
> +}
> +
> +static int dsu_pmu_get_event_idx(struct dsu_hw_events *hw_events,
> + struct perf_event *event)
> +{
> + int idx;
> + unsigned long evtype = event->attr.config;
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> + unsigned long *used_mask = hw_events->used_mask;
> +
> + if (evtype == DSU_PMU_EVT_CYCLES) {
> + if (test_and_set_bit(DSU_PMU_IDX_CYCLE_COUNTER, used_mask))
> + return -EAGAIN;
> + return DSU_PMU_IDX_CYCLE_COUNTER;
> + }
> +
> + idx = find_first_zero_bit(used_mask, dsu_pmu->num_counters);
> + if (idx >= dsu_pmu->num_counters)
> + return -EAGAIN;
> + set_bit(idx, hw_events->used_mask);
> + return idx;
> +}
> +
> +static void dsu_pmu_enable_counter(struct dsu_pmu *dsu_pmu, int idx)
> +{
> + __dsu_pmu_counter_interrupt_enable(idx);
> + __dsu_pmu_enable_counter(idx);
> +}
> +
> +static void dsu_pmu_disable_counter(struct dsu_pmu *dsu_pmu, int idx)
> +{
> + __dsu_pmu_disable_counter(idx);
> + __dsu_pmu_counter_interrupt_disable(idx);
> +}
> +
> +static inline void dsu_pmu_set_event(struct dsu_pmu *dsu_pmu,
> + struct perf_event *event)
> +{
> + int idx = event->hw.idx;
> + unsigned long flags;
> +
> + if (!dsu_pmu_counter_valid(dsu_pmu, idx)) {
> + dev_err(event->pmu->dev,
> + "Trying to set invalid counter %d\n", idx);
> + return;
> + }
> +
> + raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
> + __dsu_pmu_set_event(idx, event->hw.config_base);
> + raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
> +}
> +
> +static void dsu_pmu_event_update(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + u64 delta, prev_count, new_count;
> +
> + do {
> + /* We may also be called from the irq handler */
> + prev_count = local64_read(&hwc->prev_count);
> + new_count = dsu_pmu_read_counter(event);
> + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
> + prev_count);
> + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
> + local64_add(delta, &event->count);
> +}
> +
> +static void dsu_pmu_read(struct perf_event *event)
> +{
> + dsu_pmu_event_update(event);
> +}
> +
> +static inline u32 dsu_pmu_get_reset_overflow(void)
> +{
> + return __dsu_pmu_get_reset_overflow();
> +}
> +
> +/**
> + * dsu_pmu_set_event_period: Set the period for the counter.
> + *
> + * All DSU PMU event counters, except the cycle counter are 32bit
> + * counters. To handle cases of extreme interrupt latency, we program
> + * the counter with half of the max count for the counters.
> + */
> +static void dsu_pmu_set_event_period(struct perf_event *event)
> +{
> + int idx = event->hw.idx;
> + u64 val = DSU_PMU_COUNTER_MASK(idx) >> 1;
> +
> + local64_set(&event->hw.prev_count, val);
> + dsu_pmu_write_counter(event, val);
> +}
> +
> +static irqreturn_t dsu_pmu_handle_irq(int irq_num, void *dev)
> +{
> + int i;
> + bool handled = false;
> + struct dsu_pmu *dsu_pmu = dev;
> + struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
> + unsigned long overflow;
> +
> + overflow = dsu_pmu_get_reset_overflow();
> + if (!overflow)
> + return IRQ_NONE;
> +
> + for_each_set_bit(i, &overflow, DSU_PMU_MAX_HW_CNTRS) {
> + struct perf_event *event = hw_events->events[i];
> +
> + if (!event)
> + continue;
> + dsu_pmu_event_update(event);
> + dsu_pmu_set_event_period(event);
> + handled = true;
> + }
> +
> + return IRQ_RETVAL(handled);
> +}
> +
> +static void dsu_pmu_start(struct perf_event *event, int pmu_flags)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> +
> + /* We always reprogram the counter */
> + if (pmu_flags & PERF_EF_RELOAD)
> + WARN_ON(!(event->hw.state & PERF_HES_UPTODATE));
> + dsu_pmu_set_event_period(event);
> + if (event->hw.idx != DSU_PMU_IDX_CYCLE_COUNTER)
> + dsu_pmu_set_event(dsu_pmu, event);
> + event->hw.state = 0;
> + dsu_pmu_enable_counter(dsu_pmu, event->hw.idx);
> +}
> +
> +static void dsu_pmu_stop(struct perf_event *event, int pmu_flags)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> +
> + if (event->hw.state & PERF_HES_STOPPED)
> + return;
> + dsu_pmu_disable_counter(dsu_pmu, event->hw.idx);
> + dsu_pmu_event_update(event);
> + event->hw.state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +}
> +
> +static int dsu_pmu_add(struct perf_event *event, int flags)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> + struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
> + struct hw_perf_event *hwc = &event->hw;
> + int idx;
> +
> + if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(),
> + &dsu_pmu->associated_cpus)))
> + return -ENOENT;
> +
> + idx = dsu_pmu_get_event_idx(hw_events, event);
> + if (idx < 0)
> + return idx;
> +
> + hwc->idx = idx;
> + hw_events->events[idx] = event;
> + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +
> + if (flags & PERF_EF_START)
> + dsu_pmu_start(event, PERF_EF_RELOAD);
> +
> + perf_event_update_userpage(event);
> + return 0;
> +}
> +
> +static void dsu_pmu_del(struct perf_event *event, int flags)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> + struct dsu_hw_events *hw_events = &dsu_pmu->hw_events;
> + struct hw_perf_event *hwc = &event->hw;
> + int idx = hwc->idx;
> +
> + dsu_pmu_stop(event, PERF_EF_UPDATE);
> + hw_events->events[idx] = NULL;
> + clear_bit(idx, hw_events->used_mask);
> + perf_event_update_userpage(event);
> +}
> +
> +static void dsu_pmu_enable(struct pmu *pmu)
> +{
> + u32 pmcr;
> + unsigned long flags;
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
> +
> + /* If no counters are added, skip enabling the PMU */
> + if (bitmap_empty(dsu_pmu->hw_events.used_mask, DSU_PMU_MAX_HW_CNTRS))
> + return;
> +
> + raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
> + pmcr = __dsu_pmu_read_pmcr();
> + pmcr |= CLUSTERPMCR_E;
> + __dsu_pmu_write_pmcr(pmcr);
> + raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
> +}
> +
> +static void dsu_pmu_disable(struct pmu *pmu)
> +{
> + u32 pmcr;
> + unsigned long flags;
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(pmu);
> +
> + raw_spin_lock_irqsave(&dsu_pmu->pmu_lock, flags);
> + pmcr = __dsu_pmu_read_pmcr();
> + pmcr &= ~CLUSTERPMCR_E;
> + __dsu_pmu_write_pmcr(pmcr);
> + raw_spin_unlock_irqrestore(&dsu_pmu->pmu_lock, flags);
> +}
> +
> +static bool dsu_pmu_validate_event(struct pmu *pmu,
> + struct dsu_hw_events *hw_events,
> + struct perf_event *event)
> +{
> + if (is_software_event(event))
> + return true;
> + /* Reject groups spanning multiple HW PMUs. */
> + if (event->pmu != pmu)
> + return false;
> + return dsu_pmu_get_event_idx(hw_events, event) >= 0;
> +}
> +
> +/*
> + * Make sure the group of events can be scheduled at once
> + * on the PMU.
> + */
> +static bool dsu_pmu_validate_group(struct perf_event *event)
> +{
> + struct perf_event *sibling, *leader = event->group_leader;
> + struct dsu_hw_events fake_hw;
> +
> + if (event->group_leader == event)
> + return true;
> +
> + memset(fake_hw.used_mask, 0, sizeof(fake_hw.used_mask));
> + if (!dsu_pmu_validate_event(event->pmu, &fake_hw, leader))
> + return false;
> + list_for_each_entry(sibling, &leader->sibling_list, group_entry) {
> + if (!dsu_pmu_validate_event(event->pmu, &fake_hw, sibling))
> + return false;
> + }
> + return dsu_pmu_validate_event(event->pmu, &fake_hw, event);
> +}
> +
> +static int dsu_pmu_event_init(struct perf_event *event)
> +{
> + struct dsu_pmu *dsu_pmu = to_dsu_pmu(event->pmu);
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /* We don't support sampling */
> + if (is_sampling_event(event)) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support sampling events\n");
> + return -EOPNOTSUPP;
> + }
> +
> + /* We cannot support task bound events */
> + if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support per-task counters\n");
> + return -EINVAL;
> + }
> +
> + if (has_branch_stack(event) ||
> + event->attr.exclude_user ||
> + event->attr.exclude_kernel ||
> + event->attr.exclude_hv ||
> + event->attr.exclude_idle ||
> + event->attr.exclude_host ||
> + event->attr.exclude_guest) {
> + dev_dbg(dsu_pmu->pmu.dev, "Can't support filtering\n");
> + return -EINVAL;
> + }
> +
> + if (!cpumask_test_cpu(event->cpu, &dsu_pmu->associated_cpus)) {
> + dev_dbg(dsu_pmu->pmu.dev,
> + "Requested cpu is not associated with the DSU\n");
> + return -EINVAL;
> + }


I sent out a patch that'll allow PMUs to set an event flag to avoid
unnecessary smp calls when the event can be read from any CPU. You could
just always set that if you can't have multiple DSU's running the kernel
(I don't know if the current ARM designs support having multiple DSUs in
a SoC/system) or set it if associated_cpus == cpu_present_mask.

Without that, every single call is going to cause an IPI to whatever CPU
that was passed in (assuming task == NULL) as part of creating the event.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-02-25 14:37:58

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> > +static void dsu_pmu_event_update(struct perf_event *event)
> > +{
> > + struct hw_perf_event *hwc = &event->hw;
> > + u64 delta, prev_count, new_count;
> > +
> > + do {
> > + /* We may also be called from the irq handler */
> > + prev_count = local64_read(&hwc->prev_count);
> > + new_count = dsu_pmu_read_counter(event);
> > + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
> > + prev_count);
> > + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
> > + local64_add(delta, &event->count);
> > +}
> > +
> > +static void dsu_pmu_read(struct perf_event *event)
> > +{
> > + dsu_pmu_event_update(event);
> > +}

> I sent out a patch that'll allow PMUs to set an event flag to avoid
> unnecessary smp calls when the event can be read from any CPU. You could
> just always set that if you can't have multiple DSU's running the kernel (I
> don't know if the current ARM designs support having multiple DSUs in a
> SoC/system) or set it if associated_cpus == cpu_present_mask.

As-is, that won't be safe, given the read function calls the event_update()
function, which has side-effects on hwc->prec_count and event->count. Those
need to be serialized somehow.

Thanks,
Mark.

2018-02-28 22:19:10

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 02/25/2018 06:36 AM, Mark Rutland wrote:
> On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
>> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>>> +static void dsu_pmu_event_update(struct perf_event *event)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + u64 delta, prev_count, new_count;
>>> +
>>> + do {
>>> + /* We may also be called from the irq handler */
>>> + prev_count = local64_read(&hwc->prev_count);
>>> + new_count = dsu_pmu_read_counter(event);
>>> + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
>>> + prev_count);
>>> + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
>>> + local64_add(delta, &event->count);
>>> +}
>>> +
>>> +static void dsu_pmu_read(struct perf_event *event)
>>> +{
>>> + dsu_pmu_event_update(event);
>>> +}
>
>> I sent out a patch that'll allow PMUs to set an event flag to avoid
>> unnecessary smp calls when the event can be read from any CPU. You could
>> just always set that if you can't have multiple DSU's running the kernel (I
>> don't know if the current ARM designs support having multiple DSUs in a
>> SoC/system) or set it if associated_cpus == cpu_present_mask.
>
> As-is, that won't be safe, given the read function calls the event_update()
> function, which has side-effects on hwc->prec_count and event->count. Those
> need to be serialized somehow.

You have to grab the dsu_pmu->pmu_lock spin lock anyway because the
system registers are shared across all CPUs. So, just expanding it a bit
to lock the hwc->prev_count and event->count updated doesn't seem to be
any worse. In fact, it's better than sending pointless IPIs.

The local64_read/cmpxchg/add etc makes sense when you have per-cpu
system registers like in the case of the ARM CPU PMU registers. It
doesn't really buy us much for registers shared across the CPUs.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-01 11:50:10

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Wed, Feb 28, 2018 at 02:17:33PM -0800, Saravana Kannan wrote:
> On 02/25/2018 06:36 AM, Mark Rutland wrote:
> > On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
> > > On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> > > > +static void dsu_pmu_event_update(struct perf_event *event)
> > > > +{
> > > > + struct hw_perf_event *hwc = &event->hw;
> > > > + u64 delta, prev_count, new_count;
> > > > +
> > > > + do {
> > > > + /* We may also be called from the irq handler */
> > > > + prev_count = local64_read(&hwc->prev_count);
> > > > + new_count = dsu_pmu_read_counter(event);
> > > > + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
> > > > + prev_count);
> > > > + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
> > > > + local64_add(delta, &event->count);
> > > > +}
> > > > +
> > > > +static void dsu_pmu_read(struct perf_event *event)
> > > > +{
> > > > + dsu_pmu_event_update(event);
> > > > +}
> >
> > > I sent out a patch that'll allow PMUs to set an event flag to avoid
> > > unnecessary smp calls when the event can be read from any CPU. You could
> > > just always set that if you can't have multiple DSU's running the kernel (I
> > > don't know if the current ARM designs support having multiple DSUs in a
> > > SoC/system) or set it if associated_cpus == cpu_present_mask.
> >
> > As-is, that won't be safe, given the read function calls the event_update()
> > function, which has side-effects on hwc->prec_count and event->count. Those
> > need to be serialized somehow.
>
> You have to grab the dsu_pmu->pmu_lock spin lock anyway because the system
> registers are shared across all CPUs.

I believe that lock is currently superfluous, because the perf core
ensures operations are cpu-affine, and have interrupts disabled in most
cases (thanks to the context lock).

> So, just expanding it a bit to lock the hwc->prev_count and
> event->count updated doesn't seem to be any worse. In fact, it's
> better than sending pointless IPIs.

That's a fair point.

I'll leave it to Suzuki to decide.

> The local64_read/cmpxchg/add etc makes sense when you have per-cpu system
> registers like in the case of the ARM CPU PMU registers. It doesn't really
> buy us much for registers shared across the CPUs.

Theoretically, because operations are currnetly cpu-affine, they
potentially reduce the overhead of sertialization and synchronization.
In practice for arm64 they're just LL/SC loops, so I agree we don't lose
much.

Thanks,
Mark.

2018-03-01 20:37:20

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 03/01/2018 03:49 AM, Mark Rutland wrote:
> On Wed, Feb 28, 2018 at 02:17:33PM -0800, Saravana Kannan wrote:
>> On 02/25/2018 06:36 AM, Mark Rutland wrote:
>>> On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
>>>> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>>>>> +static void dsu_pmu_event_update(struct perf_event *event)
>>>>> +{
>>>>> + struct hw_perf_event *hwc = &event->hw;
>>>>> + u64 delta, prev_count, new_count;
>>>>> +
>>>>> + do {
>>>>> + /* We may also be called from the irq handler */
>>>>> + prev_count = local64_read(&hwc->prev_count);
>>>>> + new_count = dsu_pmu_read_counter(event);
>>>>> + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
>>>>> + prev_count);
>>>>> + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
>>>>> + local64_add(delta, &event->count);
>>>>> +}
>>>>> +
>>>>> +static void dsu_pmu_read(struct perf_event *event)
>>>>> +{
>>>>> + dsu_pmu_event_update(event);
>>>>> +}
>>>
>>>> I sent out a patch that'll allow PMUs to set an event flag to avoid
>>>> unnecessary smp calls when the event can be read from any CPU. You could
>>>> just always set that if you can't have multiple DSU's running the kernel (I
>>>> don't know if the current ARM designs support having multiple DSUs in a
>>>> SoC/system) or set it if associated_cpus == cpu_present_mask.
>>>
>>> As-is, that won't be safe, given the read function calls the event_update()
>>> function, which has side-effects on hwc->prec_count and event->count. Those
>>> need to be serialized somehow.
>>
>> You have to grab the dsu_pmu->pmu_lock spin lock anyway because the system
>> registers are shared across all CPUs.
>
> I believe that lock is currently superfluous, because the perf core
> ensures operations are cpu-affine, and have interrupts disabled in most
> cases (thanks to the context lock).

I don't think it's superfluous. You have a common "event counter"
selection register and a common "event counter value" register. You can
two CPUs racing to read two unrelated event counters and end up causing
one of them to read a bogus value from the wrong event counter.

AFAIK, the *DSU* PMU event selection registers are not per-CPU (the
per-CPU CPU PMU event selection registers are). If this understanding is
correct, you definitely need the spinlock.

>> So, just expanding it a bit to lock the hwc->prev_count and
>> event->count updated doesn't seem to be any worse. In fact, it's
>> better than sending pointless IPIs.
>
> That's a fair point.
>
> I'll leave it to Suzuki to decide.
>
>> The local64_read/cmpxchg/add etc makes sense when you have per-cpu system
>> registers like in the case of the ARM CPU PMU registers. It doesn't really
>> buy us much for registers shared across the CPUs.
>
> Theoretically, because operations are currnetly cpu-affine, they
> potentially reduce the overhead of sertialization and synchronization.
> In practice for arm64 they're just LL/SC loops, so I agree we don't lose
> much.

See my point above. Serialization isn't optional AFAIK.

Suzuki,

Are you open to using per event CPU masks if I send a patch for that? So
that we can reduce IPIs and not mess up power measurements?


Thanks,
Saravana


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-02 14:53:30

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Thu, Mar 01, 2018 at 12:35:49PM -0800, Saravana Kannan wrote:
> On 03/01/2018 03:49 AM, Mark Rutland wrote:
> > On Wed, Feb 28, 2018 at 02:17:33PM -0800, Saravana Kannan wrote:
> > > On 02/25/2018 06:36 AM, Mark Rutland wrote:
> > > > On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
> > > > > On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> > > > > > +static void dsu_pmu_event_update(struct perf_event *event)
> > > > > > +{
> > > > > > + struct hw_perf_event *hwc = &event->hw;
> > > > > > + u64 delta, prev_count, new_count;
> > > > > > +
> > > > > > + do {
> > > > > > + /* We may also be called from the irq handler */
> > > > > > + prev_count = local64_read(&hwc->prev_count);
> > > > > > + new_count = dsu_pmu_read_counter(event);
> > > > > > + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
> > > > > > + prev_count);
> > > > > > + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
> > > > > > + local64_add(delta, &event->count);
> > > > > > +}
> > > > > > +
> > > > > > +static void dsu_pmu_read(struct perf_event *event)
> > > > > > +{
> > > > > > + dsu_pmu_event_update(event);
> > > > > > +}
> > > >
> > > > > I sent out a patch that'll allow PMUs to set an event flag to avoid
> > > > > unnecessary smp calls when the event can be read from any CPU. You could
> > > > > just always set that if you can't have multiple DSU's running the kernel (I
> > > > > don't know if the current ARM designs support having multiple DSUs in a
> > > > > SoC/system) or set it if associated_cpus == cpu_present_mask.
> > > >
> > > > As-is, that won't be safe, given the read function calls the event_update()
> > > > function, which has side-effects on hwc->prec_count and event->count. Those
> > > > need to be serialized somehow.
> > >
> > > You have to grab the dsu_pmu->pmu_lock spin lock anyway because the system
> > > registers are shared across all CPUs.
> >
> > I believe that lock is currently superfluous, because the perf core
> > ensures operations are cpu-affine, and have interrupts disabled in most
> > cases (thanks to the context lock).
>
> I don't think it's superfluous. You have a common "event counter" selection
> register and a common "event counter value" register. You can two CPUs
> racing to read two unrelated event counters and end up causing one of them
> to read a bogus value from the wrong event counter.

It's important to note that the DSU PMU's event_init() ensures events
are affine to a single CPU, and the perf core code serializes operations
on those events via the context lock.

Therefore, two CPUs *won't* try to access the registers simultaneously.

If events could be active on multiple CPUs simultaneously, I agree that
the lock would be necessary. However, there would also be other problems
to deal with in that case.

If we want to allow pmu::read() from arbitrary CPUs the DSU is affine
to, I agree we'd need the lock to serialize accesses to the registers
and data structures.

Thanks,
Mark.

2018-03-02 21:33:58

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 03/02/2018 02:42 AM, Mark Rutland wrote:
> On Thu, Mar 01, 2018 at 12:35:49PM -0800, Saravana Kannan wrote:
>> On 03/01/2018 03:49 AM, Mark Rutland wrote:
>>> On Wed, Feb 28, 2018 at 02:17:33PM -0800, Saravana Kannan wrote:
>>>> On 02/25/2018 06:36 AM, Mark Rutland wrote:
>>>>> On Fri, Feb 23, 2018 at 04:53:18PM -0800, Saravana Kannan wrote:
>>>>>> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>>>>>>> +static void dsu_pmu_event_update(struct perf_event *event)
>>>>>>> +{
>>>>>>> + struct hw_perf_event *hwc = &event->hw;
>>>>>>> + u64 delta, prev_count, new_count;
>>>>>>> +
>>>>>>> + do {
>>>>>>> + /* We may also be called from the irq handler */
>>>>>>> + prev_count = local64_read(&hwc->prev_count);
>>>>>>> + new_count = dsu_pmu_read_counter(event);
>>>>>>> + } while (local64_cmpxchg(&hwc->prev_count, prev_count, new_count) !=
>>>>>>> + prev_count);
>>>>>>> + delta = (new_count - prev_count) & DSU_PMU_COUNTER_MASK(hwc->idx);
>>>>>>> + local64_add(delta, &event->count);
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void dsu_pmu_read(struct perf_event *event)
>>>>>>> +{
>>>>>>> + dsu_pmu_event_update(event);
>>>>>>> +}
>>>>>
>>>>>> I sent out a patch that'll allow PMUs to set an event flag to avoid
>>>>>> unnecessary smp calls when the event can be read from any CPU. You could
>>>>>> just always set that if you can't have multiple DSU's running the kernel (I
>>>>>> don't know if the current ARM designs support having multiple DSUs in a
>>>>>> SoC/system) or set it if associated_cpus == cpu_present_mask.
>>>>>
>>>>> As-is, that won't be safe, given the read function calls the event_update()
>>>>> function, which has side-effects on hwc->prec_count and event->count. Those
>>>>> need to be serialized somehow.
>>>>
>>>> You have to grab the dsu_pmu->pmu_lock spin lock anyway because the system
>>>> registers are shared across all CPUs.
>>>
>>> I believe that lock is currently superfluous, because the perf core
>>> ensures operations are cpu-affine, and have interrupts disabled in most
>>> cases (thanks to the context lock).
>>
>> I don't think it's superfluous. You have a common "event counter" selection
>> register and a common "event counter value" register. You can two CPUs
>> racing to read two unrelated event counters and end up causing one of them
>> to read a bogus value from the wrong event counter.
>
> It's important to note that the DSU PMU's event_init() ensures events
> are affine to a single CPU, and the perf core code serializes operations
> on those events via the context lock.

Ah, I see that now. Thanks!

> Therefore, two CPUs *won't* try to access the registers simultaneously.

Right, and this driver seems to be going through a lot of work to make
sure all events are read in one CPU.

Do you even have an upstream target where there are multiple DSU's in a
system? If not, we can simplify a ton of this code (no hotplug
notifiers, no migrating PMUs, no SMP calls, etc) by just adding a
spinlock and letting any CPU read these DSU counters.

If you need to support a system with multiple DSUs, I think it's still
useful to add CPU mask for events and letting the perf framework read
events on any of those CPUs.

> If events could be active on multiple CPUs simultaneously, I agree that
> the lock would be necessary. However, there would also be other problems
> to deal with in that case.
>
> If we want to allow pmu::read() from arbitrary CPUs the DSU is affine
> to, I agree we'd need the lock to serialize accesses to the registers
> and data structures.

Agreed.

So, depending on how many DSUs you want to support in the mainline
kernel, we can simplify it a ton. And if not, we can still try to remove
the need for smp calls so that we don't cause power impact when trying
to profile while measuring power.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-05 13:42:06

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
> On 03/02/2018 02:42 AM, Mark Rutland wrote:
> > It's important to note that the DSU PMU's event_init() ensures events
> > are affine to a single CPU, and the perf core code serializes operations
> > on those events via the context lock.
>
> Ah, I see that now. Thanks!
>
> > Therefore, two CPUs *won't* try to access the registers simultaneously.
>
> Right, and this driver seems to be going through a lot of work to make sure
> all events are read in one CPU.
>
> Do you even have an upstream target where there are multiple DSU's in a
> system?

I have no idea, though I do beleive that it's possible for a system to
have multiple DSUs.

> If not, we can simplify a ton of this code (no hotplug notifiers, no
> migrating PMUs, no SMP calls, etc) by just adding a spinlock and letting any
> CPU read these DSU counters.

Regardless of whether we allow arbitrary CPUs to read the counters,
other logic still needs to be CPU affine, and we'll still need hotplug
notifiers and event migration.

I am not necessarily opposed to allowing read() calls from associated
CPUs, but as before, I'll leave that to Suzuki.

Thanks,
Mark.

2018-03-05 22:11:36

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 03/05/2018 02:59 AM, Mark Rutland wrote:
> On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
>> On 03/02/2018 02:42 AM, Mark Rutland wrote:
>>> It's important to note that the DSU PMU's event_init() ensures events
>>> are affine to a single CPU, and the perf core code serializes operations
>>> on those events via the context lock.
>>
>> Ah, I see that now. Thanks!
>>
>>> Therefore, two CPUs *won't* try to access the registers simultaneously.
>>
>> Right, and this driver seems to be going through a lot of work to make sure
>> all events are read in one CPU.
>>
>> Do you even have an upstream target where there are multiple DSU's in a
>> system?
>
> I have no idea, though I do beleive that it's possible for a system to
> have multiple DSUs.
>
>> If not, we can simplify a ton of this code (no hotplug notifiers, no
>> migrating PMUs, no SMP calls, etc) by just adding a spinlock and letting any
>> CPU read these DSU counters.
>
> Regardless of whether we allow arbitrary CPUs to read the counters,
> other logic still needs to be CPU affine, and we'll still need hotplug
> notifiers and event migration.

If you have to support multiple DSUs in a system, then the need is
obvious. But if you don't have to support multiple DSU, it's not obvious
to me on why you still need CPU affining or hotplug notifiers. Could you
please provide me pointers for general understanding?


> I am not necessarily opposed to allowing read() calls from associated
> CPUs, but as before, I'll leave that to Suzuki.

Can you make sure this thread is on Suzuki's radar?

Suzuki,

Could please respond? IPI for DSUs are definitely a pain for us.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-07 15:02:11

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support



Hi Saravana,

Sorry for the late response, I was out on vacation.

On 05/03/18 22:10, Saravana Kannan wrote:
> On 03/05/2018 02:59 AM, Mark Rutland wrote:
>> On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
>>> On 03/02/2018 02:42 AM, Mark Rutland wrote:
>>>> It's important to note that the DSU PMU's event_init() ensures events
>>>> are affine to a single CPU, and the perf core code serializes operations
>>>> on those events via the context lock.
>>>
>>> Ah, I see that now. Thanks!
>>>
>>>> Therefore, two CPUs *won't* try to access the registers simultaneously.
>>>
>>> Right, and this driver seems to be going through a lot of work to make sure
>>> all events are read in one CPU.
>>>
>>> Do you even have an upstream target where there are multiple DSU's in a
>>> system?
>>
>> I have no idea, though I do beleive that it's possible for a system to
>> have multiple DSUs.
>>
>>> If not, we can simplify a ton of this code (no hotplug notifiers, no
>>> migrating PMUs, no SMP calls, etc) by just adding a spinlock and letting any
>>> CPU read these DSU counters.
>>
>> Regardless of whether we allow arbitrary CPUs to read the counters,
>> other logic still needs to be CPU affine, and we'll still need hotplug
>> notifiers and event migration.
>
> If you have to support multiple DSUs in a system, then the need is obvious. But if you don't have to support multiple DSU, it's not obvious to me on why you still need CPU affining or hotplug notifiers. Could you please provide me pointers for general understanding?
>

We need to support multiple DSUs as such configurations are possible.

>
>> I am not necessarily opposed to allowing read() calls from associated
>> CPUs, but as before, I'll leave that to Suzuki.

I am fine with reading the registers from any of the associated CPUs.

>
> Can you make sure this thread is on Suzuki's radar?
>
> Suzuki,
>
> Could please respond? IPI for DSUs are definitely a pain for us.

Thanks
Suzuki


2018-03-07 21:38:01

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 03/07/2018 06:59 AM, Suzuki K Poulose wrote:
>
>
> Hi Saravana,
>
> Sorry for the late response, I was out on vacation.
>
> On 05/03/18 22:10, Saravana Kannan wrote:
>> On 03/05/2018 02:59 AM, Mark Rutland wrote:
>>> On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
>>>> On 03/02/2018 02:42 AM, Mark Rutland wrote:
>>>>> It's important to note that the DSU PMU's event_init() ensures events
>>>>> are affine to a single CPU, and the perf core code serializes
>>>>> operations
>>>>> on those events via the context lock.
>>>>
>>>> Ah, I see that now. Thanks!
>>>>
>>>>> Therefore, two CPUs *won't* try to access the registers
>>>>> simultaneously.
>>>>
>>>> Right, and this driver seems to be going through a lot of work to
>>>> make sure
>>>> all events are read in one CPU.
>>>>
>>>> Do you even have an upstream target where there are multiple DSU's in a
>>>> system?
>>>
>>> I have no idea, though I do beleive that it's possible for a system to
>>> have multiple DSUs.
>>>
>>>> If not, we can simplify a ton of this code (no hotplug notifiers, no
>>>> migrating PMUs, no SMP calls, etc) by just adding a spinlock and
>>>> letting any
>>>> CPU read these DSU counters.
>>>
>>> Regardless of whether we allow arbitrary CPUs to read the counters,
>>> other logic still needs to be CPU affine, and we'll still need hotplug
>>> notifiers and event migration.
>>
>> If you have to support multiple DSUs in a system, then the need is
>> obvious. But if you don't have to support multiple DSU, it's not
>> obvious to me on why you still need CPU affining or hotplug notifiers.
>> Could you please provide me pointers for general understanding?
>>
>
> We need to support multiple DSUs as such configurations are possible.
>
>>
>>> I am not necessarily opposed to allowing read() calls from associated
>>> CPUs, but as before, I'll leave that to Suzuki.
>
> I am fine with reading the registers from any of the associated CPUs.
>

If that's the case, can you please use my patch? And if it looks good to
you, give an Ack and ask Peter to pull it in as you'd be the user?

Thanks,
Saravana


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-08 11:43:55

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Mon, Mar 05, 2018 at 02:10:03PM -0800, Saravana Kannan wrote:
> On 03/05/2018 02:59 AM, Mark Rutland wrote:
> > On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
> > > On 03/02/2018 02:42 AM, Mark Rutland wrote:
> > > > It's important to note that the DSU PMU's event_init() ensures events
> > > > are affine to a single CPU, and the perf core code serializes operations
> > > > on those events via the context lock.
> > >
> > > Ah, I see that now. Thanks!
> > >
> > > > Therefore, two CPUs *won't* try to access the registers simultaneously.
> > >
> > > Right, and this driver seems to be going through a lot of work to make sure
> > > all events are read in one CPU.
> > >
> > > Do you even have an upstream target where there are multiple DSU's in a
> > > system?
> >
> > I have no idea, though I do beleive that it's possible for a system to
> > have multiple DSUs.
> >
> > > If not, we can simplify a ton of this code (no hotplug notifiers, no
> > > migrating PMUs, no SMP calls, etc) by just adding a spinlock and letting any
> > > CPU read these DSU counters.
> >
> > Regardless of whether we allow arbitrary CPUs to read the counters,
> > other logic still needs to be CPU affine, and we'll still need hotplug
> > notifiers and event migration.
>
> If you have to support multiple DSUs in a system, then the need is obvious.
> But if you don't have to support multiple DSU, it's not obvious to me on why
> you still need CPU affining or hotplug notifiers. Could you please provide
> me pointers for general understanding?

There are a number of reasons. From the top of my head:

* The perf core relies on the interrupt handler being serialised w.r.t.
operations on the relevant perf_event_context and
perf_event_cpu_context by way of these being affine to the same CPU.
For this to be the case, events *must* be managed on the CPU the
interrupt handler is affine to.

* The perf core rotates events on a per-cpu basis. To keep this fair and
reasonable, the perf core needs to manage *all* events for a PMU on
the same CPU.

* We expose a cpumask to userspace, so that it attempts to open events
on a single CPU (and doesn't open redundant events that would result
in misleading figures).

This must be an online CPU for the perf core to allow events to be
created, so this must be updated when assocaited CPUs are hotplugged.
We must choose *some* arbitrary associated CPU for this.

* If the arbitrarily-chosen CPU is hotplugged out, but other associated
CPUs are online, we should keep the events active by choosing another
arbitrary associated CPU, and migrating the events (see
perf_pmu_migrate_context). Note that we must also fiddle with the
interrupt affinity.

* If a hotplug event occurs between userspace reading the cpumask and
opening an event, it may try to open events on a CPU that is not the
currently arbitrarily chosen CPU. To ameliorate this, in
pmu::even_init we re-write event->cpu so long as the CPU was *some*
valid CPU.

There might be some other reasons, too...

Thanks,
Mark.

2018-03-09 00:01:35

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
> The DSU integrates one or more cores with an L3 memory system, control
> logic, and external interfaces to form a multicore cluster. The PMU
> allows counting the various events related to L3, SCU etc, along with
> providing a cycle counter.
>
> The PMU can be accessed via system registers, which are common
> to the cores in the same cluster. The PMU registers follow the
> semantics of the ARMv8 PMU, mostly, with the exception that
> the counters record the cluster wide events.
>
> This driver is mostly based on the ARMv8 and CCI PMU drivers.
> The driver only supports ARM64 at the moment. It can be extended
> to support ARM32 by providing register accessors like we do in
> arch/arm64/include/arm_dsu_pmu.h.
>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Reviewed-by: Jonathan Cameron <[email protected]>
> Reviewed-by: Mark Rutland <[email protected]>
> Signed-off-by: Suzuki K Poulose <[email protected]>
> ---
> Changes since V9:
> - Rely on cpuhp callback for probing the PMU.
> - Clear the overflow mask whenever the first CPU is brought up.
> - Remove dsu_pmu_get_online_cpu(), which is not needed anymore.
> - Flip the order of context migration and setting the active CPU.
>
> Changes since V8:
> - Include required header files (Mark Rutland)
> - Remove Kconfig dependency on PERF_EVENTS (Mark Rutland)
> - Fix typo in event name, bus_acesss => bus_access (Mark Rutland)
> - Use find_first_zero_bit instead of find_next_zero_bit (Mark Rutland)
> - Change order of checks in dsu_pmu_event_init (Mark Rutland)
> - Allow lazy initialisation of DSU PMU to handle cases where CPUs
> may be brought up later (e.g, maxcpus=N)- Mark Rutland.
> - Clear the interrupt overflow status upon initialisation (Mark Rutland)
> - Change the CPU check to "associated_cpus" from "active_cpus",
> as when we migrate the perf context we will access the DSU
> from two different CPUs (source and destination).
> - Fill in the "module" field for the PMU to prevent the module unload
> when the PMU is active.
> Changes since V6:
> - Address comments from Jonathan
> - Add Reviewed-by tags from Jonathan
> Changes since V5:
> - Address comments on V5 by Mark.
> - Use IRQ_NOBALANCING for IRQ handler
> - Don't expose events which could be unimplemented.
> - Get rid of dsu_pmu_event_supported and allow raw event
> code to be used without validating whether it is supported.
> - Rename "supported_cpus" mask to "associated_cpus"
> - Add Documentation for the PMU driver
> - Don't disable IRQ for dsu_pmu_{enable/disable}_counters
> - Use consistent return codes for validate_event/group calls.
> - Check PERF_ATTACH_TASK flag in event_init.
> - Allow missing CPUs in dsu_pmu_dt_get_cpus, to handle cases
> where kernel could have capped nr_cpus.
> - Cleanup sanity checking for the CPU before accessing DSU
> - Reject events with counting CPU not associated with the DSU.
> Changes since V4:
> - Reflect the changed generic helper for mapping CPU id
> Changes since V2:
> - Cleanup dsu_pmu_device_probe error handling.
> - Fix event validate_group to invert the result check of validate_event
> - Return errors if we failed to parse CPUs in the DSU.
> - Add MODULE_DEVICE_TABLE entry
> - Use hlist_entry_safe for converting cpuhp_node to dsu_pmu.
> ---
> Documentation/perf/arm_dsu_pmu.txt | 28 ++
> arch/arm64/include/asm/arm_dsu_pmu.h | 129 ++++++
> drivers/perf/Kconfig | 9 +
> drivers/perf/Makefile | 1 +
> drivers/perf/arm_dsu_pmu.c | 843 +++++++++++++++++++++++++++++++++++
> 5 files changed, 1010 insertions(+)
> create mode 100644 Documentation/perf/arm_dsu_pmu.txt
> create mode 100644 arch/arm64/include/asm/arm_dsu_pmu.h
> create mode 100644 drivers/perf/arm_dsu_pmu.c
>

Looking at the code, I didn't see any specific handling of cluster power
collapse. AFAIK, the HW counters do not retain config (what event they
are counting) or value (the current count) across power collapse.
Wouldn't you need to register for some kind of PM_ENTER/EXIT notifiers
to handle that?

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-09 10:54:24

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

+ Cc: Lorenzo, Charles.

On 08/03/18 23:59, Saravana Kannan wrote:
> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>> Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
>> The DSU integrates one or more cores with an L3 memory system, control
>> logic, and external interfaces to form a multicore cluster. The PMU
>> allows counting the various events related to L3, SCU etc, along with
>> providing a cycle counter.
>>
>> The PMU can be accessed via system registers, which are common
>> to the cores in the same cluster. The PMU registers follow the
>> semantics of the ARMv8 PMU, mostly, with the exception that
>> the counters record the cluster wide events.
>>
>> This driver is mostly based on the ARMv8 and CCI PMU drivers.
>> The driver only supports ARM64 at the moment. It can be extended
>> to support ARM32 by providing register accessors like we do in
>> arch/arm64/include/arm_dsu_pmu.h.
>>
>> Cc: Mark Rutland <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Reviewed-by: Jonathan Cameron <[email protected]>
>> Reviewed-by: Mark Rutland <[email protected]>
>> Signed-off-by: Suzuki K Poulose <[email protected]>

[...]

>
> Looking at the code, I didn't see any specific handling of cluster power collapse. AFAIK, the HW counters do not retain config (what event they are counting) or value (the current count) across power collapse. Wouldn't you need to register for some kind of PM_ENTER/EXIT notifiers to handle that?

Good point, yes *somebody* needs to save-restore the registers. But who ? As far
as the kernel is concerned, it doesn't control the DSU states. Also, as of now
there is no reliable way to get the "ENTER/EXIT" notifications for the DSU power
domain state changes. All we do is use the PMU, assuming it is available. AFAIT,
it should really be done at EL3, which manages the DSU, but may be I am wrong.

Sudeep, Lorenzo, Charles,

Please feel free to share your thoughts.

Cheers
Suzuki

2018-03-09 13:37:46

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Fri, Mar 09, 2018 at 10:53:14AM +0000, Suzuki K Poulose wrote:
> + Cc: Lorenzo, Charles.
>
> On 08/03/18 23:59, Saravana Kannan wrote:
> > On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
> > > Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
> > > The DSU integrates one or more cores with an L3 memory system, control
> > > logic, and external interfaces to form a multicore cluster. The PMU
> > > allows counting the various events related to L3, SCU etc, along with
> > > providing a cycle counter.
> > >
> > > The PMU can be accessed via system registers, which are common
> > > to the cores in the same cluster. The PMU registers follow the
> > > semantics of the ARMv8 PMU, mostly, with the exception that
> > > the counters record the cluster wide events.
> > >
> > > This driver is mostly based on the ARMv8 and CCI PMU drivers.
> > > The driver only supports ARM64 at the moment. It can be extended
> > > to support ARM32 by providing register accessors like we do in
> > > arch/arm64/include/arm_dsu_pmu.h.
> > >
> > > Cc: Mark Rutland <[email protected]>
> > > Cc: Will Deacon <[email protected]>
> > > Reviewed-by: Jonathan Cameron <[email protected]>
> > > Reviewed-by: Mark Rutland <[email protected]>
> > > Signed-off-by: Suzuki K Poulose <[email protected]>
>
> [...]
>
> >
> > Looking at the code, I didn't see any specific handling of cluster power collapse. AFAIK, the HW counters do not retain config (what event they are counting) or value (the current count) across power collapse. Wouldn't you need to register for some kind of PM_ENTER/EXIT notifiers to handle that?
>
> Good point, yes *somebody* needs to save-restore the registers. But who ? As far
> as the kernel is concerned, it doesn't control the DSU states. Also, as of now
> there is no reliable way to get the "ENTER/EXIT" notifications for the DSU power
> domain state changes. All we do is use the PMU, assuming it is available. AFAIT,
> it should really be done at EL3, which manages the DSU, but may be I am wrong.

Given this can happen behind the back of the kernel, if FW doesn't
save/restore this state, we'll have to inhibit cpuidle on a CPU
associated with the DSU PMU whenever it has active events, which would
keep the cluster online.

Thanks,
Mark.

2018-03-09 22:51:06

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 03/09/2018 05:35 AM, Mark Rutland wrote:
> On Fri, Mar 09, 2018 at 10:53:14AM +0000, Suzuki K Poulose wrote:
>> + Cc: Lorenzo, Charles.
>>
>> On 08/03/18 23:59, Saravana Kannan wrote:
>>> On 01/02/2018 03:25 AM, Suzuki K Poulose wrote:
>>>> Add support for the Cluster PMU part of the ARM DynamIQ Shared Unit (DSU).
>>>> The DSU integrates one or more cores with an L3 memory system, control
>>>> logic, and external interfaces to form a multicore cluster. The PMU
>>>> allows counting the various events related to L3, SCU etc, along with
>>>> providing a cycle counter.
>>>>
>>>> The PMU can be accessed via system registers, which are common
>>>> to the cores in the same cluster. The PMU registers follow the
>>>> semantics of the ARMv8 PMU, mostly, with the exception that
>>>> the counters record the cluster wide events.
>>>>
>>>> This driver is mostly based on the ARMv8 and CCI PMU drivers.
>>>> The driver only supports ARM64 at the moment. It can be extended
>>>> to support ARM32 by providing register accessors like we do in
>>>> arch/arm64/include/arm_dsu_pmu.h.
>>>>
>>>> Cc: Mark Rutland <[email protected]>
>>>> Cc: Will Deacon <[email protected]>
>>>> Reviewed-by: Jonathan Cameron <[email protected]>
>>>> Reviewed-by: Mark Rutland <[email protected]>
>>>> Signed-off-by: Suzuki K Poulose <[email protected]>
>>
>> [...]
>>
>>>
>>> Looking at the code, I didn't see any specific handling of cluster power collapse. AFAIK, the HW counters do not retain config (what event they are counting) or value (the current count) across power collapse. Wouldn't you need to register for some kind of PM_ENTER/EXIT notifiers to handle that?
>>
>> Good point, yes *somebody* needs to save-restore the registers. But who ? As far
>> as the kernel is concerned, it doesn't control the DSU states. Also, as of now
>> there is no reliable way to get the "ENTER/EXIT" notifications for the DSU power
>> domain state changes. All we do is use the PMU, assuming it is available. AFAIT,
>> it should really be done at EL3, which manages the DSU, but may be I am wrong.
>
> Given this can happen behind the back of the kernel, if FW doesn't
> save/restore this state, we'll have to inhibit cpuidle on a CPU
> associated with the DSU PMU whenever it has active events, which would
> keep the cluster online.
>

Using PMUs should be designed to have the least impact on
power/performance. Otherwise, using them to profile and debug issues
becomes impossible. Disabling cpuidle would significantly affect power
and performance.

Why not use CPU_CLUSTER_PM_ENTER similar to how arm-pmu.c uses
CPU_PM_ENTER for saving and restoring the counters? Technically a lot of
stuff could be pushed to FW. Doesn't mean we should. At worst, we'll
save and restore for a few cases where we didn't need to.

Thanks,
Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

2018-03-10 15:47:23

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On Fri, Mar 09, 2018 at 02:49:00PM -0800, Saravana Kannan wrote:
> > > > Looking at the code, I didn't see any specific handling of cluster
> > > > power collapse. AFAIK, the HW counters do not retain config (what event
> > > > they are counting) or value (the current count) across power collapse.
> > > > Wouldn't you need to register for some kind of PM_ENTER/EXIT notifiers
> > > > to handle that?
> > >
> > > Good point, yes *somebody* needs to save-restore the registers. But who ? As far
> > > as the kernel is concerned, it doesn't control the DSU states. Also, as of now
> > > there is no reliable way to get the "ENTER/EXIT" notifications for the DSU power
> > > domain state changes. All we do is use the PMU, assuming it is available. AFAIT,
> > > it should really be done at EL3, which manages the DSU, but may be I am wrong.
> >
> > Given this can happen behind the back of the kernel, if FW doesn't
> > save/restore this state, we'll have to inhibit cpuidle on a CPU
> > associated with the DSU PMU whenever it has active events, which would
> > keep the cluster online.
>
> Using PMUs should be designed to have the least impact on power/performance.
> Otherwise, using them to profile and debug issues becomes impossible.
> Disabling cpuidle would significantly affect power and performance.
>
> Why not use CPU_CLUSTER_PM_ENTER similar to how arm-pmu.c uses CPU_PM_ENTER
> for saving and restoring the counters?

The CPU_CLUSTER_PM_{ENTER,EXIT} notifications only exist on particular 32-bit
platforms where the kernel directly manages CPU power states. These do not
exist on systems where FW manages the cluster power state (e.g. on any arm64
platform with PSCI). So we cannot rely on CPU_CLUSTER_PM_{ENTER,EXIT} in the
DSU PMU driver.

The arm-pmu code manages state which is strictly CPU affine, so we can rely on
the CPU_PM_{ENTER,EXIT} notifications there.

We cannot rely on CPU_PM_{ENTER,EXIT} notifications to manage per-cluster
state, as CPUs can race to enter/exit idle, and we cannot track the cluster
state accurately in the kernel without serializing idle entry/exit across CPUs,
which will affect power/performance.

Thanks,
Mark.

2018-03-19 09:52:06

by Suzuki K Poulose

[permalink] [raw]
Subject: Re: [PATCH v11 8/8] perf: ARM DynamIQ Shared Unit PMU support

On 07/03/18 21:36, Saravana Kannan wrote:
> On 03/07/2018 06:59 AM, Suzuki K Poulose wrote:
>>
>>
>> Hi Saravana,
>>
>> Sorry for the late response, I was out on vacation.
>>
>> On 05/03/18 22:10, Saravana Kannan wrote:
>>> On 03/05/2018 02:59 AM, Mark Rutland wrote:
>>>> On Fri, Mar 02, 2018 at 11:19:56AM -0800, Saravana Kannan wrote:
>>>>> On 03/02/2018 02:42 AM, Mark Rutland wrote:
>>>>>> It's important to note that the DSU PMU's event_init() ensures events
>>>>>> are affine to a single CPU, and the perf core code serializes
>>>>>> operations
>>>>>> on those events via the context lock.
>>>>>
>>>>> Ah, I see that now. Thanks!
>>>>>
>>>>>> Therefore, two CPUs *won't* try to access the registers
>>>>>> simultaneously.
>>>>>
>>>>> Right, and this driver seems to be going through a lot of work to
>>>>> make sure
>>>>> all events are read in one CPU.
>>>>>
>>>>> Do you even have an upstream target where there are multiple DSU's in a
>>>>> system?
>>>>
>>>> I have no idea, though I do beleive that it's possible for a system to
>>>> have multiple DSUs.
>>>>
>>>>> If not, we can simplify a ton of this code (no hotplug notifiers, no
>>>>> migrating PMUs, no SMP calls, etc) by just adding a spinlock and
>>>>> letting any
>>>>> CPU read these DSU counters.
>>>>
>>>> Regardless of whether we allow arbitrary CPUs to read the counters,
>>>> other logic still needs to be CPU affine, and we'll still need hotplug
>>>> notifiers and event migration.
>>>
>>> If you have to support multiple DSUs in a system, then the need is
>>> obvious. But if you don't have to support multiple DSU, it's not
>>> obvious to me on why you still need CPU affining or hotplug notifiers.
>>> Could you please provide me pointers for general understanding?
>>>
>>
>> We need to support multiple DSUs as such configurations are possible.
>>
>>>
>>>> I am not necessarily opposed to allowing read() calls from associated
>>>> CPUs, but as before, I'll leave that to Suzuki.
>>
>> I am fine with reading the registers from any of the associated CPUs.
>>
>
> If that's the case, can you please use my patch?

The DSU driver is already mainline. Please feel free to send a patch with me in Cc.

> And if it looks good to you, give an Ack and ask Peter to pull it in as you'd be the user?

As mentioned above, when you send a patch, please do keep me in Cc, and will do the
needful.

Thanks
Suzuki