2017-03-02 21:25:34

by Agustin Vega-Frias

[permalink] [raw]
Subject: [PATCH V3] perf: qcom: Add L3 cache PMU driver

This adds a new dynamic PMU to the Perf Events framework to program
and control the L3 cache PMUs in some Qualcomm Technologies SOCs.

The driver supports a distributed cache architecture where the overall
cache for a socket is comprised of multiple slices each with its own PMU.
Access to each individual PMU is provided even though all CPUs share all
the slices. User space needs to aggregate to individual counts to provide
a global picture.

The driver exports formatting and event information to sysfs so it can
be used by the perf user space tools with the syntaxes:
perf stat -a -e l3cache_0_0/read-miss/
perf stat -a -e l3cache_0_0/event=0x21/

Signed-off-by: Agustin Vega-Frias <[email protected]>
---
drivers/perf/Kconfig | 10 +
drivers/perf/Makefile | 1 +
drivers/perf/qcom_l3_pmu.c | 755 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/cpuhotplug.h | 1 +
4 files changed, 767 insertions(+)
create mode 100644 drivers/perf/qcom_l3_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 4d5c5f9..7df32f7 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -12,6 +12,16 @@ config ARM_PMU
Say y if you want to use CPU performance monitors on ARM-based
systems.

+config QCOM_L3_PMU
+ bool "Qualcomm Technologies L3-cache PMU"
+ depends on ARCH_QCOM && ARM64 && PERF_EVENTS && ACPI
+ select QCOM_IRQ_COMBINER
+ help
+ Provides support for the L3 cache performance monitor unit (PMU)
+ in Qualcomm Technologies processors.
+ Adds the L3 cache PMU into the perf events subsystem for
+ monitoring L3 cache events.
+
config XGENE_PMU
depends on PERF_EVENTS && ARCH_XGENE
bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b116e98..89a2daa 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -1,2 +1,3 @@
obj-$(CONFIG_ARM_PMU) += arm_pmu.o
+obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
diff --git a/drivers/perf/qcom_l3_pmu.c b/drivers/perf/qcom_l3_pmu.c
new file mode 100644
index 0000000..207f174
--- /dev/null
+++ b/drivers/perf/qcom_l3_pmu.c
@@ -0,0 +1,755 @@
+/* Copyright (c) 2015-2017, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/bitops.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/list.h>
+#include <linux/acpi.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+
+/*
+ * Driver for the L3 cache PMUs in Qualcomm Technologies chips.
+ *
+ * The driver supports a distributed cache architecture where the overall
+ * cache for a socket is comprised of multiple slices each with its own PMU.
+ * Access to each individual PMU is provided even though all CPUs share all
+ * the slices. User space needs to aggregate to individual counts to provide
+ * a global picture.
+ *
+ * The hardware supports counter chaining to provide the user a way to avoid
+ * overhead of software counter maintenance. This is exposed via a the 'lc'
+ * flag field in perf_event_attr.config.
+ *
+ * The hardware also supports a feature that asserts the IRQ on the toggling
+ * of the most significanty bit in the 32bit counter. This feature is used
+ * to improve precision and to avoid counter reprogramming operations, since
+ * we can leave the counters as free running.
+ */
+
+/*
+ * General constants
+ */
+
+#define L3_NUM_COUNTERS 8
+#define L3_MAX_EVTYPE 0xFF
+
+/*
+ * Register offsets
+ */
+
+/* Perfmon registers */
+#define L3_HML3_PM_CR 0x000
+#define L3_HML3_PM_EVCNTR(__cntr) (0x420 + ((__cntr) & 0x7) * 8)
+#define L3_HML3_PM_CNTCTL(__cntr) (0x120 + ((__cntr) & 0x7) * 8)
+#define L3_HML3_PM_EVTYPE(__cntr) (0x220 + ((__cntr) & 0x7) * 8)
+#define L3_HML3_PM_FILTRA 0x300
+#define L3_HML3_PM_FILTRB 0x308
+#define L3_HML3_PM_FILTRC 0x310
+#define L3_HML3_PM_FILTRAM 0x304
+#define L3_HML3_PM_FILTRBM 0x30C
+#define L3_HML3_PM_FILTRCM 0x314
+
+/* Basic counter registers */
+#define L3_M_BC_CR 0x500
+#define L3_M_BC_SATROLL_CR 0x504
+#define L3_M_BC_CNTENSET 0x508
+#define L3_M_BC_CNTENCLR 0x50C
+#define L3_M_BC_INTENSET 0x510
+#define L3_M_BC_INTENCLR 0x514
+#define L3_M_BC_GANG 0x718
+#define L3_M_BC_OVSR 0x740
+#define L3_M_BC_IRQCTL 0x96C
+
+/*
+ * Bit field definitions
+ */
+
+/* L3_HML3_PM_CR */
+#define PM_CR_RESET (0)
+
+/* L3_HML3_PM_XCNTCTL/L3_HML3_PM_CNTCTLx */
+#define PMCNT_RESET (0)
+
+/* L3_HML3_PM_EVTYPEx */
+#define EVSEL(__val) ((u32)((__val) & 0xFF))
+
+/* Reset value for all the filter registers */
+#define PM_FLTR_RESET (0)
+
+/* L3_M_BC_CR */
+#define BC_RESET (((u32)1) << 1)
+#define BC_ENABLE ((u32)1)
+
+/* L3_M_BC_SATROLL_CR */
+#define BC_SATROLL_CR_RESET (0)
+
+/* L3_M_BC_CNTENSET */
+#define PMCNTENSET(__cntr) (((u32)1) << ((__cntr) & 0x7))
+
+/* L3_M_BC_CNTENCLR */
+#define PMCNTENCLR(__cntr) (((u32)1) << ((__cntr) & 0x7))
+#define BC_CNTENCLR_RESET (0xFF)
+
+/* L3_M_BC_INTENSET */
+#define PMINTENSET(__cntr) (((u32)1) << ((__cntr) & 0x7))
+
+/* L3_M_BC_INTENCLR */
+#define PMINTENCLR(__cntr) (((u32)1) << ((__cntr) & 0x7))
+#define BC_INTENCLR_RESET (0xFF)
+
+/* L3_M_BC_GANG */
+#define GANG_EN(__cntr) (((u32)1) << ((__cntr) & 0x7))
+#define BC_GANG_RESET (0)
+
+/* L3_M_BC_OVSR */
+#define PMOVSRCLR(__cntr) (((u32)1) << ((__cntr) & 0x7))
+#define PMOVSRCLR_RESET (0xFF)
+
+/* L3_M_BC_IRQCTL */
+#define PMIRQONMSBEN(__cntr) (((u32)1) << ((__cntr) & 0x7))
+#define BC_IRQCTL_RESET (0x0)
+
+/*
+ * Events
+ */
+
+#define L3_CYCLES 0x01
+#define L3_READ_HIT 0x20
+#define L3_READ_MISS 0x21
+#define L3_READ_HIT_D 0x22
+#define L3_READ_MISS_D 0x23
+#define L3_WRITE_HIT 0x24
+#define L3_WRITE_MISS 0x25
+
+/*
+ * Decoding of settings from perf_event_attr
+ *
+ * The config format for perf events is:
+ * - config: bits 0-7: event type
+ * bit 32: HW counter size requested, 0: 32 bits, 1: 64 bits
+ */
+static inline u32 get_event_type(struct perf_event *event)
+{
+ return (event->attr.config) & L3_MAX_EVTYPE;
+}
+
+static inline int get_hw_counter_size(struct perf_event *event)
+{
+ return event->attr.config >> 32 & 1;
+}
+
+/*
+ * Hardware counter interface.
+ *
+ * This interface allows operations on counters to be polymorphic.
+ * The hardware supports counter chaining to allow 64 bit virtual counters.
+ * We expose this capability as a config option for each event, that way
+ * a user can create perf events that use 32 bit counters for events that
+ * increment at a slower rate, and perf events that use 64 bit counters
+ * for events that increment faster and avoid IRQ overhead.
+ */
+struct l3cache_pmu_hwc {
+ struct perf_event *event;
+ /* Called to start event monitoring */
+ void (*start)(struct perf_event *event);
+ /* Called to stop event monitoring */
+ void (*stop)(struct perf_event *event, int flags);
+ /* Called to update the perf_event */
+ void (*update)(struct perf_event *event);
+};
+
+/*
+ * Main PMU, inherits from the core perf PMU type
+ */
+struct l3cache_pmu {
+ struct pmu perf_pmu;
+ struct hlist_node node;
+ void __iomem *regs;
+ struct l3cache_pmu_hwc counters[L3_NUM_COUNTERS];
+ unsigned long used_mask[BITS_TO_LONGS(L3_NUM_COUNTERS)];
+ cpumask_t cpumask;
+};
+
+#define to_l3cache_pmu(p) (container_of(p, struct l3cache_pmu, perf_pmu))
+
+/*
+ * 64 bit counter implementation
+ */
+
+static void qcom_l3_cache__64bit_counter_start(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 evsel = get_event_type(event);
+ u64 evcnt = local64_read(&event->count);
+ u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
+
+ writel_relaxed(gang | GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
+
+ writel_relaxed(evcnt >> 32, pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
+ writel_relaxed((u32)evcnt, pmu->regs + L3_HML3_PM_EVCNTR(idx));
+
+ writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(idx+1));
+ writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
+
+ writel_relaxed(PMCNTENSET(idx+1), pmu->regs + L3_M_BC_CNTENSET);
+ writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
+}
+
+static void qcom_l3_cache__64bit_counter_stop(struct perf_event *event,
+ int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
+
+ writel_relaxed(gang & ~GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
+ writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
+ writel_relaxed(PMCNTENCLR(idx+1), pmu->regs + L3_M_BC_CNTENCLR);
+}
+
+static void qcom_l3_cache__64bit_counter_update(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 hi_new, hi_old, lo;
+ int i, retries = 2;
+
+ hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
+ hi_old = hi_new + 1;
+ for (i = 0; (i < retries) && (hi_old != hi_new); i++) {
+ hi_old = hi_new;
+ lo = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
+ hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
+ }
+
+ local64_set(&event->count, ((u64)hi_new << 32) | lo);
+}
+
+/*
+ * 32 bit counter interface implementation
+ */
+
+static void qcom_l3_cache__32bit_counter_start(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 evsel = get_event_type(event);
+ u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
+
+ local64_set(&event->hw.prev_count, 0);
+ writel_relaxed(0, pmu->regs + L3_HML3_PM_EVCNTR(idx));
+ writel_relaxed(irqctl | PMIRQONMSBEN(idx), pmu->regs + L3_M_BC_IRQCTL);
+ writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
+ writel_relaxed(PMINTENSET(idx), pmu->regs + L3_M_BC_INTENSET);
+ writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
+}
+
+static void qcom_l3_cache__32bit_counter_stop(struct perf_event *event,
+ int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
+
+ writel_relaxed(irqctl & ~PMIRQONMSBEN(idx), pmu->regs + L3_M_BC_IRQCTL);
+ writel_relaxed(PMINTENCLR(idx), pmu->regs + L3_M_BC_INTENCLR);
+ writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
+}
+
+static void qcom_l3_cache__32bit_counter_update(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ int idx = event->hw.idx;
+ u32 delta, prev, now;
+
+ do {
+ prev = local64_read(&event->hw.prev_count);
+ now = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
+ } while (local64_cmpxchg(&event->hw.prev_count, prev, now) != prev);
+
+ delta = now - prev;
+ local64_add(delta, &event->count);
+}
+
+/*
+ * Top level PMU functions.
+ */
+
+static inline void qcom_l3_cache__init(struct l3cache_pmu *pmu)
+{
+ int i;
+
+ writel_relaxed(BC_RESET, pmu->regs + L3_M_BC_CR);
+
+ /*
+ * Use writel for the first programming command to ensure the basic
+ * counter unit is stopped before proceeding
+ */
+ writel(BC_SATROLL_CR_RESET, pmu->regs + L3_M_BC_SATROLL_CR);
+
+ writel_relaxed(BC_CNTENCLR_RESET, pmu->regs + L3_M_BC_CNTENCLR);
+ writel_relaxed(BC_INTENCLR_RESET, pmu->regs + L3_M_BC_INTENCLR);
+ writel_relaxed(PMOVSRCLR_RESET, pmu->regs + L3_M_BC_OVSR);
+ writel_relaxed(BC_GANG_RESET, pmu->regs + L3_M_BC_GANG);
+ writel_relaxed(BC_IRQCTL_RESET, pmu->regs + L3_M_BC_IRQCTL);
+ writel_relaxed(PM_CR_RESET, pmu->regs + L3_HML3_PM_CR);
+
+ for (i = 0; i < L3_NUM_COUNTERS; ++i) {
+ writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(i));
+ writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(i));
+ }
+
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRA);
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRAM);
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRB);
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRBM);
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRC);
+ writel_relaxed(PM_FLTR_RESET, pmu->regs + L3_HML3_PM_FILTRCM);
+
+ /*
+ * Use writel here to ensure all programming commands are done
+ * before proceeding
+ */
+ writel(BC_ENABLE, pmu->regs + L3_M_BC_CR);
+}
+
+static irqreturn_t qcom_l3_cache__handle_irq(int irq_num, void *data)
+{
+ struct l3cache_pmu *pmu = data;
+ u32 status = readl_relaxed(pmu->regs + L3_M_BC_OVSR);
+ int idx;
+
+ if (status == 0)
+ return IRQ_NONE;
+
+ writel_relaxed(status, pmu->regs + L3_M_BC_OVSR);
+ while (status) {
+ struct perf_event *event;
+
+ idx = __ffs(status);
+ status &= ~(1 << idx);
+ event = pmu->counters[idx].event;
+ if (!event)
+ continue;
+
+ qcom_l3_cache__32bit_counter_update(event);
+ }
+
+ return IRQ_HANDLED;
+}
+
+/*
+ * Implementation of abstract pmu functionality required by
+ * the core perf events code.
+ */
+
+static void qcom_l3_cache__pmu_enable(struct pmu *perf_pmu)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
+ int i;
+
+ /*
+ * Re-write CNTCTL for all existing events to re-assert
+ * the start trigger.
+ */
+ for (i = 0; i < L3_NUM_COUNTERS; i++)
+ if (pmu->counters[i].event)
+ writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(i));
+
+ /* Ensure all programming commands are done before proceeding */
+ writel(BC_ENABLE, pmu->regs + L3_M_BC_CR);
+}
+
+static void qcom_l3_cache__pmu_disable(struct pmu *perf_pmu)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
+
+ writel_relaxed(0, pmu->regs + L3_M_BC_CR);
+
+ /* Ensure the basic counter unit is stopped before proceeding */
+ wmb();
+}
+
+static int qcom_l3_cache__event_init(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu;
+ struct hw_perf_event *hwc = &event->hw;
+
+ /*
+ * Is the event for this PMU?
+ */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /*
+ * There are no per-counter mode filters in the PMU.
+ */
+ if (event->attr.exclude_user || event->attr.exclude_kernel ||
+ event->attr.exclude_hv || event->attr.exclude_idle)
+ return -EINVAL;
+
+ hwc->idx = -1;
+
+ /*
+ * Sampling not supported since these events are not core-attributable.
+ */
+ if (hwc->sample_period)
+ return -EINVAL;
+
+ /*
+ * Task mode not available, we run the counters as socket counters,
+ * not attributable to any CPU and therefore cannot attribute per-task.
+ */
+ if (event->cpu < 0)
+ return -EINVAL;
+
+ /*
+ * Many perf core operations (eg. events rotation) operate on a
+ * single CPU context. This is obvious for CPU PMUs, where one
+ * expects the same sets of events being observed on all CPUs,
+ * but can lead to issues for off-core PMUs, like this one, where
+ * each event could be theoretically assigned to a different CPU.
+ * To mitigate this, we enforce CPU assignment to one designated
+ * processor (the one described in the "cpumask" attribute exported
+ * by the PMU device). perf user space tools honor this and avoid
+ * opening more than one copy of the events.
+ */
+ pmu = to_l3cache_pmu(event->pmu);
+ event->cpu = cpumask_first(&pmu->cpumask);
+
+ return 0;
+}
+
+static void qcom_l3_cache__event_start(struct perf_event *event, int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+
+ hwc->state = 0;
+
+ pmu->counters[hwc->idx].start(event);
+}
+
+static void qcom_l3_cache__event_stop(struct perf_event *event, int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (!(hwc->state & PERF_HES_STOPPED)) {
+ pmu->counters[hwc->idx].stop(event, flags);
+
+ if (flags & PERF_EF_UPDATE)
+ pmu->counters[hwc->idx].update(event);
+ hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+ }
+}
+
+static int qcom_l3_cache__event_add(struct perf_event *event, int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ int idx;
+ int sz;
+
+ /*
+ * Try to allocate a counter.
+ */
+ sz = get_hw_counter_size(event);
+ idx = bitmap_find_free_region(pmu->used_mask, L3_NUM_COUNTERS, sz);
+ if (idx < 0)
+ /* The counters are all in use. */
+ return -EAGAIN;
+
+ hwc->idx = idx;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (sz == 0)
+ pmu->counters[idx] = (struct l3cache_pmu_hwc) {
+ .event = event,
+ .start = qcom_l3_cache__32bit_counter_start,
+ .stop = qcom_l3_cache__32bit_counter_stop,
+ .update = qcom_l3_cache__32bit_counter_update
+ };
+ else {
+ pmu->counters[idx] = (struct l3cache_pmu_hwc) {
+ .event = event,
+ .start = qcom_l3_cache__64bit_counter_start,
+ .stop = qcom_l3_cache__64bit_counter_stop,
+ .update = qcom_l3_cache__64bit_counter_update
+ };
+ pmu->counters[idx+1] = pmu->counters[idx];
+ }
+
+ if (flags & PERF_EF_START)
+ qcom_l3_cache__event_start(event, 0);
+
+ /* Propagate changes to the userspace mapping. */
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void qcom_l3_cache__event_del(struct perf_event *event, int flags)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ int sz;
+
+ qcom_l3_cache__event_stop(event, flags | PERF_EF_UPDATE);
+ sz = get_hw_counter_size(event);
+ pmu->counters[hwc->idx].event = NULL;
+ if (sz)
+ pmu->counters[hwc->idx+1].event = NULL;
+ bitmap_release_region(pmu->used_mask, hwc->idx, sz);
+
+ perf_event_update_userpage(event);
+}
+
+static void qcom_l3_cache__event_read(struct perf_event *event)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+
+ pmu->counters[hwc->idx].update(event);
+}
+
+/*
+ * Add support for creating events symbolically when using the perf
+ * user space tools command line. E.g.:
+ * perf stat -a -e l3cache/event=read-miss/ ls
+ * perf stat -a -e l3cache/event=0x21/ ls
+ */
+
+ssize_t l3cache_pmu_event_sysfs_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct perf_pmu_events_attr *pmu_attr;
+
+ pmu_attr = container_of(attr, struct perf_pmu_events_attr, attr);
+ return sprintf(page, "event=0x%02llx\n", pmu_attr->id);
+}
+
+#define L3CACHE_EVENT_VAR(__id) pmu_event_attr_##__id
+#define L3CACHE_EVENT_PTR(__id) (&L3CACHE_EVENT_VAR(__id).attr.attr)
+
+#define L3CACHE_EVENT_ATTR(__name, __id) \
+ PMU_EVENT_ATTR(__name, L3CACHE_EVENT_VAR(__id), __id, \
+ l3cache_pmu_event_sysfs_show)
+
+
+L3CACHE_EVENT_ATTR(cycles, L3_CYCLES);
+L3CACHE_EVENT_ATTR(read-hit, L3_READ_HIT);
+L3CACHE_EVENT_ATTR(read-miss, L3_READ_MISS);
+L3CACHE_EVENT_ATTR(read-hit-d-side, L3_READ_HIT_D);
+L3CACHE_EVENT_ATTR(read-miss-d-side, L3_READ_MISS_D);
+L3CACHE_EVENT_ATTR(write-hit, L3_WRITE_HIT);
+L3CACHE_EVENT_ATTR(write-miss, L3_WRITE_MISS);
+
+static struct attribute *qcom_l3_cache_pmu_events[] = {
+ L3CACHE_EVENT_PTR(L3_CYCLES),
+ L3CACHE_EVENT_PTR(L3_READ_HIT),
+ L3CACHE_EVENT_PTR(L3_READ_MISS),
+ L3CACHE_EVENT_PTR(L3_READ_HIT_D),
+ L3CACHE_EVENT_PTR(L3_READ_MISS_D),
+ L3CACHE_EVENT_PTR(L3_WRITE_HIT),
+ L3CACHE_EVENT_PTR(L3_WRITE_MISS),
+ NULL
+};
+
+static struct attribute_group qcom_l3_cache_pmu_events_group = {
+ .name = "events",
+ .attrs = qcom_l3_cache_pmu_events,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+PMU_FORMAT_ATTR(lc, "config:32");
+
+static struct attribute *qcom_l3_cache_pmu_formats[] = {
+ &format_attr_event.attr,
+ &format_attr_lc.attr,
+ NULL,
+};
+
+static struct attribute_group qcom_l3_cache_pmu_format_group = {
+ .name = "format",
+ .attrs = qcom_l3_cache_pmu_formats,
+};
+
+static ssize_t qcom_l3_cache_pmu_cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct l3cache_pmu *pmu = to_l3cache_pmu(dev_get_drvdata(dev));
+
+ return cpumap_print_to_pagebuf(true, buf, &pmu->cpumask);
+}
+
+static struct device_attribute qcom_l3_cache_pmu_cpumask_attr =
+ __ATTR(cpumask, 0444, qcom_l3_cache_pmu_cpumask_show, NULL);
+
+static struct attribute *qcom_l3_cache_pmu_cpumask_attrs[] = {
+ &qcom_l3_cache_pmu_cpumask_attr.attr,
+ NULL,
+};
+
+static struct attribute_group qcom_l3_cache_pmu_cpumask_attr_group = {
+ .attrs = qcom_l3_cache_pmu_cpumask_attrs,
+};
+
+
+static const struct attribute_group *qcom_l3_cache_pmu_attr_grps[] = {
+ &qcom_l3_cache_pmu_format_group,
+ &qcom_l3_cache_pmu_events_group,
+ &qcom_l3_cache_pmu_cpumask_attr_group,
+ NULL,
+};
+
+/*
+ * Probing functions and data.
+ */
+
+static int qcom_l3_cache_pmu_online_cpu(unsigned int cpu, struct hlist_node *node)
+{
+ struct l3cache_pmu *pmu = hlist_entry_safe(node, struct l3cache_pmu, node);
+
+ /* If there is not a CPU/PMU association pick this CPU */
+ if (cpumask_empty(&pmu->cpumask))
+ cpumask_set_cpu(cpu, &pmu->cpumask);
+
+ return 0;
+}
+
+static int qcom_l3_cache_pmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
+{
+ struct l3cache_pmu *pmu = hlist_entry_safe(node, struct l3cache_pmu, node);
+ unsigned int target;
+
+ if (!cpumask_test_and_clear_cpu(cpu, &pmu->cpumask))
+ return 0;
+ target = cpumask_any_but(cpu_online_mask, cpu);
+ if (target >= nr_cpu_ids)
+ return 0;
+ perf_pmu_migrate_context(&pmu->perf_pmu, cpu, target);
+ cpumask_set_cpu(target, &pmu->cpumask);
+ return 0;
+}
+
+static int qcom_l3_cache_pmu_probe(struct platform_device *pdev)
+{
+ struct l3cache_pmu *pmu;
+ struct acpi_device *acpi_dev;
+ struct resource *memrc;
+ int rc;
+ char *name;
+
+ /* Initialize the PMU data structures */
+
+ acpi_dev = ACPI_COMPANION(&pdev->dev);
+ if (!acpi_dev)
+ return -ENODEV;
+
+ pmu = devm_kzalloc(&pdev->dev, sizeof(*pmu), GFP_KERNEL);
+ name = devm_kasprintf(&pdev->dev, GFP_KERNEL, "l3cache_%s_%s",
+ acpi_dev->parent->pnp.unique_id, acpi_dev->pnp.unique_id);
+ if (!pmu || !name)
+ return -ENOMEM;
+
+ pmu->perf_pmu = (struct pmu) {
+ .task_ctx_nr = perf_invalid_context,
+
+ .pmu_enable = qcom_l3_cache__pmu_enable,
+ .pmu_disable = qcom_l3_cache__pmu_disable,
+ .event_init = qcom_l3_cache__event_init,
+ .add = qcom_l3_cache__event_add,
+ .del = qcom_l3_cache__event_del,
+ .start = qcom_l3_cache__event_start,
+ .stop = qcom_l3_cache__event_stop,
+ .read = qcom_l3_cache__event_read,
+
+ .attr_groups = qcom_l3_cache_pmu_attr_grps,
+ };
+
+ memrc = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+ pmu->regs = devm_ioremap_resource(&pdev->dev, memrc);
+ if (IS_ERR(pmu->regs)) {
+ dev_err(&pdev->dev, "Can't map PMU @%pa\n", &memrc->start);
+ return PTR_ERR(pmu->regs);
+ }
+
+ qcom_l3_cache__init(pmu);
+
+ rc = platform_get_irq(pdev, 0);
+ if (rc <= 0) {
+ dev_err(&pdev->dev, "Failed to get valid irq for PMU @%pa\n",
+ &memrc->start);
+ return rc;
+ }
+
+ rc = devm_request_irq(&pdev->dev, rc, qcom_l3_cache__handle_irq, 0,
+ name, pmu);
+ if (rc) {
+ dev_err(&pdev->dev, "Request for IRQ failed for slice @%pa\n",
+ &memrc->start);
+ return rc;
+ }
+
+ /* Add this instance to the list used by the offline callback */
+ rc = cpuhp_state_add_instance(CPUHP_AP_PERF_QCOM_L3CACHE_ONLINE, &pmu->node);
+ if (rc) {
+ dev_err(&pdev->dev, "Error %d registering hotplug", rc);
+ return rc;
+ }
+
+ rc = perf_pmu_register(&pmu->perf_pmu, name, -1);
+ if (rc < 0) {
+ dev_err(&pdev->dev, "Failed to register L3 cache PMU (%d)\n", rc);
+ return rc;
+ }
+
+ dev_info(&pdev->dev, "Registered %s, type: %d\n", name, pmu->perf_pmu.type);
+
+ return 0;
+}
+
+static const struct acpi_device_id qcom_l3_cache_pmu_acpi_match[] = {
+ { "QCOM8081", },
+ { }
+};
+MODULE_DEVICE_TABLE(acpi, qcom_l3_cache_pmu_acpi_match);
+
+static struct platform_driver qcom_l3_cache_pmu_driver = {
+ .driver = {
+ .name = "qcom-l3cache-pmu",
+ .owner = THIS_MODULE,
+ .acpi_match_table = ACPI_PTR(qcom_l3_cache_pmu_acpi_match),
+ },
+ .probe = qcom_l3_cache_pmu_probe,
+};
+
+static int __init register_qcom_l3_cache_pmu_driver(void)
+{
+ int ret;
+
+ /* Install a hook to update the reader CPU in case it goes offline */
+ ret = cpuhp_setup_state_multi(CPUHP_AP_PERF_QCOM_L3CACHE_ONLINE,
+ "perf/qcom/l3cache:online",
+ qcom_l3_cache_pmu_online_cpu,
+ qcom_l3_cache_pmu_offline_cpu);
+ if (ret)
+ return ret;
+
+ return platform_driver_register(&qcom_l3_cache_pmu_driver);
+}
+device_initcall(register_qcom_l3_cache_pmu_driver);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 921acaa..be83438 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -137,6 +137,7 @@ enum cpuhp_state {
CPUHP_AP_PERF_ARM_CCI_ONLINE,
CPUHP_AP_PERF_ARM_CCN_ONLINE,
CPUHP_AP_PERF_ARM_L2X0_ONLINE,
+ CPUHP_AP_PERF_QCOM_L3CACHE_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
CPUHP_AP_ONLINE_DYN,
--
Qualcomm Datacenter Technologies, Inc. on behalf of the Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.


2017-03-03 16:46:05

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V3] perf: qcom: Add L3 cache PMU driver

Hi Augustin,

On Thu, Mar 02, 2017 at 03:58:32PM -0500, Agustin Vega-Frias wrote:
> This adds a new dynamic PMU to the Perf Events framework to program
> and control the L3 cache PMUs in some Qualcomm Technologies SOCs.
>
> The driver supports a distributed cache architecture where the overall
> cache for a socket is comprised of multiple slices each with its own PMU.
> Access to each individual PMU is provided even though all CPUs share all
> the slices. User space needs to aggregate to individual counts to provide
> a global picture.
>
> The driver exports formatting and event information to sysfs so it can
> be used by the perf user space tools with the syntaxes:
> perf stat -a -e l3cache_0_0/read-miss/
> perf stat -a -e l3cache_0_0/event=0x21/

As a high-level thing, while we can't do aggregation of counts across
slices, and while we'd have to reject cross-slice groups, we *could*
have a single struct PMU that covers all those slices, with userspace
selecting which slice an event targets via
perf_event_attr::config{,1,2}. i.e. we'd treat those as independent
sub-units under the PMU.

With some sysfs and userspace work, it would then be possible to have
the perf tool infer automatically that it should open an event on each
sub-unit as it currently does per-cpu.

>
> Signed-off-by: Agustin Vega-Frias <[email protected]>
> ---
> drivers/perf/Kconfig | 10 +
> drivers/perf/Makefile | 1 +
> drivers/perf/qcom_l3_pmu.c | 755 +++++++++++++++++++++++++++++++++++++++++++++
> include/linux/cpuhotplug.h | 1 +
> 4 files changed, 767 insertions(+)
> create mode 100644 drivers/perf/qcom_l3_pmu.c

This could do with documentation, as we have for the L2 PMU in
Documentation/perf/qcom_l2_pmu.txt, e.g. with hhow to user the long
counter option.

IIRC this also has a flat event space, rather than the row/column style
that the L2 PMU has. That would be worth mentioning explicitly, too.

[...]

> diff --git a/drivers/perf/qcom_l3_pmu.c b/drivers/perf/qcom_l3_pmu.c
> new file mode 100644
> index 0000000..207f174
> --- /dev/null
> +++ b/drivers/perf/qcom_l3_pmu.c
> @@ -0,0 +1,755 @@
> +/* Copyright (c) 2015-2017, The Linux Foundation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 and
> + * only version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/bitops.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/list.h>
> +#include <linux/acpi.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>

Nit: please keep includes ordered alphabetically.

Also, as a general stylistic note, please keep the start of the file
file organised as:

1) description of the file
2) copyright
3) includes
4) code

(perhaps with 1 & 2 being the same comment block)

That's what we do in the other PMU drivers in this directory.

> +/*
> + * Driver for the L3 cache PMUs in Qualcomm Technologies chips.
> + *
> + * The driver supports a distributed cache architecture where the overall
> + * cache for a socket is comprised of multiple slices each with its own PMU.
> + * Access to each individual PMU is provided even though all CPUs share all
> + * the slices. User space needs to aggregate to individual counts to provide
> + * a global picture.
> + *
> + * The hardware supports counter chaining to provide the user a way to avoid
> + * overhead of software counter maintenance. This is exposed via a the 'lc'
> + * flag field in perf_event_attr.config.

We should be able to refer to the documentation file here.

> + *
> + * The hardware also supports a feature that asserts the IRQ on the toggling
> + * of the most significanty bit in the 32bit counter.

Nit: s/significanty/significant/

This is just an overflow interrupt, right?

Or does this interrupt also trigger when bit 31 goes from 0 to 1?

[...]

> +/*
> + * Decoding of settings from perf_event_attr
> + *
> + * The config format for perf events is:
> + * - config: bits 0-7: event type
> + * bit 32: HW counter size requested, 0: 32 bits, 1: 64 bits
> + */

Not really a problem, but is there a specific reason to avoid bits 31-8?

> +static inline u32 get_event_type(struct perf_event *event)
> +{
> + return (event->attr.config) & L3_MAX_EVTYPE;
> +}
> +
> +static inline int get_hw_counter_size(struct perf_event *event)
> +{
> + return event->attr.config >> 32 & 1;
> +}

This would be better as something like

#define L3_EVENT_ATTR_LC_BIT 32

static inline bool event_uses_long_counter(struct perf_event *event)
{
return !!(event->attr.config & BIT_ULL(L3_EVENT_ATTR_LC_BIT));
}

That way it's clear what the return value means, and you can reuse the
definition for the sysfs attr.

That means qcom_l3_cache__event_add() has to be more explicit w.r.t. the
order parameter for the bitmap search, but that's a good thing.

[...]

> +struct l3cache_pmu_hwc {
> + struct perf_event *event;
> + /* Called to start event monitoring */
> + void (*start)(struct perf_event *event);
> + /* Called to stop event monitoring */
> + void (*stop)(struct perf_event *event, int flags);
> + /* Called to update the perf_event */
> + void (*update)(struct perf_event *event);
> +};

Even if having separate ops makes handling chaining simpler, I don't
think we need a copy of all these per-event.

We can instead have something like:

struct l3cache_event_ops {
void (*start)(struct perf_event *event);
void (*stop)(struct perf_event *event, int flags);
void (*update)(struct perf_event *event);
};

struct l3cache_event_ops event_ops_std;
struct l3cache_event_ops event_ops_long;

static struct l3cache_event_ops *l3cache_event_get_ops(struct perf_event *event)
{
if (event_uses_long_counter(event))
return &event_ops_long;
else
return &event_ops_std;
}

... and callers the need the ops can consult
l3cache_event_get_ops(event).

[...]

> +
> +/*
> + * Main PMU, inherits from the core perf PMU type
> + */
> +struct l3cache_pmu {
> + struct pmu perf_pmu;

Please call this pmu.

More generally, for function arguments and vars, please name a struct
pmu as 'pmu', and an l3cache_pmu as 'l3pmu' (or something like that).

That'll keep things consistent and easier to follow, even for those
unfamiliar with this particular driver.

> + struct hlist_node node;
> + void __iomem *regs;
> + struct l3cache_pmu_hwc counters[L3_NUM_COUNTERS];
> + unsigned long used_mask[BITS_TO_LONGS(L3_NUM_COUNTERS)];
> + cpumask_t cpumask;
> +};
> +
> +#define to_l3cache_pmu(p) (container_of(p, struct l3cache_pmu, perf_pmu))
> +
> +/*
> + * 64 bit counter implementation
> + */
> +
> +static void qcom_l3_cache__64bit_counter_start(struct perf_event *event)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 evsel = get_event_type(event);
> + u64 evcnt = local64_read(&event->count);
> + u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
> +
> + writel_relaxed(gang | GANG_EN(idx), pmu->regs + L3_M_BC_GANG);

I take it this is what actually configures the chaining behaviour?

> + writel_relaxed(evcnt >> 32, pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
> + writel_relaxed((u32)evcnt, pmu->regs + L3_HML3_PM_EVCNTR(idx));


It's not safe to use perf_event::count in this manner. The core perf
code can change the count value at any time it wants (e.g. see
PERF_EVENT_IOC_RESET), so this will result in bogus counts.

Drivers should keep track of the HW value using
hw_perf_event::prev_count, and add the delta into perf_event::count.

Please do so here.

> + writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(idx+1));
> + writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
> +
> + writel_relaxed(PMCNTENSET(idx+1), pmu->regs + L3_M_BC_CNTENSET);
> + writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
> +}

IIUC, we're not enabling the interrupt here, is that correct?

If that's deliberate, there should be a comment as to why.

These are all relaxed, so what ensures the counter is actually started
when we return from this function?

> +static void qcom_l3_cache__64bit_counter_stop(struct perf_event *event,
> + int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
> +
> + writel_relaxed(gang & ~GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
> + writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
> + writel_relaxed(PMCNTENCLR(idx+1), pmu->regs + L3_M_BC_CNTENCLR);
> +}
> +
> +static void qcom_l3_cache__64bit_counter_update(struct perf_event *event)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 hi_new, hi_old, lo;
> + int i, retries = 2;
> +
> + hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
> + hi_old = hi_new + 1;
> + for (i = 0; (i < retries) && (hi_old != hi_new); i++) {
> + hi_old = hi_new;
> + lo = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
> + hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
> + }

This looks overly complicated, and would be far simpler using a pattern
like:

do {
hi = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
lo = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
} while (hi != readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1)));

If we need a bounded number of retries, we should warn rather than
exiting the loop ant potentially returning junk.

Elsewhere when we read a 64-bit free-running counter in two halves (e.g.
in arch_counter_get_cntvct_mem()), it's assumed that 32-bit overflow
takes sufficiently long that it's unlikely to occur often enough to
result in a livelock, and we don't need to explicitly bound the loop.

> +
> + local64_set(&event->count, ((u64)hi_new << 32) | lo);

As mentioned above for qcom_l3_cache__64bit_counter_start(), please use
hw_perf_event::prev_count in the usual way, and accumulate the delta
into perf_event::count when updating.

Doing so means you can share most of the logic for update.

> +}
> +
> +/*
> + * 32 bit counter interface implementation
> + */
> +
> +static void qcom_l3_cache__32bit_counter_start(struct perf_event *event)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 evsel = get_event_type(event);
> + u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
> +
> + local64_set(&event->hw.prev_count, 0);

Please follow the example set elsewhere, and start the counter at half
its max value (e.g. 0x80000000 here). So long as the IRQ is taken before
another 2^31 events occur, that caters for extreme IRQ latency, and the
IRQ handler doesn't have to treat the overflow as an implicit extra
counter bit.

> + writel_relaxed(0, pmu->regs + L3_HML3_PM_EVCNTR(idx));
> + writel_relaxed(irqctl | PMIRQONMSBEN(idx), pmu->regs + L3_M_BC_IRQCTL);
> + writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
> + writel_relaxed(PMINTENSET(idx), pmu->regs + L3_M_BC_INTENSET);
> + writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
> +}
> +
> +static void qcom_l3_cache__32bit_counter_stop(struct perf_event *event,
> + int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
> +
> + writel_relaxed(irqctl & ~PMIRQONMSBEN(idx), pmu->regs + L3_M_BC_IRQCTL);
> + writel_relaxed(PMINTENCLR(idx), pmu->regs + L3_M_BC_INTENCLR);
> + writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
> +}
> +
> +static void qcom_l3_cache__32bit_counter_update(struct perf_event *event)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + int idx = event->hw.idx;
> + u32 delta, prev, now;
> +
> + do {
> + prev = local64_read(&event->hw.prev_count);
> + now = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
> + } while (local64_cmpxchg(&event->hw.prev_count, prev, now) != prev);
> +
> + delta = now - prev;
> + local64_add(delta, &event->count);
> +}

This doesn't take overflow into account, so we're potentially losing
2^32 events here when called from the IRQ handler.

If you start the counter at half of its max value, this should be ok
as-is.

[...]

> +static irqreturn_t qcom_l3_cache__handle_irq(int irq_num, void *data)
> +{
> + struct l3cache_pmu *pmu = data;
> + u32 status = readl_relaxed(pmu->regs + L3_M_BC_OVSR);
> + int idx;
> +
> + if (status == 0)
> + return IRQ_NONE;
> +
> + writel_relaxed(status, pmu->regs + L3_M_BC_OVSR);

I take it this clears the overflow status bits? i.e. it's write to
clear?

It would be worth a comment.

> + while (status) {
> + struct perf_event *event;
> +
> + idx = __ffs(status);
> + status &= ~(1 << idx);

It would be better to have status in an unsigned long, and use
for_each_set_bit() over that.

> + event = pmu->counters[idx].event;
> + if (!event)
> + continue;
> +
> + qcom_l3_cache__32bit_counter_update(event);

It would be worth a comment as to why we don't consider 64-bit events.

Even given that, for consistency it'd be worth updating the counter
using the usual ops indirection.

> + }
> +
> + return IRQ_HANDLED;
> +}
> +
> +/*
> + * Implementation of abstract pmu functionality required by
> + * the core perf events code.
> + */
> +
> +static void qcom_l3_cache__pmu_enable(struct pmu *perf_pmu)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
> + int i;
> +
> + /*
> + * Re-write CNTCTL for all existing events to re-assert
> + * the start trigger.
> + */
> + for (i = 0; i < L3_NUM_COUNTERS; i++)
> + if (pmu->counters[i].event)
> + writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(i));

Could you elaborate on this, please?

Why do we need to poke each event? What happens if we didn't do this?

> +
> + /* Ensure all programming commands are done before proceeding */
> + writel(BC_ENABLE, pmu->regs + L3_M_BC_CR);
> +}
> +
> +static void qcom_l3_cache__pmu_disable(struct pmu *perf_pmu)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
> +
> + writel_relaxed(0, pmu->regs + L3_M_BC_CR);
> +
> + /* Ensure the basic counter unit is stopped before proceeding */
> + wmb();

You presumably want likewise on the enable path, before enabling the
PMU.

> +}
> +
> +static int qcom_l3_cache__event_init(struct perf_event *event)
> +{
> + struct l3cache_pmu *pmu;
> + struct hw_perf_event *hwc = &event->hw;
> +
> + /*
> + * Is the event for this PMU?
> + */
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /*
> + * There are no per-counter mode filters in the PMU.
> + */
> + if (event->attr.exclude_user || event->attr.exclude_kernel ||
> + event->attr.exclude_hv || event->attr.exclude_idle)
> + return -EINVAL;
> +
> + hwc->idx = -1;
> +

Please do this after the remaining return conditions below.

> + /*
> + * Sampling not supported since these events are not core-attributable.
> + */
> + if (hwc->sample_period)
> + return -EINVAL;
> +
> + /*
> + * Task mode not available, we run the counters as socket counters,
> + * not attributable to any CPU and therefore cannot attribute per-task.
> + */
> + if (event->cpu < 0)
> + return -EINVAL;
> +

You also need to check the event grouping here. Please look at the QC L2
PMU driver's l2_cache_event_init().

> + /*
> + * Many perf core operations (eg. events rotation) operate on a
> + * single CPU context. This is obvious for CPU PMUs, where one
> + * expects the same sets of events being observed on all CPUs,
> + * but can lead to issues for off-core PMUs, like this one, where
> + * each event could be theoretically assigned to a different CPU.
> + * To mitigate this, we enforce CPU assignment to one designated
> + * processor (the one described in the "cpumask" attribute exported
> + * by the PMU device). perf user space tools honor this and avoid
> + * opening more than one copy of the events.
> + */
> + pmu = to_l3cache_pmu(event->pmu);
> + event->cpu = cpumask_first(&pmu->cpumask);
> +
> + return 0;
> +}
> +
> +static void qcom_l3_cache__event_start(struct perf_event *event, int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> +
> + hwc->state = 0;
> +
> + pmu->counters[hwc->idx].start(event);
> +}
> +
> +static void qcom_l3_cache__event_stop(struct perf_event *event, int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> +
> + if (!(hwc->state & PERF_HES_STOPPED)) {

Please use an early return to handle this, e.g.

if (hwc->state & PERF_HES_STOPPED)
return;

That makes it easier to follow the rest of the function, which doesn't
need to be in a block, indented.

> + pmu->counters[hwc->idx].stop(event, flags);
> +
> + if (flags & PERF_EF_UPDATE)
> + pmu->counters[hwc->idx].update(event);
> + hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
> + }
> +}
> +
> +static int qcom_l3_cache__event_add(struct perf_event *event, int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> + int idx;
> + int sz;
> +
> + /*
> + * Try to allocate a counter.
> + */
> + sz = get_hw_counter_size(event);
> + idx = bitmap_find_free_region(pmu->used_mask, L3_NUM_COUNTERS, sz);

Is it strictly necessary for these to be an even/odd pair, or can
chained events be used with any two counters?

> + if (idx < 0)
> + /* The counters are all in use. */
> + return -EAGAIN;
> +
> + hwc->idx = idx;
> + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +
> + if (sz == 0)
> + pmu->counters[idx] = (struct l3cache_pmu_hwc) {
> + .event = event,
> + .start = qcom_l3_cache__32bit_counter_start,
> + .stop = qcom_l3_cache__32bit_counter_stop,
> + .update = qcom_l3_cache__32bit_counter_update
> + };
> + else {
> + pmu->counters[idx] = (struct l3cache_pmu_hwc) {
> + .event = event,
> + .start = qcom_l3_cache__64bit_counter_start,
> + .stop = qcom_l3_cache__64bit_counter_stop,
> + .update = qcom_l3_cache__64bit_counter_update
> + };
> + pmu->counters[idx+1] = pmu->counters[idx];
> + }

This should be able to go, if we use the ops indirection suggested
above.

In the long counter case, do we even need to touch the additional
pmu->counters[] slot?

AFAICT, we'll never use it for anything. The irq handler implicitly
ignores all pmu->counters[] slots for long events, since they don't
generate irqs, and elsewhere we don't sseem to use this.

> +
> + if (flags & PERF_EF_START)
> + qcom_l3_cache__event_start(event, 0);
> +
> + /* Propagate changes to the userspace mapping. */
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static void qcom_l3_cache__event_del(struct perf_event *event, int flags)
> +{
> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> + int sz;
> +
> + qcom_l3_cache__event_stop(event, flags | PERF_EF_UPDATE);
> + sz = get_hw_counter_size(event);
> + pmu->counters[hwc->idx].event = NULL;
> + if (sz)
> + pmu->counters[hwc->idx+1].event = NULL;

As above, it's not clear to me if we need to touch this slot at all.

[...]

> +#define L3CACHE_EVENT_VAR(__id) pmu_event_attr_##__id
> +#define L3CACHE_EVENT_PTR(__id) (&L3CACHE_EVENT_VAR(__id).attr.attr)
> +
> +#define L3CACHE_EVENT_ATTR(__name, __id) \
> + PMU_EVENT_ATTR(__name, L3CACHE_EVENT_VAR(__id), __id, \
> + l3cache_pmu_event_sysfs_show)
> +
> +
> +L3CACHE_EVENT_ATTR(cycles, L3_CYCLES);
> +L3CACHE_EVENT_ATTR(read-hit, L3_READ_HIT);
> +L3CACHE_EVENT_ATTR(read-miss, L3_READ_MISS);
> +L3CACHE_EVENT_ATTR(read-hit-d-side, L3_READ_HIT_D);
> +L3CACHE_EVENT_ATTR(read-miss-d-side, L3_READ_MISS_D);
> +L3CACHE_EVENT_ATTR(write-hit, L3_WRITE_HIT);
> +L3CACHE_EVENT_ATTR(write-miss, L3_WRITE_MISS);
> +
> +static struct attribute *qcom_l3_cache_pmu_events[] = {
> + L3CACHE_EVENT_PTR(L3_CYCLES),
> + L3CACHE_EVENT_PTR(L3_READ_HIT),
> + L3CACHE_EVENT_PTR(L3_READ_MISS),
> + L3CACHE_EVENT_PTR(L3_READ_HIT_D),
> + L3CACHE_EVENT_PTR(L3_READ_MISS_D),
> + L3CACHE_EVENT_PTR(L3_WRITE_HIT),
> + L3CACHE_EVENT_PTR(L3_WRITE_MISS),
> + NULL
> +};

Please follow the approach taken by drivers/perf/xgene_pmu.c's
XGENE_PMU_EVENT_ATTR(), so that they can be definied in-place.

> +
> +static struct attribute_group qcom_l3_cache_pmu_events_group = {
> + .name = "events",
> + .attrs = qcom_l3_cache_pmu_events,
> +};
> +
> +PMU_FORMAT_ATTR(event, "config:0-7");
> +PMU_FORMAT_ATTR(lc, "config:32");
> +
> +static struct attribute *qcom_l3_cache_pmu_formats[] = {
> + &format_attr_event.attr,
> + &format_attr_lc.attr,
> + NULL,
> +};

Likewise, see XGENE_PMU_FORMAT_ATTR().

Thanks,
Mark.

2017-03-06 15:29:52

by Agustin Vega-Frias

[permalink] [raw]
Subject: Re: [PATCH V3] perf: qcom: Add L3 cache PMU driver

Hi Mark,

On 2017-03-03 09:50, Mark Rutland wrote:
> Hi Augustin,
>
> On Thu, Mar 02, 2017 at 03:58:32PM -0500, Agustin Vega-Frias wrote:
>> This adds a new dynamic PMU to the Perf Events framework to program
>> and control the L3 cache PMUs in some Qualcomm Technologies SOCs.
>>
>> The driver supports a distributed cache architecture where the overall
>> cache for a socket is comprised of multiple slices each with its own
>> PMU.
>> Access to each individual PMU is provided even though all CPUs share
>> all
>> the slices. User space needs to aggregate to individual counts to
>> provide
>> a global picture.
>>
>> The driver exports formatting and event information to sysfs so it can
>> be used by the perf user space tools with the syntaxes:
>> perf stat -a -e l3cache_0_0/read-miss/
>> perf stat -a -e l3cache_0_0/event=0x21/
>
> As a high-level thing, while we can't do aggregation of counts across
> slices, and while we'd have to reject cross-slice groups, we *could*
> have a single struct PMU that covers all those slices, with userspace
> selecting which slice an event targets via
> perf_event_attr::config{,1,2}. i.e. we'd treat those as independent
> sub-units under the PMU.
>
> With some sysfs and userspace work, it would then be possible to have
> the perf tool infer automatically that it should open an event on each
> sub-unit as it currently does per-cpu.
>

Agreed, I will be looking at that change later, which can be seen as
related
Andi Kleen's work on uncore events [1].

In my opinion having the separate perf PMUs fits more cleanly into that
work, since there is currently no way to translate something like:

l3cache/read-miss,slice=0xF/ # slices 0-3

into the individual events:

l3cache/read-miss,slice=0x1/ # slice 0
l3cache/read-miss,slice=0x2/ # slice 1
l3cache/read-miss,slice=0x4/ # slice 2
l3cache/read-miss,slice=0x8/ # slice 3

I'd prefer something like:

l3cache_0_[0123]/read-miss/ # slices 0-3 on socket 0
l3cache_0_*/read-miss/ # all slices on socket 0

Thoughts?

>>
>> Signed-off-by: Agustin Vega-Frias <[email protected]>
>> ---
>> drivers/perf/Kconfig | 10 +
>> drivers/perf/Makefile | 1 +
>> drivers/perf/qcom_l3_pmu.c | 755
>> +++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/cpuhotplug.h | 1 +
>> 4 files changed, 767 insertions(+)
>> create mode 100644 drivers/perf/qcom_l3_pmu.c
>
> This could do with documentation, as we have for the L2 PMU in
> Documentation/perf/qcom_l2_pmu.txt, e.g. with hhow to user the long
> counter option.
>
> IIRC this also has a flat event space, rather than the row/column style
> that the L2 PMU has. That would be worth mentioning explicitly, too.
>

Yes this has a flat event space.
I will add the document.

> [...]
>
>> diff --git a/drivers/perf/qcom_l3_pmu.c b/drivers/perf/qcom_l3_pmu.c
>> new file mode 100644
>> index 0000000..207f174
>> --- /dev/null
>> +++ b/drivers/perf/qcom_l3_pmu.c
>> @@ -0,0 +1,755 @@
>> +/* Copyright (c) 2015-2017, The Linux Foundation. All rights
>> reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> modify
>> + * it under the terms of the GNU General Public License version 2 and
>> + * only version 2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + */
>> +
>> +#include <linux/module.h>
>> +#include <linux/bitops.h>
>> +#include <linux/interrupt.h>
>> +#include <linux/io.h>
>> +#include <linux/list.h>
>> +#include <linux/acpi.h>
>> +#include <linux/perf_event.h>
>> +#include <linux/platform_device.h>
>
> Nit: please keep includes ordered alphabetically.
>
> Also, as a general stylistic note, please keep the start of the file
> file organised as:
>
> 1) description of the file
> 2) copyright
> 3) includes
> 4) code
>
> (perhaps with 1 & 2 being the same comment block)
>
> That's what we do in the other PMU drivers in this directory.
>

Will fix this.

>> +/*
>> + * Driver for the L3 cache PMUs in Qualcomm Technologies chips.
>> + *
>> + * The driver supports a distributed cache architecture where the
>> overall
>> + * cache for a socket is comprised of multiple slices each with its
>> own PMU.
>> + * Access to each individual PMU is provided even though all CPUs
>> share all
>> + * the slices. User space needs to aggregate to individual counts to
>> provide
>> + * a global picture.
>> + *
>> + * The hardware supports counter chaining to provide the user a way
>> to avoid
>> + * overhead of software counter maintenance. This is exposed via a
>> the 'lc'
>> + * flag field in perf_event_attr.config.
>
> We should be able to refer to the documentation file here.
>

Will do.

>> + *
>> + * The hardware also supports a feature that asserts the IRQ on the
>> toggling
>> + * of the most significanty bit in the 32bit counter.
>
> Nit: s/significanty/significant/
>
> This is just an overflow interrupt, right?
>
> Or does this interrupt also trigger when bit 31 goes from 0 to 1?
>

Yes the interrupt also triggers when bit 31 goes from 0 to 1. This
feature
was added so we can avoid counter reprogramming and leave the counters
as
free running and still handle the deltas properly.


> [...]
>
>> +/*
>> + * Decoding of settings from perf_event_attr
>> + *
>> + * The config format for perf events is:
>> + * - config: bits 0-7: event type
>> + * bit 32: HW counter size requested, 0: 32 bits, 1: 64
>> bits
>> + */
>
> Not really a problem, but is there a specific reason to avoid bits
> 31-8?
>

No, just leaving some space in case the event types are expanded.

>> +static inline u32 get_event_type(struct perf_event *event)
>> +{
>> + return (event->attr.config) & L3_MAX_EVTYPE;
>> +}
>> +
>> +static inline int get_hw_counter_size(struct perf_event *event)
>> +{
>> + return event->attr.config >> 32 & 1;
>> +}
>
> This would be better as something like
>
> #define L3_EVENT_ATTR_LC_BIT 32
>
> static inline bool event_uses_long_counter(struct perf_event *event)
> {
> return !!(event->attr.config & BIT_ULL(L3_EVENT_ATTR_LC_BIT));
> }
>
> That way it's clear what the return value means, and you can reuse the
> definition for the sysfs attr.
>
> That means qcom_l3_cache__event_add() has to be more explicit w.r.t.
> the
> order parameter for the bitmap search, but that's a good thing.
>

I was implementing it this way so I can just directly use it as a
parameter
to bitmap_find_free_region, but I can take a look to see how to make
this
cleaner.

> [...]
>
>> +struct l3cache_pmu_hwc {
>> + struct perf_event *event;
>> + /* Called to start event monitoring */
>> + void (*start)(struct perf_event *event);
>> + /* Called to stop event monitoring */
>> + void (*stop)(struct perf_event *event, int flags);
>> + /* Called to update the perf_event */
>> + void (*update)(struct perf_event *event);
>> +};
>
> Even if having separate ops makes handling chaining simpler, I don't
> think we need a copy of all these per-event.
>
> We can instead have something like:
>
> struct l3cache_event_ops {
> void (*start)(struct perf_event *event);
> void (*stop)(struct perf_event *event, int flags);
> void (*update)(struct perf_event *event);
> };
>
> struct l3cache_event_ops event_ops_std;
> struct l3cache_event_ops event_ops_long;
>
> static struct l3cache_event_ops *l3cache_event_get_ops(struct
> perf_event *event)
> {
> if (event_uses_long_counter(event))
> return &event_ops_long;
> else
> return &event_ops_std;
> }
>
> ... and callers the need the ops can consult
> l3cache_event_get_ops(event).
>

Understood. How about a single pointer to ops in the counter struct?
E.g.:

struct l3cache_pmu_hwc {
struct perf_event *event;
struct l3cache_event_ops *ops;
};

Then instead of the current:
pmu->counters[hwc->idx].start(event);
we have:
pmu->counters[hwc->idx].ops->start(event);

> [...]
>
>> +
>> +/*
>> + * Main PMU, inherits from the core perf PMU type
>> + */
>> +struct l3cache_pmu {
>> + struct pmu perf_pmu;
>
> Please call this pmu.
>
> More generally, for function arguments and vars, please name a struct
> pmu as 'pmu', and an l3cache_pmu as 'l3pmu' (or something like that).
>
> That'll keep things consistent and easier to follow, even for those
> unfamiliar with this particular driver.
>

Will do.

>> + struct hlist_node node;
>> + void __iomem *regs;
>> + struct l3cache_pmu_hwc counters[L3_NUM_COUNTERS];
>> + unsigned long used_mask[BITS_TO_LONGS(L3_NUM_COUNTERS)];
>> + cpumask_t cpumask;
>> +};
>> +
>> +#define to_l3cache_pmu(p) (container_of(p, struct l3cache_pmu,
>> perf_pmu))
>> +
>> +/*
>> + * 64 bit counter implementation
>> + */
>> +
>> +static void qcom_l3_cache__64bit_counter_start(struct perf_event
>> *event)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 evsel = get_event_type(event);
>> + u64 evcnt = local64_read(&event->count);
>> + u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
>> +
>> + writel_relaxed(gang | GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
>
> I take it this is what actually configures the chaining behaviour?
>

Correct, that's the register name, I might just change GANG to CHAIN
even if the HW guys yell at me ;o)

>> + writel_relaxed(evcnt >> 32, pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
>> + writel_relaxed((u32)evcnt, pmu->regs + L3_HML3_PM_EVCNTR(idx));
>
>
> It's not safe to use perf_event::count in this manner. The core perf
> code can change the count value at any time it wants (e.g. see
> PERF_EVENT_IOC_RESET), so this will result in bogus counts.
>
> Drivers should keep track of the HW value using
> hw_perf_event::prev_count, and add the delta into perf_event::count.
>
> Please do so here.
>

Will fix this, I totally bungled this in trying for efficiency.

>> + writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(idx+1));
>> + writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
>> +
>> + writel_relaxed(PMCNTENSET(idx+1), pmu->regs + L3_M_BC_CNTENSET);
>> + writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
>> +}
>
> IIUC, we're not enabling the interrupt here, is that correct?
>
> If that's deliberate, there should be a comment as to why.

The IRQ is not needed because in essence we have a 64bit hardware
counter
which matches the software counter size. I will add a comment.

>
> These are all relaxed, so what ensures the counter is actually started
> when we return from this function?
>

Given how the framework works at a high level, once all events are
scheduled
into the PMU there is a pmu->enable call, which will use a non-relaxed
write
to do the final enable. Am I missing something?

>> +static void qcom_l3_cache__64bit_counter_stop(struct perf_event
>> *event,
>> + int flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
>> +
>> + writel_relaxed(gang & ~GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
>> + writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
>> + writel_relaxed(PMCNTENCLR(idx+1), pmu->regs + L3_M_BC_CNTENCLR);
>> +}
>> +
>> +static void qcom_l3_cache__64bit_counter_update(struct perf_event
>> *event)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 hi_new, hi_old, lo;
>> + int i, retries = 2;
>> +
>> + hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
>> + hi_old = hi_new + 1;
>> + for (i = 0; (i < retries) && (hi_old != hi_new); i++) {
>> + hi_old = hi_new;
>> + lo = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
>> + hi_new = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
>> + }
>
> This looks overly complicated, and would be far simpler using a pattern
> like:
>
> do {
> hi = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1));
> lo = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
> } while (hi != readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx+1)));
>
> If we need a bounded number of retries, we should warn rather than
> exiting the loop ant potentially returning junk.
>
> Elsewhere when we read a 64-bit free-running counter in two halves
> (e.g.
> in arch_counter_get_cntvct_mem()), it's assumed that 32-bit overflow
> takes sufficiently long that it's unlikely to occur often enough to
> result in a livelock, and we don't need to explicitly bound the loop.
>

Will fix this.

>> +
>> + local64_set(&event->count, ((u64)hi_new << 32) | lo);
>
> As mentioned above for qcom_l3_cache__64bit_counter_start(), please use
> hw_perf_event::prev_count in the usual way, and accumulate the delta
> into perf_event::count when updating.
>
> Doing so means you can share most of the logic for update.
>

Will do

>> +}
>> +
>> +/*
>> + * 32 bit counter interface implementation
>> + */
>> +
>> +static void qcom_l3_cache__32bit_counter_start(struct perf_event
>> *event)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 evsel = get_event_type(event);
>> + u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
>> +
>> + local64_set(&event->hw.prev_count, 0);
>
> Please follow the example set elsewhere, and start the counter at half
> its max value (e.g. 0x80000000 here). So long as the IRQ is taken
> before
> another 2^31 events occur, that caters for extreme IRQ latency, and the
> IRQ handler doesn't have to treat the overflow as an implicit extra
> counter bit.

We can start at zero given that we enable the feature that asserts the
IRQ
on toggle of the MSB. In essence we are doing the same thing other
drivers
do with hardware support that obviates the need to adjust the counter on
every overflow. I will add a block comment to explain this.

>
>> + writel_relaxed(0, pmu->regs + L3_HML3_PM_EVCNTR(idx));
>> + writel_relaxed(irqctl | PMIRQONMSBEN(idx), pmu->regs +
>> L3_M_BC_IRQCTL);
>> + writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
>> + writel_relaxed(PMINTENSET(idx), pmu->regs + L3_M_BC_INTENSET);
>> + writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
>> +}
>> +
>> +static void qcom_l3_cache__32bit_counter_stop(struct perf_event
>> *event,
>> + int flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
>> +
>> + writel_relaxed(irqctl & ~PMIRQONMSBEN(idx), pmu->regs +
>> L3_M_BC_IRQCTL);
>> + writel_relaxed(PMINTENCLR(idx), pmu->regs + L3_M_BC_INTENCLR);
>> + writel_relaxed(PMCNTENCLR(idx), pmu->regs + L3_M_BC_CNTENCLR);
>> +}
>> +
>> +static void qcom_l3_cache__32bit_counter_update(struct perf_event
>> *event)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + int idx = event->hw.idx;
>> + u32 delta, prev, now;
>> +
>> + do {
>> + prev = local64_read(&event->hw.prev_count);
>> + now = readl_relaxed(pmu->regs + L3_HML3_PM_EVCNTR(idx));
>> + } while (local64_cmpxchg(&event->hw.prev_count, prev, now) != prev);
>> +
>> + delta = now - prev;
>> + local64_add(delta, &event->count);
>> +}
>
> This doesn't take overflow into account, so we're potentially losing
> 2^32 events here when called from the IRQ handler.
>
> If you start the counter at half of its max value, this should be ok
> as-is.

Using IRQ on MSB feature ensures this works. I will explain all of this
in a block comment.

>
> [...]
>
>> +static irqreturn_t qcom_l3_cache__handle_irq(int irq_num, void *data)
>> +{
>> + struct l3cache_pmu *pmu = data;
>> + u32 status = readl_relaxed(pmu->regs + L3_M_BC_OVSR);
>> + int idx;
>> +
>> + if (status == 0)
>> + return IRQ_NONE;
>> +
>> + writel_relaxed(status, pmu->regs + L3_M_BC_OVSR);
>
> I take it this clears the overflow status bits? i.e. it's write to
> clear?
>
> It would be worth a comment.
>

Will do

>> + while (status) {
>> + struct perf_event *event;
>> +
>> + idx = __ffs(status);
>> + status &= ~(1 << idx);
>
> It would be better to have status in an unsigned long, and use
> for_each_set_bit() over that.

Will do, however I have seen that the AARCH64 compiler can optimize
better
if we use __ffs directly. It might be the version that I'm using, I will
double-check.

>
>> + event = pmu->counters[idx].event;
>> + if (!event)
>> + continue;
>> +
>> + qcom_l3_cache__32bit_counter_update(event);
>
> It would be worth a comment as to why we don't consider 64-bit events.
>

Will add a comment.

> Even given that, for consistency it'd be worth updating the counter
> using the usual ops indirection.
>

Yes, that will also save some operations.

>> + }
>> +
>> + return IRQ_HANDLED;
>> +}
>> +
>> +/*
>> + * Implementation of abstract pmu functionality required by
>> + * the core perf events code.
>> + */
>> +
>> +static void qcom_l3_cache__pmu_enable(struct pmu *perf_pmu)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
>> + int i;
>> +
>> + /*
>> + * Re-write CNTCTL for all existing events to re-assert
>> + * the start trigger.
>> + */
>> + for (i = 0; i < L3_NUM_COUNTERS; i++)
>> + if (pmu->counters[i].event)
>> + writel_relaxed(PMCNT_RESET, pmu->regs + L3_HML3_PM_CNTCTL(i));
>
> Could you elaborate on this, please?
>
> Why do we need to poke each event? What happens if we didn't do this?
>

The counters can be started via a CTI trigger or via a software
trigger. In case of the software trigger it means we need to rewrite
the CNTCTL register after a disable/enable sequence.

>> +
>> + /* Ensure all programming commands are done before proceeding */
>> + writel(BC_ENABLE, pmu->regs + L3_M_BC_CR);
>> +}
>> +
>> +static void qcom_l3_cache__pmu_disable(struct pmu *perf_pmu)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
>> +
>> + writel_relaxed(0, pmu->regs + L3_M_BC_CR);
>> +
>> + /* Ensure the basic counter unit is stopped before proceeding */
>> + wmb();
>
> You presumably want likewise on the enable path, before enabling the
> PMU.
>

I am doing that, I am using writel on the last operation of pmu_enable,
I will make it a separate barrier to be more explicit.

>> +}
>> +
>> +static int qcom_l3_cache__event_init(struct perf_event *event)
>> +{
>> + struct l3cache_pmu *pmu;
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + /*
>> + * Is the event for this PMU?
>> + */
>> + if (event->attr.type != event->pmu->type)
>> + return -ENOENT;
>> +
>> + /*
>> + * There are no per-counter mode filters in the PMU.
>> + */
>> + if (event->attr.exclude_user || event->attr.exclude_kernel ||
>> + event->attr.exclude_hv || event->attr.exclude_idle)
>> + return -EINVAL;
>> +
>> + hwc->idx = -1;
>> +
>
> Please do this after the remaining return conditions below.
>

Will do.

>> + /*
>> + * Sampling not supported since these events are not
>> core-attributable.
>> + */
>> + if (hwc->sample_period)
>> + return -EINVAL;
>> +
>> + /*
>> + * Task mode not available, we run the counters as socket counters,
>> + * not attributable to any CPU and therefore cannot attribute
>> per-task.
>> + */
>> + if (event->cpu < 0)
>> + return -EINVAL;
>> +
>
> You also need to check the event grouping here. Please look at the QC
> L2
> PMU driver's l2_cache_event_init().
>

Will do.

>> + /*
>> + * Many perf core operations (eg. events rotation) operate on a
>> + * single CPU context. This is obvious for CPU PMUs, where one
>> + * expects the same sets of events being observed on all CPUs,
>> + * but can lead to issues for off-core PMUs, like this one, where
>> + * each event could be theoretically assigned to a different CPU.
>> + * To mitigate this, we enforce CPU assignment to one designated
>> + * processor (the one described in the "cpumask" attribute exported
>> + * by the PMU device). perf user space tools honor this and avoid
>> + * opening more than one copy of the events.
>> + */
>> + pmu = to_l3cache_pmu(event->pmu);
>> + event->cpu = cpumask_first(&pmu->cpumask);
>> +
>> + return 0;
>> +}
>> +
>> +static void qcom_l3_cache__event_start(struct perf_event *event, int
>> flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + hwc->state = 0;
>> +
>> + pmu->counters[hwc->idx].start(event);
>> +}
>> +
>> +static void qcom_l3_cache__event_stop(struct perf_event *event, int
>> flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + if (!(hwc->state & PERF_HES_STOPPED)) {
>
> Please use an early return to handle this, e.g.
>
> if (hwc->state & PERF_HES_STOPPED)
> return;
>
> That makes it easier to follow the rest of the function, which doesn't
> need to be in a block, indented.
>

Will do

>> + pmu->counters[hwc->idx].stop(event, flags);
>> +
>> + if (flags & PERF_EF_UPDATE)
>> + pmu->counters[hwc->idx].update(event);
>> + hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> + }
>> +}
>> +
>> +static int qcom_l3_cache__event_add(struct perf_event *event, int
>> flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + struct hw_perf_event *hwc = &event->hw;
>> + int idx;
>> + int sz;
>> +
>> + /*
>> + * Try to allocate a counter.
>> + */
>> + sz = get_hw_counter_size(event);
>> + idx = bitmap_find_free_region(pmu->used_mask, L3_NUM_COUNTERS, sz);
>
> Is it strictly necessary for these to be an even/odd pair, or can
> chained events be used with any two counters?
>

Only even/odd pairings are supported.

>> + if (idx < 0)
>> + /* The counters are all in use. */
>> + return -EAGAIN;
>> +
>> + hwc->idx = idx;
>> + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> +
>> + if (sz == 0)
>> + pmu->counters[idx] = (struct l3cache_pmu_hwc) {
>> + .event = event,
>> + .start = qcom_l3_cache__32bit_counter_start,
>> + .stop = qcom_l3_cache__32bit_counter_stop,
>> + .update = qcom_l3_cache__32bit_counter_update
>> + };
>> + else {
>> + pmu->counters[idx] = (struct l3cache_pmu_hwc) {
>> + .event = event,
>> + .start = qcom_l3_cache__64bit_counter_start,
>> + .stop = qcom_l3_cache__64bit_counter_stop,
>> + .update = qcom_l3_cache__64bit_counter_update
>> + };
>> + pmu->counters[idx+1] = pmu->counters[idx];
>> + }
>
> This should be able to go, if we use the ops indirection suggested
> above.
>
> In the long counter case, do we even need to touch the additional
> pmu->counters[] slot?
>
> AFAICT, we'll never use it for anything. The irq handler implicitly
> ignores all pmu->counters[] slots for long events, since they don't
> generate irqs, and elsewhere we don't sseem to use this.
>

This was me in paranoid mode, I just didn't want to leave dangling
or stale pointers. I'll clean it up.

>> +
>> + if (flags & PERF_EF_START)
>> + qcom_l3_cache__event_start(event, 0);
>> +
>> + /* Propagate changes to the userspace mapping. */
>> + perf_event_update_userpage(event);
>> +
>> + return 0;
>> +}
>> +
>> +static void qcom_l3_cache__event_del(struct perf_event *event, int
>> flags)
>> +{
>> + struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
>> + struct hw_perf_event *hwc = &event->hw;
>> + int sz;
>> +
>> + qcom_l3_cache__event_stop(event, flags | PERF_EF_UPDATE);
>> + sz = get_hw_counter_size(event);
>> + pmu->counters[hwc->idx].event = NULL;
>> + if (sz)
>> + pmu->counters[hwc->idx+1].event = NULL;
>
> As above, it's not clear to me if we need to touch this slot at all.
>

I will clean it up

> [...]
>
>> +#define L3CACHE_EVENT_VAR(__id) pmu_event_attr_##__id
>> +#define L3CACHE_EVENT_PTR(__id) (&L3CACHE_EVENT_VAR(__id).attr.attr)
>> +
>> +#define L3CACHE_EVENT_ATTR(__name, __id) \
>> + PMU_EVENT_ATTR(__name, L3CACHE_EVENT_VAR(__id), __id, \
>> + l3cache_pmu_event_sysfs_show)
>> +
>> +
>> +L3CACHE_EVENT_ATTR(cycles, L3_CYCLES);
>> +L3CACHE_EVENT_ATTR(read-hit, L3_READ_HIT);
>> +L3CACHE_EVENT_ATTR(read-miss, L3_READ_MISS);
>> +L3CACHE_EVENT_ATTR(read-hit-d-side, L3_READ_HIT_D);
>> +L3CACHE_EVENT_ATTR(read-miss-d-side, L3_READ_MISS_D);
>> +L3CACHE_EVENT_ATTR(write-hit, L3_WRITE_HIT);
>> +L3CACHE_EVENT_ATTR(write-miss, L3_WRITE_MISS);
>> +
>> +static struct attribute *qcom_l3_cache_pmu_events[] = {
>> + L3CACHE_EVENT_PTR(L3_CYCLES),
>> + L3CACHE_EVENT_PTR(L3_READ_HIT),
>> + L3CACHE_EVENT_PTR(L3_READ_MISS),
>> + L3CACHE_EVENT_PTR(L3_READ_HIT_D),
>> + L3CACHE_EVENT_PTR(L3_READ_MISS_D),
>> + L3CACHE_EVENT_PTR(L3_WRITE_HIT),
>> + L3CACHE_EVENT_PTR(L3_WRITE_MISS),
>> + NULL
>> +};
>
> Please follow the approach taken by drivers/perf/xgene_pmu.c's
> XGENE_PMU_EVENT_ATTR(), so that they can be definied in-place.
>

Will do.

>> +
>> +static struct attribute_group qcom_l3_cache_pmu_events_group = {
>> + .name = "events",
>> + .attrs = qcom_l3_cache_pmu_events,
>> +};
>> +
>> +PMU_FORMAT_ATTR(event, "config:0-7");
>> +PMU_FORMAT_ATTR(lc, "config:32");
>> +
>> +static struct attribute *qcom_l3_cache_pmu_formats[] = {
>> + &format_attr_event.attr,
>> + &format_attr_lc.attr,
>> + NULL,
>> +};
>
> Likewise, see XGENE_PMU_FORMAT_ATTR().

Will do.

Thanks for the thorough feedback, I will submit V4 ASAP,
Agustin.


[1] https://lkml.org/lkml/2017/1/27/954

--
Qualcomm Datacenter Technologies, Inc. on behalf of the Qualcomm
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a
Linux Foundation Collaborative Project.

2017-03-06 17:35:59

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH V3] perf: qcom: Add L3 cache PMU driver

Hi,

On Mon, Mar 06, 2017 at 10:29:25AM -0500, Agustin Vega-Frias wrote:
> On 2017-03-03 09:50, Mark Rutland wrote:
> >Hi Augustin,

Apologies for the typo here, btw.

> >On Thu, Mar 02, 2017 at 03:58:32PM -0500, Agustin Vega-Frias wrote:
> >>The driver exports formatting and event information to sysfs so it can
> >>be used by the perf user space tools with the syntaxes:
> >> perf stat -a -e l3cache_0_0/read-miss/
> >> perf stat -a -e l3cache_0_0/event=0x21/
> >
> >As a high-level thing, while we can't do aggregation of counts across
> >slices, and while we'd have to reject cross-slice groups, we *could*
> >have a single struct PMU that covers all those slices, with userspace
> >selecting which slice an event targets via
> >perf_event_attr::config{,1,2}. i.e. we'd treat those as independent
> >sub-units under the PMU.
> >
> >With some sysfs and userspace work, it would then be possible to have
> >the perf tool infer automatically that it should open an event on each
> >sub-unit as it currently does per-cpu.
>
> Agreed, I will be looking at that change later, which can be seen as
> related Andi Kleen's work on uncore events [1].

Interesting; I had not spotted this series, and will have to look at it
in more detail.

> In my opinion having the separate perf PMUs fits more cleanly into that
> work, since there is currently no way to translate something like:
>
> l3cache/read-miss,slice=0xF/ # slices 0-3
>
> into the individual events:
>
> l3cache/read-miss,slice=0x1/ # slice 0
> l3cache/read-miss,slice=0x2/ # slice 1
> l3cache/read-miss,slice=0x4/ # slice 2
> l3cache/read-miss,slice=0x8/ # slice 3
>
> I'd prefer something like:
>
> l3cache_0_[0123]/read-miss/ # slices 0-3 on socket 0
> l3cache_0_*/read-miss/ # all slices on socket 0
>
> Thoughts?

At present, I don't have a strong feeling either way.

I'll have to consider this more deeply.

[...]

> >>+ *
> >>+ * The hardware also supports a feature that asserts the IRQ on
> >>the toggling
> >>+ * of the most significanty bit in the 32bit counter.
> >
> >Nit: s/significanty/significant/
> >
> >This is just an overflow interrupt, right?
> >
> >Or does this interrupt also trigger when bit 31 goes from 0 to 1?
> >
>
> Yes the interrupt also triggers when bit 31 goes from 0 to 1. This
> feature was added so we can avoid counter reprogramming and leave the
> counters as free running and still handle the deltas properly.

I see. That is somewhat surprising, so it might be worth having more
detail in the comment.

I guess that makes sense, though it's somewhat annoying that this means
this has to be different to every other PMU w.r.t. interrupt handling.
I'll have to try to recall the various races our usual approach avoids.

> >>+/*
> >>+ * Decoding of settings from perf_event_attr
> >>+ *
> >>+ * The config format for perf events is:
> >>+ * - config: bits 0-7: event type
> >>+ * bit 32: HW counter size requested, 0: 32 bits,
> >>1: 64 bits
> >>+ */
> >
> >Not really a problem, but is there a specific reason to avoid bits
> >31-8?
>
> No, just leaving some space in case the event types are expanded.

Ok. If you could drop in a comment to that effect, that would be great.

[...]

> >>+static inline int get_hw_counter_size(struct perf_event *event)
> >>+{
> >>+ return event->attr.config >> 32 & 1;
> >>+}
> >
> >This would be better as something like
> >
> >#define L3_EVENT_ATTR_LC_BIT 32
> >
> >static inline bool event_uses_long_counter(struct perf_event *event)
> >{
> > return !!(event->attr.config & BIT_ULL(L3_EVENT_ATTR_LC_BIT));
> >}

> I was implementing it this way so I can just directly use it as a
> parameter to bitmap_find_free_region,

Sure. I found this made those calls harder to read, since it wasn't
obvious what the size variable was exactly.

> but I can take a look to see how to make this cleaner.

Great. FWIW, I think it would be clearer to either have a wrapper of
bitmap_find_free_region that took a 'long' boolean parameter, or just
have a ternary when calling it to figure out the size parameter.

[...]

> >>+struct l3cache_pmu_hwc {
> >>+ struct perf_event *event;
> >>+ /* Called to start event monitoring */
> >>+ void (*start)(struct perf_event *event);
> >>+ /* Called to stop event monitoring */
> >>+ void (*stop)(struct perf_event *event, int flags);
> >>+ /* Called to update the perf_event */
> >>+ void (*update)(struct perf_event *event);
> >>+};
> >
> >Even if having separate ops makes handling chaining simpler, I don't
> >think we need a copy of all these per-event.
> >
> >We can instead have something like:
> >
> >struct l3cache_event_ops {
> > void (*start)(struct perf_event *event);
> > void (*stop)(struct perf_event *event, int flags);
> > void (*update)(struct perf_event *event);
> >};
> >
> >struct l3cache_event_ops event_ops_std;
> >struct l3cache_event_ops event_ops_long;
> >
> >static struct l3cache_event_ops *l3cache_event_get_ops(struct
> >perf_event *event)
> >{
> > if (event_uses_long_counter(event))
> > return &event_ops_long;
> > else
> > return &event_ops_std;
> >}
> >
> >... and callers the need the ops can consult
> >l3cache_event_get_ops(event).
>
> Understood. How about a single pointer to ops in the counter struct?
> E.g.:
>
> struct l3cache_pmu_hwc {
> struct perf_event *event;
> struct l3cache_event_ops *ops;
> };
>
> Then instead of the current:
> pmu->counters[hwc->idx].start(event);
> we have:
> pmu->counters[hwc->idx].ops->start(event);

I'd rather that we figured this out dynamically, since that way we don't
need a special l3cache_pmu_hwc type, and it's not possible for the ops
to becomes out of sync with the event somehow.

[...]

> >>+static void qcom_l3_cache__64bit_counter_start(struct
> >>perf_event *event)
> >>+{
> >>+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> >>+ int idx = event->hw.idx;
> >>+ u32 evsel = get_event_type(event);
> >>+ u64 evcnt = local64_read(&event->count);
> >>+ u32 gang = readl_relaxed(pmu->regs + L3_M_BC_GANG);
> >>+
> >>+ writel_relaxed(gang | GANG_EN(idx), pmu->regs + L3_M_BC_GANG);
> >
> >I take it this is what actually configures the chaining behaviour?
>
> Correct, that's the register name, I might just change GANG to CHAIN
> even if the HW guys yell at me ;o)

The name is fine as-is; it should match the HW naming.

It might be worth a comment, but that's about it.

[...]

> >>+ writel_relaxed(EVSEL(0), pmu->regs + L3_HML3_PM_EVTYPE(idx+1));
> >>+ writel_relaxed(EVSEL(evsel), pmu->regs + L3_HML3_PM_EVTYPE(idx));
> >>+
> >>+ writel_relaxed(PMCNTENSET(idx+1), pmu->regs + L3_M_BC_CNTENSET);
> >>+ writel_relaxed(PMCNTENSET(idx), pmu->regs + L3_M_BC_CNTENSET);
> >>+}
> >
> >IIUC, we're not enabling the interrupt here, is that correct?
> >
> >If that's deliberate, there should be a comment as to why.
>
> The IRQ is not needed because in essence we have a 64bit hardware
> counter which matches the software counter size. I will add a comment.

Ok. I take it that the (maximum potential) rate of increment is such
that this never feasibly overflows.

> >These are all relaxed, so what ensures the counter is actually started
> >when we return from this function?
>
> Given how the framework works at a high level, once all events are
> scheduled into the PMU there is a pmu->enable call, which will use a
> non-relaxed write to do the final enable. Am I missing something?

Good point; I failed to consider that.

[...]

> >>+}
> >>+
> >>+/*
> >>+ * 32 bit counter interface implementation
> >>+ */
> >>+
> >>+static void qcom_l3_cache__32bit_counter_start(struct
> >>perf_event *event)
> >>+{
> >>+ struct l3cache_pmu *pmu = to_l3cache_pmu(event->pmu);
> >>+ int idx = event->hw.idx;
> >>+ u32 evsel = get_event_type(event);
> >>+ u32 irqctl = readl_relaxed(pmu->regs + L3_M_BC_IRQCTL);
> >>+
> >>+ local64_set(&event->hw.prev_count, 0);
> >
> >Please follow the example set elsewhere, and start the counter at half
> >its max value (e.g. 0x80000000 here). So long as the IRQ is taken
> >before
> >another 2^31 events occur, that caters for extreme IRQ latency, and the
> >IRQ handler doesn't have to treat the overflow as an implicit extra
> >counter bit.
>
> We can start at zero given that we enable the feature that asserts the
> IRQ on toggle of the MSB. In essence we are doing the same thing other
> drivers do with hardware support that obviates the need to adjust the
> counter on every overflow. I will add a block comment to explain this.

Ok. I think that's ok, but as above I'll need to consider this a bit
more.

[...]

> >>+static irqreturn_t qcom_l3_cache__handle_irq(int irq_num, void *data)
> >>+{
> >>+ struct l3cache_pmu *pmu = data;
> >>+ u32 status = readl_relaxed(pmu->regs + L3_M_BC_OVSR);
> >>+ int idx;
> >>+
> >>+ if (status == 0)
> >>+ return IRQ_NONE;
> >>+
> >>+ writel_relaxed(status, pmu->regs + L3_M_BC_OVSR);
> >
> >I take it this clears the overflow status bits? i.e. it's write to
> >clear?
> >
> >It would be worth a comment.
> >
>
> Will do
>
> >>+ while (status) {
> >>+ struct perf_event *event;
> >>+
> >>+ idx = __ffs(status);
> >>+ status &= ~(1 << idx);
> >
> >It would be better to have status in an unsigned long, and use
> >for_each_set_bit() over that.
>
> Will do, however I have seen that the AARCH64 compiler can optimize
> better
> if we use __ffs directly. It might be the version that I'm using, I will
> double-check.

I'd prefer the for_each_set_bit() approach regardless. It's clearer, and
I can't imagine that it has significant overhead here.

[...]

> >>+static void qcom_l3_cache__pmu_disable(struct pmu *perf_pmu)
> >>+{
> >>+ struct l3cache_pmu *pmu = to_l3cache_pmu(perf_pmu);
> >>+
> >>+ writel_relaxed(0, pmu->regs + L3_M_BC_CR);
> >>+
> >>+ /* Ensure the basic counter unit is stopped before proceeding */
> >>+ wmb();
> >
> >You presumably want likewise on the enable path, before enabling the
> >PMU.
>
> I am doing that, I am using writel on the last operation of pmu_enable,

Ah, true.

> I will make it a separate barrier to be more explicit.

It would also be fine to drop a comment in the pmu_enable() use of
writel(); either way is fine.

Thanks,
Mark.