2018-04-25 09:03:22

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: [PATCH v4 0/2] Add ThunderX2 SoC Performance Monitoring Unit driver

This patchset adds PMU driver for Cavium's ThunderX2 SoC UNCORE devices.
The SoC has PMU support in L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

v4:
-Incroporated review comments from Mark Rutland[1]

[1] https://www.spinics.net/lists/arm-kernel/msg588563.html

v3:
- fixed warning reported by kbuild robot

v2:
- rebased to 4.12-rc1
- Removed Arch VULCAN dependency.
- update SMC call parameters as per latest firmware.

v1:
-Initial patch

Ganapatrao Kulkarni (2):
perf: uncore: Adding documentation for ThunderX2 pmu uncore driver
ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Documentation/perf/thunderx2-pmu.txt | 66 +++
drivers/perf/Kconfig | 8 +
drivers/perf/Makefile | 1 +
drivers/perf/thunderx2_pmu.c | 958 +++++++++++++++++++++++++++++++++++
4 files changed, 1033 insertions(+)
create mode 100644 Documentation/perf/thunderx2-pmu.txt
create mode 100644 drivers/perf/thunderx2_pmu.c

--
2.9.4



2018-04-25 09:03:38

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

This patch adds a perf driver for the PMU UNCORE devices DDR4 Memory
Controller(DMC) and Level 3 Cache(L3C).

ThunderX2 has 8 independent DMC PMUs to capture performance events
corresponding to 8 channels of DDR4 Memory Controller and 16 independent
L3C PMUs to capture events corresponding to 16 tiles of L3 cache.
Each PMU supports up to 4 counters. All counters lack overflow interrupt
and are sampled periodically.

Signed-off-by: Ganapatrao Kulkarni <[email protected]>
---
drivers/perf/Kconfig | 8 +
drivers/perf/Makefile | 1 +
drivers/perf/thunderx2_pmu.c | 958 +++++++++++++++++++++++++++++++++++++++++++
3 files changed, 967 insertions(+)
create mode 100644 drivers/perf/thunderx2_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 28bb5a0..eafd0fc 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -85,6 +85,14 @@ config QCOM_L3_PMU
Adds the L3 cache PMU into the perf events subsystem for
monitoring L3 cache events.

+config THUNDERX2_PMU
+ bool "Cavium ThunderX2 SoC PMU UNCORE"
+ depends on ARCH_THUNDER2 && PERF_EVENTS && ACPI
+ help
+ Provides support for ThunderX2 UNCORE events.
+ The SoC has PMU support in its L3 cache controller (L3C) and
+ in the DDR4 Memory Controller (DMC).
+
config XGENE_PMU
depends on ARCH_XGENE
bool "APM X-Gene SoC PMU"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index b3902bd..909f27f 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -7,5 +7,6 @@ obj-$(CONFIG_ARM_PMU_ACPI) += arm_pmu_acpi.o
obj-$(CONFIG_HISI_PMU) += hisilicon/
obj-$(CONFIG_QCOM_L2_PMU) += qcom_l2_pmu.o
obj-$(CONFIG_QCOM_L3_PMU) += qcom_l3_pmu.o
+obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
diff --git a/drivers/perf/thunderx2_pmu.c b/drivers/perf/thunderx2_pmu.c
new file mode 100644
index 0000000..92d19b5
--- /dev/null
+++ b/drivers/perf/thunderx2_pmu.c
@@ -0,0 +1,958 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CAVIUM THUNDERX2 SoC PMU UNCORE
+ *
+ * Copyright (C) 2018 Cavium Inc.
+ * Author: Ganapatrao Kulkarni <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include <linux/acpi.h>
+#include <linux/arm-smccc.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+
+/* L3c and DMC has 16 and 8 channels per socket respectively.
+ * Each Channel supports UNCORE PMU device and consists of
+ * 4 independent programmable counters. Counters are 32 bit
+ * and does not support overflow interrupt, they needs to be
+ * sampled before overflow(i.e, at every 2 seconds).
+ */
+
+#define UNCORE_MAX_COUNTERS 4
+#define UNCORE_L3_MAX_TILES 16
+#define UNCORE_DMC_MAX_CHANNELS 8
+
+#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)
+#define GET_EVENTID(ev) ((ev->hw.config) & 0x1ff)
+#define GET_COUNTERID(ev) ((ev->hw.idx) & 0xf)
+#define GET_CHANNELID(pmu_uncore) (pmu_uncore->channel)
+
+#define DMC_COUNTER_CTL 0x234
+#define DMC_COUNTER_DATA 0x240
+#define L3C_COUNTER_CTL 0xA8
+#define L3C_COUNTER_DATA 0xAC
+
+#define THUNDERX2_SMC_CALL_ID 0xC200FF00
+#define THUNDERX2_SMC_SET_CHANNEL 0xB010
+
+
+enum thunderx2_uncore_l3_events {
+ L3_EVENT_NONE,
+ L3_EVENT_NBU_CANCEL,
+ L3_EVENT_DIB_RETRY,
+ L3_EVENT_DOB_RETRY,
+ L3_EVENT_DIB_CREDIT_RETRY,
+ L3_EVENT_DOB_CREDIT_RETRY,
+ L3_EVENT_FORCE_RETRY,
+ L3_EVENT_IDX_CONFLICT_RETRY,
+ L3_EVENT_EVICT_CONFLICT_RETRY,
+ L3_EVENT_BANK_CONFLICT_RETRY,
+ L3_EVENT_FILL_ENTRY_RETRY,
+ L3_EVENT_EVICT_NOT_READY_RETRY,
+ L3_EVENT_L3_RETRY,
+ L3_EVENT_READ_REQ,
+ L3_EVENT_WRITE_BACK_REQ,
+ L3_EVENT_INVALIDATE_NWRITE_REQ,
+ L3_EVENT_INV_REQ,
+ L3_EVENT_SELF_REQ,
+ L3_EVENT_REQ,
+ L3_EVENT_EVICT_REQ,
+ L3_EVENT_INVALIDATE_NWRITE_HIT,
+ L3_EVENT_INVALIDATE_HIT,
+ L3_EVENT_SELF_HIT,
+ L3_EVENT_READ_HIT,
+ L3_EVENT_MAX,
+};
+
+enum thunderx2_uncore_dmc_events {
+ DMC_EVENT_NONE,
+ DMC_EVENT_COUNT_CYCLES,
+ DMC_EVENT_RES2,
+ DMC_EVENT_RES3,
+ DMC_EVENT_RES4,
+ DMC_EVENT_RES5,
+ DMC_EVENT_RES6,
+ DMC_EVENT_RES7,
+ DMC_EVENT_RES8,
+ DMC_EVENT_READ_64B_TXNS,
+ DMC_EVENT_READ_BELOW_64B_TXNS,
+ DMC_EVENT_WRITE_TXNS,
+ DMC_EVENT_TXN_CYCLES,
+ DMC_EVENT_DATA_TRANSFERS,
+ DMC_EVENT_CANCELLED_READ_TXNS,
+ DMC_EVENT_CONSUMED_READ_TXNS,
+ DMC_EVENT_MAX,
+};
+
+enum thunderx2_uncore_type {
+ PMU_TYPE_L3C,
+ PMU_TYPE_DMC,
+ PMU_TYPE_INVALID,
+};
+
+struct active_timer {
+ struct perf_event *event;
+ struct hrtimer hrtimer;
+};
+
+/*
+ * pmu on each socket has 2 uncore devices(dmc and l3),
+ * each uncore device has up to 16 channels, each channel can sample
+ * events independently with counters up to 4.
+ *
+ * struct thunderx2_pmu_uncore_channel created per channel.
+ * struct thunderx2_pmu_uncore_dev per uncore device.
+ */
+struct thunderx2_pmu_uncore_channel {
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+ struct pmu pmu;
+ int counter;
+ int channel;
+ DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
+ struct active_timer *active_timers;
+ /* to sync counter alloc/release */
+ raw_spinlock_t lock;
+};
+
+struct thunderx2_pmu_uncore_dev {
+ char *name;
+ struct device *dev;
+ enum thunderx2_uncore_type type;
+ unsigned long base;
+ int node;
+ u32 max_counters;
+ u32 max_channels;
+ u32 max_events;
+ u64 hrtimer_interval;
+ /* this lock synchronizes across channels */
+ raw_spinlock_t lock;
+ const struct attribute_group **attr_groups;
+ void (*init_cntr_base)(struct perf_event *event,
+ struct thunderx2_pmu_uncore_dev *uncore_dev);
+ void (*select_channel)(struct perf_event *event);
+ void (*stop_event)(struct perf_event *event);
+ void (*start_event)(struct perf_event *event, int flags);
+};
+
+static inline struct thunderx2_pmu_uncore_channel *
+pmu_to_thunderx2_pmu_uncore(struct pmu *pmu)
+{
+ return container_of(pmu, struct thunderx2_pmu_uncore_channel, pmu);
+}
+
+/*
+ * sysfs format attributes
+ */
+static ssize_t thunderx2_pmu_format_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_ext_attribute *eattr;
+
+ eattr = container_of(attr, struct dev_ext_attribute, attr);
+ return sprintf(buf, "%s\n", (char *) eattr->var);
+}
+
+#define FORMAT_ATTR(_name, _config) \
+ (&((struct dev_ext_attribute[]) { \
+ { \
+ .attr = __ATTR(_name, 0444, thunderx2_pmu_format_show, NULL), \
+ .var = (void *) _config, \
+ } \
+ })[0].attr.attr)
+
+static struct attribute *l3c_pmu_format_attrs[] = {
+ FORMAT_ATTR(event, "config:0-4"),
+ NULL,
+};
+
+static struct attribute *dmc_pmu_format_attrs[] = {
+ FORMAT_ATTR(event, "config:0-4"),
+ NULL,
+};
+
+static const struct attribute_group l3c_pmu_format_attr_group = {
+ .name = "format",
+ .attrs = l3c_pmu_format_attrs,
+};
+
+static const struct attribute_group dmc_pmu_format_attr_group = {
+ .name = "format",
+ .attrs = dmc_pmu_format_attrs,
+};
+
+/*
+ * sysfs event attributes
+ */
+static ssize_t thunderx2_pmu_event_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dev_ext_attribute *eattr;
+
+ eattr = container_of(attr, struct dev_ext_attribute, attr);
+ return sprintf(buf, "config=0x%lx\n", (unsigned long) eattr->var);
+}
+
+#define EVENT_ATTR(_name, _config) \
+ (&((struct dev_ext_attribute[]) { \
+ { \
+ .attr = __ATTR(_name, 0444, thunderx2_pmu_event_show, NULL), \
+ .var = (void *) _config, \
+ } \
+ })[0].attr.attr)
+
+static struct attribute *l3c_pmu_events_attrs[] = {
+ EVENT_ATTR(nbu_cancel, L3_EVENT_NBU_CANCEL),
+ EVENT_ATTR(dib_retry, L3_EVENT_DIB_RETRY),
+ EVENT_ATTR(dob_retry, L3_EVENT_DOB_RETRY),
+ EVENT_ATTR(dib_credit_retry, L3_EVENT_DIB_CREDIT_RETRY),
+ EVENT_ATTR(dob_credit_retry, L3_EVENT_DOB_CREDIT_RETRY),
+ EVENT_ATTR(force_retry, L3_EVENT_FORCE_RETRY),
+ EVENT_ATTR(idx_conflict_retry, L3_EVENT_IDX_CONFLICT_RETRY),
+ EVENT_ATTR(evict_conflict_retry, L3_EVENT_EVICT_CONFLICT_RETRY),
+ EVENT_ATTR(bank_conflict_retry, L3_EVENT_BANK_CONFLICT_RETRY),
+ EVENT_ATTR(fill_entry_retry, L3_EVENT_FILL_ENTRY_RETRY),
+ EVENT_ATTR(evict_not_ready_retry, L3_EVENT_EVICT_NOT_READY_RETRY),
+ EVENT_ATTR(l3_retry, L3_EVENT_L3_RETRY),
+ EVENT_ATTR(read_request, L3_EVENT_READ_REQ),
+ EVENT_ATTR(write_back_request, L3_EVENT_WRITE_BACK_REQ),
+ EVENT_ATTR(inv_nwrite_request, L3_EVENT_INVALIDATE_NWRITE_REQ),
+ EVENT_ATTR(inv_request, L3_EVENT_INV_REQ),
+ EVENT_ATTR(self_request, L3_EVENT_SELF_REQ),
+ EVENT_ATTR(request, L3_EVENT_REQ),
+ EVENT_ATTR(evict_request, L3_EVENT_EVICT_REQ),
+ EVENT_ATTR(inv_nwrite_hit, L3_EVENT_INVALIDATE_NWRITE_HIT),
+ EVENT_ATTR(inv_hit, L3_EVENT_INVALIDATE_HIT),
+ EVENT_ATTR(self_hit, L3_EVENT_SELF_HIT),
+ EVENT_ATTR(read_hit, L3_EVENT_READ_HIT),
+ NULL,
+};
+
+static struct attribute *dmc_pmu_events_attrs[] = {
+ EVENT_ATTR(cnt_cycles, DMC_EVENT_COUNT_CYCLES),
+ EVENT_ATTR(read_64b_txns, DMC_EVENT_READ_64B_TXNS),
+ EVENT_ATTR(read_below_64b_txns, DMC_EVENT_READ_BELOW_64B_TXNS),
+ EVENT_ATTR(write_txns, DMC_EVENT_WRITE_TXNS),
+ EVENT_ATTR(txn_cycles, DMC_EVENT_TXN_CYCLES),
+ EVENT_ATTR(data_transfers, DMC_EVENT_DATA_TRANSFERS),
+ EVENT_ATTR(cancelled_read_txns, DMC_EVENT_CANCELLED_READ_TXNS),
+ EVENT_ATTR(consumed_read_txns, DMC_EVENT_CONSUMED_READ_TXNS),
+ NULL,
+};
+
+static const struct attribute_group l3c_pmu_events_attr_group = {
+ .name = "events",
+ .attrs = l3c_pmu_events_attrs,
+};
+
+static const struct attribute_group dmc_pmu_events_attr_group = {
+ .name = "events",
+ .attrs = dmc_pmu_events_attrs,
+};
+
+/*
+ * sysfs cpumask attributes
+ */
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct cpumask cpu_mask;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore =
+ pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
+
+ /* Pick first online cpu from the node */
+ cpumask_clear(&cpu_mask);
+ cpumask_set_cpu(cpumask_first(
+ cpumask_of_node(pmu_uncore->uncore_dev->node)),
+ &cpu_mask);
+
+ return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
+}
+
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *thunderx2_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL,
+};
+
+static const struct attribute_group pmu_cpumask_attr_group = {
+ .attrs = thunderx2_pmu_cpumask_attrs,
+};
+
+/*
+ * Per PMU device attribute groups
+ */
+static const struct attribute_group *l3c_pmu_attr_groups[] = {
+ &l3c_pmu_format_attr_group,
+ &pmu_cpumask_attr_group,
+ &l3c_pmu_events_attr_group,
+ NULL
+};
+
+static const struct attribute_group *dmc_pmu_attr_groups[] = {
+ &dmc_pmu_format_attr_group,
+ &pmu_cpumask_attr_group,
+ &dmc_pmu_events_attr_group,
+ NULL
+};
+
+static inline struct active_timer *get_active_timer(struct hrtimer *hrt)
+{
+ return container_of(hrt, struct active_timer, hrtimer);
+}
+
+static inline u32 reg_readl(unsigned long addr)
+{
+ return readl((void __iomem *)addr);
+}
+
+static inline void reg_writel(u32 val, unsigned long addr)
+{
+ writel(val, (void __iomem *)addr);
+}
+
+static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
+{
+ int counter;
+
+ raw_spin_lock(&pmu_uncore->lock);
+ counter = find_first_zero_bit(pmu_uncore->counter_mask,
+ pmu_uncore->uncore_dev->max_counters);
+ if (counter == pmu_uncore->uncore_dev->max_counters) {
+ raw_spin_unlock(&pmu_uncore->lock);
+ return -ENOSPC;
+ }
+ set_bit(counter, pmu_uncore->counter_mask);
+ raw_spin_unlock(&pmu_uncore->lock);
+ return counter;
+}
+
+static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
+ int counter)
+{
+ raw_spin_lock(&pmu_uncore->lock);
+ clear_bit(counter, pmu_uncore->counter_mask);
+ raw_spin_unlock(&pmu_uncore->lock);
+}
+
+/*
+ * DMC and L3 counter interface is muxed across all channels.
+ * hence we need to select the channel before accessing counter
+ * data/control registers.
+ *
+ * L3 Tile and DMC channel selection is through SMC call
+ * SMC call arguments,
+ * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
+ * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
+ * x2 = Node id
+ * x3 = DMC(1)/L3C(0)
+ * x4 = channel Id
+ */
+static void uncore_select_channel(struct perf_event *event)
+{
+ struct arm_smccc_res res;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore =
+ pmu_to_thunderx2_pmu_uncore(event->pmu);
+
+ arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
+ pmu_uncore->uncore_dev->node,
+ pmu_uncore->uncore_dev->type,
+ pmu_uncore->channel, 0, 0, 0, &res);
+
+}
+
+static void uncore_start_event_l3c(struct perf_event *event, int flags)
+{
+ u32 val;
+ struct hw_perf_event *hwc = &event->hw;
+
+ /* event id encoded in bits [07:03] */
+ val = GET_EVENTID(event) << 3;
+ reg_writel(val, hwc->config_base);
+
+ if (flags & PERF_EF_RELOAD) {
+ u64 prev_raw_count =
+ local64_read(&event->hw.prev_count);
+ reg_writel(prev_raw_count, hwc->event_base);
+ }
+ local64_set(&event->hw.prev_count,
+ reg_readl(hwc->event_base));
+}
+
+static void uncore_start_event_dmc(struct perf_event *event, int flags)
+{
+ u32 val, event_shift = 8;
+ struct hw_perf_event *hwc = &event->hw;
+
+ /* enable and start counters and read current value in prev_count */
+ val = reg_readl(hwc->config_base);
+
+ /* 8 bits for each counter,
+ * bits [05:01] of a counter to set event type.
+ */
+ reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) *
+ event_shift) + 1))) |
+ (GET_EVENTID(event) <<
+ (((GET_COUNTERID(event)) * event_shift) + 1)),
+ hwc->config_base);
+
+ if (flags & PERF_EF_RELOAD) {
+ u64 prev_raw_count =
+ local64_read(&event->hw.prev_count);
+ reg_writel(prev_raw_count, hwc->event_base);
+ }
+ local64_set(&event->hw.prev_count,
+ reg_readl(hwc->event_base));
+}
+
+static void uncore_stop_event_l3c(struct perf_event *event)
+{
+ reg_writel(0, event->hw.config_base);
+}
+
+static void uncore_stop_event_dmc(struct perf_event *event)
+{
+ u32 val, event_shift = 8;
+ struct hw_perf_event *hwc = &event->hw;
+
+ val = reg_readl(hwc->config_base);
+ reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))),
+ hwc->config_base);
+}
+
+static void init_cntr_base_l3c(struct perf_event *event,
+ struct thunderx2_pmu_uncore_dev *uncore_dev)
+{
+
+ struct hw_perf_event *hwc = &event->hw;
+
+ /* counter ctrl/data reg offset at 8 */
+ hwc->config_base = uncore_dev->base
+ + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
+ hwc->event_base = uncore_dev->base
+ + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event));
+}
+
+static void init_cntr_base_dmc(struct perf_event *event,
+ struct thunderx2_pmu_uncore_dev *uncore_dev)
+{
+
+ struct hw_perf_event *hwc = &event->hw;
+
+ hwc->config_base = uncore_dev->base
+ + DMC_COUNTER_CTL;
+ /* counter data reg offset at 0xc */
+ hwc->event_base = uncore_dev->base
+ + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event));
+}
+
+static void thunderx2_uncore_update(struct perf_event *event)
+{
+ s64 prev, new = 0;
+ u64 delta;
+ struct hw_perf_event *hwc = &event->hw;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ enum thunderx2_uncore_type type;
+
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+ type = pmu_uncore->uncore_dev->type;
+
+ if (pmu_uncore->uncore_dev->select_channel)
+ pmu_uncore->uncore_dev->select_channel(event);
+
+ new = reg_readl(hwc->event_base);
+ prev = local64_xchg(&hwc->prev_count, new);
+
+ /* handles rollover of 32 bit counter */
+ delta = (u32)(((1UL << 32) - prev) + new);
+
+ local64_add(delta, &event->count);
+}
+
+enum thunderx2_uncore_type get_uncore_device_type(struct acpi_device *adev)
+{
+ int i = 0;
+ struct acpi_uncore_device {
+ __u8 id[ACPI_ID_LEN];
+ enum thunderx2_uncore_type type;
+ } devices[] = {
+ {"CAV901D", PMU_TYPE_L3C},
+ {"CAV901F", PMU_TYPE_DMC},
+ {"", PMU_TYPE_INVALID},
+ };
+
+ while (devices[i].type != PMU_TYPE_INVALID) {
+ if (!strcmp(acpi_device_hid(adev), devices[i].id))
+ return devices[i].type;
+ i++;
+ }
+ return PMU_TYPE_INVALID;
+}
+
+/*
+ * We must NOT create groups containing events from multiple hardware PMUs,
+ * although mixing different software and hardware PMUs is allowed.
+ */
+static bool thunderx2_uncore_validate_event_group(struct perf_event *event)
+{
+ struct pmu *pmu = event->pmu;
+ struct perf_event *leader = event->group_leader;
+ struct perf_event *sibling;
+ int counters = 0;
+
+ if (leader->pmu != event->pmu && !is_software_event(leader))
+ return false;
+
+ counters++;
+
+ for_each_sibling_event(sibling, event->group_leader) {
+ if (is_software_event(sibling))
+ continue;
+ if (sibling->pmu != pmu)
+ return false;
+ counters++;
+ }
+
+ /*
+ * If the group requires more counters than the HW has,
+ * it cannot ever be scheduled.
+ */
+ return counters <= UNCORE_MAX_COUNTERS;
+}
+
+static int thunderx2_uncore_event_init(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+
+ /* Test the event attr type check for PMU enumeration */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /*
+ * SOC PMU counters are shared across all cores.
+ * Therefore, it does not support per-process mode.
+ * Also, it does not support event sampling mode.
+ */
+ if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
+ return -EINVAL;
+
+ /* SOC counters do not have usr/os/guest/host bits */
+ if (event->attr.exclude_user || event->attr.exclude_kernel ||
+ event->attr.exclude_host || event->attr.exclude_guest)
+ return -EINVAL;
+
+ if (event->cpu < 0)
+ return -EINVAL;
+
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+
+ if (!pmu_uncore)
+ return -ENODEV;
+
+ /* Pick first online cpu from the node */
+ event->cpu = cpumask_first(
+ cpumask_of_node(pmu_uncore->uncore_dev->node));
+
+ if (event->cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (event->attr.config >= pmu_uncore->uncore_dev->max_events)
+ return -EINVAL;
+
+ /* store event id */
+ hwc->config = event->attr.config;
+
+ /* Validate the group */
+ if (!thunderx2_uncore_validate_event_group(event))
+ return -EINVAL;
+
+ return 0;
+}
+
+static void thunderx2_uncore_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+ unsigned long irqflags;
+ struct active_timer *active_timer;
+
+ hwc->state = 0;
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+ uncore_dev = pmu_uncore->uncore_dev;
+
+ raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
+
+ if (uncore_dev->select_channel)
+ uncore_dev->select_channel(event);
+ uncore_dev->start_event(event, flags);
+ raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
+
+ perf_event_update_userpage(event);
+ active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)];
+ active_timer->event = event;
+
+ hrtimer_start(&active_timer->hrtimer,
+ ns_to_ktime(uncore_dev->hrtimer_interval),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void thunderx2_uncore_stop(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+ unsigned long irqflags;
+
+ if (hwc->state & PERF_HES_UPTODATE)
+ return;
+
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+ uncore_dev = pmu_uncore->uncore_dev;
+
+ raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
+
+ if (uncore_dev->select_channel)
+ uncore_dev->select_channel(event);
+ uncore_dev->stop_event(event);
+
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ if ((flags & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ thunderx2_uncore_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+ raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
+
+ hrtimer_cancel(
+ &pmu_uncore->active_timers[GET_COUNTERID(event)].hrtimer);
+ pmu_uncore->active_timers[GET_COUNTERID(event)].event = NULL;
+}
+
+static int thunderx2_uncore_add(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+ uncore_dev = pmu_uncore->uncore_dev;
+
+ /* Allocate a free counter */
+ hwc->idx = alloc_counter(pmu_uncore);
+ if (hwc->idx < 0)
+ return -EAGAIN;
+
+ /* set counter control and data registers base address */
+ uncore_dev->init_cntr_base(event, uncore_dev);
+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+ if (flags & PERF_EF_START)
+ thunderx2_uncore_start(event, PERF_EF_RELOAD);
+
+ return 0;
+}
+
+static void thunderx2_uncore_del(struct perf_event *event, int flags)
+{
+ struct thunderx2_pmu_uncore_channel *pmu_uncore =
+ pmu_to_thunderx2_pmu_uncore(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+
+ thunderx2_uncore_stop(event, PERF_EF_UPDATE);
+
+ /* clear the assigned counter */
+ free_counter(pmu_uncore, GET_COUNTERID(event));
+
+ perf_event_update_userpage(event);
+ hwc->idx = -1;
+}
+
+static void thunderx2_uncore_read(struct perf_event *event)
+{
+ unsigned long irqflags;
+ struct thunderx2_pmu_uncore_channel *pmu_uncore =
+ pmu_to_thunderx2_pmu_uncore(event->pmu);
+
+ raw_spin_lock_irqsave(&pmu_uncore->uncore_dev->lock, irqflags);
+ thunderx2_uncore_update(event);
+ raw_spin_unlock_irqrestore(&pmu_uncore->uncore_dev->lock, irqflags);
+}
+
+static enum hrtimer_restart thunderx2_uncore_hrtimer_callback(
+ struct hrtimer *hrt)
+{
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ struct perf_event *event;
+ unsigned long irqflags;
+ struct active_timer *active_timer;
+
+ active_timer = get_active_timer(hrt);
+ event = active_timer->event;
+
+ pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
+
+ raw_spin_lock_irqsave(&pmu_uncore->uncore_dev->lock, irqflags);
+ thunderx2_uncore_update(event);
+ raw_spin_unlock_irqrestore(&pmu_uncore->uncore_dev->lock, irqflags);
+
+ hrtimer_forward_now(hrt,
+ ns_to_ktime(
+ pmu_uncore->uncore_dev->hrtimer_interval));
+ return HRTIMER_RESTART;
+}
+
+static int thunderx2_pmu_uncore_register(
+ struct thunderx2_pmu_uncore_channel *pmu_uncore)
+{
+ struct device *dev = pmu_uncore->uncore_dev->dev;
+ char *name = pmu_uncore->uncore_dev->name;
+ int channel = pmu_uncore->channel;
+
+ /* Perf event registration */
+ pmu_uncore->pmu = (struct pmu) {
+ .attr_groups = pmu_uncore->uncore_dev->attr_groups,
+ .task_ctx_nr = perf_invalid_context,
+ .event_init = thunderx2_uncore_event_init,
+ .add = thunderx2_uncore_add,
+ .del = thunderx2_uncore_del,
+ .start = thunderx2_uncore_start,
+ .stop = thunderx2_uncore_stop,
+ .read = thunderx2_uncore_read,
+ };
+
+ pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
+ "%s_%d", name, channel);
+
+ return perf_pmu_register(&pmu_uncore->pmu, pmu_uncore->pmu.name, -1);
+}
+
+static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev,
+ int channel)
+{
+ struct thunderx2_pmu_uncore_channel *pmu_uncore;
+ int ret;
+ int counter;
+
+ pmu_uncore = devm_kzalloc(uncore_dev->dev, sizeof(*pmu_uncore),
+ GFP_KERNEL);
+ if (!pmu_uncore)
+ return -ENOMEM;
+
+ pmu_uncore->uncore_dev = uncore_dev;
+ pmu_uncore->channel = channel;
+
+ /* we can run up to (max_counters * max_channels) events
+ * simultaneously, allocate hrtimers per channel.
+ */
+ pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev,
+ sizeof(struct active_timer) * uncore_dev->max_counters,
+ GFP_KERNEL);
+
+ for (counter = 0; counter < uncore_dev->max_counters; counter++) {
+ hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer,
+ CLOCK_MONOTONIC,
+ HRTIMER_MODE_REL);
+ pmu_uncore->active_timers[counter].hrtimer.function =
+ thunderx2_uncore_hrtimer_callback;
+ }
+
+ ret = thunderx2_pmu_uncore_register(pmu_uncore);
+ if (ret) {
+ dev_err(uncore_dev->dev, "%s PMU: Failed to init driver\n",
+ uncore_dev->name);
+ return -ENODEV;
+ }
+
+ dev_dbg(uncore_dev->dev, "%s PMU UNCORE registered\n",
+ pmu_uncore->pmu.name);
+ return ret;
+}
+
+static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev(
+ struct device *dev, acpi_handle handle,
+ struct acpi_device *adev, u32 type)
+{
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+ unsigned long base;
+ struct resource res;
+ struct resource_entry *rentry;
+ struct list_head list;
+ int ret;
+
+ INIT_LIST_HEAD(&list);
+ ret = acpi_dev_get_resources(adev, &list, NULL, NULL);
+ if (ret <= 0) {
+ dev_err(dev, "failed to parse _CRS method, error %d\n", ret);
+ return NULL;
+ }
+
+ list_for_each_entry(rentry, &list, node) {
+ if (resource_type(rentry->res) == IORESOURCE_MEM) {
+ res = *rentry->res;
+ break;
+ }
+ }
+
+ if (!rentry->res)
+ return NULL;
+
+ acpi_dev_free_resource_list(&list);
+
+ base = (unsigned long)devm_ioremap_resource(dev, &res);
+ if (IS_ERR((void *)base)) {
+ dev_err(dev, "PMU type %d: Fail to map resource\n", type);
+ return NULL;
+ }
+
+ uncore_dev = devm_kzalloc(dev, sizeof(*uncore_dev), GFP_KERNEL);
+ if (!uncore_dev)
+ return NULL;
+
+ uncore_dev->dev = dev;
+ uncore_dev->type = type;
+ uncore_dev->base = base;
+ uncore_dev->node = dev_to_node(dev);
+
+ raw_spin_lock_init(&uncore_dev->lock);
+
+ switch (uncore_dev->type) {
+ case PMU_TYPE_L3C:
+ uncore_dev->max_counters = UNCORE_MAX_COUNTERS;
+ uncore_dev->max_channels = UNCORE_L3_MAX_TILES;
+ uncore_dev->max_events = L3_EVENT_MAX;
+ uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL;
+ uncore_dev->attr_groups = l3c_pmu_attr_groups;
+ uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL,
+ "uncore_l3c_%d", uncore_dev->node);
+ uncore_dev->init_cntr_base = init_cntr_base_l3c;
+ uncore_dev->start_event = uncore_start_event_l3c;
+ uncore_dev->stop_event = uncore_stop_event_l3c;
+ uncore_dev->select_channel = uncore_select_channel;
+ break;
+ case PMU_TYPE_DMC:
+ uncore_dev->max_counters = UNCORE_MAX_COUNTERS;
+ uncore_dev->max_channels = UNCORE_DMC_MAX_CHANNELS;
+ uncore_dev->max_events = DMC_EVENT_MAX;
+ uncore_dev->hrtimer_interval = UNCORE_HRTIMER_INTERVAL;
+ uncore_dev->attr_groups = dmc_pmu_attr_groups;
+ uncore_dev->name = devm_kasprintf(dev, GFP_KERNEL,
+ "uncore_dmc_%d", uncore_dev->node);
+ uncore_dev->init_cntr_base = init_cntr_base_dmc;
+ uncore_dev->start_event = uncore_start_event_dmc;
+ uncore_dev->stop_event = uncore_stop_event_dmc;
+ uncore_dev->select_channel = uncore_select_channel;
+ break;
+ case PMU_TYPE_INVALID:
+ devm_kfree(dev, uncore_dev);
+ uncore_dev = NULL;
+ break;
+ }
+
+ return uncore_dev;
+}
+
+static acpi_status thunderx2_pmu_uncore_dev_add(acpi_handle handle, u32 level,
+ void *data, void **return_value)
+{
+ struct thunderx2_pmu_uncore_dev *uncore_dev;
+ struct acpi_device *adev;
+ enum thunderx2_uncore_type type;
+ int channel;
+
+ if (acpi_bus_get_device(handle, &adev))
+ return AE_OK;
+ if (acpi_bus_get_status(adev) || !adev->status.present)
+ return AE_OK;
+
+ type = get_uncore_device_type(adev);
+ if (type == PMU_TYPE_INVALID)
+ return AE_OK;
+
+ uncore_dev = init_pmu_uncore_dev((struct device *)data, handle,
+ adev, type);
+
+ if (!uncore_dev)
+ return AE_ERROR;
+
+ for (channel = 0; channel < uncore_dev->max_channels; channel++) {
+ if (thunderx2_pmu_uncore_add(uncore_dev, channel)) {
+ /* Can't add the PMU device, abort */
+ return AE_ERROR;
+ }
+ }
+ return AE_OK;
+}
+
+static int thunderx2_uncore_dev_probe(struct device *dev)
+{
+ acpi_handle handle;
+ acpi_status status;
+
+ if (!has_acpi_companion(dev))
+ return -ENODEV;
+
+ handle = ACPI_HANDLE(dev);
+ if (!handle)
+ return -EINVAL;
+
+ status = acpi_walk_namespace(ACPI_TYPE_DEVICE, handle, 1,
+ thunderx2_pmu_uncore_dev_add,
+ NULL, dev, NULL);
+ if (ACPI_FAILURE(status)) {
+ dev_err(dev, "failed to probe PMU devices\n");
+ return_ACPI_STATUS(status);
+ }
+
+ dev_info(dev, "node%d: pmu uncore registered\n", dev_to_node(dev));
+ return 0;
+}
+
+static const struct acpi_device_id thunderx2_uncore_acpi_match[] = {
+ {"CAV901C", 0},
+ {},
+};
+MODULE_DEVICE_TABLE(acpi, thunderx2_uncore_acpi_match);
+
+static int thunderx2_uncore_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct arm_smccc_res res;
+
+ set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev)));
+
+ /* Make sure firmware supports DMC/L3C set channel smc call */
+ arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
+ dev_to_node(dev), 0, 0, 0, 0, 0, &res);
+ if (res.a0) {
+ dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n",
+ dev_to_node(dev));
+ return -ENODEV;
+ }
+
+ /* Walk through the tree for all PMU UNCORE devices */
+ return thunderx2_uncore_dev_probe(dev);
+}
+
+static struct platform_driver thunderx2_uncore_driver = {
+ .probe = thunderx2_uncore_probe,
+ .driver = {
+ .name = "thunderx2-uncore-pmu",
+ .acpi_match_table = ACPI_PTR(thunderx2_uncore_acpi_match),
+ },
+};
+
+builtin_platform_driver(thunderx2_uncore_driver);
--
2.9.4


2018-04-25 09:04:00

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: [PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
The SoC has PMU support in its L3 cache controller (L3C) and in the
DDR4 Memory Controller (DMC).

Signed-off-by: Ganapatrao Kulkarni <[email protected]>
---
Documentation/perf/thunderx2-pmu.txt | 66 ++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
create mode 100644 Documentation/perf/thunderx2-pmu.txt

diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
new file mode 100644
index 0000000..9e9f535
--- /dev/null
+++ b/Documentation/perf/thunderx2-pmu.txt
@@ -0,0 +1,66 @@
+
+Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
+==========================================================================
+
+ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
+as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
+
+It has 8 independent DMC PMUs to capture performance events corresponding
+to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
+to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
+up to 4 counters.
+
+Counters are independent programmable and can be started and stopped
+individually. Each counter can be set to sample specific perf events.
+Counters are 32 bit and does not support overflow interrupt, they are
+sampled at every 2 seconds. The Counters register access are multiplexed
+across channels of DMC and L3C. The muxing(select channel) is done through
+write to a Secure register using smcc calls.
+
+PMU UNCORE (perf) driver:
+
+The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
+Each of the PMU provides description of its available events
+and configuration options in sysfs.
+ see /sys/devices/uncore_<l3c_S_X/dmc_S_X/>
+
+S is socket id and X represents channel number.
+Each PMU can be used to sample up to 4 events simultaneously.
+
+The "format" directory describes format of the config (event ID).
+The "events" directory provides configuration templates for all
+supported event types that can be used with perf tool.
+
+For example, "uncore_dmc_0_0/cnt_cycles/" is an
+equivalent of "uncore_dmc_0_0/config=0x1/".
+
+Each perf driver also provides a "cpumask" sysfs attribute, which contains a
+single CPU ID of the processor which is likely to be used to handle all the
+PMU events. It will be the first online CPU from the NUMA node of PMU device.
+
+Example for perf tool use:
+
+perf stat -a -e \
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_1/cnt_cycles/,\
+uncore_dmc_0_2/cnt_cycles/,\
+uncore_dmc_0_3/cnt_cycles/,\
+uncore_dmc_0_4/cnt_cycles/,\
+uncore_dmc_0_5/cnt_cycles/,\
+uncore_dmc_0_6/cnt_cycles/,\
+uncore_dmc_0_7/cnt_cycles/ sleep 1
+
+perf stat -a -e \
+uncore_dmc_0_0/cancelled_read_txns/,\
+uncore_dmc_0_0/cnt_cycles/,\
+uncore_dmc_0_0/consumed_read_txns/,\
+uncore_dmc_0_0/data_transfers/ sleep 1
+
+perf stat -a -e \
+uncore_l3c_0_0/l3_retry/,\
+uncore_l3c_0_0/read_hit/,\
+uncore_l3c_0_0/read_request/,\
+uncore_l3c_0_0/inv_request/ sleep 1
+
+The driver does not support sampling, therefore "perf record" will
+not work. Per-task (without "-a") perf sessions are not supported.
--
2.9.4


2018-04-26 11:02:18

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi,

On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
> +
> +/* L3c and DMC has 16 and 8 channels per socket respectively.
> + * Each Channel supports UNCORE PMU device and consists of
> + * 4 independent programmable counters. Counters are 32 bit
> + * and does not support overflow interrupt, they needs to be
> + * sampled before overflow(i.e, at every 2 seconds).
> + */
> +
> +#define UNCORE_MAX_COUNTERS 4
> +#define UNCORE_L3_MAX_TILES 16
> +#define UNCORE_DMC_MAX_CHANNELS 8
> +
> +#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)

How was a period of two seconds chosen?

What's the maximum clock speed for the L3C and DMC?

Given that all channels compete for access to the muxed register
interface, I suspect we need to try more often than once every 2
seconds...

[...]

> +struct active_timer {
> + struct perf_event *event;
> + struct hrtimer hrtimer;
> +};
> +
> +/*
> + * pmu on each socket has 2 uncore devices(dmc and l3),
> + * each uncore device has up to 16 channels, each channel can sample
> + * events independently with counters up to 4.
> + *
> + * struct thunderx2_pmu_uncore_channel created per channel.
> + * struct thunderx2_pmu_uncore_dev per uncore device.
> + */
> +struct thunderx2_pmu_uncore_channel {
> + struct thunderx2_pmu_uncore_dev *uncore_dev;
> + struct pmu pmu;

Can we put the pmu first in the struct, please?

> + int counter;

AFAICT, this counter field is never used.

> + int channel;
> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
> + struct active_timer *active_timers;

You should only need a single timer per channel, rather than one per
event.

I think you can get rid of the active_timer structure, and have:

struct perf_event *events[UNCORE_MAX_COUNTERS];
struct hrtimer timer;

> + /* to sync counter alloc/release */
> + raw_spinlock_t lock;
> +};
> +
> +struct thunderx2_pmu_uncore_dev {
> + char *name;
> + struct device *dev;
> + enum thunderx2_uncore_type type;
> + unsigned long base;

This should be:

void __iomem *base;

[...]

> +static ssize_t cpumask_show(struct device *dev,
> + struct device_attribute *attr, char *buf)
> +{
> + struct cpumask cpu_mask;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
> +
> + /* Pick first online cpu from the node */
> + cpumask_clear(&cpu_mask);
> + cpumask_set_cpu(cpumask_first(
> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
> + &cpu_mask);
> +
> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
> +}

AFAICT cpumask_of_node() returns a mask that can include offline CPUs.

Regardless, I don't think that you can keep track of the management CPU
this way. Please keep track of the CPU the PMU should be managed by,
either with a cpumask or int embedded within the PMU structure.

At hotplug time, you'll need to update the management CPU, calling
perf_pmu_migrate_context() when it is offlined.

[...]

> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
> +{
> + int counter;
> +
> + raw_spin_lock(&pmu_uncore->lock);
> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
> + pmu_uncore->uncore_dev->max_counters);
> + if (counter == pmu_uncore->uncore_dev->max_counters) {
> + raw_spin_unlock(&pmu_uncore->lock);
> + return -ENOSPC;
> + }
> + set_bit(counter, pmu_uncore->counter_mask);
> + raw_spin_unlock(&pmu_uncore->lock);
> + return counter;
> +}
> +
> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
> + int counter)
> +{
> + raw_spin_lock(&pmu_uncore->lock);
> + clear_bit(counter, pmu_uncore->counter_mask);
> + raw_spin_unlock(&pmu_uncore->lock);
> +}

I don't believe that locking is required in either of these, as the perf
core serializes pmu::add() and pmu::del(), where these get called.

> +
> +/*
> + * DMC and L3 counter interface is muxed across all channels.
> + * hence we need to select the channel before accessing counter
> + * data/control registers.

Are there separate interfaces for all-dmc-channels and all-l3c-channels?

... or is a single interface used by all-dmc-and-l3c-channels?

> + *
> + * L3 Tile and DMC channel selection is through SMC call
> + * SMC call arguments,
> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
> + * x2 = Node id

How do we map Linux node IDs to the firmware's view of node IDs?

I don't believe the two are necessarily the same -- Linux's node IDs are
a Linux-specific construct.

It would be much nicer if we could pass something based on the MPIDR,
which is a known HW construct, or if this implicitly affected the
current node.

It would be vastly more sane for this to not be muxed at all. :/

> + * x3 = DMC(1)/L3C(0)
> + * x4 = channel Id
> + */
> +static void uncore_select_channel(struct perf_event *event)
> +{
> + struct arm_smccc_res res;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
> + pmu_to_thunderx2_pmu_uncore(event->pmu);
> +
> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
> + pmu_uncore->uncore_dev->node,
> + pmu_uncore->uncore_dev->type,
> + pmu_uncore->channel, 0, 0, 0, &res);
> +
> +}

Why aren't we checking the return value of the SMC call?

The muxing and SMC sound quite scary. :/

> +
> +static void uncore_start_event_l3c(struct perf_event *event, int flags)
> +{
> + u32 val;
> + struct hw_perf_event *hwc = &event->hw;
> +
> + /* event id encoded in bits [07:03] */
> + val = GET_EVENTID(event) << 3;
> + reg_writel(val, hwc->config_base);
> +
> + if (flags & PERF_EF_RELOAD) {
> + u64 prev_raw_count =
> + local64_read(&event->hw.prev_count);
> + reg_writel(prev_raw_count, hwc->event_base);
> + }
> + local64_set(&event->hw.prev_count,
> + reg_readl(hwc->event_base));

It would be simpler to ignore PERF_EF_RELOAD, and always reprogram the
prev_count and HW to zero.

> +}
> +
> +static void uncore_start_event_dmc(struct perf_event *event, int flags)
> +{
> + u32 val, event_shift = 8;
> + struct hw_perf_event *hwc = &event->hw;
> +
> + /* enable and start counters and read current value in prev_count */
> + val = reg_readl(hwc->config_base);
> +
> + /* 8 bits for each counter,
> + * bits [05:01] of a counter to set event type.
> + */
> + reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) *
> + event_shift) + 1))) |
> + (GET_EVENTID(event) <<
> + (((GET_COUNTERID(event)) * event_shift) + 1)),
> + hwc->config_base);

This would be far more legible if val were constructed before the
reg_writel(), especially if there were a helper for the event field
shifting, e.g.

#define DMC_EVENT_CFG(idx, val) ((val) << (((idx) * 8) + 1))

int id = GET_COUNTERID(event);
int event = GET_EVENTID(event);

val = reg_readl(hwc->config_base);
val &= ~DMC_EVENT_CFG(id, 0x1f);
val |= DMCC_EVENT_CFG(id, event);
reg_writel(val, hwc->config_base));

What are bits 7:6 and 1 used for?

> +
> + if (flags & PERF_EF_RELOAD) {
> + u64 prev_raw_count =
> + local64_read(&event->hw.prev_count);
> + reg_writel(prev_raw_count, hwc->event_base);
> + }
> + local64_set(&event->hw.prev_count,
> + reg_readl(hwc->event_base));


As with the L3C events, it would be simpler to always reprogram the
prev_count and HW to zero.

> +}
> +
> +static void uncore_stop_event_l3c(struct perf_event *event)
> +{
> + reg_writel(0, event->hw.config_base);
> +}
> +
> +static void uncore_stop_event_dmc(struct perf_event *event)
> +{
> + u32 val, event_shift = 8;
> + struct hw_perf_event *hwc = &event->hw;
> +
> + val = reg_readl(hwc->config_base);
> + reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))),
> + hwc->config_base);


This could be simplified with the helper proposed above.

> +}
> +
> +static void init_cntr_base_l3c(struct perf_event *event,
> + struct thunderx2_pmu_uncore_dev *uncore_dev)
> +{
> +
> + struct hw_perf_event *hwc = &event->hw;
> +
> + /* counter ctrl/data reg offset at 8 */

Offset 8, or stride 8?

What does the register layout look like?

> + hwc->config_base = uncore_dev->base
> + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
> + hwc->event_base = uncore_dev->base
> + + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event));
> +}
> +
> +static void init_cntr_base_dmc(struct perf_event *event,
> + struct thunderx2_pmu_uncore_dev *uncore_dev)
> +{
> +
> + struct hw_perf_event *hwc = &event->hw;
> +
> + hwc->config_base = uncore_dev->base
> + + DMC_COUNTER_CTL;
> + /* counter data reg offset at 0xc */

A stride of 0xc seems unusual.

What does the register layout look like?

> + hwc->event_base = uncore_dev->base
> + + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event));
> +}
> +
> +static void thunderx2_uncore_update(struct perf_event *event)
> +{
> + s64 prev, new = 0;
> + u64 delta;
> + struct hw_perf_event *hwc = &event->hw;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
> + enum thunderx2_uncore_type type;
> +
> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
> + type = pmu_uncore->uncore_dev->type;

AFAICT this variable is not used below.

> +
> + if (pmu_uncore->uncore_dev->select_channel)
> + pmu_uncore->uncore_dev->select_channel(event);

This should always be non-NULL, right?

[...]

> +static bool thunderx2_uncore_validate_event_group(struct perf_event *event)
> +{
> + struct pmu *pmu = event->pmu;
> + struct perf_event *leader = event->group_leader;
> + struct perf_event *sibling;
> + int counters = 0;
> +
> + if (leader->pmu != event->pmu && !is_software_event(leader))
> + return false;
> +
> + counters++;

I don't think this is right when event != leader and the leader is a SW
event. In that case, the leader doesn't take a HW counter.

> +
> + for_each_sibling_event(sibling, event->group_leader) {
> + if (is_software_event(sibling))
> + continue;
> + if (sibling->pmu != pmu)
> + return false;
> + counters++;
> + }
> +
> + /*
> + * If the group requires more counters than the HW has,
> + * it cannot ever be scheduled.
> + */
> + return counters <= UNCORE_MAX_COUNTERS;
> +}

[...]

> +static int thunderx2_uncore_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore;

> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
> +
> + if (!pmu_uncore)
> + return -ENODEV;

This cannot happen, given pmu_to_thunderx2_pmu_uncore() is a wrapper
around container_of().

> +
> + /* Pick first online cpu from the node */
> + event->cpu = cpumask_first(
> + cpumask_of_node(pmu_uncore->uncore_dev->node));

I don't believe this is safe. You must keep track of which CPU is
managing the PMU, with hotplug callbacks.

[...]

> +static void thunderx2_uncore_start(struct perf_event *event, int flags)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
> + struct thunderx2_pmu_uncore_dev *uncore_dev;
> + unsigned long irqflags;
> + struct active_timer *active_timer;
> +
> + hwc->state = 0;
> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
> + uncore_dev = pmu_uncore->uncore_dev;
> +
> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
> +
> + if (uncore_dev->select_channel)
> + uncore_dev->select_channel(event);
> + uncore_dev->start_event(event, flags);
> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
> +
> + perf_event_update_userpage(event);
> + active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)];
> + active_timer->event = event;
> +
> + hrtimer_start(&active_timer->hrtimer,
> + ns_to_ktime(uncore_dev->hrtimer_interval),
> + HRTIMER_MODE_REL_PINNED);
> +}

Please use a single hrtimer, and update *all* of the events when it
fires.

I *think* that can be done in the pmu::pmu_enable() and
pmu::pmu_disable() callbacks.

Are there control bits to enable/disable all counters, or can that only
be done through the event configuration registers?

> +static void thunderx2_uncore_stop(struct perf_event *event, int flags)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
> + struct thunderx2_pmu_uncore_dev *uncore_dev;
> + unsigned long irqflags;
> +
> + if (hwc->state & PERF_HES_UPTODATE)
> + return;
> +
> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
> + uncore_dev = pmu_uncore->uncore_dev;
> +
> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
> +
> + if (uncore_dev->select_channel)
> + uncore_dev->select_channel(event);

AFAICT this cannot be NULL.

[...]

> +static int thunderx2_pmu_uncore_register(
> + struct thunderx2_pmu_uncore_channel *pmu_uncore)
> +{
> + struct device *dev = pmu_uncore->uncore_dev->dev;
> + char *name = pmu_uncore->uncore_dev->name;
> + int channel = pmu_uncore->channel;
> +
> + /* Perf event registration */
> + pmu_uncore->pmu = (struct pmu) {
> + .attr_groups = pmu_uncore->uncore_dev->attr_groups,
> + .task_ctx_nr = perf_invalid_context,
> + .event_init = thunderx2_uncore_event_init,
> + .add = thunderx2_uncore_add,
> + .del = thunderx2_uncore_del,
> + .start = thunderx2_uncore_start,
> + .stop = thunderx2_uncore_stop,
> + .read = thunderx2_uncore_read,
> + };
> +
> + pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
> + "%s_%d", name, channel);

Does the channel idx take the NUMA node into account?

[...]

> +static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev,
> + int channel)
> +{

> + /* we can run up to (max_counters * max_channels) events
> + * simultaneously, allocate hrtimers per channel.
> + */
> + pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev,
> + sizeof(struct active_timer) * uncore_dev->max_counters,
> + GFP_KERNEL);

Please just fold a single hrtimer into the thunderx2_pmu_uncore_channel
structure, and you can get rid of this allocation...

> +
> + for (counter = 0; counter < uncore_dev->max_counters; counter++) {
> + hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer,
> + CLOCK_MONOTONIC,
> + HRTIMER_MODE_REL);
> + pmu_uncore->active_timers[counter].hrtimer.function =
> + thunderx2_uncore_hrtimer_callback;
> + }

... and simplify this initialization.

[...]

> +static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev(
> + struct device *dev, acpi_handle handle,
> + struct acpi_device *adev, u32 type)
> +{
> + struct thunderx2_pmu_uncore_dev *uncore_dev;
> + unsigned long base;

> + base = (unsigned long)devm_ioremap_resource(dev, &res);
> + if (IS_ERR((void *)base)) {
> + dev_err(dev, "PMU type %d: Fail to map resource\n", type);
> + return NULL;
> + }

Please treat this as a void __iomem *base.

[...]

> +static int thunderx2_uncore_probe(struct platform_device *pdev)
> +{
> + struct device *dev = &pdev->dev;
> + struct arm_smccc_res res;
> +
> + set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev)));
> +
> + /* Make sure firmware supports DMC/L3C set channel smc call */
> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
> + dev_to_node(dev), 0, 0, 0, 0, 0, &res);
> + if (res.a0) {
> + dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n",
> + dev_to_node(dev));
> + return -ENODEV;
> + }

Please re-use the uncore_select_channel() wrapper rather than
open-coding this.

Which FW supports this interface?

Thanks,
Mark.

2018-04-26 20:57:31

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

Hi,

Just a few typo corrections...

On 04/25/2018 02:00 AM, Ganapatrao Kulkarni wrote:
> Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
> The SoC has PMU support in its L3 cache controller (L3C) and in the
> DDR4 Memory Controller (DMC).
>
> Signed-off-by: Ganapatrao Kulkarni <[email protected]>
> ---
> Documentation/perf/thunderx2-pmu.txt | 66 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 66 insertions(+)
> create mode 100644 Documentation/perf/thunderx2-pmu.txt
>
> diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
> new file mode 100644
> index 0000000..9e9f535
> --- /dev/null
> +++ b/Documentation/perf/thunderx2-pmu.txt
> @@ -0,0 +1,66 @@
> +
> +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
> +==========================================================================
> +
> +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
> +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
> +
> +It has 8 independent DMC PMUs to capture performance events corresponding
> +to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
> +to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
> +up to 4 counters.
> +
> +Counters are independent programmable and can be started and stopped

independently

> +individually. Each counter can be set to sample specific perf events.
> +Counters are 32 bit and does not support overflow interrupt, they are

do not interrupt; they are


> +sampled at every 2 seconds. The Counters register access are multiplexed
> +across channels of DMC and L3C. The muxing(select channel) is done through
> +write to a Secure register using smcc calls.
> +
> +PMU UNCORE (perf) driver:
> +
> +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
> +Each of the PMU provides description of its available events

of the PMUs

> +and configuration options in sysfs.
> + see /sys/devices/uncore_<l3c_S_X/dmc_S_X/>
> +
> +S is socket id and X represents channel number.
> +Each PMU can be used to sample up to 4 events simultaneously.
> +
> +The "format" directory describes format of the config (event ID).
> +The "events" directory provides configuration templates for all
> +supported event types that can be used with perf tool.
> +
> +For example, "uncore_dmc_0_0/cnt_cycles/" is an
> +equivalent of "uncore_dmc_0_0/config=0x1/".
> +
> +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
> +single CPU ID of the processor which is likely to be used to handle all the
> +PMU events. It will be the first online CPU from the NUMA node of PMU device.
> +
> +Example for perf tool use:
> +
> +perf stat -a -e \
> +uncore_dmc_0_0/cnt_cycles/,\
> +uncore_dmc_0_1/cnt_cycles/,\
> +uncore_dmc_0_2/cnt_cycles/,\
> +uncore_dmc_0_3/cnt_cycles/,\
> +uncore_dmc_0_4/cnt_cycles/,\
> +uncore_dmc_0_5/cnt_cycles/,\
> +uncore_dmc_0_6/cnt_cycles/,\
> +uncore_dmc_0_7/cnt_cycles/ sleep 1
> +
> +perf stat -a -e \
> +uncore_dmc_0_0/cancelled_read_txns/,\
> +uncore_dmc_0_0/cnt_cycles/,\
> +uncore_dmc_0_0/consumed_read_txns/,\
> +uncore_dmc_0_0/data_transfers/ sleep 1
> +
> +perf stat -a -e \
> +uncore_l3c_0_0/l3_retry/,\
> +uncore_l3c_0_0/read_hit/,\
> +uncore_l3c_0_0/read_request/,\
> +uncore_l3c_0_0/inv_request/ sleep 1
> +
> +The driver does not support sampling, therefore "perf record" will
> +not work. Per-task (without "-a") perf sessions are not supported.
>


--
~Randy

2018-04-26 22:07:55

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Wed, 25 Apr 2018 14:30:47 +0530
Ganapatrao Kulkarni <[email protected]> wrote:

> +static int thunderx2_uncore_event_init(struct perf_event *event)
...
> + /*
> + * SOC PMU counters are shared across all cores.
> + * Therefore, it does not support per-process mode.
> + * Also, it does not support event sampling mode.
> + */
> + if (is_sampling_event(event) || event->attach_state & PERF_ATTACH_TASK)
> + return -EINVAL;
> +
> + /* SOC counters do not have usr/os/guest/host bits */
> + if (event->attr.exclude_user || event->attr.exclude_kernel ||
> + event->attr.exclude_host || event->attr.exclude_guest)
> + return -EINVAL;
> +
> + if (event->cpu < 0)
> + return -EINVAL;
> +
> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
> +
> + if (!pmu_uncore)
> + return -ENODEV;
> +
> + /* Pick first online cpu from the node */
> + event->cpu = cpumask_first(
> + cpumask_of_node(pmu_uncore->uncore_dev->node));
> +
> + if (event->cpu >= nr_cpu_ids)
> + return -EINVAL;
> +
> + if (event->attr.config >= pmu_uncore->uncore_dev->max_events)
> + return -EINVAL;
> +
> + /* store event id */
> + hwc->config = event->attr.config;
> +
> + /* Validate the group */
> + if (!thunderx2_uncore_validate_event_group(event))
> + return -EINVAL;

This PMU driver can be made more user-friendly by not just silently
returning an error code such as -EINVAL, but by emitting a useful
message describing the specific error via dmesg.

Thanks,

Kim

2018-04-27 04:54:02

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v4 1/2] perf: uncore: Adding documentation for ThunderX2 pmu uncore driver

On Fri, Apr 27, 2018 at 2:25 AM, Randy Dunlap <[email protected]> wrote:
> Hi,
>
> Just a few typo corrections...
>
> On 04/25/2018 02:00 AM, Ganapatrao Kulkarni wrote:
>> Documentation for the UNCORE PMUs on Cavium's ThunderX2 SoC.
>> The SoC has PMU support in its L3 cache controller (L3C) and in the
>> DDR4 Memory Controller (DMC).
>>
>> Signed-off-by: Ganapatrao Kulkarni <[email protected]>
>> ---
>> Documentation/perf/thunderx2-pmu.txt | 66 ++++++++++++++++++++++++++++++++++++
>> 1 file changed, 66 insertions(+)
>> create mode 100644 Documentation/perf/thunderx2-pmu.txt
>>
>> diff --git a/Documentation/perf/thunderx2-pmu.txt b/Documentation/perf/thunderx2-pmu.txt
>> new file mode 100644
>> index 0000000..9e9f535
>> --- /dev/null
>> +++ b/Documentation/perf/thunderx2-pmu.txt
>> @@ -0,0 +1,66 @@
>> +
>> +Cavium ThunderX2 SoC Performance Monitoring Unit (PMU UNCORE)
>> +==========================================================================
>> +
>> +ThunderX2 SoC PMU consists of independent system wide per Socket PMUs such
>> +as Level 3 Cache(L3C) and DDR4 Memory Controller(DMC).
>> +
>> +It has 8 independent DMC PMUs to capture performance events corresponding
>> +to 8 channels of DDR4 Memory Controller. There are 16 independent L3C PMUs
>> +to capture events corresponding to 16 tiles of L3 cache. Each PMU supports
>> +up to 4 counters.
>> +
>> +Counters are independent programmable and can be started and stopped
>
> independently

thanks, will update.
>
>> +individually. Each counter can be set to sample specific perf events.
>> +Counters are 32 bit and does not support overflow interrupt, they are
>
> do not interrupt; they are
>
>
>> +sampled at every 2 seconds. The Counters register access are multiplexed
>> +across channels of DMC and L3C. The muxing(select channel) is done through
>> +write to a Secure register using smcc calls.
>> +
>> +PMU UNCORE (perf) driver:
>> +
>> +The thunderx2-pmu driver registers several perf PMUs for DMC and L3C devices.
>> +Each of the PMU provides description of its available events
>
> of the PMUs
>
>> +and configuration options in sysfs.
>> + see /sys/devices/uncore_<l3c_S_X/dmc_S_X/>
>> +
>> +S is socket id and X represents channel number.
>> +Each PMU can be used to sample up to 4 events simultaneously.
>> +
>> +The "format" directory describes format of the config (event ID).
>> +The "events" directory provides configuration templates for all
>> +supported event types that can be used with perf tool.
>> +
>> +For example, "uncore_dmc_0_0/cnt_cycles/" is an
>> +equivalent of "uncore_dmc_0_0/config=0x1/".
>> +
>> +Each perf driver also provides a "cpumask" sysfs attribute, which contains a
>> +single CPU ID of the processor which is likely to be used to handle all the
>> +PMU events. It will be the first online CPU from the NUMA node of PMU device.
>> +
>> +Example for perf tool use:
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_1/cnt_cycles/,\
>> +uncore_dmc_0_2/cnt_cycles/,\
>> +uncore_dmc_0_3/cnt_cycles/,\
>> +uncore_dmc_0_4/cnt_cycles/,\
>> +uncore_dmc_0_5/cnt_cycles/,\
>> +uncore_dmc_0_6/cnt_cycles/,\
>> +uncore_dmc_0_7/cnt_cycles/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_dmc_0_0/cancelled_read_txns/,\
>> +uncore_dmc_0_0/cnt_cycles/,\
>> +uncore_dmc_0_0/consumed_read_txns/,\
>> +uncore_dmc_0_0/data_transfers/ sleep 1
>> +
>> +perf stat -a -e \
>> +uncore_l3c_0_0/l3_retry/,\
>> +uncore_l3c_0_0/read_hit/,\
>> +uncore_l3c_0_0/read_request/,\
>> +uncore_l3c_0_0/inv_request/ sleep 1
>> +
>> +The driver does not support sampling, therefore "perf record" will
>> +not work. Per-task (without "-a") perf sessions are not supported.
>>
>
>
> --
> ~Randy

thanks
Ganapat

2018-04-27 09:32:54

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Kim,

On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> On Wed, 25 Apr 2018 14:30:47 +0530
> Ganapatrao Kulkarni <[email protected]> wrote:
>
> > +static int thunderx2_uncore_event_init(struct perf_event *event)

> This PMU driver can be made more user-friendly by not just silently
> returning an error code such as -EINVAL, but by emitting a useful
> message describing the specific error via dmesg.

As has previously been discussed on several occasions, patches which log
to dmesg in a pmu::event_init() path at any level above pr_debug() are
not acceptable -- dmesg is not intended as a mechanism to inform users
of driver-specific constraints.

I would appreciate if in future you could qualify your suggestion with
the requirement that pr_debug() is used.

Thanks,
Mark.

2018-04-27 13:16:52

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Fri, 27 Apr 2018 10:30:27 +0100
Mark Rutland <[email protected]> wrote:

> Hi Kim,
>
> On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> > On Wed, 25 Apr 2018 14:30:47 +0530
> > Ganapatrao Kulkarni <[email protected]> wrote:
> >
> > > +static int thunderx2_uncore_event_init(struct perf_event *event)
>
> > This PMU driver can be made more user-friendly by not just silently
> > returning an error code such as -EINVAL, but by emitting a useful
> > message describing the specific error via dmesg.
>
> As has previously been discussed on several occasions, patches which log
> to dmesg in a pmu::event_init() path at any level above pr_debug() are
> not acceptable -- dmesg is not intended as a mechanism to inform users
> of driver-specific constraints.

I disagree - drivers do it all the time, using dev_err(), dev_warn(), etc.

> I would appreciate if in future you could qualify your suggestion with
> the requirement that pr_debug() is used.

It shouldn't - the driver isn't being debugged, it's in regular use.

Thanks,

Kim

2018-04-27 14:38:18

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Fri, Apr 27, 2018 at 08:15:25AM -0500, Kim Phillips wrote:
> On Fri, 27 Apr 2018 10:30:27 +0100
> Mark Rutland <[email protected]> wrote:
> > On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> > > On Wed, 25 Apr 2018 14:30:47 +0530
> > > Ganapatrao Kulkarni <[email protected]> wrote:
> > >
> > > > +static int thunderx2_uncore_event_init(struct perf_event *event)
> >
> > > This PMU driver can be made more user-friendly by not just silently
> > > returning an error code such as -EINVAL, but by emitting a useful
> > > message describing the specific error via dmesg.
> >
> > As has previously been discussed on several occasions, patches which log
> > to dmesg in a pmu::event_init() path at any level above pr_debug() are
> > not acceptable -- dmesg is not intended as a mechanism to inform users
> > of driver-specific constraints.
>
> I disagree - drivers do it all the time, using dev_err(), dev_warn(), etc.
>
> > I would appreciate if in future you could qualify your suggestion with
> > the requirement that pr_debug() is used.
>
> It shouldn't - the driver isn't being debugged, it's in regular use.

For anything under drivers/perf/, I'd prefer not to have these prints
and instead see efforts to improve error reporting via the perf system
call interface.

Anyway, I think this driver has bigger problems that need addressing.

Will

2018-04-27 15:47:54

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Fri, 27 Apr 2018 15:37:20 +0100
Will Deacon <[email protected]> wrote:

> On Fri, Apr 27, 2018 at 08:15:25AM -0500, Kim Phillips wrote:
> > On Fri, 27 Apr 2018 10:30:27 +0100
> > Mark Rutland <[email protected]> wrote:
> > > On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> > > > On Wed, 25 Apr 2018 14:30:47 +0530
> > > > Ganapatrao Kulkarni <[email protected]> wrote:
> > > >
> > > > > +static int thunderx2_uncore_event_init(struct perf_event *event)
> > >
> > > > This PMU driver can be made more user-friendly by not just silently
> > > > returning an error code such as -EINVAL, but by emitting a useful
> > > > message describing the specific error via dmesg.
> > >
> > > As has previously been discussed on several occasions, patches which log
> > > to dmesg in a pmu::event_init() path at any level above pr_debug() are
> > > not acceptable -- dmesg is not intended as a mechanism to inform users
> > > of driver-specific constraints.
> >
> > I disagree - drivers do it all the time, using dev_err(), dev_warn(), etc.
> >
> > > I would appreciate if in future you could qualify your suggestion with
> > > the requirement that pr_debug() is used.
> >
> > It shouldn't - the driver isn't being debugged, it's in regular use.
>
> For anything under drivers/perf/, I'd prefer not to have these prints
> and instead see efforts to improve error reporting via the perf system
> call interface.

We'd all prefer that, and for all PMU drivers, why should ones under
drivers/perf be treated differently?

As you are already aware, I've personally tried to fix this problem -
that has existed since before the introduction of the perf tool (I
consider it a syscall-independent enhanced error interface), multiple
times, and failed.

So until someone comes up with a solution that works for everyone
up to and including Linus Torvalds (who hasn't put up a problem
pulling PMU drivers emitting things to dmesg so far, by the way), this
keep PMU drivers' errors silent preference of yours is unnecessarily
impeding people trying to measure system performance on Arm based
machines - all other archs' maintainers are fine with PMU drivers using
dmesg.

> Anyway, I think this driver has bigger problems that need addressing.

To me it represents yet another PMU driver submission - as the years go
by - that is lacking in the user messaging area. Which reminds me, can
you take another look at applying this?:

https://patchwork.kernel.org/patch/10068535/

Thanks,

Kim

2018-04-27 16:10:34

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Kim,

[Ganapat: please don't let this discussion disrupt your PMU driver
development. You can safely ignore it for now :)]

On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
> On Fri, 27 Apr 2018 15:37:20 +0100
> Will Deacon <[email protected]> wrote:
>
> > On Fri, Apr 27, 2018 at 08:15:25AM -0500, Kim Phillips wrote:
> > > On Fri, 27 Apr 2018 10:30:27 +0100
> > > Mark Rutland <[email protected]> wrote:
> > > > On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> > > > > On Wed, 25 Apr 2018 14:30:47 +0530
> > > > > Ganapatrao Kulkarni <[email protected]> wrote:
> > > > >
> > > > > > +static int thunderx2_uncore_event_init(struct perf_event *event)
> > > >
> > > > > This PMU driver can be made more user-friendly by not just silently
> > > > > returning an error code such as -EINVAL, but by emitting a useful
> > > > > message describing the specific error via dmesg.
> > > >
> > > > As has previously been discussed on several occasions, patches which log
> > > > to dmesg in a pmu::event_init() path at any level above pr_debug() are
> > > > not acceptable -- dmesg is not intended as a mechanism to inform users
> > > > of driver-specific constraints.
> > >
> > > I disagree - drivers do it all the time, using dev_err(), dev_warn(), etc.
> > >
> > > > I would appreciate if in future you could qualify your suggestion with
> > > > the requirement that pr_debug() is used.
> > >
> > > It shouldn't - the driver isn't being debugged, it's in regular use.
> >
> > For anything under drivers/perf/, I'd prefer not to have these prints
> > and instead see efforts to improve error reporting via the perf system
> > call interface.
>
> We'd all prefer that, and for all PMU drivers, why should ones under
> drivers/perf be treated differently?

Because they're the ones I maintain...

> As you are already aware, I've personally tried to fix this problem -
> that has existed since before the introduction of the perf tool (I
> consider it a syscall-independent enhanced error interface), multiple
> times, and failed.

Why is that my problem? Try harder?

> So until someone comes up with a solution that works for everyone
> up to and including Linus Torvalds (who hasn't put up a problem
> pulling PMU drivers emitting things to dmesg so far, by the way), this
> keep PMU drivers' errors silent preference of yours is unnecessarily
> impeding people trying to measure system performance on Arm based
> machines - all other archs' maintainers are fine with PMU drivers using
> dmesg.

Good for them, although I'm pretty sure that at least the x86 folks are
against this crap too.

> > Anyway, I think this driver has bigger problems that need addressing.
>
> To me it represents yet another PMU driver submission - as the years go
> by - that is lacking in the user messaging area. Which reminds me, can
> you take another look at applying this?:

As I said before, I'm not going to take anything that logs above pr_debug
for things that are directly triggerable from userspace. Spin a version
using pr_debug and I'll queue it.

Have a good weekend,

Will

2018-04-27 16:58:20

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Fri, 27 Apr 2018 17:09:14 +0100
Will Deacon <[email protected]> wrote:

> Kim,
>
> [Ganapat: please don't let this discussion disrupt your PMU driver
> development. You can safely ignore it for now :)]
>
> On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
> > On Fri, 27 Apr 2018 15:37:20 +0100
> > Will Deacon <[email protected]> wrote:
> >
> > > On Fri, Apr 27, 2018 at 08:15:25AM -0500, Kim Phillips wrote:
> > > > On Fri, 27 Apr 2018 10:30:27 +0100
> > > > Mark Rutland <[email protected]> wrote:
> > > > > On Thu, Apr 26, 2018 at 05:06:24PM -0500, Kim Phillips wrote:
> > > > > > On Wed, 25 Apr 2018 14:30:47 +0530
> > > > > > Ganapatrao Kulkarni <[email protected]> wrote:
> > > > > >
> > > > > > > +static int thunderx2_uncore_event_init(struct perf_event *event)
> > > > >
> > > > > > This PMU driver can be made more user-friendly by not just silently
> > > > > > returning an error code such as -EINVAL, but by emitting a useful
> > > > > > message describing the specific error via dmesg.
> > > > >
> > > > > As has previously been discussed on several occasions, patches which log
> > > > > to dmesg in a pmu::event_init() path at any level above pr_debug() are
> > > > > not acceptable -- dmesg is not intended as a mechanism to inform users
> > > > > of driver-specific constraints.
> > > >
> > > > I disagree - drivers do it all the time, using dev_err(), dev_warn(), etc.
> > > >
> > > > > I would appreciate if in future you could qualify your suggestion with
> > > > > the requirement that pr_debug() is used.
> > > >
> > > > It shouldn't - the driver isn't being debugged, it's in regular use.
> > >
> > > For anything under drivers/perf/, I'd prefer not to have these prints
> > > and instead see efforts to improve error reporting via the perf system
> > > call interface.
> >
> > We'd all prefer that, and for all PMU drivers, why should ones under
> > drivers/perf be treated differently?
>
> Because they're the ones I maintain...

You represent a minority on your opinion on this matter though.

> > As you are already aware, I've personally tried to fix this problem -
> > that has existed since before the introduction of the perf tool (I
> > consider it a syscall-independent enhanced error interface), multiple
> > times, and failed.
>
> Why is that my problem? Try harder?

It's your problem because we're here reviewing a patch that happens to
fall under your maintainership. I'll be the first person to tell you
I'm obviously incompetent and haven't been able to come up with a
solution that is acceptable for everyone up to and including Linus
Torvalds. I'm just noticing a chronic usability problem that can be
easily alleviated in the context of this patch review.

> > So until someone comes up with a solution that works for everyone
> > up to and including Linus Torvalds (who hasn't put up a problem
> > pulling PMU drivers emitting things to dmesg so far, by the way), this
> > keep PMU drivers' errors silent preference of yours is unnecessarily
> > impeding people trying to measure system performance on Arm based
> > machines - all other archs' maintainers are fine with PMU drivers using
> > dmesg.
>
> Good for them, although I'm pretty sure that at least the x86 folks are
> against this crap too.

Unfortunately, it doesn't affect them nearly as much as it does our
more diverse platforms, which is why I don't think they care to do
much about it.

> > > Anyway, I think this driver has bigger problems that need addressing.
> >
> > To me it represents yet another PMU driver submission - as the years go
> > by - that is lacking in the user messaging area. Which reminds me, can
> > you take another look at applying this?:
>
> As I said before, I'm not going to take anything that logs above pr_debug
> for things that are directly triggerable from userspace. Spin a version

Why? There are plenty of things that emit stuff into dmesg that are
directly triggerable from userspace. Is it because it upsets fuzzing
tests? How about those be run with a patched kernel that somehow
mitigates the printing?

> using pr_debug and I'll queue it.

How about using a ratelimited dev_err variant?

> Have a good weekend,

You too.

Kim

2018-05-01 11:54:15

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Kim,

On Fri, Apr 27, 2018 at 11:56:25AM -0500, Kim Phillips wrote:
> On Fri, 27 Apr 2018 17:09:14 +0100
> Will Deacon <[email protected]> wrote:
> > On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
> > > On Fri, 27 Apr 2018 15:37:20 +0100
> > > Will Deacon <[email protected]> wrote:
> > > > For anything under drivers/perf/, I'd prefer not to have these prints
> > > > and instead see efforts to improve error reporting via the perf system
> > > > call interface.
> > >
> > > We'd all prefer that, and for all PMU drivers, why should ones under
> > > drivers/perf be treated differently?
> >
> > Because they're the ones I maintain...
>
> You represent a minority on your opinion on this matter though.

I'm not sure that's true -- you tend to be the only one raising this issue;
I think most people don't actually care one way or the other. If you know of
others who care about it too then I suggest you work together as a group to
solve the problem in a way which is amenable to the upstream maintainers.
Then you won't have to worry about fixing it personally. You could even
propose a topic for plumbers if there is enough interest in a general
framework for propagating error messages to userspace.

> > > As you are already aware, I've personally tried to fix this problem -
> > > that has existed since before the introduction of the perf tool (I
> > > consider it a syscall-independent enhanced error interface), multiple
> > > times, and failed.
> >
> > Why is that my problem? Try harder?
>
> It's your problem because we're here reviewing a patch that happens to
> fall under your maintainership. I'll be the first person to tell you
> I'm obviously incompetent and haven't been able to come up with a
> solution that is acceptable for everyone up to and including Linus
> Torvalds. I'm just noticing a chronic usability problem that can be
> easily alleviated in the context of this patch review.

I don't see how incompetence has anything to do with it and, like most
people (lucky gits...), I doubt Torvalds cares too deeply about PMU driver
internals. No arguments on the usability problem, but this ain't the fix
you're looking for: it's a bodge. We've been over this many times before.

> > > So until someone comes up with a solution that works for everyone
> > > up to and including Linus Torvalds (who hasn't put up a problem
> > > pulling PMU drivers emitting things to dmesg so far, by the way), this
> > > keep PMU drivers' errors silent preference of yours is unnecessarily
> > > impeding people trying to measure system performance on Arm based
> > > machines - all other archs' maintainers are fine with PMU drivers using
> > > dmesg.
> >
> > Good for them, although I'm pretty sure that at least the x86 folks are
> > against this crap too.
>
> Unfortunately, it doesn't affect them nearly as much as it does our
> more diverse platforms, which is why I don't think they care to do
> much about it.

x86 has its fair share of PMUs too, so I don't think we're special in this
regard. The main difference seems to be that they have better support in
the perf tool for diagnosing problems.

> > > > Anyway, I think this driver has bigger problems that need addressing.
> > >
> > > To me it represents yet another PMU driver submission - as the years go
> > > by - that is lacking in the user messaging area. Which reminds me, can
> > > you take another look at applying this?:
> >
> > As I said before, I'm not going to take anything that logs above pr_debug
> > for things that are directly triggerable from userspace. Spin a version
>
> Why? There are plenty of things that emit stuff into dmesg that are
> directly triggerable from userspace. Is it because it upsets fuzzing
> tests? How about those be run with a patched kernel that somehow
> mitigates the printing?

[we've been over this before]

There are two camps that might find this information useful:

1. People writing userspace support for a new PMU using the
perf_event_open syscall

2. People trying to use a tool which abstracts the PMU details to some
degree (e.g. perf tool)

Those in (1) can get by with pr_debug. Sure, they have to recompile, but
it's not like there are many people in this camp and they'll probably be
working with the PMU driver writer to some degree anyway and taking new
kernel drops.

Those in (2) may not have access to dmesg, absolutely should not be able
to spam it (since this could hide other important messages), will probably
struggle to match an internal message from the PMU driver to their
invocation of the high-level tool and may well be in a multi-user
environment so will struggle to identify the messages that they are
responsible for. What they actually want is for the perf tool to give
them more information, and dmesg is not the right channel for that, for
similar reasons.

> > using pr_debug and I'll queue it.
>
> How about using a ratelimited dev_err variant?

Nah, I don't want dmesg being parsed by users of perf tool. pr_debug should
be sufficient for perf tool developers working with a new PMU type.

Will

2018-05-04 00:32:08

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Tue, 1 May 2018 12:54:05 +0100
Will Deacon <[email protected]> wrote:

> Hi Kim,

Hi Will, thanks for responding.

> On Fri, Apr 27, 2018 at 11:56:25AM -0500, Kim Phillips wrote:
> > On Fri, 27 Apr 2018 17:09:14 +0100
> > Will Deacon <[email protected]> wrote:
> > > On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
> > > > On Fri, 27 Apr 2018 15:37:20 +0100
> > > > Will Deacon <[email protected]> wrote:
> > > > > For anything under drivers/perf/, I'd prefer not to have these prints
> > > > > and instead see efforts to improve error reporting via the perf system
> > > > > call interface.
> > > >
> > > > We'd all prefer that, and for all PMU drivers, why should ones under
> > > > drivers/perf be treated differently?
> > >
> > > Because they're the ones I maintain...
> >
> > You represent a minority on your opinion on this matter though.
>
> I'm not sure that's true

I haven't seen another arch/PMU driver maintainer actively oppose it
other than you (and Mark).

> -- you tend to be the only one raising this issue;

If you're basing that on list activity, I wouldn't, since most Arm perf
users don't subscribe to LKML. I'm trying to represent current and
future Arm perf users that shouldn't be expected to hang out here on
LKML and review PMU drivers. Couldn't tell you why PMU driver authors
themselves don't chime in, but will note that some drivers in
drivers/perf do print things from their event_init functions.

Also, you have been cc'd on off-list email threads where SPE users had
to debug the SPE driver's event_init() manually. There are other SPE
users I know that have had to go through the same painstaking process of
debugging the SPE driver. Last but not least, there's me, and I'd sure
appreciate more verbose PMU drivers, based on my real-life performance
monitoring experience(s).

> I think most people don't actually care one way or the other. If you know of

Like I said, I think you're limiting your audience to LKML subscribers,
who are more amenable to being convinced they need to debug their
kernel.

> others who care about it too then I suggest you work together as a group to
> solve the problem in a way which is amenable to the upstream maintainers.

I know a few off-list people that would like a more user-friendly Arm
perf experience, but we're not qualified to solve the larger problem
that perf has had since its inception, and I think it's a little unfair
for you to suggest that, esp. given you're one of the maintainers, and
there's Linus on top of you.

> Then you won't have to worry about fixing it personally. You could even
> propose a topic for plumbers if there is enough interest in a general
> framework for propagating error messages to userspace.

I'm not worried about fixing it personally or not. I'm worried about
our perf users' experience given your objection to using the vehicle
already in use on other architectures and PMU drivers.

If you - or anyone else for that matter - know of a way that'll get us
past this, by all means, say something.

If it helps, I recently asked acme if we could borrow bits from struct
perf_event and got no response.

> > > > As you are already aware, I've personally tried to fix this problem -
> > > > that has existed since before the introduction of the perf tool (I
> > > > consider it a syscall-independent enhanced error interface), multiple
> > > > times, and failed.
> > >
> > > Why is that my problem? Try harder?
> >
> > It's your problem because we're here reviewing a patch that happens to
> > fall under your maintainership. I'll be the first person to tell you
> > I'm obviously incompetent and haven't been able to come up with a
> > solution that is acceptable for everyone up to and including Linus
> > Torvalds. I'm just noticing a chronic usability problem that can be
> > easily alleviated in the context of this patch review.
>
> I don't see how incompetence has anything to do with it and, like most

Well you told me to try harder, and I'm trying to let you know that I
can try harder - ad infinitum - and still not get it, so telling me to
try harder isn't going to necessarily resolve the issue.

> people (lucky gits...), I doubt Torvalds cares too deeply about PMU driver
> internals. No arguments on the usability problem, but this ain't the fix

Like acme, I doubt Linus runs perf on real Arm h/w, so yes, I don't
think they care that much, but if they did try to do it one day, I'd
like to that they'd be able to have dmesg contain that extra error
information for them.

As you know, David Howells wrote a supplemental error syscall facility
[1], that Linus nacked (off list no less). Does Linus care about what
David Howells was developing it for? I don't know, but he does have
enough experience to know whether a certain approach to a problem is
sustainable in his kernel or not.

> you're looking for: it's a bodge. We've been over this many times before.

I think it's OK - necessary even - until this error infrastructure
problem perf has is fixed, and I think you're being unrealistic if you
think it will be fixed anytime soon.

> > > > So until someone comes up with a solution that works for everyone
> > > > up to and including Linus Torvalds (who hasn't put up a problem
> > > > pulling PMU drivers emitting things to dmesg so far, by the way), this
> > > > keep PMU drivers' errors silent preference of yours is unnecessarily
> > > > impeding people trying to measure system performance on Arm based
> > > > machines - all other archs' maintainers are fine with PMU drivers using
> > > > dmesg.
> > >
> > > Good for them, although I'm pretty sure that at least the x86 folks are
> > > against this crap too.
> >
> > Unfortunately, it doesn't affect them nearly as much as it does our
> > more diverse platforms, which is why I don't think they care to do
> > much about it.
>
> x86 has its fair share of PMUs too, so I don't think we're special in this
> regard. The main difference seems to be that they have better support in
> the perf tool for diagnosing problems.

Indeed, their software is as vertically integrated as their hardware,
and, well, they have more people with more experience working and caring
about perf running and being more usable.

> > > > > Anyway, I think this driver has bigger problems that need addressing.
> > > >
> > > > To me it represents yet another PMU driver submission - as the years go
> > > > by - that is lacking in the user messaging area. Which reminds me, can
> > > > you take another look at applying this?:
> > >
> > > As I said before, I'm not going to take anything that logs above pr_debug
> > > for things that are directly triggerable from userspace. Spin a version
> >
> > Why? There are plenty of things that emit stuff into dmesg that are
> > directly triggerable from userspace. Is it because it upsets fuzzing
> > tests? How about those be run with a patched kernel that somehow
> > mitigates the printing?
>
> [we've been over this before]
> There are two camps that might find this information useful:
>
> 1. People writing userspace support for a new PMU using the
> perf_event_open syscall
>
> 2. People trying to use a tool which abstracts the PMU details to some
> degree (e.g. perf tool)
>
> Those in (1) can get by with pr_debug. Sure, they have to recompile, but
> it's not like there are many people in this camp and they'll probably be
> working with the PMU driver writer to some degree anyway and taking new
> kernel drops.

No, please, that's a worse user experience than necessary.

FWIW, I see this type of person as e.g., a JIT developer - a
compiler/toolchain person - who has decided to use h/w monitoring to
help the JIT engine find hotspots. The first place they should start
is to start using the perf tool itself, experimenting with events and
PMUs of interest. When satisfied, they should clone the parameters
the tool makes to the perf_event_open syscall in their JIT engine. We
should not be wasting their time forcing them to hack around in the
kernel trying to debug perf driver semantics.

> Those in (2) may not have access to dmesg, absolutely should not be able

I don't think you're being realistic here: we're talking about people
trying to improve performance: They would likely have kptr_restrict ==
0 (unrestricted), so why would they have dmesg_restrict set? Btw, the
rationale for dmesg_restrict was kernel pointer visibility [2], but now
%Kp hashes addresses before printing them, so it's even less likely to
be used. Curious, where have you seen dmesg_restrict being set?

> to spam it (since this could hide other important messages), will probably

How will it hide other important messages? Run a logging daemon?
We're not talking about a lot of messages here: one line per perf
invocation: see e.g. runs in [2].

> struggle to match an internal message from the PMU driver to their
> invocation of the high-level tool and may well be in a multi-user
> environment so will struggle to identify the messages that they are
> responsible for.

Again, I think you're being unrealistic. My experience - in
multiple performance analysis roles - was that I was always the only
person using the machine at the time, and can easily correlate perf
invocations with dmesg output, particularly if I run 'dmesg -w &' prior
to invoking perf. In one of those roles, having to mess with the
kernel was almost inconceivable - it literally 'just worked' for
everything else. I was lucky that I had prior kernel experience in
order to debug the driver and be able find out how to invoke perf.

> What they actually want is for the perf tool to give
> them more information, and dmesg is not the right channel for that, for
> similar reasons.

We'd all like the perf tool to do things better, and dmesg is all we
have for now, and now I'm sounding like I did 1 1/2 years ago.

> > > using pr_debug and I'll queue it.
> >
> > How about using a ratelimited dev_err variant?
>
> Nah, I don't want dmesg being parsed by users of perf tool. pr_debug should
> be sufficient for perf tool developers working with a new PMU type.

OK the question was to the ratelimited part, but I think that even
might be the default these days [4].

For reasons (different perceptions?) already mentioned above, I also
don't agree that this has to do with just perf tool developers, and
only when working with a new PMU type. My main concern is new/future
users to perf on Arm, working with any PMU type - new or old. If there
is a perf tool-side component to this driver, I don't see it. Having
said that, I think the owners of this and other PMU drivers should have
a say in what type of experience they want for their users in this
regard...is that fair?

Thanks,

Kim

[1] https://patchwork.kernel.org/patch/9907463/
[2] https://lwn.net/Articles/414813/
[3] https://lkml.org/lkml/2017/11/21/385
[4] https://lwn.net/Articles/693010/

2018-05-04 17:11:31

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Kim,

On 04/05/18 01:30, Kim Phillips wrote:
> On Tue, 1 May 2018 12:54:05 +0100
> Will Deacon <[email protected]> wrote:
>
>> Hi Kim,
>
> Hi Will, thanks for responding.
>
>> On Fri, Apr 27, 2018 at 11:56:25AM -0500, Kim Phillips wrote:
>>> On Fri, 27 Apr 2018 17:09:14 +0100
>>> Will Deacon <[email protected]> wrote:
>>>> On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
>>>>> On Fri, 27 Apr 2018 15:37:20 +0100
>>>>> Will Deacon <[email protected]> wrote:
>>>>>> For anything under drivers/perf/, I'd prefer not to have these prints
>>>>>> and instead see efforts to improve error reporting via the perf system
>>>>>> call interface.
>>>>>
>>>>> We'd all prefer that, and for all PMU drivers, why should ones under
>>>>> drivers/perf be treated differently?
>>>>
>>>> Because they're the ones I maintain...
>>>
>>> You represent a minority on your opinion on this matter though.
>>
>> I'm not sure that's true
>
> I haven't seen another arch/PMU driver maintainer actively oppose it
> other than you (and Mark).
>
>> -- you tend to be the only one raising this issue;
>
> If you're basing that on list activity, I wouldn't, since most Arm perf
> users don't subscribe to LKML. I'm trying to represent current and
> future Arm perf users that shouldn't be expected to hang out here on
> LKML and review PMU drivers. Couldn't tell you why PMU driver authors
> themselves don't chime in, but will note that some drivers in
> drivers/perf do print things from their event_init functions.

As a PMU driver author who *has* already invested not-inconsiderable
time and effort in trying to explain to you the technical issues from a
non-maintainer perspective (and is about to do so again), I can't help
but feel rather slighted by that comment.

> Also, you have been cc'd on off-list email threads where SPE users had
> to debug the SPE driver's event_init() manually. There are other SPE
> users I know that have had to go through the same painstaking process of
> debugging the SPE driver. Last but not least, there's me, and I'd sure
> appreciate more verbose PMU drivers, based on my real-life performance
> monitoring experience(s).
>
>> I think most people don't actually care one way or the other. If you know of
>
> Like I said, I think you're limiting your audience to LKML subscribers,
> who are more amenable to being convinced they need to debug their
> kernel.
>
>> others who care about it too then I suggest you work together as a group to
>> solve the problem in a way which is amenable to the upstream maintainers.
>
> I know a few off-list people that would like a more user-friendly Arm
> perf experience, but we're not qualified to solve the larger problem
> that perf has had since its inception, and I think it's a little unfair
> for you to suggest that, esp. given you're one of the maintainers, and
> there's Linus on top of you.
>
>> Then you won't have to worry about fixing it personally. You could even
>> propose a topic for plumbers if there is enough interest in a general
>> framework for propagating error messages to userspace.
>
> I'm not worried about fixing it personally or not. I'm worried about
> our perf users' experience given your objection to using the vehicle
> already in use on other architectures and PMU drivers.
>
> If you - or anyone else for that matter - know of a way that'll get us
> past this, by all means, say something.
>
> If it helps, I recently asked acme if we could borrow bits from struct
> perf_event and got no response.
>
>>>>> As you are already aware, I've personally tried to fix this problem -
>>>>> that has existed since before the introduction of the perf tool (I
>>>>> consider it a syscall-independent enhanced error interface), multiple
>>>>> times, and failed.
>>>>
>>>> Why is that my problem? Try harder?
>>>
>>> It's your problem because we're here reviewing a patch that happens to
>>> fall under your maintainership. I'll be the first person to tell you
>>> I'm obviously incompetent and haven't been able to come up with a
>>> solution that is acceptable for everyone up to and including Linus
>>> Torvalds. I'm just noticing a chronic usability problem that can be
>>> easily alleviated in the context of this patch review.
>>
>> I don't see how incompetence has anything to do with it and, like most
>
> Well you told me to try harder, and I'm trying to let you know that I
> can try harder - ad infinitum - and still not get it, so telling me to
> try harder isn't going to necessarily resolve the issue.
>
>> people (lucky gits...), I doubt Torvalds cares too deeply about PMU driver
>> internals. No arguments on the usability problem, but this ain't the fix
>
> Like acme, I doubt Linus runs perf on real Arm h/w, so yes, I don't
> think they care that much, but if they did try to do it one day, I'd
> like to that they'd be able to have dmesg contain that extra error
> information for them.
>
> As you know, David Howells wrote a supplemental error syscall facility
> [1], that Linus nacked (off list no less). Does Linus care about what
> David Howells was developing it for? I don't know, but he does have
> enough experience to know whether a certain approach to a problem is
> sustainable in his kernel or not.
>
>> you're looking for: it's a bodge. We've been over this many times before.
>
> I think it's OK - necessary even - until this error infrastructure
> problem perf has is fixed, and I think you're being unrealistic if you
> think it will be fixed anytime soon.
>
>>>>> So until someone comes up with a solution that works for everyone
>>>>> up to and including Linus Torvalds (who hasn't put up a problem
>>>>> pulling PMU drivers emitting things to dmesg so far, by the way), this
>>>>> keep PMU drivers' errors silent preference of yours is unnecessarily
>>>>> impeding people trying to measure system performance on Arm based
>>>>> machines - all other archs' maintainers are fine with PMU drivers using
>>>>> dmesg.
>>>>
>>>> Good for them, although I'm pretty sure that at least the x86 folks are
>>>> against this crap too.
>>>
>>> Unfortunately, it doesn't affect them nearly as much as it does our
>>> more diverse platforms, which is why I don't think they care to do
>>> much about it.
>>
>> x86 has its fair share of PMUs too, so I don't think we're special in this
>> regard. The main difference seems to be that they have better support in
>> the perf tool for diagnosing problems.
>
> Indeed, their software is as vertically integrated as their hardware,
> and, well, they have more people with more experience working and caring
> about perf running and being more usable.

I note that uncore_pmu_event_init() and snb_uncore_imc_event_init(), to
pick a couple of Intel examples, contain a lot of the same logic to the
ARM system PMU drivers you repeatedly request should be more verbose.
Now, surely either perf tool already has some magic which somehow makes
those routines "user-friendly", in which case there seems no reason that
at least any generic event attribute stuff couldn't be applied to ARM
PMUs as well; or if not then you could easily send a patch adding
"helpful" printk()s to those drivers too and actively solicit the
maintainer feedback you lament not having.

>>>>>> Anyway, I think this driver has bigger problems that need addressing.
>>>>>
>>>>> To me it represents yet another PMU driver submission - as the years go
>>>>> by - that is lacking in the user messaging area. Which reminds me, can
>>>>> you take another look at applying this?:
>>>>
>>>> As I said before, I'm not going to take anything that logs above pr_debug
>>>> for things that are directly triggerable from userspace. Spin a version
>>>
>>> Why? There are plenty of things that emit stuff into dmesg that are
>>> directly triggerable from userspace. Is it because it upsets fuzzing
>>> tests? How about those be run with a patched kernel that somehow
>>> mitigates the printing?
>>
>> [we've been over this before]
>> There are two camps that might find this information useful:
>>
>> 1. People writing userspace support for a new PMU using the
>> perf_event_open syscall
>>
>> 2. People trying to use a tool which abstracts the PMU details to some
>> degree (e.g. perf tool)
>>
>> Those in (1) can get by with pr_debug. Sure, they have to recompile, but
>> it's not like there are many people in this camp and they'll probably be
>> working with the PMU driver writer to some degree anyway and taking new
>> kernel drops.
>
> No, please, that's a worse user experience than necessary.
>
> FWIW, I see this type of person as e.g., a JIT developer - a
> compiler/toolchain person - who has decided to use h/w monitoring to
> help the JIT engine find hotspots. The first place they should start
> is to start using the perf tool itself, experimenting with events and
> PMUs of interest. When satisfied, they should clone the parameters
> the tool makes to the perf_event_open syscall in their JIT engine. We
> should not be wasting their time forcing them to hack around in the
> kernel trying to debug perf driver semantics.
>
>> Those in (2) may not have access to dmesg, absolutely should not be able
>
> I don't think you're being realistic here: we're talking about people
> trying to improve performance: They would likely have kptr_restrict ==
> 0 (unrestricted), so why would they have dmesg_restrict set? Btw, the
> rationale for dmesg_restrict was kernel pointer visibility [2], but now
> %Kp hashes addresses before printing them, so it's even less likely to
> be used. Curious, where have you seen dmesg_restrict being set?
>
>> to spam it (since this could hide other important messages), will probably
>
> How will it hide other important messages? Run a logging daemon?
> We're not talking about a lot of messages here: one line per perf
> invocation: see e.g. runs in [2].

Assuming you meant [3], that's a pretty trivial case which doesn't
really illustrate the whole scope of the problem.

Just for you, I added prints in 4 places where event_init validation
returns -EINVAL in my not-yet-upstream SMMUv2 PMU driver. Let's capture
a few events on my Juno board to get some arbitrary cross-sections of
system-wide I/O activity:

[root@space-channel-5 ~]# perf stat -e arm_smmu/cycles/ -e
arm_smmu/access,tcefcfg=1/ -e arm_smmu/access,tcefcfg=1,smr_id=1/ -e
arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/ -e
arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/ find /usr > /dev/null

Performance counter stats for 'system wide':

729128494 arm_smmu/cycles/
(12.08%)
841301489 arm_smmu/cycles/
(28.10%)
834458758 arm_smmu/cycles/
(23.40%)
774373529 arm_smmu/cycles/
(15.69%)
1028501665 arm_smmu/cycles/
(13.84%)
5401 arm_smmu/access,tcefcfg=1/
(27.11%)
0 arm_smmu/access,tcefcfg=1/
(14.73%)
0 arm_smmu/access,tcefcfg=1/
(23.27%)
0 arm_smmu/access,tcefcfg=1/
(20.54%)
220788 arm_smmu/access,tcefcfg=1/
(17.37%)
0 arm_smmu/access,tcefcfg=1,smr_id=1/
(28.18%)
0 arm_smmu/access,tcefcfg=1,smr_id=1/
(24.10%)
0 arm_smmu/access,tcefcfg=1,smr_id=1/
(12.52%)
0 arm_smmu/access,tcefcfg=1,smr_id=1/
(16.17%)
0 arm_smmu/access,tcefcfg=1,smr_id=1/
(19.71%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
(19.68%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
(23.69%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
(29.15%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
(29.19%)
12092 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
(27.70%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
(16.97%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
(13.36%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
(16.01%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
(24.70%)
0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
(28.39%)

2.315150449 seconds time elapsed


Wonderful, it worked OK. And yet in that brief time it spammed
*sixty-one* lines to the kernel log:

[ 2096.220934] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.226893] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.232852] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.238798] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.244744] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.250690] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.256641] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.262594] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.268546] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.274492] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.280437] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.286381] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.292326] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.298271] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.304215] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.310164] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.316109] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.322053] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.327997] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.333941] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.339887] event_source arm_smmu_3: Incompatible SMR value
[ 2096.345408] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.351361] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.357313] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.363263] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.369214] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.375171] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.381124] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.387078] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.393029] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.398984] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.404936] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.410890] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.416840] event_source arm_smmu_4: Incompatible context bank index
[ 2096.423130] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.429075] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.435019] event_source arm_smmu_1: Incompatible context bank index
[ 2096.441308] event_source arm_smmu_3: Incompatible context bank index
[ 2096.447611] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.453557] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.459501] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.465445] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.471390] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.477342] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.483286] event_source arm_smmu_2: Incompatible SMR value
[ 2096.488800] event_source arm_smmu_0: Incompatible SMR value
[ 2096.494316] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.500319] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.571799] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.577749] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.583705] event_source arm_smmu_4: Incompatible SMR value
[ 2096.589221] event_source arm_smmu_2: Incompatible SMR value
[ 2096.594736] event_source arm_smmu_0: Incompatible SMR value
[ 2096.600256] event_source arm_smmu_1: Incompatible SMR value
[ 2096.605770] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.611718] event_source arm_smmu_4: Incompatible filtering mode
[ 2096.617663] event_source arm_smmu_2: Incompatible filtering mode
[ 2096.623607] event_source arm_smmu_0: Incompatible filtering mode
[ 2096.629550] event_source arm_smmu_1: Incompatible filtering mode
[ 2096.635494] event_source arm_smmu_3: Incompatible filtering mode
[ 2096.641443] event_source arm_smmu_4: Incompatible context bank index

(and in case it's not clear, that output is *continual* during the perf
run; empirically, the runtime only needs to reach about 6 seconds or so
for the dmesg ringbuffer to wrap so I have to fall back to journalctl to
see what I missed)

Now how about we try counting those exact same events a slightly
different way:

[root@space-channel-5 ~]# perf stat -e
\{arm_smmu/cycles/,arm_smmu/access,tcefcfg=1/,arm_smmu/access,tcefcfg=1,smr_id=1/,arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/,arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/\}
find /usr > /dev/null

Performance counter stats for 'system wide':

<not counted> arm_smmu/cycles/

<not supported> arm_smmu/cycles/

<not supported> arm_smmu/cycles/

<not supported> arm_smmu/cycles/

<not supported> arm_smmu/cycles/

<not supported> arm_smmu/access,tcefcfg=1/

<not supported> arm_smmu/access,tcefcfg=1/

<not supported> arm_smmu/access,tcefcfg=1/

<not supported> arm_smmu/access,tcefcfg=1/

<not supported> arm_smmu/access,tcefcfg=1/

<not supported> arm_smmu/access,tcefcfg=1,smr_id=1/

<not supported> arm_smmu/access,tcefcfg=1,smr_id=1/

<not supported> arm_smmu/access,tcefcfg=1,smr_id=1/

<not supported> arm_smmu/access,tcefcfg=1,smr_id=1/

<not supported> arm_smmu/access,tcefcfg=1,smr_id=1/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/

<not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/


0.721310193 seconds time elapsed


Oh, that didn't work. Let's check dmesg to see why:

[ 2416.494951] event_source arm_smmu_3: Incompatible filtering mode
[ 2416.500933] event_source arm_smmu_3: Incompatible filtering mode
[ 2416.506958] event_source arm_smmu_3: Incompatible filtering mode
[ 2416.513276] event_source arm_smmu_3: Incompatible filtering mode


Well, it said the same thing continually when it worked correctly, so I
guess it must have failed for some other reason. As the user I'm now not
only confused but unknowingly heading towards the *wrong* conclusion
thanks to those "helpful" messages. And that's not even contemplating
the "what if multiple programs did this at the same time" case.

>> struggle to match an internal message from the PMU driver to their
>> invocation of the high-level tool and may well be in a multi-user
>> environment so will struggle to identify the messages that they are
>> responsible for.
>
> Again, I think you're being unrealistic. My experience - in
> multiple performance analysis roles - was that I was always the only
> person using the machine at the time, and can easily correlate perf
> invocations with dmesg output, particularly if I run 'dmesg -w &' prior
> to invoking perf. In one of those roles, having to mess with the
> kernel was almost inconceivable - it literally 'just worked' for
> everything else. I was lucky that I had prior kernel experience in
> order to debug the driver and be able find out how to invoke perf.
>
>> What they actually want is for the perf tool to give
>> them more information, and dmesg is not the right channel for that, for
>> similar reasons.
>
> We'd all like the perf tool to do things better, and dmesg is all we
> have for now, and now I'm sounding like I did 1 1/2 years ago.
>
>>>> using pr_debug and I'll queue it.
>>>
>>> How about using a ratelimited dev_err variant?
>>
>> Nah, I don't want dmesg being parsed by users of perf tool. pr_debug should
>> be sufficient for perf tool developers working with a new PMU type.
>
> OK the question was to the ratelimited part, but I think that even
> might be the default these days [4].
>
> For reasons (different perceptions?) already mentioned above, I also
> don't agree that this has to do with just perf tool developers, and
> only when working with a new PMU type. My main concern is new/future
> users to perf on Arm, working with any PMU type - new or old. If there
> is a perf tool-side component to this driver, I don't see it. Having
> said that, I think the owners of this and other PMU drivers should have
> a say in what type of experience they want for their users in this
> regard...is that fair?

As I've said before, a significant portion of what you want to improve
involves generic perf_event properties which would already be much
better validated in the core than by every driver individually, so the
first straightforward step would be to improve the interface between
drivers and perf core, such that the "hard" problem reduces to
communicating from a single part of perf core to userspace. As it is,
PMU drivers alone simply don't have enough context to know when an
event_init "failure" is actually significant to the user or not, thus if
they display user-visible messages directly then half the time the user
will be thoroughly misled unless they are intimately familiar with how
perf core grouping and event rotation work (see above), yet required
knowledge of perf internals is exactly what you're arguing against!

Robin.

>
> Thanks,
>
> Kim
>
> [1] https://patchwork.kernel.org/patch/9907463/
> [2] https://lwn.net/Articles/414813/
> [3] https://lkml.org/lkml/2017/11/21/385
> [4] https://lwn.net/Articles/693010/
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

2018-05-04 18:48:25

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Mark,

On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
> Hi,
>
> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>> +
>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>> + * Each Channel supports UNCORE PMU device and consists of
>> + * 4 independent programmable counters. Counters are 32 bit
>> + * and does not support overflow interrupt, they needs to be
>> + * sampled before overflow(i.e, at every 2 seconds).
>> + */
>> +
>> +#define UNCORE_MAX_COUNTERS 4
>> +#define UNCORE_L3_MAX_TILES 16
>> +#define UNCORE_DMC_MAX_CHANNELS 8
>> +
>> +#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)
>
> How was a period of two seconds chosen?

It has been suggested from hw team to sample before 3 to 4 seconds.

>
> What's the maximum clock speed for the L3C and DMC?

L3C at 1.5GHz and DMC at 1.2GHz
>
> Given that all channels compete for access to the muxed register
> interface, I suspect we need to try more often than once every 2
> seconds...

2 seconds seems to be sufficient. So far testing looks good.

>
> [...]
>
>> +struct active_timer {
>> + struct perf_event *event;
>> + struct hrtimer hrtimer;
>> +};
>> +
>> +/*
>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>> + * each uncore device has up to 16 channels, each channel can sample
>> + * events independently with counters up to 4.
>> + *
>> + * struct thunderx2_pmu_uncore_channel created per channel.
>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>> + */
>> +struct thunderx2_pmu_uncore_channel {
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + struct pmu pmu;
>
> Can we put the pmu first in the struct, please?

ok
>
>> + int counter;
>
> AFAICT, this counter field is never used.

hmm ok, will remove.
>
>> + int channel;
>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>> + struct active_timer *active_timers;
>
> You should only need a single timer per channel, rather than one per
> event.
>
> I think you can get rid of the active_timer structure, and have:
>
> struct perf_event *events[UNCORE_MAX_COUNTERS];
> struct hrtimer timer;
>

thanks, will change as suggested.

>> + /* to sync counter alloc/release */
>> + raw_spinlock_t lock;
>> +};
>> +
>> +struct thunderx2_pmu_uncore_dev {
>> + char *name;
>> + struct device *dev;
>> + enum thunderx2_uncore_type type;
>> + unsigned long base;
>
> This should be:
>
> void __iomem *base;

ok
>
> [...]
>
>> +static ssize_t cpumask_show(struct device *dev,
>> + struct device_attribute *attr, char *buf)
>> +{
>> + struct cpumask cpu_mask;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>> +
>> + /* Pick first online cpu from the node */
>> + cpumask_clear(&cpu_mask);
>> + cpumask_set_cpu(cpumask_first(
>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>> + &cpu_mask);
>> +
>> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
>> +}
>
> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>
> Regardless, I don't think that you can keep track of the management CPU
> this way. Please keep track of the CPU the PMU should be managed by,
> either with a cpumask or int embedded within the PMU structure.

thanks, will add hotplug callbacks.
>
> At hotplug time, you'll need to update the management CPU, calling
> perf_pmu_migrate_context() when it is offlined.

ok
>
> [...]
>
>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> +{
>> + int counter;
>> +
>> + raw_spin_lock(&pmu_uncore->lock);
>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> + pmu_uncore->uncore_dev->max_counters);
>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> + raw_spin_unlock(&pmu_uncore->lock);
>> + return -ENOSPC;
>> + }
>> + set_bit(counter, pmu_uncore->counter_mask);
>> + raw_spin_unlock(&pmu_uncore->lock);
>> + return counter;
>> +}
>> +
>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>> + int counter)
>> +{
>> + raw_spin_lock(&pmu_uncore->lock);
>> + clear_bit(counter, pmu_uncore->counter_mask);
>> + raw_spin_unlock(&pmu_uncore->lock);
>> +}
>
> I don't believe that locking is required in either of these, as the perf
> core serializes pmu::add() and pmu::del(), where these get called.

thanks, it seems to be not required.
>
>> +
>> +/*
>> + * DMC and L3 counter interface is muxed across all channels.
>> + * hence we need to select the channel before accessing counter
>> + * data/control registers.
>
> Are there separate interfaces for all-dmc-channels and all-l3c-channels?

separate interface for DMC and L3c.
channels within DMC/L3C are muxed.

>
> ... or is a single interface used by all-dmc-and-l3c-channels?
>
>> + *
>> + * L3 Tile and DMC channel selection is through SMC call
>> + * SMC call arguments,
>> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>> + * x2 = Node id
>
> How do we map Linux node IDs to the firmware's view of node IDs?
>
> I don't believe the two are necessarily the same -- Linux's node IDs are
> a Linux-specific construct.

both are same, it is numa node id from ACPI/firmware.

>
> It would be much nicer if we could pass something based on the MPIDR,
> which is a known HW construct, or if this implicitly affected the
> current node.

IMO, node id is sufficient.

>
> It would be vastly more sane for this to not be muxed at all. :/

i am helpless due to crappy hw design!

>
>> + * x3 = DMC(1)/L3C(0)
>> + * x4 = channel Id
>> + */
>> +static void uncore_select_channel(struct perf_event *event)
>> +{
>> + struct arm_smccc_res res;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>> + pmu_to_thunderx2_pmu_uncore(event->pmu);
>> +
>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>> + pmu_uncore->uncore_dev->node,
>> + pmu_uncore->uncore_dev->type,
>> + pmu_uncore->channel, 0, 0, 0, &res);
>> +
>> +}
>
> Why aren't we checking the return value of the SMC call?

thanks, i will add.

>
> The muxing and SMC sound quite scary. :/

i too agree!, however i have no other option since the muxing register
is a secure register.
>
>> +
>> +static void uncore_start_event_l3c(struct perf_event *event, int flags)
>> +{
>> + u32 val;
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + /* event id encoded in bits [07:03] */
>> + val = GET_EVENTID(event) << 3;
>> + reg_writel(val, hwc->config_base);
>> +
>> + if (flags & PERF_EF_RELOAD) {
>> + u64 prev_raw_count =
>> + local64_read(&event->hw.prev_count);
>> + reg_writel(prev_raw_count, hwc->event_base);
>> + }
>> + local64_set(&event->hw.prev_count,
>> + reg_readl(hwc->event_base));
>
> It would be simpler to ignore PERF_EF_RELOAD, and always reprogram the
> prev_count and HW to zero.

ok, will change
>
>> +}
>> +
>> +static void uncore_start_event_dmc(struct perf_event *event, int flags)
>> +{
>> + u32 val, event_shift = 8;
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + /* enable and start counters and read current value in prev_count */
>> + val = reg_readl(hwc->config_base);
>> +
>> + /* 8 bits for each counter,
>> + * bits [05:01] of a counter to set event type.
>> + */
>> + reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) *
>> + event_shift) + 1))) |
>> + (GET_EVENTID(event) <<
>> + (((GET_COUNTERID(event)) * event_shift) + 1)),
>> + hwc->config_base);
>
> This would be far more legible if val were constructed before the
> reg_writel(), especially if there were a helper for the event field
> shifting, e.g.
>
> #define DMC_EVENT_CFG(idx, val) ((val) << (((idx) * 8) + 1))
>
> int id = GET_COUNTERID(event);
> int event = GET_EVENTID(event);
>
> val = reg_readl(hwc->config_base);
> val &= ~DMC_EVENT_CFG(id, 0x1f);
> val |= DMCC_EVENT_CFG(id, event);
> reg_writel(val, hwc->config_base));
>
> What are bits 7:6 and 1 used for?

not used/reserved bits.

>
>> +
>> + if (flags & PERF_EF_RELOAD) {
>> + u64 prev_raw_count =
>> + local64_read(&event->hw.prev_count);
>> + reg_writel(prev_raw_count, hwc->event_base);
>> + }
>> + local64_set(&event->hw.prev_count,
>> + reg_readl(hwc->event_base));
>
>
> As with the L3C events, it would be simpler to always reprogram the
> prev_count and HW to zero.

ok
>
>> +}
>> +
>> +static void uncore_stop_event_l3c(struct perf_event *event)
>> +{
>> + reg_writel(0, event->hw.config_base);
>> +}
>> +
>> +static void uncore_stop_event_dmc(struct perf_event *event)
>> +{
>> + u32 val, event_shift = 8;
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + val = reg_readl(hwc->config_base);
>> + reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))),
>> + hwc->config_base);
>
>
> This could be simplified with the helper proposed above.

ok
>
>> +}
>> +
>> +static void init_cntr_base_l3c(struct perf_event *event,
>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>> +{
>> +
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + /* counter ctrl/data reg offset at 8 */
>
> Offset 8, or stride 8?

stride 8
>
> What does the register layout look like?
>
>> + hwc->config_base = uncore_dev->base
>> + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
>> + hwc->event_base = uncore_dev->base
>> + + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event));
>> +}
>> +
>> +static void init_cntr_base_dmc(struct perf_event *event,
>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>> +{
>> +
>> + struct hw_perf_event *hwc = &event->hw;
>> +
>> + hwc->config_base = uncore_dev->base
>> + + DMC_COUNTER_CTL;
>> + /* counter data reg offset at 0xc */
>
> A stride of 0xc seems unusual.
>
> What does the register layout look like?

the offsets are at 0x640, 0x64c, 0x658, 0x664,
>
>> + hwc->event_base = uncore_dev->base
>> + + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event));
>> +}
>> +
>> +static void thunderx2_uncore_update(struct perf_event *event)
>> +{
>> + s64 prev, new = 0;
>> + u64 delta;
>> + struct hw_perf_event *hwc = &event->hw;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>> + enum thunderx2_uncore_type type;
>> +
>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>> + type = pmu_uncore->uncore_dev->type;
>
> AFAICT this variable is not used below.

thanks.
>
>> +
>> + if (pmu_uncore->uncore_dev->select_channel)
>> + pmu_uncore->uncore_dev->select_channel(event);
>
> This should always be non-NULL, right?

yes, unwanted check at the moment, will remove.

>
> [...]
>
>> +static bool thunderx2_uncore_validate_event_group(struct perf_event *event)
>> +{
>> + struct pmu *pmu = event->pmu;
>> + struct perf_event *leader = event->group_leader;
>> + struct perf_event *sibling;
>> + int counters = 0;
>> +
>> + if (leader->pmu != event->pmu && !is_software_event(leader))
>> + return false;
>> +
>> + counters++;
>
> I don't think this is right when event != leader and the leader is a SW
> event. In that case, the leader doesn't take a HW counter.

I think this check to avoid the grouping of multiple hw PMUs ,
however it is allowed to group sw events along with hw PMUs.

>
>> +
>> + for_each_sibling_event(sibling, event->group_leader) {
>> + if (is_software_event(sibling))
>> + continue;
>> + if (sibling->pmu != pmu)
>> + return false;
>> + counters++;
>> + }
>> +
>> + /*
>> + * If the group requires more counters than the HW has,
>> + * it cannot ever be scheduled.
>> + */
>> + return counters <= UNCORE_MAX_COUNTERS;
>> +}
>
> [...]
>
>> +static int thunderx2_uncore_event_init(struct perf_event *event)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>
>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>> +
>> + if (!pmu_uncore)
>> + return -ENODEV;
>
> This cannot happen, given pmu_to_thunderx2_pmu_uncore() is a wrapper
> around container_of().

thanks, unnecessary check!
>
>> +
>> + /* Pick first online cpu from the node */
>> + event->cpu = cpumask_first(
>> + cpumask_of_node(pmu_uncore->uncore_dev->node));
>
> I don't believe this is safe. You must keep track of which CPU is
> managing the PMU, with hotplug callbacks.

ok, will add callbacks.
>
> [...]
>
>> +static void thunderx2_uncore_start(struct perf_event *event, int flags)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + unsigned long irqflags;
>> + struct active_timer *active_timer;
>> +
>> + hwc->state = 0;
>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>> + uncore_dev = pmu_uncore->uncore_dev;
>> +
>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>> +
>> + if (uncore_dev->select_channel)
>> + uncore_dev->select_channel(event);
>> + uncore_dev->start_event(event, flags);
>> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
>> +
>> + perf_event_update_userpage(event);
>> + active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)];
>> + active_timer->event = event;
>> +
>> + hrtimer_start(&active_timer->hrtimer,
>> + ns_to_ktime(uncore_dev->hrtimer_interval),
>> + HRTIMER_MODE_REL_PINNED);
>> +}
>
> Please use a single hrtimer, and update *all* of the events when it
> fires.

sure
>
> I *think* that can be done in the pmu::pmu_enable() and
> pmu::pmu_disable() callbacks.

ok
>
> Are there control bits to enable/disable all counters, or can that only
> be done through the event configuration registers?

only through config register.
>
>> +static void thunderx2_uncore_stop(struct perf_event *event, int flags)
>> +{
>> + struct hw_perf_event *hwc = &event->hw;
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + unsigned long irqflags;
>> +
>> + if (hwc->state & PERF_HES_UPTODATE)
>> + return;
>> +
>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>> + uncore_dev = pmu_uncore->uncore_dev;
>> +
>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>> +
>> + if (uncore_dev->select_channel)
>> + uncore_dev->select_channel(event);
>
> AFAICT this cannot be NULL.

ok.
>
> [...]
>
>> +static int thunderx2_pmu_uncore_register(
>> + struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> +{
>> + struct device *dev = pmu_uncore->uncore_dev->dev;
>> + char *name = pmu_uncore->uncore_dev->name;
>> + int channel = pmu_uncore->channel;
>> +
>> + /* Perf event registration */
>> + pmu_uncore->pmu = (struct pmu) {
>> + .attr_groups = pmu_uncore->uncore_dev->attr_groups,
>> + .task_ctx_nr = perf_invalid_context,
>> + .event_init = thunderx2_uncore_event_init,
>> + .add = thunderx2_uncore_add,
>> + .del = thunderx2_uncore_del,
>> + .start = thunderx2_uncore_start,
>> + .stop = thunderx2_uncore_stop,
>> + .read = thunderx2_uncore_read,
>> + };
>> +
>> + pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
>> + "%s_%d", name, channel);
>
> Does the channel idx take the NUMA node into account?

name already has node id suffix.
>
> [...]
>
>> +static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev,
>> + int channel)
>> +{
>
>> + /* we can run up to (max_counters * max_channels) events
>> + * simultaneously, allocate hrtimers per channel.
>> + */
>> + pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev,
>> + sizeof(struct active_timer) * uncore_dev->max_counters,
>> + GFP_KERNEL);
>
> Please just fold a single hrtimer into the thunderx2_pmu_uncore_channel
> structure, and you can get rid of this allocation...

sure, i will rewrite code to have timer per channel and all active
counters are updated when timer fires.

>
>> +
>> + for (counter = 0; counter < uncore_dev->max_counters; counter++) {
>> + hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer,
>> + CLOCK_MONOTONIC,
>> + HRTIMER_MODE_REL);
>> + pmu_uncore->active_timers[counter].hrtimer.function =
>> + thunderx2_uncore_hrtimer_callback;
>> + }
>
> ... and simplify this initialization.

ok
>
> [...]
>
>> +static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev(
>> + struct device *dev, acpi_handle handle,
>> + struct acpi_device *adev, u32 type)
>> +{
>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>> + unsigned long base;
>
>> + base = (unsigned long)devm_ioremap_resource(dev, &res);
>> + if (IS_ERR((void *)base)) {
>> + dev_err(dev, "PMU type %d: Fail to map resource\n", type);
>> + return NULL;
>> + }
>
> Please treat this as a void __iomem *base.

ok
>
> [...]
>
>> +static int thunderx2_uncore_probe(struct platform_device *pdev)
>> +{
>> + struct device *dev = &pdev->dev;
>> + struct arm_smccc_res res;
>> +
>> + set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev)));
>> +
>> + /* Make sure firmware supports DMC/L3C set channel smc call */
>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>> + dev_to_node(dev), 0, 0, 0, 0, 0, &res);
>> + if (res.a0) {
>> + dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n",
>> + dev_to_node(dev));
>> + return -ENODEV;
>> + }
>
> Please re-use the uncore_select_channel() wrapper rather than
> open-coding this.

ok.
>
> Which FW supports this interface?

it is through vendor specific ATF runtime services.
>
> Thanks,
> Mark.

thanks,
Ganapat

2018-05-10 01:10:28

by Kim Phillips

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Fri, 4 May 2018 18:10:44 +0100
Robin Murphy <[email protected]> wrote:

> Hi Kim,

Hi Robin,

> On 04/05/18 01:30, Kim Phillips wrote:
> > On Tue, 1 May 2018 12:54:05 +0100
> > Will Deacon <[email protected]> wrote:
> >> On Fri, Apr 27, 2018 at 11:56:25AM -0500, Kim Phillips wrote:
> >>> On Fri, 27 Apr 2018 17:09:14 +0100
> >>> Will Deacon <[email protected]> wrote:
> >>>> On Fri, Apr 27, 2018 at 10:46:29AM -0500, Kim Phillips wrote:
> >>>>> On Fri, 27 Apr 2018 15:37:20 +0100
> >>>>> Will Deacon <[email protected]> wrote:
> >>>>>> For anything under drivers/perf/, I'd prefer not to have these prints
> >>>>>> and instead see efforts to improve error reporting via the perf system
> >>>>>> call interface.
> >>>>>
> >>>>> We'd all prefer that, and for all PMU drivers, why should ones under
> >>>>> drivers/perf be treated differently?
> >>>>
> >>>> Because they're the ones I maintain...
> >>>
> >>> You represent a minority on your opinion on this matter though.
> >>
> >> I'm not sure that's true
> >
> > I haven't seen another arch/PMU driver maintainer actively oppose it
> > other than you (and Mark).
> >
> >> -- you tend to be the only one raising this issue;
> >
> > If you're basing that on list activity, I wouldn't, since most Arm perf
> > users don't subscribe to LKML. I'm trying to represent current and
> > future Arm perf users that shouldn't be expected to hang out here on
> > LKML and review PMU drivers. Couldn't tell you why PMU driver authors
> > themselves don't chime in, but will note that some drivers in
> > drivers/perf do print things from their event_init functions.
>
> As a PMU driver author who *has* already invested not-inconsiderable
> time and effort in trying to explain to you the technical issues from a
> non-maintainer perspective (and is about to do so again), I can't help
> but feel rather slighted by that comment.

Very sorry! You're quite right. I believe I had considered you as a
kernel developer first, and was trying to keep LKML activity affined to
LKML, vs. other activity on other mailing lists.

> > Also, you have been cc'd on off-list email threads where SPE users had
> > to debug the SPE driver's event_init() manually. There are other SPE
> > users I know that have had to go through the same painstaking process of
> > debugging the SPE driver. Last but not least, there's me, and I'd sure
> > appreciate more verbose PMU drivers, based on my real-life performance
> > monitoring experience(s).
> >
> >> I think most people don't actually care one way or the other. If you know of
> >
> > Like I said, I think you're limiting your audience to LKML subscribers,
> > who are more amenable to being convinced they need to debug their
> > kernel.
> >
> >> others who care about it too then I suggest you work together as a group to
> >> solve the problem in a way which is amenable to the upstream maintainers.
> >
> > I know a few off-list people that would like a more user-friendly Arm
> > perf experience, but we're not qualified to solve the larger problem
> > that perf has had since its inception, and I think it's a little unfair
> > for you to suggest that, esp. given you're one of the maintainers, and
> > there's Linus on top of you.
> >
> >> Then you won't have to worry about fixing it personally. You could even
> >> propose a topic for plumbers if there is enough interest in a general
> >> framework for propagating error messages to userspace.
> >
> > I'm not worried about fixing it personally or not. I'm worried about
> > our perf users' experience given your objection to using the vehicle
> > already in use on other architectures and PMU drivers.
> >
> > If you - or anyone else for that matter - know of a way that'll get us
> > past this, by all means, say something.
> >
> > If it helps, I recently asked acme if we could borrow bits from struct
> > perf_event and got no response.
> >
> >>>>> As you are already aware, I've personally tried to fix this problem -
> >>>>> that has existed since before the introduction of the perf tool (I
> >>>>> consider it a syscall-independent enhanced error interface), multiple
> >>>>> times, and failed.
> >>>>
> >>>> Why is that my problem? Try harder?
> >>>
> >>> It's your problem because we're here reviewing a patch that happens to
> >>> fall under your maintainership. I'll be the first person to tell you
> >>> I'm obviously incompetent and haven't been able to come up with a
> >>> solution that is acceptable for everyone up to and including Linus
> >>> Torvalds. I'm just noticing a chronic usability problem that can be
> >>> easily alleviated in the context of this patch review.
> >>
> >> I don't see how incompetence has anything to do with it and, like most
> >
> > Well you told me to try harder, and I'm trying to let you know that I
> > can try harder - ad infinitum - and still not get it, so telling me to
> > try harder isn't going to necessarily resolve the issue.
> >
> >> people (lucky gits...), I doubt Torvalds cares too deeply about PMU driver
> >> internals. No arguments on the usability problem, but this ain't the fix
> >
> > Like acme, I doubt Linus runs perf on real Arm h/w, so yes, I don't
> > think they care that much, but if they did try to do it one day, I'd
> > like to that they'd be able to have dmesg contain that extra error
> > information for them.
> >
> > As you know, David Howells wrote a supplemental error syscall facility
> > [1], that Linus nacked (off list no less). Does Linus care about what
> > David Howells was developing it for? I don't know, but he does have
> > enough experience to know whether a certain approach to a problem is
> > sustainable in his kernel or not.
> >
> >> you're looking for: it's a bodge. We've been over this many times before.
> >
> > I think it's OK - necessary even - until this error infrastructure
> > problem perf has is fixed, and I think you're being unrealistic if you
> > think it will be fixed anytime soon.
> >
> >>>>> So until someone comes up with a solution that works for everyone
> >>>>> up to and including Linus Torvalds (who hasn't put up a problem
> >>>>> pulling PMU drivers emitting things to dmesg so far, by the way), this
> >>>>> keep PMU drivers' errors silent preference of yours is unnecessarily
> >>>>> impeding people trying to measure system performance on Arm based
> >>>>> machines - all other archs' maintainers are fine with PMU drivers using
> >>>>> dmesg.
> >>>>
> >>>> Good for them, although I'm pretty sure that at least the x86 folks are
> >>>> against this crap too.
> >>>
> >>> Unfortunately, it doesn't affect them nearly as much as it does our
> >>> more diverse platforms, which is why I don't think they care to do
> >>> much about it.
> >>
> >> x86 has its fair share of PMUs too, so I don't think we're special in this
> >> regard. The main difference seems to be that they have better support in
> >> the perf tool for diagnosing problems.
> >
> > Indeed, their software is as vertically integrated as their hardware,
> > and, well, they have more people with more experience working and caring
> > about perf running and being more usable.
>
> I note that uncore_pmu_event_init() and snb_uncore_imc_event_init(), to
> pick a couple of Intel examples, contain a lot of the same logic to the
> ARM system PMU drivers you repeatedly request should be more verbose.
> Now, surely either perf tool already has some magic which somehow makes
> those routines "user-friendly", in which case there seems no reason that
> at least any generic event attribute stuff couldn't be applied to ARM
> PMUs as well; or if not then you could easily send a patch adding
> "helpful" printk()s to those drivers too and actively solicit the
> maintainer feedback you lament not having.

Those Intel drivers? IIRC, those drivers have been developed
historically more in sync with the tool. I believe early drivers
started out with four simple error cases- each with its own -Evalue -
and as the infrastructure grew, they still enjoyed more-or-less a
one-to-one correspondence between the -Evalue they returned and what
the tool did in response. Periodically the fall-back (retry) code and
message texts get amended as drivers get added.

In the Arm infrastructure, things are generally more 'disabling' to
the user - they'd rather error out instead of be more accommodating
(bit.LITTLE is a good example of this), so we have people copying
others' very strict event_inits that return the most generic -EINVAL
for most things, without trying to see what can be done to (a) either
continue, going "we knew they meant this," or (b) looking at returning
a different -Evalue, and/or eventually printing something to dmesg (in
the case it's pretty clear that the user will have no idea what
esoteric thing happened).

So whilst what you're saying may be true, I care about Arm more, as I
don't see that big an error messaging problem on Intel right now.

> >>>>>> Anyway, I think this driver has bigger problems that need addressing.
> >>>>>
> >>>>> To me it represents yet another PMU driver submission - as the years go
> >>>>> by - that is lacking in the user messaging area. Which reminds me, can
> >>>>> you take another look at applying this?:
> >>>>
> >>>> As I said before, I'm not going to take anything that logs above pr_debug
> >>>> for things that are directly triggerable from userspace. Spin a version
> >>>
> >>> Why? There are plenty of things that emit stuff into dmesg that are
> >>> directly triggerable from userspace. Is it because it upsets fuzzing
> >>> tests? How about those be run with a patched kernel that somehow
> >>> mitigates the printing?
> >>
> >> [we've been over this before]
> >> There are two camps that might find this information useful:
> >>
> >> 1. People writing userspace support for a new PMU using the
> >> perf_event_open syscall
> >>
> >> 2. People trying to use a tool which abstracts the PMU details to some
> >> degree (e.g. perf tool)
> >>
> >> Those in (1) can get by with pr_debug. Sure, they have to recompile, but
> >> it's not like there are many people in this camp and they'll probably be
> >> working with the PMU driver writer to some degree anyway and taking new
> >> kernel drops.
> >
> > No, please, that's a worse user experience than necessary.
> >
> > FWIW, I see this type of person as e.g., a JIT developer - a
> > compiler/toolchain person - who has decided to use h/w monitoring to
> > help the JIT engine find hotspots. The first place they should start
> > is to start using the perf tool itself, experimenting with events and
> > PMUs of interest. When satisfied, they should clone the parameters
> > the tool makes to the perf_event_open syscall in their JIT engine. We
> > should not be wasting their time forcing them to hack around in the
> > kernel trying to debug perf driver semantics.
> >
> >> Those in (2) may not have access to dmesg, absolutely should not be able
> >
> > I don't think you're being realistic here: we're talking about people
> > trying to improve performance: They would likely have kptr_restrict ==
> > 0 (unrestricted), so why would they have dmesg_restrict set? Btw, the
> > rationale for dmesg_restrict was kernel pointer visibility [2], but now
> > %Kp hashes addresses before printing them, so it's even less likely to
> > be used. Curious, where have you seen dmesg_restrict being set?
> >
> >> to spam it (since this could hide other important messages), will probably
> >
> > How will it hide other important messages? Run a logging daemon?
> > We're not talking about a lot of messages here: one line per perf
> > invocation: see e.g. runs in [2].
>
> Assuming you meant [3], that's a pretty trivial case which doesn't
> really illustrate the whole scope of the problem.

You're right, they're trivial cases that have trivial messages that
enable the trivial user. It's not the whole scope of the problem, but
it's a very good start.

> Just for you, I added prints in 4 places where event_init validation
> returns -EINVAL in my not-yet-upstream SMMUv2 PMU driver. Let's capture
> a few events on my Juno board to get some arbitrary cross-sections of
> system-wide I/O activity:
>
> [root@space-channel-5 ~]# perf stat -e arm_smmu/cycles/ -e
> arm_smmu/access,tcefcfg=1/ -e arm_smmu/access,tcefcfg=1,smr_id=1/ -e
> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/ -e
> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/ find /usr > /dev/null
>
> Performance counter stats for 'system wide':
>
> 729128494 arm_smmu/cycles/
> (12.08%)
> 841301489 arm_smmu/cycles/
> (28.10%)
> 834458758 arm_smmu/cycles/
> (23.40%)
> 774373529 arm_smmu/cycles/
> (15.69%)
> 1028501665 arm_smmu/cycles/
> (13.84%)
> 5401 arm_smmu/access,tcefcfg=1/
> (27.11%)
> 0 arm_smmu/access,tcefcfg=1/
> (14.73%)
> 0 arm_smmu/access,tcefcfg=1/
> (23.27%)
> 0 arm_smmu/access,tcefcfg=1/
> (20.54%)
> 220788 arm_smmu/access,tcefcfg=1/
> (17.37%)
> 0 arm_smmu/access,tcefcfg=1,smr_id=1/
> (28.18%)
> 0 arm_smmu/access,tcefcfg=1,smr_id=1/
> (24.10%)
> 0 arm_smmu/access,tcefcfg=1,smr_id=1/
> (12.52%)
> 0 arm_smmu/access,tcefcfg=1,smr_id=1/
> (16.17%)
> 0 arm_smmu/access,tcefcfg=1,smr_id=1/
> (19.71%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
> (19.68%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
> (23.69%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
> (29.15%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
> (29.19%)
> 12092 arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
> (27.70%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
> (16.97%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
> (13.36%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
> (16.01%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
> (24.70%)
> 0 arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
> (28.39%)
>
> 2.315150449 seconds time elapsed
>
>
> Wonderful, it worked OK. And yet in that brief time it spammed
> *sixty-one* lines to the kernel log:
>
> [ 2096.220934] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.226893] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.232852] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.238798] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.244744] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.250690] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.256641] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.262594] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.268546] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.274492] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.280437] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.286381] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.292326] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.298271] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.304215] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.310164] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.316109] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.322053] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.327997] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.333941] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.339887] event_source arm_smmu_3: Incompatible SMR value
> [ 2096.345408] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.351361] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.357313] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.363263] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.369214] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.375171] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.381124] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.387078] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.393029] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.398984] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.404936] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.410890] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.416840] event_source arm_smmu_4: Incompatible context bank index
> [ 2096.423130] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.429075] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.435019] event_source arm_smmu_1: Incompatible context bank index
> [ 2096.441308] event_source arm_smmu_3: Incompatible context bank index
> [ 2096.447611] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.453557] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.459501] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.465445] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.471390] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.477342] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.483286] event_source arm_smmu_2: Incompatible SMR value
> [ 2096.488800] event_source arm_smmu_0: Incompatible SMR value
> [ 2096.494316] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.500319] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.571799] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.577749] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.583705] event_source arm_smmu_4: Incompatible SMR value
> [ 2096.589221] event_source arm_smmu_2: Incompatible SMR value
> [ 2096.594736] event_source arm_smmu_0: Incompatible SMR value
> [ 2096.600256] event_source arm_smmu_1: Incompatible SMR value
> [ 2096.605770] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.611718] event_source arm_smmu_4: Incompatible filtering mode
> [ 2096.617663] event_source arm_smmu_2: Incompatible filtering mode
> [ 2096.623607] event_source arm_smmu_0: Incompatible filtering mode
> [ 2096.629550] event_source arm_smmu_1: Incompatible filtering mode
> [ 2096.635494] event_source arm_smmu_3: Incompatible filtering mode
> [ 2096.641443] event_source arm_smmu_4: Incompatible context bank index
>
> (and in case it's not clear, that output is *continual* during the perf
> run; empirically, the runtime only needs to reach about 6 seconds or so
> for the dmesg ringbuffer to wrap so I have to fall back to journalctl to
> see what I missed)
>
> Now how about we try counting those exact same events a slightly
> different way:
>
> [root@space-channel-5 ~]# perf stat -e
> \{arm_smmu/cycles/,arm_smmu/access,tcefcfg=1/,arm_smmu/access,tcefcfg=1,smr_id=1/,arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/,arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/\}
> find /usr > /dev/null
>
> Performance counter stats for 'system wide':
>
> <not counted> arm_smmu/cycles/
>
> <not supported> arm_smmu/cycles/
>
> <not supported> arm_smmu/cycles/
>
> <not supported> arm_smmu/cycles/
>
> <not supported> arm_smmu/cycles/
>
> <not supported> arm_smmu/access,tcefcfg=1/
>
> <not supported> arm_smmu/access,tcefcfg=1/
>
> <not supported> arm_smmu/access,tcefcfg=1/
>
> <not supported> arm_smmu/access,tcefcfg=1/
>
> <not supported> arm_smmu/access,tcefcfg=1/
>
> <not supported> arm_smmu/access,tcefcfg=1,smr_id=1/
>
> <not supported> arm_smmu/access,tcefcfg=1,smr_id=1/
>
> <not supported> arm_smmu/access,tcefcfg=1,smr_id=1/
>
> <not supported> arm_smmu/access,tcefcfg=1,smr_id=1/
>
> <not supported> arm_smmu/access,tcefcfg=1,smr_id=1/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=0/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
>
> <not supported> arm_smmu/tlb_alloc,tcefcfg=2,ndx=1/
>
>
> 0.721310193 seconds time elapsed
>
>
> Oh, that didn't work. Let's check dmesg to see why:
>
> [ 2416.494951] event_source arm_smmu_3: Incompatible filtering mode
> [ 2416.500933] event_source arm_smmu_3: Incompatible filtering mode
> [ 2416.506958] event_source arm_smmu_3: Incompatible filtering mode
> [ 2416.513276] event_source arm_smmu_3: Incompatible filtering mode
>
>
> Well, it said the same thing continually when it worked correctly, so I
> guess it must have failed for some other reason. As the user I'm now not
> only confused but unknowingly heading towards the *wrong* conclusion
> thanks to those "helpful" messages. And that's not even contemplating
> the "what if multiple programs did this at the same time" case.

Like I said in other threads, multiple invocations aren't a concern vs.
experimenting with multiple atomic invocations to get the pattern of
events and their parameters down to a valid, successfully executing
(hopefully without dmesg output?) state.

WRT the rest of what you say above, I'd have to check the individual
error condition details, but yes, you're right to think the user's
confused, but - and forgive the pun here - armed with the information
coming from the driver, they are able to adapt and learn to know what's
possible and what's not when combining different event groups.

The volume of messaging doesn't concern me much either:

- it doesn't change with dev_dbg vs. dev_warn
- using a logger to scroll far back isn't generally a problem
- users tend to focus on one event at a time when experimenting before
composing and invoking the final event list.

> >> struggle to match an internal message from the PMU driver to their
> >> invocation of the high-level tool and may well be in a multi-user
> >> environment so will struggle to identify the messages that they are
> >> responsible for.
> >
> > Again, I think you're being unrealistic. My experience - in
> > multiple performance analysis roles - was that I was always the only
> > person using the machine at the time, and can easily correlate perf
> > invocations with dmesg output, particularly if I run 'dmesg -w &' prior
> > to invoking perf. In one of those roles, having to mess with the
> > kernel was almost inconceivable - it literally 'just worked' for
> > everything else. I was lucky that I had prior kernel experience in
> > order to debug the driver and be able find out how to invoke perf.
> >
> >> What they actually want is for the perf tool to give
> >> them more information, and dmesg is not the right channel for that, for
> >> similar reasons.
> >
> > We'd all like the perf tool to do things better, and dmesg is all we
> > have for now, and now I'm sounding like I did 1 1/2 years ago.
> >
> >>>> using pr_debug and I'll queue it.
> >>>
> >>> How about using a ratelimited dev_err variant?
> >>
> >> Nah, I don't want dmesg being parsed by users of perf tool. pr_debug should
> >> be sufficient for perf tool developers working with a new PMU type.
> >
> > OK the question was to the ratelimited part, but I think that even
> > might be the default these days [4].
> >
> > For reasons (different perceptions?) already mentioned above, I also
> > don't agree that this has to do with just perf tool developers, and
> > only when working with a new PMU type. My main concern is new/future
> > users to perf on Arm, working with any PMU type - new or old. If there
> > is a perf tool-side component to this driver, I don't see it. Having
> > said that, I think the owners of this and other PMU drivers should have
> > a say in what type of experience they want for their users in this
> > regard...is that fair?
>
> As I've said before, a significant portion of what you want to improve
> involves generic perf_event properties which would already be much
> better validated in the core than by every driver individually, so the
> first straightforward step would be to improve the interface between
> drivers and perf core, such that the "hard" problem reduces to
> communicating from a single part of perf core to userspace. As it is,
> PMU drivers alone simply don't have enough context to know when an
> event_init "failure" is actually significant to the user or not, thus if
> they display user-visible messages directly then half the time the user
> will be thoroughly misled unless they are intimately familiar with how
> perf core grouping and event rotation work (see above), yet required
> knowledge of perf internals is exactly what you're arguing against!

You're right, perf is inherently confusing to use, but more information
is better than none, esp. for the trivial cases.

Thanks,

Kim

2018-05-15 10:33:52

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Mark,


On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <[email protected]> wrote:
> Hi Mark,
>
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
>> Hi,
>>
>> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>> +
>>> +/* L3c and DMC has 16 and 8 channels per socket respectively.
>>> + * Each Channel supports UNCORE PMU device and consists of
>>> + * 4 independent programmable counters. Counters are 32 bit
>>> + * and does not support overflow interrupt, they needs to be
>>> + * sampled before overflow(i.e, at every 2 seconds).
>>> + */
>>> +
>>> +#define UNCORE_MAX_COUNTERS 4
>>> +#define UNCORE_L3_MAX_TILES 16
>>> +#define UNCORE_DMC_MAX_CHANNELS 8
>>> +
>>> +#define UNCORE_HRTIMER_INTERVAL (2 * NSEC_PER_SEC)
>>
>> How was a period of two seconds chosen?
>
> It has been suggested from hw team to sample before 3 to 4 seconds.
>
>>
>> What's the maximum clock speed for the L3C and DMC?
>
> L3C at 1.5GHz and DMC at 1.2GHz
>>
>> Given that all channels compete for access to the muxed register
>> interface, I suspect we need to try more often than once every 2
>> seconds...
>
> 2 seconds seems to be sufficient. So far testing looks good.
>
>>
>> [...]
>>
>>> +struct active_timer {
>>> + struct perf_event *event;
>>> + struct hrtimer hrtimer;
>>> +};
>>> +
>>> +/*
>>> + * pmu on each socket has 2 uncore devices(dmc and l3),
>>> + * each uncore device has up to 16 channels, each channel can sample
>>> + * events independently with counters up to 4.
>>> + *
>>> + * struct thunderx2_pmu_uncore_channel created per channel.
>>> + * struct thunderx2_pmu_uncore_dev per uncore device.
>>> + */
>>> +struct thunderx2_pmu_uncore_channel {
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + struct pmu pmu;
>>
>> Can we put the pmu first in the struct, please?
>
> ok
>>
>>> + int counter;
>>
>> AFAICT, this counter field is never used.
>
> hmm ok, will remove.
>>
>>> + int channel;
>>> + DECLARE_BITMAP(counter_mask, UNCORE_MAX_COUNTERS);
>>> + struct active_timer *active_timers;
>>
>> You should only need a single timer per channel, rather than one per
>> event.
>>
>> I think you can get rid of the active_timer structure, and have:
>>
>> struct perf_event *events[UNCORE_MAX_COUNTERS];
>> struct hrtimer timer;
>>
>
> thanks, will change as suggested.
>
>>> + /* to sync counter alloc/release */
>>> + raw_spinlock_t lock;
>>> +};
>>> +
>>> +struct thunderx2_pmu_uncore_dev {
>>> + char *name;
>>> + struct device *dev;
>>> + enum thunderx2_uncore_type type;
>>> + unsigned long base;
>>
>> This should be:
>>
>> void __iomem *base;
>
> ok
>>
>> [...]
>>
>>> +static ssize_t cpumask_show(struct device *dev,
>>> + struct device_attribute *attr, char *buf)
>>> +{
>>> + struct cpumask cpu_mask;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(dev_get_drvdata(dev));
>>> +
>>> + /* Pick first online cpu from the node */
>>> + cpumask_clear(&cpu_mask);
>>> + cpumask_set_cpu(cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node)),
>>> + &cpu_mask);
>>> +
>>> + return cpumap_print_to_pagebuf(true, buf, &cpu_mask);
>>> +}
>>
>> AFAICT cpumask_of_node() returns a mask that can include offline CPUs.
>>
>> Regardless, I don't think that you can keep track of the management CPU
>> this way. Please keep track of the CPU the PMU should be managed by,
>> either with a cpumask or int embedded within the PMU structure.
>
> thanks, will add hotplug callbacks.
>>
>> At hotplug time, you'll need to update the management CPU, calling
>> perf_pmu_migrate_context() when it is offlined.
>
> ok
>>
>> [...]
>>
>>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + int counter;
>>> +
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>>> + pmu_uncore->uncore_dev->max_counters);
>>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return -ENOSPC;
>>> + }
>>> + set_bit(counter, pmu_uncore->counter_mask);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> + return counter;
>>> +}
>>> +
>>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>>> + int counter)
>>> +{
>>> + raw_spin_lock(&pmu_uncore->lock);
>>> + clear_bit(counter, pmu_uncore->counter_mask);
>>> + raw_spin_unlock(&pmu_uncore->lock);
>>> +}
>>
>> I don't believe that locking is required in either of these, as the perf
>> core serializes pmu::add() and pmu::del(), where these get called.

without this locking, i am seeing "BUG: scheduling while atomic" when
i run perf with more events together than the maximum counters
supported

>
> thanks, it seems to be not required.
>>
>>> +
>>> +/*
>>> + * DMC and L3 counter interface is muxed across all channels.
>>> + * hence we need to select the channel before accessing counter
>>> + * data/control registers.
>>
>> Are there separate interfaces for all-dmc-channels and all-l3c-channels?
>
> separate interface for DMC and L3c.
> channels within DMC/L3C are muxed.
>
>>
>> ... or is a single interface used by all-dmc-and-l3c-channels?
>>
>>> + *
>>> + * L3 Tile and DMC channel selection is through SMC call
>>> + * SMC call arguments,
>>> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>>> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>>> + * x2 = Node id
>>
>> How do we map Linux node IDs to the firmware's view of node IDs?
>>
>> I don't believe the two are necessarily the same -- Linux's node IDs are
>> a Linux-specific construct.
>
> both are same, it is numa node id from ACPI/firmware.
>
>>
>> It would be much nicer if we could pass something based on the MPIDR,
>> which is a known HW construct, or if this implicitly affected the
>> current node.
>
> IMO, node id is sufficient.
>
>>
>> It would be vastly more sane for this to not be muxed at all. :/
>
> i am helpless due to crappy hw design!
>
>>
>>> + * x3 = DMC(1)/L3C(0)
>>> + * x4 = channel Id
>>> + */
>>> +static void uncore_select_channel(struct perf_event *event)
>>> +{
>>> + struct arm_smccc_res res;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore =
>>> + pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> +
>>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>>> + pmu_uncore->uncore_dev->node,
>>> + pmu_uncore->uncore_dev->type,
>>> + pmu_uncore->channel, 0, 0, 0, &res);
>>> +
>>> +}
>>
>> Why aren't we checking the return value of the SMC call?
>
> thanks, i will add.
>
>>
>> The muxing and SMC sound quite scary. :/
>
> i too agree!, however i have no other option since the muxing register
> is a secure register.
>>
>>> +
>>> +static void uncore_start_event_l3c(struct perf_event *event, int flags)
>>> +{
>>> + u32 val;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* event id encoded in bits [07:03] */
>>> + val = GET_EVENTID(event) << 3;
>>> + reg_writel(val, hwc->config_base);
>>> +
>>> + if (flags & PERF_EF_RELOAD) {
>>> + u64 prev_raw_count =
>>> + local64_read(&event->hw.prev_count);
>>> + reg_writel(prev_raw_count, hwc->event_base);
>>> + }
>>> + local64_set(&event->hw.prev_count,
>>> + reg_readl(hwc->event_base));
>>
>> It would be simpler to ignore PERF_EF_RELOAD, and always reprogram the
>> prev_count and HW to zero.
>
> ok, will change
>>
>>> +}
>>> +
>>> +static void uncore_start_event_dmc(struct perf_event *event, int flags)
>>> +{
>>> + u32 val, event_shift = 8;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* enable and start counters and read current value in prev_count */
>>> + val = reg_readl(hwc->config_base);
>>> +
>>> + /* 8 bits for each counter,
>>> + * bits [05:01] of a counter to set event type.
>>> + */
>>> + reg_writel((val & ~(0x1f << (((GET_COUNTERID(event)) *
>>> + event_shift) + 1))) |
>>> + (GET_EVENTID(event) <<
>>> + (((GET_COUNTERID(event)) * event_shift) + 1)),
>>> + hwc->config_base);
>>
>> This would be far more legible if val were constructed before the
>> reg_writel(), especially if there were a helper for the event field
>> shifting, e.g.
>>
>> #define DMC_EVENT_CFG(idx, val) ((val) << (((idx) * 8) + 1))
>>
>> int id = GET_COUNTERID(event);
>> int event = GET_EVENTID(event);
>>
>> val = reg_readl(hwc->config_base);
>> val &= ~DMC_EVENT_CFG(id, 0x1f);
>> val |= DMCC_EVENT_CFG(id, event);
>> reg_writel(val, hwc->config_base));
>>
>> What are bits 7:6 and 1 used for?
>
> not used/reserved bits.
>
>>
>>> +
>>> + if (flags & PERF_EF_RELOAD) {
>>> + u64 prev_raw_count =
>>> + local64_read(&event->hw.prev_count);
>>> + reg_writel(prev_raw_count, hwc->event_base);
>>> + }
>>> + local64_set(&event->hw.prev_count,
>>> + reg_readl(hwc->event_base));
>>
>>
>> As with the L3C events, it would be simpler to always reprogram the
>> prev_count and HW to zero.
>
> ok
>>
>>> +}
>>> +
>>> +static void uncore_stop_event_l3c(struct perf_event *event)
>>> +{
>>> + reg_writel(0, event->hw.config_base);
>>> +}
>>> +
>>> +static void uncore_stop_event_dmc(struct perf_event *event)
>>> +{
>>> + u32 val, event_shift = 8;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + val = reg_readl(hwc->config_base);
>>> + reg_writel((val & ~(0xff << ((GET_COUNTERID(event)) * event_shift))),
>>> + hwc->config_base);
>>
>>
>> This could be simplified with the helper proposed above.
>
> ok
>>
>>> +}
>>> +
>>> +static void init_cntr_base_l3c(struct perf_event *event,
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>>> +{
>>> +
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + /* counter ctrl/data reg offset at 8 */
>>
>> Offset 8, or stride 8?
>
> stride 8
>>
>> What does the register layout look like?
>>
>>> + hwc->config_base = uncore_dev->base
>>> + + L3C_COUNTER_CTL + (8 * GET_COUNTERID(event));
>>> + hwc->event_base = uncore_dev->base
>>> + + L3C_COUNTER_DATA + (8 * GET_COUNTERID(event));
>>> +}
>>> +
>>> +static void init_cntr_base_dmc(struct perf_event *event,
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev)
>>> +{
>>> +
>>> + struct hw_perf_event *hwc = &event->hw;
>>> +
>>> + hwc->config_base = uncore_dev->base
>>> + + DMC_COUNTER_CTL;
>>> + /* counter data reg offset at 0xc */
>>
>> A stride of 0xc seems unusual.
>>
>> What does the register layout look like?
>
> the offsets are at 0x640, 0x64c, 0x658, 0x664,
>>
>>> + hwc->event_base = uncore_dev->base
>>> + + DMC_COUNTER_DATA + (0xc * GET_COUNTERID(event));
>>> +}
>>> +
>>> +static void thunderx2_uncore_update(struct perf_event *event)
>>> +{
>>> + s64 prev, new = 0;
>>> + u64 delta;
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + enum thunderx2_uncore_type type;
>>> +
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + type = pmu_uncore->uncore_dev->type;
>>
>> AFAICT this variable is not used below.
>
> thanks.
>>
>>> +
>>> + if (pmu_uncore->uncore_dev->select_channel)
>>> + pmu_uncore->uncore_dev->select_channel(event);
>>
>> This should always be non-NULL, right?
>
> yes, unwanted check at the moment, will remove.
>
>>
>> [...]
>>
>>> +static bool thunderx2_uncore_validate_event_group(struct perf_event *event)
>>> +{
>>> + struct pmu *pmu = event->pmu;
>>> + struct perf_event *leader = event->group_leader;
>>> + struct perf_event *sibling;
>>> + int counters = 0;
>>> +
>>> + if (leader->pmu != event->pmu && !is_software_event(leader))
>>> + return false;
>>> +
>>> + counters++;
>>
>> I don't think this is right when event != leader and the leader is a SW
>> event. In that case, the leader doesn't take a HW counter.
>
> I think this check to avoid the grouping of multiple hw PMUs ,
> however it is allowed to group sw events along with hw PMUs.
>
>>
>>> +
>>> + for_each_sibling_event(sibling, event->group_leader) {
>>> + if (is_software_event(sibling))
>>> + continue;
>>> + if (sibling->pmu != pmu)
>>> + return false;
>>> + counters++;
>>> + }
>>> +
>>> + /*
>>> + * If the group requires more counters than the HW has,
>>> + * it cannot ever be scheduled.
>>> + */
>>> + return counters <= UNCORE_MAX_COUNTERS;
>>> +}
>>
>> [...]
>>
>>> +static int thunderx2_uncore_event_init(struct perf_event *event)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> +
>>> + if (!pmu_uncore)
>>> + return -ENODEV;
>>
>> This cannot happen, given pmu_to_thunderx2_pmu_uncore() is a wrapper
>> around container_of().
>
> thanks, unnecessary check!
>>
>>> +
>>> + /* Pick first online cpu from the node */
>>> + event->cpu = cpumask_first(
>>> + cpumask_of_node(pmu_uncore->uncore_dev->node));
>>
>> I don't believe this is safe. You must keep track of which CPU is
>> managing the PMU, with hotplug callbacks.
>
> ok, will add callbacks.
>>
>> [...]
>>
>>> +static void thunderx2_uncore_start(struct perf_event *event, int flags)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long irqflags;
>>> + struct active_timer *active_timer;
>>> +
>>> + hwc->state = 0;
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + uncore_dev = pmu_uncore->uncore_dev;
>>> +
>>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>>> +
>>> + if (uncore_dev->select_channel)
>>> + uncore_dev->select_channel(event);
>>> + uncore_dev->start_event(event, flags);
>>> + raw_spin_unlock_irqrestore(&uncore_dev->lock, irqflags);
>>> +
>>> + perf_event_update_userpage(event);
>>> + active_timer = &pmu_uncore->active_timers[GET_COUNTERID(event)];
>>> + active_timer->event = event;
>>> +
>>> + hrtimer_start(&active_timer->hrtimer,
>>> + ns_to_ktime(uncore_dev->hrtimer_interval),
>>> + HRTIMER_MODE_REL_PINNED);
>>> +}
>>
>> Please use a single hrtimer, and update *all* of the events when it
>> fires.
>
> sure
>>
>> I *think* that can be done in the pmu::pmu_enable() and
>> pmu::pmu_disable() callbacks.
>
> ok
>>
>> Are there control bits to enable/disable all counters, or can that only
>> be done through the event configuration registers?
>
> only through config register.
>>
>>> +static void thunderx2_uncore_stop(struct perf_event *event, int flags)
>>> +{
>>> + struct hw_perf_event *hwc = &event->hw;
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore;
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long irqflags;
>>> +
>>> + if (hwc->state & PERF_HES_UPTODATE)
>>> + return;
>>> +
>>> + pmu_uncore = pmu_to_thunderx2_pmu_uncore(event->pmu);
>>> + uncore_dev = pmu_uncore->uncore_dev;
>>> +
>>> + raw_spin_lock_irqsave(&uncore_dev->lock, irqflags);
>>> +
>>> + if (uncore_dev->select_channel)
>>> + uncore_dev->select_channel(event);
>>
>> AFAICT this cannot be NULL.
>
> ok.
>>
>> [...]
>>
>>> +static int thunderx2_pmu_uncore_register(
>>> + struct thunderx2_pmu_uncore_channel *pmu_uncore)
>>> +{
>>> + struct device *dev = pmu_uncore->uncore_dev->dev;
>>> + char *name = pmu_uncore->uncore_dev->name;
>>> + int channel = pmu_uncore->channel;
>>> +
>>> + /* Perf event registration */
>>> + pmu_uncore->pmu = (struct pmu) {
>>> + .attr_groups = pmu_uncore->uncore_dev->attr_groups,
>>> + .task_ctx_nr = perf_invalid_context,
>>> + .event_init = thunderx2_uncore_event_init,
>>> + .add = thunderx2_uncore_add,
>>> + .del = thunderx2_uncore_del,
>>> + .start = thunderx2_uncore_start,
>>> + .stop = thunderx2_uncore_stop,
>>> + .read = thunderx2_uncore_read,
>>> + };
>>> +
>>> + pmu_uncore->pmu.name = devm_kasprintf(dev, GFP_KERNEL,
>>> + "%s_%d", name, channel);
>>
>> Does the channel idx take the NUMA node into account?
>
> name already has node id suffix.
>>
>> [...]
>>
>>> +static int thunderx2_pmu_uncore_add(struct thunderx2_pmu_uncore_dev *uncore_dev,
>>> + int channel)
>>> +{
>>
>>> + /* we can run up to (max_counters * max_channels) events
>>> + * simultaneously, allocate hrtimers per channel.
>>> + */
>>> + pmu_uncore->active_timers = devm_kzalloc(uncore_dev->dev,
>>> + sizeof(struct active_timer) * uncore_dev->max_counters,
>>> + GFP_KERNEL);
>>
>> Please just fold a single hrtimer into the thunderx2_pmu_uncore_channel
>> structure, and you can get rid of this allocation...
>
> sure, i will rewrite code to have timer per channel and all active
> counters are updated when timer fires.
>
>>
>>> +
>>> + for (counter = 0; counter < uncore_dev->max_counters; counter++) {
>>> + hrtimer_init(&pmu_uncore->active_timers[counter].hrtimer,
>>> + CLOCK_MONOTONIC,
>>> + HRTIMER_MODE_REL);
>>> + pmu_uncore->active_timers[counter].hrtimer.function =
>>> + thunderx2_uncore_hrtimer_callback;
>>> + }
>>
>> ... and simplify this initialization.
>
> ok
>>
>> [...]
>>
>>> +static struct thunderx2_pmu_uncore_dev *init_pmu_uncore_dev(
>>> + struct device *dev, acpi_handle handle,
>>> + struct acpi_device *adev, u32 type)
>>> +{
>>> + struct thunderx2_pmu_uncore_dev *uncore_dev;
>>> + unsigned long base;
>>
>>> + base = (unsigned long)devm_ioremap_resource(dev, &res);
>>> + if (IS_ERR((void *)base)) {
>>> + dev_err(dev, "PMU type %d: Fail to map resource\n", type);
>>> + return NULL;
>>> + }
>>
>> Please treat this as a void __iomem *base.
>
> ok
>>
>> [...]
>>
>>> +static int thunderx2_uncore_probe(struct platform_device *pdev)
>>> +{
>>> + struct device *dev = &pdev->dev;
>>> + struct arm_smccc_res res;
>>> +
>>> + set_dev_node(dev, acpi_get_node(ACPI_HANDLE(dev)));
>>> +
>>> + /* Make sure firmware supports DMC/L3C set channel smc call */
>>> + arm_smccc_smc(THUNDERX2_SMC_CALL_ID, THUNDERX2_SMC_SET_CHANNEL,
>>> + dev_to_node(dev), 0, 0, 0, 0, 0, &res);
>>> + if (res.a0) {
>>> + dev_err(dev, "No Firmware support for PMU UNCORE(%d)\n",
>>> + dev_to_node(dev));
>>> + return -ENODEV;
>>> + }
>>
>> Please re-use the uncore_select_channel() wrapper rather than
>> open-coding this.
>
> ok.
>>
>> Which FW supports this interface?
>
> it is through vendor specific ATF runtime services.
>>
>> Thanks,
>> Mark.
>
> thanks,
> Ganapat

thanks
Ganapat

2018-05-21 10:37:45

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Ganapat,


Sorry for the delay in replying; I was away most of last week.

On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
> On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <[email protected]> wrote:
> > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
> >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:

> >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
> >>> +{
> >>> + int counter;
> >>> +
> >>> + raw_spin_lock(&pmu_uncore->lock);
> >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
> >>> + pmu_uncore->uncore_dev->max_counters);
> >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> + return -ENOSPC;
> >>> + }
> >>> + set_bit(counter, pmu_uncore->counter_mask);
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> + return counter;
> >>> +}
> >>> +
> >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
> >>> + int counter)
> >>> +{
> >>> + raw_spin_lock(&pmu_uncore->lock);
> >>> + clear_bit(counter, pmu_uncore->counter_mask);
> >>> + raw_spin_unlock(&pmu_uncore->lock);
> >>> +}
> >>
> >> I don't believe that locking is required in either of these, as the perf
> >> core serializes pmu::add() and pmu::del(), where these get called.
>
> without this locking, i am seeing "BUG: scheduling while atomic" when
> i run perf with more events together than the maximum counters
> supported

Did you manage to get to the bottom of this?

Do you have a backtrace?

It looks like in your latest posting you reserve counters through the
userspace ABI, which doesn't seem right to me, and I'd like to
understand the problem.

Thanks,
Mark.

2018-05-21 10:40:39

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
> Hi Ganapat,
>
>
> Sorry for the delay in replying; I was away most of last week.
>
> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <[email protected]> wrote:
> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
> > >>> +{
> > >>> + int counter;
> > >>> +
> > >>> + raw_spin_lock(&pmu_uncore->lock);
> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
> > >>> + pmu_uncore->uncore_dev->max_counters);
> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> + return -ENOSPC;
> > >>> + }
> > >>> + set_bit(counter, pmu_uncore->counter_mask);
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> + return counter;
> > >>> +}
> > >>> +
> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
> > >>> + int counter)
> > >>> +{
> > >>> + raw_spin_lock(&pmu_uncore->lock);
> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
> > >>> + raw_spin_unlock(&pmu_uncore->lock);
> > >>> +}
> > >>
> > >> I don't believe that locking is required in either of these, as the perf
> > >> core serializes pmu::add() and pmu::del(), where these get called.
> >
> > without this locking, i am seeing "BUG: scheduling while atomic" when
> > i run perf with more events together than the maximum counters
> > supported
>
> Did you manage to get to the bottom of this?
>
> Do you have a backtrace?
>
> It looks like in your latest posting you reserve counters through the
> userspace ABI, which doesn't seem right to me, and I'd like to
> understand the problem.

Looks like I misunderstood -- those are still allocated kernel-side.

I'll follow that up in the v5 posting.

Thanks,
Mark.

2018-05-21 10:57:14

by Mark Rutland

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:

> >> + *
> >> + * L3 Tile and DMC channel selection is through SMC call
> >> + * SMC call arguments,
> >> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
> >> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
> >> + * x2 = Node id
> >
> > How do we map Linux node IDs to the firmware's view of node IDs?
> >
> > I don't believe the two are necessarily the same -- Linux's node IDs are
> > a Linux-specific construct.
>
> both are same, it is numa node id from ACPI/firmware.

I am very wary about assuming that the Linux nid will always be the same
as the ACPI node id.

For that to *potentially* be true, this driver should depend on
CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
node id will always be NUMA_NO_NODE.

I would be *much* happier if we had an explicit mapping somewhere to the
ID the FW expects.

> > It would be much nicer if we could pass something based on the MPIDR,
> > which is a known HW construct, or if this implicitly affected the
> > current node.
>
> IMO, node id is sufficient.

I agree that *a* node ID is sufficient, I just don't think that we're
guaranteed to have the specific node ID the FW wants.

> > It would be vastly more sane for this to not be muxed at all. :/
>
> i am helpless due to crappy hw design!

I'm certainly not blaming you for this! :)

I hope the HW designers don't make the same mistake in future, though...

Thanks,
Mark.

2018-05-21 12:36:09

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

Hi Mark,

On Mon, May 21, 2018 at 4:25 PM, Mark Rutland <[email protected]> wrote:
> On Sat, May 05, 2018 at 12:16:13AM +0530, Ganapatrao Kulkarni wrote:
>> On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
>> > On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>
>> >> + *
>> >> + * L3 Tile and DMC channel selection is through SMC call
>> >> + * SMC call arguments,
>> >> + * x0 = THUNDERX2_SMC_CALL_ID (Vendor SMC call Id)
>> >> + * x1 = THUNDERX2_SMC_SET_CHANNEL (Id to set DMC/L3C channel)
>> >> + * x2 = Node id
>> >
>> > How do we map Linux node IDs to the firmware's view of node IDs?
>> >
>> > I don't believe the two are necessarily the same -- Linux's node IDs are
>> > a Linux-specific construct.
>>
>> both are same, it is numa node id from ACPI/firmware.
>
> I am very wary about assuming that the Linux nid will always be the same
> as the ACPI node id.
>
> For that to *potentially* be true, this driver should depend on
> CONFIG_NUMA, NUMA must not be disabled on the command line, etc, or the
> node id will always be NUMA_NO_NODE.

ok, i can check the node id which we get from ACPI helpers in probe.
if it is NUMA_NO_NODE, I will init first socket uncore only and nid
param to fw is always zero?

>
> I would be *much* happier if we had an explicit mapping somewhere to the
> ID the FW expects.
>
>> > It would be much nicer if we could pass something based on the MPIDR,
>> > which is a known HW construct, or if this implicitly affected the
>> > current node.
>>
>> IMO, node id is sufficient.
>
> I agree that *a* node ID is sufficient, I just don't think that we're
> guaranteed to have the specific node ID the FW wants.

for thunderx2 which is 2 socket only platform, pxm and nid should be
same(either 0 or 1)
however, i can send PXM id(node_to_pxm) to firmware to make it more sane.

>
>> > It would be vastly more sane for this to not be muxed at all. :/
>>
>> i am helpless due to crappy hw design!
>
> I'm certainly not blaming you for this! :)
>
> I hope the HW designers don't make the same mistake in future, though...
>
> Thanks,
> Mark.

thanks
Ganapat

2018-05-21 12:42:53

by Ganapatrao Kulkarni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] ThunderX2: Add Cavium ThunderX2 SoC UNCORE PMU driver

On Mon, May 21, 2018 at 4:10 PM, Mark Rutland <[email protected]> wrote:
> On Mon, May 21, 2018 at 11:37:12AM +0100, Mark Rutland wrote:
>> Hi Ganapat,
>>
>>
>> Sorry for the delay in replying; I was away most of last week.
>>
>> On Tue, May 15, 2018 at 04:03:19PM +0530, Ganapatrao Kulkarni wrote:
>> > On Sat, May 5, 2018 at 12:16 AM, Ganapatrao Kulkarni <[email protected]> wrote:
>> > > On Thu, Apr 26, 2018 at 4:29 PM, Mark Rutland <[email protected]> wrote:
>> > >> On Wed, Apr 25, 2018 at 02:30:47PM +0530, Ganapatrao Kulkarni wrote:
>>
>> > >>> +static int alloc_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore)
>> > >>> +{
>> > >>> + int counter;
>> > >>> +
>> > >>> + raw_spin_lock(&pmu_uncore->lock);
>> > >>> + counter = find_first_zero_bit(pmu_uncore->counter_mask,
>> > >>> + pmu_uncore->uncore_dev->max_counters);
>> > >>> + if (counter == pmu_uncore->uncore_dev->max_counters) {
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> + return -ENOSPC;
>> > >>> + }
>> > >>> + set_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> + return counter;
>> > >>> +}
>> > >>> +
>> > >>> +static void free_counter(struct thunderx2_pmu_uncore_channel *pmu_uncore,
>> > >>> + int counter)
>> > >>> +{
>> > >>> + raw_spin_lock(&pmu_uncore->lock);
>> > >>> + clear_bit(counter, pmu_uncore->counter_mask);
>> > >>> + raw_spin_unlock(&pmu_uncore->lock);
>> > >>> +}
>> > >>
>> > >> I don't believe that locking is required in either of these, as the perf
>> > >> core serializes pmu::add() and pmu::del(), where these get called.
>> >
>> > without this locking, i am seeing "BUG: scheduling while atomic" when
>> > i run perf with more events together than the maximum counters
>> > supported
>>
>> Did you manage to get to the bottom of this?
>>
>> Do you have a backtrace?
>>
>> It looks like in your latest posting you reserve counters through the
>> userspace ABI, which doesn't seem right to me, and I'd like to
>> understand the problem.
>
> Looks like I misunderstood -- those are still allocated kernel-side.
>
> I'll follow that up in the v5 posting.

please review v5.
>
> Thanks,
> Mark.

thanks
Ganapat