2022-09-17 12:31:06

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v1 0/3] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.

Shuai Xue (3):
docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
drivers/perf: add DesignWare PCIe PMU driver
MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

.../admin-guide/perf/dwc_pcie_pmu.rst | 61 ++
Documentation/admin-guide/perf/index.rst | 1 +
MAINTAINERS | 6 +
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 976 ++++++++++++++++++
6 files changed, 1052 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
create mode 100644 drivers/perf/dwc_pcie_pmu.c

--
2.20.1.12.g72788fdb


2022-09-17 13:26:47

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v1 3/3] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 936490dcc97b..2a6965e97be7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19731,6 +19731,12 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/dw_mmc*

+SYNOPSYS SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M: Shuai Xue <[email protected]>
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
M: Eugeniy Paltsev <[email protected]>
S: Supported
--
2.20.1.12.g72788fdb

2023-04-10 03:19:13

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 0/3] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

Changes since v1:

1. address comments from Jonathan:
- drop marco for PMU name and VSEC version
- simplify code with PCI standard marco
- simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
- name register filed with single _ instead double
- wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
- check vendor id while matching vesc with pci_find_vsec_capability()
- remove RP_NUM_MAX and use a list to organize PMU devices for rootports
- replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
- comments on riping register together

2. address comments from Bjorn:
- rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
- rename cap_pos to ras_des
- simplify declare of device_attribute with DEVICE_ATTR_RO
- simplify code with PCI standard macro and API like pcie_get_width_cap()
- fix some code style problem and typo
- drop meaningless snaity check of container_of

3. address comments from Yicong:
- use sysfs_emit() to replace sprintf()
- simplify iteration of pci device with for_each_pci_dev
- pick preferred CPUs on a near die and add comments
- unregister PMU drivers only for failed ones
- log on behalf PMU device and give more hint
- fix some code style problem

(Thanks for all comments and they are very valuable to me)

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.

Shuai Xue (3):
docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
drivers/perf: add DesignWare PCIe PMU driver
MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

.../admin-guide/perf/dwc_pcie_pmu.rst | 61 ++
Documentation/admin-guide/perf/index.rst | 1 +
MAINTAINERS | 6 +
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 877 ++++++++++++++++++
6 files changed, 953 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
create mode 100644 drivers/perf/dwc_pcie_pmu.c

--
2.20.1.12.g72788fdb

2023-04-10 03:20:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver

This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
Core controller IP which provides statistics feature. The PMU is not a PCIe
Root Complex integrated End Point(RCiEP) device but only register counters
provided by each PCIe Root Port.

To facilitate collection of statistics the controller provides the
following two features for each Root Port:

- Time Based Analysis (RX/TX data throughput and time spent in each
low-power LTSSM state)
- Event counters (Error and Non-Error for lanes)

Note, only one counter for each type and does not overflow interrupt.

This driver adds PMU devices for each PCIe Root Port. And the PMU device is
named based the BDF of Root Port. For example,

30:03.0 PCI bridge: Device 1ded:8000 (rev 01)

the PMU device name for this Root Port is dwc_rootport_3018.

Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

$# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/

average RX bandwidth can be calculated like this:

PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 877 ++++++++++++++++++++++++++++++++++++
3 files changed, 885 insertions(+)
create mode 100644 drivers/perf/dwc_pcie_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 66c259000a44..57bce3880cba 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
Enable perf support for Marvell DDR Performance monitoring
event on CN10K platform.

+config DWC_PCIE_PMU
+ tristate "Enable Synopsys DesignWare PCIe PMU Support"
+ depends on ARM64 || (COMPILE_TEST && 64BIT)
+ help
+ Enable perf support for Synopsys DesignWare PCIe PMU Performance
+ monitoring event on Yitian 710 platform.
+
source "drivers/perf/arm_cspmu/Kconfig"

source "drivers/perf/amlogic/Kconfig"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 13e45da61100..3f233e96524e 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
+obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
new file mode 100644
index 000000000000..c8c09f120d4e
--- /dev/null
+++ b/drivers/perf/dwc_pcie_pmu.c
@@ -0,0 +1,877 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synopsys DesignWare PCIe PMU driver
+ *
+ * Copyright (C) 2021, 2022 Alibaba Inc.
+ */
+
+#include <linux/pci.h>
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cpuhotplug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+#define PCI_VENDOR_ID_ALIBABA 0x1ded
+
+#define ATTRI_NAME_MAX_SIZE 32
+#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
+
+#define DWC_PCIE_EVENT_CNT_CTL 0x8
+#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
+#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
+#define DWC_PCIE_CNT_STATUS BIT(7)
+#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
+#define DWC_PCIE_PER_EVENT_OFF FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
+#define DWC_PCIE_PER_EVENT_ON FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
+#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
+#define DWC_PCIE_EVENT_PER_CLEAR 0x1
+
+#define DWC_PCIE_EVENT_CNT_DATA 0xC
+
+#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
+#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
+#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
+#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
+#define DWC_PCIE_DURATION_1MS 0x1
+#define DWC_PCIE_DURATION_10MS 0x2
+#define DWC_PCIE_DURATION_100MS 0x3
+#define DWC_PCIE_DURATION_1S 0x4
+#define DWC_PCIE_DURATION_2S 0x5
+#define DWC_PCIE_DURATION_4S 0x6
+#define DWC_PCIE_DURATION_4US 0xff
+#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
+
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
+
+/* Event attributes */
+#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
+#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
+#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
+
+#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
+#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
+#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
+
+#define DWC_PCIE_PMU_HAS_REGISTER 1
+
+enum dwc_pcie_event_type {
+ DWC_PCIE_TYPE_INVALID,
+ DWC_PCIE_TIME_BASE_EVENT,
+ DWC_PCIE_LANE_EVENT,
+};
+
+struct dwc_event_counters {
+ const char name[32];
+ u32 event_id;
+};
+
+struct dwc_pcie_pmu {
+ struct hlist_node node;
+ unsigned int on_cpu;
+ struct pmu pmu;
+ struct device *dev;
+};
+
+struct dwc_pcie_rp_info {
+ u32 bdf;
+ u32 ras_des;
+ u32 num_lanes;
+
+ struct list_head rp_node;
+ struct pci_dev *pdev;
+ struct dwc_pcie_pmu pcie_pmu;
+ u8 pmu_is_register;
+ struct perf_event *event;
+
+ struct dwc_pcie_event_attr *lane_event_attrs;
+ struct attribute **pcie_pmu_event_attrs;
+ struct attribute_group pcie_pmu_event_attrs_group;
+ const struct attribute_group *pcie_pmu_attr_groups[4];
+};
+
+struct dwc_pcie_pmu_priv {
+ struct device *dev;
+ u32 pcie_ctrl_num;
+ struct list_head rp_infos;
+};
+
+#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
+
+static struct platform_device *dwc_pcie_pmu_dev;
+static char *event_attr_name = "events";
+
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
+
+ return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+#define DEVICE_ATTR_RO(_name) \
+ struct device_attribute dev_attr_##_name = __ATTR_RO(_name)
+#define __ATTR_RO(_name) { \
+ .attr = { .name = __stringify(_name), .mode = 0444 }, \
+ .show = _name##_show, \
+}
+#define __ATTR(_name, _mode, _show, _store) { \
+ .attr = {.name = __stringify(_name), \
+ .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
+ .show = _show, \
+ .store = _store, \
+}
+
+static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL
+};
+
+static struct attribute_group pcie_pmu_cpumask_attrs_group = {
+ .attrs = dwc_pcie_pmu_cpumask_attrs,
+};
+
+struct dwc_pcie_format_attr {
+ struct device_attribute attr;
+ u64 field;
+ int config;
+};
+
+static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
+ int lo = __ffs(fmt->field), hi = __fls(fmt->field);
+
+ if (lo == hi)
+ return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
+
+ if (!fmt->config)
+ return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
+
+ return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
+ hi);
+}
+
+#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
+ (&((struct dwc_pcie_format_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
+ .config = _cfg, \
+ .field = _fld, \
+ }})[0].attr.attr)
+
+#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
+
+static struct attribute *dwc_pcie_format_attrs[] = {
+ dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
+ dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
+ dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
+ NULL,
+};
+
+static struct attribute_group pcie_pmu_format_attrs_group = {
+ .name = "format",
+ .attrs = dwc_pcie_format_attrs,
+};
+
+struct dwc_pcie_event_attr {
+ struct device_attribute attr;
+ enum dwc_pcie_event_type type;
+ u16 eventid;
+ u8 lane;
+};
+
+ssize_t dwc_pcie_event_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct dwc_pcie_event_attr *eattr;
+
+ eattr = container_of(attr, typeof(*eattr), attr);
+
+ if (eattr->type == DWC_PCIE_LANE_EVENT)
+ return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
+ (unsigned long)eattr->eventid,
+ (unsigned long)eattr->type,
+ (unsigned long)eattr->lane);
+ else
+ return sprintf(page, "eventid=0x%lx, type=0x%lx",
+ (unsigned long)eattr->eventid,
+ (unsigned long)eattr->type);
+}
+
+#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
+ (&((struct dwc_pcie_event_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
+ .type = _type, \
+ .eventid = _eventid, \
+ .lane = _lane, \
+ }})[0].attr.attr)
+
+#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
+
+static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
+ /* Group #0 */
+ DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
+ /* Group #1 */
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
+ NULL
+};
+
+static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
+ struct attribute *attr,
+ int unuse)
+{
+ return attr->mode;
+}
+
+static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
+{
+ return (pci_is_pcie(pdev) &&
+ pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
+}
+
+static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
+{
+ int index = 0;
+ struct pci_dev *pdev = NULL;
+ struct dwc_pcie_rp_info *rp_info;
+
+ INIT_LIST_HEAD(&priv->rp_infos);
+
+ /* Match the rootport with VSEC_RAS_DES_ID */
+ for_each_pci_dev(pdev) {
+ u16 vsec;
+ u32 val;
+
+ if (!pci_dev_is_rootport(pdev))
+ continue;
+
+ rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
+ if (!rp_info)
+ return -ENOMEM;
+
+ rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
+ rp_info->pdev = pdev;
+
+ vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
+ DWC_PCIE_VSEC_RAS_DES_ID);
+ if (!vsec)
+ continue;
+
+ pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
+ if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
+ PCI_VNDR_HEADER_LEN(val) != 0x100)
+ continue;
+ pci_dbg(pdev,
+ "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
+
+ rp_info->ras_des = vsec;
+
+ pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &val);
+ rp_info->num_lanes = pcie_get_width_cap(pdev);
+
+ list_add(&rp_info->rp_node, &priv->rp_infos);
+ index++;
+ }
+
+ if (!index)
+ return -ENODEV;
+
+ priv->pcie_ctrl_num = index;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
+ int event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val &= ~DWC_PCIE_CNT_ENABLE;
+ val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
+ int lane, int event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val &= ~DWC_PCIE_CNT_LANE_SEL;
+ val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
+ u32 enable)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val &= ~DWC_PCIE_CNT_ENABLE;
+
+ if (enable)
+ val |= DWC_PCIE_PER_EVENT_ON;
+ else
+ val |= DWC_PCIE_PER_EVENT_OFF;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
+ u32 enable)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, &val);
+
+ if (enable)
+ val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ else
+ val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, val);
+}
+
+static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
+ u64 *counter)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
+ *counter = val;
+}
+
+/* The results are cleared when next measurement starts. */
+static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
+ u16 ras_des, u64 *counter)
+{
+ u32 val;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
+ &val);
+ *counter = val;
+ *counter <<= 32;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
+ &val);
+
+ *counter += val;
+}
+
+static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
+ u16 ras_des, u32 event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ &val);
+
+ val &= ~DWC_PCIE_TIME_BASED_REPORT_SEL;
+ val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
+ val &= ~DWC_PCIE_TIME_BASED_DURATION_SEL;
+
+ /*
+ * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
+ * use it with any manually controlled duration.
+ */
+ val &= ~(DWC_PCIE_TIME_BASED_DURATION_SEL);
+ val |= DWC_PCIE_DURATION_MANUAL_CTL;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ val);
+}
+
+static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
+{
+ struct dwc_pcie_rp_info *rp_info;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
+
+ rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
+
+ return rp_info;
+}
+
+static void dwc_pcie_pmu_event_update(struct perf_event *event)
+{
+ u64 counter;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ u64 delta, prev, now;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
+ &counter);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+
+ now = counter;
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ delta = now - prev;
+
+ local64_add(delta, &event->count);
+}
+
+static int dwc_pcie_pmu_event_init(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct perf_event *sibling;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ if (hwc->sample_period) {
+ dev_dbg(pcie_pmu->dev, "Sampling not supported\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (event->cpu < 0) {
+ dev_dbg(pcie_pmu->dev, "Per-task mode not supported\n");
+ return -EOPNOTSUPP;
+ }
+
+ event->cpu = pcie_pmu->on_cpu;
+
+ if (event->group_leader != event &&
+ !is_software_event(event->group_leader))
+ return -EINVAL;
+
+ for_each_sibling_event(sibling, event->group_leader) {
+ if (sibling != event && !is_software_event(sibling))
+ return -EINVAL;
+ }
+
+ hwc->idx = -1;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
+{
+ local64_set(&hwc->prev_count, 0);
+}
+
+static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ hwc->state = 0;
+ dwc_pcie_pmu_set_period(hwc);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+}
+
+static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+
+ dwc_pcie_pmu_event_update(event);
+}
+
+static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ int event_id = DWC_PCIE_EVENT_ID(event);
+ int lane = DWC_PCIE_EVENT_LANE(event);
+
+ /* Only one counter and it is in use */
+ if (rp_info->event)
+ return -ENOSPC;
+
+ rp_info->event = event;
+
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
+ dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
+ dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
+ dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
+ } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
+ dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
+ } else {
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+ return -EINVAL;
+ }
+
+ if (flags & PERF_EF_START)
+ dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+
+ dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
+ perf_event_update_userpage(event);
+ rp_info->event = NULL;
+}
+
+static void dwc_pcie_pmu_event_read(struct perf_event *event)
+{
+ dwc_pcie_pmu_event_update(event);
+}
+
+static struct dwc_event_counters event_array[] = {
+ {"tx_ack_dllp", 0x600},
+ {"tx_update_fc_dllp", 0x601},
+ {"rx_ack_dllp", 0x602},
+ {"rx_update_fc_dllp", 0x603},
+ {"rx_nulified_tlp", 0x604},
+ {"tx_nulified_tlp", 0x605},
+ {"rx_duplicate_tlp", 0x606},
+ {"tx_memory_write", 0x700},
+ {"tx_memory_read", 0x701},
+ {"tx_configuration_write", 0x702},
+ {"tx_configuration_read", 0x703},
+ {"tx_io_write", 0x704},
+ {"tx_io_read", 0x705},
+ {"tx_completion_without_data", 0x706},
+ {"tx_completion_with_data", 0x707},
+ {"tx_message_tlp", 0x708},
+ {"tx_atomic", 0x709},
+ {"tx_tlp_with_prefix", 0x70A},
+ {"rx_memory_write", 0x70B},
+ {"rx_memory_read", 0x70C},
+ {"rx_io_write", 0x70F},
+ {"rx_io_read", 0x710},
+ {"rx_completion_without_data", 0x711},
+ {"rx_completion_with_data", 0x712},
+ {"rx_message_tlp", 0x713},
+ {"rx_atomic", 0x714},
+ {"rx_tlp_with_prefix", 0x715},
+ {"tx_ccix_tlp", 0x716},
+ {"rx_ccix_tlp", 0x717},
+};
+
+static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
+ struct dwc_pcie_rp_info *rp_info)
+{
+ int i, j;
+ char lane[8];
+ const char tmp[64];
+ int events_per_lane;
+ int num_lane_events;
+ int time_base_count;
+ int num_attrs, attr_idx;
+ struct dwc_pcie_event_attr *lane_attrs;
+ struct attribute **pmu_attrs;
+
+ memset((void *)tmp, 0, sizeof(tmp));
+ memset((void *)lane, 0, sizeof(lane));
+ time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
+ events_per_lane = ARRAY_SIZE(event_array);
+ num_lane_events = rp_info->num_lanes * events_per_lane;
+ num_attrs = time_base_count + num_lane_events;
+
+ rp_info->lane_event_attrs =
+ devm_kcalloc(priv->dev, num_lane_events,
+ sizeof(struct dwc_pcie_event_attr),
+ GFP_KERNEL);
+ if (!rp_info->lane_event_attrs)
+ return -ENOMEM;
+ lane_attrs = rp_info->lane_event_attrs;
+ rp_info->pcie_pmu_event_attrs =
+ devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
+ GFP_KERNEL);
+ if (!rp_info->pcie_pmu_event_attrs)
+ return -ENOMEM;
+ pmu_attrs = rp_info->pcie_pmu_event_attrs;
+
+ for (i = 0; i < num_lane_events; i++) {
+ lane_attrs[i].attr.attr.name =
+ devm_kzalloc(priv->dev, sizeof(char)
+ * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
+ if (!lane_attrs[i].attr.attr.name)
+ return -ENOMEM;
+ }
+
+ attr_idx = 0;
+ for (i = 0; i < rp_info->num_lanes; i++) {
+ sprintf(lane, "_lane%d", i);
+
+ for (j = 0; j < events_per_lane; j++) {
+ int pos = i * events_per_lane + j;
+
+ strcat((char *)tmp, event_array[j].name);
+ strcat((char *)tmp, lane);
+ memcpy((void *)lane_attrs[pos].attr.attr.name,
+ (void *)tmp,
+ sizeof(tmp));
+
+ lane_attrs[pos].attr.attr.mode =
+ VERIFY_OCTAL_PERMISSIONS(0444);
+ lane_attrs[pos].attr.show = dwc_pcie_event_show;
+ lane_attrs[pos].attr.store = NULL;
+ lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
+ lane_attrs[pos].eventid = event_array[j].event_id;
+ lane_attrs[pos].lane = i;
+ pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
+
+ memset((void *)tmp, 0, sizeof(tmp));
+ }
+ }
+
+ for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
+ pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
+
+ rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
+
+ rp_info->pcie_pmu_event_attrs_group.name = event_attr_name;
+ rp_info->pcie_pmu_event_attrs_group.is_visible =
+ pcie_pmu_event_attr_is_visible;
+ rp_info->pcie_pmu_event_attrs_group.attrs =
+ rp_info->pcie_pmu_event_attrs;
+
+ rp_info->pcie_pmu_attr_groups[0] =
+ &rp_info->pcie_pmu_event_attrs_group;
+ rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
+ rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
+ rp_info->pcie_pmu_attr_groups[3] = NULL;
+
+ return 0;
+}
+
+static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
+ struct dwc_pcie_rp_info *rp_info)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct device *dev;
+ char *name;
+ int ret;
+
+ pcie_pmu = &rp_info->pcie_pmu;
+ dev = &rp_info->pdev->dev;
+
+ ret = dwc_pcie_pmu_attr_init(priv, rp_info);
+ if (ret) {
+ pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
+ return ret;
+ }
+
+ pcie_pmu->dev = dev;
+ pcie_pmu->pmu = (struct pmu) {
+ .module = THIS_MODULE,
+ .task_ctx_nr = perf_invalid_context,
+ .pmu_enable = NULL,
+ .pmu_disable = NULL,
+ .event_init = dwc_pcie_pmu_event_init,
+ .add = dwc_pcie_pmu_event_add,
+ .del = dwc_pcie_pmu_event_del,
+ .start = dwc_pcie_pmu_event_start,
+ .stop = dwc_pcie_pmu_event_stop,
+ .read = dwc_pcie_pmu_event_read,
+ .attr_groups = rp_info->pcie_pmu_attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
+ };
+
+ name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
+ rp_info->bdf);
+ if (!name)
+ return -ENOMEM;
+
+ /*
+ * Pick one CPU to be the preferred one on local NUMA node.
+ *
+ * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
+ * is a uncore PMU device.
+ */
+ pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
+ ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
+ if (ret) {
+ pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
+ rp_info->bdf);
+ return ret;
+ }
+
+ rp_info->pmu_is_register = DWC_PCIE_PMU_HAS_REGISTER;
+
+ return ret;
+}
+
+static int dwc_pcie_pmu_remove(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct dwc_pcie_rp_info *rp_info;
+
+ list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
+ if (rp_info->pmu_is_register) {
+ pcie_pmu = &rp_info->pcie_pmu;
+ perf_pmu_unregister(&pcie_pmu->pmu);
+ }
+ }
+ return 0;
+}
+
+static int dwc_pcie_pmu_probe(struct platform_device *pdev)
+{
+ int ret;
+ struct dwc_pcie_pmu_priv *priv;
+ struct dwc_pcie_rp_info *rp_info;
+
+ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->dev = &pdev->dev;
+ platform_set_drvdata(pdev, priv);
+
+ /* If RAS_DES PMU is not supported on current platform, keep silent */
+ ret = dwc_pcie_ras_des_discover(priv);
+ if (ret)
+ return ret;
+
+ list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
+ struct pci_dev *rp = rp_info->pdev;
+
+ ret = __dwc_pcie_pmu_probe(priv, rp_info);
+ if (ret) {
+ dev_err(&rp->dev, "PCIe PMU probe fail\n");
+ goto pmu_unregister;
+ }
+ }
+
+ return 0;
+
+pmu_unregister:
+ dwc_pcie_pmu_remove(pdev);
+
+ return ret;
+}
+
+static struct platform_driver dwc_pcie_pmu_driver = {
+ .probe = dwc_pcie_pmu_probe,
+ .remove = dwc_pcie_pmu_remove,
+ .driver = {.name = "dwc_pcie_pmu",},
+};
+
+static int __init dwc_pcie_pmu_init(void)
+{
+ int ret;
+
+ ret = platform_driver_register(&dwc_pcie_pmu_driver);
+
+ if (ret)
+ return ret;
+
+ dwc_pcie_pmu_dev =
+ platform_device_register_simple("dwc_pcie_pmu", -1, NULL, 0);
+ if (IS_ERR(dwc_pcie_pmu_dev)) {
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+ return PTR_ERR(dwc_pcie_pmu_dev);
+ }
+
+ return 0;
+}
+
+static void __exit dwc_pcie_pmu_exit(void)
+{
+ platform_device_unregister(dwc_pcie_pmu_dev);
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+}
+
+module_init(dwc_pcie_pmu_init);
+module_exit(dwc_pcie_pmu_exit);
+
+MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
+MODULE_AUTHOR("[email protected]");
+MODULE_AUTHOR("[email protected]");
+MODULE_LICENSE("GPL v2");
--
2.20.1.12.g72788fdb

2023-04-10 03:21:23

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 1/3] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver

Alibaba's T-Head Yitan 710 SoC is built on Synopsys' widely deployed and
silicon-proven DesignWare Core PCIe controller which implements PMU for
performance and functional debugging to facilitate system maintenance.
Document it to provide guidance on how to use it.

Signed-off-by: Shuai Xue <[email protected]>
---
.../admin-guide/perf/dwc_pcie_pmu.rst | 61 +++++++++++++++++++
Documentation/admin-guide/perf/index.rst | 1 +
2 files changed, 62 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst

diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..0672e959ebe4
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,61 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+To facilitate collection of statistics, Synopsys DesignWare Cores PCIe
+controller provides the following two features:
+
+- Time Based Analysis (RX/TX data throughput and time spent in each
+ low-power LTSSM state)
+- Lane Event counters (Error and Non-Error for lanes)
+
+The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
+only register counters provided by each PCIe Root Port.
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+
+The counters are 64-bit width and measure data in two categories,
+
+- percentage of time does the controller stay in LTSSM state in a
+ configurable duration. The measurement range of each Event in Group#0.
+- amount of data processed (Units of 16 bytes). The measurement range of
+ each Event in Group#1.
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller.
+
+The counters are 32-bit width and the measured event is select by:
+
+- Group i
+- Event j within the Group i
+- and Lane k
+
+Some of the event counters only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver add PMU devices for each PCIe Root Port. And the PMU device is
+named based the BDF of Root Port. For example,
+
+ 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
+
+ $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+average RX bandwidth can be calculated like this:
+
+ PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 9de64a40adab..11a80cd28a2e 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -19,5 +19,6 @@ Performance monitor support
arm_dsu_pmu
thunderx2-pmu
alibaba_pmu
+ dwc_pcie_pmu
nvidia-pmu
meson-ddr-pmu
--
2.20.1.12.g72788fdb

2023-04-10 03:22:18

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v2 3/3] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e01e546f3a90..52e9b0b91272 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20278,6 +20278,12 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/dw_mmc*

+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M: Shuai Xue <[email protected]>
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
M: Eugeniy Paltsev <[email protected]>
S: Supported
--
2.20.1.12.g72788fdb

2023-04-10 07:32:45

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver

Hi Shuai,

kernel test robot noticed the following build warnings:

[auto build test WARNING on soc/for-next]
[also build test WARNING on linus/master v6.3-rc6 next-20230406]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Shuai-Xue/drivers-perf-add-DesignWare-PCIe-PMU-driver/20230410-121727
base: https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git for-next
patch link: https://lore.kernel.org/r/20230410031702.68355-3-xueshuai%40linux.alibaba.com
patch subject: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver
config: s390-allyesconfig (https://download.01.org/0day-ci/archive/20230410/[email protected]/config)
compiler: s390-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/279673b15957e3ea9c14fded1e41a861a6d8b2d9
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Shuai-Xue/drivers-perf-add-DesignWare-PCIe-PMU-driver/20230410-121727
git checkout 279673b15957e3ea9c14fded1e41a861a6d8b2d9
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=s390 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> drivers/perf/dwc_pcie_pmu.c:195:9: warning: no previous prototype for 'dwc_pcie_event_show' [-Wmissing-prototypes]
195 | ssize_t dwc_pcie_event_show(struct device *dev,
| ^~~~~~~~~~~~~~~~~~~


vim +/dwc_pcie_event_show +195 drivers/perf/dwc_pcie_pmu.c

194
> 195 ssize_t dwc_pcie_event_show(struct device *dev,
196 struct device_attribute *attr, char *page)
197 {
198 struct dwc_pcie_event_attr *eattr;
199
200 eattr = container_of(attr, typeof(*eattr), attr);
201
202 if (eattr->type == DWC_PCIE_LANE_EVENT)
203 return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
204 (unsigned long)eattr->eventid,
205 (unsigned long)eattr->type,
206 (unsigned long)eattr->lane);
207 else
208 return sprintf(page, "eventid=0x%lx, type=0x%lx",
209 (unsigned long)eattr->eventid,
210 (unsigned long)eattr->type);
211 }
212

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

2023-04-11 03:25:48

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 4/10/2023 11:17 AM, Shuai Xue wrote:
> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature. The PMU is not a PCIe
> Root Complex integrated End Point(RCiEP) device but only register counters
> provided by each PCIe Root Port.
>
> To facilitate collection of statistics the controller provides the
> following two features for each Root Port:
>
> - Time Based Analysis (RX/TX data throughput and time spent in each
> low-power LTSSM state)
> - Event counters (Error and Non-Error for lanes)
>
> Note, only one counter for each type and does not overflow interrupt.
>
> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
> named based the BDF of Root Port. For example,
>
> 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>
> the PMU device name for this Root Port is dwc_rootport_3018.
>
> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>
> $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>
> average RX bandwidth can be calculated like this:
>
> PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/perf/Kconfig | 7 +
> drivers/perf/Makefile | 1 +
> drivers/perf/dwc_pcie_pmu.c | 877 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 885 insertions(+)
> create mode 100644 drivers/perf/dwc_pcie_pmu.c
>
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index 66c259000a44..57bce3880cba 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
> Enable perf support for Marvell DDR Performance monitoring
> event on CN10K platform.
>
> +config DWC_PCIE_PMU
> + tristate "Enable Synopsys DesignWare PCIe PMU Support"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)
> + help
> + Enable perf support for Synopsys DesignWare PCIe PMU Performance
> + monitoring event on Yitian 710 platform.
> +
> source "drivers/perf/arm_cspmu/Kconfig"
>
> source "drivers/perf/amlogic/Kconfig"
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 13e45da61100..3f233e96524e 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
> obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
> obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
> obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
> +obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
> obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
> obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
> new file mode 100644
> index 000000000000..c8c09f120d4e
> --- /dev/null
> +++ b/drivers/perf/dwc_pcie_pmu.c
> @@ -0,0 +1,877 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Synopsys DesignWare PCIe PMU driver
> + *
> + * Copyright (C) 2021, 2022 Alibaba Inc.
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/smp.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>
> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
> +
> +#define ATTRI_NAME_MAX_SIZE 32
> +#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
> +
> +#define DWC_PCIE_EVENT_CNT_CTL 0x8
> +#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
> +#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
> +#define DWC_PCIE_CNT_STATUS BIT(7)
> +#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
> +#define DWC_PCIE_PER_EVENT_OFF FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
> +#define DWC_PCIE_PER_EVENT_ON FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
> +#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
> +#define DWC_PCIE_EVENT_PER_CLEAR 0x1
> +
> +#define DWC_PCIE_EVENT_CNT_DATA 0xC
> +
> +#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
> +#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
> +#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
> +#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
> +#define DWC_PCIE_DURATION_1MS 0x1
> +#define DWC_PCIE_DURATION_10MS 0x2
> +#define DWC_PCIE_DURATION_100MS 0x3
> +#define DWC_PCIE_DURATION_1S 0x4
> +#define DWC_PCIE_DURATION_2S 0x5
> +#define DWC_PCIE_DURATION_4S 0x6
> +#define DWC_PCIE_DURATION_4US 0xff
> +#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
> +
> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
> +
> +/* Event attributes */
> +#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
> +#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
> +#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
> +
> +#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
> +#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
> +#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
> +
> +#define DWC_PCIE_PMU_HAS_REGISTER 1
> +
> +enum dwc_pcie_event_type {
> + DWC_PCIE_TYPE_INVALID,
> + DWC_PCIE_TIME_BASE_EVENT,
> + DWC_PCIE_LANE_EVENT,
> +};
> +
> +struct dwc_event_counters {
> + const char name[32];
> + u32 event_id;
> +};
> +
> +struct dwc_pcie_pmu {
> + struct hlist_node node;
> + unsigned int on_cpu;
> + struct pmu pmu;
> + struct device *dev;
> +};
> +
> +struct dwc_pcie_rp_info {
> + u32 bdf;
> + u32 ras_des;
> + u32 num_lanes;
> +
> + struct list_head rp_node;
> + struct pci_dev *pdev;
> + struct dwc_pcie_pmu pcie_pmu;
> + u8 pmu_is_register;
> + struct perf_event *event;
> +
> + struct dwc_pcie_event_attr *lane_event_attrs;
> + struct attribute **pcie_pmu_event_attrs;
> + struct attribute_group pcie_pmu_event_attrs_group;
> + const struct attribute_group *pcie_pmu_attr_groups[4];
> +};
> +
> +struct dwc_pcie_pmu_priv {
> + struct device *dev;
> + u32 pcie_ctrl_num;
> + struct list_head rp_infos;
> +};
> +
> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
> +
> +static struct platform_device *dwc_pcie_pmu_dev;
> +static char *event_attr_name = "events";

Since only one place uses the 'event_attr_name' variable, just using the
"events" string in the code seems more readable.

> +
> +static ssize_t cpumask_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
> +
> + return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
> +}
> +static DEVICE_ATTR_RO(cpumask);
> +
> +#define DEVICE_ATTR_RO(_name) \
> + struct device_attribute dev_attr_##_name = __ATTR_RO(_name)

There is already a general DEVICE_ATTR_RO macro in linux/device.h,
please do not add duplicate and confusing macro definition.

> +#define __ATTR_RO(_name) { \

This also seems a more general macro definition, can you rename it as
something like 'DWC_PMU_ATTR_RO' or other better name?

> + .attr = { .name = __stringify(_name), .mode = 0444 }, \
> + .show = _name##_show, \
> +}
> +#define __ATTR(_name, _mode, _show, _store) { \

ditto.

> + .attr = {.name = __stringify(_name), \
> + .mode = VERIFY_OCTAL_PERMISSIONS(_mode) }, \
> + .show = _show, \
> + .store = _store, \
> +}
> +
> +static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
> + &dev_attr_cpumask.attr,
> + NULL
> +};
> +
> +static struct attribute_group pcie_pmu_cpumask_attrs_group = { > + .attrs = dwc_pcie_pmu_cpumask_attrs,
> +};
> +
> +struct dwc_pcie_format_attr {
> + struct device_attribute attr;
> + u64 field;
> + int config;
> +};
> +
> +static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
> + int lo = __ffs(fmt->field), hi = __fls(fmt->field);
> +
> + if (lo == hi)
> + return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
> +
> + if (!fmt->config)
> + return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
> +
> + return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
> + hi);
> +}
> +
> +#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
> + (&((struct dwc_pcie_format_attr[]) {{ \
> + .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
> + .config = _cfg, \
> + .field = _fld, \
> + }})[0].attr.attr)
> +
> +#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
> +
> +static struct attribute *dwc_pcie_format_attrs[] = {
> + dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
> + dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
> + dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
> + NULL,
> +};
> +
> +static struct attribute_group pcie_pmu_format_attrs_group = { > + .name = "format",
> + .attrs = dwc_pcie_format_attrs,
> +};
> +
> +struct dwc_pcie_event_attr {
> + struct device_attribute attr;
> + enum dwc_pcie_event_type type;
> + u16 eventid;
> + u8 lane;
> +};
> +
> +ssize_t dwc_pcie_event_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + struct dwc_pcie_event_attr *eattr;
> +
> + eattr = container_of(attr, typeof(*eattr), attr);
> +
> + if (eattr->type == DWC_PCIE_LANE_EVENT)
> + return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
> + (unsigned long)eattr->eventid,
> + (unsigned long)eattr->type,
> + (unsigned long)eattr->lane);
> + else

'else' is redundant.

> + return sprintf(page, "eventid=0x%lx, type=0x%lx",
> + (unsigned long)eattr->eventid,
> + (unsigned long)eattr->type);
> +}
> +
> +#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
> + (&((struct dwc_pcie_event_attr[]) {{ \
> + .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
> + .type = _type, \
> + .eventid = _eventid, \
> + .lane = _lane, \
> + }})[0].attr.attr)
> +
> +#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid) \
> + DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
> +
> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
> + /* Group #0 */
> + DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
> + /* Group #1 */
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
> + NULL
> +};
> +
> +static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
> + struct attribute *attr,
> + int unuse)
> +{
> + return attr->mode;
> +}
> +
> +static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
> +{
> + return (pci_is_pcie(pdev) &&
> + pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
> +}
> +
> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> +{
> + int index = 0;
> + struct pci_dev *pdev = NULL;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + INIT_LIST_HEAD(&priv->rp_infos);
> +
> + /* Match the rootport with VSEC_RAS_DES_ID */
> + for_each_pci_dev(pdev) {
> + u16 vsec;
> + u32 val;
> +
> + if (!pci_dev_is_rootport(pdev))
> + continue;
> +
> + rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
> + if (!rp_info)
> + return -ENOMEM;
> +
> + rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
> + rp_info->pdev = pdev;
> +
> + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> + DWC_PCIE_VSEC_RAS_DES_ID);
> + if (!vsec)
> + continue;
> +
> + pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
> + if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
> + PCI_VNDR_HEADER_LEN(val) != 0x100)
> + continue;
> + pci_dbg(pdev,
> + "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
> +
> + rp_info->ras_des = vsec;
> +
> + pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &val);

The linkcap value is not used, then why need read it?

> + rp_info->num_lanes = pcie_get_width_cap(pdev);
> +
> + list_add(&rp_info->rp_node, &priv->rp_infos);
> + index++;
> + }
> +
> + if (!index)
> + return -ENODEV;
> +
> + priv->pcie_ctrl_num = index;
> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
> + int event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val &= ~DWC_PCIE_CNT_ENABLE;
> + val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
> + int lane, int event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val &= ~DWC_PCIE_CNT_LANE_SEL;
> + val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
> + u32 enable)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val &= ~DWC_PCIE_CNT_ENABLE;
> +
> + if (enable)
> + val |= DWC_PCIE_PER_EVENT_ON;
> + else
> + val |= DWC_PCIE_PER_EVENT_OFF;
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
> + u32 enable)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, &val);
> +
> + if (enable)
> + val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
> + else
> + val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
> + u64 *counter)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
> + *counter = val;
> +}
> +
> +/* The results are cleared when next measurement starts. */
> +static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
> + u16 ras_des, u64 *counter)
> +{
> + u32 val;
> +
> + pci_read_config_dword(
> + pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
> + &val);
> + *counter = val;
> + *counter <<= 32;
> +
> + pci_read_config_dword(
> + pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
> + &val);
> +
> + *counter += val;
> +}
> +
> +static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
> + u16 ras_des, u32 event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + &val);
> +
> + val &= ~DWC_PCIE_TIME_BASED_REPORT_SEL;
> + val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
> + val &= ~DWC_PCIE_TIME_BASED_DURATION_SEL;
> +
> + /*
> + * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
> + * use it with any manually controlled duration.
> + */
> + val &= ~(DWC_PCIE_TIME_BASED_DURATION_SEL);

Remove redundant '()'.

> + val |= DWC_PCIE_DURATION_MANUAL_CTL;
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + val);
> +}
> +
> +static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
> +{
> + struct dwc_pcie_rp_info *rp_info;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
> +
> + rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
> +
> + return rp_info;
> +}
> +
> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
> +{
> + u64 counter;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + struct hw_perf_event *hwc = &event->hw;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> + u64 delta, prev, now;
> +
> + do {
> + prev = local64_read(&hwc->prev_count);
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
> + &counter);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
> +
> + now = counter;
> + } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
> +
> + delta = now - prev;

This can be overflow? better to add a mask to avoid possible overflow.

> +
> + local64_add(delta, &event->count);
> +}
> +
> +static int dwc_pcie_pmu_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct perf_event *sibling;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + if (hwc->sample_period) {
> + dev_dbg(pcie_pmu->dev, "Sampling not supported\n");

should use dev_err() if treating it as an error.

> + return -EOPNOTSUPP;
> + }
> +
> + if (event->cpu < 0) {
> + dev_dbg(pcie_pmu->dev, "Per-task mode not supported\n");

ditto.

> + return -EOPNOTSUPP;
> + }
> +
> + event->cpu = pcie_pmu->on_cpu;
> +
> + if (event->group_leader != event &&
> + !is_software_event(event->group_leader))
> + return -EINVAL;
> +
> + for_each_sibling_event(sibling, event->group_leader) {
> + if (sibling != event && !is_software_event(sibling))
> + return -EINVAL;
> + }
> +
> + hwc->idx = -1;
> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
> +{
> + local64_set(&hwc->prev_count, 0);
> +}
> +
> +static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> +
> + hwc->state = 0;
> + dwc_pcie_pmu_set_period(hwc);
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
> +}
> +
> +static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> +
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> +
> + if (event->hw.state & PERF_HES_STOPPED)
> + return;
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
> +
> + dwc_pcie_pmu_event_update(event);
> +}
> +
> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + struct hw_perf_event *hwc = &event->hw;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> + int event_id = DWC_PCIE_EVENT_ID(event);
> + int lane = DWC_PCIE_EVENT_LANE(event);
> +
> + /* Only one counter and it is in use */
> + if (rp_info->event)
> + return -ENOSPC;
> +
> + rp_info->event = event;
> +
> + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +
> + if (type == DWC_PCIE_LANE_EVENT) {
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
> + dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
> + dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
> + dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
> + } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
> + dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
> + } else {
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
> + return -EINVAL;
> + }
> +
> + if (flags & PERF_EF_START)
> + dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> +
> + dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
> + perf_event_update_userpage(event);
> + rp_info->event = NULL;
> +}
> +
> +static void dwc_pcie_pmu_event_read(struct perf_event *event)
> +{
> + dwc_pcie_pmu_event_update(event);
> +}
> +
> +static struct dwc_event_counters event_array[] = {
> + {"tx_ack_dllp", 0x600},
> + {"tx_update_fc_dllp", 0x601},
> + {"rx_ack_dllp", 0x602},
> + {"rx_update_fc_dllp", 0x603},
> + {"rx_nulified_tlp", 0x604},
> + {"tx_nulified_tlp", 0x605},
> + {"rx_duplicate_tlp", 0x606},
> + {"tx_memory_write", 0x700},
> + {"tx_memory_read", 0x701},
> + {"tx_configuration_write", 0x702},
> + {"tx_configuration_read", 0x703},
> + {"tx_io_write", 0x704},
> + {"tx_io_read", 0x705},
> + {"tx_completion_without_data", 0x706},
> + {"tx_completion_with_data", 0x707},
> + {"tx_message_tlp", 0x708},
> + {"tx_atomic", 0x709},
> + {"tx_tlp_with_prefix", 0x70A},
> + {"rx_memory_write", 0x70B},
> + {"rx_memory_read", 0x70C},
> + {"rx_io_write", 0x70F},
> + {"rx_io_read", 0x710},
> + {"rx_completion_without_data", 0x711},
> + {"rx_completion_with_data", 0x712},
> + {"rx_message_tlp", 0x713},
> + {"rx_atomic", 0x714},
> + {"rx_tlp_with_prefix", 0x715},
> + {"tx_ccix_tlp", 0x716},
> + {"rx_ccix_tlp", 0x717},
> +};
> +
> +static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
> + struct dwc_pcie_rp_info *rp_info)
> +{
> + int i, j;
> + char lane[8];
> + const char tmp[64];
> + int events_per_lane;
> + int num_lane_events;
> + int time_base_count;
> + int num_attrs, attr_idx;
> + struct dwc_pcie_event_attr *lane_attrs;
> + struct attribute **pmu_attrs;
> +
> + memset((void *)tmp, 0, sizeof(tmp));
> + memset((void *)lane, 0, sizeof(lane));
> + time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
> + events_per_lane = ARRAY_SIZE(event_array);
> + num_lane_events = rp_info->num_lanes * events_per_lane;
> + num_attrs = time_base_count + num_lane_events;
> +
> + rp_info->lane_event_attrs =
> + devm_kcalloc(priv->dev, num_lane_events,
> + sizeof(struct dwc_pcie_event_attr),
> + GFP_KERNEL);
> + if (!rp_info->lane_event_attrs)
> + return -ENOMEM;
> + lane_attrs = rp_info->lane_event_attrs;
> + rp_info->pcie_pmu_event_attrs =
> + devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
> + GFP_KERNEL);
> + if (!rp_info->pcie_pmu_event_attrs)
> + return -ENOMEM;
> + pmu_attrs = rp_info->pcie_pmu_event_attrs;
> +
> + for (i = 0; i < num_lane_events; i++) {
> + lane_attrs[i].attr.attr.name =
> + devm_kzalloc(priv->dev, sizeof(char)
> + * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
> + if (!lane_attrs[i].attr.attr.name)
> + return -ENOMEM;
> + }
> +
> + attr_idx = 0;
> + for (i = 0; i < rp_info->num_lanes; i++) {
> + sprintf(lane, "_lane%d", i);
> +
> + for (j = 0; j < events_per_lane; j++) {
> + int pos = i * events_per_lane + j;
> +
> + strcat((char *)tmp, event_array[j].name);
> + strcat((char *)tmp, lane);
> + memcpy((void *)lane_attrs[pos].attr.attr.name,
> + (void *)tmp,
> + sizeof(tmp));
> +
> + lane_attrs[pos].attr.attr.mode =
> + VERIFY_OCTAL_PERMISSIONS(0444);
> + lane_attrs[pos].attr.show = dwc_pcie_event_show;
> + lane_attrs[pos].attr.store = NULL;
> + lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
> + lane_attrs[pos].eventid = event_array[j].event_id;
> + lane_attrs[pos].lane = i;
> + pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
> +
> + memset((void *)tmp, 0, sizeof(tmp));
> + }
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
> + pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
> +
> + rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
> +
> + rp_info->pcie_pmu_event_attrs_group.name = event_attr_name;
> + rp_info->pcie_pmu_event_attrs_group.is_visible =
> + pcie_pmu_event_attr_is_visible;
> + rp_info->pcie_pmu_event_attrs_group.attrs =
> + rp_info->pcie_pmu_event_attrs;
> +
> + rp_info->pcie_pmu_attr_groups[0] =
> + &rp_info->pcie_pmu_event_attrs_group;
> + rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
> + rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
> + rp_info->pcie_pmu_attr_groups[3] = NULL;
> +
> + return 0;
> +}
> +
> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
> + struct dwc_pcie_rp_info *rp_info)
> +{
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct device *dev;
> + char *name;
> + int ret;
> +
> + pcie_pmu = &rp_info->pcie_pmu;
> + dev = &rp_info->pdev->dev;
> +
> + ret = dwc_pcie_pmu_attr_init(priv, rp_info);
> + if (ret) {
> + pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
> + return ret;
> + }
> +
> + pcie_pmu->dev = dev;
> + pcie_pmu->pmu = (struct pmu) {
> + .module = THIS_MODULE,
> + .task_ctx_nr = perf_invalid_context,
> + .pmu_enable = NULL,
> + .pmu_disable = NULL,
> + .event_init = dwc_pcie_pmu_event_init,
> + .add = dwc_pcie_pmu_event_add,
> + .del = dwc_pcie_pmu_event_del,
> + .start = dwc_pcie_pmu_event_start,
> + .stop = dwc_pcie_pmu_event_stop,
> + .read = dwc_pcie_pmu_event_read,
> + .attr_groups = rp_info->pcie_pmu_attr_groups,
> + .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
> + };
> +
> + name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
> + rp_info->bdf);
> + if (!name)
> + return -ENOMEM;
> +
> + /*
> + * Pick one CPU to be the preferred one on local NUMA node.
> + *
> + * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
> + * is a uncore PMU device.
> + */
> + pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
> + ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
> + if (ret) {
> + pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
> + rp_info->bdf);
> + return ret;
> + }
> +
> + rp_info->pmu_is_register = DWC_PCIE_PMU_HAS_REGISTER;

Seems a boolean type, and no need define another macro.

> +
> + return ret;

return 0;

> +}
> +
> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
> +{
> + struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
> + if (rp_info->pmu_is_register) {
> + pcie_pmu = &rp_info->pcie_pmu;
> + perf_pmu_unregister(&pcie_pmu->pmu);
> + }
> + }
> + return 0;
> +}
> +
> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
> +{
> + int ret;
> + struct dwc_pcie_pmu_priv *priv;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->dev = &pdev->dev;
> + platform_set_drvdata(pdev, priv);
> +
> + /* If RAS_DES PMU is not supported on current platform, keep silent */
> + ret = dwc_pcie_ras_des_discover(priv);
> + if (ret)
> + return ret;
> +
> + list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
> + struct pci_dev *rp = rp_info->pdev;
> +
> + ret = __dwc_pcie_pmu_probe(priv, rp_info);
> + if (ret) {
> + dev_err(&rp->dev, "PCIe PMU probe fail\n");
> + goto pmu_unregister;
> + }
> + }
> +
> + return 0;
> +
> +pmu_unregister:
> + dwc_pcie_pmu_remove(pdev);
> +
> + return ret;
> +}
> +
> +static struct platform_driver dwc_pcie_pmu_driver = {
> + .probe = dwc_pcie_pmu_probe,
> + .remove = dwc_pcie_pmu_remove,
> + .driver = {.name = "dwc_pcie_pmu",},
> +};
> +
> +static int __init dwc_pcie_pmu_init(void)
> +{
> + int ret;
> +
> + ret = platform_driver_register(&dwc_pcie_pmu_driver);
> +
> + if (ret)
> + return ret;
> +
> + dwc_pcie_pmu_dev =
> + platform_device_register_simple("dwc_pcie_pmu", -1, NULL, 0);

You can use PLATFORM_DEVID_NONE macro instead of the magic number '-1'.

> + if (IS_ERR(dwc_pcie_pmu_dev)) {
> + platform_driver_unregister(&dwc_pcie_pmu_driver);
> + return PTR_ERR(dwc_pcie_pmu_dev);
> + }
> +
> + return 0;
> +}
> +
> +static void __exit dwc_pcie_pmu_exit(void)
> +{
> + platform_device_unregister(dwc_pcie_pmu_dev);
> + platform_driver_unregister(&dwc_pcie_pmu_driver);
> +}
> +
> +module_init(dwc_pcie_pmu_init);
> +module_exit(dwc_pcie_pmu_exit);
> +
> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
> +MODULE_AUTHOR("[email protected]");

Please correct the format for module author:
MODULE_AUTHOR("Xue Shuai <[email protected]>");

> +MODULE_AUTHOR("[email protected]");
> +MODULE_LICENSE("GPL v2");

2023-04-17 01:36:24

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 2023/4/11 AM11:17, Baolin Wang wrote:
>
>
> On 4/10/2023 11:17 AM, Shuai Xue wrote:
>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>> Core controller IP which provides statistics feature. The PMU is not a PCIe
>> Root Complex integrated End Point(RCiEP) device but only register counters
>> provided by each PCIe Root Port.
>>
>> To facilitate collection of statistics the controller provides the
>> following two features for each Root Port:
>>
>> - Time Based Analysis (RX/TX data throughput and time spent in each
>>    low-power LTSSM state)
>> - Event counters (Error and Non-Error for lanes)
>>
>> Note, only one counter for each type and does not overflow interrupt.
>>
>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>> named based the BDF of Root Port. For example,
>>
>>      30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>
>> the PMU device name for this Root Port is dwc_rootport_3018.
>>
>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>
>>      $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>
>> average RX bandwidth can be calculated like this:
>>
>>      PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>>   drivers/perf/Kconfig        |   7 +
>>   drivers/perf/Makefile       |   1 +
>>   drivers/perf/dwc_pcie_pmu.c | 877 ++++++++++++++++++++++++++++++++++++
>>   3 files changed, 885 insertions(+)
>>   create mode 100644 drivers/perf/dwc_pcie_pmu.c
>>
>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>> index 66c259000a44..57bce3880cba 100644
>> --- a/drivers/perf/Kconfig
>> +++ b/drivers/perf/Kconfig
>> @@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
>>         Enable perf support for Marvell DDR Performance monitoring
>>         event on CN10K platform.
>>   +config DWC_PCIE_PMU
>> +    tristate "Enable Synopsys DesignWare PCIe PMU Support"
>> +    depends on ARM64 || (COMPILE_TEST && 64BIT)
>> +    help
>> +      Enable perf support for Synopsys DesignWare PCIe PMU Performance
>> +      monitoring event on Yitian 710 platform.
>> +
>>   source "drivers/perf/arm_cspmu/Kconfig"
>>     source "drivers/perf/amlogic/Kconfig"
>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
>> index 13e45da61100..3f233e96524e 100644
>> --- a/drivers/perf/Makefile
>> +++ b/drivers/perf/Makefile
>> @@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
>>   obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
>>   obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
>>   obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
>> +obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
>>   obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
>>   obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
>> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
>> new file mode 100644
>> index 000000000000..c8c09f120d4e
>> --- /dev/null
>> +++ b/drivers/perf/dwc_pcie_pmu.c
>> @@ -0,0 +1,877 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Synopsys DesignWare PCIe PMU driver
>> + *
>> + * Copyright (C) 2021, 2022 Alibaba Inc.
>> + */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/bitfield.h>
>> +#include <linux/bitops.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/device.h>
>> +#include <linux/errno.h>
>> +#include <linux/kernel.h>
>> +#include <linux/list.h>
>> +#include <linux/perf_event.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/smp.h>
>> +#include <linux/sysfs.h>
>> +#include <linux/types.h>
>> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
>> +
>> +#define ATTRI_NAME_MAX_SIZE            32
>> +#define DWC_PCIE_VSEC_RAS_DES_ID        0x02
>> +
>> +#define DWC_PCIE_EVENT_CNT_CTL            0x8
>> +#define DWC_PCIE_CNT_EVENT_SEL            GENMASK(27, 16)
>> +#define DWC_PCIE_CNT_LANE_SEL            GENMASK(11, 8)
>> +#define DWC_PCIE_CNT_STATUS            BIT(7)
>> +#define DWC_PCIE_CNT_ENABLE            GENMASK(4, 2)
>> +#define DWC_PCIE_PER_EVENT_OFF            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
>> +#define DWC_PCIE_PER_EVENT_ON            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
>> +#define DWC_PCIE_EVENT_CLEAR            GENMASK(1, 0)
>> +#define DWC_PCIE_EVENT_PER_CLEAR        0x1
>> +
>> +#define DWC_PCIE_EVENT_CNT_DATA            0xC
>> +
>> +#define DWC_PCIE_TIME_BASED_ANAL_CTL        0x10
>> +#define DWC_PCIE_TIME_BASED_REPORT_SEL        GENMASK(31, 24)
>> +#define DWC_PCIE_TIME_BASED_DURATION_SEL    GENMASK(15, 8)
>> +#define DWC_PCIE_DURATION_MANUAL_CTL        0x0
>> +#define DWC_PCIE_DURATION_1MS            0x1
>> +#define DWC_PCIE_DURATION_10MS            0x2
>> +#define DWC_PCIE_DURATION_100MS            0x3
>> +#define DWC_PCIE_DURATION_1S            0x4
>> +#define DWC_PCIE_DURATION_2S            0x5
>> +#define DWC_PCIE_DURATION_4S            0x6
>> +#define DWC_PCIE_DURATION_4US            0xff
>> +#define DWC_PCIE_TIME_BASED_CNT_ENABLE        0x1
>> +
>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW    0x14
>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH    0x18
>> +
>> +/* Event attributes */
>> +#define DWC_PCIE_CONFIG_EVENTID            GENMASK(15, 0)
>> +#define DWC_PCIE_CONFIG_TYPE            GENMASK(19, 16)
>> +#define DWC_PCIE_CONFIG_LANE            GENMASK(27, 20)
>> +
>> +#define DWC_PCIE_EVENT_ID(event)    FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
>> +#define DWC_PCIE_EVENT_TYPE(event)    FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
>> +#define DWC_PCIE_EVENT_LANE(event)    FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
>> +
>> +#define DWC_PCIE_PMU_HAS_REGISTER        1
>> +
>> +enum dwc_pcie_event_type {
>> +    DWC_PCIE_TYPE_INVALID,
>> +    DWC_PCIE_TIME_BASE_EVENT,
>> +    DWC_PCIE_LANE_EVENT,
>> +};
>> +
>> +struct dwc_event_counters {
>> +    const char name[32];
>> +    u32 event_id;
>> +};
>> +
>> +struct dwc_pcie_pmu {
>> +    struct hlist_node node;
>> +    unsigned int on_cpu;
>> +    struct pmu pmu;
>> +    struct device *dev;
>> +};
>> +
>> +struct dwc_pcie_rp_info {
>> +    u32 bdf;
>> +    u32 ras_des;
>> +    u32 num_lanes;
>> +
>> +    struct list_head rp_node;
>> +    struct pci_dev *pdev;
>> +    struct dwc_pcie_pmu pcie_pmu;
>> +    u8 pmu_is_register;
>> +    struct perf_event *event;
>> +
>> +    struct dwc_pcie_event_attr *lane_event_attrs;
>> +    struct attribute **pcie_pmu_event_attrs;
>> +    struct attribute_group pcie_pmu_event_attrs_group;
>> +    const struct attribute_group *pcie_pmu_attr_groups[4];
>> +};
>> +
>> +struct dwc_pcie_pmu_priv {
>> +    struct device *dev;
>> +    u32 pcie_ctrl_num;
>> +    struct list_head rp_infos;
>> +};
>> +
>> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
>> +
>> +static struct platform_device *dwc_pcie_pmu_dev;
>> +static char *event_attr_name = "events";
>
> Since only one place uses the 'event_attr_name' variable, just using the "events" string in the code seems more readable.
>

Ok, will fix in next version.

>> +
>> +static ssize_t cpumask_show(struct device *dev,
>> +                     struct device_attribute *attr,
>> +                     char *buf)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
>> +
>> +    return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
>> +}
>> +static DEVICE_ATTR_RO(cpumask);
>> +
>> +#define DEVICE_ATTR_RO(_name) \
>> +    struct device_attribute dev_attr_##_name = __ATTR_RO(_name)
>
> There is already a general DEVICE_ATTR_RO macro in linux/device.h, please do not add duplicate and confusing macro definition.
>
>> +#define __ATTR_RO(_name) {                        \
>
> This also seems a more general macro definition, can you rename it as something like 'DWC_PMU_ATTR_RO' or other better name?
>
>> +    .attr    = { .name = __stringify(_name), .mode = 0444 },        \
>> +    .show    = _name##_show,                        \
>> +}
>> +#define __ATTR(_name, _mode, _show, _store) {                \
>
> ditto.

I was going to copy the macro definitions from linux/device.h and see more details,
sorry for forgetting to delete it.

>
>> +    .attr = {.name = __stringify(_name),                \
>> +         .mode = VERIFY_OCTAL_PERMISSIONS(_mode) },        \
>> +    .show    = _show,                        \
>> +    .store    = _store,                        \
>> +}
>> +
>> +static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
>> +    &dev_attr_cpumask.attr,
>> +    NULL
>> +};
>> +
>> +static struct attribute_group pcie_pmu_cpumask_attrs_group = { > +    .attrs = dwc_pcie_pmu_cpumask_attrs,
>> +};
>> +
>> +struct dwc_pcie_format_attr {
>> +    struct device_attribute attr;
>> +    u64 field;
>> +    int config;
>> +};
>> +
>> +static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
>> +                    struct device_attribute *attr,
>> +                    char *buf)
>> +{
>> +    struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
>> +    int lo = __ffs(fmt->field), hi = __fls(fmt->field);
>> +
>> +    if (lo == hi)
>> +        return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
>> +
>> +    if (!fmt->config)
>> +        return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
>> +
>> +    return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
>> +            hi);
>> +}
>> +
>> +#define _dwc_pcie_format_attr(_name, _cfg, _fld)                \
>> +    (&((struct dwc_pcie_format_attr[]) {{                \
>> +        .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL),    \
>> +        .config = _cfg,                        \
>> +        .field = _fld,                        \
>> +    }})[0].attr.attr)
>> +
>> +#define dwc_pcie_format_attr(_name, _fld)    _dwc_pcie_format_attr(_name, 0, _fld)
>> +
>> +static struct attribute *dwc_pcie_format_attrs[] = {
>> +    dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
>> +    dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
>> +    dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
>> +    NULL,
>> +};
>> +
>> +static struct attribute_group pcie_pmu_format_attrs_group = { > +    .name = "format",
>> +    .attrs = dwc_pcie_format_attrs,
>> +};
>> +
>> +struct dwc_pcie_event_attr {
>> +    struct device_attribute attr;
>> +    enum dwc_pcie_event_type type;
>> +    u16 eventid;
>> +    u8 lane;
>> +};
>> +
>> +ssize_t dwc_pcie_event_show(struct device *dev,
>> +                struct device_attribute *attr, char *page)
>> +{
>> +    struct dwc_pcie_event_attr *eattr;
>> +
>> +    eattr = container_of(attr, typeof(*eattr), attr);
>> +
>> +    if (eattr->type == DWC_PCIE_LANE_EVENT)
>> +        return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
>> +                   (unsigned long)eattr->eventid,
>> +                   (unsigned long)eattr->type,
>> +                   (unsigned long)eattr->lane);
>> +    else
>
> 'else' is redundant.

Will remove it in next version.


>> +        return sprintf(page, "eventid=0x%lx, type=0x%lx",
>> +                   (unsigned long)eattr->eventid,
>> +                   (unsigned long)eattr->type);
>> +}
>> +
>> +#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane)        \
>> +    (&((struct dwc_pcie_event_attr[]) {{                \
>> +        .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL),    \
>> +        .type = _type,                        \
>> +        .eventid = _eventid,                    \
>> +        .lane = _lane,                    \
>> +    }})[0].attr.attr)
>> +
>> +#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid)            \
>> +    DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
>> +
>> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
>> +    /* Group #0 */
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
>> +    /* Group #1 */
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
>> +    NULL
>> +};
>> +
>> +static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
>> +                             struct attribute *attr,
>> +                             int unuse)
>> +{
>> +    return attr->mode;
>> +}
>> +
>> +static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
>> +{
>> +    return (pci_is_pcie(pdev) &&
>> +        pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
>> +}
>> +
>> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
>> +{
>> +    int index = 0;
>> +    struct pci_dev *pdev = NULL;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    INIT_LIST_HEAD(&priv->rp_infos);
>> +
>> +    /* Match the rootport with VSEC_RAS_DES_ID */
>> +    for_each_pci_dev(pdev) {
>> +        u16 vsec;
>> +        u32 val;
>> +
>> +        if (!pci_dev_is_rootport(pdev))
>> +            continue;
>> +
>> +        rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
>> +        if (!rp_info)
>> +            return -ENOMEM;
>> +
>> +        rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>> +        rp_info->pdev = pdev;
>> +
>> +        vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>> +                        DWC_PCIE_VSEC_RAS_DES_ID);
>> +        if (!vsec)
>> +            continue;
>> +
>> +        pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>> +        if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>> +            PCI_VNDR_HEADER_LEN(val) != 0x100)
>> +            continue;
>> +        pci_dbg(pdev,
>> +            "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>> +
>> +        rp_info->ras_des = vsec;
>> +
>> +        pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &val);
>
> The linkcap value is not used, then why need read it?

Yes, it is redundant. Will remove it in next version.

>
>> +        rp_info->num_lanes = pcie_get_width_cap(pdev);
>> +
>> +        list_add(&rp_info->rp_node, &priv->rp_infos);
>> +        index++;
>> +    }
>> +
>> +    if (!index)
>> +        return -ENODEV;
>> +
>> +    priv->pcie_ctrl_num = index;
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
>> +                     int event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val &= ~DWC_PCIE_CNT_ENABLE;
>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
>> +                     int lane, int event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val &= ~DWC_PCIE_CNT_LANE_SEL;
>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
>> +                     u32 enable)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val &= ~DWC_PCIE_CNT_ENABLE;
>> +
>> +    if (enable)
>> +        val |= DWC_PCIE_PER_EVENT_ON;
>> +    else
>> +        val |= DWC_PCIE_PER_EVENT_OFF;
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
>> +                     u32 enable)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, &val);
>> +
>> +    if (enable)
>> +        val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
>> +    else
>> +        val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
>> +                        u64 *counter)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
>> +    *counter = val;
>> +}
>> +
>> +/* The results are cleared when next measurement starts. */
>> +static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
>> +                        u16 ras_des, u64 *counter)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(
>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
>> +        &val);
>> +    *counter = val;
>> +    *counter <<= 32;
>> +
>> +    pci_read_config_dword(
>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
>> +        &val);
>> +
>> +    *counter += val;
>> +}
>> +
>> +static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
>> +                           u16 ras_des, u32 event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                  &val);
>> +
>> +    val &= ~DWC_PCIE_TIME_BASED_REPORT_SEL;
>> +    val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
>> +    val &= ~DWC_PCIE_TIME_BASED_DURATION_SEL;
>> +
>> +    /*
>> +     * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
>> +     * use it with any manually controlled duration.
>> +     */
>> +    val &= ~(DWC_PCIE_TIME_BASED_DURATION_SEL);
>
> Remove redundant '()'.

Will remove it.

>
>> +    val |= DWC_PCIE_DURATION_MANUAL_CTL;
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                   val);
>> +}
>> +
>> +static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
>> +{
>> +    struct dwc_pcie_rp_info *rp_info;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
>> +
>> +    rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
>> +
>> +    return rp_info;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
>> +{
>> +    u64 counter;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +    u64 delta, prev, now;
>> +
>> +    do {
>> +        prev = local64_read(&hwc->prev_count);
>> +
>> +        if (type == DWC_PCIE_LANE_EVENT)
>> +            dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
>> +        else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +            dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
>> +                                &counter);
>> +        else
>> +            dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>> +
>> +        now = counter;
>> +    } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
>> +
>> +    delta = now - prev;
>
> This can be overflow? better to add a mask to avoid possible overflow.

I think it can not. This Root Complex supports up to PCIe Gen5 (32 GT/s)
and one root port support up to x16 lanes, with peek bandwidth 64 GB/s.
On Yitian 710, one root port is x4 lane with peak bandwidth 16 GB/s.
The counter is 64 bit width with 16 bytes unit.

2^64*16/(64*10^9)/60/60/24/365 = 146 years

For x16 root port, it will not overflow within 146 yeasr and for yitian 710,
it will never overflow in my life too.

>
>> +
>> +    local64_add(delta, &event->count);
>> +}
>> +
>> +static int dwc_pcie_pmu_event_init(struct perf_event *event)
>> +{
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct perf_event *sibling;
>> +
>> +    if (event->attr.type != event->pmu->type)
>> +        return -ENOENT;
>> +
>> +    if (hwc->sample_period) {
>> +        dev_dbg(pcie_pmu->dev, "Sampling not supported\n");
>
> should use dev_err() if treating it as an error.
>
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (event->cpu < 0) {
>> +        dev_dbg(pcie_pmu->dev, "Per-task mode not supported\n");
>
> ditto.

I will use dev_err() instead.

>
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    event->cpu = pcie_pmu->on_cpu;
>> +
>> +    if (event->group_leader != event &&
>> +        !is_software_event(event->group_leader))
>> +        return -EINVAL;
>> +
>> +    for_each_sibling_event(sibling, event->group_leader) {
>> +        if (sibling != event && !is_software_event(sibling))
>> +            return -EINVAL;
>> +    }
>> +
>> +    hwc->idx = -1;
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
>> +{
>> +    local64_set(&hwc->prev_count, 0);
>> +}
>> +
>> +static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
>> +{
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +
>> +    hwc->state = 0;
>> +    dwc_pcie_pmu_set_period(hwc);
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT)
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
>> +    else
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>> +}
>> +
>> +static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +
>> +    if (event->hw.state & PERF_HES_STOPPED)
>> +        return;
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT)
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>> +    else
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>> +
>> +    dwc_pcie_pmu_event_update(event);
>> +}
>> +
>> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +    int event_id = DWC_PCIE_EVENT_ID(event);
>> +    int lane = DWC_PCIE_EVENT_LANE(event);
>> +
>> +    /* Only one counter and it is in use */
>> +    if (rp_info->event)
>> +        return -ENOSPC;
>> +
>> +    rp_info->event = event;
>> +
>> +    hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT) {
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>> +        dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
>> +        dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
>> +        dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
>> +    } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>> +        dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
>> +    } else {
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (flags & PERF_EF_START)
>> +        dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
>> +
>> +    perf_event_update_userpage(event);
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +
>> +    dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
>> +    perf_event_update_userpage(event);
>> +    rp_info->event = NULL;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_read(struct perf_event *event)
>> +{
>> +    dwc_pcie_pmu_event_update(event);
>> +}
>> +
>> +static struct dwc_event_counters event_array[] = {
>> +    {"tx_ack_dllp", 0x600},
>> +    {"tx_update_fc_dllp", 0x601},
>> +    {"rx_ack_dllp", 0x602},
>> +    {"rx_update_fc_dllp", 0x603},
>> +    {"rx_nulified_tlp", 0x604},
>> +    {"tx_nulified_tlp", 0x605},
>> +    {"rx_duplicate_tlp", 0x606},
>> +    {"tx_memory_write", 0x700},
>> +    {"tx_memory_read", 0x701},
>> +    {"tx_configuration_write", 0x702},
>> +    {"tx_configuration_read", 0x703},
>> +    {"tx_io_write", 0x704},
>> +    {"tx_io_read", 0x705},
>> +    {"tx_completion_without_data", 0x706},
>> +    {"tx_completion_with_data", 0x707},
>> +    {"tx_message_tlp", 0x708},
>> +    {"tx_atomic", 0x709},
>> +    {"tx_tlp_with_prefix", 0x70A},
>> +    {"rx_memory_write", 0x70B},
>> +    {"rx_memory_read", 0x70C},
>> +    {"rx_io_write", 0x70F},
>> +    {"rx_io_read", 0x710},
>> +    {"rx_completion_without_data", 0x711},
>> +    {"rx_completion_with_data", 0x712},
>> +    {"rx_message_tlp", 0x713},
>> +    {"rx_atomic", 0x714},
>> +    {"rx_tlp_with_prefix", 0x715},
>> +    {"tx_ccix_tlp", 0x716},
>> +    {"rx_ccix_tlp", 0x717},
>> +};
>> +
>> +static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
>> +                  struct dwc_pcie_rp_info *rp_info)
>> +{
>> +    int i, j;
>> +    char lane[8];
>> +    const char tmp[64];
>> +    int events_per_lane;
>> +    int num_lane_events;
>> +    int time_base_count;
>> +    int num_attrs, attr_idx;
>> +    struct dwc_pcie_event_attr *lane_attrs;
>> +    struct attribute **pmu_attrs;
>> +
>> +    memset((void *)tmp, 0, sizeof(tmp));
>> +    memset((void *)lane, 0, sizeof(lane));
>> +    time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
>> +    events_per_lane = ARRAY_SIZE(event_array);
>> +    num_lane_events = rp_info->num_lanes * events_per_lane;
>> +    num_attrs = time_base_count + num_lane_events;
>> +
>> +    rp_info->lane_event_attrs =
>> +        devm_kcalloc(priv->dev, num_lane_events,
>> +                sizeof(struct dwc_pcie_event_attr),
>> +                GFP_KERNEL);
>> +    if (!rp_info->lane_event_attrs)
>> +        return -ENOMEM;
>> +    lane_attrs = rp_info->lane_event_attrs;
>> +    rp_info->pcie_pmu_event_attrs =
>> +        devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
>> +             GFP_KERNEL);
>> +    if (!rp_info->pcie_pmu_event_attrs)
>> +        return -ENOMEM;
>> +    pmu_attrs = rp_info->pcie_pmu_event_attrs;
>> +
>> +    for (i = 0; i < num_lane_events; i++) {
>> +        lane_attrs[i].attr.attr.name =
>> +            devm_kzalloc(priv->dev, sizeof(char)
>> +                 * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
>> +        if (!lane_attrs[i].attr.attr.name)
>> +            return -ENOMEM;
>> +    }
>> +
>> +    attr_idx = 0;
>> +    for (i = 0; i < rp_info->num_lanes; i++) {
>> +        sprintf(lane, "_lane%d", i);
>> +
>> +        for (j = 0; j < events_per_lane; j++) {
>> +            int pos = i * events_per_lane + j;
>> +
>> +            strcat((char *)tmp, event_array[j].name);
>> +            strcat((char *)tmp, lane);
>> +            memcpy((void *)lane_attrs[pos].attr.attr.name,
>> +                   (void *)tmp,
>> +                   sizeof(tmp));
>> +
>> +            lane_attrs[pos].attr.attr.mode =
>> +                VERIFY_OCTAL_PERMISSIONS(0444);
>> +            lane_attrs[pos].attr.show = dwc_pcie_event_show;
>> +            lane_attrs[pos].attr.store = NULL;
>> +            lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
>> +            lane_attrs[pos].eventid = event_array[j].event_id;
>> +            lane_attrs[pos].lane = i;
>> +            pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
>> +
>> +            memset((void *)tmp, 0, sizeof(tmp));
>> +        }
>> +    }
>> +
>> +    for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
>> +        pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
>> +
>> +    rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
>> +
>> +    rp_info->pcie_pmu_event_attrs_group.name = event_attr_name;
>> +    rp_info->pcie_pmu_event_attrs_group.is_visible =
>> +        pcie_pmu_event_attr_is_visible;
>> +    rp_info->pcie_pmu_event_attrs_group.attrs =
>> +        rp_info->pcie_pmu_event_attrs;
>> +
>> +    rp_info->pcie_pmu_attr_groups[0] =
>> +        &rp_info->pcie_pmu_event_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[3] = NULL;
>> +
>> +    return 0;
>> +}
>> +
>> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
>> +                struct dwc_pcie_rp_info *rp_info)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct device *dev;
>> +    char *name;
>> +    int ret;
>> +
>> +    pcie_pmu = &rp_info->pcie_pmu;
>> +    dev = &rp_info->pdev->dev;
>> +
>> +    ret = dwc_pcie_pmu_attr_init(priv, rp_info);
>> +    if (ret) {
>> +        pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
>> +        return ret;
>> +    }
>> +
>> +    pcie_pmu->dev = dev;
>> +    pcie_pmu->pmu = (struct pmu) {
>> +        .module        = THIS_MODULE,
>> +        .task_ctx_nr    = perf_invalid_context,
>> +        .pmu_enable    = NULL,
>> +        .pmu_disable    = NULL,
>> +        .event_init    = dwc_pcie_pmu_event_init,
>> +        .add        = dwc_pcie_pmu_event_add,
>> +        .del        = dwc_pcie_pmu_event_del,
>> +        .start        = dwc_pcie_pmu_event_start,
>> +        .stop        = dwc_pcie_pmu_event_stop,
>> +        .read        = dwc_pcie_pmu_event_read,
>> +        .attr_groups    = rp_info->pcie_pmu_attr_groups,
>> +        .capabilities    = PERF_PMU_CAP_NO_EXCLUDE,
>> +    };
>> +
>> +    name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
>> +                  rp_info->bdf);
>> +    if (!name)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Pick one CPU to be the preferred one on local NUMA node.
>> +     *
>> +     * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
>> +     * is a uncore PMU device.
>> +     */
>> +    pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
>> +    ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>> +    if (ret) {
>> +        pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
>> +                 rp_info->bdf);
>> +        return ret;
>> +    }
>> +
>> +    rp_info->pmu_is_register = DWC_PCIE_PMU_HAS_REGISTER;
>
> Seems a boolean type, and no need define another macro.

Yes, will change to boolean.
>
>> +
>> +    return ret;
>
> return 0;
>

Will fix it.

>> +}
>> +
>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>> +{
>> +    struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>> +        if (rp_info->pmu_is_register) {
>> +            pcie_pmu = &rp_info->pcie_pmu;
>> +            perf_pmu_unregister(&pcie_pmu->pmu);
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>> +{
>> +    int ret;
>> +    struct dwc_pcie_pmu_priv *priv;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>> +    if (!priv)
>> +        return -ENOMEM;
>> +
>> +    priv->dev = &pdev->dev;
>> +    platform_set_drvdata(pdev, priv);
>> +
>> +    /* If RAS_DES PMU is not supported on current platform, keep silent */
>> +    ret = dwc_pcie_ras_des_discover(priv);
>> +    if (ret)
>> +        return ret;
>> +
>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>> +        struct pci_dev *rp = rp_info->pdev;
>> +
>> +        ret = __dwc_pcie_pmu_probe(priv, rp_info);
>> +        if (ret) {
>> +            dev_err(&rp->dev, "PCIe PMU probe fail\n");
>> +            goto pmu_unregister;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +pmu_unregister:
>> +    dwc_pcie_pmu_remove(pdev);
>> +
>> +    return ret;
>> +}
>> +
>> +static struct platform_driver dwc_pcie_pmu_driver = {
>> +    .probe = dwc_pcie_pmu_probe,
>> +    .remove = dwc_pcie_pmu_remove,
>> +    .driver = {.name = "dwc_pcie_pmu",},
>> +};
>> +
>> +static int __init dwc_pcie_pmu_init(void)
>> +{
>> +    int ret;
>> +
>> +    ret = platform_driver_register(&dwc_pcie_pmu_driver);
>> +
>> +    if (ret)
>> +        return ret;
>> +
>> +    dwc_pcie_pmu_dev =
>> +        platform_device_register_simple("dwc_pcie_pmu", -1, NULL, 0);
>
> You can use PLATFORM_DEVID_NONE macro instead of the magic number '-1'.
>

Good catch, will use PLATFORM_DEVID_NONE.

>> +    if (IS_ERR(dwc_pcie_pmu_dev)) {
>> +        platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +        return PTR_ERR(dwc_pcie_pmu_dev);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void __exit dwc_pcie_pmu_exit(void)
>> +{
>> +    platform_device_unregister(dwc_pcie_pmu_dev);
>> +    platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +}
>> +
>> +module_init(dwc_pcie_pmu_init);
>> +module_exit(dwc_pcie_pmu_exit);
>> +
>> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
>> +MODULE_AUTHOR("[email protected]");
>
> Please correct the format for module author:
> MODULE_AUTHOR("Xue Shuai <[email protected]>");

Will fix it.

>
>> +MODULE_AUTHOR("[email protected]");
>> +MODULE_LICENSE("GPL v2");


Thank you for your comments :)

Best Regards,
Shuai

2023-04-17 06:18:26

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 0/3] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

Changes since v2 addressing comments from Baolin:
- remove redundant macro definitions
- use dev_err to print error message
- change pmu_is_register to boolean
- use PLATFORM_DEVID_NONE macro
- fix module author format

Changes since v1:

1. address comments from Jonathan:
- drop marco for PMU name and VSEC version
- simplify code with PCI standard marco
- simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
- name register filed with single _ instead double
- wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
- check vendor id while matching vesc with pci_find_vsec_capability()
- remove RP_NUM_MAX and use a list to organize PMU devices for rootports
- replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
- comments on riping register together

2. address comments from Bjorn:
- rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
- rename cap_pos to ras_des
- simplify declare of device_attribute with DEVICE_ATTR_RO
- simplify code with PCI standard macro and API like pcie_get_width_cap()
- fix some code style problem and typo
- drop meaningless snaity check of container_of

3. address comments from Yicong:
- use sysfs_emit() to replace sprintf()
- simplify iteration of pci device with for_each_pci_dev
- pick preferred CPUs on a near die and add comments
- unregister PMU drivers only for failed ones
- log on behalf PMU device and give more hint
- fix some code style problem

(Thanks for all comments and they are very valuable to me)

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.


Shuai Xue (3):
docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
drivers/perf: add DesignWare PCIe PMU driver
MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

.../admin-guide/perf/dwc_pcie_pmu.rst | 61 ++
Documentation/admin-guide/perf/index.rst | 1 +
MAINTAINERS | 6 +
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 855 ++++++++++++++++++
6 files changed, 931 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
create mode 100644 drivers/perf/dwc_pcie_pmu.c

--
2.20.1.12.g72788fdb

2023-04-17 06:18:55

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 1/3] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver

Alibaba's T-Head Yitan 710 SoC is built on Synopsys' widely deployed and
silicon-proven DesignWare Core PCIe controller which implements PMU for
performance and functional debugging to facilitate system maintenance.
Document it to provide guidance on how to use it.

Signed-off-by: Shuai Xue <[email protected]>
---
.../admin-guide/perf/dwc_pcie_pmu.rst | 61 +++++++++++++++++++
Documentation/admin-guide/perf/index.rst | 1 +
2 files changed, 62 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst

diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..0672e959ebe4
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,61 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+To facilitate collection of statistics, Synopsys DesignWare Cores PCIe
+controller provides the following two features:
+
+- Time Based Analysis (RX/TX data throughput and time spent in each
+ low-power LTSSM state)
+- Lane Event counters (Error and Non-Error for lanes)
+
+The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
+only register counters provided by each PCIe Root Port.
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+
+The counters are 64-bit width and measure data in two categories,
+
+- percentage of time does the controller stay in LTSSM state in a
+ configurable duration. The measurement range of each Event in Group#0.
+- amount of data processed (Units of 16 bytes). The measurement range of
+ each Event in Group#1.
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller.
+
+The counters are 32-bit width and the measured event is select by:
+
+- Group i
+- Event j within the Group i
+- and Lane k
+
+Some of the event counters only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver add PMU devices for each PCIe Root Port. And the PMU device is
+named based the BDF of Root Port. For example,
+
+ 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
+
+ $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+average RX bandwidth can be calculated like this:
+
+ PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 9de64a40adab..11a80cd28a2e 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -19,5 +19,6 @@ Performance monitor support
arm_dsu_pmu
thunderx2-pmu
alibaba_pmu
+ dwc_pcie_pmu
nvidia-pmu
meson-ddr-pmu
--
2.20.1.12.g72788fdb

2023-04-17 06:19:17

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver

This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
Core controller IP which provides statistics feature. The PMU is not a PCIe
Root Complex integrated End Point(RCiEP) device but only register counters
provided by each PCIe Root Port.

To facilitate collection of statistics the controller provides the
following two features for each Root Port:

- Time Based Analysis (RX/TX data throughput and time spent in each
low-power LTSSM state)
- Event counters (Error and Non-Error for lanes)

Note, only one counter for each type and does not overflow interrupt.

This driver adds PMU devices for each PCIe Root Port. And the PMU device is
named based the BDF of Root Port. For example,

30:03.0 PCI bridge: Device 1ded:8000 (rev 01)

the PMU device name for this Root Port is dwc_rootport_3018.

Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

$# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/

average RX bandwidth can be calculated like this:

PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 855 ++++++++++++++++++++++++++++++++++++
3 files changed, 863 insertions(+)
create mode 100644 drivers/perf/dwc_pcie_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 66c259000a44..57bce3880cba 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
Enable perf support for Marvell DDR Performance monitoring
event on CN10K platform.

+config DWC_PCIE_PMU
+ tristate "Enable Synopsys DesignWare PCIe PMU Support"
+ depends on ARM64 || (COMPILE_TEST && 64BIT)
+ help
+ Enable perf support for Synopsys DesignWare PCIe PMU Performance
+ monitoring event on Yitian 710 platform.
+
source "drivers/perf/arm_cspmu/Kconfig"

source "drivers/perf/amlogic/Kconfig"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 13e45da61100..3f233e96524e 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
+obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
new file mode 100644
index 000000000000..b7691cfe0df4
--- /dev/null
+++ b/drivers/perf/dwc_pcie_pmu.c
@@ -0,0 +1,855 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synopsys DesignWare PCIe PMU driver
+ *
+ * Copyright (C) 2021-2023 Alibaba Inc.
+ */
+
+#include <linux/pci.h>
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cpuhotplug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+#define PCI_VENDOR_ID_ALIBABA 0x1ded
+
+#define ATTRI_NAME_MAX_SIZE 32
+#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
+
+#define DWC_PCIE_EVENT_CNT_CTL 0x8
+#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
+#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
+#define DWC_PCIE_CNT_STATUS BIT(7)
+#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
+#define DWC_PCIE_PER_EVENT_OFF FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
+#define DWC_PCIE_PER_EVENT_ON FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
+#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
+#define DWC_PCIE_EVENT_PER_CLEAR 0x1
+
+#define DWC_PCIE_EVENT_CNT_DATA 0xC
+
+#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
+#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
+#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
+#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
+#define DWC_PCIE_DURATION_1MS 0x1
+#define DWC_PCIE_DURATION_10MS 0x2
+#define DWC_PCIE_DURATION_100MS 0x3
+#define DWC_PCIE_DURATION_1S 0x4
+#define DWC_PCIE_DURATION_2S 0x5
+#define DWC_PCIE_DURATION_4S 0x6
+#define DWC_PCIE_DURATION_4US 0xff
+#define DWC_PCIE_TIME_BASED_TIMER_START BIT(0)
+#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
+
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
+
+/* Event attributes */
+#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
+#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
+#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
+
+#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
+#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
+#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
+
+enum dwc_pcie_event_type {
+ DWC_PCIE_TYPE_INVALID,
+ DWC_PCIE_TIME_BASE_EVENT,
+ DWC_PCIE_LANE_EVENT,
+};
+
+struct dwc_event_counters {
+ const char name[32];
+ u32 event_id;
+};
+
+struct dwc_pcie_pmu {
+ struct hlist_node node;
+ unsigned int on_cpu;
+ struct pmu pmu;
+ struct device *dev;
+};
+
+struct dwc_pcie_rp_info {
+ u32 bdf;
+ u32 ras_des;
+ u32 num_lanes;
+
+ struct list_head rp_node;
+ struct pci_dev *pdev;
+ struct dwc_pcie_pmu pcie_pmu;
+ bool pmu_is_register;
+ struct perf_event *event;
+
+ struct dwc_pcie_event_attr *lane_event_attrs;
+ struct attribute **pcie_pmu_event_attrs;
+ struct attribute_group pcie_pmu_event_attrs_group;
+ const struct attribute_group *pcie_pmu_attr_groups[4];
+};
+
+struct dwc_pcie_pmu_priv {
+ struct device *dev;
+ u32 pcie_ctrl_num;
+ struct list_head rp_infos;
+};
+
+#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
+
+static struct platform_device *dwc_pcie_pmu_dev;
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
+
+ return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL
+};
+
+static struct attribute_group pcie_pmu_cpumask_attrs_group = {
+ .attrs = dwc_pcie_pmu_cpumask_attrs,
+};
+
+struct dwc_pcie_format_attr {
+ struct device_attribute attr;
+ u64 field;
+ int config;
+};
+
+static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
+ int lo = __ffs(fmt->field), hi = __fls(fmt->field);
+
+ if (lo == hi)
+ return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
+
+ if (!fmt->config)
+ return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
+
+ return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
+ hi);
+}
+
+#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
+ (&((struct dwc_pcie_format_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
+ .config = _cfg, \
+ .field = _fld, \
+ }})[0].attr.attr)
+
+#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
+
+static struct attribute *dwc_pcie_format_attrs[] = {
+ dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
+ dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
+ dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
+ NULL,
+};
+
+static struct attribute_group pcie_pmu_format_attrs_group = {
+ .name = "format",
+ .attrs = dwc_pcie_format_attrs,
+};
+
+struct dwc_pcie_event_attr {
+ struct device_attribute attr;
+ enum dwc_pcie_event_type type;
+ u16 eventid;
+ u8 lane;
+};
+
+static ssize_t dwc_pcie_event_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct dwc_pcie_event_attr *eattr;
+
+ eattr = container_of(attr, typeof(*eattr), attr);
+
+ if (eattr->type == DWC_PCIE_LANE_EVENT)
+ return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
+ (unsigned long)eattr->eventid,
+ (unsigned long)eattr->type,
+ (unsigned long)eattr->lane);
+
+ return sprintf(page, "eventid=0x%lx, type=0x%lx",
+ (unsigned long)eattr->eventid,
+ (unsigned long)eattr->type);
+}
+
+#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
+ (&((struct dwc_pcie_event_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
+ .type = _type, \
+ .eventid = _eventid, \
+ .lane = _lane, \
+ }})[0].attr.attr)
+
+#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
+
+static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
+ /* Group #0 */
+ DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
+ /* Group #1 */
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
+ DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
+ NULL
+};
+
+static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
+ struct attribute *attr,
+ int unuse)
+{
+ return attr->mode;
+}
+
+static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
+{
+ return (pci_is_pcie(pdev) &&
+ pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
+}
+
+static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
+{
+ int index = 0;
+ struct pci_dev *pdev = NULL;
+ struct dwc_pcie_rp_info *rp_info;
+
+ INIT_LIST_HEAD(&priv->rp_infos);
+
+ /* Match the rootport with VSEC_RAS_DES_ID */
+ for_each_pci_dev(pdev) {
+ u16 vsec;
+ u32 val;
+
+ if (!pci_dev_is_rootport(pdev))
+ continue;
+
+ rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
+ if (!rp_info)
+ return -ENOMEM;
+
+ rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
+ rp_info->pdev = pdev;
+
+ vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
+ DWC_PCIE_VSEC_RAS_DES_ID);
+ if (!vsec)
+ continue;
+
+ pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
+ if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
+ PCI_VNDR_HEADER_LEN(val) != 0x100)
+ continue;
+ pci_dbg(pdev,
+ "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
+
+ rp_info->ras_des = vsec;
+ rp_info->num_lanes = pcie_get_width_cap(pdev);
+
+ list_add(&rp_info->rp_node, &priv->rp_infos);
+ index++;
+ }
+
+ if (!index)
+ return -ENODEV;
+
+ priv->pcie_ctrl_num = index;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
+ int event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
+ int lane, int event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
+ u32 enable)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ if (enable)
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
+ else
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
+ u32 enable)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ &val);
+
+ if (enable)
+ val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ else
+ val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ val);
+}
+
+static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
+ u64 *counter)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
+ *counter = val;
+}
+
+/* The results are cleared when next measurement starts. */
+static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
+ u16 ras_des, u64 *counter)
+{
+ u32 val;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
+ &val);
+ *counter = val;
+ *counter <<= 32;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
+ &val);
+
+ *counter += val;
+}
+
+static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
+ u16 ras_des, u32 event_id)
+{
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ &val);
+
+ val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
+
+ /*
+ * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
+ * use it with any manually controlled duration.
+ */
+ val |= FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
+ DWC_PCIE_DURATION_MANUAL_CTL);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ val);
+}
+
+static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
+{
+ struct dwc_pcie_rp_info *rp_info;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
+
+ rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
+
+ return rp_info;
+}
+
+static void dwc_pcie_pmu_event_update(struct perf_event *event)
+{
+ u64 counter;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ u64 delta, prev, now;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
+ &counter);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+
+ now = counter;
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ delta = now - prev;
+
+ local64_add(delta, &event->count);
+}
+
+static int dwc_pcie_pmu_event_init(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct perf_event *sibling;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ if (hwc->sample_period) {
+ dev_err(pcie_pmu->dev, "Sampling not supported\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (event->cpu < 0) {
+ dev_err(pcie_pmu->dev, "Per-task mode not supported\n");
+ return -EOPNOTSUPP;
+ }
+
+ event->cpu = pcie_pmu->on_cpu;
+
+ if (event->group_leader != event &&
+ !is_software_event(event->group_leader))
+ return -EINVAL;
+
+ for_each_sibling_event(sibling, event->group_leader) {
+ if (sibling != event && !is_software_event(sibling))
+ return -EINVAL;
+ }
+
+ hwc->idx = -1;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
+{
+ local64_set(&hwc->prev_count, 0);
+}
+
+static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ hwc->state = 0;
+ dwc_pcie_pmu_set_period(hwc);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+}
+
+static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
+ else
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+
+ dwc_pcie_pmu_event_update(event);
+}
+
+static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+ struct pci_dev *pdev = rp_info->pdev;
+ u16 ras_des = rp_info->ras_des;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ int event_id = DWC_PCIE_EVENT_ID(event);
+ int lane = DWC_PCIE_EVENT_LANE(event);
+
+ /* Only one counter and it is in use */
+ if (rp_info->event)
+ return -ENOSPC;
+
+ rp_info->event = event;
+
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
+ dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
+ dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
+ dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
+ } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
+ dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
+ dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
+ } else {
+ dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
+ return -EINVAL;
+ }
+
+ if (flags & PERF_EF_START)
+ dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
+
+ dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
+ perf_event_update_userpage(event);
+ rp_info->event = NULL;
+}
+
+static void dwc_pcie_pmu_event_read(struct perf_event *event)
+{
+ dwc_pcie_pmu_event_update(event);
+}
+
+static struct dwc_event_counters event_array[] = {
+ {"tx_ack_dllp", 0x600},
+ {"tx_update_fc_dllp", 0x601},
+ {"rx_ack_dllp", 0x602},
+ {"rx_update_fc_dllp", 0x603},
+ {"rx_nulified_tlp", 0x604},
+ {"tx_nulified_tlp", 0x605},
+ {"rx_duplicate_tlp", 0x606},
+ {"tx_memory_write", 0x700},
+ {"tx_memory_read", 0x701},
+ {"tx_configuration_write", 0x702},
+ {"tx_configuration_read", 0x703},
+ {"tx_io_write", 0x704},
+ {"tx_io_read", 0x705},
+ {"tx_completion_without_data", 0x706},
+ {"tx_completion_with_data", 0x707},
+ {"tx_message_tlp", 0x708},
+ {"tx_atomic", 0x709},
+ {"tx_tlp_with_prefix", 0x70A},
+ {"rx_memory_write", 0x70B},
+ {"rx_memory_read", 0x70C},
+ {"rx_io_write", 0x70F},
+ {"rx_io_read", 0x710},
+ {"rx_completion_without_data", 0x711},
+ {"rx_completion_with_data", 0x712},
+ {"rx_message_tlp", 0x713},
+ {"rx_atomic", 0x714},
+ {"rx_tlp_with_prefix", 0x715},
+ {"tx_ccix_tlp", 0x716},
+ {"rx_ccix_tlp", 0x717},
+};
+
+static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
+ struct dwc_pcie_rp_info *rp_info)
+{
+ int i, j;
+ char lane[8];
+ const char tmp[64];
+ int events_per_lane;
+ int num_lane_events;
+ int time_base_count;
+ int num_attrs, attr_idx;
+ struct dwc_pcie_event_attr *lane_attrs;
+ struct attribute **pmu_attrs;
+
+ memset((void *)tmp, 0, sizeof(tmp));
+ memset((void *)lane, 0, sizeof(lane));
+ time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
+ events_per_lane = ARRAY_SIZE(event_array);
+ num_lane_events = rp_info->num_lanes * events_per_lane;
+ num_attrs = time_base_count + num_lane_events;
+
+ rp_info->lane_event_attrs =
+ devm_kcalloc(priv->dev, num_lane_events,
+ sizeof(struct dwc_pcie_event_attr),
+ GFP_KERNEL);
+ if (!rp_info->lane_event_attrs)
+ return -ENOMEM;
+ lane_attrs = rp_info->lane_event_attrs;
+ rp_info->pcie_pmu_event_attrs =
+ devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
+ GFP_KERNEL);
+ if (!rp_info->pcie_pmu_event_attrs)
+ return -ENOMEM;
+ pmu_attrs = rp_info->pcie_pmu_event_attrs;
+
+ for (i = 0; i < num_lane_events; i++) {
+ lane_attrs[i].attr.attr.name =
+ devm_kzalloc(priv->dev, sizeof(char)
+ * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
+ if (!lane_attrs[i].attr.attr.name)
+ return -ENOMEM;
+ }
+
+ attr_idx = 0;
+ for (i = 0; i < rp_info->num_lanes; i++) {
+ sprintf(lane, "_lane%d", i);
+
+ for (j = 0; j < events_per_lane; j++) {
+ int pos = i * events_per_lane + j;
+
+ strcat((char *)tmp, event_array[j].name);
+ strcat((char *)tmp, lane);
+ memcpy((void *)lane_attrs[pos].attr.attr.name,
+ (void *)tmp,
+ sizeof(tmp));
+
+ lane_attrs[pos].attr.attr.mode =
+ VERIFY_OCTAL_PERMISSIONS(0444);
+ lane_attrs[pos].attr.show = dwc_pcie_event_show;
+ lane_attrs[pos].attr.store = NULL;
+ lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
+ lane_attrs[pos].eventid = event_array[j].event_id;
+ lane_attrs[pos].lane = i;
+ pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
+
+ memset((void *)tmp, 0, sizeof(tmp));
+ }
+ }
+
+ for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
+ pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
+
+ rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
+
+ rp_info->pcie_pmu_event_attrs_group.name = "events";
+ rp_info->pcie_pmu_event_attrs_group.is_visible =
+ pcie_pmu_event_attr_is_visible;
+ rp_info->pcie_pmu_event_attrs_group.attrs =
+ rp_info->pcie_pmu_event_attrs;
+
+ rp_info->pcie_pmu_attr_groups[0] =
+ &rp_info->pcie_pmu_event_attrs_group;
+ rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
+ rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
+ rp_info->pcie_pmu_attr_groups[3] = NULL;
+
+ return 0;
+}
+
+static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
+ struct dwc_pcie_rp_info *rp_info)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct device *dev;
+ char *name;
+ int ret;
+
+ pcie_pmu = &rp_info->pcie_pmu;
+ dev = &rp_info->pdev->dev;
+
+ ret = dwc_pcie_pmu_attr_init(priv, rp_info);
+ if (ret) {
+ pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
+ return ret;
+ }
+
+ pcie_pmu->dev = dev;
+ pcie_pmu->pmu = (struct pmu) {
+ .module = THIS_MODULE,
+ .task_ctx_nr = perf_invalid_context,
+ .pmu_enable = NULL,
+ .pmu_disable = NULL,
+ .event_init = dwc_pcie_pmu_event_init,
+ .add = dwc_pcie_pmu_event_add,
+ .del = dwc_pcie_pmu_event_del,
+ .start = dwc_pcie_pmu_event_start,
+ .stop = dwc_pcie_pmu_event_stop,
+ .read = dwc_pcie_pmu_event_read,
+ .attr_groups = rp_info->pcie_pmu_attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
+ };
+
+ name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
+ rp_info->bdf);
+ if (!name)
+ return -ENOMEM;
+
+ /*
+ * Pick one CPU to be the preferred one on local NUMA node.
+ *
+ * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
+ * is a uncore PMU device.
+ */
+ pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
+ ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
+ if (ret) {
+ pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
+ rp_info->bdf);
+ return ret;
+ }
+
+ rp_info->pmu_is_register = true;
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_remove(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct dwc_pcie_rp_info *rp_info;
+
+ list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
+ if (rp_info->pmu_is_register) {
+ pcie_pmu = &rp_info->pcie_pmu;
+ perf_pmu_unregister(&pcie_pmu->pmu);
+ }
+ }
+ return 0;
+}
+
+static int dwc_pcie_pmu_probe(struct platform_device *pdev)
+{
+ int ret;
+ struct dwc_pcie_pmu_priv *priv;
+ struct dwc_pcie_rp_info *rp_info;
+
+ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->dev = &pdev->dev;
+ platform_set_drvdata(pdev, priv);
+
+ /* If RAS_DES PMU is not supported on current platform, keep silent */
+ ret = dwc_pcie_ras_des_discover(priv);
+ if (ret)
+ return ret;
+
+ list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
+ struct pci_dev *rp = rp_info->pdev;
+
+ ret = __dwc_pcie_pmu_probe(priv, rp_info);
+ if (ret) {
+ dev_err(&rp->dev, "PCIe PMU probe fail\n");
+ goto pmu_unregister;
+ }
+ }
+
+ return 0;
+
+pmu_unregister:
+ dwc_pcie_pmu_remove(pdev);
+
+ return ret;
+}
+
+static struct platform_driver dwc_pcie_pmu_driver = {
+ .probe = dwc_pcie_pmu_probe,
+ .remove = dwc_pcie_pmu_remove,
+ .driver = {.name = "dwc_pcie_pmu",},
+};
+
+static int __init dwc_pcie_pmu_init(void)
+{
+ int ret;
+
+ ret = platform_driver_register(&dwc_pcie_pmu_driver);
+
+ if (ret)
+ return ret;
+
+ dwc_pcie_pmu_dev = platform_device_register_simple(
+ "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
+ if (IS_ERR(dwc_pcie_pmu_dev)) {
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+ return PTR_ERR(dwc_pcie_pmu_dev);
+ }
+
+ return 0;
+}
+
+static void __exit dwc_pcie_pmu_exit(void)
+{
+ platform_device_unregister(dwc_pcie_pmu_dev);
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+}
+
+module_init(dwc_pcie_pmu_init);
+module_exit(dwc_pcie_pmu_exit);
+
+MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
+MODULE_AUTHOR("Shuai xue <[email protected]>");
+MODULE_AUTHOR("Wen Cheng <[email protected]>");
+MODULE_LICENSE("GPL v2");
--
2.20.1.12.g72788fdb

2023-04-17 06:19:46

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v3 3/3] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 90abe83c02f3..6d96e5bb8174 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20279,6 +20279,12 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/dw_mmc*

+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M: Shuai Xue <[email protected]>
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
M: Eugeniy Paltsev <[email protected]>
S: Supported
--
2.20.1.12.g72788fdb

2023-04-18 01:55:05

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 4/17/2023 9:16 AM, Shuai Xue wrote:

[snip]

>>> +
>>> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
>>> +{
>>> +    u64 counter;
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +    struct pci_dev *pdev = rp_info->pdev;
>>> +    u16 ras_des = rp_info->ras_des;
>>> +    struct hw_perf_event *hwc = &event->hw;
>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +    u64 delta, prev, now;
>>> +
>>> +    do {
>>> +        prev = local64_read(&hwc->prev_count);
>>> +
>>> +        if (type == DWC_PCIE_LANE_EVENT)
>>> +            dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
>>> +        else if (type == DWC_PCIE_TIME_BASE_EVENT)
>>> +            dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
>>> +                                &counter);
>>> +        else
>>> +            dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>> +
>>> +        now = counter;
>>> +    } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
>>> +
>>> +    delta = now - prev;
>>
>> This can be overflow? better to add a mask to avoid possible overflow.
>
> I think it can not. This Root Complex supports up to PCIe Gen5 (32 GT/s)
> and one root port support up to x16 lanes, with peek bandwidth 64 GB/s.
> On Yitian 710, one root port is x4 lane with peak bandwidth 16 GB/s.
> The counter is 64 bit width with 16 bytes unit.
>
> 2^64*16/(64*10^9)/60/60/24/365 = 146 years
>
> For x16 root port, it will not overflow within 146 yeasr and for yitian 710,
> it will never overflow in my life too.

However the lane event counter is 32bit, so still a maximum counter mask
is preferable.

2023-04-18 23:33:42

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver

On 2023-04-17 07:17, Shuai Xue wrote:
> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature. The PMU is not a PCIe
> Root Complex integrated End Point(RCiEP) device but only register counters
> provided by each PCIe Root Port.
>
> To facilitate collection of statistics the controller provides the
> following two features for each Root Port:
>
> - Time Based Analysis (RX/TX data throughput and time spent in each
> low-power LTSSM state)
> - Event counters (Error and Non-Error for lanes)
>
> Note, only one counter for each type and does not overflow interrupt.
>
> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
> named based the BDF of Root Port. For example,
>
> 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>
> the PMU device name for this Root Port is dwc_rootport_3018.
>
> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>
> $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>
> average RX bandwidth can be calculated like this:
>
> PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/perf/Kconfig | 7 +
> drivers/perf/Makefile | 1 +
> drivers/perf/dwc_pcie_pmu.c | 855 ++++++++++++++++++++++++++++++++++++
> 3 files changed, 863 insertions(+)
> create mode 100644 drivers/perf/dwc_pcie_pmu.c
>
> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
> index 66c259000a44..57bce3880cba 100644
> --- a/drivers/perf/Kconfig
> +++ b/drivers/perf/Kconfig
> @@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
> Enable perf support for Marvell DDR Performance monitoring
> event on CN10K platform.
>
> +config DWC_PCIE_PMU
> + tristate "Enable Synopsys DesignWare PCIe PMU Support"
> + depends on ARM64 || (COMPILE_TEST && 64BIT)

Is there anything here that really depends on 64BIT? Nothing obvious
stands out.

> + help
> + Enable perf support for Synopsys DesignWare PCIe PMU Performance
> + monitoring event on Yitian 710 platform.
> +
> source "drivers/perf/arm_cspmu/Kconfig"
>
> source "drivers/perf/amlogic/Kconfig"
> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
> index 13e45da61100..3f233e96524e 100644
> --- a/drivers/perf/Makefile
> +++ b/drivers/perf/Makefile
> @@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
> obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
> obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
> obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
> +obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
> obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
> obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
> new file mode 100644
> index 000000000000..b7691cfe0df4
> --- /dev/null
> +++ b/drivers/perf/dwc_pcie_pmu.c
> @@ -0,0 +1,855 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Synopsys DesignWare PCIe PMU driver
> + *
> + * Copyright (C) 2021-2023 Alibaba Inc.
> + */
> +
> +#include <linux/pci.h>
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/smp.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>
> +#define PCI_VENDOR_ID_ALIBABA 0x1ded

Shouldn't that belong in linux/pci_ids.h?

> +
> +#define ATTRI_NAME_MAX_SIZE 32
> +#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
> +
> +#define DWC_PCIE_EVENT_CNT_CTL 0x8
> +#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
> +#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
> +#define DWC_PCIE_CNT_STATUS BIT(7)
> +#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
> +#define DWC_PCIE_PER_EVENT_OFF FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
> +#define DWC_PCIE_PER_EVENT_ON FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)

Those two don't look right... :/

> +#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
> +#define DWC_PCIE_EVENT_PER_CLEAR 0x1
> +
> +#define DWC_PCIE_EVENT_CNT_DATA 0xC
> +
> +#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
> +#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
> +#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
> +#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
> +#define DWC_PCIE_DURATION_1MS 0x1
> +#define DWC_PCIE_DURATION_10MS 0x2
> +#define DWC_PCIE_DURATION_100MS 0x3
> +#define DWC_PCIE_DURATION_1S 0x4
> +#define DWC_PCIE_DURATION_2S 0x5
> +#define DWC_PCIE_DURATION_4S 0x6
> +#define DWC_PCIE_DURATION_4US 0xff
> +#define DWC_PCIE_TIME_BASED_TIMER_START BIT(0)
> +#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
> +
> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
> +
> +/* Event attributes */
> +#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
> +#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
> +#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
> +
> +#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
> +#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
> +#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
> +
> +enum dwc_pcie_event_type {
> + DWC_PCIE_TYPE_INVALID,
> + DWC_PCIE_TIME_BASE_EVENT,
> + DWC_PCIE_LANE_EVENT,
> +};
> +
> +struct dwc_event_counters {
> + const char name[32];
> + u32 event_id;
> +};
> +
> +struct dwc_pcie_pmu {
> + struct hlist_node node;

This isn't used anywhere (but it should be).

> + unsigned int on_cpu;
> + struct pmu pmu;
> + struct device *dev;
> +};
> +
> +struct dwc_pcie_rp_info {
> + u32 bdf;
> + u32 ras_des;
> + u32 num_lanes;
> +
> + struct list_head rp_node;
> + struct pci_dev *pdev;
> + struct dwc_pcie_pmu pcie_pmu;
> + bool pmu_is_register;
> + struct perf_event *event;
> +
> + struct dwc_pcie_event_attr *lane_event_attrs;
> + struct attribute **pcie_pmu_event_attrs;
> + struct attribute_group pcie_pmu_event_attrs_group;
> + const struct attribute_group *pcie_pmu_attr_groups[4];
> +};

Is there any particular reason for the seemingly arbitrary split between
dwc_pcie_pmu and dwc_pcie_rp_info? It doesn't appear obvious from the
design of the code; if anything it mostly just seems to make things a
bit more busy than they need to be.

> +
> +struct dwc_pcie_pmu_priv {
> + struct device *dev;
> + u32 pcie_ctrl_num;
> + struct list_head rp_infos;
> +};
> +
> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
> +
> +static struct platform_device *dwc_pcie_pmu_dev;
> +static ssize_t cpumask_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
> +
> + return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
> +}
> +static DEVICE_ATTR_RO(cpumask);
> +
> +static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
> + &dev_attr_cpumask.attr,
> + NULL
> +};
> +
> +static struct attribute_group pcie_pmu_cpumask_attrs_group = {
> + .attrs = dwc_pcie_pmu_cpumask_attrs,
> +};
> +
> +struct dwc_pcie_format_attr {
> + struct device_attribute attr;
> + u64 field;
> + int config;
> +};
> +
> +static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
> + int lo = __ffs(fmt->field), hi = __fls(fmt->field);
> +
> + if (lo == hi)
> + return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
> +
> + if (!fmt->config)
> + return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
> +
> + return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
> + hi);

A lot of this is unnecessary - you don't have any single-bit config
fields, and you aren't using config1 or config2, so it's kind of
confusing to have all the code and data for handling them.

Also, please use sysfs_emit() instead of all the assorted sprintf() and
snprintf() calls.

> +}
> +
> +#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
> + (&((struct dwc_pcie_format_attr[]) {{ \
> + .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
> + .config = _cfg, \
> + .field = _fld, \
> + }})[0].attr.attr)
> +
> +#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
> +
> +static struct attribute *dwc_pcie_format_attrs[] = {
> + dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
> + dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
> + dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
> + NULL,
> +};
> +
> +static struct attribute_group pcie_pmu_format_attrs_group = {
> + .name = "format",
> + .attrs = dwc_pcie_format_attrs,
> +};
> +
> +struct dwc_pcie_event_attr {
> + struct device_attribute attr;
> + enum dwc_pcie_event_type type;
> + u16 eventid;
> + u8 lane;
> +};
> +
> +static ssize_t dwc_pcie_event_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + struct dwc_pcie_event_attr *eattr;
> +
> + eattr = container_of(attr, typeof(*eattr), attr);
> +
> + if (eattr->type == DWC_PCIE_LANE_EVENT)
> + return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",

Convention seems to be that these strings do not have spaces in them, so
there's a small chance that it may confuse some userspace tools.

> + (unsigned long)eattr->eventid,
> + (unsigned long)eattr->type,
> + (unsigned long)eattr->lane);

Hmm, why use %lx and then have to cast everything, rather than just %x?

> +
> + return sprintf(page, "eventid=0x%lx, type=0x%lx",
> + (unsigned long)eattr->eventid,
> + (unsigned long)eattr->type);
> +}
> +
> +#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
> + (&((struct dwc_pcie_event_attr[]) {{ \
> + .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
> + .type = _type, \
> + .eventid = _eventid, \
> + .lane = _lane, \
> + }})[0].attr.attr)
> +
> +#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid) \
> + DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
> +
> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
> + /* Group #0 */
> + DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
> + /* Group #1 */
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
> + DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
> + NULL
> +};
> +
> +static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
> + struct attribute *attr,
> + int unuse)
> +{
> + return attr->mode;

There is no point implementing an optional callback which only
replicates the default behaviour of not having the callback.

Whether to simply remove it, or instead implement more meaningful
behaviour here to save complexity elsewhere, is something I'll come back
to later...

> +}
> +
> +static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
> +{
> + return (pci_is_pcie(pdev) &&
> + pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
> +}
> +
> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> +{
> + int index = 0;
> + struct pci_dev *pdev = NULL;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + INIT_LIST_HEAD(&priv->rp_infos);
> +
> + /* Match the rootport with VSEC_RAS_DES_ID */
> + for_each_pci_dev(pdev) {

Does the PCI layer not offer a more robust mechanism for this? (PCI
fixups come to mind, but I don't actually know whether that would be a
viable approach or not.) As things stand, it seems like you've got a
potential ordering problem if this is built-in and runs before PCI
devices have been fully discovered.

> + u16 vsec;
> + u32 val;
> +
> + if (!pci_dev_is_rootport(pdev))
> + continue;
> +
> + rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
> + if (!rp_info)
> + return -ENOMEM;

This leaks a refcount on the device.

> +
> + rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);

Do you really need to store this? It could just as well be a local
variable in the one scope where it's used later.

> + rp_info->pdev = pdev;
> +
> + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> + DWC_PCIE_VSEC_RAS_DES_ID);
> + if (!vsec)
> + continue;
> +
> + pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
> + if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
> + PCI_VNDR_HEADER_LEN(val) != 0x100)
> + continue;
> + pci_dbg(pdev,
> + "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
> +

Could you not finish all the checks before allocating rp_info? I'm
guessing you probably don't expect to find a mix of root ports where
only some have this capability, but if that ever did happen it would be
nicer not to leave a bunch of wasted memory hanging around for the
lifetime of the driver.

> + rp_info->ras_des = vsec;
> + rp_info->num_lanes = pcie_get_width_cap(pdev);
> +
> + list_add(&rp_info->rp_node, &priv->rp_infos);
> + index++;
> + }
> +
> + if (!index)
> + return -ENODEV;
> +
> + priv->pcie_ctrl_num = index;

You never use this.

> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
> + int event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
> + int lane, int event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}

What's the purpose of these two functions doing the exact same thing,
but one having an extra unused argument?

> +
> +static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
> + u32 enable)

Pass Boolean arguments as bool.

> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + if (enable)
> + val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
> + else
> + val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);

This looks suspicious - even if the values were defined correctly, the
read-modify-write implies that a transition from enabled to disabled
would result in ON | OFF == ON, which doesn't appear to make much sense.

> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
> + u32 enable)

bool again.

> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + &val);
> +
> + if (enable)
> + val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
> + else
> + val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + val);
> +}
> +
> +static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
> + u64 *counter)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
> + *counter = val;

Why not just return a u64 by value?

> +}
> +
> +/* The results are cleared when next measurement starts. */
> +static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
> + u16 ras_des, u64 *counter)
> +{
> + u32 val;
> +
> + pci_read_config_dword(
> + pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
> + &val);
> + *counter = val;
> + *counter <<= 32;
> +
> + pci_read_config_dword(
> + pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
> + &val);
> +
> + *counter += val;

Ditto.

> +}
> +
> +static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
> +
> + val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);

OK, does this EVENT_CNT_CTL register just have some really weird
behaviour where the fields we touch are self-clearing but other bits
still have to be preserved when written?

> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
> +}
> +
> +static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
> + u16 ras_des, u32 event_id)
> +{
> + u32 val;
> +
> + pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + &val);
> +
> + val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);

Ditto for this one, in fact.

> +
> + /*
> + * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
> + * use it with any manually controlled duration.
> + */
> + val |= FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
> + DWC_PCIE_DURATION_MANUAL_CTL);
> +
> + pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
> + val);
> +}
> +
> +static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
> +{
> + struct dwc_pcie_rp_info *rp_info;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
> +
> + rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
> +
> + return rp_info;
> +}
> +
> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
> +{
> + u64 counter;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + struct hw_perf_event *hwc = &event->hw;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> + u64 delta, prev, now;

That's an awful lot of boilerplate - straight away, pcie_pmu is
redundant since &pdev->dev would give you the same thing, but then you
don't actually need that anyway, so it would be even cleaner to pass
rp_info directly to the read_*_counter helpers for them to dereference
pdev and ras_des for themselves. Same thing in the start, stop and add
callbacks below.

> +
> + do {
> + prev = local64_read(&hwc->prev_count);
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
> + &counter);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);

If that could ever happen, you've got a bug in your event_init or
event_add, or something has corrupted memory in a way you cannot
possibly attempt to reason about. In fact, pretending to handle such a
theoretical bug here is its own *real* bug, since if that path is taken
then it leads to consuming an uninitialised variable below.

> +
> + now = counter;

Get rid of the redundant "counter" variable. It's really not helpful.

> + } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
> +
> + delta = now - prev;
> +
> + local64_add(delta, &event->count);
> +}
> +
> +static int dwc_pcie_pmu_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct perf_event *sibling;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + if (hwc->sample_period) {

This confused me - it might work, but please use is_sampling_event() to
be consistent with everything else.

> + dev_err(pcie_pmu->dev, "Sampling not supported\n");

dev_dbg() at best, but TBH I'd just remove it. I know this has been
debated a lot, but FWIW I'm firmly in the "giving users the capability
to use to use perf events should not give them the implicit capability
to flood the kernel log and fill up journald's disk quota" camp.

> + return -EOPNOTSUPP;

Use EINVAL. It's not that you simply don't support sampling, it's that
there is fundamentally no meaningful context in a PCIe lane to sample
from, so it is not a valid event at all.

> + }
> +
> + if (event->cpu < 0) {
> + dev_err(pcie_pmu->dev, "Per-task mode not supported\n");
> + return -EOPNOTSUPP;

Same comments as above.

> + }
> +
> + event->cpu = pcie_pmu->on_cpu;

It would seem neater not to start modifying the event until after you've
finished validating it.

> +
> + if (event->group_leader != event &&
> + !is_software_event(event->group_leader))
> + return -EINVAL;
> +
> + for_each_sibling_event(sibling, event->group_leader) {
> + if (sibling != event && !is_software_event(sibling))

An event cannot possibly be its own sibling.

> + return -EINVAL;
> + }
> +
> + hwc->idx = -1;

You never use this.

> +
> + return 0;

This would be the point to check that the event's config actually makes
sense and is supported by this PMU.

> +}
> +
> +static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
> +{
> + local64_set(&hwc->prev_count, 0);
> +}
> +
> +static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
> +{
> + struct hw_perf_event *hwc = &event->hw;
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> +
> + hwc->state = 0;
> + dwc_pcie_pmu_set_period(hwc);
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);

Again, that should be impossible if you validated it correctly in
event_init.

> +}
> +
> +static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> +
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> +
> + if (event->hw.state & PERF_HES_STOPPED)
> + return;
> +
> + if (type == DWC_PCIE_LANE_EVENT)
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
> + else if (type == DWC_PCIE_TIME_BASE_EVENT)
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
> + else
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);

Ditto.

> +
> + dwc_pcie_pmu_event_update(event);
> +}
> +
> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> + struct pci_dev *pdev = rp_info->pdev;
> + u16 ras_des = rp_info->ras_des;
> + struct hw_perf_event *hwc = &event->hw;
> + enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
> + int event_id = DWC_PCIE_EVENT_ID(event);
> + int lane = DWC_PCIE_EVENT_LANE(event);
> +
> + /* Only one counter and it is in use */
> + if (rp_info->event)
> + return -ENOSPC;

Out of curiosity, is this a technical limitation, or just a case of
keeping the initial driver simple? The registers seem to imply that lane
events and time events are pretty much independent of each other, so
maybe it might be possible for one of each to coexist.

> +
> + rp_info->event = event;
> +
> + hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
> +
> + if (type == DWC_PCIE_LANE_EVENT) {
> + dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
> + dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
> + dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
> + dwc_pcie_pmu_clear_event_counter(pdev, ras_des);

Eww, the helpers were already confusing enough, but is it necessary to
do *four* back-to-back RMW operations on the same register? Can you
really not just write the whole thing in one go?

> + } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
> + dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
> + dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
> + } else {
> + dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
> + return -EINVAL;
> + }
> +
> + if (flags & PERF_EF_START)
> + dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
> +
> + perf_event_update_userpage(event);
> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
> +{
> + struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
> +
> + dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
> + perf_event_update_userpage(event);
> + rp_info->event = NULL;
> +}
> +
> +static void dwc_pcie_pmu_event_read(struct perf_event *event)
> +{
> + dwc_pcie_pmu_event_update(event);
> +}

What use is a function wrapper that does nothing but call a function
with an identical signature? Whatever would call this can just call
dwc_pcie_pmu_event_update() directly.

> +
> +static struct dwc_event_counters event_array[] = {
> + {"tx_ack_dllp", 0x600},
> + {"tx_update_fc_dllp", 0x601},
> + {"rx_ack_dllp", 0x602},
> + {"rx_update_fc_dllp", 0x603},
> + {"rx_nulified_tlp", 0x604},
> + {"tx_nulified_tlp", 0x605},
> + {"rx_duplicate_tlp", 0x606},
> + {"tx_memory_write", 0x700},
> + {"tx_memory_read", 0x701},
> + {"tx_configuration_write", 0x702},
> + {"tx_configuration_read", 0x703},
> + {"tx_io_write", 0x704},
> + {"tx_io_read", 0x705},
> + {"tx_completion_without_data", 0x706},
> + {"tx_completion_with_data", 0x707},
> + {"tx_message_tlp", 0x708},
> + {"tx_atomic", 0x709},
> + {"tx_tlp_with_prefix", 0x70A},
> + {"rx_memory_write", 0x70B},
> + {"rx_memory_read", 0x70C},
> + {"rx_io_write", 0x70F},
> + {"rx_io_read", 0x710},
> + {"rx_completion_without_data", 0x711},
> + {"rx_completion_with_data", 0x712},
> + {"rx_message_tlp", 0x713},
> + {"rx_atomic", 0x714},
> + {"rx_tlp_with_prefix", 0x715},
> + {"tx_ccix_tlp", 0x716},
> + {"rx_ccix_tlp", 0x717},
> +};
> +
> +static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
> + struct dwc_pcie_rp_info *rp_info)
> +{
> + int i, j;
> + char lane[8];
> + const char tmp[64];
> + int events_per_lane;
> + int num_lane_events;
> + int time_base_count;
> + int num_attrs, attr_idx;
> + struct dwc_pcie_event_attr *lane_attrs;
> + struct attribute **pmu_attrs;
> +
> + memset((void *)tmp, 0, sizeof(tmp));
> + memset((void *)lane, 0, sizeof(lane));
> + time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
> + events_per_lane = ARRAY_SIZE(event_array);
> + num_lane_events = rp_info->num_lanes * events_per_lane;
> + num_attrs = time_base_count + num_lane_events;
> +
> + rp_info->lane_event_attrs =
> + devm_kcalloc(priv->dev, num_lane_events,
> + sizeof(struct dwc_pcie_event_attr),
> + GFP_KERNEL);
> + if (!rp_info->lane_event_attrs)
> + return -ENOMEM;
> + lane_attrs = rp_info->lane_event_attrs;
> + rp_info->pcie_pmu_event_attrs =
> + devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
> + GFP_KERNEL);
> + if (!rp_info->pcie_pmu_event_attrs)
> + return -ENOMEM;
> + pmu_attrs = rp_info->pcie_pmu_event_attrs;
> +
> + for (i = 0; i < num_lane_events; i++) {
> + lane_attrs[i].attr.attr.name =
> + devm_kzalloc(priv->dev, sizeof(char)
> + * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
> + if (!lane_attrs[i].attr.attr.name)
> + return -ENOMEM;
> + }
> +
> + attr_idx = 0;
> + for (i = 0; i < rp_info->num_lanes; i++) {
> + sprintf(lane, "_lane%d", i);
> +
> + for (j = 0; j < events_per_lane; j++) {
> + int pos = i * events_per_lane + j;
> +
> + strcat((char *)tmp, event_array[j].name);
> + strcat((char *)tmp, lane);
> + memcpy((void *)lane_attrs[pos].attr.attr.name,
> + (void *)tmp,
> + sizeof(tmp));
> +
> + lane_attrs[pos].attr.attr.mode =
> + VERIFY_OCTAL_PERMISSIONS(0444);
> + lane_attrs[pos].attr.show = dwc_pcie_event_show;
> + lane_attrs[pos].attr.store = NULL;
> + lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
> + lane_attrs[pos].eventid = event_array[j].event_id;
> + lane_attrs[pos].lane = i;
> + pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
> +
> + memset((void *)tmp, 0, sizeof(tmp));
> + }
> + }
> +
> + for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
> + pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
> +
> + rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
> +
> + rp_info->pcie_pmu_event_attrs_group.name = "events";
> + rp_info->pcie_pmu_event_attrs_group.is_visible =
> + pcie_pmu_event_attr_is_visible;
> + rp_info->pcie_pmu_event_attrs_group.attrs =
> + rp_info->pcie_pmu_event_attrs;
> +
> + rp_info->pcie_pmu_attr_groups[0] =
> + &rp_info->pcie_pmu_event_attrs_group;
> + rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
> + rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
> + rp_info->pcie_pmu_attr_groups[3] = NULL;
> +
> + return 0;

That took a while to make sense of... dynamically generating event attrs
is a bit horrible, especially if it means you waste a bunch of memory
duplicating the same stuff over and over again for every PMU instance.

You can achieve the same effect far more neatly by statically defining
all possible events, then using the .is_visible callback to hide any
which are not supported by the given PMU instance.

However, I'm not sure it's really worth even doing that (especially if
I've counted right and it means 464 events for a maximum of 16 lanes).
It doesn't seem like the difference in typing "tx_io_read_lane0" vs.
"tx_io_read,lane=0" will have a significant impact on user experience.
Conversely, I happen to know that some users actively dislike the
experience of "perf list" spewing out hundreds of events that swamp out
the ones they're looking for. If this is the SoC which also has arm-cmn
then it's already more than bad enough (sorry!) - describing 29 distinct
events as just 29 events will be a lot more popular and manageable. If
you like you can output "lane=?" in the event string to make it even
more obvious - perf tool will then require the user to provide a lane
value instead of defaulting to 0 if it isn't specified.

> +}
> +
> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
> + struct dwc_pcie_rp_info *rp_info)
> +{
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct device *dev;
> + char *name;
> + int ret;
> +
> + pcie_pmu = &rp_info->pcie_pmu;
> + dev = &rp_info->pdev->dev;
> +
> + ret = dwc_pcie_pmu_attr_init(priv, rp_info);
> + if (ret) {
> + pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
> + return ret;
> + }
> +
> + pcie_pmu->dev = dev;
> + pcie_pmu->pmu = (struct pmu) {
> + .module = THIS_MODULE,
> + .task_ctx_nr = perf_invalid_context,
> + .pmu_enable = NULL,
> + .pmu_disable = NULL,

If I counted right there are still 28 other fields being
default-initialised to 0 or NULL here, what's special about these two? ;)

> + .event_init = dwc_pcie_pmu_event_init,
> + .add = dwc_pcie_pmu_event_add,
> + .del = dwc_pcie_pmu_event_del,
> + .start = dwc_pcie_pmu_event_start,
> + .stop = dwc_pcie_pmu_event_stop,
> + .read = dwc_pcie_pmu_event_read,
> + .attr_groups = rp_info->pcie_pmu_attr_groups,
> + .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
> + };
> +
> + name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
> + rp_info->bdf);
> + if (!name)
> + return -ENOMEM;
> +
> + /*
> + * Pick one CPU to be the preferred one on local NUMA node.
> + *
> + * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
> + * is a uncore PMU device.
> + */
> + pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));

You need to cope with the possibility of that CPU going offline.

> + ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
> + if (ret) {
> + pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
> + rp_info->bdf);
> + return ret;
> + }
> +
> + rp_info->pmu_is_register = true;
> +
> + return 0;
> +}
> +
> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
> +{
> + struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
> + if (rp_info->pmu_is_register) {
> + pcie_pmu = &rp_info->pcie_pmu;
> + perf_pmu_unregister(&pcie_pmu->pmu);
> + }
> + }
> + return 0;
> +}
> +
> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
> +{
> + int ret;
> + struct dwc_pcie_pmu_priv *priv;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->dev = &pdev->dev;
> + platform_set_drvdata(pdev, priv);
> +
> + /* If RAS_DES PMU is not supported on current platform, keep silent */
> + ret = dwc_pcie_ras_des_discover(priv);
> + if (ret)
> + return ret;
> +
> + list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
> + struct pci_dev *rp = rp_info->pdev;
> +
> + ret = __dwc_pcie_pmu_probe(priv, rp_info);

If the PMUs are independent of each other, why not just probe them
directly during the PCIe capability walk? That way you could easily
ensure that the list only ever contains successfully-registered PMUs by
construction, and make cleanup even simpler.

> + if (ret) {
> + dev_err(&rp->dev, "PCIe PMU probe fail\n");
> + goto pmu_unregister;
> + }
> + }
> +
> + return 0;
> +
> +pmu_unregister:
> + dwc_pcie_pmu_remove(pdev);
> +
> + return ret;
> +}
> +
> +static struct platform_driver dwc_pcie_pmu_driver = {
> + .probe = dwc_pcie_pmu_probe,
> + .remove = dwc_pcie_pmu_remove,
> + .driver = {.name = "dwc_pcie_pmu",},
> +};
> +
> +static int __init dwc_pcie_pmu_init(void)
> +{
> + int ret;
> +
> + ret = platform_driver_register(&dwc_pcie_pmu_driver);
> +
> + if (ret)
> + return ret;
> +
> + dwc_pcie_pmu_dev = platform_device_register_simple(
> + "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
> + if (IS_ERR(dwc_pcie_pmu_dev)) {
> + platform_driver_unregister(&dwc_pcie_pmu_driver);
> + return PTR_ERR(dwc_pcie_pmu_dev);
> + }

Why go through all this bother of inventing a device and registering a
driver just to take a long round-trip through the driver core to call
dwc_pcie_pmu_probe()? Why not just do the work of dwc_pcie_pmu_probe()
right here? Sure, you'd also need to manually clean up a couple of
allocations on failure or exit instead of using devres, but that's still
considerably less hassle than invoking the whole driver model just to
disguise a list_head and a couple of function calls.

Thanks,
Robin.

> +
> + return 0;
> +}
> +
> +static void __exit dwc_pcie_pmu_exit(void)
> +{
> + platform_device_unregister(dwc_pcie_pmu_dev);
> + platform_driver_unregister(&dwc_pcie_pmu_driver);
> +}
> +
> +module_init(dwc_pcie_pmu_init);
> +module_exit(dwc_pcie_pmu_exit);
> +
> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
> +MODULE_AUTHOR("Shuai xue <[email protected]>");
> +MODULE_AUTHOR("Wen Cheng <[email protected]>");
> +MODULE_LICENSE("GPL v2");

2023-04-19 01:54:56

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v2 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 2023/4/18 AM9:51, Baolin Wang wrote:
>
>
> On 4/17/2023 9:16 AM, Shuai Xue wrote:
>
> [snip]
>
>>>> +
>>>> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
>>>> +{
>>>> +    u64 counter;
>>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>>> +    struct pci_dev *pdev = rp_info->pdev;
>>>> +    u16 ras_des = rp_info->ras_des;
>>>> +    struct hw_perf_event *hwc = &event->hw;
>>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>>> +    u64 delta, prev, now;
>>>> +
>>>> +    do {
>>>> +        prev = local64_read(&hwc->prev_count);
>>>> +
>>>> +        if (type == DWC_PCIE_LANE_EVENT)
>>>> +            dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
>>>> +        else if (type == DWC_PCIE_TIME_BASE_EVENT)
>>>> +            dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
>>>> +                                &counter);
>>>> +        else
>>>> +            dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>>> +
>>>> +        now = counter;
>>>> +    } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
>>>> +
>>>> +    delta = now - prev;
>>>
>>> This can be overflow? better to add a mask to avoid possible overflow.
>>
>> I think it can not. This Root Complex supports up to PCIe Gen5 (32 GT/s)
>> and one root port support up to x16 lanes, with peek bandwidth 64 GB/s.
>> On Yitian 710, one root port is x4 lane with peak bandwidth 16 GB/s.
>> The counter is 64 bit width with 16 bytes unit.
>>
>>     2^64*16/(64*10^9)/60/60/24/365 = 146 years
>>
>> For x16 root port, it will not overflow within 146 yeasr and for yitian 710,
>> it will never overflow in my life too.
>
> However the lane event counter is 32bit, so still a maximum counter mask is preferable.

You are right, will mask it in next version.

Thank you.
Best Regards,
Shuai

2023-04-27 06:44:15

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 2023/4/19 AM7:30, Robin Murphy wrote:
> On 2023-04-17 07:17, Shuai Xue wrote:
>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>> Core controller IP which provides statistics feature. The PMU is not a PCIe
>> Root Complex integrated End Point(RCiEP) device but only register counters
>> provided by each PCIe Root Port.
>>
>> To facilitate collection of statistics the controller provides the
>> following two features for each Root Port:
>>
>> - Time Based Analysis (RX/TX data throughput and time spent in each
>>    low-power LTSSM state)
>> - Event counters (Error and Non-Error for lanes)
>>
>> Note, only one counter for each type and does not overflow interrupt.
>>
>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>> named based the BDF of Root Port. For example,
>>
>>      30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>
>> the PMU device name for this Root Port is dwc_rootport_3018.
>>
>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>
>>      $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>
>> average RX bandwidth can be calculated like this:
>>
>>      PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>>   drivers/perf/Kconfig        |   7 +
>>   drivers/perf/Makefile       |   1 +
>>   drivers/perf/dwc_pcie_pmu.c | 855 ++++++++++++++++++++++++++++++++++++
>>   3 files changed, 863 insertions(+)
>>   create mode 100644 drivers/perf/dwc_pcie_pmu.c
>>
>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>> index 66c259000a44..57bce3880cba 100644
>> --- a/drivers/perf/Kconfig
>> +++ b/drivers/perf/Kconfig
>> @@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
>>         Enable perf support for Marvell DDR Performance monitoring
>>         event on CN10K platform.
>>   +config DWC_PCIE_PMU
>> +    tristate "Enable Synopsys DesignWare PCIe PMU Support"
>> +    depends on ARM64 || (COMPILE_TEST && 64BIT)
>
> Is there anything here that really depends on 64BIT? Nothing obvious stands out.
>
>> +    help
>> +      Enable perf support for Synopsys DesignWare PCIe PMU Performance
>> +      monitoring event on Yitian 710 platform.
>> +
>>   source "drivers/perf/arm_cspmu/Kconfig"
>>     source "drivers/perf/amlogic/Kconfig"
>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
>> index 13e45da61100..3f233e96524e 100644
>> --- a/drivers/perf/Makefile
>> +++ b/drivers/perf/Makefile
>> @@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
>>   obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
>>   obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
>>   obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
>> +obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
>>   obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
>>   obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
>> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
>> new file mode 100644
>> index 000000000000..b7691cfe0df4
>> --- /dev/null
>> +++ b/drivers/perf/dwc_pcie_pmu.c
>> @@ -0,0 +1,855 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Synopsys DesignWare PCIe PMU driver
>> + *
>> + * Copyright (C) 2021-2023 Alibaba Inc.
>> + */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/bitfield.h>
>> +#include <linux/bitops.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/device.h>
>> +#include <linux/errno.h>
>> +#include <linux/kernel.h>
>> +#include <linux/list.h>
>> +#include <linux/perf_event.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/smp.h>
>> +#include <linux/sysfs.h>
>> +#include <linux/types.h>
>> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
>
> Shouldn't that belong in linux/pci_ids.h?

Yes, it also exist in drivers/infiniband/hw/erdma/erdma_hw.h
I will add a prepare patch to define it in linux/pci_ids.h.

>
>> +
>> +#define ATTRI_NAME_MAX_SIZE            32
>> +#define DWC_PCIE_VSEC_RAS_DES_ID        0x02
>> +
>> +#define DWC_PCIE_EVENT_CNT_CTL            0x8
>> +#define DWC_PCIE_CNT_EVENT_SEL            GENMASK(27, 16)
>> +#define DWC_PCIE_CNT_LANE_SEL            GENMASK(11, 8)
>> +#define DWC_PCIE_CNT_STATUS            BIT(7)
>> +#define DWC_PCIE_CNT_ENABLE            GENMASK(4, 2)
>> +#define DWC_PCIE_PER_EVENT_OFF            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
>> +#define DWC_PCIE_PER_EVENT_ON            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
>
> Those two don't look right... :/

You are right, those two are a nested FIELD_PREP, will fix it!

>
>> +#define DWC_PCIE_EVENT_CLEAR            GENMASK(1, 0)
>> +#define DWC_PCIE_EVENT_PER_CLEAR        0x1
>> +
>> +#define DWC_PCIE_EVENT_CNT_DATA            0xC
>> +
>> +#define DWC_PCIE_TIME_BASED_ANAL_CTL        0x10
>> +#define DWC_PCIE_TIME_BASED_REPORT_SEL        GENMASK(31, 24)
>> +#define DWC_PCIE_TIME_BASED_DURATION_SEL    GENMASK(15, 8)
>> +#define DWC_PCIE_DURATION_MANUAL_CTL        0x0
>> +#define DWC_PCIE_DURATION_1MS            0x1
>> +#define DWC_PCIE_DURATION_10MS            0x2
>> +#define DWC_PCIE_DURATION_100MS            0x3
>> +#define DWC_PCIE_DURATION_1S            0x4
>> +#define DWC_PCIE_DURATION_2S            0x5
>> +#define DWC_PCIE_DURATION_4S            0x6
>> +#define DWC_PCIE_DURATION_4US            0xff
>> +#define DWC_PCIE_TIME_BASED_TIMER_START     BIT(0)
>> +#define DWC_PCIE_TIME_BASED_CNT_ENABLE        0x1
>> +
>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW    0x14
>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH    0x18
>> +
>> +/* Event attributes */
>> +#define DWC_PCIE_CONFIG_EVENTID            GENMASK(15, 0)
>> +#define DWC_PCIE_CONFIG_TYPE            GENMASK(19, 16)
>> +#define DWC_PCIE_CONFIG_LANE            GENMASK(27, 20)
>> +
>> +#define DWC_PCIE_EVENT_ID(event)    FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
>> +#define DWC_PCIE_EVENT_TYPE(event)    FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
>> +#define DWC_PCIE_EVENT_LANE(event)    FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
>> +
>> +enum dwc_pcie_event_type {
>> +    DWC_PCIE_TYPE_INVALID,
>> +    DWC_PCIE_TIME_BASE_EVENT,
>> +    DWC_PCIE_LANE_EVENT,
>> +};
>> +
>> +struct dwc_event_counters {
>> +    const char name[32];
>> +    u32 event_id;
>> +};
>> +
>> +struct dwc_pcie_pmu {
>> +    struct hlist_node node;
>
> This isn't used anywhere (but it should be).

Yes, I will refator this struct.

>
>> +    unsigned int on_cpu;
>> +    struct pmu pmu;
>> +    struct device *dev;
>> +};
>> +
>> +struct dwc_pcie_rp_info {
>> +    u32 bdf;
>> +    u32 ras_des;
>> +    u32 num_lanes;
>> +
>> +    struct list_head rp_node;
>> +    struct pci_dev *pdev;
>> +    struct dwc_pcie_pmu pcie_pmu;
>> +    bool pmu_is_register;
>> +    struct perf_event *event;
>> +
>> +    struct dwc_pcie_event_attr *lane_event_attrs;
>> +    struct attribute **pcie_pmu_event_attrs;
>> +    struct attribute_group pcie_pmu_event_attrs_group;
>> +    const struct attribute_group *pcie_pmu_attr_groups[4];
>> +};
>
> Is there any particular reason for the seemingly arbitrary split between dwc_pcie_pmu and dwc_pcie_rp_info? It doesn't appear obvious from the design of the code; if anything it mostly just seems to make things a bit more busy than they need to be.

No particular reason, I intend to collect root port specific info to dwc_pcie_rp_info and
define PMU specific to dwc_pcie_pmu.

I will move it togother.

>
>> +
>> +struct dwc_pcie_pmu_priv {
>> +    struct device *dev;
>> +    u32 pcie_ctrl_num;
>> +    struct list_head rp_infos;
>> +};
>> +
>> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
>> +
>> +static struct platform_device *dwc_pcie_pmu_dev;
>> +static ssize_t cpumask_show(struct device *dev,
>> +                     struct device_attribute *attr,
>> +                     char *buf)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
>> +
>> +    return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
>> +}
>> +static DEVICE_ATTR_RO(cpumask);
>> +
>> +static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
>> +    &dev_attr_cpumask.attr,
>> +    NULL
>> +};
>> +
>> +static struct attribute_group pcie_pmu_cpumask_attrs_group = {
>> +    .attrs = dwc_pcie_pmu_cpumask_attrs,
>> +};
>> +
>> +struct dwc_pcie_format_attr {
>> +    struct device_attribute attr;
>> +    u64 field;
>> +    int config;
>> +};
>> +
>> +static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
>> +                    struct device_attribute *attr,
>> +                    char *buf)
>> +{
>> +    struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
>> +    int lo = __ffs(fmt->field), hi = __fls(fmt->field);
>> +
>> +    if (lo == hi)
>> +        return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
>> +
>> +    if (!fmt->config)
>> +        return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
>> +
>> +    return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
>> +            hi);
>
> A lot of this is unnecessary - you don't have any single-bit config fields, and you aren't using config1 or config2, so it's kind of confusing to have all the code and data for handling them.

Will remove it.

>
> Also, please use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.

Ok, I will use sysfs_emit() instead.

>
>> +}
>> +
>> +#define _dwc_pcie_format_attr(_name, _cfg, _fld)                \
>> +    (&((struct dwc_pcie_format_attr[]) {{                \
>> +        .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL),    \
>> +        .config = _cfg,                        \
>> +        .field = _fld,                        \
>> +    }})[0].attr.attr)
>> +
>> +#define dwc_pcie_format_attr(_name, _fld)    _dwc_pcie_format_attr(_name, 0, _fld)
>> +
>> +static struct attribute *dwc_pcie_format_attrs[] = {
>> +    dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
>> +    dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
>> +    dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
>> +    NULL,
>> +};
>> +
>> +static struct attribute_group pcie_pmu_format_attrs_group = {
>> +    .name = "format",
>> +    .attrs = dwc_pcie_format_attrs,
>> +};
>> +
>> +struct dwc_pcie_event_attr {
>> +    struct device_attribute attr;
>> +    enum dwc_pcie_event_type type;
>> +    u16 eventid;
>> +    u8 lane;
>> +};
>> +
>> +static ssize_t dwc_pcie_event_show(struct device *dev,
>> +                struct device_attribute *attr, char *page)
>> +{
>> +    struct dwc_pcie_event_attr *eattr;
>> +
>> +    eattr = container_of(attr, typeof(*eattr), attr);
>> +
>> +    if (eattr->type == DWC_PCIE_LANE_EVENT)
>> +        return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
>
> Convention seems to be that these strings do not have spaces in them, so there's a small chance that it may confuse some userspace tools.

Yes, you are right. I will remove the unnecessary spaces.

>
>> +                   (unsigned long)eattr->eventid,
>> +                   (unsigned long)eattr->type,
>> +                   (unsigned long)eattr->lane);
>
> Hmm, why use %lx and then have to cast everything, rather than just %x?

I misunderstand the usage of eattr->var, will just use %x instead.

>
>> +
>> +    return sprintf(page, "eventid=0x%lx, type=0x%lx",
>> +               (unsigned long)eattr->eventid,
>> +               (unsigned long)eattr->type);
>> +}
>> +
>> +#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane)        \
>> +    (&((struct dwc_pcie_event_attr[]) {{                \
>> +        .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL),    \
>> +        .type = _type,                        \
>> +        .eventid = _eventid,                    \
>> +        .lane = _lane,                    \
>> +    }})[0].attr.attr)
>> +
>> +#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid)            \
>> +    DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
>> +
>> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
>> +    /* Group #0 */
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
>> +    /* Group #1 */
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
>> +    NULL
>> +};
>> +
>> +static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
>> +                             struct attribute *attr,
>> +                             int unuse)
>> +{
>> +    return attr->mode;
>
> There is no point implementing an optional callback which only replicates the default behaviour of not having the callback.
>
> Whether to simply remove it, or instead implement more meaningful behaviour here to save complexity elsewhere, is something I'll come back to later...

I will use it for new defined lane event type in next version, as you suggested later.

>
>> +}
>> +
>> +static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
>> +{
>> +    return (pci_is_pcie(pdev) &&
>> +        pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
>> +}
>> +
>> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
>> +{
>> +    int index = 0;
>> +    struct pci_dev *pdev = NULL;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    INIT_LIST_HEAD(&priv->rp_infos);
>> +
>> +    /* Match the rootport with VSEC_RAS_DES_ID */
>> +    for_each_pci_dev(pdev) {
>
> Does the PCI layer not offer a more robust mechanism for this? (PCI fixups come to mind, but I don't actually know whether that would be a viable approach or not.)

I am afraid not yet. Jonathan try to add a PMU service but it is not merged into mainline.

As things stand, it seems like you've got a potential ordering problem if this is built-in and runs before PCI devices have been fully discovered.

If this is built-in, the module is init by device_initcall, level 6.
As far as I konw, PCI and PCIe devices are init in acpi_init,
a subsys_initcall, level 4, so we will not get a potential ordering
problem? If I missed anything, please correct me.

>
>> +        u16 vsec;
>> +        u32 val;
>> +
>> +        if (!pci_dev_is_rootport(pdev))
>> +            continue;
>> +
>> +        rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
>> +        if (!rp_info)
>> +            return -ENOMEM;
>
> This leaks a refcount on the device.

Aha, you are right, I should pci_dev_put first! Will fix it in next version.

>
>> +
>> +        rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>
> Do you really need to store this? It could just as well be a local variable in the one scope where it's used later.

Ok, I will remove the bdf from rp_info and get it form pdev at where it used.

>
>> +        rp_info->pdev = pdev;
>> +
>> +        vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>> +                        DWC_PCIE_VSEC_RAS_DES_ID);
>> +        if (!vsec)
>> +            continue;
>> +
>> +        pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>> +        if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>> +            PCI_VNDR_HEADER_LEN(val) != 0x100)
>> +            continue;
>> +        pci_dbg(pdev,
>> +            "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>> +
>
> Could you not finish all the checks before allocating rp_info? I'm guessing you probably don't expect to find a mix of root ports where only some have this capability, but if that ever did happen it would be nicer not to leave a bunch of wasted memory hanging around for the lifetime of the driver.

Good suggestion, I will do all check before allocating rp_info.

>
>> +        rp_info->ras_des = vsec;
>> +        rp_info->num_lanes = pcie_get_width_cap(pdev);
>> +
>> +        list_add(&rp_info->rp_node, &priv->rp_infos);
>> +        index++;
>> +    }
>> +
>> +    if (!index)
>> +        return -ENODEV;
>> +
>> +    priv->pcie_ctrl_num = index;
>
> You never use this.

Yes, I forgot to delete it after patch v1. Will delete it in next version.

>
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
>> +                     int event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
>> +                     int lane, int event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>
> What's the purpose of these two functions doing the exact same thing, but one having an extra unused argument?

Good catch, I should put it all together.

>
>> +
>> +static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
>> +                     u32 enable)
>
> Pass Boolean arguments as bool.

Will fix it.

>
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    if (enable)
>> +        val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
>> +    else
>> +        val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
>
> This looks suspicious - even if the values were defined correctly, the read-modify-write implies that a transition from enabled to disabled would result in ON | OFF == ON, which doesn't appear to make much sense.

Good catch, I will clear bits in DWC_PCIE_CNT_ENABLE first.

>
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
>> +                     u32 enable)
>
> bool again.

Will fix it.

>
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                  &val);
>> +
>> +    if (enable)
>> +        val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
>> +    else
>> +        val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                   val);
>> +}
>> +
>> +static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
>> +                        u64 *counter)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
>> +    *counter = val;
>
> Why not just return a u64 by value?

Will change it return a u64 from void.

>
>> +}
>> +
>> +/* The results are cleared when next measurement starts. */
>> +static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
>> +                        u16 ras_des, u64 *counter)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(
>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
>> +        &val);
>> +    *counter = val;
>> +    *counter <<= 32;
>> +
>> +    pci_read_config_dword(
>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
>> +        &val);
>> +
>> +    *counter += val;
>
> Ditto.

Ok, I will do it.


>
>> +}
>> +
>> +static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>> +
>> +    val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
>
> OK, does this EVENT_CNT_CTL register just have some really weird behaviour where the fields we touch are self-clearing but other bits still have to be preserved when written?
>
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>> +}
>> +
>> +static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
>> +                           u16 ras_des, u32 event_id)
>> +{
>> +    u32 val;
>> +
>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                  &val);
>> +
>> +    val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
>
> Ditto for this one, in fact.

No, all valid field are writeAsRead. You mean I should write to all filed
without a write-after-read when .add() a event, right?

>
>> +
>> +    /*
>> +     * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
>> +     * use it with any manually controlled duration.
>> +     */
>> +    val |= FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
>> +              DWC_PCIE_DURATION_MANUAL_CTL);
>> +
>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>> +                   val);
>> +}
>> +
>> +static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
>> +{
>> +    struct dwc_pcie_rp_info *rp_info;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
>> +
>> +    rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
>> +
>> +    return rp_info;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
>> +{
>> +    u64 counter;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +    u64 delta, prev, now;
>
> That's an awful lot of boilerplate - straight away, pcie_pmu is redundant since &pdev->dev would give you the same thing, but then you don't actually need that anyway, so it would be even cleaner to pass rp_info directly to the read_*_counter helpers for them to dereference pdev and ras_des for themselves. Same thing in the start, stop and add callbacks below.

Good suggestion. will do it.

>
>> +
>> +    do {
>> +        prev = local64_read(&hwc->prev_count);
>> +
>> +        if (type == DWC_PCIE_LANE_EVENT)
>> +            dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
>> +        else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +            dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
>> +                                &counter);
>> +        else
>> +            dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>
> If that could ever happen, you've got a bug in your event_init or event_add, or something has corrupted memory in a way you cannot possibly attempt to reason about. In fact, pretending to handle such a theoretical bug here is its own *real* bug, since if that path is taken then it leads to consuming an uninitialised variable below.

Yes, it should checked in init event.

>
>> +
>> +        now = counter;
>
> Get rid of the redundant "counter" variable. It's really not helpful.

Got it. Will remove it.

>
>> +    } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
>> +
>> +    delta = now - prev;
>> +
>> +    local64_add(delta, &event->count);
>> +}
>> +
>> +static int dwc_pcie_pmu_event_init(struct perf_event *event)
>> +{
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct perf_event *sibling;
>> +
>> +    if (event->attr.type != event->pmu->type)
>> +        return -ENOENT;
>> +
>> +    if (hwc->sample_period) {
>
> This confused me - it might work, but please use is_sampling_event() to be consistent with everything else.

Ok, I'd like to use is_sampling_event().

>
>> +        dev_err(pcie_pmu->dev, "Sampling not supported\n");
>
> dev_dbg() at best, but TBH I'd just remove it. I know this has been debated a lot, but FWIW I'm firmly in the "giving users the capability to use to use perf events should not give them the implicit capability to flood the kernel log and fill up journald's disk quota" camp.

Ok, I will remove it.
>
>> +        return -EOPNOTSUPP;
>
> Use EINVAL. It's not that you simply don't support sampling, it's that there is fundamentally no meaningful context in a PCIe lane to sample from, so it is not a valid event at all>
>
>> +    }
>> +
>> +    if (event->cpu < 0) {
>> +        dev_err(pcie_pmu->dev, "Per-task mode not supported\n");
>> +        return -EOPNOTSUPP;
>
> Same comments as above.

Thank you for explanation. I will use EINVAL.

>
>> +    }
>> +
>> +    event->cpu = pcie_pmu->on_cpu;
>
> It would seem neater not to start modifying the event until after you've finished validating it.

Will move it to later.

>
>> +
>> +    if (event->group_leader != event &&
>> +        !is_software_event(event->group_leader))
>> +        return -EINVAL;
>> +
>> +    for_each_sibling_event(sibling, event->group_leader) {
>> +        if (sibling != event && !is_software_event(sibling))
>
> An event cannot possibly be its own sibling.

Good catch, I should compare event->pmu with sibling->pmu.

>
>> +            return -EINVAL;
>> +    }
>> +
>> +    hwc->idx = -1;
>
> You never use this.

Will remove it.
>
>> +
>> +    return 0;
>
> This would be the point to check that the event's config actually makes sense and is supported by this PMU.
>
>> +}
>> +
>> +static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
>> +{
>> +    local64_set(&hwc->prev_count, 0);
>> +}
>> +
>> +static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
>> +{
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +
>> +    hwc->state = 0;
>> +    dwc_pcie_pmu_set_period(hwc);
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT)
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
>> +    else
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>
> Again, that should be impossible if you validated it correctly in event_init>
>
>> +}
>> +
>> +static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +
>> +    if (event->hw.state & PERF_HES_STOPPED)
>> +        return;
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT)
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>> +    else
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>
> Ditto.

Good suggestion, I will validate it correctly in event_init.

>
>> +
>> +    dwc_pcie_pmu_event_update(event);
>> +}
>> +
>> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +    struct pci_dev *pdev = rp_info->pdev;
>> +    u16 ras_des = rp_info->ras_des;
>> +    struct hw_perf_event *hwc = &event->hw;
>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>> +    int event_id = DWC_PCIE_EVENT_ID(event);
>> +    int lane = DWC_PCIE_EVENT_LANE(event);
>> +
>> +    /* Only one counter and it is in use */
>> +    if (rp_info->event)
>> +        return -ENOSPC;
>
> Out of curiosity, is this a technical limitation, or just a case of keeping the initial driver simple? The registers seem to imply that lane events and time events are pretty much independent of each other, so maybe it might be possible for one of each to coexist.

The are independent of each other, I just to keep the driver simple :)

>
>> +
>> +    rp_info->event = event;
>> +
>> +    hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>> +
>> +    if (type == DWC_PCIE_LANE_EVENT) {
>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>> +        dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
>> +        dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
>> +        dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
>
> Eww, the helpers were already confusing enough, but is it necessary to do *four* back-to-back RMW operations on the same register? Can you really not just write the whole thing in one go?

Sure, I will warp them in one prepare().

>
>> +    } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>> +        dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
>> +    } else {
>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>> +        return -EINVAL;
>> +    }
>> +
>> +    if (flags & PERF_EF_START)
>> +        dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
>> +
>> +    perf_event_update_userpage(event);
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
>> +{
>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>> +
>> +    dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
>> +    perf_event_update_userpage(event);
>> +    rp_info->event = NULL;
>> +}
>> +
>> +static void dwc_pcie_pmu_event_read(struct perf_event *event)
>> +{
>> +    dwc_pcie_pmu_event_update(event);
>> +}
>
> What use is a function wrapper that does nothing but call a function with an identical signature? Whatever would call this can just call dwc_pcie_pmu_event_update() directly.

Haha, I have to admit that I do not know why. The dmc620, dsu, smmu_pmu, arm-cci,
arm-cci drivers use .update() in .read() with the smae identical signature.

After a double check, the drivers fsl_imx8_ddr_perf.c, marvell_cn10k_ddr_pmu.c
use .update() as .read(), e.g.

.read = cn10k_ddr_perf_event_update,

I think we should use .update() as .read() directly, right?

>> +
>> +static struct dwc_event_counters event_array[] = {
>> +    {"tx_ack_dllp", 0x600},
>> +    {"tx_update_fc_dllp", 0x601},
>> +    {"rx_ack_dllp", 0x602},
>> +    {"rx_update_fc_dllp", 0x603},
>> +    {"rx_nulified_tlp", 0x604},
>> +    {"tx_nulified_tlp", 0x605},
>> +    {"rx_duplicate_tlp", 0x606},
>> +    {"tx_memory_write", 0x700},
>> +    {"tx_memory_read", 0x701},
>> +    {"tx_configuration_write", 0x702},
>> +    {"tx_configuration_read", 0x703},
>> +    {"tx_io_write", 0x704},
>> +    {"tx_io_read", 0x705},
>> +    {"tx_completion_without_data", 0x706},
>> +    {"tx_completion_with_data", 0x707},
>> +    {"tx_message_tlp", 0x708},
>> +    {"tx_atomic", 0x709},
>> +    {"tx_tlp_with_prefix", 0x70A},
>> +    {"rx_memory_write", 0x70B},
>> +    {"rx_memory_read", 0x70C},
>> +    {"rx_io_write", 0x70F},
>> +    {"rx_io_read", 0x710},
>> +    {"rx_completion_without_data", 0x711},
>> +    {"rx_completion_with_data", 0x712},
>> +    {"rx_message_tlp", 0x713},
>> +    {"rx_atomic", 0x714},
>> +    {"rx_tlp_with_prefix", 0x715},
>> +    {"tx_ccix_tlp", 0x716},
>> +    {"rx_ccix_tlp", 0x717},
>> +};
>> +
>> +static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
>> +                  struct dwc_pcie_rp_info *rp_info)
>> +{
>> +    int i, j;
>> +    char lane[8];
>> +    const char tmp[64];
>> +    int events_per_lane;
>> +    int num_lane_events;
>> +    int time_base_count;
>> +    int num_attrs, attr_idx;
>> +    struct dwc_pcie_event_attr *lane_attrs;
>> +    struct attribute **pmu_attrs;
>> +
>> +    memset((void *)tmp, 0, sizeof(tmp));
>> +    memset((void *)lane, 0, sizeof(lane));
>> +    time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
>> +    events_per_lane = ARRAY_SIZE(event_array);
>> +    num_lane_events = rp_info->num_lanes * events_per_lane;
>> +    num_attrs = time_base_count + num_lane_events;
>> +
>> +    rp_info->lane_event_attrs =
>> +        devm_kcalloc(priv->dev, num_lane_events,
>> +                sizeof(struct dwc_pcie_event_attr),
>> +                GFP_KERNEL);
>> +    if (!rp_info->lane_event_attrs)
>> +        return -ENOMEM;
>> +    lane_attrs = rp_info->lane_event_attrs;
>> +    rp_info->pcie_pmu_event_attrs =
>> +        devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
>> +             GFP_KERNEL);
>> +    if (!rp_info->pcie_pmu_event_attrs)
>> +        return -ENOMEM;
>> +    pmu_attrs = rp_info->pcie_pmu_event_attrs;
>> +
>> +    for (i = 0; i < num_lane_events; i++) {
>> +        lane_attrs[i].attr.attr.name =
>> +            devm_kzalloc(priv->dev, sizeof(char)
>> +                 * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
>> +        if (!lane_attrs[i].attr.attr.name)
>> +            return -ENOMEM;
>> +    }
>> +
>> +    attr_idx = 0;
>> +    for (i = 0; i < rp_info->num_lanes; i++) {
>> +        sprintf(lane, "_lane%d", i);
>> +
>> +        for (j = 0; j < events_per_lane; j++) {
>> +            int pos = i * events_per_lane + j;
>> +
>> +            strcat((char *)tmp, event_array[j].name);
>> +            strcat((char *)tmp, lane);
>> +            memcpy((void *)lane_attrs[pos].attr.attr.name,
>> +                   (void *)tmp,
>> +                   sizeof(tmp));
>> +
>> +            lane_attrs[pos].attr.attr.mode =
>> +                VERIFY_OCTAL_PERMISSIONS(0444);
>> +            lane_attrs[pos].attr.show = dwc_pcie_event_show;
>> +            lane_attrs[pos].attr.store = NULL;
>> +            lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
>> +            lane_attrs[pos].eventid = event_array[j].event_id;
>> +            lane_attrs[pos].lane = i;
>> +            pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
>> +
>> +            memset((void *)tmp, 0, sizeof(tmp));
>> +        }
>> +    }
>> +
>> +    for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
>> +        pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
>> +
>> +    rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
>> +
>> +    rp_info->pcie_pmu_event_attrs_group.name = "events";
>> +    rp_info->pcie_pmu_event_attrs_group.is_visible =
>> +        pcie_pmu_event_attr_is_visible;
>> +    rp_info->pcie_pmu_event_attrs_group.attrs =
>> +        rp_info->pcie_pmu_event_attrs;
>> +
>> +    rp_info->pcie_pmu_attr_groups[0] =
>> +        &rp_info->pcie_pmu_event_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
>> +    rp_info->pcie_pmu_attr_groups[3] = NULL;
>> +
>> +    return 0;
>
> That took a while to make sense of... dynamically generating event attrs is a bit horrible, especially if it means you waste a bunch of memory duplicating the same stuff over and over again for every PMU instance.
>
> You can achieve the same effect far more neatly by statically defining all possible events, then using the .is_visible callback to hide any which are not supported by the given PMU instance.
>
> However, I'm not sure it's really worth even doing that (especially if I've counted right and it means 464 events for a maximum of 16 lanes). It doesn't seem like the difference in typing "tx_io_read_lane0" vs. "tx_io_read,lane=0" will have a significant impact on user experience. Conversely, I happen to know that some users actively dislike the experience of "perf list" spewing out hundreds of events that swamp out the ones they're looking for. If this is the SoC which also has arm-cmn then it's already more than bad enough (sorry!) - describing 29 distinct events as just 29 events will be a lot more popular and manageable. If you like you can output "lane=?" in the event string to make it even more obvious - perf tool will then require the user to provide a lane value instead of defaulting to 0 if it isn't specified.

Got your point, the prolem is that we can't use part of symbolic name,
right?

"tx_io_read_lane0" vs. "tx_io_read,lane=0", I prefer the latter. But
the right latter is "eventid=0x705,lane=0", right?

I should add a description in document.

>
>> +}
>> +
>> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
>> +                struct dwc_pcie_rp_info *rp_info)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct device *dev;
>> +    char *name;
>> +    int ret;
>> +
>> +    pcie_pmu = &rp_info->pcie_pmu;
>> +    dev = &rp_info->pdev->dev;
>> +
>> +    ret = dwc_pcie_pmu_attr_init(priv, rp_info);
>> +    if (ret) {
>> +        pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
>> +        return ret;
>> +    }
>> +
>> +    pcie_pmu->dev = dev;
>> +    pcie_pmu->pmu = (struct pmu) {
>> +        .module        = THIS_MODULE,
>> +        .task_ctx_nr    = perf_invalid_context,
>> +        .pmu_enable    = NULL,
>> +        .pmu_disable    = NULL,
>
> If I counted right there are still 28 other fields being default-initialised to 0 or NULL here, what's special about these two? ;)

Nope, I will delete it. It is optional to implement.

>
>> +        .event_init    = dwc_pcie_pmu_event_init,
>> +        .add        = dwc_pcie_pmu_event_add,
>> +        .del        = dwc_pcie_pmu_event_del,
>> +        .start        = dwc_pcie_pmu_event_start,
>> +        .stop        = dwc_pcie_pmu_event_stop,
>> +        .read        = dwc_pcie_pmu_event_read,
>> +        .attr_groups    = rp_info->pcie_pmu_attr_groups,
>> +        .capabilities    = PERF_PMU_CAP_NO_EXCLUDE,
>> +    };
>> +
>> +    name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
>> +                  rp_info->bdf);
>> +    if (!name)
>> +        return -ENOMEM;
>> +
>> +    /*
>> +     * Pick one CPU to be the preferred one on local NUMA node.
>> +     *
>> +     * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
>> +     * is a uncore PMU device.
>> +     */
>> +    pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
>
> You need to cope with the possibility of that CPU going offline.

As I commented here, the PMU does NOT support interrupt, and the on_cpu
just indicate it is a uncore PMU device as Yicong Yang suggested. So the
selected CPU is not really used.

Do we need to care about that CPU going offline?

>
>> +    ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>> +    if (ret) {
>> +        pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
>> +                 rp_info->bdf);
>> +        return ret;
>> +    }
>> +
>> +    rp_info->pmu_is_register = true;
>> +
>> +    return 0;
>> +}
>> +
>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>> +{
>> +    struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>> +        if (rp_info->pmu_is_register) {
>> +            pcie_pmu = &rp_info->pcie_pmu;
>> +            perf_pmu_unregister(&pcie_pmu->pmu);
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>> +{
>> +    int ret;
>> +    struct dwc_pcie_pmu_priv *priv;
>> +    struct dwc_pcie_rp_info *rp_info;
>> +
>> +    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>> +    if (!priv)
>> +        return -ENOMEM;
>> +
>> +    priv->dev = &pdev->dev;
>> +    platform_set_drvdata(pdev, priv);
>> +
>> +    /* If RAS_DES PMU is not supported on current platform, keep silent */
>> +    ret = dwc_pcie_ras_des_discover(priv);
>> +    if (ret)
>> +        return ret;
>> +
>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>> +        struct pci_dev *rp = rp_info->pdev;
>> +
>> +        ret = __dwc_pcie_pmu_probe(priv, rp_info);
>
> If the PMUs are independent of each other, why not just probe them directly during the PCIe capability walk? That way you could easily ensure that the list only ever contains successfully-registered PMUs by construction, and make cleanup even simpler.

Yes, they are independent of each other and belongs to different root port.
Good suggestion, I will probe them during the PCIe capability walk in next version.


>> +        if (ret) {
>> +            dev_err(&rp->dev, "PCIe PMU probe fail\n");
>> +            goto pmu_unregister;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +pmu_unregister:
>> +    dwc_pcie_pmu_remove(pdev);
>> +
>> +    return ret;
>> +}
>> +
>> +static struct platform_driver dwc_pcie_pmu_driver = {
>> +    .probe = dwc_pcie_pmu_probe,
>> +    .remove = dwc_pcie_pmu_remove,
>> +    .driver = {.name = "dwc_pcie_pmu",},
>> +};
>> +
>> +static int __init dwc_pcie_pmu_init(void)
>> +{
>> +    int ret;
>> +
>> +    ret = platform_driver_register(&dwc_pcie_pmu_driver);
>> +
>> +    if (ret)
>> +        return ret;
>> +
>> +    dwc_pcie_pmu_dev = platform_device_register_simple(
>> +                "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>> +    if (IS_ERR(dwc_pcie_pmu_dev)) {
>> +        platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +        return PTR_ERR(dwc_pcie_pmu_dev);
>> +    }
>
> Why go through all this bother of inventing a device and registering a driver just to take a long round-trip through the driver core to call dwc_pcie_pmu_probe()? Why not just do the work of dwc_pcie_pmu_probe() right here? Sure, you'd also need to manually clean up a couple of allocations on failure or exit instead of using devres, but that's still considerably less hassle than invoking the whole driver model just to disguise a list_head and a couple of function calls.
>

I directly call dwc_pcie_pmu_probe in first internal version. I see
some PMU drivers register simple platform devices, like riscv_pmu_legacy.c
and riscv_pmu_sbi.c.

If you are asking to call dwc_pcie_pmu_probe() directly, I'd
like to change it back.

> Thanks,
> Robin.
>

Thank you very much for your valuable comments.

Best Regards,
Shuai


>> +
>> +    return 0;
>> +}
>> +
>> +static void __exit dwc_pcie_pmu_exit(void)
>> +{
>> +    platform_device_unregister(dwc_pcie_pmu_dev);
>> +    platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +}
>> +
>> +module_init(dwc_pcie_pmu_init);
>> +module_exit(dwc_pcie_pmu_exit);
>> +
>> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
>> +MODULE_AUTHOR("Shuai xue <[email protected]>");
>> +MODULE_AUTHOR("Wen Cheng <[email protected]>");
>> +MODULE_LICENSE("GPL v2");

2023-05-09 02:17:49

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 2023/4/27 14:33, Shuai Xue wrote:
>
>
> On 2023/4/19 AM7:30, Robin Murphy wrote:
>> On 2023-04-17 07:17, Shuai Xue wrote:
>>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>>> Core controller IP which provides statistics feature. The PMU is not a PCIe
>>> Root Complex integrated End Point(RCiEP) device but only register counters
>>> provided by each PCIe Root Port.
>>>
>>> To facilitate collection of statistics the controller provides the
>>> following two features for each Root Port:
>>>
>>> - Time Based Analysis (RX/TX data throughput and time spent in each
>>>    low-power LTSSM state)
>>> - Event counters (Error and Non-Error for lanes)
>>>
>>> Note, only one counter for each type and does not overflow interrupt.
>>>
>>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>>> named based the BDF of Root Port. For example,
>>>
>>>      30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>>
>>> the PMU device name for this Root Port is dwc_rootport_3018.
>>>
>>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>>
>>>      $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>>
>>> average RX bandwidth can be calculated like this:
>>>
>>>      PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>>
>>> Signed-off-by: Shuai Xue <[email protected]>
>>> ---
>>>   drivers/perf/Kconfig        |   7 +
>>>   drivers/perf/Makefile       |   1 +
>>>   drivers/perf/dwc_pcie_pmu.c | 855 ++++++++++++++++++++++++++++++++++++
>>>   3 files changed, 863 insertions(+)
>>>   create mode 100644 drivers/perf/dwc_pcie_pmu.c
>>>
>>> diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
>>> index 66c259000a44..57bce3880cba 100644
>>> --- a/drivers/perf/Kconfig
>>> +++ b/drivers/perf/Kconfig
>>> @@ -199,6 +199,13 @@ config MARVELL_CN10K_DDR_PMU
>>>         Enable perf support for Marvell DDR Performance monitoring
>>>         event on CN10K platform.
>>>   +config DWC_PCIE_PMU
>>> +    tristate "Enable Synopsys DesignWare PCIe PMU Support"
>>> +    depends on ARM64 || (COMPILE_TEST && 64BIT)
>>
>> Is there anything here that really depends on 64BIT? Nothing obvious stands out.
>>
>>> +    help
>>> +      Enable perf support for Synopsys DesignWare PCIe PMU Performance
>>> +      monitoring event on Yitian 710 platform.
>>> +
>>>   source "drivers/perf/arm_cspmu/Kconfig"
>>>     source "drivers/perf/amlogic/Kconfig"
>>> diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
>>> index 13e45da61100..3f233e96524e 100644
>>> --- a/drivers/perf/Makefile
>>> +++ b/drivers/perf/Makefile
>>> @@ -21,5 +21,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
>>>   obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
>>>   obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
>>>   obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
>>> +obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
>>>   obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
>>>   obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
>>> diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
>>> new file mode 100644
>>> index 000000000000..b7691cfe0df4
>>> --- /dev/null
>>> +++ b/drivers/perf/dwc_pcie_pmu.c
>>> @@ -0,0 +1,855 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Synopsys DesignWare PCIe PMU driver
>>> + *
>>> + * Copyright (C) 2021-2023 Alibaba Inc.
>>> + */
>>> +
>>> +#include <linux/pci.h>
>>> +#include <linux/bitfield.h>
>>> +#include <linux/bitops.h>
>>> +#include <linux/cpuhotplug.h>
>>> +#include <linux/cpumask.h>
>>> +#include <linux/device.h>
>>> +#include <linux/errno.h>
>>> +#include <linux/kernel.h>
>>> +#include <linux/list.h>
>>> +#include <linux/perf_event.h>
>>> +#include <linux/platform_device.h>
>>> +#include <linux/smp.h>
>>> +#include <linux/sysfs.h>
>>> +#include <linux/types.h>
>>> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
>>
>> Shouldn't that belong in linux/pci_ids.h?
>
> Yes, it also exist in drivers/infiniband/hw/erdma/erdma_hw.h
> I will add a prepare patch to define it in linux/pci_ids.h.
>
>>
>>> +
>>> +#define ATTRI_NAME_MAX_SIZE            32
>>> +#define DWC_PCIE_VSEC_RAS_DES_ID        0x02
>>> +
>>> +#define DWC_PCIE_EVENT_CNT_CTL            0x8
>>> +#define DWC_PCIE_CNT_EVENT_SEL            GENMASK(27, 16)
>>> +#define DWC_PCIE_CNT_LANE_SEL            GENMASK(11, 8)
>>> +#define DWC_PCIE_CNT_STATUS            BIT(7)
>>> +#define DWC_PCIE_CNT_ENABLE            GENMASK(4, 2)
>>> +#define DWC_PCIE_PER_EVENT_OFF            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x1)
>>> +#define DWC_PCIE_PER_EVENT_ON            FIELD_PREP(DWC_PCIE_CNT_ENABLE, 0x3)
>>
>> Those two don't look right... :/
>
> You are right, those two are a nested FIELD_PREP, will fix it!
>
>>
>>> +#define DWC_PCIE_EVENT_CLEAR            GENMASK(1, 0)
>>> +#define DWC_PCIE_EVENT_PER_CLEAR        0x1
>>> +
>>> +#define DWC_PCIE_EVENT_CNT_DATA            0xC
>>> +
>>> +#define DWC_PCIE_TIME_BASED_ANAL_CTL        0x10
>>> +#define DWC_PCIE_TIME_BASED_REPORT_SEL        GENMASK(31, 24)
>>> +#define DWC_PCIE_TIME_BASED_DURATION_SEL    GENMASK(15, 8)
>>> +#define DWC_PCIE_DURATION_MANUAL_CTL        0x0
>>> +#define DWC_PCIE_DURATION_1MS            0x1
>>> +#define DWC_PCIE_DURATION_10MS            0x2
>>> +#define DWC_PCIE_DURATION_100MS            0x3
>>> +#define DWC_PCIE_DURATION_1S            0x4
>>> +#define DWC_PCIE_DURATION_2S            0x5
>>> +#define DWC_PCIE_DURATION_4S            0x6
>>> +#define DWC_PCIE_DURATION_4US            0xff
>>> +#define DWC_PCIE_TIME_BASED_TIMER_START     BIT(0)
>>> +#define DWC_PCIE_TIME_BASED_CNT_ENABLE        0x1
>>> +
>>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW    0x14
>>> +#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH    0x18
>>> +
>>> +/* Event attributes */
>>> +#define DWC_PCIE_CONFIG_EVENTID            GENMASK(15, 0)
>>> +#define DWC_PCIE_CONFIG_TYPE            GENMASK(19, 16)
>>> +#define DWC_PCIE_CONFIG_LANE            GENMASK(27, 20)
>>> +
>>> +#define DWC_PCIE_EVENT_ID(event)    FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
>>> +#define DWC_PCIE_EVENT_TYPE(event)    FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
>>> +#define DWC_PCIE_EVENT_LANE(event)    FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
>>> +
>>> +enum dwc_pcie_event_type {
>>> +    DWC_PCIE_TYPE_INVALID,
>>> +    DWC_PCIE_TIME_BASE_EVENT,
>>> +    DWC_PCIE_LANE_EVENT,
>>> +};
>>> +
>>> +struct dwc_event_counters {
>>> +    const char name[32];
>>> +    u32 event_id;
>>> +};
>>> +
>>> +struct dwc_pcie_pmu {
>>> +    struct hlist_node node;
>>
>> This isn't used anywhere (but it should be).
>
> Yes, I will refator this struct.
>
>>
>>> +    unsigned int on_cpu;
>>> +    struct pmu pmu;
>>> +    struct device *dev;
>>> +};
>>> +
>>> +struct dwc_pcie_rp_info {
>>> +    u32 bdf;
>>> +    u32 ras_des;
>>> +    u32 num_lanes;
>>> +
>>> +    struct list_head rp_node;
>>> +    struct pci_dev *pdev;
>>> +    struct dwc_pcie_pmu pcie_pmu;
>>> +    bool pmu_is_register;
>>> +    struct perf_event *event;
>>> +
>>> +    struct dwc_pcie_event_attr *lane_event_attrs;
>>> +    struct attribute **pcie_pmu_event_attrs;
>>> +    struct attribute_group pcie_pmu_event_attrs_group;
>>> +    const struct attribute_group *pcie_pmu_attr_groups[4];
>>> +};
>>
>> Is there any particular reason for the seemingly arbitrary split between dwc_pcie_pmu and dwc_pcie_rp_info? It doesn't appear obvious from the design of the code; if anything it mostly just seems to make things a bit more busy than they need to be.
>
> No particular reason, I intend to collect root port specific info to dwc_pcie_rp_info and
> define PMU specific to dwc_pcie_pmu.
>
> I will move it togother.
>
>>
>>> +
>>> +struct dwc_pcie_pmu_priv {
>>> +    struct device *dev;
>>> +    u32 pcie_ctrl_num;
>>> +    struct list_head rp_infos;
>>> +};
>>> +
>>> +#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
>>> +
>>> +static struct platform_device *dwc_pcie_pmu_dev;
>>> +static ssize_t cpumask_show(struct device *dev,
>>> +                     struct device_attribute *attr,
>>> +                     char *buf)
>>> +{
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
>>> +
>>> +    return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->on_cpu));
>>> +}
>>> +static DEVICE_ATTR_RO(cpumask);
>>> +
>>> +static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
>>> +    &dev_attr_cpumask.attr,
>>> +    NULL
>>> +};
>>> +
>>> +static struct attribute_group pcie_pmu_cpumask_attrs_group = {
>>> +    .attrs = dwc_pcie_pmu_cpumask_attrs,
>>> +};
>>> +
>>> +struct dwc_pcie_format_attr {
>>> +    struct device_attribute attr;
>>> +    u64 field;
>>> +    int config;
>>> +};
>>> +
>>> +static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
>>> +                    struct device_attribute *attr,
>>> +                    char *buf)
>>> +{
>>> +    struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
>>> +    int lo = __ffs(fmt->field), hi = __fls(fmt->field);
>>> +
>>> +    if (lo == hi)
>>> +        return snprintf(buf, PAGE_SIZE, "config:%d\n", lo);
>>> +
>>> +    if (!fmt->config)
>>> +        return snprintf(buf, PAGE_SIZE, "config:%d-%d\n", lo, hi);
>>> +
>>> +    return snprintf(buf, PAGE_SIZE, "config%d:%d-%d\n", fmt->config, lo,
>>> +            hi);
>>
>> A lot of this is unnecessary - you don't have any single-bit config fields, and you aren't using config1 or config2, so it's kind of confusing to have all the code and data for handling them.
>
> Will remove it.
>
>>
>> Also, please use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
>
> Ok, I will use sysfs_emit() instead.
>
>>
>>> +}
>>> +
>>> +#define _dwc_pcie_format_attr(_name, _cfg, _fld)                \
>>> +    (&((struct dwc_pcie_format_attr[]) {{                \
>>> +        .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL),    \
>>> +        .config = _cfg,                        \
>>> +        .field = _fld,                        \
>>> +    }})[0].attr.attr)
>>> +
>>> +#define dwc_pcie_format_attr(_name, _fld)    _dwc_pcie_format_attr(_name, 0, _fld)
>>> +
>>> +static struct attribute *dwc_pcie_format_attrs[] = {
>>> +    dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
>>> +    dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
>>> +    dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
>>> +    NULL,
>>> +};
>>> +
>>> +static struct attribute_group pcie_pmu_format_attrs_group = {
>>> +    .name = "format",
>>> +    .attrs = dwc_pcie_format_attrs,
>>> +};
>>> +
>>> +struct dwc_pcie_event_attr {
>>> +    struct device_attribute attr;
>>> +    enum dwc_pcie_event_type type;
>>> +    u16 eventid;
>>> +    u8 lane;
>>> +};
>>> +
>>> +static ssize_t dwc_pcie_event_show(struct device *dev,
>>> +                struct device_attribute *attr, char *page)
>>> +{
>>> +    struct dwc_pcie_event_attr *eattr;
>>> +
>>> +    eattr = container_of(attr, typeof(*eattr), attr);
>>> +
>>> +    if (eattr->type == DWC_PCIE_LANE_EVENT)
>>> +        return sprintf(page, "eventid=0x%lx, type=0x%lx, lane=0x%lx\n",
>>
>> Convention seems to be that these strings do not have spaces in them, so there's a small chance that it may confuse some userspace tools.
>
> Yes, you are right. I will remove the unnecessary spaces.
>
>>
>>> +                   (unsigned long)eattr->eventid,
>>> +                   (unsigned long)eattr->type,
>>> +                   (unsigned long)eattr->lane);
>>
>> Hmm, why use %lx and then have to cast everything, rather than just %x?
>
> I misunderstand the usage of eattr->var, will just use %x instead.
>
>>
>>> +
>>> +    return sprintf(page, "eventid=0x%lx, type=0x%lx",
>>> +               (unsigned long)eattr->eventid,
>>> +               (unsigned long)eattr->type);
>>> +}
>>> +
>>> +#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane)        \
>>> +    (&((struct dwc_pcie_event_attr[]) {{                \
>>> +        .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL),    \
>>> +        .type = _type,                        \
>>> +        .eventid = _eventid,                    \
>>> +        .lane = _lane,                    \
>>> +    }})[0].attr.attr)
>>> +
>>> +#define DWC_PCIE_PMU_BASE_TIME_ATTR(_name, _eventid)            \
>>> +    DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
>>> +
>>> +static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
>>> +    /* Group #0 */
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(one_cycle, 0x00),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S, 0x01),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S, 0x02),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0, 0x03),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1, 0x04),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_1, 0x05),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_2, 0x06),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY, 0x07),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S, 0x08),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_AUX, 0x09),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(ONE_cycle, 0x10),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_L0S_, 0x11),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(RX_L0S_, 0x12),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L0_, 0x13),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(L1_, 0x17),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(CFG_RCVRY_, 0x17),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(TX_RX_L0S_, 0x18),
>>> +    /* Group #1 */
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
>>> +    DWC_PCIE_PMU_BASE_TIME_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
>>> +    NULL
>>> +};
>>> +
>>> +static inline umode_t pcie_pmu_event_attr_is_visible(struct kobject *kobj,
>>> +                             struct attribute *attr,
>>> +                             int unuse)
>>> +{
>>> +    return attr->mode;
>>
>> There is no point implementing an optional callback which only replicates the default behaviour of not having the callback.
>>
>> Whether to simply remove it, or instead implement more meaningful behaviour here to save complexity elsewhere, is something I'll come back to later...
>
> I will use it for new defined lane event type in next version, as you suggested later.
>
>>
>>> +}
>>> +
>>> +static inline bool pci_dev_is_rootport(struct pci_dev *pdev)
>>> +{
>>> +    return (pci_is_pcie(pdev) &&
>>> +        pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT);
>>> +}
>>> +
>>> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
>>> +{
>>> +    int index = 0;
>>> +    struct pci_dev *pdev = NULL;
>>> +    struct dwc_pcie_rp_info *rp_info;
>>> +
>>> +    INIT_LIST_HEAD(&priv->rp_infos);
>>> +
>>> +    /* Match the rootport with VSEC_RAS_DES_ID */
>>> +    for_each_pci_dev(pdev) {
>>
>> Does the PCI layer not offer a more robust mechanism for this? (PCI fixups come to mind, but I don't actually know whether that would be a viable approach or not.)
>
> I am afraid not yet. Jonathan try to add a PMU service but it is not merged into mainline.
>
> As things stand, it seems like you've got a potential ordering problem if this is built-in and runs before PCI devices have been fully discovered.
>
> If this is built-in, the module is init by device_initcall, level 6.
> As far as I konw, PCI and PCIe devices are init in acpi_init,
> a subsys_initcall, level 4, so we will not get a potential ordering
> problem? If I missed anything, please correct me.
>
>>
>>> +        u16 vsec;
>>> +        u32 val;
>>> +
>>> +        if (!pci_dev_is_rootport(pdev))
>>> +            continue;
>>> +
>>> +        rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
>>> +        if (!rp_info)
>>> +            return -ENOMEM;
>>
>> This leaks a refcount on the device.
>
> Aha, you are right, I should pci_dev_put first! Will fix it in next version.
>
>>
>>> +
>>> +        rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
>>
>> Do you really need to store this? It could just as well be a local variable in the one scope where it's used later.
>
> Ok, I will remove the bdf from rp_info and get it form pdev at where it used.
>
>>
>>> +        rp_info->pdev = pdev;
>>> +
>>> +        vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
>>> +                        DWC_PCIE_VSEC_RAS_DES_ID);
>>> +        if (!vsec)
>>> +            continue;
>>> +
>>> +        pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
>>> +        if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
>>> +            PCI_VNDR_HEADER_LEN(val) != 0x100)
>>> +            continue;
>>> +        pci_dbg(pdev,
>>> +            "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
>>> +
>>
>> Could you not finish all the checks before allocating rp_info? I'm guessing you probably don't expect to find a mix of root ports where only some have this capability, but if that ever did happen it would be nicer not to leave a bunch of wasted memory hanging around for the lifetime of the driver.
>
> Good suggestion, I will do all check before allocating rp_info.
>
>>
>>> +        rp_info->ras_des = vsec;
>>> +        rp_info->num_lanes = pcie_get_width_cap(pdev);
>>> +
>>> +        list_add(&rp_info->rp_node, &priv->rp_infos);
>>> +        index++;
>>> +    }
>>> +
>>> +    if (!index)
>>> +        return -ENODEV;
>>> +
>>> +    priv->pcie_ctrl_num = index;
>>
>> You never use this.
>
> Yes, I forgot to delete it after patch v1. Will delete it in next version.
>
>>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_set_event_id(struct pci_dev *pdev, u16 ras_des,
>>> +                     int event_id)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>>> +
>>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id);
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_write_event_lane(struct pci_dev *pdev, u16 ras_des,
>>> +                     int lane, int event_id)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>>> +
>>> +    val |= FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, lane);
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>>> +}
>>
>> What's the purpose of these two functions doing the exact same thing, but one having an extra unused argument?
>
> Good catch, I should put it all together.
>
>>
>>> +
>>> +static void dwc_pcie_pmu_event_enable(struct pci_dev *pdev, u16 ras_des,
>>> +                     u32 enable)
>>
>> Pass Boolean arguments as bool.
>
> Will fix it.
>
>>
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>>> +
>>> +    if (enable)
>>> +        val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
>>> +    else
>>> +        val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
>>
>> This looks suspicious - even if the values were defined correctly, the read-modify-write implies that a transition from enabled to disabled would result in ON | OFF == ON, which doesn't appear to make much sense.
>
> Good catch, I will clear bits in DWC_PCIE_CNT_ENABLE first.
>
>>
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_base_time_enable(struct pci_dev *pdev, u16 ras_des,
>>> +                     u32 enable)
>>
>> bool again.
>
> Will fix it.
>
>>
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>>> +                  &val);
>>> +
>>> +    if (enable)
>>> +        val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
>>> +    else
>>> +        val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>>> +                   val);
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_read_event_counter(struct pci_dev *pdev, u16 ras_des,
>>> +                        u64 *counter)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
>>> +    *counter = val;
>>
>> Why not just return a u64 by value?
>
> Will change it return a u64 from void.
>
>>
>>> +}
>>> +
>>> +/* The results are cleared when next measurement starts. */
>>> +static void dwc_pcie_pmu_read_base_time_counter(struct pci_dev *pdev,
>>> +                        u16 ras_des, u64 *counter)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(
>>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH,
>>> +        &val);
>>> +    *counter = val;
>>> +    *counter <<= 32;
>>> +
>>> +    pci_read_config_dword(
>>> +        pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW,
>>> +        &val);
>>> +
>>> +    *counter += val;
>>
>> Ditto.
>
> Ok, I will do it.
>
>
>>
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_clear_event_counter(struct pci_dev *pdev, u16 ras_des)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
>>> +
>>> +    val |= FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
>>
>> OK, does this EVENT_CNT_CTL register just have some really weird behaviour where the fields we touch are self-clearing but other bits still have to be preserved when written?
>>
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_base_time_add_prepare(struct pci_dev *pdev,
>>> +                           u16 ras_des, u32 event_id)
>>> +{
>>> +    u32 val;
>>> +
>>> +    pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>>> +                  &val);
>>> +
>>> +    val |= FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id);
>>
>> Ditto for this one, in fact.
>
> No, all valid field are writeAsRead. You mean I should write to all filed
> without a write-after-read when .add() a event, right?
>
>>
>>> +
>>> +    /*
>>> +     * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
>>> +     * use it with any manually controlled duration.
>>> +     */
>>> +    val |= FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
>>> +              DWC_PCIE_DURATION_MANUAL_CTL);
>>> +
>>> +    pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
>>> +                   val);
>>> +}
>>> +
>>> +static struct dwc_pcie_rp_info *pmu_to_pcie_info(struct pmu *pmu)
>>> +{
>>> +    struct dwc_pcie_rp_info *rp_info;
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(pmu);
>>> +
>>> +    rp_info = container_of(pcie_pmu, struct dwc_pcie_rp_info, pcie_pmu);
>>> +
>>> +    return rp_info;
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_event_update(struct perf_event *event)
>>> +{
>>> +    u64 counter;
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +    struct pci_dev *pdev = rp_info->pdev;
>>> +    u16 ras_des = rp_info->ras_des;
>>> +    struct hw_perf_event *hwc = &event->hw;
>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +    u64 delta, prev, now;
>>
>> That's an awful lot of boilerplate - straight away, pcie_pmu is redundant since &pdev->dev would give you the same thing, but then you don't actually need that anyway, so it would be even cleaner to pass rp_info directly to the read_*_counter helpers for them to dereference pdev and ras_des for themselves. Same thing in the start, stop and add callbacks below.
>
> Good suggestion. will do it.
>
>>
>>> +
>>> +    do {
>>> +        prev = local64_read(&hwc->prev_count);
>>> +
>>> +        if (type == DWC_PCIE_LANE_EVENT)
>>> +            dwc_pcie_pmu_read_event_counter(pdev, ras_des, &counter);
>>> +        else if (type == DWC_PCIE_TIME_BASE_EVENT)
>>> +            dwc_pcie_pmu_read_base_time_counter(pdev, ras_des,
>>> +                                &counter);
>>> +        else
>>> +            dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>
>> If that could ever happen, you've got a bug in your event_init or event_add, or something has corrupted memory in a way you cannot possibly attempt to reason about. In fact, pretending to handle such a theoretical bug here is its own *real* bug, since if that path is taken then it leads to consuming an uninitialised variable below.
>
> Yes, it should checked in init event.
>
>>
>>> +
>>> +        now = counter;
>>
>> Get rid of the redundant "counter" variable. It's really not helpful.
>
> Got it. Will remove it.
>
>>
>>> +    } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
>>> +
>>> +    delta = now - prev;
>>> +
>>> +    local64_add(delta, &event->count);
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_event_init(struct perf_event *event)
>>> +{
>>> +    struct hw_perf_event *hwc = &event->hw;
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct perf_event *sibling;
>>> +
>>> +    if (event->attr.type != event->pmu->type)
>>> +        return -ENOENT;
>>> +
>>> +    if (hwc->sample_period) {
>>
>> This confused me - it might work, but please use is_sampling_event() to be consistent with everything else.
>
> Ok, I'd like to use is_sampling_event().
>
>>
>>> +        dev_err(pcie_pmu->dev, "Sampling not supported\n");
>>
>> dev_dbg() at best, but TBH I'd just remove it. I know this has been debated a lot, but FWIW I'm firmly in the "giving users the capability to use to use perf events should not give them the implicit capability to flood the kernel log and fill up journald's disk quota" camp.
>
> Ok, I will remove it.
>>
>>> +        return -EOPNOTSUPP;
>>
>> Use EINVAL. It's not that you simply don't support sampling, it's that there is fundamentally no meaningful context in a PCIe lane to sample from, so it is not a valid event at all>
>>
>>> +    }
>>> +
>>> +    if (event->cpu < 0) {
>>> +        dev_err(pcie_pmu->dev, "Per-task mode not supported\n");
>>> +        return -EOPNOTSUPP;
>>
>> Same comments as above.
>
> Thank you for explanation. I will use EINVAL.
>
>>
>>> +    }
>>> +
>>> +    event->cpu = pcie_pmu->on_cpu;
>>
>> It would seem neater not to start modifying the event until after you've finished validating it.
>
> Will move it to later.
>
>>
>>> +
>>> +    if (event->group_leader != event &&
>>> +        !is_software_event(event->group_leader))
>>> +        return -EINVAL;
>>> +
>>> +    for_each_sibling_event(sibling, event->group_leader) {
>>> +        if (sibling != event && !is_software_event(sibling))
>>
>> An event cannot possibly be its own sibling.
>
> Good catch, I should compare event->pmu with sibling->pmu.
>
>>
>>> +            return -EINVAL;
>>> +    }
>>> +
>>> +    hwc->idx = -1;
>>
>> You never use this.
>
> Will remove it.
>>
>>> +
>>> +    return 0;
>>
>> This would be the point to check that the event's config actually makes sense and is supported by this PMU.
>>
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
>>> +{
>>> +    local64_set(&hwc->prev_count, 0);
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
>>> +{
>>> +    struct hw_perf_event *hwc = &event->hw;
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +    struct pci_dev *pdev = rp_info->pdev;
>>> +    u16 ras_des = rp_info->ras_des;
>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +
>>> +    hwc->state = 0;
>>> +    dwc_pcie_pmu_set_period(hwc);
>>> +
>>> +    if (type == DWC_PCIE_LANE_EVENT)
>>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 1);
>>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 1);
>>> +    else
>>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>
>> Again, that should be impossible if you validated it correctly in event_init>
>>
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
>>> +{
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +    struct pci_dev *pdev = rp_info->pdev;
>>> +    u16 ras_des = rp_info->ras_des;
>>> +
>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +
>>> +    if (event->hw.state & PERF_HES_STOPPED)
>>> +        return;
>>> +
>>> +    if (type == DWC_PCIE_LANE_EVENT)
>>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>>> +    else if (type == DWC_PCIE_TIME_BASE_EVENT)
>>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>>> +    else
>>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>
>> Ditto.
>
> Good suggestion, I will validate it correctly in event_init.
>
>>
>>> +
>>> +    dwc_pcie_pmu_event_update(event);
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
>>> +{
>>> +    struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +    struct pci_dev *pdev = rp_info->pdev;
>>> +    u16 ras_des = rp_info->ras_des;
>>> +    struct hw_perf_event *hwc = &event->hw;
>>> +    enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
>>> +    int event_id = DWC_PCIE_EVENT_ID(event);
>>> +    int lane = DWC_PCIE_EVENT_LANE(event);
>>> +
>>> +    /* Only one counter and it is in use */
>>> +    if (rp_info->event)
>>> +        return -ENOSPC;
>>
>> Out of curiosity, is this a technical limitation, or just a case of keeping the initial driver simple? The registers seem to imply that lane events and time events are pretty much independent of each other, so maybe it might be possible for one of each to coexist.
>
> The are independent of each other, I just to keep the driver simple :)
>
>>
>>> +
>>> +    rp_info->event = event;
>>> +
>>> +    hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
>>> +
>>> +    if (type == DWC_PCIE_LANE_EVENT) {
>>> +        dwc_pcie_pmu_event_enable(pdev, ras_des, 0);
>>> +        dwc_pcie_pmu_write_event_lane(pdev, ras_des, lane, event_id);
>>> +        dwc_pcie_pmu_set_event_id(pdev, ras_des, event_id);
>>> +        dwc_pcie_pmu_clear_event_counter(pdev, ras_des);
>>
>> Eww, the helpers were already confusing enough, but is it necessary to do *four* back-to-back RMW operations on the same register? Can you really not just write the whole thing in one go?
>
> Sure, I will warp them in one prepare().
>
>>
>>> +    } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
>>> +        dwc_pcie_pmu_base_time_enable(pdev, ras_des, 0);
>>> +        dwc_pcie_pmu_base_time_add_prepare(pdev, ras_des, event_id);
>>> +    } else {
>>> +        dev_err(pcie_pmu->dev, "invalid event type: 0x%x\n", type);
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if (flags & PERF_EF_START)
>>> +        dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
>>> +
>>> +    perf_event_update_userpage(event);
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
>>> +{
>>> +    struct dwc_pcie_rp_info *rp_info = pmu_to_pcie_info(event->pmu);
>>> +
>>> +    dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
>>> +    perf_event_update_userpage(event);
>>> +    rp_info->event = NULL;
>>> +}
>>> +
>>> +static void dwc_pcie_pmu_event_read(struct perf_event *event)
>>> +{
>>> +    dwc_pcie_pmu_event_update(event);
>>> +}
>>
>> What use is a function wrapper that does nothing but call a function with an identical signature? Whatever would call this can just call dwc_pcie_pmu_event_update() directly.
>
> Haha, I have to admit that I do not know why. The dmc620, dsu, smmu_pmu, arm-cci,
> arm-cci drivers use .update() in .read() with the smae identical signature.
>
> After a double check, the drivers fsl_imx8_ddr_perf.c, marvell_cn10k_ddr_pmu.c
> use .update() as .read(), e.g.
>
> .read = cn10k_ddr_perf_event_update,
>
> I think we should use .update() as .read() directly, right?
>
>>> +
>>> +static struct dwc_event_counters event_array[] = {
>>> +    {"tx_ack_dllp", 0x600},
>>> +    {"tx_update_fc_dllp", 0x601},
>>> +    {"rx_ack_dllp", 0x602},
>>> +    {"rx_update_fc_dllp", 0x603},
>>> +    {"rx_nulified_tlp", 0x604},
>>> +    {"tx_nulified_tlp", 0x605},
>>> +    {"rx_duplicate_tlp", 0x606},
>>> +    {"tx_memory_write", 0x700},
>>> +    {"tx_memory_read", 0x701},
>>> +    {"tx_configuration_write", 0x702},
>>> +    {"tx_configuration_read", 0x703},
>>> +    {"tx_io_write", 0x704},
>>> +    {"tx_io_read", 0x705},
>>> +    {"tx_completion_without_data", 0x706},
>>> +    {"tx_completion_with_data", 0x707},
>>> +    {"tx_message_tlp", 0x708},
>>> +    {"tx_atomic", 0x709},
>>> +    {"tx_tlp_with_prefix", 0x70A},
>>> +    {"rx_memory_write", 0x70B},
>>> +    {"rx_memory_read", 0x70C},
>>> +    {"rx_io_write", 0x70F},
>>> +    {"rx_io_read", 0x710},
>>> +    {"rx_completion_without_data", 0x711},
>>> +    {"rx_completion_with_data", 0x712},
>>> +    {"rx_message_tlp", 0x713},
>>> +    {"rx_atomic", 0x714},
>>> +    {"rx_tlp_with_prefix", 0x715},
>>> +    {"tx_ccix_tlp", 0x716},
>>> +    {"rx_ccix_tlp", 0x717},
>>> +};
>>> +
>>> +static int dwc_pcie_pmu_attr_init(struct dwc_pcie_pmu_priv *priv,
>>> +                  struct dwc_pcie_rp_info *rp_info)
>>> +{
>>> +    int i, j;
>>> +    char lane[8];
>>> +    const char tmp[64];
>>> +    int events_per_lane;
>>> +    int num_lane_events;
>>> +    int time_base_count;
>>> +    int num_attrs, attr_idx;
>>> +    struct dwc_pcie_event_attr *lane_attrs;
>>> +    struct attribute **pmu_attrs;
>>> +
>>> +    memset((void *)tmp, 0, sizeof(tmp));
>>> +    memset((void *)lane, 0, sizeof(lane));
>>> +    time_base_count = ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs);
>>> +    events_per_lane = ARRAY_SIZE(event_array);
>>> +    num_lane_events = rp_info->num_lanes * events_per_lane;
>>> +    num_attrs = time_base_count + num_lane_events;
>>> +
>>> +    rp_info->lane_event_attrs =
>>> +        devm_kcalloc(priv->dev, num_lane_events,
>>> +                sizeof(struct dwc_pcie_event_attr),
>>> +                GFP_KERNEL);
>>> +    if (!rp_info->lane_event_attrs)
>>> +        return -ENOMEM;
>>> +    lane_attrs = rp_info->lane_event_attrs;
>>> +    rp_info->pcie_pmu_event_attrs =
>>> +        devm_kcalloc(priv->dev, num_attrs, sizeof(struct attribute *),
>>> +             GFP_KERNEL);
>>> +    if (!rp_info->pcie_pmu_event_attrs)
>>> +        return -ENOMEM;
>>> +    pmu_attrs = rp_info->pcie_pmu_event_attrs;
>>> +
>>> +    for (i = 0; i < num_lane_events; i++) {
>>> +        lane_attrs[i].attr.attr.name =
>>> +            devm_kzalloc(priv->dev, sizeof(char)
>>> +                 * ATTRI_NAME_MAX_SIZE, GFP_KERNEL);
>>> +        if (!lane_attrs[i].attr.attr.name)
>>> +            return -ENOMEM;
>>> +    }
>>> +
>>> +    attr_idx = 0;
>>> +    for (i = 0; i < rp_info->num_lanes; i++) {
>>> +        sprintf(lane, "_lane%d", i);
>>> +
>>> +        for (j = 0; j < events_per_lane; j++) {
>>> +            int pos = i * events_per_lane + j;
>>> +
>>> +            strcat((char *)tmp, event_array[j].name);
>>> +            strcat((char *)tmp, lane);
>>> +            memcpy((void *)lane_attrs[pos].attr.attr.name,
>>> +                   (void *)tmp,
>>> +                   sizeof(tmp));
>>> +
>>> +            lane_attrs[pos].attr.attr.mode =
>>> +                VERIFY_OCTAL_PERMISSIONS(0444);
>>> +            lane_attrs[pos].attr.show = dwc_pcie_event_show;
>>> +            lane_attrs[pos].attr.store = NULL;
>>> +            lane_attrs[pos].type = DWC_PCIE_LANE_EVENT;
>>> +            lane_attrs[pos].eventid = event_array[j].event_id;
>>> +            lane_attrs[pos].lane = i;
>>> +            pmu_attrs[attr_idx++] = &lane_attrs[pos].attr.attr;
>>> +
>>> +            memset((void *)tmp, 0, sizeof(tmp));
>>> +        }
>>> +    }
>>> +
>>> +    for (i = 0; i < ARRAY_SIZE(dwc_pcie_pmu_time_event_attrs); i++)
>>> +        pmu_attrs[attr_idx++] = dwc_pcie_pmu_time_event_attrs[i];
>>> +
>>> +    rp_info->pcie_pmu_event_attrs[attr_idx++] = NULL;
>>> +
>>> +    rp_info->pcie_pmu_event_attrs_group.name = "events";
>>> +    rp_info->pcie_pmu_event_attrs_group.is_visible =
>>> +        pcie_pmu_event_attr_is_visible;
>>> +    rp_info->pcie_pmu_event_attrs_group.attrs =
>>> +        rp_info->pcie_pmu_event_attrs;
>>> +
>>> +    rp_info->pcie_pmu_attr_groups[0] =
>>> +        &rp_info->pcie_pmu_event_attrs_group;
>>> +    rp_info->pcie_pmu_attr_groups[1] = &pcie_pmu_format_attrs_group;
>>> +    rp_info->pcie_pmu_attr_groups[2] = &pcie_pmu_cpumask_attrs_group;
>>> +    rp_info->pcie_pmu_attr_groups[3] = NULL;
>>> +
>>> +    return 0;
>>
>> That took a while to make sense of... dynamically generating event attrs is a bit horrible, especially if it means you waste a bunch of memory duplicating the same stuff over and over again for every PMU instance.
>>
>> You can achieve the same effect far more neatly by statically defining all possible events, then using the .is_visible callback to hide any which are not supported by the given PMU instance.
>>
>> However, I'm not sure it's really worth even doing that (especially if I've counted right and it means 464 events for a maximum of 16 lanes). It doesn't seem like the difference in typing "tx_io_read_lane0" vs. "tx_io_read,lane=0" will have a significant impact on user experience. Conversely, I happen to know that some users actively dislike the experience of "perf list" spewing out hundreds of events that swamp out the ones they're looking for. If this is the SoC which also has arm-cmn then it's already more than bad enough (sorry!) - describing 29 distinct events as just 29 events will be a lot more popular and manageable. If you like you can output "lane=?" in the event string to make it even more obvious - perf tool will then require the user to provide a lane value instead of defaulting to 0 if it isn't specified.
>
> Got your point, the prolem is that we can't use part of symbolic name,
> right?
>
> "tx_io_read_lane0" vs. "tx_io_read,lane=0", I prefer the latter. But
> the right latter is "eventid=0x705,lane=0", right?
>
> I should add a description in document.
>
>>
>>> +}
>>> +
>>> +static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv,
>>> +                struct dwc_pcie_rp_info *rp_info)
>>> +{
>>> +    struct dwc_pcie_pmu *pcie_pmu;
>>> +    struct device *dev;
>>> +    char *name;
>>> +    int ret;
>>> +
>>> +    pcie_pmu = &rp_info->pcie_pmu;
>>> +    dev = &rp_info->pdev->dev;
>>> +
>>> +    ret = dwc_pcie_pmu_attr_init(priv, rp_info);
>>> +    if (ret) {
>>> +        pci_err(rp_info->pdev, "PMU attr init fail ret=%d\n", ret);
>>> +        return ret;
>>> +    }
>>> +
>>> +    pcie_pmu->dev = dev;
>>> +    pcie_pmu->pmu = (struct pmu) {
>>> +        .module        = THIS_MODULE,
>>> +        .task_ctx_nr    = perf_invalid_context,
>>> +        .pmu_enable    = NULL,
>>> +        .pmu_disable    = NULL,
>>
>> If I counted right there are still 28 other fields being default-initialised to 0 or NULL here, what's special about these two? ;)
>
> Nope, I will delete it. It is optional to implement.
>
>>
>>> +        .event_init    = dwc_pcie_pmu_event_init,
>>> +        .add        = dwc_pcie_pmu_event_add,
>>> +        .del        = dwc_pcie_pmu_event_del,
>>> +        .start        = dwc_pcie_pmu_event_start,
>>> +        .stop        = dwc_pcie_pmu_event_stop,
>>> +        .read        = dwc_pcie_pmu_event_read,
>>> +        .attr_groups    = rp_info->pcie_pmu_attr_groups,
>>> +        .capabilities    = PERF_PMU_CAP_NO_EXCLUDE,
>>> +    };
>>> +
>>> +    name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
>>> +                  rp_info->bdf);
>>> +    if (!name)
>>> +        return -ENOMEM;
>>> +
>>> +    /*
>>> +     * Pick one CPU to be the preferred one on local NUMA node.
>>> +     *
>>> +     * Note, this PMU does NOT support interrupt, set on_cpu to indicate it
>>> +     * is a uncore PMU device.
>>> +     */
>>> +    pcie_pmu->on_cpu = cpumask_local_spread(0, dev_to_node(pcie_pmu->dev));
>>
>> You need to cope with the possibility of that CPU going offline.
>
> As I commented here, the PMU does NOT support interrupt, and the on_cpu
> just indicate it is a uncore PMU device as Yicong Yang suggested. So the
> selected CPU is not really used.
>
> Do we need to care about that CPU going offline?
>
>>
>>> +    ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
>>> +    if (ret) {
>>> +        pci_err(rp_info->pdev, "Error %d registering PMU @%x\n", ret,
>>> +                 rp_info->bdf);
>>> +        return ret;
>>> +    }
>>> +
>>> +    rp_info->pmu_is_register = true;
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>>> +{
>>> +    struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>>> +    struct dwc_pcie_pmu *pcie_pmu;
>>> +    struct dwc_pcie_rp_info *rp_info;
>>> +
>>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>>> +        if (rp_info->pmu_is_register) {
>>> +            pcie_pmu = &rp_info->pcie_pmu;
>>> +            perf_pmu_unregister(&pcie_pmu->pmu);
>>> +        }
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>>> +{
>>> +    int ret;
>>> +    struct dwc_pcie_pmu_priv *priv;
>>> +    struct dwc_pcie_rp_info *rp_info;
>>> +
>>> +    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>>> +    if (!priv)
>>> +        return -ENOMEM;
>>> +
>>> +    priv->dev = &pdev->dev;
>>> +    platform_set_drvdata(pdev, priv);
>>> +
>>> +    /* If RAS_DES PMU is not supported on current platform, keep silent */
>>> +    ret = dwc_pcie_ras_des_discover(priv);
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
>>> +        struct pci_dev *rp = rp_info->pdev;
>>> +
>>> +        ret = __dwc_pcie_pmu_probe(priv, rp_info);
>>
>> If the PMUs are independent of each other, why not just probe them directly during the PCIe capability walk? That way you could easily ensure that the list only ever contains successfully-registered PMUs by construction, and make cleanup even simpler.
>
> Yes, they are independent of each other and belongs to different root port.
> Good suggestion, I will probe them during the PCIe capability walk in next version.
>
>
>>> +        if (ret) {
>>> +            dev_err(&rp->dev, "PCIe PMU probe fail\n");
>>> +            goto pmu_unregister;
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +pmu_unregister:
>>> +    dwc_pcie_pmu_remove(pdev);
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static struct platform_driver dwc_pcie_pmu_driver = {
>>> +    .probe = dwc_pcie_pmu_probe,
>>> +    .remove = dwc_pcie_pmu_remove,
>>> +    .driver = {.name = "dwc_pcie_pmu",},
>>> +};
>>> +
>>> +static int __init dwc_pcie_pmu_init(void)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = platform_driver_register(&dwc_pcie_pmu_driver);
>>> +
>>> +    if (ret)
>>> +        return ret;
>>> +
>>> +    dwc_pcie_pmu_dev = platform_device_register_simple(
>>> +                "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>>> +    if (IS_ERR(dwc_pcie_pmu_dev)) {
>>> +        platform_driver_unregister(&dwc_pcie_pmu_driver);
>>> +        return PTR_ERR(dwc_pcie_pmu_dev);
>>> +    }
>>
>> Why go through all this bother of inventing a device and registering a driver just to take a long round-trip through the driver core to call dwc_pcie_pmu_probe()? Why not just do the work of dwc_pcie_pmu_probe() right here? Sure, you'd also need to manually clean up a couple of allocations on failure or exit instead of using devres, but that's still considerably less hassle than invoking the whole driver model just to disguise a list_head and a couple of function calls.
>>
>
> I directly call dwc_pcie_pmu_probe in first internal version. I see
> some PMU drivers register simple platform devices, like riscv_pmu_legacy.c
> and riscv_pmu_sbi.c.
>
> If you are asking to call dwc_pcie_pmu_probe() directly, I'd
> like to change it back.
>
>> Thanks,
>> Robin.
>>

Hi, Robin

Any comments on my response? Looking forward to your reply :)

>
> Thank you very much for your valuable comments.
>
> Best Regards,
> Shuai


>
>
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +static void __exit dwc_pcie_pmu_exit(void)
>>> +{
>>> +    platform_device_unregister(dwc_pcie_pmu_dev);
>>> +    platform_driver_unregister(&dwc_pcie_pmu_driver);
>>> +}
>>> +
>>> +module_init(dwc_pcie_pmu_init);
>>> +module_exit(dwc_pcie_pmu_exit);
>>> +
>>> +MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
>>> +MODULE_AUTHOR("Shuai xue <[email protected]>");
>>> +MODULE_AUTHOR("Wen Cheng <[email protected]>");
>>> +MODULE_LICENSE("GPL v2");

2023-05-16 13:13:50

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver

This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
Core controller IP which provides statistics feature. The PMU is not a PCIe
Root Complex integrated End Point(RCiEP) device but only register counters
provided by each PCIe Root Port.

To facilitate collection of statistics the controller provides the
following two features for each Root Port:

- Time Based Analysis (RX/TX data throughput and time spent in each
low-power LTSSM state)
- Event counters (Error and Non-Error for lanes)

Note, only one counter for each type and does not overflow interrupt.

This driver adds PMU devices for each PCIe Root Port. And the PMU device is
named based the BDF of Root Port. For example,

30:03.0 PCI bridge: Device 1ded:8000 (rev 01)

the PMU device name for this Root Port is dwc_rootport_3018.

Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

$# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/

average RX bandwidth can be calculated like this:

PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 700 ++++++++++++++++++++++++++++++++++++
3 files changed, 708 insertions(+)
create mode 100644 drivers/perf/dwc_pcie_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 711f82400086..d5750cbc67c4 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -209,6 +209,13 @@ config MARVELL_CN10K_DDR_PMU
Enable perf support for Marvell DDR Performance monitoring
event on CN10K platform.

+config DWC_PCIE_PMU
+ tristate "Enable Synopsys DesignWare PCIe PMU Support"
+ depends on ARM64 || COMPILE_TEST
+ help
+ Enable perf support for Synopsys DesignWare PCIe PMU Performance
+ monitoring event on Yitian 710 platform.
+
source "drivers/perf/arm_cspmu/Kconfig"

source "drivers/perf/amlogic/Kconfig"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index dabc859540ce..13a6d1b286da 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -22,5 +22,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
+obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
new file mode 100644
index 000000000000..1ecb06579137
--- /dev/null
+++ b/drivers/perf/dwc_pcie_pmu.c
@@ -0,0 +1,700 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synopsys DesignWare PCIe PMU driver
+ *
+ * Copyright (C) 2021-2023 Alibaba Inc.
+ */
+
+#include <linux/pci.h>
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cpuhotplug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
+
+#define DWC_PCIE_EVENT_CNT_CTL 0x8
+/*
+ * Event Counter Data Select includes two parts:
+ * - 27-24: Group number(4-bit: 0..0x7)
+ * - 23-16: Event number(8-bit: 0..0x13) within the Group
+ *
+ * Put them togother as TRM used.
+ */
+#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
+#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
+#define DWC_PCIE_CNT_STATUS BIT(7)
+#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
+#define DWC_PCIE_PER_EVENT_OFF 0x1
+#define DWC_PCIE_PER_EVENT_ON 0x3
+#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
+#define DWC_PCIE_EVENT_PER_CLEAR 0x1
+
+#define DWC_PCIE_EVENT_CNT_DATA 0xC
+
+#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
+#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
+#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
+#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
+#define DWC_PCIE_DURATION_1MS 0x1
+#define DWC_PCIE_DURATION_10MS 0x2
+#define DWC_PCIE_DURATION_100MS 0x3
+#define DWC_PCIE_DURATION_1S 0x4
+#define DWC_PCIE_DURATION_2S 0x5
+#define DWC_PCIE_DURATION_4S 0x6
+#define DWC_PCIE_DURATION_4US 0xff
+#define DWC_PCIE_TIME_BASED_TIMER_START BIT(0)
+#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
+
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
+
+/* Event attributes */
+#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
+#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
+#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
+
+#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
+#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
+#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
+
+enum dwc_pcie_event_type {
+ DWC_PCIE_TYPE_INVALID,
+ DWC_PCIE_TIME_BASE_EVENT,
+ DWC_PCIE_LANE_EVENT,
+};
+
+#define DWC_PCIE_LANE_EVENT_MAX_PERIOD (GENMASK_ULL(31, 0))
+#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD (GENMASK_ULL(63, 0))
+
+
+struct dwc_pcie_pmu {
+ struct pci_dev *pdev; /* Root Port device */
+ u32 ras_des; /* RAS DES capability offset */
+ u32 nr_lanes;
+
+ struct list_head pmu_node;
+ struct hlist_node cpuhp_node;
+ struct pmu pmu;
+ struct perf_event *event;
+ int oncpu;
+};
+
+struct dwc_pcie_pmu_priv {
+ struct device *dev;
+ struct list_head pmu_nodes;
+};
+
+#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
+
+static struct platform_device *dwc_pcie_pmu_dev;
+static int dwc_pcie_pmu_hp_state;
+
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
+
+ return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->oncpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL
+};
+
+static struct attribute_group dwc_pcie_cpumask_attr_group = {
+ .attrs = dwc_pcie_pmu_cpumask_attrs,
+};
+
+struct dwc_pcie_format_attr {
+ struct device_attribute attr;
+ u64 field;
+ int config;
+};
+
+static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
+ int lo = __ffs(fmt->field), hi = __fls(fmt->field);
+
+ return sysfs_emit(buf, "config:%d-%d\n", lo, hi);
+}
+
+#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
+ (&((struct dwc_pcie_format_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
+ .config = _cfg, \
+ .field = _fld, \
+ }})[0].attr.attr)
+
+#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
+
+static struct attribute *dwc_pcie_format_attrs[] = {
+ dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
+ dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
+ dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
+ NULL,
+};
+
+static struct attribute_group dwc_pcie_format_attrs_group = {
+ .name = "format",
+ .attrs = dwc_pcie_format_attrs,
+};
+
+struct dwc_pcie_event_attr {
+ struct device_attribute attr;
+ enum dwc_pcie_event_type type;
+ u16 eventid;
+ u8 lane;
+};
+
+static ssize_t dwc_pcie_event_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dwc_pcie_event_attr *eattr;
+
+ eattr = container_of(attr, typeof(*eattr), attr);
+
+ if (eattr->type == DWC_PCIE_LANE_EVENT)
+ return sysfs_emit(buf, "eventid=0x%x,type=0x%x,lane=?\n",
+ eattr->eventid, eattr->type);
+
+ return sysfs_emit(buf, "eventid=0x%x,type=0x%x\n", eattr->eventid,
+ eattr->type);
+}
+
+#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
+ (&((struct dwc_pcie_event_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
+ .type = _type, \
+ .eventid = _eventid, \
+ .lane = _lane, \
+ }})[0].attr.attr)
+
+#define DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
+#define DWC_PCIE_PMU_LANE_EVENT_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_LANE_EVENT, _eventid, 0)
+
+static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
+ /* Group #0 */
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(one_cycle, 0x00),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_L0S, 0x01),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(RX_L0S, 0x02),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L0, 0x03),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1, 0x04),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_1, 0x05),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_2, 0x06),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(CFG_RCVRY, 0x07),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_RX_L0S, 0x08),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_AUX, 0x09),
+
+ /* Group #1 */
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
+
+ /*
+ * Leave it to the user to specify the lane ID to avoid generating
+ * a list of hundreds of events.
+ */
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ack_dllp, 0x600),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_update_fc_dllp, 0x601),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ack_dllp, 0x602),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_update_fc_dllp, 0x603),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_nulified_tlp, 0x604),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_nulified_tlp, 0x605),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_duplicate_tl, 0x606),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_write, 0x700),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_read, 0x701),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_write, 0x702),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_read, 0x703),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_write, 0x704),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_read, 0x705),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_without_data, 0x706),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_with_data, 0x707),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_message_tlp, 0x708),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_atomic, 0x709),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_tlp_with_prefix, 0x70A),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_write, 0x70B),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_read, 0x70C),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_write, 0x70F),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_read, 0x710),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_without_data, 0x711),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_with_data, 0x712),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_message_tlp, 0x713),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_atomic, 0x714),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_tlp_with_prefix, 0x715),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ccix_tlp, 0x716),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ccix_tlp, 0x717),
+
+ NULL
+};
+
+static const struct attribute_group dwc_pcie_event_attrs_group = {
+ .name = "events",
+ .attrs = dwc_pcie_pmu_time_event_attrs,
+};
+
+static const struct attribute_group *dwc_pcie_attr_groups[] = {
+ &dwc_pcie_event_attrs_group,
+ &dwc_pcie_format_attrs_group,
+ &dwc_pcie_cpumask_attr_group,
+ NULL
+};
+
+static void dwc_pcie_pmu_lane_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+ bool enable)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ /* Clear DWC_PCIE_CNT_ENABLE field first */
+ val &= ~DWC_PCIE_CNT_ENABLE;
+ if (enable)
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
+ else
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_time_based_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+ bool enable)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ &val);
+
+ if (enable)
+ val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ else
+ val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ val);
+}
+
+static u64 dwc_pcie_pmu_read_lane_event_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
+
+ return val;
+}
+
+static u64 dwc_pcie_pmu_read_time_based_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u64 count;
+ u32 val;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH, &val);
+ count = val;
+ count <<= 32;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW, &val);
+
+ count += val;
+
+ return count;
+}
+
+static void dwc_pcie_pmu_event_update(struct perf_event *event)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ u64 delta, prev, now;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ now = dwc_pcie_pmu_read_lane_event_counter(pcie_pmu);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ now = dwc_pcie_pmu_read_time_based_counter(pcie_pmu);
+
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ delta = (now - prev) & DWC_PCIE_LANE_EVENT_MAX_PERIOD;
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ delta = (now - prev) & DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD;
+
+ local64_add(delta, &event->count);
+}
+
+static int dwc_pcie_pmu_event_init(struct perf_event *event)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ struct perf_event *sibling;
+ u32 lane;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* We don't support sampling */
+ if (is_sampling_event(event))
+ return -EINVAL;
+
+ /* We cannot support task bound events */
+ if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK)
+ return -EINVAL;
+
+ if (event->group_leader != event &&
+ !is_software_event(event->group_leader))
+ return -EINVAL;
+
+ for_each_sibling_event(sibling, event->group_leader) {
+ if (sibling->pmu != event->pmu && !is_software_event(sibling))
+ return -EINVAL;
+ }
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ lane = DWC_PCIE_EVENT_LANE(event);
+ if (lane < 0 || lane >= pcie_pmu->nr_lanes)
+ return -EINVAL;
+ }
+
+ event->cpu = pcie_pmu->oncpu;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
+{
+ local64_set(&hwc->prev_count, 0);
+}
+
+static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ hwc->state = 0;
+ dwc_pcie_pmu_set_period(hwc);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_lane_event_enable(pcie_pmu, true);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_time_based_event_enable(pcie_pmu, true);
+}
+
+static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_lane_event_enable(pcie_pmu, false);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_time_based_event_enable(pcie_pmu, false);
+
+ dwc_pcie_pmu_event_update(event);
+ hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+}
+
+static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ int event_id = DWC_PCIE_EVENT_ID(event);
+ int lane = DWC_PCIE_EVENT_LANE(event);
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 ctrl;
+
+ /* Only one counter and it is in use */
+ if (pcie_pmu->event)
+ return -ENOSPC;
+
+ pcie_pmu->event = event;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ /* EVENT_COUNTER_DATA_REG needs clear manually */
+ ctrl = FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id) |
+ FIELD_PREP(DWC_PCIE_CNT_LANE_SEL, lane) |
+ FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF) |
+ FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL,
+ ctrl);
+ } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
+ /*
+ * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
+ * use it with any manually controlled duration. And it is
+ * cleared when next measurement starts.
+ */
+ ctrl = FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id) |
+ FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
+ DWC_PCIE_DURATION_MANUAL_CTL) |
+ DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ pci_write_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, ctrl);
+ }
+
+ if (flags & PERF_EF_START)
+ dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+
+ dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
+ perf_event_update_userpage(event);
+ pcie_pmu->event = NULL;
+}
+
+static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv)
+{
+ struct pci_dev *pdev = NULL;
+ struct dwc_pcie_pmu *pcie_pmu;
+ char *name;
+ u32 bdf;
+ int ret;
+
+ INIT_LIST_HEAD(&priv->pmu_nodes);
+
+ /* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
+ for_each_pci_dev(pdev) {
+ u16 vsec;
+ u32 val;
+
+ if (!(pci_is_pcie(pdev) &&
+ pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
+ continue;
+
+ vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
+ DWC_PCIE_VSEC_RAS_DES_ID);
+ if (!vsec)
+ continue;
+
+ pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
+ if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
+ PCI_VNDR_HEADER_LEN(val) != 0x100)
+ continue;
+ pci_dbg(pdev,
+ "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
+
+ bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
+ name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
+ bdf);
+ if (!name)
+ return -ENOMEM;
+
+ /* All checks passed, go go go */
+ pcie_pmu = devm_kzalloc(&pdev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
+ if (!pcie_pmu) {
+ pci_dev_put(pdev);
+ return -ENOMEM;
+ }
+
+ pcie_pmu->pdev = pdev;
+ pcie_pmu->ras_des = vsec;
+ pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
+ pcie_pmu->pmu = (struct pmu){
+ .module = THIS_MODULE,
+ .attr_groups = dwc_pcie_attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
+ .task_ctx_nr = perf_invalid_context,
+ .event_init = dwc_pcie_pmu_event_init,
+ .add = dwc_pcie_pmu_event_add,
+ .del = dwc_pcie_pmu_event_del,
+ .start = dwc_pcie_pmu_event_start,
+ .stop = dwc_pcie_pmu_event_stop,
+ .read = dwc_pcie_pmu_event_update,
+ };
+
+ /* Add this instance to the list used by the offline callback */
+ ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
+ &pcie_pmu->cpuhp_node);
+ if (ret) {
+ pci_err(pcie_pmu->pdev,
+ "Error %d registering hotplug @%x\n", ret, bdf);
+ return ret;
+ }
+ ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
+ if (ret) {
+ pci_err(pcie_pmu->pdev,
+ "Error %d registering PMU @%x\n", ret, bdf);
+ cpuhp_state_remove_instance_nocalls(
+ dwc_pcie_pmu_hp_state, &pcie_pmu->cpuhp_node);
+ return ret;
+ }
+
+ /* Add registered PMUs and unregister them when this driver remove */
+ list_add(&pcie_pmu->pmu_node, &priv->pmu_nodes);
+ }
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_remove(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
+ struct dwc_pcie_pmu *pcie_pmu;
+
+ list_for_each_entry(pcie_pmu, &priv->pmu_nodes, pmu_node) {
+ cpuhp_state_remove_instance(dwc_pcie_pmu_hp_state,
+ &pcie_pmu->cpuhp_node);
+ perf_pmu_unregister(&pcie_pmu->pmu);
+ }
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_probe(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv;
+
+ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->dev = &pdev->dev;
+ platform_set_drvdata(pdev, priv);
+
+ /* If one PMU registration fails, remove all. */
+ if (__dwc_pcie_pmu_probe(priv))
+ dwc_pcie_pmu_remove(pdev);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_migrate(struct dwc_pcie_pmu *pcie_pmu, unsigned int cpu)
+{
+ /* This PMU does NOT support interrupt, just migrate context. */
+ perf_pmu_migrate_context(&pcie_pmu->pmu, pcie_pmu->oncpu, cpu);
+ pcie_pmu->oncpu = cpu;
+}
+
+static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct pci_dev *pdev;
+ int node;
+
+ pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+ pdev = pcie_pmu->pdev;
+ node = dev_to_node(&pdev->dev);
+
+ if (node != NUMA_NO_NODE && cpu_to_node(pcie_pmu->oncpu) != node &&
+ cpu_to_node(cpu) == node)
+ dwc_pcie_pmu_migrate(pcie_pmu, cpu);
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct pci_dev *pdev;
+ int node;
+ cpumask_t mask;
+ unsigned int target;
+
+ pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+ if (cpu != pcie_pmu->oncpu)
+ return 0;
+
+ pdev = pcie_pmu->pdev;
+ node = dev_to_node(&pdev->dev);
+ if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
+ cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
+ target = cpumask_any(&mask);
+ else
+ target = cpumask_any_but(cpu_online_mask, cpu);
+ if (target < nr_cpu_ids)
+ dwc_pcie_pmu_migrate(pcie_pmu, target);
+
+ return 0;
+}
+
+static struct platform_driver dwc_pcie_pmu_driver = {
+ .probe = dwc_pcie_pmu_probe,
+ .remove = dwc_pcie_pmu_remove,
+ .driver = {.name = "dwc_pcie_pmu",},
+};
+
+static int __init dwc_pcie_pmu_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+ "perf/dwc_pcie_pmu:online",
+ dwc_pcie_pmu_online_cpu,
+ dwc_pcie_pmu_offline_cpu);
+ if (ret < 0)
+ return ret;
+
+ dwc_pcie_pmu_hp_state = ret;
+
+ ret = platform_driver_register(&dwc_pcie_pmu_driver);
+ if (ret) {
+ cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
+ return ret;
+ }
+
+ dwc_pcie_pmu_dev = platform_device_register_simple(
+ "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
+ if (IS_ERR(dwc_pcie_pmu_dev)) {
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+ return PTR_ERR(dwc_pcie_pmu_dev);
+ }
+
+ return 0;
+}
+
+static void __exit dwc_pcie_pmu_exit(void)
+{
+ platform_device_unregister(dwc_pcie_pmu_dev);
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+}
+
+module_init(dwc_pcie_pmu_init);
+module_exit(dwc_pcie_pmu_exit);
+
+MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
+MODULE_AUTHOR("Shuai xue <[email protected]>");
+MODULE_AUTHOR("Wen Cheng <[email protected]>");
+MODULE_LICENSE("GPL v2");
--
2.20.1.12.g72788fdb


2023-05-16 13:19:49

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver

Alibaba's T-Head Yitan 710 SoC is built on Synopsys' widely deployed and
silicon-proven DesignWare Core PCIe controller which implements PMU for
performance and functional debugging to facilitate system maintenance.
Document it to provide guidance on how to use it.

Signed-off-by: Shuai Xue <[email protected]>
---
.../admin-guide/perf/dwc_pcie_pmu.rst | 61 +++++++++++++++++++
Documentation/admin-guide/perf/index.rst | 1 +
2 files changed, 62 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst

diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..0672e959ebe4
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,61 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+To facilitate collection of statistics, Synopsys DesignWare Cores PCIe
+controller provides the following two features:
+
+- Time Based Analysis (RX/TX data throughput and time spent in each
+ low-power LTSSM state)
+- Lane Event counters (Error and Non-Error for lanes)
+
+The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
+only register counters provided by each PCIe Root Port.
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+
+The counters are 64-bit width and measure data in two categories,
+
+- percentage of time does the controller stay in LTSSM state in a
+ configurable duration. The measurement range of each Event in Group#0.
+- amount of data processed (Units of 16 bytes). The measurement range of
+ each Event in Group#1.
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller.
+
+The counters are 32-bit width and the measured event is select by:
+
+- Group i
+- Event j within the Group i
+- and Lane k
+
+Some of the event counters only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver add PMU devices for each PCIe Root Port. And the PMU device is
+named based the BDF of Root Port. For example,
+
+ 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
+
+ $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+average RX bandwidth can be calculated like this:
+
+ PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 9de64a40adab..11a80cd28a2e 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -19,5 +19,6 @@ Performance monitor support
arm_dsu_pmu
thunderx2-pmu
alibaba_pmu
+ dwc_pcie_pmu
nvidia-pmu
meson-ddr-pmu
--
2.20.1.12.g72788fdb


2023-05-16 13:26:26

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

Changes since v3:

1. addressing comments from Robin Murphy:
- add a prepare patch to define pci id in linux/pci_ids.h
- remove unnecessary 64BIT dependency
- fix DWC_PCIE_PER_EVENT_OFF/ON macro
- remove dwc_pcie_pmu struct and move all its fileds into dwc_pcie_rp_info
- remove unnecessary format field show
- use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
- remove unnecessary spaces and remove unnecessary cast to follow event show convention
- remove pcie_pmu_event_attr_is_visible
- fix a refcout leak on error branch when walk pci device in for_each_pci_dev
- remove bdf field from dwc_pcie_rp_info and calculate it at runtime
- finish all the checks before allocating rp_info to avoid hanging wasted memory
- remove some unused fields
- warp out control register configuration from sub function to .add()
- make function return type with a proper signature
- fix lane event count enable by clear DWC_PCIE_CNT_ENABLE field first
- pass rp_info directly to the read_*_counter helpers and in start, stop and add callbacks
- move event type validtion into .event_init()
- use is_sampling_event() to be consistent with everything else of pmu drivers
- remove unnecessary dev_err message in .event_init()
- return EINVAL instead EOPNOTSUPP for not a valid event
- finish all the checks before start modifying the event
- fix sibling event check by comparing event->pmu with sibling->pmu
- probe PMU for each rootport independently
- use .update() as .read() directly
- remove dynamically generating symbolic name of lane event
- redefine static symbolic name of lane event and leave lane filed to user
- add CPU hotplug support

2. addressing comments from Baolin:
- add a mask to avoid possible overflow

Changes since v2 addressing comments from Baolin:
- remove redundant macro definitions
- use dev_err to print error message
- change pmu_is_register to boolean
- use PLATFORM_DEVID_NONE macro
- fix module author format

Changes since v1:

1. address comments from Jonathan:
- drop marco for PMU name and VSEC version
- simplify code with PCI standard marco
- simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
- name register filed with single _ instead double
- wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
- check vendor id while matching vesc with pci_find_vsec_capability()
- remove RP_NUM_MAX and use a list to organize PMU devices for rootports
- replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
- comments on riping register together

2. address comments from Bjorn:
- rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
- rename cap_pos to ras_des
- simplify declare of device_attribute with DEVICE_ATTR_RO
- simplify code with PCI standard macro and API like pcie_get_width_cap()
- fix some code style problem and typo
- drop meaningless snaity check of container_of

3. address comments from Yicong:
- use sysfs_emit() to replace sprintf()
- simplify iteration of pci device with for_each_pci_dev
- pick preferred CPUs on a near die and add comments
- unregister PMU drivers only for failed ones
- log on behalf PMU device and give more hint
- fix some code style problem

(Thanks for all comments and they are very valuable to me)

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.

Shuai Xue (4):
docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
PCI: move Alibaba Vendor ID linux/pci_ids.h
drivers/perf: add DesignWare PCIe PMU driver
MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

.../admin-guide/perf/dwc_pcie_pmu.rst | 61 ++
Documentation/admin-guide/perf/index.rst | 1 +
MAINTAINERS | 6 +
drivers/infiniband/hw/erdma/erdma_hw.h | 2 -
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 700 ++++++++++++++++++
include/linux/pci_ids.h | 2 +
8 files changed, 778 insertions(+), 2 deletions(-)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
create mode 100644 drivers/perf/dwc_pcie_pmu.c

--
2.20.1.12.g72788fdb


2023-05-16 13:27:05

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h

Move Alibaba Vendor ID (0x1ded) to linux/pci_ids.h so that it can shared by
several drivers.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/infiniband/hw/erdma/erdma_hw.h | 2 --
include/linux/pci_ids.h | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
index 37ad1bb1917c..6985b689f632 100644
--- a/drivers/infiniband/hw/erdma/erdma_hw.h
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -11,8 +11,6 @@
#include <linux/types.h>

/* PCIe device related definition. */
-#define PCI_VENDOR_ID_ALIBABA 0x1ded
-
#define ERDMA_PCI_WIDTH 64
#define ERDMA_FUNC_BAR 0
#define ERDMA_MISX_BAR 2
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 45c3d62e616d..d07791bae9c5 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2585,6 +2585,8 @@
#define PCI_VENDOR_ID_TEKRAM 0x1de1
#define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29

+#define PCI_VENDOR_ID_ALIBABA 0x1ded
+
#define PCI_VENDOR_ID_TEHUTI 0x1fc9
#define PCI_DEVICE_ID_TEHUTI_3009 0x3009
#define PCI_DEVICE_ID_TEHUTI_3010 0x3010
--
2.20.1.12.g72788fdb


2023-05-16 13:29:03

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v4 4/4] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ebd26b3ca90e..14f4db0f8977 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20378,6 +20378,12 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/dw_mmc*

+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M: Shuai Xue <[email protected]>
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
M: Eugeniy Paltsev <[email protected]>
S: Supported
--
2.20.1.12.g72788fdb


2023-05-16 14:47:17

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver

On Mon, 17 Apr 2023 14:17:27 +0800
Shuai Xue <[email protected]> wrote:

> Alibaba's T-Head Yitan 710 SoC is built on Synopsys' widely deployed and
> silicon-proven DesignWare Core PCIe controller which implements PMU for

Keep to most relevant facts in description only. Something like:

Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe controller
which implements ...

Or ask for advertising fees from Synopsys :)


> performance and functional debugging to facilitate system maintenance.
> Document it to provide guidance on how to use it.
>
> Signed-off-by: Shuai Xue <[email protected]>
> ---
> .../admin-guide/perf/dwc_pcie_pmu.rst | 61 +++++++++++++++++++
> Documentation/admin-guide/perf/index.rst | 1 +
> 2 files changed, 62 insertions(+)
> create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>
> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> new file mode 100644
> index 000000000000..0672e959ebe4
> --- /dev/null
> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> @@ -0,0 +1,61 @@
> +======================================================================
> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
> +======================================================================
> +
> +DesignWare Cores (DWC) PCIe PMU
> +===============================
> +
> +To facilitate collection of statistics, Synopsys DesignWare Cores PCIe
> +controller provides the following two features:
> +
> +- Time Based Analysis (RX/TX data throughput and time spent in each
> + low-power LTSSM state)
> +- Lane Event counters (Error and Non-Error for lanes)
> +
> +The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
> +only register counters provided by each PCIe Root Port.
> +
> +Time Based Analysis
> +-------------------
> +
> +Using this feature you can obtain information regarding RX/TX data
> +throughput and time spent in each low-power LTSSM state by the controller.
> +
> +The counters are 64-bit width and measure data in two categories,
> +
> +- percentage of time does the controller stay in LTSSM state in a
> + configurable duration. The measurement range of each Event in Group#0.
> +- amount of data processed (Units of 16 bytes). The measurement range of
> + each Event in Group#1.
> +
> +Lane Event counters
> +-------------------
> +
> +Using this feature you can obtain Error and Non-Error information in
> +specific lane by the controller.
> +
> +The counters are 32-bit width and the measured event is select by:
> +
> +- Group i
> +- Event j within the Group i
> +- and Lane k
> +
> +Some of the event counters only exist for specific configurations.
> +
> +DesignWare Cores (DWC) PCIe PMU Driver
> +=======================================
> +
> +This driver add PMU devices for each PCIe Root Port. And the PMU device is
> +named based the BDF of Root Port. For example,
> +
> + 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
> +
> +the PMU device name for this Root Port is dwc_rootport_3018.
I'd suggest renaming to a scheme lie
dwc_rootport_30:03.0
to save people remembering how to break up the BDF parts.

> +
> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
> +
> + $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
> +
> +average RX bandwidth can be calculated like this:
> +
> + PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Could consider an example of the other type of event, the error counters
you mention.

Otherwise, looks good to me.

Jonathan

> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
> index 9de64a40adab..11a80cd28a2e 100644
> --- a/Documentation/admin-guide/perf/index.rst
> +++ b/Documentation/admin-guide/perf/index.rst
> @@ -19,5 +19,6 @@ Performance monitor support
> arm_dsu_pmu
> thunderx2-pmu
> alibaba_pmu
> + dwc_pcie_pmu
> nvidia-pmu
> meson-ddr-pmu


2023-05-16 15:12:43

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver


PCI folks, Question below directed at you. Please take a look.
+CC linux-cxl because a similar question is going to bite us shortly
if we want CXL PMUs to work well on RP or Switch ports.

> >> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> >> +{
> >> +??? int index = 0;
> >> +??? struct pci_dev *pdev = NULL;
> >> +??? struct dwc_pcie_rp_info *rp_info;
> >> +
> >> +??? INIT_LIST_HEAD(&priv->rp_infos);
> >> +
> >> +??? /* Match the rootport with VSEC_RAS_DES_ID */
> >> +??? for_each_pci_dev(pdev) {
> >
> > Does the PCI layer not offer a more robust mechanism for this? (PCI fixups come to mind, but I don't actually know whether that would be a viable approach or not.)
>
> I am afraid not yet. Jonathan try to add a PMU service but it is not merged into mainline.
I wouldn't read much into that 'failure'. We never persisted with that driver because it was for an old
generation of hardware. Mostly the aim with that was to explore the area of PCIe PMU in general
rather than to get the support upstream. Some of the counters on that hardware were too small to
be of much use anyway :)

Grabbing just relevant functions..

Bjorn, we need to figure out a way forwards for this sort of case and I'd
appreciate your input on the broad brush question of 'how should it be done'?

This is a case where a PCIe port (RP here) correctly has the PCIe class code
so binds to the pcie_port driver, but has a VSEC (others examples use DOE, or DVSEC)
that provides extended functionality. The referred to PCIe PMU from our older Hisilicon
platforms did it by adding another service driver - that probably doesn't extend well.

The approach used here is to separately walk the PCI topology and register the devices.
It can 'maybe' get away with that because no interrupts and I assume resets have no
nasty impacts on it because the device is fairly simple. In general that's not going
to work. CXL does a similar trick (which I don't much like, but too late now),
but we've also run into the problem of how to get interrupts if not the main driver.

So what approach should we look at to solve this in general?

Jonathan

> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> +{
> + int index = 0;
> + struct pci_dev *pdev = NULL;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + INIT_LIST_HEAD(&priv->rp_infos);
> +
> + /* Match the rootport with VSEC_RAS_DES_ID */
> + for_each_pci_dev(pdev) {
> + u16 vsec;
> + u32 val;
> +
> + if (!pci_dev_is_rootport(pdev))
> + continue;
> +
> + rp_info = devm_kzalloc(&pdev->dev, sizeof(*rp_info), GFP_KERNEL);
> + if (!rp_info)
> + return -ENOMEM;
> +
> + rp_info->bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
> + rp_info->pdev = pdev;
> +
> + vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> + DWC_PCIE_VSEC_RAS_DES_ID);
> + if (!vsec)
> + continue;
> +
> + pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
> + if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
> + PCI_VNDR_HEADER_LEN(val) != 0x100)
> + continue;
> + pci_dbg(pdev,
> + "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
> +
> + rp_info->ras_des = vsec;
> + rp_info->num_lanes = pcie_get_width_cap(pdev);
> +
> + list_add(&rp_info->rp_node, &priv->rp_infos);
> + index++;
> + }
> +
> + if (!index)
> + return -ENODEV;
> +
> + priv->pcie_ctrl_num = index;
> +
> + return 0;
> +}


> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
> +{
> + int ret;
> + struct dwc_pcie_pmu_priv *priv;
> + struct dwc_pcie_rp_info *rp_info;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->dev = &pdev->dev;
> + platform_set_drvdata(pdev, priv);
> +
> + /* If RAS_DES PMU is not supported on current platform, keep silent */
> + ret = dwc_pcie_ras_des_discover(priv);
> + if (ret)
> + return ret;
> +
> + list_for_each_entry(rp_info, &priv->rp_infos, rp_node) {
> + struct pci_dev *rp = rp_info->pdev;
> +
> + ret = __dwc_pcie_pmu_probe(priv, rp_info);
> + if (ret) {
> + dev_err(&rp->dev, "PCIe PMU probe fail\n");
> + goto pmu_unregister;
> + }
> + }
> +
> + return 0;
> +
> +pmu_unregister:
> + dwc_pcie_pmu_remove(pdev);
> +
> + return ret;
> +}

2023-05-16 19:27:48

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver

On Tue, May 16, 2023 at 04:03:04PM +0100, Jonathan Cameron wrote:
>
> PCI folks, Question below directed at you. Please take a look.
> +CC linux-cxl because a similar question is going to bite us shortly
> if we want CXL PMUs to work well on RP or Switch ports.
>
> > >> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> > >> +{
> > >> +??? int index = 0;
> > >> +??? struct pci_dev *pdev = NULL;
> > >> +??? struct dwc_pcie_rp_info *rp_info;
> > >> +
> > >> +??? INIT_LIST_HEAD(&priv->rp_infos);
> > >> +
> > >> +??? /* Match the rootport with VSEC_RAS_DES_ID */
> > >> +??? for_each_pci_dev(pdev) {
> > >
> > > Does the PCI layer not offer a more robust mechanism for this?
> > > (PCI fixups come to mind, but I don't actually know whether that
> > > would be a viable approach or not.)
> >
> > I am afraid not yet. Jonathan try to add a PMU service but it is
> > not merged into mainline.
>
> I wouldn't read much into that 'failure'. We never persisted with
> that driver because it was for an old generation of hardware.
> Mostly the aim with that was to explore the area of PCIe PMU in
> general rather than to get the support upstream. Some of the
> counters on that hardware were too small to be of much use anyway :)
>
> Grabbing just relevant functions..
>
> Bjorn, we need to figure out a way forwards for this sort of case
> and I'd appreciate your input on the broad brush question of 'how
> should it be done'?
>
> This is a case where a PCIe port (RP here) correctly has the PCIe
> class code so binds to the pcie_port driver, but has a VSEC (others
> examples use DOE, or DVSEC) that provides extended functionality.
> The referred to PCIe PMU from our older Hisilicon platforms did it
> by adding another service driver - that probably doesn't extend
> well.
>
> The approach used here is to separately walk the PCI topology and
> register the devices. It can 'maybe' get away with that because no
> interrupts and I assume resets have no nasty impacts on it because
> the device is fairly simple. In general that's not going to work.
> CXL does a similar trick (which I don't much like, but too late
> now), but we've also run into the problem of how to get interrupts
> if not the main driver.

Yes, this is a real problem. I think the "walk all PCI devices
looking for one we like" approach is terrible because it breaks a lot
of driver model assumptions (no device ID to autoload module via udev,
hotplug doesn't work, etc), but we don't have a good alternative right
now.

I think portdrv is slightly better because at least it claims the
device in the usual way and gives a way for service drivers to
register with it. But I don't really like that either because it
created a new weird /sys/bus/pci_express hierarchy full of these
sub-devices that aren't really devices, and it doesn't solve the
module load and hotplug issues.

I would like to have portdrv be completely built into the PCI core and
not claim Root Ports or Switch Ports. Then those devices would be
available via the usual driver model for driver loading and binding
and for hotplug.

Bjorn

2023-05-16 19:52:42

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver

On Tue, May 16, 2023 at 09:01:09PM +0800, Shuai Xue wrote:
> ...

> +#include <linux/pci.h>
> +#include <linux/bitfield.h>
> +#include <linux/bitops.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/errno.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/smp.h>
> +#include <linux/sysfs.h>
> +#include <linux/types.h>

Typically in alpha order.

> +#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
> +
> +#define DWC_PCIE_EVENT_CNT_CTL 0x8

Add a blank line here.

> +/*
> + * Event Counter Data Select includes two parts:

> +#define DWC_PCIE_EVENT_CNT_DATA 0xC
> +#define DWC_PCIE_DURATION_4US 0xff
...
Pick upper-case hex or lower-case hex and use consistently.

> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD (GENMASK_ULL(31, 0))
> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD (GENMASK_ULL(63, 0))

Unnecessary outer "()".

> +struct dwc_pcie_pmu {
> + struct pci_dev *pdev; /* Root Port device */
> + u32 ras_des; /* RAS DES capability offset */

u16 is enough to address all of config space.

2023-05-17 00:04:35

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver

Hi Shuai,

kernel test robot noticed the following build errors:

[auto build test ERROR on pci/next]
[also build test ERROR on pci/for-linus soc/for-next linus/master v6.4-rc2 next-20230516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Shuai-Xue/PCI-move-Alibaba-Vendor-ID-linux-pci_ids-h/20230517-013326
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
patch link: https://lore.kernel.org/r/20230516130110.59632-4-xueshuai%40linux.alibaba.com
patch subject: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver
config: sh-allmodconfig
compiler: sh4-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/f576345a26fff4584ed49f0f42e03c65d8a7f2bf
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Shuai-Xue/PCI-move-Alibaba-Vendor-ID-linux-pci_ids-h/20230517-013326
git checkout f576345a26fff4584ed49f0f42e03c65d8a7f2bf
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash drivers/perf/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

drivers/perf/dwc_pcie_pmu.c: In function '__dwc_pcie_pmu_probe':
>> drivers/perf/dwc_pcie_pmu.c:507:24: error: implicit declaration of function 'pci_find_vsec_capability'; did you mean 'pci_find_ext_capability'? [-Werror=implicit-function-declaration]
507 | vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
| ^~~~~~~~~~~~~~~~~~~~~~~~
| pci_find_ext_capability
cc1: some warnings being treated as errors


vim +507 drivers/perf/dwc_pcie_pmu.c

487
488 static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv)
489 {
490 struct pci_dev *pdev = NULL;
491 struct dwc_pcie_pmu *pcie_pmu;
492 char *name;
493 u32 bdf;
494 int ret;
495
496 INIT_LIST_HEAD(&priv->pmu_nodes);
497
498 /* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
499 for_each_pci_dev(pdev) {
500 u16 vsec;
501 u32 val;
502
503 if (!(pci_is_pcie(pdev) &&
504 pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
505 continue;
506
> 507 vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
508 DWC_PCIE_VSEC_RAS_DES_ID);
509 if (!vsec)
510 continue;
511
512 pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
513 if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
514 PCI_VNDR_HEADER_LEN(val) != 0x100)
515 continue;
516 pci_dbg(pdev,
517 "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
518
519 bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
520 name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
521 bdf);
522 if (!name)
523 return -ENOMEM;
524
525 /* All checks passed, go go go */
526 pcie_pmu = devm_kzalloc(&pdev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
527 if (!pcie_pmu) {
528 pci_dev_put(pdev);
529 return -ENOMEM;
530 }
531
532 pcie_pmu->pdev = pdev;
533 pcie_pmu->ras_des = vsec;
534 pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
535 pcie_pmu->pmu = (struct pmu){
536 .module = THIS_MODULE,
537 .attr_groups = dwc_pcie_attr_groups,
538 .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
539 .task_ctx_nr = perf_invalid_context,
540 .event_init = dwc_pcie_pmu_event_init,
541 .add = dwc_pcie_pmu_event_add,
542 .del = dwc_pcie_pmu_event_del,
543 .start = dwc_pcie_pmu_event_start,
544 .stop = dwc_pcie_pmu_event_stop,
545 .read = dwc_pcie_pmu_event_update,
546 };
547
548 /* Add this instance to the list used by the offline callback */
549 ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
550 &pcie_pmu->cpuhp_node);
551 if (ret) {
552 pci_err(pcie_pmu->pdev,
553 "Error %d registering hotplug @%x\n", ret, bdf);
554 return ret;
555 }
556 ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
557 if (ret) {
558 pci_err(pcie_pmu->pdev,
559 "Error %d registering PMU @%x\n", ret, bdf);
560 cpuhp_state_remove_instance_nocalls(
561 dwc_pcie_pmu_hp_state, &pcie_pmu->cpuhp_node);
562 return ret;
563 }
564
565 /* Add registered PMUs and unregister them when this driver remove */
566 list_add(&pcie_pmu->pmu_node, &priv->pmu_nodes);
567 }
568
569 return 0;
570 }
571

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests


Attachments:
(No filename) (5.42 kB)
config (253.55 kB)
Download all attachments

2023-05-17 01:31:34

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver



On 2023/5/16 22:32, Jonathan Cameron wrote:
> On Mon, 17 Apr 2023 14:17:27 +0800
> Shuai Xue <[email protected]> wrote:
>
>> Alibaba's T-Head Yitan 710 SoC is built on Synopsys' widely deployed and
>> silicon-proven DesignWare Core PCIe controller which implements PMU for
>
> Keep to most relevant facts in description only. Something like:
>
> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe controller
> which implements ...
>
> Or ask for advertising fees from Synopsys :)

Haha, I will keep it more simple facts.

>
>
>> performance and functional debugging to facilitate system maintenance.
>> Document it to provide guidance on how to use it.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> .../admin-guide/perf/dwc_pcie_pmu.rst | 61 +++++++++++++++++++
>> Documentation/admin-guide/perf/index.rst | 1 +
>> 2 files changed, 62 insertions(+)
>> create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>>
>> diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> new file mode 100644
>> index 000000000000..0672e959ebe4
>> --- /dev/null
>> +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> @@ -0,0 +1,61 @@
>> +======================================================================
>> +Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
>> +======================================================================
>> +
>> +DesignWare Cores (DWC) PCIe PMU
>> +===============================
>> +
>> +To facilitate collection of statistics, Synopsys DesignWare Cores PCIe
>> +controller provides the following two features:
>> +
>> +- Time Based Analysis (RX/TX data throughput and time spent in each
>> + low-power LTSSM state)
>> +- Lane Event counters (Error and Non-Error for lanes)
>> +
>> +The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
>> +only register counters provided by each PCIe Root Port.
>> +
>> +Time Based Analysis
>> +-------------------
>> +
>> +Using this feature you can obtain information regarding RX/TX data
>> +throughput and time spent in each low-power LTSSM state by the controller.
>> +
>> +The counters are 64-bit width and measure data in two categories,
>> +
>> +- percentage of time does the controller stay in LTSSM state in a
>> + configurable duration. The measurement range of each Event in Group#0.
>> +- amount of data processed (Units of 16 bytes). The measurement range of
>> + each Event in Group#1.
>> +
>> +Lane Event counters
>> +-------------------
>> +
>> +Using this feature you can obtain Error and Non-Error information in
>> +specific lane by the controller.
>> +
>> +The counters are 32-bit width and the measured event is select by:
>> +
>> +- Group i
>> +- Event j within the Group i
>> +- and Lane k
>> +
>> +Some of the event counters only exist for specific configurations.
>> +
>> +DesignWare Cores (DWC) PCIe PMU Driver
>> +=======================================
>> +
>> +This driver add PMU devices for each PCIe Root Port. And the PMU device is
>> +named based the BDF of Root Port. For example,
>> +
>> + 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>> +
>> +the PMU device name for this Root Port is dwc_rootport_3018.
> I'd suggest renaming to a scheme lie
> dwc_rootport_30:03.0
> to save people remembering how to break up the BDF parts.
>
>> +
>> +Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>> +
>> + $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>> +
>> +average RX bandwidth can be calculated like this:
>> +
>> + PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>
> Could consider an example of the other type of event, the error counters
> you mention.

Sure, I will add an example to show how to use it.

>
> Otherwise, looks good to me.
>
> Jonathan

Thank you.

Best Regards.
Shuai


>
>> diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
>> index 9de64a40adab..11a80cd28a2e 100644
>> --- a/Documentation/admin-guide/perf/index.rst
>> +++ b/Documentation/admin-guide/perf/index.rst
>> @@ -19,5 +19,6 @@ Performance monitor support
>> arm_dsu_pmu
>> thunderx2-pmu
>> alibaba_pmu
>> + dwc_pcie_pmu
>> nvidia-pmu
>> meson-ddr-pmu

2023-05-17 02:58:26

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver



On 2023/5/17 03:19, Bjorn Helgaas wrote:
> On Tue, May 16, 2023 at 09:01:09PM +0800, Shuai Xue wrote:
>> ...
>
>> +#include <linux/pci.h>
>> +#include <linux/bitfield.h>
>> +#include <linux/bitops.h>
>> +#include <linux/cpuhotplug.h>
>> +#include <linux/cpumask.h>
>> +#include <linux/device.h>
>> +#include <linux/errno.h>
>> +#include <linux/kernel.h>
>> +#include <linux/list.h>
>> +#include <linux/perf_event.h>
>> +#include <linux/platform_device.h>
>> +#include <linux/smp.h>
>> +#include <linux/sysfs.h>
>> +#include <linux/types.h>
>
> Typically in alpha order.

Got it, I will reorder them.

>
>> +#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
>> +
>> +#define DWC_PCIE_EVENT_CNT_CTL 0x8
>
> Add a blank line here.

Sure, will add it.

>
>> +/*
>> + * Event Counter Data Select includes two parts:
>
>> +#define DWC_PCIE_EVENT_CNT_DATA 0xC
>> +#define DWC_PCIE_DURATION_4US 0xff
> ...
> Pick upper-case hex or lower-case hex and use consistently.

Will pick upper-case hex for all macros.

>
>> +#define DWC_PCIE_LANE_EVENT_MAX_PERIOD (GENMASK_ULL(31, 0))
>> +#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD (GENMASK_ULL(63, 0))
>
> Unnecessary outer "()".

Ok, will remove it.

>
>> +struct dwc_pcie_pmu {
>> + struct pci_dev *pdev; /* Root Port device */
>> + u32 ras_des; /* RAS DES capability offset */
>
> u16 is enough to address all of config space.

Go it. will fix in next version.

Thank you :)

Best Regards,
Shuai



2023-05-17 04:36:29

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver



On 2023/5/17 07:21, kernel test robot wrote:
> Hi Shuai,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on pci/next]
> [also build test ERROR on pci/for-linus soc/for-next linus/master v6.4-rc2 next-20230516]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Shuai-Xue/PCI-move-Alibaba-Vendor-ID-linux-pci_ids-h/20230517-013326
> base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git next
> patch link: https://lore.kernel.org/r/20230516130110.59632-4-xueshuai%40linux.alibaba.com
> patch subject: [PATCH v4 3/4] drivers/perf: add DesignWare PCIe PMU driver
> config: sh-allmodconfig
> compiler: sh4-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
> wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # https://github.com/intel-lab-lkp/linux/commit/f576345a26fff4584ed49f0f42e03c65d8a7f2bf
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review Shuai-Xue/PCI-move-Alibaba-Vendor-ID-linux-pci_ids-h/20230517-013326
> git checkout f576345a26fff4584ed49f0f42e03c65d8a7f2bf
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh olddefconfig
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sh SHELL=/bin/bash drivers/perf/
>
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All errors (new ones prefixed by >>):
>
> drivers/perf/dwc_pcie_pmu.c: In function '__dwc_pcie_pmu_probe':
>>> drivers/perf/dwc_pcie_pmu.c:507:24: error: implicit declaration of function 'pci_find_vsec_capability'; did you mean 'pci_find_ext_capability'? [-Werror=implicit-function-declaration]
> 507 | vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> | ^~~~~~~~~~~~~~~~~~~~~~~~
> | pci_find_ext_capability
> cc1: some warnings being treated as errors

This will be fixed by:

- remove COMPILE_TEST and add depend on PCI in kconfig

Thank you.
Shuai

>
> vim +507 drivers/perf/dwc_pcie_pmu.c
>
> 487
> 488 static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv)
> 489 {
> 490 struct pci_dev *pdev = NULL;
> 491 struct dwc_pcie_pmu *pcie_pmu;
> 492 char *name;
> 493 u32 bdf;
> 494 int ret;
> 495
> 496 INIT_LIST_HEAD(&priv->pmu_nodes);
> 497
> 498 /* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
> 499 for_each_pci_dev(pdev) {
> 500 u16 vsec;
> 501 u32 val;
> 502
> 503 if (!(pci_is_pcie(pdev) &&
> 504 pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
> 505 continue;
> 506
> > 507 vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
> 508 DWC_PCIE_VSEC_RAS_DES_ID);
> 509 if (!vsec)
> 510 continue;
> 511
> 512 pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
> 513 if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
> 514 PCI_VNDR_HEADER_LEN(val) != 0x100)
> 515 continue;
> 516 pci_dbg(pdev,
> 517 "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
> 518
> 519 bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
> 520 name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
> 521 bdf);
> 522 if (!name)
> 523 return -ENOMEM;
> 524
> 525 /* All checks passed, go go go */
> 526 pcie_pmu = devm_kzalloc(&pdev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
> 527 if (!pcie_pmu) {
> 528 pci_dev_put(pdev);
> 529 return -ENOMEM;
> 530 }
> 531
> 532 pcie_pmu->pdev = pdev;
> 533 pcie_pmu->ras_des = vsec;
> 534 pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
> 535 pcie_pmu->pmu = (struct pmu){
> 536 .module = THIS_MODULE,
> 537 .attr_groups = dwc_pcie_attr_groups,
> 538 .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
> 539 .task_ctx_nr = perf_invalid_context,
> 540 .event_init = dwc_pcie_pmu_event_init,
> 541 .add = dwc_pcie_pmu_event_add,
> 542 .del = dwc_pcie_pmu_event_del,
> 543 .start = dwc_pcie_pmu_event_start,
> 544 .stop = dwc_pcie_pmu_event_stop,
> 545 .read = dwc_pcie_pmu_event_update,
> 546 };
> 547
> 548 /* Add this instance to the list used by the offline callback */
> 549 ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
> 550 &pcie_pmu->cpuhp_node);
> 551 if (ret) {
> 552 pci_err(pcie_pmu->pdev,
> 553 "Error %d registering hotplug @%x\n", ret, bdf);
> 554 return ret;
> 555 }
> 556 ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
> 557 if (ret) {
> 558 pci_err(pcie_pmu->pdev,
> 559 "Error %d registering PMU @%x\n", ret, bdf);
> 560 cpuhp_state_remove_instance_nocalls(
> 561 dwc_pcie_pmu_hp_state, &pcie_pmu->cpuhp_node);
> 562 return ret;
> 563 }
> 564
> 565 /* Add registered PMUs and unregister them when this driver remove */
> 566 list_add(&pcie_pmu->pmu_node, &priv->pmu_nodes);
> 567 }
> 568
> 569 return 0;
> 570 }
> 571
>

2023-05-17 09:59:14

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver

On Tue, 16 May 2023 14:17:52 -0500
Bjorn Helgaas <[email protected]> wrote:

> On Tue, May 16, 2023 at 04:03:04PM +0100, Jonathan Cameron wrote:
> >
> > PCI folks, Question below directed at you. Please take a look.
> > +CC linux-cxl because a similar question is going to bite us shortly
> > if we want CXL PMUs to work well on RP or Switch ports.
> >
> > > >> +static int dwc_pcie_ras_des_discover(struct dwc_pcie_pmu_priv *priv)
> > > >> +{
> > > >> +??? int index = 0;
> > > >> +??? struct pci_dev *pdev = NULL;
> > > >> +??? struct dwc_pcie_rp_info *rp_info;
> > > >> +
> > > >> +??? INIT_LIST_HEAD(&priv->rp_infos);
> > > >> +
> > > >> +??? /* Match the rootport with VSEC_RAS_DES_ID */
> > > >> +??? for_each_pci_dev(pdev) {
> > > >
> > > > Does the PCI layer not offer a more robust mechanism for this?
> > > > (PCI fixups come to mind, but I don't actually know whether that
> > > > would be a viable approach or not.)
> > >
> > > I am afraid not yet. Jonathan try to add a PMU service but it is
> > > not merged into mainline.
> >
> > I wouldn't read much into that 'failure'. We never persisted with
> > that driver because it was for an old generation of hardware.
> > Mostly the aim with that was to explore the area of PCIe PMU in
> > general rather than to get the support upstream. Some of the
> > counters on that hardware were too small to be of much use anyway :)
> >
> > Grabbing just relevant functions..
> >
> > Bjorn, we need to figure out a way forwards for this sort of case
> > and I'd appreciate your input on the broad brush question of 'how
> > should it be done'?
> >
> > This is a case where a PCIe port (RP here) correctly has the PCIe
> > class code so binds to the pcie_port driver, but has a VSEC (others
> > examples use DOE, or DVSEC) that provides extended functionality.
> > The referred to PCIe PMU from our older Hisilicon platforms did it
> > by adding another service driver - that probably doesn't extend
> > well.
> >
> > The approach used here is to separately walk the PCI topology and
> > register the devices. It can 'maybe' get away with that because no
> > interrupts and I assume resets have no nasty impacts on it because
> > the device is fairly simple. In general that's not going to work.
> > CXL does a similar trick (which I don't much like, but too late
> > now), but we've also run into the problem of how to get interrupts
> > if not the main driver.
>
> Yes, this is a real problem. I think the "walk all PCI devices
> looking for one we like" approach is terrible because it breaks a lot
> of driver model assumptions (no device ID to autoload module via udev,
> hotplug doesn't work, etc), but we don't have a good alternative right
> now.
>
> I think portdrv is slightly better because at least it claims the
> device in the usual way and gives a way for service drivers to
> register with it. But I don't really like that either because it
> created a new weird /sys/bus/pci_express hierarchy full of these
> sub-devices that aren't really devices, and it doesn't solve the
> module load and hotplug issues.
>
> I would like to have portdrv be completely built into the PCI core and
> not claim Root Ports or Switch Ports. Then those devices would be
> available via the usual driver model for driver loading and binding
> and for hotplug.

Let me see if I understand this correctly as I can think of a few options
that perhaps are inline with what you are thinking.

1) All the portdrv stuff converted to normal PCI core helper functions
that a driver bound to the struct pci_dev can use.
2) Driver core itself provides a bunch of extra devices alongside the
struct pci_dev one to which additional drivers can bind? - so kind
of portdrv handling, but squashed into the PCI device topology?
3) Have portdrv operated under the hood, so all the services etc that
it provides don't require a driver to be bound at all. Then
allow usual VID/DID based driver binding.

If 1 - we are going to run into class device restrictions and that will
just move where we have to handle the potential vendor specific parts.
We probably don't want that to be a hydra with all the functionality
and lookups etc driven from there, so do we end up with sub devices
of that new PCI port driver with a discover method based on either
vsec + VID or DVSEC with devices created under the main pci_dev.
That would have to include nastiness around interrupt discovery for
those sub devices. So ends up roughly like port_drv.

I don't think 2 solves anything.

For 3 - interrupts and ownership of facilities is going to be tricky
as initially those need to be owned by the PCI core (no device driver bound)
and then I guess handed off to the driver once it shows up? Maybe that
driver should call a pci_claim_port() that gives it control of everything
and pci_release_port() that hands it all back to the core. That seems
racey.

As another similar proposal to 3 (and one Greg KH will hate :) can we
do something similar to vfio and allow an unbind of a class driver followed by
a bind of a more specific one?

I think 1 is probably the easiest to implement, but it just moves
the problem.

If we had a way to reliably override the class driver if a more specific
one exists, that might work around the problem but I don't think we
can do that currently.

Jonathan


>
> Bjorn


2023-05-17 16:50:22

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver

On Wed, May 17, 2023 at 10:54:21AM +0100, Jonathan Cameron wrote:
> On Tue, 16 May 2023 14:17:52 -0500
> Bjorn Helgaas <[email protected]> wrote:
>
> > On Tue, May 16, 2023 at 04:03:04PM +0100, Jonathan Cameron wrote:
> ...

> > > The approach used here is to separately walk the PCI topology and
> > > register the devices. It can 'maybe' get away with that because no
> > > interrupts and I assume resets have no nasty impacts on it because
> > > the device is fairly simple. In general that's not going to work.
> > > CXL does a similar trick (which I don't much like, but too late
> > > now), but we've also run into the problem of how to get interrupts
> > > if not the main driver.
> >
> > Yes, this is a real problem. I think the "walk all PCI devices
> > looking for one we like" approach is terrible because it breaks a lot
> > of driver model assumptions (no device ID to autoload module via udev,
> > hotplug doesn't work, etc), but we don't have a good alternative right
> > now.
> >
> > I think portdrv is slightly better because at least it claims the
> > device in the usual way and gives a way for service drivers to
> > register with it. But I don't really like that either because it
> > created a new weird /sys/bus/pci_express hierarchy full of these
> > sub-devices that aren't really devices, and it doesn't solve the
> > module load and hotplug issues.
> >
> > I would like to have portdrv be completely built into the PCI core and
> > not claim Root Ports or Switch Ports. Then those devices would be
> > available via the usual driver model for driver loading and binding
> > and for hotplug.
>
> Let me see if I understand this correctly as I can think of a few options
> that perhaps are inline with what you are thinking.
>
> 1) All the portdrv stuff converted to normal PCI core helper functions
> that a driver bound to the struct pci_dev can use.
> 2) Driver core itself provides a bunch of extra devices alongside the
> struct pci_dev one to which additional drivers can bind? - so kind
> of portdrv handling, but squashed into the PCI device topology?
> 3) Have portdrv operated under the hood, so all the services etc that
> it provides don't require a driver to be bound at all. Then
> allow usual VID/DID based driver binding.
>
> If 1 - we are going to run into class device restrictions and that will
> just move where we have to handle the potential vendor specific parts.
> We probably don't want that to be a hydra with all the functionality
> and lookups etc driven from there, so do we end up with sub devices
> of that new PCI port driver with a discover method based on either
> vsec + VID or DVSEC with devices created under the main pci_dev.
> That would have to include nastiness around interrupt discovery for
> those sub devices. So ends up roughly like port_drv.
>
> I don't think 2 solves anything.
>
> For 3 - interrupts and ownership of facilities is going to be tricky
> as initially those need to be owned by the PCI core (no device driver bound)
> and then I guess handed off to the driver once it shows up? Maybe that
> driver should call a pci_claim_port() that gives it control of everything
> and pci_release_port() that hands it all back to the core. That seems
> racey.

Yes, 3 is the option I want to explore. That's what we already do for
things like ASPM. Agreed, interrupts is a potential issue. I think
the architected parts of config space should be implicitly owned by
the PCI core, with interfaces ? la pci_disable_link_state() if drivers
need them.

Bjorn

2023-05-19 10:35:16

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] drivers/perf: add DesignWare PCIe PMU driver



On 2023/5/18 00:27, Bjorn Helgaas wrote:
> On Wed, May 17, 2023 at 10:54:21AM +0100, Jonathan Cameron wrote:
>> On Tue, 16 May 2023 14:17:52 -0500
>> Bjorn Helgaas <[email protected]> wrote:
>>
>>> On Tue, May 16, 2023 at 04:03:04PM +0100, Jonathan Cameron wrote:
>> ...
>
>>>> The approach used here is to separately walk the PCI topology and
>>>> register the devices. It can 'maybe' get away with that because no
>>>> interrupts and I assume resets have no nasty impacts on it because
>>>> the device is fairly simple. In general that's not going to work.
>>>> CXL does a similar trick (which I don't much like, but too late
>>>> now), but we've also run into the problem of how to get interrupts
>>>> if not the main driver.
>>>
>>> Yes, this is a real problem. I think the "walk all PCI devices
>>> looking for one we like" approach is terrible because it breaks a lot
>>> of driver model assumptions (no device ID to autoload module via udev,
>>> hotplug doesn't work, etc), but we don't have a good alternative right
>>> now.
>>>
>>> I think portdrv is slightly better because at least it claims the
>>> device in the usual way and gives a way for service drivers to
>>> register with it. But I don't really like that either because it
>>> created a new weird /sys/bus/pci_express hierarchy full of these
>>> sub-devices that aren't really devices, and it doesn't solve the
>>> module load and hotplug issues.
>>>
>>> I would like to have portdrv be completely built into the PCI core and
>>> not claim Root Ports or Switch Ports. Then those devices would be
>>> available via the usual driver model for driver loading and binding
>>> and for hotplug.
>>
>> Let me see if I understand this correctly as I can think of a few options
>> that perhaps are inline with what you are thinking.
>>
>> 1) All the portdrv stuff converted to normal PCI core helper functions
>> that a driver bound to the struct pci_dev can use.
>> 2) Driver core itself provides a bunch of extra devices alongside the
>> struct pci_dev one to which additional drivers can bind? - so kind
>> of portdrv handling, but squashed into the PCI device topology?
>> 3) Have portdrv operated under the hood, so all the services etc that
>> it provides don't require a driver to be bound at all. Then
>> allow usual VID/DID based driver binding.
>>
>> If 1 - we are going to run into class device restrictions and that will
>> just move where we have to handle the potential vendor specific parts.
>> We probably don't want that to be a hydra with all the functionality
>> and lookups etc driven from there, so do we end up with sub devices
>> of that new PCI port driver with a discover method based on either
>> vsec + VID or DVSEC with devices created under the main pci_dev.
>> That would have to include nastiness around interrupt discovery for
>> those sub devices. So ends up roughly like port_drv.
>>
>> I don't think 2 solves anything.
>>
>> For 3 - interrupts and ownership of facilities is going to be tricky
>> as initially those need to be owned by the PCI core (no device driver bound)
>> and then I guess handed off to the driver once it shows up? Maybe that
>> driver should call a pci_claim_port() that gives it control of everything
>> and pci_release_port() that hands it all back to the core. That seems
>> racey.
>
> Yes, 3 is the option I want to explore. That's what we already do for
> things like ASPM. Agreed, interrupts is a potential issue. I think
> the architected parts of config space should be implicitly owned by
> the PCI core, with interfaces à la pci_disable_link_state() if drivers
> need them.
>

I agree "walk all PCI devices looking for one we like" approach is terrible
in general. And I am glad to modify my code to adapt to a more suitable solution
when it comes.

For now, I will collect comments from v3 and v4 and send a new version after
addressed them. Any alternative option is welcomed, always :)

> Bjorn

Thank you.

Best Regards,
Shuai

2023-05-22 03:57:44

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver

Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
controller which implements which implements PMU for performance and
functional debugging to facilitate system maintenance.

Document it to provide guidance on how to use it.

Signed-off-by: Shuai Xue <[email protected]>
---
.../admin-guide/perf/dwc_pcie_pmu.rst | 97 +++++++++++++++++++
Documentation/admin-guide/perf/index.rst | 1 +
2 files changed, 98 insertions(+)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst

diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
new file mode 100644
index 000000000000..c1f671cb64ec
--- /dev/null
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -0,0 +1,97 @@
+======================================================================
+Synopsys DesignWare Cores (DWC) PCIe Performance Monitoring Unit (PMU)
+======================================================================
+
+DesignWare Cores (DWC) PCIe PMU
+===============================
+
+The PMU is not a PCIe Root Complex integrated End Point (RCiEP) device but
+only PCIe configuration space register block provided by each PCIe Root
+Port in a Vendor-Specific Extended Capability named RAS DES (Debug, Error
+injection, and Statistics).
+
+As the name indicated, the RAS DES capability supports system level
+debugging, AER error injection, and collection of statistics. To facilitate
+collection of statistics, Synopsys DesignWare Cores PCIe controller
+provides the following two features:
+
+- Time Based Analysis (RX/TX data throughput and time spent in each
+ low-power LTSSM state)
+- Lane Event counters (Error and Non-Error for lanes)
+
+Time Based Analysis
+-------------------
+
+Using this feature you can obtain information regarding RX/TX data
+throughput and time spent in each low-power LTSSM state by the controller.
+
+The counters are 64-bit width and measure data in two categories,
+
+- percentage of time does the controller stay in LTSSM state in a
+ configurable duration. The measurement range of each Event in Group#0.
+- amount of data processed (Units of 16 bytes). The measurement range of
+ each Event in Group#1.
+
+Lane Event counters
+-------------------
+
+Using this feature you can obtain Error and Non-Error information in
+specific lane by the controller.
+
+The counters are 32-bit width and the measured event is select by:
+
+- Group i
+- Event j within the Group i
+- and Lane k
+
+Some of the event counters only exist for specific configurations.
+
+DesignWare Cores (DWC) PCIe PMU Driver
+=======================================
+
+This driver add PMU devices for each PCIe Root Port. And the PMU device is
+named based the BDF of Root Port. For example,
+
+ 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+
+the PMU device name for this Root Port is dwc_rootport_3018.
+
+The DWC PCIe PMU driver registers a perf PMU driver, which provides
+description of available events and configuration options in sysfs, see
+/sys/bus/event_source/devices/dwc_rootport_{bdf}.
+
+The "format" directory describes format of the config, fields of the
+perf_event_attr structure. The "events" directory provides configuration
+templates for all documented events. For example,
+"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
+
+The "perf list" command shall list the available events from sysfs, e.g.::
+
+ $# perf list | grep dwc_rootport
+ <...>
+ dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event]
+ <...>
+ dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event]
+
+Time Based Analysis Event Usage
+-------------------------------
+
+Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
+
+ $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+
+The average RX/TX bandwidth can be calculated using the following formula:
+
+ PCIe RX Bandwidth = PCIE_RX_DATA * 16B / Measure_Time_Window
+ PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
+
+Lane Event Usage
+-------------------------------
+
+Each lane has the same event set and to avoid generating a list of hundreds
+of events, the user need to specify the lane ID explicitly, e.g.::
+
+ $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
+
+The driver does not support sampling, therefore "perf record" will not
+work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 9de64a40adab..11a80cd28a2e 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -19,5 +19,6 @@ Performance monitor support
arm_dsu_pmu
thunderx2-pmu
alibaba_pmu
+ dwc_pcie_pmu
nvidia-pmu
meson-ddr-pmu
--
2.20.1.12.g72788fdb


2023-05-22 04:12:00

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h

Move Alibaba Vendor ID (0x1ded) to linux/pci_ids.h so that it can shared by
several drivers.

Signed-off-by: Shuai Xue <[email protected]>
---
drivers/infiniband/hw/erdma/erdma_hw.h | 2 --
include/linux/pci_ids.h | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
index 76ce2856be28..ee35ebef9ee7 100644
--- a/drivers/infiniband/hw/erdma/erdma_hw.h
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -11,8 +11,6 @@
#include <linux/types.h>

/* PCIe device related definition. */
-#define PCI_VENDOR_ID_ALIBABA 0x1ded
-
#define ERDMA_PCI_WIDTH 64
#define ERDMA_FUNC_BAR 0
#define ERDMA_MISX_BAR 2
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index 95f33dadb2be..9e8aec472f06 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -2586,6 +2586,8 @@
#define PCI_VENDOR_ID_TEKRAM 0x1de1
#define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29

+#define PCI_VENDOR_ID_ALIBABA 0x1ded
+
#define PCI_VENDOR_ID_TEHUTI 0x1fc9
#define PCI_DEVICE_ID_TEHUTI_3009 0x3009
#define PCI_DEVICE_ID_TEHUTI_3010 0x3010
--
2.20.1.12.g72788fdb


2023-05-22 04:13:47

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 4/4] MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

Add maintainers for Synopsys DesignWare PCIe PMU driver and driver
document.

Signed-off-by: Shuai Xue <[email protected]>
---
MAINTAINERS | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index e0ad886d3163..70271eed279d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -20478,6 +20478,12 @@ L: [email protected]
S: Maintained
F: drivers/mmc/host/dw_mmc*

+SYNOPSYS DESIGNWARE PCIE PMU DRIVER
+M: Shuai Xue <[email protected]>
+S: Supported
+F: Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+F: drivers/perf/dwc_pcie_pmu.c
+
SYNOPSYS HSDK RESET CONTROLLER DRIVER
M: Eugeniy Paltsev <[email protected]>
S: Supported
--
2.20.1.12.g72788fdb


2023-05-22 04:17:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

changes since v4:

1. addressing commens from Bjorn Helgaas:
- reorder the includes by alpha
- change all macros with upper-case hex
- change ras_des type into u16
- remove unnecessary outer "()"
- minor format changes

2. Address commensts from Jonathan Cameron:
- rewrite doc and add a example to show how to use lane event

3. fix compile error reported by: kernel test robot
- remove COMPILE_TEST and add depend on PCI in kconfig
- add Reported-by: kernel test robot <[email protected]>

Changes since v3:

1. addressing comments from Robin Murphy:
- add a prepare patch to define pci id in linux/pci_ids.h
- remove unnecessary 64BIT dependency
- fix DWC_PCIE_PER_EVENT_OFF/ON macro
- remove dwc_pcie_pmu struct and move all its fileds into dwc_pcie_rp_info
- remove unnecessary format field show
- use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
- remove unnecessary spaces and remove unnecessary cast to follow event show convention
- remove pcie_pmu_event_attr_is_visible
- fix a refcout leak on error branch when walk pci device in for_each_pci_dev
- remove bdf field from dwc_pcie_rp_info and calculate it at runtime
- finish all the checks before allocating rp_info to avoid hanging wasted memory
- remove some unused fields
- warp out control register configuration from sub function to .add()
- make function return type with a proper signature
- fix lane event count enable by clear DWC_PCIE_CNT_ENABLE field first
- pass rp_info directly to the read_*_counter helpers and in start, stop and add callbacks
- move event type validtion into .event_init()
- use is_sampling_event() to be consistent with everything else of pmu drivers
- remove unnecessary dev_err message in .event_init()
- return EINVAL instead EOPNOTSUPP for not a valid event
- finish all the checks before start modifying the event
- fix sibling event check by comparing event->pmu with sibling->pmu
- probe PMU for each rootport independently
- use .update() as .read() directly
- remove dynamically generating symbolic name of lane event
- redefine static symbolic name of lane event and leave lane filed to user
- add CPU hotplug support

2. addressing comments from Baolin:
- add a mask to avoid possible overflow

Changes since v2 addressing comments from Baolin:
- remove redundant macro definitions
- use dev_err to print error message
- change pmu_is_register to boolean
- use PLATFORM_DEVID_NONE macro
- fix module author format

Changes since v1:

1. address comments from Jonathan:
- drop marco for PMU name and VSEC version
- simplify code with PCI standard marco
- simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
- name register filed with single _ instead double
- wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
- check vendor id while matching vesc with pci_find_vsec_capability()
- remove RP_NUM_MAX and use a list to organize PMU devices for rootports
- replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
- comments on riping register together

2. address comments from Bjorn:
- rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
- rename cap_pos to ras_des
- simplify declare of device_attribute with DEVICE_ATTR_RO
- simplify code with PCI standard macro and API like pcie_get_width_cap()
- fix some code style problem and typo
- drop meaningless snaity check of container_of

3. address comments from Yicong:
- use sysfs_emit() to replace sprintf()
- simplify iteration of pci device with for_each_pci_dev
- pick preferred CPUs on a near die and add comments
- unregister PMU drivers only for failed ones
- log on behalf PMU device and give more hint
- fix some code style problem

(Thanks for all comments and they are very valuable to me)

This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
Core controller IP which provides statistics feature.

Shuai Xue (4):
docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
PCI: move Alibaba Vendor ID linux/pci_ids.h
drivers/perf: add DesignWare PCIe PMU driver
MAINTAINERS: add maintainers for DesignWare PCIe PMU driver

.../admin-guide/perf/dwc_pcie_pmu.rst | 97 +++
Documentation/admin-guide/perf/index.rst | 1 +
MAINTAINERS | 6 +
drivers/infiniband/hw/erdma/erdma_hw.h | 2 -
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 701 ++++++++++++++++++
include/linux/pci_ids.h | 2 +
8 files changed, 815 insertions(+), 2 deletions(-)
create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
create mode 100644 drivers/perf/dwc_pcie_pmu.c

--
2.20.1.12.g72788fdb


2023-05-22 04:17:20

by Shuai Xue

[permalink] [raw]
Subject: [PATCH v5 3/4] drivers/perf: add DesignWare PCIe PMU driver

This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
Core controller IP which provides statistics feature. The PMU is not a PCIe
Root Complex integrated End Point(RCiEP) device but only register counters
provided by each PCIe Root Port.

To facilitate collection of statistics the controller provides the
following two features for each Root Port:

- Time Based Analysis (RX/TX data throughput and time spent in each
low-power LTSSM state)
- Event counters (Error and Non-Error for lanes)

Note, only one counter for each type and does not overflow interrupt.

This driver adds PMU devices for each PCIe Root Port. And the PMU device is
named based the BDF of Root Port. For example,

30:03.0 PCI bridge: Device 1ded:8000 (rev 01)

the PMU device name for this Root Port is dwc_rootport_3018.

Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::

$# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/

average RX bandwidth can be calculated like this:

PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window

Signed-off-by: Shuai Xue <[email protected]>
Reported-by: kernel test robot <[email protected]>
Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
---
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/dwc_pcie_pmu.c | 701 ++++++++++++++++++++++++++++++++++++
3 files changed, 709 insertions(+)
create mode 100644 drivers/perf/dwc_pcie_pmu.c

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 711f82400086..6ff3921d7a62 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -209,6 +209,13 @@ config MARVELL_CN10K_DDR_PMU
Enable perf support for Marvell DDR Performance monitoring
event on CN10K platform.

+config DWC_PCIE_PMU
+ tristate "Enable Synopsys DesignWare PCIe PMU Support"
+ depends on (ARM64 && PCI)
+ help
+ Enable perf support for Synopsys DesignWare PCIe PMU Performance
+ monitoring event on Yitian 710 platform.
+
source "drivers/perf/arm_cspmu/Kconfig"

source "drivers/perf/amlogic/Kconfig"
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index dabc859540ce..13a6d1b286da 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -22,5 +22,6 @@ obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
obj-$(CONFIG_APPLE_M1_CPU_PMU) += apple_m1_cpu_pmu.o
obj-$(CONFIG_ALIBABA_UNCORE_DRW_PMU) += alibaba_uncore_drw_pmu.o
+obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
diff --git a/drivers/perf/dwc_pcie_pmu.c b/drivers/perf/dwc_pcie_pmu.c
new file mode 100644
index 000000000000..e4e85575ea7d
--- /dev/null
+++ b/drivers/perf/dwc_pcie_pmu.c
@@ -0,0 +1,701 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Synopsys DesignWare PCIe PMU driver
+ *
+ * Copyright (C) 2021-2023 Alibaba Inc.
+ */
+
+#include <linux/bitfield.h>
+#include <linux/bitops.h>
+#include <linux/cpuhotplug.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/perf_event.h>
+#include <linux/pci.h>
+#include <linux/platform_device.h>
+#include <linux/smp.h>
+#include <linux/sysfs.h>
+#include <linux/types.h>
+
+#define DWC_PCIE_VSEC_RAS_DES_ID 0x02
+
+#define DWC_PCIE_EVENT_CNT_CTL 0x8
+
+/*
+ * Event Counter Data Select includes two parts:
+ * - 27-24: Group number(4-bit: 0..0x7)
+ * - 23-16: Event number(8-bit: 0..0x13) within the Group
+ *
+ * Put them togother as TRM used.
+ */
+#define DWC_PCIE_CNT_EVENT_SEL GENMASK(27, 16)
+#define DWC_PCIE_CNT_LANE_SEL GENMASK(11, 8)
+#define DWC_PCIE_CNT_STATUS BIT(7)
+#define DWC_PCIE_CNT_ENABLE GENMASK(4, 2)
+#define DWC_PCIE_PER_EVENT_OFF 0x1
+#define DWC_PCIE_PER_EVENT_ON 0x3
+#define DWC_PCIE_EVENT_CLEAR GENMASK(1, 0)
+#define DWC_PCIE_EVENT_PER_CLEAR 0x1
+
+#define DWC_PCIE_EVENT_CNT_DATA 0xC
+
+#define DWC_PCIE_TIME_BASED_ANAL_CTL 0x10
+#define DWC_PCIE_TIME_BASED_REPORT_SEL GENMASK(31, 24)
+#define DWC_PCIE_TIME_BASED_DURATION_SEL GENMASK(15, 8)
+#define DWC_PCIE_DURATION_MANUAL_CTL 0x0
+#define DWC_PCIE_DURATION_1MS 0x1
+#define DWC_PCIE_DURATION_10MS 0x2
+#define DWC_PCIE_DURATION_100MS 0x3
+#define DWC_PCIE_DURATION_1S 0x4
+#define DWC_PCIE_DURATION_2S 0x5
+#define DWC_PCIE_DURATION_4S 0x6
+#define DWC_PCIE_DURATION_4US 0xFF
+#define DWC_PCIE_TIME_BASED_TIMER_START BIT(0)
+#define DWC_PCIE_TIME_BASED_CNT_ENABLE 0x1
+
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW 0x14
+#define DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH 0x18
+
+/* Event attributes */
+#define DWC_PCIE_CONFIG_EVENTID GENMASK(15, 0)
+#define DWC_PCIE_CONFIG_TYPE GENMASK(19, 16)
+#define DWC_PCIE_CONFIG_LANE GENMASK(27, 20)
+
+#define DWC_PCIE_EVENT_ID(event) FIELD_GET(DWC_PCIE_CONFIG_EVENTID, (event)->attr.config)
+#define DWC_PCIE_EVENT_TYPE(event) FIELD_GET(DWC_PCIE_CONFIG_TYPE, (event)->attr.config)
+#define DWC_PCIE_EVENT_LANE(event) FIELD_GET(DWC_PCIE_CONFIG_LANE, (event)->attr.config)
+
+enum dwc_pcie_event_type {
+ DWC_PCIE_TYPE_INVALID,
+ DWC_PCIE_TIME_BASE_EVENT,
+ DWC_PCIE_LANE_EVENT,
+};
+
+#define DWC_PCIE_LANE_EVENT_MAX_PERIOD GENMASK_ULL(31, 0)
+#define DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD GENMASK_ULL(63, 0)
+
+
+struct dwc_pcie_pmu {
+ struct pci_dev *pdev; /* Root Port device */
+ u16 ras_des; /* RAS DES capability offset */
+ u32 nr_lanes;
+
+ struct list_head pmu_node;
+ struct hlist_node cpuhp_node;
+ struct pmu pmu;
+ struct perf_event *event;
+ int oncpu;
+};
+
+struct dwc_pcie_pmu_priv {
+ struct device *dev;
+ struct list_head pmu_nodes;
+};
+
+#define to_dwc_pcie_pmu(p) (container_of(p, struct dwc_pcie_pmu, pmu))
+
+static struct platform_device *dwc_pcie_pmu_dev;
+static int dwc_pcie_pmu_hp_state;
+
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(dev_get_drvdata(dev));
+
+ return cpumap_print_to_pagebuf(true, buf, cpumask_of(pcie_pmu->oncpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *dwc_pcie_pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL
+};
+
+static struct attribute_group dwc_pcie_cpumask_attr_group = {
+ .attrs = dwc_pcie_pmu_cpumask_attrs,
+};
+
+struct dwc_pcie_format_attr {
+ struct device_attribute attr;
+ u64 field;
+ int config;
+};
+
+static ssize_t dwc_pcie_pmu_format_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct dwc_pcie_format_attr *fmt = container_of(attr, typeof(*fmt), attr);
+ int lo = __ffs(fmt->field), hi = __fls(fmt->field);
+
+ return sysfs_emit(buf, "config:%d-%d\n", lo, hi);
+}
+
+#define _dwc_pcie_format_attr(_name, _cfg, _fld) \
+ (&((struct dwc_pcie_format_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_pmu_format_show, NULL), \
+ .config = _cfg, \
+ .field = _fld, \
+ }})[0].attr.attr)
+
+#define dwc_pcie_format_attr(_name, _fld) _dwc_pcie_format_attr(_name, 0, _fld)
+
+static struct attribute *dwc_pcie_format_attrs[] = {
+ dwc_pcie_format_attr(type, DWC_PCIE_CONFIG_TYPE),
+ dwc_pcie_format_attr(eventid, DWC_PCIE_CONFIG_EVENTID),
+ dwc_pcie_format_attr(lane, DWC_PCIE_CONFIG_LANE),
+ NULL,
+};
+
+static struct attribute_group dwc_pcie_format_attrs_group = {
+ .name = "format",
+ .attrs = dwc_pcie_format_attrs,
+};
+
+struct dwc_pcie_event_attr {
+ struct device_attribute attr;
+ enum dwc_pcie_event_type type;
+ u16 eventid;
+ u8 lane;
+};
+
+static ssize_t dwc_pcie_event_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct dwc_pcie_event_attr *eattr;
+
+ eattr = container_of(attr, typeof(*eattr), attr);
+
+ if (eattr->type == DWC_PCIE_LANE_EVENT)
+ return sysfs_emit(buf, "eventid=0x%x,type=0x%x,lane=?\n",
+ eattr->eventid, eattr->type);
+
+ return sysfs_emit(buf, "eventid=0x%x,type=0x%x\n", eattr->eventid,
+ eattr->type);
+}
+
+#define DWC_PCIE_EVENT_ATTR(_name, _type, _eventid, _lane) \
+ (&((struct dwc_pcie_event_attr[]) {{ \
+ .attr = __ATTR(_name, 0444, dwc_pcie_event_show, NULL), \
+ .type = _type, \
+ .eventid = _eventid, \
+ .lane = _lane, \
+ }})[0].attr.attr)
+
+#define DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_TIME_BASE_EVENT, _eventid, 0)
+#define DWC_PCIE_PMU_LANE_EVENT_ATTR(_name, _eventid) \
+ DWC_PCIE_EVENT_ATTR(_name, DWC_PCIE_LANE_EVENT, _eventid, 0)
+
+static struct attribute *dwc_pcie_pmu_time_event_attrs[] = {
+ /* Group #0 */
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(one_cycle, 0x00),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_L0S, 0x01),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(RX_L0S, 0x02),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L0, 0x03),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1, 0x04),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_1, 0x05),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_2, 0x06),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(CFG_RCVRY, 0x07),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(TX_RX_L0S, 0x08),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(L1_AUX, 0x09),
+
+ /* Group #1 */
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_PCIe_TLP_Data_Payload, 0x20),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_PCIe_TLP_Data_Payload, 0x21),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Tx_CCIX_TLP_Data_Payload, 0x22),
+ DWC_PCIE_PMU_TIME_BASE_EVENT_ATTR(Rx_CCIX_TLP_Data_Payload, 0x23),
+
+ /*
+ * Leave it to the user to specify the lane ID to avoid generating
+ * a list of hundreds of events.
+ */
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ack_dllp, 0x600),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_update_fc_dllp, 0x601),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ack_dllp, 0x602),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_update_fc_dllp, 0x603),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_nulified_tlp, 0x604),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_nulified_tlp, 0x605),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_duplicate_tl, 0x606),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_write, 0x700),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_memory_read, 0x701),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_write, 0x702),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_configuration_read, 0x703),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_write, 0x704),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_io_read, 0x705),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_without_data, 0x706),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_completion_with_data, 0x707),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_message_tlp, 0x708),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_atomic, 0x709),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_tlp_with_prefix, 0x70A),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_write, 0x70B),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_memory_read, 0x70C),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_write, 0x70F),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_io_read, 0x710),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_without_data, 0x711),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_completion_with_data, 0x712),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_message_tlp, 0x713),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_atomic, 0x714),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_tlp_with_prefix, 0x715),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(tx_ccix_tlp, 0x716),
+ DWC_PCIE_PMU_LANE_EVENT_ATTR(rx_ccix_tlp, 0x717),
+
+ NULL
+};
+
+static const struct attribute_group dwc_pcie_event_attrs_group = {
+ .name = "events",
+ .attrs = dwc_pcie_pmu_time_event_attrs,
+};
+
+static const struct attribute_group *dwc_pcie_attr_groups[] = {
+ &dwc_pcie_event_attrs_group,
+ &dwc_pcie_format_attrs_group,
+ &dwc_pcie_cpumask_attr_group,
+ NULL
+};
+
+static void dwc_pcie_pmu_lane_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+ bool enable)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, &val);
+
+ /* Clear DWC_PCIE_CNT_ENABLE field first */
+ val &= ~DWC_PCIE_CNT_ENABLE;
+ if (enable)
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_ON);
+ else
+ val |= FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF);
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL, val);
+}
+
+static void dwc_pcie_pmu_time_based_event_enable(struct dwc_pcie_pmu *pcie_pmu,
+ bool enable)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ &val);
+
+ if (enable)
+ val |= DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ else
+ val &= ~DWC_PCIE_TIME_BASED_CNT_ENABLE;
+
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL,
+ val);
+}
+
+static u64 dwc_pcie_pmu_read_lane_event_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 val;
+
+ pci_read_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_DATA, &val);
+
+ return val;
+}
+
+static u64 dwc_pcie_pmu_read_time_based_counter(struct dwc_pcie_pmu *pcie_pmu)
+{
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ u16 ras_des = pcie_pmu->ras_des;
+ u64 count;
+ u32 val;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_HIGH, &val);
+ count = val;
+ count <<= 32;
+
+ pci_read_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_DATA_REG_LOW, &val);
+
+ count += val;
+
+ return count;
+}
+
+static void dwc_pcie_pmu_event_update(struct perf_event *event)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ u64 delta, prev, now;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ now = dwc_pcie_pmu_read_lane_event_counter(pcie_pmu);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ now = dwc_pcie_pmu_read_time_based_counter(pcie_pmu);
+
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ delta = (now - prev) & DWC_PCIE_LANE_EVENT_MAX_PERIOD;
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ delta = (now - prev) & DWC_PCIE_TIME_BASED_EVENT_MAX_PERIOD;
+
+ local64_add(delta, &event->count);
+}
+
+static int dwc_pcie_pmu_event_init(struct perf_event *event)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ struct perf_event *sibling;
+ u32 lane;
+
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* We don't support sampling */
+ if (is_sampling_event(event))
+ return -EINVAL;
+
+ /* We cannot support task bound events */
+ if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK)
+ return -EINVAL;
+
+ if (event->group_leader != event &&
+ !is_software_event(event->group_leader))
+ return -EINVAL;
+
+ for_each_sibling_event(sibling, event->group_leader) {
+ if (sibling->pmu != event->pmu && !is_software_event(sibling))
+ return -EINVAL;
+ }
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ lane = DWC_PCIE_EVENT_LANE(event);
+ if (lane < 0 || lane >= pcie_pmu->nr_lanes)
+ return -EINVAL;
+ }
+
+ event->cpu = pcie_pmu->oncpu;
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_set_period(struct hw_perf_event *hwc)
+{
+ local64_set(&hwc->prev_count, 0);
+}
+
+static void dwc_pcie_pmu_event_start(struct perf_event *event, int flags)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+
+ hwc->state = 0;
+ dwc_pcie_pmu_set_period(hwc);
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_lane_event_enable(pcie_pmu, true);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_time_based_event_enable(pcie_pmu, true);
+}
+
+static void dwc_pcie_pmu_event_stop(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (event->hw.state & PERF_HES_STOPPED)
+ return;
+
+ if (type == DWC_PCIE_LANE_EVENT)
+ dwc_pcie_pmu_lane_event_enable(pcie_pmu, false);
+ else if (type == DWC_PCIE_TIME_BASE_EVENT)
+ dwc_pcie_pmu_time_based_event_enable(pcie_pmu, false);
+
+ dwc_pcie_pmu_event_update(event);
+ hwc->state |= PERF_HES_STOPPED | PERF_HES_UPTODATE;
+}
+
+static int dwc_pcie_pmu_event_add(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+ struct pci_dev *pdev = pcie_pmu->pdev;
+ struct hw_perf_event *hwc = &event->hw;
+ enum dwc_pcie_event_type type = DWC_PCIE_EVENT_TYPE(event);
+ int event_id = DWC_PCIE_EVENT_ID(event);
+ int lane = DWC_PCIE_EVENT_LANE(event);
+ u16 ras_des = pcie_pmu->ras_des;
+ u32 ctrl;
+
+ /* Only one counter and it is in use */
+ if (pcie_pmu->event)
+ return -ENOSPC;
+
+ pcie_pmu->event = event;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (type == DWC_PCIE_LANE_EVENT) {
+ /* EVENT_COUNTER_DATA_REG needs clear manually */
+ ctrl = FIELD_PREP(DWC_PCIE_CNT_EVENT_SEL, event_id) |
+ FIELD_PREP(DWC_PCIE_CNT_LANE_SEL, lane) |
+ FIELD_PREP(DWC_PCIE_CNT_ENABLE, DWC_PCIE_PER_EVENT_OFF) |
+ FIELD_PREP(DWC_PCIE_EVENT_CLEAR, DWC_PCIE_EVENT_PER_CLEAR);
+ pci_write_config_dword(pdev, ras_des + DWC_PCIE_EVENT_CNT_CTL,
+ ctrl);
+ } else if (type == DWC_PCIE_TIME_BASE_EVENT) {
+ /*
+ * TIME_BASED_ANAL_DATA_REG is a 64 bit register, we can safely
+ * use it with any manually controlled duration. And it is
+ * cleared when next measurement starts.
+ */
+ ctrl = FIELD_PREP(DWC_PCIE_TIME_BASED_REPORT_SEL, event_id) |
+ FIELD_PREP(DWC_PCIE_TIME_BASED_DURATION_SEL,
+ DWC_PCIE_DURATION_MANUAL_CTL) |
+ DWC_PCIE_TIME_BASED_CNT_ENABLE;
+ pci_write_config_dword(
+ pdev, ras_des + DWC_PCIE_TIME_BASED_ANAL_CTL, ctrl);
+ }
+
+ if (flags & PERF_EF_START)
+ dwc_pcie_pmu_event_start(event, PERF_EF_RELOAD);
+
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_event_del(struct perf_event *event, int flags)
+{
+ struct dwc_pcie_pmu *pcie_pmu = to_dwc_pcie_pmu(event->pmu);
+
+ dwc_pcie_pmu_event_stop(event, flags | PERF_EF_UPDATE);
+ perf_event_update_userpage(event);
+ pcie_pmu->event = NULL;
+}
+
+static int __dwc_pcie_pmu_probe(struct dwc_pcie_pmu_priv *priv)
+{
+ struct pci_dev *pdev = NULL;
+ struct dwc_pcie_pmu *pcie_pmu;
+ char *name;
+ u32 bdf;
+ int ret;
+
+ INIT_LIST_HEAD(&priv->pmu_nodes);
+
+ /* Match the rootport with VSEC_RAS_DES_ID, and register a PMU for it */
+ for_each_pci_dev(pdev) {
+ u16 vsec;
+ u32 val;
+
+ if (!(pci_is_pcie(pdev) &&
+ pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT))
+ continue;
+
+ vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_ALIBABA,
+ DWC_PCIE_VSEC_RAS_DES_ID);
+ if (!vsec)
+ continue;
+
+ pci_read_config_dword(pdev, vsec + PCI_VNDR_HEADER, &val);
+ if (PCI_VNDR_HEADER_REV(val) != 0x04 ||
+ PCI_VNDR_HEADER_LEN(val) != 0x100)
+ continue;
+ pci_dbg(pdev,
+ "Detected PCIe Vendor-Specific Extended Capability RAS DES\n");
+
+ bdf = PCI_DEVID(pdev->bus->number, pdev->devfn);
+ name = devm_kasprintf(priv->dev, GFP_KERNEL, "dwc_rootport_%x",
+ bdf);
+ if (!name)
+ return -ENOMEM;
+
+ /* All checks passed, go go go */
+ pcie_pmu = devm_kzalloc(&pdev->dev, sizeof(*pcie_pmu), GFP_KERNEL);
+ if (!pcie_pmu) {
+ pci_dev_put(pdev);
+ return -ENOMEM;
+ }
+
+ pcie_pmu->pdev = pdev;
+ pcie_pmu->ras_des = vsec;
+ pcie_pmu->nr_lanes = pcie_get_width_cap(pdev);
+ pcie_pmu->pmu = (struct pmu){
+ .module = THIS_MODULE,
+ .attr_groups = dwc_pcie_attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE,
+ .task_ctx_nr = perf_invalid_context,
+ .event_init = dwc_pcie_pmu_event_init,
+ .add = dwc_pcie_pmu_event_add,
+ .del = dwc_pcie_pmu_event_del,
+ .start = dwc_pcie_pmu_event_start,
+ .stop = dwc_pcie_pmu_event_stop,
+ .read = dwc_pcie_pmu_event_update,
+ };
+
+ /* Add this instance to the list used by the offline callback */
+ ret = cpuhp_state_add_instance(dwc_pcie_pmu_hp_state,
+ &pcie_pmu->cpuhp_node);
+ if (ret) {
+ pci_err(pcie_pmu->pdev,
+ "Error %d registering hotplug @%x\n", ret, bdf);
+ return ret;
+ }
+ ret = perf_pmu_register(&pcie_pmu->pmu, name, -1);
+ if (ret) {
+ pci_err(pcie_pmu->pdev,
+ "Error %d registering PMU @%x\n", ret, bdf);
+ cpuhp_state_remove_instance_nocalls(
+ dwc_pcie_pmu_hp_state, &pcie_pmu->cpuhp_node);
+ return ret;
+ }
+
+ /* Add registered PMUs and unregister them when this driver remove */
+ list_add(&pcie_pmu->pmu_node, &priv->pmu_nodes);
+ }
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_remove(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
+ struct dwc_pcie_pmu *pcie_pmu;
+
+ list_for_each_entry(pcie_pmu, &priv->pmu_nodes, pmu_node) {
+ cpuhp_state_remove_instance(dwc_pcie_pmu_hp_state,
+ &pcie_pmu->cpuhp_node);
+ perf_pmu_unregister(&pcie_pmu->pmu);
+ }
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_probe(struct platform_device *pdev)
+{
+ struct dwc_pcie_pmu_priv *priv;
+
+ priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
+ if (!priv)
+ return -ENOMEM;
+
+ priv->dev = &pdev->dev;
+ platform_set_drvdata(pdev, priv);
+
+ /* If one PMU registration fails, remove all. */
+ if (__dwc_pcie_pmu_probe(priv))
+ dwc_pcie_pmu_remove(pdev);
+
+ return 0;
+}
+
+static void dwc_pcie_pmu_migrate(struct dwc_pcie_pmu *pcie_pmu, unsigned int cpu)
+{
+ /* This PMU does NOT support interrupt, just migrate context. */
+ perf_pmu_migrate_context(&pcie_pmu->pmu, pcie_pmu->oncpu, cpu);
+ pcie_pmu->oncpu = cpu;
+}
+
+static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct pci_dev *pdev;
+ int node;
+
+ pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+ pdev = pcie_pmu->pdev;
+ node = dev_to_node(&pdev->dev);
+
+ if (node != NUMA_NO_NODE && cpu_to_node(pcie_pmu->oncpu) != node &&
+ cpu_to_node(cpu) == node)
+ dwc_pcie_pmu_migrate(pcie_pmu, cpu);
+
+ return 0;
+}
+
+static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
+{
+ struct dwc_pcie_pmu *pcie_pmu;
+ struct pci_dev *pdev;
+ int node;
+ cpumask_t mask;
+ unsigned int target;
+
+ pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
+ if (cpu != pcie_pmu->oncpu)
+ return 0;
+
+ pdev = pcie_pmu->pdev;
+ node = dev_to_node(&pdev->dev);
+ if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
+ cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
+ target = cpumask_any(&mask);
+ else
+ target = cpumask_any_but(cpu_online_mask, cpu);
+ if (target < nr_cpu_ids)
+ dwc_pcie_pmu_migrate(pcie_pmu, target);
+
+ return 0;
+}
+
+static struct platform_driver dwc_pcie_pmu_driver = {
+ .probe = dwc_pcie_pmu_probe,
+ .remove = dwc_pcie_pmu_remove,
+ .driver = {.name = "dwc_pcie_pmu",},
+};
+
+static int __init dwc_pcie_pmu_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+ "perf/dwc_pcie_pmu:online",
+ dwc_pcie_pmu_online_cpu,
+ dwc_pcie_pmu_offline_cpu);
+ if (ret < 0)
+ return ret;
+
+ dwc_pcie_pmu_hp_state = ret;
+
+ ret = platform_driver_register(&dwc_pcie_pmu_driver);
+ if (ret) {
+ cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
+ return ret;
+ }
+
+ dwc_pcie_pmu_dev = platform_device_register_simple(
+ "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
+ if (IS_ERR(dwc_pcie_pmu_dev)) {
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+ return PTR_ERR(dwc_pcie_pmu_dev);
+ }
+
+ return 0;
+}
+
+static void __exit dwc_pcie_pmu_exit(void)
+{
+ platform_device_unregister(dwc_pcie_pmu_dev);
+ platform_driver_unregister(&dwc_pcie_pmu_driver);
+}
+
+module_init(dwc_pcie_pmu_init);
+module_exit(dwc_pcie_pmu_exit);
+
+MODULE_DESCRIPTION("PMU driver for DesignWare Cores PCI Express Controller");
+MODULE_AUTHOR("Shuai xue <[email protected]>");
+MODULE_AUTHOR("Wen Cheng <[email protected]>");
+MODULE_LICENSE("GPL v2");
--
2.20.1.12.g72788fdb


2023-05-22 14:42:13

by Jonathan Cameron

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support

On Mon, 22 May 2023 11:54:24 +0800
Shuai Xue <[email protected]> wrote:

Hi,

Very rarely a good idea to send a new patch set version in reply to an old
one. Tends to just resort in it being way off the top of people's most recent
email (depending on client of course!)

Jonathan

> changes since v4:
>
> 1. addressing commens from Bjorn Helgaas:
> - reorder the includes by alpha
> - change all macros with upper-case hex
> - change ras_des type into u16
> - remove unnecessary outer "()"
> - minor format changes
>
> 2. Address commensts from Jonathan Cameron:
> - rewrite doc and add a example to show how to use lane event
>
> 3. fix compile error reported by: kernel test robot
> - remove COMPILE_TEST and add depend on PCI in kconfig
> - add Reported-by: kernel test robot <[email protected]>
>
> Changes since v3:
>
> 1. addressing comments from Robin Murphy:
> - add a prepare patch to define pci id in linux/pci_ids.h
> - remove unnecessary 64BIT dependency
> - fix DWC_PCIE_PER_EVENT_OFF/ON macro
> - remove dwc_pcie_pmu struct and move all its fileds into dwc_pcie_rp_info
> - remove unnecessary format field show
> - use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
> - remove unnecessary spaces and remove unnecessary cast to follow event show convention
> - remove pcie_pmu_event_attr_is_visible
> - fix a refcout leak on error branch when walk pci device in for_each_pci_dev
> - remove bdf field from dwc_pcie_rp_info and calculate it at runtime
> - finish all the checks before allocating rp_info to avoid hanging wasted memory
> - remove some unused fields
> - warp out control register configuration from sub function to .add()
> - make function return type with a proper signature
> - fix lane event count enable by clear DWC_PCIE_CNT_ENABLE field first
> - pass rp_info directly to the read_*_counter helpers and in start, stop and add callbacks
> - move event type validtion into .event_init()
> - use is_sampling_event() to be consistent with everything else of pmu drivers
> - remove unnecessary dev_err message in .event_init()
> - return EINVAL instead EOPNOTSUPP for not a valid event
> - finish all the checks before start modifying the event
> - fix sibling event check by comparing event->pmu with sibling->pmu
> - probe PMU for each rootport independently
> - use .update() as .read() directly
> - remove dynamically generating symbolic name of lane event
> - redefine static symbolic name of lane event and leave lane filed to user
> - add CPU hotplug support
>
> 2. addressing comments from Baolin:
> - add a mask to avoid possible overflow
>
> Changes since v2 addressing comments from Baolin:
> - remove redundant macro definitions
> - use dev_err to print error message
> - change pmu_is_register to boolean
> - use PLATFORM_DEVID_NONE macro
> - fix module author format
>
> Changes since v1:
>
> 1. address comments from Jonathan:
> - drop marco for PMU name and VSEC version
> - simplify code with PCI standard marco
> - simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
> - name register filed with single _ instead double
> - wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
> - check vendor id while matching vesc with pci_find_vsec_capability()
> - remove RP_NUM_MAX and use a list to organize PMU devices for rootports
> - replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
> - comments on riping register together
>
> 2. address comments from Bjorn:
> - rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
> - rename cap_pos to ras_des
> - simplify declare of device_attribute with DEVICE_ATTR_RO
> - simplify code with PCI standard macro and API like pcie_get_width_cap()
> - fix some code style problem and typo
> - drop meaningless snaity check of container_of
>
> 3. address comments from Yicong:
> - use sysfs_emit() to replace sprintf()
> - simplify iteration of pci device with for_each_pci_dev
> - pick preferred CPUs on a near die and add comments
> - unregister PMU drivers only for failed ones
> - log on behalf PMU device and give more hint
> - fix some code style problem
>
> (Thanks for all comments and they are very valuable to me)
>
> This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature.
>
> Shuai Xue (4):
> docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
> PCI: move Alibaba Vendor ID linux/pci_ids.h
> drivers/perf: add DesignWare PCIe PMU driver
> MAINTAINERS: add maintainers for DesignWare PCIe PMU driver
>
> .../admin-guide/perf/dwc_pcie_pmu.rst | 97 +++
> Documentation/admin-guide/perf/index.rst | 1 +
> MAINTAINERS | 6 +
> drivers/infiniband/hw/erdma/erdma_hw.h | 2 -
> drivers/perf/Kconfig | 7 +
> drivers/perf/Makefile | 1 +
> drivers/perf/dwc_pcie_pmu.c | 701 ++++++++++++++++++
> include/linux/pci_ids.h | 2 +
> 8 files changed, 815 insertions(+), 2 deletions(-)
> create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
> create mode 100644 drivers/perf/dwc_pcie_pmu.c
>


2023-05-22 16:21:09

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h

Please follow subject line capitalization style (learn it with "git
log --oneline include/linux/pci_ids.h"):

PCI: Add Alibaba Vendor ID

On Mon, May 22, 2023 at 11:54:26AM +0800, Shuai Xue wrote:
> Move Alibaba Vendor ID (0x1ded) to linux/pci_ids.h so that it can shared by
> several drivers.

It would be helpful for reviewers to list the drivers here, since only
one is obvious from the patch.

Thanks for sorting the entry correctly!

> Signed-off-by: Shuai Xue <[email protected]>
> ---
> drivers/infiniband/hw/erdma/erdma_hw.h | 2 --
> include/linux/pci_ids.h | 2 ++
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
> index 76ce2856be28..ee35ebef9ee7 100644
> --- a/drivers/infiniband/hw/erdma/erdma_hw.h
> +++ b/drivers/infiniband/hw/erdma/erdma_hw.h
> @@ -11,8 +11,6 @@
> #include <linux/types.h>
>
> /* PCIe device related definition. */
> -#define PCI_VENDOR_ID_ALIBABA 0x1ded
> -
> #define ERDMA_PCI_WIDTH 64
> #define ERDMA_FUNC_BAR 0
> #define ERDMA_MISX_BAR 2
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index 95f33dadb2be..9e8aec472f06 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -2586,6 +2586,8 @@
> #define PCI_VENDOR_ID_TEKRAM 0x1de1
> #define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29
>
> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
> +
> #define PCI_VENDOR_ID_TEHUTI 0x1fc9
> #define PCI_DEVICE_ID_TEHUTI_3009 0x3009
> #define PCI_DEVICE_ID_TEHUTI_3010 0x3010
> --
> 2.20.1.12.g72788fdb
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2023-05-23 03:01:24

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] drivers/perf: add Synopsys DesignWare PCIe PMU driver support



On 2023/5/22 22:28, Jonathan Cameron wrote:
> On Mon, 22 May 2023 11:54:24 +0800
> Shuai Xue <[email protected]> wrote:
>
> Hi,
>
> Very rarely a good idea to send a new patch set version in reply to an old
> one. Tends to just resort in it being way off the top of people's most recent
> email (depending on client of course!)

Got your point. I will avoid using In-Reply-To explicitly.

>
> Jonathan

Thank you.

Best Regards,
Shuai

>
>> changes since v4:
>>
>> 1. addressing commens from Bjorn Helgaas:
>> - reorder the includes by alpha
>> - change all macros with upper-case hex
>> - change ras_des type into u16
>> - remove unnecessary outer "()"
>> - minor format changes
>>
>> 2. Address commensts from Jonathan Cameron:
>> - rewrite doc and add a example to show how to use lane event
>>
>> 3. fix compile error reported by: kernel test robot
>> - remove COMPILE_TEST and add depend on PCI in kconfig
>> - add Reported-by: kernel test robot <[email protected]>
>>
>> Changes since v3:
>>
>> 1. addressing comments from Robin Murphy:
>> - add a prepare patch to define pci id in linux/pci_ids.h
>> - remove unnecessary 64BIT dependency
>> - fix DWC_PCIE_PER_EVENT_OFF/ON macro
>> - remove dwc_pcie_pmu struct and move all its fileds into dwc_pcie_rp_info
>> - remove unnecessary format field show
>> - use sysfs_emit() instead of all the assorted sprintf() and snprintf() calls.
>> - remove unnecessary spaces and remove unnecessary cast to follow event show convention
>> - remove pcie_pmu_event_attr_is_visible
>> - fix a refcout leak on error branch when walk pci device in for_each_pci_dev
>> - remove bdf field from dwc_pcie_rp_info and calculate it at runtime
>> - finish all the checks before allocating rp_info to avoid hanging wasted memory
>> - remove some unused fields
>> - warp out control register configuration from sub function to .add()
>> - make function return type with a proper signature
>> - fix lane event count enable by clear DWC_PCIE_CNT_ENABLE field first
>> - pass rp_info directly to the read_*_counter helpers and in start, stop and add callbacks
>> - move event type validtion into .event_init()
>> - use is_sampling_event() to be consistent with everything else of pmu drivers
>> - remove unnecessary dev_err message in .event_init()
>> - return EINVAL instead EOPNOTSUPP for not a valid event
>> - finish all the checks before start modifying the event
>> - fix sibling event check by comparing event->pmu with sibling->pmu
>> - probe PMU for each rootport independently
>> - use .update() as .read() directly
>> - remove dynamically generating symbolic name of lane event
>> - redefine static symbolic name of lane event and leave lane filed to user
>> - add CPU hotplug support
>>
>> 2. addressing comments from Baolin:
>> - add a mask to avoid possible overflow
>>
>> Changes since v2 addressing comments from Baolin:
>> - remove redundant macro definitions
>> - use dev_err to print error message
>> - change pmu_is_register to boolean
>> - use PLATFORM_DEVID_NONE macro
>> - fix module author format
>>
>> Changes since v1:
>>
>> 1. address comments from Jonathan:
>> - drop marco for PMU name and VSEC version
>> - simplify code with PCI standard marco
>> - simplify code with FIELD_PREP()/FIELD_GET() to replace shift marco
>> - name register filed with single _ instead double
>> - wrap dwc_pcie_pmu_{write}_dword out and drop meaningless snaity check
>> - check vendor id while matching vesc with pci_find_vsec_capability()
>> - remove RP_NUM_MAX and use a list to organize PMU devices for rootports
>> - replace DWC_PCIE_CREATE_BDF with standard PCI_DEVID
>> - comments on riping register together
>>
>> 2. address comments from Bjorn:
>> - rename DWC_PCIE_VSEC_ID to DWC_PCIE_VSEC_RAS_DES_ID
>> - rename cap_pos to ras_des
>> - simplify declare of device_attribute with DEVICE_ATTR_RO
>> - simplify code with PCI standard macro and API like pcie_get_width_cap()
>> - fix some code style problem and typo
>> - drop meaningless snaity check of container_of
>>
>> 3. address comments from Yicong:
>> - use sysfs_emit() to replace sprintf()
>> - simplify iteration of pci device with for_each_pci_dev
>> - pick preferred CPUs on a near die and add comments
>> - unregister PMU drivers only for failed ones
>> - log on behalf PMU device and give more hint
>> - fix some code style problem
>>
>> (Thanks for all comments and they are very valuable to me)
>>
>> This patchset adds the PCIe Performance Monitoring Unit (PMU) driver support
>> for T-Head Yitian 710 SoC chip. Yitian 710 is based on the Synopsys PCI Express
>> Core controller IP which provides statistics feature.
>>
>> Shuai Xue (4):
>> docs: perf: Add description for Synopsys DesignWare PCIe PMU driver
>> PCI: move Alibaba Vendor ID linux/pci_ids.h
>> drivers/perf: add DesignWare PCIe PMU driver
>> MAINTAINERS: add maintainers for DesignWare PCIe PMU driver
>>
>> .../admin-guide/perf/dwc_pcie_pmu.rst | 97 +++
>> Documentation/admin-guide/perf/index.rst | 1 +
>> MAINTAINERS | 6 +
>> drivers/infiniband/hw/erdma/erdma_hw.h | 2 -
>> drivers/perf/Kconfig | 7 +
>> drivers/perf/Makefile | 1 +
>> drivers/perf/dwc_pcie_pmu.c | 701 ++++++++++++++++++
>> include/linux/pci_ids.h | 2 +
>> 8 files changed, 815 insertions(+), 2 deletions(-)
>> create mode 100644 Documentation/admin-guide/perf/dwc_pcie_pmu.rst
>> create mode 100644 drivers/perf/dwc_pcie_pmu.c
>>

2023-05-23 03:30:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h



On 2023/5/23 00:04, Bjorn Helgaas wrote:
> Please follow subject line capitalization style (learn it with "git
> log --oneline include/linux/pci_ids.h"):
>
> PCI: Add Alibaba Vendor ID

Sorry, I will rewrite the subject.

>
> On Mon, May 22, 2023 at 11:54:26AM +0800, Shuai Xue wrote:
>> Move Alibaba Vendor ID (0x1ded) to linux/pci_ids.h so that it can shared by
>> several drivers.
>
> It would be helpful for reviewers to list the drivers here, since only
> one is obvious from the patch.

Will add it.

Then, the commit log should be:


PCI: Add Alibaba Vendor ID to linux/pci_ids.h

The Alibaba Vendor ID (0x1ded) is now only used by Alibaba elasticRDMA
adapter driver. Move the Vendor ID to linux/pci_ids.h so that it can shared
by several drivers later.

>
> Thanks for sorting the entry correctly!

Aha, you are welcome :)

Thank you for valuable comments.

Best Regards,
Shuai

>
>> Signed-off-by: Shuai Xue <[email protected]>
>> ---
>> drivers/infiniband/hw/erdma/erdma_hw.h | 2 --
>> include/linux/pci_ids.h | 2 ++
>> 2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
>> index 76ce2856be28..ee35ebef9ee7 100644
>> --- a/drivers/infiniband/hw/erdma/erdma_hw.h
>> +++ b/drivers/infiniband/hw/erdma/erdma_hw.h
>> @@ -11,8 +11,6 @@
>> #include <linux/types.h>
>>
>> /* PCIe device related definition. */
>> -#define PCI_VENDOR_ID_ALIBABA 0x1ded
>> -
>> #define ERDMA_PCI_WIDTH 64
>> #define ERDMA_FUNC_BAR 0
>> #define ERDMA_MISX_BAR 2
>> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
>> index 95f33dadb2be..9e8aec472f06 100644
>> --- a/include/linux/pci_ids.h
>> +++ b/include/linux/pci_ids.h
>> @@ -2586,6 +2586,8 @@
>> #define PCI_VENDOR_ID_TEKRAM 0x1de1
>> #define PCI_DEVICE_ID_TEKRAM_DC290 0xdc29
>>
>> +#define PCI_VENDOR_ID_ALIBABA 0x1ded
>> +
>> #define PCI_VENDOR_ID_TEHUTI 0x1fc9
>> #define PCI_DEVICE_ID_TEHUTI_3009 0x3009
>> #define PCI_DEVICE_ID_TEHUTI_3010 0x3010
>> --
>> 2.20.1.12.g72788fdb
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2023-05-23 12:16:40

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h

On Tue, May 23, 2023 at 11:22:08AM +0800, Shuai Xue wrote:

> The Alibaba Vendor ID (0x1ded) is now only used by Alibaba elasticRDMA
> adapter driver. Move the Vendor ID to linux/pci_ids.h so that it can shared
> by several drivers later.

Well, not exactly. We don't want to merge changes that might be used
by unspecified drivers later. We only want to merge things that are
needed *now*, i.e., when this complete series is merged.

In this case, I think it will be used by another driver that is part
of this series ("dwc_pcie_pmu"), so the commit log should mention both
Alibaba elasticRDMA ("erdma"?) and "dwc_pcie_pmu".

Bjorn

2023-05-23 13:10:03

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 2/4] PCI: move Alibaba Vendor ID linux/pci_ids.h



On 2023/5/23 19:54, Bjorn Helgaas wrote:
> On Tue, May 23, 2023 at 11:22:08AM +0800, Shuai Xue wrote:
>
>> The Alibaba Vendor ID (0x1ded) is now only used by Alibaba elasticRDMA
>> adapter driver. Move the Vendor ID to linux/pci_ids.h so that it can shared
>> by several drivers later.
>
> Well, not exactly. We don't want to merge changes that might be used
> by unspecified drivers later. We only want to merge things that are
> needed *now*, i.e., when this complete series is merged.
>
>
> In this case, I think it will be used by another driver that is part
> of this series ("dwc_pcie_pmu"), so the commit log should mention both
> Alibaba elasticRDMA ("erdma"?) and "dwc_pcie_pmu".
>
> Bjorn

Yes, I have noticed the policy in head of include/linux/pci_ids.h.

> * Do not add new entries to this file unless the definitions
> * are shared between multiple drivers.

Actually, I mentioned both Alibaba elasticRDMA ("erdma") and PCIe PMU
"dwc_pcie_pmu" in initial draft. But I realized that dwc_pcie_pmu
is still in review, so I dropped it finally. :(

Anyway, I will add it back. Hope you are happy the bellow changes:

PCI: Add Alibaba Vendor ID to linux/pci_ids.h

The Alibaba Vendor ID (0x1ded) is now used by Alibaba elasticRDMA ("erdma") and
will be shared with the upcoming PCIe PMU ("dwc_pcie_pmu"). Move the Vendor ID
to linux/pci_ids.h so that it can shared by several drivers later.

Thank you.

Best Regards,
Shuai

2023-05-29 04:21:30

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver



On 5/22/2023 11:54 AM, Shuai Xue wrote:
> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
> controller which implements which implements PMU for performance and
> functional debugging to facilitate system maintenance.
>
> Document it to provide guidance on how to use it.
>
> Signed-off-by: Shuai Xue <[email protected]>

LGTM. Feel free to add:
Reviewed-by: Baolin Wang <[email protected]>

2023-05-29 06:15:01

by Baolin Wang

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] drivers/perf: add DesignWare PCIe PMU driver



On 5/22/2023 11:54 AM, Shuai Xue wrote:
> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
> Core controller IP which provides statistics feature. The PMU is not a PCIe
> Root Complex integrated End Point(RCiEP) device but only register counters
> provided by each PCIe Root Port.
>
> To facilitate collection of statistics the controller provides the
> following two features for each Root Port:
>
> - Time Based Analysis (RX/TX data throughput and time spent in each
> low-power LTSSM state)
> - Event counters (Error and Non-Error for lanes)
>
> Note, only one counter for each type and does not overflow interrupt.
>
> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
> named based the BDF of Root Port. For example,
>
> 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>
> the PMU device name for this Root Port is dwc_rootport_3018.
>
> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>
> $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>
> average RX bandwidth can be calculated like this:
>
> PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>
> Signed-off-by: Shuai Xue <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
> ---

[snip]

> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
> +{
> + struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
> + struct dwc_pcie_pmu *pcie_pmu;
> +
> + list_for_each_entry(pcie_pmu, &priv->pmu_nodes, pmu_node) {
> + cpuhp_state_remove_instance(dwc_pcie_pmu_hp_state,
> + &pcie_pmu->cpuhp_node);
> + perf_pmu_unregister(&pcie_pmu->pmu);
> + }
> +
> + return 0;
> +}
> +
> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
> +{
> + struct dwc_pcie_pmu_priv *priv;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + priv->dev = &pdev->dev;
> + platform_set_drvdata(pdev, priv);
> +
> + /* If one PMU registration fails, remove all. */
> + if (__dwc_pcie_pmu_probe(priv))
> + dwc_pcie_pmu_remove(pdev);

In this case, you should return error from __dwc_pcie_pmu_probe()
instead of returning 0, to release the requested resources of the PMU
deivce.

> +
> + return 0;
> +}
> +
> +static void dwc_pcie_pmu_migrate(struct dwc_pcie_pmu *pcie_pmu, unsigned int cpu)
> +{
> + /* This PMU does NOT support interrupt, just migrate context. */
> + perf_pmu_migrate_context(&pcie_pmu->pmu, pcie_pmu->oncpu, cpu);
> + pcie_pmu->oncpu = cpu;
> +}
> +
> +static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
> +{
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct pci_dev *pdev;
> + int node;
> +
> + pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
> + pdev = pcie_pmu->pdev;
> + node = dev_to_node(&pdev->dev);
> +
> + if (node != NUMA_NO_NODE && cpu_to_node(pcie_pmu->oncpu) != node &&
> + cpu_to_node(cpu) == node)
> + dwc_pcie_pmu_migrate(pcie_pmu, cpu);
> +
> + return 0;
> +}
> +
> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
> +{
> + struct dwc_pcie_pmu *pcie_pmu;
> + struct pci_dev *pdev;
> + int node;
> + cpumask_t mask;
> + unsigned int target;
> +
> + pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
> + if (cpu != pcie_pmu->oncpu)
> + return 0;
> +
> + pdev = pcie_pmu->pdev;
> + node = dev_to_node(&pdev->dev);
> + if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
> + cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
> + target = cpumask_any(&mask);
> + else
> + target = cpumask_any_but(cpu_online_mask, cpu);
> + if (target < nr_cpu_ids)
> + dwc_pcie_pmu_migrate(pcie_pmu, target);
> +
> + return 0;
> +}
> +
> +static struct platform_driver dwc_pcie_pmu_driver = {
> + .probe = dwc_pcie_pmu_probe,
> + .remove = dwc_pcie_pmu_remove,
> + .driver = {.name = "dwc_pcie_pmu",},
> +};
> +
> +static int __init dwc_pcie_pmu_init(void)
> +{
> + int ret;
> +
> + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
> + "perf/dwc_pcie_pmu:online",
> + dwc_pcie_pmu_online_cpu,
> + dwc_pcie_pmu_offline_cpu);
> + if (ret < 0)
> + return ret;
> +
> + dwc_pcie_pmu_hp_state = ret;
> +
> + ret = platform_driver_register(&dwc_pcie_pmu_driver);
> + if (ret) {
> + cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
> + return ret;
> + }
> +
> + dwc_pcie_pmu_dev = platform_device_register_simple(
> + "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
> + if (IS_ERR(dwc_pcie_pmu_dev)) {
> + platform_driver_unregister(&dwc_pcie_pmu_driver);
> + return PTR_ERR(dwc_pcie_pmu_dev);
> + }
> +
> + return 0;
> +}
> +
> +static void __exit dwc_pcie_pmu_exit(void)
> +{
> + platform_device_unregister(dwc_pcie_pmu_dev);
> + platform_driver_unregister(&dwc_pcie_pmu_driver);

You should also call 'cpuhp_remove_multi_state()' when exiting the driver.

With above issues fixed, you can add:
Reviewed-by: Baolin Wang <[email protected]>

2023-05-29 06:52:50

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] drivers/perf: add DesignWare PCIe PMU driver



On 2023/5/29 14:13, Baolin Wang wrote:
>
>
> On 5/22/2023 11:54 AM, Shuai Xue wrote:
>> This commit adds the PCIe Performance Monitoring Unit (PMU) driver support
>> for T-Head Yitian SoC chip. Yitian is based on the Synopsys PCI Express
>> Core controller IP which provides statistics feature. The PMU is not a PCIe
>> Root Complex integrated End Point(RCiEP) device but only register counters
>> provided by each PCIe Root Port.
>>
>> To facilitate collection of statistics the controller provides the
>> following two features for each Root Port:
>>
>> - Time Based Analysis (RX/TX data throughput and time spent in each
>>    low-power LTSSM state)
>> - Event counters (Error and Non-Error for lanes)
>>
>> Note, only one counter for each type and does not overflow interrupt.
>>
>> This driver adds PMU devices for each PCIe Root Port. And the PMU device is
>> named based the BDF of Root Port. For example,
>>
>>      30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
>>
>> the PMU device name for this Root Port is dwc_rootport_3018.
>>
>> Example usage of counting PCIe RX TLP data payload (Units of 16 bytes)::
>>
>>      $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
>>
>> average RX bandwidth can be calculated like this:
>>
>>      PCIe TX Bandwidth = PCIE_TX_DATA * 16B / Measure_Time_Window
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>> Reported-by: kernel test robot <[email protected]>
>> Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>> ---
>
> [snip]
>
>> +static int dwc_pcie_pmu_remove(struct platform_device *pdev)
>> +{
>> +    struct dwc_pcie_pmu_priv *priv = platform_get_drvdata(pdev);
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +
>> +    list_for_each_entry(pcie_pmu, &priv->pmu_nodes, pmu_node) {
>> +        cpuhp_state_remove_instance(dwc_pcie_pmu_hp_state,
>> +                        &pcie_pmu->cpuhp_node);
>> +        perf_pmu_unregister(&pcie_pmu->pmu);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int dwc_pcie_pmu_probe(struct platform_device *pdev)
>> +{
>> +    struct dwc_pcie_pmu_priv *priv;
>> +
>> +    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
>> +    if (!priv)
>> +        return -ENOMEM;
>> +
>> +    priv->dev = &pdev->dev;
>> +    platform_set_drvdata(pdev, priv);
>> +
>> +    /* If one PMU registration fails, remove all. */
>> +    if (__dwc_pcie_pmu_probe(priv))
>> +        dwc_pcie_pmu_remove(pdev);
>
> In this case, you should return error from __dwc_pcie_pmu_probe() instead of returning 0, to release the requested resources of the PMU deivce.

You are right, will fix it in next version.

>
>> +
>> +    return 0;
>> +}
>> +
>> +static void dwc_pcie_pmu_migrate(struct dwc_pcie_pmu *pcie_pmu, unsigned int cpu)
>> +{
>> +    /* This PMU does NOT support interrupt, just migrate context. */
>> +    perf_pmu_migrate_context(&pcie_pmu->pmu, pcie_pmu->oncpu, cpu);
>> +    pcie_pmu->oncpu = cpu;
>> +}
>> +
>> +static int dwc_pcie_pmu_online_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct pci_dev *pdev;
>> +    int node;
>> +
>> +    pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>> +    pdev = pcie_pmu->pdev;
>> +    node = dev_to_node(&pdev->dev);
>> +
>> +    if (node != NUMA_NO_NODE && cpu_to_node(pcie_pmu->oncpu) != node &&
>> +        cpu_to_node(cpu) == node)
>> +        dwc_pcie_pmu_migrate(pcie_pmu, cpu);
>> +
>> +    return 0;
>> +}
>> +
>> +static int dwc_pcie_pmu_offline_cpu(unsigned int cpu, struct hlist_node *cpuhp_node)
>> +{
>> +    struct dwc_pcie_pmu *pcie_pmu;
>> +    struct pci_dev *pdev;
>> +    int node;
>> +    cpumask_t mask;
>> +    unsigned int target;
>> +
>> +    pcie_pmu = hlist_entry_safe(cpuhp_node, struct dwc_pcie_pmu, cpuhp_node);
>> +    if (cpu != pcie_pmu->oncpu)
>> +        return 0;
>> +
>> +    pdev = pcie_pmu->pdev;
>> +    node = dev_to_node(&pdev->dev);
>> +    if (cpumask_and(&mask, cpumask_of_node(node), cpu_online_mask) &&
>> +        cpumask_andnot(&mask, &mask, cpumask_of(cpu)))
>> +        target = cpumask_any(&mask);
>> +    else
>> +        target = cpumask_any_but(cpu_online_mask, cpu);
>> +    if (target < nr_cpu_ids)
>> +        dwc_pcie_pmu_migrate(pcie_pmu, target);
>> +
>> +    return 0;
>> +}
>> +
>> +static struct platform_driver dwc_pcie_pmu_driver = {
>> +    .probe = dwc_pcie_pmu_probe,
>> +    .remove = dwc_pcie_pmu_remove,
>> +    .driver = {.name = "dwc_pcie_pmu",},
>> +};
>> +
>> +static int __init dwc_pcie_pmu_init(void)
>> +{
>> +    int ret;
>> +
>> +    ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
>> +                      "perf/dwc_pcie_pmu:online",
>> +                      dwc_pcie_pmu_online_cpu,
>> +                      dwc_pcie_pmu_offline_cpu);
>> +    if (ret < 0)
>> +        return ret;
>> +
>> +    dwc_pcie_pmu_hp_state = ret;
>> +
>> +    ret = platform_driver_register(&dwc_pcie_pmu_driver);
>> +    if (ret) {
>> +        cpuhp_remove_multi_state(dwc_pcie_pmu_hp_state);
>> +        return ret;
>> +    }
>> +
>> +    dwc_pcie_pmu_dev = platform_device_register_simple(
>> +                "dwc_pcie_pmu", PLATFORM_DEVID_NONE, NULL, 0);
>> +    if (IS_ERR(dwc_pcie_pmu_dev)) {
>> +        platform_driver_unregister(&dwc_pcie_pmu_driver);
>> +        return PTR_ERR(dwc_pcie_pmu_dev);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void __exit dwc_pcie_pmu_exit(void)
>> +{
>> +    platform_device_unregister(dwc_pcie_pmu_dev);
>> +    platform_driver_unregister(&dwc_pcie_pmu_driver);
>
> You should also call 'cpuhp_remove_multi_state()' when exiting the driver.

Good catch, will add it in next version.


>
> With above issues fixed, you can add:
> Reviewed-by: Baolin Wang <[email protected]>

Thank you :)

Best Regards,
Shuai

2023-05-29 07:04:03

by Shuai Xue

[permalink] [raw]
Subject: Re: [PATCH v5 1/4] docs: perf: Add description for Synopsys DesignWare PCIe PMU driver



On 2023/5/29 11:45, Baolin Wang wrote:
>
>
> On 5/22/2023 11:54 AM, Shuai Xue wrote:
>> Alibaba's T-Head Yitan 710 SoC includes Synopsys' DesignWare Core PCIe
>> controller which implements which implements PMU for performance and
>> functional debugging to facilitate system maintenance.
>>
>> Document it to provide guidance on how to use it.
>>
>> Signed-off-by: Shuai Xue <[email protected]>
>
> LGTM. Feel free to add:
> Reviewed-by: Baolin Wang <[email protected]>

Thank you :)

Best Regards,
Shuai