LinuxLists.cc - [PATCH 0/5] intel_rapl & perf rapl: combine PMU support

2024-01-31 14:24:37

Subject: [PATCH 0/5] intel_rapl & perf rapl: combine PMU support

This patch series is made based on the patch series posted at
https://lore.kernel.org/all/[email protected]/

Problem statement
-----------------
MSR RAPL powercap sysfs is done in drivers/powercap/intel_rapl_msr.c.
MSR RAPL PMU is done in arch/x86/events/rapl.c.

They maintain two separate CPU model lists, describing the same feature
available on the same set of hardware. This increases unnecessary
maintenance burden a lot.

Now we need to introduce TPMI RAPL PMU support, which again shares most
of the logic with MSR RAPL PMU.

Solution
--------
Introducing PMU support as part of RAPL framework and remove current MSR
RAPL PMU code.

The idea is that, if a RAPL Package device is registered to RAPL
framework, and is ready for energy reporting and control via powercap
sysfs, then it is also ready for PMU.

So introducing PMU support in RAPL framework that works for all
registered RAPL Package devices. With this, we can remove current MSR
RAPL PMU completely.

Given that MSR RAPL and TPMI RAPL driver won't funtion on the same
platform, the new RAPL PMU can be fully compatible with current MSR RAPL
PMU, including using the same PMU name and events name/id/unit/scale.

For example, on platforms use either MSR or TPMI, use the same command
perf stat -e power/energy-pkg/ -e power/energy-ram/ -e power/energy-cores/ FOO
to get the energy consumption when the events are in "perf list" output.

Notes
-----
There are indeed some functional changes introduced, due to the
divergency between the two CPU model lists. This includes,
1. Fix BROADWELL_D in intel_rapl driver to use fixed Dram domain energy
unit.
2. Enable PMU for some Intel platforms, which were missing in
arch/x86/events/rapl.c. This includes
ICELAKE_NNPI
ROCKETLAKE
LUNARLAKE_M
LAKEFIELD
ATOM_SILVERMONT
ATOM_SILVERMONT_MID
ATOM_AIRMONT
ATOM_AIRMONT_MID
ATOM_TREMONT
ATOM_TREMONT_D
ATOM_TREMONT_L
3. Change the logic for enumerating AMD/HYGON platforms
Previously, it was
X86_MATCH_FEATURE(X86_FEATURE_RAPL, &model_amd_hygon)
And now it is
X86_MATCH_VENDOR_FAM(AMD, 0x17, &rapl_defaults_amd)
X86_MATCH_VENDOR_FAM(AMD, 0x19, &rapl_defaults_amd)
X86_MATCH_VENDOR_FAM(HYGON, 0x18, &rapl_defaults_amd)

Any comments/concerns are welcome.

thanks,
rui

----------------------------------------------------------------
Zhang Rui (5):
powercap: intel_rapl: Sort header files
powercap: intel_rapl: Add PMU support
powercap: intel_rapl_tpmi: Enable PMU support for TPMI RAPL
powercap: intel_rapl: Fix BROADWELL_D
powercap: intel_rapl_msr: Enable PMU support for MSR RAPL

arch/x86/events/Kconfig | 8 -
arch/x86/events/Makefile | 1 -
arch/x86/events/rapl.c | 871 -----------------------------------
drivers/powercap/intel_rapl_common.c | 562 +++++++++++++++++++++-
drivers/powercap/intel_rapl_msr.c | 2 +
drivers/powercap/intel_rapl_tpmi.c | 1 +
include/linux/intel_rapl.h | 17 +
7 files changed, 569 insertions(+), 893 deletions(-)
delete mode 100644 arch/x86/events/rapl.c

2024-01-31 14:24:42

by Zhang, Rui

[permalink] [raw]

Subject: [PATCH 3/5] powercap: intel_rapl_tpmi: Enable PMU support for TPMI RAPL

Enable RAPL PMU support for TPMI RAPL driver.

Signed-off-by: Zhang Rui <[email protected]>
---
drivers/powercap/intel_rapl_tpmi.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/drivers/powercap/intel_rapl_tpmi.c b/drivers/powercap/intel_rapl_tpmi.c
index f6b7f085977c..f131a086c535 100644
--- a/drivers/powercap/intel_rapl_tpmi.c
+++ b/drivers/powercap/intel_rapl_tpmi.c
@@ -285,6 +285,7 @@ static int intel_rapl_tpmi_probe(struct auxiliary_device *auxdev,
trp->priv.read_raw = tpmi_rapl_read_raw;
trp->priv.write_raw = tpmi_rapl_write_raw;
trp->priv.control_type = tpmi_control_type;
+ trp->priv.enable_pmu = true;

/* RAPL TPMI I/F is per physical package */
trp->rp = rapl_find_package_domain(info->package_id, &trp->priv, false);
--
2.34.1

2024-01-31 14:25:21

by Zhang, Rui

[permalink] [raw]

Subject: [PATCH 4/5] powercap: intel_rapl: Fix BROADWELL_D Dram energy unit

BROADWELL_D uses a fixed energy unit for Dram Domain like the other Xeon
servers, e.g. BROADWELL_X.

Signed-off-by: Zhang Rui <[email protected]>
---
drivers/powercap/intel_rapl_common.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c
index 318174c87249..094e226b8bd3 100644
--- a/drivers/powercap/intel_rapl_common.c
+++ b/drivers/powercap/intel_rapl_common.c
@@ -1235,7 +1235,7 @@ static const struct x86_cpu_id rapl_ids[] __initconst = {

X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &rapl_defaults_core),
X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &rapl_defaults_core),
- X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &rapl_defaults_core),
+ X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &rapl_defaults_hsw_server),
X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &rapl_defaults_hsw_server),

X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &rapl_defaults_core),
--
2.34.1

2024-01-31 14:25:36

by Zhang, Rui

[permalink] [raw]

Subject: [PATCH 5/5] powercap: intel_rapl_msr: Enable PMU support for MSR RAPL

Enable PMU support in intel_rapl_msr driver and remove previous MSR
RAPL PMU support in arch/x86/events/rapl.c.

Note that, there are some divergence between the CPU model lists of
these two drivers, switching from arch/x86/events/rapl.c CPU model list
to the intel_rapl driver CPU model list indeed brings some functional
changes including
1. Enable PMU for some Intel platforms, which were missing in
arch/x86/events/rapl.c. This includes
ICELAKE_NNPI
ROCKETLAKE
LUNARLAKE_M
LAKEFIELD
ATOM_SILVERMONT
ATOM_SILVERMONT_MID
ATOM_AIRMONT
ATOM_AIRMONT_MID
ATOM_TREMONT
ATOM_TREMONT_D
ATOM_TREMONT_L
2. Change the logic for enumerating AMD/HYGON platforms
Previously, it was
X86_MATCH_FEATURE(X86_FEATURE_RAPL, &model_amd_hygon)
And now it is
X86_MATCH_VENDOR_FAM(AMD, 0x17, &rapl_defaults_amd)
X86_MATCH_VENDOR_FAM(AMD, 0x19, &rapl_defaults_amd)
X86_MATCH_VENDOR_FAM(HYGON, 0x18, &rapl_defaults_amd)

Signed-off-by: Zhang Rui <[email protected]>
---
Given that both of the CPU model lists describe the same feature that is
available on the same set of hardwares, any bug during this transition
suggests a gap in either of the previous code.
So IMO, the risk is acceptable.
---
arch/x86/events/Kconfig | 8 -
arch/x86/events/Makefile | 1 -
arch/x86/events/rapl.c | 871 ------------------------------
drivers/powercap/intel_rapl_msr.c | 2 +
4 files changed, 2 insertions(+), 880 deletions(-)
delete mode 100644 arch/x86/events/rapl.c

diff --git a/arch/x86/events/Kconfig b/arch/x86/events/Kconfig
index dabdf3d7bf84..21f7ea817dcf 100644
--- a/arch/x86/events/Kconfig
+++ b/arch/x86/events/Kconfig
@@ -9,14 +9,6 @@ config PERF_EVENTS_INTEL_UNCORE
Include support for Intel uncore performance events. These are
available on NehalemEX and more modern processors.

-config PERF_EVENTS_INTEL_RAPL
- tristate "Intel/AMD rapl performance events"
- depends on PERF_EVENTS && (CPU_SUP_INTEL || CPU_SUP_AMD) && PCI
- default y
- help
- Include support for Intel and AMD rapl performance events for power
- monitoring on modern processors.
-
config PERF_EVENTS_INTEL_CSTATE
tristate "Intel cstate performance events"
depends on PERF_EVENTS && CPU_SUP_INTEL && PCI
diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile
index 86a76efa8bb6..b9d51fa83b06 100644
--- a/arch/x86/events/Makefile
+++ b/arch/x86/events/Makefile
@@ -1,6 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y += core.o probe.o utils.o
-obj-$(CONFIG_PERF_EVENTS_INTEL_RAPL) += rapl.o
obj-y += amd/
obj-$(CONFIG_X86_LOCAL_APIC) += msr.o
obj-$(CONFIG_CPU_SUP_INTEL) += intel/
diff --git a/arch/x86/events/rapl.c b/arch/x86/events/rapl.c
deleted file mode 100644
index 8d98d468b976..000000000000
--- a/arch/x86/events/rapl.c
+++ /dev/null
@@ -1,871 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Support Intel/AMD RAPL energy consumption counters
- * Copyright (C) 2013 Google, Inc., Stephane Eranian
- *
- * Intel RAPL interface is specified in the IA-32 Manual Vol3b
- * section 14.7.1 (September 2013)
- *
- * AMD RAPL interface for Fam17h is described in the public PPR:
- * https://bugzilla.kernel.org/show_bug.cgi?id=206537
- *
- * RAPL provides more controls than just reporting energy consumption
- * however here we only expose the 3 energy consumption free running
- * counters (pp0, pkg, dram).
- *
- * Each of those counters increments in a power unit defined by the
- * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
- * but it can vary.
- *
- * Counter to rapl events mappings:
- *
- * pp0 counter: consumption of all physical cores (power plane 0)
- * event: rapl_energy_cores
- * perf code: 0x1
- *
- * pkg counter: consumption of the whole processor package
- * event: rapl_energy_pkg
- * perf code: 0x2
- *
- * dram counter: consumption of the dram domain (servers only)
- * event: rapl_energy_dram
- * perf code: 0x3
- *
- * gpu counter: consumption of the builtin-gpu domain (client only)
- * event: rapl_energy_gpu
- * perf code: 0x4
- *
- * psys counter: consumption of the builtin-psys domain (client only)
- * event: rapl_energy_psys
- * perf code: 0x5
- *
- * We manage those counters as free running (read-only). They may be
- * use simultaneously by other tools, such as turbostat.
- *
- * The events only support system-wide mode counting. There is no
- * sampling support because it does not make sense and is not
- * supported by the RAPL hardware.
- *
- * Because we want to avoid floating-point operations in the kernel,
- * the events are all reported in fixed point arithmetic (32.32).
- * Tools must adjust the counts to convert them to Watts using
- * the duration of the measurement. Tools may use a function such as
- * ldexp(raw_count, -32);
- */
-
-#define pr_fmt(fmt) "RAPL PMU: " fmt
-
-#include <linux/module.h>
-#include <linux/slab.h>
-#include <linux/perf_event.h>
-#include <linux/nospec.h>
-#include <asm/cpu_device_id.h>
-#include <asm/intel-family.h>
-#include "perf_event.h"
-#include "probe.h"
-
-MODULE_LICENSE("GPL");
-
-/*
- * RAPL energy status counters
- */
-enum perf_rapl_events {
- PERF_RAPL_PP0 = 0, /* all cores */
- PERF_RAPL_PKG, /* entire package */
- PERF_RAPL_RAM, /* DRAM */
- PERF_RAPL_PP1, /* gpu */
- PERF_RAPL_PSYS, /* psys */
-
- PERF_RAPL_MAX,
- NR_RAPL_DOMAINS = PERF_RAPL_MAX,
-};
-
-static const char *const rapl_domain_names[NR_RAPL_DOMAINS] __initconst = {
- "pp0-core",
- "package",
- "dram",
- "pp1-gpu",
- "psys",
-};
-
-/*
- * event code: LSB 8 bits, passed in attr->config
- * any other bit is reserved
- */
-#define RAPL_EVENT_MASK 0xFFULL
-#define RAPL_CNTR_WIDTH 32
-
-#define RAPL_EVENT_ATTR_STR(_name, v, str) \
-static struct perf_pmu_events_attr event_attr_##v = { \
- .attr = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), \
- .id = 0, \
- .event_str = str, \
-};
-
-struct rapl_pmu {
- raw_spinlock_t lock;
- int n_active;
- int cpu;
- struct list_head active_list;
- struct pmu *pmu;
- ktime_t timer_interval;
- struct hrtimer hrtimer;
-};
-
-struct rapl_pmus {
- struct pmu pmu;
- unsigned int maxdie;
- struct rapl_pmu *pmus[] __counted_by(maxdie);
-};
-
-enum rapl_unit_quirk {
- RAPL_UNIT_QUIRK_NONE,
- RAPL_UNIT_QUIRK_INTEL_HSW,
- RAPL_UNIT_QUIRK_INTEL_SPR,
-};
-
-struct rapl_model {
- struct perf_msr *rapl_msrs;
- unsigned long events;
- unsigned int msr_power_unit;
- enum rapl_unit_quirk unit_quirk;
-};
-
- /* 1/2^hw_unit Joule */
-static int rapl_hw_unit[NR_RAPL_DOMAINS] __read_mostly;
-static struct rapl_pmus *rapl_pmus;
-static cpumask_t rapl_cpu_mask;
-static unsigned int rapl_cntr_mask;
-static u64 rapl_timer_ms;
-static struct perf_msr *rapl_msrs;
-
-static inline struct rapl_pmu *cpu_to_rapl_pmu(unsigned int cpu)
-{
- unsigned int dieid = topology_logical_die_id(cpu);
-
- /*
- * The unsigned check also catches the '-1' return value for non
- * existent mappings in the topology map.
- */
- return dieid < rapl_pmus->maxdie ? rapl_pmus->pmus[dieid] : NULL;
-}
-
-static inline u64 rapl_read_counter(struct perf_event *event)
-{
- u64 raw;
- rdmsrl(event->hw.event_base, raw);
- return raw;
-}
-
-static inline u64 rapl_scale(u64 v, int cfg)
-{
- if (cfg > NR_RAPL_DOMAINS) {
- pr_warn("Invalid domain %d, failed to scale data\n", cfg);
- return v;
- }
- /*
- * scale delta to smallest unit (1/2^32)
- * users must then scale back: count * 1/(1e9*2^32) to get Joules
- * or use ldexp(count, -32).
- * Watts = Joules/Time delta
- */
- return v << (32 - rapl_hw_unit[cfg - 1]);
-}
-
-static u64 rapl_event_update(struct perf_event *event)
-{
- struct hw_perf_event *hwc = &event->hw;
- u64 prev_raw_count, new_raw_count;
- s64 delta, sdelta;
- int shift = RAPL_CNTR_WIDTH;
-
- prev_raw_count = local64_read(&hwc->prev_count);
- do {
- rdmsrl(event->hw.event_base, new_raw_count);
- } while (!local64_try_cmpxchg(&hwc->prev_count,
- &prev_raw_count, new_raw_count));
-
- /*
- * Now we have the new raw value and have updated the prev
- * timestamp already. We can now calculate the elapsed delta
- * (event-)time and add that to the generic event.
- *
- * Careful, not all hw sign-extends above the physical width
- * of the count.
- */
- delta = (new_raw_count << shift) - (prev_raw_count << shift);
- delta >>= shift;
-
- sdelta = rapl_scale(delta, event->hw.config);
-
- local64_add(sdelta, &event->count);
-
- return new_raw_count;
-}
-
-static void rapl_start_hrtimer(struct rapl_pmu *pmu)
-{
- hrtimer_start(&pmu->hrtimer, pmu->timer_interval,
- HRTIMER_MODE_REL_PINNED);
-}
-
-static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
-{
- struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);
- struct perf_event *event;
- unsigned long flags;
-
- if (!pmu->n_active)
- return HRTIMER_NORESTART;
-
- raw_spin_lock_irqsave(&pmu->lock, flags);
-
- list_for_each_entry(event, &pmu->active_list, active_entry)
- rapl_event_update(event);
-
- raw_spin_unlock_irqrestore(&pmu->lock, flags);
-
- hrtimer_forward_now(hrtimer, pmu->timer_interval);
-
- return HRTIMER_RESTART;
-}
-
-static void rapl_hrtimer_init(struct rapl_pmu *pmu)
-{
- struct hrtimer *hr = &pmu->hrtimer;
-
- hrtimer_init(hr, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
- hr->function = rapl_hrtimer_handle;
-}
-
-static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
- struct perf_event *event)
-{
- if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
- return;
-
- event->hw.state = 0;
-
- list_add_tail(&event->active_entry, &pmu->active_list);
-
- local64_set(&event->hw.prev_count, rapl_read_counter(event));
-
- pmu->n_active++;
- if (pmu->n_active == 1)
- rapl_start_hrtimer(pmu);
-}
-
-static void rapl_pmu_event_start(struct perf_event *event, int mode)
-{
- struct rapl_pmu *pmu = event->pmu_private;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&pmu->lock, flags);
- __rapl_pmu_event_start(pmu, event);
- raw_spin_unlock_irqrestore(&pmu->lock, flags);
-}
-
-static void rapl_pmu_event_stop(struct perf_event *event, int mode)
-{
- struct rapl_pmu *pmu = event->pmu_private;
- struct hw_perf_event *hwc = &event->hw;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&pmu->lock, flags);
-
- /* mark event as deactivated and stopped */
- if (!(hwc->state & PERF_HES_STOPPED)) {
- WARN_ON_ONCE(pmu->n_active <= 0);
- pmu->n_active--;
- if (pmu->n_active == 0)
- hrtimer_cancel(&pmu->hrtimer);
-
- list_del(&event->active_entry);
-
- WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
- hwc->state |= PERF_HES_STOPPED;
- }
-
- /* check if update of sw counter is necessary */
- if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
- /*
- * Drain the remaining delta count out of a event
- * that we are disabling:
- */
- rapl_event_update(event);
- hwc->state |= PERF_HES_UPTODATE;
- }
-
- raw_spin_unlock_irqrestore(&pmu->lock, flags);
-}
-
-static int rapl_pmu_event_add(struct perf_event *event, int mode)
-{
- struct rapl_pmu *pmu = event->pmu_private;
- struct hw_perf_event *hwc = &event->hw;
- unsigned long flags;
-
- raw_spin_lock_irqsave(&pmu->lock, flags);
-
- hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
-
- if (mode & PERF_EF_START)
- __rapl_pmu_event_start(pmu, event);
-
- raw_spin_unlock_irqrestore(&pmu->lock, flags);
-
- return 0;
-}
-
-static void rapl_pmu_event_del(struct perf_event *event, int flags)
-{
- rapl_pmu_event_stop(event, PERF_EF_UPDATE);
-}
-
-static int rapl_pmu_event_init(struct perf_event *event)
-{
- u64 cfg = event->attr.config & RAPL_EVENT_MASK;
- int bit, ret = 0;
- struct rapl_pmu *pmu;
-
- /* only look at RAPL events */
- if (event->attr.type != rapl_pmus->pmu.type)
- return -ENOENT;
-
- /* check only supported bits are set */
- if (event->attr.config & ~RAPL_EVENT_MASK)
- return -EINVAL;
-
- if (event->cpu < 0)
- return -EINVAL;
-
- event->event_caps |= PERF_EV_CAP_READ_ACTIVE_PKG;
-
- if (!cfg || cfg >= NR_RAPL_DOMAINS + 1)
- return -EINVAL;
-
- cfg = array_index_nospec((long)cfg, NR_RAPL_DOMAINS + 1);
- bit = cfg - 1;
-
- /* check event supported */
- if (!(rapl_cntr_mask & (1 << bit)))
- return -EINVAL;
-
- /* unsupported modes and filters */
- if (event->attr.sample_period) /* no sampling */
- return -EINVAL;
-
- /* must be done before validate_group */
- pmu = cpu_to_rapl_pmu(event->cpu);
- if (!pmu)
- return -EINVAL;
- event->cpu = pmu->cpu;
- event->pmu_private = pmu;
- event->hw.event_base = rapl_msrs[bit].msr;
- event->hw.config = cfg;
- event->hw.idx = bit;
-
- return ret;
-}
-
-static void rapl_pmu_event_read(struct perf_event *event)
-{
- rapl_event_update(event);
-}
-
-static ssize_t rapl_get_attr_cpumask(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- return cpumap_print_to_pagebuf(true, buf, &rapl_cpu_mask);
-}
-
-static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
-
-static struct attribute *rapl_pmu_attrs[] = {
- &dev_attr_cpumask.attr,
- NULL,
-};
-
-static struct attribute_group rapl_pmu_attr_group = {
- .attrs = rapl_pmu_attrs,
-};
-
-RAPL_EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01");
-RAPL_EVENT_ATTR_STR(energy-pkg , rapl_pkg, "event=0x02");
-RAPL_EVENT_ATTR_STR(energy-ram , rapl_ram, "event=0x03");
-RAPL_EVENT_ATTR_STR(energy-gpu , rapl_gpu, "event=0x04");
-RAPL_EVENT_ATTR_STR(energy-psys, rapl_psys, "event=0x05");
-
-RAPL_EVENT_ATTR_STR(energy-cores.unit, rapl_cores_unit, "Joules");
-RAPL_EVENT_ATTR_STR(energy-pkg.unit , rapl_pkg_unit, "Joules");
-RAPL_EVENT_ATTR_STR(energy-ram.unit , rapl_ram_unit, "Joules");
-RAPL_EVENT_ATTR_STR(energy-gpu.unit , rapl_gpu_unit, "Joules");
-RAPL_EVENT_ATTR_STR(energy-psys.unit, rapl_psys_unit, "Joules");
-
-/*
- * we compute in 0.23 nJ increments regardless of MSR
- */
-RAPL_EVENT_ATTR_STR(energy-cores.scale, rapl_cores_scale, "2.3283064365386962890625e-10");
-RAPL_EVENT_ATTR_STR(energy-pkg.scale, rapl_pkg_scale, "2.3283064365386962890625e-10");
-RAPL_EVENT_ATTR_STR(energy-ram.scale, rapl_ram_scale, "2.3283064365386962890625e-10");
-RAPL_EVENT_ATTR_STR(energy-gpu.scale, rapl_gpu_scale, "2.3283064365386962890625e-10");
-RAPL_EVENT_ATTR_STR(energy-psys.scale, rapl_psys_scale, "2.3283064365386962890625e-10");
-
-/*
- * There are no default events, but we need to create
- * "events" group (with empty attrs) before updating
- * it with detected events.
- */
-static struct attribute *attrs_empty[] = {
- NULL,
-};
-
-static struct attribute_group rapl_pmu_events_group = {
- .name = "events",
- .attrs = attrs_empty,
-};
-
-PMU_FORMAT_ATTR(event, "config:0-7");
-static struct attribute *rapl_formats_attr[] = {
- &format_attr_event.attr,
- NULL,
-};
-
-static struct attribute_group rapl_pmu_format_group = {
- .name = "format",
- .attrs = rapl_formats_attr,
-};
-
-static const struct attribute_group *rapl_attr_groups[] = {
- &rapl_pmu_attr_group,
- &rapl_pmu_format_group,
- &rapl_pmu_events_group,
- NULL,
-};
-
-static struct attribute *rapl_events_cores[] = {
- EVENT_PTR(rapl_cores),
- EVENT_PTR(rapl_cores_unit),
- EVENT_PTR(rapl_cores_scale),
- NULL,
-};
-
-static struct attribute_group rapl_events_cores_group = {
- .name = "events",
- .attrs = rapl_events_cores,
-};
-
-static struct attribute *rapl_events_pkg[] = {
- EVENT_PTR(rapl_pkg),
- EVENT_PTR(rapl_pkg_unit),
- EVENT_PTR(rapl_pkg_scale),
- NULL,
-};
-
-static struct attribute_group rapl_events_pkg_group = {
- .name = "events",
- .attrs = rapl_events_pkg,
-};
-
-static struct attribute *rapl_events_ram[] = {
- EVENT_PTR(rapl_ram),
- EVENT_PTR(rapl_ram_unit),
- EVENT_PTR(rapl_ram_scale),
- NULL,
-};
-
-static struct attribute_group rapl_events_ram_group = {
- .name = "events",
- .attrs = rapl_events_ram,
-};
-
-static struct attribute *rapl_events_gpu[] = {
- EVENT_PTR(rapl_gpu),
- EVENT_PTR(rapl_gpu_unit),
- EVENT_PTR(rapl_gpu_scale),
- NULL,
-};
-
-static struct attribute_group rapl_events_gpu_group = {
- .name = "events",
- .attrs = rapl_events_gpu,
-};
-
-static struct attribute *rapl_events_psys[] = {
- EVENT_PTR(rapl_psys),
- EVENT_PTR(rapl_psys_unit),
- EVENT_PTR(rapl_psys_scale),
- NULL,
-};
-
-static struct attribute_group rapl_events_psys_group = {
- .name = "events",
- .attrs = rapl_events_psys,
-};
-
-static bool test_msr(int idx, void *data)
-{
- return test_bit(idx, (unsigned long *) data);
-}
-
-/* Only lower 32bits of the MSR represents the energy counter */
-#define RAPL_MSR_MASK 0xFFFFFFFF
-
-static struct perf_msr intel_rapl_msrs[] = {
- [PERF_RAPL_PP0] = { MSR_PP0_ENERGY_STATUS, &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PKG] = { MSR_PKG_ENERGY_STATUS, &rapl_events_pkg_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_RAM] = { MSR_DRAM_ENERGY_STATUS, &rapl_events_ram_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PP1] = { MSR_PP1_ENERGY_STATUS, &rapl_events_gpu_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group, test_msr, false, RAPL_MSR_MASK },
-};
-
-static struct perf_msr intel_rapl_spr_msrs[] = {
- [PERF_RAPL_PP0] = { MSR_PP0_ENERGY_STATUS, &rapl_events_cores_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PKG] = { MSR_PKG_ENERGY_STATUS, &rapl_events_pkg_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_RAM] = { MSR_DRAM_ENERGY_STATUS, &rapl_events_ram_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PP1] = { MSR_PP1_ENERGY_STATUS, &rapl_events_gpu_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_PSYS] = { MSR_PLATFORM_ENERGY_STATUS, &rapl_events_psys_group, test_msr, true, RAPL_MSR_MASK },
-};
-
-/*
- * Force to PERF_RAPL_MAX size due to:
- * - perf_msr_probe(PERF_RAPL_MAX)
- * - want to use same event codes across both architectures
- */
-static struct perf_msr amd_rapl_msrs[] = {
- [PERF_RAPL_PP0] = { 0, &rapl_events_cores_group, NULL, false, 0 },
- [PERF_RAPL_PKG] = { MSR_AMD_PKG_ENERGY_STATUS, &rapl_events_pkg_group, test_msr, false, RAPL_MSR_MASK },
- [PERF_RAPL_RAM] = { 0, &rapl_events_ram_group, NULL, false, 0 },
- [PERF_RAPL_PP1] = { 0, &rapl_events_gpu_group, NULL, false, 0 },
- [PERF_RAPL_PSYS] = { 0, &rapl_events_psys_group, NULL, false, 0 },
-};
-
-static int rapl_cpu_offline(unsigned int cpu)
-{
- struct rapl_pmu *pmu = cpu_to_rapl_pmu(cpu);
- int target;
-
- /* Check if exiting cpu is used for collecting rapl events */
- if (!cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask))
- return 0;
-
- pmu->cpu = -1;
- /* Find a new cpu to collect rapl events */
- target = cpumask_any_but(topology_die_cpumask(cpu), cpu);
-
- /* Migrate rapl events to the new target */
- if (target < nr_cpu_ids) {
- cpumask_set_cpu(target, &rapl_cpu_mask);
- pmu->cpu = target;
- perf_pmu_migrate_context(pmu->pmu, cpu, target);
- }
- return 0;
-}
-
-static int rapl_cpu_online(unsigned int cpu)
-{
- struct rapl_pmu *pmu = cpu_to_rapl_pmu(cpu);
- int target;
-
- if (!pmu) {
- pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
- if (!pmu)
- return -ENOMEM;
-
- raw_spin_lock_init(&pmu->lock);
- INIT_LIST_HEAD(&pmu->active_list);
- pmu->pmu = &rapl_pmus->pmu;
- pmu->timer_interval = ms_to_ktime(rapl_timer_ms);
- rapl_hrtimer_init(pmu);
-
- rapl_pmus->pmus[topology_logical_die_id(cpu)] = pmu;
- }
-
- /*
- * Check if there is an online cpu in the package which collects rapl
- * events already.
- */
- target = cpumask_any_and(&rapl_cpu_mask, topology_die_cpumask(cpu));
- if (target < nr_cpu_ids)
- return 0;
-
- cpumask_set_cpu(cpu, &rapl_cpu_mask);
- pmu->cpu = cpu;
- return 0;
-}
-
-static int rapl_check_hw_unit(struct rapl_model *rm)
-{
- u64 msr_rapl_power_unit_bits;
- int i;
-
- /* protect rdmsrl() to handle virtualization */
- if (rdmsrl_safe(rm->msr_power_unit, &msr_rapl_power_unit_bits))
- return -1;
- for (i = 0; i < NR_RAPL_DOMAINS; i++)
- rapl_hw_unit[i] = (msr_rapl_power_unit_bits >> 8) & 0x1FULL;
-
- switch (rm->unit_quirk) {
- /*
- * DRAM domain on HSW server and KNL has fixed energy unit which can be
- * different than the unit from power unit MSR. See
- * "Intel Xeon Processor E5-1600 and E5-2600 v3 Product Families, V2
- * of 2. Datasheet, September 2014, Reference Number: 330784-001 "
- */
- case RAPL_UNIT_QUIRK_INTEL_HSW:
- rapl_hw_unit[PERF_RAPL_RAM] = 16;
- break;
- /* SPR uses a fixed energy unit for Psys domain. */
- case RAPL_UNIT_QUIRK_INTEL_SPR:
- rapl_hw_unit[PERF_RAPL_PSYS] = 0;
- break;
- default:
- break;
- }
-
-
- /*
- * Calculate the timer rate:
- * Use reference of 200W for scaling the timeout to avoid counter
- * overflows. 200W = 200 Joules/sec
- * Divide interval by 2 to avoid lockstep (2 * 100)
- * if hw unit is 32, then we use 2 ms 1/200/2
- */
- rapl_timer_ms = 2;
- if (rapl_hw_unit[0] < 32) {
- rapl_timer_ms = (1000 / (2 * 100));
- rapl_timer_ms *= (1ULL << (32 - rapl_hw_unit[0] - 1));
- }
- return 0;
-}
-
-static void __init rapl_advertise(void)
-{
- int i;
-
- pr_info("API unit is 2^-32 Joules, %d fixed counters, %llu ms ovfl timer\n",
- hweight32(rapl_cntr_mask), rapl_timer_ms);
-
- for (i = 0; i < NR_RAPL_DOMAINS; i++) {
- if (rapl_cntr_mask & (1 << i)) {
- pr_info("hw unit of domain %s 2^-%d Joules\n",
- rapl_domain_names[i], rapl_hw_unit[i]);
- }
- }
-}
-
-static void cleanup_rapl_pmus(void)
-{
- int i;
-
- for (i = 0; i < rapl_pmus->maxdie; i++)
- kfree(rapl_pmus->pmus[i]);
- kfree(rapl_pmus);
-}
-
-static const struct attribute_group *rapl_attr_update[] = {
- &rapl_events_cores_group,
- &rapl_events_pkg_group,
- &rapl_events_ram_group,
- &rapl_events_gpu_group,
- &rapl_events_psys_group,
- NULL,
-};
-
-static int __init init_rapl_pmus(void)
-{
- int maxdie = topology_max_packages() * topology_max_die_per_package();
- size_t size;
-
- size = sizeof(*rapl_pmus) + maxdie * sizeof(struct rapl_pmu *);
- rapl_pmus = kzalloc(size, GFP_KERNEL);
- if (!rapl_pmus)
- return -ENOMEM;
-
- rapl_pmus->maxdie = maxdie;
- rapl_pmus->pmu.attr_groups = rapl_attr_groups;
- rapl_pmus->pmu.attr_update = rapl_attr_update;
- rapl_pmus->pmu.task_ctx_nr = perf_invalid_context;
- rapl_pmus->pmu.event_init = rapl_pmu_event_init;
- rapl_pmus->pmu.add = rapl_pmu_event_add;
- rapl_pmus->pmu.del = rapl_pmu_event_del;
- rapl_pmus->pmu.start = rapl_pmu_event_start;
- rapl_pmus->pmu.stop = rapl_pmu_event_stop;
- rapl_pmus->pmu.read = rapl_pmu_event_read;
- rapl_pmus->pmu.module = THIS_MODULE;
- rapl_pmus->pmu.capabilities = PERF_PMU_CAP_NO_EXCLUDE;
- return 0;
-}
-
-static struct rapl_model model_snb = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_PP1),
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_snbep = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM),
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_hsw = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM) |
- BIT(PERF_RAPL_PP1),
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_hsx = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM),
- .unit_quirk = RAPL_UNIT_QUIRK_INTEL_HSW,
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_knl = {
- .events = BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM),
- .unit_quirk = RAPL_UNIT_QUIRK_INTEL_HSW,
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_skl = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM) |
- BIT(PERF_RAPL_PP1) |
- BIT(PERF_RAPL_PSYS),
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_msrs,
-};
-
-static struct rapl_model model_spr = {
- .events = BIT(PERF_RAPL_PP0) |
- BIT(PERF_RAPL_PKG) |
- BIT(PERF_RAPL_RAM) |
- BIT(PERF_RAPL_PSYS),
- .unit_quirk = RAPL_UNIT_QUIRK_INTEL_SPR,
- .msr_power_unit = MSR_RAPL_POWER_UNIT,
- .rapl_msrs = intel_rapl_spr_msrs,
-};
-
-static struct rapl_model model_amd_hygon = {
- .events = BIT(PERF_RAPL_PKG),
- .msr_power_unit = MSR_AMD_RAPL_POWER_UNIT,
- .rapl_msrs = amd_rapl_msrs,
-};
-
-static const struct x86_cpu_id rapl_model_match[] __initconst = {
- X86_MATCH_FEATURE(X86_FEATURE_RAPL, &model_amd_hygon),
- X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE, &model_snb),
- X86_MATCH_INTEL_FAM6_MODEL(SANDYBRIDGE_X, &model_snbep),
- X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE, &model_snb),
- X86_MATCH_INTEL_FAM6_MODEL(IVYBRIDGE_X, &model_snbep),
- X86_MATCH_INTEL_FAM6_MODEL(HASWELL, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(HASWELL_X, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(HASWELL_L, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(HASWELL_G, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(BROADWELL, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_G, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_X, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(BROADWELL_D, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &model_knl),
- X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &model_knl),
- X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(CANNONLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_D, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(ATOM_GOLDMONT_PLUS, &model_hsw),
- X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ICELAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_D, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &model_hsx),
- X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(TIGERLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_L, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(ATOM_GRACEMONT, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(SAPPHIRERAPIDS_X, &model_spr),
- X86_MATCH_INTEL_FAM6_MODEL(EMERALDRAPIDS_X, &model_spr),
- X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_P, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(RAPTORLAKE_S, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(METEORLAKE, &model_skl),
- X86_MATCH_INTEL_FAM6_MODEL(METEORLAKE_L, &model_skl),
- {},
-};
-MODULE_DEVICE_TABLE(x86cpu, rapl_model_match);
-
-static int __init rapl_pmu_init(void)
-{
- const struct x86_cpu_id *id;
- struct rapl_model *rm;
- int ret;
-
- id = x86_match_cpu(rapl_model_match);
- if (!id)
- return -ENODEV;
-
- rm = (struct rapl_model *) id->driver_data;
-
- rapl_msrs = rm->rapl_msrs;
-
- rapl_cntr_mask = perf_msr_probe(rapl_msrs, PERF_RAPL_MAX,
- false, (void *) &rm->events);
-
- ret = rapl_check_hw_unit(rm);
- if (ret)
- return ret;
-
- ret = init_rapl_pmus();
- if (ret)
- return ret;
-
- /*
- * Install callbacks. Core will call them for each online cpu.
- */
- ret = cpuhp_setup_state(CPUHP_AP_PERF_X86_RAPL_ONLINE,
- "perf/x86/rapl:online",
- rapl_cpu_online, rapl_cpu_offline);
- if (ret)
- goto out;
-
- ret = perf_pmu_register(&rapl_pmus->pmu, "power", -1);
- if (ret)
- goto out1;
-
- rapl_advertise();
- return 0;
-
-out1:
- cpuhp_remove_state(CPUHP_AP_PERF_X86_RAPL_ONLINE);
-out:
- pr_warn("Initialization failed (%d), disabled\n", ret);
- cleanup_rapl_pmus();
- return ret;
-}
-module_init(rapl_pmu_init);
-
-static void __exit intel_rapl_exit(void)
-{
- cpuhp_remove_state_nocalls(CPUHP_AP_PERF_X86_RAPL_ONLINE);
- perf_pmu_unregister(&rapl_pmus->pmu);
- cleanup_rapl_pmus();
-}
-module_exit(intel_rapl_exit);
diff --git a/drivers/powercap/intel_rapl_msr.c b/drivers/powercap/intel_rapl_msr.c
index b4b6930cacb0..564623471a68 100644
--- a/drivers/powercap/intel_rapl_msr.c
+++ b/drivers/powercap/intel_rapl_msr.c
@@ -53,6 +53,7 @@ static struct rapl_if_priv rapl_msr_priv_intel = {
.regs[RAPL_DOMAIN_PLATFORM][RAPL_DOMAIN_REG_STATUS].msr = MSR_PLATFORM_ENERGY_STATUS,
.limits[RAPL_DOMAIN_PACKAGE] = BIT(POWER_LIMIT2),
.limits[RAPL_DOMAIN_PLATFORM] = BIT(POWER_LIMIT2),
+ .enable_pmu = true,
};

static struct rapl_if_priv rapl_msr_priv_amd = {
@@ -60,6 +61,7 @@ static struct rapl_if_priv rapl_msr_priv_amd = {
.reg_unit.msr = MSR_AMD_RAPL_POWER_UNIT,
.regs[RAPL_DOMAIN_PACKAGE][RAPL_DOMAIN_REG_STATUS].msr = MSR_AMD_PKG_ENERGY_STATUS,
.regs[RAPL_DOMAIN_PP0][RAPL_DOMAIN_REG_STATUS].msr = MSR_AMD_CORE_ENERGY_STATUS,
+ .enable_pmu = true,
};

/* Handles CPU hotplug on multi-socket systems.
--
2.34.1

2024-01-31 14:40:50

by Rafael J. Wysocki

[permalink] [raw]

Subject: Re: [PATCH 0/5] intel_rapl & perf rapl: combine PMU support

On Wed, Jan 31, 2024 at 3:24 PM Zhang Rui <[email protected]> wrote:
>
> This patch series is made based on the patch series posted at
> https://lore.kernel.org/all/[email protected]/
>
> Problem statement
> -----------------
> MSR RAPL powercap sysfs is done in drivers/powercap/intel_rapl_msr.c.
> MSR RAPL PMU is done in arch/x86/events/rapl.c.
>
> They maintain two separate CPU model lists, describing the same feature
> available on the same set of hardware. This increases unnecessary
> maintenance burden a lot.
>
> Now we need to introduce TPMI RAPL PMU support, which again shares most
> of the logic with MSR RAPL PMU.
>
> Solution
> --------
> Introducing PMU support as part of RAPL framework and remove current MSR
> RAPL PMU code.
>
> The idea is that, if a RAPL Package device is registered to RAPL
> framework, and is ready for energy reporting and control via powercap
> sysfs, then it is also ready for PMU.
>
> So introducing PMU support in RAPL framework that works for all
> registered RAPL Package devices. With this, we can remove current MSR
> RAPL PMU completely.
>
> Given that MSR RAPL and TPMI RAPL driver won't funtion on the same
> platform, the new RAPL PMU can be fully compatible with current MSR RAPL
> PMU, including using the same PMU name and events name/id/unit/scale.
>
> For example, on platforms use either MSR or TPMI, use the same command
> perf stat -e power/energy-pkg/ -e power/energy-ram/ -e power/energy-cores/ FOO
> to get the energy consumption when the events are in "perf list" output.
>
> Notes
> -----
> There are indeed some functional changes introduced, due to the
> divergency between the two CPU model lists. This includes,
> 1. Fix BROADWELL_D in intel_rapl driver to use fixed Dram domain energy
> unit.
> 2. Enable PMU for some Intel platforms, which were missing in
> arch/x86/events/rapl.c. This includes
> ICELAKE_NNPI
> ROCKETLAKE
> LUNARLAKE_M
> LAKEFIELD
> ATOM_SILVERMONT
> ATOM_SILVERMONT_MID
> ATOM_AIRMONT
> ATOM_AIRMONT_MID
> ATOM_TREMONT
> ATOM_TREMONT_D
> ATOM_TREMONT_L
> 3. Change the logic for enumerating AMD/HYGON platforms
> Previously, it was
> X86_MATCH_FEATURE(X86_FEATURE_RAPL, &model_amd_hygon)
> And now it is
> X86_MATCH_VENDOR_FAM(AMD, 0x17, &rapl_defaults_amd)
> X86_MATCH_VENDOR_FAM(AMD, 0x19, &rapl_defaults_amd)
> X86_MATCH_VENDOR_FAM(HYGON, 0x18, &rapl_defaults_amd)
>
> Any comments/concerns are welcome.

Say the first patch in the series is applied and the last one is not.
Will anything break?

Regardless of the above. if any existing code is moved unmodified by
this series to a new location, it would be nice to be able to see that
in the patches. Otherwise, some subtle differences may be missed.

2024-01-31 14:56:45

by Zhang, Rui

[permalink] [raw]

Subject: [PATCH 2/5] powercap: intel_rapl: Add PMU support

MSR RAPL powercap sysfs is done in drivers/powercap/intel_rapl_msr.c.
MSR RAPL PMU is done in arch/x86/events/rapl.c.

They maintain two separate CPU model lists, describing the same feature
available on the same set of hardware. This increases unnecessary
maintenance burden a lot.

Plus that, TPMI RAPL also needs PMU support, which shares similar code
with MSR RAPL PMU.

To unify all of these, introduce PMU support as part of RAPL framework.

The idea is that, if a RAPL Package device is registered to RAPL
framework, and is ready for energy reporting and control via sysfs, it
is also ready for PMU. So register a PMU in RAPL framework that works
for all registered RAPL Package devices with .enable_pmu flag set.
Note the RAPL PMU does not distinguish RAPL Package devices from
different Interface because MSR RAPL and TPMI RAPL do not co-exist on
any platform, and MMIO RAPL does not set .enable_pmu flag.

The RAPL PMU is fully compatible with current MSR RAPL PMU, including
using the same PMU name and events name/id/unit/scale, etc. For example,
on platforms use either MSR or TPMI, use the same command
perf stat -e power/energy-pkg/ -e power/energy-ram/ -e power/energy-cores/ FOO
to get the energy consumption when the events are in "perf list" output.

Note that,
MSR RAPL PMU is CPU model based and the events supported are known at
driver initialization time. TPMI RAPL is probed dynamically, and the
events supported by each TPMI RAPL device can be different. Thus, when a
new RAPL Package device is registered with .enable_pmu set,
1. if PMU has not been registered yet, register the PMU
2. if current PMU already covers all events that the new RAPL Package
device supports, do nothing (always true for MSR RAPL)
2. or else, unregister current PMU and re-register the PMU with events
supported by all probed RAPL Package devices.
For example, on a dual-package system using TPMI RAPL, it is possible
that Package 1 behaves as TPMI domain root. In this case, register
PMU without Psys event support when probing TPMI RAPL Package 0, and
re-register the PMU with Psys event support when probing Package 1.

Signed-off-by: Zhang Rui <[email protected]>
---
drivers/powercap/intel_rapl_common.c | 536 +++++++++++++++++++++++++++
include/linux/intel_rapl.h | 17 +
2 files changed, 553 insertions(+)

diff --git a/drivers/powercap/intel_rapl_common.c b/drivers/powercap/intel_rapl_common.c
index 315e304219e2..318174c87249 100644
--- a/drivers/powercap/intel_rapl_common.c
+++ b/drivers/powercap/intel_rapl_common.c
@@ -15,6 +15,8 @@
#include <linux/list.h>
#include <linux/log2.h>
#include <linux/module.h>
+#include <linux/nospec.h>
+#include <linux/perf_event.h>
#include <linux/platform_device.h>
#include <linux/powercap.h>
#include <linux/processor.h>
@@ -1506,6 +1508,532 @@ static int rapl_detect_domains(struct rapl_package *rp)
return 0;
}

+#ifdef CONFIG_PERF_EVENTS
+
+/* Support for RAPL PMU */
+
+struct rapl_pmu {
+ struct pmu pmu;
+ u64 timer_ms;
+ cpumask_t cpu_mask;
+ unsigned long domain_map; /* Events supported by all registered RAPL packages */
+ bool registered;
+};
+
+static struct rapl_pmu rapl_pmu;
+
+/* PMU helpers */
+static int get_pmu_cpu(struct rapl_package *rp)
+{
+ int cpu;
+
+ if (!rp->priv->enable_pmu)
+ return nr_cpu_ids;
+
+ /* MSR RAPL uses rp->lead_cpu for PMU */
+ if (rp->lead_cpu >= 0)
+ return rp->lead_cpu;
+
+ /* TPMI RAPL uses any CPU in the package for PMU */
+ for_each_online_cpu(cpu)
+ if (topology_physical_package_id(cpu) == rp->id)
+ return cpu;
+
+ return nr_cpu_ids;
+}
+
+static bool is_rp_pmu_cpu(struct rapl_package *rp, int cpu)
+{
+ if (!rp->priv->enable_pmu)
+ return false;
+
+ /* MSR RAPL uses rp->lead_cpu for PMU */
+ if (rp->lead_cpu >= 0)
+ return cpu == rp->lead_cpu;
+
+ /* TPMI RAPL uses any CPU in the package for PMU */
+ return topology_physical_package_id(cpu) == rp->id;
+}
+
+static struct rapl_package_pmu_data *event_to_pmu_data(struct perf_event *event)
+{
+ struct rapl_package *rp = event->pmu_private;
+
+ return &rp->pmu_data;
+}
+
+/* PMU event callbacks */
+
+/* Return 0 for unsupported events or failed read */
+static u64 event_read_counter(struct perf_event *event)
+{
+ struct rapl_package *rp = event->pmu_private;
+ u64 val;
+ int ret;
+
+ if (event->hw.idx < 0)
+ return 0;
+
+ ret = rapl_read_data_raw(&rp->domains[event->hw.idx], ENERGY_COUNTER, false, &val);
+ return ret ? 0 : val;
+}
+
+static void __rapl_pmu_event_start(struct perf_event *event)
+{
+ struct rapl_package_pmu_data *data = event_to_pmu_data(event);
+
+ if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+ return;
+
+ event->hw.state = 0;
+
+ list_add_tail(&event->active_entry, &data->active_list);
+
+ local64_set(&event->hw.prev_count, event_read_counter(event));
+ if (++data->n_active == 1)
+ hrtimer_start(&data->hrtimer, data->timer_interval,
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int mode)
+{
+ struct rapl_package_pmu_data *data = event_to_pmu_data(event);
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&data->lock, flags);
+ __rapl_pmu_event_start(event);
+ raw_spin_unlock_irqrestore(&data->lock, flags);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ struct rapl_package_pmu_data *data = event_to_pmu_data(event);
+ u64 prev_raw_count, new_raw_count;
+ s64 delta, sdelta;
+
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ new_raw_count = event_read_counter(event);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count)
+ goto again;
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ */
+ delta = new_raw_count - prev_raw_count;
+
+ /*
+ * Scale delta to smallest unit (2^-32)
+ * users must then scale back: count * 1/(1e9*2^32) to get Joules
+ * or use ldexp(count, -32).
+ * Watts = Joules/Time delta
+ */
+ sdelta = delta * data->scale[event->hw.flags];
+
+ local64_add(sdelta, &event->count);
+
+ return new_raw_count;
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int mode)
+{
+ struct rapl_package_pmu_data *data = event_to_pmu_data(event);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&data->lock, flags);
+
+ /* Mark event as deactivated and stopped */
+ if (!(hwc->state & PERF_HES_STOPPED)) {
+ WARN_ON_ONCE(data->n_active <= 0);
+ if (--data->n_active == 0)
+ hrtimer_cancel(&data->hrtimer);
+
+ list_del(&event->active_entry);
+
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* Check if update of sw counter is necessary */
+ if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ rapl_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+
+ raw_spin_unlock_irqrestore(&data->lock, flags);
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int mode)
+{
+ struct rapl_package_pmu_data *data = event_to_pmu_data(event);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&data->lock, flags);
+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_START)
+ __rapl_pmu_event_start(event);
+
+ raw_spin_unlock_irqrestore(&data->lock, flags);
+
+ return 0;
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+/*
+ * RAPL energy status counters
+ */
+enum perf_rapl_events {
+ PERF_RAPL_PP0 = 0, /* all cores */
+ PERF_RAPL_PKG, /* entire package */
+ PERF_RAPL_RAM, /* DRAM */
+ PERF_RAPL_PP1, /* gpu */
+ PERF_RAPL_PSYS, /* psys */
+};
+#define RAPL_EVENT_MASK GENMASK(7, 0)
+
+static int event_to_domain(int event)
+{
+ switch (event) {
+ case PERF_RAPL_PP0:
+ return RAPL_DOMAIN_PP0;
+ case PERF_RAPL_PKG:
+ return RAPL_DOMAIN_PACKAGE;
+ case PERF_RAPL_RAM:
+ return RAPL_DOMAIN_DRAM;
+ case PERF_RAPL_PP1:
+ return RAPL_DOMAIN_PP1;
+ case PERF_RAPL_PSYS:
+ return RAPL_DOMAIN_PLATFORM;
+ default:
+ return RAPL_DOMAIN_MAX;
+ }
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+ struct rapl_package *pos, *rp = NULL;
+ u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+ int domain, idx;
+
+ /* Only look at RAPL events */
+ if (event->attr.type != event->pmu->type)
+ return -ENOENT;
+
+ /* Check only supported bits are set */
+ if (event->attr.config & ~RAPL_EVENT_MASK)
+ return -EINVAL;
+
+ if (event->cpu < 0)
+ return -EINVAL;
+
+ list_for_each_entry(pos, &rapl_packages, plist) {
+ if (is_rp_pmu_cpu(pos, event->cpu)) {
+ rp = pos;
+ break;
+ }
+ }
+ if (!rp)
+ return -ENODEV;
+
+ domain = event_to_domain(cfg - 1);
+ if (domain >= RAPL_DOMAIN_MAX)
+ return -EINVAL;
+
+ event->event_caps |= PERF_EV_CAP_READ_ACTIVE_PKG;
+ event->pmu_private = rp; /* Which package */
+ event->hw.flags = domain; /* Which domain */
+ event->hw.idx = -1; /* Index in rp->domains[] to get domain pointer */
+ for (idx = 0; idx < rp->nr_domains; idx++) {
+ if (rp->domains[idx].id == domain) {
+ event->hw.idx = idx;
+ break;
+ }
+ }
+
+ return 0;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+ rapl_event_update(event);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct rapl_package_pmu_data *data =
+ container_of(hrtimer, struct rapl_package_pmu_data, hrtimer);
+ struct perf_event *event;
+ unsigned long flags;
+
+ if (!data->n_active)
+ return HRTIMER_NORESTART;
+
+ raw_spin_lock_irqsave(&data->lock, flags);
+
+ list_for_each_entry(event, &data->active_list, active_entry)
+ rapl_event_update(event);
+
+ raw_spin_unlock_irqrestore(&data->lock, flags);
+
+ hrtimer_forward_now(hrtimer, data->timer_interval);
+
+ return HRTIMER_RESTART;
+}
+
+/* PMU sysfs attributes */
+
+/*
+ * There are no default events, but we need to create "events" group (with
+ * empty attrs) before updating it with detected events.
+ */
+static struct attribute *attrs_empty[] = {
+ NULL,
+};
+
+static struct attribute_group pmu_events_group = {
+ .name = "events",
+ .attrs = attrs_empty,
+};
+
+static ssize_t cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct rapl_package *rp;
+ int cpu;
+
+ cpus_read_lock();
+ cpumask_clear(&rapl_pmu.cpu_mask);
+
+ /* Choose a cpu for each RAPL Package */
+ list_for_each_entry(rp, &rapl_packages, plist) {
+ cpu = get_pmu_cpu(rp);
+ if (cpu < nr_cpu_ids)
+ cpumask_set_cpu(cpu, &rapl_pmu.cpu_mask);
+ }
+ cpus_read_unlock();
+
+ return cpumap_print_to_pagebuf(true, buf, &rapl_pmu.cpu_mask);
+}
+
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *pmu_cpumask_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL
+};
+
+static struct attribute_group pmu_cpumask_group = {
+ .attrs = pmu_cpumask_attrs,
+};
+
+PMU_FORMAT_ATTR(event, "config:0-7");
+static struct attribute *pmu_format_attr[] = {
+ &format_attr_event.attr,
+ NULL
+};
+
+static struct attribute_group pmu_format_group = {
+ .name = "format",
+ .attrs = pmu_format_attr,
+};
+
+static const struct attribute_group *pmu_attr_groups[] = {
+ &pmu_events_group,
+ &pmu_cpumask_group,
+ &pmu_format_group,
+ NULL
+};
+
+#define RAPL_EVENT_ATTR_STR(_name, v, str) \
+static struct perf_pmu_events_attr event_attr_##v = { \
+ .attr = __ATTR(_name, 0444, perf_event_sysfs_show, NULL), \
+ .event_str = str, \
+}
+
+RAPL_EVENT_ATTR_STR(energy-cores, rapl_cores, "event=0x01");
+RAPL_EVENT_ATTR_STR(energy-pkg, rapl_pkg, "event=0x02");
+RAPL_EVENT_ATTR_STR(energy-ram, rapl_ram, "event=0x03");
+RAPL_EVENT_ATTR_STR(energy-gpu, rapl_gpu, "event=0x04");
+RAPL_EVENT_ATTR_STR(energy-psys, rapl_psys, "event=0x05");
+
+RAPL_EVENT_ATTR_STR(energy-cores.unit, rapl_unit_cores, "Joules");
+RAPL_EVENT_ATTR_STR(energy-pkg.unit, rapl_unit_pkg, "Joules");
+RAPL_EVENT_ATTR_STR(energy-ram.unit, rapl_unit_ram, "Joules");
+RAPL_EVENT_ATTR_STR(energy-gpu.unit, rapl_unit_gpu, "Joules");
+RAPL_EVENT_ATTR_STR(energy-psys.unit, rapl_unit_psys, "Joules");
+
+RAPL_EVENT_ATTR_STR(energy-cores.scale, rapl_scale_cores, "2.3283064365386962890625e-10");
+RAPL_EVENT_ATTR_STR(energy-pkg.scale, rapl_scale_pkg, "2.3283064365386962890625e-10");
+RAPL_EVENT_ATTR_STR(energy-ram.scale, rapl_scale_ram, "2.3283064365386962890625e-10");
+RAPL_EVENT_ATTR_STR(energy-gpu.scale, rapl_scale_gpu, "2.3283064365386962890625e-10");
+RAPL_EVENT_ATTR_STR(energy-psys.scale, rapl_scale_psys, "2.3283064365386962890625e-10");
+
+#define RAPL_EVENT_GROUP(_name, domain) \
+static struct attribute *pmu_attr_##_name[] = { \
+ &event_attr_rapl_##_name.attr.attr, \
+ &event_attr_rapl_unit_##_name.attr.attr, \
+ &event_attr_rapl_scale_##_name.attr.attr, \
+ NULL \
+}; \
+static umode_t is_visible_##_name(struct kobject *kobj, struct attribute *attr, int event) \
+{ \
+ return rapl_pmu.domain_map & BIT(domain) ? attr->mode : 0; \
+} \
+static struct attribute_group pmu_group_##_name = { \
+ .name = "events", \
+ .attrs = pmu_attr_##_name, \
+ .is_visible = is_visible_##_name, \
+}
+
+RAPL_EVENT_GROUP(cores, RAPL_DOMAIN_PP0);
+RAPL_EVENT_GROUP(pkg, RAPL_DOMAIN_PACKAGE);
+RAPL_EVENT_GROUP(ram, RAPL_DOMAIN_DRAM);
+RAPL_EVENT_GROUP(gpu, RAPL_DOMAIN_PP1);
+RAPL_EVENT_GROUP(psys, RAPL_DOMAIN_PLATFORM);
+
+static const struct attribute_group *pmu_attr_update[] = {
+ &pmu_group_cores,
+ &pmu_group_pkg,
+ &pmu_group_ram,
+ &pmu_group_gpu,
+ &pmu_group_psys,
+ NULL
+};
+
+static int rapl_pmu_update(struct rapl_package *rp)
+{
+ int ret;
+
+ /* Return if PMU already covers all events supported by current RAPL Package */
+ if (rapl_pmu.registered && !(rp->domain_map & (~rapl_pmu.domain_map)))
+ return 0;
+
+ /* Unregister previous registered PMU */
+ if (rapl_pmu.registered) {
+ perf_pmu_unregister(&rapl_pmu.pmu);
+ memset(&rapl_pmu.pmu, 0, sizeof(struct pmu));
+ }
+
+ rapl_pmu.domain_map |= rp->domain_map;
+
+ memset(&rapl_pmu.pmu, 0, sizeof(struct pmu));
+ rapl_pmu.pmu.attr_groups = pmu_attr_groups;
+ rapl_pmu.pmu.attr_update = pmu_attr_update;
+ rapl_pmu.pmu.task_ctx_nr = perf_invalid_context;
+ rapl_pmu.pmu.event_init = rapl_pmu_event_init;
+ rapl_pmu.pmu.add = rapl_pmu_event_add;
+ rapl_pmu.pmu.del = rapl_pmu_event_del;
+ rapl_pmu.pmu.start = rapl_pmu_event_start;
+ rapl_pmu.pmu.stop = rapl_pmu_event_stop;
+ rapl_pmu.pmu.read = rapl_pmu_event_read;
+ rapl_pmu.pmu.module = THIS_MODULE;
+ rapl_pmu.pmu.capabilities = PERF_PMU_CAP_NO_EXCLUDE | PERF_PMU_CAP_NO_INTERRUPT;
+ ret = perf_pmu_register(&rapl_pmu.pmu, "power", -1);
+ if (ret)
+ pr_warn("Failed to register PMU\n");
+
+ rapl_pmu.registered = !ret;
+
+ return ret;
+}
+
+static int rapl_package_add_pmu(struct rapl_package *rp)
+{
+ struct rapl_package_pmu_data *data = &rp->pmu_data;
+ int idx;
+
+ if (!rp->priv->enable_pmu)
+ return 0;
+
+ for (idx = 0; idx < rp->nr_domains; idx++) {
+ struct rapl_domain *rd = &rp->domains[idx];
+ int domain = rd->id;
+ u64 val;
+
+ if (!test_bit(domain, &rp->domain_map))
+ continue;
+
+ /*
+ * The RAPL PMU granularity is 2^-32 Joules
+ * data->scale[]: times of 2^-32 Joules for each ENERGY COUNTER increase
+ */
+ val = rd->energy_unit * (1ULL << 32);
+ do_div(val, ENERGY_UNIT_SCALE * 1000000);
+ data->scale[domain] = val;
+
+ if (!rapl_pmu.timer_ms) {
+ struct rapl_primitive_info *rpi = get_rpi(rp, ENERGY_COUNTER);
+
+
+ /*
+ * Calculate the timer rate:
+ * Use reference of 200W for scaling the timeout to avoid counter
+ * overflows.
+ *
+ * max_count = rpi->mask >> rpi->shift + 1
+ * max_energy_pj = max_count * rd->energy_unit
+ * max_time_sec = (max_energy_pj / 1000000000) / 200w
+ *
+ * rapl_pmu.timer_ms = max_time_sec * 1000 / 2
+ */
+ val = (rpi->mask >> rpi->shift) + 1;
+ val *= rd->energy_unit;
+ do_div(val, 1000000 * 200 * 2);
+ rapl_pmu.timer_ms = val;
+
+ pr_info("%llu ms ovfl timer\n", rapl_pmu.timer_ms);
+ }
+
+ pr_info("Domain %s: hw unit %lld * 2^-32 Joules\n", rd->name, data->scale[domain]);
+ }
+
+ /* Initialize per package PMU data */
+ raw_spin_lock_init(&data->lock);
+ INIT_LIST_HEAD(&data->active_list);
+ data->timer_interval = ms_to_ktime(rapl_pmu.timer_ms);
+ hrtimer_init(&data->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ data->hrtimer.function = rapl_hrtimer_handle;
+
+ return rapl_pmu_update(rp);
+}
+
+static void rapl_package_remove_pmu(struct rapl_package *rp)
+{
+ struct rapl_package *pos;
+
+ if (!rp->priv->enable_pmu || !rapl_pmu.registered)
+ return;
+
+ list_for_each_entry(pos, &rapl_packages, plist) {
+ if (pos->priv->enable_pmu)
+ return;
+ }
+
+ perf_pmu_unregister(&rapl_pmu.pmu);
+ memset(&rapl_pmu, 0, sizeof(struct rapl_pmu));
+}
+#else
+static int rapl_package_add_pmu(struct rapl_package *rp) { return 0; }
+static void rapl_package_remove_pmu(struct rapl_package *rp) { }
+#endif
+
/* called from CPU hotplug notifier, hotplug lock held */
void rapl_remove_package_cpuslocked(struct rapl_package *rp)
{
@@ -1534,6 +2062,8 @@ void rapl_remove_package_cpuslocked(struct rapl_package *rp)
powercap_unregister_zone(rp->priv->control_type,
&rd_package->power_zone);
list_del(&rp->plist);
+ /* Do PMU update after rp removed from rapl_packages list */
+ rapl_package_remove_pmu(rp);
kfree(rp);
}
EXPORT_SYMBOL_GPL(rapl_remove_package_cpuslocked);
@@ -1613,6 +2143,12 @@ struct rapl_package *rapl_add_package_cpuslocked(int id, struct rapl_if_priv *pr
if (!ret) {
INIT_LIST_HEAD(&rp->plist);
list_add(&rp->plist, &rapl_packages);
+ /*
+ * PMU failure does not affect the registered RAPL packages.
+ * Thus it should not block the current RAPL package neither.
+ * Ignore the return value.
+ */
+ rapl_package_add_pmu(rp);
return rp;
}

diff --git a/include/linux/intel_rapl.h b/include/linux/intel_rapl.h
index f3196f82fd8a..6f27380d4bab 100644
--- a/include/linux/intel_rapl.h
+++ b/include/linux/intel_rapl.h
@@ -138,6 +138,7 @@ struct reg_action {
* @reg_unit: Register for getting energy/power/time unit.
* @regs: Register sets for different RAPL Domains.
* @limits: Number of power limits supported by each domain.
+ * @enable_pmu: Enable Perf PMU for Domain energy counters.
* @read_raw: Callback for reading RAPL interface specific
* registers.
* @write_raw: Callback for writing RAPL interface specific
@@ -152,12 +153,25 @@ struct rapl_if_priv {
union rapl_reg reg_unit;
union rapl_reg regs[RAPL_DOMAIN_MAX][RAPL_DOMAIN_REG_MAX];
int limits[RAPL_DOMAIN_MAX];
+ bool enable_pmu;
int (*read_raw)(int id, struct reg_action *ra);
int (*write_raw)(int id, struct reg_action *ra);
void *defaults;
void *rpi;
};

+#ifdef CONFIG_PERF_EVENTS
+struct rapl_package_pmu_data {
+ u64 scale[RAPL_DOMAIN_MAX];
+ raw_spinlock_t lock;
+ int n_active;
+ struct list_head active_list;
+ ktime_t timer_interval;
+ struct hrtimer hrtimer;
+ u8 domain_mask;
+};
+#endif
+
/* maximum rapl package domain name: package-%d-die-%d */
#define PACKAGE_DOMAIN_NAME_LENGTH 30

@@ -176,6 +190,9 @@ struct rapl_package {
struct cpumask cpumask;
char name[PACKAGE_DOMAIN_NAME_LENGTH];
struct rapl_if_priv *priv;
+#ifdef CONFIG_PERF_EVENTS
+ struct rapl_package_pmu_data pmu_data;
+#endif
};

struct rapl_package *rapl_find_package_domain_cpuslocked(int id, struct rapl_if_priv *priv,
--
2.34.1

2024-02-01 05:35:30

by Zhang, Rui

[permalink] [raw]

Subject: Re: [PATCH 0/5] intel_rapl & perf rapl: combine PMU support

On Wed, 2024-01-31 at 15:40 +0100, Rafael J. Wysocki wrote:
> On Wed, Jan 31, 2024 at 3:24 PM Zhang Rui <[email protected]>
> wrote:
> >
> > This patch series is made based on the patch series posted at
> > https://lore.kernel.org/all/[email protected]/
> >
> > Problem statement
> > -----------------
> > MSR RAPL powercap sysfs is done in
> > drivers/powercap/intel_rapl_msr.c.
> > MSR RAPL PMU is done in arch/x86/events/rapl.c.
> >
> > They maintain two separate CPU model lists, describing the same
> > feature
> > available on the same set of hardware. This increases unnecessary
> > maintenance burden a lot.
> >
> > Now we need to introduce TPMI RAPL PMU support, which again shares
> > most
> > of the logic with MSR RAPL PMU.
> >
> > Solution
> > --------
> > Introducing PMU support as part of RAPL framework and remove
> > current MSR
> > RAPL PMU code.
> >
> > The idea is that, if a RAPL Package device is registered to RAPL
> > framework, and is ready for energy reporting and control via
> > powercap
> > sysfs, then it is also ready for PMU.
> >
> > So introducing PMU support in RAPL framework that works for all
> > registered RAPL Package devices. With this, we can remove current
> > MSR
> > RAPL PMU completely.
> >
> > Given that MSR RAPL and TPMI RAPL driver won't funtion on the same
> > platform, the new RAPL PMU can be fully compatible with current MSR
> > RAPL
> > PMU, including using the same PMU name and events
> > name/id/unit/scale.
> >
> > For example, on platforms use either MSR or TPMI, use the same
> > command
> > perf stat -e power/energy-pkg/ -e power/energy-ram/ -e
> > power/energy-cores/ FOO
> > to get the energy consumption when the events are in "perf list"
> > output.
> >
> > Notes
> > -----
> > There are indeed some functional changes introduced, due to the
> > divergency between the two CPU model lists. This includes,
> > 1. Fix BROADWELL_D in intel_rapl driver to use fixed Dram domain
> > energy
> >    unit.
> > 2. Enable PMU for some Intel platforms, which were missing in
> >    arch/x86/events/rapl.c. This includes
> >         ICELAKE_NNPI
> >         ROCKETLAKE
> >         LUNARLAKE_M
> >         LAKEFIELD
> >         ATOM_SILVERMONT
> >         ATOM_SILVERMONT_MID
> >         ATOM_AIRMONT
> >         ATOM_AIRMONT_MID
> >         ATOM_TREMONT
> >         ATOM_TREMONT_D
> >         ATOM_TREMONT_L
> > 3. Change the logic for enumerating AMD/HYGON platforms
> >    Previously, it was
> >         X86_MATCH_FEATURE(X86_FEATURE_RAPL,
> > &model_amd_hygon)
> >    And now it is
> >         X86_MATCH_VENDOR_FAM(AMD, 0x17, &rapl_defaults_amd)
> >         X86_MATCH_VENDOR_FAM(AMD, 0x19, &rapl_defaults_amd)
> >         X86_MATCH_VENDOR_FAM(HYGON, 0x18, &rapl_defaults_amd)
> >
> > Any comments/concerns are welcome.
>
> Say the first patch in the series is applied and the last one is not.
> Will anything break?

No. Without the last patch
1. for platforms using TPMI RAPL, .enable_pmu flag is set and PMU is
registered via RAPL framework
2. for platforms using MSR RAPL, it doesn't set .enable_pmu flag, and
the PMU is registered by arch/x86/events/rapl.c

>
> Regardless of the above. if any existing code is moved unmodified by
> this series to a new location,

intel_rapl PMU support shares a lot of code with
arch/x86/events/rapl.c, but still there are quite a lot of differences.
Including
1. dynamic PMU probing
2. using intel_rapl wrappers to get energy units and read energy
counter
etc.

thanks,
rui
> it would be nice to be able to see that
> in the patches. Otherwise, some subtle differences may be missed.