2013-10-10 14:50:22

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt: W = C * 0.23 / (1e9 * time)

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

The PMU exports a cpumask in /sys/devices/uncore/cpumask. It
is used by perf to ensure only one instance of each RAPL event
is measured per processor socket. Hotplug CPU is also supported.

The second patch adds a hrtimer to poll the counters given that
they do no interrupt on overflow. Hardware counters are 32-bit
wide.

In v2, we add the locking necesarry to protect the rapl_pmu
struct. We also add a description at the top of the file.
We check for Intel only processor. We improved the data
layout of the rapl_pmu struct. We also lifted the restriction
of the number of instances of RAPL counters that can be active
at the same time. RAPL is free running counters, so ought to be
able to measure events as many times as necessary in parallel
via multiple tools. There is never multiplexing among RAPL events.

Supported CPUs: SandyBridge, IvyBridge, Haswell.

$ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
time counts events
1.000345931 772 278 493 rapl/rapl-energy-cores/
1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
2.000836387 771 751 936 rapl/rapl-energy-cores/
2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Stephane Eranian (3):
perf: add active_entry list head to struct perf_event
perf,x86: add Intel RAPL PMU support
perf,x86: add RAPL hrtimer support

arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 688 +++++++++++++++++++++++++++
include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +
4 files changed, 691 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

--
1.7.9.5


2013-10-10 14:50:28

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 1/3] perf: add active_entry list head to struct perf_event

This patch adds a new fields to the struct perf_event.
It is intended to be used to chain events which are
active (enabled). It helps in the hardware layer
for PMU which do not have actual counter restrictions, i.e.,
free running read-only counters. Active events are chained
as opposed to being tracked via the counter they use.

Signed-off-by: Stephane Eranian <[email protected]>
---
include/linux/perf_event.h | 1 +
kernel/events/core.c | 1 +
2 files changed, 2 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2e069d1..a376384 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -435,6 +435,7 @@ struct perf_event {
struct perf_cgroup *cgrp; /* cgroup event is attach to */
int cgrp_defer_enabled;
#endif
+ struct list_head active_entry;

#endif /* CONFIG_PERF_EVENTS */
};
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c716385..b1dbf79 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6629,6 +6629,7 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
INIT_LIST_HEAD(&event->event_entry);
INIT_LIST_HEAD(&event->sibling_list);
INIT_LIST_HEAD(&event->rb_entry);
+ INIT_LIST_HEAD(&event->active_entry);

init_waitqueue_head(&event->waitq);
init_irq_work(&event->pending, perf_pending_event);
--
1.7.9.5

2013-10-10 14:50:32

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 2/3] perf,x86: add Intel RAPL PMU support

This patch adds a new uncore PMU to expose the Intel
RAPL energy consumption counters. Up to 3 counters,
each counting a particular RAPL event are exposed.

The RAPL counters are available on Intel SandyBridge,
IvyBridge, Haswell. The server skus add a 3rd counter.

The following events are available nd exposed in sysfs:
- rapl-energy-cores: power consumption of all cores on socket
- rapl-energy-pkg: power consumption of all cores + LLc cache
- rapl-energy-dram: power consumption of DRAM

The RAPL PMU is uncore by nature and is implemented such
that it only works in system-wide mode. Measuring only
one CPU per socket is sufficient. The /sys/devices/rapl/cpumask
is exported and can be used by tools to figure out which CPU
to monitor by default. For instance, on a 2-socket system, 2 CPUs
(one on each socket) will be shown.

The counters all count in the same unit. The perf_events API
exposes all RAPL counters as 64-bit integers counting in unit
of 1/2^32 Joules (or 0.23 nJ). User level tools must convert
the counts by multiplying them by 0.23 and divide 10^9 to
obtain Joules. The reason for this is that the kernel avoids
doing floating point math whenever possible because it is
expensive (user floating-point state must be saved). The method
used avoids kernel floating-point and minimizes the loss of
precision (bits). Thanks to PeterZ for suggesting this approach.

To convert the raw count in Watt:
W = C * 0.23 / (1e9 * time)
or ldexp(C, -32).

RAPL PMU is a new standalone PMU which registers with the
perf_event core subsystem. The PMU type (attr->type) is
dynamically allocated and is available from /sys/device/rapl/type.

Sampling is not supported by the RAPL PMU. There is no
privilege level filtering either.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/Makefile | 2 +-
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 623 +++++++++++++++++++++++++++
2 files changed, 624 insertions(+), 1 deletion(-)
create mode 100644 arch/x86/kernel/cpu/perf_event_intel_rapl.c

diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index 47b56a7..6359506 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -36,7 +36,7 @@ obj-$(CONFIG_CPU_SUP_AMD) += perf_event_amd_iommu.o
endif
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_p6.o perf_event_knc.o perf_event_p4.o
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_lbr.o perf_event_intel_ds.o perf_event_intel.o
-obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o
+obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_uncore.o perf_event_intel_rapl.o
endif


diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
new file mode 100644
index 0000000..abaaf4f
--- /dev/null
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -0,0 +1,623 @@
+/*
+ * perf_event_intel_rapl.c: support Intel RAPL energy consumption counters
+ * Copyright (C) 2013 Google, Inc., Stephane Eranian
+ *
+ * Intel RAPL interface is specified in the IA-32 Manual Vol3b
+ * section 14.7.1 (September 2013)
+ *
+ * RAPL provides more controls than just reporting energy consumption
+ * however here we only expose the 3 energy consumption free running
+ * counters (pp0, pkg, dram).
+ *
+ * Each of those counters increments in a power unit defined by the
+ * RAPL_POWER_UNIT MSR. On SandyBridge, this unit is 1/(2^16) Joules
+ * but it can vary.
+ *
+ * Counter to rapl events mappings:
+ *
+ * pp0 counter: consumption of all physical cores (power plane 0)
+ * event: rapl_energy_cores
+ * perf code: 0x1
+ *
+ * pkg counter: consumption of the whole processor package
+ * event: rapl_energy_pkg
+ * perf code: 0x2
+ *
+ * dram counter: consumption of the dram domain (servers only)
+ * event: rapl_energy_dram
+ * perf code: 0x3
+ *
+ * We manage those counters as free running (read-only). They may be
+ * use simultaneously by other tools, such as turbostat.
+ *
+ * The events only support system-wide mode counting. There is no
+ * sampling support because it does not make sense and is not
+ * supported by the RAPL hardware.
+ *
+ * Because we want to avoid floating-point operations in the kernel,
+ * the events are all reported in fixed point arithmetic (32.32).
+ * Tools must adjust the counts to convert them to Watts using
+ * the duration of the measurement. Tools may use a function such as
+ * ldexp(raw_count, -32);
+ */
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/perf_event.h>
+#include <asm/cpu_device_id.h>
+#include "perf_event.h"
+
+/*
+ * RAPL energy status counters
+ */
+#define RAPL_IDX_PP0_NRG_STAT 0 /* all cores */
+#define INTEL_RAPL_PP0 0x1 /* pseudo-encoding */
+#define RAPL_IDX_PKG_NRG_STAT 1 /* entire package */
+#define INTEL_RAPL_PKG 0x2 /* pseudo-encoding */
+#define RAPL_IDX_RAM_NRG_STAT 2 /* DRAM */
+#define INTEL_RAPL_RAM 0x3 /* pseudo-encoding */
+
+/* Clients have PP0, PKG */
+#define RAPL_IDX_CLN (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT)
+
+/* Servers have PP0, PKG, RAM */
+#define RAPL_IDX_SRV (1<<RAPL_IDX_PP0_NRG_STAT|\
+ 1<<RAPL_IDX_PKG_NRG_STAT|\
+ 1<<RAPL_IDX_RAM_NRG_STAT)
+
+/*
+ * event code: LSB 8 bits, passed in attr->config
+ * any other bit is reserved
+ */
+#define RAPL_EVENT_MASK 0xFFULL
+
+#define DEFINE_RAPL_FORMAT_ATTR(_var, _name, _format) \
+static ssize_t __rapl_##_var##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *page) \
+{ \
+ BUILD_BUG_ON(sizeof(_format) >= PAGE_SIZE); \
+ return sprintf(page, _format "\n"); \
+} \
+static struct kobj_attribute format_attr_##_var = \
+ __ATTR(_name, 0444, __rapl_##_var##_show, NULL)
+
+#define RAPL_EVENT_DESC(_name, _config) \
+{ \
+ .attr = __ATTR(_name, 0444, rapl_event_show, NULL), \
+ .config = _config, \
+}
+
+#define RAPL_CNTR_WIDTH 32 /* 32-bit rapl counters */
+
+struct rapl_pmu {
+ spinlock_t lock;
+ atomic_t refcnt;
+ int hw_unit; /* 1/2^hw_unit Joule */
+ int phys_id;
+ int n_active; /* number of active events */
+ struct list_head active_list;
+};
+
+static struct pmu rapl_pmu_class;
+static cpumask_t rapl_cpu_mask;
+static int rapl_cntr_mask;
+
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu);
+static DEFINE_PER_CPU(struct rapl_pmu *, rapl_pmu_kfree);
+
+static DEFINE_SPINLOCK(rapl_hotplug_lock);
+
+static inline u64 rapl_read_counter(struct perf_event *event)
+{
+ u64 raw;
+ rdmsrl(event->hw.event_base, raw);
+ return raw;
+}
+
+static inline u64 rapl_scale(u64 v)
+{
+ /*
+ * scale delta to smallest unit (1/2^32)
+ * users must then scale back: count * 1/(1e9*2^32) to get Joules
+ * or use ldexp(count, -32).
+ * Watts = Joules/Time delta
+ */
+ return v << (32 - __get_cpu_var(rapl_pmu)->hw_unit);
+}
+
+static u64 rapl_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev_raw_count, new_raw_count;
+ s64 delta, sdelta;
+ int shift = RAPL_CNTR_WIDTH;
+
+again:
+ prev_raw_count = local64_read(&hwc->prev_count);
+ rdmsrl(event->hw.event_base, new_raw_count);
+
+ if (local64_cmpxchg(&hwc->prev_count, prev_raw_count,
+ new_raw_count) != prev_raw_count) {
+ cpu_relax();
+ goto again;
+ }
+
+ /*
+ * Now we have the new raw value and have updated the prev
+ * timestamp already. We can now calculate the elapsed delta
+ * (event-)time and add that to the generic event.
+ *
+ * Careful, not all hw sign-extends above the physical width
+ * of the count.
+ */
+ delta = (new_raw_count << shift) - (prev_raw_count << shift);
+ delta >>= shift;
+
+ sdelta = rapl_scale(delta);
+
+ local64_add(sdelta, &event->count);
+
+ return new_raw_count;
+}
+
+static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
+ struct perf_event *event)
+{
+ if (WARN_ON_ONCE(!(event->hw.state & PERF_HES_STOPPED)))
+ return;
+
+ event->hw.state = 0;
+
+ list_add_tail(&event->active_entry, &pmu->active_list);
+
+ local64_set(&event->hw.prev_count, rapl_read_counter(event));
+
+ pmu->n_active++;
+}
+
+static void rapl_pmu_event_start(struct perf_event *event, int mode)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+ __rapl_pmu_event_start(pmu, event);
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static void rapl_pmu_event_stop(struct perf_event *event, int mode)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+
+ /* mark event as deactivated and stopped */
+ if (!(hwc->state & PERF_HES_STOPPED)) {
+ WARN_ON_ONCE(pmu->n_active <= 0);
+ pmu->n_active--;
+
+ list_del(&event->active_entry);
+
+ WARN_ON_ONCE(hwc->state & PERF_HES_STOPPED);
+ hwc->state |= PERF_HES_STOPPED;
+ }
+
+ /* check if update of sw counter is necessary */
+ if ((mode & PERF_EF_UPDATE) && !(hwc->state & PERF_HES_UPTODATE)) {
+ /*
+ * Drain the remaining delta count out of a event
+ * that we are disabling:
+ */
+ rapl_event_update(event);
+ hwc->state |= PERF_HES_UPTODATE;
+ }
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+}
+
+static int rapl_pmu_event_add(struct perf_event *event, int mode)
+{
+ struct rapl_pmu *pmu = __get_cpu_var(rapl_pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long flags;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+
+ hwc->state = PERF_HES_UPTODATE | PERF_HES_STOPPED;
+
+ if (mode & PERF_EF_START)
+ __rapl_pmu_event_start(pmu, event);
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+
+ return 0;
+}
+
+static void rapl_pmu_event_del(struct perf_event *event, int flags)
+{
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+}
+
+static int rapl_pmu_event_init(struct perf_event *event)
+{
+ u64 cfg = event->attr.config & RAPL_EVENT_MASK;
+ int bit, msr, ret = 0;
+
+ /* only look at RAPL events */
+ if (event->attr.type != rapl_pmu_class.type)
+ return -ENOENT;
+
+ /* check only supported bits are set */
+ if (event->attr.config & ~RAPL_EVENT_MASK)
+ return -EINVAL;
+
+ /*
+ * check event is known (determines counter)
+ */
+ switch (cfg) {
+ case INTEL_RAPL_PP0:
+ bit = RAPL_IDX_PP0_NRG_STAT;
+ msr = MSR_PP0_ENERGY_STATUS;
+ break;
+ case INTEL_RAPL_PKG:
+ bit = RAPL_IDX_PKG_NRG_STAT;
+ msr = MSR_PKG_ENERGY_STATUS;
+ break;
+ case INTEL_RAPL_RAM:
+ bit = RAPL_IDX_RAM_NRG_STAT;
+ msr = MSR_DRAM_ENERGY_STATUS;
+ break;
+ default:
+ return -EINVAL;
+ }
+ /* check event supported */
+ if (!(rapl_cntr_mask & (1 << bit)))
+ return -EINVAL;
+
+ /* unsupported modes and filters */
+ if (event->attr.exclude_user ||
+ event->attr.exclude_kernel ||
+ event->attr.exclude_hv ||
+ event->attr.exclude_idle ||
+ event->attr.exclude_host ||
+ event->attr.exclude_guest ||
+ event->attr.sample_period) /* no sampling */
+ return -EINVAL;
+
+ /* must be done before validate_group */
+ event->hw.event_base = msr;
+ event->hw.config = cfg;
+ event->hw.idx = bit;
+
+ return ret;
+}
+
+static void rapl_pmu_event_read(struct perf_event *event)
+{
+ rapl_event_update(event);
+}
+
+static ssize_t rapl_get_attr_cpumask(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ int n = cpulist_scnprintf(buf, PAGE_SIZE - 2, &rapl_cpu_mask);
+
+ buf[n++] = '\n';
+ buf[n] = '\0';
+ return n;
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, rapl_get_attr_cpumask, NULL);
+
+static struct attribute *rapl_pmu_attrs[] = {
+ &dev_attr_cpumask.attr,
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_attr_group = {
+ .attrs = rapl_pmu_attrs,
+};
+
+EVENT_ATTR_STR(rapl-energy-cores, rapl_pp0, "event=0x01");
+EVENT_ATTR_STR(rapl-energy-pkg , rapl_pkg, "event=0x02");
+EVENT_ATTR_STR(rapl-energy-ram , rapl_ram, "event=0x03");
+
+static struct attribute *rapl_events_srv_attr[] = {
+ EVENT_PTR(rapl_pp0),
+ EVENT_PTR(rapl_pkg),
+ EVENT_PTR(rapl_ram),
+ NULL,
+};
+
+static struct attribute *rapl_events_cln_attr[] = {
+ EVENT_PTR(rapl_pp0),
+ EVENT_PTR(rapl_pkg),
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_events_group = {
+ .name = "events",
+ .attrs = NULL, /* patched at runtime */
+};
+
+DEFINE_RAPL_FORMAT_ATTR(event, event, "config:0-7");
+static struct attribute *rapl_formats_attr[] = {
+ &format_attr_event.attr,
+ NULL,
+};
+
+static struct attribute_group rapl_pmu_format_group = {
+ .name = "format",
+ .attrs = rapl_formats_attr,
+};
+
+const struct attribute_group *rapl_attr_groups[] = {
+ &rapl_pmu_attr_group,
+ &rapl_pmu_format_group,
+ &rapl_pmu_events_group,
+ NULL,
+};
+
+static struct pmu rapl_pmu_class = {
+ .attr_groups = rapl_attr_groups,
+ .task_ctx_nr = perf_invalid_context, /* system-wide only */
+ .event_init = rapl_pmu_event_init,
+ .add = rapl_pmu_event_add, /* must have */
+ .del = rapl_pmu_event_del, /* must have */
+ .start = rapl_pmu_event_start,
+ .stop = rapl_pmu_event_stop,
+ .read = rapl_pmu_event_read,
+};
+
+static void rapl_exit_cpu(int cpu)
+{
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ /* if CPU not in RAPL mask, nothing to do */
+ if (!cpumask_test_and_clear_cpu(cpu, &rapl_cpu_mask))
+ return;
+
+ /* find a new cpu on same package */
+ for_each_online_cpu(i) {
+ if (i == cpu || i == 0)
+ continue;
+ if (phys_id == topology_physical_package_id(i)) {
+ cpumask_set_cpu(i, &rapl_cpu_mask);
+ break;
+ }
+ }
+
+ WARN_ON(cpumask_empty(&rapl_cpu_mask));
+}
+
+static void rapl_init_cpu(int cpu)
+{
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ spin_lock(&rapl_hotplug_lock);
+
+ /* check if phys_is is already covered */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (i == 0)
+ continue;
+ if (phys_id == topology_physical_package_id(i))
+ return;
+ }
+ /* was not found, so add it */
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+
+ spin_unlock(&rapl_hotplug_lock);
+}
+
+static int rapl_cpu_prepare(int cpu)
+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ int phys_id = topology_physical_package_id(cpu);
+
+ if (pmu)
+ return 0;
+
+ if (phys_id < 0)
+ return -1;
+
+ pmu = kzalloc_node(sizeof(*pmu), GFP_KERNEL, cpu_to_node(cpu));
+ if (!pmu)
+ return -1;
+
+ spin_lock_init(&pmu->lock);
+ atomic_set(&pmu->refcnt, 1);
+
+ INIT_LIST_HEAD(&pmu->active_list);
+
+ pmu->phys_id = phys_id;
+ /*
+ * grab power unit as: 1/2^unit Joules
+ *
+ * we cache in local PMU instance
+ */
+ rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
+ pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;
+
+ /* set RAPL pmu for this cpu for now */
+ per_cpu(rapl_pmu_kfree, cpu) = NULL;
+ per_cpu(rapl_pmu, cpu) = pmu;
+
+ return 0;
+}
+
+static int rapl_cpu_starting(int cpu)
+{
+ struct rapl_pmu *pmu2;
+ struct rapl_pmu *pmu1 = per_cpu(rapl_pmu, cpu);
+ int i, phys_id = topology_physical_package_id(cpu);
+
+ if (pmu1)
+ return 0;
+
+ spin_lock(&rapl_hotplug_lock);
+
+ for_each_online_cpu(i) {
+ pmu2 = per_cpu(rapl_pmu, i);
+
+ if (!pmu2 || i == cpu)
+ continue;
+
+ if (pmu2->phys_id == phys_id) {
+ per_cpu(rapl_pmu, cpu) = pmu2;
+ per_cpu(rapl_pmu_kfree, cpu) = pmu1;
+ atomic_inc(&pmu2->refcnt);
+ break;
+ }
+ }
+ spin_unlock(&rapl_hotplug_lock);
+ return 0;
+}
+
+static int rapl_cpu_dying(int cpu)
+{
+ struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
+ struct perf_event *event, *tmp;
+
+ if (!pmu)
+ return 0;
+
+ spin_lock(&rapl_hotplug_lock);
+
+ /*
+ * stop all syswide RAPL events on that CPU
+ * as a consequence also stops the hrtimer
+ */
+ list_for_each_entry_safe(event, tmp, &pmu->active_list, active_entry) {
+ rapl_pmu_event_stop(event, PERF_EF_UPDATE);
+ }
+
+ per_cpu(rapl_pmu, cpu) = NULL;
+
+ if (atomic_dec_and_test(&pmu->refcnt))
+ kfree(pmu);
+
+ spin_unlock(&rapl_hotplug_lock);
+ return 0;
+}
+
+static int rapl_cpu_notifier(struct notifier_block *self,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (long)hcpu;
+
+ /* allocate/free data structure for uncore box */
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_UP_PREPARE:
+ rapl_cpu_prepare(cpu);
+ break;
+ case CPU_STARTING:
+ rapl_cpu_starting(cpu);
+ break;
+ case CPU_UP_CANCELED:
+ case CPU_DYING:
+ rapl_cpu_dying(cpu);
+ break;
+ case CPU_ONLINE:
+ kfree(per_cpu(rapl_pmu_kfree, cpu));
+ per_cpu(rapl_pmu_kfree, cpu) = NULL;
+ break;
+ case CPU_DEAD:
+ per_cpu(rapl_pmu, cpu) = NULL;
+ break;
+ default:
+ break;
+ }
+
+ /* select the cpu that collects uncore events */
+ switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DOWN_FAILED:
+ case CPU_STARTING:
+ rapl_init_cpu(cpu);
+ break;
+ case CPU_DOWN_PREPARE:
+ rapl_exit_cpu(cpu);
+ break;
+ default:
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static const struct x86_cpu_id rapl_cpu_match[] = {
+ [0] = { .vendor = X86_VENDOR_INTEL, .family = 6 },
+ [1] = {},
+};
+static int __init rapl_pmu_init(void)
+{
+ struct rapl_pmu *pmu;
+ int i, cpu, ret;
+
+ /*
+ * check for Intel processor family 6
+ */
+ if (!x86_match_cpu(rapl_cpu_match))
+ return 0;
+
+ /* check supported CPU */
+ switch (boot_cpu_data.x86_model) {
+ case 42: /* Sandy Bridge */
+ case 58: /* Ivy Bridge */
+ case 60: /* Haswell */
+ rapl_cntr_mask = RAPL_IDX_CLN;
+ rapl_pmu_events_group.attrs = rapl_events_cln_attr;
+ break;
+ case 45: /* Sandy Bridge-EP */
+ case 62: /* IvyTown */
+ rapl_cntr_mask = RAPL_IDX_SRV;
+ rapl_pmu_events_group.attrs = rapl_events_srv_attr;
+ break;
+
+ default:
+ /* unsupported */
+ return 0;
+ }
+ get_online_cpus();
+
+ for_each_online_cpu(cpu) {
+ int phys_id = topology_physical_package_id(cpu);
+
+ /* save on prepare by only calling prepare for new phys_id */
+ for_each_cpu(i, &rapl_cpu_mask) {
+ if (phys_id == topology_physical_package_id(i)) {
+ phys_id = -1;
+ break;
+ }
+ }
+ if (phys_id < 0) {
+ pmu = per_cpu(rapl_pmu, i);
+ if (pmu) {
+ per_cpu(rapl_pmu, cpu) = pmu;
+ atomic_inc(&pmu->refcnt);
+ }
+ continue;
+ }
+ rapl_cpu_prepare(cpu);
+ cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ }
+
+ perf_cpu_notifier(rapl_cpu_notifier);
+
+ ret = perf_pmu_register(&rapl_pmu_class, "rapl", -1);
+ WARN_ON(ret);
+
+ pmu = __get_cpu_var(rapl_pmu);
+ pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
+ " API unit is 2^-32 Joules,"
+ " %d fixed counters\n",
+ pmu->hw_unit,
+ hweight32(rapl_cntr_mask));
+
+ put_online_cpus();
+
+ return 0;
+}
+device_initcall(rapl_pmu_init);
--
1.7.9.5

2013-10-10 14:50:35

by Stephane Eranian

[permalink] [raw]
Subject: [PATCH v2 3/3] perf,x86: add RAPL hrtimer support

The RAPL PMU counters do not interrupt on overflow.
Therefore, the kernel needs to poll the counters
to avoid missing an overflow. This patch adds
the hrtimer code to do this.

The timer internval is calculated at boot time
based on the power unit used by the HW.

Signed-off-by: Stephane Eranian <[email protected]>
---
arch/x86/kernel/cpu/perf_event_intel_rapl.c | 75 +++++++++++++++++++++++++--
1 file changed, 70 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/perf_event_intel_rapl.c b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
index abaaf4f..c5a6f51 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_rapl.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_rapl.c
@@ -92,11 +92,13 @@ static struct kobj_attribute format_attr_##_var = \

struct rapl_pmu {
spinlock_t lock;
- atomic_t refcnt;
int hw_unit; /* 1/2^hw_unit Joule */
- int phys_id;
- int n_active; /* number of active events */
+ struct hrtimer hrtimer;
struct list_head active_list;
+ ktime_t timer_interval; /* in ktime_t unit */
+ int n_active; /* number of active events */
+ int phys_id;
+ atomic_t refcnt;
};

static struct pmu rapl_pmu_class;
@@ -161,6 +163,47 @@ static u64 rapl_event_update(struct perf_event *event)
return new_raw_count;
}

+static void rapl_start_hrtimer(struct rapl_pmu *pmu)
+{
+ __hrtimer_start_range_ns(&pmu->hrtimer,
+ pmu->timer_interval, 0,
+ HRTIMER_MODE_REL_PINNED, 0);
+}
+
+static void rapl_stop_hrtimer(struct rapl_pmu *pmu)
+{
+ hrtimer_cancel(&pmu->hrtimer);
+}
+
+static enum hrtimer_restart rapl_hrtimer_handle(struct hrtimer *hrtimer)
+{
+ struct rapl_pmu *pmu = container_of(hrtimer, struct rapl_pmu, hrtimer);
+ struct perf_event *event;
+ unsigned long flags;
+
+ if (!pmu->n_active)
+ return HRTIMER_NORESTART;
+
+ spin_lock_irqsave(&pmu->lock, flags);
+
+ list_for_each_entry(event, &pmu->active_list, active_entry) {
+ rapl_event_update(event);
+ }
+
+ spin_unlock_irqrestore(&pmu->lock, flags);
+
+ hrtimer_forward_now(&pmu->hrtimer, pmu->timer_interval);
+
+ return HRTIMER_RESTART;
+}
+
+static void rapl_hrtimer_init(struct rapl_pmu *pmu)
+{
+ hrtimer_init(&pmu->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ pmu->hrtimer.function = rapl_hrtimer_handle;
+}
+
+
static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
struct perf_event *event)
{
@@ -174,6 +217,8 @@ static void __rapl_pmu_event_start(struct rapl_pmu *pmu,
local64_set(&event->hw.prev_count, rapl_read_counter(event));

pmu->n_active++;
+ if (pmu->n_active == 1)
+ rapl_start_hrtimer(pmu);
}

static void rapl_pmu_event_start(struct perf_event *event, int mode)
@@ -198,6 +243,8 @@ static void rapl_pmu_event_stop(struct perf_event *event, int mode)
if (!(hwc->state & PERF_HES_STOPPED)) {
WARN_ON_ONCE(pmu->n_active <= 0);
pmu->n_active--;
+ if (pmu->n_active == 0)
+ rapl_stop_hrtimer(pmu);

list_del(&event->active_entry);

@@ -416,6 +463,7 @@ static int rapl_cpu_prepare(int cpu)
{
struct rapl_pmu *pmu = per_cpu(rapl_pmu, cpu);
int phys_id = topology_physical_package_id(cpu);
+ u64 ms;

if (pmu)
return 0;
@@ -441,6 +489,20 @@ static int rapl_cpu_prepare(int cpu)
rdmsrl(MSR_RAPL_POWER_UNIT, pmu->hw_unit);
pmu->hw_unit = (pmu->hw_unit >> 8) & 0x1FULL;

+ /*
+ * use reference of 200W for scaling the timeout
+ * to avoid missing counter overflows.
+ * 200W = 200 Joules/sec
+ * divide interval by 2 to avoid lockstep (2 * 100)
+ * if hw unit is 32, then we use 2 ms 1/200/2
+ */
+ if (pmu->hw_unit < 32)
+ ms = 1000 * (1ULL << (32 - pmu->hw_unit - 1)) / (2 * 100);
+ else
+ ms = 2;
+
+ pmu->timer_interval = ms_to_ktime(ms);
+
/* set RAPL pmu for this cpu for now */
per_cpu(rapl_pmu_kfree, cpu) = NULL;
per_cpu(rapl_pmu, cpu) = pmu;
@@ -602,6 +664,7 @@ static int __init rapl_pmu_init(void)
}
rapl_cpu_prepare(cpu);
cpumask_set_cpu(cpu, &rapl_cpu_mask);
+ rapl_hrtimer_init(per_cpu(rapl_pmu, cpu));
}

perf_cpu_notifier(rapl_cpu_notifier);
@@ -612,9 +675,11 @@ static int __init rapl_pmu_init(void)
pmu = __get_cpu_var(rapl_pmu);
pr_info("RAPL PMU detected, hw unit 2^-%d Joules,"
" API unit is 2^-32 Joules,"
- " %d fixed counters\n",
+ " %d fixed counters"
+ " %llu ms ovfl timer\n",
pmu->hw_unit,
- hweight32(rapl_cntr_mask));
+ hweight32(rapl_cntr_mask),
+ ktime_to_ms(pmu->timer_interval));

put_online_cpus();

--
1.7.9.5

2013-10-10 17:43:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support


Looks all good to me now.

Reviewed-by: Andi Kleen <[email protected]>

-Andi

--
[email protected] -- Speaking for myself only

2013-10-10 18:00:57

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
> $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
> time counts events
> 1.000345931 772 278 493 rapl/rapl-energy-cores/
> 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
> 2.000836387 771 751 936 rapl/rapl-energy-cores/
> 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Hmm, so I'm looking at builtin-stat.c::print_interval() and since
it gets the perf_evsel counters and you can deduce the counter name
from it, you probably could match the rapl counters and do the Watts
conversion above as a special case.

I dunno, it is much better than having some naked numbers for which
people have to go stare at the sources + CPU vendor docs as to what they
actually mean.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-10-16 12:46:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support


So, the RAPL patch-set clearly needs more work.

* Borislav Petkov <[email protected]> wrote:

> On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
> > $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
> > time counts events
> > 1.000345931 772 278 493 rapl/rapl-energy-cores/
> > 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
> > 2.000836387 771 751 936 rapl/rapl-energy-cores/
> > 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/

Why is there the rapl/rapl duplication in the event name? It should be
rapl/energy-cores, rapl/energy-pkg, etc.

I'm also not sure about the Intel-specific naming. Joules per core and
Joules per socket ought to be pretty generic, even if the initial
implementation is Intel-only. I.e.:

power/energy-core
power/energy-pkg

> Hmm, so I'm looking at builtin-stat.c::print_interval() and since it
> gets the perf_evsel counters and you can deduce the counter name from
> it, you probably could match the rapl counters and do the Watts
> conversion above as a special case.
>
> I dunno, it is much better than having some naked numbers for which
> people have to go stare at the sources + CPU vendor docs as to what they
> actually mean.

So what should happen here is to extend the sysfs attributes that tell us
that it's in 32.32 fixed-point format.

We should also tell user-space that the unit of this counter is 'Joule'.

Then things like:

perf stat -a -e power/* sleep 1

would output, without knowing any RAPL details:

0.20619 Joule power/energy-core
2.42151 Joule power/energy-pkg

or so.

Other platforms offering energy measurement facilities will then name
their counters in the same power/* (or energy/*) namespace, with new names
if they do something fundamentally differently.

Tooling can then generalize along these abstractions, as much as the
hardware allows it.

Thanks,

Ingo

2013-10-16 13:13:57

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
>
> So, the RAPL patch-set clearly needs more work.
>
> * Borislav Petkov <[email protected]> wrote:
>
>> On Thu, Oct 10, 2013 at 04:50:05PM +0200, Stephane Eranian wrote:
>> > $ perf stat -a -e rapl/rapl-energy-cores/,rapl/rapl-energy-pkg/ -I 1000 sleep 10
>> > time counts events
>> > 1.000345931 772 278 493 rapl/rapl-energy-cores/
>> > 1.000345931 55 539 138 560 rapl/rapl-energy-pkg/
>> > 2.000836387 771 751 936 rapl/rapl-energy-cores/
>> > 2.000836387 55 326 015 488 rapl/rapl-energy-pkg/
>
> Why is there the rapl/rapl duplication in the event name? It should be
> rapl/energy-cores, rapl/energy-pkg, etc.
>
yeah, I thought about doing that too. I will change the names.

> I'm also not sure about the Intel-specific naming. Joules per core and
> Joules per socket ought to be pretty generic, even if the initial
> implementation is Intel-only. I.e.:
>
Joules per cores (with an s)
Joules per package.
Joules per dram, i.e., all the DRAM attached to a socket (I think).


> power/energy-core
> power/energy-pkg
>
Fine with me. Or joules-cores to make the unit explicit

>> Hmm, so I'm looking at builtin-stat.c::print_interval() and since it
>> gets the perf_evsel counters and you can deduce the counter name from
>> it, you probably could match the rapl counters and do the Watts
>> conversion above as a special case.
>>
>> I dunno, it is much better than having some naked numbers for which
>> people have to go stare at the sources + CPU vendor docs as to what they
>> actually mean.
>
> So what should happen here is to extend the sysfs attributes that tell us
> that it's in 32.32 fixed-point format.
>
We could add that in sysfs, but then I am wondering how would the tool realize
it has to use this file. We'd have to create something generic like a scaling
factor. If the file is there, then use it, if not assume 1x. Is that what you
are thinking about?


> We should also tell user-space that the unit of this counter is 'Joule'.
>
> Then things like:
>
> perf stat -a -e power/* sleep 1
>
> would output, without knowing any RAPL details:
>
> 0.20619 Joule power/energy-core
> 2.42151 Joule power/energy-pkg
>
Not sure there is already some support for this in perf stat. Arnaldo?
If not that we need another sysfs file to export the unit. Another
possibility is for perf stat to recognize the power/* and extract the
unit from the event name. In my example power/joules-cores -> joules.

2013-10-16 17:53:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
> > We should also tell user-space that the unit of this counter is 'Joule'.
> >
> > Then things like:
> >
> > perf stat -a -e power/* sleep 1
> >
> > would output, without knowing any RAPL details:
> >
> > 0.20619 Joule power/energy-core
> > 2.42151 Joule power/energy-pkg
> >
> Not sure there is already some support for this in perf stat. Arnaldo?

Nope, there is not, we would have to have some table somewhere with
"event-regexp: unit-string"

> If not that we need another sysfs file to export the unit. Another
> possibility is for perf stat to recognize the power/* and extract the
> unit from the event name. In my example power/joules-cores -> joules.

I.e. you would be encoding the counter unit as the suffix, might as well
call it "power/cores.joules" and use the dot as the separator for the
unit, but would be just a compact form to encode the counter->unit
table.

- Arnaldo

2013-10-16 18:14:08

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
>> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
>> > We should also tell user-space that the unit of this counter is 'Joule'.
>> >
>> > Then things like:
>> >
>> > perf stat -a -e power/* sleep 1
>> >
>> > would output, without knowing any RAPL details:
>> >
>> > 0.20619 Joule power/energy-core
>> > 2.42151 Joule power/energy-pkg
>> >
>> Not sure there is already some support for this in perf stat. Arnaldo?
>
> Nope, there is not, we would have to have some table somewhere with
> "event-regexp: unit-string"
>
>> If not that we need another sysfs file to export the unit. Another
>> possibility is for perf stat to recognize the power/* and extract the
>> unit from the event name. In my example power/joules-cores -> joules.
>
> I.e. you would be encoding the counter unit as the suffix, might as well
> call it "power/cores.joules" and use the dot as the separator for the
> unit, but would be just a compact form to encode the counter->unit
> table.
>
May be easier to add a sysfs entry with the unit to display.

2013-10-17 08:14:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support


* Stephane Eranian <[email protected]> wrote:

> On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
> >> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
> >> > We should also tell user-space that the unit of this counter is 'Joule'.
> >> >
> >> > Then things like:
> >> >
> >> > perf stat -a -e power/* sleep 1
> >> >
> >> > would output, without knowing any RAPL details:
> >> >
> >> > 0.20619 Joule power/energy-core
> >> > 2.42151 Joule power/energy-pkg
> >> >
> >> Not sure there is already some support for this in perf stat. Arnaldo?
> >
> > Nope, there is not, we would have to have some table somewhere with
> > "event-regexp: unit-string"
> >
> >> If not that we need another sysfs file to export the unit. Another
> >> possibility is for perf stat to recognize the power/* and extract the
> >> unit from the event name. In my example power/joules-cores -> joules.
> >
> > I.e. you would be encoding the counter unit as the suffix, might as well
> > call it "power/cores.joules" and use the dot as the separator for the
> > unit, but would be just a compact form to encode the counter->unit
> > table.
>
> May be easier to add a sysfs entry with the unit to display.

Yes - with no entry meaning a raw 'count' or such.

Thanks,

Ingo

2013-10-17 09:07:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

On Thu, Oct 17, 2013 at 10:14:20AM +0200, Ingo Molnar wrote:
> > > I.e. you would be encoding the counter unit as the suffix, might as well
> > > call it "power/cores.joules" and use the dot as the separator for the
> > > unit, but would be just a compact form to encode the counter->unit
> > > table.
> >
> > May be easier to add a sysfs entry with the unit to display.
>
> Yes - with no entry meaning a raw 'count' or such.

The downside to such a sysfs entry will be the scope. It would either be
pmu wide (unwieldy for many PMUs) or be only per listed event; and we
really don't want exhaustive event lists in the kernel.

2013-10-17 09:12:20

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

On Thu, Oct 17, 2013 at 11:07:30AM +0200, Peter Zijlstra wrote:
> The downside to such a sysfs entry will be the scope. It would either
> be pmu wide (unwieldy for many PMUs) or be only per listed event; and
> we really don't want exhaustive event lists in the kernel.

So why not teach perf tool to recognize the PMU instead of adding
anything to the kernel?

It seems much easier to me...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-10-17 20:09:51

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Peter,

On Thu, Oct 17, 2013 at 11:07 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Oct 17, 2013 at 10:14:20AM +0200, Ingo Molnar wrote:
>> > > I.e. you would be encoding the counter unit as the suffix, might as well
>> > > call it "power/cores.joules" and use the dot as the separator for the
>> > > unit, but would be just a compact form to encode the counter->unit
>> > > table.
>> >
>> > May be easier to add a sysfs entry with the unit to display.
>>
>> Yes - with no entry meaning a raw 'count' or such.
>
> The downside to such a sysfs entry will be the scope. It would either be
> pmu wide (unwieldy for many PMUs) or be only per listed event; and we
> really don't want exhaustive event lists in the kernel.
>
Why not put in the events subdir:

/sys/devices/power/events/energy-cores
/sys/devices/power/events/energy-cores.unit
/sys/devices/power/events/energy-cores.scaling
$ cat energy-core.unit
Joules
$ cat energy-core.scaling
0.00000000023

Perf could easily lookup those files and if they are not present it will print
the event as it does today. If present, then it will print the unit and apply
the scaling factor to the raw cont (already scaled for multiplexing).

Borislav, the scaling factor cannot be hardcoded into perf because it
can change for processor to processor.

2013-10-22 16:47:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Hi,

I have updated my RAPL patches to implement the suggested changes.
I will post the patch very soon. The new look and feel is as folllows:

# perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
1000 sleep 1000
# time unit counts events
1.000264953 Joules 2.09 power/energy-cores/
[100.00%]
1.000264953 Joules 5.94 power/energy-pkg/
1.000264953 160,530,320 ref-cycles
2.000640422 Joules 2.07 power/energy-cores/
2.000640422 Joules 5.94 power/energy-pkg/
2.000640422 152,673,056 ref-cycles
3.000964416 Joules 2.08 power/energy-cores/
3.000964416 Joules 5.93 power/energy-pkg/
3.000964416 158,779,184 ref-cycles

# ls -1 /sys/devices/power/events/
energy-cores
energy-cores.scale
energy-cores.unit
energy-pkg
energy-pkg.scale
energy-pkg.unit

# cat /sys/devices/power/events/energy-cores.scale
2.3e-10
# cat /sys/devices/power/events/energy-cores.unit
Joules

Of course, this unit and scaling support is generic and not limited
to the RAPL events. For now, this only works with events exported
by the kernel via sysfs.



On Thu, Oct 17, 2013 at 10:14 AM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
>> <[email protected]> wrote:
>> > Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
>> >> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
>> >> > We should also tell user-space that the unit of this counter is 'Joule'.
>> >> >
>> >> > Then things like:
>> >> >
>> >> > perf stat -a -e power/* sleep 1
>> >> >
>> >> > would output, without knowing any RAPL details:
>> >> >
>> >> > 0.20619 Joule power/energy-core
>> >> > 2.42151 Joule power/energy-pkg
>> >> >
>> >> Not sure there is already some support for this in perf stat. Arnaldo?
>> >
>> > Nope, there is not, we would have to have some table somewhere with
>> > "event-regexp: unit-string"
>> >
>> >> If not that we need another sysfs file to export the unit. Another
>> >> possibility is for perf stat to recognize the power/* and extract the
>> >> unit from the event name. In my example power/joules-cores -> joules.
>> >
>> > I.e. you would be encoding the counter unit as the suffix, might as well
>> > call it "power/cores.joules" and use the dot as the separator for the
>> > unit, but would be just a compact form to encode the counter->unit
>> > table.
>>
>> May be easier to add a sysfs entry with the unit to display.
>
> Yes - with no entry meaning a raw 'count' or such.
>
> Thanks,
>
> Ingo

2013-10-22 22:18:42

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Em Tue, Oct 22, 2013 at 06:47:38PM +0200, Stephane Eranian escreveu:
> I have updated my RAPL patches to implement the suggested changes.
> I will post the patch very soon. The new look and feel is as folllows:

> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
> 1000 sleep 1000
> # time unit counts events
> 1.000264953 Joules 2.09 power/energy-cores/
> [100.00%]
> 1.000264953 Joules 5.94 power/energy-pkg/
> 1.000264953 160,530,320 ref-cycles
> 2.000640422 Joules 2.07 power/energy-cores/
> 2.000640422 Joules 5.94 power/energy-pkg/
> 2.000640422 152,673,056 ref-cycles
> 3.000964416 Joules 2.08 power/energy-cores/
> 3.000964416 Joules 5.93 power/energy-pkg/
> 3.000964416 158,779,184 ref-cycles

What about:

# perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000
# time events
1.000264953 2.09 Joules power/energy-cores/
1.000264953 5.94 Joules power/energy-pkg/
1.000264953 160,530,320 ref-cycles
2.000640422 2.07 Joules power/energy-cores/
2.000640422 5.94 Joules power/energy-pkg/
2.000640422 152,673,056 ref-cycles
3.000964416 2.08 Joules power/energy-cores/
3.000964416 5.93 Joules power/energy-pkg/
3.000964416 158,779,184 ref-cycles

?

Or even 2.09J power/energy-cores/?

I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
available.

- Arnaldo


>
> # ls -1 /sys/devices/power/events/
> energy-cores
> energy-cores.scale
> energy-cores.unit
> energy-pkg
> energy-pkg.scale
> energy-pkg.unit
>
> # cat /sys/devices/power/events/energy-cores.scale
> 2.3e-10
> # cat /sys/devices/power/events/energy-cores.unit
> Joules
>
> Of course, this unit and scaling support is generic and not limited
> to the RAPL events. For now, this only works with events exported
> by the kernel via sysfs.
>
>
>
> On Thu, Oct 17, 2013 at 10:14 AM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
> >> <[email protected]> wrote:
> >> > Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
> >> >> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
> >> >> > We should also tell user-space that the unit of this counter is 'Joule'.
> >> >> >
> >> >> > Then things like:
> >> >> >
> >> >> > perf stat -a -e power/* sleep 1
> >> >> >
> >> >> > would output, without knowing any RAPL details:
> >> >> >
> >> >> > 0.20619 Joule power/energy-core
> >> >> > 2.42151 Joule power/energy-pkg
> >> >> >
> >> >> Not sure there is already some support for this in perf stat. Arnaldo?
> >> >
> >> > Nope, there is not, we would have to have some table somewhere with
> >> > "event-regexp: unit-string"
> >> >
> >> >> If not that we need another sysfs file to export the unit. Another
> >> >> possibility is for perf stat to recognize the power/* and extract the
> >> >> unit from the event name. In my example power/joules-cores -> joules.
> >> >
> >> > I.e. you would be encoding the counter unit as the suffix, might as well
> >> > call it "power/cores.joules" and use the dot as the separator for the
> >> > unit, but would be just a compact form to encode the counter->unit
> >> > table.
> >>
> >> May be easier to add a sysfs entry with the unit to display.
> >
> > Yes - with no entry meaning a raw 'count' or such.
> >
> > Thanks,
> >
> > Ingo

2013-10-23 07:07:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
> 1000 sleep 1000
> # time unit counts events
> 1.000264953 Joules 2.09 power/energy-cores/
> [100.00%]
> 1.000264953 Joules 5.94 power/energy-pkg/
> 1.000264953 160,530,320 ref-cycles
> 2.000640422 Joules 2.07 power/energy-cores/
> 2.000640422 Joules 5.94 power/energy-pkg/
> 2.000640422 152,673,056 ref-cycles
> 3.000964416 Joules 2.08 power/energy-cores/
> 3.000964416 Joules 5.93 power/energy-pkg/
> 3.000964416 158,779,184 ref-cycles

Can you add some column marker that there is no unit (like -) ?

This is just in case someone wants to parse this with a tool. Yes they
should be using -x, but it is still better to be always parseable.

-Andi

2013-10-23 09:24:53

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Andi,

On Wed, Oct 23, 2013 at 9:07 AM, Andi Kleen <[email protected]> wrote:
>> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
>> 1000 sleep 1000
>> # time unit counts events
>> 1.000264953 Joules 2.09 power/energy-cores/
>> [100.00%]
>> 1.000264953 Joules 5.94 power/energy-pkg/
>> 1.000264953 160,530,320 ref-cycles
>> 2.000640422 Joules 2.07 power/energy-cores/
>> 2.000640422 Joules 5.94 power/energy-pkg/
>> 2.000640422 152,673,056 ref-cycles
>> 3.000964416 Joules 2.08 power/energy-cores/
>> 3.000964416 Joules 5.93 power/energy-pkg/
>> 3.000964416 158,779,184 ref-cycles
>
> Can you add some column marker that there is no unit (like -) ?
>
> This is just in case someone wants to parse this with a tool. Yes they
> should be using -x, but it is still better to be always parseable.
>
It is parseable, it's just that you get an empty field: ,,
But I can add a "?".

2013-10-23 09:34:45

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Arnaldo,

On Wed, Oct 23, 2013 at 12:18 AM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Tue, Oct 22, 2013 at 06:47:38PM +0200, Stephane Eranian escreveu:
>> I have updated my RAPL patches to implement the suggested changes.
>> I will post the patch very soon. The new look and feel is as folllows:
>
>> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I
>> 1000 sleep 1000
>> # time unit counts events
>> 1.000264953 Joules 2.09 power/energy-cores/
>> [100.00%]
>> 1.000264953 Joules 5.94 power/energy-pkg/
>> 1.000264953 160,530,320 ref-cycles
>> 2.000640422 Joules 2.07 power/energy-cores/
>> 2.000640422 Joules 5.94 power/energy-pkg/
>> 2.000640422 152,673,056 ref-cycles
>> 3.000964416 Joules 2.08 power/energy-cores/
>> 3.000964416 Joules 5.93 power/energy-pkg/
>> 3.000964416 158,779,184 ref-cycles
>
> What about:
>
> # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000
> # time events
> 1.000264953 2.09 Joules power/energy-cores/
> 1.000264953 5.94 Joules power/energy-pkg/
> 1.000264953 160,530,320 ref-cycles
> 2.000640422 2.07 Joules power/energy-cores/
> 2.000640422 5.94 Joules power/energy-pkg/
> 2.000640422 152,673,056 ref-cycles
> 3.000964416 2.08 Joules power/energy-cores/
> 3.000964416 5.93 Joules power/energy-pkg/
> 3.000964416 158,779,184 ref-cycles
>
> ?
>
> Or even 2.09J power/energy-cores/?
>
I can try that.

> I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
> available.
>
I don't have this function is my tree yet (tip.git).

> - Arnaldo
>
>
>>
>> # ls -1 /sys/devices/power/events/
>> energy-cores
>> energy-cores.scale
>> energy-cores.unit
>> energy-pkg
>> energy-pkg.scale
>> energy-pkg.unit
>>
>> # cat /sys/devices/power/events/energy-cores.scale
>> 2.3e-10
>> # cat /sys/devices/power/events/energy-cores.unit
>> Joules
>>
>> Of course, this unit and scaling support is generic and not limited
>> to the RAPL events. For now, this only works with events exported
>> by the kernel via sysfs.
>>
>>
>>
>> On Thu, Oct 17, 2013 at 10:14 AM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Stephane Eranian <[email protected]> wrote:
>> >
>> >> On Wed, Oct 16, 2013 at 7:53 PM, Arnaldo Carvalho de Melo
>> >> <[email protected]> wrote:
>> >> > Em Wed, Oct 16, 2013 at 03:13:54PM +0200, Stephane Eranian escreveu:
>> >> >> On Wed, Oct 16, 2013 at 2:46 PM, Ingo Molnar <[email protected]> wrote:
>> >> >> > We should also tell user-space that the unit of this counter is 'Joule'.
>> >> >> >
>> >> >> > Then things like:
>> >> >> >
>> >> >> > perf stat -a -e power/* sleep 1
>> >> >> >
>> >> >> > would output, without knowing any RAPL details:
>> >> >> >
>> >> >> > 0.20619 Joule power/energy-core
>> >> >> > 2.42151 Joule power/energy-pkg
>> >> >> >
>> >> >> Not sure there is already some support for this in perf stat. Arnaldo?
>> >> >
>> >> > Nope, there is not, we would have to have some table somewhere with
>> >> > "event-regexp: unit-string"
>> >> >
>> >> >> If not that we need another sysfs file to export the unit. Another
>> >> >> possibility is for perf stat to recognize the power/* and extract the
>> >> >> unit from the event name. In my example power/joules-cores -> joules.
>> >> >
>> >> > I.e. you would be encoding the counter unit as the suffix, might as well
>> >> > call it "power/cores.joules" and use the dot as the separator for the
>> >> > unit, but would be just a compact form to encode the counter->unit
>> >> > table.
>> >>
>> >> May be easier to add a sysfs entry with the unit to display.
>> >
>> > Yes - with no entry meaning a raw 'count' or such.
>> >
>> > Thanks,
>> >
>> > Ingo

2013-10-23 14:23:20

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Em Wed, Oct 23, 2013 at 11:34:42AM +0200, Stephane Eranian escreveu:
> On Wed, Oct 23, 2013 at 12:18 AM, Arnaldo Carvalho de Melo
> > What about:

> > # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000
> > # time events
> > 1.000264953 2.09 Joules power/energy-cores/
> > 1.000264953 5.94 Joules power/energy-pkg/
> > 1.000264953 160,530,320 ref-cycles
> > 2.000640422 2.07 Joules power/energy-cores/
> > 2.000640422 5.94 Joules power/energy-pkg/
> > 2.000640422 152,673,056 ref-cycles
> > 3.000964416 2.08 Joules power/energy-cores/
> > 3.000964416 5.93 Joules power/energy-pkg/
> > 3.000964416 158,779,184 ref-cycles

> > ?
> > Or even 2.09J power/energy-cores/?

> I can try that.

> > I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
> > available.

> I don't have this function is my tree yet (tip.git).

That would be a new one :-)

At some point I'll study the %pM, etc things in the kernel printk code
to come up with something like perf_evsel__{f,scn}printf that allows us
to use just one string format and then pick things like units as a
modifier, but till then having these fprintf variants seems good enough.

- Arnaldo

2013-10-23 14:33:08

by Stephane Eranian

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] perf,x86: add Intel RAPL PMU support

Arnaldo,

On Wed, Oct 23, 2013 at 4:22 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Wed, Oct 23, 2013 at 11:34:42AM +0200, Stephane Eranian escreveu:
>> On Wed, Oct 23, 2013 at 12:18 AM, Arnaldo Carvalho de Melo
>> > What about:
>
>> > # perf stat -a -e power/energy-cores/,power/energy-pkg/,ref-cycles -I 1000 sleep 1000
>> > # time events
>> > 1.000264953 2.09 Joules power/energy-cores/
>> > 1.000264953 5.94 Joules power/energy-pkg/
>> > 1.000264953 160,530,320 ref-cycles
>> > 2.000640422 2.07 Joules power/energy-cores/
>> > 2.000640422 5.94 Joules power/energy-pkg/
>> > 2.000640422 152,673,056 ref-cycles
>> > 3.000964416 2.08 Joules power/energy-cores/
>> > 3.000964416 5.93 Joules power/energy-pkg/
>> > 3.000964416 158,779,184 ref-cycles
>
>> > ?
>> > Or even 2.09J power/energy-cores/?
>
>> I can try that.
>
>> > I.e. a perf_evsel__fprintf_value(evsel) would append a unit string, if
>> > available.
>
>> I don't have this function is my tree yet (tip.git).
>
> That would be a new one :-)
>
> At some point I'll study the %pM, etc things in the kernel printk code
> to come up with something like perf_evsel__{f,scn}printf that allows us
> to use just one string format and then pick things like units as a
> modifier, but till then having these fprintf variants seems good enough.
>
Having the printf() would only be good to print the value but the problem is
that you'd need to synchronize with the column headers and width. So
if you say fprintf_value() print the count + unit, then you need to line up
also with the column header which comes from somwhere else. I am
talking about the interval printing mode here.